2025Activity reportProject-TeamTADAAM
RNSR: 201521202K- Research center Inria Centre at the University of Bordeaux
- In partnership with:Bordeaux INP, Université de Bordeaux
- Team name: Topology-aware system-scale data management for high-performance computing
- In collaboration with:Laboratoire Bordelais de Recherche en Informatique (LaBRI)
Creation of the Project-Team: 2017 December 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A1.1.1. Multicore, Manycore
- A1.1.2. Hardware accelerators (GPGPU, FPGA, etc.)
- A1.1.3. Memory models
- A1.1.4. High performance computing
- A1.1.5. Exascale
- A1.1.9. Fault tolerant systems
- A1.2.4. QoS, performance evaluation
- A1.6. Green Computing
- A2.1.7. Distributed programming
- A2.2.2. Memory models
- A2.2.3. Memory management
- A2.2.4. Parallel architectures
- A2.2.5. Run-time systems
- A2.6.1. Operating systems
- A2.6.2. Middleware
- A2.6.4. Ressource management
- A3.1.2. Data management, quering and storage
- A3.1.3. Distributed data
- A3.1.8. Big data (production, storage, transfer)
- A3.4. Machine learning and statistics
- A6.1.2. Stochastic Modeling
- A6.2.3. Probabilistic methods
- A6.2.6. Optimization
- A6.2.7. HPC for machine learning
- A6.3.3. Data processing
- A7.1.1. Distributed algorithms
- A7.1.2. Parallel algorithms
- A7.1.3. Graph algorithms
- A8.1. Discrete mathematics, combinatorics
- A8.2. Optimization
- A8.7. Graph theory
- A8.9. Performance evaluation
- A9.2. Machine learning
Other Research Topics and Application Domains
- B6.3.2. Network protocols
- B6.3.3. Network Management
- B9.5.1. Computer science
- B9.8. Reproducibility
1 Team members, visitors, external collaborators
Research Scientists
- Brice Goglin [Team leader, INRIA, Senior Researcher, HDR]
- Alexandre Denis [INRIA, Researcher]
- Luan Teylo Gouveia Lima [INRIA, ISFP]
- Mihail Popov [INRIA, ISFP]
Faculty Members
- Guillaume Mercier [BORDEAUX INP, Associate Professor Delegation, HDR]
- François Pellegrini [UNIV BORDEAUX, Professor, HDR]
- Francieli Zanon-Boito [UNIV BORDEAUX, Associate Professor Delegation, HDR]
PhD Students
- Charles Goedefroit [BULL, CIFRE]
- Serge Meurrens [INRIA, from Dec 2025]
- Thibaut Pepin [CEA/DAM]
- Tristan Riehs [INRIA, from Sep 2025]
- Meline Trochon [DDN (DataDirect Networks), CIFRE]
Technical Staff
- Mahamat Younous Abdraman [INRIA, Engineer]
- Pierre Clouzet [INRIA, Engineer]
- Ana Hourcau [INRIA, Engineer, from Oct 2025]
- Xavier Muller [INRIA, Engineer]
Interns and Apprentices
- Laora Aimi [INRIA, Intern, from Feb 2025 until Sep 2025]
- Tanguy Chatelain [INRIA, Intern, from Feb 2025 until Jul 2025]
- Axel Malmgren [INRIA, until Jul 2025]
- Noureddine Tamssaout [INRIA, Intern, from Feb 2025 until Aug 2025]
- Gael Valade [INRIA, Intern, from Jun 2025 until Aug 2025]
Administrative Assistant
- Fabienne Cuyollaa [INRIA]
External Collaborators
- Iheb Becher [CNRS, from Apr 2025]
- Julien Rodriguez [DGA, until Aug 2025]
2 Overall objectives
In TADaaM, we propose a new approach where we allow the application to explicitly express its resource needs about its execution. The application needs to express its behavior, but in a different way from the compute-centric approach, as the additional information is not necessarily focused on computation and on instructions execution, but follows a high-level semantics (needs of large memory for some processes, start of a communication phase, need to refine the granularity, beginning of a storage access phase, description of data affinity, etc.). These needs will be expressed to a service though an API. These services will be system-wide (able to gather a global knowledge) and stateful (able to take decision based on the current request but also on previous ones). The API shall enable the application to access this services through a well-defined set of functions, based on carefully designed abstractions.
Hence, the goal of TADaaM is to design stateful services for HPC systems, in order to optimize applications execution according to their needs.
These layers will abstract low-level details of the architecture and the software stack, and will allow applications to register their needs. Then, according to these requests and to the environment characteristics, these services will feature engines to optimize the execution of the applications at system-scale, taking into account the gathered global knowledge and previous requests.
This approach exhibits several key characteristics:
- It is independent from the application parallelization, the programming model, the numerical scheme and, largely, from the data layout. Indeed, high-level semantic requests can easily be added to the application code after the problem has been modeled, parallelized, and most of the time after the data layout has been designed and optimized. Therefore, this approach is – to a large extent – orthogonal to other optimization mechanisms and does not require application developers to rewrite their code.
- Application developers are the persons who know best their code and therefore the needs of their application. They can easily (if the interface is well designed and the abstractions are correctly exposed), express the application needs in terms of resource usage and interaction with the whole environment.
- Being stateful and shared by all the applications in the parallel
environment, the proposed layer will therefore enable optimizations
that:
- cannot be performed statically but require information only known at launch- or run-time,
- are incremental and require minimal changes to the application execution scheme,
- deal with several parts of the environment at the same time (e.g., batch scheduler, I/O, process manager and storage),
- take into account the needs of several applications at the same time and deal with their interaction. This will be useful, for instance, to handle network contention, storage access or any other shared resources.
3 Research program
3.1 Need for System-Scale Optimization
Firstly, in order for applications to make the best possible use of the available resources, it is impossible to expose all the low-level details of the hardware to the program, as it would make impossible to achieve portability. Hence, the standard approach is to add intermediate layers (programming models, libraries, compilers, runtime systems, etc.) to the software stack so as to bridge the gap between the application and the hardware. With this approach, optimizing the application requires to express its parallelism (within the imposed programming model), organize the code, schedule and load-balance the computations, etc. In other words, in this approach, the way the code is written and the way it is executed and interpreted by the lower layers drives the optimization. In any case, this approach is centered on how computations are performed. Such an approach is therefore no longer sufficient, as the way an application is executing does depend less and less on the organization of computation and more and more on the way its data is managed.
Secondly, modern large-scale parallel platforms comprise tens to hundreds of thousand nodes1. However, very few applications use the whole machine. In general, an application runs only on a subset of the nodes2. Therefore, most of the time, an application shares the network, the storage and other resources with other applications running concurrently during its execution. Depending on the allocated resources, it is not uncommon that the execution of one application interferes with the execution of a neighboring one.
Lastly, even if an application is running alone, each element of the software stack often performs its own optimization independently. For instance, when considering an hybrid MPI/OpenMP application, one may realize that threads are concurrently used within the OpenMP runtime system, within the MPI library for communication progression, and possibly within the computation library (BLAS) and even within the application itself (pthreads). However, none of these different classes of threads are aware of the existence of the others. Consequently, the way they are executed, scheduled, prioritized does not depend on their relative roles, their locations in the software stack nor on the state of the application.
The above remarks show that in order to go beyond the state-of-the-art, it is necessary to design a new set of mechanisms allowing cross-layer and system-wide optimizations so as to optimize the way data is allocated, accessed and transferred by the application.
3.2 Scientific Challenges and Research Issues
In TADaaM, we will tackle the problem of efficiently executing an application, at system-scale, on an HPC machine. We assume that the application is already optimized (efficient data layout, use of effective libraries, usage of state-of-the-art compilation techniques, etc.). Nevertheless, even a statically optimized application will not be able to be executed at scale without considering the following dynamic constraints: machine topology, allocated resources, data movement and contention, other running applications, access to storage, etc. Thanks to the proposed layer, we will provide a simple and efficient way for already existing applications, as well as new ones, to express their needs in terms of resource usage, locality and topology, using a high-level semantic.
It is important to note that we target the optimization of each application independently but also several applications at the same time and at system-scale, taking into account their resource requirement, their network usage or their storage access. Furthermore, dealing with code-coupling application is an intermediate use-case that will also be considered.
Several issues have to be considered. The first one consists in providing relevant abstractions and models to describe the topology of the available resources and the application behavior.
Therefore, the first question we want to answer is: “How to build scalable models and efficient abstractions enabling to understand the impact of data movement, topology and locality on performance?” These models must be sufficiently precise to grasp the reality, tractable enough to enable efficient solutions and algorithms, and simple enough to remain usable by non-hardware experts. We will work on (1) better describing the memory hierarchy, considering new memory technologies; (2) providing an integrated view of the nodes, the network and the storage; (3) exhibiting qualitative knowledge; (4) providing ways to express the multi-scale properties of the machine. Concerning abstractions, we will work on providing general concepts to be integrated at the application or programming model layers. The goal is to offer means, for the application, to express its high-level requirements in terms of data access, locality and communication, by providing abstractions on the notion of hierarchy, mesh, affinity, traffic metrics, etc.
In addition to the abstractions and the aforementioned models we need to define a clean and expressive API in a scalable way, in order for applications to express their needs (memory usage, affinity, network, storage access, model refinement, etc.).
Therefore, the second question we need to answer is: “how to build a system-scale, stateful, shared layer that can gather applications needs expressed with a high-level semantic?” This work will require not only to define a clean API where applications will express their needs, but also to define how such a layer will be shared across applications and will scale on future systems. The API will provide a simple yet effective way to express different needs such as: memory usage of a given portion of the code; start of a compute intensive part; phase where the network is accessed intensively; topology-aware affinity management; usage of storage (in read and/or write mode); change of the data layout after mesh refinement, etc. From an engineering point of view, the layer will have a hierarchical design matching the hardware hierarchy, so as to achieve scalability.
Once this has been done, the service layer, will have all the information about the environment characteristics and application requirements. We therefore need to design a set of mechanisms to optimize applications execution: communication, mapping, thread scheduling, data partitioning / mapping / movement, etc.
Hence, the last scientific question we will address is: “How to design fast and efficient algorithms, mechanisms and tools to enable execution of applications at system-scale, in full a HPC ecosystem, taking into account topology and locality?” A first set of research is related to thread and process placement according to the topology and the affinity. Another large field of study is related to data placement, allocation and partitioning: optimizing the way data is accessed and processed especially for mesh-based applications. The issues of transferring data across the network will also be tackled, thanks to the global knowledge we have on the application behavior and the data layout. Concerning the interaction with other applications, several directions will be tackled. Among these directions we will deal with matching process placement with resource allocation given by the batch scheduler or with the storage management: switching from a best-effort application centric strategy to global optimization scheme.
4 Application domains
TADaaM focuses on data management hence on data intensive applications, ranging from HPC to AI applications, that require lots of data movement, between cores, storage, etc.
Mesh-based applications were the main focus when TADaaM was created but it is now only one our focuses among data intensive applications, especially since the emergence of AI and data analytics.
4.1 Mesh-based applications
TADaaM targets scientific simulation applications on large-scale systems, as these applications present huge challenges in terms of performance, locality, scalability, parallelism and data management. Many of these HPC applications use meshes as the basic model for their computation. For instance, PDE-based simulations using finite differences, finite volumes, or finite elements methods operate on meshes that describe the geometry and the physical properties of the simulated objects.
Mesh-based applications not only represent the majority of HPC applications running on existing supercomputing systems, yet also feature properties that should be taken into account to achieve scalability and performance on future large-scale systems. These properties are the following:
-
Size
Datasets are large: some meshes comprise hundreds of millions of elements, or even billions.
-
Dynamicity
In many simulations, meshes are refined or coarsened at each time step, so as to account for the evolution of the physical simulation (moving parts, shockwaves, structural changes in the model resulting from collisions between mesh parts, etc.).
-
Structure
Many meshes are unstructured, and require advanced data structures so as to manage irregularity in data storage.
-
Topology
Due to their rooting in the physical world, meshes exhibit interesting topological properties (low dimensionality embedding, small maximum degree, large diameter, etc.). It is very important to take advantage of these properties when laying out mesh data on systems where communication locality matters.
All these features make mesh-based applications a very interesting and challenging use-case for the research we want to carry out in this project. Moreover, we believe that our proposed approach and solutions will contribute to enhance these applications and allow them to achieve the best possible usage of the available resources of future high-end systems.
5 Social and environmental responsibility
5.1 Footprint of research activities
Team members make common use of small to large-scale high performance computing platforms, which are energy consuming.
For this reason, previous research in the team 35 leveraged an existing consolidated simulation tool — SimGrid — for the bulk of experiments, using an experimental platform for validation only. For comparison, the validation experiments required hours on nine nodes, while the simulation results that made into the paper would take at least 569 days to run. Although using and adapting the simulation tool took a certain effort, it allowed for more extensive evaluation, in addition to decreasing the footprint of this research. A similar strategy is being used in other projects since then (34).
Brice Goglin is involved in an emerging French initiative towards extending the lifetime of computing infrastructure 26.
5.2 Impact of research results
The digital sector is an ever-growing consumer of energy. Hence, it is of the utmost importance to increase the efficiency of use of digital tools. Our work on performance optimization, whether for high-end, energy consuming supercomputers, or more modest systems, aims at reducing the footprint of computations.
Because the aim of these machines is to be used at their maximum capacity, given their high production cost to amortize, we consider that our research results will not lead to a decrease in the overall use of computer systems; however, we expect them to lead to better modeling the energy consumption of application and hence a usage of their energy, hence resulting in “more science per watt”. Of course it is always hard to evaluate the real impact as a possible rebound effect is for more users to run on these machines, or users deciding to run extra experiments “because it is possible”.
5.3 Influence of team members
Members of the team participated to the writing of the Inria global Action plan on F/M professional equality for 2021-2024.
Moreover, Méline TROCHON, Ph.D. student in the team, is a member of the Groupe de Travail Parité-Egalité from the Inria Center at the University of Bordeaux (project.inria.fr/pariteegalitebordeaux/).
6 Highlights of the year
- TADaaM members published 4 papers at the IPDPS 2025 conference in Milan, one of the major conference in our research community (as well as 2 papers in IPDPS workshops).
- TADaaM also received the IPDPS 2025 Open Source Contribution Award in collaboration with STORM.
- Revision 5.0 of the MPI standard for communication in parallel applications was published. Guillaume Mercier is an editor, chapter leader and working group leader.
- A first industrial partner joined the Scotch consortium.
- Francieli Boito defended her HDR habilitation.
7 Latest software developments, platforms, open data
7.1 Latest software developments
7.1.1 hwloc
-
Name:
Hardware Locality
-
Keywords:
NUMA, Multicore, GPU, Affinities, Open MPI, Topology, HPC, Locality
-
Functional Description:
Hardware Locality (hwloc) is a library and set of tools aiming at discovering and exposing the topology of machines, including processors, cores, threads, shared caches, NUMA memory nodes and I/O devices. It builds a widely-portable abstraction of these resources and exposes it to applications so as to help them adapt their behavior to the hardware characteristics. They may consult the hierarchy of resources, their attributes, and bind task or memory on them.
hwloc targets many types of high-performance computing applications, from thread scheduling to placement of MPI processes. Most existing MPI implementations, several resource managers and task schedulers, and multiple other parallel libraries already use hwloc.
-
News of the Year:
In 2024, the support for heterogeneous memory was further improved to ease the selection of best memory targets in a more portable way. Support for GPUs from several vendors was also enhanced. Newest discovery and binding features in different operating systems were also leveraged in hwloc. Many internal changes were implemented to prepare the 3.0 major release in 2025.
- URL:
- Publications:
-
Contact:
Brice Goglin
-
Participants:
Samuel Thibault, Brice Goglin, an anonymous participant
-
Partners:
Open MPI consortium, Intel, AMD, IBM, Eviden
7.1.2 Hsplit
-
Name:
Hardware communicators split
-
Keyword:
Topology
-
Scientific Description:
Hsplit is a library that implements an abstraction allowing the programmer using MPI in their parallel applications to access the underlying hardware structure through a hierarchy of communicators. Hsplit is based on the MPI_Comm_split_type routine and provides a new value for the split_type argument that specifically creates a hierarchy a subcommunicators where each new subcommunicator corresponds to a meaningful hardware level. The important point is that only the structure o the hardware is exploited and the number of levels or the levels names are not fixed so as to propose a solution independent from future hardware evolutions (such as new levels for instance). Another flavor of this MPI_Comm_split_type function is provided that creates a roots communicators at the same time a subcommunicator is produced, in order to ease the collective communication and/or synchronization among subcommunicators.
-
Functional Description:
Hsplit implements an abstraction that allows the programmer using MPI in their parallel applications to access the underlying hardware structure through a hierarchy of communicators. Hsplit is based on the MPI_Comm_split_type routine and provides a new value for the split_type argument that specifically creates a hierarchy a subcommunicators where each new subcommunicator corresponds to a meaningful hardware level. The important point is that only the structure o the hardware is exploited and the number of levels or the levels names are not fixed so as to propose a solution independent from future hardware evolutions (such as new levels for instance). Another flavor of this MPI_Comm_split_type function is provided that creates a roots communicators at the same time a subcommunicator is produced, in order to ease the collective communication and/or synchronization among subcommunicators.
- URL:
- Publications:
-
Contact:
Guillaume Mercier
-
Participants:
Guillaume Mercier, Brice Goglin, Emmanuel Jeannot, Thibaut Pepin
7.1.3 TopoMatch
-
Keyword:
High performance computing
-
Scientific Description:
TopoMatch embeds a set of algorithms to map processors/cores in order to minimize the communication cost of the application.
Important features are : the number of processors can be greater than the number of applications processes , it assumes that the topology is a tree and does not require valuation of the topology (e.g. communication speeds) , it implements different placement algorithms that are switched according to the input size.
Some core algorithms are parallel to speed-up the execution. Optionally embeds scotch for fix-vertex mapping. enable exhaustive search if required. Several metric mapping are computed. Allow for oversubscribing of ressources. multithreaded.
TopoMatch is integrated into various software such as the Charm++ programming environment as well as in both major open-source MPI implementations: Open MPI and MPICH2.
-
Functional Description:
TreeMatch is a library for performing process placement based on the topology of the machine and the communication pattern of the application.
- URL:
- Publication:
-
Contact:
Emmanuel Jeannot
-
Participants:
Emmanuel Jeannot, François Tessier, Guillaume Mercier, 2 anonymous participants
7.1.4 NewMadeleine
-
Name:
NewMadeleine: An Optimizing Communication Library for High-Performance Networks
-
Keywords:
High-performance calculation, MPI communication
-
Functional Description:
NewMadeleine is the fourth incarnation of the Madeleine communication library. The new architecture aims at enabling the use of a much wider range of communication flow optimization techniques. Its design is entirely modular: drivers and optimization strategies are dynamically loadable software components, allowing experimentations with multiple approaches or on multiple issues with regard to processing communication flows.
The optimizing scheduler SchedOpt targets applications with irregular, multi-flow communication schemes such as found in the increasingly common application conglomerates made of multiple programming environments and coupled pieces of code, for instance. SchedOpt itself is easily extensible through the concepts of optimization strategies (what to optimize for, what the optimization goal is) expressed in terms of tactics (how to optimize to reach the optimization goal). Tactics themselves are made of basic communication flows operations such as packet merging or reordering.
The communication library is fully multi-threaded through its close integration with PIOMan. It manages concurrent communication operations from multiple libraries and from multiple threads. Its MPI implementation MadMPI fully supports the MPI_THREAD_MULTIPLE multi-threading level.
- URL:
-
Publications:
inria-00127356, inria-00177230, inria-00177167, inria-00327177, inria-00224999, inria-00327158, tel-00469488, hal-02103700, inria-00381670, inria-00408521, hal-00793176, inria-00586015, inria-00605735, hal-00716478, hal-01064652, hal-01087775, hal-01395299, hal-01587584, hal-02103700, hal-02407276, hal-03012097, hal-03118807
-
Contact:
Alexandre Denis
-
Participants:
Alexandre Denis, 6 anonymous participants
7.1.5 IOPS
-
Name:
IOPS - A generic benchmark orchestration framework
-
Keywords:
I/O, HPC, Benchmarking
-
Functional Description:
IOPS is a benchmark orchestration framework for automated parametric experiments. Users define YAML configurations specifying parameters to sweep, commands to execute, and metrics to extract. IOPS generates execution plans, runs tests locally or on SLURM clusters, and produces interactive HTML reports. It supports exhaustive, random, and Bayesian search strategies, with execution caching to avoid redundant tests. While originally designed for I/O characterization, it now serves as a general-purpose tool for systematic exploration of configuration spaces.
-
Contact:
Luan Teylo Gouveia Lima
-
Participants:
Luan Teylo Gouveia Lima, Francieli Zanon-Boito, Mahamat Younous Abdraman
7.1.6 AGIOS
-
Name:
Application-guided I/O Scheduler
-
Keywords:
High-Performance Computing, Scheduling
-
Functional Description:
A user-level I/O request scheduling library that works at file level. Any service that handles requests to files (parallel file system clients and/or data servers, I/O forwarding frameworks, etc) may use the library to schedule these requests. AGIOS provides multiple scheduling algorithms, including dynamic options that change algorithms during the execution. It is also capable of providing many statistics in general and per file, such as average offset distance and time between requests. Finally, it may be used to create text-format traces.
- URL:
- Publications:
-
Contact:
Francieli Zanon-Boito
-
Participants:
Luan Teylo Gouveia Lima, 2 anonymous participants
7.1.7 SCOTCH
-
Name:
Scotch / PT-Scotch
-
Keywords:
Mesh partitioning, Domain decomposition, Graph algorithmics, High-performance calculation, Sparse matrix ordering, Static mapping
-
Scientific Description:
The aim of the Scotch project is to tackle the problems of partitioning and mapping very large graphs, by way of algorithms that rely only on graph topology, and to devise efficient shared-memory, distributed-memory, and hybrid parallel algorithms for this purpose.
-
Functional Description:
Scotch is a graph partitioner. It helps optimise the division of a problem, by means of a graph, into a set of independent sub-problems of equivalent sizes. These sub-problems can also be solved in parallel.
-
Release Contributions:
SCOTCH has many interesting features:
- Its capabilities can be used through a set of stand-alone programs as well as through the libSCOTCH library, which offers both C and Fortran interfaces.
- It provides algorithms to partition graph structures, as well as mesh structures defined as node-element bipartite graphs and which can also represent hypergraphs.
- The SCOTCH library dynamically takes advantage of POSIX threads to speed-up its computations. The PT-SCOTCH library, used to manage very large graphs distributed across the nodes of a parallel computer, uses the MPI interface as well as POSIX threads.
- It can map any weighted source graph onto any weighted target graph. The source and target graphs may have any topology, and their vertices and edges may be weighted. Moreover, both source and target graphs may be disconnected. This feature allows for the mapping of programs onto disconnected subparts of a parallel architecture made up of heterogeneous processors and communication links.
- It computes amalgamated block orderings of sparse matrices, for efficient solving using BLAS routines.
- Its running time is linear in the number of edges of the source graph, and logarithmic in the number of vertices of the target graph for mapping computations.
- It can handle indifferently graph and mesh data structures created within C or Fortran programs, with array indices starting from 0 or 1.
- It offers extended support for adaptive graphs and meshes through the handling of disjoint edge arrays.
- It is dynamically parametrizable thanks to strategy strings that are interpreted at run-time.
- It uses system memory efficiently, to process large graphs and meshes without incurring out-of-memory faults,
- It is highly modular and documented. Since it has been released under the CeCILL-C free/libre software license, it can be used as a testbed for the easy and quick development and testing of new partitioning and ordering methods.
- It can be easily interfaced to other programs..
- It provides many tools to build, check, and display graphs, meshes and matrix patterns.
- It is written in C and uses the POSIX interface, which makes it highly portable.
-
News of the Year:
The Member's contract for the Scotch Consortium has been finalized. A full-time core software engineer has been hired.
- URL:
-
Publications:
hal-04404141, hal-01671156, hal-01968358, hal-00648735, tel-00540581, hal-00301427, hal-00402893, tel-00410402, hal-00402946, hal-00410408, hal-00410427
-
Contact:
François Pellegrini
-
Participants:
François Pellegrini, Marc Fuentes, Clément Barthelemy, Xavier Muller, 6 anonymous participants
-
Partners:
Université de Bordeaux, IPB, CNRS, Region Aquitaine
7.1.8 Raisin
-
Keywords:
Hypergraph, Partitioning, Graph algorithmics, Static mapping, FPGA
-
Functional Description:
Raisin is a multi-valued oriented hypergraph partitioning software whose objective function is to minimize the length of the longest path between some types of vertices while limiting the number of cut hyper-arcs.
-
Release Contributions:
Raisin has been designed to solve the problem of circuit placement onto multi-FPGA architectures. It models the circuit to map as a set of red-black, directed, acyclic hypergraphs (DAHs). Hypergraph vertices can be either red vertices (which represent registers and external I/O ports) or black vertices (which represent internal combinatorial circuits). Vertices bear multiple weights, which define the types of resources needed to map the circuit (e.g., registers, ALUs, etc.). Every hyper-arc comprises a unique source vertex, all other ends of the hyper-arcs being sinks (which models the transmission of signals through circuit wiring). A circuit is consequently represented as set of DAHs that share some of their red vertices.
Target architectures are described by their number of target parts, the maximum resource capacities within each target part, and the connectivity between target parts.
The main metric to minimize is the length of the longest path between two red vertices, that is, the critical path that signals have to traverse during a circuit compute cycle, which correlates to the maximum frequency at which the circuit can operate on the given target architecture.
Raisin computes a partition in which resource capacity constraints are respected and the critical path length is kept as small as possible, while reducing the number of cut hyper-arcs. It produces an assignment list, which describes, for each vertex of the hypergraphs, the part to which the vertex is assigned.
Raisin has many interesting features:
- It can map any weighted source circuit (represented as a set of red-black DAHs) onto any weighted target graph.
- It is based on a set of graph algorithms, including a multi-level scheme and local optimization methods of the “Fiduccia-Mattheyses” kind.
- It contains two greedy initial partitioning algorithms that have a computation time that is linear in the number of vertices. Each algorithm can be used for a particular type of topology, which can make them both complementary and efficient, depending on the problem instances.
- It takes advantage of the properties of DAHs to model path lengths with a weighting scheme based on the computation of local critical paths. This weighting scheme allows to constrain the clustering algorithms to achieve better results in smaller time.
- It can combine several of its algorithms to create dedicated mapping strategies, suited to specific types of circuits.
- It provides many tools to build, check and convert red-black DAHs to other hypergraph and graph formats.
- It is written in C.
- Publications:
-
Contact:
Julien Rodriguez
-
Participants:
François Pellegrini, Julien Rodriguez, 2 anonymous participants
7.1.9 CORHPEX
-
Name:
COmpiler, Runtime and Hardware Parameter EXplorer
-
Keywords:
Energy, Performance, Optimization, Design space exploration
-
Scientific Description:
COmpiler, Runtime and Hardware Parameter EXplorer (CORHPEX), is a framework to explore performance optimization spaces for HPC applications.
-
Functional Description:
CORHPEX enables application developers to discover the influence of configurations of hardware, compilers and run-time options on optimization targets such as performance and energy consumption by proposing the efficient collection and exploration of metrics in application design spaces.
-
Release Contributions:
First public diffusion
- URL:
- Publications:
-
Contact:
Mihail Popov
-
Participants:
Mihail Popov, 4 anonymous participants
-
Partner:
IFPEN
7.1.10 CERE
-
Name:
Codelet Extractor and REplayer
-
Keywords:
Checkpointing, Profiling
-
Functional Description:
CERE finds and extracts the hotspots of an application as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces benchmarking cost and allows piecewise optimization of an application.
- URL:
-
Contact:
Mihail Popov
-
Participant:
Mihail Popov
-
Partners:
Université de Versailles St-Quentin-en-Yvelines, Exascale Computing Research
7.2 New platforms
7.2.1 PlaFRIM
Participants: Brice Goglin.
-
Name:
Plateforme Fédérative pour la Recherche en Informatique et Mathématiques
- Website:
-
Description:
PlaFRIM is an experimental platform for research in modeling, simulations and high performance computing. This platform has been set up from 2009 under the leadership of Inria Bordeaux Sud-Ouest in collaboration with computer science and mathematics laboratories, respectively LaBRI and IMB with a strong support in the region Aquitaine.
It aggregates different kinds of computational resources for research and development purposes. The latest technologies in terms of processors, memories and architecture are added when they are available on the market. As of 2023, it contains more than 6,000 cores, 50 GPUs and several large memory nodes that are available for all research teams of Inria Bordeaux, Labri and IMB.
Brice Goglin is in charge of PlaFRIM since June 2021.
7.2.2 Abaca
Participants: Brice Goglin.
-
Name:
Abaca
- Website:
-
Description:
Abaca is Inria's mutualized computing infrastructure. It gathers computing resources from Inria research centers across the country into a uniform software environment. The platform currently contains more than 15 000 CPU cores and 600 000 GPU cores.
Brice Goglin is a member of the executive committee since 2021,- and the Product Owner since June 2025.
7.3 Open data
LLM4DiCE dataset
-
Contributors:
Asia Auville, Tim Jammer, Eric Petit, Pablo de Oliveira Castro, Emmanuelle Saillard, and Mihail Popov
-
Description:
Dataset containing both real world programs and mutants. Mutants are created from the original files by inserting MPI errors using a mutation tool developed in collaboration with Technische Universität Darmstadt. The files are scrapped from GitHub's popular MPI projects.
- Dataset PID (DOI,...):
- Project link:
-
Publications:
under review.
-
Contact:
asia.auville@inria.fr
I/O Traces from SDumont and PlaFRIM
-
Contributors:
Francieli Boito, Luan Teylo, Mihail Popov, Theo Jolivel, François Tessier, Jakob Luettgau, Julien Monniot, Ahmad Tarraf, André Carneiro, and Carla Osthoff.
-
Description:
This data set contains the PlaFRIM and Santos Dumont traces we made available after the study from the “A Deep Look Into the Temporal I/O Behavior of HPC Applications” paper, including all code used to analyze them.
- Dataset PID (DOI,...):
- Project link:
- Publications:
-
Contact:
Luan Teylo, luan.teylo@inria.fr
8 New results
8.1 Predicting and Fixing Errors in Parallel Applications with AI
Participants: Asia Auville, Emmanuelle Saillard, Mihail Popov.
Investigating if parallel applications are correct is a very challenging task. Yet, recent progress in ML and text embedding show promising results in characterizing source code or the compiler intermediate representation to identify optimizations. We propose to transpose such characterization methods to the context of verification. In particular, we train ML models that take as labels the code correctness along with intermediate representations or source code embeddings as features. Results over small MPI verification benchmarks including MBI and DataRaceBench demonstrate that we can train models that detect if a code is correct with 90% accuracy and up to 75% over new unseen errors. This work 37 is a collaboration with the Iowa State University.
In the context of Asia Auville's Ph.D. thesis, we are investigating the prediction capabilities of ML models to detected and fix errors in real world applications. Through Github repositories crawling and compiler mutations, we are creating complex application examples with MPI errors to train detection models. This work,in collaboration with the University of Versailles, Intel, and Technische Universität Darmstadt is currently under review.
8.2 Optimizing Performance and Energy with AI Guided Exploration
Participants: Lana Scravaglieri, Mihail Popov, Pierre Clouzet, Laércio Lima Pilla, Noureddine Tamssaout.
HPC systems expose configuration options that help users optimize their applications'execution. Questions related to the best thread and data mapping, number of threads, or cache prefetching have been posed for different applications, yet they have been mostly limited to a single optimization objective (e.g., performance) and a fixed application problem size. Unfortunately, optimization strategies that work well in one scenario may generalize poorly when applied in new contexts.
In the context of Lana Scravaglieri 's Ph.D. thesis and in collaboration with IFP Energies nouvelles (IFPEN), we carried this research further, by focusing on the exploration of SIMD transformations over carbon storage applications. To do so, we are designing a more general exploration infrastructure, CORHPEX, that can easly incorporate more diverse optimization knobs and applications. This work was awarded the IPDPS'25 Best Open-Source Contribution Award. CORHPEX was also utilized on Software Defined Radio (SDR) applications in collaboration with Inria Topal, demonstrating the potential performance and energy gains when considering the target hardware 31.
Finally, in collaboration with the University of Uppsala, we are also investigating the hardware prefetching interaction with the new hybrid architectures (Intel’s Efficiency- and Performance-cores or ARM big.LITTLE). Preliminary results 38, 36 showcase how energy gains can be achieved by tuning the system to the applications. This work is currently under review. we plan to further pursue the heterogeneous capabilities by adding memory effects to the AI based exploration.
8.3 IOPS: I/O Performance Evaluation Suite
Participants: Mahamat Abdraman, Francieli Boito, Mihail Popov, Luan Teylo.
In high-performance computing, I/O operations can become a bottleneck when dealing with large-scale data processing tasks. In these systems, where files are distributed by a parallel file system (PFS) across multiple object storage targets (OSTs), the performance of the I/O operations is influenced by several interdependent parameters, such as the number of computing nodes, network performance, the number of OSTs, and the access pattern of the application, among others. Therefore, understanding how the combination of these parameters affects I/O performance is crucial for identifying anomalies, verifying expected performance from a storage system, and improving application I/O performance. In this work 18, we present IOPS, a tool that allows users to profile an HPC I/O infrastructure. IOPS is designed to be easy to use, flexible, and extensible. We demonstrate the capabilities of IOPS by evaluating the performance of BeeGFS, a popular parallel file system, using different configurations and access patterns. The results show that IOPS can automate the process of finding the best parameter combinations.
In addition to the paper presenting IOPS, in 2025 we extended it to cover other benchmarking situations, such as generating data sets for training AI models (see Section 8.4) and evaluating interference. Moreover, we explored the use of Bayesian Optimization to navigate a large space of parameters.
8.4 Prediction of HPC I/O Resources Usage with Machine Learning
Participants: Mahamat Abdraman, Laora Aimi, Francieli Boito, Mihail Popov, Luan Teylo.
During previous work on heuristics for allocation of I/O resources to HPC applications 34, we observed that the best algorithm requires to know the number of resources that maximize application I/O performance. Nonetheless, this information is not typically available, and obtaining it would involve running the application multiple times with multiple configurations. Instead, we then focused on finding a good estimate of the number of I/O resources (e.g., OSTs and I/O nodes) that provides the maximal bandwidth while minimizing the system occupation and taking into account the natural I/O variability. We used machine learning techniques to do so, focusing on intrinsic application features and system configurations. Preliminary results were published as a pre-print on HAL in 2024 33.
In 2025, during the internship of Laora Aimi, we continued this work by improving the methodology for the evaluation of models' accuracy and further studying the importance of different parameters. Furthermore, we investigated approches to decrease the amount of data required for training the model, and improved the IOPS tool so it can be used to obtain such data. This work is expected to result in a publication in 2026.
8.5 A Deep Look Into the Temporal I/O Behavior of HPC Applications
Participants: Francieli Boito, Mihail Popov, Luan Teylo.
We studied the temporal I/O behavior of over 440,000 jobs running on four HPC systems, all different in terms of infrastructure, scale, and users, covering several time periods over the last 11 years. The data we analyzed came either from parallel file systems (system-side traces) or from I/O monitoring tools (application-side traces).
The aim of analyzing these traces is to provide an in-depth study of data accesses by HPC applications in the wild. We have thus identified and addressed a number of questions dealing with the temporality of I/Os, their periodicity, the existence and prevalence of certain patterns, I/O concurrency between applications or user practices. We also proposed a classification of temporal I/O behaviors, which shows a few patterns are able to represent a vast majority of jobs. Overall, the results of this study provide relevant information for anyone working to improve high-performance I/O. They also serve as a basis for future research into both behavior detection tools and the use of trace analysis, particularly for scheduling and application optimization.
This work was submitted in 2024 and published at IPDPS'2025 9 (an extended version was published as a technical report 27) and is the result of a collaboration between Inria Bordeaux, Inria Rennes, the Technical University of Darmstadt, and the National Laboratory for Scientific Computing (LNCC).
We are currently extending our AI analysis on the dataset by considering more automated techniques. Our goal is to scale the I/O trace pattern detection to ease any new optimization.
8.6 On the Impact of Interference from Concurrent Jobs on Checkpointing Performance
Participants: Francieli Boito, Brice Goglin, Luan Teylo, Méline Trochon.
Among the most I/O-intensive operations in HPC systems is checkpointing, which is necessary to save the state of an application and allow it to be restarted at an advanced stage of computation. However, near the parallel file system, concurrency prevents checkpoint phases from reaching the best I/O performance. In this work, we studied I/O interference in this specific context: we look at performance of a checkpoint phase when faced with different interference patterns, exploring aspects such as scale, number of processes, operation, number of files, etc. Through an extensive experimentation, in two systems, we showed the impact of these aspects on checkpoint. Moreover, we showed that some configurations — namely, an application that does random accesses — lead to degraded system I/O performance. This work was published as pre-print in 2025 28 and is expected to result in a publication in 2026, as it provides an important background for any effort into mitigating I/O interference.
8.7 Checkpointing Optimisation to Prepare Future Exascale Plasma Turbulence Simulations
Participants: Méline Trochon.
The advent of exascale computing has revolutionized high-performance computing (HPC) and enabled unprecedented advancements in nuclear fusion research. Simulations of plasma turbulence dynamics, such as the GYSELA code, now achieve unparalleled precision and complexity. However, this progress is accompanied by significant challenges in managing the exponential growth of data generated by these simulations. Traditional input/output (I/O) methods struggle to handle the massive data volumes, heightened concurrency, and fault-tolerance requirements inherent to exascale systems. This work 17 investigated the I/O bottlenecks inherent in exascale computing, with a particular focus on the checkpointing mechanisms of GYSELA. These mechanisms are critical for ensuring fault tolerance and must handle several terabytes of data efficiently to avoid undermining computational performance. We analyze the current implementation of GYSELA’s checkpointing mechanism managed via the PDI data interface, identifying its limitations and proposing two alternative approaches aimed at enhancing scalability and resilience. Experiments conducted on pre-exascale architectures validate the efficiency of these methods through both strong and weak scaling benchmarks.
We reduced the checkpointing execution time by a factor of four, achieving near-optimal bandwidth utilisation, and we have identified implementations well-suited for exascale architectures. Our findings suggest the potential for notable performance improvements and offer insights that could help optimise I/O operations in exascale simulations.
This work was started during Méline Trochon's internship at CEA, and will be continued in 2026. Indeed, GYSELA is one of the demonstrator applications of the NumPEx Exa-DoST project.
8.8 A Weighted Bi-objective Strategy for Executing Scientific Workflows in Containerized Environments
Participants: Wesley Ferreira, Liliane Kunstmann, Yuri Frota, Luan Teylo, Daniel de Oliveira.
Scientific workflows support the execution of complex simulation-based experiments across heterogeneous computing environments. Containerization technologies, such as Docker, improve portability by encapsulating tasks together with their dependencies. However, they also introduce challenges in resource management, as containers incur additional memory and CPU overhead and may execute concurrently on the same virtual or physical machine. These challenges are particularly critical in memory-constrained environments, where inefficient scheduling can lead to performance degradation or even task failures. To address this issue, we propose a weighted bi-objective scheduling strategy that considers memory consumption and execution time, allowing users to prioritize one objective or achieve a balance between the two. Experimental evaluations with both synthetic and real-world workflows demonstrate that our approach enhances performance and resource utilization.
This study 11 was done in collaboration with the Federal Fluminense University in Brazil, through the Equipe Associée DecoHPC, and was published in the Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD), 2025.
8.9 Towards a Novel Vertical Scaling Approach for Bursty Workloads in Kubernetes
Participants: Miguel De Lima, Luan Teylo, Lúcia Drummond.
Traditional static computational resource allocation in cloud or on-premises clusters often results in inefficient overprovisioning. Users frequently lack precise knowledge of the memory and processors their applications require, leading them to request excess resources. This causes wasted capacity, higher costs, and, in shared environments, longer queue waiting times. Dynamic resource allocation through autoscaling addresses this issue by adjusting resources at runtime. Kubernetes, a widely used container orchestration platform, supports autoscaling via Horizontal and Vertical Pod Autoscalers. However, its default restart-based scaling can disrupt stateful, long-running workloads without checkpointing. This work leverages Kubernetes’ new in-place scaling, which resizes resources without restarts, to propose the Dynamic Resizing Strategy (DRS), a novel autoscaling approach that proactively manages contention by temporarily throttling co-located pods to prioritize a bursting application. We evaluate it with NAS Parallel Benchmarks and synthetic workloads in co-execution scenarios, showing improved efficiency and stability, increasing success rates and reducing global average wait time by over 18% compared to the Burstable QoS class.
This study 14 was done in collaboration with the Federal Fluminense University in Brazil, through the Equipe Associée DecoHPC, and was published in the 18th IEEE/ACM International Conference on Utility and Cloud Computing, 2025.
8.10 Network Topology Reconstruction
Participants: Brice Goglin, Guillaume Mercier, Thibaut Pépin.
With the increase in size and complexity of supercomputers, it has become crucial to match applications and communication libraries to the underlying physical topology. While new functionalities were recently added to the MPI standard regarding the access of topological information of computing nodes of the system, there is still a lack of tools to retrieve the network topology information.
We previously implemented the prototype for a tool allowing the reconstruction of the network topology using latency measurements15. We continued to build upon this tool by improving our runtime MPI rank attribution library. The purpose of this library is to better define the execution environment of a succession of communications and to use this information to improve the performance of MPI collective communications. To this end, we defined a set of metrics to add upon the information commonly used by collective communication algorithms. Our first implementation validated the usefulness of the use of these metrics to improve the data locality as well as the communication load balancing. Experimental results using our tool show a great potential for performance improvements for a string of communications. The results analysis is still ongoing.
We also started to adapt node-aware topological collective communication algorithms to use the network topology information. This adaptation requires to shift from a balanced topology (the node), to an unbalanced one (the network) as well as adding the handling of multiple topology layers.
Last, on the communication side, we presented our work on runtime rank attribution at the COMPAS conference in Bordeaux in July 2025 and our article on network topology reconstruction was accepted for the HPCAsia conference , which will take place in January 2026
8.11 A novel interface to enforce mapping policies
Participants: Guillaume Mercier.
We are taking part in the design and development of the QuoVadis software, which aims at providing application developers with an interface to easily enforce mapping policies for processes and threads. Indeed, in a context of multi-kernel applications where various policies could be applied to improve performance, changing dynamically such policies is currently difficult and awkward. QuoVadis' goal is to help hybrid applications in making efficient use of heterogeneous hardware, to ease programmability in the presence of multiple programming abstractions, and to enable portability across systems. QuoVadis' core interface is based on a split-like operation for processes that partitions hardware resources into an arbitrary number of pieces and assigns processes to these pieces, enabling concurrency and avoiding resource interference. Similar operations are also available for threads 30 (OpenMP and POSIX threads are both supported), enabling hybrid applications to fully take advantage of managing resources at a level of abstraction that computational scientists can employ rapidly. The QuoVadis thread interface features similar semantics — and syntax — to the process interface allowing users to leverage a single-semantics model for partitioning and assignment of resources to workers. When combined with application-specific heuristics, QuoVadis enforces tailored execution policies by using dynamic hardware affinities exposed through a straightforward stack semantics (push/pop). An arbitrary number of binding policies can be stacked that correspond to the stacked composition of coupled components in a QuoVadis-enabled application.
8.12 Performance Projection for Design-Space Exploration on future HPC Architectures
Participants: Clément Gavoille, Brice Goglin, Emmanuel Jeannot.
To address the growing need for performance from future HPC machines, their processor designs are constantly evolving. Assessing the impact of changes in hardware, software stack, and applications on performance is crucial in a codesign process. Here, we propose a performance projection workflow to facilitate the initial exploration of design space for multicore nodes and multi-threaded applications. For this purpose, we analyze the architectural efficiency of an accessible source machine and determine the maximum sustainable flop/s performance of a hypothetical target machine based on its software stack on a per-thread basis. Finally, we use these characterizations to project the performance evolution from the source machine to the target machine.
In this work, we assess the strengths and weaknesses of our approach by integrating it into the Fugaku-Next Feasibility Study. We compare the accuracy and overhead of our approach with the gem5 cycle-level simulations and a fast exploration methodology based on Machine Code Analyzer (MCA), using NAS Parallel benchmarks and CCS-QCD, a quantum chromodynamics miniapp. The study demonstrates that, compared to gem5, our approach has a prediction deviation of 5% for most cases and up to 30% for extreme cases. Additionally, it exhibits an execution overhead an order of magnitude bigger than MCA but orders of magnitude smaller than gem5.
Finally, we demonstrate our approach's capability to study larger scale and more representative applications than gem5, such as QWS and Genesis, two applications of RIKEN optimized for Fugaku.
This work 12 was performed in collaboration with CEA/DAM and RIKEN.
8.13 User-space interrupts for HPC communications
Participants: Alexandre Denis, Brice Goglin, Charles Goedefroit.
In HPC, network are programmed directly from user space, since system call have a significant cost with low latency networks. Usually, the user performs polling: the network is polled at regular intervall to check whether a new message has arrived. However, it wastes some resources. Another solution is to rely on interrupts instead of polling, but since interrupts are managed by the kernel, they involve system calls we are precisely willing to avoid.
Intel introduced user-level interrupts on its lates Sapphire Rapids CPUs, allowing to use interrupts from user space. These user space interrupts may be a viable alternative to polling, by using interrupts without the cost of systems calls.
We have extended 20, 13 user-space interrupts to be able to trigger them from a device and not only from the CPU. We work with Eviden on their BXI network to make it trigger user-space interrupts so as to benefit from uintr in inter-node communications.
8.14 Interrupt-safe data structures
Participants: Alexandre Denis, Charles Goedefroit.
With the addition of interrupt-based communication in NewMadeleine, synchronization issues have emerged in some data structures. NewMadeleine relies on lock-free queues for a lot of its activities: progression through Pioman, submission queue, completion queue, deferred tasks. However, our implementation of lock-free queues was not non-blocking and was not suitable for use in an interrupt handler.
Other implementations found in the litterature target scalability but exhibit high latency in the uncontended case. We have shown that, since latency of network and queues are different by several orders of magnitude, even highly contented network operation do not impose a high pressure on queues.
We have proposed 10 a new non-blocking queue algorithm that is optimized for low contention, while degrading nicely in case of higher contention. We have shown that it exhibits the best performance in NewMadeleine when compared to 15 other queue designs on four different architectures.
8.15 Management of InfiniBand memory registration with StarPU/NewMadeleine
Participants: Alexandre Denis, Tanguy Chatelain, Tristan Riehs.
Until now, StarPU allocated memory to receive messages when tasks were submited. This may cause a useless large consumption of memory. A better strategy would be to allocate memory just in time, when the sender is ready to send data.
However, networks used in HPC, like InfiniBand, are programmed from user space, and thus require most of the time memory to be registered; this is a costly operation that involves a system call. If we allocate memory just in time, memory registration delays the actual posting of the receive operation.
We have proposed 29, 19 to perform memory registration in advance, to remove it from the communication critical path: on the sender side, when we begin computation for a task, we know that we will have to send it later; we register memory upfront at the beginning of computation, before the data is even available (we only need the pointer). At the same time, when a task starts, the sender sends a request to the receiver so that it may allocate and register the buffer for the receive operation, so that allocation and registration are completed when the message is ready to be sent. We have observed that these mechanisms save a lot of memory, thus allowing to run larger datasets, while improving performance.
8.16 Composability of drivers and strategies in NewMadeleine
Participants: Alexandre Denis, Gael Valade.
The NewMadeleine communication library is built with software component, making it modular. Components are used for drivers and for the optimizing strategy. However, strategies are built as a single monolithic component. Therefore, the user has to choose between the strategy that implement message aggregation, priority-based scheduling or multi-rail, but cannot compose them.
We have worked 32 on a new structure that dispatch these features in virtual driver components, so as to make them composable.
8.17 Improvement of the usability of Scotch and PT-Scotch
Participants: Clément Barthélemy, Mark Fuentes, Xavier Muller, François Pellegrini.
The Scotch software has undergone continuous development. A first axis of work concerned the implementation of a full set of multilevel vertex bipartitioning algorithms that aim at minimizing the vertex cut (i.e., the separator size), while also balancing the weights of halo vertices across both subdomains. This work, as a continuation of Astrid Casadei's work, described in her Ph.D. thesis, aims at reducing communication overhead when solving sparse linear systems.
A second axis of work concerned the improvement of the edge partitioning/mapping of huge meshes representing cardiac tissue, in the context of the MICROCARD-2 Euro-HPC project. In this project, elements bearing the same tags must always be placed in the same parts. This result has been achieved by designing and implementing multi-threaded centralized and distributed graph quotienting routines, seen as extensions of preexisting graph coarsening algorithms, which coalesce all vertices of same tags into single weighted vertices of a coarser graph then to be partitioned. This method yielded the expected result, i.e., well-balanced partitions enforcing the placement constraint.
All these improvements are already available to scientific partners in pre-release versions of Scotch, and will be provided to the global community in future public releases. Several bugfix releases took place in the mean time, up to version v7.0.10.
Also, version v1.0.0 of ScotchPy, a Python wrapper for Scotch and PT-Scotch, has been publicly released.
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
CEA
Participants: Clément Gavoille, Brice Goglin, Guillaume Mercier, Thibaut Pépin.
- CEA/DAM granted the funding of the PhD thesis of Thibaut Pépin on communication on modular supercomputer architectures.
- CEA/DAM granted the fundind of the PhD thesis of Clément Gavoille , defended in January, which led to publication with RIKEN 12.
ATOS/Bull/Eviden
Participants: Quentin Buot, Alexandre Denis, Brice Goglin, Emmanuel Jeannot, Guillaume Mercier, Richard Sartori.
- ATOS/Bull/Eviden is funding the CIFRE PhD Thesis of Charles Goedefroit on Delivering Userspace Interrupts from the BXI network interface
IFPEN
Participants: Mihail Popov, Lana Scravaglieri.
- IFPEN funded the PhD Thesis of Lana Scravaglieri on the designs of models to optimize numerical simulations by adjusting the programs to the underline HPC systems.
DDN
Participants: Francieli Boito, Brice Goglin, Méline Trochon.
- DDN is funding the thesis of Méline Trochon (CIFRE) on improving checkpointing mechanisms in HPC to prevent network and I/O contention. She is advised by Francieli Boito and Brice Goglin in Bordeaux, François Tessier from Inria Rennes, and Jean-Thomas Acquaviva from DDN.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program
DECoHPC
Participants: Luan Teylo, Francieli Boito.
-
Title:
Data movement, Energy COnsumption and performance in High-Performance Computing
-
Partner Institution(s):
- National Laboratory for Scientific Computing (LNCC), Brazil
- Federal Fluminense University (UFF) Brazil
- Federal University of Rio Grande do Sul (UFRGS), Brazil
- Federal Center for Technological Education of Rio de Janeiro (CEFET-RJ), Brazil
-
Date/Duration:
from 2024 to 2026
- Website:
-
Additionnal info/keywords:
Supercomputers were conceived to efficiently run traditional HPC applications, namely numerical simulations. However, in the context of the convergence between HPC and big data, their workload is becoming more heterogeneous. In this new scenario, efficient application execution becomes more challenging. Moreover, energy consumption has emerged as an important concern for HPC and computer science in general. First, with the effects of climate change, environmental concerns have become a major focus across various scientific fields. Second, as more and more exascale machines emerge, the energy budget has become one of the main concerns for these machines, driven not only by environmental considerations but also by economic ones.
The previous HPCProSol associate team (2021–2023) provided us with performance insights about two kinds of representative applications from the Santos Dumont system from the LNCC (the largerst supercomputer in Latin America): finite element methods (HPC) and bioinformatics workflows (HPDA). Moreover, we collaborated on advancing the system’s monitoring infrastructure by developing software to efficiently process it. Now, in the DECoHPC associate team, we aim to take these insights and tools and extend them towards our three main goals:
- (WP1) Based on the Santos Dumont’s traces (recently made available), to obtain a holistic view of the I/O behavior of HPC applications. We want to classify applications according to their behaviors — and on their different needs from the system.
- (WP2) To study and characterize the energy consumption of moving applications’ data through the network and I/O infrastructure.
- (WP3) To characterize the I/O performance and energy consumption of AI applications, which have not been explored in HPCProSol, but are now among one of the most important users of HPC facilities.
10.2 International research visitors
10.2.1 Visits of international scientists
In the context of the DECoHPC Associate Team, in 2025 the team received the visits of:
10.3 European initiatives
10.3.1 H2020 projects
EUPEX
Participants: Brice Goglin.
- EUPEX: European Pilot for Exascale
- Program: H2020 EuroHPC
- Grant Agreement number: 101033975 – H2020-JTI-EuroHPC-2020-01
- 2022-2026
- Partners: Atos, FZJ, CEA, GENCI, CINECA, E4, ICS-FORTH, Cini National Lab, ECMWF, IT4I, FER, ParTec, EXAPSYS, INGV, Goethe University, SECO, CybeleTech
-
The EUPEX pilot brings together academic and commercial stakeholders to co-design a European modular Exascale-ready pilot system. Together, they will deploy a pilot hardware and software platform integrating the full spectrum of European technologies, and will demonstrate the readiness and scalability of these technologies, and particularly of the Modular Supercomputing Architecture (MSA), towards Exascale.
EUPEX’s ambition is to support actively the European industrial ecosystem around HPC, as well as to prepare applications and users to efficiently exploit future European exascale supercomputers.
- Website: eupex.eu
- TADaaM funding: 150k€
10.4 National initiatives
InriaSoft: Scotch Consortium
Participants: François Pellegrini, Clément Barthélemy.
- Scotch Consortium
- Program: InriaSoft
- 2024–
- Website: gitlab.inria.fr/scotch/scotch
- Coordinator: François Pellegrini
-
Abstract:
The Scotch Consortium, supported by InriaSoft3, has been created to bring together organizations interested in furthering the Scotch software currently developed within the TADaaM project. It will take care of the sustainability and development of the Scotch software environment, sharing the governance between its members. It will also allow every member to participate in the software roadmap, and to get adequate support. It will ensure Scotch stays permanently maintained, and available to the worldwide community under a free/libre software license.
While the consortium has not officially been launched, Inria has started populating the Scotch consortium engineering team by agreeing to hire a full-time core software engineer. Clément Barthélemy was recruited and started working on September 1, joining Marc Fuentes , the part-time environment software engineer.
Numpex PC2: Exa-Soft
Participants: Alexandre Denis.
- Exa-SofT: HPC softwares and tools
- Program: project PC2 in PEPR Numpex
- 2023-2029
- Partners: Université Paris-Saclay, Telecom SudParis, Bordeaux INP, ENSIIE, Université de Bordeaux, Université de Grenoble-Alpes, Université de Rennes 1, Université de Strabourg, Université de Toulouse, CEA, CNRS, Inria.
- Website: numpex.org/exasoft-hpc-software-and-tools
- Coordinator: Raymond Namyst (Storm)
-
Abstract:
Though significant efforts have been devoted to the implementation and optimization of several crucial parts of a typical HPC software stack, most HPC experts agree that exascale supercomputers will raise new challenges, mostly because the trend in exascale compute-node hardware is toward heterogeneity and scalability: Compute nodes of future systems will have a combination of regular CPUs and accelerators (typically GPUs), along with a diversity of GPU architectures. Meeting the needs of complex parallel applications and the requirements of exascale architectures raises numerous challenges which are still left unaddressed. As a result, several parts of the software stack must evolve to better support these architectures. More importantly, the links between these parts must be strengthened to form a coherent, tightly integrated software suite. Our project aims at consolidating the exascale software ecosystem by providing a coherent, exascale- ready software stack featuring breakthrough research advances enabled by multidisciplinary collaborations between researchers. The main scientific challenges we intend to address are: productivity, performance portability, heterogeneity, scalability and resilience, performance and energy efficiency.
Numpex PC3: Exa-DoST
Participants: Francieli Boito, Emmanuel Jeannot, Luan Teylo.
- Exa-DoST: Data-oriented Software and Tools for the Exascale
- Program: project PC3 in PEPR Numpex
- 2023-2029
- Partners: Université Paris-Saclay, Telecom SudParis, Bordeaux INP, ENSIIE, Université de Bordeaux, Université de Grenoble-Alpes, Université de Rennes 1, Université de Strabourg, Université de Toulouse, CEA, CNRS, Inria.
- Website: numpex.org/exadost-data-oriented-software-and-tools-for-the-exascale/
- Coordinator: Gabriel Antoniu (KerData)
-
Abstract:
The advent of future Exascale supercomputers raises multiple data-related challenges. To enable applications to fully leverage the upcoming infrastructures, a major challenge concerns the scalability of techniques used for data storage, transfer, processing and analytics. Additional key challenges emerge from the need to adequately exploit emerging technologies for storage and processing, leading to new, more complex storage hierarchies. Finally, it now becomes necessary to support more and more complex hybrid workflows involving at the same time simulation, analytics and learning, running at extreme scales across supercomputers interconnected to clouds and edgebased systems. The Exa-DoST project will address most of these challenges, organized in 3 areas: 1. Scalable storage and I/O; 2. Scalable in situ processing; 3. Scalable smart analytics. As part of the NumPEx program, Exa-DoST will address the major data challenges by proposing operational solutions co-designed and validated in French and European applications. This will allow filling the gap left by previous international projects to ensure that French and European needs are taken into account in the roadmaps for building the data-oriented Exascale software stack.
Inria Exploratory Action
Participants: Asia Auville, Emmanuelle Saillard, Mihail Popov.
- Title: Large Language Models for Detection and Correction of Errors
- Website: LLM4DiCE
- 2024 - 2027 (36 months)
- Coordinator: Emmanuelle Saillard and Mihail Popov
- Abstract: Large Language Models (LLMs) are a hot and rapidly evolving research topic. In particular, their recent successes in summarization, question-answering, and code generation with AI pair programming make them attractive candidates in the field of error verification. We propose to harness these LLMs capabilities with fine-tuning on carefully generated datasets through a novel clustering strategy based on Natural Language Processing (NLP) techniques and code embedding to assist bug detection and correction, targeting hard domains such as parallel program verification.
10.5 Public policy support
François Pellegrini was short-listed by a jury set-up by the European Commission, for the position of chair of the European Data Protection Supervisor, the independent data protection authority that supervises the use of personal data by the bodies of the European Union. He was subsequently heard before the European Parliament and the European Council. Due to a lack of agreement between these two institutions, the hiring process is currently on hold.
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Member of the organizing committees
- Francieli Boito is Deputy chair for the Workshops part of ISC 2026, and vice-chair for the Workshops part of IPDPS 2026. In both cases, the work started in 2025.
11.1.2 Scientific events: selection
Chair of conference program committees
- Mihail Popov was co-chair for the HIPC 2025 System Software (High Performance Computing) track.
Member of the conference program committees
- Francieli Boito was a member of the program committees of HPCAsia 2025, ICPP 2025, Cluster 2025, HiPC 2025, Rex-IO (a workshop held with Cluster) 2025, and PDSW (a workshop held with Supercomputing) 2025.
- Brice Goglin was a member of the program committees of Cluster 2025 and ICPP 2025.
- Mihail Popov was a member of the program committees of IPDPSW GrAPL 2025 and ISCW 2025.
- Guillaume Mercier was a member of the program committees of EuroMPI 2025, HiPC 2025 and ICPP 2025.
- Luan Teylo was a member of the program committee of CCGRID 2025, PMBS 2025, WAMCA 2025, ISPDC 2025, ESSA 2025, HiPC 2025 and SBAC 2025.
- Alexandre Denis was a member of the program committees of APDCM 2025 and HiPC 2025.
11.1.3 Journal
- Mihail Popov was a reviewer for TPDS, JPDC, Journal of Cloud Computing, and journal of Supercomputing.
- Guillaume Mercier served as a reviewer for the Journal of Parallel and Distributed Computing.
- Luan Teylo was a reviewer for TPDS.
Reviewer - reviewing activities
- Francieli Boito was an external reviewer for IPDPS 2025.
- Guillaume Mercier served as an external reviewer for SC 2025,
- Luan Teylo was an external reviewer for SC 2025 and Cluster 2025.
- Alexandre Denis was an external reviewer for Cluster 2025.
11.1.4 Invited talks
- Francieli Boito gave an invited talk at the NHR Conference (a German national event) in September 2025 — “Investigating the temporal I/O behavior of HPC applications”.
- Francieli Boito gave a keynote during the ESSA workshop (held with IPDPS) in June 2025 — “Improving I/O Resource Usage in HPC”.
- Brice Goglin was invited to give a talk at the French Academy of Science in May as a followup to the 2024 "Innovation" award from l'Academie des Sciences, Dassault Systèmes and Inria.
11.1.5 Scientific expertise
- Brice Goglin was a member of the Khronos OpenCL Advisory Panel as well as the Unified Acceleration Foundation (former oneAPI) Hardware Abstraction SIG.
- Brice Goglin is involved in the expertise of HPC projects in Africa with IRD and AFD.
11.1.6 Research administration
- Francieli Boito is a member of the council of the SIN department of the University of Bordeaux since 2022.
- Brice Goglin is the product owner of the Inria' nation-wide computing infrastructure, Abaca, since June 2025.
- Brice Goglin is in charge of the computing infrastructures of the Inria Bordeaux research center.
11.1.7 Standardization Activities
Participation in the MPI Forum
- TADaaM attended the MPI Forum meetings on behalf of Inria (where the MPI standard for communication in parallel applications is developed and maintained). Guillaume Mercier leads the Topologies working group. He participates in serveral other Working Groups (Hybrid WG, ABI WG) and is also an editor of the MPI Standard, as a member of several chapter committees (Contexts, Topologies and Info object). He also serves as the Context chapter committee chair. This year, the version 5.0 of the standard was approved by the MPI forum, the major addition to the MPI standard being the introduction of an Abstract Binary Interface (ABI) that is expected to improve applications portability since several implementations of MPI do coexist. This ABI support will enable applications to switch (more or less) effortlessly from one implementation to another (or even from one implementation version to another).
Participation in the PMIx ASC
- TADaaM is a member of the Administrative Steering Committee of PMIx standard focused on orchestration of application launch and execution.
11.2 Teaching - Supervision - Juries
11.2.1 Teaching
Members of the TADaaM project gave hundreds of hours of teaching at Université de Bordeaux and the Bordeaux INP engineering school, covering a wide range of topics from basic use of computers, introduction to algorithmic and C programming to advanced topics such as probabilities and statistics, scheduling, computer networks, computer architecture, operating systems, big data, cryptography, parallel programming and high-performance runtime systems, as well as software law and personal data law.
- François Pellegrini did the introductory conference of the Numerics graduate program at Université de Bordeaux, on the ethical issues of automated data processing.
- François Pellegrini did a course in English on “Software Law” and “Personal data law” to 9 PhD students of Université de Bordeaux.
- François Pellegrini did a new course on personal data and cybersecurity to 35 professional (lawyers, IT managers, etc.) attending the University Degree (D.U.) of cyber-criminology of Université Bordeaux-Montaigne with support from Gendarmerie Nationale.
- François Pellegrini did a new course on "Technological and societal innovations" to the 15 students of the international master "Law for Innovation" of Université de Bordeaux.
- Luan Teylo taught a course on data visualization with Python to undergraduate students from various fields of study at the Université de Bordeaux.
- Alexandre Denis teaches a course on MPI+X at ENSEIRB-MATMECA.
- Mihail Popov is the head of the cryptography and parallel programming courses, both at ENSEIRB-MATMECA.
11.2.2 Supervision
- PhD finished: Lana Scravaglieri , Portable vectorization with numerical accuracy control for multi-precision simulation codes. Advisors: Olivier Aumage, Mihail Popov, Thomas Guignon (IFPEN) and Ani Anciaux-Sedrakian (IFPEN).
- PhD in progress: Asia Auville , Large Language Models for Detection and Correction of Errors in HPC Applications. Advisors: Emmanuelle Saillard, Mihail Popov, Pablo Oliveira (UVSQ) and Eric Petit (Intel).
- PhD in progress: Charles Goedefroit , Delivering userspace interrupts from the BXI network interface. co-advised with ATOS/Bull/Eviden. Started in March 2024. Advisors: Alexandre Denis and Brice Goglin .
- PhD in progress: Serge Meurrens , Application-aware I/O scheduling in HPC systems. Started in December 2025. Advisors: Francieli Boito , François Tessier (Inria Rennes), and Luan Teylo .
- PhD in progress: Thibaut Pepin , MPI communication on modular supercomputing architectures, started in May 2023. Advisors: Guillaume Mercier .
- PhD in progress: Tristan Riehs , Integration of communications and task scheduling. Started in October 2025. Advisors: Alexandre Denis , Philippe Swartvagher (TOPAL), Samuel Thibault (Storm).
- PhD in progress: Méline Trochon , Adaptive checkpointing strategies depending on the network load. Started in November 2024. Advisors: Francieli Boito , François Tessier , Brice Goglin and Jean-Thomas Acquaviva (DDN).
11.2.3 Juries
- Francieli Boito was a member of the PhD jury of Adrian KHELILI (University of Paris Saclay).
- Brice Goglin was reviewer for the PhD thesis of Ioannis Vardas (TUWien) and Mickael Boichot (TelecomSudParis and CEA/DAM).
- Brice Goglin was the president of the PhD defense jury of Radjasouria Vinayagame (Univ. Bordeaux and Eviden).
11.3 Popularization
11.3.1 Participation in Live events
- Brice Goglin gave talks about research in computer science and high-performance computing to high-school students as part of the Chiche programme and Circuit Scientifique Bordelais, and about science and research to elementary schools' classes.
- Brice Goglin and Mihail Popov presented research in HPC and AI to middle school interns.
12 Scientific production
12.1 Major publications
- 1 inproceedingsArbitration Policies for On-Demand User-Level I/O Forwarding on HPC Platforms.IPDPS 2021 - 35th IEEE International Parallel and Distributed Processing SymposiumPortland, Oregon / Virtual, United StatesMay 2021HAL
- 2 articleIO-SETS: Simple and efficient approaches for I/O bandwidth management.IEEE Transactions on Parallel and Distributed Systems3410August 2023, 2783 - 2796HALDOI
- 3 inproceedingsScalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests.CCGrid 2019 - 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid ComputingLarnaca, CyprusMay 2019HAL
- 4 articleModeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model.IEEE Transactions on Parallel and Distributed Systems306June 2019, 1374--1389HALDOI
- 5 articleProfiles of upcoming HPC Applications and their Impact on Reservation Strategies.IEEE Transactions on Parallel and Distributed Systems325May 2021, 1178-1190HALDOI
- 6 articleHardware topology management in MPI applications through hierarchical communicators.Parallel Computing76August 2018, 70 - 90HALDOI
- 7 articleOptimizing Performance and Energy Across Problem Sizes Through a Search Space Exploration and Machine Learning.Journal of Parallel and Distributed Computing180June 2023, 104720HALDOI
- 8 miscA Deep Look Into the Temporal I/O Behavior of HPC Applications.January 2025HAL
12.2 Publications of the year
International peer-reviewed conferences
Conferences without proceedings
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications
12.3 Cited publications
- 33 unpublishedPrediction and Interpretability of HPC I/O Resources Usage with Machine Learning.2024, working paper or preprintHALback to text
- 34 inproceedingsScheduling Distributed I/O Resources in HPC Systems.Euro-Par 2024: Parallel Processing: 30th European Conference on Parallel and Distributed Processing, Madrid, Spain, August 26–30, 2024, Proceedings, Part IBerlin, HeidelbergMadrid, SpainSpringer-Verlag2024, 137–151URL: https://doi.org/10.1007/978-3-031-69577-3_10DOIback to textback to text
- 35 articleIO-SETS: Simple and efficient approaches for I/O bandwidth management.IEEE Transactions on Parallel and Distributed Systems3410August 2023, 2783 - 2796HALDOIback to text
- 36 onlineExploring the Search Space of Hybrid E/P Cores Prefetch Configurations using Machine Learning .2025, URL: https://www.diva-portal.org/smash/get/diva2:1985357/FULLTEXT02.pdf (visited on 2025-12-30)back to text
- 37 inproceedingsMPI Errors Detection using GNN Embedding and Vector Embedding over LLVM IR.IPDPS 2024 - 38th International Symposium on Parallel and Distributed ProcessingSan francisco, United StatesMay 2024HALback to text
- 38 onlineHybrid E/P Cores Prefetch Optimization.2024, URL: https://www.diva-portal.org/smash/get/diva2:1888603/FULLTEXT01.pdf (visited on 2024-12-30)back to text