EN FR
EN FR
ROMA - 2025

2025Activity reportProject-Team‌​‌ROMA

RNSR: 201221200W
  • Research​​ center Inria Lyon Centre​​​‌
  • In partnership with:CNRS,‌ Université Claude Bernard (Lyon‌​‌ 1), Ecole normale supérieure​​ de Lyon
  • Team name:​​​‌ Optimisation des ressources :‌ modèles, algorithmes et ordonnancement‌​‌
  • In collaboration with:Laboratoire​​ de l'Informatique du Parallélisme​​​‌ (LIP)

Creation of the‌ Project-Team: 2015 January 01‌​‌

Each year, Inria research​​​‌ teams publish an Activity​ Report presenting their work​‌ and results over the​​ reporting period. These reports​​​‌ follow a common structure,​ with some optional sections​‌ depending on the specific​​ team. They typically begin​​​‌ by outlining the overall​ objectives and research programme,​‌ including the main research​​ themes, goals, and methodological​​​‌ approaches. They also describe​ the application domains targeted​‌ by the team, highlighting​​ the scientific or societal​​​‌ contexts in which their​ work is situated.

The​‌ reports then present the​​ highlights of the year,​​​‌ covering major scientific achievements,​ software developments, or teaching​‌ contributions. When relevant, they​​ include sections on software,​​​‌ platforms, and open data,​ detailing the tools developed​‌ and how they are​​ shared. A substantial part​​​‌ is dedicated to new​ results, where scientific contributions​‌ are described in detail,​​ often with subsections specifying​​​‌ participants and associated keywords.​

Finally, the Activity Report​‌ addresses funding, contracts, partnerships,​​ and collaborations at various​​​‌ levels, from industrial agreements​ to international cooperations. It​‌ also covers dissemination and​​ teaching activities, such as​​​‌ participation in scientific events,​ outreach, and supervision. The​‌ document concludes with a​​ presentation of scientific production,​​​‌ including major publications and​ those produced during the​‌ year.

Keywords

Computer Science​​ and Digital Science

  • A1.1.1.​​​‌ Multicore, Manycore
  • A1.1.2. Hardware​ accelerators (GPGPU, FPGA, etc.)​‌
  • A1.1.3. Memory models
  • A1.1.4.​​ High performance computing
  • A1.1.5.​​​‌ Exascale
  • A1.1.9. Fault tolerant​ systems
  • A1.6. Green Computing​‌
  • A6.1. Methods in mathematical​​ modeling
  • A6.2.3. Probabilistic methods​​​‌
  • A6.2.5. Numerical Linear Algebra​
  • A6.2.6. Optimization
  • A6.2.7. HPC​‌ for machine learning
  • A6.3.​​ Computation-data interaction
  • A7.1. Algorithms​​​‌
  • A7.1.2. Parallel algorithms
  • A8.1.​ Discrete mathematics, combinatorics
  • A8.2.​‌ Optimization
  • A8.7. Graph theory​​
  • A8.9. Performance evaluation

Other​​​‌ Research Topics and Application​ Domains

  • B3.2. Climate and​‌ meteorology
  • B3.3. Geosciences
  • B4.​​ Energy
  • B4.5.1. Green computing​​​‌
  • B5.2.3. Aviation
  • B5.5. Materials​

1 Team members, visitors,​‌ external collaborators

Research Scientists​​

  • Frédéric Vivien [Team​​​‌ leader, INRIA,​ Senior Researcher, HDR​‌]
  • Suraj Kumar [​​INRIA, ISFP]​​​‌
  • Loris Marchal [CNRS​, Senior Researcher,​‌ HDR]
  • Bora Uçar​​ [CNRS, Senior​​​‌ Researcher, HDR]​

Faculty Members

  • Anne Benoît​‌ [ENS DE LYON​​, Professor, HDR​​​‌]
  • Yves Robert [​ENS DE LYON,​‌ Emeritus, HDR]​​

Post-Doctoral Fellows

  • Antoine Jego​​​‌ [SORBONNE UNIVERSITE]​
  • Maher Mallem [INRIA​‌, Post-Doctoral Fellow]​​

PhD Students

  • Joachim Cendrier​​​‌ [CNRS]
  • Hadi​ Gholami [ENS DE​‌ LYON, from Nov​​ 2025]
  • Damien Lesens​​​‌ [ENS DE LYON​, from Oct 2025​‌]
  • Michel Nicolis [​​INRIA, from Dec​​​‌ 2025]
  • Adrien Obrecht​ [ENS DE LYON​‌, from Sep 2025​​]
  • Felix Wirth-Bonne [​​​‌KOG, CIFRE,​ from Oct 2025]​‌

Interns and Apprentices

  • Damien​​ Lesens [ENS DE​​​‌ LYON, Intern,​ until Jan 2025]​‌

Administrative Assistant

  • Chrystelle Mouton​​ [INRIA]

Visiting​​​‌ Scientists

  • Julien Langou [​UNIV COLORADO, from​‌ Nov 2025]
  • Julien​​ Langou [UNIV COLORADO​​​‌, until Jan 2025​]

External Collaborator

  • Theo​‌ Mary [CNRS]​​

2 Overall objectives

The​​ Roma project aims at​​​‌ designing models, algorithms, and‌ scheduling strategies to optimize‌​‌ the execution of scientific​​ applications.

Scientists now have​​​‌ access to tremendous computing‌ power. For instance, the‌​‌ top supercomputers contain several​​ hundreds of thousands of​​​‌ cores, and edge servers‌ represent many millions of‌​‌ resources. Furthermore, it had​​ never been so easy​​​‌ for scientists to have‌ access to parallel computing‌​‌ resources, either through the​​ multitude of local clusters​​​‌ or through distant cloud‌ computing platforms.

Because parallel‌​‌ computing resources are ubiquitous,​​ and because the available​​​‌ computing power is so‌ huge, one could believe‌​‌ that scientists no longer​​ need to worry about​​​‌ finding computing resources, even‌ less to optimize their‌​‌ usage. Nothing is farther​​ from the truth. Institutions​​​‌ and government agencies keep‌ building larger and more‌​‌ powerful computing platforms with​​ a clear goal. These​​​‌ platforms must allow to‌ solve problems in reasonable‌​‌ timescales, which were so​​ far out of reach.​​​‌ They must also allow‌ to solve problems more‌​‌ precisely where the existing​​ solutions are not deemed​​​‌ to be sufficiently accurate.‌ For those platforms to‌​‌ fulfill their purposes, their​​ computing power must therefore​​​‌ be carefully exploited and‌ not be wasted. This‌​‌ often requires an efficient​​ management of all types​​​‌ of platform resources: computation,‌ communication, memory, storage, energy,‌​‌ etc. This is often​​ hard to achieve because​​​‌ of the characteristics of‌ new and emerging platforms.‌​‌ Moreover, because of technological​​ evolutions, new problems arise,​​​‌ and fully tried and‌ tested solutions need to‌​‌ be thoroughly overhauled or​​ simply discarded and replaced.​​​‌ Here are some of‌ the difficulties that have,‌​‌ or will have, to​​ be overcome:

  • Computing platforms​​​‌ are hierarchical: a processor‌ includes several cores, a‌​‌ node includes several processors,​​ and the nodes themselves​​​‌ are gathered into clusters.‌ Algorithms must take this‌​‌ hierarchical structure into account,​​ in order to fully​​​‌ harness the available computing‌ power;
  • The probability for‌​‌ a platform to suffer​​ from a hardware fault​​​‌ automatically increases with the‌ number of its components.‌​‌ Fault-tolerance techniques become unavoidable​​ for large-scale platforms;
  • The​​​‌ ever increasing gap between‌ the computing power of‌​‌ nodes and the bandwidths​​ of memories and networks,​​​‌ in conjunction with the‌ organization of memories in‌​‌ deep hierarchies, requires to​​ take more and more​​​‌ care of the way‌ algorithms use memory;
  • Energy‌​‌ considerations are unavoidable nowadays.​​ Design specifications for new​​​‌ computing platforms always include‌ a maximal energy consumption.‌​‌ The energy bill of​​ a supercomputer may represent​​​‌ a significant share of‌ its cost over its‌​‌ lifespan. These issues must​​ be taken into account​​​‌ at the algorithm-design level.‌

We are convinced that‌​‌ dramatic breakthroughs in algorithms​​ and scheduling strategies are​​​‌ required for the scientific‌ computing community to overcome‌​‌ all the challenges posed​​ by new and emerging​​​‌ computing platforms. This is‌ required for applications to‌​‌ be successfully deployed at​​ very large scale, and​​​‌ hence for enabling the‌ scientific computing community to‌​‌ push the frontiers of​​ knowledge as far as​​​‌ possible. The Roma project-team‌ aims at providing fundamental‌​‌ algorithms, scheduling strategies, protocols,​​​‌ and software packages to​ fulfill the needs encountered​‌ by a wide class​​ of scientific computing applications,​​​‌ including domains as diverse​ as geophysics, structural mechanics,​‌ chemistry, electromagnetism, numerical optimization,​​ or computational fluid dynamics,​​​‌ to quote a few.​ To fulfill this goal,​‌ the Roma project-team takes​​ a special interest in​​​‌ dense and sparse linear​ algebra.

3 Research program​‌

The work in the​​ Roma team is organized​​​‌ along three research themes.​

3.1 Resilience for very​‌ large scale platforms

For​​ HPC applications, scale is​​​‌ a major opportunity. The​ largest supercomputers contain tens​‌ of thousands of nodes​​ and future platforms will​​​‌ certainly have to enroll​ even more computing resources​‌ to enter the Exascale​​ era. Unfortunately, scale is​​​‌ also a major threat.​ Indeed, even if each​‌ node provides an individual​​ MTBF (Mean Time Between​​​‌ Failures) of, say, one​ century, a machine with​‌ 100,000 nodes will encounter​​ a failure every 9​​​‌ hours in average, which​ is shorter than the​‌ execution time of many​​ HPC applications.

To further​​​‌ darken the picture, several​ types of errors need​‌ to be considered when​​ computing at scale. In​​​‌ addition to classical fail-stop​ errors (such as hardware​‌ failures), silent errors (a.k.a​​ silent data corruptions) must​​​‌ be taken into account.​ The cause for silent​‌ errors may be for​​ instance soft errors in​​​‌ L1 cache, or bit​ flips due to cosmic​‌ radiations. The problem is​​ that the detection of​​​‌ a silent error is​ not immediate, and that​‌ they only manifest later,​​ once the corrupted data​​​‌ has propagated and impacted​ the result.

Our work​‌ investigates new models and​​ algorithms for resilience at​​​‌ extreme-scale. Its main objective​ is to cope with​‌ both fail-stop and silent​​ errors, and to design​​​‌ new approaches that dramatically​ improve the efficiency of​‌ state-of-the-art methods. Application resilience​​ currently involves a broad​​​‌ range of techniques, including​ fault prediction, error detection,​‌ error containment, error correction,​​ checkpointing, replication, migration, recovery,​​​‌ etc. Extending these techniques,​ and developing new ones,​‌ to achieve efficient execution​​ at extreme-scale is a​​​‌ difficult challenge, but it​ is the key to​‌ a successful deployment and​​ usage of future computing​​​‌ platforms.

3.2 Multi-criteria scheduling​ strategies

In this theme,​‌ we focus on the​​ design of scheduling strategies​​​‌ that finely take into​ account some platform characteristics​‌ beyond the most classical​​ ones, namely the computing​​​‌ speed of processors and​ accelerators, and the communication​‌ bandwidth of network links.​​ Our work mainly considers​​​‌ the following two platform​ characteristics:

  • Energy consumption.
    Power​‌ management in HPC is​​ necessary due to both​​​‌ monetary and environmental constraints.​ Using dynamic voltage and​‌ frequency scaling (DVFS) is​​ a widely used technique​​​‌ to decrease energy consumption,​ but it can severely​‌ degrade performance and increase​​ execution time. Part of​​​‌ our work in this​ direction studies the trade-off​‌ between energy consumption and​​ performance (throughput or execution​​​‌ time). Furthermore, our work​ also focuses on the​‌ optimization of the power​​ consumption of fault-tolerant mechanisms.​​​‌ The problem of the​ energy consumption of these​‌ mechanisms is especially important​​ because resilience generally requires​​ redundant computations and/or redundant​​​‌ communications, either in time‌ (re-execution) or in space‌​‌ (replication), and because redundancy​​ consumes extra energy.
  • Memory​​​‌ usage and data movement.‌
    In many scientific computations,‌​‌ memory is a bottleneck​​ and should be carefully​​​‌ considered. Besides, data movements,‌ between main memory and‌​‌ secondary storages (I/Os) or​​ between different computing nodes​​​‌ (communications), are taking an‌ increasing part of the‌​‌ cost of computing, both​​ in term of performance​​​‌ and energy consumption. In‌ this context, our work‌​‌ focuses on scheduling scientific​​ applications described as task​​​‌ graphs both on memory‌ constrained platforms, and on‌​‌ distributed platforms with the​​ objective of minimizing communications.​​​‌ The task-based representation of‌ a computing application is‌​‌ very common in the​​ scheduling literature but meets​​​‌ an increasing interest in‌ the HPC field thanks‌​‌ to the use of​​ runtime schedulers. Our work​​​‌ on memory-aware scheduling is‌ naturally multi-criteria, as it‌​‌ is concerned with both​​ memory consumption, performance and​​​‌ data-movements.

3.3 Sparse direct‌ solvers and sparsity in‌​‌ computing

In this theme,​​ we work on various​​​‌ aspects of sparse direct‌ solvers for linear systems.‌​‌ Target applications lead to​​ sparse systems made of​​​‌ millions of unknowns. In‌ the scope of the‌​‌ PaStiX solver, co-developed with​​ the Inria HiePACS team,​​​‌ there are two main‌ objectives: reducing as much‌​‌ as possible memory requirements​​ and exploiting modern parallel​​​‌ architectures through the use‌ of runtime systems.

A‌​‌ first research challenge is​​ to exploit the parallelism​​​‌ of modern computers, made‌ of heterogeneous (CPUs+GPUs) nodes.‌​‌ The approach consists of​​ using dynamic runtime systems​​​‌ (in the context of‌ the PaStiX solver, Parsec‌​‌ or StarPU) to​​ schedule tasks.

Another important​​​‌ direction of research is‌ the exploitation of low-rank‌​‌ representations. Low-rank approximations are​​ commonly used to compress​​​‌ the representation of data‌ structures. The loss of‌​‌ information induced is often​​ negligible and can be​​​‌ controlled. In the context‌ of sparse direct solvers,‌​‌ we exploit the notion​​ of low-rank properties in​​​‌ order to reduce the‌ demand in terms of‌​‌ floating-point operations and memory​​ usage. To enhance sparse​​​‌ direct solvers using low-rank‌ compression, two orthogonal approaches‌​‌ are followed: (i) integrate​​ new strategies for a​​​‌ better scalability and (ii)‌ use preprocessing steps to‌​‌ better identify how to​​ cluster unknowns, when to​​​‌ perform compression and which‌ blocks not to compress.‌​‌

Combinatorial scientific computing (CSC)​​ is a term for​​​‌ interdisciplinary research at the‌ intersection of discrete mathematics,‌​‌ computer science, and scientific​​ computing. In particular, it​​​‌ refers to the development,‌ application, and analysis of‌​‌ combinatorial algorithms to enable​​ scientific computing applications. CSC’s​​​‌ deepest roots are in‌ the realm of direct‌​‌ methods for solving sparse​​ linear systems of equations​​​‌ where graph theoretical models‌ have been central to‌​‌ the exploitation of sparsity,​​ since the 1960s. The​​​‌ general approach is to‌ identify performance issues in‌​‌ a scientific computing problem,​​ such as memory use,​​​‌ parallel speed up, and/or‌ the rate of convergence‌​‌ of a method, and​​ to develop combinatorial algorithms​​​‌ and models to tackle‌ those issues. Most of‌​‌ the time, the research​​​‌ output includes experiments with​ real life data to​‌ validate the developed combinatorial​​ algorithms and fine tune​​​‌ them.

In this context,​ our work targets (i)​‌ the preprocessing phases of​​ direct methods, iterative methods,​​​‌ and hybrid methods for​ solving linear systems of​‌ equations; (ii) high performance​​ tensor computations. The core​​​‌ topics covering our contributions​ include partitioning and clustering​‌ in graphs and hypergraphs,​​ matching in graphs, data​​​‌ structures and algorithms for​ sparse matrices and tensors​‌ (different from partitioning), and​​ task mapping and scheduling.​​​‌

4 Application domains

Sparse​ linear system solvers have​‌ a wide range of​​ applications as they are​​​‌ used at the heart​ of many numerical methods​‌ in computational science: whether​​ a model uses finite​​​‌ elements or finite differences,​ or requires the optimization​‌ of a complex linear​​ or nonlinear function, one​​​‌ often ends up solving​ a system of linear​‌ equations involving sparse matrices.​​ There are therefore a​​​‌ number of application fields:​ structural mechanics, seismic modeling,​‌ biomechanics, medical image processing,​​ tomography, geophysics, electromagnetism, fluid​​​‌ dynamics, econometric models, oil​ reservoir simulation, magneto-hydro-dynamics, chemistry,​‌ acoustics, glaciology, astrophysics, circuit​​ simulation, and work on​​​‌ hybrid direct-iterative methods.

Tensors,​ or multidimensional arrays, are​‌ becoming very important because​​ of their use in​​​‌ many data analysis applications.​ The additional dimensions over​‌ matrices (or two dimensional​​ arrays) enable gleaning information​​​‌ that is otherwise unreachable.​ Tensors, like matrices, come​‌ in two flavors: dense​​ tensors and sparse tensors.​​​‌ Dense tensors arise usually​ in physical and simulation​‌ applications: signal processing for​​ electroencephalography (also named EEG,​​​‌ electrophysiological monitoring method to​ record electrical activity of​‌ the brain); hyperspectral image​​ analysis; compression of large​​​‌ grid-structured data coming from​ a high-fidelity computational simulation;​‌ quantum chemistry etc. Dense​​ tensors also arise in​​​‌ a variety of statistical​ and data science applications.​‌ Some of the cited​​ applications have structured sparsity​​​‌ in the tensors. We​ see sparse tensors, with​‌ no apparent/special structure, in​​ data analysis and network​​​‌ science applications. Well known​ applications dealing with sparse​‌ tensors are: recommender systems;​​ computer network traffic analysis​​​‌ for intrusion and anomaly​ detection; clustering in graphs​‌ and hypergraphs modeling various​​ relations; knowledge graphs/bases such​​​‌ as those in learning​ natural languages.

5 Social​‌ and environmental responsibility

5.1​​ Impact of research results​​​‌

Within the framework of​ our collaboration with the​‌ University of Chicago (see​​ Section 10.1.4), we explore​​​‌ novel scheduling algorithms that​ are able to adapt​‌ to dynamic power changes,​​ to reduce carbon emissions,​​​‌ and to give priority​ to using green energy​‌ sources.

6 Highlights of​​ the year

Anne Benoit​​​‌ was promoted full professor​ of ENS de Lyon.​‌

Anne Benoit, Loris Marchal​​ and Bora Uçar each​​​‌ spent the first semester​ of 2025 in sabbatical​‌ to build new collaborations​​ and acquire new expertise.​​​‌ Anne Benoit and Bora​ Uçar were hosted at​‌ the Georgia Institute of​​ Technology, and Loris Marchal​​​‌ at the ILLS laboratory​ of the École de​‌ Technologie Supérieure de Montréal.​​

7 Latest software developments,​​​‌ platforms, open data

7.1​ Latest software developments

7.1.1​‌ MatchMaker

  • Name:
    Maximum matchings​​ in bipartite graphs
  • Keywords:​​
    Graph algorithmics, Matching
  • Scientific​​​‌ Description:
    The implementations of‌ ten exact algorithms and‌​‌ four heuristics for solving​​ the problem of finding​​​‌ a maximum cardinality matching‌ in bipartite graphs are‌​‌ provided.
  • Functional Description:
    This​​ software provides algorithms to​​​‌ solve the maximum cardinality‌ matching problem in bipartite‌​‌ graphs.
  • Release Contributions:
    This​​ version includes the initialization​​​‌ algorithm/heuristic 5 and moves‌ files in suitable directories.‌​‌
  • URL:
  • Publications:
  • Contact:
    Bora Uçar
  • Participants:‌
    Ioannis Panagiotas, Bora Uçar,‌​‌ Kamer Kaya, Johannes Langguth​​

8 New results

8.1​​​‌ Resilience for very large‌ scale platforms

The ROMA‌​‌ team has been working​​ on resilience problems for​​​‌ several years. In 2025,‌ we have focused on‌​‌ the following problems.

8.1.1​​ Fixed-Work vs. Fixed-Time Checkpointing​​​‌ on Large-Scale Failure-Prone Platforms‌

Participants: Quentin Barbut,‌​‌ Lucas Perotin [Vanderbilt University]​​, Anne Benoit,​​​‌ Thomas Hérault [University of‌ Tennessee, Knoxville], Yves‌​‌ Robert, Frédéric Vivien​​.

Consider a High-Performance​​​‌ Computing (HPC) application executing‌ on a large-scale failure-prone‌​‌ platform. The Fixed-Work Checkpointing​​ (FWC) problem consists in​​​‌ minimizing the expected time‌ to execute a fixed‌​‌ amount of work (namely​​ a fraction or the​​​‌ totality of the application).‌ Strategies for the FWC‌​‌ problem have received considerable​​ attention and are well-understood.​​​‌ On the contrary, the‌ dual problem, namely the‌​‌ Fixed-Time Checkpointing (FTC) problem,​​ has been considered only​​​‌ very recently. The FTC‌ problem consists in maximizing‌​‌ the expected work achieved​​ during a fixed amount​​​‌ of time (namely the‌ duration of a reservation‌​‌ granted to the application).​​ This work provides a​​​‌ comparative overview of both‌ problems. First we review‌​‌ existing strategies for the​​ FWC problem and extend​​​‌ them to stochastic checkpoints,‌ i.e., when the checkpoint‌​‌ is no longer a​​ deterministic constant but obeys​​​‌ some probability distribution law‌ instead. Then we provide‌​‌ a comprehensive study of​​ the FTC problem. The​​​‌ problem turns out to‌ be surprisingly difficult, even‌​‌ when restricting to taking​​ one or two checkpoints.​​​‌ We provide a threshold-based‌ heuristic to solve the‌​‌ general instance of the​​ problem with an arbitrary​​​‌ number of checkpoints, and‌ we have to resort‌​‌ to time discretization to​​ provide an optimal strategy.​​​‌ We further extend this‌ latter strategy to stochastic‌​‌ checkpoints.

This work has​​ been published in the​​​‌ International Journal of High‌ Performance Computing Applications 8‌​‌.

8.1.2 Partial Detectors​​ Versus Replication To Cope​​​‌ With Silent Errors

Participants:‌ Anne Benoit, Thomas‌​‌ Hérault [Inria Bordeaux],​​ Yves Robert, Alix​​​‌ Tremodeux.

This work‌ studies an iterative algorithm‌​‌ running on an error-prone​​ platform, where silent errors​​​‌ strike each iteration with‌ some probability. A detector‌​‌ verifies correctness before taking​​ a checkpoint but may​​​‌ fail to detect errors.‌ Specifically, an error at‌​‌ iteration I is detected​​ only after iteration (I​​​‌ - 1) + X,‌ where X follows a‌​‌ bounded probability distribution like​​ a truncated geometric distribution.​​​‌ Intuitively, the error silently‌ amplifies during some iterations‌​‌ before it can be​​ detected at distance X​​​‌ or higher, and there‌ is the risk of‌​‌ missing an error that​​​‌ has struck recently but​ cannot be detected yet.​‌ X is bounded by​​ D, the maximum detection​​​‌ latency. To mitigate undetected​ errors during verification, a​‌ simple strategy keeps two​​ checkpoints and divides the​​​‌ execution into D -​ 1 iteration segments, each​‌ followed by verification and​​ checkpoint. In steady state:​​​‌ (i) if verification succeeds,​ the oldest checkpoint is​‌ erased and replaced; (ii)​​ if it fails, rollback​​​‌ occurs to the oldest​ verified checkpoint. This work​‌ explores whether this scheme​​ outperforms replication and determines​​​‌ the optimal number of​ checkpoints and segment lengths,​‌ both theoretically and via​​ Monte Carlo simulations.

This​​​‌ work was presented at​ the EuroPar 2025 conference​‌ 18.

8.1.3 Fault-tolerant​​ numerical iterative algorithms at​​​‌ scale

Participants: Alix Tremodeux​, Anne Benoit,​‌ Emmanuel Agullo [Inria Bordeaux]​​, Thomas Hérault [Inria​​​‌ Bordeaux; University of Tennessee,​ Knoxville], Luc Giraud​‌ [Inria Bordeaux], Yves​​ Robert.

This work​​​‌ investigates how to protect​ numerical iterative algorithms from​‌ all types of errors​​ that can strike at​​​‌ scale: fail-stop errors (a.k.a.​ failures) and silent errors,​‌ striking both as computation​​ errors and memory bit-flips.​​​‌ We combine various techniques:​ detectors for computation errors,​‌ checksums for memory errors,​​ and checkpoint/restart for failures.​​​‌ The objective is to​ minimize the expected time​‌ per iteration of the​​ algorithm. We design a​​​‌ hierarchical pattern that combines​ and interleaves all these​‌ fault-tolerance mechanisms, and we​​ determine the optimal periodic​​​‌ pattern that achieves this​ objective. We instantiate these​‌ results for the performance​​ analysis of the Preconditioned​​​‌ Conjugate Gradient (PCG) algorithm:​ we report several scenarios​‌ where the optimal pattern​​ dramatically decreases the overhead​​​‌ due to error mitigation.​

This work has been​‌ published in the International​​ Journal of High Performance​​​‌ Computing Applications 13.​

8.2 Multi-criteria scheduling strategies​‌

We report here the​​ work undertaken by the​​​‌ ROMA team in multi-criteria​ strategies, which focuses on​‌ taking into account energy​​ and memory constraints, but​​​‌ also budget constraints or​ specific constraints for scheduling​‌ online requests.

8.2.1 Green​​ Scheduling on the Edge​​​‌

Participants: Joachim Cendrier,​ Rajini Wijayawardana [University of​‌ Chicago], Anne Benoit​​, Yves Robert,​​​‌ Frédéric Vivien, Andrew​ A. Chien [University of​‌ Chicago].

This work​​ aims at designing and​​​‌ evaluating scheduling algorithms that​ minimize carbon cost on​‌ edge platforms. When a​​ job is released to​​​‌ some edge server, difficult​ scheduling questions arise: should​‌ the job be executed​​ on that server? If​​​‌ yes, when? If no,​ which other edge server​‌ should the job be​​ transferred to? Typically, jobs​​​‌ are submitted online, and​ have a deadline to​‌ enforce. Online scheduling problems​​ are already difficult without​​​‌ accounting for different energy​ sources, so one should​‌ not expect any optimal​​ solution. Still, an important​​​‌ research goal is to​ revisit standard algorithms such​‌ as Earliest Completion Time​​ (ECT) and Earliest Deadline​​​‌ First (EDF) in order​ to design and evaluate​‌ carbon-aware variants. This work​​ introduces several new algorithms​​​‌ that use sophisticated scheduling​ policies to efficiently decrease​‌ carbon cost; these algorithms​​ maximize the use of​​ green energy both on​​​‌ local and remote edge‌ servers, by re-evaluating previous‌​‌ decisions whenever needed to​​ accommodate newly released jobs.​​​‌ We provide a comprehensive‌ simulation campaign based on‌​‌ actual platform/job data and​​ carbon traces and report​​​‌ an average gain of‌ 42% over standard approaches.‌​‌

This work was presented​​ at the EuroPar 2025​​​‌ conference 19.

8.2.2‌ Carbon-Aware Workflow Scheduling with‌​‌ Fixed Mapping and Deadline​​ Constraint

Participants: Dominik Schweisgut​​​‌ [Humboldt University Berlin, Karlsruhe‌ Institute of Technology],‌​‌ Anne Benoit, Yves​​ Robert, Henning Meyerhenke​​​‌ [Karlsruhe Institute of Technology]‌.

Large data and‌​‌ computing centers consume a​​ significant share of the​​​‌ world's energy consumption. A‌ prominent subset of the‌​‌ workloads in such centers​​ are workflows with interdependent​​​‌ tasks, usually represented as‌ directed acyclic graphs (DAGs).‌​‌ To reduce the carbon​​ emissions resulting from executing​​​‌ such workflows in centers‌ with a mixed (renewable‌​‌ and non-renewable) energy supply,​​ it is advisable to​​​‌ move task executions to‌ time intervals with sufficient‌​‌ green energy when possible.​​ To this end, we​​​‌ formalize the above problem‌ as a scheduling problem‌​‌ with a given mapping​​ and ordering of the​​​‌ tasks. We show that‌ this problem can be‌​‌ solved in polynomial time​​ in the uniprocessor case.​​​‌ For at least two‌ processors, however, the problem‌​‌ becomes NP-hard. Hence, we​​ propose a heuristic framework​​​‌ called CaWoSched that combines‌ several greedy approaches with‌​‌ local search. To assess​​ the 16 heuristics resulting​​​‌ from different combinations, we‌ also devise a simple‌​‌ baseline algorithm and an​​ exact ILP-based solution. Our​​​‌ experimental results show that‌ our heuristics provide significant‌​‌ savings in carbon emissions​​ compared to the baseline.​​​‌

This work was presented‌ at the ICPP'25 conference‌​‌ 24.

8.2.3 Deadline-Aware​​ Scheduling of Mixed-Criticality Tasks​​​‌

Participants: Maxime Gonthier [University‌ of Chicago], Kyle‌​‌ Chard [University of Chicago]​​, Ian Foster [University​​​‌ of Chicago], Loris‌ Marchal, Frédéric Vivien‌​‌.

High-performance computing centers​​ and cloud providers host​​​‌ a wide variety of‌ workloads, ranging from routine‌​‌ calibration tasks with no​​ strict timing requirements to​​​‌ urgent real-time computations that‌ must be completed within‌​‌ hard deadlines. Traditional approaches​​ reserve resources for high-criticality​​​‌ tasks or preempt and‌ kill lower-criticality tasks when‌​‌ necessary, resulting in wasted​​ compute time and longer​​​‌ turnaround times for lower-criticality‌ tasks. We suggest that‌​‌ a better solution is​​ to interleave the execution​​​‌ of critical and non-critical‌ tasks. We formulate a‌​‌ bi-objective optimization problem: guarantee​​ that all critical tasks​​​‌ meet their deadlines, and‌ minimize the maximum flow,‌​‌ defined as the time​​ a task spends in​​​‌ the system, of non-critical‌ tasks. We introduce a‌​‌ formal model, derive an​​ approximation algorithm and a​​​‌ lower bound, and develop‌ several heuristics based on‌​‌ the approximation framework. Through​​ extensive simulations, based on​​​‌ synthetic and real-world workloads,‌ we show that one‌​‌ of our heuristics reduces​​ the maximum flow of​​​‌ non-critical tasks by up‌ to 14% compared to‌​‌ static resource partitioning.

This​​ work was presented at​​​‌ the ICPP'25 conference 21‌.

8.2.4 Memory-aware Adaptive‌​‌ Scheduling of Scientific Workflows​​​‌ on Heterogeneous Architectures

Participants:​ Svetlana Kulagina [Humboldt University​‌ Berlin], Anne Benoit​​, Henning Meyerhenke [Humboldt​​​‌ University Berlin].

The​ analysis of massive scientific​‌ data often happens in​​ the form of workflows​​​‌ with interdependent tasks. When​ such a scientific workflow​‌ needs to be scheduled​​ on a parallel or​​​‌ distributed system, one usually​ represents the workflow as​‌ a directed acyclic graph​​ (DAG). The vertices of​​​‌ the DAG represent the​ tasks, while its edges​‌ model the dependencies between​​ the tasks (usually data​​​‌ to be communicated to​ successor tasks). When executed,​‌ each task requires a​​ certain amount of memory​​​‌ and if that exceeds​ the available memory, the​‌ execution fails. The typical​​ goal is to execute​​​‌ the workflow without failures​ (i.e., satisfying the memory​‌ constraints) and with the​​ shortest possible execution time​​​‌ (i.e., to minimize its​ makespan). To address this​‌ problem, we investigate the​​ memory-aware scheduling of DAG-shaped​​​‌ workflows on heterogeneous plat-​ forms, where each processor​‌ can have a different​​ speed and a different​​​‌ memory size. We propose​ a variant of HEFT​‌ (Heterogeneous Earliest Finish Time)​​ that, in contrast to​​​‌ the original, accounts for​ memory and includes eviction​‌ strategies for cases when​​ it might be beneficial​​​‌ to remove some data​ from memory in order​‌ to have enough memory​​ to execute other tasks.​​​‌ Furthermore, while HEFT assumes​ perfect knowledge of the​‌ execution time and memory​​ usage of each task,​​​‌ the actual values might​ differ upon execution. Thus,​‌ we propose an adaptive​​ scheduling strategy, where a​​​‌ schedule is recomputed when​ there has been a​‌ significant variation in terms​​ of execution time or​​​‌ memory. The scheduler has​ been closely integrated with​‌ a runtime system, allowing​​ us to perform a​​​‌ thorough experimental evaluation on​ real-world workflows. The runtime​‌ system warns the scheduler​​ when the task parameters​​​‌ have changed, and a​ schedule can be recomputed​‌ on the fly. The​​ memory-aware strategy allows us​​​‌ to schedule task graphs​ that would run out​‌ of memory with a​​ state-of-the-art scheduler, and the​​​‌ adaptive setting allows us​ to significantly reduce the​‌ makespan.

This work was​​ presented at the CCGrid​​​‌ 2025 conference 22.​

8.2.5 A scheduler to​‌ foster data locality for​​ GPU and out-of-core task-based​​​‌ linear algebra applications

Participants:​ Maxime Gonthier [University of​‌ Chicago], Loris Marchal​​, Samuel Thibault [Inria​​​‌ Bordeaux].

Hardware accelerators​ like GPUs now provide​‌ a large part of​​ the computational power used​​​‌ for scientific simulations. Despite​ their efficacy, GPUs possess​‌ limited memory and are​​ connected to the main​​​‌ memory of the machine​ via a bandwidth limited​‌ bus. Scientific simulations often​​ operate on very large​​​‌ data, that surpasses the​ GPU's memory capacity. Therefore,​‌ one has to turn​​ to out-of-core computing: data​​​‌ is kept in a​ remote, slower memory (CPU​‌ memory), and moved back​​ and forth from/to the​​​‌ device memory (GPU memory),​ a process also present​‌ for multicore CPUs with​​ limited memory. In both​​​‌ cases, data movement quickly​ becomes a performance bottleneck.​‌ Task-based runtime schedulers have​​ emerged as a convenient​​ and efficient way to​​​‌ manage large applications on‌ such heterogeneous platforms. We‌​‌ propose a scheduler for​​ task-based runtimes that improves​​​‌ data locality for out-of-core‌ linear algebra computations, to‌​‌ reduce data movement. We​​ design a data-aware strategy​​​‌ for both task scheduling‌ and data eviction from‌​‌ limited memories. We compare​​ this scheduler to existing​​​‌ schedulers in runtime systems.‌ Using StarPU, we show‌​‌ that our new scheduling​​ strategy achieves comparable performance​​​‌ when memory is not‌ a constraint, and significantly‌​‌ better performance when application​​ input data exceeds memory,​​​‌ on both GPUs and‌ CPU cores.

This work‌​‌ has been published in​​ the Journal of Parallel​​​‌ and Distributed Computing 10‌.

8.2.6 Cache Management‌​‌ for Mixture-of-Experts LLMs

Participants:​​ Spyros Angelopoulos [IRL ILLS]​​​‌, Loris Marchal,‌ Adrien Obrecht, Bertrand‌​‌ Simon [CNRS IN2P3].​​

Large language models (LLMs)​​​‌ have demonstrated remarkable capabilities‌ across a variety of‌​‌ tasks. One of the​​ main challenges towards the​​​‌ successful deployment of LLMs‌ is memory management, since‌​‌ they typically involve billions​​ of parameters. To this​​​‌ end, architectures based on‌ Mixture-of-Experts have been proposed,‌​‌ which aim to reduce​​ the size of the​​​‌ parameters that are activated‌ when producing a token.‌​‌ This raises the equally​​ critical issue of efficiently​​​‌ managing the limited cache‌ of the system, in‌​‌ that frequently used experts​​ should be stored in​​​‌ the fast cache rather‌ than in the slower‌​‌ secondary memory. In this​​ work, we introduce and​​​‌ study a new paging‌ problem that models expert‌​‌ management optimization. Our formulation​​ captures both the layered​​​‌ architecture of LLMs and‌ the requirement that experts‌​‌ are cached efficiently. We​​ first present lower bounds​​​‌ on the competitive ratio‌ of both deterministic and‌​‌ randomized algorithms, which show​​ that under mild assumptions,​​​‌ LRU-like policies have good‌ theoretical competitive performance. We‌​‌ then propose a layer-based​​ extension of LRU that​​​‌ is tailored to the‌ problem at hand. Extensive‌​‌ simulations on both synthetic​​ datasets and actual traces​​​‌ of MoE usage show‌ that our algorithm outperforms‌​‌ policies for the classic​​ paging problem, such as​​​‌ the standard LRU.

This‌ work was presented at‌​‌ the EuroPar 2025 conference​​ 16.

8.2.7 Leveraging​​​‌ Expert Usage to Speed‌ up LLM Inference with‌​‌ Expert Parallelism

Participants: Olivier​​ Beaumont [Inria Bordeaux],​​​‌ Raphaël Bourgouin [Inria Bordeaux]‌, Maxime Darrin [IRL‌​‌ ILLS], Loris Marchal​​, Pablo Piantanida [IRL​​​‌ ILLS].

Large language‌ models have become indispensable‌​‌ for many text-processing applications.​​ Their inference, i.e., their​​​‌ use to generate text,‌ is a time-consuming task‌​‌ since tokens have to​​ be generated one after​​​‌ the other, even if‌ the computational load has‌​‌ been reduced by model​​ sparsification, e.g., by using​​​‌ a Mixture of Experts‌ (MoE) models. In the‌​‌ MoE context, a subset​​ of experts is selected​​​‌ at each stage. Note‌ that not all subsets‌​‌ of experts (pairs of​​ experts in most cases)​​​‌ in a given layer‌ have the same probability‌​‌ of being selected. When​​ experts are mapped to​​​‌ different GPUs, there is‌ a risk of load‌​‌ imbalance if the selected​​​‌ experts end up on​ a small number of​‌ GPUs. This work proposes​​ to leverage this heterogeneity​​​‌ in expert usage to​ map experts of popular​‌ subsets onto distinct GPUs,​​ allowing them to be​​​‌ processed in parallel and​ thus reducing the time​‌ needed for inference. Even​​ though this mapping problem​​​‌ is NP-complete, it is​ possible to design simple​‌ greedy strategies that significantly​​ reduce the need for​​​‌ sequential expert processing. Our​ proof-of-concept confirms that our​‌ mapping strategies effectively reduce​​ inference time on the​​​‌ Mixtral model.

This work​ was presented at the​‌ EuroPar 2025 conference 17​​.

8.2.8 Towards Parallel​​​‌ Transformer-Based Large Language Models​ for Fast Inference

Participants:​‌ Gaël Dellaleau [Kog company]​​, Morgan Giraud [Kog​​​‌ company], Loris Marchal​, Félix Wirth [Kog​‌ company].

Transformer-based models​​ have reached quality levels​​​‌ that make them attractive​ for a wide range​‌ of applications, including those​​ demanding faster generation speeds​​​‌ such as multimedia artifact​ generation. However, traditional parallelism​‌ techniques have struggled to​​ meet these speed requirements​​​‌ due to the inherently​ sequential structure of Transformers,​‌ which limits parallel execution.​​ In this work, we​​​‌ have proposed a simple​ modeling to the parallel​‌ execution of Transformer, which​​ identifies synchronization times across​​​‌ inference devices as the​ culprit. Leveraging real world​‌ figures from GPU manufacturers,​​ we highlight the increasing​​​‌ cost of this synchronization​ time with ever more​‌ powerful hardware. In response,​​ we introduce Tensor Parallel​​​‌ Blocks, a naturally parallelized​ architecture. To maintain quality,​‌ we incorporate multiple enhancements,​​ such as delayed inter-block​​​‌ communication, hybrid architecture, and​ different parallelism gateways, which​‌ mitigate potential degradation in​​ model output. We design​​​‌ and validate a dedicated​ training protocol tailored to​‌ this architecture and conduct​​ a preliminary hyperparameter study​​​‌ on 150M models, proving​ hardware-tied architecture can achieve​‌ relevant results. Using insights​​ from these trials, we​​​‌ establish a set of​ rules for scaling the​‌ model to larger size,​​ theorizing a 4.2x speedup​​​‌ in decoding time compared​ to an equivalently sized​‌ LLaMA based model on​​ the target inference platform.​​​‌

This work is detailed​ in a research report​‌ 33 describing the work​​ initiated during the 6​​​‌ month internship of Félix​ Wirth in the ROMA​‌ team and continued in​​ collaboration with Gael Delalleau​​​‌ and Morgan Giraud from​ the Kog company.

8.2.9​‌ Efficient and effective methods​​ for variant selection

Participants:​​​‌ Srinivas Aluru [Georgia Institute​ of Technology], Anne​‌ Benoit, Bora Uçar​​.

Variation graphs succinctly​​​‌ capture genetic variations among​ individuals within a species​‌ or a target population.​​ The use of variation​​​‌ graphs instead of a​ single reference genome is​‌ credited with reducing bias​​ and increasing the accuracy​​​‌ of sequence mapping algorithms.​ However, complete variation graphs​‌ that comprehensively incorporate all​​ genetic variations are often​​​‌ found to be ineffective​ and inaccurate in practice​‌ due to the presence​​ of a combinatorially explosive​​​‌ number of paths in​ the graph that do​‌ not correspond to any​​ observed genome. Thus, a​​​‌ balance is struck in​ carefully selecting a subset​‌ of variants to be​​ incorporated, for which mathematical​​ frameworks have recently been​​​‌ developed. We advance the‌ mathematical framework proposed by‌​‌ Jain et al., where​​ integer linear programming formulations​​​‌ were developed for optimal‌ variant selection. We propose‌​‌ novel graph-based formulations and​​ develop exact and fast​​​‌ algorithms for certain cases,‌ approximation methods for some‌​‌ others, and empirically close​​ to optimal results in​​​‌ all cases. The primary‌ advantage of the algorithms‌​‌ designed here is that​​ they provide near-optimal results​​​‌ at orders of magnitude‌ faster run time of‌​‌ an ILP solver.

This​​ work was presented at​​​‌ the 16th ACM Conference‌ on Bioinformatics, Computational Biology,‌​‌ and Health Informatics 15​​.

8.3 Sparse direct​​​‌ solvers and sparsity in‌ computing

We continued our‌​‌ work sparse tensors by​​ looking at the tensor​​​‌ contraction operation, which is‌ a higher order analogue‌​‌ of the sparse matrix​​ sparse matrix multiplication operation.​​​‌ We worked on combinatorial‌ problems arising in sparse‌​‌ tensor models. Further work​​ on graph algorithms and​​​‌ Birkhoff–von Neumann decomposition using‌ methods from the signal‌​‌ processing domain were also​​ conducted. We have also​​​‌ investigated communication lower bounds‌ and algorithms achieving those‌​‌ bounds for certain (dense)​​ matrix and tensor kernels.​​​‌

8.3.1 Efficient Parallel Sparse‌ Tensor Contraction

Participants: Somesh‌​‌ Singh, Bora Uçar​​.

We investigate the​​​‌ performance of algorithms for‌ sparse tensor-sparse tensor multiplication‌​‌ (SpGETT). This operation, also​​ called sparse tensor contraction,​​​‌ is a higher order‌ analogue of the sparse‌​‌ matrix-sparse matrix multiplication (SpGEMM)​​ operation. Therefore, SpGETT can​​​‌ be performed by first‌ converting the input tensors‌​‌ into matrices, then invoking​​ high performance variants of​​​‌ SpGEMM, and finally reconverting‌ the resultant matrix into‌​‌ a tensor. Alternatively, one​​ can carry out the​​​‌ scalar operations underlying SpGETT‌ in the realm of‌​‌ tensors without matrix formulation.​​ We discuss the building​​​‌ blocks in both approaches‌ and formulate a hashing-based‌​‌ method to avoid costly​​ search or redirection operations.​​​‌ We present performance results‌ with the current state-of-the-art‌​‌ SpGEMM-based approaches, existing SpGETT​​ approaches, and a carefully​​​‌ implemented SpGETT approach with‌ a new fine-tuned hashing‌​‌ method, proposed in this​​ work. We evaluate the​​​‌ methods on real world‌ tensors, contracting a tensor‌​‌ with itself along various​​ dimensions. Our proposed hashing-based​​​‌ method for SpGETT consistently‌ outperforms the state-of-the-art method,‌​‌ achieving a 25% reduction​​ in sequential execution time​​​‌ on average and a‌ 21% reduction in parallel‌​‌ execution time on average​​ across a variety of​​​‌ input instances.

This work‌ has been published in‌​‌ the IEEE Transactions on​​ Parallel and Distributed Systems​​​‌ journal 12. Codes‌ are available here.‌​‌

8.3.2 Semi-Streaming Algorithms for​​ Hypergraph Matching

Participants: Henrik​​​‌ Reinstädtler [Heidelberg University],‌ S.M. Ferdous [PNNL -‌​‌ Pacific Northwest National Laboratory]​​, Alex Pothen [Purdue​​​‌ University], Bora Uçar‌, Christian Schulz [Heidelberg‌​‌ University].

We propose​​ two one-pass streaming algorithms​​​‌ for the 𝒩𝒫-hard‌ hypergraph matching problem. The‌​‌ first algorithm stores a​​ small subset of potential​​​‌ matching edges in a‌ stack using dual variables‌​‌ to select edges. It​​ has an approximation guarantee​​​‌ of 1d(‌1+ε)‌​‌ and requires 𝒪(​​​‌(nε)​log2n)​‌ bits of memory, where​​ n is the number​​​‌ of vertices in the​ hypergraph, d is the​‌ maximum number of vertices​​ in a hyperedge, and​​​‌ ϵ>0 is​ a parameter to be​‌ chosen. The second algorithm​​ computes, stores, and updates​​​‌ a single matching as​ the edges stream, with​‌ an approximation ratio dependent​​ on a parameter α​​​‌. Its best approximation​ guarantee is 1(​‌2d-1​​)+2d​​​‌(d-1​), and it​‌ requires only 𝒪(​​n) memory.

We​​​‌ have implemented both algorithms​ and compared them with​‌ respect to solution quality,​​ memory consumption, and running​​​‌ times on two diverse​ sets of hypergraphs with​‌ a non-streaming greedy and​​ a naive streaming algorithm.​​​‌ Our results show that​ the streaming algorithms achieve​‌ much better solution quality​​ than naive algorithms when​​​‌ facing adverse orderings. Furthermore,​ these algorithms reduce the​‌ memory required by a​​ factor of 13 in​​​‌ the geometric mean on​ our test problems, and​‌ also outperform the offline​​ Greedy algorithm in running​​​‌ time.

This work was​ presented at the ESA​‌ 2025 conference 23.​​ Codes are available here​​​‌.

8.3.3 Algorithms for​ symmetric Birkhoff-von Neumann decomposition​‌ of symmetric doubly stochastic​​ matrices

Participants: Damien Lesens​​​‌, Jérémy E. Cohen​ [CNRS, Lyon], Bora​‌ Uçar.

The classical​​ Birkhoff–von Neumann (BvN) decomposition​​​‌ expresses a given doubly​ stochastic matrix as a​‌ convex combination of permutation​​ matrices. We investigate the​​​‌ BvN decomposition of symmetric​ doubly stochastic matrices where​‌ the permutation matrices in​​ the decomposition are also​​​‌ symmetric, called SymBvn decomposition.​ This decomposition is not​‌ always possible. Two pioneering​​ theoretical works establish the​​​‌ conditions under which such​ a decomposition is possible​‌ using graph terminology. We​​ propose a practical algorithm​​​‌ by combining these two​ works. A simple transformation​‌ converts any given symmetric​​ doubly stochastic matrix, with​​​‌ possibly nonzero diagonal elements,​ to be the adjacency​‌ matrix of an edge-weighted​​ undirected graph. The adjacency​​​‌ matrix of the resulting​ graph admits a SymBvn​‌ decomposition if and only​​ if the given matrix​​​‌ does so. The practicality​ of the proposed algorithm​‌ allows us to implement​​ it, release its source​​​‌ code, and report the​ first set of experiments​‌ ever performed for the​​ SymBvn decomposition. Our experiments​​​‌ suggest that the proposed​ algorithm is as effective​‌ as the state-of-the-art algorithms​​ for the classical BvN​​​‌ decomposition.

This work has​ been accepted to be​‌ published in SIAM Journal​​ on Matrix Analysis and​​​‌ Applications in Dec 2025​ 11. Codes are​‌ available here.

8.3.4​​ SUperman: Efficient Permanent Computation​​​‌ on GPUs

Participants: Deniz​ Elbek [Sabancı University],​‌ Fatih Taşyaran [Sabancı University]​​, Bora Uçar,​​​‌ Kamer Kaya [Sabancı University]​.

The permanent is​‌ a function, defined for​​ a square matrix, with​​​‌ applications in various domains​ including quantum computing, statistical​‌ physics, complexity theory, combinatorics,​​ and graph theory. Its​​​‌ formula is similar to​ that of the determinant,​‌ however unlike the determinant,​​ its exact computation is​​ #P-complete, i.e., there is​​​‌ no algorithm to compute‌ the permanent in polynomial‌​‌ time unless P=NP. For​​ an n×n​​​‌ matrix, the fastest algorithm‌ has a time complexity‌​‌ of O(2​​n-1n​​​‌). Although supercomputers‌ have been employed for‌​‌ permanent computation before, there​​ is no work and​​​‌ more importantly, no publicly‌ available software that leverages‌​‌ cutting-edge, yet widely accessible,​​ High-Performance Computing accelerators such​​​‌ as GPUs. In this‌ work, we designed, developed,‌​‌ and investigated the performance​​ of SUperman, a​​​‌ complete software suite that‌ can compute matrix permanents‌​‌ on multiple nodes/GPUs on​​ a cluster while handling​​​‌ various matrix types, e.g.,‌ real/complex/binary and sparse/dense etc.,‌​‌ with a unique treatment​​ for each type. Compared​​​‌ to a state-of-the-art parallel‌ algorithm on 44 cores,‌​‌ SUperman can be 86​​× faster on a​​​‌ single Nvidia A100 GPU.‌ Combining multiple GPUs, we‌​‌ also showed that SUperman​​ can compute the permanent​​​‌ of a 56×‌56 matrix which is‌​‌ the largest reported in​​ the literature.

This work​​​‌ has been recently accepted‌ to be published in‌​‌ Computer Physics Communications 30​​.

8.3.5 Communication Lower​​​‌ Bounds and Optimal Algorithms‌ for Symmetric Matrix Computations‌​‌

Participants: Hussam Al Daas​​ [Rutherford Appleton Laboratory],​​​‌ Grey Ballard [Wake Forest‌ University], Laura Grigori‌​‌ [EPFL], Suraj Kumar​​, Kathryn Rouse [Inmar​​​‌ Intelligence], Mathieu Vérité‌ [EPFL].

In this‌​‌ work, we focus on​​ the communication costs of​​​‌ three symmetric matrix computations:‌ i) multiplying a matrix‌​‌ with its transpose, known​​ as a symmetric rank-k​​​‌ update (SYRK) ii) adding‌ the result of the‌​‌ multiplication of a matrix​​ with the transpose of​​​‌ another matrix and the‌ transpose of that result,‌​‌ known as a symmetric​​ rank-2k update (SYR2K) iii)​​​‌ performing matrix multiplication with‌ a symmetric input matrix‌​‌ (SYMM). All three computations​​ appear in the Level​​​‌ 3 Basic Linear Algebra‌ Subroutines (BLAS) and have‌​‌ wide use in applications​​ involving symmetric matrices. We​​​‌ establish communication lower bounds‌ for these kernels using‌​‌ sequential and distributed-memory parallel​​ computational models, and we​​​‌ show that our bounds‌ are tight by presenting‌​‌ communication-optimal algorithms for each​​ setting. Our lower bound​​​‌ proofs rely on applying‌ a geometric inequality for‌​‌ symmetric computations and analytically​​ solving constrained nonlinear optimization​​​‌ problems. The symmetric matrix‌ and its corresponding computations‌​‌ are accessed and performed​​ according to a triangular​​​‌ block partitioning scheme in‌ the optimal algorithms.

This‌​‌ work has been published​​ in the ACM Transactions​​​‌ on Parallel Computing journal‌ 9.

8.3.6 Minimizing‌​‌ Communication for Parallel Symmetric​​ Tensor Times Same Vector​​​‌ Computation

Participants: Hussam Al‌ Daas [Rutherford Appleton Laboratory]‌​‌, Grey Ballard [Wake​​ Forest University], Laura​​​‌ Grigori [EPFL], Suraj‌ Kumar, Kathryn Rouse‌​‌ [Inmar Intelligence], Mathieu​​ Vérité [EPFL].

In​​​‌ this work, we focus‌ on the parallel communication‌​‌ cost of multiplying the​​ same vector along two​​​‌ modes of a 3-dimensional‌ symmetric tensor. This is‌​‌ a key computation in​​ the higher-order power method​​​‌ for determining eigenpairs of‌ a 3-dimensional symmetric tensor‌​‌ and in gradient-based methods​​​‌ for computing a symmetric​ CP decomposition. We establish​‌ communication lower bounds that​​ determine how much data​​​‌ movement is required to​ perform the specified computation​‌ in parallel. We demonstrate​​ that the communication lower​​​‌ bounds are tight by​ presenting an optimal algorithm​‌ where the data distribution​​ is a natural extension​​​‌ of the triangle block​ partition scheme for symmetric​‌ matrices to 3-dimensional symmetric​​ tensors.

This work was​​​‌ presented at the SPAA​ 2025 conference 14.​‌

9 Bilateral contracts and​​ grants with industry

9.1​​​‌ Bilateral contracts with industry​

Participants: Loris Marchal,​‌ Félix Wirth-Bonne.

  • Contrat​​ de collaboration entre la​​​‌ société Kog et l'équipe-projet​ ROMA pour le co-encadrement​‌ du stage de Félix​​ Wirth (2200€).

10 Partnerships​​​‌ and cooperations

10.1 International​ initiatives

10.1.1 Inria associate​‌ team not involved in​​ an IIL or an​​​‌ international program

MODS

Participants:​ Bora Uçar.

  • Title:​‌
    Match and Order: improving​​ direct solvers for cardiac​​​‌ simulations
  • Duration:
    2023 ->​ 2025.
  • Coordinator:
    Johannes Langguth​‌ (langguth@simula.no)
  • Partners:
    • University of​​ Bergen (Norway)
  • Inria contact:​​​‌
    Bora Uçar
  • Summary:
    The​ goal of the MODS​‌ project is to enhance​​ robustness, scalability, and performance​​​‌ of sparse direct solvers​ by developing novel parallel​‌ matching and ordering algorithms.​​ The results will be​​​‌ tested on and applied​ to simulations of cardiac​‌ electrophysiology developed by Simula.​​

10.1.2 Participation in other​​​‌ International Programs

JLESC —​ Joint Laboratory on Extreme​‌ Scale Computing.

Participants: Anne​​ Benoit, Yves Robert​​​‌, Frédéric Vivien.​

The University of Illinois​‌ at Urbana-Champaign, INRIA, the​​ French national computer science​​​‌ institute, Argonne National Laboratory,​ Barcelona Supercomputing Center, Jülich​‌ Supercomputing Centre and the​​ Riken Advanced Institute for​​​‌ Computational Science formed the​ Joint Laboratory on Extreme​‌ Scale Computing, a follow-up​​ of the Inria-Illinois Joint​​​‌ Laboratory for Petascale Computing.​ The Joint Laboratory is​‌ based at Illinois and​​ includes researchers from INRIA,​​​‌ and the National Center​ for Supercomputing Applications, ANL,​‌ BSC and JSC. It​​ focuses on software challenges​​​‌ found in extreme scale​ high-performance computers.

Research areas​‌ include:

  • Scientific applications (big​​ compute and big data)​​​‌ that are the drivers​ of the research in​‌ the other topics of​​ the joint-laboratory.
  • Modeling and​​​‌ optimizing numerical libraries, which​ are at the heart​‌ of many scientific applications.​​
  • Novel programming models and​​​‌ runtime systems, which allow​ scientific applications to be​‌ updated or reimagined to​​ take full advantage of​​​‌ extreme-scale supercomputers.
  • Resilience and​ Fault-tolerance research, which reduces​‌ the negative impact when​​ processors, disk drives, or​​​‌ memory fail in supercomputers​ that have tens or​‌ hundreds of thousands of​​ those components.
  • I/O and​​​‌ visualization, which are important​ parts of parallel execution​‌ for numerical simulations and​​ data analytics
  • HPC Clouds,​​​‌ that may execute a​ portion of the HPC​‌ workload in the near​​ future.

Several members of​​​‌ the ROMA team are​ involved in the JLESC​‌ joint lab through their​​ research on scheduling and​​​‌ resilience.

Collaboration with Humboldt​ University Berlin/KIT

Participants: Anne​‌ Benoit, Yves Robert​​.

  • Title: FONDA
  • Partner​​​‌ Institution(s):
    • Humboldt University Berlin​ and Karlsruhe Institute of​‌ Technoloty, Germany
  • Date/Duration:
    2024–2028​​
  • Additional info/keywords:
    Anne Benoit​​ is a Mercator Fellow​​​‌ for the Collaborative Research‌ Center FONDA. This project‌​‌ aims at studying the​​ foundations of workflows for​​​‌ large-scale scientific data analysis.‌
Collaboration with U. Darmstadt‌​‌

Participants: Anne Benoit,​​ Hadi Gholami, Frédéric​​​‌ Vivien.

  • Title: ANR-DFG‌ EnergyDoldrums
  • Partner Institution(s):
    • University‌​‌ of Darmstadt, Germany
  • Date/Duration:​​
    2025–2028
  • Additional info/keywords:
    The​​​‌ aim of this project‌ is to optimize the‌​‌ scheduling and execution of​​ adaptative HPC jobs on​​​‌ platforms whose energy supply‌ varies over time.
Collaboration‌​‌ with U. Chicago

Participants:​​ Anne Benoit, Joachim​​​‌ Cendrier, Loris Marchal‌, Yves Robert,‌​‌ Frédéric Vivien.

  • Title:​​ U. Chicago-CNRS project
  • Partner​​​‌ Institution(s):
    • University of Chicago,‌ USA
  • Date/Duration:
    2023–2026
  • Additional‌​‌ info/keywords:
    This project is​​ funding the PhD of​​​‌ J. Cendrier, on efficient‌ and environment-friendly scheduling and‌​‌ resource management algorithms at​​ the edge

Participants: Loris​​​‌ Marchal, Frédéric Vivien‌.

  • Title: FACCTS project‌​‌
  • Partner Institution(s):
    • University of​​ Chicago, USA
  • Date/Duration:
    2024–2026​​​‌
  • Additional info/keywords:
    In this‌ project, we focus on‌​‌ the heterogeneity in the​​ workload submitted to computing​​​‌ centers and clouds. Some‌ of the submitted jobs‌​‌ may be critical, with​​ hard deadlines on their​​​‌ completion, while other may‌ be scheduled on a‌​‌ best-effort basis to reduce​​ their response time. We​​​‌ aim at designing scheduling‌ strategies to accommodate both‌​‌ type of jobs with​​ their different constraints and​​​‌ objectives on a shared‌ computing platform.

10.2 International‌​‌ research visitors

10.2.1 Visits​​ of international scientists

Inria​​​‌ International Chair

Participants: Julien‌ Langou.

Julien Langou,‌​‌ professor at the University​​ Denver (USA) has been​​​‌ awarded an Inria International‌ Chair to visit the‌​‌ ROMA team in the​​ period 2023–2026. He spent​​​‌ 4.5 months in the‌ team in May-July 2025‌​‌ and starting mid-November 2025.​​

10.2.2 Visits to international​​​‌ teams

Research stays abroad‌
Loris Marchal
  • Visited institution:‌​‌
    École Supérieure de Technologie,​​ Montréal
  • Country:
    Canada
  • Dates:​​​‌
    01–07/2025
  • Context of the‌ visit:
    temporary assignement at‌​‌ the ILLS laboratory (CNRS,​​ McGill, ETS, MILA).
  • Mobility​​​‌ program/type of mobility:
    sabbatical‌
Anne Benoit
  • Visited institution:‌​‌
    Institute of Data Engineering​​ and Science (IDEaS), Georgia​​​‌ Institute of Technology
  • Country:‌
    US
  • Dates:
    02–07/2025
  • Context‌​‌ of the visit:
    Collaboration​​ with S. Aluru
  • Mobility​​​‌ program/type of mobility:
    Long‌ term stay abroad
Bora‌​‌ Uçar
  • Visited institution:
    Institute​​ of Data Engineering and​​​‌ Science (IDEaS), Georgia Institute‌ of Technology
  • Country:
    US‌​‌
  • Dates:
    02–07/2025
  • Context of​​ the visit:
    Collaboration with​​​‌ S. Aluru
  • Mobility program/type‌ of mobility:
    Long term‌​‌ stay abroad

10.3 European​​ initiatives

10.3.1 Horizon Europe​​​‌

Participants: Anne Benoit,‌ Michel Nicolis, Frédéric‌​‌ Vivien.

  • HORIZON-INFRA project​​ ODISSEE (2025–2027).
    The ODISSEE​​​‌ project aims to develop‌ innovative technologies and methodologies‌​‌ to process the unprecedented​​ volume of scientific data​​​‌ produced by research infrastructures‌ such as CERN’s HL-LHC‌​‌ and SKAO. It notably​​ plans to develop on-the-fly​​​‌ AI data processing, which‌ is a major challenge‌​‌ in research in the​​ physical sciences, and where​​​‌ the contribution of SLICES‌ RI will be decisive.‌​‌ Coordinated by Damien Gratadour,​​ researcher at the CNRS​​​‌ Laboratory for Instrumentation and‌ Research in Astrophysics, ODISSEE‌​‌ is leveraging the European​​​‌ HPC ecosystem to open​ up a new era​‌ in science, helping to​​ unravel fundamental mysteries such​​​‌ as the nature of​ dark matter. The three-year​‌ ODISSEE consortium brings together​​ 14 partners and 2​​​‌ associate partners from academia​ and industry. F. Vivien​‌ is the head of​​ work-package 5.

10.4 National​​​‌ initiatives

ECLAT

Participants: Frédéric​ Vivien.

  • Partner Institution(s):​‌
    CNRS, Inria, Eviden, Observatoire​​ de la Côte d’Azur,​​​‌ Observatoire de Paris-PSL
  • Date/Duration:​
    2023-
  • Summary
    ECLAT is​‌ a joint laboratory gathering​​ 14 laboratories. Its aim​​​‌ is to support the​ French contribution to the​‌ SKAO observatory.

11 Dissemination​​

11.1 Promoting scientific activities​​​‌

18th Workshop on Scheduling​ for Large Scale Systems:​‌

Loris Marchal , Pablo​​ Piantanida and Yves Robert​​​‌ have organized the 18th​ Workshop on Scheduling for​‌ Large Scale Systems in​​ Montréal in July 2025.​​​‌ Further details can be​ found on the workshop​‌ webpage.

11.1.1 Scientific​​ events: organisation

Bora Uçar​​​‌ was the organizing committee​ co-chair of SIAM Conference​‌ on Applied and Computational​​ Discrete Algorithms (ACDA25) Montréal,​​​‌ Canada. In SIAM conference​ systems this role corresponds​‌ to the general chair.​​

11.1.2 Scientific events: selection​​​‌

Chair of conference program​ committees
  • Anne Benoit was​‌ chair for Track B​​ of ESA, the European​​​‌ Symposium on Algorithms, in​ conjunction with ALGO'2025, Warsaw,​‌ Poland, September 15-19, 2025.​​
Member of the conference​​​‌ program committees
  • Anne Benoit​ was a member of​‌ the program committee of​​ ALENEX'25, EuroPar'25, ICPP'25, SC'25​​​‌ workshops and symposiums.
  • Suraj​ Kumar was a member​‌ of the program committee​​ of SC 2025.
  • Loris​​​‌ Marchal was a member​ of the program committee​‌ of EuroPar 2025.
  • Bora​​ Uçar was a member​​​‌ of the program committee​ of 39th IEEE International​‌ Parallel & Distributed Processing​​ Symposium, June 3–7, 2025,​​​‌ Milan, Italy.
  • Frédéric Vivien​ was a member of​‌ the program committee of​​ ESA 2025.
Reviewer

Bora​​​‌ Uçar reviewed articles for​ ALENEX2025:SIAM Symposium on Algorithm​‌ Engineering and Experiments, New​​ Orleans, Louisiana, USA.

11.1.3​​​‌ Journal

Member of the​ editorial boards
  • Anne Benoit​‌ is Editor in Chief​​ of ParCo, the journal​​​‌ of Parallel Computing: Systems​ and Applications, and she​‌ is also a member​​ of the editorial board​​​‌ of ACM TOPC (Transactions​ on Parallel Computing).
  • Yves​‌ Robert is a member​​ of the editorial board​​​‌ of the International Journal​ of High Performance Computing​‌ (IJHPCA) and the Journal​​ of Computational Science (JOCS).​​​‌
  • Bora Uçar is a​ member of the editorial​‌ board of IEEE Transactions​​ on Parallel and Distributed​​​‌ System, the journal of​ Parallel Computing (until May​‌ 2025), SIAM Journal on​​ Scientific Computing (SISC) until​​​‌ December 2025, and SIAM​ Journal on Matrix Analysis​‌ and Applications (SIMAX).
  • Frédéric​​ Vivien is a member​​​‌ of the editorial board​ of the Journal of​‌ Parallel and Distributed Computing.​​
Reviewer - reviewing activities​​​‌
  • Anne Benoit has reviewed​ manuscripts for ACM Transactions​‌ on Parallel Computing, IEEE​​ TPDS, IJHPCA.
  • Suraj Kumar​​​‌ has reviewed manuscripts for​ Journal of Parallel and​‌ Distributed Computing, SIAM Journal​​ on Scientific Computing, and​​​‌ Transactions on Parallel and​ Distributed Systems journal.
  • Bora​‌ Uçar was a reviewer​​ for ACM Transactions on​​ Parallel Computing, The Journal​​​‌ of Supercomputing.

11.1.4 Scientific‌ expertise

  • Frédéric Vivien was‌​‌ a member of the​​ evaluation committee for the​​​‌ Scientific Evaluation of Forschungszentrum‌ Jülich (FZJ), Germany, in‌​‌ April 2025.
  • Frédéric Vivien​​ is an elected member​​​‌ of INRIA commision d'évaluation‌.
  • Frédéric Vivien is‌​‌ a member of the​​ scientific council of the​​​‌ IRMIA labex.

11.2‌ Teaching - Supervision -‌​‌ Juries - Educational and​​ pedagogical outreach

11.2.1 Teaching​​​‌

  • Master: Anne Benoit, Parallel‌ and Distributed Algorithms and‌​‌ Programs, 39h, M1, ENS​​ Lyon, France
  • Master: Suraj​​​‌ Kumar, Loris Marchal and‌ Frédéric Vivien, Resource-aware computations‌​‌ on CPUs and GPUs,​​ 30h, M2, ENS Lyon,​​​‌ France.

11.2.2 Supervision

  • Anne‌ Benoit and Frédéric Vivien‌​‌ are co-supervising the PhD​​ of Joachim Cendrier. Joachim​​​‌ works on efficient and‌ environment-friendly scheduling and resource‌​‌ management algorithms at the​​ edge, in collaboration with​​​‌ Andrew Chien from U.‌ Chicago. He is funded‌​‌ by a CNRS -​​ U. Chicago project.
  • Anne​​​‌ Benoit and Frédéric Vivien‌ are co-supervising the PhD‌​‌ of Hadi Gholami. Hadi​​ works on resource allocation​​​‌ to jobs running on‌ variable-capacity platforms.
  • Anne Benoit‌​‌ is co-supervising the PhD​​ of Svetlana Kulagina with​​​‌ Henning Meyerhenke (Humboldt-Universitat zu‌ Berlin, Germany), as part‌​‌ of the FONDA project,​​ on the execution of​​​‌ large workflows in heterogeneous‌ execution environments.
  • Anne Benoit‌​‌ is co-supervising the PhD​​ of Dominik Schweisgut with​​​‌ Henning Meyerhenke (now at‌ Karlsruhe Institute of Technology,‌​‌ Germany), as part of​​ the FONDA project, on​​​‌ algorithms for carbon-aware workflow‌ scheduling.
  • Frédéric Vivien and‌​‌ Anne Benoit are co-supervising​​ the PhD of Michel​​​‌ Nicolis. Michel works on‌ predictive maintenance.
  • Anne Benoit‌​‌ and Loris Marchal are​​ supervising the PhD of​​​‌ Adrien Obrecht on scheduling‌ task graphs with application‌​‌ to large language models.​​
  • Loris Marchal is supervising​​​‌ the PhD of Felix‌ Wirth (Cifre collaboration with‌​‌ the Kog company) on​​ the design of generative​​​‌ models with efficient parallel‌ inference.
  • Bora Uçar is‌​‌ co-supervising the PhD of​​ Damien Lesens on nonnegative​​​‌ matrix factorization.

11.2.3 Juries‌

  • Anne Benoit was a‌​‌ reviewer of the PhD​​ of Diane Orhan, defended​​​‌ at Bordeaux University on‌ December 9, 2025.
  • Anne‌​‌ Benoit was an examiner​​ for the PhD of​​​‌ Roblex Nana Tchakouté, defended‌ at PSL University Paris‌​‌ on December 5, 2025.​​
  • Loris Marchal was a​​​‌ reviewer of the PhD‌ of Robin Boezennec, defended‌​‌ at Rennes University on​​ December 10, 2025.
  • Loris​​​‌ Marchal was a reviewer‌ of the PhD of‌​‌ Huijun Wang, to be​​ defended at Auckland university​​​‌ (New Zealand) in Spring‌ 2026.
  • Frédéric Vivien was‌​‌ a member of the​​ jury for hiring INRIA​​​‌ CRCN and ISFP researchers‌ for the Centre Inria‌​‌ de l'Université de Bordeaux.​​
  • Frédéric Vivien was a​​​‌ member of the jury‌ for hiring INRIA senior‌​‌ (DR2) researchers.

12 Scientific​​ production

12.1 Major publications​​​‌

  • 1 inproceedingsA.Anne‌ Benoit, T.Thomas‌​‌ Hérault, V.Valentin​​ Le Fèvre and Y.​​​‌Yves Robert. Replication‌ Is More Efficient Than‌​‌ You Think.SC​​ 2019 - International Conference​​​‌ for High Performance Computing,‌ Networking, Storage, and Analysis‌​‌ (SC'19)Denver, United States​​​‌November 2019HAL
  • 2​ articleA.Anne Benoit​‌, L.Lucas Perotin​​, Y.Yves Robert​​​‌ and F.Frédéric Vivien​. Checkpointing strategies to​‌ tolerate non-memoryless failures on​​ HPC platforms.ACM​​​‌ Transactions on Parallel Computing​September 2023HALDOI​‌
  • 3 inproceedingsM.Marin​​ Bougeret, H.Henri​​​‌ Casanova, M.Mikael​ Rabie, Y.Yves​‌ Robert and F.Frédéric​​ Vivien. Checkpointing strategies​​​‌ for parallel jobs..​SuperComputing (SC) - International​‌ Conference for High Performance​​ Computing, Networking, Storage and​​​‌ Analysis, 2011United States​2011, 1-11HAL​‌
  • 4 incollectionJ.Jack​​ Dongarra, T.Thomas​​​‌ Hérault and Y.Yves​ Robert. Fault Tolerance​‌ Techniques for High-Performance Computing​​.Fault-Tolerance Techniques for​​​‌ High-Performance ComputingSpringerMay​ 2015, 83HAL​‌
  • 5 articleF.Fanny​​ Dufossé and B.Bora​​​‌ Uçar. Notes on​ Birkhoff-von Neumann decomposition of​‌ doubly stochastic matrices.​​Linear Algebra and its​​​‌ Applications497February 2016​, 108--115HALDOI​‌
  • 6 articleL.Lionel​​ Eyraud-Dubois, L.Loris​​​‌ Marchal, O.Oliver​ Sinnen and F.Frédéric​‌ Vivien. Parallel scheduling​​ of task trees with​​​‌ limited memory.ACM​ Transactions on Parallel Computing​‌22July 2015​​, 36HALDOI​​​‌
  • 7 articleL.Loris​ Marchal, B.Bertrand​‌ Simon and F.Frédéric​​ Vivien. Limiting the​​​‌ memory footprint when dynamically​ scheduling DAGs on shared-memory​‌ platforms.Journal of​​ Parallel and Distributed Computing​​​‌128February 2019,​ 30-42HALDOI

12.2​‌ Publications of the year​​

International journals

International​​ peer-reviewed conferences

Reports​​ & preprints

Software