2025Activity reportProject-TeamTHOTH
RNSR: 201622034K- Research center Inria Centre at Université Grenoble Alpes
- Team name: Learning visual models from large-scale data
- In collaboration with:Laboratoire Jean Kuntzmann (LJK)
Creation of the Project-Team: 2016 March 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A3.4. Machine learning and statistics
- A5.3. Image processing and analysis
- A5.9. Signal processing
- A6.2.6. Optimization
- A8.2. Optimization
- A9.2. Machine learning
- A9.3. Signal processing
- A9.7. AI algorithmics
- A9.11. Generative AI
- A9.12. Computer vision
Other Research Topics and Application Domains
- B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
- Julien Mairal [Team leader, Inria, Senior Researcher, en détachement du corps des Mines, HDR]
- Karteek Alahari [Inria, Senior Researcher, HDR]
- Michael Arbel [Inria, Researcher]
- Pia Bideau [UGA, Chair]
- Jocelyn Chanussot [Inria, Senior Researcher, en détachement Grenoble INP, HDR]
- Emanuele Dalsasso [Inria, ISFP, from Dec 2025]
- Pierre Gaillard [Inria, Researcher, HDR]
- Hadrien Hendrikx [Inria, Researcher]
Post-Doctoral Fellows
- Alessia Boccalatte [UGA, Post-Doctoral Fellow, until Jan 2025]
- Khaled Eldowa [Inria, Post-Doctoral Fellow, from Oct 2025]
- Charles-Gerard Lucas [Inria, Post-Doctoral Fellow, from Oct 2025]
- Giacomo Meanti [Inria, Post-Doctoral Fellow]
- Romain Menegaux [Inria, until Jan 2025]
- Scott Pesme [Inria, Post-Doctoral Fellow]
PhD Students
- Yedidia Agnimo [Ekimetrics, CIFRE, from Jul 2025]
- Loic Arbez [GRENOBLE INP]
- Eyal Benaroche [Meta, from Nov 2025]
- Tariq Berrada Ifriqi [Meta, CIFRE]
- Theo Bodrito [Inria, until Jul 2025, with Willow]
- Timothee Darcet [Meta, CIFRE, until Feb 2025]
- Fares El Khoury [Inria]
- Renaud Gaucher [Ecole Polytechnique]
- Bilal Yagiz Gündeger [UGA, from Nov 2025]
- Vincent Herfeld [ENHANCE LAB, CIFRE, from May 2025]
- Emmanuel Jehanno [Inria]
- Zhiqi Kang [Inria, until Sep 2025]
- Paul Liautaud [Sorbonne Univ]
- Bianca Marin Moreno [EDF, CIFRE, until Nov 2025]
- Juliette Marrie [NAVER LABS Europe, CIFRE, until Jul 2025]
- Ieva Petrulionyte [UGA]
- François Porcher [Meta, CIFRE, from Apr 2025, with WILLOW]
- Colin Prieur [Univ Montpellier, until Oct 2025]
- Romain Seailles [ENS Paris, with Willow]
- Amogh Tiwari [UGA]
- Eloise Touron [Inria]
- Kenta Vert [UGA, from Sep 2025]
- Julien Zhou [Criteo, CIFRE]
Technical Staff
- Juliette Bertrand [Inria, Engineer, until Oct 2025]
- Julien Horvat [UGA]
- Noé Peterlongo [INPG SA, Engineer, until Jan 2025]
- Thomas Ryckeboer [Inria, Engineer]
- Mathis Tailland [UGA, until Oct 2025]
Interns and Apprentices
- Theodore Batte [Polytech Grenoble, Intern, until Mar 2025]
- Augustin Cablant [Criteo, Intern, from May 2025 until Nov 2025]
- Romain Forestier [UGA, Intern, from May 2025 until Jun 2025]
- Manuela Giraldo Obando [INPG SA, Intern, from Jun 2025 until Jul 2025]
- Quentin Goizet [Polytech Grenoble, Intern]
- Geraud Ilinca [Inria, Intern, from Mar 2025 until Sep 2025]
- Lucas Montigon [Polytech Grenoble, Intern, until Mar 2025]
- Carlos Inaki Roman Martinez [UGA, Intern, from Feb 2025 until Jul 2025]
- Morgan Scalabrino [Inria, Intern, from Apr 2025 until Aug 2025]
- Kenta Vert [Inria, Intern, from Apr 2025 until Aug 2025]
Administrative Assistant
- Nathalie Gillot [Inria]
Visiting Scientists
- Nassim Ait Ali Braham [DLR, until Oct 2025]
- Yusuf Mehmet Colak [UNIV PAVIE, from Oct 2025]
- François Postic [INRAE, until Sep 2025]
- Francesca Razzano [Univ Padova, from Oct 2025]
External Collaborator
- Olivier Flasseur [CNRS]
2 Overall objectives
Thoth is a computer vision and machine learning team. Our initial goal was to develop machine learning models for analyzing the massive amounts of visual data that are currently available on the web. Then, the focus of the team has become more diverse. More precisely, we share a common objective of developing machine learning models that are robust and efficient (in terms of computational cost and data requirements).
Our main research directions are the following ones:
- visual understanding from limited annotations and data: Many state-of-the-art computer vision models are typically trained on a huge corpus of fully annotated data. We want to reduce the cost by developing new algorithms for unsupervised, self-supervised, continual, or incremental learning.
- efficient deep learning models, from theory to applications: We want to invent a new generation of machine learning models (in particular deep learning) with theoretical guarantees, efficient algorithms, and a wide range of applications. We develop for instance models for images, videos, graphs, or sequences.
- statistical machine learning and optimization: we are also developing efficient machine learning methods, with a focus on stochastic optimization for processing large-scale data, and online learning.
- pluri-disciplinary collaborations: Machine learning being at the crossing of several disciplines, we have successfully conducted collaborations in scientific domains that are relatively far from our domains of expertise. These fields are producing massive amounts of data and are in dire needs of efficient tools to make predictions or interpretations. For example, we have had the chance to collaborate with many colleagues from natural language processing, robotics, neuroimaging, computational biology, genomics, astrophysics for exoplanet detections, and we are currently involved in several remote sensing and hyperspectral imaging projects thanks to Jocelyn Chanussot (hosted by Thoth in the 2019 to 2022 period, now an INRIA senior scientist on leave from Grenoble INP since september 2023 ).
3 Research program
3.1 Designing and learning structured models
The task of understanding image and video content has been interpreted in several ways over the past few decades, namely image classification, detecting objects in a scene, recognizing objects and their spatial extents in an image, recovering scene geometry. However, addressing all these problems individually provides us with a partial understanding of the scene at best, leaving much of the visual data unexplained.
One of the main goals of this research axis is to go beyond the initial attempts that consider only a subset of tasks jointly, by developing novel models for a more complete understanding of scenes to address all the component tasks. We propose to incorporate the structure in image and video data explicitly into the models. In other words, our models aim to satisfy the complex sets of constraints that exist in natural images and videos. Examples of such constraints include: (i) relations between objects, like signs for shops indicate the presence of buildings, (ii) higher-level semantic relations involving the type of scene, geographic location, and the plausible actions as a global constraint, e.g., an image taken at a swimming pool is unlikely to contain cars, (iii) relating objects occluded in some of the video frames to content in other frames, where they are more clearly visible as the camera or the object itself move, with the use of long-term trajectories and video object proposals.
This research axis will focus on two topics. The first is developing deep features for video. This involves designing rich features available in the form of long-range temporal interactions among pixels in a video sequence to learn a representation that is truly spatio-temporal in nature. The second topic is aimed at learning models that capture the relationships among several objects and regions in a single image scene, and additionally, among scenes in the case of an image collection or a video. The main scientific challenges in this topic stem from learning the structure of the probabilistic graphical model as well as the parameters of the cost functions quantifying the relationships among its entities. In the following we will present work related to all these three topics and then elaborate on our research directions.
- Deep features for vision. Deep learning models provide a rich representation of complex objects but in return have a large number of parameters. Thus, to work well on difficult tasks, a large amount of data is required. In this context, video presents several advantages: objects are observed from a large range of viewpoints, motion information allows the extraction of moving objects and parts, and objects can be differentiated by their motion patterns. We initially plan to develop deep features for videos that incorporate temporal information at multiple scales. We then plan to further exploit the rich content in video by incorporating additional cues such as minimal prior knowledge of the object of interest, with the goal of learning a representation that is more appropriate for video understanding. In other words, a representation that is learned from video data and targeted at specific applications.
- Structured models. The interactions among various elements in a scene, such as the objects and regions in it, the motion of object parts or entire objects themselves, form a key element for understanding image or video content. These rich cues define the structure of visual data and how it evolves spatio-temporally. We plan to develop a novel graphical model to exploit this structure. The main components in this graphical model are spatio-temporal regions (in the case of video or simply image regions), which can represent object parts or entire objects themselves, and the interactions among several entities. The dependencies among the scene entities are defined with a higher order or a global cost function. A higher order constraint is a generalization of the pairwise interaction term, and is a cost function involving more than two components in the scene, e.g., several regions, whereas a global constraint imposes a cost term over the entire image or vide such as a prior knowledge on the number of people expected in the scene. The constraints we plan to include generalize several existing methods, which are limited to pairwise interactions or a small restrictive set of higher-order costs. In addition to learning the parameters of these novel functions, we will focus on learning the structure of the graph itself—a challenging problem that is seldom addressed in current approaches. This provides an elegant way to go beyond state-of-the-art deep learning methods, which are limited to learning the high-level interaction among parts of an object, by learning the relationships among objects.
3.2 Learning of visual models from minimal supervision
Today's approaches to visual recognition learn models for a limited and fixed set of visual categories with fully supervised classification techniques. This paradigm has been adopted in the early 2000's, and within it enormous progress has been made over the last decade.
The scale and diversity in today's large and growing image and video collections (such as, e.g., broadcast archives, and personal image/video collections) call for a departure from the current paradigm. This is the case because to answer queries about such data, it is unfeasible to learn the models of visual content by manually and precisely annotating every relevant concept, object, scene, or action category in a representative sample of everyday conditions. For one, it will be difficult, or even impossible to decide a-priori what are the relevant categories and the proper granularity level. Moreover, the cost of such annotations would be prohibitive in most application scenarios. One of the main goals of the Thoth project-team is to develop a new framework for learning visual recognition models by actively exploring large digital image and video sources (off-line archives as well as growing on-line content), and exploiting the weak supervisory signal provided by the accompanying metadata (such as captions, keywords, tags, subtitles, or scripts) and audio signal (from which we can for example extract speech transcripts, or exploit speaker recognition models).
Textual metadata has traditionally been used to index and search for visual content. The information in metadata is, however, typically sparse (e.g., the location and overall topic of newscasts in a video archive 1) and noisy (e.g., a movie script may tell us that two persons kiss in some scene, but not when, and the kiss may occur off the screen or not have survived the final cut). For this reason, metadata search should be complemented by visual content based search, where visual recognition models are used to localize content of interest that is not mentioned in the metadata, to increase the usability and value of image/video archives. The key insight that we build on in this research axis is that while the metadata for a single image or video is too sparse and noisy to rely on for search, the metadata associated with large video and image databases collectively provide an extremely versatile source of information to learn visual recognition models. This form of “embedded annotation” is rich, diverse and abundantly available. Mining these correspondences from the web, TV and film archives, and online consumer generated content sites such as Flickr, Facebook, or YouTube, guarantees that the learned models are representative for many different situations, unlike models learned from manually collected fully supervised training data sets which are often biased.
The approach we propose to address the limitations of the fully supervised learning paradigm aligns with “Big Data” approaches developed in other areas: we rely on the orders-of-magnitude-larger training sets that have recently become available with metadata to compensate for less explicit forms of supervision. This will form a sustainable approach to learn visual recognition models for a much larger set of categories with little or no manual intervention. Reducing and ultimately removing the dependency on manual annotations will dramatically reduce the cost of learning visual recognition models. This in turn will allow such models to be used in many more applications, and enable new applications based on visual recognition beyond a fixed set of categories, such as natural language based querying for visual content. This is an ambitious goal, given the sheer volume and intrinsic variability of the every day visual content available on-line, and the lack of a universally accepted formalism for modeling it. Yet, the potential payoff is a breakthrough in visual object recognition and scene understanding capabilities.
This research axis is organized into the following three sub-tasks:
- Weakly supervised learning. For object localization we will go beyond current methods that learn one category model at a time and develop methods that learn models for different categories concurrently. This allows “explaining away” effects to be leveraged, i.e., if a certain region in an image has been identified as an instance of one category, it cannot be an instance of another category at the same time. For weakly supervised detection in video we will consider detection proposal methods. While these are effective for still images, recent approaches for the spatio-temporal domain need further improvements to be similarly effective. Furthermore, we will exploit appearance and motion information jointly over a set of videos. In the video domain we will also continue to work on learning recognition models from subtitle and script information. The basis of leveraging the script data which does not have a temporal alignment with the video is to use matches in the narrative in the script and the subtitles (which do have a temporal alignment with the video). We will go beyond simple correspondences between names and verbs relating to self-motion, and match more complex sentences related to interaction with objects and other people. To deal with the limited number of occurrences of such actions in a single movie, we will consider approaches that learn action models across a collection of movies.
- Online learning of visual models. As a larger number of visual category models is being learned, online learning methods become important, since new training data and categories will arrive over time. We will develop online learning methods that can incorporate new examples for existing category models, and learn new category models from few examples by leveraging similarity to related categories using multi-task learning methods. Here we will develop new distance-based classifiers and attribute and label embedding techniques, and explore the use of NLP techniques such as skipgram models to automatically determine between which classes transfer should occur. Moreover, NLP will be useful in the context of learning models for many categories to identify synonyms, and to determine cases of polysemy (e.g. jaguar car brand v.s. jaguar animal), and merge or refine categories accordingly. Ultimately this will result in methods that are able to learn an“encyclopedia” of visual models.
- Visual search from unstructured textual queries. We will build on recent approaches that learn recognition models on-the-fly (as the query is issued) from generic image search engines such as Google Images. While it is feasible to learn models in this manner in a matter of seconds, it is challenging to use the model to retrieve relevant content in real-time from large video archives of more than a few thousand hours. To achieve this requires feature compression techniques to store visual representations in memory, and cascaded search techniques to avoid exhaustive search. This approach, however, leaves untouched the core problem of how to associate visual material with the textual query in the first place. The second approach we will explore is based on image annotation models. In particular we will go beyond image-text retrieval methods by using recurrent neural networks such as Elman networks or long short-term memory (LSTM) networks to generate natural language sentences to describe images.
3.3 Large-scale learning and optimization
We have entered an era of massive data acquisition, leading to the revival of an old scientific utopia: it should be possible to better understand the world by automatically converting data into knowledge. It is also leading to a new economic paradigm, where data is a valuable asset and a source of activity. Therefore, developing scalable technology to make sense of massive data has become a strategic issue. Computer vision has already started to adapt to these changes.
In particular, very high-dimensional models such as deep networks are becoming highly popular and successful for visual recognition. This change is closely related to the advent of big data. On the one hand, these models involve a huge number of parameters and are rich enough to represent well complex objects such as natural images or text corpora. On the other hand, they are prone to overfitting (fitting too closely to training data without being able to generalize to new unseen data) despite regularization; to work well on difficult tasks, they require a large amount of labeled data that has been available only recently. Other cues may explain their success: the deep learning community has made significant engineering efforts, making it possible to learn in a day on a GPU large models that would have required weeks of computations on a traditional CPU, and it has accumulated enough empirical experience to find good hyper-parameters for its networks.
To learn the huge number of parameters of deep hierarchical models requires scalable optimization techniques and large amounts of data to prevent overfitting. This immediately raises two major challenges: how to learn without large amounts of labeled data, or with weakly supervised annotations? How to efficiently learn such huge-dimensional models? To answer the above challenges, we will concentrate on the design and theoretical justifications of deep architectures including our recently proposed deep kernel machines, with a focus on weakly supervised and unsupervised learning, and develop continuous and discrete optimization techniques that push the state of the art in terms of speed and scalability.
This research axis will be developed into three sub-tasks:
- Deep kernel machines for structured data. Deep kernel machines combine advantages of kernel methods and deep learning. Both approaches rely on high-dimensional models. Kernels implicitly operate in a space of possibly infinite dimension, whereas deep networks explicitly construct high-dimensional nonlinear data representations. Yet, these approaches are complementary: Kernels can be built with deep learning principles such as hierarchies and convolutions, and approximated by multilayer neural networks. Furthermore, kernels work with structured data and have well understood theoretical principles. Thus, a goal of the Thoth project-team is to design and optimize the training of such deep kernel machines.
- Large-scale parallel optimization. Deep kernel machines produce nonlinear representations of input data points. After encoding these data points, a learning task is often formulated as a large-scale convex optimization problem; for example, this is the case for linear support vector machines, logistic regression classifiers, or more generally many empirical risk minimization formulations. We intend to pursue recent efforts for making convex optimization techniques that are dedicated to machine learning more scalable. Most existing approaches address scalability issues either in model size (meaning that the function to minimize is defined on a domain of very high dimension), or in the amount of training data (typically, the objective is a large sum of elementary functions). There is thus a large room for improvements for techniques that jointly take these two criteria into account.
- Large-scale graphical models. To represent structured data, we will also investigate graphical models and their optimization. The challenge here is two-fold: designing an adequate cost function and minimizing it. While several cost functions are possible, their utility will be largely determined by the efficiency and the effectiveness of the optimization algorithms for solving them. It is a combinatorial optimization problem involving billions of variables and is NP-hard in general, requiring us to go beyond the classical approximate inference techniques. The main challenges in minimizing cost functions stem from the large number of variables to be inferred, the inherent structure of the graph induced by the interaction terms (e.g., pairwise terms), and the high-arity terms which constrain multiple entities in a graph.
4 Application domains
4.1 Visual applications
Any solution to automatically understanding images and videos on a semantic level will have an immediate impact on a wide range of applications. For example:
- Semantic-level image and video access is highly relevant for visual search on the Web, in professional archives and personal collections.
- Visual data organization is applicable to organizing family photo and video albums as well as to large-scale information retrieval.
- Visual object recognition has potential applications ranging from autonomous driving, to service robotics for assistance in day-to-day activities as well as the medical domain.
- Real-time scene understanding is relevant for human interaction through devices such as HoloLens, Oculus Rift.
4.2 Pluri-disciplinary research
Machine learning is intrinsically pluri-disciplinary. By developing large-scale machine learning models and algorithms for processing data, the Thoth team became naturally involved in pluri-disciplinary collaborations that go beyond visual modelling. During the last few years, Thoth has conducted several collaborations in other fields such as neuroimaging, bioinformatics, ecology, natural language processing, and remote sensing.
5 Social and environmental responsibility
5.1 Footprint of research activities
Compute
A significant amount of the team’s computations are performed on Jean Zay national cluster. According to the cluster's reporting platform, 50k normalized GPU hours have been used by the team, which amounts to 1.2 tons eqCO2. Besides computations performed on this cluster, the team maintained its own cluster, on which part of the computations are done as well. Assuming 10 GPUs are used at all times (which is a rather generous estimate), this amounts to less than 100k GPU hours over the year. Most of these machines are hosted in the datacenter of the IMAG building, which is probably slightly less efficient than the GENCI infrastructure. Overall, we estimate our local consumption to be under 3 tons eqCO2.
In total, we estimate the emissions of the team's compute to be about 4 tons eqCO2. While we do not provide impact if term of resources, the team dedicated a special effort to keep local computing servers running for as long as possible, upgrading them when possible to avoid replacing them.
This does not count the Dino (V2 and V3) models, which are significantly more expensive to train but are also significantly more impactful than an average research paper, being used in more than 10K scientific projects, and for which full emissions data is available.
Travel
The other main CO2eq footprint is international flights. While we did not gather specific numbers, team members take special care in reducing their plane travels (several permanent researchers have not traveled by plane for several years), refusing distant invitations, as well as encouraging less travel-hungry community practices. This has led to a drastic reduction of our travel impact over the years, which we will try to quantify for the next activity report.
5.2 Impact of research results
A large part of Thoth's team research contributes to advancing the field of machine learning as a whole. This improves and promotes Artificial Intelligence tools, which have a large, still growing, and controversial societal impact (automation, recommendation algorithms, mass surveillance...). Besides these impacts, machine learning has a substantial (and also growing) environmental footprint, and is especially prone to the rebound effect, making efficiency improvements unable to reduce this impact.
Beyond methodological contributions, team members make more targeted applied contributions that leverage Machine Learning for advancing other sciences (e.g., astrophysics, earth science, physics simulations…). Some of these projects focus on reducing carbon footprints (e.g., by making electricity management more efficient), or preserving biodiversity (e.g., by better understanding ecosystem responses to human pressure and global warming).
“Environmental-friendly” contributions do not offset the negative socio-environmental impacts of the current global AI race, which should be tackled at a larger scale. Hence, Thoth team members are involved at several levels (scientific policy, popularization of science, local socio-environmental initiatives) to support meaningful decision-making regarding these issues and future technological developments at a broader level.
6 Highlights of the year
6.1 Awards
- Prix Jeunes Talents L'Oréal-UNESCO 2025 for Bianca Marin Moreno
- Karteek Alahari received the Outstanding IJCV Editorial Board Member Award.
- Julien Mairal received a top reviewer award at NeurIPS 2025.
- Michael Arbel received a top reviewer award at AISTATS 2025.
- The spin-off Enhance Lab received the i-Lab prize from BPI.
- J. Chanussot received the 2025 IEEE GRSS Highest Impact Paper Award (HIPA) selected from the 18670 papers published in the journals of the IEEE Geoscience and Remote Sensing Society in 2020-2024
- J. Chanussot was recognized a Highly Cited Research (Clarivate Analytics)
7 Latest software developments, platforms, open data
7.1 Latest software developments
7.1.1 Cyanure
-
Name:
Cyanure: An Open-Source Toolbox for Empirical Risk Minimization
-
Functional Description:
Cyanure is an open-source C++ software package with a Python interface. The goal of Arsenic is to provide state-of-the-art solvers for learning linear models, based on stochastic variance-reduced stochastic optimization with acceleration mechanisms and Quasi-Newton principles. Arsenic can handle a large variety of loss functions (logistic, square, squared hinge, multinomial logistic) and regularization functions (l2, l1, elastic-net, fused Lasso, multi-task group Lasso). It provides a simple Python API, which is very close to that of scikit-learn, which should be extended to other languages such as R or Matlab in a near future.
-
Release Contributions:
packaging on conda and pipy + various improvements
- URL:
-
Contact:
Julien Mairal
-
Participant:
2 anonymous participants
7.1.2 MLXP
-
Name:
Machine Learning eXperimentalist for Python
-
Keywords:
Reproducibility, Replication and consistency, Machine learning
-
Functional Description:
MLXP is an open-source, simple, and lightweight experiment management tool based on Python. It streamlines the experimental process with minimal practitioner overhead while ensuring a high level of reproducibility. As an open-source package, MLXP facilitates experiment launching, logging, and efficient result exploitation. Key components include automated job launching and hierarchical configuration files, logging of experiment outputs along with metadata, automated code and job version management, seamless multi-job submission to a HPC job scheduler, and intuitive result exploitation capabilities including querying results, grouping and aggregation operations.
- URL:
-
Contact:
Michael Arbel
8 New results
8.1 Visual Recognition
Object-wise Distance Estimation for Event Camera Data
Participants: Nan Cai, Pia Bideau.
Event cameras provide a natural and data efficient representation of visual information, motivating novel computational strategies towards extracting visual information. Inspired by the biological vision system, in this work 26 propose a behavior driven approach for object-wise distance estimation from event camera data. This behavior-driven method mimics how biological systems, like the human eye, stabilize their view based on object distance: distant objects require minimal compensatory rotation to stay in focus, while nearby objects demand greater adjustments to maintain alignment. This adaptive strategy leverages natural stabilization behaviors to estimate relative distances effectively. Unlike traditional vision algorithms that estimate depth across the entire image, our approach targets local depth estimation within a specific region of interest. By aligning events within a small region, we estimate the angular velocity required to stabilize the image motion. We demonstrate that, under certain assumptions, the compensatory rotational flow is inversely proportional to the object's distance. The proposed approach achieves new state-of-the-art accuracy in distance estimation on the dataset EVIMO2.
Figure
Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation
Participants: Runfeng Qu, Ole Hall, Pia Bideau, Julie Ouerfelli-Ethier, Martin Rolfs, Klaus Obermayer, Olaf Hellwich.
Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this by implementing debiasing strategies, but often at the cost of spatial understanding—resulting in over-reliance on semantic priors. In 37, we introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision.
Code is available on github.
Figure
Watching Swarm Dynamics from Above: A Framework for Advanced Object Tracking in Drone Videos
Participants: Pia Bideau, Duc Pham, Félicie Dhellemmes, Matthew Hansen, Jens Krause.
Easily accessible technologies, such as drones equipped with diverse onboard sensors, have greatly expanded opportunities to study animal behavior in natural environments. However, analyzing large volumes of unlabeled video data, often spanning hours, remains a significant challenge for machine learning, particularly in computer vision. Existing approaches typically process only a small number of frames, and accurate georeferencing of tracked positions is still largely unresolved, particularly in dynamic environments where static landmarks cannot be established. In this work, we focus on long-term tracking of animal behavior in real-world geographic coordinates. To address this challenge, we utilize classical probabilistic methods for state estimation, such as particle filtering. Particle filters offer a useful algorithmic structure for recursively adding new incoming information and thus ensuring time consistency. By incorporating recent developments in semantic object segmentation, we enable continuous tracking of rapidly evolving object formations, even in scenarios with limited data availability. We propose a novel approach for tracking schools of fish in the open ocean from drone videos. Our framework not only performs classical object tracking in image coordinates, instead it additionally tracks the position and spatial expansion of the fish school in geographic coordinates by fusing video data and the drone's on board sensor information (GPS and IMU). No landmarks with known geographic coordinates are required, making the proposed method adaptable to unstructured, dynamic environments like the open ocean, where static landmarks are unavailable. With this, the presented framework enables researchers to study the collective behavior of fish schools within their social and environmental context.
Code and the newly introduced dataset for tracking collective animal behavior over long time horizons in marine environments are available here.
Figure
LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes.
Participants: Juliette Marrie, Romain Menegaux, Michael Arbel, Diane Larlus, Julien Mairal.
In 34, we address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into 3D Gaussian Splatting scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion enriches features from a given model, such as CLIP, by leveraging 3D geometry and pairwise similarities induced by another strong model such as DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object detection tasks, highlighting the versatility of our approach.
Figure
Cluster and Predict Latent Patches for Improved Masked Image Modeling
Participants: Maxime Oquab, Federico Baldassarre, Timothee Darcet, Julien Mairal, Piotr Bojanowski.
Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper 8, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. The approach is illustrated in Figure 5.
Figure
Entropy Rectifying Guidance for Diffusion and Flow Models
Participants: Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, Karteek Alahari.
Guidance techniques are commonly used in diffusion and flow models to improve image quality and input consistency for conditional generative tasks such as class- conditional and text-to-image generation. In particular, classifier-free guidance (CFG) is the most widely adopted guidance technique. It results, however, in trade-offs across quality, diversity and consistency: improving some at the expense of others. While recent work has shown that it is possible to disentangle these factors to some extent, such methods come with an overhead of requiring an additional (weaker) model, or require more forward passes per sampling step. In this work 29, we propose Entropy Rectifying Guidance (ERG), a simple and effective guidance method based on inference-time changes in the attention mechanism of state-of-the-art diffusion transformer architectures, which allows for simultaneous improvements over image quality, diversity and prompt consistency. ERG is more general than CFG and similar guidance techniques, as it extends to unconditional sampling. We show that ERG results in significant improvements in various tasks, including text-to-image, class-conditional and unconditional image generation (see examples in Figure 6). We also show that ERG can be seamlessly combined with other recent guidance methods such as CADS and APG, further improving generation results.
Figure
Boosting Latent Diffusion with Perceptual Objectives
Participants: Tariq Berrada Ifriqi, Pietro Astolfi, Melissa Hall, Marton Havasi, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari, Michal Drozdzal, Jakob Verbeek.
Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL) 23. This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative – with boosts between and in FID – and qualitative results when using our perceptual loss (see examples in Figure 7.
Figure
Lightweight Structure-Aware Attention for Visual Understanding
Participants: Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari.
Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this work 13, we propose a novel attention operator, called Lightweight Structure-aware Attention (LiSA), which has a better representation power with log-linear complexity (see Figure 8). Our operator transforms the attention kernels to be more discriminative by learning structural patterns. These structural patterns are encoded by exploiting a set of relative position embeddings (RPEs) as multiplicative weights, thereby improving the representation power of the attention kernels. Additionally, the RPEs are approximated to obtain log-linear complexity. Our experiments and analyses demonstrate that the proposed operator outperforms self-attention and other existing operators, achieving state-of-the-art results on ImageNet-1K and other downstream tasks such as video action recognition on Kinetics-400, object detection and instance segmentation on COCO, and semantic segmentation on ADE-20K.
Figure
Source-free video domain adaptation by learning from noisy labels
Participants: Avijit Dasgupta, C. V. Jawahar, Karteek Alahari.
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper 9, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher–student (TS) framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available.
Figure
Flowception: Temporally Expansive Flow Matching for Video Generation
Participants: Tariq Berrada Ifriqi, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen.
We present Flowception 46, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context (see examples in Figure 10). Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.
Figure
Online In-Context Distillation for Low-Resource Vision Language Models
Participants: Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari.
As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method 48, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them (see overview in Figure 11). Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to ) using scarce teacher annotations (as low as ), and competes with the teacher's zero-shot performance.
Figure
8.2 Statistical Machine Learning and Optimization
Counterfactual Learning of Stochastic Policies with Continuous Actions
Participants: Houssam Zenati, Pierre Gaillard, Julien Mairal.
Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In 20, we address the problem of counterfactual learning of stochastic policies with continuous actions, which raises difficult challenges about (i) data modelization, (ii) optimization, and (iii) evaluation on real data. First, we introduce a modeling strategy based on a joint kernel embedding of contexts and actions, illustrated in Figure 12 which overcomes the shortcomings of previous discretization strategies as shown in 9. Second, we empirically show that the optimization aspect of counterfactual learning is more important than previously thought, and we demonstrate the benefits of proximal point algorithms and differentiable estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.
Figure
MAP Estimation with Denoisers: Convergence Rates and Guarantees
Participants: Scott Pesme, Giacomo Meanti, Michael Arbel, Julien Mairal.
Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In 36, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior p. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods. This result is provided in Figure 13.
Figure
Logarithmic Regret for Unconstrained Submodular Maximization Stochastic Bandit
Participants: Julien Zhou, Pierre Gaillard, Thibaud Rahier, Julyan Arbel.
In 40, we address the online unconstrained submodular maximization problem (Online USM), in a setting with stochastic bandit feedback. In this framework, a decision-maker receives noisy rewards from a nonmonotone submodular function, taking values in a known bounded interval. This paper proposes Double-Greedy - Explore-then-Commit (DG-ETC), adapting the Double-Greedy approach from the offline and online full-information settings. DG-ETC satisfies a problem dependent upper bound for the -approximate pseudo-regret, as well as a problem-free one at the same time, outperforming existing approaches. To that end, we introduce a notion of hardness for submodular functions, characterizing how difficult it is to maximize them with this type of strategy.
Hardness
Locally Adaptive Online Nonparametric Regression
Participants: Paul Liautaud, Pierre Gaillard, Olivier Wintenberger.
In 32 and 31, We study online adversarial regression with convex losses against a rich class of continuous yet highly irregular prediction rules, modeled by Besov spaces with general parameters and smoothness . We introduce an adaptive wavelet-based algorithm that performs sequential prediction without prior knowledge of , and establish minimax-optimal regret bounds against any comparator in . We further design a locally adaptive extension capable of dynamically tracking spatially inhomogeneous smoothness. This adaptive mechanism adjusts the resolution of the predictions over both time and space, yielding refined regret bounds in terms of local regularity. Consequently, in heterogeneous environments, our adaptive guarantees can significantly surpass those obtained by standard global methods.
Regret
Online Learning Approach for Survival Analysis
Participants: Camila Fernandez, Pierre Gaillard, Olivier Wintenberger.
In 10, we introduce an online mathematical framework for survival analysis, allowing real time adaptation to dynamic environments and censored data. This framework enables the estimation of event time distributions through an optimal second order online convex optimization algorithm—Online Newton Step (ONS). This approach, previously unexplored, presents substantial advantages, including explicit algorithms with non-asymptotic convergence guarantees. Moreover, we analyze the selection of ONS hyperparameters, which depends on the exp-concavity property and has a significant influence on the regret bound. We introduce an adaptive aggregation method that ensures robustness in hyperparameter selection while maintaining fast regret bounds. These findings can extend beyond the survival analysis field, and are relevant for any case characterized by poor exp-concavity and unstable ONS. Additionally, we propose a stochastic approach for ONS that guarantees logarithmic regret in the case of an exponential hazard model. Next, these assertions are illustrated by simulation experiments, followed by an application to a real dataset. Fernandez et al.55 also provides some experimental comparison of existing algorithms for survival analysis.
Error
Efficient and Near-Optimal Online Portfolio Selection
Participants: Rémi Jézéquel, Dmitrii Ostrovski, Pierre Gaillard.
In 12, we study online portfolio selection as introduced by Cover (1991), where a trader allocates wealth over assets across rounds to maximize logarithmic return. Cover’s Universal Portfolios achieve worst-case optimal regret but require costly -dimensional integration, leading to a prohibitive per-round runtime. We propose a new algorithm achieving essentially the same regret—up to constants and replacing with —with a drastically improved runtime of per round. Our method selects portfolios by minimizing logarithmic loss regularized by a log-determinant barrier, revealing connections between online portfolio selection and classical cutting-plane and interior-point methods.
Online Convex Reinforcement Learning with applications to Demand-Side Management.
Participants: Bianca Marin Moreno, Khaled Eldowa, Margaux Brégère, Pierre Gaillard, Nadia Oudjane.
To counter the challenge of integrating fluctuating renewables into the grid, devices like thermostatically controlled loads (water-heaters, air conditioners, etc) offer flexible demand. However, efficiently controlling a large population of these devices to track desired consumption signals remains a complex challenge. Existing methods lack convergence guarantees and computational efficiency, or resort to regularization techniques instead of tackling the target tracking problem directly. 14 addresses these drawbacks. We propose to model the problem as a finite horizon episodic Markov decision process, enabling us to adapt convex optimization algorithms with convergence guarantees and computational efficiency. This framework also extends to online learning scenarios, where daily control decisions are made without prior knowledge of consumer behavior and with daily-changing target profiles due to fluctuations of energy production and inflexible consumption. We introduce a new algorithm, called Online Target Tracker (OTT), the first online learning load control method, for which we prove sub-linear regret. We demonstrate our claims with realistic experiments. This combination of optimization and learning lays the groundwork for more dynamic and efficient load control methods. 33 studies a generalization of episodic Reinforcement Learning to convex losses that could be applied for Demand-Side Management in an unknown environment. By introducing a reset-free framework called the periodic framework, 49 weakens the episodic assumption to avoid having to reset the population of the devices to the initial distribution at every episode.
DSM
Optimized projection-free algorithms for online learning: construction and worst-case analysis
Participants: Julien Weibel, Pierre Gaillard, Wouter Koolen, Adrien Taylor.
In 53, we study projection-free algorithms for online learning with linear optimization oracles (Frank–Wolfe methods) to handle constrained decision sets. We propose an optimized variant of an online Frank–Wolfe algorithm with a simple potential-based analysis, and introduce a semidefinite programming framework to jointly design and analyze such algorithms. Our numerical results suggest that no pure online Frank–Wolfe method in this model class can achieve regret better than without additional assumptions. We further observe suboptimal constants in existing methods, anytime guarantees of order , and limited benefits from multiple linear optimization steps per round.
Error
Optimal and Efficient Algorithms for Multinomial Logistic Bandits
Participants: Pierre Boudart, Pierre Gaillard, Alessandro Rudi, Aadirupa Saha.
In 38 and 44, we study active online assortment optimization with preference feedback, a framework for modeling user choice and subsetwise utility maximization with applications in advertising, online retail, recommendation systems, and language model fine-tuning. Existing approaches often rely on unrealistic assumptions such as strong reference items or repeated identical assortments. In 38, we design efficient regret-minimization algorithms that remove both of these assumptions. In 44, we improve the asymptotic regret by a constant that may be exponentially large in some cases.
MNL regret
Advancing Prompt-Based Methods for Replay-Independent General Continual Learning
Participants: Zhiqi Kang, Liyuan Wang, Xingxing Zhang, Karteek Alahari.
General continual learning (GCL) is a broad concept to describe real-world continual learning (CL) problems, which are often characterized by online data streams without distinct transitions between tasks, i.e., blurry task boundaries. Such requirements result in poor initial performance, limited generalizability, and severe catastrophic forgetting, heavily impacting the effectiveness of mainstream GCL models trained from scratch (see illustration in Figure 20). While the use of a frozen pretrained backbone with appropriate prompt tuning can partially address these challenges, such prompt-based methods remain suboptimal for CL of remaining tunable parameters on the fly. In this regard, we propose an innovative approach named MISA (Mask and Initial Session Adaption) to advance prompt-based methods in GCL 30. It includes a forgetting-aware initial session adaption that employs pretraining data to initialize prompt parameters and improve generalizability, as well as a non-parametric logit mask of the output layers to mitigate catastrophic forgetting. Empirical results demonstrate substantial performance gains of our approach compared to recent competitors, especially without a replay buffer (e.g., up to 18.39, 22.06, and 11.96 points performance lead on CIFAR-100, Tiny-ImageNet, and ImageNet-R, respectively). Moreover, our approach features the plug-in nature for prompt-based methods, independence of replay, ease of implementation, and avoidance of CL-relevant hyperparameters, serving as a strong baseline for GCL research. Our source code is publicly available.
Figure
Unified Breakdown Analysis for Byzantine Robust Gossip
Participants: Renaud Gaucher, Aymeric Dieuleveut, Hadrien Hendrikx.
Distributed approaches have many computational benefits, but they are vulnerable to attacks from a subset of devices transmitting incorrect information. This work 28 investigates Byzantine-resilient algorithms in a decentralized setting, where devices communicate directly with one another. We investigate the notion of breakdown point, and show an upper bound on the number of adversaries that decentralized algorithms can tolerate. This is done through careful study of a specific graph topology, presented in Figure 21. We introduce CG + , an algorithm at the intersection of ClippedGossip and NNA, two popular approaches for robust decentralized learning. CG + meets our upper bound, and thus obtains optimal robustness guarantees, whereas neither of the existing two does. We provide experimental evidence for this gap by presenting an attack tailored to sparse graphs which breaks NNA but against which CG + is robust.
Figure
Byzantine-Robust Gossip: Insights from a Dual Approach
Participants: Renaud Gaucher, Aymeric Dieuleveut, Hadrien Hendrikx.
Distributed learning has many computational benefits but is vulnerable to attacks from a subset of devices transmitting incorrect information. This paper 45 investigates Byzantine-resilient algorithms in a decentralized setting, where devices communicate directly in a peer-to-peer manner within a communication network. We leverage the so-called dual approach for decentralized optimization and propose a Byzantine-robust algorithm. We provide convergence guarantees in the average consensus subcase, discuss the potential of the dual approach beyond this subcase, and re-interpret existing algorithms using the dual framework, under the general update rule presented in Figure 22. Lastly, we experimentally show the soundness of our method.
Figure
A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation
Participants: Etienne Boursier, Stott Pesme, Radu-Alexandru Dragomir.
In 25, we study the dynamics of gradient flow with small weight decay on general training losses . Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay exhibits a two-phase behaviour as . During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of . Then, at time of order , the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the -norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the grokking effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks. This mechanism is illustrated in Figure 23.
Figure
Flow Matching for Robust Simulation-Based Inference under Model Misspecification
Participants: Pierre-Louis Ruhlmann, Pedro Rodrigues, Michael Arbel, Florence Forbes.
Simulation-based inference (SBI) is transforming experimental sciences by enabling parameter estimation in complex non-linear models from simulated data. A persistent challenge, however, is model misspecification: simulators are only approximations of reality, and mismatches between simulated and real data can yield biased or overconfident posteriors. In 51 We address this issue by introducing Flow Matching Corrected Posterior Estimation (FMCPE), a framework that leverages the flow matching paradigm to refine simulation-trained posterior estimators using a small set of real calibration samples, as illustrated in Figure 24. Our approach proceeds in two stages: first, a posterior approximator is trained on abundant simulated data; second, flow matching transports its predictions toward the true posterior supported by real observations, without requiring explicit knowledge of the misspecification. This design enables FMCPE to combine the scalability of SBI with robustness to distributional shift. Across synthetic benchmarks and real-world datasets, we show that our proposal consistently mitigates the effects of misspecification, delivering improved inference accuracy and uncertainty calibration compared to standard SBI baselines, while remaining computationally efficient.
Figure
Simulation-based inference of yeast centromeres
Participants: Eloïse Touron, Pedro Rodrigues, Julyan Arbel, Nelle Varoquaux, Michael Arbel.
The chromatin folding and the spatial arrangement of chromosomes in the cell play a crucial role in DNA replication and genes expression. An improper chromatin folding could lead to malfunctions and, over time, diseases. For eukaryotes, centromeres are essential for proper chromosome segregation and folding. Despite extensive research using de novo sequencing of genomes and annotation analysis, centromere locations in yeasts remain difficult to infer and are still unknown in most species. Recently, genome-wide chromosome conformation capture coupled with next-generation sequencing (Hi-C) has become one of the leading methods to investigate chromosome structures. Some recent studies have used Hi-C data to give a point estimate of each centromere, but those approaches highly rely on a good pre-localization. In 39, we present a novel approach that infers in a stochastic manner the locations of all centromeres in budding yeast based on both the experimental Hi-C map and simulated contact maps using a neural network model as illustrated in Figure 25.
Figure
Dual Perspectives on Non-Contrastive Self-Supervised Learning
Participants: Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel.
The stop gradient and exponential moving average iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. In 50, we investigate these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they do not optimize the original objective, or any other smooth function, they do avoid collapse. Following prior work, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average always leads to collapse, as shown in Figure 26. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, asymptotically stable. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
Figure
Learning Theory for Kernel Bilevel Optimization
Participants: Fares El Khoury, Edouard Pauwels, Samuel Vaiter, Michael Arbel.
Bilevel optimization has emerged as a technique for addressing a wide range of machine learning problems that involve an outer objective implicitly determined by the minimizer of an inner problem. In 27, we investigate the generalization properties for kernel bilevel optimization problems where the inner objective is optimized over a Reproducing Kernel Hilbert Space. This setting enables rich function approximation while providing a foundation for rigorous theoretical analysis. In this context, we establish novel generalization error bounds for the bilevel problem under finite-sample approximation. Our approach adopts a functional perspective, inspired by (Petrulionyte et al., 2024), and leverages tools from empirical process theory and maximal inequalities for degenerate -processes to derive uniform error bounds. The results rely on an equivalence we establish between the estimator implemented in practice and an abstract one derived using the functional perspective that is more amenable to a statistical analysis, as shown in Figure 27. These generalization error estimates allow to characterize the statistical accuracy of gradient-based methods applied to the empirical discretization of the bilevel problem.
Figure
EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Network
Participants: Michael Arbel, David Salinas, Frank Hutter.
Recent foundational models for tabular data, such as TabPFN, have demonstrated remarkable effectiveness in adapting to new tasks through in-context learning. However, these models overlook a crucial equivariance property: the arbitrary ordering of target dimensions should not influence model predictions. In 22, we identify this oversight as a source of incompressible error, termed the equivariance gap, which introduces instability in predictions. To mitigate these issues, we propose a novel model designed to preserve equivariance across output dimensions, as shown in Figure 28. Our experimental results indicate that our proposed model not only addresses these pitfalls effectively but also achieves competitive benchmark performance.
Figure
8.3 Scientific Imaging and Remote Sensing
A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations
Participants: Theo Bodrito, Olivier Flasseur, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange.
The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. The paper 24 presents a novel statistical model that captures nuisance fluctuations using a multiscale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall tradeoff, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys. The model is illustrated in Figure 29.
Figure
Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching
Participants: Giacomo Meanti, Thomas Ryckeboer, Michael Arbel, Julien Mairal.
This work 35 addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches—which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images—the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or mis-specified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring timeconsuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort. This approach is illustrated in Figure 30.
Figure
Optimal transport unlocks end-to-end learning for single-molecule localization
Participants: Romain seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal.
Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores – fluorescent molecules stained onto the observed specimen – over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this work 52, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. This approach is illustrated in Figure 31.
Figure
SpectralEarth: Training Hyperspectral Foundation Models at Scale
Participants: Nassim Ait Ali Braham, C. Albrecht, Julien Mairal, Jocelyn Chanussot, Y Wang, Xiao Xiang Zhu.
Foundation models have triggered a paradigm shift in computer vision and are increasingly being adopted in remote sensing, particularly for multispectral imagery. Yet, their potential in hyperspectral imaging (HSI) remains untapped due to the absence of comprehensive and globally representative hyperspectral datasets. To close this gap, in 4 we introduce SpectralEarth, a large-scale multitemporal dataset designed to pretrain hyperspectral foundation models leveraging data from the environmental mapping and analysis program (EnMAP). SpectralEarth comprises 538 974 image patches covering 415 153 unique locations from 11 636 globally distributed EnMAP scenes spanning two years of archive. In addition, of these locations include multiple timestamps, enabling multitemporal HSI analysis. Utilizing state-of-the-art selfsupervised learning algorithms, we pretrain a series of foundation models on SpectralEarth, integrating a spectral adapter into classical vision backbones to accommodate the unique characteristics of HSI. In tandem, we construct nine downstream datasets for land-cover, crop-type mapping, and tree-species classification, providing benchmarks for model evaluation. Experimental results support the versatility of our models and their generalizability across different tasks and sensors. We also highlight computational efficiency during model fine-tuning. In Figure 32, we compare the size of various datasets published for Earth observation.
Figure
MicroFlow: Domain-Specific Optical Flow for Ground Deformation Estimation in Seismic Events
Participants: Juliette Bertrand, Sophie Giffard-Roisin, James Hollingsworth, Julien Mairal.
Dense ground displacement measurements are crucial for geological studies but are impractical to collect directly. Traditionally, displacement fields are estimated using patch matching on optical satellite images from different acquisition times. While deep learning-based optical flow models are promising, their adoption in ground deformation analysis is hindered by challenges such as the absence of real ground truth, the need for sub-pixel precision, and temporal variations due to geological or anthropogenic changes. In particular, we identify that deep learning models relying on explicit correlation layers struggle at estimating small displacements in real-world conditions. Instead, we propose a model that employs iterative refinements with explicit warping layers and a correlation-independent backbone, enabling sub-pixel precision. Additionally, a non-convex variant of Total Variation regularization preserves fault-line sharpness while maintaining smoothness elsewhere. Our model significantly outperforms widely used geophysics methods on semi-synthetic benchmarks and generalizes well to challenging real-world scenarios captured by both medium- and high-resolution sensors. This work is available in the paper 43 and is illustrated in Figure 33.
Figure
Leveraging very high resolution optical remote sensing data and deep learning to assess the potential for photovoltaïc energy production in urban areas
Participants: Alessia Boccalatte, Jocelyn Chanussot.
Convolutional Neural Networks (CNNs) have shown remarkable success in remote sensing tasks. In urban contexts, recent research has utilized CNNs to generate rooftop segmentation masks and determine rooftop section orientation from aerial images. This cost-effective approach is especially valuable for large-scale rooftop solar potential estimations when detailed three-dimensional data is unavailable. This research, published in 3, introduces SolarMTNet, a novel multitask dense-prediction network designed for rooftop solar potential prediction using only aerial images. Unlike previous studies that focus on small manually labeled datasets (approximately 2000 scenes) and only segment rooftop orientations while typically assuming constant slopes, SolarMTNet simultaneously segments both orientations and slopes, enhancing the accuracy of solar potential estimations by 40%. SolarMTNet leverages a large, automatically labeled dataset (up to 280000 scenes) created from open-source Swis geospatial and aerial data, significantly improving generalization. The model is trained on rooftop data from the Zurich and Geneva cantons and cross-validated on the Canton of Vaud, Switzerland. The results show a mean Intersection over Union (mIoU) of 0.67 for orientation segmentation and 0.40 for slope segmentation. The estimated irradiance exhibits an absolute mean percentage difference of only 5% compared to real solar cadaster data derived from detailed model-based calculations, primarily du to shading issues. Finally, SolarMTNet has also been tested in different geographica areas outside Switzerland (France and Germany), demonstrating consistent performance across diverse regions and pixel resolutions. The quantification of urban solar potential losses from rooftop superstructures via aerial imagery and Convolutional Neural Networks has also been considered 2.
Hyperspectral Pansharpening
Participants: Jocelyn Chanussot.
Hyperspectral (HS) pansharpening consists of fusing a high-resolution panchromatic (PAN) band and a low-resolution HS image to obtain a new image with high resolution in both the spatial and spectral domains. These remote sensing products are valuable for a wide range of applications, driving ever-growing research efforts. Nonetheless, results still do not meet application demands. In part, this comes from the technical complexity of the task: compared to multispectral (MS) pansharpening, many more bands are involved, in a spectral range only partially covered by the PAN component and with overwhelming noise. However, another major limiting factor is the absence of a comprehensive framework for the rapid development and accurate evaluation of new methods. This article attempts to address this issue. We started by designing a dataset large and diverse enough to allow reliable training (for data-driven methods) and testing of new methods. Then, we selected a set of state-of-the-art methods, following different approaches characterized by promising performance, and reimplemented them in a single PyTorch framework. Finally, we carried out a critical comparative analysis of all methods, using the most accredited quality indicators. The analysis highlights the main limitations of current solutions in terms of spectral/spatial quality and computational efficiency, and it suggests promising research directions 7.
On a related topic, another work presents a critical survey of deep learning in remote sensing image fusion 16.
Probing Synergistic High-Order Interaction for Multi-Modal Image Fusion
Participants: Jocelyn Chanussot.
Multi-modal image fusion aims to generate a fused image by integrating and distinguishing the cross-modality complementary information from multiple source images. While the cross-attention mechanism with global spatial interactions appears promising, it only captures second-order spatial interactions, neglecting higher-order interactions in both spatial and channel dimensions. This limitation hampers the exploitation of synergies between multi-modalities. To bridge this gap, we introduce in 21 a Synergistic High-order Interaction Paradigm (SHIP), designed to systematically investigate spatial fine-grained and global statistics collaborations between the multi-modal images across two fundamental dimensions: 1) Spatial dimension: we construct spatial fine-grained interactions through element-wise multiplication, mathematically equivalent to global interactions, and then foster high-order formats by iteratively aggregating and evolving complementary information, enhancing both efficiency and flexibility. 2) Channel dimension: expanding on channel interactions with first-order statistics (mean), we devise high-order channel interactions to facilitate the discernment of inter-dependencies between source images based on global statistics. We further introduce an enhanced version of the SHIP model, called SHIP++ that enhances the cross-modality information interaction representation by the cross-order attention evolving mechanism, cross-order information integration, and residual information memorizing mechanism. Harnessing high-order interactions significantly enhances our model’s ability to exploit multi-modal synergies, leading in superior performance over state-of-the-art alternatives, as shown through comprehensive experiments across various benchmarks in two significant multi-modal image fusion tasks: pan-sharpening, and infrared and visible image fusion.
Fully-Connected Transformer for Multi-Source Image Fusion
Participants: Jocelyn Chanussot.
Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper 19, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.
GeoFlowNet-SAR: Earthquake Displacement Estimation from Synthetic Aperture Radar Images
Participants: Jocelyn Chanussot.
Displacement estimation using remote sensing images is an effective approach for assessing surface displacement caused by natural disasters like earthquakes and landslides. By employing pixel correlation algorithms, high-precision displacement maps can be generated from images taken before and after surface movement. However, traditional methods often rely on spatial regularization or frequency masking to reduce high-frequency noise, which can smooth spatial details and result in biased displacement estimates, especially near sharp discontinuities typical of earthquake surface ruptures. Moreover, subpixel displacement estimation using synthetic aperture radar (SAR) images remains a challenge compared to optical images, due to the strong impact of speckle noise. This article 18 presents GeoFlowNet-SAR, an innovative subpixel displacement estimation method leveraging SAR images. SAR offers advantages thanks to all-weather observation and high penetration, making it suitable for conditions typically challenging for optical systems in the visible light spectrum. This study uses Sentinel-1 SAR single look complex (SLC) images with dual-polarization (VV and VH modes) and interferometric wide (IW) swath mode to balance coverage and resolution. By training on simulated displacement datasets with realistic sharp discontinuities, GeoFlowNet-SAR directly predicts surface displacement fields, providing highly efficient, robust, and precise results while overcoming some limitations of traditional methods.The effectiveness of the proposed methodological contribution is first quantitatively demonstrated using synthetic simulated earthquake datasets, including comparisons with state-of-the-art correlation methods. The method is further validated using two real remote sensing images from the 2019 Ridgecrest earthquake and from the 2023 Turkey–Syria earthquake. The observed results from these real datasets confirm the effectiveness of GeoFlowNet-SAR in practical applications.
Kolmogorov–Arnold Network for Hyperspectral Change Detection
Participants: Jocelyn Chanussot.
Hyperspectral change detection (HCD) techniques to monitor Earth’s surface processes advanced markedly in recent years. Seasonal variations and associated spectral signatures as well as nonlinear noise patterns emanating from sensors and atmospheric sources pose fundamental challenges in HCD. Advanced deep learning models, such as those that leverage convolutional neural networks (3D-Siamese) or transformers (MLP-Mixer), are increasingly employed to address these challenges. However, they often need substantial training data and computational resources. Here, we show that the Kolmogorov–Arnold network (KAN) can enhance HCD capabilities without the excessive training demand of deep networks. The Kolmogorov–Arnold theorem provides the theoretical foundation for our approach, which is particularly well-suited for hyperspectral data analysis by providing a rigorous basis for handling high-dimensional spectral signatures through dimensional reduction and feature extraction. Our architectural design employs this theoretical framework by incorporating specialized neural network layers that mirror the theorem’s compositional structure, thereby facilitating efficient processing of spectral bands. By replacing the linear weighting scheme with learnable nonlinear functions, the Kolmogorov–Arnold network (KAN) provides a unique capability to capture intricate patterns and irregularities in high-dimensional data. Here, we compare five KAN-based architectures and deep learning models such as the MLP-Mixer, 3D-Siamese, dual-branch Siamese spatial–spectral Transformer attention network (DBS3TAN), and the Swin Transformer for HCD and show that the Chebyshev-KAN model, with an average overall accuracy of 97.35% over four real-world benchmark cases, outperforms other models while having a marked lower complexity than the deep learning models. We also show that the choice of fit nonlinear function and model structure is more important than the number of parameters in KAN-based models 15.
ECSPLAIN: Explainability Constrained-claSsifier for Pairing the detection and the Localization of moving Areas from SAR INterferograms
Participants: Jocelyn Chanussot.
Detecting slope instabilities on synthetic aperture radar (SAR) interferograms using deep learning approaches presents several challenges. This detection task suffers from the lack of transparency of deep networks, the complexity of the input data (i.e., complex values, sensitivity to distortions, and presence of counterfactuals), and the complexity of the target phenomena (i.e., the variable velocities and the complex underground processes). In this article 5, we propose a new framework called explainability-constrained classifier for pairing the detection and the localization of moving areas on interferograms (ECSPLAIN), to generate decision, localization, and segmentation maps from a single but explainable classifier network. It consists of training a classifier to detect whether an instability is located in the patch or not, and to explain its decision with a class activation map (CAM) that matches the actual location of the instability. Therefore, by using a single classifier network, the framework can pair the detection and the localization of moving areas. Four CAMs are investigated for the training of the ECSPLAIN framework. Experiments on the ISSLIDE dataset show that our proposal achieves better explainability than standard a posteriori CAMs with more than 0.20 points of improvement in terms of Dice and IoU scores. It also allows competitive performance with segmentation-only networks, with only 0.04 points of difference in terms of Dice and intersection over union (IoU) scores. Thus, the proposed method is competitive with the most efficient methods while being lighter, faster, and delivering a decision based on a human-like reasoning process. Finally, the ECSPLAIN framework is applied to enrich the ISSLIDE dataset, discovering more than 470 manually validated slope instabilities over the Alps.
8.4 Other pluri-disciplinary projects
Challenges in Non-Polymeric Crystal Structure Prediction: Why a Geometric, Permutation-Invariant Loss is Needed
Participants: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin.
Crystalline structure prediction is an essential prerequisite for designing materials with targeted properties. Yet, it is still an open challenge in materials design and drug discovery. Despite recent advances in computational materials science, accurately predicting three-dimensional non-polymeric crystal structures remains elusive. In this work 47, we focus on the molecular assembly problem, where a set S of identical rigid molecules is packed to form a crystalline structure. Such a simplified formulation provides a useful approximation to the actual problem. However, while recent state-of-the-art methods have increasingly adopted sophisticated techniques, the underlying learning objective remains ill-posed. We propose a better formulation that introduces a loss function, illustrated in Figure 34, capturing key geometric molecular properties while ensuring permutation invariance over S. Remarkably, we demonstrate that within this framework, a simple regression model already outperforms prior approaches, including flow matching techniques, on the COD-Cluster17 benchmark, a curated non-polymeric subset of the Crystallography Open Database (COD).
Figure
9 Bilateral contracts and grants with industry
9.1 Bilateral contracts with industry
Participants: Julien Mairal, Karteek Alahari, Pierre Gaillard.
In 2025, we had:
- four CIFRE PhD students with Meta: Timothée Darcet (co-advised by J. Mairal), who defended in June 2025, Eyal Benaroche, who started in December 2025, Tariq Berrada Ifriqi (co-advised by K. Alahari), and Francois Porcher (co-advised by K. Alahari), who started in April 2025.
- one CIFRE PhD student with Naver Labs Europe: Juliette Marrie (co-advised by J. Mairal and M. Arbel) who defended in June 2025.
- one CIFRE PhD student with EDF R&D: Bianca Marin Moreno who defended in October 2025 (co-advised by P. Gaillard).
- one CIFRE PhD student with Criteo: Julien Zhou (co-advised by P. Gaillard).
- one CIFRE PhD student with Ekimetrics: Yedidia Agnimo (co-advised by K. Alahari), who started in July 2025.
- one CIFRE PhD student with Enhance Lab: Vincent Herfeld (co-advised by J. Mairal).
- a collaboration led by K. Alahari with Toyota Motor Europe.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Participation in other International Programs
Project EIFFEL
Participants: Karteek Alahari, Pia Bideau.
-
Title:
Efficient Distillation of Foundation Models for Computer Vision
-
Duration:
2025 - 2028
-
Summary:
This collaborative project with South Korea is supported by the Institute of Information Communications Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (No. RS-2024-00457882, National AI Research Lab Project). Its focus is on efficient foundation models. Foundation models, which have been trained on massive amounts of curated data by using huge resources, constitute one of the most recent advancements in machine learning for computer vision and other domains. These are being typically produced by large corporations or as part of industrial/academic collaborations, which raises fundamental challenges for academia. One of the scientific objectives is to widen the reach of these models by proposing computationally efficient counterparts as well as variants that leverage multiple modalities, e.g., text, image, video, audio, collectively. In particular, we are interested in developing new models under challenging but realistic scenarios, such as limited data or data with temporally evolving distribution, low computational resources, which occur in many industrial and scientific applications.
10.2 European initiatives
10.2.1 Horizon Europe
APHELEIA
APHELEIA project on cordis.europa.eu
-
Title:
Reconciling Classical and Modern (Deep) Machine Learning for Real-World Applications
-
Duration:
From September 1, 2023 to August 31, 2028
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Mairal Julien
-
Summary:
Despite the undeniable success of machine learning in addressing a wide variety of technological and scientific challenges, the current trend of training predictive models with an evergrowing number of parameters from an evergrowing amount of data is not sustainable. These huge models, often engineered by large corporations benefiting from huge computational resources, typically require learning a billion or more of parameters. They have proven to be very effective in solving prediction tasks in computer vision, natural language processing, and computational biology, for example, but they mostly remain black boxes that are hard to interpret, computationally demanding, and not robust to small data perturbations.
With a strong emphasis on visual modeling, the grand challenge of APHELEIA is to develop a new generation of machine learning models that are more robust, interpretable, and efficient, and do not require massive amounts of data to produce accurate predictions. To achieve this objective, we will foster new interactions between classical signal processing, statistics, optimization, and modern deep learning. Our goal is to reduce the need for massive data by enabling scientists and engineers to design trainable machine learning models that directly encode a priori knowledge of the task semantics and data formation process, while automatically prefering simple and stable solutions over complex ones. These models will be built on solid theoretical foundations with convergence and robustness guarantees, which are important to make real-life trustworthy predictions in the wild. We will implement these ideas in an open-source software toolbox readily applicable to visual recognition and inverse imaging problems, which will also handle other modalities. This will stimulate interdisciplinary collaborations, with the potential to be a game changer in the way scientists and engineers design machine learning problems.
10.2.2 Other european programs/initiatives
J. Chanussot is involved in a project funded by the European Space Agency (ESA): ROSE-L in Harmony: EO Data Integration for Global Land Cover and Vegetation Mapping led by the Canadian company C-Core (2025-2028)
10.3 National initiatives
10.3.1 ANR Project BONSAI
Participants: Michael Arbel.
- Project BONSAI is a multi-disciplinary project aiming at integrating knowledge produced by experts, in the form of simulators, into current machine learning frameworks through bilevel optimization for accurate and efficient inference. We address three challenges. The first one is to develop a deep learning-based approach to simulation-based inference that can adapt to data using bilevel optimization. A second challenge is to depoly the methods to real-world problems which have their specificities. A third challenge is to develop bilevel optimization methods that can handle the non-convexity and over-parameterization arising from using deep learning. The principal investigator is Michael Arbel, and the project involves participants from Toulouse School of Economics, TIMC team at UGA and other INRIA teams (Statify). This project started in April 2024.
10.3.2 MIAI chair: Learning Visual Representations from Interaction for Robot Manipulation Tasks
Participants: Pia Bideau, Karteek Alahari.
- How to grasp an object has been studied in computer vision and robotics and several approaches to this problem exist - either given a 3D shape of an object contact points are determined that lead to a stable hand object configuration or an other line of work aims at reconstructing stable hand object configurations modelling the reconstruction process of hand pose and object pose jointly. In both cases many solutions are possible, although a majority might not be the natural approach that humans would chose - mainly because the intention behind the grasp is omitted. This project aims at learning visual representations from interaction that encode activity information. Encoding such contextual information appears not only to be relevant to synthesise feasible grasps furthermore this is likely to enhance future generalisation skills facilitating adaptation across the same activity but different objects - grasping a cup to pour something into something shares similar motion pattern as grasping a bottle to pour something into something. Inspired by the effectiveness of human grasping, we aim at finding similarly adaptable representations that are capable of guiding complex manipulation skills. To this end we will fuse ideas relying on classical probabilistic modeling of distributions over possible motion trajectories and latent action representations from a conditional variational autoencoder (CVAE). Both of these directions come with complementary strengths and thus provide promising capabilities of modulating the degree of action abstractions at test time to enable both coarse and fine-grained control for real world robot manipulation tasks. The chair is taking place in collaboration with Karteek Alahari, Xavier Alameda-Pineda, and Pierre-Brice Wieber. We have recruited one PhD student and have an intern starting in February 2025.
10.3.3 MIAI Cluster chair: MOnitoring natural Hazards using AI and Remote sensing (MOHAIR)
Participants: Jocelyn Chanussot.
-
J. Chanussot is the co-chair, with Sophie Giffard-Roisin (IRD junior researcher, Laboratoire IsTerre) and Yajing Yang (Associate Professor, LISTIC Univ. Savoie Mont-Blanc). This project started in September 2025. It gathers members from 7 different teams of 6 laboratories in Grenoble, Annecy and Clermont-Ferrand.
Satellite based remote sensing, using a variety of sensing modalities (optical, radar, hyperspectral, lidar) offers a unique source of information to monitor the environment, with fine spatial resolution, wide coverage and frequent revisit. This enables addressing the challenge of natural hazard monitoring and forecasting, which has a significant societal impact. To fully harness the potential of remote sensing data, advanced algorithms in machine learning, deep learning, or more broadly artificial intelligence, must be developed. Gathering an interdisciplinary team of experts, from data science, environmental and Earth sciences, as well as social sciences, this chair will focus on three important topics: forest monitoring, Earth deformation estimation and volcanic inverse modeling. From a methodological point of view, research will be conducted on the analysis of multimodal time series, multimodal deep and graph learning and foundation models.
10.3.4 MIAI chair: Fundamentals of Reinforcement Learning
Participants: Pierre Gaillard.
- P. Gaillard is the co-chair, with Bruno Gaujal (LIG, UGA) of this MIAI chair that focuses on developping advanced methodologies for Reinforcement Learning (RL). The project aims to develop new RL algorithms with strong theoretical foundations and practical effectiveness by exploiting the problem's inherent structure. The focus areas include online control of queueing networks, weakly coupled stochastic dynamic systems (sometimes associated with bandits) and parametric learning for adaptive policies. These three approaches to structured learning will be used for innovative applications in energy, cloud computing, and resource allocation.
10.3.5 Deep Red
Participants: Jocelyn Chanussot.
- J. Chanussot is the chair of the Deep Red project from the Foundation Grenoble INP under the patronage of Lynred company (2022-2026). The project aims at popularizing the technology of infrared imaging for new usages.
10.3.6 PEPR project Numpex
Participants: Hadrien Hendrikx.
- The 'Numpex' programme's objectives are to design and develop the software building blocks required to equip future 'exascale machines' and to prepare the major application domains that aim to fully exploit the capabilities of such machines for scientific research and industry alike. This project is part of France's response to the next EuroHPC call for expressions of interest (Projet Exascale France) in hosting one of the two major exascale machines planned in Europe for 2024. In this way 'Numpex' will contribute to the creation of a set of tools, software, applications and training which will enable France to remain one of the leaders in the field of international competition through its national Exascale ecosystem that is in step with European strategy.
10.3.7 PEPR project Origins
Participants: Julien Mairal.
- Thoth is involved in the axis “Direct imaging and exoplanet characterization” of the PEPR Origins. This is an on-going collaboration with astronomers from Observatoire de Paris and Lyon and with the Willow team.
11 Dissemination
Participants: Julien Mairal, Karteek Alahari, Jocelyn Chanussot, Hadrien Hendrikx, Michael Arbel, Pierre Gaillard, Pia Bideau, Scott Pesme.
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Member of the organizing committees
- M. Arbel, P. Gaillard, H. Hendrikx, J. Mairal, G. Meanti, S. Pesme and N. Gillot co-organized the PAISS summer school in Grenoble, which attracted about 200 students.
- P. Gaillard co-organized with EDF R&D a workshop on Meta-models that attracted arround 50 participants.
- J. Chanussot was the general co-chair (with Prof Xiuping Jia, University of New South Wales, and Prof Jeffrey Walker, Monash University) of the IEEE Geoscience and Remote Sensing Symposium (IGARSS) that attracted 3200 participants in Brisbane, Australia, august 3-8 2025.
- J. Chanussot was the co-chair of GeoCV (First Workshop on Computer Vision for Geospatial Image Analysis) at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV workshop) Tucson, AZ, March, 2025.
- J. Chanussot was the co-chair of MORSE (Workshop on Foundation and Large Vision Models in Remote Sensing) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR Workshop) Nashville, TN, June 2025.
11.1.2 Scientific events: selection
Chair of conference program committees
- K. Alahari will be a program co-chair for ECCV 2028 (Bucharest, Romania).
- J. Chanussot will be the Technical Program Committee chair for the IEEE Geoscience and Remote Sensing Symposium (IGARSS) to be held in Reykjavik, Iceland in 2027
Member of the conference program committees
- K. Alahari was an area chair for CVPR 2025, ICCV 2025, NeurIPS 2025, and will be area chair for upcoming ICML 2026.
- P. Gaillard was an area chair for ICML 2025, and will be area chair for upcoming ICML 2026.
- M. Arbel was an area chair for NeurIPS 2025, and will be area chair for the upcoming ICML 2026.
- J. Mairal will be an area chair for the upcoming ICML 2026.
Reviewer
- J. Mairal was reviewer for ICCV 2025, ICLR 2026 and NeurIPS 2025 (where he received a top reviewer award).
- K. Alahari was reviewer for CVPR 2026, BMVC 2025.
- P. Gaillard was reviewer for NeurIPS 2025.
- H. Hendrikx was reviewer for ICML 2025
- M. Arbel was reviewer for AISTATS 2025 (where he received a top reviewer award), ICCV 2025, ICLR 2026.
11.1.3 Journal
Member of the editorial boards
- J. Mairal. Editor for Journal of Machine Learning Reseach (JMLR).
- K. Alahari. Associate editor of International Journal of Computer Vision (IJCV).
- J. Chanussot is an Associate Editor for the IEEE Transactions on Geoscience and Remote Sensing
Reviewer - reviewing activities
- P. Gaillard was reviewer for JMLR.
- H. Hendrikx was reviewer for JMLR and SIOPT
- M. Arbel was reviewer for JMLR.
11.1.4 Invited talks
- J. Mairal was an invited speaker at the BASP workshop, Villars sur Ollon. Feb. 2025.
- J. Mairal was an invited speaker at the OSKI workshop, Aussois. March 2025.
- J. Mairal was an invited speaker at the GDR-IASIS workshop, Lyon. March 2025.
- J. Mairal was an invited speaker at Academie des Sciences (inter-section meeting). June 2025.
- J. Mairal gave an invited seminar at the ELLIS Stuttgart unit. June 2025.
- J. Mairal was an invited speaker at the Non-convex optimization: landscapes, dynamics and learning workshop, EPFL, Aug. 2025.
- J. Mairal was an invited speaker at the GDR-IASIS workshop, Paris. Sept. 2025.
- K. Alahari was a keynote speaker at the Inria-Waterloo workshop at Univ. Waterloo, Canada. May 2025
- K. Alahari was an invited speaker at Journées de statistique de la SFdS, Marseille. June 2025.
- K. Alahari was an invited speaker at the Global AI Frontiers Symposium, Seoul, South Korea. Oct. 2025.
- K. Alahari was an invited speaker at the Open Science Days@UGA, Grenoble. Nov. 2025.
- K. Alahari was a keynote speaker at the Sfen workshop on apport de l'IA a la science des materiaux pour l'industrie nucleaire, Paris. Dec. 2025.
- S. Pesme was an invited speaker at Journée scientifique du groupe SMAI-SIGMA, Dec. 2025, Paris.
- S. Pesme gave an invited seminar at Centrale Supelec. Nov. 2025.
- S. Pesme gave a talk at the Oberwolfach Mini-Workshop on Probabilistic Perspectives in Neural Network-Based Machine Learning, Oct. 2025, Oberwolfach, Germany
- S. Pesme gave an invited talk at the Workshop sur les modèles génératifs : diffusion, flow matching et leurs applications, Oct. 2025, Lyon.
- S. Pesme gave an invited seminar at Eindhoven University of Technology, Netherlands, Oct 2025
- S. Pesme gave an invited talk at the Workshop on the Statistical Theory of Neural Networks, May 2025, University of Twente, Netherlands.
- J. Marrie gave an invited seminar at Ecole des Ponts, Marne la Vallée, March 2025.
- T. Darcet gave an invited talk at the BLISS summer school, TU Berlin. May 2025.
- T. Bodrito gave a talk at the COBREX seminar. Feb. 2025.
- T. Bodrito gave a talk at the Journées de la Société Française d'Astronomie (SF2A). July 2025.
- P. Gaillard was an invited speaker at a scientific seminar organized by the LabEx EnergyAlps. May 2025.
- H. Hendrikx gave an invited seminar at Inria Montpellier, May 2025.
- H. Hendrikx gave a talk at project Redeem (PEPR IA) annual meeting, October 2025.
- M. Arbel was an invited speaker at the RKHS Seminars, METU, February 2025.
11.1.5 Scientific expertise
- J. Mairal was a member of the Hemholtz panel on scientific imaging.
- J. Mairal was a member of the Prairie panel for junior chairs.
- J. Mairal was a panel member for the research council of Norway.
- K. Alahari was a member of the CRCN/ISFP 2025 recruitment committee at Grenoble.
- P. Gaillard was a reviewer for the JCJC call from ANR.
- H. Hendrikx was a panel member for the TSIA call from ANR, subcommittee "IA & Environnements, écosystèmes, ressources biologiques"
11.1.6 Research administration
- J. Mairal is a member of the scientific committee (COS) of Inria Grenoble's research center, and also a member of the scientific committee of MIAI.
- K. Alahari is the deputy scientific director in charge of AI at Inria.
- K. Alahari is one of the scientific directors of the PEPR IA national research programme.
- K. Alahari is responsible for the Mathematics and Computer Science specialist field at the MSTII doctoral school.
- K. Alahari is a member of commission prospection postes at LJK.
- H. Hendrikx is Chargé de mission Science Environnement Société (SEnS) for Inria Grenoble.
- H. Hendrikx is the Inria Transformation Ecologique (TREC) representative at UGA.
- H. Hendrikx is a leading member of the Inria Grenoble socio-environmental roadmap.
- J. Chanussot is a member of the Commission Recherche, University Grenoble Alpes.
11.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
11.2.1 Supervision
- Théo Bodrito defended his Phd in June 2025. He was co-advised by Olivier Flasseur, Jean Ponce and Julien Mairal. See the manuscript 41.
- Timothée Darcet defended his Phd in June 2025. He was co-advised by Piotr Bojanowski, Maxim Oquab, and Julien Mairal. See the manuscript 42
- Juliette Marrie defended her PhD in June 2025. She was co-advised by Michael Arbel, Diane Larlus and Julien Mairal.
- Bianca Marin Moreno defended her PhD in October 2025. She was co-advised by P. Gaillard.
- Zhiqi Kang defended his PhD in November 2025. He was advised by Karteek Alahari.
- Anandaramane Candassamy,defended his PhD in september 2025. He was co-advised by J. Chanussot.
- Colin Prieur defended his PhD in november 2025. He was co-advised by J. Chanussot
11.2.2 Juries
- J. Mairal was reviewer for the PhD thesis of Samuel Gruffaz, Univ. Paris Saclay. 2025.
- J. Mairal was reviewer for the HdR of Thomas Moreau, Univ. Paris Saclay. 2025.
- J. Mairal was a member of the PhD committee of Gaspard Dussert, Univ. Lyon 1. 2025.
- J. Mairal was a member of the HdR commitee of Maxime Sangnier, PSL Sorbonne Université. 2025.
- K. Alahari was a member of the PhD jury of Mohammmed-Yasser Benigmim, IP Paris. 2025.
- K. Alahari was a reviewer for the PhD thesis of Corentin Sautier, Ecole des Ponts ParisTech. 2025.
- K. Alahari was a reviewer for the PhD thesis of Tanay Agrawal, Université Côte d'Azur. 2025.
- K. Alahari was the president of the PhD jury of Guillaume Déau, Univ. Poitiers. 2025.
- P. Gaillard was reviewer for the PhD thesis of Lukas Zierahn, Politecnico di Torino, Italy. 2025.
- P. Gaillard was a member of the PhD committee of Antoine Picard, Univ. Lille. 2025.
- J. Chanussot was a reviewer for the PhD of Liang Zhao, University of South Australia (Australia) 2025.
- J. Chanussot was a reviewer for the PhD of Kimmo Riihiaho, University of Jyväskylä (Finland) 2025.
- J. Chanussot was a reviewer for the PhD of Dan Pineau, Université Paris-Saclay, 2025.
- J. Chanussot was a reviewer for the PhD of Yi Wang, TU Munich (Germany) 2025.
- J. Chanussot was a reviewer for the PhD of Sai Reddy B., GITAM - Deemed to be University (India) 2025.
- J. Chanussot was a reviewer for the PhD of Triem Pham, Université Paris-Saclay, 2025.
- J. Chanussot was the president of the PhD jury of Astrid Tazzioli, Université PSL Paris, 2025.
- J. Chanussot was a reviewer for the PhD of Ritu Yadav, KTH (Sweden), 2025
- J. Chanussot was a reviewer and the president of the committee for the PhD of Vadim Becquet, Université Paris PSL - Mines de Paris, 2025
- J. Chanussot was a reviewer for the HdR of Minh-Tan PHAM, Université de Bretagne Sud, 2025
- M. Arbel was a member of the PhD committee of Alessandro Pasqui, PSL Université de Paris, 2025.
11.2.3 Educational and pedagogical outreach
- Master: M. Arbel and J. Mairal, Kernel methods for statistical learning, 36h eqTD, M2, ENS Paris-Saclay/PSL, France.
- Master: M. Arbel, J. Mairal and S. Pesme, From Basic Machine Learning models to Advanced Kernel Learning, 54h eqTD, M2, UGA, Grenoble.
- Master: P. Gaillard, Sequential Learning, 12h eqTD, M2, MVA, ENS Paris-Saclay, France.
- Master: H. Hendrikx, Numerical Optimization, 40h eqTD, M1, UGA, Grenoble
- Master: J. Chanussot, Hyperspectral imaging, 25h eqTD, M2, Grenoble INP
11.3 Popularization
11.3.1 Productions (articles, videos, podcasts, serious games, ...)
- K. Alahari participated in a podcast interview for Interstices 54.
11.3.2 Participation in Live events
- K. Alahari co-animated the “Café IA" event at Inria Grenoble.
- S. Pesme participated to the “Ateliers scolaires les 9 et 10 octobre au sein du parcours "Éclats de sciences" sur le campus de l'Université Grenoble Alpes à Saint-Martin d'Hères”.
- S. Pesme participated two “Café IA”: at Inria (September 30th 2025), and another with Digital League (December 2nd 2025)Talk at the Math Olympiad Ceremony (June 4, 2025, Université Grenoble-Alpes)
- S. Pesme participated to Classroom sessions for the "Semaine des Maths" (March 10–19, 2025, schools in the Grenoble academy)
11.3.3 Others science outreach relevant activities
- S. Pesme was interviewed for the fête de la science 2025.
- J. Chanussot is a member of the scientific advisory board of the établissement public de coopération culturelle « Territoire de Sciences » with its two components: Cosmocité Museum and Grenoble Casemate.
- J. Chanussot organized a half-day event about thermal imaging for the 10th grade students doing their internship at INRIA.
12 Scientific production
12.1 Publications of the year
International journals
International peer-reviewed conferences
Doctoral dissertations and habilitation theses
Reports & preprints
Scientific popularization
12.2 Cited publications
- 55 unpublishedExperimental Comparison of Ensemble Methods and Time-to-Event Analysis Models Through Integrated Brier Score and Concordance Index.2024, working paper or preprintHALback to text