EN FR
EN FR

2025Activity‌ reportProject-TeamROBOTLEARN

RNSR:‌​‌ 202124098G

Creation of the‌ Project-Team: 2021 July 01‌​‌

Each year, Inria research​​ teams publish an Activity​​​‌ Report presenting their work‌ and results over the‌​‌ reporting period. These reports​​ follow a common structure,​​​‌ with some optional sections‌ depending on the specific‌​‌ team. They typically begin​​ by outlining the overall​​​‌ objectives and research programme,‌ including the main research‌​‌ themes, goals, and methodological​​​‌ approaches. They also describe​ the application domains targeted​‌ by the team, highlighting​​ the scientific or societal​​​‌ contexts in which their​ work is situated.

The​‌ reports then present the​​ highlights of the year,​​​‌ covering major scientific achievements,​ software developments, or teaching​‌ contributions. When relevant, they​​ include sections on software,​​​‌ platforms, and open data,​ detailing the tools developed​‌ and how they are​​ shared. A substantial part​​​‌ is dedicated to new​ results, where scientific contributions​‌ are described in detail,​​ often with subsections specifying​​​‌ participants and associated keywords.​

Finally, the Activity Report​‌ addresses funding, contracts, partnerships,​​ and collaborations at various​​​‌ levels, from industrial agreements​ to international cooperations. It​‌ also covers dissemination and​​ teaching activities, such as​​​‌ participation in scientific events,​ outreach, and supervision. The​‌ document concludes with a​​ presentation of scientific production,​​​‌ including major publications and​ those produced during the​‌ year.

Keywords

Computer Science​​ and Digital Science

  • A5.7.3.​​​‌ Speech
  • A5.7.4. Analysis
  • A5.7.5.​ Synthesis
  • A5.10.2. Perception
  • A5.10.4.​‌ Robot control
  • A5.10.5. Robot​​ interaction (with the environment,​​​‌ humans, other robots)
  • A9.2.​ Machine learning
  • A9.3. Signal​‌ processing
  • A9.4. Natural language​​ processing
  • A9.5. Robotics and​​​‌ AI
  • A9.11. Generative AI​
  • A9.12.2. Activity recognition
  • A9.12.5.​‌ Object tracking and motion​​ analysis
  • A9.14. Evaluation of​​​‌ AI models

Other Research​ Topics and Application Domains​‌

  • B2. Digital health
  • B5.6.​​ Robotic systems

1 Team​​​‌ members, visitors, external collaborators​

Research Scientists

  • Xavier Alameda​‌ Pineda [Team leader​​, INRIA, Senior​​​‌ Researcher]
  • Laurent Girin​ [GRENOBLE INP,​‌ HDR]
  • Patrice Horaud​​ [retired, Emeritus​​​‌, HDR]
  • Thomas​ Hueber [CNRS]​‌
  • Stéphane Lathuilière [INRIA​​, ISFP]
  • Olivier​​​‌ Perrotin [CNRS,​ Researcher]

Post-Doctoral Fellows​‌

  • Xiaoyu Lin [UGA​​, Post-Doctoral Fellow,​​​‌ until May 2025]​
  • Samir Sadok [INRIA​‌, Post-Doctoral Fellow,​​ from May 2025]​​​‌
  • Samir Sadok [UGA​, Post-Doctoral Fellow,​‌ until May 2025]​​

PhD Students

  • Maxime Attwood​​​‌ [UGA, from​ Oct 2025]
  • Gaetan​‌ Lepage [INRIA,​​ until Jan 2025]​​​‌

Technical Staff

  • Ahamed Mohamed​ [INRIA, Engineer​‌]
  • Gianluca Zappavigna [​​INRIA, from Dec​​​‌ 2025]

Interns and​ Apprentices

  • Maxime Attwood [​‌INRIA, Intern,​​ from Feb 2025 until​​​‌ Jul 2025]
  • Manal​ Belouarda [INRIA,​‌ Intern, from May​​ 2025 until Jul 2025​​​‌]
  • Gianluca Zappavigna [​INRIA, Intern,​‌ from Jun 2025 until​​ Nov 2025]

Administrative​​​‌ Assistant

  • Nathalie Gillot [​INRIA]

Visiting Scientists​‌

  • Massimiliano Pappa [UNIV​​ ROME III, until​​​‌ Jul 2025]
  • Javier​ Venema Rodriguez [Panacea​‌ Cooperative Research, PhD student​​ at University of Granada​​​‌, from May 2025​ until Jun 2025]​‌

2 Overall objectives

In​​ recent years, social robots​​​‌ have been introduced into​ public spaces, such as​‌ museums, airports, commercial malls,​​ banks, show-rooms, schools, universities,​​​‌ hospitals, and retirement homes,​ to mention a few​‌ examples. In addition to​​ classical robotic skills such​​​‌ as navigating in complex​ environments, grasping and manipulating​‌ objects, i.e. physical interactions​​, social robots must​​ be able to communicate​​​‌ with people and to‌ adopt appropriate behavior. Welcoming‌​‌ newcomers, providing various pieces​​ of information, and entertaining​​​‌ groups of people are‌ typical services that social‌​‌ robots are expected to​​ provide in the near​​​‌ future.

Nevertheless, today's state-of-the-art‌ in robotics is not‌​‌ well-suited to fulfill these​​ needs, and there are​​​‌ two main bottlenecks: (i)‌ robots are limited to‌​‌ a handful of simple​​ scenarios which leads to​​​‌ (ii) social robots not‌ being well accepted by‌​‌ a large percentage of​​ users. While there are​​​‌ research programs and projects‌ which have tackled some‌​‌ of these challenges, existing​​ commercially available robots cannot​​​‌ (or only to a‌ very limited extent) recognize‌​‌ individual behaviors (e.g. facial​​ expressions, hand- and body-gestures,​​​‌ head- and eye-gaze) or‌ group behaviors (e.g. who‌​‌ looks at whom, who​​ speaks to whom, who​​​‌ needs robot assistance, etc.).‌ They do not have‌​‌ the ability to take​​ social (or non-verbal) signals​​​‌ into account while they‌ are engaged in spoken‌​‌ dialogue and they cannot​​ connect the dialogue with​​​‌ the persons and objects‌ that are physically present‌​‌ in their surroundings. We​​ would like to develop​​​‌ robots that are responsible‌ for their perception, and‌​‌ act to enhance the​​ quality of the signals​​​‌ they receive, instead of‌ asking the users to‌​‌ adapt their behavior to​​ the robotic platform.

The​​​‌ scientific ambition of RobotLearn‌ is to train robots‌​‌ to acquire the capacity​​ to look, listen​​​‌, learn, move‌ and speak in a‌​‌ socially acceptable manner. We​​ identify three main objectives:​​​‌

  1. Develop deep probabilistic models‌ and methods that allow‌​‌ the fusion of audio​​ and visual data, possibly​​​‌ sequential, recorded with cameras‌ and microphones, and in‌​‌ particular with sensors onboard​​ of robots.
  2. Increase the​​​‌ performance of human behaviour‌ understanding using deep probabilistic‌​‌ models and jointly exploiting​​ auditory and visual information.​​​‌
  3. Learn robot-action policies that‌ are socially acceptable and‌​‌ that enable robots to​​ better perceive humans and​​​‌ the physical environment.

RobotLearn‌ stands at the cross-roads‌​‌ of several fields: computer​​ vision, audio signal processing,​​​‌ speech technology, statistical learning,‌ deep learning, and robotics.‌​‌ In partnership with several​​ companies (e.g. PAL Robotics​​​‌ and ERM Automatismes Industriels),‌ the technological objective is‌​‌ to launch a brand​​ new generation of robots​​​‌ that are flexible enough‌ to adapt to the‌​‌ needs of the users,​​ and not the other​​​‌ way around. The experimental‌ objective is to validate‌​‌ the scientific and technological​​ progress in the real​​​‌ world. Furthermore, we believe‌ that RobotLearn will contribute‌​‌ with tools and methods​​ able to process robotic​​​‌ data (perception and action‌ signals) in such a‌​‌ way that connections with​​ more abstract representations (semantics,​​​‌ knowledge) are possible. The‌ developments needed to discover‌​‌ and use such connections​​ could be addressed through​​​‌ collaborations. Similarly, aspects related‌ to robot deployment in‌​‌ the consumer world, such​​ as ethics and acceptability​​​‌ will be addressed in‌ collaboration, for instance, with‌​‌ the Broca day-care hospital​​ in Paris.

From a​​​‌ methodological perspective, the challenge‌ is at least three-fold.‌​‌ First, to reduce the​​​‌ amount of human intervention​ needed to adapt the​‌ designed learning models in​​ a new environment. We​​​‌ aim to further develop​ strategies based on unsupervised​‌ learning and unsupervised domain​​ adaptation, within the framework​​​‌ of deep probabilistic modling​ with latent variables 55​‌. Second, to successfully​​ exploit auditory and visual​​​‌ data for human behavior​ understanding. For instance by​‌ developing mechanisms that manage​​ to model and learn​​​‌ the complementarity between sounds​ and images 9.​‌ Third, by developping reinforcement​​ learning algorithms that can​​​‌ transfer previous knowledge to​ future tasks and environments.​‌ One potential way forward​​ is to anchor the​​​‌ learning into key features​ that can be hand-crafted​‌ or learned 58.​​

3 Research program

RobotLearn​​​‌ is structured in three​ research axes, allowing to​‌ develop socially intelligent robots.​​ First, on deep probabilistic​​​‌ models, which include the​ large family of deep​‌ neural network architectures, the​​ large family of probabilistic​​​‌ models, and their intersection.​ Briefly, we investigate how​‌ to jointly exploit the​​ representation power of deep​​​‌ network together with the​ flexibility of probabilistic models.​‌ A well-known example of​​ such combination are variational​​​‌ autoencoders. Deep probabilistic models​ are the methodological backbone​‌ of the proposed projet,​​ and set the foundations​​​‌ of the two other​ research axes. Second, we​‌ develop methods for the​​ automatic understanding of human​​​‌ behavior from both auditory​ and visual data. To​‌ this aim we design​​ our algorithms to exploit​​​‌ the complementary nature of​ these two modalities, and​‌ adapt their inference and​​ on-line update procedures to​​​‌ the computational resources available​ when operating with robotic​‌ platforms. Third, we investigate​​ models and tools allowing​​​‌ a robot to automatically​ learn the optimal social​‌ action policies. In other​​ words, learn to select​​​‌ the best actions according​ to the social environment.​‌ Importantly, these action policies​​ should also allow us​​​‌ to improve the robotic​ perception, in case this​‌ is needed to better​​ understand the ongoing interaction.​​​‌ We believe that these​ two research axes, grounded​‌ on deep and probabilistic​​ models, will ultimately enable​​​‌ us to train robots​ to acquire social intelligence,​‌ meaning, as discussed in​​ the introduction, the capacity​​​‌ to look, listen, learn,​ move and speak.

3.1​‌ Deep probabilistic models

A​​ large number of perception​​​‌ and interaction processes require​ temporal modeling. Consider for​‌ example the task of​​ extracting a clean speech​​​‌ signal from visual and​ audio data. Both modalities​‌ live in high-dimensional observation​​ spaces and one challenge​​​‌ is to extract low-dimensional​ embeddings that encode information​‌ in a compact way​​ and to update it​​​‌ over time. These high-dimensional​ to low-dimensional mappings are​‌ nonlinear in the general​​ case. Moreover, audio and​​​‌ visual data are corrupted​ by various perturbations, e.g.​‌ by the presence of​​ background noise which is​​​‌ mixed up with the​ speech signal uttered by​‌ a person of interest,​​ or by head movements​​​‌ that overlap with lip​ movements. Finally, for robotics​‌ applications, the available data​​ is scarce, and datasets​​​‌ captured in other settings​ can only serve as​‌ proxies, thus requiring either​​ adaptation 62 or the​​ use of unsupervised models​​​‌ 48. Therefore, the‌ problem is manyfold: to‌​‌ extract low-dimensional compact representations​​ from high-dimensional inputs, to​​​‌ disregard useless data in‌ order to retain information‌​‌ that is relevant for​​ the task at hand,​​​‌ to update and maintain‌ reliable information over time,‌​‌ and to do so​​ in without (or with​​​‌ very few) annotated data‌ from the robot.

This‌​‌ class of problems can​​ be addressed in the​​​‌ framework of state-space models‌ (SSMs). In their most‌​‌ general form, SSMs are​​ stochastic nonlinear systems with​​​‌ latent variables. Such a‌ system is composed of‌​‌ a state equation, that​​ describes the dynamics of​​​‌ the latent (or state)‌ variables, and M observation‌​‌ equations (an observation equation​​ for each sensorial modality​​​‌ m) that predict‌ observations from the state‌​‌ of the system, namely:​​

𝐱 t + 1​​​‌ = f ( 𝐱‌ t , 𝐮 t‌​‌ ) + 𝐯 t​​ 𝐲 t m =​​​‌ g m ( 𝐱‌ t , 𝐮 t‌​‌ ) + 𝐰 t​​ m , m​​​‌ { 1 ⋯‌ M } , 1‌​‌

where the latent vector​​ 𝐱L​​​‌ evolves according to a‌ nonlinear stationary Markov dynamic‌​‌ model driven by the​​ observed control variable 𝐮​​​‌ and corrupted by the‌ noise 𝐯. Similarly,‌​‌ the observed vectors 𝐲​​mD​​​‌m are modeled with‌ nonlinear stationary functions of‌​‌ the current state and​​ current input, affected by​​​‌ noise 𝐰m.‌ Models of this kind‌​‌ have been examined for​​ decades and their complexity​​​‌ increases from linear-Gaussian models‌ to nonlinear and non-Gaussian‌​‌ ones. Interestingly, they can​​ also be viewed in​​​‌ the framework of probabilistic‌ graphical models to represent‌​‌ the conditional dependencies between​​ the variables. The objective​​​‌ of an SSM is‌ to infer the sequence‌​‌ of latent variables by​​ computing the posterior distribution​​​‌ of the latent variable,‌ conditioned by the sequence‌​‌ of observations, p(​​𝐱t|𝐲​​​‌1:t)‌.

When the two‌​‌ functions are linear, the​​ model boils down to​​​‌ a linear dynamical system,‌ that can be learned‌​‌ with an exact Expectation-Maximization​​ (EM) algorithm. Beyond this​​​‌ simple case, non-linearity can‌ be achieved via mixtures‌​‌ of K linear models​​ or more general non-linear​​​‌ (e.g. deep neural) functions.‌ Either case, learning and‌​‌ inference cannot be exact​​ and must be approximated,​​​‌ either by using variational‌ EM algorithms 46,‌​‌ 56, 49,​​ 3, amortized variational​​​‌ inference 55, 47‌ or a combination of‌​‌ both techniques 57,​​ 18.

We name​​​‌ the larger family of‌ all these methods as‌​‌ Deep Probabilistic Models (DPMs),​​ which form a backbone​​​‌ among the methodological foundations‌ of RobotLearn. Learning‌​‌ DPMs is challenging from​​ the theoretical, methodological and​​​‌ computational points of view.‌ Indeed, the problem of‌​‌ learning, for instance, deep​​ generative Bayesian filters in​​​‌ the framework of nonlinear‌ and non-Gaussian SSMs remains‌​‌ intractable and approximate solutions,​​ that are both optimal​​​‌ from a theoretical point‌ of view and efficient‌​‌ from a computational point​​​‌ of view, remain to​ be proposed. We plan​‌ to investigate both discriminative​​ and generative deep recurrent​​​‌ Bayesian networks and to​ apply them to audio,​‌ visual and audio-visual processing​​ tasks.

Exemplar application: deep​​​‌ probabilistic sequential modeling

We​ have investigated a latent-variable​‌ generative model called mixture​​ of dynamical variational autoencoders​​​‌ (MixDVAE) to model the​ dynamics of a system​‌ composed of multiple moving​​ sources. A DVAE model​​​‌ is pre-trained on a​ single-source dataset to capture​‌ the source dynamics. Then,​​ multiple instances of the​​​‌ pre-trained DVAE model are​ integrated into a multi-source​‌ mixture model with a​​ discrete observation-to-source assignment latent​​​‌ variable. The posterior distributions​ of both the discrete​‌ observation-to-source assignment variable and​​ the continuous DVAE variables​​​‌ representing the sources content/position​ are estimated using the​‌ variational expectation-maximization algorithm, leading​​ to multi-source trajectories estimation.​​​‌ We illustrated the versatility​ of the proposed MixDVAE​‌ model on two tasks:​​ a computer vision task,​​​‌ namely multi-object tracking, and​ an audio processing task,​‌ namely single-channel audio source​​ separation. Consequently, this mixture​​​‌ models allows to mix​ different non-linear source models​‌ within the maximum likelihood​​ umbrella and combine the​​​‌ model with other probabilistic​ models as well.

Figure 1

MixDVAE​‌ overall diagram.

Figure 1​​: MixDVAE overall diagram.​​​‌

3.2 Human behavior understanding​

Interactions between a robot​‌ and a group of​​ people require human behavior​​​‌ understanding (HBU) methods. Consider​ for example the tasks​‌ of detecting eye-gaze and​​ head-gaze and of tracking​​​‌ the gaze directions associated​ with a group of​‌ participants. This means that,​​ in addition to gaze​​​‌ detection and gaze tracking,​ it is important to​‌ detect persons and to​​ track them as well.​​​‌ Additionally, it is important​ to extract segments of​‌ speech, to associate these​​ segments with persons and​​​‌ hence to be able​ to determine over time​‌ who looks to whom​​ and who is the​​​‌ speaker and who are​ the listeners. The temporal​‌ and spatial fusion of​​ visual and audio cues​​​‌ stands at the basis​ of understanding social roles​‌ and of building a​​ multimodal conversational model.

Performing​​​‌ HBU tasks in complex,​ cluttered and noisy environments​‌ is challenging for several​​ reasons: participants come in​​​‌ an out of the​ camera field of view,​‌ their photometric features, e.g.​​ facial texture, clothing, orientation​​​‌ with respect to the​ camera, etc., vary drastically,​‌ even over short periods​​ of time, people look​​​‌ at an object of​ interest (a person entering​‌ the room, a speaking​​ person, a TV/computer screen,​​​‌ a wall painting, etc.)​ by turning their heads​‌ away from the camera,​​ hence facial image analysis​​​‌ is difficult, small head​ movements are often associated​‌ with speech which perturbs​​ both lip reading and​​​‌ head-gaze tracking, etc. Clearly,​ understanding multi-person human-robot interaction​‌ is complex because the​​ person-to-person and person-to-object, in​​​‌ addition to person-to-robot, interactions​ must explicitly be taken​‌ into account.

We propose​​ to perform audio-visual HBU​​​‌ by taking explicitly into​ account the complementary nature​‌ of these two modalities.​​ Differently from one current​​​‌ trend in AV learning​ 45, 52,​‌ 54, we opt​​ for unsupervised probabilitic methods​​ that can (i) assign​​​‌ observations to persons without‌ supervision, (ii) be combined‌​‌ with various probabilistic noise​​ models and (iii) and​​​‌ fuse various cues depending‌ on their availability in‌​‌ time (i.e. handle missing​​ data). Indeed, in face-to-face​​​‌ communication, the robot must‌ choose with who it‌​‌ should engage dialog, e.g.​​ based on proximity, eye​​​‌ gaze, head movements, lip‌ movements, facial expressions, etc.,‌​‌ in addition to speech.​​ Unlike in the single-user​​​‌ human-robot interaction case, it‌ is crucial to associate‌​‌ temporal segments of speech​​ to participants, referred to​​​‌ as speech diarization. Under‌ such scenarios, speech signals‌​‌ are perturbed by noise,​​ reverberation and competing audio​​​‌ sources, hence speech localization‌ and speech enhancement methods‌​‌ must be used in​​ conjunction with speech recognition.​​​‌

It is also necessary‌ to perform some kind‌​‌ of adaptation to the​​ distribution of the particular​​​‌ data at hand, e.g.‌ collected with robot sensors.‌​‌ If these data are​​ available in advance, off-line​​​‌ adaptation can be done,‌ otherwise the adaptation needs‌​‌ to be performed on-line​​ or at run time.​​​‌ Such strategies will be‌ useful given the particular‌​‌ experimental conditions of practical​​ human-robot interaction scenarios. Either​​​‌ way we will need‌ some sort of on-line‌​‌ learning to perform final​​ adaptation. On-line learning based​​​‌ on deep neural networks‌ is far from being‌​‌ well understood. We plan​​ to thoroughly study the​​​‌ incorporation of on-line learning‌ into both Bayesian and‌​‌ discriminative deep networks. In​​ the practical case of​​​‌ interaction, real-time processing is‌ crucial. Therefore, a compromise‌​‌ must be found between​​ the size of the​​​‌ network, its discriminative power‌ and the computational cost‌​‌ of the learning and​​ prediction algorithms. Clearly, there​​​‌ is no single solution‌ given the large variety‌​‌ of problems and scenarios​​ that are encountered in​​​‌ practice.

Exemplar application: expression-preserving‌ face frontalization

Face frontalization‌​‌ consists of synthesizing a​​ frontally-viewed face from an​​​‌ arbitrarily-viewed one. We proposed‌ a frontalization methodology that‌​‌ preserves non-rigid facial deformations​​ in order to boost​​​‌ the performance of visually‌ assisted speech communication. The‌​‌ method alternates between the​​ estimation of (i) the​​​‌ rigid transformation (scale, rotation,‌ and translation) and (ii)‌​‌ the non-rigid deformation between​​ an arbitrarily-viewed face and​​​‌ a face model. The‌ method has two important‌​‌ merits: it can deal​​ with non-Gaussian errors in​​​‌ the data and it‌ incorporates a dynamical face‌​‌ deformation model. For that​​ purpose, we used the​​​‌ generalized Student t-distribution in‌ combination with a linear‌​‌ dynamic system in order​​ to account for both​​​‌ rigid head motions and‌ time-varying facial deformations caused‌​‌ by speech production. We​​ proposed to use the​​​‌ zero-mean normalized cross-correlation (ZNCC)‌ score to evaluate the‌​‌ ability of the method​​ to preserve facial expressions.​​​‌ We showed that the‌ method, when incorporated into‌​‌ deep learning pipelines, namely​​ lip reading and speech​​​‌ enhancement, improves word recognition‌ and speech intelligibility scores‌​‌ by a considerable margin.​​

Figure 2

Some results of the​​​‌ proposed expression-preserving face frontalization‌ method.

Figure 2:‌​‌ Some results of the​​ proposed expression-preserving face frontalization​​​‌ method.

3.3 Learning and‌ control for social robots‌​‌

Traditionally, research on human-robot​​​‌ interaction focused on single-person​ scenarios also called dyadic​‌ interactions. However, over the​​ past decade several studies​​​‌ were devoted to various​ aspects of multi-party interactions,​‌ meaning situations in which​​ a robot interacts with​​​‌ a group of two​ or more people 59​‌. This line of​​ research is much more​​​‌ challenging because of two​ main reasons. First, the​‌ behavioral cues of each​​ individual and of the​​​‌ group need to be​ faithfully extracted (and assigned​‌ to each individual). Second,​​ the behavioral dynamics of​​​‌ groups of people can​ be pushed by the​‌ presence of the robot​​ towards competition 51 or​​​‌ even bullying 50.​ This is why some​‌ studies restrict the experimental​​ conditions to very controlled​​​‌ collaborative scenarios, often lead​ by the robot, such​‌ as quiz-like game playing​​ 61 or very specific​​​‌ robot roles 53.​ Intuitively, constraining the scenario​‌ also reduces the gesture​​ variabilty and the overall​​​‌ interaction dynamics, leading to​ methods and algorithms with​‌ questionable generalisation to free​​ and natural social multi-party​​​‌ interactions.

Whenever a robot​ participates in such multi-party​‌ interactions, it must perform​​ social actions. Such​​​‌ robot social actions are​ typically associated with the​‌ need to perceive a​​ person or a group​​​‌ of persons in an​ optimal way as well​‌ as to take appropriate​​ decisions such as to​​​‌ safely move towards a​ selected group, to pop​‌ into a conversation or​​ to answer a question.​​​‌ Therefore, one can distinguish​ between two types of​‌ robot social actions: (i)​​ physical actions which correspond​​​‌ to synthesizing appropriate motions​ using the robot actuators​‌ (motors), possibly within a​​ sensorimotor loop, so as​​​‌ to enhance perception and​ maintain a natural interaction​‌ and (ii) spoken actions​​ which correspond to synthesizing​​​‌ appropriate speech utterances by​ a spoken dialog system.​‌ In RobotLearn we will​​ focus on the former,​​​‌ and integrate the latter​ via collaborations with research​‌ groups having with established​​ expertise in speech technologies.​​​‌

In this regard we​ face three problems. First,​‌ given the complexity of​​ the environment and the​​​‌ inherent limitations of the​ robot's perception capabilities, e.g.​‌ limited camera field of​​ view, cluttered spaces, complex​​​‌ acoustic conditions, etc., the​ robot will only have​‌ access to a partial​​ representation of the environment,​​​‌ and up to a​ certain degree of accuracy.​‌ Second, for learning purposes,​​ there is no easy​​​‌ way to annotate which​ are the best actions​‌ the robot must choose​​ given a situation: supervised​​​‌ methods are therefore not​ an option. Third, since​‌ the robot cannot learn​​ from scratch by random​​​‌ exploration in a new​ environment, standard model-free RL​‌ approaches cannot be used.​​ Some sort of previous​​​‌ knowledge on the environment​ or a similar one​‌ should be exploited. Finally,​​ given that the robot​​​‌ moves within a populated​ environment, it is desirable​‌ to have the capability​​ to enforce certain constrains,​​​‌ thus limiting the range​ of possible robot actions.​‌

Building algorithms to endow​​ robots with autonomous decision​​​‌ taking is not straightforward.​ Two relatively distinct paradigms​‌ are available the literature.​​ First, one can devise​​ customized strategies based on​​​‌ techniques such as robot‌ motion planning combined with‌​‌ sensor-based robot control.​​ These techniques lack generalization,​​​‌ in particular when the‌ robot acts in complex,‌​‌ dynamic and unconstrained environments.​​ Second, one can let​​​‌ the robot devise its‌ own strategies based on‌​‌ reinforcement learning (RL) –​​ a machine learning paradigm​​​‌ in which “agents" learn‌ by themselves by trial‌​‌ and error to achieve​​ successful strategies60.​​​‌ It is very difficult,‌ however, to enforce any‌​‌ kind of soft- or​​ hard-constraint within this framework.​​​‌ We will showcase these‌ two scientific streams with‌​‌ one group of techniques​​ for each one: model​​​‌ predictive control (MPC) and‌ Q-learning, deep Q-networks (DQNs),‌​‌ more precisely. These two​​ techniques are promising. Moreover,​​​‌ they are well documented‌ in the robotics and‌​‌ machine learning. Nevertheless, combining​​ them is extremely challenging.​​​‌

An additional challenge, independent‌ from the learning and‌​‌ control combination foreseen, is​​ the data distribution gap​​​‌ between the simulations and‌ the real-world. Meta-learning, or‌​‌ the ability to learn​​ how to learn, can​​​‌ provide partial answers to‌ this problem. Indeed, developing‌​‌ machine learning methods able​​ to understand how the​​​‌ learning is achieved can‌ be used to extend‌​‌ this learning to a​​ new task and speed​​​‌ up the learning process‌ on the new task.‌​‌ Recent developments proposed meta-learning​​ strategies specifically conceived for​​​‌ reinforcement learning, leading to‌ Meta-RL methods. One promising‌​‌ trend in Meta-RL is​​ to have a probabilistic​​​‌ formulation involving SSMs and‌ VAEs, i.e. hence sharing‌​‌ the methodology based on​​ dynamical variational autoencoders described​​​‌ before. Very importantly, we‌ are not aware of‌​‌ any studies able to​​ combine Meta-RL with MPC​​​‌ to handle the constraints,‌ and within a unified‌​‌ formulation. From a methodological​​ perspective, this is an​​​‌ important challenge we face‌ in the next few‌​‌ years.

Exemplar application: transfering​​ poilicies via successor feature​​​‌ representations

Transfer in Reinforcement‌ Learning aims to improve‌​‌ learning performance on target​​ tasks using knowledge from​​​‌ experienced source tasks. Successor‌ Representations (SR) and their‌​‌ extension Successor Features (SF)​​ are prominent transfer mechanisms​​​‌ in domains where reward‌ functions change between tasks.‌​‌ They reevaluate the expected​​ return of previously learned​​​‌ policies in a new‌ target task to transfer‌​‌ their knowledge. The SF​​ framework extended SR by​​​‌ linearly decomposing rewards into‌ successor features and a‌​‌ reward weight vector allowing​​ their application in high-dimensional​​​‌ tasks. But this came‌ with the cost of‌​‌ having a linear relationship​​ between reward functions and​​​‌ successor features, limiting its‌ application to tasks where‌​‌ such a linear relationship​​ exists. We proposed a​​​‌ novel formulation of SR‌ based on learning the‌​‌ cumulative discounted probability of​​ successor features, called Successor​​​‌ Feature Representations (SFR). Crucially,‌ SFR allows to reevaluate‌​‌ the expected return of​​ policies for general reward​​​‌ functions. We introduced different‌ SFR variations, prove its‌​‌ convergence, and provide a​​ guarantee on its transfer​​​‌ performance. Experimental evaluations based‌ on SFR with function‌​‌ approximation demonstrate its advantage​​ over SF not only​​​‌ for general reward functions,‌ but also in the‌​‌ case of linearly decomposable​​​‌ reward functions.

4 Application​ domains

For the last​‌ decades, there has been​​ an increasing interest in​​​‌ robots that cooperate and​ communicate with people. As​‌ already mentioned, we are​​ interested Socially Assistive Robots​​​‌ (SARs) that can communicate​ with people and that​‌ are perceived as social​​ entities. So far, the​​​‌ humanoid robots developed to​ fill this role are​‌ mainly used as research​​ platforms for human-robot collaboration​​​‌ and interaction and their​ prices, if at all​‌ commercially available, are in​​ the 6-digit-euro category, e.g.​​​‌ 250,000 EUR for the​ iCub robot and Romeo​‌ humanoid robots, developed by​​ the Italian Institute of​​​‌ Technology and SoftBank Robotics​ Europe, respectively, as well​‌ as the REEM-C and​​ TALOS robots from PAL​​​‌ Robotics. A notable exception​ being the NAO robot​‌ which is a humanoid​​ (legged) robot, available at​​​‌ an affordable price. Apart​ from humanoid robots, there​‌ are also several companion​​ robots manufactured in Europe​​​‌ and available at a​ much lower price (in​‌ the range 10,000–30,000 EUR)​​ that address the SAR​​​‌ market. For example, the​ Kompaï, the TIAGo​‌, and the Pepper​​ robots are wheeled indoor​​​‌ robotic platforms. The user​ interacts with these robots​‌ via touch screen and​​ voice commands. The robots​​​‌ manage shopping lists, remember​ appointments, play music, and​‌ respond to simple requests.​​ These affordable robots (Kompaï,​​​‌ TIAGo, NAO, and Pepper)​ rapidly became the platforms​‌ of choice for many​​ researchers in cognitive robotics​​​‌ and in HRI, and​ they have been used​‌ by many EU projects,​​ e.g. HUMAVIPS, EARS​​​‌, VHIA, and​ ENRICHEME.

When interacting, these​‌ robots rely on a​​ few selected modalities. The​​​‌ voice interface of this​ category of robots, e.g.​‌ Kompaï, NAO, and Pepper,​​ is based on speech​​​‌ recognition similar to speech​ technologies used by smart​‌ phones and table-top devices,​​ e.g. Google Home. Their​​​‌ audio hardware architecture and​ software packages are designed​‌ to handle single-user face-to-face​​ spoken dialogue based on​​​‌ keyword spotting, but they​ can neither perform multiple​‌ sound-source analysis, fuse audio​​ and visual information for​​​‌ more advanced multi-modal/multi-party interactions,​ nor hold a conversation​‌ that exceeds a couple​​ of turns and that​​​‌ is out of very​ narrow predefined domain.

To​‌ the best of our​​ knowledge, the only notable​​​‌ efforts to overcome some​ of the limitations mentioned​‌ above are the FP7​​ EARS and H2020 MuMMER​​​‌ projects. The EARS project's​ aim was to redesign​‌ the microphone-array architecture of​​ the commercially available humanoid​​​‌ robot NAO, and to​ build a robot head​‌ prototype that can support​​ software based on advanced​​​‌ multi-channel audio signal processing.​ The EARS partners were​‌ able to successfully demonstrate​​ the usefulness of this​​​‌ microphone array for speech-signal​ noise reduction, dereverberation, and​‌ multiple-speaker localisation. Moreover, the​​ recent IEEE-AASP Challenge on​​​‌ Acoustic Source Localisation and​ Tracking (LOCATA)​‌ comprises a dataset that​​ uses this microphone array.​​​‌ The design of NAO​ imposed severe constraints on​‌ the physical integration of​​ the microphones and associated​​​‌ hardware. Consequently and in​ spite of the scientific​‌ and practical promises of​​ this design, SoftBank Robotics​​ has not integrated this​​​‌ technology into their commercially‌ available robots NAO and‌​‌ Pepper. In order to​​ overcome problems arising from​​​‌ human-robot interaction in unconstrained‌ environments and open-domain dialogue‌​‌ on the Pepper robot,​​ the H2020 MuMMER project​​​‌ aimed to deploy an‌ entertaining and helpful robot‌​‌ assistant to a shopping​​ mall. While they had​​​‌ initial success with short‌ deployments of the robot‌​‌ to the mall, they​​ were not specifically addressing​​​‌ the issues arising from‌ multi-party interaction: Pepper's audio‌​‌ hardware/software design cannot locate​​ and separate several simultaneously​​​‌ emitting speech sources.

Figure 3.a
Figure 3.b

The‌ two robotic platforms of‌​‌ the team: (left) the​​ ARI robot from PAL​​​‌ Robotics and (right) the‌ Miroka robot from EnchantedTools.‌​‌

The two robotic platforms​​ of the team: (left)​​​‌ the ARI robot from‌ PAL Robotics and (right)‌​‌ the Miroka robot from​​ EnchantedTools.

Figure 3:​​​‌ The two robotic platforms‌ of the team: (left)‌​‌ the ARI robot from​​ PAL Robotics and (right)​​​‌ the Miroka robot from‌ EnchantedTools.

To conclude, current‌​‌ robotic platforms available in​​ the consumer market, i.e.​​​‌ with large-scale deployment potential,‌ are neither equipped with‌​‌ the adequate hardware nor​​ endowed with the appropriate​​​‌ software required for multi-party‌ social interactions in real-world‌​‌ environments.

In the light​​ of the above discussion,​​​‌ the partners of the‌ H2020 SPRING project decided‌​‌ to build a robot​​ prototype well suited for​​​‌ socially assistive tasks and‌ shared by the SPRING‌​‌ partners as well as​​ by other EU projects.​​​‌ We participated to the‌ specifications of the ARI‌​‌ robot prototype (shown on​​ the right), designed, developed​​​‌ and manufactured by PAL‌ Robotics, an industrial partner‌​‌ of the SPRING project.​​ ARI is a ROS-enabled,​​​‌ non-holonomic, differential-drive wheeled robot,‌ equipped with a pan‌​‌ and tilt head, with​​ both color and depth​​​‌ cameras and with a‌ microphone array that embeds‌​‌ the latest audio signal​​ processing technologies. Seven ARI​​​‌ robot units were delivered‌ to the SPRING partners‌​‌ in April 2021.

We​​ are committed to implement​​​‌ our algorithms and associated‌ software packages onto this‌​‌ advanced robotic platform, from​​ low-level control to high-level​​​‌ perception, interaction and planning‌ tasks, such that the‌​‌ robot has a socially-aware​​ behaviour while it safely​​​‌ navigates in an ever‌ changing environment. We will‌​‌ experiment in environments of​​ increasing complexity, e.g. our​​​‌ robotic lab, the Inria‌ Grenoble cafeteria and Login‌​‌ exhibition, as well as​​ the Broca hospital in​​​‌ Paris. The expertise that‌ the team's engineers and‌​‌ researchers have acquired for​​ the last decade would​​​‌ be crucial for present‌ and future robotic developments‌​‌ and experiments.

5 Social​​ and environmental responsibility

5.1​​​‌ Impact of research results‌

Our line of research‌​‌ on developing unsupervised learning​​ methods exploiting audio-visual data​​​‌ to understand social scenes‌ and to learn to‌​‌ interact within is very​​ interesting and challenging, and​​​‌ has large economical and‌ societal impact. Economical impact‌​‌ since the auditory and​​ visual sensors are the​​​‌ most common one, and‌ we can find (many‌​‌ of) them in almost​​ every smartphone in the​​​‌ market. Beyond telephones, manufacturers‌ designing new systems meant‌​‌ for human use, should​​​‌ take into account the​ need for verbal interaction,​‌ and hence for audio-visual​​ perception. A clear example​​​‌ of this potential is​ the transfer of our​‌ technology to a real​​ robotic platform, for evaluation​​​‌ within a day-care hospital​ (DCH). This is possible​‌ thanks to the H2020​​ SPRING EU project, that​​​‌ assesses the interest of​ social robotics in the​‌ non-medical phases of a​​ regular day for elder​​​‌ patients in a DCH.​ We are evaluating the​‌ performance of our methods​​ for AV speaker tracking,​​​‌ AV speech enhancement, and​ AV sound source separation,​‌ for future technology transfer​​ to the robot manufacturer.​​​‌ This is the first​ step toward a robot​‌ that can be part​​ of the social environment​​​‌ of the DCH, helping​ to reduce patient and​‌ companion stress, at the​​ same time as being​​​‌ a useful tool for​ the medical personnel. We​‌ are confident that developing​​ robust AV perception and​​​‌ action capabilities for robots​ and autonomous systems, will​‌ make them more suitable​​ for environments populated with​​​‌ humans.

6 Highlights of​ the year

6.1 Final​‌ results of the H2020​​ SPRING project

As the​​​‌ H2020 SPRING project concludes,​ these joint results highlight​‌ the potential of socially​​ assistive robots (SARs) in​​​‌ geriatric care. This research​ evaluated the humanoid robot​‌ ARI in a Paris​​ day-care hospital, focusing on​​​‌ its ability to support​ older adults and caregivers​‌ through multi-modal conversational dialogue.​​ Across several experimental waves​​​‌ involving over 120 participants,​ the studies assessed system​‌ performance, user engagement, and​​ the impact of Large​​​‌ Language Model (LLM) integration.​ Results from the Acceptability​‌ E-Scale (AES) and System​​ Usability Scale (SUS) indicate​​​‌ that end-users are highly​ receptive to this technology.​‌ Key findings demonstrate that​​ while LLMs improve interaction​​​‌ fluency, overall success depends​ on the robot's ability​‌ to minimize errors in​​ cluttered, real-world environments. The​​​‌ study also identified that​ personal user characteristics and​‌ robot adaptability significantly influence​​ long-term adoption and emotional​​​‌ engagement. Ultimately, robust perception​ and flexible action skills​‌ proved essential for moving​​ beyond lab settings into​​​‌ dynamic clinical facilities. These​ contributions provide a vital​‌ framework for deploying AI-driven​​ robotics to alleviate healthcare​​​‌ workloads and reduce patient​ loneliness. By bridging the​‌ gap between technical development​​ and clinical reality, SPRING​​​‌ has paved the way​ for future geriatric assistive​‌ technologies 26, 27​​.

6.2 Onboarding of​​​‌ Stéphane Lathuilère

A significant​ milestone in the team's​‌ recent evolution was the​​ arrival of Stéphane Lathuilière,​​​‌ who joined as a​ permanent Research Scientist (ISFP)​‌ in January 2025. His​​ integration into RobotLearn—and subsequently​​​‌ ComLearn, see below—brings specialized​ expertise in deep generative​‌ models, image and video​​ generation, and multimodal learning.​​​‌ Having previously served as​ an Associate Professor at​‌ Télécom Paris and completed​​ his PhD within the​​​‌ predecessor Perception team at​ Inria, Stéphane provides a​‌ vital bridge between high-level​​ scene perception and the​​​‌ synthesis of realistic social​ signals. His research focus​‌ on generative AI and​​ "Human Behavior Understanding" directly​​​‌ supports the new team's​ mission to develop Multimodal​‌ Foundation Models (MFMs). By​​ strengthening the "generation" pillar​​ of the team, his​​​‌ presence accelerates the development‌ of artificial agents capable‌​‌ of more fluid, context-sensitive,​​ and human-centric interactions.

6.3​​​‌ The genesis of ComLearn‌

The creation of ComLearn‌​‌ marks a strategic merger​​ between the CRISSP (GIPSA-lab)​​​‌ and RobotLearn (Inria) teams,‌ unifying their world-class expertise‌​‌ in speech synthesis and​​ computer vision. By combining​​​‌ CRISSP’s mastery of multimodal‌ generation with RobotLearn's advanced‌​‌ audiovisual perception, ComLearn establishes​​ a powerhouse for next-generation​​​‌ social robotics. This synergy‌ aims to overcome the‌​‌ "last mile" of human-agent​​ interaction by developing Multimodal​​​‌ Foundation Models (MFMs) that‌ ground reasoning and generation‌​‌ in real-world communicative environments.​​ Leveraging a shared methodological​​​‌ foundation in Deep Generative‌ Models, the team will‌​‌ design artificial agents capable​​ of seamless, context-sensitive dialogue​​​‌ within multi-party groups. Beyond‌ technical innovation, the project‌​‌ serves as a bridge​​ between signal processing and​​​‌ cognitive science, providing tools‌ to simulate and better‌​‌ understand fundamental human communication​​ mechanisms. The merger provides​​​‌ the critical mass necessary‌ to lead international research‌​‌ in audio-visual scene analysis​​ and user-adaptive assistive technologies.​​​‌ Ultimately, ComLearn will empower‌ social robots to navigate‌​‌ complex, cluttered social spaces​​ with unprecedented fluency and​​​‌ interpretability.

6.4 Welcome Miroka!‌

The acquisition of a‌​‌ Miroka robotic platform represents​​ a transformative step for​​​‌ the RobotLearn/ComLearn team, providing‌ a state-of-the-art vehicle for‌​‌ testing Multimodal Foundation Models​​ (MFMs) in real-world settings.​​​‌ Unlike traditional platforms, Miroka's‌ unique globe-based locomotion and‌​‌ "character-driven" design allow it​​ to navigate crowded hospital​​​‌ environments with an agility‌ and social presence that‌​‌ mimics human movement. This​​ platform serves as the​​​‌ ideal physical anchor to‌ ground the team's research‌​‌ in audiovisual perception and​​ generative social signals, bridging​​​‌ the gap between theoretical‌ AI and embodied interaction.‌​‌ Its expressive animated interface​​ provides a high-fidelity canvas​​​‌ for our work in‌ generative behavior synthesis, enabling‌​‌ more nuanced and emotionally​​ resonant communication. Furthermore, Miroka's​​​‌ specialized social capabilities allow‌ the team to study‌​‌ complex multi-party interactions. This​​ investment ensures the team​​​‌ remains at the global‌ forefront of social robotics,‌​‌ moving beyond basic dialogue​​ to truly integrated, context-sensitive​​​‌ assistance. Ultimately, Miroka transforms‌ the lab's algorithmic breakthroughs‌​‌ into tangible, observable social​​ behaviors.

7 Latest software​​​‌ developments, platforms, open data‌

7.1 New platforms

Participants:‌​‌ Xavier Alameda-Pineda, Ahamed​​ Mohamed, Stéphane Lathuiliere​​​‌, Nicolas Turro,‌ Soraya Arias.

During‌​‌ 2025, the RobotLearn team​​ has acquired the Miroka​​​‌ platform, see Figure 3‌ (right). This platform is‌​‌ built by EnchantedTools (a​​ startup in Paris). It​​​‌ has some similarities with‌ our previous platform ARI‌​‌ (that we will keep),​​ namely: the soft appearance,​​​‌ the design intenteded for‌ social interaction, and multi-sensory‌​‌ capabilities. However, it has​​ some important differences. First,​​​‌ Miroka's face is projected,‌ and therefore much more‌​‌ expressive than the static​​ face of ARI. Second,​​​‌ Miroka comes with integrated‌ LIDAR, which would potentially‌​‌ and significally help its​​ navigation skills. Third, Miroka​​​‌ moves with a self-balancing‌ strategy over a sphere.‌​‌ While this is more​​ complex to handle, it​​​‌ means that Miroka is‌ a holonomic robot and‌​‌ can move in any​​​‌ direction. We hope it​ simplifies the issues related​‌ to “manouvering” in social​​ settings.

The acquisition of​​​‌ a Miroka robotic platform​ represents a transformative step​‌ for the RobotLearn/ComLearn team,​​ providing a state-of-the-art vehicle​​​‌ for testing Multimodal Foundation​ Models (MFMs) in real-world​‌ settings. Unlike traditional platforms,​​ Miroka's unique globe-based locomotion​​​‌ and "character-driven" design allow​ it to navigate crowded​‌ hospital environments with an​​ agility and social presence​​​‌ that mimics human movement.​ This platform serves as​‌ the ideal physical anchor​​ to ground the team's​​​‌ research in audiovisual perception​ and generative social signals,​‌ bridging the gap between​​ theoretical AI and embodied​​​‌ interaction. Its expressive animated​ interface provides a high-fidelity​‌ canvas for our work​​ in generative behavior synthesis,​​​‌ enabling more nuanced and​ emotionally resonant communication. Furthermore,​‌ Miroka's specialized social capabilities​​ allow the team to​​​‌ study complex multi-party interactions.​ This investment ensures the​‌ team remains at the​​ global forefront of social​​​‌ robotics, moving beyond basic​ dialogue to truly integrated,​‌ context-sensitive assistance. Ultimately, Miroka​​ transforms the lab's algorithmic​​​‌ breakthroughs into tangible, observable​ social behaviors.

8 New​‌ results

The new results​​ listed below are organised​​​‌ by research axis.

8.1​ Deep Probabilistic Models

8.1.1​‌ Diffusion-based Unsupervised Audio-visual Speech​​ Enhancement

Participants: Xavier Alameda-Pineda​​​‌.

We propose a​ new unsupervised audiovisual speech​‌ enhancement (AVSE) approach that​​ combines a diffusion-based audio-visual​​​‌ speech generative model with​ a non-negative matrix factorization​‌ (NMF) noise model. First,​​ the diffusion model is​​​‌ pre-trained on clean speech​ conditioned on corresponding video​‌ data to simulate the​​ speech generative distribution. This​​​‌ pre-trained model is then​ paired with the NMF-based​‌ noise model to estimate​​ clean speech iteratively. Specifically,​​​‌ a diffusion-based posterior sampling​ approach is implemented within​‌ the reverse diffusion process,​​ where after each iteration,​​​‌ a speech estimate is​ obtained and used to​‌ update the noise parameters.​​ Experimental results confirm that​​​‌ the proposed AVSE approach​ not only outperforms its​‌ audio-only counterpart but also​​ generalizes better than a​​​‌ recent supervised-generative AVSE method.​ Additionally, the new inference​‌ algorithm offers a better​​ balance between inference speed​​​‌ and performance compared to​ the previous diffusion-based method.​‌ Code and demo available​​ here.

8.1.2 No​​​‌ Images, No Problem: Retaining​ Knowledge in Continual VQA​‌ with Questions-Only Memory

Participants:​​ Stéphane Lathuilière.

Continual​​​‌ Learning in Visual Question​ Answering (VQACL) requires models​‌ to learn new visual-linguistic​​ tasks (plasticity) while retaining​​​‌ knowledge from previous tasks​ (stability). The multimodal nature​‌ of VQACL presents unique​​ challenges, requiring models to​​​‌ balance stability across visual​ and textual domains while​‌ maintaining plasticity to adapt​​ to novel objects and​​​‌ reasoning tasks. Existing methods,​ predominantly designed for unimodal​‌ tasks, often struggle to​​ balance these demands effectively.​​​‌ In this work, we​ introduce QUestion-only replay with​‌ Attention Distillation (QUAD), a​​ novel approach for VQACL​​​‌ that leverages only past​ task questions for regularisation,​‌ eliminating the need to​​ store visual data and​​​‌ addressing both memory and​ privacy concerns. QUAD achieves​‌ stability by introducing a​​ question-only replay mechanism that​​​‌ selectively uses questions from​ previous tasks to prevent​‌ overfitting to the current​​ task's answer space, thereby​​ mitigating the out-of-answer-set problem.​​​‌ Complementing this, we propose‌ attention consistency distillation, which‌​‌ uniquely enforces both intra-modal​​ and inter-modal attention consistency​​​‌ across tasks, preserving essential‌ visual-linguistic associations. Extensive experiments‌​‌ on VQAv2 and NExT-QA​​ demonstrate that QUAD significantly​​​‌ outperforms state-of-the-art methods, achieving‌ robust performance in continual‌​‌ VQA.

8.1.3 Group-robust Machine​​ Unlearning

Participants: Stéphane Lathuilière​​​‌.

Machine unlearning is‌ an emerging paradigm to‌​‌ remove the influence of​​ specific training data (i.e.,​​​‌ the forget set) from‌ a model while preserving‌​‌ its knowledge of the​​ rest of the data​​​‌ (i.e., the retain set).‌ Previous approaches assume the‌​‌ forget data to be​​ uniformly distributed from all​​​‌ training datapoints. However, if‌ the data to unlearn‌​‌ is dominant in one​​ group, we empirically show​​​‌ that performance for this‌ group degrades, leading to‌​‌ fairness issues. This work​​ tackles the overlooked problem​​​‌ of non-uniformly distributed forget‌ sets, which we call‌​‌ group-robust machine unlearning, by​​ presenting a simple, effective​​​‌ strategy that mitigates the‌ performance loss in dominant‌​‌ groups via sample distribution​​ reweighting. Moreover, we present​​​‌ MIU (Mutual Information-aware Machine‌ Unlearning), the first approach‌​‌ for group robustness in​​ approximate machine unlearning. MIU​​​‌ minimizes the mutual information‌ between model features and‌​‌ group information, achieving unlearning​​ while reducing performance degradation​​​‌ in the dominant group‌ of the forget set.‌​‌ Additionally, MIU exploits sample​​ distribution reweighting and mutual​​​‌ information calibration with the‌ original model to preserve‌​‌ group robustness. We conduct​​ experiments on three datasets​​​‌ and show that MIU‌ outperforms standard methods, achieving‌​‌ unlearning without compromising model​​ robustness. Source code available​​​‌ here.

8.1.4 DiMO:‌ Distilling Masked Diffusion Models‌​‌ into One-step Generator

Participants:​​ Stéphane Lathuilière.

Masked​​​‌ Diffusion Models (MDMs) have‌ emerged as a powerful‌​‌ generative modeling technique. Despite​​ their remarkable results, they​​​‌ typically suffer from slow‌ inference with several steps.‌​‌ In this paper, we​​ propose DiMO, a novel​​​‌ approach that distills masked‌ diffusion models into a‌​‌ one-step generator.DiMO addresses two​​ key challenges: (1) the​​​‌ intractability of using intermediate-step‌ information for one-step generation,‌​‌ which we solve through​​ token-level distribution matching that​​​‌ optimizes model output logits‌ by an `on-policy framework'‌​‌ with the help of​​ an auxiliary model; and​​​‌ (2) the lack of‌ entropy in the initial‌​‌ distribution, which we address​​ through a token initialization​​​‌ strategy that injects randomness‌ while maintaining similarity to‌​‌ teacher training distribution. We​​ show DiMO's effectiveness on​​​‌ both class-conditional and text-conditional‌ image generation, impressively achieving‌​‌ performance competitive to multi-step​​ teacher outputs while drastically​​​‌ reducing inference time. To‌ our knowledge, we are‌​‌ the first to successfully​​ achieve one-step distillation of​​​‌ masked diffusion models and‌ the first to apply‌​‌ discrete distillation to text-to-image​​ generation, opening new paths​​​‌ for efficient generative modeling.‌

8.1.5 Don't Forget your‌​‌ Inverse DDIM for Image​​ Editing

Participants: Stéphane Lathuilière​​​‌.

The field of‌ text-to-image generation has undergone‌​‌ significant advancements with the​​ introduction of diffusion models.​​​‌ Nevertheless, the challenge of‌ editing real images persists,‌​‌ as most methods are​​ either computationally intensive or​​​‌ produce poor reconstructions. This‌ paper introduces SAGE (Self-Attention‌​‌ Guidance for image Editing)​​​‌ - a novel technique​ leveraging pre-trained diffusion models​‌ for image editing. SAGE​​ builds upon the DDIM​​​‌ algorithm and incorporates a​ novel guidance mechanism utilizing​‌ the self-attention layers of​​ the diffusion U-Net. This​​​‌ mechanism computes a reconstruction​ objective based on attention​‌ maps generated during the​​ inverse DDIM process, enabling​​​‌ efficient reconstruction of unedited​ regions without the need​‌ to precisely reconstruct the​​ entire input image. Thus,​​​‌ SAGE directly addresses the​ key challenges in image​‌ editing. The superiority of​​ SAGE over other methods​​​‌ is demonstrated through quantitative​ and qualitative evaluations and​‌ confirmed by a statistically​​ validated comprehensive user study,​​​‌ in which all 47​ surveyed users preferred SAGE​‌ over competing methods. Additionally,​​ SAGE ranks as the​​​‌ top-performing method in seven​ out of 10 quantitative​‌ analyses and secures second​​ and third places in​​​‌ the remaining three.

8.2​ Human Behavior Understanding

8.2.1​‌ MEGA: Masked Generative Autoencoder​​ for Human Mesh Recovery​​​‌

Participants: Xavier Alameda Pineda​.

Human Mesh Recovery​‌ (HMR) from a single​​ RGB image is a​​​‌ highly ambiguous problem, as​ similar 2D projections can​‌ correspond to multiple 3D​​ interpretations. Nevertheless, most HMR​​​‌ methods overlook this ambiguity​ and make a single​‌ prediction without accounting for​​ the associated uncertainty. A​​​‌ few approaches generate a​ distribution of human meshes,​‌ enabling the sampling of​​ multiple predictions; however, none​​​‌ of them is competitive​ with the latest single-output​‌ model when making a​​ single prediction. This work​​​‌ proposes a new approach​ based on masked generative​‌ modeling. By tokenizing the​​ human pose and shape,​​​‌ we formulate the HMR​ task as generating a​‌ sequence of discrete tokens​​ conditioned on an input​​​‌ image. We introduce MEGA,​ a MaskEd Generative Autoencoder​‌ trained to recover human​​ meshes from images and​​​‌ partial human mesh token​ sequences. Given an image,​‌ our flexible generation scheme​​ allows us to predict​​​‌ a single human mesh​ in deterministic mode or​‌ to generate multiple human​​ meshes in stochastic mode.​​​‌ MEGA enables us to​ propose multiple outputs and​‌ to evaluate the uncertainty​​ of the predictions. Experiments​​​‌ on in-the-wild benchmarks show​ that MEGA achieves state-of-the-art​‌ performance in deterministic and​​ stochastic modes, outperforming single-output​​​‌ and multi-output approaches.

8.2.2​ Unlearning personal data from​‌ a single image

Participants:​​ Stéphane Lathuilière.

Machine​​​‌ unlearning aims to erase​ data from a model​‌ as if the latter​​ never saw them during​​​‌ training. While existing approaches​ unlearn information from complete​‌ or partial access to​​ the training data, this​​​‌ access can be limited​ over time due to​‌ privacy regulations. Currently, no​​ setting or benchmark exists​​​‌ to probe the effectiveness​ of unlearning methods in​‌ such scenarios. To fill​​ this gap, we propose​​​‌ a novel task we​ call One-Shot Unlearning of​‌ Personal Identities (1-SHUI) that​​ evaluates unlearning models when​​​‌ the training data is​ not available. We focus​‌ on unlearning identity data,​​ which is specifically relevant​​​‌ due to current regulations​ requiring personal data deletion​‌ after training. To cope​​ with data absence, we​​​‌ expect users to provide​ a portraiting picture to​‌ aid unlearning. We design​​ requests on CelebA, CelebA-HQ,​​ and MUFAC with different​​​‌ unlearning set sizes to‌ evaluate applicable methods in‌​‌ 1-SHUI. Moreover, we propose​​ MetaUnlearn, an effective method​​​‌ that meta-learns to forget‌ identities from a single‌​‌ image. Our findings indicate​​ that existing approaches struggle​​​‌ when data availability is‌ limited, especially when there‌​‌ is a dissimilarity between​​ the provided samples and​​​‌ the training data. The‌ source code is available‌​‌ here.

8.2.3 AnCoGen:​​ Analysis, Control and Generation​​​‌ of Speech with a‌ Masked Autoencoder

Participants: Samir‌​‌ Sadok, Xavier Alameda​​ Pineda.

This work​​​‌ introduces AnCoGen, a novel‌ method that leverages a‌​‌ masked autoencoder to unify​​ the analysis, control, and​​​‌ generation of speech signals‌ within a single model.‌​‌ AnCoGen can analyze speech​​ by estimating key attributes,​​​‌ such as speaker identity,‌ pitch, content, loudness, signal-to-noise‌​‌ ratio, and clarity index.​​ In addition, it can​​​‌ generate speech from these‌ attributes and allow precise‌​‌ control of the synthesized​​ speech by modifying them.​​​‌ Extensive experiments demonstrated the‌ effectiveness of AnCoGen across‌​‌ speech analysis-resynthesis, pitch estimation,​​ pitch modification, and speech​​​‌ enhancement. Code and audio‌ examples are available online.‌​‌

8.2.4 Posterior Transition Modeling​​ for Unsupervised Diffusion-Based Speech​​​‌ Enhancement

Participants: Xavier Alameda‌ Pineda.

We explore‌​‌ unsupervised speech enhancement using​​ diffusion models as expressive​​​‌ generative priors for clean‌ speech. Existing approaches guide‌​‌ the reverse diffusion process​​ using noisy speech through​​​‌ an approximate, noise-perturbed likelihood‌ score, combined with the‌​‌ unconditional score via a​​ trade-off hyperparameter. In this​​​‌ work, we propose two‌ alternative algorithms that directly‌​‌ model the conditional reverse​​ transition distribution of diffusion​​​‌ states. The first method‌ integrates the diffusion prior‌​‌ with the observation model​​ in a principled way,​​​‌ removing the need for‌ hyperparameter tuning. The second‌​‌ defines a diffusion process​​ over the noisy speech​​​‌ itself, yielding a fully‌ tractable and exact likelihood‌​‌ score. Experiments on the​​ WSJ0-QUT and VoiceBank-DEMAND datasets​​​‌ demonstrate improved enhancement metrics‌ and greater robustness to‌​‌ domain shifts compared to​​ both supervised and unsupervised​​​‌ baselines.

8.3 Learning and‌ Control for Social Robots‌​‌

8.3.1 OpenSocInt: A Multi-modal​​ Training Environment for Human-Aware​​​‌ Social Navigation

Participants: Xavier‌ Alameda-Pineda.

We introduce‌​‌ OpenSocInt, an open-source software​​ package providing a simulator​​​‌ for multi-modal social interactions‌ and a modular architecture‌​‌ to train social agents.​​ We described the software​​​‌ package and showcased its‌ interest via an experimental‌​‌ protocol based on the​​ task of social navigation.​​​‌ Our framework allows for‌ exploring the use of‌​‌ different perceptual features, their​​ encoding and fusion, as​​​‌ well as the use‌ of different agents. The‌​‌ software is already publicly​​ available under GPL here​​​‌.

8.3.2 Socially Pertinent‌ Robots in Gerontological Healthcare‌​‌

Participants: Soraya Arias,​​ Nicolas Turro, Alex​​​‌ Auternaud, Chris Reinke‌, Victor Sanchez,‌​‌ Xavier Alameda-Pineda.

Despite​​ the many recent achievements​​​‌ in developing and deploying‌ social robotics, there are‌​‌ still many underexplored environments​​ and applications for which​​​‌ systematic evaluation of such‌ systems by end-users is‌​‌ necessary. While several robotic​​ platforms have been used​​​‌ in gerontological healthcare, the‌ question of whether or‌​‌ not a social interactive​​​‌ robot with multi-modal conversational​ capabilities will be useful​‌ and accepted in real-life​​ facilities is yet to​​​‌ be answered. This paper​ is an attempt to​‌ partially answer this question,​​ via two waves of​​​‌ experiments with patients and​ companions in a day-care​‌ gerontological facility in Paris​​ with a full-sized humanoid​​​‌ robot endowed with social​ and conversational interaction capabilities.​‌ The software architecture, developed​​ during the H2020 SPRING​​​‌ project, together with the​ experimental protocol, allowed us​‌ to evaluate the acceptability​​ (AES) and usability (SUS)​​​‌ with more than 60​ end-users. Overall, the users​‌ are receptive to this​​ technology, especially when the​​​‌ robot perception and action​ skills are robust to​‌ environmental clutter and flexible​​ to handle a plethora​​​‌ of different interactions.

8.4​ Integrating a Large Language​‌ Model Into a Socially​​ Assistive Robot in a​​​‌ Hospital Geriatric Unit: Two-Wave​ Comparative Study on Performance,​‌ Engagement, and User Perceptions​​

Participants: Xavier Alameda Pineda​​​‌.

Healthcare systems struggle​ to meet the complex​‌ needs of older adults​​ in resource-limited settings. Socially​​​‌ assistive robots (SARs) offer​ a potential solution by​‌ providing information and practical​​ support. This study evaluated​​​‌ the integration of Large​ Language Models (LLMs) into​‌ SARs to improve interaction​​ fluency. Researchers compared a​​​‌ basic dialogue system (Wave​ 1) to an LLM-based​‌ system (Wave 2). The​​ evaluation focused on system​​​‌ performance, interaction success, and​ multidimensional user engagement. Conducted​‌ over eight months in​​ a Paris geriatric hospital,​​​‌ the study involved 28​ participants aged 60+. Interactions​‌ were video-recorded to code​​ for technical errors and​​​‌ verbal, physical, and emotional​ engagement. Validated scales were​‌ used to measure the​​ robot's overall usability and​​​‌ user acceptability. Results analyzed​ how user characteristics influenced​‌ perceptions of the LLM-enhanced​​ technology. The findings aim​​​‌ to minimize conversational errors​ and optimize SAR adaptability​‌ for real-world use. This​​ research provides insights into​​​‌ successfully deploying AI-driven robotics​ in geriatric care.

8.5​‌ Acceptability and Usability of​​ a Socially Assistive Robot​​​‌ Integrated With a Large​ Language Model for Enhanced​‌ Human-Robot Interaction in a​​ Geriatric Care Institution: Mixed​​​‌ Methods Evaluation

Participants: Xavier​ Alameda Pineda.

Socially​‌ assistive robots (SARs) aim​​ to support older adults​​​‌ and clinicians by promoting​ well-being and managing routine​‌ tasks. However, ensuring high​​ levels of acceptability and​​​‌ usability remains a significant​ hurdle in dynamic care​‌ settings. This study evaluated​​ these factors by deploying​​​‌ the ARI humanoid robot​ in a geriatric day​‌ care hospital. Over one​​ year, 97 participants—comprising 65​​​‌ older patients and 32​ informal caregivers—engaged with the​‌ robot. The evaluation took​​ place in a waiting​​​‌ area in Paris, where​ ARI utilized voice-based dialogue​‌ for interaction. Researchers employed​​ a mixed-methods approach to​​​‌ capture a holistic view​ of the user experience.​‌ Quantitative data were gathered​​ through the Acceptability E-scale​​​‌ and the System Usability​ Scale. These assessments were​‌ administered orally to accommodate​​ the participants' accessibility needs.​​​‌ Qualitative feedback was also​ collected to identify subjective​‌ perceptions and specific contextual​​ barriers. The study sought​​​‌ to pinpoint key factors​ influencing the adoption of​‌ SARs by both patients​​ and caregivers. Ultimately, the​​ findings provide a framework​​​‌ for improving robot integration‌ into real-world geriatric environments.‌​‌

9 Partnerships and cooperations​​

9.1 International initiatives

9.1.1​​​‌ Inria associate team not‌ involved in an IIL‌​‌ or an international program​​

VisaSpeech

Participants: Xavier Alameda​​​‌ Pineda, Samir Sadok‌, Stéphane Lathuilière.‌​‌

  • Title:
    Visually Assisted Speech​​ Processing
  • Duration:
    2025 ->​​​‌
  • Coordinator:
    Mirco Ravanelli (mirco.ravanelli@concordia.ca)‌
  • Partners:
    • Concordia University Montréal‌​‌ (Canada)
  • Inria contact:
    Xavier​​ Alameda Pineda
  • Summary:
    Fostered​​​‌ by deep learning models‌ trained on massive datasets,‌​‌ artificial intelligence (AI) has​​ recently changed the face​​​‌ of many subfields of‌ computer science and information‌​‌ processing, including speech and​​ audio, computer vision, natural​​​‌ language processing, and robotics.‌ Large language models (LLMs)‌​‌ have become central in​​ modern AI to process​​​‌ the sensory information of‌ the world around us.‌​‌ Originally developed for text,​​ LLMs have now been​​​‌ successfully extended to multimodal‌ signals. Recently, some models‌​‌ for audio-visual speech have​​ also been proposed to​​​‌ learn a joint representation‌ of the clean speech‌​‌ audio signal and the​​ lips images. These models​​​‌ have proven to be‌ very useful for tasks‌​‌ such as audio-visual speech​​ enhancement and recognition. While​​​‌ this research provides valuable‌ insights into exploiting lip-related‌​‌ visual content for speech​​ processing, little is known​​​‌ about foundation models exploiting‌ other visual cues for‌​‌ speech processing. For instance:​​ the speaker's background provides​​​‌ information on the type‌ of environment (e.g. living‌​‌ room, backyard, kitchen), and​​ therefore on the characteristics​​​‌ of the noise, to‌ better guide the enhancement‌​‌ algorithm; the understanding of​​ the surrounding objects could​​​‌ guide the speech recognition‌ model to better infer‌​‌ a missing word; the​​ head orientation could bring​​​‌ insights on how is‌ the current speaker in‌​‌ a conversation. To our​​ knowledge, there is no​​​‌ methodology so far exploiting‌ and/or developing foundation models‌​‌ exploiting lip-unrelated visual cues​​ for speech processing. VisaSpeech​​​‌ will develop models and‌ algorithms to jointly exploit‌​‌ this rich amount of​​ information, thanks to the​​​‌ complementary expertise of the‌ RobotLearn Inria team and‌​‌ Mirco Ravanelli's lab at​​ Concordia University.

9.2 International​​​‌ research visitors

Other international‌ visits to the team‌​‌
Massimiliano Pappa

Participants: Xavier​​ Alameda Pineda, Stéphane​​​‌ Lathuilière.

  • Status
    PhD‌
  • Institution of origin:
    Università‌​‌ della Sapienza, Roma
  • Country:​​
    Italy
  • Dates:
  • Context of​​​‌ the visit:
    Mobility during‌ PhD
  • Mobility program/type of‌​‌ mobility:
    Research Stay
  • Summary:​​
    Deploying safety-critical agents requires​​​‌ anticipating the consequences of‌ actions before they are‌​‌ executed. While world models​​ offer a paradigm for​​​‌ this proactive foresight, current‌ approaches relying on visual‌​‌ simulation incur prohibitive latencies,​​ often exceeding several seconds​​​‌ per step. In this‌ work, we challenge the‌​‌ assumption that visual processing​​ is necessary for safety.​​​‌ We introduce the Latent‌ Sufficiency Hypothesis, positing that‌​‌ a good policy's internal​​ representation, combined with its​​​‌ predicted actions, constitutes a‌ sufficient statistic for predicting‌​‌ the near future observations.​​ To harness this, we​​​‌ present DILLO (Distilled Language-Action‌ World Model), a fast‌​‌ safety layer that shifts​​ the paradigm from "simulate-then-act"​​​‌ to "describe-then-act". Crucially, DILLO‌ creates a "Zero-Visual-Overhead" inference‌​‌ path, bypassing heavy visual​​​‌ encoders entirely. Experiments on​ MetaWorld tasks demonstrate that​‌ DILLO serves as an​​ effective rejection sampling controller.​​​‌
Javier Venema Rodriguez

Participants:​ Stéphane Lathuilière.

  • Status​‌
    PhD
  • Institution of origin:​​
    Panacea Cooperative Research, Universidad​​​‌ de Granada
  • Country:
    Spain​
  • Dates:
    May-June/2025
  • Context of​‌ the visit:
  • Mobility program/type​​ of mobility:
    Research Stay​​​‌
  • Summary:

    Craniofacial reconstruction (CFR)​ is an identification technique​‌ that allows reconstructing facial​​ appearance only from the​​​‌ skull structure, of special​ relevance in situations where​‌ there are no reference​​ data or samples (e.g.,​​​‌ medical records, family DNA).​ The main objective of​‌ this work is to​​ develop a reliable and​​​‌ objective method, comparing different​ strategies based on the​‌ use of generative AI,​​ that allow the automation​​​‌ of CFR and its​ forensic use. With that​‌ aim, three strategies have​​ been followed: (i) the​​​‌ use of generative adversarial​ neural networks (GANs) with​‌ volumetric images (3D), (ii)​​ the use of GANs​​​‌ with multi-view depth maps​ (2.5D) building up on​‌ the work of Pan​​ et al. 2024 [1],​​​‌ and (iii) the use​ of diffusion models. The​‌ training has been carried​​ out on a sample​​​‌ with more than a​ thousand examples sourced from​‌ public repositories (NMDID) and​​ collaborations (NFS Seoul), facilitated​​​‌ by access to the​ computational resources of EuroHPC​‌ (MNS 5, BSC).

    Preliminary​​ results point to superior​​​‌ performance of 2.5D GANs​ compared to the rest​‌ in terms of quality​​ and fidelity to the​​​‌ real image. Within this​ approach, the best results​‌ so far have been​​ obtained by using three​​​‌ views of the skull​ model (-30, 0, and​‌ 30 degrees) as input,​​ in combination with the​​​‌ use of Wasserstein GAN​ with gradient penalty (WGAN-GP)​‌ in training. In a​​ cross-comparison of CFR outputs​​​‌ and ground truth images,​ a ranking of correspondence​‌ was calculated combining different​​ metrics (MAE and perceptual​​​‌ loss) placing the correct​ identity in position 4.88​‌ as average. In summary,​​ the use of GANs​​​‌ on 2.5D images constitutes​ a promising strategy for​‌ the development of an​​ automatic CFR tool for​​​‌ forensic use, given that​ it also offers lower​‌ computational costs and environmental​​ impact than other computationally​​​‌ intensive approaches. These results​ form the basis for​‌ future developments towards a​​ photorealistic CFR tool.

9.3​​​‌ European initiatives

9.3.1 H2020​ projects

Participants: Stéphane Lathuilière​‌.

  • Title:
    FaceGEN
  • Duration:​​
    1 year (2025)
  • Coordinator:​​​‌
    Victoria Ulloa (victoria.ulloa@panacea-coop.com)
  • Partners:​
    • Panacea Cooperative Research, Spain​‌
    • University of Granada, Spain​​
  • Inria contact:
    Stéphane Lathuilière​​​‌
  • Summary:
    Forensic human identification​ is an essential step​‌ in both criminal investigations​​ and humanitarian efforts. Traditional​​​‌ methods such as DNA​ profiling, fingerprints, and dental​‌ charts are often highly​​ reliable. Still, they depend​​​‌ on the availability of​ ante-mortem data and the​‌ physical condition of the​​ remains. Unfortunately, in many​​​‌ cases, particularly after natural​ disasters, armed conflicts, or​‌ when dealing with remains​​ that are decades old,​​​‌ these methods cannot be​ applied. In such scenarios,​‌ forensic anthropology provides alternative​​ routes. One of these​​​‌ is Craniofacial Reconstruction (CFR),​ the process of recreating​‌ a person’s facial appearance​​ starting from their skull.​​ CFR is based on​​​‌ the well-established correlation between‌ bone structure and soft‌​‌ tissue morphology. Today, however,​​ it remains largely a​​​‌ manual process, requiring the‌ expertise of highly specialized‌​‌ forensic artists. These reconstructions​​ are costly, time-intensive, and​​​‌ difficult to scale. This‌ is where AI and,‌​‌ in particular, generative AI​​ enter the picture. Recent​​​‌ advances in image generation‌ models and high-performance computing‌​‌ resources have opened the​​ door to automating CFR​​​‌ in a way that‌ was unthinkable just a‌​‌ few years ago. By​​ training AI systems to​​​‌ learn from large collections‌ of images, it is‌​‌ now possible to model​​ the relationship between skull​​​‌ shapes and facial features.‌ For forensic practitioners, this‌​‌ will mean faster, more​​ reproducible, and objective reconstructions.​​​‌ For society, it offers‌ new ways to provide‌​‌ closure in unsolved cases​​ and to address the​​​‌ growing number of unidentified‌ remains worldwide.

10 Dissemination‌​‌

10.1 Promoting scientific activities​​

10.1.1 Scientific events: organisation​​​‌

General chair, scientific chair‌

As General co-Chair of‌​‌ ACM Multimedia 2026, Xavier​​ Alameda Pineda started working​​​‌ on the organisation of‌ that conference's edition.

Member‌​‌ of the organizing committees​​

As a web-Chair of​​​‌ ACM Multimedia 2026, Stéphane‌ Lathuiliere started working on‌​‌ the website of the​​ conference.

10.1.2 Scientific events:​​​‌ selection

Member of the‌ conference program committees

:‌​‌ Xavier Alameda Pineda was​​ Senior Area Chair of​​​‌ ACM Multimedia 2025, and‌ Area Chair of IEEE‌​‌ ICASSP'25 and ICIAP'25.

Stéphane​​ Lathuiliere was Area Chair​​​‌ of ICCV 2025 and‌ CVPR 2025

Reviewer

:‌​‌ Stéphane Lathuiliere was reviewer​​ for WACV 2025 (rounds​​​‌ 1 and 2)

10.1.3‌ Journal

Member of the‌​‌ editorial boards

: during​​ 2025 Xavier Alameda Pineda​​​‌ was Associate Editor of‌ ACM TOMM and CVIU.‌​‌

Reviewer

: Stéphane Lathuiliere​​ was reviewer for TMLR​​​‌

10.1.4 Invited talks

Xavier‌ Alameda Pineda was invited‌​‌ to give a course​​ on the topic “From​​​‌ VAE to Diffusion: probabilistic‌ learning with audio-visual data”‌​‌ at the INPT AI​​ Summer School and an​​​‌ invited talk on “Multimodal‌ perception, action, and evaluation‌​‌ of socially intelligent robots”​​ at the International Workshop​​​‌ on AI for Robotics,‌ organised by Naver Labs‌​‌ Europe.

10.1.5 Leadership within​​ the scientific community

Xavier​​​‌ Alameda Pineda is deelpy‌ involved in the multimedia‌​‌ community at the European​​ and International level. At​​​‌ the European level, Xavier‌ is one of the‌​‌ founders of the SIGMM​​ European Chapter, first as​​​‌ Chair (2024-2025), then as‌ Treasurer (2025-2028). At the‌​‌ international level, Xavier is​​ part of the Steering​​​‌ Committee of ACM Multimedia‌ since 2022.

10.2 Teaching‌​‌ - Supervision - Juries​​ - Educational and pedagogical​​​‌ outreach

10.2.1 Supervision

Xavier‌ Alameda Pineda supervised the‌​‌ following PhD students: Gaétan​​ Lepage (defended), Jean-Eudes Ayilo,​​​‌ Sofiene Kammoun, and Maxime‌ Attwood.

Stéphane Lathuilière supervised‌​‌ the following PhD students:​​ Maxime Attwood, Hugo Malard,​​​‌ Sarra Khairi, Imad Marouf,‌ Yasser Benigmim (defended), Thomas‌​‌ De Min, Yuanzhi Zhu.​​

10.2.2 Juries

Xavier Alameda​​​‌ Pineda was the Chair‌ of the HDR committee‌​‌ of Sergi Pujades, the​​ Chair of the PhD​​​‌ Committee of Timothée Darcet,‌ and of Rim Rekik.‌​‌

Xavier Alameda Pineda participated​​​‌ in the Selection Committee​ of the Public Exam​‌ for Research Positions at​​ Inria U. Côte d'Azur​​​‌ and of a Maître​ de Conférences at Télécom​‌ ParisTech.

Stéphane Lathuilière was​​ reviewer for the PhD​​​‌ of Paul Couairon and​ Nicola Dall'Asen.

10.2.3 Educational​‌ and pedagogical outreach

Xavier​​ Alameda Pineda participated in​​​‌ two Masters courses: Generative​ Multimodal AI, and Learning,​‌ Probabilities, and Causality. Stéphane​​ Lathuilière participated in a​​​‌ UGA Masters course: "Generative​ Multimodal AI" and 1​‌ Ensimag course "Perception, Vision​​ et Apprentissage "

11​​​‌ Scientific production

11.1 Major​ publications

  • 1 articleL.​‌Louis Airale, D.​​Dominique Vaufreydaz and X.​​​‌Xavier Alameda-Pineda. SocialInteractionGAN:​ Multi-person Interaction Sequence Generation​‌.IEEE Transactions on​​ Affective ComputingMay 2022​​​‌HALDOI
  • 2 article​Y.Yutong Ban,​‌ X.Xavier Alameda-Pineda,​​ C.Christine Evers and​​​‌ R.Radu Horaud.​ Tracking Multiple Audio Sources​‌ with the Von Mises​​ Distribution and Variational EM​​​‌.IEEE Signal Processing​ Letters266June​‌ 2019, 798 -​​ 802HALDOI
  • 3​​​‌ articleY.Yutong Ban​, X.Xavier Alameda-Pineda​‌, L.Laurent Girin​​ and R.Radu Horaud​​​‌. Variational Bayesian Inference​ for Audio-Visual Tracking of​‌ Multiple Speakers.IEEE​​ Transactions on Pattern Analysis​​​‌ and Machine Intelligence43​5May 2021,​‌ 1761-1776HALDOIback​​ to text
  • 4 article​​​‌X.Xiaoyu Bie,​ S.Simon Leglaive,​‌ X.Xavier Alameda-Pineda and​​ L.Laurent Girin.​​​‌ Unsupervised Speech Enhancement using​ Dynamical Variational Autoencoders.​‌IEEE/ACM Transactions on Audio,​​ Speech and Language Processing​​​‌30September 2022,​ 2993 - 3007HAL​‌DOI
  • 5 inproceedingsG.​​Guillaume Delorme, Y.​​​‌Yihong Xu, S.​Stéphane Lathuilière, R.​‌Radu Horaud and X.​​Xavier Alameda-Pineda. CANU-ReID:​​​‌ A Conditional Adversarial Network​ for Unsupervised person Re-IDentification​‌.ICPR 2020 -​​ 25th International Conference on​​​‌ Pattern RecognitionMilano, Italy​IEEE2021, 4428-4435​‌HALDOI
  • 6 article​​G.Georgios Evangelidis and​​​‌ R.Radu Horaud.​ Joint Alignment of Multiple​‌ Point Sets with Batch​​ and Incremental Expectation-Maximization.​​​‌IEEE Transactions on Pattern​ Analysis and Machine Intelligence​‌406June 2018​​, 1397 - 1410​​​‌HALDOI
  • 7 article​I.Israel Gebru,​‌ S.Sileye Ba,​​ X.Xiaofei Li and​​​‌ R.Radu Horaud.​ Audio-Visual Speaker Diarization Based​‌ on Spatiotemporal Bayesian Fusion​​.IEEE Transactions on​​​‌ Pattern Analysis and Machine​ Intelligence405July​‌ 2018, 1086 -​​ 1099HALDOI
  • 8​​​‌ articleL.Laurent Girin​, S.Simon Leglaive​‌, X.Xiaoyu Bie​​, J.Julien Diard​​​‌, T.Thomas Hueber​ and X.Xavier Alameda-Pineda​‌. Dynamical Variational Autoencoders:​​ A Comprehensive Review.​​​‌Foundations and Trends in​ Machine Learning151-2​‌December 2021, 1-175​​HALDOI
  • 9 article​​​‌Z.Zhiqi Kang,​ M.Mostafa Sadeghi,​‌ R.Radu Horaud and​​ X.Xavier Alameda-Pineda.​​​‌ Expression-preserving face frontalization improves​ visually assisted speech processing​‌.International Journal of​​ Computer VisionJanuary 2023​​​‌HALDOIback to​ text
  • 10 articleS.​‌Stéphane Lathuilière, B.​​Benoît Massé, P.​​Pablo Mesejo and R.​​​‌Radu Horaud. Neural‌ Network Based Reinforcement Learning‌​‌ for Audio-Visual Gaze Control​​ in Human-Robot Interaction.​​​‌Pattern Recognition Letters118‌February 2019, 61-71‌​‌HALDOI
  • 11 article​​S.Stéphane Lathuilière,​​​‌ P.Pablo Mesejo,‌ X.Xavier Alameda-Pineda and‌​‌ R.Radu Horaud.​​ A Comprehensive Analysis of​​​‌ Deep Regression.IEEE‌ Transactions on Pattern Analysis‌​‌ and Machine Intelligence42​​9September 2020,​​​‌ 2065-2081HALDOI
  • 12‌ articleX.Xiaofei Li‌​‌, Y.Yutong Ban​​, L.Laurent Girin​​​‌, X.Xavier Alameda-Pineda‌ and R.Radu Horaud‌​‌. Online Localization and​​ Tracking of Multiple Moving​​​‌ Speakers in Reverberant Environments‌.IEEE Journal of‌​‌ Selected Topics in Signal​​ Processing131March​​​‌ 2019, 88-103HAL‌DOI
  • 13 articleX.‌​‌Xiaofei Li, S.​​Sharon Gannot, L.​​​‌Laurent Girin and R.‌Radu Horaud. Multichannel‌​‌ Identification and Nonnegative Equalization​​ for Dereverberation and Noise​​​‌ Reduction based on Convolutive‌ Transfer Function.IEEE/ACM‌​‌ Transactions on Audio, Speech​​ and Language Processing26​​​‌10May 2018,‌ 1755-1768HALDOI
  • 14‌​‌ articleX.Xiaofei Li​​, L.Laurent Girin​​​‌, S.Sharon Gannot‌ and R.Radu Horaud‌​‌. Multichannel Speech Separation​​ and Enhancement Using the​​​‌ Convolutive Transfer Function.‌IEEE/ACM Transactions on Audio,‌​‌ Speech and Language Processing​​273March 2019​​​‌, 645-659HALDOI‌
  • 15 articleX.Xiaofei‌​‌ Li, S.Simon​​ Leglaive, L.Laurent​​​‌ Girin and R.Radu‌ Horaud. Audio-noise Power‌​‌ Spectral Density Estimation Using​​ Long Short-term Memory.​​​‌IEEE Signal Processing Letters‌266June 2019‌​‌, 918-922HALDOI​​
  • 16 articleX.Xiaoyu​​​‌ Lin, L.Laurent‌ Girin and X.Xavier‌​‌ Alameda-Pineda. Mixture of​​ Dynamical Variational Autoencoders for​​​‌ Multi-Source Trajectory Modeling and‌ Separation.Transactions on‌​‌ Machine Learning Research Journal​​2024, 1-19HAL​​​‌
  • 17 articleB.Benoît‌ Massé, S.Silèye‌​‌ Ba and R.Radu​​ Horaud. Tracking Gaze​​​‌ and Visual Focus of‌ Attention of People Involved‌​‌ in Social Interaction.​​IEEE Transactions on Pattern​​​‌ Analysis and Machine Intelligence‌4011November 2018‌​‌, 2711 - 2724​​HALDOI
  • 18 article​​​‌M.Mostafa Sadeghi,‌ S.Simon Leglaive,‌​‌ X.Xavier Alameda-Pineda,​​ L.Laurent Girin and​​​‌ R.Radu Horaud.‌ Audio-Visual Speech Enhancement Using‌​‌ Conditional Variational Auto-Encoders.​​IEEE/ACM Transactions on Audio,​​​‌ Speech and Language Processing‌28May 2020,‌​‌ 1788-1800HALDOIback​​ to text
  • 19 article​​​‌S.Samir Sadok,‌ S.Simon Leglaive,‌​‌ L.Laurent Girin,​​ X.Xavier Alameda-Pineda and​​​‌ R.Renaud Séguier.‌ A Multimodal Dynamical Variational‌​‌ Autoencoder for Audiovisual Speech​​ Representation Learning.Neural​​​‌ Networks172April 2024‌, 106120HALDOI‌​‌
  • 20 articleA.Aliaksandr​​ Siarohin, G.Gloria​​​‌ Zen, C.Cveta‌ Majtanovic, X.Xavier‌​‌ Alameda-Pineda, E.Elisa​​ Ricci and N.Nicu​​​‌ Sebe. Increasing Image‌ Memorability with Neural Style‌​‌ Transfer.ACM Transactions​​ on Multimedia Computing, Communications​​​‌ and Applications152‌June 2019HALDOI‌​‌
  • 21 inproceedingsL.Lorenzo​​​‌ Vaquero, Y.Yihong​ Xu, X.Xavier​‌ Alameda-Pineda, V. M.​​Victor M. Brea and​​​‌ M.Manuel Mucientes.​ Lost and Found: Overcoming​‌ Detector Failures in Online​​ Multi-Object Tracking.ECCV​​​‌ 24 - 18th European​ Conference on Computer Vision​‌Milan (Italie), ItalyJuly​​ 2024, 1-30HAL​​​‌
  • 22 articleD.Dan​ Xu, X.Xavier​‌ Alameda-Pineda, W.Wanli​​ Ouyang, E.Elisa​​​‌ Ricci, X.Xiaogang​ Wang and N.Nicu​‌ Sebe. Probabilistic Graph​​ Attention Network with Conditional​​​‌ Kernels for Pixel-Wise Prediction​.IEEE Transactions on​‌ Pattern Analysis and Machine​​ Intelligence445May​​​‌ 2022, 2673-2688HAL​DOI
  • 23 articleY.​‌Yihong Xu, Y.​​Yutong Ban, G.​​​‌Guillaume Delorme, C.​Chuang Gan, D.​‌Daniela Rus and X.​​Xavier Alameda-Pineda. TransCenter:​​​‌ Transformers With Dense Representations​ for Multiple-Object Tracking.​‌IEEE Transactions on Pattern​​ Analysis and Machine Intelligence​​​‌November 2022, 1-16​HALDOI
  • 24 article​‌G.Guanglei Yang,​​ E.Enrico Fini,​​​‌ D.Dan Xu,​ P.Paolo Rota,​‌ M.Mingli Ding,​​ M.Moin Nabi,​​​‌ X.Xavier Alameda-Pineda and​ E.Elisa Ricci.​‌ Uncertainty-aware Contrastive Distillation for​​ Incremental Semantic Segmentation.​​​‌IEEE Transactions on Pattern​ Analysis and Machine Intelligence​‌March 2022, 1-14​​HALDOI
  • 25 article​​​‌G.Guanglei Yang,​ E.Enrico Fini,​‌ D.Dan Xu,​​ P.Paolo Rota,​​​‌ M.Mingli Ding,​ H.Hao Tang,​‌ X.Xavier Alameda-Pineda and​​ E.Elisa Ricci.​​​‌ Continual Attentive Fusion for​ Incremental Learning in Semantic​‌ Segmentation.IEEE Transactions​​ on MultimediaApril 2022​​​‌HALDOI

11.2 Publications​ of the year

International​‌ journals

International​​ peer-reviewed conferences

  • 34 inproceedings​​​‌J.-E.Jean-Eudes Ayilo,‌ M.Mostafa Sadeghi,‌​‌ R.Romain Serizel and​​ X.Xavier Alameda-Pineda.​​​‌ Diffusion-based Unsupervised Audio-visual Speech‌ Enhancement.ICASSP 2025‌​‌ - International Conference on​​ Acoustics Speech and Signal​​​‌ ProcessingHyderabad, IndiaIEEE‌2025, 1-5HAL‌​‌
  • 35 inproceedingsA.Amdjed​​ Belaref, S.Samir​​​‌ Sadok, K.Karim‌ Ibrahim, Z.Zineb‌​‌ Noumir and R.Renaud​​ Seguier. Can AI​​​‌ Decode the Circumplex Model‌ of Affect? A Data-driven‌​‌ Study.Pattern Recognition.​​ ICPR 2024 International Workshops​​​‌ and Challenges. ICPR 2024.‌ Lecture Notes in Computer‌​‌ Science, vol 15614. Springer​​International Conference on Pattern​​​‌ Recognition, ICPR 202415614‌Lecture Notes in Computer‌​‌ ScienceKolkata, IndiaSpringer​​ Nature Switzerland; Springer Nature​​​‌ SwitzerlandMay 2025,‌ 97-108HALDOI
  • 36‌​‌ inproceedingsG.Guénolé Fiche​​, S.Simon Leglaive​​​‌, X.Xavier Alameda-Pineda‌ and F.Francesc Moreno-Noguer‌​‌. MEGA: Masked Generative​​​‌ Autoencoder for Human Mesh​ Recovery.Proc. of​‌ the 2025 IEEE/CVF Conference​​ on Computer Vision and​​​‌ Pattern RecognitionCVPR 2025​ - IEEE/CVF Conference on​‌ Computer Vision and Pattern​​ RecognitionNashville (Tennessee), United​​​‌ StatesIEEE2025,​ 1-16HAL
  • 37 inproceedings​‌I. E.Imad Eddine​​ Marouf, E.Enzo​​​‌ Tartaglione, S.Stéphane​ Lathuilière and J.Joost​‌ van de Weijer.​​ Ask and Remember: A​​​‌ Questions-Only Replay Strategy for​ Continual Visual Question Answering​‌.ICCV 2025 -​​ International Conference on Computer​​​‌ VisionHonolulu, United States​October 2025HAL
  • 38​‌ inproceedingsS.Samir Sadok​​, J.Julien Hauret​​​‌ and E.Eric Bavu​. Bringing Interpretability to​‌ Neural Audio Codecs.​​Interspeech 2025 - 26th​​​‌ edition of the Interspeech​ ConferenceRotterdam, NetherlandsAugust​‌ 2025, 1-5HAL​​
  • 39 inproceedingsS.Samir​​​‌ Sadok, S.Simon​ Leglaive, L.Laurent​‌ Girin, G.Gaël​​ Richard and X.Xavier​​​‌ Alameda-Pineda. AnCoGen: Analysis,​ Control and Generation of​‌ Speech with a Masked​​ Autoencoder.ICASSP 2025​​​‌ - IEEE International Conference​ on Acoustics, Speech, and​‌ Signal ProcessingHyderabad, India​​IEEEJanuary 2025,​​​‌ 1-5HAL
  • 40 inproceedings​Y.Yuanzhi Zhu,​‌ X.Xi Wang,​​ S.Stéphane Lathuilière and​​​‌ V.Vicky Kalogeiton.​ Di[M]O: Distilling Masked Diffusion​‌ Models into One-step Generator​​.2025 International Conference​​​‌ on Computer Vision (ICCV​ 2025)Hawaii, United States​‌October 2025HAL

National​​ peer-reviewed Conferences

  • 41 inproceedings​​​‌S.Samir Sadok,​ J.Julien Hauret and​‌ E.Eric Bavu.​​ Donner du sens aux​​​‌ Codecs Neuronaux : Interprétabilité​ des Tokens discrets produits​‌ pour des Signaux Vocaux​​.CFA 2025 -​​​‌ 17e Congrès Français d'Acoustique​Paris, France2025HAL​‌

Reports & preprints

11.3 Cited publications​​

  • 45 inproceedingsT.Triantafyllos​​​‌ Afouras, A.Andrew​ Owens, J. S.​‌Joon Son Chung and​​ A.Andrew Zisserman.​​​‌ Self-supervised learning of audio-visual​ objects from video.​‌Computer Vision--ECCV 2020: 16th​​ European Conference, Glasgow, UK,​​​‌ August 23--28, 2020, Proceedings,​ Part XVIII 16Springer​‌2020, 208--224back​​ to text
  • 46 article​​​‌S.Sileye Ba,​ X.Xavier Alameda-Pineda,​‌ A.Alessio Xompero and​​ R.Radu Horaud.​​​‌ An On-line Variational Bayesian​ Model for Multi-Person Tracking​‌ from Cluttered Scenes.​​Computer Vision and Image​​​‌ Understanding153December 2016​, 64--76HALDOI​‌back to text
  • 47​​ miscA.Anand Ballou​​, X.Xavier Alameda-Pineda​​​‌ and C.Chris Reinke‌. Variational Meta Reinforcement‌​‌ Learning for Social Robotics​​.December 2022HAL​​​‌back to text
  • 48‌ inproceedingsY.Yutong Ban‌​‌, X.Xavier Alameda-Pineda​​, F.Fabien Badeig​​​‌, S.Sileye Ba‌ and R.Radu Horaud‌​‌. Tracking a Varying​​ Number of People with​​​‌ a Visually-Controlled Robotic Head‌.IEEE/RSJ International Conference‌​‌ on Intelligent Robots and​​ SystemsVancouver, CanadaIEEE​​​‌September 2017, 4144-4151‌HALDOIback to‌​‌ text
  • 49 articleY.​​Yutong Ban, X.​​​‌Xavier Alameda-Pineda, C.‌Christine Evers and R.‌​‌Radu Horaud. Tracking​​ Multiple Audio Sources with​​​‌ the Von Mises Distribution‌ and Variational EM.‌​‌IEEE Signal Processing Letters​​266June 2019​​​‌, 798 - 802‌HALDOIback to‌​‌ text
  • 50 inproceedingsD.​​Drażen Bršċić, H.​​​‌Hiroyuki Kidokoro, Y.‌Yoshitaka Suehiro and T.‌​‌Takayuki Kanda. Escaping​​ from children's abuse of​​​‌ social robots.Proceedings‌ of the tenth annual‌​‌ acm/ieee international conference on​​ human-robot interaction2015,​​​‌ 59--66back to text‌
  • 51 inproceedingsW.-L.Wan-Ling‌​‌ Chang, J. P.​​Jeremy P White,​​​‌ J.Joohyun Park,‌ A.Anna Holm and‌​‌ S.Selma Šabanović.​​ The effect of group​​​‌ size on people's attitudes‌ and cooperative behaviors toward‌​‌ robots in interactive gameplay​​.RO-MAN International Symposium​​​‌ on Robot and Human‌ Interactive CommunicationIEEE2012‌​‌, 845--850back to​​ text
  • 52 inproceedingsC.​​​‌Changan Chen, U.‌Unnat Jain, C.‌​‌Carl Schissler, S.​​ V.Sebastia Vicenc Amengual​​​‌ Gari, Z.Ziad‌ Al-Halah, V. K.‌​‌Vamsi Krishna Ithapu,​​ P.Philip Robinson and​​​‌ K.Kristen Grauman.‌ Soundspaces: Audio-visual navigation in‌​‌ 3d environments.Computer​​ Vision--ECCV 2020: 16th European​​​‌ Conference, Glasgow, UK, August‌ 23--28, 2020, Proceedings, Part‌​‌ VI 16Springer2020​​, 17--36back to​​​‌ text
  • 53 articleM.‌ E.Mary Ellen Foster‌​‌, A.Andre Gaschler​​ and M.Manuel Giuliani​​​‌. Automatically classifying user‌ engagement for dynamic multi-party‌​‌ human--robot interaction.International​​ Journal of Social Robotics​​​‌952017,‌ 659--674back to text‌​‌
  • 54 inproceedingsR.Ruohan​​ Gao and K.Kristen​​​‌ Grauman. Visualvoice: Audio-visual‌ speech separation with cross-modal‌​‌ consistency.IEEE/CVF CVPR​​2021back to text​​​‌
  • 55 articleL.Laurent‌ Girin, S.Simon‌​‌ Leglaive, X.Xiaoyu​​ Bie, J.Julien​​​‌ Diard, T.Thomas‌ Hueber and X.Xavier‌​‌ Alameda-Pineda. Dynamical Variational​​ Autoencoders: A Comprehensive Review​​​‌.Foundations and Trends‌ in Machine Learning15‌​‌1-2December 2021,​​ 1-175HALDOIback​​​‌ to textback to‌ text
  • 56 articleX.‌​‌Xiaofei Li, Y.​​Yutong Ban, L.​​​‌Laurent Girin, X.‌Xavier Alameda-Pineda and R.‌​‌Radu Horaud. Online​​ Localization and Tracking of​​​‌ Multiple Moving Speakers in‌ Reverberant Environments.IEEE‌​‌ Journal of Selected Topics​​ in Signal Processing13​​​‌1March 2019,‌ 88-103HALDOIback‌​‌ to text
  • 57 misc​​X.Xiaoyu Lin,​​​‌ L.Laurent Girin and‌ X.Xavier Alameda-Pineda.‌​‌ Unsupervised Multiple-Object Tracking with​​​‌ a Dynamical Variational Autoencoder​.February 2022HAL​‌back to text
  • 58​​ miscC.Chris Reinke​​​‌ and X.Xavier Alameda-Pineda​. Successor Feature Representations​‌.May 2022HAL​​back to text
  • 59​​​‌ articleS.Sarah Sebo​, B.Brett Stoll​‌, B.Brian Scassellati​​ and M. F.Malte​​​‌ F Jung. Robots​ in groups and teams:​‌ a literature review.​​Proceedings of the ACM​​​‌ on Human-Computer Interaction4​CSCW22020, 1--36​‌back to text
  • 60​​ bookR. S.Richard​​​‌ S Sutton and A.​ G.Andrew G Barto​‌. Reinforcement learning: An​​ introduction.MIT press​​​‌2018back to text​
  • 61 articleM.Mateusz​‌ Żarkowski. Multi-party turn-taking​​ in repeated human--robot interactions:​​​‌ an interdisciplinary evaluation.​International Journal of Social​‌ Robotics1152019​​, 693--707back to​​​‌ text
  • 62 articleJ.​Jingwei Zhang, L.​‌Lei Tai, P.​​Peng Yun, Y.​​​‌Yufeng Xiong, M.​Ming Liu, J.​‌Joschka Boedecker and W.​​Wolfram Burgard. Vr-goggles​​​‌ for robots: Real-to-sim domain​ adaptation for visual control​‌.IEEE Robotics and​​ Automation Letters42​​​‌2019, 1148--1155back​ to text