EN FR
EN FR
STARS - 2025

2025Activity reportProject-Team​STARS

RNSR: 201221015V

Creation​‌ of the Project-Team: 2024​​ November 01

Each year,​​​‌ Inria research teams publish​ an Activity Report presenting​‌ their work and results​​ over the reporting period.​​​‌ These reports follow a​ common structure, with some​‌ optional sections depending on​​ the specific team. They​​​‌ typically begin by outlining​ the overall objectives and​‌ research programme, including the​​ main research themes, goals,​​​‌ and methodological approaches. They​ also describe the application​‌ domains targeted by the​​ team, highlighting the scientific​​​‌ or societal contexts in​ which their work is​‌ situated.

The reports then​​ present the highlights of​​​‌ the year, covering major​ scientific achievements, software developments,​‌ or teaching contributions. When​​ relevant, they include sections​​​‌ on software, platforms, and​ open data, detailing the​‌ tools developed and how​​ they are shared. A​​​‌ substantial part is dedicated​ to new results, where​‌ scientific contributions are described​​ in detail, often with​​​‌ subsections specifying participants and​ associated keywords.

Finally, the​‌ Activity Report addresses funding,​​ contracts, partnerships, and collaborations​​​‌ at various levels, from​ industrial agreements to international​‌ cooperations. It also covers​​ dissemination and teaching activities,​​​‌ such as participation in​ scientific events, outreach, and​‌ supervision. The document concludes​​ with a presentation of​​​‌ scientific production, including major​ publications and those produced​‌ during the year.

Keywords​​

Computer Science and Digital​​ Science

  • A9. Artificial intelligence​​​‌
  • A9.1. Knowledge
  • A9.2. Machine‌ learning
  • A9.3. Signal processing‌​‌
  • A9.8. Reasoning
  • A9.12. Computer​​ vision

Other Research Topics​​​‌ and Application Domains

  • B2.1.‌ Well being
  • B2.4. Therapies‌​‌

1 Team members, visitors,​​ external collaborators

Research Scientists​​​‌

  • François Brémond [Team‌ leader, INRIA,‌​‌ Senior Researcher, HDR​​]
  • Michal Balazia [​​​‌UNIV COTE D'AZUR,‌ ISFP]
  • Antitza Dantcheva‌​‌ [INRIA, Senior​​ Researcher, HDR]​​​‌
  • Monique Thonnat [INRIA‌, Emeritus, HDR‌​‌]

Post-Doctoral Fellows

  • Baptiste​​ Chopin [INRIA,​​​‌ Post-Doctoral Fellow, until‌ Aug 2025]
  • Olivier‌​‌ Huynh [INRIA,​​ Post-Doctoral Fellow, until​​​‌ Feb 2025]

PhD‌ Students

  • Tanay Agrawal [‌​‌INRIA, until Oct​​ 2025]
  • Yuan Gao​​​‌ [INRAE, until‌ Dec 2026]
  • Snehashis‌​‌ Majhi [INRIA,​​ until Mar 2026]​​​‌
  • Nabyl Quignon [INRIA‌, from Mar 2025‌​‌]
  • Aglind Reka [​​UNIV COTE D'AZUR,​​​‌ from Sep 2025]‌
  • Tomasz Stanczyk [INRIA‌​‌, until May 2026​​]
  • Valeriya Strizhkova [​​​‌INRIA, until Feb‌ 2025]
  • Charbel Yahchouchi‌​‌ [Probayes, CIFRE​​, from Jul 2025​​​‌]
  • Seongro Yoon [‌INRIA, until Oct‌​‌ 2027]

Technical Staff​​

  • Mahmoud Ali [INRIA​​​‌, Engineer]
  • Marios‌ Kaplanis [INRIA,‌​‌ Engineer, from Apr​​ 2025 until Jun 2025​​​‌]
  • Aowen Shi [‌INRIA, Engineer,‌​‌ from Feb 2025]​​
  • Yoann Torrado [INRIA​​​‌, Engineer, until‌ May 2025]

Interns‌​‌ and Apprentices

  • Hardik Agarwal​​ [INRIA, Intern​​​‌, from Apr 2025‌ until Jul 2025]‌​‌
  • Akshaya Ananda Murthy [​​INRIA, Intern,​​​‌ from Apr 2025 until‌ Aug 2025]
  • Andranik‌​‌ Arakelov [INRIA,​​ Intern, from Jun​​​‌ 2025 until Oct 2025‌]
  • Saurabh Atreya [‌​‌INRIA, Intern,​​ until May 2025]​​​‌
  • Aaryan Dhawan [UNIV‌ COTE D'AZUR, Intern‌​‌, from Apr 2025​​ until Sep 2025]​​​‌
  • Anil Egin [INRIA‌, Intern, until‌​‌ Apr 2025]
  • Utkarsh​​ Gupta [INRIA,​​​‌ Intern, from Apr‌ 2025 until Jul 2025‌​‌]
  • Khodor Hamadi [​​INRIA, Intern,​​​‌ from Jun 2025 until‌ Nov 2025]
  • Mingyun‌​‌ Jeong [INRIA,​​ Intern, from Jul​​​‌ 2025]
  • Jang Hyun‌ Kim [INRIA,‌​‌ Intern, from Jul​​ 2025 until Sep 2025​​​‌]
  • Dian-Wei Lai [‌INRIA, Intern,‌​‌ from Oct 2025]​​
  • Quentin Merilleau [INRIA​​​‌, from Feb 2025‌]
  • Nishit Poddar [‌​‌INRIA, Intern,​​ from Apr 2025 until​​​‌ Jun 2025]
  • Nabyl‌ Quignon [INRIA,‌​‌ until Feb 2025]​​
  • Aglind Reka [INRIA​​​‌, until Aug 2025‌]
  • Miriana Russo [‌​‌INRIA, Intern,​​ from Oct 2025]​​​‌
  • Ananya Sharma [INRIA‌, from Sep 2025‌​‌]
  • Utkarsh Tiwari [​​Inria, until Apr​​​‌ 2025]

Administrative Assistant‌

  • Marie-Cecile Lafont [INRIA‌​‌]

Visiting Scientists

  • Seungryul​​ Baek [Ulsan National​​​‌ Institute of Science and‌ Technology (UNIST), Republic of‌​‌ Korea, from Jul​​​‌ 2025 until Aug 2025​]
  • Donghyeon Cho [​‌Hanyang University, Seoul, Republic​​ of Korea, from​​​‌ Dec 2025]
  • Nesli​ Ergdogmus [Izmir Institute​‌ of Technology, Turkey,​​ from Jun 2025 until​​​‌ Jul 2025]
  • Salvatore​ Fiorilla [ University of​‌ Bologna, Italy, until​​ Feb 2025]
  • Eric​​​‌ Granger [ETS MONTREAL​, from Dec 2025​‌]
  • Jinsun Park [​​Pusan National University, Busan,​​​‌ Republic of Korea,​ from Jul 2025 until​‌ Jul 2025]
  • Teimuraz​​ Saginadze [MICM, Georgian​​​‌ Technical University, GTU, Tbilisi,​ Georgia, from Feb​‌ 2025 until Apr 2025​​]

External Collaborators

  • Abid​​​‌ Ali [INRIA, then​ University of Luxembourg,​‌ until Jul 2025]​​
  • Laura Ferrari [Scuola​​​‌ Superiore Sant'Anna, Pisa, Italy​]
  • Rachid Guerchouche [​‌CoBTek]
  • Alexandra Konig​​ [CoBTek, CHU NICE​​​‌]
  • Benoit Lagadec [​FairVision, from Jul​‌ 2025]
  • Hali Lindsay​​ [KIT - ALLEMAGNE​​​‌, until Jun 2025​]
  • Sabine Moisan [​‌retired, HDR]​​
  • Jean Rigault [retired​​​‌]
  • Philippe Robert [​CoBTek]
  • Yaohui Wang​‌ [Shangai AI Lab​​]
  • Di Yang [​​​‌Shangai University, until​ Aug 2025]

2​‌ Overall objectives

2.1 Presentation​​

The STARS research project-team​​​‌ focuses on the design​ of computer vision methods​‌ for real-time understanding of​​ social interactions observed by​​​‌ sensors. Our objective is​ to propose new algorithms​‌ to analyze the behavior​​ of people suffering from​​​‌ behavioral disorders, in order​ to improve their quality​‌ of life. We study​​ long-term spatio-temporal interactions performed​​​‌ by humans in their​ natural environment. We address​‌ this challenge by proposing​​ novel deep-learning architectures to​​​‌ model behavioral traits such​ as facial expression, gaze,​‌ gestures, body behavior, and​​ body language. To cope​​​‌ with the limited amount​ of available data and​‌ the privacy issues of​​ medical data, we propose​​​‌ data generation for data​ augmentation and anonymization. Another​‌ important challenge is to​​ make the link between​​​‌ collected data, medical diagnosis,​ and ultimately treatments. To​‌ validate our research we​​ work closely with our​​​‌ clinical partners, in particular​ those of the Nice​‌ Hospital.

2.2 Motivation

Deep​​ learning techniques are highly​​​‌ successful for simple action​ recognition, nevertheless several important​‌ challenges remain in activity​​ recognition in general and​​​‌ specifically for our target​ medical application domain.

To​‌ validate our research, we​​ work closely with our​​​‌ clinical partners. We have​ a strategic partnership named​‌ CoBTek with the clinicians​​ of Nice Hospital (CHU​​​‌ Nice) to study the​ impact of video understanding​‌ approaches for cognitive disorders.​​ This partnership started in​​​‌ January 2012 and has​ evolved to a University​‌ Côte d'Azur team and​​ joint work with monthly​​​‌ regular meetings between STARS​ and the clinicians of​‌ Institut Claude Pompidou (ICP),​​ Lenval, and Pasteur hospitals.​​​‌ The two directors of​ CoBTek are François Brémond​‌ and Florence Askenazy (PU-PH)​​ at Lenval. Our objective​​​‌ to deepen research in​ social interaction is motivated​‌ by the needs of​​ our clinician partners. A​​​‌ typical use-case of social​ interactions observed by sensors​‌ appears in the clinical​​ assessments of psychiatric patients,​​ such as people suffering​​​‌ from conditions like major‌ depression, bipolar disorder, or‌​‌ schizophrenia 25. In​​ these clinical assessments, interactions​​​‌ between the patient and‌ the clinician are recorded‌​‌ with multi-modalities, i.e., with​​ video, audio, and physiological​​​‌ sensors. The goal is‌ to extract digital markers‌​‌ (defined by formal interaction​​ models), which are indicators​​​‌ that can characterize a‌ digital phenotype. The digital‌​‌ markers are automatically extracted​​ from the recorded data​​​‌ and the digital phenotypes‌ could lead to a‌​‌ treatment improving the patient's​​ behavioral disorders.

Social interaction​​​‌ as a new study‌ target.

An abundance of‌​‌ valuable diagnostic relevant information​​ is extracted from the​​​‌ interaction between clinician and‌ patient. This clinical interaction‌​‌ (e.g., conversation between patient​​ and clinician including verbal​​​‌ and nonverbal communication) is‌ traditionally the clinician’s most‌​‌ important source of information​​ about patients’ social skills,​​​‌ mood, and motivation levels.‌ However, a comprehensive clinical‌​‌ interview requires sufficient consultation​​ time as well as​​​‌ strong clinical competencies and‌ expertise to be able‌​‌ to detect early subtle​​ signs of changes in​​​‌ communication. Moreover, for detecting‌ these changes during a‌​‌ clinical conversation, no standardized​​ objective measures exist, leaving​​​‌ a lot of room‌ for speculation and subjective‌​‌ biases. Introducing methodologies to​​ assess in a quantitative​​​‌ manner behavioral dynamics during‌ real-life social interaction could‌​‌ help indicate, for instance,​​ the level of reciprocity​​​‌ and therapeutic alliance, which‌ until now is merely‌​‌ left to clinical intuition​​ as we have pointed​​​‌ out 25.

Need‌ for precise and sensitive‌​‌ digital markers.

To develop​​ and test new measures​​​‌ of mental illness, a‌ movement from traditional markers‌​‌ and phenotyping to digital​​ markers and digital phenotyping​​​‌ is needed. "Digital phenotyping"‌ refers to the moment-to-moment‌​‌ quantification of human behavior​​ in everyday life using​​​‌ data from digital devices.‌ Digital phenotyping suggests collecting‌​‌ patient data allowing for​​ non-intrusive and continuous monitoring​​​‌ of behavioral and mental‌ states, ultimately revealing clinically‌​‌ relevant information. Similarly, `digital​​ markers' (e.g., frequency of​​​‌ eye contact) are digitally-obtained‌ disease indicators that can‌​‌ be used to define​​ a digital phenotype (e.g.,​​​‌ eye gaze). Interaction-based phenotyping‌ could provide various additional‌​‌ data to generate an​​ observer-independent assessment of behavior​​​‌ during a social interaction‌ which reflects as a‌​‌ mirror the current symptomology​​ of a patient. Additionally,​​​‌ interaction-based measures such as‌ social synchrony may have‌​‌ predictive value for treatment​​ outcomes. Recent progress in​​​‌ computer vision, speech processing,‌ and machine learning has‌​‌ enabled detailed and objective​​ characterization of human interaction​​​‌ behavior 8. Applying‌ these advanced methods of‌​‌ artificial intelligence provides new​​ opportunities to identify digital​​​‌ markers of patient behavior.‌ Such markers have the‌​‌ potential to provide objective​​ and continuous assessments of​​​‌ symptomatology in the context‌ of patients' daily lives‌​‌ 30, 4,​​ thereby allowing to precisely​​​‌ tailor treatment to the‌ concrete patient trajectory. So‌​‌ far, many developed techniques​​ are based solely on​​​‌ verbal information during interviews;‌ however interpersonal communication often‌​‌ occurs non-verbally. Thus, merging​​ computer vision-based measurement in​​​‌ a multi-modal approach would‌ enhance the quality of‌​‌ analysis by allowing the​​​‌ detection of changes in​ the quality of communication​‌ as alterations in the​​ dyadic interaction patterns.

Digital​​​‌ markers and methods.

In​ recent years, behavior recognition​‌ methods based on artificial​​ intelligence (i.e., machine or​​​‌ deep learning) have become​ increasingly effective in a​‌ variety of tasks, including​​ action classification 19,​​​‌ body language and gestures​ 6, gaze estimation​‌ 26, eye contact​​ detection, facial action units,​​​‌ facial expression 27,​ as well as affect​‌ extracted from single or​​ multiple modalities 2.​​​‌ A growing number of​ approaches make use of​‌ this progress in human​​ behavior sensing to analyze​​​‌ clinical interaction data (e.g.,​ therapy sessions), linguistic and​‌ paralinguistic characteristics from speech.​​ As psychiatric disorders (depression,​​​‌ bipolar, schizophrenia) impact the​ quality of social interactions,​‌ there is an emphasis​​ on studying these quantifiable​​​‌ behavioral dynamics in real-life​ social interaction at the​‌ dyadic level rather than​​ solely individual behavior 25​​​‌. While these initial​ results are promising, this​‌ research needs to be​​ accelerated by further development​​​‌ of digital phenotyping technology​ focusing on scalability and​‌ equity, by establishing shared​​ longitudinal data repositories and​​​‌ by fostering multidisciplinary collaborations​ between clinical stakeholders, including​‌ patients, computer scientists, and​​ medical researchers.

Sensors for​​​‌ analyzing human interactions.

We​ are planning to keep​‌ using mainly RGB (i.e.​​ Red, Green and Blue​​​‌ colors) monocular cameras for​ video analysis. These off-the-shelf​‌ sensors are affordable, and​​ very precise with a​​​‌ large dynamic range and​ high resolution. They are​‌ easily deployable in elderly​​ homes and in hospitals.​​​‌ However, we also investigate​ new types of sensors​‌ (e.g. RGB-D, i.e. RGB​​ colors and Depth, and​​​‌ infrared cameras, physiological sensors,​ and microphones) to capture​‌ complementary information and depending​​ on the use-cases. These​​​‌ new sensors can open​ up new avenues of​‌ research. As we do​​ not want to disturb​​​‌ the everyday activities of​ the end-users, we can​‌ first train our models​​ with a large variety​​​‌ of sensors in dedicated​ locations, such as laboratories.​‌ Second, we can distill​​ the learned weights into​​​‌ lighter models trained only​ with RGB video streams.​‌ These lighter RGB models​​ are more convenient and​​​‌ less intrusive, as they​ can be processed only​‌ using standard RGB cameras.​​ Third, we can use​​​‌ only these lighter RGB​ models at run-time in​‌ embedded devices directly at​​ the end-users' locations. Therefore,​​​‌ we only use the​ sensors and cameras pertinent​‌ to the end-users.

2.3​​ Social interaction understanding: a​​​‌ challenging task

The major​ challenge in semantic interpretation​‌ of dynamic scenes is​​ to bridge the gap​​​‌ between the task dependent​ interpretation of data and​‌ the flood of measures​​ provided by sensors. The​​​‌ problems we address range​ from physical object detection,​‌ activity understanding, activity learning​​ to vision system design​​​‌ and evaluation. The two​ principal classes of human​‌ activities we focus on​​ are assistance to older​​​‌ adults and video analytics.​

Typical examples of complex​‌ activity are shown in​​ Figure 1 and Figure​​​‌ 2 for a homecare​ application (See Toyota Smarthome​‌ Dataset here). In​​ this example, the duration​​ of the monitoring of​​​‌ an older person apartment‌ could last several months.‌​‌ The activities involve interactions​​ between the observed person​​​‌ and several pieces of‌ equipment. The application goal‌​‌ is to recognize the​​ everyday activities at home​​​‌ through formal activity models‌ and data captured by‌​‌ a network of sensors​​ embedded in the apartment.​​​‌ Here typical services include‌ an objective assessment of‌​‌ the frailty level of​​ the observed person to​​​‌ be able to provide‌ a more personalized care‌​‌ and to monitor the​​ effectiveness of a prescribed​​​‌ therapy. The assessment of‌ the frailty level is‌​‌ performed by an Activity​​ Recognition System which transmits​​​‌ a textual report (containing‌ only meta-data) to the‌​‌ general practitioner who follows​​ the older person. Thanks​​​‌ to the recognized activities,‌ the quality of life‌​‌ of the observed people​​ can thus be improved​​​‌ and their personal information‌ can be preserved.

Figure 1

The‌​‌ image presents a study​​ of activity patterns in​​​‌ different environments: the dining‌ room, kitchen, and living‌​‌ room, marked as C1-C7.​​ It includes pie charts​​​‌ showing distribution of activity‌ duration (short, medium, long)‌​‌ and temporal variability (low,​​ medium, high). There are​​​‌ bar graphs displaying the‌ frequency of various activities‌​‌ and their respective environments.​​ The bottom section uses​​​‌ colors to represent where‌ and how often each‌​‌ activity occurs. (Description generated​​ at January 15th, 2026​​​‌ by Albert AI with‌ the model Mistral-Small-3.2-24B)

Figure‌​‌ 1: Homecare monitoring:​​ the large diversity of​​​‌ activities collected in a‌ three-room apartment
Figure 2

Homecare monitoring:‌​‌ the annotation of a​​ composed activity "Cook", captured​​​‌ by a video camera‌

Figure 2: Homecare‌​‌ monitoring: the annotation of​​ a composed activity "Cook",​​​‌ captured by a video‌ camera

The ultimate goal‌​‌ is for cognitive systems​​ to perceive and understand​​​‌ their environment to be‌ able to provide appropriate‌​‌ services to a potential​​ user. An important step​​​‌ is to propose a‌ computational representation of people‌​‌ activities to adapt these​​ services to them. Up​​​‌ to now, the most‌ effective sensors have been‌​‌ video cameras due to​​ the rich information they​​​‌ can provide on the‌ observed environment. These sensors‌​‌ are currently perceived as​​ intrusive ones. A key​​​‌ issue is to capture‌ the pertinent raw data‌​‌ for adapting the services​​ to the people while​​​‌ preserving their privacy. We‌ study different solutions including‌​‌ of course the local​​ processing of the data​​​‌ without transmission of images‌ and the utilization of‌​‌ new compact sensors developed​​ for interaction (also called​​​‌ RGB-Depth sensors, an example‌ being the Kinect) or‌​‌ networks of small non-visual​​ sensors.

2.4 International and​​​‌ Industrial Cooperation

Our work‌ has been applied in‌​‌ the context of more​​ than 10 European projects​​​‌ such as COFRIEND, ADVISOR,‌ SERKET, CARETAKER, VANAHEIM, SUPPORT,‌​‌ DEM@CARE, VICOMO, EIT Health.​​

We had or have​​​‌ industrial collaborations in several‌ domains: transportation (CCI Airport‌​‌ Toulouse Blagnac, SNCF, Inrets,​​ Alstom, Ratp, Toyota, GTT​​​‌ (Italy), banking (Crédit Agricole‌ Bank Corporation, Eurotelis and‌​‌ Ciel), security (Thales R&T​​ FR, Thales Security Syst,​​​‌ EADS, Sagem, Bertin, Alcatel,‌ Keeneo), multimedia (Thales Communications),‌​‌ civil engineering (Centre Scientifique​​​‌ et Technique du Bâtiment​ (CSTB)), computer industry (BULL),​‌ software industry (AKKA), hardware​​ industry (ST-Microelectronics) and health​​​‌ industry (Philips, Link Care​ Services, Vistek).

We have​‌ international cooperations with research​​ centers such as Reading​​​‌ University (UK), Idiap (Switzerland),​ Multitel (Belgium), National Cheng​‌ Kung University (Taiwan), National​​ Taiwan University (Taiwan), University​​​‌ of Southern California (USA),​ University of South Florida​‌ (USA), Michigan State University​​ (USA), Chinese Academy of​​​‌ Sciences (China), IIIT Delhi​ (India), Hochschule Darmstadt (Germany),​‌ Fraunhofer Institute for Computer​​ Graphics Research IGD (Germany).​​​‌

3 Research program

Our​ research objective is related​‌ to the recognition of​​ human actions, facial expressions,​​​‌ and body language in​ social interactions. Therefore we​‌ plan to work on​​ two main research axes:​​​‌

  • Axis 1:
    Human Interaction​ Recognition based on body​‌ and face analysis,
  • Axis​​ 2:
    Data Generation for​​​‌ Augmentation and Anonymization for​ solving data limitation and​‌ privacy issues.

3.1 Axis​​ 1: Human Interaction Recognition​​​‌

Participants: François Brémond,​ Michal Balazia, Antitza​‌ Dantcheva, Monique Thonnat​​.

3.1.1 Body Language​​​‌ Analysis

Participants: François Brémond​, Michal Balazia,​‌ Antitza Dantcheva, Monique​​ Thonnat.

Body language​​​‌ has been actively researched​ by psychologists for decades.​‌ Early work by Mehrabian​​ found that, among other​​​‌ signals, backward leaning of​ the torso is indicative​‌ of liking. Former research​​ has shown that people​​​‌ believe power is expressed​ with nonverbal cues like​‌ open posture (i.e., no​​ arms crossed or legs​​​‌ crossed), more gesturing, and​ less self-touching (both hands​‌ and face). Displacement behaviors​​ such as grooming, face​​​‌ touching or fumbling are​ related to anxiety and​‌ stress regulation. As a​​ consequence of these manifold​​​‌ connections of body language​ with important personal and​‌ social attributes, body language​​ analysis has been a​​​‌ focus of automatic approaches​ attempting to infer high-level​‌ attributes such as emotion​​ leadership role, or personality​​​‌ type. In contrast to​ the human science studies​‌ discussed above, these automatic​​ approaches commonly lack an​​​‌ explicit intermediate representation of​ functional bodily behavior categories.​‌ Instead, they rely on​​ a generic feature representation,​​​‌ encoding body postures and​ movements or on deep​‌ learning approaches without clear​​ interpretable internal structure. While​​​‌ such representations can be​ effective in prediction scenarios,​‌ they often lack interpretability​​ and may miss subtle​​​‌ but meaningful differences, e.g.,​ between fumbling and scratching.​‌

Recognition of Actions and​​ Body Language.

RGB-based human​​​‌ action recognition has often​ been addressed by three​‌ main approaches. Two-stream 2D​​ Convolutional Neural Networks (CNN)​​​‌ generally contain two 2D​ CNN branches taking different​‌ input features extracted from​​ the RGB videos for​​​‌ action recognition. Recurrent Neural​ Networks (RNN) usually employ​‌ 2D CNNs as feature​​ extractors for an LSTM​​​‌ (i.e., Long Short Term​ Memory) model. 3D CNN​‌ based methods extend 2D​​ CNNs to 3D structures,​​​‌ to simultaneously model the​ spatial and temporal context​‌ information in videos that​​ is crucial for action​​​‌ recognition. For instance, a​ two-stream 2D CNN architecture,​‌ divides each video into​​ three segments and processes​​​‌ each segment with a​ two-stream network, fusing the​‌ individual classification scores by​​ an average pooling method​​ to produce the video-level​​​‌ prediction of the action‌ class. Also, the two-stream‌​‌ Inflated 3D CNN (I3D)​​ inflates the convolutional and​​​‌ pooling kernels of a‌ 2D CNN with an‌​‌ additional temporal dimension to​​ process at once a​​​‌ 3D block of pixels.‌ The transformer method that‌​‌ was designed for natural​​ language processing has been​​​‌ recently extended to computer‌ vision tasks to recognize‌​‌ human activities. In contrast​​ to action recognition, which​​​‌ typically considers freely moving‌ people, limited work on‌​‌ body language recognition addressed​​ more constrained social interaction​​​‌ scenarios. We observe that‌ the common denominator of‌​‌ body language analysis methods​​ are the employment of​​​‌ a general action recognition‌ method without handling the‌​‌ specificity of body language​​ such as subtle motions​​​‌ or micro facial expressions.‌

To summarize, these body‌​‌ language analysis methods enable​​ us to measure objectively​​​‌ the behavior of humans‌ by recognizing their Activities‌​‌ of Daily Living (ADL),​​ their emotions, eating habits,​​​‌ and lifestyle. Human behavior‌ can be modeled by‌​‌ learning from a large​​ number of data, collected​​​‌ from a variety of‌ sensors, to improve and‌​‌ optimize, for instance, the​​ quality of life of​​​‌ people suffering from behavior‌ disorders, such as anxiety‌​‌ or apathy. In previous​​ work, STARS successfully detected​​​‌ the everyday life activities‌ performed by an individual‌​‌ living alone at home​​ and we were able,​​​‌ for instance, to detect‌ breakfast activities, such as‌​‌ “preparing coffee”, and “cutting​​ bread”, with sufficient accuracy​​​‌ 19, 13,‌ 15.

3.1.2 Face‌​‌ Analysis and Emotion Recognition​​

Participants: François Brémond,​​​‌ Michal Balazia, Antitza‌ Dantcheva.

An emotion‌​‌ is a mental state​​ that arises spontaneously and​​​‌ is often accompanied by‌ cognitive, physical, and physiological‌​‌ changes. Due to the​​ complexity of human reactions,​​​‌ recognizing emotions is still‌ limited and remains the‌​‌ target of many relevant​​ scientific researches. In fact,​​​‌ Emotion Recognition is a‌ highly multidisciplinary field where‌​‌ psychology meets deep learning.​​ Emotions are typically divided​​​‌ in basic categories, as‌ theorized by Ekman who‌​‌ identified basic discrete emotions.​​ Such categorization has been​​​‌ extended considering the interconnection‌ between emotions and multiple‌​‌ intensities.

Predicting emotions has​​ been attempted via facial​​​‌ expression analysis in videos,‌ which has been widely‌​‌ adopted both in research​​ and in industry owing​​​‌ to its ease of‌ use with just a‌​‌ camera. However, the accuracy​​ of computer vision algorithms,​​​‌ as in the case‌ of CNN, is typically‌​‌ limited in identifying real​​ emotions. Facial micro-expression recognition​​​‌ recently reported state-of-the-art performances‌ when implemented with a‌​‌ transformer-based architecture. While the​​ FaceReader system, launched in​​​‌ late 2005, is used‌ worldwide in institutes and‌​‌ companies, there are still​​ some limitations as image​​​‌ quality and facial angulation.‌ Other main open challenges‌​‌ in the field are​​ small available datasets and​​​‌ subjective annotations. Typical datasets‌ range between some hundreds‌​‌ of videos to a​​ few thousands and the​​​‌ annotations are often noisy‌ due to the human‌​‌ complexity. A person may​​ be happy even if​​​‌ he/she is not smiling‌ and people differ widely‌​‌ in how expressive they​​​‌ are in showing their​ inner emotions. So, emotion​‌ annotations are very subjective​​ and need to be​​​‌ adequately addressed. Moreover, emotions​ have multiple nuances, with​‌ different intensities.

Regarding emotional​​ models, various architectures have​​​‌ been used as RNNs,​ LSTMs, CNNs, with the​‌ aim of capturing the​​ spatio-temporal information. In order​​​‌ to improve the recognition​ accuracy, multimodal transformers have​‌ been introduced, exploiting self-​​ and cross-attention. Knowledge distillation​​​‌ from multimodal to unimodal​ (video) transformers has been​‌ reported, to reduce the​​ acquisition complexity at inference​​​‌ time. The state-of-the-art is​ achieved today with multimodal​‌ transformers, using video, audio,​​ and language cues. Here,​​​‌ the video and the​ audio are processed by​‌ small transformer encoders receiving​​ as input features pre-trained​​​‌ on other datasets. The​ model extracting features is​‌ frozen and therefore it​​ cannot be adapted to​​​‌ a new targeted dataset.​ For the video transformer,​‌ the inputs are fixed​​ representations, such as DLN​​​‌ features, IResNet and DenseNet​ features, Facet/Openface features, R(2+1)D-152​‌ features and landmarks and​​ action units. Such feature​​​‌ extractors and shallower encoders​ are typically used when​‌ small datasets are targeted.​​ The main limitations of​​​‌ this approach are twofold:​ first, frozen representations are​‌ less appropriate for raw​​ data than end-to-end trainable​​​‌ models; second, smaller models​ are less accurate for​‌ recognizing specific expressions. In​​ order to use raw​​​‌ data and bigger encoders,​ proper pre-training is needed​‌ to limit overfitting. While​​ self-supervised techniques, such as​​​‌ VideoMAE, can be used​ for that purpose, they​‌ may miss the little​​ details necessary to recognize​​​‌ facial micro-expressions. They are​ therefore not well adapted​‌ for the emotion recognition​​ task.

3.1.3 Multimodal Recognition​​​‌ of Human Interactions

Participants:​ François Brémond, Michal​‌ Balazia, Monique Thonnat​​.

Behavior traits can​​​‌ be detected in self-presentation​ videos based on the​‌ acoustic and visual, non-verbal​​ features such as pitch,​​​‌ intensity, movement, head orientation,​ posture, fidgeting, and eye-gaze.​‌ According to 1,​​ 2, modalities such​​​‌ as audiovisual, text, and​ demographic features are important​‌ for personality prediction. Emotion​​ recognition has generated specific​​​‌ approaches for multimodal data​ processing. Deep bimodal models​‌ give state-of-the-art results on​​ Multimodal Language Analysis in​​​‌ the Wild. It has​ been shown that body​‌ gestures, head movements, expressions,​​ and speech lead to​​​‌ an effective diagnosis of​ apathy. Few models have​‌ dealt with trimodal fusion​​ of features.Although multimodal approaches​​​‌ are commonly used to​ recognize personality traits, there​‌ does not exist a​​ comprehensive method to optimize​​​‌ and combine the considerable​ amount of informative features.​‌ All modality features may​​ be concatenated together for​​​‌ behavior prediction; this approach​ is referred to early​‌ fusion. However, most of​​ the multimodal approaches perform​​​‌ late fusion on heterogeneous​ data, as it outperforms​‌ other techniques. Present research​​ in the field aims​​​‌ to find efficient ways​ for feature extraction and​‌ combination. We aim to​​ design new approaches able​​​‌ to utilize all possible​ information available in an​‌ optimal manner 3.​​ The objective is to​​​‌ develop and test Human​ Behavior Coding algorithms using​‌ RGB video cameras at​​ test time 13,​​ 1, but using​​​‌ multi-modalities at training time‌ with multiple datasets with‌​‌ various modalities to better​​ characterize human behavior during​​​‌ interactions. As it is‌ challenging to be an‌​‌ expert in all modalities,​​ we will rely on​​​‌ open-source code (when available)‌ or on our partners‌​‌ (when needed) to obtain​​ the most effective backbone​​​‌ models for extracting multi-modal‌ features. For instance, we‌​‌ are collaborating with DFKI​​ (i.e., Deutsches Forschungszentrum für​​​‌ Künstliche Intelligenz) 24 to‌ extract audio and text‌​‌ features for measuring neuropsychiatric​​ symptoms in patients with​​​‌ early cognitive decline. For‌ electrophysiological signals, we are‌​‌ working with the Biorobotic​​ Institute - Scuola Superiore​​​‌ Sant’Anna (Pontedera, Italy) 21‌ to compute more objective‌​‌ measurements of emotion.

3.2​​ Axis 2: Data Generation​​​‌ for Augmentation and Anonymization‌

Participants: Antitza Dantcheva,‌​‌ François Brémond.

3.2.1​​ Data Generation

Participants: Antitza​​​‌ Dantcheva, François Brémond‌.

In the past‌​‌ decade, computer vision has​​ witnessed remarkable progress fueled​​​‌ by the triptych of‌ (a) algorithms for training‌​‌ computer vision models (e.g.,​​ backpropagation), (b) increased computational​​​‌ power (think of powerful‌ graphical processing units (GPUs)),‌​‌ but very importantly by​​ (c) increased volumes of​​​‌ training data. For‌ example, millions of facial‌​‌ images (i.e., MegaFace) have​​ rapidly driven progress in​​​‌ face recognition, showcasing that‌ better models are empowered‌​‌ by bigger data. Even​​ in the occasional abundance​​​‌ of raw data, there‌ is a plethora of‌​‌ remaining challenges in designing​​ data-driven intelligence approaches such​​​‌ as deep neural networks‌ (DNNs). These challenges stem‌​‌ from the fact that​​ data must be processed;​​​‌ for example, data must‌ be annotated (e.g., annotation‌​‌ of facial expressions in​​ facial videos), in order​​​‌ to optimize the millions‌ of network-parameters. To make‌​‌ things worse, the curation​​ of large datasets is​​​‌ tedious, costly, time-consuming and‌ is fundamentally bounded by‌​‌ the population sizes of​​ such data, as well​​​‌ as by the ever-increasing‌ privacy and usage considerations‌​‌ that have been recently​​ highlighted by the General​​​‌ Data Protection Regulation (GDPR).‌ The resulting real data‌​‌ and associated real-life datasets​​ are scarce, private, and​​​‌ they inherit human biases.‌ As such, these limitations‌​‌ threaten to bring any​​ advances in computer vision​​​‌ to a dramatic halt.‌ Therefore, we are now‌​‌ at a point, where​​ the availability of annotated​​​‌ data is the main‌ bottleneck in the development‌​‌ of data-hungry DNN models;​​ a bottleneck that far​​​‌ exceeds any algorithmic or‌ computational bottleneck. Based on‌​‌ the premise that computer​​ vision data-driven intelligence is​​​‌ heavily influenced by the‌ underlying data, we here‌​‌ seek to understand how​​ one can actually create​​​‌ data that will augment‌ the learning space and‌​‌ the learning capabilities of​​ computer vision models. Generated​​​‌ data or synthetic data‌ provides a promising solution‌​‌ to the above challenges,​​ as it is easier​​​‌ to obtain, it is‌ inexhaustible, pre-annotated, and less‌​‌ expensive. In addition, synthetic​​ data has the potential​​​‌ to avoid ethical and‌ privacy concerns, as well‌​‌ as practical issues related​​ to security. Further, synthetic​​​‌ data brings to the‌ fore unique opportunities, allowing‌​‌ for the surgical injection​​​‌ of training data in​ scenarios where collecting real​‌ data may be impractical​​ or impossible (e.g., talking​​​‌ dogs, faces that do​ not exist, etc.). Indeed​‌ synthetic data allows for​​ new training paradigms in​​​‌ computer vision models. We​ will design methods that​‌ allow synthetic data to​​ be dynamically generated, directly​​​‌ as a function of​ the needs of learning​‌ algorithms.

Past attempts for​​ synthetic images and videos.​​​‌

Computer vision-generative models of​ images have received unprecedented​‌ attention, owing to recent​​ breakthroughs in the underlying​​​‌ modeling methodology. The most​ powerful models today are​‌ built on generative adversarial​​ networks (GANs), autoregressive transformers,​​​‌ and most recently diffusion​ models. Diffusion models (DM)​‌ constitute neural networks, which​​ were trained to denoise​​​‌ images successively blurred with​ Gaussian noise by learning​‌ to reverse such diffusion​​ process. After training, such​​​‌ a model can generate​ data by simply passing​‌ randomly sampled noise through​​ the learned de-noising process.​​​‌ This synthesis procedure can​ be interpreted as an​‌ optimization algorithm that follows​​ the gradient of the​​​‌ data density to produce​ likely samples. In its​‌ denoising process, conditional features​​ like class labels of​​​‌ data can be applied​ to the network for​‌ specializing its sampling process.​​ Such DMs outperform previous​​​‌ generative methods, as they​ offer robust, stable and​‌ scalable training procedures. DMs​​ are largely unaffected by​​​‌ training limitations such as​ overfitting, as it is​‌ the case in GANs​​ (mode collapse). In addition,​​​‌ DMs generally involve fewer​ parameters than transformer-based counterparts​‌ that typical require massive​​ amounts of data and​​​‌ thus experience a performance​ plateau. As diverse synthetic​‌ data is a primary​​ need for computer vision,​​​‌ DMs have been rapidly​ adopted in several settings​‌ such as image and​​ video generation, image deblurring,​​​‌ high-resolution image generation, and​ image editing.

Challenges in​‌ video generation.

However, while​​ the image domain has​​​‌ seen great progress, video​ has proven to be​‌ more challenging due to​​ (i) significant computational costs​​​‌ associated with training on​ video data, as well​‌ as due to (ii)​​ the lack of large-scale,​​​‌ general, and publicly available​ video datasets. In regards​‌ to the computational challenge​​ in (i), it is​​​‌ indeed the case that​ training current state-of-the-art image​‌ generation models is already​​ extremely expensive computationally, making​​​‌ it exceedingly hard to​ generate videos, particularly videos​‌ of variable length. Similarly,​​ w.r.t. the second challenge​​​‌ in (ii), it is​ the case that while​‌ in image generation there​​ are datasets with billions​​​‌ of images - in​ video, datasets are much​‌ smaller (think of the​​ VoxCeleb2 dataset of about​​​‌ 1M videos) and thus​ cannot support the higher​‌ complexity of open domain​​ videos.

Limited settings of​​​‌ generated videos. Very recently,​ video generation methods such​‌ as DM-based Imagen Video​​ and Make-a-Video, showcased the​​​‌ stunning potential of generative​ AI. However, to date,​‌ the generated videos remain​​ heavily constrained in quality,​​​‌ resolution, as well as​ length, mainly due to​‌ having video encoders that​​ only encode fixed size​​​‌ videos or encode frames​ independently. Such video generation​‌ methods are further limited​​ as they currently produce​​ results only depicting single​​​‌ persons, performing simple motions‌ in highly constrained settings‌​‌ with mostly a neutral​​ background. Crucial in our​​​‌ effort will be our‌ goal of generating videos‌​‌ that encompass complex settings​​ of multiple subjects, able​​​‌ to interact in front‌ of a non-uniform background.‌​‌

Control. While we are​​ already beginning to know​​​‌ a few things regarding‌ DMs - like for‌​‌ example that in terms​​ of reconstruction and encoding,​​​‌ DMs are superior to‌ GANs - it is‌​‌ indeed the case that​​ understanding the limits of​​​‌ control of such models,‌ still lies at its‌​‌ infancy. In an effort​​ to control generated images,​​​‌ recent works explored the‌ discovery of semantically meaningful‌​‌ directions in the latent​​ space of pre-trained GANs,​​​‌ where linear navigation corresponds‌ to the desired manipulation‌​‌ of images. In this​​ context and in terms​​​‌ of control, supervised, as‌ well as unsupervised approaches‌​‌ were proposed to edit​​ semantics such as facial​​​‌ attributes, colors and basic‌ visual transformations (e.g., rotation‌​‌ and zooming) in generated​​ or inverted real images.​​​‌ The latest addition of‌ Latent Diffusion Models (LDMs)‌​‌ are a positive development​​ in this direction, as​​​‌ such LDMs are able‌ to reduce the heavy‌​‌ computational burden when training​​ on high-resolution images. In​​​‌ addition, our own work‌ revealed - in the‌​‌ context of autoencoder generation​​ models - how to​​​‌ disentangle motion and appearance‌ in videos, as well‌​‌ as how to manipulate​​ decomposed semantically meaningful motion-directions.​​​‌ However, in the context‌ of LDMs, disentanglement and‌​‌ manipulation of semantic attributes​​ remains a key open​​​‌ research challenge of substantial‌ potential impact and these‌​‌ are indeed challenges that​​ we will explore.

3.2.2​​​‌ Data Augmentation and Anonymization‌

Participants: Antitza Dantcheva,‌​‌ François Brémond.

We​​ aim to apply data​​​‌ generation models proposed in‌ the previous section in‌​‌ two domains of application,​​ namely data augmentation and​​​‌ data anonymization, which are‌ catering the needs of‌​‌ Axis 1 (Human Interaction​​ Recognition).

Data augmentation.

The​​​‌ general focus of data-driven‌ computer vision algorithms has‌​‌ to do with the​​ automated extraction of patterns​​​‌ by finding complex data‌ representations from large volumes‌​‌ of input data without​​ human interference, utilizing the​​​‌ patterns to detect or‌ classify unseen data. The‌​‌ powerful twist that we​​ are envisioning is that​​​‌ data generation places full‌ control over the distribution‌​‌ of the generated data,​​ thus endowing us with​​​‌ the ability to ensure‌ quality and diversity, while‌​‌ saving cost, and mitigating​​ bias. As a consequence,​​​‌ we foresee that such‌ synthetic data will allow‌​‌ for nothing less than​​ a paradigm shift in​​​‌ training. For example, as‌ inspired by human systems,‌​‌ synthetic data will bring​​ continual, multimodal, interactive, embodied​​​‌ learning to the next‌ level, providing richer and‌​‌ more sophisticated representations. This​​ applies directly toward the​​​‌ grand goal of allowing‌ computer vision to approach‌​‌ human-level intelligence; a long-term​​ goal that will require​​​‌ the grasping of key‌ concepts related to the‌​‌ physical world and its​​ composition, as well as​​​‌ to entail a non-diluted‌ ability to learn continually,‌​‌ interactively and multimodally 23​​​‌. We aim to​ identify entirely new perception​‌ models and related learning​​ paradigms, which will exploit​​​‌ synthetic data in an​ entirely new, efficient and​‌ dynamic manner. We consider​​ such models for a​​​‌ variety of recognition settings​ that can target a​‌ broad spectrum of facial​​ behaviors including expressions and​​​‌ micro-expressions. By exploring the​ fundamental properties of learning​‌ with synthetic data, we​​ anticipate computer vision models​​​‌ that generalize onto a​ large class of human​‌ actions.

Data anonymization.

Privacy-preserving​​ data-processing has obtained increased​​​‌ attention in the past​ years, with challenges having​‌ to do with data​​ anonymization, while maintaining the​​​‌ image quality. The General​ Data Protection Regulation (GDPR)​‌ came to effect as​​ of 25th of May,​​​‌ 2018, affecting all processing​ of personal data across​‌ Europe. GDPR requires regular​​ consent from the individual​​​‌ for any use of​ their personal data. However,​‌ if the data does​​ not allow to identify​​​‌ an individual, companies are​ free to use the​‌ data without consent. To​​ effectively anonymize images, we​​​‌ require a robust model​ to replace the original​‌ face, without destroying the​​ existing data distribution; that​​​‌ is: the output should​ be a realistic face​‌ fitting the given situation.​​

Anonymizing images, while retaining​​​‌ the original distribution is​ challenging, as it entails​‌ the removal of all​​ privacy-sensitive information, generation of​​​‌ a highly realistic face,​ while providing a seamless​‌ transition between original and​​ anonymized parts. This requires​​​‌ a model that can​ perform complex semantic reasoning​‌ to generate a new​​ anonymized face. For practical​​​‌ use, we desire the​ model to be able​‌ to manage a broad​​ diversity of images, poses,​​​‌ backgrounds, and different persons.​ Our proposed solution can​‌ successfully anonymize images in​​ a large variety of​​​‌ cases, and create realistic​ faces to the given​‌ conditional information.

4 Application​​ domains

Video understanding consists​​​‌ of a complex pipeline​ made of various tasks,​‌ such as object detection,​​ people tracking, pose estimation,​​​‌ and event detection. So,​ many tasks are generic,​‌ and can be shared​​ between different application domains.​​​‌ The behavior analysis techniques​ we develop for other​‌ applications (for instance for​​ sport or security domains)​​​‌ can be applied to​ medical applications and vice-versa.​‌

4.1 Medical Applications

Our​​ main motivation as explained​​​‌ before is to help​ clinicians to diagnose, monitor​‌ and provide pertinent treatment​​ to patients with behavior​​​‌ disorders. The applications we​ target are not general​‌ medical diseases but the​​ ones related to the​​​‌ brain and more precisely​ to psychiatric disorders. These​‌ disorders can appear very​​ early in the life​​​‌ of the patient (for​ instance autism spectrum disorder​‌ 4), they can​​ concern adults (depression, bipolar,​​​‌ schizophrenia 25) or​ the elderly (for instance​‌ Alzheimer disease). We have​​ been working for the​​​‌ elderly patients since the​ creation of the CoBTek​‌ joint team in January​​ 2012. More recently, we​​​‌ have extended our study​ to the two other​‌ categories of age. Now​​ we have some clinical​​​‌ trials within these three​ categories of patients.

4.2​‌ Other Applications

    Sport applications:​​ Sport is an interesting​​ application domain for human​​​‌ activity understanding for three‌ reasons. First, data are‌​‌ often publicly available, so​​ with less ethical concerns​​​‌ than medical ones. Moreover,‌ many data have been‌​‌ recorded and annotated to​​ be part of international​​​‌ challenges Website Challenges.‌ Second, human activities are‌​‌ complex at the level​​ of individuals, of a​​​‌ team and along time.‌ Third, many companies are‌​‌ interested to fund research​​ to advance the field​​​‌ of human activity understanding‌ for sport. For instance,‌​‌ we have a collaboration​​ with a local company,​​​‌ Fairvision (see Fairvision website‌ on football games).

Security‌​‌ applications: The interest and​​ investment in vision-based security​​​‌ systems is large and‌ rapidly growing and is‌​‌ fueled by applications ranging​​ from autonomous vehicles to​​​‌ personalization of customer service.‌ Accordingly, numerous companies, military‌​‌ and public organizations are​​ interested in research in​​​‌ this context.

4.3 Ethical‌ and Acceptability Issues

The‌​‌ development and ultimate use​​ of novel assistive technologies​​​‌ by a vulnerable user‌ group such as individuals‌​‌ with dementia, and the​​ assessment methodologies planned by​​​‌ STARS are not free‌ of ethical, or even‌​‌ legal concerns, even if​​ many studies have shown​​​‌ how these Information and‌ Communication Technologies (ICT) can‌​‌ be useful and well​​ accepted by older people​​​‌ with or without impairments.‌ Thus, one goal of‌​‌ STARS team is to​​ design the right technologies​​​‌ that can provide the‌ appropriate information to the‌​‌ medical carers while preserving​​ people privacy. Moreover, STARS​​​‌ pay particular attention to‌ ethical, acceptability, legal and‌​‌ privacy concerns that may​​ arise, addressing them in​​​‌ a professional way following‌ the corresponding established EU‌​‌ and national laws and​​ regulations, especially when outside​​​‌ France. STARS can also‌ benefit from the support‌​‌ of the COERLE (Comité​​ Opérationnel d'Evaluation des Risques​​​‌ Légaux et Ethiques) to‌ help it to respect‌​‌ ethical policies in its​​ applications.

As presented in​​​‌ Section 2, STARS‌ aims at designing cognitive‌​‌ vision systems with perceptual​​ capabilities to efficiently monitor​​​‌ people activities. As a‌ matter of fact, vision‌​‌ sensors can be seen​​ as intrusive ones, even​​​‌ if no images are‌ acquired or transmitted (only‌​‌ meta-data describing activities need​​ to be collected). Therefore,​​​‌ new communication paradigms and‌ other sensors (e.g. accelerometers,‌​‌ RFID (Radio Frequency Identification),​​ and new sensors to​​​‌ come in the future)‌ are also envisaged to‌​‌ provide the most appropriate​​ services to the observed​​​‌ people, while preserving their‌ privacy. To better understand‌​‌ ethical issues, STARS members​​ are already involved in​​​‌ several ethical organizations.

For‌ addressing the acceptability issues,‌​‌ focus groups and HMI​​ (Human Machine Interaction) experts​​​‌ are consulted on the‌ most adequate range of‌​‌ mechanisms to interact and​​ display information to older​​​‌ people.

5 Social and‌ environmental responsibility

5.1 Footprint‌​‌ of research activities

We​​ have limited our travels​​​‌ by reducing our physical‌ participation to conferences and‌​‌ to international collaborations.

5.2​​ Impact of research results​​​‌

We have been involved‌ for many years in‌​‌ promoting public transportation by​​ improving safety onboard and​​​‌ in station. Moreover, we‌ have been working on‌​‌ pedestrian detection for self-driving​​​‌ cars, which will help​ also reducing the number​‌ of individual cars.

6​​ Highlights of the year​​​‌

6.1 Awards

  • Antitza Dantcheva​ was appointed 3IA chair.​‌
  • Monique Thonnat has been​​ nominated Coordinatrice Alpes Maritimes​​​‌ for the foundation FUAE​ Fondation Un Avenir Ensemble​‌ of Grande Chancellerie de​​ la Legion d'Honneur (​​​‌Website Fondation). The​ objective is to promote​‌ social mobility by offering​​ recipients of national honors​​​‌ the opportunity to mentor​ deserving and motivated students​‌ from high school to​​ higher education and entry​​​‌ into working life.

6.2​ Major results

  • A first​‌ work has consisted of​​ releasing novel tracking algorithms​​​‌ that can reliably track​ people through a video​‌ stream. These algorithms can​​ combine bounding box detection​​​‌ with pixel mask to​ significantly improve the quality​‌ of tracking and to​​ be able to track​​​‌ people on a long-term​ basis.
  • During this period,​‌ several novel activity recognition​​ algorithms have also been​​​‌ designed for Activities of​ Daily Living (ADLs) in​‌ real-world settings. These algorithms​​ got the best performances​​​‌ on all relevant action​ datasets. Previously, these algorithms​‌ were built in more​​ or less supervised settings.​​​‌ Thus, we have proposed​ new algorithms for action​‌ detection with a weakly​​ supervised setting with only​​​‌ video-level labels. These algorithms​ can reliably detect specific​‌ events with their time​​ of occurrence within untrimmed​​​‌ videos.
  • We have also​ improved the quality and​‌ the capacity of action​​ recognition algorithms by processing​​​‌ long videos with a​ duration of more than​‌ 10 minutes. For that,​​ we have designed new​​​‌ adapters that can be​ plugged into strong video​‌ backbones and thus necessitate​​ only retraining the adapters,​​​‌ which reduces the training​ time and enables a​‌ training process with videos​​ of a much longer​​​‌ duration.
  • We have also​ designed novel algorithms for​‌ video action anticipation that​​ can detect some possible​​​‌ events after having observed​ only a limited amount​‌ of normal video streams.​​
  • All these algorithms have​​​‌ been successfully evaluated on​ the main international benchmarks​‌ and also on video​​ datasets depicting patients with​​​‌ cognitive disorders in order​ to help doctors to​‌ better monitor their patients.​​

7 Latest software developments,​​​‌ platforms, open data

7.1​ Open data

We have​‌ provided two benchmark datasets.​​

Stress ID Dataset: a​​​‌ Multimodal Dataset for Stress​ Identification
  • Contributors:
    Hava Chaptoukaev​‌ , Valeriya Strizhkova ,​​ Michele Panariello , Bianca​​​‌ Dalpaos , Aglind Reka​ , Valeria Manera ,​‌ Susanne Thummler , Esma​​ Ismailova , Nicholas Evans​​​‌ , François Brémond ,​ Massimiliano Todisco , Maria​‌ A Zuluaga , Laura​​ M Ferrari .
  • Description:​​​‌
    It contains RGB facial​ video, audio and physiological​‌ signals (ECG, EDA, Respiration).​​ Different stress-inducing stimuli are​​​‌ used: emotional video-clips, cognitive​ tasks and public speaking.​‌ The total dataset consists​​ of recordings from 65​​​‌ participants that performed 11​ tasks. Each task is​‌ labeled by the subjects​​ in terms of stress,​​​‌ relaxation, arousal, and valence.​ The experimental set-up ensures​‌ synchronized, high-quality, and low​​ noise data.
  • Dataset PID​​​‌ (DOI,...):
  • Project​ link:
  • Publications:
    StressID: a Multimodal​​ Dataset for Stress Identification​​ Thirty-seventh Conference on Neural​​​‌ Information Processing Systems Datasets‌ and Benchmarks Track 2023‌​‌ 7
  • Contact:
    stressid.dataset@inria.fr
  • Release​​ contributions:
    The Dataset is​​​‌ licensed for non-commercial scientific‌ research purposes.
Toyota Smarthome‌​‌ Datasets: Real-World Activities of​​ Daily Living.
  • Contributors:
    Rui​​​‌ Dai , Srijan Das‌ , Saurab Sharma ,‌​‌ Luca Minciullo , Lorenzo​​ Garattoni , François Brémond​​​‌ , Gianpiero Francesca .‌
  • Description:
    Smarthome has been‌​‌ recorded in an apartment​​ equipped with 7 Kinect​​​‌ v1 cameras. It contains‌ the common daily living‌​‌ activities of 18 subjects.​​ The subjects are senior​​​‌ people in the age‌ range 60-80 years old.‌​‌ The dataset has a​​ resolution of 640×480 and​​​‌ offers 3 modalities: RGB‌ + Depth + 3D‌​‌ Skeleton. The 3D skeleton​​ joints were extracted from​​​‌ RGB. For privacy-preserving reasons,‌ the face of the‌​‌ subjects is blurred. Currently,​​ two versions of the​​​‌ dataset are provided: Toyota‌ Smarthome Trimmed and Toyota‌​‌ Smarthome Untrimmed.
  • Dataset PID​​ (DOI,...):
    10.1109/TPAMI.2022.3169976
  • Project link:​​​‌
  • Publications:‌
    Toyota Smarthome Untrimmed: Real-World‌​‌ Untrimmed Videos for Activity​​ Detection, PAMI 2022 18​​​‌.
  • Contact:
    toyotasmarthome@inria.fr
  • Release‌ contributions:
    The Dataset is‌​‌ licensed for non-commercial scientific​​ research purposes.

8 New​​​‌ results

This year Stars‌ has proposed new results‌​‌ related to its two​​ main research axes: (i)​​​‌ Human Interaction Recognition and‌ (ii) Data Generation for‌​‌ Augmentation and Anonymization.

Human​​ Interaction Recognition

Participants: François​​​‌ Brémond, Antitza Dantcheva‌, Michal Balazia,‌​‌ Monique Thonnat, Baptiste​​ Chopin, Di Yang​​​‌, Abid Ali,‌ Olivier Huynh, Tomasz‌​‌ Stanczyk, Sanya Sinha​​, Mohammed Guermal,​​​‌ Tanay Agrawal, Snehashis‌ Majhi, Aglind Reka‌​‌.

The new results​​ for Human Interaction Recognition​​​‌ are:

  • No Train Yet‌ Gain: Towards Generic Multi-Object‌​‌ Tracking in Sports and​​ Beyond (see 8.1)​​​‌
  • Does Re-ID Really Help‌ in Multi-Object Tracking? (see‌​‌ 8.2)
  • CM3T: Framework​​ for Efficient Multimodal Learning​​​‌ for Inhomogeneous Interaction Datasets‌ (see 8.3)
  • Are‌​‌ Attention Maps Richer than​​ we Imagined for Action​​​‌ Recognition? (see 8.4)‌
  • Scaling Action Detection: AdaTAD++‌​‌ with Transformer-Enhanced Temporal-Spatial Adaptation​​ (see 8.5)
  • SKI​​​‌ Models: SKeleton Induced Vision-Language‌ Embeddings for Understanding Activities‌​‌ of Daily Living (see​​ 8.6)
  • LLAVIDAL :​​​‌ A Large LAnguage VIsion‌ Model for Daily Activities‌​‌ of Living (see 8.7​​)
  • Human-Centric Video Understanding:​​​‌ From Single-Modality to Multi-Modal‌ Learning (see 8.8)‌​‌
  • B-MoE: A Body-Part-Aware Mixture-of-Experts​​ “All Parts Matter” Approach​​​‌ to Micro-Action Recognition (see‌ 8.9)
  • Loose Social-Interaction‌​‌ Recognition in Real-world Therapy​​ Scenarios (see 8.10)​​​‌
  • Just Dance with π!,‌ A Poly-modal Inductor for‌​‌ Weakly-supervised Video Anomaly Detection​​ (see 8.11)
  • Mixture​​​‌ of Experts Guided by‌ Gaussian Splatters Matters: A‌​‌ new Approach to Weakly-Supervised​​ Video Anomaly Detection (see​​​‌ 8.12)
  • Denoise, Divide,‌ Distill, and Predict (‌​‌𝒟3𝒫):​​ Towards Forecasting Long-horizon Real-world​​​‌ Anomaly from Normalcy (see‌ 8.13)
  • Not All‌​‌ Blends Are Equal: The​​ BLEMORE Dataset of Blended​​​‌ Emotion Expressions with Relative‌ Salience Annotations (see 8.14‌​‌)
  • The INEMO Dataset:​​ A Multimodal Benchmark of​​​‌ Physiological and Behavioral Responses‌ to Social Media and‌​‌ Film Stimuli (see 8.15​​​‌)
  • EEG Classification with​ Limited Data: A Deep​‌ Clustering Approach. (see 8.16​​)
  • MEPHESTO: Multimodal Phenotyping​​​‌ of Psychiatric Disorders from​ Social Interaction (see 8.17​‌)
  • MultiMediate'25: Cross-Cultural Multi-domain​​ Engagement Estimation (see 8.18​​​‌)
  • Stress Estimation in​ Dancers for Injury Prevention​‌ (see 8.19)
  • Emotion​​ Recognition using Deep Learning​​​‌ (see 8.20)
  • Identifying​ Surgical Instruments in Pedagogical​‌ Cataract Surgery Videos through​​ an Optimized Aggregation Network​​​‌ (see 8.21)
  • TBDM:​ Temporal Boundary Distillation Module​‌ for Surgical Gesture Segmentation​​ (see 8.22)
  • Effective​​​‌ Video Feature Extraction for​ Training and Comprehension: Human-Centered​‌ Multimodal Video (see 8.23​​)
Data Generation for​​​‌ Augmentation and Anonymization

Participants:​ François Brémond, Antitza​‌ Dantcheva, Baptiste Chopin​​, Nabyl Quignon,​​​‌ Charbel Yahchouchi, Anil​ Egin, Michal Balazia​‌, Di Yang,​​ Valeriya Strizhkova.

The​​​‌ new results for Data​ Generation for Augmentation and​‌ Anonymization are:

  • Rotation-Induced Centroid​​ Shift in Latent Space​​​‌ (see 8.24)
  • Dual​ Volume Skeleton-Guided 3D Face​‌ Reconstruction from Sparse Views​​ (see 8.25)
  • Turbo​​​‌ Learning: 3D Face Reconstruction​ with Mesh Re-Projection and​‌ Re-Identification Consistency (see 8.26​​)
  • THEval. Evaluation Framework​​​‌ for Talking Head Video​ Generation (see 8.27)​‌
  • Beyond Real versus Fake​​ Towards Intent-Aware Video Analysis​​​‌ (see 8.28)
  • AI​ killed the video star.​‌ Audio-driven diffusion model for​​ expressive talking head generation​​​‌ (see 8.29)
  • LIA-X:​ Interpretable Latent Portrait Animator​‌ (see 8.30)
  • Simplicity-Bias-Aware​​ Adaptation of Foundation Models​​​‌ for Deepfake Detection (see​ 8.31)
  • Now You​‌ See Me, Now You​​ Don't: A Unified Framework​​​‌ for Expression Consistent Anonymization​ in Talking Head Videos​‌ (see 8.32)
  • Beyond​​ the visible: A survey​​​‌ on cross-spectral face recognition​ (see 8.33)

8.1​‌ No Train Yet Gain:​​ Towards Generic Multi-Object Tracking​​​‌ in Sports and Beyond​

Participants: Tomasz Stanczyk,​‌ Seongro Yoon, Francois​​ Bremond.

We proposed​​​‌ McByte 46, a​ novel tracking-by-detection framework that​‌ enhances multi-object tracking (MOT)​​ by integrating temporally propagated​​​‌ segmentation masks as an​ additional association cue. The​‌ key objective was to​​ improve robustness and generalization​​​‌ in challenging sports scenarios​ - characterized by fast​‌ motion, occlusions, blur, and​​ camera shifts - without​​​‌ requiring any training or​ per-sequence parameter tuning.

Starting​‌ from a strong ByteTrack-based​​ baseline, we designed a​​​‌ pipeline that combines Kalman​ filter motion prediction, IoU-based​‌ matching, and a pre-trained​​ mask temporal propagation model.​​​‌ The propagated masks are​ not used blindly; instead,​‌ we introduced regulated policies​​ that activate mask-based guidance​​​‌ only in well-defined situations​ - namely ambiguity (multiple​‌ plausible associations) and isolation​​ (failure of IoU-based matching).​​​‌ This controlled fusion ensures​ that the mask cue​‌ strengthens association decisions while​​ avoiding instability caused by​​​‌ unreliable mask predictions.

Figure​ 3 illustrates the full​‌ tracking pipeline, showing how​​ bounding-box predictions, detections, and​​​‌ temporally propagated masks are​ jointly integrated into a​‌ unified association cost matrix​​ solved via Hungarian matching.​​​‌ This design allows McByte​ to preserve the strengths​‌ of tracking-by-detection while benefiting​​ from the spatial coherence​​​‌ provided by mask propagation.​

Figure 3

The image illustrates a​‌ tracking process in video​​ analysis. It begins with​​ tracklet boxes from the​​​‌ previous frame (t-1) processed‌ through a Kalman filter‌​‌ to predict tracklet boxes​​ for the current frame​​​‌ (t). In parallel, an‌ object detector identifies detection‌​‌ boxes in the current​​ frame. These predictions and​​​‌ detections are matched using‌ Hungarian matching assignment to‌​‌ update tracklets. Masks from​​ frame t-1 are propagated​​​‌ temporally and matched to‌ detections using IoU-based and‌​‌ mask-enhanced matching to ensure​​ accurate tracking of objects​​​‌ across frames, shown in‌ a basketball scene with‌​‌ two players. (Description generated​​ at January 19th, 2026​​​‌ by Albert AI with‌ the model Mistral-Small-3.2-24B)

Figure‌​‌ 3: The overview​​ of the McByte tracking​​​‌ pipeline.

We conducted extensive‌ ablation studies to analyze‌​‌ the impact of each​​ design choice, demonstrating that​​​‌ uncontrolled use of masks‌ can degrade performance, whereas‌​‌ carefully gated mask usage​​ yields consistent gains. Qualitative​​​‌ results further show McByte’s‌ ability to maintain identities‌​‌ through heavy occlusions and​​ motion blur. In particular,​​​‌ Fig. 4 highlights challenging‌ football scenarios where McByte‌​‌ successfully preserves tracklets that​​ baseline methods fail to​​​‌ maintain due to abrupt‌ camera motion and degraded‌​‌ visual quality.

We evaluated​​ McByte on four diverse​​​‌ datasets - SportsMOT, DanceTrack,‌ SoccerNet-tracking 2022, and MOT17‌​‌ - using standard MOT​​ metrics (HOTA, IDF1, MOTA).​​​‌ Across all benchmarks, McByte‌ consistently outperformed strong tracking-by-detection‌​‌ baselines, especially in sports​​ datasets, while remaining competitive​​​‌ on pedestrian tracking. Importantly,‌ these improvements were achieved‌​‌ without training, dataset-specific tuning,​​ or additional annotations, demonstrating​​​‌ the method’s generality and‌ practical value.

Overall, this‌​‌ work introduces a generic,​​ training-free MOT framework that​​​‌ bridges the gap between‌ detection-based and mask-based tracking,‌​‌ offering a robust solution​​ applicable across sports and​​​‌ non-sports domains.

The image depicts a‌​‌ soccer match in the​​ Barclays Premier League between​​​‌ Liverpool (LIV) and Manchester‌ United (MU) with a‌​‌ score of 1-2. Liverpool​​ is playing with 10​​​‌ men due to a‌ red card to Gerrard‌​‌ at the 46th minute.​​ The time shown is​​​‌ 86:15. The scene focuses‌ on the goal area,‌​‌ where the goalkeeper and​​ other players are positioned.​​​‌ Several players are highlighted‌ with numbers such as‌​‌ 160, 131, 112, 106,​​ 132, 177, and 144.​​​‌ Arrows point to the‌ goalkeeper and two other‌​‌ players near the goalpost,​​ indicating their positions and​​​‌ possible actions. (Description generated‌ at January 19th, 2026‌​‌ by Albert AI with​​ the model Mistral-Small-3.2-24B)

Figure​​​‌ 4: Example comparison‌ with baseline in a‌​‌ challenging football setting. McByte​​ can maintain the tracklets​​​‌ of the blurry players‌ caused by the abrupt‌​‌ camera movement (pointed by​​ yellow arrows).

8.2 Does​​​‌ Re-ID Really Help in‌ Multi-Object Tracking?

Participants: Tomasz‌​‌ Stanczyk, Francois Bremond​​.

We conducted a​​​‌ systematic and critical analysis‌ of the role of‌​‌ person re-identification (re-ID) in​​ multi-object tracking (MOT) 49​​​‌. While re-ID is‌ widely assumed to improve‌​‌ association quality, its actual​​ contribution in practical tracking​​​‌ pipelines remains unclear. Our‌ goal was to rigorously‌​‌ evaluate when, how, and​​ to what extent re-ID​​​‌ genuinely benefits MOT performance.‌

We focused our study‌​‌ on the widely used​​​‌ BoT-SORT tracking framework and​ evaluated multiple re-ID configurations,​‌ including re-ID trained on​​ the target dataset, re-ID​​​‌ trained on external datasets,​ and a strong generic​‌ re-ID model. Experiments were​​ conducted on the MOT17​​​‌ validation set, using both​ ground-truth detections and realistic​‌ detector outputs to disentangle​​ the effects of detection​​​‌ quality from appearance-based association.​

Beyond standard tracking evaluations,​‌ we introduced a custom​​ re-ID assessment protocol tailored​​​‌ to tracking. This protocol​ directly measures correct and​‌ incorrect inter-frame matches produced​​ by re-ID, enabling a​​​‌ deeper understanding of re-ID​ behavior in realistic tracking​‌ scenarios. We analyzed cosine​​ distance distributions, match accuracy,​​​‌ and failure modes across​ sequences with varying crowd​‌ density, occlusion patterns, and​​ bounding-box sizes.

Our results​​​‌ show that re-ID often​ provides only marginal gains​‌ and, in several scenarios,​​ can even degrade tracking​​​‌ performance, especially when bounding​ boxes are small, heavily​‌ occluded, or visually ambiguous.​​ We further demonstrated that​​​‌ tuning re-ID similarity thresholds​ is non-trivial and highly​‌ sequence-dependent, undermining the robustness​​ and general applicability of​​​‌ re-ID-based association.

To mitigate​ these issues, we explored​‌ constraints on re-ID usage,​​ such as filtering based​​​‌ on occlusion level and​ minimum bounding-box size. While​‌ these constraints reduced incorrect​​ matches in isolation, their​​​‌ impact on full tracking​ performance remained limited and​‌ inconsistent across sequences.

Overall,​​ this work provides evidence-based​​​‌ insight into the limitations​ of re-ID in MOT​‌ and challenges the assumption​​ that stronger re-ID models​​​‌ automatically lead to better​ tracking. We conclude that​‌ re-ID is not a​​ universally reliable solution for​​​‌ improving MOT and that​ its effectiveness is strongly​‌ conditioned on scene characteristics,​​ detection quality, and careful​​​‌ integration into the tracking​ pipeline.

This study offers​‌ practical guidance for both​​ researchers and practitioners, encouraging​​​‌ more critical and context-aware​ use of re-ID in​‌ future MOT systems.

8.3​​ CM3T: Framework for Efficient​​​‌ Multimodal Learning for Inhomogeneous​ Interaction Datasets

Participants: Tanay​‌ Agrawal, Mohammed Guermal​​, Michal Balazia,​​​‌ Francois Bremond.

Challenges​ in cross-learning involve inhomogeneous​‌ or even inadequate amount​​ of training data and​​​‌ lack of resources for​ retraining large pretrained models.​‌ Inspired by transfer learning​​ techniques in NLP (i.e.,​​​‌ natural language processing), adapters​ and prefix tuning, we​‌ present a new model-agnostic​​ plugin architecture for cross-learning,​​​‌ called CM3T 36,​ that adapts transformer-based models​‌ to new or missing​​ information (see Figure 5​​​‌). We introduce two​ adapter blocks: multi-head vision​‌ adapters for transfer learning​​ and cross-attention adapters for​​​‌ multimodal learning. Training becomes​ substantially efficient as the​‌ backbone and other plugins​​ do not need to​​​‌ be fine-tuned along with​ these additions.

Figure 5

Backbones pretrained​‌ using self-supervised learning provide​​ good general features, thus​​​‌ all methods of fine-tuning​ work well. In the​‌ case of supervised pretraining,​​ adapters fail to perform​​​‌ well (in red) and​ CM3T is introduced to​‌ solve this (in green).​​

Figure 5: This​​​‌ is a representation of​ the main problem CM3T​‌ aims to solve. Backbones​​ pretrained using self-supervised learning​​​‌ provide good general features,​ thus all methods of​‌ fine-tuning work well. In​​ the case of supervised​​ pretraining, adapters fail to​​​‌ perform well (in red)‌ and CM3T is introduced‌​‌ to solve this (in​​ green).

Comparative and ablation​​​‌ studies on three datasets‌ Epic-Kitchens-100, MPIIGroupInteraction and UDIVA‌​‌ v0.5 show efficacy of​​ this framework on different​​​‌ recording settings and tasks.‌ With only 12.8% trainable‌​‌ parameters compared to the​​ backbone to process video​​​‌ input and only 22.3%‌ trainable parameters for two‌​‌ additional modalities, we achieve​​ comparable and even better​​​‌ results than the state-of-the-art.‌ CM3T has no specific‌​‌ requirements for training or​​ pretraining and is a​​​‌ step towards bridging the‌ gap between a general‌​‌ model and specific practical​​ applications of video classification.​​​‌

8.4 Are Attention Maps‌ Richer than we Imagined‌​‌ for Action Recognition?

Participants:​​ Tanay Agrawal, Abid​​​‌ Ali, Francois Bremond‌.

Deep learning models‌​‌ are becoming more general​​ and robust by the​​​‌ day. Specifically, image foundation‌ models have recently shown‌​‌ exponential growth. We introduce​​ a way to exploit​​​‌ this growth in the‌ field of video classification.‌​‌ The basic idea here​​ is that if we​​​‌ have a good understanding‌ of space, we should‌​‌ not require complicated spatio-temporal​​ processing. We introduce the​​​‌ Attention Map (AM) flow,‌ a way to identify‌​‌ the location of local​​ changes between two frames​​​‌ in a video, without‌ adding additional parameters specifically‌​‌ for it. We utilize​​ adapters, which have been​​​‌ growing in popularity in‌ the field of parameter-efficient‌​‌ transfer learning. These help​​ us incorporate AM flow​​​‌ in a pretrained image‌ model without the need‌​‌ of fine-tuning it. With​​ just these changes and​​​‌ minimal temporal processing, an‌ image model is able‌​‌ to achieve state-of-the-art results​​ on popular action recognition​​​‌ datasets with low training‌ time and requiring minimal‌​‌ pretraining. This work explores​​ the theory behind this​​​‌ idea and the intricacies‌ involved. Through relevant experiments,‌​‌ we show the efficacy​​ of this method and​​​‌ discuss various ideas to‌ take this work forward.‌​‌ We use kinetics-400, something-something​​ v2, and the Toyota​​​‌ SmartHome datasets and achieve‌ state-of-the-art or comparable results.‌​‌ We also show that​​ video models suffer from​​​‌ extensive pretraining on multiple‌ datasets and a large‌​‌ training time, but our​​ work answers these problems.​​​‌

This work has been‌ published at WACV 2025‌​‌ 35.

8.5 Scaling​​ Action Detection: AdaTAD++ with​​​‌ Transformer-Enhanced Temporal-Spatial Adaptation

Participants:‌ Tanay Agrawal, Abid‌​‌ Ali, Francois Bremond​​.

Temporal Action Detection​​​‌ (TAD) is essential for‌ analyzing long-form videos by‌​‌ identifying and segmenting actions​​ within untrimmed sequences. While​​​‌ recent innovations like Temporal‌ Informative Adapters (TIA) have‌​‌ improved resolution, memory constraints​​ still limit large video​​​‌ processing. To address this‌ issue, we introduce AdaTAD++,‌​‌ an enhanced framework that​​ decouples temporal and spatial​​​‌ processing within adapters, organizing‌ them into independently trainable‌​‌ modules. Our novel two-step​​ training strategy first optimizes​​​‌ for high temporal and‌ low spatial resolution, then‌​‌ vice versa, allows the​​ model to utilize both​​​‌ high spatial and temporal‌ resolutions during inference, while‌​‌ maintaining training efficiency. Additionally,​​ we incorporate a more​​​‌ sophisticated temporal module capable‌ of capturing long-range dependencies‌​‌ more effectively than previous​​​‌ methods. Experiments on benchmark​ datasets, including ActivityNet-1.3, THUMOS14,​‌ and EPIC-Kitchens 100, demonstrate​​ that AdaTAD++ achieves state-of-the-art​​​‌ performance. We also explore​ various adapter configurations, discussing​‌ their trade-offs regarding resource​​ constraints and performance, providing​​​‌ valuable insights into their​ optimal application.

This work​‌ has been published at​​ ICCV 2025 38.​​​‌

8.6 SKI Models: SKeleton​ Induced Vision-Language Embeddings for​‌ Understanding Activities of Daily​​ Living

Participants: Arkaprava Sinha​​​‌, Dominick Reilly,​ Francois Bremond, Srijan​‌ Das.

The introduction​​ of vision-language models like​​​‌ CLIP has enabled the​ development of foundational video​‌ models capable of generalizing​​ to unseen videos and​​​‌ human actions. However, these​ models are typically trained​‌ on web videos, which​​ often fail to capture​​​‌ the challenges present in​ Activities of Daily Living​‌ (ADL) videos. Existing works​​ address ADL-specific challenges, such​​​‌ as similar appearances, subtle​ motion patterns, and multiple​‌ viewpoints, by combining 3D​​ skeletons and RGB videos.​​​‌ However, these approaches are​ not integrated with language,​‌ limiting their ability to​​ generalize to unseen action​​​‌ classes. In this paper,​ we introduce SKI models,​‌ which integrate 3D skeletons​​ into the vision-language embedding​​​‌ space. SKI models leverage​ a skeleton language model,​‌ SkeletonCLIP, to infuse skeleton​​ information into Vision Language​​​‌ Models (VLMs) and Large​ Vision Language Models (LVLMs)​‌ through collaborative training. Notably,​​ SKI models do not​​​‌ require skeleton data during​ inference, enhancing their robustness​‌ for real-world applications. The​​ effectiveness of SKI models​​​‌ is validated on three​ popular ADL datasets for​‌ zero-shot action recognition and​​ video caption generation tasks.​​​‌ Our code is available​ at this github Github​‌ page.

This work​​ has been published at​​​‌ AAAI 2025 45.​

8.7 LLAVIDAL : A​‌ Large LAnguage VIsion Model​​ for Daily Activities of​​​‌ Living

Participants: Dominick Reilly​, Francois Bremond,​‌ Srijan Das.

Current​​ Large Language Vision Models​​​‌ (LLVMs) trained on web​ videos perform well in​‌ general video understanding but​​ struggle with fine-grained details,​​​‌ complex human object interactions​ (HOI), and view-invariant representation​‌ learning essential for Activities​​ of Daily Living (ADL).​​​‌ This limitation stems from​ a lack of specialized​‌ ADL video instruction-tuning datasets​​ and insufficient modality integration​​​‌ to capture discriminative action​ representations. To address this,​‌ we propose a semi-automated​​ framework for curating ADL​​​‌ datasets, creating ADL-X, a​ multiview, multimodal RGBS (i.e.,​‌ RGB and Segmentation) instruction-tuning​​ dataset. Additionally, we introduce​​​‌ LLAVIDAL, an LLVM integrating​ videos, 3D skeletons, and​‌ HOIs to model ADL's​​ complex spatiotemporal relationships. For​​​‌ training LLAVIDAL a simple​ joint alignment of all​‌ modalities yields suboptimal results;​​ thus, we propose a​​​‌ Multimodal Progressive (MMPro) training​ strategy, incorporating modalities in​‌ stages following a curriculum.​​ We also establish ADL​​​‌ MCQ and video description​ benchmarks to assess LLVM​‌ performance in ADL tasks.​​ Trained on ADL-X, LLAVIDAL​​​‌ achieves state-of-the-art (SOTA) performance​ across ADL benchmarks.

This​‌ work has been published​​ at CVPR 2025 43​​​‌.

8.8 Human-Centric Video​ Understanding: From Single-Modality to​‌ Multi-Modal Learning

Participants: Mahmoud​​ Ali, Di Yang​​​‌, Francois Bremond.​

Figure 6

General pipeline of MoVie​‌ for action detection

Figure​​ 6: General pipeline​​ of MoVie for action​​​‌ detection. (a) We broaden‌ the views of a‌​‌ given observation segment by​​ extracting features for the​​​‌ previous segment and skeleton‌ motion. In this‌​‌ stage, we propose a​​ novel Mixed Motion-Visual Encoder,​​​‌ including a Motion Encoder‌ and a Motion-Visual Mixer‌​‌ (MVM) inside to mix​​ multi-modal features. (b) We​​​‌ process the history features‌ and mixed motion-visual features‌​‌ using a TCN- (i.e.,​​ Temporal Convolutional Network) and​​​‌ Transformer-based cross-modal temporal model‌ to obtain frame-level features‌​‌ for the observation video​​ segment. Finally, a multi-label​​​‌ classifier is stacked to‌ predict per-frame action categories‌​‌ within the observation segment.​​

Human action recognition is​​​‌ an active research field‌ with significant contributions to‌​‌ applications such as home-care​​ monitoring, human-computer interaction, and​​​‌ game control. However, recognizing‌ human activities in real-world‌​‌ videos remains challenging, especially​​ when learning effective video​​​‌ representations with a high‌ expressive power to represent‌​‌ human spatio-temporal motion, view-invariant​​ actions, complex composable actions,​​​‌ etc. To address this‌ challenge, we made three‌​‌ contributions toward learning effective​​ representations that can be​​​‌ applied and evaluated in‌ real-world human action classification,‌​‌ retrieval, prediction, detection, and​​ segmentation tasks by transfer​​​‌ learning.

The first contribution‌ (single modality): we improve‌​‌ the generalizability of human​​ skeleton motion representation models​​​‌ under the skeleton-only modality.‌ We introduce two novel‌​‌ self-supervised learning frameworks based​​ on contrastive learning to​​​‌ learn robust and transferable‌ skeleton representations without relying‌​‌ on action labels. By​​ exploiting the inherent spatio-temporal​​​‌ structure of human skeleton‌ sequences, our approach encourages‌​‌ discriminative motion representations through​​ instance-level and temporal consistency​​​‌ objectives. Extensive evaluations demonstrate‌ that the proposed frameworks‌​‌ improve performance across diverse​​ downstream tasks and scenarios,​​​‌ bridging the gap between‌ controlled 3D laboratory datasets‌​‌ (e.g., NTU-RGB-D) and challenging​​ 2D real-world datasets (e.g.,​​​‌ SmartHome), highlighting the strength‌ of SSL (i.e., Self-Supervised‌​‌ Learning) for skeleton-based motion​​ understanding.

The second contribution​​​‌ (two modality): Despite the‌ effectiveness of skeleton-based models‌​‌ in capturing spatial and​​ temporal dynamics, they struggle​​​‌ to recognize fine-grained actions.‌ In particular, they fail‌​‌ to distinguish between semantically​​ similar actions, such as​​​‌ "drinking from a cup"‌ versus "drinking from a‌​‌ bottle", as these models​​ lack access to object-centric​​​‌ and semantic information. To‌ address this, we propose‌​‌ MoVie as shown in​​ Fig. 6, a​​​‌ motion-augmented framework designed to‌ improve real-world human action‌​‌ detection by integrating skeleton​​ motion features with visual​​​‌ information through the Motion-Vision‌ Mixer and incorporating history-aware‌​‌ temporal modeling.

Figure 7

Overview of​​ the framework

Figure 7​​​‌: Overview of T-MOR‌ framework. Given the skeleton‌​‌ sequence 𝐬𝐤, it​​ begins with data augmentation​​​‌ to get 𝐬𝐤+‌ to enrich the learning‌​‌ base. The core components​​ include (i) Skeleton Embedding,​​​‌ utilizing a motion encoder‌ EM to capture‌​‌ nuanced human movements; (ii)​​ Visual Embedding with a​​​‌ pre-trained encoder EV‌ for video frames 𝐯‌​‌, enhancing the ability​​ to correlate visual cues​​​‌ with motion data; (iii)‌ Text Embedding with a‌​‌ pre-trained encoder ET​​, applying textual description​​​‌ 𝐚 to refine the‌ comprehension of actions; all‌​‌ three embeddings are followed​​​‌ by projection layers ϕ​ and then are sent​‌ to (iv) Multi-modal Contrastive​​ module, implementing a novel​​​‌ mechanism that synergizes skeleton,​ visual, and text embeddings​‌ to optimize the learning​​ process. Finally, (v) the​​​‌ pre-trained EM can​ improve downstream action recognition​‌ tasks.

The third contribution​​ (multi modality): our previous​​​‌ works show that VLFMs​ (i.e., Vision-Language Foundation Models)​‌ are still far away​​ from satisfactory performance in​​​‌ all evaluated tasks, particularly​ in densely labeled and​‌ long video datasets, such​​ as the fine-grained activities​​​‌ in complex and real-world​ scenarios. As shown in​‌ Fig. 7, we​​ introduce our proposed Transferable​​​‌ skeleton MOtion Representation learning​ architecture (T-MOR) based on​‌ a contrastive motion-video-language pre-training​​ strategy. The pre-trained skeleton​​​‌ model is effective for​ both action classification, segmentation​‌ and zero-shot action recognition​​ tasks.

Overall, this work​​​‌ contributes to the field​ of human-centric video understanding​‌ by proposing novel methods​​ for skeleton-based action representation​​​‌ learning and general RGB​ video representation learning. Such​‌ representations benefit both action​​ classification and segmentation tasks.​​​‌

8.9 B-MoE: A Body-Part-Aware​ Mixture-of-Experts “All Parts Matter”​‌ Approach to Micro-Action Recognition​​

Participants: Aglind Reka,​​​‌ Nishit Poddar, Diana​ Borza, Snehashis Majhi​‌, Michal Balazia,​​ Francois Bremond.

Micro-action​​​‌ recognition (MAR) presents unique​ challenges due to the​‌ inherently subtle, fleeting, and​​ ambiguous nature of micro-actions.​​​‌ Unlike conventional actions, which​ are often clearly distinguishable,​‌ micro-actions, such as a​​ slight nod, a subtle​​​‌ shift in posture, or​ a brief glance are​‌ characterized by their fine-grained​​ motion and short duration.​​​‌ These movements often overlap​ in meaning and arise​‌ from reflexes or situational​​ cues, making them difficult​​​‌ to interpret and classify.​ Additionally, micro-actions are influenced​‌ by environmental and social​​ factors, further complicating their​​​‌ recognition.

A significant issue​ in current approaches is​‌ the failure to account​​ for the structured nature​​​‌ of human motion. Micro-actions​ often originate from specific​‌ body parts, such as​​ the head, torso, or​​​‌ limbs, and follow a​ consistent body-to-action hierarchy. However,​‌ most existing models treat​​ these actions as flat​​​‌ categories, overlooking the spatial​ dependencies between body regions.​‌ This oversight leads to​​ difficulties in isolating informative​​​‌ signals from background noise​ and differentiating between highly​‌ similar micro-movements within the​​ same body region. Another​​​‌ challenge lies in the​ imbalance and variability of​‌ micro-action datasets. Datasets like​​ MA-52, SocialGesture, and MPII-GroupInteraction​​​‌ capture a wide range​ of human movements, from​‌ short, dynamic gestures to​​ long, static postures. This​​​‌ variability in temporal scale​ and class frequency makes​‌ it challenging for models​​ to capture rare yet​​​‌ distinctive motion patterns, which​ are characteristic of micro-actions.​‌

To address these challenges,​​ we introduce B-MoE, a​​​‌ body-part-aware Mixture-of-Experts framework (see​ Figure 8) designed​‌ to explicitly model the​​ structured nature of human​​​‌ motion. B-MoE specializes in​ analyzing motions from localized​‌ body regions such as​​ the head, torso, upper​​​‌ limbs, and lower limbs,​ allowing the model to​‌ focus on subtle movements​​ and discriminative cues within​​​‌ each region. By doing​ so, B-MoE suppresses background​‌ interference and enhances the​​ detection of fine-grained motion​​ cues, improving the ability​​​‌ to differentiate between ambiguous‌ action classes. Central to‌​‌ B-MoE is the Macro–Micro​​ Motion Encoder (M3E) as​​​‌ shown in Figure 9‌, a lightweight yet‌​‌ powerful backbone that captures​​ both long-range contextual structure​​​‌ and fine-grained local motion.‌ This dual capability enables‌​‌ the model to effectively​​ recognize both prolonged poses​​​‌ and rapid micro-movements. A‌ cross-attention routing mechanism further‌​‌ enhances the framework by​​ dynamically selecting and fusing​​​‌ informative region-wise semantic cues,‌ as shown in Figure‌​‌ 10, which are​​ then integrated with global​​​‌ motion features. Through this‌ approach, B-MoE effectively addresses‌​‌ the core challenges of​​ MAR subtlety, ambiguity, and​​​‌ class imbalance by amplifying‌ fine local cues, suppressing‌​‌ irrelevant regions, and providing​​ complementary semantic and motion​​​‌ evidence. This work was‌ submitted to CVPR 2026.‌​‌

Figure 8

The image depicts a​​ flowchart of a machine​​​‌ learning model. It takes‌ video input and processes‌​‌ it through multiple branches:​​ semantic branches for different​​​‌ body parts (head, body,‌ upper limb, lower limb),‌​‌ a semantic encoder, and​​ a motion encoder. The​​​‌ semantic branches are frozen,‌ while the experts, which‌​‌ are learnable components, adapt.​​ The outputs from these​​​‌ branches and encoders converge‌ and pass through a‌​‌ series of modules including​​ cross-attention mechanisms, a transformer,​​​‌ and a multi-layer perceptron‌ (MLP). The process aims‌​‌ to learn and integrate​​ semantic and motion features​​​‌ for analysis. (Description generated‌ at January 15th, 2026‌​‌ by Albert AI with​​ the model Mistral-Small-3.2-24B)

Figure​​​‌ 8: B-MoE: A‌ dual-stream encoder extracts region-conditioned‌​‌ semantic features using semantic​​ encoder and global motion​​​‌ encoder. The semantic stream‌ is routed through a‌​‌ region-aware MoE, where each​​ expert specializes in modeling​​​‌ micro-movements within a specific‌ body region. A cross-attention‌​‌ fusion head integrates expert​​ outputs with motion saliency​​​‌ from the global stream,‌ and a transformer-MLP (i.e.,‌​‌ MultiLayer Perceptron) classifier produces​​ the final predictions.
Figure 9

The​​​‌ image depicts a neural‌ network architecture consisting of‌​‌ an SGP layer and​​ a semantic embedding alignment​​​‌ mechanism. The SGP layer‌ includes components like MHSA‌​‌ (Multi-Head Self Attention), ConvC​​ (Convolution), and fully connected​​​‌ (FC) layers. The process‌ starts with input (T,‌​‌ D) through MHSA, followed​​ by several convolutional and​​​‌ pooling operations. The semantic‌ embedding alignment, used only‌​‌ during pre-training, aligns word​​ embeddings through a TAN​​​‌ module with fully connected‌ layers, minimizing the embedding‌​‌ loss. (Description generated at​​ January 16th, 2026 by​​​‌ Albert AI with the‌ model Mistral-Small-3.2-24B)

Figure 9‌​‌: Macro-Micro Motion Encoder​​ (M3E): Input sequence is​​​‌ processed with multi-head self-attention‌ to capture global temporal‌​‌ dependencies, followed by an​​ SGP (i.e., Scalable-Granularity Perception)​​​‌ module for fine-grained local‌ motion reasoning. During pretraining,‌​‌ a semantic alignment loss​​ aligns learned features with​​​‌ word embeddings of action‌ labels.

Figure 10

The image depicts‌​‌ a flowchart of a​​ video processing system. It​​​‌ begins with a video‌ input of a person‌​‌ sitting. The video frame​​ (T, C, H, W)​​​‌ is first processed by‌ a module called "Sapiens,"‌​‌ likely for segmentation, producing​​ segmentation maps (T, H,​​​‌ W). These maps are‌ combined with another input,‌​‌ represented by a bone​​​‌ structure image, which then​ goes to a "Semantic​‌ Encoder." The output is​​ a feature representation (T/16,​​​‌ 1408) (8 x 1408).​ Snowflake icons indicate the​‌ use of frozen parameters​​ or pre-trained weights in​​​‌ these modules. (Description generated​ at January 16th, 2026​‌ by Albert AI with​​ the model Mistral-Small-3.2-24B)

Figure​​​‌ 10: Semantic Branch:​ We segment each frame​‌ using SAPIENS, derive the​​ crop around the target​​​‌ body part (upper limb​ in this example), and​‌ apply the corresponding mask​​ to the cropped region.​​​‌ The resulting cropped and​ masked video is then​‌ processed by VideoMAE-V2, pretrained​​ on Kinetics.

Our extensive​​​‌ experiments on three socially​ contextual micro-action benchmarks (MA-52,​‌ MPII-GI, and SocialGesture) demonstrate​​ significant improvements, with notable​​​‌ gains in F1macro accuracy​ of +4.32%, +3.35%, and​‌ +1.17%, respectively. These results​​ highlight B-MoE’s robustness in​​​‌ handling class imbalance and​ its superior performance in​‌ recognizing subtle and ambiguous​​ actions recognition of ambiguous,​​​‌ underrepresented, and low-amplitude actions.​

8.10 Loose Social-Interaction Recognition​‌ in Real-world Therapy Scenarios​​

Participants: Abid Ali,​​​‌ Monique Thonnat, Francois​ Bremond.

The computer​‌ vision community has explored​​ dyadic interactions for atomic​​​‌ actions such as pushing,​ carrying-object, etc. However, with​‌ the advancement in deep​​ learning models, there is​​​‌ a need to explore​ more complex dyadic situations​‌ such as loose interactions.​​ These are interactions where​​​‌ two people perform certain​ atomic activities to complete​‌ a global action irrespective​​ of temporal synchronization and​​​‌ physical engagement, like cooking-together​ for example. Analyzing these​‌ types of dyadic-interactions has​​ several useful applications in​​​‌ the medical domain for​ social-skills development and mental​‌ health diagnosis. To achieve​​ this, we propose a​​​‌ novel dual-path architecture to​ capture the loose interaction​‌ between two individuals. Our​​ model learns global abstract​​​‌ features from each stream​ via a CNNs backbone​‌ and fuses them using​​ a new Global-Layer-Attention module​​​‌ based on a cross-attention​ strategy. We evaluate our​‌ model on real-world autism​​ diagnoses such as our​​​‌ Loose-Interaction dataset, and the​ publicly available Autism dataset​‌ for loose interactions. Our​​ network achieves baseline results​​​‌ on the Loose-Interaction and​ SOTA results on the​‌ Autism datasets. Moreover, we​​ study different social interactions​​​‌ by experimenting on a​ publicly available dataset i.e.​‌ NTU-RGB+D (interactive classes from​​ both NTU-60 and NTU-120).​​​‌ We have found that​ different interactions require different​‌ network designs. We also​​ compare a slightly different​​​‌ version of our method​ by incorporating time information​‌ to address tight interactions​​ achieving SOTA results.

This​​​‌ work has been published​ at WACV 2025 37​‌.

8.11 Just Dance​​ with π!, A Poly-modal​​​‌ Inductor for Weakly-supervised Video​ Anomaly Detection

Participants: Snehashis​‌ Majhi, Giacomo D’amicantonio​​, Antitza Dantcheva Ali​​​‌, Francois Bremond.​

Weakly-supervised methods for video​‌ anomaly detection (VAD) are​​ conventionally based merely on​​​‌ RGB spatiotemporal features, which​ continues to limit their​‌ reliability in real-world scenarios.​​ This is because RGB-features​​​‌ are not sufficiently distinctive​ in setting apart categories​‌ such as shoplifting from​​ visually similar events. Therefore,​​​‌ towards robust complex real-world​ VAD, it is essential​‌ to augment RGB spatio-temporal​​ features with additional modalities.​​ Motivated by this, we​​​‌ introduce the Poly-modal Induced‌ framework for VAD: “PI-VAD”‌​‌ (or π-VAD), a novel​​ approach that augments RGB​​​‌ representations by five additional‌ modalities. Specifically, the modalities‌​‌ include sensitivity to fine-grained​​ motion (Pose), three-dimensional scene​​​‌ and entity representation (Depth),‌ surrounding objects (Panoptic masks),‌​‌ global motion (optical flow),​​ as well as language​​​‌ cues (VLM). Each modality‌ represents an axis of‌​‌ a polygon, streamlined to​​ add salient cues to​​​‌ RGB. π-VAD includes two‌ plug-in modules, namely the‌​‌ Pseudo-modality Generation module and​​ the Cross Modal Induction​​​‌ module, which generate modality-specific‌ prototypical representations and, thereby,‌​‌ induce multi-modal information into​​ RGB cues. These modules​​​‌ operate by performing anomaly-aware‌ auxiliary tasks and necessitate‌​‌ five modality backbones –​​ only during training. Notably,​​​‌ π-VAD achieves state-of-the-art accuracy‌ on three prominent VAD‌​‌ datasets encompassing real-world scenarios,​​ without requiring the computational​​​‌ overhead of five modality‌ backbones at inference.

This‌​‌ work has been published​​ at CVPR 2025 40​​​‌.

8.12 Mixture of‌ Experts Guided by Gaussian‌​‌ Splatters Matters: A new​​ Approach to Weakly-Supervised Video​​​‌ Anomaly Detection

Participants: Snehashis‌ Majhi, Giacomo D’Amicantonio‌​‌, Dantcheva Antitza,​​ Francois Bremond.

We​​​‌ identify one of the‌ main issues in the‌​‌ formulation of the Weakly-supervised​​ video anomaly detection (WSVAD)​​​‌ task. Multi-instance learning (MIL)‌ strikes a balance between‌​‌ fully supervised methods, which​​ exhibit good performance but​​​‌ require costly data annotation,‌ and unsupervised methods, which‌​‌ do not require manual​​ annotations but generally result​​​‌ in worse performance. The‌ core idea of MIL‌​‌ is to create bags​​ containing positive and negative​​​‌ data samples (i.e.‌, normal and abnormal‌​‌ videos), labeled only at​​ the video-level. During training,​​​‌ the model assigns a‌ score between 0 and‌​‌ 1 to each snippet,​​ with 0 indicating a​​​‌ normal snippet and 1‌ indicating an abnormal snippet.‌​‌ The highest-scoring samples in​​ the normal bag are​​​‌ guided towards 0, allowing‌ the model to learn‌​‌ most normal scenarios correctly.​​ On the other hand,​​​‌ the highest-scoring negative samples‌ are pushed towards 1.‌​‌ This leads the model​​ to be supervised, and​​​‌ therefore learn few and‌ specific instances of anomalous‌​‌ events, ignoring useful information​​ contained in neighboring snippets.​​​‌ Over time, this approach‌ has proved to be‌​‌ powerful but insufficient to​​ train a model to​​​‌ correctly capture the secondary‌ and specific attributes of‌​‌ different anomalous classes. In​​ recent works, different auxiliary​​​‌ objectives are identified as‌ priors for the VAD‌​‌ task to optimize the​​ training process.

Figure 11

Overview of​​​‌ the GS-MoE architecture

Figure‌ 11: Overview of‌​‌ the GS-MoE architecture: First,​​ in the feature extraction​​​‌ stage, the video encoder‌ extracts snippet-level features from‌​‌ the video, and the​​ task encoder refines them​​​‌ in the anomaly-detection latent‌ space. In the second‌​‌ stage, each class-expert is​​ trained only on refined​​​‌ features belonging to its‌ assigned class and to‌​‌ the normal class. In​​ the final stage, the​​​‌ gate model collects the‌ scores assigned by each‌​‌ expert and compares them​​ with the refined features​​​‌ of the task encoder,‌ producing the final abnormal‌​‌ score.

To address this​​​‌ issue, we propose to​ model the anomalies in​‌ a video as Gaussian​​ distributions (see Fig. 11​​​‌), rendering multiple Gaussian​ kernels in correspondence with​‌ peaks detected along the​​ temporal dimension of the​​​‌ scores estimated for abnormal​ videos. This technique, called​‌ Temporal Gaussian Splatting (TGS),​​ creates a more complete​​​‌ representation of an anomalous​ event over time, including​‌ snippets of the anomaly​​ with lower abnormal scores​​​‌ in the training objective.​ The Gaussian kernels are​‌ extracted from the abnormal​​ scores produced by the​​​‌ model.

An additional challenge​ is related to the​‌ intrinsic differences between abnormal​​ classes. Under the MIL​​​‌ paradigm, the models are​ trained to learn the​‌ difference between normal and​​ abnormal videos, while the​​​‌ specific differences between anomalous​ classes are overlooked. As​‌ a result, these methods​​ mainly focus on coarse-level​​​‌ representations of anomalies that​ allow us to distinguish​‌ between normal and abnormal​​ events, but ignore the​​​‌ fine-grained category-specific cues. Therefore,​ the more salient anomalies​‌ (i.e., such​​ as an explosion) are​​​‌ likely to be easily​ detected, while subtle anomalies​‌ (i.e., shoplifting)​​ are more likely to​​​‌ be confused with normal​ events. This constitutes a​‌ major limitation of most​​ recent methods based on​​​‌ WSVAD. We address this​ issue via a Mixture-of-Expert​‌ (MoE) architecture, in which​​ each expert is trained​​​‌ to model a single​ anomaly class, enhancing the​‌ specific attributes of each​​ anomaly class that are​​​‌ often overlooked. To further​ leverage the correlations and​‌ differences between anomalies, a​​ gate model mediates between​​​‌ the predictions of each​ expert and the more​‌ coarse-level anomalous features to​​ learn potential interactions between​​​‌ anomalies.

This work has​ been published at ICCV​‌ 2025 48.

8.13​​ Denoise, Divide, Distill, and​​​‌ Predict (D3¶): Towards Forecasting​ Long-horizon Real-world Anomaly from​‌ Normalcy

Participants: Quentin Merilleau​​, Snehashis Majhi,​​​‌ Dantcheva Antitza, Francois​ Bremond.

Forecasting abnormal​‌ human behavior (AHB) in​​ unconstrained real-world environments is​​​‌ critical for enabling proactive​ safety interventions 42.​‌ Unlike short-term anomaly detection,​​ long-horizon forecasting offers a​​​‌ vital reaction window but​ remains underexplored due to​‌ three core challenges: (i)​​ noisy, complex human–agent interactions;​​​‌ (ii) weak temporal coupling​ between normal observations and​‌ distant anomalies; and (iii)​​ data scarcity limiting the​​​‌ scalability of autoregressive models.​ To address these, we​‌ propose (Denoise, Divide, Distill,​​ and Predict) displayed in​​​‌ Fig. 12, a​ novel encoder–decoder framework that​‌ bridges denoised pasts with​​ distilled autoregressive futures, which​​​‌ has been accepted for​ publication in WACV 2026.​‌ Our Differential Past Encoder​​ (DiPE) disentangles scene-level and​​​‌ object-level dynamics via differential​ attention, suppressing irrelevant interactions​‌ and enhancing discriminative cues.​​ The Distilled Future Auto-Regressive​​​‌ Decoder (D-FAD) adopts a​ divide-and-conquer strategy, segmenting future​‌ queries into temporal chunks​​ for sequential prediction, while​​​‌ leveraging distillation to balance​ robustness and latency. We​‌ validate our approach on​​ the AHB-F benchmark, the​​​‌ only dataset dedicated to​ abnormal behavior forecasting, and​‌ further integrate D-FAD with​​ several state-of-the-art methods. In​​​‌ all cases, our framework​ consistently outperforms prior work​‌ in both forecasting accuracy​​ and computational efficiency.

Figure 12

The​​ image compares traditional video​​​‌ anomaly detection with a‌ new video anomaly anticipation‌​‌ method. In the traditional​​ method, labeled "Video Anomaly​​​‌ Detection," anomalies are identified‌ in the current frame‌​‌ by analyzing past frames​​ and classifying future frames​​​‌ as normal or anomalous.‌ The new method, "Our‌​‌ Video Anomaly Anticipation," not​​ only detects anomalies in​​​‌ the current frame but‌ also predicts future anomalies‌​‌ by utilizing both short-term​​ (1-3 seconds) and long-term​​​‌ (4-8 seconds) anticipations. The‌ diagram is divided into‌​‌ sections showing offline and​​ online processes. Offline training​​​‌ involves multiple frames, while‌ online detection and anticipation‌​‌ use a reference model​​ (Dref) to predict anomalies.​​​‌ The overall aim is‌ to enhance early detection‌​‌ of irregular events in​​ videos. (Description generated at​​​‌ January 15th, 2026 by‌ Albert AI with the‌​‌ model Mistral-Small-3.2-24B)

Figure 12​​: Illustration VAD (Video​​​‌ Anomaly Detection) Vs. VAA‌ (Video Anomaly Anticipation): Suppose‌​‌ the current time step​​ is t. For online​​​‌ VAD, a parametrized‌ model f(θ‌​‌) can predict normal​​ (N) or anomaly (A)​​​‌ for the current t‌ based on observed time‌​‌ stamps t-i​​t-1​​​‌,t, where‌ i represents the observed‌​‌ du- ration. However, for​​ our VAA, we​​​‌ predict what kind of‌ anomaly will occur in‌​‌ the future in a​​ range of [t​​​‌+ 1, t+‌ 2,..., t+k‌​‌] where k represents​​ anticipation duration. Further, we​​​‌ comprehend the short and‌ long-term anticipation to identify‌​‌ the potential re-occurrence of​​ an anomaly in the​​​‌ long future.

8.14 Not‌ All Blends Are Equal:‌​‌ The BLEMORE Dataset of​​ Blended Emotion Expressions with​​​‌ Relative Salience Annotations

Participants:‌ Michal Balazia, Teimuraz‌​‌ Saghinadze, Francois Bremond​​.

(Both paper and​​​‌ competition are accepted at‌ FG 2026)

Humans often‌​‌ experience not just a​​ single basic emotion at​​​‌ a time, but rather‌ a blend of several‌​‌ emotions with varying salience.​​ Despite the importance of​​​‌ such blended emotions, most‌ video-based emotion recognition approaches‌​‌ are designed to recognize​​ single emotions only. The​​​‌ few approaches that have‌ attempted to recognize blended‌​‌ emotions typically cannot assess​​ the relative salience of​​​‌ the emotions within a‌ blend. This limitation largely‌​‌ stems from the lack​​ of datasets containing a​​​‌ substantial number of blended‌ emotion samples annotated with‌​‌ relative salience. To address​​ this shortcoming, we introduce​​​‌ BLEMORE, a novel dataset‌ for multimodal (video, audio)‌​‌ BLended EMOtion​​ REcognition (see Figure​​​‌ 13) that includes‌ information on the relative‌​‌ salience of each emotion​​ within a blend.

Figure 13

Examples​​​‌ of stills from the‌ video recordings

The image‌​‌ shows a sequence of​​ four photos featuring a​​​‌ person with dark hair‌ tied back, wearing a‌​‌ black shirt. The photos​​ progressively depict increasing expressions​​​‌ of emotion, starting from‌ a neutral expression and‌​‌ moving through stages of​​ surprise or excitement. The​​​‌ person's mouth opens wider‌ and their facial muscles‌​‌ tense more in each​​ subsequent photo. The background​​​‌ is a plain, neutral‌ gray. (Description generated at‌​‌ January 15th, 2026 by​​​‌ Albert AI with the​ model Mistral-Small-3.2-24B)

Figure 13​‌: Examples of stills​​ from the video recordings.​​​‌ The actor portrays a​ combination of anger and​‌ fear.

BLEMORE comprises over​​ 3,000 clips from 58​​​‌ actors, performing 6 basic​ emotions (anger, disgust, fear,​‌ happiness, sadness, and neutral)​​ and 10 distinct blends​​​‌ consisting of all pairwise​ combinations of anger, disgust,​‌ fear, happiness, and sadness.​​ All pairwise combinations (see​​​‌ Figure 14) were​ further conveyed with three​‌ different blend conditions:

  • 50/50​​ = same amount of​​​‌ both emotions (e.g. 50/50​ happiness-sadness where both happiness​‌ and sadness are expressed​​ in equal proportions)
  • 70/30​​​‌ = the first emotion​ is more salient than​‌ the second emotion (e.g.​​ 70/30 happiness-sadness conveys mainly​​​‌ happiness blended with a​ tinge of sadness)
  • 30/70​‌ = the second emotion​​ is more salient than​​​‌ the first emotion (e.g.​ 30/70 happiness-sadness conveys mainly​‌ sadness blended with a​​ tinge of happiness)
Figure 14

Structure​​​‌ of the BLEMORE full​ dataset

Figure 14:​‌ Structure of the BLEMORE​​ full dataset (train and​​​‌ test partition) which contains​ single emotions and blended​‌ emotion expressed with equal​​ (=) and​​​‌ unequal (<)​ salience.

Using this dataset,​‌ we conduct extensive evaluations​​ of state-of-the-art video classification​​​‌ approaches on two blended​ emotion prediction tasks: (1)​‌ predicting the presence of​​ emotions in a given​​​‌ sample, and (2) predicting​ the relative salience of​‌ emotions in a blend.​​ Our results show that​​​‌ unimodal classifiers achieve up​ to 29% presence accuracy​‌ and 13% salience accuracy​​ on the validation set,​​​‌ while multimodal methods yield​ clear improvements, with ImageBind+WavLM​‌ reaching 35% presence accuracy​​ and HiCMAE 18% salience​​​‌ accuracy. On the held-out​ test set, the best​‌ models achieve 33% presence​​ accuracy (VideoMAEv2+HuBERT) and 18%​​​‌ salience accuracy (HiCMAE).

BLEMORE​ dataset is also the​‌ basis of BLEMORE competition​​ where participants develop systems​​​‌ to predict the emotions​ present in each recording​‌ and the relative salience​​ of each emotion. To​​​‌ support participation, we provide​ training data with labels,​‌ test data without labels,​​ pre-extracted audio-visual feature embeddings,​​​‌ and baseline unimodal and​ multimodal classification results. The​‌ competition offers the first​​ comprehensive platform for evaluating​​​‌ blended emotion recognition and​ aims to stimulate methodological​‌ innovation in multimodal affective​​ computing.

8.15 The INEMO​​​‌ Dataset: A Multimodal Benchmark​ of Physiological and Behavioral​‌ Responses to Social Media​​ and Film Stimuli

Participants:​​​‌ Wenxin Xiong, Valeriya​ Strizhkova, Aowen Shi​‌, Michal Balazia,​​ Laura Ferrari, Francois​​​‌ Bremond.

The INEMO​ dataset is a multimodal​‌ benchmark designed to study​​ emotional and behavioral responses​​​‌ to influencer-style social media​ videos and emotion calibration​‌ film clips. As shown​​ in Figures 15 and​​​‌ 16, participants complete​ two tasks (Influencer and​‌ Calibration), in which they​​ watch short video clips​​​‌ and then rate their​ emotions using 1–9 Self-Assessment​‌ Manikin (SAM) scales for​​ valence and arousal, as​​​‌ well as provide preference​ judgments about the videos.​‌ During these sessions, multiple​​ synchronized modalities are recorded,​​​‌ including facial video, electrocardiography​ (ECG), electrodermal activity (EDA),​‌ eye tracking and screen​​ activity, all time-aligned and​​ stored in a structured​​​‌ metadata format organized by‌ participant, task and modality.‌​‌ This design makes INEMO​​ directly usable for machine​​​‌ learning and deep learning‌ models and positions it‌​‌ as a bridge between​​ traditional lab-based affective datasets​​​‌ and more realistic social‌ media scenarios.

Figure 15

The image‌​‌ depicts an experiment protocol​​ with two tasks. Task​​​‌ 1 involves watching influencer‌ video clips in three‌​‌ sets, each with three​​ videos, and using the​​​‌ Self-Assessment Manikin (SAM) to‌ gauge reactions. Participants also‌​‌ rank preferences for individuals​​ and others. Task 2​​​‌ includes emotion calibration with‌ videos evoking different emotions‌​‌ (amusement, tenderness, sadness, disgust,​​ fear), followed by SAM​​​‌ assessments. The process ends‌ with questionnaires. SAM measures‌​‌ valence (negative to positive)​​ and arousal (calm to​​​‌ exciting). (Description generated at‌ January 15th, 2026 by‌​‌ Albert AI with the​​ model Mistral-Small-3.2-24B)

Figure 15​​​‌: Overview of the‌ INEMO experiment protocol
 
 

The‌​‌ image shows a person​​ sitting at a table​​​‌ with medical devices attached‌ to their body. Electrodes‌​‌ are placed on their​​ chest and stomach, connected​​​‌ by wires to a‌ device strapped to their‌​‌ left wrist. Another device​​ is strapped to their​​​‌ right wrist, with wires‌ connected to electrodes on‌​‌ their right hand. The​​ person appears to be​​​‌ in a medical or‌ clinical setting, possibly undergoing‌​‌ a diagnostic or therapeutic​​ procedure involving muscle or​​​‌ nerve activity monitoring. (Description‌ generated at January 15th,‌​‌ 2026 by Albert AI​​ with the model Mistral-Small-3.2-24B)​​​‌

The image shows a‌ person sitting at a‌​‌ table with medical devices​​ attached to their body.​​​‌ Electrodes are placed on‌ their chest and stomach,‌​‌ connected by wires to​​ a device strapped to​​​‌ their left wrist. Another‌ device is strapped to‌​‌ their right wrist, with​​ wires connected to electrodes​​​‌ on their right hand.‌ The person appears to‌​‌ be in a medical​​ or clinical setting, possibly​​​‌ undergoing a diagnostic or‌ therapeutic procedure involving muscle‌​‌ or nerve activity monitoring.​​ (Description generated at January​​​‌ 15th, 2026 by Albert‌ AI with the model‌​‌ Mistral-Small-3.2-24B)

Figure 16:​​ Overview of the INEMO​​​‌ setup: placement of physiological‌ electrodes.

To evaluate the‌​‌ dataset and illustrate its​​ potential for multimodal emotion​​​‌ recognition, classical machine learning‌ models (SVM, Random Forest,‌​‌ Gradient Boosting) were trained​​ on handcrafted features extracted​​​‌ from ECG and EDA,‌ with and without video‌​‌ features, and compared to​​ a multimodal MVP-based (i.e.,​​​‌ Multimodal for Video and‌ Physio) baseline that jointly‌​‌ integrates ECG, EDA and​​ facial video. The best​​​‌ results are obtained with‌ a Gradient Boosting model‌​‌ using the combined ECG+EDA+Video​​ configuration, reaching weighted F1-scores​​​‌ of about 0.78 for‌ valence and 0.76 for‌​‌ arousal, and accuracies up​​ to 0.80 for valence​​​‌ and 0.70 for arousal.‌ These results confirm that‌​‌ the INEMO signals are​​ informative and that the​​​‌ associated classification tasks are‌ learnable, while still leaving‌​‌ room for more advanced​​ multimodal modeling approaches.

8.16​​​‌ EEG Classification with Limited‌ Data: A Deep Clustering‌​‌ Approach.

Participants: Mohsen Tabejamaat​​, Farhood Negin,​​​‌ Francois Bremond.

The‌ computer vision community has‌​‌ explored dyadic interactions for​​​‌ atomic actions such as​ pushing, carrying-object, etc. However,​‌ with the advancement in​​ deep learning models, there​​​‌ is a need to​ explore more complex dyadic​‌ situations such as loose​​ interactions. These are interactions​​​‌ where two people perform​ certain atomic activities to​‌ complete a global action​​ irrespective of temporal synchronization​​​‌ and physical engagement, like​ cooking-together for example. Analyzing​‌ these types of dyadic-interactions​​ has several useful applications​​​‌ in the medical domain​ for social-skills development and​‌ mental health diagnosis. To​​ achieve this, we propose​​​‌ a novel dual-path architecture​ to capture the loose​‌ interaction between two individuals.​​ Our model learns global​​​‌ abstract features from each​ stream via a CNNs​‌ backbone and fuses them​​ using a new Global-Layer-Attention​​​‌ module based on a​ cross-attention strategy. We evaluate​‌ our model on real-world​​ autism diagnoses such as​​​‌ our Loose-Interaction dataset, and​ the publicly available Autism​‌ dataset for loose interactions.​​ Our network achieves baseline​​​‌ results on the Loose-Interaction​ and SOTA results on​‌ the Autism datasets. Moreover,​​ we study different social​​​‌ interactions by experimenting on​ a publicly available dataset​‌ i.e. NTU-RGB+D (interactive classes​​ from both NTU-60 and​​​‌ NTU-120). We have found​ that different interactions require​‌ different network designs. We​​ also compare a slightly​​​‌ different version of our​ method by incorporating time​‌ information to address tight​​ interactions achieving SOTA results.​​​‌

This work has been​ published in Pattern Recognition​‌ 2025 34.

8.17​​ MEPHESTO: Multimodal Phenotyping of​​​‌ Psychiatric Disorders from Social​ Interaction

Participants: Michal Balazia​‌, Aowen Shi,​​ Miriana Russo, Francois​​​‌ Bremond.

Identifying objective​ and reliable markers to​‌ tailor diagnosis and treatment​​ of psychiatric patients remains​​​‌ a challenge, as conditions​ like major depression, bipolar​‌ disorder, or schizophrenia are​​ qualified by complex behavior​​​‌ observations or subjective self-reports​ instead of easily measurable​‌ somatic features. Recent progress​​ in computer vision, speech​​​‌ processing and machine learning​ has enabled detailed and​‌ objective characterization of human​​ behavior in social interactions.​​​‌ However, the application of​ these technologies to personalized​‌ psychiatry is limited due​​ to the lack of​​​‌ sufficiently large corpora that​ combine multimodal measurements with​‌ longitudinal assessments of patients​​ covering more than a​​​‌ single disorder. Our multi-centre,​ multi-disorder longitudinal corpus creation​‌ effort MEPHESTO is designed​​ to develop and validate​​​‌ novel multimodal markers for​ psychiatric conditions. MEPHESTO consists​‌ of multimodal audio, video,​​ and physiological recordings as​​​‌ well as clinical assessments​ of psychiatric patients covering​‌ a six-week main study​​ period as well as​​​‌ several follow-up recordings spread​ across twelve months.

Diagnoses​‌ include schizophrenia, depression and​​ bipolar disorder. Dataset does​​​‌ not include control subjects.​ Each patient is contributing​‌ with 1–8 videos, roughly​​ 5.5 videos on average.​​​‌ In addition to video,​ the recordings include patients'​‌ and clinicians' biosignals electrodermal​​ activity (EDA), blood volume​​​‌ pulse (BVP), inter-beat interval​ (IBI), heart rate, temperature,​‌ and accelerometer. Videos are​​ recorded by Azure Kinect​​​‌ and biosignals by Empatica.​ People do not wear​‌ face masks while being​​ recorded, although to minimize​​​‌ the transmission of COVID-19​ there is a large​‌ transparent plexi-glass. Dataset is​​ confidential, but many patients​​ agreed to publish their​​​‌ raw or anonymized data‌ for research purposes. Figure‌​‌ 17 shows a screenshot​​ from a mock recording.​​​‌

Figure 17

The image shows two‌ people sitting in different‌​‌ rooms. Each person has​​ a set of physiological​​​‌ data displayed below them.‌ The data includes temperature,‌​‌ EDA (electrodermal activity), BVP​​ (blood volume pulse), ACC​​​‌ (accelerometer), and HR (heart‌ rate). The person on‌​‌ the left is seated​​ near a window and​​​‌ wears a black and‌ white striped shirt. The‌​‌ person on the right​​ sits near a bookshelf​​​‌ and wears a black‌ top. (Description generated at‌​‌ January 15th, 2026 by​​ Albert AI with the​​​‌ model Mistral-Small-3.2-24B)

Figure 17‌: Screenshot of a‌​‌ mock recording with two​​ videos and biosignals. Person​​​‌ in the left represents‌ a clinician and person‌​‌ in the right a​​ patient. To protect the​​​‌ identity of patients, this‌ mock recording is acted‌​‌ by two clinicians.

This​​ year, we have made​​​‌ three major contributions regarding‌ therapeutic alliance, recognizing depression‌​‌ and schizophrenia, and detecting​​ childhood trauma from speech.​​​‌ These contributions are explained‌ in detail in the‌​‌ subsections below.

8.17.1 Contextualized​​ Synchrony for Therapeutic Alliance​​​‌

Non-verbal behavioral synchrony has‌ been widely studied as‌​‌ an indicator of relational​​ dynamics in clinical interactions​​​‌ and has been shown‌ to exhibit weak to‌​‌ moderate associations with therapeutic​​ alliance (TA). However, most​​​‌ existing synchrony measures are‌ computed in a content-agnostic‌​‌ manner, implicitly assuming that​​ synchrony occurring at different​​​‌ moments of an interaction‌ contributes equally to the‌​‌ development of the therapeutic​​ relationship. This work is​​​‌ motivated by the hypothesis‌ that the relational meaning‌​‌ of synchrony is context-dependent,​​ and that linguistic content​​​‌ may play a critical‌ role in determining when‌​‌ non-verbal coordination is most​​ relevant to therapeutic alliance.​​​‌ In our setting, TA‌ is assessed at the‌​‌ end of each session​​ via a seven-item patient​​​‌ questionnaire capturing liking, perceived‌ helpfulness, feeling understood and‌​‌ supported, and ease of​​ sharing personal information, with​​​‌ the global TA score‌ obtained by averaging item‌​‌ responses. By integrating semantic​​ information derived from spoken​​​‌ language with non-verbal synchrony‌ measures, this study aims‌​‌ to move beyond global,​​ uniform synchrony metrics toward​​​‌ a more fine-grained, context-sensitive‌ understanding of therapist–patient interaction‌​‌ dynamics. Non-verbal synchrony was​​ computed at the window​​​‌ level using Motion Energy‌ Analysis (MEA, see Figure‌​‌ 18 for an example​​ of patient–therapist MEA time​​​‌ series) and a cross-correlation‌ framework applied to the‌​‌ continuous motion energy time​​ series of patient and​​​‌ therapist.

Figure 18

The image is‌ a line graph titled‌​‌ "Motion Energy Analysis (MEA):​​ Patient vs. Therapist." It​​​‌ depicts standardized motion energy‌ on the y-axis versus‌​‌ time on the x-axis.​​ The graph compares the​​​‌ motion energy of a‌ patient and a therapist,‌​‌ represented by blue and​​ orange lines, respectively. The​​​‌ patient's motion energy shows‌ smaller, more frequent fluctuations,‌​‌ while the therapist's energy​​ exhibits larger, more sporadic​​​‌ peaks. Both lines show‌ significant activity at the‌​‌ beginning and end of​​ the time period. (Description​​​‌ generated at January 15th,‌ 2026 by Albert AI‌​‌ with the model Mistral-Small-3.2-24B)​​​‌

Figure 18: Example​ of patient–therapist Motion Energy​‌ Analysis (MEA) time series​​ over a single therapy​​​‌ session.

We evaluate all​ models by predicting session-level​‌ TA scores and using​​ Pearson’s correlation coefficient r​​​‌ between predicted and observed​ TA as the primary​‌ outcome measure, computed in​​ a session-level cross-validation setting.​​​‌ We first replicated a​ stable baseline association between​‌ global MEA synchrony and​​ patient-reported TA, with a​​​‌ content-agnostic aggregation over all​ windows yielding a correlation​‌ of approximately r≈​​0.22.​​​‌ Building on this foundation,​ transcript data were processed​‌ into semantic embeddings and​​ temporally aligned with synchrony​​​‌ windows, enabling a multimodal​ representation in which textual​‌ context modulates how window-level​​ synchrony is aggregated over​​​‌ time. In the current​ implementation, not all MEA​‌ windows have a corresponding​​ text segment, so windows​​​‌ without aligned transcripts are​ ignored when applying text-informed​‌ weighting. Evaluating a uniform​​ (all-ones) aggregation under this​​​‌ constraint leads to a​ reduced MEA-TA association of​‌ r0.​​13, compared to​​​‌ the r0​.22 obtained when​‌ all MEA windows are​​ used. Within this constrained​​​‌ evaluation setting, however, our​ text-informed weighting scheme increases​‌ the correlation to r​​0.18​​​‌, suggesting that linguistic​ information helps to highlight​‌ synchrony segments that are​​ more informative about alliance.​​​‌ While the overall performance​ of this preliminary implementation​‌ does not yet surpass​​ the full-window MEA baseline,​​​‌ the results support the​ view that synchrony is​‌ not uniformly informative throughout​​ an interaction and highlight​​​‌ the potential of window-level,​ context-aware multimodal modeling combined​‌ with improved textual coverage​​ for capturing subtle relational​​​‌ processes in therapeutic settings.​

8.17.2 Psychiatric Diagnosis Classification​‌ through Temporal Behavioral Analysis​​

This sub-project focuses on​​​‌ automated psychiatric diagnosis through​ multimodal behavioral analysis of​‌ clinical interview videos, with​​ the objective of distinguishing​​​‌ between depression and schizophrenia.​ We utilize a portion​‌ of the MEPHESTO dataset​​ of 34 patients: 25​​​‌ with depression and 9​ with schizophrenia. The dataset​‌ includes manual behavioral annotations​​ provided by expert clinical​​​‌ annotators who labeled over​ 3000 video segments with​‌ observable behaviors. The implemented​​ system (see Figure 19​​​‌) follows a 7-stage​ pipeline: (1) input data​‌ acquisition from MEPHISTO with​​ pre-annotated transcriptions, (2) low-level​​​‌ extraction using OpenFace 3.0​ (8 Action Units: AU01,​‌ AU02, AU04, AU06, AU07,​​ AU12, AU14, AU45 +​​​‌ gaze + head pose​ + 8 emotions), MediaPipe​‌ holistic (33 pose, 42​​ hand, 468 face landmarks),​​​‌ and Whisper for speech​ (1,842 features/frame), (3) temporal​‌ alignment with frame-level synchronization​​ (±1 frame precision, 33ms),​​​‌ (4) multi-scale windowing (5s,​ 10s, 30s windows, 50%​‌ overlap) extracting 188 features​​ across 24,588 windows, (5)​​​‌ temporal variability aggregation computing​ 6 statistics per feature​‌ (mean, standard deviation, coefficient​​ of variation, minimum, maximum,​​​‌ range), (6) feature selection​ via ANOVA F-test selecting​‌ top 20 features (70%​​ speech-based, 30% visual), and​​​‌ (7) classification with random​ forest using leave-one-out cross-validation​‌ across 13 tested methods.​​

Figure 19

The image depicts a​​​‌ process for diagnosing psychiatric​ conditions (Depression vs. Schizophrenia)​‌ using a baseline random​​ forest model with feature​​ fusion. It involves extracting​​​‌ multi-modal features from patient‌ interview videos using three‌​‌ pipelines: OpenFace 3.0 for​​ facial actions and gaze,​​​‌ MediaPipe for body and‌ hand movements, and Whisper‌​‌ with speech analysis for​​ speech features. These features​​​‌ are temporally windowed, fused,‌ and statistically analyzed. Feature‌​‌ selection is performed using​​ ANOVA F-test, reducing the​​​‌ dataset to 20 features.‌ A random forest classifier‌​‌ is trained and validated​​ using leave-one-out cross-validation, achieving​​​‌ 94.1% accuracy. A confusion‌ matrix and top feature‌​‌ importance are displayed, highlighting​​ the most influential features​​​‌ for diagnosis. (Description generated‌ at January 15th, 2026‌​‌ by Albert AI with​​ the model Mistral-Small-3.2-24B)

Figure​​​‌ 19: This architecture‌ diagram illustrates a multimodal‌​‌ machine learning pipeline for​​ binary psychiatric diagnosis (depression/schizophrenia)​​​‌ from clinical interview videos.‌ The system combines three‌​‌ parallel feature extraction pipelines:​​ OpenFace 3.0 for facial​​​‌ action units and gaze,‌ MediaPipe for body pose‌​‌ and hand movements, and​​ Whisper for speech transcription​​​‌ and linguistic analysis. Features‌ are extracted across multi-scale‌​‌ temporal windows with statistical​​ aggregations to capture temporal​​​‌ variability patterns. After feature‌ fusion into a unified‌​‌ matrix, ANOVA F-test ranks​​ features by discriminative power,​​​‌ select the top 20,‌ and predictions are made‌​‌ by a random forest​​ classifier.

Random forest achieves​​​‌ 94.1% accuracy with only‌ two schizophrenia patients misclassified.‌​‌ Top discriminative feature is​​ the standard deviation of​​​‌ patient's incomplete utterances. During‌ our experiments, we found‌​‌ that temporal variability is​​ the critical discriminative marker,​​​‌ that speech features dominate‌ (70%) in the top-20‌​‌ features, that feature fusion​​ outperforms modality separation, and​​​‌ that traditional machine learning‌ beats deep learning on‌​‌ small datasets. In the​​ future, we are going​​​‌ to focus on temporal‌ trauma detection in the‌​‌ long untrimmed clinical interviews.​​

8.17.3 Childhood Trauma Affects​​​‌ Speech and Language Measures‌ in Patients with Major‌​‌ Depressive Disorder during Clinical​​ Interviews

Speech analysis has​​​‌ shown significant promise as‌ a potential biomarker for‌​‌ depression. However, no studies​​ to date have examined​​​‌ the impact of childhood‌ trauma on speech and‌​‌ language patterns in individuals​​ with depression 32.​​​‌ This study aims to‌ explore the relationship between‌​‌ vocal characteristics and depressive​​ symptoms, while also assessing​​​‌ how childhood trauma may‌ shape these patterns. 27‌​‌ participants with a major​​ depressive episode were included.​​​‌ The severity of depression‌ was assessed using the‌​‌ Montgomery & Asberg Depression​​ Rating Scale (MADRS) and​​​‌ the Beck Depression Inventory‌ II. Childhood trauma was‌​‌ measured using the Childhood​​ Trauma Questionnaire. Speech recordings​​​‌ from the MADRS semi-structured‌ interview and a free‌​‌ clinical interview were analyzed​​ using speaker diarization, automatic​​​‌ speech recognition, and feature‌ extraction.

Several acoustics features‌​‌ were significantly associated with​​ depression severity. Correlation analysis​​​‌ revealed that greater depression‌ severity was linked to‌​‌ shorter, less diverse speech,​​ characterized by fewer words,​​​‌ fewer semantic clusters, and‌ reduced articulatory effort. In‌​‌ contrast, childhood trauma was​​ positively associated with distinct​​​‌ speech characteristics. Higher trauma‌ load was associated with‌​‌ richer, longer, and more​​ syntactically complex speech. Additionally,​​​‌ utterances were shorter, with‌ more frequent shifts between‌​‌ semantic clusters, reflecting a​​​‌ more fragmented speech pattern​ influenced by traumatic load.​‌ Our study highlights the​​ influence of childhood trauma​​​‌ on vocal and linguistic​ characteristics of patients with​‌ depression. Automated language analysis​​ offers the possibility to​​​‌ identify biomarkers of traumatic​ load in patients with​‌ depression. This could improve​​ diagnostic accuracy, guide therapeutic​​​‌ management and monitor clinical​ progress.

8.18 MultiMediate'25: Cross-Cultural​‌ Multi-domain Engagement Estimation

Participants:​​ Michal Balazia, Francois​​​‌ Bremond.

Estimating momentary​ conversational engagement is central​‌ to assistive, socially aware​​ AI systems, yet models​​​‌ are typically trained and​ evaluated within a single​‌ domain, limiting real-world robustness.​​ The MultiMediate'25 challenge 47​​​‌ advances engagement estimation to​ more challenging, cross-cultural, and​‌ multi-domain settings. Building on​​ prior challenge editions, we​​​‌ expand beyond NOXI and​ MPIIGroupInteraction (see Figure 20​‌) as the sole​​ training source by introducing​​​‌ NOXI-J, a new multilingual​ corpus covering Japanese and​‌ Chinese interactions, enabling both​​ training and evaluation in​​​‌ diverse linguistic contexts. Although​ NOXI-J conceptually extends NOXI,​‌ we treat it as​​ a distinct domain because​​​‌ linguistic, cultural, capture, and​ annotation differences induce measurable​‌ distribution shifts. MultiMediate'25 continues​​ all previously defined tasks​​​‌ and creates another task:​ Cross-cultural Multi-domain Engagement Estimation.​‌

In this work, we​​ present new annotations, precomputed​​​‌ multi-modal features (visual, vocal,​ and verbal), baseline evaluations,​‌ and an analysis of​​ the best performing challenge​​​‌ solutions. Beyond accuracy, we​ quantify fairness using conditional​‌ demographic disparity for gender​​ and language. Our baselines​​​‌ confirm strong in-domain performance​ (e.g., paralinguistic eGeMAPS and​‌ video-transformer features) and reveal​​ notable cross-domain drops, underscoring​​​‌ the challenge of cultural,​ linguistic, and interactional shifts.​‌ Fairness analyses indicate generally​​ small discrepancies for our​​​‌ baselines. We observe the​ largest disparities for the​‌ proposed challenge solutions on​​ the Chinese language test​​​‌ set. All annotations, features,​ code, and leaderboards are​‌ made publicly available to​​ foster sustained progress on​​​‌ robust and fair engagement​ estimation.

Figure 20.a
Figure 20.b

The image consists​‌ of three side-by-side photos​​ of a man in​​​‌ different poses. In the​ first photo, he is​‌ talking on a phone,​​ in the second he​​​‌ is standing relaxed, and​ in the third he​‌ is gesturing with his​​ hands. Below each photo,​​​‌ there is a graphical​ representation indicating some form​‌ of measured data, possibly​​ volume or sound levels.​​​‌ Each graph has a​ red vertical line and​‌ a yellow-shaded area with​​ varying black line patterns.​​​‌ (Description generated at January​ 14th, 2026 by Albert​‌ AI with the model​​ Mistral-Small-3.2-24B)

The image consists​​​‌ of three side-by-side photos​ of a man in​‌ different poses. In the​​ first photo, he is​​​‌ talking on a phone,​ in the second he​‌ is standing relaxed, and​​ in the third he​​​‌ is gesturing with his​ hands. Below each photo,​‌ there is a graphical​​ representation indicating some form​​​‌ of measured data, possibly​ volume or sound levels.​‌ Each graph has a​​ red vertical line and​​​‌ a yellow-shaded area with​ varying black line patterns.​‌ (Description generated at January​​ 14th, 2026 by Albert​​​‌ AI with the model​ Mistral-Small-3.2-24B)

Figure 20:​‌ Left: Snapshots of scenes​​ of a participant in​​ the NOXI corpus being​​​‌ disengaged, neutral and highly‌ engaged. Right: Setup of‌​‌ the MPIIGroupInteraction dataset.

As​​ training datasets, we provide​​​‌ NOXI and NOXI-J to‌ our participants. NOXI is‌​‌ a corpus of dyadic,​​ screen-mediated face-to-face interactions in​​​‌ an expert-novice knowledge sharing‌ context. In a session,‌​‌ one participant assumes the​​ role of an expert​​​‌ and the other participant‌ the role of a‌​‌ novice. NOXI includes interactions​​ recorded at three locations​​​‌ (France, Germany and UK),‌ spoken in seven languages‌​‌ (English, French, German, Spanish,​​ Indonesian, Arabic and Italian),​​​‌ discussing a wide range‌ of topics. The languages‌​‌ Indonesian, Arabic, Spanish, and​​ Italian serve as an​​​‌ out-of-domain evaluation set. NOXI‌ is extended by NOXI-J‌​‌ consisting of 66 dyadic​​ interactions and over 16​​​‌ hours of material using‌ the same setup as‌​‌ original NOXI. NOXI-J features​​ 48 interactions in Japanese​​​‌ with native Japanese speakers‌ and 18 interactions in‌​‌ Chinese with Chinese native​​ speakers. See Table 1​​​‌ for the train-validation-test split.‌

Table 1: Engagement estimation‌​‌ datasets used in the​​ MultiMediate'25 challenge. Languages covered​​​‌ by each dataset are‌ given in italics, with‌​‌ the respective number of​​ interactions in parentheses.
Training​​​‌ Data Validation Data Test‌ Data
NOXI NOXI NOXI‌​‌
English (23), French (7),​​ German (8) English (3),​​​‌ French (4), German (3)‌ English (6), French (6),‌​‌ German (4)
NOXI (additional​​ test languages)
Arabic (2),​​​‌ Italian (2), Indonesian (4),‌ Spanish (4)
MPIIGroupInteraction MPIIGroupInteraction‌​‌
German (6) German (6)​​
NOXI-J NOXI-J NOXI-J
Japanese​​​‌ (21), Chinese (10) Japanese‌ (6), Chinese (4) Japanese‌​‌ (6), Chinese (4)

The​​ task is frame-wise prediction​​​‌ of each interlocutor's engagement‌ on a continuous scale‌​‌ [0,1​​]. Accuracy is​​​‌ measured with the Concordance‌ Correlation Coefficient (CCC), ranging‌​‌ from -1 to​​ +1. Participants​​​‌ are free to use‌ the provided labeled data‌​‌ for training and validation​​ and undergo in-domain and​​​‌ out-of-domain evaluations on NoXI,‌ NoXI-J, NoXI (Additional Languages),‌​‌ and MPIIGroupInteraction. We provide​​ a multi-modal set of​​​‌ precomputed features to participants.‌ From the audio signal,‌​‌ we provide transcripts generated​​ with the Whisper model.​​​‌ Additionally, we supply GeMAPS‌ features along with wav2vec‌​‌ 2.0 embeddings. From the​​ video, we provide the​​​‌ backbone embeddings of Video‌ Swin Transformer, DINOv2, CLIP‌​‌ and VideoMAEv2 and the​​ outputs of OpenFace and​​​‌ OpenPose to cover facial‌ as well as body‌​‌ behaviors.

8.19 Stress Estimation​​ in Dancers for Injury​​​‌ Prevention

Participants: Dian-Wei Lai‌, Quentin Merilleau,‌​‌ Aowen Shi, Francois​​ Bremond.

Detecting stress​​​‌ in dancers is important,‌ as high stress levels‌​‌ are often related to​​ fatigue and injuries, which​​​‌ can negatively affect both‌ performance and health. However,‌​‌ stress detection itself is​​ not an easy task.​​​‌ This becomes even more‌ challenging when using indirect‌​‌ and non-invasive data such​​ as video. Although video​​​‌ is one of the‌ most commonly available modalities,‌​‌ extracting reliable stress information​​ from it remains highly​​​‌ challenging.

In this work,‌ we investigate automatic stress‌​‌ estimation from dance videos​​ using a small, weakly​​​‌ labeled dataset collected from‌ professional dancers at Université‌​‌ Côte d’Azur. Each dancer​​​‌ performs the same dance​ under three different difficulty​‌ levels and in different​​ scenes. The dataset currently​​​‌ includes 84 dancers, with​ two camera views (front​‌ view and diagonal view).​​ Each video is approximately​​​‌ 1 to 2 minutes​ long. Data collection is​‌ still ongoing to further​​ enrich the dataset and​​​‌ improve the reliability of​ the stress score distributions,​‌ PDF and CDF curves​​ in Figure 21.​​​‌

Figure 21

The image shows a​ study with 84 dancers​‌ performing three exercises of​​ varying difficulty: Easy (Exercise​​​‌ 1), Intermediate (Exercise 2),​ and Hard (Exercise 3).​‌ The right side displays​​ histograms and cumulative distribution​​​‌ functions (CDFs) of stress​ levels assessed by judges​‌ for each exercise. The​​ histograms show the distribution​​​‌ of stress scores, while​ the CDFs illustrate the​‌ cumulative probabilities of these​​ scores. Stress levels appear​​​‌ to increase with the​ difficulty of the exercise.​‌ (Description generated at January​​ 15th, 2026 by Albert​​​‌ AI with the model​ Mistral-Small-3.2-24B)

Figure 21:​‌ Dataset overview for stress​​ estimation in dancers. Left:​​​‌ three exercise difficulty levels​ performed by 84 dancers.​‌ Right: stress score distributions​​ for each exercise shown​​​‌ as PDF and CDF​ curves.

To obtain meaningful​‌ results with limited data,​​ we leverage pretrained models​​​‌ trained on large-scale video​ and motion datasets to​‌ improve feature representations. We​​ then study the contribution​​​‌ of different visual modalities,​ including RGB, skeleton poses​‌ extracted using different methods​​ with richer joint (see​​​‌ Figure 22) and​ hand motion information, depth,​‌ and optical flow. By​​ analyzing each modality separately​​​‌ and in combination, we​ aim to build a​‌ robust multi-modal pipeline for​​ stress estimation and to​​​‌ identify which modalities and​ movement cues are most​‌ informative for effective stress​​ prediction.

Figure 22

The image contains​​​‌ a series of four​ photos showing a person​‌ in a dynamic pose,​​ with different colored lines​​​‌ and dots overlaid on​ the person's body. These​‌ lines and dots appear​​ to represent a pose​​​‌ estimation or skeletal tracking​ system, mapping key points​‌ such as the head,​​ shoulders, elbows, wrists, hips,​​​‌ knees, and ankles. The​ person is dressed in​‌ a long-sleeved shirt and​​ pants. The background shows​​​‌ a room with a​ partially visible doorway and​‌ paintings on the wall.​​ (Description generated at January​​​‌ 15th, 2026 by Albert​ AI with the model​‌ Mistral-Small-3.2-24B)

Figure 22:​​ Pose extraction methods comparison.​​​‌ From left to right:​ the original video frame,​‌ YOLO-Pose, Posetics, and OpenPose​​ (body + hand). These​​​‌ methods are used to​ capture joint-level features and​‌ characterize dancer movements. The​​ dancer’s face is masked​​​‌ to preserve privacy.

8.20​ Emotion recognition using Deep​‌ Learning

Participants: Valeriya Strizhkova​​, Antitza Dantcheva,​​​‌ Francois Bremond.

Understanding​ human emotions is crucial​‌ in healthcare, human-robot interaction,​​ and marketing. Despite the​​​‌ progress in emotion recognition​ from one modality, such​‌ as a facial video​​ and a sequence of​​​‌ physiological signals, it is​ still challenging to improve​‌ by combining multiple modalities.​​ Moreover, it is difficult​​​‌ to recognize emotions in​ long sequential data, such​‌ as long videos, although​​ most real-world videos of​​ people expressing emotions are​​​‌ long. Existing emotion datasets‌ are limited in volume‌​‌ and quality, making it​​ difficult to develop an​​​‌ effective deep learning-based emotion‌ recognition system. An effective‌​‌ real-world emotion understanding system​​ should be able to​​​‌ recognize emotions from long‌ videos synchronized with multiple‌​‌ modalities. In this thesis​​ 52, we focus​​​‌ on multimodal emotion recognition‌ from long videos synchronized‌​‌ with physiological signals. Specifically,​​ multimodal emotion methods face​​​‌ three main challenges: (a)‌ learning the emotion representation,‌​‌ (b) learning the representation​​ of fine-grained emotions, as​​​‌ well as (c) combining‌ modalities to predict emotions.‌​‌ In this thesis, we​​ first introduce two large​​​‌ behavior analysis datasets: INEMO‌ and StressID. INEMO is‌​‌ a multimodal dataset designed​​ to facilitate emotion recognition​​​‌ from watching social media‌ videos. StressID is a‌​‌ multimodal dataset designed for​​ stress identification. Secondly, we​​​‌ propose two pre-training techniques‌ for facial expression recognition:‌​‌ (1) supervised pre-training on​​ synthetic data generated by​​​‌ our video generation method‌ and (2) self-supervised pre-training‌​‌ on multi-view videos. We​​ show that the proposed​​​‌ pre-training techniques allow us‌ to get rich facial‌​‌ representations, which allow us​​ to improve fine-grained emotion​​​‌ recognition accuracy. Thirdly, we‌ tackle the problem of‌​‌ emotion recognition from multiple​​ modalities. We propose a​​​‌ framework for multimodal fusion‌ of videos and physiological‌​‌ signals to predict emotions.​​ This framework consists of​​​‌ mainly two steps: (1)‌ extracting features from long‌​‌ raw videos and physiological​​ signals; (2) fusing extracted​​​‌ features to predict emotions‌ using a cross-modality approach‌​‌ based on attention mechanism.​​ Our methods leverage the​​​‌ additional modalities resulting in‌ better emotion recognition performance.‌​‌ Our methods have been​​ extensively evaluated on various​​​‌ emotion recognition benchmarks. The‌ proposed methods outperform previous‌​‌ methods, significantly pushing emotion​​ recognition to real-world deployments.​​​‌

8.21 Identifying Surgical Instruments‌ in Pedagogical Cataract Surgery‌​‌ Videos through an Optimized​​ Aggregation Network

Participants: Sanya​​​‌ Sinha, Michal Balazia‌, Francois Bremond.‌​‌

Instructional cataract surgery videos​​ are crucial for ophthalmologists​​​‌ and trainees to observe‌ surgical details repeatedly. In‌​‌ 44, we present​​ a deep learning model​​​‌ for real-time identification of‌ surgical instruments in these‌​‌ videos, using a custom​​ dataset scraped from open-access​​​‌ sources. Inspired by the‌ architecture of YOLOv9, the‌​‌ model employs a Programmable​​ Gradient Information (PGI) mechanism​​​‌ and a novel Generally-Optimized‌ Efficient Layer Aggregation Network‌​‌ (Go-ELAN) to address the​​ information bottleneck problem, enhancing​​​‌ Minimum Average Precision (mAP)‌ at higher Non-Maximum Suppression‌​‌ Intersection over Union (NMS​​ IoU) scores.

Go-ELAN YOLOv9​​​‌ Architecture (see Figure 23‌) contains an auxiliary‌​‌ block which works on​​ the Programmable Gradient Information​​​‌ (PGI) concept by creating‌ an auxiliary reverse branch‌​‌ for enabling reliable gradient​​ calculation by avoiding potential​​​‌ semantic loss. The GELAN‌ block in the backbone‌​‌ feature extractor is replaced​​ by the Go-ELAN block​​​‌ proposed in this paper.‌ The Spatial Pyramid Pooling‌​‌ block SPPELAN removes the​​ fixed size limitation of​​​‌ the backbone. The ADown‌ block downsamples the generated‌​‌ feature maps to target​​ sizes. The CBLinear blocks​​​‌ extract higher level features‌ from the images, and‌​‌ the CBFuse block fuses​​​‌ these extracted features. The​ Neck combines the acquired​‌ features and the Head​​ predicts the final bounding​​​‌ bound outputs with their​ respective probabilities.

Figure 23

The image​‌ depicts a neural network​​ architecture with sections labeled​​​‌ Auxiliary, Backbone, Neck, and​ Head. The Backbone starts​‌ with a Silence layer,​​ followed by convolutional (Conv)​​​‌ layers, multiple Go-ELAN blocks,​ and ADown layers, ending​‌ with an SPPELAN block.​​ The Auxiliary section mirrors​​​‌ parts of the Backbone​ with convolutional layers, Go-ELAN​‌ blocks, and ADown layers,​​ and includes CBFuse and​​​‌ CBLinear blocks connecting to​ the Backbone. Both the​‌ Auxiliary and Backbone sections​​ feed into the Neck,​​​‌ which connects to the​ final Head section, consisting​‌ of Detect blocks for​​ making predictions. Connections between​​​‌ blocks are indicated by​ arrows showing data flow.​‌ (Description generated at January​​ 15th, 2026 by Albert​​​‌ AI with the model​ Mistral-Small-3.2-24B)

Figure 23:​‌ Architecture of Go-ELAN YOLOV9.​​

Our Go-ELAN YOLOv9 model,​​​‌ evaluated against YOLO v5,​ v7, v8, v9 vanilla,​‌ Laptool and DETR, achieves​​ a superior mAP of​​​‌ 73.74 at IoU 0.5​ on a dataset of​‌ 615 images with 10​​ instrument classes, demonstrating the​​​‌ effectiveness of the proposed​ model. To illustrate the​‌ visual and qualitative superiority​​ of our model, we​​​‌ have compared 12 ground-truth​ images with their respective​‌ model predictions in Figure​​ 24.

Figure 24

The image​​​‌ shows a collection of​ medical procedure photos, likely​‌ from eye surgeries. Each​​ image is labeled with​​​‌ various surgical tools such​ as speculum, cannula, forceps,​‌ hook, phacoprobe, and keratome.​​ The tools are highlighted​​​‌ with colored boxes and​ labels, indicating their position​‌ and type. The images​​ are arranged in a​​​‌ grid format, displaying different​ stages of the procedure.​‌ The eye is opened​​ with a speculum, and​​​‌ various tools are used​ for precise surgical actions.​‌ The photos include confidence​​ levels for the identified​​​‌ tools. (Description generated at​ January 15th, 2026 by​‌ Albert AI with the​​ model Mistral-Small-3.2-24B)

Figure 24​​​‌: Qualitative Examination of​ Model Performance. Rows 1​‌ and 3 are labels​​ while 2 and 4​​​‌ are respective predictions.

8.22​ TBDM: Temporal Boundary Distillation​‌ Module for Surgical Gesture​​ Segmentation

This work was​​​‌ funded by 3IA Côte​ d'Azur.

Participants: Ezem Sura​‌ Ekmekci, Snehashis Majhi​​, Khodor Hamadi,​​​‌ Francois Bremond.

In​ 2025, in collaboration with​‌ CHU Nice and Caranx​​ Medical, a novel framework​​​‌ for surgical gesture segmentation​ was developed that addresses​‌ the challenging problem of​​ precise temporal localization during​​​‌ surgical action transitions. This​ work introduces the Temporal​‌ Boundary Distillation Module (TBDM),​​ an innovative approach that​​​‌ explicitly models temporal boundaries​ between surgical gestures using​‌ RGB-only video data (see​​ Figure  25). The​​​‌ framework employs knowledge distillation​ to learn boundary-aware features​‌ during training through cross-attention​​ mechanisms, while requiring no​​​‌ additional computational overhead at​ inference. TBDM was validated​‌ on two major surgical​​ datasets (CholecT50 and RARP-45),​​​‌ demonstrating consistent improvements across​ multiple baseline architectures, with​‌ up to +8.5 edit​​ score improvement on CholecT50.​​​‌ Notably, the approach achieved​ state-of-the-art performance on RARP-45​‌ (81.4 edit score, 77.9​​ F1@50), establishing TBDM as​​ a generalizable, plug-and-play solution​​​‌ for fine-grained surgical workflow‌ analysis. This work has‌​‌ been submitted to IPCAI​​ 2026.

Additionally, a comprehensive​​​‌ evaluation of YOLOv8 for‌ real-time surgical instrument recognition‌​‌ in robot-assisted and laparoscopic​​ surgeries was conducted 33​​​‌. Using a diverse‌ multi-source dataset of over‌​‌ 7,400 frames and 17,175​​ annotations, the model achieved​​​‌ a mean average precision‌ of 0.77 for binary‌​‌ detection and 0.72 for​​ multi-instrument classification across seven​​​‌ instrument types. The segmentation‌ performance demonstrated excellent accuracy‌​‌ with a mean Dice​​ score of 0.91 and​​​‌ mean intersection over union‌ of 0.86. With an‌​‌ inference speed of 1.12​​ milliseconds per frame, the​​​‌ model shows strong potential‌ for real-time clinical applications‌​‌ in surgical workflow analysis​​ and instrument tracking.

Figure 25

The​​​‌ image depicts a machine‌ learning framework for gesture‌​‌ recognition. The process starts​​ with a sequence of​​​‌ video frames input into‌ a pre-trained VideoMAE-v2 model‌​‌ with frozen parameters. The​​ extracted features are then​​​‌ passed to a projection‌ layer and a temporal‌​‌ model for prediction. Additionally,​​ there is a temporal​​​‌ boundary distillation module that‌ only operates during training.‌​‌ This module uses cross-attention​​ mechanisms and class presence​​​‌ maps to aggregate gesture‌ class information. This module‌​‌ helps in refining the​​ model's decision boundaries through​​​‌ a distillation loss calculated‌ between the projection features‌​‌ and the distilled boundary​​ features. The framework aims​​​‌ to improve gesture recognition‌ accuracy by leveraging temporal‌​‌ information and class distinctions.​​ (Description generated at January​​​‌ 19th, 2026 by Albert‌ AI with the model‌​‌ Mistral-Small-3.2-24B)

Figure 25:​​ Overview of TBDM framework​​​‌ for Surgical Gesture Segmentation.‌ During training, the boundary‌​‌ blocks generate boundary-aware features​​ to guide the projection​​​‌ layer. During inference, only‌ the trained projection layer‌​‌ is used, adding no​​ extra cost, while achieving​​​‌ significant boundary precision in‌ segmentation.

8.23 Effective Video‌​‌ Feature Extraction for Training​​ and Comprehension: Human-Centered Multimodal​​​‌ Video

Participants: Tanay Agrawal‌, Antitza Dantcheva,‌​‌ Francois Bremond.

Understanding​​ actions in videos is​​​‌ a crucial element of‌ computer vision, with significant‌​‌ implications in many fields.​​ Given our increasing reliance​​​‌ on visual data, understanding‌ and interpreting human actions‌​‌ in videos are becoming​​ essential for developing technologies​​​‌ in surveillance, healthcare, autonomous‌ systems, and human-computer interaction.‌​‌ Accurate interpretation of actions​​ in videos is fundamental​​​‌ to creating intelligent systems‌ capable of navigating and‌​‌ responding effectively to the​​ complexities of the real​​​‌ world. In this context,‌ advances in action understanding‌​‌ are pushing the boundaries​​ of computer vision and​​​‌ playing a crucial role‌ in the development of‌​‌ cutting-edge applications that impact​​ our daily lives.

Computer​​​‌ vision has seen significant‌ progress thanks to the‌​‌ rise of deep learning​​ methods such as convolutional​​​‌ neural networks (CNNs) and‌ transformers, pushing the boundaries‌​‌ of computer vision and​​ enabling the computer vision​​​‌ community to advance in‌ many areas, including image‌​‌ segmentation, object detection, scene​​ understanding, and more. However,​​​‌ video processing remains limited‌ compared to static images.‌​‌ In this thesis, we​​ focus on video understanding,​​​‌ dividing it into two‌ main parts: video classification‌​‌ and action detection, and​​​‌ their application in affective​ computing, particularly in interaction-based​‌ scenarios. In this thesis,​​ we explore efficient learning​​​‌ approaches for video feature​ extraction in various video​‌ classification and interaction understanding​​ tasks. Our contributions 51​​​‌ cover the computation of​ intermediate-level features for faster​‌ convergence, plugin adaptation for​​ handling diverse datasets and​​​‌ modalities, and evolutionary temporal​ modeling for understanding long​‌ videos. We begin by​​ improving personality and behavior​​​‌ recognition through geometry-based behavioral​ coding and segmentation-driven attention​‌ mechanisms. We then address​​ the challenges of modality​​​‌ availability and data diversity​ using knowledge distillation and​‌ a novel adapter-based cross-learning​​ framework that generalizes to​​​‌ all tasks. Finally, we​ tackle the analysis of​‌ long videos for temporal​​ action detection using temporal​​​‌ adapters with image models,​ as well as modular​‌ adapters and a two-stage​​ spatiotemporal learning strategy with​​​‌ a video basis. Together,​ this work contributes to​‌ building generalizable and efficient​​ learning systems for a​​​‌ wide range of video​ understanding applications.

8.24 Rotation-Induced​‌ Centroid Shift in Latent​​ Space

Participants: Benoit Lagadec​​​‌, Matthieu Saumard,​ Francois Bremond.

Convolutional​‌ neural networks are not​​ rotation-equivariant in practice: discrete​​​‌ image rotation requires interpolation​ and zero-padding, making the​‌ rotation operator non-invertible and​​ causing convolution and rotation​​​‌ to not commute. We​ show that this leads​‌ to a systematic and​​ measurable shift of the​​​‌ feature-space centroid when images​ are rotated, even when​‌ the model is trained​​ with standard rotation augmentation.​​​‌ We formalize this centroid​ drift analytically and verify​‌ it empirically. To mitigate​​ this effect, we introduce​​​‌ a set of angle-specialized​ Exponential Moving Average (EMA)​‌ teachers that provide stable​​ feature anchors at different​​​‌ rotation angles, optionally enhanced​ with low-rank angle adapters.​‌ This approach directly suppresses​​ rotation-induced centroid shift and​​​‌ significantly improves feature consistency​ and classification accuracy under​‌ rotation, outperforming both classical​​ augmentation and mean-teacher baselines​​​‌ while requiring minimal additional​ computation. We formalize discrete​‌ in-plane rotation on pixel​​ grids as a degraded​​​‌ permutation and show why​ convolution and rotation do​‌ not commute. In this​​ work, we empirically confirm​​​‌ that the centroid of​ feature representations shifts under​‌ rotation. Many studies are​​ dedicated to find invariant​​​‌ in detection. An illustration​ is detailed in Figure​‌ 26.

Figure 26.a
Figure 26.b

This image​​ illustrates a series of​​​‌ transformations applied to a​ grid of pixels. 1.​‌ **Original Grid**: A small​​ grid with green highlighted​​​‌ pixels. 2. **Add Padding**:​ The grid is expanded​‌ with black padding around​​ it. 3. **Rotation 45​​​‌ Degrees**: The grid is​ rotated by 45 degrees​‌ within the padded area.​​ 4. **Real/Numerical Transformation**: The​​​‌ rotated grid is transformed​ into a new format​‌ while retaining its pixel​​ structure. 5. **Flatten Operation​​​‌ (Vectorization)**: Both the original​ and transformed grids are​‌ flattened into a linear​​ vector format. 6. **Re-ordering**:​​​‌ Pixels are re-ordered, with​ some pixels not affected​‌ by the transformation. 7.​​ **Limited Permutation**: The vectors​​​‌ are permuted to demonstrate​ how a rotation translates​‌ to a simple permutation​​ of pixels. 8. **Final​​​‌ Sub-Space**: After keeping only​ specific pixels, a sub-space​‌ of the image is​​ shown, illustrating that rotation​​ can be represented by​​​‌ a simple permutation of‌ pixels. (Description generated at‌​‌ January 15th, 2026 by​​ Albert AI with the​​​‌ model Mistral-Small-3.2-24B)

This image‌ illustrates a series of‌​‌ transformations applied to a​​ grid of pixels. 1.​​​‌ **Original Grid**: A small‌ grid with green highlighted‌​‌ pixels. 2. **Add Padding**:​​ The grid is expanded​​​‌ with black padding around‌ it. 3. **Rotation 45‌​‌ Degrees**: The grid is​​ rotated by 45 degrees​​​‌ within the padded area.‌ 4. **Real/Numerical Transformation**: The‌​‌ rotated grid is transformed​​ into a new format​​​‌ while retaining its pixel‌ structure. 5. **Flatten Operation‌​‌ (Vectorization)**: Both the original​​ and transformed grids are​​​‌ flattened into a linear‌ vector format. 6. **Re-ordering**:‌​‌ Pixels are re-ordered, with​​ some pixels not affected​​​‌ by the transformation. 7.‌ **Limited Permutation**: The vectors‌​‌ are permuted to demonstrate​​ how a rotation translates​​​‌ to a simple permutation‌ of pixels. 8. **Final‌​‌ Sub-Space**: After keeping only​​ specific pixels, a sub-space​​​‌ of the image is‌ shown, illustrating that rotation‌​‌ can be represented by​​ a simple permutation of​​​‌ pixels. (Description generated at‌ January 15th, 2026 by‌​‌ Albert AI with the​​ model Mistral-Small-3.2-24B)

Figure 26​​​‌: On the left,‌ illustration of the wrong‌​‌ approximation due to the​​ rotation transformation. On the​​​‌ right, the EMA mechanism‌ enables to correct this‌​‌ approximation.

Convolutional neural networks​​ are often assumed to​​​‌ be robust to rotations‌ when trained with rotation‌​‌ augmentations. However, this assumption​​ overlooks a key property​​​‌ of real image rotations:‌ discrete rotation on a‌​‌ pixel grid is implemented​​ using interpolation and padding,​​​‌ making the operation non-invertible‌ and causing it to‌​‌ not commute with convolution.​​ As a result, rotating​​​‌ an input image and‌ then extracting features is‌​‌ not equivalent to extracting​​ features and then rotating​​​‌ them. We show that‌ this mismatch induces a‌​‌ systematic and predictable shift​​ in the feature-space centroid​​​‌ across rotation angles, even‌ when the network is‌​‌ trained with extensive rotation​​ augmentation.

This observation reframes​​​‌ rotation robustness as a‌ problem of representation geometry‌​‌, rather than data​​ diversity alone. If rotation​​​‌ induces angle-dependent sub-clusters in‌ feature space, enforcing global‌​‌ consistency (e.g., with a​​ single Mean Teacher model)​​​‌ can suppress meaningful structure‌ and lead to underfitting.‌​‌ We therefore propose a​​ simple alternative: a set​​​‌ of angle-specialized EMA teachers‌ that provide stable feature‌​‌ targets at different rotation​​ angles, coupled with a​​​‌ feature-space centroid alignment loss‌ that prevents rotation-induced drift‌​‌ without collapsing intra-class variability.​​ Our approach is architecture-agnostic,​​​‌ computationally lightweight, and complementary‌ to standard training pipelines.‌​‌ It improves rotation robustness​​ in both classification and​​​‌ detection settings without requiring‌ group-equivariant architectures or spatial‌​‌ transformer modules. The core​​ contribution of this work​​​‌ is to characterize the‌ geometric effect of discrete‌​‌ rotation in CNN feature​​ space and to introduce​​​‌ a training strategy that‌ explicitly stabilizes this geometry.‌​‌

Applied to detection (see​​ Figure  26), our​​​‌ method ensures that rotation-induced‌ feature sub-clusters remain compact‌​‌ and aligned. This contrasts​​ with our former work,​​​‌ which uses a related‌ mechanism in person re-identification‌​‌ to enlarge inter-cluster separation,​​​‌ whereas our objective is​ to preserve sub-cluster coherence.​‌

8.25 Dual Volume Skeleton-Guided​​ 3D Face Reconstruction from​​​‌ Sparse Views

Participants: Benoit​ Lagadec, Seongro Yoon​‌, Francois Bremond.​​

Reconstructing high-fidelity 3D face​​​‌ meshes from sparse 2D​ inputs is challenging due​‌ to limited depth cues​​ and structural ambiguity. We​​​‌ present a skeleton-guided, dual-volume​ diffusion framework for reconstructing​‌ editable, high-fidelity 3D face​​ meshes from only two​​​‌ sparse views, see Figure​ 27. By integrating​‌ part-level latent diffusion with​​ skeleton-based conditioning and symmetry-aware​​​‌ dual-volume packing, our approach​ preserves pose-consistent geometry, enables​‌ part-aware editing, and maintains​​ bilateral alignment. A teacher–student​​​‌ strategy with multi-view consistency​ further improves stability and​‌ fidelity, yielding significant gains​​ over state-of-the-art baselines. Our​​​‌ contributions:

• A skeleton-conditioned​ diffusion pipeline that injects​‌ explicit structural priors to​​ improve pose-consistent geometry under​​​‌ sparse views.

• A​ dual-volume latent representation, inspired​‌ by bipartite packing, enabling​​ part-aware decoding and preventing​​​‌ fusion of contacting parts.​ It allows to complete​‌ a partial view in​​ final face generation.

•​​​‌ A symmetry-aware objective coupling​ reconstruction accuracy and bilateral​‌ regularization for realistic midline​​ geometry.

• A self​​​‌ supervised teacher–student strategy enhances​ multi-view consistency.

Figure 27.a
Figure 27.b

The image​‌ depicts a process for​​ generating a 3D mesh​​​‌ from a 2D input​ image of a person's​‌ face. The process begins​​ with facial landmark detection.​​​‌ These key points are​ used to construct a​‌ 3D skeleton, which is​​ encoded into a clip.​​​‌ Using a dual U-Net-based​ diffusion model with skeleton​‌ conditioning, two 3D single​​ view generations are created.​​​‌ These views are decoded​ into volumes (left and​‌ right) and combined using​​ marching cubes with part​​​‌ mesh to produce the​ final 3D mesh. Loss​‌ functions like mutual contrastive​​ loss and symmetry losses​​​‌ are applied during the​ process to ensure accuracy​‌ and symmetry. (Description generated​​ at January 15th, 2026​​​‌ by Albert AI with​ the model Mistral-Small-3.2-24B)

The​‌ image depicts a process​​ for generating a 3D​​​‌ mesh from a 2D​ input image of a​‌ person's face. The process​​ begins with facial landmark​​​‌ detection. These key points​ are used to construct​‌ a 3D skeleton, which​​ is encoded into a​​​‌ clip. Using a dual​ U-Net-based diffusion model with​‌ skeleton conditioning, two 3D​​ single view generations are​​​‌ created. These views are​ decoded into volumes (left​‌ and right) and combined​​ using marching cubes with​​​‌ part mesh to produce​ the final 3D mesh.​‌ Loss functions like mutual​​ contrastive loss and symmetry​​​‌ losses are applied during​ the process to ensure​‌ accuracy and symmetry. (Description​​ generated at January 15th,​​​‌ 2026 by Albert AI​ with the model Mistral-Small-3.2-24B)​‌

Figure 27: Left:​​ Illustration of workflow. Right:​​​‌ projection of 2D landmarks​ to guide the generation​‌ of new mesh

Given​​ two input images (frontal​​​‌ and profile), we detect​ 2D landmarks to form​‌ a facial skeleton. Here​​ landmarks are replaced by​​​‌ facial skeleton to produce​ more realistic generation in​‌ diffusion. A skeleton encoder​​ produces a latent embedding​​​‌ that conditions a dual-UNet​ diffusion backbone via adaptive​‌ normalization. The denoiser outputs​​ two latent volumes (left/right),​​ which are decoded by​​​‌ a 3D VAE into‌ SDF (i.e., static and‌​‌ dynamic factorization) /occupancy grids.​​ Marching Cubes extracts meshes​​​‌ per side; parts remain‌ disjoint via dual-volume packing,‌​‌ see Figure 28.​​ A symmetry loss regularizes​​​‌ left/right consistency. A complete‌ view of architecture is‌​‌ defined in Figure 27​​.

Figure 28

The image displays​​​‌ three 3D models of‌ a human head and‌​‌ shoulders. The first model​​ is entirely red with​​​‌ a hat. The second‌ model is entirely blue‌​‌ with a hat. The​​ third model has the​​​‌ head in red and‌ the shoulders and upper‌​‌ body in blue. All​​ models are set against​​​‌ a grey grid background.‌ (Description generated at January‌​‌ 15th, 2026 by Albert​​ AI with the model​​​‌ Mistral-Small-3.2-24B)

Figure 28:‌ 2 editable mesh/views are‌​‌ stitched.

8.26 Turbo Learning:​​ 3D Face Reconstruction with​​​‌ Mesh Re-Projection and Re-Identification‌ Consistency

Participants: Benoit Lagadec‌​‌, Francois Bremond.​​

We introduce Turbo Learning​​​‌, a two-stage iterative‌ refinement framework for 3D‌​‌ face reconstruction inspired by​​ the positive-feedback dynamics of​​​‌ turbocharged engines. Traditional pipelines‌ rely on sparse supervisory‌​‌ cues such as 2D​​ landmarks, limiting their ability​​​‌ to recover accurate geometry.‌ Our approach instead uses‌​‌ self-generated 3D meshes as​​ progressively stronger priors: Stage​​​‌ 1 predicts a coarse‌ mesh guided by MediaPipe‌​‌ landmarks, while Stage 2​​ uses this mesh as​​​‌ dense geometric supervision.

To‌ further enhance identity preservation,‌​‌ we introduce a Mesh​​ Re-Projection and Re-Identification Consistency​​​‌ Loss. By re-projecting‌ meshes from both stages‌​‌ into image space and​​ applying an InfoNCE contrastive​​​‌ Re-ID objective, we enforce‌ identity stability across refinement‌​‌ steps. The combination of​​ a geometric turbo loop​​​‌ and an identity turbo‌ loop produces reconstructions that‌​‌ are more stable, more​​ detailed, and more identity-faithful.​​​‌

We compare Turbo Learning‌ to classical iterative strategies‌​‌ such as EM, diffusion-based​​ refinement, boosting, and teacher–student​​​‌ systems, and show that‌ it occupies a distinctive‌​‌ position among them, see​​ Fig. 29.

Figure 29

The​​​‌ image depicts a two-step‌ process for a machine‌​‌ learning framework aimed at​​ improving 3D mesh reconstruction.​​​‌ In Step 1, the‌ input includes face and‌​‌ profile images, generating depth,​​ mask, and normal outputs.​​​‌ These are processed through‌ a dual uncertainty block‌​‌ and re-projected to calculate​​ re-projection loss and re-ID​​​‌ contrastive loss. Step 2‌ builds on this by‌​‌ refining the 3D mesh​​ guided by the initial​​​‌ output and additional ground‌ truth mesh, further minimizing‌​‌ re-projection loss. The dual​​ uncertainty block plays a​​​‌ central role in both‌ steps, ensuring accurate depth‌​‌ and geometric information. (Description​​ generated at January 15th,​​​‌ 2026 by Albert AI‌ with the model Mistral-Small-3.2-24B)‌​‌

Figure 29: At​​ each step the mesh​​​‌ generated is re-used. A‌ re-identification metrix is computed‌​‌ to learn the input​​ image.

8.27 THEval: Evaluation​​​‌ Framework for Talking Head‌ Video Generation

Participants: Nabyl‌​‌ Quignon, Baptiste Chopin​​, Yaohui Wang,​​​‌ Antitza Dantcheva.

Generative‌ models for talking head‌​‌ videos have witnessed remarkable​​ progress, achieving high-resolution and​​​‌ realistic results. However, evaluating‌ these models remains a‌​‌ significant challenge, as the​​​‌ rapid advancement in generation​ has outpaced the development​‌ of adequate metrics. Current​​ evaluations primarily rely on​​​‌ general image quality metrics​ or lip-synchronization scores, which​‌ often fail to capture​​ essential aspects of realism​​​‌ such as motion quality,​ temporal coherence, and naturalness.​‌ Furthermore, these existing metrics​​ have been shown to​​​‌ correlate poorly with human​ preferences, necessitating a more​‌ robust and perceptually aligned​​ evaluation approach.

Figure 30

Overview of​​​‌ the THEval scheme.

Figure​ 30: Overview of​‌ the THEval benchmark. Evaluating​​ 17 SOTA methods on​​​‌ 85,000 videos reveals that​ existing metrics align poorly​‌ with human ratings (red​​ box). We propose THEval​​​‌ (center), a framework with​ 8 metrics covering (i)​‌ quality, (ii) naturalness​​, and (iii) synchronization​​​‌. Our final score​ (green box) achieves a​‌ 0.870 correlation with human​​ preference.

We introduce THEVAL​​​‌ 56, a novel​ evaluation framework designed to​‌ address these limitations by​​ aligning closely with human​​​‌ perception, a visual summary​ of the framework is​‌ available on Figure 30​​. We support this​​​‌ framework with a new,​ challenging evaluation dataset comprising​‌ over 5,000 videos sourced​​ from diverse YouTube channels,​​​‌ ensuring the content was​ unseen during model training​‌ to test generalization. The​​ dataset features a wide​​​‌ range of languages, head​ poses, and expressions. To​‌ assess performance comprehensively, we​​ decompose evaluation into three​​​‌ core dimensions: quality, naturalness,​ and synchronization, utilizing eight​‌ fine-grained metrics to analyze​​ dynamics such as lip​​​‌ and head motion alongside​ global aesthetics.

To validate​‌ our framework, we conduct​​ an extensive benchmark of​​​‌ 17 state-of-the-art audio- and​ video-driven models, generating and​‌ analyzing over 85,000 videos.​​ We leverage a user​​​‌ study to demonstrate that​ our final composite score​‌ achieves a strong Spearman​​ correlation of 0.870 with​​​‌ human ratings, significantly outperforming​ traditional metrics like FID​‌ and Syncnet. By applying​​ this pipeline, we identify​​​‌ that while many current​ algorithms excel in lip​‌ synchronization, they continue to​​ face challenges in generating​​​‌ expressive facial behavior and​ artifact-free details, establishing THEVAL​‌ as a vital tool​​ for fostering future progress​​​‌ in the field.

8.28​ Beyond Real versus Fake​‌ Towards Intent-Aware Video Analysis​​

Participants: Saurabh Atreya,​​​‌ Nabyl Quignon, Baptiste​ Chopin, Abhijit Das​‌, Antitza Dantcheva.​​

The rapid advancement of​​​‌ generative models has led​ to increasingly realistic deepfake​‌ videos, posing significant societal​​ and security risks. While​​​‌ existing detection methods focus​ primarily on distinguishing real​‌ from fake videos, such​​ approaches fail to address​​​‌ a fundamental question regarding​ the intent behind manipulated​‌ content. With the proliferation​​ of AI-generated media, the​​​‌ binary distinction of authenticity​ is becoming less relevant​‌ than understanding whether content​​ is malicious or benign.​​​‌ This shift necessitates a​ new paradigm in video​‌ analysis that moves beyond​​ artifact detection to the​​​‌ contextual understanding of underlying​ motivations.

Figure 31

Three-Way Contrastive Alignment​‌ Pipeline

Figure 31:​​ Three-Way Contrastive Alignment Pipeline.​​​‌ Overview of the proposed​ training methodology. The augmented​‌ dataset is encoded using​​ modality-specific encoders (CLIP for​​​‌ video, WavLM for audio,​ CLIP Text for text),​‌ projected into a shared​​ space, and aligned through​​ a three-way contrastive loss.​​​‌ The pretrained encoders are‌ then fine-tuned using a‌​‌ supervised MLP classifier for​​ intent prediction.

We introduce​​​‌ IntentHQ 53, a‌ new benchmark for human-centered‌​‌ intent analysis designed to​​ formalize the task of​​​‌ intent recognition. We curate‌ a comprehensive dataset of‌​‌ 5,168 videos, meticulously annotated​​ with 23 fine-grained intent​​​‌ categories such as "Financial‌ fraud", "Political propaganda", and‌​‌ "Comedy", organized under five​​ broader dimensions including Deception​​​‌ and Persuasion. To effectively‌ analyze these videos, we‌​‌ propose a novel self-supervised​​ learning framework (see Figure​​​‌ 31) that leverages‌ a three-way contrastive alignment‌​‌ strategy. This method jointly​​ aligns video, audio, and​​​‌ textual modalities, utilizing data‌ augmentation techniques like semantic‌​‌ paraphrasing and text-to-speech synthesis​​ to learn robust representations​​​‌ without relying on manual‌ labels during pretraining.

To‌​‌ validate our approach, we​​ benchmark intent recognition using​​​‌ various state-of-the-art multimodal architectures.‌ Our proposed model, which‌​‌ integrates spatio-temporal video features​​ with audio and text​​​‌ analysis, achieves a classification‌ accuracy of 52.5%, establishing‌​‌ a new state-of-the-art by​​ significantly outperforming standard video​​​‌ classification baselines such as‌ VideoMAE and TimeSFormer. Ablation‌​‌ studies further reveal that,​​ while video remains the​​​‌ most predictive modality, the‌ fusion of text and‌​‌ audio is essential for​​ distinguishing complex, socially embedded​​​‌ intents. By releasing the‌ IntentHQ dataset and code,‌​‌ we aim to foster​​ further research in intent-aware​​​‌ media analysis, shifting the‌ focus towards a more‌​‌ nuanced understanding of digital​​ content.

8.29 AI killed​​​‌ the video star. Audio-driven‌ diffusion model for expressive‌​‌ talking head generation

Participants:​​ Baptiste Chopin, Antitza​​​‌ Dantcheva.

We proposed‌ Dimitra++ 55, a‌​‌ novel framework for audio-driven​​ talking head generation, streamlined​​​‌ to learn lip motion,‌ facial expression, as well‌​‌ as head pose motion.​​ Specifically, we proposed a​​​‌ conditional Motion Diffusion Transformer‌ (cMDT) to model facial‌​‌ motion sequences, employing a​​ 3D representation. The cMDT​​​‌ is conditioned on two‌ inputs: a reference facial‌​‌ image, which determines appearance,​​ as well as an​​​‌ audio sequence, which drives‌ the motion. Quantitative and‌​‌ qualitative experiments, as well​​ as a user study​​​‌ on two widely employed‌ datasets, i.e., VoxCeleb2 and‌​‌ CelebV-HQ, suggested that Dimitra++​​ is able to outperform​​​‌ existing approaches in generating‌ realistic talking heads imparting‌​‌ lip motion, facial expression,​​ and head pose. Code​​​‌ and qualitative results are‌ provided on our project‌​‌ page: Project Page.​​

8.30 LIA-X: Interpretable Latent​​​‌ Portrait Animator

Participants: Antitza‌ Dantcheva, François Brémond‌​‌.

We introduce LIA-X​​ 57, a novel​​​‌ interpretable portrait animator designed‌ to transfer facial dynamics‌​‌ from a driving video​​ to a source portrait​​​‌ with fine-grained control. LIA-X‌ is an autoencoder that‌​‌ models motion transfer as​​ a linear navigation of​​​‌ motion codes in latent‌ space. Crucially, it incorporates‌​‌ a novel Sparse Motion​​ Dictionary that enables the​​​‌ model to disentangle facial‌ dynamics into interpretable factors.‌​‌ Deviating from previous 'warp-render'​​ approaches, the interpretability of​​​‌ the Sparse Motion Dictionary‌ allows LIA-X to support‌​‌ a highly controllable 'edit-warp-render'​​ strategy, enabling precise manipulation​​​‌ of fine-grained facial semantics‌ in the source portrait.‌​‌ This helps to narrow​​​‌ initial differences with the​ driving video in terms​‌ of pose and expression.​​ Moreover, we demonstrate the​​​‌ scalability of LIA-X by​ successfully training a large-scale​‌ model with approximately 1​​ billion parameters on extensive​​​‌ datasets. Experimental results show​ that our proposed method​‌ outperforms previous approaches in​​ both self-reenactment and cross-reenactment​​​‌ tasks across several benchmarks.​ Additionally, the interpretable and​‌ controllable nature of LIA-X​​ supports practical applications such​​​‌ as fine-grained, user-guided image​ and video editing, as​‌ well as 3D-aware portrait​​ video manipulation. Project Page​​​‌

8.31 Simplicity-Bias-Aware Adaptation of​ Foundation Models for Deepfake​‌ Detection

Participants: Charbel Yahchouchi​​, Noemi Roggero,​​​‌ Laurent Saroul, Antitza​ Dantcheva.

Given the​‌ rapid advancement of deep​​ learning and generative models,​​​‌ the synthesis of realistic​ and plausible images and​‌ videos has reached unprecedented​​ levels. However, this accessibility​​​‌ also raises serious concerns,​ as such content can​‌ be misused for malicious​​ purposes such as identity​​​‌ impersonation, misinformation, and social​ manipulation. Consequently, deepfake detection​‌ has emerged as a​​ crucial area of research,​​​‌ aiming to develop robust​ and generalizable detectors capable​‌ of reliably identifying manipulated​​ media. Despite impressive progress,​​​‌ most current detectors struggle​ to generalize to unseen​‌ manipulation, limiting their real-world​​ reliability.

Figure 32

The image depicts​​​‌ a deep learning model​ designed for detecting fake​‌ or real images. The​​ process starts by converting​​​‌ an input image into​ patch embeddings. These embeddings​‌ go through a series​​ of Transformer and Adapter​​​‌ layers within the CLIP​ ViT backbone. The output​‌ is then split into​​ two paths: one leading​​​‌ to an auxiliary classification​ head (SiFeR Head) and​‌ the other to the​​ main classification head. Both​​​‌ heads use linear classifiers​ to determine if the​‌ image is fake or​​ real. The auxiliary head​​​‌ calculates a specific loss​ function that includes both​‌ auxiliary and forget loss,​​ while the main head​​​‌ calculates the main loss.​ The model aims to​‌ improve the detection of​​ manipulated images by using​​​‌ these dual classification heads.​ (Description generated at January​‌ 14th, 2026 by Albert​​ AI with the model​​​‌ Mistral-Small-3.2-24B)

Figure 32:​ Overview of the proposed​‌ simplicity-bias-aware adaptation framework for​​ deepfake detection. A frozen​​​‌ CLIP visual encoder is​ augmented with lightweight adapter​‌ modules, while the SIFER​​ mechanism is applied at​​​‌ an intermediate representation to​ identify and suppress shortcut​‌ features during training.

In​​ this work, we study​​​‌ the limitations of foundation​ model adaptation for deepfake​‌ detection under distribution shifts​​ and address the impact​​​‌ of shortcut learning induced​ by parameter-efficient fine-tuning for​‌ deepfakes. We introduce a​​ simplicity-bias-aware adaptation framework, see​​​‌ Fig. 32, that​ augments a frozen CLIP​‌ visual encoder with lightweight​​ adapter modules and integrates​​​‌ the SIFER feature-sieving mechanism​ to identify and suppress​‌ simple but non-generalizable cues​​ during training. To validate​​​‌ our framework, we conduct​ an extensive evaluation on​‌ recent state-of-the-art deepfake detection​​ datasets, focusing on cross-dataset​​​‌ and cross-manipulation generalization under​ distribution shifts. Experimental results​‌ show consistent improvements in​​ video-level Area Under the​​​‌ Curve (AUC) compared to​ CLIP-based baselines and other​‌ parameter-efficient adaptation strategies, with​​ particularly strong gains on​​ subtle and localized manipulations.​​​‌

8.32 Now You See‌ Me, Now You Don't:‌​‌ A Unified Framework for​​ Expression Consistent Anonymization in​​​‌ Talking Head Videos

Participants:‌ Anil Egin, Antitza‌​‌ Dantcheva.

Face video​​ anonymization is aimed at​​​‌ privacy preservation while allowing‌ for the analysis of‌​‌ videos in a number​​ of computer vision downstream​​​‌ tasks such as expression‌ recognition, people tracking, and‌​‌ action recognition. We propose​​ here a novel unified​​​‌ framework 39 referred to‌ as Anon-NET, streamlined to‌​‌ de-identify facial videos, while​​ preserving age, gender, race,​​​‌ pose, and expression of‌ the original video. Specifically,‌​‌ we inpaint faces by​​ a diffusion-based generative model​​​‌ guided by high-level attribute‌ recognition and motion-aware expression‌​‌ transfer. We then animate​​ deidentified faces by video-driven​​​‌ animation, which accepts the‌ de-identified face and the‌​‌ original video as input.​​ Extensive experiments on the​​​‌ datasets VoxCeleb2, CelebV-HQ, and‌ HDTF, which include diverse‌​‌ facial dynamics, demonstrate the​​ effectiveness of AnonNET in​​​‌ obfuscating identity while retaining‌ visual realism and temporal‌​‌ consistency. Project Page

8.33​​ Beyond the visible: A​​​‌ survey on cross-spectral face‌ recognition

Participants: Antitza Dantcheva‌​‌.

Cross-spectral face recognition​​ (CFR) refers to recognizing​​​‌ individuals using face images‌ stemming from different spectral‌​‌ bands, such as infrared​​ versus visible. While CFR​​​‌ is inherently more challenging‌ than classical face recognition‌​‌ due to significant variation​​ in facial appearance caused​​​‌ by the modality gap,‌ it is useful in‌​‌ many scenarios including night-vision​​ biometrics and detecting presentation​​​‌ attacks. Recent advances in‌ deep neural networks (DNNs)‌​‌ have resulted in significant​​ improvement in the performance​​​‌ of CFR systems. Given‌ these developments, the contributions‌​‌ of this survey are​​ three-fold. First, we provide​​​‌ an overview of CFR,‌ by formalizing the CFR‌​‌ problem and presenting related​​ applications. Secondly, we discuss​​​‌ the appropriate spectral bands‌ for face recognition and‌​‌ discuss recent CFR methods,​​ placing emphasis on deep​​​‌ neural networks. In particular,‌ we describe techniques that‌​‌ have been proposed to​​ extract and compare heterogeneous​​​‌ features emerging from different‌ spectral bands. We also‌​‌ discuss the datasets that​​ have been used for​​​‌ evaluating CFR methods. Finally,‌ we discuss the challenges‌​‌ and future lines of​​ research on this topic.​​​‌

This work has been‌ published in Neurocomputing 31‌​‌.

9 Bilateral contracts​​ and grants with industry​​​‌

Participants: Antitza Dantcheva,‌ Francois Bremond.

Stars‌​‌ team has currently several​​ experiences in technological transfer​​​‌ towards industries, which have‌ permitted to exploit research‌​‌ result.

9.1 Bilateral contracts​​ with industry

9.1.1 Toyota​​​‌

This project runs from‌ the 1st of August‌​‌ 2013 up to December​​ 2025. It aims at​​​‌ detecting critical situations in‌ the daily life of‌​‌ older adults living home​​ alone.

Toyota is working​​​‌ with Stars on action‌ recognition software to be‌​‌ integrated on their robot​​ platform. This project aims​​​‌ at detecting critical situations‌ in the daily life‌​‌ of older adults alone​​ at home. This will​​​‌ require not only recognition‌ of ADLs but also‌​‌ an evaluation of the​​ way and timing in​​​‌ which they are being‌ carried out. The system‌​‌ we want to develop​​​‌ is intended to help​ them and their relatives​‌ to feel more comfortable​​ because they know that​​​‌ potentially dangerous situations will​ be detected and reported​‌ to caregivers if necessary.​​ The system is intended​​​‌ to work with a​ Partner Robot - HSR​‌ - (to send real-time​​ information to the robot)​​​‌ to better interact with​ the older adult.

9.1.2​‌ Fantastic Sourcing

Fantastic Sourcing​​ is a French SME​​​‌ specialized in micro-electronics, it​ develops e-health technologies. Fantastic​‌ Sourcing is collaborating with​​ Stars through the Univ.​​​‌ Côte d'Azur Solitaria project,​ by providing their Nodeus​‌ system. Nodeus is an​​ IoT (Internet of Things)​​​‌ system for home support​ for the elderly, which​‌ consists of a set​​ of small sensors (without​​​‌ video cameras) to collect​ precious data on the​‌ habits of isolated people.​​ Solitaria project performs a​​​‌ multi-sensor activity analysis for​ monitoring and safety of​‌ older and isolated people.​​ With the increase of​​​‌ the ageing population in​ Europe and in the​‌ rest of the world,​​ keeping elderly people at​​​‌ home, in their usual​ environment, as long as​‌ possible, becomes a priority​​ and a challenge of​​​‌ modern society. A system​ for monitoring activities and​‌ alerting in case of​​ danger, in permanent connection​​​‌ with a device (an​ application on a phone,​‌ a surveillance system ...)​​ to warn relatives (family,​​​‌ neighbors, friends ...) of​ isolated people still living​‌ in their natural environment​​ could save lives and​​​‌ avoid incidents that cause​ or worsen the loss​‌ of autonomy. In this​​ R&D project,​​​‌ we propose to study​ a solution allowing the​‌ use of a set​​ of innovative heterogeneous sensors​​​‌ in order to: 1)​ detect emergencies (falls, crises,​‌ etc.) and call relatives​​ (neighbors, family, etc.); 2)​​​‌ detect, over short or​ longer predefined periods of​‌ time.

9.1.3 Probayes

STARS​​ will be working with​​​‌ Probayes starting 01/07/2025 within​ a CIFRE Ph.D. on​‌ the development of advanced​​ methods for detecting artificially​​​‌ generated videos using artificial​ intelligence models. Recent advances​‌ in image and video​​ generation based on neural​​​‌ networks make it possible​ to create highly realistic​‌ fake videos of individuals​​ (deepfakes), which raises major​​​‌ security concerns for many​ organizations. This project aims​‌ at designing innovative approaches​​ to assess the authenticity​​​‌ of video content. A​ particular emphasis will be​‌ placed on developing techniques​​ that are generalizable and​​​‌ not specific to a​ given video generation model​‌ or application context. The​​ proposed methods will rely​​​‌ on the analysis of​ spatio-temporal behavioral cues, such​‌ as mouth dynamics, in​​ order to evaluate the​​​‌ veracity of video sequences.​

10 Partnerships and cooperations​‌

Participants: François Brémond,​​ Antitza Dantcheva, Michal​​​‌ Balazia, Monique Thonnat​.

10.1 International research​‌ visitors

10.1.1 Visits of​​ international scientists

Other international​​​‌ visits to the team​

Participant: Donghyeon Cho.​‌

  • Status
    Associate Professor
  • Institution​​ of origin:
    Ulsan National​​​‌ Institute of Science and​ Technology (UNIST)
  • Country:
    South​‌ Korea
  • Dates:
    July to​​ August
  • Context of the​​​‌ visit:
    Collaborations
  • Mobility program/type​ of mobility:
    Korean research​‌ stay.

Participant: Jinsun Park​​.

  • Status
    Associate Professor​​
  • Institution of origin:
    Pusan​​​‌ National University in Busan‌
  • Country:
    South Korea
  • Dates:‌​‌
    July to August
  • Context​​ of the visit:
    Collaborations​​​‌
  • Mobility program/type of mobility:‌
    Korean research stay.

Participant:‌​‌ Seungryul Baek.

  • Status​​
    Associate Professor
  • Institution of​​​‌ origin:
    Hanyang University in‌ Seoul
  • Country:
    South Korea‌​‌
  • Dates:
    July to August​​
  • Context of the visit:​​​‌
    Collaborations
  • Mobility program/type of‌ mobility:
    Korean research stay.‌​‌

Participant: Eric Granger.​​

  • Status
    Full Professor
  • Institution​​​‌ of origin:
    École de‌ technologie supérieure, Université du‌​‌ Québec
  • Country:
    Canada
  • Dates:​​
    November to December
  • Context​​​‌ of the visit:
    Collaborations‌
  • Mobility program/type of mobility:‌​‌
    sabbatical.

Participant: Nesli Erdogmus​​.

  • Status
    Assistant Professor​​​‌
  • Institution of origin:
    Izmir‌ Institute of Technology
  • Country:‌​‌
    Turkey
  • Dates:
    July to​​ August
  • Context of the​​​‌ visit:
    Collaborations
  • Mobility program/type‌ of mobility:
    Franco -‌​‌ Turkish Research Fellowship Program​​ "Prestij"

10.2 European initiatives​​​‌

10.2.1 Horizon Europe

GAIN‌

GAIN project on cordis.europa.eu‌​‌

  • Title:
    Georgian Artificial Intelligence​​ Networking and Twinning Initiative​​​‌
  • Duration:
    From October 1,‌ 2022 to September 30,‌​‌ 2025
  • Partners:
    • Institut National​​ De Recherche En Informatique​​​‌ Et Automatique (INRIA), France‌
    • Exolaunch GMBH (EXO), Germany‌​‌
    • Deutsches Forschungszentrum Fur Kunstliche​​ Intelligenz GMBH (DFKI), Germany​​​‌
    • Georgian Technical University (GTU),‌ Georgia
  • Inria contact:
    François‌​‌ Bremond
  • Coordinator:
    George Giorgobiani​​
  • Summary:
    GAIN will take​​​‌ a strategic step towards‌ integrating Georgia, one of‌​‌ the Widening countries, into​​ the system of European​​​‌ efforts aimed at ensuring‌ the Europe’s leadership in‌​‌ one of the most​​ transformative technologies of today​​​‌ and tomorrow – Artificial‌ Intelligence (AI). It will‌​‌ be achieved by research​​ profile adjusting and linking​​​‌ the central Georgian ICT‌ research institute - Muskhelishvili‌​‌ Institute of Computational Mathematics​​ (MICM), to the European​​​‌ AI research and innovation‌ community. Two absolutely leading‌​‌ European research organizations (DFKI​​ and INRIA) supported by​​​‌ the high-tech company EXOLAUNCH‌ will support MICM in‌​‌ this endeavor. The Strategic​​ Research and Innovation Programme​​​‌ (SRIP) designed by the‌ partnership will provide the‌​‌ environment for the Georgian​​ colleagues to get involved​​​‌ in the research projects‌ of the European partners‌​‌ addressing a clearly delineated​​ set of AI topics.​​​‌ Jointly, the partners will‌ advance in capacity building‌​‌ and networking within the​​ area of AI Methods​​​‌ and Tools for Human‌ Activities Recognition and Evaluation,‌​‌ which also will contribute​​ to strengthening core competences​​​‌ in such fundamental technologies‌ as e.g. Machine (Deep)‌​‌ Learning. The results of​​ the cooperation presented through​​​‌ the series of scientific‌ publications and events will‌​‌ inform the European AI​​ community about the potential​​​‌ of MICM and trigger‌ new partnerships building, addressing‌​‌ e.g. Horizon Europe. The​​ project will contribute to​​​‌ career development of a‌ cohort of young researchers‌​‌ at MICM through joint​​ supervision and targeted capacity​​​‌ building measures. Innovation and‌ Research Administration and Management‌​‌ capacities of MICM will​​ also be strengthened to​​​‌ allow the Institute to‌ be better connected to‌​‌ the local, regional and​​ European innovation activities. Using​​​‌ their extensive research and‌ innovation networking capacities DFKI‌​‌ and INRIA will introduce​​ MICM to the European​​​‌ AI research community by‌ connecting to such networks‌​‌ as CLAIRE, ELLIS, ADRA,​​​‌ AI NoEs, etc.

10.3​ National initiatives

ANR COMSEMA​‌

Website: ANR COMSEMA

  • Title:​​
    Communications Sémantiques pour les​​​‌ futurs réseaux - Semantic​ Communications for future networks​‌
  • Duration:
    From November 1,​​ 2024 to October 30,​​​‌ 2028
  • Partners:
    • Institut National​ De Recherche En Informatique​‌ Et Automatique (INRIA), France​​
    • Centrale-Supelec
    • Orange
  • Inria contact:​​​‌
    François Bremond
  • Coordinator:
    Mohamad​ Assaad
  • Summary:
    The ANR​‌ COMSEMA project, part of​​ Thematiques Specifiques en Intelligence​​​‌ Artificielle (TSIA), from November​ 1 2024 up to​‌ October 30 2028 (48​​ months) aims to improve​​​‌ future networks incorporating video​ interpretation applications. Wireless networks​‌ are currently witnessing a​​ radical shift from a​​​‌ purely data-oriented architecture to​ service and intelligent-based architectures,​‌ allowing hence the support​​ of a diverse set​​​‌ of verticals. Thanks to​ the development of AI,​‌ future networks are expected​​ to incorporate an even​​​‌ larger set of applications​ and services such as​‌ ReID applications and human​​ activity recognition, interactive hologram,​​​‌ e-health, intelligent humanoid robot,​ etc. In this project,​‌ we consider video interpretation​​ applications and propose a​​​‌ fundamental semantic-approach to redesign​ the entire process of​‌ information generation and transmission​​ over the network. In​​​‌ particular, novel AI-based interference​ management that focuses on​‌ the task achievement, rather​​ than the bit rate​​​‌ improvement over the air​ interface, will be investigated.​‌ Inria is in charge​​ of customizing video interpretation​​​‌ applications to improve data​ transmission over the network.​‌ INRIA Grant is about​​ 196 keuros (24 Person​​​‌ Months) out of 560​ keuros.
  • Title:
    Interpretable Representation​‌ Learning for Video GANs​​
  • Duration:
    From 2024 to​​​‌ 2028
  • Partners:
    • Inria Center​ at Université Côte d'Azur,​‌ France
  • Inria contact:
    Antitza​​ Dantcheva
  • Coordinator:
    Antitza Dantcheva​​​‌
  • Summary:
    The Inria Exploratory​ Action (Aex) XGAN, from​‌ 2024 to 2028 aims​​ at piercing the black​​​‌ box of generative models​ for video generation by​‌ proposing strategies to interpret​​ the latent space in​​​‌ (a) designing interpretable architectures,​ and by (b) analyzing​‌ symmetric functions in input​​ and output of patch-based​​​‌ generation. Despite remarkable progress​ in generative models, such​‌ networks operate currently as​​ black boxes. INRIA Grant​​​‌ is about 170 keuros.​

10.4 Regional initiatives

Since​‌ 2011, we have initiated​​ a strategic partnership (called​​​‌ CobTek) with Nice hospital​ (CHU Nice, Prof F.​‌ Askenazy) to start ambitious​​ research activities dedicated to​​​‌ healthcare monitoring and assistive​ technologies. These new studies​‌ address the analysis of​​ more complex spatiotemporal activities​​​‌ (e.g. complex interactions, long​ term activities).

11 Dissemination​‌

11.1 Promoting scientific activities​​

11.1.1 Scientific events: organization​​​‌

General chair, scientific chair:​

Participant: François Brémond.​‌

François Brémond was:

  • General​​ Chair at IPAS 2025​​​‌ [130 people], the IEEE​ International Conference on Image​‌ Processing, Applications and Systems​​ (website), Lyon, January​​​‌ 2025. Member of the​ organizing committee (see 50​‌).
  • General Chair at​​ the South Caucasus Conference​​​‌ on Artificial Intelligence -​ SCCAI 2025, MICM/GTU, Tbilisi,​‌ Georgia, September 16-18, 2025.​​
Member of the organizing​​​‌ committees:

Participant: Antitza Dantcheva​, Michal Balazia.​‌

Antitza Dantcheva was co-organizer​​ of CV4BIOM, the Workshop​​​‌ on Computer Vision for​ Biometrics, Identity & Behaviour​‌ associated to the International​​ Conference on Computer Vision​​ (ICCV 2025) on October​​​‌ 20th, 2025.

She was‌ also co-organizer of the‌​‌ 4th Vision-based Remote Physiological​​ Signal Sensing (RePSS) workshop​​​‌ in conjunction with the‌ International Joint Conference on‌​‌ Artificial Intelligence (IJCAI 2025)​​ on August 28th, 2025.​​​‌

Michal Balazia was in‌ the technical committee of‌​‌ SCCAI 2025, as well​​ as session chair. He​​​‌ also was session chair,‌ program chair, and member‌​‌ of the organizing technical​​ committee at ACMMM MultiMediate​​​‌ 2025.

11.1.2 Scientific events:‌ selection

Participants: François Brémond‌​‌, Antitza Dantcheva,​​ Michal Balazia, Monique​​​‌ Thonnat.

Reviewer:
  • François‌ Brémond was reviewer in‌​‌ major Computer Vision /​​ Machine Learning conferences, including​​​‌ ICCV, ECCV, CVPR, NeurIPS,‌ AAAI, ICLR, WACV.
  • Monique‌​‌ Thonnat was a member​​ of conference program committee​​​‌ IJCAI-2025 and ICPRAM 2026.‌
  • Antitza Dantcheva was reviewer‌​‌ and evaluator for SMASH​​, a Slovenian career-development​​​‌ training program.

    Further she‌ served as reviewer for‌​‌ major Computer Vision /​​ Machine Learning conferences such​​​‌ as ICCV, CVPR, NeurIPS,‌ AAAI, ICLR, WACV.

  • Michal‌​‌ Balazia was in 2025​​ reviewer for ACMMM MultiMediate,​​​‌ ACM Multimedia, ICPR, and‌ WACV.

11.1.3 Journal

Michal‌​‌ Balazia served as reviewer​​ for TBIOM and MDPI​​​‌ Sensors.

11.1.4 Invited talks‌

Participants: François Brémond,‌​‌ Monique Thonnat, Antitza​​ Dantcheva, Michal Balazia​​​‌.

Francois Bremond gave‌ the following invited talks:‌​‌

  • invited talk (1h) at​​ IPAS 2025, IEEE International​​​‌ Conference on Image Processing,‌ Applications and Systems IPAS‌​‌ Website, Lyon, January​​ 9-11, 2025.
  • invited talk​​​‌ (1h) at the South‌ Caucasus Conference on Artificial‌​‌ Intelligence - SCCAI 2025,​​ MICM/GTU, Tbilisi, Georgia, September​​​‌ 16-18, 2025.
  • invited talk‌ (1h) on "Video Action‌​‌ Recognition" at the University​​ of Bristol, on 21​​​‌ October 2025.
  • Keynote speaker‌ at the ePictureThis workshop‌​‌ on "Video Action Recognition​​ for Human Behavior Analysis",​​​‌ TU-Eindhoven, on 28 October‌ 2025.

Monique Thonnat was‌​‌ invited as keynote speaker​​ in the IEEE ICPRS​​​‌ conférence in Vina del‌ Mar Chile, December 1-4,‌​‌ 2025.

Antitza Dantcheva gave​​ the following invited talks.​​​‌

  • invited talk in the‌ Storyzy premises in Paris,‌​‌ May 7, 2025.
  • invited​​ talk in the online​​​‌ US Seminar on "US‌ Developments and Impact of‌​‌ AI on Biometric Vulnerabilities",​​ June 26, 2025.
  • invited​​​‌ talk at the Workshop‌ for "Synthetic Realities and‌​‌ Biometric Security: Advances in​​ Forensic Analysis and Threat​​​‌ Mitigation (SRBS)", November 27,‌ associated to BMVC.
  • invited‌​‌ talk at SophIA Summit​​ in Sophia Antipolis, November​​​‌ 20, 2025.
  • invited talk‌ at the University of‌​‌ Technology in Vienna (TU​​ Wien), Austria, December 4,​​​‌ 2025.

Michal Balazia gave‌ invited talks at Metascience‌​‌ and Guardians.

11.1.5 Contributed​​ talks

Monique Thonnat attended​​​‌ as speaker the IEEE/CVF‌ Winter Conference on Applications‌​‌ of Computer Vision (WACV)​​ at Tucson Arizona February​​​‌ 28 - March 4‌ 2025.

11.1.6 Scientific expertise‌​‌

Participants: Monique Thonnat,​​ Michal Balazia.

Monique​​​‌ Thonnat evaluated ANR projects‌ in the framework of‌​‌ comité d’évaluation “CE38 –​​ Interfaces : mathématiques, sciences​​​‌ du numérique – sciences‌ humaines et sociales".

Michal‌​‌ Balazia served as reviewer​​ for ANR and NSERC.​​​‌

11.2 Teaching - Supervision‌ - Juries - Educational‌​‌ and pedagogical outreach

Participant:​​​‌ François Brémond.

Francois​ Bremond held AI courses​‌ on Computer Vision &​​ Deep Learning for the​​​‌ Data Science and AI​ - MSc program at​‌ Université Côte d'Azur: Teaching​​ Website. Academic year​​​‌ 2025: 24 hours.

11.2.1​ Supervision

Participants: François Brémond​‌, Antitza Dantcheva,​​ Michal Balazia.

Francois​​​‌ Bremond (co)-supervised 11 PhD​ students and many master's​‌ students:

  • Tomasz Stanczyk: 3IA​​ PhD student
  • Valeriya Strizhkova:​​​‌ 3IA PhD student, defended​ on March 14, 2025,​‌ 52.
  • Seongro Yoon:​​ 3IA PhD student
  • Tanay​​​‌ Agrawal: PhD student -​ Fellowship from European project​‌ Gain, defended on September​​ 26, 2025, 51.​​​‌
  • Abid Ali: PhD student​ - Fellowship from BoostUrCAreer​‌ CoFund
  • Snehashis Majhi: PhD​​ student - Fellowship from​​​‌ Toyota
  • Aglind Reka: Fellowship​ EUR Spectrum, Geoazur -​‌ Intelligent Mapping
  • Ezem Ekmekci:​​ 3IA PhD student
  • Wenxin​​​‌ Xiong: Gredeg PhD student​
  • Yuan Gao: INRAE PhD​‌ student
  • Sébastien Frey: Nice​​ Hospital PhD student.

Francois​​​‌ Bremond was part of​ the supervision of several​‌ internship students (master &​​ PhD) that have been​​​‌ hosted at the STARS​ team.

Antitza Dantcheva (co)-supervised​‌ 5 PhD students and​​ many master's students:

  • Valeriya​​​‌ Strizhkova: 3IA PhD student​
  • Tanay Agrawal: PhD student​‌ - Fellowship from European​​ project Gain
  • Snehashis Majhi:​​​‌ PhD student - Fellowship​ from Toyota
  • Nabyl Quignong:​‌ Inria AEX XGAN PhD​​
  • Charbel Yahchouchi: CIFRE PhD​​​‌ Probayes
  • Baptiste Chopin: Inria​ AEX XGAN Postdoc
  • Anil​‌ Egin: Masters Student

Michal​​ Balazia supervised the following​​​‌ researchers.

  • M2 interns: Aaryan​ Dhawan, Miriana Russo, Sanya​‌ Sinha
  • engineer: Aowen Shi​​
  • pre-docs: Quentin Merilleau, Aglind​​​‌ Reka.

11.2.2 Juries

Participants:​ François Brémond, Antitza​‌ Dantcheva.

Francois Bremond​​ participated in the following​​​‌ juries:

  • HDR:
    • Carlos Crispim​ from Université Lumière -​‌ Lyon 2, September 22,​​ 2025
  • PhD Thesis Review:​​​‌
    • Nima Mehdi from Inria​ Centre at Université de​‌ Lorraine, December 17, 2024​​
    • Kevin Flanagan from the​​​‌ University of Bristol, October​ 22, 2025
    • Samy Tafasca​‌ from École Polytechnique Fédérale​​ de Lausanne - EPFL,​​​‌ December 5, 2025
    • Salvatore​ Fiorilla from Università di​‌ Bologna, December 11, 2025​​
  • CSI - Comité de​​​‌ suivi de thèse:
    • Marc​ Chapus, May 5, 2025​‌
    • Keqi Chen, May 21,​​ 2025
    • Monica Fossati, May​​​‌ 26, 2025
    • Federica Facente,​ May 31, 2025
    • Aela​‌ Le Sommer, June 3,​​ 2025
    • Franz Fabini Franco​​​‌ Gallo, June 10,2025
    • Thomas​ Campagnolo, July 1, 2025​‌
    • Kaushik Bhowmik, July 1,​​ 2025
    • Sofia Alexopoulou, July​​​‌ 11, 2025
    • Yannick Porto,​ July 9, 2025
    • Idir​‌ Chatar, September 12, 2025​​

Antitza Dantcheva participated in​​​‌ the following juries.

  • PhD​ Thesis Review:
    • Sahar Husseini,​‌ Eurecom, June 17, 2025.​​
  • CSI - Comité de​​​‌ suivi de thèse:
    • Mehdi​ Atamna, December 8, 2025​‌
    • Huyen Trang Nguyen, October​​ 21, 2025
    • Yuanzhi Zhu,​​​‌ October 30, 2025
    • Huyen​ Trang Nguyen, July 7,​‌ 2025

11.3 Popularization

11.3.1​​ Specific official responsibilities in​​​‌ science outreach structures

Participants:​ François Brémond, Antitza​‌ Dantcheva, Michal Balazia​​.

  • Francois Bremond participated​​​‌ in the organization of​ the Sophia Summit 2025.​‌
  • Michal Balazia gave invited​​ talks at Metascience on​​​‌ June 19, 2025.
  • Michal​ Balazia gave invited talks​‌ at Guardians on October​​ 9, 2025.

11.3.2 Productions​​ (articles, videos, podcasts, serious​​​‌ games, ...)

Participant: Michal‌ Balazia.

Michal Balazia‌​‌ made a demo visualization​​ tool for action detection​​​‌ in videos of psychiatric‌ interviews.

11.3.3 Participation in‌​‌ Live events

Participant: François​​ Brémond.

Francois Bremond​​​‌ participated in the following‌ events with following functions:‌​‌

  • Presentation on "Human Action​​ Recognition", part of “Fête​​​‌ de la science” at‌ the Village des sciences‌​‌ d'Antibes – Juan-les-Pins, on​​ October 11, 2025;
  • Presentation​​​‌ for bachelor students, ENS‌ de Lyon, Sophia Antipolis,‌​‌ November 2025;
  • Presentation for​​ high school students, part​​​‌ of Terra Numerica, Sophia‌ Antipolis, December 2025;

11.3.4‌​‌ Other science outreach relevant​​ activities

Participant: François Brémond​​​‌.

Francois Bremond gave‌ an interview on "Automated‌​‌ video surveillance" to Bachelor​​ students from Sciences Po,​​​‌ in February 2025.

12‌ Scientific production

12.1 Major‌​‌ publications

12.2 Publications of the​ year

International journals

  • 31​‌ articleD.David Anghelone​​, C.Cunjian Chen​​​‌, A.Arun Ross​ and A.Antitza Dantcheva​‌. Beyond the Visible:​​ A Survey on Cross-spectral​​​‌ Face Recognition.Neurocomputing​January 2025HALback​‌ to text
  • 32 article​​E.Eric Ettore,​​​‌ H.Hali Lindsay,​ J.Johannes Tröger,​‌ M.Michal Balazia,​​ B.Benoit Michel,​​​‌ P.Philippe Robert,​ D.Danilo Postin,​‌ R.Rene Hurlemann and​​ A.Alexandra König.​​ Childhood trauma affects speech​​​‌ and language measures in‌ patients with major depressive‌​‌ disorder during clinical interviews​​.Journal of Affective​​​‌ Disorders388November 2025‌, 119769HALDOI‌​‌back to text
  • 33​​ articleS.Sébastien Frey​​​‌, F.Federica Facente‌, W.Wen Wei‌​‌, E. S.Ezem​​ Sura Ekmekci, E.​​​‌Eric Séjor, P.‌Patrick Baqué, M.‌​‌Matthieu Durand, H.​​Hervé Delingette, F.​​​‌Francois Bremond, P.‌Pierre Berthet-Rayne and N.‌​‌Nicholas Ayache. Optimizing​​ Intraoperative AI: Evaluation of​​​‌ YOLOv8 for Real-Time Recognition‌ of Robotic and Laparoscopic‌​‌ Instruments.Journal of​​ Robotic Surgery19131​​​‌March 2025HALDOI‌back to text
  • 34‌​‌ articleM.Mohsen Tabejamaat​​, H.Hoda Mohammadzade​​​‌, F.Farhood Negin‌ and F.Francois Bremond‌​‌. EEG Classification with​​ Limited Data: A Deep​​​‌ Clustering Approach.Pattern‌ RecognitionVolume 157110934‌​‌January 2025HALDOI​​back to text

International​​​‌ peer-reviewed conferences

  • 35 inproceedings‌ T.Tanay Agrawal,‌​‌ A.Abid Ali,​​ A.Antitza Dantcheva and​​​‌ F.Francois Bremond.‌ Are Attention Maps Richer‌​‌ than we Imagined for​​ Action Recognition? AVSS 2025​​​‌ - IEEE International Conference‌ on Advanced Video and‌​‌ Signal based Surveillance Tainan,​​ Taiwan August 2025 HAL​​​‌ back to text
  • 36‌ inproceedingsT.Tanay Agrawal‌​‌, M.Mohammed Guermal​​, M.Michal Balazia​​​‌ and F.Francois Bremond‌. CM3T: Framework for‌​‌ Efficient Multimodal Learning for​​ Inhomogeneous Interaction Datasets.​​​‌IEEE XploreWACV 2025‌ - Winter Conference on‌​‌ Applications of Computer Vision​​Tucson, United StatesMarch​​​‌ 2025HALback to‌ text
  • 37 inproceedingsA.‌​‌Abid Ali, R.​​Rui Dai, A.​​​‌Ashish Marisetty, G.‌Guillaume Astruc, M.‌​‌Monique Thonnat, J.-M.​​Jean-Marc Odobez, S.​​​‌Susanne Thümmler and F.‌Francois Bremond. Loose‌​‌ Social-Interaction Recognition in Real-world​​ Therapy Scenarios.IEEE​​​‌ XploreWACV 2025 -‌ IEEE/CVF Winter Conference on‌​‌ Applications of Computer Vision​​Tucson, United StatesarXiv​​​‌February 2025HALback‌ to text
  • 38 inproceedings‌​‌A.Abid Ali,​​ A.Antitza Dantcheva,​​​‌ F.Francois Bremond and‌ T.Tanay Agrawal.‌​‌ Scaling Action Detection: AdaTAD++​​ with Transformer-Enhanced Temporal-Spatial Adaptation​​​‌.ICCV 2025 -‌ IEEE International Conference on‌​‌ Computer VisionHonolulu, Hawai,​​ United StatesOctober 2025​​​‌HALback to text‌
  • 39 inproceedingsA.Anil‌​‌ Egin, A.Andrea​​ Tangherloni and A.Antitza​​​‌ Dantcheva. Now You‌ See Me, Now You‌​‌ Don't: A Unified Framework​​ for Expression Consistent Anonymization​​​‌ in Talking Head Videos‌.Proceedings of the‌​‌ IEEE/CVF International Conference on​​ Computer Vision Workshops (ICCVW​​​‌ 2025), CV4BIOM: Workshop on‌ Computer Vision for Biometrics,‌​‌ Identity & BehaviourICCV​​ 2025 - IEEE/CVF International​​​‌ Conference on Computer Vision‌ WorkshopsHawaii-Honolulu, United States‌​‌IEEEOctober 2025,​​ 5925–5934HALback to​​​‌ text
  • 40 inproceedingsS.‌Snehashis Majhi, G.‌​‌Giacomo D’amicantonio, A.​​Antitza Dantcheva, Q.​​​‌Quan Kong, L.‌Lorenzo Garattoni, G.‌​‌Gianpiero Francesca, E.​​Egor Bondarev and F.​​​‌Francois Bremond. Just‌ Dance with π!, A‌​‌ Poly-modal Inductor forWeakly-supervised Video​​​‌ Anomaly Detection.CVPR​ 2025 - Conference on​‌ Computer Vision and Pattern​​ RecognitionNashville, United States​​​‌June 2025HALback​ to text
  • 41 inproceedings​‌S.Snehashis Majhi,​​ M.Mohammed Guermal,​​​‌ A.Antitza Dantcheva,​ Q.Quan Kong,​‌ L.Lorenzo Garattoni,​​ G.Gianpiero Francesca and​​​‌ F.Francois Bremond.​ Guess Future Anomalies from​‌ Normalcy: Forecasting Abnormal Behavior​​ in Real-World Videos.​​​‌IEEE XploreIEEE/CVF Winter​ Conference on Applications of​‌ Computer Vision (WACV) 2025​​2025 IEEE/CVF Winter Conference​​​‌ on Applications of Computer​ Vision (WACV)Tucson (AZ),​‌ United StatesFebruary 2025​​HALDOI
  • 42 inproceedings​​​‌S.Snehashis Majhi,​ M.Mohammed Guermal,​‌ A.Antitza Dantcheva,​​ Q.Quan Kong,​​​‌ L.Lorenzo Garattoni,​ G.Gianpiero Francesca and​‌ F.Francois Bremond.​​ Guess Future Anomalies from​​​‌ Normalcy: Forecasting Abnormal Behavior​ in Real-World Videos.​‌Winter Conference on Applications​​ of Computer Vision, WACV​​​‌ 2025Tucson, United States​February 2025HALback​‌ to text
  • 43 inproceedings​​D.Dominick Reilly,​​​‌ R.Rajatsubhra Chakraborty,​ A.Arkaprava Sinha,​‌ M.Manish Kumar,​​ P.Pu Wang,​​​‌ F.Francois Bremond,​ L.Le Xue and​‌ S.Srijan Das.​​ LLAVIDAL : A Large​​​‌ LAnguage VIsion Model for​ Daily Activities of Living​‌.CVPR 2025 -​​ IEEE/CVF Conference on Computer​​​‌ Vision and Pattern Recognition​Nashville, United StatesMarch​‌ 2025HALback to​​ text
  • 44 inproceedingsS.​​​‌Sanya Sinha, M.​Michal Balazia and F.​‌Francois Bremond. Identifying​​ Surgical Instruments in Pedagogical​​​‌ Cataract Surgery Videos through​ an Optimized Aggregation Network​‌.IPAS 2025 -​​ Sixth IEEE International Conference​​​‌ on Image Processing Applications​ and SystemsLyon, France​‌January 2025HALback​​ to text
  • 45 inproceedings​​​‌A.Arkaprava Sinha,​ D.Dominick Reilly,​‌ F.Francois Bremond,​​ P.Pu Wang and​​​‌ S.Srijan Das.​ SKI Models: SKeleton Induced​‌ Vision-Language Embeddings for Understanding​​ Activities of Daily Living​​​‌.39th Annual AAAI​ Conference on Artificial Intelligence,​‌ AAAI 2025Philadelphia, United​​ StatesFebruary 2025HAL​​​‌back to text
  • 46​ inproceedingsT.Tomasz Stanczyk​‌, S.Seongro Yoon​​ and F.Francois Bremond​​​‌. No Train Yet​ Gain: Towards Generic Multi-Object​‌ Tracking in Sports and​​ Beyond.IEEE Xplore​​​‌CVPR 2025 - Conference​ on Computer Vision and​‌ Pattern Recognition2025 IEEE/CVF​​ Conference on Computer Vision​​​‌ and Pattern Recognition Workshops​ (CVPRW)NASHVILLE, United States​‌June 2025HALDOI​​back to text
  • 47​​​‌ inproceedingsD. S.Daksitha​ Senel Withanage Don,​‌ M.Marius Funk,​​ M.Michal Balazia,​​​‌ H.Huajian Qiu,​ S.Shogo Okada,​‌ F.François Brémond,​​ J.Jan Alexandersson,​​​‌ A.Andreas Bulling,​ E.Elisabeth André and​‌ P.Philipp Müller.​​ MultiMediate '25: Cross-cultural Multi-domain​​​‌ Engagement Estimation.MM​ 2025 - 33rd ACM​‌ International Conference on MultiMedia​​Dublin, IrelandOctober 2025​​​‌, 14150 - 14155​HALDOIback to​‌ text
  • 48 inproceedingsG.​​Giacomo d'Amicantonio, S.​​​‌Snehashis Majhi, Q.​Quan Kong, L.​‌Lorenzo Garattoni, G.​​Gianpiero Francesca, E.​​Egor Bondarev and F.​​​‌Francois Bremond. Mixture‌ of Experts Guided by‌​‌ Gaussian Splatters Matters: A​​ new Approach to Weakly-Supervised​​​‌ Video Anomaly Detection.‌ICCV 2025 - IEEE‌​‌ International Conference on Computer​​ VisionHonolulu, Hawai, United​​​‌ StatesOctober 2025HAL‌back to text

Conferences‌​‌ without proceedings

  • 49 inproceedings​​ T.Tomasz Stanczyk and​​​‌ F.Francois Bremond.‌ Does Re-ID Really Help‌​‌ in Multi-Object Tracking? SCCAI​​ 2025 - South Caucasus​​​‌ Conference on Artificial Intelligence‌ Tbilisi, Georgia September 2025‌​‌ HAL back to text​​

Edition (books, proceedings, special​​​‌ issue of a journal)‌

  • 50 proceedingsS.Serge‌​‌ Miguet, M.Mouna​​ Zouari, D.Dorra​​​‌ Sellami, H.Habib‌ M. Kammoun and F.‌​‌Francois Bremond, eds.​​ 6th IEEE International Conference​​​‌ on Image Processing, Applications‌ and Systems - IPAS‌​‌ 2025 Conference Proceedings.​​IPAS 2025 - Sixth​​​‌ IEEE international conference on‌ Image Processing Applications and‌​‌ SystemsLyon, FranceIEEE​​March 2025HALDOI​​​‌back to text

Doctoral‌ dissertations and habilitation theses‌​‌

Reports & preprints