STARS

STARS - 2025

2025Activity reportProject-TeamSTARS

RNSR: 201221015V

Research‌ centerInria Centre at Université Côte d'Azur
Team‌ name: Spatio-Temporal Activity Recognition of Social interactions

Creation‌ of the Project-Team: 2024 November 01

Each year,‌ Inria research teams publish an Activity Report presenting‌ their work and results over the reporting period.‌ These reports follow a common structure, with some‌ optional sections depending on the specific team. They‌ typically begin by outlining the overall objectives and‌ research programme, including the main research themes, goals,‌ and methodological approaches. They also describe the application‌ domains targeted by the team, highlighting the scientific‌ or societal contexts in which their work is‌ situated.

The reports then present the highlights of‌ the year, covering major scientific achievements, software developments,‌ or teaching contributions. When relevant, they include sections‌ on software, platforms, and open data, detailing the‌ tools developed and how they are shared. A‌ substantial part is dedicated to new results, where‌ scientific contributions are described in detail, often with‌ subsections specifying participants and associated keywords.

Finally, the‌ Activity Report addresses funding, contracts, partnerships, and collaborations‌ at various levels, from industrial agreements to international‌ cooperations. It also covers dissemination and teaching activities,‌ such as participation in scientific events, outreach, and‌ supervision. The document concludes with a presentation of‌ scientific production, including major publications and those produced‌ during the year.

Keywords

Computer Science and Digital Science

A9. Artificial intelligence‌
A9.1. Knowledge
A9.2. Machine‌ learning
A9.3. Signal processing‌‌
A9.8. Reasoning
A9.12. Computer vision

1 Team members, visitors, external collaborators

Research Scientists‌

François Brémond [Team‌ leader, INRIA,‌‌ Senior Researcher, HDR]
Michal Balazia [‌UNIV COTE D'AZUR,‌ ISFP]
Antitza Dantcheva‌‌ [INRIA, Senior Researcher, HDR]‌
Monique Thonnat [INRIA‌, Emeritus, HDR‌‌]

Post-Doctoral Fellows

Baptiste Chopin [INRIA,‌ Post-Doctoral Fellow, until‌ Aug 2025]
Olivier‌‌ Huynh [INRIA, Post-Doctoral Fellow, until‌ Feb 2025]

PhD‌ Students

Tanay Agrawal [‌‌INRIA, until Oct 2025]
Yuan Gao‌ [INRAE, until‌ Dec 2026]
Snehashis‌‌ Majhi [INRIA, until Mar 2026]‌
Nabyl Quignon [INRIA‌, from Mar 2025‌‌]
Aglind Reka [UNIV COTE D'AZUR,‌ from Sep 2025]‌
Tomasz Stanczyk [INRIA‌‌, until May 2026]
Valeriya Strizhkova [‌INRIA, until Feb‌ 2025]
Charbel Yahchouchi‌‌ [Probayes, CIFRE, from Jul 2025‌]
Seongro Yoon [‌INRIA, until Oct‌‌ 2027]

Technical Staff

Mahmoud Ali [INRIA‌, Engineer]
Marios‌ Kaplanis [INRIA,‌‌ Engineer, from Apr 2025 until Jun 2025‌]
Aowen Shi [‌INRIA, Engineer,‌‌ from Feb 2025]
Yoann Torrado [INRIA‌, Engineer, until‌ May 2025]

Interns‌‌ and Apprentices

Hardik Agarwal [INRIA, Intern‌, from Apr 2025‌ until Jul 2025]‌‌
Akshaya Ananda Murthy [INRIA, Intern,‌ from Apr 2025 until‌ Aug 2025]
Andranik‌‌ Arakelov [INRIA, Intern, from Jun‌ 2025 until Oct 2025‌]
Saurabh Atreya [‌‌INRIA, Intern, until May 2025]‌
Aaryan Dhawan [UNIV‌ COTE D'AZUR, Intern‌‌, from Apr 2025 until Sep 2025]‌
Anil Egin [INRIA‌, Intern, until‌‌ Apr 2025]
Utkarsh Gupta [INRIA,‌ Intern, from Apr‌ 2025 until Jul 2025‌‌]
Khodor Hamadi [INRIA, Intern,‌ from Jun 2025 until‌ Nov 2025]
Mingyun‌‌ Jeong [INRIA, Intern, from Jul‌ 2025]
Jang Hyun‌ Kim [INRIA,‌‌ Intern, from Jul 2025 until Sep 2025‌]
Dian-Wei Lai [‌INRIA, Intern,‌‌ from Oct 2025]
Quentin Merilleau [INRIA‌, from Feb 2025‌]
Nishit Poddar [‌‌INRIA, Intern, from Apr 2025 until‌ Jun 2025]
Nabyl‌ Quignon [INRIA,‌‌ until Feb 2025]
Aglind Reka [INRIA‌, until Aug 2025‌]
Miriana Russo [‌‌INRIA, Intern, from Oct 2025]‌
Ananya Sharma [INRIA‌, from Sep 2025‌‌]
Utkarsh Tiwari [Inria, until Apr‌ 2025]

Administrative Assistant‌

Marie-Cecile Lafont [INRIA‌‌]

Visiting Scientists

Seungryul Baek [Ulsan National‌ Institute of Science and‌ Technology (UNIST), Republic of‌‌ Korea, from Jul‌ 2025 until Aug 2025]
Donghyeon Cho [‌Hanyang University, Seoul, Republic of Korea, from‌ Dec 2025]
Nesli Ergdogmus [Izmir Institute‌ of Technology, Turkey, from Jun 2025 until‌ Jul 2025]
Salvatore Fiorilla [ University of‌ Bologna, Italy, until Feb 2025]
Eric‌ Granger [ETS MONTREAL, from Dec 2025‌]
Jinsun Park [Pusan National University, Busan,‌ Republic of Korea, from Jul 2025 until‌ Jul 2025]
Teimuraz Saginadze [MICM, Georgian‌ Technical University, GTU, Tbilisi, Georgia, from Feb‌ 2025 until Apr 2025]

External Collaborators

Abid‌ Ali [INRIA, then University of Luxembourg,‌ until Jul 2025]
Laura Ferrari [Scuola‌ Superiore Sant'Anna, Pisa, Italy]
Rachid Guerchouche [‌CoBTek]
Alexandra Konig [CoBTek, CHU NICE‌]
Benoit Lagadec [FairVision, from Jul‌ 2025]
Hali Lindsay [KIT - ALLEMAGNE‌, until Jun 2025]
Sabine Moisan [‌retired, HDR]
Jean Rigault [retired‌]
Philippe Robert [CoBTek]
Yaohui Wang‌ [Shangai AI Lab]
Di Yang [‌Shangai University, until Aug 2025]

2‌ Overall objectives

2.1 Presentation

The STARS research project-team‌ focuses on the design of computer vision methods‌ for real-time understanding of social interactions observed by‌ sensors. Our objective is to propose new algorithms‌ to analyze the behavior of people suffering from‌ behavioral disorders, in order to improve their quality‌ of life. We study long-term spatio-temporal interactions performed‌ by humans in their natural environment. We address‌ this challenge by proposing novel deep-learning architectures to‌ model behavioral traits such as facial expression, gaze,‌ gestures, body behavior, and body language. To cope‌ with the limited amount of available data and‌ the privacy issues of medical data, we propose‌ data generation for data augmentation and anonymization. Another‌ important challenge is to make the link between‌ collected data, medical diagnosis, and ultimately treatments. To‌ validate our research we work closely with our‌ clinical partners, in particular those of the Nice‌ Hospital.

2.2 Motivation

Deep learning techniques are highly‌ successful for simple action recognition, nevertheless several important‌ challenges remain in activity recognition in general and‌ specifically for our target medical application domain.

To‌ validate our research, we work closely with our‌ clinical partners. We have a strategic partnership named‌ CoBTek with the clinicians of Nice Hospital (CHU‌ Nice) to study the impact of video understanding‌ approaches for cognitive disorders. This partnership started in‌ January 2012 and has evolved to a University‌ Côte d'Azur team and joint work with monthly‌ regular meetings between STARS and the clinicians of‌ Institut Claude Pompidou (ICP), Lenval, and Pasteur hospitals.‌ The two directors of CoBTek are François Brémond‌ and Florence Askenazy (PU-PH) at Lenval. Our objective‌ to deepen research in social interaction is motivated‌ by the needs of our clinician partners. A‌ typical use-case of social interactions observed by sensors‌ appears in the clinical assessments of psychiatric patients, such as people suffering‌ from conditions like major‌ depression, bipolar disorder, or‌‌ schizophrenia 25. In these clinical assessments, interactions‌ between the patient and‌ the clinician are recorded‌‌ with multi-modalities, i.e., with video, audio, and physiological‌ sensors. The goal is‌ to extract digital markers‌‌ (defined by formal interaction models), which are indicators‌ that can characterize a‌ digital phenotype. The digital‌‌ markers are automatically extracted from the recorded data‌ and the digital phenotypes‌ could lead to a‌‌ treatment improving the patient's behavioral disorders.

Social interaction‌ as a new study‌ target.

An abundance of‌‌ valuable diagnostic relevant information is extracted from the‌ interaction between clinician and‌ patient. This clinical interaction‌‌ (e.g., conversation between patient and clinician including verbal‌ and nonverbal communication) is‌ traditionally the clinician’s most‌‌ important source of information about patients’ social skills,‌ mood, and motivation levels.‌ However, a comprehensive clinical‌‌ interview requires sufficient consultation time as well as‌ strong clinical competencies and‌ expertise to be able‌‌ to detect early subtle signs of changes in‌ communication. Moreover, for detecting‌ these changes during a‌‌ clinical conversation, no standardized objective measures exist, leaving‌ a lot of room‌ for speculation and subjective‌‌ biases. Introducing methodologies to assess in a quantitative‌ manner behavioral dynamics during‌ real-life social interaction could‌‌ help indicate, for instance, the level of reciprocity‌ and therapeutic alliance, which‌ until now is merely‌‌ left to clinical intuition as we have pointed‌ out 25.

Need‌ for precise and sensitive‌‌ digital markers.

To develop and test new measures‌ of mental illness, a‌ movement from traditional markers‌‌ and phenotyping to digital markers and digital phenotyping‌ is needed. "Digital phenotyping"‌ refers to the moment-to-moment‌‌ quantification of human behavior in everyday life using‌ data from digital devices.‌ Digital phenotyping suggests collecting‌‌ patient data allowing for non-intrusive and continuous monitoring‌ of behavioral and mental‌ states, ultimately revealing clinically‌‌ relevant information. Similarly, `digital markers' (e.g., frequency of‌ eye contact) are digitally-obtained‌ disease indicators that can‌‌ be used to define a digital phenotype (e.g.,‌ eye gaze). Interaction-based phenotyping‌ could provide various additional‌‌ data to generate an observer-independent assessment of behavior‌ during a social interaction‌ which reflects as a‌‌ mirror the current symptomology of a patient. Additionally,‌ interaction-based measures such as‌ social synchrony may have‌‌ predictive value for treatment outcomes. Recent progress in‌ computer vision, speech processing,‌ and machine learning has‌‌ enabled detailed and objective characterization of human interaction‌ behavior 8. Applying‌ these advanced methods of‌‌ artificial intelligence provides new opportunities to identify digital‌ markers of patient behavior.‌ Such markers have the‌‌ potential to provide objective and continuous assessments of‌ symptomatology in the context‌ of patients' daily lives‌‌ 30, 4, thereby allowing to precisely‌ tailor treatment to the‌ concrete patient trajectory. So‌‌ far, many developed techniques are based solely on‌ verbal information during interviews;‌ however interpersonal communication often‌‌ occurs non-verbally. Thus, merging computer vision-based measurement in‌ a multi-modal approach would‌ enhance the quality of‌‌ analysis by allowing the‌ detection of changes in the quality of communication‌ as alterations in the dyadic interaction patterns.

Digital‌ markers and methods.

In recent years, behavior recognition‌ methods based on artificial intelligence (i.e., machine or‌ deep learning) have become increasingly effective in a‌ variety of tasks, including action classification 19,‌ body language and gestures 6, gaze estimation‌ 26, eye contact detection, facial action units,‌ facial expression 27, as well as affect‌ extracted from single or multiple modalities 2.‌ A growing number of approaches make use of‌ this progress in human behavior sensing to analyze‌ clinical interaction data (e.g., therapy sessions), linguistic and‌ paralinguistic characteristics from speech. As psychiatric disorders (depression,‌ bipolar, schizophrenia) impact the quality of social interactions,‌ there is an emphasis on studying these quantifiable‌ behavioral dynamics in real-life social interaction at the‌ dyadic level rather than solely individual behavior 25‌. While these initial results are promising, this‌ research needs to be accelerated by further development‌ of digital phenotyping technology focusing on scalability and‌ equity, by establishing shared longitudinal data repositories and‌ by fostering multidisciplinary collaborations between clinical stakeholders, including‌ patients, computer scientists, and medical researchers.

Sensors for‌ analyzing human interactions.

We are planning to keep‌ using mainly RGB (i.e. Red, Green and Blue‌ colors) monocular cameras for video analysis. These off-the-shelf‌ sensors are affordable, and very precise with a‌ large dynamic range and high resolution. They are‌ easily deployable in elderly homes and in hospitals.‌ However, we also investigate new types of sensors‌ (e.g. RGB-D, i.e. RGB colors and Depth, and‌ infrared cameras, physiological sensors, and microphones) to capture‌ complementary information and depending on the use-cases. These‌ new sensors can open up new avenues of‌ research. As we do not want to disturb‌ the everyday activities of the end-users, we can‌ first train our models with a large variety‌ of sensors in dedicated locations, such as laboratories.‌ Second, we can distill the learned weights into‌ lighter models trained only with RGB video streams.‌ These lighter RGB models are more convenient and‌ less intrusive, as they can be processed only‌ using standard RGB cameras. Third, we can use‌ only these lighter RGB models at run-time in‌ embedded devices directly at the end-users' locations. Therefore,‌ we only use the sensors and cameras pertinent‌ to the end-users.

2.3 Social interaction understanding: a‌ challenging task

The major challenge in semantic interpretation‌ of dynamic scenes is to bridge the gap‌ between the task dependent interpretation of data and‌ the flood of measures provided by sensors. The‌ problems we address range from physical object detection,‌ activity understanding, activity learning to vision system design‌ and evaluation. The two principal classes of human‌ activities we focus on are assistance to older‌ adults and video analytics.

Typical examples of complex‌ activity are shown in Figure 1 and Figure‌ 2 for a homecare application (See Toyota Smarthome‌ Dataset here). In this example, the duration of the monitoring of‌ an older person apartment‌ could last several months.‌‌ The activities involve interactions between the observed person‌ and several pieces of‌ equipment. The application goal‌‌ is to recognize the everyday activities at home‌ through formal activity models‌ and data captured by‌‌ a network of sensors embedded in the apartment.‌ Here typical services include‌ an objective assessment of‌‌ the frailty level of the observed person to‌ be able to provide‌ a more personalized care‌‌ and to monitor the effectiveness of a prescribed‌ therapy. The assessment of‌ the frailty level is‌‌ performed by an Activity Recognition System which transmits‌ a textual report (containing‌ only meta-data) to the‌‌ general practitioner who follows the older person. Thanks‌ to the recognized activities,‌ the quality of life‌‌ of the observed people can thus be improved‌ and their personal information‌ can be preserved.

Figure‌‌ 1: Homecare monitoring: the large diversity of‌ activities collected in a‌ three-room apartment

Figure 2: Homecare‌‌ monitoring: the annotation of a composed activity "Cook",‌ captured by a video‌ camera

The ultimate goal‌‌ is for cognitive systems to perceive and understand‌ their environment to be‌ able to provide appropriate‌‌ services to a potential user. An important step‌ is to propose a‌ computational representation of people‌‌ activities to adapt these services to them. Up‌ to now, the most‌ effective sensors have been‌‌ video cameras due to the rich information they‌ can provide on the‌ observed environment. These sensors‌‌ are currently perceived as intrusive ones. A key‌ issue is to capture‌ the pertinent raw data‌‌ for adapting the services to the people while‌ preserving their privacy. We‌ study different solutions including‌‌ of course the local processing of the data‌ without transmission of images‌ and the utilization of‌‌ new compact sensors developed for interaction (also called‌ RGB-Depth sensors, an example‌ being the Kinect) or‌‌ networks of small non-visual sensors.

2.4 International and‌ Industrial Cooperation

Our work‌ has been applied in‌‌ the context of more than 10 European projects‌ such as COFRIEND, ADVISOR,‌ SERKET, CARETAKER, VANAHEIM, SUPPORT,‌‌ DEM@CARE, VICOMO, EIT Health.

We had or have‌ industrial collaborations in several‌ domains: transportation (CCI Airport‌‌ Toulouse Blagnac, SNCF, Inrets, Alstom, Ratp, Toyota, GTT‌ (Italy), banking (Crédit Agricole‌ Bank Corporation, Eurotelis and‌‌ Ciel), security (Thales R&T FR, Thales Security Syst,‌ EADS, Sagem, Bertin, Alcatel,‌ Keeneo), multimedia (Thales Communications),‌‌ civil engineering (Centre Scientifique‌ et Technique du Bâtiment (CSTB)), computer industry (BULL),‌ software industry (AKKA), hardware industry (ST-Microelectronics) and health‌ industry (Philips, Link Care Services, Vistek).

We have‌ international cooperations with research centers such as Reading‌ University (UK), Idiap (Switzerland), Multitel (Belgium), National Cheng‌ Kung University (Taiwan), National Taiwan University (Taiwan), University‌ of Southern California (USA), University of South Florida‌ (USA), Michigan State University (USA), Chinese Academy of‌ Sciences (China), IIIT Delhi (India), Hochschule Darmstadt (Germany),‌ Fraunhofer Institute for Computer Graphics Research IGD (Germany).‌

3 Research program

Our research objective is related‌ to the recognition of human actions, facial expressions,‌ and body language in social interactions. Therefore we‌ plan to work on two main research axes:‌

Axis 1:
Human Interaction Recognition based on body‌ and face analysis,
Axis 2:
Data Generation for‌ Augmentation and Anonymization for solving data limitation and‌ privacy issues.

3.1 Axis 1: Human Interaction Recognition‌

Participants: François Brémond, Michal Balazia, Antitza‌ Dantcheva, Monique Thonnat.

3.1.1 Body Language‌ Analysis

Participants: François Brémond, Michal Balazia,‌ Antitza Dantcheva, Monique Thonnat.

Body language‌ has been actively researched by psychologists for decades.‌ Early work by Mehrabian found that, among other‌ signals, backward leaning of the torso is indicative‌ of liking. Former research has shown that people‌ believe power is expressed with nonverbal cues like‌ open posture (i.e., no arms crossed or legs‌ crossed), more gesturing, and less self-touching (both hands‌ and face). Displacement behaviors such as grooming, face‌ touching or fumbling are related to anxiety and‌ stress regulation. As a consequence of these manifold‌ connections of body language with important personal and‌ social attributes, body language analysis has been a‌ focus of automatic approaches attempting to infer high-level‌ attributes such as emotion leadership role, or personality‌ type. In contrast to the human science studies‌ discussed above, these automatic approaches commonly lack an‌ explicit intermediate representation of functional bodily behavior categories.‌ Instead, they rely on a generic feature representation,‌ encoding body postures and movements or on deep‌ learning approaches without clear interpretable internal structure. While‌ such representations can be effective in prediction scenarios,‌ they often lack interpretability and may miss subtle‌ but meaningful differences, e.g., between fumbling and scratching.‌

Recognition of Actions and Body Language.

RGB-based human‌ action recognition has often been addressed by three‌ main approaches. Two-stream 2D Convolutional Neural Networks (CNN)‌ generally contain two 2D CNN branches taking different‌ input features extracted from the RGB videos for‌ action recognition. Recurrent Neural Networks (RNN) usually employ‌ 2D CNNs as feature extractors for an LSTM‌ (i.e., Long Short Term Memory) model. 3D CNN‌ based methods extend 2D CNNs to 3D structures,‌ to simultaneously model the spatial and temporal context‌ information in videos that is crucial for action‌ recognition. For instance, a two-stream 2D CNN architecture,‌ divides each video into three segments and processes‌ each segment with a two-stream network, fusing the‌ individual classification scores by an average pooling method to produce the video-level‌ prediction of the action‌ class. Also, the two-stream‌‌ Inflated 3D CNN (I3D) inflates the convolutional and‌ pooling kernels of a‌ 2D CNN with an‌‌ additional temporal dimension to process at once a‌ 3D block of pixels.‌ The transformer method that‌‌ was designed for natural language processing has been‌ recently extended to computer‌ vision tasks to recognize‌‌ human activities. In contrast to action recognition, which‌ typically considers freely moving‌ people, limited work on‌‌ body language recognition addressed more constrained social interaction‌ scenarios. We observe that‌ the common denominator of‌‌ body language analysis methods are the employment of‌ a general action recognition‌ method without handling the‌‌ specificity of body language such as subtle motions‌ or micro facial expressions.‌

To summarize, these body‌‌ language analysis methods enable us to measure objectively‌ the behavior of humans‌ by recognizing their Activities‌‌ of Daily Living (ADL), their emotions, eating habits,‌ and lifestyle. Human behavior‌ can be modeled by‌‌ learning from a large number of data, collected‌ from a variety of‌ sensors, to improve and‌‌ optimize, for instance, the quality of life of‌ people suffering from behavior‌ disorders, such as anxiety‌‌ or apathy. In previous work, STARS successfully detected‌ the everyday life activities‌ performed by an individual‌‌ living alone at home and we were able,‌ for instance, to detect‌ breakfast activities, such as‌‌ “preparing coffee”, and “cutting bread”, with sufficient accuracy‌ 19, 13,‌ 15.

3.1.2 Face‌‌ Analysis and Emotion Recognition

Participants: François Brémond,‌ Michal Balazia, Antitza‌ Dantcheva.

An emotion‌‌ is a mental state that arises spontaneously and‌ is often accompanied by‌ cognitive, physical, and physiological‌‌ changes. Due to the complexity of human reactions,‌ recognizing emotions is still‌ limited and remains the‌‌ target of many relevant scientific researches. In fact,‌ Emotion Recognition is a‌ highly multidisciplinary field where‌‌ psychology meets deep learning. Emotions are typically divided‌ in basic categories, as‌ theorized by Ekman who‌‌ identified basic discrete emotions. Such categorization has been‌ extended considering the interconnection‌ between emotions and multiple‌‌ intensities.

Predicting emotions has been attempted via facial‌ expression analysis in videos,‌ which has been widely‌‌ adopted both in research and in industry owing‌ to its ease of‌ use with just a‌‌ camera. However, the accuracy of computer vision algorithms,‌ as in the case‌ of CNN, is typically‌‌ limited in identifying real emotions. Facial micro-expression recognition‌ recently reported state-of-the-art performances‌ when implemented with a‌‌ transformer-based architecture. While the FaceReader system, launched in‌ late 2005, is used‌ worldwide in institutes and‌‌ companies, there are still some limitations as image‌ quality and facial angulation.‌ Other main open challenges‌‌ in the field are small available datasets and‌ subjective annotations. Typical datasets‌ range between some hundreds‌‌ of videos to a few thousands and the‌ annotations are often noisy‌ due to the human‌‌ complexity. A person may be happy even if‌ he/she is not smiling‌ and people differ widely‌‌ in how expressive they‌ are in showing their inner emotions. So, emotion‌ annotations are very subjective and need to be‌ adequately addressed. Moreover, emotions have multiple nuances, with‌ different intensities.

Regarding emotional models, various architectures have‌ been used as RNNs, LSTMs, CNNs, with the‌ aim of capturing the spatio-temporal information. In order‌ to improve the recognition accuracy, multimodal transformers have‌ been introduced, exploiting self- and cross-attention. Knowledge distillation‌ from multimodal to unimodal (video) transformers has been‌ reported, to reduce the acquisition complexity at inference‌ time. The state-of-the-art is achieved today with multimodal‌ transformers, using video, audio, and language cues. Here,‌ the video and the audio are processed by‌ small transformer encoders receiving as input features pre-trained‌ on other datasets. The model extracting features is‌ frozen and therefore it cannot be adapted to‌ a new targeted dataset. For the video transformer,‌ the inputs are fixed representations, such as DLN‌ features, IResNet and DenseNet features, Facet/Openface features, R(2+1)D-152‌ features and landmarks and action units. Such feature‌ extractors and shallower encoders are typically used when‌ small datasets are targeted. The main limitations of‌ this approach are twofold: first, frozen representations are‌ less appropriate for raw data than end-to-end trainable‌ models; second, smaller models are less accurate for‌ recognizing specific expressions. In order to use raw‌ data and bigger encoders, proper pre-training is needed‌ to limit overfitting. While self-supervised techniques, such as‌ VideoMAE, can be used for that purpose, they‌ may miss the little details necessary to recognize‌ facial micro-expressions. They are therefore not well adapted‌ for the emotion recognition task.

3.1.3 Multimodal Recognition‌ of Human Interactions

Participants: François Brémond, Michal‌ Balazia, Monique Thonnat.

Behavior traits can‌ be detected in self-presentation videos based on the‌ acoustic and visual, non-verbal features such as pitch,‌ intensity, movement, head orientation, posture, fidgeting, and eye-gaze.‌ According to 1, 2, modalities such‌ as audiovisual, text, and demographic features are important‌ for personality prediction. Emotion recognition has generated specific‌ approaches for multimodal data processing. Deep bimodal models‌ give state-of-the-art results on Multimodal Language Analysis in‌ the Wild. It has been shown that body‌ gestures, head movements, expressions, and speech lead to‌ an effective diagnosis of apathy. Few models have‌ dealt with trimodal fusion of features.Although multimodal approaches‌ are commonly used to recognize personality traits, there‌ does not exist a comprehensive method to optimize‌ and combine the considerable amount of informative features.‌ All modality features may be concatenated together for‌ behavior prediction; this approach is referred to early‌ fusion. However, most of the multimodal approaches perform‌ late fusion on heterogeneous data, as it outperforms‌ other techniques. Present research in the field aims‌ to find efficient ways for feature extraction and‌ combination. We aim to design new approaches able‌ to utilize all possible information available in an‌ optimal manner 3. The objective is to‌ develop and test Human Behavior Coding algorithms using‌ RGB video cameras at test time 13, 1, but using‌ multi-modalities at training time‌ with multiple datasets with‌‌ various modalities to better characterize human behavior during‌ interactions. As it is‌ challenging to be an‌‌ expert in all modalities, we will rely on‌ open-source code (when available)‌ or on our partners‌‌ (when needed) to obtain the most effective backbone‌ models for extracting multi-modal‌ features. For instance, we‌‌ are collaborating with DFKI (i.e., Deutsches Forschungszentrum für‌ Künstliche Intelligenz) 24 to‌ extract audio and text‌‌ features for measuring neuropsychiatric symptoms in patients with‌ early cognitive decline. For‌ electrophysiological signals, we are‌‌ working with the Biorobotic Institute - Scuola Superiore‌ Sant’Anna (Pontedera, Italy) 21‌ to compute more objective‌‌ measurements of emotion.

3.2 Axis 2: Data Generation‌ for Augmentation and Anonymization‌

Participants: Antitza Dantcheva,‌‌ François Brémond.

3.2.1 Data Generation

Participants: Antitza‌ Dantcheva, François Brémond‌.

In the past‌‌ decade, computer vision has witnessed remarkable progress fueled‌ by the triptych of‌ (a) algorithms for training‌‌ computer vision models (e.g., backpropagation), (b) increased computational‌ power (think of powerful‌ graphical processing units (GPUs)),‌‌ but very importantly by (c) increased volumes of‌ training data. For‌ example, millions of facial‌‌ images (i.e., MegaFace) have rapidly driven progress in‌ face recognition, showcasing that‌ better models are empowered‌‌ by bigger data. Even in the occasional abundance‌ of raw data, there‌ is a plethora of‌‌ remaining challenges in designing data-driven intelligence approaches such‌ as deep neural networks‌ (DNNs). These challenges stem‌‌ from the fact that data must be processed;‌ for example, data must‌ be annotated (e.g., annotation‌‌ of facial expressions in facial videos), in order‌ to optimize the millions‌ of network-parameters. To make‌‌ things worse, the curation of large datasets is‌ tedious, costly, time-consuming and‌ is fundamentally bounded by‌‌ the population sizes of such data, as well‌ as by the ever-increasing‌ privacy and usage considerations‌‌ that have been recently highlighted by the General‌ Data Protection Regulation (GDPR).‌ The resulting real data‌‌ and associated real-life datasets are scarce, private, and‌ they inherit human biases.‌ As such, these limitations‌‌ threaten to bring any advances in computer vision‌ to a dramatic halt.‌ Therefore, we are now‌‌ at a point, where the availability of annotated‌ data is the main‌ bottleneck in the development‌‌ of data-hungry DNN models; a bottleneck that far‌ exceeds any algorithmic or‌ computational bottleneck. Based on‌‌ the premise that computer vision data-driven intelligence is‌ heavily influenced by the‌ underlying data, we here‌‌ seek to understand how one can actually create‌ data that will augment‌ the learning space and‌‌ the learning capabilities of computer vision models. Generated‌ data or synthetic data‌ provides a promising solution‌‌ to the above challenges, as it is easier‌ to obtain, it is‌ inexhaustible, pre-annotated, and less‌‌ expensive. In addition, synthetic data has the potential‌ to avoid ethical and‌ privacy concerns, as well‌‌ as practical issues related to security. Further, synthetic‌ data brings to the‌ fore unique opportunities, allowing‌‌ for the surgical injection‌ of training data in scenarios where collecting real‌ data may be impractical or impossible (e.g., talking‌ dogs, faces that do not exist, etc.). Indeed‌ synthetic data allows for new training paradigms in‌ computer vision models. We will design methods that‌ allow synthetic data to be dynamically generated, directly‌ as a function of the needs of learning‌ algorithms.

Past attempts for synthetic images and videos.‌

Computer vision-generative models of images have received unprecedented‌ attention, owing to recent breakthroughs in the underlying‌ modeling methodology. The most powerful models today are‌ built on generative adversarial networks (GANs), autoregressive transformers,‌ and most recently diffusion models. Diffusion models (DM)‌ constitute neural networks, which were trained to denoise‌ images successively blurred with Gaussian noise by learning‌ to reverse such diffusion process. After training, such‌ a model can generate data by simply passing‌ randomly sampled noise through the learned de-noising process.‌ This synthesis procedure can be interpreted as an‌ optimization algorithm that follows the gradient of the‌ data density to produce likely samples. In its‌ denoising process, conditional features like class labels of‌ data can be applied to the network for‌ specializing its sampling process. Such DMs outperform previous‌ generative methods, as they offer robust, stable and‌ scalable training procedures. DMs are largely unaffected by‌ training limitations such as overfitting, as it is‌ the case in GANs (mode collapse). In addition,‌ DMs generally involve fewer parameters than transformer-based counterparts‌ that typical require massive amounts of data and‌ thus experience a performance plateau. As diverse synthetic‌ data is a primary need for computer vision,‌ DMs have been rapidly adopted in several settings‌ such as image and video generation, image deblurring,‌ high-resolution image generation, and image editing.

Challenges in‌ video generation.

However, while the image domain has‌ seen great progress, video has proven to be‌ more challenging due to (i) significant computational costs‌ associated with training on video data, as well‌ as due to (ii) the lack of large-scale,‌ general, and publicly available video datasets. In regards‌ to the computational challenge in (i), it is‌ indeed the case that training current state-of-the-art image‌ generation models is already extremely expensive computationally, making‌ it exceedingly hard to generate videos, particularly videos‌ of variable length. Similarly, w.r.t. the second challenge‌ in (ii), it is the case that while‌ in image generation there are datasets with billions‌ of images - in video, datasets are much‌ smaller (think of the VoxCeleb2 dataset of about‌ 1M videos) and thus cannot support the higher‌ complexity of open domain videos.

Limited settings of‌ generated videos. Very recently, video generation methods such‌ as DM-based Imagen Video and Make-a-Video, showcased the‌ stunning potential of generative AI. However, to date,‌ the generated videos remain heavily constrained in quality,‌ resolution, as well as length, mainly due to‌ having video encoders that only encode fixed size‌ videos or encode frames independently. Such video generation‌ methods are further limited as they currently produce results only depicting single‌ persons, performing simple motions‌ in highly constrained settings‌‌ with mostly a neutral background. Crucial in our‌ effort will be our‌ goal of generating videos‌‌ that encompass complex settings of multiple subjects, able‌ to interact in front‌ of a non-uniform background.‌‌

Control. While we are already beginning to know‌ a few things regarding‌ DMs - like for‌‌ example that in terms of reconstruction and encoding,‌ DMs are superior to‌ GANs - it is‌‌ indeed the case that understanding the limits of‌ control of such models,‌ still lies at its‌‌ infancy. In an effort to control generated images,‌ recent works explored the‌ discovery of semantically meaningful‌‌ directions in the latent space of pre-trained GANs,‌ where linear navigation corresponds‌ to the desired manipulation‌‌ of images. In this context and in terms‌ of control, supervised, as‌ well as unsupervised approaches‌‌ were proposed to edit semantics such as facial‌ attributes, colors and basic‌ visual transformations (e.g., rotation‌‌ and zooming) in generated or inverted real images.‌ The latest addition of‌ Latent Diffusion Models (LDMs)‌‌ are a positive development in this direction, as‌ such LDMs are able‌ to reduce the heavy‌‌ computational burden when training on high-resolution images. In‌ addition, our own work‌ revealed - in the‌‌ context of autoencoder generation models - how to‌ disentangle motion and appearance‌ in videos, as well‌‌ as how to manipulate decomposed semantically meaningful motion-directions.‌ However, in the context‌ of LDMs, disentanglement and‌‌ manipulation of semantic attributes remains a key open‌ research challenge of substantial‌ potential impact and these‌‌ are indeed challenges that we will explore.

3.2.2‌ Data Augmentation and Anonymization‌

Participants: Antitza Dantcheva,‌‌ François Brémond.

We aim to apply data‌ generation models proposed in‌ the previous section in‌‌ two domains of application, namely data augmentation and‌ data anonymization, which are‌ catering the needs of‌‌ Axis 1 (Human Interaction Recognition).

Data augmentation.

The‌ general focus of data-driven‌ computer vision algorithms has‌‌ to do with the automated extraction of patterns‌ by finding complex data‌ representations from large volumes‌‌ of input data without human interference, utilizing the‌ patterns to detect or‌ classify unseen data. The‌‌ powerful twist that we are envisioning is that‌ data generation places full‌ control over the distribution‌‌ of the generated data, thus endowing us with‌ the ability to ensure‌ quality and diversity, while‌‌ saving cost, and mitigating bias. As a consequence,‌ we foresee that such‌ synthetic data will allow‌‌ for nothing less than a paradigm shift in‌ training. For example, as‌ inspired by human systems,‌‌ synthetic data will bring continual, multimodal, interactive, embodied‌ learning to the next‌ level, providing richer and‌‌ more sophisticated representations. This applies directly toward the‌ grand goal of allowing‌ computer vision to approach‌‌ human-level intelligence; a long-term goal that will require‌ the grasping of key‌ concepts related to the‌‌ physical world and its composition, as well as‌ to entail a non-diluted‌ ability to learn continually,‌‌ interactively and multimodally 23‌. We aim to identify entirely new perception‌ models and related learning paradigms, which will exploit‌ synthetic data in an entirely new, efficient and‌ dynamic manner. We consider such models for a‌ variety of recognition settings that can target a‌ broad spectrum of facial behaviors including expressions and‌ micro-expressions. By exploring the fundamental properties of learning‌ with synthetic data, we anticipate computer vision models‌ that generalize onto a large class of human‌ actions.

Data anonymization.

Privacy-preserving data-processing has obtained increased‌ attention in the past years, with challenges having‌ to do with data anonymization, while maintaining the‌ image quality. The General Data Protection Regulation (GDPR)‌ came to effect as of 25th of May,‌ 2018, affecting all processing of personal data across‌ Europe. GDPR requires regular consent from the individual‌ for any use of their personal data. However,‌ if the data does not allow to identify‌ an individual, companies are free to use the‌ data without consent. To effectively anonymize images, we‌ require a robust model to replace the original‌ face, without destroying the existing data distribution; that‌ is: the output should be a realistic face‌ fitting the given situation.

Anonymizing images, while retaining‌ the original distribution is challenging, as it entails‌ the removal of all privacy-sensitive information, generation of‌ a highly realistic face, while providing a seamless‌ transition between original and anonymized parts. This requires‌ a model that can perform complex semantic reasoning‌ to generate a new anonymized face. For practical‌ use, we desire the model to be able‌ to manage a broad diversity of images, poses,‌ backgrounds, and different persons. Our proposed solution can‌ successfully anonymize images in a large variety of‌ cases, and create realistic faces to the given‌ conditional information.

4 Application domains

Video understanding consists‌ of a complex pipeline made of various tasks,‌ such as object detection, people tracking, pose estimation,‌ and event detection. So, many tasks are generic,‌ and can be shared between different application domains.‌ The behavior analysis techniques we develop for other‌ applications (for instance for sport or security domains)‌ can be applied to medical applications and vice-versa.‌

4.1 Medical Applications

Our main motivation as explained‌ before is to help clinicians to diagnose, monitor‌ and provide pertinent treatment to patients with behavior‌ disorders. The applications we target are not general‌ medical diseases but the ones related to the‌ brain and more precisely to psychiatric disorders. These‌ disorders can appear very early in the life‌ of the patient (for instance autism spectrum disorder‌ 4), they can concern adults (depression, bipolar,‌ schizophrenia 25) or the elderly (for instance‌ Alzheimer disease). We have been working for the‌ elderly patients since the creation of the CoBTek‌ joint team in January 2012. More recently, we‌ have extended our study to the two other‌ categories of age. Now we have some clinical‌ trials within these three categories of patients.

4.2‌ Other Applications

Sport applications: Sport is an interesting application domain for human‌ activity understanding for three‌ reasons. First, data are‌‌ often publicly available, so with less ethical concerns‌ than medical ones. Moreover,‌ many data have been‌‌ recorded and annotated to be part of international‌ challenges Website Challenges.‌ Second, human activities are‌‌ complex at the level of individuals, of a‌ team and along time.‌ Third, many companies are‌‌ interested to fund research to advance the field‌ of human activity understanding‌ for sport. For instance,‌‌ we have a collaboration with a local company,‌ Fairvision (see Fairvision website‌ on football games).

Security‌‌ applications: The interest and investment in vision-based security‌ systems is large and‌ rapidly growing and is‌‌ fueled by applications ranging from autonomous vehicles to‌ personalization of customer service.‌ Accordingly, numerous companies, military‌‌ and public organizations are interested in research in‌ this context.

4.3 Ethical‌ and Acceptability Issues

The‌‌ development and ultimate use of novel assistive technologies‌ by a vulnerable user‌ group such as individuals‌‌ with dementia, and the assessment methodologies planned by‌ STARS are not free‌ of ethical, or even‌‌ legal concerns, even if many studies have shown‌ how these Information and‌ Communication Technologies (ICT) can‌‌ be useful and well accepted by older people‌ with or without impairments.‌ Thus, one goal of‌‌ STARS team is to design the right technologies‌ that can provide the‌ appropriate information to the‌‌ medical carers while preserving people privacy. Moreover, STARS‌ pay particular attention to‌ ethical, acceptability, legal and‌‌ privacy concerns that may arise, addressing them in‌ a professional way following‌ the corresponding established EU‌‌ and national laws and regulations, especially when outside‌ France. STARS can also‌ benefit from the support‌‌ of the COERLE (Comité Opérationnel d'Evaluation des Risques‌ Légaux et Ethiques) to‌ help it to respect‌‌ ethical policies in its applications.

As presented in‌ Section 2, STARS‌ aims at designing cognitive‌‌ vision systems with perceptual capabilities to efficiently monitor‌ people activities. As a‌ matter of fact, vision‌‌ sensors can be seen as intrusive ones, even‌ if no images are‌ acquired or transmitted (only‌‌ meta-data describing activities need to be collected). Therefore,‌ new communication paradigms and‌ other sensors (e.g. accelerometers,‌‌ RFID (Radio Frequency Identification), and new sensors to‌ come in the future)‌ are also envisaged to‌‌ provide the most appropriate services to the observed‌ people, while preserving their‌ privacy. To better understand‌‌ ethical issues, STARS members are already involved in‌ several ethical organizations.

For‌ addressing the acceptability issues,‌‌ focus groups and HMI (Human Machine Interaction) experts‌ are consulted on the‌ most adequate range of‌‌ mechanisms to interact and display information to older‌ people.

5 Social and‌ environmental responsibility

5.1 Footprint‌‌ of research activities

We have limited our travels‌ by reducing our physical‌ participation to conferences and‌‌ to international collaborations.

5.2 Impact of research results‌

We have been involved‌ for many years in‌‌ promoting public transportation by improving safety onboard and‌ in station. Moreover, we‌ have been working on‌‌ pedestrian detection for self-driving‌ cars, which will help also reducing the number‌ of individual cars.

6 Highlights of the year‌

6.1 Awards

Antitza Dantcheva was appointed 3IA chair.‌
Monique Thonnat has been nominated Coordinatrice Alpes Maritimes‌ for the foundation FUAE Fondation Un Avenir Ensemble‌ of Grande Chancellerie de la Legion d'Honneur (‌Website Fondation). The objective is to promote‌ social mobility by offering recipients of national honors‌ the opportunity to mentor deserving and motivated students‌ from high school to higher education and entry‌ into working life.

6.2 Major results

A first‌ work has consisted of releasing novel tracking algorithms‌ that can reliably track people through a video‌ stream. These algorithms can combine bounding box detection‌ with pixel mask to significantly improve the quality‌ of tracking and to be able to track‌ people on a long-term basis.
During this period,‌ several novel activity recognition algorithms have also been‌ designed for Activities of Daily Living (ADLs) in‌ real-world settings. These algorithms got the best performances‌ on all relevant action datasets. Previously, these algorithms‌ were built in more or less supervised settings.‌ Thus, we have proposed new algorithms for action‌ detection with a weakly supervised setting with only‌ video-level labels. These algorithms can reliably detect specific‌ events with their time of occurrence within untrimmed‌ videos.
We have also improved the quality and‌ the capacity of action recognition algorithms by processing‌ long videos with a duration of more than‌ 10 minutes. For that, we have designed new‌ adapters that can be plugged into strong video‌ backbones and thus necessitate only retraining the adapters,‌ which reduces the training time and enables a‌ training process with videos of a much longer‌ duration.
We have also designed novel algorithms for‌ video action anticipation that can detect some possible‌ events after having observed only a limited amount‌ of normal video streams.
All these algorithms have‌ been successfully evaluated on the main international benchmarks‌ and also on video datasets depicting patients with‌ cognitive disorders in order to help doctors to‌ better monitor their patients.

7 Latest software developments,‌ platforms, open data

7.1 Open data

We have‌ provided two benchmark datasets.

Stress ID Dataset: a‌ Multimodal Dataset for Stress Identification

Contributors:
Hava Chaptoukaev‌ , Valeriya Strizhkova , Michele Panariello , Bianca‌ Dalpaos , Aglind Reka , Valeria Manera ,‌ Susanne Thummler , Esma Ismailova , Nicholas Evans‌ , François Brémond , Massimiliano Todisco , Maria‌ A Zuluaga , Laura M Ferrari .
Description:‌
It contains RGB facial video, audio and physiological‌ signals (ECG, EDA, Respiration). Different stress-inducing stimuli are‌ used: emotional video-clips, cognitive tasks and public speaking.‌ The total dataset consists of recordings from 65‌ participants that performed 11 tasks. Each task is‌ labeled by the subjects in terms of stress,‌ relaxation, arousal, and valence. The experimental set-up ensures‌ synchronized, high-quality, and low noise data.
Dataset PID‌ (DOI,...):
NeurIPS 2023
Project link:
Stress ID dataset‌
Publications:
StressID: a Multimodal Dataset for Stress Identification Thirty-seventh Conference on Neural‌ Information Processing Systems Datasets‌ and Benchmarks Track 2023‌‌ 7
Contact:
stressid.dataset@inria.fr
Release contributions:
The Dataset is‌ licensed for non-commercial scientific‌ research purposes.

Toyota Smarthome‌‌ Datasets: Real-World Activities of Daily Living.

Contributors:
Rui‌ Dai , Srijan Das‌ , Saurab Sharma ,‌‌ Luca Minciullo , Lorenzo Garattoni , François Brémond‌ , Gianpiero Francesca .‌
Description:
Smarthome has been‌‌ recorded in an apartment equipped with 7 Kinect‌ v1 cameras. It contains‌ the common daily living‌‌ activities of 18 subjects. The subjects are senior‌ people in the age‌ range 60-80 years old.‌‌ The dataset has a resolution of 640×480 and‌ offers 3 modalities: RGB‌ + Depth + 3D‌‌ Skeleton. The 3D skeleton joints were extracted from‌ RGB. For privacy-preserving reasons,‌ the face of the‌‌ subjects is blurred. Currently, two versions of the‌ dataset are provided: Toyota‌ Smarthome Trimmed and Toyota‌‌ Smarthome Untrimmed.
Dataset PID (DOI,...):
10.1109/TPAMI.2022.3169976
Project link:‌
Toyota Smarthome datasets
Publications:‌
Toyota Smarthome Untrimmed: Real-World‌‌ Untrimmed Videos for Activity Detection, PAMI 2022 18‌.
Contact:
toyotasmarthome@inria.fr
Release‌ contributions:
The Dataset is‌‌ licensed for non-commercial scientific research purposes.

8 New‌ results

This year Stars‌ has proposed new results‌‌ related to its two main research axes: (i)‌ Human Interaction Recognition and‌ (ii) Data Generation for‌‌ Augmentation and Anonymization.

Human Interaction Recognition

Participants: François‌ Brémond, Antitza Dantcheva‌, Michal Balazia,‌‌ Monique Thonnat, Baptiste Chopin, Di Yang‌, Abid Ali,‌ Olivier Huynh, Tomasz‌‌ Stanczyk, Sanya Sinha, Mohammed Guermal,‌ Tanay Agrawal, Snehashis‌ Majhi, Aglind Reka‌‌.

The new results for Human Interaction Recognition‌ are:

No Train Yet‌ Gain: Towards Generic Multi-Object‌‌ Tracking in Sports and Beyond (see 8.1)‌
Does Re-ID Really Help‌ in Multi-Object Tracking? (see‌‌ 8.2)
CM3T: Framework for Efficient Multimodal Learning‌ for Inhomogeneous Interaction Datasets‌ (see 8.3)
Are‌‌ Attention Maps Richer than we Imagined for Action‌ Recognition? (see 8.4)‌
Scaling Action Detection: AdaTAD++‌‌ with Transformer-Enhanced Temporal-Spatial Adaptation (see 8.5)
SKI‌ Models: SKeleton Induced Vision-Language‌ Embeddings for Understanding Activities‌‌ of Daily Living (see 8.6)
LLAVIDAL :‌ A Large LAnguage VIsion‌ Model for Daily Activities‌‌ of Living (see 8.7)
Human-Centric Video Understanding:‌ From Single-Modality to Multi-Modal‌ Learning (see 8.8)‌‌
B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach‌ to Micro-Action Recognition (see‌ 8.9)
Loose Social-Interaction‌‌ Recognition in Real-world Therapy Scenarios (see 8.10)‌
Just Dance with π!,‌ A Poly-modal Inductor for‌‌ Weakly-supervised Video Anomaly Detection (see 8.11)
Mixture‌ of Experts Guided by‌ Gaussian Splatters Matters: A‌‌ new Approach to Weakly-Supervised Video Anomaly Detection (see‌ 8.12)
Denoise, Divide,‌ Distill, and Predict (‌‌ $𝒟^{3} 𝒫$ ): Towards Forecasting Long-horizon Real-world‌ Anomaly from Normalcy (see‌ 8.13)
Not All‌‌ Blends Are Equal: The BLEMORE Dataset of Blended‌ Emotion Expressions with Relative‌ Salience Annotations (see 8.14‌‌)
The INEMO Dataset: A Multimodal Benchmark of‌ Physiological and Behavioral Responses‌ to Social Media and‌‌ Film Stimuli (see 8.15‌)
EEG Classification with Limited Data: A Deep‌ Clustering Approach. (see 8.16)
MEPHESTO: Multimodal Phenotyping‌ of Psychiatric Disorders from Social Interaction (see 8.17‌)
MultiMediate'25: Cross-Cultural Multi-domain Engagement Estimation (see 8.18‌)
Stress Estimation in Dancers for Injury Prevention‌ (see 8.19)
Emotion Recognition using Deep Learning‌ (see 8.20)
Identifying Surgical Instruments in Pedagogical‌ Cataract Surgery Videos through an Optimized Aggregation Network‌ (see 8.21)
TBDM: Temporal Boundary Distillation Module‌ for Surgical Gesture Segmentation (see 8.22)
Effective‌ Video Feature Extraction for Training and Comprehension: Human-Centered‌ Multimodal Video (see 8.23)

Data Generation for‌ Augmentation and Anonymization

Participants: François Brémond, Antitza‌ Dantcheva, Baptiste Chopin, Nabyl Quignon,‌ Charbel Yahchouchi, Anil Egin, Michal Balazia‌, Di Yang, Valeriya Strizhkova.

The‌ new results for Data Generation for Augmentation and‌ Anonymization are:

Rotation-Induced Centroid Shift in Latent Space‌ (see 8.24)
Dual Volume Skeleton-Guided 3D Face‌ Reconstruction from Sparse Views (see 8.25)
Turbo‌ Learning: 3D Face Reconstruction with Mesh Re-Projection and‌ Re-Identification Consistency (see 8.26)
THEval. Evaluation Framework‌ for Talking Head Video Generation (see 8.27)‌
Beyond Real versus Fake Towards Intent-Aware Video Analysis‌ (see 8.28)
AI killed the video star.‌ Audio-driven diffusion model for expressive talking head generation‌ (see 8.29)
LIA-X: Interpretable Latent Portrait Animator‌ (see 8.30)
Simplicity-Bias-Aware Adaptation of Foundation Models‌ for Deepfake Detection (see 8.31)
Now You‌ See Me, Now You Don't: A Unified Framework‌ for Expression Consistent Anonymization in Talking Head Videos‌ (see 8.32)
Beyond the visible: A survey‌ on cross-spectral face recognition (see 8.33)

8.1‌ No Train Yet Gain: Towards Generic Multi-Object Tracking‌ in Sports and Beyond

Participants: Tomasz Stanczyk,‌ Seongro Yoon, Francois Bremond.

We proposed‌ McByte 46, a novel tracking-by-detection framework that‌ enhances multi-object tracking (MOT) by integrating temporally propagated‌ segmentation masks as an additional association cue. The‌ key objective was to improve robustness and generalization‌ in challenging sports scenarios - characterized by fast‌ motion, occlusions, blur, and camera shifts - without‌ requiring any training or per-sequence parameter tuning.

Starting‌ from a strong ByteTrack-based baseline, we designed a‌ pipeline that combines Kalman filter motion prediction, IoU-based‌ matching, and a pre-trained mask temporal propagation model.‌ The propagated masks are not used blindly; instead,‌ we introduced regulated policies that activate mask-based guidance‌ only in well-defined situations - namely ambiguity (multiple‌ plausible associations) and isolation (failure of IoU-based matching).‌ This controlled fusion ensures that the mask cue‌ strengthens association decisions while avoiding instability caused by‌ unreliable mask predictions.

Figure 3 illustrates the full‌ tracking pipeline, showing how bounding-box predictions, detections, and‌ temporally propagated masks are jointly integrated into a‌ unified association cost matrix solved via Hungarian matching.‌ This design allows McByte to preserve the strengths‌ of tracking-by-detection while benefiting from the spatial coherence‌ provided by mask propagation.

Figure‌‌ 3: The overview of the McByte tracking‌ pipeline.

We conducted extensive‌ ablation studies to analyze‌‌ the impact of each design choice, demonstrating that‌ uncontrolled use of masks‌ can degrade performance, whereas‌‌ carefully gated mask usage yields consistent gains. Qualitative‌ results further show McByte’s‌ ability to maintain identities‌‌ through heavy occlusions and motion blur. In particular,‌ Fig. 4 highlights challenging‌ football scenarios where McByte‌‌ successfully preserves tracklets that baseline methods fail to‌ maintain due to abrupt‌ camera motion and degraded‌‌ visual quality.

We evaluated McByte on four diverse‌ datasets - SportsMOT, DanceTrack,‌ SoccerNet-tracking 2022, and MOT17‌‌ - using standard MOT metrics (HOTA, IDF1, MOTA).‌ Across all benchmarks, McByte‌ consistently outperformed strong tracking-by-detection‌‌ baselines, especially in sports datasets, while remaining competitive‌ on pedestrian tracking. Importantly,‌ these improvements were achieved‌‌ without training, dataset-specific tuning, or additional annotations, demonstrating‌ the method’s generality and‌ practical value.

Overall, this‌‌ work introduces a generic, training-free MOT framework that‌ bridges the gap between‌ detection-based and mask-based tracking,‌‌ offering a robust solution applicable across sports and‌ non-sports domains.

Figure 4.a — Figure‌ 4: Example comparison‌ with baseline in a‌‌ challenging football setting. McByte can maintain the tracklets‌ of the blurry players‌ caused by the abrupt‌‌ camera movement (pointed by yellow arrows).

Figure 4.b — Figure‌ 4: Example comparison‌ with baseline in a‌‌ challenging football setting. McByte can maintain the tracklets‌ of the blurry players‌ caused by the abrupt‌‌ camera movement (pointed by yellow arrows).

8.2 Does‌ Re-ID Really Help in‌ Multi-Object Tracking?

Participants: Tomasz‌‌ Stanczyk, Francois Bremond.

We conducted a‌ systematic and critical analysis‌ of the role of‌‌ person re-identification (re-ID) in multi-object tracking (MOT) 49‌. While re-ID is‌ widely assumed to improve‌‌ association quality, its actual contribution in practical tracking‌ pipelines remains unclear. Our‌ goal was to rigorously‌‌ evaluate when, how, and to what extent re-ID‌ genuinely benefits MOT performance.‌

We focused our study‌‌ on the widely used‌ BoT-SORT tracking framework and evaluated multiple re-ID configurations,‌ including re-ID trained on the target dataset, re-ID‌ trained on external datasets, and a strong generic‌ re-ID model. Experiments were conducted on the MOT17‌ validation set, using both ground-truth detections and realistic‌ detector outputs to disentangle the effects of detection‌ quality from appearance-based association.

Beyond standard tracking evaluations,‌ we introduced a custom re-ID assessment protocol tailored‌ to tracking. This protocol directly measures correct and‌ incorrect inter-frame matches produced by re-ID, enabling a‌ deeper understanding of re-ID behavior in realistic tracking‌ scenarios. We analyzed cosine distance distributions, match accuracy,‌ and failure modes across sequences with varying crowd‌ density, occlusion patterns, and bounding-box sizes.

Our results‌ show that re-ID often provides only marginal gains‌ and, in several scenarios, can even degrade tracking‌ performance, especially when bounding boxes are small, heavily‌ occluded, or visually ambiguous. We further demonstrated that‌ tuning re-ID similarity thresholds is non-trivial and highly‌ sequence-dependent, undermining the robustness and general applicability of‌ re-ID-based association.

To mitigate these issues, we explored‌ constraints on re-ID usage, such as filtering based‌ on occlusion level and minimum bounding-box size. While‌ these constraints reduced incorrect matches in isolation, their‌ impact on full tracking performance remained limited and‌ inconsistent across sequences.

Overall, this work provides evidence-based‌ insight into the limitations of re-ID in MOT‌ and challenges the assumption that stronger re-ID models‌ automatically lead to better tracking. We conclude that‌ re-ID is not a universally reliable solution for‌ improving MOT and that its effectiveness is strongly‌ conditioned on scene characteristics, detection quality, and careful‌ integration into the tracking pipeline.

This study offers‌ practical guidance for both researchers and practitioners, encouraging‌ more critical and context-aware use of re-ID in‌ future MOT systems.

8.3 CM3T: Framework for Efficient‌ Multimodal Learning for Inhomogeneous Interaction Datasets

Participants: Tanay‌ Agrawal, Mohammed Guermal, Michal Balazia,‌ Francois Bremond.

Challenges in cross-learning involve inhomogeneous‌ or even inadequate amount of training data and‌ lack of resources for retraining large pretrained models.‌ Inspired by transfer learning techniques in NLP (i.e.,‌ natural language processing), adapters and prefix tuning, we‌ present a new model-agnostic plugin architecture for cross-learning,‌ called CM3T 36, that adapts transformer-based models‌ to new or missing information (see Figure 5‌). We introduce two adapter blocks: multi-head vision‌ adapters for transfer learning and cross-attention adapters for‌ multimodal learning. Training becomes substantially efficient as the‌ backbone and other plugins do not need to‌ be fine-tuned along with these additions.

Figure 5: This‌ is a representation of the main problem CM3T‌ aims to solve. Backbones pretrained using self-supervised learning‌ provide good general features, thus all methods of‌ fine-tuning work well. In the case of supervised pretraining, adapters fail to‌ perform well (in red)‌ and CM3T is introduced‌‌ to solve this (in green).

Comparative and ablation‌ studies on three datasets‌ Epic-Kitchens-100, MPIIGroupInteraction and UDIVA‌‌ v0.5 show efficacy of this framework on different‌ recording settings and tasks.‌ With only 12.8% trainable‌‌ parameters compared to the backbone to process video‌ input and only 22.3%‌ trainable parameters for two‌‌ additional modalities, we achieve comparable and even better‌ results than the state-of-the-art.‌ CM3T has no specific‌‌ requirements for training or pretraining and is a‌ step towards bridging the‌ gap between a general‌‌ model and specific practical applications of video classification.‌

8.4 Are Attention Maps‌ Richer than we Imagined‌‌ for Action Recognition?

Participants: Tanay Agrawal, Abid‌ Ali, Francois Bremond‌.

Deep learning models‌‌ are becoming more general and robust by the‌ day. Specifically, image foundation‌ models have recently shown‌‌ exponential growth. We introduce a way to exploit‌ this growth in the‌ field of video classification.‌‌ The basic idea here is that if we‌ have a good understanding‌ of space, we should‌‌ not require complicated spatio-temporal processing. We introduce the‌ Attention Map (AM) flow,‌ a way to identify‌‌ the location of local changes between two frames‌ in a video, without‌ adding additional parameters specifically‌‌ for it. We utilize adapters, which have been‌ growing in popularity in‌ the field of parameter-efficient‌‌ transfer learning. These help us incorporate AM flow‌ in a pretrained image‌ model without the need‌‌ of fine-tuning it. With just these changes and‌ minimal temporal processing, an‌ image model is able‌‌ to achieve state-of-the-art results on popular action recognition‌ datasets with low training‌ time and requiring minimal‌‌ pretraining. This work explores the theory behind this‌ idea and the intricacies‌ involved. Through relevant experiments,‌‌ we show the efficacy of this method and‌ discuss various ideas to‌ take this work forward.‌‌ We use kinetics-400, something-something v2, and the Toyota‌ SmartHome datasets and achieve‌ state-of-the-art or comparable results.‌‌ We also show that video models suffer from‌ extensive pretraining on multiple‌ datasets and a large‌‌ training time, but our work answers these problems.‌

This work has been‌ published at WACV 2025‌‌ 35.

8.5 Scaling Action Detection: AdaTAD++ with‌ Transformer-Enhanced Temporal-Spatial Adaptation

Participants:‌ Tanay Agrawal, Abid‌‌ Ali, Francois Bremond.

Temporal Action Detection‌ (TAD) is essential for‌ analyzing long-form videos by‌‌ identifying and segmenting actions within untrimmed sequences. While‌ recent innovations like Temporal‌ Informative Adapters (TIA) have‌‌ improved resolution, memory constraints still limit large video‌ processing. To address this‌ issue, we introduce AdaTAD++,‌‌ an enhanced framework that decouples temporal and spatial‌ processing within adapters, organizing‌ them into independently trainable‌‌ modules. Our novel two-step training strategy first optimizes‌ for high temporal and‌ low spatial resolution, then‌‌ vice versa, allows the model to utilize both‌ high spatial and temporal‌ resolutions during inference, while‌‌ maintaining training efficiency. Additionally, we incorporate a more‌ sophisticated temporal module capable‌ of capturing long-range dependencies‌‌ more effectively than previous‌ methods. Experiments on benchmark datasets, including ActivityNet-1.3, THUMOS14,‌ and EPIC-Kitchens 100, demonstrate that AdaTAD++ achieves state-of-the-art‌ performance. We also explore various adapter configurations, discussing‌ their trade-offs regarding resource constraints and performance, providing‌ valuable insights into their optimal application.

This work‌ has been published at ICCV 2025 38.‌

8.6 SKI Models: SKeleton Induced Vision-Language Embeddings for‌ Understanding Activities of Daily Living

Participants: Arkaprava Sinha‌, Dominick Reilly, Francois Bremond, Srijan‌ Das.

The introduction of vision-language models like‌ CLIP has enabled the development of foundational video‌ models capable of generalizing to unseen videos and‌ human actions. However, these models are typically trained‌ on web videos, which often fail to capture‌ the challenges present in Activities of Daily Living‌ (ADL) videos. Existing works address ADL-specific challenges, such‌ as similar appearances, subtle motion patterns, and multiple‌ viewpoints, by combining 3D skeletons and RGB videos.‌ However, these approaches are not integrated with language,‌ limiting their ability to generalize to unseen action‌ classes. In this paper, we introduce SKI models,‌ which integrate 3D skeletons into the vision-language embedding‌ space. SKI models leverage a skeleton language model,‌ SkeletonCLIP, to infuse skeleton information into Vision Language‌ Models (VLMs) and Large Vision Language Models (LVLMs)‌ through collaborative training. Notably, SKI models do not‌ require skeleton data during inference, enhancing their robustness‌ for real-world applications. The effectiveness of SKI models‌ is validated on three popular ADL datasets for‌ zero-shot action recognition and video caption generation tasks.‌ Our code is available at this github Github‌ page.

This work has been published at‌ AAAI 2025 45.

8.7 LLAVIDAL : A‌ Large LAnguage VIsion Model for Daily Activities of‌ Living

Participants: Dominick Reilly, Francois Bremond,‌ Srijan Das.

Current Large Language Vision Models‌ (LLVMs) trained on web videos perform well in‌ general video understanding but struggle with fine-grained details,‌ complex human object interactions (HOI), and view-invariant representation‌ learning essential for Activities of Daily Living (ADL).‌ This limitation stems from a lack of specialized‌ ADL video instruction-tuning datasets and insufficient modality integration‌ to capture discriminative action representations. To address this,‌ we propose a semi-automated framework for curating ADL‌ datasets, creating ADL-X, a multiview, multimodal RGBS (i.e.,‌ RGB and Segmentation) instruction-tuning dataset. Additionally, we introduce‌ LLAVIDAL, an LLVM integrating videos, 3D skeletons, and‌ HOIs to model ADL's complex spatiotemporal relationships. For‌ training LLAVIDAL a simple joint alignment of all‌ modalities yields suboptimal results; thus, we propose a‌ Multimodal Progressive (MMPro) training strategy, incorporating modalities in‌ stages following a curriculum. We also establish ADL‌ MCQ and video description benchmarks to assess LLVM‌ performance in ADL tasks. Trained on ADL-X, LLAVIDAL‌ achieves state-of-the-art (SOTA) performance across ADL benchmarks.

This‌ work has been published at CVPR 2025 43‌.

8.8 Human-Centric Video Understanding: From Single-Modality to‌ Multi-Modal Learning

Participants: Mahmoud Ali, Di Yang‌, Francois Bremond.

Figure 6: General pipeline of MoVie for action‌ detection. (a) We broaden‌ the views of a‌‌ given observation segment by extracting features for the‌ previous segment and skeleton‌ motion. In this‌‌ stage, we propose a novel Mixed Motion-Visual Encoder,‌ including a Motion Encoder‌ and a Motion-Visual Mixer‌‌ (MVM) inside to mix multi-modal features. (b) We‌ process the history features‌ and mixed motion-visual features‌‌ using a TCN- (i.e., Temporal Convolutional Network) and‌ Transformer-based cross-modal temporal model‌ to obtain frame-level features‌‌ for the observation video segment. Finally, a multi-label‌ classifier is stacked to‌ predict per-frame action categories‌‌ within the observation segment.

Human action recognition is‌ an active research field‌ with significant contributions to‌‌ applications such as home-care monitoring, human-computer interaction, and‌ game control. However, recognizing‌ human activities in real-world‌‌ videos remains challenging, especially when learning effective video‌ representations with a high‌ expressive power to represent‌‌ human spatio-temporal motion, view-invariant actions, complex composable actions,‌ etc. To address this‌ challenge, we made three‌‌ contributions toward learning effective representations that can be‌ applied and evaluated in‌ real-world human action classification,‌‌ retrieval, prediction, detection, and segmentation tasks by transfer‌ learning.

The first contribution‌ (single modality): we improve‌‌ the generalizability of human skeleton motion representation models‌ under the skeleton-only modality.‌ We introduce two novel‌‌ self-supervised learning frameworks based on contrastive learning to‌ learn robust and transferable‌ skeleton representations without relying‌‌ on action labels. By exploiting the inherent spatio-temporal‌ structure of human skeleton‌ sequences, our approach encourages‌‌ discriminative motion representations through instance-level and temporal consistency‌ objectives. Extensive evaluations demonstrate‌ that the proposed frameworks‌‌ improve performance across diverse downstream tasks and scenarios,‌ bridging the gap between‌ controlled 3D laboratory datasets‌‌ (e.g., NTU-RGB-D) and challenging 2D real-world datasets (e.g.,‌ SmartHome), highlighting the strength‌ of SSL (i.e., Self-Supervised‌‌ Learning) for skeleton-based motion understanding.

The second contribution‌ (two modality): Despite the‌ effectiveness of skeleton-based models‌‌ in capturing spatial and temporal dynamics, they struggle‌ to recognize fine-grained actions.‌ In particular, they fail‌‌ to distinguish between semantically similar actions, such as‌ "drinking from a cup"‌ versus "drinking from a‌‌ bottle", as these models lack access to object-centric‌ and semantic information. To‌ address this, we propose‌‌ MoVie as shown in Fig. 6, a‌ motion-augmented framework designed to‌ improve real-world human action‌‌ detection by integrating skeleton motion features with visual‌ information through the Motion-Vision‌ Mixer and incorporating history-aware‌‌ temporal modeling.

Figure 7‌: Overview of T-MOR‌ framework. Given the skeleton‌‌ sequence $𝐬𝐤$ , it begins with data augmentation‌ to get ${𝐬𝐤}^{+‌}$ to enrich the learning‌‌ base. The core components include (i) Skeleton Embedding,‌ utilizing a motion encoder‌ $E M$ to capture‌‌ nuanced human movements; (ii) Visual Embedding with a‌ pre-trained encoder $E V‌$ for video frames $𝐯‌‌$ , enhancing the ability to correlate visual cues‌ with motion data; (iii)‌ Text Embedding with a‌‌ pre-trained encoder $E T$ , applying textual description‌ $𝐚$ to refine the‌ comprehension of actions; all‌‌ three embeddings are followed‌ by projection layers $ϕ$ and then are sent‌ to (iv) Multi-modal Contrastive module, implementing a novel‌ mechanism that synergizes skeleton, visual, and text embeddings‌ to optimize the learning process. Finally, (v) the‌ pre-trained $E M$ can improve downstream action recognition‌ tasks.

The third contribution (multi modality): our previous‌ works show that VLFMs (i.e., Vision-Language Foundation Models)‌ are still far away from satisfactory performance in‌ all evaluated tasks, particularly in densely labeled and‌ long video datasets, such as the fine-grained activities‌ in complex and real-world scenarios. As shown in‌ Fig. 7, we introduce our proposed Transferable‌ skeleton MOtion Representation learning architecture (T-MOR) based on‌ a contrastive motion-video-language pre-training strategy. The pre-trained skeleton‌ model is effective for both action classification, segmentation‌ and zero-shot action recognition tasks.

Overall, this work‌ contributes to the field of human-centric video understanding‌ by proposing novel methods for skeleton-based action representation‌ learning and general RGB video representation learning. Such‌ representations benefit both action classification and segmentation tasks.‌

8.9 B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter”‌ Approach to Micro-Action Recognition

Participants: Aglind Reka,‌ Nishit Poddar, Diana Borza, Snehashis Majhi‌, Michal Balazia, Francois Bremond.

Micro-action‌ recognition (MAR) presents unique challenges due to the‌ inherently subtle, fleeting, and ambiguous nature of micro-actions.‌ Unlike conventional actions, which are often clearly distinguishable,‌ micro-actions, such as a slight nod, a subtle‌ shift in posture, or a brief glance are‌ characterized by their fine-grained motion and short duration.‌ These movements often overlap in meaning and arise‌ from reflexes or situational cues, making them difficult‌ to interpret and classify. Additionally, micro-actions are influenced‌ by environmental and social factors, further complicating their‌ recognition.

A significant issue in current approaches is‌ the failure to account for the structured nature‌ of human motion. Micro-actions often originate from specific‌ body parts, such as the head, torso, or‌ limbs, and follow a consistent body-to-action hierarchy. However,‌ most existing models treat these actions as flat‌ categories, overlooking the spatial dependencies between body regions.‌ This oversight leads to difficulties in isolating informative‌ signals from background noise and differentiating between highly‌ similar micro-movements within the same body region. Another‌ challenge lies in the imbalance and variability of‌ micro-action datasets. Datasets like MA-52, SocialGesture, and MPII-GroupInteraction‌ capture a wide range of human movements, from‌ short, dynamic gestures to long, static postures. This‌ variability in temporal scale and class frequency makes‌ it challenging for models to capture rare yet‌ distinctive motion patterns, which are characteristic of micro-actions.‌

To address these challenges, we introduce B-MoE, a‌ body-part-aware Mixture-of-Experts framework (see Figure 8) designed‌ to explicitly model the structured nature of human‌ motion. B-MoE specializes in analyzing motions from localized‌ body regions such as the head, torso, upper‌ limbs, and lower limbs, allowing the model to‌ focus on subtle movements and discriminative cues within‌ each region. By doing so, B-MoE suppresses background‌ interference and enhances the detection of fine-grained motion cues, improving the ability‌ to differentiate between ambiguous‌ action classes. Central to‌‌ B-MoE is the Macro–Micro Motion Encoder (M3E) as‌ shown in Figure 9‌, a lightweight yet‌‌ powerful backbone that captures both long-range contextual structure‌ and fine-grained local motion.‌ This dual capability enables‌‌ the model to effectively recognize both prolonged poses‌ and rapid micro-movements. A‌ cross-attention routing mechanism further‌‌ enhances the framework by dynamically selecting and fusing‌ informative region-wise semantic cues,‌ as shown in Figure‌‌ 10, which are then integrated with global‌ motion features. Through this‌ approach, B-MoE effectively addresses‌‌ the core challenges of MAR subtlety, ambiguity, and‌ class imbalance by amplifying‌ fine local cues, suppressing‌‌ irrelevant regions, and providing complementary semantic and motion‌ evidence. This work was‌ submitted to CVPR 2026.‌‌

Figure‌ 8: B-MoE: A‌ dual-stream encoder extracts region-conditioned‌‌ semantic features using semantic encoder and global motion‌ encoder. The semantic stream‌ is routed through a‌‌ region-aware MoE, where each expert specializes in modeling‌ micro-movements within a specific‌ body region. A cross-attention‌‌ fusion head integrates expert outputs with motion saliency‌ from the global stream,‌ and a transformer-MLP (i.e.,‌‌ MultiLayer Perceptron) classifier produces the final predictions.

Figure 9‌‌: Macro-Micro Motion Encoder (M3E): Input sequence is‌ processed with multi-head self-attention‌ to capture global temporal‌‌ dependencies, followed by an SGP (i.e., Scalable-Granularity Perception)‌ module for fine-grained local‌ motion reasoning. During pretraining,‌‌ a semantic alignment loss aligns learned features with‌ word embeddings of action‌ labels.

Figure‌ 10: Semantic Branch: We segment each frame‌ using SAPIENS, derive the crop around the target‌ body part (upper limb in this example), and‌ apply the corresponding mask to the cropped region.‌ The resulting cropped and masked video is then‌ processed by VideoMAE-V2, pretrained on Kinetics.

Our extensive‌ experiments on three socially contextual micro-action benchmarks (MA-52,‌ MPII-GI, and SocialGesture) demonstrate significant improvements, with notable‌ gains in F1macro accuracy of +4.32%, +3.35%, and‌ +1.17%, respectively. These results highlight B-MoE’s robustness in‌ handling class imbalance and its superior performance in‌ recognizing subtle and ambiguous actions recognition of ambiguous,‌ underrepresented, and low-amplitude actions.

8.10 Loose Social-Interaction Recognition‌ in Real-world Therapy Scenarios

Participants: Abid Ali,‌ Monique Thonnat, Francois Bremond.

The computer‌ vision community has explored dyadic interactions for atomic‌ actions such as pushing, carrying-object, etc. However, with‌ the advancement in deep learning models, there is‌ a need to explore more complex dyadic situations‌ such as loose interactions. These are interactions where‌ two people perform certain atomic activities to complete‌ a global action irrespective of temporal synchronization and‌ physical engagement, like cooking-together for example. Analyzing these‌ types of dyadic-interactions has several useful applications in‌ the medical domain for social-skills development and mental‌ health diagnosis. To achieve this, we propose a‌ novel dual-path architecture to capture the loose interaction‌ between two individuals. Our model learns global abstract‌ features from each stream via a CNNs backbone‌ and fuses them using a new Global-Layer-Attention module‌ based on a cross-attention strategy. We evaluate our‌ model on real-world autism diagnoses such as our‌ Loose-Interaction dataset, and the publicly available Autism dataset‌ for loose interactions. Our network achieves baseline results‌ on the Loose-Interaction and SOTA results on the‌ Autism datasets. Moreover, we study different social interactions‌ by experimenting on a publicly available dataset i.e.‌ NTU-RGB+D (interactive classes from both NTU-60 and NTU-120).‌ We have found that different interactions require different‌ network designs. We also compare a slightly different‌ version of our method by incorporating time information‌ to address tight interactions achieving SOTA results.

This‌ work has been published at WACV 2025 37‌.

8.11 Just Dance with π!, A Poly-modal‌ Inductor for Weakly-supervised Video Anomaly Detection

Participants: Snehashis‌ Majhi, Giacomo D’amicantonio, Antitza Dantcheva Ali‌, Francois Bremond.

Weakly-supervised methods for video‌ anomaly detection (VAD) are conventionally based merely on‌ RGB spatiotemporal features, which continues to limit their‌ reliability in real-world scenarios. This is because RGB-features‌ are not sufficiently distinctive in setting apart categories‌ such as shoplifting from visually similar events. Therefore,‌ towards robust complex real-world VAD, it is essential‌ to augment RGB spatio-temporal features with additional modalities. Motivated by this, we‌ introduce the Poly-modal Induced‌ framework for VAD: “PI-VAD”‌‌ (or π-VAD), a novel approach that augments RGB‌ representations by five additional‌ modalities. Specifically, the modalities‌‌ include sensitivity to fine-grained motion (Pose), three-dimensional scene‌ and entity representation (Depth),‌ surrounding objects (Panoptic masks),‌‌ global motion (optical flow), as well as language‌ cues (VLM). Each modality‌ represents an axis of‌‌ a polygon, streamlined to add salient cues to‌ RGB. π-VAD includes two‌ plug-in modules, namely the‌‌ Pseudo-modality Generation module and the Cross Modal Induction‌ module, which generate modality-specific‌ prototypical representations and, thereby,‌‌ induce multi-modal information into RGB cues. These modules‌ operate by performing anomaly-aware‌ auxiliary tasks and necessitate‌‌ five modality backbones – only during training. Notably,‌ π-VAD achieves state-of-the-art accuracy‌ on three prominent VAD‌‌ datasets encompassing real-world scenarios, without requiring the computational‌ overhead of five modality‌ backbones at inference.

This‌‌ work has been published at CVPR 2025 40‌.

8.12 Mixture of‌ Experts Guided by Gaussian‌‌ Splatters Matters: A new Approach to Weakly-Supervised Video‌ Anomaly Detection

Participants: Snehashis‌ Majhi, Giacomo D’Amicantonio‌‌, Dantcheva Antitza, Francois Bremond.

We‌ identify one of the‌ main issues in the‌‌ formulation of the Weakly-supervised video anomaly detection (WSVAD)‌ task. Multi-instance learning (MIL)‌ strikes a balance between‌‌ fully supervised methods, which exhibit good performance but‌ require costly data annotation,‌ and unsupervised methods, which‌‌ do not require manual annotations but generally result‌ in worse performance. The‌ core idea of MIL‌‌ is to create bags containing positive and negative‌ data samples (i.e.‌, normal and abnormal‌‌ videos), labeled only at the video-level. During training,‌ the model assigns a‌ score between 0 and‌‌ 1 to each snippet, with 0 indicating a‌ normal snippet and 1‌ indicating an abnormal snippet.‌‌ The highest-scoring samples in the normal bag are‌ guided towards 0, allowing‌ the model to learn‌‌ most normal scenarios correctly. On the other hand,‌ the highest-scoring negative samples‌ are pushed towards 1.‌‌ This leads the model to be supervised, and‌ therefore learn few and‌ specific instances of anomalous‌‌ events, ignoring useful information contained in neighboring snippets.‌ Over time, this approach‌ has proved to be‌‌ powerful but insufficient to train a model to‌ correctly capture the secondary‌ and specific attributes of‌‌ different anomalous classes. In recent works, different auxiliary‌ objectives are identified as‌ priors for the VAD‌‌ task to optimize the training process.

Figure‌ 11: Overview of‌‌ the GS-MoE architecture: First, in the feature extraction‌ stage, the video encoder‌ extracts snippet-level features from‌‌ the video, and the task encoder refines them‌ in the anomaly-detection latent‌ space. In the second‌‌ stage, each class-expert is trained only on refined‌ features belonging to its‌ assigned class and to‌‌ the normal class. In the final stage, the‌ gate model collects the‌ scores assigned by each‌‌ expert and compares them with the refined features‌ of the task encoder,‌ producing the final abnormal‌‌ score.

To address this‌ issue, we propose to model the anomalies in‌ a video as Gaussian distributions (see Fig. 11‌), rendering multiple Gaussian kernels in correspondence with‌ peaks detected along the temporal dimension of the‌ scores estimated for abnormal videos. This technique, called‌ Temporal Gaussian Splatting (TGS), creates a more complete‌ representation of an anomalous event over time, including‌ snippets of the anomaly with lower abnormal scores‌ in the training objective. The Gaussian kernels are‌ extracted from the abnormal scores produced by the‌ model.

An additional challenge is related to the‌ intrinsic differences between abnormal classes. Under the MIL‌ paradigm, the models are trained to learn the‌ difference between normal and abnormal videos, while the‌ specific differences between anomalous classes are overlooked. As‌ a result, these methods mainly focus on coarse-level‌ representations of anomalies that allow us to distinguish‌ between normal and abnormal events, but ignore the‌ fine-grained category-specific cues. Therefore, the more salient anomalies‌ (i.e., such as an explosion) are‌ likely to be easily detected, while subtle anomalies‌ (i.e., shoplifting) are more likely to‌ be confused with normal events. This constitutes a‌ major limitation of most recent methods based on‌ WSVAD. We address this issue via a Mixture-of-Expert‌ (MoE) architecture, in which each expert is trained‌ to model a single anomaly class, enhancing the‌ specific attributes of each anomaly class that are‌ often overlooked. To further leverage the correlations and‌ differences between anomalies, a gate model mediates between‌ the predictions of each expert and the more‌ coarse-level anomalous features to learn potential interactions between‌ anomalies.

This work has been published at ICCV‌ 2025 48.

8.13 Denoise, Divide, Distill, and‌ Predict (D3¶): Towards Forecasting Long-horizon Real-world Anomaly from‌ Normalcy

Participants: Quentin Merilleau, Snehashis Majhi,‌ Dantcheva Antitza, Francois Bremond.

Forecasting abnormal‌ human behavior (AHB) in unconstrained real-world environments is‌ critical for enabling proactive safety interventions 42.‌ Unlike short-term anomaly detection, long-horizon forecasting offers a‌ vital reaction window but remains underexplored due to‌ three core challenges: (i) noisy, complex human–agent interactions;‌ (ii) weak temporal coupling between normal observations and‌ distant anomalies; and (iii) data scarcity limiting the‌ scalability of autoregressive models. To address these, we‌ propose (Denoise, Divide, Distill, and Predict) displayed in‌ Fig. 12, a novel encoder–decoder framework that‌ bridges denoised pasts with distilled autoregressive futures, which‌ has been accepted for publication in WACV 2026.‌ Our Differential Past Encoder (DiPE) disentangles scene-level and‌ object-level dynamics via differential attention, suppressing irrelevant interactions‌ and enhancing discriminative cues. The Distilled Future Auto-Regressive‌ Decoder (D-FAD) adopts a divide-and-conquer strategy, segmenting future‌ queries into temporal chunks for sequential prediction, while‌ leveraging distillation to balance robustness and latency. We‌ validate our approach on the AHB-F benchmark, the‌ only dataset dedicated to abnormal behavior forecasting, and‌ further integrate D-FAD with several state-of-the-art methods. In‌ all cases, our framework consistently outperforms prior work‌ in both forecasting accuracy and computational efficiency.

Figure 12: Illustration VAD (Video‌ Anomaly Detection) Vs. VAA‌ (Video Anomaly Anticipation): Suppose‌‌ the current time step is t. For online‌ VAD, a parametrized‌ model $f (θ‌‌)$ can predict normal (N) or anomaly (A)‌ for the current t‌ based on observed time‌‌ stamps $t - i \dots t - 1‌, t$ , where‌ $i$ represents the observed‌‌ du- ration. However, for our VAA, we‌ predict what kind of‌ anomaly will occur in‌‌ the future in a range of $[$ $t‌$ $+$ 1, $t$ $+‌$ 2,..., $t$ $+$ $k‌‌]$ where $k$ represents anticipation duration. Further, we‌ comprehend the short and‌ long-term anticipation to identify‌‌ the potential re-occurrence of an anomaly in the‌ long future.

8.14 Not‌ All Blends Are Equal:‌‌ The BLEMORE Dataset of Blended Emotion Expressions with‌ Relative Salience Annotations

Participants:‌ Michal Balazia, Teimuraz‌‌ Saghinadze, Francois Bremond.

(Both paper and‌ competition are accepted at‌ FG 2026)

Humans often‌‌ experience not just a single basic emotion at‌ a time, but rather‌ a blend of several‌‌ emotions with varying salience. Despite the importance of‌ such blended emotions, most‌ video-based emotion recognition approaches‌‌ are designed to recognize single emotions only. The‌ few approaches that have‌ attempted to recognize blended‌‌ emotions typically cannot assess the relative salience of‌ the emotions within a‌ blend. This limitation largely‌‌ stems from the lack of datasets containing a‌ substantial number of blended‌ emotion samples annotated with‌‌ relative salience. To address this shortcoming, we introduce‌ BLEMORE, a novel dataset‌ for multimodal (video, audio)‌‌ BLended EMOtion REcognition (see Figure‌ 13) that includes‌ information on the relative‌‌ salience of each emotion within a blend.

Figure 13‌: Examples of stills from the video recordings.‌ The actor portrays a combination of anger and‌ fear.

BLEMORE comprises over 3,000 clips from 58‌ actors, performing 6 basic emotions (anger, disgust, fear,‌ happiness, sadness, and neutral) and 10 distinct blends‌ consisting of all pairwise combinations of anger, disgust,‌ fear, happiness, and sadness. All pairwise combinations (see‌ Figure 14) were further conveyed with three‌ different blend conditions:

50/50 = same amount of‌ both emotions (e.g. 50/50 happiness-sadness where both happiness‌ and sadness are expressed in equal proportions)
70/30‌ = the first emotion is more salient than‌ the second emotion (e.g. 70/30 happiness-sadness conveys mainly‌ happiness blended with a tinge of sadness)
30/70‌ = the second emotion is more salient than‌ the first emotion (e.g. 30/70 happiness-sadness conveys mainly‌ sadness blended with a tinge of happiness)

Figure 14:‌ Structure of the BLEMORE full dataset (train and‌ test partition) which contains single emotions and blended‌ emotion expressed with equal ( $=$ ) and‌ unequal ( $<$ ) salience.

Using this dataset,‌ we conduct extensive evaluations of state-of-the-art video classification‌ approaches on two blended emotion prediction tasks: (1)‌ predicting the presence of emotions in a given‌ sample, and (2) predicting the relative salience of‌ emotions in a blend. Our results show that‌ unimodal classifiers achieve up to 29% presence accuracy‌ and 13% salience accuracy on the validation set,‌ while multimodal methods yield clear improvements, with ImageBind+WavLM‌ reaching 35% presence accuracy and HiCMAE 18% salience‌ accuracy. On the held-out test set, the best‌ models achieve 33% presence accuracy (VideoMAEv2+HuBERT) and 18%‌ salience accuracy (HiCMAE).

BLEMORE dataset is also the‌ basis of BLEMORE competition where participants develop systems‌ to predict the emotions present in each recording‌ and the relative salience of each emotion. To‌ support participation, we provide training data with labels,‌ test data without labels, pre-extracted audio-visual feature embeddings,‌ and baseline unimodal and multimodal classification results. The‌ competition offers the first comprehensive platform for evaluating‌ blended emotion recognition and aims to stimulate methodological‌ innovation in multimodal affective computing.

8.15 The INEMO‌ Dataset: A Multimodal Benchmark of Physiological and Behavioral‌ Responses to Social Media and Film Stimuli

Participants:‌ Wenxin Xiong, Valeriya Strizhkova, Aowen Shi‌, Michal Balazia, Laura Ferrari, Francois‌ Bremond.

The INEMO dataset is a multimodal‌ benchmark designed to study emotional and behavioral responses‌ to influencer-style social media videos and emotion calibration‌ film clips. As shown in Figures 15 and‌ 16, participants complete two tasks (Influencer and‌ Calibration), in which they watch short video clips‌ and then rate their emotions using 1–9 Self-Assessment‌ Manikin (SAM) scales for valence and arousal, as‌ well as provide preference judgments about the videos.‌ During these sessions, multiple synchronized modalities are recorded,‌ including facial video, electrocardiography (ECG), electrodermal activity (EDA),‌ eye tracking and screen activity, all time-aligned and stored in a structured‌ metadata format organized by‌ participant, task and modality.‌‌ This design makes INEMO directly usable for machine‌ learning and deep learning‌ models and positions it‌‌ as a bridge between traditional lab-based affective datasets‌ and more realistic social‌ media scenarios.

Figure 15‌: Overview of the‌ INEMO experiment protocol

The‌‌ image shows a person sitting at a table‌ with medical devices attached‌ to their body. Electrodes‌‌ are placed on their chest and stomach, connected‌ by wires to a‌ device strapped to their‌‌ left wrist. Another device is strapped to their‌ right wrist, with wires‌ connected to electrodes on‌‌ their right hand. The person appears to be‌ in a medical or‌ clinical setting, possibly undergoing‌‌ a diagnostic or therapeutic procedure involving muscle or‌ nerve activity monitoring. (Description‌ generated at January 15th,‌‌ 2026 by Albert AI with the model Mistral-Small-3.2-24B)‌

The image shows a‌ person sitting at a‌‌ table with medical devices attached to their body.‌ Electrodes are placed on‌ their chest and stomach,‌‌ connected by wires to a device strapped to‌ their left wrist. Another‌ device is strapped to‌‌ their right wrist, with wires connected to electrodes‌ on their right hand.‌ The person appears to‌‌ be in a medical or clinical setting, possibly‌ undergoing a diagnostic or‌ therapeutic procedure involving muscle‌‌ or nerve activity monitoring. (Description generated at January‌ 15th, 2026 by Albert‌ AI with the model‌‌ Mistral-Small-3.2-24B)

Figure 16: Overview of the INEMO‌ setup: placement of physiological‌ electrodes.

To evaluate the‌‌ dataset and illustrate its potential for multimodal emotion‌ recognition, classical machine learning‌ models (SVM, Random Forest,‌‌ Gradient Boosting) were trained on handcrafted features extracted‌ from ECG and EDA,‌ with and without video‌‌ features, and compared to a multimodal MVP-based (i.e.,‌ Multimodal for Video and‌ Physio) baseline that jointly‌‌ integrates ECG, EDA and facial video. The best‌ results are obtained with‌ a Gradient Boosting model‌‌ using the combined ECG+EDA+Video configuration, reaching weighted F1-scores‌ of about 0.78 for‌ valence and 0.76 for‌‌ arousal, and accuracies up to 0.80 for valence‌ and 0.70 for arousal.‌ These results confirm that‌‌ the INEMO signals are informative and that the‌ associated classification tasks are‌ learnable, while still leaving‌‌ room for more advanced multimodal modeling approaches.

8.16‌ EEG Classification with Limited‌ Data: A Deep Clustering‌‌ Approach.

Participants: Mohsen Tabejamaat, Farhood Negin,‌ Francois Bremond.

The‌ computer vision community has‌‌ explored dyadic interactions for‌ atomic actions such as pushing, carrying-object, etc. However,‌ with the advancement in deep learning models, there‌ is a need to explore more complex dyadic‌ situations such as loose interactions. These are interactions‌ where two people perform certain atomic activities to‌ complete a global action irrespective of temporal synchronization‌ and physical engagement, like cooking-together for example. Analyzing‌ these types of dyadic-interactions has several useful applications‌ in the medical domain for social-skills development and‌ mental health diagnosis. To achieve this, we propose‌ a novel dual-path architecture to capture the loose‌ interaction between two individuals. Our model learns global‌ abstract features from each stream via a CNNs‌ backbone and fuses them using a new Global-Layer-Attention‌ module based on a cross-attention strategy. We evaluate‌ our model on real-world autism diagnoses such as‌ our Loose-Interaction dataset, and the publicly available Autism‌ dataset for loose interactions. Our network achieves baseline‌ results on the Loose-Interaction and SOTA results on‌ the Autism datasets. Moreover, we study different social‌ interactions by experimenting on a publicly available dataset‌ i.e. NTU-RGB+D (interactive classes from both NTU-60 and‌ NTU-120). We have found that different interactions require‌ different network designs. We also compare a slightly‌ different version of our method by incorporating time‌ information to address tight interactions achieving SOTA results.‌

This work has been published in Pattern Recognition‌ 2025 34.

8.17 MEPHESTO: Multimodal Phenotyping of‌ Psychiatric Disorders from Social Interaction

Participants: Michal Balazia‌, Aowen Shi, Miriana Russo, Francois‌ Bremond.

Identifying objective and reliable markers to‌ tailor diagnosis and treatment of psychiatric patients remains‌ a challenge, as conditions like major depression, bipolar‌ disorder, or schizophrenia are qualified by complex behavior‌ observations or subjective self-reports instead of easily measurable‌ somatic features. Recent progress in computer vision, speech‌ processing and machine learning has enabled detailed and‌ objective characterization of human behavior in social interactions.‌ However, the application of these technologies to personalized‌ psychiatry is limited due to the lack of‌ sufficiently large corpora that combine multimodal measurements with‌ longitudinal assessments of patients covering more than a‌ single disorder. Our multi-centre, multi-disorder longitudinal corpus creation‌ effort MEPHESTO is designed to develop and validate‌ novel multimodal markers for psychiatric conditions. MEPHESTO consists‌ of multimodal audio, video, and physiological recordings as‌ well as clinical assessments of psychiatric patients covering‌ a six-week main study period as well as‌ several follow-up recordings spread across twelve months.

Diagnoses‌ include schizophrenia, depression and bipolar disorder. Dataset does‌ not include control subjects. Each patient is contributing‌ with 1–8 videos, roughly 5.5 videos on average.‌ In addition to video, the recordings include patients'‌ and clinicians' biosignals electrodermal activity (EDA), blood volume‌ pulse (BVP), inter-beat interval (IBI), heart rate, temperature,‌ and accelerometer. Videos are recorded by Azure Kinect‌ and biosignals by Empatica. People do not wear‌ face masks while being recorded, although to minimize‌ the transmission of COVID-19 there is a large‌ transparent plexi-glass. Dataset is confidential, but many patients agreed to publish their‌ raw or anonymized data‌ for research purposes. Figure‌‌ 17 shows a screenshot from a mock recording.‌

Figure 17‌: Screenshot of a‌‌ mock recording with two videos and biosignals. Person‌ in the left represents‌ a clinician and person‌‌ in the right a patient. To protect the‌ identity of patients, this‌ mock recording is acted‌‌ by two clinicians.

This year, we have made‌ three major contributions regarding‌ therapeutic alliance, recognizing depression‌‌ and schizophrenia, and detecting childhood trauma from speech.‌ These contributions are explained‌ in detail in the‌‌ subsections below.

8.17.1 Contextualized Synchrony for Therapeutic Alliance‌

Non-verbal behavioral synchrony has‌ been widely studied as‌‌ an indicator of relational dynamics in clinical interactions‌ and has been shown‌ to exhibit weak to‌‌ moderate associations with therapeutic alliance (TA). However, most‌ existing synchrony measures are‌ computed in a content-agnostic‌‌ manner, implicitly assuming that synchrony occurring at different‌ moments of an interaction‌ contributes equally to the‌‌ development of the therapeutic relationship. This work is‌ motivated by the hypothesis‌ that the relational meaning‌‌ of synchrony is context-dependent, and that linguistic content‌ may play a critical‌ role in determining when‌‌ non-verbal coordination is most relevant to therapeutic alliance.‌ In our setting, TA‌ is assessed at the‌‌ end of each session via a seven-item patient‌ questionnaire capturing liking, perceived‌ helpfulness, feeling understood and‌‌ supported, and ease of sharing personal information, with‌ the global TA score‌ obtained by averaging item‌‌ responses. By integrating semantic information derived from spoken‌ language with non-verbal synchrony‌ measures, this study aims‌‌ to move beyond global, uniform synchrony metrics toward‌ a more fine-grained, context-sensitive‌ understanding of therapist–patient interaction‌‌ dynamics. Non-verbal synchrony was computed at the window‌ level using Motion Energy‌ Analysis (MEA, see Figure‌‌ 18 for an example of patient–therapist MEA time‌ series) and a cross-correlation‌ framework applied to the‌‌ continuous motion energy time series of patient and‌ therapist.

Figure 18: Example of patient–therapist Motion Energy‌ Analysis (MEA) time series over a single therapy‌ session.

We evaluate all models by predicting session-level‌ TA scores and using Pearson’s correlation coefficient $r‌$ between predicted and observed TA as the primary‌ outcome measure, computed in a session-level cross-validation setting.‌ We first replicated a stable baseline association between‌ global MEA synchrony and patient-reported TA, with a‌ content-agnostic aggregation over all windows yielding a correlation‌ of approximately $r \approx 0.22$ .‌ Building on this foundation, transcript data were processed‌ into semantic embeddings and temporally aligned with synchrony‌ windows, enabling a multimodal representation in which textual‌ context modulates how window-level synchrony is aggregated over‌ time. In the current implementation, not all MEA‌ windows have a corresponding text segment, so windows‌ without aligned transcripts are ignored when applying text-informed‌ weighting. Evaluating a uniform (all-ones) aggregation under this‌ constraint leads to a reduced MEA-TA association of‌ $r \approx 0 . 13$ , compared to‌ the $r \approx 0 . 22$ obtained when‌ all MEA windows are used. Within this constrained‌ evaluation setting, however, our text-informed weighting scheme increases‌ the correlation to $r \approx 0 . 18‌$ , suggesting that linguistic information helps to highlight‌ synchrony segments that are more informative about alliance.‌ While the overall performance of this preliminary implementation‌ does not yet surpass the full-window MEA baseline,‌ the results support the view that synchrony is‌ not uniformly informative throughout an interaction and highlight‌ the potential of window-level, context-aware multimodal modeling combined‌ with improved textual coverage for capturing subtle relational‌ processes in therapeutic settings.

8.17.2 Psychiatric Diagnosis Classification‌ through Temporal Behavioral Analysis

This sub-project focuses on‌ automated psychiatric diagnosis through multimodal behavioral analysis of‌ clinical interview videos, with the objective of distinguishing‌ between depression and schizophrenia. We utilize a portion‌ of the MEPHESTO dataset of 34 patients: 25‌ with depression and 9 with schizophrenia. The dataset‌ includes manual behavioral annotations provided by expert clinical‌ annotators who labeled over 3000 video segments with‌ observable behaviors. The implemented system (see Figure 19‌) follows a 7-stage pipeline: (1) input data‌ acquisition from MEPHISTO with pre-annotated transcriptions, (2) low-level‌ extraction using OpenFace 3.0 (8 Action Units: AU01,‌ AU02, AU04, AU06, AU07, AU12, AU14, AU45 +‌ gaze + head pose + 8 emotions), MediaPipe‌ holistic (33 pose, 42 hand, 468 face landmarks),‌ and Whisper for speech (1,842 features/frame), (3) temporal‌ alignment with frame-level synchronization (±1 frame precision, 33ms),‌ (4) multi-scale windowing (5s, 10s, 30s windows, 50%‌ overlap) extracting 188 features across 24,588 windows, (5)‌ temporal variability aggregation computing 6 statistics per feature‌ (mean, standard deviation, coefficient of variation, minimum, maximum,‌ range), (6) feature selection via ANOVA F-test selecting‌ top 20 features (70% speech-based, 30% visual), and‌ (7) classification with random forest using leave-one-out cross-validation‌ across 13 tested methods.

Figure‌ 19: This architecture‌ diagram illustrates a multimodal‌‌ machine learning pipeline for binary psychiatric diagnosis (depression/schizophrenia)‌ from clinical interview videos.‌ The system combines three‌‌ parallel feature extraction pipelines: OpenFace 3.0 for facial‌ action units and gaze,‌ MediaPipe for body pose‌‌ and hand movements, and Whisper for speech transcription‌ and linguistic analysis. Features‌ are extracted across multi-scale‌‌ temporal windows with statistical aggregations to capture temporal‌ variability patterns. After feature‌ fusion into a unified‌‌ matrix, ANOVA F-test ranks features by discriminative power,‌ select the top 20,‌ and predictions are made‌‌ by a random forest classifier.

Random forest achieves‌ 94.1% accuracy with only‌ two schizophrenia patients misclassified.‌‌ Top discriminative feature is the standard deviation of‌ patient's incomplete utterances. During‌ our experiments, we found‌‌ that temporal variability is the critical discriminative marker,‌ that speech features dominate‌ (70%) in the top-20‌‌ features, that feature fusion outperforms modality separation, and‌ that traditional machine learning‌ beats deep learning on‌‌ small datasets. In the future, we are going‌ to focus on temporal‌ trauma detection in the‌‌ long untrimmed clinical interviews.

8.17.3 Childhood Trauma Affects‌ Speech and Language Measures‌ in Patients with Major‌‌ Depressive Disorder during Clinical Interviews

Speech analysis has‌ shown significant promise as‌ a potential biomarker for‌‌ depression. However, no studies to date have examined‌ the impact of childhood‌ trauma on speech and‌‌ language patterns in individuals with depression 32.‌ This study aims to‌ explore the relationship between‌‌ vocal characteristics and depressive symptoms, while also assessing‌ how childhood trauma may‌ shape these patterns. 27‌‌ participants with a major depressive episode were included.‌ The severity of depression‌ was assessed using the‌‌ Montgomery & Asberg Depression Rating Scale (MADRS) and‌ the Beck Depression Inventory‌ II. Childhood trauma was‌‌ measured using the Childhood Trauma Questionnaire. Speech recordings‌ from the MADRS semi-structured‌ interview and a free‌‌ clinical interview were analyzed using speaker diarization, automatic‌ speech recognition, and feature‌ extraction.

Several acoustics features‌‌ were significantly associated with depression severity. Correlation analysis‌ revealed that greater depression‌ severity was linked to‌‌ shorter, less diverse speech, characterized by fewer words,‌ fewer semantic clusters, and‌ reduced articulatory effort. In‌‌ contrast, childhood trauma was positively associated with distinct‌ speech characteristics. Higher trauma‌ load was associated with‌‌ richer, longer, and more syntactically complex speech. Additionally,‌ utterances were shorter, with‌ more frequent shifts between‌‌ semantic clusters, reflecting a‌ more fragmented speech pattern influenced by traumatic load.‌ Our study highlights the influence of childhood trauma‌ on vocal and linguistic characteristics of patients with‌ depression. Automated language analysis offers the possibility to‌ identify biomarkers of traumatic load in patients with‌ depression. This could improve diagnostic accuracy, guide therapeutic‌ management and monitor clinical progress.

8.18 MultiMediate'25: Cross-Cultural‌ Multi-domain Engagement Estimation

Participants: Michal Balazia, Francois‌ Bremond.

Estimating momentary conversational engagement is central‌ to assistive, socially aware AI systems, yet models‌ are typically trained and evaluated within a single‌ domain, limiting real-world robustness. The MultiMediate'25 challenge 47‌ advances engagement estimation to more challenging, cross-cultural, and‌ multi-domain settings. Building on prior challenge editions, we‌ expand beyond NOXI and MPIIGroupInteraction (see Figure 20‌) as the sole training source by introducing‌ NOXI-J, a new multilingual corpus covering Japanese and‌ Chinese interactions, enabling both training and evaluation in‌ diverse linguistic contexts. Although NOXI-J conceptually extends NOXI,‌ we treat it as a distinct domain because‌ linguistic, cultural, capture, and annotation differences induce measurable‌ distribution shifts. MultiMediate'25 continues all previously defined tasks‌ and creates another task: Cross-cultural Multi-domain Engagement Estimation.‌

In this work, we present new annotations, precomputed‌ multi-modal features (visual, vocal, and verbal), baseline evaluations,‌ and an analysis of the best performing challenge‌ solutions. Beyond accuracy, we quantify fairness using conditional‌ demographic disparity for gender and language. Our baselines‌ confirm strong in-domain performance (e.g., paralinguistic eGeMAPS and‌ video-transformer features) and reveal notable cross-domain drops, underscoring‌ the challenge of cultural, linguistic, and interactional shifts.‌ Fairness analyses indicate generally small discrepancies for our‌ baselines. We observe the largest disparities for the‌ proposed challenge solutions on the Chinese language test‌ set. All annotations, features, code, and leaderboards are‌ made publicly available to foster sustained progress on‌ robust and fair engagement estimation.

Figure 20.a — Figure 20:‌ Left: Snapshots of scenes of a participant in the NOXI corpus being‌ disengaged, neutral and highly‌ engaged. Right: Setup of‌‌ the MPIIGroupInteraction dataset.

Figure 20.b — Figure 20:‌ Left: Snapshots of scenes of a participant in the NOXI corpus being‌ disengaged, neutral and highly‌ engaged. Right: Setup of‌‌ the MPIIGroupInteraction dataset.

As training datasets, we provide‌ NOXI and NOXI-J to‌ our participants. NOXI is‌‌ a corpus of dyadic, screen-mediated face-to-face interactions in‌ an expert-novice knowledge sharing‌ context. In a session,‌‌ one participant assumes the role of an expert‌ and the other participant‌ the role of a‌‌ novice. NOXI includes interactions recorded at three locations‌ (France, Germany and UK),‌ spoken in seven languages‌‌ (English, French, German, Spanish, Indonesian, Arabic and Italian),‌ discussing a wide range‌ of topics. The languages‌‌ Indonesian, Arabic, Spanish, and Italian serve as an‌ out-of-domain evaluation set. NOXI‌ is extended by NOXI-J‌‌ consisting of 66 dyadic interactions and over 16‌ hours of material using‌ the same setup as‌‌ original NOXI. NOXI-J features 48 interactions in Japanese‌ with native Japanese speakers‌ and 18 interactions in‌‌ Chinese with Chinese native speakers. See Table 1‌ for the train-validation-test split.‌

Table 1: Engagement estimation‌‌ datasets used in the MultiMediate'25 challenge. Languages covered‌ by each dataset are‌ given in italics, with‌‌ the respective number of interactions in parentheses.
Training‌ Data	Validation Data	Test‌ Data
NOXI	NOXI	NOXI‌‌
English (23), French (7), German (8)	English (3),‌ French (4), German (3)‌	English (6), French (6),‌‌ German (4)
		NOXI (additional test languages)
		Arabic (2),‌ Italian (2), Indonesian (4),‌ Spanish (4)
	MPIIGroupInteraction	MPIIGroupInteraction‌‌
	German (6)	German (6)
NOXI-J	NOXI-J	NOXI-J
Japanese‌ (21), Chinese (10)	Japanese‌ (6), Chinese (4)	Japanese‌‌ (6), Chinese (4)

The task is frame-wise prediction‌ of each interlocutor's engagement‌ on a continuous scale‌‌ $[0, 1]$ . Accuracy is‌ measured with the Concordance‌ Correlation Coefficient (CCC), ranging‌‌ from $- 1$ to $+ 1$ . Participants‌ are free to use‌ the provided labeled data‌‌ for training and validation and undergo in-domain and‌ out-of-domain evaluations on NoXI,‌ NoXI-J, NoXI (Additional Languages),‌‌ and MPIIGroupInteraction. We provide a multi-modal set of‌ precomputed features to participants.‌ From the audio signal,‌‌ we provide transcripts generated with the Whisper model.‌ Additionally, we supply GeMAPS‌ features along with wav2vec‌‌ 2.0 embeddings. From the video, we provide the‌ backbone embeddings of Video‌ Swin Transformer, DINOv2, CLIP‌‌ and VideoMAEv2 and the outputs of OpenFace and‌ OpenPose to cover facial‌ as well as body‌‌ behaviors.

8.19 Stress Estimation in Dancers for Injury‌ Prevention

Participants: Dian-Wei Lai‌, Quentin Merilleau,‌‌ Aowen Shi, Francois Bremond.

Detecting stress‌ in dancers is important,‌ as high stress levels‌‌ are often related to fatigue and injuries, which‌ can negatively affect both‌ performance and health. However,‌‌ stress detection itself is not an easy task.‌ This becomes even more‌ challenging when using indirect‌‌ and non-invasive data such as video. Although video‌ is one of the‌ most commonly available modalities,‌‌ extracting reliable stress information from it remains highly‌ challenging.

In this work,‌ we investigate automatic stress‌‌ estimation from dance videos using a small, weakly‌ labeled dataset collected from‌ professional dancers at Université‌‌ Côte d’Azur. Each dancer‌ performs the same dance under three different difficulty‌ levels and in different scenes. The dataset currently‌ includes 84 dancers, with two camera views (front‌ view and diagonal view). Each video is approximately‌ 1 to 2 minutes long. Data collection is‌ still ongoing to further enrich the dataset and‌ improve the reliability of the stress score distributions,‌ PDF and CDF curves in Figure 21.‌

Figure 21:‌ Dataset overview for stress estimation in dancers. Left:‌ three exercise difficulty levels performed by 84 dancers.‌ Right: stress score distributions for each exercise shown‌ as PDF and CDF curves.

To obtain meaningful‌ results with limited data, we leverage pretrained models‌ trained on large-scale video and motion datasets to‌ improve feature representations. We then study the contribution‌ of different visual modalities, including RGB, skeleton poses‌ extracted using different methods with richer joint (see‌ Figure 22) and hand motion information, depth,‌ and optical flow. By analyzing each modality separately‌ and in combination, we aim to build a‌ robust multi-modal pipeline for stress estimation and to‌ identify which modalities and movement cues are most‌ informative for effective stress prediction.

Figure 22: Pose extraction methods comparison.‌ From left to right: the original video frame,‌ YOLO-Pose, Posetics, and OpenPose (body + hand). These‌ methods are used to capture joint-level features and‌ characterize dancer movements. The dancer’s face is masked‌ to preserve privacy.

8.20 Emotion recognition using Deep‌ Learning

Participants: Valeriya Strizhkova, Antitza Dantcheva,‌ Francois Bremond.

Understanding human emotions is crucial‌ in healthcare, human-robot interaction, and marketing. Despite the‌ progress in emotion recognition from one modality, such‌ as a facial video and a sequence of‌ physiological signals, it is still challenging to improve‌ by combining multiple modalities. Moreover, it is difficult‌ to recognize emotions in long sequential data, such‌ as long videos, although most real-world videos of people expressing emotions are‌ long. Existing emotion datasets‌ are limited in volume‌‌ and quality, making it difficult to develop an‌ effective deep learning-based emotion‌ recognition system. An effective‌‌ real-world emotion understanding system should be able to‌ recognize emotions from long‌ videos synchronized with multiple‌‌ modalities. In this thesis 52, we focus‌ on multimodal emotion recognition‌ from long videos synchronized‌‌ with physiological signals. Specifically, multimodal emotion methods face‌ three main challenges: (a)‌ learning the emotion representation,‌‌ (b) learning the representation of fine-grained emotions, as‌ well as (c) combining‌ modalities to predict emotions.‌‌ In this thesis, we first introduce two large‌ behavior analysis datasets: INEMO‌ and StressID. INEMO is‌‌ a multimodal dataset designed to facilitate emotion recognition‌ from watching social media‌ videos. StressID is a‌‌ multimodal dataset designed for stress identification. Secondly, we‌ propose two pre-training techniques‌ for facial expression recognition:‌‌ (1) supervised pre-training on synthetic data generated by‌ our video generation method‌ and (2) self-supervised pre-training‌‌ on multi-view videos. We show that the proposed‌ pre-training techniques allow us‌ to get rich facial‌‌ representations, which allow us to improve fine-grained emotion‌ recognition accuracy. Thirdly, we‌ tackle the problem of‌‌ emotion recognition from multiple modalities. We propose a‌ framework for multimodal fusion‌ of videos and physiological‌‌ signals to predict emotions. This framework consists of‌ mainly two steps: (1)‌ extracting features from long‌‌ raw videos and physiological signals; (2) fusing extracted‌ features to predict emotions‌ using a cross-modality approach‌‌ based on attention mechanism. Our methods leverage the‌ additional modalities resulting in‌ better emotion recognition performance.‌‌ Our methods have been extensively evaluated on various‌ emotion recognition benchmarks. The‌ proposed methods outperform previous‌‌ methods, significantly pushing emotion recognition to real-world deployments.‌

8.21 Identifying Surgical Instruments‌ in Pedagogical Cataract Surgery‌‌ Videos through an Optimized Aggregation Network

Participants: Sanya‌ Sinha, Michal Balazia‌, Francois Bremond.‌‌

Instructional cataract surgery videos are crucial for ophthalmologists‌ and trainees to observe‌ surgical details repeatedly. In‌‌ 44, we present a deep learning model‌ for real-time identification of‌ surgical instruments in these‌‌ videos, using a custom dataset scraped from open-access‌ sources. Inspired by the‌ architecture of YOLOv9, the‌‌ model employs a Programmable Gradient Information (PGI) mechanism‌ and a novel Generally-Optimized‌ Efficient Layer Aggregation Network‌‌ (Go-ELAN) to address the information bottleneck problem, enhancing‌ Minimum Average Precision (mAP)‌ at higher Non-Maximum Suppression‌‌ Intersection over Union (NMS IoU) scores.

Go-ELAN YOLOv9‌ Architecture (see Figure 23‌) contains an auxiliary‌‌ block which works on the Programmable Gradient Information‌ (PGI) concept by creating‌ an auxiliary reverse branch‌‌ for enabling reliable gradient calculation by avoiding potential‌ semantic loss. The GELAN‌ block in the backbone‌‌ feature extractor is replaced by the Go-ELAN block‌ proposed in this paper.‌ The Spatial Pyramid Pooling‌‌ block SPPELAN removes the fixed size limitation of‌ the backbone. The ADown‌ block downsamples the generated‌‌ feature maps to target sizes. The CBLinear blocks‌ extract higher level features‌ from the images, and‌‌ the CBFuse block fuses‌ these extracted features. The Neck combines the acquired‌ features and the Head predicts the final bounding‌ bound outputs with their respective probabilities.

Figure 23:‌ Architecture of Go-ELAN YOLOV9.

Our Go-ELAN YOLOv9 model,‌ evaluated against YOLO v5, v7, v8, v9 vanilla,‌ Laptool and DETR, achieves a superior mAP of‌ 73.74 at IoU 0.5 on a dataset of‌ 615 images with 10 instrument classes, demonstrating the‌ effectiveness of the proposed model. To illustrate the‌ visual and qualitative superiority of our model, we‌ have compared 12 ground-truth images with their respective‌ model predictions in Figure 24.

Figure 24‌: Qualitative Examination of Model Performance. Rows 1‌ and 3 are labels while 2 and 4‌ are respective predictions.

8.22 TBDM: Temporal Boundary Distillation‌ Module for Surgical Gesture Segmentation

This work was‌ funded by 3IA Côte d'Azur.

Participants: Ezem Sura‌ Ekmekci, Snehashis Majhi, Khodor Hamadi,‌ Francois Bremond.

In 2025, in collaboration with‌ CHU Nice and Caranx Medical, a novel framework‌ for surgical gesture segmentation was developed that addresses‌ the challenging problem of precise temporal localization during‌ surgical action transitions. This work introduces the Temporal‌ Boundary Distillation Module (TBDM), an innovative approach that‌ explicitly models temporal boundaries between surgical gestures using‌ RGB-only video data (see Figure 25). The‌ framework employs knowledge distillation to learn boundary-aware features‌ during training through cross-attention mechanisms, while requiring no‌ additional computational overhead at inference. TBDM was validated‌ on two major surgical datasets (CholecT50 and RARP-45),‌ demonstrating consistent improvements across multiple baseline architectures, with‌ up to +8.5 edit score improvement on CholecT50.‌ Notably, the approach achieved state-of-the-art performance on RARP-45‌ (81.4 edit score, 77.9 F1@50), establishing TBDM as a generalizable, plug-and-play solution‌ for fine-grained surgical workflow‌ analysis. This work has‌‌ been submitted to IPCAI 2026.

Additionally, a comprehensive‌ evaluation of YOLOv8 for‌ real-time surgical instrument recognition‌‌ in robot-assisted and laparoscopic surgeries was conducted 33‌. Using a diverse‌ multi-source dataset of over‌‌ 7,400 frames and 17,175 annotations, the model achieved‌ a mean average precision‌ of 0.77 for binary‌‌ detection and 0.72 for multi-instrument classification across seven‌ instrument types. The segmentation‌ performance demonstrated excellent accuracy‌‌ with a mean Dice score of 0.91 and‌ mean intersection over union‌ of 0.86. With an‌‌ inference speed of 1.12 milliseconds per frame, the‌ model shows strong potential‌ for real-time clinical applications‌‌ in surgical workflow analysis and instrument tracking.

Figure 25: Overview of TBDM framework‌ for Surgical Gesture Segmentation.‌ During training, the boundary‌‌ blocks generate boundary-aware features to guide the projection‌ layer. During inference, only‌ the trained projection layer‌‌ is used, adding no extra cost, while achieving‌ significant boundary precision in‌ segmentation.

8.23 Effective Video‌‌ Feature Extraction for Training and Comprehension: Human-Centered Multimodal‌ Video

Participants: Tanay Agrawal‌, Antitza Dantcheva,‌‌ Francois Bremond.

Understanding actions in videos is‌ a crucial element of‌ computer vision, with significant‌‌ implications in many fields. Given our increasing reliance‌ on visual data, understanding‌ and interpreting human actions‌‌ in videos are becoming essential for developing technologies‌ in surveillance, healthcare, autonomous‌ systems, and human-computer interaction.‌‌ Accurate interpretation of actions in videos is fundamental‌ to creating intelligent systems‌ capable of navigating and‌‌ responding effectively to the complexities of the real‌ world. In this context,‌ advances in action understanding‌‌ are pushing the boundaries of computer vision and‌ playing a crucial role‌ in the development of‌‌ cutting-edge applications that impact our daily lives.

Computer‌ vision has seen significant‌ progress thanks to the‌‌ rise of deep learning methods such as convolutional‌ neural networks (CNNs) and‌ transformers, pushing the boundaries‌‌ of computer vision and enabling the computer vision‌ community to advance in‌ many areas, including image‌‌ segmentation, object detection, scene understanding, and more. However,‌ video processing remains limited‌ compared to static images.‌‌ In this thesis, we focus on video understanding,‌ dividing it into two‌ main parts: video classification‌‌ and action detection, and‌ their application in affective computing, particularly in interaction-based‌ scenarios. In this thesis, we explore efficient learning‌ approaches for video feature extraction in various video‌ classification and interaction understanding tasks. Our contributions 51‌ cover the computation of intermediate-level features for faster‌ convergence, plugin adaptation for handling diverse datasets and‌ modalities, and evolutionary temporal modeling for understanding long‌ videos. We begin by improving personality and behavior‌ recognition through geometry-based behavioral coding and segmentation-driven attention‌ mechanisms. We then address the challenges of modality‌ availability and data diversity using knowledge distillation and‌ a novel adapter-based cross-learning framework that generalizes to‌ all tasks. Finally, we tackle the analysis of‌ long videos for temporal action detection using temporal‌ adapters with image models, as well as modular‌ adapters and a two-stage spatiotemporal learning strategy with‌ a video basis. Together, this work contributes to‌ building generalizable and efficient learning systems for a‌ wide range of video understanding applications.

8.24 Rotation-Induced‌ Centroid Shift in Latent Space

Participants: Benoit Lagadec‌, Matthieu Saumard, Francois Bremond.

Convolutional‌ neural networks are not rotation-equivariant in practice: discrete‌ image rotation requires interpolation and zero-padding, making the‌ rotation operator non-invertible and causing convolution and rotation‌ to not commute. We show that this leads‌ to a systematic and measurable shift of the‌ feature-space centroid when images are rotated, even when‌ the model is trained with standard rotation augmentation.‌ We formalize this centroid drift analytically and verify‌ it empirically. To mitigate this effect, we introduce‌ a set of angle-specialized Exponential Moving Average (EMA)‌ teachers that provide stable feature anchors at different‌ rotation angles, optionally enhanced with low-rank angle adapters.‌ This approach directly suppresses rotation-induced centroid shift and‌ significantly improves feature consistency and classification accuracy under‌ rotation, outperforming both classical augmentation and mean-teacher baselines‌ while requiring minimal additional computation. We formalize discrete‌ in-plane rotation on pixel grids as a degraded‌ permutation and show why convolution and rotation do‌ not commute. In this work, we empirically confirm‌ that the centroid of feature representations shifts under‌ rotation. Many studies are dedicated to find invariant‌ in detection. An illustration is detailed in Figure‌ 26.

Figure 26.a — Figure 26‌: On the left,‌ illustration of the wrong‌‌ approximation due to the rotation transformation. On the‌ right, the EMA mechanism‌ enables to correct this‌‌ approximation.

Figure 26.b — Figure 26‌: On the left,‌ illustration of the wrong‌‌ approximation due to the rotation transformation. On the‌ right, the EMA mechanism‌ enables to correct this‌‌ approximation.

Convolutional neural networks are often assumed to‌ be robust to rotations‌ when trained with rotation‌‌ augmentations. However, this assumption overlooks a key property‌ of real image rotations:‌ discrete rotation on a‌‌ pixel grid is implemented using interpolation and padding,‌ making the operation non-invertible‌ and causing it to‌‌ not commute with convolution. As a result, rotating‌ an input image and‌ then extracting features is‌‌ not equivalent to extracting features and then rotating‌ them. We show that‌ this mismatch induces a‌‌ systematic and predictable shift in the feature-space centroid‌ across rotation angles, even‌ when the network is‌‌ trained with extensive rotation augmentation.

This observation reframes‌ rotation robustness as a‌ problem of representation geometry‌‌, rather than data diversity alone. If rotation‌ induces angle-dependent sub-clusters in‌ feature space, enforcing global‌‌ consistency (e.g., with a single Mean Teacher model)‌ can suppress meaningful structure‌ and lead to underfitting.‌‌ We therefore propose a simple alternative: a set‌ of angle-specialized EMA teachers‌ that provide stable feature‌‌ targets at different rotation angles, coupled with a‌ feature-space centroid alignment loss‌ that prevents rotation-induced drift‌‌ without collapsing intra-class variability. Our approach is architecture-agnostic,‌ computationally lightweight, and complementary‌ to standard training pipelines.‌‌ It improves rotation robustness in both classification and‌ detection settings without requiring‌ group-equivariant architectures or spatial‌‌ transformer modules. The core contribution of this work‌ is to characterize the‌ geometric effect of discrete‌‌ rotation in CNN feature space and to introduce‌ a training strategy that‌ explicitly stabilizes this geometry.‌‌

Applied to detection (see Figure 26), our‌ method ensures that rotation-induced‌ feature sub-clusters remain compact‌‌ and aligned. This contrasts with our former work,‌ which uses a related‌ mechanism in person re-identification‌‌ to enlarge inter-cluster separation,‌ whereas our objective is to preserve sub-cluster coherence.‌

8.25 Dual Volume Skeleton-Guided 3D Face Reconstruction from‌ Sparse Views

Participants: Benoit Lagadec, Seongro Yoon‌, Francois Bremond.

Reconstructing high-fidelity 3D face‌ meshes from sparse 2D inputs is challenging due‌ to limited depth cues and structural ambiguity. We‌ present a skeleton-guided, dual-volume diffusion framework for reconstructing‌ editable, high-fidelity 3D face meshes from only two‌ sparse views, see Figure 27. By integrating‌ part-level latent diffusion with skeleton-based conditioning and symmetry-aware‌ dual-volume packing, our approach preserves pose-consistent geometry, enables‌ part-aware editing, and maintains bilateral alignment. A teacher–student‌ strategy with multi-view consistency further improves stability and‌ fidelity, yielding significant gains over state-of-the-art baselines. Our‌ contributions:

• A skeleton-conditioned diffusion pipeline that injects‌ explicit structural priors to improve pose-consistent geometry under‌ sparse views.

• A dual-volume latent representation, inspired‌ by bipartite packing, enabling part-aware decoding and preventing‌ fusion of contacting parts. It allows to complete‌ a partial view in final face generation.

•‌ A symmetry-aware objective coupling reconstruction accuracy and bilateral‌ regularization for realistic midline geometry.

• A self‌ supervised teacher–student strategy enhances multi-view consistency.

Figure 27.a — Figure 27: Left: Illustration of workflow. Right:‌ projection of 2D landmarks to guide the generation‌ of new mesh

Figure 27.b — Figure 27: Left: Illustration of workflow. Right:‌ projection of 2D landmarks to guide the generation‌ of new mesh

Given two input images (frontal‌ and profile), we detect 2D landmarks to form‌ a facial skeleton. Here landmarks are replaced by‌ facial skeleton to produce more realistic generation in‌ diffusion. A skeleton encoder produces a latent embedding‌ that conditions a dual-UNet diffusion backbone via adaptive‌ normalization. The denoiser outputs two latent volumes (left/right), which are decoded by‌ a 3D VAE into‌ SDF (i.e., static and‌‌ dynamic factorization) /occupancy grids. Marching Cubes extracts meshes‌ per side; parts remain‌ disjoint via dual-volume packing,‌‌ see Figure 28. A symmetry loss regularizes‌ left/right consistency. A complete‌ view of architecture is‌‌ defined in Figure 27.

Figure 28:‌ 2 editable mesh/views are‌‌ stitched.

8.26 Turbo Learning: 3D Face Reconstruction with‌ Mesh Re-Projection and Re-Identification‌ Consistency

Participants: Benoit Lagadec‌‌, Francois Bremond.

We introduce Turbo Learning‌, a two-stage iterative‌ refinement framework for 3D‌‌ face reconstruction inspired by the positive-feedback dynamics of‌ turbocharged engines. Traditional pipelines‌ rely on sparse supervisory‌‌ cues such as 2D landmarks, limiting their ability‌ to recover accurate geometry.‌ Our approach instead uses‌‌ self-generated 3D meshes as progressively stronger priors: Stage‌ 1 predicts a coarse‌ mesh guided by MediaPipe‌‌ landmarks, while Stage 2 uses this mesh as‌ dense geometric supervision.

To‌ further enhance identity preservation,‌‌ we introduce a Mesh Re-Projection and Re-Identification Consistency‌ Loss. By re-projecting‌ meshes from both stages‌‌ into image space and applying an InfoNCE contrastive‌ Re-ID objective, we enforce‌ identity stability across refinement‌‌ steps. The combination of a geometric turbo loop‌ and an identity turbo‌ loop produces reconstructions that‌‌ are more stable, more detailed, and more identity-faithful.‌

We compare Turbo Learning‌ to classical iterative strategies‌‌ such as EM, diffusion-based refinement, boosting, and teacher–student‌ systems, and show that‌ it occupies a distinctive‌‌ position among them, see Fig. 29.

Figure 29: At each step the mesh‌ generated is re-used. A‌ re-identification metrix is computed‌‌ to learn the input image.

8.27 THEval: Evaluation‌ Framework for Talking Head‌ Video Generation

Participants: Nabyl‌‌ Quignon, Baptiste Chopin, Yaohui Wang,‌ Antitza Dantcheva.

Generative‌ models for talking head‌‌ videos have witnessed remarkable progress, achieving high-resolution and‌ realistic results. However, evaluating‌ these models remains a‌‌ significant challenge, as the‌ rapid advancement in generation has outpaced the development‌ of adequate metrics. Current evaluations primarily rely on‌ general image quality metrics or lip-synchronization scores, which‌ often fail to capture essential aspects of realism‌ such as motion quality, temporal coherence, and naturalness.‌ Furthermore, these existing metrics have been shown to‌ correlate poorly with human preferences, necessitating a more‌ robust and perceptually aligned evaluation approach.

Figure 30: Overview of‌ the THEval benchmark. Evaluating 17 SOTA methods on‌ 85,000 videos reveals that existing metrics align poorly‌ with human ratings (red box). We propose THEval‌ (center), a framework with 8 metrics covering (i)‌ quality, (ii) naturalness, and (iii) synchronization‌. Our final score (green box) achieves a‌ 0.870 correlation with human preference.

We introduce THEVAL‌ 56, a novel evaluation framework designed to‌ address these limitations by aligning closely with human‌ perception, a visual summary of the framework is‌ available on Figure 30. We support this‌ framework with a new, challenging evaluation dataset comprising‌ over 5,000 videos sourced from diverse YouTube channels,‌ ensuring the content was unseen during model training‌ to test generalization. The dataset features a wide‌ range of languages, head poses, and expressions. To‌ assess performance comprehensively, we decompose evaluation into three‌ core dimensions: quality, naturalness, and synchronization, utilizing eight‌ fine-grained metrics to analyze dynamics such as lip‌ and head motion alongside global aesthetics.

To validate‌ our framework, we conduct an extensive benchmark of‌ 17 state-of-the-art audio- and video-driven models, generating and‌ analyzing over 85,000 videos. We leverage a user‌ study to demonstrate that our final composite score‌ achieves a strong Spearman correlation of 0.870 with‌ human ratings, significantly outperforming traditional metrics like FID‌ and Syncnet. By applying this pipeline, we identify‌ that while many current algorithms excel in lip‌ synchronization, they continue to face challenges in generating‌ expressive facial behavior and artifact-free details, establishing THEVAL‌ as a vital tool for fostering future progress‌ in the field.

8.28 Beyond Real versus Fake‌ Towards Intent-Aware Video Analysis

Participants: Saurabh Atreya,‌ Nabyl Quignon, Baptiste Chopin, Abhijit Das‌, Antitza Dantcheva.

The rapid advancement of‌ generative models has led to increasingly realistic deepfake‌ videos, posing significant societal and security risks. While‌ existing detection methods focus primarily on distinguishing real‌ from fake videos, such approaches fail to address‌ a fundamental question regarding the intent behind manipulated‌ content. With the proliferation of AI-generated media, the‌ binary distinction of authenticity is becoming less relevant‌ than understanding whether content is malicious or benign.‌ This shift necessitates a new paradigm in video‌ analysis that moves beyond artifact detection to the‌ contextual understanding of underlying motivations.

Figure 31: Three-Way Contrastive Alignment Pipeline.‌ Overview of the proposed training methodology. The augmented‌ dataset is encoded using modality-specific encoders (CLIP for‌ video, WavLM for audio, CLIP Text for text),‌ projected into a shared space, and aligned through a three-way contrastive loss.‌ The pretrained encoders are‌ then fine-tuned using a‌‌ supervised MLP classifier for intent prediction.

We introduce‌ IntentHQ 53, a‌ new benchmark for human-centered‌‌ intent analysis designed to formalize the task of‌ intent recognition. We curate‌ a comprehensive dataset of‌‌ 5,168 videos, meticulously annotated with 23 fine-grained intent‌ categories such as "Financial‌ fraud", "Political propaganda", and‌‌ "Comedy", organized under five broader dimensions including Deception‌ and Persuasion. To effectively‌ analyze these videos, we‌‌ propose a novel self-supervised learning framework (see Figure‌ 31) that leverages‌ a three-way contrastive alignment‌‌ strategy. This method jointly aligns video, audio, and‌ textual modalities, utilizing data‌ augmentation techniques like semantic‌‌ paraphrasing and text-to-speech synthesis to learn robust representations‌ without relying on manual‌ labels during pretraining.

To‌‌ validate our approach, we benchmark intent recognition using‌ various state-of-the-art multimodal architectures.‌ Our proposed model, which‌‌ integrates spatio-temporal video features with audio and text‌ analysis, achieves a classification‌ accuracy of 52.5%, establishing‌‌ a new state-of-the-art by significantly outperforming standard video‌ classification baselines such as‌ VideoMAE and TimeSFormer. Ablation‌‌ studies further reveal that, while video remains the‌ most predictive modality, the‌ fusion of text and‌‌ audio is essential for distinguishing complex, socially embedded‌ intents. By releasing the‌ IntentHQ dataset and code,‌‌ we aim to foster further research in intent-aware‌ media analysis, shifting the‌ focus towards a more‌‌ nuanced understanding of digital content.

8.29 AI killed‌ the video star. Audio-driven‌ diffusion model for expressive‌‌ talking head generation

Participants: Baptiste Chopin, Antitza‌ Dantcheva.

We proposed‌ Dimitra++ 55, a‌‌ novel framework for audio-driven talking head generation, streamlined‌ to learn lip motion,‌ facial expression, as well‌‌ as head pose motion. Specifically, we proposed a‌ conditional Motion Diffusion Transformer‌ (cMDT) to model facial‌‌ motion sequences, employing a 3D representation. The cMDT‌ is conditioned on two‌ inputs: a reference facial‌‌ image, which determines appearance, as well as an‌ audio sequence, which drives‌ the motion. Quantitative and‌‌ qualitative experiments, as well as a user study‌ on two widely employed‌ datasets, i.e., VoxCeleb2 and‌‌ CelebV-HQ, suggested that Dimitra++ is able to outperform‌ existing approaches in generating‌ realistic talking heads imparting‌‌ lip motion, facial expression, and head pose. Code‌ and qualitative results are‌ provided on our project‌‌ page: Project Page.

8.30 LIA-X: Interpretable Latent‌ Portrait Animator

Participants: Antitza‌ Dantcheva, François Brémond‌‌.

We introduce LIA-X 57, a novel‌ interpretable portrait animator designed‌ to transfer facial dynamics‌‌ from a driving video to a source portrait‌ with fine-grained control. LIA-X‌ is an autoencoder that‌‌ models motion transfer as a linear navigation of‌ motion codes in latent‌ space. Crucially, it incorporates‌‌ a novel Sparse Motion Dictionary that enables the‌ model to disentangle facial‌ dynamics into interpretable factors.‌‌ Deviating from previous 'warp-render' approaches, the interpretability of‌ the Sparse Motion Dictionary‌ allows LIA-X to support‌‌ a highly controllable 'edit-warp-render' strategy, enabling precise manipulation‌ of fine-grained facial semantics‌ in the source portrait.‌‌ This helps to narrow‌ initial differences with the driving video in terms‌ of pose and expression. Moreover, we demonstrate the‌ scalability of LIA-X by successfully training a large-scale‌ model with approximately 1 billion parameters on extensive‌ datasets. Experimental results show that our proposed method‌ outperforms previous approaches in both self-reenactment and cross-reenactment‌ tasks across several benchmarks. Additionally, the interpretable and‌ controllable nature of LIA-X supports practical applications such‌ as fine-grained, user-guided image and video editing, as‌ well as 3D-aware portrait video manipulation. Project Page‌

8.31 Simplicity-Bias-Aware Adaptation of Foundation Models for Deepfake‌ Detection

Participants: Charbel Yahchouchi, Noemi Roggero,‌ Laurent Saroul, Antitza Dantcheva.

Given the‌ rapid advancement of deep learning and generative models,‌ the synthesis of realistic and plausible images and‌ videos has reached unprecedented levels. However, this accessibility‌ also raises serious concerns, as such content can‌ be misused for malicious purposes such as identity‌ impersonation, misinformation, and social manipulation. Consequently, deepfake detection‌ has emerged as a crucial area of research,‌ aiming to develop robust and generalizable detectors capable‌ of reliably identifying manipulated media. Despite impressive progress,‌ most current detectors struggle to generalize to unseen‌ manipulation, limiting their real-world reliability.

Figure 32: Overview of the proposed‌ simplicity-bias-aware adaptation framework for deepfake detection. A frozen‌ CLIP visual encoder is augmented with lightweight adapter‌ modules, while the SIFER mechanism is applied at‌ an intermediate representation to identify and suppress shortcut‌ features during training.

In this work, we study‌ the limitations of foundation model adaptation for deepfake‌ detection under distribution shifts and address the impact‌ of shortcut learning induced by parameter-efficient fine-tuning for‌ deepfakes. We introduce a simplicity-bias-aware adaptation framework, see‌ Fig. 32, that augments a frozen CLIP‌ visual encoder with lightweight adapter modules and integrates‌ the SIFER feature-sieving mechanism to identify and suppress‌ simple but non-generalizable cues during training. To validate‌ our framework, we conduct an extensive evaluation on‌ recent state-of-the-art deepfake detection datasets, focusing on cross-dataset‌ and cross-manipulation generalization under distribution shifts. Experimental results‌ show consistent improvements in video-level Area Under the‌ Curve (AUC) compared to CLIP-based baselines and other‌ parameter-efficient adaptation strategies, with particularly strong gains on subtle and localized manipulations.‌

8.32 Now You See‌ Me, Now You Don't:‌‌ A Unified Framework for Expression Consistent Anonymization in‌ Talking Head Videos

Participants:‌ Anil Egin, Antitza‌‌ Dantcheva.

Face video anonymization is aimed at‌ privacy preservation while allowing‌ for the analysis of‌‌ videos in a number of computer vision downstream‌ tasks such as expression‌ recognition, people tracking, and‌‌ action recognition. We propose here a novel unified‌ framework 39 referred to‌ as Anon-NET, streamlined to‌‌ de-identify facial videos, while preserving age, gender, race,‌ pose, and expression of‌ the original video. Specifically,‌‌ we inpaint faces by a diffusion-based generative model‌ guided by high-level attribute‌ recognition and motion-aware expression‌‌ transfer. We then animate deidentified faces by video-driven‌ animation, which accepts the‌ de-identified face and the‌‌ original video as input. Extensive experiments on the‌ datasets VoxCeleb2, CelebV-HQ, and‌ HDTF, which include diverse‌‌ facial dynamics, demonstrate the effectiveness of AnonNET in‌ obfuscating identity while retaining‌ visual realism and temporal‌‌ consistency. Project Page

8.33 Beyond the visible: A‌ survey on cross-spectral face‌ recognition

Participants: Antitza Dantcheva‌‌.

Cross-spectral face recognition (CFR) refers to recognizing‌ individuals using face images‌ stemming from different spectral‌‌ bands, such as infrared versus visible. While CFR‌ is inherently more challenging‌ than classical face recognition‌‌ due to significant variation in facial appearance caused‌ by the modality gap,‌ it is useful in‌‌ many scenarios including night-vision biometrics and detecting presentation‌ attacks. Recent advances in‌ deep neural networks (DNNs)‌‌ have resulted in significant improvement in the performance‌ of CFR systems. Given‌ these developments, the contributions‌‌ of this survey are three-fold. First, we provide‌ an overview of CFR,‌ by formalizing the CFR‌‌ problem and presenting related applications. Secondly, we discuss‌ the appropriate spectral bands‌ for face recognition and‌‌ discuss recent CFR methods, placing emphasis on deep‌ neural networks. In particular,‌ we describe techniques that‌‌ have been proposed to extract and compare heterogeneous‌ features emerging from different‌ spectral bands. We also‌‌ discuss the datasets that have been used for‌ evaluating CFR methods. Finally,‌ we discuss the challenges‌‌ and future lines of research on this topic.‌

This work has been‌ published in Neurocomputing 31‌‌.

9 Bilateral contracts and grants with industry‌

Participants: Antitza Dantcheva,‌ Francois Bremond.

Stars‌‌ team has currently several experiences in technological transfer‌ towards industries, which have‌ permitted to exploit research‌‌ result.

9.1 Bilateral contracts with industry

9.1.1 Toyota‌

This project runs from‌ the 1st of August‌‌ 2013 up to December 2025. It aims at‌ detecting critical situations in‌ the daily life of‌‌ older adults living home alone.

Toyota is working‌ with Stars on action‌ recognition software to be‌‌ integrated on their robot platform. This project aims‌ at detecting critical situations‌ in the daily life‌‌ of older adults alone at home. This will‌ require not only recognition‌ of ADLs but also‌‌ an evaluation of the way and timing in‌ which they are being‌ carried out. The system‌‌ we want to develop‌ is intended to help them and their relatives‌ to feel more comfortable because they know that‌ potentially dangerous situations will be detected and reported‌ to caregivers if necessary. The system is intended‌ to work with a Partner Robot - HSR‌ - (to send real-time information to the robot)‌ to better interact with the older adult.

9.1.2‌ Fantastic Sourcing

Fantastic Sourcing is a French SME‌ specialized in micro-electronics, it develops e-health technologies. Fantastic‌ Sourcing is collaborating with Stars through the Univ.‌ Côte d'Azur Solitaria project, by providing their Nodeus‌ system. Nodeus is an IoT (Internet of Things)‌ system for home support for the elderly, which‌ consists of a set of small sensors (without‌ video cameras) to collect precious data on the‌ habits of isolated people. Solitaria project performs a‌ multi-sensor activity analysis for monitoring and safety of‌ older and isolated people. With the increase of‌ the ageing population in Europe and in the‌ rest of the world, keeping elderly people at‌ home, in their usual environment, as long as‌ possible, becomes a priority and a challenge of‌ modern society. A system for monitoring activities and‌ alerting in case of danger, in permanent connection‌ with a device (an application on a phone,‌ a surveillance system ...) to warn relatives (family,‌ neighbors, friends ...) of isolated people still living‌ in their natural environment could save lives and‌ avoid incidents that cause or worsen the loss‌ of autonomy. In this R $&$ D project,‌ we propose to study a solution allowing the‌ use of a set of innovative heterogeneous sensors‌ in order to: 1) detect emergencies (falls, crises,‌ etc.) and call relatives (neighbors, family, etc.); 2)‌ detect, over short or longer predefined periods of‌ time.

9.1.3 Probayes

STARS will be working with‌ Probayes starting 01/07/2025 within a CIFRE Ph.D. on‌ the development of advanced methods for detecting artificially‌ generated videos using artificial intelligence models. Recent advances‌ in image and video generation based on neural‌ networks make it possible to create highly realistic‌ fake videos of individuals (deepfakes), which raises major‌ security concerns for many organizations. This project aims‌ at designing innovative approaches to assess the authenticity‌ of video content. A particular emphasis will be‌ placed on developing techniques that are generalizable and‌ not specific to a given video generation model‌ or application context. The proposed methods will rely‌ on the analysis of spatio-temporal behavioral cues, such‌ as mouth dynamics, in order to evaluate the‌ veracity of video sequences.

10 Partnerships and cooperations‌

Participants: François Brémond, Antitza Dantcheva, Michal‌ Balazia, Monique Thonnat.

10.1 International research‌ visitors

10.1.1 Visits of international scientists

Other international‌ visits to the team

Participant: Donghyeon Cho.‌

Status
Associate Professor
Institution of origin:
Ulsan National‌ Institute of Science and Technology (UNIST)
Country:
South‌ Korea
Dates:
July to August
Context of the‌ visit:
Collaborations
Mobility program/type of mobility:
Korean research‌ stay.

Participant: Jinsun Park.

Status
Associate Professor
Institution of origin:
Pusan‌ National University in Busan‌
Country:
South Korea
Dates:‌‌
July to August
Context of the visit:
Collaborations‌
Mobility program/type of mobility:‌
Korean research stay.

Participant:‌‌ Seungryul Baek.

Status
Associate Professor
Institution of‌ origin:
Hanyang University in‌ Seoul
Country:
South Korea‌‌
Dates:
July to August
Context of the visit:‌
Collaborations
Mobility program/type of‌ mobility:
Korean research stay.‌‌

Participant: Eric Granger.

Status
Full Professor
Institution‌ of origin:
École de‌ technologie supérieure, Université du‌‌ Québec
Country:
Canada
Dates:
November to December
Context‌ of the visit:
Collaborations‌
Mobility program/type of mobility:‌‌
sabbatical.

Participant: Nesli Erdogmus.

Status
Assistant Professor‌
Institution of origin:
Izmir‌ Institute of Technology
Country:‌‌
Turkey
Dates:
July to August
Context of the‌ visit:
Collaborations
Mobility program/type‌ of mobility:
Franco -‌‌ Turkish Research Fellowship Program "Prestij"

10.2 European initiatives‌

10.2.1 Horizon Europe

GAIN‌

GAIN project on cordis.europa.eu‌‌

Title:
Georgian Artificial Intelligence Networking and Twinning Initiative‌
Duration:
From October 1,‌ 2022 to September 30,‌‌ 2025
Partners:
- Institut National De Recherche En Informatique‌ Et Automatique (INRIA), France‌
- Exolaunch GMBH (EXO), Germany‌‌
- Deutsches Forschungszentrum Fur Kunstliche Intelligenz GMBH (DFKI), Germany‌
- Georgian Technical University (GTU),‌ Georgia
Inria contact:
François‌‌ Bremond
Coordinator:
George Giorgobiani
Summary:
GAIN will take‌ a strategic step towards‌ integrating Georgia, one of‌‌ the Widening countries, into the system of European‌ efforts aimed at ensuring‌ the Europe’s leadership in‌‌ one of the most transformative technologies of today‌ and tomorrow – Artificial‌ Intelligence (AI). It will‌‌ be achieved by research profile adjusting and linking‌ the central Georgian ICT‌ research institute - Muskhelishvili‌‌ Institute of Computational Mathematics (MICM), to the European‌ AI research and innovation‌ community. Two absolutely leading‌‌ European research organizations (DFKI and INRIA) supported by‌ the high-tech company EXOLAUNCH‌ will support MICM in‌‌ this endeavor. The Strategic Research and Innovation Programme‌ (SRIP) designed by the‌ partnership will provide the‌‌ environment for the Georgian colleagues to get involved‌ in the research projects‌ of the European partners‌‌ addressing a clearly delineated set of AI topics.‌ Jointly, the partners will‌ advance in capacity building‌‌ and networking within the area of AI Methods‌ and Tools for Human‌ Activities Recognition and Evaluation,‌‌ which also will contribute to strengthening core competences‌ in such fundamental technologies‌ as e.g. Machine (Deep)‌‌ Learning. The results of the cooperation presented through‌ the series of scientific‌ publications and events will‌‌ inform the European AI community about the potential‌ of MICM and trigger‌ new partnerships building, addressing‌‌ e.g. Horizon Europe. The project will contribute to‌ career development of a‌ cohort of young researchers‌‌ at MICM through joint supervision and targeted capacity‌ building measures. Innovation and‌ Research Administration and Management‌‌ capacities of MICM will also be strengthened to‌ allow the Institute to‌ be better connected to‌‌ the local, regional and European innovation activities. Using‌ their extensive research and‌ innovation networking capacities DFKI‌‌ and INRIA will introduce MICM to the European‌ AI research community by‌ connecting to such networks‌‌ as CLAIRE, ELLIS, ADRA,‌ AI NoEs, etc.

10.3 National initiatives

ANR COMSEMA‌

Website: ANR COMSEMA

Title:
Communications Sémantiques pour les‌ futurs réseaux - Semantic Communications for future networks‌
Duration:
From November 1, 2024 to October 30,‌ 2028
Partners:
- Institut National De Recherche En Informatique‌ Et Automatique (INRIA), France
- Centrale-Supelec
- Orange
Inria contact:‌
François Bremond
Coordinator:
Mohamad Assaad
Summary:
The ANR‌ COMSEMA project, part of Thematiques Specifiques en Intelligence‌ Artificielle (TSIA), from November 1 2024 up to‌ October 30 2028 (48 months) aims to improve‌ future networks incorporating video interpretation applications. Wireless networks‌ are currently witnessing a radical shift from a‌ purely data-oriented architecture to service and intelligent-based architectures,‌ allowing hence the support of a diverse set‌ of verticals. Thanks to the development of AI,‌ future networks are expected to incorporate an even‌ larger set of applications and services such as‌ ReID applications and human activity recognition, interactive hologram,‌ e-health, intelligent humanoid robot, etc. In this project,‌ we consider video interpretation applications and propose a‌ fundamental semantic-approach to redesign the entire process of‌ information generation and transmission over the network. In‌ particular, novel AI-based interference management that focuses on‌ the task achievement, rather than the bit rate‌ improvement over the air interface, will be investigated.‌ Inria is in charge of customizing video interpretation‌ applications to improve data transmission over the network.‌ INRIA Grant is about 196 keuros (24 Person‌ Months) out of 560 keuros.

Title:
Interpretable Representation‌ Learning for Video GANs
Duration:
From 2024 to‌ 2028
Partners:
- Inria Center at Université Côte d'Azur,‌ France
Inria contact:
Antitza Dantcheva
Coordinator:
Antitza Dantcheva‌
Summary:
The Inria Exploratory Action (Aex) XGAN, from‌ 2024 to 2028 aims at piercing the black‌ box of generative models for video generation by‌ proposing strategies to interpret the latent space in‌ (a) designing interpretable architectures, and by (b) analyzing‌ symmetric functions in input and output of patch-based‌ generation. Despite remarkable progress in generative models, such‌ networks operate currently as black boxes. INRIA Grant‌ is about 170 keuros.

10.4 Regional initiatives

Since‌ 2011, we have initiated a strategic partnership (called‌ CobTek) with Nice hospital (CHU Nice, Prof F.‌ Askenazy) to start ambitious research activities dedicated to‌ healthcare monitoring and assistive technologies. These new studies‌ address the analysis of more complex spatiotemporal activities‌ (e.g. complex interactions, long term activities).

11 Dissemination‌

11.1 Promoting scientific activities

11.1.1 Scientific events: organization‌

General chair, scientific chair:

Participant: François Brémond.‌

François Brémond was:

General Chair at IPAS 2025‌ [130 people], the IEEE International Conference on Image‌ Processing, Applications and Systems (website), Lyon, January‌ 2025. Member of the organizing committee (see 50‌).
General Chair at the South Caucasus Conference‌ on Artificial Intelligence - SCCAI 2025, MICM/GTU, Tbilisi,‌ Georgia, September 16-18, 2025.

Member of the organizing‌ committees:

Participant: Antitza Dantcheva, Michal Balazia.‌

Antitza Dantcheva was co-organizer of CV4BIOM, the Workshop‌ on Computer Vision for Biometrics, Identity & Behaviour‌ associated to the International Conference on Computer Vision (ICCV 2025) on October‌ 20th, 2025.

She was‌ also co-organizer of the‌‌ 4th Vision-based Remote Physiological Signal Sensing (RePSS) workshop‌ in conjunction with the‌ International Joint Conference on‌‌ Artificial Intelligence (IJCAI 2025) on August 28th, 2025.‌

Michal Balazia was in‌ the technical committee of‌‌ SCCAI 2025, as well as session chair. He‌ also was session chair,‌ program chair, and member‌‌ of the organizing technical committee at ACMMM MultiMediate‌ 2025.

11.1.2 Scientific events:‌ selection

Participants: François Brémond‌‌, Antitza Dantcheva, Michal Balazia, Monique‌ Thonnat.

Reviewer:

François‌ Brémond was reviewer in‌‌ major Computer Vision / Machine Learning conferences, including‌ ICCV, ECCV, CVPR, NeurIPS,‌ AAAI, ICLR, WACV.
Monique‌‌ Thonnat was a member of conference program committee‌ IJCAI-2025 and ICPRAM 2026.‌
Antitza Dantcheva was reviewer‌‌ and evaluator for SMASH, a Slovenian career-development‌ training program.

Further she‌ served as reviewer for‌‌ major Computer Vision / Machine Learning conferences such‌ as ICCV, CVPR, NeurIPS,‌ AAAI, ICLR, WACV.
Michal‌‌ Balazia was in 2025 reviewer for ACMMM MultiMediate,‌ ACM Multimedia, ICPR, and‌ WACV.

11.1.3 Journal

Michal‌‌ Balazia served as reviewer for TBIOM and MDPI‌ Sensors.

11.1.4 Invited talks‌

Participants: François Brémond,‌‌ Monique Thonnat, Antitza Dantcheva, Michal Balazia‌.

Francois Bremond gave‌ the following invited talks:‌‌

invited talk (1h) at IPAS 2025, IEEE International‌ Conference on Image Processing,‌ Applications and Systems IPAS‌‌ Website, Lyon, January 9-11, 2025.
invited talk‌ (1h) at the South‌ Caucasus Conference on Artificial‌‌ Intelligence - SCCAI 2025, MICM/GTU, Tbilisi, Georgia, September‌ 16-18, 2025.
invited talk‌ (1h) on "Video Action‌‌ Recognition" at the University of Bristol, on 21‌ October 2025.
Keynote speaker‌ at the ePictureThis workshop‌‌ on "Video Action Recognition for Human Behavior Analysis",‌ TU-Eindhoven, on 28 October‌ 2025.

Monique Thonnat was‌‌ invited as keynote speaker in the IEEE ICPRS‌ conférence in Vina del‌ Mar Chile, December 1-4,‌‌ 2025.

Antitza Dantcheva gave the following invited talks.‌

invited talk in the‌ Storyzy premises in Paris,‌‌ May 7, 2025.
invited talk in the online‌ US Seminar on "US‌ Developments and Impact of‌‌ AI on Biometric Vulnerabilities", June 26, 2025.
invited‌ talk at the Workshop‌ for "Synthetic Realities and‌‌ Biometric Security: Advances in Forensic Analysis and Threat‌ Mitigation (SRBS)", November 27,‌ associated to BMVC.
invited‌‌ talk at SophIA Summit in Sophia Antipolis, November‌ 20, 2025.
invited talk‌ at the University of‌‌ Technology in Vienna (TU Wien), Austria, December 4,‌ 2025.

Michal Balazia gave‌ invited talks at Metascience‌‌ and Guardians.

11.1.5 Contributed talks

Monique Thonnat attended‌ as speaker the IEEE/CVF‌ Winter Conference on Applications‌‌ of Computer Vision (WACV) at Tucson Arizona February‌ 28 - March 4‌ 2025.

11.1.6 Scientific expertise‌‌

Participants: Monique Thonnat, Michal Balazia.

Monique‌ Thonnat evaluated ANR projects‌ in the framework of‌‌ comité d’évaluation “CE38 – Interfaces : mathématiques, sciences‌ du numérique – sciences‌ humaines et sociales".

Michal‌‌ Balazia served as reviewer for ANR and NSERC.‌

11.2 Teaching - Supervision‌ - Juries - Educational‌‌ and pedagogical outreach

Participant:‌ François Brémond.

Francois Bremond held AI courses‌ on Computer Vision $&$ Deep Learning for the‌ Data Science and AI - MSc program at‌ Université Côte d'Azur: Teaching Website. Academic year‌ 2025: 24 hours.

11.2.1 Supervision

Participants: François Brémond‌, Antitza Dantcheva, Michal Balazia.

Francois‌ Bremond (co)-supervised 11 PhD students and many master's‌ students:

Tomasz Stanczyk: 3IA PhD student
Valeriya Strizhkova:‌ 3IA PhD student, defended on March 14, 2025,‌ 52.
Seongro Yoon: 3IA PhD student
Tanay‌ Agrawal: PhD student - Fellowship from European project‌ Gain, defended on September 26, 2025, 51.‌
Abid Ali: PhD student - Fellowship from BoostUrCAreer‌ CoFund
Snehashis Majhi: PhD student - Fellowship from‌ Toyota
Aglind Reka: Fellowship EUR Spectrum, Geoazur -‌ Intelligent Mapping
Ezem Ekmekci: 3IA PhD student
Wenxin‌ Xiong: Gredeg PhD student
Yuan Gao: INRAE PhD‌ student
Sébastien Frey: Nice Hospital PhD student.

Francois‌ Bremond was part of the supervision of several‌ internship students (master & PhD) that have been‌ hosted at the STARS team.

Antitza Dantcheva (co)-supervised‌ 5 PhD students and many master's students:

Valeriya‌ Strizhkova: 3IA PhD student
Tanay Agrawal: PhD student‌ - Fellowship from European project Gain
Snehashis Majhi:‌ PhD student - Fellowship from Toyota
Nabyl Quignong:‌ Inria AEX XGAN PhD
Charbel Yahchouchi: CIFRE PhD‌ Probayes
Baptiste Chopin: Inria AEX XGAN Postdoc
Anil‌ Egin: Masters Student

Michal Balazia supervised the following‌ researchers.

M2 interns: Aaryan Dhawan, Miriana Russo, Sanya‌ Sinha
engineer: Aowen Shi
pre-docs: Quentin Merilleau, Aglind‌ Reka.

11.2.2 Juries

Participants: François Brémond, Antitza‌ Dantcheva.

Francois Bremond participated in the following‌ juries:

HDR:
- Carlos Crispim from Université Lumière -‌ Lyon 2, September 22, 2025
PhD Thesis Review:‌
- Nima Mehdi from Inria Centre at Université de‌ Lorraine, December 17, 2024
- Kevin Flanagan from the‌ University of Bristol, October 22, 2025
- Samy Tafasca‌ from École Polytechnique Fédérale de Lausanne - EPFL,‌ December 5, 2025
- Salvatore Fiorilla from Università di‌ Bologna, December 11, 2025
CSI - Comité de‌ suivi de thèse:
- Marc Chapus, May 5, 2025‌
- Keqi Chen, May 21, 2025
- Monica Fossati, May‌ 26, 2025
- Federica Facente, May 31, 2025
- Aela‌ Le Sommer, June 3, 2025
- Franz Fabini Franco‌ Gallo, June 10,2025
- Thomas Campagnolo, July 1, 2025‌
- Kaushik Bhowmik, July 1, 2025
- Sofia Alexopoulou, July‌ 11, 2025
- Yannick Porto, July 9, 2025
- Idir‌ Chatar, September 12, 2025

Antitza Dantcheva participated in‌ the following juries.

PhD Thesis Review:
- Sahar Husseini,‌ Eurecom, June 17, 2025.
CSI - Comité de‌ suivi de thèse:
- Mehdi Atamna, December 8, 2025‌
- Huyen Trang Nguyen, October 21, 2025
- Yuanzhi Zhu,‌ October 30, 2025
- Huyen Trang Nguyen, July 7,‌ 2025

11.3 Popularization

11.3.1 Specific official responsibilities in‌ science outreach structures

Participants: François Brémond, Antitza‌ Dantcheva, Michal Balazia.

Francois Bremond participated‌ in the organization of the Sophia Summit 2025.‌
Michal Balazia gave invited talks at Metascience on‌ June 19, 2025.
Michal Balazia gave invited talks‌ at Guardians on October 9, 2025.

11.3.2 Productions (articles, videos, podcasts, serious‌ games, ...)

Participant: Michal‌ Balazia.

Michal Balazia‌‌ made a demo visualization tool for action detection‌ in videos of psychiatric‌ interviews.

11.3.3 Participation in‌‌ Live events

Participant: François Brémond.

Francois Bremond‌ participated in the following‌ events with following functions:‌‌

Presentation on "Human Action Recognition", part of “Fête‌ de la science” at‌ the Village des sciences‌‌ d'Antibes – Juan-les-Pins, on October 11, 2025;
Presentation‌ for bachelor students, ENS‌ de Lyon, Sophia Antipolis,‌‌ November 2025;
Presentation for high school students, part‌ of Terra Numerica, Sophia‌ Antipolis, December 2025;

11.3.4‌‌ Other science outreach relevant activities

Participant: François Brémond‌.

Francois Bremond gave‌ an interview on "Automated‌‌ video surveillance" to Bachelor students from Sciences Po,‌ in February 2025.

12‌ Scientific production

12.1 Major‌‌ publications

1 inproceedingsD.Dhruv Agarwal, T.‌Tanay Agrawal, L.‌Laura Ferrari and F.‌‌ F.Francois F Bremond. From Multimodal to‌ Unimodal Attention in Transformers‌ using Knowledge Distillation.‌‌AVSS 2021 - 17th IEEE International Conference on‌ Advanced Video and Signal-based‌ SurveillanceVirtual, United States‌‌November 2021HAL DOIback to text back‌ to text
2 inproceedings‌T.Tanay Agrawal,‌‌ D.Dhruv Agarwal, M.Michal Balazia,‌ N.Neelabh Sinha and‌ F. F.Francois F‌‌ Bremond. Multimodal Personality Recognition using Cross-Attention Transformer‌ and Behaviour Encoding.‌VISAPP '22: International Conference‌‌ on Computer Vision Theory and Applicationsvirtual, United‌ StatesIEEE; SCITEPRESS -‌ Science and Technology Publications‌‌February 2022, 501-508HAL DOI back to‌ text back to text‌
3 inproceedingsT.Tanay‌‌ Agrawal, M.Michal Balazia, P.Philipp‌ Müller and F. F.‌Francois F Bremond.‌‌ Multimodal Vision Transformers with Forced Attention for Behavior‌ Analysis.WACV '23:‌ IEEE International Winter Conference‌‌ on Applications in Computer VisionWaikoloa, United States‌January 2023HAL DOI‌back to text
4‌‌ inproceedingsA.Abid Ali, F. F.Farhood‌ F Negin, F.‌ F.Francois F Bremond‌‌ and S.Susanne Thümmler. Video-based Behavior Understanding‌ of Children for Objective‌ Diagnosis of Autism.‌‌VISAPP 2022 - 17th International Conference on Computer‌ Vision Theory and Applications‌Online, FranceFebruary 2022‌‌HAL back to textback to text
5‌ inproceedings M.Mahmoud Ali‌, D.Di Yang‌‌, A.Arkaprava Sinha, D.Dominick Reilly‌, S.Srijan Das‌, G.Gianpiero Francesca‌‌ and F.Francois Bremond. Quo Vadis, Video‌ Understanding with Vision-Language Foundation‌ Models? NeurIPS Proceedings Neurips‌‌ 2024 - 38th Annual Conference on Neural Information‌ Processing Systems Vancouver (Canada),‌ Canada December 2024 HAL‌‌
6 inproceedingsM.Michal Balazia, P.Philipp‌ Müller, Á. L.‌Ákos Levente Tánczos,‌‌ A. V.August Von Liechtenstein and F.François‌ Brémond. Bodily Behaviors‌ in Social Interaction: Novel‌‌ Annotations and State-of-the-Art Evaluation.MM'22: The 30th‌ ACM International Conference on‌ MultimediaLisbon, PortugalACM;‌‌ ACMOctober 2022, 70-79HAL DOI back‌ to text
7 inproceedings‌H.Hava Chaptoukaev,‌‌ V.Valeriya Strizhkova,‌ M.Michele Panariello, B.Bianca D’alpaos,‌ A.Aglind Reka, V.Valeria Manera,‌ S.Susanne Thümmler, E.Esma Ismailova,‌ N.Nicholas Evans, F. F.Francois F‌ Bremond, M.Massimiliano Todisco, M. A.‌Maria A. Zuluaga and L. M.Laura M‌ Ferrari. StressID: a Multimodal Dataset for Stress‌ Identification.NeurIPS 2023 - 37th Conference on‌ Neural Information Processing SystemsNew Orleans, United States‌December 2023HAL back to text
8 article‌S.Siyuan Chen, Y.Youngjun Cho,‌ K.Kun Yu, L.Laura Ferrari and‌ F. F.Francois F Bremond. Editorial: Recognizing‌ the state of emotion, cognition and action from‌ physiological and behavioral signals.Frontiers in Computer‌ Science4August 2022HAL DOI back to‌ text
9 inproceedingsH.Hao Chen, Y.‌Yaohui Wang, B.Benoit Lagadec, A.‌Antitza Dantcheva and F. F.Francois F Bremond‌. Joint Generative and Contrastive Learning for Unsupervised‌ Person Re-identification.CVPR 2021 - IEEE Conference‌ on Computer Vision and Pattern RecognitionVirtual, United‌ StatesJune 2021HAL
10 articleH.Hao‌ Chen, Y.Yaohui Wang, B.Benoit‌ Lagadec, A.Antitza Dantcheva and F.Francois‌ Bremond. Learning Invariance from Generated Variance for‌ Unsupervised Person Re-identification.IEEE Transactions on Pattern‌ Analysis and Machine IntelligenceDecember 2022, 1-15‌In press. HAL DOI
11 articleC. F.‌Carlos F Crispim-Junior, V.Vincent Buso,‌ K.Konstantinos Avgerinakis, G.Georgios Meditskos,‌ A.Alexia Briassouli, J.Jenny Benois-Pineau,‌ Y.Yiannis Kompatsiaris and F.Francois Bremond.‌ Semantic Event Fusion of Different Visual Modality Concepts‌ for Activity Recognition.IEEE Transactions on Pattern‌ Analysis and Machine Intelligence382016, 1598‌ - 1611HAL DOI
12 inproceedingsR.Rui‌ Dai, S.Srijan Das and F. F.‌Francois F Bremond. CTRN: Class Temporal Relational‌ Network For Action Detection.BMVC 2021 -‌ The British Machine Vision ConferenceVirtual, United Kingdom‌November 2021HAL
13 inproceedingsR.Rui Dai‌, S.Srijan Das and F. F.Francois‌ F Bremond. Learning an Augmented RGB Representation‌ with Cross-Modal Knowledge Distillation for Action Detection.‌ICCV 2021 - IEEE/CVF International Conference on Computer‌ VisionMontreal, CanadaOctober 2021HAL back to‌ text back to text
14 inproceedingsR.Rui‌ Dai, S.Srijan Das, K.Kumara‌ Kahatapitiya, M.Michael Ryoo and F. F.‌Francois F Bremond. MS-TCT: Multi-Scale Temporal ConvTransformer‌ for Action Detection.CVPR - Conference on‌ Computer Vision and Pattern RecognitionNew Orleans, United‌ StatesJune 2022HAL
15 inproceedingsR.Rui‌ Dai, S.Srijan Das, L.Luca‌ Minciullo, L.Lorenzo Garattoni, G.Gianpiero‌ Francesca and F. F.Francois F Bremond.‌ PDAN: Pyramid Dilated Attention Network for Action Detection‌.WACV 2021 - Winter Conference on Applications‌ of Computer Vision 2021Waikoloa / Virtual, United StatesJanuary 2021HAL‌back to text
16‌ inproceedingsR.Rui Dai‌‌, S.Srijan Das, M.Michael Ryoo‌ and F.Francois Bremond‌. AAN : Attributes-Aware‌‌ Network for Temporal Action Detection.BMVC 2023‌ - The 34th British‌ Machine Vision ConferenceAberdeen,‌‌ United KingdomNovember 2023HAL
17 articleA.‌Antitza Dantcheva and F.‌François Brémond. Gender‌‌ estimation based on smile-dynamics.IEEE Transactions on‌ Information Forensics and Security‌2016, 11HAL‌‌DOI
18 inproceedingsS.Srijan Das, R.‌Rui Dai, M.‌Michal Koperski, L.‌‌Luca Minciullo, L.Lorenzo Garattoni, F.‌François Bremond and G.‌Gianpiero Francesca. Toyota‌‌ Smarthome: Real-World Activities of Daily Living.ICCV‌ 2019 -17th International Conference‌ on Computer VisionSeoul,‌‌ South KoreaOctober 2019HAL back to text‌
19 articleS.Srijan‌ Das, R.Rui‌‌ Dai, D.Di Yang and F. F.‌Francois F Bremond.‌ VPN++: Rethinking Video-Pose embeddings‌‌ for understanding Activities of Daily Living.IEEE‌ Transactions on Pattern Analysis‌ and Machine IntelligenceDecember‌‌ 2021HAL DOI back to text back to‌ text
20 inproceedingsS.‌Srijan Das, S.‌‌Saurav Sharma, R.Rui Dai, F.‌ F.Francois F Bremond‌ and M.Monique Thonnat‌‌. VPN: Learning Video-Pose Embedding for Activities of‌ Daily Living.ECCV‌ 2020 - 16th European‌‌ Conference on Computer VisionGlasgow (Virtual), United Kingdom‌August 2020HAL
21‌ inproceedingsL. M.Laura‌‌ M Ferrari, G.Guy Abi Hanna,‌ P.Paolo Volpe,‌ E.Esma Ismailova,‌‌ F. F.Francois F Bremond and M. A.‌Maria A. Zuluaga.‌ One-class autoencoder approach for‌‌ optimal electrode set-up identification in wearable EEG event‌ monitoring.EMBC 2021‌ - 43rd Annual International‌‌ Conference of the IEEE Engineering in Medicine and‌ Biology SocietyVirtuel, France‌October 2021HAL DOI‌‌back to text
22 inproceedingsJ.-C.Jen-Cheng Hou‌, A.Aileen Mcgonigal‌, F.Fabrice Bartolomei‌‌ and M.Monique Thonnat. A Self-Supervised Pre-Training‌ Framework for Vision-Based Seizure‌ Classification.2022 IEEE‌‌ International Conference on Acoustics, Speech, and Signal Processing‌ proceedingsIEEE ICASSP 2022‌ : IEEE International Conference‌‌ on Acoustics, Speech and Signal ProcessingSingapore, Singapore‌May 2022HAL DOI‌
23 articleI.Indu‌‌ Joshi, M.Marcel Grimmer, C.Christian‌ Rathgeb, C.Christoph‌ Busch, F.Francois‌‌ Bremond and A.Antitza Dantcheva. Synthetic Data‌ in Human Analysis: A‌ Survey.IEEE Transactions‌‌ on Pattern Analysis and Machine Intelligence2024,‌ 1-20HAL DOI back‌ to text
24 article‌‌A.Alexandra König, E.Elisa Mallick,‌ J.Johannes Tröger,‌ N.Nicklas Linz,‌‌ R.Radia Zeghari, V.Valeria Manera and‌ P.Philippe Robert.‌ Measuring neuropsychiatric symptoms in‌‌ patients with early cognitive decline using speech analysis‌.European Psychiatry64‌12021, e64‌‌HAL DOI back to text
25 articleA.‌Alexandra König, P.‌Philipp Müller, J.‌‌Johannes Tröger, H.‌Hali Lindsay, J.Jan Alexandersson, J.‌Jonas Hinze, M.Matthias Riemenschneider, D.‌Danilo Postin, E.Eric Ettore, A.‌Amandine Lecomte, M.Michel Musiol, M.‌Maxime Amblard, F.François Bremond, M.‌Michal Balazia and R.Rene Hurlemann. Multimodal‌ phenotyping of psychiatric disorders from social interaction: Protocol‌ of a clinical multicenter prospective study.Personalized‌ Medicine in Psychiatry33-34May 2022, 100094‌HAL DOI back to text back to text‌back to text back to text
26 inproceedings‌N.Neelabh Sinha, M.Michal Balazia and‌ F. F.Francois F Bremond. FLAME: Facial‌ Landmark Heatmap Activated Multimodal Gaze Estimation.AVSS‌ 2021 - 17th IEEE International Conference on Advanced‌ Video and Signal-based SurveillanceVirtual, United StatesNovember‌ 2021HAL DOI back to text
27 inproceedings‌V.Valeriya Strizhkova, Y.Yaohui Wang,‌ D.David Anghelone, D.Di Yang,‌ A.Antitza Dantcheva and F.François Brémond.‌ Emotion Editing in Head Reenactment Videos using Latent‌ Space Manipulation.2021 16th IEEE International Conference‌ on Automatic Face and Gesture Recognition (FG 2021)‌FG 2021 - IEEE International Conference on Automatic‌ Face and Gesture RecognitionJodhpur, IndiaDecember 2021‌HAL DOI back to text
28 inproceedingsY.‌Yaohui Wang, P.Piotr Bilinski, F.‌ F.Francois F Bremond and A.Antitza Dantcheva‌. G3AN: Disentangling Appearance and Motion for Video‌ Generation.CVPR 2020 - IEEE Conference on‌ Computer Vision and Pattern RecognitionSeattle / Virtual,‌ United StatesJune 2020HAL
29 inproceedingsD.‌Di Yang, Y.Yaohui Wang, Q.‌Quan Kong, A.Antitza Dantcheva, L.‌Lorenzo Garattoni, G.Gianpiero Francesca and F.‌ F.Francois F Bremond. Self-Supervised Video Representation‌ Learning via Latent Time Navigation.Technical Tracks‌ 3AAAI 2023 - AAAI Conference on Artificial‌ Intelligence37Proceedings of the 37th AAAI Conference‌ on Artificial Intelligence3Washigton, D.C., United States‌June 2023HAL DOI
30 articleR.Radia‌ Zeghari, R.Rachid Guerchouche, M.Minh‌ Tran-Duc, F.François Bremond, K.Kai‌ Langel, I.Inez Ramakers, N.Nathalie‌ Amiel, M. P.Maria Pascale Lemoine,‌ V.Vincent Bultingaire, V.Valeria Manera,‌ P.Philippe Robert and A.Alexandra König.‌ Feasibility Study of an Internet-Based Platform for Tele-Neuropsychological‌ Assessment of Elderly in Remote Areas.Diagnostics‌12April 2022HALDOI back to text‌

12.2 Publications of the year

International journals

31‌ articleD.David Anghelone, C.Cunjian Chen‌, A.Arun Ross and A.Antitza Dantcheva‌. Beyond the Visible: A Survey on Cross-spectral‌ Face Recognition.NeurocomputingJanuary 2025HAL back‌ to text
32 articleE.Eric Ettore,‌ H.Hali Lindsay, J.Johannes Tröger,‌ M.Michal Balazia, B.Benoit Michel,‌ P.Philippe Robert, D.Danilo Postin,‌ R.Rene Hurlemann and A.Alexandra König. Childhood trauma affects speech‌ and language measures in‌ patients with major depressive‌‌ disorder during clinical interviews.Journal of Affective‌ Disorders388November 2025‌, 119769HAL DOI‌‌back to text
33 articleS.Sébastien Frey‌, F.Federica Facente‌, W.Wen Wei‌‌, E. S.Ezem Sura Ekmekci, E.‌Eric Séjor, P.‌Patrick Baqué, M.‌‌Matthieu Durand, H.Hervé Delingette, F.‌Francois Bremond, P.‌Pierre Berthet-Rayne and N.‌‌Nicholas Ayache. Optimizing Intraoperative AI: Evaluation of‌ YOLOv8 for Real-Time Recognition‌ of Robotic and Laparoscopic‌‌ Instruments.Journal of Robotic Surgery19131‌March 2025HAL DOI‌back to text
34‌‌ articleM.Mohsen Tabejamaat, H.Hoda Mohammadzade‌, F.Farhood Negin‌ and F.Francois Bremond‌‌. EEG Classification with Limited Data: A Deep‌ Clustering Approach.Pattern‌ RecognitionVolume 157110934‌‌January 2025HAL DOIback to text

International‌ peer-reviewed conferences

35 inproceedings‌ T.Tanay Agrawal,‌‌ A.Abid Ali, A.Antitza Dantcheva and‌ F.Francois Bremond.‌ Are Attention Maps Richer‌‌ than we Imagined for Action Recognition? AVSS 2025‌ - IEEE International Conference‌ on Advanced Video and‌‌ Signal based Surveillance Tainan, Taiwan August 2025 HAL‌ back to text
36‌ inproceedingsT.Tanay Agrawal‌‌, M.Mohammed Guermal, M.Michal Balazia‌ and F.Francois Bremond‌. CM3T: Framework for‌‌ Efficient Multimodal Learning for Inhomogeneous Interaction Datasets.‌IEEE XploreWACV 2025‌ - Winter Conference on‌‌ Applications of Computer VisionTucson, United StatesMarch‌ 2025HAL back to‌ text
37 inproceedingsA.‌‌Abid Ali, R.Rui Dai, A.‌Ashish Marisetty, G.‌Guillaume Astruc, M.‌‌Monique Thonnat, J.-M.Jean-Marc Odobez, S.‌Susanne Thümmler and F.‌Francois Bremond. Loose‌‌ Social-Interaction Recognition in Real-world Therapy Scenarios.IEEE‌ XploreWACV 2025 -‌ IEEE/CVF Winter Conference on‌‌ Applications of Computer VisionTucson, United StatesarXiv‌February 2025HAL back‌ to text
38 inproceedings‌‌A.Abid Ali, A.Antitza Dantcheva,‌ F.Francois Bremond and‌ T.Tanay Agrawal.‌‌ Scaling Action Detection: AdaTAD++ with Transformer-Enhanced Temporal-Spatial Adaptation‌.ICCV 2025 -‌ IEEE International Conference on‌‌ Computer VisionHonolulu, Hawai, United StatesOctober 2025‌HAL back to text‌
39 inproceedingsA.Anil‌‌ Egin, A.Andrea Tangherloni and A.Antitza‌ Dantcheva. Now You‌ See Me, Now You‌‌ Don't: A Unified Framework for Expression Consistent Anonymization‌ in Talking Head Videos‌.Proceedings of the‌‌ IEEE/CVF International Conference on Computer Vision Workshops (ICCVW‌ 2025), CV4BIOM: Workshop on‌ Computer Vision for Biometrics,‌‌ Identity & BehaviourICCV 2025 - IEEE/CVF International‌ Conference on Computer Vision‌ WorkshopsHawaii-Honolulu, United States‌‌IEEEOctober 2025, 5925–5934HAL back to‌ text
40 inproceedingsS.‌Snehashis Majhi, G.‌‌Giacomo D’amicantonio, A.Antitza Dantcheva, Q.‌Quan Kong, L.‌Lorenzo Garattoni, G.‌‌Gianpiero Francesca, E.Egor Bondarev and F.‌Francois Bremond. Just‌ Dance with π!, A‌‌ Poly-modal Inductor forWeakly-supervised Video‌ Anomaly Detection.CVPR 2025 - Conference on‌ Computer Vision and Pattern RecognitionNashville, United States‌June 2025HAL back to text
41 inproceedings‌S.Snehashis Majhi, M.Mohammed Guermal,‌ A.Antitza Dantcheva, Q.Quan Kong,‌ L.Lorenzo Garattoni, G.Gianpiero Francesca and‌ F.Francois Bremond. Guess Future Anomalies from‌ Normalcy: Forecasting Abnormal Behavior in Real-World Videos.‌IEEE XploreIEEE/CVF Winter Conference on Applications of‌ Computer Vision (WACV) 20252025 IEEE/CVF Winter Conference‌ on Applications of Computer Vision (WACV)Tucson (AZ),‌ United StatesFebruary 2025HAL DOI
42 inproceedings‌S.Snehashis Majhi, M.Mohammed Guermal,‌ A.Antitza Dantcheva, Q.Quan Kong,‌ L.Lorenzo Garattoni, G.Gianpiero Francesca and‌ F.Francois Bremond. Guess Future Anomalies from‌ Normalcy: Forecasting Abnormal Behavior in Real-World Videos.‌Winter Conference on Applications of Computer Vision, WACV‌ 2025Tucson, United StatesFebruary 2025HAL back‌ to text
43 inproceedingsD.Dominick Reilly,‌ R.Rajatsubhra Chakraborty, A.Arkaprava Sinha,‌ M.Manish Kumar, P.Pu Wang,‌ F.Francois Bremond, L.Le Xue and‌ S.Srijan Das. LLAVIDAL : A Large‌ LAnguage VIsion Model for Daily Activities of Living‌.CVPR 2025 - IEEE/CVF Conference on Computer‌ Vision and Pattern RecognitionNashville, United StatesMarch‌ 2025HAL back to text
44 inproceedingsS.‌Sanya Sinha, M.Michal Balazia and F.‌Francois Bremond. Identifying Surgical Instruments in Pedagogical‌ Cataract Surgery Videos through an Optimized Aggregation Network‌.IPAS 2025 - Sixth IEEE International Conference‌ on Image Processing Applications and SystemsLyon, France‌January 2025HAL back to text
45 inproceedings‌A.Arkaprava Sinha, D.Dominick Reilly,‌ F.Francois Bremond, P.Pu Wang and‌ S.Srijan Das. SKI Models: SKeleton Induced‌ Vision-Language Embeddings for Understanding Activities of Daily Living‌.39th Annual AAAI Conference on Artificial Intelligence,‌ AAAI 2025Philadelphia, United StatesFebruary 2025HAL‌back to text
46 inproceedingsT.Tomasz Stanczyk‌, S.Seongro Yoon and F.Francois Bremond‌. No Train Yet Gain: Towards Generic Multi-Object‌ Tracking in Sports and Beyond.IEEE Xplore‌CVPR 2025 - Conference on Computer Vision and‌ Pattern Recognition2025 IEEE/CVF Conference on Computer Vision‌ and Pattern Recognition Workshops (CVPRW)NASHVILLE, United States‌June 2025HAL DOIback to text
47‌ inproceedingsD. S.Daksitha Senel Withanage Don,‌ M.Marius Funk, M.Michal Balazia,‌ H.Huajian Qiu, S.Shogo Okada,‌ F.François Brémond, J.Jan Alexandersson,‌ A.Andreas Bulling, E.Elisabeth André and‌ P.Philipp Müller. MultiMediate '25: Cross-cultural Multi-domain‌ Engagement Estimation.MM 2025 - 33rd ACM‌ International Conference on MultiMediaDublin, IrelandOctober 2025‌, 14150 - 14155HAL DOI back to‌ text
48 inproceedingsG.Giacomo d'Amicantonio, S.‌Snehashis Majhi, Q.Quan Kong, L.‌Lorenzo Garattoni, G.Gianpiero Francesca, E.Egor Bondarev and F.‌Francois Bremond. Mixture‌ of Experts Guided by‌‌ Gaussian Splatters Matters: A new Approach to Weakly-Supervised‌ Video Anomaly Detection.‌ICCV 2025 - IEEE‌‌ International Conference on Computer VisionHonolulu, Hawai, United‌ StatesOctober 2025HAL‌back to text

Conferences‌‌ without proceedings

49 inproceedings T.Tomasz Stanczyk and‌ F.Francois Bremond.‌ Does Re-ID Really Help‌‌ in Multi-Object Tracking? SCCAI 2025 - South Caucasus‌ Conference on Artificial Intelligence‌ Tbilisi, Georgia September 2025‌‌ HAL back to text

Edition (books, proceedings, special‌ issue of a journal)‌

50 proceedingsS.Serge‌‌ Miguet, M.Mouna Zouari, D.Dorra‌ Sellami, H.Habib‌ M. Kammoun and F.‌‌Francois Bremond, eds. 6th IEEE International Conference‌ on Image Processing, Applications‌ and Systems - IPAS‌‌ 2025 Conference Proceedings.IPAS 2025 - Sixth‌ IEEE international conference on‌ Image Processing Applications and‌‌ SystemsLyon, FranceIEEEMarch 2025HAL DOI‌back to text

Doctoral‌ dissertations and habilitation theses‌‌

51 thesisT.Tanay Agrawal. Training-efficient video‌ feature extraction for human-centric‌ multimodal video understanding.‌‌Université Côte d'AzurSeptember 2025HAL back to‌ text back to text‌
52 thesisV.Valeriya‌‌ Strizhkova. Emotion recognition using Deep Learning.‌Université Côte d'AzurMarch‌ 2025HAL back to‌‌ text back to text

Reports & preprints

53‌ miscS.Saurabh Atreya‌, N.Nabyl Quignon‌‌, B.Baptiste Chopin, A.Abhijit Das‌ and A.Antitza Dantcheva‌. Beyond Real versus‌‌ Fake Towards Intent-Aware Video Analysis.November 2025‌HAL back to text‌
54 miscJ.Johnata‌‌ Brayan, S.Sihao Deng, A.Armando‌ Alves Neto, I.‌Iaroslav Okunevich, T.‌‌Tomas Krajnik, F.Francois Bremond and Z.‌Zhi Yan. NavWareSet:‌ A Dataset of Socially‌‌ Compliant and Non-Compliant Robot Navigation.August 2025‌HAL DOI
55 misc‌B.Baptiste Chopin,‌‌ T.Tashvik Dhamija, P.Pranav Balaji,‌ Y.Yaohui Wang and‌ A.Antitza Dantcheva.‌‌ Dimitra: Audio-driven Diffusion model for Expressive Talking Head‌ Generation.February 2025‌HAL back to text‌‌
56 miscN.Nabyl Quignon, B.Baptiste‌ Chopin, Y.Yaohui‌ Wang and A.Antitza‌‌ Dantcheva. THEval. Evaluation Framework for Talking Head‌ Video Generation.November‌ 2025HAL back to‌‌ text
57 miscY.Yaohui Wang, D.‌Di Yang, X.‌Xinyuan Chen, F.‌‌Francois Bremond, Y.Yu Qiao and A.‌Antitza Dantcheva. LIA-X:‌ Interpretable Latent Portrait Animator‌‌.2025HAL back to text

STARS - 2025

STARS - 2025

2025Activity reportProject-Team﻿​﻿﻿STARS

Keywords​​﻿﻿

Computer Science and Digital﻿​​﻿ Science

Other Research Topics​​​‌ and Application Domains

1 Team members, visitors,﻿​​﻿ external collaborators

Research Scientists​​​‌

Post-Doctoral Fellows

PhD﻿﻿﻿‌ Students

Technical Staff﻿​​﻿

Interns﻿‌​‌ and Apprentices

Administrative Assistant﻿﻿﻿‌

Visiting Scientists

External Collaborators

2​‌﻿﻿ Overall objectives

2.1 Presentation​​﻿﻿

2.2 Motivation

Social interaction​​​‌ as a new study﻿﻿﻿‌ target.

Need﻿﻿﻿‌ for precise and sensitive﻿‌​‌ digital markers.

Digital​​​‌ markers and methods.

Sensors for​​​‌ analyzing human interactions.

2.3​​﻿﻿ Social interaction understanding: a​​​‌ challenging task

2.4 International and​​​‌ Industrial Cooperation

3 Research program

3.1 Axis​​﻿﻿ 1: Human Interaction Recognition​​​‌

3.1.1 Body Language​​​‌ Analysis

Recognition of Actions and​​﻿﻿ Body Language.

3.1.2 Face﻿‌​‌ Analysis and Emotion Recognition﻿​​﻿

3.1.3 Multimodal Recognition​​​‌ of Human Interactions

3.2﻿​​﻿ Axis 2: Data Generation​​​‌ for Augmentation and Anonymization﻿﻿﻿‌

3.2.1﻿​​﻿ Data Generation

Past attempts for​​﻿﻿ synthetic images and videos.​​​‌

Challenges in​‌﻿﻿ video generation.

3.2.2​​​‌ Data Augmentation and Anonymization﻿﻿﻿‌

Data augmentation.

Data anonymization.

4 Application​​﻿﻿ domains

4.1 Medical Applications

4.2​‌﻿﻿ Other Applications

4.3 Ethical﻿﻿﻿‌ and Acceptability Issues

5 Social and﻿﻿﻿‌ environmental responsibility

5.1 Footprint﻿‌​‌ of research activities

5.2﻿​​﻿ Impact of research results​​​‌

6​​﻿﻿ Highlights of the year​​​‌

6.1 Awards

6.2﻿​﻿﻿ Major results

7 Latest software developments,​​​‌ platforms, open data

7.1﻿​﻿﻿ Open data

Stress ID Dataset: a​​​‌ Multimodal Dataset for Stress﻿​﻿﻿ Identification

Toyota Smarthome﻿‌​‌ Datasets: Real-World Activities of﻿​​﻿ Daily Living.

8 New​​​‌ results

Human﻿​​﻿ Interaction Recognition

Data Generation for​​​‌ Augmentation and Anonymization

8.1​‌﻿﻿ No Train Yet Gain:​​﻿﻿ Towards Generic Multi-Object Tracking​​​‌ in Sports and Beyond﻿​﻿﻿

8.2 Does​​​‌ Re-ID Really Help in﻿﻿﻿‌ Multi-Object Tracking?

8.3​​﻿﻿ CM3T: Framework for Efficient​​​‌ Multimodal Learning for Inhomogeneous﻿​﻿﻿ Interaction Datasets

8.4 Are Attention Maps﻿﻿﻿‌ Richer than we Imagined﻿‌​‌ for Action Recognition?

8.5 Scaling﻿​​﻿ Action Detection: AdaTAD++ with​​​‌ Transformer-Enhanced Temporal-Spatial Adaptation

8.6 SKI Models: SKeleton﻿​﻿﻿ Induced Vision-Language Embeddings for​‌﻿﻿ Understanding Activities of Daily​​﻿﻿ Living

8.7 LLAVIDAL : A​‌﻿﻿ Large LAnguage VIsion Model​​﻿﻿ for Daily Activities of​​​‌ Living

8.8 Human-Centric Video﻿​﻿﻿ Understanding: From Single-Modality to​‌﻿﻿ Multi-Modal Learning

8.9 B-MoE: A Body-Part-Aware﻿​﻿﻿ Mixture-of-Experts “All Parts Matter”​‌﻿﻿ Approach to Micro-Action Recognition​​﻿﻿

8.10 Loose Social-Interaction Recognition​‌﻿﻿ in Real-world Therapy Scenarios​​﻿﻿

8.11 Just Dance​​﻿﻿ with π!, A Poly-modal​​​‌ Inductor for Weakly-supervised Video﻿​﻿﻿ Anomaly Detection

8.12 Mixture of﻿﻿﻿‌ Experts Guided by Gaussian﻿‌​‌ Splatters Matters: A new﻿​​﻿ Approach to Weakly-Supervised Video​​​‌ Anomaly Detection

8.13​​﻿﻿ Denoise, Divide, Distill, and​​​‌ Predict (D3¶): Towards Forecasting﻿​﻿﻿ Long-horizon Real-world Anomaly from​‌﻿﻿ Normalcy

8.14 Not﻿﻿﻿‌ All Blends Are Equal:﻿‌​‌ The BLEMORE Dataset of﻿​​﻿ Blended Emotion Expressions with​​​‌ Relative Salience Annotations

8.15 The INEMO​​​‌ Dataset: A Multimodal Benchmark﻿​﻿﻿ of Physiological and Behavioral​‌﻿﻿ Responses to Social Media​​﻿﻿ and Film Stimuli

8.16​​​‌ EEG Classification with Limited﻿﻿﻿‌ Data: A Deep Clustering﻿‌​‌ Approach.

8.17​​﻿﻿ MEPHESTO: Multimodal Phenotyping of​​​‌ Psychiatric Disorders from Social﻿​﻿﻿ Interaction

8.17.1 Contextualized﻿​​﻿ Synchrony for Therapeutic Alliance​​​‌

8.17.2 Psychiatric Diagnosis Classification​‌﻿﻿ through Temporal Behavioral Analysis​​﻿﻿

8.17.3 Childhood Trauma Affects​​​‌ Speech and Language Measures﻿﻿﻿‌ in Patients with Major﻿‌​‌ Depressive Disorder during Clinical﻿​​﻿ Interviews

8.18 MultiMediate'25: Cross-Cultural​‌﻿﻿ Multi-domain Engagement Estimation

8.19 Stress Estimation﻿​​﻿ in Dancers for Injury​​​‌ Prevention

8.20﻿​﻿﻿ Emotion recognition using Deep​‌﻿﻿ Learning

8.21 Identifying Surgical Instruments﻿﻿﻿‌ in Pedagogical Cataract Surgery﻿‌​‌ Videos through an Optimized﻿​​﻿ Aggregation Network

8.22﻿​﻿﻿ TBDM: Temporal Boundary Distillation​‌﻿﻿ Module for Surgical Gesture​​﻿﻿ Segmentation

8.23 Effective Video﻿‌​‌ Feature Extraction for Training﻿​​﻿ and Comprehension: Human-Centered Multimodal​​​‌ Video

2025Activity reportProject-TeamSTARS

Keywords

Computer Science and Digital Science

Other Research Topics‌ and Application Domains

1 Team members, visitors, external collaborators

Research Scientists‌

PhD‌ Students

Technical Staff

Interns‌‌ and Apprentices

Administrative Assistant‌

2‌ Overall objectives

2.1 Presentation

Social interaction‌ as a new study‌ target.

Need‌ for precise and sensitive‌‌ digital markers.

Digital‌ markers and methods.

Sensors for‌ analyzing human interactions.

2.3 Social interaction understanding: a‌ challenging task

2.4 International and‌ Industrial Cooperation

3.1 Axis 1: Human Interaction Recognition‌

3.1.1 Body Language‌ Analysis

Recognition of Actions and Body Language.

3.1.2 Face‌‌ Analysis and Emotion Recognition

3.1.3 Multimodal Recognition‌ of Human Interactions

3.2 Axis 2: Data Generation‌ for Augmentation and Anonymization‌

3.2.1 Data Generation

Past attempts for synthetic images and videos.‌

Challenges in‌ video generation.

3.2.2‌ Data Augmentation and Anonymization‌

4 Application domains

4.2‌ Other Applications

4.3 Ethical‌ and Acceptability Issues

5 Social and‌ environmental responsibility

5.1 Footprint‌‌ of research activities

5.2 Impact of research results‌

6 Highlights of the year‌

6.2 Major results

7 Latest software developments,‌ platforms, open data

7.1 Open data

Stress ID Dataset: a‌ Multimodal Dataset for Stress Identification

Toyota Smarthome‌‌ Datasets: Real-World Activities of Daily Living.

8 New‌ results

Human Interaction Recognition

Data Generation for‌ Augmentation and Anonymization

8.1‌ No Train Yet Gain: Towards Generic Multi-Object Tracking‌ in Sports and Beyond

8.2 Does‌ Re-ID Really Help in‌ Multi-Object Tracking?

8.3 CM3T: Framework for Efficient‌ Multimodal Learning for Inhomogeneous Interaction Datasets

8.4 Are Attention Maps‌ Richer than we Imagined‌‌ for Action Recognition?

8.5 Scaling Action Detection: AdaTAD++ with‌ Transformer-Enhanced Temporal-Spatial Adaptation

8.6 SKI Models: SKeleton Induced Vision-Language Embeddings for‌ Understanding Activities of Daily Living

8.7 LLAVIDAL : A‌ Large LAnguage VIsion Model for Daily Activities of‌ Living

8.8 Human-Centric Video Understanding: From Single-Modality to‌ Multi-Modal Learning

8.9 B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter”‌ Approach to Micro-Action Recognition

8.10 Loose Social-Interaction Recognition‌ in Real-world Therapy Scenarios

8.11 Just Dance with π!, A Poly-modal‌ Inductor for Weakly-supervised Video Anomaly Detection

8.12 Mixture of‌ Experts Guided by Gaussian‌‌ Splatters Matters: A new Approach to Weakly-Supervised Video‌ Anomaly Detection

8.13 Denoise, Divide, Distill, and‌ Predict (D3¶): Towards Forecasting Long-horizon Real-world Anomaly from‌ Normalcy

8.14 Not‌ All Blends Are Equal:‌‌ The BLEMORE Dataset of Blended Emotion Expressions with‌ Relative Salience Annotations

8.15 The INEMO‌ Dataset: A Multimodal Benchmark of Physiological and Behavioral‌ Responses to Social Media and Film Stimuli

8.16‌ EEG Classification with Limited‌ Data: A Deep Clustering‌‌ Approach.

8.17 MEPHESTO: Multimodal Phenotyping of‌ Psychiatric Disorders from Social Interaction

8.17.1 Contextualized Synchrony for Therapeutic Alliance‌

8.17.2 Psychiatric Diagnosis Classification‌ through Temporal Behavioral Analysis

8.17.3 Childhood Trauma Affects‌ Speech and Language Measures‌ in Patients with Major‌‌ Depressive Disorder during Clinical Interviews

8.18 MultiMediate'25: Cross-Cultural‌ Multi-domain Engagement Estimation

8.19 Stress Estimation in Dancers for Injury‌ Prevention

8.20 Emotion recognition using Deep‌ Learning

8.21 Identifying Surgical Instruments‌ in Pedagogical Cataract Surgery‌‌ Videos through an Optimized Aggregation Network

8.22 TBDM: Temporal Boundary Distillation‌ Module for Surgical Gesture Segmentation

8.23 Effective Video‌‌ Feature Extraction for Training and Comprehension: Human-Centered Multimodal‌ Video

8.24 Rotation-Induced‌ Centroid Shift in Latent Space

8.25 Dual Volume Skeleton-Guided 3D Face Reconstruction from‌ Sparse Views

8.26 Turbo Learning: 3D Face Reconstruction with‌ Mesh Re-Projection and Re-Identification‌ Consistency

8.27 THEval: Evaluation‌ Framework for Talking Head‌ Video Generation

8.28 Beyond Real versus Fake‌ Towards Intent-Aware Video Analysis

8.29 AI killed‌ the video star. Audio-driven‌ diffusion model for expressive‌‌ talking head generation

8.30 LIA-X: Interpretable Latent‌ Portrait Animator

8.31 Simplicity-Bias-Aware Adaptation of Foundation Models for Deepfake‌ Detection

8.32 Now You See‌ Me, Now You Don't:‌‌ A Unified Framework for Expression Consistent Anonymization in‌ Talking Head Videos

8.33 Beyond the visible: A‌ survey on cross-spectral face‌ recognition

9 Bilateral contracts and grants with industry‌