EN FR
EN FR

2025Activity reportProject-Team​​​‌MULTISPEECH

RNSR: 201421147E
  • Research​ center Inria Centre at​‌ Université de Lorraine
  • In​​ partnership with:CNRS, Université​​​‌ de Lorraine
  • Team name:​ Multimodal Speech in Interaction​‌
  • In collaboration with:Laboratoire​​ lorrain de recherche en​​​‌ informatique et ses applications​ (LORIA)

Creation of the​‌ Project-Team: 2024 October 01​​

Each year, Inria research​​​‌ teams publish an Activity​ Report presenting their work​‌ and results over the​​ reporting period. These reports​​​‌ follow a common structure,​ with some optional sections​‌ depending on the specific​​ team. They typically begin​​​‌ by outlining the overall​ objectives and research programme,​‌ including the main research​​ themes, goals, and methodological​​​‌ approaches. They also describe​ the application domains targeted​‌ by the team, highlighting​​ the scientific or societal​​​‌ contexts in which their​ work is situated.

The​‌ reports then present the​​ highlights of the year,​​​‌ covering major scientific achievements,​ software developments, or teaching​‌ contributions. When relevant, they​​ include sections on software,​​​‌ platforms, and open data,​ detailing the tools developed​‌ and how they are​​ shared. A substantial part​​​‌ is dedicated to new​ results, where scientific contributions​‌ are described in detail,​​ often with subsections specifying​​​‌ participants and associated keywords.​

Finally, the Activity Report​‌ addresses funding, contracts, partnerships,​​ and collaborations at various​​​‌ levels, from industrial agreements​ to international cooperations. It​‌ also covers dissemination and​​ teaching activities, such as​​​‌ participation in scientific events,​ outreach, and supervision. The​‌ document concludes with a​​ presentation of scientific production,​​​‌ including major publications and​ those produced during the​‌ year.

Keywords

Computer Science​​ and Digital Science

  • A3.4.​​​‌ Machine learning and statistics​
  • A3.5. Social networks
  • A4.8.​‌ Privacy-enhancing technologies
  • A5.1.5. Body-based​​ interfaces
  • A5.1.7. Multimodal interfaces​​​‌
  • A5.6.2. Augmented reality
  • A5.6.3.​ Avatar simulation and embodiment​‌
  • A5.7. Audio modeling and​​ processing
  • A5.7.1. Sound
  • A5.7.3.​​​‌ Speech
  • A5.7.4. Analysis
  • A5.7.5.​ Synthesis
  • A5.8. Natural language​‌ processing
  • A5.9. Signal processing​​
  • A5.9.1. Sampling, acquisition
  • A5.9.2.​​​‌ Estimation, modeling
  • A5.9.3. Reconstruction,​ enhancement
  • A5.10.2. Perception
  • A5.10.5.​‌ Robot interaction (with the​​ environment, humans, other robots)​​​‌
  • A6.2.4. Statistical methods
  • A6.3.1.​ Inverse problems
  • A6.3.4. Model​‌ reduction
  • A6.3.5. Uncertainty Quantification​​
  • A9.2. Machine learning
  • A9.2.1.​​​‌ Supervised learning
  • A9.2.2. Unsupervised​ learning
  • A9.2.3. Reinforcement learning​‌
  • A9.2.4. Optimization and learning​​
  • A9.2.6. Neural networks
  • A9.2.8.​​​‌ Deep learning
  • A9.3. Signal​ processing
  • A9.4. Natural language​‌ processing
  • A9.5. Robotics and​​ AI
  • A9.11. Generative AI​​​‌

Other Research Topics and​ Application Domains

  • B8.1.2. Sensor​‌ networks for smart buildings​​
  • B8.4. Security and personal​​​‌ assistance
  • B9.1.1. E-learning, MOOC​
  • B9.5.1. Computer science
  • B9.5.2.​‌ Mathematics
  • B9.5.6. Data science​​
  • B9.6.8. Linguistics
  • B9.6.10. Digital​​​‌ humanities
  • B9.10. Privacy

1​ Team members, visitors, external​‌ collaborators

Research Scientists

  • Yves​​ Laprie [CNRS,​​ Senior Researcher, HDR​​​‌]
  • Paul Magron [‌INRIA, Researcher]‌​‌
  • Mostafa Sadeghi [INRIA​​, ISFP]
  • Emmanuel​​​‌ Vincent [INRIA,‌ Senior Researcher, HDR‌​‌]

Faculty Members

  • Slim​​ Ouni [Team leader​​​‌, UL, Professor‌, HDR]
  • Domitille‌​‌ Caillat [UNIV MONTPELLIER​​ III, Associate Professor​​​‌ Delegation, until Aug‌ 2025]
  • Vincent Colotte‌​‌ [UL, Associate​​ Professor]
  • Pascale Erhart​​​‌ [UNIV STRASBOURG,‌ Associate Professor Delegation,‌​‌ from Sep 2025]​​
  • Irina Illina [UL​​​‌, Associate Professor,‌ HDR]
  • Romain Serizel‌​‌ [UL, Professor​​, HDR]

Post-Doctoral​​​‌ Fellows

  • Tom Bourgeade [‌UL, Post-Doctoral Fellow‌​‌]
  • Constance Douwes [​​UL, Post-Doctoral Fellow​​​‌, until Mar 2025‌]
  • François Effa [‌​‌UL, ATER, from​​ Oct 2025]
  • Natalia​​​‌ Tomashenko [UL, ATER‌, from Oct 2025‌​‌]

PhD Students

  • Louis​​ Abel [UL,​​​‌ until Feb 2025]‌
  • Jean Eudes Ayilo [‌​‌INRIA]
  • Sofiane Azzouz​​ [UL]
  • Raphael​​​‌ Bagat [CNRS]‌
  • Zahra Hafida Benslimane [‌​‌CEA]
  • Doria Bonzi​​ [UL, from​​​‌ Oct 2025]
  • Aine‌ Drelingyte [UL]‌​‌
  • Orane Dufour [UL​​]
  • Guilhem Faure [​​​‌INRIA]
  • Imed Eddine‌ Ghebriout [CNRS]‌​‌
  • Mickaella Grondin-Verdon [UL​​, from Oct 2025​​​‌]
  • Mickaella Grondin-Verdon [‌UL, ATER,‌​‌ until Aug 2025]​​
  • Taous Iatariene [ORANGE​​​‌, CIFRE]
  • Isobelle‌ Miles [INRIA,‌​‌ from Feb 2025]​​
  • Mayank Mishra [UL​​​‌]
  • Nasser-Eddine Monir [‌UL]
  • Nhat Nam‌​‌ Nguyen [UL,​​ from Nov 2025]​​​‌
  • Robin San Roman [‌META, CIFRE]‌​‌
  • Alex Stasica [INRIA​​, from Sep 2025​​​‌]
  • Natalia Tomashenko [‌UL, ATER,‌​‌ from Sep 2025]​​

Technical Staff

  • Louis Abel​​​‌ [UL, Engineer‌, from Dec 2025‌​‌]
  • Romuald Ait Bachir​​ [CNRS, Engineer​​​‌, from Oct 2025‌]
  • Hugo Bergerat [‌​‌UL, Engineer,​​ from Oct 2025]​​​‌
  • Theo Biasutto-Lervat [INRIA‌, Engineer]
  • Sam‌​‌ Bigeard [INRIA,​​ Engineer]
  • Emma Granier​​​‌ [UL, Engineer‌, from Nov 2025‌​‌]
  • Colombe M'Boungou [​​INRIA, Engineer,​​​‌ from Feb 2025]‌
  • Malek Yaich [INRIA‌​‌, Engineer]

Interns​​ and Apprentices

  • Elliot Abarca​​​‌ [INRIA, Intern‌, from Apr 2025‌​‌ until Jun 2025]​​
  • Abbas Awarkeh [CNRS​​​‌, Intern, from‌ Mar 2025 until Sep‌​‌ 2025]
  • Nacera Elarbi​​ Tolehi [UL,​​​‌ Intern, from Mar‌ 2025 until Sep 2025‌​‌]
  • Tian Huang [​​UL, Intern,​​​‌ from Mar 2025 until‌ Sep 2025]
  • Camille‌​‌ Lavigne [INRIA,​​ Intern, from Jun​​​‌ 2025 until Aug 2025‌]
  • Kehina Manseri [‌​‌INRIA, Intern,​​ from Feb 2025 until​​​‌ Jul 2025]
  • Celie‌ Ponroy [INRIA,‌​‌ Intern, from Apr​​ 2025 until Jun 2025​​​‌]
  • Nina Rouffaud [‌UL, Intern,‌​‌ from Jun 2025 until​​​‌ Jun 2025]
  • Nicolas​ Russo [UL,​‌ Intern, from Jul​​ 2025 until Aug 2025​​​‌]

Administrative Assistants

  • Emmanuelle​ Deschamps [INRIA]​‌
  • Cecilia Olivier [INRIA​​]

2 Overall objectives​​​‌

In Multispeech, we consider​ speech as a multimodal​‌ signal with different facets:​​ acoustic, facial, articulatory, gestural,​​​‌ etc. Historically, speech was​ mainly considered under its​‌ acoustic facet, which is​​ still the most important​​​‌ one. However, the acoustic​ signal is a consequence​‌ of the temporal evolution​​ of the shape of​​​‌ the vocal tract (pharynx,​ tongue, jaws, lips, etc.),​‌ this is the articulatory​​ facet of speech. Since​​​‌ the vocal tract configuration​ is partly reflected in​‌ facial movements, these constitute​​ the primary visual facet​​​‌ of speech.

The face​ can provide additional information​‌ on the speaker's state​​ through facial expressions. Speech​​​‌ can be accompanied by​ gestures (head nodding, arm​‌ and hand movements, etc.),​​ that help to clarify​​​‌ the linguistic message. In​ some cases, such as​‌ in sign language, these​​ gestures can bear the​​​‌ main linguistic content and​ be the only means​‌ of communication.

The general​​ objective of Multispeech is​​​‌ to study the analysis​ and synthesis of the​‌ different facets of this​​ multimodal signal and their​​​‌ multimodal coordination in the​ context of human-human or​‌ human-computer interaction. While this​​ multimodal signal carries all​​​‌ of the information used​ in spoken communication, the​‌ collection, processing, and extraction​​ of meaningful information by​​​‌ a machine system remains​ a challenge. In particular,​‌ to operate in real-world​​ conditions, such a system​​​‌ must be robust to​ noisy or missing facets.​‌ We are especially interested​​ in designing models and​​​‌ learning techniques that rely​ on limited amounts of​‌ labeled data and that​​ preserve privacy.

Therefore, Multispeech​​​‌ addresses data-efficient, privacy-preserving learning​ methods, and the robust​‌ extraction of various streams​​ of information from speech​​​‌ signals. These two axes​ will allow us to​‌ address multimodality, i.e., the​​ analysis and the generation​​​‌ of multimodal speech and​ its consideration in an​‌ interactional context.

The outcomes​​ will crystallize into a​​​‌ unified software platform for​ the development of embodied​‌ voice assistants. Our main​​ objective is that the​​​‌ results of our research​ feed this platform, and​‌ that the platform itself​​ facilitates our research and​​​‌ that of other researchers​ in the general domain​‌ of human-computer interaction, as​​ well as the development​​​‌ of concrete applications that​ help humans to interact​‌ with one another or​​ with machines. We will​​​‌ focus on two main​ application areas: language learning​‌ and health assistance.

3​​ Research program

3.1 Axis​​​‌ 1 — Data-efficient and​ privacy-preserving learning

A central​‌ aspect of our research​​ is to design machine​​​‌ learning models and methods​ for multimodal speech data,​‌ whether acoustic, visual or​​ gestural. By contrast with​​​‌ big tech companies, we​ focus on scenarios where​‌ the amount of speech​​ data is limited and/or​​​‌ access to the raw​ data is infeasible due​‌ to privacy requirements, and​​ little or no human​​​‌ labels are available.

3.1.1​ Axis 1.1 — Integrating​‌ domain knowledge

State-of-the-art methods​​ for speech and audio​​ processing are based on​​​‌ discriminative neural networks trained‌ for the targeted task.‌​‌ This paradigm faces major​​ limitations: lack of interpretability,​​​‌ large data requirements and‌ inability to generalize to‌​‌ unseen classes or tasks.​​ Our approach is to​​​‌ combine the representation power‌ of deep learning with‌​‌ our acoustic expertise to​​ obtain smaller generative models​​​‌ describing the probability distribution‌ of speech and audio‌​‌ signals. Particular attention will​​ be paid to designing​​​‌ physically-motivated input layers, output‌ layers, and unsupervised representations‌​‌ that capture complex-valued, multi-scale​​ spectro-temporal dependencies. Given these​​​‌ models, we derive computationally‌ efficient inference algorithms that‌​‌ address the above limitations.​​ We also explore the​​​‌ integration of deep learning‌ with symbolic reasoning and‌​‌ common-sense knowledge to increase​​ the generalization ability of​​​‌ deep models.

3.1.2 Axis‌ 1.2 — Learning from‌​‌ little/no labeled data

While​​ supervised learning from fully​​​‌ labeled data is economically‌ costly, unlabeled data are‌​‌ inexpensive but provide intrinsically​​ less information. Our goal​​​‌ is to learn representations‌ that disentangle the attributes‌​‌ of speech by equipping​​ the unsupervised representation learning​​​‌ methods above with supervised‌ branches exploiting the available‌​‌ labels and supervisory signals,​​ and with multiple adversarial​​​‌ branches overcoming the usual‌ limitations of adversarial.

3.1.3‌​‌ Axis 1.3 — Preserving​​ privacy

To preserve privacy,​​​‌ speech must be transformed‌ to hide the users'‌​‌ identity and other privacy-sensitive​​ attributes (e.g., accent, health​​​‌ status) while leaving intact‌ those attributes which are‌​‌ required for the task​​ (e.g., phonetic content for​​​‌ automatic speech recognition) and‌ preserving the data variability‌​‌ for training purposes. We​​ develop strong attacks to​​​‌ evaluate privacy. We also‌ seek to hide personal‌​‌ identifiers and privacy-sensitive attributes​​ in the linguistic content,​​​‌ focusing on their robust‌ extraction and replacement from‌​‌ speech signals.

3.1.4 Axis​​ 1.4 — Reducing computational​​​‌ footprint

This axis includes‌ proposing reliable methods to‌​‌ quantify fine-grained energy consumption,​​ computational footprint (in terms​​​‌ of operations), and memory‌ footprint, so as to‌​‌ identify potential bottlenecks in​​ the network at training​​​‌ and test time before‌ applying compression methods.

3.2‌​‌ Axis 2 — Extracting​​ information from speech signals​​​‌

In this axis, we‌ focus on extracting meaningful‌​‌ information from speech signals​​ in real conditions. This​​​‌ information can be related‌ (1) to the linguistic‌​‌ content, (2) to the​​ speaker, and (3) to​​​‌ the speech environment.

3.2.1‌ Axis 2.1 — Linguistic‌​‌ speech content

Speech recognition​​ is the main means​​​‌ to extract linguistic information‌ from speech. Although it‌​‌ is a mature research​​ area, performance drops in​​​‌ real-world environments pursue the‌ development of speech enhancement‌​‌ and source separation methods​​ to effectively improve robustness​​​‌ in such real-world scenarios.‌ Semantic content analysis is‌​‌ required to interpret the​​ spoken message. The challenges​​​‌ include learning from little‌ real data, quickly adapting‌​‌ to new topics, and​​ robustness to speech recognition​​​‌ errors. The detection and‌ classification of hate speech‌​‌ in social media videos​​ will also be considered​​​‌ as a benchmark, thereby‌ extending the work on‌​‌ text-only detection. Finally, we​​ also consider extracting phonetic​​​‌ and prosodic information to‌ study the categorization of‌​‌ speech sounds and certain​​​‌ aspects of prosody by​ learners of a foreign​‌ language.

3.2.2 Axis 2.2​​ — Speaker identity and​​​‌ states

Speaker identity is​ required for the personalization​‌ of human-computer interaction. Speaker​​ recognition and diarization are​​​‌ still challenging in real-world​ conditions. The speaker states​‌ that we aim to​​ recognize include emotion and​​​‌ stress, which can be​ used to adapt the​‌ interaction in real time.​​

3.2.3 Axis 2.3 —​​​‌ Speech environment information

We​ develop audio event detection​‌ methods that exploit both​​ strongly/weakly labeled and unlabeled​​​‌ data, operate in real-world​ conditions, can discover new​‌ events, and provide a​​ semantic interpretation. Modeling the​​​‌ temporal, spatial and logical​ structure of ambient sound​‌ scenes over a long​​ duration is also considered.​​​‌

3.3 Axis 3 —​ Multimodal Speech: generation and​‌ interaction

In our project,​​ we consider speech as​​​‌ a multimodal object, where​ we study (1) multimodality​‌ modeling and analysis, focusing​​ on multimodal fusion and​​​‌ coordination, (2) the generation​ of multimodal speech by​‌ taking into account its​​ different facets (acoustic, articulatory,​​​‌ visual, gestural), separately or​ combined, and (3) interaction,​‌ in the context of​​ human-human or human-computer interaction.​​​‌

3.3.1 Axis 3.1 -​ Multimodality modeling and analysis​‌

The study of multimodality​​ concerns the interaction between​​​‌ modalities, their fusion, coordination​ and synchronization for a​‌ single speaker, as well​​ as their synchronization across​​​‌ the speakers in a​ conversation. We focus on​‌ audiovisual speech enhancement to​​ improve the intelligibility and​​​‌ quality of noisy speech​ by considering the speaker’s​‌ lip movements. We also​​ consider the semi/weakly/self-supervised learning​​​‌ methods for multimodal data​ to obtain interpretable representations​‌ that disentangle in each​​ modality the attributes related​​​‌ to linguistic and semantic​ content, emotion, reaction, etc.​‌ We also study the​​ contribution of each modality​​​‌ to the intelligibility of​ spoken communication.

3.3.2 Axis​‌ 3.2 - Multimodal speech​​ generation

Multimodal speech generation​​​‌ refers to articulatory, acoustic,​ and audiovisual speech synthesis​‌ techniques which output one​​ or more facets. Articulatory​​​‌ speech synthesis relies on​ 2D and 3D modeling​‌ of the dynamics of​​ the vocal tract from​​​‌ real-time MRI (rtMRI) data.​ We consider the generation​‌ of the full vocal​​ tract, from the vocal​​​‌ folds to the lips,​ first in 2D then​‌ in 3D. This comprises​​ the generation of the​​​‌ face and the prediction​ of the glottis opening.​‌ We also consider audiovisual​​ speech synthesis. Both the​​​‌ animation of the lower​ part of the face​‌ related to speech and​​ of the upper part​​​‌ related to the facial​ expressions are considered, and​‌ development continues towards a​​ multilingual talking head. We​​​‌ investigate further the modeling​ of expressivity for both​‌ audio-only and audiovisual speech​​ synthesis, for a better​​​‌ control of expressivity, where​ we consider several disentangled​‌ attributes at the same​​ time.

3.3.3 Axis 3.3​​​‌ — Interaction

Interaction is​ a new field of​‌ research for our project-team​​ that we will approach​​​‌ gradually. We start by​ studying the multimodal components​‌ (prosody, facial expressions, gestures)​​ used during interaction, both​​​‌ by the speaker and​ by the listener, where​‌ the goal is to​​ simultaneously generate speech and​​ gestures by the speaker,​​​‌ and generating regulatory gestures‌ for the listener. We‌​‌ will introduce different dialog​​ bricks progressively: Spoken language​​​‌ understanding, Dialog management, and‌ Natural language generation. Dialog‌​‌ will be considered in​​ a multimodal context (gestures,​​​‌ emotional states of the‌ interlocutor, etc.) and we‌​‌ will break the classical​​ dialog management scheme to​​​‌ dynamically account for the‌ interlocutor's evolution during the‌​‌ speaker's response.

3.4 Software​​ platform: Multimodal Voice assistant​​​‌

This research program aims‌ to develop a unified‌​‌ software platform for embodied​​ voice assistants, fueled by​​​‌ our research outcomes. The‌ platform will not only‌​‌ aid our research but​​ also facilitate other researchers​​​‌ in the field of‌ human-computer interaction. It will‌​‌ also help in creating​​ practical applications for human​​​‌ interactions, with a primary‌ focus on language learning‌​‌ and health assistance.

4​​ Application domains

The approaches​​​‌ and models developed in‌ Multispeech will have several‌​‌ applications to help humans​​ interact with one another​​​‌ or with machines. Each‌ application will typically rely‌​‌ on an embodied voice​​ assistant developed via our​​​‌ generic software platform or‌ on individual components, as‌​‌ presented above. We will​​ put special effort into​​​‌ two application domains: language‌ learning and health assistance.‌​‌ We chose these domains​​ mainly because of their​​​‌ economic and social impact.‌ Moreover, many outcomes of‌​‌ our research will be​​ naturally applicable in these​​​‌ two domains, which will‌ help us showcase their‌​‌ relevance.

4.1 Language Learning​​

Learning a second language,​​​‌ or acquiring the native‌ language for people suffering‌​‌ from language disorders, is​​ a challenge for the​​​‌ learner and represents a‌ significant cognitive load. Many‌​‌ scientific activities have therefore​​ been devoted to these​​​‌ issues, both from the‌ point of view of‌​‌ production and perception. We​​ aim to show the​​​‌ learner (native or second‌ language) how to articulate‌​‌ the sounds of the​​ target language by illustrating​​​‌ articulation with a talking‌ head augmented by the‌​‌ vocal tract which allows​​ animating the articulators of​​​‌ speech. Moreover, based on‌ the analysis of the‌​‌ learner’s production, an automatic​​ diagnosis can be envisaged.​​​‌ However, reliable diagnosis remains‌ a challenge, which depends‌​‌ on the accuracy of​​ speech recognition and prosodic​​​‌ analysis techniques. This is‌ still an open question.‌​‌

4.2 Health Assistance

Speech​​ technology can facilitate healthcare​​​‌ access to all patients‌ and it provides an‌​‌ unprecedented opportunity to transform​​ the healthcare industry. This​​​‌ includes speech disorders and‌ hearing impairments. For instance,‌​‌ it is possible to​​ use automatic techniques to​​​‌ diagnose disfluencies from an‌ acoustic or an audiovisual‌​‌ signal, as in the​​ case of stuttering. Speech​​​‌ enhancement and separation can‌ enhance speech intelligibility for‌​‌ hearing aid wearers in​​ complex acoustic environments, while​​​‌ articulatory feedback tools can‌ be beneficial for articulatory‌​‌ rehabilitation of cochlear implant​​ wearers. More generally, voice​​​‌ assistants are a valuable‌ tool for senior or‌​‌ disabled people, especially for​​ those who are unable​​​‌ to use other interfaces‌ due to lack of‌​‌ hand dexterity, mobility, and/or​​ good vision. Speech technologies​​​‌ can also facilitate communication‌ between hospital staff and‌​‌ patients, and help emergency​​​‌ call operators triage the​ callers by quantifying their​‌ stress level and getting​​ the maximum amount of​​​‌ information automatically thanks to​ a robust speech recognition​‌ system adapted to these​​ extreme conditions.

5 New​​​‌ results

5.1 Axis 1​ — Data-efficient and privacy-preserving​‌ learning

Participants: Vincent Colotte​​, Pascale Erhart,​​​‌ Irina Illina, Paul​ Magron, Slim Ouni​‌, Mostafa Sadeghi,​​ Romain Serizel, Emmanuel​​​‌ Vincent, Jean Eudes​ Ayilo, Zahra Hafida​‌ Benslimane, Sam Bigeard​​, Constance Douwes,​​​‌ Orane Dufour, Mohamed​ Imed Eddine Ghebriout,​‌ Isobelle Miles, Robin​​ San Roman, Natalia​​​‌ Tomashenko, Malek Yaich​.

5.1.1 Axis 1.1​‌ — Integrating domain knowledge​​

Hybrid signal processing and​​​‌ deep learning.

We propose​ to combine traditional signal​‌ processing-based filtering with deep​​ learning for automatic speech​​​‌ recognition (ASR). More specifically,​ the beamforming filter processes​‌ specific angular sectors based​​ on their spherical polar​​​‌ coordinates before applying an​ end-to-end multichannel, multi-speaker ASR​‌ system. This method is​​ data-independent and training-free. We​​​‌ demonstrate that using a​ group of beamformed signals​‌ improves performance compared to​​ using the same number​​​‌ of raw microphone signals.​ Moreover, increasing the number​‌ of signals used for​​ beamforming further enhances recognition​​​‌ accuracy, leading to a​ more efficient use of​‌ multichannel signals while reducing​​ the overall input load​​​‌ for the ASR system.​ This strategy is promising​‌ for exploiting the best​​ of both worlds, thus​​​‌ reducing the need for​ supervision while maintaining high​‌ performance 15.

Generative-based​​ speech enhancement.

A widely​​​‌ used approach for speech​ enhancement is to directly​‌ train a deep neural​​ network (DNN) to estimate​​​‌ clean speech from a​ noisy input. Despite strong​‌ performance, this paradigm faces​​ two main challenges. First,​​​‌ it often requires very​ large models trained on​‌ huge datasets that span​​ many noise types, noise​​​‌ levels, and recording conditions.​ Second, its generalization is​‌ typically limited, with performance​​ degrading in unseen environments.​​​‌ To tackle this problem,​ we organized the UDASE​‌ task of the 7th​​ CHiME challenge, which leverages​​​‌ real-world noisy speech recordings​ from the test domain​‌ for unsupervised domain adaptation​​ of speech enhancement models​​​‌ 6. We also​ provided comprehensive objective and​‌ subjective evaluations of the​​ submitted systems to better​​​‌ understand the limitations of​ different approaches. In the​‌ same research direction, we​​ proposed a novel diffusion-based​​​‌ unsupervised speech enhancement framework​ 8 that leverages diffusion​‌ models, which currently achieve​​ state-of-the-art performance in generative​​​‌ modeling. This framework combines​ a pre-trained diffusion-based speech​‌ generative model with a​​ parametric noise model, and​​​‌ performs speech enhancement through​ an iterative expectation–maximization procedure.​‌ The results confirm that​​ this unsupervised approach is​​​‌ more robust than its​ supervised counterpart under mismatched​‌ conditions.

5.1.2 Axis 1.2​​ - Learning from little/no​​​‌ labeled data

ASR for​ regional languages.

Using Mozilla​‌ Common Voice, we pursued​​ the collection of speech​​​‌ data covering some of​ the regional, overseas, and​‌ non-territorial languages of France​​ from data archives, media,​​​‌ associations, and individuals. This​ data will support the​‌ training of automatic speech​​ recognition (ASR) and text-to-speech​​ (TTS) models for these​​​‌ languages. We presented a‌ comprehensive overview of the‌​‌ tools already developed, notably​​ ASR 37. The​​​‌ work focused on developing‌ a recognition system based‌​‌ on Whisper. Fine-tuning and​​ error correction techniques using​​​‌ adapted language models were‌ applied for the recognition‌​‌ of Basque, Occitan, Alsatian​​ and Shimaoré.

Accented ASR.​​​‌

Two approaches addressing accented‌ ASR are presented. The‌​‌ first introduces Mixture of​​ Accent-Specific LoRAs (MAS-LoRA), a​​​‌ finetuning method that leverages‌ a mixture of Low-Rank‌​‌ Adaptation (LoRA) experts, each​​ specialized in a specific​​​‌ accent 14. Our‌ experiments, conducted using Whisper‌​‌ on the L2-ARCTIC corpus,​​ demonstrate 4% and 14%​​​‌ relative Word Error Rate‌ (WER) improvement compared to‌​‌ regular LoRA and no​​ fine-tuning when the accent​​​‌ is unknown. The second‌ approach, BEARD, focuses on‌​‌ adapting Whisper's encoder with​​ unlabeled data 13.​​​‌ It combines a BEST-RQ‌ objective with knowledge distillation‌​‌ from a frozen teacher​​ encoder. On the ATCO2​​​‌ corpus of Air Traffic‌ Control (ATC) communications, using‌​‌ 5,000 hours of untranscribed​​ speech for BEARD and​​​‌ 2 hours of transcribed‌ speech for fine-tuning, BEARD‌​‌ achieves a relative improvement​​ of 12% compared to​​​‌ the fine-tuned model.

TTS‌ for regional languages.

We‌​‌ have been working on​​ speech synthesis for the​​​‌ Alsatian language. Using a‌ system based on ToucanTTS,‌​‌ we first explored the​​ possibility of integrating a​​​‌ specific phonetizer. The difficulty‌ in achieving high-quality synthesis‌​‌ is mainly due to​​ the lack of data​​​‌ in this type of‌ language and a lack‌​‌ of normalization in the​​ textual transcription. Speech synthesis​​​‌ system architectures are generally‌ designed for a critical‌​‌ volume of training data,​​ so we used several​​​‌ multilingual synthesis systems to‌ evaluate their performance on‌​‌ this type of regional​​ language. At the same​​​‌ time, a campaign to‌ acquire oral data in‌​‌ collaboration with Mozilla Common​​ Voice is underway and​​​‌ should increase the volume‌ of data especially with‌​‌ dialect variations. Alsatian contains​​ several dialects that bring​​​‌ a great variability in‌ terms of both vocabulary‌​‌ and pronunciation. Current TTS​​ and ASR systems remain​​​‌ poorly adapted for this‌ type of regional accent‌​‌ yet.

Joint punctuated +​​ normalized ASR (limited punctuated​​​‌ data).

We proposed two‌ end-to-end strategies for joint‌​‌ punctuated and normalized ASR​​ when punctuated supervision is​​​‌ scarce 16: (i)‌ using a language model‌​‌ to generate punctuated targets​​ from normalized transcripts, improving​​​‌ out-of-domain performance (up to‌ 17% relative PC-WER reduction),‌​‌ and (ii) a single​​ conditional decoder that outputs​​​‌ either punctuated or normalized‌ transcripts on demand. The‌​‌ conditional-decoder model achieved a​​ 42% relative PC-WER reduction​​​‌ vs Whisper-base and remained‌ effective with as little‌​‌ as 5% punctuated training​​ data.

5.1.3 Axis 1.3​​​‌ - Preserving privacy

Speaker‌ anonymization.

Speech signals convey‌​‌ a lot of private​​ information. To protect speakers,​​​‌ we pursued our investigation‌ of x-vector based voice‌​‌ anonymization, which relies on​​ splitting the speech signal​​​‌ into the speaker (x-vector),‌ phonetic and pitch features‌​‌ and resynthesizing the signal​​ with a different target​​​‌ x-vector. In particular, we‌ measured and reduced the‌​‌ amount of speaker information​​​‌ carried by phoneme durations​ 32, 33.​‌ Our method which extracts​​ an embedding of phoneme​​​‌ durations by an ECAPA-TDNN​ model achieves a low​‌ equal error rate (EER)​​ of 2% for 8​​​‌ test signals.

We looked​ at the privacy of​‌ video game users 9​​. Speech in video​​​‌ games is a spontaneous​ human conversation, which is​‌ associated to the player's​​ pseudonym and could be​​​‌ recorded by any player​ using screen capture software​‌ to build and augment​​ identifying records. We also​​​‌ looked at privacy in​ multi-speaker recordings 43 using​‌ a target speaker extraction​​ module to extract the​​​‌ speaker to be anonymized​ before speaker anonymization and​‌ speech recombination. We achieved​​ an EER of 36%​​​‌ and a Time-Constrained minimum​ Permutation Word Error Rate​‌ (tcpWER) of 18% on​​ mixtures of two speakers​​​‌ (SparseLibri2Mix).

Privacy metrics and​ evaluation.

Beyond anonymization systems​‌ themselves, we investigated how​​ to measure and challenge​​​‌ privacy guarantees. We looked​ at the definition of​‌ new privacy metrics inspired​​ by the Article 29​​​‌ Working Party’s Opinion 05/2014​ on Anonymization Techniques, which​‌ characterize Singling Out, Linkability,​​ and Inference 34.​​​‌ Experiments across various attack​ scenarios reveal that, while​‌ the EER remains stable,​​ Singling Out and Linkability​​​‌ vary much more. Finally,​ we looked at the​‌ emotion displayed by anonymized​​ speech data. Two alternative​​​‌ strategies were examined to​ ensure that the original​‌ emotion is kept: first​​ integrating emotion embeddings from​​​‌ a pre-trained emotion encoder,​ and second processing the​‌ speaker by a speaker​​ anonymizer and by an​​​‌ emotion indicator to select​ the emotion-matched SVM accurately​‌ 7.

We published​​ the rules of the​​​‌ 1st VoicePrivacy Attacker Challenge​ 31, which focuses​‌ on developing speaker re-identification​​ attacks against three baseline​​​‌ anonymization systems and four​ anonymization systems developed by​‌ the Voice Privacy 2024​​ Challenge participants. The best​​​‌ attacker systems reduced the​ EER by 25–44% relative​‌ w.r.t. the semi-informed attack​​ used in the VoicePrivacy​​​‌ 2024 Challenge.

Sensitive content​ replacement.

Complementarily, we explored​‌ privacy protection at the​​ content level rather than​​​‌ the speaker level. As​ part of the ANR​‌ SpeechPrivacy project, we explored​​ the replacement of sensitive​​​‌ speech content. The process​ involves detecting sensitive personal​‌ data in the transcript,​​ such as names, addresses​​​‌ or references to age.​ The substitution is carried​‌ out in the acoustic​​ signal itself. The work​​​‌ focused on re-synthesising the​ original sentence using the​‌ codec approach. Initial results​​ show that replacing tokens​​​‌ in the same prosodic​ context allows for good​‌ integration of the new​​ elements.

5.1.4 Axis 1.4​​​‌ — Reducing computational footprint​

We studied the relation​‌ between performance of audio​​ generation models and their​​​‌ energy consumption. In particular,​ as most of the​‌ recent models are based​​ on diffusion we focused​​​‌ on the relation between​ the number of diffusion​‌ iteration and the quality​​ of the generated signal​​​‌ 29.

During her​ PhD thesis work, Zahra​‌ Benslimane studied in detail​​ the relationship between algorithm​​​‌ latency, computational footprint and​ performance in terms of​‌ speech enhancement. She then​​ proposed architecture simplification to​​ achieve low latency (2​​​‌ ms) and low complexity‌ processing (the number of‌​‌ operations is divided by​​ a factor 100 compared​​​‌ to the original algorithm)‌ while preserving the speech‌​‌ enhancement performance.

5.2 Axis​​ 2 — Extracting information​​​‌ from speech signals

Participants:‌ Irina Illina, Paul‌​‌ Magron, Mostafa Sadeghi​​, Romain Serizel,​​​‌ Emmanuel Vincent, Romuald‌ Ait Bachir, Raphaël‌​‌ Bagat, Doria Bonzi​​, Aine Drelingyte,​​​‌ Taous Iatariene, Mayank‌ Mishra, Nasser-Eddine Monir‌​‌.

5.2.1 Axis 2.1​​ — Linguistic speech content​​​‌

Joint beamforming + speaker-attributed‌ ASR for meetings.

We‌​‌ introduced a multichannel beamforming​​ front-end for distant-microphone speaker-attributed​​​‌ ASR, including a real-data‌ alignment/augmentation method to pretrain‌​‌ a neural beamformer 17​​. On AMI, channel-fusion​​​‌ baselines did not help,‌ but beamforming did: fine-tuning‌​‌ SA-ASR on fixed beamformer​​ output reduced WER by​​​‌ 8% relative, and joint‌ fine-tuning with a neural‌​‌ beamformer achieved a 9%​​ relative WER reduction.

LLM​​​‌ compression.

Current LLM compression‌ typically requires two steps:‌​‌ calibration-based compression followed by​​ costly continued pretraining on​​​‌ billions of tokens. We‌ eliminate this second step‌​‌ with a one-shot compression​​ method that locally distills​​​‌ low-rank weights 30,‌ leveraging the observation that‌​‌ activations are low-rank. SVD​​ initialization, a joint teacher-student​​​‌ activation loss, and local‌ gradient updates ensure fast‌​‌ convergence and low memory​​ usage. Our method compresses​​​‌ Mixtral-8x7B in minutes on‌ a single A100 GPU‌​‌ — removing 10B parameters​​ while retaining over 95%​​​‌ performance - and reduces‌ Phi-2 3B by 40%‌​‌ using only 13M calibration​​ tokens, yielding a model​​​‌ competitive with similarly-sized alternatives.‌ The approach generalizes beyond‌​‌ transformer architectures.

Pruning for​​ Low-resource Speech Recognition.

Pruning​​​‌ large pre-trained transformers for‌ low-resource languages is challenging‌​‌ due to limited retraining​​ data. Can Whisper be​​​‌ made lighter and faster‌ for edge devices in‌​‌ data-scarce settings? Focusing on​​ Bambara (32h of speech-to-text​​​‌ data), we propose a‌ pruning recipe combining low-rank‌​‌ embedding decomposition with feature​​ distillation and layer merging​​​‌ — bypassing vocabulary pruning,‌ unsuitable given frequent code-switching.‌​‌ The resulting model is​​ 48% smaller and 2.15x​​​‌ faster on a MacBook‌ Air M1, while preserving‌​‌ 90% of the original​​ performance.

Speech Language Modeling​​​‌ for Wolof.

We present‌ our work on training‌​‌ a speech language model​​ for Wolof, an underrepresented​​​‌ language spoken in West‌ Africa, and share key‌​‌ insights. We first emphasize​​ the importance of collecting​​​‌ large-scale, spontaneous, high-quality unsupervised‌ speech data, and show‌​‌ that continued pretraining HuBERT​​ on this dataset outperforms​​​‌ both the base model‌ and African-centric models on‌​‌ ASR. We then integrate​​ this speech encoder into​​​‌ a Wolof LLM to‌ train the first Speech‌​‌ LLM for this language,​​ extending its capabilities to​​​‌ tasks such as speech‌ translation. Furthermore, we explore‌​‌ training the Speech LLM​​ to perform multi-step Chain-of-Thought​​​‌ before transcribing or translating.‌ Our results show that‌​‌ the Speech LLM not​​ only improves speech recognition​​​‌ but also performs well‌ in speech translation. The‌​‌ models and the code​​ will be openly shared.​​​‌

5.2.2 Axis 2.2 —‌ Speaker identity and states‌​‌

Speaker localisation and tracking.​​​‌

We investigated the problem​ of localizing and tracking​‌ the position of speaker​​ and ensuring the consistency​​​‌ even after speech pauses.​ We proposed a formal​‌ description of the task​​ together with a dataset,​​​‌ a set of metrics​ adapted from object tracking​‌ in computer vision and​​ a first baseline 24​​​‌. We then proposed​ refined approaches relying on​‌ speaker identity to ensure​​ track consistency 22 and​​​‌ targeting decisions on small​ temporal context to move​‌ towards low latency processing​​ 23.

5.2.3 Axis​​​‌ 2.3 — Speech in​ its environment

Ambient sound​‌ detection and separation.

Pursuing​​ our involvement in the​​​‌ community on ambient sound​ analysis, we initiated a​‌ novel task called spatial​​ semantic segmentation of sound​​​‌ scenes (S5), which consists​ in combining ambient sound​‌ detection and audio source​​ separation. To foster this​​​‌ new topic, we co-organized​ a task as part​‌ of the Detection and​​ Classification of Acoustic Scenes​​​‌ and Events (DCASE) 2025​ challenge 35. We​‌ also addressed the difficult​​ topic of designing a​​​‌ metric for this joint​ task. Indeed, to evaluate​‌ S5 systems, one can​​ consider two individual metrics,​​​‌ i.e., one for source​ separation and another for​‌ sound event classification, but​​ this approach makes it​​​‌ challenging to compare S5​ system. Therefore, we proposed​‌ and analyzed joint metrics​​ that can better reflect​​​‌ the actual contribution of​ classification and separation errors​‌ 42.

In order​​ to asses our continued​​​‌ involvement on the topic​ we also published an​‌ analysis of the evolution​​ of the tasks proposed​​​‌ to the DCASE challenge​ during the past 10​‌ editions 26.

Speech​​ enhancement.

Targeting speech enhancement​​​‌ for hearing aids, we​ continued investigating the performance​‌ of speech enhancement at​​ a fine grained phonetic​​​‌ level. The goal here​ is to link the​‌ results obtained with objective​​ metrics to the outcome​​​‌ of listening tests conducted​ at our partner site​‌ (Institut de l'audition). To​​ that end, we conducted​​​‌ an extensive evaluation of​ state-of-the-art speech enhancement algorithms​‌ at the phoneme level​​ (rather than at the​​​‌ commonly-considered utterance level), and​ across genders. Results show​‌ that the tested algorithms​​ better reduce interference with​​​‌ fewer artifacts on female​ speech, particularly in plosives,​‌ fricatives, and vowels. Additionally,​​ they demonstrate greater performance​​​‌ for female speech in​ terms of perceptual and​‌ speech recognition metrics 27​​. We exploited these​​​‌ findings in a subsequent​ work, where we proposed​‌ perceptually-informed variants of common​​ speech enhancement training losses.​​​‌ These are designed to​ emphasize time-frequency regions where​‌ speech is prominent or​​ where the interfering noise​​​‌ is particularly strong, in​ order to better account​‌ for variability across phonemes.​​ Spectral analysis indicates better​​​‌ consonant reconstruction, which points​ to a better preservation​‌ of certain acoustic cues​​ 28.

Speech intelligibility​​​‌ reduction.

We investigated masking​ noise generation to reduce​‌ speech intelligibility in open​​ plan offices. The target​​​‌ is to attenuate the​ annoyance caused by concurrent​‌ speech produced by co-workers.​​ While commercial systems rely​​​‌ on stationary noise at​ a constant level (resulting​‌ in over exposition to​​ sound), we explored adjusting​​ the noise level to​​​‌ spefic part of speech‌ (different phoneme classes) and‌​‌ assessed the impact on​​ speech intelligibility 18.​​​‌

5.3 Axis 3 —‌ Multimodal Speech: generation and‌​‌ interaction

Participants: Théo Biasutto-Lervat​​, Domitille Caillat,​​​‌ Vincent Colotte, Yves‌ Laprie, Slim Ouni‌​‌, Mostafa Sadeghi,​​ Emmanuel Vincent, Louis​​​‌ Abel, Hugo Bergerat‌, Jean Eudes Ayilo‌​‌, Sofiane Azzouz,​​ Tom Bourgeade, Guilhem​​​‌ Faure, Mickaella Grondin-Verdon‌, Colombe M'Boungou,‌​‌ Nhat Nam Nguyen,​​ Alex Stasica.

5.3.1​​​‌ Axis 3.1 — Multimodality‌ modeling and analysis

Audio-visual‌​‌ speech enhancement (AVSE).

We​​ addressed audio–visual fusion for​​​‌ unsupervised speech enhancement with‌ diffusion models. Our framework‌​‌ combines a visually conditioned​​ diffusion speech prior with​​​‌ an NMF noise model‌ 10. The diffusion‌​‌ prior is first pre-trained​​ on clean speech conditioned​​​‌ on video: visual features‌ are extracted from the‌​‌ input stream and fused​​ with audio features in​​​‌ the diffusion network via‌ cross-attention. At inference, the‌​‌ model performs iterative posterior​​ sampling within the reverse​​​‌ diffusion process, while the‌ NMF noise parameters are‌​‌ updated using intermediate speech​​ estimates. Experiments show improvements​​​‌ over the audio-only variant,‌ better generalization than a‌​‌ recent supervised-generative AVSE approach,​​ and a more favorable​​​‌ speed–quality trade-off than prior‌ diffusion-based inference.

Automatic isolated‌​‌ sign recognition in French​​ Sign Language (LSF).

We​​​‌ investigated an isolated sign‌ recognition system for sign‌​‌ language WordNet resources, aimed​​ at identifying and grouping​​​‌ phonologically similar signs across‌ different sign languages with‌​‌ minimal training data and​​ providing similarity suggestions for​​​‌ manual validation 25.‌ The approach relied on‌​‌ video-only analysis, combining key-frame​​ extraction, pose estimation with​​​‌ MediaPipe, and normalization strategies‌ to mitigate biases related‌​‌ to handedness, non-dominant arm​​ position, and signer morphology.​​​‌ Representations were analyzed using‌ Uniform Manifold Approximation and‌​‌ Projection (UMAP) and compared​​ with a Vision Transformer​​​‌ for pairwise similarity ranking.‌ Evaluation on a manually‌​‌ annotated subset of the​​ WordNet corpus achieved reasonable​​​‌ accuracy.

5.3.2 Axis 3.2‌ — Multimodal speech generation‌​‌

Acquisition of rt-MRI (real-time​​ Magnetic Resonance Imaging) data.​​​‌

This year, in collaboration‌ with the IADI laboratory‌​‌ (P.-A. Vuissoz), we started​​ the acquisition of a​​​‌ large corpus of Arabic‌ language for one speaker.‌​‌ Since no corpus was​​ available a set of​​​‌ 2000 sentences was designed‌ with the help of‌​‌ Gemini. This is interesting​​ since the same approach​​​‌ could be used for‌ other languages involved in‌​‌ the new ANR ArtAny​​ project. In addition, we​​​‌ recorded a speaker producing‌ several non-standard voice qualities‌​‌ (falsetto, very deep or​​ cracked voice) in order​​​‌ to study the areas‌ of variability of the‌​‌ vocal tract articulators and​​ generalise the automatic articulator​​​‌ tracking algorithms.

Acoustic to‌ articulatory inversion.

Acoustic to‌​‌ articulatory inversion is a​​ major processing challenge, with​​​‌ a wide range of‌ applications from speech synthesis‌​‌ to feedback systems for​​ language learning and rehabilitation.​​​‌ Last year, we conducted‌ the first experiments on‌​‌ articulatory acoustic inversion for​​ the tongue, which is​​​‌ the most mobile and‌ deformable speech articulator 11‌​‌. This was the​​​‌ first time that inversion​ covered the entire contour​‌ of the tongue (from​​ its root to its​​​‌ tip), since inversion generally​ only covers a few​‌ points corresponding to sensors​​ attached to the tongue.​​​‌ We extended the approach​ to all articulators (lips,​‌ tongue, velum, epiglottis, arytenoid​​ cartilages, glottis) by sizing​​​‌ the output layers so​ as to clearly separate​‌ the articulators. The average​​ accuracy is 1.67 mm,​​​‌ given that the pixel​ size in the images​‌ is 1.6 mm, 12​​, 36. To​​​‌ our knowledge, this is​ the first inversion experiment​‌ to recover the complete​​ geometry of the vocal​​​‌ tract in the form​ of the contour of​‌ all articulators in the​​ mid-sagittal plane.

Quaternion pose​​​‌ encoding and contrastive learning​ for robust sign language​‌ production.

We tackled a​​ key challenge in neural​​​‌ sign language production: high​ intra-class variability caused by​‌ signer morphology and stylistic​​ differences. Building on Progressive​​​‌ Transformers, we introduced two​ complementary improvements 19.​‌ First, we represented poses​​ using bone rotations in​​​‌ quaternion space and optimize​ with a geodesic loss,​‌ which better captures angular​​ motion and improves joint​​​‌ articulation. Second, we added​ a semantically guided contrastive​‌ loss that structures decoder​​ embeddings using sentence-level similarity​​​‌ (via gloss overlap or​ SBERT), encouraging the model​‌ to focus on meaning-relevant​​ motion while reducing anatomical​​​‌ and stylistic bias. On​ Phoenix14T, a widely used​‌ corpus,the contrastive objective alone​​ improves Probability of Correct​​​‌ Keypoint by 16% over​ the baseline, and combining​‌ it with quaternion encoding​​ reduces Mean Bone Angle​​​‌ Error by 6%, highlighting​ the benefit of skeletal-structure​‌ modeling and semantic supervision​​ in Transformer-based sign language​​​‌ production.

5.3.3 Axis 3.3​ — Interaction

Formal description​‌ and annotation of co-speech​​ gestures.

A first line​​​‌ of work focused on​ the formal characterization of​‌ gestures, aiming to identify​​ necessary and sufficient descriptive​​​‌ features and to automate​ their extraction, leading to​‌ the definition of six​​ complementary modalities (manuality, trajectory,​​​‌ location, hand configuration, speed,​ and size) to support​‌ objective annotation and gesture-aware​​ neural systems 38.​​​‌ This formalization effort was​ extended by an in-depth​‌ investigation of the spatial​​ aspects of gestures, proposing​​​‌ a dual spatial encoding​ based on positioning and​‌ orientation within dedicated three-dimensional​​ reference spaces, and demonstrating​​​‌ how automatically derived spatial​ attributes can enrich corpora​‌ and support the analysis​​ of gesture-speech relationships 40​​​‌. In parallel, methodological​ contributions addressed the practical​‌ limitations of manual annotation​​ through the development of​​​‌ COSMOS, a semi-automatic tool​ based on motion capture​‌ data and encoder-decoder models,​​ designed to assist gesture​​​‌ segmentation with limited training​ data while significantly reducing​‌ annotation effort 44.​​

Recognition and evaluation of​​​‌ co-speech gestures.

Within the​ broader scientific framework of​‌ multimodal communication and speech-gesture​​ modeling, and in parallel​​​‌ with our work on​ the automatic generation of​‌ co-verbal gestures using graph-based​​ neural networks 41,​​​‌ several complementary studies addressed​ the formal description, annotation,​‌ recognition, and evaluation of​​ co-verbal manual gestures, combining​​​‌ approaches from linguistics, computer​ science, and movement sciences.​‌ Finally, an exploratory interdisciplinary​​ study examined the evaluation​​ of hand-gesture synthesis quality,​​​‌ showing that expert annotations‌ can reveal systematic differences‌​‌ between natural and synthetic​​ gestures in terms of​​​‌ communicative efficiency and movement‌ dynamics, and highlighting the‌​‌ need for combined computational​​ and linguistic criteria to​​​‌ assess and improve gesture‌ generation systems 39,‌​‌ 21.

Medical dialog​​ summarization.

In the context​​​‌ of our collaboration with‌ a medical doctor in‌​‌ Paris, we introduced QUARTZ,​​ a framework for task-oriented​​​‌ unsupervised dialogue summarization 20‌. For medical dialogs,‌​‌ task-specific medical accuracy is​​ important. QUARTZ starts by​​​‌ generating multiple summaries and‌ task-specific question-answer pairs using‌​‌ large language models (LLMs).​​ Summaries are evaluated by​​​‌ having the LLMs respond‌ to task-related questions before‌​‌ (i) selecting the best​​ candidate responses and (ii)​​​‌ identifying the most informative‌ summary. Finally, we finetune‌​‌ the best LLM on​​ the selected summaries. When​​​‌ validated on multiple datasets,‌ QUARTZ achieves competitive zero-shot‌​‌ performance, rivaling fully-supervised state-of-the-art​​ approaches.

6 Bilateral contracts​​​‌ and grants with industry‌

6.1 Bilateral grants with‌​‌ industry

6.1.1 Meta AI​​

  • Company: Meta AI (France)​​​‌
  • Duration: May 2022 –‌ Apr 2025
  • Participants: Robin‌​‌ San Roman, Romain Serizel​​
  • Abstract: This CIFRE grant​​​‌ funds the PhD of‌ Robin San Roman on‌​‌ self-supervised disentangled representation learning​​ of audio data for​​​‌ compression and generation.

6.1.2‌ Orange Labs

  • Company: Orange‌​‌ Labs (France)
  • Duration: March​​ 2023 – Feb 2026​​​‌
  • Participants: Taous Iatariene, Romain‌ Serizel
  • Abstract: This CIFRE‌​‌ grant funds the PhD​​ of Taous Iatariene on​​​‌ sound source tracking.

7‌ Partnerships and cooperations

7.1‌​‌ International initiatives

7.1.1 Inria​​ associate team not involved​​​‌ in an IIL or‌ an international program

TrustedSpeech‌​‌
  • Title:
    Trusted speech dataset​​ generation
  • Duration:
    Jan 2025​​​‌ – Dec 2027
  • Coordinator:‌
    Junichi Yamagishi (jyamagis@nii.ac.jp)
  • Partners:‌​‌
    • National Institute of Informatics​​ Tokyo (Japon)
  • Inria contact:​​​‌
    Emmanuel Vincent
  • Summary:
    The‌ TrustedSpeech associate team will‌​‌ conduct joint research aiming​​ to improve the privacy,​​​‌ fairness and utility of‌ anonymized and synthetic speech‌​‌ data, so as to​​ offer a complete methodology​​​‌ to produce trusted speech‌ datasets.

7.1.2 Participation in‌​‌ other International Programs

ANR-JST​​ CONFLUENCE
  • Title:
    Semantic Segmentation​​​‌ of Complex Sound Scenes‌ on Edge Devices
  • Duration:‌​‌
    Dec 2024 - Nov​​ 2027
  • Coordinator:
    Sonaid
  • Partners:​​​‌
    Université de Lorraine, CEA-List‌ (FR), the company Sonaide‌​‌ (FR), Nippon Telegraph and​​ Telephone Corporation (NTT, JP)​​​‌ and Tokyo Metropolitan University‌ (JP)
  • Participants:
    Paul Magron,‌​‌ Mayank Mishra, Romain Serizel​​
  • Abstract:
    The CONFLUENCE project​​​‌ aims to develop artificial‌ intelligence (AI) technologies for‌​‌ sound semantic segmentation of​​ acoustic signals that can​​​‌ recognize sound events and‌ separate/isolate the signals of‌​‌ the sound sources forming​​ semantic entities.

7.2 International​​​‌ research visitors

7.2.1 Visits‌ to international teams

R.‌​‌ Baga: Short stay (15​​ days) at the National​​​‌ Institute of Informatics Tokyo‌ (Japon) in the framework‌​‌ of Associate Teams TrustedSpeech​​

7.3 European initiatives

7.3.1​​​‌ Horizon Europe

PSST

PSST‌ project on cordis.europa.eu

  • Title:‌​‌
    Privacy for Smart Speech​​ Technology
  • Duration:
    Feb 2025​​​‌ – Jan 2030
  • Partners:‌
    Inria, UNIVERSITE DE LORRAINE,‌​‌ ORANGE SA (FR), KI​​ ELEMENTS GMBH (DE), STICHTING​​​‌ RADBOUD UNIVERSITEIT (NL), VOICEINTERACTION‌ (PT), OMILIA (GR), AALTO‌​‌ KORKEAKOULUSAATIO SR (FI), TECHNISCHE​​​‌ UNIVERSITAT BERLIN (DE), NAVER​ FRANCE, Commission nationale de​‌ l'informatique et des libertés​​ (FR), EVALUATIONS AND LANGUAGE​​​‌ RESOURCES DISTRIBUTION AGENCY (FR),​ RUHR-UNIVERSITAET BOCHUM (DE), Loihde​‌ Advisory Oy, Finland (FI),​​ voice INTER connect GmbH​​​‌ (DE), VOCAPIA RESEARCH (FR),​ EURECOM (FR), INESC ID​‌ (PT), Voicemod Inc. (ES),​​ INSTITUTO SUPERIOR TECNICO (PT),​​​‌ SORBONNE UNIVERSITE (FR)
  • Inria​ contact:
    Emmanuel Vincent
  • Coordinator:​‌
    Tom Bäckström
  • Summary:
    The​​ PSST joint doctoral training​​​‌ network will train a​ new cohort of PhD​‌ students to develop voice​​ privacy technologies using cutting-edge​​​‌ deep learning methods.

7.3.2​ Digital Europe

LLMs4EU
  • Title:​‌
    Large Language Models for​​ the European Union
  • Duration:​​​‌
    Mar 2025 – Feb​ 2028
  • Partners:
    • Inria, France​‌
    • 65 other partners in​​ Europe
  • Inria contact:
    Emmanuel​​​‌ Vincent
  • Coordinator:
    Edouard Geoffrois​
  • Summary:
    The LLMs4EU project​‌ coordinated by the Alliance​​ for Language Technologies (ALT-EDIC)​​​‌ brings together Europe's leading​ players in the field​‌ of generative AI to​​ ensure that European companies​​​‌ and especially SMEs have​ access to the tools​‌ and resources to become​​ competitive regarding language technologies​​​‌ and especially Large Language​ Models (LLMs).

7.4 National​‌ initiatives

ANR ENACT
  • Title:​​
    IA Cluster — Centre​​​‌ Européen en Intelligence Artificielle​ par l'Innovation
  • Duration:
    Jan​‌ 2025 - Dec 2029​​
  • Coordinator:
    Emmanuel Vincent (until​​​‌ Jun 2025) Jean-Baptiste Mouret​ (from Jun to Dec​‌ 2025)
  • Partners:
    Université de​​ Lorraine, Université de Strasbourg,​​​‌ Inria, CNRS, CHRU de​ Nancy, Région Grand Est,​‌ Métropole Grand-Nancy, Métropole de​​ Strasbourg, Métropole de Metz.​​​‌
  • Participants:
    Emmanuel Vincent ,​ Irina Illina
  • Abstract:
    ENACT​‌ is the AI Cluster​​ of Region Grand Est,​​​‌ with a budget of​ 30 MEUR. It aims​‌ to make Grand Est​​ a European leader in​​​‌ artificial intelligence (AI), with​ a structuring strategy of​‌ training, research and innovation​​ designed in a global​​​‌ way to benefit the​ entire territory of the​‌ Region and beyond. Emmanuel​​ Vincent holds a chair​​​‌ with Nancy's hospital on​ LLMs for emergency medicine,​‌ and Irina Illina has​​ a PhD student funded​​​‌ by the project.
ANR​ Full3DTalkingHead
  • Title:
    Synthèse articulatoire​‌ phonétique
  • Duration:
    Apr 2021​​ - Sept 2025
  • Coordinator:​​​‌
    Yves Laprie
  • Partners:
    Loria,​ Gipsa-Lab, LEGI, IADI, LPP.​‌
  • Participants:
    Yves Laprie ,​​ Slim Ouni , Vinicius​​​‌ Ribeiro
  • Abstract:
    The objective​ is to realize a​‌ complete three-dimensional digital talking​​ head including the vocal​​​‌ tract from the vocal​ folds to the lips​‌ and the face, and​​ integrating the digital simulation​​​‌ of the aero-acoustic phenomena.​
ANR ArtAny
  • Title:
    Articulateur​‌ universel
  • Duration:
    Nov 2025​​ - Oct 2030
  • Coordinator:​​​‌
    IADI(Nancy)
  • Partners:
    IADI (Nancy),​ LPP (Paris)
  • Participants:
    Yves​‌ Laprie , Slim Ouni​​ , Emmanuel Vincent ,​​​‌ Vincent Colotte
  • Abstract:
    The​ Articulator Anything project aims​‌ to reconstruct the three-dimensional​​ dynamic evolution of the​​​‌ vocal tract for any​ language and any speaker.​‌ It falls within the​​ field of articulatory synthesis,​​​‌ modeling and simulating the​ physical process of human​‌ speech production using advanced​​ artificial intelligence methods. Current​​​‌ approaches are limited as​ they rely on static​‌ representations of phonemes and​​ fail to capture the​​​‌ temporal dynamics essential for​ coarticulation and anticipation in​‌ natural speech.
ANR CODIM​​
  • Title:
    COmpositionality and DIscourse​​ Markers
  • Duration:
    Jan 2023​​​‌ - Dec 2027
  • Coordinator:‌
    ATILF(Nancy)
  • Partners:
    ATILF(Nancy), LORIA(Nancy),‌​‌ LLF
  • Participants:
    Vincent Colotte​​
  • Abstract:
    The CODIM project​​​‌ focuses on the two‌ main linguistic resources for‌​‌ organizing monologues or conversations​​ in human languages :​​​‌ Discourse Markers (therefore/donc, well/ben,bon‌ etc. in English/French) and‌​‌ prosody (in particular, intonation).​​ It will evaluate their​​​‌ status with respect to‌ two major views on‌​‌ communication: compositionality (the possibility​​ of combining meaningful expressions​​​‌ into more complex meaningful‌ expressions) and pattern or‌​‌ construction-based approaches (the idea​​ that language users exploit​​​‌ partly ‘frozen’ strings of‌ words). We will compare‌​‌ the semantic and prosodic​​ properties of simple and​​​‌ complex French DM (e.g.‌ ah + bon) found‌​‌ in corpora for written​​ and spoken French.
ANR​​​‌ LLM4all
  • Title:
    Large Language‌ Models for All
  • Duration:‌​‌
    Oct 2023 - Mars​​ 2027
  • Coordinator:
    Christophe Cerisara​​​‌
  • Partners:
    LORIA-Synalp, LORIA-Multispeech, LIX,‌ Linagora
  • Participants:
    Irina Illina‌​‌ , Emmanuel Vincent
  • Abstract:​​
    Large Language Models (LLM)​​​‌ of sufficient size exhibit‌ outstanding emergent abilities, such‌​‌ as learning from their​​ input context and decomposing​​​‌ a complex problem into‌ a chain of simpler‌​‌ steps. The LLM4all project​​ will thus focus on​​​‌ such large models, or‌ on models at the‌​‌ same level of generic​​ performances, and will propose​​​‌ methods to solve two‌ related fundamental issues: how‌​‌ to update these LLMs​​ automatically, and how to​​​‌ reduce their computing requirements‌ in order to facilitate‌​‌ their deployment.
ANR Lorraine​​ Artificicial Intelligence – LOR-AI​​​‌ LOR-AI
  • Title:
    Lorraine Artificicial‌ Intelligence Cofinancement de thèses‌​‌ en IA
  • Duration:
    Sep​​ 2020- Dec 2025
  • Coordinator:​​​‌
    Yves Laprie
  • Partners:
    CNRS,‌ Inria, Regional University Hospital‌​‌ Centre (CHRU)
  • Participants:
    Doctoral​​ school of Université de​​​‌ Lorraine
  • Abstract:
    This project‌ about Artificial Intelligence, led‌​‌ by the Université de​​ Lorraine (UL), has a​​​‌ double objective by providing‌ 12 co-fundings for doctoral‌​‌ theses: on the one​​ hand, to strengthen UL​​​‌ areas of excellence in‌ AI and domains tightly‌​‌ connected to IA, i.e.​​ particularly Health, and on​​​‌ the other hand, to‌ open other research areas‌​‌ to AI with the​​ objective of leading to​​​‌ scientific breakthroughs.
ANR REFINED‌
  • Title:
    Real-Time Artificial Intelligence‌​‌ for Hearing Aids
  • Duration:​​
    Mar 2022 - Mar​​​‌ 2026
  • Coordinator:
    CEA List‌ (Saclay)
  • Partners:
    CEA List‌​‌ (Saclay), Institut de l'audition​​ (Paris), LORIA (Nancy)
  • Participants:​​​‌
    Paul Magron, Nasser-Eddine Monir,‌ Romain Serizel
  • Abstract:
    The‌​‌ Refined project brings together​​ audiologists, computer scientists and​​​‌ specialists about hardware implementation‌ to design new speech‌​‌ enhancement algorithms that both​​ fit the needs of​​​‌ patients suffering of hearing‌ losses and the computational‌​‌ constraints of hearing aid​​ devices.
ANR ReNAR
  • Title:​​​‌
    Reducing Noise with Augmented‌ Reality
  • Duration:
    Feb 2024‌​‌ - Jan 2028
  • Coordinator:​​
    CEA List (Saclay)
  • Partners:​​​‌
    Ircam (Paris), Laboratoire des‌ Sciences du Numérique de‌​‌ Nantes (Nantes), LORIA (Nancy)​​
  • Participants:
    Romain Serizel, Aine​​​‌ Drelingyte
  • Abstract:
    The aim‌ of the ReNAR project‌​‌ is to design a​​ solution that can attenaute​​​‌ the impact of noise‌ in office working scenarios‌​‌ (in particular in open​​ spaces). We will target​​​‌ two aspects: generating noise‌ maskers that results in‌​‌ sound scenes that are​​​‌ pleasent to hear for​ workers and generating signals​‌ that can obfuscate surrounding​​ speech.
ANR SPEECHPRIVACY
  • Title:​​​‌
    Multiple-attribute disentanglement and semantic​ privacy
  • Duration:
    Feb 2024​‌ - Jan 2028
  • Coordinator:​​
    Vincent Colotte
  • Partners:
    LORIA​​​‌ (Nancy), EURECOM (Sophia Antipolis),​ LIA (Avignon)
  • Participants:
    Vincent​‌ Colotte , Emmanuel Vincent​​ , Orane Dufour, Natalia​​​‌ Tomashenko.
  • Abstract:
    SpeechPrivacy will​ deliver a flexible solution​‌ to privacy preservation based​​ on isolated/disentangled representations and​​​‌ the selective obfuscation/modification of​ individual attributes beyond the​‌ usual voice identity/sex and​​ sensitive keywords.
ANR Syncogest​​​‌
  • Title:
    Gesture and Speech​ Synchronization
  • Duration:
    Apr 2025​‌ - Mar 2029
  • Coordinator:​​
    Slim Ouni
  • Partners:
    LORIA​​​‌ (Nancy), PRAXILING (Montpellier), EUROMOV​ (Montpellier)
  • Participants:
    Slim Ouni​‌ , Vincent Colotte, Louis​​ Abel, Hugo Bergerat, Domitille​​​‌ Caillat
  • Abstract:
    SYNCOGEST aims​ to model spontaneous human​‌ gestures—facial expressions, postures, and​​ body movements—and their synchronization​​​‌ with speech in face-to-face​ communication. By combining insights​‌ from artificial intelligence, language​​ sciences, and movement sciences,​​​‌ the project will develop​ deep learning–based models for​‌ automatic gesture generation, enabling​​ more natural and effective​​​‌ embodied conversational agents.
PEPR​ Cybersécurité, projet iPOP
  • Title:​‌
    Protection des données personnelles​​
  • Duration:
    Oct 2022 –​​​‌ Sep 2028
  • Coordinator:
    Vincent​ Roca (Inria PRIVATICS)
  • Partners:​‌
    Inria Multispeech (Nancy), PRIVATICS​​ (Lyon), COMETE, PETRUS (Saclay),​​​‌ MAGNET, SPIRALS (Lille), IRISA​ (Rennes), LIFO (Bourges), DCS​‌ (Nantes), CESICE (Grenoble), EDHEC​​ (Lille), CNIL (Paris)
  • Participant:​​​‌
    Emmanuel Vincent
  • Summary:
    The​ objectives of iPOP are​‌ to study the threats​​ on privacy introduced by​​​‌ new digital technologies, and​ to design privacy-preserving solutions​‌ compatible with French and​​ European regulations. Within this​​​‌ scope, Multispeech focuses on​ speech data.
Défi Inria​‌ COLaF
  • Title:
    Corpus et​​ Outils pour les Langues​​​‌ de France
  • Duration:
    Aug​ 2023 – Jul 2027​‌
  • Coordinator:
    Slim Ouni and​​ Benoît Sagot (Inria ALMANACH)​​​‌
  • Partners:
    Inria Multispeech (Nancy),​ ALMANACH (Paris)
  • Participant:
    Slim​‌ Ouni , Sam Bigeard​​ , Vincent Colotte ,​​​‌ Emmanuel Vincent , Pascale​ Erhart
  • Summary:
    This project​‌ aims to increase the​​ inclusiveness of speech technologies​​​‌ by releasing open data,​ models and software for​‌ accented French and for​​ regional, overseas and non-territorial​​​‌ languages of France.
DGA​ DEEP MAUVES
  • Title:
    Deep​‌ automatic aircraft speech recognition​​ for non native speakers​​​‌
  • Duration:
    Dec 2022 –​ Dec 2026
  • Coordinator:
    Irina​‌ Illina
  • Participant:
    Irina Illina​​ , Raphaël Bagat ,​​​‌ Emmanuel Vincent , Romuald​ Ait Bachir
  • Summary:
    This​‌ project proposes methods and​​ tools that increase the​​​‌ usability of ASR systems​ for non-native speakers in​‌ noisy conditions in the​​ aeronautical domain.
ANSES IPIAMA​​​‌
  • Title:
    Reducing Noise with​ Augmented Reality
  • Duration:
    Dec​‌ 2023 - Dec 2026​​
  • Coordinator:
    Jean-Pierre Arz, INRS​​​‌ (Nancy)
  • Partners:
    INRS (Nancy),​ Laboratoire Énergies et Mécanique​‌ Théorique et Appliquée (Nancy),​​ LORIA (Nancy)
  • Participants:
    Romain​​​‌ Serizel
  • Abstract:
    The IPIAMA​ project aims to propose​‌ binaural speech intelligibility measurements​​ (with both ears) for​​​‌ people equipped with hearing​ aids. The project will​‌ rely jointly on classic​​ listening tests (reliable but​​​‌ expensive) and models based​ on data collected in​‌ realistic conditions.

8 Dissemination​​

8.1 Promoting scientific activities​​​‌

8.1.1 Scientific events: organisation​

General chair, scientific chair​‌
  • Main organizer, UDICE-U15 workshop​​ on AI: Stronger together​​ – How to train​​​‌ and retain the next‌ generation of talent in‌​‌ Europe and develop efficient​​ and competitive French-German ecosystems?,​​​‌ Nancy, Mar 2025 (E.‌ Vincent)
Member of the‌​‌ organizing committees
  • Organizer, 1st​​ VoicePrivacy Attacker Challenge (N.​​​‌ Tomashenko, E. Vincent)
  • Challenge‌ co-chair, DCASE Challenge 2025‌​‌ (R. Serizel)

8.1.2 Scientific​​ events: selection

Member of​​​‌ the conference program committees‌
  • ICASSP 2026 – IEEE‌​‌ International Conference on Acoustics,​​ Speech, and Signal Processing​​​‌ (R. Serizel)
  • WASPAA 2025‌ – IEEE Workshop on‌​‌ Applications of Signal Processing​​ to Audio and Acoustics​​​‌ (R. Serizel)
Reviewer
  • ICASSP‌ 2026 - IEEE International‌​‌ Conference on Acoustics, Speech,​​ and Signal Processing (P.​​​‌ Magron, E. Vincent, M.‌ Sadeghi, R. Serizel)
  • ICASSP‌​‌ 2025 - IEEE International​​ Conference on Acoustics, Speech,​​​‌ and Signal Processing (I.‌ Illina)
  • INTERSPEECH 2025 (P.‌​‌ Magron, I. Illina, Y.​​ Laprie, V. Colotte)
  • EUSIPCO​​​‌ 2025 - European Signal‌ Processing Conference (V. Colotte)‌​‌
  • WASPAA 2025 - IEEE​​ Workshop on Applications of​​​‌ Signal Processing to Audio‌ and Acoustics (P. Magron)‌​‌
  • ASRU 2025 - IEEE​​ Automatic Speech Recognition and​​​‌ Understanding Workshop (I.Illina)
  • DCASE‌ 2025 - Workshop on‌​‌ Detection and Classification of​​ Acoustic Scenes and Events​​​‌ (R. Serizel)
  • Revue TAL‌ : Traitement Automqtique de‌​‌ Langues (I. Illina)
  • NAACL​​ 2025, DemoTrack (I. Illina)​​​‌
  • ICMI 2025, Industrial track‌ (S. Ouni)

8.1.3 Journal‌​‌

Member of the editorial​​ boards
  • IEEE Transactions on​​​‌ Audio, Speech and Language‌ Processing (R. Serizel)
Reviewer‌​‌ - reviewing activities
  • IEEE​​ Signal Processing Letters (P.​​​‌ Magron, M. Sadeghi)
  • IEEE‌ Transactions on Audio, Speech‌​‌ and Language Processing (P.​​ Magron, E. Vincent, M.​​​‌ Sadeghi)
  • ACL 2025 -‌ Association for Computational Linguistics‌​‌ (I. Illina)

8.1.4 Invited​​ talks

  • Keynote "The rise,​​​‌ fall, and resurgence of‌ NMF for audio source‌​‌ separation", Workshop on Low-Rank​​ Models and Applications (Mons,​​​‌ Belgium), Sep 2025 (P.‌ Magron)
  • Seminar "Machine learning‌​‌ for music separation: Combining​​ data-driven models and expert​​​‌ knowledge", University of Strasbourg‌ (Strasbourg, France), May 2025‌​‌ (P. Magron)
  • Keynote "Modéliser​​ la communication parlée multimodale",​​​‌ Workshop RJCP (Paris), Nov‌ 2025 (S. Ouni)

8.1.5‌​‌ Leadership within the scientific​​ community

  • Member of the​​​‌ Steering Committee of ISCA's‌ Special Interest Group on‌​‌ Security and Privacy in​​ Speech Communication (E. Vincent)​​​‌
  • Board member of Le‌ VoiceLab, the association of‌​‌ French voice tech players​​ (E. Vincent)
  • Chair of​​​‌ the DCASE Steering Committee‌ (R. Serizel)
  • Board member‌​‌ of AFCP - Association​​ Francophone de la Communication​​​‌ Parlée (V. Colotte, S.‌ Ouni)
  • Secretary/Treasurer, executive member‌​‌ of AVISA (Auditory-VIsual Speech​​ Association), an ISCA Special​​​‌ Interest Group (S. Ouni)‌

8.1.6 Scientific expertise

  • Scientific‌​‌ Expert for CIFRE grant​​ allocation, Ministère de l'Enseignement​​​‌ supérieur, de la Recherche‌ et de l'Innovation (R.‌​‌ Serizel, S. Ouni)
  • Project​​ expert for Direction Générale​​​‌ Déléguée Recherche, Innovation, Valorisation‌ et Ecoles doctorales (I.‌​‌ Illina)

8.1.7 Research administration​​

  • Inria representative on the​​​‌ Lorraine Steering Committee for‌ Open Science (E. Vincent)‌​‌
  • Head of pole scientifique​​ Automatique, Mathématiques, Informatique et​​​‌ leurs interactions (AM2I) de‌ l'Université de Lorraine (Y.‌​‌ Laprie)
  • Member of the​​ executive board of the​​​‌ Université de Lorraine (Y.‌ Laprie)
  • Local correspondent for‌​‌ Inria's Quadran high-risk research​​​‌ programme (Y. Laprie)
  • Member​ of the steering committee​‌ for the digital strategy​​ of the Université de​​​‌ Lorraine (Y. Laprie)
  • Member​ of the bureau du​‌ pole scientifique Automatique,Mathematiques, Informatique​​ et leurs interactions (AM2I)​​​‌ (I. Illina)
  • Member of​ the Comite du pole​‌ scientifique Automatique,Mathematiques, Informatique et​​ leurs interactions (AM2I) (I.​​​‌ Illina)
  • Member of the​ RIPEC jury, UL (I.​‌ Illina)
  • Member of the​​ promotion committee, UL (I.​​​‌ Illina)
  • Member of the​ admission committee for Master​‌ TAL, UL (I. Illina)​​
  • Member of the admission​​​‌ committee for ATER, UL,​ IUT Charlemagne (I. Illina)​‌
  • Member of the selection​​ committee for MCF, UL​​​‌ (Illina)
  • Member of the​ IUT Charlemagne Council, UL​‌ IUT Charlemagne (I. Illina)​​
  • Member of the IUT​​​‌ Charlemagne Restricted Council, UL​ IUT Charlemagne (I. Illina)​‌
  • Member of the PhD​​ grant allocation committee, Avignon​​​‌ University (I. Illina)
  • Member​ of laboratory concil of​‌ LORIA (V. Colotte).
  • Member​​ of the selection committee​​​‌ for the position of​ assistant professor at Université​‌ de Paris-Saclay (S. Ouni)​​
  • Member of the selection​​​‌ committee for the position​ of professor at Université​‌ de Toulouse (S. Ouni)​​
  • Co-Chair of the selection​​​‌ committee for the position​ of professor at Université​‌ de Lorraine (S. Ouni)​​
  • Member of the repyramidage​​​‌ committee for the position​ of professor at Université​‌ de Toulouse (S. Ouni)​​
  • Member of the evaluation​​​‌ committee of Haut Conseil​ de l'évaluation de la​‌ recherche et de l'enseignement​​ supérieur (HCERES) for LJK​​​‌ (S. Ouni)
  • Co-head of​ the Computer Science track​‌ at the IAEM Doctoral​​ School (S. Ouni)
  • Chair​​​‌ of the ATER Recruitment​ Committee, Department of Computer​‌ Science, IUT Nancy-Charlemagne (S.​​ Ouni)
  • Member of the​​​‌ Comité Utilisateurs des Moyens​ de Calculs, Inria Research​‌ Center at Université de​​ Lorraine (T. Biasutto–Lervat)
  • Referent​​​‌ Plateformes-Outils, Inria Research Center​ at Université de Lorraine​‌ (T. Biasutto–Lervat)

8.2 Teaching​​ - Supervision - Juries​​​‌ - Educational and pedagogical​ outreach

8.2.1 Teaching

  • Master:​‌ P. Magron
    • "Neural networks"​​ (54 hours), M2, UL​​​‌
    • "Professional insertion" (2 hours),​ M2, IRCAM / Sorbonne​‌ University
  • Master: M. Sadeghi​​
    • "Machine learning" (20 hours),​​​‌ M1, UL
    • "Statistics" (20​ hours), M1, UL
  • BUT:​‌ I. Illina
    • Java programming​​ (100 hours), L1, UL​​​‌
    • Linux programming (58 hours),​ L1, UL
    • Advanced Java​‌ programming (40 hours), L1,​​ UL
    • Supervision of student​​​‌ projects and internships (30​ hours), L2, UL
  • Master:​‌ I. Illina
    • Speech recognition​​ and text-to-speech (10 hours),​​​‌ M2, UL
  • BUT: R.​ Serizel
    • "Bases informatiques" (14​‌ hours), BUT1, UL
    • "Publication​​ web" (84 hours), BUT1,​​​‌ UL
    • "Métadonnées internes" (14​ hours), BUT1, UL
    • "Bases​‌ de données relationnelles" (8​​ hours), BUT1, UL
    • "Indexation​​​‌ de contenus multimédias" (16​ hours), BUT2, UL
    • "Systèmes​‌ d'information" (18 hours), BUT2,​​ UL
    • "Introduction à l'audio​​​‌ numérique" (14 hours), BUT2,​ UL
    • "Données ouvertes" (8​‌ hours), BUT3, UL
    • "Visualisation​​ de données" (8 hours),​​​‌ BUT3, UL
    • "Usages de​ l'IA" (14 hours), BUT3,​‌ UL
  • Master: R. Serizel​​
    • "Robustesse de la parole"​​​‌ (15 HETD), M2, UL​
    • "Impact environnementaux de l'IA"​‌ (6 hours), M2, UL​​
  • Eng: R. Serizel
    • "Algorithmique"​​​‌ (18 hours), L3, UL​
    • "Bases de l'apprentissage automatique"​‌ (12 hours), M1, UL​​
    • "Impact environnementaux de l'IA"​​ (21 hours), M2, UL​​​‌
  • BUT: S. Ouni
    • Programming‌ in Java (24 hours),‌​‌ BUT1, UL
    • Web Programming​​ (24 hours), BUT1, UL​​​‌
    • Graphical User Interface (96‌ hours), BUT1, UL
    • Advanced‌​‌ Algorithms (24 hours), BUT2,​​ UL
    • Algorithm analysis (24​​​‌ hours), BUT3, UL
    • Multimedia‌ (24 hours), BUT3, UL‌​‌
    • AI Agent (24 hours),​​ BUT3, UL
  • Master: Y.​​​‌ Laprie
    • "Speech corpora" (30‌ hours), M1, UL
  • Licence:‌​‌ Y. Laprie
    • Phonetics (16​​ hours), L2, École d'audioprothèse,​​​‌ UL
  • Licence: V. Colotte‌
    • Digital literacy and tools‌​‌ (hybrid courses, 50 hours),​​ L1, UL
    • System (80​​​‌ hours), L2-L3, UL
    • Introduction‌ to speech processing (20‌​‌ hours), L3, UL
  • Master:​​ V. Colotte
    • Integration project:​​​‌ multimodal interaction with Pepper‌ Robot (17 hours), M2,‌​‌ UL
    • Multimodal oral communication​​ (24 hours), M2, UL​​​‌
    • AI introduction (9 hours),‌ M2 - intellectual property‌​‌ rights, UL
    • Introduction to​​ speech processing (24 hours),​​​‌ M1, UL
  • Other: V.‌ Colotte
    • Co-Responsible for NUMOC‌​‌ (Digital literacy by hybrid​​ courses) for UL(for 7000​​​‌ students)
  • Other: S. Ouni‌
    • Co-Responsible of the RA-IL‌​‌ track in the BUT​​ Computer Science program, UL​​​‌

8.2.2 Supervision

  • PhD defended:‌ Louis Abel, "Co-speech gesture‌​‌ synthesis : Towards a​​ controllable and interpretable model​​​‌ using a graph deterministic‌ approach", Jan 2025, V.‌​‌ Colotte and S. Ouni​​ 41
  • PhD in progress:​​​‌ Nasser-Eddine Monir, "Multichannel speech‌ enhancement for patients with‌​‌ auditory neuropathy spectrum disorders",​​ Dec 2022, R. Serizel​​​‌ and P. Magron
  • PhD‌ in progress: Mickaëlla Grondin,‌​‌ "Modeling gestures and speech​​ in interactions", Nov 2021,​​​‌ S. Ouni and D.‌ Caillat (Praxiling).
  • PhD in‌​‌ progress: Jean-Eudes Ayilo, "Audio-visual​​ Speech Enhancement: Bridging the​​​‌ Gap between Supervised and‌ Unsupervised Approaches", Oct. 2023,‌​‌ M. Sadeghi and R.​​ Serizel
  • PhD in progress:​​​‌ Guilhem Fauré, "End-to-end Speech-to-Sign‌ Language Generation", Oct. 2024,‌​‌ S. Ouni and M.​​ Sadeghi
  • PhD in progress:​​​‌ Zahra-Hafida Benslimane, "Embedded speech‌ enhancement for hearing aids",‌​‌ Nov. 2023, Fabrice Auzanneau​​ (CEA-List) and R. Serizel​​​‌
  • PhD in progress: Raphaël‌ Bagat, “Automatic speech recognition‌​‌ for non-native speakers in​​ a noisy environment”, Oct​​​‌ 2023, I. Illina and‌ E. Vincent.
  • PhD in‌​‌ progress: Mohamed Imed Eddine​​ Ghebriout, “LLM adaptation and​​​‌ exploitation for medical emergency‌ call triage”, Apr 2024,‌​‌ G. Guibon (LIPN) and​​ E. Vincent.
  • PhD in​​​‌ progress: Orane Dufour, "Towards‌ a comprehensive speech anonymization‌​‌ framework", Oct 2024, E.​​ Vincent, M. Rouvier (LIA),​​​‌ and P. Magron
  • PhD‌ in progress: Aine Drelingyte,‌​‌ ` "Speech intelligibility attenuation",​​ Nov 2024, Mathieu Lagrange​​​‌ (LS2N) and R. Serizel‌
  • PhD in progress: Lilian‌​‌ Rodriguez, ` "Detection and​​ anonymization of sensitive content​​​‌ in speech", Oct 2024,‌ Yannick Estève (LIA) and‌​‌ V. Colotte
  • PhD in​​ progress: Mayank Mishra, `​​​‌ "Semantic segmentation of audio‌ soundscapes on edge devices",‌​‌ Dec 2024, R. Serizel​​ and P. Magron
  • PhD​​​‌ in progress: Isobelle Miles,‌ ` "Regional and low‌​‌ ressource language speech synthesis",​​ fev 2025, E. Vincent,​​​‌ V. Colotte and P.‌ Erhart (UNISTRA-LILPA)
  • PhD in‌​‌ progress: Elio Stasica, `​​ "Differential diagnosis of heart​​​‌ attack from speech", Sep‌ 2025, V. Martin, R.‌​‌ Serizel and E. Vincent​​
  • PhD in progress :​​​‌ Doria Bonzi "Social-behavior-aware chatbot‌ for a communication skills‌​‌ coaching of medical students"​​​‌ Supervision: Irina Illina, Patrice​ Gallet and Fabrice Lefèvre,​‌ Oct. 2025
  • PhD in​​ progress : Yaya Sy​​​‌ « Efficient Continued Pre-training​ of Large Language Models​‌ », Supervision : C.​​ Cerisara, I. Illina, Nov​​​‌ 2023.
  • PhD in progress​ : Sofiane Azzouz «​‌ Acoustic to articulatory inversion​​ based on rt-MRI data​​​‌ », Supervision: Y. Laprie,​ Nov 2023.
  • PhD in​‌ progress : Nhat-Nam Nguyen​​ « Multispeaker Acoustic to​​​‌ articulatory inversion based on​ rt-MRI data », Supervision:​‌ Y. Laprie, Nov 2025.​​

8.2.3 Juries

  • Participation in​​​‌ the PhD jury of​ Thibault Banerat-Roux (University of​‌ Nantes, January 2025), I.​​ Illina, reviewer
  • Participation in​​​‌ the PhD jury of​ Lucas Maison (University of​‌ Avignon, November 2025), I.​​ Illina, reviewer
  • Participation in​​​‌ the PhD jury of​ Nicolas André (University of​‌ Avignon, December 2025), I.​​ Illina, reviewer
  • Participation in​​​‌ the PhD jury of​ Nathan Griot (University of​‌ Avignon, December 2025), I.​​ Illina, reviewer
  • Participation in​​​‌ the PhD jury of​ Adrien Pupier (University of​‌ Grenoble, June 2025), I.​​ Illina, examiner
  • Participation in​​​‌ the PhD jury of​ David Genova (University of​‌ Sorbonne, October 2025), I.​​ Illina, examiner
  • Participation in​​​‌ the PhD jury of​ Paul Primus (Johannes Kepler​‌ University, February 2025), R.​​ Serizel, reviewer
  • Participation in​​​‌ the PhD jury of​ Sreenivasa Upadhyaya (KU Leuven,​‌ February 2025), R. Serizel,​​ examiner
  • Participation in the​​​‌ PhD jury of Benno​ Weck-Hufnagel (University of Grenoble,​‌ July 2025), R. Serizel,​​ reviewer
  • Participation in the​​​‌ PhD jury of Benno​ Weck-Hufnagel (Universitat Pompeu Fabra,​‌ October 2025), R. Serizel,​​ reviewer
  • Participation in the​​​‌ PhD jury of Modan​ Tailleur (Ecole Centrale de​‌ Nantes, November 2025), R.​​ Serizel, examiner
  • Participation in​​​‌ the PhD jury of​ Ricardo Falcom Perez (Aalto​‌ University, November 2025), R.​​ Serizel, reviewer
  • Participation in​​​‌ the PhD jury of​ Alexis Plaquet (Université Paul​‌ Sabatier, December 2025), R.​​ Serizel, reviewer
  • Participation in​​​‌ the HDR jury of​ Angélique Amelot (Université de​‌ Lorraine, December 2025), S.​​ Ouni, Chair
  • Participation in​​​‌ the HDR jury of​ Angélique Amelot (Université de​‌ Lorraine, December 2025), Y.​​ Laprie, supervisor
  • Participation in​​​‌ the PhD jury of​ Al Oualid Eliraki (University​‌ of Grenoble, June 2025),​​ Y. Laprie, reviewer
  • Participation​​​‌ in the PhD jury​ of Nezih Younsi (ISIR,​‌ April 2025), S. Ouni,​​ reviewer
  • Participation in the​​​‌ PhD jury of Yanis​ OUAKRIM (University of Grenoble,​‌ May 2025), S. Ouni,​​ examiner

8.3 Popularization

  • "M-PHASIS​​​‌ Un projet de recherche​ pour lutter contre les​‌ discours de haine sur​​ internet". Journal « Numerique​​​‌ et societé », interview​ with I. Illina, March​‌ 2025
  • "Et si nos​​ voix pouvaient aider pour​​​‌ server nos langues", RFI,​ radio broadcast "De Vives​‌ Voix". P. Erhart, S.​​ Bigeard, Jan 2026
  • "Langues​​​‌ régionales : l'intelligence artificielle​ au secours de l'alsacien",​‌ France Bleu, news article,​​ P. Ethart, S. Bigeard,​​​‌ S. Ouni, Oct 2025​
  • "Traduction, voix de synthèse...​‌ Ces chercheurs veulent que​​ l'IA parle breton ou​​​‌ alsacien", Ouest France, news​ article, P. Ethart, S.​‌ Bigeard, S. Ouni, Oct​​ 2025
  • "L'alsacien à l'heure​​​‌ de l'IA : intégrer​ les langues régionales dans​‌ les modèles numériques", Sciences​​ et Avenir, news article,​​ P. Ethart, S. Bigeard,​​​‌ S. Ouni, Oct 2025‌
  • "Künstliche Intelligenz befördert Elsässisch‌​‌ in die digitale Welt",​​ Badische Neueste Nachrichten (BNN),​​​‌ news article, P. Ethart,‌ S. Bigeard, Oct 2025‌​‌
  • "Alsacien 2.0 : quels​​ usages pour les parlers​​​‌ dialectaux alsaciens ?", DNA,‌ news article, P. Ethart,‌​‌ S. Bigeard, Jan 2026​​

8.3.1 Participation in Live​​​‌ events

  • Fête de la‌ science, "La puce à‌​‌ l'oreille" (R. Serizel)
  • Nuit​​ de la science, "Ia​​​‌ pour le son" (R.‌ Serizel)
  • Procès du robots,‌​‌ 6 shows (R. Serizel,​​ S. Bigeard)
  • Chiche 1​​​‌ scientifique, 1 classe, 1‌ visit (R. Serizel)
  • "Alsacien‌​‌ 2.0 : quels usages​​ pour les parlers dialectaux​​​‌ alsaciens ?", public seminar,‌ Strasbourg, Jan 2026 (P.‌​‌ Erhard, S. Bigeard)
  • Press​​ conference for the launch​​​‌ of "Parole Spontanée" Voice‌ collection, Strasbourg (P. Ethart,‌​‌ S. Bigeard, S. Ouni)​​

8.3.2 Others science outreach​​​‌ relevant activities

  • Journée Colaf,‌ Annual seminar of Défi‌​‌ Colaf, Paris, June 2025​​ (S. Bigeard, I. Miles,​​​‌ M. Yaich, G. Faure,‌ S. Ouni)

9 Scientific‌​‌ production

9.1 Major publications​​

9.2 Publications​​​‌ of the year

International‌ journals

International peer-reviewed​ conferences

National peer-reviewed Conferences

Conferences without proceedings‌

Doctoral​​​‌ dissertations and habilitation theses‌

Reports &​​ preprints

Other scientific​​​‌ publications