MULTISPEECH

MULTISPEECH - 2025

2025Activity reportProject-Team‌MULTISPEECH

RNSR: 201421147E

Research center Inria Centre at‌ Université de Lorraine
In partnership with:CNRS, Université‌ de Lorraine
Team name: Multimodal Speech in Interaction‌
In collaboration with:Laboratoire lorrain de recherche en‌ informatique et ses applications (LORIA)

Creation of the‌ Project-Team: 2024 October 01

Each year, Inria research‌ teams publish an Activity Report presenting their work‌ and results over the reporting period. These reports‌ follow a common structure, with some optional sections‌ depending on the specific team. They typically begin‌ by outlining the overall objectives and research programme,‌ including the main research themes, goals, and methodological‌ approaches. They also describe the application domains targeted‌ by the team, highlighting the scientific or societal‌ contexts in which their work is situated.

The‌ reports then present the highlights of the year,‌ covering major scientific achievements, software developments, or teaching‌ contributions. When relevant, they include sections on software,‌ platforms, and open data, detailing the tools developed‌ and how they are shared. A substantial part‌ is dedicated to new results, where scientific contributions‌ are described in detail, often with subsections specifying‌ participants and associated keywords.

Finally, the Activity Report‌ addresses funding, contracts, partnerships, and collaborations at various‌ levels, from industrial agreements to international cooperations. It‌ also covers dissemination and teaching activities, such as‌ participation in scientific events, outreach, and supervision. The‌ document concludes with a presentation of scientific production,‌ including major publications and those produced during the‌ year.

Keywords

Computer Science and Digital Science

A3.4.‌ Machine learning and statistics
A3.5. Social networks
A4.8.‌ Privacy-enhancing technologies
A5.1.5. Body-based interfaces
A5.1.7. Multimodal interfaces‌
A5.6.2. Augmented reality
A5.6.3. Avatar simulation and embodiment‌
A5.7. Audio modeling and processing
A5.7.1. Sound
A5.7.3.‌ Speech
A5.7.4. Analysis
A5.7.5. Synthesis
A5.8. Natural language‌ processing
A5.9. Signal processing
A5.9.1. Sampling, acquisition
A5.9.2.‌ Estimation, modeling
A5.9.3. Reconstruction, enhancement
A5.10.2. Perception
A5.10.5.‌ Robot interaction (with the environment, humans, other robots)‌
A6.2.4. Statistical methods
A6.3.1. Inverse problems
A6.3.4. Model‌ reduction
A6.3.5. Uncertainty Quantification
A9.2. Machine learning
A9.2.1.‌ Supervised learning
A9.2.2. Unsupervised learning
A9.2.3. Reinforcement learning‌
A9.2.4. Optimization and learning
A9.2.6. Neural networks
A9.2.8.‌ Deep learning
A9.3. Signal processing
A9.4. Natural language‌ processing
A9.5. Robotics and AI
A9.11. Generative AI‌

1 Team members, visitors, external‌ collaborators

Research Scientists

Yves Laprie [CNRS, Senior Researcher, HDR‌]
Paul Magron [‌INRIA, Researcher]‌‌
Mostafa Sadeghi [INRIA, ISFP]
Emmanuel‌ Vincent [INRIA,‌ Senior Researcher, HDR‌‌]

Faculty Members

Slim Ouni [Team leader‌, UL, Professor‌, HDR]
Domitille‌‌ Caillat [UNIV MONTPELLIER III, Associate Professor‌ Delegation, until Aug‌ 2025]
Vincent Colotte‌‌ [UL, Associate Professor]
Pascale Erhart‌ [UNIV STRASBOURG,‌ Associate Professor Delegation,‌‌ from Sep 2025]
Irina Illina [UL‌, Associate Professor,‌ HDR]
Romain Serizel‌‌ [UL, Professor, HDR]

Post-Doctoral‌ Fellows

Tom Bourgeade [‌UL, Post-Doctoral Fellow‌‌]
Constance Douwes [UL, Post-Doctoral Fellow‌, until Mar 2025‌]
François Effa [‌‌UL, ATER, from Oct 2025]
Natalia‌ Tomashenko [UL, ATER‌, from Oct 2025‌‌]

PhD Students

Louis Abel [UL,‌ until Feb 2025]‌
Jean Eudes Ayilo [‌‌INRIA]
Sofiane Azzouz [UL]
Raphael‌ Bagat [CNRS]‌
Zahra Hafida Benslimane [‌‌CEA]
Doria Bonzi [UL, from‌ Oct 2025]
Aine‌ Drelingyte [UL]‌‌
Orane Dufour [UL]
Guilhem Faure [‌INRIA]
Imed Eddine‌ Ghebriout [CNRS]‌‌
Mickaella Grondin-Verdon [UL, from Oct 2025‌]
Mickaella Grondin-Verdon [‌UL, ATER,‌‌ until Aug 2025]
Taous Iatariene [ORANGE‌, CIFRE]
Isobelle‌ Miles [INRIA,‌‌ from Feb 2025]
Mayank Mishra [UL‌]
Nasser-Eddine Monir [‌UL]
Nhat Nam‌‌ Nguyen [UL, from Nov 2025]‌
Robin San Roman [‌META, CIFRE]‌‌
Alex Stasica [INRIA, from Sep 2025‌]
Natalia Tomashenko [‌UL, ATER,‌‌ from Sep 2025]

Technical Staff

Louis Abel‌ [UL, Engineer‌, from Dec 2025‌‌]
Romuald Ait Bachir [CNRS, Engineer‌, from Oct 2025‌]
Hugo Bergerat [‌‌UL, Engineer, from Oct 2025]‌
Theo Biasutto-Lervat [INRIA‌, Engineer]
Sam‌‌ Bigeard [INRIA, Engineer]
Emma Granier‌ [UL, Engineer‌, from Nov 2025‌‌]
Colombe M'Boungou [INRIA, Engineer,‌ from Feb 2025]‌
Malek Yaich [INRIA‌‌, Engineer]

Interns and Apprentices

Elliot Abarca‌ [INRIA, Intern‌, from Apr 2025‌‌ until Jun 2025]
Abbas Awarkeh [CNRS‌, Intern, from‌ Mar 2025 until Sep‌‌ 2025]
Nacera Elarbi Tolehi [UL,‌ Intern, from Mar‌ 2025 until Sep 2025‌‌]
Tian Huang [UL, Intern,‌ from Mar 2025 until‌ Sep 2025]
Camille‌‌ Lavigne [INRIA, Intern, from Jun‌ 2025 until Aug 2025‌]
Kehina Manseri [‌‌INRIA, Intern, from Feb 2025 until‌ Jul 2025]
Celie‌ Ponroy [INRIA,‌‌ Intern, from Apr 2025 until Jun 2025‌]
Nina Rouffaud [‌UL, Intern,‌‌ from Jun 2025 until‌ Jun 2025]
Nicolas Russo [UL,‌ Intern, from Jul 2025 until Aug 2025‌]

Administrative Assistants

Emmanuelle Deschamps [INRIA]‌
Cecilia Olivier [INRIA]

2 Overall objectives‌

In Multispeech, we consider speech as a multimodal‌ signal with different facets: acoustic, facial, articulatory, gestural,‌ etc. Historically, speech was mainly considered under its‌ acoustic facet, which is still the most important‌ one. However, the acoustic signal is a consequence‌ of the temporal evolution of the shape of‌ the vocal tract (pharynx, tongue, jaws, lips, etc.),‌ this is the articulatory facet of speech. Since‌ the vocal tract configuration is partly reflected in‌ facial movements, these constitute the primary visual facet‌ of speech.

The face can provide additional information‌ on the speaker's state through facial expressions. Speech‌ can be accompanied by gestures (head nodding, arm‌ and hand movements, etc.), that help to clarify‌ the linguistic message. In some cases, such as‌ in sign language, these gestures can bear the‌ main linguistic content and be the only means‌ of communication.

The general objective of Multispeech is‌ to study the analysis and synthesis of the‌ different facets of this multimodal signal and their‌ multimodal coordination in the context of human-human or‌ human-computer interaction. While this multimodal signal carries all‌ of the information used in spoken communication, the‌ collection, processing, and extraction of meaningful information by‌ a machine system remains a challenge. In particular,‌ to operate in real-world conditions, such a system‌ must be robust to noisy or missing facets.‌ We are especially interested in designing models and‌ learning techniques that rely on limited amounts of‌ labeled data and that preserve privacy.

Therefore, Multispeech‌ addresses data-efficient, privacy-preserving learning methods, and the robust‌ extraction of various streams of information from speech‌ signals. These two axes will allow us to‌ address multimodality, i.e., the analysis and the generation‌ of multimodal speech and its consideration in an‌ interactional context.

The outcomes will crystallize into a‌ unified software platform for the development of embodied‌ voice assistants. Our main objective is that the‌ results of our research feed this platform, and‌ that the platform itself facilitates our research and‌ that of other researchers in the general domain‌ of human-computer interaction, as well as the development‌ of concrete applications that help humans to interact‌ with one another or with machines. We will‌ focus on two main application areas: language learning‌ and health assistance.

3 Research program

3.1 Axis‌ 1 — Data-efficient and privacy-preserving learning

A central‌ aspect of our research is to design machine‌ learning models and methods for multimodal speech data,‌ whether acoustic, visual or gestural. By contrast with‌ big tech companies, we focus on scenarios where‌ the amount of speech data is limited and/or‌ access to the raw data is infeasible due‌ to privacy requirements, and little or no human‌ labels are available.

3.1.1 Axis 1.1 — Integrating‌ domain knowledge

State-of-the-art methods for speech and audio processing are based on‌ discriminative neural networks trained‌ for the targeted task.‌‌ This paradigm faces major limitations: lack of interpretability,‌ large data requirements and‌ inability to generalize to‌‌ unseen classes or tasks. Our approach is to‌ combine the representation power‌ of deep learning with‌‌ our acoustic expertise to obtain smaller generative models‌ describing the probability distribution‌ of speech and audio‌‌ signals. Particular attention will be paid to designing‌ physically-motivated input layers, output‌ layers, and unsupervised representations‌‌ that capture complex-valued, multi-scale spectro-temporal dependencies. Given these‌ models, we derive computationally‌ efficient inference algorithms that‌‌ address the above limitations. We also explore the‌ integration of deep learning‌ with symbolic reasoning and‌‌ common-sense knowledge to increase the generalization ability of‌ deep models.

3.1.2 Axis‌ 1.2 — Learning from‌‌ little/no labeled data

While supervised learning from fully‌ labeled data is economically‌ costly, unlabeled data are‌‌ inexpensive but provide intrinsically less information. Our goal‌ is to learn representations‌ that disentangle the attributes‌‌ of speech by equipping the unsupervised representation learning‌ methods above with supervised‌ branches exploiting the available‌‌ labels and supervisory signals, and with multiple adversarial‌ branches overcoming the usual‌ limitations of adversarial.

3.1.3‌‌ Axis 1.3 — Preserving privacy

To preserve privacy,‌ speech must be transformed‌ to hide the users'‌‌ identity and other privacy-sensitive attributes (e.g., accent, health‌ status) while leaving intact‌ those attributes which are‌‌ required for the task (e.g., phonetic content for‌ automatic speech recognition) and‌ preserving the data variability‌‌ for training purposes. We develop strong attacks to‌ evaluate privacy. We also‌ seek to hide personal‌‌ identifiers and privacy-sensitive attributes in the linguistic content,‌ focusing on their robust‌ extraction and replacement from‌‌ speech signals.

3.1.4 Axis 1.4 — Reducing computational‌ footprint

This axis includes‌ proposing reliable methods to‌‌ quantify fine-grained energy consumption, computational footprint (in terms‌ of operations), and memory‌ footprint, so as to‌‌ identify potential bottlenecks in the network at training‌ and test time before‌ applying compression methods.

3.2‌‌ Axis 2 — Extracting information from speech signals‌

In this axis, we‌ focus on extracting meaningful‌‌ information from speech signals in real conditions. This‌ information can be related‌ (1) to the linguistic‌‌ content, (2) to the speaker, and (3) to‌ the speech environment.

3.2.1‌ Axis 2.1 — Linguistic‌‌ speech content

Speech recognition is the main means‌ to extract linguistic information‌ from speech. Although it‌‌ is a mature research area, performance drops in‌ real-world environments pursue the‌ development of speech enhancement‌‌ and source separation methods to effectively improve robustness‌ in such real-world scenarios.‌ Semantic content analysis is‌‌ required to interpret the spoken message. The challenges‌ include learning from little‌ real data, quickly adapting‌‌ to new topics, and robustness to speech recognition‌ errors. The detection and‌ classification of hate speech‌‌ in social media videos will also be considered‌ as a benchmark, thereby‌ extending the work on‌‌ text-only detection. Finally, we also consider extracting phonetic‌ and prosodic information to‌ study the categorization of‌‌ speech sounds and certain‌ aspects of prosody by learners of a foreign‌ language.

3.2.2 Axis 2.2 — Speaker identity and‌ states

Speaker identity is required for the personalization‌ of human-computer interaction. Speaker recognition and diarization are‌ still challenging in real-world conditions. The speaker states‌ that we aim to recognize include emotion and‌ stress, which can be used to adapt the‌ interaction in real time.

3.2.3 Axis 2.3 —‌ Speech environment information

We develop audio event detection‌ methods that exploit both strongly/weakly labeled and unlabeled‌ data, operate in real-world conditions, can discover new‌ events, and provide a semantic interpretation. Modeling the‌ temporal, spatial and logical structure of ambient sound‌ scenes over a long duration is also considered.‌

3.3 Axis 3 — Multimodal Speech: generation and‌ interaction

In our project, we consider speech as‌ a multimodal object, where we study (1) multimodality‌ modeling and analysis, focusing on multimodal fusion and‌ coordination, (2) the generation of multimodal speech by‌ taking into account its different facets (acoustic, articulatory,‌ visual, gestural), separately or combined, and (3) interaction,‌ in the context of human-human or human-computer interaction.‌

3.3.1 Axis 3.1 - Multimodality modeling and analysis‌

The study of multimodality concerns the interaction between‌ modalities, their fusion, coordination and synchronization for a‌ single speaker, as well as their synchronization across‌ the speakers in a conversation. We focus on‌ audiovisual speech enhancement to improve the intelligibility and‌ quality of noisy speech by considering the speaker’s‌ lip movements. We also consider the semi/weakly/self-supervised learning‌ methods for multimodal data to obtain interpretable representations‌ that disentangle in each modality the attributes related‌ to linguistic and semantic content, emotion, reaction, etc.‌ We also study the contribution of each modality‌ to the intelligibility of spoken communication.

3.3.2 Axis‌ 3.2 - Multimodal speech generation

Multimodal speech generation‌ refers to articulatory, acoustic, and audiovisual speech synthesis‌ techniques which output one or more facets. Articulatory‌ speech synthesis relies on 2D and 3D modeling‌ of the dynamics of the vocal tract from‌ real-time MRI (rtMRI) data. We consider the generation‌ of the full vocal tract, from the vocal‌ folds to the lips, first in 2D then‌ in 3D. This comprises the generation of the‌ face and the prediction of the glottis opening.‌ We also consider audiovisual speech synthesis. Both the‌ animation of the lower part of the face‌ related to speech and of the upper part‌ related to the facial expressions are considered, and‌ development continues towards a multilingual talking head. We‌ investigate further the modeling of expressivity for both‌ audio-only and audiovisual speech synthesis, for a better‌ control of expressivity, where we consider several disentangled‌ attributes at the same time.

3.3.3 Axis 3.3‌ — Interaction

Interaction is a new field of‌ research for our project-team that we will approach‌ gradually. We start by studying the multimodal components‌ (prosody, facial expressions, gestures) used during interaction, both‌ by the speaker and by the listener, where‌ the goal is to simultaneously generate speech and gestures by the speaker,‌ and generating regulatory gestures‌ for the listener. We‌‌ will introduce different dialog bricks progressively: Spoken language‌ understanding, Dialog management, and‌ Natural language generation. Dialog‌‌ will be considered in a multimodal context (gestures,‌ emotional states of the‌ interlocutor, etc.) and we‌‌ will break the classical dialog management scheme to‌ dynamically account for the‌ interlocutor's evolution during the‌‌ speaker's response.

3.4 Software platform: Multimodal Voice assistant‌

This research program aims‌ to develop a unified‌‌ software platform for embodied voice assistants, fueled by‌ our research outcomes. The‌ platform will not only‌‌ aid our research but also facilitate other researchers‌ in the field of‌ human-computer interaction. It will‌‌ also help in creating practical applications for human‌ interactions, with a primary‌ focus on language learning‌‌ and health assistance.

4 Application domains

The approaches‌ and models developed in‌ Multispeech will have several‌‌ applications to help humans interact with one another‌ or with machines. Each‌ application will typically rely‌‌ on an embodied voice assistant developed via our‌ generic software platform or‌ on individual components, as‌‌ presented above. We will put special effort into‌ two application domains: language‌ learning and health assistance.‌‌ We chose these domains mainly because of their‌ economic and social impact.‌ Moreover, many outcomes of‌‌ our research will be naturally applicable in these‌ two domains, which will‌ help us showcase their‌‌ relevance.

4.1 Language Learning

Learning a second language,‌ or acquiring the native‌ language for people suffering‌‌ from language disorders, is a challenge for the‌ learner and represents a‌ significant cognitive load. Many‌‌ scientific activities have therefore been devoted to these‌ issues, both from the‌ point of view of‌‌ production and perception. We aim to show the‌ learner (native or second‌ language) how to articulate‌‌ the sounds of the target language by illustrating‌ articulation with a talking‌ head augmented by the‌‌ vocal tract which allows animating the articulators of‌ speech. Moreover, based on‌ the analysis of the‌‌ learner’s production, an automatic diagnosis can be envisaged.‌ However, reliable diagnosis remains‌ a challenge, which depends‌‌ on the accuracy of speech recognition and prosodic‌ analysis techniques. This is‌ still an open question.‌‌

4.2 Health Assistance

Speech technology can facilitate healthcare‌ access to all patients‌ and it provides an‌‌ unprecedented opportunity to transform the healthcare industry. This‌ includes speech disorders and‌ hearing impairments. For instance,‌‌ it is possible to use automatic techniques to‌ diagnose disfluencies from an‌ acoustic or an audiovisual‌‌ signal, as in the case of stuttering. Speech‌ enhancement and separation can‌ enhance speech intelligibility for‌‌ hearing aid wearers in complex acoustic environments, while‌ articulatory feedback tools can‌ be beneficial for articulatory‌‌ rehabilitation of cochlear implant wearers. More generally, voice‌ assistants are a valuable‌ tool for senior or‌‌ disabled people, especially for those who are unable‌ to use other interfaces‌ due to lack of‌‌ hand dexterity, mobility, and/or good vision. Speech technologies‌ can also facilitate communication‌ between hospital staff and‌‌ patients, and help emergency‌ call operators triage the callers by quantifying their‌ stress level and getting the maximum amount of‌ information automatically thanks to a robust speech recognition‌ system adapted to these extreme conditions.

5 New‌ results

5.1 Axis 1 — Data-efficient and privacy-preserving‌ learning

Participants: Vincent Colotte, Pascale Erhart,‌ Irina Illina, Paul Magron, Slim Ouni‌, Mostafa Sadeghi, Romain Serizel, Emmanuel‌ Vincent, Jean Eudes Ayilo, Zahra Hafida‌ Benslimane, Sam Bigeard, Constance Douwes,‌ Orane Dufour, Mohamed Imed Eddine Ghebriout,‌ Isobelle Miles, Robin San Roman, Natalia‌ Tomashenko, Malek Yaich.

5.1.1 Axis 1.1‌ — Integrating domain knowledge

Hybrid signal processing and‌ deep learning.

We propose to combine traditional signal‌ processing-based filtering with deep learning for automatic speech‌ recognition (ASR). More specifically, the beamforming filter processes‌ specific angular sectors based on their spherical polar‌ coordinates before applying an end-to-end multichannel, multi-speaker ASR‌ system. This method is data-independent and training-free. We‌ demonstrate that using a group of beamformed signals‌ improves performance compared to using the same number‌ of raw microphone signals. Moreover, increasing the number‌ of signals used for beamforming further enhances recognition‌ accuracy, leading to a more efficient use of‌ multichannel signals while reducing the overall input load‌ for the ASR system. This strategy is promising‌ for exploiting the best of both worlds, thus‌ reducing the need for supervision while maintaining high‌ performance 15.

Generative-based speech enhancement.

A widely‌ used approach for speech enhancement is to directly‌ train a deep neural network (DNN) to estimate‌ clean speech from a noisy input. Despite strong‌ performance, this paradigm faces two main challenges. First,‌ it often requires very large models trained on‌ huge datasets that span many noise types, noise‌ levels, and recording conditions. Second, its generalization is‌ typically limited, with performance degrading in unseen environments.‌ To tackle this problem, we organized the UDASE‌ task of the 7th CHiME challenge, which leverages‌ real-world noisy speech recordings from the test domain‌ for unsupervised domain adaptation of speech enhancement models‌ 6. We also provided comprehensive objective and‌ subjective evaluations of the submitted systems to better‌ understand the limitations of different approaches. In the‌ same research direction, we proposed a novel diffusion-based‌ unsupervised speech enhancement framework 8 that leverages diffusion‌ models, which currently achieve state-of-the-art performance in generative‌ modeling. This framework combines a pre-trained diffusion-based speech‌ generative model with a parametric noise model, and‌ performs speech enhancement through an iterative expectation–maximization procedure.‌ The results confirm that this unsupervised approach is‌ more robust than its supervised counterpart under mismatched‌ conditions.

5.1.2 Axis 1.2 - Learning from little/no‌ labeled data

ASR for regional languages.

Using Mozilla‌ Common Voice, we pursued the collection of speech‌ data covering some of the regional, overseas, and‌ non-territorial languages of France from data archives, media,‌ associations, and individuals. This data will support the‌ training of automatic speech recognition (ASR) and text-to-speech (TTS) models for these‌ languages. We presented a‌ comprehensive overview of the‌‌ tools already developed, notably ASR 37. The‌ work focused on developing‌ a recognition system based‌‌ on Whisper. Fine-tuning and error correction techniques using‌ adapted language models were‌ applied for the recognition‌‌ of Basque, Occitan, Alsatian and Shimaoré.

Accented ASR.‌

Two approaches addressing accented‌ ASR are presented. The‌‌ first introduces Mixture of Accent-Specific LoRAs (MAS-LoRA), a‌ finetuning method that leverages‌ a mixture of Low-Rank‌‌ Adaptation (LoRA) experts, each specialized in a specific‌ accent 14. Our‌ experiments, conducted using Whisper‌‌ on the L2-ARCTIC corpus, demonstrate 4% and 14%‌ relative Word Error Rate‌ (WER) improvement compared to‌‌ regular LoRA and no fine-tuning when the accent‌ is unknown. The second‌ approach, BEARD, focuses on‌‌ adapting Whisper's encoder with unlabeled data 13.‌ It combines a BEST-RQ‌ objective with knowledge distillation‌‌ from a frozen teacher encoder. On the ATCO2‌ corpus of Air Traffic‌ Control (ATC) communications, using‌‌ 5,000 hours of untranscribed speech for BEARD and‌ 2 hours of transcribed‌ speech for fine-tuning, BEARD‌‌ achieves a relative improvement of 12% compared to‌ the fine-tuned model.

TTS‌ for regional languages.

We‌‌ have been working on speech synthesis for the‌ Alsatian language. Using a‌ system based on ToucanTTS,‌‌ we first explored the possibility of integrating a‌ specific phonetizer. The difficulty‌ in achieving high-quality synthesis‌‌ is mainly due to the lack of data‌ in this type of‌ language and a lack‌‌ of normalization in the textual transcription. Speech synthesis‌ system architectures are generally‌ designed for a critical‌‌ volume of training data, so we used several‌ multilingual synthesis systems to‌ evaluate their performance on‌‌ this type of regional language. At the same‌ time, a campaign to‌ acquire oral data in‌‌ collaboration with Mozilla Common Voice is underway and‌ should increase the volume‌ of data especially with‌‌ dialect variations. Alsatian contains several dialects that bring‌ a great variability in‌ terms of both vocabulary‌‌ and pronunciation. Current TTS and ASR systems remain‌ poorly adapted for this‌ type of regional accent‌‌ yet.

Joint punctuated + normalized ASR (limited punctuated‌ data).

We proposed two‌ end-to-end strategies for joint‌‌ punctuated and normalized ASR when punctuated supervision is‌ scarce 16: (i)‌ using a language model‌‌ to generate punctuated targets from normalized transcripts, improving‌ out-of-domain performance (up to‌ 17% relative PC-WER reduction),‌‌ and (ii) a single conditional decoder that outputs‌ either punctuated or normalized‌ transcripts on demand. The‌‌ conditional-decoder model achieved a 42% relative PC-WER reduction‌ vs Whisper-base and remained‌ effective with as little‌‌ as 5% punctuated training data.

5.1.3 Axis 1.3‌ - Preserving privacy

Speaker‌ anonymization.

Speech signals convey‌‌ a lot of private information. To protect speakers,‌ we pursued our investigation‌ of x-vector based voice‌‌ anonymization, which relies on splitting the speech signal‌ into the speaker (x-vector),‌ phonetic and pitch features‌‌ and resynthesizing the signal with a different target‌ x-vector. In particular, we‌ measured and reduced the‌‌ amount of speaker information‌ carried by phoneme durations 32, 33.‌ Our method which extracts an embedding of phoneme‌ durations by an ECAPA-TDNN model achieves a low‌ equal error rate (EER) of 2% for 8‌ test signals.

We looked at the privacy of‌ video game users 9. Speech in video‌ games is a spontaneous human conversation, which is‌ associated to the player's pseudonym and could be‌ recorded by any player using screen capture software‌ to build and augment identifying records. We also‌ looked at privacy in multi-speaker recordings 43 using‌ a target speaker extraction module to extract the‌ speaker to be anonymized before speaker anonymization and‌ speech recombination. We achieved an EER of 36%‌ and a Time-Constrained minimum Permutation Word Error Rate‌ (tcpWER) of 18% on mixtures of two speakers‌ (SparseLibri2Mix).

Privacy metrics and evaluation.

Beyond anonymization systems‌ themselves, we investigated how to measure and challenge‌ privacy guarantees. We looked at the definition of‌ new privacy metrics inspired by the Article 29‌ Working Party’s Opinion 05/2014 on Anonymization Techniques, which‌ characterize Singling Out, Linkability, and Inference 34.‌ Experiments across various attack scenarios reveal that, while‌ the EER remains stable, Singling Out and Linkability‌ vary much more. Finally, we looked at the‌ emotion displayed by anonymized speech data. Two alternative‌ strategies were examined to ensure that the original‌ emotion is kept: first integrating emotion embeddings from‌ a pre-trained emotion encoder, and second processing the‌ speaker by a speaker anonymizer and by an‌ emotion indicator to select the emotion-matched SVM accurately‌ 7.

We published the rules of the‌ 1st VoicePrivacy Attacker Challenge 31, which focuses‌ on developing speaker re-identification attacks against three baseline‌ anonymization systems and four anonymization systems developed by‌ the Voice Privacy 2024 Challenge participants. The best‌ attacker systems reduced the EER by 25–44% relative‌ w.r.t. the semi-informed attack used in the VoicePrivacy‌ 2024 Challenge.

Sensitive content replacement.

Complementarily, we explored‌ privacy protection at the content level rather than‌ the speaker level. As part of the ANR‌ SpeechPrivacy project, we explored the replacement of sensitive‌ speech content. The process involves detecting sensitive personal‌ data in the transcript, such as names, addresses‌ or references to age. The substitution is carried‌ out in the acoustic signal itself. The work‌ focused on re-synthesising the original sentence using the‌ codec approach. Initial results show that replacing tokens‌ in the same prosodic context allows for good‌ integration of the new elements.

5.1.4 Axis 1.4‌ — Reducing computational footprint

We studied the relation‌ between performance of audio generation models and their‌ energy consumption. In particular, as most of the‌ recent models are based on diffusion we focused‌ on the relation between the number of diffusion‌ iteration and the quality of the generated signal‌ 29.

During her PhD thesis work, Zahra‌ Benslimane studied in detail the relationship between algorithm‌ latency, computational footprint and performance in terms of‌ speech enhancement. She then proposed architecture simplification to achieve low latency (2‌ ms) and low complexity‌ processing (the number of‌‌ operations is divided by a factor 100 compared‌ to the original algorithm)‌ while preserving the speech‌‌ enhancement performance.

5.2 Axis 2 — Extracting information‌ from speech signals

Participants:‌ Irina Illina, Paul‌‌ Magron, Mostafa Sadeghi, Romain Serizel,‌ Emmanuel Vincent, Romuald‌ Ait Bachir, Raphaël‌‌ Bagat, Doria Bonzi, Aine Drelingyte,‌ Taous Iatariene, Mayank‌ Mishra, Nasser-Eddine Monir‌‌.

5.2.1 Axis 2.1 — Linguistic speech content‌

Joint beamforming + speaker-attributed‌ ASR for meetings.

We‌‌ introduced a multichannel beamforming front-end for distant-microphone speaker-attributed‌ ASR, including a real-data‌ alignment/augmentation method to pretrain‌‌ a neural beamformer 17. On AMI, channel-fusion‌ baselines did not help,‌ but beamforming did: fine-tuning‌‌ SA-ASR on fixed beamformer output reduced WER by‌ 8% relative, and joint‌ fine-tuning with a neural‌‌ beamformer achieved a 9% relative WER reduction.

LLM‌ compression.

Current LLM compression‌ typically requires two steps:‌‌ calibration-based compression followed by costly continued pretraining on‌ billions of tokens. We‌ eliminate this second step‌‌ with a one-shot compression method that locally distills‌ low-rank weights 30,‌ leveraging the observation that‌‌ activations are low-rank. SVD initialization, a joint teacher-student‌ activation loss, and local‌ gradient updates ensure fast‌‌ convergence and low memory usage. Our method compresses‌ Mixtral-8x7B in minutes on‌ a single A100 GPU‌‌ — removing 10B parameters while retaining over 95%‌ performance - and reduces‌ Phi-2 3B by 40%‌‌ using only 13M calibration tokens, yielding a model‌ competitive with similarly-sized alternatives.‌ The approach generalizes beyond‌‌ transformer architectures.

Pruning for Low-resource Speech Recognition.

Pruning‌ large pre-trained transformers for‌ low-resource languages is challenging‌‌ due to limited retraining data. Can Whisper be‌ made lighter and faster‌ for edge devices in‌‌ data-scarce settings? Focusing on Bambara (32h of speech-to-text‌ data), we propose a‌ pruning recipe combining low-rank‌‌ embedding decomposition with feature distillation and layer merging‌ — bypassing vocabulary pruning,‌ unsuitable given frequent code-switching.‌‌ The resulting model is 48% smaller and 2.15x‌ faster on a MacBook‌ Air M1, while preserving‌‌ 90% of the original performance.

Speech Language Modeling‌ for Wolof.

We present‌ our work on training‌‌ a speech language model for Wolof, an underrepresented‌ language spoken in West‌ Africa, and share key‌‌ insights. We first emphasize the importance of collecting‌ large-scale, spontaneous, high-quality unsupervised‌ speech data, and show‌‌ that continued pretraining HuBERT on this dataset outperforms‌ both the base model‌ and African-centric models on‌‌ ASR. We then integrate this speech encoder into‌ a Wolof LLM to‌ train the first Speech‌‌ LLM for this language, extending its capabilities to‌ tasks such as speech‌ translation. Furthermore, we explore‌‌ training the Speech LLM to perform multi-step Chain-of-Thought‌ before transcribing or translating.‌ Our results show that‌‌ the Speech LLM not only improves speech recognition‌ but also performs well‌ in speech translation. The‌‌ models and the code will be openly shared.‌

5.2.2 Axis 2.2 —‌ Speaker identity and states‌‌

Speaker localisation and tracking.‌

We investigated the problem of localizing and tracking‌ the position of speaker and ensuring the consistency‌ even after speech pauses. We proposed a formal‌ description of the task together with a dataset,‌ a set of metrics adapted from object tracking‌ in computer vision and a first baseline 24‌. We then proposed refined approaches relying on‌ speaker identity to ensure track consistency 22 and‌ targeting decisions on small temporal context to move‌ towards low latency processing 23.

5.2.3 Axis‌ 2.3 — Speech in its environment

Ambient sound‌ detection and separation.

Pursuing our involvement in the‌ community on ambient sound analysis, we initiated a‌ novel task called spatial semantic segmentation of sound‌ scenes (S5), which consists in combining ambient sound‌ detection and audio source separation. To foster this‌ new topic, we co-organized a task as part‌ of the Detection and Classification of Acoustic Scenes‌ and Events (DCASE) 2025 challenge 35. We‌ also addressed the difficult topic of designing a‌ metric for this joint task. Indeed, to evaluate‌ S5 systems, one can consider two individual metrics,‌ i.e., one for source separation and another for‌ sound event classification, but this approach makes it‌ challenging to compare S5 system. Therefore, we proposed‌ and analyzed joint metrics that can better reflect‌ the actual contribution of classification and separation errors‌ 42.

In order to asses our continued‌ involvement on the topic we also published an‌ analysis of the evolution of the tasks proposed‌ to the DCASE challenge during the past 10‌ editions 26.

Speech enhancement.

Targeting speech enhancement‌ for hearing aids, we continued investigating the performance‌ of speech enhancement at a fine grained phonetic‌ level. The goal here is to link the‌ results obtained with objective metrics to the outcome‌ of listening tests conducted at our partner site‌ (Institut de l'audition). To that end, we conducted‌ an extensive evaluation of state-of-the-art speech enhancement algorithms‌ at the phoneme level (rather than at the‌ commonly-considered utterance level), and across genders. Results show‌ that the tested algorithms better reduce interference with‌ fewer artifacts on female speech, particularly in plosives,‌ fricatives, and vowels. Additionally, they demonstrate greater performance‌ for female speech in terms of perceptual and‌ speech recognition metrics 27. We exploited these‌ findings in a subsequent work, where we proposed‌ perceptually-informed variants of common speech enhancement training losses.‌ These are designed to emphasize time-frequency regions where‌ speech is prominent or where the interfering noise‌ is particularly strong, in order to better account‌ for variability across phonemes. Spectral analysis indicates better‌ consonant reconstruction, which points to a better preservation‌ of certain acoustic cues 28.

Speech intelligibility‌ reduction.

We investigated masking noise generation to reduce‌ speech intelligibility in open plan offices. The target‌ is to attenuate the annoyance caused by concurrent‌ speech produced by co-workers. While commercial systems rely‌ on stationary noise at a constant level (resulting‌ in over exposition to sound), we explored adjusting the noise level to‌ spefic part of speech‌ (different phoneme classes) and‌‌ assessed the impact on speech intelligibility 18.‌

5.3 Axis 3 —‌ Multimodal Speech: generation and‌‌ interaction

Participants: Théo Biasutto-Lervat, Domitille Caillat,‌ Vincent Colotte, Yves‌ Laprie, Slim Ouni‌‌, Mostafa Sadeghi, Emmanuel Vincent, Louis‌ Abel, Hugo Bergerat‌, Jean Eudes Ayilo‌‌, Sofiane Azzouz, Tom Bourgeade, Guilhem‌ Faure, Mickaella Grondin-Verdon‌, Colombe M'Boungou,‌‌ Nhat Nam Nguyen, Alex Stasica.

5.3.1‌ Axis 3.1 — Multimodality‌ modeling and analysis

Audio-visual‌‌ speech enhancement (AVSE).

We addressed audio–visual fusion for‌ unsupervised speech enhancement with‌ diffusion models. Our framework‌‌ combines a visually conditioned diffusion speech prior with‌ an NMF noise model‌ 10. The diffusion‌‌ prior is first pre-trained on clean speech conditioned‌ on video: visual features‌ are extracted from the‌‌ input stream and fused with audio features in‌ the diffusion network via‌ cross-attention. At inference, the‌‌ model performs iterative posterior sampling within the reverse‌ diffusion process, while the‌ NMF noise parameters are‌‌ updated using intermediate speech estimates. Experiments show improvements‌ over the audio-only variant,‌ better generalization than a‌‌ recent supervised-generative AVSE approach, and a more favorable‌ speed–quality trade-off than prior‌ diffusion-based inference.

Automatic isolated‌‌ sign recognition in French Sign Language (LSF).

We‌ investigated an isolated sign‌ recognition system for sign‌‌ language WordNet resources, aimed at identifying and grouping‌ phonologically similar signs across‌ different sign languages with‌‌ minimal training data and providing similarity suggestions for‌ manual validation 25.‌ The approach relied on‌‌ video-only analysis, combining key-frame extraction, pose estimation with‌ MediaPipe, and normalization strategies‌ to mitigate biases related‌‌ to handedness, non-dominant arm position, and signer morphology.‌ Representations were analyzed using‌ Uniform Manifold Approximation and‌‌ Projection (UMAP) and compared with a Vision Transformer‌ for pairwise similarity ranking.‌ Evaluation on a manually‌‌ annotated subset of the WordNet corpus achieved reasonable‌ accuracy.

5.3.2 Axis 3.2‌ — Multimodal speech generation‌‌

Acquisition of rt-MRI (real-time Magnetic Resonance Imaging) data.‌

This year, in collaboration‌ with the IADI laboratory‌‌ (P.-A. Vuissoz), we started the acquisition of a‌ large corpus of Arabic‌ language for one speaker.‌‌ Since no corpus was available a set of‌ 2000 sentences was designed‌ with the help of‌‌ Gemini. This is interesting since the same approach‌ could be used for‌ other languages involved in‌‌ the new ANR ArtAny project. In addition, we‌ recorded a speaker producing‌ several non-standard voice qualities‌‌ (falsetto, very deep or cracked voice) in order‌ to study the areas‌ of variability of the‌‌ vocal tract articulators and generalise the automatic articulator‌ tracking algorithms.

Acoustic to‌ articulatory inversion.

Acoustic to‌‌ articulatory inversion is a major processing challenge, with‌ a wide range of‌ applications from speech synthesis‌‌ to feedback systems for language learning and rehabilitation.‌ Last year, we conducted‌ the first experiments on‌‌ articulatory acoustic inversion for the tongue, which is‌ the most mobile and‌ deformable speech articulator 11‌‌. This was the‌ first time that inversion covered the entire contour‌ of the tongue (from its root to its‌ tip), since inversion generally only covers a few‌ points corresponding to sensors attached to the tongue.‌ We extended the approach to all articulators (lips,‌ tongue, velum, epiglottis, arytenoid cartilages, glottis) by sizing‌ the output layers so as to clearly separate‌ the articulators. The average accuracy is 1.67 mm,‌ given that the pixel size in the images‌ is 1.6 mm, 12, 36. To‌ our knowledge, this is the first inversion experiment‌ to recover the complete geometry of the vocal‌ tract in the form of the contour of‌ all articulators in the mid-sagittal plane.

Quaternion pose‌ encoding and contrastive learning for robust sign language‌ production.

We tackled a key challenge in neural‌ sign language production: high intra-class variability caused by‌ signer morphology and stylistic differences. Building on Progressive‌ Transformers, we introduced two complementary improvements 19.‌ First, we represented poses using bone rotations in‌ quaternion space and optimize with a geodesic loss,‌ which better captures angular motion and improves joint‌ articulation. Second, we added a semantically guided contrastive‌ loss that structures decoder embeddings using sentence-level similarity‌ (via gloss overlap or SBERT), encouraging the model‌ to focus on meaning-relevant motion while reducing anatomical‌ and stylistic bias. On Phoenix14T, a widely used‌ corpus,the contrastive objective alone improves Probability of Correct‌ Keypoint by 16% over the baseline, and combining‌ it with quaternion encoding reduces Mean Bone Angle‌ Error by 6%, highlighting the benefit of skeletal-structure‌ modeling and semantic supervision in Transformer-based sign language‌ production.

5.3.3 Axis 3.3 — Interaction

Formal description‌ and annotation of co-speech gestures.

A first line‌ of work focused on the formal characterization of‌ gestures, aiming to identify necessary and sufficient descriptive‌ features and to automate their extraction, leading to‌ the definition of six complementary modalities (manuality, trajectory,‌ location, hand configuration, speed, and size) to support‌ objective annotation and gesture-aware neural systems 38.‌ This formalization effort was extended by an in-depth‌ investigation of the spatial aspects of gestures, proposing‌ a dual spatial encoding based on positioning and‌ orientation within dedicated three-dimensional reference spaces, and demonstrating‌ how automatically derived spatial attributes can enrich corpora‌ and support the analysis of gesture-speech relationships 40‌. In parallel, methodological contributions addressed the practical‌ limitations of manual annotation through the development of‌ COSMOS, a semi-automatic tool based on motion capture‌ data and encoder-decoder models, designed to assist gesture‌ segmentation with limited training data while significantly reducing‌ annotation effort 44.

Recognition and evaluation of‌ co-speech gestures.

Within the broader scientific framework of‌ multimodal communication and speech-gesture modeling, and in parallel‌ with our work on the automatic generation of‌ co-verbal gestures using graph-based neural networks 41,‌ several complementary studies addressed the formal description, annotation,‌ recognition, and evaluation of co-verbal manual gestures, combining‌ approaches from linguistics, computer science, and movement sciences.‌ Finally, an exploratory interdisciplinary study examined the evaluation of hand-gesture synthesis quality,‌ showing that expert annotations‌ can reveal systematic differences‌‌ between natural and synthetic gestures in terms of‌ communicative efficiency and movement‌ dynamics, and highlighting the‌‌ need for combined computational and linguistic criteria to‌ assess and improve gesture‌ generation systems 39,‌‌ 21.

Medical dialog summarization.

In the context‌ of our collaboration with‌ a medical doctor in‌‌ Paris, we introduced QUARTZ, a framework for task-oriented‌ unsupervised dialogue summarization 20‌. For medical dialogs,‌‌ task-specific medical accuracy is important. QUARTZ starts by‌ generating multiple summaries and‌ task-specific question-answer pairs using‌‌ large language models (LLMs). Summaries are evaluated by‌ having the LLMs respond‌ to task-related questions before‌‌ (i) selecting the best candidate responses and (ii)‌ identifying the most informative‌ summary. Finally, we finetune‌‌ the best LLM on the selected summaries. When‌ validated on multiple datasets,‌ QUARTZ achieves competitive zero-shot‌‌ performance, rivaling fully-supervised state-of-the-art approaches.

6 Bilateral contracts‌ and grants with industry‌

6.1 Bilateral grants with‌‌ industry

6.1.1 Meta AI

Company: Meta AI (France)‌
Duration: May 2022 –‌ Apr 2025
Participants: Robin‌‌ San Roman, Romain Serizel
Abstract: This CIFRE grant‌ funds the PhD of‌ Robin San Roman on‌‌ self-supervised disentangled representation learning of audio data for‌ compression and generation.

6.1.2‌ Orange Labs

Company: Orange‌‌ Labs (France)
Duration: March 2023 – Feb 2026‌
Participants: Taous Iatariene, Romain‌ Serizel
Abstract: This CIFRE‌‌ grant funds the PhD of Taous Iatariene on‌ sound source tracking.

7‌ Partnerships and cooperations

7.1‌‌ International initiatives

7.1.1 Inria associate team not involved‌ in an IIL or‌ an international program

TrustedSpeech‌‌

Title:
Trusted speech dataset generation
Duration:
Jan 2025‌ – Dec 2027
Coordinator:‌
Junichi Yamagishi (jyamagis@nii.ac.jp)
Partners:‌‌
- National Institute of Informatics Tokyo (Japon)
Inria contact:‌
Emmanuel Vincent
Summary:
The‌ TrustedSpeech associate team will‌‌ conduct joint research aiming to improve the privacy,‌ fairness and utility of‌ anonymized and synthetic speech‌‌ data, so as to offer a complete methodology‌ to produce trusted speech‌ datasets.

7.1.2 Participation in‌‌ other International Programs

ANR-JST CONFLUENCE

Title:
Semantic Segmentation‌ of Complex Sound Scenes‌ on Edge Devices
Duration:‌‌
Dec 2024 - Nov 2027
Coordinator:
Sonaid
Partners:‌
Université de Lorraine, CEA-List‌ (FR), the company Sonaide‌‌ (FR), Nippon Telegraph and Telephone Corporation (NTT, JP)‌ and Tokyo Metropolitan University‌ (JP)
Participants:
Paul Magron,‌‌ Mayank Mishra, Romain Serizel
Abstract:
The CONFLUENCE project‌ aims to develop artificial‌ intelligence (AI) technologies for‌‌ sound semantic segmentation of acoustic signals that can‌ recognize sound events and‌ separate/isolate the signals of‌‌ the sound sources forming semantic entities.

7.2 International‌ research visitors

7.2.1 Visits‌ to international teams

R.‌‌ Baga: Short stay (15 days) at the National‌ Institute of Informatics Tokyo‌ (Japon) in the framework‌‌ of Associate Teams TrustedSpeech

7.3 European initiatives

7.3.1‌ Horizon Europe

PSST

PSST‌ project on cordis.europa.eu

Title:‌‌
Privacy for Smart Speech Technology
Duration:
Feb 2025‌ – Jan 2030
Partners:‌
Inria, UNIVERSITE DE LORRAINE,‌‌ ORANGE SA (FR), KI ELEMENTS GMBH (DE), STICHTING‌ RADBOUD UNIVERSITEIT (NL), VOICEINTERACTION‌ (PT), OMILIA (GR), AALTO‌‌ KORKEAKOULUSAATIO SR (FI), TECHNISCHE‌ UNIVERSITAT BERLIN (DE), NAVER FRANCE, Commission nationale de‌ l'informatique et des libertés (FR), EVALUATIONS AND LANGUAGE‌ RESOURCES DISTRIBUTION AGENCY (FR), RUHR-UNIVERSITAET BOCHUM (DE), Loihde‌ Advisory Oy, Finland (FI), voice INTER connect GmbH‌ (DE), VOCAPIA RESEARCH (FR), EURECOM (FR), INESC ID‌ (PT), Voicemod Inc. (ES), INSTITUTO SUPERIOR TECNICO (PT),‌ SORBONNE UNIVERSITE (FR)
Inria contact:
Emmanuel Vincent
Coordinator:‌
Tom Bäckström
Summary:
The PSST joint doctoral training‌ network will train a new cohort of PhD‌ students to develop voice privacy technologies using cutting-edge‌ deep learning methods.

7.3.2 Digital Europe

LLMs4EU

Title:‌
Large Language Models for the European Union
Duration:‌
Mar 2025 – Feb 2028
Partners:
- Inria, France‌
- 65 other partners in Europe
Inria contact:
Emmanuel‌ Vincent
Coordinator:
Edouard Geoffrois
Summary:
The LLMs4EU project‌ coordinated by the Alliance for Language Technologies (ALT-EDIC)‌ brings together Europe's leading players in the field‌ of generative AI to ensure that European companies‌ and especially SMEs have access to the tools‌ and resources to become competitive regarding language technologies‌ and especially Large Language Models (LLMs).

7.4 National‌ initiatives

ANR ENACT

Title:
IA Cluster — Centre‌ Européen en Intelligence Artificielle par l'Innovation
Duration:
Jan‌ 2025 - Dec 2029
Coordinator:
Emmanuel Vincent (until‌ Jun 2025) Jean-Baptiste Mouret (from Jun to Dec‌ 2025)
Partners:
Université de Lorraine, Université de Strasbourg,‌ Inria, CNRS, CHRU de Nancy, Région Grand Est,‌ Métropole Grand-Nancy, Métropole de Strasbourg, Métropole de Metz.‌
Participants:
Emmanuel Vincent , Irina Illina
Abstract:
ENACT‌ is the AI Cluster of Region Grand Est,‌ with a budget of 30 MEUR. It aims‌ to make Grand Est a European leader in‌ artificial intelligence (AI), with a structuring strategy of‌ training, research and innovation designed in a global‌ way to benefit the entire territory of the‌ Region and beyond. Emmanuel Vincent holds a chair‌ with Nancy's hospital on LLMs for emergency medicine,‌ and Irina Illina has a PhD student funded‌ by the project.

ANR Full3DTalkingHead

Title:
Synthèse articulatoire‌ phonétique
Duration:
Apr 2021 - Sept 2025
Coordinator:‌
Yves Laprie
Partners:
Loria, Gipsa-Lab, LEGI, IADI, LPP.‌
Participants:
Yves Laprie , Slim Ouni , Vinicius‌ Ribeiro
Abstract:
The objective is to realize a‌ complete three-dimensional digital talking head including the vocal‌ tract from the vocal folds to the lips‌ and the face, and integrating the digital simulation‌ of the aero-acoustic phenomena.

ANR ArtAny

Title:
Articulateur‌ universel
Duration:
Nov 2025 - Oct 2030
Coordinator:‌
IADI(Nancy)
Partners:
IADI (Nancy), LPP (Paris)
Participants:
Yves‌ Laprie , Slim Ouni , Emmanuel Vincent ,‌ Vincent Colotte
Abstract:
The Articulator Anything project aims‌ to reconstruct the three-dimensional dynamic evolution of the‌ vocal tract for any language and any speaker.‌ It falls within the field of articulatory synthesis,‌ modeling and simulating the physical process of human‌ speech production using advanced artificial intelligence methods. Current‌ approaches are limited as they rely on static‌ representations of phonemes and fail to capture the‌ temporal dynamics essential for coarticulation and anticipation in‌ natural speech.

ANR CODIM

Title:
COmpositionality and DIscourse Markers
Duration:
Jan 2023‌ - Dec 2027
Coordinator:‌
ATILF(Nancy)
Partners:
ATILF(Nancy), LORIA(Nancy),‌‌ LLF
Participants:
Vincent Colotte
Abstract:
The CODIM project‌ focuses on the two‌ main linguistic resources for‌‌ organizing monologues or conversations in human languages :‌ Discourse Markers (therefore/donc, well/ben,bon‌ etc. in English/French) and‌‌ prosody (in particular, intonation). It will evaluate their‌ status with respect to‌ two major views on‌‌ communication: compositionality (the possibility of combining meaningful expressions‌ into more complex meaningful‌ expressions) and pattern or‌‌ construction-based approaches (the idea that language users exploit‌ partly ‘frozen’ strings of‌ words). We will compare‌‌ the semantic and prosodic properties of simple and‌ complex French DM (e.g.‌ ah + bon) found‌‌ in corpora for written and spoken French.

ANR‌ LLM4all

Title:
Large Language‌ Models for All
Duration:‌‌
Oct 2023 - Mars 2027
Coordinator:
Christophe Cerisara‌
Partners:
LORIA-Synalp, LORIA-Multispeech, LIX,‌ Linagora
Participants:
Irina Illina‌‌ , Emmanuel Vincent
Abstract:
Large Language Models (LLM)‌ of sufficient size exhibit‌ outstanding emergent abilities, such‌‌ as learning from their input context and decomposing‌ a complex problem into‌ a chain of simpler‌‌ steps. The LLM4all project will thus focus on‌ such large models, or‌ on models at the‌‌ same level of generic performances, and will propose‌ methods to solve two‌ related fundamental issues: how‌‌ to update these LLMs automatically, and how to‌ reduce their computing requirements‌ in order to facilitate‌‌ their deployment.

ANR Lorraine Artificicial Intelligence – LOR-AI‌ LOR-AI

Title:
Lorraine Artificicial‌ Intelligence Cofinancement de thèses‌‌ en IA
Duration:
Sep 2020- Dec 2025
Coordinator:‌
Yves Laprie
Partners:
CNRS,‌ Inria, Regional University Hospital‌‌ Centre (CHRU)
Participants:
Doctoral school of Université de‌ Lorraine
Abstract:
This project‌ about Artificial Intelligence, led‌‌ by the Université de Lorraine (UL), has a‌ double objective by providing‌ 12 co-fundings for doctoral‌‌ theses: on the one hand, to strengthen UL‌ areas of excellence in‌ AI and domains tightly‌‌ connected to IA, i.e. particularly Health, and on‌ the other hand, to‌ open other research areas‌‌ to AI with the objective of leading to‌ scientific breakthroughs.

ANR REFINED‌

Title:
Real-Time Artificial Intelligence‌‌ for Hearing Aids
Duration:
Mar 2022 - Mar‌ 2026
Coordinator:
CEA List‌ (Saclay)
Partners:
CEA List‌‌ (Saclay), Institut de l'audition (Paris), LORIA (Nancy)
Participants:‌
Paul Magron, Nasser-Eddine Monir,‌ Romain Serizel
Abstract:
The‌‌ Refined project brings together audiologists, computer scientists and‌ specialists about hardware implementation‌ to design new speech‌‌ enhancement algorithms that both fit the needs of‌ patients suffering of hearing‌ losses and the computational‌‌ constraints of hearing aid devices.

ANR ReNAR

Title:‌
Reducing Noise with Augmented‌ Reality
Duration:
Feb 2024‌‌ - Jan 2028
Coordinator:
CEA List (Saclay)
Partners:‌
Ircam (Paris), Laboratoire des‌ Sciences du Numérique de‌‌ Nantes (Nantes), LORIA (Nancy)
Participants:
Romain Serizel, Aine‌ Drelingyte
Abstract:
The aim‌ of the ReNAR project‌‌ is to design a solution that can attenaute‌ the impact of noise‌ in office working scenarios‌‌ (in particular in open spaces). We will target‌ two aspects: generating noise‌ maskers that results in‌‌ sound scenes that are‌ pleasent to hear for workers and generating signals‌ that can obfuscate surrounding speech.

ANR SPEECHPRIVACY

Title:‌
Multiple-attribute disentanglement and semantic privacy
Duration:
Feb 2024‌ - Jan 2028
Coordinator:
Vincent Colotte
Partners:
LORIA‌ (Nancy), EURECOM (Sophia Antipolis), LIA (Avignon)
Participants:
Vincent‌ Colotte , Emmanuel Vincent , Orane Dufour, Natalia‌ Tomashenko.
Abstract:
SpeechPrivacy will deliver a flexible solution‌ to privacy preservation based on isolated/disentangled representations and‌ the selective obfuscation/modification of individual attributes beyond the‌ usual voice identity/sex and sensitive keywords.

ANR Syncogest‌

Title:
Gesture and Speech Synchronization
Duration:
Apr 2025‌ - Mar 2029
Coordinator:
Slim Ouni
Partners:
LORIA‌ (Nancy), PRAXILING (Montpellier), EUROMOV (Montpellier)
Participants:
Slim Ouni‌ , Vincent Colotte, Louis Abel, Hugo Bergerat, Domitille‌ Caillat
Abstract:
SYNCOGEST aims to model spontaneous human‌ gestures—facial expressions, postures, and body movements—and their synchronization‌ with speech in face-to-face communication. By combining insights‌ from artificial intelligence, language sciences, and movement sciences,‌ the project will develop deep learning–based models for‌ automatic gesture generation, enabling more natural and effective‌ embodied conversational agents.

PEPR Cybersécurité, projet iPOP

Title:‌
Protection des données personnelles
Duration:
Oct 2022 –‌ Sep 2028
Coordinator:
Vincent Roca (Inria PRIVATICS)
Partners:‌
Inria Multispeech (Nancy), PRIVATICS (Lyon), COMETE, PETRUS (Saclay),‌ MAGNET, SPIRALS (Lille), IRISA (Rennes), LIFO (Bourges), DCS‌ (Nantes), CESICE (Grenoble), EDHEC (Lille), CNIL (Paris)
Participant:‌
Emmanuel Vincent
Summary:
The objectives of iPOP are‌ to study the threats on privacy introduced by‌ new digital technologies, and to design privacy-preserving solutions‌ compatible with French and European regulations. Within this‌ scope, Multispeech focuses on speech data.

Défi Inria‌ COLaF

Title:
Corpus et Outils pour les Langues‌ de France
Duration:
Aug 2023 – Jul 2027‌
Coordinator:
Slim Ouni and Benoît Sagot (Inria ALMANACH)‌
Partners:
Inria Multispeech (Nancy), ALMANACH (Paris)
Participant:
Slim‌ Ouni , Sam Bigeard , Vincent Colotte ,‌ Emmanuel Vincent , Pascale Erhart
Summary:
This project‌ aims to increase the inclusiveness of speech technologies‌ by releasing open data, models and software for‌ accented French and for regional, overseas and non-territorial‌ languages of France.

DGA DEEP MAUVES

Title:
Deep‌ automatic aircraft speech recognition for non native speakers‌
Duration:
Dec 2022 – Dec 2026
Coordinator:
Irina‌ Illina
Participant:
Irina Illina , Raphaël Bagat ,‌ Emmanuel Vincent , Romuald Ait Bachir
Summary:
This‌ project proposes methods and tools that increase the‌ usability of ASR systems for non-native speakers in‌ noisy conditions in the aeronautical domain.

ANSES IPIAMA‌

Title:
Reducing Noise with Augmented Reality
Duration:
Dec‌ 2023 - Dec 2026
Coordinator:
Jean-Pierre Arz, INRS‌ (Nancy)
Partners:
INRS (Nancy), Laboratoire Énergies et Mécanique‌ Théorique et Appliquée (Nancy), LORIA (Nancy)
Participants:
Romain‌ Serizel
Abstract:
The IPIAMA project aims to propose‌ binaural speech intelligibility measurements (with both ears) for‌ people equipped with hearing aids. The project will‌ rely jointly on classic listening tests (reliable but‌ expensive) and models based on data collected in‌ realistic conditions.

8 Dissemination

8.1 Promoting scientific activities‌

8.1.1 Scientific events: organisation

General chair, scientific chair‌

Main organizer, UDICE-U15 workshop on AI: Stronger together – How to train‌ and retain the next‌ generation of talent in‌‌ Europe and develop efficient and competitive French-German ecosystems?,‌ Nancy, Mar 2025 (E.‌ Vincent)

Member of the‌‌ organizing committees

Organizer, 1st VoicePrivacy Attacker Challenge (N.‌ Tomashenko, E. Vincent)
Challenge‌ co-chair, DCASE Challenge 2025‌‌ (R. Serizel)

8.1.2 Scientific events: selection

Member of‌ the conference program committees‌

ICASSP 2026 – IEEE‌‌ International Conference on Acoustics, Speech, and Signal Processing‌ (R. Serizel)
WASPAA 2025‌ – IEEE Workshop on‌‌ Applications of Signal Processing to Audio and Acoustics‌ (R. Serizel)

Reviewer

ICASSP‌ 2026 - IEEE International‌‌ Conference on Acoustics, Speech, and Signal Processing (P.‌ Magron, E. Vincent, M.‌ Sadeghi, R. Serizel)
ICASSP‌‌ 2025 - IEEE International Conference on Acoustics, Speech,‌ and Signal Processing (I.‌ Illina)
INTERSPEECH 2025 (P.‌‌ Magron, I. Illina, Y. Laprie, V. Colotte)
EUSIPCO‌ 2025 - European Signal‌ Processing Conference (V. Colotte)‌‌
WASPAA 2025 - IEEE Workshop on Applications of‌ Signal Processing to Audio‌ and Acoustics (P. Magron)‌‌
ASRU 2025 - IEEE Automatic Speech Recognition and‌ Understanding Workshop (I.Illina)
DCASE‌ 2025 - Workshop on‌‌ Detection and Classification of Acoustic Scenes and Events‌ (R. Serizel)
Revue TAL‌ : Traitement Automqtique de‌‌ Langues (I. Illina)
NAACL 2025, DemoTrack (I. Illina)‌
ICMI 2025, Industrial track‌ (S. Ouni)

8.1.3 Journal‌‌

Member of the editorial boards

IEEE Transactions on‌ Audio, Speech and Language‌ Processing (R. Serizel)

Reviewer‌‌ - reviewing activities

IEEE Signal Processing Letters (P.‌ Magron, M. Sadeghi)
IEEE‌ Transactions on Audio, Speech‌‌ and Language Processing (P. Magron, E. Vincent, M.‌ Sadeghi)
ACL 2025 -‌ Association for Computational Linguistics‌‌ (I. Illina)

8.1.4 Invited talks

Keynote "The rise,‌ fall, and resurgence of‌ NMF for audio source‌‌ separation", Workshop on Low-Rank Models and Applications (Mons,‌ Belgium), Sep 2025 (P.‌ Magron)
Seminar "Machine learning‌‌ for music separation: Combining data-driven models and expert‌ knowledge", University of Strasbourg‌ (Strasbourg, France), May 2025‌‌ (P. Magron)
Keynote "Modéliser la communication parlée multimodale",‌ Workshop RJCP (Paris), Nov‌ 2025 (S. Ouni)

8.1.5‌‌ Leadership within the scientific community

Member of the‌ Steering Committee of ISCA's‌ Special Interest Group on‌‌ Security and Privacy in Speech Communication (E. Vincent)‌
Board member of Le‌ VoiceLab, the association of‌‌ French voice tech players (E. Vincent)
Chair of‌ the DCASE Steering Committee‌ (R. Serizel)
Board member‌‌ of AFCP - Association Francophone de la Communication‌ Parlée (V. Colotte, S.‌ Ouni)
Secretary/Treasurer, executive member‌‌ of AVISA (Auditory-VIsual Speech Association), an ISCA Special‌ Interest Group (S. Ouni)‌

8.1.6 Scientific expertise

Scientific‌‌ Expert for CIFRE grant allocation, Ministère de l'Enseignement‌ supérieur, de la Recherche‌ et de l'Innovation (R.‌‌ Serizel, S. Ouni)
Project expert for Direction Générale‌ Déléguée Recherche, Innovation, Valorisation‌ et Ecoles doctorales (I.‌‌ Illina)

8.1.7 Research administration

Inria representative on the‌ Lorraine Steering Committee for‌ Open Science (E. Vincent)‌‌
Head of pole scientifique Automatique, Mathématiques, Informatique et‌ leurs interactions (AM2I) de‌ l'Université de Lorraine (Y.‌‌ Laprie)
Member of the executive board of the‌ Université de Lorraine (Y.‌ Laprie)
Local correspondent for‌‌ Inria's Quadran high-risk research‌ programme (Y. Laprie)
Member of the steering committee‌ for the digital strategy of the Université de‌ Lorraine (Y. Laprie)
Member of the bureau du‌ pole scientifique Automatique,Mathematiques, Informatique et leurs interactions (AM2I)‌ (I. Illina)
Member of the Comite du pole‌ scientifique Automatique,Mathematiques, Informatique et leurs interactions (AM2I) (I.‌ Illina)
Member of the RIPEC jury, UL (I.‌ Illina)
Member of the promotion committee, UL (I.‌ Illina)
Member of the admission committee for Master‌ TAL, UL (I. Illina)
Member of the admission‌ committee for ATER, UL, IUT Charlemagne (I. Illina)‌
Member of the selection committee for MCF, UL‌ (Illina)
Member of the IUT Charlemagne Council, UL‌ IUT Charlemagne (I. Illina)
Member of the IUT‌ Charlemagne Restricted Council, UL IUT Charlemagne (I. Illina)‌
Member of the PhD grant allocation committee, Avignon‌ University (I. Illina)
Member of laboratory concil of‌ LORIA (V. Colotte).
Member of the selection committee‌ for the position of assistant professor at Université‌ de Paris-Saclay (S. Ouni)
Member of the selection‌ committee for the position of professor at Université‌ de Toulouse (S. Ouni)
Co-Chair of the selection‌ committee for the position of professor at Université‌ de Lorraine (S. Ouni)
Member of the repyramidage‌ committee for the position of professor at Université‌ de Toulouse (S. Ouni)
Member of the evaluation‌ committee of Haut Conseil de l'évaluation de la‌ recherche et de l'enseignement supérieur (HCERES) for LJK‌ (S. Ouni)
Co-head of the Computer Science track‌ at the IAEM Doctoral School (S. Ouni)
Chair‌ of the ATER Recruitment Committee, Department of Computer‌ Science, IUT Nancy-Charlemagne (S. Ouni)
Member of the‌ Comité Utilisateurs des Moyens de Calculs, Inria Research‌ Center at Université de Lorraine (T. Biasutto–Lervat)
Referent‌ Plateformes-Outils, Inria Research Center at Université de Lorraine‌ (T. Biasutto–Lervat)

8.2 Teaching - Supervision - Juries‌ - Educational and pedagogical outreach

8.2.1 Teaching

Master:‌ P. Magron
- "Neural networks" (54 hours), M2, UL‌
- "Professional insertion" (2 hours), M2, IRCAM / Sorbonne‌ University
Master: M. Sadeghi
- "Machine learning" (20 hours),‌ M1, UL
- "Statistics" (20 hours), M1, UL
BUT:‌ I. Illina
- Java programming (100 hours), L1, UL‌
- Linux programming (58 hours), L1, UL
- Advanced Java‌ programming (40 hours), L1, UL
- Supervision of student‌ projects and internships (30 hours), L2, UL
Master:‌ I. Illina
- Speech recognition and text-to-speech (10 hours),‌ M2, UL
BUT: R. Serizel
- "Bases informatiques" (14‌ hours), BUT1, UL
- "Publication web" (84 hours), BUT1,‌ UL
- "Métadonnées internes" (14 hours), BUT1, UL
- "Bases‌ de données relationnelles" (8 hours), BUT1, UL
- "Indexation‌ de contenus multimédias" (16 hours), BUT2, UL
- "Systèmes‌ d'information" (18 hours), BUT2, UL
- "Introduction à l'audio‌ numérique" (14 hours), BUT2, UL
- "Données ouvertes" (8‌ hours), BUT3, UL
- "Visualisation de données" (8 hours),‌ BUT3, UL
- "Usages de l'IA" (14 hours), BUT3,‌ UL
Master: R. Serizel
- "Robustesse de la parole"‌ (15 HETD), M2, UL
- "Impact environnementaux de l'IA"‌ (6 hours), M2, UL
Eng: R. Serizel
- "Algorithmique"‌ (18 hours), L3, UL
- "Bases de l'apprentissage automatique"‌ (12 hours), M1, UL
- "Impact environnementaux de l'IA" (21 hours), M2, UL‌
BUT: S. Ouni
- Programming‌ in Java (24 hours),‌‌ BUT1, UL
- Web Programming (24 hours), BUT1, UL‌
- Graphical User Interface (96‌ hours), BUT1, UL
- Advanced‌‌ Algorithms (24 hours), BUT2, UL
- Algorithm analysis (24‌ hours), BUT3, UL
- Multimedia‌ (24 hours), BUT3, UL‌‌
- AI Agent (24 hours), BUT3, UL
Master: Y.‌ Laprie
- "Speech corpora" (30‌ hours), M1, UL
Licence:‌‌ Y. Laprie
- Phonetics (16 hours), L2, École d'audioprothèse,‌ UL
Licence: V. Colotte‌
- Digital literacy and tools‌‌ (hybrid courses, 50 hours), L1, UL
- System (80‌ hours), L2-L3, UL
- Introduction‌ to speech processing (20‌‌ hours), L3, UL
Master: V. Colotte
- Integration project:‌ multimodal interaction with Pepper‌ Robot (17 hours), M2,‌‌ UL
- Multimodal oral communication (24 hours), M2, UL‌
- AI introduction (9 hours),‌ M2 - intellectual property‌‌ rights, UL
- Introduction to speech processing (24 hours),‌ M1, UL
Other: V.‌ Colotte
- Co-Responsible for NUMOC‌‌ (Digital literacy by hybrid courses) for UL(for 7000‌ students)
Other: S. Ouni‌
- Co-Responsible of the RA-IL‌‌ track in the BUT Computer Science program, UL‌

8.2.2 Supervision

PhD defended:‌ Louis Abel, "Co-speech gesture‌‌ synthesis : Towards a controllable and interpretable model‌ using a graph deterministic‌ approach", Jan 2025, V.‌‌ Colotte and S. Ouni 41
PhD in progress:‌ Nasser-Eddine Monir, "Multichannel speech‌ enhancement for patients with‌‌ auditory neuropathy spectrum disorders", Dec 2022, R. Serizel‌ and P. Magron
PhD‌ in progress: Mickaëlla Grondin,‌‌ "Modeling gestures and speech in interactions", Nov 2021,‌ S. Ouni and D.‌ Caillat (Praxiling).
PhD in‌‌ progress: Jean-Eudes Ayilo, "Audio-visual Speech Enhancement: Bridging the‌ Gap between Supervised and‌ Unsupervised Approaches", Oct. 2023,‌‌ M. Sadeghi and R. Serizel
PhD in progress:‌ Guilhem Fauré, "End-to-end Speech-to-Sign‌ Language Generation", Oct. 2024,‌‌ S. Ouni and M. Sadeghi
PhD in progress:‌ Zahra-Hafida Benslimane, "Embedded speech‌ enhancement for hearing aids",‌‌ Nov. 2023, Fabrice Auzanneau (CEA-List) and R. Serizel‌
PhD in progress: Raphaël‌ Bagat, “Automatic speech recognition‌‌ for non-native speakers in a noisy environment”, Oct‌ 2023, I. Illina and‌ E. Vincent.
PhD in‌‌ progress: Mohamed Imed Eddine Ghebriout, “LLM adaptation and‌ exploitation for medical emergency‌ call triage”, Apr 2024,‌‌ G. Guibon (LIPN) and E. Vincent.
PhD in‌ progress: Orane Dufour, "Towards‌ a comprehensive speech anonymization‌‌ framework", Oct 2024, E. Vincent, M. Rouvier (LIA),‌ and P. Magron
PhD‌ in progress: Aine Drelingyte,‌‌ ` "Speech intelligibility attenuation", Nov 2024, Mathieu Lagrange‌ (LS2N) and R. Serizel‌
PhD in progress: Lilian‌‌ Rodriguez, ` "Detection and anonymization of sensitive content‌ in speech", Oct 2024,‌ Yannick Estève (LIA) and‌‌ V. Colotte
PhD in progress: Mayank Mishra, `‌ "Semantic segmentation of audio‌ soundscapes on edge devices",‌‌ Dec 2024, R. Serizel and P. Magron
PhD‌ in progress: Isobelle Miles,‌ ` "Regional and low‌‌ ressource language speech synthesis", fev 2025, E. Vincent,‌ V. Colotte and P.‌ Erhart (UNISTRA-LILPA)
PhD in‌‌ progress: Elio Stasica, ` "Differential diagnosis of heart‌ attack from speech", Sep‌ 2025, V. Martin, R.‌‌ Serizel and E. Vincent
PhD in progress :‌ Doria Bonzi "Social-behavior-aware chatbot‌ for a communication skills‌‌ coaching of medical students"‌ Supervision: Irina Illina, Patrice Gallet and Fabrice Lefèvre,‌ Oct. 2025
PhD in progress : Yaya Sy‌ « Efficient Continued Pre-training of Large Language Models‌ », Supervision : C. Cerisara, I. Illina, Nov‌ 2023.
PhD in progress : Sofiane Azzouz «‌ Acoustic to articulatory inversion based on rt-MRI data‌ », Supervision: Y. Laprie, Nov 2023.
PhD in‌ progress : Nhat-Nam Nguyen « Multispeaker Acoustic to‌ articulatory inversion based on rt-MRI data », Supervision:‌ Y. Laprie, Nov 2025.

8.2.3 Juries

Participation in‌ the PhD jury of Thibault Banerat-Roux (University of‌ Nantes, January 2025), I. Illina, reviewer
Participation in‌ the PhD jury of Lucas Maison (University of‌ Avignon, November 2025), I. Illina, reviewer
Participation in‌ the PhD jury of Nicolas André (University of‌ Avignon, December 2025), I. Illina, reviewer
Participation in‌ the PhD jury of Nathan Griot (University of‌ Avignon, December 2025), I. Illina, reviewer
Participation in‌ the PhD jury of Adrien Pupier (University of‌ Grenoble, June 2025), I. Illina, examiner
Participation in‌ the PhD jury of David Genova (University of‌ Sorbonne, October 2025), I. Illina, examiner
Participation in‌ the PhD jury of Paul Primus (Johannes Kepler‌ University, February 2025), R. Serizel, reviewer
Participation in‌ the PhD jury of Sreenivasa Upadhyaya (KU Leuven,‌ February 2025), R. Serizel, examiner
Participation in the‌ PhD jury of Benno Weck-Hufnagel (University of Grenoble,‌ July 2025), R. Serizel, reviewer
Participation in the‌ PhD jury of Benno Weck-Hufnagel (Universitat Pompeu Fabra,‌ October 2025), R. Serizel, reviewer
Participation in the‌ PhD jury of Modan Tailleur (Ecole Centrale de‌ Nantes, November 2025), R. Serizel, examiner
Participation in‌ the PhD jury of Ricardo Falcom Perez (Aalto‌ University, November 2025), R. Serizel, reviewer
Participation in‌ the PhD jury of Alexis Plaquet (Université Paul‌ Sabatier, December 2025), R. Serizel, reviewer
Participation in‌ the HDR jury of Angélique Amelot (Université de‌ Lorraine, December 2025), S. Ouni, Chair
Participation in‌ the HDR jury of Angélique Amelot (Université de‌ Lorraine, December 2025), Y. Laprie, supervisor
Participation in‌ the PhD jury of Al Oualid Eliraki (University‌ of Grenoble, June 2025), Y. Laprie, reviewer
Participation‌ in the PhD jury of Nezih Younsi (ISIR,‌ April 2025), S. Ouni, reviewer
Participation in the‌ PhD jury of Yanis OUAKRIM (University of Grenoble,‌ May 2025), S. Ouni, examiner

8.3 Popularization

"M-PHASIS‌ Un projet de recherche pour lutter contre les‌ discours de haine sur internet". Journal « Numerique‌ et societé », interview with I. Illina, March‌ 2025
"Et si nos voix pouvaient aider pour‌ server nos langues", RFI, radio broadcast "De Vives‌ Voix". P. Erhart, S. Bigeard, Jan 2026
"Langues‌ régionales : l'intelligence artificielle au secours de l'alsacien",‌ France Bleu, news article, P. Ethart, S. Bigeard,‌ S. Ouni, Oct 2025
"Traduction, voix de synthèse...‌ Ces chercheurs veulent que l'IA parle breton ou‌ alsacien", Ouest France, news article, P. Ethart, S.‌ Bigeard, S. Ouni, Oct 2025
"L'alsacien à l'heure‌ de l'IA : intégrer les langues régionales dans‌ les modèles numériques", Sciences et Avenir, news article, P. Ethart, S. Bigeard,‌ S. Ouni, Oct 2025‌
"Künstliche Intelligenz befördert Elsässisch‌‌ in die digitale Welt", Badische Neueste Nachrichten (BNN),‌ news article, P. Ethart,‌ S. Bigeard, Oct 2025‌‌
"Alsacien 2.0 : quels usages pour les parlers‌ dialectaux alsaciens ?", DNA,‌ news article, P. Ethart,‌‌ S. Bigeard, Jan 2026

8.3.1 Participation in Live‌ events

Fête de la‌ science, "La puce à‌‌ l'oreille" (R. Serizel)
Nuit de la science, "Ia‌ pour le son" (R.‌ Serizel)
Procès du robots,‌‌ 6 shows (R. Serizel, S. Bigeard)
Chiche 1‌ scientifique, 1 classe, 1‌ visit (R. Serizel)
"Alsacien‌‌ 2.0 : quels usages pour les parlers dialectaux‌ alsaciens ?", public seminar,‌ Strasbourg, Jan 2026 (P.‌‌ Erhard, S. Bigeard)
Press conference for the launch‌ of "Parole Spontanée" Voice‌ collection, Strasbourg (P. Ethart,‌‌ S. Bigeard, S. Ouni)

8.3.2 Others science outreach‌ relevant activities

Journée Colaf,‌ Annual seminar of Défi‌‌ Colaf, Paris, June 2025 (S. Bigeard, I. Miles,‌ M. Yaich, G. Faure,‌ S. Ouni)

9 Scientific‌‌ production

9.1 Major publications

1 articleS.Sara‌ Dahmani, V.Vincent‌ Colotte, V.Valérian‌‌ Girard and S.Slim Ouni. Learning emotions‌ latent representation with CVAE‌ for Text-Driven Expressive AudioVisual‌‌ Speech Synthesis.Neural Networks1412021,‌ 315-329HAL DOI
2‌ articleN.-E. E.Nasser-Eddine‌‌ Eddine Monir, P.Paul Magron and R.‌Romain Serizel. A‌ Phoneme-Scale Assessment of Multichannel‌‌ Speech Enhancement Algorithms.Trends in Hearing28‌December 2024HAL DOI‌
3 inproceedingsM.Manuel‌‌ Pariente, S.Samuele Cornell, J.Joris‌ Cosentino, S.Sunit‌ Sivasankaran, E.Efthymios‌‌ Tzinis, J.Jens Heitkaemper, M.Michel‌ Olvera, F.-R.Fabian-Robert‌ Stöter, M.Mathieu‌‌ Hu, J. M.Juan M. Martín-Doñas,‌ D.David Ditter,‌ A.Ariel Frank,‌‌ A.Antoine Deleforge and E.Emmanuel Vincent.‌ Asteroid: the PyTorch-based audio‌ source separation toolkit for‌‌ researchers.Interspeech 2020Shanghai, ChinaOctober 2020‌HAL
4 articleV.‌Vinicius Ribeiro, K.‌‌Karyna Isaieva, J.Justine Leclere, P.-A.‌Pierre-André Vuissoz and Y.‌Yves Laprie. Automatic‌‌ generation of the complete vocal tract shape from‌ the sequence of phonemes‌ to be articulated.‌‌Speech Communication141April 2022, 1-13HAL‌DOI
5 articleB.‌ M.Brij Mohan Lal‌‌ Srivastava, M.Mohamed Maouche, M.Md‌ Sahidullah, E.Emmanuel‌ Vincent, A.Aurélien‌‌ Bellet, M.Marc Tommasi, N.Natalia‌ Tomashenko, X.Xin‌ Wang and J.Junichi‌‌ Yamagishi. Privacy and utility of x-vector based‌ speaker anonymization.IEEE/ACM‌ Transactions on Audio, Speech‌‌ and Language ProcessingJune 2022HAL

9.2 Publications‌ of the year

International‌ journals

6 articleS.‌‌Simon Leglaive, M.Matthieu Fraticelli, H.‌Hend ElGhazaly, L.‌Léonie Borne, M.‌‌Mostafa Sadeghi, S.Scott Wisdom, M.‌Manuel Pariente, J.‌ R.John R. Hershey‌‌, D.Daniel Pressnitzer and J. P.Jon‌ P. Barker. Objective‌ and subjective evaluation of‌‌ speech enhancement methods in‌ the UDASE task of the 7th CHiME challenge‌.Computer Speech and Language89January 2025‌HAL DOI back to text
7 articleX.‌Xiaoxiao Miao, Y.Yuxiang Zhang, X.‌Xin Wang, N.Natalia Tomashenko, D.‌ C.Donny Cheng Lock Soh and I.Ian‌ Mcloughlin. Adapting general disentanglement-based speaker anonymization for‌ enhanced emotion preservation.Computer Speech and Language‌94November 2025, 101810HAL DOI back‌ to text
8 articleM.Mostafa Sadeghi,‌ J.-E.Jean-Eudes Ayilo, R.Romain Serizel and‌ X.Xavier Alameda-Pineda. Posterior Transition Modeling for‌ Unsupervised Diffusion-Based Speech Enhancement.IEEE Signal Processing‌ Letters322025, 2694-2698HAL DOI back‌ to text

International peer-reviewed conferences

9 inproceedingsJ.‌ V.Jarno Van Arkel, M.Martha Larson‌ and E.Emmanuel Vincent. Video games and‌ speech privacy: A case study of Fortnite.‌SPSC 2025 - 5th ISCA Symposium on Security‌ and Privacy in Speech CommunicationDelft, NetherlandsAugust‌ 2025, 43 - 47HAL DOI back‌ to text
10 inproceedingsJ.-E.Jean-Eudes Ayilo,‌ M.Mostafa Sadeghi, R.Romain Serizel and‌ X.Xavier Alameda-Pineda. Diffusion-based Unsupervised Audio-visual Speech‌ Enhancement.ICASSP 2025 - International Conference on‌ Acoustics Speech and Signal ProcessingHyderabad, IndiaIEEE‌2025, 1-5HALback to text
11‌ inproceedingsS.Sofiane Azzouz, P.-A.Pierre-André Vuissoz‌ and Y.Yves Laprie. Complete Reconstruction of‌ the Tongue Contour Through Acoustic to Articulatory Inversion‌ Using Real-Time MRI Data.ICASSP 2025 -‌ 2025 IEEE International Conference on Acoustics, Speech and‌ Signal Processing (ICASSP)Hyderabad, IndiaIEEEApril 2025‌, 1-5HAL DOIback to text
12‌ inproceedingsS.Sofiane Azzouz, P.-A.Pierre-André Vuissoz‌ and Y.Yves Laprie. Reconstruction of the‌ Complete Vocal Tract Contour Through Acoustic to Articulatory‌ Inversion Using Real-Time MRI Data.Interspeech 2025‌Rotterdam (NL), NetherlandsISCAAugust 2025, 978-982‌HAL DOI back to text
13 inproceedingsR.‌Raphaël Bagat, I.Irina Illina and E.‌Emmanuel Vincent. BEST-RQ-Based Self-Supervised Learning for Whisper‌ Domain Adaptation.IEEE International Conference on Acoustics,‌ Speech, and Signal Processing (ICASSP)ICASSP 2026 -‌ IEEE International Conference on Acoustics, Speech, and Signal‌ ProcessingBarcelone, SpainMay 2026HAL back to‌ text
14 inproceedingsR.Raphaël Bagat, I.‌Irina Illina and E.Emmanuel Vincent. Mixture‌ of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech‌ Recognition.26th Interspeech Conference (Interspeech 2025)Rotterdam,‌ NetherlandsAugust 2025HALback to text
15‌ inproceedingsC.Can Cui, P.Paul Magron‌, M.Mostafa Sadeghi and E.Emmanuel Vincent‌. Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR‌.IEEE 27th International Workshop on Multimedia Signal‌ Processing (MMSP 2025)Pékin, ChinaSeptember 2025HAL‌back to text
16 inproceedingsC.Can Cui‌, I. A.Imran Ahamad Sheikh, M.‌Mostafa Sadeghi and E.Emmanuel Vincent. End-to-end‌ Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training‌ Data.European Signal‌ Processing Conference (EUSIPCO 2025)‌‌Palermo, ItalySeptember 2025HAL back to text‌
17 inproceedingsC.Can‌ Cui, I. A.‌‌Imran Ahamad Sheikh, M.Mostafa Sadeghi and‌ E.Emmanuel Vincent.‌ Joint Beamforming and Speaker-Attributed‌‌ ASR for Real Distant-Microphone Meeting Transcription.European‌ Signal Processing Conference (EUSIPCO‌ 2025)Palermo, ItalySeptember‌‌ 2025HAL back to text
18 inproceedingsA.‌Aine Drelingyte, R.‌Romain Serizel and M.‌‌Mathieu Lagrange. Phoneme-Level Speech Intelligibility Reduction.‌EUSIPCO 2025 - 33rd‌ European Signal Processing Conference‌‌Palerme, ItalySeptember 2025HAL back to text‌
19 inproceedingsG.Guilhem‌ Fauré, M.Mostafa‌‌ Sadeghi, S.Sam Bigeard and S.Slim‌ Ouni. Towards Skeletal‌ and Signer Noise Reduction‌‌ in Sign Language Production via Quaternion-Based Pose Encoding‌ and Contrastive Learning.‌IVA Adjunct '25: Adjunct‌‌ Proceedings of the 25th ACM International Conference on‌ Intelligent Virtual AgentsSLTAT‌ 2025: 9th Workshop on‌‌ Sign Language Translation and Avatar TechnologiesIVA Adjunct‌ '25: Adjunct Proceedings of‌ the 25th ACM International‌‌ Conference on Intelligent Virtual Agents10Berlin, Germany‌2025, 1-9HAL‌DOI back to text‌‌
20 inproceedingsM. I.Mohamed Imed Eddine Ghebriout‌, G.Gaël Guibon‌, I.Ivan Lerner‌‌ and E.Emmanuel Vincent. QUARTZ: QA-based Unsupervised‌ Abstractive Refinement for Task-oriented‌ Dialogue Summarization.Proceedings‌‌ of the 2025 Conference on Empirical Methods in‌ Natural Language ProcessingEMNLP‌ 2025 : The 2025‌‌ Conference on Empirical Methods in Natural Language Processing‌Proceedings of the 2025‌ Conference on Empirical Methods‌‌ in Natural Language ProcessingSuzhou, ChinaNovember 2025‌HAL back to text‌
21 inproceedingsM.Mickaëlla‌‌ Grondin-Verdon, D.Domitille Caillat, L.Louis‌ Abel and S.Slim‌ Ouni. Evaluating Automatic‌‌ Hand-Gesture Generation Using Multimodal Corpus Annotations: The Benefits‌ of a Multidisciplinary Approach‌.GENEA '25: Proceedings‌‌ of the International Workshop on Generation and Evaluation‌ of Non-verbal Behaviour for‌ Embodied AgentsMM '25:The‌‌ 33rd ACM International Conference on MultimediaDublin, Ireland‌ACM (Association for Computing‌ Machinery)October 2025,‌‌ 3-11HAL DOI back to text
22 inproceedings‌T.Taous Iatariene,‌ C.Can Cui,‌‌ A.Alexandre Guérin and R.Romain Serizel.‌ Speaker Embeddings to Improve‌ Tracking of Intermittent and‌‌ Moving Speakers.Proceedings of the 33rd European‌ Signal Processing Conference (EUSIPCO‌ 2025)33rd European Signal‌‌ Processing Conference (EUSIPCO 2025)Palerme (Italie), ItalySeptember‌ 2025HAL back to‌ text
23 inproceedingsT.‌‌Taous Iatariene, A.Alexandre Guérin and R.‌Romain Serizel. Towards‌ Low-Latency Tracking of Multiple‌‌ Speakers With Short-Context Speaker Embeddings.2025 IEEE‌ 27th International Workshop on‌ Multimedia Signal Processing (MMSP)‌‌Beijin, ChinaSeptember 2025HAL back to text‌
24 inproceedingsT.Taous‌ Iatariene, A.Alexandre‌‌ Guérin and R.Romain Serizel. Tracking of‌ Intermittent and Moving Speakers‌ : Dataset and Metrics‌‌.Proceedings of the 11th Convention of the‌ European Acoustics Association Forum‌ Acusticum 2025Proceedings of‌‌ the 11th Convention of‌ the European Acoustics Association Forum Acusticum 2025Malaga,‌ Espagne, SpainJune 2025HAL back to text‌
25 inproceedingsK.Kehina Manseri, S.Sam‌ Bigeard and S.Slim Ouni. Preprocessing MediaPipe‌ Joint Annotation for Sign Language Similarity Analysis.‌SLTAT 2025: 9th Workshop on Sign Language Translation‌ and Avatar TechnologiesSLTAT 2025: 9th Workshop on‌ Sign Language Translation and Avatar TechnologiesBerlin, Germany‌ACMSeptember 2025, 1-9HAL DOI back‌ to text
26 inproceedingsA.Annamaria Mesaros,‌ R.Romain Serizel, T.Toni Heittola,‌ T.Tuomas Virtanen and M.Mark Plumbley.‌ A decade of DCASE: Achievements, practices, evaluations and‌ future challenges.Proc. ICASSP 2025ICASSP 2025‌ - 2025 IEEE International Conference on Acoustics, Speech‌ and Signal Processing (ICASSP)Hyderabad, FranceIEEEMay‌ 2025, 1-5HALDOI back to text‌
27 inproceedingsN.-E.Nasser-Eddine Monir, P.Paul‌ Magron and R.Romain Serizel. Evaluating Multichannel‌ Speech Enhancement Algorithms at the Phoneme Scale Across‌ Genders.33rd European Signal Processing ConferencePalerme,‌ ItalyarXivJune 2025HAL DOI back to‌ text
28 inproceedingsN.-E.Nasser-Eddine Monir, P.‌Paul Magron and R.Romain Serizel. Frequency-Weighted‌ Training Losses for Phoneme-Level DNN-based Speech Enhancement.‌IEEE 27th International Workshop on Multimedia Signal Processing‌Beijing, China2025HALDOI back to text‌
29 inproceedingsP.Passoni Ricardo, F.Francesca‌ Ronchini, L.Luca Comanducci, R.Romain‌ Serizel and F.Fabio Antonacci. Diffused Responsibility:‌ Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion‌ Models.Proc. WASPAA 2025IEEE Workshop on‌ Applications of Signal Processing to Audio and Acoustics‌Granlibakken Tahoe, United StatesOctober 2025HAL back‌ to text
30 inproceedingsY.Yaya Sy,‌ C.Christophe Cerisara and I.Irina Illina.‌ Efficient One-shot Compression via Low-Rank Local Feature Distillation‌.2025 Conference of the Nations of the‌ Americas Chapter of the Association for Computational Linguistics‌Albuquerque, New Mexico, United StatesApril 2025HAL‌DOI back to text
31 inproceedingsN.Natalia‌ Tomashenko, X.Xiaoxiao Miao, E.Emmanuel‌ Vincent and J.Junichi Yamagishi. The First‌ VoicePrivacy Attacker Challenge.ICASSP 2025 - 2025‌ IEEE International Conference on Acoustics, Speech and Signal‌ Processing (ICASSP), Hyderabad, India, 2025, pp. 1-2, doi:‌ 10.1109/ICASSP49660.2025.10888513.ICASSP 2025 - 2025 IEEE International Conference‌ on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad,‌ India, 2025Hyderabad, IndiaApril 2025, 1-2‌HAL DOI back to text
32 inproceedingsN.‌Natalia Tomashenko, E.Emmanuel Vincent and M.‌Marc Tommasi. Analysis of Speech Temporal Dynamics‌ in the Context of Speaker Verification and Voice‌ Anonymization.2025 IEEE International Conference on Acoustics,‌ Speech, and Signal Processing (ICASSP 2025)Hyderabad, India‌April 2025HAL DOIback to text
33‌ inproceedingsN.Natalia Tomashenko, E.Emmanuel Vincent‌ and M.Marc Tommasi. Exploiting Context-dependent Duration‌ Features for Voice Anonymization Attack Systems.Interspeech‌ 2025Rotterdam, NetherlandsAugust 2025HAL back to text
34 inproceedingsN.‌Nathalie Vauquier, B.‌ M.Brij Mohan Lal‌‌ Srivastava, S. A.Seyed Ahmad Hosseini and‌ E.Emmanuel Vincent.‌ Legally validated evaluation framework‌‌ for voice anonymization.Interspeech 2025 - Annual‌ Conference of the International‌ Speech Communication AssociationRotterdam,‌‌ NetherlandsAugust 2025, 3229 - 3233HAL‌DOI back to text‌
35 inproceedingsM.Masahiro‌‌ Yasuda, B. T.Binh Thien Nguyen,‌ N.Noboru Harada,‌ R.Romain Serizel,‌‌ M.Mayank Mishra, M.Marc Delcroix,‌ S.Shoko Araki,‌ D.Daiki Takeuchi,‌‌ D.Daisuke Niizumi, Y.Yasunori Ohishi,‌ T.Tomohiro Nakatani,‌ T.Takao Kawamura and‌‌ N.Nobutaka Ono. Description and Discussion on‌ DCASE 2025 Challenge Task‌ 4: Spatial Semantic Segmentation‌‌ of Sound Scenes.Proceedings of the 10th‌ Workshop on Detection and‌ Classification of Acoustic Scenes‌‌ and Events (DCASE 2025)Workshop on Detection and‌ Classification of Acoustic Scenes‌ and Events (DCASE 2025)‌‌Barcelona, FranceMusic Technology Group - Universitat Pompeu‌ FabraJune 2025HAL‌DOI back to text‌‌

National peer-reviewed Conferences

36 inproceedingsS.Sofiane Azzouz‌, P.-A.Pierre-André Vuissoz‌ and Y.Yves Laprie‌‌. Inversion acoustique-articulatoire complète du conduit vocal.‌CFA 2025 - 17e‌ Congrès Français d'AcoustiqueParis,‌‌ France2025HAL back to text
37 inproceedings‌B.Benoît Sagot,‌ S.Slim Ouni,‌‌ S.Sam Bigeard, L.Lucence Ing,‌ R.Rasul Dent,‌ J.Juliette Janes,‌‌ T.Thibault Clérice, R.Rachel Bawden,‌ E.Emmanuel Vincent,‌ O.Oriane Nédey,‌‌ M.Malek Yaich, P.Panagiotis Tsolakis,‌ V.Vincent Colotte and‌ M.Mostafa Sadeghi.‌‌ COLaF: Corpus and Tools for Languages of France‌ and Varieties of French‌.Actes de la‌‌ session industrielle de CORIA-TALN 202520e Conférence en‌ Recherche d’Information et Applications‌ (CORIA) 32ème Conférence sur‌‌ le Traitement Automatique des Langues Naturelles (TALN) 27ème‌ Rencontre des Étudiants Chercheurs‌ en Informatique pour le‌‌ Traitement Automatique des Langues (RECITAL) Les 18e Rencontres‌ Jeunes Chercheurs en RI‌ (RJCRI)Marseille, FranceATALA‌‌ & ARIA2025, 33-47HAL back to‌ text

Conferences without proceedings‌

38 inproceedingsD.Domitille‌‌ Caillat, M.Mickaëlla Grondin-Verdon and S.Slim‌ Ouni. Exploring Gesture‌ Formalization: Encoding Features and‌‌ Automation Strategies.10th Conference of the International‌ Society for Gesture Studies‌Nijmegen, NetherlandsJuly 2025‌‌HAL back to text
39 inproceedingsD.Domitille‌ Caillat, M.Mickaëlla‌ Grondin-Verdon and S.Slim‌‌ Ouni. Iconic Co-verbal Gestures: Study of Some‌ Facets of Iconicity.‌Workshop on dimensions of‌‌ iconicity in the visual modalityGöttingen, GermanyFebruary‌ 2025HAL back to‌ text
40 inproceedingsM.‌‌Mickaëlla Grondin-Verdon, D.Domitille Caillat and S.‌Slim Ouni. Encoding‌ the Spatial Features of‌‌ Co-Verbal Manual Gestures: a Framework for Automated Annotation‌.12th International Conference‌ on Multimodality (ICOM)Groningen,‌‌ NetherlandsOctober 2025HALback to text

Doctoral‌ dissertations and habilitation theses‌

41 thesisL.Louis‌‌ Abel. Co-speech gesture‌ synthesis : Towards a controllable and interpretable model‌ using a graph deterministic approach.Université de‌ LorraineFebruary 2025HALback to text back‌ to text

Reports & preprints

42 miscM.‌Mayank Mishra, P.Paul Magron and R.‌Romain Serizel. Metric analysis for spatial semantic‌ segmentation of sound scenes.November 2025HAL‌back to text
43 miscN.Natalia Tomashenko‌, J.Junichi Yamagishi, X.Xin Wang‌, Y.Yun Liu and E.Emmanuel Vincent‌. Target Speaker Anonymization In Multi-Speaker Recordings.‌October 2025HAL back to text

Other scientific‌ publications

44 inproceedingsM.Mickaëlla Grondin-Verdon, D.‌Domitille Caillat and S.Slim Ouni. COSMOS:‌ Semi-automatic tool development for co-speech gesture segmentation based‌ on expert frame annotations.GENEA 2025 -‌ Generation and Evaluation of Non-verbal Behaviour for Embodied‌ Agents WorkshopDublin, Ireland2025HAL back to‌ text

MULTISPEECH - 2025

MULTISPEECH - 2025

2025Activity reportProject-Team​​​‌MULTISPEECH

Keywords

Computer Science​​﻿﻿ and Digital Science

Other Research Topics and﻿​﻿﻿ Application Domains

1﻿​﻿﻿ Team members, visitors, external​‌﻿﻿ collaborators

Research Scientists

Faculty Members

Post-Doctoral​​​‌ Fellows

PhD Students

Technical Staff

Interns﻿​​﻿ and Apprentices

Administrative Assistants

2 Overall objectives​​​‌

3​​﻿﻿ Research program

3.1 Axis​​​‌ 1 — Data-efficient and﻿​﻿﻿ privacy-preserving learning

3.1.1﻿​﻿﻿ Axis 1.1 — Integrating​‌﻿﻿ domain knowledge

3.1.2 Axis﻿﻿﻿‌ 1.2 — Learning from﻿‌​‌ little/no labeled data

3.1.3﻿‌​‌ Axis 1.3 — Preserving﻿​​﻿ privacy

3.1.4 Axis﻿​​﻿ 1.4 — Reducing computational​​​‌ footprint

3.2﻿‌​‌ Axis 2 — Extracting﻿​​﻿ information from speech signals​​​‌

3.2.1﻿﻿﻿‌ Axis 2.1 — Linguistic﻿‌​‌ speech content

3.2.2 Axis 2.2​​﻿﻿ — Speaker identity and​​​‌ states

3.2.3 Axis 2.3 —​​​‌ Speech environment information

3.3 Axis 3 —﻿​﻿﻿ Multimodal Speech: generation and​‌﻿﻿ interaction

3.3.1 Axis 3.1 -﻿​﻿﻿ Multimodality modeling and analysis​‌﻿﻿

3.3.2 Axis​‌﻿﻿ 3.2 - Multimodal speech​​﻿﻿ generation

3.3.3 Axis 3.3​​​‌ — Interaction

3.4 Software﻿​​﻿ platform: Multimodal Voice assistant​​​‌

4﻿​​﻿ Application domains

4.1 Language Learning﻿​​﻿

4.2 Health Assistance

5 New​​​‌ results

5.1 Axis 1﻿​﻿﻿ — Data-efficient and privacy-preserving​‌﻿﻿ learning

5.1.1 Axis 1.1​‌﻿﻿ — Integrating domain knowledge​​﻿﻿

Hybrid signal processing and​​​‌ deep learning.

Generative-based​​﻿﻿ speech enhancement.

5.1.2 Axis 1.2​​﻿﻿ - Learning from little/no​​​‌ labeled data

ASR for﻿​﻿﻿ regional languages.

Accented ASR.​​​‌

TTS﻿﻿﻿‌ for regional languages.

Joint punctuated +﻿​​﻿ normalized ASR (limited punctuated​​​‌ data).

5.1.3 Axis 1.3​​​‌ - Preserving privacy

Speaker﻿﻿﻿‌ anonymization.

Privacy metrics and﻿​﻿﻿ evaluation.

Sensitive content﻿​﻿﻿ replacement.

5.1.4 Axis 1.4​​​‌ — Reducing computational footprint﻿​﻿﻿

5.2 Axis﻿​​﻿ 2 — Extracting information​​​‌ from speech signals

5.2.1 Axis 2.1﻿​​﻿ — Linguistic speech content​​​‌

Joint beamforming + speaker-attributed﻿﻿﻿‌ ASR for meetings.

LLM​​​‌ compression.

Pruning for﻿​​﻿ Low-resource Speech Recognition.

Speech Language Modeling​​​‌ for Wolof.

5.2.2 Axis 2.2 —﻿﻿﻿‌ Speaker identity and states﻿‌​‌

Speaker localisation and tracking.​​​‌

5.2.3 Axis​​​‌ 2.3 — Speech in﻿​﻿﻿ its environment

Ambient sound​‌﻿﻿ detection and separation.

Speech​​﻿﻿ enhancement.

Speech intelligibility​​​‌ reduction.

5.3 Axis 3 —﻿﻿﻿‌ Multimodal Speech: generation and﻿‌​‌ interaction

5.3.1​​​‌ Axis 3.1 — Multimodality﻿﻿﻿‌ modeling and analysis

Audio-visual﻿‌​‌ speech enhancement (AVSE).

Automatic isolated﻿‌​‌ sign recognition in French﻿​​﻿ Sign Language (LSF).

5.3.2 Axis 3.2﻿﻿﻿‌ — Multimodal speech generation﻿‌​‌

Acquisition of rt-MRI (real-time﻿​​﻿ Magnetic Resonance Imaging) data.​​​‌

Acoustic to﻿﻿﻿‌ articulatory inversion.

Quaternion pose​​​‌ encoding and contrastive learning﻿​﻿﻿ for robust sign language​‌﻿﻿ production.

5.3.3 Axis 3.3﻿​﻿﻿ — Interaction

Formal description​‌﻿﻿ and annotation of co-speech​​﻿﻿ gestures.

Recognition and evaluation of​​​‌ co-speech gestures.

Medical dialog﻿​​﻿ summarization.

6 Bilateral contracts​​​‌ and grants with industry﻿﻿﻿‌

6.1 Bilateral grants with﻿‌​‌ industry

6.1.1 Meta AI﻿​​﻿

6.1.2﻿﻿﻿‌ Orange Labs

7﻿﻿﻿‌ Partnerships and cooperations

7.1﻿‌​‌ International initiatives

7.1.1 Inria﻿​​﻿ associate team not involved​​​‌ in an IIL or﻿﻿﻿‌ an international program

TrustedSpeech﻿‌​‌

2025Activity reportProject-Team‌MULTISPEECH

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external‌ collaborators

Post-Doctoral‌ Fellows

Interns and Apprentices

2 Overall objectives‌

3 Research program

3.1 Axis‌ 1 — Data-efficient and privacy-preserving learning

3.1.1 Axis 1.1 — Integrating‌ domain knowledge

3.1.2 Axis‌ 1.2 — Learning from‌‌ little/no labeled data

3.1.3‌‌ Axis 1.3 — Preserving privacy

3.1.4 Axis 1.4 — Reducing computational‌ footprint

3.2‌‌ Axis 2 — Extracting information from speech signals‌

3.2.1‌ Axis 2.1 — Linguistic‌‌ speech content

3.2.2 Axis 2.2 — Speaker identity and‌ states

3.2.3 Axis 2.3 —‌ Speech environment information

3.3 Axis 3 — Multimodal Speech: generation and‌ interaction

3.3.1 Axis 3.1 - Multimodality modeling and analysis‌

3.3.2 Axis‌ 3.2 - Multimodal speech generation

3.3.3 Axis 3.3‌ — Interaction

3.4 Software platform: Multimodal Voice assistant‌

4 Application domains

4.1 Language Learning

5 New‌ results

5.1 Axis 1 — Data-efficient and privacy-preserving‌ learning

5.1.1 Axis 1.1‌ — Integrating domain knowledge

Hybrid signal processing and‌ deep learning.

Generative-based speech enhancement.

5.1.2 Axis 1.2 - Learning from little/no‌ labeled data

ASR for regional languages.

Accented ASR.‌

TTS‌ for regional languages.

Joint punctuated + normalized ASR (limited punctuated‌ data).

5.1.3 Axis 1.3‌ - Preserving privacy

Speaker‌ anonymization.

Privacy metrics and evaluation.

Sensitive content replacement.

5.1.4 Axis 1.4‌ — Reducing computational footprint

5.2 Axis 2 — Extracting information‌ from speech signals

5.2.1 Axis 2.1 — Linguistic speech content‌

Joint beamforming + speaker-attributed‌ ASR for meetings.

LLM‌ compression.

Pruning for Low-resource Speech Recognition.

Speech Language Modeling‌ for Wolof.

5.2.2 Axis 2.2 —‌ Speaker identity and states‌‌

Speaker localisation and tracking.‌

5.2.3 Axis‌ 2.3 — Speech in its environment

Ambient sound‌ detection and separation.

Speech enhancement.

Speech intelligibility‌ reduction.

5.3 Axis 3 —‌ Multimodal Speech: generation and‌‌ interaction

5.3.1‌ Axis 3.1 — Multimodality‌ modeling and analysis

Audio-visual‌‌ speech enhancement (AVSE).

Automatic isolated‌‌ sign recognition in French Sign Language (LSF).

5.3.2 Axis 3.2‌ — Multimodal speech generation‌‌

Acquisition of rt-MRI (real-time Magnetic Resonance Imaging) data.‌

Acoustic to‌ articulatory inversion.

Quaternion pose‌ encoding and contrastive learning for robust sign language‌ production.

5.3.3 Axis 3.3 — Interaction

Formal description‌ and annotation of co-speech gestures.

Recognition and evaluation of‌ co-speech gestures.

Medical dialog summarization.

6 Bilateral contracts‌ and grants with industry‌

6.1 Bilateral grants with‌‌ industry

6.1.1 Meta AI

6.1.2‌ Orange Labs

7‌ Partnerships and cooperations

7.1‌‌ International initiatives

7.1.1 Inria associate team not involved‌ in an IIL or‌ an international program

TrustedSpeech‌‌

7.1.2 Participation in‌‌ other International Programs

ANR-JST CONFLUENCE

7.2 International‌ research visitors

7.2.1 Visits‌ to international teams

7.3.1‌ Horizon Europe

7.3.2 Digital Europe

7.4 National‌ initiatives

ANR Full3DTalkingHead

ANR CODIM