2025Activity reportProject-TeamMULTISPEECH
RNSR: 201421147E- Research center Inria Centre at Université de Lorraine
- In partnership with:CNRS, Université de Lorraine
- Team name: Multimodal Speech in Interaction
- In collaboration with:Laboratoire lorrain de recherche en informatique et ses applications (LORIA)
Creation of the Project-Team: 2024 October 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A3.4. Machine learning and statistics
- A3.5. Social networks
- A4.8. Privacy-enhancing technologies
- A5.1.5. Body-based interfaces
- A5.1.7. Multimodal interfaces
- A5.6.2. Augmented reality
- A5.6.3. Avatar simulation and embodiment
- A5.7. Audio modeling and processing
- A5.7.1. Sound
- A5.7.3. Speech
- A5.7.4. Analysis
- A5.7.5. Synthesis
- A5.8. Natural language processing
- A5.9. Signal processing
- A5.9.1. Sampling, acquisition
- A5.9.2. Estimation, modeling
- A5.9.3. Reconstruction, enhancement
- A5.10.2. Perception
- A5.10.5. Robot interaction (with the environment, humans, other robots)
- A6.2.4. Statistical methods
- A6.3.1. Inverse problems
- A6.3.4. Model reduction
- A6.3.5. Uncertainty Quantification
- A9.2. Machine learning
- A9.2.1. Supervised learning
- A9.2.2. Unsupervised learning
- A9.2.3. Reinforcement learning
- A9.2.4. Optimization and learning
- A9.2.6. Neural networks
- A9.2.8. Deep learning
- A9.3. Signal processing
- A9.4. Natural language processing
- A9.5. Robotics and AI
- A9.11. Generative AI
Other Research Topics and Application Domains
- B8.1.2. Sensor networks for smart buildings
- B8.4. Security and personal assistance
- B9.1.1. E-learning, MOOC
- B9.5.1. Computer science
- B9.5.2. Mathematics
- B9.5.6. Data science
- B9.6.8. Linguistics
- B9.6.10. Digital humanities
- B9.10. Privacy
1 Team members, visitors, external collaborators
Research Scientists
- Yves Laprie [CNRS, Senior Researcher, HDR]
- Paul Magron [INRIA, Researcher]
- Mostafa Sadeghi [INRIA, ISFP]
- Emmanuel Vincent [INRIA, Senior Researcher, HDR]
Faculty Members
- Slim Ouni [Team leader, UL, Professor, HDR]
- Domitille Caillat [UNIV MONTPELLIER III, Associate Professor Delegation, until Aug 2025]
- Vincent Colotte [UL, Associate Professor]
- Pascale Erhart [UNIV STRASBOURG, Associate Professor Delegation, from Sep 2025]
- Irina Illina [UL, Associate Professor, HDR]
- Romain Serizel [UL, Professor, HDR]
Post-Doctoral Fellows
- Tom Bourgeade [UL, Post-Doctoral Fellow]
- Constance Douwes [UL, Post-Doctoral Fellow, until Mar 2025]
- François Effa [UL, ATER, from Oct 2025]
- Natalia Tomashenko [UL, ATER, from Oct 2025]
PhD Students
- Louis Abel [UL, until Feb 2025]
- Jean Eudes Ayilo [INRIA]
- Sofiane Azzouz [UL]
- Raphael Bagat [CNRS]
- Zahra Hafida Benslimane [CEA]
- Doria Bonzi [UL, from Oct 2025]
- Aine Drelingyte [UL]
- Orane Dufour [UL]
- Guilhem Faure [INRIA]
- Imed Eddine Ghebriout [CNRS]
- Mickaella Grondin-Verdon [UL, from Oct 2025]
- Mickaella Grondin-Verdon [UL, ATER, until Aug 2025]
- Taous Iatariene [ORANGE, CIFRE]
- Isobelle Miles [INRIA, from Feb 2025]
- Mayank Mishra [UL]
- Nasser-Eddine Monir [UL]
- Nhat Nam Nguyen [UL, from Nov 2025]
- Robin San Roman [META, CIFRE]
- Alex Stasica [INRIA, from Sep 2025]
- Natalia Tomashenko [UL, ATER, from Sep 2025]
Technical Staff
- Louis Abel [UL, Engineer, from Dec 2025]
- Romuald Ait Bachir [CNRS, Engineer, from Oct 2025]
- Hugo Bergerat [UL, Engineer, from Oct 2025]
- Theo Biasutto-Lervat [INRIA, Engineer]
- Sam Bigeard [INRIA, Engineer]
- Emma Granier [UL, Engineer, from Nov 2025]
- Colombe M'Boungou [INRIA, Engineer, from Feb 2025]
- Malek Yaich [INRIA, Engineer]
Interns and Apprentices
- Elliot Abarca [INRIA, Intern, from Apr 2025 until Jun 2025]
- Abbas Awarkeh [CNRS, Intern, from Mar 2025 until Sep 2025]
- Nacera Elarbi Tolehi [UL, Intern, from Mar 2025 until Sep 2025]
- Tian Huang [UL, Intern, from Mar 2025 until Sep 2025]
- Camille Lavigne [INRIA, Intern, from Jun 2025 until Aug 2025]
- Kehina Manseri [INRIA, Intern, from Feb 2025 until Jul 2025]
- Celie Ponroy [INRIA, Intern, from Apr 2025 until Jun 2025]
- Nina Rouffaud [UL, Intern, from Jun 2025 until Jun 2025]
- Nicolas Russo [UL, Intern, from Jul 2025 until Aug 2025]
Administrative Assistants
- Emmanuelle Deschamps [INRIA]
- Cecilia Olivier [INRIA]
2 Overall objectives
In Multispeech, we consider speech as a multimodal signal with different facets: acoustic, facial, articulatory, gestural, etc. Historically, speech was mainly considered under its acoustic facet, which is still the most important one. However, the acoustic signal is a consequence of the temporal evolution of the shape of the vocal tract (pharynx, tongue, jaws, lips, etc.), this is the articulatory facet of speech. Since the vocal tract configuration is partly reflected in facial movements, these constitute the primary visual facet of speech.
The face can provide additional information on the speaker's state through facial expressions. Speech can be accompanied by gestures (head nodding, arm and hand movements, etc.), that help to clarify the linguistic message. In some cases, such as in sign language, these gestures can bear the main linguistic content and be the only means of communication.
The general objective of Multispeech is to study the analysis and synthesis of the different facets of this multimodal signal and their multimodal coordination in the context of human-human or human-computer interaction. While this multimodal signal carries all of the information used in spoken communication, the collection, processing, and extraction of meaningful information by a machine system remains a challenge. In particular, to operate in real-world conditions, such a system must be robust to noisy or missing facets. We are especially interested in designing models and learning techniques that rely on limited amounts of labeled data and that preserve privacy.
Therefore, Multispeech addresses data-efficient, privacy-preserving learning methods, and the robust extraction of various streams of information from speech signals. These two axes will allow us to address multimodality, i.e., the analysis and the generation of multimodal speech and its consideration in an interactional context.
The outcomes will crystallize into a unified software platform for the development of embodied voice assistants. Our main objective is that the results of our research feed this platform, and that the platform itself facilitates our research and that of other researchers in the general domain of human-computer interaction, as well as the development of concrete applications that help humans to interact with one another or with machines. We will focus on two main application areas: language learning and health assistance.
3 Research program
3.1 Axis 1 — Data-efficient and privacy-preserving learning
A central aspect of our research is to design machine learning models and methods for multimodal speech data, whether acoustic, visual or gestural. By contrast with big tech companies, we focus on scenarios where the amount of speech data is limited and/or access to the raw data is infeasible due to privacy requirements, and little or no human labels are available.
3.1.1 Axis 1.1 — Integrating domain knowledge
State-of-the-art methods for speech and audio processing are based on discriminative neural networks trained for the targeted task. This paradigm faces major limitations: lack of interpretability, large data requirements and inability to generalize to unseen classes or tasks. Our approach is to combine the representation power of deep learning with our acoustic expertise to obtain smaller generative models describing the probability distribution of speech and audio signals. Particular attention will be paid to designing physically-motivated input layers, output layers, and unsupervised representations that capture complex-valued, multi-scale spectro-temporal dependencies. Given these models, we derive computationally efficient inference algorithms that address the above limitations. We also explore the integration of deep learning with symbolic reasoning and common-sense knowledge to increase the generalization ability of deep models.
3.1.2 Axis 1.2 — Learning from little/no labeled data
While supervised learning from fully labeled data is economically costly, unlabeled data are inexpensive but provide intrinsically less information. Our goal is to learn representations that disentangle the attributes of speech by equipping the unsupervised representation learning methods above with supervised branches exploiting the available labels and supervisory signals, and with multiple adversarial branches overcoming the usual limitations of adversarial.
3.1.3 Axis 1.3 — Preserving privacy
To preserve privacy, speech must be transformed to hide the users' identity and other privacy-sensitive attributes (e.g., accent, health status) while leaving intact those attributes which are required for the task (e.g., phonetic content for automatic speech recognition) and preserving the data variability for training purposes. We develop strong attacks to evaluate privacy. We also seek to hide personal identifiers and privacy-sensitive attributes in the linguistic content, focusing on their robust extraction and replacement from speech signals.
3.1.4 Axis 1.4 — Reducing computational footprint
This axis includes proposing reliable methods to quantify fine-grained energy consumption, computational footprint (in terms of operations), and memory footprint, so as to identify potential bottlenecks in the network at training and test time before applying compression methods.
3.2 Axis 2 — Extracting information from speech signals
In this axis, we focus on extracting meaningful information from speech signals in real conditions. This information can be related (1) to the linguistic content, (2) to the speaker, and (3) to the speech environment.
3.2.1 Axis 2.1 — Linguistic speech content
Speech recognition is the main means to extract linguistic information from speech. Although it is a mature research area, performance drops in real-world environments pursue the development of speech enhancement and source separation methods to effectively improve robustness in such real-world scenarios. Semantic content analysis is required to interpret the spoken message. The challenges include learning from little real data, quickly adapting to new topics, and robustness to speech recognition errors. The detection and classification of hate speech in social media videos will also be considered as a benchmark, thereby extending the work on text-only detection. Finally, we also consider extracting phonetic and prosodic information to study the categorization of speech sounds and certain aspects of prosody by learners of a foreign language.
3.2.2 Axis 2.2 — Speaker identity and states
Speaker identity is required for the personalization of human-computer interaction. Speaker recognition and diarization are still challenging in real-world conditions. The speaker states that we aim to recognize include emotion and stress, which can be used to adapt the interaction in real time.
3.2.3 Axis 2.3 — Speech environment information
We develop audio event detection methods that exploit both strongly/weakly labeled and unlabeled data, operate in real-world conditions, can discover new events, and provide a semantic interpretation. Modeling the temporal, spatial and logical structure of ambient sound scenes over a long duration is also considered.
3.3 Axis 3 — Multimodal Speech: generation and interaction
In our project, we consider speech as a multimodal object, where we study (1) multimodality modeling and analysis, focusing on multimodal fusion and coordination, (2) the generation of multimodal speech by taking into account its different facets (acoustic, articulatory, visual, gestural), separately or combined, and (3) interaction, in the context of human-human or human-computer interaction.
3.3.1 Axis 3.1 - Multimodality modeling and analysis
The study of multimodality concerns the interaction between modalities, their fusion, coordination and synchronization for a single speaker, as well as their synchronization across the speakers in a conversation. We focus on audiovisual speech enhancement to improve the intelligibility and quality of noisy speech by considering the speaker’s lip movements. We also consider the semi/weakly/self-supervised learning methods for multimodal data to obtain interpretable representations that disentangle in each modality the attributes related to linguistic and semantic content, emotion, reaction, etc. We also study the contribution of each modality to the intelligibility of spoken communication.
3.3.2 Axis 3.2 - Multimodal speech generation
Multimodal speech generation refers to articulatory, acoustic, and audiovisual speech synthesis techniques which output one or more facets. Articulatory speech synthesis relies on 2D and 3D modeling of the dynamics of the vocal tract from real-time MRI (rtMRI) data. We consider the generation of the full vocal tract, from the vocal folds to the lips, first in 2D then in 3D. This comprises the generation of the face and the prediction of the glottis opening. We also consider audiovisual speech synthesis. Both the animation of the lower part of the face related to speech and of the upper part related to the facial expressions are considered, and development continues towards a multilingual talking head. We investigate further the modeling of expressivity for both audio-only and audiovisual speech synthesis, for a better control of expressivity, where we consider several disentangled attributes at the same time.
3.3.3 Axis 3.3 — Interaction
Interaction is a new field of research for our project-team that we will approach gradually. We start by studying the multimodal components (prosody, facial expressions, gestures) used during interaction, both by the speaker and by the listener, where the goal is to simultaneously generate speech and gestures by the speaker, and generating regulatory gestures for the listener. We will introduce different dialog bricks progressively: Spoken language understanding, Dialog management, and Natural language generation. Dialog will be considered in a multimodal context (gestures, emotional states of the interlocutor, etc.) and we will break the classical dialog management scheme to dynamically account for the interlocutor's evolution during the speaker's response.
3.4 Software platform: Multimodal Voice assistant
This research program aims to develop a unified software platform for embodied voice assistants, fueled by our research outcomes. The platform will not only aid our research but also facilitate other researchers in the field of human-computer interaction. It will also help in creating practical applications for human interactions, with a primary focus on language learning and health assistance.
4 Application domains
The approaches and models developed in Multispeech will have several applications to help humans interact with one another or with machines. Each application will typically rely on an embodied voice assistant developed via our generic software platform or on individual components, as presented above. We will put special effort into two application domains: language learning and health assistance. We chose these domains mainly because of their economic and social impact. Moreover, many outcomes of our research will be naturally applicable in these two domains, which will help us showcase their relevance.
4.1 Language Learning
Learning a second language, or acquiring the native language for people suffering from language disorders, is a challenge for the learner and represents a significant cognitive load. Many scientific activities have therefore been devoted to these issues, both from the point of view of production and perception. We aim to show the learner (native or second language) how to articulate the sounds of the target language by illustrating articulation with a talking head augmented by the vocal tract which allows animating the articulators of speech. Moreover, based on the analysis of the learner’s production, an automatic diagnosis can be envisaged. However, reliable diagnosis remains a challenge, which depends on the accuracy of speech recognition and prosodic analysis techniques. This is still an open question.
4.2 Health Assistance
Speech technology can facilitate healthcare access to all patients and it provides an unprecedented opportunity to transform the healthcare industry. This includes speech disorders and hearing impairments. For instance, it is possible to use automatic techniques to diagnose disfluencies from an acoustic or an audiovisual signal, as in the case of stuttering. Speech enhancement and separation can enhance speech intelligibility for hearing aid wearers in complex acoustic environments, while articulatory feedback tools can be beneficial for articulatory rehabilitation of cochlear implant wearers. More generally, voice assistants are a valuable tool for senior or disabled people, especially for those who are unable to use other interfaces due to lack of hand dexterity, mobility, and/or good vision. Speech technologies can also facilitate communication between hospital staff and patients, and help emergency call operators triage the callers by quantifying their stress level and getting the maximum amount of information automatically thanks to a robust speech recognition system adapted to these extreme conditions.
5 New results
5.1 Axis 1 — Data-efficient and privacy-preserving learning
Participants: Vincent Colotte, Pascale Erhart, Irina Illina, Paul Magron, Slim Ouni, Mostafa Sadeghi, Romain Serizel, Emmanuel Vincent, Jean Eudes Ayilo, Zahra Hafida Benslimane, Sam Bigeard, Constance Douwes, Orane Dufour, Mohamed Imed Eddine Ghebriout, Isobelle Miles, Robin San Roman, Natalia Tomashenko, Malek Yaich.
5.1.1 Axis 1.1 — Integrating domain knowledge
Hybrid signal processing and deep learning.
We propose to combine traditional signal processing-based filtering with deep learning for automatic speech recognition (ASR). More specifically, the beamforming filter processes specific angular sectors based on their spherical polar coordinates before applying an end-to-end multichannel, multi-speaker ASR system. This method is data-independent and training-free. We demonstrate that using a group of beamformed signals improves performance compared to using the same number of raw microphone signals. Moreover, increasing the number of signals used for beamforming further enhances recognition accuracy, leading to a more efficient use of multichannel signals while reducing the overall input load for the ASR system. This strategy is promising for exploiting the best of both worlds, thus reducing the need for supervision while maintaining high performance 15.
Generative-based speech enhancement.
A widely used approach for speech enhancement is to directly train a deep neural network (DNN) to estimate clean speech from a noisy input. Despite strong performance, this paradigm faces two main challenges. First, it often requires very large models trained on huge datasets that span many noise types, noise levels, and recording conditions. Second, its generalization is typically limited, with performance degrading in unseen environments. To tackle this problem, we organized the UDASE task of the 7th CHiME challenge, which leverages real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models 6. We also provided comprehensive objective and subjective evaluations of the submitted systems to better understand the limitations of different approaches. In the same research direction, we proposed a novel diffusion-based unsupervised speech enhancement framework 8 that leverages diffusion models, which currently achieve state-of-the-art performance in generative modeling. This framework combines a pre-trained diffusion-based speech generative model with a parametric noise model, and performs speech enhancement through an iterative expectation–maximization procedure. The results confirm that this unsupervised approach is more robust than its supervised counterpart under mismatched conditions.
5.1.2 Axis 1.2 - Learning from little/no labeled data
ASR for regional languages.
Using Mozilla Common Voice, we pursued the collection of speech data covering some of the regional, overseas, and non-territorial languages of France from data archives, media, associations, and individuals. This data will support the training of automatic speech recognition (ASR) and text-to-speech (TTS) models for these languages. We presented a comprehensive overview of the tools already developed, notably ASR 37. The work focused on developing a recognition system based on Whisper. Fine-tuning and error correction techniques using adapted language models were applied for the recognition of Basque, Occitan, Alsatian and Shimaoré.
Accented ASR.
Two approaches addressing accented ASR are presented. The first introduces Mixture of Accent-Specific LoRAs (MAS-LoRA), a finetuning method that leverages a mixture of Low-Rank Adaptation (LoRA) experts, each specialized in a specific accent 14. Our experiments, conducted using Whisper on the L2-ARCTIC corpus, demonstrate 4% and 14% relative Word Error Rate (WER) improvement compared to regular LoRA and no fine-tuning when the accent is unknown. The second approach, BEARD, focuses on adapting Whisper's encoder with unlabeled data 13. It combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder. On the ATCO2 corpus of Air Traffic Control (ATC) communications, using 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, BEARD achieves a relative improvement of 12% compared to the fine-tuned model.
TTS for regional languages.
We have been working on speech synthesis for the Alsatian language. Using a system based on ToucanTTS, we first explored the possibility of integrating a specific phonetizer. The difficulty in achieving high-quality synthesis is mainly due to the lack of data in this type of language and a lack of normalization in the textual transcription. Speech synthesis system architectures are generally designed for a critical volume of training data, so we used several multilingual synthesis systems to evaluate their performance on this type of regional language. At the same time, a campaign to acquire oral data in collaboration with Mozilla Common Voice is underway and should increase the volume of data especially with dialect variations. Alsatian contains several dialects that bring a great variability in terms of both vocabulary and pronunciation. Current TTS and ASR systems remain poorly adapted for this type of regional accent yet.
Joint punctuated + normalized ASR (limited punctuated data).
We proposed two end-to-end strategies for joint punctuated and normalized ASR when punctuated supervision is scarce 16: (i) using a language model to generate punctuated targets from normalized transcripts, improving out-of-domain performance (up to 17% relative PC-WER reduction), and (ii) a single conditional decoder that outputs either punctuated or normalized transcripts on demand. The conditional-decoder model achieved a 42% relative PC-WER reduction vs Whisper-base and remained effective with as little as 5% punctuated training data.
5.1.3 Axis 1.3 - Preserving privacy
Speaker anonymization.
Speech signals convey a lot of private information. To protect speakers, we pursued our investigation of x-vector based voice anonymization, which relies on splitting the speech signal into the speaker (x-vector), phonetic and pitch features and resynthesizing the signal with a different target x-vector. In particular, we measured and reduced the amount of speaker information carried by phoneme durations 32, 33. Our method which extracts an embedding of phoneme durations by an ECAPA-TDNN model achieves a low equal error rate (EER) of 2% for 8 test signals.
We looked at the privacy of video game users 9. Speech in video games is a spontaneous human conversation, which is associated to the player's pseudonym and could be recorded by any player using screen capture software to build and augment identifying records. We also looked at privacy in multi-speaker recordings 43 using a target speaker extraction module to extract the speaker to be anonymized before speaker anonymization and speech recombination. We achieved an EER of 36% and a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 18% on mixtures of two speakers (SparseLibri2Mix).
Privacy metrics and evaluation.
Beyond anonymization systems themselves, we investigated how to measure and challenge privacy guarantees. We looked at the definition of new privacy metrics inspired by the Article 29 Working Party’s Opinion 05/2014 on Anonymization Techniques, which characterize Singling Out, Linkability, and Inference 34. Experiments across various attack scenarios reveal that, while the EER remains stable, Singling Out and Linkability vary much more. Finally, we looked at the emotion displayed by anonymized speech data. Two alternative strategies were examined to ensure that the original emotion is kept: first integrating emotion embeddings from a pre-trained emotion encoder, and second processing the speaker by a speaker anonymizer and by an emotion indicator to select the emotion-matched SVM accurately 7.
We published the rules of the 1st VoicePrivacy Attacker Challenge 31, which focuses on developing speaker re-identification attacks against three baseline anonymization systems and four anonymization systems developed by the Voice Privacy 2024 Challenge participants. The best attacker systems reduced the EER by 25–44% relative w.r.t. the semi-informed attack used in the VoicePrivacy 2024 Challenge.
Sensitive content replacement.
Complementarily, we explored privacy protection at the content level rather than the speaker level. As part of the ANR SpeechPrivacy project, we explored the replacement of sensitive speech content. The process involves detecting sensitive personal data in the transcript, such as names, addresses or references to age. The substitution is carried out in the acoustic signal itself. The work focused on re-synthesising the original sentence using the codec approach. Initial results show that replacing tokens in the same prosodic context allows for good integration of the new elements.
5.1.4 Axis 1.4 — Reducing computational footprint
We studied the relation between performance of audio generation models and their energy consumption. In particular, as most of the recent models are based on diffusion we focused on the relation between the number of diffusion iteration and the quality of the generated signal 29.
During her PhD thesis work, Zahra Benslimane studied in detail the relationship between algorithm latency, computational footprint and performance in terms of speech enhancement. She then proposed architecture simplification to achieve low latency (2 ms) and low complexity processing (the number of operations is divided by a factor 100 compared to the original algorithm) while preserving the speech enhancement performance.
5.2 Axis 2 — Extracting information from speech signals
Participants: Irina Illina, Paul Magron, Mostafa Sadeghi, Romain Serizel, Emmanuel Vincent, Romuald Ait Bachir, Raphaël Bagat, Doria Bonzi, Aine Drelingyte, Taous Iatariene, Mayank Mishra, Nasser-Eddine Monir.
5.2.1 Axis 2.1 — Linguistic speech content
Joint beamforming + speaker-attributed ASR for meetings.
We introduced a multichannel beamforming front-end for distant-microphone speaker-attributed ASR, including a real-data alignment/augmentation method to pretrain a neural beamformer 17. On AMI, channel-fusion baselines did not help, but beamforming did: fine-tuning SA-ASR on fixed beamformer output reduced WER by 8% relative, and joint fine-tuning with a neural beamformer achieved a 9% relative WER reduction.
LLM compression.
Current LLM compression typically requires two steps: calibration-based compression followed by costly continued pretraining on billions of tokens. We eliminate this second step with a one-shot compression method that locally distills low-rank weights 30, leveraging the observation that activations are low-rank. SVD initialization, a joint teacher-student activation loss, and local gradient updates ensure fast convergence and low memory usage. Our method compresses Mixtral-8x7B in minutes on a single A100 GPU — removing 10B parameters while retaining over 95% performance - and reduces Phi-2 3B by 40% using only 13M calibration tokens, yielding a model competitive with similarly-sized alternatives. The approach generalizes beyond transformer architectures.
Pruning for Low-resource Speech Recognition.
Pruning large pre-trained transformers for low-resource languages is challenging due to limited retraining data. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara (32h of speech-to-text data), we propose a pruning recipe combining low-rank embedding decomposition with feature distillation and layer merging — bypassing vocabulary pruning, unsuitable given frequent code-switching. The resulting model is 48% smaller and 2.15x faster on a MacBook Air M1, while preserving 90% of the original performance.
Speech Language Modeling for Wolof.
We present our work on training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality unsupervised speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.
5.2.2 Axis 2.2 — Speaker identity and states
Speaker localisation and tracking.
We investigated the problem of localizing and tracking the position of speaker and ensuring the consistency even after speech pauses. We proposed a formal description of the task together with a dataset, a set of metrics adapted from object tracking in computer vision and a first baseline 24. We then proposed refined approaches relying on speaker identity to ensure track consistency 22 and targeting decisions on small temporal context to move towards low latency processing 23.
5.2.3 Axis 2.3 — Speech in its environment
Ambient sound detection and separation.
Pursuing our involvement in the community on ambient sound analysis, we initiated a novel task called spatial semantic segmentation of sound scenes (S5), which consists in combining ambient sound detection and audio source separation. To foster this new topic, we co-organized a task as part of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 challenge 35. We also addressed the difficult topic of designing a metric for this joint task. Indeed, to evaluate S5 systems, one can consider two individual metrics, i.e., one for source separation and another for sound event classification, but this approach makes it challenging to compare S5 system. Therefore, we proposed and analyzed joint metrics that can better reflect the actual contribution of classification and separation errors 42.
In order to asses our continued involvement on the topic we also published an analysis of the evolution of the tasks proposed to the DCASE challenge during the past 10 editions 26.
Speech enhancement.
Targeting speech enhancement for hearing aids, we continued investigating the performance of speech enhancement at a fine grained phonetic level. The goal here is to link the results obtained with objective metrics to the outcome of listening tests conducted at our partner site (Institut de l'audition). To that end, we conducted an extensive evaluation of state-of-the-art speech enhancement algorithms at the phoneme level (rather than at the commonly-considered utterance level), and across genders. Results show that the tested algorithms better reduce interference with fewer artifacts on female speech, particularly in plosives, fricatives, and vowels. Additionally, they demonstrate greater performance for female speech in terms of perceptual and speech recognition metrics 27. We exploited these findings in a subsequent work, where we proposed perceptually-informed variants of common speech enhancement training losses. These are designed to emphasize time-frequency regions where speech is prominent or where the interfering noise is particularly strong, in order to better account for variability across phonemes. Spectral analysis indicates better consonant reconstruction, which points to a better preservation of certain acoustic cues 28.
Speech intelligibility reduction.
We investigated masking noise generation to reduce speech intelligibility in open plan offices. The target is to attenuate the annoyance caused by concurrent speech produced by co-workers. While commercial systems rely on stationary noise at a constant level (resulting in over exposition to sound), we explored adjusting the noise level to spefic part of speech (different phoneme classes) and assessed the impact on speech intelligibility 18.
5.3 Axis 3 — Multimodal Speech: generation and interaction
Participants: Théo Biasutto-Lervat, Domitille Caillat, Vincent Colotte, Yves Laprie, Slim Ouni, Mostafa Sadeghi, Emmanuel Vincent, Louis Abel, Hugo Bergerat, Jean Eudes Ayilo, Sofiane Azzouz, Tom Bourgeade, Guilhem Faure, Mickaella Grondin-Verdon, Colombe M'Boungou, Nhat Nam Nguyen, Alex Stasica.
5.3.1 Axis 3.1 — Multimodality modeling and analysis
Audio-visual speech enhancement (AVSE).
We addressed audio–visual fusion for unsupervised speech enhancement with diffusion models. Our framework combines a visually conditioned diffusion speech prior with an NMF noise model 10. The diffusion prior is first pre-trained on clean speech conditioned on video: visual features are extracted from the input stream and fused with audio features in the diffusion network via cross-attention. At inference, the model performs iterative posterior sampling within the reverse diffusion process, while the NMF noise parameters are updated using intermediate speech estimates. Experiments show improvements over the audio-only variant, better generalization than a recent supervised-generative AVSE approach, and a more favorable speed–quality trade-off than prior diffusion-based inference.
Automatic isolated sign recognition in French Sign Language (LSF).
We investigated an isolated sign recognition system for sign language WordNet resources, aimed at identifying and grouping phonologically similar signs across different sign languages with minimal training data and providing similarity suggestions for manual validation 25. The approach relied on video-only analysis, combining key-frame extraction, pose estimation with MediaPipe, and normalization strategies to mitigate biases related to handedness, non-dominant arm position, and signer morphology. Representations were analyzed using Uniform Manifold Approximation and Projection (UMAP) and compared with a Vision Transformer for pairwise similarity ranking. Evaluation on a manually annotated subset of the WordNet corpus achieved reasonable accuracy.
5.3.2 Axis 3.2 — Multimodal speech generation
Acquisition of rt-MRI (real-time Magnetic Resonance Imaging) data.
This year, in collaboration with the IADI laboratory (P.-A. Vuissoz), we started the acquisition of a large corpus of Arabic language for one speaker. Since no corpus was available a set of 2000 sentences was designed with the help of Gemini. This is interesting since the same approach could be used for other languages involved in the new ANR ArtAny project. In addition, we recorded a speaker producing several non-standard voice qualities (falsetto, very deep or cracked voice) in order to study the areas of variability of the vocal tract articulators and generalise the automatic articulator tracking algorithms.
Acoustic to articulatory inversion.
Acoustic to articulatory inversion is a major processing challenge, with a wide range of applications from speech synthesis to feedback systems for language learning and rehabilitation. Last year, we conducted the first experiments on articulatory acoustic inversion for the tongue, which is the most mobile and deformable speech articulator 11. This was the first time that inversion covered the entire contour of the tongue (from its root to its tip), since inversion generally only covers a few points corresponding to sensors attached to the tongue. We extended the approach to all articulators (lips, tongue, velum, epiglottis, arytenoid cartilages, glottis) by sizing the output layers so as to clearly separate the articulators. The average accuracy is 1.67 mm, given that the pixel size in the images is 1.6 mm, 12, 36. To our knowledge, this is the first inversion experiment to recover the complete geometry of the vocal tract in the form of the contour of all articulators in the mid-sagittal plane.
Quaternion pose encoding and contrastive learning for robust sign language production.
We tackled a key challenge in neural sign language production: high intra-class variability caused by signer morphology and stylistic differences. Building on Progressive Transformers, we introduced two complementary improvements 19. First, we represented poses using bone rotations in quaternion space and optimize with a geodesic loss, which better captures angular motion and improves joint articulation. Second, we added a semantically guided contrastive loss that structures decoder embeddings using sentence-level similarity (via gloss overlap or SBERT), encouraging the model to focus on meaning-relevant motion while reducing anatomical and stylistic bias. On Phoenix14T, a widely used corpus,the contrastive objective alone improves Probability of Correct Keypoint by 16% over the baseline, and combining it with quaternion encoding reduces Mean Bone Angle Error by 6%, highlighting the benefit of skeletal-structure modeling and semantic supervision in Transformer-based sign language production.
5.3.3 Axis 3.3 — Interaction
Formal description and annotation of co-speech gestures.
A first line of work focused on the formal characterization of gestures, aiming to identify necessary and sufficient descriptive features and to automate their extraction, leading to the definition of six complementary modalities (manuality, trajectory, location, hand configuration, speed, and size) to support objective annotation and gesture-aware neural systems 38. This formalization effort was extended by an in-depth investigation of the spatial aspects of gestures, proposing a dual spatial encoding based on positioning and orientation within dedicated three-dimensional reference spaces, and demonstrating how automatically derived spatial attributes can enrich corpora and support the analysis of gesture-speech relationships 40. In parallel, methodological contributions addressed the practical limitations of manual annotation through the development of COSMOS, a semi-automatic tool based on motion capture data and encoder-decoder models, designed to assist gesture segmentation with limited training data while significantly reducing annotation effort 44.
Recognition and evaluation of co-speech gestures.
Within the broader scientific framework of multimodal communication and speech-gesture modeling, and in parallel with our work on the automatic generation of co-verbal gestures using graph-based neural networks 41, several complementary studies addressed the formal description, annotation, recognition, and evaluation of co-verbal manual gestures, combining approaches from linguistics, computer science, and movement sciences. Finally, an exploratory interdisciplinary study examined the evaluation of hand-gesture synthesis quality, showing that expert annotations can reveal systematic differences between natural and synthetic gestures in terms of communicative efficiency and movement dynamics, and highlighting the need for combined computational and linguistic criteria to assess and improve gesture generation systems 39, 21.
Medical dialog summarization.
In the context of our collaboration with a medical doctor in Paris, we introduced QUARTZ, a framework for task-oriented unsupervised dialogue summarization 20. For medical dialogs, task-specific medical accuracy is important. QUARTZ starts by generating multiple summaries and task-specific question-answer pairs using large language models (LLMs). Summaries are evaluated by having the LLMs respond to task-related questions before (i) selecting the best candidate responses and (ii) identifying the most informative summary. Finally, we finetune the best LLM on the selected summaries. When validated on multiple datasets, QUARTZ achieves competitive zero-shot performance, rivaling fully-supervised state-of-the-art approaches.
6 Bilateral contracts and grants with industry
6.1 Bilateral grants with industry
6.1.1 Meta AI
- Company: Meta AI (France)
- Duration: May 2022 – Apr 2025
- Participants: Robin San Roman, Romain Serizel
- Abstract: This CIFRE grant funds the PhD of Robin San Roman on self-supervised disentangled representation learning of audio data for compression and generation.
6.1.2 Orange Labs
- Company: Orange Labs (France)
- Duration: March 2023 – Feb 2026
- Participants: Taous Iatariene, Romain Serizel
- Abstract: This CIFRE grant funds the PhD of Taous Iatariene on sound source tracking.
7 Partnerships and cooperations
7.1 International initiatives
7.1.1 Inria associate team not involved in an IIL or an international program
TrustedSpeech
-
Title:
Trusted speech dataset generation
-
Duration:
Jan 2025 – Dec 2027
-
Coordinator:
Junichi Yamagishi (jyamagis@nii.ac.jp)
-
Partners:
- National Institute of Informatics Tokyo (Japon)
-
Inria contact:
Emmanuel Vincent
-
Summary:
The TrustedSpeech associate team will conduct joint research aiming to improve the privacy, fairness and utility of anonymized and synthetic speech data, so as to offer a complete methodology to produce trusted speech datasets.
7.1.2 Participation in other International Programs
ANR-JST CONFLUENCE
-
Title:
Semantic Segmentation of Complex Sound Scenes on Edge Devices
-
Duration:
Dec 2024 - Nov 2027
-
Coordinator:
Sonaid
-
Partners:
Université de Lorraine, CEA-List (FR), the company Sonaide (FR), Nippon Telegraph and Telephone Corporation (NTT, JP) and Tokyo Metropolitan University (JP)
-
Participants:
Paul Magron, Mayank Mishra, Romain Serizel
-
Abstract:
The CONFLUENCE project aims to develop artificial intelligence (AI) technologies for sound semantic segmentation of acoustic signals that can recognize sound events and separate/isolate the signals of the sound sources forming semantic entities.
7.2 International research visitors
7.2.1 Visits to international teams
R. Baga: Short stay (15 days) at the National Institute of Informatics Tokyo (Japon) in the framework of Associate Teams TrustedSpeech
7.3 European initiatives
7.3.1 Horizon Europe
PSST
PSST project on cordis.europa.eu
-
Title:
Privacy for Smart Speech Technology
-
Duration:
Feb 2025 – Jan 2030
-
Partners:
Inria, UNIVERSITE DE LORRAINE, ORANGE SA (FR), KI ELEMENTS GMBH (DE), STICHTING RADBOUD UNIVERSITEIT (NL), VOICEINTERACTION (PT), OMILIA (GR), AALTO KORKEAKOULUSAATIO SR (FI), TECHNISCHE UNIVERSITAT BERLIN (DE), NAVER FRANCE, Commission nationale de l'informatique et des libertés (FR), EVALUATIONS AND LANGUAGE RESOURCES DISTRIBUTION AGENCY (FR), RUHR-UNIVERSITAET BOCHUM (DE), Loihde Advisory Oy, Finland (FI), voice INTER connect GmbH (DE), VOCAPIA RESEARCH (FR), EURECOM (FR), INESC ID (PT), Voicemod Inc. (ES), INSTITUTO SUPERIOR TECNICO (PT), SORBONNE UNIVERSITE (FR)
-
Inria contact:
Emmanuel Vincent
-
Coordinator:
Tom Bäckström
-
Summary:
The PSST joint doctoral training network will train a new cohort of PhD students to develop voice privacy technologies using cutting-edge deep learning methods.
7.3.2 Digital Europe
LLMs4EU
-
Title:
Large Language Models for the European Union
-
Duration:
Mar 2025 – Feb 2028
-
Partners:
- Inria, France
- 65 other partners in Europe
-
Inria contact:
Emmanuel Vincent
-
Coordinator:
Edouard Geoffrois
-
Summary:
The LLMs4EU project coordinated by the Alliance for Language Technologies (ALT-EDIC) brings together Europe's leading players in the field of generative AI to ensure that European companies and especially SMEs have access to the tools and resources to become competitive regarding language technologies and especially Large Language Models (LLMs).
7.4 National initiatives
ANR ENACT
-
Title:
IA Cluster — Centre Européen en Intelligence Artificielle par l'Innovation
-
Duration:
Jan 2025 - Dec 2029
-
Coordinator:
Emmanuel Vincent (until Jun 2025) Jean-Baptiste Mouret (from Jun to Dec 2025)
-
Partners:
Université de Lorraine, Université de Strasbourg, Inria, CNRS, CHRU de Nancy, Région Grand Est, Métropole Grand-Nancy, Métropole de Strasbourg, Métropole de Metz.
-
Participants:
Emmanuel Vincent , Irina Illina
-
Abstract:
ENACT is the AI Cluster of Region Grand Est, with a budget of 30 MEUR. It aims to make Grand Est a European leader in artificial intelligence (AI), with a structuring strategy of training, research and innovation designed in a global way to benefit the entire territory of the Region and beyond. Emmanuel Vincent holds a chair with Nancy's hospital on LLMs for emergency medicine, and Irina Illina has a PhD student funded by the project.
ANR Full3DTalkingHead
-
Title:
Synthèse articulatoire phonétique
-
Duration:
Apr 2021 - Sept 2025
-
Coordinator:
Yves Laprie
-
Partners:
Loria, Gipsa-Lab, LEGI, IADI, LPP.
-
Participants:
Yves Laprie , Slim Ouni , Vinicius Ribeiro
-
Abstract:
The objective is to realize a complete three-dimensional digital talking head including the vocal tract from the vocal folds to the lips and the face, and integrating the digital simulation of the aero-acoustic phenomena.
ANR ArtAny
-
Title:
Articulateur universel
-
Duration:
Nov 2025 - Oct 2030
-
Coordinator:
IADI(Nancy)
-
Partners:
IADI (Nancy), LPP (Paris)
-
Participants:
Yves Laprie , Slim Ouni , Emmanuel Vincent , Vincent Colotte
-
Abstract:
The Articulator Anything project aims to reconstruct the three-dimensional dynamic evolution of the vocal tract for any language and any speaker. It falls within the field of articulatory synthesis, modeling and simulating the physical process of human speech production using advanced artificial intelligence methods. Current approaches are limited as they rely on static representations of phonemes and fail to capture the temporal dynamics essential for coarticulation and anticipation in natural speech.
ANR CODIM
-
Title:
COmpositionality and DIscourse Markers
-
Duration:
Jan 2023 - Dec 2027
-
Coordinator:
ATILF(Nancy)
-
Partners:
ATILF(Nancy), LORIA(Nancy), LLF
-
Participants:
Vincent Colotte
-
Abstract:
The CODIM project focuses on the two main linguistic resources for organizing monologues or conversations in human languages : Discourse Markers (therefore/donc, well/ben,bon etc. in English/French) and prosody (in particular, intonation). It will evaluate their status with respect to two major views on communication: compositionality (the possibility of combining meaningful expressions into more complex meaningful expressions) and pattern or construction-based approaches (the idea that language users exploit partly ‘frozen’ strings of words). We will compare the semantic and prosodic properties of simple and complex French DM (e.g. ah + bon) found in corpora for written and spoken French.
ANR LLM4all
-
Title:
Large Language Models for All
-
Duration:
Oct 2023 - Mars 2027
-
Coordinator:
Christophe Cerisara
-
Partners:
LORIA-Synalp, LORIA-Multispeech, LIX, Linagora
-
Participants:
Irina Illina , Emmanuel Vincent
-
Abstract:
Large Language Models (LLM) of sufficient size exhibit outstanding emergent abilities, such as learning from their input context and decomposing a complex problem into a chain of simpler steps. The LLM4all project will thus focus on such large models, or on models at the same level of generic performances, and will propose methods to solve two related fundamental issues: how to update these LLMs automatically, and how to reduce their computing requirements in order to facilitate their deployment.
ANR Lorraine Artificicial Intelligence – LOR-AI LOR-AI
-
Title:
Lorraine Artificicial Intelligence Cofinancement de thèses en IA
-
Duration:
Sep 2020- Dec 2025
-
Coordinator:
Yves Laprie
-
Partners:
CNRS, Inria, Regional University Hospital Centre (CHRU)
-
Participants:
Doctoral school of Université de Lorraine
-
Abstract:
This project about Artificial Intelligence, led by the Université de Lorraine (UL), has a double objective by providing 12 co-fundings for doctoral theses: on the one hand, to strengthen UL areas of excellence in AI and domains tightly connected to IA, i.e. particularly Health, and on the other hand, to open other research areas to AI with the objective of leading to scientific breakthroughs.
ANR REFINED
-
Title:
Real-Time Artificial Intelligence for Hearing Aids
-
Duration:
Mar 2022 - Mar 2026
-
Coordinator:
CEA List (Saclay)
-
Partners:
CEA List (Saclay), Institut de l'audition (Paris), LORIA (Nancy)
-
Participants:
Paul Magron, Nasser-Eddine Monir, Romain Serizel
-
Abstract:
The Refined project brings together audiologists, computer scientists and specialists about hardware implementation to design new speech enhancement algorithms that both fit the needs of patients suffering of hearing losses and the computational constraints of hearing aid devices.
ANR ReNAR
-
Title:
Reducing Noise with Augmented Reality
-
Duration:
Feb 2024 - Jan 2028
-
Coordinator:
CEA List (Saclay)
-
Partners:
Ircam (Paris), Laboratoire des Sciences du Numérique de Nantes (Nantes), LORIA (Nancy)
-
Participants:
Romain Serizel, Aine Drelingyte
-
Abstract:
The aim of the ReNAR project is to design a solution that can attenaute the impact of noise in office working scenarios (in particular in open spaces). We will target two aspects: generating noise maskers that results in sound scenes that are pleasent to hear for workers and generating signals that can obfuscate surrounding speech.
ANR SPEECHPRIVACY
-
Title:
Multiple-attribute disentanglement and semantic privacy
-
Duration:
Feb 2024 - Jan 2028
-
Coordinator:
Vincent Colotte
-
Partners:
LORIA (Nancy), EURECOM (Sophia Antipolis), LIA (Avignon)
-
Participants:
Vincent Colotte , Emmanuel Vincent , Orane Dufour, Natalia Tomashenko.
-
Abstract:
SpeechPrivacy will deliver a flexible solution to privacy preservation based on isolated/disentangled representations and the selective obfuscation/modification of individual attributes beyond the usual voice identity/sex and sensitive keywords.
ANR Syncogest
-
Title:
Gesture and Speech Synchronization
-
Duration:
Apr 2025 - Mar 2029
-
Coordinator:
Slim Ouni
-
Partners:
LORIA (Nancy), PRAXILING (Montpellier), EUROMOV (Montpellier)
-
Participants:
Slim Ouni , Vincent Colotte, Louis Abel, Hugo Bergerat, Domitille Caillat
-
Abstract:
SYNCOGEST aims to model spontaneous human gestures—facial expressions, postures, and body movements—and their synchronization with speech in face-to-face communication. By combining insights from artificial intelligence, language sciences, and movement sciences, the project will develop deep learning–based models for automatic gesture generation, enabling more natural and effective embodied conversational agents.
PEPR Cybersécurité, projet iPOP
-
Title:
Protection des données personnelles
-
Duration:
Oct 2022 – Sep 2028
-
Coordinator:
Vincent Roca (Inria PRIVATICS)
-
Partners:
Inria Multispeech (Nancy), PRIVATICS (Lyon), COMETE, PETRUS (Saclay), MAGNET, SPIRALS (Lille), IRISA (Rennes), LIFO (Bourges), DCS (Nantes), CESICE (Grenoble), EDHEC (Lille), CNIL (Paris)
-
Participant:
Emmanuel Vincent
-
Summary:
The objectives of iPOP are to study the threats on privacy introduced by new digital technologies, and to design privacy-preserving solutions compatible with French and European regulations. Within this scope, Multispeech focuses on speech data.
Défi Inria COLaF
-
Title:
Corpus et Outils pour les Langues de France
-
Duration:
Aug 2023 – Jul 2027
-
Coordinator:
Slim Ouni and Benoît Sagot (Inria ALMANACH)
-
Partners:
Inria Multispeech (Nancy), ALMANACH (Paris)
-
Participant:
Slim Ouni , Sam Bigeard , Vincent Colotte , Emmanuel Vincent , Pascale Erhart
-
Summary:
This project aims to increase the inclusiveness of speech technologies by releasing open data, models and software for accented French and for regional, overseas and non-territorial languages of France.
DGA DEEP MAUVES
-
Title:
Deep automatic aircraft speech recognition for non native speakers
-
Duration:
Dec 2022 – Dec 2026
-
Coordinator:
Irina Illina
-
Participant:
Irina Illina , Raphaël Bagat , Emmanuel Vincent , Romuald Ait Bachir
-
Summary:
This project proposes methods and tools that increase the usability of ASR systems for non-native speakers in noisy conditions in the aeronautical domain.
ANSES IPIAMA
-
Title:
Reducing Noise with Augmented Reality
-
Duration:
Dec 2023 - Dec 2026
-
Coordinator:
Jean-Pierre Arz, INRS (Nancy)
-
Partners:
INRS (Nancy), Laboratoire Énergies et Mécanique Théorique et Appliquée (Nancy), LORIA (Nancy)
-
Participants:
Romain Serizel
-
Abstract:
The IPIAMA project aims to propose binaural speech intelligibility measurements (with both ears) for people equipped with hearing aids. The project will rely jointly on classic listening tests (reliable but expensive) and models based on data collected in realistic conditions.
8 Dissemination
8.1 Promoting scientific activities
8.1.1 Scientific events: organisation
General chair, scientific chair
- Main organizer, UDICE-U15 workshop on AI: Stronger together – How to train and retain the next generation of talent in Europe and develop efficient and competitive French-German ecosystems?, Nancy, Mar 2025 (E. Vincent)
Member of the organizing committees
- Organizer, 1st VoicePrivacy Attacker Challenge (N. Tomashenko, E. Vincent)
- Challenge co-chair, DCASE Challenge 2025 (R. Serizel)
8.1.2 Scientific events: selection
Member of the conference program committees
- ICASSP 2026 – IEEE International Conference on Acoustics, Speech, and Signal Processing (R. Serizel)
- WASPAA 2025 – IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (R. Serizel)
Reviewer
- ICASSP 2026 - IEEE International Conference on Acoustics, Speech, and Signal Processing (P. Magron, E. Vincent, M. Sadeghi, R. Serizel)
- ICASSP 2025 - IEEE International Conference on Acoustics, Speech, and Signal Processing (I. Illina)
- INTERSPEECH 2025 (P. Magron, I. Illina, Y. Laprie, V. Colotte)
- EUSIPCO 2025 - European Signal Processing Conference (V. Colotte)
- WASPAA 2025 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (P. Magron)
- ASRU 2025 - IEEE Automatic Speech Recognition and Understanding Workshop (I.Illina)
- DCASE 2025 - Workshop on Detection and Classification of Acoustic Scenes and Events (R. Serizel)
- Revue TAL : Traitement Automqtique de Langues (I. Illina)
- NAACL 2025, DemoTrack (I. Illina)
- ICMI 2025, Industrial track (S. Ouni)
8.1.3 Journal
Member of the editorial boards
- IEEE Transactions on Audio, Speech and Language Processing (R. Serizel)
Reviewer - reviewing activities
- IEEE Signal Processing Letters (P. Magron, M. Sadeghi)
- IEEE Transactions on Audio, Speech and Language Processing (P. Magron, E. Vincent, M. Sadeghi)
- ACL 2025 - Association for Computational Linguistics (I. Illina)
8.1.4 Invited talks
- Keynote "The rise, fall, and resurgence of NMF for audio source separation", Workshop on Low-Rank Models and Applications (Mons, Belgium), Sep 2025 (P. Magron)
- Seminar "Machine learning for music separation: Combining data-driven models and expert knowledge", University of Strasbourg (Strasbourg, France), May 2025 (P. Magron)
- Keynote "Modéliser la communication parlée multimodale", Workshop RJCP (Paris), Nov 2025 (S. Ouni)
8.1.5 Leadership within the scientific community
- Member of the Steering Committee of ISCA's Special Interest Group on Security and Privacy in Speech Communication (E. Vincent)
- Board member of Le VoiceLab, the association of French voice tech players (E. Vincent)
- Chair of the DCASE Steering Committee (R. Serizel)
- Board member of AFCP - Association Francophone de la Communication Parlée (V. Colotte, S. Ouni)
- Secretary/Treasurer, executive member of AVISA (Auditory-VIsual Speech Association), an ISCA Special Interest Group (S. Ouni)
8.1.6 Scientific expertise
- Scientific Expert for CIFRE grant allocation, Ministère de l'Enseignement supérieur, de la Recherche et de l'Innovation (R. Serizel, S. Ouni)
- Project expert for Direction Générale Déléguée Recherche, Innovation, Valorisation et Ecoles doctorales (I. Illina)
8.1.7 Research administration
- Inria representative on the Lorraine Steering Committee for Open Science (E. Vincent)
- Head of pole scientifique Automatique, Mathématiques, Informatique et leurs interactions (AM2I) de l'Université de Lorraine (Y. Laprie)
- Member of the executive board of the Université de Lorraine (Y. Laprie)
- Local correspondent for Inria's Quadran high-risk research programme (Y. Laprie)
- Member of the steering committee for the digital strategy of the Université de Lorraine (Y. Laprie)
- Member of the bureau du pole scientifique Automatique,Mathematiques, Informatique et leurs interactions (AM2I) (I. Illina)
- Member of the Comite du pole scientifique Automatique,Mathematiques, Informatique et leurs interactions (AM2I) (I. Illina)
- Member of the RIPEC jury, UL (I. Illina)
- Member of the promotion committee, UL (I. Illina)
- Member of the admission committee for Master TAL, UL (I. Illina)
- Member of the admission committee for ATER, UL, IUT Charlemagne (I. Illina)
- Member of the selection committee for MCF, UL (Illina)
- Member of the IUT Charlemagne Council, UL IUT Charlemagne (I. Illina)
- Member of the IUT Charlemagne Restricted Council, UL IUT Charlemagne (I. Illina)
- Member of the PhD grant allocation committee, Avignon University (I. Illina)
- Member of laboratory concil of LORIA (V. Colotte).
- Member of the selection committee for the position of assistant professor at Université de Paris-Saclay (S. Ouni)
- Member of the selection committee for the position of professor at Université de Toulouse (S. Ouni)
- Co-Chair of the selection committee for the position of professor at Université de Lorraine (S. Ouni)
- Member of the repyramidage committee for the position of professor at Université de Toulouse (S. Ouni)
- Member of the evaluation committee of Haut Conseil de l'évaluation de la recherche et de l'enseignement supérieur (HCERES) for LJK (S. Ouni)
- Co-head of the Computer Science track at the IAEM Doctoral School (S. Ouni)
- Chair of the ATER Recruitment Committee, Department of Computer Science, IUT Nancy-Charlemagne (S. Ouni)
- Member of the Comité Utilisateurs des Moyens de Calculs, Inria Research Center at Université de Lorraine (T. Biasutto–Lervat)
- Referent Plateformes-Outils, Inria Research Center at Université de Lorraine (T. Biasutto–Lervat)
8.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
8.2.1 Teaching
- Master: P. Magron
- "Neural networks" (54 hours), M2, UL
- "Professional insertion" (2 hours), M2, IRCAM / Sorbonne University
- Master: M. Sadeghi
- "Machine learning" (20 hours), M1, UL
- "Statistics" (20 hours), M1, UL
- BUT: I. Illina
- Java programming (100 hours), L1, UL
- Linux programming (58 hours), L1, UL
- Advanced Java programming (40 hours), L1, UL
- Supervision of student projects and internships (30 hours), L2, UL
- Master: I. Illina
- Speech recognition and text-to-speech (10 hours), M2, UL
- BUT: R. Serizel
- "Bases informatiques" (14 hours), BUT1, UL
- "Publication web" (84 hours), BUT1, UL
- "Métadonnées internes" (14 hours), BUT1, UL
- "Bases de données relationnelles" (8 hours), BUT1, UL
- "Indexation de contenus multimédias" (16 hours), BUT2, UL
- "Systèmes d'information" (18 hours), BUT2, UL
- "Introduction à l'audio numérique" (14 hours), BUT2, UL
- "Données ouvertes" (8 hours), BUT3, UL
- "Visualisation de données" (8 hours), BUT3, UL
- "Usages de l'IA" (14 hours), BUT3, UL
- Master: R. Serizel
- "Robustesse de la parole" (15 HETD), M2, UL
- "Impact environnementaux de l'IA" (6 hours), M2, UL
- Eng: R. Serizel
- "Algorithmique" (18 hours), L3, UL
- "Bases de l'apprentissage automatique" (12 hours), M1, UL
- "Impact environnementaux de l'IA" (21 hours), M2, UL
- BUT: S. Ouni
- Programming in Java (24 hours), BUT1, UL
- Web Programming (24 hours), BUT1, UL
- Graphical User Interface (96 hours), BUT1, UL
- Advanced Algorithms (24 hours), BUT2, UL
- Algorithm analysis (24 hours), BUT3, UL
- Multimedia (24 hours), BUT3, UL
- AI Agent (24 hours), BUT3, UL
- Master: Y. Laprie
- "Speech corpora" (30 hours), M1, UL
- Licence: Y. Laprie
- Phonetics (16 hours), L2, École d'audioprothèse, UL
- Licence: V. Colotte
- Digital literacy and tools (hybrid courses, 50 hours), L1, UL
- System (80 hours), L2-L3, UL
- Introduction to speech processing (20 hours), L3, UL
- Master: V. Colotte
- Integration project: multimodal interaction with Pepper Robot (17 hours), M2, UL
- Multimodal oral communication (24 hours), M2, UL
- AI introduction (9 hours), M2 - intellectual property rights, UL
- Introduction to speech processing (24 hours), M1, UL
- Other: V. Colotte
- Co-Responsible for NUMOC (Digital literacy by hybrid courses) for UL(for 7000 students)
- Other: S. Ouni
- Co-Responsible of the RA-IL track in the BUT Computer Science program, UL
8.2.2 Supervision
- PhD defended: Louis Abel, "Co-speech gesture synthesis : Towards a controllable and interpretable model using a graph deterministic approach", Jan 2025, V. Colotte and S. Ouni 41
- PhD in progress: Nasser-Eddine Monir, "Multichannel speech enhancement for patients with auditory neuropathy spectrum disorders", Dec 2022, R. Serizel and P. Magron
- PhD in progress: Mickaëlla Grondin, "Modeling gestures and speech in interactions", Nov 2021, S. Ouni and D. Caillat (Praxiling).
- PhD in progress: Jean-Eudes Ayilo, "Audio-visual Speech Enhancement: Bridging the Gap between Supervised and Unsupervised Approaches", Oct. 2023, M. Sadeghi and R. Serizel
- PhD in progress: Guilhem Fauré, "End-to-end Speech-to-Sign Language Generation", Oct. 2024, S. Ouni and M. Sadeghi
- PhD in progress: Zahra-Hafida Benslimane, "Embedded speech enhancement for hearing aids", Nov. 2023, Fabrice Auzanneau (CEA-List) and R. Serizel
- PhD in progress: Raphaël Bagat, “Automatic speech recognition for non-native speakers in a noisy environment”, Oct 2023, I. Illina and E. Vincent.
- PhD in progress: Mohamed Imed Eddine Ghebriout, “LLM adaptation and exploitation for medical emergency call triage”, Apr 2024, G. Guibon (LIPN) and E. Vincent.
- PhD in progress: Orane Dufour, "Towards a comprehensive speech anonymization framework", Oct 2024, E. Vincent, M. Rouvier (LIA), and P. Magron
- PhD in progress: Aine Drelingyte, ` "Speech intelligibility attenuation", Nov 2024, Mathieu Lagrange (LS2N) and R. Serizel
- PhD in progress: Lilian Rodriguez, ` "Detection and anonymization of sensitive content in speech", Oct 2024, Yannick Estève (LIA) and V. Colotte
- PhD in progress: Mayank Mishra, ` "Semantic segmentation of audio soundscapes on edge devices", Dec 2024, R. Serizel and P. Magron
- PhD in progress: Isobelle Miles, ` "Regional and low ressource language speech synthesis", fev 2025, E. Vincent, V. Colotte and P. Erhart (UNISTRA-LILPA)
- PhD in progress: Elio Stasica, ` "Differential diagnosis of heart attack from speech", Sep 2025, V. Martin, R. Serizel and E. Vincent
- PhD in progress : Doria Bonzi "Social-behavior-aware chatbot for a communication skills coaching of medical students" Supervision: Irina Illina, Patrice Gallet and Fabrice Lefèvre, Oct. 2025
- PhD in progress : Yaya Sy « Efficient Continued Pre-training of Large Language Models », Supervision : C. Cerisara, I. Illina, Nov 2023.
- PhD in progress : Sofiane Azzouz « Acoustic to articulatory inversion based on rt-MRI data », Supervision: Y. Laprie, Nov 2023.
- PhD in progress : Nhat-Nam Nguyen « Multispeaker Acoustic to articulatory inversion based on rt-MRI data », Supervision: Y. Laprie, Nov 2025.
8.2.3 Juries
- Participation in the PhD jury of Thibault Banerat-Roux (University of Nantes, January 2025), I. Illina, reviewer
- Participation in the PhD jury of Lucas Maison (University of Avignon, November 2025), I. Illina, reviewer
- Participation in the PhD jury of Nicolas André (University of Avignon, December 2025), I. Illina, reviewer
- Participation in the PhD jury of Nathan Griot (University of Avignon, December 2025), I. Illina, reviewer
- Participation in the PhD jury of Adrien Pupier (University of Grenoble, June 2025), I. Illina, examiner
- Participation in the PhD jury of David Genova (University of Sorbonne, October 2025), I. Illina, examiner
- Participation in the PhD jury of Paul Primus (Johannes Kepler University, February 2025), R. Serizel, reviewer
- Participation in the PhD jury of Sreenivasa Upadhyaya (KU Leuven, February 2025), R. Serizel, examiner
- Participation in the PhD jury of Benno Weck-Hufnagel (University of Grenoble, July 2025), R. Serizel, reviewer
- Participation in the PhD jury of Benno Weck-Hufnagel (Universitat Pompeu Fabra, October 2025), R. Serizel, reviewer
- Participation in the PhD jury of Modan Tailleur (Ecole Centrale de Nantes, November 2025), R. Serizel, examiner
- Participation in the PhD jury of Ricardo Falcom Perez (Aalto University, November 2025), R. Serizel, reviewer
- Participation in the PhD jury of Alexis Plaquet (Université Paul Sabatier, December 2025), R. Serizel, reviewer
- Participation in the HDR jury of Angélique Amelot (Université de Lorraine, December 2025), S. Ouni, Chair
- Participation in the HDR jury of Angélique Amelot (Université de Lorraine, December 2025), Y. Laprie, supervisor
- Participation in the PhD jury of Al Oualid Eliraki (University of Grenoble, June 2025), Y. Laprie, reviewer
- Participation in the PhD jury of Nezih Younsi (ISIR, April 2025), S. Ouni, reviewer
- Participation in the PhD jury of Yanis OUAKRIM (University of Grenoble, May 2025), S. Ouni, examiner
8.3 Popularization
- "M-PHASIS Un projet de recherche pour lutter contre les discours de haine sur internet". Journal « Numerique et societé », interview with I. Illina, March 2025
- "Et si nos voix pouvaient aider pour server nos langues", RFI, radio broadcast "De Vives Voix". P. Erhart, S. Bigeard, Jan 2026
- "Langues régionales : l'intelligence artificielle au secours de l'alsacien", France Bleu, news article, P. Ethart, S. Bigeard, S. Ouni, Oct 2025
- "Traduction, voix de synthèse... Ces chercheurs veulent que l'IA parle breton ou alsacien", Ouest France, news article, P. Ethart, S. Bigeard, S. Ouni, Oct 2025
- "L'alsacien à l'heure de l'IA : intégrer les langues régionales dans les modèles numériques", Sciences et Avenir, news article, P. Ethart, S. Bigeard, S. Ouni, Oct 2025
- "Künstliche Intelligenz befördert Elsässisch in die digitale Welt", Badische Neueste Nachrichten (BNN), news article, P. Ethart, S. Bigeard, Oct 2025
- "Alsacien 2.0 : quels usages pour les parlers dialectaux alsaciens ?", DNA, news article, P. Ethart, S. Bigeard, Jan 2026
8.3.1 Participation in Live events
- Fête de la science, "La puce à l'oreille" (R. Serizel)
- Nuit de la science, "Ia pour le son" (R. Serizel)
- Procès du robots, 6 shows (R. Serizel, S. Bigeard)
- Chiche 1 scientifique, 1 classe, 1 visit (R. Serizel)
- "Alsacien 2.0 : quels usages pour les parlers dialectaux alsaciens ?", public seminar, Strasbourg, Jan 2026 (P. Erhard, S. Bigeard)
- Press conference for the launch of "Parole Spontanée" Voice collection, Strasbourg (P. Ethart, S. Bigeard, S. Ouni)
8.3.2 Others science outreach relevant activities
- Journée Colaf, Annual seminar of Défi Colaf, Paris, June 2025 (S. Bigeard, I. Miles, M. Yaich, G. Faure, S. Ouni)
9 Scientific production
9.1 Major publications
- 1 articleLearning emotions latent representation with CVAE for Text-Driven Expressive AudioVisual Speech Synthesis.Neural Networks1412021, 315-329HALDOI
- 2 articleA Phoneme-Scale Assessment of Multichannel Speech Enhancement Algorithms.Trends in Hearing28December 2024HALDOI
- 3 inproceedingsAsteroid: the PyTorch-based audio source separation toolkit for researchers.Interspeech 2020Shanghai, ChinaOctober 2020HAL
- 4 articleAutomatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated.Speech Communication141April 2022, 1-13HALDOI
- 5 articlePrivacy and utility of x-vector based speaker anonymization.IEEE/ACM Transactions on Audio, Speech and Language ProcessingJune 2022HAL
9.2 Publications of the year
International journals
International peer-reviewed conferences
National peer-reviewed Conferences
Conferences without proceedings
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications