Section: Research Program
Beyond black-box supervised learning
This research axis focuses on fundamental, domain-agnostic challenges relating to deep learning, such as the integration of domain knowledge, data efficiency, or privacy preservation. The results of this axis naturally apply in the domains studied in the two other research axes.
Integrating domain knowledge
State-of-the-art methods in speech and audio are based on neural networks trained for the targeted task. This paradigm faces major limitations: lack of interpretability and of guarantees, large data requirements, and inability to generalize to unseen classes or tasks. We intend to research deep generative models as a way to learn task-agnostic probabilistic models of audio signals and design inference methods to combine and reuse them for a variety of tasks. We will pursue our investigation of hybrid methods that combine the representational power of deep learning with statistical signal processing expertise by leveraging recent optimization techniques for non-convex, non-linear inverse problems. We will also explore the integration of deep learning and symbolic reasoning to increase the generalization ability of deep models and to empower researchers/engineers to improve them.
Learning from little/no labeled data
While fully labeled data are costly, unlabeled data are cheap but provide intrinsically less information. Weakly supervised learning based on not-so-expensive incomplete and/or noisy labels is a promising middle ground. This entails modeling label noise and leveraging it for unbiased training. Models may depend on the labeler, the spoken context (voice command), or the temporal structure (ambient sound analysis). We will also keep studying transfer learning to adapt an expressive (audiovisual) speech synthesizer trained on a given speaker to another speaker for which only neutral voice data has been collected.
Preserving privacy
Some voice technology companies process users' voices in the cloud and store them for training purposes, which raises privacy concerns. We aim to hide speaker identity and (some) speaker states and traits from the speech signal, and evaluate the resulting automatic speech/speaker recognition accuracy and subjective quality/intelligibility/identifiability, possibly after removing private words from the training data. We will also explore semi-decentralized learning methods for model personalization, and seek to obtain statistical guarantees.