2025Activity reportProject-TeamDATAVERS
RNSR: 202524718N- Research center Inria Centre at the University of Lille
- In partnership with:Université de Lille, Centre Hospitalier Universitaire de Lille
- Team name: From health Data universe to advances in statistical learning
- In collaboration with:Laboratoire Paul Painlevé (LPP), Evaluation des technologies de santé et des pratiques médicales
Creation of the Project-Team: 2025 August 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A3.1.4. Uncertain data
- A3.1.10. Heterogeneous data
- A3.2.3. Inference
- A3.3.2. Data mining
- A3.3.3. Big data analysis
- A5.2. Data visualization
- A5.9.2. Estimation, modeling
- A6.2.3. Probabilistic methods
- A6.2.4. Statistical methods
- A6.3.3. Data processing
- A9.2. Machine learning
- A9.2.1. Supervised learning
- A9.2.2. Unsupervised learning
- A9.2.5. Bayesian methods
- A9.2.7. Kernel methods
Other Research Topics and Application Domains
- B2.2.3. Cancer
- B9.5.6. Data science
- B9.6.3. Economy, Finance
- B9.6.5. Sociology
1 Team members, visitors, external collaborators
Research Scientist
- Christophe Biernacki [INRIA, Senior Researcher, from Aug 2025, HDR]
Faculty Members
- Cristian Preda [Team leader, UNIV LILLE, Professor, from Aug 2025, HDR]
- Evgéniya Babykina [UNIV LILLE, Associate Professor, from Aug 2025]
- Emmanuel Chazard [UNIV LILLE, Professor, from Aug 2025]
- Sophie Dabo [UNIV LILLE, Professor, from Aug 2025]
- Guillemette Marot [UNIV LILLE, Professor, from Aug 2025]
Post-Doctoral Fellows
- Gaurav Dhar [INRIA, Post-Doctoral Fellow, from Aug 2025 until Aug 2025]
- Komlan Midodzi Noukpoape [UNIV LILLE, Post-Doctoral Fellow, from Aug 2025 until Nov 2025]
PhD Students
- Mustapha Atmani [CEREMA]
- François Bassac [DECATHLON, CIFRE, from Aug 2025]
- Clarisse Boinay [DAUPHINE PSL, from Aug 2025 until Aug 2025]
- Hugo Cannafarina [INRIA, from Aug 2025 until Oct 2025]
- Violaine Courrier [WITHINGS, CIFRE, from Aug 2025]
- Clara Dubois [UNIV LILLE, CIFRE, from Aug 2025, UCCS]
- Cécile Verrier [UNIV LILLE, from Aug 2025, LPP]
Technical Staff
- Paul Faye [INRIA, Engineer, from Nov 2025]
- Nicolas Jankovsky [SATT NORD, from Aug 2025]
Interns and Apprentices
- Marin Bahut [INRIA, Intern, from Aug 2025 until Aug 2025]
- Theo Dufresne [INRIA, Intern, from Aug 2025 until Aug 2025]
Administrative Assistant
- Anne Rejl [INRIA]
Visiting Scientist
- Rahul Bordoloi [UNIV ROSTOCK, from Oct 2025]
2 Overall objectives
The overall objective of Datavers is to provide a framework where theoretical and applied developments in statistical learning with complex and heterogeneous data meet the expectations of clinical decision support system's users. The aim is to orient the methodological developments according to the needs of clinical decision support, and reciprocally, to find applications in clinical decision support for methodological developments.
2.1 Context
Health data is the main ingredient for building clinical decision systems, helping doctors and health policy makers in their decisions. There is a permanent evolution of collecting (volume) and organizing health data (type complexity). It becomes more and more available, accessible and, clearly, it represents a considerable source of inspiration for new research, firstly medical but also applicable to other domains such as computer science, physics, chemistry. In France, the universe of health data is under a continuous process of structuring: EHR (Electronic Health Record) known also as EDS (Entrepôt de données de santé), SNDS (Système National des Données de Santé), Health Data Hub, IoT health data, etc. As a consequence, the complexity of the health data to be analyzed has significantly increased. It can be understood in terms of size (number of individuals and variables), structural complexity (relational scheme, number of levels of qualitative variables), missingness, and relation to time (almost all variables are time-dependent, and are collected in real life according to a temporality which is not protocolized) or to space or space/time (in a large number of real problems, data have complex spatial or spatio-temporal dependence structure). This new generation of data represents a provocation for the current statistical learning methods, which need to be adapted or conceptually reshaped: curse of dimensionality, variable selection, visualisation and interpretation, new algorithms for new data types, etc. Nevertheless, the objectives of data analyses remain the same: description (visualisation, clustering), prediction (supervised learning, regression) or any combination of those.
2.2 Goals
New generation of data is an increasing challenge for nowadays statistical learning methods. Datavers addresses it developping data-driven research in statistics resulting from strong ongoing collaborations with partners from health domain. The main goal of Datavers is to develop statistical learning methods in order to build a clinical decision support system based on heterogeneous, high-dimensional, time-space dependent data such as public health, clinical, multi-omics.
Datavers intends to address clinical questions related to patient health trajectory (define patterns, rehospitalisation prediction, disease dynamics, etc.) and precision medicine (definition of subgroups of patients, biomarker selection, notably with application to adverse drug event prevention, etc.). In this applied framework, statistical learning theoretical developpments are carried out with focus on many directions such as: variable selection and prediction with heteregenous and censored data, clustering of massive data, space-time and functional data analysis.
3 Research program
The research program is structured within three main interacting axes:
- Axis 1 provides a framework for building a clinical decision system based on heterogeneous data coming from EHR or other databases (public health, clinical, biological and multi-omics data).
- Axis 2 deals with modeling time-space dependent data. It includes functional data analysis, point processes, recurrent events and competing risks.
- Axis 3 develops methods for other data types, namely complex, high-dimensional and tall data. It includes networks, multivariate censored, missing data and frugal methods for massive data.
3.1 Axis 1: Clinical decision system based on complex health data
This axis provides a framework for building a clinical decision system based on heterogeneous data coming from EHR and other complementary databases (it includes public health, clinical, biological, multi-omics data). It represents an applied research based on collaborations with researchers from CHU Lille and on medical data issued from EDS Include (CHU Lille) and SNDS. The main topic developped in this axis is the understanding of the complexity of native structured healthcare data and the formulation of clinical questions and raw EHR data into statistical learning statements. To develop models and precision medicine methods for patient health trajectories concerning elderly and diabetic/obese patients represent a first example of application on real data. Because of the variety of the scope, this axis involves all the permanent team members with a special mention for members from Metrics laboratory which interract closely with clinicians and hospital structures. PhD students and some engineers are associated.
3.2 Axis 2: Space-time dependent data analysis
One main concern of the Datavers team is to developp statistical methods taking into account the temporal and spatial dimension of data. The clinical decision support systems of interest in this project concern both health patient trajectories and event occurences over time (death, hospitalisation, adverse drug events). In both cases, time and space are essential factors in the statistical analysis. As a general observation, the EHR systems massively record the spatio-temporal feature of clinical data. Datavers develops methodologies for functional data analysis, spatial statistics and multistate models with focus on categorical functional data, spatial dependency of observations in supervised learning and survival analysis with competing risks and recurrent events.
3.3 Axis 3: Complex, high dimensional and tall data analysis
The originality of this axis is to properly account for the fact that records in EHR are gradually more numerous (regarding the number of patients and the number of features) and also that the recorded features are more heterogeneous with a high variety of patient features. As a more precise example concerning the number of individuals, the national PMSI database (medicalization of information systems program) contains nearly 24 to 27 million MCO (obstetrics-surgical medicine) stays per year. Concerning now the number of features, in genomics one can collect up to 5 billion base pairs per individual. In addition, IoT and wearable devices are a new source of data which inflates databases significantly and in the long term. Generic clustering challenges fueled by EHR such as dealing with avalanche of small clusters within massive and heterogeneous data sets, selection within a priori "irreconcilable" clustering methods and joint modeling of longitudinal high-dimensional data and multivariate censored data are addressed within this axis.
4 Application domains
Application areas of statistical modeling for complex data are extensive. Datavers team is mainly focused on Biology and Health applications where new challenges in high throughput technologies or clinical decision systems are opened. Secondary application areas are considered in Industry, Retail, Finance, Marketing and Cybersecurity.
4.1 Health applications
Beyond specific applications arising from individual collaborations between the team members and researchers from health domain, Datavers contributes to statistical modeling of the patient path at hospital and precision medicine, in particular for elderly, diabetic and oncology patients. As a general remark, data coming from hospitals will continue to grow in volume and complexity, especially due to recent national policy to build EHR systems which allows to optimally structure medical data and record them over time. Consequently, it is essential to rely on fundamental advances provided in Axis 2 and Axis 3, in particular to anticipate increasing complexity emerging from future EHR-like databases.
4.2 Economic and other field applications
Collaborations with companies such as Decathlon (Sport), ADEO (Marketing), Seckiot (Cybersecurity), Worldline (Finance), Withings (wearable devices) are source of application research we materialise by supervising PhD CIFRE theses with a strong impact in the development plan of these companies. These researches cover dynamical clustering and predictive clustering from time series data, unsupervised learning models from IoT graph data, functional data analysis, etc. Even though health applications are the core of Datavers, the members of the team continue to develop such collaborations to ensure broad transfer of our research beyond health applications.
5 Social and environmental responsibility
5.1 Footprint of research activities
Datavers develops innovative learning methods as an added value to medical decision support based on real-life health. Precision medecine, patient health trajectories modeling and multistate analysis are the core of Datavers research activity.
5.2 Impact of research results
The development of statistical learning methods to integrate into a clinical decision system is a foundation axis of Datavers. Besides the integration of new methodologies within the medical information systems of specific hospital units (for instance, the geriatric unit) as dedicated software, the internships of master students from Master of Biologie et Santé (Faculty of Medicine) or postdocs in public Health can be shared between Datavers team and research units from University of Lille and CHU of Lille. Datavers objective is to publish these methods and disseminate them to researchers wishing to develop clinical decision support systems.
Establishing partnerships with companies such as CIFRE PhD theses is a powerful tool for us to transfer our research to industry and private companies. We intend to continue this process as the ongoing collaborations with Withings, Decathlon, Worldline, Seckiot, Alicante, Horiba and ADEO companies.
6 Highlights of the year
6.1 Awards
The NUMETAB project coordinated by Guillemette Marot (Datavers) and Francois Pattou (Inserm U1190) was selected as 2025 laureat of the call Cross Disciplinarity Projects "Initiative d'excellence Université de Lille et France 2030" (2026-2029, 3.2M euros)
Sophie Dabo got the first paper award of the ICMSEM 2025 conference
6.2 Nomination
Sophie Dabo has been elected Vice-president of CIMPA (International Center of Pured and Applied Mathematics), Unesco-Center Category 2
Sophie Dabo has been designated by CNRS INSMI to lead the IRL (International Research Laboratory) project in mathematics based in Africa.
6.3 Interview
Sophie Dabo gave an interview on October 15th at Radio France International for the event "Around the Question, the magazine for all the sciences: How to do mathematics differently and on every continent".
7 Latest software developments, platforms, open data
7.1 Latest software developments
7.1.1 cfda
-
Name:
Categorical functional data analysis
-
Keyword:
Functional data
-
Functional Description:
The R package cfda performs:
- descriptive statistics for categorical functional data
- dimension reduction and optimal encoding of states (correspondance multiple analyses towards functional data)
- approximation for multivariate categorical functional data analysis.
-
Release Contributions:
- approximation for multivariate categorical functional data analysis.
- URL:
-
Contact:
Cristian Preda
-
Participants:
Cristian Preda, Quentin Grimonprez, Vincent Vandewalle
-
Partner:
Université de Lille
7.1.2 MixtComp.V4
-
Keyword:
Clustering
-
Functional Description:
MixtComp (Mixture Computation) is a model-based clustering package for mixed data from Modal team (Inria Lille). It has been engineered around the idea of easy and quick integration of all new univariate models, under the conditional independence assumption. New models will eventually be available from researches, carried out by the Modal team or by other teams. Currently, central architecture of MixtComp is built and functionality has been field-tested through industry partnerships. Five basic models (Gaussian, Multinomial, Poisson, Weibull, NegativeBinomial) are implemented, as well as two advanced models (Functional and Rank). MixtComp has the ability to natively manage missing data (completely or by interval). MixtComp is used as an R package, but its internals are coded in C++ using state of the art libraries for faster computation.
-
Release Contributions:
- New I/O system - Replacement of regex library - Improvement of initialization - Criteria for stopping the algorithm - Added management of partially missing data for several models - User documentation - Adding user features in R
- URL:
-
Contact:
Christophe Biernacki
-
Participants:
Christophe Biernacki, Vincent Kubicki, Matthieu Marbac-Lourdelle, Serge Iovleff, Quentin Grimonprez, Etienne Goffinet
7.1.3 HDSpatialScan
-
Name:
Multivariate and Functional Spatial Scan Statistics
-
Keywords:
Functional data, Clustering, Spatial information, Multivariate data
-
Scientific Description:
Scan statistics in high dimensional spaces
-
Functional Description:
Allows to detect spatial clusters of abnormal values on multivariate or functional data
- URL:
-
Contact:
Sophie Dabo
-
Participants:
Sophie Dabo, Michael Genin, Camille Frevent
7.1.4 visCorVar
-
Name:
visualization of correlated variables in the context of statistical integration of omics data
-
Keywords:
Data integration, Visualization
-
Functional Description:
The R package visCorVar allows visualizing results from data integration with the function block.spslda (bioconductor mixOmics package). The data integration is performed for different types of omic datasets (transcriptomics, metabolomics, metagenomics) in order to select variables of a omic dataset which are correlated with the variables of the other omic datasets and the response variables and to predict the class membership of a new sample. These correlated variables can be visualized with correlation circles and networks.
- URL:
-
Contact:
Guillemette Marot
-
Participants:
Maxime Brunin, Guillemette Marot, Pierre Pericard
-
Partner:
Université de Lille
7.1.5 MLGL
-
Name:
Multi-Layer Group Lasso
-
Keywords:
Variable selection, Statistical learning
-
Functional Description:
The MLGL R-package, standing for Multi-Layer Group-Lasso, implements a procedure of variable selection in the context of redundancy between explanatory variables, which holds true with high dimensional data. The MLGL approach combines variables aggregation and selection in order to improve interpretability and performance. First, a hierarchical clustering procedure provides at each level a partition of the variables into groups. Then, the set of groups of variables from the different levels of the hierarchy is given as input to group-Lasso, with weights adapted to the structure of the hierarchy. At this step, group-Lasso outputs sets of candidate groups of variables for each value of regularization parameter. The versatility offered by MLGL to choose groups at different levels of the hierarchy a priori induces a high computational complexity. MLGL however exploits the structure of the hierarchy and the weights used in group-Lasso to greatly reduce the final time cost. The final choice of the regularization parameter – and therefore the final choice of groups – is made by a multiple hierarchical testing procedure.
- URL:
-
Contact:
Guillemette Marot
-
Participants:
Guillemette Marot, Quentin Grimonprez
8 New results
8.1 Axis 1 and Axis 3: From Unsupervised to Guided Clustering: A Variational Implementation
Participants: Christophe Biernacki, Violaine Courrier, Cristian Preda.
Clustering is viewed as an unsupervised technique, but in practice it requires guidance to uncover meaningful structures. We formalize this with guided clustering, a paradigm that uses a guiding variable to steer the discovery process, and introduce the Guided Clustering Variational Autoencoder (GCVAE) as its deep generative realization. GCVAE learns a latent space structured as a Gaussian Mixture Model by optimizing a variational objective that forces the representation to be maximally informative about the guiding variable. This framework allows the resulting clustering to be dynamically reoriented by altering the guiding variable, yielding clusters that are both interpretable and meaningful for the specified context. Experiments on public (MNIST-SVHN) and proprietary connected health devices data demonstrate GCVAE’s ability to discover coherent and task-relevant clusters in complex, high-dimensional settings.
This work has been presented to the team seminar 49, to a national conference 35, 31 and to an international conference 29. The submission to an international joural is in preparation.
It is a joint work with Benjamin Vittrant from the Witings company.
8.2 Axis 1 and Axis 3: Levels Merging in the Latent Class Model
Participants: Christophe Biernacki, Emmanuel Chazard, Johan Lyrvall.
The latent class model (LCM), dedicated to cluster categorical variables, suffers for the curse of dimension when the number of levels is large, situation frequently encountered in practice. We propose to extent LCM to a natural modeling which limits the number of levels by merging them, process which is also equivalent to a specific levels clustering. Related estimation and model selection processes are also presented and discussed.
A national conference, an international conference and also an international journal are in preparation.
This is a joint work with Christine Keribin from University Paris-Saclay.
8.3 Axis 1 and Axis 2: Joint Latent Class Models: A Tutorial on Practical Applications in Clinical Research
Participant: Genia Babykina.
The joint latent class model is a statistical approach that allows the simultaneous analysis of two outcomes related to disease progression—a longitudinal outcome and a time-to-event outcome—in the presence of population heterogeneity. The theoretical properties of the model have been established, and it has been implemented in dedicated software. However, due to its complexity, the model remains challenging for clinicians to specify and use in practice. This work, published in article 18, provides a detailed tutorial aimed at clinicians and applied statisticians. It explains how to specify joint latent class models in the R software to address concrete clinical questions, and how to explore, manipulate, and interpret the resulting outputs. The tutorial is based on a real clinical dataset; for each clinical question, the corresponding mathematical model specification and R implementation are presented, along with a detailed interpretation of estimation results and goodness-of-fit measures. This work was carried out within the framework of the PhD thesis of M. Kyheng, co-supervised by G. Babykina, and in collaboration with A. Duhamel (University of Lille, CHU Lille).
8.4 Axis 1: Metabolite biomarker discovery for pancreatic neuroendocrine tumors using metabolomic approach
Participant: Sophie Dabo.
Metabolic flexibility, a key hallmark of cancer, reflects aberrant tumour changes associated with metabolites. The metabolic plasticity of pancreatic neuroendocrine tumours (pNETs) remains largely unexplored. Notably, the heterogeneity of pNETs complicates their diagnosis, prognosis, and therapeutic management. In this paper we compared the plasma metabolomic profiles of patients with pNET and non-cancerous individuals to understand metabolic dysregulation. This study highlights the distinct plasma metabolic signatures of pNETs, including the critical role of FAO and elevated glutamate levels in metastasis, supporting the energy and biosynthetic needs of rapidly proliferating tumour cells. Mapping of these dysregulated metabolites may facilitate the identification of new therapeutic targets for pNETs management. The paper 17 is in collaboration with Dr Arnaud Jannin of Lille CHU and colleagues from Oncolille and Lille CHU.
8.5 Axis 1: Long-term outcome of oesophageal atresia in adolescence (TransEAsome): a national French cohort study protocol
Participant: Guillemette Marot.
TransEAsome is a national multicentre population-based cohort study recruiting participants from all qualified French centres for OA surgery at birth. The primary objective is to assess the prevalence of gastro-oesophageal reflux disease in adolescence among patients with OA, with several secondary objectives including the identification of risk factors and multiomic profiles from oesophageal biopsies and blood samples collected between 13 and 14 years old, compared with a control group. This comprehensive characterisation of phenotype and omic profiles aims to enhance the understanding of disease evolution in patients with OA and inform tailored care management strategies. This work has been published in BMJ journal 55.
8.6 Axis 1: Analysis of Dependency Levels in Psychiatric Hospitalizations by Psychiatric Nurses: A Retrospective Study in France
Participants: Emmanuel Chazard, Alexis Dias, Antoine Lamer.
This study aims to evaluate and describe the dependency levels among adults hospitalised for psychiatric care in France between 2013 and 2022, leveraging medico‐administrative data from the French National Health Data System (SNDS). We conducted a retrospective cohort study using SNDS data, analysing ADL scores collected during psychiatric admissions. Dependency levels were categorized into six levels based on established criteria, with specific focus on physical (e.g., mobility, continence) and relational dimensions (e.g., behavioural interactions). See for more details 16.
8.7 Axis 1: Comparison of youth psychiatric hospitalizations by type of facility in 2022
Participants: Emmanuel Chazard, Antoine Lamer, Antoine Teston.
This work was conducted using the French national insurance database (SNDS). Patients aged less than 18 and discharged from psychiatric hospitals in 2022 were included. Characteristics of stays were described according to the types of facilities: public, private not-for-profit, or private for-profit hospitals. As a result, in 2022, 20,598 patients were hospitalized in psychiatric facilities in France, totaling 46,222 stays. There were 76.92% of the stays in public, 13.39% in non-profit facilities, and 9.70% in for-profit facilities. In public and non-profit facilities, patients were more frequently male, younger, and had shorter lengths of stay compared to those in for-profit facilities. Public facilities take care of the majority of patients. Characteristics of patients and stays differ according to the type of facility. There is a significant common population between public and private sectors. See for more details 23.
8.8 Axis 1: Risk factors for severe morbidity and mortality
Participants: Emmanuel Chazard, Antoine Lamer, Océane Pécheux.
Using the French national discharge summary database, we retrospectively analysed all hospital stays that included POP surgery in public- or private-sector healthcare facilities between January 1st, 2015, and September 1st, 2024. A total of 375,705 surgical procedures were included. In a multivariate analysis, the risk of death was higher for laparotomy, transanal and multiple approaches than for vaginal surgery. The composite outcome rate (death or admission to an intensive care unit during the hospital stay for POP surgery) was 0.57 % (n = 2,124). The patient-related and surgery-related risk factors were age, heart failure, respiratory insufficiency, diabetes, obesity, and laparoscopic, laparotomy, transanal and multiple approaches. See for more details 19.
8.9 Axis 1: Mortality and fracture risk in children with osteogenesis imperfecta
Participants: Emmanuel Chazard, Antoine Lamer, Cécile Philippoteaux.
In this work we used data from the French nationwide hospital discharge database (2014-2022). Based on age at index stay, patients were classified as newborns (< 1 month), infants (>1 and <24 months), or children (> 2 years). Immediate mortality (during the index stay or after same-day transfer) and long-term mortality were analyzed along with fracture risk using descriptive statistics, Kaplan-Meier estimates, and Cox models. See for more details 20.
8.10 Axis 1: Evaluation of a score for identifying hospital stays that trigger a pharmacist intervention
Participants: Emmanuel Chazard, Laurine Robert.
The study was retrospective and observational, conducted in the clinical pharmacy team. The patient risk score was adapted from a Canadian score and was integrated in the clinical decision support system (CDSS). For each hospital stay, the score was calculated at the beginning of hospitalization and we retrospectively showed if a medication review and a PI were conducted. Then, the optimal patient risk score threshold was determined to help pharmacist in optimizing medication review. See for more details 21.
8.11 Axis 2: Clustering of recurrent events data
Participants: Genia Babykina, Vincent Vandewalle.
A novel statistical methodology was developed for the analysis of recurrent event data, which commonly arise in fields such as healthcare, epidemiology, and reliability studies. Specifically, we proposed a mixture model for recurrent events that accounts for unobserved heterogeneity through latent classes. This framework enables the unsupervised clustering of individuals into homogeneous subgroups, while modeling recurrent event intensities within each cluster and adjusting for covariates. Model parameters are estimated by maximum likelihood using the EM algorithm. The feasibility and performance of the method were assessed through simulation studies and illustrated using real-world hospital readmission data, providing improved insight into heterogeneous recurrent event dynamics. The associated methodological article is published in 15. This methodology was subsequently applied in a large prospective multicentre cohort study aimed at identifying subgroups of older patients at risk of repeated hospital readmissions and death following discharge from acute geriatric units. Using the proposed approach, two distinct patient subgroups with markedly different post-discharge outcomes were identified. Further analyses revealed that only a limited number of clinical characteristics were weakly associated with membership in the high-risk subgroup, highlighting the difficulty of predicting adverse outcomes based solely on standard clinical variables. This applied work demonstrates the practical relevance of the proposed methodological framework and underscores the need for improved predictive tools to better target high-risk older patients. The results are published in 24. This work was carried out in collaboration with V. Vandewalle (Inria Modal) and J. Bravo (University of Cádiz, Spain). The clinical application was conducted in close collaboration with clinicians from CHU Lille, notably F. Visade and J.-B. Beuscart.
8.12 Axis 2: Variable selection with FDR control for high dimensional competing risk data
Participants: Guillemette Marot, Genia Babykina, Hugo Cannafarina.
In biomedical research, high-dimensional data are increasingly common, particularly in fields such as genomics, transcriptomics, proteomics, and metabolomics, while clinical outcomes are often subject to competing risks. In this context, the penalized Fine–Gray model is widely used to identify covariates associated with the outcome of interest. However, variable selection remains challenging, as penalized approaches may retain a large number of irrelevant variables, thereby complicating the identification of meaningful biomarkers. To address this issue, we proposed the use of Integrated Path Stability Selection (IPSS) to enhance variable selection in high-dimensional competing risks settings. This approach improves the control of false positives while maintaining the ability to detect truly influential variables and ensuring control of the False Discovery Rate (FDR). Simulation studies demonstrate that IPSS substantially reduces the number of false positives compared with existing methods, while preserving strong performance in terms of true positive detection and predictive accuracy. The practical relevance of the method is further illustrated through a real-world biomedical case study. This work has been disseminated in a conference paper 37. This research was conducted within the framework of a PhD thesis of H. Cannafarina, co-supervised by C. Preda, G. Marot and G. Babykina.
8.13 Axis 2: Longitudinal Data: A Lever for Precision Medicine
Participant: Genia Babykina.
A keynote presentation was delivered by a member of the team (G. Babykina) during the Journées PEPR Santé Numérique held in Lille (October 2025). The keynote provided a comprehensive overview of statistical methodologies enabling the use of longitudinal data from multiple sources as a lever for precision medicine. The presentation highlighted methodological challenges and recent advances in modeling complex longitudinal trajectories to better support individualized clinical decision-making. The presentation slides are available in 27.
8.14 Axis 2: Fusion regression methods with repeated functional data
Participants: Sophie Dabo, Cristian Preda, Issam Moindjie.
Linear regression and classification methods with repeated functional data are considered in this work. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions related by some neighborhood structure (spatial, group, etc.). Two regression methods based on fusion penalties are proposed to consider the dependence induced by this structure. These methods aim to obtain parsimonious coefficient regression functions, by determining if close conditions are associated with common regression coefficient functions. The first method is a generalization to functional data of the variable fusion methodology based on the 1-nearest neighbor. The second one relies on the group fusion lasso penalty which assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. Numerical simulations and an application of electroencephalography data are presented 10.
8.15 Axis 2: Principal component analysis of multivariate spatial functional data
Participant: Sophie Dabo.
This paper is devoted to the study of dimension reduction techniques for multivariate spatially indexed functional data and defined on different domains. We present a method called Spatial Multivariate Functional Principal Component Analysis (SMFPCA), which performs principal component analysis for multivariate spatial functional data. In contrast to Multivariate Karhunen-Loève approach for independent data, SMFPCA is notably adept at effectively capturing spatial dependencies among multiple functions. SMFPCA applies spectral functional component analysis to multivariate functional spatial data, focusing on data points arranged on a regular grid. The methodological framework and algorithm of SMFPCA have been developed to tackle the challenges arising from the lack of appropriate methods for managing this type of data. The performance of the proposed method has been verified through finite sample properties using simulated datasets and sea-surface temperature dataset. Additionally, we conducted comparative studies of SMFPCA against some existing methods providing valuable insights into the properties of multivariate spatial functional data within a finite sample. The paper 22 is in collaboration with Idris Siahmed, Leila Hamdad (Algeria) Christelle Judith Agonkoui and Yoba Kande (University of Lille).
8.16 Axis 2: Forecasting mortality rates with functional signatures
Participant: Sophie Dabo.
This study introduces an innovative methodology for mortality forecasting, which integrates signature-based methods within the functional data framework of the Hyndman-Ullah (HU) model. This new approach, termed the Hyndman-Ullah with truncated signatures (HUts) model, aims to enhance the accuracy and robustness of mortality predictions. By utilizing signature regression, the HUts model is able to capture complex, nonlinear dependencies in mortality data which enhances forecasting accuracy across various demographic conditions. The model is applied to mortality data from 12 countries, comparing its forecasting performance against variants of the HU models across multiple forecast horizons. Our findings indicate that overall the HUts model not only provides more precise point forecasts but also shows robustness against data irregularities, such as those observed in countries with historical outliers. The integration of signature-based methods enables the HUts model to capture complex patterns in mortality data, making it a powerful tool for actuaries and demographers. Prediction intervals are also constructed with bootstrapping methods. This paper 25 is in collaboration Zhong Jing Yap and Dharini Pathmanathan from Malaya University (Kuala Lumpur, Malaysia).
8.17 Axis 2: Predictive model for running performance using scalar and qualitative functional data.
Participants: François Bassac, Cristian Preda, Cédric Morio.
Predicting the performance of runners using the training programs offered by Decathlon is a major challenge for both the company and the runners themselves. Data measuring the runners physical capacities, the training themes followed and the results of training sessions are available at Decathlon. Their longitudinal nature (time dimensionality) and heterogeneity (type, length of observation) require effective pre-processing before they can be used in a predictive model. The extension of principal component analysis and multiple correspondence analysis to scalar and categorical functional data is used to reduce the dimension, visualize the data and fit a linear regression model. The coefficient functions of the regression model allow interpretation and prediction with new data. See for more details 36.
8.18 Axis 3: Detection of anomalies in dynamics graphs with application in cybersecurity for OT
Participants: Christophe Biernacki, Cristian Preda.
The increasing number of cyber attacks on industrial networks puts human life and economies at risk. Firms usually implement fixed rules rather than anomaly detection to prevent such attacks. However, anomaly detection methods would allow for a more flexible grasp of deviations from normal behaviour. For instance, anomaly detection in graphs modeling industrial networks can sense changes in the behaviour of machines. In this work, we seek to establish whether the number of messages sent from one or more machines to one or more machines is normal or not. To this end, we first model interactions between IP addresses with dynamical graphs. Then, we construct a test statistic based on the likelihood of a graph computed thanks to generative models such as the stochastic block model and kernel estimators. Finally, we evaluate the power of the test in realistic and generic attack scenarios.
Clarisse Boinay defended her PhD thesis this year 47. She presented this work at Ecole d’été de Saint-Flour (July 2025, Clermont-Ferrand) and a paper for an international journal is in preparation.
8.19 Axis 3: Model-based co-clustering: high dimension and estimation challenges
Participant: Christophe Biernacki.
Model-based co-clustering can be seen as a particularly important extension of model-based clustering. It allows for a significant reduction of both the number of rows (individuals) and columns (variables) of a data set in a parsimonious manner, and also allows interpretability of the resulting reduced data set since the meaning of the initial individuals and features is preserved. Moreover, it benefits from the rich statistical theory for both estimation and model selection. Many works have produced new advances on this topic in recent years, and we offer a general update of the related literature. It is the opportunity to advocate two main messages, supported by specific research material: (1) co-clustering requires further research to fix some well-identified estimation issues, and (2) co-clustering is one of the most promising approaches for clustering in the (very) high-dimensional setting, which corresponds to the global trend in modern data sets.
A presentation at an international online semininar ("DaSSWeb - Data Science and Statistics Webinar") has been given on this topic 48.
It is a joint work with Julien Jacques from University Lyon 2 and Christine Keribin from University Paris-Saclay.
8.20 Axis 3: An EM Stopping Rule for Avoiding Degeneracy in Gaussian-Based Clustering with Missing Data
Participant: Christophe Biernacki.
Missing data frequency increases with the growing size of multivariate modern datasets. In Gaussian model-based clustering, the EM algorithm easily takes into account such data but the degeneracy problem is dramatically aggravated during the EM runs: parameter degeneracy is quite slow and also more frequent than with complete data. Consequently, parameter degenerated solutions may be confused with valuable parameter solutions and, in addition, computing time may be wasted through wrong runs. In this work, a simple and low informational condition on the latent partition allows to propose a very simple partition-based stopping rule of EM which shows good behaviour on numerical experiments.
This work has been presented to Journée PS-MAASAI 2025 28 and an article to an international journal is in preparation.
It is a joint work with Vincent Vandewalle from University Côte d’Azur.
8.21 Axis 3: Probabilistic estimation of fatigue damage based on binned data from passive sensors
Participants: Mustapha Atmani, Christophe Biernacki.
The monitoring of the structural integrity of civil engineering structures, particularly bridges subjected to variable loads due to traffic, is crucial for safety and predictive maintenance. In the case of conventional strain gauges, continuous amplitudes from active sensors are measured but such sensors lack of robustness over time and need to be powered. To overcome these constraints, the company SilMach has developed a passive mechanical sensor that requires no power supply and is designed to detect strain amplitudes at the installation location. This sensor provides aggregated data in the form of counts (binned data): it indicates the number of cycles whose amplitude falls within predefined, often wide, intervals, without reproducing the exact values of the fluctuations experienced over time.
This study proposes an estimation methodology using an Expectation–Maximization (EM) algorithm adapted to binned data, in order to efficiently identify the parameters of the chosen distribution (or mixture) based solely on the interval counts from the passive sensor. Once the parameters are estimated, the damage is calculated via its integral expression, and its uncertainty is then quantified. Looking ahead, we propose to study the effect of the counting interval bounds (size, position) on the estimation accuracy and the width of the confidence intervals, in order to jointly optimize the design of passive sensors and the aggregation strategy.
This work has been submitted to EWSHM 2026 (12th European Workshop on Structural Health Monitoring).
It is a joint with André Orcesi from Cerema. See for more details 14.
9 Bilateral contracts and grants with industry
9.1 Bilateral Grants with Industry
9.1.1 Withings
Participants: Christophe Biernacki, Violaine Courrier, Cristian Preda.
Withings is a French consumer electronics company which designs and innovates in connected devices, such as the first Wi-Fi scale on the market (introduced in 2009), an FDA-cleared blood pressure monitor, a smart sleep system, and a line of automatic activity tracking watches. It also provides B2B services for healthcare providers and researchers.
The PhD thesis of Violaine Courrier begun on September 2023 on the topic of analysis of multivariate, sparse longitudinal data, with mixed co-variates, from connected medical objects.
9.1.2 Seckiot
Participants: Christophe Biernacki, Clarisse Boinay, Cristian Preda.
Seckiot is an editor of cybersecurity software to protect industrial systems & IoT. From December 2021, Clarisse Boinay begun her Cifre PhD thesis (with AID, Agence de l'Innovation de Défense) with Seckiot on the topic of "anomaly detection and change point detection in contextual dynamic asynchronous graphs with applications in OT cybersecurity" under the co-supervision of Thomas Anglade (Seckiot), Christophe Biernacki and Cristian Preda .
Clarisse Boinay defended her PhD thesis on December 16 2025 47.
9.1.3 SilMach
Participants: Mustapha Atmani, Christophe Biernacki.
Through their joint ROAD-AI project, Inria and Cerema are jointly studying digital tools allowing these phenomena to be modeled using structural instrumentation. This initiative is complemented and reinforced by the SIRCAPASS project coordinated by the company SilMach and which aims to use new passive MEMS (Micro Electro-Mechanical Systems) sensor technology for this instrumentation.
In this context, Mustapha Atmani began his PhD thesis on December 1 2024 entitled “Statistical processing of “low data” from passive sensors: application to the monitoring of engineering structures”. The co-supervision is ensured by André Orcesi from Cerema.
9.1.4 Décathlon
Participants: Francois Bassac, Cristian Preda.
Decathlon is a brand specializing in the large distribution of sports equipment and materials. From September 2022, François Bassac begun his PhD thesis within Inria-Decathlon partnership on the topic of predicting performances and injuries with training data under the supervision of Cristian Preda.
9.1.5 Horiba
Participants: Komlan Noukpoape, Sophie Dabo, Cristian Preda.
Horiba is a company specialized on optical spectrometry. Datavers is working with this compagny and CENTRALE Lille on Raman spectroscopy and Artificial Intelligence dedicated to the synthesis in chemistry.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Participation in other International Programs
IRN AFRIMath
Participant: Sophie Dabo.
-
Partner Institution(s): CNRS
- AFRIMath is an International Research Network of the CNRS bringing together mathematicians located mainly in sub-Saharan Africa and in France.
- Date/Duration: 2021-2028
- Additionnal info/keywords: Numerical Analysis, Probability and Statistics
IRL CRM-CNRS
Participant: Sophie Dabo.
-
Partner Institution(s): CNRS, University of Montreal
- IRL CRM-CNRS is a joint International Research Laboratory between CNRS and University of Montreal, it is based in Montreal. Sophie Dabo has been in CNRS delegation in this IRL from September 2024 to February 2025.
- Date/Duration: 2024-2025
PHC Tournesol
Participant: Sophie Dabo.
-
Partner Institution(s): University of Lille, ULB (Brussels)
- Sophie Dabo shares the PI of the project with Pr Thomas Verdebout
- Date/Duration: 2024-2025
- Additionnal info/keywords: Functional Data, Statistical test, PCA
10.1.2 Visits to international teams
Sabbatical programme
Participant: Sophie Dabo.
Sophie Dabo was on CNRS sabbatical program at Université of Montreal, IRL CRM-CNRS, from September 2024 to February 2025. She hold a CNRS Chair JRP FANE-MATH-PE with 3 months sabbatical program each year (2024-2027) at AIMS South Africa, AIMS Senegal and North West Univeristy in South Africa.
10.1.3 Other european programs/initiatives
Participant: Sophie Dabo.
Sophie Dabo is part of the Mathematics for Humanity Scientific team of ICMS of London Mathematical Society
10.2 National initiatives
10.2.1 CDP - Cross Disciplinary Project
Participants: Guillemette Marot, Genia Babykina, Cristian Preda, Emmanuel Chazard.
-
Consortium:
Inria, Unievrsity of Lille, Inserm, CHU Lille.
-
Coordinators:
François Pattou (INSERM 1190 (EGID)) and Guillemette Marot (Inria, Datavers)
-
Project title:
Molecular signatures of esophageal atresia: towards the identification of the molecular causes of the different forms of esophageal atresia and prenatal diagnosis
-
Objective :
Characterize inter-individual heterogeneity in weight-loss trajectories ; identify the biological mechanisms underlying variability in response to interventions ; link weight-loss trajectories to major long-term clinical outcomes ; translate these insights into predictive, clinically actionable decision-support tools.
-
Funding:
3.2M euros.
-
Duration
2026-2029.
10.2.2 ”Inria Challenge” ROAD-AI with Cerema
Participant: Christophe Biernacki.
Cerema (Centre d'études et d'expertise sur les risques, l'environnement, la mobilité et l'aménagement - Centre for Studies on Risks, the Environment, Mobility and Urban Planning) is a public institution dedicated to supporting public policies, under the dual supervision of the ministry for ecological transition and the ministry for regional cohesion and local authority relations. Datavers is involved in the ROAD-AI (Routes et Ouvrages d'Art Diversiformes, Augmentés & Intégrés) “Inria Challenge”, with six other Inria teams (ACENTAURI, COATI, FUN, I4S,STATIFY, TITANE) including statistics, robotics, telecomunication, sensors network and 3D modeling. This four year project (starting in 2021) aims at having more sustainable, safer and more resilient transport infrastructures.
10.2.3 ANR
Oesomics
-
Participants:
Guillemette Marot .
-
Type:
ANR AAP Recherche translationnelle en santé
-
Acronym
: Oesomics
-
Project title:
Molecular signatures of esophageal atresia: towards the identification of the molecular causes of the different forms of esophageal atresia and prenatal diagnosis
-
Coordinator:
Frédéric Gottrand (Univ. Lille, CHU Lille, Infinite)
-
Duration:
36 months (2022–2027)
-
Funding:
233k euros
-
Partners:
CHU Lille, PRISM, PLBS-Goal, PLBS-bilille
-
Contribution:
Statistical analysis of multi-omics (mainly transcriptomics and proteomics) data
TransEAsome
-
Participants:
Guillemette Marot .
-
Type:
AMI Maladies rares
-
Acronym
: TransEAsome
-
Project title:
Long term outcome of esophageal atresia: transomics profiles in adolescence
-
Coordinator:
Frédéric Gottrand (Univ. Lille, CHU Lille, Infinite)
-
Duration:
72 months (2022–2027)
-
Funding:
1.4M euros
-
Partners:
CHU Lille, Univ. Lille, Inserm NO, Inserm ADR - GO, CRACMO, FIMATHO
-
Contribution:
Statistical analysis of multi-omics (mainly transcriptomics and proteomics) data
10.2.4 FHU
A FHU is a federative project and a label necessary to postulate for a RHU.
-
Acronym:
PRECISE
-
Project title:
PREcision health in Complex Immune-mediated inflammatory diseaSEs
-
Coordinator:
David Launay (U. Lille, CHU Lille)
-
Duration:
5 years (2021–2025)
-
Partners:
CHU Lille, CHU Amiens, CHU Rouen, CHU Caen, Université de Lille, Université de Picardie, Université de Rouen, Inserm
-
Contribution:
The objective of FHU PRECISE is to structure care, research and teaching relative to care of patients who suffer from complex IMID (Immune mediated inflammatory diseases) with an interdisciplinary approach. Guillemette Marot is the co-head with Vincent Sobanski of the WP2 workpackage, which aims at creating a «virtual patient» and cluster patients based on their clinical and omic profiles. In this WP, she is involved both in the analysis task with Bilille platform and in the research task led by Christophe Biernacki , involving Datavers team. This research task aims at combining complex data and integrating temporal structure in order to identify patient’s care pathways. Guillemette Marot is also participating with Bilille platform in WP3 for the research of a molecular signature predictive of the treatment response (resistance and complication).
Participants: Christophe Biernacki, Sophie Dabo, Genia Babykina, Cristian Preda.
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Sophie Dabo organized:
-
Four Research schools:
African Mathematical School on Quantitative Biology: Applications in Epidemiology, Ecology and Cancer, 19-27 Feb 2024, NWU, SA
3MC-PIMS-ICMS school on Multiscale Modeling: Infectious Diseases, Cancer and Treatments, 2 - 13 Dec 2024, Edinburgh, UK
CIMPA School on Mathematical and Statistical Modeling in Oncology, 3 - 14 Feb 2025, North-West University, South Africa
3MC-PIMS-IDMS-ICMS school on Quantitative molecular and cellular biology, 16-27 Jun 2025, University of Manitoba, Canada
-
4 conferences:
International Conference on Mathematical Modeling in Biology and Life Sciences, 28 Feb-1 Mar 2024, North-West University, SA
Colloque Francophone International de Statistique, Probabilités et Interactions, 8-10 Jul 2025, AIMS-Senegal & Université Gaston Berger
Conférence Internationale Annuelle: Femmes Mathématiques et Interactions, 12 Jun 2025, Association des Femmes Scientifiques Africaines du Québec, Canada
Women in SAGE-Tunisia, 30 Sep - 4 Oct 2025, Université de Tunis
African Women in Mathematics: Challenging Questions, 6-8 Oct 2025, Tunisia
- 1 Exhibition (to promote mathematics, inspire young people, and highlight African heritage), 8-10 May 2025, Dakar-Senegal
General chair, scientific chair
- Sophie was the general chair of African Mathematical Biology Society conference, December 5th 2025.
Member of the conference program committees
- Christophe Biernacki was a program committee member of the 29th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES'25) for the session on "Detection of Complex Attacks"
- Sophie was member of the scientific program committe of EcoSta2025 conference, 21-23 August 2025.
11.1.2 Journal
Member of the editorial boards
- Christophe Biernacki is an Associate Editor for the international journal Advances in Data Analysis and Classification (ADAC).
- Sophie Dabo is an Associate Editor of Journal of: Statistical Modeling and Analytics, Journal of Nonparametrics Statistics, Afrika Mathematika.
- Cristian Preda is an Associate Editor of Methodology and Computing in Applied Probability.
Reviewer - reviewing activities
- Christophe Biernacki acted as a reviewer for different journals (Journal of Classification, Methodology and Computing in Applied Probability, Communications in Statistics - Theory and Methods, Computational Statistics, Statistics and Computing, Austrian Journal of Statistics) and two conferences (AISTATS 2025, CAp 2025).
- Genia Babykina acted as a reviewer for Brazilian Journal of Biometrics.
- Sophie Dabo acted as reviewer for different journals (JRSS B, C, JASA, Electronic Journal of Statistics, Bernoulli, Journal of Nonparametric Statistics, ADAC,...) and several conferences worldwide.
- Cristian Preda acted as a reviewer for BMJ, CSDA, JMVA journals.
11.1.3 Invited talks
- Genia Babykina gave an invited talk at Journées PEPR Santé Numérique held in France 27.
- Sophie Dabo has been invited to several conferences and seminars:
- Seminar, North West University, South Africa, 13th January 2025.
- Journées de Statistique et Optimisation, Perpignan, 2th-4th April 2025.
- Conference Afrimath, Abidjan, Ivory Coast, 31st March 2025.
- Conference SAMPTA 2025, July 28th, August 1st, Vienna, Austria
- First virtual symposium of the African Society for Biomathematics, 26-27th June, 2025, Virtual.
- ICMSEM 2025, Lille, July, 24-25, 2025.
- Non-stationarity and Statistics for EEG, Paris, 4th September 2025,
- SSADS25, Oujda, Maroc, 15th October 2025,
- JSMDS 2025, Tunisia, November 13th-15th, 2025
- SASA 2025, Riverside, South Africa, November 26th, 2025.
- Mathematics in Africa, London Mathematical Society, UK/Africa Partnerships, May 14th, Edinburgh.
- CIMAD 2025, November 26th-27th, 2025, Ndjamena, Tchad.
- Seminar IBENS, ENS Paris, 8th December 2025
- Cristian Preda has been invited for a plenary talk in the StatMod Conference, September 2025, University of Piraeus, Greece.
11.1.4 Leadership within the scientific community
- Christophe Biernacki was elected as the President of the SFdS (Société Française de Statistique) since July 2024, which is the French society specialized in Statistics, whose mission is to promote the use of statistics and its understanding and to foster its methodological developments.
- Sophie Dabo is :
- vice-president of CIMPA since January 2025.
- member of Diversity committee of IMU (International Mathematical Union), 2020-2025,
- member of Scientific Committee, ICMS (International Centre for Mathematical Sciences) Mathematics for Humanity, 2023-2025.
- Member of the scientific committee of Ibni Oumar Mahamat Saleh (Association pour la promotion scientifique de l'Afrique) Award, 2019-2025.
- Associate mmeber of European-Mathematical Society committee of Developping countries, since 2022 after chairing the committee.
11.1.5 Scientific expertise
- Christophe Biernacki was a member of the ANR scientific evaluation committee "AI for scientific discovery".
- Sophie Dabo was a member of the ANR scientific evaluation committee 40 "Mathematics", INSERM CSS7, INRAE CSS Misti
11.1.6 Research administration
- Since January 2020, Christophe Biernacki acts as a deputy scientific director of Inria at the national level in charge of the domain “Applied mathematics, computation and simulation".
- Since January 2024, Sophie Dabo acts as chair of the CNRS Network FANE MATH PE.
11.2 Teaching
- Sophie Dabo as professor at University of Lille, teaches Datamining (Master 1), Spatial Statistics (Master 2), Data Analysis (Master 1) and Probability (Bachelor).
- Cristian Preda as a professor at University of Lille teaches statistics and probability (192 hours per year) for engineers of Polytech'Lille.
- Genia Babykina as an associate professor of University of Lille teaches statistics at ILIS institute for 192 hours per year.
- Emmanuel Chazard as a professor at University of Lille teaches statistics and probability at students of the Faculty of Medicine
11.3 Supervision
- Clarisse Boinay works on anomaly detection and change point detection in contextual dynamic asynchronous graphs with applications in OT cybersecurity, under the supervision of Christophe Biernacki and Cristian Preda. She has defended on PhD thesis on December 16 2025 47.
- Violaine Courrier works on the analysis of multivariate, sparse longitudinal data, with mixed co variates, from connected medical objects. Started in September 2023 under the supervision of Christophe Biernacki and Cristian Preda.
- Mustapha Atmani began his PhD thesis on December 1 2024 entitled “Statistical processing of “low data” from passive sensors: application to the monitoring of engineering structures”. The co-supervision is ensured by André Orcesi from Cerema.
- Hugo Cannafarina (PhD) worked on variable selection in the context of high dimensional data and competing risks outcome from November 2024 to November 2025 under co-supervision of C. Preda, G. Marot and G. Babykina. In November 2025, Hugo decided to not continue the PhD project.
- Cécile Verrier works on Mathematical Oncology to model senescence 51. She started her PhD on February 2025 and is supervsied by Sophie Dabo, Alexandre Poulain et Vanessa Dehault (Oncolille, Canther) .
- Clara Dubois works on, she works on Spectroscopie Raman and AI for chemistry synthesis. Her Cifre PhD started in 2023 with Horiba and Centrale Lille. She is supervised by Sophie Dabo, Christophe Dujardin (Centrale Lille) and Sebastien Legendre (Hiriba)
- François Bassac works on functional data for predicting sport performance of runners. His PhD started in 2022 under the supervison of Cristian Preda.
11.4 Juries
- Christophe Biernacki acted as a reviewer for 2 PhD theses and was president of the jury for two other PhD theses.
- Sophie Dabo acted as a reviewer for 6 PhD theses and was president of the jury for one of these PhD theses.
11.4.1 Educational and pedagogical outreach
- Sophie Dabo is involved in severak activities worlwide on training new generation of scientists from the global South via CIMPA, EMS-CDC, ICMS, IMM-ICTP programs
11.5 Popularization
11.5.1 Specific official responsibilities in science outreach structures
- Christophe Biernacki is director of the collection "Pratique de la statistique" at the Presses Universitaires de Rennes.
- Sophie Dabo is chair of Mathematical Statistics theme of Encyclopedia of SCIENCES
11.5.2 Productions (articles, videos, podcasts, serious games, ...)
11.5.3 Participation in Live events
Sophie Dabo has participated at the Radio France International live event "Around the Question, the magazine for all the sciences: How to do mathematics differently and on every continent", 10th October 2025.
12 Scientific production
12.1 Major publications
- 1 miscFrugal Gaussian clustering of huge imbalanced datasets through a bin-marginal approach.December 2021HAL
- 2 articleUnifying Data Units and Models in (Co-)Clustering.Advances in Data Analysis and Classification13May 2018, 7-31HALDOI
- 3 articleCirculating proteomic signature of early death in heart failure patients with reduced ejection fraction.Scientific Reports9December 2019, 19202HALDOI
- 4 articleShort-term air temperature forecasting using Nonparametric Functional Data Analysis and SARMA models.Environmental Modelling and Software111January 2019, 394-408HALDOI
- 5 articleMLGL: An R package implementing correlated variable selection by hierarchical clustering and group-Lasso.Journal of Statistical Software1063March 2023HALDOI
- 6 articleModel-based clustering for multivariate functional data.Computational Statistics and Data Analysis71March 2014, 92-106HALDOI
- 7 articleIdentification of metabolite biomarkers for pancreatic neuroendocrine tumours using a metabolomic approach.European Journal of Endocrinology1924April 2025, 466-480HALDOI
- 8 articleAssociation between sarcopenia and risk of major adverse cardiac and cerebrovascular events-UK Biobank database..Journal of the American Geriatrics Society723November 2023, p.693-706HALDOI
- 9 articleJoint latent class model: Simulation study of model properties and application to amyotrophic lateral sclerosis disease.BMC Medical Research Methodology211September 2021, 198HALDOI
- 10 articleFusion regression methods with repeated functional data.Computational Statistics and Data Analysis203March 2025, 108069HALDOIback to text
- 11 articleCategorical Functional Data Analysis. The cfda R Package.Mathematics 923December 2021, 31HALDOI
- 12 articlePrincipal component analysis of multivariate spatial functional data.Big Data Research39February 2025, 100504HALDOI
- 13 articleForecasting mortality rates with functional signatures.ASTIN BulletinJanuary 2025, 1-24HALDOI
12.2 Publications of the year
International journals
Invited conferences
International peer-reviewed conferences
Conferences without proceedings
Scientific books
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Other scientific publications
Scientific popularization
12.3 Cited publications
- 55 articleLong-term outcome of oesophageal atresia in adolescence (TransEAsome): a national French cohort study protocol.BMJ Open1512025back to text