MODAL - 2021 - Annual activity report

MODAL

MODAL - 2021

2021

Activity report

Project-Team

MODAL

RNSR: 201020969D

Research center

Lille - Nord Europe

In partnership with:

CNRS, Université de Lille

MOdel for Data Analysis and Learning

In collaboration with:

Laboratoire Paul Painlevé (LPP)

Domain

Applied Mathematics, Computation and Simulation

Theme

Optimization, machine learning and statistical methods

Creation of the Project-Team: 2012 January 01

Participants: Hemant Tyagi.

We study the problem of $k$ -way clustering in signed graphs. Considerable attention in recent years has been devoted to analyzing and modeling signed graphs, where the affinity measure between nodes takes either positive or negative values. Recently, Cucuringu et al. [CDGT 2019] proposed a spectral method, namely SPONGE (Signed Positive over Negative Generalized Eigenproblem), which casts the clustering task as a generalized eigenvalue problem optimizing a suitably defined objective function. This approach is motivated by social balance theory, where the clustering task aims to decompose a given network into disjoint groups, such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are mainly connected by negative edges. Through extensive numerical simulations, SPONGE was shown to achieve state-of-the-art empirical performance. On the theoretical front, [CDGT 2019] analyzed SPONGE and the popular Signed Laplacian method under the setting of a Signed Stochastic Block Model (SSBM), for $k = 2$ equal-sized clusters, in the regime where the graph is moderately dense. In this work, we build on the results in [CDGT 2019] on two fronts for the normalized versions of SPONGE and the Signed Laplacian. Firstly, for both algorithms, we extend the theoretical analysis in [CDGT 2019] to the general setting of $k \geq 2$ unequal-sized clusters in the moderately dense regime. Secondly, we introduce regularized versions of both methods to handle sparse graphs – a regime where standard spectral methods underperform – and provide theoretical guarantees under the same SSBM model. To the best of our knowledge, regularized spectral methods have so far not been considered in the setting of clustering signed graphs. We complement our theoretical results with an extensive set of numerical experiments on synthetic data.

This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom), Apoorv Vikram Singh (NYU), Deborah Sulem (University of Oxford, United Kingdom). It was initiated when Apoorv Vikram Singh visited the MODAL team to work with Hemant Tyagi from Oct 2019-Jan 2020. This has now been accepted for publication in the journal: Journal of Machine Learning Research12. A summary of the results was presented at the GCLR (Graphs and more Complex structures for Learning and Reasoning) workshop at AAAI 2021.

Participants: Benjamin Guedj.

A learning method is self-certified if it uses all available data to simultaneously learn a predictor and certify its quality with a tight statistical certificate that is valid with high confidence on any random data point. Self-certified learning promises to bring two major advantages to the machine learning community: First, it avoids the need to hold out data for validation and test purposes, both for certifying the model’s performance as well as for model selection. This could lead to a simplification of the machine learning data pipeline, while additionally, using all the available data for training could also lead to better representations of the underlying data distribution and ultimately lead to more accurate models. Secondly, self-certified learning focuses on delivering performance certificates that are valid with high confidence and are informative of the out-of-sample error, properties that are crucial for appropriately comparing machine learning models as well as setting performance standards for algorithmic governance of these models in the real world. In this paper, we assess how close we are to achieving self-certification in neural networks. In particular, recent work has shown that probabilistic neural networks trained by optimising PAC-Bayes generalisation bounds could bear promise towards achieving self-certified learning, since these can leverage all the available data to learn a posterior and simultaneously certify its risk with tight statistical performance certificates. In this work we empirically compare (on 4 classification datasets) test set generalisation bounds for deterministic predictors and a PAC-Bayes bound for randomised predictors obtained by a self-certified learning strategy (i.e. using all available data for training). We first show that both of these generalisation bounds are not too far from test set errors. We then show that in data small regimes, holding out data for the test set bounds adversely affects generalisation performance, while self-certified strategies based on PAC-Bayes bounds do not suffer from this drawback, showing that they might be a suitable choice for this small data regime. We also find that self-certified probabilistic neural networks learnt by PAC-Bayes inspired objectives lead to certificates that can be surprisingly competitive compared to commonly used test set bounds.

Accepted at the Bayesian Deep Learning workshop at NeurIPS 2021.

10 Partnerships and cooperations

10.2.1 ANR

10.2.2 RHU and FHU

A RHU (recherche hospitalo-universitaire) is an excellence programme funded by PIA (program of investment for the future) and selected by ANR. A FHU is a federative project and a label necessary to postulate for a RHU.

10.3 Regional initiatives

PhD in progress:

Eglantine Karle, November 2020, Hemant Tyagi and Cristian Preda
Guillaume Braun, January 2020, Christophe Biernacki and Hemant Tyagi
Wilfried Heyse, 2019, Guillemette Marot and Vincent Vandewalle
Axel Potier, Sale prediction for low turn-over products, November 2020, Christophe Biernacki, Matthieu Marbac, Vincent Vandewalle
Etienne Kronert, Détection d'anomalie à noyau reproduisant appliquée au domaine IT, Septembre 2020, Alain Celisse et Cristian Preda.
Issam Moindje, Analyse de données fonctionnelles pour l'identification des biomarqueurs dans l'EEG et le MEG chez les prématurés et les foetus, October 2020, Sophie Dabo, Cristian Preda.
Luxin Zhang, Model Agnostic Domain Adaptation: application to Fraud Detection, February 2019, Christophe Biernacki, Pascal Germain, Yacine Kecassi
Filippo Antonazzo, Frugal Gaussian clustering of huge imbalanced datasets through a bin-marginal approach, October 2019, Christophe Biernacki, Christine Keribin
Clarisse Boinay, anomaly detection and change point detection in contextual dynamic asynchronous graphs with applications in OT cybersecurity, December 2021, Christophe Biernacki, Cristian Preda
Felix Biggs, PAC-Bayes, deep neural networks and generative models. Started Sept 2019, University College London, supervisors Benjamin Guedj and John Shawe-Taylor.
Antoine Vendeville, Graph models for cybersecurity and information diffusion on networks. Started Sept 2019, University College London, supervisors Benjamin Guedj and Shi Zhou.
Antonin Schrab, PAC-Bayes, generative models and hypothesis testing. Started Sept 2020, University College London, supervisors Benjamin Guedj and Arthur Gretton.
Reuben Adams, PAC-Bayes theory and computational statistics. Started Sept 2020, University College London, supervisors Benjamin Guedj and John Shawe-Taylor.
Maxime Haddouche, PAC-Bayes, representation learning and online learning. Started Sept 2021, University College London, supervisors Benjamin Guedj and John Shawe-Taylor.
Théophile Cantelobre, PAC-Bayes, kernel methods and representation learning. Started Sept 2021, University College London, supervisors Benjamin Guedj, Alessandro Rudi and Carlo Ciliberto.
Mathieu Alain, PAC-Bayes and information theory. Started Sept 2021, University College London, supervisors Benjamin Guedj and Miguel Rodrigues.
Antoine Picard, Agrégation d'experts et apprentissage multi-tâches : application à la modélisation du processus de méthanisation pour l'optimisation de la gestion de déchêts organiques. Started Sept 2021, CIFRE Suez, supervisors Benjamin Guedj, Roman Moscoviz and Gilles Faÿ.

MODAL - 2021

MODAL - 2021

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellows

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistant

External Collaborator

2 Overall objectives

2.1 Context

2.2 Goals

3 Research program

3.1 Research axis 1: Unsupervised learning

3.2 Research axis 2: Performance assessment

3.3 Research axis 3: Functional data

3.4 Research axis 4: Applications motivating research

4 Application domains

4.1 Economic world

4.2 Biology and health

5 Social and environmental responsibility

6 Highlights of the year

Action Exploratoire (AEx)

Acronyme et nom du projet :

Collaboration

Inria London.

7 New software and platforms

7.1 New software

7.1.1 MixtComp.V4

7.1.2 MASSICCC

7.1.3 cfda

7.1.4 PyRotor

7.1.5 ClusPred

7.1.6 MPAGenomics

7.1.7 visCorVar

7.1.8 metaRNASeq

7.1.9 HDSpatialScan

7.2 New platforms

7.2.1 MASSICCC Platform

8 New results

8.1 Axis 1: Model-based Co-clustering for Ordinal Data of Different Dimensions

8.2 Axis 1: Gaussian-based Visualization of Gaussian and non-Gaussian Model-based Clustering

8.3 Axis 1: Dealing with Missing Data in Model-based Clustering through a MNAR Model

8.4 Axis 1: Predictive Clustering

8.5 Axis 1: A Binned Technique for Scalable Model-based Clustering on Huge Datasets

8.6 Axis 1: Regularized spectral methods for clustering signed networks

8.7 Axis 1: Dynamic Ranking with the BTL Model: A Nearest Neighbor based Rank Centrality Method

8.8 Axis 1: An extension of the angular synchronization problem to the heterogeneous setting

8.9 Axis 1&2: Clustering on Multilayer Graphs with Missing Values

8.10 Axis 1&2: An iterative clustering algorithm for the Contextual Stochastic Block Model with optimality guarantees

8.11 Axis 2: Denoising modulo samples: k-NN regression and tightness of SDP relaxation

8.12 Axis 2: Error analysis for denoising smooth modulo signals on a graph

8.13 Axis 2: Recovering Hölder smooth functions from noisy modulo samples

8.14 Axis 2: Asymptotic efficiency of some nonparametric tests for location on hyperspheres

8.15 Axis 2: k-nearest neighbors prediction and classification for spatial data

8.16 Axis 3: Regression models for spatially distributed autoregressive functional data

8.17 Axis 3: Non-parametric statistical analysis of spatially distributed functional data

8.18 Axis 3: Clustering spatial functional data

8.19 Axis 3: Investigating spatial scan statistics for multivariate functional data

8.20 Axis 3: Categorical functional data analysis

8.21 Axis 3: Clustering categorical functional data

8.22 Axis 4: Statistical analysis of high-throughput proteomic data

8.23 Axis 4: Contribution to the nutritional transition

8.24 Axis 4: Identification of a new master regulator through transcriptomics and epigenomics data analysis

8.25 Axis 4: Reject Inference Methods in Credit Scoring

8.26 Axis 4: Usability study

8.27 Axis 4: Artificial intelligence for aviation

8.28 Axis 4: Interpretable Domain Adaptation for Hidden Subdomain Alignment in the Context of Pre-trained Source Models

8.29 Axis 4: Interpretable Domain Adaptation Using Unsupervised Feature Selection on Pretrained Source Models

8.30 Other: Projection Under Pairwise Control

8.31 Other: On the Local and Global Properties of the Gravitational Spheres of Influence

8.32 Axis 4: Single cell classification using statistical learning on mechanical properties measured by mems tweezers

8.33 Axis 4: Dimensionality Reduction and Bandwidth Selection for Spatial Kernel Discriminant Analysis

8.34 Axis 4: A kernel discriminant analysis for spatially dependent data

8.35 Axis 2: Progress in Self-Certified Neural Networks

8.11 Axis 2: Denoising modulo samples: $k$ -NN regression and tightness of SDP relaxation

8.15 Axis 2: $k$ -nearest neighbors prediction and classification for spatial data