Section:
Scientific Foundations
Regression models of supervised learning
The most obvious contribution of statistics to machine learning is to
consider the supervised learning scenario as a special case of regression
estimation: given independent pairs
of observations , , the aim is to
“learn” the dependence of on . Thus, classical results
about statistical regression estimation apply, with the caveat that
the hypotheses we can reasonably assume about the distribution
of the pairs are much weaker than what is usually
considered in statistical studies. The aim here is to assume very
little, maybe only independence of the observed sequence of input-output
pairs, and to validate model and variable selection schemes.
These schemes should produce the best possible approximation of the
joint distribution of within some restricted family
of models. Their performance is evaluated according
to some measure of discrepancy between
distributions, a standard choice being to use the Kullback-Leibler
divergence.
PAC-Bayes inequalities
One of the specialties of the team in this direction is to use
PAC-Bayes inequalities to combine thresholded exponential moment
inequalities. The name of this theory comes from its founder,
David McAllester, and may be misleading. Indeed, its cornerstone is rather made of non-asymptotic entropy inequalities,
and a perturbative approach to parameter estimation. The team
has made major contributions to the theory, first focussed on
classification [6] , then on regression [1] .
It has introduced the idea of combining the PAC-Bayesian approach
with the use of thresholded exponential moments, in order to
derive bounds under very weak assumptions on the noise.
Sparsity and –regularization
Another line of research in regression estimation is the use
of sparse models, and its link with –regularization.
Regularization is the joint minimization of some empirical criterion and some penalty function;
it should lead to a model that not only fits well the data but is also as simple as possible.
For instance, the Lasso uses a –regularization instead of a –one; it is
popular mostly because it leads to sparse solutions (the estimate has only a few nonzero coordinates),
which usually have a clear interpretation in many settings (e.g., the influence or lack of influence of some variables).
In addition, unlike –penalization, the Lasso is computationally feasible for high-dimensional data.
Pushing it to the extreme: no assumption on the data
The next brick of our scientific foundations explains why and how, in certains cases, we may formulate
absolutely no assumption on the data , , which is then considered a
deterministic set of input–output pairs.