Skip to content

FIX/API Fix AIC/BIC in LassoLarsIC and introduce noise_variance #21481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Nov 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
c3839d9
FIX compute the AIC and BIC using MSE
glemaitre Oct 27, 2021
b46e6ee
iter
glemaitre Oct 27, 2021
053db7c
Merge remote-tracking branch 'origin/main' into is/14566
glemaitre Oct 27, 2021
e864611
add changelog
glemaitre Oct 27, 2021
0b61363
acknowledge
glemaitre Oct 27, 2021
776490e
iter
glemaitre Oct 29, 2021
caea26c
iter
glemaitre Oct 30, 2021
979f629
iter
glemaitre Oct 30, 2021
61a6b6c
DOC add a new example
glemaitre Oct 30, 2021
668461a
iter
glemaitre Oct 30, 2021
880edbf
iter
glemaitre Oct 30, 2021
e3ef92b
iter
glemaitre Oct 30, 2021
d81bca9
iter
glemaitre Oct 30, 2021
ecb1401
Merge branch 'main' into is/14566
glemaitre Oct 30, 2021
97c3259
Apply suggestions from code review
glemaitre Oct 30, 2021
ee7b6a4
iter
glemaitre Oct 30, 2021
fada91c
Apply suggestions from code review
glemaitre Oct 30, 2021
25e3e6e
Update sklearn/linear_model/_least_angle.py
glemaitre Oct 31, 2021
794418a
iter
glemaitre Nov 2, 2021
30386c4
iter
glemaitre Nov 2, 2021
228ca79
Apply suggestions from code review
glemaitre Nov 2, 2021
10ececf
iter
glemaitre Nov 2, 2021
e353b6a
Apply suggestions from code review
glemaitre Nov 3, 2021
1c07447
Apply suggestions from code review
glemaitre Nov 3, 2021
1eaee6f
Merge remote-tracking branch 'origin/main' into is/14566
glemaitre Nov 9, 2021
47be241
iter
glemaitre Nov 9, 2021
121b48f
iter
glemaitre Nov 9, 2021
50c862d
iter
glemaitre Nov 9, 2021
37c2c93
iter
glemaitre Nov 9, 2021
73a9f5e
christian improvements
glemaitre Nov 22, 2021
e3dc244
last review
glemaitre Nov 22, 2021
227f2a1
last review
glemaitre Nov 22, 2021
19f9032
black
glemaitre Nov 22, 2021
e3a4daf
Merge remote-tracking branch 'origin/main' into is/14566
glemaitre Nov 22, 2021
d282d94
Apply suggestions from code review
glemaitre Nov 22, 2021
369f6a7
iter
glemaitre Nov 22, 2021
e426eb4
Apply suggestions from code review
glemaitre Nov 23, 2021
6f08ddf
reviews
glemaitre Nov 23, 2021
e347478
Update examples/linear_model/plot_lasso_model_selection.py
glemaitre Nov 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 82 additions & 11 deletions doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ features, it is often faster than :class:`LassoCV`.

.. centered:: |lasso_cv_1| |lasso_cv_2|

.. _lasso_lars_ic:

Information-criteria based model selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -306,22 +307,92 @@ Alternatively, the estimator :class:`LassoLarsIC` proposes to use the
Akaike information criterion (AIC) and the Bayes Information criterion (BIC).
It is a computationally cheaper alternative to find the optimal value of alpha
as the regularization path is computed only once instead of k+1 times
when using k-fold cross-validation. However, such criteria needs a
proper estimation of the degrees of freedom of the solution, are
derived for large samples (asymptotic results) and assume the model
is correct, i.e. that the data are generated by this model.
They also tend to break when the problem is badly conditioned
(more features than samples).

.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png
:target: ../auto_examples/linear_model/plot_lasso_model_selection.html
when using k-fold cross-validation.

Indeed, these criteria are computed on the in-sample training set. In short,
they penalize the over-optimistic scores of the different Lasso models by
their flexibility (cf. to "Mathematical details" section below).

However, such criteria need a proper estimation of the degrees of freedom of
the solution, are derived for large samples (asymptotic results) and assume the
correct model is candidates under investigation. They also tend to break when
the problem is badly conditioned (e.g. more features than samples).

.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_lars_ic_001.png
:target: ../auto_examples/linear_model/plot_lasso_lars_ic.html
:align: center
:scale: 50%

.. _aic_bic:

**Mathematical details**

The definition of AIC (and thus BIC) might differ in the literature. In this
section, we give more information regarding the criterion computed in
scikit-learn. The AIC criterion is defined as:

.. math::
AIC = -2 \log(\hat{L}) + 2 d

where :math:`\hat{L}` is the maximum likelihood of the model and
:math:`d` is the number of parameters (as well referred to as degrees of
freedom in the previous section).

The definition of BIC replace the constant :math:`2` by :math:`\log(N)`:

.. math::
BIC = -2 \log(\hat{L}) + \log(N) d

where :math:`N` is the number of samples.

For a linear Gaussian model, the maximum log-likelihood is defined as:

.. math::
\log(\hat{L}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{2\sigma^2}

where :math:`\sigma^2` is an estimate of the noise variance,
:math:`y_i` and :math:`\hat{y}_i` are respectively the true and predicted
targets, and :math:`n` is the number of samples.

Plugging the maximum log-likelihood in the AIC formula yields:

.. math::
AIC = n \log(2 \pi \sigma^2) + \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sigma^2} + 2 d

The first term of the above expression is sometimes discarded since it is a
constant when :math:`\sigma^2` is provided. In addition,
it is sometimes stated that the AIC is equivalent to the :math:`C_p` statistic
[12]_. In a strict sense, however, it is equivalent only up to some constant
and a multiplicative factor.

At last, we mentioned above that :math:`\sigma^2` is an estimate of the
noise variance. In :class:`LassoLarsIC` when the parameter `noise_variance` is
not provided (default), the noise variance is estimated via the unbiased
estimator [13]_ defined as:

.. math::
\sigma^2 = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n - p}

where :math:`p` is the number of features and :math:`\hat{y}_i` is the
predicted target using an ordinary least squares regression. Note, that this
formula is valid only when `n_samples > n_features`.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`
* :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_lars_ic.py`

.. topic:: References

.. [12] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
"On the degrees of freedom of the lasso."
The Annals of Statistics 35.5 (2007): 2173-2192.
<0712.0881.pdf>`

.. [13] `Cherkassky, Vladimir, and Yunqian Ma.
"Comparison of model selection for regression."
Neural computation 15.7 (2003): 1691-1714.
<https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.8794&rep=rep1&type=pdf>`_

Comparison with the regularization parameter of SVM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -934,8 +1005,8 @@ to warm-starting (see :term:`Glossary <warm_start>`).

.. [6] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <https://hal.inria.fr/hal-00860051/document>`_

.. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien:
:arxiv:`SAGA: A Fast Incremental Gradient Method With Support for
.. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien:
:arxiv:`SAGA: A Fast Incremental Gradient Method With Support for
Non-Strongly Convex Composite Objectives. <1407.0202>`

.. [8] https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm
Expand Down
15 changes: 15 additions & 0 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,21 @@ Changelog
multilabel classification.
:pr:`19689` by :user:`Guillaume Lemaitre <glemaitre>`.

:mod:`sklearn.linear_model`
...........................

- |API| :class:`linear_model.LassoLarsIC` now exposes `noise_variance` as
a parameter in order to provide an estimate of the noise variance.
This is particularly relevant when `n_features > n_samples` and the
estimator of the noise variance cannot be computed.
:pr:`21481` by :user:`Guillaume Lemaitre <glemaitre>`

- |Fix| :class:`linear_model.LassoLarsIC` now correctly computes AIC
and BIC. An error is now raised when `n_features > n_samples` and
when the noise variance is not provided.
:pr:`21481` by :user:`Guillaume Lemaitre <glemaitre>` and
:user:`Andrés Babino <ababino>`.

:mod:`sklearn.metrics`
......................

Expand Down
123 changes: 123 additions & 0 deletions examples/linear_model/plot_lasso_lars_ic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
"""
==============================================
Lasso model selection via information criteria
==============================================

This example reproduces the example of Fig. 2 of [ZHT2007]_. A
:class:`~sklearn.linear_model.LassoLarsIC` estimator is fit on a
diabetes dataset and the AIC and the BIC criteria are used to select
the best model.

.. note::
It is important to note that the optimization to find `alpha` with
:class:`~sklearn.linear_model.LassoLarsIC` relies on the AIC or BIC
criteria that are computed in-sample, thus on the training set directly.
This approach differs from the cross-validation procedure. For a comparison
of the two approaches, you can refer to the following example:
:ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`.

.. topic:: References

.. [ZHT2007] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
"On the degrees of freedom of the lasso."
The Annals of Statistics 35.5 (2007): 2173-2192.
<0712.0881>`
"""

# Author: Alexandre Gramfort
# Guillaume Lemaitre
# License: BSD 3 clause

# %%
import sklearn

sklearn.set_config(display="diagram")

# %%
# We will use the diabetes dataset.
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)
n_samples = X.shape[0]
X.head()

# %%
# Scikit-learn provides an estimator called
# :class:`~sklearn.linear_model.LinearLarsIC` that uses either Akaike's
# information criterion (AIC) or the Bayesian information criterion (BIC) to
# select the best model. Before fitting
# this model, we will scale the dataset.
#
# In the following, we are going to fit two models to compare the values
# reported by AIC and BIC.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoLarsIC
from sklearn.pipeline import make_pipeline

lasso_lars_ic = make_pipeline(
StandardScaler(), LassoLarsIC(criterion="aic", normalize=False)
).fit(X, y)


# %%
# To be in line with the defintion in [ZHT2007]_, we need to rescale the
# AIC and the BIC. Indeed, Zou et al. are ignoring some constant terms
# compared to the original definition of AIC derived from the maximum
# log-likelihood of a linear model. You can refer to
# :ref:`mathematical detail section for the User Guide <lasso_lars_ic>`.
def zou_et_al_criterion_rescaling(criterion, n_samples, noise_variance):
"""Rescale the information criterion to follow the definition of Zou et al."""
return criterion - n_samples * np.log(2 * np.pi * noise_variance) - n_samples


# %%
import numpy as np

aic_criterion = zou_et_al_criterion_rescaling(
lasso_lars_ic[-1].criterion_,
n_samples,
lasso_lars_ic[-1].noise_variance_,
)

index_alpha_path_aic = np.flatnonzero(
lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_
)[0]

# %%
lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)

bic_criterion = zou_et_al_criterion_rescaling(
lasso_lars_ic[-1].criterion_,
n_samples,
lasso_lars_ic[-1].noise_variance_,
)

index_alpha_path_bic = np.flatnonzero(
lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_
)[0]

# %%
# Now that we collected the AIC and BIC, we can as well check that the minima
# of both criteria happen at the same alpha. Then, we can simplify the
# following plot.
index_alpha_path_aic == index_alpha_path_bic

# %%
# Finally, we can plot the AIC and BIC criterion and the subsequent selected
# regularization parameter.
import matplotlib.pyplot as plt

plt.plot(aic_criterion, color="tab:blue", marker="o", label="AIC criterion")
plt.plot(bic_criterion, color="tab:orange", marker="o", label="BIC criterion")
plt.vlines(
index_alpha_path_bic,
aic_criterion.min(),
aic_criterion.max(),
color="black",
linestyle="--",
label="Selected alpha",
)
plt.legend()
plt.ylabel("Information criterion")
plt.xlabel("Lasso model sequence")
_ = plt.title("Lasso model selection via AIC and BIC")
Loading