scikit-learn · lorentzenchr · Nov 23, 2021 · Oct 27, 2021 · Oct 27, 2021 · Oct 27, 2021
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -298,6 +298,7 @@ features, it is often faster than :class:`LassoCV`.
 
 .. centered:: |lasso_cv_1| |lasso_cv_2|
 
+.. _lasso_lars_ic:
 
 Information-criteria based model selection
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -306,22 +307,92 @@ Alternatively, the estimator :class:`LassoLarsIC` proposes to use the
 Akaike information criterion (AIC) and the Bayes Information criterion (BIC).
 It is a computationally cheaper alternative to find the optimal value of alpha
 as the regularization path is computed only once instead of k+1 times
-when using k-fold cross-validation. However, such criteria needs a
-proper estimation of the degrees of freedom of the solution, are
-derived for large samples (asymptotic results) and assume the model
-is correct, i.e. that the data are generated by this model.
-They also tend to break when the problem is badly conditioned
-(more features than samples).
-
-.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_model_selection_001.png
-    :target: ../auto_examples/linear_model/plot_lasso_model_selection.html
+when using k-fold cross-validation.
+
+Indeed, these criteria are computed on the in-sample training set. In short,
+they penalize the over-optimistic scores of the different Lasso models by
+their flexibility (cf. to "Mathematical details" section below).
+
+However, such criteria need a proper estimation of the degrees of freedom of
+the solution, are derived for large samples (asymptotic results) and assume the
+correct model is candidates under investigation. They also tend to break when
+the problem is badly conditioned (e.g. more features than samples).
+
+.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_lasso_lars_ic_001.png
+    :target: ../auto_examples/linear_model/plot_lasso_lars_ic.html
     :align: center
     :scale: 50%
 
+.. _aic_bic:
+
+**Mathematical details**
+
+The definition of AIC (and thus BIC) might differ in the literature. In this
+section, we give more information regarding the criterion computed in
+scikit-learn. The AIC criterion is defined as:
+
+.. math::
+    AIC = -2 \log(\hat{L}) + 2 d
+
+where :math:`\hat{L}` is the maximum likelihood of the model and
+:math:`d` is the number of parameters (as well referred to as degrees of
+freedom in the previous section).
+
+The definition of BIC replace the constant :math:`2` by :math:`\log(N)`:
+
+.. math::
+    BIC = -2 \log(\hat{L}) + \log(N) d
+
+where :math:`N` is the number of samples.
+
+For a linear Gaussian model, the maximum log-likelihood is defined as:
+
+.. math::
+    \log(\hat{L}) = - \frac{n}{2} \log(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{2\sigma^2}
+
+where :math:`\sigma^2` is an estimate of the noise variance,
+:math:`y_i` and :math:`\hat{y}_i` are respectively the true and predicted
+targets, and :math:`n` is the number of samples.
+
+Plugging the maximum log-likelihood in the AIC formula yields:
+
+.. math::
+    AIC = n \log(2 \pi \sigma^2) + \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sigma^2} + 2 d
+
+The first term of the above expression is sometimes discarded since it is a
+constant when :math:`\sigma^2` is provided. In addition,
+it is sometimes stated that the AIC is equivalent to the :math:`C_p` statistic
+[12]_. In a strict sense, however, it is equivalent only up to some constant
+and a multiplicative factor.
+
+At last, we mentioned above that :math:`\sigma^2` is an estimate of the
+noise variance. In :class:`LassoLarsIC` when the parameter `noise_variance` is
+not provided (default), the noise variance is estimated via the unbiased
+estimator [13]_ defined as:
+
+.. math::
+    \sigma^2 = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n - p}
+
+where :math:`p` is the number of features and :math:`\hat{y}_i` is the
+predicted target using an ordinary least squares regression. Note, that this
+formula is valid only when `n_samples > n_features`.
 
 .. topic:: Examples:
 
   * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`
+  * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_lars_ic.py`
+
+.. topic:: References
+
+  .. [12] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
+           "On the degrees of freedom of the lasso."
+           The Annals of Statistics 35.5 (2007): 2173-2192.
+           <0712.0881.pdf>`
+
+  .. [13] `Cherkassky, Vladimir, and Yunqian Ma.
+           "Comparison of model selection for regression."
+           Neural computation 15.7 (2003): 1691-1714.
+           <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.392.8794&rep=rep1&type=pdf>`_
 
 Comparison with the regularization parameter of SVM
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -934,8 +1005,8 @@ to warm-starting (see :term:`Glossary <warm_start>`).
 
     .. [6] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <https://hal.inria.fr/hal-00860051/document>`_
 
-    .. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien: 
-        :arxiv:`SAGA: A Fast Incremental Gradient Method With Support for 
+    .. [7] Aaron Defazio, Francis Bach, Simon Lacoste-Julien:
+        :arxiv:`SAGA: A Fast Incremental Gradient Method With Support for
         Non-Strongly Convex Composite Objectives. <1407.0202>`
 
     .. [8] https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm

diff --git a/doc/whats_new/v1.1.rst b/doc/whats_new/v1.1.rst
@@ -141,6 +141,21 @@ Changelog
   multilabel classification.
   :pr:`19689` by :user:`Guillaume Lemaitre <glemaitre>`.
 
+:mod:`sklearn.linear_model`
+...........................
+
+- |API| :class:`linear_model.LassoLarsIC` now exposes `noise_variance` as
+  a parameter in order to provide an estimate of the noise variance.
+  This is particularly relevant when `n_features > n_samples` and the
+  estimator of the noise variance cannot be computed.
+  :pr:`21481` by :user:`Guillaume Lemaitre <glemaitre>`
+
+- |Fix| :class:`linear_model.LassoLarsIC` now correctly computes AIC
+  and BIC. An error is now raised when `n_features > n_samples` and
+  when the noise variance is not provided.
+  :pr:`21481` by :user:`Guillaume Lemaitre <glemaitre>` and
+  :user:`Andrés Babino <ababino>`.
+
 :mod:`sklearn.metrics`
 ......................
 

diff --git a/examples/linear_model/plot_lasso_lars_ic.py b/examples/linear_model/plot_lasso_lars_ic.py
@@ -0,0 +1,123 @@
+"""
+==============================================
+Lasso model selection via information criteria
+==============================================
+
+This example reproduces the example of Fig. 2 of [ZHT2007]_. A
+:class:`~sklearn.linear_model.LassoLarsIC` estimator is fit on a
+diabetes dataset and the AIC and the BIC criteria are used to select
+the best model.
+
+.. note::
+    It is important to note that the optimization to find `alpha` with
+    :class:`~sklearn.linear_model.LassoLarsIC` relies on the AIC or BIC
+    criteria that are computed in-sample, thus on the training set directly.
+    This approach differs from the cross-validation procedure. For a comparison
+    of the two approaches, you can refer to the following example:
+    :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_model_selection.py`.
+
+.. topic:: References
+
+    .. [ZHT2007] :arxiv:`Zou, Hui, Trevor Hastie, and Robert Tibshirani.
+       "On the degrees of freedom of the lasso."
+       The Annals of Statistics 35.5 (2007): 2173-2192.
+       <0712.0881>`
+"""
+
+# Author: Alexandre Gramfort
+#         Guillaume Lemaitre
+# License: BSD 3 clause
+
+# %%
+import sklearn
+
+sklearn.set_config(display="diagram")
+
+# %%
+# We will use the diabetes dataset.
+from sklearn.datasets import load_diabetes
+
+X, y = load_diabetes(return_X_y=True, as_frame=True)
+n_samples = X.shape[0]
+X.head()
+
+# %%
+# Scikit-learn provides an estimator called
+# :class:`~sklearn.linear_model.LinearLarsIC` that uses either Akaike's
+# information criterion (AIC) or the Bayesian information criterion (BIC) to
+# select the best model. Before fitting
+# this model, we will scale the dataset.
+#
+# In the following, we are going to fit two models to compare the values
+# reported by AIC and BIC.
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LassoLarsIC
+from sklearn.pipeline import make_pipeline
+
+lasso_lars_ic = make_pipeline(
+    StandardScaler(), LassoLarsIC(criterion="aic", normalize=False)
+).fit(X, y)
+
+
+# %%
+# To be in line with the defintion in [ZHT2007]_, we need to rescale the
+# AIC and the BIC. Indeed, Zou et al. are ignoring some constant terms
+# compared to the original definition of AIC derived from the maximum
+# log-likelihood of a linear model. You can refer to
+# :ref:`mathematical detail section for the User Guide <lasso_lars_ic>`.
+def zou_et_al_criterion_rescaling(criterion, n_samples, noise_variance):
+    """Rescale the information criterion to follow the definition of Zou et al."""
+    return criterion - n_samples * np.log(2 * np.pi * noise_variance) - n_samples
+
+
+# %%
+import numpy as np
+
+aic_criterion = zou_et_al_criterion_rescaling(
+    lasso_lars_ic[-1].criterion_,
+    n_samples,
+    lasso_lars_ic[-1].noise_variance_,
+)
+
+index_alpha_path_aic = np.flatnonzero(
+    lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_
+)[0]
+
+# %%
+lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)
+
+bic_criterion = zou_et_al_criterion_rescaling(
+    lasso_lars_ic[-1].criterion_,
+    n_samples,
+    lasso_lars_ic[-1].noise_variance_,
+)
+
+index_alpha_path_bic = np.flatnonzero(
+    lasso_lars_ic[-1].alphas_ == lasso_lars_ic[-1].alpha_
+)[0]
+
+# %%
+# Now that we collected the AIC and BIC, we can as well check that the minima
+# of both criteria happen at the same alpha. Then, we can simplify the
+# following plot.
+index_alpha_path_aic == index_alpha_path_bic
+
+# %%
+# Finally, we can plot the AIC and BIC criterion and the subsequent selected
+# regularization parameter.
+import matplotlib.pyplot as plt
+
+plt.plot(aic_criterion, color="tab:blue", marker="o", label="AIC criterion")
+plt.plot(bic_criterion, color="tab:orange", marker="o", label="BIC criterion")
+plt.vlines(
+    index_alpha_path_bic,
+    aic_criterion.min(),
+    aic_criterion.max(),
+    color="black",
+    linestyle="--",
+    label="Selected alpha",
+)
+plt.legend()
+plt.ylabel("Information criterion")
+plt.xlabel("Lasso model sequence")
+_ = plt.title("Lasso model selection via AIC and BIC")