scikit-learn · ogrisel · Mar 22, 2023 · Mar 23, 2023 · Mar 23, 2023 · Mar 23, 2023
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -24,13 +24,12 @@ To perform classification with generalized linear models, see
 Ordinary Least Squares
 =======================
 
-:class:`LinearRegression` fits a linear model with coefficients
-:math:`w = (w_1, ..., w_p)` to minimize the residual sum
-of squares between the observed targets in the dataset, and the
-targets predicted by the linear approximation. Mathematically it
-solves a problem of the form:
+:class:`LinearRegression` fits a linear model with coefficients :math:`w =
+(w_1, ..., w_p)` to minimize the residual sum of squares between the observed
+targets in the dataset, and the targets predicted by the linear approximation.
+Mathematically it solves a problem of the form:
 
-.. math:: \min_{w} || X w - y||_2^2
+.. math:: \min_{w, w_0} || X w + w_0 - y||_2^2
 
 .. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_ols_001.png
    :target: ../auto_examples/linear_model/plot_ols.html
@@ -100,12 +99,19 @@ of squares:
 
 .. math::
 
-   \min_{w} || X w - y||_2^2 + \alpha ||w||_2^2
+   \min_{w} || X w + w_0 - y||_2^2 + \alpha ||w||_2^2
 
+- :math:`X` is the feature matrix with shape `(n_samples, n_features)`;
+- :math:`y` is the target vector with shape `(n_samples,)`;
+- :math:`w` is the coefficient vector with shape `(n_features,)`;
+- :math:`w_0` is the intercept (a scalar);
+- :math:`1_s` is a vector of size `n_samples` where all entries are 1.
 
-The complexity parameter :math:`\alpha \geq 0` controls the amount
-of shrinkage: the larger the value of :math:`\alpha`, the greater the amount
-of shrinkage and thus the coefficients become more robust to collinearity.
+The regularization parameter :math:`\alpha \geq 0` controls the amount of
+shrinkage: the larger the value of :math:`\alpha`, the greater the amount of
+regularization and thus the coefficients become more robust to collinearity and
+overfitting when `n_samples` is not large enough or when there is a large
+number of features with noise.
 
 .. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_ridge_path_001.png
    :target: ../auto_examples/linear_model/plot_ridge_path.html
@@ -122,7 +128,7 @@ its ``coef_`` member::
     >>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
     Ridge(alpha=0.5)
     >>> reg.coef_
-    array([0.34545455, 0.34545455])
+    array([0.34545..., 0.34545...])
     >>> reg.intercept_
     0.13636...
 
@@ -849,6 +855,115 @@ Ridge Regression`_, see the example below.
 
 .. [4] Tristan Fletcher: `Relevance Vector Machines Explained <https://citeseerx.ist.psu.edu/doc_view/pid/3dc9d625404fdfef6eaccc3babddefe4c176abd4>`_
 
+.. _linear_regression_intercept:
+
+Note on the handling of the intercept in linear regression models
+=================================================================
+
+All least-squares linear regression estimators in scikit-learn solve a centered
+version of the problem without intercept and then compute the intercept a
+posteriori from that solution.
+
+For underdetermined problems (more features than samples or in the presence of
+collinear features), this ensures that the ridge solution converges to the
+minimal norm OLS solution when `alpha` tends to zero. Note that to ensure the
+continuity of the solution path with respect to `alpha`, we consider a
+formulation of the OLS problem that excludes the intercept from the computation
+of the norm in the objective function.
+
+The following explains why this approach is valid, both in the penalized and
+the unpenalized underdetermined cases.
+
+Intercept of the penalized least-squares solution
+-------------------------------------------------
+
+Recall the definition of the ridge estimator:
+
+.. math::  \min_{w, w_0} { ||X w + w_0 1_s - y||_2 ^ 2 + \alpha ||w||_2 ^ 2}
+
+Here :math:`1_s` is a vector of size `n_samples` where all entries are 1.
+:math:`w_0` is a scalar parameter and is not regularized.
+
+Let's note :math:`\bar{X}` the column vector whose entries the means of the
+columns of :math:`X` and :math:`\bar{y}` the mean scalar value of :math:`y`,
+:math:`X_c` the centered version of :math:`X`, :math:`y_c` the centered version
+of :math:`y` such that:
+
+.. math::  X = X_c + 1_s \bar{X}^{T}  \text{ and } y = y_c + \bar{y} 1_s
+
+Optimizing the original ridge objective with intercept is equivalent to solving
+the following ridge problem on the centered dataset with no intercept:
+
+.. math:: \hat{w} = \min_{w} { ||X_c w - y_c||_2 ^ 2 + \alpha ||w||_2 ^ 2}
+
+and subsequently define the intercept as :math:`\hat{w_0} = \bar{y} -
+\bar{X}^{T} \hat{w}`.
+
+To see this, let's rewrite the data-fit term of the original ridge problem as
+follows (after rearranging the terms and factorizing by :math:`1_s`):
+
+.. math::  ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c +  (\bar{X}^{T} w  + w_0  - \bar{y}) 1_s ||_2^2
+
+Let's define now the scalar :math:`\bar{r} = \bar{X}^{T} w  + w_0  - \bar{y}`. We
+can now expand the data-fit term of the original ridge problem as follows:
+
+.. math::  ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c ||_2^2 + ||\bar{r} 1_s ||_2^2 + 2 \bar{r} 1_s^{T} (X_c w - y_c)
+
+If :math:`n` is the number of data points, we have:
+
+.. math::  ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c ||_2^2 + \bar{r}^2 n + 2 \bar{r} (1_s^{T} X_c w - 1_s^{T} y_c)
+
+Since :math:`X_c` and :math:`y_c` are the centered versions of X and y, we have
+:math:`1_s^{T} X_c = 0` and :math:`1_s^{T} y_c = 0`. Therefore:
+
+.. math::  ||X w + w_0 1_s - y||_2^2 + \alpha ||w||_2^2 = ||X_c w - y_c ||_2^2  + \alpha ||w||_2^2 + \bar{r}^2 n
+
+Let's consider the derivative of the above expression with respect to :math:`w_0`
+and setting it to 0. The first two terms do not depend on :math:`w_0`,
+therefore we only need to set the derivative of the last term to 0 to obtain
+the following value of the intercept at the minimum:
+
+.. math::  \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}
+
+and as a result, at the minimum, :math:`\bar{r} = 0` and therefore:
+
+.. math::  ||X w + \hat{w_0} 1_s - y||_2^2 + \alpha ||w||_2^2 = ||X_c w - y_c ||_2^2  + \alpha ||w||_2^2
+
+This is minimized by :math:`\hat{w}` the solution of the centered ridge without
+intercept.
+
+Note that the same argument holds for any other penalized linear regression
+estimator (as long as the penalty is such that the solution is unique).
+
+Intercept of the ordinary least-squares solution
+------------------------------------------------
+
+For the OLS estimator (as implemented in :class:`LinearRegression` or
+equivalently for :class:`Ridge` with `alpha=0`), the solution is not unique
+when the rank of :math:`X` is lower than `n_features`. In this case,
+scikit-learn estimators converge to the minimum norm solution defined as
+follows:
+
+.. math:: \hat{w} = \min_{w, w_0} ||w||_2^2 \text{ s.t. } X w + w_0 1_s = y
+
+Note that the intercept :math:`w_0` does not take part in the computation of
+the norm.
+
+Again this is equivalent to the minimum norm solution of the centered problem
+without intercept:
+
+.. math:: \hat{w} = \min_{w} ||w||_2^2 \text{  s.t. } X_c w = y_c
+
+along with:
+
+.. math:: \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}
+
+TODO: include proof? if so where should we put this, in maintainers notes somewhere?
+
+Here is an external draft of the proof:
+
+https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf
+
 .. _Logistic_regression:
 
 Logistic regression

diff --git a/sklearn/linear_model/_base.py b/sklearn/linear_model/_base.py
@@ -137,6 +137,11 @@ def _preprocess_data(
     If `fit_intercept=False`, no centering is performed and `X_offset`, `y_offset`
     are set to zero.
 
+    Note that it is always possible to use the solution to the (penalized) least
+    squares problem on centered data to build the solution to the original
+    problem with or without intercept by setting the intercept a posteriori as
+    explained in the user guide: :ref:`linear_regression_intercept`.
+
     Returns
     -------
     X_out : {ndarray, sparse matrix} of shape (n_samples, n_features)
@@ -307,6 +312,10 @@ def _set_intercept(self, X_offset, y_offset, X_scale):
             coef_ = xp.astype(self.coef_, X_scale.dtype, copy=False)
             coef_ = self.coef_ = xp.divide(coef_, X_scale)
 
+            # If self.coef_ is the solution of the penalized least squares on
+            # centered data then it is the solution of the least squares on the
+            # uncentered data with the following intercept:
+            # https://scikit-learn.org/stable/modules/linear_model.html#linear_regression_intercept
             if coef_.ndim == 1:
                 intercept_ = y_offset - X_offset @ coef_
             else:

diff --git a/sklearn/linear_model/_ridge.py b/sklearn/linear_model/_ridge.py
@@ -253,6 +253,7 @@ def _solve_cholesky_kernel(K, y, alpha, sample_weight=None, copy=False):
                 "Singular matrix in solving dual problem. Using "
                 "least-squares solution instead."
             )
+            # TODO: always pass an explicit value of cond to lstsq.
             dual_coef = linalg.lstsq(K, y)[0]
 
         # K is expensive to compute and store in memory so change it back in
@@ -270,6 +271,7 @@ def _solve_cholesky_kernel(K, y, alpha, sample_weight=None, copy=False):
         for dual_coef, target, current_alpha in zip(dual_coefs, y.T, alpha):
             K.flat[:: n_samples + 1] += current_alpha
 
+            # TODO: handle np.linalg.LinAlgError as done for one_alpha case.
             dual_coef[:] = linalg.solve(
                 K, target, assume_a="pos", overwrite_a=False
             ).ravel()
@@ -1015,7 +1017,7 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):
 
     Minimizes the objective function::
 
-    ||y - Xw||^2_2 + alpha * ||w||^2_2
+    ||y - Xw + w_0||^2_2 + alpha * ||w||^2_2
 
     This model solves a regression model where the loss function is
     the linear least squares function and regularization is given by
@@ -1032,17 +1034,24 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):
         strength. `alpha` must be a non-negative float i.e. in `[0, inf)`.
 
         When `alpha = 0`, the objective is equivalent to ordinary least
-        squares, solved by the :class:`LinearRegression` object. For numerical
-        reasons, using `alpha = 0` with the `Ridge` object is not advised.
-        Instead, you should use the :class:`LinearRegression` object.
+        squares, solved by the :class:`LinearRegression` class. If `X.T @ X` is
+        singular (typically when `n_features > n_samples` or in the presence of
+        collinear features) then both classes should converge to the minimum
+        norm solution where the intercept does not participate in the
+        computation of the norm.
+
+        Some solvers of the `Ridge` class are known to be unstable or fail for
+        `alpha = 0`. Instead, you should use the :class:`LinearRegression`
+        class.
 
         If an array is passed, penalties are assumed to be specific to the
         targets. Hence they must correspond in number.
 
     fit_intercept : bool, default=True
         Whether to fit the intercept for this model. If set
         to false, no intercept will be used in calculations
-        (i.e. ``X`` and ``y`` are expected to be centered).
+        (i.e. ``X`` and ``y`` are expected to be centered for
+        the model to be well-specified).
 
     copy_X : bool, default=True
         If True, X will be copied; else, it may be overwritten.
@@ -1083,25 +1092,28 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):
         - 'auto' chooses the solver automatically based on the type of data.
 
         - 'svd' uses a Singular Value Decomposition of X to compute the Ridge
-          coefficients. It is the most stable solver, in particular more stable
+          coefficients. It is a stable solver, in particular more stable
           for singular matrices than 'cholesky' at the cost of being slower.
 
-        - 'cholesky' uses the standard scipy.linalg.solve function to
-          obtain a closed-form solution.
+        - 'cholesky' uses the standard `scipy.linalg.solve` function to obtain
+          a closed-form solution. This solver can fail when the regularization
+          is too small on highly correlated data. In this case, the 'svd' solver
+          is automatically used instead.
 
         - 'sparse_cg' uses the conjugate gradient solver as found in
           scipy.sparse.linalg.cg. As an iterative algorithm, this solver is
           more appropriate than 'cholesky' for large-scale data
           (possibility to set `tol` and `max_iter`).
 
         - 'lsqr' uses the dedicated regularized least-squares routine
-          scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative
-          procedure.
+          `scipy.sparse.linalg.lsqr`. It is often one of the fastest solvers and
+          uses an iterative procedure. It can handle training data with
+          singular `X.T @ X` even with low regularization very efficiently.
 
         - 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses
           its improved, unbiased version named SAGA. Both methods also use an
           iterative procedure, and are often faster than other solvers when
-          both n_samples and n_features are large. Note that 'sag' and
+          both n_samples and n_features are very large. Note that 'sag' and
           'saga' fast convergence is only guaranteed on features with
           approximately the same scale. You can preprocess the data with a
           scaler from sklearn.preprocessing.
@@ -1164,6 +1176,8 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):
 
     See Also
     --------
+    LinearRegression : Ordinary least squares Linear Regression, equivalent to
+        `Ridge(alpha=0)`.
     RidgeClassifier : Ridge classifier.
     RidgeCV : Ridge regression with built-in cross validation.
     :class:`~sklearn.kernel_ridge.KernelRidge` : Kernel ridge regression