Skip to content

FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS #25948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d7d5ac0
DOC document why we center linear regression and set intercept a-post…
ogrisel Mar 22, 2023
abd5f4c
typo
ogrisel Mar 23, 2023
977ad8b
Reference the equivalence of solutions where appropriate in the code
ogrisel Mar 23, 2023
1580e6c
consistent notation for column vector dot products
ogrisel Mar 23, 2023
c4d5e3b
WIP udpate test_ridge.py
ogrisel Mar 23, 2023
06afd44
Update the doc to properly cover the underdetermined case and give mo…
ogrisel Mar 24, 2023
a4c0295
Fix header levels for the new section
ogrisel Mar 24, 2023
8475086
Use pinv to deal with singular X_c @ X_c.T in fixture for all admissi…
ogrisel Mar 24, 2023
ddd8f56
Merge branch 'main' into least-norm-ridge
ogrisel Mar 27, 2023
303dbe0
remove incorrect derivation for the underdetermined case and link to …
ogrisel Mar 27, 2023
39a60e6
Improve / fix the docstring of Ridge based on the findings of this PR
ogrisel Mar 27, 2023
831137a
Fix typo in last commit
ogrisel Mar 27, 2023
816d6bd
Typos and style
ogrisel Apr 11, 2023
bc4f8d6
Typos / grammar fixes
ogrisel Apr 16, 2023
102bfd8
More explicit phrasing
ogrisel Apr 13, 2023
00e4159
Take @agramfort's suggestion into account
ogrisel Apr 17, 2023
5ebfcfa
equation => expression
ogrisel Apr 17, 2023
1730aa1
Merge branch 'main' into least-norm-ridge
ogrisel Oct 21, 2024
0305814
Post merge conflict fix.
ogrisel Oct 21, 2024
d567350
Fix grammar and improve phrasing in linear_model.rst
ogrisel Oct 22, 2024
2375dfe
Fix tests for unpenalized case
ogrisel Oct 23, 2024
e73ac96
Add some TODOs for a latter PR
ogrisel Oct 23, 2024
1795e3c
Remove wrong comment.
ogrisel Oct 23, 2024
036552b
Improve notes in the docstring of the fixture itself
ogrisel Oct 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 126 additions & 11 deletions doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,12 @@ To perform classification with generalized linear models, see
Ordinary Least Squares
=======================

:class:`LinearRegression` fits a linear model with coefficients
:math:`w = (w_1, ..., w_p)` to minimize the residual sum
of squares between the observed targets in the dataset, and the
targets predicted by the linear approximation. Mathematically it
solves a problem of the form:
:class:`LinearRegression` fits a linear model with coefficients :math:`w =
(w_1, ..., w_p)` to minimize the residual sum of squares between the observed
targets in the dataset, and the targets predicted by the linear approximation.
Mathematically it solves a problem of the form:

.. math:: \min_{w} || X w - y||_2^2
.. math:: \min_{w, w_0} || X w + w_0 - y||_2^2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it then consistent throughout this document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Do we introduce the 1_s vector everywhere for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not do it because it is harder to read and most people, I guess, will understand without it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be used in a "mathematical details" section.


.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_ols_001.png
:target: ../auto_examples/linear_model/plot_ols.html
Expand Down Expand Up @@ -100,12 +99,19 @@ of squares:

.. math::

\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2
\min_{w} || X w + w_0 - y||_2^2 + \alpha ||w||_2^2

- :math:`X` is the feature matrix with shape `(n_samples, n_features)`;
- :math:`y` is the target vector with shape `(n_samples,)`;
- :math:`w` is the coefficient vector with shape `(n_features,)`;
- :math:`w_0` is the intercept (a scalar);
- :math:`1_s` is a vector of size `n_samples` where all entries are 1.

The complexity parameter :math:`\alpha \geq 0` controls the amount
of shrinkage: the larger the value of :math:`\alpha`, the greater the amount
of shrinkage and thus the coefficients become more robust to collinearity.
The regularization parameter :math:`\alpha \geq 0` controls the amount of
shrinkage: the larger the value of :math:`\alpha`, the greater the amount of
regularization and thus the coefficients become more robust to collinearity and
overfitting when `n_samples` is not large enough or when there is a large
number of features with noise.

.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_ridge_path_001.png
:target: ../auto_examples/linear_model/plot_ridge_path.html
Expand All @@ -122,7 +128,7 @@ its ``coef_`` member::
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
Ridge(alpha=0.5)
>>> reg.coef_
array([0.34545455, 0.34545455])
array([0.34545..., 0.34545...])
Copy link
Member Author

@ogrisel ogrisel Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was not actually needed to make the docstest pass: this PR does not change the Ridge implementation. But I think it makes the docstring more readable and consistent with the 5 digits we display for the intercept.

>>> reg.intercept_
0.13636...

Expand Down Expand Up @@ -849,6 +855,115 @@ Ridge Regression`_, see the example below.

.. [4] Tristan Fletcher: `Relevance Vector Machines Explained <https://citeseerx.ist.psu.edu/doc_view/pid/3dc9d625404fdfef6eaccc3babddefe4c176abd4>`_

.. _linear_regression_intercept:

Note on the handling of the intercept in linear regression models
=================================================================

All least-squares linear regression estimators in scikit-learn solve a centered
version of the problem without intercept and then compute the intercept a
posteriori from that solution.

For underdetermined problems (more features than samples or in the presence of
collinear features), this ensures that the ridge solution converges to the
minimal norm OLS solution when `alpha` tends to zero. Note that to ensure the
continuity of the solution path with respect to `alpha`, we consider a
formulation of the OLS problem that excludes the intercept from the computation
of the norm in the objective function.

The following explains why this approach is valid, both in the penalized and
the unpenalized underdetermined cases.

Intercept of the penalized least-squares solution
-------------------------------------------------

Recall the definition of the ridge estimator:

.. math:: \min_{w, w_0} { ||X w + w_0 1_s - y||_2 ^ 2 + \alpha ||w||_2 ^ 2}

Here :math:`1_s` is a vector of size `n_samples` where all entries are 1.
:math:`w_0` is a scalar parameter and is not regularized.

Let's note :math:`\bar{X}` the column vector whose entries the means of the
columns of :math:`X` and :math:`\bar{y}` the mean scalar value of :math:`y`,
:math:`X_c` the centered version of :math:`X`, :math:`y_c` the centered version
of :math:`y` such that:

.. math:: X = X_c + 1_s \bar{X}^{T} \text{ and } y = y_c + \bar{y} 1_s

Optimizing the original ridge objective with intercept is equivalent to solving
the following ridge problem on the centered dataset with no intercept:

.. math:: \hat{w} = \min_{w} { ||X_c w - y_c||_2 ^ 2 + \alpha ||w||_2 ^ 2}

and subsequently define the intercept as :math:`\hat{w_0} = \bar{y} -
\bar{X}^{T} \hat{w}`.

To see this, let's rewrite the data-fit term of the original ridge problem as
follows (after rearranging the terms and factorizing by :math:`1_s`):

.. math:: ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c + (\bar{X}^{T} w + w_0 - \bar{y}) 1_s ||_2^2

Let's define now the scalar :math:`\bar{r} = \bar{X}^{T} w + w_0 - \bar{y}`. We
can now expand the data-fit term of the original ridge problem as follows:

.. math:: ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c ||_2^2 + ||\bar{r} 1_s ||_2^2 + 2 \bar{r} 1_s^{T} (X_c w - y_c)

If :math:`n` is the number of data points, we have:

.. math:: ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c ||_2^2 + \bar{r}^2 n + 2 \bar{r} (1_s^{T} X_c w - 1_s^{T} y_c)

Since :math:`X_c` and :math:`y_c` are the centered versions of X and y, we have
:math:`1_s^{T} X_c = 0` and :math:`1_s^{T} y_c = 0`. Therefore:

.. math:: ||X w + w_0 1_s - y||_2^2 + \alpha ||w||_2^2 = ||X_c w - y_c ||_2^2 + \alpha ||w||_2^2 + \bar{r}^2 n

Let's consider the derivative of the above expression with respect to :math:`w_0`
and setting it to 0. The first two terms do not depend on :math:`w_0`,
therefore we only need to set the derivative of the last term to 0 to obtain
the following value of the intercept at the minimum:

.. math:: \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}

and as a result, at the minimum, :math:`\bar{r} = 0` and therefore:

.. math:: ||X w + \hat{w_0} 1_s - y||_2^2 + \alpha ||w||_2^2 = ||X_c w - y_c ||_2^2 + \alpha ||w||_2^2

This is minimized by :math:`\hat{w}` the solution of the centered ridge without
intercept.

Note that the same argument holds for any other penalized linear regression
estimator (as long as the penalty is such that the solution is unique).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanation is nice and easy to follow.

When I do this in a class I say that at the optimum the gradient wrt w and w_0 should be zero and when you write the gradient wrt w_0 you get directly \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}


Intercept of the ordinary least-squares solution
------------------------------------------------

For the OLS estimator (as implemented in :class:`LinearRegression` or
equivalently for :class:`Ridge` with `alpha=0`), the solution is not unique
when the rank of :math:`X` is lower than `n_features`. In this case,
scikit-learn estimators converge to the minimum norm solution defined as
follows:

.. math:: \hat{w} = \min_{w, w_0} ||w||_2^2 \text{ s.t. } X w + w_0 1_s = y

Note that the intercept :math:`w_0` does not take part in the computation of
the norm.

Again this is equivalent to the minimum norm solution of the centered problem
without intercept:

.. math:: \hat{w} = \min_{w} ||w||_2^2 \text{ s.t. } X_c w = y_c

along with:

.. math:: \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}

TODO: include proof? if so where should we put this, in maintainers notes somewhere?

Here is an external draft of the proof:

https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf

.. _Logistic_regression:

Logistic regression
Expand Down
9 changes: 9 additions & 0 deletions sklearn/linear_model/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,11 @@ def _preprocess_data(
If `fit_intercept=False`, no centering is performed and `X_offset`, `y_offset`
are set to zero.

Note that it is always possible to use the solution to the (penalized) least
squares problem on centered data to build the solution to the original
problem with or without intercept by setting the intercept a posteriori as
explained in the user guide: :ref:`linear_regression_intercept`.

Returns
-------
X_out : {ndarray, sparse matrix} of shape (n_samples, n_features)
Expand Down Expand Up @@ -307,6 +312,10 @@ def _set_intercept(self, X_offset, y_offset, X_scale):
coef_ = xp.astype(self.coef_, X_scale.dtype, copy=False)
coef_ = self.coef_ = xp.divide(coef_, X_scale)

# If self.coef_ is the solution of the penalized least squares on
# centered data then it is the solution of the least squares on the
# uncentered data with the following intercept:
# https://scikit-learn.org/stable/modules/linear_model.html#linear_regression_intercept
if coef_.ndim == 1:
intercept_ = y_offset - X_offset @ coef_
else:
Expand Down
36 changes: 25 additions & 11 deletions sklearn/linear_model/_ridge.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@ def _solve_cholesky_kernel(K, y, alpha, sample_weight=None, copy=False):
"Singular matrix in solving dual problem. Using "
"least-squares solution instead."
)
# TODO: always pass an explicit value of cond to lstsq.
dual_coef = linalg.lstsq(K, y)[0]

# K is expensive to compute and store in memory so change it back in
Expand All @@ -270,6 +271,7 @@ def _solve_cholesky_kernel(K, y, alpha, sample_weight=None, copy=False):
for dual_coef, target, current_alpha in zip(dual_coefs, y.T, alpha):
K.flat[:: n_samples + 1] += current_alpha

# TODO: handle np.linalg.LinAlgError as done for one_alpha case.
dual_coef[:] = linalg.solve(
K, target, assume_a="pos", overwrite_a=False
).ravel()
Expand Down Expand Up @@ -1015,7 +1017,7 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):

Minimizes the objective function::

||y - Xw||^2_2 + alpha * ||w||^2_2
||y - Xw + w_0||^2_2 + alpha * ||w||^2_2

This model solves a regression model where the loss function is
the linear least squares function and regularization is given by
Expand All @@ -1032,17 +1034,24 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):
strength. `alpha` must be a non-negative float i.e. in `[0, inf)`.

When `alpha = 0`, the objective is equivalent to ordinary least
squares, solved by the :class:`LinearRegression` object. For numerical
reasons, using `alpha = 0` with the `Ridge` object is not advised.
Instead, you should use the :class:`LinearRegression` object.
squares, solved by the :class:`LinearRegression` class. If `X.T @ X` is
singular (typically when `n_features > n_samples` or in the presence of
collinear features) then both classes should converge to the minimum
norm solution where the intercept does not participate in the
computation of the norm.

Some solvers of the `Ridge` class are known to be unstable or fail for
`alpha = 0`. Instead, you should use the :class:`LinearRegression`
class.

If an array is passed, penalties are assumed to be specific to the
targets. Hence they must correspond in number.

fit_intercept : bool, default=True
Whether to fit the intercept for this model. If set
to false, no intercept will be used in calculations
(i.e. ``X`` and ``y`` are expected to be centered).
(i.e. ``X`` and ``y`` are expected to be centered for
the model to be well-specified).

copy_X : bool, default=True
If True, X will be copied; else, it may be overwritten.
Expand Down Expand Up @@ -1083,25 +1092,28 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):
- 'auto' chooses the solver automatically based on the type of data.

- 'svd' uses a Singular Value Decomposition of X to compute the Ridge
coefficients. It is the most stable solver, in particular more stable
coefficients. It is a stable solver, in particular more stable
for singular matrices than 'cholesky' at the cost of being slower.

- 'cholesky' uses the standard scipy.linalg.solve function to
obtain a closed-form solution.
- 'cholesky' uses the standard `scipy.linalg.solve` function to obtain
a closed-form solution. This solver can fail when the regularization
is too small on highly correlated data. In this case, the 'svd' solver
is automatically used instead.

- 'sparse_cg' uses the conjugate gradient solver as found in
scipy.sparse.linalg.cg. As an iterative algorithm, this solver is
more appropriate than 'cholesky' for large-scale data
(possibility to set `tol` and `max_iter`).

- 'lsqr' uses the dedicated regularized least-squares routine
scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative
procedure.
`scipy.sparse.linalg.lsqr`. It is often one of the fastest solvers and
uses an iterative procedure. It can handle training data with
singular `X.T @ X` even with low regularization very efficiently.

- 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses
its improved, unbiased version named SAGA. Both methods also use an
iterative procedure, and are often faster than other solvers when
both n_samples and n_features are large. Note that 'sag' and
both n_samples and n_features are very large. Note that 'sag' and
'saga' fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a
scaler from sklearn.preprocessing.
Expand Down Expand Up @@ -1164,6 +1176,8 @@ class Ridge(MultiOutputMixin, RegressorMixin, _BaseRidge):

See Also
--------
LinearRegression : Ordinary least squares Linear Regression, equivalent to
`Ridge(alpha=0)`.
RidgeClassifier : Ridge classifier.
RidgeCV : Ridge regression with built-in cross validation.
:class:`~sklearn.kernel_ridge.KernelRidge` : Kernel ridge regression
Expand Down
Loading