-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS #25948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d7d5ac0
abd5f4c
977ad8b
1580e6c
c4d5e3b
06afd44
a4c0295
8475086
ddd8f56
303dbe0
39a60e6
831137a
816d6bd
bc4f8d6
102bfd8
00e4159
5ebfcfa
1730aa1
0305814
d567350
2375dfe
e73ac96
1795e3c
036552b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,13 +24,12 @@ To perform classification with generalized linear models, see | |
Ordinary Least Squares | ||
======================= | ||
|
||
:class:`LinearRegression` fits a linear model with coefficients | ||
:math:`w = (w_1, ..., w_p)` to minimize the residual sum | ||
of squares between the observed targets in the dataset, and the | ||
targets predicted by the linear approximation. Mathematically it | ||
solves a problem of the form: | ||
:class:`LinearRegression` fits a linear model with coefficients :math:`w = | ||
(w_1, ..., w_p)` to minimize the residual sum of squares between the observed | ||
targets in the dataset, and the targets predicted by the linear approximation. | ||
Mathematically it solves a problem of the form: | ||
|
||
.. math:: \min_{w} || X w - y||_2^2 | ||
.. math:: \min_{w, w_0} || X w + w_0 - y||_2^2 | ||
|
||
.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_ols_001.png | ||
:target: ../auto_examples/linear_model/plot_ols.html | ||
|
@@ -100,12 +99,19 @@ of squares: | |
|
||
.. math:: | ||
|
||
\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2 | ||
\min_{w} || X w + w_0 - y||_2^2 + \alpha ||w||_2^2 | ||
|
||
- :math:`X` is the feature matrix with shape `(n_samples, n_features)`; | ||
- :math:`y` is the target vector with shape `(n_samples,)`; | ||
- :math:`w` is the coefficient vector with shape `(n_features,)`; | ||
- :math:`w_0` is the intercept (a scalar); | ||
- :math:`1_s` is a vector of size `n_samples` where all entries are 1. | ||
|
||
The complexity parameter :math:`\alpha \geq 0` controls the amount | ||
of shrinkage: the larger the value of :math:`\alpha`, the greater the amount | ||
of shrinkage and thus the coefficients become more robust to collinearity. | ||
The regularization parameter :math:`\alpha \geq 0` controls the amount of | ||
shrinkage: the larger the value of :math:`\alpha`, the greater the amount of | ||
regularization and thus the coefficients become more robust to collinearity and | ||
overfitting when `n_samples` is not large enough or when there is a large | ||
number of features with noise. | ||
|
||
.. figure:: ../auto_examples/linear_model/images/sphx_glr_plot_ridge_path_001.png | ||
:target: ../auto_examples/linear_model/plot_ridge_path.html | ||
|
@@ -122,7 +128,7 @@ its ``coef_`` member:: | |
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) | ||
Ridge(alpha=0.5) | ||
>>> reg.coef_ | ||
array([0.34545455, 0.34545455]) | ||
array([0.34545..., 0.34545...]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: this was not actually needed to make the docstest pass: this PR does not change the Ridge implementation. But I think it makes the docstring more readable and consistent with the 5 digits we display for the intercept. |
||
>>> reg.intercept_ | ||
0.13636... | ||
|
||
|
@@ -849,6 +855,115 @@ Ridge Regression`_, see the example below. | |
|
||
.. [4] Tristan Fletcher: `Relevance Vector Machines Explained <https://citeseerx.ist.psu.edu/doc_view/pid/3dc9d625404fdfef6eaccc3babddefe4c176abd4>`_ | ||
|
||
.. _linear_regression_intercept: | ||
|
||
Note on the handling of the intercept in linear regression models | ||
================================================================= | ||
|
||
All least-squares linear regression estimators in scikit-learn solve a centered | ||
version of the problem without intercept and then compute the intercept a | ||
posteriori from that solution. | ||
|
||
For underdetermined problems (more features than samples or in the presence of | ||
collinear features), this ensures that the ridge solution converges to the | ||
minimal norm OLS solution when `alpha` tends to zero. Note that to ensure the | ||
continuity of the solution path with respect to `alpha`, we consider a | ||
formulation of the OLS problem that excludes the intercept from the computation | ||
of the norm in the objective function. | ||
|
||
The following explains why this approach is valid, both in the penalized and | ||
the unpenalized underdetermined cases. | ||
|
||
Intercept of the penalized least-squares solution | ||
------------------------------------------------- | ||
|
||
Recall the definition of the ridge estimator: | ||
|
||
.. math:: \min_{w, w_0} { ||X w + w_0 1_s - y||_2 ^ 2 + \alpha ||w||_2 ^ 2} | ||
|
||
Here :math:`1_s` is a vector of size `n_samples` where all entries are 1. | ||
:math:`w_0` is a scalar parameter and is not regularized. | ||
|
||
Let's note :math:`\bar{X}` the column vector whose entries the means of the | ||
columns of :math:`X` and :math:`\bar{y}` the mean scalar value of :math:`y`, | ||
:math:`X_c` the centered version of :math:`X`, :math:`y_c` the centered version | ||
of :math:`y` such that: | ||
|
||
.. math:: X = X_c + 1_s \bar{X}^{T} \text{ and } y = y_c + \bar{y} 1_s | ||
|
||
Optimizing the original ridge objective with intercept is equivalent to solving | ||
the following ridge problem on the centered dataset with no intercept: | ||
|
||
.. math:: \hat{w} = \min_{w} { ||X_c w - y_c||_2 ^ 2 + \alpha ||w||_2 ^ 2} | ||
|
||
and subsequently define the intercept as :math:`\hat{w_0} = \bar{y} - | ||
\bar{X}^{T} \hat{w}`. | ||
|
||
To see this, let's rewrite the data-fit term of the original ridge problem as | ||
follows (after rearranging the terms and factorizing by :math:`1_s`): | ||
ogrisel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. math:: ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c + (\bar{X}^{T} w + w_0 - \bar{y}) 1_s ||_2^2 | ||
|
||
Let's define now the scalar :math:`\bar{r} = \bar{X}^{T} w + w_0 - \bar{y}`. We | ||
can now expand the data-fit term of the original ridge problem as follows: | ||
|
||
.. math:: ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c ||_2^2 + ||\bar{r} 1_s ||_2^2 + 2 \bar{r} 1_s^{T} (X_c w - y_c) | ||
|
||
If :math:`n` is the number of data points, we have: | ||
|
||
.. math:: ||X w + w_0 1_s - y||_2^2 = ||X_c w - y_c ||_2^2 + \bar{r}^2 n + 2 \bar{r} (1_s^{T} X_c w - 1_s^{T} y_c) | ||
|
||
Since :math:`X_c` and :math:`y_c` are the centered versions of X and y, we have | ||
:math:`1_s^{T} X_c = 0` and :math:`1_s^{T} y_c = 0`. Therefore: | ||
|
||
.. math:: ||X w + w_0 1_s - y||_2^2 + \alpha ||w||_2^2 = ||X_c w - y_c ||_2^2 + \alpha ||w||_2^2 + \bar{r}^2 n | ||
|
||
Let's consider the derivative of the above expression with respect to :math:`w_0` | ||
and setting it to 0. The first two terms do not depend on :math:`w_0`, | ||
therefore we only need to set the derivative of the last term to 0 to obtain | ||
the following value of the intercept at the minimum: | ||
|
||
.. math:: \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w} | ||
|
||
and as a result, at the minimum, :math:`\bar{r} = 0` and therefore: | ||
|
||
.. math:: ||X w + \hat{w_0} 1_s - y||_2^2 + \alpha ||w||_2^2 = ||X_c w - y_c ||_2^2 + \alpha ||w||_2^2 | ||
|
||
This is minimized by :math:`\hat{w}` the solution of the centered ridge without | ||
intercept. | ||
|
||
Note that the same argument holds for any other penalized linear regression | ||
estimator (as long as the penalty is such that the solution is unique). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this explanation is nice and easy to follow. When I do this in a class I say that at the optimum the gradient wrt w and w_0 should be zero and when you write the gradient wrt w_0 you get directly |
||
|
||
Intercept of the ordinary least-squares solution | ||
------------------------------------------------ | ||
|
||
For the OLS estimator (as implemented in :class:`LinearRegression` or | ||
equivalently for :class:`Ridge` with `alpha=0`), the solution is not unique | ||
when the rank of :math:`X` is lower than `n_features`. In this case, | ||
scikit-learn estimators converge to the minimum norm solution defined as | ||
follows: | ||
|
||
.. math:: \hat{w} = \min_{w, w_0} ||w||_2^2 \text{ s.t. } X w + w_0 1_s = y | ||
|
||
Note that the intercept :math:`w_0` does not take part in the computation of | ||
the norm. | ||
|
||
Again this is equivalent to the minimum norm solution of the centered problem | ||
without intercept: | ||
|
||
.. math:: \hat{w} = \min_{w} ||w||_2^2 \text{ s.t. } X_c w = y_c | ||
|
||
along with: | ||
|
||
.. math:: \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w} | ||
|
||
TODO: include proof? if so where should we put this, in maintainers notes somewhere? | ||
|
||
Here is an external draft of the proof: | ||
|
||
https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf | ||
|
||
.. _Logistic_regression: | ||
|
||
Logistic regression | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make it then consistent throughout this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Do we introduce the
1_s
vector everywhere for consistency?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not do it because it is harder to read and most people, I guess, will understand without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be used in a "mathematical details" section.