-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS #25948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I did not mean to approve this PR, only to comment it.
…tivation in the intro of the section
@@ -122,7 +128,7 @@ its ``coef_`` member:: | |||
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) | |||
Ridge(alpha=0.5) | |||
>>> reg.coef_ | |||
array([0.34545455, 0.34545455]) | |||
array([0.34545..., 0.34545...]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this was not actually needed to make the docstest pass: this PR does not change the Ridge implementation. But I think it makes the docstring more readable and consistent with the 5 digits we display for the intercept.
@agramfort @lorentzenchr @jjerphan I think the clarification of the motivations for the centering in I still need to think on how to update the ridge tests to be consistent with this for all cases. I probably won't have time to do it today though. |
…ble values of global_random_seed
In my lately short style - but implicitly very appreciating —just a few points:
|
Can you be more precise? EDIT: indeed there is a problem when I wrote "we recover:". There could be other values of w_0 that cause the sample wise sum equality to hold.
I agree. We can move them to another place (once fixed). But I wanted to benefit from the LaTeX rendering. An ascii version would be hard to read in my opinion. Maybe we could move that to a maintainer-oriented section of the doc. |
I have a proof sketch for the under-determined case using the method of Lagrange multipliers (for the minimization problem that does not include |
Ok I pushed my proof for the underdetermined case here: https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf /cc @lorentzenchr @agramfort @jjerphan I will update the restructured text versions of the notes based on your feedback. |
…external proof for now
Thank you for writing this down, @ogrisel. To me, the derivation is correct, and we just need an argument showing that
|
The only user impacting thing it that we would now guarantee that
Well, standard lectures do not cover the case where we do not penalize the intercept or include it in the computation of the minimum norm. Making it explicit as I did my linked PDF is boring but not that easy to check, and we got the initial version of the fixture wrong because of this. |
Indeed. The solution would still be valid with the pseudo-inverse instead of the inverse (this is what I have to do in the fixture to make it work with any value of |
The minimizer of the objective function is a critical point, but the converse is not true generally. Here we proceed with the converse to derive the solution, but we still need an argument to show that there's an equivalence here. |
I think the equivalence comes from the Lagrange multiplier theorem: |
zero by setting :math:`\hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}`. | ||
|
||
Note that the same argument holds for any other penalized linear regression | ||
estimator (as long as the penalty is such that the solution is unique). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this explanation is nice and easy to follow.
When I do this in a class I say that at the optimum the gradient wrt w and w_0 should be zero and when you write the gradient wrt w_0 you get directly \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}
Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
Just for the record, I still plan to follow-up on this. Thanks @agramfort for the review. Do you have any opinion on how to document the under-determined case? Do you think I should port the content of https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf into our sphinx doc (assuming you agree with the content)? |
@ogrisel I also plan to review this PR. From the glance I took, my main concern is the proposal to not include the intercept in the norm. |
|
||
.. math:: \min_{w} || X w - y||_2^2 | ||
.. math:: \min_{w, w_0} || X w + w_0 - y||_2^2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make it then consistent throughout this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Do we introduce the 1_s
vector everywhere for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not do it because it is harder to read and most people, I guess, will understand without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be used in a "mathematical details" section.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@lorentzenchr @agramfort I finally took the time to conduct a numerical study of the impact of the problem formulation: https://gist.github.com/ogrisel/18bbf02128a3890b03534d52b7fc8bd0#file-minimum_norm_ridge_limit-ipynb The current formulation (ridge on centered data followed by analytical intercept computation) seems to beneficial for several reasons detailed in the notebooks. The main problems with what I named "type b" are:
We could improve the notebook to check that those conclusions hold when changing the solver. EDIT: I updated the linked notebook to add a ridge solvers based on |
@ogrisel I‘ll have a deeper look at your great analysis. But I need some time. You could even consider to write a paper about it or a blog post. |
Thanks. I won't have the time to dig deeper next week, but ideally we could try to understand better why/how the different solvers break on extreme values of the regularization parameter. We could also probably find out analytically why type b ridge (without intercept penalization) cannot reach the type b OLS (with intercept in the norm). We could do that either generally (for any X, y training set) or for some degenerate cases such as a single data point dataset: On the shorter term I would like to first finalize the updates of this PR to make sure that all solvers in scikit-learn behave correctly using the centered ridge formulation. |
@ogrisel https://gist.github.com/ogrisel/18bbf02128a3890b03534d52b7fc8bd0#file-minimum_norm_ridge_limit-ipynb has a bug. For I additionally added Fix + additional solver in https://github.com/lorentzenchr/notebooks/blob/master/minimum_norm_ridge_limit.ipynb. My summary |
@lorentzenchr I updated my copy of the notebook to include your fix + the For the latter, the results are very similar to what we observe with I plan to resume the work on updating the tests in this PR and making them pass for all solvers and all data shapes using the type "a" formulation as reference solution for the minimum norm OLS problem. Some related parallel fixes for |
Fixes: #22947.
Related to: #22910.
The existing test assumes that we should recover the minimum norm solution to the OLS problem where the intercept fitting is implemented by adding an extra constant feature and its matching coef.
However this does not hold. In particular, the intercept component should not contribute to the coef norm in the least norm solution as explained in this new note added to the documentation. Furthermore this justifies our pre-centering strategy in
_preprocess_data
and generic_set_intercept
in all linear regression models with a least squares data-fit term, whatever the regularization term.TODO:
ols_ridge_dataset
fixture accordingly and cross-link to the doc;wide
)long
)?wide
andfit_intercept=True
wide
andfit_intercept=False
long
andfit_intercept=True
long
andfit_intercept=False
global_random_seed
on the CI"cholesky"
failures observed when running the tests with all admissible values forglobal_random_seed
_preprocess_data
and_set_intercept
because this trick was not trivial to me and it took me a while to fully grasp all its implications;coef_
norm solution for OLS when alpha -> 0 for all solvers.