Skip to content

FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS #25948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Mar 22, 2023

Fixes: #22947.
Related to: #22910.

The existing test assumes that we should recover the minimum norm solution to the OLS problem where the intercept fitting is implemented by adding an extra constant feature and its matching coef.

However this does not hold. In particular, the intercept component should not contribute to the coef norm in the least norm solution as explained in this new note added to the documentation. Furthermore this justifies our pre-centering strategy in _preprocess_data and generic _set_intercept in all linear regression models with a least squares data-fit term, whatever the regularization term.

TODO:

  • document why solving for penalized linear regression with intercept is equivalent to fit on centered data without intercept;
    • find the correct derivation for the underdetermined case
    • decide where to put this (in the doc or link to an external resource?)
    • reorganize the doc to leverage one or more mathematical details sections
  • update the ols_ridge_dataset fixture accordingly and cross-link to the doc;
    • update the underdetermined case (wide)
    • update the ovderderdetermined case (long)?
  • make sure the updated tests pass for all ridge solvers;
    • for wide and fit_intercept=True
    • for wide and fit_intercept=False
    • for long and fit_intercept=True
    • for long and fit_intercept=False
    • run the fixture tests with all global_random_seed on the CI
    • investigate and fix the rare "cholesky" failures observed when running the tests with all admissible values for global_random_seed
  • add inline comments to _preprocess_data and _set_intercept because this trick was not trivial to me and it took me a while to fully grasp all its implications;
  • add a new test to check that the Ridge solution converges to the minimum coef_ norm solution for OLS when alpha -> 0 for all solvers.

@lorentzenchr lorentzenchr self-requested a review March 23, 2023 12:54
@lorentzenchr lorentzenchr changed the title Fix least norm solution to unpenalized ridge / OLS when fit_intercept=True FIX minimum norm solution of unpenalized ridge / OLS when fit_intercept=True Mar 23, 2023
jjerphan
jjerphan previously approved these changes Mar 24, 2023
@jjerphan jjerphan dismissed their stale review March 24, 2023 08:48

I did not mean to approve this PR, only to comment it.

@@ -122,7 +128,7 @@ its ``coef_`` member::
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
Ridge(alpha=0.5)
>>> reg.coef_
array([0.34545455, 0.34545455])
array([0.34545..., 0.34545...])
Copy link
Member Author

@ogrisel ogrisel Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was not actually needed to make the docstest pass: this PR does not change the Ridge implementation. But I think it makes the docstring more readable and consistent with the 5 digits we display for the intercept.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 24, 2023

@agramfort @lorentzenchr @jjerphan I think the clarification of the motivations for the centering in _preprocess_data and post-hoc intercept computation in _set_intercept is ready to review. Here is the rendered docs:

I still need to think on how to update the ridge tests to be consistent with this for all cases. I probably won't have time to do it today though.

@lorentzenchr
Copy link
Member

In my lately short style - but implicitly very appreciating —just a few points:

  • I think it does not solve BUG unpenalized Ridge does not give minimum norm solution #22947 because the math for centering with n_features>n_samples does not work out.
  • These notes only concern the implementation but should not concern a user. Should we really place a solution to a standard exercise of a stats/ml lecture in out user guide?

@ogrisel
Copy link
Member Author

ogrisel commented Mar 24, 2023

I think it does not solve #22947 because the math for centering with n_features>n_samples does not work out.

Can you be more precise?

EDIT: indeed there is a problem when I wrote "we recover:". There could be other values of w_0 that cause the sample wise sum equality to hold.

These notes only concern the implementation but should not concern a user. Should we really place a solution to a standard exercise of a stats/ml lecture in out user guide?

I agree. We can move them to another place (once fixed). But I wanted to benefit from the LaTeX rendering. An ascii version would be hard to read in my opinion.

Maybe we could move that to a maintainer-oriented section of the doc.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 25, 2023

I have a proof sketch for the under-determined case using the method of Lagrange multipliers (for the minimization problem that does not include w_0 in the computation of the norm). I will try to finalize it on Monday.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 27, 2023

Ok I pushed my proof for the underdetermined case here:

https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf

/cc @lorentzenchr @agramfort @jjerphan

I will update the restructured text versions of the notes based on your feedback.

@jjerphan
Copy link
Member

jjerphan commented Mar 27, 2023

Thank you for writing this down, @ogrisel.

To me, the derivation is correct, and we just need an argument showing that $X_c X_c^\top$ is invertible.

(Edit: also since is convex, the minimizer necessarily is the critical point and we directly can proceed with equivalences.)

@ogrisel
Copy link
Member Author

ogrisel commented Mar 27, 2023

These notes only concern the implementation but should not concern a user.

The only user impacting thing it that we would now guarantee that LinearRegression(with_intercept=True).fit(X, y).coef_ is the limit of Ridge(with_intercept=True, alpha=alpha).fit(X, y).coef_ when alpha goes to zero which was never made explicit before. But can just write it in the user guide and/or the docstring of Ridge's alpha and LinearRegression's "see also" section.

Should we really place a solution to a standard exercise of a stats/ml lecture in out user guide?

Well, standard lectures do not cover the case where we do not penalize the intercept or include it in the computation of the minimum norm. Making it explicit as I did my linked PDF is boring but not that easy to check, and we got the initial version of the fixture wrong because of this.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 27, 2023

To me, the derivation is correct, and we just need an argument showing that
is invertible.

Indeed. The solution would still be valid with the pseudo-inverse instead of the inverse (this is what I have to do in the fixture to make it work with any value of global_random_seed) but I am too lazy to prove it :)

@jjerphan
Copy link
Member

The minimizer of the objective function is a critical point, but the converse is not true generally. Here we proceed with the converse to derive the solution, but we still need an argument to show that there's an equivalence here.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 27, 2023

The minimizer of the objective function is a critical point, but the converse is not true generally. Here we proceed with the converse to derive the solution, but we still need an argument to show that there's an equivalence here.

I think the equivalence comes from the Lagrange multiplier theorem:

https://en.wikipedia.org/wiki/Lagrange_multiplier#Statement

zero by setting :math:`\hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}`.

Note that the same argument holds for any other penalized linear regression
estimator (as long as the penalty is such that the solution is unique).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanation is nice and easy to follow.

When I do this in a class I say that at the optimum the gradient wrt w and w_0 should be zero and when you write the gradient wrt w_0 you get directly \hat{w_0} = \bar{y} - \bar{X}^{T} \hat{w}

Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
@ogrisel
Copy link
Member Author

ogrisel commented Apr 14, 2023

Just for the record, I still plan to follow-up on this. Thanks @agramfort for the review. Do you have any opinion on how to document the under-determined case?

Do you think I should port the content of https://github.com/ogrisel/minimum-norm-ols/blob/main/minimum-norm-ols-intercept.pdf into our sphinx doc (assuming you agree with the content)?

@lorentzenchr
Copy link
Member

@ogrisel I also plan to review this PR. From the glance I took, my main concern is the proposal to not include the intercept in the norm.


.. math:: \min_{w} || X w - y||_2^2
.. math:: \min_{w, w_0} || X w + w_0 - y||_2^2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it then consistent throughout this document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Do we introduce the 1_s vector everywhere for consistency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not do it because it is harder to read and most people, I guess, will understand without it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be used in a "mathematical details" section.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 21, 2023

@lorentzenchr @agramfort I finally took the time to conduct a numerical study of the impact of the problem formulation:

https://gist.github.com/ogrisel/18bbf02128a3890b03534d52b7fc8bd0#file-minimum_norm_ridge_limit-ipynb

image

The current formulation (ridge on centered data followed by analytical intercept computation) seems to beneficial for several reasons detailed in the notebooks. The main problems with what I named "type b" are:

  • the intercept estimate is numerically discontinuous when changing alpha to very low or large values, at least when keeping the rcond singular value threshold of the underlying SVD-based solver constant.
  • the high regularization value of the intercept is wrong, it does not converge to y_train.mean() as expected.
  • I don't see how it would be possible to get a continuous intercept when alpha goes to zero if the intercept is included in the computation of the minimum norm solution but not penalized in the ridge problem, even if we had not fixed the rcond parameter.

We could improve the notebook to check that those conclusions hold when changing the solver.

EDIT: I updated the linked notebook to add a ridge solvers based on lstsq instead of pinv and the conclusions still hold (although the lstsq solver seems slightly less stable for a reason I do not understand).

@lorentzenchr
Copy link
Member

@ogrisel I‘ll have a deeper look at your great analysis. But I need some time. You could even consider to write a paper about it or a blog post.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 22, 2023

Thanks. I won't have the time to dig deeper next week, but ideally we could try to understand better why/how the different solvers break on extreme values of the regularization parameter. We could also probably find out analytically why type b ridge (without intercept penalization) cannot reach the type b OLS (with intercept in the norm). We could do that either generally (for any X, y training set) or for some degenerate cases such as a single data point dataset: X = [[1.]], y = [1.0] for instance.

On the shorter term I would like to first finalize the updates of this PR to make sure that all solvers in scikit-learn behave correctly using the centered ridge formulation.

@lorentzenchr
Copy link
Member

lorentzenchr commented Apr 29, 2023

@ogrisel https://gist.github.com/ogrisel/18bbf02128a3890b03534d52b7fc8bd0#file-minimum_norm_ridge_limit-ipynb has a bug. For lstsq, one needs to take the square root of alpha. It then never fails and is faster than pinv, at least for smaller alpha.

I additionally added np.linalg.solve. It fails for alpha=0, but except for this case, it is as stable (well, X here is well behaved) as pinv and faster.

Fix + additional solver in https://github.com/lorentzenchr/notebooks/blob/master/minimum_norm_ridge_limit.ipynb.

My summary
I find the higher test R2 for the centered variant ("a") in addition to the smoother behaviour of the intercept as a function of alpha somehow striking.

Copy link

github-actions bot commented Oct 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 036552b. Link to the linter CI: here

@ogrisel
Copy link
Member Author

ogrisel commented Oct 22, 2024

@lorentzenchr I updated my copy of the notebook to include your fix + the numpy.linalg.solve solver and further added type "a" and type "b" solutions to the minimum norm OLS and ridge problems with the lsqr solver to the analysis.

For the latter, the results are very similar to what we observe with lstsq: for the type "b" formulation, with observe discontinuities (unstable) solution paths both for low and high values of alpha that are in turn detrimental to the test R2 performance. Type "a" is always stable, whatever the choice of the solver, except numpy.linalg.solve for small alpha values, but this is expected (as per its docstring).

I plan to resume the work on updating the tests in this PR and making them pass for all solvers and all data shapes using the type "a" formulation as reference solution for the minimum norm OLS problem.

Some related parallel fixes for LinearRegression:

@ogrisel ogrisel changed the title FIX minimum norm solution of unpenalized ridge / OLS when fit_intercept=True FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

BUG unpenalized Ridge does not give minimum norm solution
4 participants