Skip to content

ENH change Ridge tol to 1e-4 #24465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 6, 2022
Merged

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Sep 18, 2022

Reference Issues/PRs

Closes #19615.

What does this implement/fix? Explain your changes.

This PR changes the defaul tol of Ridge from 1e-3 to 1e-4 which is the default of many other linear models like ElasticNet and LogisticRegression.

Any other comments?

Does this warrant a deprecation cycle? This change might change model results with default values, but only in a beneficial way. The downside is a potentially longer fit time.

@lorentzenchr lorentzenchr changed the title Ridge tol Change Ridge tol to 1e-4 Sep 18, 2022
@lorentzenchr lorentzenchr changed the title Change Ridge tol to 1e-4 ENH change Ridge tol to 1e-4 Sep 18, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that introducing a FutureWarning will be more annoying than beneficial to most users.

I am +0 for this PR in its current state (that is with explicit documentation in the behavior change both in the changelog and the docstrings but without a FutureWarning).

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 19, 2022

How to interpret an approval in combination with a "+0"? 😄
My motivation is to close #19615 and I think this PR does it.

@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2022

How to interpret an approval in combination with a "+0"? 😄
My motivation is to close #19615 and I think this PR does it.

Ok +1 then ;)

@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2022

@agramfort any opinion on this and #19615?

@agramfort
Copy link
Member

tol can mean many different things depending on the stopping criteria employed (gradient norm, gain on the train objective of one update etc.) So the argument of consistency among estimators is not very strong me. Now this being said 1e-3 seems high. @lorentzenchr did you benchmark on a few datasets the impact of this in terms of cross-val score and fit time?

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Sep 24, 2022

The impact is as follows:

  • auto: see lbfgs, sag, cholesky, sparse_cg
  • svd: tol has no impact
  • cholesky: tol has no impact
  • lsqr: tol is set as atol and btol of lsmr, which specifies conditions involving the residual and norms of matrix and coefficients.
  • sparse_cg: relative or absolute residual smaller tol
  • sag and saga: change of coef smaller than tol
  • lbfgs: (projected) gradient=residual smaller tol

It may be worth to add that to the docstring:smirk:

Here a simple benchmark:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge


X, y = make_regression(
    n_samples=100_000,
    n_features=100,
    n_informative=100,
    bias=10,
    random_state=42,
)

def max_abs_diff(a, b):
    return np.max(np.abs(a/b - 1))

%time svd = Ridge(solver="svd").fit(X, y)

Wall time: 692 ms

%time lsqr_3 = Ridge(solver="lsqr", tol=1e-3).fit(X, y)

Wall time: 125 ms

%time lsqr_4 = Ridge(solver="lsqr", tol=1e-4).fit(X, y)

Wall time: 114 ms

max_abs_diff(lsqr_3.coef_, svd.coef_), max_abs_diff(lsqr_4.coef_, svd.coef_)

(8.287694934638878e-05, 2.798488532462784e-06)

Same game for all solvers

Solver time 1e-3 time 1e-4 accuracy 1e-3 accuracy 1e-4
lsmr 125 ms 125 ms 8.29e-05 2.80e-06
sparse_cg 126 ms 116 ms 0.00153 0.00153
sag 5.54 s 7.92 s 0.0341 0.00374
saga 1.57 s 2.11 s 0.0109 0.000807
lbfgs 115 ms 125 ms

I don't understand the results for sparse_cg.

@agramfort
Copy link
Member

thanks @lorentzenchr

a few remarks:

  • what you call here accuracy is R2 and if so the number are super low suggesting that despite the n_samples >> n_features it does not learn much
  • the fact that the R2 are quiet different suggests solvers do not fully converge with a tol=1-4. The story may be simpler if the problem is easier and R2 is closer to 1.
  • the fact that you tests with n_samples >> n_features is necessary for SAG and SAGA to be descent solvers but it makes the problem very well posed with excellent conditioning. Picture could be different with n_features > n_samples.

wdyt?

@lorentzenchr
Copy link
Member Author

With accuracy, I meant max_abs_diff, i.e. the maximum of relative differences of coefficients wrt the solution by svd. It shows that tol=1e-4 comes much closer to the correct values. Only sparse_cg seems odd.
These results are good evidence that decreasing default tol from 1e-3 to 1e-4 is beneficial for the found solutions (closer to correct values) with a comparatively small impact on runtime.

@ogrisel
Copy link
Member

ogrisel commented Sep 26, 2022

It may be worth to add that to the docstring😏

+1

@ogrisel
Copy link
Member

ogrisel commented Sep 26, 2022

Thanks for the experiments @lorentzenchr. I agree with your conclusions.

@haiatn
Copy link
Contributor

haiatn commented Oct 4, 2022

Good job @lorentzenchr

@agramfort agramfort merged commit ab5ba84 into scikit-learn:main Oct 6, 2022
@agramfort
Copy link
Member

thanks @lorentzenchr

@lorentzenchr lorentzenchr deleted the ridge_tol branch October 6, 2022 15:45
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Oct 9, 2022
* ENH set defaul tol of Ridge to 1e-4 instead of 1e-3

* MNT rename internal least square method in discrimant_analysis

* DOC add versionchanged

* DOC add whats_new entry

* DOC add pr to whatsnew

* DOC better versionchanged message
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022
* ENH set defaul tol of Ridge to 1e-4 instead of 1e-3

* MNT rename internal least square method in discrimant_analysis

* DOC add versionchanged

* DOC add whats_new entry

* DOC add pr to whatsnew

* DOC better versionchanged message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults
4 participants