-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 for the plan except the "other ideas". I don't think that adding L-BFGS is necessary. And the QR based solver seems also an overkill. for 2 you say lsqr for sparse case? |
I am not sure what you mean. I edited the description to remove a previous version of item 2 because it was wrong (my tol was too lax and that was leading me to wrong conclusions). |
I'm also on board with the Proposed plan of actions, and, like @agramfort, don't think that adding lbfgs is necessary. A few further comments:
|
remove sparse_cg ? I would think it's the right thing to do for sparse X no?
… |
Also in my benchmarks, |
@ogrisel Which benchmark are you referring to?
Sorry, that suggestion of mine was typed too quickly and too late in the evening. I mixed up
On the corresponding LSMR page:
|
Interactive tests that I did by tweaking the end of the gist linked in the description. |
Regarding adding or not LBFGS as a solver to If we would like to add support for such an option |
+1
I am not sure if there is a performance drawback (maybe LBFGS would be faster than the CD used in ElasticNet?).
I am sure you can find data for which LBFGS > coordinate descent and versa.
|
@glemaitre @agramfort I opened #19772 for |
I dared to check the 2nd point with #22910. Those are pretty tight tests. What is left to do, is maybe decreasing (making tighter) default
|
Let me reopen this issue because I believe that the default choice of parameters for the solver of the
Based on complementary experiments conducted in #22855 (comment) it seems that we should probably use |
There were also other interesting potential improvements proposed by @lorentzenchr in #19615 (comment). |
Related issue when |
This statement is for the unpenalized case and might apply for tiny alpha as well. I think lsqr/lsmr could qualify as good default, but we should also do some experiments for n_features > n_samples. |
I did some follow-up work for the discussions on potential problems for
Ridge(normalize=False)
in #19426 and #17444 in the following gist:https://gist.github.com/ogrisel/cf7ac8bca6725ec02d4cc8d03ce096a7
This gist compares the results between a minimal implementation of ridge with LBFGS by @agramfort against scikit-learn's
Ridge(normalize=False, solver="cholesky")
,Ridge(normalize=False, solver="lsqr")
,Ridge(normalize=False, solver="svd")
and other solvers.The main conclusion is that those models seems to converge to the same solution, if we set a low value of
tol
and if we are not too strict on the tolerance: see how I defined atol in the gist. I cannot get convergence to machine precision (only around 1e-6/1e-7 in float64), but maybe this is expected. I am not so sure: I tried to tweakpgtol
on top offactr
but it did not improve.On large-ish problems, L-BFGS can be significantly faster than our the always accurate svd solver (but not always, depends on the regularization) but never faster than the
lsqr
(LAPACK-based) solver. The cholesky solver (our default for dense data) is sometimes faster or slower than lsqr and L-BFGS (depending on regularization) but always faster than SVD (at least with high regularization). The sag/saga solvers are not necessarily as fast as they could be: I noticed that they do no benefit from multi-threading on a multicore machine, so that might be the reason. I have never seen then run faster than lsqr. sag/saga can be significantly faster than L-BFGS despite using fewer cores on problems withn_samples=5e3
/n_features=1e3
and not too much regularization (alpha=1e-6).There are still problems with
normalize=True
as reported in #17444 but maybe we don't care so much because we are going to deprecate this option, see #17772 by @maikia that I plan to review soon.Proposed plan of actions:
normalize=True
in DEP Deprecate 'normalize' in ridge models #17772 andtherefore put Ridge(normalize=True) fails check_sample_weights_invariance #17444 on hold (or less of a priority)actually no, see: [MRG] FIX sample_weight invariance for linear models #19616.check_min
of my gist.Tight tests (but without checking the minimum explicitly) TST tight and clean tests for Ridge #22910
lsqr
seems to be very promising, both in terms of speed and accuracy of the solution (according to mycheck_min
thingy).tol=1e-3
is safe. It's probably very solver dependent, but I doubt that it's a good idea for the fastlsqr
solver for instance (to be benchmarked if we do the above evaluation). ENH change Ridge tol to 1e-4 #24465Other ideas:
lsqr
orcholesky
though: they seem to always be more accurate than l-bfgs and faster (especiallylsqr
). And l-bfgs can be very slow when alpha is small.lsqr
solver?The text was updated successfully, but these errors were encountered: