[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

ogrisel · 2021-03-04T14:54:52Z

I did some follow-up work for the discussions on potential problems for Ridge(normalize=False) in #19426 and #17444 in the following gist:

https://gist.github.com/ogrisel/cf7ac8bca6725ec02d4cc8d03ce096a7

This gist compares the results between a minimal implementation of ridge with LBFGS by @agramfort against scikit-learn's Ridge(normalize=False, solver="cholesky"), Ridge(normalize=False, solver="lsqr"), Ridge(normalize=False, solver="svd") and other solvers.

The main conclusion is that those models seems to converge to the same solution, if we set a low value of tol and if we are not too strict on the tolerance: see how I defined atol in the gist. I cannot get convergence to machine precision (only around 1e-6/1e-7 in float64), but maybe this is expected. I am not so sure: I tried to tweak pgtol on top of factr but it did not improve.

On large-ish problems, L-BFGS can be significantly faster than our the always accurate svd solver (but not always, depends on the regularization) but never faster than the lsqr (LAPACK-based) solver. The cholesky solver (our default for dense data) is sometimes faster or slower than lsqr and L-BFGS (depending on regularization) but always faster than SVD (at least with high regularization). The sag/saga solvers are not necessarily as fast as they could be: I noticed that they do no benefit from multi-threading on a multicore machine, so that might be the reason. I have never seen then run faster than lsqr. sag/saga can be significantly faster than L-BFGS despite using fewer cores on problems with n_samples=5e3 / n_features=1e3 and not too much regularization (alpha=1e-6).

There are still problems with normalize=True as reported in #17444 but maybe we don't care so much because we are going to deprecate this option, see #17772 by @maikia that I plan to review soon.

Proposed plan of actions:

proceed with the deprecation of normalize=True in DEP Deprecate 'normalize' in ridge models #17772 and ~~therefore put Ridge(normalize=True) fails check_sample_weights_invariance #17444 on hold (or less of a priority)~~ actually no, see: [MRG] FIX sample_weight invariance for linear models #19616.
improve the ridge tests in scikit-learn by explicitly evaluating the min function for different solvers and maybe using a utility such as the check_min of my gist.
Tight tests (but without checking the minimum explicitly) TST tight and clean tests for Ridge #22910
maybe add the ridge_bfgs function as a reference to compare against in our test suite? Since it does not seem to always converge to a very high quality optimum, I am not so sure of the value.
maybe we could conduct a more systematic evaluation of various datasets with different shapes, conditioning and regularization to select a better default than "cholesky" for dense data? "cholesky" is reputed to not be numerically stable, although I did not observe problems in my gist, maybe because of regularization. lsqr seems to be very promising, both in terms of speed and accuracy of the solution (according to my check_min thingy).
I am not sure that our default value of tol=1e-3 is safe. It's probably very solver dependent, but I doubt that it's a good idea for the fast lsqr solver for instance (to be benchmarked if we do the above evaluation). ENH change Ridge tol to 1e-4 #24465

Other ideas:

maybe we should consider adding l-bfgs an option? That would require doing more evaluation on sparse data / ill-conditioned data with various values of alpha. Not sure of the benefit compared to lsqr or cholesky though: they seem to always be more accurate than l-bfgs and faster (especially lsqr). And l-bfgs can be very slow when alpha is small.
and maybe add a QR-based solver that is supposed to be more numerically stable? or is this what LAPACK is doing internally when we use the lsqr solver?

The text was updated successfully, but these errors were encountered:

agramfort · 2021-03-04T15:32:42Z

+1 for the plan except the "other ideas". I don't think that adding L-BFGS is necessary. And the QR based solver seems also an overkill.

for 2 you say lsqr for sparse case?

ogrisel · 2021-03-04T17:30:55Z

for 2 you say lsqr for sparse case?

I am not sure what you mean. I edited the description to remove a previous version of item 2 because it was wrong (my tol was too lax and that was leading me to wrong conclusions).

lorentzenchr · 2021-03-11T22:19:13Z

I'm also on board with the Proposed plan of actions, and, like @agramfort, don't think that adding lbfgs is necessary. A few further comments:

tol=1e-3 is indeed a bit large. On top, it means different things for different solvers.
"cholesky" solves the normal equations, i.e. computes X.T @ X + penalty, and therefore doubles the condition number of X which directly impacts the precision of solvers. The penalty should help, indeed.
You propose a QR based solution. I could imagine, solving [X, lambda*Identidy].T @ coef = [y, 0].T in some way could be interesting, as it avoids doubling the condition number.
lsmr is competing with lsqr. But in the past, I could not get a clear picture, which one to prefer in general.
Both first use a Golub-Kahan bidiagonalization and then Givens rotations (leading to a QR decomposition) and solving triangular linear systems is more or less all that is done.
lsqr should also work with sparse input and fit_intercept. If that were the case already, I could imagine to remove "sparse_cg".

agramfort · 2021-03-12T08:19:21Z

remove sparse_cg ? I would think it's the right thing to do for sparse X no?

…

ogrisel · 2021-03-12T09:00:22Z

Also in my benchmarks, sparse_cg seemed to perform reasonably well, even on dense array data.

lorentzenchr · 2021-03-12T12:23:42Z

@ogrisel Which benchmark are you referring to?

remove sparse_cg ? I would think it's the right thing to do for sparse X no?

Sorry, that suggestion of mine was typed too quickly and too late in the evening. I mixed up _solve_sparse_cg with cg based optimization routines. However, https://web.stanford.edu/group/SOL/software/lsqr/ states:

It [LSQR --Ed.] is algebraically equivalent to applying CG to the normal equation ( (A^T A + \lambda^2 I) x = A^T b, ) but has better numerical properties, especially if (A) is ill-conditioned.

On the corresponding LSMR page:

Special feature: Both (|r|) and (|A^T r|) decrease monotonically, where (r = b - Ax) is the current residual. For LSQR, only (|r|) is monotonic. LSQR is recommended for compatible systems (Ax=b), but on least-squares problems with loose stopping tolerances, LSMR may be able to terminate significantly sooner than LSQR.

ogrisel · 2021-03-12T14:42:49Z

remove sparse_cg ? I would think it's the right thing to do for sparse X no?

Interactive tests that I did by tweaking the end of the gist linked in the description.

glemaitre · 2021-03-26T10:43:18Z

Regarding adding or not LBFGS as a solver to Ridge, I just came across a use-case to model a classic mechanical (based on Newton's second law) problem where weights of the linear model are expected to be positive. Right now, we allow adding positivity constraints in LinearRegression, Lasso, and ElasticNet. Ridge is the only lonely linear model that does not currently support this use-case.

If we would like to add support for such an option LBFGS might be the solver that we want to use for solving this type of problem. It would make our linear models consistent in this regard. However, it is true that one can still use an ElasticNet and cancelling the l1 penalty to achieve such use-case but I am not sure if there is a performance drawback (maybe LBFGS would be faster than the CG used in ElasticNet?).

agramfort · 2021-03-26T20:10:12Z

+1

I am not sure if there is a performance drawback (maybe LBFGS would be faster than the CD used in ElasticNet?).

I am sure you can find data for which LBFGS > coordinate descent and versa.

lorentzenchr · 2021-03-27T07:58:50Z

@glemaitre @agramfort I opened #19772 for positive in Ridge. As you'll see there, I did not propose LBFGS as special linear algebra based options are available.

lorentzenchr · 2022-04-11T15:22:25Z

I dared to check the 2nd point with #22910. Those are pretty tight tests.

What is left to do, is maybe decreasing (making tighter) default tol. From 1e-3 to at least 1e-4 (same as in most others like Lasso, LogisticRegression, ..). tol is irrelevant for the following solvers:

cholesky
svd

ogrisel · 2023-03-10T22:21:27Z

Let me reopen this issue because I believe that the default choice of parameters for the solver of the Ridge estimator could still be improved. In particular the following point of the description was not addressed:

maybe we could conduct a more systematic evaluation of various datasets with different shapes, conditioning and regularization to select a better default than "cholesky" for dense data? "cholesky" is reputed to not be numerically stable, although I did not observe problems in my gist, maybe because of regularization. lsqr seems to be very promising, both in terms of speed and accuracy of the solution (according to my check_min thingy).

Based on complementary experiments conducted in #22855 (comment) it seems that we should probably use lsqr either by default or as fallback for cholesky when the problem is singular, at least for some problem shapes. Also the fallback for cholesky should raise a warning telling the user to either increase the penalization or switch to lsqr directly to silence the warning.

ogrisel · 2023-03-10T22:24:52Z

There were also other interesting potential improvements proposed by @lorentzenchr in #19615 (comment).

ogrisel · 2023-03-10T22:26:17Z

Related issue when alpha=0:

BUG unpenalized Ridge does not give minimum norm solution #22947

lorentzenchr · 2023-03-16T20:39:33Z

"cholesky" is reputed to not be numerically stable

This statement is for the unpenalized case and might apply for tiny alpha as well.

I think lsqr/lsmr could qualify as good default, but we should also do some experiments for n_features > n_samples.

ogrisel added Bug: triage Enhancement and removed Bug: triage labels Mar 4, 2021

ogrisel added module:test-suite everything related to our tests module:linear_model labels Mar 4, 2021

lorentzenchr mentioned this issue Mar 27, 2021

Add positive argument to Ridge regression #19772

Closed

ogrisel mentioned this issue Jun 15, 2021

FEA Add positive argument to Ridge using L-BFGS-B solver #20231

Merged

lorentzenchr mentioned this issue Sep 18, 2022

ENH change Ridge tol to 1e-4 #24465

Merged

agramfort closed this as completed in #24465 Oct 6, 2022

ogrisel changed the title ~~[RFC] Ridge correctness evaluation~~ [RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults Mar 10, 2023

ogrisel reopened this Mar 10, 2023

glemaitre added this to Losses and solvers May 17, 2024

glemaitre moved this to Discussion in Losses and solvers May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

ogrisel commented Mar 4, 2021 •

edited by lorentzenchr

Loading

agramfort commented Mar 4, 2021

ogrisel commented Mar 4, 2021

lorentzenchr commented Mar 11, 2021

agramfort commented Mar 12, 2021 via email

ogrisel commented Mar 12, 2021

lorentzenchr commented Mar 12, 2021

ogrisel commented Mar 12, 2021

glemaitre commented Mar 26, 2021

agramfort commented Mar 26, 2021 via email

lorentzenchr commented Mar 27, 2021

lorentzenchr commented Apr 11, 2022

ogrisel commented Mar 10, 2023

ogrisel commented Mar 10, 2023

ogrisel commented Mar 10, 2023

lorentzenchr commented Mar 16, 2023

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

Comments

ogrisel commented Mar 4, 2021 • edited by lorentzenchr Loading

agramfort commented Mar 4, 2021

ogrisel commented Mar 4, 2021

lorentzenchr commented Mar 11, 2021

agramfort commented Mar 12, 2021 via email

ogrisel commented Mar 12, 2021

lorentzenchr commented Mar 12, 2021

ogrisel commented Mar 12, 2021

glemaitre commented Mar 26, 2021

agramfort commented Mar 26, 2021 via email

lorentzenchr commented Mar 27, 2021

lorentzenchr commented Apr 11, 2022

ogrisel commented Mar 10, 2023

ogrisel commented Mar 10, 2023

ogrisel commented Mar 10, 2023

lorentzenchr commented Mar 16, 2023

ogrisel commented Mar 4, 2021 •

edited by lorentzenchr

Loading