-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
FEA add newton-lsmr solver to LogisticRegression and GLMs #25462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEA add newton-lsmr solver to LogisticRegression and GLMs #25462
Conversation
@ogrisel @mathurinm @TomDLT @agramfort @rth might be interested as this seems to be new ground for GLM solvers, especially the multinomial logistic regression! It was a very stony path to arrive with all (added) tests green. Right now, I've no energy to do extensive benchmarking. But I hope, that this work will become useful and find its way into scikit-learn, in the end. I'm sure, there are opportunities left for performance optimization. |
Glad to see this! I just re-ran the previous benchmark for Poisson regression on the French MTPL dataset from the previous PR: and here are the results on my laptop: so this looks very good. |
I have adapted the above benchmark to turn it into an imbalanced multiclass classification problem by binning the target. Since 0 is overly represented, when choosing a large number of bins and the quantile strategy, many bins are collapsed to 0. Here is the code:
DISCLAIMER: the plot above displays the unpenalized EDIT: I did another run with This task is very challenging for all solvers and I had to decrease the number of samples to get it run in a reasonable time on my laptop. I also stopped recording solver when tol decreases to the point where a single fit would last more than a few minutes. Here are the resulting plots: Note that the handling of the stopping criterion of LBFGS is not working properly for the
Note that for lower tolerance values, the above snippet can trigger:
for the Finally, I think it would be interesting to adapt this benchmark to use benchopt and include it in this panel since it's quite challenging for most solvers yet still realistic enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback after a quick glance at the code.
Out of curiosity, have you tried to profile this to pinpoint the bottlenecks for both the multinomial and non-multinomial cases?
I have the impression that multithreading is barely used (in the multiclass case) so it's probably not a few large BLAS calls as is the case for newton-cholesky
.
@@ -516,3 +532,460 @@ def inner_solve(self, X, y, sample_weight): | |||
) | |||
self.use_fallback_lbfgs_solve = True | |||
return | |||
|
|||
|
|||
class NewtonLSMRSolver(NewtonSolver): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling the code would be simpler to follow if we had a special subclass to handle the if self.linear_loss.base_loss.is_multiclass
case.
That would require introducing a factory function to do the dispatching to the correct solver class based on the loss object but I have the feeling that would be worth it.
For the multinomial/multiclass case, it clearly is Edit: I was able to significantly speed up those 2 functions in e5f5f48. They are still the bottleneck, but much reduced (~2x). |
885b413
to
d0eea42
Compare
- keep dtype float32 after LSMR - lower test precision in test_NewtonLSMRSolver_multinomial_A_b_on_3_classes
d0eea42
to
7ef5877
Compare
With the latest improvements it looks a bit better (btw Sparse X (as above)Dense XConclusionSo this solver can be used for very fast but rough estimates or for high precision estimates:smirk: Code for reproducibility:
|
@lorentzenchr I am not sure if you saw but your last push triggered a CI failure. I did not investigate myself but I wanted to make sure that it not go unnoticed. |
Looking at the last plot, I wonder why the LMSR-based solver seems to slow down after the first 4 iterations, before it accelerates in the last two again. Perhaps, the choice of |
fa2586f
to
0dc87cf
Compare
The remaining CI error will be automatically fixed by setting scipy>=1.4, see scipy/scipy#7396. Note that the transpose is only taken in a few tests, the solver itself works fine with those older scipy versions. |
For reference, the bump to scipy>=1.4 in |
CI all 🟢 again. |
Commit 83ce34f passes |
When trying to trigger the LBFGS fallback by trying to make the model extremely confident so has to make the Hessian numerically degenerate I triggered the following
We might need a |
Also note that |
Actually there is a problem with lbfgs on the above problem, it does not converge:
and
while I don't know if the failure of lbfgs here should be considered a bug or not but since lbfgs is our robust fallback, this might be a problem :) Maybe LBFGS should try a simple gradient step when the line search fails? |
I found a way to trigger the LBFGS fallback for the multiclass case:
|
I opened #26707 for investigating the inner solver stopping criterion and run a log of benchmarks. There is no clear winner. I have to leave it as is: Either it is good enough in it's current shape or someone else needs to dig deeper. My conclusion is that we have quite some room of improvement of the current solvers, like #24752. Also the |
There are cases where it's indeed quite impressive based on the last benchmarks that are now collapsed in the discussion. But I agree that fixing #24752 would be helpful to get a clearer picture. Also based on benchopt, it seems that SAG & SAGA are better reference solvers for 20 newsgroups, see e.g.: https://benchopt.github.io/results/preprint_results_preprint_results_logreg_l2.html I have no intuition on why this should be the case. You'll have to switch the dataset in the menu on the left to see the results on 20 newsgroups. UPDATE: actually the results on this dataset are completely different with and without scaling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments I wrote a few months ago (they might not all be relevant).
if self.p.dtype == np.float32: | ||
eps = 2 * np.finfo(np.float32).resolution | ||
else: | ||
eps = 2 * np.finfo(np.float64).resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.p.dtype == np.float32: | |
eps = 2 * np.finfo(np.float32).resolution | |
else: | |
eps = 2 * np.finfo(np.float64).resolution | |
eps = 2 * np.finfo(self.p.dtype).resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure that proba
and hence self.p
is always either float32 or float64.
elif solver == "newton-lsmr": | ||
clf.set_params(tol=1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adaptation makes me think of an UX consideration: do we want to have the tolerance settable by the choice of the solver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind?
Reference Issues/PRs
Supersedes #23507.
Fixes #16634.
What does this implement/fix? Explain your changes.
This PR adds a further
NewtonSolver
:NewtonLSMRSolver
.This solver uses the iteratively reweighted least squares (IRLS) formulation of a Newton step. This means the inner solver uses the square root of the Hessian and solves the corresponding least squares problem (as opposed to solving the normal equation as
"newton-cholesky"
is doing) with the iterative LSMR solver.This solver is therefore suited for dense and sparse
X
.Any other comments?
The multinomial/multiclass case deserves special attention as there are different ways to look at the Hessian$X' W X$ :
n_classes=3
n_samples
. ThenNow, one can use the LDL decomposition of this particular matrix
This is the chosen approach.
Edit: For benchmarks, see #25462 (comment).