MRG Logistic regression preconditioning #15583

amueller · 2019-11-10T06:13:53Z

Partially addresses #15556.

Right now only for l-bfgs.
I added a parameter for people to confirm that it behaves as expected and easier debugging, but I think we should not add a parameter and always do this.

Might be a bit late but honestly I would like to avoid users having another 6 month of convergence warnings whenever they fit logistic regression. Also the results are much more stable / better and faster.

To demonstrate the effect:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.linear_model._logistic import _logistic_loss, _logistic_loss_and_grad
import numpy as np

X, y = make_classification(n_samples=100, n_features=60, random_state=0)
X[:, 1] += 10000
X[:, 0] *= 10000

lr = LogisticRegression(tol=1e-8).fit(X, y) # convergence warning
lr_pre = LogisticRegression(tol=1e-8, precondition=True).fit(X, y) # no warning
lr1000 = LogisticRegression(max_iter=1000, tol=1e-8).fit(X, y) # no warning, still worse solution!

print(_logistic_loss(np.hstack([lr.coef_.ravel(), lr.intercept_]), X, 2 * y - 1, 1))
# 14.289
print(_logistic_loss(np.hstack([lr1000.coef_.ravel(), lr1000.intercept_]), X, 2 * y - 1, 1))
# 13.478
print(_logistic_loss(np.hstack([lr_pre.coef_.ravel(), lr_pre.intercept_]), X, 2 * y - 1, 1))
# 12.354

in other words: it converges faster and to a better solution.

Multinomial

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import scipy
from sklearn.linear_model._logistic import _multinomial_loss, _multinomial_loss_grad
import numpy as np
from sklearn.preprocessing import label_binarize

X, y = make_classification(n_samples=100, n_features=60, n_classes=3, random_state=0, n_informative=6)
sample_weight = np.ones(X.shape[0])
Y = label_binarize(y, [0, 1, 2])
X[:, 1] += 10000
X[:, 0] *= 10000

lr = LogisticRegression(max_iter=100).fit(X, y) # convergence warning
lr10000 = LogisticRegression(max_iter=10000).fit(X, y) # no convergence warning, still worse. 1000 is not enough.
lr_pre = LogisticRegression(precondition=True, max_iter=100).fit(X, y)
# convergence warning, though max_iter=200 would fix it

loss, p, w = _multinomial_loss(np.hstack([lr.coef_, lr.intercept_.reshape(-1, 1)]), X, Y, 1, sample_weight=sample_weight)
print(loss) # 31.075

loss_10000, p, w = _multinomial_loss(np.hstack([lr10000.coef_, lr10000.intercept_.reshape(-1, 1)]), X, Y, 1, sample_weight=sample_weight)
print(loss_10000)  # 18.972

loss_pre, p, w = _multinomial_loss(np.hstack([lr_pre.coef_, lr_pre.intercept_.reshape(-1, 1)]), X, Y, 1, sample_weight=sample_weight)
print(loss_pre) # 18.308

For multinomial, it still warns with the defaults, but produces a better result than master with 10000 iterations (in which case master doesn't warn).
So this doesn't remove all warnings, but makes the solutions much more stable, and allows solving by increasing max_iter within a reasonable range (in this example).

Changing to n_samples=10000 the above script produces

max_iter=100: 7999.571 & warn
max_iter=10000: 7999.5717 & doesn't warn
max_iter=100, precondition=True: 7734.327 & warn
max_iter=200, precondition=True: 7733.976 & doesn't warn

amueller · 2019-11-10T07:34:54Z

the same trick applies to all other solvers and I think we should implement it for those as well. I'd be happy to have this in just for the default solver, though.

amueller · 2019-11-10T07:38:38Z

hm right now subtracting the mean even in the sparse case... need to benchmark/think about the sparse case a bit more

amueller · 2019-11-10T07:59:08Z

This is actually a regression because this used to work with liblinear, which was the default solver in 0.21.

In the binary case, liblinear is slightly worse than the preconditioned solution by default, but much better than current master.

glemaitre · 2019-11-10T09:09:15Z

If we want to scale the data, I think it should be by default (either with a parameter or not), otherwise one will still get the warning anyway so it does not solve anything.

If we start scaling the data for the user then I am wondering about the following. One will not be able to robust scale the data since that we will reapply a standard scaling on the top. Does the optimization problem to be solved will be affected and does it matter?

amueller · 2019-11-10T14:52:18Z

@glemaitre just to be clear, this PR doesn't scale the data. It solves the original optimization problem, just better and faster (if I scaled the data, the loss on the unscaled data would be high, it's not). That was @GaelVaroquaux's suggestion and I agree that it's more in line with our general principles (and it addresses your concern).

thomasjpfan · 2019-11-10T16:41:09Z

Can we pass the X_pre, X_mean, and X_scale into _logistic_loss_and_grad and friends and use this to compute the loss and gradients? (This is to try to avoid scaling sparse data)

amueller · 2019-11-10T19:47:19Z

Right now I'm doing that for X_scale (it's the only way to make this work), but not X_mean. But should be possible.

…onsistency

adrinjalali

Could you please also add the new params everywhere to their docstrings?

sklearn/linear_model/_logistic.py

adrinjalali · 2019-11-11T17:34:22Z

sklearn/linear_model/_logistic.py

+            if fit_intercept:
+                X_offset = -X_mean
+            # can we actually do inplace here?
+            inplace_column_scale(X_pre, 1 / X_scale)


should this be inplace?

we copy if it's not sparse

Sorry not sure if I understand your second comment.

adrinjalali · 2019-11-11T17:42:56Z

sklearn/linear_model/tests/test_logistic.py

+
+def test_illconditioned_lbfgs():
+    # check that lbfgs converges even with ill-conditioned X
+    X, y = make_classification(n_samples=100, n_features=60, random_state=0)


please test both binary and multiclass case.

adrinjalali · 2019-11-11T18:37:09Z

Also, it does feel weird to have it for one solver and not the others. But I guess the docs can explain that.

amueller · 2019-11-11T18:58:57Z

@adrinjalali as I said above, I intend to remove the parameter everywhere again and it's just for easy comparison.

Co-Authored-By: Adrin Jalali <adrin.jalali@gmail.com>

adrinjalali · 2020-04-21T17:05:43Z

removing from the milestone (this one got complicated)

ogrisel · 2020-05-02T08:07:49Z

@tomMoral the synthetic data I mentioned to trigger conditioning issues for benchopt is here: #15583 (comment)

amueller · 2020-10-26T18:09:00Z

Hm the svd on subsampled data idea sounds good... damn, I haven't looked at this in a while...

amueller · 2020-10-27T21:42:34Z

Regarding #15583 (comment), it was pointed out to me that we might be favoring the version without preconditioning because the first and last feature are actually much more important given how we set up the problem.
If we adjust the coefficents to reflect the change of units we did for the features, the results are different:

from time import time
import numpy as np
from sklearn.datasets import make_low_rank_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale

n_samples, n_features = 1000, 10000
max_iter = 1000
print(f"n_samples={n_samples}, n_features={n_features}")

rng = np.random.RandomState(0)
w_true = rng.randn(n_features)

X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
# X = rng.randn(n_samples, n_features)
X[:, 0] *= 1e3
X[:, -1] *= 1e3
w_true[0] /= 1e3
w_true[-1] /= 1e3

z = X @ w_true + 1
z += 1e-1 * rng.randn(n_samples)

# Balanced binary classification problem
y = (z > np.median(z)).astype(np.int32)

for C in [1e3, 1, 1e-3]:
    print(f"\nC={C}")
    print("Without data preconditioning:")
    clf = LogisticRegression(precondition=False, C=C,
                             max_iter=max_iter)
    tic = time()
    clf.fit(X, y)
    duration = time() - tic
    print(f"n_iter={clf.n_iter_[0]},"
          f" obj={clf.objective_value_:.6f},"
          f" duration={duration:.3f}s")

    print("Without data preconditioning with scaling:")
    clf_scaled = LogisticRegression(precondition=False, C=C,
                                    max_iter=max_iter)

    print("With data preconditioning:")
    clf_pre = LogisticRegression(precondition=True, C=C,
                                 max_iter=max_iter)
    tic = time()
    clf_pre.fit(X, y)
    duration_pre = time() - tic
    print(f"n_iter={clf_pre.n_iter_[0]},"
          f" obj={clf_pre.objective_value_:.6f},"
          f" duration={duration_pre:.3f}s")
    print(f"speedup (pre): {duration/duration_pre:.1f}x")

n_samples=1000, n_features=10000

C=1000.0
Without data preconditioning:
n_iter=809, obj=367.675372, duration=5.530s
Without data preconditioning with scaling:
With data preconditioning:
n_iter=144, obj=367.674329, duration=1.083s
speedup (pre): 5.1x

C=1
Without data preconditioning:
n_iter=24, obj=675.094608, duration=0.196s
Without data preconditioning with scaling:
With data preconditioning:
n_iter=31, obj=675.094611, duration=0.295s
speedup (pre): 0.7x

C=0.001
Without data preconditioning:
n_iter=8, obj=692.023023, duration=0.083s
Without data preconditioning with scaling:
With data preconditioning:
n_iter=23, obj=692.024219, duration=0.249s
speedup (pre): 0.3x

ogrisel · 2020-11-25T17:13:46Z

Note: as discussed in #18795, it seems that when features are proprocessed with PolynomialCountSketch(degree=2, n_components=1000) or more we get another real life case where our linear model solvers struggle with data-conditioning issues (although I did not investigate in details).

ogrisel · 2020-11-26T14:48:12Z

I also wonder if multiclass problems with many classes including some rare classes could not cause another family of data-related conditioning issues (for LogisticRegression with the multinomial loss).

lorentzenchr · 2023-01-24T17:05:45Z

This seems a bit stalled.

The findings of our study confirm that preconditioning least-squares problems is hard and
that at present there is no single approach that works well for all problems.

Gould, N.I., & Scott, J.A. (2017). The State-of-the-Art of Preconditioners for Sparse Linear Least-Squares Problems. ACM Transactions on Mathematical Software (TOMS), 43, 1 - 35. https://doi.org/10.1145/3014057 PDF

Edit: More on that matter in

Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification (2018)
A Study on Truncated Newton Methods for Linear Classification (2022)

It is also a bit ironic to remove the "normalize" parameter from so many linear models and to try to reinvent it for logistic regression 😄

Note that the glum authors mention that intern standardization is beneficial for the optimizer, see https://glum.readthedocs.io/en/latest/background.html#Standardization and https://glum.readthedocs.io/en/latest/motivation.html#software-optimizations.

lorentzenchr · 2024-11-07T17:28:55Z

See #15556 (comment).

amueller added 10 commits November 9, 2019 22:04

remove mean for logisticregression lbfgs

59f9e40

add test that preconditioning works for offsets in X

697112a

add precondition option temporarily to log_reg_scoring_path

5a1431c

fix gradients, add test

bf2e452

remove unused grad_scale

86f7520

pep8

01c2c98

fix intercept for multinomial loss

2bfeba4

fix loss for multinomial

e72d271

add multinomial logistic regression preconditioning with lbfgs

7d71afb

pep8

f153321

amueller changed the title ~~WIP Logistic precondition~~ RFC Logistic regression preconditioning Nov 10, 2019

amueller mentioned this pull request Nov 10, 2019

[MRG] ENH: Add scaling to convergence warning for LBFGS #15571

Merged

amueller added Blocker Bug Regression labels Nov 10, 2019

glemaitre mentioned this pull request Nov 10, 2019

Minimal Generalized linear models implementation (L2 + lbfgs) #14300

Merged

7 tasks

amueller added 3 commits November 10, 2019 13:55

hack around with sparse stuff, set precondition=True everywhere for c…

6da875d

…onsistency

fixing warmstarting

3859d57

starting on sparse offset support

5dd503d

adrinjalali reviewed Nov 11, 2019

View reviewed changes

Update sklearn/linear_model/_logistic.py

090e540

Co-Authored-By: Adrin Jalali <adrin.jalali@gmail.com>

adrinjalali added this to the 0.23 milestone Dec 3, 2019

amueller mentioned this pull request Dec 16, 2019

Liblinear convergence failure everywhere? #11536

Open

glemaitre mentioned this pull request Feb 16, 2020

"normalize" parameter in sklearn.linear_model should be "standardize" #16445

Closed

github-actions bot added module:feature_selection module:linear_model module:utils labels Mar 2, 2020

ogrisel mentioned this pull request Mar 4, 2020

Add IRLS solver for linear models #16634

Closed

ogrisel mentioned this pull request Mar 12, 2020

Add hessian/fisher matrix for TweedieRegression #16641

Closed

adrinjalali removed this from the 0.23 milestone Apr 21, 2020

thomasjpfan added this to the 0.24 milestone Apr 21, 2020

lorentzenchr mentioned this pull request May 30, 2020

Poorly scaled, ill conditioned optimization problems scipy/scipy#12275

Open

cmarmo removed this from the 0.24 milestone Oct 15, 2020

tomMoral mentioned this pull request Oct 26, 2020

DATA add ill-conditionned simulated data benchopt/benchmark_logreg_l2#5

Open

merge branch 'master' into logistic_precondition

048e355

ogrisel mentioned this pull request Nov 25, 2020

DOC Release highlights for version 0.24 #18795

Merged

Base automatically changed from master to main January 22, 2021 10:51

ogrisel mentioned this pull request May 13, 2022

FEA add Cholesky based Newton solver to GLMs #23314

Closed

lorentzenchr added Stalled Needs Decision Requires decision labels Jun 27, 2022

ogrisel mentioned this pull request Oct 27, 2022

ENH add newton-cholesky solver to LogisticRegression #24767

Merged

lorentzenchr mentioned this pull request Oct 26, 2024

Scaling issues in l-bfgs for LogisticRegression #15556

Closed

lorentzenchr closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG Logistic regression preconditioning #15583

MRG Logistic regression preconditioning #15583

amueller commented Nov 10, 2019 •

edited

Loading

amueller commented Nov 10, 2019

amueller commented Nov 10, 2019

amueller commented Nov 10, 2019 •

edited

Loading

glemaitre commented Nov 10, 2019

amueller commented Nov 10, 2019 •

edited

Loading

thomasjpfan commented Nov 10, 2019

amueller commented Nov 10, 2019

adrinjalali left a comment

adrinjalali Nov 11, 2019

adrinjalali Nov 11, 2019

amueller Nov 11, 2019

adrinjalali Nov 11, 2019

adrinjalali commented Nov 11, 2019

amueller commented Nov 11, 2019

adrinjalali commented Apr 21, 2020

ogrisel commented May 2, 2020

amueller commented Oct 26, 2020

amueller commented Oct 27, 2020

ogrisel commented Nov 25, 2020

ogrisel commented Nov 26, 2020

lorentzenchr commented Jan 24, 2023 •

edited

Loading

lorentzenchr commented Nov 7, 2024

MRG Logistic regression preconditioning #15583

MRG Logistic regression preconditioning #15583

Conversation

amueller commented Nov 10, 2019 • edited Loading

amueller commented Nov 10, 2019

amueller commented Nov 10, 2019

amueller commented Nov 10, 2019 • edited Loading

glemaitre commented Nov 10, 2019

amueller commented Nov 10, 2019 • edited Loading

thomasjpfan commented Nov 10, 2019

amueller commented Nov 10, 2019

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Nov 11, 2019

Choose a reason for hiding this comment

adrinjalali Nov 11, 2019

Choose a reason for hiding this comment

amueller Nov 11, 2019

Choose a reason for hiding this comment

adrinjalali Nov 11, 2019

Choose a reason for hiding this comment

adrinjalali commented Nov 11, 2019

amueller commented Nov 11, 2019

adrinjalali commented Apr 21, 2020

ogrisel commented May 2, 2020

amueller commented Oct 26, 2020

amueller commented Oct 27, 2020

ogrisel commented Nov 25, 2020

ogrisel commented Nov 26, 2020

lorentzenchr commented Jan 24, 2023 • edited Loading

lorentzenchr commented Nov 7, 2024

amueller commented Nov 10, 2019 •

edited

Loading

amueller commented Nov 10, 2019 •

edited

Loading

amueller commented Nov 10, 2019 •

edited

Loading

lorentzenchr commented Jan 24, 2023 •

edited

Loading