Skip to content

MRG Logistic regression preconditioning #15583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 43 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Nov 10, 2019

Partially addresses #15556.

Right now only for l-bfgs.
I added a parameter for people to confirm that it behaves as expected and easier debugging, but I think we should not add a parameter and always do this.

Might be a bit late but honestly I would like to avoid users having another 6 month of convergence warnings whenever they fit logistic regression. Also the results are much more stable / better and faster.

To demonstrate the effect:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.linear_model._logistic import _logistic_loss, _logistic_loss_and_grad
import numpy as np

X, y = make_classification(n_samples=100, n_features=60, random_state=0)
X[:, 1] += 10000
X[:, 0] *= 10000

lr = LogisticRegression(tol=1e-8).fit(X, y) # convergence warning
lr_pre = LogisticRegression(tol=1e-8, precondition=True).fit(X, y) # no warning
lr1000 = LogisticRegression(max_iter=1000, tol=1e-8).fit(X, y) # no warning, still worse solution!

print(_logistic_loss(np.hstack([lr.coef_.ravel(), lr.intercept_]), X, 2 * y - 1, 1))
# 14.289
print(_logistic_loss(np.hstack([lr1000.coef_.ravel(), lr1000.intercept_]), X, 2 * y - 1, 1))
# 13.478
print(_logistic_loss(np.hstack([lr_pre.coef_.ravel(), lr_pre.intercept_]), X, 2 * y - 1, 1))
# 12.354

in other words: it converges faster and to a better solution.

Multinomial

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import scipy
from sklearn.linear_model._logistic import _multinomial_loss, _multinomial_loss_grad
import numpy as np
from sklearn.preprocessing import label_binarize

X, y = make_classification(n_samples=100, n_features=60, n_classes=3, random_state=0, n_informative=6)
sample_weight = np.ones(X.shape[0])
Y = label_binarize(y, [0, 1, 2])
X[:, 1] += 10000
X[:, 0] *= 10000

lr = LogisticRegression(max_iter=100).fit(X, y) # convergence warning
lr10000 = LogisticRegression(max_iter=10000).fit(X, y) # no convergence warning, still worse. 1000 is not enough.
lr_pre = LogisticRegression(precondition=True, max_iter=100).fit(X, y)
# convergence warning, though max_iter=200 would fix it

loss, p, w = _multinomial_loss(np.hstack([lr.coef_, lr.intercept_.reshape(-1, 1)]), X, Y, 1, sample_weight=sample_weight)
print(loss) # 31.075

loss_10000, p, w = _multinomial_loss(np.hstack([lr10000.coef_, lr10000.intercept_.reshape(-1, 1)]), X, Y, 1, sample_weight=sample_weight)
print(loss_10000)  # 18.972

loss_pre, p, w = _multinomial_loss(np.hstack([lr_pre.coef_, lr_pre.intercept_.reshape(-1, 1)]), X, Y, 1, sample_weight=sample_weight)
print(loss_pre) # 18.308

For multinomial, it still warns with the defaults, but produces a better result than master with 10000 iterations (in which case master doesn't warn).
So this doesn't remove all warnings, but makes the solutions much more stable, and allows solving by increasing max_iter within a reasonable range (in this example).

Changing to n_samples=10000 the above script produces

max_iter=100: 7999.571 & warn
max_iter=10000: 7999.5717 & doesn't warn
max_iter=100, precondition=True: 7734.327 & warn
max_iter=200, precondition=True: 7733.976 & doesn't warn

@amueller amueller changed the title WIP Logistic precondition RFC Logistic regression preconditioning Nov 10, 2019
@amueller
Copy link
Member Author

the same trick applies to all other solvers and I think we should implement it for those as well. I'd be happy to have this in just for the default solver, though.

@amueller
Copy link
Member Author

hm right now subtracting the mean even in the sparse case... need to benchmark/think about the sparse case a bit more

@amueller
Copy link
Member Author

amueller commented Nov 10, 2019

This is actually a regression because this used to work with liblinear, which was the default solver in 0.21.

In the binary case, liblinear is slightly worse than the preconditioned solution by default, but much better than current master.

@glemaitre
Copy link
Member

If we want to scale the data, I think it should be by default (either with a parameter or not), otherwise one will still get the warning anyway so it does not solve anything.

If we start scaling the data for the user then I am wondering about the following. One will not be able to robust scale the data since that we will reapply a standard scaling on the top. Does the optimization problem to be solved will be affected and does it matter?

@amueller
Copy link
Member Author

amueller commented Nov 10, 2019

@glemaitre just to be clear, this PR doesn't scale the data. It solves the original optimization problem, just better and faster (if I scaled the data, the loss on the unscaled data would be high, it's not). That was @GaelVaroquaux's suggestion and I agree that it's more in line with our general principles (and it addresses your concern).

@thomasjpfan
Copy link
Member

Can we pass the X_pre, X_mean, and X_scale into _logistic_loss_and_grad and friends and use this to compute the loss and gradients? (This is to try to avoid scaling sparse data)

@amueller
Copy link
Member Author

Right now I'm doing that for X_scale (it's the only way to make this work), but not X_mean. But should be possible.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add the new params everywhere to their docstrings?

if fit_intercept:
X_offset = -X_mean
# can we actually do inplace here?
inplace_column_scale(X_pre, 1 / X_scale)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be inplace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we copy if it's not sparse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not sure if I understand your second comment.


def test_illconditioned_lbfgs():
# check that lbfgs converges even with ill-conditioned X
X, y = make_classification(n_samples=100, n_features=60, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test both binary and multiclass case.

@adrinjalali
Copy link
Member

Also, it does feel weird to have it for one solver and not the others. But I guess the docs can explain that.

@amueller
Copy link
Member Author

@adrinjalali as I said above, I intend to remove the parameter everywhere again and it's just for easy comparison.

Co-Authored-By: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali
Copy link
Member

removing from the milestone (this one got complicated)

@adrinjalali adrinjalali removed this from the 0.23 milestone Apr 21, 2020
@thomasjpfan thomasjpfan added this to the 0.24 milestone Apr 21, 2020
@ogrisel
Copy link
Member

ogrisel commented May 2, 2020

@tomMoral the synthetic data I mentioned to trigger conditioning issues for benchopt is here: #15583 (comment)

@amueller
Copy link
Member Author

Hm the svd on subsampled data idea sounds good... damn, I haven't looked at this in a while...

@amueller
Copy link
Member Author

Regarding #15583 (comment), it was pointed out to me that we might be favoring the version without preconditioning because the first and last feature are actually much more important given how we set up the problem.
If we adjust the coefficents to reflect the change of units we did for the features, the results are different:

from time import time
import numpy as np
from sklearn.datasets import make_low_rank_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale

n_samples, n_features = 1000, 10000
max_iter = 1000
print(f"n_samples={n_samples}, n_features={n_features}")

rng = np.random.RandomState(0)
w_true = rng.randn(n_features)

X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
# X = rng.randn(n_samples, n_features)
X[:, 0] *= 1e3
X[:, -1] *= 1e3
w_true[0] /= 1e3
w_true[-1] /= 1e3

z = X @ w_true + 1
z += 1e-1 * rng.randn(n_samples)

# Balanced binary classification problem
y = (z > np.median(z)).astype(np.int32)

for C in [1e3, 1, 1e-3]:
    print(f"\nC={C}")
    print("Without data preconditioning:")
    clf = LogisticRegression(precondition=False, C=C,
                             max_iter=max_iter)
    tic = time()
    clf.fit(X, y)
    duration = time() - tic
    print(f"n_iter={clf.n_iter_[0]},"
          f" obj={clf.objective_value_:.6f},"
          f" duration={duration:.3f}s")

    print("Without data preconditioning with scaling:")
    clf_scaled = LogisticRegression(precondition=False, C=C,
                                    max_iter=max_iter)

    print("With data preconditioning:")
    clf_pre = LogisticRegression(precondition=True, C=C,
                                 max_iter=max_iter)
    tic = time()
    clf_pre.fit(X, y)
    duration_pre = time() - tic
    print(f"n_iter={clf_pre.n_iter_[0]},"
          f" obj={clf_pre.objective_value_:.6f},"
          f" duration={duration_pre:.3f}s")
    print(f"speedup (pre): {duration/duration_pre:.1f}x")
n_samples=1000, n_features=10000

C=1000.0
Without data preconditioning:
n_iter=809, obj=367.675372, duration=5.530s
Without data preconditioning with scaling:
With data preconditioning:
n_iter=144, obj=367.674329, duration=1.083s
speedup (pre): 5.1x

C=1
Without data preconditioning:
n_iter=24, obj=675.094608, duration=0.196s
Without data preconditioning with scaling:
With data preconditioning:
n_iter=31, obj=675.094611, duration=0.295s
speedup (pre): 0.7x

C=0.001
Without data preconditioning:
n_iter=8, obj=692.023023, duration=0.083s
Without data preconditioning with scaling:
With data preconditioning:
n_iter=23, obj=692.024219, duration=0.249s
speedup (pre): 0.3x

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2020

Note: as discussed in #18795, it seems that when features are proprocessed with PolynomialCountSketch(degree=2, n_components=1000) or more we get another real life case where our linear model solvers struggle with data-conditioning issues (although I did not investigate in details).

@ogrisel
Copy link
Member

ogrisel commented Nov 26, 2020

I also wonder if multiclass problems with many classes including some rare classes could not cause another family of data-related conditioning issues (for LogisticRegression with the multinomial loss).

@lorentzenchr
Copy link
Member

lorentzenchr commented Jan 24, 2023

This seems a bit stalled.

The findings of our study confirm that preconditioning least-squares problems is hard and
that at present there is no single approach that works well for all problems.

Gould, N.I., & Scott, J.A. (2017). The State-of-the-Art of Preconditioners for Sparse Linear Least-Squares Problems. ACM Transactions on Mathematical Software (TOMS), 43, 1 - 35. https://doi.org/10.1145/3014057 PDF

Edit: More on that matter in

It is also a bit ironic to remove the "normalize" parameter from so many linear models and to try to reinvent it for logistic regression 😄

Note that the glum authors mention that intern standardization is beneficial for the optimizer, see https://glum.readthedocs.io/en/latest/background.html#Standardization and https://glum.readthedocs.io/en/latest/motivation.html#software-optimizations.

@lorentzenchr
Copy link
Member

See #15556 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.