-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG Logistic regression preconditioning #15583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the same trick applies to all other solvers and I think we should implement it for those as well. I'd be happy to have this in just for the default solver, though. |
hm right now subtracting the mean even in the sparse case... need to benchmark/think about the sparse case a bit more |
This is actually a regression because this used to work with liblinear, which was the default solver in 0.21. In the binary case, liblinear is slightly worse than the preconditioned solution by default, but much better than current master. |
If we want to scale the data, I think it should be by default (either with a parameter or not), otherwise one will still get the warning anyway so it does not solve anything. If we start scaling the data for the user then I am wondering about the following. One will not be able to robust scale the data since that we will reapply a standard scaling on the top. Does the optimization problem to be solved will be affected and does it matter? |
@glemaitre just to be clear, this PR doesn't scale the data. It solves the original optimization problem, just better and faster (if I scaled the data, the loss on the unscaled data would be high, it's not). That was @GaelVaroquaux's suggestion and I agree that it's more in line with our general principles (and it addresses your concern). |
Can we pass the |
Right now I'm doing that for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add the new params everywhere to their docstrings?
sklearn/linear_model/_logistic.py
Outdated
if fit_intercept: | ||
X_offset = -X_mean | ||
# can we actually do inplace here? | ||
inplace_column_scale(X_pre, 1 / X_scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be inplace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we copy if it's not sparse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry not sure if I understand your second comment.
|
||
def test_illconditioned_lbfgs(): | ||
# check that lbfgs converges even with ill-conditioned X | ||
X, y = make_classification(n_samples=100, n_features=60, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please test both binary and multiclass case.
Also, it does feel weird to have it for one solver and not the others. But I guess the docs can explain that. |
@adrinjalali as I said above, I intend to remove the parameter everywhere again and it's just for easy comparison. |
Co-Authored-By: Adrin Jalali <adrin.jalali@gmail.com>
removing from the milestone (this one got complicated) |
@tomMoral the synthetic data I mentioned to trigger conditioning issues for benchopt is here: #15583 (comment) |
Hm the svd on subsampled data idea sounds good... damn, I haven't looked at this in a while... |
Regarding #15583 (comment), it was pointed out to me that we might be favoring the version without preconditioning because the first and last feature are actually much more important given how we set up the problem. from time import time
import numpy as np
from sklearn.datasets import make_low_rank_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
n_samples, n_features = 1000, 10000
max_iter = 1000
print(f"n_samples={n_samples}, n_features={n_features}")
rng = np.random.RandomState(0)
w_true = rng.randn(n_features)
X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
# X = rng.randn(n_samples, n_features)
X[:, 0] *= 1e3
X[:, -1] *= 1e3
w_true[0] /= 1e3
w_true[-1] /= 1e3
z = X @ w_true + 1
z += 1e-1 * rng.randn(n_samples)
# Balanced binary classification problem
y = (z > np.median(z)).astype(np.int32)
for C in [1e3, 1, 1e-3]:
print(f"\nC={C}")
print("Without data preconditioning:")
clf = LogisticRegression(precondition=False, C=C,
max_iter=max_iter)
tic = time()
clf.fit(X, y)
duration = time() - tic
print(f"n_iter={clf.n_iter_[0]},"
f" obj={clf.objective_value_:.6f},"
f" duration={duration:.3f}s")
print("Without data preconditioning with scaling:")
clf_scaled = LogisticRegression(precondition=False, C=C,
max_iter=max_iter)
print("With data preconditioning:")
clf_pre = LogisticRegression(precondition=True, C=C,
max_iter=max_iter)
tic = time()
clf_pre.fit(X, y)
duration_pre = time() - tic
print(f"n_iter={clf_pre.n_iter_[0]},"
f" obj={clf_pre.objective_value_:.6f},"
f" duration={duration_pre:.3f}s")
print(f"speedup (pre): {duration/duration_pre:.1f}x")
|
Note: as discussed in #18795, it seems that when features are proprocessed with |
I also wonder if multiclass problems with many classes including some rare classes could not cause another family of data-related conditioning issues (for LogisticRegression with the multinomial loss). |
This seems a bit stalled.
Gould, N.I., & Scott, J.A. (2017). The State-of-the-Art of Preconditioners for Sparse Linear Least-Squares Problems. ACM Transactions on Mathematical Software (TOMS), 43, 1 - 35. https://doi.org/10.1145/3014057 PDF Edit: More on that matter in
It is also a bit ironic to remove the Note that the glum authors mention that intern standardization is beneficial for the optimizer, see https://glum.readthedocs.io/en/latest/background.html#Standardization and https://glum.readthedocs.io/en/latest/motivation.html#software-optimizations. |
See #15556 (comment). |
Partially addresses #15556.
Right now only for l-bfgs.
I added a parameter for people to confirm that it behaves as expected and easier debugging, but I think we should not add a parameter and always do this.
Might be a bit late but honestly I would like to avoid users having another 6 month of convergence warnings whenever they fit logistic regression. Also the results are much more stable / better and faster.
To demonstrate the effect:
in other words: it converges faster and to a better solution.
Multinomial
For multinomial, it still warns with the defaults, but produces a better result than master with 10000 iterations (in which case master doesn't warn).
So this doesn't remove all warnings, but makes the solutions much more stable, and allows solving by increasing
max_iter
within a reasonable range (in this example).Changing to
n_samples=10000
the above script produces