-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA add Cholesky based Newton solver to GLMs #23314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2698b9f
to
ad65d89
Compare
Interesting. Looking forward to some benchmarks vs lbfgs :) |
Based on our example from sklearn.linear_model import PoissonRegressor, see also #21787. import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
("binned_numeric", KBinsDiscretizer(n_bins=10, subsample=None), ["VehAge", "DrivAge"]),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df) %time PoissonRegressor(solver="lbfgs", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y); None
%time PoissonRegressor(solver="newton-cholesky", alpha=1e-4, tol=1e-4).fit(X, y); None
Note: On synthetic data ( Note to myself: For tiny |
Is there a thread-safe way to detect this situation and raise a higher level warnings that suggest to increase the regularization or use the "lbfgs" solver in this case? |
Interesting benchmark results. I can reproduce locally. The "newton-cholesky" solver converged (with I am not sure the convergence criterion for both solver is the same though. I would be interesting to plot the final objective value on the training set vs |
@jeremiedbb's work on the callbacks would really be useful here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some notes from a first quick code review. As discussed with @GaelVaroquaux and @agramfort it would be interesting how this solver compares with an equivalent of the solver="newton-cg"
implemented in LogisticRegression
. The latter does not require materializing the hessian in memory.
intercept = 0 | ||
model.fit(X, y) | ||
|
||
rtol = 3e-5 if solver == "lbfgs" else 1e-11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this high rtol
requirement reveal a bug in our use of LBFGS for linear models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, it's hard to achieve high precision with lbfgs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confirmed by #23314 (comment) where I vary the tol
parameter for all solvers.
It's even worse in our use of LBFGS in the LogisticRegression
, probably because the tol
criterion is not handled correctly but this is a different problem.
Sets self.coef_newton. | ||
""" | ||
|
||
def line_search(self, X, y, sample_weight): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably have dedicated unit tests to check the correctness of the line search in isolation. To ease testing, this method could return extra info, such as the final value of t
or i
for instance, or maybe some indicator of which line search convergence check triggered the breaking of the search loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some benchmarks by varying the tols and I confirm that this solver is much more efficient than LBFGS on this problem.
Edit: the y axis of those plots is the excess loss over the loss of the best model (+ 1e-12).
Here is the script:
import warnings
from pathlib import Path
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib
m = joblib.Memory(location=".")
@m.cache
def prepare_data():
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
(
"binned_numeric",
KBinsDiscretizer(n_bins=10, subsample=None),
["VehAge", "DrivAge"],
),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df)
return X, y, w
X, y, w = prepare_data()
# X = X.toarray()
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
X, y, w, train_size=100_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")
results = []
loss_sw = np.full_like(y_train, fill_value=(1. / y_train.shape[0]))
# loss_sw = None
for tol in np.logspace(-1, -10, 10):
for solver in ["lbfgs", "newton-cholesky"]:
tic = perf_counter()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
reg = PoissonRegressor(
alpha=1e-4, solver=solver, tol=tol, max_iter=10000
).fit(X_train, y_train)
toc = perf_counter()
train_time = toc - tic
train_loss = LinearModelLoss(
base_loss=reg._get_loss(), fit_intercept=reg.fit_intercept
).loss(
coef=np.r_[reg.coef_, reg.intercept_],
X=X_train,
y=y_train.values,
l2_reg_strength=reg.alpha,
sample_weight=loss_sw,
)
result = {
"solver": solver,
"tol": tol,
"train_loss": train_loss,
"train_time": train_time,
"train_score": reg.score(X_train, y_train),
"test_score": reg.score(X_test, y_test),
"n_iter": reg.n_iter_,
"converged": reg.n_iter_ < reg.max_iter,
}
print(result)
results.append(result)
results = pd.DataFrame.from_records(results)
results.to_csv(Path(__file__).parent / "bench_poisson_reg.csv")
# %%
import pandas as pd
from pathlib import Path
results = pd.read_csv(Path(__file__).parent / "bench_poisson_reg.csv")
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-12
results
# %%
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
# %%
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
It seems that the LBFGS solver can get stuck and never converges exactly to the optimum (one order of magnitude above the solution found by the new solver) even for low tols. |
You can set Concerning timing benchmarks: I have something in the pipe, where I first find tolerances that give comparable results. Unfortunately, syntactic data, even |
It would be interesting to understand what makes lbfgs comparatibely fail on this data. It's weird that it needs hundreds or even thousand of iterations to reach tol below 1e-7 on convex linear model fitting optimization problem. We could introspect the spectrum of the hessian at the optimum and maybe forge a synthetic quadratic optimization problem? Or maybe it's the evolution of the hessian on the optimization path that makes the problem hard for LBFGS to track? Is this a result of the very noisy regression problem, possibly with significant heteroscedasticity? |
Binary logistic regression on kick datasetThings become really interesting for binary logistic regression. I chose the "kick" dataset, because it is comparable in size to the freMTPL2freq dataset. It has Scripts import numpy as np
import pandas as pd
from sklearn._loss import HalfBinomialLoss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.linear_model._glm.glm import _GeneralizedLinearRegressor
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
class BinomialRegressor(_GeneralizedLinearRegressor):
def _get_loss(self):
return HalfBinomialLoss()
df = fetch_openml(data_id=41162, as_frame=True).frame
linear_model_preprocessor = ColumnTransformer(
[
(
"passthrough_numeric",
make_pipeline(SimpleImputer(), StandardScaler()),
[
"MMRAcquisitionAuctionAveragePrice",
"MMRAcquisitionAuctionCleanPrice",
"MMRCurrentAuctionAveragePrice",
"MMRCurrentAuctionCleanPrice",
"MMRCurrentRetailAveragePrice",
"MMRCurrentRetailCleanPrice",
"MMRCurrentRetailAveragePrice",
"MMRCurrentRetailCleanPrice",
"VehBCost",
"VehYear",
"VehicleAge",
"WarrantyCost",
],
),
(
"onehot_categorical",
OneHotEncoder(min_frequency=10),
[
"Auction",
"Color",
"IsOnlineSale",
"Make",
"Model",
"Nationality",
"Size",
"SubModel",
"Transmission",
"Trim",
"WheelType",
],
),
],
remainder="drop",
)
y = np.asarray(df["IsBadBuy"] == "1", dtype=float)
X = linear_model_preprocessor.fit_transform(df)
alpha = 1e-4 %time clf_lbfgs = LogisticRegression(solver="lbfgs", C=1 / alpha / X.shape[0], tol=1e-4, max_iter=1000).fit(X, y) Wall time: 1.48 s %time clf_newton_cg = LogisticRegression(solver="newton-cg", C=1 / alpha / X.shape[0], tol=1e-4, max_iter=1000).fit(X, y) Wall time: 1.42 s %time clf_lbfgs2 = BinomialRegressor(solver="lbfgs", alpha=alpha, tol=1e-4, max_iter=1000).fit(X, y) Wall time: 478 ms %time clf_newton_cholesky = BinomialRegressor(solver="newton-cholesky", alpha=alpha, tol=1e-4).fit(X, y) Wall time: 781 ms They achieve different losses lml = LinearModelLoss(base_loss=HalfBinomialLoss(), fit_intercept=True)
sw = np.full_like(y, fill_value=1 / y.shape[0])
loss_lbfgs = lml.loss(coef=np.r_[clf_lbfgs.coef_[0], clf_lbfgs.intercept_], X=X, y=y, sample_weight=sw, l2_reg_strength=alpha)
loss_newton_cg = lml.loss(coef=np.r_[clf_newton_cg.coef_[0], clf_newton_cg.intercept_], X=X, y=y, sample_weight=sw, l2_reg_strength=alpha)
loss_lbfgs2 = lml.loss(coef=np.r_[clf_lbfgs2.coef_, clf_lbfgs2.intercept_], X=X, y=y, sample_weight=sw, l2_reg_strength=alpha)
loss_newton_cholesky = lml.loss(coef=np.r_[clf_newton_cholesky.coef_, clf_newton_cholesky.intercept_], X=X, y=y, sample_weight=sw, l2_reg_strength=alpha) loss_newton_cg 0.3065839722555997 < loss_newton_cholesky 0.30658397857185354 < loss_lbfgs 0.306584001191482 < loss_lbfgs2 0.3066001389832086 Notes
|
Things become really interesting for binary logistic regression.
Thanks! Those results are really interesting. One problem to interpret
these is that it seems that the stopping criteria triggered in a
different way.
With regards to Newton-cg vs Newton-cholesky: the difference should be
marked in higher dimensions. It would be interesting to have a benchmark
with p in the hundreds.
|
X.shape = (72983, 1075) |
X.shape = (72983, 1075)
OK, taking this back. Makes the benchmark very interesting.
|
Those experiments shed new lights on past discussions on LogisticRegression with lbfgs on ill-conditioned problems: There is also this attempt to use the related but it wasn't successful (at this point). |
I ran some further benchmark on a imbalanced binarized variant of the original regression data to be able to compare with the binary classifiers of scikit-learn I can reproduce the strong speed-up of the new Here To reproduce here are the scripts: EDIT by @lorentzenchr: I corrected import warnings
from pathlib import Path
import numpy as np
from sklearn.metrics import log_loss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn._loss import HalfBinomialLoss
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.linear_model._glm.glm import _GeneralizedLinearRegressor
from time import perf_counter
import pandas as pd
import joblib
class BinomialRegressor(_GeneralizedLinearRegressor):
def _get_loss(self):
return HalfBinomialLoss()
@joblib.Memory(location=".").cache
def prepare_data():
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
(
"binned_numeric",
KBinsDiscretizer(n_bins=10, subsample=None),
["VehAge", "DrivAge"],
),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df)
return X, y, w
X, y, w = prepare_data()
y = y.values
y = (y > np.quantile(y, q=0.95)).astype(np.float64)
print(y.mean())
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
X, y, w, train_size=5_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")
results = []
loss_sw = np.full_like(y_train, fill_value=(1.0 / y_train.shape[0]))
# loss_sw = None
alpha = 1e-12
C = 1. / alpha / X.shape[0]
max_iter = 10_000
for tol in np.logspace(-1, -10, 10):
for solver in ["lbfgs", "lbfgs2", "newton-cg", "newton-cholesky"]:
tic = perf_counter()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
if solver in ["lbfgs", "newton-cg"]:
model = LogisticRegression(
C=C, solver=solver, tol=tol, max_iter=max_iter
).fit(X_train, y_train)
toc = perf_counter()
else:
if solver == "lbfgs2":
binomial_solver = "lbfgs"
else:
binomial_solver = solver
model = BinomialRegressor(
solver=binomial_solver, alpha=alpha, tol=tol, max_iter=max_iter
).fit(X_train, y_train)
toc = perf_counter()
train_time = toc - tic
lml = LinearModelLoss(base_loss=HalfBinomialLoss(), fit_intercept=True)
train_loss = lml.loss(
coef=np.r_[model.coef_.ravel(), model.intercept_.ravel()],
X=X_train,
y=y_train.astype(np.float64),
sample_weight=loss_sw,
l2_reg_strength=alpha,
)
result = {
"solver": solver,
"tol": tol,
"train_time": train_time,
"train_loss": train_loss,
"n_iter": int(np.squeeze(model.n_iter_)),
"converged": np.squeeze(model.n_iter_ < model.max_iter).all(),
"coef_sq_norm": (model.coef_**2).sum(),
}
print(result)
results.append(result)
results = pd.DataFrame.from_records(results)
results.to_csv(Path(__file__).parent / "bench_quantile_clf.csv") and to plot the results (using the VS Code notebook-style cell separators): # %%
import pandas as pd
from pathlib import Path
results = pd.read_csv(Path(__file__).parent / "bench_quantile_clf.csv")
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-12
results
# %%
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
# %%
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax) Here is the std output of the bench:
Observations and comments
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments. I wanted to do a full pass but github prevents me to reply to existing discussions while I have pending review comments, hence the fragmented review.
I still want to do a review and we will need to check that |
f"#{self.iteration} did no converge after 21 line search refinement " | ||
"iterations.", | ||
ConvergenceWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a way to trigger this case with some non-42 random seed in test_glm_regression_hstacked_X
when tweaking glm_dataset
as done in https://github.com/ogrisel/scikit-learn/tree/line_search_warning_in_test_glm_regression_hstacked_X .
I think we don't know what coef
we should return in this case. I think it deserves manual investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 34e297e, it now also resorts to lbfgs.
I have to stop here for today and don't plan to work on this much next week but I definitely want to come back to this PR soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test with verbose=True
and the capsys
fixture to check that important convergence messages are correctly printed on stdout?
I pushed a commit to update the examples to use the new solver. In particular for the regression of the claim amounts using Tweedie / Poisson * Gamma regressors the speed is much better even when using the full datasets which was not possible previously with LGBFS. Note however that the results heavily depends on the random split. We could refactor this example to use cross-validation but despite the new solver that would probably be too expensive and we require a significant rewrite of the example which is not the focus of this PR. |
Sounds like it could soon be merged :) |
SummaryThere is so much information here. I'll try to summarize:
Proposed Actions
|
This sounds like something I can give it a shot once 1. is reviewed and merged. |
For the interested reader, have a look at the extensive benchmarking in https://github.com/lorentzenchr/notebooks/blob/master/bench_glm_newton_solvers_cholesky_lsmr.ipynb. |
With #24637 merged, I'll close. |
With #24637 merged, I'll close. |
Reference Issues/PRs
Fixes #16634.
#23619 should be merged first.
What does this implement/fix? Explain your changes.
This PR adds Newton solvers where the Newton step is obtained by a Cholesky decomposition. Another variant first uses a QR decomposition of
X'
which is beneficial forn_features >> n_samples
.This is basically the same as iterated reweighted least squares (IRLS) with inner Cholesky based solver on the normal equations.
Any other comments?
2 points:
PoissonRegressor
,GammaRegressor
andTweedieRegressor
have the new solvers. It is very easy to extend it to binaryLogisticRegression
as well."newton-cholesky"
and"newton-qr-cholesky"
. These could finally be merged to a single one which automatically selects one based onn_samples
andn_features
.Summary (Edit)
See #23314 (comment)