ENH multiclass/multinomial newton cholesky for LogisticRegression #28840

lorentzenchr · 2024-04-15T15:40:29Z

Reference Issues/PRs

In a way a follow-up of #24767.

What does this implement/fix? Explain your changes.

This extends the "newton-cholesky" solver of LogisticRegression and LogisticRegressionCV to full multinomial loss. In particular, the full hessian is calculated. This way, this solver does not need to resort to OvR for multiclass targets.

Any other comments?

There are 2 tricky parts:

Some index battle as one usually divides the index of coefficients hierarchically into n_features and n_classes. But in the end, the hessian is a 2-dim matrix - and it is!
The multinomial is over-parameterized for any unpenalized coefficient, so at least for the intercept. We therefore choose the last class intercept as reference and set its intercept value to zero.

github-actions · 2024-04-15T15:41:47Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c690194. Link to the linter CI: here}

lorentzenchr · 2024-04-18T21:37:39Z

Benchmark

As of 1de85b7

X_train.shape = (10000, 75)
sparse.issparse(X_train)=False
n_classes=12

import warnings
from pathlib import Path
import numpy as np
from scipy import sparse
from sklearn._loss import HalfMultinomialLoss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd



def prepare_data():
    df = fetch_openml(data_id=41214, as_frame=True, parser='auto').frame
    df["Frequency"] = df["ClaimNb"] / df["Exposure"]
    log_scale_transformer = make_pipeline(
        FunctionTransformer(np.log, validate=False), StandardScaler()
    )
    linear_model_preprocessor = ColumnTransformer(
        [
            ("passthrough_numeric", "passthrough", ["BonusMalus"]),
            (
                "binned_numeric",
                KBinsDiscretizer(n_bins=10, subsample=None),
                ["VehAge", "DrivAge"],
            ),
            ("log_scaled_numeric", log_scale_transformer, ["Density"]),
            (
                "onehot_categorical",
                OneHotEncoder(),
                ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
            ),
        ],
        remainder="drop",
    )
    y = df["Frequency"]
    w = df["Exposure"]
    X = linear_model_preprocessor.fit_transform(df)
    return X, y, w


X, y_orig, w = prepare_data()

print("binning the target...")
binner = KBinsDiscretizer(
    n_bins=300, encode="ordinal", strategy="quantile", subsample=int(2e5), random_state=0
)
y = binner.fit_transform(y_orig.to_numpy().reshape(-1, 1)).ravel().astype(float)

X = X.toarray()
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, train_size=10_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")
print(f"{sparse.issparse(X_train)=}")
n_classes = len(np.unique(y_train))
print(f"{n_classes=}")
print("y_train.value_counts() :")
print(pd.Series(y_train).value_counts())


results = []
slow_solvers = set()
loss_sw = np.full_like(y_train, fill_value=(1. / y_train.shape[0]))
alpha = 1e-6  # A bit larger than in the LSMR benchmarks to avoid ConvergenceWarnings
for tol in np.logspace(-1, -10, 10):
    for solver in ["lbfgs", "newton-cg", "newton-cholesky"]:
        if solver in slow_solvers:
            # skip slow solvers to keep the benchmark runtime reasonable
            continue
        tic = perf_counter()
        # with warnings.catch_warnings():
        #     warnings.filterwarnings("ignore", category=ConvergenceWarning)
        clf = LogisticRegression(
            C=1/alpha,
            solver=solver,
            tol=tol,
            max_iter=10_000 if solver=="lbfgs" else 1000,
        ).fit(X_train, y_train)
        toc = perf_counter()
        train_time = toc - tic
        n_iter = clf.n_iter_[0]
        if train_time > 200 or n_iter >= clf.max_iter:
            # skip this solver from now on...
            slow_solvers.add(solver)
        # Look inside _GeneralizedLinearRegressor to check the parameters.
        # Or run once with verbose=1 and compare to the reported loss.
        train_loss = LinearModelLoss(
            base_loss=HalfMultinomialLoss(n_classes=n_classes), fit_intercept=clf.fit_intercept
        ).loss(
            coef=np.c_[clf.coef_, clf.intercept_],
            X=X_train,
            y=y_train,
            l2_reg_strength=alpha / X_train.shape[0],
            sample_weight=loss_sw,
        )
        result = {
            "solver": solver,
            "tol": tol,
            "train_loss": train_loss,
            "train_time": train_time,
            "train_score": clf.score(X_train, y_train),
            "test_score": clf.score(X_test, y_test),
            "n_iter": n_iter,
            "converged": n_iter < clf.max_iter,
        }
        print(result)
        results.append(result)


results = pd.DataFrame.from_records(results)
filepath = Path().resolve() / "bench_multinomial_logistic_regression_mtpl_dense_newton_cholesky.csv"
results.to_csv(filepath)


import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt


filepath = Path().resolve() / "bench_multinomial_logistic_regression_mtpl_dense_newton_cholesky.csv"

results = pd.read_csv(filepath)
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-16

fig, axes = plt.subplots(ncols=2, figsize=(8*2, 6))
for label, group in results.groupby("solver"):
    group.sort_values("tol").plot(
        x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=axes[0]
    )
axes[0].set_ylabel("suboptimality")
axes[0].set_title("Suboptimality by iterations")

for label, group in results.groupby("solver"):
    group.sort_values("tol").plot(
        x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=axes[1]
    )
axes[1].set_ylabel("suboptimality")
axes[1].set_title("Suboptimality by time")
plt.show()

ogrisel

Thanks @lorentzenchr, this is a very interesting PR. Here is a first pass of feedback.

nitpick: I think Hessian should always be capitalized in the docstrings and comments.

sklearn/linear_model/tests/test_logistic.py

sklearn/linear_model/_linear_loss.py

lorentzenchr · 2024-04-25T06:36:07Z

nitpick: I think Hessian should always be capitalized in the docstrings and comments.

That's right, but not the standard in our code base. If you wish to correct that, I propose a separate PR.

ogrisel · 2024-04-25T09:26:48Z

That's right, but not the standard in our code base. If you wish to correct that, I propose a separate PR.

I think we can just make sure that we don't propagate this error in new docstrings / comments and use a follow-up PR to fix existing docstrings/comments that are not logically related to the scope of this PR.

ogrisel · 2024-04-25T15:41:24Z

sklearn/linear_model/_linear_loss.py

+    # While a dedicated Cython routine could exploit the symmetry, it is very hard to
+    # beat BLAS GEMM, even thought the latter cannot exploit the symmetry, unless one
+    # pays the price of a taking square roots and implements
+    #    sqrtWX = sqrt(W)[: None] * X


Note that exploiting symmetry is not the only reason why a dedicated sandwich product kernel would make sense.

This line above would trigger and read/write round trip between RAM and CPU of the size of X (when X is too large to fit in CPU cache which is typically the case of interest). When n_samples >> n_features, a dedicated fused sandwich product kernel would only have to:

read n_samples * (n_feature + 1) / 2 from RAM;

write n_features ** 2 to RAM.

while what you propose would:

read n_samples * (n_feature + 1) from RAM, # sqrtWX = sqrt(W)[: None] * X

write n_samples * n_feature to RAM, # sqrtWX = sqrt(W)[: None] * X

read n_samples * n_feature / 2 from RAM, # np.dot(sqrtWX.T, sqrtWX)

write n_features ** 2 to RAM. # np.dot(sqrtWX.T, sqrtWX)

Assuming that this kernel is memory bound, I would expect a ~3x speed-up from the fused kernel over the 2-step numpy code.

The problem is that writing an efficient blocked sandwich product kernel in Cython with OpenMP threading and hardware adapted SIMD vector instructions is far from trivial.

For CPU, https://github.com/Quantco/tabmat already presumably does that.

For GPU, something like https://github.com/openai/triton/ might be able to do it in a vendor agnostic way.

EDIT: some triton developers are working on a CPU backend. It's very preliminary at this point but my be interesting in the medium term as it would allow a single source code base for optimized CPU + GPU kernels written in a high level programming language: https://github.com/triton-lang/triton-cpu/

I can remove some of that comment. I wanted to stress 2 facts:

this is the cpu bottleneck

Replacing it by some self written BLAS like function is a ludicrous undertaking (even tabmat is only faster when there are categoricals!). GEMM might be the algo where most human time was spent writing and optimizing it.

Let’s not get off-topic too much.

Replacing it by some self written BLAS like function is a ludicrous undertaking (even tabmat is only faster when there are categoricals!). GEMM might be the algo where most human time was spent writing and optimizing it.

tabmat is nearly 4x faster than GEMM on dense numeric inputs according to:

https://tabmat.readthedocs.io/en/latest/benchmarks.html#performance

This matches my memory bandwidth analysis above. My suggestion is not to change the implementation as part of this PR or in the near future but rather to improve the comment and converge on a shared understanding of the remaining achievable performance improvements if we move away from GEMM to a dedicated sandwich product fused kernel (as the one implemented in tabmat).

lorentzenchr · 2024-09-05T06:21:06Z

@agramfort @TomDLT @rth friendly ping in case you find time. IMO, this PR closes a gap in the linear model solvers and enables precision solutions with unprecedented speed (orders of magnitude) for multiclass problems if the hessian fits into memory.

sklearn/linear_model/_linear_loss.py

agramfort

I did not review in details but being a convex problem if the loss reaches the global minimum as evidenced by the plot and the tests the numerics must be correct. It's a new feature we support here so there is no risk of regression on the specific algorithm. I would just suggest someone carefully checks that this does not lead to any API change.

lorentzenchr · 2024-10-05T10:36:16Z

I would just suggest someone carefully checks that this does not lead to any API change.

This PR makes the following public API change:
If multi_class = "auto" (effectively the default) and n_classes >= 3 and solver = newton-cholesky then the effectively used multi_class changes from "ovr" to "multiclass".
Please note that multi_class is deprecated and will be removed in 1.7.

ogrisel · 2024-10-18T10:06:06Z

While reviewing this PR and testing it on some data, I realized that the reported number of iterations was always 0 whenever l-BFGS-b would kick in.

I pushed a quick fix in 1fa8e14.

I made it such that any completed iteration from the newton-cholesky solver would be subtracted from max_iter before calling lbfgs and then report the sum of the two solvers in the end. In practice, this does not seem to change anything because when Hessian conditioning problem always happens during the first iteration in my experiments with low regularized, rank deficient problems that typically trigger the lbfgs fallback mechanism.

Now I realized that this bug was already present with the binary classification problem and I should have opened a separate PR with a proper changelog entry. Let me revert this commit and do that instead.

Note that I find the lbfgs fallback warning quite annoying whenever it is triggered when tuning the regularization level (e.g. using LogisticRegressionCV or RandomizedSearchCV). I have the feeling that this should be a regular verbose print instead, but we can tackle that in a separate PR.

This reverts commit 1fa8e14.

ogrisel

I had another scan at the code. The tests look good, I trust them. I will open a follow-up PR for the bug I found in the LBFGS fallback mechanism.

lorentzenchr · 2024-10-18T10:12:47Z

@ogrisel Thanks for reviewing. I would prefer if you did not push commits on (my) PRs. Please first communicate with me.

jjerphan · 2024-10-18T11:13:23Z

Thank you a lot for this contribution, Christian. 🙌

github-actions bot added the module:linear_model label Apr 15, 2024

lorentzenchr changed the title ~~ENH multiclass newton cholesky for LogisticRegression~~ ENH multiclass/multinomial newton cholesky for LogisticRegression Apr 15, 2024

lorentzenchr added this to the 1.5 milestone Apr 18, 2024

ogrisel reviewed Apr 24, 2024

View reviewed changes

ogrisel reviewed Apr 25, 2024

View reviewed changes

jjerphan self-requested a review April 29, 2024 15:59

jeremiedbb modified the milestones: 1.5, 1.6 May 13, 2024

lorentzenchr mentioned this pull request Jul 11, 2024

DOC: Add missing solver in the doc of LogisticRegressionCV #29463

Closed

qbarthelemy reviewed Sep 30, 2024

View reviewed changes

sklearn/linear_model/_linear_loss.py Outdated Show resolved Hide resolved

lorentzenchr added 15 commits September 30, 2024 20:46

ENH add full multinomial hessian matrix

9b94f6e

TST improve test_loss_dtype of common losses

a25fe86

ENH improve newton solver verbosity

bcf8d05

FIX linear loss gradient_hessian contiguity

942225c

ENH extend newton-cholesky for multinomial loss

cac29e6

ENH multinomial newton-cholesky for LogisticRegression

e7ddcf1

DOC add whatsnew

00aa30c

TST better test e.g. for fit_intercept=True

e2e9926

DOC add PR number to whatsnew

1ee88a7

TST fix test_logistic_regression_solvers_multiclass

3d8fe03

ENH set coefficients of reference class to zero

9aabcb5

TST fix and extend test_multinomial_identifiability_on_iris

33531b7

TST improve test coverage in linear_loss

3fb8d31

CLN comments from review suggestions

ece3da0

CLN better comment fot sandwich_dot

15a40fa

lorentzenchr added 4 commits October 1, 2024 18:36

TST 2 more filterwarnings for Deprecation

9f3720b

trigger CI

0f59c2c

DOC move to 1.6 changelog

3e1b2f5

CLN use @ instead of np.dot

a0428e2

lorentzenchr force-pushed the multiclass_newton_cholesky branch from 4c48a40 to a0428e2 Compare October 1, 2024 16:50

agramfort approved these changes Oct 5, 2024

View reviewed changes

lorentzenchr and others added 3 commits October 14, 2024 12:43

Merge branch 'main' into multiclass_newton_cholesky

a1998c4

Merge branch 'main' into multiclass_newton_cholesky

e80224d

Properly record the number of iterations when LBFGS fallback kicks in

1fa8e14

ogrisel added 2 commits October 18, 2024 12:06

Revert unrelated fix to be moved to its own PR.

5ca5c39

This reverts commit 1fa8e14.

Merge branch 'main' into multiclass_newton_cholesky

c690194

ogrisel approved these changes Oct 18, 2024

View reviewed changes

ogrisel enabled auto-merge (squash) October 18, 2024 10:11

ogrisel merged commit c08b433 into scikit-learn:main Oct 18, 2024
29 checks passed

ogrisel deleted the multiclass_newton_cholesky branch October 18, 2024 11:14

ogrisel mentioned this pull request Oct 18, 2024

FIX properly report n_iter_ in case of fallback from Newton-Cholesky to LBFGS #30100

Merged

lorentzenchr mentioned this pull request Oct 26, 2024

Scaling issues in l-bfgs for LogisticRegression #15556

Closed

ogrisel mentioned this pull request Dec 5, 2024

DOC Release Highlights for version 1.6 #30392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH multiclass/multinomial newton cholesky for LogisticRegression #28840

ENH multiclass/multinomial newton cholesky for LogisticRegression #28840

lorentzenchr commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024 •

edited

Loading

lorentzenchr commented Apr 18, 2024

ogrisel left a comment

lorentzenchr commented Apr 25, 2024

ogrisel commented Apr 25, 2024

ogrisel Apr 25, 2024 •

edited

Loading

lorentzenchr Apr 25, 2024

lorentzenchr Apr 25, 2024

ogrisel Jun 21, 2024 •

edited

Loading

lorentzenchr commented Sep 5, 2024

agramfort left a comment

lorentzenchr commented Oct 5, 2024

ogrisel commented Oct 18, 2024

ogrisel left a comment

lorentzenchr commented Oct 18, 2024

jjerphan commented Oct 18, 2024

ENH multiclass/multinomial newton cholesky for LogisticRegression #28840

ENH multiclass/multinomial newton cholesky for LogisticRegression #28840

Conversation

lorentzenchr commented Apr 15, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Apr 15, 2024 • edited Loading

✔️ Linting Passed

lorentzenchr commented Apr 18, 2024

Benchmark

ogrisel left a comment

Choose a reason for hiding this comment

lorentzenchr commented Apr 25, 2024

ogrisel commented Apr 25, 2024

ogrisel Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

lorentzenchr Apr 25, 2024

Choose a reason for hiding this comment

lorentzenchr Apr 25, 2024

Choose a reason for hiding this comment

ogrisel Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

lorentzenchr commented Sep 5, 2024

agramfort left a comment

Choose a reason for hiding this comment

lorentzenchr commented Oct 5, 2024

ogrisel commented Oct 18, 2024

ogrisel left a comment

Choose a reason for hiding this comment

lorentzenchr commented Oct 18, 2024

jjerphan commented Oct 18, 2024

lorentzenchr commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024 •

edited

Loading

ogrisel Apr 25, 2024 •

edited

Loading

ogrisel Jun 21, 2024 •

edited

Loading