Skip to content

DEP deprecate multi_class in LogisticRegression #28703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 3, 2024

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 26, 2024

Reference Issues/PRs

Towards #11865.

What does this implement/fix? Explain your changes.

This PR deprecates the multi_class parameter in LogisticRegression. Using that option is equivalent to OneVsRestClassifier(LogisticRegression()), so no functionality is lost and, once gone, it would simplify the code of logreg quite a bit and make in more maintainable.

Any other comments?

This PR starts very simple with only LogisticRegression. In case of positive feedback, I'll extend it to LogisticRegressionCV and adapt all the docstrings and so on.

@lorentzenchr lorentzenchr added this to the 1.5 milestone Mar 26, 2024
Copy link

github-actions bot commented Mar 26, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 10e73d8. Link to the linter CI: here

@lorentzenchr lorentzenchr added the Needs Decision Requires decision label Mar 26, 2024
@lorentzenchr
Copy link
Member Author

@scikit-learn/core-devs friendly ping for visibility.

@jjerphan
Copy link
Member

I have not had time to look at this PR at all, but before deprecating, can we have tests to be sure they are equivalent in results and potentially in UX (e.g. would accessing fitted attributes still be possible)?

@agramfort
Copy link
Member

would be ok for me. I think it's ok to live with

OneVsRestClassifier(LogisticRegression(..))

when using liblinear.

@GaelVaroquaux
Copy link
Member

Just to make sure that I understand things correctly: the plan is that by default multi-class is supported via multinomial loss but if the solver is liblinear multi-class raises an error, and the user must use a OvR?

If so, that strategy is fine by me, but I think that the deprecation message should first suggest to use solvers that support multinomial or user OvR if liblinear is desired.

@lorentzenchr
Copy link
Member Author

Just to make sure that I understand things correctly: the plan is that by default multi-class is supported via multinomial loss but if the solver is liblinear multi-class raises an error, and the user must use a OvR?

Yes, exactly. And yes, with an informative deprecation warning, later an error advertising solvers that support the multinomial loss.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks like solver="newton-cholesky" does not support ovr .

From a high level API point of view, LinearSVC only does OVR with liblinear and LogisticRegression will no longer support it. In the future, should LinearSVC also force users to use OneVsRestClassifier for multi-class problems?

Over the last few years, we are slowly making meta-estimators more necessary for certain task. (i.e., the removal of normalize or this PR). It kind of goes against the history of "lets make estimators easy to use". For example, the classifiers encodes string labels to "make these easy". This is my observation and I am undecided on the current path.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 28, 2024 via email

@lorentzenchr
Copy link
Member Author

Over the last few years, we are slowly making meta-estimators more necessary for certain task.

You raise good points. For this particular case, I have 3 answers:

  • Out of the box, logreg will continue to just work. Only if a user wants to use a specific solver, he/she will be directed to the meta-estimator. (We can even extent the newton-cholesky one to support multinomial, no big deal. Then, only liblinear would be left.)
  • From a statistical point of view, multinomial logreg should clearly be favored over OvR logreg. (In fact, I find it even hard to provide good statistical reference for OvR logreg.)
  • The logic of LogisticRegression will become significantly less „opaque“ and the code more maintainable. (And yes, this is a concern as the code complexity has been a problem in several PRs, making contributing hard.)

@lorentzenchr
Copy link
Member Author

I interpret the conversation so far as decision to deprecate multi_class in LogisticRegression.

I have an open questions: Shall we raise an error for multiclass liblinear or internally switch to OvR? (And state this in the docs)?

@ogrisel
Copy link
Member

ogrisel commented Apr 10, 2024

I am fine for deprecating the multi_class thingy in LogisticRegression but not in LinearSVC because there is no good default alternative to handle the multiclass case for the latter so that would be a regression from a UX point of view.

@jeremiedbb
Copy link
Member

We just have a different factor of 2 somewhere which interferes with the penalty parameters.

Yes I noticed that when setting C=np.inf the coefs differ by a factor 2 exactly.

@lorentzenchr
Copy link
Member Author

Anything else I can do for this PR?

@jeremiedbb
Copy link
Member

Thanks for the clarification. There's still a difference I think because in binary + multi_class="multinomial", a softmax is used to compute the probabilities. And it looks that it's used by some users (see #9889). We even have special comments in the docs regarding binary + multi_class="multinomial", e.g.

`coef_` is of shape (1, n_features) when the given problem is binary.
In particular, when `multi_class='multinomial'`, `coef_` corresponds
to outcome 1 (True) and `-coef_` corresponds to outcome 0 (False).
intercept_ : ndarray of shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function.
If `fit_intercept` is set to False, the intercept is set to zero.
`intercept_` is of shape (1,) when the given problem is binary.
In particular, when `multi_class='multinomial'`, `intercept_`
corresponds to outcome 1 (True) and `-intercept_` corresponds to
outcome 0 (False).
, and dedicated tests like
def test_multinomial_binary_probabilities(global_random_seed):
# Test multinomial LR gives expected probabilities based on the
# decision function, for a binary problem.
X, y = make_classification(random_state=global_random_seed)
clf = LogisticRegression(
multi_class="multinomial",
solver="saga",
tol=1e-3,
random_state=global_random_seed,
)
clf.fit(X, y)
decision = clf.decision_function(X)
proba = clf.predict_proba(X)
expected_proba_class_1 = np.exp(decision) / (np.exp(decision) + np.exp(-decision))
expected_proba = np.c_[1 - expected_proba_class_1, expected_proba_class_1]
assert_almost_equal(proba, expected_proba)
.

I don't have an opinion for the decision of removing the binary + multi_class="multinomial" use case, I trust you on that, but I think that we need and additional future warning for this special case because the current warning says that from 1.7 it will always be multinomial, which is not true for binary problems and we don't propose any alternative.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented May 2, 2024

I don't have an opinion for the decision of removing the binary + multi_class="multinomial" use case, I trust you on that, but I think that we need and additional future warning for this special case because the current warning says that from 1.7 it will always be multinomial, which is not true for binary problems and we don't propose any alternative.

🤔 We don't have good terms available: multi_class : {'auto', 'ovr', 'multinomial'}. There is no "binary" and to misuse OVR as binary would be awkward (the logic of the code does it, which is very awkward). Therefore, I don't know how to improve the warning. I would just neglect that case.

And we can always improve warning messages later, in case someone comes up with a good idea how to improve.

@jeremiedbb
Copy link
Member

jeremiedbb commented May 2, 2024

I would trigger a dedicated warning if type_of_target(y) == "binary" and multi_class == "multinomial"

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 2, 2024 via email

@lorentzenchr
Copy link
Member Author

lorentzenchr commented May 2, 2024

I would trigger a dedicated warning if type_of_target(y) == "binary" and multi_class == "multinomial"

I'll add one.

The risk is to break people's pipelines

A scikit-learn pipeline won't break. Only if you rely on the shapes of coef_ in some analysis after the fit or in a custom object using LogistigRegression inside. (EDIT: Even the shapes are the same.) I'll add that warning.

@lorentzenchr
Copy link
Member Author

Ready again if CI passes.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. ping @ogrisel, does your +1 still holds, especially w.r.t the binary situation ?

@ogrisel
Copy link
Member

ogrisel commented May 2, 2024

Yes, binomial and multinomial are mathematical equivalent for 2 classes - in both cases, the loss is the log loss. We just have a different factor of 2 somewhere which interferes with the penalty parameters.

When the problem is penalized, the (projected) over-parametrized solution (multinomial parametrization with softmax link) and the standard binary solution (binomial parametrization with logit link) can be very different, both in coef space and in function space, and even when setting the tolerance to a very small value:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(
    n_classes=2,
    n_samples=1000,
    n_features=5,
    n_informative=2,
    n_redundant=2,
    random_state=0,
)


hparams = dict(
    C=1e-2,
    tol=1e-16,
    solver="lbfgs",
)
logreg_logit = LogisticRegression(**hparams).fit(X, y)
logreg_softmax = LogisticRegression(multi_class="multinomial", **hparams).fit(X, y)

logreg_logit.coef_ / logreg_softmax.coef_
array([[ 1.59470659,  1.54629272,  1.60829714, -0.25944454,  1.36905979]])
import numpy as np

X_test = np.random.RandomState(0).randn(10, 5)
np.abs(
    logreg_logit.predict_proba(X_test)[:, 1] - logreg_softmax.predict_proba(X_test)[:, 1]
).max()
0.056755531461336606

Note that the above results are mostly the same with other solvers ("newton-cg", "sag", and "saga", as expected given the strongly convex objective functions).

I confirm that when increasing or decreasing penalization, the two models become equivalent.

The discrepancy between the two solutions also goes away asymptotically with larger sample sizes.

I would not have expected such a strong discrepancy for intermediate penalties and sample sizes though. But indeed the penalties are not equivalent for both formulations.

That being said, I agree that setting multi_class="multinomial" vs multi_class="ovr" to change between over-parametrized softmax formulation vs the standard binary logit formulation is really counter intuitive / misleading.

If we want to keep the possibility to fit (over/softmax) parametrized binary logistic regression in the future, we should probably do so with a better name, but I am not sure which.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented May 2, 2024

If we want to keep the possibility to fit (over/softmax) parametrized binary logistic regression in the future

We don't. At least I don't want that. I consider it bad practice. There are only disadvantages. I see absolutely no reason beyond curiosity to do so.
Furthermore, if you select the penalty via CV, you will end up with models of same predictions up to some uncertainty (solver, numeris, ..).

🙏 And I really beg you to not over-complicated this deprecation. 🙏

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 2, 2024 via email

@lorentzenchr
Copy link
Member Author

Here is my last drop of rationality:

If we are concerned about the effect on millions of users, then let's do this deprecation. This way, the functionality is still there, just a warning about future removal is emitted. If users really care that much and have good reasons for using an over-parametrized multinomial on a binomial problem, then they will complain about that. The deprecation is a way to get user feedback (it's a communication device anyway).

For info: Before #23040, HistGradientBoostingClassifier with loss = "categorical_crossentropy" raised a ValueError on binary problems. Now the only option is loss="log_loss" which automatically selects the proper log-loss setting with a single tree instead of two for binary problems - much like proposed in this PR. We never got any complaints.

@ogrisel
Copy link
Member

ogrisel commented May 3, 2024

I agree with @lorentzenchr's rationale above.

Furthermore, I believe that using LogisticRegression(multi_class="multinomial") is actually broken (not doing what the docstring/comments says it does and what user would expect):

if multi_class == "multinomial":
n_classes = max(2, classes.size)
if solver in ["lbfgs", "newton-cg"]:
multi_w0 = np.reshape(w0, (n_classes, -1), order="F")
else:
multi_w0 = w0
if n_classes == 2:
multi_w0 = multi_w0[1][np.newaxis, :]
coefs.append(multi_w0.copy())
else:
coefs.append(w0.copy())

The line 577 should actually be:

            multi_w0 = 2 * multi_w0[1][np.newaxis, :]

we are missing a factor 2 to get the correct mathematical equivalence, hence the computed predictions that use the logistic sigmoid instead of the redundant softmax when n_classes=2 are not the ones we should have obtained at convergence: they are twice too weak (uncertain) in terms of log odds ratios.

EDIT: Actually, I checked the code in predict_proba and it's consistent with the original line multi_w0 = multi_w0[1][np.newaxis, :] line, so there is no bug: it's "correctly" using sotfmax with duplicated coefs and swapped signs when n_classes == 2 and multi_class == "multinomial". This is overkill and hard to understand what the coef_ values mean in this case but the estimated class membership probabilities are correct.

@jeremiedbb
Copy link
Member

I think it's still better to keep this deprecation because chosing between the 2 through multi_class="ovr"/"multinomial" is a bad design. If we want to keep the 2 behaviors, we'd rather introduce a new hyper param. We can do that between the RC and the final, or wait to see if we get negative feedback from the future warning.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to use the deprecation period to see if any user cares about the over-parametrized formulation.

If so, we will have time to add a better constructor parameter to expose that feature.

@ogrisel
Copy link
Member

ogrisel commented May 3, 2024

@GaelVaroquaux are you -1 on using the deprecation period to discover if there are users that actually rely on the overparameterized formulation?

I am also -0 on introducing an alternative option to reintroduce this formulation with a cleaner API if we don't have a strong clue that it is actually useful in any way as it would risk making the code complex for nothing.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 3, 2024 via email

@ogrisel
Copy link
Member

ogrisel commented May 3, 2024

Alright, let's merge this as is but be mindful to take users feedback on this issue with consideration during the deprecation period, in case they do find a valid reason to use the softmax parametrization.

@jeremiedbb jeremiedbb merged commit 20c8e69 into scikit-learn:main May 3, 2024
30 checks passed
@ogrisel
Copy link
Member

ogrisel commented May 3, 2024

Furthermore, if you select the penalty via CV, you will end up with models of same predictions up to some uncertainty (solver, numeris, ..).

For the record, I checked empirically that empirically by checking that using a C_logit = 2 * C_softmax penalty leads to the same solutions on the regularization path (up to small solver-dependent numerical discrepancies):

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegressionCV
import numpy as np
from time import perf_counter


X, y = make_classification(
    n_classes=2,
    n_samples=1000,
    n_features=5,
    n_informative=2,
    n_redundant=2,
    random_state=0,
)


Cs_logit = np.logspace(-6, 6, 27)
hparams = dict(
    tol=1e-10,
    solver="lbfgs",
    scoring="neg_log_loss",
    max_iter=10_000,
)
tic = perf_counter()
logreg_logit = LogisticRegressionCV(Cs=Cs_logit, **hparams).fit(X, y)
print(
    f"Fit logit model in {perf_counter() - tic:.4f} seconds, optimal C={logreg_logit.C_}"
)
tic = perf_counter()
logreg_softmax = LogisticRegressionCV(
    Cs=Cs_logit / 2, multi_class="multinomial", **hparams
).fit(X, y)
print(
    f"Fit softmax model in {perf_counter() - tic:.4f} seconds, optimal C={logreg_softmax.C_}"
)
np.testing.assert_allclose(logreg_logit.coef_, 2 * logreg_softmax.coef_)

class_idx = 1
np.testing.assert_allclose(
    logreg_logit.scores_[class_idx], logreg_softmax.scores_[class_idx]
)
max_abs_coef_path_diff = np.abs(
    logreg_logit.coefs_paths_[class_idx] -  2 * logreg_softmax.coefs_paths_[class_idx]
).max()
print(f"Max absolute difference in coef paths: {max_abs_coef_path_diff:.2e}")

Here is the typical output for that script with the lbfgs solver.

Fit logit model in 0.1500 seconds, optimal C=[1.]
Fit softmax model in 0.1527 seconds, optimal C=[0.5]
Max absolute difference in coef paths: 4.69e-07

I get similar results with Newton-CG/SAG/SAGA. (I had to increase tol a bit otherwise some solvers would have various kinds of convergence problems during that grid search but I think this is kind of expected).

So in the end, this confirms Christian's statement and I cannot imagine any reason to justify exposing the over-parametrized case in our public API in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants