-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DEP deprecate multi_class in LogisticRegression #28703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEP deprecate multi_class in LogisticRegression #28703
Conversation
@scikit-learn/core-devs friendly ping for visibility. |
I have not had time to look at this PR at all, but before deprecating, can we have tests to be sure they are equivalent in results and potentially in UX (e.g. would accessing fitted attributes still be possible)? |
would be ok for me. I think it's ok to live with OneVsRestClassifier(LogisticRegression(..)) when using liblinear. |
Just to make sure that I understand things correctly: the plan is that by default multi-class is supported via multinomial loss but if the solver is liblinear multi-class raises an error, and the user must use a OvR? If so, that strategy is fine by me, but I think that the deprecation message should first suggest to use solvers that support multinomial or user OvR if liblinear is desired. |
Yes, exactly. And yes, with an informative deprecation warning, later an error advertising solvers that support the multinomial loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also looks like solver="newton-cholesky"
does not support ovr .
From a high level API point of view, LinearSVC
only does OVR with liblinear
and LogisticRegression
will no longer support it. In the future, should LinearSVC
also force users to use OneVsRestClassifier
for multi-class problems?
Over the last few years, we are slowly making meta-estimators more necessary for certain task. (i.e., the removal of normalize
or this PR). It kind of goes against the history of "lets make estimators easy to use". For example, the classifiers encodes string labels to "make these easy". This is my observation and I am undecided on the current path.
Over the last few years, we are slowly making meta-estimators more necessary for certain task. (i.e., the removal of normalize or this PR). It kind of goes against the history of "lets make estimators easy to use". For example, the classifiers encodes string labels to "make these easy". This is my observation and I am undecided on the current path.
Yes, I worry a lot about this trend.
Everybody that I talk to values a lot the fact that it's easy to get things going with scikit-learn. It's the number one benefit that people mention. If we loose this, we loose what made scikit-learn.
|
You raise good points. For this particular case, I have 3 answers:
|
I interpret the conversation so far as decision to deprecate I have an open questions: Shall we raise an error for multiclass liblinear or internally switch to OvR? (And state this in the docs)? |
I am fine for deprecating the |
Yes I noticed that when setting C=np.inf the coefs differ by a factor 2 exactly. |
Anything else I can do for this PR? |
Thanks for the clarification. There's still a difference I think because in binary + multi_class="multinomial", a softmax is used to compute the probabilities. And it looks that it's used by some users (see #9889). We even have special comments in the docs regarding binary + multi_class="multinomial", e.g. scikit-learn/sklearn/linear_model/_logistic.py Lines 1003 to 1014 in 2bafd7b
scikit-learn/sklearn/linear_model/tests/test_logistic.py Lines 271 to 289 in 2bafd7b
I don't have an opinion for the decision of removing the binary + multi_class="multinomial" use case, I trust you on that, but I think that we need and additional future warning for this special case because the current warning says that from 1.7 it will always be multinomial, which is not true for binary problems and we don't propose any alternative. |
🤔 We don't have good terms available: And we can always improve warning messages later, in case someone comes up with a good idea how to improve. |
I would trigger a dedicated warning if |
I would just neglect that case.
I'm a bit worried about this stance, unless we have strong evidence that it is not used.
The risk is to break people's pipelines, and they may have good reasons for such a pipeline.
|
I'll add one.
A scikit-learn pipeline won't break. |
Ready again if CI passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. ping @ogrisel, does your +1 still holds, especially w.r.t the binary situation ?
When the problem is penalized, the (projected) over-parametrized solution (multinomial parametrization with softmax link) and the standard binary solution (binomial parametrization with logit link) can be very different, both in coef space and in function space, and even when setting the tolerance to a very small value: from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(
n_classes=2,
n_samples=1000,
n_features=5,
n_informative=2,
n_redundant=2,
random_state=0,
)
hparams = dict(
C=1e-2,
tol=1e-16,
solver="lbfgs",
)
logreg_logit = LogisticRegression(**hparams).fit(X, y)
logreg_softmax = LogisticRegression(multi_class="multinomial", **hparams).fit(X, y)
logreg_logit.coef_ / logreg_softmax.coef_
import numpy as np
X_test = np.random.RandomState(0).randn(10, 5)
np.abs(
logreg_logit.predict_proba(X_test)[:, 1] - logreg_softmax.predict_proba(X_test)[:, 1]
).max()
Note that the above results are mostly the same with other solvers ("newton-cg", "sag", and "saga", as expected given the strongly convex objective functions). I confirm that when increasing or decreasing penalization, the two models become equivalent. The discrepancy between the two solutions also goes away asymptotically with larger sample sizes. I would not have expected such a strong discrepancy for intermediate penalties and sample sizes though. But indeed the penalties are not equivalent for both formulations. That being said, I agree that setting If we want to keep the possibility to fit (over/softmax) parametrized binary logistic regression in the future, we should probably do so with a better name, but I am not sure which. |
We don't. At least I don't want that. I consider it bad practice. There are only disadvantages. I see absolutely no reason beyond curiosity to do so. 🙏 And I really beg you to not over-complicated this deprecation. 🙏 |
> We don't. At least I don't want that. I consider it bad practice.
This should not be an argument.
That someone considering something bad practice should modify a core library used by millions to make this bad practice impossible is not good behavior unless there is a documented wide consensus that the practice is very problematic.
|
Here is my last drop of rationality: If we are concerned about the effect on millions of users, then let's do this deprecation. This way, the functionality is still there, just a warning about future removal is emitted. If users really care that much and have good reasons for using an over-parametrized multinomial on a binomial problem, then they will complain about that. The deprecation is a way to get user feedback (it's a communication device anyway). For info: Before #23040, |
I agree with @lorentzenchr's rationale above. Furthermore, I believe that using scikit-learn/sklearn/linear_model/_logistic.py Lines 570 to 580 in 4e20b01
The line 577 should actually be: multi_w0 = 2 * multi_w0[1][np.newaxis, :] we are missing a factor 2 to get the correct mathematical equivalence, hence the computed predictions that use the logistic sigmoid instead of the redundant softmax when EDIT: Actually, I checked the code in |
I think it's still better to keep this deprecation because chosing between the 2 through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to use the deprecation period to see if any user cares about the over-parametrized formulation.
If so, we will have time to add a better constructor parameter to expose that feature.
@GaelVaroquaux are you -1 on using the deprecation period to discover if there are users that actually rely on the overparameterized formulation? I am also -0 on introducing an alternative option to reintroduce this formulation with a cleaner API if we don't have a strong clue that it is actually useful in any way as it would risk making the code complex for nothing. |
I have no strong opinion. I'm neither +1 nor -1 on the choice, and I don't feel that I have evidence to have an informed opinion.
I have an issue with the attitude that I stressed above. It's not the right way to make decisions.
…On May 3, 2024, 11:55, at 11:55, Olivier Grisel ***@***.***> wrote:
@GaelVaroquaux are you -1 on using the deprecation period to discover
if there are users that actually rely on the overparameterized
formulation?
I am also -0 on introducing an alternative option to reintroduce this
formulation with a cleaner API if we don't have a strong clue that it
is actually useful in any way as it would risk making the code complex
for nothing.
--
Reply to this email directly or view it on GitHub:
#28703 (comment)
You are receiving this because you were mentioned.
Message ID:
***@***.***>
|
Alright, let's merge this as is but be mindful to take users feedback on this issue with consideration during the deprecation period, in case they do find a valid reason to use the softmax parametrization. |
For the record, I checked empirically that empirically by checking that using a from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegressionCV
import numpy as np
from time import perf_counter
X, y = make_classification(
n_classes=2,
n_samples=1000,
n_features=5,
n_informative=2,
n_redundant=2,
random_state=0,
)
Cs_logit = np.logspace(-6, 6, 27)
hparams = dict(
tol=1e-10,
solver="lbfgs",
scoring="neg_log_loss",
max_iter=10_000,
)
tic = perf_counter()
logreg_logit = LogisticRegressionCV(Cs=Cs_logit, **hparams).fit(X, y)
print(
f"Fit logit model in {perf_counter() - tic:.4f} seconds, optimal C={logreg_logit.C_}"
)
tic = perf_counter()
logreg_softmax = LogisticRegressionCV(
Cs=Cs_logit / 2, multi_class="multinomial", **hparams
).fit(X, y)
print(
f"Fit softmax model in {perf_counter() - tic:.4f} seconds, optimal C={logreg_softmax.C_}"
)
np.testing.assert_allclose(logreg_logit.coef_, 2 * logreg_softmax.coef_)
class_idx = 1
np.testing.assert_allclose(
logreg_logit.scores_[class_idx], logreg_softmax.scores_[class_idx]
)
max_abs_coef_path_diff = np.abs(
logreg_logit.coefs_paths_[class_idx] - 2 * logreg_softmax.coefs_paths_[class_idx]
).max()
print(f"Max absolute difference in coef paths: {max_abs_coef_path_diff:.2e}") Here is the typical output for that script with the lbfgs solver.
I get similar results with Newton-CG/SAG/SAGA. (I had to increase tol a bit otherwise some solvers would have various kinds of convergence problems during that grid search but I think this is kind of expected). So in the end, this confirms Christian's statement and I cannot imagine any reason to justify exposing the over-parametrized case in our public API in the future. |
Reference Issues/PRs
Towards #11865.
What does this implement/fix? Explain your changes.
This PR deprecates the
multi_class
parameter inLogisticRegression
. Using that option is equivalent toOneVsRestClassifier(LogisticRegression())
, so no functionality is lost and, once gone, it would simplify the code of logreg quite a bit and make in more maintainable.Any other comments?
This PR starts very simple with only
LogisticRegression
. In case of positive feedback, I'll extend it toLogisticRegressionCV
and adapt all the docstrings and so on.