-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Change default solver in LogisticRegression #11476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for this.
I'm still trying to review any interesting trends in the plots... |
Have we given up on this for 0.20? :( |
A minimal change that could make it to the release could be to use: if solver == 'auto':
if penalty == 'l1':
solver = 'saga'
else:
solver = 'lbfgs' This change is probably not optimal but I am not confident with adding more heuristics if we have a strong contract with the 'auto' mode. Otherwise we can either:
|
The three changes at the beginning are kind of independent, right? Do we want to do a FutureWarning that we're gonna move to |
Well the three changes are not entirely independent. We can't only change to default But we can limit changes to:
|
It is very tempting to take that simplified lbfgs approach... |
b8a118a
to
7f9da90
Compare
So I removed the 'auto' solver, and the PR is ready to be merged. It includes:
|
I think it's not too late to change that!
|
I have to dig through that. It seems very weird to me to offer two different versions in the binary case as they should be reparametrizations, right? |
So yes, I think the right fix would be to never do multinomial in the binary case as we're just doing redundant computations, right? |
I disagree with the forcing of binary logistic regression when using Preliminary
and I will refer to binary softmax regression as the model with prediction function
where Reasons
where Solution If there is going to be a big change however, my suggestion would be
I also certainly agree that the default should be |
Thanks for the quick reply @rwolst. So my main concern is about the default behavior. We will switch the default behavior to multinomial, and so people with binary problems would get the softmax formulation, which is probably not what they expect. We could have an For your Preliminaries, the soft-max should have two different intercepts and weight vectors, right? For 1)
They wouldn't be explicitly asking for it, as I said, it would be the default behavior. And I think if the main reason someone would use this is to use the weights in their custom implementation, then I think that's a good argument to exclude it. This is not really a use-case we support or should worry about. For 2) Finally, in your conclusion I don't understand why you'd want to change the shape in ovr to k-1. Does that even work? And having There's three things we can do:
I argued for 3). I think 1) would be very strange. We could do 2) but the way the parameters are named is really weird - why should the formulation of the binary problem be determined by the |
Its a property of constrained softmax regression, that in the binary case
the coefficients and intercept will the negative of one another for each
class prediction.
I agree that getting softmax regression when running a `LogisticRegression`
is confusing and I'm probably just accustomed to it now. But by that
reasoning it also seems strange to default to `multiclass='multinomial'`
for more than 2 classes, as I would expect multiclass logistic regression
to be one versus rest (as it is in the binary case).
The reason in my conclusion I said to use `k-1` is because this is
technically the least parameters needed to represent the `ovr` solution. It
seems strange to me that in the binary case this minimum representation
would be returned but not in the multiclass case.
After thinking about what you said, I think the best solution would be to
split into a `LogisticRegression` and `SoftmaxRegression`, for the `ovr`
case and `multinomial` case respectively. I believe this would make most
sense to a normal user.
…On Mon, 20 Aug 2018, 18:24 Andreas Mueller, ***@***.***> wrote:
Thanks for the quick reply @rwolst <https://github.com/rwolst>.
So my main concern is about the default behavior. We will switch the
default behavior to multinomial, and so people with binary problems would
get the softmax formulation, which is probably not what they expect. We
could have an auto mode to implement what you're suggesting and allow
users to explicitly ask for softmax.
Or would you argue the softmax is a good default? I don't see a benefit
for the user in general, it's slower and it's surprising.
For your Preliminaries, the soft-max should have two different intercepts
and weight vectors, right?
For 1)
In fact I would guess the most likely reason anyone is performing a binary
softmax regression is because they have coded a general algorithm that for
some use cases only has two classes in the input.
They wouldn't be explicitly asking for it, as I said, it would be the
default behavior. And I think if the main reason someone would use this is
to use the weights in their custom implementation, then I think that's a
good argument to exclude it. This is not really a use-case we support or
should worry about.
For 2)
This is called LogisticRegression, and not SoftmaxRegression, and so I
think the default behavior should be that of vanilla logistic regression
(our default of using regularization is a bit of a tangential topic). I
want the default behavior to be the thing the user expects.
Finally, in your conclusion I don't understand why you'd want to change
the shape in ovr to k-1. Does that even work? And having coef_ shape
dependent on multiclass will result in a giant heap of trouble.
There's three things we can do:
1. Leave it as is, and use binary softmax instead of binary logistic
regression by default in logistic regression (and have people set the
multi_class parameter to get standard logistic regression)
2. Use binary logistic regression by default but allow explicitly
asking for soft-max regression, possibly by having multi_class='auto'
3. Don't support binary softmax regression in LogisticRegression and
just ignore multi_class in the binary case.
I argued for 3). I think 1) would be very strange. We could do 2) but the
way the parameters are named is really weird - why should the formulation
of the binary problem be determined by the multi_class parameter? That's
very counter-intuitive. Also, I don't see a strong argument for
implementing softmax regression.
This code is already way too complex imho, and adding a new parameter to
allow a relatively obscure formulation seems not worth it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11476 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADtl0RxLx7FrQZ8AGm2Y-jEzu-AjS-YZks5uSuLKgaJpZM4VK_B0>
.
|
I guess it depends on the community the user is from, but I think the OvR solution is very uncommon both in ML and stats. It was mostly introduced in liblinear from what I know. My understanding is that most textbooks that describe logistic regression use multinomial logistic regression as the standard multi-class case. And while k-1 is technically the least parameters needed to represent the OVR solution, I don't think that actually works. How would you learn that? |
I'm not sure I actually understand that sentence. What do you mean by "one vs rest in the binary case"? |
I think #11859 is good now? |
I pushed a commit to address some of the most straightforward @amueller 's comments. Also removed the sentence about change of default from the docstring, since it already says so in the For the other comments, someone should push the remaining fixes, once it is decided what should be done :) |
Agreed. For multiclass, OvR is uncommon. What is strange I find would be to
default to OvR in the binary case but softmax in the multiclass case (in my
opinion, others may not think so).
To learn OvR with `k-1` shape, simply write your likelihood as
Pr(Y = 1 | x) = 1/(1 + sum_{i=2}^k (exp(intercept_[i] + coef_[i]x))
and
Pr(Y = j | x) = exp(intercept_[j] + coef_[j]x)/(1 + sum_{i=2}^k
(exp(intercept_[i] + coef_[i]x))
for `j` not equal to 1. Then use your favourite optimiser (the gradients
and Hessian are easy to find analytically)
…On Mon, 20 Aug 2018, 21:10 Andreas Mueller, ***@***.***> wrote:
I guess it depends on the community the user is from, but I think the OvR
solution is very uncommon both in ML and stats. It was mostly introduced in
liblinear from what I know. My understanding is that most textbooks that
describe logistic regression use multinomial logistic regression as the
standard multi-class case.
And while k-1 is technically the least parameters needed to represent the
OVR solution, I don't think that actually works. How would you learn that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11476 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADtl0S-y9-QbqI1svdZzYOONaMJVcGq0ks5uSwmYgaJpZM4VK_B0>
.
|
To answer the question
I'm not sure I actually understand that sentence. What do you mean by "one
vs rest in the binary case"?
The binary logistic regression definition I used is essentially one versus
rest. In the binary case we are comparing whether the output is in one
class versus the other. Note this is different from (binary) softmax
regression where each class essentially gets a weighting which is then
normalised by the sum of all other classes weights.
…On Mon, 20 Aug 2018, 21:17 Andreas Mueller, ***@***.***> wrote:
as I would expect multiclass logistic regression
to be one versus rest (as it is in the binary case).
I'm not sure I actually understand that sentence. What do you mean by "one
vs rest in the binary case"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11476 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADtl0c1NR-S02Z40mLPNVTHAZeWmpwgtks5uSwsygaJpZM4VK_B0>
.
|
The point of OvR is to learn independent classifiers using a binary model. OvR is a reduction of the multi-class problem to the binary problem. So if you optimize as in your first comment, that's not OvR. It's a new formulation that's certainly possible but that is not a reduction and it's unclear to me why anyone would use that. And saying "The binary logistic regression definition I used is essentially one versus rest." doesn't make sense to me. One-vs-rest means reducing a multi-class problem to a binary problem. It's not a description of a particular likelihood. |
Anyway, do you think there's an important use-case for having a binary softmax classifier? I think just providing logistic regression should be sufficient. |
Ah right, thanks for clearing that up. I didn't read the OvR description
closely enough (honestly I've never used it). Regardless, from that
description of OvR, that is still the form of (from my definition) binary
logistic regression. Hence it is still strange that for multiclass it would
default to softmax regression but the binary case it would default to OvR.
For a use case, I'm not entirely sure what to say other than anyone wanting
to perform binary softmax regression would no longer be able to. I'm not
sure if this is a common use case but I can say that I do personally use it.
A final point to note is that if you do remove the possibility of binary
softmax regression, then it will break any code that relies on it and there
will be no workaround. This is due to the behaviour of both optimisations
with respect to the constraints (as mentioned early) i.e. there is no
general way of transforming the solution of constrained binary logistic
regression to constrained binary softmax regression.
…On Tue, 21 Aug 2018, 00:04 Andreas Mueller, ***@***.***> wrote:
Anyway, do you think there's an important use-case for having a binary
softmax classifier? I think just providing logistic regression should be
sufficient.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11476 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADtl0VhP2lJFG0QQbWZ_LoaJ5cqs67Sjks5uSzJ_gaJpZM4VK_B0>
.
|
I'm not against having a way to perform binary softmax regression. I don't
think "multi_class" is the right parameter name to control it (since we use
multi-class as a term here in explicit contrast to binary).
I also think users expect LogisticRegression to do multinomial LR =
softmax in the multiclass case, though we could indeed consider splitting
softmax into a separate classifier, or replacing multi_class with
softmax={True, False, 'auto'} or something. coef_ should have shape (2,
n_features) if binary data and softmax=True, and otherwise (1, n_features)
Importantly, binary softmax was available in 0.19, but its predict_proba
was broken. If we are going to (temporarily) stop offering it, now is the
time to do it, before the 0.20 release.
|
I would say multi_class='auto' is a good middle ground here.
I agree it's maybe confusing to use multi_class to fix the behavior in the
binary case
but It still allows to do multinomial in the binary case which can be used
as argued
above.
|
@agramfort as the final solution? That behavior is basically impossible to discover I'd say. |
I could throw together a PR, perhaps tonight, which:
It's a pity that this involves a deprecation, although in some ways that simplifies the current PR. Would this be acceptable to all parties? |
@amueller, I'd love to get your +1 or otherwise on that plan before you holiday!! |
That is probably the most reasonable course of action if we want to keep the binary softmax. |
'compat' is just a default value while we deprecate multi_class.
|
This is about changing the default solver in
LogisticRegression
.Fixes #9997
Closes #10001
Main goals:
multi_class='ovr'
intomulti_class='multinomial'
.solver='liblinear'
since it regularizes the intercept.solver='auto'
.Main solvers:
'liblinear'
: (current default)fit_intercept=True
.multi_class='multinomial'
.dual=True
.'saga'
:penalty='l1'
withmulti_class='multinomial'
.'lbfgs'
:penalty='l1'
.Basic proposed solution:
The auto solver would simply select
'saga'
forpenalty='l1'
, and'lbfgs'
forpenalty='l2'
.This is simple, but may not lead to best performances.
Possible performance refinements:
'liblinear'
whenfit_intercept=False
andmulti_class='ovr'
.'liblinear'
whendual=True
.'saga'
for large datasets.'lbfgs'
for unscaled features.tol
andmax_iter
depending on the solver.Some benchmark results to come...
@amueller and @NicolasHug don't hesitate to take over, as I may not have much time during the sprint.