Skip to content

[MRG] Change default solver in LogisticRegression #11476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Jul 11, 2018

This is about changing the default solver in LogisticRegression.
Fixes #9997
Closes #10001

Main goals:

  • We want to change the default multi_class='ovr' into multi_class='multinomial'.
  • We want to change the default solver='liblinear' since it regularizes the intercept.
  • We want to use the change of default solver to introduce a new solver='auto'.

Main solvers:

  • 'liblinear': (current default)
    • Regularize the intercept, which is wrong. We should avoid it when fit_intercept=True.
    • Does not handle multi_class='multinomial'.
    • Only solver which handles dual=True.
  • 'saga':
    • Only solver which handles penalty='l1' with multi_class='multinomial'.
    • Very fast for large datasets, dense or sparse.
    • May suffer slower convergence when the features are not scaled.
    • May suffer slower convergence when there are large norm outlier samples. (This might be improved with a better step-size heuristic (e.g.), or with an adaptive step-size.)
  • 'lbfgs':
    • Good default, but not the fastest for large datasets.
    • Does not handle penalty='l1'.

Basic proposed solution:
The auto solver would simply select 'saga' for penalty='l1', and 'lbfgs' for penalty='l2'.
This is simple, but may not lead to best performances.

Possible performance refinements:

  • Select 'liblinear' when fit_intercept=False and multi_class='ovr'.
  • Select 'liblinear' when dual=True.
  • Select 'saga' for large datasets.
  • Select 'lbfgs' for unscaled features.
  • Change tol and max_iter depending on the solver.

Some benchmark results to come...


@amueller and @NicolasHug don't hesitate to take over, as I may not have much time during the sprint.

@TomDLT TomDLT force-pushed the logistic_default branch from d8b5fda to eb70f8a Compare July 12, 2018 15:19
@TomDLT
Copy link
Member Author

TomDLT commented Jul 13, 2018

multinomial_l2

multinomial_l2_20news
multinomial_l2_digits
multinomial_l2_iris
multinomial_l2_mnist
multinomial_l2_rcv1


ovr_l1

ovr_l1_20news
ovr_l1_digits
ovr_l1_iris
ovr_l1_mnist
ovr_l1_rcv1


ovr_l2

ovr_l2_20news
ovr_l2_digits
ovr_l2_iris
ovr_l2_mnist
ovr_l2_rcv1

@TomDLT TomDLT force-pushed the logistic_default branch from eb70f8a to 8f708da Compare July 13, 2018 08:54
@jnothman
Copy link
Member

Thanks a lot for this.

  • It may be bit unfair to compare liblinear timings to others without comparing it in a LogisticRegressionCV context (with StratifiedShuffleSplit(n_splits=1)), and ideally one where intercept_scaling is also optimised for liblinear (at least for the binary classification case).
  • I'm still a bit undecided about 'auto' changing without notice. Basically we want to be able to say that results should be minimally different if we change things, so that they are equivalent to runtime efficiency enhancements which we make without notice. But I'm not sure how to characterise "minimally different".

I'm still trying to review any interesting trends in the plots...

@jnothman
Copy link
Member

Have we given up on this for 0.20? :(

@TomDLT
Copy link
Member Author

TomDLT commented Jul 25, 2018

A minimal change that could make it to the release could be to use:

if solver == 'auto':
    if penalty == 'l1':
        solver = 'saga'
    else:
        solver = 'lbfgs'

This change is probably not optimal but I am not confident with adding more heuristics if we have a strong contract with the 'auto' mode.

Otherwise we can either:

  • have a weak contract with the 'auto' mode, i.e. warn that it might change without notice
  • use FutureWarnings at each change of the 'auto' behavior
  • carefully tune the 'auto' mode to avoid changing it in the future, but this would probably need more benchmarking, and I won't have a lot of time before the release.
    Feel free to take over if someone has enough time.

@amueller
Copy link
Member

The three changes at the beginning are kind of independent, right? Do we want to do a FutureWarning that we're gonna move to multinomial? That's the thing we care most about, right?

@TomDLT
Copy link
Member Author

TomDLT commented Jul 26, 2018

Well the three changes are not entirely independent. We can't only change to default multi_class='multinomial' since LIBLINEAR does not handle multinomial losses.

But we can limit changes to:

  • Warn that we will change the default to multi_class='multinomial' with a FutureWarning.
  • Warn that we will change the default to solver='lbfgs' with a FutureWarning.
  • Postpone the 'auto' mode to the next release.

@TomDLT TomDLT force-pushed the logistic_default branch from 8806dd6 to 25f343d Compare July 27, 2018 14:18
@jnothman
Copy link
Member

It is very tempting to take that simplified lbfgs approach...

@TomDLT TomDLT force-pushed the logistic_default branch from 0906fe5 to cc53212 Compare July 29, 2018 08:32
@TomDLT TomDLT force-pushed the logistic_default branch 3 times, most recently from b8a118a to 7f9da90 Compare July 29, 2018 19:57
@TomDLT TomDLT force-pushed the logistic_default branch from 7f9da90 to 1e21983 Compare July 30, 2018 08:46
@TomDLT TomDLT changed the title [WIP] Change default solver in LogisticRegression [MRG] Change default solver in LogisticRegression Jul 30, 2018
@TomDLT TomDLT force-pushed the logistic_default branch from e6c776e to 28317e6 Compare July 30, 2018 10:01
@TomDLT
Copy link
Member Author

TomDLT commented Jul 30, 2018

So I removed the 'auto' solver, and the PR is ready to be merged. It includes:

  • Warn that we will change the default to multi_class='multinomial' with a FutureWarning.
  • Warn that we will change the default to solver='lbfgs' with a FutureWarning.

@TomDLT TomDLT added this to the 0.20 milestone Jul 30, 2018
@TomDLT TomDLT force-pushed the logistic_default branch from 28317e6 to 85f0ecc Compare July 30, 2018 10:06
@jnothman
Copy link
Member

jnothman commented Aug 19, 2018 via email

@jnothman
Copy link
Member

jnothman commented Aug 19, 2018 via email

@amueller
Copy link
Member

I have to dig through that. It seems very weird to me to offer two different versions in the binary case as they should be reparametrizations, right?

@amueller
Copy link
Member

So yes, I think the right fix would be to never do multinomial in the binary case as we're just doing redundant computations, right?

@rwolst
Copy link
Contributor

rwolst commented Aug 20, 2018

I disagree with the forcing of binary logistic regression when using multiclass='multinomial' for the reasons below.

Preliminary
To be clear, I will refer to binary logistic regression as the model with prediction function

Pr(y=1 | x) = exp(intercept_ + coef_ @ x)/(1 + exp(intercept_ + coef_ @ x))

and I will refer to binary softmax regression as the model with prediction function

Pr(y=1 | x) = exp(intercept_ + coef_ @ x)/(exp(-intercept_ - coef_ @ x) + exp(intercept_ + coef_ @ x))

where y is the dependent variable and x are the observed independent variables.

Reasons

  1. Note that binary softmax regression generalises to softmax regression, binary logistic regression does not (it generalises to one versus rest regression). When someone uses multiclass='multinomial', I believe they are explicitly asking to do a softmax regression. Hence even in the binary case they should get a softmax regression. In fact I would guess the most likely reason anyone is performing a binary softmax regression is because they have coded a general algorithm that for some use cases only has two classes in the input. Forcing binary logistic regression will break this if their algorithm then wants to use the learnt parameters intercept_ and coef_.

  2. There is more to this than a binary softmax regression just being another representation of binary logistic regression. You may argue that the binary logistic regression predictor can be written

     f(x) = exp(intercept_/2 + coef_/2 @ x)/(exp(-intercept_/2 - coef_/2 @ x) + exp(intercept_/2 + coef_/2 @ x))
    

where coef_ and intercept_ are the values learnt in a binary logistic regression, so worst case simply divide both coef_ and intercept_ by 2 to get the binary softmax regression coefficients. However as soon as we introduce for example L2 regularisation, the coef_ will be punished non-linearly more than coef_/2 and -coef_/2 and the transformation will no longer be correct between both methods. Hence both methods are solving a different optimisation problem.

Solution
Unfortunately I'm not in the best position to comment what to do as I don't know the codebase well enough. The LogisticRegression code was not totally created with the separate ideas of logistic and softmax regression in mind and in fact as per #9889, it even says in the API of coef_, coef_ is of shape (1, n_features) when the given problem is binary. I believe this is wrong for the binary softmax regression case, which should generalise to the standard softmax regression case (and coef_ should have shape (2, n_features)) but as mentioned this would change the API and potentially have some large knock on effects.

If there is going to be a big change however, my suggestion would be

  1. If multiclass='multinomial', give the user softmax regression i.e. if there are k classes, they get a coef_ matrix of shape (k, n_features).
  2. If multiclass='ovr', give the user one versus rest regression i.e. if there are k classes, they get a coef_ matrix of shape (k-1, n_features).

I also certainly agree that the default should be multiclass='multinomial', it has much more sensible regularisation and generalises to neural networks.

@amueller
Copy link
Member

Thanks for the quick reply @rwolst.

So my main concern is about the default behavior. We will switch the default behavior to multinomial, and so people with binary problems would get the softmax formulation, which is probably not what they expect. We could have an auto mode to implement what you're suggesting and allow users to explicitly ask for softmax.
Or would you argue the softmax is a good default? I don't see a benefit for the user in general, it's slower and it's surprising.

For your Preliminaries, the soft-max should have two different intercepts and weight vectors, right?

For 1)

In fact I would guess the most likely reason anyone is performing a binary softmax regression is because they have coded a general algorithm that for some use cases only has two classes in the input.

They wouldn't be explicitly asking for it, as I said, it would be the default behavior. And I think if the main reason someone would use this is to use the weights in their custom implementation, then I think that's a good argument to exclude it. This is not really a use-case we support or should worry about.

For 2)
This is called LogisticRegression, and not SoftmaxRegression, and so I think the default behavior should be that of vanilla logistic regression (our default of using regularization is a bit of a tangential topic). I want the default behavior to be the thing the user expects.

Finally, in your conclusion I don't understand why you'd want to change the shape in ovr to k-1. Does that even work? And having coef_ shape dependent on multiclass will result in a giant heap of trouble.

There's three things we can do:

  1. Leave it as is, and use binary softmax instead of binary logistic regression by default in logistic regression (and have people set the multi_class parameter to get standard logistic regression)
  2. Use binary logistic regression by default but allow explicitly asking for soft-max regression, possibly by having multi_class='auto'
  3. Don't support binary softmax regression in LogisticRegression and just ignore multi_class in the binary case.

I argued for 3). I think 1) would be very strange. We could do 2) but the way the parameters are named is really weird - why should the formulation of the binary problem be determined by the multi_class parameter? That's very counter-intuitive. Also, I don't see a strong argument for implementing softmax regression.
This code is already way too complex imho, and adding a new parameter to allow a relatively obscure formulation seems not worth it.

@rwolst
Copy link
Contributor

rwolst commented Aug 20, 2018 via email

@amueller
Copy link
Member

I guess it depends on the community the user is from, but I think the OvR solution is very uncommon both in ML and stats. It was mostly introduced in liblinear from what I know. My understanding is that most textbooks that describe logistic regression use multinomial logistic regression as the standard multi-class case.

And while k-1 is technically the least parameters needed to represent the OVR solution, I don't think that actually works. How would you learn that?

@amueller
Copy link
Member

as I would expect multiclass logistic regression
to be one versus rest (as it is in the binary case).

I'm not sure I actually understand that sentence. What do you mean by "one vs rest in the binary case"?

@amueller
Copy link
Member

I think #11859 is good now?

@rth
Copy link
Member

rth commented Aug 20, 2018

I pushed a commit to address some of the most straightforward @amueller 's comments. Also removed the sentence about change of default from the docstring, since it already says so in the ..versionschanged 3 lines below, but maybe I shouldn't have.

For the other comments, someone should push the remaining fixes, once it is decided what should be done :)

@rwolst
Copy link
Contributor

rwolst commented Aug 20, 2018 via email

@rwolst
Copy link
Contributor

rwolst commented Aug 20, 2018 via email

@amueller
Copy link
Member

The point of OvR is to learn independent classifiers using a binary model. OvR is a reduction of the multi-class problem to the binary problem. So if you optimize as in your first comment, that's not OvR. It's a new formulation that's certainly possible but that is not a reduction and it's unclear to me why anyone would use that.

And saying "The binary logistic regression definition I used is essentially one versus rest." doesn't make sense to me. One-vs-rest means reducing a multi-class problem to a binary problem. It's not a description of a particular likelihood.

@amueller
Copy link
Member

Anyway, do you think there's an important use-case for having a binary softmax classifier? I think just providing logistic regression should be sufficient.

@rwolst
Copy link
Contributor

rwolst commented Aug 20, 2018 via email

@jnothman
Copy link
Member

jnothman commented Aug 20, 2018 via email

@agramfort
Copy link
Member

agramfort commented Aug 21, 2018 via email

@amueller
Copy link
Member

@agramfort as the final solution? That behavior is basically impossible to discover I'd say.

@jnothman
Copy link
Member

jnothman commented Aug 22, 2018

I could throw together a PR, perhaps tonight, which:

  • adopts the changes to solver here
  • deprecates multi_class
  • adds a parameter softmax with values False (implying binarised), True (implying softmax always), and 'multiclass' (implying softmax only if non-binary). While deprecating multi_class, the default value would be 'compat' which reverts to multi_class behaviour.
  • for softmax=True, we would adopt the multinomial behaviour currently in master. That is, for binary y, coef_ would have shape (1, n_features) (I'm still uncertain if this is correct) and predict_proba would renormalise. I am taking this approach primarily because it is already implemented and tested.
  • Instead of softmax we could consider calling the new parameter multinomial. WDYT?

It's a pity that this involves a deprecation, although in some ways that simplifies the current PR.

Would this be acceptable to all parties?

@jnothman
Copy link
Member

@amueller, I'd love to get your +1 or otherwise on that plan before you holiday!!

@amueller
Copy link
Member

That is probably the most reasonable course of action if we want to keep the binary softmax.
I'm not entirely sure what the compat behavior would be.
So softmax=True would mean OVR? That seems odd as OVR doesn't use a softmax, really

@jnothman
Copy link
Member

jnothman commented Aug 22, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants