Skip to content

RFC LogisticRegression code is way to opaque #11865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Aug 20, 2018 · 8 comments
Open

RFC LogisticRegression code is way to opaque #11865

amueller opened this issue Aug 20, 2018 · 8 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:linear_model RFC

Comments

@amueller
Copy link
Member

After the release we should refactor the logistic regression code.
Right now, path computation, optimizers and OVR are all a tangled mess.

@amueller amueller added Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Aug 20, 2018
@amueller
Copy link
Member Author

Also, using the term 'ovr' in the binary case (which after #11859 will be everywhere) is very confusing.

@alexsieusahai
Copy link

alexsieusahai commented Aug 23, 2018

I'll take this on.

@jnothman
Copy link
Member

jnothman commented Aug 23, 2018 via email

@alexsieusahai
Copy link

Awesome; thanks for letting me know!

@lorentzenchr
Copy link
Member

I agree with the assessment of the code being opaque. Are there any more actionable insights except for better naming of "ovr" in the binary case?

@lorentzenchr lorentzenchr changed the title LogisticRegression code is way to opaque RFC LogisticRegression code is way to opaque Feb 25, 2023
@lorentzenchr
Copy link
Member

lorentzenchr commented Feb 25, 2023

Trying to improve some things a bit for linear models, I got stuck in the swamp of _logistic.py and test_logistic.py. TBO, I can't and won't maintain that code! And this despite the fact that given so much functionality, the code seems quite compact. (IMHO, the HGBT code, for instance, is long but very understandable and maintainable).

If I were to rewrite it without keeping legacy, I would do:

  • Use penalty parametrization as in ElasticNet
  • Remove "ovr". If a solver supports the full multinomial loss, there is no reason to use OvR, is there? And if a solver does not support multinomial, there is no choice anyway.
    Alternative option: new class for OvR classifiers. And we already have OneVsRestClassifier
    DEP deprecate multi_class in LogisticRegression #28703
  • Use a single best penalty for LogisticRegressionCV not one per class (which might stem from the OvR legacy)
  • Get rid of (most of) joblib parallelism
  • Big overhaul of the tests
  • ...

@scikit-learn/core-devs ping for discussion.

@adrinjalali
Copy link
Member

Is this one of those things like what we did with model_selection that it'd be easier to implement in a new module somehow?

I like the ideas you have here, but I'm not sure why the parallelism needs to go away or why it's not helpful.

I think it'd be nice to have a maintainable code there.

@GaelVaroquaux
Copy link
Member

@lorentzenchr The changes that you are proposing seem so significant that, should we go in this direction, it is probably better to ramp up a new object to ensure smooth transition to our users (rather than an abrupt break of backward compatibility).

However, I feel that most of the changes that you propose are net loss of useful functionality for the user.

With regards to the penalization parametrization, I clearly agree with you, and there are several places in scikit-learn where I would like to change how we parametrize our penalties (for instance, in the ridge, it should account for a norm of the design matrix, in order to have good defaults). One question is: how to do this smoothly? It can be addressed, but I am not sure that it will simplify the code.

With regards to ovr, I think that it is a actually an important aspect of the objects, in particular the CV one. In my experience, in multiclass one often needs different penalty per class. The reason is for instance that the prevalence of the classes vary widely. One could argue that the use of the metaclassifier OneVsRestClassifier means that the corresponding functionality can be implemented via a pipeline. But it's much harder to implement. Having good patterns easy to implement for the user seems like a priority.

With regards to parallelism, I actually think that parallelism is something very important, and that we should be working on having more parallelism in scikit-learn. Modern computers have many cores (my laptop has 8, many compute nodes have much more).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:linear_model RFC
Projects
None yet
Development

No branches or pull requests

7 participants