Skip to content

Minimal Generalized linear models implementation (L2 + lbfgs) #14300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 292 commits into from
Mar 4, 2020

Conversation

rth
Copy link
Member

@rth rth commented Jul 9, 2019

A minimal implementation of Generalized Linear Models (GLM) from #9405 by @lorentzenchr

This only includes L2 penalty with the lbfgs solver. In other words, in excludes L1 penalties (or CD solver), matrix penalties, warm start, some distributions (e.g. BinomialDistribution), newton-cg & irls solvers from the original GLM PR.

The goals is to get an easier to review initial implementation. Benchmarks were done in #9405 (comment)

TODO

Christian Lorentzen and others added 30 commits February 17, 2019 18:38
* smarter initialization of intercept

* PEP 257 -- Docstring Conventions

* minor docstring changes
* P2 also accepts 1d array and interprets it as diagonal matrix

* improved input checks for P1 and P2
…hecks

* adapt examples of GeneralizedLinearModel to new defaults for
  P1, P2 and selection

* fix precision/decimal issue in test_poisson_enet

* use more robust least squares instead of solve in IRLS

* fix sign error in input checks
    * add Binomial distribution

    * add Logit link

    * tests for binomial against LogisticRegression

    * option 'auto' for link

    * reduce code duplication by replacing @abstractproperty by @Property
* refactor into function _irls_solver

* refactor into function _cd_solver

* replace of safe_sparse_dot by matmul operator @

* more efficient handling of fisher matrix

* sparse coo matrices are converted to csc or csr

* sample weights don't except sparse matrices

* minor doc changes
* renamed option irls into guess

* removed option least_squares

* updated tests
* new estimator GeneralizedLinearRegressor
* loss functions for Tweedie family and Binomial
* elasitc net penalties
* control of penalties by matrix P2 and vector P1
* new solvers: coordinate descent, irls
* tests
* documentation
* example for Poisson regression
* make glm module private
* typos
* improve tests
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made minor pushes to compare the perf of the 2 models side by side in the example (293214c)

CI is green, LGTM. I think most of the comments are either minor or addressed.

Thank you very much for the great work and for your patience @lorentzenchr @rth . Much appreciated. Looking forward to seeing the future improvements!

Feel free to review my last changes and merge when you see fit

@lorentzenchr
Copy link
Member

And the second prize of chocolate goes to ... @NicolasHug 🥈 🍫 🎉

@rth
Copy link
Member Author

rth commented Mar 1, 2020

Thanks a lot for the review @NicolasHug and for quickly addressing comments @lorentzenchr !

Let's give it another couple of days in case @ogrisel has other suggestions.

For the person who is going to merge this please edit the merge commit description to remove individual commits but to keep one of each of Co-authored-by tags as discussed in #16550 to keep attributions.

@rth rth closed this Mar 1, 2020
@rth rth reopened this Mar 1, 2020
@rth
Copy link
Member Author

rth commented Mar 1, 2020

(Sorry accidental key press, updated message above with the missing part).

@jnothman
Copy link
Member

jnothman commented Mar 2, 2020

Very excited for this! Sorry I've not been able to review in a while.

@lorentzenchr
Copy link
Member

@ogrisel As a last remark: How do you feel about reversing the sorting of the x-axis in the last plot of plot_tweedie_regression_insurance_claims.py from "safest to riskiest" to "riskiest to safest"? This way, both examples would have a comparable "rank" plot at the end.

@rth
Copy link
Member Author

rth commented Mar 4, 2020

OK, merging with +2 and several partial reviews. Thanks @lorentzenchr and to all reviewers! I'm really glad this is done.

How do you feel about reversing the sorting of the x-axis in the last plot of plot_tweedie_regression_insurance_claims.py from "safest to riskiest" to "riskiest to safest"? This way, both examples would have a comparable "rank" plot at the end.

Feel free to open a follow up PR. Also looking forward to other additional features we could add from #9405. Will comment there.

@rth rth merged commit 69ea066 into scikit-learn:master Mar 4, 2020
@rth rth deleted the GLM-minimal branch March 4, 2020 13:47
@agramfort
Copy link
Member

🍺 !

@lorentzenchr
Copy link
Member

Many thanks to all of you! This was hard work and with enough perseverance we got it done. 👏

@jnothman
Copy link
Member

jnothman commented Mar 5, 2020

Awesome! Might be worth announcing on the mailing list and asking people to take it for a ride.

@jnothman
Copy link
Member

jnothman commented Mar 5, 2020 via email

@rth
Copy link
Member Author

rth commented Mar 5, 2020

Awesome! Might be worth announcing on the mailing list and asking people to take it for a ride.

@lorentzenchr Would you like to write that announcement email? Users can try it with nightly wheels https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds and the dev documentation https://scikit-learn.org/dev/modules/linear_model.html#generalized-linear-regression

@lorentzenchr
Copy link
Member

@rth Thanks for asking. As I'm not (yet) on the scikit-learn mailing list and as I can't find a precedent for new features as template, I'd rather let you go ahead 😊

ashutosh1919 pushed a commit to ashutosh1919/scikit-learn that referenced this pull request Mar 13, 2020
…ikit-learn#14300)

Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
…ikit-learn#14300)

Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants