Skip to content

[MRG] META Add Generalized Linear Models #9405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 72 commits into from

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Jul 18, 2017

This PR adds Generalized Linear Models (full Tweedie family) with elastic net penalty comparable to R's glmnet, see issue #5975. Futher distributions of the exponential dispersion family can be added easily.
It also solves #11566.

Sorry, for not posting this earlier.

I'm excited about your feedback.

TODO

  • loss function (deviance) for Tweedie family (Normal, Poisson, Gamma, ...)
  • link functions
  • estimator class GeneralizedLinearRegressor
  • elastic net penalties
  • solver: coordinate descent, irls
  • tests
  • documentation
  • example

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this world be something good to have, and the code and docstrings are high quality.

Yes, I think we want to focus on regularised models so this probably won't be merged without net.

However, if I were you, I'd start by getting the API right (strings preferred to custom objects), and by starting to sell the usefulness of glms ideally by composing an example that shows off its predictive and inferential capabilities on some real-world dataset. Maybe another example could generate datasets each of which requires a different family to get minimum loss.

And you'd do well with more tests. You might also want to fix up pep8 and keep it fixed by adding a checker to your editor.

Thanks!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add one or more entries to doc/modules/classes.rst so that the API reference docs are generated.

# So far, coefficients=beta and weight=w (as standard literature)
# TODO: Add l2-penalty
# TODO: Add l1-penalty (elastic net)
# TODO: Add cross validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you intend to implement cross validation for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like LogisticRegressionCV, once penalties are added ...

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Jul 19, 2017
* Fixed pep8
* Fixed flake8
* Rename GeneralizedLinearModel as GeneralizedLinearRegressor
* Use of six.with_metaclass
* PEP257: summary should be on same line as quotes
* Docstring of class GeneralizedLinearRegressor: \ before mu
* Arguments family and link accept strings
* Use of ConvergenceWarning
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Jul 19, 2017
* GeneralizedLinearRegressor added to doc/modules/classes.rst
@lorentzenchr
Copy link
Member Author

@jnothman Thank you very much for your fast and encouraging feedback. I tried my best to make the suggested changes. For sure, I'd like to include more tests, better documentation, examples and penalites (like glmnet). But I will be away the next month.

@agramfort
Copy link
Member

agramfort commented Jul 19, 2017 via email

@jnothman
Copy link
Member

@agramfort, I think the point here is to provide Poisson regression etc, and to leave more specialised tools alone. Is there an issue with that?

@lorentzenchr, rather than creating CV variants, I would focus on supporting warm_start. From there a generic cv adaptation is fairly straightforward.

And a month's break is fine. Plenty to go on with here.

@agramfort
Copy link
Member

agramfort commented Jul 20, 2017 via email

@jmschrei
Copy link
Member

Isn't there a great deal of similarity between the backend of these models that would be captured with a single GLM object? I agree that since we already have some GLMs that we shouldn't re-invent the wheel, but there seems like there would be an explosion of linear models if we name them all separately.

@agramfort
Copy link
Member

agramfort commented Jul 20, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jul 20, 2017 via email

@agramfort
Copy link
Member

how are we going to document this?

choose between:

LogisticRegression(solver='newton-cg')
or
GeneralizedLinearRegressor(family=Log, solver='newton-cg')

?

see also https://github.com/glm-tools/pyglmnet

also lightning has a lot of super efficient code.

@lorentzenchr do you have the energy to unify all this?

@amueller
Copy link
Member

@agramfort the same issue exists already with SGDClassifier as @jnothman points out. It's not ideal, but I think we should not expose the same solver in different places if we can avoid it.

So maybe not add log here?

@agramfort
Copy link
Member

agramfort commented Jul 23, 2017 via email

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 8, 2017

Let me try to list some thoughts I had in mind with this PR:

  • Functionality of glmnet in R would be great in sklearn
  • The deviance of an Exponential Dispersion Model (EDM) can serve as a loss function. Examples: If it's a count process, the Poisson deviance might be better suited than squared error. If it's a heavy tailed process, the Gamma or Inverse Gaussian deviance might be a good choice (the higher the Tweedie parameter the more heavy-tailed the distribution). These loss functions could be used by other estimators like GBRT.
  • GLMs with penalties are quite generic, so I implemented a generic class GeneralizedLinearRegressor capable to deal with all EDMs and all link functions (if family=Binomial and link=Logistic were implemented, they would give the logistic regression). I would prefer to keep this generic class/functionality and to also provide a simple PoissonRegressor (since this might be the most common use case) and maybe a TweedieRegessor (since this contains most of the used EDMs).
  • My personal view to deal with exposure is to model the scaled target and provide weights, e.g. for Poisson: model y=n/w with n=counts and w=weight or exposure (observation time or number of trials or ...). This seems to me more general for all kinds of EDM distributions/families and links, compared to offsets.

@agramfort I do certainly not have the energy to unify this all, but I hope I'm not alone ;-)

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Aug 9, 2017
* fixed bug: init parameter max_iter
* fix API for family and link:
  default parameter changed to string
  non public variables self._family_instance and self._link_instance
* fixed bug in score, minus sign forgotten
* added check_is_fitted to estimate_phi and score
* added check_array(X) in predict
* replaced lambda functions in TweedieDistribution
* some documentation
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017
* Fixed pep8
* Fixed flake8
* Rename GeneralizedLinearModel as GeneralizedLinearRegressor
* Use of six.with_metaclass
* PEP257: summary should be on same line as quotes
* Docstring of class GeneralizedLinearRegressor: \ before mu
* Arguments family and link accept strings
* Use of ConvergenceWarning
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017
* GeneralizedLinearRegressor added to doc/modules/classes.rst
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017
* fixed bug: init parameter max_iter
* fix API for family and link:
  default parameter changed to string
  non public variables self._family_instance and self._link_instance
* fixed bug in score, minus sign forgotten
* added check_is_fitted to estimate_phi and score
* added check_array(X) in predict
* replaced lambda functions in TweedieDistribution
* some documentation
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017
* make raw docstrings where appropriate
* make ExponentialDispersionModel (i.e. TweedieDistribution) pickable:
  ExponentialDispersionModel has new properties include_lower_bound,
  method in_y_range is not abstract anymore.
* set self.intercept_=0 if fit_intercept=False, such that it is always defined.
* set score to D2, a generalized R2 with deviance instead of squared error,
  as does glmnet. This also solves issues with
  check_regressors_train(GeneralizedLinearRegressor), which assumes R2 score.
* change of names: weight to weights in ExponentialDispersionModel and to
  sample_weight in GeneralizedLinearRegressor
* add class method linear_predictor
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017
* added L2 penalty
* api change: alpha, l1_ratio, P1, P2, warm_start, check_input, copy_X
* added entry in user guide
* improved docstrings
* helper function _irls_step
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017
* fix some bugs in user guide linear_model.rst
* fix some pep8 issues in test_glm.py
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 19, 2017
* fix some bugs in user guide linear_model.rst
* fix some pep8 issues in test_glm.py
lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 19, 2017
* fix some bugs in user guide linear_model.rst
* fix some pep8 issues in test_glm.py
@rth
Copy link
Member

rth commented Jul 16, 2019

Thanks @lorentzenchr, I addressed your comments. Also synced changes in #14300.

@adrinjalali adrinjalali modified the milestones: 0.22, 0.23 Oct 29, 2019
@adrinjalali adrinjalali added High Priority High priority issues and pull requests and removed High Priority High priority issues and pull requests labels Oct 29, 2019
@jboarman
Copy link

jboarman commented Nov 6, 2019

It's a real bummer that the milestone has to be pushed back. The addition of these GLM models is REALLY exciting news, so I'm saddened to have to tell the team that we can't expect to use this for at least 6(?) months.

Anyhow, I'm super appreciative of the tutorial that @lorentzenchr has put together. VERY nice work! :D

@amueller
Copy link
Member

amueller commented Nov 6, 2019

@jboarman you could use the code now by using the branch, and you could use it once it's merged by using the sklearn dev version ;) but yes, it will only be in a release in about 6 month.

@jnothman
Copy link
Member

jnothman commented Nov 6, 2019 via email

@lorentzenchr
Copy link
Member Author

@jboarman Thank you so much for your encouraging feedback. As I'm pushing this feature for over 2.5 years, I learned to be patient 😏 And everyone is doing her/his best.
BTW, the to-be-merged PR is now #14300.

@jboarman
Copy link

jboarman commented Nov 6, 2019

Thanks guys ... Obviously a lot of ppl are interested in this and working hard to merge in this PR. I shouldn't complain! 😊

@lorentzenchr
Copy link
Member Author

The tutorial https://github.com/lorentzenchr/Tutorial_freMTPL2 has now spatial smoothing and thus shows a good use case and the intention of the penalty matrix P2.

@rth
Copy link
Member

rth commented Mar 4, 2020

A minimal implementation of this PR with L2 penalty and LBFGS solver is now merged in #14300 and should be available in the next release. I have opened follow-up issues for some of the other features included here,

The easiest might be to keep this PR open for visibility, but address them in separate smaller PRs.

@adrinjalali
Copy link
Member

Removing from the milestone, since parts of this are already in.

@adrinjalali adrinjalali removed this from the 0.23 milestone Apr 21, 2020
@lorentzenchr lorentzenchr marked this pull request as draft August 19, 2020 09:37
@cmarmo cmarmo added the Superseded PR has been replace by a newer PR label Oct 22, 2020
@cmarmo cmarmo removed the Superseded PR has been replace by a newer PR label Oct 30, 2020
Base automatically changed from master to main January 22, 2021 10:49
@jjerphan
Copy link
Member

jjerphan commented Aug 3, 2022

@lorentzenchr: just for confirmation before closing this PR; has all the content of this PR been integrated in scikit-learn, or are there still some pieces that have not been yet?

@lorentzenchr
Copy link
Member Author

@jjerphan With the current release, we only have the minimal GLM implementation and a lot is still missing. I think we have issues for all those additional features so we could close this PR (which is old with lots of details that I would do differently today).
The most important ones for me are #16637 and #16634 (a faster solver than lbfgs, no matter if irls or another one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.