[MRG] META Add Generalized Linear Models #9405

lorentzenchr · 2017-07-18T20:05:43Z

This PR adds Generalized Linear Models (full Tweedie family) with elastic net penalty comparable to R's glmnet, see issue #5975. Futher distributions of the exponential dispersion family can be added easily.
It also solves #11566.

Sorry, for not posting this earlier.

I'm excited about your feedback.

TODO

jnothman

Overall, I think this world be something good to have, and the code and docstrings are high quality.

Yes, I think we want to focus on regularised models so this probably won't be merged without net.

However, if I were you, I'd start by getting the API right (strings preferred to custom objects), and by starting to sell the usefulness of glms ideally by composing an example that shows off its predictive and inferential capabilities on some real-world dataset. Maybe another example could generate datasets each of which requires a different family to get minimum loss.

And you'd do well with more tests. You might also want to fix up pep8 and keep it fixed by adding a checker to your editor.

Thanks!

sklearn/linear_model/glm.py

jnothman

Please also add one or more entries to doc/modules/classes.rst so that the API reference docs are generated.

jnothman · 2017-07-19T03:38:36Z

sklearn/linear_model/glm.py

+#       So far, coefficients=beta and weight=w (as standard literature)
+# TODO: Add l2-penalty
+# TODO: Add l1-penalty (elastic net)
+# TODO: Add cross validation


What do you intend to implement cross validation for?

Like LogisticRegressionCV, once penalties are added ...

sklearn/linear_model/glm.py

* Fixed pep8 * Fixed flake8 * Rename GeneralizedLinearModel as GeneralizedLinearRegressor * Use of six.with_metaclass * PEP257: summary should be on same line as quotes * Docstring of class GeneralizedLinearRegressor: \ before mu * Arguments family and link accept strings * Use of ConvergenceWarning

* GeneralizedLinearRegressor added to doc/modules/classes.rst

lorentzenchr · 2017-07-19T17:21:33Z

@jnothman Thank you very much for your fast and encouraging feedback. I tried my best to make the suggested changes. For sure, I'd like to include more tests, better documentation, examples and penalites (like glmnet). But I will be away the next month.

agramfort · 2017-07-19T18:51:14Z

my concern here is the overlap of scope with what we already have (LogisticRegression, LinearRegression, Ridge etc...) if you provide a generic GLM object then we'll have to decide what to do with the existing classes which basically mean that you be at least as fast as them.

jnothman · 2017-07-19T23:30:55Z

@agramfort, I think the point here is to provide Poisson regression etc, and to leave more specialised tools alone. Is there an issue with that?

@lorentzenchr, rather than creating CV variants, I would focus on supporting warm_start. From there a generic cv adaptation is fairly straightforward.

And a month's break is fine. Plenty to go on with here.

agramfort · 2017-07-20T07:46:46Z

@agramfort <https://github.com/agramfort>, I think the point here is to provide Poisson regression etc, and to leave more specialised tools alone. Is there an issue with that?

no but then let's call stuff PoissonRegression to be consistent and avoid the generic GLM/Family naming.

jmschrei · 2017-07-20T17:02:09Z

Isn't there a great deal of similarity between the backend of these models that would be captured with a single GLM object? I agree that since we already have some GLMs that we shouldn't re-invent the wheel, but there seems like there would be an explosion of linear models if we name them all separately.

agramfort · 2017-07-20T17:15:51Z

true would could maybe factorize some code but have different estimators was an initial decision to have as explicit as possible. Having one single object is a big refactoring / API change.

jnothman · 2017-07-20T22:00:00Z

we're only really talking about logistic and linear regression, right? and this doesn't currently support logistic? and SGD overlaps with others already. and penalised linear and logistic are much more common in ML than other glms. I don't see the problem with offering specialists alongside a generalist, even if we then decide to make the specialist just an instantiation of the generalist.

…

On 21 Jul 2017 3:15 am, "Alexandre Gramfort" ***@***.***> wrote: true would could maybe factorize some code but have different estimators was an initial decision to have as explicit as possible. Having one single object is a big refactoring / API change. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9405 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61iy_7Zf_OtmCaNNhazLp-MgbZOhks5sP4tJgaJpZM4Ob2sP> .

agramfort · 2017-07-21T07:23:43Z

how are we going to document this?

choose between:

LogisticRegression(solver='newton-cg')
or
GeneralizedLinearRegressor(family=Log, solver='newton-cg')

?

see also https://github.com/glm-tools/pyglmnet

also lightning has a lot of super efficient code.

@lorentzenchr do you have the energy to unify all this?

amueller · 2017-07-21T16:09:21Z

@agramfort the same issue exists already with SGDClassifier as @jnothman points out. It's not ideal, but I think we should not expose the same solver in different places if we can avoid it.

So maybe not add log here?

agramfort · 2017-07-23T13:50:07Z

I am fine having a private base class for the new GLMs. it's a sound internal design choice. now I think I am still in favor of having model with names in the public API. PoissonRegressor etc. I would therefore not implement log in this new code unless it can be used to benchmark this code with models we already have which could be nice to spot computation issues or benefits of the new implem.

lorentzenchr · 2017-08-08T17:05:26Z

Let me try to list some thoughts I had in mind with this PR:

Functionality of glmnet in R would be great in sklearn
The deviance of an Exponential Dispersion Model (EDM) can serve as a loss function. Examples: If it's a count process, the Poisson deviance might be better suited than squared error. If it's a heavy tailed process, the Gamma or Inverse Gaussian deviance might be a good choice (the higher the Tweedie parameter the more heavy-tailed the distribution). These loss functions could be used by other estimators like GBRT.
GLMs with penalties are quite generic, so I implemented a generic class GeneralizedLinearRegressor capable to deal with all EDMs and all link functions (if family=Binomial and link=Logistic were implemented, they would give the logistic regression). I would prefer to keep this generic class/functionality and to also provide a simple PoissonRegressor (since this might be the most common use case) and maybe a TweedieRegessor (since this contains most of the used EDMs).
My personal view to deal with exposure is to model the scaled target and provide weights, e.g. for Poisson: model y=n/w with n=counts and w=weight or exposure (observation time or number of trials or ...). This seems to me more general for all kinds of EDM distributions/families and links, compared to offsets.

@agramfort I do certainly not have the energy to unify this all, but I hope I'm not alone ;-)

* fixed bug: init parameter max_iter * fix API for family and link: default parameter changed to string non public variables self._family_instance and self._link_instance * fixed bug in score, minus sign forgotten * added check_is_fitted to estimate_phi and score * added check_array(X) in predict * replaced lambda functions in TweedieDistribution * some documentation

* Fixed pep8 * Fixed flake8 * Rename GeneralizedLinearModel as GeneralizedLinearRegressor * Use of six.with_metaclass * PEP257: summary should be on same line as quotes * Docstring of class GeneralizedLinearRegressor: \ before mu * Arguments family and link accept strings * Use of ConvergenceWarning

* GeneralizedLinearRegressor added to doc/modules/classes.rst

* fixed bug: init parameter max_iter * fix API for family and link: default parameter changed to string non public variables self._family_instance and self._link_instance * fixed bug in score, minus sign forgotten * added check_is_fitted to estimate_phi and score * added check_array(X) in predict * replaced lambda functions in TweedieDistribution * some documentation

* make raw docstrings where appropriate * make ExponentialDispersionModel (i.e. TweedieDistribution) pickable: ExponentialDispersionModel has new properties include_lower_bound, method in_y_range is not abstract anymore. * set self.intercept_=0 if fit_intercept=False, such that it is always defined. * set score to D2, a generalized R2 with deviance instead of squared error, as does glmnet. This also solves issues with check_regressors_train(GeneralizedLinearRegressor), which assumes R2 score. * change of names: weight to weights in ExponentialDispersionModel and to sample_weight in GeneralizedLinearRegressor * add class method linear_predictor

* added L2 penalty * api change: alpha, l1_ratio, P1, P2, warm_start, check_input, copy_X * added entry in user guide * improved docstrings * helper function _irls_step

* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py

rth · 2019-07-16T13:59:06Z

Thanks @lorentzenchr, I addressed your comments. Also synced changes in #14300.

jboarman · 2019-11-06T21:13:16Z

It's a real bummer that the milestone has to be pushed back. The addition of these GLM models is REALLY exciting news, so I'm saddened to have to tell the team that we can't expect to use this for at least 6(?) months.

Anyhow, I'm super appreciative of the tutorial that @lorentzenchr has put together. VERY nice work! :D

amueller · 2019-11-06T22:43:37Z

@jboarman you could use the code now by using the branch, and you could use it once it's merged by using the sklearn dev version ;) but yes, it will only be in a release in about 6 month.

jnothman · 2019-11-06T22:51:55Z

@jboarman, indeed it will be great to have people trying this out with nightly builds once we've got it merged to master.

lorentzenchr · 2019-11-06T22:54:47Z

@jboarman Thank you so much for your encouraging feedback. As I'm pushing this feature for over 2.5 years, I learned to be patient 😏 And everyone is doing her/his best.
BTW, the to-be-merged PR is now #14300.

jboarman · 2019-11-06T22:57:42Z

Thanks guys ... Obviously a lot of ppl are interested in this and working hard to merge in this PR. I shouldn't complain! 😊

lorentzenchr · 2019-11-30T10:43:25Z

The tutorial https://github.com/lorentzenchr/Tutorial_freMTPL2 has now spatial smoothing and thus shows a good use case and the intention of the penalty matrix P2.

rth · 2020-03-04T14:25:09Z

A minimal implementation of this PR with L2 penalty and LBFGS solver is now merged in #14300 and should be available in the next release. I have opened follow-up issues for some of the other features included here,

IRLS solver Add IRLS solver for linear models #16634
newton_cg solver Add newton_cg solver to TweedieRegression #16635
L1 penalty and coordinate descent solver Add L1 penalty and coordinate descent solver to TweedieRegression #16637

The easiest might be to keep this PR open for visibility, but address them in separate smaller PRs.

adrinjalali · 2020-04-21T08:39:09Z

Removing from the milestone, since parts of this are already in.

jjerphan · 2022-08-03T09:48:41Z

@lorentzenchr: just for confirmation before closing this PR; has all the content of this PR been integrated in scikit-learn, or are there still some pieces that have not been yet?

lorentzenchr · 2022-08-05T15:08:01Z

@jjerphan With the current release, we only have the minimal GLM implementation and a lot is still missing. I think we have issues for all those additional features so we could close this PR (which is old with lots of details that I would do differently today).
The most important ones for me are #16637 and #16634 (a faster solver than lbfgs, no matter if irls or another one).

TomDLT added the New Feature label Jul 18, 2017

jnothman reviewed Jul 19, 2017

View reviewed changes

sklearn/linear_model/glm.py Outdated Show resolved Hide resolved

sklearn/linear_model/glm.py Outdated Show resolved Hide resolved

sklearn/linear_model/glm.py Outdated Show resolved Hide resolved

sklearn/linear_model/glm.py Outdated Show resolved Hide resolved

jnothman reviewed Jul 19, 2017

View reviewed changes

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Jul 19, 2017

[WIP] Add Generalized Linear Models (scikit-learn#9405)

cdbeab2

* GeneralizedLinearRegressor added to doc/modules/classes.rst

jnothman mentioned this pull request Sep 10, 2017

Poisson, gamma and tweedie family of loss functions #5975

Closed

lorentzenchr force-pushed the GLM branch from 5cd994e to 6f4d67c Compare September 18, 2017 22:02

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017

[WIP] Add Generalized Linear Models (scikit-learn#9405)

ff87db0

* GeneralizedLinearRegressor added to doc/modules/classes.rst

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 18, 2017

[WIP] Add Generalized Linear Models (scikit-learn#9405)

c70b544

* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py

lorentzenchr force-pushed the GLM branch from c70b544 to eeebb3d Compare September 19, 2017 21:15

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 19, 2017

[WIP] Add Generalized Linear Models (scikit-learn#9405)

eeebb3d

* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py

lorentzenchr force-pushed the GLM branch from eeebb3d to 3ffe3d3 Compare September 19, 2017 21:55

lorentzenchr pushed a commit to lorentzenchr/scikit-learn that referenced this pull request Sep 19, 2017

[WIP] Add Generalized Linear Models (scikit-learn#9405)

3ffe3d3

* fix some bugs in user guide linear_model.rst * fix some pep8 issues in test_glm.py

lorentzenchr force-pushed the GLM branch from 3ffe3d3 to 7a09132 Compare September 20, 2017 08:21

Address review comments

c3fc392

adrinjalali modified the milestones: 0.22, 0.23 Oct 29, 2019

adrinjalali added High Priority High priority issues and pull requests and removed High Priority High priority issues and pull requests labels Oct 29, 2019

fix sparse P2 cases

98054bc

github-actions bot added the module:linear_model label Mar 2, 2020

This was referenced Mar 4, 2020

Add IRLS solver for linear models #16634

Closed

Add newton_cg solver to TweedieRegression #16635

Closed

Add L1 penalty and coordinate descent solver to TweedieRegression #16637

Open

lorentzenchr mentioned this pull request Mar 12, 2020

Add hessian/fisher matrix for TweedieRegression #16641

Closed

adrinjalali removed this from the 0.23 milestone Apr 21, 2020

lorentzenchr marked this pull request as draft August 19, 2020 09:37

lorentzenchr mentioned this pull request Sep 26, 2020

[MRG] Add Penalty factors for each coefficient in enet ( see #11566) #11671

Open

cmarmo added the Superseded PR has been replace by a newer PR label Oct 22, 2020

cmarmo removed the Superseded PR has been replace by a newer PR label Oct 30, 2020

Base automatically changed from master to main January 22, 2021 10:49

lorentzenchr closed this Aug 5, 2022

Anatolijborodin approved these changes Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] META Add Generalized Linear Models #9405

[MRG] META Add Generalized Linear Models #9405

lorentzenchr commented Jul 18, 2017 •

edited

Loading

jnothman left a comment

jnothman left a comment

jnothman Jul 19, 2017

lorentzenchr Jul 19, 2017

lorentzenchr commented Jul 19, 2017

agramfort commented Jul 19, 2017 via email

jnothman commented Jul 19, 2017

agramfort commented Jul 20, 2017 via email

jmschrei commented Jul 20, 2017

agramfort commented Jul 20, 2017 via email

jnothman commented Jul 20, 2017 via email

agramfort commented Jul 21, 2017

amueller commented Jul 21, 2017

agramfort commented Jul 23, 2017 via email

lorentzenchr commented Aug 8, 2017 •

edited

Loading

rth commented Jul 16, 2019

jboarman commented Nov 6, 2019 •

edited

Loading

amueller commented Nov 6, 2019

jnothman commented Nov 6, 2019 via email

lorentzenchr commented Nov 6, 2019

jboarman commented Nov 6, 2019

lorentzenchr commented Nov 30, 2019

rth commented Mar 4, 2020

adrinjalali commented Apr 21, 2020

jjerphan commented Aug 3, 2022 •

edited

Loading

lorentzenchr commented Aug 5, 2022

[MRG] META Add Generalized Linear Models #9405

[MRG] META Add Generalized Linear Models #9405

Conversation

lorentzenchr commented Jul 18, 2017 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

jnothman Jul 19, 2017

Choose a reason for hiding this comment

lorentzenchr Jul 19, 2017

Choose a reason for hiding this comment

lorentzenchr commented Jul 19, 2017

agramfort commented Jul 19, 2017 via email

jnothman commented Jul 19, 2017

agramfort commented Jul 20, 2017 via email

jmschrei commented Jul 20, 2017

agramfort commented Jul 20, 2017 via email

jnothman commented Jul 20, 2017 via email

agramfort commented Jul 21, 2017

amueller commented Jul 21, 2017

agramfort commented Jul 23, 2017 via email

lorentzenchr commented Aug 8, 2017 • edited Loading

rth commented Jul 16, 2019

jboarman commented Nov 6, 2019 • edited Loading

amueller commented Nov 6, 2019

jnothman commented Nov 6, 2019 via email

lorentzenchr commented Nov 6, 2019

jboarman commented Nov 6, 2019

lorentzenchr commented Nov 30, 2019

rth commented Mar 4, 2020

adrinjalali commented Apr 21, 2020

jjerphan commented Aug 3, 2022 • edited Loading

lorentzenchr commented Aug 5, 2022

lorentzenchr commented Jul 18, 2017 •

edited

Loading

lorentzenchr commented Aug 8, 2017 •

edited

Loading

jboarman commented Nov 6, 2019 •

edited

Loading

jjerphan commented Aug 3, 2022 •

edited

Loading