Skip to content

Minimal Generalized linear models implementation (L2 + lbfgs) #14300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 292 commits into from
Mar 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
292 commits
Select commit Hold shift + click to select a range
919912c
Smarter intercept initialization and docstring improvements
Feb 17, 2019
01033e3
Fix false formula in starting_mu and improve start_params
Feb 20, 2019
4071a8a
Improve argument handling of P1 and P2
Feb 20, 2019
757bc3c
Fix doctest, test_poisson_enet, change IRLS to use lstsq, fix input c…
Feb 20, 2019
ed8e74f
Use pytest decorators and pytest.raises
Feb 23, 2019
fe876da
Add Logistic regression=Binomial + Logit
Feb 24, 2019
2993e03
More efficient sparse matrices and refactor of irls and cd solver
Apr 7, 2019
a6f9f13
Treat the intercept separately, i.e. X, P1, P2 never include intercept
Apr 20, 2019
c9a7a95
Revised option start_params
Apr 21, 2019
a7755de
Fix a few typos
rth Jun 4, 2019
9aa1fc4
Make module private
rth Jun 4, 2019
ca3eae2
Working on tests
rth Jun 4, 2019
61bc6b8
Improve tests
rth Jun 5, 2019
b24a7ca
Remove unused dec parameter in tests
rth Jun 5, 2019
f95b390
ENH: add Generalized Linear Models, issue #5975
Jul 18, 2017
09176b4
MAINT: merge branch 'GLM-impr' of https://github.com/rth/scikit-learn
Jun 9, 2019
def12ae
[MAINT] make glm private, fix typos, improve tests
Jun 9, 2019
9b574bd
Fix docstrings for the new print_changed_only=True by default
rth Jun 11, 2019
90299fd
Increase coverage
rth Jun 12, 2019
e3a5a9a
More tests and addressing some review comments
rth Jun 12, 2019
54b80b8
TST More specific checks of error messages in tests
rth Jun 13, 2019
e962859
Merge branch 'master' into GLM
rth Jun 27, 2019
7db0320
Add PoissonRegressor alias
rth Jun 14, 2019
dcfe9ed
TST Simplify comparison with ridge
rth Jun 27, 2019
4879bb6
EXA Add plot_tweedie_regression_insurance_claims.py
rth Jun 28, 2019
56069e5
EXA Fix issues with older pandas versions in example
rth Jun 28, 2019
53f3c5f
DOC Add second poisson regression example
rth Jul 9, 2019
ac1fef3
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 9, 2019
be5a3c4
Add GeneralizedHyperbolicSecant and BinomialDistributions
rth Jul 9, 2019
e67fecb
Remove start params option
rth Jul 9, 2019
62f4448
Remove L1 penalty and CD solver
rth Jul 9, 2019
d25042e
Remove newton CG algorithm
rth Jul 9, 2019
07ee495
Remove fisher_matrix, _observed_information and _eta_mu_score_fisher
rth Jul 9, 2019
d0eb285
Remove matrix L2 penalty and IRLS solver
rth Jul 9, 2019
1e4b538
Remove plot_poisson_spline_regression.py example
rth Jul 9, 2019
3265148
Remove random_state parameter
rth Jul 9, 2019
1862ab6
Lint
rth Jul 9, 2019
4154074
Fix docstring
rth Jul 10, 2019
c5d77d7
Remove unused core
rth Jul 10, 2019
9ab5ac2
Update examples/linear_model/plot_poisson_regression_non_normal_loss.py
rth Jul 13, 2019
e4d0be1
Update examples/linear_model/plot_poisson_regression_non_normal_loss.py
rth Jul 13, 2019
6ff4d58
Update doc/modules/linear_model.rst
rth Jul 13, 2019
13102d5
Update doc/modules/linear_model.rst
rth Jul 13, 2019
af89e52
Update doc/modules/linear_model.rst
rth Jul 13, 2019
3802420
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 13, 2019
ddc4b71
Use scipy.optimize.minimize interface for LBFGS optimizer
rth Jul 13, 2019
426ae1d
EXA wording and score in plot_tweedie_regression_insurance_claims.html
Jul 14, 2019
a404384
Address review comments
rth Jul 15, 2019
65796a3
Review comments on the documentation
rth Jul 16, 2019
e44afe7
Split the implementation into several files
rth Jul 16, 2019
5927379
Fix CI
rth Jul 16, 2019
a6df2a7
Add test_deviance_derivative
rth Jul 16, 2019
5af89a7
Fix sklearn/linear_model/setup.py
rth Jul 16, 2019
cd347d4
Remove variance and variance_derivative methods from distributions
rth Jul 17, 2019
0d7f9cd
Improve coverage
rth Jul 17, 2019
dbffad8
Remove mentions of the binomial distribution
rth Jul 17, 2019
d914ab2
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 19, 2019
3187204
Use common simple weight validation
rth Jul 19, 2019
cc03c1a
Simplify comments formatting
rth Jul 19, 2019
4d433d1
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 19, 2019
aa52b4a
Refactor to use TweedieDistribition in metrics
rth Jul 22, 2019
816aa8f
WIP
rth Jul 25, 2019
6500c81
Use Poisson deviance in examples
rth Jul 25, 2019
59a6d9d
Use PoissonRegressor and GammaRegressor in examples
rth Jul 25, 2019
03a8a2d
Improve documentation wording
rth Jul 26, 2019
bbf7f38
Use dataframe OpenML fetcher
rth Jul 26, 2019
49a3a8e
Refactor distibution bounds
rth Jul 26, 2019
228e8c8
Move deviance checks under destribution
rth Jul 26, 2019
09a57c9
Expose TweedieRegressor
rth Jul 26, 2019
4b485ca
Improve documentation
rth Jul 26, 2019
aa0adf1
Lint
rth Jul 26, 2019
abd47d7
Fix __init__
rth Jul 30, 2019
c65ac12
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 31, 2019
7a9d067
Update doc/modules/linear_model.rst
rth Aug 2, 2019
18b4503
Update doc/modules/linear_model.rst
rth Aug 2, 2019
29658d6
Update doc/modules/linear_model.rst
rth Aug 2, 2019
1ea70d3
Fix typos in documentation
Aug 7, 2019
efdcb5b
Update doc/modules/linear_model.rst
rth Aug 9, 2019
ef0d063
Update doc/modules/linear_model.rst
rth Aug 9, 2019
0125e1c
Update doc/modules/linear_model.rst
rth Aug 9, 2019
6a8a600
Update examples/linear_model/plot_poisson_regression_non_normal_loss.py
rth Aug 9, 2019
73f3bd1
Rename inverse.gaussian to inverse-gaussian
rth Aug 9, 2019
11b178f
Remove sample_weight parameter from predict
rth Aug 9, 2019
3806fbe
Remove redundant check_array in predict
rth Aug 9, 2019
ae1c672
Update doc/modules/linear_model.rst
Aug 11, 2019
f07c831
Remove dispersion
Aug 11, 2019
e34fb57
Merge branch 'upstream/master' into GLM-minimal
Aug 13, 2019
ebbbe9c
Update doc/modules/linear_model.rst
Aug 13, 2019
918e257
Update doc/modules/linear_model.rst
Aug 13, 2019
5236cd8
Merge branch 'GLM-minimal' of https://github.com/rth/scikit-learn int…
Aug 13, 2019
37d0f47
Use double `` when necessary
rth Aug 16, 2019
9c337f2
ax -> axes in plot_poisson_regression_non_normal_loss.py
rth Aug 16, 2019
5e05935
Update sklearn/linear_model/_glm/distribution.py
rth Aug 16, 2019
4a68213
Remove solver=auto
rth Aug 16, 2019
8ee5c85
Update sklearn/linear_model/_glm/glm.py
rth Aug 16, 2019
b106c25
Merge branch 'GLM-minimal' of github.com:rth/scikit-learn into GLM-mi…
rth Aug 16, 2019
a1f8aab
More review comments
rth Aug 16, 2019
c0999ea
Addressing reviews in tests
rth Aug 16, 2019
e09e336
More comments in tests
rth Aug 16, 2019
6601d30
Update linear_model.rst
Aug 17, 2019
81eabe3
Merge upstream/master into GLM-minimal
Aug 17, 2019
5174dae
Address check_is_fitted deprication of attributes
Aug 17, 2019
61dc13f
No LaTeX in docstrings
Aug 17, 2019
44524ca
Replace Tweedie p->power
Aug 17, 2019
58d2409
Replace Tweedie p->power
Aug 17, 2019
ee351e1
Fix tests due to Tweedie p->power
Aug 17, 2019
33fe9be
Simplify super(...)
Aug 18, 2019
94272e7
Replace Link.link(..) by __call__(..)
Aug 18, 2019
2457039
Replace 1. -> 1
Aug 18, 2019
6396d2c
Fix table in TweedieRegressor
Aug 18, 2019
8be0387
Improve docstring in plot_tweedie_regression_insurance_claims.py
rth Aug 22, 2019
da66fd5
Use train_test_split in tests
rth Aug 22, 2019
b9bc170
Fix TODO in test_warm_start
rth Aug 22, 2019
ab6c5d8
Revert "No LaTeX in docstrings"
rth Aug 22, 2019
b424a07
Remove n_iter_ check when warm start.
rth Aug 22, 2019
95a9058
Rename variable L2 -> coef_scaled
rth Aug 22, 2019
59eceb4
Minor fixes
rth Aug 22, 2019
12a5067
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Aug 28, 2019
04f30f4
Better wording in example
rth Aug 28, 2019
3630b52
Improvements in plot_poisson_regression_non_normal_loss.py
rth Aug 28, 2019
516eadb
Improvements in plot_tweedie_regression_insurance_claims.py
rth Aug 28, 2019
5e14928
Drop unused ExponentialDispersionModel._upper_bound
rth Aug 28, 2019
6cc1df5
Move notes and references from docstrings to user manual
rth Aug 28, 2019
752d6aa
More explanatory comments in the code
rth Aug 28, 2019
38a4ad4
Fix requires_positive_y tag
rth Aug 28, 2019
c15a1cc
Remove Link.inverse_derivative2
rth Aug 28, 2019
37de07b
Rename p to power parameter in mean_tweedie_deviance
rth Aug 30, 2019
adbf997
Rename predicted mean mu to y_pred
rth Aug 30, 2019
47dbc84
Fix link parameter documentation in TweedieRegression
rth Aug 30, 2019
3b526e9
EXA Use a simpler pipeline for GBDT in poisson regression example
rth Aug 30, 2019
b1eb611
Minor fixes for user guide
Sep 1, 2019
d964c01
EXA Poisson: minor changes
Sep 1, 2019
a1844b8
Fix mu->y_pred and p->power
Sep 2, 2019
f513392
EXA Tweedie: some improvements
Sep 3, 2019
84229a6
Fix doc test
Sep 3, 2019
dd22699
Merge branch 'master' into GLM-minimal
rth Sep 11, 2019
059aeb7
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Sep 11, 2019
8c6c255
Fix test
rth Sep 11, 2019
0a23313
EXA Use Ridge and remove eps
rth Sep 12, 2019
29964af
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Sep 16, 2019
976b436
Address comments in plot_poisson_regression_non_normal_loss.py
rth Sep 16, 2019
7c850d1
Lint
rth Sep 16, 2019
f64dc4a
Simplify plot_tweedie_regression_insurance_claims.py example
rth Sep 16, 2019
b1f5bde
Add "lift curve" for model validation in Poisson example
rth Sep 18, 2019
a9ab4e4
Various improvements to the model comparison example
ogrisel Sep 25, 2019
be7bb67
Add cumulated claims plot
ogrisel Sep 25, 2019
4125c20
Improve the cumulated nb claims plot
ogrisel Sep 26, 2019
0070d52
Fix wrong xlabel in histogram plot
ogrisel Sep 26, 2019
9d6bb52
More example improvements (preprocessors + plots)
ogrisel Sep 26, 2019
b353b2d
Simplify dataset + use more data
ogrisel Sep 26, 2019
88757fd
Remove solver parameter from {Poisson,Gamma,Tweedie}Regression
rth Sep 26, 2019
6d119d4
Revert some accidental changes from 88757fdb99cc516be230fe08ec1ebfb7b…
rth Sep 26, 2019
b735eb7
Additional comment about the use of properties with setters
rth Sep 26, 2019
2d91114
Add additional tests for link derivatives
rth Sep 26, 2019
89103bc
cosmits + typos
agramfort Sep 29, 2019
4f28a44
Address some of Alex's comments
rth Sep 30, 2019
d4dfd0b
Removing unnecessary comments / asarray call
rth Sep 30, 2019
64d6fbd
Update doc/modules/linear_model.rst
rth Oct 3, 2019
82ace9f
Remove unused solver parameter in tests
rth Oct 3, 2019
5288a0f
Add test for sample_weight consistency
rth Oct 3, 2019
499e8d2
Move GLM losses under sklearn._loss.glm_distribution
rth Oct 3, 2019
f4aa839
Update sklearn/linear_model/_glm/glm.py
rth Oct 3, 2019
48fcbe6
Add missing config.add_subpackage in setup.py
rth Oct 3, 2019
d71fb9f
Address Nicolas comments in the documentation (partial)
rth Oct 3, 2019
fa90272
More cleanups in the plot_tweedie_regression_insurance_claims.py example
rth Oct 3, 2019
4d16f31
Typos and text improvement in poisson example
Oct 6, 2019
15eb1d3
EXA sharey for histograms
Oct 6, 2019
3d097c6
Plot y_pred histograms on the test set
ogrisel Oct 8, 2019
6372287
Merge remote-tracking branch 'origin/master' into GLM-minimal
ogrisel Oct 8, 2019
b117856
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Oct 8, 2019
31f5b3d
Compound Poisson => Compound Poisson Gamma
ogrisel Oct 9, 2019
a498ff5
Compound Poisson => Compound Poisson Gamma
ogrisel Oct 9, 2019
3fae28a
Various improvement in Tweedie regression example
ogrisel Oct 9, 2019
a2b6841
Merge remote-tracking branch 'origin/master' into GLM-minimal
ogrisel Oct 9, 2019
a47798a
Update doc/modules/linear_model.rst
rth Oct 10, 2019
83391dd
Use latest docstring conventions everywhere
rth Oct 10, 2019
3bfb54e
Drop check_input parameter
rth Oct 10, 2019
d325fe2
Use keyword only arguments SLEP009
rth Oct 10, 2019
661cf56
Move _y_pred_deviance_derivative from losses as a private function
rth Oct 10, 2019
560c180
Fix cumulated claim amount curve in Tweedie regression example
ogrisel Oct 10, 2019
0ea2dce
PEP8
ogrisel Oct 10, 2019
4ca2e95
MNT remove function body in abstract methods
rth Oct 14, 2019
89b429d
Improvements to Pure Premium example
ogrisel Oct 14, 2019
2d0b195
s/ordered Lorenz/Lorenz/
ogrisel Oct 14, 2019
ea6a3e8
More doc improvements to the pure premium example
ogrisel Oct 14, 2019
ddae396
doc update, simplification and fixes
NicolasHug Oct 15, 2019
21572d9
put back doc for BaseLink removed by mistake
NicolasHug Oct 15, 2019
d7ff6f4
TST GLM/Ridge comparison with sample_weight (xfail for now)
rth Oct 17, 2019
939d240
TST More invariance checks for sample_weight
rth Oct 17, 2019
da0d2a6
Remove copy_X parameter
rth Oct 17, 2019
162fb3b
Minor doc improvements
Oct 17, 2019
cafc92f
DOC consistent coef_=w for OMP in user guide
Oct 17, 2019
d3235db
EXA typos
Oct 17, 2019
9401230
EXA set less extreme Tweedie power=1.9
Oct 17, 2019
5bc48e5
TST fix normal_ridge_comparison with sample_weight
Oct 17, 2019
e572c31
DOC advice against cv for Tweedie power in UG
Oct 20, 2019
8d70042
DOC improve advice for power cv
Oct 20, 2019
6dc5e34
Merger master into GLM-minimal
Nov 10, 2019
c479889
EXA rely on default of FunctionTransformer
Nov 10, 2019
3628cff
Merge remote-tracking branch 'rth/GLM-minimal' into GLM-minimal
Nov 10, 2019
88d150e
EXA print all columns of DataFrame with max_columns option
Nov 10, 2019
0d31f47
EXA improve wording Poisson target
Nov 10, 2019
3ab2877
EXA increase digits in scores
Nov 10, 2019
87de01b
EXA convergence issue solve with max_iter
Nov 10, 2019
072f4fb
Merge branch 'master' into GLM-minimal
Jan 29, 2020
fbc22d8
Update sklearn/metrics/_regression.py
rth Jan 29, 2020
27ae4a2
DOC Add what's new entry
rth Jan 29, 2020
51be7e1
Removed myself from what's new
NicolasHug Jan 30, 2020
25f0e53
CLN Minor link clean up
thomasjpfan Feb 4, 2020
5eddf9c
CLN one word too much
Feb 5, 2020
81068ce
Merge branch 'master' of github.com:scikit-learn/scikit-learn into GL…
NicolasHug Feb 28, 2020
c700acf
minor typo
NicolasHug Feb 28, 2020
a126f4a
remove unused symbol
NicolasHug Feb 28, 2020
82c4483
use j instead of i to index features
NicolasHug Feb 28, 2020
d3083db
removed redundant variable
NicolasHug Feb 28, 2020
f63f795
removed unused var
NicolasHug Feb 28, 2020
45de110
Added comment and basic test to family attribute
NicolasHug Feb 28, 2020
b8459b0
Added test for link='auto'
NicolasHug Feb 28, 2020
581b4d7
more comment about OneHotEncoder vs OrdinalEncoder for forests
NicolasHug Feb 28, 2020
bb75435
minor typos or formulations
NicolasHug Feb 28, 2020
94dfc00
Remove unused ExponentialDispersionModel.unit_variance_derivative
rth Feb 29, 2020
0810bf3
DOC more detailed description of the dataset used in examples
rth Feb 29, 2020
a90a0aa
EXA fix minor typo
Feb 29, 2020
f74ab96
EXA compare metric by metric
Feb 29, 2020
a349be7
Add examples of use-cases
rth Feb 29, 2020
497a76c
Add figure with Poisson, Gamma and Tweedie distributions
rth Feb 29, 2020
c8c902f
DOC fix minor typo
Feb 29, 2020
bda7ad6
DOC add point mass to plot
Feb 29, 2020
83c2ba6
TST remove futile arguments
Feb 29, 2020
7498f3e
TST increase rtol
Feb 29, 2020
90b1213
TST add fit_intercept to test_glm_identity_regression
Feb 29, 2020
697bda2
TST ignore one specific ConvergenceWarning
Feb 29, 2020
578408c
TST add fit_intercept to test_glm_log_regression
Feb 29, 2020
2668910
EXA comment on penalty strenght GLM vs Ridge
Feb 29, 2020
d1c3dc9
EXA fix nitpicks
Feb 29, 2020
a9686f6
EXA remove empty line
Feb 29, 2020
04e7aca
EXA add blank line after function definition E305
Feb 29, 2020
21a739c
Gamma -> Poisson
NicolasHug Feb 29, 2020
79ada1e
Used X @ coef as suggested by Roman
NicolasHug Feb 29, 2020
39eeb44
minimal addition to clearly separate links in example
NicolasHug Feb 29, 2020
e3cf69d
Added a few definitions from the insurance jargon
NicolasHug Feb 29, 2020
56aa0d7
minor comment
NicolasHug Feb 29, 2020
27a344c
maybe fixed doc
NicolasHug Mar 1, 2020
e817b2c
forgot these
NicolasHug Mar 1, 2020
0fdc518
Update comment about read-only family attribute
rth Mar 1, 2020
6d4ecb2
Update figures to illustrate that they are not defined for Y<0
rth Mar 1, 2020
b96e021
updated remaining comment about property
NicolasHug Mar 1, 2020
293214c
Compare perfs of models side by side in example
NicolasHug Mar 1, 2020
987239a
Shorten text for df to fit fully in width
NicolasHug Mar 1, 2020
edba3b8
Use context manager instead?
NicolasHug Mar 1, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,21 @@ Any estimator using the Huber loss would also be robust to outliers, e.g.
linear_model.RANSACRegressor
linear_model.TheilSenRegressor

Generalized linear models (GLM) for regression
----------------------------------------------

These models allow for response variables to have error distributions other
than a normal distribution:

.. autosummary::
:toctree: generated/
:template: class.rst

linear_model.PoissonRegressor
linear_model.TweedieRegressor
linear_model.GammaRegressor


Miscellaneous
-------------

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
149 changes: 146 additions & 3 deletions doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -556,13 +556,13 @@ orthogonal matching pursuit can approximate the optimum solution vector with a
fixed number of non-zero elements:

.. math::
\underset{\gamma}{\operatorname{arg\,min\,}} ||y - X\gamma||_2^2 \text{ subject to } ||\gamma||_0 \leq n_{\text{nonzero\_coefs}}
\underset{w}{\operatorname{arg\,min\,}} ||y - Xw||_2^2 \text{ subject to } ||w||_0 \leq n_{\text{nonzero\_coefs}}

Alternatively, orthogonal matching pursuit can target a specific error instead
of a specific number of non-zero coefficients. This can be expressed as:

.. math::
\underset{\gamma}{\operatorname{arg\,min\,}} ||\gamma||_0 \text{ subject to } ||y-X\gamma||_2^2 \leq \text{tol}
\underset{w}{\operatorname{arg\,min\,}} ||w||_0 \text{ subject to } ||y-Xw||_2^2 \leq \text{tol}


OMP is based on a greedy algorithm that includes at each step the atom most
Expand Down Expand Up @@ -906,7 +906,7 @@ with 'log' loss, which might be even faster but requires more tuning.
It is possible to obtain the p-values and confidence intervals for
coefficients in cases of regression without penalization. The `statsmodels
package <https://pypi.org/project/statsmodels/>` natively supports this.
Within sklearn, one could use bootstrapping instead as well.
Within sklearn, one could use bootstrapping instead as well.


:class:`LogisticRegressionCV` implements Logistic Regression with built-in
Expand All @@ -928,6 +928,149 @@ to warm-starting (see :term:`Glossary <warm_start>`).
.. [9] `"Performance Evaluation of Lbfgs vs other solvers"
<http://www.fuzihao.org/blog/2016/01/16/Comparison-of-Gradient-Descent-Stochastic-Gradient-Descent-and-L-BFGS/>`_

.. _Generalized_linear_regression:

Generalized Linear Regression
=============================

Generalized Linear Models (GLM) extend linear models in two ways
[10]_. First, the predicted values :math:`\hat{y}` are linked to a linear
combination of the input variables :math:`X` via an inverse link function
:math:`h` as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an inverse function, would it makes sense to denote it

$$h^{-1}$$

It would also be nice to suggest what this link function is doing. From what I saw this is a link between the linear predictor and the mean of the distribution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this is standard literature nomenclature: link function g: g(E[Y]) = Xw and inverse link function h: h(Xw) = g^{-1}(Xw) = E[Y].

Second, we had this info/equation in the user guide, but some reviewer wanted it gone 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the link function g could be introduced e.g. below in the usage part.

I don't like using E[Y] because:

  • we don't talk about E[Y] for the rest of the linear models, so why here
  • it's actually E[Y/X], right? It seems to me that calling that E[Y] is another implicit nomenclature of GLMs, which we want to avoid in the UG precisely because it's implicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is standard literature nomenclature

We should go with it then :) (my remarks are from somebody which is not familiar with GLMs)

Second, we had this info/equation in the user guide, but some reviewer wanted it gone

I was not thinking to go until this level of detail but more about defining with a couple of words what is the link function. Regarding the mathematical section, I'm fine with the majority of reviewers thought it was an overkill. I will not be pushy on this :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug First excuse - it was late 😊 Second excuse, GLM literature is lazy and skips the conditional most of the time, e.g. E[Y] really means E[Y|X].

More seriously: Maybe, I missed that piece in the actual user guide, but I can't find a reference as to what regressors do actually estimate/predict (most of the time E[Y|X]). That's why I bring this up again (last time, promise). Should we add that info somewhere in the UG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add that info somewhere in the UG?

That's a possibility, yes. Though I would prefer doing that in another PR to make sure the UG is overall consistent, since this would be a non-trivial update.

In the mean time, I still think we can have a reasonable UG for the GLMs that doesn't mention E[Y/X], we can just talk about predictions.


.. math:: \hat{y}(w, X) = h(Xw).

Secondly, the squared loss function is replaced by the unit deviance
:math:`d` of a distribution in the exponential family (or more precisely, a
reproductive exponential dispersion model (EDM) [11]_).

The minimization problem becomes:

.. math:: \min_{w} \frac{1}{2 n_{\text{samples}}} \sum_i d(y_i, \hat{y}_i) + \frac{\alpha}{2} ||w||_2,

where :math:`\alpha` is the L2 regularization penalty. When sample weights are
provided, the average becomes a weighted average.

The following table lists some specific EDMs and their unit deviance (all of
these are instances of the Tweedie family):

================= =============================== ============================================
Distribution Target Domain Unit Deviance :math:`d(y, \hat{y})`
================= =============================== ============================================
Normal :math:`y \in (-\infty, \infty)` :math:`(y-\hat{y})^2`
Poisson :math:`y \in [0, \infty)` :math:`2(y\log\frac{y}{\hat{y}}-y+\hat{y})`
Gamma :math:`y \in (0, \infty)` :math:`2(\log\frac{\hat{y}}{y}+\frac{y}{\hat{y}}-1)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add the Tweedie distribution in the list as well?

The first regressor below is the TweedieRegressor and when reading the section, this is the first time that the term occurs. So I was wondering if it makes sense to mention something about the distribution earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say so. Only objection is to have this info 2 times. Maybe a link to the mean_tweedie_deviance metric here suffices?

Inverse Gaussian :math:`y \in (0, \infty)` :math:`\frac{(y-\hat{y})^2}{y\hat{y}^2}`
================= =============================== ============================================

The Probability Density Functions (PDF) of these distributions are illustrated
in the following figure,

.. figure:: ./glm_data/poisson_gamma_tweedie_distributions.png
:align: center
:scale: 100%

PDF of a random variable Y following Poisson, Tweedie (power=1.5) and Gamma
distributions with different mean values (:math:`\mu`). Observe the point
mass at :math:`Y=0` for the Poisson distribution and the Tweedie (power=1.5)
distribution, but not for the Gamma distribution which has a strictly
positive target domain.

The choice of the distribution depends on the problem at hand:

* If the target values :math:`y` are counts (non-negative integer valued) or
relative frequencies (non-negative), you might use a Poisson deviance
with log-link.
* If the target values are positive valued and skewed, you might try a
Gamma deviance with log-link.
* If the target values seem to be heavier tailed than a Gamma distribution,
you might try an Inverse Gaussian deviance (or even higher variance powers
of the Tweedie family).


Examples of use cases include:

* Agriculture / weather modeling: number of rain events per year (Poisson),
amount of rainfall per event (Gamma), total rainfall per year (Tweedie /
Compound Poisson Gamma).
* Risk modeling / insurance policy pricing: number of claim events /
policyholder per year (Poisson), cost per event (Gamma), total cost per
policyholder per year (Tweedie / Compound Poisson Gamma).
* Predictive maintenance: number of production interruption events per year:
Poisson, duration of interruption: Gamma, total interruption time per year
(Tweedie / Compound Poisson Gamma).


.. topic:: References:

.. [10] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models,
Second Edition. Boca Raton: Chapman and Hall/CRC. ISBN 0-412-31760-5.

.. [11] Jørgensen, B. (1992). The theory of exponential dispersion models
and analysis of deviance. Monografias de matemática, no. 51. See also
`Exponential dispersion model.
<https://en.wikipedia.org/wiki/Exponential_dispersion_model>`_

Usage
-----

:class:`TweedieRegressor` implements a generalized linear model for the
Tweedie distribution, that allows to model any of the above mentioned
distributions using the appropriate ``power`` parameter. In particular:

- ``power = 0``: Normal distribution. Specific estimators such as
:class:`Ridge`, :class:`ElasticNet` are generally more appropriate in
this case.
- ``power = 1``: Poisson distribution. :class:`PoissonRegressor` is exposed
for convenience. However, it is strictly equivalent to
`TweedieRegressor(power=1, link='log')`.
- ``power = 2``: Gamma distribution. :class:`GammaRegressor` is exposed for
convenience. However, it is strictly equivalent to
`TweedieRegressor(power=2, link='log')`.
- ``power = 3``: Inverse Gaussian distribution.

The link function is determined by the `link` parameter.

Usage example::

>>> from sklearn.linear_model import TweedieRegressor
>>> reg = TweedieRegressor(power=1, alpha=0.5, link='log')
>>> reg.fit([[0, 0], [0, 1], [2, 2]], [0, 1, 2])
TweedieRegressor(alpha=0.5, link='log', power=1)
>>> reg.coef_
array([0.2463..., 0.4337...])
>>> reg.intercept_
-0.7638...


.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_linear_model_plot_poisson_regression_non_normal_loss.py`
* :ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py`

Practical considerations
------------------------

The feature matrix `X` should be standardized before fitting. This ensures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we always apply standard-scaling or one should also consider robust-scaling and so on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can have L-BFGS running, shall we have a similar concern than in #15556 and #15583

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and we should apply the same method. It will remove the need for scaling from an optimization perspective. Unscaled data means that the penalization is weirdly weighted but if we don't scale we should make sure the algorithm still converges if the data is not scaled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have added item to the todo list of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we always apply standard-scaling or one should also consider robust-scaling and so on?

I think it's standard practice to say that the data should be standardized or scaled. Which scaler to use and whether it should be robust to outliers is I think off topic in this section of the user guide. Users can read the preprocessing docs if interested.

Since we can have L-BFGS running, shall we have a similar concern than in #15556 and #15583

Copying what I responded in #14300 (comment), since I'm doing another pass on comments, I don't think it belongs in this PR. This PR is a minimal implementation, and while preconditioning may improve convergence it is not essential, we have been using LBFGS without for years. Also as the above linked PR showed the results were not as clear cut as expected.

that the penalty treats features equally.

Since the linear predictor :math:`Xw` can be negative and Poisson,
Gamma and Inverse Gaussian distributions don't support negative values, it
is necessary to apply an inverse link function that guarantees the
non-negativeness. For example with `link='log'`, the inverse link function
becomes :math:`h(Xw)=\exp(Xw)`.

If you want to model a relative frequency, i.e. counts per exposure (time,
volume, ...) you can do so by using a Poisson distribution and passing
:math:`y=\frac{\mathrm{counts}}{\mathrm{exposure}}` as target values
together with :math:`\mathrm{exposure}` as sample weights. For a concrete
example see e.g.
:ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py`.

When performing cross-validation for the `power` parameter of
`TweedieRegressor`, it is advisable to specify an explicit `scoring` function,
because the default scorer :meth:`TweedieRegressor.score` is a function of
`power` itself.

Stochastic Gradient Descent - SGD
=================================
Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v0.23.rst
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,13 @@ Changelog
:mod:`sklearn.linear_model`
...........................

- |MajorFeature| Added generalized linear models (GLM) with non normal error
distributions, including :class:`linear_model.PoissonRegressor`,
:class:`linear_model.GammaRegressor` and :class:`linear_model.TweedieRegressor`
which use Poisson, Gamma and Tweedie distributions respectively.
:pr:`14300` by :user:`Christian Lorentzen <lorentzenchr>`, `Roman Yurchak`_,
and `Olivier Grisel`_.

- |Feature| Support of `sample_weight` in :class:`linear_model.ElasticNet` and
:class:`linear_model:Lasso` for dense feature matrix `X`.
:pr:`15436` by :user:`Christian Lorentzen <lorentzenchr>`.
Expand Down
Loading