Skip to content

[WIP] Implement Gini coefficient for model selection with positive regression GLMs #15176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 228 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
228 commits
Select commit Hold shift + click to select a range
d5e8810
[WIP] Add Generalized Linear Model, issue #5975, initial commit
Jul 18, 2017
2fc189d
[WIP] Add Generalized Linear Models (#9405)
Jul 19, 2017
a6137d8
[WIP] Add Generalized Linear Models (#9405)
Jul 19, 2017
b0be167
[WIP] Add Generalized Linear Models (#9405)
Aug 9, 2017
85c52ec
[WIP] Add Generalized Linear Models (#9405)
Aug 12, 2017
0f4bdb3
[WIP] Add Generalized Linear Models (#9405)
Sep 18, 2017
5b46c23
[WIP] Add Generalized Linear Models (#9405)
Sep 18, 2017
10dd146
[WIP] Add Generalized Linear Models (#9405)
Dec 3, 2017
72485b6
[WIP] Add Generalized Linear Models (#9405)
Jan 8, 2018
5c1369b
[WIP] Add Generalized Linear Models (#9405)
Jan 24, 2018
91497a2
[WIP] Add Generalized Linear Models (#9405)
Jan 24, 2018
b9e5105
[WIP] Add Generalized Linear Models (#9405)
Jan 24, 2018
e317422
[WIP] Add Generalized Linear Models (#9405)
Jan 25, 2018
9a98184
[WIP] Add Generalized Linear Models (#9405)
Jan 25, 2018
db9defe
[WIP] Add Generalized Linear Models (#9405)
Jan 26, 2018
dc7fdd7
[WIP] Add Generalized Linear Models (#9405)
Jan 26, 2018
b11d06b
[WIP] Add Generalized Linear Models (#9405)
Jan 26, 2018
9e6c013
[WIP] Add Generalized Linear Models (#9405)
Jan 26, 2018
bad0190
[WIP] Add Generalized Linear Models (#9405)
Jan 27, 2018
48137d8
[WIP] Add Generalized Linear Models (#9405)
Jan 28, 2018
2c2a077
[WIP] Add Generalized Linear Models (#9405)
Jan 28, 2018
15931c3
[WIP] Add Generalized Linear Models (#9405)
Jan 28, 2018
feedba3
[MRG] Add Generalized Linear Models (#9405)
Mar 30, 2018
6fdfb47
[MRG] Add Generalized Linear Models (#9405)
Mar 30, 2018
d489f56
[MRG] Add Generalized Linear Models (#9405)
Aug 5, 2018
809e3a2
Remove test_glm_P2_argument
Aug 26, 2018
4edce36
Filter out DeprecationWarning in old versions of scipy.sparse.linalg.…
Aug 30, 2018
46df5b6
import pytest
Aug 30, 2018
21f2136
Document arguments of abstact methods
Aug 30, 2018
1faedf8
Pytest filter warnings use two colons
Aug 30, 2018
992f981
Improve documentation of arguments that were so far undocumented
Aug 30, 2018
06b8451
Further improve documentation of arguments
Aug 30, 2018
c93f60d
Remove parameters docstring for __init__
Aug 31, 2018
66ec63b
Fix typos in docstring of TweedieDistribution
Aug 31, 2018
53c6970
Change docstring section of TweedieDistribution from Attributes to Pa…
Aug 31, 2018
87d5ba3
Minor doc improvements of GeneralizedLinearRegressor
Oct 7, 2018
a9ae023
Double escape in doctring of GeneralizedLinearRegressor
Oct 8, 2018
bb62485
Add example for GeneralizedLinearRegressor
Dec 31, 2018
16d064d
Resolve merge conflicts
Jan 1, 2019
1a02a90
Adapt for minimum numpy version
Jan 1, 2019
177eb4c
Remove six dependencies as in #12639
Jan 6, 2019
3d4c784
Improve user guide, doc and fix penalty parameter for Ridge
Feb 3, 2019
919912c
Smarter intercept initialization and docstring improvements
Feb 17, 2019
01033e3
Fix false formula in starting_mu and improve start_params
Feb 20, 2019
4071a8a
Improve argument handling of P1 and P2
Feb 20, 2019
757bc3c
Fix doctest, test_poisson_enet, change IRLS to use lstsq, fix input c…
Feb 20, 2019
ed8e74f
Use pytest decorators and pytest.raises
Feb 23, 2019
fe876da
Add Logistic regression=Binomial + Logit
Feb 24, 2019
2993e03
More efficient sparse matrices and refactor of irls and cd solver
Apr 7, 2019
a6f9f13
Treat the intercept separately, i.e. X, P1, P2 never include intercept
Apr 20, 2019
c9a7a95
Revised option start_params
Apr 21, 2019
a7755de
Fix a few typos
rth Jun 4, 2019
9aa1fc4
Make module private
rth Jun 4, 2019
ca3eae2
Working on tests
rth Jun 4, 2019
61bc6b8
Improve tests
rth Jun 5, 2019
b24a7ca
Remove unused dec parameter in tests
rth Jun 5, 2019
f95b390
ENH: add Generalized Linear Models, issue #5975
Jul 18, 2017
09176b4
MAINT: merge branch 'GLM-impr' of https://github.com/rth/scikit-learn
Jun 9, 2019
def12ae
[MAINT] make glm private, fix typos, improve tests
Jun 9, 2019
9b574bd
Fix docstrings for the new print_changed_only=True by default
rth Jun 11, 2019
90299fd
Increase coverage
rth Jun 12, 2019
e3a5a9a
More tests and addressing some review comments
rth Jun 12, 2019
54b80b8
TST More specific checks of error messages in tests
rth Jun 13, 2019
e962859
Merge branch 'master' into GLM
rth Jun 27, 2019
7db0320
Add PoissonRegressor alias
rth Jun 14, 2019
dcfe9ed
TST Simplify comparison with ridge
rth Jun 27, 2019
4879bb6
EXA Add plot_tweedie_regression_insurance_claims.py
rth Jun 28, 2019
56069e5
EXA Fix issues with older pandas versions in example
rth Jun 28, 2019
53f3c5f
DOC Add second poisson regression example
rth Jul 9, 2019
ac1fef3
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 9, 2019
be5a3c4
Add GeneralizedHyperbolicSecant and BinomialDistributions
rth Jul 9, 2019
e67fecb
Remove start params option
rth Jul 9, 2019
62f4448
Remove L1 penalty and CD solver
rth Jul 9, 2019
d25042e
Remove newton CG algorithm
rth Jul 9, 2019
07ee495
Remove fisher_matrix, _observed_information and _eta_mu_score_fisher
rth Jul 9, 2019
d0eb285
Remove matrix L2 penalty and IRLS solver
rth Jul 9, 2019
1e4b538
Remove plot_poisson_spline_regression.py example
rth Jul 9, 2019
3265148
Remove random_state parameter
rth Jul 9, 2019
1862ab6
Lint
rth Jul 9, 2019
4154074
Fix docstring
rth Jul 10, 2019
c5d77d7
Remove unused core
rth Jul 10, 2019
9ab5ac2
Update examples/linear_model/plot_poisson_regression_non_normal_loss.py
rth Jul 13, 2019
e4d0be1
Update examples/linear_model/plot_poisson_regression_non_normal_loss.py
rth Jul 13, 2019
6ff4d58
Update doc/modules/linear_model.rst
rth Jul 13, 2019
13102d5
Update doc/modules/linear_model.rst
rth Jul 13, 2019
af89e52
Update doc/modules/linear_model.rst
rth Jul 13, 2019
3802420
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 13, 2019
ddc4b71
Use scipy.optimize.minimize interface for LBFGS optimizer
rth Jul 13, 2019
426ae1d
EXA wording and score in plot_tweedie_regression_insurance_claims.html
Jul 14, 2019
a404384
Address review comments
rth Jul 15, 2019
65796a3
Review comments on the documentation
rth Jul 16, 2019
e44afe7
Split the implementation into several files
rth Jul 16, 2019
5927379
Fix CI
rth Jul 16, 2019
a6df2a7
Add test_deviance_derivative
rth Jul 16, 2019
5af89a7
Fix sklearn/linear_model/setup.py
rth Jul 16, 2019
cd347d4
Remove variance and variance_derivative methods from distributions
rth Jul 17, 2019
0d7f9cd
Improve coverage
rth Jul 17, 2019
dbffad8
Remove mentions of the binomial distribution
rth Jul 17, 2019
d914ab2
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 19, 2019
3187204
Use common simple weight validation
rth Jul 19, 2019
cc03c1a
Simplify comments formatting
rth Jul 19, 2019
4d433d1
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 19, 2019
aa52b4a
Refactor to use TweedieDistribition in metrics
rth Jul 22, 2019
816aa8f
WIP
rth Jul 25, 2019
6500c81
Use Poisson deviance in examples
rth Jul 25, 2019
59a6d9d
Use PoissonRegressor and GammaRegressor in examples
rth Jul 25, 2019
03a8a2d
Improve documentation wording
rth Jul 26, 2019
bbf7f38
Use dataframe OpenML fetcher
rth Jul 26, 2019
49a3a8e
Refactor distibution bounds
rth Jul 26, 2019
228e8c8
Move deviance checks under destribution
rth Jul 26, 2019
09a57c9
Expose TweedieRegressor
rth Jul 26, 2019
4b485ca
Improve documentation
rth Jul 26, 2019
aa0adf1
Lint
rth Jul 26, 2019
abd47d7
Fix __init__
rth Jul 30, 2019
c65ac12
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Jul 31, 2019
7a9d067
Update doc/modules/linear_model.rst
rth Aug 2, 2019
18b4503
Update doc/modules/linear_model.rst
rth Aug 2, 2019
29658d6
Update doc/modules/linear_model.rst
rth Aug 2, 2019
1ea70d3
Fix typos in documentation
Aug 7, 2019
efdcb5b
Update doc/modules/linear_model.rst
rth Aug 9, 2019
ef0d063
Update doc/modules/linear_model.rst
rth Aug 9, 2019
0125e1c
Update doc/modules/linear_model.rst
rth Aug 9, 2019
6a8a600
Update examples/linear_model/plot_poisson_regression_non_normal_loss.py
rth Aug 9, 2019
73f3bd1
Rename inverse.gaussian to inverse-gaussian
rth Aug 9, 2019
11b178f
Remove sample_weight parameter from predict
rth Aug 9, 2019
3806fbe
Remove redundant check_array in predict
rth Aug 9, 2019
ae1c672
Update doc/modules/linear_model.rst
Aug 11, 2019
f07c831
Remove dispersion
Aug 11, 2019
e34fb57
Merge branch 'upstream/master' into GLM-minimal
Aug 13, 2019
ebbbe9c
Update doc/modules/linear_model.rst
Aug 13, 2019
918e257
Update doc/modules/linear_model.rst
Aug 13, 2019
5236cd8
Merge branch 'GLM-minimal' of https://github.com/rth/scikit-learn int…
Aug 13, 2019
37d0f47
Use double `` when necessary
rth Aug 16, 2019
9c337f2
ax -> axes in plot_poisson_regression_non_normal_loss.py
rth Aug 16, 2019
5e05935
Update sklearn/linear_model/_glm/distribution.py
rth Aug 16, 2019
4a68213
Remove solver=auto
rth Aug 16, 2019
8ee5c85
Update sklearn/linear_model/_glm/glm.py
rth Aug 16, 2019
b106c25
Merge branch 'GLM-minimal' of github.com:rth/scikit-learn into GLM-mi…
rth Aug 16, 2019
a1f8aab
More review comments
rth Aug 16, 2019
c0999ea
Addressing reviews in tests
rth Aug 16, 2019
e09e336
More comments in tests
rth Aug 16, 2019
6601d30
Update linear_model.rst
Aug 17, 2019
81eabe3
Merge upstream/master into GLM-minimal
Aug 17, 2019
5174dae
Address check_is_fitted deprication of attributes
Aug 17, 2019
61dc13f
No LaTeX in docstrings
Aug 17, 2019
44524ca
Replace Tweedie p->power
Aug 17, 2019
58d2409
Replace Tweedie p->power
Aug 17, 2019
ee351e1
Fix tests due to Tweedie p->power
Aug 17, 2019
33fe9be
Simplify super(...)
Aug 18, 2019
94272e7
Replace Link.link(..) by __call__(..)
Aug 18, 2019
2457039
Replace 1. -> 1
Aug 18, 2019
6396d2c
Fix table in TweedieRegressor
Aug 18, 2019
8be0387
Improve docstring in plot_tweedie_regression_insurance_claims.py
rth Aug 22, 2019
da66fd5
Use train_test_split in tests
rth Aug 22, 2019
b9bc170
Fix TODO in test_warm_start
rth Aug 22, 2019
ab6c5d8
Revert "No LaTeX in docstrings"
rth Aug 22, 2019
b424a07
Remove n_iter_ check when warm start.
rth Aug 22, 2019
95a9058
Rename variable L2 -> coef_scaled
rth Aug 22, 2019
59eceb4
Minor fixes
rth Aug 22, 2019
12a5067
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Aug 28, 2019
04f30f4
Better wording in example
rth Aug 28, 2019
3630b52
Improvements in plot_poisson_regression_non_normal_loss.py
rth Aug 28, 2019
516eadb
Improvements in plot_tweedie_regression_insurance_claims.py
rth Aug 28, 2019
5e14928
Drop unused ExponentialDispersionModel._upper_bound
rth Aug 28, 2019
6cc1df5
Move notes and references from docstrings to user manual
rth Aug 28, 2019
752d6aa
More explanatory comments in the code
rth Aug 28, 2019
38a4ad4
Fix requires_positive_y tag
rth Aug 28, 2019
c15a1cc
Remove Link.inverse_derivative2
rth Aug 28, 2019
37de07b
Rename p to power parameter in mean_tweedie_deviance
rth Aug 30, 2019
adbf997
Rename predicted mean mu to y_pred
rth Aug 30, 2019
47dbc84
Fix link parameter documentation in TweedieRegression
rth Aug 30, 2019
3b526e9
EXA Use a simpler pipeline for GBDT in poisson regression example
rth Aug 30, 2019
b1eb611
Minor fixes for user guide
Sep 1, 2019
d964c01
EXA Poisson: minor changes
Sep 1, 2019
a1844b8
Fix mu->y_pred and p->power
Sep 2, 2019
f513392
EXA Tweedie: some improvements
Sep 3, 2019
84229a6
Fix doc test
Sep 3, 2019
dd22699
Merge branch 'master' into GLM-minimal
rth Sep 11, 2019
059aeb7
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Sep 11, 2019
8c6c255
Fix test
rth Sep 11, 2019
0a23313
EXA Use Ridge and remove eps
rth Sep 12, 2019
29964af
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Sep 16, 2019
976b436
Address comments in plot_poisson_regression_non_normal_loss.py
rth Sep 16, 2019
7c850d1
Lint
rth Sep 16, 2019
f64dc4a
Simplify plot_tweedie_regression_insurance_claims.py example
rth Sep 16, 2019
b1f5bde
Add "lift curve" for model validation in Poisson example
rth Sep 18, 2019
a9ab4e4
Various improvements to the model comparison example
ogrisel Sep 25, 2019
be7bb67
Add cumulated claims plot
ogrisel Sep 25, 2019
4125c20
Improve the cumulated nb claims plot
ogrisel Sep 26, 2019
0070d52
Fix wrong xlabel in histogram plot
ogrisel Sep 26, 2019
9d6bb52
More example improvements (preprocessors + plots)
ogrisel Sep 26, 2019
b353b2d
Simplify dataset + use more data
ogrisel Sep 26, 2019
88757fd
Remove solver parameter from {Poisson,Gamma,Tweedie}Regression
rth Sep 26, 2019
6d119d4
Revert some accidental changes from 88757fdb99cc516be230fe08ec1ebfb7b…
rth Sep 26, 2019
b735eb7
Additional comment about the use of properties with setters
rth Sep 26, 2019
2d91114
Add additional tests for link derivatives
rth Sep 26, 2019
89103bc
cosmits + typos
agramfort Sep 29, 2019
4f28a44
Address some of Alex's comments
rth Sep 30, 2019
d4dfd0b
Removing unnecessary comments / asarray call
rth Sep 30, 2019
64d6fbd
Update doc/modules/linear_model.rst
rth Oct 3, 2019
82ace9f
Remove unused solver parameter in tests
rth Oct 3, 2019
5288a0f
Add test for sample_weight consistency
rth Oct 3, 2019
499e8d2
Move GLM losses under sklearn._loss.glm_distribution
rth Oct 3, 2019
f4aa839
Update sklearn/linear_model/_glm/glm.py
rth Oct 3, 2019
48fcbe6
Add missing config.add_subpackage in setup.py
rth Oct 3, 2019
d71fb9f
Address Nicolas comments in the documentation (partial)
rth Oct 3, 2019
fa90272
More cleanups in the plot_tweedie_regression_insurance_claims.py example
rth Oct 3, 2019
4d16f31
Typos and text improvement in poisson example
Oct 6, 2019
15eb1d3
EXA sharey for histograms
Oct 6, 2019
3d097c6
Plot y_pred histograms on the test set
ogrisel Oct 8, 2019
6372287
Merge remote-tracking branch 'origin/master' into GLM-minimal
ogrisel Oct 8, 2019
b117856
Merge remote-tracking branch 'upstream/master' into GLM-minimal
rth Oct 8, 2019
31f5b3d
Compound Poisson => Compound Poisson Gamma
ogrisel Oct 9, 2019
a498ff5
Compound Poisson => Compound Poisson Gamma
ogrisel Oct 9, 2019
3fae28a
Various improvement in Tweedie regression example
ogrisel Oct 9, 2019
a2b6841
Merge remote-tracking branch 'origin/master' into GLM-minimal
ogrisel Oct 9, 2019
a47798a
Update doc/modules/linear_model.rst
rth Oct 10, 2019
83391dd
Use latest docstring conventions everywhere
rth Oct 10, 2019
3bfb54e
Drop check_input parameter
rth Oct 10, 2019
d325fe2
Use keyword only arguments SLEP009
rth Oct 10, 2019
661cf56
Move _y_pred_deviance_derivative from losses as a private function
rth Oct 10, 2019
560c180
Fix cumulated claim amount curve in Tweedie regression example
ogrisel Oct 10, 2019
0ea2dce
PEP8
ogrisel Oct 10, 2019
a608c70
WIP implementation of Gini coeff and Lorenz curve
ogrisel Oct 10, 2019
853f8b7
Use Lorenz curve in Tweedie example
ogrisel Oct 10, 2019
b3b55e8
PEP8
ogrisel Oct 10, 2019
640f017
Make sure labels/weights are floats before normalizing
ogrisel Oct 11, 2019
6dd197a
Update scorer test framework
ogrisel Oct 11, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -854,6 +854,22 @@ Any estimator using the Huber loss would also be robust to outliers, e.g.
linear_model.RANSACRegressor
linear_model.TheilSenRegressor

Generalized linear models (GLM) for regression
----------------------------------------------

A generalization of linear models that allows for response variables to
have error distribution other than a normal distribution is implemented
in the following models,

.. autosummary::
:toctree: generated/
:template: class.rst

linear_model.PoissonRegressor
linear_model.TweedieRegressor
linear_model.GammaRegressor


Miscellaneous
-------------

Expand Down
164 changes: 163 additions & 1 deletion doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -875,7 +875,7 @@ with 'log' loss, which might be even faster but requires more tuning.
It is possible to obtain the p-values and confidence intervals for
coefficients in cases of regression without penalization. The `statsmodels
package <https://pypi.org/project/statsmodels/>` natively supports this.
Within sklearn, one could use bootstrapping instead as well.
Within sklearn, one could use bootstrapping instead as well.


:class:`LogisticRegressionCV` implements Logistic Regression with built-in
Expand All @@ -897,6 +897,168 @@ to warm-starting (see :term:`Glossary <warm_start>`).
.. [9] `"Performance Evaluation of Lbfgs vs other solvers"
<http://www.fuzihao.org/blog/2016/01/16/Comparison-of-Gradient-Descent-Stochastic-Gradient-Descent-and-L-BFGS/>`_

.. _Generalized_linear_regression:

Generalized Linear Regression
=============================

Generalized Linear Models (GLM) extend linear models in two ways
[10]_. First, the predicted values :math:`\hat{y}` are linked to a linear
combination of the input variables :math:`X` via an inverse link function
:math:`h` as

.. math:: \hat{y}(w, X) = h(x^\top w) = h(w_0 + w_1 X_1 + ... + w_p X_p).

Secondly, the squared loss function is replaced by the unit deviance :math:`d`
of a reproductive exponential dispersion model (EDM) [11]_. The minimization
problem becomes

.. math:: \min_{w} \frac{1}{2 \sum_i s_i} \sum_i s_i \cdot d(y_i, \hat{y}(w, X_i)) + \frac{\alpha}{2} ||w||_2

with sample weights :math:`s_i`, and L2 regularization penalty :math:`\alpha`.
The unit deviance is defined by the log of the :math:`\mathrm{EDM}(\mu, \phi)`
likelihood as

.. math:: d(y, \mu) = -2\phi\cdot
\left( \log p(y|\mu,\phi)
- \log p(y|y,\phi)\right).

The following table lists some specific EDM distributions—all are instances of Tweedie
distributions—and some of their properties.

================= =============================== ====================================== ============================================
Distribution Target Domain Unit Variance Function :math:`v(\mu)` Unit Deviance :math:`d(y, \mu)`
================= =============================== ====================================== ============================================
Normal :math:`y \in (-\infty, \infty)` :math:`1` :math:`(y-\mu)^2`
Poisson :math:`y \in [0, \infty)` :math:`\mu` :math:`2(y\log\frac{y}{\mu}-y+\mu)`
Gamma :math:`y \in (0, \infty)` :math:`\mu^2` :math:`2(\log\frac{\mu}{y}+\frac{y}{\mu}-1)`
Inverse Gaussian :math:`y \in (0, \infty)` :math:`\mu^3` :math:`\frac{(y-\mu)^2}{y\mu^2}`
================= =============================== ====================================== ============================================


Usage
-----

A GLM loss different from the classical squared loss might be appropriate in
the following cases:

* If the target values :math:`y` are counts (non-negative integer valued) or
frequencies (non-negative), you might use a Poisson deviance with log-link.

* If the target values are positive valued and skewed, you might try a
Gamma deviance with log-link.

* If the target values seem to be heavier tailed than a Gamma distribution,
you might try an Inverse Gaussian deviance (or even higher variance powers
of the Tweedie family).

Since the linear predictor :math:`x^\top w` can be negative and
Poisson, Gamma and Inverse Gaussian distributions don't support negative values,
it is convenient to apply a link function different from the identity link
:math:`h(x^\top w)=x^\top w` that guarantees the non-negativeness, e.g. the
log-link `link='log'` with :math:`h(x^\top w)=\exp(x^\top w)`.

:class:`TweedieRegressor` implements a generalized linear model
for the Tweedie distribution, that allows to model any of the above mentioned
distributions using the appropriate ``power`` parameter, i.e. the exponent
of the unit variance function:

- ``power = 0``: Normal distribution. Specialized solvers such as
:class:`Ridge`, :class:`ElasticNet` are generally
more appropriate in this case.

- ``power = 1``: Poisson distribution. :class:`PoissonRegressor` is exposed for
convenience. However, it is strictly equivalent to
`TweedieRegressor(power=1)`.

- ``power = 2``: Gamma distribution. :class:`GammaRegressor` is exposed for
convenience. However, it is strictly equivalent to
`TweedieRegressor(power=2)`.

- ``power = 3``: Inverse Gamma distribution.


.. note::

* The feature matrix `X` should be standardized before fitting. This
ensures that the penalty treats features equally.
* If you want to model a relative frequency, i.e. counts per exposure (time,
volume, ...) you can do so by a Poisson distribution and passing
:math:`y=\frac{\mathrm{counts}}{\mathrm{exposure}}` as target values
together with :math:`s=\mathrm{exposure}` as sample weights.

As an example, consider Poisson distributed counts z (integers) and
weights s=exposure (time, money, persons years, ...). Then you fit
y = z/s, i.e. ``PoissonRegressor.fit(X, y, sample_weight=s)``.
The weights are necessary for the right (finite sample) mean.
Considering :math:`\bar{y} = \frac{\sum_i s_i y_i}{\sum_i s_i}`,
in this case one might say that y has a 'scaled' Poisson distribution.
The same holds for other distributions.

* The fit itself does not need Y to be from an EDM, but only assumes
the first two moments to be :math:`E[Y_i]=\mu_i=h((Xw)_i)` and
:math:`Var[Y_i]=\frac{\phi}{s_i} v(\mu_i)`.

The estimator can be used as follows::

>>> from sklearn.linear_model import TweedieRegressor
>>> reg = TweedieRegressor(power=1, alpha=0.5, link='log')
>>> reg.fit([[0, 0], [0, 1], [2, 2]], [0, 1, 2])
TweedieRegressor(alpha=0.5, link='log', power=1)
>>> reg.coef_
array([0.2463..., 0.4337...])
>>> reg.intercept_
-0.7638...


.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_linear_model_plot_tweedie_regression_insurance_claims.py`
* :ref:`sphx_glr_auto_examples_linear_model_plot_poisson_regression_non_normal_loss.py`

Mathematical formulation
------------------------

In the unpenalized case, the assumptions are the following:

* The target values :math:`y_i` are realizations of random variables
:math:`Y_i \overset{i.i.d}{\sim} \mathrm{EDM}(\mu_i, \frac{\phi}{s_i})`
with expectation :math:`\mu_i=\mathrm{E}[Y]`, dispersion parameter
:math:`\phi` and sample weights :math:`s_i`.
* The aim is to predict the expectation :math:`\mu_i` with
:math:`\hat{y}_i = h(\eta_i)`, linear predictor
:math:`\eta_i=(Xw)_i` and inverse link function :math:`h`.

Note that the first assumption implies
:math:`\mathrm{Var}[Y_i]=\frac{\phi}{s_i} v(\mu_i)` with unit variance
function :math:`v(\mu)`. Specifying a particular distribution of an EDM is the
same as specifying a unit variance function (they are one-to-one).

A few remarks:

* The deviance is independent of :math:`\phi`. Therefore, also the estimation
of the coefficients :math:`w` is independent of the dispersion parameter of
the EDM.
* The minimization is equivalent to (penalized) maximum likelihood estimation.
* The deviances for at least Normal, Poisson and Gamma distributions are
strictly consistent scoring functions for the mean :math:`\mu`, see Eq.
(19)-(20) in [12]_. This means that, given an appropriate feature matrix `X`,
you get good (asymptotic) estimators for the expectation when using these
deviances.


.. topic:: References:

.. [10] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models,
Second Edition. Boca Raton: Chapman and Hall/CRC. ISBN 0-412-31760-5.

.. [11] Jørgensen, B. (1992). The theory of exponential dispersion models
and analysis of deviance. Monografias de matemática, no. 51. See also
`Exponential dispersion model.
<https://en.wikipedia.org/wiki/Exponential_dispersion_model>`_

.. [12] Gneiting, T. (2010). `Making and Evaluating Point Forecasts.
<https://arxiv.org/pdf/0912.0902.pdf>`_

Stochastic Gradient Descent - SGD
=================================
Expand Down
Loading