ENH migrate GLMs / TweedieRegressor to linear loss #22548

lorentzenchr · 2022-02-19T17:38:49Z

Reference Issues/PRs

This is a follow-up of #21808 and #20567.
It also fixes #22124 (~partial fix of #21406).

What does this implement/fix? Explain your changes.

This PR plugs in the new LinearModuleLoss in the privateGeneralizedLinearRegressor, thereby removing sklearn._loss.glm_distribution.py and sklearn.linear_model._glm/link.py.
The tweedie deviance code is copy&pasted into the metric mean_tweedie_deviance.

Any other comments?

It should be a user API backward compatible change (PoissonRegressor, GammaRegressor and TweedieRegressor, mean_tweedie_deviance).

lorentzenchr

Some hints for reviewers.

lorentzenchr · 2022-02-19T20:21:10Z

sklearn/_loss/loss.py

+class HalfTweedieLossIdentity(BaseLoss):
+    """Half Tweedie deviance loss with identity link, for regression.


This new loss class is needed for TweedieRegressor(link="identity").
See Cython implementation in file _loss.pyx.tp.

lorentzenchr · 2022-02-19T20:25:20Z

sklearn/linear_model/_glm/glm.py

-    family : {'normal', 'poisson', 'gamma', 'inverse-gaussian'} \
-            or an ExponentialDispersionModel instance, default='normal'
-        The distributional assumption of the GLM, i.e. which distribution from
-        the EDM, specifies the loss function to be minimized.
+    base_loss_class : subclass of BaseLoss, default=HalfSquaredError
+        A `base_loss_class` contains a specific loss function as well as the link
+        function. The loss to be minimized specifies the distributional assumption of
+        the GLM, i.e. the distribution from the EDM. Here are some examples:
+
+        =======================  ========  ==========================
+        base_loss_class          Link       Target Domain
+        =======================  ========  ==========================
+        HalfSquaredError         identity  y any real number
+        HalfPoissonLoss          log       0 <= y
+        HalfGammaLoss            log       0 < y
+        HalfInverseGaussianLoss  log       0 < y
+        HalfTweedieLoss          log       dependend on tweedie power
+        =======================  ========  ==========================

-    link : {'auto', 'identity', 'log'} or an instance of class BaseLink, \
-            default='auto'
        The link function of the GLM, i.e. mapping from linear predictor
-        `X @ coeff + intercept` to prediction `y_pred`. Option 'auto' sets
-        the link depending on the chosen family as follows:
+        `X @ coeff + intercept` to prediction `y_pred`. For instance, with a log link,
+        we have `y_pred = exp(X @ coeff + intercept)`.

-        - 'identity' for Normal distribution
-        - 'log' for Poisson,  Gamma and Inverse Gaussian distributions
+    base_loss_params : dictionary, default={}
+        Arguments to be passed to base_loss_class, e.g. {"power": 1.5} with
+        `base_loss_class=HalfTweedieLoss`.


family and link were private attributes. They are replaced by base_loss_class and base_loss_params. The new losses have the link functions baked into them.

lorentzenchr · 2022-02-19T20:26:39Z

sklearn/linear_model/_glm/glm.py

+        self._linear_loss = LinearModelLoss(
+            base_loss=self._get_base_loss_instance(),
+            fit_intercept=self.fit_intercept,
+        )


This is the reason for this PR, to use LinearModelLoss!

lorentzenchr · 2022-02-19T20:28:35Z

sklearn/linear_model/_glm/tests/test_glm.py

+@pytest.mark.parametrize(
+    "name, link_class", [("identity", IdentityLink), ("log", LogLink)]
+)
+def test_tweedie_link_argument(name, link_class):


Replaces former test_glm_link_argument.

lorentzenchr · 2022-02-19T20:29:24Z

sklearn/linear_model/_glm/tests/test_glm.py

+        (3, LogLink),  # inverse-gaussian
+    ],
+)
+def test_tweedie_link_auto(power, expected_link_class):


Replaces former test_glm_link_auto

lorentzenchr · 2022-02-19T20:30:22Z

sklearn/metrics/_regression.py

+        target_type=numbers.Real,
+    )
+
+    message = f"Mean Tweedie deviance error with power={p} can only be used on "


From here on, it is really copy&paste from the ~~now~~(edit: in 2 releases to be) deleted glm_distribution.py.

lorentzenchr · 2022-02-19T21:39:07Z

This might interest @agramfort @rth @TomDLT.
After this, one could imagine to develop a better 2nd order optimizer.

doc/whats_new/v1.1.rst

sklearn/linear_model/_glm/glm.py

This reverts commit 719bb1b.

lorentzenchr · 2022-03-18T09:16:36Z

I think we should do a regular deprecation cycle for the family parameter. I know this impose keeping around an other wise useless Python module for 2 releases, but so be it.

What changed your mind? I would like to hear a second reviewer's opinion to avoid back and forth, even if it is almost as simple as undo 719bb1b.

I changed my mind, too. Let's do proper deprecation and hopefully move forward with this. This PR is the last missing peace for possible new 2nd order solvers for all/most GLMs!

ogrisel

LGTM, thank you very much for the PR.

thomasjpfan

I opened a PR to your fork to showcase a way to avoid _base_loss:
lorentzenchr#3

sklearn/linear_model/_glm/glm.py

CLN Idea removing _base_loss

thomasjpfan

I doubled checked the math for HalfTweedieLossIdentity and it looks good to me. I have a few small comments on the code.

There is an issue with d2_tweedie_score in terms of memory.

doc/whats_new/v1.1.rst

sklearn/linear_model/_glm/glm.py

sklearn/linear_model/_glm/tests/test_glm.py

sklearn/_loss/tests/test_glm_distribution.py

sklearn/_loss/glm_distribution.py

sklearn/metrics/_regression.py

Small refactor for _mean_tweedie_deviance

…tzenchr/scikit-learn into migrate_glm_to_linear_loss

thomasjpfan

I ran the benchmarks from #22548 (review) and see the same performance improvements.

LGTM

thomasjpfan · 2022-03-28T17:42:39Z

sklearn/_loss/loss.py

@@ -770,6 +771,52 @@ def constant_to_optimal_zero(self, y_true, sample_weight=None):
            return term


+class HalfTweedieLossIdentity(BaseLoss):


My curiosity: when does it make sense to use identity link with power != 0 ?

Years ago, I thought it a good idea. Meanwhile, I don't think it's is useful. Therefore, I opened #19086 without much response.

My curiosity: when does it make sense to use identity link with power != 0 ?

I have a problem where I know the expectation y' follows the linear model y' = w x. My measurements, y, have poisson errors.

(The specific problem involves analysis of radiation measurements. The expectation is linear with the amount of source; the measurements are poisson distributed).

Using a log link function is just not the right description of my problem. Yes, the whole thing breaks down when evaluating negative values of w, but it seems much better to offer a constraint to avoid ever evaluating negative values of w rather than exclude the situations where you have an actual linear relationship with poisson measurements.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

lorentzenchr added 11 commits February 18, 2022 11:47

DOC fix missing power of 2

decbb5a

FIX typo in error message

26c0836

FIX bug in loss tests

fb66241

FEA add HalfTweedieLossIdentity

395743a

DOC comment about linear predictor

c029dad

ENH refactor glm.py with LinearModelLoss

2eb28f4

MAINT remove link.py in _glm submodule

fd8d1ca

CLN googlemail to gmail

09b243a

MAINT copy tweedie deviance to regression metrics

cecb328

MAINT remove glm_distribution.py

719bb1b

CLN fix comment

5f56968

github-actions bot added module:linear_model module:metrics cython labels Feb 19, 2022

lorentzenchr added 5 commits February 19, 2022 19:59

CLN fix bug in tweedie deviance metrics

7adc1bc

FIX dtype because of lbfgs

2c13683

TST TweedieRegressor for validate in init

02dd2de

ENH use _openmp_effective_n_threads

5a455b0

DOC add whatsnew

dda943f

lorentzenchr commented Feb 19, 2022

View reviewed changes

lorentzenchr mentioned this pull request Feb 19, 2022

FIX avoid parameter validation in init in TweedieRegressor #22124

Closed

lorentzenchr added 3 commits February 19, 2022 22:04

FIX dtype in score funtion

9559464

FIX validation of base_loss_class

85f616f

TST validation of y_true range

bc19cf3

DOC fix typo

9582e8b

TomDLT reviewed Feb 22, 2022

View reviewed changes

lorentzenchr mentioned this pull request Feb 25, 2022

A common private module for differentiable loss functions used as objective functions in estimators #15123

Open

lorentzenchr added 2 commits February 25, 2022 15:05

MNT make GeneralizedLinearRegressor private

3b6beda

DOC typo and text improvements

0ebe6f2

lorentzenchr added 3 commits March 18, 2022 09:10

Revert "MAINT remove glm_distribution.py"

4bfc1e9

This reverts commit 719bb1b.

MNT add removal TODO for glm_distribution.py

7f428b5

DEP backward compatible deprecation of family property

a1474d0

lorentzenchr and others added 2 commits March 18, 2022 10:17

Merge branch 'main' into migrate_glm_to_linear_loss

2ca9586

CLN Idea removing _base_loss

f23dd31

ogrisel approved these changes Mar 23, 2022

View reviewed changes

thomasjpfan reviewed Mar 23, 2022

View reviewed changes

sklearn/linear_model/_glm/glm.py Outdated Show resolved Hide resolved

Merge pull request #3 from thomasjpfan/pr/22548_idea

129bce5

CLN Idea removing _base_loss

jjerphan self-requested a review March 24, 2022 07:42

lorentzenchr and others added 2 commits March 24, 2022 23:15

CLN fix type in test name

db4d402

Small refactor for _mean_tweedie_deviance

87dd217

thomasjpfan reviewed Mar 27, 2022

View reviewed changes

lorentzenchr added 7 commits March 28, 2022 07:40

Merge pull request #4 from thomasjpfan/pr/22548_memory

238bf6e

Small refactor for _mean_tweedie_deviance

CLN fix merge conflict in whatsnew

5170aca

Merge branch 'migrate_glm_to_linear_loss' of https://github.com/loren…

111dd8a

…tzenchr/scikit-learn into migrate_glm_to_linear_loss

CLN remove HalfInverseGaussianLoss

5a5a22f

DOC move private _base_loss to attributes section

16e8be5

CLN remove comment

0024ca3

CLN TODO(1.3)

49fb93f

thomasjpfan approved these changes Mar 28, 2022

View reviewed changes

thomasjpfan merged commit 75a94f5 into scikit-learn:main Mar 28, 2022

lorentzenchr deleted the migrate_glm_to_linear_loss branch March 29, 2022 16:16

lorentzenchr mentioned this pull request Mar 30, 2022

Performance improvements in linear models scikit-learn/communication#12

Open

jsilke mentioned this pull request Mar 31, 2022

PR #22548 breaks documentation building #23008

Closed

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022

ENH migrate GLMs / TweedieRegressor to linear loss (scikit-learn#22548)

b624e9f

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

lorentzenchr mentioned this pull request Apr 9, 2022

ENH save memory with LinearLoss #23090

Closed

jpbrodsky mentioned this pull request May 11, 2022

Remove link argument from TweedieRegressor #19086

Closed

lorentzenchr mentioned this pull request Aug 6, 2022

Add hessian/fisher matrix for TweedieRegression #16641

Closed

lorentzenchr added the Performance label May 22, 2025

		class HalfTweedieLossIdentity(BaseLoss):
		"""Half Tweedie deviance loss with identity link, for regression.

		@@ -770,6 +771,52 @@ def constant_to_optimal_zero(self, y_true, sample_weight=None):
		return term


		class HalfTweedieLossIdentity(BaseLoss):

Uh oh!

ENH migrate GLMs / TweedieRegressor to linear loss #22548

ENH migrate GLMs / TweedieRegressor to linear loss #22548

Uh oh!

Conversation

lorentzenchr commented Feb 19, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Feb 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Feb 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Feb 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Mar 18, 2022

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

jpbrodsky May 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr Feb 19, 2022 •

edited

Loading

lorentzenchr Feb 19, 2022 •

edited

Loading