[MRG] ENH add Poisson splitting criterion for single trees #17386

lorentzenchr · 2020-05-29T14:58:34Z

Reference Issues/PRs

Contributes to #16668.

What does this implement/fix? Explain your changes.

This PR adds a Poisson splitting criterion to DecisionTreeRegressor and ExtraTreeRegressor with option criterion="poisson".

Any other comments?

The implemented splitting criterion forbids splits with child nodes having sum(y_i) = 0 as non-positive predictions are not allowed for Poisson regression/loss. This is the simplest solution to this problem, i.e. without introducing another option.

If this is good to go and merged, the additional option "poisson" can be propagated to RandomForestRegressor and other tree based ensemble regressors.

NicolasHug

thanks @lorentzenchr , made a first pass, looks good

we'll also need some docs to the UG ;)

sklearn/tree/_criterion.pyx

NicolasHug · 2020-06-02T13:07:40Z

sklearn/tree/tests/test_tree.py

-    score = mean_squared_error(diabetes.target, reg.predict(diabetes.data))
-    assert score < 60 and score > 0, (
-        f"Failed with {name}, criterion = {criterion} and score = {score}"
+    loss = mean_squared_error(diabetes.target, reg.predict(diabetes.data))


does it make sense to check the mse when the tree was trained with a poisson criteria? The need for a limit of 4000 seems to indicate that the check is super loose?

also nit but score seemed like the correct term here. There's no loss in these trees

MSE elicits the expectation, so does Poisson deviance. So yes, it makes sense.
I tried to use mean_poisson_deviance, but that did not change much, i.e. still needs a high max_loss. I don't know why.

"score" in scikit-learn means "higher values are better". Error and loss, for me, are often synonyms. I can change into "error"?

ok for using loss.

I'm still skeptical about the loosness of the check: enforcing an MSE of 4000 when other criteria achieve 100 times better than this. It seems that this reveals that either there's something wrong in the Poisson deviance, or that this test isn't relevant for Poisson?

I'm also skeptical here. I changed to use Poisson deviance as score for "poisson" splitter. It's much better now.

sklearn/tree/tests/test_tree.py

NicolasHug · 2020-06-02T13:12:13Z

sklearn/tree/tests/test_tree.py

+@pytest.mark.parametrize("criterion", ["mse", "poisson"])
+@pytest.mark.parametrize("tree", REG_TREES.values())
+def test_balance_property(criterion, tree):
+    """Test that sum(y_pred)=sum(y_true) on training set."""


maybe add a comment that this can't be true for mae because the median is used instead of the mean (if that's true, I haven't double checked)?
what about friedman_mse?

I added "friedman_mse" - I had not double checked - and a comment about mae and median.
This could also be tested for each leaf.

sklearn/tree/tests/test_tree.py

sklearn/tree/_criterion.pyx

NicolasHug · 2020-06-02T13:45:27Z

sklearn/tree/_criterion.pyx

+    impurity:
+        1/n * sum(y_i * log(y_i/y_pred)
+    """
+    # FIXME in 0.25:


Is this relevant considering we have proxy_impurity_improvement?

Isn't the impurity calculated for each node (after a split)? If so, the FIXME might improve fit time as xlogy must be calculated for every sample in the node. There is no algebraic simplification.
I have not done benchmakrs though.

I guess my question becomes: do we really want to drop the constant terms, as suggested by the comment? Maybe we do that for other criteria, I haven't checked, but my understanding is that node_impurity should return the ground truth impurity of the node?

If that's the case, I'd suggest moving this "FIXME" comment down to proxy_impurity_improvement to explain the proxy

If not, we should maybe indicate that this 'trick' could also be used in children_impurity, so that we can compute both left and right impurity in one pass instead of 2.

sklearn/tree/_criterion.pyx

lorentzenchr · 2020-06-03T10:08:14Z

we'll also need some docs to the UG ;)

Might be worth considering 😄

lorentzenchr · 2020-06-03T14:35:10Z

The user guide does not mention "friedman_mse". Where should "poisson" be mentioned? Under https://scikit-learn.org/stable/modules/tree.html#regression-criteria?

NicolasHug · 2020-06-03T14:43:09Z

Yes under scikit-learn.org/stable/modules/tree.html#regression-criteria for the formula. (Feel free to add Friedman if you feel like it ;) )

Also a small paragraph in https://scikit-learn.org/stable/modules/tree.html#regression to distinguish criteria would be helpful. E.g. mae is less sensitive to outliers, friedman mse is (?????), and poisson is (?????)

lorentzenchr · 2020-06-04T19:49:27Z

@NicolasHug I added the Poisson deviance to the user guide and also did some more improvements. Please tell me if you like them.

About the "friedman_mse" - maybe better as separate issue: It is never mentioned in the User Guide or examples. After consulting the literature [1], I don't understand why it is implemented/API-exposed for DecisionTreeRegressor. It seems to have its place in multiclass gradient-boosted trees.

[1] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

lorentzenchr · 2020-07-04T10:45:31Z

@NicolasHug I merged master to resolve merge conflicts. I think I addressed all the comments from your first review pass. Anything else I can do?

thomasjpfan

Thank you for the PR @lorentzenchr

This is a partial review.

doc/modules/tree.rst

sklearn/tree/_classes.py

sklearn/tree/_criterion.pyx

lorentzenchr · 2020-08-19T09:54:05Z

@NicolasHug @thomasjpfan I hope you're still able to see the forest for the trees, in particular for boosted and histrogrammed trees.:smirk: Just a little, well-meant reminder for this PR.

NicolasHug · 2020-08-19T13:14:44Z

I tried to push an empty commit but it looks like I'm not allowed to push to this PR. Could you just merge with master @lorentzenchr so that the docs get rendered?

NicolasHug

thanks for the patience @lorentzenchr , made another pass

It would be nice to have some high level tests ensuring that using Poisson results in a better score than using e.g. MSE on a specific problem where Poisson is expected to perform better?

I also feel like the UG could be a bit more detailed on which cases one might want to use the Poisson criteria vs MSE/MAE.

Also could you quickly benchmark the Poisson criteria vs MSE? I'm curious about the "overhead" incurred in children_impurity

doc/modules/tree.rst

sklearn/tree/_criterion.pyx

NicolasHug · 2020-08-19T14:06:33Z

sklearn/tree/tests/test_tree.py

@@ -1950,6 +1966,30 @@ def test_apply_path_readonly_all_trees(name):
    check_apply_path_readonly(name)


+@pytest.mark.parametrize("criterion", ["mse", "friedman_mse", "poisson"])
+@pytest.mark.parametrize("Tree", REG_TREES.values())
+def test_balance_property(criterion, Tree):


Thanks. An often neglected, but very important property. I'm wondering if some common test would make sense - maybe not.

you mean a common test for which other estimators?

I think we could definitely have this for the histogram-GBDT trees but I can't think of any other one

It would be for a lot of models using MSE, Tweedie deviance or Log Loss. Counter examples are linear models without intercept and uncentered data.

sklearn/tree/_criterion.pyx

NicolasHug · 2020-08-19T14:39:39Z

sklearn/tree/_criterion.pyx

+    impurity:
+        1/n * sum(y_i * log(y_i/y_pred)
+    """
+    # FIXME in 0.25:


I guess my question becomes: do we really want to drop the constant terms, as suggested by the comment? Maybe we do that for other criteria, I haven't checked, but my understanding is that node_impurity should return the ground truth impurity of the node?

If that's the case, I'd suggest moving this "FIXME" comment down to proxy_impurity_improvement to explain the proxy

If not, we should maybe indicate that this 'trick' could also be used in children_impurity, so that we can compute both left and right impurity in one pass instead of 2.

sklearn/tree/_criterion.pyx

NicolasHug · 2020-08-19T15:12:11Z

sklearn/tree/_criterion.pyx

+        impurity_right[0] = 0.
+        for k in range(self.n_outputs):
+            y_mean = self.sum_right[k] / self.weighted_n_right
+            for p in range(pos, end):
+                i = self.samples[p]
+
+                if self.sample_weight != NULL:
+                    w = self.sample_weight[i]
+
+                impurity_right[0] += w * xlogy(y[i, k], y[i, k] / y_mean)
+        impurity_right[0] /= self.weighted_n_right * self.n_outputs


this bit of code seems to be duplicated 3 times. Would it make sense to extract it into a private method so that we could use it here and in node_impurity?

Same could be done for MAE.

lorentzenchr · 2020-08-19T21:32:42Z

@NicolasHug Thanks for your last/nest review round.

I decided to keep the FIXME comment and elaborate a bit more.
Added test poisson vs mse, ~~so far failing and no clue.~~
~~So far, I'd like to focus on implementation and not on User Guide~~ ~~(maybe separate issue)~~.
~~Refer to GLM User Guide~~ Add note about when to use criterion="poisson".

thomasjpfan · 2020-07-17T17:57:32Z

sklearn/tree/_criterion.pyx

+
+                impurity += w * xlogy(self.y[i, k], self.y[i, k] / y_mean)
+
+        return impurity / (self.weighted_n_node_samples * self.n_outputs)


Do we have a test for sample weight?

There is check_sample_weights_invariance in estimator_checks.py. IIUC, the common tests use only the default arguments, right? That means, only DecisionTreeRegressor(criterion="mse") is tested.

Can I add check_sample_weights_invariance here in test_tree.py (for all regression criteria)?

doc/modules/tree.rst

thomasjpfan · 2020-10-06T00:18:20Z

sklearn/tree/tests/test_tree.py

+
+    # Choose a training set with non-negative targets (for poisson)
+    X, y = diabetes.data, diabetes.target
+    reg = Tree(criterion=criterion)


This test feels like it should pass regardless of random_state, but do we want to set it anyways just to make the test deterministic (for different platforms etc).

Good question! I can only tell that the test has to hold regardless of random_state. It is a property between tree and training data no matter how the tree parameters (except criterion) was set.

sklearn/tree/tests/test_tree.py

thomasjpfan · 2020-10-07T17:51:31Z

sklearn/tree/tests/test_tree.py

+    y = [0, 0, 0, 0, 1, 2, 3, 4]
+    reg = DecisionTreeRegressor(criterion="poisson", random_state=1)
+    reg.fit(X, y)
+    assert np.all(reg.predict(X) > 0)


Interestingly, the impurity value for those nodes that predict zero is np.nan.

Is this issue specific to this PR or an general issue with the tree splitting?

thomasjpfan

LGTM

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Feature|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

doc/modules/tree.rst

lorentzenchr · 2020-10-10T22:30:19Z

I added tests for sample_weight consistency, a whats new entry and mentioned slower fitting time for poisson and mae. Please note these changes before merging. And thank you both for your reviews!

cmarmo · 2020-10-23T08:42:54Z

@glemaitre, two approvals here: shall we merge? Thanks!

sklearn/tree/_criterion.pyx

sklearn/tree/tests/test_tree.py

glemaitre

LGTM. Only 2-3 nits.
I would be in favour of merging afterwords.

lorentzenchr · 2020-11-02T14:49:03Z

@glemaitre Thanks for looking into this PR. I addressed your comments.

Note that this PR enables criterion="poisson" not only in DecisionTreeRegressor but also automatically in

tree.ExtraTreeRegressor
ensemble.ExtraTreesRegressor
ensemble.RandomForestRegressor

Should I mention this in the what's new entry and docstring? So far, I did not as tests are not yet implemented for those.
So, I'd prefer a thorough separate PR after v0.24.

glemaitre · 2020-11-02T15:24:23Z

Should I mention this in the what's new entry and docstring? So far, I did not as tests are not yet implemented for those.
So, I'd prefer a thorough separate PR after v0.24.

I agree that I would prefer a separate PR. In addition to the tests and docstring, we might need to illustrate via an example maybe.
I think that we will be ready for the release before this subsequent PR.

glemaitre · 2020-11-02T15:33:01Z

Oopsy I forgot to remove [MRG], my bad.
Thanks @lorentzenchr

github-actions bot added the module:tree label May 29, 2020

lorentzenchr changed the title ~~[WIP] ENH add Poisson splitting criterion for single trees~~ [MRG] ENH add Poisson splitting criterion for single trees May 29, 2020

Christian Lorentzen added 4 commits June 2, 2020 08:11

ENH add poisson splitting criterion for trees

1cb9716

TST add tests for poisson splitting in trees

97fb35e

BUG fix and test fobidden zero node split for Poisson

3bf7fce

TST include poisson criterion in test_diabetes_underfit

e5ab4dc

lorentzenchr force-pushed the poisson_tree branch from 6833457 to e5ab4dc Compare June 2, 2020 06:34

NicolasHug reviewed Jun 2, 2020

View reviewed changes

address review comments

463b858

Christian Lorentzen added 2 commits June 4, 2020 21:33

DOC add Poisson deviance as impurity in user guide

736a872

DOC improve tree section of user guide

0c8e8ec

Christian Lorentzen added 2 commits June 4, 2020 21:51

DOC miiiiiinor text improvement

68126db

Merge branch 'master' into poisson_tree

2f35789

thomasjpfan reviewed Jul 6, 2020

View reviewed changes

Christian Lorentzen added 2 commits July 6, 2020 18:57

address review comments

5653962

Check positive child sums before calculating means

d412cf6

NicolasHug reviewed Aug 19, 2020

View reviewed changes

lorentzenchr added 4 commits August 19, 2020 18:10

trigger doc build

57a3cc7

Merge branch 'master' into poisson_tree

062865d

address review comments

a87f8d7

TST compare poisson split vs mse

3438e45

MNT reduce redundant code with helper poisson_loss

eda5b14

CLN remove FIXME tag

e63f8ad

thomasjpfan reviewed Oct 7, 2020

View reviewed changes

CLN address review comments

d741303

cmarmo added this to the 0.24 milestone Oct 8, 2020

thomasjpfan approved these changes Oct 10, 2020

View reviewed changes

NicolasHug reviewed Oct 10, 2020

View reviewed changes

doc/modules/tree.rst Outdated Show resolved Hide resolved

lorentzenchr added 3 commits October 11, 2020 00:14

TST add tests for sample_weight consistency

2a7c836

DOC add what_new entry

167f166

DOC mention poisson and mae fit slower than mse

da25541

cmarmo removed the Waiting for Reviewer label Nov 2, 2020

glemaitre reviewed Nov 2, 2020

View reviewed changes

sklearn/tree/_criterion.pyx Show resolved Hide resolved

glemaitre reviewed Nov 2, 2020

View reviewed changes

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

glemaitre reviewed Nov 2, 2020

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved

glemaitre reviewed Nov 2, 2020

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved

glemaitre approved these changes Nov 2, 2020

View reviewed changes

lorentzenchr added 3 commits November 2, 2020 15:18

TST add comment about not touching original data

2fd61aa

CLN nicer code formating

8800269

TST parametrize metric as well

e329a76

glemaitre merged commit de59efe into scikit-learn:master Nov 2, 2020

lorentzenchr deleted the poisson_tree branch November 2, 2020 16:11

lorentzenchr mentioned this pull request Nov 2, 2020

Doc remove poisson criterion from ExtraTreeRegressor #18734

Merged

lorentzenchr mentioned this pull request Jan 30, 2021

Poisson criterion in RandomForestRegressor #19304

Closed

3 tasks

lorentzenchr mentioned this pull request Feb 15, 2021

ENH Add Poisson criterion to RandomForestRegressor #19464

Closed

lorentzenchr mentioned this pull request Nov 27, 2021

Request more criterion for random forest regression #5368

Closed

ogrisel mentioned this pull request Jun 11, 2024

Tweedie deviance loss for tree based models #16668

Open

5 tasks


		impurity += w * xlogy(self.y[i, k], self.y[i, k] / y_mean)

		return impurity / (self.weighted_n_node_samples * self.n_outputs)

Uh oh!

[MRG] ENH add Poisson splitting criterion for single trees #17386

[MRG] ENH add Poisson splitting criterion for single trees #17386

Uh oh!

Conversation

lorentzenchr commented May 29, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Jun 3, 2020

Uh oh!

lorentzenchr commented Jun 3, 2020

Uh oh!

NicolasHug commented Jun 3, 2020

Uh oh!

lorentzenchr commented Jun 4, 2020

Uh oh!

lorentzenchr commented Jul 4, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Aug 19, 2020

Uh oh!

NicolasHug commented Aug 19, 2020

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Jun 3, 2020 •

edited

Loading

NicolasHug Aug 19, 2020 •

edited

Loading

lorentzenchr commented Aug 19, 2020 •

edited

Loading

lorentzenchr commented Nov 2, 2020 •

edited

Loading