Skip to content

[MRG] ENH add Poisson splitting criterion for single trees #17386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Nov 2, 2020

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

Contributes to #16668.

What does this implement/fix? Explain your changes.

This PR adds a Poisson splitting criterion to DecisionTreeRegressor and ExtraTreeRegressor with option criterion="poisson".

Any other comments?

The implemented splitting criterion forbids splits with child nodes having sum(y_i) = 0 as non-positive predictions are not allowed for Poisson regression/loss. This is the simplest solution to this problem, i.e. without introducing another option.

If this is good to go and merged, the additional option "poisson" can be propagated to RandomForestRegressor and other tree based ensemble regressors.

@lorentzenchr lorentzenchr changed the title [WIP] ENH add Poisson splitting criterion for single trees [MRG] ENH add Poisson splitting criterion for single trees May 29, 2020
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lorentzenchr , made a first pass, looks good

we'll also need some docs to the UG ;)

score = mean_squared_error(diabetes.target, reg.predict(diabetes.data))
assert score < 60 and score > 0, (
f"Failed with {name}, criterion = {criterion} and score = {score}"
loss = mean_squared_error(diabetes.target, reg.predict(diabetes.data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to check the mse when the tree was trained with a poisson criteria? The need for a limit of 4000 seems to indicate that the check is super loose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit but score seemed like the correct term here. There's no loss in these trees

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSE elicits the expectation, so does Poisson deviance. So yes, it makes sense.
I tried to use mean_poisson_deviance, but that did not change much, i.e. still needs a high max_loss. I don't know why.

"score" in scikit-learn means "higher values are better". Error and loss, for me, are often synonyms. I can change into "error"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for using loss.

I'm still skeptical about the loosness of the check: enforcing an MSE of 4000 when other criteria achieve 100 times better than this. It seems that this reveals that either there's something wrong in the Poisson deviance, or that this test isn't relevant for Poisson?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also skeptical here. I changed to use Poisson deviance as score for "poisson" splitter. It's much better now.

@pytest.mark.parametrize("criterion", ["mse", "poisson"])
@pytest.mark.parametrize("tree", REG_TREES.values())
def test_balance_property(criterion, tree):
"""Test that sum(y_pred)=sum(y_true) on training set."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment that this can't be true for mae because the median is used instead of the mean (if that's true, I haven't double checked)?
what about friedman_mse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "friedman_mse" - I had not double checked - and a comment about mae and median.
This could also be tested for each leaf.

impurity:
1/n * sum(y_i * log(y_i/y_pred)
"""
# FIXME in 0.25:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant considering we have proxy_impurity_improvement?

Copy link
Member Author

@lorentzenchr lorentzenchr Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the impurity calculated for each node (after a split)? If so, the FIXME might improve fit time as xlogy must be calculated for every sample in the node. There is no algebraic simplification.
I have not done benchmakrs though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question becomes: do we really want to drop the constant terms, as suggested by the comment? Maybe we do that for other criteria, I haven't checked, but my understanding is that node_impurity should return the ground truth impurity of the node?

If that's the case, I'd suggest moving this "FIXME" comment down to proxy_impurity_improvement to explain the proxy

If not, we should maybe indicate that this 'trick' could also be used in children_impurity, so that we can compute both left and right impurity in one pass instead of 2.

@lorentzenchr
Copy link
Member Author

we'll also need some docs to the UG ;)

Might be worth considering 😄

@lorentzenchr
Copy link
Member Author

The user guide does not mention "friedman_mse". Where should "poisson" be mentioned? Under https://scikit-learn.org/stable/modules/tree.html#regression-criteria?

@NicolasHug
Copy link
Member

Yes under scikit-learn.org/stable/modules/tree.html#regression-criteria for the formula. (Feel free to add Friedman if you feel like it ;) )

Also a small paragraph in https://scikit-learn.org/stable/modules/tree.html#regression to distinguish criteria would be helpful. E.g. mae is less sensitive to outliers, friedman mse is (?????), and poisson is (?????)

@lorentzenchr
Copy link
Member Author

@NicolasHug I added the Poisson deviance to the user guide and also did some more improvements. Please tell me if you like them.

About the "friedman_mse" - maybe better as separate issue: It is never mentioned in the User Guide or examples. After consulting the literature [1], I don't understand why it is implemented/API-exposed for DecisionTreeRegressor. It seems to have its place in multiclass gradient-boosted trees.

[1] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

@lorentzenchr
Copy link
Member Author

@NicolasHug I merged master to resolve merge conflicts. I think I addressed all the comments from your first review pass. Anything else I can do?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lorentzenchr

This is a partial review.

@lorentzenchr
Copy link
Member Author

@NicolasHug @thomasjpfan I hope you're still able to see the forest for the trees, in particular for boosted and histrogrammed trees.:smirk: Just a little, well-meant reminder for this PR.

@NicolasHug
Copy link
Member

I tried to push an empty commit but it looks like I'm not allowed to push to this PR. Could you just merge with master @lorentzenchr so that the docs get rendered?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the patience @lorentzenchr , made another pass

It would be nice to have some high level tests ensuring that using Poisson results in a better score than using e.g. MSE on a specific problem where Poisson is expected to perform better?

I also feel like the UG could be a bit more detailed on which cases one might want to use the Poisson criteria vs MSE/MAE.

Also could you quickly benchmark the Poisson criteria vs MSE? I'm curious about the "overhead" incurred in children_impurity

@@ -1950,6 +1966,30 @@ def test_apply_path_readonly_all_trees(name):
check_apply_path_readonly(name)


@pytest.mark.parametrize("criterion", ["mse", "friedman_mse", "poisson"])
@pytest.mark.parametrize("Tree", REG_TREES.values())
def test_balance_property(criterion, Tree):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. An often neglected, but very important property. I'm wondering if some common test would make sense - maybe not.

Copy link
Member

@NicolasHug NicolasHug Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a common test for which other estimators?

I think we could definitely have this for the histogram-GBDT trees but I can't think of any other one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be for a lot of models using MSE, Tweedie deviance or Log Loss. Counter examples are linear models without intercept and uncentered data.

impurity:
1/n * sum(y_i * log(y_i/y_pred)
"""
# FIXME in 0.25:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question becomes: do we really want to drop the constant terms, as suggested by the comment? Maybe we do that for other criteria, I haven't checked, but my understanding is that node_impurity should return the ground truth impurity of the node?

If that's the case, I'd suggest moving this "FIXME" comment down to proxy_impurity_improvement to explain the proxy

If not, we should maybe indicate that this 'trick' could also be used in children_impurity, so that we can compute both left and right impurity in one pass instead of 2.

Comment on lines 1436 to 1446
impurity_right[0] = 0.
for k in range(self.n_outputs):
y_mean = self.sum_right[k] / self.weighted_n_right
for p in range(pos, end):
i = self.samples[p]

if self.sample_weight != NULL:
w = self.sample_weight[i]

impurity_right[0] += w * xlogy(y[i, k], y[i, k] / y_mean)
impurity_right[0] /= self.weighted_n_right * self.n_outputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bit of code seems to be duplicated 3 times. Would it make sense to extract it into a private method so that we could use it here and in node_impurity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same could be done for MAE.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Aug 19, 2020

@NicolasHug Thanks for your last/nest review round.

  • I decided to keep the FIXME comment and elaborate a bit more.
  • Added test poisson vs mse, so far failing and no clue.
  • So far, I'd like to focus on implementation and not on User Guide (maybe separate issue).
  • Refer to GLM User Guide Add note about when to use criterion="poisson".


impurity += w * xlogy(self.y[i, k], self.y[i, k] / y_mean)

return impurity / (self.weighted_n_node_samples * self.n_outputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test for sample weight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is check_sample_weights_invariance in estimator_checks.py. IIUC, the common tests use only the default arguments, right? That means, only DecisionTreeRegressor(criterion="mse") is tested.

Can I add check_sample_weights_invariance here in test_tree.py (for all regression criteria)?


# Choose a training set with non-negative targets (for poisson)
X, y = diabetes.data, diabetes.target
reg = Tree(criterion=criterion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels like it should pass regardless of random_state, but do we want to set it anyways just to make the test deterministic (for different platforms etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I can only tell that the test has to hold regardless of random_state. It is a property between tree and training data no matter how the tree parameters (except criterion) was set.

y = [0, 0, 0, 0, 1, 2, 3, 4]
reg = DecisionTreeRegressor(criterion="poisson", random_state=1)
reg.fit(X, y)
assert np.all(reg.predict(X) > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the impurity value for those nodes that predict zero is np.nan.

Is this issue specific to this PR or an general issue with the tree splitting?

@cmarmo cmarmo added this to the 0.24 milestone Oct 8, 2020
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Feature|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@lorentzenchr
Copy link
Member Author

I added tests for sample_weight consistency, a whats new entry and mentioned slower fitting time for poisson and mae. Please note these changes before merging. And thank you both for your reviews!

@cmarmo
Copy link
Contributor

cmarmo commented Oct 23, 2020

@glemaitre, two approvals here: shall we merge? Thanks!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only 2-3 nits.
I would be in favour of merging afterwords.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Nov 2, 2020

@glemaitre Thanks for looking into this PR. I addressed your comments.

Note that this PR enables criterion="poisson" not only in DecisionTreeRegressor but also automatically in

  • tree.ExtraTreeRegressor
  • ensemble.ExtraTreesRegressor
  • ensemble.RandomForestRegressor

Should I mention this in the what's new entry and docstring? So far, I did not as tests are not yet implemented for those.
So, I'd prefer a thorough separate PR after v0.24.

@glemaitre
Copy link
Member

Should I mention this in the what's new entry and docstring? So far, I did not as tests are not yet implemented for those.
So, I'd prefer a thorough separate PR after v0.24.

I agree that I would prefer a separate PR. In addition to the tests and docstring, we might need to illustrate via an example maybe.
I think that we will be ready for the release before this subsequent PR.

@glemaitre glemaitre merged commit de59efe into scikit-learn:master Nov 2, 2020
@glemaitre
Copy link
Member

Oopsy I forgot to remove [MRG], my bad.
Thanks @lorentzenchr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants