-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH add Poisson splitting criterion for single trees #17386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6833457
to
e5ab4dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @lorentzenchr , made a first pass, looks good
we'll also need some docs to the UG ;)
sklearn/tree/tests/test_tree.py
Outdated
score = mean_squared_error(diabetes.target, reg.predict(diabetes.data)) | ||
assert score < 60 and score > 0, ( | ||
f"Failed with {name}, criterion = {criterion} and score = {score}" | ||
loss = mean_squared_error(diabetes.target, reg.predict(diabetes.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to check the mse when the tree was trained with a poisson criteria? The need for a limit of 4000 seems to indicate that the check is super loose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also nit but score seemed like the correct term here. There's no loss in these trees
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSE elicits the expectation, so does Poisson deviance. So yes, it makes sense.
I tried to use mean_poisson_deviance, but that did not change much, i.e. still needs a high max_loss
. I don't know why.
"score" in scikit-learn means "higher values are better". Error and loss, for me, are often synonyms. I can change into "error"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for using loss.
I'm still skeptical about the loosness of the check: enforcing an MSE of 4000 when other criteria achieve 100 times better than this. It seems that this reveals that either there's something wrong in the Poisson deviance, or that this test isn't relevant for Poisson?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also skeptical here. I changed to use Poisson deviance as score for "poisson"
splitter. It's much better now.
sklearn/tree/tests/test_tree.py
Outdated
@pytest.mark.parametrize("criterion", ["mse", "poisson"]) | ||
@pytest.mark.parametrize("tree", REG_TREES.values()) | ||
def test_balance_property(criterion, tree): | ||
"""Test that sum(y_pred)=sum(y_true) on training set.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment that this can't be true for mae because the median is used instead of the mean (if that's true, I haven't double checked)?
what about friedman_mse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added "friedman_mse"
- I had not double checked - and a comment about mae and median.
This could also be tested for each leaf.
impurity: | ||
1/n * sum(y_i * log(y_i/y_pred) | ||
""" | ||
# FIXME in 0.25: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this relevant considering we have proxy_impurity_improvement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the impurity calculated for each node (after a split)? If so, the FIXME might improve fit time as xlogy
must be calculated for every sample in the node. There is no algebraic simplification.
I have not done benchmakrs though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question becomes: do we really want to drop the constant terms, as suggested by the comment? Maybe we do that for other criteria, I haven't checked, but my understanding is that node_impurity
should return the ground truth impurity of the node?
If that's the case, I'd suggest moving this "FIXME" comment down to proxy_impurity_improvement
to explain the proxy
If not, we should maybe indicate that this 'trick' could also be used in children_impurity
, so that we can compute both left and right impurity in one pass instead of 2.
Might be worth considering 😄 |
The user guide does not mention |
Yes under scikit-learn.org/stable/modules/tree.html#regression-criteria for the formula. (Feel free to add Friedman if you feel like it ;) ) Also a small paragraph in https://scikit-learn.org/stable/modules/tree.html#regression to distinguish criteria would be helpful. E.g. mae is less sensitive to outliers, friedman mse is (?????), and poisson is (?????) |
@NicolasHug I added the Poisson deviance to the user guide and also did some more improvements. Please tell me if you like them. About the |
@NicolasHug I merged master to resolve merge conflicts. I think I addressed all the comments from your first review pass. Anything else I can do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @lorentzenchr
This is a partial review.
@NicolasHug @thomasjpfan I hope you're still able to see the forest for the trees, in particular for boosted and histrogrammed trees.:smirk: Just a little, well-meant reminder for this PR. |
I tried to push an empty commit but it looks like I'm not allowed to push to this PR. Could you just merge with master @lorentzenchr so that the docs get rendered? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the patience @lorentzenchr , made another pass
It would be nice to have some high level tests ensuring that using Poisson results in a better score than using e.g. MSE on a specific problem where Poisson is expected to perform better?
I also feel like the UG could be a bit more detailed on which cases one might want to use the Poisson criteria vs MSE/MAE.
Also could you quickly benchmark the Poisson criteria vs MSE? I'm curious about the "overhead" incurred in children_impurity
@@ -1950,6 +1966,30 @@ def test_apply_path_readonly_all_trees(name): | |||
check_apply_path_readonly(name) | |||
|
|||
|
|||
@pytest.mark.parametrize("criterion", ["mse", "friedman_mse", "poisson"]) | |||
@pytest.mark.parametrize("Tree", REG_TREES.values()) | |||
def test_balance_property(criterion, Tree): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. An often neglected, but very important property. I'm wondering if some common test would make sense - maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean a common test for which other estimators?
I think we could definitely have this for the histogram-GBDT trees but I can't think of any other one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be for a lot of models using MSE, Tweedie deviance or Log Loss. Counter examples are linear models without intercept and uncentered data.
impurity: | ||
1/n * sum(y_i * log(y_i/y_pred) | ||
""" | ||
# FIXME in 0.25: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question becomes: do we really want to drop the constant terms, as suggested by the comment? Maybe we do that for other criteria, I haven't checked, but my understanding is that node_impurity
should return the ground truth impurity of the node?
If that's the case, I'd suggest moving this "FIXME" comment down to proxy_impurity_improvement
to explain the proxy
If not, we should maybe indicate that this 'trick' could also be used in children_impurity
, so that we can compute both left and right impurity in one pass instead of 2.
sklearn/tree/_criterion.pyx
Outdated
impurity_right[0] = 0. | ||
for k in range(self.n_outputs): | ||
y_mean = self.sum_right[k] / self.weighted_n_right | ||
for p in range(pos, end): | ||
i = self.samples[p] | ||
|
||
if self.sample_weight != NULL: | ||
w = self.sample_weight[i] | ||
|
||
impurity_right[0] += w * xlogy(y[i, k], y[i, k] / y_mean) | ||
impurity_right[0] /= self.weighted_n_right * self.n_outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit of code seems to be duplicated 3 times. Would it make sense to extract it into a private method so that we could use it here and in node_impurity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same could be done for MAE.
@NicolasHug Thanks for your last/nest review round.
|
sklearn/tree/_criterion.pyx
Outdated
|
||
impurity += w * xlogy(self.y[i, k], self.y[i, k] / y_mean) | ||
|
||
return impurity / (self.weighted_n_node_samples * self.n_outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test for sample weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is check_sample_weights_invariance
in estimator_checks.py
. IIUC, the common tests use only the default arguments, right? That means, only DecisionTreeRegressor(criterion="mse"
) is tested.
Can I add check_sample_weights_invariance
here in test_tree.py
(for all regression criteria)?
|
||
# Choose a training set with non-negative targets (for poisson) | ||
X, y = diabetes.data, diabetes.target | ||
reg = Tree(criterion=criterion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test feels like it should pass regardless of random_state
, but do we want to set it anyways just to make the test deterministic (for different platforms etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I can only tell that the test has to hold regardless of random_state
. It is a property between tree and training data no matter how the tree parameters (except criterion) was set.
y = [0, 0, 0, 0, 1, 2, 3, 4] | ||
reg = DecisionTreeRegressor(criterion="poisson", random_state=1) | ||
reg.fit(X, y) | ||
assert np.all(reg.predict(X) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, the impurity value for those nodes that predict zero is np.nan.
Is this issue specific to this PR or an general issue with the tree splitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please add an entry to the change log at doc/whats_new/v0.24.rst
with tag |Feature|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
I added tests for |
@glemaitre, two approvals here: shall we merge? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only 2-3 nits.
I would be in favour of merging afterwords.
@glemaitre Thanks for looking into this PR. I addressed your comments. Note that this PR enables
Should I mention this in the what's new entry and docstring? So far, I did not as tests are not yet implemented for those. |
I agree that I would prefer a separate PR. In addition to the tests and docstring, we might need to illustrate via an example maybe. |
Oopsy I forgot to remove [MRG], my bad. |
Reference Issues/PRs
Contributes to #16668.
What does this implement/fix? Explain your changes.
This PR adds a Poisson splitting criterion to
DecisionTreeRegressor
andExtraTreeRegressor
with optioncriterion="poisson"
.Any other comments?
The implemented splitting criterion forbids splits with child nodes having
sum(y_i) = 0
as non-positive predictions are not allowed for Poisson regression/loss. This is the simplest solution to this problem, i.e. without introducing another option.If this is good to go and merged, the additional option
"poisson"
can be propagated toRandomForestRegressor
and other tree based ensemble regressors.