-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Fast PDP for DecisionTreeRegressor #15848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fast PDP for DecisionTreeRegressor #15848
Conversation
CC @thomasjpfan and @glemaitre maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass with a few comments/questions:
response_method='auto') | ||
pdp_tree = _partial_dependence_brute(tree, grid, features, X, | ||
response_method='auto') | ||
assert np.allclose(pdp_gbdt, pdp_tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using np.testing.assert_allclose
yields more informative error messages in case of failure.
@@ -206,6 +212,48 @@ def test_partial_dependence_helpers(est, method, target_feature): | |||
assert np.allclose(pdp, mean_predictions, rtol=rtol) | |||
|
|||
|
|||
@skip_if_32bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any idea why is this necessary? The differences seems very large. Maybe you should try with a non-random dataset (e.g. make_regression) so that the tree and the GBRT model have some more numerically stable structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it comes from a split with equal gain, but I'm not sure. I was really confused. Now I'm properly testing the recursion method so maybe we can remove this
|
||
pdp_gbdt = _partial_dependence_brute(gbdt, grid, features, X, | ||
response_method='auto') | ||
pdp_tree = _partial_dependence_brute(tree, grid, features, X, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you test with the brute force method? Why not the recursion method or the public partial_dependence
function with auto mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should definitely be the recursion method, my bad
Ok so the test is still failing with 32 bits. https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=11194 I'm not sure why, since the prediction sanity check passes. Note that using |
@ogrisel , I added a check to make sure that both trees are exactly equal in As you can see, this fails (only) for 32bits. So naturally the PDP can't be equal either. Since this failure isn't related to the current PR, I would propose to add something like if not trees_are_equal:
assert is_32bits # 32 bits doesn't grow the same tree for some reason
return
# check pdp here for all other platorms would that be OK? EDIT: did just that in c09565a |
I'll close in favor of #15864 which also add the RandomForestRegressor support (for very few additional lines) |
This PR implements the fast 'recursion' method for DecisionTreeRegressor
We're only exposing the method which already exists for the gradient boosting estimators.