-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Tree apply #4065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Tree apply #4065
Conversation
The issue with the transform is one of the reasons I added the |
Should we test something? |
Something is tested, but in a very awkward place. |
I'm blind apparently. But you are right, that is not the right place. |
I was also hesitant about that inserted line of test code. It was my understanding that the only need in making a public API was that the input matrix X be converted to the correct format. From reading the code, I felt this meant two things:
So these tests are indeed bad in hindsight, as they simply check that calling the private method returns the same as calling the public method, instead of these conditions. I will look more thoroughly through the test file. Though looking back I now realize that this misses the edge case where the indices of X are not 32-bit ints. (My understanding of what a correct input is was gleaned from _check_input() in _tree.pyx) validation.check_array() does not check for this. I can add a manual check of this in apply(), in addition to validation.check_array(), at the cost of code duplication; otherwise, I would have to expose TreeBuilder's _check_input() in to the python code by changing its cdef declaration cpdef in _tree.pxd and _tree.pyx to do all the requisite checks. I'm hesitant to do the latter since the ensemble methods build off the decision tree, and a cpdef declaration would add overhead. Not to mention __check_input() makes assumption which don't apply here, such as that a TreeBuilder would be present, and that there are y values to be estimated, so the code would look somewhat obscure. Suggestions on which route to take? |
I would not do this automatically, but let the user decides what to do with the leaf ids instead. |
@@ -1137,6 +1137,8 @@ def check_explicit_sparse_zeros(tree, max_depth=3, | |||
Xs = (X_test, X_sparse_test) | |||
for X1, X2 in product(Xs, Xs): | |||
assert_array_almost_equal(s.tree_.apply(X1), d.tree_.apply(X2)) | |||
assert_array_almost_equal(s.apply(X1), d.apply(X2)) | |||
assert_array_almost_equal(s.apply(X1), s.tree_.apply(X1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write instead an independent test to only check the correctness of apply
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, the test is there, but it should be removed here, right?
Can you please rebase? |
there is a test error (sorry for late reply). |
@@ -1268,3 +1270,38 @@ def check_min_weight_leaf_split_level(name): | |||
def test_min_weight_leaf_split_level(): | |||
for name in ALL_TREES: | |||
yield check_min_weight_leaf_split_level, name | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge error here.
Fixed by #4488 |
apply is now a public method for all decision trees, for #3832.
Added docstring and example demonstrating two uses of apply(): Reducing number of classes to predict, and make a one-hot feature encoding.
There is currently no function to do the one-hot encoding automatically. Should we consider adding one? transform() is already used to select only the features with highest importance.