Skip to content

[MRG] Tree apply #4065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

[MRG] Tree apply #4065

wants to merge 8 commits into from

Conversation

galv
Copy link
Contributor

@galv galv commented Jan 8, 2015

apply is now a public method for all decision trees, for #3832.

Added docstring and example demonstrating two uses of apply(): Reducing number of classes to predict, and make a one-hot feature encoding.

There is currently no function to do the one-hot encoding automatically. Should we consider adding one? transform() is already used to select only the features with highest importance.

@amueller
Copy link
Member

amueller commented Jan 8, 2015

The issue with the transform is one of the reasons I added the RandomTreeEmbedding. Having that for trees that are trained in a supervised way would be good, though.

@amueller
Copy link
Member

amueller commented Jan 8, 2015

Should we test something?

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

Something is tested, but in a very awkward place.

@amueller
Copy link
Member

amueller commented Jan 8, 2015

I'm blind apparently. But you are right, that is not the right place.

@galv
Copy link
Contributor Author

galv commented Jan 8, 2015

I was also hesitant about that inserted line of test code.

It was my understanding that the only need in making a public API was that the input matrix X be converted to the correct format. From reading the code, I felt this meant two things:

  • Convert 64-bit floating point data to 32-bit floating point data.
  • Ensure all sparse matrices are CSC matrices, or convert them if not.

So these tests are indeed bad in hindsight, as they simply check that calling the private method returns the same as calling the public method, instead of these conditions. I will look more thoroughly through the test file.

Though looking back I now realize that this misses the edge case where the indices of X are not 32-bit ints. (My understanding of what a correct input is was gleaned from _check_input() in _tree.pyx)

validation.check_array() does not check for this. I can add a manual check of this in apply(), in addition to validation.check_array(), at the cost of code duplication; otherwise, I would have to expose TreeBuilder's _check_input() in to the python code by changing its cdef declaration cpdef in _tree.pxd and _tree.pyx to do all the requisite checks.

I'm hesitant to do the latter since the ensemble methods build off the decision tree, and a cpdef declaration would add overhead. Not to mention __check_input() makes assumption which don't apply here, such as that a TreeBuilder would be present, and that there are y values to be estimated, so the code would look somewhat obscure.

Suggestions on which route to take?

@glouppe
Copy link
Contributor

glouppe commented Jan 8, 2015

There is currently no function to do the one-hot encoding automatically. Should we consider adding one?

I would not do this automatically, but let the user decides what to do with the leaf ids instead.

@@ -1137,6 +1137,8 @@ def check_explicit_sparse_zeros(tree, max_depth=3,
Xs = (X_test, X_sparse_test)
for X1, X2 in product(Xs, Xs):
assert_array_almost_equal(s.tree_.apply(X1), d.tree_.apply(X2))
assert_array_almost_equal(s.apply(X1), d.apply(X2))
assert_array_almost_equal(s.apply(X1), s.tree_.apply(X1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write instead an independent test to only check the correctness of apply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the test is there, but it should be removed here, right?

@amueller
Copy link
Member

Can you please rebase?

@amueller amueller changed the title Tree apply [MRG] Tree apply Jan 15, 2015
@amueller
Copy link
Member

amueller commented Mar 9, 2015

there is a test error (sorry for late reply).

@@ -1268,3 +1270,38 @@ def check_min_weight_leaf_split_level(name):
def test_min_weight_leaf_split_level():
for name in ALL_TREES:
yield check_min_weight_leaf_split_level, name
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge error here.

@glouppe
Copy link
Contributor

glouppe commented Apr 11, 2015

Fixed by #4488

@glouppe glouppe closed this Apr 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants