Skip to content

Apply method for trees #3832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Nov 6, 2014 · 14 comments
Closed

Apply method for trees #3832

amueller opened this issue Nov 6, 2014 · 14 comments
Labels
Easy Well-defined and straightforward way to resolve Enhancement

Comments

@amueller
Copy link
Member

amueller commented Nov 6, 2014

I think it would be nice to add an apply method to the tree. Currently there is one in the RandomForest, but not in the tree. There is one in tree.tree_, but the tree object is not publicly documented. I think the idea was that we might want to change the structure of the tree object, so we don't make it public.
Still we could provide a public interface to the lower level functions so that people might find them more easily.

Opinions?

@amueller amueller added Easy Well-defined and straightforward way to resolve Enhancement labels Nov 6, 2014
@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

+1 and for GB models as well.

@arjoly
Copy link
Member

arjoly commented Nov 21, 2014

Why would it be useful to users? Do you have some applications (examples?) in mind?

@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

Making it easier to do this kind of transform (see the end of the notebook):

http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb

@arjoly
Copy link
Member

arjoly commented Nov 21, 2014

If I made correctly the quick search in your document, what you want to have is a RandomTreesEmbedding which is not a totally randomized trees. I am +1 for this idea. :-)

This doesn't seem to be an example / application for this issue.

@amueller
Copy link
Member Author

I think we should add an example :)

@jnothman
Copy link
Member

Briefly looking at Olivier's notebook, I have seen Jerome Friedman speak of a similar model as "rule ensembles", wherein one uses randomised trees as a means of extracting feature combinations that can then be weighted with logistic regression et al. I do not recall the details of his 2008 paper (or 2005 tech report) on the topic, but from his presentation, I gathered these rules could be the path to any node (from root, or from any I can't recall), not only to the leaf. In any case, it is a little different from what's given above. I think it's a nice idea in terms of producing models that can be understood.

@jnothman
Copy link
Member

But I would also be unsurprised if it's a technique with many parallel reinventions...

@ogrisel
Copy link
Member

ogrisel commented Nov 23, 2014

Yes @pprett also mentioned over twitter that rulefit leverages sub-paths starting from the root as categorical features instead just the leafs (full path from the root).

@davidcieslak-zz
Copy link

Sorry to jump in on this issue unannounced, but this issue seemed related to something I wanted to do. I'm hoping to apply sklearn to get predicted leaf node ids. I think I'm using the apply method -- like@ogrisel does in his notebook -- as follows:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
clf = DecisionTreeClassifier(random_state=0)
mdl = clf.fit(iris.data, iris.target)
mdl.tree_.apply(iris.data)

However, when I do that, I'm getting the following error:

    print mdl.tree_.apply(iris.data)
  File "_tree.pyx", line 2382, in sklearn.tree._tree.Tree.apply (sklearn/tree/_tree.c:19595)
ValueError: Buffer dtype mismatch, expected 'DTYPE_t' but got 'double'

I'm also getting the same error with a call to the first tree of a random forest:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
mdl = clf.fit(iris.data, iris.target)
mdl.estimators_[0].tree_.apply(iris.data)

Do I need to re-type the data somehow in order to get this to work? Thanks in advance for any help!

@jnothman
Copy link
Member

jnothman commented Jan 6, 2015

I think the development version should give a more specific error message,
along the lines of "X.dtype should be np.float32, got np.float64". But yes,
this is another reason not to require users to directly use Tree.apply

On 7 January 2015 at 09:53, davidcieslak notifications@github.com wrote:

Sorry to jump in on this issue unannounced, but this issue seemed
related to something I wanted to do. I'm hoping to apply sklearn to get
predicted leaf node ids. I think I'm using the apply method -- like@ogrisel
does in his notebook -- as follows:

from sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import load_iris
iris = load_iris()
clf = DecisionTreeClassifier(random_state=0)
mdl = clf.fit(iris.data, iris.target)
mdl.tree_.apply(iris.data)

However, when I do that, I'm getting the following error:

print mdl.tree_.apply(iris.data)

File "_tree.pyx", line 2382, in sklearn.tree._tree.Tree.apply (sklearn/tree/_tree.c:19595)
ValueError: Buffer dtype mismatch, expected 'DTYPE_t' but got 'double'

I'm also getting the same error with a call to the first tree of a random
forest:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
mdl = clf.fit(iris.data, iris.target)
mdl.estimators_[0].tree_.apply(iris.data)

Do I need to re-type the data somehow in order to get this to work?
Thanks in advance for any help!


Reply to this email directly or view it on GitHub
#3832 (comment)
.

@amueller
Copy link
Member Author

amueller commented Jan 6, 2015

You can work around this by re-typing to float32.

@davidcieslak-zz
Copy link

Brilliant. Thanks all!

@galv galv mentioned this issue Jan 8, 2015
@galv
Copy link
Contributor

galv commented Jan 8, 2015

For the record, it's true that this method has multiple reinventions. It's used in speech recognition for a slightly different purpose: to cluster the hidden markov models for triphones (For every hmm, there is a a triphone. It's a one-to-one correspondence) that are seen few or no times in the training data, such that hmms in the same cluster share parameters, so that more robust estimates of these parameters can be made.

@glouppe
Copy link
Contributor

glouppe commented Apr 11, 2015

Fixed by #4488

@glouppe glouppe closed this as completed Apr 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement
Projects
None yet
Development

No branches or pull requests

7 participants