Skip to content

[MRG + 1] Public apply method for decision trees #4488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 11, 2015

Conversation

glouppe
Copy link
Contributor

@glouppe glouppe commented Apr 2, 2015

This supersedes #4065. I rebased @galv branch on top of master and rewrote the tests.
Fixes ##3832.

@glouppe glouppe changed the title Public apply method for decision trees [MRG] Public apply method for decision trees Apr 2, 2015
@glouppe
Copy link
Contributor Author

glouppe commented Apr 2, 2015

Random pings to people at the sprint @ogrisel @betatim

@glouppe glouppe changed the title [MRG] Public apply method for decision trees [WIP] Public apply method for decision trees Apr 2, 2015
@glouppe
Copy link
Contributor Author

glouppe commented Apr 2, 2015

Still have to look at the example...

@landscape-bot
Copy link

Code Health
Repository health increased by 0.00% when pulling 9fb131f on glouppe:tree-apply into 6e54079 on scikit-learn:master.

@glouppe glouppe changed the title [WIP] Public apply method for decision trees [MRG] Public apply method for decision trees Apr 2, 2015
@glouppe
Copy link
Contributor Author

glouppe commented Apr 2, 2015

I improved the example a bit, but I am not very happy with it at all in fact. I would vote for removing it. What do you think?

Note that we already have an example showing the capabilities of trees for embedding.
http://scikit-learn.org/dev/auto_examples/manifold/plot_lle_digits.html#example-manifold-plot-lle-digits-py

@glouppe glouppe force-pushed the tree-apply branch 2 times, most recently from 2e851cf to 9d4bb37 Compare April 2, 2015 11:03
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.12% when pulling 9d4bb37 on glouppe:tree-apply into 6e54079 on scikit-learn:master.

@amueller
Copy link
Member

amueller commented Apr 2, 2015

I was looking at this yesterday and thought of rebasing it :) Thanks for picking it up. I'll have a look later.

@arjoly
Copy link
Member

arjoly commented Apr 2, 2015

Do you have the picture genreated by the example?


Note that in the below double bar graph demonstrating the first transformation,
all setosas fall into the first leaf, and none into the second leaf.
Similarly, all versicolors and virginicas fall only into the second leaf. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some double spaces in this and the previous line.

@vene
Copy link
Member

vene commented Apr 2, 2015

This is the plot that the example produces, it doesn't say much to me.

tree_apply

I agree it's not the most convincing example. I think the "sparse" feature extraction point is not well made by using a tree with only two leaves. As for part 1 of the example, I am not convinced why I would want to transform the classification problem that I care about into one where I'd just predict the noisy leaves of a decision tree trained on the original problem. The only reason I can find is to do model compression, but why would I compress a tree? Maybe a whole forest.

How about an example that focuses on the second part, but actually creates a Transformer class and pipelines it into a linear classifier, would that make sense?

@vene
Copy link
Member

vene commented Apr 2, 2015

Come to think of it, given the digits embedding example referenced by @glouppe, the only usefulness of a tree.apply example would be to show users how to roll their own. Might be simpler to just remove the example completely.

@amueller
Copy link
Member

amueller commented Apr 2, 2015

+1 on removing the example.
Possible other example: comparing RandomTreeEmbedding, RandomForestClassifier and GradientBoostingClassifier for feature extraction and compare PCA vs a trained classifier on them?

The GradientBoostingClassifier doesn't have apply (as it is not necessarily tree-based) so we could use the tree apply.

X_small32 = X_small.astype(tree._tree.DTYPE)

est = ALL_TREES[name]()
est.fit(X_small32, y_small)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use X_small here

@vene
Copy link
Member

vene commented Apr 2, 2015

Possible other example: comparing RandomTreeEmbedding, RandomForestClassifier and GradientBoostingClassifier for feature extraction and compare PCA vs a trained classifier on them?

The GradientBoostingClassifier doesn't have apply (as it is not necessarily tree-based) so we could use the tree apply.

Ah, this would be a great example indeed. GB not being necessarily tree-based is a great reason to provide this as an example rather than as a transformer in the library. In the old PR, @ogrisel pointed to a notebook where he used this:

http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/sklearn_demos/Income%20classification.ipynb

with a link to this paper:

Practical Lessons from Predicting Clicks on Ads at Facebook Junfeng Pan, He Xinran, Ou Jin, Tianbing XU, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quiñonero Candela International Workshop on Data Mining for Online Advertising (ADKDD)

https://www.facebook.com/publications/329190253909587/

@amueller
Copy link
Member

amueller commented Apr 2, 2015

hehe yeah Ralf Herbrich was my boss at Amazon, which is why I thought about it ;)

X_trans = np.zeros((len(X), max_leaves))

for i in range(max_leaves):
X_trans[:, i] = y_reduced == (i + 1) # Leaf indices begin at 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this is incorrect and would break with deeper trees given @amueller 's documentation observation below.

@glouppe
Copy link
Contributor Author

glouppe commented Apr 8, 2015

Sorry for the delay. I removed the example, improved tests and now raise NotFittedError instead of ValueError.

@amueller amueller changed the title [MRG] Public apply method for decision trees [MRG + 1] Public apply method for decision trees Apr 8, 2015
@amueller
Copy link
Member

amueller commented Apr 8, 2015

LGTM. whatsnew?

@glouppe
Copy link
Contributor Author

glouppe commented Apr 8, 2015

LGTM. whatsnew?

Done :)

@amueller
Copy link
Member

amueller commented Apr 8, 2015

Opened #4549 in case anyone is bored ;)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.16% when pulling 6ea24c7 on glouppe:tree-apply into b1eaac0 on scikit-learn:master.

@glouppe
Copy link
Contributor Author

glouppe commented Apr 10, 2015

A last review? Should be quick, there is not much. CC: @vene

@glouppe
Copy link
Contributor Author

glouppe commented Apr 11, 2015

I improved the documentation. Merging.

glouppe added a commit that referenced this pull request Apr 11, 2015
[MRG + 1] Public `apply` method for decision trees
@glouppe glouppe merged commit 28c893d into scikit-learn:master Apr 11, 2015
This was referenced Apr 11, 2015
@glouppe glouppe deleted the tree-apply branch October 20, 2015 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants