-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MRG: Multi-output decision trees #923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ee-mo Conflicts: sklearn/tree/_tree.c
ENH: MultiOuputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip)
…ee-mo Conflicts: sklearn/tree/_tree.c
Hey Gilles. To bad we didn't talk about this. I have a working version of this for classification. But it is not so hard to do. I'll have a look at your implementation later and compare to mine. btw, your pull request can't be merged :-/ |
@amueller It can now :) But anyway this is nowhere ready. I still have to change the ensemble estimators. |
Thanks Gilles - I'm looking forward to study it in more detail; weather this weekend should be really fine in the alps so I might not make it in the next couple of days. |
@glouppe this looks pretty cool! can I ask you what your motivation was? I used this for multilabel classification. Your implementation might actually be able to cope with structured class labels. What I did used a Having completely arbitrary objects as |
wow.. straight from my alma mater (tu-graz) - neat - thanks for the ref. |
@pprett do you know hogh forests? they are pretty related and also from somewhere around there I think ;) ps: sorry for OT |
@amueller We are planning to apply that to classify windows of pixels in images. Regarding my implementation, I consider |
@glouppe that sounds a look like the paper I cited above :) (but is earlier I think). |
@amueller Haha yeah! I am going to carefully read yours I guess ;) |
I haven't really read your code but does it do any "extra work" when y is 1d? I think multi-label would be a pretty standart application for this algorithm. |
@amueller Well basically it shouldn't. All loops degenerate into single iterations but some (little) overhead is likely. |
Oh... I remember why I used lists... in a multi-label setup, each instance can have different numbers of labels. That might not play well with your approach of a fixed length y. We could use 1 of n encodings but that would make it unnecessary slow to work with many classes, I guess. we could also fill the remaining entries with -1 but that also does not really excite my. |
@amueller I handle that :) |
@glouppe SWEET! ok I'll keep quite until I read your code ;) |
Please not that multi-output is different from multi-labels. I don't know if we are exactly talking about the same thing? In a multi-output classification setting, each output column has its own set of classes and only one of them can be picked at prediction time. In a multi-label setting, as I understand it, you are allowed to predict several classes from the same unique set of classes, which is different. However, you can indeed transform a multi-label problem into a multi-ouput problem using binary encoding (i.e. use n binary outputs, one for each class). |
What I wanted to say was that it would be good if we could handle both, multi-output and multi-label. I was afraid the 1-of-n coding might be inefficient if there are many classes but only a few are active at any time. |
Well that it cannot handle. :/ However it can handle multi-output classification problems with sets of classes of different sizes at each output. That is what I wanted to say earlier. |
Well, maybe doing the binary coding isn't so bad. And doing patch based learning in sklearn is mega-awesome (which I might have not said enough before ;). |
I just wanted to say thanks guys especially @glouppe! This has been on my list since I started and I've just not been getting around to it, so I support this 100%! |
Very nice example BTW. Except for the few comments above, +1 for merging. |
I'm also +1 for merge but if you want I can do another review of the cython code in the evening. great example and PR! PS: we should talk to @vene regarding a vbench script which checks for performance regressions in the tree module for future feature requests. |
@ogrisel All your comments have been addressed. @pprett Yes, I am not against a review of the Cython code. Actually, I found a serious bug this morning regarding regressors (segfault and crash). Could you re-run your benchmark on Boston? I guess it'll take longer this time :( (But at least results will be correct) |
sure - I'll check 2012/7/4 Gilles Louppe
Peter Prettenhofer |
Any more review? :) |
Sorry for the delay - I've been away on holiday. I'd like to give it a On 8 July 2012 11:20, Olivier Grisel <
|
@bdholt1 Sure! |
…ee-mo Conflicts: doc/whats_new.rst
Glouppe tree mo
Thanks for this additional example Brian! @pprett Waiting for your approval to hit the green button :) |
@glouppe Thanks very much for undertaking to implement this, its a very welcome addition! This functionality achieves what mvpart does for R, taking is one step closer to being a complete suite. |
@glouppe I re-run the benchmarks on boston - looks very good.
|
@glouppe there are two formatting errors in the doctests (tree.rst); apart from that I'm +1 great work - thx! |
…ee-mo Conflicts: sklearn/tree/_tree.c
Thank you all for the reviews! I merge :) |
MRG: Multi-output decision trees
Great Work. Thanks a lot! |
@@ -165,9 +175,10 @@ class Tree(object): | |||
LEAF = -1 | |||
UNDEFINED = -2 | |||
|
|||
def __init__(self, n_classes, n_features, capacity=3): | |||
def __init__(self, n_classes, n_features, n_outputs, capacity=3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on #941, and as I was thinking of making n_output an optional argument (with default 1). Would you mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind, I am okay with that.
While I am at it, I must warn you though, I am currently making huge changes on the tree structure (see #946). I don't know how we should resolve our future conflicts :/
Hi folks!
Just to let you know, I am currently working on a multi-output extension of our decision trees.
Basically, this will make our implementation capable of handling classification or regression problems with several outputs. As I was discussing with @pprett, a very simple way to solve this kind of problems is to build n independent models, i.e. one for each output. However by doing that you lose the (likely) correlation between the outputs (classes or regression values). Hence an often better way is to build a single model to predict simultaneously all n outputs. With regard to decision trees, this amounts to store n output values in a leaf and to use splitting criteria that compute the average reduction among all outputs.
This PR includes a working prototype of multi-output decision trees. I tried as much as possible not to impair training time on single-output problems.
A lot of things still need to be done:
Write multi-output unit testsPatch RandomForest* and ExtraTrees* to account for the API changesPatch GradientBoosting* to account for the API changesUpdate the documentation