MRG: Multi-output decision trees #923

glouppe · 2012-06-29T12:06:06Z

Hi folks!

Just to let you know, I am currently working on a multi-output extension of our decision trees.

Basically, this will make our implementation capable of handling classification or regression problems with several outputs. As I was discussing with @pprett, a very simple way to solve this kind of problems is to build n independent models, i.e. one for each output. However by doing that you lose the (likely) correlation between the outputs (classes or regression values). Hence an often better way is to build a single model to predict simultaneously all n outputs. With regard to decision trees, this amounts to store n output values in a leaf and to use splitting criteria that compute the average reduction among all outputs.

This PR includes a working prototype of multi-output decision trees. I tried as much as possible not to impair training time on single-output problems.

A lot of things still need to be done:

~~Write multi-output unit tests~~
~~Patch RandomForest* and ExtraTrees* to account for the API changes~~
~~Patch GradientBoosting* to account for the API changes~~
~~Update the documentation~~

…ee-mo Conflicts: sklearn/tree/_tree.c

ENH: MultiOuputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip)

…ee-mo Conflicts: sklearn/tree/_tree.c

amueller · 2012-06-29T12:08:03Z

Hey Gilles. To bad we didn't talk about this. I have a working version of this for classification. But it is not so hard to do. I'll have a look at your implementation later and compare to mine. btw, your pull request can't be merged :-/

glouppe · 2012-06-29T12:10:05Z

@amueller It can now :) But anyway this is nowhere ready. I still have to change the ensemble estimators.

pprett · 2012-06-29T12:11:59Z

Thanks Gilles - I'm looking forward to study it in more detail; weather this weekend should be really fine in the alps so I might not make it in the next couple of days.

amueller · 2012-06-29T12:17:07Z

@glouppe this looks pretty cool! can I ask you what your motivation was? I used this for multilabel classification. Your implementation might actually be able to cope with structured class labels.

What I did used a list for y. It did not seem to impact the perfomance much in the single label case and was very flexible. Your method is probably more efficient and more amendable for optimization.

Having completely arbitrary objects as y would be pretty sweet, though (and would be no theoretical problem as long as one can define what node purity is).

pprett · 2012-06-29T12:24:48Z

wow.. straight from my alma mater (tu-graz) - neat - thanks for the ref.

amueller · 2012-06-29T12:26:17Z

@pprett do you know hogh forests? they are pretty related and also from somewhere around there I think ;)

ps: sorry for OT

glouppe · 2012-06-29T12:26:38Z

@amueller We are planning to apply that to classify windows of pixels in images.
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2009/DMWG09/

Regarding my implementation, I consider y as a 2d array (or convert it to that format). A few things may still need to be done though, to convert it appropriately. I'll check that when writing the unit tests.

amueller · 2012-06-29T12:28:39Z

@glouppe that sounds a look like the paper I cited above :) (but is earlier I think).

glouppe · 2012-06-29T12:31:10Z

@amueller Haha yeah! I am going to carefully read yours I guess ;)

amueller · 2012-06-29T12:32:40Z

I haven't really read your code but does it do any "extra work" when y is 1d? I think multi-label would be a pretty standart application for this algorithm.

glouppe · 2012-06-29T12:35:32Z

@amueller Well basically it shouldn't. All loops degenerate into single iterations but some (little) overhead is likely.

amueller · 2012-06-29T12:35:45Z

Oh... I remember why I used lists... in a multi-label setup, each instance can have different numbers of labels. That might not play well with your approach of a fixed length y. We could use 1 of n encodings but that would make it unnecessary slow to work with many classes, I guess. we could also fill the remaining entries with -1 but that also does not really excite my.
If we could address both, the 2d patch and multi-label setting, that would be awesome!

glouppe · 2012-06-29T12:37:03Z

@amueller I handle that :)

amueller · 2012-06-29T12:38:22Z

@glouppe SWEET! ok I'll keep quite until I read your code ;)

glouppe · 2012-06-29T12:41:51Z

Please not that multi-output is different from multi-labels. I don't know if we are exactly talking about the same thing? In a multi-output classification setting, each output column has its own set of classes and only one of them can be picked at prediction time. In a multi-label setting, as I understand it, you are allowed to predict several classes from the same unique set of classes, which is different. However, you can indeed transform a multi-label problem into a multi-ouput problem using binary encoding (i.e. use n binary outputs, one for each class).

amueller · 2012-06-29T12:45:24Z

What I wanted to say was that it would be good if we could handle both, multi-output and multi-label. I was afraid the 1-of-n coding might be inefficient if there are many classes but only a few are active at any time.

glouppe · 2012-06-29T12:48:47Z

Well that it cannot handle. :/

However it can handle multi-output classification problems with sets of classes of different sizes at each output. That is what I wanted to say earlier.

amueller · 2012-06-29T12:52:49Z

Well, maybe doing the binary coding isn't so bad. And doing patch based learning in sklearn is mega-awesome (which I might have not said enough before ;).

bdholt1 · 2012-06-29T18:44:31Z

I just wanted to say thanks guys especially @glouppe! This has been on my list since I started and I've just not been getting around to it, so I support this 100%!

ogrisel · 2012-07-04T10:44:17Z

Very nice example BTW. Except for the few comments above, +1 for merging.

pprett · 2012-07-04T11:34:46Z

I'm also +1 for merge but if you want I can do another review of the cython code in the evening.

great example and PR!

PS: we should talk to @vene regarding a vbench script which checks for performance regressions in the tree module for future feature requests.

glouppe · 2012-07-04T11:41:36Z

@ogrisel All your comments have been addressed.

@pprett Yes, I am not against a review of the Cython code. Actually, I found a serious bug this morning regarding regressors (segfault and crash). Could you re-run your benchmark on Boston? I guess it'll take longer this time :( (But at least results will be correct)

pprett · 2012-07-04T11:47:40Z

sure - I'll check

2012/7/4 Gilles Louppe
reply@reply.github.com:

@ogrisel All your comments have been addressed.

@pprett Yes, I am not against a review of the Cython code. Actually, I found a serious bug this morning regarding regressors (segfault and crash). Could you re-run your benchmark on Boston? I guess it'll take longer this time :( (But at least results will be correct)

Reply to this email directly or view it on GitHub:
#923 (comment)

Peter Prettenhofer

glouppe · 2012-07-07T22:27:46Z

Any more review? :)

ogrisel · 2012-07-08T10:20:12Z

Looks good for me but as I am not a tree expert / user I would rather have @amueller , @bdholt1 or @pprett (or someone else interested in multiple output trees) give it another round of review.

bdholt1 · 2012-07-08T10:48:59Z

Sorry for the delay - I've been away on holiday. I'd like to give it a
final round if thats possible - perhaps post reviews later this evening?

On 8 July 2012 11:20, Olivier Grisel <
reply@reply.github.com

wrote:

Looks good for me but as I am not a tree expert / user I would rather have
@amueller or @pprett (or someone else interested in multiple output trees)
give it another round of review.

Reply to this email directly or view it on GitHub:
#923 (comment)

glouppe · 2012-07-08T13:08:37Z

@bdholt1 Sure!

…ee-mo Conflicts: doc/whats_new.rst

Glouppe tree mo

glouppe · 2012-07-09T13:18:20Z

Thanks for this additional example Brian!

@pprett Waiting for your approval to hit the green button :)

bdholt1 · 2012-07-09T13:27:38Z

@glouppe Thanks very much for undertaking to implement this, its a very welcome addition! This functionality achieves what mvpart does for R, taking is one step closer to being a complete suite.

pprett · 2012-07-09T13:57:15Z

@glouppe I re-run the benchmarks on boston - looks very good.

Fit
+--------+----------+-------+-------+
|        |Master    |  MO   |  MOv2 |
+--------+----------+-------+-------+
|Tree(20)|   41.1   |  43.7 | 44.6  |
+--------+----------+-------+-------+
|Tree(1) |     0.6  |  0.712|    .72|
+--------+----------+-------+-------+
|RF      |      338 |  296  |  321  |
+--------+----------+-------+-------+
|GBRT    |      90  |  109  |  112  | 
+--------+----------+-------+-------+

Predict
+--------+-------------+-------+------+
|        |Master       |MO     |MO    |
+--------+-------------+-------+------+
|Tree(20)|      0.09   |  0.1  | 0.1  |
+--------+-------------+-------+------+
|Tree(1) |     0.031   |  0.034|0.037 |
+--------+-------------+-------+------+
|RF      |      30     |  1.1  | 1.2  |
+--------+-------------+-------+------+
|GBRT    |      0.775  |0.768  | 0.82 |
+--------+-------------+-------+------+

pprett · 2012-07-09T14:17:32Z

@glouppe there are two formatting errors in the doctests (tree.rst); apart from that I'm +1

great work - thx!

…ee-mo Conflicts: sklearn/tree/_tree.c

glouppe · 2012-07-09T15:38:53Z

Thank you all for the reviews! I merge :)

MRG: Multi-output decision trees

amueller · 2012-07-09T21:52:07Z

Great Work. Thanks a lot!

sgenoud · 2012-07-11T13:31:16Z

sklearn/tree/tree.py

@@ -165,9 +175,10 @@ class Tree(object):
    LEAF = -1
    UNDEFINED = -2

-    def __init__(self, n_classes, n_features, capacity=3):
+    def __init__(self, n_classes, n_features, n_outputs, capacity=3):


I am working on #941, and as I was thinking of making n_output an optional argument (with default 1). Would you mind?

I don't mind, I am okay with that.

While I am at it, I must warn you though, I am currently making huge changes on the tree structure (see #946). I don't know how we should resolve our future conflicts :/

glouppe added 6 commits June 25, 2012 16:37

ENH: MultiOutputTree (wip)

a25522e

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

4728c79

…ee-mo Conflicts: sklearn/tree/_tree.c

ENH: Multi-output decision trees

eac35cc

ENH: MultiOuputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip) ENH: MultiOutputTree (wip)

ENH: Regenerate .c file

064a48c

FIX: graphviz test

74bf03c

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

55dbb49

…ee-mo Conflicts: sklearn/tree/_tree.c

FIX: test_classification_toy

be8ea69

TEST: test_multioutput (1)

afacf44

glouppe added 3 commits July 2, 2012 09:56

TEST: test_multioutput

6cf4d26

ENH: make forests support multi-output

b22b1f6

TEST: test_multioutput

7b6ef37

glouppe added 2 commits July 4, 2012 13:02

DOC: example

386631e

DOC: typo

91963b8

DOC: narrative documentation

a08a910

glouppe mentioned this pull request Jul 5, 2012

Speed up tree construction #933

Closed

bdholt1 and others added 5 commits July 9, 2012 12:48

added multi-ouput tree example

637ab82

updated documentation to reflect multi-output DT regression

81a1f90

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

e5a61dc

…ee-mo Conflicts: doc/whats_new.rst

added link

94a5f3f

Merge pull request #3 from bdholt1/glouppe-tree-mo

dc8e65a

Glouppe tree mo

glouppe added 2 commits July 9, 2012 17:32

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

532c54c

…ee-mo Conflicts: sklearn/tree/_tree.c

DOC: format

f14601a

glouppe added a commit that referenced this pull request Jul 9, 2012

Merge pull request #923 from glouppe/tree-mo

aad531f

MRG: Multi-output decision trees

glouppe merged commit aad531f into scikit-learn:master Jul 9, 2012

glouppe mentioned this pull request Jul 11, 2012

[MRG] Tree speedup #946

Merged

sgenoud reviewed Jul 11, 2012
View reviewed changes

bdholt1 mentioned this pull request Jul 11, 2012

Enhancements to the tree module #382

Closed

Uh oh!

MRG: Multi-output decision trees #923

MRG: Multi-output decision trees #923

Uh oh!

Conversation

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

pprett commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

pprett commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

glouppe commented Jun 29, 2012

Uh oh!

amueller commented Jun 29, 2012

Uh oh!

bdholt1 commented Jun 29, 2012

Uh oh!

ogrisel commented Jul 4, 2012

Uh oh!

pprett commented Jul 4, 2012

Uh oh!

glouppe commented Jul 4, 2012

Uh oh!

pprett commented Jul 4, 2012

Uh oh!

glouppe commented Jul 7, 2012

Uh oh!

ogrisel commented Jul 8, 2012

Uh oh!

bdholt1 commented Jul 8, 2012

Uh oh!

glouppe commented Jul 8, 2012

Uh oh!

glouppe commented Jul 9, 2012

Uh oh!

bdholt1 commented Jul 9, 2012

Uh oh!

pprett commented Jul 9, 2012

Uh oh!

pprett commented Jul 9, 2012

Uh oh!

glouppe commented Jul 9, 2012

Uh oh!

amueller commented Jul 9, 2012

Uh oh!

sgenoud Jul 11, 2012

Choose a reason for hiding this comment

Uh oh!

glouppe Jul 11, 2012

Choose a reason for hiding this comment

Uh oh!

Uh oh!