[WIP] Gradient Boosting speed improvement #5007

jmschrei · 2015-07-20T13:16:01Z

Through more stringent caching of intermediate results, and hardcoding the FriedmanMSE criterion into a PresortBestSplitter, I propose the FriedmanMSESplitter, which builds the same trees, but 2-3x faster than before. This effect scales with number of estimators more than with depth, and so building many small trees gives a better speed improvement (~3x) than building a few large trees (~2x).

Instead of making any calls to a Criterion class, this method will simply store the cumulative weight, weight times residual, and weight times residual squared values in a vector, from which it can calculate improvement and impurity. Previously, this required multiple scans of the full sample space, which was less efficient.

In addition to being faster, this also simplifies the code base by removing the (in my opinion) verbose criterion class. The next step for me is to add in multithreaded training, and so making the code as compact as possible is beneficial moving forward. While I hardcoded FriedmanMSE into this splitter, I believe all tree-based methods can benefit from the caching done here.

Lastly, I've added some more documentation to some Criterion and Splitting objects, which I found helpful in learning the purpose of these objects.

Here are time tests and accuracies on master and branch (mine) on a variety of datasets in two settings; 25 estimators each of depth 2, and 5 estimators each of depth 6.

25 estimators, max depth 2

          accuracy   train time
madelon    
master     0.68833     2.487
branch     0.68833     0.860

gaussian (50,000 points, 2 classes, 200 dimensions, each dimension z=0.5 apart)
master     0.91712     30.815
branch     0.91712     10.336 

spambase
master     0.92122     0.444
branch     0.92122     0.165 

mnist
master     0.83786     877.432
branch     0.83786     340.564

covtype
master     0.71034     462.218
branch     0.71034     249.624

5 estimators, max depth 6

          accuracy   train time
madelon    
master     0.835        2.460
branch     0.835        1.197

gaussian (50,000 points, 2 classes, 200 dimensions, each dimension z=0.5 apart)
master     0.83908     21.9226
branch     0.83908       9.4118

spambase
master     0.91471     0.341
branch     0.91536     0.160

mnist
master     0.89471     720.519
branch     0.89457     373.040


covtype
master     0.75713     290.271
branch     0.75713     178.403

Of note; the two versions differ slightly (<1%) in the spambase and mnist accuracies for 5 trees of depth 6. This seems to happen when the dimensionality is very high (several hundred).

I am also currently getting segmentation faults when subsample is less than 1; which is odd because I didn't touch the allocation and deallocation steps. However, I thought I would open this WIP so that others could comment while I track down this issue.

glouppe · 2015-07-20T13:29:54Z

This is great! Definitely adding that to my review list.

(Review may take time, dont hesitate to ping us. CC: @pprett )

ogrisel · 2015-07-20T13:33:53Z

sklearn/tree/_tree.pyx

                if first:
                    impurity = splitter.node_impurity()
                    first = 0

                is_leaf = is_leaf or (impurity <= MIN_IMPURITY_SPLIT)

-                if not is_leaf:
-                    splitter.node_split(impurity, &split, &n_constant_features)
-                    is_leaf = is_leaf or (split.pos >= end)


This change is apparently causing the classification trees to be different and therefore causes some test failures.

hmm... @jmschrei the order is the only difference here, right? what's the reason of moving the block before the first check?

The code was organized such that the impurity of the entire dataset was calculated before the first split. However, when you calculate the split, you also get this value. My code caches this value when it does the split, but needs to do the split before it can return the impurity.

This has been updated in the latest commits, but only through a hack. I am now working on making this cleaner.

arjoly · 2015-07-20T13:48:41Z

Interesting stuffs!!

…adient Boosting.

pprett · 2015-07-21T11:55:35Z

this is like Christmas and Eastern on the same day...

jmschrei · 2015-07-21T13:06:49Z

The newest changes have fixed the segfaulting by inheriting from PresortSplitter instead of BaseDenseSplitter (apparently I was allocating memory incorrectly somewhere). The code has also been hacked in order to pass the tree nosetests that Olivier pointed out. However, it cannot handle subsampling and looking at weights per leaf, because there is an issue in the splitter API which makes it difficult to get leaf weights. The function splitter.node_reset has previously been used to get the weight of the current node, but is called before splitter.node_split, where the weight vector is calculated in my implementation.

I would suggest that instead of calculating weight using splitter.node_reset, and calculate impurity using splitter.node_impurity, these values are returned in a StackRecord produced by splitter.node_split. That would utilize the caching more correctly, and require fewer code hacks. However, it would require rewriting the entire _tree.pyx module, as all tree builders and splitters would have to be modified to incorporate this API change.

As a sidenote, fixing the memory issue has apparently also led to another ~10% speed up.

ogrisel · 2015-07-21T14:31:36Z

@jmschrei quick note: it would be great to avoid merging the master branch into your PR branch but instead rebase your branch on master when you need to get recent changes from master into your branch. Otherwise the history is more branchy and harder to read once the branch is merged.

Undoing a merge commit can be complicated. If you want to clean the merge commit from this PR I would just copy you change in the ._tree.pyx and gradient_boosting.py files to a temp folder then do git reset --hard master in you branch, readd your changes, re-rerun cython and commit your changes again and force push to the same branch on your github repo.

glouppe · 2015-07-31T13:47:47Z

Is this a duplicate of #5041? (if yes, just close this one)

jmschrei · 2015-07-31T13:52:28Z

Close enough.

Generated C file

bda486d

ogrisel reviewed Jul 20, 2015
View reviewed changes

Jacob Schreiber added 2 commits July 21, 2015 08:44

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

8b6844a

Introduce FriedmanMSESplitter as SpeedSplitter, a faster way to do Gr…

13ba1b1

…adient Boosting.

jmschrei force-pushed the _tree_speed branch 2 times, most recently from fd99cab to ac43c1e Compare July 21, 2015 07:39

jmschrei mentioned this pull request Jul 21, 2015

[MRG] _tree.pyx documentation additions #5010

Merged

Decision tree nosetests work now, memory issue fixed

760ffa4

jmschrei force-pushed the _tree_speed branch from ac43c1e to 760ffa4 Compare July 21, 2015 13:00

jmschrei mentioned this pull request Jul 28, 2015

[WIP] _tree.pyx rewrite #5041

Closed

30 tasks

jmschrei closed this Jul 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Gradient Boosting speed improvement #5007

[WIP] Gradient Boosting speed improvement #5007

Uh oh!

jmschrei commented Jul 20, 2015

Uh oh!

glouppe commented Jul 20, 2015

Uh oh!

ogrisel Jul 20, 2015

Uh oh!

pprett Jul 21, 2015

Uh oh!

jmschrei Jul 21, 2015

Uh oh!

arjoly commented Jul 20, 2015

Uh oh!

pprett commented Jul 21, 2015

Uh oh!

jmschrei commented Jul 21, 2015

Uh oh!

ogrisel commented Jul 21, 2015

Uh oh!

glouppe commented Jul 31, 2015

Uh oh!

jmschrei commented Jul 31, 2015

Uh oh!

Uh oh!

Uh oh!

[WIP] Gradient Boosting speed improvement #5007

[WIP] Gradient Boosting speed improvement #5007

Uh oh!

Conversation

jmschrei commented Jul 20, 2015

Uh oh!

glouppe commented Jul 20, 2015

Uh oh!

ogrisel Jul 20, 2015

Choose a reason for hiding this comment

Uh oh!

pprett Jul 21, 2015

Choose a reason for hiding this comment

Uh oh!

jmschrei Jul 21, 2015

Choose a reason for hiding this comment

Uh oh!

arjoly commented Jul 20, 2015

Uh oh!

pprett commented Jul 21, 2015

Uh oh!

jmschrei commented Jul 21, 2015

Uh oh!

ogrisel commented Jul 21, 2015

Uh oh!

glouppe commented Jul 31, 2015

Uh oh!

jmschrei commented Jul 31, 2015

Uh oh!

Uh oh!