Skip to content

[WIP] Gradient Boosting speed improvement #5007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jmschrei
Copy link
Member

Through more stringent caching of intermediate results, and hardcoding the FriedmanMSE criterion into a PresortBestSplitter, I propose the FriedmanMSESplitter, which builds the same trees, but 2-3x faster than before. This effect scales with number of estimators more than with depth, and so building many small trees gives a better speed improvement (~3x) than building a few large trees (~2x).

Instead of making any calls to a Criterion class, this method will simply store the cumulative weight, weight times residual, and weight times residual squared values in a vector, from which it can calculate improvement and impurity. Previously, this required multiple scans of the full sample space, which was less efficient.

In addition to being faster, this also simplifies the code base by removing the (in my opinion) verbose criterion class. The next step for me is to add in multithreaded training, and so making the code as compact as possible is beneficial moving forward. While I hardcoded FriedmanMSE into this splitter, I believe all tree-based methods can benefit from the caching done here.

Lastly, I've added some more documentation to some Criterion and Splitting objects, which I found helpful in learning the purpose of these objects.

Here are time tests and accuracies on master and branch (mine) on a variety of datasets in two settings; 25 estimators each of depth 2, and 5 estimators each of depth 6.

25 estimators, max depth 2

          accuracy   train time
madelon    
master     0.68833     2.487
branch     0.68833     0.860

gaussian (50,000 points, 2 classes, 200 dimensions, each dimension z=0.5 apart)
master     0.91712     30.815
branch     0.91712     10.336 

spambase
master     0.92122     0.444
branch     0.92122     0.165 

mnist
master     0.83786     877.432
branch     0.83786     340.564

covtype
master     0.71034     462.218
branch     0.71034     249.624
5 estimators, max depth 6

          accuracy   train time
madelon    
master     0.835        2.460
branch     0.835        1.197

gaussian (50,000 points, 2 classes, 200 dimensions, each dimension z=0.5 apart)
master     0.83908     21.9226
branch     0.83908       9.4118

spambase
master     0.91471     0.341
branch     0.91536     0.160

mnist
master     0.89471     720.519
branch     0.89457     373.040


covtype
master     0.75713     290.271
branch     0.75713     178.403

Of note; the two versions differ slightly (<1%) in the spambase and mnist accuracies for 5 trees of depth 6. This seems to happen when the dimensionality is very high (several hundred).

I am also currently getting segmentation faults when subsample is less than 1; which is odd because I didn't touch the allocation and deallocation steps. However, I thought I would open this WIP so that others could comment while I track down this issue.

@glouppe
Copy link
Contributor

glouppe commented Jul 20, 2015

This is great! Definitely adding that to my review list.

(Review may take time, dont hesitate to ping us. CC: @pprett )

if first:
impurity = splitter.node_impurity()
first = 0

is_leaf = is_leaf or (impurity <= MIN_IMPURITY_SPLIT)

if not is_leaf:
splitter.node_split(impurity, &split, &n_constant_features)
is_leaf = is_leaf or (split.pos >= end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is apparently causing the classification trees to be different and therefore causes some test failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... @jmschrei the order is the only difference here, right? what's the reason of moving the block before the first check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was organized such that the impurity of the entire dataset was calculated before the first split. However, when you calculate the split, you also get this value. My code caches this value when it does the split, but needs to do the split before it can return the impurity.

This has been updated in the latest commits, but only through a hack. I am now working on making this cleaner.

@arjoly
Copy link
Member

arjoly commented Jul 20, 2015

Interesting stuffs!!

@jmschrei jmschrei force-pushed the _tree_speed branch 2 times, most recently from fd99cab to ac43c1e Compare July 21, 2015 07:39
@pprett
Copy link
Member

pprett commented Jul 21, 2015

this is like Christmas and Eastern on the same day...

@jmschrei
Copy link
Member Author

The newest changes have fixed the segfaulting by inheriting from PresortSplitter instead of BaseDenseSplitter (apparently I was allocating memory incorrectly somewhere). The code has also been hacked in order to pass the tree nosetests that Olivier pointed out. However, it cannot handle subsampling and looking at weights per leaf, because there is an issue in the splitter API which makes it difficult to get leaf weights. The function splitter.node_reset has previously been used to get the weight of the current node, but is called before splitter.node_split, where the weight vector is calculated in my implementation.

I would suggest that instead of calculating weight using splitter.node_reset, and calculate impurity using splitter.node_impurity, these values are returned in a StackRecord produced by splitter.node_split. That would utilize the caching more correctly, and require fewer code hacks. However, it would require rewriting the entire _tree.pyx module, as all tree builders and splitters would have to be modified to incorporate this API change.

As a sidenote, fixing the memory issue has apparently also led to another ~10% speed up.

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2015

@jmschrei quick note: it would be great to avoid merging the master branch into your PR branch but instead rebase your branch on master when you need to get recent changes from master into your branch. Otherwise the history is more branchy and harder to read once the branch is merged.

Undoing a merge commit can be complicated. If you want to clean the merge commit from this PR I would just copy you change in the ._tree.pyx and gradient_boosting.py files to a temp folder then do git reset --hard master in you branch, readd your changes, re-rerun cython and commit your changes again and force push to the same branch on your github repo.

@jmschrei jmschrei mentioned this pull request Jul 28, 2015
30 tasks
@glouppe
Copy link
Contributor

glouppe commented Jul 31, 2015

Is this a duplicate of #5041? (if yes, just close this one)

@jmschrei
Copy link
Member Author

Close enough.

@jmschrei jmschrei closed this Jul 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants