-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Gradient Boosting speed improvement #5007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is great! Definitely adding that to my review list. (Review may take time, dont hesitate to ping us. CC: @pprett ) |
if first: | ||
impurity = splitter.node_impurity() | ||
first = 0 | ||
|
||
is_leaf = is_leaf or (impurity <= MIN_IMPURITY_SPLIT) | ||
|
||
if not is_leaf: | ||
splitter.node_split(impurity, &split, &n_constant_features) | ||
is_leaf = is_leaf or (split.pos >= end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is apparently causing the classification trees to be different and therefore causes some test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... @jmschrei the order is the only difference here, right? what's the reason of moving the block before the first check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was organized such that the impurity of the entire dataset was calculated before the first split. However, when you calculate the split, you also get this value. My code caches this value when it does the split, but needs to do the split before it can return the impurity.
This has been updated in the latest commits, but only through a hack. I am now working on making this cleaner.
Interesting stuffs!! |
fd99cab
to
ac43c1e
Compare
this is like Christmas and Eastern on the same day... |
The newest changes have fixed the segfaulting by inheriting from PresortSplitter instead of BaseDenseSplitter (apparently I was allocating memory incorrectly somewhere). The code has also been hacked in order to pass the tree nosetests that Olivier pointed out. However, it cannot handle subsampling and looking at weights per leaf, because there is an issue in the splitter API which makes it difficult to get leaf weights. The function I would suggest that instead of calculating weight using As a sidenote, fixing the memory issue has apparently also led to another ~10% speed up. |
@jmschrei quick note: it would be great to avoid merging the master branch into your PR branch but instead rebase your branch on master when you need to get recent changes from master into your branch. Otherwise the history is more branchy and harder to read once the branch is merged. Undoing a merge commit can be complicated. If you want to clean the merge commit from this PR I would just copy you change in the |
Is this a duplicate of #5041? (if yes, just close this one) |
Close enough. |
Through more stringent caching of intermediate results, and hardcoding the FriedmanMSE criterion into a PresortBestSplitter, I propose the FriedmanMSESplitter, which builds the same trees, but 2-3x faster than before. This effect scales with number of estimators more than with depth, and so building many small trees gives a better speed improvement (~3x) than building a few large trees (~2x).
Instead of making any calls to a Criterion class, this method will simply store the cumulative weight, weight times residual, and weight times residual squared values in a vector, from which it can calculate improvement and impurity. Previously, this required multiple scans of the full sample space, which was less efficient.
In addition to being faster, this also simplifies the code base by removing the (in my opinion) verbose criterion class. The next step for me is to add in multithreaded training, and so making the code as compact as possible is beneficial moving forward. While I hardcoded FriedmanMSE into this splitter, I believe all tree-based methods can benefit from the caching done here.
Lastly, I've added some more documentation to some Criterion and Splitting objects, which I found helpful in learning the purpose of these objects.
Here are time tests and accuracies on master and branch (mine) on a variety of datasets in two settings; 25 estimators each of depth 2, and 5 estimators each of depth 6.
Of note; the two versions differ slightly (<1%) in the spambase and mnist accuracies for 5 trees of depth 6. This seems to happen when the dimensionality is very high (several hundred).
I am also currently getting segmentation faults when subsample is less than 1; which is odd because I didn't touch the allocation and deallocation steps. However, I thought I would open this WIP so that others could comment while I track down this issue.