-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] _tree.pyx rewrite #5041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] _tree.pyx rewrite #5041
Conversation
Thanks, this is really great! It will take some time review it though. Will try to find some time later this week to do that properly. |
By the way, can we try to have #5010 finished, and then this PR to be rebased on top of it, to make the review shorter? |
Multithreaded decision tree building has been added (currently just my toy splitter). It works by having each thread calculate the best split on a different feature, and then figuring out which feature had the best split overall after all threads have returned. This is just a first pass at it. Here are some benchmarks on my machine:
To keep score, Gaussian with 8 threads is now 10x faster than the master branch, and but covtypes with 8 threads is only ~3.3x faster. This can be made better. Single threaded training has taken a hit, as I need to allocate an array for each thread in the inner loop, instead of one array total (covtypes went from 249s to 276s). It also looks like the jump from 2-4 threads is the biggest improvement, whereas 1-2 is not that great. I'm going to look into seeing how to allocate one array per thread, instead of one array per call, to help with this overhead. Another issue is that it uses prange whose backend is OpenMP. I'm unsure what others thoughts are on using OpenMP, but @GaelVaroquaux raised some concerns that it may not be trivial to get working on all compilers. @ogrisel , do you have any insight into this? |
it will not be possible to get OpenMP to work with all compilers. The main thing is that is should degrade gracefully, so no-one should get compiler / linker errors. |
As a note, I've determined the reason things are slower, but am unsure what the best answer is. For the single-thread implementation, I do not sort y and w before processing them (which I previously had done), so the best split processing isn't a single scan over contiguous elements. I had thought that the cost of creating a sorted array would be equivalent to looking up those elements for processing later and so removed making the sorted array, but I was mistaken, since I make multiple calls to these elements. For the multi-thread implementation, each call to criterion allocates new memory for I will be looking into both of these issues now, but any input would be appreciated! |
This is a fantastic change, but OpenMP is definitely not supported by the default toolchain available to Mac users. Is this something wheel deals with? Also, the cython |
@@ -27,37 +27,33 @@ cdef class Criterion: | |||
# impurity of a split on that node. It also computes the output statistics | |||
# such as the mean in regression and class probabilities in classification. | |||
|
|||
# Internal structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is bad practice, but since this is still a work in progress I have been a little lax with keeping documentation up to date, as I have been changing things fairly frequently.
Changes on |
I think that since I've made so many changes, displaying them all would be problematic, for the same reasons it doesn't show changes to the C files. |
yes, but it is fairly hard to discuss your changes, or even have an overview of your changes :/ |
Looks like the individual diff limit is 100KB according to this ... and the _tree.pyx one is above the limit
Not quite about the C files since I believe this is done via .gitattributes. |
Ah, thanks for the clarification @lesteve . Is there a recommendation for what I should do? Unfortunately, since I am changing the internal API, there are a lot of changes, but I can understand how difficult this makes review. |
Is it me or you removed support for multi-output? (at least, it does seem to be taken in account in any of the criteria) |
I have currently removed support for multi-output while I make sure single output is working. It can be added in simply once all the criterion work on single output. Also, the only criterion which are correct are MSE and FriedmanMSE. The best_split method for the remaining criteria are filler for now, but I am working on them today. Apologies for the confusion on that aspect. |
Can you add a list of tasks to your PR, of the things that remain to be done? (among the most important ones, support for multi-output and support for sparse data) |
Yep. I've written an initial list of what I've done so far, and major things which need to be done, and added it to the initial post. Another issue is support for random splits in addition to best splits. Instead of having multiple splitter objects for this, it makes sense to just have one splitter to coordinate feature selection and selection of best split per node, but different methods to calculate the split (either best per feature, or random per feature). |
Sorry, I'm a little confused. Is my task list not comprehensive enough? This is my first time with a major pull request, so I apologize for not being as coordinated as I should be. |
My bad, I missed it at the beginning of the PR ... |
After tracking down an issue with assigning the proper node values throughout the trees, I've gotten RandomForestRegressors working, and they show the same speedups as GradientBoosting, 1.5x-3x faster. Here are some time tests.
It looks like there's a numerics issue with the smaller sample size data sets. I'm unsure if this is due to random chance or an error in the code, and I need to rush out the door to catch my bus now. I'll post more when I get home. |
It's probably time to separate the _tree module into a _splitter, _criterion, _tree_structure modules. |
I would personally be against that, as it would make the code more difficult to follow and promote future bloat. In an attempt to simplify _tree.pyx, I have removed several hundred lines of code from the _tree.pyx module and reduced the number of classes. The consequence is that I've exceeded the diff limit for this one PR. Future changes should be smaller! |
Somewhat unrelated, has anyone looked into #1036 recently? I feel that could also really help the gradient boosting in practice.... |
@amueller, I agree that work should be done in that direction. I was going to take a closer look at it when I was done with this rewrite. I had a vague idea about sequential cross-validation which I wanted to discuss with @pprett, involving sequentially fitting trees until the validation set error stops decreasing significantly (instead of doing some form of grid search on n_estimators). |
Cool :) I am not sure how your idea is different than what he implemented. |
Currently, all dense splitters have been replaced by DenseSplitter, the shiny version of PresortBestSplitter, which will presort the entire dataset before fitting trees. This means sorting needs to be done only once, regardless of the number of trees grown. However, RandomForest creates a deep copy of each estimator, causing the Presorting to be redone for each tree (still manages to be faster than master). I tried a prelimary approach to cache the presorting across trees, which is to share the same splitter object across all tree objects. This fits natively in with bootstrapping and random sampling, because they only manipulate the sample weight vector, which the splitter is already OK with!
It looks like we get another factor of two improvement on RandomForestRegressor. A note is that this will not work with parallel building of forests. However, the presorting can be refactored to be done outside of the dataset, and fed into the splitter, at which point it remains constant for each tree. Apologies about the lack of rigor in having many test cases. I'm working at home on a tiny laptop, and left all my benchmarking code at my desk machine! Thoughts? |
By the way, I wanted to check whether the impurity values were still the same (this is very important, e.g. for all derived calculations such as variable importances), but the following script currently fails (segfault) for your branch:
This might be useful as well to ensure that this branch still builds exactly the same trees as before. |
|
||
if self.n_outputs_ == 1: | ||
self.n_classes_ = self.n_classes_[0] | ||
self.classes_ = self.classes_[0] | ||
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetics: trailing spaces
@glouppe, you were right about the presorting not being efficient based on the parameters you mention, and I was wrong! I had forgotten that presorting required a full pass over the dataset at each feature at each split. With presorting, I was getting ~4 seconds to train a decision tree on 5000 samples of the MNIST data, with ~1.3 seconds on master. When I switched to not use presorting, it takes ~1.2 seconds on my branch. To sum it up: https://www.youtube.com/watch?v=YoSjNnCDiW0 I am going to make presorting an option which is default for gradient boosting, but otherwise not used. That gives the best of both worlds without forcing the user to know when it is a good idea. I am currently experiencing some issues with segfaults when there are 0 weight points present, so my progress has been impeded while I try to track those down. DecisionTreeRegressor and DecisionTreeClassifiers look close to master right now. I will post more benchmarks when I track down all the segfault issues. |
+1 |
Estimators produce identical accuracies as master branch on most tests, except for deep trees on large datasets where they can vary by up to 1%, though usually less, which I think is likely due to rounding. I will post benchmarks as soon as the presorting option is added. |
I am also removing support for multithreading from this PR. It seems unclear what the best strategy is, and more worthwhile to get this merged before dealing with that issue. The code is factored in such a manner that in the future it is easy to re-add. |
Please delete code sections that you commented out in |
Done, and pushed. However, the code was identical to code on master branch, so I don't see how it would change the size of the diff. |
There is still a large commented out block for BestFirstTreeBuilder which causes a large number of spurious additions. Note sure it's enough but it's worth a try. |
Okay, new benchmarks out. I'm getting exact results with DecisionTrees, GradientBoosting, and most RandomForests (a few out of place), but I think I still have a bug in my ExtraTrees random split algorithm. Everything is faster, though GradientBoosting is the fastest, except for ExtraTrees, and deep DecisionTreeClassifiers (deep DecisionTreeRegressor is faster).
Like I mentioned previously, the deep decision trees seem to vary, with regression trees being faster and classification trees being slightly slower. Presorting is now an argument to the fit method, and can be turned on or off. It is currently always off except for GradientBoosting, but I am doing to do some testing to find a good 'auto' setting, which will be the default. If I turn presorting on for RandomForests, here are some new benchmarks:
So, presorting can help tremendously on smaller datasets with constrained models (3x faster in the regression case). Especially since people will make models with lots of trees. Here is a time test with 250 trees of maximum depth 6 using presorting:
Most unit tests are passing now. The ones which are not mostly relate to sparse data, best splitting, or multioutput; three features I have not yet added. I will be working on those now. |
@ogrisel done |
Do you seed all the models with a fixed seed?
It's weird that the you don't get the same tree with |
All models are run with random_seed=0. It is weird; I'll take a look at the tree being built after I get BestFirstBuilder finished. |
Also in your bench report could you please report the speedup (that is the ratio master_time / branch_time) to make it easier to quickly review the perf results? |
For forests, please systematically test for |
Also please rebase the branch on top of the current master and resolve the conflicts (probably in the generated |
Also there is it not much point in benchmarking on small dataset such as iris. To check the correctness of the algorithm it's better to rely on having the tests pass. |
And finally please run the benchmarks for classification models only on classification datasets (e.g. diabetes, arcene, spambase, MNIST, covertype) and regression models on regression datasets (e.g. boston, california housing, regression...). |
If the gain is in having a presorting à la Breiman, why not having a BreimanSplitter / PresortBestSplitter instead of changing the overall api? |
The presorting is only one of the speedups, in the limited case of using small datasets and/or growing small trees. In my current revisions, it is an option which is default off except in the case of Gradient Boosting (which used to utilize PresortBestSplitter). The main speedups come from rewriting the criteria, specifically MSE. |
I think that this would be a great to implement this without changing the whole splitter interface with many algorithm modifications / re-implementations. It indeed blurs the overall picture. I believe it's easier at all levels to focus on one incremental improvement at a time. Using the "reduced" impurity formula for impurity decrease maximisation is probably possible by changing the way that the criterion is used and behaved during the splitting process. This might require some new methods, new returned values, new properties and / or changing the current criterion interface. There are some benchmark available that were used previously. See for instance bench_mnist.py and bench_covertype.py for classification. (For mnist, a default parameter set could be added for gbrt in another pull request.) Adding something similar for regression task would be worth. |
I agree that it would be better to incrementally add in features, to ensure consistency. However, it is not possible to implement my improvements without either (1) significantly hacking the way the code works currently, which would make it more difficult to clean it up later, or (2) rewriting the API to be cleaner and incorporate these changes. |
Best first building and sparse splitting are both functional. However, I seem to have reintroduced a bug where overly large trees get built when I refactored Criterion to work with sparse data structures. All that's left before review is to hunt down this bug again, and add multioutput predictions! |
Multioutput support is now available for all criteria. This means that all features have been rewritten and work. However, in settings with large numbers of features my implementation of Gini and Entropy are slower than master branch by ~15%. I will need to track down the reason for this, but I hope to have a pull request ready for review by the end of Friday. Here are some benchmarks on multioutput data using make_multilabel_classification, 100,000 samples (50k train 50k test), 250 features, 3 targets, no depth limit set on the trees.
The code is here: from sklearn.ensemble import *
from sklearn.tree import *
from sklearn.utils import shuffle
from sklearn import datasets as datasets
import time
import numpy as np
random.seed(0)
np.random.seed(0)
def multioutput_benchmark( clf, X_train, X_test, y_train, y_test ):
tic = time.time()
clf.fit( X_train, y_train )
dur = time.time() - tic
acc = 1. * np.abs(y_test - clf.predict( X_test )).sum() / (y_test.shape[0] * y_test.shape[1])
return dur, acc
X, y = datasets.make_multilabel_classification( n_samples=100000,
n_features=250, n_classes=3, n_labels=1, return_indicator=True )
print "DecisionTreeClassifier (Entropy)"
clf = DecisionTreeClassifier( criterion='entropy' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )
print "ExtraTreeClassifier (Entropy)"
clf = ExtraTreeClassifier( criterion='entropy' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )
print "DecisionTreeClassifier (Gini)"
clf = DecisionTreeClassifier( criterion='gini' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )
print "ExtraTreeClassifier (Gini)"
clf = ExtraTreeClassifier( criterion='gini' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )
print "DecisionTreeRegressor (mse)"
clf = DecisionTreeRegressor()
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )
print "ExtraTreeRegressor (mse)"
clf = ExtraTreeRegressor()
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] ) |
4283b20
to
af9c1cb
Compare
Given #5203 I am closing this PR, and will be incorporating it over the next few weeks as smaller PRs. |
Great!! I am looking forward your next pull requests @jmschrei ! |
The following tasks must be completed:
I attempted to get my changes proposed in #5007 to include min_sample and min_weight constraints, but the API imposed by the Criterion class/splitters was a bit difficult for me to efficiently do so without a large number of hacks. Instead, I chose to change the API in _tree.pyx to be simpler to use and easier for me to understand. The API is as follows:
TreeBuilder is currently has the same API, but is changed internally to handle the new API. This makes it clearer that Splitter handles shuffling the samples around based on previous splits and determining the best global split, while Criterion handles splitting a single feature, as coordinated by the Splitter.
Time tests show that this is slightly slower than the embedded criterion/splitter object I proposed in #5007 with many small trees, and even slower with fewer deep trees, with time tests below. It is also slightly more accurate, but I don't know a specific reason for that.
25 estimators, max depth 2
Deep trees seem a lot slower than before (though still faster than master); I'll need to track down where this slowdown comes from.
Another change I am considering is removing the caching of constant features, for two reasons; (1) it would make multithreading easier to handle, and (2) in none of these examples do constant features arise, and they can cause a speed decrease in larger examples. I have modification (not in this pull request) which will check if a feature is constant before checking for the best split, but won't cache the result. Here are time tests I get on covtypes between this branch, and without caching.
Thoughts? @glouppe @pprett @ogrisel