Skip to content

[WIP] _tree.pyx rewrite #5041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

jmschrei
Copy link
Member

The following tasks must be completed:

  • Propose new API to simplify Criterion, Splitter, and TreeBuilder methods
  • FriedmanMSE rewritten
  • MSE rewritten
  • DecisionTreeRegressor works with new API
  • GradientBoostingClassifier works with new API
  • Initial multi-threading proposal
  • GradientBoostingRegressor works with new API
  • RandomForestRegressor works with new API
  • Entropy rewritten
  • DecisionTreeClassifier works with new API
  • Gini rewritten
  • CHECKPOINT: All criterion rewritten
  • RandomForestClassifier works with new API
  • Random splitting method added to DenseSplitter (for ExtraTrees estimators)
  • ExtraTreesClassifier/Regressor works with new API
  • Refactor Random Forest internals
    • Fix segfault issue with Ensemble classifiers on more than one tree (recent)
    • Cache presort outside splitter objects to share among ensembles
  • CHECKPOINT: All classifiers run on dense single output data
  • Utilize caching more thoroughly in criterion
  • Prediction differences debugged or understood
  • Use counts of training labels in each leaf instead of hot-encoding of prediction
  • Carry this vector around until the end, instead of a single value
  • Make presorting an option
  • Best splitter addition
  • Multi-output predictions
  • Support for sparse data through SparseSplitter
  • Resolve multi-threading support issue
  • CHECKPOINT: All code written
  • Add proper documentation to all objects, mostly from [MRG] _tree.pyx documentation additions #5010.

I attempted to get my changes proposed in #5007 to include min_sample and min_weight constraints, but the API imposed by the Criterion class/splitters was a bit difficult for me to efficiently do so without a large number of hacks. Instead, I chose to change the API in _tree.pyx to be simpler to use and easier for me to understand. The API is as follows:

cdef class Criterion:
    def __cinit__( self, n_output ): # Store the number of targets, same as before
    cdef void __init__( self, X, X_sample_stride, X_feature_stride, y, y_stride, w, size, min_leaf_samples, min_leaf_weight ): # Store constants about this dataset and the constraints the user imposed
    cdef SplitRecord best_split(self, samples, start, end, feature) nogil: # Find the best split for a specific feature using this criterion, looking at start, end, the sample mapping, and return a SplitRecord of the best split

cdef class Splitter:
    cdef void init( self, X, y, sample_weight ) # Same as before
    cdef SplitRecord best_split(self, start, end, n_constant_features) # Find the best split across all features. 

TreeBuilder is currently has the same API, but is changed internally to handle the new API. This makes it clearer that Splitter handles shuffling the samples around based on previous splits and determining the best global split, while Criterion handles splitting a single feature, as coordinated by the Splitter.

Time tests show that this is slightly slower than the embedded criterion/splitter object I proposed in #5007 with many small trees, and even slower with fewer deep trees, with time tests below. It is also slightly more accurate, but I don't know a specific reason for that.

25 estimators, max depth 2

          accuracy   train time
madelon    
master     0.68833     2.487
branch     0.70833     0.882
xgboost    0.7         0.663

gaussian (50,000 points, 2 classes, 200 dimensions, each dimension z=0.5 apart)
master     0.91712     30.815
branch     0.9192      10.757
xgboost    0.92068     7.232

spambase
master     0.92122     0.444
branch     0.92122     0.166
xgboost    0.92122     0.110 

mnist
master     0.83786     877.432
branch     0.83871     359.564
xgboost    0.81014     200.688

covtype
master     0.71034     462.218
branch     0.71034     249.624
xgboost    0.68720     103.372

5 estimators, max depth 6

          accuracy   train time
madelon    
master     0.835        2.460
branch     0.832        1.218
xgboost    0.818        0.5404

gaussian (50,000 points, 2 classes, 200 dimensions, each dimension z=0.5 apart)
master     0.83908     21.9226
branch     0.84588      9.4118
xgboost    0.84288      4.8209

spambase
master     0.91471      0.341
branch     0.91862      0.160
xgboost    0.92252      0.070

mnist
master     0.89471     720.519
branch     0.89542     463.294
xgboost    0.90257     118.749

covtype
master     0.75713     290.271
branch     0.75730     241.988
xgboost    0.75476      57.540

Deep trees seem a lot slower than before (though still faster than master); I'll need to track down where this slowdown comes from.

Another change I am considering is removing the caching of constant features, for two reasons; (1) it would make multithreading easier to handle, and (2) in none of these examples do constant features arise, and they can cause a speed decrease in larger examples. I have modification (not in this pull request) which will check if a feature is constant before checking for the best split, but won't cache the result. Here are time tests I get on covtypes between this branch, and without caching.

Thoughts? @glouppe @pprett @ogrisel

25 trees, depth 2
covtypes
branch       249.62s
no_cache     235.04s

mnist
branch       359.564
no_cache     333.52

5 trees, depth 6
covtypes
branch       241.988
no_cache     220.25

mnist         
branch       463.294
no_cache     428.1

@glouppe
Copy link
Contributor

glouppe commented Jul 28, 2015

Thanks, this is really great! It will take some time review it though. Will try to find some time later this week to do that properly.

@glouppe
Copy link
Contributor

glouppe commented Jul 28, 2015

By the way, can we try to have #5010 finished, and then this PR to be rebased on top of it, to make the review shorter?

@jmschrei
Copy link
Member Author

I will finish #5010 first thing tomorrow. However, since I am fundamentally altering the API which #5010 documents, I am unsure rebasing on it would cause any speed ups in the review.

@glouppe
Copy link
Contributor

glouppe commented Jul 29, 2015

CC: @pprett @ndawe to help with the review

@jmschrei
Copy link
Member Author

Multithreaded decision tree building has been added (currently just my toy splitter). It works by having each thread calculate the best split on a different feature, and then figuring out which feature had the best split overall after all threads have returned. This is just a first pass at it.

Here are some benchmarks on my machine:

25 trees each of depth 2

Gaussian (50,000 points, 200 dimensions)
            accuracy  time
1 thread    0.91848   10.769
2 threads   0.91848    8.202
4 threads   0.91848    4.539
8 threads   0.91848    3.165

covtypes
            accuracy  time
1 thread    0.71085   276.567
2 threads   0.71085   240.414
4 threads   0.71085   169.613
8 threads   0.71085   138.591

To keep score, Gaussian with 8 threads is now 10x faster than the master branch, and but covtypes with 8 threads is only ~3.3x faster. This can be made better.

Single threaded training has taken a hit, as I need to allocate an array for each thread in the inner loop, instead of one array total (covtypes went from 249s to 276s). It also looks like the jump from 2-4 threads is the biggest improvement, whereas 1-2 is not that great. I'm going to look into seeing how to allocate one array per thread, instead of one array per call, to help with this overhead.

Another issue is that it uses prange whose backend is OpenMP. I'm unsure what others thoughts are on using OpenMP, but @GaelVaroquaux raised some concerns that it may not be trivial to get working on all compilers. @ogrisel , do you have any insight into this?

@amueller
Copy link
Member

it will not be possible to get OpenMP to work with all compilers. The main thing is that is should degrade gracefully, so no-one should get compiler / linker errors.
Not sure what the status is there. @ogrisel what does appveyor use to built?

@jmschrei
Copy link
Member Author

As a note, I've determined the reason things are slower, but am unsure what the best answer is.

For the single-thread implementation, I do not sort y and w before processing them (which I previously had done), so the best split processing isn't a single scan over contiguous elements. I had thought that the cost of creating a sorted array would be equivalent to looking up those elements for processing later and so removed making the sorted array, but I was mistaken, since I make multiple calls to these elements.

For the multi-thread implementation, each call to criterion allocates new memory for w_cl, yw_cl and yw_sq, the sufficient statistics, and frees them after the best split is found. Previously I could have a single preallocated array in which to do computations, but that is not thread-safe, so I had each thread allocate it's own array. I am unsure how to have each thread preallocate an array which it keeps around with it.

I will be looking into both of these issues now, but any input would be appreciated!

@wlattner
Copy link

This is a fantastic change, but OpenMP is definitely not supported by the default toolchain available to Mac users. Is this something wheel deals with?

Also, the cython parallel directive may be helpful for allocating w_cl, yw_cl, and yw_sq on a per-thread basis.

@@ -27,37 +27,33 @@ cdef class Criterion:
# impurity of a split on that node. It also computes the output statistics
# such as the mean in regression and class probabilities in classification.

# Internal structures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is bad practice, but since this is still a work in progress I have been a little lax with keeping documentation up to date, as I have been changing things fairly frequently.

@glouppe
Copy link
Contributor

glouppe commented Jul 31, 2015

Changes on _tree.pyx cannot be reviewed through the github interface... Does anyone know how to fix that?

@jmschrei
Copy link
Member Author

I think that since I've made so many changes, displaying them all would be problematic, for the same reasons it doesn't show changes to the C files.

@glouppe
Copy link
Contributor

glouppe commented Jul 31, 2015

yes, but it is fairly hard to discuss your changes, or even have an overview of your changes :/

@lesteve
Copy link
Member

lesteve commented Jul 31, 2015

Looks like the individual diff limit is 100KB according to this ... and the _tree.pyx one is above the limit

I think that since I've made so many changes, displaying them all would be problematic, for the same reasons it doesn't show changes to the C files.

Not quite about the C files since I believe this is done via .gitattributes.

@jmschrei
Copy link
Member Author

Ah, thanks for the clarification @lesteve . Is there a recommendation for what I should do? Unfortunately, since I am changing the internal API, there are a lot of changes, but I can understand how difficult this makes review.

@glouppe
Copy link
Contributor

glouppe commented Jul 31, 2015

Is it me or you removed support for multi-output? (at least, it does seem to be taken in account in any of the criteria)

@jmschrei
Copy link
Member Author

I have currently removed support for multi-output while I make sure single output is working. It can be added in simply once all the criterion work on single output.

Also, the only criterion which are correct are MSE and FriedmanMSE. The best_split method for the remaining criteria are filler for now, but I am working on them today. Apologies for the confusion on that aspect.

@glouppe
Copy link
Contributor

glouppe commented Jul 31, 2015

Can you add a list of tasks to your PR, of the things that remain to be done? (among the most important ones, support for multi-output and support for sparse data)

@jmschrei
Copy link
Member Author

Yep. I've written an initial list of what I've done so far, and major things which need to be done, and added it to the initial post. Another issue is support for random splits in addition to best splits. Instead of having multiple splitter objects for this, it makes sense to just have one splitter to coordinate feature selection and selection of best split per node, but different methods to calculate the split (either best per feature, or random per feature).

@lesteve
Copy link
Member

lesteve commented Jul 31, 2015

Not 100% sure that's what @glouppe was after but you can use task lists in github, see here for more information. That may both help you to remember what needs to be done and other people to track the progress of the PR.

@jmschrei
Copy link
Member Author

Sorry, I'm a little confused. Is my task list not comprehensive enough? This is my first time with a major pull request, so I apologize for not being as coordinated as I should be.

@lesteve
Copy link
Member

lesteve commented Jul 31, 2015

Sorry, I'm a little confused. Is my task list not comprehensive enough?

My bad, I missed it at the beginning of the PR ...

@jmschrei
Copy link
Member Author

After tracking down an issue with assigning the proper node values throughout the trees, I've gotten RandomForestRegressors working, and they show the same speedups as GradientBoosting, 1.5x-3x faster. Here are some time tests.

Using RandomForestRegressor

boston
trees    depth    branch    mse        time
10       2        master    2.427      0.0138
                  branch    2.539      0.0114

25       2        master    3.104      0.0336
                  branch    3.145      0.0274

50       2        master    2.908      0.0670
                  branch    2.945      0.0544

50       4        master    2.095      0.1100
                  branch    2.102      0.0708

50       6        master    1.445      0.1476
                  branch    1.683      0.0948

diabetes
trees    depth    branch    mse        time
10       2        master    42.49      0.0106
                  branch    46.09      0.0092

25       2        master    47.48      0.0260
                  branch    47.40      0.0227

50       2        master    43.57      0.0519
                  branch    43.65      0.0460

50       4        master    37.67      0.0806
                  branch    38.55      0.0568

50       6        master    38.03      0.1100
                  branch    40.35      0.0767

regression (25000 points, 200 dimensions y ~ x.sum())
trees    depth    branch    mse        time
10       2        master    4.810      13.754
                  branch    4.810       7.612

25       2        master    4.756      32.446
                  branch    4.756      18.941

50       2        master    4.755      66.157
                  branch    4.755      38.284

50       4        master    4.744     126.169
                  branch    4.744      54.739

50       6        master    4.661     182.548
                  branch    4.661      88.256

It looks like there's a numerics issue with the smaller sample size data sets. I'm unsure if this is due to random chance or an error in the code, and I need to rush out the door to catch my bus now. I'll post more when I get home.

@arjoly
Copy link
Member

arjoly commented Jul 31, 2015

Looks like the individual diff limit is 100KB according to this ... and the _tree.pyx one is above the limit

I think that since I've made so many changes, displaying them all would be problematic, for the same reasons it doesn't show changes to the C files.
Not quite about the C files since I believe this is done via .gitattributes.

It's probably time to separate the _tree module into a _splitter, _criterion, _tree_structure modules.

@jmschrei
Copy link
Member Author

I would personally be against that, as it would make the code more difficult to follow and promote future bloat. In an attempt to simplify _tree.pyx, I have removed several hundred lines of code from the _tree.pyx module and reduced the number of classes. The consequence is that I've exceeded the diff limit for this one PR. Future changes should be smaller!

@amueller
Copy link
Member

Somewhat unrelated, has anyone looked into #1036 recently? I feel that could also really help the gradient boosting in practice....

@jmschrei
Copy link
Member Author

@amueller, I agree that work should be done in that direction. I was going to take a closer look at it when I was done with this rewrite. I had a vague idea about sequential cross-validation which I wanted to discuss with @pprett, involving sequentially fitting trees until the validation set error stops decreasing significantly (instead of doing some form of grid search on n_estimators).

@amueller
Copy link
Member

Cool :) I am not sure how your idea is different than what he implemented.

@jmschrei
Copy link
Member Author

Currently, all dense splitters have been replaced by DenseSplitter, the shiny version of PresortBestSplitter, which will presort the entire dataset before fitting trees. This means sorting needs to be done only once, regardless of the number of trees grown. However, RandomForest creates a deep copy of each estimator, causing the Presorting to be redone for each tree (still manages to be faster than master).

I tried a prelimary approach to cache the presorting across trees, which is to share the same splitter object across all tree objects. This fits natively in with bootstrapping and random sampling, because they only manipulate the sample weight vector, which the splitter is already OK with!

regression (25000 samples, 100 dimensions, y ~ x.sum())
         mse      time
master   0.0989   45.8
branch   0.0989   25.3
presort  0.0979   12.2

It looks like we get another factor of two improvement on RandomForestRegressor. A note is that this will not work with parallel building of forests. However, the presorting can be refactored to be done outside of the dataset, and fed into the splitter, at which point it remains constant for each tree.

Apologies about the lack of rigor in having many test cases. I'm working at home on a tiny laptop, and left all my benchmarking code at my desk machine!

Thoughts?

@glouppe
Copy link
Contributor

glouppe commented Aug 1, 2015

By the way, I wanted to check whether the impurity values were still the same (this is very important, e.g. for all derived calculations such as variable importances), but the following script currently fails (segfault) for your branch:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
clf = DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("iris-branch.pdf")

This might be useful as well to ensure that this branch still builds exactly the same trees as before.


if self.n_outputs_ == 1:
self.n_classes_ = self.n_classes_[0]
self.classes_ = self.classes_[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetics: trailing spaces

@jmschrei
Copy link
Member Author

@glouppe, you were right about the presorting not being efficient based on the parameters you mention, and I was wrong! I had forgotten that presorting required a full pass over the dataset at each feature at each split. With presorting, I was getting ~4 seconds to train a decision tree on 5000 samples of the MNIST data, with ~1.3 seconds on master. When I switched to not use presorting, it takes ~1.2 seconds on my branch.

To sum it up: https://www.youtube.com/watch?v=YoSjNnCDiW0

I am going to make presorting an option which is default for gradient boosting, but otherwise not used. That gives the best of both worlds without forcing the user to know when it is a good idea.

I am currently experiencing some issues with segfaults when there are 0 weight points present, so my progress has been impeded while I try to track those down. DecisionTreeRegressor and DecisionTreeClassifiers look close to master right now. I will post more benchmarks when I track down all the segfault issues.

@glouppe
Copy link
Contributor

glouppe commented Aug 19, 2015

I am going to make presorting an option which is default for gradient boosting, but otherwise not used. That gives the best of both worlds without forcing the user to know when it is a good idea.

+1

@jmschrei
Copy link
Member Author

Estimators produce identical accuracies as master branch on most tests, except for deep trees on large datasets where they can vary by up to 1%, though usually less, which I think is likely due to rounding. I will post benchmarks as soon as the presorting option is added.

@jmschrei
Copy link
Member Author

I am also removing support for multithreading from this PR. It seems unclear what the best strategy is, and more worthwhile to get this merged before dealing with that issue. The code is factored in such a manner that in the future it is easy to re-add.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

Please delete code sections that you commented out in _tree.pyx: it's likely that deletion will reduce the diff size and will make the _tree.pyx reappear in the github diff view and will make it easier for reviewers to follow the progress on this PR.

@jmschrei
Copy link
Member Author

Done, and pushed. However, the code was identical to code on master branch, so I don't see how it would change the size of the diff. _tree.pyx doesn't look like it reappeared, I think the 1000+ lines of different code above the commented code are more likely to blame.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

There is still a large commented out block for BestFirstTreeBuilder which causes a large number of spurious additions. Note sure it's enough but it's worth a try.

@jmschrei
Copy link
Member Author

Okay, new benchmarks out. I'm getting exact results with DecisionTrees, GradientBoosting, and most RandomForests (a few out of place), but I think I still have a bug in my ExtraTrees random split algorithm. Everything is faster, though GradientBoosting is the fastest, except for ExtraTrees, and deep DecisionTreeClassifiers (deep DecisionTreeRegressor is faster).

n_estimators max_depth   master_acc  master_time   branch_acc  branch_time

RandomForestClassifier
arcene
1   1   0.660  0.003   0.560  0.005
3   2   0.680  0.006   0.690  0.005
5   3   0.820  0.010   0.730  0.009
10  4   0.760  0.022   0.740  0.019

iris
1   1   0.667  0.001   0.600  0.001
3   2   0.933  0.002   0.893  0.002
5   3   0.947  0.003   0.947  0.003
10  4   0.960  0.006   0.947  0.007

spambase
1   1   0.691  0.002   0.738  0.002
3   2   0.831  0.007   0.814  0.006
5   3   0.895  0.013   0.890  0.012
10  4   0.924  0.029   0.921  0.027

Gaussian
1   1   0.599  0.063   0.599  0.056
3   2   0.651  0.250   0.688  0.215
5   3   0.765  0.597   0.769  0.444
10  4   0.867  1.619   0.868  1.118

madelon
1   1   0.518  0.004   0.498  0.004
3   2   0.532  0.014   0.575  0.012
5   3   0.573  0.032   0.550  0.027
10  4   0.667  0.086   0.607  0.069


ExtraTreesClassifier
arcene
1   1   0.680  0.002   0.690  0.002
3   2   0.720  0.004   0.740  0.005
5   3   0.700  0.006   0.740  0.010
10  4   0.710  0.013   0.790  0.021

iris
1   1   0.520  0.001   0.533  0.001
3   2   0.960  0.002   0.827  0.002
5   3   0.933  0.003   0.933  0.003
10  4   0.973  0.006   0.947  0.006

spambase
1   1   0.596  0.002   0.596  0.002
3   2   0.611  0.005   0.675  0.006
5   3   0.717  0.009   0.714  0.013
10  4   0.780  0.019   0.751  0.029

Gaussian
1   1   0.601  0.027   0.598  0.062
3   2   0.687  0.064   0.685  0.266
5   3   0.762  0.124   0.769  0.614
10  4   0.875  0.286   0.868  1.541

madelon
1   1   0.572  0.003   0.500  0.004 
3   2   0.615  0.006   0.578  0.015
5   3   0.580  0.010   0.587  0.034
10  4   0.613  0.024   0.577  0.086


GradientBoostingClassifier
arcene
1   1   0.560  0.099   0.560  0.054
3   2   0.660  0.434   0.620  0.177
5   3   0.640  1.021   0.640  0.429
10  4   0.660  2.623   0.730  0.992

iris
1   1   0.973  0.001   0.973  0.001
3   2   0.960  0.004   0.960  0.004
5   3   0.960  0.007   0.960  0.007
10  4   0.960  0.015   0.960  0.014

spambase
1   1   0.614  0.015   0.614  0.009
3   2   0.799  0.055   0.799  0.022
5   3   0.902  0.136   0.902  0.050
10  4   0.922  0.368   0.922  0.138

Gaussian
1   1   0.602  1.114   0.602  0.679
3   2   0.693  4.160   0.693  1.468
5   3   0.783 10.097   0.783  3.089
10  4   0.885 26.788   0.885  8.613

madelon
1   1   0.612  0.101   0.612  0.076
3   2   0.692  0.335   0.692  0.144
5   3   0.728  0.845   0.728  0.327
10  4   0.782  2.380   0.782  0.887


DecisionTreeClassifier
arcene
    1   0.660  0.063   0.660  0.047  
    2   0.590  0.125   0.570  0.093
    3   0.630  0.159   0.660  0.128
    4   0.650  0.191   0.670  0.155

iris
    1   0.667  0.0   0.667 0.0
    2   0.947  0.0   0.947 0.0
    3   0.960  0.0   0.960 0.0
    4   0.960  0.0   0.960 0.0


spambase
    1   0.788  0.005   0.788 0.006
    2   0.870  0.015   0.870 0.010
    3   0.886  0.017   0.886 0.015
    4   0.892  0.022   0.892 0.020

Gaussian
    1   0.592  0.977   0.592  0.672
    2   0.613  1.832   0.613  1.285
    3   0.650  2.491   0.650  1.866
    4   0.658  3.244   0.658  2.402

madelon
    1   0.612  0.051   0.612  0.050
    2   0.665  0.106   0.665  0.100
    3   0.715  0.166   0.715  0.155
    4   0.748  0.234   0.748  0.213


GradientBoostingRegressor
boston
1   1  4.861  0.001   4.861  0.001
3   2  4.803  0.003   4.803  0.002
5   3  2.873  0.008   2.873  0.003
10  4  2.367  0.020   2.373  0.008


iris
1   1   0.512  0.000   0.512  0.000
3   2   0.390  0.001   0.390  0.001
5   3   0.281  0.001   0.281  0.001
10  4   0.173  0.002   0.173  0.002


regression
1   1   3.475  0.232   3.475  0.135
3   2   3.469  0.969   3.469  0.270
5   3   3.458  2.314   3.458  0.577
10  4   3.328  5.639   3.328 1.501


diabetes
1   1  54.154  0.001   54.154  0.001
3   2  56.728  0.002   56.728  0.001
5   3  57.747  0.005   57.747  0.003
10  4  48.930  0.013   48.753  0.006


RandomForestRegressor
boston
1   1   4.220  0.001    4.220  0.001
3   2   2.870  0.004    2.870  0.003
5   3   2.972  0.008    2.929  0.006
10  4   2.417  0.020    2.449  0.013

iris
1   1   0.221  0.001   0.221  0.001
3   2   0.007  0.002   0.007  0.002
5   3   0.008  0.003   0.008  0.003
10  4   0.005  0.005   0.004 0.005

regression
1   1   3.448  0.147   3.448  0.085
3   2   3.388  0.841   3.388  0.475
5   3   3.368  2.134   3.368  1.182
10  4   3.298  5.496   3.298  3.104

diabetes
1   1  55.310  0.001  55.310  0.001
3   2  43.158  0.003  43.158  0.002
5   3  43.876  0.006  43.876  0.005
10  4  47.410  0.015  46.977  0.011

DecisionTreeRegressor
boston
1   1   5.354  0.001   5.354  0.0
3   2   3.535  0.001   3.535  0.001
5   3   2.656  0.002   2.656  0.001
10  4   2.198  0.002   2.198  0.001

iris
1   1   0.222  0.0   0.222  0.0
3   2   0.031  0.0   0.031  0.0
5   3   0.004  0.0   0.004  0.0
10  4   0.002  0.0   0.002  0.0

regression
1   1   3.427  0.234   3.427  0.137
3   2   3.437  0.449   3.437  0.261
5   3   3.431  0.661   3.431  0.386
10  4   3.436  0.878   3.436  0.497

diabetes
1   1  50.013  0.001  50.013  0.000
3   2  39.449  0.001  39.449  0.001
5   3  48.345  0.001  48.345  0.001
10  4  40.437  0.002  40.437  0.001

Deep DecisionTreeClassifier on MNIST
0.8751  29.221    0.8716  36.6203

Deep DecisionTreeRegressor on MNIST
0.8426, 44.774    0.8313  28.2852

Like I mentioned previously, the deep decision trees seem to vary, with regression trees being faster and classification trees being slightly slower.

Presorting is now an argument to the fit method, and can be turned on or off. It is currently always off except for GradientBoosting, but I am doing to do some testing to find a good 'auto' setting, which will be the default. If I turn presorting on for RandomForests, here are some new benchmarks:

RandomForestClassifier
Gaussian
1   1   0.599  0.063   0.599  0.056   0.597  0.589
3   2   0.651  0.250   0.688  0.215   0.682  0.651
5   3   0.765  0.597   0.769  0.444   0.776  0.795
10  4   0.867  1.619   0.868  1.118   0.866  1.111

RandomForestRegressor
regression
1   1   3.448  0.147   3.448 0.085   3.463  0.129
3   2   3.388  0.841   3.388 0.475   3.406  0.235
5   3   3.368  2.134   3.368 1.182   3.367  0.461   
10  4   3.298  5.496   3.298 3.104   3.322  1.147

So, presorting can help tremendously on smaller datasets with constrained models (3x faster in the regression case). Especially since people will make models with lots of trees. Here is a time test with 250 trees of maximum depth 6 using presorting:

RandomForestClassifier
Gaussian
250  6   0.996  52.806   0.995  26.992

RandomForestRegressor
250  6   3.213 198.031   3.213  56.603 

Most unit tests are passing now. The ones which are not mostly relate to sparse data, best splitting, or multioutput; three features I have not yet added. I will be working on those now.

@jmschrei
Copy link
Member Author

@ogrisel done

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

Do you seed all the models with a fixed seed?

RandomForestClassifier
arcene
1   1   0.660  0.003   0.560  0.005

It's weird that the you don't get the same tree with n_estimators=1 and max_depth=1 here.

@jmschrei
Copy link
Member Author

All models are run with random_seed=0. It is weird; I'll take a look at the tree being built after I get BestFirstBuilder finished.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

Also in your bench report could you please report the speedup (that is the ratio master_time / branch_time) to make it easier to quickly review the perf results?

@glouppe
Copy link
Contributor

glouppe commented Aug 21, 2015

For forests, please systematically test for max_depth=None, as this is the default setting.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

Also please rebase the branch on top of the current master and resolve the conflicts (probably in the generated .c files) to have travis run the tests.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

Also there is it not much point in benchmarking on small dataset such as iris. To check the correctness of the algorithm it's better to rely on having the tests pass.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2015

And finally please run the benchmarks for classification models only on classification datasets (e.g. diabetes, arcene, spambase, MNIST, covertype) and regression models on regression datasets (e.g. boston, california housing, regression...).

@arjoly
Copy link
Member

arjoly commented Aug 24, 2015

If the gain is in having a presorting à la Breiman, why not having a BreimanSplitter / PresortBestSplitter instead of changing the overall api?

@jmschrei
Copy link
Member Author

The presorting is only one of the speedups, in the limited case of using small datasets and/or growing small trees. In my current revisions, it is an option which is default off except in the case of Gradient Boosting (which used to utilize PresortBestSplitter). The main speedups come from rewriting the criteria, specifically MSE.

@arjoly
Copy link
Member

arjoly commented Aug 24, 2015

The main speedups come from rewriting the criteria, specifically MSE.

I think that this would be a great to implement this without changing the whole splitter interface with many algorithm modifications / re-implementations. It indeed blurs the overall picture. I believe it's easier at all levels to focus on one incremental improvement at a time.

Using the "reduced" impurity formula for impurity decrease maximisation is probably possible by changing the way that the criterion is used and behaved during the splitting process. This might require some new methods, new returned values, new properties and / or changing the current criterion interface.

There are some benchmark available that were used previously. See for instance bench_mnist.py and bench_covertype.py for classification. (For mnist, a default parameter set could be added for gbrt in another pull request.) Adding something similar for regression task would be worth.

@jmschrei
Copy link
Member Author

I agree that it would be better to incrementally add in features, to ensure consistency. However, it is not possible to implement my improvements without either (1) significantly hacking the way the code works currently, which would make it more difficult to clean it up later, or (2) rewriting the API to be cleaner and incorporate these changes.

@jmschrei
Copy link
Member Author

Best first building and sparse splitting are both functional. However, I seem to have reintroduced a bug where overly large trees get built when I refactored Criterion to work with sparse data structures. All that's left before review is to hunt down this bug again, and add multioutput predictions!

@jmschrei
Copy link
Member Author

jmschrei commented Sep 2, 2015

Multioutput support is now available for all criteria. This means that all features have been rewritten and work. However, in settings with large numbers of features my implementation of Gini and Entropy are slower than master branch by ~15%. I will need to track down the reason for this, but I hope to have a pull request ready for review by the end of Friday.

Here are some benchmarks on multioutput data using make_multilabel_classification, 100,000 samples (50k train 50k test), 250 features, 3 targets, no depth limit set on the trees.

                                         master          branch
                                     time   acc         time   acc
DecisionTreeClassifier (entropy)    11.55   0.266      13.20   0.267
ExtraTreeClassifier (entropy)        0.68   0.316       0.91   0.316
DecisionTreeClassifier (gini)       12.31   0.269      13.54   0.269
ExtraTreeClassifier (gini)           0.58   0.314       0.73   0.312
DecisionTreeRegressor (mse)         19.48   0.268      10.32   0.269
ExtraTreeRegressor (mse)            17.95   0.267       9.14   0.272

The code is here:

from sklearn.ensemble import *
from sklearn.tree import *
from sklearn.utils import shuffle
from sklearn import datasets as datasets
import time
import numpy as np

random.seed(0)
np.random.seed(0)

def multioutput_benchmark( clf, X_train, X_test, y_train, y_test ):
    tic = time.time()
    clf.fit( X_train, y_train )
    dur = time.time() - tic

    acc = 1. * np.abs(y_test - clf.predict( X_test )).sum() / (y_test.shape[0] * y_test.shape[1])

    return dur, acc

X, y = datasets.make_multilabel_classification( n_samples=100000, 
    n_features=250, n_classes=3, n_labels=1, return_indicator=True )

print "DecisionTreeClassifier (Entropy)"
clf = DecisionTreeClassifier( criterion='entropy' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )

print "ExtraTreeClassifier (Entropy)"
clf = ExtraTreeClassifier( criterion='entropy' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )

print "DecisionTreeClassifier (Gini)"
clf = DecisionTreeClassifier( criterion='gini' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )

print "ExtraTreeClassifier (Gini)"
clf = ExtraTreeClassifier( criterion='gini' )
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )

print "DecisionTreeRegressor (mse)"
clf = DecisionTreeRegressor()
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )

print "ExtraTreeRegressor (mse)"
clf = ExtraTreeRegressor()
print multioutput_benchmark( clf, X[::2], X[1::2], y[::2], y[1::2] )

@jmschrei
Copy link
Member Author

jmschrei commented Sep 3, 2015

Given #5203 I am closing this PR, and will be incorporating it over the next few weeks as smaller PRs.

@arjoly
Copy link
Member

arjoly commented Sep 3, 2015

Great!! I am looking forward your next pull requests @jmschrei !

@jmschrei jmschrei closed this Sep 3, 2015
@jmschrei jmschrei mentioned this pull request Sep 4, 2015
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants