[MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. #5203

arjoly · 2015-09-02T13:45:36Z

In the pull request #5041, @jmschrei proposes to approximate the impurity
improvement during the optimisation by avoiding to compute all the constant
terms. Since this could be done simply by adding only one method, I have decided
to implement this optimisation.

The benchmarks show that:

GBRT is 35% faster on covertype;
random forest and decision tree are 10% faster on mnist regression;
no impact on extra trees / random forest / decision tree classifier;
trees between master and this pr are different due to differences in numerical errors and in tie breaking.

PS : I haven't been able to benchmark against #5041 since I had segmentation faults.

Here are the raw benchmark results.

± make in && python benchmarks/bench_mnist.py --classifiers RandomForest ExtraTrees CART

...

Classifier                          train-time test-time error-rate

master ExtraTrees                   44.73s       0.50s       0.0288
master RandomForest                 42.24s       0.43s       0.0304
master CART                         20.87s       0.02s       0.1214

thispr ExtraTrees                   46.92s       0.50s       0.0294
thispr RandomForest                 41.75s       0.42s       0.0318
thispr CART                         20.56s       0.18s       0.1219

± python benchmarks/bench_covertype.py --classifiers CART ExtraTrees GBRT RandomForest

...

Classifier                          train-time test-time error-rate

master RandomForest  62.1709s   0.3572s     0.0334
master ExtraTrees    40.9040s   0.5348s     0.0366
master CART          15.2244s   0.0225s     0.0425
master GBRT         682.0862s   0.2816s     0.1777

thispr RandomForest  66.0639s   0.3583s     0.0330
thispr ExtraTrees    43.6342s   0.5493s     0.0372
thispr CART          14.8608s   0.0683s     0.0424
thispr GBRT         507.0393s   0.3322s     0.1777

# mnist in regression

from sklearn.datasets import fetch_mldata
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
try:
    from time import process_time as bench_time
except ImportError:
    from time import time as bench_time
data = fetch_mldata('MNIST original')
for name, EstimatorClass in  [("random forest", RandomForestRegressor),
                              ("extra trees", ExtraTreesRegressor),
                              ('cart', DecisionTreeRegressor)]:
    est = EstimatorClass(random_state=0)
    time_start = bench_time()
    est.fit(data['data'], data['target'])
    chrono = bench_time() - time_start
    print('%s = %s' % (name, chrono))


# ± make in && python bench_mnist_regression.py
#
# master random forest = 155.378248
# master extra trees   = 136.87747000000002
# master cart = 25.426823
#
# thispr random forest = 139.21123
# thispr extra trees   = 134.557126
# thispr cart          = 23.304993

arjoly · 2015-09-02T13:52:19Z

Ping @pprett @glouppe

arjoly · 2015-09-02T13:56:23Z

The big difference for gbrt could be due to the difference in the computation of impurity improvement.

jmschrei · 2015-09-02T14:02:40Z

Which version of my code triggered the seg fault? I just pushed the last major code contribution to #5041, which shouldn't produce any seg faults. If so, let me know and I will track it down.

arjoly · 2015-09-02T14:14:15Z

With the fix, I still have a nice speed up

thispr GBRT         507.0393s   0.3322s     0.1777

arjoly · 2015-09-02T14:14:30Z

@jmschrei I will try with your last version.

jmschrei · 2015-09-02T14:26:24Z

I'm not sure how helpful this is, but here are the results of the mnist regression benchmark code you provided, comparing master against my PR on my machine (which is apparently very slow!)

#5041
random forest = 171.877454996
extra trees = 240.350306988
cart = 28.9209749699

master
random forest = 291.686326981
extra trees = 291.1176579
cart = 49.7908821106

GaelVaroquaux · 2015-09-02T14:41:25Z

In the pull request #5041, @jmschrei proposes to approximate the impurity improvement during the optimisation by avoiding to compute all the constant terms. Since this could be done simply by adding only one method, I have decided to implement this optimisation.

It seems to me that #5041 proposes additional speed ups (based on deeper changes to the code). Is there any reason to merge this PR rather than #5041?

glouppe · 2015-09-03T06:41:55Z

Here is the benchmark for MNIST in regression, on my machine:

# master
random forest = 190.507488012
extra trees = 165.347410917
cart = 28.303839922

# arjoly@fast-mse
random forest = 150.614963055
extra trees = 117.137885094
cart = 23.5541858673

# jmschrei@_tree_rewrite
random forest = 159.706566095
extra trees = 215.314203978
cart = 26.6369800568

jmschrei · 2015-09-03T08:23:46Z

I find that benchmark very confusing, since it doesn't match my results at all (particular ExtraTrees being slower). Which version of my code is that with?

glouppe · 2015-09-03T08:55:03Z

I find that benchmark very confusing, since it doesn't match my results at all (particular ExtraTrees being slower). Which version of my code is that with?

4283b20

glouppe · 2015-09-03T09:04:12Z

@jmschrei I just run the benchmark once again for your branch:

# jmschrei@_tree_rewrite, 2nd run
random forest = 173.878245115
extra trees = 231.97809577
cart = 23.894657135

So it seems there is some variance on my machine. One thing for sure is that extra-trees are slower than what they should.

arjoly · 2015-09-03T09:54:52Z

@jmschrei I ran the benchmark against #5041 (af9c1cb) on my machine.

# jmschrei@_tree_rewrite #5041 (af9c1cb) - first run
random forest = 161.766252
extra trees = 236.50657800000002
cart = 28.638787000000036

# jmschrei@_tree_rewrite #5041 (af9c1cb) - second run
random forest = 161.42367199999998
extra trees = 235.89488899999998
cart = 28.593887999999993

ogrisel · 2015-09-03T12:17:48Z

Here are the results of the regressors on mnist on my box (Xeon X5660 @ 2.80GHz with 12MB cache), compiled with GCC 4.8 under Ubuntu 14.04:

from sklearn.datasets import fetch_mldata
from sklearn.ensemble import *
from sklearn.tree import *
import time

data = fetch_mldata('MNIST original')
estimators = [
    ("Random Forest", RandomForestRegressor),
    ("Extra Trees", ExtraTreesRegressor),
    ("CART", DecisionTreeRegressor),
    ("GBRT", GradientBoostingRegressor),
]

for name, estimator in estimators:
    est = estimator(random_state=0)
    tic = time.time()
    est.fit(data.data, data.target)
    toc = time.time() - tic
    print(name, toc)
# #5041
# Random Forest 150.26244401931763
# Extra Trees 216.57464289665222
# CART 26.032639026641846
# GBRT 371.58685517311096

# #5203
# Random Forest 137.6741919517517
# Extra Trees 112.64128398895264
# CART 22.449392080307007
# GBRT 325.93634605407715

# master
# Random Forest 264.32571601867676
# Extra Trees 263.0369408130646
# CART 45.76998996734619
# GBRT 708.4197769165039

So indeed @arjoly's refactoring seems to be always the best from a speed point of view. Before merging we need to check that the grown trees are still the same as in master or at least the accuracy is still the same in case random feature sampling has been affected in one way or another. Also we should check that this does not incur a regression from a memory usage point of view with memory_profiler.

Edit: I included the results for GBRT as well in the previous script / results.

ogrisel · 2015-09-03T12:47:19Z

Here are the result for regressors on mnist on my Mac (clang / llvm 6.1.0 compiler, Core i7 @ 2GHz with 4MB cache),

# master
# Random Forest 182.51679611206055
# Extra Trees 160.08595299720764
# CART 30.50537896156311

# #5203
# Random Forest 177.5755307674408
# Extra Trees 169.99811220169067
# CART 29.187539100646973

# #5041
# Random Forest 193.5168468952179
# Extra Trees 278.5783009529114
# CART 34.431031942367554

again #5203 shows an improvement although not as strong as on the Xeon proc with GCC. #5041 is slower than both master and #5203.

jmschrei · 2015-09-03T12:59:05Z


                       #5041             master              #5203
MNIST (50% train/test split)
Random Forest      71.29   0.862      122.53   0.862      62.9   0.863
Extra Trees       101.00   0.890      126.44   0.889      47.9   0.892
Regression Tree    11.79   0.693       22.27   0.693      10.3   0.688

covtype (50% train/test split)
Random Forest      50.38   0.904       72.66   0.904      43.5   0.904
Extra Trees        83.60   0.926       75.14   0.925      37.2   0.925
Regression Tree     7.88   0.819       10.82   0.822      6.22   0.821

I am getting the same results as @ogrisel, where #5203 is the fastest. My computer is Xeon E5640 @2.67GHz with 12MB cache and gcc 4.8.4. However, #5203 is not building the same tree as master, and gets slightly different accuracies. Here are the tree sizes for the regression trees:

           #5041   master  #5203
MNIST       6857    6857    6869
covtype    40371   40381   40349

I support merging this PR over #5041 if this difference can be accounted for.

arjoly · 2015-09-03T13:11:49Z

@jmschrei If this pr get merged, can I put your name in the what's new for this feature? Without #5041, we would have missed those improvements.

I hope that you will bring in small pull requests all the remaining improvements that you have thought in #5041 (caching weight_n_samples, total statistics caching, pre-calculus of (y ** 2, w y **2) for mse, etc).

If you have time, it would be great that you review this pull request.

ogrisel · 2015-09-03T13:11:54Z

sklearn/tree/_tree.pyx

@@ -197,6 +197,16 @@ cdef class Criterion:

        pass

+    cdef double approx_impurity_improvement(self) nogil:
+        """Approximate the impurity reduction removing constants for a split"""


Could you please extend that docstring to make it explicit that this is the only part of the impurity improvement computation that is required to find the best split threshold for a given feature. This method is used to speed up the search for the best split.

The absolute impurity improvement is only computed by the impurity_improvement method once the best split has been found.

Maybe calling that "approximate" is misleading (its a shifted and scaled version of the true value that preserves the location of the best split). But I don't personally have a better naming suggestion at hand.

+1, the method name is misleading.

This thing is in fact a proxy quantity such that the split that maximizes this value also maximizes the impurity improvement.

What do you think of proxy_impurity_improvement?

+1 to @ogrisel

proxy_impurity_improvement is okay, or scaled_impurity_improvement might be better. We don't want to imply it is approximate in any way, since it is not.

ogrisel · 2015-09-03T14:33:02Z

One more data point, same mnist regression benchmark, big modern Xeon E5-2697 v2 @ 2.70GHz with 30MB cache:

master
('Random Forest', 163.37543892860413)
('Extra Trees', 135.14275002479553)
('CART', 26.503570079803467)

#5203
('Random Forest', 116.93278503417969)
('Extra Trees', 96.99032807350159)
('CART', 19.511062145233154)

glouppe · 2015-09-03T16:52:32Z

FYI, on a random sample of 1000 rows of MNIST:

master GBRT (n_estimators=100, max_depth=3), train time=30.6134212017 
#5041  GBRT (n_estimators=100, max_depth=3), train time=19.3394122124   
#5203  GBRT (n_estimators=100, max_depth=3), train time=17.5684459209   
XGBoostClassifier (n_estimators=100, max_depth=3, nthread=1), train time=11.0144069195

Still somewhat slower than xgboost, but we are closing the gap!

@jmschrei do you have insights about what further improvements would be needed to get up to speed?

jmschrei · 2015-09-03T17:46:15Z

@arjoly of course, thanks! I do plan on submitting small PRs. It was a mistake on my part to try to redo a core part of sklearn in one go. I hope you'll have time to review them in the coming weeks.

@glouppe I am taking a quick break from decision trees to see if I can contribute anywhere else in sklearn until next week, when I'll start reviewing the code of master more thoroughly. The solution is likely more caching between splits on different levels of the tree, such as wy and wy**2.

glouppe · 2015-09-03T18:37:47Z

@jmschrei Thanks for your positive attitude! I have to admit that we were a bit uncomfortable with how to deal best with your PR. You did an amazing work and came with great ideas, but also wanted to change too many things at the same time. I truly wish this was not done for nothing and that these ideas will eventually land into master.

glouppe · 2015-09-04T05:47:10Z

sklearn/tree/_tree.pyx

+        cdef double* sum_right = self.sum_right
+
+        cdef SIZE_t k
+        cdef double proxy_impuriy_left = 0.


impuriy -> impurity

arjoly · 2015-09-04T07:59:42Z

I took into account comments an added a entry in the what's new.

arjoly · 2015-09-04T08:49:47Z

The last optimization of friedman mse allows to speed up a little more gbrt:

covertype - this pr
GBRT         498.4850s   0.3465s     0.1777

arjoly · 2015-09-04T11:22:40Z

When this pr got merged, I am going to propose another speed up idea (different from those of @jmschrei). The idea is to perform the update in reverse order (from end to new_pos) if it's beneficial (branch fast-reverse-split.)

Results are nice on datasets such as covertype. There is a gain for all tree-based classifier.

# mnist master
ExtraTrees                   44.73s       0.50s       0.0288
RandomForest                 42.24s       0.43s       0.0304
CART                         20.87s       0.02s       0.1214

# mnist  this pr
ExtraTrees                   46.92s       0.50s       0.0294
RandomForest                 41.75s       0.42s       0.0318
CART                         20.56s       0.18s       0.1219

# mnist fast-reverse-split +  this pr
ExtraTrees                   43.51s       0.55s       0.0294
RandomForest                 39.31s       0.42s       0.0318
CART                         20.14s       0.19s       0.1219

# covertype master
RandomForest  62.1709s   0.3572s     0.0334
ExtraTrees    40.9040s   0.5348s     0.0366
CART          15.2244s   0.0225s     0.0425
GBRT         682.0862s   0.2816s     0.1777

# covertype this pr
RandomForest  66.0639s   0.3583s     0.0330
ExtraTrees    43.6342s   0.5493s     0.0372
CART          14.8608s   0.0683s     0.0424
GBRT         507.0393s   0.3322s     0.1777  

# covertype fast-reverse-split + this pr
RandomForest  45.4110s   0.3656s     0.0330  
ExtraTrees    33.9126s   0.5563s     0.0372  
CART          11.7812s   0.0702s     0.0424  
GBRT         439.4366s   0.2855s     0.1777

# Master mnist regression
random forest = 155.378248
extra trees   = 136.87747000000002
cart = 25.426823

# mnist regression this pr
random forest = 139.21123
extra trees   = 134.557126
cart          = 23.304993

# mnist regression fast-reverse-split + this pr
random forest = 129.846281
extra trees = 121.103736
cart = 22.655389000000042

jmschrei · 2015-09-04T11:28:43Z

Each time you benchmark, can you include results from master branch as well? It would be easier to read, and remove the possibility of anything else influencing the time.

arjoly · 2015-09-04T11:33:10Z

@jmschrei I have added the other benchmarks.

jmschrei · 2015-09-04T11:49:32Z

Can you comment on how GradientBoosting is being sped up? GradientBoosting is hard coded to use the FriedmanMSE criteria, not the MSE one, and I'm not seeing you change it. Am I missing something?

Also, I meant rerun the tests on master, not copy/paste the results from your first run. This is to try to factor out anything which is effecting the runtime of your computer during the time.

arjoly · 2015-09-04T11:57:42Z

Can you comment on how GradientBoosting is being sped up? GradientBoosting is hard coded to use the FriedmanMSE criteria, not the MSE one, and I'm not seeing you change it. Am I missing something?

In the branch fast-reverse-split, I update the left and right node statistics from end to new_pos instead of from pos to new_pos if it's beneficial (end - new_pos < new_pos - pos).

The update method is shared between FriedmanMSE and MSE.

Also, I meant rerun the tests on master, not copy/paste the results from your first run. This is to try to factor out anything which is effecting the runtime of your computer during the time.

I will re-do those more formally whenever this pr got merged. But results should be comparable.

jmschrei · 2015-09-04T11:59:19Z

That sounds like a good improvement. I meant, how does fast-mse improve GBRT's performance?

arjoly · 2015-09-04T11:59:48Z

Sorry, it wasn't clear fast-mse is this branch :-) I will change the names.

jmschrei · 2015-09-04T12:42:56Z

Ah, I understand. The majority of the speed updates don't come from the faster impurity calculation, but in needing to calculate less in the update step.

glouppe · 2015-09-04T15:07:04Z

Thanks for the changes. +1 for merge as it already is a huge improvement. Then please open a new PR for your next batches on improvements.

GaelVaroquaux · 2015-09-04T15:18:18Z

@jmschrei : could you do a review of this PR and give your 👍 if you think that it should be merged please.

jmschrei · 2015-09-05T13:24:12Z

doc/whats_new.rst

   - :class:`tree.DecisionTreeClassifier` now exposes an ``apply`` method
     for retrieving the leaf indices samples are predicted as. By
     `Daniel Galvez`_ and `Gilles Louppe`_.

+   - Speed up decision tree regressor, random forest regressor, extra trees
+     regressor and gradient regression tree boosting by computing a proxy


It should speed up both Gradient Boosting Regression and Gradient Boosting Classification, since it turns a classification problem into a regression problem. Maybe better to say "extra trees regressor and gradient boosting"

The gradient boosting is not general and only allows regression tree base estimator. But ok, i will remove regression tree mention.

jmschrei · 2015-09-05T13:28:12Z

This looks good for now, just one minor documentation issue.

By avoiding computing constant terms during split optimization.

ogrisel · 2015-09-07T07:14:49Z

+1 for merge as well!

arjoly · 2015-09-07T07:25:40Z

I have squashed everything. This should be good to go.

ogrisel · 2015-09-07T07:27:00Z

Let's wait for the CI to finish. Once merged @arjoly can you please rebase the fast-reverse-split branch and open a new PR for that one?

arjoly · 2015-09-07T09:07:39Z

Since it is green, I have merged by rebase.

glouppe · 2015-09-07T09:13:15Z

🍻

jmschrei · 2015-09-07T09:15:18Z

🎺

Lets get fast-reverse-split merged and then split the code into multiple files.

ogrisel · 2015-09-07T09:22:31Z

🍻 thank you all!

GaelVaroquaux · 2015-09-07T09:25:56Z

Indeed! Good job to the team.

I am looking forward to seeing the rest of the improvements.

Sent from my phone. Please forgive brevity and mis spelling

On Sep 7, 2015, 11:22, at 11:22, Olivier Grisel notifications@github.com wrote:

🍻 thank you all!

Reply to this email directly or view it on GitHub:
#5203 (comment)

arjoly · 2015-09-07T09:36:08Z

Cheers 🍻 !!

amueller · 2015-09-09T20:57:25Z

We (aka someone from the tree team) might want to do a blog post on the speed improvements for 0.17?
Or maybe me or olivier write a blog post and feature some benchmarks?

ogrisel reviewed Sep 3, 2015
View reviewed changes

jmschrei mentioned this pull request Sep 3, 2015

[WIP] _tree.pyx rewrite #5041

Closed

30 tasks

glouppe reviewed Sep 4, 2015
View reviewed changes

arjoly force-pushed the fast-mse branch from a8c9c5b to 3ed04d6 Compare September 4, 2015 07:56

glouppe changed the title ~~[MRG] Optimize MSE criterion by avoiding computing constant terms during split optimization.~~ [MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. Sep 4, 2015

glouppe mentioned this pull request Sep 4, 2015

[RFC] Tree module improvements #5212

Open

12 tasks

jmschrei reviewed Sep 5, 2015
View reviewed changes

Optimize MSE criterion

d39716e

By avoiding computing constant terms during split optimization.

arjoly force-pushed the fast-mse branch from 16a05ba to d39716e Compare September 7, 2015 07:10

arjoly merged commit d39716e into scikit-learn:master Sep 7, 2015

arjoly deleted the fast-mse branch September 7, 2015 09:35

jmschrei mentioned this pull request Sep 9, 2015

[MRG] Proxy improvement methods added to entropy/gini #5233

Closed

[MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. #5203

[MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. #5203

Conversation

arjoly commented Sep 2, 2015

arjoly commented Sep 2, 2015

arjoly commented Sep 2, 2015

jmschrei commented Sep 2, 2015

arjoly commented Sep 2, 2015

arjoly commented Sep 2, 2015

jmschrei commented Sep 2, 2015

GaelVaroquaux commented Sep 2, 2015 via email

glouppe commented Sep 3, 2015

jmschrei commented Sep 3, 2015

glouppe commented Sep 3, 2015

glouppe commented Sep 3, 2015

arjoly commented Sep 3, 2015

ogrisel commented Sep 3, 2015

ogrisel commented Sep 3, 2015

jmschrei commented Sep 3, 2015

arjoly commented Sep 3, 2015

ogrisel Sep 3, 2015

Choose a reason for hiding this comment

glouppe Sep 3, 2015

Choose a reason for hiding this comment

arjoly Sep 3, 2015

Choose a reason for hiding this comment

jmschrei Sep 3, 2015

Choose a reason for hiding this comment

ogrisel commented Sep 3, 2015

glouppe commented Sep 3, 2015

jmschrei commented Sep 3, 2015

glouppe commented Sep 3, 2015

glouppe Sep 4, 2015

Choose a reason for hiding this comment

arjoly commented Sep 4, 2015

arjoly commented Sep 4, 2015

arjoly commented Sep 4, 2015

jmschrei commented Sep 4, 2015

arjoly commented Sep 4, 2015

jmschrei commented Sep 4, 2015

arjoly commented Sep 4, 2015

jmschrei commented Sep 4, 2015

arjoly commented Sep 4, 2015

jmschrei commented Sep 4, 2015

glouppe commented Sep 4, 2015

GaelVaroquaux commented Sep 4, 2015

jmschrei Sep 5, 2015

Choose a reason for hiding this comment

arjoly Sep 7, 2015

Choose a reason for hiding this comment

jmschrei commented Sep 5, 2015

ogrisel commented Sep 7, 2015

arjoly commented Sep 7, 2015

ogrisel commented Sep 7, 2015

arjoly commented Sep 7, 2015

glouppe commented Sep 7, 2015

jmschrei commented Sep 7, 2015

ogrisel commented Sep 7, 2015

GaelVaroquaux commented Sep 7, 2015

arjoly commented Sep 7, 2015

amueller commented Sep 9, 2015