Skip to content

[MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. #5203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 7, 2015

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented Sep 2, 2015

In the pull request #5041, @jmschrei proposes to approximate the impurity
improvement during the optimisation by avoiding to compute all the constant
terms. Since this could be done simply by adding only one method, I have decided
to implement this optimisation.

The benchmarks show that:

  • GBRT is 35% faster on covertype;
  • random forest and decision tree are 10% faster on mnist regression;
  • no impact on extra trees / random forest / decision tree classifier;
  • trees between master and this pr are different due to differences in numerical errors and in tie breaking.

PS : I haven't been able to benchmark against #5041 since I had segmentation faults.

Here are the raw benchmark results.

± make in && python benchmarks/bench_mnist.py --classifiers RandomForest ExtraTrees CART

...

Classifier                          train-time test-time error-rate

master ExtraTrees                   44.73s       0.50s       0.0288
master RandomForest                 42.24s       0.43s       0.0304
master CART                         20.87s       0.02s       0.1214

thispr ExtraTrees                   46.92s       0.50s       0.0294
thispr RandomForest                 41.75s       0.42s       0.0318
thispr CART                         20.56s       0.18s       0.1219
± python benchmarks/bench_covertype.py --classifiers CART ExtraTrees GBRT RandomForest

...

Classifier                          train-time test-time error-rate

master RandomForest  62.1709s   0.3572s     0.0334
master ExtraTrees    40.9040s   0.5348s     0.0366
master CART          15.2244s   0.0225s     0.0425
master GBRT         682.0862s   0.2816s     0.1777

thispr RandomForest  66.0639s   0.3583s     0.0330
thispr ExtraTrees    43.6342s   0.5493s     0.0372
thispr CART          14.8608s   0.0683s     0.0424
thispr GBRT         507.0393s   0.3322s     0.1777  
# mnist in regression

from sklearn.datasets import fetch_mldata
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
try:
    from time import process_time as bench_time
except ImportError:
    from time import time as bench_time
data = fetch_mldata('MNIST original')
for name, EstimatorClass in  [("random forest", RandomForestRegressor),
                              ("extra trees", ExtraTreesRegressor),
                              ('cart', DecisionTreeRegressor)]:
    est = EstimatorClass(random_state=0)
    time_start = bench_time()
    est.fit(data['data'], data['target'])
    chrono = bench_time() - time_start
    print('%s = %s' % (name, chrono))


# ± make in && python bench_mnist_regression.py
#
# master random forest = 155.378248
# master extra trees   = 136.87747000000002
# master cart = 25.426823
#
# thispr random forest = 139.21123
# thispr extra trees   = 134.557126
# thispr cart          = 23.304993

@arjoly
Copy link
Member Author

arjoly commented Sep 2, 2015

Ping @pprett @glouppe

@arjoly
Copy link
Member Author

arjoly commented Sep 2, 2015

The big difference for gbrt could be due to the difference in the computation of impurity improvement.

@jmschrei
Copy link
Member

jmschrei commented Sep 2, 2015

Which version of my code triggered the seg fault? I just pushed the last major code contribution to #5041, which shouldn't produce any seg faults. If so, let me know and I will track it down.

@arjoly
Copy link
Member Author

arjoly commented Sep 2, 2015

With the fix, I still have a nice speed up

thispr GBRT         507.0393s   0.3322s     0.1777  

@arjoly
Copy link
Member Author

arjoly commented Sep 2, 2015

@jmschrei I will try with your last version.

@jmschrei
Copy link
Member

jmschrei commented Sep 2, 2015

I'm not sure how helpful this is, but here are the results of the mnist regression benchmark code you provided, comparing master against my PR on my machine (which is apparently very slow!)

#5041
random forest = 171.877454996
extra trees = 240.350306988
cart = 28.9209749699

master
random forest = 291.686326981
extra trees = 291.1176579
cart = 49.7908821106

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 2, 2015 via email

@glouppe
Copy link
Contributor

glouppe commented Sep 3, 2015

Here is the benchmark for MNIST in regression, on my machine:

# master
random forest = 190.507488012
extra trees = 165.347410917
cart = 28.303839922

# arjoly@fast-mse
random forest = 150.614963055
extra trees = 117.137885094
cart = 23.5541858673

# jmschrei@_tree_rewrite
random forest = 159.706566095
extra trees = 215.314203978
cart = 26.6369800568

@jmschrei
Copy link
Member

jmschrei commented Sep 3, 2015

I find that benchmark very confusing, since it doesn't match my results at all (particular ExtraTrees being slower). Which version of my code is that with?

@glouppe
Copy link
Contributor

glouppe commented Sep 3, 2015

I find that benchmark very confusing, since it doesn't match my results at all (particular ExtraTrees being slower). Which version of my code is that with?

4283b20

@glouppe
Copy link
Contributor

glouppe commented Sep 3, 2015

@jmschrei I just run the benchmark once again for your branch:

# jmschrei@_tree_rewrite, 2nd run
random forest = 173.878245115
extra trees = 231.97809577
cart = 23.894657135

So it seems there is some variance on my machine. One thing for sure is that extra-trees are slower than what they should.

@arjoly
Copy link
Member Author

arjoly commented Sep 3, 2015

@jmschrei I ran the benchmark against #5041 (af9c1cb) on my machine.

# jmschrei@_tree_rewrite #5041 (af9c1cb) - first run
random forest = 161.766252
extra trees = 236.50657800000002
cart = 28.638787000000036

# jmschrei@_tree_rewrite #5041 (af9c1cb) - second run
random forest = 161.42367199999998
extra trees = 235.89488899999998
cart = 28.593887999999993

@ogrisel
Copy link
Member

ogrisel commented Sep 3, 2015

Here are the results of the regressors on mnist on my box (Xeon X5660 @ 2.80GHz with 12MB cache), compiled with GCC 4.8 under Ubuntu 14.04:

from sklearn.datasets import fetch_mldata
from sklearn.ensemble import *
from sklearn.tree import *
import time

data = fetch_mldata('MNIST original')
estimators = [
    ("Random Forest", RandomForestRegressor),
    ("Extra Trees", ExtraTreesRegressor),
    ("CART", DecisionTreeRegressor),
    ("GBRT", GradientBoostingRegressor),
]

for name, estimator in estimators:
    est = estimator(random_state=0)
    tic = time.time()
    est.fit(data.data, data.target)
    toc = time.time() - tic
    print(name, toc)
# #5041
# Random Forest 150.26244401931763
# Extra Trees 216.57464289665222
# CART 26.032639026641846
# GBRT 371.58685517311096

# #5203
# Random Forest 137.6741919517517
# Extra Trees 112.64128398895264
# CART 22.449392080307007
# GBRT 325.93634605407715

# master
# Random Forest 264.32571601867676
# Extra Trees 263.0369408130646
# CART 45.76998996734619
# GBRT 708.4197769165039

So indeed @arjoly's refactoring seems to be always the best from a speed point of view. Before merging we need to check that the grown trees are still the same as in master or at least the accuracy is still the same in case random feature sampling has been affected in one way or another. Also we should check that this does not incur a regression from a memory usage point of view with memory_profiler.

Edit: I included the results for GBRT as well in the previous script / results.

@ogrisel
Copy link
Member

ogrisel commented Sep 3, 2015

Here are the result for regressors on mnist on my Mac (clang / llvm 6.1.0 compiler, Core i7 @ 2GHz with 4MB cache),

# master
# Random Forest 182.51679611206055
# Extra Trees 160.08595299720764
# CART 30.50537896156311

# #5203
# Random Forest 177.5755307674408
# Extra Trees 169.99811220169067
# CART 29.187539100646973

# #5041
# Random Forest 193.5168468952179
# Extra Trees 278.5783009529114
# CART 34.431031942367554

again #5203 shows an improvement although not as strong as on the Xeon proc with GCC. #5041 is slower than both master and #5203.

@jmschrei
Copy link
Member

jmschrei commented Sep 3, 2015


                       #5041             master              #5203
MNIST (50% train/test split)
Random Forest      71.29   0.862      122.53   0.862      62.9   0.863
Extra Trees       101.00   0.890      126.44   0.889      47.9   0.892
Regression Tree    11.79   0.693       22.27   0.693      10.3   0.688

covtype (50% train/test split)
Random Forest      50.38   0.904       72.66   0.904      43.5   0.904
Extra Trees        83.60   0.926       75.14   0.925      37.2   0.925
Regression Tree     7.88   0.819       10.82   0.822      6.22   0.821

I am getting the same results as @ogrisel, where #5203 is the fastest. My computer is Xeon E5640 @2.67GHz with 12MB cache and gcc 4.8.4. However, #5203 is not building the same tree as master, and gets slightly different accuracies. Here are the tree sizes for the regression trees:

           #5041   master  #5203
MNIST       6857    6857    6869
covtype    40371   40381   40349

I support merging this PR over #5041 if this difference can be accounted for.

@arjoly
Copy link
Member Author

arjoly commented Sep 3, 2015

@jmschrei If this pr get merged, can I put your name in the what's new for this feature? Without #5041, we would have missed those improvements.

I hope that you will bring in small pull requests all the remaining improvements that you have thought in #5041 (caching weight_n_samples, total statistics caching, pre-calculus of (y ** 2, w y **2) for mse, etc).

If you have time, it would be great that you review this pull request.

@@ -197,6 +197,16 @@ cdef class Criterion:

pass

cdef double approx_impurity_improvement(self) nogil:
"""Approximate the impurity reduction removing constants for a split"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extend that docstring to make it explicit that this is the only part of the impurity improvement computation that is required to find the best split threshold for a given feature. This method is used to speed up the search for the best split.

The absolute impurity improvement is only computed by the impurity_improvement method once the best split has been found.

Maybe calling that "approximate" is misleading (its a shifted and scaled version of the true value that preserves the location of the best split). But I don't personally have a better naming suggestion at hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the method name is misleading.

This thing is in fact a proxy quantity such that the split that maximizes this value also maximizes the impurity improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of proxy_impurity_improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @ogrisel

proxy_impurity_improvement is okay, or scaled_impurity_improvement might be better. We don't want to imply it is approximate in any way, since it is not.

@ogrisel
Copy link
Member

ogrisel commented Sep 3, 2015

One more data point, same mnist regression benchmark, big modern Xeon E5-2697 v2 @ 2.70GHz with 30MB cache:

master
('Random Forest', 163.37543892860413)
('Extra Trees', 135.14275002479553)
('CART', 26.503570079803467)

#5203
('Random Forest', 116.93278503417969)
('Extra Trees', 96.99032807350159)
('CART', 19.511062145233154)

@glouppe
Copy link
Contributor

glouppe commented Sep 3, 2015

FYI, on a random sample of 1000 rows of MNIST:

master GBRT (n_estimators=100, max_depth=3), train time=30.6134212017 
#5041  GBRT (n_estimators=100, max_depth=3), train time=19.3394122124   
#5203  GBRT (n_estimators=100, max_depth=3), train time=17.5684459209   
XGBoostClassifier (n_estimators=100, max_depth=3, nthread=1), train time=11.0144069195

Still somewhat slower than xgboost, but we are closing the gap!

@jmschrei do you have insights about what further improvements would be needed to get up to speed?

@jmschrei
Copy link
Member

jmschrei commented Sep 3, 2015

@arjoly of course, thanks! I do plan on submitting small PRs. It was a mistake on my part to try to redo a core part of sklearn in one go. I hope you'll have time to review them in the coming weeks.

@glouppe I am taking a quick break from decision trees to see if I can contribute anywhere else in sklearn until next week, when I'll start reviewing the code of master more thoroughly. The solution is likely more caching between splits on different levels of the tree, such as wy and wy**2.

@jmschrei jmschrei mentioned this pull request Sep 3, 2015
30 tasks
@glouppe
Copy link
Contributor

glouppe commented Sep 3, 2015

@jmschrei Thanks for your positive attitude! I have to admit that we were a bit uncomfortable with how to deal best with your PR. You did an amazing work and came with great ideas, but also wanted to change too many things at the same time. I truly wish this was not done for nothing and that these ideas will eventually land into master.

cdef double* sum_right = self.sum_right

cdef SIZE_t k
cdef double proxy_impuriy_left = 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impuriy -> impurity

@arjoly
Copy link
Member Author

arjoly commented Sep 4, 2015

I took into account comments an added a entry in the what's new.

@arjoly
Copy link
Member Author

arjoly commented Sep 4, 2015

The last optimization of friedman mse allows to speed up a little more gbrt:

covertype - this pr
GBRT         498.4850s   0.3465s     0.1777  

@arjoly
Copy link
Member Author

arjoly commented Sep 4, 2015

When this pr got merged, I am going to propose another speed up idea (different from those of @jmschrei). The idea is to perform the update in reverse order (from end to new_pos) if it's beneficial (branch fast-reverse-split.)

Results are nice on datasets such as covertype. There is a gain for all tree-based classifier.

# mnist master
ExtraTrees                   44.73s       0.50s       0.0288
RandomForest                 42.24s       0.43s       0.0304
CART                         20.87s       0.02s       0.1214

# mnist  this pr
ExtraTrees                   46.92s       0.50s       0.0294
RandomForest                 41.75s       0.42s       0.0318
CART                         20.56s       0.18s       0.1219

# mnist fast-reverse-split +  this pr
ExtraTrees                   43.51s       0.55s       0.0294
RandomForest                 39.31s       0.42s       0.0318
CART                         20.14s       0.19s       0.1219

# covertype master
RandomForest  62.1709s   0.3572s     0.0334
ExtraTrees    40.9040s   0.5348s     0.0366
CART          15.2244s   0.0225s     0.0425
GBRT         682.0862s   0.2816s     0.1777

# covertype this pr
RandomForest  66.0639s   0.3583s     0.0330
ExtraTrees    43.6342s   0.5493s     0.0372
CART          14.8608s   0.0683s     0.0424
GBRT         507.0393s   0.3322s     0.1777  

# covertype fast-reverse-split + this pr
RandomForest  45.4110s   0.3656s     0.0330  
ExtraTrees    33.9126s   0.5563s     0.0372  
CART          11.7812s   0.0702s     0.0424  
GBRT         439.4366s   0.2855s     0.1777

# Master mnist regression
random forest = 155.378248
extra trees   = 136.87747000000002
cart = 25.426823

# mnist regression this pr
random forest = 139.21123
extra trees   = 134.557126
cart          = 23.304993

# mnist regression fast-reverse-split + this pr
random forest = 129.846281
extra trees = 121.103736
cart = 22.655389000000042

@jmschrei
Copy link
Member

jmschrei commented Sep 4, 2015

Each time you benchmark, can you include results from master branch as well? It would be easier to read, and remove the possibility of anything else influencing the time.

@arjoly
Copy link
Member Author

arjoly commented Sep 4, 2015

@jmschrei I have added the other benchmarks.

@jmschrei
Copy link
Member

jmschrei commented Sep 4, 2015

Can you comment on how GradientBoosting is being sped up? GradientBoosting is hard coded to use the FriedmanMSE criteria, not the MSE one, and I'm not seeing you change it. Am I missing something?

Also, I meant rerun the tests on master, not copy/paste the results from your first run. This is to try to factor out anything which is effecting the runtime of your computer during the time.

@arjoly
Copy link
Member Author

arjoly commented Sep 4, 2015

Can you comment on how GradientBoosting is being sped up? GradientBoosting is hard coded to use the FriedmanMSE criteria, not the MSE one, and I'm not seeing you change it. Am I missing something?

In the branch fast-reverse-split, I update the left and right node statistics from end to new_pos instead of from pos to new_pos if it's beneficial (end - new_pos < new_pos - pos).

The update method is shared between FriedmanMSE and MSE.

Also, I meant rerun the tests on master, not copy/paste the results from your first run. This is to try to factor out anything which is effecting the runtime of your computer during the time.

I will re-do those more formally whenever this pr got merged. But results should be comparable.

@jmschrei
Copy link
Member

jmschrei commented Sep 4, 2015

That sounds like a good improvement. I meant, how does fast-mse improve GBRT's performance?

@arjoly
Copy link
Member Author

arjoly commented Sep 4, 2015

Sorry, it wasn't clear fast-mse is this branch :-) I will change the names.

@jmschrei
Copy link
Member

jmschrei commented Sep 4, 2015

Ah, I understand. The majority of the speed updates don't come from the faster impurity calculation, but in needing to calculate less in the update step.

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2015

Thanks for the changes. +1 for merge as it already is a huge improvement. Then please open a new PR for your next batches on improvements.

@glouppe glouppe changed the title [MRG] Optimize MSE criterion by avoiding computing constant terms during split optimization. [MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. Sep 4, 2015
@GaelVaroquaux
Copy link
Member

@jmschrei : could you do a review of this PR and give your 👍 if you think that it should be merged please.

@glouppe glouppe mentioned this pull request Sep 4, 2015
12 tasks
- :class:`tree.DecisionTreeClassifier` now exposes an ``apply`` method
for retrieving the leaf indices samples are predicted as. By
`Daniel Galvez`_ and `Gilles Louppe`_.

- Speed up decision tree regressor, random forest regressor, extra trees
regressor and gradient regression tree boosting by computing a proxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should speed up both Gradient Boosting Regression and Gradient Boosting Classification, since it turns a classification problem into a regression problem. Maybe better to say "extra trees regressor and gradient boosting"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gradient boosting is not general and only allows regression tree base estimator. But ok, i will remove regression tree mention.

@jmschrei
Copy link
Member

jmschrei commented Sep 5, 2015

This looks good for now, just one minor documentation issue.

By avoiding computing constant terms during split optimization.
@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2015

+1 for merge as well!

@arjoly
Copy link
Member Author

arjoly commented Sep 7, 2015

I have squashed everything. This should be good to go.

@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2015

Let's wait for the CI to finish. Once merged @arjoly can you please rebase the fast-reverse-split branch and open a new PR for that one?

@arjoly
Copy link
Member Author

arjoly commented Sep 7, 2015

Since it is green, I have merged by rebase.

@arjoly arjoly merged commit d39716e into scikit-learn:master Sep 7, 2015
@glouppe
Copy link
Contributor

glouppe commented Sep 7, 2015

🍻

@jmschrei
Copy link
Member

jmschrei commented Sep 7, 2015

🎺

Lets get fast-reverse-split merged and then split the code into multiple files.

@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2015

🍻 thank you all!

@GaelVaroquaux
Copy link
Member

Indeed! Good job to the team.

I am looking forward to seeing the rest of the improvements.

Sent from my phone. Please forgive brevity and mis spelling

On Sep 7, 2015, 11:22, at 11:22, Olivier Grisel notifications@github.com wrote:

🍻 thank you all!


Reply to this email directly or view it on GitHub:
#5203 (comment)

@arjoly arjoly deleted the fast-mse branch September 7, 2015 09:35
@arjoly
Copy link
Member Author

arjoly commented Sep 7, 2015

Cheers 🍻 !!

@amueller
Copy link
Member

amueller commented Sep 9, 2015

We (aka someone from the tree team) might want to do a blog post on the speed improvements for 0.17?
Or maybe me or olivier write a blog post and feature some benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants