-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Optimize MSE criterion by avoiding computing constant terms during split optimization. #5203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The big difference for gbrt could be due to the difference in the computation of impurity improvement. |
Which version of my code triggered the seg fault? I just pushed the last major code contribution to #5041, which shouldn't produce any seg faults. If so, let me know and I will track it down. |
With the fix, I still have a nice speed up
|
@jmschrei I will try with your last version. |
I'm not sure how helpful this is, but here are the results of the mnist regression benchmark code you provided, comparing master against my PR on my machine (which is apparently very slow!)
|
Here is the benchmark for MNIST in regression, on my machine:
|
I find that benchmark very confusing, since it doesn't match my results at all (particular ExtraTrees being slower). Which version of my code is that with? |
|
@jmschrei I just run the benchmark once again for your branch:
So it seems there is some variance on my machine. One thing for sure is that extra-trees are slower than what they should. |
@jmschrei I ran the benchmark against #5041 (af9c1cb) on my machine.
|
Here are the results of the regressors on mnist on my box (Xeon X5660 @ 2.80GHz with 12MB cache), compiled with GCC 4.8 under Ubuntu 14.04: from sklearn.datasets import fetch_mldata
from sklearn.ensemble import *
from sklearn.tree import *
import time
data = fetch_mldata('MNIST original')
estimators = [
("Random Forest", RandomForestRegressor),
("Extra Trees", ExtraTreesRegressor),
("CART", DecisionTreeRegressor),
("GBRT", GradientBoostingRegressor),
]
for name, estimator in estimators:
est = estimator(random_state=0)
tic = time.time()
est.fit(data.data, data.target)
toc = time.time() - tic
print(name, toc)
# #5041
# Random Forest 150.26244401931763
# Extra Trees 216.57464289665222
# CART 26.032639026641846
# GBRT 371.58685517311096
# #5203
# Random Forest 137.6741919517517
# Extra Trees 112.64128398895264
# CART 22.449392080307007
# GBRT 325.93634605407715
# master
# Random Forest 264.32571601867676
# Extra Trees 263.0369408130646
# CART 45.76998996734619
# GBRT 708.4197769165039 So indeed @arjoly's refactoring seems to be always the best from a speed point of view. Before merging we need to check that the grown trees are still the same as in master or at least the accuracy is still the same in case random feature sampling has been affected in one way or another. Also we should check that this does not incur a regression from a memory usage point of view with Edit: I included the results for GBRT as well in the previous script / results. |
Here are the result for regressors on mnist on my Mac (clang / llvm 6.1.0 compiler, Core i7 @ 2GHz with 4MB cache), # master
# Random Forest 182.51679611206055
# Extra Trees 160.08595299720764
# CART 30.50537896156311
# #5203
# Random Forest 177.5755307674408
# Extra Trees 169.99811220169067
# CART 29.187539100646973
# #5041
# Random Forest 193.5168468952179
# Extra Trees 278.5783009529114
# CART 34.431031942367554 again #5203 shows an improvement although not as strong as on the Xeon proc with GCC. #5041 is slower than both master and #5203. |
I am getting the same results as @ogrisel, where #5203 is the fastest. My computer is Xeon E5640 @2.67GHz with 12MB cache and gcc 4.8.4. However, #5203 is not building the same tree as master, and gets slightly different accuracies. Here are the tree sizes for the regression trees:
I support merging this PR over #5041 if this difference can be accounted for. |
@jmschrei If this pr get merged, can I put your name in the what's new for this feature? Without #5041, we would have missed those improvements. I hope that you will bring in small pull requests all the remaining improvements that you have thought in #5041 (caching weight_n_samples, total statistics caching, pre-calculus of (y ** 2, w y **2) for mse, etc). If you have time, it would be great that you review this pull request. |
@@ -197,6 +197,16 @@ cdef class Criterion: | |||
|
|||
pass | |||
|
|||
cdef double approx_impurity_improvement(self) nogil: | |||
"""Approximate the impurity reduction removing constants for a split""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extend that docstring to make it explicit that this is the only part of the impurity improvement computation that is required to find the best split threshold for a given feature. This method is used to speed up the search for the best split.
The absolute impurity improvement is only computed by the impurity_improvement
method once the best split has been found.
Maybe calling that "approximate" is misleading (its a shifted and scaled version of the true value that preserves the location of the best split). But I don't personally have a better naming suggestion at hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, the method name is misleading.
This thing is in fact a proxy quantity such that the split that maximizes this value also maximizes the impurity improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of proxy_impurity_improvement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @ogrisel
proxy_impurity_improvement
is okay, or scaled_impurity_improvement
might be better. We don't want to imply it is approximate in any way, since it is not.
One more data point, same mnist regression benchmark, big modern Xeon E5-2697 v2 @ 2.70GHz with 30MB cache:
|
FYI, on a random sample of 1000 rows of MNIST:
Still somewhat slower than xgboost, but we are closing the gap! @jmschrei do you have insights about what further improvements would be needed to get up to speed? |
@arjoly of course, thanks! I do plan on submitting small PRs. It was a mistake on my part to try to redo a core part of sklearn in one go. I hope you'll have time to review them in the coming weeks. @glouppe I am taking a quick break from decision trees to see if I can contribute anywhere else in sklearn until next week, when I'll start reviewing the code of master more thoroughly. The solution is likely more caching between splits on different levels of the tree, such as wy and wy**2. |
@jmschrei Thanks for your positive attitude! I have to admit that we were a bit uncomfortable with how to deal best with your PR. You did an amazing work and came with great ideas, but also wanted to change too many things at the same time. I truly wish this was not done for nothing and that these ideas will eventually land into master. |
cdef double* sum_right = self.sum_right | ||
|
||
cdef SIZE_t k | ||
cdef double proxy_impuriy_left = 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impuriy -> impurity
I took into account comments an added a entry in the what's new. |
The last optimization of friedman mse allows to speed up a little more gbrt:
|
When this pr got merged, I am going to propose another speed up idea (different from those of @jmschrei). The idea is to perform the update in reverse order (from end to new_pos) if it's beneficial (branch fast-reverse-split.) Results are nice on datasets such as covertype. There is a gain for all tree-based classifier.
|
Each time you benchmark, can you include results from master branch as well? It would be easier to read, and remove the possibility of anything else influencing the time. |
@jmschrei I have added the other benchmarks. |
Can you comment on how GradientBoosting is being sped up? GradientBoosting is hard coded to use the FriedmanMSE criteria, not the MSE one, and I'm not seeing you change it. Am I missing something? Also, I meant rerun the tests on master, not copy/paste the results from your first run. This is to try to factor out anything which is effecting the runtime of your computer during the time. |
In the branch fast-reverse-split, I update the left and right node statistics from end to new_pos instead of from pos to new_pos if it's beneficial (end - new_pos < new_pos - pos). The update method is shared between FriedmanMSE and MSE.
I will re-do those more formally whenever this pr got merged. But results should be comparable. |
That sounds like a good improvement. I meant, how does fast-mse improve GBRT's performance? |
Sorry, it wasn't clear fast-mse is this branch :-) I will change the names. |
Ah, I understand. The majority of the speed updates don't come from the faster impurity calculation, but in needing to calculate less in the update step. |
Thanks for the changes. +1 for merge as it already is a huge improvement. Then please open a new PR for your next batches on improvements. |
@jmschrei : could you do a review of this PR and give your 👍 if you think that it should be merged please. |
- :class:`tree.DecisionTreeClassifier` now exposes an ``apply`` method | ||
for retrieving the leaf indices samples are predicted as. By | ||
`Daniel Galvez`_ and `Gilles Louppe`_. | ||
|
||
- Speed up decision tree regressor, random forest regressor, extra trees | ||
regressor and gradient regression tree boosting by computing a proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should speed up both Gradient Boosting Regression and Gradient Boosting Classification, since it turns a classification problem into a regression problem. Maybe better to say "extra trees regressor and gradient boosting"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gradient boosting is not general and only allows regression tree base estimator. But ok, i will remove regression tree mention.
This looks good for now, just one minor documentation issue. |
By avoiding computing constant terms during split optimization.
+1 for merge as well! |
I have squashed everything. This should be good to go. |
Let's wait for the CI to finish. Once merged @arjoly can you please rebase the |
Since it is green, I have merged by rebase. |
🍻 |
🎺 Lets get fast-reverse-split merged and then split the code into multiple files. |
🍻 thank you all! |
Indeed! Good job to the team. I am looking forward to seeing the rest of the improvements. Sent from my phone. Please forgive brevity and mis spelling On Sep 7, 2015, 11:22, at 11:22, Olivier Grisel notifications@github.com wrote:
|
Cheers 🍻 !! |
We (aka someone from the tree team) might want to do a blog post on the speed improvements for 0.17? |
In the pull request #5041, @jmschrei proposes to approximate the impurity
improvement during the optimisation by avoiding to compute all the constant
terms. Since this could be done simply by adding only one method, I have decided
to implement this optimisation.
The benchmarks show that:
PS : I haven't been able to benchmark against #5041 since I had segmentation faults.
Here are the raw benchmark results.