-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Alternative architecture for regression tree #8458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use FEATURE_THRESHOLD to detect constant features
Reflecting on the challenges of the last attempt of refactoring the tree code, making sure that the changes are beneficial for everybody is crucial. There is a real risk that the PR is too hard to review to convince the historical tree maintainers ( @glouppe, @arjoly, @jmschrei ) that it is a benefit that comes without risks of code being wrong or other degraded functionality. Maybe one direction to consider is to have both tree implementations in the codebase for a while, at least in the PR, to be able to write tests comparing one with the other. |
c717a2b
to
e4b4c07
Compare
This pull request introduces 3 alerts - view on lgtm.com new alerts:
Comment posted by lgtm.com |
I didn't realize you were working on this, nice :) Have you run https://github.com/guolinke/boosting_tree_benchmarks with the current code to see how bad it is? |
I have stuff there: https://github.com/glemaitre/gbrt-benchmarks/blob/master/benchmark/benchmark_plotting.ipynb What is sure is that I am currently comparing the exact method from XGBoost and usually, this is at least 5 times faster. I have an implementation which is as fast as XGBoost up to 10^5 samples. However, with larger number of samples, the performance decrease to get to the original performance of scikit-learn. |
What is the difference between the first two columns (cache vs not? What does that mean?). If you say "this" is at least 5 times faster, what is "this"? Also, I assume you limit XGBoost to a single core? You don't do binning, so as fast as XGBoost exact, I guess? |
XGBoost exact vs sklearn master branch both on a single core. So no binning for the moment
XGBoost has a cache-optimized version (offering a gain of ~ x2) by creating block of small data.
XGBoost exact is x5 faster than scikit-learn master. This is something that I am currently benchmarking and profiling while developing my branch. I will update those benchmark to be sure once that I am going what I am working now right now. |
Great that you're posting the benchmarks. It would be very useful to include a comparison of your approach to scikit-learn master, so that we know how much we are gaining. |
Thanks for the explanation @glemaitre. I somehow assumed we were more than 5x slower from hearsay. Are the default parameters very different? Or is/was there a difference in early stopping? |
(or maybe it's just that 5x plus multicore means 20x, which is kind of a lot lol) |
I had considered jumping in on parallelizing the current tree architecture or looking into categorical support. Should I wait until this new architecture is added in? I've also forgotten, is this solely going to be for regression trees, or both CART? |
Once that I fix my current stuff. I will provide a benchmark for different number of samples on Higgs dataset. |
I had considered jumping in on parallelizing the current tree architecture or looking into categorical support. Should I wait until this new architecture is added in?
Since gradient boosting used the regression tree, we call them like that. But I don't see any blocker to make it happening on classification tree. However, there is currently on implementation details which could be annoying for the random forest. Since that the trees are going by depth and that we take advantage of this to make only a single pass on the samples, it means that the randomly selected features will be the same for all nodes at this specific depth. This is not an issue for gradient boosting since that usually you are not subsampling the features, but this is one for random forest since that you don't have this feature randomization. It would need to be well benchmark on different problem to see the influence of this particularity. |
any news on a benchmark ;) |
@amueller I got trap with side project at work which will take me until mid-March. Then I will be able to dedicate much more time and come back to the implementation. |
Reference Issue
Partial fix #8231
What does this implement/fix? Explain your changes.
A pure python implementation of a regression tree. Instead of evaluating a split at a time, the data are scanned and distributed to a splitter-list.
Any other comments?
The implementation is far to be optimal in term of speed. It stands to illustrate the design of the trees.
TODO:
DecisionTreeRegressor
;EDIT by @raghavrv
@ogrisel @raghavrv Let me know if I forgot anything in the description
@glouppe @jmschrei @jnothman @nelson-liu @GaelVaroquaux @agramfort @lesteve here we come