Skip to content

[WIP] Alternative architecture for regression tree #8458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 64 commits into from

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Feb 25, 2017

Reference Issue

Partial fix #8231

What does this implement/fix? Explain your changes.

A pure python implementation of a regression tree. Instead of evaluating a split at a time, the data are scanned and distributed to a splitter-list.

Any other comments?

The implementation is far to be optimal in term of speed. It stands to illustrate the design of the trees.

TODO:

  • Write unit test for each class -> By passed the multi-output and sparse features for the moment
  • Benchmark the implementation with DecisionTreeRegressor;
  • Sparse data support.

EDIT by @raghavrv

  • Time profile
  • minimally cythonize.
  • Paralelize implementation.
  • Support MAE criterion
  • Support multi-output?

@ogrisel @raghavrv Let me know if I forgot anything in the description

@glouppe @jmschrei @jnothman @nelson-liu @GaelVaroquaux @agramfort @lesteve here we come

@GaelVaroquaux
Copy link
Member

Reflecting on the challenges of the last attempt of refactoring the tree code, making sure that the changes are beneficial for everybody is crucial. There is a real risk that the PR is too hard to review to convince the historical tree maintainers ( @glouppe, @arjoly, @jmschrei ) that it is a benefit that comes without risks of code being wrong or other degraded functionality.

Maybe one direction to consider is to have both tree implementations in the codebase for a while, at least in the PR, to be able to write tests comparing one with the other.

@jnothman
Copy link
Member

This pull request introduces 3 alerts - view on lgtm.com

new alerts:

  • 1 for Inconsistent equality and inequality
  • 1 for Unreachable code
  • 1 for Inconsistent equality and hashing

Comment posted by lgtm.com

@amueller
Copy link
Member

I didn't realize you were working on this, nice :) Have you run https://github.com/guolinke/boosting_tree_benchmarks with the current code to see how bad it is?

@glemaitre
Copy link
Member Author

I have stuff there: https://github.com/glemaitre/gbrt-benchmarks/blob/master/benchmark/benchmark_plotting.ipynb
I think this is not the last results (I might have forgot to turn on -O3 for XGBoost :))

What is sure is that I am currently comparing the exact method from XGBoost and usually, this is at least 5 times faster. I have an implementation which is as fast as XGBoost up to 10^5 samples. However, with larger number of samples, the performance decrease to get to the original performance of scikit-learn.
I am currently profiling the code to find the reason and the bottleneck.

@amueller
Copy link
Member

What is the difference between the first two columns (cache vs not? What does that mean?).
And that is comparing against your branch or against master?

If you say "this" is at least 5 times faster, what is "this"? Also, I assume you limit XGBoost to a single core? You don't do binning, so as fast as XGBoost exact, I guess?

@glemaitre
Copy link
Member Author

If you say "this" is at least 5 times faster, what is "this"? Also, I assume you limit XGBoost to a single core? You don't do binning, so as fast as XGBoost exact, I guess?

XGBoost exact vs sklearn master branch both on a single core. So no binning for the moment

cache vs not? What does that mean?

XGBoost has a cache-optimized version (offering a gain of ~ x2) by creating block of small data.

"this" is at least 5 times faster, what is "this

XGBoost exact is x5 faster than scikit-learn master. This is something that I am currently benchmarking and profiling while developing my branch.

I will update those benchmark to be sure once that I am going what I am working now right now.
I want to benchmark on several dataset to be sure.

@GaelVaroquaux
Copy link
Member

Great that you're posting the benchmarks. It would be very useful to include a comparison of your approach to scikit-learn master, so that we know how much we are gaining.

@amueller
Copy link
Member

Thanks for the explanation @glemaitre. I somehow assumed we were more than 5x slower from hearsay. Are the default parameters very different? Or is/was there a difference in early stopping?

@amueller
Copy link
Member

(or maybe it's just that 5x plus multicore means 20x, which is kind of a lot lol)

@jmschrei
Copy link
Member

I had considered jumping in on parallelizing the current tree architecture or looking into categorical support. Should I wait until this new architecture is added in? I've also forgotten, is this solely going to be for regression trees, or both CART?

@glemaitre
Copy link
Member Author

Great that you're posting the benchmarks. It would be very useful to include a comparison of your approach to scikit-learn master, so that we know how much we are gaining.

Once that I fix my current stuff. I will provide a benchmark for different number of samples on Higgs dataset.

@glemaitre
Copy link
Member Author

glemaitre commented Nov 23, 2017

I had considered jumping in on parallelizing the current tree architecture or looking into categorical support. Should I wait until this new architecture is added in?

I've also forgotten, is this solely going to be for regression trees, or both CART?

Since gradient boosting used the regression tree, we call them like that. But I don't see any blocker to make it happening on classification tree. However, there is currently on implementation details which could be annoying for the random forest. Since that the trees are going by depth and that we take advantage of this to make only a single pass on the samples, it means that the randomly selected features will be the same for all nodes at this specific depth.

This is not an issue for gradient boosting since that usually you are not subsampling the features, but this is one for random forest since that you don't have this feature randomization. It would need to be well benchmark on different problem to see the influence of this particularity.

@amueller
Copy link
Member

any news on a benchmark ;)

@glemaitre
Copy link
Member Author

@amueller I got trap with side project at work which will take me until mid-March. Then I will be able to dedicate much more time and come back to the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Gradient Boosting improvement
7 participants