[WIP] Alternative architecture for regression tree #8458

glemaitre · 2017-02-25T18:04:21Z

Reference Issue

Partial fix #8231

What does this implement/fix? Explain your changes.

A pure python implementation of a regression tree. Instead of evaluating a split at a time, the data are scanned and distributed to a splitter-list.

Any other comments?

The implementation is far to be optimal in term of speed. It stands to illustrate the design of the trees.

TODO:

Write unit test for each class -> By passed the multi-output and sparse features for the moment
Benchmark the implementation with DecisionTreeRegressor;
Sparse data support.

EDIT by @raghavrv

@ogrisel @raghavrv Let me know if I forgot anything in the description

@glouppe @jmschrei @jnothman @nelson-liu @GaelVaroquaux @agramfort @lesteve here we come

Use FEATURE_THRESHOLD to detect constant features

GaelVaroquaux · 2017-02-26T09:51:21Z

Reflecting on the challenges of the last attempt of refactoring the tree code, making sure that the changes are beneficial for everybody is crucial. There is a real risk that the PR is too hard to review to convince the historical tree maintainers ( @glouppe, @arjoly, @jmschrei ) that it is a benefit that comes without risks of code being wrong or other degraded functionality.

Maybe one direction to consider is to have both tree implementations in the codebase for a while, at least in the PR, to be able to write tests comparing one with the other.

jnothman · 2017-11-20T15:56:37Z

This pull request introduces 3 alerts - view on lgtm.com

new alerts:

1 for Inconsistent equality and inequality
1 for Unreachable code
1 for Inconsistent equality and hashing

Comment posted by lgtm.com

amueller · 2017-11-20T19:54:29Z

I didn't realize you were working on this, nice :) Have you run https://github.com/guolinke/boosting_tree_benchmarks with the current code to see how bad it is?

glemaitre · 2017-11-20T21:08:48Z

I have stuff there: https://github.com/glemaitre/gbrt-benchmarks/blob/master/benchmark/benchmark_plotting.ipynb
I think this is not the last results (I might have forgot to turn on -O3 for XGBoost :))

What is sure is that I am currently comparing the exact method from XGBoost and usually, this is at least 5 times faster. I have an implementation which is as fast as XGBoost up to 10^5 samples. However, with larger number of samples, the performance decrease to get to the original performance of scikit-learn.
I am currently profiling the code to find the reason and the bottleneck.

amueller · 2017-11-21T03:53:41Z

What is the difference between the first two columns (cache vs not? What does that mean?).
And that is comparing against your branch or against master?

If you say "this" is at least 5 times faster, what is "this"? Also, I assume you limit XGBoost to a single core? You don't do binning, so as fast as XGBoost exact, I guess?

glemaitre · 2017-11-21T10:13:12Z

If you say "this" is at least 5 times faster, what is "this"? Also, I assume you limit XGBoost to a single core? You don't do binning, so as fast as XGBoost exact, I guess?

XGBoost exact vs sklearn master branch both on a single core. So no binning for the moment

cache vs not? What does that mean?

XGBoost has a cache-optimized version (offering a gain of ~ x2) by creating block of small data.

"this" is at least 5 times faster, what is "this

XGBoost exact is x5 faster than scikit-learn master. This is something that I am currently benchmarking and profiling while developing my branch.

I will update those benchmark to be sure once that I am going what I am working now right now.
I want to benchmark on several dataset to be sure.

GaelVaroquaux · 2017-11-22T14:39:11Z

Great that you're posting the benchmarks. It would be very useful to include a comparison of your approach to scikit-learn master, so that we know how much we are gaining.

amueller · 2017-11-22T15:14:42Z

Thanks for the explanation @glemaitre. I somehow assumed we were more than 5x slower from hearsay. Are the default parameters very different? Or is/was there a difference in early stopping?

amueller · 2017-11-22T15:15:12Z

(or maybe it's just that 5x plus multicore means 20x, which is kind of a lot lol)

jmschrei · 2017-11-23T21:56:44Z

I had considered jumping in on parallelizing the current tree architecture or looking into categorical support. Should I wait until this new architecture is added in? I've also forgotten, is this solely going to be for regression trees, or both CART?

glemaitre · 2017-11-23T23:47:15Z

Great that you're posting the benchmarks. It would be very useful to include a comparison of your approach to scikit-learn master, so that we know how much we are gaining.

Once that I fix my current stuff. I will provide a benchmark for different number of samples on Higgs dataset.

glemaitre · 2017-11-23T23:53:07Z

I had considered jumping in on parallelizing the current tree architecture or looking into categorical support. Should I wait until this new architecture is added in?

I've also forgotten, is this solely going to be for regression trees, or both CART?

Since gradient boosting used the regression tree, we call them like that. But I don't see any blocker to make it happening on classification tree. However, there is currently on implementation details which could be annoying for the random forest. Since that the trees are going by depth and that we take advantage of this to make only a single pass on the samples, it means that the randomly selected features will be the same for all nodes at this specific depth.

This is not an issue for gradient boosting since that usually you are not subsampling the features, but this is one for random forest since that you don't have this feature randomization. It would need to be well benchmark on different problem to see the influence of this particularity.

amueller · 2018-02-21T16:27:30Z

any news on a benchmark ;)

glemaitre · 2018-02-21T16:31:11Z

@amueller I got trap with side project at work which will take me until mid-March. Then I will be able to dedicate much more time and come back to the implementation.

Guillaume Lemaitre and others added 26 commits February 7, 2017 22:11

implement new criterion and splitter

ab30b43

Making it all work together

001fd30

Expose _add_node to python level

49d9bf3

Debuging fixes

bf7a850

Update init

95b8343

fix issue propagating stats to children node

3ad0349

Finish to refactor before debugging

7d7f25f

remove temporary property and correct init

8e028a2

fixed the impurity improvement bug and iadd

013abdb

Do not check again the samples in leaf

7b4703f

Clarify the deepcopy

05e2db7

fix bug samples mapping

2ccb28f

add the possibility to select shuffled features

7269591

remove useless comment

293d953

fix bug update X_nid

9f18d6b

clean the printing

d4cc0d0

Use FEATURE_THRESHOLD to detect constant features

48405c5

Merge pull request #1 from ogrisel/newtree-fixes

02c012a

Use FEATURE_THRESHOLD to detect constant features

Move the tree in a new module for testing purposes

71e5468

fitting running

e3a3efb

EXA tree

015528a

Make it python3 compatible

4fea914

Update the node value for all the parent nodes

b090c98

FIX solve issue with the prediction

040771f

TST test stats node class

cb3f6bc

FIX rename residuals to y

0ddf44a

glemaitre mentioned this pull request Feb 25, 2017

[RFC] Gradient Boosting improvement #8231

Closed

2 tasks

Guillaume Lemaitre added 2 commits February 25, 2017 21:28

reverse __init__ in tree module

3a9d940

TST testing split_record

3dc03c3

glemaitre added 8 commits September 23, 2017 12:15

compute proxy in one function

36b1d7b

Working for the moment

dae3057

Improve the reassignment

7f12bdf

Refactor code

9637b70

Remove bug when max_depth is not None

54662b0

make it float32

5db33c8

add missing folder

dd3a0c9

Merge remote-tracking branch 'origin/master' into refactor_tree

e4b4c07

glemaitre force-pushed the refactor_tree branch from c717a2b to e4b4c07 Compare September 28, 2017 11:47

glemaitre added 6 commits September 28, 2017 13:58

Modify Gradient Boosting for benchmark

f30f14a

iter

2f53f03

iter

560787b

change gradient tree

a857b92

Merge remote-tracking branch 'origin' into refactor_tree

c9ab96e

iter

9cae475

amueller mentioned this pull request May 22, 2018

Adding L2- regualization in gradient boosting #8784

Closed

glemaitre closed this Jun 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Alternative architecture for regression tree #8458

[WIP] Alternative architecture for regression tree #8458

glemaitre commented Feb 25, 2017 •

edited by raghavrv

Loading

GaelVaroquaux commented Feb 26, 2017

jnothman commented Nov 20, 2017

amueller commented Nov 20, 2017

glemaitre commented Nov 20, 2017

amueller commented Nov 21, 2017

glemaitre commented Nov 21, 2017

GaelVaroquaux commented Nov 22, 2017

amueller commented Nov 22, 2017

amueller commented Nov 22, 2017

jmschrei commented Nov 23, 2017

glemaitre commented Nov 23, 2017

glemaitre commented Nov 23, 2017 •

edited

Loading

amueller commented Feb 21, 2018

glemaitre commented Feb 21, 2018

[WIP] Alternative architecture for regression tree #8458

[WIP] Alternative architecture for regression tree #8458

Conversation

glemaitre commented Feb 25, 2017 • edited by raghavrv Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

GaelVaroquaux commented Feb 26, 2017

jnothman commented Nov 20, 2017

amueller commented Nov 20, 2017

glemaitre commented Nov 20, 2017

amueller commented Nov 21, 2017

glemaitre commented Nov 21, 2017

GaelVaroquaux commented Nov 22, 2017

amueller commented Nov 22, 2017

amueller commented Nov 22, 2017

jmschrei commented Nov 23, 2017

glemaitre commented Nov 23, 2017

glemaitre commented Nov 23, 2017 • edited Loading

amueller commented Feb 21, 2018

glemaitre commented Feb 21, 2018

glemaitre commented Feb 25, 2017 •

edited by raghavrv

Loading

glemaitre commented Nov 23, 2017 •

edited

Loading