[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

JosephTLucas · 2020-06-06T20:03:05Z

#DataUmbrella
cc: @ab-anssi

Reference Issues/PRs

Issue #11993

What does this implement/fix? Explain your changes.

@ab-anssi and I worked to implement node-level bootstrapping, but after implementing a prototype/proof of concept example, we did not get the results we expected. Do you have any advice on how/where this feature should be implemented?

Any other comments?

ab-anssi · 2020-06-06T20:07:18Z

For the node-level boostrapping we want to use the same helper functions as for tree-level boostrapping (_get_n_samples_bootstrap and _generate_sample_indices). We have copy-pasted the code for now because of cycle dependency issues (these functions should be in utils for example).

ab-anssi · 2020-06-06T20:09:40Z

We do not understand why the implementation we propose do not pass the tests, since we try not to change the behavior of the code. Even when we set boot_size to X.shape[0], the tests do not pass. Do you have any lead ?

jnothman · 2020-06-06T22:15:45Z

The initial failures here are in lining, not test failure:

sklearn/tree/_classes.py:52:1: E303 too many blank lines (3)
def _get_n_samples_bootstrap(n_samples, max_samples):
^
sklearn/tree/_classes.py:418:26: W291 trailing whitespace
        proportion = 0.99 
                         ^
sklearn/tree/_classes.py:420:80: E501 line too long (80 > 79 characters)
        ind = _generate_sample_indices(self.random_state, X.shape[0], boot_size)
                                                                               ^
sklearn/tree/_classes.py:422:80: E501 line too long (87 > 79 characters)
            builder.build(self.tree_, X[ind], y[ind], sample_weight, X_idx_sorted[ind])
                                                                               ^
sklearn/tree/_classes.py:424:80: E501 line too long (82 > 79 characters)
            builder.build(self.tree_, X[ind], y[ind], sample_weight, X_idx_sorted)

jnothman · 2020-06-06T22:18:58Z

But yes, other tests failing seem to imply that your code is producing different models, and deserve investigating. Maybe choose one of those falling tests, and run them while adding debug output to guide you to the point of failure

jnothman · 2020-06-06T22:23:33Z

Sorry that was maybe a silly comment before looking at your code. It looks like your code will always bootstrap, without stratification, which can distinctly change the distribution of samples input to the tree building process. Why don't you put in some code to investigate what the distribution of y looks like that then results in test failure.

ab-anssi · 2020-06-10T13:16:10Z

@jnothman Thanks for your answer about stratification. Indeed, the current implementation always bootstraps the data at each node of the trees, without any stratification, as the bootstrapping at the tree level. Should we stratify the bootstrap at the node level ? Should we also perform stratified bootstrapping at the tree level ?

reshamas · 2020-06-17T20:21:57Z

@cmarmo @thomasjpfan
Are you able to assist with this PR?

cmarmo · 2020-06-18T08:09:41Z

Hi @reshamas, @JosephTLucas, I can't reproduce the test failure, as it happens for linux 32bit systems. I'm not able to comment about the algorithm, but FYI, if the failure is really a problem of computing precision on different architectures, there is a way to skip tests on 32bit system adding

@skip_if_32bit

marker just before the faulting test.

to classes RandomForestClassifier (and upper classes) and DecisionTree (and upper classes)

Replicated bootstrap attributes from DecisionTreeClassifier to DecisionTreeRegressor Wrote unit test to ensure that trees trained on the full dataset are more accurate than boosted trees Thinking about ways to test the training time (the real motivation behind providing this functionality)

_get_n_samples() and _generate_sample_indices() previously resided in ensemble/_forest.py, leading to a circular definition Moved them both to utils/__init__.py so they could be referenced by ensemble/_forest and tree/_class Ensured all unit tests still pass

Unit test now compares feature importance between full and bootstrapped trees to ensure similarity in tree construction.

ab-anssi · 2020-06-20T15:52:52Z

I have added @skip_if_32bit to the test, and now the CI is ok. Thanks @cmarmo.

amueller · 2020-06-20T16:08:13Z

I'm not sure if I follow entirely. Right now you're doing tree-level boostrapping, right? To do node-level bootstrapping, you'd have to change the Cython code, I assume?
I feel like it might be worth to do tree-level bootstrapping with less samples first as an option, using bootstrap=float as @jnothman suggested, I think node-level bootstrapping will be much harder.

amueller · 2020-06-20T16:11:04Z

btw, I think #13227 might be somewhat related.

ab-anssi · 2020-06-20T16:14:33Z

@amueller We intend to do bootstrapping at the node level. I think that bootstrapping at the tree level has already been implemented and merged (#14682). The argument bootstrap is still a boolean, and the sampling rate is provided by the argument max_samples.
We thought we could implement bootstrap at the node level without changing the Cython code.

amueller · 2020-06-20T16:17:41Z

@ab-anssi oh thanks, I didn't remember #14682.
Can you say how you would do the node-level bootstrapping without changing the Cython code? I don't really see how.

ab-anssi · 2020-06-20T16:29:57Z

Our current implementation modifies the fit function of the BaseDecisionTree class. But, now that I look at it again, it does not seem to be the right place (there is not recursive call here). I am going to think about a better (right !) way to implement this feature.

reshamas · 2020-10-20T15:33:27Z

@JosephTLucas @ab-anssi
Are you still working on this PR?

JosephTLucas · 2020-10-20T16:00:02Z

@JosephTLucas @ab-anssi
Are you still working on this PR?

No, closed it.

github-actions bot added the module:tree label Jun 6, 2020

JosephTLucas changed the title ~~[WIP] Issue #11993 Prototyped Solution, Need recommendations~~ [WIP] Issue #11993 - add node_bootstrap param to RandomForest Jun 10, 2020

JosephTLucas and others added 9 commits June 20, 2020 17:22

WIP Prototyped Solution, Need recommendations

7f28f0a

add param node_bootstrap and node_max_samples

632715e

to classes RandomForestClassifier (and upper classes) and DecisionTree (and upper classes)

fix pep8

4b6136a

fix pep8

bf764de

rm unused import

fbcf2f1

Improved Unit Test

04492e1

Unit test now compares feature importance between full and bootstrapped trees to ensure similarity in tree construction.

add @skip_if_32bit

d84c2a6

fix doc for random forest

bd95a75

JosephTLucas closed this Oct 20, 2020

Uh oh!

[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

Uh oh!

Conversation

JosephTLucas commented Jun 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ab-anssi commented Jun 6, 2020

Uh oh!

ab-anssi commented Jun 6, 2020

Uh oh!

jnothman commented Jun 6, 2020

Uh oh!

jnothman commented Jun 6, 2020

Uh oh!

jnothman commented Jun 6, 2020 via email

Uh oh!

ab-anssi commented Jun 10, 2020

Uh oh!

reshamas commented Jun 17, 2020

Uh oh!

cmarmo commented Jun 18, 2020

Uh oh!

ab-anssi commented Jun 20, 2020

Uh oh!

amueller commented Jun 20, 2020

Uh oh!

amueller commented Jun 20, 2020

Uh oh!

ab-anssi commented Jun 20, 2020

Uh oh!

amueller commented Jun 20, 2020

Uh oh!

ab-anssi commented Jun 20, 2020

Uh oh!

reshamas commented Oct 20, 2020

Uh oh!

JosephTLucas commented Oct 20, 2020

Uh oh!

Uh oh!

JosephTLucas commented Jun 6, 2020 •

edited

Loading