Skip to content

[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed

[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

wants to merge 10 commits into from

Conversation

JosephTLucas
Copy link
Contributor

@JosephTLucas JosephTLucas commented Jun 6, 2020

#DataUmbrella
cc: @ab-anssi

Reference Issues/PRs

Issue #11993

What does this implement/fix? Explain your changes.

@ab-anssi and I worked to implement node-level bootstrapping, but after implementing a prototype/proof of concept example, we did not get the results we expected. Do you have any advice on how/where this feature should be implemented?

Any other comments?

@ab-anssi
Copy link

ab-anssi commented Jun 6, 2020

For the node-level boostrapping we want to use the same helper functions as for tree-level boostrapping (_get_n_samples_bootstrap and _generate_sample_indices). We have copy-pasted the code for now because of cycle dependency issues (these functions should be in utils for example).

@ab-anssi
Copy link

ab-anssi commented Jun 6, 2020

We do not understand why the implementation we propose do not pass the tests, since we try not to change the behavior of the code. Even when we set boot_size to X.shape[0], the tests do not pass. Do you have any lead ?

@jnothman
Copy link
Member

jnothman commented Jun 6, 2020

The initial failures here are in lining, not test failure:

sklearn/tree/_classes.py:52:1: E303 too many blank lines (3)
def _get_n_samples_bootstrap(n_samples, max_samples):
^
sklearn/tree/_classes.py:418:26: W291 trailing whitespace
        proportion = 0.99 
                         ^
sklearn/tree/_classes.py:420:80: E501 line too long (80 > 79 characters)
        ind = _generate_sample_indices(self.random_state, X.shape[0], boot_size)
                                                                               ^
sklearn/tree/_classes.py:422:80: E501 line too long (87 > 79 characters)
            builder.build(self.tree_, X[ind], y[ind], sample_weight, X_idx_sorted[ind])
                                                                               ^
sklearn/tree/_classes.py:424:80: E501 line too long (82 > 79 characters)
            builder.build(self.tree_, X[ind], y[ind], sample_weight, X_idx_sorted)

@jnothman
Copy link
Member

jnothman commented Jun 6, 2020

But yes, other tests failing seem to imply that your code is producing different models, and deserve investigating. Maybe choose one of those falling tests, and run them while adding debug output to guide you to the point of failure

@jnothman
Copy link
Member

jnothman commented Jun 6, 2020 via email

@ab-anssi
Copy link

@jnothman Thanks for your answer about stratification. Indeed, the current implementation always bootstraps the data at each node of the trees, without any stratification, as the bootstrapping at the tree level. Should we stratify the bootstrap at the node level ? Should we also perform stratified bootstrapping at the tree level ?

@JosephTLucas JosephTLucas changed the title [WIP] Issue #11993 Prototyped Solution, Need recommendations [WIP] Issue #11993 - add node_bootstrap param to RandomForest Jun 10, 2020
@reshamas
Copy link
Member

@cmarmo @thomasjpfan
Are you able to assist with this PR?

@cmarmo
Copy link
Contributor

cmarmo commented Jun 18, 2020

Hi @reshamas, @JosephTLucas, I can't reproduce the test failure, as it happens for linux 32bit systems. I'm not able to comment about the algorithm, but FYI, if the failure is really a problem of computing precision on different architectures, there is a way to skip tests on 32bit system adding

@skip_if_32bit

marker just before the faulting test.

JosephTLucas and others added 9 commits June 20, 2020 17:22
to classes RandomForestClassifier (and upper classes)
and DecisionTree (and upper classes)
Replicated bootstrap attributes from DecisionTreeClassifier to DecisionTreeRegressor
Wrote unit test to ensure that trees trained on the full dataset are more accurate than boosted trees
Thinking about ways to test the training time (the real motivation behind providing this functionality)
_get_n_samples() and _generate_sample_indices() previously resided in ensemble/_forest.py, leading to a circular definition
Moved them both to utils/__init__.py so they could be referenced by ensemble/_forest and tree/_class
Ensured all unit tests still pass
Unit test now compares feature importance between full and bootstrapped trees to ensure similarity in tree construction.
@ab-anssi
Copy link

I have added @skip_if_32bit to the test, and now the CI is ok. Thanks @cmarmo.

@amueller
Copy link
Member

I'm not sure if I follow entirely. Right now you're doing tree-level boostrapping, right? To do node-level bootstrapping, you'd have to change the Cython code, I assume?
I feel like it might be worth to do tree-level bootstrapping with less samples first as an option, using bootstrap=float as @jnothman suggested, I think node-level bootstrapping will be much harder.

@amueller
Copy link
Member

btw, I think #13227 might be somewhat related.

@ab-anssi
Copy link

@amueller We intend to do bootstrapping at the node level. I think that bootstrapping at the tree level has already been implemented and merged (#14682). The argument bootstrap is still a boolean, and the sampling rate is provided by the argument max_samples.
We thought we could implement bootstrap at the node level without changing the Cython code.

@amueller
Copy link
Member

@ab-anssi oh thanks, I didn't remember #14682.
Can you say how you would do the node-level bootstrapping without changing the Cython code? I don't really see how.

@ab-anssi
Copy link

Our current implementation modifies the fit function of the BaseDecisionTree class. But, now that I look at it again, it does not seem to be the right place (there is not recursive call here). I am going to think about a better (right !) way to implement this feature.

@reshamas
Copy link
Member

@JosephTLucas @ab-anssi
Are you still working on this PR?

@JosephTLucas
Copy link
Contributor Author

@JosephTLucas @ab-anssi
Are you still working on this PR?

No, closed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants