-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For the node-level boostrapping we want to use the same helper functions as for tree-level boostrapping (_get_n_samples_bootstrap and _generate_sample_indices). We have copy-pasted the code for now because of cycle dependency issues (these functions should be in utils for example). |
We do not understand why the implementation we propose do not pass the tests, since we try not to change the behavior of the code. Even when we set |
The initial failures here are in lining, not test failure:
|
But yes, other tests failing seem to imply that your code is producing different models, and deserve investigating. Maybe choose one of those falling tests, and run them while adding debug output to guide you to the point of failure |
Sorry that was maybe a silly comment before looking at your code.
It looks like your code will always bootstrap, without stratification,
which can distinctly change the distribution of samples input to the tree
building process. Why don't you put in some code to investigate what the
distribution of y looks like that then results in test failure.
|
@jnothman Thanks for your answer about stratification. Indeed, the current implementation always bootstraps the data at each node of the trees, without any stratification, as the bootstrapping at the tree level. Should we stratify the bootstrap at the node level ? Should we also perform stratified bootstrapping at the tree level ? |
@cmarmo @thomasjpfan |
Hi @reshamas, @JosephTLucas, I can't reproduce the test failure, as it happens for linux 32bit systems. I'm not able to comment about the algorithm, but FYI, if the failure is really a problem of computing precision on different architectures, there is a way to skip tests on 32bit system adding
marker just before the faulting test. |
to classes RandomForestClassifier (and upper classes) and DecisionTree (and upper classes)
Replicated bootstrap attributes from DecisionTreeClassifier to DecisionTreeRegressor Wrote unit test to ensure that trees trained on the full dataset are more accurate than boosted trees Thinking about ways to test the training time (the real motivation behind providing this functionality)
_get_n_samples() and _generate_sample_indices() previously resided in ensemble/_forest.py, leading to a circular definition Moved them both to utils/__init__.py so they could be referenced by ensemble/_forest and tree/_class Ensured all unit tests still pass
Unit test now compares feature importance between full and bootstrapped trees to ensure similarity in tree construction.
I have added @skip_if_32bit to the test, and now the CI is ok. Thanks @cmarmo. |
I'm not sure if I follow entirely. Right now you're doing tree-level boostrapping, right? To do node-level bootstrapping, you'd have to change the Cython code, I assume? |
btw, I think #13227 might be somewhat related. |
@amueller We intend to do bootstrapping at the node level. I think that bootstrapping at the tree level has already been implemented and merged (#14682). The argument |
Our current implementation modifies the |
@JosephTLucas @ab-anssi |
No, closed it. |
#DataUmbrella
cc: @ab-anssi
Reference Issues/PRs
Issue #11993
What does this implement/fix? Explain your changes.
@ab-anssi and I worked to implement node-level bootstrapping, but after implementing a prototype/proof of concept example, we did not get the results we expected. Do you have any advice on how/where this feature should be implemented?
Any other comments?