Skip to content

Feature request: Random Forest node splitting sample size #11993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
parrt opened this issue Sep 4, 2018 · 13 comments
Open

Feature request: Random Forest node splitting sample size #11993

parrt opened this issue Sep 4, 2018 · 13 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:ensemble

Comments

@parrt
Copy link

parrt commented Sep 4, 2018

During the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy. With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn's RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:

def set_RF_sample_size(n):
    forest._generate_sample_indices = \
        (lambda rs, n_samples: forest.check_random_state(rs).randint(0, n_samples,n))

def reset_RF_sample_size():
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.

We appreciate your consideration of this feature request!

@jnothman
Copy link
Member

jnothman commented Sep 4, 2018

I'm surprised there is no existing issue for this... I would consider supporting it as bootstrap=float to indicate the fraction of the sample size used to train each tree.

@parrt
Copy link
Author

parrt commented Sep 4, 2018

Heh, that sounds pretty good to me; it might be better than an absolute number that I have to fiddle with on different sized data sets. Maybe "overloading" bootstrap from bool to float isn't quite the best way, though, as it might be confused with the overall bootstrapping per tree. node_bootstrap? bootstrap_node? (edit: ah, you said per-tree and we were thinking per node but I guess we only really care about reducing size of big nodes near root so maybe boostrap=float is ok.)

Implementation note: hopefully this parameter would be per RF model object... I think Jeremy's trick currently sets a global function, sklearn.ensemble.forest._generate_sample_indices(), but out of necessity not desire.

@jnothman
Copy link
Member

jnothman commented Sep 4, 2018 via email

@jph00
Copy link

jph00 commented Sep 4, 2018 via email

@jnothman
Copy link
Member

jnothman commented Sep 4, 2018 via email

@jph00
Copy link

jph00 commented Sep 4, 2018

I don't think you can use weights both for sampling and use them as weights in the info gain function. I believe sklearn currently does the latter, right? So it would make sense to continue doing only that.

@MDouriez
Copy link
Contributor

MDouriez commented Nov 2, 2019

The title indicates that the request for the sample size is at the node level but comments lean towards the tree level.
If at a tree-level, it seems that #14682 resolved the issue no? It introduced a max_samplesparameter: If bootstrap is True, the number of samples to draw from X to train each base estimator.

@jph00
Copy link

jph00 commented Nov 2, 2019

Yes thanks @MDouriez that does resolve the issue at tree level! A node-level option would also be very helpful though... :)

@loldja
Copy link
Contributor

loldja commented Jun 6, 2020

@jph00 Can you clarify how node-level would be useful? Is there literature on this you can point to?

@jph00
Copy link

jph00 commented Jun 11, 2020

Sorry I don't have any literature I can point to - just my own research over the last few years, which I never got around to writing up, unfortunately.

@reshamas
Copy link
Member

@cmarmo
Do you think this issue is beginner friendly? Is it doable for someone at one of our sprints?

@cmarmo
Copy link
Contributor

cmarmo commented Oct 20, 2020

Do you think this issue is beginner friendly? Is it doable for someone at one of our sprints?

@reshamas, checking the pull requests that tried to solve it I'm under the impression that this issue does not have a straightforward solution. Perhaps it will require some experience to be taken over.

@reshamas reshamas added the Moderate Anything that requires some knowledge of conventions and best practices label Oct 20, 2020
@reshamas
Copy link
Member

@cmarmo
Ok, I labeled it moderate, so it is not picked up by beginners at sprints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:ensemble
Projects
None yet
Development

No branches or pull requests

8 participants