Feature request: Random Forest node splitting sample size #11993

parrt · 2018-09-04T00:30:15Z

During the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy. With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn's RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:

def set_RF_sample_size(n):
    forest._generate_sample_indices = \
        (lambda rs, n_samples: forest.check_random_state(rs).randint(0, n_samples,n))

def reset_RF_sample_size():
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.

We appreciate your consideration of this feature request!

jnothman · 2018-09-04T00:48:17Z

I'm surprised there is no existing issue for this... I would consider supporting it as bootstrap=float to indicate the fraction of the sample size used to train each tree.

parrt · 2018-09-04T00:54:25Z

Heh, that sounds pretty good to me; it might be better than an absolute number that I have to fiddle with on different sized data sets. Maybe "overloading" bootstrap from bool to float isn't quite the best way, though, as it might be confused with the overall bootstrapping per tree. node_bootstrap? bootstrap_node? (edit: ah, you said per-tree and we were thinking per node but I guess we only really care about reducing size of big nodes near root so maybe boostrap=float is ok.)

Implementation note: hopefully this parameter would be per RF model object... I think Jeremy's trick currently sets a global function, sklearn.ensemble.forest._generate_sample_indices(), but out of necessity not desire.

jnothman · 2018-09-04T01:16:45Z

Yes, it would be a parameter of the RF estimators, as `bootstrap` is currently.

jph00 · 2018-09-04T01:25:23Z

Perfect!

jnothman · 2018-09-04T01:38:40Z

One implementation detail that we need to consider is whether sample weights should be considered when sampling. I would argue no simply on the basis that they are not considered in the current bootstrap.

…

On Tue, 4 Sep 2018 at 11:25, Jeremy Howard ***@***.***> wrote: Perfect! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11993 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-eaUXZ_9ahTtKCo66xSZHjCk79Cks5uXdaEgaJpZM4WYItw> .

jph00 · 2018-09-04T22:47:52Z

I don't think you can use weights both for sampling and use them as weights in the info gain function. I believe sklearn currently does the latter, right? So it would make sense to continue doing only that.

MDouriez · 2019-11-02T21:49:29Z

The title indicates that the request for the sample size is at the node level but comments lean towards the tree level.
If at a tree-level, it seems that #14682 resolved the issue no? It introduced a max_samplesparameter: If bootstrap is True, the number of samples to draw from X to train each base estimator.

jph00 · 2019-11-02T21:54:02Z

Yes thanks @MDouriez that does resolve the issue at tree level! A node-level option would also be very helpful though... :)

loldja · 2020-06-06T18:37:41Z

@jph00 Can you clarify how node-level would be useful? Is there literature on this you can point to?

jph00 · 2020-06-11T17:18:42Z

Sorry I don't have any literature I can point to - just my own research over the last few years, which I never got around to writing up, unfortunately.

reshamas · 2020-10-20T16:08:48Z

@cmarmo
Do you think this issue is beginner friendly? Is it doable for someone at one of our sprints?

cmarmo · 2020-10-20T16:19:38Z

Do you think this issue is beginner friendly? Is it doable for someone at one of our sprints?

@reshamas, checking the pull requests that tried to solve it I'm under the impression that this issue does not have a straightforward solution. Perhaps it will require some experience to be taken over.

reshamas · 2020-10-20T16:27:45Z

@cmarmo
Ok, I labeled it moderate, so it is not picked up by beginners at sprints.

thomasjpfan added the Enhancement label Oct 27, 2019

This was referenced Jun 6, 2020

[WIP] Node bootstrap #17500

Closed

[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

Closed

reshamas added the Moderate Anything that requires some knowledge of conventions and best practices label Oct 20, 2020

cmarmo added the module:ensemble label Jan 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Random Forest node splitting sample size #11993

Feature request: Random Forest node splitting sample size #11993

parrt commented Sep 4, 2018 •

edited

Loading

jnothman commented Sep 4, 2018

parrt commented Sep 4, 2018 •

edited

Loading

jnothman commented Sep 4, 2018 via email

jph00 commented Sep 4, 2018 via email

jnothman commented Sep 4, 2018 via email

jph00 commented Sep 4, 2018

MDouriez commented Nov 2, 2019

jph00 commented Nov 2, 2019

loldja commented Jun 6, 2020

jph00 commented Jun 11, 2020

reshamas commented Oct 20, 2020

cmarmo commented Oct 20, 2020

reshamas commented Oct 20, 2020

Feature request: Random Forest node splitting sample size #11993

Feature request: Random Forest node splitting sample size #11993

Comments

parrt commented Sep 4, 2018 • edited Loading

jnothman commented Sep 4, 2018

parrt commented Sep 4, 2018 • edited Loading

jnothman commented Sep 4, 2018 via email

jph00 commented Sep 4, 2018 via email

jnothman commented Sep 4, 2018 via email

jph00 commented Sep 4, 2018

MDouriez commented Nov 2, 2019

jph00 commented Nov 2, 2019

loldja commented Jun 6, 2020

jph00 commented Jun 11, 2020

reshamas commented Oct 20, 2020

cmarmo commented Oct 20, 2020

reshamas commented Oct 20, 2020

parrt commented Sep 4, 2018 •

edited

Loading

parrt commented Sep 4, 2018 •

edited

Loading