Feature request: Random Forest node splitting sample size

During the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy.  With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn's RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:

```python
def set_RF_sample_size(n):
    forest._generate_sample_indices = \
        (lambda rs, n_samples: forest.check_random_state(rs).randint(0, n_samples,n))

def reset_RF_sample_size():
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))
```

We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.

We appreciate your consideration of this feature request!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature request: Random Forest node splitting sample size #11993

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature request: Random Forest node splitting sample size #11993

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions