[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682

notmatthancock · 2019-08-18T18:20:50Z

Adds a max_samples kwarg to forest ensembles that limits the size of the bootstrap samples used to train each estimator. This PR is intended to supersede two previous stalled PRs (#5963 and #9645), which have not been touched for a couple of years. This PR adds unit tests for the various functionalities.

Why is this useful?

When training a random forest classifier on a large dataset with correlated/redundant examples, training on the entire dataset is memory intensive and unnecessary. Further, this results models that occupy an unwieldy amount of memory. As an example consider training on image patches for a segmentation-as-classification problem. It would be useful in this situation to train only on a subset of the available image patches because it's expected that an image patch obtained at one location is highly correlated with the patch obtained one pixel over. Limiting the size of each bootstrap sample to train each estimator is useful in this and similar applications.

Pickled model disk space comparison

Here's a simple test script to show the difference between occupied disk space of the pickled model, using the full dataset size for each bootstrap vs. using just a bootstrap size of 1 (obviously this is dramatic):

File size `max_samples=None`: 0.154GB
File size `max_samples=1`: 5.286e-05GB

import os
import pickle

import numpy as np
from sklearn.ensemble import RandomForestClassifier


rs = np.random.RandomState(1234)
X = rs.randn(100000, 1000)
y = rs.randn(X.shape[0]) > 0

rfc = RandomForestClassifier(
    n_estimators=100, random_state=rs)
rfc.fit(X, y)
with open('rfc.pkl', 'wb') as f:
    pickle.dump(rfc, f)
size = os.stat('rfc.pkl').st_size / 1e9
print("File size `max_samples=None`: {}GB".format(size))

rfc = RandomForestClassifier(
    n_estimators=100, random_state=rs, max_samples=1)
rfc.fit(X, y)
with open('rfc.pkl', 'wb') as f:
    pickle.dump(rfc, f)
size = os.stat('rfc.pkl').st_size / 1e9
print("File size `max_samples=1`: {}GB".format(size))

sklearn/ensemble/forest.py

glemaitre

You should also add the parameter to RandomForestRegressor, isn't it?

sklearn/ensemble/forest.py

glemaitre

Additional comments regarding the tests.

glemaitre · 2019-08-19T20:18:30Z

sklearn/ensemble/tests/test_forest.py

@@ -1330,3 +1330,93 @@ def test_forest_degenerate_feature_importances():
    gbr = RandomForestRegressor(n_estimators=10).fit(X, y)
    assert_array_equal(gbr.feature_importances_,
                       np.zeros(10, dtype=np.float64))
+
+
+def test__get_n_bootstrap_samples():


We usually don't test a private function. Instead we would test through the different estimator (RandomForest, ExtraTrees)

You can parametrize easily this case and you can check. You can make a test function where want to check that errors are raised properly (you can parametrize as well).

The other behavior should be done by fitting the estimator and check that we fitted on a subset of data.

check that we fitted on a subset of data.

where do you get that info?

As far as I can see, the sample indices for each bootstrap are not stored anywhere, but translated into sample weights and passed to the estimator.

(exception check unit tests refactored in: 081b7b7)

where do you get that info?

Yep I thought that we were having a private function to get back the indices but it does not look like so simple to do without making some hacks. I would need a bit more time to think about how to test this.

sklearn/ensemble/tests/test_forest.py

glemaitre · 2019-08-19T20:28:17Z

sklearn/ensemble/forest.py

            unsampled_indices = _generate_unsampled_indices(
-                estimator.random_state, n_samples)
+                estimator.random_state, n_samples, n_bootstrap_samples)


@jnothman here we will test on the left out samples. Since we used max_samples it means that we will use a lot of samples for the OOB. Is it something that we want?

If this is the case, we should add a test to make sure that we have the right complement in case max_samples is not None.

Is there a problem with having a large OOB sample for test? Testing with trees isn't fast, but is a lot faster than training...?

Testing with trees isn't fast, but is a lot faster than training...?

True. Should not be an issue then

If the sample is much smaller than the dataset I suppose it may be an issue. Maybe we should make a rule that the oob sample is constrained to be no larger than the training set... But that may confuse users trying different values for this parameter

If the sample is much smaller than the dataset

I can see this situation easily arising, e.g, your dataset is 10⁶ examples and you want to to fit with say, max_samples=1000 and n_estimators=1000.

# Conflicts: # sklearn/ensemble/forest.py

glemaitre

A couple of additional comments

sklearn/ensemble/forest.py

sklearn/ensemble/tests/test_forest.py

Shihab-Shahriar · 2019-09-06T12:54:37Z

Hello, does this actually limit sample size given to each base tree? I noticed few days back that _parallel_build_trees doesn't actually sample training instances. To inject randomness, it uses bootstrap as a way of re-weighting the sample_weight of instances. In other words, whole dataset is passed to each base tree, but with different sample_weight.

I tried putting a print(len(y)) just before the tree.fit call when bootstrap=True in your code, and it confirmed my suspicion. Please let me know if there's any error in my reasoning.

sklearn/ensemble/tests/test_forest.py

glemaitre · 2019-09-09T11:29:25Z

Hello, does this actually limit sample size given to each base tree? I noticed few days back that _parallel_build_trees doesn't actually sample training instances. To inject randomness, it uses bootstrap as a way of re-weighting the sample_weight of instances. In other words, whole dataset is passed to each base tree, but with different sample_weight.

It is the definition of a bootstrap sample: sampling with replacement so n_samples=X.shape[0].
Now, we allow sampling with replacement allowing n_samples < X.shape[0].

glemaitre · 2019-09-12T08:08:50Z

@glemaitre, Sorry about that particular choice of word. I realize this may have sounded a lot different than I originally intended.

It is just to be certain that we don't document something wrong :)

sklearn/ensemble/tests/test_forest.py

glemaitre

Apart from the small pytest.raises LGTM.

glemaitre · 2019-09-13T08:56:09Z

FYI: codecov seems to be wrong. All the diff code is covered.

glemaitre · 2019-09-14T09:02:30Z

Thanks @notmatthancock, let's wait for a second review before merging.

Maybe @jnothman @adrinjalali could have a look

sklearn/ensemble/tests/test_forest.py

jnothman

I'm good with the implementation. I'm wondering if it needs mentioning in the documentation for bootstrap... This is no longer quite a bootstrap, but I think former proposals reused that parameter name.

If we are happy with a new parameter, this lgtm

adrinjalali

Otherwise LGTM, thanks @notmatthancock

adrinjalali · 2019-09-18T13:17:46Z

sklearn/ensemble/forest.py

+    if max_samples is None:
+        return n_samples
+
+    if isinstance(max_samples, numbers.Integral):


I think we should be consistent w.r.t how we treat these fractions. For instance, in optics, we have:

scikit-learn/sklearn/cluster/optics_.py

Lines 280 to 290 in e5be3ad

def _validate_size(size, n_samples, param_name):

if size <= 0 or (size !=

int(size)

and size > 1):

raise ValueError('%s must be a positive integer '

'or a float between 0 and 1. Got %r' %

(param_name, size))

elif size > n_samples:

raise ValueError('%s must be no greater than the'

' number of samples (%d). Got %d' %

(param_name, n_samples, size))

And then 1 always means 100% of the data, at least in optics. Do we have a similar semantics in other places?

With PCA, n_components=1 means 1 components while n_components<1 will be a percentage.

Excluding 1 from float avoid issue with float comparison as well.

adrinjalali · 2019-09-19T08:41:13Z

sklearn/ensemble/forest.py

+        to train each base estimator.
+            - If None (default), then draw `X.shape[0]` samples.
+            - If int, then draw `max_samples` samples.
+            - If float, then draw `max_samples * X.shape[0]` samples.


I'm okay with the behavior of the float, but we need to document that here, i.e. explicitly say that if float, it must be in (0, 1)

glemaitre · 2019-09-20T07:59:43Z

@notmatthancock I made the small change requested by @adrinjalali and merged.
Thanks a lot for your contribution.

notmatthancock · 2019-09-20T11:49:53Z

Likewise @glemaitre, thanks to you and @jnothman for the helpful review comments.

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682)

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250)

- Deprecate presort (scikit-learn/scikit-learn#14907) - Add Minimal Cost-Complexity Pruning to Decision Trees (scikit-learn/scikit-learn#12887) - Add bootstrap sample size limit to forest ensembles (scikit-learn/scikit-learn#14682) - Fix deprecated imports (scikit-learn/scikit-learn#9250) Do not add ccp_alpha to SurvivalTree, because it relies node_impurity, which is not set for SurvivalTree.

notmatthancock added 3 commits August 17, 2019 22:16

Add max_samples bootstrap size kwarg

9b66811

Refactor unit tests

b845709

Add one more assert check in index helper test

6db3384

jnothman reviewed Aug 18, 2019

View reviewed changes

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/forest.py Show resolved Hide resolved

Move validation and bootstrap size get to helper

2507d5b

jnothman reviewed Aug 18, 2019

View reviewed changes

sklearn/ensemble/forest.py Outdated Show resolved Hide resolved

Compute bootstrap size just once

4fa10a9

glemaitre requested changes Aug 19, 2019

View reviewed changes

notmatthancock added 8 commits August 20, 2019 18:17

n_bootstrap_samples -> n_samples_bootstrap; update docstring

3eb299c

Refactor exception tests for max_samples

081b7b7

n_bootstrap_samples -> n_samples_bootstrap in signatures

cf8b0cb

Add max_samples kwarg to RandomForestRegressor

49a949a

Revert doc default

0e2b386

Move n_samples_bootstrap out of loop

ef5efe1

Merge branch 'master' into feature/rf-subsample

07865fd

# Conflicts: # sklearn/ensemble/forest.py

Add max_samples to RandomTreesEmbedding

5bc7796

glemaitre requested changes Aug 22, 2019

View reviewed changes

notmatthancock added 8 commits August 26, 2019 19:52

Change docstring style for default

75ce338

Update grammar in docstring

37104d3

Refactor conditional structures

93e1840

Add version added tag; change call style

005d6ca

Remove docstring from unit test

eac5ad6

Add exception message checks to unit test

d9de541

Refactor toy data unit test

93e5799

Add node count check unit test

9cdd5c2

glemaitre reviewed Sep 9, 2019

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

glemaitre self-requested a review September 9, 2019 11:34

glemaitre reviewed Sep 12, 2019

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

glemaitre reviewed Sep 12, 2019

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

glemaitre reviewed Sep 12, 2019

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

notmatthancock added 2 commits September 12, 2019 17:47

Include bootstrap condition in docstring comments

4548191

Rename forest_class -> ForestClass

58f005c

glemaitre self-requested a review September 13, 2019 08:48

glemaitre approved these changes Sep 13, 2019

View reviewed changes

glemaitre changed the title ~~Add bootstrap sample size limit to forest ensembles~~ [MRG+1] EHN Add bootstrap sample size limit to forest ensembles Sep 13, 2019

Escape all the special chars

7869f87

jnothman reviewed Sep 14, 2019

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

Remove extraneous test

e8535d4

jnothman approved these changes Sep 14, 2019

View reviewed changes

adrinjalali reviewed Sep 18, 2019

View reviewed changes

adrinjalali reviewed Sep 19, 2019

View reviewed changes

comments adrin

5b72dcc

glemaitre merged commit 746efb5 into scikit-learn:master Sep 20, 2019

glemaitre mentioned this pull request Oct 31, 2019

[WIP] FIX backward compat in _parallel_build_trees #15417

Closed

MDouriez mentioned this pull request Nov 2, 2019

Feature request: Random Forest node splitting sample size #11993

Open

ab-anssi mentioned this pull request Jun 20, 2020

[WIP] Issue #11993 - add node_bootstrap param to RandomForest #17504

Closed

	def _validate_size(size, n_samples, param_name):
	if size <= 0 or (size !=
	int(size)
	and size > 1):
	raise ValueError('%s must be a positive integer '
	'or a float between 0 and 1. Got %r' %
	(param_name, size))
	elif size > n_samples:
	raise ValueError('%s must be no greater than the'
	' number of samples (%d). Got %d' %
	(param_name, n_samples, size))

Uh oh!

[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682

[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682

Uh oh!

Conversation

notmatthancock commented Aug 18, 2019

Why is this useful?

Pickled model disk space comparison

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shihab-Shahriar commented Sep 6, 2019

Uh oh!

Uh oh!

glemaitre commented Sep 9, 2019

Uh oh!

glemaitre commented Sep 12, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Sep 13, 2019

Uh oh!

glemaitre commented Sep 14, 2019

Uh oh!