Integration and test cases for RandomForest subsampling #9645

leandroleal · 2017-08-29T21:17:36Z

This work was based in #5963.
Thanks @DrDub

jnothman · 2017-08-29T23:56:05Z

Thanks. Will review soon!

jnothman · 2017-08-30T03:46:39Z

Doctests are failing, because you've not updated the set of parameters shown when a RandomForestRegressor() etc is printed:

----------------------------------------------------------------------
File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/forest.py", line 972, in sklearn.ensemble.forest.RandomForestClassifier
Failed example:
    clf.fit(X, y)
Expected:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=2, max_features='auto', max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                oob_score=False, random_state=0, verbose=0, warm_start=False)
Got:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=2, max_features='auto', max_leaf_nodes=None,
                max_samples=1.0, min_impurity_decrease=0.0,
                min_impurity_split=None, min_samples_leaf=1,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
                verbose=0, warm_start=False)
>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x7f9fadd15050>.getvalue()))
    
======================================================================
FAIL: Doctest: sklearn.ensemble.forest.RandomForestRegressor
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python2.7/doctest.py", line 2226, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.ensemble.forest.RandomForestRegressor
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/forest.py", line 1055, in RandomForestRegressor
----------------------------------------------------------------------
File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/forest.py", line 1219, in sklearn.ensemble.forest.RandomForestRegressor
Failed example:
    regr.fit(X, y)
Expected:
    RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
               max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=0, verbose=0, warm_start=False)
Got:
    RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
               max_features='auto', max_leaf_nodes=None, max_samples=1.0,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=0, verbose=0, warm_start=False)
>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x7f9fadd09200>.getvalue()))

Flake8 is also failing:

Running flake8 on the diff in the range 468d6e71a..c5cb044fb (1 commit(s)):
--------------------------------------------------------------------------------
./sklearn/ensemble/forest.py:89:80: E501 line too long (83 > 79 characters)
    sample_indices = _generate_sample_indices(random_state, n_samples, max_samples)
                                                                               ^
./sklearn/ensemble/forest.py:112:80: E501 line too long (117 > 79 characters)
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                               ^
./sklearn/ensemble/forest.py:112:116: E202 whitespace before ')'
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                                                                   ^
./sklearn/ensemble/forest.py:119:72: E231 missing whitespace after ','
        indices = _generate_sample_indices(tree.random_state, n_samples,max_samples)
                                                                       ^
./sklearn/ensemble/forest.py:119:80: E501 line too long (84 > 79 characters)
        indices = _generate_sample_indices(tree.random_state, n_samples,max_samples)
                                                                               ^
./sklearn/ensemble/forest.py:451:80: E501 line too long (117 > 79 characters)
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                               ^
./sklearn/ensemble/forest.py:451:116: E202 whitespace before ')'
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                                                                   ^
./sklearn/ensemble/forest.py:735:80: E501 line too long (117 > 79 characters)
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                               ^
./sklearn/ensemble/forest.py:735:116: E202 whitespace before ')'
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                                                                   ^
./sklearn/ensemble/tests/test_forest.py:202:80: E501 line too long (84 > 79 characters)
    for name, max_samples in product(FOREST_CLASSIFIERS, (1.0, 0.8, 0.5, 0.3, 0.1)):
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:361:1: E302 expected 2 blank lines, found 1
def check_max_samples_equal_0(name):
^
./sklearn/ensemble/tests/test_forest.py:365:1: W293 blank line contains whitespace
^
./sklearn/ensemble/tests/test_forest.py:370:1: F811 redefinition of unused 'check_max_samples_equal_0' from line 361
def check_max_samples_equal_0():
^
./sklearn/ensemble/tests/test_forest.py:374:1: E302 expected 2 blank lines, found 1
def check_oob_score(name, X, y, n_estimators=20, max_samples = 1.0):
^
./sklearn/ensemble/tests/test_forest.py:374:61: E251 unexpected spaces around keyword / parameter equals
def check_oob_score(name, X, y, n_estimators=20, max_samples = 1.0):
                                                            ^
./sklearn/ensemble/tests/test_forest.py:374:63: E251 unexpected spaces around keyword / parameter equals
def check_oob_score(name, X, y, n_estimators=20, max_samples = 1.0):
                                                              ^
./sklearn/ensemble/tests/test_forest.py:380:80: E501 line too long (103 > 79 characters)
                                  n_estimators=n_estimators, bootstrap=True, max_samples = max_samples)
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:380:89: E251 unexpected spaces around keyword / parameter equals
                                  n_estimators=n_estimators, bootstrap=True, max_samples = max_samples)
                                                                                        ^
./sklearn/ensemble/tests/test_forest.py:380:91: E251 unexpected spaces around keyword / parameter equals
                                  n_estimators=n_estimators, bootstrap=True, max_samples = max_samples)
                                                                                          ^
./sklearn/ensemble/tests/test_forest.py:394:80: E501 line too long (96 > 79 characters)
                                      n_estimators=1, bootstrap=True, max_samples = max_samples)
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:394:82: E251 unexpected spaces around keyword / parameter equals
                                      n_estimators=1, bootstrap=True, max_samples = max_samples)
                                                                                 ^
./sklearn/ensemble/tests/test_forest.py:394:84: E251 unexpected spaces around keyword / parameter equals
                                      n_estimators=1, bootstrap=True, max_samples = max_samples)
                                                                                   ^
./sklearn/ensemble/tests/test_forest.py:403:80: E501 line too long (88 > 79 characters)
        yield check_oob_score, name, csc_matrix(iris.data), iris.target, 20, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:406:80: E501 line too long (84 > 79 characters)
        yield check_oob_score, name, iris.data, iris.target * 2 + 1, 20, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:409:80: E501 line too long (80 > 79 characters)
        yield check_oob_score, name, boston.data, boston.target, 50, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:412:80: E501 line too long (92 > 79 characters)
        yield check_oob_score, name, csc_matrix(boston.data), boston.target, 50, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:458:80: E501 line too long (96 > 79 characters)
    forest = ForestEstimator(n_estimators=10, max_samples=max_samples, n_jobs=3, random_state=0)
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:471:80: E501 line too long (84 > 79 characters)
    for name, max_samples in product(FOREST_CLASSIFIERS, (1.0, 0.8, 0.5, 0.3, 0.1)):
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:474:80: E501 line too long (83 > 79 characters)
    for name, max_samples in product(FOREST_REGRESSORS, (1.0, 0.8, 0.5, 0.3, 0.1)):
                                                                               ^

jnothman · 2017-08-30T03:48:10Z

I don't think distinguishing between float and int on type alone is generally a good idea. I'd rather make the default 'all' and make 1.0 raise an error.

Silly question: should the sampling be weighted by sample_weight?

amueller · 2017-08-30T16:16:58Z

This seems quite related to the class-weight based resampling, which there is a PR for somewhere else? That's a strategy that's useful in nearly all cases, so I might want to give that priority?

jnothman · 2017-08-30T23:19:26Z

I'm not sure what you're referring to, @amueller... This is taking over a stalled PR.

amueller · 2017-12-12T19:08:31Z

I was talking about #8732.

Integration and test cases for RandomForest subsampling implementation (

c5cb044

scikit-learn#5963)

leandroleal mentioned this pull request Aug 29, 2017

Added subsampling to RandomForest #5963

Closed

jnothman mentioned this pull request Aug 30, 2017

Trees with MAE criterion are slow to train #9626

Open

amueller added the Waiting for Reviewer label Aug 5, 2019

notmatthancock mentioned this pull request Aug 18, 2019

[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682

Merged

glemaitre closed this Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integration and test cases for RandomForest subsampling #9645

Integration and test cases for RandomForest subsampling #9645

Uh oh!

leandroleal commented Aug 29, 2017

Uh oh!

jnothman commented Aug 29, 2017

Uh oh!

jnothman commented Aug 30, 2017

Uh oh!

jnothman commented Aug 30, 2017

Uh oh!

amueller commented Aug 30, 2017

Uh oh!

jnothman commented Aug 30, 2017

Uh oh!

amueller commented Dec 12, 2017

Uh oh!

Uh oh!

Uh oh!

Integration and test cases for RandomForest subsampling #9645

Integration and test cases for RandomForest subsampling #9645

Uh oh!

Conversation

leandroleal commented Aug 29, 2017

Uh oh!

jnothman commented Aug 29, 2017

Uh oh!

jnothman commented Aug 30, 2017

Uh oh!

jnothman commented Aug 30, 2017

Uh oh!

amueller commented Aug 30, 2017

Uh oh!

jnothman commented Aug 30, 2017

Uh oh!

amueller commented Dec 12, 2017

Uh oh!

Uh oh!