Skip to content

Integration and test cases for RandomForest subsampling #9645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

leandroleal
Copy link

This work was based in #5963.
Thanks @DrDub

@jnothman
Copy link
Member

Thanks. Will review soon!

@jnothman
Copy link
Member

Doctests are failing, because you've not updated the set of parameters shown when a RandomForestRegressor() etc is printed:

----------------------------------------------------------------------
File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/forest.py", line 972, in sklearn.ensemble.forest.RandomForestClassifier
Failed example:
    clf.fit(X, y)
Expected:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=2, max_features='auto', max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
                oob_score=False, random_state=0, verbose=0, warm_start=False)
Got:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=2, max_features='auto', max_leaf_nodes=None,
                max_samples=1.0, min_impurity_decrease=0.0,
                min_impurity_split=None, min_samples_leaf=1,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
                verbose=0, warm_start=False)
>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x7f9fadd15050>.getvalue()))
    
======================================================================
FAIL: Doctest: sklearn.ensemble.forest.RandomForestRegressor
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python2.7/doctest.py", line 2226, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.ensemble.forest.RandomForestRegressor
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/forest.py", line 1055, in RandomForestRegressor
----------------------------------------------------------------------
File "/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/forest.py", line 1219, in sklearn.ensemble.forest.RandomForestRegressor
Failed example:
    regr.fit(X, y)
Expected:
    RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
               max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=0, verbose=0, warm_start=False)
Got:
    RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
               max_features='auto', max_leaf_nodes=None, max_samples=1.0,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=0, verbose=0, warm_start=False)
>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x7f9fadd09200>.getvalue()))
    

Flake8 is also failing:

Running flake8 on the diff in the range 468d6e71a..c5cb044fb (1 commit(s)):
--------------------------------------------------------------------------------
./sklearn/ensemble/forest.py:89:80: E501 line too long (83 > 79 characters)
    sample_indices = _generate_sample_indices(random_state, n_samples, max_samples)
                                                                               ^
./sklearn/ensemble/forest.py:112:80: E501 line too long (117 > 79 characters)
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                               ^
./sklearn/ensemble/forest.py:112:116: E202 whitespace before ')'
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                                                                   ^
./sklearn/ensemble/forest.py:119:72: E231 missing whitespace after ','
        indices = _generate_sample_indices(tree.random_state, n_samples,max_samples)
                                                                       ^
./sklearn/ensemble/forest.py:119:80: E501 line too long (84 > 79 characters)
        indices = _generate_sample_indices(tree.random_state, n_samples,max_samples)
                                                                               ^
./sklearn/ensemble/forest.py:451:80: E501 line too long (117 > 79 characters)
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                               ^
./sklearn/ensemble/forest.py:451:116: E202 whitespace before ')'
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                                                                   ^
./sklearn/ensemble/forest.py:735:80: E501 line too long (117 > 79 characters)
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                               ^
./sklearn/ensemble/forest.py:735:116: E202 whitespace before ')'
            raise ValueError("max_samples = " + str(max_samples) + " and it must be in (0, " + str(n_samples) + ")" )
                                                                                                                   ^
./sklearn/ensemble/tests/test_forest.py:202:80: E501 line too long (84 > 79 characters)
    for name, max_samples in product(FOREST_CLASSIFIERS, (1.0, 0.8, 0.5, 0.3, 0.1)):
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:361:1: E302 expected 2 blank lines, found 1
def check_max_samples_equal_0(name):
^
./sklearn/ensemble/tests/test_forest.py:365:1: W293 blank line contains whitespace
^
./sklearn/ensemble/tests/test_forest.py:370:1: F811 redefinition of unused 'check_max_samples_equal_0' from line 361
def check_max_samples_equal_0():
^
./sklearn/ensemble/tests/test_forest.py:374:1: E302 expected 2 blank lines, found 1
def check_oob_score(name, X, y, n_estimators=20, max_samples = 1.0):
^
./sklearn/ensemble/tests/test_forest.py:374:61: E251 unexpected spaces around keyword / parameter equals
def check_oob_score(name, X, y, n_estimators=20, max_samples = 1.0):
                                                            ^
./sklearn/ensemble/tests/test_forest.py:374:63: E251 unexpected spaces around keyword / parameter equals
def check_oob_score(name, X, y, n_estimators=20, max_samples = 1.0):
                                                              ^
./sklearn/ensemble/tests/test_forest.py:380:80: E501 line too long (103 > 79 characters)
                                  n_estimators=n_estimators, bootstrap=True, max_samples = max_samples)
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:380:89: E251 unexpected spaces around keyword / parameter equals
                                  n_estimators=n_estimators, bootstrap=True, max_samples = max_samples)
                                                                                        ^
./sklearn/ensemble/tests/test_forest.py:380:91: E251 unexpected spaces around keyword / parameter equals
                                  n_estimators=n_estimators, bootstrap=True, max_samples = max_samples)
                                                                                          ^
./sklearn/ensemble/tests/test_forest.py:394:80: E501 line too long (96 > 79 characters)
                                      n_estimators=1, bootstrap=True, max_samples = max_samples)
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:394:82: E251 unexpected spaces around keyword / parameter equals
                                      n_estimators=1, bootstrap=True, max_samples = max_samples)
                                                                                 ^
./sklearn/ensemble/tests/test_forest.py:394:84: E251 unexpected spaces around keyword / parameter equals
                                      n_estimators=1, bootstrap=True, max_samples = max_samples)
                                                                                   ^
./sklearn/ensemble/tests/test_forest.py:403:80: E501 line too long (88 > 79 characters)
        yield check_oob_score, name, csc_matrix(iris.data), iris.target, 20, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:406:80: E501 line too long (84 > 79 characters)
        yield check_oob_score, name, iris.data, iris.target * 2 + 1, 20, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:409:80: E501 line too long (80 > 79 characters)
        yield check_oob_score, name, boston.data, boston.target, 50, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:412:80: E501 line too long (92 > 79 characters)
        yield check_oob_score, name, csc_matrix(boston.data), boston.target, 50, max_samples
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:458:80: E501 line too long (96 > 79 characters)
    forest = ForestEstimator(n_estimators=10, max_samples=max_samples, n_jobs=3, random_state=0)
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:471:80: E501 line too long (84 > 79 characters)
    for name, max_samples in product(FOREST_CLASSIFIERS, (1.0, 0.8, 0.5, 0.3, 0.1)):
                                                                               ^
./sklearn/ensemble/tests/test_forest.py:474:80: E501 line too long (83 > 79 characters)
    for name, max_samples in product(FOREST_REGRESSORS, (1.0, 0.8, 0.5, 0.3, 0.1)):
                                                                               ^

@jnothman
Copy link
Member

I don't think distinguishing between float and int on type alone is generally a good idea. I'd rather make the default 'all' and make 1.0 raise an error.

Silly question: should the sampling be weighted by sample_weight?

@amueller
Copy link
Member

This seems quite related to the class-weight based resampling, which there is a PR for somewhere else? That's a strategy that's useful in nearly all cases, so I might want to give that priority?

@jnothman
Copy link
Member

I'm not sure what you're referring to, @amueller... This is taking over a stalled PR.

@amueller
Copy link
Member

I was talking about #8732.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants