Added subsampling to RandomForest #5963

DrDub · 2015-12-05T01:14:23Z

The current implementation uses bagging with the number of selected
samples equal to the whole training set. The current patch adds
a parameter to enable training the trees in the decision forest
on a smaller number of training samples.

For example, if you have a 1 million training instances, the
current implementation will train each decision tree in the
forest on 1 million random instances drawn from the original
set (with repetition). With this patch, it is possible to train
each tree, for example, in 50% of the overall training data.

To avoid modifying the signature of the fit(X,y) method, the
number of samples used in each estimator is passed to the
constructors, but this value can be set to a floating point
between 0.0 and 1.0 to refer to a percentage of the total
training data.

Minor changes to ensure PEP8 are also included, plus extended
test cases.

glouppe · 2015-12-05T08:28:34Z

Thanks! Could you rename this added parameter to max_samples and mimick what is done in BaggingClassifier, where this functionality is also implemented? The goal is to have a consistent API across estimators implementing the same behaviour. There might be code also to factor out.

DrDub · 2015-12-05T14:01:31Z

Very nice. Yes, I just added the two changes, it looks much nicer now. There might be some code to factor out (which makes sense even at the conceptual level, as bagging and random forests are related to the same key researchers) but refactoring code is outside what I'm comfortable given my current understanding of the code. Maybe this change can be merged and I then create an enhancement request for others to look into refactoring opportunities? Thanks so much for looking into this PR so promptly!

amueller · 2016-09-14T17:02:30Z

@DrDub sorry for the slow turn-around. Could you please rebase?

The current implementation uses bagging with the number of selected samples equal to the whole training set. The current patch adds a parameter to enable training the trees in the decision forest on a smaller number of training samples. For example, if you have a 1 million training instances, the current implementation will train each decision tree in the forest on 1 million random instances drawn from the original set (with repetition). With this patch, it is possible to train each tree, for example, in 50% of the overall training data. Following the example in the Bagging class, the estimator samples are passed as the parameter max_samples to the constructor. Minor changes to ensure PEP8 are also included, plus extended test cases.

DrDub · 2016-11-03T22:34:50Z

Rebased.

jnothman

At a skim, this looks good. It would be nice if we had a stronger test. We should also aim for test coverage of the error messages.

jnothman · 2016-11-06T11:28:21Z

sklearn/ensemble/tests/test_forest.py

    clf.fit(X, y)
    assert_array_equal(clf.predict(T), true_result)
    assert_equal(10, len(clf))

-    clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1)
+    clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1, max_samples=max_samples)


please limit lines to 79 chars

Thanks for the review. I have fixed the line lengths.

I'm open to contribute better tests in a separate PR, I can @ you for it, if you'd be interested in reviewing it.

Why should we leave testing to a separate PR?

My PR has tests for the changes in line with existing tests in the class. You suggest a different work on improving the test cases. My interest in this PR is in bringing Random Forests in line with Breiman (2001) by sampling training instances. This is very useful when training over large sets. I have used this code for my own experiments and it works, I hope others might benefit from this improvement which puts scikit-learn implementation in line with other ML libraries and published work on RFs.

I offer improving the test cases of the class as a compromise, seeing that is something that interests you.

jnothman · 2016-11-28T09:59:21Z

Tests are good engineering practice. I don't see why you feel the functionality doesn't need testing.

Sometimes it requires some imagination to design a test for machine learning. Here you could easily use an X that has a different single feature active for each sample, and ensure that a subsampled tree, built to fit the data perfectly, uses the correct number of features.

leandroleal · 2017-08-28T22:35:11Z

Hi, I did this:

Merge with current master code (sklearn/ensemble/forest.py);
Design some test cases (sklearn/ensemble/tests/test_forest.py);

All the tests was passed ! :)

How I can send it to you ?
I can't commit in this pull request.

This url has my two patches: https://www.lapig.iesa.ufg.br/drive/index.php/s/kT1HAgwHt8XS2dT

jnothman · 2017-08-29T00:04:37Z

Create a new pull request, please.

…

On 29 August 2017 at 08:35, Leandro Leal Parente ***@***.***> wrote: Hi, I did this: - Merge with current master code (sklearn/ensemble/forest.py); - Design some test cases (sklearn/ensemble/tests/test_forest.py); All the tests was passed ! :) How I can send it to you ? I can't commit in this pull request. This url has my two patches: https://www.lapig.iesa.ufg.br/ drive/index.php/s/kT1HAgwHt8XS2dT — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5963 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65cN6PtgkiMKDm8lWCaW5kMYtXSNks5sc0ChgaJpZM4GvT4g> .

scikit-learn#5963)

leandroleal · 2017-08-29T21:18:36Z

Pull request created #9645

Borda · 2019-07-13T10:15:31Z

sklearn/ensemble/forest.py

@@ -94,19 +96,28 @@ def _generate_unsampled_indices(random_state, n_samples):


 def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees,
-                          verbose=0, class_weight=None):
+                          verbose=0, class_weight=None, max_samples=1.0):


by max_samples I would naturally understand a total number, not a ratio... add explanation or nb_samples, used_samples?

Borda · 2019-07-13T10:17:10Z

sklearn/ensemble/forest.py

+
+        # if max_samples is float:
+        if not isinstance(max_samples, (numbers.Integral, np.integer)):
+            max_samples = int(max_samples * X.shape[0])


int(round(max_samples * len(X)))

Borda · 2019-07-13T10:30:02Z

sklearn/ensemble/forest.py


    def _set_oob_score(self, X, y):
        """Compute out-of-bag score"""
        X = check_array(X, dtype=DTYPE, accept_sparse='csr')

        n_classes_ = self.n_classes_
        n_samples = y.shape[0]
+        max_samples = self.max_samples
+
+        # if max_samples is float:


it is repeating, think about a separate function

def __adjust_nb_samples(max_samples, nb_samples): n_samples = int(max_samples * nb_samples) if not isinstance(max_samples, (numbers.Integral, np.integer)) else nb_samples if not (0 < n_samples <= len(X)): raise ValueError("max_samples must be in (0, n_samples]") return nb_samples

Borda · 2019-07-13T10:30:32Z

sklearn/ensemble/forest.py

            self.classes_.append(classes_k)
            self.n_classes_.append(classes_k.shape[0])
        y = y_store_unique_indices

        if self.class_weight is not None:
-            valid_presets = ('auto', 'balanced', 'subsample', 'balanced_subsample')
+            valid_presets = ('auto', 'balanced', 'subsample',


why two lines?

Borda · 2019-07-13T10:31:00Z

sklearn/ensemble/forest.py

@@ -552,8 +579,8 @@ def predict_proba(self, X):

        The predicted class probabilities of an input sample are computed as
        the mean predicted class probabilities of the trees in the forest. The
-        class probability of a single tree is the fraction of samples of the same
-        class in a leaf.
+        class probability of a single tree is the fraction of samples of the


Borda · 2019-07-13T10:31:38Z

sklearn/ensemble/forest.py

@@ -651,7 +678,8 @@ def __init__(self,
                 n_jobs=1,
                 random_state=None,
                 verbose=0,
-                 warm_start=False):
+                 warm_start=False,
+                 max_samples=1.0):


Borda · 2019-07-13T10:32:45Z

sklearn/ensemble/forest.py

-                              "max_features", "max_leaf_nodes", "min_impurity_split",
-                              "random_state"),
+                              "max_features", "max_leaf_nodes",
+                              "min_impurity_split", "random_state"),


is this formatting really needed?

DrDub force-pushed the random-forest-subsampling branch from 1989580 to c0f0a42 Compare November 3, 2016 20:26

DrDub force-pushed the random-forest-subsampling branch from c0f0a42 to a361f67 Compare November 3, 2016 20:30

jnothman added Enhancement Waiting for Reviewer labels Nov 6, 2016

jnothman requested changes Nov 6, 2016

View reviewed changes

DrDub added 2 commits November 26, 2016 04:28

Merge branch 'master' into random-forest-subsampling

deae699

Fixed line lengths

0331103

leandroleal added a commit to lapig-ufg/scikit-learn that referenced this pull request Aug 29, 2017

Integration and test cases for RandomForest subsampling implementation (

c5cb044

scikit-learn#5963)

leandroleal mentioned this pull request Aug 29, 2017

Integration and test cases for RandomForest subsampling #9645

Closed

jnothman mentioned this pull request Feb 21, 2018

RandomForest: Different subsampling per tree for faster training #10668

Closed

jnothman mentioned this pull request Nov 15, 2018

Setting subsample size in RandomForestClassifier and RandomForestRegressor? #12594

Closed

jnothman mentioned this pull request Apr 22, 2019

Feature request: allow bootstrap size less than training set size in random forest fit #13687

Closed

Borda reviewed Jul 13, 2019

View reviewed changes

amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019

notmatthancock mentioned this pull request Aug 18, 2019

[MRG+1] EHN Add bootstrap sample size limit to forest ensembles #14682

Merged

glemaitre closed this Sep 20, 2019

Uh oh!

Added subsampling to RandomForest #5963

Added subsampling to RandomForest #5963

Uh oh!

Conversation

DrDub commented Dec 5, 2015

Uh oh!

glouppe commented Dec 5, 2015

Uh oh!

DrDub commented Dec 5, 2015

Uh oh!

amueller commented Sep 14, 2016

Uh oh!

DrDub commented Nov 3, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 28, 2016

Uh oh!

leandroleal commented Aug 28, 2017

Uh oh!

jnothman commented Aug 29, 2017 via email

Uh oh!

leandroleal commented Aug 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!