Skip to content

Added subsampling to RandomForest #5963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

DrDub
Copy link

@DrDub DrDub commented Dec 5, 2015

The current implementation uses bagging with the number of selected
samples equal to the whole training set. The current patch adds
a parameter to enable training the trees in the decision forest
on a smaller number of training samples.

For example, if you have a 1 million training instances, the
current implementation will train each decision tree in the
forest on 1 million random instances drawn from the original
set (with repetition). With this patch, it is possible to train
each tree, for example, in 50% of the overall training data.

To avoid modifying the signature of the fit(X,y) method, the
number of samples used in each estimator is passed to the
constructors, but this value can be set to a floating point
between 0.0 and 1.0 to refer to a percentage of the total
training data.

Minor changes to ensure PEP8 are also included, plus extended
test cases.

@glouppe
Copy link
Contributor

glouppe commented Dec 5, 2015

Thanks! Could you rename this added parameter to max_samples and mimick what is done in BaggingClassifier, where this functionality is also implemented? The goal is to have a consistent API across estimators implementing the same behaviour. There might be code also to factor out.

@DrDub
Copy link
Author

DrDub commented Dec 5, 2015

Very nice. Yes, I just added the two changes, it looks much nicer now. There might be some code to factor out (which makes sense even at the conceptual level, as bagging and random forests are related to the same key researchers) but refactoring code is outside what I'm comfortable given my current understanding of the code. Maybe this change can be merged and I then create an enhancement request for others to look into refactoring opportunities? Thanks so much for looking into this PR so promptly!

@amueller
Copy link
Member

@DrDub sorry for the slow turn-around. Could you please rebase?

@DrDub DrDub force-pushed the random-forest-subsampling branch from 1989580 to c0f0a42 Compare November 3, 2016 20:26
The current implementation uses bagging with the number of selected
samples equal to the whole training set. The current patch adds
a parameter to enable training the trees in the decision forest
on a smaller number of training samples.

For example, if you have a 1 million training instances, the
current implementation will train each decision tree in the
forest on 1 million random instances drawn from the original
set (with repetition). With this patch, it is possible to train
each tree, for example, in 50% of the overall training data.

Following the example in the Bagging class, the estimator samples
are passed as the parameter max_samples to the constructor.

Minor changes to ensure PEP8 are also included, plus extended
test cases.
@DrDub DrDub force-pushed the random-forest-subsampling branch from c0f0a42 to a361f67 Compare November 3, 2016 20:30
@DrDub
Copy link
Author

DrDub commented Nov 3, 2016

Rebased.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a skim, this looks good. It would be nice if we had a stronger test. We should also aim for test coverage of the error messages.

clf.fit(X, y)
assert_array_equal(clf.predict(T), true_result)
assert_equal(10, len(clf))

clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1)
clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1, max_samples=max_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please limit lines to 79 chars

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have fixed the line lengths.

I'm open to contribute better tests in a separate PR, I can @ you for it, if you'd be interested in reviewing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we leave testing to a separate PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR has tests for the changes in line with existing tests in the class. You suggest a different work on improving the test cases. My interest in this PR is in bringing Random Forests in line with Breiman (2001) by sampling training instances. This is very useful when training over large sets. I have used this code for my own experiments and it works, I hope others might benefit from this improvement which puts scikit-learn implementation in line with other ML libraries and published work on RFs.

I offer improving the test cases of the class as a compromise, seeing that is something that interests you.

@jnothman
Copy link
Member

Tests are good engineering practice. I don't see why you feel the functionality doesn't need testing.

Sometimes it requires some imagination to design a test for machine learning. Here you could easily use an X that has a different single feature active for each sample, and ensure that a subsampled tree, built to fit the data perfectly, uses the correct number of features.

@leandroleal
Copy link

Hi, I did this:

  • Merge with current master code (sklearn/ensemble/forest.py);
  • Design some test cases (sklearn/ensemble/tests/test_forest.py);

All the tests was passed ! :)

How I can send it to you ?
I can't commit in this pull request.

This url has my two patches: https://www.lapig.iesa.ufg.br/drive/index.php/s/kT1HAgwHt8XS2dT

@jnothman
Copy link
Member

jnothman commented Aug 29, 2017 via email

@leandroleal
Copy link

Pull request created #9645

@@ -94,19 +96,28 @@ def _generate_unsampled_indices(random_state, n_samples):


def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees,
verbose=0, class_weight=None):
verbose=0, class_weight=None, max_samples=1.0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by max_samples I would naturally understand a total number, not a ratio... add explanation or nb_samples, used_samples?


# if max_samples is float:
if not isinstance(max_samples, (numbers.Integral, np.integer)):
max_samples = int(max_samples * X.shape[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(round(max_samples * len(X)))


def _set_oob_score(self, X, y):
"""Compute out-of-bag score"""
X = check_array(X, dtype=DTYPE, accept_sparse='csr')

n_classes_ = self.n_classes_
n_samples = y.shape[0]
max_samples = self.max_samples

# if max_samples is float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is repeating, think about a separate function

def __adjust_nb_samples(max_samples, nb_samples):
     n_samples = int(max_samples * nb_samples) if not isinstance(max_samples, (numbers.Integral, np.integer)) else nb_samples

     if not (0 < n_samples <= len(X)):
        raise ValueError("max_samples must be in (0, n_samples]")

    return nb_samples

self.classes_.append(classes_k)
self.n_classes_.append(classes_k.shape[0])
y = y_store_unique_indices

if self.class_weight is not None:
valid_presets = ('auto', 'balanced', 'subsample', 'balanced_subsample')
valid_presets = ('auto', 'balanced', 'subsample',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why two lines?

@@ -552,8 +579,8 @@ def predict_proba(self, X):

The predicted class probabilities of an input sample are computed as
the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the same
class in a leaf.
class probability of a single tree is the fraction of samples of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

@@ -651,7 +678,8 @@ def __init__(self,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False):
warm_start=False,
max_samples=1.0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is default

"max_features", "max_leaf_nodes", "min_impurity_split",
"random_state"),
"max_features", "max_leaf_nodes",
"min_impurity_split", "random_state"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this formatting really needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Superseded PR has been replace by a newer PR Waiting for Reviewer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants