-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Added subsampling to RandomForest #5963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks! Could you rename this added parameter to |
Very nice. Yes, I just added the two changes, it looks much nicer now. There might be some code to factor out (which makes sense even at the conceptual level, as bagging and random forests are related to the same key researchers) but refactoring code is outside what I'm comfortable given my current understanding of the code. Maybe this change can be merged and I then create an enhancement request for others to look into refactoring opportunities? Thanks so much for looking into this PR so promptly! |
@DrDub sorry for the slow turn-around. Could you please rebase? |
1989580
to
c0f0a42
Compare
The current implementation uses bagging with the number of selected samples equal to the whole training set. The current patch adds a parameter to enable training the trees in the decision forest on a smaller number of training samples. For example, if you have a 1 million training instances, the current implementation will train each decision tree in the forest on 1 million random instances drawn from the original set (with repetition). With this patch, it is possible to train each tree, for example, in 50% of the overall training data. Following the example in the Bagging class, the estimator samples are passed as the parameter max_samples to the constructor. Minor changes to ensure PEP8 are also included, plus extended test cases.
c0f0a42
to
a361f67
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a skim, this looks good. It would be nice if we had a stronger test. We should also aim for test coverage of the error messages.
clf.fit(X, y) | ||
assert_array_equal(clf.predict(T), true_result) | ||
assert_equal(10, len(clf)) | ||
|
||
clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1) | ||
clf = ForestClassifier(n_estimators=10, max_features=1, random_state=1, max_samples=max_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please limit lines to 79 chars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I have fixed the line lengths.
I'm open to contribute better tests in a separate PR, I can @ you for it, if you'd be interested in reviewing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we leave testing to a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR has tests for the changes in line with existing tests in the class. You suggest a different work on improving the test cases. My interest in this PR is in bringing Random Forests in line with Breiman (2001) by sampling training instances. This is very useful when training over large sets. I have used this code for my own experiments and it works, I hope others might benefit from this improvement which puts scikit-learn implementation in line with other ML libraries and published work on RFs.
I offer improving the test cases of the class as a compromise, seeing that is something that interests you.
Tests are good engineering practice. I don't see why you feel the functionality doesn't need testing. Sometimes it requires some imagination to design a test for machine learning. Here you could easily use an X that has a different single feature active for each sample, and ensure that a subsampled tree, built to fit the data perfectly, uses the correct number of features. |
Hi, I did this:
All the tests was passed ! :) How I can send it to you ? This url has my two patches: https://www.lapig.iesa.ufg.br/drive/index.php/s/kT1HAgwHt8XS2dT |
Create a new pull request, please.
…On 29 August 2017 at 08:35, Leandro Leal Parente ***@***.***> wrote:
Hi, I did this:
- Merge with current master code (sklearn/ensemble/forest.py);
- Design some test cases (sklearn/ensemble/tests/test_forest.py);
All the tests was passed ! :)
How I can send it to you ?
I can't commit in this pull request.
This url has my two patches: https://www.lapig.iesa.ufg.br/
drive/index.php/s/kT1HAgwHt8XS2dT
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5963 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65cN6PtgkiMKDm8lWCaW5kMYtXSNks5sc0ChgaJpZM4GvT4g>
.
|
Pull request created #9645 |
@@ -94,19 +96,28 @@ def _generate_unsampled_indices(random_state, n_samples): | |||
|
|||
|
|||
def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees, | |||
verbose=0, class_weight=None): | |||
verbose=0, class_weight=None, max_samples=1.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by max_samples
I would naturally understand a total number, not a ratio... add explanation or nb_samples
, used_samples
?
|
||
# if max_samples is float: | ||
if not isinstance(max_samples, (numbers.Integral, np.integer)): | ||
max_samples = int(max_samples * X.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int(round(max_samples * len(X)))
|
||
def _set_oob_score(self, X, y): | ||
"""Compute out-of-bag score""" | ||
X = check_array(X, dtype=DTYPE, accept_sparse='csr') | ||
|
||
n_classes_ = self.n_classes_ | ||
n_samples = y.shape[0] | ||
max_samples = self.max_samples | ||
|
||
# if max_samples is float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is repeating, think about a separate function
def __adjust_nb_samples(max_samples, nb_samples):
n_samples = int(max_samples * nb_samples) if not isinstance(max_samples, (numbers.Integral, np.integer)) else nb_samples
if not (0 < n_samples <= len(X)):
raise ValueError("max_samples must be in (0, n_samples]")
return nb_samples
self.classes_.append(classes_k) | ||
self.n_classes_.append(classes_k.shape[0]) | ||
y = y_store_unique_indices | ||
|
||
if self.class_weight is not None: | ||
valid_presets = ('auto', 'balanced', 'subsample', 'balanced_subsample') | ||
valid_presets = ('auto', 'balanced', 'subsample', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why two lines?
@@ -552,8 +579,8 @@ def predict_proba(self, X): | |||
|
|||
The predicted class probabilities of an input sample are computed as | |||
the mean predicted class probabilities of the trees in the forest. The | |||
class probability of a single tree is the fraction of samples of the same | |||
class in a leaf. | |||
class probability of a single tree is the fraction of samples of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
@@ -651,7 +678,8 @@ def __init__(self, | |||
n_jobs=1, | |||
random_state=None, | |||
verbose=0, | |||
warm_start=False): | |||
warm_start=False, | |||
max_samples=1.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- is default
"max_features", "max_leaf_nodes", "min_impurity_split", | ||
"random_state"), | ||
"max_features", "max_leaf_nodes", | ||
"min_impurity_split", "random_state"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this formatting really needed?
The current implementation uses bagging with the number of selected
samples equal to the whole training set. The current patch adds
a parameter to enable training the trees in the decision forest
on a smaller number of training samples.
For example, if you have a 1 million training instances, the
current implementation will train each decision tree in the
forest on 1 million random instances drawn from the original
set (with repetition). With this patch, it is possible to train
each tree, for example, in 50% of the overall training data.
To avoid modifying the signature of the fit(X,y) method, the
number of samples used in each estimator is passed to the
constructors, but this value can be set to a floating point
between 0.0 and 1.0 to refer to a percentage of the total
training data.
Minor changes to ensure PEP8 are also included, plus extended
test cases.