-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Stratifying Across Classes During Training in ShuffleSplit #5965 #5972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…huffleSplit. Programmed the inner-working to achieve the expected CV splits. Enhanced the validation of the specification for train_size and test_size (sum=1). There might be a potential bug in the ShuffleSplit class, which primarily anchors on test_size, and doesn't handle the specification of train_size (and infer the test_size appropriately). It seems like not much testing has been performed in cases when the user only specifies train_size.
…ate_shuffle_split.
…PyCharm seems to have not saved my previous changes\!
…train and n_test.
… sizes, and to deal with nose tests properly.
Hey @jnothman, can you take a look? Thanks. |
I don't think we should call this stratifying (also this is a common usage of the word). In scikit-learn stratifying means keeping the class frequencies the same for each split. Btw, have you compared this against using |
Thanks for the pointer Andreas, I took a quick look. Correct me if I am wrong, but It seems to me that the current implementation Depending on the goals of the user e.g. minimizing the |
I didn't claim there was no use-case, I was just wondering how the methods compare on your application. I'm not arguing against adding this, I agree our understanding is not complete ;) |
I am yet to compare the two approaches for my application. My efforts so far focused on improving the accuracy and interpretability of the various feature I have developed. As I make use of scikit-learn more, I would try performing the comparison. Did you take a look at the changes in code itself? Is it ready to be merged? thanks. |
Sorry for the long turn-around. |
That's a good idea. And I like the name, as it clearly conveys the motivation and its function. Did you get a chance to review the code? |
well the code needs to change if we want this in a separate class... |
With the new splitter structure, resamplers as meta-estimators would be the right approach, I think: from sklearn.utils.random import *
from sklearn.utils import *
from sklearn.model_selection import *
class Undersampler(object):
def __init__(self, base_cv=None, random_state=None):
self.base_cv = base_cv
self.random_state = random_state
def split(self, X, y):
random_state = check_random_state(self.random_state)
# TODO: check and encode y
for train_idx, test_idx in check_cv(self.base_cv, classifier=True).split(X, y):
y_split = y.take(train_idx)
counts = np.bincount(y_split)
n_per_class = np.min(counts)
out_train_idx = []
for i in range(len(counts)):
i_sample = sample_without_replacement(counts[i], n_per_class)
out_train_idx.append(np.flatnonzero(y_split == i).take(i_sample))
yield np.hstack(out_train_idx), test_idx Is there something I've missed about how this would operate in a cross-validation context? |
Also relevant is that |
In terms of a general resampler, see #1454 |
@jnothman you mean meta-splitters, right? I feel this feature is somehow tied to the cross-validation strategy. I think this behavior would be more natural for the stratified cv objects than the non-stratified ones. it's not entirely clear to me whether we want to balance the test set, keep the test-set as-is or put "the rest" into the test set. |
I would support balancing the training set (based on a user-requested percentage of the smallest class) and adding the rest (from each class that is larger than the smallest class) to the test set. This is what my original implementation does. |
I maybe need to review what I wrote, but I'm not sure I intended to throw away samples, @amueller. Depends on sampling regime..? I don't think you usually want to balance the test set: surely you're better off just reporting metrics sensitive to class imbalance. |
@jnothman you don't change the test-set but you subsample the training set, right? That should discard samples? |
Undersample or oversample, both should work since we use indices for CV On 15 October 2016 at 04:58, Andreas Mueller notifications@github.com
|
I think this would require a new PR, and probably #26821 would be a better way forward. |
This is the PR reg: Stratifying Across Classes in ShuffleSplit #5965 , in which I have done the following:
Introduced a new optional flag stratify_across_classes to StratifiedShuffleSplit.
Programmed the inner-working to achieve the expected CV splits.
Enhanced the validation of the specification for train_size and test_size (sum=1).
Also, I feel there might be a potential bug in the ShuffleSplit class, which primarily anchors on test_size, and doesn't handle the specification of train_size (and infer the test_size appropriately). It seems like not much testing has been performed in cases when the user only specifies train_size.
Example outputs from running the committed code:
Class sizes: [30, 80]
test_size: 0.3
------- specifying test_size --------
With the proposed stratification:
Training: 21 21 Testing: 9 59
Without
Training: 21 56 Testing: 9 24
Class sizes: [70, 50]
test_size: 0.3
------- specifying test_size --------
With the proposed stratification:
Training: 35 35 Testing: 35 15
Without
Training: 49 35 Testing: 21 15
Class sizes: [20, 90]
test_size: 0.1
------- specifying test_size --------
With the proposed stratification:
Training: 18 18 Testing: 2 72
Without
Training: 18 81 Testing: 2 9
This change is