Skip to content

Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

raamana
Copy link
Contributor

@raamana raamana commented Dec 7, 2015

This is the PR reg: Stratifying Across Classes in ShuffleSplit #5965 , in which I have done the following:
Introduced a new optional flag stratify_across_classes to StratifiedShuffleSplit.
Programmed the inner-working to achieve the expected CV splits.
Enhanced the validation of the specification for train_size and test_size (sum=1).

Also, I feel there might be a potential bug in the ShuffleSplit class, which primarily anchors on test_size, and doesn't handle the specification of train_size (and infer the test_size appropriately). It seems like not much testing has been performed in cases when the user only specifies train_size.

Example outputs from running the committed code:

Class sizes: [30, 80]
test_size: 0.3
------- specifying test_size --------
With the proposed stratification:
Training: 21 21 Testing: 9 59
Without
Training: 21 56 Testing: 9 24

Class sizes: [70, 50]
test_size: 0.3
------- specifying test_size --------
With the proposed stratification:
Training: 35 35 Testing: 35 15
Without
Training: 49 35 Testing: 21 15

Class sizes: [20, 90]
test_size: 0.1
------- specifying test_size --------
With the proposed stratification:
Training: 18 18 Testing: 2 72
Without
Training: 18 81 Testing: 2 9


This change is Reviewable

raamana and others added 5 commits December 7, 2015 02:58
…huffleSplit.

Programmed the inner-working to achieve the expected CV splits.
Enhanced the validation of the specification for train_size and test_size (sum=1).
 There might be a potential bug in the ShuffleSplit class, which primarily anchors on test_size, and doesn't handle the specification of train_size (and infer the test_size appropriately). It seems like not much testing has been performed in cases when the user only specifies train_size.
…PyCharm seems to have not saved my previous changes\!
… sizes, and to deal with nose tests properly.
@raamana
Copy link
Contributor Author

raamana commented Dec 8, 2015

Hey @jnothman, can you take a look? Thanks.

@amueller
Copy link
Member

amueller commented Dec 8, 2015

I don't think we should call this stratifying (also this is a common usage of the word). In scikit-learn stratifying means keeping the class frequencies the same for each split.

Btw, have you compared this against using class_weight='balanced'?

@raamana
Copy link
Contributor Author

raamana commented Dec 9, 2015

Thanks for the pointer Andreas, I took a quick look. Correct me if I am wrong, but It seems to me that the current implementation class_weight in scikit-learn would only be useful if the classifier can support it. Moreover this is still an open area of research, where there is no consensus on the selection of class weights. Its influence on the misclassification cost and classifier performance is understood only in a limited sense, which varies depending on the dataset. Some recent studies [1,2] suggest a combination of cost-sensitive learning and resampling may be preferable.

Depending on the goals of the user e.g. minimizing the misclassification cost (in both balanced or natural distributions), increasing or balancing the classifier performance (in sensitivity and specificity), the preference for cost-sensitive learning and class-resampling changes. Given the diversity in user needs and our limited understanding, I feel it may be better to offer as many options as possible (class_weight, resampling, stratification etc). Hence there may be many use cases for the approach I suggested to balance the classes during the training set (e.g. when the classifier doesn't support cost-sensitive learning).

  1. http://cscwd2006.seu.edu.cn/people/xyliu/publication/icdm06a.pdf
  2. http://www.ismll.uni-hildesheim.de/pub/pdfs/Nguyen_et_al_IJCNN2010_CSL.pdf

@amueller
Copy link
Member

amueller commented Dec 9, 2015

I didn't claim there was no use-case, I was just wondering how the methods compare on your application. I'm not arguing against adding this, I agree our understanding is not complete ;)
Class weights (in most cases) correspond to oversampling the minority class.

@raamana
Copy link
Contributor Author

raamana commented Dec 9, 2015

I am yet to compare the two approaches for my application. My efforts so far focused on improving the accuracy and interpretability of the various feature I have developed. As I make use of scikit-learn more, I would try performing the comparison.

Did you take a look at the changes in code itself? Is it ready to be merged? thanks.

@amueller
Copy link
Member

Sorry for the long turn-around.
I think this functionality should go in a separate class, not ShuffleSplit. I don't have a good name for it, though, as we already used the term "stratified" for exactly the opposite. Maybe BalancedShuffleSplit?

@raamana
Copy link
Contributor Author

raamana commented Sep 14, 2016

That's a good idea. And I like the name, as it clearly conveys the motivation and its function.

Did you get a chance to review the code?

@amueller
Copy link
Member

well the code needs to change if we want this in a separate class...

@jnothman
Copy link
Member

jnothman commented Sep 15, 2016

With the new splitter structure, resamplers as meta-estimators would be the right approach, I think:

from sklearn.utils.random import *
from sklearn.utils import *
from sklearn.model_selection import *
class Undersampler(object):
    def __init__(self, base_cv=None, random_state=None):
        self.base_cv = base_cv
        self.random_state = random_state
    def split(self, X, y):
        random_state = check_random_state(self.random_state)
        # TODO: check and encode y
        for train_idx, test_idx in check_cv(self.base_cv, classifier=True).split(X, y):
            y_split = y.take(train_idx)
            counts = np.bincount(y_split)
            n_per_class = np.min(counts)
            out_train_idx = []
            for i in range(len(counts)):
                i_sample = sample_without_replacement(counts[i], n_per_class)
                out_train_idx.append(np.flatnonzero(y_split == i).take(i_sample))
            yield np.hstack(out_train_idx), test_idx

Is there something I've missed about how this would operate in a cross-validation context?

@jnothman
Copy link
Member

Also relevant is that class_weight='balanced' is often available to achieve similar ends.

@jnothman
Copy link
Member

In terms of a general resampler, see #1454

@amueller
Copy link
Member

@jnothman you mean meta-splitters, right?
Your implementation is quite different from @raamana in that it throws away some samples, while @raamana's puts them in the training set.

I feel this feature is somehow tied to the cross-validation strategy. I think this behavior would be more natural for the stratified cv objects than the non-stratified ones.

it's not entirely clear to me whether we want to balance the test set, keep the test-set as-is or put "the rest" into the test set.

@raamana
Copy link
Contributor Author

raamana commented Oct 13, 2016

I would support balancing the training set (based on a user-requested percentage of the smallest class) and adding the rest (from each class that is larger than the smallest class) to the test set. This is what my original implementation does.

@jnothman
Copy link
Member

I maybe need to review what I wrote, but I'm not sure I intended to throw away samples, @amueller. Depends on sampling regime..?

I don't think you usually want to balance the test set: surely you're better off just reporting metrics sensitive to class imbalance.

@amueller
Copy link
Member

@jnothman you don't change the test-set but you subsample the training set, right? That should discard samples?

@jnothman
Copy link
Member

Undersample or oversample, both should work since we use indices for CV
splitters.

On 15 October 2016 at 04:58, Andreas Mueller notifications@github.com
wrote:

@jnothman https://github.com/jnothman you don't change the test-set but
you subsample the training set, right? That should discard samples?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5972 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68i8-COspJrMRaX9IthC5r3jvprFks5qz8LPgaJpZM4Gv8Py
.

@amueller amueller added API Needs Decision Requires decision labels Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:48
@adrinjalali
Copy link
Member

I think this would require a new PR, and probably #26821 would be a better way forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants