Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

raamana · 2015-12-07T08:13:39Z

This is the PR reg: Stratifying Across Classes in ShuffleSplit #5965 , in which I have done the following:
Introduced a new optional flag stratify_across_classes to StratifiedShuffleSplit.
Programmed the inner-working to achieve the expected CV splits.
Enhanced the validation of the specification for train_size and test_size (sum=1).

Also, I feel there might be a potential bug in the ShuffleSplit class, which primarily anchors on test_size, and doesn't handle the specification of train_size (and infer the test_size appropriately). It seems like not much testing has been performed in cases when the user only specifies train_size.

Example outputs from running the committed code:

Class sizes: [30, 80]
test_size: 0.3
------- specifying test_size --------
With the proposed stratification:
Training: 21 21 Testing: 9 59
Without
Training: 21 56 Testing: 9 24

Class sizes: [70, 50]
test_size: 0.3
------- specifying test_size --------
With the proposed stratification:
Training: 35 35 Testing: 35 15
Without
Training: 49 35 Testing: 21 15

Class sizes: [20, 90]
test_size: 0.1
------- specifying test_size --------
With the proposed stratification:
Training: 18 18 Testing: 2 72
Without
Training: 18 81 Testing: 2 9

This change is

…huffleSplit. Programmed the inner-working to achieve the expected CV splits. Enhanced the validation of the specification for train_size and test_size (sum=1). There might be a potential bug in the ShuffleSplit class, which primarily anchors on test_size, and doesn't handle the specification of train_size (and infer the test_size appropriately). It seems like not much testing has been performed in cases when the user only specifies train_size.

…ate_shuffle_split.

…PyCharm seems to have not saved my previous changes\!

…train and n_test.

… sizes, and to deal with nose tests properly.

raamana · 2015-12-08T03:25:00Z

Hey @jnothman, can you take a look? Thanks.

amueller · 2015-12-08T23:08:42Z

I don't think we should call this stratifying (also this is a common usage of the word). In scikit-learn stratifying means keeping the class frequencies the same for each split.

Btw, have you compared this against using class_weight='balanced'?

raamana · 2015-12-09T04:33:06Z

Thanks for the pointer Andreas, I took a quick look. Correct me if I am wrong, but It seems to me that the current implementation class_weight in scikit-learn would only be useful if the classifier can support it. Moreover this is still an open area of research, where there is no consensus on the selection of class weights. Its influence on the misclassification cost and classifier performance is understood only in a limited sense, which varies depending on the dataset. Some recent studies [1,2] suggest a combination of cost-sensitive learning and resampling may be preferable.

Depending on the goals of the user e.g. minimizing the misclassification cost (in both balanced or natural distributions), increasing or balancing the classifier performance (in sensitivity and specificity), the preference for cost-sensitive learning and class-resampling changes. Given the diversity in user needs and our limited understanding, I feel it may be better to offer as many options as possible (class_weight, resampling, stratification etc). Hence there may be many use cases for the approach I suggested to balance the classes during the training set (e.g. when the classifier doesn't support cost-sensitive learning).

amueller · 2015-12-09T16:38:40Z

I didn't claim there was no use-case, I was just wondering how the methods compare on your application. I'm not arguing against adding this, I agree our understanding is not complete ;)
Class weights (in most cases) correspond to oversampling the minority class.

raamana · 2015-12-09T20:22:54Z

I am yet to compare the two approaches for my application. My efforts so far focused on improving the accuracy and interpretability of the various feature I have developed. As I make use of scikit-learn more, I would try performing the comparison.

Did you take a look at the changes in code itself? Is it ready to be merged? thanks.

amueller · 2016-09-14T20:24:44Z

Sorry for the long turn-around.
I think this functionality should go in a separate class, not ShuffleSplit. I don't have a good name for it, though, as we already used the term "stratified" for exactly the opposite. Maybe BalancedShuffleSplit?

raamana · 2016-09-14T21:11:54Z

That's a good idea. And I like the name, as it clearly conveys the motivation and its function.

Did you get a chance to review the code?

amueller · 2016-09-14T21:26:47Z

well the code needs to change if we want this in a separate class...

jnothman · 2016-09-15T04:24:14Z

With the new splitter structure, resamplers as meta-estimators would be the right approach, I think:

from sklearn.utils.random import *
from sklearn.utils import *
from sklearn.model_selection import *
class Undersampler(object):
    def __init__(self, base_cv=None, random_state=None):
        self.base_cv = base_cv
        self.random_state = random_state
    def split(self, X, y):
        random_state = check_random_state(self.random_state)
        # TODO: check and encode y
        for train_idx, test_idx in check_cv(self.base_cv, classifier=True).split(X, y):
            y_split = y.take(train_idx)
            counts = np.bincount(y_split)
            n_per_class = np.min(counts)
            out_train_idx = []
            for i in range(len(counts)):
                i_sample = sample_without_replacement(counts[i], n_per_class)
                out_train_idx.append(np.flatnonzero(y_split == i).take(i_sample))
            yield np.hstack(out_train_idx), test_idx

Is there something I've missed about how this would operate in a cross-validation context?

jnothman · 2016-09-15T05:00:15Z

Also relevant is that class_weight='balanced' is often available to achieve similar ends.

jnothman · 2016-09-15T05:00:52Z

In terms of a general resampler, see #1454

amueller · 2016-10-13T18:22:19Z

@jnothman you mean meta-splitters, right?
Your implementation is quite different from @raamana in that it throws away some samples, while @raamana's puts them in the training set.

I feel this feature is somehow tied to the cross-validation strategy. I think this behavior would be more natural for the stratified cv objects than the non-stratified ones.

it's not entirely clear to me whether we want to balance the test set, keep the test-set as-is or put "the rest" into the test set.

raamana · 2016-10-13T18:33:15Z

I would support balancing the training set (based on a user-requested percentage of the smallest class) and adding the rest (from each class that is larger than the smallest class) to the test set. This is what my original implementation does.

jnothman · 2016-10-13T21:20:23Z

I maybe need to review what I wrote, but I'm not sure I intended to throw away samples, @amueller. Depends on sampling regime..?

I don't think you usually want to balance the test set: surely you're better off just reporting metrics sensitive to class imbalance.

amueller · 2016-10-14T17:58:38Z

@jnothman you don't change the test-set but you subsample the training set, right? That should discard samples?

jnothman · 2016-10-15T09:54:36Z

Undersample or oversample, both should work since we use indices for CV
splitters.

On 15 October 2016 at 04:58, Andreas Mueller notifications@github.com
wrote:

@jnothman https://github.com/jnothman you don't change the test-set but
you subsample the training set, right? That should discard samples?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5972 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68i8-COspJrMRaX9IthC5r3jvprFks5qz8LPgaJpZM4Gv8Py
.

adrinjalali · 2024-04-17T12:22:22Z

I think this would require a new PR, and probably #26821 would be a better way forward.

raamana and others added 5 commits December 7, 2015 02:58

removed unnecessary requirement for stratify_across_classes to _valid…

e4953bd

…ate_shuffle_split.

correcting for the number of output args of _validate_shuffle_split. …

35fdb8b

…PyCharm seems to have not saved my previous changes\!

Fixing the UnboundLocalError with an apriori assignment of None to n_…

7b85cac

…train and n_test.

Fixes to address the wildly incorrect specification of train and test…

db82b79

… sizes, and to deal with nose tests properly.

jnothman mentioned this pull request Sep 15, 2016

MRG+1: Add resample to preprocessing. #1454

Closed

amueller mentioned this pull request Oct 13, 2016

Balanced/Weighted Sampling #6568

Open

amueller added API Needs Decision Requires decision labels Aug 5, 2019

github-actions bot added the module:model_selection label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:48

adrinjalali closed this Apr 17, 2024

Uh oh!

Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

Uh oh!

Conversation

raamana commented Dec 7, 2015 • edited by amueller Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example outputs from running the committed code:

Uh oh!

raamana commented Dec 8, 2015

Uh oh!

amueller commented Dec 8, 2015

Uh oh!

raamana commented Dec 9, 2015

Uh oh!

amueller commented Dec 9, 2015

Uh oh!

raamana commented Dec 9, 2015

Uh oh!

amueller commented Sep 14, 2016

Uh oh!

raamana commented Sep 14, 2016

Uh oh!

amueller commented Sep 14, 2016

Uh oh!

jnothman commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 15, 2016

Uh oh!

jnothman commented Sep 15, 2016

Uh oh!

amueller commented Oct 13, 2016

Uh oh!

raamana commented Oct 13, 2016

Uh oh!

jnothman commented Oct 13, 2016

Uh oh!

amueller commented Oct 14, 2016

Uh oh!

jnothman commented Oct 15, 2016

Uh oh!

adrinjalali commented Apr 17, 2024

Uh oh!

Uh oh!

raamana commented Dec 7, 2015 •

edited by amueller

Loading

jnothman commented Sep 15, 2016 •

edited

Loading