Stratifying Across Classes in ShuffleSplit #5965

raamana · 2015-12-05T22:07:37Z

Hello sk'ers,

Great work guys in developing scikit-learn!

An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as ShuffleSplit here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in achieving balanced sensitivity and specificity for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.

This can be achieved by having an optional flag, such as

stratify_across_classes_in_training_set=True

to StratifiedShuffledSplit, which would act differently in this place

 n_i = np.round(n_train * p_i).astype(int)

Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV.

References:

Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES.
Visa, S., & Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. Proceedings of the Sixteen Midwest Artificial ….
Wallace, B. C., Small, K., Brodley, C. E., & Wang, L. (2011). Class Imbalance, Redux. Data Mining (ICDM).
Raamana, P. R., Weiner, M. W., Wang, L., & Beg, M. F. (2015). Thickness network features for prognostic applications in dementia. Neurobiology of Aging, 36, S91–S102. http://doi.org/10.1016/j.neurobiolaging.2014.05.040

jnothman · 2015-12-06T06:11:15Z

To make this explicit, could you please submit a PR that modifies StratifiedShuffleSplit and includes a test of the sought behaviour that distinguishes it from the current behaviour?

raamana · 2015-12-06T15:19:42Z

Sure, I would be happy to. Just wanted to get your input before going ahead with it. Basically it would boil down to this: suppose say you have two classes of sizes 100 and 30 (large class imbalance!), and the user chooses 90% training size for StratifiedShuffleSplit, current behaviour would dictate training the classifier on 90% from class 1 (which is 90) and 90% from class 2 (which is 27), and testing on the rest (10 from class 1 and 3 from class 2). There is still large class imbalance between the two classes in the training set (90 and 27). In the proposed change, I would restrict the sizes of the two classes to the 90% of the smallest class - which is 27 subjects from each class for training and the rest for testing (73 and 3).

What that would mean, and whether that would be okay, and when that would be okay are important and worthy of discussion - but adding them to this great toolkit is worth it in my opinion.

I will be submitting the PR soon, thank you.

raamana mentioned this issue Dec 7, 2015

Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

Closed

amueller added Needs Decision Requires decision API labels Aug 5, 2019

amueller mentioned this issue Aug 5, 2019

SLEP005: Resampler API scikit-learn/enhancement_proposals#15

Open

cmarmo added the module:model_selection label Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratifying Across Classes in ShuffleSplit #5965

Stratifying Across Classes in ShuffleSplit #5965

raamana commented Dec 5, 2015

jnothman commented Dec 6, 2015

raamana commented Dec 6, 2015

Stratifying Across Classes in ShuffleSplit #5965

Stratifying Across Classes in ShuffleSplit #5965

Comments

raamana commented Dec 5, 2015

References:

jnothman commented Dec 6, 2015

raamana commented Dec 6, 2015