Skip to content

Stratifying Across Classes in ShuffleSplit #5965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
raamana opened this issue Dec 5, 2015 · 2 comments
Open

Stratifying Across Classes in ShuffleSplit #5965

raamana opened this issue Dec 5, 2015 · 2 comments

Comments

@raamana
Copy link
Contributor

raamana commented Dec 5, 2015

Hello sk'ers,

Great work guys in developing scikit-learn!

An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as ShuffleSplit here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in achieving balanced sensitivity and specificity for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.

This can be achieved by having an optional flag, such as

stratify_across_classes_in_training_set=True

to StratifiedShuffledSplit, which would act differently in this place

 n_i = np.round(n_train * p_i).astype(int)

Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV.

References:

  1. Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES.
  2. Visa, S., & Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. Proceedings of the Sixteen Midwest Artificial ….
  3. Wallace, B. C., Small, K., Brodley, C. E., & Wang, L. (2011). Class Imbalance, Redux. Data Mining (ICDM).
  4. Raamana, P. R., Weiner, M. W., Wang, L., & Beg, M. F. (2015). Thickness network features for prognostic applications in dementia. Neurobiology of Aging, 36, S91–S102. http://doi.org/10.1016/j.neurobiolaging.2014.05.040
@jnothman
Copy link
Member

jnothman commented Dec 6, 2015

To make this explicit, could you please submit a PR that modifies StratifiedShuffleSplit and includes a test of the sought behaviour that distinguishes it from the current behaviour?

@raamana
Copy link
Contributor Author

raamana commented Dec 6, 2015

Sure, I would be happy to. Just wanted to get your input before going ahead with it. Basically it would boil down to this: suppose say you have two classes of sizes 100 and 30 (large class imbalance!), and the user chooses 90% training size for StratifiedShuffleSplit, current behaviour would dictate training the classifier on 90% from class 1 (which is 90) and 90% from class 2 (which is 27), and testing on the rest (10 from class 1 and 3 from class 2). There is still large class imbalance between the two classes in the training set (90 and 27). In the proposed change, I would restrict the sizes of the two classes to the 90% of the smallest class - which is 27 subjects from each class for training and the rest for testing (73 and 3).

What that would mean, and whether that would be okay, and when that would be okay are important and worthy of discussion - but adding them to this great toolkit is worth it in my opinion.

I will be submitting the PR soon, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants