You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as ShuffleSplit here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in achieving balanced sensitivity and specificity for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.
This can be achieved by having an optional flag, such as
stratify_across_classes_in_training_set=True
to StratifiedShuffledSplit, which would act differently in this place
n_i=np.round(n_train*p_i).astype(int)
Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV.
References:
Batuwita, R., & Palade, V. (2012). CLASS IMBALANCE LEARNING METHODS FOR SUPPORT VECTOR MACHINES.
Visa, S., & Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. Proceedings of the Sixteen Midwest Artificial ….
Wallace, B. C., Small, K., Brodley, C. E., & Wang, L. (2011). Class Imbalance, Redux. Data Mining (ICDM).
Raamana, P. R., Weiner, M. W., Wang, L., & Beg, M. F. (2015). Thickness network features for prognostic applications in dementia. Neurobiology of Aging, 36, S91–S102. http://doi.org/10.1016/j.neurobiolaging.2014.05.040
The text was updated successfully, but these errors were encountered:
To make this explicit, could you please submit a PR that modifies StratifiedShuffleSplit and includes a test of the sought behaviour that distinguishes it from the current behaviour?
Sure, I would be happy to. Just wanted to get your input before going ahead with it. Basically it would boil down to this: suppose say you have two classes of sizes 100 and 30 (large class imbalance!), and the user chooses 90% training size for StratifiedShuffleSplit, current behaviour would dictate training the classifier on 90% from class 1 (which is 90) and 90% from class 2 (which is 27), and testing on the rest (10 from class 1 and 3 from class 2). There is still large class imbalance between the two classes in the training set (90 and 27). In the proposed change, I would restrict the sizes of the two classes to the 90% of the smallest class - which is 27 subjects from each class for training and the rest for testing (73 and 3).
What that would mean, and whether that would be okay, and when that would be okay are important and worthy of discussion - but adding them to this great toolkit is worth it in my opinion.
Hello sk'ers,
Great work guys in developing scikit-learn!
An under-appreciated and yet an important issue in machine learning is the class imbalance during the training of the classifier [1,2,3]. SVM in particular is sensitive to this [1]. Class-imbalance is not uncommon, esp. in my world of NeuroImaging datasets. Where sample sizes are small, it can be difficult to let go of data to equalize the size of different classes during training. I have overcome this by implementing a variant of repeated hold-out (implemented as ShuffleSplit here), that stratifies the class sizes within the training set. I achieve this by choosing the size (for all the classes) based on fixed percentage (user-chosen) of the smallest class. This has helped me in achieving balanced sensitivity and specificity for my predictive models, and I feel this would be a worthy inclusion in scikit-learn.
This can be achieved by having an optional flag, such as
to StratifiedShuffledSplit, which would act differently in this place
Let me know what you think - I would be very happy to contribute this to ShuffleSplit (and perhaps KFold) implementations of CV.
References:
The text was updated successfully, but these errors were encountered: