Open
Description
Description
Add a flag to StratifiedKFold
which ensures each class is present in training set.
For StratifiedKFold.split
, if some class has only 1 sample, currently this sample might be included in the test split rather than the training split. (sklearn
does give a warning.)
While for some applications this can be acceptable, a flag which forces classes with a single sample to always be in training can be helpful
Steps/Code to Reproduce
from sklearn.model_selection import StratifiedKFold
import numpy as np
X = np.array([0, 1, 2])
y = np.array([0, 0, 1])
skf = StratifiedKFold(n_splits=2, random_state=0, shuffle=True)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Expected Results
There should be some flag in StratifiedKFold, so that atleast 1 element of class 1
is always present in train (in this case, the element at index 2
)
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.
% (min_groups, self.n_splits)), Warning)
TRAIN: [0 2] TEST: [1]
TRAIN: [1 2] TEST: [0]
Actual Results
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.
% (min_groups, self.n_splits)), Warning)
TRAIN: [0 2] TEST: [1]
TRAIN: [1] TEST: [0 2]
Versions
Linux-4.13.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1