Skip to content

[Feature-Request] Add a flag to StratifiedKFold to force classes with only 1 sample in training #10767

Open
@akhilkedia

Description

@akhilkedia

Description

Add a flag to StratifiedKFold which ensures each class is present in training set.
For StratifiedKFold.split, if some class has only 1 sample, currently this sample might be included in the test split rather than the training split. (sklearn does give a warning.)
While for some applications this can be acceptable, a flag which forces classes with a single sample to always be in training can be helpful

Steps/Code to Reproduce

from sklearn.model_selection import StratifiedKFold
import numpy as np
X = np.array([0, 1, 2])
y = np.array([0, 0, 1])


skf = StratifiedKFold(n_splits=2, random_state=0, shuffle=True)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Expected Results

There should be some flag in StratifiedKFold, so that atleast 1 element of class 1 is always present in train (in this case, the element at index 2)

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.
  % (min_groups, self.n_splits)), Warning)
TRAIN: [0 2] TEST: [1]
TRAIN: [1 2] TEST: [0]

Actual Results

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=2.
  % (min_groups, self.n_splits)), Warning)
TRAIN: [0 2] TEST: [1]
TRAIN: [1] TEST: [0 2]

Versions

Linux-4.13.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions