Stratified splitters user warnings #28628

myenugula · 2024-03-14T11:43:43Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implement a warning in StratifiedKFold, StratifiedGroupKFold, and StratifiedShuffleSplit to alert users when only a single class is present in the target variable, suggesting that stratified splitting might not be appropriate and guiding towards more suitable cross-validation strategies.

Any other comments?

Here's a code sample of using the 3 classes, and it shows a user warning accordingly.

import numpy as np
from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold, StratifiedShuffleSplit
import warnings

# Function to suppress the traceback of warnings
def custom_formatwarning(msg, *args, **kwargs):
    return str(msg) + '\n'

warnings.formatwarning = custom_formatwarning
warnings.filterwarnings('default', category=UserWarning)

# StratifiedKFold example
print("StratifiedKFold Example:")
X = np.random.rand(10, 2)  # 10 samples, 2 features
y_skf = np.zeros(10)  # Target variable with one class
skf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y_skf):
    print(f"TRAIN indices: {train_index}, TEST indices: {test_index}")

# StratifiedGroupKFold example
print("\nStratifiedGroupKFold Example:")
X = np.random.rand(15, 2)  # 15 samples, 2 features
y_sgkf = np.ones(15)  # Target variable with one class
groups = np.array([1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7])
sgkf = StratifiedGroupKFold(n_splits=3)
for train_index, test_index in sgkf.split(X, y_sgkf, groups=groups):
    print(f"TRAIN indices: {train_index}, TEST indices: {test_index}")

# StratifiedShuffleSplit example
print("\nStratifiedShuffleSplit Example:")
X = np.random.rand(20, 2)  # 20 samples, 2 features
y_sss = np.zeros(20)  # Target variable with one class
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=42)
for train_index, test_index in sss.split(X, y_sss):
    print(f"TRAIN indices: {train_index}, TEST indices: {test_index}")

Output:

Only one class present in y. StratifiedKFold is designed to be used with data that contains two or more classes. Consider using KFold instead.
Only one class present in y. StratifiedGroupKFold is designed to be used with data that contains two or more classes. The single-class scenario might not be suitable for stratified folds.
Only one class present in y. StratifiedShuffleSplit is designed to be used with data that contains two or more classes. Consider using ShuffleSplit instead.

…ns less than 2 unique classes

github-actions · 2024-03-14T11:45:01Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c21608d. Link to the linter CI: here}

oasidorshin · 2024-03-15T06:58:05Z

@myenugula Also dont forget train_test_split with stratify=True and RepeatedStratifiedKFold

myenugula · 2024-03-17T12:08:38Z

@oasidorshin The RepeatedStratifiedKFold class uses StratifiedKFold internally to split, that's how the warning gets raised. ex:

import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 0])
rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=1)

for train, test in rskf.split(X, y):
    pass

Output:

/Users/myenugula/PycharmProjects/scikit-learn/sklearn/model_selection/_split.py:850: UserWarning: Only one class present in y. StratifiedKFold is designed to be used with data that contains two or more classes. Consider using KFold instead.

Similarly, train_test_split uses StratifiedShuffleSplit to split the data. ex:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
train_test_split(y, shuffle=True, stratify=[0, 0, 0, 0, 0])

Output:

/Users/myenugula/PycharmProjects/scikit-learn/sklearn/model_selection/_split.py:2343: UserWarning: Only one class present in y. StratifiedShuffleSplit is designed to be used with data that contains two or more classes. Consider using ShuffleSplit instead.

However, I'm thinking of replacing

        if stratify is not None:
            CVClass = StratifiedShuffleSplit
        else:
            CVClass = ShuffleSplit

with

        n_classes = len(np.unique(stratify))
        if stratify is not None and n_classes > 1:
            CVClass = StratifiedShuffleSplit
        else:
            CVClass = ShuffleSplit

in the train_test_split to avoid the user warning in cases when only one class is passed

…e of single class

oasidorshin · 2024-03-18T11:43:16Z

@myenugula Thank you, looks good!

I think that

        n_classes = len(np.unique(stratify))
        if stratify is not None and n_classes > 1:
            CVClass = StratifiedShuffleSplit
        else:
            CVClass = ShuffleSplit

is the way to go

myenugula · 2024-08-07T20:10:52Z

Any update on this?

VyankateshRohokale · 2024-09-15T09:37:15Z

Is there something more you want to add in this issue ? maybe i could help

myenugula · 2024-10-27T01:15:40Z

Hello any update ?

myenugula · 2025-03-14T16:08:22Z

Hello any update ?

myenugula · 2025-03-14T16:09:18Z

Hello any update ?

myenugula · 2025-04-15T09:50:23Z

Hi @lesteve, Could you please review this PR?

Add user warning to stratified splitters when the target class contai…

fedce3c

…ns less than 2 unique classes

github-actions bot added the module:model_selection label Mar 14, 2024

myenugula added 3 commits March 14, 2024 16:33

Add 2 target classes to the 'y' testing data

62dc59f

Update Changelog

ddf85f7

Make sure test_preserve_feature_names tests don't raise a warning

c61fba0

Make sure train_test_split with stratify won't raise a warning in cas…

b9ac754

…e of single class

Merge branch 'scikit-learn:main' into stratified-splitters-warning

c21608d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratified splitters user warnings #28628

Stratified splitters user warnings #28628

myenugula commented Mar 14, 2024

github-actions bot commented Mar 14, 2024 •

edited

Loading

oasidorshin commented Mar 15, 2024

myenugula commented Mar 17, 2024

oasidorshin commented Mar 18, 2024

myenugula commented Aug 7, 2024

VyankateshRohokale commented Sep 15, 2024

myenugula commented Oct 27, 2024

myenugula commented Mar 14, 2025

myenugula commented Mar 14, 2025

myenugula commented Apr 15, 2025

Stratified splitters user warnings #28628

Are you sure you want to change the base?

Stratified splitters user warnings #28628

Conversation

myenugula commented Mar 14, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 14, 2024 • edited Loading

✔️ Linting Passed

oasidorshin commented Mar 15, 2024

myenugula commented Mar 17, 2024

oasidorshin commented Mar 18, 2024

myenugula commented Aug 7, 2024

VyankateshRohokale commented Sep 15, 2024

myenugula commented Oct 27, 2024

myenugula commented Mar 14, 2025

myenugula commented Mar 14, 2025

myenugula commented Apr 15, 2025

github-actions bot commented Mar 14, 2024 •

edited

Loading