FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124

qinhanmin2014 · 2019-02-09T02:15:23Z

Fixes #13110
Our StratifiedKFold shuffle each class's samples with same random_state (i.e., in the same way). This is not good IMO:
(1) We can only get certain splits.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.arange(20)
y = [0] * 10 + [1] * 10
StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX)
# all the test folds in the split will be [a, b, 10+a, 10+b]

(2) When there're only two samples in the groups, users will always get the same splits with different random_state.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.arange(10)
y = [0] * 5 + [1] * 5
StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX)
# the test folds will always be [[0, 5], [1, 6], [2, 7], [3, 8], [4, 9]]

Revert part of #7823, I think it's a regression introduced in 0.19, ping @jnothman @amueller

agramfort · 2019-02-10T17:35:14Z

I would add a test that if shuffle is True then results depends on random_state for all cv classes that support this.

sklearn/model_selection/_split.py

jnothman · 2019-02-11T02:38:52Z

PR #7823 was included in 0.19.0, so if it's due to that, I don't think it's a regression since 0.19.x.

jnothman · 2019-02-11T02:53:51Z

There's no test here, and I'm quite confused about what's going on.

It seems that what's going on is we currently pass each class's samples to KFold(self.n_splits, shuffle=self.shuffle, random_state=self.random_state). This PR changes it so that instead of passing self.random_state we pass a RandomState that is progressed in each call to KFold. I don't really see why the current code causes the stated bug.

I will also note that I think there are two design errors here that have been raised previously which make this hard to work with:

In the sklearn.cross_validation -> sklearn.model_selection rewrite, we did not carefully design what to do with random states. Ideally check_random_state should be called in some fit method, or perhaps in __init__, such that each call to split is different. Instead we landed up calling check_random_state in split, such that you get the surprising behaviour that passing in a RandomState will give different results for each split but passing in an int will give fixed results for each split.
StratifiedKFold uses this "apply K fold to each class" which gives some weird (imbalanced) results. We previously had a sort-and-round-robin approach, which I think was rejected for the wrong reasons (all we needed was a stable sort), and it should be reinstated/available as a much simpler approach.

qinhanmin2014 · 2019-02-11T08:47:13Z

@jnothman I'm aware that there's some design issues (maybe errors) in StratifiedKFold, but my intention here is to fix bugs.
Apologies I didn't describe the problem very well. I've updated the PR body. Please take some time to have a look.

jnothman · 2019-02-11T10:12:42Z

Do you still believe it's a regression since 0.19?

jnothman · 2019-02-11T10:14:27Z

Please add a test

qinhanmin2014 · 2019-02-11T10:59:00Z

Do you still believe it's a regression since 0.19?

#7823 is included in 0.19.X (not included in 0.18.X), so I think it's a regression.

jnothman · 2019-02-11T11:30:27Z

Right, so a regression introduced in 0.19... so we should not be scheduling it as a 0.20.x fix.

NicolasHug

LGTM!

doc/whats_new/v0.21.rst

sklearn/model_selection/tests/test_split.py

jnothman

Thanks.

doc/whats_new/v0.21.rst

sklearn/model_selection/tests/test_split.py

qinhanmin2014 · 2019-02-12T01:14:24Z

not each stratification, per se, but each class's samples

this term comes from the doc and I've also updated the doc accordingly.

jnothman · 2019-02-12T01:43:44Z

Thanks @Arjannikov and @qinhanmin2014

Arjannikov · 2019-02-12T06:13:09Z

I think the solution is simpler than the title of this thread suggest.

My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.

Therefore, I think the appropriate solution is to have StratifiedKFold() shuffle the data with the given random seed (we only have one random seed), and then go about things as normal. In this case, KFold() won't need to shuffle, so that option can be turned off.

jnothman · 2019-02-12T06:57:11Z

What is "the data" you want to shuffle? We want shuffled indices within each class. Shuffling the data, i.e. y requires inverting the permutation before returning... So it's not really any simpler. But I agree that the description of the fix tells too much detail about the implementation.

qinhanmin2014 · 2019-02-12T08:03:51Z

My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.

I agree that this can solve the problem. I choose another way because we did so previously and I guess there's not too much difference between these two ways.

jnothman · 2019-02-13T00:58:13Z

I don't understand

qinhanmin2014 · 2019-02-13T01:34:49Z

I don't understand

I mean we can either shuffle the entire dataset or shuffle each class's samples (with a different random_state). I choose the second way because we did so previously (and it's consistent with our doc).

jnothman · 2019-02-13T03:43:25Z

Oh, I wrote "I don't understand" in response to a comment by @Arjannikov that appears to have since been deleted

Arjannikov · 2019-02-13T04:05:40Z

Oh, my comment was. We only have one random seed. So it makes more sense to do the shuffle once, at the very beginning. But later I thought about using the same random seed for different subsets of the data would work also. So, I deleted my comment. Sorry.

Arjannikov · 2019-02-13T04:06:58Z

What confuses me is @qinhanmin2014 note: "(with different random_state)".

Arjannikov · 2019-02-16T01:28:52Z

@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter.

qinhanmin2014 · 2019-02-16T02:14:39Z

@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter.

We can produce different random states using one integer (the random_state parameter).

jnothman · 2019-02-27T09:09:14Z

We can put this in 0.20.X if people want. It should be merged either way (another pro-forma review anyone, or we can accept @NicolasHug's above?)

GaelVaroquaux

LGTM. 👍 for merge.

Merging.

ogrisel · 2019-03-01T16:30:39Z

The 0.20.3 tag has already been pushed without this fix but I think we should include it in the next LTS bugfix release.

jnothman · 2019-03-03T09:35:19Z

Did I fall to cherry pick this? Sorry :(

…ifiedKFold (scikit-learn#13124) * Enable StratifiedKFold to produce different splits * what's new * redundant statement * update what's new * redundant comment * add a test * move what's new entry * review comment * review comment

…in StratifiedKFold (scikit-learn#13124)" This reverts commit 596c615.

…ifiedKFold (scikit-learn#13124) * Enable StratifiedKFold to produce different splits * what's new * redundant statement * update what's new * redundant comment * add a test * move what's new entry * review comment * review comment

Enable StratifiedKFold to produce different splits

743e73f

qinhanmin2014 added the Regression label Feb 9, 2019

qinhanmin2014 added this to the 0.20.3 milestone Feb 9, 2019

qinhanmin2014 mentioned this pull request Feb 9, 2019

sklearn.model_selection.StratifiedKFold either shuffling is wrong or documentation is misleading #13110

Closed

what's new

a2921c6

NicolasHug reviewed Feb 10, 2019

View reviewed changes

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

redundant statement

6d36012

jnothman removed this from the 0.20.3 milestone Feb 11, 2019

qinhanmin2014 changed the title ~~FIX Enable StratifiedKFold to produce different splits~~ FIX Shuffle each stratification with different random_state in StratifiedKFold Feb 11, 2019

update what's new

b7b8e44

redundant comment

8675188

add a test

3bec2f1

move what's new entry

ca2f430

NicolasHug approved these changes Feb 11, 2019

View reviewed changes

doc/whats_new/v0.21.rst Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_split.py Outdated Show resolved Hide resolved

review comment

4307d64

jnothman approved these changes Feb 12, 2019

View reviewed changes

doc/whats_new/v0.21.rst Show resolved Hide resolved

sklearn/model_selection/tests/test_split.py Outdated Show resolved Hide resolved

review comment

85c2d67

qinhanmin2014 changed the title ~~FIX Shuffle each stratification with different random_state in StratifiedKFold~~ FIX Shuffle each class's samples with different random_state in StratifiedKFold Feb 14, 2019

GaelVaroquaux approved these changes Feb 27, 2019

View reviewed changes

GaelVaroquaux merged commit afc6cc5 into scikit-learn:master Feb 27, 2019

qinhanmin2014 deleted the StratifiedKFold-split branch February 27, 2019 13:37

ogrisel added this to the 0.20.4 milestone Mar 1, 2019

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Shuffle each class's samples with different random_state …

7cc75c8

…in StratifiedKFold (scikit-learn#13124)" This reverts commit 596c615.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Shuffle each class's samples with different random_state …

25695ab

…in StratifiedKFold (scikit-learn#13124)" This reverts commit 596c615.

Uh oh!

FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124

FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124

Uh oh!

Conversation

qinhanmin2014 commented Feb 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Feb 10, 2019

Uh oh!

Uh oh!

jnothman commented Feb 11, 2019

Uh oh!

jnothman commented Feb 11, 2019

Uh oh!

qinhanmin2014 commented Feb 11, 2019

Uh oh!

jnothman commented Feb 11, 2019

Uh oh!

jnothman commented Feb 11, 2019 via email

Uh oh!

qinhanmin2014 commented Feb 11, 2019

Uh oh!

jnothman commented Feb 11, 2019

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qinhanmin2014 commented Feb 12, 2019

Uh oh!

jnothman commented Feb 12, 2019

Uh oh!

Arjannikov commented Feb 12, 2019

Uh oh!

jnothman commented Feb 12, 2019 via email

Uh oh!

qinhanmin2014 commented Feb 12, 2019

Uh oh!

jnothman commented Feb 13, 2019 via email

Uh oh!

qinhanmin2014 commented Feb 13, 2019

Uh oh!

jnothman commented Feb 13, 2019

Uh oh!

Arjannikov commented Feb 13, 2019

Uh oh!

Arjannikov commented Feb 13, 2019

Uh oh!

Arjannikov commented Feb 16, 2019

Uh oh!

qinhanmin2014 commented Feb 16, 2019

Uh oh!

jnothman commented Feb 27, 2019

Uh oh!

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 1, 2019

Uh oh!

jnothman commented Mar 3, 2019

Uh oh!

Uh oh!

qinhanmin2014 commented Feb 9, 2019 •

edited

Loading