Skip to content

FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 27, 2019
Merged

FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124

merged 9 commits into from
Feb 27, 2019

Conversation

qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Feb 9, 2019

Fixes #13110
Our StratifiedKFold shuffle each class's samples with same random_state (i.e., in the same way). This is not good IMO:
(1) We can only get certain splits.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.arange(20)
y = [0] * 10 + [1] * 10
StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX)
# all the test folds in the split will be [a, b, 10+a, 10+b]

(2) When there're only two samples in the groups, users will always get the same splits with different random_state.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.arange(10)
y = [0] * 5 + [1] * 5
StratifiedKFold(n_splits=5, shuffle=True, random_state=XXX)
# the test folds will always be [[0, 5], [1, 6], [2, 7], [3, 8], [4, 9]]

Revert part of #7823, I think it's a regression introduced in 0.19, ping @jnothman @amueller

@agramfort
Copy link
Member

I would add a test that if shuffle is True then results depends on random_state for all cv classes that support this.

@jnothman
Copy link
Member

PR #7823 was included in 0.19.0, so if it's due to that, I don't think it's a regression since 0.19.x.

@jnothman
Copy link
Member

There's no test here, and I'm quite confused about what's going on.

It seems that what's going on is we currently pass each class's samples to KFold(self.n_splits, shuffle=self.shuffle, random_state=self.random_state). This PR changes it so that instead of passing self.random_state we pass a RandomState that is progressed in each call to KFold. I don't really see why the current code causes the stated bug.

I will also note that I think there are two design errors here that have been raised previously which make this hard to work with:

  1. In the sklearn.cross_validation -> sklearn.model_selection rewrite, we did not carefully design what to do with random states. Ideally check_random_state should be called in some fit method, or perhaps in __init__, such that each call to split is different. Instead we landed up calling check_random_state in split, such that you get the surprising behaviour that passing in a RandomState will give different results for each split but passing in an int will give fixed results for each split.
  2. StratifiedKFold uses this "apply K fold to each class" which gives some weird (imbalanced) results. We previously had a sort-and-round-robin approach, which I think was rejected for the wrong reasons (all we needed was a stable sort), and it should be reinstated/available as a much simpler approach.

@jnothman jnothman removed this from the 0.20.3 milestone Feb 11, 2019
@qinhanmin2014 qinhanmin2014 changed the title FIX Enable StratifiedKFold to produce different splits FIX Shuffle each stratification with different random_state in StratifiedKFold Feb 11, 2019
@qinhanmin2014
Copy link
Member Author

@jnothman I'm aware that there's some design issues (maybe errors) in StratifiedKFold, but my intention here is to fix bugs.
Apologies I didn't describe the problem very well. I've updated the PR body. Please take some time to have a look.

@jnothman
Copy link
Member

Do you still believe it's a regression since 0.19?

@jnothman
Copy link
Member

jnothman commented Feb 11, 2019 via email

@qinhanmin2014
Copy link
Member Author

Do you still believe it's a regression since 0.19?

#7823 is included in 0.19.X (not included in 0.18.X), so I think it's a regression.

@jnothman
Copy link
Member

Right, so a regression introduced in 0.19... so we should not be scheduling it as a 0.20.x fix.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@qinhanmin2014
Copy link
Member Author

not each stratification, per se, but each class's samples

this term comes from the doc and I've also updated the doc accordingly.

@jnothman
Copy link
Member

Thanks @Arjannikov and @qinhanmin2014

@Arjannikov
Copy link

I think the solution is simpler than the title of this thread suggest.

My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.

Therefore, I think the appropriate solution is to have StratifiedKFold() shuffle the data with the given random seed (we only have one random seed), and then go about things as normal. In this case, KFold() won't need to shuffle, so that option can be turned off.

@jnothman
Copy link
Member

jnothman commented Feb 12, 2019 via email

@qinhanmin2014
Copy link
Member Author

My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless.

I agree that this can solve the problem. I choose another way because we did so previously and I guess there's not too much difference between these two ways.

@jnothman
Copy link
Member

jnothman commented Feb 13, 2019 via email

@qinhanmin2014
Copy link
Member Author

I don't understand

I mean we can either shuffle the entire dataset or shuffle each class's samples (with a different random_state). I choose the second way because we did so previously (and it's consistent with our doc).

@jnothman
Copy link
Member

Oh, I wrote "I don't understand" in response to a comment by @Arjannikov that appears to have since been deleted

@Arjannikov
Copy link

Oh, my comment was. We only have one random seed. So it makes more sense to do the shuffle once, at the very beginning. But later I thought about using the same random seed for different subsets of the data would work also. So, I deleted my comment. Sorry.

@Arjannikov
Copy link

What confuses me is @qinhanmin2014 note: "(with different random_state)".

@qinhanmin2014 qinhanmin2014 changed the title FIX Shuffle each stratification with different random_state in StratifiedKFold FIX Shuffle each class's samples with different random_state in StratifiedKFold Feb 14, 2019
@Arjannikov
Copy link

@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter.

@qinhanmin2014
Copy link
Member Author

@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter.

We can produce different random states using one integer (the random_state parameter).

@jnothman
Copy link
Member

We can put this in 0.20.X if people want. It should be merged either way (another pro-forma review anyone, or we can accept @NicolasHug's above?)

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍 for merge.

Merging.

@GaelVaroquaux GaelVaroquaux merged commit afc6cc5 into scikit-learn:master Feb 27, 2019
@qinhanmin2014 qinhanmin2014 deleted the StratifiedKFold-split branch February 27, 2019 13:37
@ogrisel ogrisel added this to the 0.20.4 milestone Mar 1, 2019
@ogrisel
Copy link
Member

ogrisel commented Mar 1, 2019

The 0.20.3 tag has already been pushed without this fix but I think we should include it in the next LTS bugfix release.

@jnothman
Copy link
Member

jnothman commented Mar 3, 2019

Did I fall to cherry pick this? Sorry :(

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
…ifiedKFold (scikit-learn#13124)

* Enable StratifiedKFold to produce different splits

* what's new

* redundant statement

* update what's new

* redundant comment

* add a test

* move what's new entry

* review comment

* review comment
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
…ifiedKFold (scikit-learn#13124)

* Enable StratifiedKFold to produce different splits

* what's new

* redundant statement

* update what's new

* redundant comment

* add a test

* move what's new entry

* review comment

* review comment
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Jul 23, 2019
…ifiedKFold (scikit-learn#13124)

* Enable StratifiedKFold to produce different splits

* what's new

* redundant statement

* update what's new

* redundant comment

* add a test

* move what's new entry

* review comment

* review comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants