-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Shuffle each class's samples with different random_state in StratifiedKFold #13124
Conversation
I would add a test that if shuffle is True then results depends on random_state for all cv classes that support this. |
PR #7823 was included in 0.19.0, so if it's due to that, I don't think it's a regression since 0.19.x. |
There's no test here, and I'm quite confused about what's going on. It seems that what's going on is we currently pass each class's samples to I will also note that I think there are two design errors here that have been raised previously which make this hard to work with:
|
@jnothman I'm aware that there's some design issues (maybe errors) in StratifiedKFold, but my intention here is to fix bugs. |
Do you still believe it's a regression since 0.19? |
Please add a test
|
#7823 is included in 0.19.X (not included in 0.18.X), so I think it's a regression. |
Right, so a regression introduced in 0.19... so we should not be scheduling it as a 0.20.x fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
this term comes from the doc and I've also updated the doc accordingly. |
Thanks @Arjannikov and @qinhanmin2014 |
I think the solution is simpler than the title of this thread suggest. My personal solution is to shuffle the data before passing it to StratifiedKFold() with shuffle=False. This fixes all problems, however, it renders shuffle=True useless. Therefore, I think the appropriate solution is to have StratifiedKFold() shuffle the data with the given random seed (we only have one random seed), and then go about things as normal. In this case, KFold() won't need to shuffle, so that option can be turned off. |
What is "the data" you want to shuffle? We want shuffled indices within
each class. Shuffling the data, i.e. y requires inverting the permutation
before returning... So it's not really any simpler. But I agree that the
description of the fix tells too much detail about the implementation.
|
I agree that this can solve the problem. I choose another way because we did so previously and I guess there's not too much difference between these two ways. |
I don't understand
|
I mean we can either shuffle the entire dataset or shuffle each class's samples (with a different random_state). I choose the second way because we did so previously (and it's consistent with our doc). |
Oh, I wrote "I don't understand" in response to a comment by @Arjannikov that appears to have since been deleted |
Oh, my comment was. We only have one random seed. So it makes more sense to do the shuffle once, at the very beginning. But later I thought about using the same random seed for different subsets of the data would work also. So, I deleted my comment. Sorry. |
What confuses me is @qinhanmin2014 note: "(with different random_state)". |
@qinhanmin2014, where do you get "different random_state"? StartifiedKFold() takes only one integer as random_state parameter. |
We can produce different random states using one integer (the random_state parameter). |
We can put this in 0.20.X if people want. It should be merged either way (another pro-forma review anyone, or we can accept @NicolasHug's above?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 👍 for merge.
Merging.
The 0.20.3 tag has already been pushed without this fix but I think we should include it in the next LTS bugfix release. |
Did I fall to cherry pick this? Sorry :( |
…ifiedKFold (scikit-learn#13124) * Enable StratifiedKFold to produce different splits * what's new * redundant statement * update what's new * redundant comment * add a test * move what's new entry * review comment * review comment
…in StratifiedKFold (scikit-learn#13124)" This reverts commit 596c615.
…in StratifiedKFold (scikit-learn#13124)" This reverts commit 596c615.
…ifiedKFold (scikit-learn#13124) * Enable StratifiedKFold to produce different splits * what's new * redundant statement * update what's new * redundant comment * add a test * move what's new entry * review comment * review comment
…ifiedKFold (scikit-learn#13124) * Enable StratifiedKFold to produce different splits * what's new * redundant statement * update what's new * redundant comment * add a test * move what's new entry * review comment * review comment
Fixes #13110
Our StratifiedKFold shuffle each class's samples with same random_state (i.e., in the same way). This is not good IMO:
(1) We can only get certain splits.
(2) When there're only two samples in the groups, users will always get the same splits with different random_state.
Revert part of #7823, I think it's a regression introduced in 0.19, ping @jnothman @amueller