-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Add note on overlapping test sets in GroupShuffleSplit
#29676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
O_o a bit confused. If the test sets are not different, then the train sets also would be the same, wouldn't they? |
So my understanding (which may be wrong) is that when we say the test sets are 'different', we mean that test sets do not overlap between folds (i.e. no sample appears in more than one test set across folds). With train set, there is generally always overlap between folds. I don't think it means simply that the test/train sets are 'different' between folds because, generally they are not exactly the same. I noticed this after reading: scikit-learn/doc/modules/cross_validation.rst Lines 820 to 822 in 4e44ede
(which is the only 'shuffle' splitter where we use the term 'with replacement') |
So I think instead of saying "different" we can say, "test test are not guaranteed to be mutually exclusive, and might include overlapping samples". |
Thanks @adrinjalali , I agree that 'different' is too vague and slightly mis-leading but couldn't think of something better. I've made the changes! |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Adds the note on random splits not guaranteeing different test sets to
GroupShuffleSplit
.Changes the wording of this note to make it clear it is talking about the test subset.
Any other comments?