Skip to content

DOC Add note on overlapping test sets in GroupShuffleSplit #29676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 23, 2024

Conversation

lucyleeow
Copy link
Member

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds the note on random splits not guaranteeing different test sets to GroupShuffleSplit.
Changes the wording of this note to make it clear it is talking about the test subset.

Any other comments?

Copy link

github-actions bot commented Aug 15, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9ec90ad. Link to the linter CI: here

@adrinjalali
Copy link
Member

O_o a bit confused. If the test sets are not different, then the train sets also would be the same, wouldn't they?

@lucyleeow
Copy link
Member Author

lucyleeow commented Aug 15, 2024

So my understanding (which may be wrong) is that when we say the test sets are 'different', we mean that test sets do not overlap between folds (i.e. no sample appears in more than one test set across folds). With train set, there is generally always overlap between folds.

I don't think it means simply that the test/train sets are 'different' between folds because, generally they are not exactly the same.

I noticed this after reading:

expensive. In such a scenario, :class:`GroupShuffleSplit` provides
a random sample (with replacement) of the train / test splits
generated by :class:`LeavePGroupsOut`.

(which is the only 'shuffle' splitter where we use the term 'with replacement')

@adrinjalali
Copy link
Member

So I think instead of saying "different" we can say, "test test are not guaranteed to be mutually exclusive, and might include overlapping samples".

@lucyleeow
Copy link
Member Author

Thanks @adrinjalali , I agree that 'different' is too vague and slightly mis-leading but couldn't think of something better. I've made the changes!

@adrinjalali adrinjalali merged commit 635d83f into scikit-learn:main Aug 23, 2024
30 checks passed
@lucyleeow lucyleeow deleted the doc_split_shuffle branch August 23, 2024 10:16
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants