Skip to content

Add parameter to train_test_split for deterministic splitting and tests #31097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kaekkr
Copy link

@kaekkr kaekkr commented Mar 28, 2025

Reference Issues/PRs

Fixes #30992
See also the discussion in #30992.

What does this implement/fix? Explain your changes.

This PR adds a new uid parameter to sklearn.model_selection.train_test_split, allowing for deterministic splitting of datasets using unique identifiers. This ensures that samples with the same UID are always assigned to the same split (train or test), regardless of dataset order. It helps with stability and reproducibility, especially in production environments or pipelines where data might be reshuffled.

Key details:

  • Introduces a uid argument (mutually exclusive with stratify and shuffle=False)
  • Performs hashing via MD5 on UID values to produce reproducible splits
  • Adds corresponding unit tests for deterministic behavior

Any other comments?

Let me know if there are naming or API design considerations you'd like improved — happy to adjust the implementation. Thanks!

Copy link

github-actions bot commented Mar 28, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c1a1049. Link to the linter CI: here

@sciencecw
Copy link

sciencecw commented Apr 5, 2025

Why not use murmurhash like my sample code in #30992 ? We don't need a cryptographic hashing function, and this adds hashlib as a dependency (not sure if the checks include this)

Also it does not seem right to do (hash mod 10^8) and divide the hashed array by the max value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UID-based Stable Train-Test Split
3 participants