Add parameter to train_test_split for deterministic splitting and tests #31097

kaekkr · 2025-03-28T09:16:51Z

Reference Issues/PRs

Fixes #30992
See also the discussion in #30992.

What does this implement/fix? Explain your changes.

This PR adds a new uid parameter to sklearn.model_selection.train_test_split, allowing for deterministic splitting of datasets using unique identifiers. This ensures that samples with the same UID are always assigned to the same split (train or test), regardless of dataset order. It helps with stability and reproducibility, especially in production environments or pipelines where data might be reshuffled.

Key details:

Introduces a uid argument (mutually exclusive with stratify and shuffle=False)
Performs hashing via MD5 on UID values to produce reproducible splits
Adds corresponding unit tests for deterministic behavior

Any other comments?

Let me know if there are naming or API design considerations you'd like improved — happy to adjust the implementation. Thanks!

github-actions · 2025-03-28T09:18:04Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c1a1049. Link to the linter CI: here}

sklearn/model_selection/_split.py

sciencecw · 2025-04-05T03:39:10Z

Why not use murmurhash like my sample code in #30992 ? We don't need a cryptographic hashing function, and this adds hashlib as a dependency (not sure if the checks include this)

Also it does not seem right to do (hash mod 10^8) and divide the hashed array by the max value

Add parameter to train_test_split for deterministic splitting and tests

871cfa6

github-actions bot added the module:model_selection label Mar 28, 2025

Merge branch 'main' into stable-train-test-split

0be210f

github-advanced-security bot found potential problems Mar 28, 2025

View reviewed changes

sklearn/model_selection/_split.py Fixed Show fixed Hide fixed

Karassay and others added 2 commits March 28, 2025 14:34

Change hash function from MD5 to SHA256

93b835e

Merge branch 'main' into stable-train-test-split

c1a1049

StefanieSenger added the Waiting for Reviewer label Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add parameter to train_test_split for deterministic splitting and tests #31097

Add parameter to train_test_split for deterministic splitting and tests #31097

kaekkr commented Mar 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

sciencecw commented Apr 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add parameter to train_test_split for deterministic splitting and tests #31097

Are you sure you want to change the base?

Add parameter to train_test_split for deterministic splitting and tests #31097

Conversation

kaekkr commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

sciencecw commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kaekkr commented Mar 28, 2025 •

edited

Loading

github-actions bot commented Mar 28, 2025 •

edited

Loading

sciencecw commented Apr 5, 2025 •

edited

Loading