Add parameter to train_test_split for deterministic splitting and tests #31097
+84
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #30992
See also the discussion in #30992.
What does this implement/fix? Explain your changes.
This PR adds a new
uid
parameter tosklearn.model_selection.train_test_split
, allowing for deterministic splitting of datasets using unique identifiers. This ensures that samples with the same UID are always assigned to the same split (train or test), regardless of dataset order. It helps with stability and reproducibility, especially in production environments or pipelines where data might be reshuffled.Key details:
uid
argument (mutually exclusive withstratify
andshuffle=False
)Any other comments?
Let me know if there are naming or API design considerations you'd like improved — happy to adjust the implementation. Thanks!