Skip to content

Add balance_regression option to train_test_split for regression problems #30009

Closed
@Pablitosalinero

Description

@Pablitosalinero

Describe the workflow you want to enable

Currently, train_test_split supports stratified sampling for classification problems using the stratify parameter to ensure that the proportion of classes in the training and test sets is balanced. However, there is no equivalent functionality for regression problems, where the distribution of the target variable can be unevenly split between the training and test sets. This can lead to biased models, especially when the target variable follows a skewed or non-uniform distribution.

This proposal aims to introduce a balance_regression parameter to train_test_split that allows for maintaining a similar distribution of the target variable in both the training and test sets for regression tasks. The goal is to ensure that the train/test split better reflects the underlying distribution of the target variable in regression problems, improving the generalization of models trained on these splits.

Describe your proposed solution

The solution is to modify the current implementation of train_test_split by adding an optional balance_regression parameter. When enabled, this parameter will discretize the target variable into quantiles (or bins) using pd.qcut, and then apply stratified sampling based on these quantiles to ensure that the distribution of the target variable is consistent across both training and test sets.

The steps are as follows:

Add the balance_regression parameter to train_test_split, with a default value of False.
When balance_regression=True, use pd.qcut to divide the target variable into n_bins quantiles.
Use the stratified sampling mechanism based on these quantiles to perform the train/test split.
Ensure that the existing functionality for classification with stratify remains unaffected, and that balance_regression applies only to regression problems.
The feature will help users maintain a balanced target variable distribution when splitting datasets in regression problems, ensuring that training and testing sets have similar distributions, leading to better model performance and fairness in evaluation.

Describe alternatives you've considered, if relevant

An alternative solution could be to implement a completely separate function for splitting regression datasets with balanced distributions. However, this would introduce additional complexity and redundancy in the library, and could lead to confusion among users. By extending train_test_split, we leverage the familiar interface while maintaining consistency with the existing workflow for classification problems.

Additionally, a user could manually split their data by binning the target variable themselves, but this would require extra effort and knowledge of the appropriate binning strategies, which may not be intuitive for all users. Embedding this functionality directly in train_test_split streamlines the process and reduces the likelihood of errors.

Additional context

Regression problems often involve continuous target variables with highly skewed distributions. When splitting data into training and test sets, the resulting subsets can have significantly different distributions of the target variable, leading to biased models that perform well on training data but fail to generalize to new data.

This issue is particularly relevant for datasets with long tails or highly concentrated ranges of values. Implementing a balanced split based on quantiles will mitigate this issue and help ensure that models trained on these splits perform more consistently across different datasets.

The proposed solution builds on the existing stratification mechanism in train_test_split to extend its applicability to regression tasks, without introducing breaking changes or significant overhead.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions