Add balance_regression option to train_test_split for regression problems

### Describe the workflow you want to enable

Currently, `train_test_split` supports stratified sampling for classification problems using the stratify parameter to ensure that the proportion of classes in the training and test sets is balanced. However, there is no equivalent functionality for regression problems, where the distribution of the target variable can be unevenly split between the training and test sets. This can lead to biased models, especially when the target variable follows a skewed or non-uniform distribution.

This proposal aims to introduce a `balance_regression` parameter to `train_test_split` that allows for maintaining a similar distribution of the target variable in both the training and test sets for regression tasks. The goal is to ensure that the train/test split better reflects the underlying distribution of the target variable in regression problems, improving the generalization of models trained on these splits.

### Describe your proposed solution

The solution is to modify the current implementation of `train_test_split` by adding an optional `balance_regression` parameter. When enabled, this parameter will discretize the target variable into quantiles (or bins) using `pd.qcut`, and then apply stratified sampling based on these quantiles to ensure that the distribution of the target variable is consistent across both training and test sets.

The steps are as follows:

Add the balance_regression parameter to `train_test_split`, with a default value of `False`.
When `balance_regression=True`, use `pd.qcut` to divide the target variable into `n_bins` quantiles.
Use the stratified sampling mechanism based on these quantiles to perform the train/test split.
Ensure that the existing functionality for classification with stratify remains unaffected, and that `balance_regression` applies only to regression problems.
The feature will help users maintain a balanced target variable distribution when splitting datasets in regression problems, ensuring that training and testing sets have similar distributions, leading to better model performance and fairness in evaluation.


### Describe alternatives you've considered, if relevant

An alternative solution could be to implement a completely separate function for splitting regression datasets with balanced distributions. However, this would introduce additional complexity and redundancy in the library, and could lead to confusion among users. By extending `train_test_split`, we leverage the familiar interface while maintaining consistency with the existing workflow for classification problems.

Additionally, a user could manually split their data by binning the target variable themselves, but this would require extra effort and knowledge of the appropriate binning strategies, which may not be intuitive for all users. Embedding this functionality directly in train_test_split streamlines the process and reduces the likelihood of errors.


### Additional context

Regression problems often involve continuous target variables with highly skewed distributions. When splitting data into training and test sets, the resulting subsets can have significantly different distributions of the target variable, leading to biased models that perform well on training data but fail to generalize to new data.

This issue is particularly relevant for datasets with long tails or highly concentrated ranges of values. Implementing a balanced split based on quantiles will mitigate this issue and help ensure that models trained on these splits perform more consistently across different datasets.

The proposed solution builds on the existing stratification mechanism in `train_test_split` to extend its applicability to regression tasks, without introducing breaking changes or significant overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add balance_regression option to train_test_split for regression problems #30009

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add balance_regression option to train_test_split for regression problems #30009

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions