Description
Describe the workflow you want to enable
Currently, train_test_split
supports stratified sampling for classification problems using the stratify parameter to ensure that the proportion of classes in the training and test sets is balanced. However, there is no equivalent functionality for regression problems, where the distribution of the target variable can be unevenly split between the training and test sets. This can lead to biased models, especially when the target variable follows a skewed or non-uniform distribution.
This proposal aims to introduce a balance_regression
parameter to train_test_split
that allows for maintaining a similar distribution of the target variable in both the training and test sets for regression tasks. The goal is to ensure that the train/test split better reflects the underlying distribution of the target variable in regression problems, improving the generalization of models trained on these splits.
Describe your proposed solution
The solution is to modify the current implementation of train_test_split
by adding an optional balance_regression
parameter. When enabled, this parameter will discretize the target variable into quantiles (or bins) using pd.qcut
, and then apply stratified sampling based on these quantiles to ensure that the distribution of the target variable is consistent across both training and test sets.
The steps are as follows:
Add the balance_regression parameter to train_test_split
, with a default value of False
.
When balance_regression=True
, use pd.qcut
to divide the target variable into n_bins
quantiles.
Use the stratified sampling mechanism based on these quantiles to perform the train/test split.
Ensure that the existing functionality for classification with stratify remains unaffected, and that balance_regression
applies only to regression problems.
The feature will help users maintain a balanced target variable distribution when splitting datasets in regression problems, ensuring that training and testing sets have similar distributions, leading to better model performance and fairness in evaluation.
Describe alternatives you've considered, if relevant
An alternative solution could be to implement a completely separate function for splitting regression datasets with balanced distributions. However, this would introduce additional complexity and redundancy in the library, and could lead to confusion among users. By extending train_test_split
, we leverage the familiar interface while maintaining consistency with the existing workflow for classification problems.
Additionally, a user could manually split their data by binning the target variable themselves, but this would require extra effort and knowledge of the appropriate binning strategies, which may not be intuitive for all users. Embedding this functionality directly in train_test_split streamlines the process and reduces the likelihood of errors.
Additional context
Regression problems often involve continuous target variables with highly skewed distributions. When splitting data into training and test sets, the resulting subsets can have significantly different distributions of the target variable, leading to biased models that perform well on training data but fail to generalize to new data.
This issue is particularly relevant for datasets with long tails or highly concentrated ranges of values. Implementing a balanced split based on quantiles will mitigate this issue and help ensure that models trained on these splits perform more consistently across different datasets.
The proposed solution builds on the existing stratification mechanism in train_test_split
to extend its applicability to regression tasks, without introducing breaking changes or significant overhead.