Description
Would there be interest in implementing oversampling and hybrid oversampling+undersampling techniques for combating class imbalance in classification problems?
The API I have envisioned (and have code roughly implementing) would be as follows:
from sklearn import preprocessing
import numpy as np
X, y = get_dataset()
n, d = X.shape
balancer = preprocessing.ClassBalancer(
method='smote', ratio=2, n_neighbors=5, random_state=0)
balancedX, balancedY, synethetic_indices = balancer.fit_transform(
X, y, minority_class=1, shuffle=True)
where here, balancedX
and balancedX
contain both synthetic and real examples. For example with the smote
method (only oversampling), there would be n
real examples and (ratio - 1) * n_minority_class
synthetic examples along with labels. fit_transform
takes both the features and labels as well as possibly specifying the minority_class
or letting the ClassBalancer decide based on raw frequencies.
There are a few nuances to the API that should be discussed to encompass the spectrum of oversampling, undersampling, and hybrids thereof, but this is the use case.
A couple academic references:
- SMOTE: Synthetic Minority Over-sampling Technique
- SMOTE-RSB∗: a hybrid preprocessing approach based
on oversampling and undersampling for high imbalanced
data-sets using SMOTE and rough sets theory
If there is interest I'll put together a PR, but I wanted to establish that before I cleaned it up for sciki-learn conventions/formats.