ClassBalancer as method for combating classification imbalance

Would there be interest in implementing oversampling and hybrid oversampling+undersampling techniques for combating class imbalance in classification problems?

The API I have envisioned (and have code roughly implementing) would be as follows:

``` python
from sklearn import preprocessing
import numpy as np

X, y = get_dataset()
n, d = X.shape
balancer = preprocessing.ClassBalancer(
    method='smote', ratio=2, n_neighbors=5, random_state=0)
balancedX, balancedY, synethetic_indices = balancer.fit_transform(
    X, y, minority_class=1, shuffle=True)
```

where here, `balancedX` and `balancedX` contain both synthetic and real examples. For example with the `smote` method (only oversampling), there would be `n` real examples and `(ratio - 1) * n_minority_class` synthetic examples along with labels. `fit_transform` takes both the features and labels as well as possibly specifying the `minority_class` or letting the ClassBalancer decide based on raw frequencies. 

There are a few nuances to the API that should be discussed to encompass the spectrum of oversampling, undersampling, and hybrids thereof, but this is the use case.

A couple academic references:
- [SMOTE: Synthetic Minority Over-sampling Technique](http://adsabs.harvard.edu/cgi-bin/bib_query?arXiv:1106.1813)
- [SMOTE-RSB∗: a hybrid preprocessing approach based
  on oversampling and undersampling for high imbalanced
  data-sets using SMOTE and rough sets theory](http://sci2s.ugr.es/publications/ficheros/2012-Ramentol-KAIS.pdf)

If there is interest I'll put together a PR, but I wanted to establish that before I cleaned it up for sciki-learn conventions/formats. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ClassBalancer as method for combating classification imbalance #4617

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ClassBalancer as method for combating classification imbalance #4617

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions