Skip to content

ClassBalancer as method for combating classification imbalance #4617

Closed
@worldveil

Description

@worldveil

Would there be interest in implementing oversampling and hybrid oversampling+undersampling techniques for combating class imbalance in classification problems?

The API I have envisioned (and have code roughly implementing) would be as follows:

from sklearn import preprocessing
import numpy as np

X, y = get_dataset()
n, d = X.shape
balancer = preprocessing.ClassBalancer(
    method='smote', ratio=2, n_neighbors=5, random_state=0)
balancedX, balancedY, synethetic_indices = balancer.fit_transform(
    X, y, minority_class=1, shuffle=True)

where here, balancedX and balancedX contain both synthetic and real examples. For example with the smote method (only oversampling), there would be n real examples and (ratio - 1) * n_minority_class synthetic examples along with labels. fit_transform takes both the features and labels as well as possibly specifying the minority_class or letting the ClassBalancer decide based on raw frequencies.

There are a few nuances to the API that should be discussed to encompass the spectrum of oversampling, undersampling, and hybrids thereof, but this is the use case.

A couple academic references:

If there is interest I'll put together a PR, but I wanted to establish that before I cleaned it up for sciki-learn conventions/formats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions