Skip to content

MRG+1: Add resample to preprocessing. #1454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -667,3 +667,133 @@ error with a ``filterwarnings``::
For a full code example that demonstrates using a :class:`FunctionTransformer`
to do custom feature selection,
see :ref:`sphx_glr_auto_examples_preprocessing_plot_function_transformer.py`

.. _preprocessing_resample_labels:

Resampling of labels
==========================

Balancing labels
----------------
Datasets, unless carefully designed, will not contain an equal number of samples
from each class. Resampling a dataset with :func:`resample_labels` allows the user
to change the label distribution and simultaneously grow or shrink the dataset by
drawing from samples from the original dataset. Noise may then be added to obtain
a better training set to get more accuracy or generalization from a machine
learning algorithm, or to test the running time as the dataset scales.

As an example of resampling, assume there is an unbalanced dataset with three
class labels ``[0, 1, 2]`` and sample counts ``[100, 125, 150]``. The ``method``
keyword of :func:`resample_labels` supports three string options or instead take a
``dict``. By setting ``method='undersample'``, the number of samples in the least
common class determines the count of the samples for each class, in this case 300
total samples, 100 from each class::

>>> import numpy as np
>>> from sklearn.preprocessing.resample import resample_labels
>>> y = np.concatenate([np.repeat(0,100), np.repeat(1,125), np.repeat(2,150)])
>>> indices = resample_labels(y, method='undersample')
>>> print(np.bincount(y[indices]))
[100 100 100]

With ``method='oversample'``, each class draws the maximum class's 150 samples
for an output length of 450::
>>> indices = resample_labels(y, method='oversample')
>>> print(np.bincount(y[indices]))
[150 150 150]


Using ``method='balance'`` keeps the length of the dataset at 375 but equalizes
the count of each class by undersampling or oversampling classes as needed::

>>> indices = resample_labels(y, method='balance')
>>> print(np.bincount(y[indices]))
[125 125 125]

Keep in mind that if your dataset starts with very few samples of a class and
you choose the ``undersample`` option, the output dataset will be very small.
Using the ``scale`` keyword as described below can ensure the dataset remains
large, but with many repeated samples. You may then add noise or reconsider your
approach.


Custom label distribution
-------------------------
The three sampling options for the ``method`` keyword provide balanced class
distribution, but unbalancing the classes by passing a ``method=dict`` is also
supported. For instance, in the previous example, passing ``method={0: .5, 1: .25, 2: .25}``
gives a dataset where a 0 class label is twice as likely as a 1 or 2.

Skew the distribution with a dict::

>>> import numpy as np
>>> from sklearn.preprocessing.resample import resample_labels
>>> y = np.concatenate([np.repeat(0,100), np.repeat(1,125), np.repeat(2,150)])
>>> indices = resample_labels(y, method={0: .5, 1: .25, 2: .25}, random_state=4)
>>> print(np.bincount(y[indices]))
[203 89 83]


Resizing the dataset
--------------------
Changing the size of a dataset is also supported, which can lead to interesting
training possibilities. Perhaps the full dataset is very large and will take a
long time to train. Setting ``scaling=3000``, for example, will output 3000
samples with the desired label distribution, while setting ``scaling`` to a
float will scale the original number of samples, e.g. ``scaling=.5`` will
output half the number of samples in the original dataset, while
``scaling=3.0`` will triple the dataset.

Scaling up the size of a training set is also useful for making sure that
a machine learning algorithm can handle training and predicting large amounts
of data in a reasonable amount of time.

Take scaling up the iris dataset from 150 samples to 7,500 as an example::

>>> import numpy as np
>>> from sklearn import svm, datasets
>>> from sklearn.preprocessing.resample import resample_labels
>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target
>>> indices = resample_labels(y, scaling=50.0)
>>> X0 = iris.data[indices]
>>> y0 = iris.target[indices]
>>> svc = svm.SVC(kernel='linear', C=1.0).fit(X, y)
>>> svc = svm.SVC(kernel='linear', C=1.0).fit(X0, y0)

Training is almost instantaneous with 150 samples, takes slightly longer as
shown with 50x samples, and can take over ten seconds with 1000x the data.

Other resampling options
------------------------
Sampling can be done with or without replacement. The default option,
``replace=False``, selects sampling without replacement. In order for the
scaling feature to work with sampling without replacement, if we would
run out of samples, instead the sample pool starts over with the original
dataset. Thus, if you set ``scale=10.0`` and sample without replacement,
the dataset will have each sample repeated ten times. In effect, only
the last repetition of the dataset might vary when sampling without
replacement, and only if the full dataset is not repeated.

Shuffling the data can take considerable CPU time and is turned off by
default, but is possible by setting the keyword argument ``shuffle=True``.

Finally, for repeatability, you may set the ``random_state`` keyword
argument.

Keep training and testing datasets separate
-------------------------------------------
It's a fundamental machine learning practice to keep the training and
testing datasets completely separate because estimators do very well
on samples they have been trained on and testing results will be overly
optimistic. When resampling, since you are duplicating samples exactly,
there is now the possiblity that a sample could find its way into both
the training and testing sets.

Even ruling out coding errors, techniques such as scaling a dataset
and then running cross-validation on it would potentially place some
samples in both datasets. The danger is that your results are not as
good as the metrics report or that model generalization is worse.
Therefore, use extreme care when resampling so that you do not confuse
the training and testing datasets.
4 changes: 3 additions & 1 deletion sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
The :mod:`sklearn.preprocessing` module includes scaling, centering,
normalization, binarization and imputation methods.
normalization, binarization, imputation and resampling methods.
"""

from ._function_transformer import FunctionTransformer
Expand Down Expand Up @@ -34,6 +34,7 @@

from .imputation import Imputer

from .resample import resample_labels

__all__ = [
'Binarizer',
Expand Down Expand Up @@ -63,4 +64,5 @@
'label_binarize',
'quantile_transform',
'power_transform',
'resample_labels',
]
Loading