scikit-learn · erg · Mar 3, 2015 · Jun 8, 2018 · Jun 9, 2018 · Jun 11, 2018
diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
@@ -667,3 +667,133 @@ error with a ``filterwarnings``::
 For a full code example that demonstrates using a :class:`FunctionTransformer`
 to do custom feature selection,
 see :ref:`sphx_glr_auto_examples_preprocessing_plot_function_transformer.py`
+
+.. _preprocessing_resample_labels:
+
+Resampling of labels
+==========================
+
+Balancing labels
+----------------
+Datasets, unless carefully designed, will not contain an equal number of samples
+from each class. Resampling a dataset with :func:`resample_labels` allows the user
+to change the label distribution and simultaneously grow or shrink the dataset by
+drawing from samples from the original dataset. Noise may then be added to obtain
+a better training set to get more accuracy or generalization from a machine
+learning algorithm, or to test the running time as the dataset scales.
+
+As an example of resampling, assume there is an unbalanced dataset with three
+class labels ``[0, 1, 2]`` and sample counts ``[100, 125, 150]``. The ``method``
+keyword of :func:`resample_labels` supports three string options or instead take a
+``dict``.  By setting ``method='undersample'``, the number of samples in the least
+common class determines the count of the samples for each class, in this case 300
+total samples, 100 from each class::
+
+  >>> import numpy as np
+  >>> from sklearn.preprocessing.resample import resample_labels
+  >>> y = np.concatenate([np.repeat(0,100), np.repeat(1,125), np.repeat(2,150)])
+  >>> indices = resample_labels(y, method='undersample')
+  >>> print(np.bincount(y[indices]))
+  [100 100 100]
+
+With ``method='oversample'``, each class draws the maximum class's 150 samples
+for an output length of 450::
+  >>> indices = resample_labels(y, method='oversample')
+  >>> print(np.bincount(y[indices]))
+  [150 150 150]
+
+
+Using ``method='balance'`` keeps the length of the dataset at 375 but equalizes
+the count of each class by undersampling or oversampling classes as needed::
+
+  >>> indices = resample_labels(y, method='balance')
+  >>> print(np.bincount(y[indices]))
+  [125 125 125]
+
+Keep in mind that if your dataset starts with very few samples of a class and
+you choose the ``undersample`` option, the output dataset will be very small.
+Using the ``scale`` keyword as described below can ensure the dataset remains
+large, but with many repeated samples. You may then add noise or reconsider your
+approach.
+
+
+Custom label distribution
+-------------------------
+The three sampling options for the ``method`` keyword provide balanced class
+distribution, but unbalancing the classes by passing a ``method=dict`` is also
+supported. For instance, in the previous example, passing ``method={0: .5, 1: .25, 2: .25}``
+gives a dataset where a 0 class label is twice as likely as a 1 or 2.
+
+Skew the distribution with a dict::
+
+  >>> import numpy as np
+  >>> from sklearn.preprocessing.resample import resample_labels
+  >>> y = np.concatenate([np.repeat(0,100), np.repeat(1,125), np.repeat(2,150)])
+  >>> indices = resample_labels(y, method={0: .5, 1: .25, 2: .25}, random_state=4)
+  >>> print(np.bincount(y[indices]))
+  [203  89  83]
+
+
+Resizing the dataset
+--------------------
+Changing the size of a dataset is also supported, which can lead to interesting
+training possibilities. Perhaps the full dataset is very large and will take a
+long time to train. Setting ``scaling=3000``, for example, will output 3000
+samples with the desired label distribution, while setting ``scaling`` to a
+float will scale the original number of samples, e.g. ``scaling=.5`` will
+output half the number of samples in the original dataset, while
+``scaling=3.0`` will triple the dataset.
+
+Scaling up the size of a training set is also useful for making sure that
+a machine learning algorithm can handle training and predicting large amounts
+of data in a reasonable amount of time.
+
+Take scaling up the iris dataset from 150 samples to 7,500 as an example::
+
+    >>> import numpy as np
+    >>> from sklearn import svm, datasets
+    >>> from sklearn.preprocessing.resample import resample_labels
+    >>> iris = datasets.load_iris()
+    >>> X = iris.data
+    >>> y = iris.target
+    >>> indices = resample_labels(y, scaling=50.0)
+    >>> X0 = iris.data[indices]
+    >>> y0 = iris.target[indices]
+    >>> svc = svm.SVC(kernel='linear', C=1.0).fit(X, y)
+    >>> svc = svm.SVC(kernel='linear', C=1.0).fit(X0, y0)
+
+Training is almost instantaneous with 150 samples, takes slightly longer as
+shown with 50x samples, and can take over ten seconds with 1000x the data.
+
+Other resampling options
+------------------------
+Sampling can be done with or without replacement. The default option,
+``replace=False``, selects sampling without replacement. In order for the
+scaling feature to work with sampling without replacement, if we would
+run out of samples, instead the sample pool starts over with the original
+dataset. Thus, if you set ``scale=10.0`` and sample without replacement,
+the dataset will have each sample repeated ten times. In effect, only
+the last repetition of the dataset might vary when sampling without
+replacement, and only if the full dataset is not repeated.
+
+Shuffling the data can take considerable CPU time and is turned off by
+default, but is possible by setting the keyword argument ``shuffle=True``.
+
+Finally, for repeatability, you may set the ``random_state`` keyword
+argument.
+
+Keep training and testing datasets separate
+-------------------------------------------
+It's a fundamental machine learning practice to keep the training and
+testing datasets completely separate because estimators do very well
+on samples they have been trained on and testing results will be overly
+optimistic. When resampling, since you are duplicating samples exactly,
+there is now the possiblity that a sample could find its way into both
+the training and testing sets.
+
+Even ruling out coding errors, techniques such as scaling a dataset
+and then running cross-validation on it would potentially place some
+samples in both datasets. The danger is that your results are not as
+good as the metrics report or that model generalization is worse.
+Therefore, use extreme care when resampling so that you do not confuse
+the training and testing datasets.
diff --git a/sklearn/preprocessing/__init__.py b/sklearn/preprocessing/__init__.py
@@ -1,6 +1,6 @@
 """
 The :mod:`sklearn.preprocessing` module includes scaling, centering,
-normalization, binarization and imputation methods.
+normalization, binarization, imputation and resampling methods.
 """
 
 from ._function_transformer import FunctionTransformer
@@ -34,6 +34,7 @@
 
 from .imputation import Imputer
 
+from .resample import resample_labels
 
 __all__ = [
     'Binarizer',
@@ -63,4 +64,5 @@
     'label_binarize',
     'quantile_transform',
     'power_transform',
+    'resample_labels',
 ]