scikit-learn · jnothman · Jul 12, 2018 · Jul 12, 2017 · Aug 31, 2017 · Sep 1, 2017
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -1240,6 +1240,7 @@ Model validation
 
    preprocessing.Binarizer
    preprocessing.FunctionTransformer
+   preprocessing.KBinsDiscretizer
    preprocessing.KernelCenterer
    preprocessing.LabelBinarizer
    preprocessing.LabelEncoder

diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
@@ -432,67 +432,6 @@ The normalizer instance can then be used on sample vectors as any transformer::
   efficient Cython routines. To avoid unnecessary memory copies, it is
   recommended to choose the CSR representation upstream.
 
-.. _preprocessing_binarization:
-
-Binarization
-============
-
-Feature binarization
---------------------
-
-**Feature binarization** is the process of **thresholding numerical
-features to get boolean values**. This can be useful for downstream
-probabilistic estimators that make assumption that the input data
-is distributed according to a multi-variate `Bernoulli distribution
-<https://en.wikipedia.org/wiki/Bernoulli_distribution>`_. For instance,
-this is the case for the :class:`sklearn.neural_network.BernoulliRBM`.
-
-It is also common among the text processing community to use binary
-feature values (probably to simplify the probabilistic reasoning) even
-if normalized counts (a.k.a. term frequencies) or TF-IDF valued features
-often perform slightly better in practice.
-
-As for the :class:`Normalizer`, the utility class
-:class:`Binarizer` is meant to be used in the early stages of
-:class:`sklearn.pipeline.Pipeline`. The ``fit`` method does nothing
-as each sample is treated independently of others::
-
-  >>> X = [[ 1., -1.,  2.],
-  ...      [ 2.,  0.,  0.],
-  ...      [ 0.,  1., -1.]]
-
-  >>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
-  >>> binarizer
-  Binarizer(copy=True, threshold=0.0)
-
-  >>> binarizer.transform(X)
-  array([[1., 0., 1.],
-         [1., 0., 0.],
-         [0., 1., 0.]])
-
-It is possible to adjust the threshold of the binarizer::
-
-  >>> binarizer = preprocessing.Binarizer(threshold=1.1)
-  >>> binarizer.transform(X)
-  array([[0., 0., 1.],
-         [1., 0., 0.],
-         [0., 0., 0.]])
-
-As for the :class:`StandardScaler` and :class:`Normalizer` classes, the
-preprocessing module provides a companion function :func:`binarize`
-to be used when the transformer API is not necessary.
-
-.. topic:: Sparse input
-
-  :func:`binarize` and :class:`Binarizer` accept **both dense array-like
-  and sparse matrices from scipy.sparse as input**.
-
-  For sparse input the data is **converted to the Compressed Sparse Rows
-  representation** (see ``scipy.sparse.csr_matrix``).
-  To avoid unnecessary memory copies, it is recommended to choose the CSR
-  representation upstream.
-
-
 .. _preprocessing_categorical_features:
 
 Encoding categorical features
@@ -589,6 +528,124 @@ columns for this feature will be all zeros
 See :ref:`dict_feature_extraction` for categorical features that are represented
 as a dict, not as scalars.
 
+.. _preprocessing_discretization:
+
+Discretization
+==============
+
+`Discretization <https://en.wikipedia.org/wiki/Discretization_of_continuous_features>`_
+(otherwise known as quantization or binning) provides a way to partition continuous
+features into discrete values. Certain datasets with continuous features
+may benefit from discretization, because discretization can transform the dataset
+of continuous attributes to one with only nominal attributes.
+
+K-bins discretization
+---------------------
+
+:class:`KBinsDiscretizer` discretizers features into ``k`` equal width bins::
+
+  >>> X = np.array([[ -3., 5., 15 ],
+  ...               [  0., 6., 14 ],
+  ...               [  6., 3., 11 ]])
+  >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
+
+By default the output is one-hot encoded into a sparse matrix
+(See :ref:`preprocessing_categorical_features`)
+and this can be configured with the ``encode`` parameter.
+For each feature, the bin edges are computed during ``fit`` and together with
+the number of bins, they will define the intervals. Therefore, for the current
+example, these intervals are defined as:
+
+ - feature 1: :math:`{[-\infty, -1), [-1, 2), [2, \infty)}`
+ - feature 2: :math:`{[-\infty, 5), [5, \infty)}`
+ - feature 3: :math:`{[-\infty, 14), [14, \infty)}`
+
+ Based on these bin intervals, ``X`` is transformed as follows::
+
+  >>> est.transform(X)                      # doctest: +SKIP
+  array([[ 0., 1., 1.],
+         [ 1., 1., 1.],
+         [ 2., 0., 0.]])
+
+The resulting dataset contains ordinal attributes which can be further used
+in a :class:`sklearn.pipeline.Pipeline`.
+
+Discretization is similar to constructing histograms for continuous data.
+However, histograms focus on counting features which fall into particular
+bins, whereas discretization focuses on assigning feature values to these bins.
+
+:class:`KBinsDiscretizer` implements different binning strategies, which can be
+selected with the ``strategy`` parameter. The 'uniform' strategy uses
+constant-width bins. The 'quantile' strategy uses the quantiles values to have
+equally populated bins in each feature. The 'kmeans' strategy defines bins based
+on a k-means clustering procedure performed on each feature independently.
+
+.. topic:: Examples:
+
+  * :ref:`sphx_glr_auto_examples_plot_discretization.py`
+  * :ref:`sphx_glr_auto_examples_plot_discretization_classification.py`
+  * :ref:`sphx_glr_auto_examples_plot_discretization_strategies.py`
+
+.. _preprocessing_binarization:
+
+Feature binarization
+--------------------
+
+**Feature binarization** is the process of **thresholding numerical
+features to get boolean values**. This can be useful for downstream
+probabilistic estimators that make assumption that the input data
+is distributed according to a multi-variate `Bernoulli distribution
+<https://en.wikipedia.org/wiki/Bernoulli_distribution>`_. For instance,
+this is the case for the :class:`sklearn.neural_network.BernoulliRBM`.
+
+It is also common among the text processing community to use binary
+feature values (probably to simplify the probabilistic reasoning) even
+if normalized counts (a.k.a. term frequencies) or TF-IDF valued features
+often perform slightly better in practice.
+
+As for the :class:`Normalizer`, the utility class
+:class:`Binarizer` is meant to be used in the early stages of
+:class:`sklearn.pipeline.Pipeline`. The ``fit`` method does nothing
+as each sample is treated independently of others::
+
+  >>> X = [[ 1., -1.,  2.],
+  ...      [ 2.,  0.,  0.],
+  ...      [ 0.,  1., -1.]]
+
+  >>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
+  >>> binarizer
+  Binarizer(copy=True, threshold=0.0)
+
+  >>> binarizer.transform(X)
+  array([[1., 0., 1.],
+         [1., 0., 0.],
+         [0., 1., 0.]])
+
+It is possible to adjust the threshold of the binarizer::
+
+  >>> binarizer = preprocessing.Binarizer(threshold=1.1)
+  >>> binarizer.transform(X)
+  array([[0., 0., 1.],
+         [1., 0., 0.],
+         [0., 0., 0.]])
+
+As for the :class:`StandardScaler` and :class:`Normalizer` classes, the
+preprocessing module provides a companion function :func:`binarize`
+to be used when the transformer API is not necessary.
+
+Note that the :class:`Binarizer` is similar to the :class:`KBinsDiscretizer`
+when ``k = 2``, and when the bin edge is at the value ``threshold``.
+
+.. topic:: Sparse input
+
+  :func:`binarize` and :class:`Binarizer` accept **both dense array-like
+  and sparse matrices from scipy.sparse as input**.
+
+  For sparse input the data is **converted to the Compressed Sparse Rows
+  representation** (see ``scipy.sparse.csr_matrix``).
+  To avoid unnecessary memory copies, it is recommended to choose the CSR
+  representation upstream.
+
 .. _imputation:
 
 Imputation of missing values

diff --git a/doc/whats_new/_contributors.rst b/doc/whats_new/_contributors.rst
@@ -153,3 +153,5 @@
 .. _Joris Van den Bossche: https://github.com/jorisvandenbossche
 
 .. _Roman Yurchak: https://github.com/rth
+
+.. _Hanmin Qin: https://github.com/qinhanmin2014
diff --git a/doc/whats_new/v0.20.rst b/doc/whats_new/v0.20.rst
@@ -127,6 +127,13 @@ Preprocessing
   the maximum value in the features. :issue:`9151` and :issue:`10521` by
   :user:`Vighnesh Birodkar <vighneshbirodkar>` and `Joris Van den Bossche`_.
 
+- Added :class:`preprocessing.KBinsDiscretizer` for turning
+  continuous features into categorical or one-hot encoded
+  features. :issue:`7668`, :issue:`9647`, :issue:`10195`,
+  :issue:`10192`, :issue:`11272` and :issue:`11467`.
+  by :user:`Henry Lin <hlin117>`, `Hanmin Qin`_
+  and `Tom Dupre la Tour`_.
+
 - Added :class:`compose.ColumnTransformer`, which allows to apply
   different transformers to different columns of arrays or pandas
   DataFrames. :issue:`9012` by `Andreas Müller`_ and `Joris Van den Bossche`_,

diff --git a/examples/preprocessing/plot_discretization.py b/examples/preprocessing/plot_discretization.py
@@ -0,0 +1,86 @@
+# -*- coding: utf-8 -*-
+
+"""
+================================================================
+Using KBinsDiscretizer to discretize continuous features
+================================================================
+
+The example compares prediction result of linear regression (linear model)
+and decision tree (tree based model) with and without discretization of
+real-valued features.
+
+As is shown in the result before discretization, linear model is fast to
+build and relatively straightforward to interpret, but can only model
+linear relationships, while decision tree can build a much more complex model
+of the data. One way to make linear model more powerful on continuous data
+is to use discretization (also known as binning). In the example, we
+discretize the feature and one-hot encode the transformed data. Note that if
+the bins are not reasonably wide, there would appear to be a substantially
+increased risk of overfitting, so the discretizer parameters should usually
+be tuned under cross validation.
+
+After discretization, linear regression and decision tree make exactly the
+same prediction. As features are constant within each bin, any model must
+predict the same value for all points within a bin. Compared with the result
+before discretization, linear model become much more flexible while decision
+tree gets much less flexible. Note that binning features generally has no
+beneficial effect for tree-based models, as these models can learn to split
+up the data anywhere.
+
+"""
+
+# Author: Andreas Müller
+#         Hanmin Qin <qinhanmin2005@sina.com>
+# License: BSD 3 clause
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+from sklearn.linear_model import LinearRegression
+from sklearn.preprocessing import KBinsDiscretizer
+from sklearn.tree import DecisionTreeRegressor
+
+print(__doc__)
+
+# construct the dataset
+rnd = np.random.RandomState(42)
+X = rnd.uniform(-3, 3, size=100)
+y = np.sin(X) + rnd.normal(size=len(X)) / 3
+X = X.reshape(-1, 1)
+
+# transform the dataset with KBinsDiscretizer
+enc = KBinsDiscretizer(n_bins=10, encode='onehot')
+X_binned = enc.fit_transform(X)
+
+# predict with original dataset
+fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(10, 4))
+line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
+reg = LinearRegression().fit(X, y)
+ax1.plot(line, reg.predict(line), linewidth=2, color='green',
+         label="linear regression")
+reg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X, y)
+ax1.plot(line, reg.predict(line), linewidth=2, color='red',
+         label="decision tree")
+ax1.plot(X[:, 0], y, 'o', c='k')
+ax1.legend(loc="best")
+ax1.set_ylabel("Regression output")
+ax1.set_xlabel("Input feature")
+ax1.set_title("Result before discretization")
+
+# predict with transformed dataset
+line_binned = enc.transform(line)
+reg = LinearRegression().fit(X_binned, y)
+ax2.plot(line, reg.predict(line_binned), linewidth=2, color='green',
+         linestyle='-', label='linear regression')
+reg = DecisionTreeRegressor(min_samples_split=3,
+                            random_state=0).fit(X_binned, y)
+ax2.plot(line, reg.predict(line_binned), linewidth=2, color='red',
+         linestyle=':', label='decision tree')
+ax2.plot(X[:, 0], y, 'o', c='k')
+ax2.vlines(enc.bin_edges_[0], *plt.gca().get_ylim(), linewidth=1, alpha=.2)
+ax2.legend(loc="best")
+ax2.set_xlabel("Input feature")
+ax2.set_title("Result after discretization")
+
+plt.tight_layout()
+plt.show()
Original file line number	Diff line number	Diff line change
Expand Up		@@ -153,3 +153,5 @@
		.. _Joris Van den Bossche: https://github.com/jorisvandenbossche

		.. _Roman Yurchak: https://github.com/rth

		.. _Hanmin Qin: https://github.com/qinhanmin2014