Skip to content

[MRG+2] Merge discrete branch into master #9342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jul 12, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
7ef342e
ENH add fixed-width discretizer
hlin117 Jul 12, 2017
56902d3
Merge branch 'master' into discrete
jnothman Aug 31, 2017
a9cd64e
Fix docstring section heading
jnothman Sep 1, 2017
eef7bdb
add encoding option to KBinsDiscretizer (#9647)
qinhanmin2014 Sep 7, 2017
a099c26
[MRG] DOC fix link to user guide (#9705)
glemaitre Sep 8, 2017
0a91bce
Merge branch 'master' into discrete
TomDLT Nov 23, 2017
fb2a065
flake8 fix (a forced unrelated change)
qinhanmin2014 Nov 23, 2017
4e60a3d
drop duplicate section
qinhanmin2014 Nov 24, 2017
2c9134e
[MRG+2] discrete branch: add an example for KBinsDiscretizer (#10192)
qinhanmin2014 Nov 27, 2017
430af30
DOC add a second example for KBinsDiscretizer (#10195)
TomDLT Nov 27, 2017
4c034c4
DOC what's new for discretizer
jnothman Nov 27, 2017
070c969
Merge branch 'master' into discrete
qinhanmin2014 Nov 30, 2017
c36fa21
Remove blank line
jnothman Nov 30, 2017
2249f8a
comments from hlin177
qinhanmin2014 Nov 30, 2017
9b3d995
flake8 fail again, so forced to recover the unrelated change
qinhanmin2014 Nov 30, 2017
ef57c34
Merge branch 'master' into discrete
qinhanmin2014 Dec 6, 2017
f737fe6
Merge branch 'master' into discrete
jnothman May 30, 2018
ca44533
Merge branch 'discrete' of github.com:scikit-learn/scikit-learn into …
jnothman May 31, 2018
8c1c488
FIX check_methods_subset_invariance where estimator produces sparse o…
jnothman May 31, 2018
fd7c217
Merge branch 'master' into discrete
qinhanmin2014 Jun 4, 2018
8cf5567
Merge branch 'discrete' of github.com:scikit-learn/scikit-learn into …
jnothman Jun 4, 2018
7f29f15
Whitespace
jnothman Jun 4, 2018
fa07fc2
Update classes.rst
glemaitre Jun 4, 2018
0be6e84
doc test
qinhanmin2014 Jun 4, 2018
9603c3f
doc test
qinhanmin2014 Jun 4, 2018
e2b6240
Merge branch 'master' into 9342
TomDLT Jun 14, 2018
7caf178
FIX new parameter dtype after merging master
TomDLT Jun 14, 2018
e861914
FIX lazy merge
TomDLT Jun 14, 2018
8aa62ab
FIX add KBinsDiscretizer in DONT_TEST list
TomDLT Jun 15, 2018
582b42f
ENH remove warnings from example
TomDLT Jun 18, 2018
5a61af9
[MRG+2] Implement two non-uniform strategies for KBinsDiscretizer (di…
TomDLT Jul 9, 2018
40d1f00
Merge branch 'master' into discrete
jnothman Jul 9, 2018
e4d1884
preprocessing.discretization -> preprocessing._discretization
jnothman Jul 9, 2018
bb719e1
Move _transform_selected helper to base.py
jnothman Jul 9, 2018
d1e2615
Merge branch 'master' into discrete
qinhanmin2014 Jul 10, 2018
e4089d1
[MRG] ENH Remove ignored_features in KBinsDiscretizer (#11467)
qinhanmin2014 Jul 11, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1240,6 +1240,7 @@ Model validation

preprocessing.Binarizer
preprocessing.FunctionTransformer
preprocessing.KBinsDiscretizer
preprocessing.KernelCenterer
preprocessing.LabelBinarizer
preprocessing.LabelEncoder
Expand Down
179 changes: 118 additions & 61 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -432,67 +432,6 @@ The normalizer instance can then be used on sample vectors as any transformer::
efficient Cython routines. To avoid unnecessary memory copies, it is
recommended to choose the CSR representation upstream.

.. _preprocessing_binarization:

Binarization
============

Feature binarization
--------------------

**Feature binarization** is the process of **thresholding numerical
features to get boolean values**. This can be useful for downstream
probabilistic estimators that make assumption that the input data
is distributed according to a multi-variate `Bernoulli distribution
<https://en.wikipedia.org/wiki/Bernoulli_distribution>`_. For instance,
this is the case for the :class:`sklearn.neural_network.BernoulliRBM`.

It is also common among the text processing community to use binary
feature values (probably to simplify the probabilistic reasoning) even
if normalized counts (a.k.a. term frequencies) or TF-IDF valued features
often perform slightly better in practice.

As for the :class:`Normalizer`, the utility class
:class:`Binarizer` is meant to be used in the early stages of
:class:`sklearn.pipeline.Pipeline`. The ``fit`` method does nothing
as each sample is treated independently of others::

>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]

>>> binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)

>>> binarizer.transform(X)
array([[1., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])

It is possible to adjust the threshold of the binarizer::

>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[0., 0., 1.],
[1., 0., 0.],
[0., 0., 0.]])

As for the :class:`StandardScaler` and :class:`Normalizer` classes, the
preprocessing module provides a companion function :func:`binarize`
to be used when the transformer API is not necessary.

.. topic:: Sparse input

:func:`binarize` and :class:`Binarizer` accept **both dense array-like
and sparse matrices from scipy.sparse as input**.

For sparse input the data is **converted to the Compressed Sparse Rows
representation** (see ``scipy.sparse.csr_matrix``).
To avoid unnecessary memory copies, it is recommended to choose the CSR
representation upstream.


.. _preprocessing_categorical_features:

Encoding categorical features
Expand Down Expand Up @@ -589,6 +528,124 @@ columns for this feature will be all zeros
See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as scalars.

.. _preprocessing_discretization:

Discretization
==============

`Discretization <https://en.wikipedia.org/wiki/Discretization_of_continuous_features>`_
(otherwise known as quantization or binning) provides a way to partition continuous
features into discrete values. Certain datasets with continuous features
may benefit from discretization, because discretization can transform the dataset
of continuous attributes to one with only nominal attributes.

K-bins discretization
---------------------

:class:`KBinsDiscretizer` discretizers features into ``k`` equal width bins::

>>> X = np.array([[ -3., 5., 15 ],
... [ 0., 6., 14 ],
... [ 6., 3., 11 ]])
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

By default the output is one-hot encoded into a sparse matrix
(See :ref:`preprocessing_categorical_features`)
and this can be configured with the ``encode`` parameter.
For each feature, the bin edges are computed during ``fit`` and together with
the number of bins, they will define the intervals. Therefore, for the current
example, these intervals are defined as:

- feature 1: :math:`{[-\infty, -1), [-1, 2), [2, \infty)}`
- feature 2: :math:`{[-\infty, 5), [5, \infty)}`
- feature 3: :math:`{[-\infty, 14), [14, \infty)}`

Based on these bin intervals, ``X`` is transformed as follows::

>>> est.transform(X) # doctest: +SKIP
array([[ 0., 1., 1.],
[ 1., 1., 1.],
[ 2., 0., 0.]])

The resulting dataset contains ordinal attributes which can be further used
in a :class:`sklearn.pipeline.Pipeline`.

Discretization is similar to constructing histograms for continuous data.
However, histograms focus on counting features which fall into particular
bins, whereas discretization focuses on assigning feature values to these bins.

:class:`KBinsDiscretizer` implements different binning strategies, which can be
selected with the ``strategy`` parameter. The 'uniform' strategy uses
constant-width bins. The 'quantile' strategy uses the quantiles values to have
equally populated bins in each feature. The 'kmeans' strategy defines bins based
on a k-means clustering procedure performed on each feature independently.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_discretization.py`
* :ref:`sphx_glr_auto_examples_plot_discretization_classification.py`
* :ref:`sphx_glr_auto_examples_plot_discretization_strategies.py`

.. _preprocessing_binarization:

Feature binarization
--------------------

**Feature binarization** is the process of **thresholding numerical
features to get boolean values**. This can be useful for downstream
probabilistic estimators that make assumption that the input data
is distributed according to a multi-variate `Bernoulli distribution
<https://en.wikipedia.org/wiki/Bernoulli_distribution>`_. For instance,
this is the case for the :class:`sklearn.neural_network.BernoulliRBM`.

It is also common among the text processing community to use binary
feature values (probably to simplify the probabilistic reasoning) even
if normalized counts (a.k.a. term frequencies) or TF-IDF valued features
often perform slightly better in practice.

As for the :class:`Normalizer`, the utility class
:class:`Binarizer` is meant to be used in the early stages of
:class:`sklearn.pipeline.Pipeline`. The ``fit`` method does nothing
as each sample is treated independently of others::

>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]

>>> binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)

>>> binarizer.transform(X)
array([[1., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])

It is possible to adjust the threshold of the binarizer::

>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[0., 0., 1.],
[1., 0., 0.],
[0., 0., 0.]])

As for the :class:`StandardScaler` and :class:`Normalizer` classes, the
preprocessing module provides a companion function :func:`binarize`
to be used when the transformer API is not necessary.

Note that the :class:`Binarizer` is similar to the :class:`KBinsDiscretizer`
when ``k = 2``, and when the bin edge is at the value ``threshold``.

.. topic:: Sparse input

:func:`binarize` and :class:`Binarizer` accept **both dense array-like
and sparse matrices from scipy.sparse as input**.

For sparse input the data is **converted to the Compressed Sparse Rows
representation** (see ``scipy.sparse.csr_matrix``).
To avoid unnecessary memory copies, it is recommended to choose the CSR
representation upstream.

.. _imputation:

Imputation of missing values
Expand Down
2 changes: 2 additions & 0 deletions doc/whats_new/_contributors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,5 @@
.. _Joris Van den Bossche: https://github.com/jorisvandenbossche

.. _Roman Yurchak: https://github.com/rth

.. _Hanmin Qin: https://github.com/qinhanmin2014
7 changes: 7 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,13 @@ Preprocessing
the maximum value in the features. :issue:`9151` and :issue:`10521` by
:user:`Vighnesh Birodkar <vighneshbirodkar>` and `Joris Van den Bossche`_.

- Added :class:`preprocessing.KBinsDiscretizer` for turning
continuous features into categorical or one-hot encoded
features. :issue:`7668`, :issue:`9647`, :issue:`10195`,
:issue:`10192`, :issue:`11272` and :issue:`11467`.
by :user:`Henry Lin <hlin117>`, `Hanmin Qin`_
and `Tom Dupre la Tour`_.

- Added :class:`compose.ColumnTransformer`, which allows to apply
different transformers to different columns of arrays or pandas
DataFrames. :issue:`9012` by `Andreas Müller`_ and `Joris Van den Bossche`_,
Expand Down
86 changes: 86 additions & 0 deletions examples/preprocessing/plot_discretization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# -*- coding: utf-8 -*-

"""
================================================================
Using KBinsDiscretizer to discretize continuous features
================================================================

The example compares prediction result of linear regression (linear model)
and decision tree (tree based model) with and without discretization of
real-valued features.

As is shown in the result before discretization, linear model is fast to
build and relatively straightforward to interpret, but can only model
linear relationships, while decision tree can build a much more complex model
of the data. One way to make linear model more powerful on continuous data
is to use discretization (also known as binning). In the example, we
discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretizer parameters should usually
be tuned under cross validation.

After discretization, linear regression and decision tree make exactly the
same prediction. As features are constant within each bin, any model must
predict the same value for all points within a bin. Compared with the result
before discretization, linear model become much more flexible while decision
tree gets much less flexible. Note that binning features generally has no
beneficial effect for tree-based models, as these models can learn to split
up the data anywhere.

"""

# Author: Andreas Müller
# Hanmin Qin <qinhanmin2005@sina.com>
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeRegressor

print(__doc__)

# construct the dataset
rnd = np.random.RandomState(42)
X = rnd.uniform(-3, 3, size=100)
y = np.sin(X) + rnd.normal(size=len(X)) / 3
X = X.reshape(-1, 1)

# transform the dataset with KBinsDiscretizer
enc = KBinsDiscretizer(n_bins=10, encode='onehot')
X_binned = enc.fit_transform(X)

# predict with original dataset
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(10, 4))
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
reg = LinearRegression().fit(X, y)
ax1.plot(line, reg.predict(line), linewidth=2, color='green',
label="linear regression")
reg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X, y)
ax1.plot(line, reg.predict(line), linewidth=2, color='red',
label="decision tree")
ax1.plot(X[:, 0], y, 'o', c='k')
ax1.legend(loc="best")
ax1.set_ylabel("Regression output")
ax1.set_xlabel("Input feature")
ax1.set_title("Result before discretization")

# predict with transformed dataset
line_binned = enc.transform(line)
reg = LinearRegression().fit(X_binned, y)
ax2.plot(line, reg.predict(line_binned), linewidth=2, color='green',
linestyle='-', label='linear regression')
reg = DecisionTreeRegressor(min_samples_split=3,
random_state=0).fit(X_binned, y)
ax2.plot(line, reg.predict(line_binned), linewidth=2, color='red',
linestyle=':', label='decision tree')
ax2.plot(X[:, 0], y, 'o', c='k')
ax2.vlines(enc.bin_edges_[0], *plt.gca().get_ylim(), linewidth=1, alpha=.2)
ax2.legend(loc="best")
ax2.set_xlabel("Input feature")
ax2.set_title("Result after discretization")

plt.tight_layout()
plt.show()
Loading