Skip to content

[MRG + 1] ENH: new CategoricalEncoder class #9151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Nov 21, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
70d8165
Added CategoricalEncoder class - deprecating OneHotEncoder
vighneshbirodkar Mar 18, 2016
bea23a5
First round of updates
jorisvandenbossche Jun 19, 2017
fda6d27
fix + test specifying of categories
jorisvandenbossche Jun 26, 2017
5f2b403
further clean-up + tests
jorisvandenbossche Jun 27, 2017
e175e4c
fix skipping pandas test
jorisvandenbossche Jun 27, 2017
dfaa9c0
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Aug 1, 2017
4f64648
feedback andy
jorisvandenbossche Aug 1, 2017
01c3bd4
add encoding keyword to support ordinal encoding
jorisvandenbossche Aug 7, 2017
dcef19c
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Aug 9, 2017
2ed91e8
remove y from transform signature
jorisvandenbossche Aug 9, 2017
a589dd9
Remove sparse keyword in favor of encoding='onehot-dense'
jorisvandenbossche Aug 9, 2017
17e5e69
Let encoding='ordinal' follow dtype keyword
jorisvandenbossche Aug 9, 2017
47a88dd
add categories_ attribute
jorisvandenbossche Aug 9, 2017
7b5b476
expand docs on ordinal + feedback
jorisvandenbossche Aug 25, 2017
5f26bdc
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Oct 19, 2017
3dcc07f
feedback Andy
jorisvandenbossche Oct 19, 2017
5f5934f
add whatsnew note
jorisvandenbossche Oct 19, 2017
c6a5d30
for now raise on unsorted passed categories
jorisvandenbossche Oct 20, 2017
ad5fdc7
Implement inverse_transform
jorisvandenbossche Oct 20, 2017
eb2f4b8
fix example to have sorted categories
jorisvandenbossche Oct 20, 2017
ce82c28
backport scipy sparse argmax
jorisvandenbossche Oct 27, 2017
64aeff5
check handle_unknown before computation in fit
jorisvandenbossche Oct 27, 2017
4f8efcf
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Oct 27, 2017
a1c0982
make scipy backport private
jorisvandenbossche Oct 27, 2017
85cf315
Directly construct CSR matrix
jorisvandenbossche Oct 30, 2017
b40bd8e
try to preserve original dtype if resulting dtype is not string
jorisvandenbossche Oct 30, 2017
2d9b4dd
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Oct 31, 2017
a31bb2a
Remove copying of data, only copy when needed in transform + add test
jorisvandenbossche Oct 31, 2017
2ef5fb9
add test for input dtypes / categories_ dtypes
jorisvandenbossche Oct 31, 2017
937446e
doc updates based on feedback
jorisvandenbossche Oct 31, 2017
a83102c
fix docstring example for python 2
jorisvandenbossche Oct 31, 2017
fbe9ea7
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Nov 7, 2017
21d9c0c
add checking of shape of X in inverse_transform
jorisvandenbossche Nov 7, 2017
929362f
loopify dtype tests
jorisvandenbossche Nov 9, 2017
a6d55d1
reword example on unknown categories
jorisvandenbossche Nov 9, 2017
9aeeb6d
clarify docs
jorisvandenbossche Nov 9, 2017
c39aa0c
remove repeated one
jorisvandenbossche Nov 9, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1198,6 +1198,7 @@ Model validation
preprocessing.MinMaxScaler
preprocessing.Normalizer
preprocessing.OneHotEncoder
preprocessing.CategoricalEncoder
preprocessing.PolynomialFeatures
preprocessing.QuantileTransformer
preprocessing.RobustScaler
Expand Down
112 changes: 76 additions & 36 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -455,47 +455,87 @@ Such features can be efficiently coded as integers, for instance
``[0, 1, 3]`` while ``["female", "from Asia", "uses Chrome"]`` would be
``[1, 2, 1]``.

Such integer representation can not be used directly with scikit-learn estimators, as these
expect continuous input, and would interpret the categories as being ordered, which is often
not desired (i.e. the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
implemented in :class:`OneHotEncoder`. This estimator transforms each
categorical feature with ``m`` possible values into ``m`` binary features, with
only one active.
To convert categorical features to such integer codes, we can use the
:class:`CategoricalEncoder`. When specifying that we want to perform an
ordinal encoding, the estimator transforms each categorical feature to one
new feature of integers (0 to n_categories - 1)::

>>> enc = preprocessing.CategoricalEncoder(encoding='ordinal')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
encoding='ordinal', handle_unknown='error')
>>> enc.transform([['female', 'from US', 'uses Safari']])
array([[ 0., 1., 1.]])

Such integer representation can, however, not be used directly with all
scikit-learn estimators, as these expect continuous input, and would interpret
the categories as being ordered, which is often not desired (i.e. the set of
browsers was ordered arbitrarily).

Another possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K, also known as one-hot or
dummy encoding.
This type of encoding is the default behaviour of the :class:`CategoricalEncoder`.
The :class:`CategoricalEncoder` then transforms each categorical feature with
``n_categories`` possible values into ``n_categories`` binary features, with
one of them 1, and all others 0.

Continuing the example above::

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

By default, how many values each feature can take is inferred automatically from the dataset.
It is possible to specify this explicitly using the parameter ``n_values``.
There are two genders, three possible continents and four web browsers in our
dataset.
Then we fit the estimator, and transform a data point.
In the result, the first two numbers encode the gender, the next set of three
numbers the continent and the last four the web browser.

Note that, if there is a possibility that the training data might have missing categorical
features, one has to explicitly set ``n_values``. For example,

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
>>> enc = preprocessing.CategoricalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
encoding='onehot', handle_unknown='error')
>>> enc.transform([['female', 'from US', 'uses Safari'],
... ['male', 'from Europe', 'uses Safari']]).toarray()
array([[ 1., 0., 0., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 1.]])

By default, the values each feature can take is inferred automatically
from the dataset and can be found in the ``categories_`` attribute::

>>> enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]

It is possible to specify this explicitly using the parameter ``categories``.
There are two genders, four possible continents and four web browsers in our
dataset::

>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.CategoricalEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categories=[...],
dtype=<... 'numpy.float64'>, encoding='onehot',
handle_unknown='error')
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

If there is a possibility that the training data might have missing categorical
features, it can often be better to specify ``handle_unknown='ignore'`` instead
of setting the ``categories`` manually as above. When
``handle_unknown='ignore'`` is specified and unknown categories are encountered
during transform, no error will be raised but the resulting one-hot encoded
columns for this feature will be all zeros
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? If a category doesn't exist in training, how can it produce a zero column?? I thought this was the description for when specifying categories.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree this can be confusing, but you can read it both ways: also in the case of handle_unknown='ignore' you end up with all zero's, just some zero's less as with the manually specified categories.

Note that this is also how it is explained in the class docstring's explanation of handle_unknown

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought: IMO it is correct how I stated it, and I don't think it can be read both ways:

  • if you manually specify categories, and you encounter a category that didn't exist in training but was specified in the categories: you get a 1 for that category (so eg [0, 0, 0, 1]
  • if you didn't specify categories but used handle_unknown='ignore', and you encounter a category that didn't exist in training: you get all zerors (so eg [0, 0, 0])

So for me the above explanation is correct. It is not that easy to clearly word the idea of "a zero for each dummy column for that specific feature in that row", so if you have a better wording than "the resulting one-hot encoded columns for this feature will be all zeros", always welcome.

Or if the above is not clear, can you try to clarify your confusion with the text?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I misread. Thanks for the explanation.

(``handle_unknown='ignore'`` is only supported for one-hot encoding)::

>>> enc = preprocessing.CategoricalEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
encoding='onehot', handle_unknown='ignore')
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[ 1., 0., 0., 0., 0., 0.]])


See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.
as a dict, not as scalars.

.. _imputation:

Expand Down
2 changes: 2 additions & 0 deletions doc/whats_new/_contributors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -150,3 +150,5 @@
.. _Neeraj Gangwar: http://neerajgangwar.in

.. _Arthur Mensch: https://amensch.fr

.. _Joris Van den Bossche: https://github.com/jorisvandenbossche
11 changes: 11 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,17 @@ Classifiers and regressors
Naive Bayes classifier described in Rennie et al. (2003).
By :user:`Michael A. Alcorn <airalcorn2>`.

Preprocessing

- Added :class:`preprocessing.CategoricalEncoder`, which allows to encode
categorical features as a numeric array, either using a one-hot (or
dummy) encoding scheme or by converting to ordinal integers.
Compared to the existing :class:`OneHotEncoder`, this new class handles
encoding of all feature types (also handles string-valued features) and
derives the categories based on the unique values in the features instead of
the maximum value in the features.
By :user:`Vighnesh Birodkar <vighneshbirodkar>` and `Joris Van den Bossche`_.

Model evaluation

- Added the :func:`metrics.balanced_accuracy` metric and a corresponding
Expand Down
6 changes: 3 additions & 3 deletions examples/ensemble/plot_feature_transformation.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import CategoricalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline
Expand Down Expand Up @@ -62,7 +62,7 @@

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_enc = CategoricalEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
Expand All @@ -72,7 +72,7 @@
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_enc = CategoricalEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
Expand Down
7 changes: 4 additions & 3 deletions sklearn/feature_extraction/dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
However, note that this transformer will only do a binary one-hot encoding
when feature values are of type string. If categorical features are
represented as numeric values such as int, the DictVectorizer can be
followed by OneHotEncoder to complete binary one-hot encoding.
followed by :class:`sklearn.preprocessing.CategoricalEncoder` to complete
binary one-hot encoding.

Features that do not occur in a sample (mapping) will have a zero value
in the resulting array/matrix.
Expand Down Expand Up @@ -88,8 +89,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
See also
--------
FeatureHasher : performs vectorization using only a hash function.
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
encoded as columns of integers.
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
features encoded as columns of arbitrary data types.
"""

def __init__(self, dtype=np.float64, separator="=", sparse=True,
Expand Down
2 changes: 2 additions & 0 deletions sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from .data import minmax_scale
from .data import quantile_transform
from .data import OneHotEncoder
from .data import CategoricalEncoder

from .data import PolynomialFeatures

Expand All @@ -46,6 +47,7 @@
'QuantileTransformer',
'Normalizer',
'OneHotEncoder',
'CategoricalEncoder',
'RobustScaler',
'StandardScaler',
'add_dummy_feature',
Expand Down
Loading