Skip to content

[MRG] Added CategoricalEncoder class deprecating OneHotEncoder #6559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1176,7 +1176,7 @@ See the :ref:`metrics` section of the user guide for further details.
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.Normalizer
preprocessing.OneHotEncoder
preprocessing.CategoricalEncoder
preprocessing.PolynomialFeatures
preprocessing.RobustScaler
preprocessing.StandardScaler
Expand Down
43 changes: 25 additions & 18 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -390,38 +390,45 @@ not desired (i.e. the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
implemented in :class:`OneHotEncoder`. This estimator transforms each
implemented in :class:`CategoricalEncoder`. This estimator transforms each
categorical feature with ``m`` possible values into ``m`` binary features, with
only one active.

Continuing the example above::

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
>>> enc = preprocessing.CategoricalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categorical_features='all', classes='auto',
dtype=<... 'numpy.float64'>, handle_unknown='error',
sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari']]).toarray()
array([[ 1., 0., 0., 1., 0., 1.]])


By default, how many values each feature can take is inferred automatically from the dataset.
It is possible to specify this explicitly using the parameter ``n_values``.
It is possible to specify this explicitly using the parameter ``classes``.
There are two genders, three possible continents and four web browsers in our
dataset.
Then we fit the estimator, and transform a data point.
In the result, the first two numbers encode the gender, the next set of three
numbers the continent and the last four the web browser.

Note that, if there is a possibilty that the training data might have missing categorical
features, one has to explicitly set ``n_values``. For example,
features, one has to explicitly set ``classes``. For example,

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> genders = ['male', 'female']
>>> locations = ['from Europe', 'from US', 'from Africa', 'from Asia']
>>> browsers = ['uses Safari', 'uses Firefox', 'uses IE', 'uses Chrome']
>>> enc = preprocessing.CategoricalEncoder(classes=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categorical_features='all',
classes=[...],
dtype=<... 'numpy.float64'>, handle_unknown='error',
sparse=True)

>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[ 1., 0., 0., 0., 0., 1., 1., 0., 0., 0.]])

See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.
Expand Down
6 changes: 3 additions & 3 deletions examples/ensemble/plot_feature_transformation.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import CategoricalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline
Expand Down Expand Up @@ -62,7 +62,7 @@

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_enc = CategoricalEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
Expand All @@ -72,7 +72,7 @@
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_enc = CategoricalEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
Expand Down
7 changes: 4 additions & 3 deletions sklearn/feature_extraction/dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
However, note that this transformer will only do a binary one-hot encoding
when feature values are of type string. If categorical features are
represented as numeric values such as int, the DictVectorizer can be
followed by OneHotEncoder to complete binary one-hot encoding.
followed by :class:`sklearn.preprocessing.CategoricalEncoder` to complete
binary one-hot encoding.

Features that do not occur in a sample (mapping) will have a zero value
in the resulting array/matrix.
Expand Down Expand Up @@ -89,8 +90,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
See also
--------
FeatureHasher : performs vectorization using only a hash function.
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
encoded as columns of integers.
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
features encoded as columns of arbitraty data types.
"""

def __init__(self, dtype=np.float64, separator="=", sparse=True,
Expand Down
2 changes: 2 additions & 0 deletions sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from .data import maxabs_scale
from .data import minmax_scale
from .data import OneHotEncoder
from .data import CategoricalEncoder

from .data import PolynomialFeatures

Expand All @@ -43,6 +44,7 @@
'MaxAbsScaler',
'Normalizer',
'OneHotEncoder',
'CategoricalEncoder',
'RobustScaler',
'StandardScaler',
'add_dummy_feature',
Expand Down
Loading