Skip to content

[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1201,6 +1201,7 @@ Model validation
preprocessing.Normalizer
preprocessing.OneHotEncoder
preprocessing.CategoricalEncoder
preprocessing.UnaryEncoder
preprocessing.PolynomialFeatures
preprocessing.PowerTransformer
preprocessing.QuantileTransformer
Expand Down
60 changes: 60 additions & 0 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,66 @@ columns for this feature will be all zeros
See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as scalars.

.. _preprocessing_ordinal_features:

Encoding ordinal features
=============================
Often categorical features have a clear ordering. For example a person could
have features

* ``["short", "tall"]``
* ``["low income", "medium income", "high income"]``
* ``["elementary school graduate", "high school graduate", "some college",
"college graduate"]``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one space too much here at the beginning of the line (to align it with the previous line, you can always check the built docs in https://15748-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/_changed.html, see explanation in http://scikit-learn.org/stable/developers/contributing.html#documentation)


Even though these features can be ordered, we shouldn't necessarily assign
scores to them, as the difference between categories one and two is not the
same as the difference between categories two and three.

One possibility to convert these ordinal features to features that can be used
with scikit-learn estimators is to use a unary encoding, which is
implemented in :class:`UnaryEncoder`. This estimator transforms each
ordinal feature with ``m`` possible values into ``m - 1`` binary features,
where the ith feature is active if x > i (for i = 0, ... k - 1).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is k coming from? Is this the same as m ? (or can k -1 be m - 2 ?)


.. note::

This encoding is likely to help when used with linear models and
kernel-based models like SVMs with the standard kernels. On the other hand, this
transformation is unlikely to help when using with tree-based models,
since those already work on the basis of a particular feature value being
< or > than a threshold, unlike linear and kernel-based models.

Continuing the example above::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not directly clear for me this refers to the list of example features (I was first searching for the last code example).
Also, this example at once starts with integers, while the example features above are strings, while this step is not explained.


>>> enc = preprocessing.UnaryEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
n_values='auto', ordinal_features='all', sparse=False)
>>> enc.transform([[0, 1, 1]])
array([[ 0., 1., 0., 1., 0., 0.]])

By default, how many values each feature can take is inferred automatically
from the dataset. It is possible to specify this explicitly using the parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify this 'automatically' is done by looking at the maximum?

``n_values``.
* There are two genders, three possible continents and four web browsers in our
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a blank line above this line (to get the list to render well)

dataset.
* Then we fit the estimator, and transform a data point.
* In the result, the first number encodes the height, the next two numbers the
income level, and the next set of three numbers the education level.

Note that, if there is a possibilty that the training data might have missing
categorical features, one has to explicitly set ``n_values``. For example,::

>>> enc = preprocessing.UnaryEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
n_values=[2, 3, 4], ordinal_features='all', sparse=False)
>>> enc.transform([[1, 1, 2]])
array([[ 1., 1., 0., 1., 1., 0.]])

.. _imputation:

Imputation of missing values
Expand Down
2 changes: 2 additions & 0 deletions sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from .data import PowerTransformer
from .data import CategoricalEncoder
from .data import PolynomialFeatures
from .data import UnaryEncoder

from .label import label_binarize
from .label import LabelBinarizer
Expand Down Expand Up @@ -65,4 +66,5 @@
'label_binarize',
'quantile_transform',
'power_transform',
'UnaryEncoder'
]
220 changes: 218 additions & 2 deletions sklearn/preprocessing/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
'minmax_scale',
'quantile_transform',
'power_transform',
'UnaryEncoder'
]


Expand Down Expand Up @@ -1957,6 +1958,8 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
matrix indicating the presence of a class label.
sklearn.preprocessing.LabelEncoder : encodes labels with values between 0
and n_classes-1.
sklearn.preprocessing.UnaryEncoder: encodes ordinal integer features
using a unary scheme.
"""
def __init__(self, n_values="auto", categorical_features="all",
dtype=np.float64, sparse=True, handle_unknown='error'):
Expand Down Expand Up @@ -2064,8 +2067,8 @@ def _transform(self, X):
mask = (X < self.n_values_).ravel()
if np.any(~mask):
if self.handle_unknown not in ['error', 'ignore']:
raise ValueError("handle_unknown should be either error or "
"unknown got %s" % self.handle_unknown)
raise ValueError("handle_unknown should be either 'error' or "
"'ignore' got %s" % self.handle_unknown)
if self.handle_unknown == 'error':
raise ValueError("unknown categorical feature present %s "
"during transform." % X.ravel()[~mask])
Expand Down Expand Up @@ -3147,3 +3150,216 @@ def inverse_transform(self, X):
X_tr[mask, idx] = None

return X_tr


class UnaryEncoder(BaseEstimator, TransformerMixin):
"""Encode ordinal integer features using a unary scheme.

The input to this transformer should be a matrix of non-negative integers,
denoting the values taken on by ordinal (discrete) features. The output
will be a matrix where each column corresponds to one possible value of
one feature. It is assumed that input features take on values in the range
0 to (n_values - 1).

This encoding is needed for feeding ordinal features to many scikit-learn
estimators, notably linear models and kernel-based models like SVMs with
the standard kernels.
This transformation is unlikely to help when using with tree-based models,
since those already work on the basis of a particular feature value being
< or > than a threshold, unlike linear and kernel-based models.

Read more in the :ref:`User Guide <preprocessing_ordinal_features>`.

Parameters
----------
n_values : 'auto', int or array of ints
Number of values per feature.

- 'auto' : determine value range from training data.
- int : number of ordinal values per feature.
Each feature value should be in ``range(n_values)``
- array : ``n_values[i]`` is the number of ordinal values in
``X[:, i]``. Each feature value should be
in ``range(n_values[i])``

ordinal_features : "all" or array of indices or mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this feature is consistent with OneHotEncoder, but in the CategoricalEncoder we decided to drop this keyword (with the idea you will be able to do this with the coming ColumnTransformer).
Personally, I would rather strive for consistency with the more newly added CategoricalEncoder (by dropping the keyword here or adding it in CategoricalEncoder).

Specify what features are treated as ordinal.

- 'all' (default): All features are treated as ordinal.
- array of indices: Array of ordinal feature indices.
- mask: Array of length n_features and with dtype=bool.

Non-ordinal features are always stacked to the right of the matrix.

dtype : number type, default=np.float
Desired dtype of output.

sparse : boolean, default=False
Will return sparse matrix if set True else will return an array.

handle_greater : str, 'warn' or 'error' or 'clip'
Whether to raise an error or clip or warn if an
ordinal feature >= n_values is passed in.

- 'warn' (default): same as clip but with warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CategoricalEncoder the default is error (and for OneHotEncoder as well). Is there a good reason to deviate in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the discussion about this, @jnotham saying:

I now wonder if we should have a default handle_greater='warn'. I think in practice that raising an error when a count-valued feature exceeds its training set range is too intrusive. Better off clipping but warning unless the user has specified otherwise.

I understand that reasoning for count-like values. But the examples in the documentation are not count-like, and for those this makes less sense I think as a default.
(but don't know with which type of data this will be used more often)

- 'error': raise error if feature >= n_values is passed in.
- 'clip': all the feature values >= n_values are clipped to
(n_values-1) during transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should align with 'clip' .. of the previous (otherwise sphinx sees it as a definition which is rendered differently as the previous entries in the list)


Attributes
----------
feature_indices_ : array of shape (n_features,)
Indices to feature ranges.
Feature ``i`` in the original data is mapped to features
from ``feature_indices_[i]`` to ``feature_indices_[i+1]``

n_values_ : array of shape (n_features,)
Maximum number of values per feature.

Examples
--------
Given a dataset with three features and four samples, we let the encoder
find the maximum value per feature and transform the data to a binary
unary encoding.

>>> from sklearn.preprocessing import UnaryEncoder
>>> enc = UnaryEncoder()
>>> enc.fit([[0, 0, 3],
... [1, 1, 0],
... [0, 2, 1],
... [1, 0, 2]]) # doctest: +ELLIPSIS
UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
n_values='auto', ordinal_features='all', sparse=False)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 1, 3, 6])
>>> enc.transform([[0, 1, 2]])
array([[ 0., 1., 0., 1., 1., 0.]])

See also
--------
sklearn.preprocessing.OneHotEncoder: encodes categorical integer features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical -> ordinal ?

using a one-hot aka one-of-K scheme.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here a see also to CategoricalEncoder as well?

"""
def __init__(self, n_values="auto", ordinal_features="all",
dtype=np.float64, sparse=False, handle_greater='warn'):
self.n_values = n_values
self.ordinal_features = ordinal_features
self.dtype = dtype
self.sparse = sparse
self.handle_greater = handle_greater

def fit(self, X, y=None):
"""Fit UnaryEncoder to X.

Parameters
----------
X : array-like, shape [n_samples, n_feature]
Input array of type int.
All feature values should be non-negative otherwise will raise a
ValueError.
"""
_transform_selected(X, self._fit, self.ordinal_features, copy=True)
return self

def _fit(self, X):
"""Assumes X contains only ordinal features."""
X = check_array(X, dtype=np.int)
if self.handle_greater not in ['warn', 'error', 'clip']:
raise ValueError("handle_greater should be either 'warn', 'error' "
"or 'clip' got %s" % self.handle_greater)
if np.any(X < 0):
raise ValueError("X needs to contain only non-negative integers.")
n_samples, n_features = X.shape

if (isinstance(self.n_values, six.string_types) and
self.n_values == 'auto'):
n_values = np.max(X, axis=0) + 1
elif isinstance(self.n_values, numbers.Integral):
n_values = np.empty(n_features, dtype=np.int)
n_values.fill(self.n_values)
else:
try:
n_values = np.asarray(self.n_values, dtype=int)
except (ValueError, TypeError):
raise TypeError("Wrong type for parameter `n_values`. Expected"
" 'auto', int or array of ints, got %r"
% self.n_values)
if n_values.ndim < 1 or n_values.shape[0] != X.shape[1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if statement is never reached in current tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test to cover it

raise ValueError("Shape mismatch: if n_values is an array,"
" it has to be of shape (n_features,).")

self.n_values_ = n_values
n_values = np.hstack([[0], n_values - 1])
indices = np.cumsum(n_values)
self.feature_indices_ = indices

mask = (X >= self.n_values_).ravel()
if np.any(mask):
if self.handle_greater == 'error':
raise ValueError("handle_greater='error' but found %d feature"
" values which exceeds n_values."
% np.count_nonzero(mask))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this check is not needed in case of 'auto' (but not sure whether a small performance improvement for this case is worth moving it elsewhere)


return X

def _transform(self, X):
"""Assumes X contains only ordinal features."""
X = check_array(X, dtype=np.int)
if np.any(X < 0):
raise ValueError("X needs to contain only non-negative integers.")
n_samples, n_features = X.shape

indices = self.feature_indices_
if n_features != indices.shape[0] - 1:
raise ValueError("X has different shape than during fitting."
" Expected %d, got %d."
% (indices.shape[0] - 1, n_features))

# We clip those ordinal features of X that are greater than n_values_
# using mask if self.handle_greater is "clip".
# This means, the row_indices and col_indices corresponding to the
# greater ordinal feature are all filled with ones.
mask = (X >= self.n_values_).ravel()
if np.any(mask):
if self.handle_greater == 'warn':
warnings.warn("Found %d feature values which exceeds "
"n_values during transform, clipping them."
% np.count_nonzero(mask))
elif self.handle_greater == 'error':
raise ValueError("handle_greater='error' but found %d feature"
" values which exceeds n_values during "
"transform." % np.count_nonzero(mask))

X_ceil = np.where(mask.reshape(X.shape), self.n_values_ - 1, X)
column_start = np.tile(indices[:-1], n_samples)
column_end = (indices[:-1] + X_ceil).ravel()
column_indices = np.hstack([np.arange(s, e) for s, e
in zip(column_start, column_end)])
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
X_ceil.sum(axis=1))
data = np.ones(X_ceil.ravel().sum())
out = sparse.coo_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not too important, but it is also possible to directly create a csr matrix instead of first constructing coo and then converting (for CategoricalEncoder I edited the construction of the indices to do this: 85cf315)


return out if self.sparse else out.toarray()

def transform(self, X):
"""Transform X using Ordinal encoding.

Parameters
----------
X : array-like, shape [n_samples, n_features]
Input array of type int.
All feature values should be non-negative otherwise will raise a
ValueError.

Returns
-------
X_out : sparse matrix if sparse=True else a 2-d array, dtype=int
Transformed input.
"""
return _transform_selected(X, self._transform,
self.ordinal_features, copy=True)
Loading