Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 35 additions & 3 deletions sklearn/preprocessing/_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import numpy as np
from scipy import sparse
import numbers

from ..base import BaseEstimator, TransformerMixin
from ..utils import check_array
Expand Down Expand Up @@ -622,6 +623,14 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

unknown_value : 'error' or int, default='error'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called handle_unknown for consistency with the OneHotEncoder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure. We used another name to emphasize that we can choose the encoded value here. But you are right, in principle, it is a similar thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly strange to have handle_unknown=-10, since -10 is not an "action".

Being the same name may case some confusion, since a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder.

Those two reasons was why I suggested a different name for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder

For me that's not really an issue: users would get an error immediately, it's not like the OHE is going to have an unexpected behavior.
We already have other places where the same parameter name accepts slightly different values depending on the nature of the estimator.

This parameter is about how unknown values are handled in transform, so its semantic is exactly the same as OHE's handle_unknown. I think it should be called the same.

Also, at least to me, the current unknown_value name suggests that the parameter means "this is what an unknown value looks like in fit", as in e.g. "treat these values as missing", which is a different thing.

It's slightly strange to have handle_unknown=-10, since -10 is not an "action".

An alternative is to have handle_unkown='replace' with an additional replace_value parameter.

Maybe others should weigh in, consistency is important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = np.array([['a', 'b', 'd']], dtype=object).T
enc = OridinalEncoder(handle_unknown='use_category',
				      unknown_category='c').fit(X)

As I mentioned before this can only be supported if we assume that X has no integer values. If X has integer values then there is now way to know whether the parameter refers to raw values or encoded values. So I don't think we should consider this option.

map unknown to -2.

Why -2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -2?

There is no particular reason, other than I want -1 to be missing. If we have a demand for enabling mapping to other integers, we can add the parameter then. I think this solves 95% of the problem already with unknown values in OrdinalEncoder.

Copy link
Member

@ogrisel ogrisel May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second parameter (unknown_category or replace_value) should specify an encoded value, not a raw value. Otherwise this will conflict when raw values are integers.

I still like the ability to be able to pass a category (in input category space) for the unknown values since it then make it possible to implement inverse_transform on test data with unknown values which would not be possible when we only pass the integer value.

We could support the two options though.

  • OrdinalEncoder(handle_unknown="use_category", unknown_category="unknown")
  • OrdinalEncoder(handle_unknown="use_value", unkown_encoded_value=-1)

Or even just use the handle_unknown="use_category" option but make it possible to specify both an input space category and an encoded space value:

  • OrdinalEncoder(handle_unknown="use_category", unknown_category="other", unkown_encoded_value=-1)

But then we need to be careful if unknown_category is already part of the category labels of the training set.

Another motivation for making it possible to pass the input space category is to make it easy to map unknown values to a meaningful garbage category that is already present in the training set (e.g. "other", "misc", "background").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about having two parameters, one to decide on the handling (use category or encoded value for unknowns) and one on the value for unknowns (with meaning depending on the handling decision):

  • OrdinalEncoder(handle_unknown="use_category", unknown="other")

  • OrdinalEncoder(handle_unknown="use_encoded_value", unknown=-1)
    Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.

Copy link
Member

@ogrisel ogrisel Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find my original suggestion more explicit but I am also fine with yours if others like it better. For the default encoded value I would rather leave it to None and force the user to be explicit.

Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.

The default values would be:

`OrdinalEncoder(handle_unknown="error", unknown_value=None)`

which would not raise any error at fit time but would raise an error at predict time if unknown values are encountered.

`OrdinalEncoder(handle_unknown="use_category", unknown_value=None)`

or

`OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)`

would raise an error at fit time asking the user to ask the user to set unknown_value to a value with a valid dtype at fit time (e.g. an integer of handle_unknown == "use_encoded_value" or a str if handle_unknown="use_category" and all category values in the training set as str instances.

When set to 'raise' an error will be raised in case an unknown
categorical feature is present during transform. When unknown_value is
an integer, the encoded value of unknown categories will be set to
this value. The integer can be negative and must not be an integer
already used for encoded categories seen in :meth:fit. In
:meth:inverse_transform, an unknown category will be denoted as None.

Attributes
----------
categories_ : list of arrays
Expand Down Expand Up @@ -658,9 +667,11 @@ class OrdinalEncoder(_BaseEncoder):
"""

@_deprecate_positional_args
def __init__(self, *, categories='auto', dtype=np.float64):
def __init__(self, *, categories='auto', dtype=np.float64,
unknown_value='error'):
self.categories = categories
self.dtype = dtype
self.unknown_value = unknown_value

def fit(self, X, y=None):
"""
Expand Down Expand Up @@ -697,7 +708,21 @@ def transform(self, X):
X_out : sparse matrix or a 2-d array
Transformed input.
"""
X_int, _ = self._transform(X)
X_int, X_mask = self._transform(X, handle_unknown=self.unknown_value)

# create separate category for unknown values
if self.unknown_value != 'error':
if not isinstance(self.unknown_value, numbers.Integral):
raise TypeError("The used value for unknown_value "
f"{self.unknown_value} is not an integer as "
"required.")
for i in range(len(self.categories_)):
if 0 <= self.unknown_value < len(self.categories_[i]):
raise ValueError(
"The used value for unknown_value "
f"{self.unknown_value} is one of the values already "
"used for encoding the seen categories.")
X_int[~X_mask[:, i], i] = self.unknown_value
return X_int.astype(self.dtype, copy=False)

def inverse_transform(self, X):
Expand Down Expand Up @@ -732,6 +757,13 @@ def inverse_transform(self, X):

for i in range(n_features):
labels = X[:, i].astype('int64', copy=False)
X_tr[:, i] = self.categories_[i][labels]
# set unknown values to None
if self.unknown_value != 'error':
X_tr[:, i] = np.where(
labels == self.unknown_value, None,
self.categories_[i][np.where(
labels == self.unknown_value, 0, labels)])
else:
X_tr[:, i] = self.categories_[i][labels]

return X_tr
34 changes: 34 additions & 0 deletions sklearn/preprocessing/tests/test_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -553,6 +553,40 @@ def test_ordinal_encoder_raise_missing(X):
ohe.transform(X)


def test_ordinal_encoder_handle_unknowns():
enc = OrdinalEncoder(unknown_value=-999)
X_fit = np.array([['a', 'x'], ['b', 'y'], ['c', 'z']], dtype=object)
X_trans = np.array([['c', 'xy'], ['bla', 'y'], ['a', 'x']], dtype=object)
enc.fit(X_fit)

X_trans_enc = enc.transform(X_trans)
exp = np.array([[2, -999], [-999, 1], [0, 0]], dtype='int64')
assert_array_equal(X_trans_enc, exp)

X_trans_inv = enc.inverse_transform(X_trans_enc)
inv_exp = np.array([['c', None], [None, 'y'], ['a', 'x']], dtype=object)
assert_array_equal(X_trans_inv, inv_exp)


def test_ordinal_encoder_raise_wrong_unknowns():
X_fit = np.array([['a', 'x'], ['b', 'y']], dtype=object)
X_trans = np.array([['c', 'xy'], ['bla', 'y']], dtype=object)

enc = OrdinalEncoder(unknown_value="wrong")
enc.fit(X_fit)
msg = ("The used value for unknown_value wrong is not an integer as "
"required.")
with pytest.raises(TypeError, match=msg):
enc.transform(X_trans)

enc = OrdinalEncoder(unknown_value=1)
enc.fit(X_fit)
msg = ("The used value for unknown_value 1 is one of the values already "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that we should raise an error in this case, but let's do it and at worse we can make it more lenient at a later date.

"used for encoding the seen categories.")
with pytest.raises(ValueError, match=msg):
enc.transform(X_trans)


def test_ordinal_encoder_raise_categories_shape():

X = np.array([['Low', 'Medium', 'High', 'Medium', 'Low']], dtype=object).T
Expand Down