Skip to content

Conversation

FelixWick
Copy link
Contributor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Sometimes it can be convenient to allow values (categories) in the transform of OrdinalEncoder that were not present in the fit data set. For example, a machine learning method, which is able of setting such an unknown sample of the corresponding feature to a neutral value (i.e. non-informative), could use the information from all other features of the sample and still output a prediction (instead of no prediction at all).

Any other comments?

is present during transform (default is to raise). When this parameter
is set to 'ignore' and an unknown category is encountered during
transform, the resulting ordinal encoded value for this feature will be
-999. In the inverse transform, an unknown category will be denoted as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree using -999 would be nice, but it would not really be "Ordinal". In other words, is an "unknown category" considered less than every other category?

In general, I think users want an "IntegerEncoder", which will map strings to ints with no assumption of order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think an unknown category is different and, in a sense, considered less than other categories seen in the fit, because you have to treat them in a special way in a subsequent ML prediction (there were no parameters learned for an unknown category in the fit). Therefore, I think it should definitely not be an integer. Of course, -999 is an arbitrary choice and I have no preference here, but a negative number seems better than nan, as nan can be an original category (missing value) and would therefore be confusing.

Copy link

@sebastian-neubauer-by sebastian-neubauer-by Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end one would want to have some explicit representation of "unknown" in the type itself. But for most algorithms we need a numerical type which do not have such an explicit representation (here integer). Then -999 actually has some advantages over some other choices:

  • "0": very dangerous, as quite often this is interpreted as just the first "bin", also here if I understand it correctly (0 is the first category), so introducing this would be a breaking change
  • "-1": better than 0, but it has the risk, that some might interpret this as belonging to the "0" bin, i.e. something like a overflow bin "lower than the lower boundary" which is not the same as "unknown"
  • "-999": This is enough "far away" from any "accidental" collision, it is a very unnatural number, so we can quite safely abuse it to encode a type "unknown". Even if negative numbers are allowed, hitting -999 by chance is highly unlikely.

One more, but weak argument: In my working group at university years back this was a common practice and it "just worked" for years (disclaimer: @FelixWick was in the same group).

One caveat might be: this might break some use cases, e.g if algorithms expect positive integers, so would it make sense to make this configurable? Even then, what would an alternative be in this case, "0"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is not an alternative, as it is the first category. And I think -999 is not a problem for algorithms expecting positive integers, because it is mandatory to handle this case of an unknown category specifically before the algorithm call (e.g. see it to a neutral value).

@thomasjpfan
Copy link
Member

Yes, I think an unknown category is different and, in a sense, considered less than other categories seen in the fit, because you have to treat them in a special way in a subsequent ML prediction

Considering the unknown category "less than all other categories seen in fit" is up to the model taking it as input and should not be assumed by the encoder. Alternatively unknown can be seen as "greater than all features seen in fit", in this case unknown can be encoded as "cardinality of category + 1".

A user may already have a category they want to use for "unknown". For example a "misc" category that was seen in training.

@FelixWick
Copy link
Contributor Author

I agree, it is not about less or greater than other categories, just about different, as all other categories were seen in the fit (also the what you call misc or unknown category). So, the encoder assumes nothing here, it just makes clear that this is a special case.

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 25, 2020

So, the encoder assumes nothing here, it just makes clear that this is a special case.

Encoding unknown to -999 implicitly assumes that unknowns is less than all the other categories seen during training.

as all other categories were seen in the fit (also the what you call misc or unknown category).

Lets say the categories were ["clothing", "household", "misc"], and these are seen during training. And this would be encoded as [0, 1, 2]. During transform time, it is reasonable to map unknown categories to the "misc" category or 2, which was seen during training time.

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 25, 2020

API wise, I would like a way to configure the unknown value when encoding unknown categories without adding too many parameters. This way a user can state how they want to encode the unknown categories.

Currently saying handle_unknown='ignore' is not semantically true, we are not ignoring the unknown, we are encoding it.

@thomasjpfan
Copy link
Member

The following API could work:

OrdinalEncoder(unknown_value='error')  # default error
OrdinalEncoder(unknown_value=-999)  # sets unknown category to be encoded as -999

Note we are focused on releasing 0.23 soon, so my responses may be delayed. Feel free to ping me after the 0.23 release.

@FelixWick
Copy link
Contributor Author

Sure, -999 is less than the other values. But the order does not matter in this encoding here, does it?

And no, I would not map it to “misc” in your example. It is a new value in the transform, that was not seen in the fit. So, it should be treated differently. Mapping it to “misc” adds information that is not there.

I agree on the API. To be honest, I took a shortcut here ;). Happy to change it if you are fine with the general idea.

And all fine with delayed answers, no rush.

@thomasjpfan
Copy link
Member

Sure, -999 is less than the other values. But the order does not matter in this encoding here, does it?

It matters in this context because the encoder is named OrdinalEncoder.

And no, I would not map it to “misc” in your example. It is a new value in the transform, that was not seen in the fit. So, it should be treated differently. Mapping it to “misc” adds information that is not there.

Okay, let's not allow the unknown_encoded_value be any of the seen categories.

I am happy with having a unknown_encoded_value parameter and raising an error if the unknown_encoded_value is one of the values already used for encoding the seen categories.

@FelixWick
Copy link
Contributor Author

Ok, great. Then I will change the API accordingly in the next days.

Concerning the ordering of categories, I just realized that we are both right. Usually, the encoding here is unordered (just the accidental sequence of different categories during the fit), but one can also enforce an ordered encoding via not giving categories parameter as auto.
Anyway, with the more flexible API, that you suggested, this is no issue anymore.

@FelixWick
Copy link
Contributor Author

Any comments on my changes?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks


enc = OrdinalEncoder(unknown_value=1)
enc.fit(X_fit)
msg = ("The used value for unknown_value 1 is one of the values already "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that we should raise an error in this case, but let's do it and at worse we can make it more lenient at a later date.

.format(self.unknown_value))
n_features = len(self.categories_)
for i in range(n_features):
if np.isin(self.unknown_value, np.unique(X_int[:, i])):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we just be checking if it's in the range 0 to len(self.categories_[i])? I.e. if we're going to say this should raise an error, it should raise it even if this specific dataset does not contain an instance of that category.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. nice catch, thanks. fixed.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

enc.fit(X_fit)

X_trans_enc = enc.transform(X_trans)
exp = np.array([[2, -999], [-999, 1], [0, 0]], dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtype of the output should not be object. Is this the case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it is integer.

Comment on lines 721 to 723
for i in range(n_features):
if np.isin(self.unknown_value,
range(len(self.categories_[i]))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

for cats in self.categories_:
    if 0 <= self.unknown_value < len(cats):
		...

"The used value for unknown_value {} is one of the "
"values already used for encoding the seen "
"categories.".format(self.unknown_value))
X_int[:, i][~X_mask[:, i]] = self.unknown_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X_int[:, i][~X_mask[:, i]] = self.unknown_value
X_int[~X_mask[:, i], i] = self.unknown_value

if np.isin(self.unknown_value,
range(len(self.categories_[i]))):
raise ValueError(
"The used value for unknown_value {} is one of the "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use f-strings now!

Suggested change
"The used value for unknown_value {} is one of the "
f"The used value for unknown_value {self.unknown_value} is one of the "

(and line wrapping to be < 80 line-length)


# create separate category for unknown values
if self.unknown_value != 'error':
if not isinstance(self.unknown_value, np.int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With import numbers:

Suggested change
if not isinstance(self.unknown_value, np.int):
if not isinstance(self.unknown_value, numbers.Integral):

# create separate category for unknown values
if self.unknown_value != 'error':
if not isinstance(self.unknown_value, np.int):
raise TypeError("The used value for unknown_value {} is not "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError("The used value for unknown_value {} is not "
raise TypeError(f"The used value for unknown_value {self.unknown_value} is not "

@@ -622,6 +622,16 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

unknown_value : 'error' or (ordinal) encoded value for unknown cateogry,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usually specific the type in the docstring:

Suggested change
unknown_value : 'error' or (ordinal) encoded value for unknown cateogry,
unknown_value : 'error' or int, default='error'

Comment on lines 627 to 633
Whether to raise an error or use a specific encoded value if an
unknown categorical feature is present during transform (default is
to raise). When this parameter is set to something else than 'error'
and an unknown category is encountered during transform, the resulting
(ordinal) encoded value for this feature will be set to this value.
Ordinal is not strictly required, a negative value works as well. In
the inverse transform, an unknown category will be denoted as None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When set to 'raise' an error will raised when an unknown categorical feature is present during transform. When unknown_value is an integer, unknown categories will encoded value will be set to this value. The integer can be negative and must not be an integer already used to encoded categories seen in :meth:fit. In :meth:inverse_transform, an unknown category will be denoted as None.

@@ -622,6 +623,14 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

unknown_value : 'error' or int, default='error'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called handle_unknown for consistency with the OneHotEncoder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure. We used another name to emphasize that we can choose the encoded value here. But you are right, in principle, it is a similar thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly strange to have handle_unknown=-10, since -10 is not an "action".

Being the same name may case some confusion, since a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder.

Those two reasons was why I suggested a different name for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder

For me that's not really an issue: users would get an error immediately, it's not like the OHE is going to have an unexpected behavior.
We already have other places where the same parameter name accepts slightly different values depending on the nature of the estimator.

This parameter is about how unknown values are handled in transform, so its semantic is exactly the same as OHE's handle_unknown. I think it should be called the same.

Also, at least to me, the current unknown_value name suggests that the parameter means "this is what an unknown value looks like in fit", as in e.g. "treat these values as missing", which is a different thing.

It's slightly strange to have handle_unknown=-10, since -10 is not an "action".

An alternative is to have handle_unkown='replace' with an additional replace_value parameter.

Maybe others should weigh in, consistency is important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = np.array([['a', 'b', 'd']], dtype=object).T
enc = OridinalEncoder(handle_unknown='use_category',
				      unknown_category='c').fit(X)

As I mentioned before this can only be supported if we assume that X has no integer values. If X has integer values then there is now way to know whether the parameter refers to raw values or encoded values. So I don't think we should consider this option.

map unknown to -2.

Why -2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -2?

There is no particular reason, other than I want -1 to be missing. If we have a demand for enabling mapping to other integers, we can add the parameter then. I think this solves 95% of the problem already with unknown values in OrdinalEncoder.

Copy link
Member

@ogrisel ogrisel May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second parameter (unknown_category or replace_value) should specify an encoded value, not a raw value. Otherwise this will conflict when raw values are integers.

I still like the ability to be able to pass a category (in input category space) for the unknown values since it then make it possible to implement inverse_transform on test data with unknown values which would not be possible when we only pass the integer value.

We could support the two options though.

  • OrdinalEncoder(handle_unknown="use_category", unknown_category="unknown")
  • OrdinalEncoder(handle_unknown="use_value", unkown_encoded_value=-1)

Or even just use the handle_unknown="use_category" option but make it possible to specify both an input space category and an encoded space value:

  • OrdinalEncoder(handle_unknown="use_category", unknown_category="other", unkown_encoded_value=-1)

But then we need to be careful if unknown_category is already part of the category labels of the training set.

Another motivation for making it possible to pass the input space category is to make it easy to map unknown values to a meaningful garbage category that is already present in the training set (e.g. "other", "misc", "background").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about having two parameters, one to decide on the handling (use category or encoded value for unknowns) and one on the value for unknowns (with meaning depending on the handling decision):

  • OrdinalEncoder(handle_unknown="use_category", unknown="other")

  • OrdinalEncoder(handle_unknown="use_encoded_value", unknown=-1)
    Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.

Copy link
Member

@ogrisel ogrisel Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find my original suggestion more explicit but I am also fine with yours if others like it better. For the default encoded value I would rather leave it to None and force the user to be explicit.

Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.

The default values would be:

`OrdinalEncoder(handle_unknown="error", unknown_value=None)`

which would not raise any error at fit time but would raise an error at predict time if unknown values are encountered.

`OrdinalEncoder(handle_unknown="use_category", unknown_value=None)`

or

`OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)`

would raise an error at fit time asking the user to ask the user to set unknown_value to a value with a valid dtype at fit time (e.g. an integer of handle_unknown == "use_encoded_value" or a str if handle_unknown="use_category" and all category values in the training set as str instances.

@jnothman
Copy link
Member

jnothman commented Jun 1, 2020

I suspect there is a lot of over-engineering going on here. If we took @thomasjpfan's minimal solution of always mapping to -2 will it be easy for users to post-process it into some other useful form? I can't see it being trivial to map it back to a specific category, but it's not super hard either?

@thomasjpfan
Copy link
Member

If we took @thomasjpfan's minimal solution of always mapping to -2 will it be easy for users to post-process it into some other useful form?

Users can still develop algorithms where -2 to means "unknown" which they can use directly in their algorithm.

I can't see it being trivial to map it back to a specific category, but it's not super hard either?

If they need the specific category after processing, users would need to pass in the original data. With this encoding scheme, multiple categories can be mapped to -2.

The goal of my suggestion is to make a small step forward while still allowing us to add more flexibly in future PRs.

@jnothman
Copy link
Member

jnothman commented Jun 1, 2020

The goal of my suggestion is to make a small step forward while still allowing us to add more flexibly in future PRs.

Yes, and I think it's a good place to start.

@FelixWick FelixWick changed the title [MRG] allow unknowns in OrdinalEncoder transform [WIP] allow unknowns in OrdinalEncoder transform Jun 1, 2020
@FelixWick
Copy link
Contributor Author

FelixWick commented Jun 1, 2020

Okidoki. I have created another pull request with this simplified version: #17406
(Sorry, messed up my fork with this one.)

@jnothman
Copy link
Member

jnothman commented Jun 2, 2020 via email

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 2, 2020

Can I be annoying and ask @thomasjpfan why we'd rather NA be between
unknown and known categories?

-1 matches pandas convention when they encode missing categories.

The other case we need to handle at some
point is infrequent categories... Will these be mapped to unknown at least
in some usages?

We can adopt the similar logic we have in #16018

  • If there are infrequent in training, then unknowns are mapped to this bin.
  • If there are no infrequent in training, then unknowns are mapped to its own bin.

@jnothman
Copy link
Member

jnothman commented Jun 2, 2020 via email

@jnothman
Copy link
Member

jnothman commented Jun 2, 2020 via email

@thomasjpfan
Copy link
Member

It's reasonable to follow the pandas convention, except that it's not
intended for ML. I'd expect NA to be distinguishable from all other
categories, and hence lower than other non-specific category values...

I do not have a strong preference to using -1 for np.nan. We can use -2 for NA and -1 for unknown.

Although infrequent categories in a strictly ordered setting should usually
be mapped to a neighboring category, rather than a separate category?

I recall you had a similar suggestion in one of the many encoder threads. At this point, I think users are use OrdinalEncoder more of an IntegerEncoder and the ordered nature is an implementation detail. I am unsure of how much demand there is for a "merge-up" or "merge-down" approach.

As for the "unknown category", we can also have this "merge-up" or "merge-down" approach, but I think most want a separate category. The algorithms that benefit from this: trees (when they have native support for categories #16909) and categorical NB, do not consider the encoding's order.

@GaelVaroquaux
Copy link
Member

As far as I am concerned, nan is problematic: it imposes special casing in the code later.

@GaelVaroquaux
Copy link
Member

As pointed out by @cmarmo : we need to focus commenting and reviewing on #17406 which superseeds this PR.

@FelixWick
Copy link
Contributor Author

So, should we just close this one and finish the discussions there?

@thomasjpfan
Copy link
Member

thomasjpfan commented Jul 3, 2020

I think #17406 that implements a part of #16959 (comment) :

OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)

would be the first step in getting something merged.

Edit: After the meeting, I am happy with @ogrisel proposal.

@FelixWick
Copy link
Contributor Author

Ok, just a quick check before I re-write #17406.
So, you agreed on the following, right?

  • having two parameters handle_unknown and unknown_value.
  • handle_unknown can have the values "error", "use_encoded_value", or "use_category". (Or should we only have "use_encoded_value" for now?)
  • default for unknown_value is None, so the user has to provide a value.

To be explicit:
The default values would be:
OrdinalEncoder(handle_unknown="error", unknown_value=None)
which would not raise any error at fit time but would raise an error at predict time if unknown values are encountered.
OrdinalEncoder(handle_unknown="use_category", unknown_value=None)
or
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)
would raise an error at fit time asking the user to set unknown_value to a value with a valid dtype (e.g. an integer for handle_unknown="use_encoded_value" or a str for handle_unknown="use_category" for all category values in the training set as str instances.).

@thomasjpfan
Copy link
Member

@FelixWick

To keep the PR smaller, I think we can first work on handle_unknown="use_encoded_value" first, since the #17406 is closest to that version.

handle_unknown="use_category" can be worked on later in another PR.

@FelixWick
Copy link
Contributor Author

@thomasjpfan
Ok. But you want me to include the second parameter unknown_value as described, right? (Currently, it is fixed to -2.)

@thomasjpfan
Copy link
Member

Yes, include the unknown_value and have it default to None. This way the user will have to explicit set it if they want to use handle_unknown="use_encoded_value".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants