[WIP] allow unknowns in OrdinalEncoder transform #16959

FelixWick · 2020-04-18T19:47:52Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Sometimes it can be convenient to allow values (categories) in the transform of OrdinalEncoder that were not present in the fit data set. For example, a machine learning method, which is able of setting such an unknown sample of the corresponding feature to a neutral value (i.e. non-informative), could use the information from all other features of the sample and still output a prediction (instead of no prediction at all).

Any other comments?

thomasjpfan · 2020-04-23T18:12:48Z

sklearn/preprocessing/_encoders.py

+        is present during transform (default is to raise). When this parameter
+        is set to 'ignore' and an unknown category is encountered during
+        transform, the resulting ordinal encoded value for this feature will be
+        -999. In the inverse transform, an unknown category will be denoted as


I agree using -999 would be nice, but it would not really be "Ordinal". In other words, is an "unknown category" considered less than every other category?

In general, I think users want an "IntegerEncoder", which will map strings to ints with no assumption of order.

Yes, I think an unknown category is different and, in a sense, considered less than other categories seen in the fit, because you have to treat them in a special way in a subsequent ML prediction (there were no parameters learned for an unknown category in the fit). Therefore, I think it should definitely not be an integer. Of course, -999 is an arbitrary choice and I have no preference here, but a negative number seems better than nan, as nan can be an original category (missing value) and would therefore be confusing.

In the end one would want to have some explicit representation of "unknown" in the type itself. But for most algorithms we need a numerical type which do not have such an explicit representation (here integer). Then -999 actually has some advantages over some other choices:

"0": very dangerous, as quite often this is interpreted as just the first "bin", also here if I understand it correctly (0 is the first category), so introducing this would be a breaking change

"-1": better than 0, but it has the risk, that some might interpret this as belonging to the "0" bin, i.e. something like a overflow bin "lower than the lower boundary" which is not the same as "unknown"

"-999": This is enough "far away" from any "accidental" collision, it is a very unnatural number, so we can quite safely abuse it to encode a type "unknown". Even if negative numbers are allowed, hitting -999 by chance is highly unlikely.

One more, but weak argument: In my working group at university years back this was a common practice and it "just worked" for years (disclaimer: @FelixWick was in the same group).

One caveat might be: this might break some use cases, e.g if algorithms expect positive integers, so would it make sense to make this configurable? Even then, what would an alternative be in this case, "0"?

0 is not an alternative, as it is the first category. And I think -999 is not a problem for algorithms expecting positive integers, because it is mandatory to handle this case of an unknown category specifically before the algorithm call (e.g. see it to a neutral value).

thomasjpfan · 2020-04-24T18:54:52Z

Yes, I think an unknown category is different and, in a sense, considered less than other categories seen in the fit, because you have to treat them in a special way in a subsequent ML prediction

Considering the unknown category "less than all other categories seen in fit" is up to the model taking it as input and should not be assumed by the encoder. Alternatively unknown can be seen as "greater than all features seen in fit", in this case unknown can be encoded as "cardinality of category + 1".

A user may already have a category they want to use for "unknown". For example a "misc" category that was seen in training.

FelixWick · 2020-04-24T19:28:29Z

I agree, it is not about less or greater than other categories, just about different, as all other categories were seen in the fit (also the what you call misc or unknown category). So, the encoder assumes nothing here, it just makes clear that this is a special case.

thomasjpfan · 2020-04-25T15:13:17Z

So, the encoder assumes nothing here, it just makes clear that this is a special case.

Encoding unknown to -999 implicitly assumes that unknowns is less than all the other categories seen during training.

as all other categories were seen in the fit (also the what you call misc or unknown category).

Lets say the categories were ["clothing", "household", "misc"], and these are seen during training. And this would be encoded as [0, 1, 2]. During transform time, it is reasonable to map unknown categories to the "misc" category or 2, which was seen during training time.

thomasjpfan · 2020-04-25T15:21:05Z

API wise, I would like a way to configure the unknown value when encoding unknown categories without adding too many parameters. This way a user can state how they want to encode the unknown categories.

Currently saying handle_unknown='ignore' is not semantically true, we are not ignoring the unknown, we are encoding it.

thomasjpfan · 2020-04-25T16:30:01Z

The following API could work:

OrdinalEncoder(unknown_value='error')  # default error
OrdinalEncoder(unknown_value=-999)  # sets unknown category to be encoded as -999

Note we are focused on releasing 0.23 soon, so my responses may be delayed. Feel free to ping me after the 0.23 release.

FelixWick · 2020-04-25T19:05:09Z

Sure, -999 is less than the other values. But the order does not matter in this encoding here, does it?

And no, I would not map it to “misc” in your example. It is a new value in the transform, that was not seen in the fit. So, it should be treated differently. Mapping it to “misc” adds information that is not there.

I agree on the API. To be honest, I took a shortcut here ;). Happy to change it if you are fine with the general idea.

And all fine with delayed answers, no rush.

thomasjpfan · 2020-04-25T20:22:16Z

Sure, -999 is less than the other values. But the order does not matter in this encoding here, does it?

It matters in this context because the encoder is named OrdinalEncoder.

And no, I would not map it to “misc” in your example. It is a new value in the transform, that was not seen in the fit. So, it should be treated differently. Mapping it to “misc” adds information that is not there.

Okay, let's not allow the unknown_encoded_value be any of the seen categories.

I am happy with having a unknown_encoded_value parameter and raising an error if the unknown_encoded_value is one of the values already used for encoding the seen categories.

FelixWick · 2020-04-26T08:29:52Z

Ok, great. Then I will change the API accordingly in the next days.

Concerning the ordering of categories, I just realized that we are both right. Usually, the encoding here is unordered (just the accidental sequence of different categories during the fit), but one can also enforce an ordered encoding via not giving categories parameter as auto.
Anyway, with the more flexible API, that you suggested, this is no issue anymore.

…dinalencoder

FelixWick · 2020-05-06T06:14:23Z

Any comments on my changes?

jnothman

thanks

jnothman · 2020-05-07T12:52:31Z

sklearn/preprocessing/tests/test_encoders.py

+
+    enc = OrdinalEncoder(unknown_value=1)
+    enc.fit(X_fit)
+    msg = ("The used value for unknown_value 1 is one of the values already "


I'm not certain that we should raise an error in this case, but let's do it and at worse we can make it more lenient at a later date.

jnothman · 2020-05-07T12:57:36Z

sklearn/preprocessing/_encoders.py

+                                .format(self.unknown_value))
+            n_features = len(self.categories_)
+            for i in range(n_features):
+                if np.isin(self.unknown_value, np.unique(X_int[:, i])):


shouldn't we just be checking if it's in the range 0 to len(self.categories_[i])? I.e. if we're going to say this should raise an error, it should raise it even if this specific dataset does not contain an instance of that category.

you are right. nice catch, thanks. fixed.

thomasjpfan

Looking good!

thomasjpfan · 2020-05-09T00:05:42Z

sklearn/preprocessing/tests/test_encoders.py

+    enc.fit(X_fit)
+
+    X_trans_enc = enc.transform(X_trans)
+    exp = np.array([[2, -999], [-999, 1], [0, 0]], dtype=object)


The dtype of the output should not be object. Is this the case here?

right, it is integer.

thomasjpfan · 2020-05-09T00:08:31Z

sklearn/preprocessing/_encoders.py

+            for i in range(n_features):
+                if np.isin(self.unknown_value,
+                           range(len(self.categories_[i]))):


Nit:

for cats in self.categories_: if 0 <= self.unknown_value < len(cats): ...

thomasjpfan · 2020-05-09T00:08:47Z

sklearn/preprocessing/_encoders.py

+                        "The used value for unknown_value {} is one of the "
+                        "values already used for encoding the seen "
+                        "categories.".format(self.unknown_value))
+                X_int[:, i][~X_mask[:, i]] = self.unknown_value


Suggested change

X_int[:, i][~X_mask[:, i]] = self.unknown_value

X_int[~X_mask[:, i], i] = self.unknown_value

thomasjpfan · 2020-05-09T00:09:12Z

sklearn/preprocessing/_encoders.py

+                if np.isin(self.unknown_value,
+                           range(len(self.categories_[i]))):
+                    raise ValueError(
+                        "The used value for unknown_value {} is one of the "


We can use f-strings now!

Suggested change

"The used value for unknown_value {} is one of the "

f"The used value for unknown_value {self.unknown_value} is one of the "

(and line wrapping to be < 80 line-length)

thomasjpfan · 2020-05-09T00:10:58Z

sklearn/preprocessing/_encoders.py

+
+        # create separate category for unknown values
+        if self.unknown_value != 'error':
+            if not isinstance(self.unknown_value, np.int):


With import numbers:

Suggested change

if not isinstance(self.unknown_value, np.int):

if not isinstance(self.unknown_value, numbers.Integral):

thomasjpfan · 2020-05-09T00:11:11Z

sklearn/preprocessing/_encoders.py

+        # create separate category for unknown values
+        if self.unknown_value != 'error':
+            if not isinstance(self.unknown_value, np.int):
+                raise TypeError("The used value for unknown_value {} is not "


Suggested change

raise TypeError("The used value for unknown_value {} is not "

raise TypeError(f"The used value for unknown_value {self.unknown_value} is not "

thomasjpfan · 2020-05-09T00:12:22Z

sklearn/preprocessing/_encoders.py

@@ -622,6 +622,16 @@ class OrdinalEncoder(_BaseEncoder):
    dtype : number type, default np.float64
        Desired dtype of output.

+    unknown_value : 'error' or (ordinal) encoded value for unknown cateogry,


This usually specific the type in the docstring:

Suggested change

unknown_value : 'error' or (ordinal) encoded value for unknown cateogry,

unknown_value : 'error' or int, default='error'

thomasjpfan · 2020-05-09T00:22:25Z

sklearn/preprocessing/_encoders.py

+        Whether to raise an error or use a specific encoded value if an
+        unknown categorical feature is present during transform (default is
+        to raise). When this parameter is set to something else than 'error'
+        and an unknown category is encountered during transform, the resulting
+        (ordinal) encoded value for this feature will be set to this value.
+        Ordinal is not strictly required, a negative value works as well. In
+        the inverse transform, an unknown category will be denoted as None.


When set to 'raise' an error will raised when an unknown categorical feature is present during transform. When unknown_value is an integer, unknown categories will encoded value will be set to this value. The integer can be negative and must not be an integer already used to encoded categories seen in :meth:fit. In :meth:inverse_transform, an unknown category will be denoted as None.

NicolasHug · 2020-05-09T13:58:59Z

sklearn/preprocessing/_encoders.py

@@ -622,6 +623,14 @@ class OrdinalEncoder(_BaseEncoder):
    dtype : number type, default np.float64
        Desired dtype of output.

+    unknown_value : 'error' or int, default='error'


should this be called handle_unknown for consistency with the OneHotEncoder?

Hm, not sure. We used another name to emphasize that we can choose the encoded value here. But you are right, in principle, it is a similar thing.

cc @thomasjpfan @jnothman

It's slightly strange to have handle_unknown=-10, since -10 is not an "action".

Being the same name may case some confusion, since a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder.

Those two reasons was why I suggested a different name for this PR.

a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder

For me that's not really an issue: users would get an error immediately, it's not like the OHE is going to have an unexpected behavior.
We already have other places where the same parameter name accepts slightly different values depending on the nature of the estimator.

This parameter is about how unknown values are handled in transform, so its semantic is exactly the same as OHE's handle_unknown. I think it should be called the same.

Also, at least to me, the current unknown_value name suggests that the parameter means "this is what an unknown value looks like in fit", as in e.g. "treat these values as missing", which is a different thing.

It's slightly strange to have handle_unknown=-10, since -10 is not an "action".

An alternative is to have handle_unkown='replace' with an additional replace_value parameter.

Maybe others should weigh in, consistency is important.

X = np.array([['a', 'b', 'd']], dtype=object).T enc = OridinalEncoder(handle_unknown='use_category', unknown_category='c').fit(X)

As I mentioned before this can only be supported if we assume that X has no integer values. If X has integer values then there is now way to know whether the parameter refers to raw values or encoded values. So I don't think we should consider this option.

map unknown to -2.

Why -2?

Why -2?

There is no particular reason, other than I want -1 to be missing. If we have a demand for enabling mapping to other integers, we can add the parameter then. I think this solves 95% of the problem already with unknown values in OrdinalEncoder.

I think the second parameter (unknown_category or replace_value) should specify an encoded value, not a raw value. Otherwise this will conflict when raw values are integers.

I still like the ability to be able to pass a category (in input category space) for the unknown values since it then make it possible to implement inverse_transform on test data with unknown values which would not be possible when we only pass the integer value.

We could support the two options though.

OrdinalEncoder(handle_unknown="use_category", unknown_category="unknown")

OrdinalEncoder(handle_unknown="use_value", unkown_encoded_value=-1)

Or even just use the handle_unknown="use_category" option but make it possible to specify both an input space category and an encoded space value:

OrdinalEncoder(handle_unknown="use_category", unknown_category="other", unkown_encoded_value=-1)

But then we need to be careful if unknown_category is already part of the category labels of the training set.

Another motivation for making it possible to pass the input space category is to make it easy to map unknown values to a meaningful garbage category that is already present in the training set (e.g. "other", "misc", "background").

What about having two parameters, one to decide on the handling (use category or encoded value for unknowns) and one on the value for unknowns (with meaning depending on the handling decision):

OrdinalEncoder(handle_unknown="use_category", unknown="other")

OrdinalEncoder(handle_unknown="use_encoded_value", unknown=-1)
Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.

I find my original suggestion more explicit but I am also fine with yours if others like it better. For the default encoded value I would rather leave it to None and force the user to be explicit.

Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.

The default values would be:

`OrdinalEncoder(handle_unknown="error", unknown_value=None)`

which would not raise any error at fit time but would raise an error at predict time if unknown values are encountered.

`OrdinalEncoder(handle_unknown="use_category", unknown_value=None)`

or

`OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)`

would raise an error at fit time asking the user to ask the user to set unknown_value to a value with a valid dtype at fit time (e.g. an integer of handle_unknown == "use_encoded_value" or a str if handle_unknown="use_category" and all category values in the training set as str instances.

jnothman · 2020-06-01T13:58:34Z

I suspect there is a lot of over-engineering going on here. If we took @thomasjpfan's minimal solution of always mapping to -2 will it be easy for users to post-process it into some other useful form? I can't see it being trivial to map it back to a specific category, but it's not super hard either?

thomasjpfan · 2020-06-01T14:40:15Z

If we took @thomasjpfan's minimal solution of always mapping to -2 will it be easy for users to post-process it into some other useful form?

Users can still develop algorithms where -2 to means "unknown" which they can use directly in their algorithm.

I can't see it being trivial to map it back to a specific category, but it's not super hard either?

If they need the specific category after processing, users would need to pass in the original data. With this encoding scheme, multiple categories can be mapped to -2.

The goal of my suggestion is to make a small step forward while still allowing us to add more flexibly in future PRs.

jnothman · 2020-06-01T14:46:05Z

The goal of my suggestion is to make a small step forward while still allowing us to add more flexibly in future PRs.

Yes, and I think it's a good place to start.

FelixWick · 2020-06-01T16:12:49Z

Okidoki. I have created another pull request with this simplified version: #17406
(Sorry, messed up my fork with this one.)

jnothman · 2020-06-02T10:56:28Z

Can I be annoying and ask @thomasjpfan why we'd rather NA be between unknown and known categories? The other case we need to handle at some point is infrequent categories... Will these be mapped to unknown at least in some usages?

thomasjpfan · 2020-06-02T13:33:28Z

Can I be annoying and ask @thomasjpfan why we'd rather NA be between
unknown and known categories?

-1 matches pandas convention when they encode missing categories.

The other case we need to handle at some
point is infrequent categories... Will these be mapped to unknown at least
in some usages?

We can adopt the similar logic we have in #16018

If there are infrequent in training, then unknowns are mapped to this bin.
If there are no infrequent in training, then unknowns are mapped to its own bin.

jnothman · 2020-06-02T13:41:39Z

It's reasonable to follow the pandas convention, except that it's not intended for ML. I'd expect NA to be distinguishable from all other categories, and hence lower than other non-specific category values... Although infrequent categories in a strictly ordered setting should usually be mapped to a neighboring category, rather than a separate category?

jnothman · 2020-06-02T13:42:03Z

Sorry, that "although" was in relation to the other comment.

thomasjpfan · 2020-06-02T14:24:45Z

It's reasonable to follow the pandas convention, except that it's not
intended for ML. I'd expect NA to be distinguishable from all other
categories, and hence lower than other non-specific category values...

I do not have a strong preference to using -1 for np.nan. We can use -2 for NA and -1 for unknown.

Although infrequent categories in a strictly ordered setting should usually
be mapped to a neighboring category, rather than a separate category?

I recall you had a similar suggestion in one of the many encoder threads. At this point, I think users are use OrdinalEncoder more of an IntegerEncoder and the ordered nature is an implementation detail. I am unsure of how much demand there is for a "merge-up" or "merge-down" approach.

As for the "unknown category", we can also have this "merge-up" or "merge-down" approach, but I think most want a separate category. The algorithms that benefit from this: trees (when they have native support for categories #16909) and categorical NB, do not consider the encoding's order.

GaelVaroquaux · 2020-06-29T12:12:54Z

As far as I am concerned, nan is problematic: it imposes special casing in the code later.

GaelVaroquaux · 2020-06-29T12:14:03Z

As pointed out by @cmarmo : we need to focus commenting and reviewing on #17406 which superseeds this PR.

FelixWick · 2020-06-30T10:52:54Z

So, should we just close this one and finish the discussions there?

thomasjpfan · 2020-07-03T18:36:00Z

I think #17406 that implements a part of #16959 (comment) :

OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)

would be the first step in getting something merged.

Edit: After the meeting, I am happy with @ogrisel proposal.

FelixWick · 2020-07-09T19:03:19Z

Ok, just a quick check before I re-write #17406.
So, you agreed on the following, right?

having two parameters handle_unknown and unknown_value.
handle_unknown can have the values "error", "use_encoded_value", or "use_category". (Or should we only have "use_encoded_value" for now?)
default for unknown_value is None, so the user has to provide a value.

To be explicit:
The default values would be:
OrdinalEncoder(handle_unknown="error", unknown_value=None)
which would not raise any error at fit time but would raise an error at predict time if unknown values are encountered.
OrdinalEncoder(handle_unknown="use_category", unknown_value=None)
or
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)
would raise an error at fit time asking the user to set unknown_value to a value with a valid dtype (e.g. an integer for handle_unknown="use_encoded_value" or a str for handle_unknown="use_category" for all category values in the training set as str instances.).

thomasjpfan · 2020-07-13T01:22:36Z

@FelixWick

To keep the PR smaller, I think we can first work on handle_unknown="use_encoded_value" first, since the #17406 is closest to that version.

handle_unknown="use_category" can be worked on later in another PR.

FelixWick · 2020-07-13T06:01:58Z

@thomasjpfan
Ok. But you want me to include the second parameter unknown_value as described, right? (Currently, it is fixed to -2.)

thomasjpfan · 2020-07-13T14:28:50Z

Yes, include the unknown_value and have it default to None. This way the user will have to explicit set it if they want to use handle_unknown="use_encoded_value".

Felix Wick (fwick/30621) added 2 commits April 18, 2020 21:21

allow unknown values in OrdinalEncoder transform

1dd97cb

PEP8 compatibility

1b61c03

github-actions bot added the module:preprocessing label Apr 18, 2020

rth requested a review from thomasjpfan April 19, 2020 20:36

thomasjpfan reviewed Apr 23, 2020

View reviewed changes

Felix Wick (fwick/30621) added 2 commits April 28, 2020 23:01

added unknown_value parameter to encode new category

632b4e5

Merge remote-tracking branch 'upstream/master' into allow_unknowns_or…

42625ec

…dinalencoder

jnothman reviewed May 7, 2020

View reviewed changes

raise for all fitted categories

99a687a

thomasjpfan reviewed May 9, 2020

View reviewed changes

using f strings and some other cosmetics

767af21

NicolasHug reviewed May 9, 2020

View reviewed changes

FelixWick changed the title ~~[MRG] allow unknowns in OrdinalEncoder transform~~ [WIP] allow unknowns in OrdinalEncoder transform Jun 1, 2020

FelixWick mentioned this pull request Jun 1, 2020

FEA allow unknowns in OrdinalEncoder transform #17406

Merged

jnothman closed this in #17406 Aug 5, 2020

	X_int[:, i][~X_mask[:, i]] = self.unknown_value
	X_int[~X_mask[:, i], i] = self.unknown_value

	"The used value for unknown_value {} is one of the "
	f"The used value for unknown_value {self.unknown_value} is one of the "

	if not isinstance(self.unknown_value, np.int):
	if not isinstance(self.unknown_value, numbers.Integral):

	raise TypeError("The used value for unknown_value {} is not "
	raise TypeError(f"The used value for unknown_value {self.unknown_value} is not "

	unknown_value : 'error' or (ordinal) encoded value for unknown cateogry,
	unknown_value : 'error' or int, default='error'

Uh oh!

[WIP] allow unknowns in OrdinalEncoder transform #16959

[WIP] allow unknowns in OrdinalEncoder transform #16959

Uh oh!

Conversation

FelixWick commented Apr 18, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-neubauer-by Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Apr 24, 2020

Uh oh!

FelixWick commented Apr 24, 2020

Uh oh!

thomasjpfan commented Apr 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Apr 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Apr 25, 2020

Uh oh!

FelixWick commented Apr 25, 2020

Uh oh!

thomasjpfan commented Apr 25, 2020

Uh oh!

FelixWick commented Apr 26, 2020

Uh oh!

FelixWick commented May 6, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastian-neubauer-by Apr 24, 2020 •

edited

Loading

thomasjpfan commented Apr 25, 2020 •

edited

Loading

thomasjpfan commented Apr 25, 2020 •

edited

Loading

ogrisel May 29, 2020 •

edited

Loading

ogrisel Jun 2, 2020 •

edited

Loading

FelixWick commented Jun 1, 2020 •

edited

Loading

thomasjpfan commented Jun 2, 2020 •

edited

Loading

thomasjpfan commented Jul 3, 2020 •

edited

Loading