-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[WIP] allow unknowns in OrdinalEncoder transform #16959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] allow unknowns in OrdinalEncoder transform #16959
Conversation
sklearn/preprocessing/_encoders.py
Outdated
is present during transform (default is to raise). When this parameter | ||
is set to 'ignore' and an unknown category is encountered during | ||
transform, the resulting ordinal encoded value for this feature will be | ||
-999. In the inverse transform, an unknown category will be denoted as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree using -999 would be nice, but it would not really be "Ordinal". In other words, is an "unknown category" considered less than every other category?
In general, I think users want an "IntegerEncoder", which will map strings to ints with no assumption of order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think an unknown category is different and, in a sense, considered less than other categories seen in the fit, because you have to treat them in a special way in a subsequent ML prediction (there were no parameters learned for an unknown category in the fit). Therefore, I think it should definitely not be an integer. Of course, -999 is an arbitrary choice and I have no preference here, but a negative number seems better than nan, as nan can be an original category (missing value) and would therefore be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end one would want to have some explicit representation of "unknown" in the type itself. But for most algorithms we need a numerical type which do not have such an explicit representation (here integer). Then -999 actually has some advantages over some other choices:
- "0": very dangerous, as quite often this is interpreted as just the first "bin", also here if I understand it correctly (0 is the first category), so introducing this would be a breaking change
- "-1": better than 0, but it has the risk, that some might interpret this as belonging to the "0" bin, i.e. something like a overflow bin "lower than the lower boundary" which is not the same as "unknown"
- "-999": This is enough "far away" from any "accidental" collision, it is a very unnatural number, so we can quite safely abuse it to encode a type "unknown". Even if negative numbers are allowed, hitting -999 by chance is highly unlikely.
One more, but weak argument: In my working group at university years back this was a common practice and it "just worked" for years (disclaimer: @FelixWick was in the same group).
One caveat might be: this might break some use cases, e.g if algorithms expect positive integers, so would it make sense to make this configurable? Even then, what would an alternative be in this case, "0"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is not an alternative, as it is the first category. And I think -999 is not a problem for algorithms expecting positive integers, because it is mandatory to handle this case of an unknown category specifically before the algorithm call (e.g. see it to a neutral value).
Considering the unknown category "less than all other categories seen in fit" is up to the model taking it as input and should not be assumed by the encoder. Alternatively unknown can be seen as "greater than all features seen in fit", in this case unknown can be encoded as "cardinality of category + 1". A user may already have a category they want to use for "unknown". For example a "misc" category that was seen in training. |
I agree, it is not about less or greater than other categories, just about different, as all other categories were seen in the fit (also the what you call misc or unknown category). So, the encoder assumes nothing here, it just makes clear that this is a special case. |
Encoding unknown to
Lets say the categories were |
API wise, I would like a way to configure the unknown value when encoding unknown categories without adding too many parameters. This way a user can state how they want to encode the unknown categories. Currently saying |
The following API could work: OrdinalEncoder(unknown_value='error') # default error
OrdinalEncoder(unknown_value=-999) # sets unknown category to be encoded as -999 Note we are focused on releasing 0.23 soon, so my responses may be delayed. Feel free to ping me after the 0.23 release. |
Sure, -999 is less than the other values. But the order does not matter in this encoding here, does it? And no, I would not map it to “misc” in your example. It is a new value in the transform, that was not seen in the fit. So, it should be treated differently. Mapping it to “misc” adds information that is not there. I agree on the API. To be honest, I took a shortcut here ;). Happy to change it if you are fine with the general idea. And all fine with delayed answers, no rush. |
It matters in this context because the encoder is named OrdinalEncoder.
Okay, let's not allow the I am happy with having a |
Ok, great. Then I will change the API accordingly in the next days. Concerning the ordering of categories, I just realized that we are both right. Usually, the encoding here is unordered (just the accidental sequence of different categories during the fit), but one can also enforce an ordered encoding via not giving categories parameter as auto. |
Any comments on my changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
|
||
enc = OrdinalEncoder(unknown_value=1) | ||
enc.fit(X_fit) | ||
msg = ("The used value for unknown_value 1 is one of the values already " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain that we should raise an error in this case, but let's do it and at worse we can make it more lenient at a later date.
sklearn/preprocessing/_encoders.py
Outdated
.format(self.unknown_value)) | ||
n_features = len(self.categories_) | ||
for i in range(n_features): | ||
if np.isin(self.unknown_value, np.unique(X_int[:, i])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we just be checking if it's in the range 0 to len(self.categories_[i])
? I.e. if we're going to say this should raise an error, it should raise it even if this specific dataset does not contain an instance of that category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. nice catch, thanks. fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
enc.fit(X_fit) | ||
|
||
X_trans_enc = enc.transform(X_trans) | ||
exp = np.array([[2, -999], [-999, 1], [0, 0]], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dtype of the output should not be object. Is this the case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, it is integer.
sklearn/preprocessing/_encoders.py
Outdated
for i in range(n_features): | ||
if np.isin(self.unknown_value, | ||
range(len(self.categories_[i]))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
for cats in self.categories_:
if 0 <= self.unknown_value < len(cats):
...
sklearn/preprocessing/_encoders.py
Outdated
"The used value for unknown_value {} is one of the " | ||
"values already used for encoding the seen " | ||
"categories.".format(self.unknown_value)) | ||
X_int[:, i][~X_mask[:, i]] = self.unknown_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_int[:, i][~X_mask[:, i]] = self.unknown_value | |
X_int[~X_mask[:, i], i] = self.unknown_value |
sklearn/preprocessing/_encoders.py
Outdated
if np.isin(self.unknown_value, | ||
range(len(self.categories_[i]))): | ||
raise ValueError( | ||
"The used value for unknown_value {} is one of the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use f-strings now!
"The used value for unknown_value {} is one of the " | |
f"The used value for unknown_value {self.unknown_value} is one of the " |
(and line wrapping to be < 80 line-length)
sklearn/preprocessing/_encoders.py
Outdated
|
||
# create separate category for unknown values | ||
if self.unknown_value != 'error': | ||
if not isinstance(self.unknown_value, np.int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With import numbers
:
if not isinstance(self.unknown_value, np.int): | |
if not isinstance(self.unknown_value, numbers.Integral): |
sklearn/preprocessing/_encoders.py
Outdated
# create separate category for unknown values | ||
if self.unknown_value != 'error': | ||
if not isinstance(self.unknown_value, np.int): | ||
raise TypeError("The used value for unknown_value {} is not " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise TypeError("The used value for unknown_value {} is not " | |
raise TypeError(f"The used value for unknown_value {self.unknown_value} is not " |
sklearn/preprocessing/_encoders.py
Outdated
@@ -622,6 +622,16 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
unknown_value : 'error' or (ordinal) encoded value for unknown cateogry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This usually specific the type in the docstring:
unknown_value : 'error' or (ordinal) encoded value for unknown cateogry, | |
unknown_value : 'error' or int, default='error' |
sklearn/preprocessing/_encoders.py
Outdated
Whether to raise an error or use a specific encoded value if an | ||
unknown categorical feature is present during transform (default is | ||
to raise). When this parameter is set to something else than 'error' | ||
and an unknown category is encountered during transform, the resulting | ||
(ordinal) encoded value for this feature will be set to this value. | ||
Ordinal is not strictly required, a negative value works as well. In | ||
the inverse transform, an unknown category will be denoted as None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When set to 'raise' an error will raised when an unknown categorical feature is present during transform. When
unknown_value
is an integer, unknown categories will encoded value will be set to this value. The integer can be negative and must not be an integer already used to encoded categories seen in :meth:fit
. In :meth:inverse_transform
, an unknown category will be denoted asNone
.
@@ -622,6 +623,14 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
unknown_value : 'error' or int, default='error' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be called handle_unknown
for consistency with the OneHotEncoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure. We used another name to emphasize that we can choose the encoded value here. But you are right, in principle, it is a similar thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly strange to have handle_unknown=-10
, since -10
is not an "action".
Being the same name may case some confusion, since a user may set it to 'ignore'
, (which should error). Or the inverse may happen: if handle_unknown
is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder
.
Those two reasons was why I suggested a different name for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a user may set it to 'ignore', (which should error). Or the inverse may happen: if handle_unknown is allowed to be an integer here, a user may set it to be an integer in OneHotEncoder
For me that's not really an issue: users would get an error immediately, it's not like the OHE is going to have an unexpected behavior.
We already have other places where the same parameter name accepts slightly different values depending on the nature of the estimator.
This parameter is about how unknown values are handled in transform, so its semantic is exactly the same as OHE's handle_unknown
. I think it should be called the same.
Also, at least to me, the current unknown_value
name suggests that the parameter means "this is what an unknown value looks like in fit", as in e.g. "treat these values as missing", which is a different thing.
It's slightly strange to have handle_unknown=-10, since -10 is not an "action".
An alternative is to have handle_unkown='replace'
with an additional replace_value
parameter.
Maybe others should weigh in, consistency is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X = np.array([['a', 'b', 'd']], dtype=object).T
enc = OridinalEncoder(handle_unknown='use_category',
unknown_category='c').fit(X)
As I mentioned before this can only be supported if we assume that X has no integer values. If X has integer values then there is now way to know whether the parameter refers to raw values or encoded values. So I don't think we should consider this option.
map unknown to -2.
Why -2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -2?
There is no particular reason, other than I want -1
to be missing. If we have a demand for enabling mapping to other integers, we can add the parameter then. I think this solves 95% of the problem already with unknown values in OrdinalEncoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second parameter (unknown_category or replace_value) should specify an encoded value, not a raw value. Otherwise this will conflict when raw values are integers.
I still like the ability to be able to pass a category (in input category space) for the unknown values since it then make it possible to implement inverse_transform
on test data with unknown values which would not be possible when we only pass the integer value.
We could support the two options though.
OrdinalEncoder(handle_unknown="use_category", unknown_category="unknown")
OrdinalEncoder(handle_unknown="use_value", unkown_encoded_value=-1)
Or even just use the handle_unknown="use_category"
option but make it possible to specify both an input space category and an encoded space value:
OrdinalEncoder(handle_unknown="use_category", unknown_category="other", unkown_encoded_value=-1)
But then we need to be careful if unknown_category
is already part of the category labels of the training set.
Another motivation for making it possible to pass the input space category is to make it easy to map unknown values to a meaningful garbage category that is already present in the training set (e.g. "other", "misc", "background").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having two parameters, one to decide on the handling (use category or encoded value for unknowns) and one on the value for unknowns (with meaning depending on the handling decision):
-
OrdinalEncoder(handle_unknown="use_category", unknown="other")
-
OrdinalEncoder(handle_unknown="use_encoded_value", unknown=-1)
Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find my original suggestion more explicit but I am also fine with yours if others like it better. For the default encoded value I would rather leave it to None
and force the user to be explicit.
Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.
The default values would be:
`OrdinalEncoder(handle_unknown="error", unknown_value=None)`
which would not raise any error at fit time but would raise an error at predict
time if unknown values are encountered.
`OrdinalEncoder(handle_unknown="use_category", unknown_value=None)`
or
`OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)`
would raise an error at fit time asking the user to ask the user to set unknown_value
to a value with a valid dtype at fit time (e.g. an integer of handle_unknown == "use_encoded_value"
or a str
if handle_unknown="use_category"
and all category values in the training set as str
instances.
I suspect there is a lot of over-engineering going on here. If we took @thomasjpfan's minimal solution of always mapping to -2 will it be easy for users to post-process it into some other useful form? I can't see it being trivial to map it back to a specific category, but it's not super hard either? |
Users can still develop algorithms where
If they need the specific category after processing, users would need to pass in the original data. With this encoding scheme, multiple categories can be mapped to The goal of my suggestion is to make a small step forward while still allowing us to add more flexibly in future PRs. |
Yes, and I think it's a good place to start. |
Okidoki. I have created another pull request with this simplified version: #17406 |
Can I be annoying and ask @thomasjpfan why we'd rather NA be between
unknown and known categories? The other case we need to handle at some
point is infrequent categories... Will these be mapped to unknown at least
in some usages?
|
We can adopt the similar logic we have in #16018
|
It's reasonable to follow the pandas convention, except that it's not
intended for ML. I'd expect NA to be distinguishable from all other
categories, and hence lower than other non-specific category values...
Although infrequent categories in a strictly ordered setting should usually
be mapped to a neighboring category, rather than a separate category?
|
Sorry, that "although" was in relation to the other comment.
|
I do not have a strong preference to using
I recall you had a similar suggestion in one of the many encoder threads. At this point, I think users are use As for the "unknown category", we can also have this "merge-up" or "merge-down" approach, but I think most want a separate category. The algorithms that benefit from this: trees (when they have native support for categories #16909) and categorical NB, do not consider the encoding's order. |
As far as I am concerned, nan is problematic: it imposes special casing in the code later. |
So, should we just close this one and finish the discussions there? |
I think #17406 that implements a part of #16959 (comment) :
would be the first step in getting something merged. Edit: After the meeting, I am happy with @ogrisel proposal. |
Ok, just a quick check before I re-write #17406.
To be explicit: |
To keep the PR smaller, I think we can first work on
|
@thomasjpfan |
Yes, include the |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Sometimes it can be convenient to allow values (categories) in the transform of OrdinalEncoder that were not present in the fit data set. For example, a machine learning method, which is able of setting such an unknown sample of the corresponding feature to a neutral value (i.e. non-informative), could use the information from all other features of the sample and still output a prediction (instead of no prediction at all).
Any other comments?