-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[WIP] allow unknowns in OrdinalEncoder transform #16959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
FelixWick
wants to merge
6
commits into
scikit-learn:master
from
FelixWick:allow_unknowns_ordinalencoder
Closed
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
1dd97cb
allow unknown values in OrdinalEncoder transform
1b61c03
PEP8 compatibility
632b4e5
added unknown_value parameter to encode new category
42625ec
Merge remote-tracking branch 'upstream/master' into allow_unknowns_or…
99a687a
raise for all fitted categories
767af21
using f strings and some other cosmetics
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -553,6 +553,40 @@ def test_ordinal_encoder_raise_missing(X): | |
ohe.transform(X) | ||
|
||
|
||
def test_ordinal_encoder_handle_unknowns(): | ||
enc = OrdinalEncoder(unknown_value=-999) | ||
X_fit = np.array([['a', 'x'], ['b', 'y'], ['c', 'z']], dtype=object) | ||
X_trans = np.array([['c', 'xy'], ['bla', 'y'], ['a', 'x']], dtype=object) | ||
enc.fit(X_fit) | ||
|
||
X_trans_enc = enc.transform(X_trans) | ||
exp = np.array([[2, -999], [-999, 1], [0, 0]], dtype='int64') | ||
assert_array_equal(X_trans_enc, exp) | ||
|
||
X_trans_inv = enc.inverse_transform(X_trans_enc) | ||
inv_exp = np.array([['c', None], [None, 'y'], ['a', 'x']], dtype=object) | ||
assert_array_equal(X_trans_inv, inv_exp) | ||
|
||
|
||
def test_ordinal_encoder_raise_wrong_unknowns(): | ||
X_fit = np.array([['a', 'x'], ['b', 'y']], dtype=object) | ||
X_trans = np.array([['c', 'xy'], ['bla', 'y']], dtype=object) | ||
|
||
enc = OrdinalEncoder(unknown_value="wrong") | ||
enc.fit(X_fit) | ||
msg = ("The used value for unknown_value wrong is not an integer as " | ||
"required.") | ||
with pytest.raises(TypeError, match=msg): | ||
enc.transform(X_trans) | ||
|
||
enc = OrdinalEncoder(unknown_value=1) | ||
enc.fit(X_fit) | ||
msg = ("The used value for unknown_value 1 is one of the values already " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not certain that we should raise an error in this case, but let's do it and at worse we can make it more lenient at a later date. |
||
"used for encoding the seen categories.") | ||
with pytest.raises(ValueError, match=msg): | ||
enc.transform(X_trans) | ||
|
||
|
||
def test_ordinal_encoder_raise_categories_shape(): | ||
|
||
X = np.array([['Low', 'Medium', 'High', 'Medium', 'Low']], dtype=object).T | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be called
handle_unknown
for consistency with the OneHotEncoder?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure. We used another name to emphasize that we can choose the encoded value here. But you are right, in principle, it is a similar thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @thomasjpfan @jnothman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly strange to have
handle_unknown=-10
, since-10
is not an "action".Being the same name may case some confusion, since a user may set it to
'ignore'
, (which should error). Or the inverse may happen: ifhandle_unknown
is allowed to be an integer here, a user may set it to be an integer inOneHotEncoder
.Those two reasons was why I suggested a different name for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me that's not really an issue: users would get an error immediately, it's not like the OHE is going to have an unexpected behavior.
We already have other places where the same parameter name accepts slightly different values depending on the nature of the estimator.
This parameter is about how unknown values are handled in transform, so its semantic is exactly the same as OHE's
handle_unknown
. I think it should be called the same.Also, at least to me, the current
unknown_value
name suggests that the parameter means "this is what an unknown value looks like in fit", as in e.g. "treat these values as missing", which is a different thing.An alternative is to have
handle_unkown='replace'
with an additionalreplace_value
parameter.Maybe others should weigh in, consistency is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned before this can only be supported if we assume that X has no integer values. If X has integer values then there is now way to know whether the parameter refers to raw values or encoded values. So I don't think we should consider this option.
Why -2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no particular reason, other than I want
-1
to be missing. If we have a demand for enabling mapping to other integers, we can add the parameter then. I think this solves 95% of the problem already with unknown values inOrdinalEncoder
.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still like the ability to be able to pass a category (in input category space) for the unknown values since it then make it possible to implement
inverse_transform
on test data with unknown values which would not be possible when we only pass the integer value.We could support the two options though.
OrdinalEncoder(handle_unknown="use_category", unknown_category="unknown")
OrdinalEncoder(handle_unknown="use_value", unkown_encoded_value=-1)
Or even just use the
handle_unknown="use_category"
option but make it possible to specify both an input space category and an encoded space value:OrdinalEncoder(handle_unknown="use_category", unknown_category="other", unkown_encoded_value=-1)
But then we need to be careful if
unknown_category
is already part of the category labels of the training set.Another motivation for making it possible to pass the input space category is to make it easy to map unknown values to a meaningful garbage category that is already present in the training set (e.g. "other", "misc", "background").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having two parameters, one to decide on the handling (use category or encoded value for unknowns) and one on the value for unknowns (with meaning depending on the handling decision):
OrdinalEncoder(handle_unknown="use_category", unknown="other")
OrdinalEncoder(handle_unknown="use_encoded_value", unknown=-1)
Not sure if we should use a default for category usage. For encoded value usage I like either -1 or -999.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find my original suggestion more explicit but I am also fine with yours if others like it better. For the default encoded value I would rather leave it to
None
and force the user to be explicit.The default values would be:
`OrdinalEncoder(handle_unknown="error", unknown_value=None)`
which would not raise any error at fit time but would raise an error at
predict
time if unknown values are encountered.`OrdinalEncoder(handle_unknown="use_category", unknown_value=None)`
or
`OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=None)`
would raise an error at fit time asking the user to ask the user to set
unknown_value
to a value with a valid dtype at fit time (e.g. an integer ofhandle_unknown == "use_encoded_value"
or astr
ifhandle_unknown="use_category"
and all category values in the training set asstr
instances.