Skip to content

FIX Allow OrdinalEncoder's encoded_missing_value set to the cardinality #25704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,13 @@ Changelog
when the global configuration sets `transform_output="pandas"`.
:pr:`25500` by :user:`Guillaume Lemaitre <glemaitre>`.

:mod:`sklearn.preprocessing`
............................

- |Fix| :class:`preprocessing.OrdinalEncoder` now correctly supports
`encoded_missing_value` or `unknown_value` set to a categories' cardinality
when there is missing values in the training data. :pr:`25704` by `Thomas Fan`_.

:mod:`sklearn.utils`
....................

Expand Down
28 changes: 17 additions & 11 deletions sklearn/preprocessing/_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -1300,24 +1300,30 @@ def fit(self, X, y=None):
# `_fit` will only raise an error when `self.handle_unknown="error"`
self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")

if self.handle_unknown == "use_encoded_value":
for feature_cats in self.categories_:
if 0 <= self.unknown_value < len(feature_cats):
raise ValueError(
"The used value for unknown_value "
f"{self.unknown_value} is one of the "
"values already used for encoding the "
"seen categories."
)
cardinalities = [len(categories) for categories in self.categories_]

# stores the missing indices per category
self._missing_indices = {}
for cat_idx, categories_for_idx in enumerate(self.categories_):
for i, cat in enumerate(categories_for_idx):
if is_scalar_nan(cat):
self._missing_indices[cat_idx] = i

# missing values are not considered part of the cardinality
# when considering unknown categories or encoded_missing_value
cardinalities[cat_idx] -= 1
continue

if self.handle_unknown == "use_encoded_value":
for cardinality in cardinalities:
if 0 <= self.unknown_value < cardinality:
raise ValueError(
"The used value for unknown_value "
f"{self.unknown_value} is one of the "
"values already used for encoding the "
"seen categories."
)

if self._missing_indices:
if np.dtype(self.dtype).kind != "f" and is_scalar_nan(
self.encoded_missing_value
Expand All @@ -1336,9 +1342,9 @@ def fit(self, X, y=None):
# known category
invalid_features = [
cat_idx
for cat_idx, categories_for_idx in enumerate(self.categories_)
for cat_idx, cardinality in enumerate(cardinalities)
if cat_idx in self._missing_indices
and 0 <= self.encoded_missing_value < len(categories_for_idx)
and 0 <= self.encoded_missing_value < cardinality
]

if invalid_features:
Expand Down
12 changes: 12 additions & 0 deletions sklearn/preprocessing/tests/test_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -2003,3 +2003,15 @@ def test_predefined_categories_dtype():
for n, cat in enumerate(enc.categories_):
assert cat.dtype == object
assert_array_equal(categories[n], cat)


def test_ordinal_encoder_missing_unknown_encoding_max():
"""Check missing value or unknown encoding can equal the cardinality."""
X = np.array([["dog"], ["cat"], [np.nan]], dtype=object)
X_trans = OrdinalEncoder(encoded_missing_value=2).fit_transform(X)
assert_allclose(X_trans, [[1], [0], [2]])

enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=2).fit(X)
X_test = np.array([["snake"]])
X_trans = enc.transform(X_test)
assert_allclose(X_trans, [[2]])