-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Adds nan passthrough in OrdinalEncoder #19069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for working on this. What's the rationale for supporting None? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan
sklearn/preprocessing/_encoders.py
Outdated
f"There are missing values in feature {cat_idx}. With " | ||
"handle_missing=passthrough, OrdinalEncoder's dtype " | ||
"parameter must be float") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only f
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they are going to be mapped to np.nan
which can not be stored in an integer array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could support int
once we support pandas
data types here :D
sklearn/preprocessing/_encoders.py
Outdated
if self.handle_missing == 'passthrough': | ||
force_all_finite = 'allow-nan' | ||
else: | ||
force_all_finite = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also check for an invalid handle_missing
value? Probably worth putting this logic in a method since it's the same code.
sklearn/preprocessing/_encoders.py
Outdated
X_trans = X_int.astype(self.dtype, copy=False) | ||
|
||
if self.handle_missing == 'passthrough': | ||
for cat_idx, cats in enumerate(self.categories_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it cat_idx, cat
or cat_idx, cats
?
As noted on the issue, I see no clear justification for needing a parameter to do this. Why not do it by default? |
Updated PR to no longer need a parameter.
Currently, |
doc/modules/preprocessing.rst
Outdated
@@ -482,6 +482,15 @@ scikit-learn estimators, as these expect continuous input, and would interpret | |||
the categories as being ordered, which is often not desired (i.e. the set of | |||
browsers was ordered arbitrarily). | |||
|
|||
:class:`OrdinalEncoder` will also passthrough missing values:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also document None
sklearn/preprocessing/_encoders.py
Outdated
If there are categories that are missing, then | ||
both `None` and `np.nan` are inverse transformed as `np.nan`, i.e. | ||
`inverse_transform(transform([None])) == np.nan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this snippet doesn't illustrate the above textual description:
transform([None])
is [nan]
so all this is saying is that inverse_transform([nan]) == [nan]
, as expected
Instead the text above suggests describes the fact that inverse_transform([None]) == [nan]
but:
- this doesn't seem to be tested (I might have missed something)
- I find that quite unexpected since
transform()
is unable to produce None as output. So I would expectinverse_transform
to fail if we give it None, or any other unexpected value (like a floating value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the wording here:
If a categorical feature uses
None
andnp.nan
as a missing value indicator, the inverse_transform will mapnp.nan
tonp.nan
.
This behavior is tested in test_ordinal_encoder_passthrough_missing_values_object
. (I added a comment there to explicitly say this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan I think there are a few inconsistencies in the current behavior:
- inverse-transfoming None currently doesn't fail but it should: None is not a valid transformed value, so we should error in this case since we already properly error when one tries to inverse-transform a value that is out of range (e.g. inverse_transform(4) when there are only 3 values in fit properly fails).
- The inverse-transformed value of of nan and None depends on whether Nans were seen during fit. This is quite unexpected as the following 3 snippets illustrate:
oe.fit_transform([[0], [1], [np.nan], [None]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
[ 1.],
[nan],
[nan]]),
array([[nan],
[nan]], dtype=object))
oe.fit_transform([[0], [1], [np.nan]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
[ 1.],
[nan]]),
array([[nan],
[nan]]))
oe.fit_transform([[0], [1], [None]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
[ 1.],
[nan]]),
array([[None],
[None]], dtype=object))
Note how both Nan and None are inverse-transformed to None instead of nan when Nan aren't present in fit. (Again, I think we should not allow to inverse-transform None, but in any case the inconsistency for NaN should be fixed)
Also, inverse-transforming NaN or None when neither of them are present in fit fails with an index-error message (which could be improved, ideally).
doc/modules/preprocessing.rst
Outdated
:class:`OrdinalEncoder` will also passthrough missing values. Note that | ||
missing values indicated by `np.nan` or `None` will both be mapped to | ||
`np.nan`:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`OrdinalEncoder` will also passthrough missing values. Note that | |
missing values indicated by `np.nan` or `None` will both be mapped to | |
`np.nan`:: | |
:class:`OrdinalEncoder` will also passthrough missing values. Missing values | |
can be indicated by `np.nan` or `None`, which will both be mapped to | |
`np.nan`:: |
sklearn/preprocessing/_encoders.py
Outdated
If a categorical feature uses `None` *and* `np.nan` as a missing | ||
value indicator, the inverse_transform will map `np.nan` to `np.nan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current wording (using "and") suggests that the sentence only applies if X
contains both nans and Nones. However I think the following will be clearer:
If a categorical feature uses `None` *and* `np.nan` as a missing | |
value indicator, the inverse_transform will map `np.nan` to `np.nan`. | |
Nans will be inverse-transformed to Nans. Since both None and NaNs | |
are mapped to NaN during `tranform`, this means that | |
`inverse_transform(transform([[None], [NaN])) == [[NaN], [NaN]]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my suggested sentence isn't even correct due to the bug noted in #19069 (review).
I still think this sentence should be improved, which we can do once the bug is fixed
Implementation wise, this would require explicitly checking for
Assuming we error when we inverse-transform
Stepping back, I think it would be good to remove a portion of this PR to get something useful merged. We can either:
WDYT @NicolasHug ? |
I would go with option 1 since it's probably going to address 99.9% of the use-cases and it would simplify the logic If we choose support None, I think we should just have |
Error is unrelated. This PR has been updated to remove support for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
doc/whats_new/v1.0.rst
Outdated
:mod:`sklearn.naive_bayes` | ||
.......................... | ||
|
||
- |API| The attribute ``sigma_`` is now deprecated in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge error?
sklearn/preprocessing/_encoders.py
Outdated
# None and np.nan will only appear at the end of feature_cats | ||
if None in categories_for_idx[-2:]: | ||
raise ValueError( | ||
"None is an unsupported missing value indicator and was " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is backwards incompatible. We currently transform None as another category label. Are we sure we want to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did not know None
was already supported. This kind of changes things a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
support doesn't seem to be tested nor documented anywhere though. If we choose to keep the previous behaviour we probably want to add tests and docs
sklearn/preprocessing/_encoders.py
Outdated
X_trans = X_int.astype(self.dtype, copy=False) | ||
|
||
for cat_idx, categories_for_idx in enumerate(self.categories_): | ||
missing_indices = [i for i, v in enumerate(categories_for_idx) | ||
if is_scalar_nan(v) or v is None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this or v is None
a vestige of a previous proposal?
sklearn/preprocessing/_encoders.py
Outdated
for cat_idx, categories_for_idx in enumerate(self.categories_): | ||
missing_indices = [i for i, v in enumerate(categories_for_idx) | ||
if is_scalar_nan(v) or v is None] | ||
for missing_idx in missing_indices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordinarily would we expect len(missing_indices) <= 1
. I look at this code and immediately incline to reduce its complexity, but if len(missing_indices)
is rarely more than 1 then it's fine.
sklearn/preprocessing/_encoders.py
Outdated
@@ -833,6 +861,17 @@ def inverse_transform(self, X): | |||
|
|||
for i in range(n_features): | |||
labels = X[:, i].astype('int64', copy=False) | |||
|
|||
# place values of X[:, i] that were nan with actual indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# place values of X[:, i] that were nan with actual indices | |
# replace values of X[:, i] that were nan with actual indices |
def test_ordinal_encoder_handle_missing_and_unknown( | ||
X, expected_X_trans, X_test | ||
): | ||
"""Test the interaction between handle_missing and handle_unknown""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle_missing isn't a thing
"""Test the interaction between handle_missing and handle_unknown""" | |
"""Test the interaction between missing values and handle_unknown""" |
From #19069 (comment), I see that from sklearn.preprocessing import OrdinalEncoder
X = [["a"], ["b"], [None]]
OrdinalEncoder().fit_transform(X)
# array([[0.],
# [1.],
# [2.]]) Since we already encode WDYT @NicolasHug @jnothman @adrinjalali ? |
Is this "support" for It seems to me that it could be the latter as there are no docs nor tests for this (I may have missed something). Also, we used to explicitly not support NaNs, which I assume is because we don't know were to map them (should it be the biggest number, the lowest, something in the middle? The choice is pretty arbitrary). So if we explicitly don't support NaNs, why support None? It doesn't make much sense to me to map
This would not let the final estimator (e.g. HistGradientBoosting) natively handle the NaNss which was the original point of #18968 (granted, the behaviour would be almost equivalent as NaNs are treated as categories, but still).
It's mainly a no-brainer in |
The way we encode
I am thinking of the case where someone is using In the meantime, I updated the PR to still reject |
@jnothman what would be your thoughts on this? |
I don't know what standard processing produces None so it's very hard to
reason about what users might do by accident. I can imagine users
intentionally differentiating None from other categories without it
specifically meaning "missing". I am okay to break on this, certainly in
the context of 1.0 and where we are breaking with an error rather than a
silent change in behaviour.
|
One could argue that the user can use |
I updated this PR to keep the current We can decide on if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
Maybe the behaviour with None should be tested rather than implicit!
LGTM
sklearn/preprocessing/_encoders.py
Outdated
@@ -775,6 +776,23 @@ def fit(self, X, y=None): | |||
f"values already used for encoding the " | |||
f"seen categories.") | |||
|
|||
# stories the missing indices per category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# stories the missing indices per category | |
# stores the missing indices per category |
really just an innocuous commit to check scikit-learn#19155
Thanks @thomasjpfan! |
Reference Issues/PRs
Resolves #18968
What does this implement/fix? Explain your changes.
None
andnp.nan
will be "passthrough" asnp.nan
.None
andnp.nan
are incategories_[i]
, theninverse_transform
will usenp.nan
as the inverse ofNone
ornp.nan
.