Skip to content

ENH Adds nan passthrough in OrdinalEncoder #19069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 24, 2021

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Resolves #18968

What does this implement/fix? Explain your changes.

  1. Both None and np.nan will be "passthrough" as np.nan.
  2. If None and np.nan are in categories_[i], then inverse_transform will use np.nan as the inverse of None or np.nan.

@NicolasHug
Copy link
Member

Thanks for working on this. What's the rationale for supporting None?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan

Comment on lines 797 to 799
f"There are missing values in feature {cat_idx}. With "
"handle_missing=passthrough, OrdinalEncoder's dtype "
"parameter must be float")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only f?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they are going to be mapped to np.nan which can not be stored in an integer array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could support int once we support pandas data types here :D

if self.handle_missing == 'passthrough':
force_all_finite = 'allow-nan'
else:
force_all_finite = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check for an invalid handle_missing value? Probably worth putting this logic in a method since it's the same code.

X_trans = X_int.astype(self.dtype, copy=False)

if self.handle_missing == 'passthrough':
for cat_idx, cats in enumerate(self.categories_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it cat_idx, cat or cat_idx, cats?

@jnothman
Copy link
Member

As noted on the issue, I see no clear justification for needing a parameter to do this. Why not do it by default?

@thomasjpfan thomasjpfan changed the title ENH Adds handle_missing=passthrough in OrdinalEncoder ENH Adds nan passthrough in OrdinalEncoder Jan 20, 2021
@thomasjpfan
Copy link
Member Author

As noted on the issue, I see no clear justification for needing a parameter to do this. Why not do it by default?

Updated PR to no longer need a parameter.

Thanks for working on this. What's the rationale for supporting None?

Currently, OneHotEncoder supports both None and np.nan. Although, I have not really encountered None as the representation of missing values when working in pandas.

@@ -482,6 +482,15 @@ scikit-learn estimators, as these expect continuous input, and would interpret
the categories as being ordered, which is often not desired (i.e. the set of
browsers was ordered arbitrarily).

:class:`OrdinalEncoder` will also passthrough missing values::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document None

Comment on lines 818 to 820
If there are categories that are missing, then
both `None` and `np.nan` are inverse transformed as `np.nan`, i.e.
`inverse_transform(transform([None])) == np.nan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this snippet doesn't illustrate the above textual description:

transform([None]) is [nan] so all this is saying is that inverse_transform([nan]) == [nan], as expected

Instead the text above suggests describes the fact that inverse_transform([None]) == [nan] but:

  • this doesn't seem to be tested (I might have missed something)
  • I find that quite unexpected since transform() is unable to produce None as output. So I would expect inverse_transform to fail if we give it None, or any other unexpected value (like a floating value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the wording here:

If a categorical feature uses None and np.nan as a missing value indicator, the inverse_transform will map np.nan to np.nan.

This behavior is tested in test_ordinal_encoder_passthrough_missing_values_object. (I added a comment there to explicitly say this.)

Base automatically changed from master to main January 22, 2021 10:53
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I think there are a few inconsistencies in the current behavior:

  • inverse-transfoming None currently doesn't fail but it should: None is not a valid transformed value, so we should error in this case since we already properly error when one tries to inverse-transform a value that is out of range (e.g. inverse_transform(4) when there are only 3 values in fit properly fails).
  • The inverse-transformed value of of nan and None depends on whether Nans were seen during fit. This is quite unexpected as the following 3 snippets illustrate:
oe.fit_transform([[0], [1], [np.nan], [None]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
        [ 1.],
        [nan],
        [nan]]),
 array([[nan],
        [nan]], dtype=object))
oe.fit_transform([[0], [1], [np.nan]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
        [ 1.],
        [nan]]),
 array([[nan],
        [nan]]))
oe.fit_transform([[0], [1], [None]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
        [ 1.],
        [nan]]),
 array([[None],
        [None]], dtype=object))

Note how both Nan and None are inverse-transformed to None instead of nan when Nan aren't present in fit. (Again, I think we should not allow to inverse-transform None, but in any case the inconsistency for NaN should be fixed)

Also, inverse-transforming NaN or None when neither of them are present in fit fails with an index-error message (which could be improved, ideally).

Comment on lines 485 to 487
:class:`OrdinalEncoder` will also passthrough missing values. Note that
missing values indicated by `np.nan` or `None` will both be mapped to
`np.nan`::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`OrdinalEncoder` will also passthrough missing values. Note that
missing values indicated by `np.nan` or `None` will both be mapped to
`np.nan`::
:class:`OrdinalEncoder` will also passthrough missing values. Missing values
can be indicated by `np.nan` or `None`, which will both be mapped to
`np.nan`::

Comment on lines 818 to 819
If a categorical feature uses `None` *and* `np.nan` as a missing
value indicator, the inverse_transform will map `np.nan` to `np.nan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current wording (using "and") suggests that the sentence only applies if X contains both nans and Nones. However I think the following will be clearer:

Suggested change
If a categorical feature uses `None` *and* `np.nan` as a missing
value indicator, the inverse_transform will map `np.nan` to `np.nan`.
Nans will be inverse-transformed to Nans. Since both None and NaNs
are mapped to NaN during `tranform`, this means that
`inverse_transform(transform([[None], [NaN])) == [[NaN], [NaN]]`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my suggested sentence isn't even correct due to the bug noted in #19069 (review).

I still think this sentence should be improved, which we can do once the bug is fixed

@thomasjpfan
Copy link
Member Author

inverse-transfoming None currently doesn't fail but it should: None is not a valid transformed value, so we should error in this case since we already properly error when one tries to inverse-transform a value that is out of range (e.g. inverse_transform(4) when there are only 3 values in fit properly fails).

Implementation wise, this would require explicitly checking for None before check_array gets a chance to convert None to np.nan.

The inverse-transformed value of of nan and None depends on whether Nans were seen during fit. This is quite unexpected as the following 3 snippets illustrate:

Assuming we error when we inverse-transform None, I feel like this would be expected behavior:

  • Only None seen in fit: inverse_tranform(np.nan) = None
  • Only np.nan seen in fit: inverse_transform(np.nan) = np.nan
  • Both None and np.nan seen in fit: inverse_transform(np.nan) = np.nan (This was chosen arbitrary)

Stepping back, I think it would be good to remove a portion of this PR to get something useful merged. We can either:

  1. Only support np.nan as you @NicolasHug suggested. I think this is the more common missing value indicator.
  2. Disable inverse_transform when there are any missing value indicators during fit.

WDYT @NicolasHug ?

@NicolasHug
Copy link
Member

I would go with option 1 since it's probably going to address 99.9% of the use-cases and it would simplify the logic

If we choose support None, I think we should just have inverse_transform(nan) == nan no matter what was seen in fit. This allows to output a float dtype instead of object dtype. And instead of erroring for inverse_trasnform(None), we can specifically document that a) this makes no sense and should not be done and b) the returned value is undefined and may change / fail in the future?

@thomasjpfan
Copy link
Member Author

Error is unrelated.

This PR has been updated to remove support for None. It will raise an error when None is seen during fit.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

:mod:`sklearn.naive_bayes`
..........................

- |API| The attribute ``sigma_`` is now deprecated in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge error?

# None and np.nan will only appear at the end of feature_cats
if None in categories_for_idx[-2:]:
raise ValueError(
"None is an unsupported missing value indicator and was "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is backwards incompatible. We currently transform None as another category label. Are we sure we want to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did not know None was already supported. This kind of changes things a little.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None support doesn't seem to be tested nor documented anywhere though. If we choose to keep the previous behaviour we probably want to add tests and docs

X_trans = X_int.astype(self.dtype, copy=False)

for cat_idx, categories_for_idx in enumerate(self.categories_):
missing_indices = [i for i, v in enumerate(categories_for_idx)
if is_scalar_nan(v) or v is None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this or v is None a vestige of a previous proposal?

for cat_idx, categories_for_idx in enumerate(self.categories_):
missing_indices = [i for i, v in enumerate(categories_for_idx)
if is_scalar_nan(v) or v is None]
for missing_idx in missing_indices:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordinarily would we expect len(missing_indices) <= 1. I look at this code and immediately incline to reduce its complexity, but if len(missing_indices) is rarely more than 1 then it's fine.

@@ -833,6 +861,17 @@ def inverse_transform(self, X):

for i in range(n_features):
labels = X[:, i].astype('int64', copy=False)

# place values of X[:, i] that were nan with actual indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# place values of X[:, i] that were nan with actual indices
# replace values of X[:, i] that were nan with actual indices

def test_ordinal_encoder_handle_missing_and_unknown(
X, expected_X_trans, X_test
):
"""Test the interaction between handle_missing and handle_unknown"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_missing isn't a thing

Suggested change
"""Test the interaction between handle_missing and handle_unknown"""
"""Test the interaction between missing values and handle_unknown"""

@thomasjpfan
Copy link
Member Author

From #19069 (comment), I see that None is already supported and encoded by OrdinalEncoder on main:

from sklearn.preprocessing import OrdinalEncoder
X = [["a"], ["b"], [None]]
OrdinalEncoder().fit_transform(X)
# array([[0.],
#        [1.],
#        [2.]])

Since we already encode None, I think this PR can encode np.nan as well and not passthrough. We can treat np.nan and None as distinct categories, which consistent with OneHotEncoder.

WDYT @NicolasHug @jnothman @adrinjalali ?

@NicolasHug
Copy link
Member

Is this "support" for None something that was done by design, or something that just happens to be possible because we didn't put the right guards in place?

It seems to me that it could be the latter as there are no docs nor tests for this (I may have missed something). Also, we used to explicitly not support NaNs, which I assume is because we don't know were to map them (should it be the biggest number, the lowest, something in the middle? The choice is pretty arbitrary). So if we explicitly don't support NaNs, why support None? It doesn't make much sense to me to map None to the highest value - if anything, there should be a way to specify the encoded value. So this is fine to break for me, but I'm happy to consider other opinions.

Since we already encode None, I think this PR can encode np.nan as well and not passthrough. We can treat np.nan and None as distinct categories, which consistent with OneHotEncoder.

This would not let the final estimator (e.g. HistGradientBoosting) natively handle the NaNss which was the original point of #18968 (granted, the behaviour would be almost equivalent as NaNs are treated as categories, but still).

which consistent with OneHotEncoder.

It's mainly a no-brainer in OneHotEncoder because order doesn't matter there. Things are more subtle with the OrdinalEncoder so I'm not sure consistency is something that's relevant in this case.

@thomasjpfan
Copy link
Member Author

Also, we used to explicitly not support NaNs, which I assume is because we don't know were to map them (should it be the biggest number, the lowest, something in the middle? The choice is pretty arbitrary).

The way we encode None now is completely arbitrary and I suspect it was not by design. I am thinking if this arbitrary behavior is currently being used and is deemed useful.

It's mainly a no-brainer in OneHotEncoder because order doesn't matter there.

I am thinking of the case where someone is using OneHotEncoder with None and find out they need to do more preprocessing on the None's for OrdinalEncoder.

In the meantime, I updated the PR to still reject None and a cleaner way to handle passthrough.

@NicolasHug
Copy link
Member

I am thinking if this arbitrary behavior is currently being used and is deemed useful.

@jnothman what would be your thoughts on this?

@jnothman
Copy link
Member

jnothman commented Feb 22, 2021 via email

@adrinjalali
Copy link
Member

One could argue that the user can use np.nan not as missing, and rather an artifact of some calculation. Instead, they could use None as missing. This distinction may be quite useful to users. But on the other hand, if the user is doing some computation, they're probably doing it in a container that doesn't support both np.nan and None, unless they user new pandas types, which supports missing values explicitly, albeit not through None. So I'm not really sure.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Feb 23, 2021

I updated this PR to keep the current None encoding behavior while also passing through np.nan. This way, this PR is backward compatible.

We can decide on if None should throw an error by adding f78c25c (#19069) back in a follow up PR.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Maybe the behaviour with None should be tested rather than implicit!

LGTM

@@ -775,6 +776,23 @@ def fit(self, X, y=None):
f"values already used for encoding the "
f"seen categories.")

# stories the missing indices per category
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# stories the missing indices per category
# stores the missing indices per category

@jnothman
Copy link
Member

Thanks @thomasjpfan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Let NaNs pass through in OrdinalEncoder
4 participants