ENH Adds nan passthrough in OrdinalEncoder #19069

thomasjpfan · 2020-12-25T03:31:23Z

Reference Issues/PRs

Resolves #18968

What does this implement/fix? Explain your changes.

Both None and np.nan will be "passthrough" as np.nan.
If None and np.nan are in categories_[i], then inverse_transform will use np.nan as the inverse of None or np.nan.

NicolasHug · 2020-12-26T14:20:47Z

Thanks for working on this. What's the rationale for supporting None?

adrinjalali

Thanks @thomasjpfan

adrinjalali · 2021-01-17T10:10:36Z

sklearn/preprocessing/_encoders.py

+                    f"There are missing values in feature {cat_idx}. With "
+                    "handle_missing=passthrough, OrdinalEncoder's dtype "
+                    "parameter must be float")


Why only f?

Because they are going to be mapped to np.nan which can not be stored in an integer array.

I guess we could support int once we support pandas data types here :D

adrinjalali · 2021-01-17T10:12:08Z

sklearn/preprocessing/_encoders.py

+        if self.handle_missing == 'passthrough':
+            force_all_finite = 'allow-nan'
+        else:
+            force_all_finite = True


also check for an invalid handle_missing value? Probably worth putting this logic in a method since it's the same code.

adrinjalali · 2021-01-17T10:13:51Z

sklearn/preprocessing/_encoders.py

        X_trans = X_int.astype(self.dtype, copy=False)

+        if self.handle_missing == 'passthrough':
+            for cat_idx, cats in enumerate(self.categories_):


is it cat_idx, cat or cat_idx, cats?

sklearn/preprocessing/_encoders.py

jnothman · 2021-01-20T12:13:26Z

As noted on the issue, I see no clear justification for needing a parameter to do this. Why not do it by default?

thomasjpfan · 2021-01-20T22:11:17Z

As noted on the issue, I see no clear justification for needing a parameter to do this. Why not do it by default?

Updated PR to no longer need a parameter.

Thanks for working on this. What's the rationale for supporting None?

Currently, OneHotEncoder supports both None and np.nan. Although, I have not really encountered None as the representation of missing values when working in pandas.

NicolasHug · 2021-01-21T08:23:18Z

doc/modules/preprocessing.rst

@@ -482,6 +482,15 @@ scikit-learn estimators, as these expect continuous input, and would interpret
 the categories as being ordered, which is often not desired (i.e. the set of
 browsers was ordered arbitrarily).

+:class:`OrdinalEncoder` will also passthrough missing values::


Please also document None

NicolasHug · 2021-01-21T08:44:50Z

sklearn/preprocessing/_encoders.py

+        If there are categories that are missing, then
+        both `None` and `np.nan` are inverse transformed as `np.nan`, i.e.
+        `inverse_transform(transform([None])) == np.nan`.


It seems that this snippet doesn't illustrate the above textual description:

transform([None]) is [nan] so all this is saying is that inverse_transform([nan]) == [nan], as expected

Instead the text above suggests describes the fact that inverse_transform([None]) == [nan] but:

this doesn't seem to be tested (I might have missed something)

I find that quite unexpected since transform() is unable to produce None as output. So I would expect inverse_transform to fail if we give it None, or any other unexpected value (like a floating value)

I adjusted the wording here:

If a categorical feature uses None and np.nan as a missing value indicator, the inverse_transform will map np.nan to np.nan.

This behavior is tested in test_ordinal_encoder_passthrough_missing_values_object. (I added a comment there to explicitly say this.)

NicolasHug

@thomasjpfan I think there are a few inconsistencies in the current behavior:

inverse-transfoming None currently doesn't fail but it should: None is not a valid transformed value, so we should error in this case since we already properly error when one tries to inverse-transform a value that is out of range (e.g. inverse_transform(4) when there are only 3 values in fit properly fails).
The inverse-transformed value of of nan and None depends on whether Nans were seen during fit. This is quite unexpected as the following 3 snippets illustrate:

oe.fit_transform([[0], [1], [np.nan], [None]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
        [ 1.],
        [nan],
        [nan]]),
 array([[nan],
        [nan]], dtype=object))

oe.fit_transform([[0], [1], [np.nan]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
        [ 1.],
        [nan]]),
 array([[nan],
        [nan]]))

oe.fit_transform([[0], [1], [None]]), oe.inverse_transform(np.array([[np.nan], [None]], dtype=object))
# gives
(array([[ 0.],
        [ 1.],
        [nan]]),
 array([[None],
        [None]], dtype=object))

Note how both Nan and None are inverse-transformed to None instead of nan when Nan aren't present in fit. (Again, I think we should not allow to inverse-transform None, but in any case the inconsistency for NaN should be fixed)

Also, inverse-transforming NaN or None when neither of them are present in fit fails with an index-error message (which could be improved, ideally).

NicolasHug · 2021-01-22T11:07:50Z

doc/modules/preprocessing.rst

+:class:`OrdinalEncoder` will also passthrough missing values. Note that
+missing values indicated by `np.nan` or `None` will both be mapped to
+`np.nan`::


Suggested change

:class:`OrdinalEncoder` will also passthrough missing values. Note that

missing values indicated by `np.nan` or `None` will both be mapped to

`np.nan`::

:class:`OrdinalEncoder` will also passthrough missing values. Missing values

can be indicated by `np.nan` or `None`, which will both be mapped to

`np.nan`::

NicolasHug · 2021-01-22T11:13:34Z

sklearn/preprocessing/_encoders.py

+        If a categorical feature uses `None` *and* `np.nan` as a missing
+        value indicator, the inverse_transform will map `np.nan` to `np.nan`.


The current wording (using "and") suggests that the sentence only applies if X contains both nans and Nones. However I think the following will be clearer:

Suggested change

If a categorical feature uses `None` *and* `np.nan` as a missing

value indicator, the inverse_transform will map `np.nan` to `np.nan`.

Nans will be inverse-transformed to Nans. Since both None and NaNs

are mapped to NaN during `tranform`, this means that

`inverse_transform(transform([[None], [NaN])) == [[NaN], [NaN]]`

Actually my suggested sentence isn't even correct due to the bug noted in #19069 (review).

I still think this sentence should be improved, which we can do once the bug is fixed

thomasjpfan · 2021-01-22T15:07:28Z

inverse-transfoming None currently doesn't fail but it should: None is not a valid transformed value, so we should error in this case since we already properly error when one tries to inverse-transform a value that is out of range (e.g. inverse_transform(4) when there are only 3 values in fit properly fails).

Implementation wise, this would require explicitly checking for None before check_array gets a chance to convert None to np.nan.

The inverse-transformed value of of nan and None depends on whether Nans were seen during fit. This is quite unexpected as the following 3 snippets illustrate:

Assuming we error when we inverse-transform None, I feel like this would be expected behavior:

Only None seen in fit: inverse_tranform(np.nan) = None
Only np.nan seen in fit: inverse_transform(np.nan) = np.nan
Both None and np.nan seen in fit: inverse_transform(np.nan) = np.nan (This was chosen arbitrary)

Stepping back, I think it would be good to remove a portion of this PR to get something useful merged. We can either:

Only support np.nan as you @NicolasHug suggested. I think this is the more common missing value indicator.
Disable inverse_transform when there are any missing value indicators during fit.

WDYT @NicolasHug ?

NicolasHug · 2021-01-22T15:41:07Z

I would go with option 1 since it's probably going to address 99.9% of the use-cases and it would simplify the logic

If we choose support None, I think we should just have inverse_transform(nan) == nan no matter what was seen in fit. This allows to output a float dtype instead of object dtype. And instead of erroring for inverse_trasnform(None), we can specifically document that a) this makes no sense and should not be done and b) the returned value is undefined and may change / fail in the future?

thomasjpfan · 2021-01-28T22:12:12Z

Error is unrelated.

This PR has been updated to remove support for None. It will raise an error when None is seen during fit.

jnothman

Otherwise LGTM

jnothman · 2021-02-21T07:10:47Z

doc/whats_new/v1.0.rst

+:mod:`sklearn.naive_bayes`
+..........................
+
+- |API| The attribute ``sigma_`` is now deprecated in


merge error?

jnothman · 2021-02-21T07:14:03Z

sklearn/preprocessing/_encoders.py

+            # None and np.nan will only appear at the end of feature_cats
+            if None in categories_for_idx[-2:]:
+                raise ValueError(
+                    "None is an unsupported missing value indicator and was "


FWIW, this is backwards incompatible. We currently transform None as another category label. Are we sure we want to do this?

Ah I did not know None was already supported. This kind of changes things a little.

None support doesn't seem to be tested nor documented anywhere though. If we choose to keep the previous behaviour we probably want to add tests and docs

jnothman · 2021-02-21T07:14:40Z

sklearn/preprocessing/_encoders.py

        X_trans = X_int.astype(self.dtype, copy=False)

+        for cat_idx, categories_for_idx in enumerate(self.categories_):
+            missing_indices = [i for i, v in enumerate(categories_for_idx)
+                               if is_scalar_nan(v) or v is None]


is this or v is None a vestige of a previous proposal?

jnothman · 2021-02-21T07:17:48Z

sklearn/preprocessing/_encoders.py

+        for cat_idx, categories_for_idx in enumerate(self.categories_):
+            missing_indices = [i for i, v in enumerate(categories_for_idx)
+                               if is_scalar_nan(v) or v is None]
+            for missing_idx in missing_indices:


ordinarily would we expect len(missing_indices) <= 1. I look at this code and immediately incline to reduce its complexity, but if len(missing_indices) is rarely more than 1 then it's fine.

jnothman · 2021-02-21T07:18:58Z

sklearn/preprocessing/_encoders.py

@@ -833,6 +861,17 @@ def inverse_transform(self, X):

        for i in range(n_features):
            labels = X[:, i].astype('int64', copy=False)
+
+            # place values of X[:, i] that were nan with actual indices


Suggested change

# place values of X[:, i] that were nan with actual indices

# replace values of X[:, i] that were nan with actual indices

jnothman · 2021-02-21T07:23:05Z

sklearn/preprocessing/tests/test_encoders.py

+def test_ordinal_encoder_handle_missing_and_unknown(
+        X, expected_X_trans, X_test
+):
+    """Test the interaction between handle_missing and handle_unknown"""


handle_missing isn't a thing

Suggested change

"""Test the interaction between handle_missing and handle_unknown"""

"""Test the interaction between missing values and handle_unknown"""

thomasjpfan · 2021-02-22T18:04:31Z

From #19069 (comment), I see that None is already supported and encoded by OrdinalEncoder on main:

from sklearn.preprocessing import OrdinalEncoder
X = [["a"], ["b"], [None]]
OrdinalEncoder().fit_transform(X)
# array([[0.],
#        [1.],
#        [2.]])

Since we already encode None, I think this PR can encode np.nan as well and not passthrough. We can treat np.nan and None as distinct categories, which consistent with OneHotEncoder.

WDYT @NicolasHug @jnothman @adrinjalali ?

NicolasHug · 2021-02-22T18:27:26Z

Is this "support" for None something that was done by design, or something that just happens to be possible because we didn't put the right guards in place?

It seems to me that it could be the latter as there are no docs nor tests for this (I may have missed something). Also, we used to explicitly not support NaNs, which I assume is because we don't know were to map them (should it be the biggest number, the lowest, something in the middle? The choice is pretty arbitrary). So if we explicitly don't support NaNs, why support None? It doesn't make much sense to me to map None to the highest value - if anything, there should be a way to specify the encoded value. So this is fine to break for me, but I'm happy to consider other opinions.

Since we already encode None, I think this PR can encode np.nan as well and not passthrough. We can treat np.nan and None as distinct categories, which consistent with OneHotEncoder.

This would not let the final estimator (e.g. HistGradientBoosting) natively handle the NaNss which was the original point of #18968 (granted, the behaviour would be almost equivalent as NaNs are treated as categories, but still).

which consistent with OneHotEncoder.

It's mainly a no-brainer in OneHotEncoder because order doesn't matter there. Things are more subtle with the OrdinalEncoder so I'm not sure consistency is something that's relevant in this case.

thomasjpfan · 2021-02-22T19:09:58Z

Also, we used to explicitly not support NaNs, which I assume is because we don't know were to map them (should it be the biggest number, the lowest, something in the middle? The choice is pretty arbitrary).

The way we encode None now is completely arbitrary and I suspect it was not by design. I am thinking if this arbitrary behavior is currently being used and is deemed useful.

It's mainly a no-brainer in OneHotEncoder because order doesn't matter there.

I am thinking of the case where someone is using OneHotEncoder with None and find out they need to do more preprocessing on the None's for OrdinalEncoder.

In the meantime, I updated the PR to still reject None and a cleaner way to handle passthrough.

NicolasHug · 2021-02-22T19:12:48Z

I am thinking if this arbitrary behavior is currently being used and is deemed useful.

@jnothman what would be your thoughts on this?

jnothman · 2021-02-22T21:58:33Z

I don't know what standard processing produces None so it's very hard to reason about what users might do by accident. I can imagine users intentionally differentiating None from other categories without it specifically meaning "missing". I am okay to break on this, certainly in the context of 1.0 and where we are breaking with an error rather than a silent change in behaviour.

adrinjalali · 2021-02-23T08:28:55Z

One could argue that the user can use np.nan not as missing, and rather an artifact of some calculation. Instead, they could use None as missing. This distinction may be quite useful to users. But on the other hand, if the user is doing some computation, they're probably doing it in a container that doesn't support both np.nan and None, unless they user new pandas types, which supports missing values explicitly, albeit not through None. So I'm not really sure.

thomasjpfan · 2021-02-23T16:54:45Z

I updated this PR to keep the current None encoding behavior while also passing through np.nan. This way, this PR is backward compatible.

We can decide on if None should throw an error by adding f78c25c (#19069) back in a follow up PR.

jnothman

Okay.

Maybe the behaviour with None should be tested rather than implicit!

LGTM

jnothman · 2021-02-23T22:59:16Z

sklearn/preprocessing/_encoders.py

@@ -775,6 +776,23 @@ def fit(self, X, y=None):
                                     f"values already used for encoding the "
                                     f"seen categories.")

+        # stories the missing indices per category


Suggested change

# stories the missing indices per category

# stores the missing indices per category

really just an innocuous commit to check scikit-learn#19155

jnothman · 2021-02-24T06:23:44Z

Thanks @thomasjpfan!

thomasjpfan added 3 commits December 11, 2020 12:26

WIP

4b103d4

ENH Adds handle_missing=passthrough in OrdinalEncoder

f2d1692

DOC Adds check regarding dtype

389daba

github-actions bot added the module:preprocessing label Dec 25, 2020

thomasjpfan added 2 commits December 24, 2020 22:31

Merge remote-tracking branch 'upstream/master' into ordinal_encoder_nan

c59375e

DOC Adds whats new

8b98ac6

adrinjalali reviewed Jan 20, 2021

View reviewed changes

thomasjpfan added 3 commits January 20, 2021 13:47

Merge remote-tracking branch 'upstream/master' into ordinal_encoder_nan

6f23ec3

ENH Addresses comments

f49e387

DOC Remove reference to handle_missing from docs

a0ed6da

thomasjpfan changed the title ~~ENH Adds handle_missing=passthrough in OrdinalEncoder~~ ENH Adds nan passthrough in OrdinalEncoder Jan 20, 2021

NicolasHug reviewed Jan 21, 2021

View reviewed changes

thomasjpfan added 3 commits January 21, 2021 15:31

ENH Address comments

30276dd

DOC Adds docstring to test

13c30ae

Merge remote-tracking branch 'upstream/master' into ordinal_encoder_nan

3e4449d

adrinjalali approved these changes Jan 22, 2021

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:53

NicolasHug reviewed Jan 22, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/main' into ordinal_encoder_nan

477f2a2

thomasjpfan added 2 commits January 28, 2021 11:34

ENH Removes support for None

3e1648a

Merge remote-tracking branch 'upstream/main' into ordinal_encoder_nan

f786561

jnothman reviewed Feb 21, 2021

View reviewed changes

thomasjpfan added 2 commits February 22, 2021 12:01

Merge remote-tracking branch 'upstream/main' into ordinal_encoder_nan

ff7e0b8

DOC Revert

242b9ae

CLN Addresses comments

badc75b

thomasjpfan added 3 commits February 22, 2021 13:48

ENH Better handling of missing value indicies

b4a9ab5

STY Linting

77ed42e

REV Less diffs

5d56c2c

thomasjpfan added 2 commits February 23, 2021 11:48

Merge remote-tracking branch 'upstream/main' into ordinal_encoder_nan

2a33a57

ENH Reintroduces None

f78c25c

jnothman approved these changes Feb 23, 2021

View reviewed changes

thomasjpfan and others added 3 commits February 23, 2021 19:58

CLN Addresses comments

4279b70

fix typo

1e89628

Consistent use of f-strings

4aaafed

really just an innocuous commit to check scikit-learn#19155

jnothman merged commit 638b768 into scikit-learn:main Feb 24, 2021

albertvillanova mentioned this pull request Apr 4, 2021

DOC Fix order of whatsnew entries #19822

Merged

thomasjpfan mentioned this pull request Apr 8, 2021

Suggest to use SimpleImputer for str categorical data with None values #16739

Closed

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

		If a categorical feature uses `None` and `np.nan` as a missing
		value indicator, the inverse_transform will map `np.nan` to `np.nan`.

-        If a categorical feature uses `None` *and* `np.nan` as a missing
-        value indicator, the inverse_transform will map `np.nan` to `np.nan`.
+        Nans will be inverse-transformed to Nans. Since both None and NaNs
+        are mapped to NaN during `tranform`, this means that
+        `inverse_transform(transform([[None], [NaN])) == [[NaN], [NaN]]`

	# place values of X[:, i] that were nan with actual indices
	# replace values of X[:, i] that were nan with actual indices

	"""Test the interaction between handle_missing and handle_unknown"""
	"""Test the interaction between missing values and handle_unknown"""

	# stories the missing indices per category
	# stores the missing indices per category

Uh oh!

ENH Adds nan passthrough in OrdinalEncoder #19069

ENH Adds nan passthrough in OrdinalEncoder #19069

Uh oh!

Conversation

thomasjpfan commented Dec 25, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

NicolasHug commented Dec 26, 2020

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Jan 20, 2021

Uh oh!

thomasjpfan commented Jan 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Jan 22, 2021

Uh oh!

NicolasHug commented Jan 22, 2021

Uh oh!

thomasjpfan commented Jan 28, 2021

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Feb 22, 2021

Uh oh!

NicolasHug commented Feb 22, 2021

Uh oh!

thomasjpfan commented Feb 22, 2021

Uh oh!

NicolasHug commented Feb 22, 2021

Uh oh!

jnothman commented Feb 22, 2021 via email

Uh oh!

adrinjalali commented Feb 23, 2021

thomasjpfan commented Feb 23, 2021 •

edited

Loading