Skip to content

Confusing error message in OrdinalEncoder with None-encoded missing values #16703

Closed
@ogrisel

Description

@ogrisel

Sister issue for #16702 (OneHotEncoder)

Code to reproduce

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder


df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
OrdinalEncoder().fit(df)

Observed result

Got: TypeError: '<' not supported between instances of 'str' and 'NoneType'

Full traceback:

TypeError                                 Traceback (most recent call last)
~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    111         try:
--> 112             res = _encode_python(values, uniques, encode)
    113         except TypeError:

~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode_python(values, uniques, encode)
     59     if uniques is None:
---> 60         uniques = sorted(set(values))
     61         uniques = np.array(uniques, dtype=values.dtype)

TypeError: '<' not supported between instances of 'str' and 'NoneType'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-35-eb249f0af3d2> in <module>
      4 
      5 df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
----> 6 OrdinalEncoder().fit(df)

~/code/scikit-learn/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    673         self
    674         """
--> 675         self._fit(X)
    676 
    677         return self

~/code/scikit-learn/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
     84             Xi = X_list[i]
     85             if self.categories == 'auto':
---> 86                 cats = _encode(Xi)
     87             else:
     88                 cats = np.array(self.categories[i], dtype=Xi.dtype)

~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    112             res = _encode_python(values, uniques, encode)
    113         except TypeError:
--> 114             raise TypeError("argument must be a string or number")
    115         return res
    116     else:

TypeError: argument must be a string or number

Expected result

A more informative ValueError, for instance:

ValueError: OrdinalEncoder does not accept None typed values. Missing values should be imputed first, for instance using sklearn.preprocessing.SimpleImputer.

Maybe we could even include the URL of some FAQ or example that shows how to deal with a mix of str and None typed values and use the following prior to Ordinal Encoding:

SimpleImputer(strategy="constant", missing_values=None, fill_value="missing")

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions