Closed
Description
Sister issue for #16702 (OneHotEncoder
)
Code to reproduce
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
OrdinalEncoder().fit(df)
Observed result
Got: TypeError: '<' not supported between instances of 'str' and 'NoneType'
Full traceback:
TypeError Traceback (most recent call last)
~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
111 try:
--> 112 res = _encode_python(values, uniques, encode)
113 except TypeError:
~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode_python(values, uniques, encode)
59 if uniques is None:
---> 60 uniques = sorted(set(values))
61 uniques = np.array(uniques, dtype=values.dtype)
TypeError: '<' not supported between instances of 'str' and 'NoneType'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-35-eb249f0af3d2> in <module>
4
5 df = pd.DataFrame({"cat_feature": ["a", None, "b", "a"]})
----> 6 OrdinalEncoder().fit(df)
~/code/scikit-learn/sklearn/preprocessing/_encoders.py in fit(self, X, y)
673 self
674 """
--> 675 self._fit(X)
676
677 return self
~/code/scikit-learn/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
84 Xi = X_list[i]
85 if self.categories == 'auto':
---> 86 cats = _encode(Xi)
87 else:
88 cats = np.array(self.categories[i], dtype=Xi.dtype)
~/code/scikit-learn/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
112 res = _encode_python(values, uniques, encode)
113 except TypeError:
--> 114 raise TypeError("argument must be a string or number")
115 return res
116 else:
TypeError: argument must be a string or number
Expected result
A more informative ValueError
, for instance:
ValueError: OrdinalEncoder does not accept None typed values. Missing values should be imputed first, for instance using sklearn.preprocessing.SimpleImputer.
Maybe we could even include the URL of some FAQ or example that shows how to deal with a mix of str and None typed values and use the following prior to Ordinal Encoding:
SimpleImputer(strategy="constant", missing_values=None, fill_value="missing")