Closed
Description
Describe the bug
OneHotEncoder creates non unique column names when values None (NoneType) and "None" (str) are both present in the same column.
Such policy creates downstream problems (e.g. ColumnTransformer raising because of non unique column names).
It's not 100% clear to me what's the expectation of a transformer.
Should every transformer emit unique names for features?
Why does the ColumnTransformer raise if there are non-unique names?
Steps/Code to Reproduce
Just one hot encoder.
from sklearn.preprocessing import OneHotEncoder
X = [['None'], [None]]
t = OneHotEncoder().fit(X)
feature_names = t.get_feature_names_out()
# ['x0_None' 'x0_None']
assert len(feature_names) == len(set(feature_names))
With ColumnTransformer
import pandas as pd
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import OneHotEncoder
t = make_column_transformer(
(OneHotEncoder(), make_column_selector("x0")),
verbose_feature_names_out=False,
)
t.fit(pd.DataFrame({"x0": [None, "None"]}))
print(t.get_feature_names_out(["x0"]))
There's a subtle bug in the ColumnTransformer as well. If verbose_feature_names_out
is True, it will not check names and will emit non-unique names without raising.
Expected Results
I'd expect different feature names being generated.
Actual Results
First code snippet
['x0_None' 'x0_None']
Second code snippet
Traceback (most recent call last):
File "src/ohe_bug.py", line 10, in <module>
print(t.get_feature_names_out(["x0"]))
File "/home/pilotuser/.local/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 518, in get_feature_names_out
raise ValueError(
ValueError: Output feature names: ['x0_None'] are not unique. Please set verbose_feature_names_out=True to add prefixes to feature names
Versions
System:
python: 3.8.12 (default, Sep 28 2021, 19:23:30) [GCC 8.3.0]
executable: /usr/local/bin/python
machine: Linux-4.14.238-182.422.amzn2.x86_64-x86_64-with-glibc2.2.5
Python dependencies:
pip: 21.2.4
setuptools: 46.0.0
sklearn: 1.0.2
numpy: 1.18.5
scipy: 1.5.4
Cython: None
pandas: 1.1.1
matplotlib: 3.2.2
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True