Skip to content

OneHotEncoder creates non unique column names, when values None (NoneType) and "None" (str) are both present #22488

Closed
@mariokostelac

Description

@mariokostelac

Describe the bug

OneHotEncoder creates non unique column names when values None (NoneType) and "None" (str) are both present in the same column.

Such policy creates downstream problems (e.g. ColumnTransformer raising because of non unique column names).

It's not 100% clear to me what's the expectation of a transformer.
Should every transformer emit unique names for features?
Why does the ColumnTransformer raise if there are non-unique names?

Steps/Code to Reproduce

Just one hot encoder.

from sklearn.preprocessing import OneHotEncoder

X = [['None'], [None]]
t = OneHotEncoder().fit(X)
feature_names = t.get_feature_names_out()
# ['x0_None' 'x0_None']

assert len(feature_names) == len(set(feature_names))

With ColumnTransformer

import pandas as pd
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import OneHotEncoder

t = make_column_transformer(
    (OneHotEncoder(), make_column_selector("x0")),
    verbose_feature_names_out=False,
)
t.fit(pd.DataFrame({"x0": [None, "None"]}))
print(t.get_feature_names_out(["x0"]))

There's a subtle bug in the ColumnTransformer as well. If verbose_feature_names_out is True, it will not check names and will emit non-unique names without raising.

Expected Results

I'd expect different feature names being generated.

Actual Results

First code snippet

['x0_None' 'x0_None']

Second code snippet

Traceback (most recent call last):
  File "src/ohe_bug.py", line 10, in <module>
    print(t.get_feature_names_out(["x0"]))
  File "/home/pilotuser/.local/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 518, in get_feature_names_out
    raise ValueError(
ValueError: Output feature names: ['x0_None'] are not unique. Please set verbose_feature_names_out=True to add prefixes to feature names

Versions

System:
    python: 3.8.12 (default, Sep 28 2021, 19:23:30)  [GCC 8.3.0]
executable: /usr/local/bin/python
   machine: Linux-4.14.238-182.422.amzn2.x86_64-x86_64-with-glibc2.2.5

Python dependencies:
          pip: 21.2.4
   setuptools: 46.0.0
      sklearn: 1.0.2
        numpy: 1.18.5
        scipy: 1.5.4
       Cython: None
       pandas: 1.1.1
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions