Description
Describe the bug
In Scikit-Learn 1.0.0, if you fit a OneHotEncoder
with a Pandas DataFrame, it records the feature names in feature_names_in_
. Great! But if you fit it again with a NumPy array, the old feature names are not removed, even when the number of features in the NumPy array does not match the number of feature names in the old feature_names_in_
.
This is probably true as well for other estimators, but I haven't tested it.
I feel like this could be a source of confusion and bugs. I believe feature_names_in_
should always be deleted when fit()
is called (and possibly replaced with a new one if a DataFrame is passed to fit()
). At the very least, it should be deleted if the number of features is different.
Steps/Code to Reproduce
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({"cities": ["Paris", "Tokyo", "Paris", "Auckland"],
"eyes": ["Blue", "Brown", "Green", "Blue"]},
index=["Alice", "Bunji", "Cécile", "Dave"])
encoder = OneHotEncoder()
encoder.fit(df)
assert list(encoder.feature_names_in_) == ["cities", "eyes"]
encoder.fit([["Orange"], ["Banana"], ["Apple"], ["Banana"]])
assert not hasattr(encoder, "feature_names_in_") # still equal to ["cities", "eye"] !
Expected Results
I expect feature_names_in_
to be removed after the second call to fit()
with a NumPy array. Especially since the number of features has changed.
Actual Results
>>> encoder.feature_names_in_
array(['cities', 'eyes'], dtype=object)
Versions
System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.0.1
threadpoolctl: 3.0.0
Built with OpenMP: True