Skip to content

feature_names_in_ is not reset after calling fit() again with a NumPy array #21383

Closed
@ageron

Description

@ageron

Describe the bug

In Scikit-Learn 1.0.0, if you fit a OneHotEncoder with a Pandas DataFrame, it records the feature names in feature_names_in_. Great! But if you fit it again with a NumPy array, the old feature names are not removed, even when the number of features in the NumPy array does not match the number of feature names in the old feature_names_in_.

This is probably true as well for other estimators, but I haven't tested it.

I feel like this could be a source of confusion and bugs. I believe feature_names_in_ should always be deleted when fit() is called (and possibly replaced with a new one if a DataFrame is passed to fit()). At the very least, it should be deleted if the number of features is different.

Steps/Code to Reproduce

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({"cities": ["Paris", "Tokyo", "Paris", "Auckland"],
                   "eyes": ["Blue", "Brown", "Green", "Blue"]},
                  index=["Alice", "Bunji", "Cécile", "Dave"])
encoder = OneHotEncoder()
encoder.fit(df)
assert list(encoder.feature_names_in_) == ["cities", "eyes"]

encoder.fit([["Orange"], ["Banana"], ["Apple"], ["Banana"]])
assert not hasattr(encoder, "feature_names_in_")  # still equal to ["cities", "eye"] !

Expected Results

I expect feature_names_in_ to be removed after the second call to fit() with a NumPy array. Especially since the number of features has changed.

Actual Results

>>> encoder.feature_names_in_
array(['cities', 'eyes'], dtype=object)

Versions

System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.0.1
threadpoolctl: 3.0.0

Built with OpenMP: True

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions