Skip to content

feature_names_in_ is not reset after calling fit() again with a NumPy array #21383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ageron opened this issue Oct 21, 2021 · 3 comments · Fixed by #21389
Closed

feature_names_in_ is not reset after calling fit() again with a NumPy array #21383

ageron opened this issue Oct 21, 2021 · 3 comments · Fixed by #21389
Assignees
Milestone

Comments

@ageron
Copy link
Contributor

ageron commented Oct 21, 2021

Describe the bug

In Scikit-Learn 1.0.0, if you fit a OneHotEncoder with a Pandas DataFrame, it records the feature names in feature_names_in_. Great! But if you fit it again with a NumPy array, the old feature names are not removed, even when the number of features in the NumPy array does not match the number of feature names in the old feature_names_in_.

This is probably true as well for other estimators, but I haven't tested it.

I feel like this could be a source of confusion and bugs. I believe feature_names_in_ should always be deleted when fit() is called (and possibly replaced with a new one if a DataFrame is passed to fit()). At the very least, it should be deleted if the number of features is different.

Steps/Code to Reproduce

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({"cities": ["Paris", "Tokyo", "Paris", "Auckland"],
                   "eyes": ["Blue", "Brown", "Green", "Blue"]},
                  index=["Alice", "Bunji", "Cécile", "Dave"])
encoder = OneHotEncoder()
encoder.fit(df)
assert list(encoder.feature_names_in_) == ["cities", "eyes"]

encoder.fit([["Orange"], ["Banana"], ["Apple"], ["Banana"]])
assert not hasattr(encoder, "feature_names_in_")  # still equal to ["cities", "eye"] !

Expected Results

I expect feature_names_in_ to be removed after the second call to fit() with a NumPy array. Especially since the number of features has changed.

Actual Results

>>> encoder.feature_names_in_
array(['cities', 'eyes'], dtype=object)

Versions

System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.0.1
threadpoolctl: 3.0.0

Built with OpenMP: True

@ogrisel ogrisel added this to the 1.0.1 milestone Oct 21, 2021
@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2021

Thanks for the report. I think it would be great to include a fix for this in 1.0.1 but I let @jeremiedbb and @glemaitre decide since they volunteered to be release managers for 1.0.1.

@jeremiedbb
Copy link
Member

take

@glemaitre
Copy link
Member

Yep it makes sense to solve it. A common test (at least with the mocking transformer would be nice).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants