Skip to content

Handle with type problem for OneHotEncoder.get_feature_names #16593 #21754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jfreneau
Copy link

Reference Issues/PRs

Handle with type problem for get_feature_names #16593

Hello, I have to work on this issue for a school project.
I propose to cast integer-columns to string-columns and put a warning to inform the user that it is possible that others errors will happen later.

I don't think we should force the user to put strings because some of the cases stated above are common (like csv files without headers). We can't change all the consistency problems of Pandas. """

- cast features names into strings 
- raise a warning to inform user that non-string features names could cause problem
@jfreneau jfreneau marked this pull request as ready for review November 23, 2021 10:37
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we intended it the way it was but in the face of use-cases we can maybe change our mind.

A PR can be a good way to have a concrete discussions but it requires detailed test to set the explicit behavior we want to implement on concrete cases.

Furthermore, if we decide to go this way, we should probably factor-out this input validation logic into a utility function _check_input_feature_names in charge of raising the warning with an informative error message and using this function consistently in all the get_feature_names methods of scikit-learn. This means that we will also need a new common test to check this on invalid inputs for all the scikit-learn estimators with a get_feature_names method.

Even if we decide against the automatic str conversion with the warning, we probably want to instead raise a TypeError with an informative error message instead.

@@ -692,7 +692,9 @@ def get_feature_names(self, input_features=None):

feature_names = []
for i in range(len(cats)):
names = [input_features[i] + "_" + str(t) for t in cats[i]]
if type(input_features) is not str:
warnings.warn("Features names aren't strings, errors could occur")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide always the observed invalid value in error message, for instance:

    warnings.warn(
        "Unexpected type of feature name feature_names[{i}]={feature_names[i]!r}. "
        "Converting to str now. To silence this warning, please make sure encode "
        "feature names as `str` from the start."
    )

@jfreneau jfreneau changed the title Handle with type problem for get_feature_names #16593 Handle with type problem for OneHotEncoder.get_feature_names #16593 Nov 23, 2021
@cmarmo cmarmo added the Superseded PR has been replace by a newer PR label Oct 28, 2022
@thomasjpfan
Copy link
Member

I am closing this PR as it was superseded by #21754. With #21754, a user can configure get_feature_names_out to combine the feature names and categories however they want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants