Skip to content

OneHotEncoder.get_feature_names doesn't work with integer column names #16593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Feb 28, 2020 · 29 comments
Open

OneHotEncoder.get_feature_names doesn't work with integer column names #16593

amueller opened this issue Feb 28, 2020 · 29 comments
Assignees

Comments

@amueller
Copy link
Member

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
X = pd.DataFrame({1: np.random.randint(0, 10, size=400)})
OneHotEncoder().fit(X).get_feature_names([1])
TypeError: unsupported operand type(s) for +: 'int' and 'str'
@amueller amueller added Bug: triage Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Feb 28, 2020
@amueller
Copy link
Member Author

What's the etiquette on bug: triage. Should I move the label to bug or should I wait for someone else to do that?

@rth
Copy link
Member

rth commented Feb 28, 2020

What's the etiquette on bug: triage. Should I move the label to bug or should I wait for someone else to do that?

Introduced in #16560 lol I have not realized it would also affect core-devs, we would probably want to open issues as "Other" reason and manually assign tags. It's was intended for unconfirmed users reports, there may be a better way to do it.

@rth rth added Bug and removed Bug: triage labels Feb 28, 2020
@amueller
Copy link
Member Author

Totally fine with opening as 'other' and labeling manually. I wasn't sure what workflow we wanted to use :)

@arunavkonwar
Copy link
Contributor

take

@cmarmo
Copy link
Contributor

cmarmo commented Feb 29, 2020

HI everybody, sorry for stepping in. I would like to point that there is another issue on OneHotEncoder (#16552) and @glemaitre and I we were waiting for a comment about the possibility of changing the drop_idx_ array from an array of integer to an array of objects (#16552 (comment)). This choice will probably change the way this issue should be solved too.

@arunavkonwar
Copy link
Contributor

arunavkonwar commented Mar 2, 2020

@cmarmo thanks for the heads up. I'll wait for you to complete your PR before I start working on this :)

@GaelVaroquaux
Copy link
Member

What's the desired solution: 1) raise a better error message, 2) magically cast the integer to a string?

I tend to frown on magical casting, as it is a source of strange behaviors and bugs. For instance, if X is:

X = pd.DataFrame({1: np.random.randint(0, 10, size=400),
                               "1": np.random.randint(0, 10, size=400)})

we are going to have bugs / strange behaviors anyhow.

@adrinjalali
Copy link
Member

Yeah, I also really don't like integer column names, since they get mixed up with indexing the columns. I would argue that a column name should be a string, and raise otherwise. People can convert their integer names to strings if they really want to use them. WDYT?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 3, 2020 via email

@jnothman
Copy link
Member

jnothman commented Mar 3, 2020 via email

@amueller
Copy link
Member Author

amueller commented Mar 5, 2020

Should we ask users? Honestly I have no idea how commonly integer column names are used. You can transpose dataframes and integer indices are pretty natural. Not opposed to not allowing it, and it will certainly simplify things. But I'm not sure how frustrating this will be to users once we start interacting more with pandas column names. If we want column name consistency (#7242), does that mean that we'll always error when someone passes a pandas dataframe with integer column names? Reading a CSV file without headers, the default is integer, and so we'd add extra boiler plate.

@amueller
Copy link
Member Author

amueller commented Mar 5, 2020

@cmarmo the reason I labeled this as a bug and not as a new feature is that ColumnTransformer seems to work when using integer column names, but when requesting the feature names we get an unexpected error. When something errors in an unpredictable way, I would consider it a bug. Though it's maybe a matter of opinion what you'd call unexpected or unpredictable.

@cmarmo
Copy link
Contributor

cmarmo commented Mar 5, 2020

@cmarmo the reason I labeled this as a bug and not as a new feature is that ColumnTransformer seems to work when using integer column names, but when requesting the feature names we get an unexpected error. When something errors in an unpredictable way, I would consider it a bug. Though it's maybe a matter of opinion what you'd call unexpected or unpredictable.

Understood. Thanks for your answer @amueller.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 5, 2020 via email

@adrinjalali
Copy link
Member

would it make sense not to have the default as integers @TomAugspurger ?

Even if this adds a step for the users to convert their column names into strings, I'm happy to force that. I'm really worried about getting into those inconsistencies, and I think we should constraint the types of pandas objects users can pass.

To be clear, not saying we shouldn't accept the pandas dataframe if it has integer column names, I just want us to reject integer column names anywhere we understand column names.

@amueller
Copy link
Member Author

amueller commented Mar 5, 2020

To be clear, not saying we shouldn't accept the pandas dataframe if it has integer column names, I just want us to reject integer column names anywhere we understand column names.

My point was that if we implement column name consistency (and/or feature names), that's everywhere, right?

@adrinjalali
Copy link
Member

adrinjalali commented Mar 5, 2020

My point was that if we implement column name consistency (and/or feature names), that's everywhere, right?

hmm, that's a tricky one. So let say I have a pipeline, fit the pipeline, and not I realize I have an alternative way of doing the first step. Then, assuming pipeline is what I had:

alt_transformer = Transformer().fit(X)

# predict:
Xt = alt_transformer.transform(X)
pipeline[1:].predict(Xt)

Should the above code raise? My intuition says no.

EDIT:
i.e. assuming feature names are propagated.

@TomAugspurger
Copy link
Contributor

If scikit-learn is getting feature names from the columns, and is doing string operations on the feature names, then requiring strings as column names seems sensible to me.

For users, the fix is potentially is as simple as df.rename(columns=str) prior to passing the dataframe into scikit-learn.

@chrispe
Copy link

chrispe commented Mar 6, 2020

Hi, can I pick this up? To my understanding the expected behaviour should be that an exception is raised when an integer is set as a column name?

@cmarmo
Copy link
Contributor

cmarmo commented Mar 6, 2020

Hi, @chrispe92, @arunavkonwar is willing to work on that and waiting for another PR (#16585) to be merged.... Thanks for your patience!

@chrispe
Copy link

chrispe commented Mar 6, 2020

Alright, I'll find something else to pick up then, thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Mar 10, 2020

#16585 merged (Thanks @rth!). @arunavkonwar your turn now! :)

@arunavkonwar
Copy link
Contributor

Thanks @cmarmo I'm working on this now.

@amueller
Copy link
Member Author

@adrinjalali sorry I'm missing your point. That seemed to be related to column name consistency in general, not integers, right? I thought we had agreed that column name consistency is useful. I'm not sure if we want a slep for that right now given all the other discussions though...

@adrinjalali
Copy link
Member

I see your point, fair enough.

@amueller amueller added Needs Decision Requires decision and removed good first issue Easy with clear instructions to resolve labels Jun 22, 2020
@mystery2828
Copy link

Try typecasting the column name to str(string) and you'll get the feature names

@deepthi1107
Copy link

Hello, I am starting with this issue.

@deepthi1107
Copy link

In order to convert the data to one hot encoder is difficult as we are dealing with numerical values, so I guess by importing np_utils from Keras, can convert numerical in the form of one-hot encoding, getting the names of the features is difficult as it does not have any column names. So it's better to assign column names and then use get_feature_names.

@adrinjalali
Copy link
Member

@deepthi1107 I don't think this issue is a good one to work on right now. I'm also not sure what your proposed solution is, and we won't be depending on keras.

@cmarmo cmarmo removed the Easy Well-defined and straightforward way to resolve label Nov 30, 2021
mariokostelac added a commit to mariokostelac/scikit-learn that referenced this issue Feb 16, 2022
This change makes sure that OneHotEncoder.get_feature_names_out returns
unique feature names, as every transformer should do. The approach is
different than discussed in the issue, erring on the side of working by
default, rather than forcing the user to provide a custom function that
transforms (feature_name, value) into the feature name of the encoded
column.

It fixes scikit-learn#22488.

I'm happy to chance the approach here, but I'd expect this being good
enough for majority of usecases.

---

This also changes the behaviour of the case explained in
scikit-learn#16593.
Instead of raising by accident on impossible operation between str and
int, we cast the feature name to str - because we're producing strings at
the end after all. The exception was accidental.

I'm not sure whether we should close that other issue because there's a
discussion on whether we should raise if column names are integers.

One could argue that we should remove the str(feature_name) piece from this PR.

I think we shouldn't because this change enables them to use
OneHotEncoder without renaming fields. I suspect they
have integer feature names because they don't care about interpreting
them after the pipeline, at least until the expectation of feature names
is clearly set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet