-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
OneHotEncoder.get_feature_names doesn't work with integer column names #16593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What's the etiquette on bug: triage. Should I move the label to bug or should I wait for someone else to do that? |
Introduced in #16560 lol I have not realized it would also affect core-devs, we would probably want to open issues as "Other" reason and manually assign tags. It's was intended for unconfirmed users reports, there may be a better way to do it. |
Totally fine with opening as 'other' and labeling manually. I wasn't sure what workflow we wanted to use :) |
take |
HI everybody, sorry for stepping in. I would like to point that there is another issue on OneHotEncoder (#16552) and @glemaitre and I we were waiting for a comment about the possibility of changing the |
@cmarmo thanks for the heads up. I'll wait for you to complete your PR before I start working on this :) |
What's the desired solution: 1) raise a better error message, 2) magically cast the integer to a string? I tend to frown on magical casting, as it is a source of strange behaviors and bugs. For instance, if X is: X = pd.DataFrame({1: np.random.randint(0, 10, size=400), "1": np.random.randint(0, 10, size=400)}) we are going to have bugs / strange behaviors anyhow. |
Yeah, I also really don't like integer column names, since they get mixed up with indexing the columns. I would argue that a column name should be a string, and raise otherwise. People can convert their integer names to strings if they really want to use them. WDYT? |
I would argue that a column name should be a string, and raise
otherwise. People can convert their integer names to strings if they
really want to use them. WDYT?
I would rather have that. It's not the same behavior as pandas, but I
find that pandas often imposes too little guarantees and as a consequence
can hide bugs in users' code.
|
I'm happy to require input feature names to be strings.
|
Should we ask users? Honestly I have no idea how commonly integer column names are used. You can transpose dataframes and integer indices are pretty natural. Not opposed to not allowing it, and it will certainly simplify things. But I'm not sure how frustrating this will be to users once we start interacting more with pandas column names. If we want column name consistency (#7242), does that mean that we'll always error when someone passes a pandas dataframe with integer column names? Reading a CSV file without headers, the default is integer, and so we'd add extra boiler plate. |
@cmarmo the reason I labeled this as a bug and not as a new feature is that ColumnTransformer seems to work when using integer column names, but when requesting the feature names we get an unexpected error. When something errors in an unpredictable way, I would consider it a bug. Though it's maybe a matter of opinion what you'd call unexpected or unpredictable. |
Understood. Thanks for your answer @amueller. |
Reading a CSV file without headers, the default is integer
That's a problem (doesn't help me liking pandas, it's full of inconsistencies)
I suggest the following: we convert integers to strings at fit time, raising a warning, and err if there is a collision.
|
would it make sense not to have the default as integers @TomAugspurger ? Even if this adds a step for the users to convert their column names into strings, I'm happy to force that. I'm really worried about getting into those inconsistencies, and I think we should constraint the types of pandas objects users can pass. To be clear, not saying we shouldn't accept the pandas dataframe if it has integer column names, I just want us to reject integer column names anywhere we understand column names. |
My point was that if we implement column name consistency (and/or feature names), that's everywhere, right? |
hmm, that's a tricky one. So let say I have a pipeline, fit the pipeline, and not I realize I have an alternative way of doing the first step. Then, assuming alt_transformer = Transformer().fit(X)
# predict:
Xt = alt_transformer.transform(X)
pipeline[1:].predict(Xt) Should the above code raise? My intuition says no. EDIT: |
If scikit-learn is getting feature names from the columns, and is doing string operations on the feature names, then requiring strings as column names seems sensible to me. For users, the fix is potentially is as simple as |
Hi, can I pick this up? To my understanding the expected behaviour should be that an exception is raised when an integer is set as a column name? |
Hi, @chrispe92, @arunavkonwar is willing to work on that and waiting for another PR (#16585) to be merged.... Thanks for your patience! |
Alright, I'll find something else to pick up then, thanks! |
#16585 merged (Thanks @rth!). @arunavkonwar your turn now! :) |
Thanks @cmarmo I'm working on this now. |
@adrinjalali sorry I'm missing your point. That seemed to be related to column name consistency in general, not integers, right? I thought we had agreed that column name consistency is useful. I'm not sure if we want a slep for that right now given all the other discussions though... |
I see your point, fair enough. |
Try typecasting the column name to str(string) and you'll get the feature names |
Hello, I am starting with this issue. |
In order to convert the data to one hot encoder is difficult as we are dealing with numerical values, so I guess by importing np_utils from Keras, can convert numerical in the form of one-hot encoding, getting the names of the features is difficult as it does not have any column names. So it's better to assign column names and then use get_feature_names. |
@deepthi1107 I don't think this issue is a good one to work on right now. I'm also not sure what your proposed solution is, and we won't be depending on keras. |
This change makes sure that OneHotEncoder.get_feature_names_out returns unique feature names, as every transformer should do. The approach is different than discussed in the issue, erring on the side of working by default, rather than forcing the user to provide a custom function that transforms (feature_name, value) into the feature name of the encoded column. It fixes scikit-learn#22488. I'm happy to chance the approach here, but I'd expect this being good enough for majority of usecases. --- This also changes the behaviour of the case explained in scikit-learn#16593. Instead of raising by accident on impossible operation between str and int, we cast the feature name to str - because we're producing strings at the end after all. The exception was accidental. I'm not sure whether we should close that other issue because there's a discussion on whether we should raise if column names are integers. One could argue that we should remove the str(feature_name) piece from this PR. I think we shouldn't because this change enables them to use OneHotEncoder without renaming fields. I suspect they have integer feature names because they don't care about interpreting them after the pipeline, at least until the expectation of feature names is clearly set.
The text was updated successfully, but these errors were encountered: