-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree that using As you noted
The semantics for Assuming we do not change
|
P.S. I got a bit confused with the details, but I have a strong feeling that the default behaviour of |
Selecting columns based on two different types is not well supported and would require a design discussion. It also breaks backward compatibility. If this is a major issue, I think we could add a ct = ColumnTransformer([('t2', Normalizer(norm="l1"), [1, 2])], use_index_directly=True)
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})
ct.fit_transform(df2)
# Internally we will do `df2[[1, 2]]` for selection. What do you think? |
Your solution seems reasonable to me. I think that some users might want What bothers me is why Notwithstanding your proposed solution, would it make sense to change the behaviour of |
Okay, I see why changing the default behaviour of After your proposed solution is implemented, I will be using the parameter |
Yup! That is the goal. The other benefit of |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Unexpected behaviour when using integer column index in a DataFrame, other than natural ordering [0, 1, ...].
The only difference between
df1
anddf2
is the type of column index. In my opinion, the results for these dataframes must be similar, but an error is raised for the latter.As far as I could see, the problem stems from semantic ambiguity as to when to use
iloc
-based indexing vsloc
-based indexing. In_get_column_indices
L382 this decision is based on the type of index and not on the type of the array. Whichever criterion is chosen, if it followed consistently in column selectors, the error shall be avoided. Probably.Steps/Code to Reproduce
(See above)
Expected Results
(See above)
Actual Results
(See above)
Versions
Python dependencies: pip: 22.0.3 setuptools: 60.8.1 sklearn: 1.1.dev0 numpy: 1.22.2 scipy: 1.8.0 Cython: 0.29.27 pandas: 1.3.5 matplotlib: 3.5.0 joblib: 1.1.0 threadpoolctl: 3.1.0 commit b28c5bba66529217ceedd497201a684e5d35b73c (upstream/main, origin/main, origin/HEAD, main) Author: Thomas J. Fan Date: Tue Feb 15 11:46:54 2022 -0500 FIX DummyRegressor overriding constant (#22486)
The text was updated successfully, but these errors were encountered: