ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

avm19 · 2022-02-20T05:17:56Z

Describe the bug

Unexpected behaviour when using integer column index in a DataFrame, other than natural ordering [0, 1, ...].

select_int = make_column_selector(dtype_include=np.int_)
ct = ColumnTransformer([('t2', Normalizer(norm="l1"), select_int)])
df1 = pd.DataFrame({'1': [1, 2, 3], '2': [9, 8, 7]})
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})
ct.fit_transform(df1) # OK
ct.fit_transform(df2) # IndexError

The only difference between df1 and df2 is the type of column index. In my opinion, the results for these dataframes must be similar, but an error is raised for the latter.

As far as I could see, the problem stems from semantic ambiguity as to when to use iloc-based indexing vs loc-based indexing. In _get_column_indices L382 this decision is based on the type of index and not on the type of the array. Whichever criterion is chosen, if it followed consistently in column selectors, the error shall be avoided. Probably.

Steps/Code to Reproduce

(See above)

Expected Results

(See above)

Actual Results

(See above)

Versions

Python dependencies:
          pip: 22.0.3
   setuptools: 60.8.1
      sklearn: 1.1.dev0
        numpy: 1.22.2
        scipy: 1.8.0
       Cython: 0.29.27
       pandas: 1.3.5
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 3.1.0

commit b28c5bba66529217ceedd497201a684e5d35b73c (upstream/main, origin/main, origin/HEAD, main)
Author: Thomas J. Fan 
Date:   Tue Feb 15 11:46:54 2022 -0500

    FIX DummyRegressor overriding constant (#22486)

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2022-03-10T19:12:44Z

I agree that using make_column_selector when selecting a DataFrame with integer column names is not great.

As you noted ColumnTransformer interprets integer selection as positional indices of the DataFrame. This behavior is detailed in the ColumnTransformer docstring:

Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.

The semantics for ColumnTransformer was decided a while ago and changing it would be backward incompatible. For now, the workaround would be to avoid using integer column names.

Assuming we do not change ColumnTransformer, here is a API proposal: We can add a returns= kwarg to make_column_selector such that:

make_column_selector(..., returns="names"): The default which is what we do now.
make_column_selector(..., return="positional"): Returns the columns as positions. For your use case, you can use this to select on any datafarme regardless of dtype of the columns.

avm19 · 2022-03-10T21:23:41Z

~~What could possibly break if you apply the same rule:~~

Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.

~~to _get_column_indices ?~~

I skimmed through an older version of code, and I think it used to be exactly this way, see e.g. this commit of #14035 . Probably some time after that, someone changed the logic: instead of checking the type of passed indices they decided to check the type of the dataframe's column Index.

P.S. I got a bit confused with the details, but I have a strong feeling that the default behaviour of make_column_selector can be changed without any harm done. I see a little problem with your proposition: if I want my pipeline to work well when the data is like df['1'] and also when it is df[1], I have to use make_column_selector(..., return="positional") and completely forgo the convenience offered by strings.

thomasjpfan · 2022-03-10T21:52:51Z

if I want my pipeline to work well when the data is like df['1'] and also when it is df[1], I have to use make_column_selector(..., return="positional") and completely forgo the convenience offered by strings.

Selecting columns based on two different types is not well supported and would require a design discussion. It also breaks backward compatibility.

If this is a major issue, I think we could add a use_index_directly parameter to ColumnTransformer that turns off these semantics and selects from the column directly. For example:

ct = ColumnTransformer([('t2', Normalizer(norm="l1"), [1, 2])], use_index_directly=True)
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})

ct.fit_transform(df2)
# Internally we will do `df2[[1, 2]]` for selection.

What do you think?

avm19 · 2022-03-10T22:50:38Z

Your solution seems reasonable to me. I think that some users might want ColumnTransformer to have this parameter even if they don't use make_column_selector or other callables. I would also be fine even with a cautionary note in the documentation as an interim solution.

What bothers me is why ColumnsTransformer and make_column_selector use different semantics in the first place. Namely, the output of make_column_selector is always names (str or int), while the input of ColumnsTransformer is either names (str) or positional (int). Afaiac, they were introduced together to be used together.

Notwithstanding your proposed solution, would it make sense to change the behaviour of make_column_selector and break the backward compatibility in the future? (if the policy allows, through a deprecation cycle or somehow). I wonder, whether ColumnsTransformer is the only "user" of make_column_selector or if there are conceivable alternative use cases for make_column_selector, in which the lack backward compatibility can break actual code.

avm19 · 2022-03-10T22:59:48Z

Okay, I see why changing the default behaviour of make_column_selector in the future is even worse than changing it now.

After your proposed solution is implemented, I will be using the parameter ColumnTransformer(..., use_index_directly=True) to make the pipeline work with both str and int column names. If make_column_selector is changed after that, then I will have to remove use_index_directly=True from all my code. So it is now or never.

thomasjpfan · 2022-03-11T16:48:34Z

After your proposed solution is implemented, I will be using the parameter ColumnTransformer(..., use_index_directly=True) to make the pipeline work with both str and int column names.

Yup! That is the goal. The other benefit of use_index_directly is that it allows for using panda's multiindex: #13781

avm19 added Bug Needs Triage Issue requires triage labels Feb 20, 2022

avm19 changed the title ~~Integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices)~~ ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) Feb 20, 2022

thomasjpfan added module:compose Needs Decision - API and removed Needs Triage Issue requires triage labels Mar 10, 2022

thomasjpfan added New Feature API and removed Bug labels Mar 11, 2022

avm19 mentioned this issue Sep 13, 2023

API ColumnTransformer.transformers_: passthrough -> FunctionTransformer #27204

Merged

lorentzenchr added Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

avm19 commented Feb 20, 2022 •

edited

Loading

thomasjpfan commented Mar 10, 2022

Uh oh!

avm19 commented Mar 10, 2022 •

edited

Loading

Uh oh!

thomasjpfan commented Mar 10, 2022

Uh oh!

avm19 commented Mar 10, 2022

Uh oh!

avm19 commented Mar 10, 2022 •

edited

Loading

Uh oh!

thomasjpfan commented Mar 11, 2022

Uh oh!

Uh oh!

ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

Comments

avm19 commented Feb 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Mar 10, 2022

Uh oh!

avm19 commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Mar 10, 2022

Uh oh!

avm19 commented Mar 10, 2022

Uh oh!

avm19 commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Mar 11, 2022

Uh oh!

avm19 commented Feb 20, 2022 •

edited

Loading

avm19 commented Mar 10, 2022 •

edited

Loading

avm19 commented Mar 10, 2022 •

edited

Loading