Skip to content

ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
avm19 opened this issue Feb 20, 2022 · 6 comments

Comments

@avm19
Copy link
Contributor

avm19 commented Feb 20, 2022

Describe the bug

Unexpected behaviour when using integer column index in a DataFrame, other than natural ordering [0, 1, ...].

select_int = make_column_selector(dtype_include=np.int_)
ct = ColumnTransformer([('t2', Normalizer(norm="l1"), select_int)])
df1 = pd.DataFrame({'1': [1, 2, 3], '2': [9, 8, 7]})
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})
ct.fit_transform(df1) # OK
ct.fit_transform(df2) # IndexError

The only difference between df1 and df2 is the type of column index. In my opinion, the results for these dataframes must be similar, but an error is raised for the latter.

As far as I could see, the problem stems from semantic ambiguity as to when to use iloc-based indexing vs loc-based indexing. In _get_column_indices L382 this decision is based on the type of index and not on the type of the array. Whichever criterion is chosen, if it followed consistently in column selectors, the error shall be avoided. Probably.

Steps/Code to Reproduce

(See above)

Expected Results

(See above)

Actual Results

(See above)

Versions

Python dependencies:
          pip: 22.0.3
   setuptools: 60.8.1
      sklearn: 1.1.dev0
        numpy: 1.22.2
        scipy: 1.8.0
       Cython: 0.29.27
       pandas: 1.3.5
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 3.1.0

commit b28c5bba66529217ceedd497201a684e5d35b73c (upstream/main, origin/main, origin/HEAD, main)
Author: Thomas J. Fan 
Date:   Tue Feb 15 11:46:54 2022 -0500

    FIX DummyRegressor overriding constant (#22486)
@avm19 avm19 added Bug Needs Triage Issue requires triage labels Feb 20, 2022
@avm19 avm19 changed the title Integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) Feb 20, 2022
@thomasjpfan
Copy link
Member

I agree that using make_column_selector when selecting a DataFrame with integer column names is not great.

As you noted ColumnTransformer interprets integer selection as positional indices of the DataFrame. This behavior is detailed in the ColumnTransformer docstring:

Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.

The semantics for ColumnTransformer was decided a while ago and changing it would be backward incompatible. For now, the workaround would be to avoid using integer column names.

Assuming we do not change ColumnTransformer, here is a API proposal: We can add a returns= kwarg to make_column_selector such that:

  1. make_column_selector(..., returns="names"): The default which is what we do now.
  2. make_column_selector(..., return="positional"): Returns the columns as positions. For your use case, you can use this to select on any datafarme regardless of dtype of the columns.

@avm19
Copy link
Contributor Author

avm19 commented Mar 10, 2022

What could possibly break if you apply the same rule:

Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.

to _get_column_indices ?

I skimmed through an older version of code, and I think it used to be exactly this way, see e.g. this commit of #14035 . Probably some time after that, someone changed the logic: instead of checking the type of passed indices they decided to check the type of the dataframe's column Index.

P.S. I got a bit confused with the details, but I have a strong feeling that the default behaviour of make_column_selector can be changed without any harm done. I see a little problem with your proposition: if I want my pipeline to work well when the data is like df['1'] and also when it is df[1], I have to use make_column_selector(..., return="positional") and completely forgo the convenience offered by strings.

@thomasjpfan
Copy link
Member

if I want my pipeline to work well when the data is like df['1'] and also when it is df[1], I have to use make_column_selector(..., return="positional") and completely forgo the convenience offered by strings.

Selecting columns based on two different types is not well supported and would require a design discussion. It also breaks backward compatibility.

If this is a major issue, I think we could add a use_index_directly parameter to ColumnTransformer that turns off these semantics and selects from the column directly. For example:

ct = ColumnTransformer([('t2', Normalizer(norm="l1"), [1, 2])], use_index_directly=True)
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})

ct.fit_transform(df2)
# Internally we will do `df2[[1, 2]]` for selection.

What do you think?

@avm19
Copy link
Contributor Author

avm19 commented Mar 10, 2022

Your solution seems reasonable to me. I think that some users might want ColumnTransformer to have this parameter even if they don't use make_column_selector or other callables. I would also be fine even with a cautionary note in the documentation as an interim solution.

What bothers me is why ColumnsTransformer and make_column_selector use different semantics in the first place. Namely, the output of make_column_selector is always names (str or int), while the input of ColumnsTransformer is either names (str) or positional (int). Afaiac, they were introduced together to be used together.

Notwithstanding your proposed solution, would it make sense to change the behaviour of make_column_selector and break the backward compatibility in the future? (if the policy allows, through a deprecation cycle or somehow). I wonder, whether ColumnsTransformer is the only "user" of make_column_selector or if there are conceivable alternative use cases for make_column_selector, in which the lack backward compatibility can break actual code.

@avm19
Copy link
Contributor Author

avm19 commented Mar 10, 2022

Okay, I see why changing the default behaviour of make_column_selector in the future is even worse than changing it now.

After your proposed solution is implemented, I will be using the parameter ColumnTransformer(..., use_index_directly=True) to make the pipeline work with both str and int column names. If make_column_selector is changed after that, then I will have to remove use_index_directly=True from all my code. So it is now or never.

@thomasjpfan
Copy link
Member

After your proposed solution is implemented, I will be using the parameter ColumnTransformer(..., use_index_directly=True) to make the pipeline work with both str and int column names.

Yup! That is the goal. The other benefit of use_index_directly is that it allows for using panda's multiindex: #13781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants