Skip to content

ENH Improve set_output compatibility in ColumnTransformer #24699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Oct 18, 2022

Reference Issues/PRs

Follow up to #23734

What does this implement/fix? Explain your changes.

On main, if the inner transformers does not define get_feature_names_out, then ColumnTransformer will error even if all the transformers return a DataFrame. This is because ColumnTransformer.get_feature_names_out is called to adjust the column names to follow verbose_feature_names_out.

This PR makes ColumnTransformer more lenient toward transformers that return DataFrames but does not define get_feature_names_out. Feature names out are prefixed following verbose_feature_names_out. The prefixing logic is shared with get_feature_names_out and refactored into a _add_prefix_for_feature_names_out method.

Any other comments?

I think it is common to have third-party transformers that only expect dataframes and will always return DataFrames regardless of how set_output is configured.

@glemaitre glemaitre self-requested a review October 19, 2022 08:39
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the proposed behaviour. I think that this is something that third-party libraries could find useful.

X_df = pd.DataFrame({"feat1": [1, 2, 3], "feat2": [3, 4, 5]})

X_wrapped = _wrap_in_pandas_container(X_df, columns=get_columns)
assert_array_equal(X_wrapped.columns, X_df.columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the documentation mentioned that raising an error is equivalent to None, I think that we should test the case where we raise an error and we pass something else than a dataframe to check that we return range(X.shape[1])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test in 0fc62e6 (#24699) and adjusted it slightly in 2fb935f (#24699)

@cmarmo cmarmo added the Waiting for Second Reviewer First reviewer is done, need a second one! label Nov 12, 2022
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb merged commit 7dcb5ef into scikit-learn:main Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:compose module:utils Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants