Skip to content

ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #20035

What does this implement/fix? Explain your changes.

ColumnTransformer.transform returns dataframes when transformers output them. If the index does not match, then a warning is raised and the old behavior is preserved.

@amueller
Copy link
Member

the test failures are relevant, right?
And should there be an option for backward-compatibility? I like this behavior much better but it is a breaking change. Also, @jnothman expressed interest ;)

@amueller amueller closed this Jun 11, 2021
@amueller amueller reopened this Jun 11, 2021
@ogrisel
Copy link
Member

ogrisel commented Jun 11, 2021

Shall we introduce the output="pandas" option to make this explicit (and make it possible to implement backward compat)?

Also, if output="pandas" and all transformers that return a numpy array or scipy sparse matrix have a working get_feature_names we could force the convertion to a pandas dataframe which would make this option even more useful.

amueller
amueller previously approved these changes Jul 9, 2021
@amueller amueller dismissed their stale review July 9, 2021 19:35

misclick lol

@thomasjpfan
Copy link
Member Author

Also, if output="pandas" and all transformers that return a numpy array or scipy sparse matrix have a working get_feature_names we could force the convertion to a pandas dataframe which would make this option even more useful.

That does work in principle. This would mean ColumnTransformer would be the only estimator the outputs dataframes. It would cover the use case:

  1. The final step of a pipeline would have its feature names.
  2. categorical_features can be selected by name in HistGradientBoosting*.

The transformers in the ColumnTransformer would not have access to the names.


return self._hstack_np(Xs)

def _hstack_np(self, Xs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a purist, sparse is not really np but coming from scipy :)

I am wondering if we should have 2 functions here. I would prefer to have an if/else statement in the _hstack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _safe_indexing we indeed have 3 functions: _array_indexing, _pandas_indexing, and _list_indexing. We could indeed have something similar then.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM. Just a couple of thoughts.

X_trans[:, ct.output_indices_['trans2']])
assert_array_equal(X_trans[:, []],
X_trans[:, ct.output_indices_['remainder']])
assert_array_equal(X_trans.iloc[:, [0]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you could use _safe_indxing here


return self._hstack_np(Xs)

def _hstack_np(self, Xs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _safe_indexing we indeed have 3 functions: _array_indexing, _pandas_indexing, and _list_indexing. We could indeed have something similar then.


@pytest.mark.parametrize("first_kwargs", [
{"index_start": 2}, {"reverse_index": True}])
def test_pandas_index_not_aligned_warns(first_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a switch to get a sparse matrix as well.

@giacomov
Copy link

+1 for this

@mattiasAngqvist
Copy link

Great initiative!

@kellybean-tulcolabs
Copy link

catboost users could really use this but it's been sitting for over a year, any idea when this might get worked into a release?

@amueller
Copy link
Member

even with the new set_output this is still relevant, right?

@thomasjpfan thomasjpfan added the Superseded PR has been replace by a newer PR label Oct 14, 2022
@thomasjpfan
Copy link
Member Author

Yea, this PR has been superseded by #23734

@amueller
Copy link
Member

@giacomov @mattiasAngqvist @kellybean-tulcolabs does using set_output or the global set_config(transform_output="pandas") solve your use-case? If we do something "magic" for ColumnTransformer, it's a bit inconsistent with other estimators.

@mattiasAngqvist
Copy link

@giacomov @mattiasAngqvist @kellybean-tulcolabs does using set_output or the global set_config(transform_output="pandas") solve your use-case? If we do something "magic" for ColumnTransformer, it's a bit inconsistent with other estimators.

Good question, it was a while since I ran into this issue. Just for my understanding: If I send in a pandas dataframe will I get out a pandas dataframe or would I need to specify that using set_output?

@amueller
Copy link
Member

you need to specify it either once globally (that you always want a pd dataframe out, no matter what the input), or you need to specify it for the pipeline / transformer individually.

Determining the output type based on the input type is something we discussed (for a long time) but ultimately rejected. There's not really a reason ColumnTransformer should change it's output based on the input types any more than StandardScaler should, but if you try to do it for all estimators, there's a whole bunch of issues.

@giacomov
Copy link

In the case of Column transformer maybe pandas should be the default output. I mean, that transformer is specifically for pandas Dataframes... But short of this, a pipeline-level set_output is also fine. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:compose Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow stacking pandas dataframes in ColumnTransformer?
7 participants