ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

thomasjpfan · 2021-05-19T02:50:57Z

Reference Issues/PRs

Fixes #20035

What does this implement/fix? Explain your changes.

ColumnTransformer.transform returns dataframes when transformers output them. If the index does not match, then a warning is raised and the old behavior is preserved.

…output them

amueller · 2021-06-11T00:41:04Z

the test failures are relevant, right?
And should there be an option for backward-compatibility? I like this behavior much better but it is a breaking change. Also, @jnothman expressed interest ;)

…pandas_out

ogrisel · 2021-06-11T15:02:20Z

Shall we introduce the output="pandas" option to make this explicit (and make it possible to implement backward compat)?

Also, if output="pandas" and all transformers that return a numpy array or scipy sparse matrix have a working get_feature_names we could force the convertion to a pandas dataframe which would make this option even more useful.

misclick lol

thomasjpfan · 2021-07-09T21:52:45Z

Also, if output="pandas" and all transformers that return a numpy array or scipy sparse matrix have a working get_feature_names we could force the convertion to a pandas dataframe which would make this option even more useful.

That does work in principle. This would mean ColumnTransformer would be the only estimator the outputs dataframes. It would cover the use case:

The final step of a pipeline would have its feature names.
categorical_features can be selected by name in HistGradientBoosting*.

The transformers in the ColumnTransformer would not have access to the names.

glemaitre · 2021-07-23T08:48:42Z

sklearn/compose/_column_transformer.py

+
+        return self._hstack_np(Xs)
+
+    def _hstack_np(self, Xs):


To be a purist, sparse is not really np but coming from scipy :)

I am wondering if we should have 2 functions here. I would prefer to have an if/else statement in the _hstack.

In the _safe_indexing we indeed have 3 functions: _array_indexing, _pandas_indexing, and _list_indexing. We could indeed have something similar then.

glemaitre

LTGM. Just a couple of thoughts.

glemaitre · 2021-07-23T08:51:49Z

sklearn/compose/tests/test_column_transformer.py

-                       X_trans[:, ct.output_indices_['trans2']])
-    assert_array_equal(X_trans[:, []],
-                       X_trans[:, ct.output_indices_['remainder']])
+    assert_array_equal(X_trans.iloc[:, [0]],


I think that you could use _safe_indxing here

glemaitre · 2021-07-23T08:52:48Z

sklearn/compose/_column_transformer.py

+
+        return self._hstack_np(Xs)
+
+    def _hstack_np(self, Xs):


In the _safe_indexing we indeed have 3 functions: _array_indexing, _pandas_indexing, and _list_indexing. We could indeed have something similar then.

glemaitre · 2021-07-23T08:54:17Z

sklearn/compose/tests/test_column_transformer.py

+
+@pytest.mark.parametrize("first_kwargs", [
+    {"index_start": 2}, {"reverse_index": True}])
+def test_pandas_index_not_aligned_warns(first_kwargs):


Do you want to add a switch to get a sparse matrix as well.

…pandas_out

giacomov · 2021-12-13T18:09:48Z

+1 for this

mattiasAngqvist · 2022-08-03T11:20:43Z

Great initiative!

kellybean-tulcolabs · 2022-09-20T19:37:20Z

catboost users could really use this but it's been sitting for over a year, any idea when this might get worked into a release?

amueller · 2022-10-14T17:32:41Z

even with the new set_output this is still relevant, right?

thomasjpfan · 2022-10-14T18:25:32Z

Yea, this PR has been superseded by #23734

amueller · 2022-10-14T19:48:45Z

@giacomov @mattiasAngqvist @kellybean-tulcolabs does using set_output or the global set_config(transform_output="pandas") solve your use-case? If we do something "magic" for ColumnTransformer, it's a bit inconsistent with other estimators.

mattiasAngqvist · 2022-10-14T20:08:56Z

@giacomov @mattiasAngqvist @kellybean-tulcolabs does using set_output or the global set_config(transform_output="pandas") solve your use-case? If we do something "magic" for ColumnTransformer, it's a bit inconsistent with other estimators.

Good question, it was a while since I ran into this issue. Just for my understanding: If I send in a pandas dataframe will I get out a pandas dataframe or would I need to specify that using set_output?

amueller · 2022-10-14T20:15:34Z

you need to specify it either once globally (that you always want a pd dataframe out, no matter what the input), or you need to specify it for the pipeline / transformer individually.

Determining the output type based on the input type is something we discussed (for a long time) but ultimately rejected. There's not really a reason ColumnTransformer should change it's output based on the input types any more than StandardScaler should, but if you try to do it for all estimators, there's a whole bunch of issues.

giacomov · 2022-10-15T00:39:18Z

In the case of Column transformer maybe pandas should be the default output. I mean, that transformer is specifically for pandas Dataframes... But short of this, a pipeline-level set_output is also fine. Thanks!

ENH ColumnTransformer.transform returns dataframes when transformers …

696fe53

…output them

github-actions bot added the module:compose label May 19, 2021

thomasjpfan added 4 commits May 18, 2021 22:51

DOC Adds whats new with PR number

0c69145

CLN Move pandas till later

1334bcb

DOC Adds docstring

c37ddb7

DOC Adds docstring to _hstack_np

59ad232

amueller closed this Jun 11, 2021

amueller reopened this Jun 11, 2021

thomasjpfan added 2 commits June 11, 2021 09:13

Merge remote-tracking branch 'upstream/main' into column_transformer_…

24dcc0c

…pandas_out

TST Fixes test

d6f7a1e

thomasjpfan mentioned this pull request Jun 14, 2021

API options for Pandas output #20258

Closed

ogrisel mentioned this pull request Jun 17, 2021

FIX ColumnTransformer raise TypeError when remainder columns have incompatible dtype #20287

Closed

amueller previously approved these changes Jul 9, 2021

View reviewed changes

glemaitre reviewed Jul 23, 2021

View reviewed changes

thomasjpfan added 2 commits July 25, 2021 07:42

Merge remote-tracking branch 'upstream/main' into column_transformer_…

ef5e5e3

…pandas_out

TST Adds test for sparse output

6513755

thomasjpfan added the Superseded PR has been replace by a newer PR label Oct 14, 2022

thomasjpfan closed this Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

thomasjpfan commented May 19, 2021

amueller commented Jun 11, 2021

ogrisel commented Jun 11, 2021

thomasjpfan commented Jul 9, 2021

glemaitre Jul 23, 2021

glemaitre Jul 23, 2021

glemaitre left a comment

glemaitre Jul 23, 2021

glemaitre Jul 23, 2021

glemaitre Jul 23, 2021

giacomov commented Dec 13, 2021

mattiasAngqvist commented Aug 3, 2022

kellybean-tulcolabs commented Sep 20, 2022

amueller commented Oct 14, 2022

thomasjpfan commented Oct 14, 2022

amueller commented Oct 14, 2022

mattiasAngqvist commented Oct 14, 2022

amueller commented Oct 14, 2022

giacomov commented Oct 15, 2022

ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

ENH ColumnTransformer.transform returns dataframes when transformers output them #20110

Conversation

thomasjpfan commented May 19, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

amueller commented Jun 11, 2021

ogrisel commented Jun 11, 2021

thomasjpfan commented Jul 9, 2021

glemaitre Jul 23, 2021

Choose a reason for hiding this comment

glemaitre Jul 23, 2021

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Jul 23, 2021

Choose a reason for hiding this comment

glemaitre Jul 23, 2021

Choose a reason for hiding this comment

glemaitre Jul 23, 2021

Choose a reason for hiding this comment

giacomov commented Dec 13, 2021

mattiasAngqvist commented Aug 3, 2022

kellybean-tulcolabs commented Sep 20, 2022

amueller commented Oct 14, 2022

thomasjpfan commented Oct 14, 2022

amueller commented Oct 14, 2022

mattiasAngqvist commented Oct 14, 2022

amueller commented Oct 14, 2022

giacomov commented Oct 15, 2022