-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH ColumnTransformer.transform returns dataframes when transformers output them #20110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH ColumnTransformer.transform returns dataframes when transformers output them #20110
Conversation
the test failures are relevant, right? |
Shall we introduce the Also, if |
That does work in principle. This would mean
The transformers in the |
|
||
return self._hstack_np(Xs) | ||
|
||
def _hstack_np(self, Xs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be a purist, sparse is not really np
but coming from scipy
:)
I am wondering if we should have 2 functions here. I would prefer to have an if/else statement in the _hstack
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the _safe_indexing
we indeed have 3 functions: _array_indexing
, _pandas_indexing
, and _list_indexing
. We could indeed have something similar then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM. Just a couple of thoughts.
X_trans[:, ct.output_indices_['trans2']]) | ||
assert_array_equal(X_trans[:, []], | ||
X_trans[:, ct.output_indices_['remainder']]) | ||
assert_array_equal(X_trans.iloc[:, [0]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you could use _safe_indxing
here
|
||
return self._hstack_np(Xs) | ||
|
||
def _hstack_np(self, Xs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the _safe_indexing
we indeed have 3 functions: _array_indexing
, _pandas_indexing
, and _list_indexing
. We could indeed have something similar then.
|
||
@pytest.mark.parametrize("first_kwargs", [ | ||
{"index_start": 2}, {"reverse_index": True}]) | ||
def test_pandas_index_not_aligned_warns(first_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a switch to get a sparse matrix as well.
+1 for this |
Great initiative! |
catboost users could really use this but it's been sitting for over a year, any idea when this might get worked into a release? |
even with the new |
Yea, this PR has been superseded by #23734 |
@giacomov @mattiasAngqvist @kellybean-tulcolabs does using |
Good question, it was a while since I ran into this issue. Just for my understanding: If I send in a pandas dataframe will I get out a pandas dataframe or would I need to specify that using |
you need to specify it either once globally (that you always want a pd dataframe out, no matter what the input), or you need to specify it for the pipeline / transformer individually. Determining the output type based on the input type is something we discussed (for a long time) but ultimately rejected. There's not really a reason |
In the case of Column transformer maybe pandas should be the default output. I mean, that transformer is specifically for pandas Dataframes... But short of this, a pipeline-level set_output is also fine. Thanks! |
Reference Issues/PRs
Fixes #20035
What does this implement/fix? Explain your changes.
ColumnTransformer.transform returns dataframes when transformers output them. If the index does not match, then a warning is raised and the old behavior is preserved.