-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Support dataframe exchange protocol in ColumnTransformer as input #26115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
elif hasattr(X, "__dataframe__"): | ||
with suppress(ImportError): | ||
from pandas.api.interchange import from_dataframe | ||
|
||
return from_dataframe(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks kinda dirty to me. Shouldn't we instead have something in _validate_data
or check_X_y, ...
to convert to whatever format is requested by the set_config
for the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_config(transform_output=)
does not control what to do with input data. This PR enables DataFrame protocol input for ColumnTransformer
and does not change output semantics. Even with this PR, the ColumnTransformer
can only output pandas DataFrames and ndarrays depending on configuration.
For DataFrame input in non-ColumnTransfromer
, we will end up calling np.asarray
on them, so it will work as long as the DataFrame defines a correct __array__
. This is the case with the latest version of Polars. As an aside, we can support DataFrames that do not implement __array__
by adding __dataframe__
support to check_array
but that is another topic.
As for the solution with _check_X
, I think this is fairly clean with from_dataframe(x)
. The logic is "if input is dataframe-like convert it to a pandas DataFrame so that ColumnTransformer
code can work on it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me these topics are very closely related:
- understanding feature names
- outputing things other than an
ndarray
- accepting things other than
ndarray
What I don't like here, is that we're making ColumnTransformer
very specific, which doesn't make much sense to me. Why would the user need to use a ColumnTransformer
in a Pipeline
to be able to support this input?
I think in my mind, the solution here would be in _validate_data
, to read the feature names from this type, store them, and probably then convert it directly to a numpy array if needed, or a pandas DataFrame if we want to. I don't see why we should put this in ColumnTransformer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean by having the logic to _validate_data
and getting it to understand other types of DataFrames. I'll open another PR to update _validate_data
to understand feature_names_in_
from the dataframe protocol to get the ball rolling.
As an aside, with the recent addition of cast_to_ndarray
to _validate_data
, we can finally adapt ColumnTransformer
to use _validate_data
so it can reuse the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link the new PR (once it exists) from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan now that we are converging on #26411 and that more and more people express interest in other dataframes (e.g. #26435) it might be a good time to revive this PR (or the one you intended to open).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a PR for feature names here: #26464
As for this PR, I really want the ColumnTransformer to pass around the same DataFrame container as the input to the inner transformers. I suspect I still need data-apis/dataframe-api#42 to work. Although, I can workaround it with defining a custom "from_dataframe" that is library specific and then update it when the DataFrame SPEc adds a "from_dataframe" functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a temporary, library-specific from_dataframe
, private routing helper with a link to data-apis/dataframe-api#42 in its docstring.
@thomasjpfan shall we close this? The code-base has evolved a lot. Now I am not sure about the status from other |
Yup, lets close this one. |
Reference Issues/PRs
Towards #25896
What does this implement/fix? Explain your changes.
This PR enables
ColumnTransformer
to ingest any DataFrame that supports the DataFrame Exchange protocol.This PR is a little strange in that the
ColumnTransformer
converts the input DataFrame into a Pandas DataFrame and passes that around. For example, if the input is a Polars DataFrame, the inner transformers will end up with a Pandas DataFrame. Converting early is the easiest way to implement the feature, but the behavior is strange to me.An alternative is to actually support Polars DataFrames directly in ColumnTransformer. This means adjusted all the Pandas specific code to support Polars. The dataframe exchange protocol does not have the API to perform the column indexing required by
ColumnTransformer
.If there is a standard for from_dataframe, then we can internally convert the input DataFrame into a Pandas DataFrame, which is processed by
ColumnTransformer
. When it comes time to pass the DataFrame to the inner transformer, we convert them back to the same DataFrame library as the input.