Skip to content

ENH Support dataframe exchange protocol in ColumnTransformer as input #26115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Towards #25896

What does this implement/fix? Explain your changes.

This PR enables ColumnTransformer to ingest any DataFrame that supports the DataFrame Exchange protocol.

This PR is a little strange in that the ColumnTransformer converts the input DataFrame into a Pandas DataFrame and passes that around. For example, if the input is a Polars DataFrame, the inner transformers will end up with a Pandas DataFrame. Converting early is the easiest way to implement the feature, but the behavior is strange to me.

  • An alternative is to actually support Polars DataFrames directly in ColumnTransformer. This means adjusted all the Pandas specific code to support Polars. The dataframe exchange protocol does not have the API to perform the column indexing required by ColumnTransformer.

  • If there is a standard for from_dataframe, then we can internally convert the input DataFrame into a Pandas DataFrame, which is processed by ColumnTransformer. When it comes time to pass the DataFrame to the inner transformer, we convert them back to the same DataFrame library as the input.

@thomasjpfan thomasjpfan marked this pull request as draft April 10, 2023 13:23
Comment on lines +907 to +911
elif hasattr(X, "__dataframe__"):
with suppress(ImportError):
from pandas.api.interchange import from_dataframe

return from_dataframe(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks kinda dirty to me. Shouldn't we instead have something in _validate_data or check_X_y, ... to convert to whatever format is requested by the set_config for the output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_config(transform_output=) does not control what to do with input data. This PR enables DataFrame protocol input for ColumnTransformer and does not change output semantics. Even with this PR, the ColumnTransformer can only output pandas DataFrames and ndarrays depending on configuration.

For DataFrame input in non-ColumnTransfromer, we will end up calling np.asarray on them, so it will work as long as the DataFrame defines a correct __array__. This is the case with the latest version of Polars. As an aside, we can support DataFrames that do not implement __array__ by adding __dataframe__ support to check_array but that is another topic.

As for the solution with _check_X, I think this is fairly clean with from_dataframe(x). The logic is "if input is dataframe-like convert it to a pandas DataFrame so that ColumnTransformer code can work on it".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me these topics are very closely related:

  • understanding feature names
  • outputing things other than an ndarray
  • accepting things other than ndarray

What I don't like here, is that we're making ColumnTransformer very specific, which doesn't make much sense to me. Why would the user need to use a ColumnTransformer in a Pipeline to be able to support this input?

I think in my mind, the solution here would be in _validate_data, to read the feature names from this type, store them, and probably then convert it directly to a numpy array if needed, or a pandas DataFrame if we want to. I don't see why we should put this in ColumnTransformer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean by having the logic to _validate_data and getting it to understand other types of DataFrames. I'll open another PR to update _validate_data to understand feature_names_in_ from the dataframe protocol to get the ball rolling.

As an aside, with the recent addition of cast_to_ndarray to _validate_data, we can finally adapt ColumnTransformer to use _validate_data so it can reuse the logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link the new PR (once it exists) from here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan now that we are converging on #26411 and that more and more people express interest in other dataframes (e.g. #26435) it might be a good time to revive this PR (or the one you intended to open).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a PR for feature names here: #26464

As for this PR, I really want the ColumnTransformer to pass around the same DataFrame container as the input to the inner transformers. I suspect I still need data-apis/dataframe-api#42 to work. Although, I can workaround it with defining a custom "from_dataframe" that is library specific and then update it when the DataFrame SPEc adds a "from_dataframe" functionality.

Copy link
Member

@ogrisel ogrisel May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a temporary, library-specific from_dataframe, private routing helper with a link to data-apis/dataframe-api#42 in its docstring.

@thomasjpfan thomasjpfan changed the title ENH Support dataframe exchange protocol in ColumnTransformer ENH Support dataframe exchange protocol in ColumnTransformer as input Apr 13, 2023
@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2023

@thomasjpfan shall we close this? The code-base has evolved a lot. Now ColumnTransformer accepts polars and pyarrow tables as input and polars as output and most transformers fetch their features names from the dataframe interchange protocol #26464.

I am not sure about the status from other __dataframe__ libraries but this should probably dealt with in dedicated PRs.

@thomasjpfan
Copy link
Member Author

Yup, lets close this one.

@thomasjpfan thomasjpfan closed this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants