Skip to content

ENH Adds polars output support to ColumnTransformer #26683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Towards #25896

What does this implement/fix? Explain your changes.

This PR uses the DataFrame interchange protocol in ColumnTransformer to parse DataFrames and slices them for the inner transformers to consume.

Any other comments?

The pandas code path is kept because __dataframe__ can sometimes make a copy with the default allow_copy=True.

@github-actions
Copy link

github-actions bot commented Jun 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: bc7628b. Link to the linter CI: here

@glemaitre glemaitre self-requested a review June 23, 2023 14:24
@lorentzenchr
Copy link
Member

@thomasjpfan Could you increase test coverage a little? You can ping me for review when ready.

@thomasjpfan
Copy link
Member Author

@adrinjalali @lorentzenchr This PR is ready to review. With the recent updates, I simplified the PR a bit and reduced the diff.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Great work, as always! Do we have a CI with polars?

Comment on lines 2031 to 2034
if to_dataframe_library == "pandas":
import pandas as pd

return pd.api.interchange.from_dataframe(df_interchange)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just FYI, there were some pretty bad bugs in pandas <2.0.2 for this method, e.g.:

In [1]: df = pl.DataFrame({'a': [1,2,3]})

In [2]: pd.api.interchange.from_dataframe(df[1:].__dataframe__())
Out[2]:
                 a
0                2
1                3
2  125822987010162

plotly have set 2.0.2 as the minimum version to try interchanging to pandas, don't know if you'd want to do the same thing, just bringing this up FYI in case you weren't aware

Copy link
Member Author

@thomasjpfan thomasjpfan Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, this code path is not being used in ColumnTransformer, so I think it is okay.

ColumnTransformer is still special casing pandas dataframes because of how the __dataframe__ interchange protocol can sometimes make a copy.

Copy link
Member Author

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, as always! Do we have a CI with polars?

Yes! Polars is already running on the CI. It was enabled in #26464

Comment on lines 2031 to 2034
if to_dataframe_library == "pandas":
import pandas as pd

return pd.api.interchange.from_dataframe(df_interchange)
Copy link
Member Author

@thomasjpfan thomasjpfan Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, this code path is not being used in ColumnTransformer, so I think it is okay.

ColumnTransformer is still special casing pandas dataframes because of how the __dataframe__ interchange protocol can sometimes make a copy.

Comment on lines 661 to 662
"The output of the '{0}' transformer should be 2D (scipy "
"matrix, array, or pandas DataFrame).".format(name)
"matrix, array, or DataFrames).".format(name)
Copy link
Member Author

@thomasjpfan thomasjpfan Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with changing the order. Looking at it again, it should say something like "array, sparse matrix, or dataframe".

@thomasjpfan thomasjpfan changed the title ENH Adds polars support to ColumnTransformer ENH Adds polars output support to ColumnTransformer Sep 8, 2023
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it's looking good to me.

Comment on lines 108 to 111
- |Feature| Adds `polars <https://www.pola.rs>`__ input support to
:class:`compose.ColumnTransformer` through the `DataFrame Interchange Protocol
<https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html>`__.
The minimum support version for polars is `0.18.2`. :pr:`26683` by `Thomas Fan`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and output?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you state the relation to #27315?

Comment on lines 363 to 365
"'X' should be a 2D NumPy array, 2D sparse matrix or pandas "
"dataframe when indexing the columns (i.e. 'axis=1'). "
"Got {} instead with {} dimension(s).".format(type(X), X.ndim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused, with this PR, we do support axis=1 indexing on polars / dataframes, don't we?

@glemaitre glemaitre requested review from adrinjalali, lorentzenchr and MarcoGorelli and removed request for glemaitre and MarcoGorelli October 31, 2023 14:47
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again. Some small things to address before merging.

Comment on lines 108 to 111
- |Feature| Adds `polars <https://www.pola.rs>`__ input support to
:class:`compose.ColumnTransformer` through the `DataFrame Interchange Protocol
<https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html>`__.
The minimum support version for polars is `0.18.2`. :pr:`26683` by `Thomas Fan`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you state the relation to #27315?

axis == 1
and indices_dtype == "str"
and not (_is_pandas_df(X) or _is_polars_df(X))
):
raise ValueError(
"Specifying the columns using strings is only supported for "
"pandas DataFrames"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"pandas DataFrames"
"pandas and polars DataFrames"

What about dataframe interchange protocol supporting objects?

@@ -1746,6 +1747,34 @@ def test_is_pandas_df_pandas_not_installed(hide_available_pandas):
assert not _is_pandas_df(1)


@pytest.mark.parametrize(
"constructor_name, minversion",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is setting minversion as a global fixure worth a thought?

@glemaitre glemaitre self-requested a review November 24, 2023 10:47
thomasjpfan and others added 2 commits November 26, 2023 15:23
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@glemaitre glemaitre added this to the 1.4 milestone Nov 27, 2023
@glemaitre
Copy link
Member

I put the 1.4 milestone. It seems a reasonable target.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are happy now here. Merging.

@adrinjalali adrinjalali enabled auto-merge (squash) December 4, 2023 12:40
@adrinjalali adrinjalali merged commit d319344 into scikit-learn:main Dec 4, 2023
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised I had started a pending review on this PR but completely forgot about it. Here are my comments just in case in inspires someone for a follow-up. No big problem though.

@@ -1957,6 +1991,14 @@ def test_check_array_multiple_extensions(
assert_array_equal(X_regular_checked, X_extension_checked)


def test_num_samples_dataframe_protocol():
"""Use DataFrame protocol to get n_samples from polars dataframe."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Use DataFrame protocol to get n_samples from polars dataframe."""
"""Use the DataFrame interchange protocol to get n_samples from polars."""

assert _is_polars_df(df)


def test_is_polars_df_pandas_not_installed():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how "pandas_not_installed" is related to the body or the docstring of the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked the code for _is_polars_df and it does not seem to be related to whether pandas is installed or not at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this test was to check that _is_polars_df is False "ducktyped polars dataframe". Specifically, objects that pass the initial check:

if hasattr(X, "columns") and hasattr(X, "schema"):

but not a polars dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants