ENH Adds polars output support to ColumnTransformer #26683

thomasjpfan · 2023-06-23T13:03:00Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR uses the DataFrame interchange protocol in ColumnTransformer to parse DataFrames and slices them for the inner transformers to consume.

Any other comments?

The pandas code path is kept because __dataframe__ can sometimes make a copy with the default allow_copy=True.

github-actions · 2023-06-23T13:04:54Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bc7628b. Link to the linter CI: here}

lorentzenchr · 2023-06-23T18:35:11Z

@thomasjpfan Could you increase test coverage a little? You can ping me for review when ready.

…column_transformer

thomasjpfan · 2023-07-21T20:12:50Z

@adrinjalali @lorentzenchr This PR is ready to review. With the recent updates, I simplified the PR a bit and reduced the diff.

lorentzenchr

@thomasjpfan Great work, as always! Do we have a CI with polars?

doc/whats_new/v1.4.rst

sklearn/utils/__init__.py

sklearn/utils/tests/test_utils.py

sklearn/utils/validation.py

sklearn/compose/_column_transformer.py

sklearn/compose/tests/test_column_transformer.py

MarcoGorelli · 2023-07-26T13:06:45Z

sklearn/utils/validation.py

+    if to_dataframe_library == "pandas":
+        import pandas as pd
+
+        return pd.api.interchange.from_dataframe(df_interchange)


just FYI, there were some pretty bad bugs in pandas <2.0.2 for this method, e.g.:

In [1]: df = pl.DataFrame({'a': [1,2,3]}) In [2]: pd.api.interchange.from_dataframe(df[1:].__dataframe__()) Out[2]: a 0 2 1 3 2 125822987010162

plotly have set 2.0.2 as the minimum version to try interchanging to pandas, don't know if you'd want to do the same thing, just bringing this up FYI in case you weren't aware

Functionally, this code path is not being used in ColumnTransformer, so I think it is okay.

ColumnTransformer is still special casing pandas dataframes because of how the __dataframe__ interchange protocol can sometimes make a copy.

…s objects

thomasjpfan

Great work, as always! Do we have a CI with polars?

Yes! Polars is already running on the CI. It was enabled in #26464

sklearn/compose/tests/test_column_transformer.py

doc/whats_new/v1.4.rst

sklearn/compose/tests/test_column_transformer.py

thomasjpfan · 2023-07-30T01:05:28Z

sklearn/utils/validation.py

+    if to_dataframe_library == "pandas":
+        import pandas as pd
+
+        return pd.api.interchange.from_dataframe(df_interchange)


Functionally, this code path is not being used in ColumnTransformer, so I think it is okay.

ColumnTransformer is still special casing pandas dataframes because of how the __dataframe__ interchange protocol can sometimes make a copy.

thomasjpfan · 2023-07-30T01:06:29Z

sklearn/compose/_column_transformer.py

                    "The output of the '{0}' transformer should be 2D (scipy "
-                    "matrix, array, or pandas DataFrame).".format(name)
+                    "matrix, array, or DataFrames).".format(name)


I am okay with changing the order. Looking at it again, it should say something like "array, sparse matrix, or dataframe".

adrinjalali

Otherwise it's looking good to me.

adrinjalali · 2023-10-19T10:39:16Z

doc/whats_new/v1.4.rst

+- |Feature| Adds `polars <https://www.pola.rs>`__ input support to
+  :class:`compose.ColumnTransformer` through the `DataFrame Interchange Protocol
+  <https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html>`__.
+  The minimum support version for polars is `0.18.2`. :pr:`26683` by `Thomas Fan`_.


and output?

Could you state the relation to #27315?

sklearn/compose/tests/test_column_transformer.py

adrinjalali · 2023-10-19T10:53:16Z

sklearn/utils/__init__.py

            "'X' should be a 2D NumPy array, 2D sparse matrix or pandas "
            "dataframe when indexing the columns (i.e. 'axis=1'). "
            "Got {} instead with {} dimension(s).".format(type(X), X.ndim)


a bit confused, with this PR, we do support axis=1 indexing on polars / dataframes, don't we?

lorentzenchr

LGTM again. Some small things to address before merging.

lorentzenchr · 2023-11-23T17:46:03Z

doc/whats_new/v1.4.rst

+- |Feature| Adds `polars <https://www.pola.rs>`__ input support to
+  :class:`compose.ColumnTransformer` through the `DataFrame Interchange Protocol
+  <https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html>`__.
+  The minimum support version for polars is `0.18.2`. :pr:`26683` by `Thomas Fan`_.


Could you state the relation to #27315?

lorentzenchr · 2023-11-23T17:53:46Z

sklearn/utils/__init__.py

+        axis == 1
+        and indices_dtype == "str"
+        and not (_is_pandas_df(X) or _is_polars_df(X))
+    ):
        raise ValueError(
            "Specifying the columns using strings is only supported for "
            "pandas DataFrames"


Suggested change

"pandas DataFrames"

"pandas and polars DataFrames"

What about dataframe interchange protocol supporting objects?

lorentzenchr · 2023-11-23T17:56:13Z

sklearn/utils/tests/test_validation.py

@@ -1746,6 +1747,34 @@ def test_is_pandas_df_pandas_not_installed(hide_available_pandas):
    assert not _is_pandas_df(1)


+@pytest.mark.parametrize(
+    "constructor_name, minversion",


Is setting minversion as a global fixure worth a thought?

…column_transformer

doc/whats_new/v1.4.rst

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

glemaitre · 2023-11-27T16:01:54Z

I put the 1.4 milestone. It seems a reasonable target.

adrinjalali

Seems like we are happy now here. Merging.

ogrisel

I realised I had started a pending review on this PR but completely forgot about it. Here are my comments just in case in inspires someone for a follow-up. No big problem though.

ogrisel · 2023-11-27T15:35:17Z

sklearn/utils/tests/test_validation.py

@@ -1957,6 +1991,14 @@ def test_check_array_multiple_extensions(
    assert_array_equal(X_regular_checked, X_extension_checked)


+def test_num_samples_dataframe_protocol():
+    """Use DataFrame protocol to get n_samples from polars dataframe."""


Suggested change

"""Use DataFrame protocol to get n_samples from polars dataframe."""

"""Use the DataFrame interchange protocol to get n_samples from polars."""

ogrisel · 2023-11-27T15:37:47Z

sklearn/utils/tests/test_validation.py

+        assert _is_polars_df(df)
+
+
+def test_is_polars_df_pandas_not_installed():


I don't understand how "pandas_not_installed" is related to the body or the docstring of the test.

I also checked the code for _is_polars_df and it does not seem to be related to whether pandas is installed or not at all.

Yea, this test was to check that _is_polars_df is False "ducktyped polars dataframe". Specifically, objects that pass the initial check:

scikit-learn/sklearn/utils/validation.py

Line 2031 in 87e6908

if hasattr(X, "columns") and hasattr(X, "schema"):

but not a polars dataframe.

ENH Adds polars support to ColumnTransformer

2bee590

github-actions bot added module:compose module:utils labels Jun 23, 2023

thomasjpfan added 2 commits June 23, 2023 15:09

DOC Update PR number

2f3146b

DOC Fixes indent

1115d2f

glemaitre self-requested a review June 23, 2023 14:24

thomasjpfan added 2 commits June 23, 2023 17:10

CLN Use protocol

d606f15

ENH Improves error message

9d73e39

thomasjpfan added 14 commits July 19, 2023 15:36

Merge remote-tracking branch 'upstream/main' into dataframe_protocol_…

485ada1

…column_transformer

DOC Adds docstrings for protocol

76a02c2

TST Fixes coverage

c3259c3

TST Increase coverage

db65930

DOC Update whats new number

d494a20

TST Increase coverage

96e3d92

CLN Revert commenting

02df48f

Merge remote-tracking branch 'upstream/main' into dataframe_protocol_…

0e334a9

…column_transformer

CLN Simplify logic

314f411

CLN Simplify logic more about indexing

2b75d0f

CLN Remove need for DataFrame Interchange protocol

623a73e

CLN Less code again

fd2976e

FIX Remove protocol tests

e6844ac

Merge remote-tracking branch 'upstream/main' into dataframe_protocol_…

560225f

…column_transformer

lorentzenchr approved these changes Jul 25, 2023

View reviewed changes

MarcoGorelli reviewed Jul 26, 2023

View reviewed changes

thomasjpfan added 2 commits July 29, 2023 20:54

CLN Address comments

9e4c02b

TST Adds test about fitting and transforming with different dataframe…

e4e8249

…s objects

thomasjpfan commented Jul 30, 2023

View reviewed changes

thomasjpfan added 3 commits September 6, 2023 23:23

CLN Simplify indexing logic

12c5d66

CLN Simplify to polars indexing

a3e2efa

FIX Fixes polars indexing

7b10847

thomasjpfan changed the title ~~ENH Adds polars support to ColumnTransformer~~ ENH Adds polars output support to ColumnTransformer Sep 8, 2023

TST Fixes codecov

693129e

thomasjpfan mentioned this pull request Sep 8, 2023

ENH Adds polars output support to set_output API #27315

Merged

adrinjalali reviewed Oct 19, 2023

View reviewed changes

glemaitre requested review from adrinjalali, lorentzenchr and MarcoGorelli and removed request for glemaitre and MarcoGorelli October 31, 2023 14:47

lorentzenchr approved these changes Nov 23, 2023

View reviewed changes

glemaitre self-requested a review November 24, 2023 10:47

thomasjpfan added 4 commits November 25, 2023 15:03

Merge remote-tracking branch 'upstream/main' into dataframe_protocol_…

12a06c7

…column_transformer

DOC Update min version

1387223

CLN Address comments

9b263ef

STY Ruff linting

f18f374

lorentzenchr reviewed Nov 26, 2023

View reviewed changes

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved

thomasjpfan and others added 2 commits November 26, 2023 15:23

Update doc/whats_new/v1.4.rst

3fa90bb

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

STY mypy linting

3671d21

glemaitre added this to the 1.4 milestone Nov 27, 2023

Adjust error wording

f848776

adrinjalali approved these changes Dec 4, 2023

View reviewed changes

Merge branch 'main' into dataframe_protocol_column_transformer

bc7628b

adrinjalali enabled auto-merge (squash) December 4, 2023 12:40

adrinjalali merged commit d319344 into scikit-learn:main Dec 4, 2023

ogrisel reviewed Dec 4, 2023

View reviewed changes

thomasjpfan mentioned this pull request Dec 4, 2023

CLN Update docs and test name for polars output in ColumnTransformer #27902

Merged

	"""Use DataFrame protocol to get n_samples from polars dataframe."""
	"""Use the DataFrame interchange protocol to get n_samples from polars."""

		assert _is_polars_df(df)


		def test_is_polars_df_pandas_not_installed():

Uh oh!

ENH Adds polars output support to ColumnTransformer #26683

ENH Adds polars output support to ColumnTransformer #26683

Uh oh!

Conversation

thomasjpfan commented Jun 23, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lorentzenchr commented Jun 23, 2023

Uh oh!

thomasjpfan commented Jul 21, 2023

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 30, 2023 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan Jul 30, 2023 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Nov 27, 2023

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Jun 23, 2023 •

edited

Loading

thomasjpfan Jul 30, 2023 •

edited by ogrisel

Loading

thomasjpfan Jul 30, 2023 •

edited by ogrisel

Loading

thomasjpfan Jul 30, 2023 •

edited

Loading