-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds polars output support to ColumnTransformer #26683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds polars output support to ColumnTransformer #26683
Conversation
@thomasjpfan Could you increase test coverage a little? You can ping me for review when ready. |
…column_transformer
…column_transformer
…column_transformer
@adrinjalali @lorentzenchr This PR is ready to review. With the recent updates, I simplified the PR a bit and reduced the diff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Great work, as always! Do we have a CI with polars?
sklearn/utils/validation.py
Outdated
if to_dataframe_library == "pandas": | ||
import pandas as pd | ||
|
||
return pd.api.interchange.from_dataframe(df_interchange) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI, there were some pretty bad bugs in pandas <2.0.2 for this method, e.g.:
In [1]: df = pl.DataFrame({'a': [1,2,3]})
In [2]: pd.api.interchange.from_dataframe(df[1:].__dataframe__())
Out[2]:
a
0 2
1 3
2 125822987010162
plotly have set 2.0.2
as the minimum version to try interchanging to pandas, don't know if you'd want to do the same thing, just bringing this up FYI in case you weren't aware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, this code path is not being used in ColumnTransformer
, so I think it is okay.
ColumnTransformer
is still special casing pandas dataframes because of how the __dataframe__
interchange protocol can sometimes make a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, as always! Do we have a CI with polars?
Yes! Polars is already running on the CI. It was enabled in #26464
sklearn/utils/validation.py
Outdated
if to_dataframe_library == "pandas": | ||
import pandas as pd | ||
|
||
return pd.api.interchange.from_dataframe(df_interchange) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, this code path is not being used in ColumnTransformer
, so I think it is okay.
ColumnTransformer
is still special casing pandas dataframes because of how the __dataframe__
interchange protocol can sometimes make a copy.
"The output of the '{0}' transformer should be 2D (scipy " | ||
"matrix, array, or pandas DataFrame).".format(name) | ||
"matrix, array, or DataFrames).".format(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with changing the order. Looking at it again, it should say something like "array, sparse matrix, or dataframe".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it's looking good to me.
doc/whats_new/v1.4.rst
Outdated
- |Feature| Adds `polars <https://www.pola.rs>`__ input support to | ||
:class:`compose.ColumnTransformer` through the `DataFrame Interchange Protocol | ||
<https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html>`__. | ||
The minimum support version for polars is `0.18.2`. :pr:`26683` by `Thomas Fan`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you state the relation to #27315?
sklearn/utils/__init__.py
Outdated
"'X' should be a 2D NumPy array, 2D sparse matrix or pandas " | ||
"dataframe when indexing the columns (i.e. 'axis=1'). " | ||
"Got {} instead with {} dimension(s).".format(type(X), X.ndim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confused, with this PR, we do support axis=1 indexing on polars / dataframes, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again. Some small things to address before merging.
doc/whats_new/v1.4.rst
Outdated
- |Feature| Adds `polars <https://www.pola.rs>`__ input support to | ||
:class:`compose.ColumnTransformer` through the `DataFrame Interchange Protocol | ||
<https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html>`__. | ||
The minimum support version for polars is `0.18.2`. :pr:`26683` by `Thomas Fan`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you state the relation to #27315?
sklearn/utils/__init__.py
Outdated
axis == 1 | ||
and indices_dtype == "str" | ||
and not (_is_pandas_df(X) or _is_polars_df(X)) | ||
): | ||
raise ValueError( | ||
"Specifying the columns using strings is only supported for " | ||
"pandas DataFrames" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pandas DataFrames" | |
"pandas and polars DataFrames" |
What about dataframe interchange protocol supporting objects?
@@ -1746,6 +1747,34 @@ def test_is_pandas_df_pandas_not_installed(hide_available_pandas): | |||
assert not _is_pandas_df(1) | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"constructor_name, minversion", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is setting minversion
as a global fixure worth a thought?
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
I put the 1.4 milestone. It seems a reasonable target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are happy now here. Merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised I had started a pending review on this PR but completely forgot about it. Here are my comments just in case in inspires someone for a follow-up. No big problem though.
@@ -1957,6 +1991,14 @@ def test_check_array_multiple_extensions( | |||
assert_array_equal(X_regular_checked, X_extension_checked) | |||
|
|||
|
|||
def test_num_samples_dataframe_protocol(): | |||
"""Use DataFrame protocol to get n_samples from polars dataframe.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Use DataFrame protocol to get n_samples from polars dataframe.""" | |
"""Use the DataFrame interchange protocol to get n_samples from polars.""" |
assert _is_polars_df(df) | ||
|
||
|
||
def test_is_polars_df_pandas_not_installed(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how "pandas_not_installed"
is related to the body or the docstring of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked the code for _is_polars_df
and it does not seem to be related to whether pandas is installed or not at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this test was to check that _is_polars_df
is False "ducktyped polars dataframe". Specifically, objects that pass the initial check:
scikit-learn/sklearn/utils/validation.py
Line 2031 in 87e6908
if hasattr(X, "columns") and hasattr(X, "schema"): |
but not a polars dataframe.
Reference Issues/PRs
Towards #25896
What does this implement/fix? Explain your changes.
This PR uses the DataFrame interchange protocol in ColumnTransformer to parse DataFrames and slices them for the inner transformers to consume.
Any other comments?
The pandas code path is kept because
__dataframe__
can sometimes make a copy with the defaultallow_copy=True
.