-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX _safe_indexing for pyarrow #31040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for supporting arrow, and I fine with doing it this way as a pragmatic and incremental improvement pending a potential refactoring to make it more general in the future (to be discussed in a follow-up issue or PR).
However I am worried that this code is untested on our CI. Could you add PyArrow as a soft dependencies on one of the worker CIs (like we do for pandas
/ polars
/ pytorch
/ array-api-strict
...)?
I don't follow. |
Should the changelog be in |
Actually, everything is fine: the coverage data is collected on the CI config that runs with Some lines such as the The others could maybe benefit from expanding the existing tests a bit but not big deal wither. Still +1 for merge on my side. |
@ogrisel Do we need a 2nd reviewer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, given that there seems to be a consensus to use narwhals for our dataframe support tools, do you think that this PR is still relevant ?
We can easily fit it in 1.7 while it's unlikely for narwhals. But do we want to introduce this is we're going to rewrite it soon (ref #31049) ?
Yes, it’s a simple fix for a bug. If we later decide to go with narwhals, there is no cost of having this PR. |
@ogrisel @jeremiedbb I still would like to have this PR in 1.7. But I changed my mind about the implementation. I replaced the "dataframe interchange protocol" indexing by proper pyarrow indexing. This is much cleaner. I'm now convinced that the dataframe interchange protocol is not usable for anything else but interchange which we don't really do or need, we need an API for indexing and assigning. Therefore, I defer this topic to #31049. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Reference Issues/PRs
Partially addresses #25896 (comment).
What does this implement/fix? Explain your changes.
_safe_indexing(.., axis=1)
is used in theColumnTransformer
and raises an error if apyarrow.Table
is passed even though it implements the dataframe interchange protocol:results in
which shows that the wrong branch (
elif hasattr(X, "shape"):
) is taken.Any other comments?
There is no general solution with
__dataframe__
because of data-apis/dataframe-api#85.Therefore a dirtier solution is taken.Therefore, just pyarrow indexing is implemented.
narwhals would be much cleaner, but needs its own dedicated issue for discussion. This PR just fixes a bug with pyarrow.Table passed around.