Skip to content

FIX use pyarrow types in pyarrow.filter() for older pyarrow versions #31605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

StefanieSenger
Copy link
Contributor

Reference Issues/PRs

closes #31604

What does this implement/fix? Explain your changes.

This fixes a bug that appears for older pyarrow versions, when the filter method is used with other indexes than pyarrow types.

See the issue description for more details.

Copy link

github-actions bot commented Jun 20, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6202316. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title MNT fix bug with pyarrow filter function FIX convert input to pyarrow filter() to pyaor older pyarrow versionsrrow type f Jun 20, 2025
@StefanieSenger StefanieSenger changed the title FIX convert input to pyarrow filter() to pyaor older pyarrow versionsrrow type f FIX convert indices passed into to pyarrow filter() to pyarrow type for older pyarrow versions Jun 20, 2025
@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jun 22, 2025

I added pyarrow to pymin_conda_forge_openblas_min_dependencies, which I had hoped would fix the codecov errors (it didn't), but it is the right thing to do anyways.

The Windows pymin_conda_forge_mkl timeout error comes seems unrelated and also appeared in other PRs.

@StefanieSenger StefanieSenger changed the title FIX convert indices passed into to pyarrow filter() to pyarrow type for older pyarrow versions FIX use pyarrow types in pyarrow.filter() for older pyarrow versions Jun 22, 2025
@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jun 22, 2025

So it seems that not all the CIs get into the coverage report for codecov. I have found in the azure-pipelines.yml that there can be a Coverage: 'false' setting, that some CIs have, but ours, the pymin_conda_forge_openblas_min_dependencies doesn't seem to possess this setting. I would think that means that it is not reporting coverage? Do we want this to be like this?

Edit: I found the coverage is set to true in posix.yml, which applies to our job as well. So that means that it should make a coverage report. That is strange, because locally, with pyarrow 14.0.0, 15.0.0 and 16.0.0 I do reach into the new branches. I didn't however manage to install pyarrow==12.0.0 or pyarrow==13.0.0 into the same environment as pymin_conda_forge_openblas_min_dependencies because there were conflicting version definitions between python_abi and tornado. I now think that might happen to the CI invironment as well (but then I wonder why it keeps running ...)

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jun 22, 2025

To sum up: it seems that the codecov errors come from pymin_conda_forge_openblas_min_dependencies not installing pyarrow==12.0.0, because of a version mismatch.

Proposing solution: change min version of pyarrow to 14.0.0.

Edit: I tried that (not pushed) but I am not sure anymore, since the changes I did in pyproject.toml, sklearn/min_dependencies.py and sklearn/tests/test_base.py don't affect the build at all. There is no change in the newly generated lock file and now I see that pyarrow is not mentioned in the current one. That means that adding "pyarrow": "min" did not have the effect of adding pyarrow's min version to the job?

Edit: 🤦 Now I got it! I should have added pyarrow to the conda_dependencies part too, not just to the version restrictions.

@StefanieSenger
Copy link
Contributor Author

All passes except for the Windows CI failing which this is unrelated and coming from main.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

import pyarrow

pyarrow_version = parse_version(pyarrow.__version__)
if pyarrow_version < parse_version("17.0.0"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what happens if pyarrow < 12 is installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried this.

Comment on lines +140 to +143
import pyarrow

if not isinstance(key, pyarrow.BooleanArray):
key = pyarrow.array(key, type=pyarrow.bool_())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import pyarrow
if not isinstance(key, pyarrow.BooleanArray):
key = pyarrow.array(key, type=pyarrow.bool_())
try:
pa = sys.modules["pyarrow"]
except KeyError: # pragma: no cover
raise ImportError("pyarrow must be installed to use pyarrow data objects")
if not isinstance(key, pa.BooleanArray):
key = pa.array(key, type=pyarrow.bool_())

Question for the 2. reviewer: Does it actually make a big enough difference (compared to import pyarrow)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a second reviewer, but I can reason you through what I did and why:

When we are in line import pyarrow (currently 140) we already know that the user has pyarrow installed.

This is because this line can only be executed conditioned on that PYARROW_VERSION_BELOW_17 returns True. The condition check one line above makes fixes.py to be run and in it, we already have a try-except block evaluating import pyarrow (and the version, which doesn't matter here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case you can just do pa = sys.modules["pyarrow"] without the try except.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing, @lorentzenchr!

I've commited your changes with the clearer TODO commands.

import pyarrow

pyarrow_version = parse_version(pyarrow.__version__)
if pyarrow_version < parse_version("17.0.0"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried this.

Comment on lines +140 to +143
import pyarrow

if not isinstance(key, pyarrow.BooleanArray):
key = pyarrow.array(key, type=pyarrow.bool_())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a second reviewer, but I can reason you through what I did and why:

When we are in line import pyarrow (currently 140) we already know that the user has pyarrow installed.

This is because this line can only be executed conditioned on that PYARROW_VERSION_BELOW_17 returns True. The condition check one line above makes fixes.py to be run and in it, we already have a try-except block evaluating import pyarrow (and the version, which doesn't matter here).

@StefanieSenger StefanieSenger added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 24, 2025
@adrinjalali adrinjalali enabled auto-merge (squash) June 24, 2025 11:19
@adrinjalali adrinjalali merged commit b51965a into scikit-learn:main Jun 26, 2025
36 checks passed
@StefanieSenger StefanieSenger deleted the fix_safe_indexing_pyarrow branch June 26, 2025 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:utils Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_safe_indexing fails with pyarrow==16.0.0
3 participants