Skip to content

ENH Array API for check_consistent_length #29519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Dec 5, 2024

Conversation

StefanieSenger
Copy link
Contributor

Reference Issues/PRs

towards #26024

What does this implement/fix? Explain your changes.

This PR adds Array API for check_consistent_length.

Some little doc enhancements/typo corrections along the way.

I didn't run the MPS test.

Copy link

github-actions bot commented Jul 18, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2142635. Link to the linter CI: here


with pytest.raises(TypeError):
check_consistent_length([1, 2], np.array(1))
check_consistent_length(xp.asarray([1, 2], device=device), np.array(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wouldn't raise when xp==np, would it?

Copy link
Contributor Author

@StefanieSenger StefanieSenger Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just tried it. I raises with TypeError: Singleton array array(1) cannot be considered a valid collection. from within _num_samples() in sklearn/utils/validation.py:376.

The last four pytest.raises checks are not really testing check_consistent_length(), but some other error mechanism outside of it. It feels not very clean. What would be a good way to handle this? Separate those tests?

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your questions @adrinjalali! I have answered them and maybe we can refactor the tests to make these things more obvious.


with pytest.raises(TypeError):
check_consistent_length([1, 2], np.array(1))
check_consistent_length(xp.asarray([1, 2], device=device), np.array(1))
Copy link
Contributor Author

@StefanieSenger StefanieSenger Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just tried it. I raises with TypeError: Singleton array array(1) cannot be considered a valid collection. from within _num_samples() in sklearn/utils/validation.py:376.

The last four pytest.raises checks are not really testing check_consistent_length(), but some other error mechanism outside of it. It feels not very clean. What would be a good way to handle this? Separate those tests?

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jul 19, 2024

Also, I don't know why the CI tests "scikit-learn.scikit-learn Expected — Waiting for status to be reported" don't run in my PR?
Edit: now they ran and failed. I will look into it on Monday.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jul 22, 2024

I've fixed the PR, @adrinjalali, though now there is another issue:

FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-numpy-None-None] - TypeError: csr_matrix is not a supported array type
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-numpy-None-None] - TypeError: csr_matrix is not a supported array type

Which comes from within _check_targets that would convert the input arrays into sparse, because they are multilabel. (Also see @EdAbati 's comment in lines 137-140.)

Then, the sparse multilabel y_true and y_pred are input into check_consistent_length within accuracy_score, and the tests fail.

Do I assume right that the PRs that will fix #29452 will also fix this here?

@EdAbati
Copy link
Contributor

EdAbati commented Jul 22, 2024

Hi both :) if you are interested in the discussion around that comment, please see #29336 and #29321 (comment)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

My main suggestion would be to rather keep test_check_consistent_length unchanged to make sure that adding array API support does not introduce any behavioral change: in particular, we still want to test that check_consistent_length can work on Python lists, including nested lists with in-homogeneous lengths such as [[1, 2], [[1, 2]]] for which no equivalent NumPy array exist.

Instead of changing the existing test_check_consistent_length function, I would suggest to add a new test named test_check_consistent_length with the yield_namespace_device_dtype_combinations parametrization to check that calling check_consistent_length on valid array API inputs under the sklearn.config_context(array_api_dispatch=True) context.

Do I assume right that the PRs that will fix #29452 will also fix this here?

Indeed, the TypeError: csr_matrix is not a supported array type failures shall be addressed by #29476 which I am concurrently reviewing.

@StefanieSenger
Copy link
Contributor Author

Thank you, @ogrisel! I have refactored the tests as you had suggested (though ruff didn't allow me to have them both have the same function name). Please let me know if there is anything else I should change.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another pass of feedback:

@StefanieSenger
Copy link
Contributor Author

Thanks for your review and the suggestions, @ogrisel.

I have committed what you had suggested and the tests (including the cuda test on Colab) all pass.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @StefanieSenger! LGTM with small details.

@@ -1184,7 +1184,7 @@ def fowlkes_mallows_score(labels_true, labels_pred, *, sparse=False):

.. versionadded:: 0.18

The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
The Fowlkes-Mallows index (FMI) is defined as the geometric mean of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the change, but this seems unrelated to the scope of the PR. Maybe next time open small quick dedicated PRs for such quick documentation fixes.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@lesteve lesteve merged commit a67e833 into scikit-learn:main Dec 5, 2024
30 checks passed
@lesteve
Copy link
Member

lesteve commented Dec 5, 2024

Merging, thanks!

@StefanieSenger StefanieSenger deleted the array_api_one branch December 5, 2024 09:30
virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants