Skip to content

FIX: accuracy and zero_loss support for multilabel with Array API #29336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 2, 2024

Conversation

EdAbati
Copy link
Contributor

@EdAbati EdAbati commented Jun 22, 2024

Reference Issues/PRs

Related to #29269
Previosly implemented in #29321, but moved to a separate PR

What does this implement/fix? Explain your changes.

Currently the below Array API tests fail in main. This fixes the support for multilabel in accuracy and zero_loss.

FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-cupy-None-None] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-cupy.array_api-None-None] - TypeError: bool is only allowed on arrays with 0 dimensions
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-torch-cuda-float64] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-torch-cuda-float32] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-cupy-None-None] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-cupy.array_api-None-None] - TypeError: bool is only allowed on arrays with 0 dimensions
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-torch-cuda-float64] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-torch-cuda-float32] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-torch-mps-float32] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-torch-mps-float32] - ValueError: unrecognized csr_matrix constructor usage

Any other comments?

cc @Tialo @ogrisel

Comment on lines +135 to +139
if _is_numpy_namespace(xp):
# XXX: do we really want to sparse-encode multilabel indicators when
# they are passed as a dense arrays? This is not possible for array
# API inputs in general hence we only do it for NumPy inputs. But even
# for NumPy the usefulness is questionable.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original comment: #29321 (comment)

Copy link

github-actions bot commented Jun 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: eabdf62. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@EdAbati EdAbati changed the title ENH: Add multilabel support toaccuracy and zero_loss with Array API FIX: multilabel support toaccuracy and zero_loss with Array API Jun 22, 2024
@EdAbati EdAbati changed the title FIX: multilabel support toaccuracy and zero_loss with Array API FIX: accuracy and zero_loss support for multilabel with Array API Jun 22, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

differing_labels = count_nonzero(y_true - y_pred, axis=1)
else:
differing_labels = xp.sum(
xp.astype(xp.astype(y_true - y_pred, xp.bool), xp.int8),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be casted to xp.int8 directly?

Suggested change
xp.astype(xp.astype(y_true - y_pred, xp.bool), xp.int8),
xp.astype(y_true - y_pred, xp.int8),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that too, but then it turns into an actual sum of everything that is not 0 (after the subtraction).
I initially just used bool actually, but array_api_scrict was not happy.

I guess an alternative could be xp.where, but I feel if could me even more verbose. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe xp.sum(xp.abs(xp.astype(y_true - y_pred)) would be more explicit instead of relying on casting semantics to turn negative differences into positive counts?

Also I am not sure about the use of xp.int8. If you have more than 127 classes in your target and y_pred never makes a good label prediction for a give data point, then the xp.sum would overflow, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might require sample_weight in the count_nonzero method as well. In the PR I created for f1_score we ignored the multilabel case for the array api integration. f1 score array api PR
If we are now including this I think it might make sense to extract _count_nonzero into a utility function.

Copy link
Contributor Author

@EdAbati EdAbati Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decide to make a _count_zero function : 6123e64

It doesn't have sample_weight yet (it is not required here, but I can add it)

What do you think about this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Thanks for the addition.

@EdAbati
Copy link
Contributor Author

EdAbati commented Jun 26, 2024

mps and CUDA tests are green on my side

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind integrating sample_weight support in this PR? I tried it out on my local system and it seems to work. I can provide the changes I did as suggestions in this PR so you can also test them out and improve them?

@EdAbati
Copy link
Contributor Author

EdAbati commented Jun 27, 2024

@OmarManzoor sounds good! I have some time later today to update this. Feel free to comment with the suggestion if you have something already :)

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdAbati I have added the suggestions.

EdAbati and others added 9 commits June 27, 2024 18:37
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now. I checked that the cuda tests pass.

@OmarManzoor
Copy link
Contributor

@ogrisel Could you kindly have another look at this PR with the latest changes?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits, but otherwise, LGTM.

I launched the CUDA tests here:

Comment on lines 974 to 977
if axis == -1:
axis = 1
elif axis == -2:
axis = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think xp.sum should always support negative axis values, no? If not, then we can simplify / generalize this logic to nd inputs as:

Suggested change
if axis == -1:
axis = 1
elif axis == -2:
axis = 0
if axis < 0:
axis += X.ndim

Also: if _count_nonzero is only valid on 2d inputs, I think it should be made explicit in the docstring and maybe add an assertions such as:

    assert X.ndim == 2

at the beginning to avoid accidental mis-use in the code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the negative logic was taken directly from the original method that has been defined for sparse arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the original function seems to be defined for 2d inputs,

def count_nonzero(X, axis=None, sample_weight=None):
"""A variant of X.getnnz() with extension to weighting on axis 0.
Useful in efficiently calculating multilabel metrics.
Parameters
----------
X : sparse matrix of shape (n_samples, n_labels)
Input data. It should be of CSR format.
axis : {0, 1}, default=None
The axis on which the data is aggregated.
sample_weight : array-like of shape (n_samples,), default=None
Weight for each row of X.
Returns
-------
nnz : int, float, ndarray of shape (n_samples,) or ndarray of shape (n_features,)
Number of non-zero values in the array along a given axis. Otherwise,
the total number of non-zero values in the array is returned.
"""
if axis == -1:
axis = 1
elif axis == -2:
axis = 0
elif X.format != "csr":
raise TypeError("Expected CSR sparse format, got {0}".format(X.format))

how about updating the docstring and adding the assertion as you suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I just took this bit from the count_nonzero implementation . I updated now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ogrisel
Copy link
Member

ogrisel commented Jun 28, 2024

I checked that the cuda tests pass.

Oops, I had not seen that before retriggering the CUDA CI...

@OmarManzoor
Copy link
Contributor

@ogrisel Does this look good to merge?

@betatim
Copy link
Member

betatim commented Jul 2, 2024

Tests pass in https://github.com/scikit-learn/scikit-learn/actions/runs/9756974834/job/26928380940. Merging

@betatim betatim merged commit 53ed13c into scikit-learn:main Jul 2, 2024
30 checks passed
@EdAbati
Copy link
Contributor Author

EdAbati commented Jul 2, 2024

Thank you everyone for the review (as always).

@EdAbati EdAbati deleted the fix-multilabel-accuracy-zero-loss branch July 2, 2024 11:01
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
…scikit-learn#29336)

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
jeremiedbb pushed a commit that referenced this pull request Jul 2, 2024
…#29336)

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024
…scikit-learn#29336)

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants