FIX: `accuracy` and `zero_loss` support for multilabel with Array API #29336

EdAbati · 2024-06-22T12:57:39Z

Reference Issues/PRs

Related to #29269
Previosly implemented in #29321, but moved to a separate PR

What does this implement/fix? Explain your changes.

Currently the below Array API tests fail in main. This fixes the support for multilabel in accuracy and zero_loss.

FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-cupy-None-None] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-cupy.array_api-None-None] - TypeError: bool is only allowed on arrays with 0 dimensions
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-torch-cuda-float64] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-torch-cuda-float32] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-cupy-None-None] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-cupy.array_api-None-None] - TypeError: bool is only allowed on arrays with 0 dimensions
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-torch-cuda-float64] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-torch-cuda-float32] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[accuracy_score-check_array_api_multilabel_classification_metric-torch-mps-float32] - ValueError: unrecognized csr_matrix constructor usage
FAILED sklearn/metrics/tests/test_common.py::test_array_api_compliance[zero_one_loss-check_array_api_multilabel_classification_metric-torch-mps-float32] - ValueError: unrecognized csr_matrix constructor usage

Any other comments?

cc @Tialo @ogrisel

doc/whats_new/v1.5.rst

EdAbati · 2024-06-22T12:59:15Z

sklearn/metrics/_classification.py

+        if _is_numpy_namespace(xp):
+            # XXX: do we really want to sparse-encode multilabel indicators when
+            # they are passed as a dense arrays? This is not possible for array
+            # API inputs in general hence we only do it for NumPy inputs. But even
+            # for NumPy the usefulness is questionable.


original comment: #29321 (comment)

github-actions · 2024-06-22T12:59:29Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: eabdf62. Link to the linter CI: here}

ogrisel

Lgtm!

thomasjpfan

Thank you for the PR!

thomasjpfan · 2024-06-22T16:43:39Z

sklearn/metrics/_classification.py

+            differing_labels = count_nonzero(y_true - y_pred, axis=1)
+        else:
+            differing_labels = xp.sum(
+                xp.astype(xp.astype(y_true - y_pred, xp.bool), xp.int8),


Can this be casted to xp.int8 directly?

Suggested change

xp.astype(xp.astype(y_true - y_pred, xp.bool), xp.int8),

xp.astype(y_true - y_pred, xp.int8),

I tried that too, but then it turns into an actual sum of everything that is not 0 (after the subtraction).
I initially just used bool actually, but array_api_scrict was not happy.

I guess an alternative could be xp.where, but I feel if could me even more verbose. What do you think?

Maybe xp.sum(xp.abs(xp.astype(y_true - y_pred)) would be more explicit instead of relying on casting semantics to turn negative differences into positive counts?

Also I am not sure about the use of xp.int8. If you have more than 127 classes in your target and y_pred never makes a good label prediction for a give data point, then the xp.sum would overflow, no?

I think we might require sample_weight in the count_nonzero method as well. In the PR I created for f1_score we ignored the multilabel case for the array api integration. f1 score array api PR
If we are now including this I think it might make sense to extract _count_nonzero into a utility function.

I decide to make a _count_zero function : 6123e64

It doesn't have sample_weight yet (it is not required here, but I can add it)

What do you think about this approach?

I like it. Thanks for the addition.

EdAbati · 2024-06-26T11:04:37Z

mps and CUDA tests are green on my side

OmarManzoor

Would you mind integrating sample_weight support in this PR? I tried it out on my local system and it seems to work. I can provide the changes I did as suggestions in this PR so you can also test them out and improve them?

EdAbati · 2024-06-27T10:02:14Z

@OmarManzoor sounds good! I have some time later today to update this. Feel free to comment with the suggestion if you have something already :)

OmarManzoor

@EdAbati I have added the suggestions.

sklearn/utils/_array_api.py

sklearn/utils/tests/test_array_api.py

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

OmarManzoor

I think this looks good now. I checked that the cuda tests pass.

OmarManzoor · 2024-06-28T13:05:35Z

@ogrisel Could you kindly have another look at this PR with the latest changes?

ogrisel

Small nits, but otherwise, LGTM.

I launched the CUDA tests here:

https://github.com/scikit-learn/scikit-learn/actions/runs/9713522506

ogrisel · 2024-06-28T13:16:23Z

sklearn/utils/_array_api.py

+    if axis == -1:
+        axis = 1
+    elif axis == -2:
+        axis = 0


I think xp.sum should always support negative axis values, no? If not, then we can simplify / generalize this logic to nd inputs as:

Suggested change

if axis == -1:

axis = 1

elif axis == -2:

axis = 0

if axis < 0:

axis += X.ndim

Also: if _count_nonzero is only valid on 2d inputs, I think it should be made explicit in the docstring and maybe add an assertions such as:

assert X.ndim == 2

at the beginning to avoid accidental mis-use in the code base.

I think the negative logic was taken directly from the original method that has been defined for sparse arrays.

Since the original function seems to be defined for 2d inputs,

scikit-learn/sklearn/utils/sparsefuncs.py

Lines 599 to 626 in 2107404

def count_nonzero(X, axis=None, sample_weight=None):

"""A variant of X.getnnz() with extension to weighting on axis 0.

Useful in efficiently calculating multilabel metrics.

Parameters

----------

X : sparse matrix of shape (n_samples, n_labels)

Input data. It should be of CSR format.

axis : {0, 1}, default=None

The axis on which the data is aggregated.

sample_weight : array-like of shape (n_samples,), default=None

Weight for each row of X.

Returns

-------

nnz : int, float, ndarray of shape (n_samples,) or ndarray of shape (n_features,)

Number of non-zero values in the array along a given axis. Otherwise,

the total number of non-zero values in the array is returned.

"""

if axis == -1:

axis = 1

elif axis == -2:

axis = 0

elif X.format != "csr":

raise TypeError("Expected CSR sparse format, got {0}".format(X.format))

how about updating the docstring and adding the assertion as you suggested?

yes, I just took this bit from the count_nonzero implementation . I updated now

ogrisel · 2024-06-28T13:23:33Z

I checked that the cuda tests pass.

Oops, I had not seen that before retriggering the CUDA CI...

OmarManzoor · 2024-07-01T09:48:03Z

@ogrisel Does this look good to merge?

betatim · 2024-07-02T07:53:59Z

Tests pass in https://github.com/scikit-learn/scikit-learn/actions/runs/9756974834/job/26928380940. Merging

EdAbati · 2024-07-02T11:01:31Z

Thank you everyone for the review (as always).

…scikit-learn#29336) Co-authored-by: Omar Salman <omar.salman2007@gmail.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

…#29336) Co-authored-by: Omar Salman <omar.salman2007@gmail.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

…scikit-learn#29336) Co-authored-by: Omar Salman <omar.salman2007@gmail.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

add support for multilabel for accuracy

2766510

github-actions bot added the module:metrics label Jun 22, 2024

add pr number

832428c

EdAbati commented Jun 22, 2024

View reviewed changes

doc/whats_new/v1.5.rst Show resolved Hide resolved

EdAbati commented Jun 22, 2024

View reviewed changes

EdAbati mentioned this pull request Jun 22, 2024

fix: mps device support in entropy #29321

Merged

cast bool to int for array_api_strict

c596fc0

ogrisel approved these changes Jun 22, 2024

View reviewed changes

EdAbati changed the title ~~ENH: Add multilabel support toaccuracy and zero_loss with Array API~~ FIX: multilabel support toaccuracy and zero_loss with Array API Jun 22, 2024

EdAbati changed the title ~~FIX: multilabel support toaccuracy and zero_loss with Array API~~ FIX: accuracy and zero_loss support for multilabel with Array API Jun 22, 2024

thomasjpfan reviewed Jun 22, 2024

View reviewed changes

EdAbati added 2 commits June 25, 2024 23:17

Merge branch 'main' into fix-multilabel-accuracy-zero-loss

93c1b49

added _count_zero array api compatible

6123e64

OmarManzoor reviewed Jun 26, 2024

View reviewed changes

OmarManzoor reviewed Jun 27, 2024

View reviewed changes

ogrisel added the Array API label Jun 27, 2024

EdAbati and others added 9 commits June 27, 2024 18:37

Update sklearn/utils/tests/test_array_api.py

5abe16f

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Update sklearn/utils/tests/test_array_api.py

87c7b8c

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Update sklearn/utils/tests/test_array_api.py

37e5b49

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Update sklearn/utils/tests/test_array_api.py

34716d9

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Update sklearn/utils/_array_api.py

f1bc718

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Update sklearn/utils/tests/test_array_api.py

519bb02

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

fix lint

1345a81

Merge branch 'main' into fix-multilabel-accuracy-zero-loss

69a2c59

Merge branch 'main' into fix-multilabel-accuracy-zero-loss

6ae1ffa

OmarManzoor approved these changes Jun 28, 2024

View reviewed changes

ogrisel approved these changes Jun 28, 2024

View reviewed changes

added assert for 2d arrays

eabdf62

betatim approved these changes Jul 2, 2024

View reviewed changes

betatim merged commit 53ed13c into scikit-learn:main Jul 2, 2024
30 checks passed

betatim mentioned this pull request Jul 2, 2024

🔒 🤖 CI Update lock files for array-api CI build(s) 🔒 🤖 #29373

Closed

EdAbati deleted the fix-multilabel-accuracy-zero-loss branch July 2, 2024 11:01

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

jeremiedbb pushed a commit that referenced this pull request Jul 2, 2024

FIX: accuracy and zero_loss support for multilabel with Array API (…

851c0d6

…#29336) Co-authored-by: Omar Salman <omar.salman2007@gmail.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

EdAbati mentioned this pull request Jul 22, 2024

ENH Array API for check_consistent_length #29519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: `accuracy` and `zero_loss` support for multilabel with Array API #29336

FIX: `accuracy` and `zero_loss` support for multilabel with Array API #29336

EdAbati commented Jun 22, 2024 •

edited

Loading

EdAbati Jun 22, 2024

github-actions bot commented Jun 22, 2024 •

edited

Loading

ogrisel left a comment

thomasjpfan left a comment

thomasjpfan Jun 22, 2024

EdAbati Jun 23, 2024

ogrisel Jun 25, 2024

OmarManzoor Jun 25, 2024

EdAbati Jun 26, 2024 •

edited

Loading

OmarManzoor Jun 26, 2024

EdAbati commented Jun 26, 2024

OmarManzoor left a comment •

edited

Loading

EdAbati commented Jun 27, 2024

OmarManzoor left a comment

OmarManzoor left a comment

OmarManzoor commented Jun 28, 2024

ogrisel left a comment

ogrisel Jun 28, 2024

OmarManzoor Jun 28, 2024

OmarManzoor Jun 28, 2024

EdAbati Jun 29, 2024

OmarManzoor Jun 29, 2024

ogrisel commented Jun 28, 2024

OmarManzoor commented Jul 1, 2024

betatim commented Jul 2, 2024

EdAbati commented Jul 2, 2024

	xp.astype(xp.astype(y_true - y_pred, xp.bool), xp.int8),
	xp.astype(y_true - y_pred, xp.int8),

	def count_nonzero(X, axis=None, sample_weight=None):
	"""A variant of X.getnnz() with extension to weighting on axis 0.

	Useful in efficiently calculating multilabel metrics.

	Parameters
	----------
	X : sparse matrix of shape (n_samples, n_labels)
	Input data. It should be of CSR format.

	axis : {0, 1}, default=None
	The axis on which the data is aggregated.

	sample_weight : array-like of shape (n_samples,), default=None
	Weight for each row of X.

	Returns
	-------
	nnz : int, float, ndarray of shape (n_samples,) or ndarray of shape (n_features,)
	Number of non-zero values in the array along a given axis. Otherwise,
	the total number of non-zero values in the array is returned.
	"""
	if axis == -1:
	axis = 1
	elif axis == -2:
	axis = 0
	elif X.format != "csr":
	raise TypeError("Expected CSR sparse format, got {0}".format(X.format))

FIX: accuracy and zero_loss support for multilabel with Array API #29336

FIX: accuracy and zero_loss support for multilabel with Array API #29336

Conversation

EdAbati commented Jun 22, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

github-actions bot commented Jun 22, 2024 • edited Loading

✔️ Linting Passed

ogrisel left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdAbati Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdAbati commented Jun 26, 2024

OmarManzoor left a comment • edited Loading

Choose a reason for hiding this comment

EdAbati commented Jun 27, 2024

OmarManzoor left a comment

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

OmarManzoor commented Jun 28, 2024

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jun 28, 2024

OmarManzoor commented Jul 1, 2024

betatim commented Jul 2, 2024

EdAbati commented Jul 2, 2024

FIX: `accuracy` and `zero_loss` support for multilabel with Array API #29336

FIX: `accuracy` and `zero_loss` support for multilabel with Array API #29336

EdAbati commented Jun 22, 2024 •

edited

Loading

github-actions bot commented Jun 22, 2024 •

edited

Loading

EdAbati Jun 26, 2024 •

edited

Loading

OmarManzoor left a comment •

edited

Loading