Skip to content

ENH Array API for contingency_matrix #29251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

Tialo
Copy link
Contributor

@Tialo Tialo commented Jun 13, 2024

Reference Issues/PRs

Towards #26024

What does this implement/fix? Explain your changes.

Added common test like in metrics/tests/test_common.py. As supervised clusters metrics operate with labels imo it is wrong to use dtype_name from yield_namespace_device_dtype_combinations which are float types.

Any other comments?

Should common test check if metric array is on the same device?

@Tialo Tialo changed the title ENH Array API for contingency matrix ENH Array API for contingency_matrix Jun 13, 2024
Copy link

github-actions bot commented Jun 13, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7d7d58f. Link to the linter CI: here

@EmilyXinyi
Copy link
Contributor

As supervised clusters metrics operate with labels imo it is wrong to use dtype_name from yield_namespace_device_dtype_combinations which are float types.

Out of curiosity and because I am not familiar with this, do you mind elaborating on why you think that this? Thanks!

@Tialo
Copy link
Contributor Author

Tialo commented Jun 18, 2024

As supervised clusters metrics operate with labels imo it is wrong to use dtype_name from yield_namespace_device_dtype_combinations which are float types.

Out of curiosity and because I am not familiar with this, do you mind elaborating on why you think that this? Thanks!

dtype_name is either float32 or float64. It will be more reasonably to use integer types for labels, as usually they are numbers from 0 to n.

But iirc, some metrics, e.g. entropy can use float labels, like 4.0, 2.0, but I am not sure about 4.2, 1.7, etc

@ogrisel
Copy link
Member

ogrisel commented Jul 8, 2024

I agree with the comment above. To avoid the confusion, the returned value yield_namespace_device_dtype_combinations could be renamed to floating_dtype and the docstring updated accordingly to explain that GPU devices are typically optimized for float32 (or even lower) but some can also offer float64 support.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here is some feedback:


def check_array_api_metric_supervised(metric, array_namespace, device, dtype_name):
labels_true = np.array([0, 0, 1, 1, 2, 2], dtype=dtype_name)
labels_pred = np.array([1, 0, 2, 1, 0, 2], dtype=dtype_name)
Copy link
Member

@ogrisel ogrisel Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please (also or only) test with the default integer dtype (without casting to dtype_name). The dtype_names generated by our testing infrastructure are floating point dtypes. Using floating point dtypes does not really make sense for cluster labels.

Related: #29251 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to add a test, where dtype_name will be for example "float32"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing of my comment was grammatically incorrect and confusing. I rewrote it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion here, actually as defined in test_array_api_compliance dtype_name either int32 or int64. So float types are not checked anyway.

)

if sparse and is_array_api_compliant:
raise ValueError("Cannot use sparse=True while using array api dispatch")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, maybe raising this exception is unnecessarily annoying. We could instead just return

Accepting array API inputs would not necessarily mean to output a data structure of the same type when output is request explicitly to use another container type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could instead just return

Did you mean to write something after return?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean to return a sparse scipy data-structures (backed by numpy arrays).

Copy link
Member

@lucyleeow lucyleeow Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does #31286 mean that should return the same type as input..? Edit - I missed the sparse parameter


classes, class_idx = xp.unique_inverse(labels_true)
clusters, cluster_idx = xp.unique_inverse(labels_pred)
class_idx = _convert_to_numpy(class_idx, xp)
Copy link
Contributor Author

@Tialo Tialo Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise cupy would raise here

contingency = sp.coo_matrix(
    (np.ones(class_idx.shape[0]), (class_idx, cluster_idx)),
    shape=(n_classes, n_clusters),
    dtype=dtype,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel do you think it is worthwhile to do some benchmarking, to compare conversion to numpy vs creating the table via other functions, or should we just leave as is?

@Tialo Tialo force-pushed the array-api/contingency_matrix branch from f97c960 to 02d4e6d Compare October 17, 2024 15:41
@Tialo
Copy link
Contributor Author

Tialo commented Dec 2, 2024

I am not sure why my test fails with this error.
FAILED metrics/cluster/tests/test_supervised.py::test_contingency_matrix_array_api_sparse - ImportError: array_api_compat is required to dispatch arrays using the API

And if module is not installed, then I don't know why other array-api tests do not fail

@lucyleeow
Copy link
Member

Checking in on this as it seems that this would unblock many other metrics (see #26024 (comment))

Thanks so much for your patience @Tialo , just checking, are you still interested in working on this?

@Tialo
Copy link
Contributor Author

Tialo commented Jul 4, 2025

Hi, yes I am

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @Tialo , this looks like it's in good shape, just a question around whether it is worth exploring alternate table computation, that it would be nice to get other maintainers thoughts.

We now vendor array-api-compat and array-api-extra ( #30340), so maybe this will fix your error: #29251 (comment)

Could you please merge main and fix the merge conflicts? Thank you

np.array([1, 0]),
sparse=True,
)
assert isinstance(res, sp.csr_matrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be used with yield_namespace_device_dtype_combinations ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I think checking that use of array_api_dispatch=True and sparse=True returns sparse matrix is sufficient

@@ -232,3 +242,60 @@ def test_returned_value_consistency(name):

assert isinstance(score, float)
assert not isinstance(score, (np.float64, np.float32))


def check_array_api_metric(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could re-use check_array_api_metric from sklearn/metrics/tests/test_common.py ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a processing of specific kwargs inside this function e.g.

if metric_kwargs.get("sample_weight") is not None:
    metric_kwargs["sample_weight"] = xp.asarray(
        metric_kwargs["sample_weight"], device=device
    )

Will it be ok to add similar kwargs processing to check_array_api_metric from sklearn/metrics/tests/test_common.py if it will be needed in future?
For example for function mutual_info_score and parameter contingency

@lucyleeow
Copy link
Member

@Tialo are you still interested in working on this PR? Thanks

@Tialo
Copy link
Contributor Author

Tialo commented Jul 21, 2025

I won't be able to get back to this PR for a month, so if this is a blocker, you can take it over

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants