-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
ENH Array API for contingency_matrix
#29251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
contingency_matrix
Out of curiosity and because I am not familiar with this, do you mind elaborating on why you think that this? Thanks! |
But iirc, some metrics, e.g. |
I agree with the comment above. To avoid the confusion, the returned value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Here is some feedback:
|
||
def check_array_api_metric_supervised(metric, array_namespace, device, dtype_name): | ||
labels_true = np.array([0, 0, 1, 1, 2, 2], dtype=dtype_name) | ||
labels_pred = np.array([1, 0, 2, 1, 0, 2], dtype=dtype_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please (also or only) test with the default integer dtype (without casting to dtype_name
). The dtype_name
s generated by our testing infrastructure are floating point dtypes. Using floating point dtypes does not really make sense for cluster labels.
Related: #29251 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to add a test, where dtype_name will be for example "float32"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrasing of my comment was grammatically incorrect and confusing. I rewrote it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the confusion here, actually as defined in test_array_api_compliance
dtype_name either int32
or int64
. So float types are not checked anyway.
) | ||
|
||
if sparse and is_array_api_compliant: | ||
raise ValueError("Cannot use sparse=True while using array api dispatch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this, maybe raising this exception is unnecessarily annoying. We could instead just return
Accepting array API inputs would not necessarily mean to output a data structure of the same type when output is request explicitly to use another container type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could instead just return
Did you mean to write something after return
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean to return a sparse scipy data-structures (backed by numpy arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does #31286 mean that should return the same type as input..? Edit - I missed the sparse
parameter
|
||
classes, class_idx = xp.unique_inverse(labels_true) | ||
clusters, cluster_idx = xp.unique_inverse(labels_pred) | ||
class_idx = _convert_to_numpy(class_idx, xp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise cupy would raise here
contingency = sp.coo_matrix(
(np.ones(class_idx.shape[0]), (class_idx, cluster_idx)),
shape=(n_classes, n_clusters),
dtype=dtype,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel do you think it is worthwhile to do some benchmarking, to compare conversion to numpy vs creating the table via other functions, or should we just leave as is?
f97c960
to
02d4e6d
Compare
…y-api/contingency_matrix
…y-api/contingency_matrix
I am not sure why my test fails with this error. And if module is not installed, then I don't know why other array-api tests do not fail |
Checking in on this as it seems that this would unblock many other metrics (see #26024 (comment)) Thanks so much for your patience @Tialo , just checking, are you still interested in working on this? |
Hi, yes I am |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience @Tialo , this looks like it's in good shape, just a question around whether it is worth exploring alternate table computation, that it would be nice to get other maintainers thoughts.
We now vendor array-api-compat and array-api-extra ( #30340), so maybe this will fix your error: #29251 (comment)
Could you please merge main and fix the merge conflicts? Thank you
np.array([1, 0]), | ||
sparse=True, | ||
) | ||
assert isinstance(res, sp.csr_matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be used with yield_namespace_device_dtype_combinations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, I think checking that use of array_api_dispatch=True
and sparse=True
returns sparse matrix is sufficient
@@ -232,3 +242,60 @@ def test_returned_value_consistency(name): | |||
|
|||
assert isinstance(score, float) | |||
assert not isinstance(score, (np.float64, np.float32)) | |||
|
|||
|
|||
def check_array_api_metric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could re-use check_array_api_metric
from sklearn/metrics/tests/test_common.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a processing of specific kwargs inside this function e.g.
if metric_kwargs.get("sample_weight") is not None:
metric_kwargs["sample_weight"] = xp.asarray(
metric_kwargs["sample_weight"], device=device
)
Will it be ok to add similar kwargs processing to check_array_api_metric
from sklearn/metrics/tests/test_common.py
if it will be needed in future?
For example for function mutual_info_score
and parameter contingency
@Tialo are you still interested in working on this PR? Thanks |
I won't be able to get back to this PR for a month, so if this is a blocker, you can take it over |
…ontingency_matrix
…y-api/contingency_matrix
Reference Issues/PRs
Towards #26024
What does this implement/fix? Explain your changes.
Added common test like in
metrics/tests/test_common.py
. As supervised clusters metrics operate with labels imo it is wrong to usedtype_name
fromyield_namespace_device_dtype_combinations
which are float types.Any other comments?
Should common test check if metric array is on the same device?