-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
ENH Array API support for confusion_matrix #30562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Array API support for confusion_matrix #30562
Conversation
result = confusion_matrix(y_true, y_pred) | ||
xp_result, _ = get_namespace(result) | ||
assert _is_numpy_namespace(xp_result) | ||
|
||
# Since the computation always happens with NumPy / SciPy on the CPU, this | ||
# function is expected to return an array allocated on the CPU even when it does | ||
# not match the input array's device. | ||
assert result.device == "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adjusted this test to your suggestions from this comment, @ogrisel. But here, the test is narrower because we made the return value of confusion_matrix
to always be a numpy array on cpu.
Regarding the question how to document the return type of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the CI failure (about the Test Library) is a false positive, as I couldn't reproduce the same error after running the tests multiple times on my local machine. Instead, I encountered a different set of errors related to CUDA. I believe the issue might be linked to how we convert sample_weight
into a NumPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good overall. Thanks @StefanieSenger
We just need to finalize whether we want to keep the current flow where we return a numpy array on the cpu or should we convert to the initial namespace and device.
CC: @ogrisel
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
About namespace, I would say for now the least controversial thing to do (from the point of view of array API support) would be to convert to the initial namespace, for example see #30440 (comment). About device, I am not quite sure ... the Scipy array API support blog post compiled code section (from October 2023) says:
Maybe @lucascolley has some opinion on what to do with device based on his scipy experience? |
thanks for the ping @lesteve, I left my initial thoughts at #30440 (comment) then discussed this with Olivier and Tim in a call. The summary is that I think you'll want to think carefully before setting the precedent for different defaults across functions as to whether they return the input namespace or NumPy. That seems like it could be confusing for users and difficult to document. In this case, if (some of) the computation has to be done with NumPy, then I would suggest attempting to convert back to the input namespace. In SciPy we have chosen to let this error for now and skip the tests for device transfers, but you could choose to let the device transfers happen (as I think you do in your |
I ran some benchmarks on a Kaggle kernel with CUDA using this branch where in one case, the current flow in which a numpy array is returned is used, while in the other we convert to the namespace and device when returning
By looking at this, I think we might as well return the confusion matrix in the original namespace and device. |
Thanks, everyone, for your thoughts on the return type! To summarize: The original decision (based on @ogrisel’s suggestion) was to default to returning numpy arrays. So, should we finalise the PR by converting back to the input namespace and device, despite the minor performance difference? Or should we keep the numpy default? I believe the advantages of converting back to the input namespace and device overweigh and in case we would later judge differently for some reason, array API is experimental and we could still change its behaviour without annoying users too much. @ogrisel, since your suggestion was to convert to numpy, do you have any thoughts on whether we should stick with that or return types from the input namespace based on the discussion? |
+1 from me! For clarity, what is the NumPy code (i.e. which lines) that we have been unable to write in a performant way with the standard (which caused the initial performance discussion)? |
Basically where we create the confusion matrix using a coo_matrix scikit-learn/sklearn/metrics/_classification.py Lines 413 to 417 in 869f568
|
This seems the less surprising to me for now and we can revisit this later if the need arises. I would think metrics computation is unlikely to be the bottleneck in most real life use cases. |
@ogrisel Could you kindly provide your feedback so that we can finalize this PR? |
Based on the discussion in #31286 (comment), the decision is to convert such output arrays back to the original namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @StefanieSenger
@@ -141,6 +141,7 @@ Metrics | |||
------- | |||
|
|||
- :func:`sklearn.metrics.accuracy_score` | |||
- :func:`sklearn.metrics.confusion_matrix` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have another section where we specify metrics that don't support the array api in internal computations and use numpy instead because the array api does not support the optimizations needed. Therefore using numpy is more appropriate internally and we convert the output back to the desired array.
This should be done because we don't guarantee and speedups for such metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, @OmarManzoor. But metrics computation is not the bottleneck for performance and it is hard to know where to draw the line, because confusion_matrix
is internally used in so many other functions and methods. So, I am not so sure if we can decide on a good way to document that on this PR. Should we maybe rather discuss that somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to not add the additional document in this PR. But don't you think that where other metrics that are documented use the array api internally and the ones involving confusion matrix or confusion matrix itself don't use the array api, it would be better to clarify such a point? I think we can ask for opinions from others as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for understanding and your support on this PR, @OmarManzoor!
I don't mind either way and am happy to support the docs moving forward so or so.
If you would ask for my opinion (but it's really not deciding), then I would tend to say we don't need to justify in detail what happens internally at the moment. I would consider confusion_matrix
and any other function that doesn't break when we use other array libraries on it now to be array api compatible. You can feed it arrays of any type and it returns arrays from the same type and thus it's compatible to be used as part of a data science pipeline.
For metrics the speedup by actually staying on gpu is not so high. And there will be many other places where we cannot support staying on gpu (yet) for different reasons, and that might change over time as array api evolves or we find better solutions. I think it is very ambitious to try to document internal details in such an early state of implementing array api in experimental state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way the main reason for getting other opinions is that this was suggested previously #29881 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviews, @OmarManzoor and @lesteve!
I have applied your suggestions and commented regarding the documentation.
Now, this PR looks very straightforward. :)
@@ -141,6 +141,7 @@ Metrics | |||
------- | |||
|
|||
- :func:`sklearn.metrics.accuracy_score` | |||
- :func:`sklearn.metrics.confusion_matrix` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, @OmarManzoor. But metrics computation is not the bottleneck for performance and it is hard to know where to draw the line, because confusion_matrix
is internally used in so many other functions and methods. So, I am not so sure if we can decide on a good way to document that on this PR. Should we maybe rather discuss that somewhere else?
Reference Issues/PRs
towards #26024
closes #30440 (supercedes)
This PR is an alternative, discussed in #30440. It accepts array inputs from all namespaces, converts the input arrays to numpy arrays right away to do the calculations in numpy (which is necessary for the coo_matrix at least) and returns the confusion_matrix in the same namespace as the input and on a cpu device.
That's what we had discussed. For more details see the discussions on both PRs.