-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add fast path for binary confusion matrix #28578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jjerphan @jeremiedbb would you be able to have a look here? |
I ran some quick benchmarks and didn't find improvement in many cases. This is what I tested: import numpy as np
from sklearn.metrics import confusion_matrix
N = 2**20
a = np.random.choice([True,False], size=N) # replace [True, False] by any 2 labels
b = np.random.choice([True,False], size=N) #
%timeit confusion_matrix(a, b) I only found some benefit when the labels are not |
Thank you for running the benchmarks! In that case I am happy to close this PR and we could maybe close the other related PR and issue as well? |
Unless we consider that the speed-up we gain when labels are not [0,1] is worth the added complexity :) |
Reading #15388 more, it seems the poster was doing
see: #15388 (comment) for a benchmark he used. Again, even if it does improve performance in that case, is it worth the complexity? |
The issue with the benchmarks in #15388 is that it's only measuring the It's the same with small repeated batches and with a one shot large batch. Then the actual time to compute the confusion matrix once everything is checked and set up is very low. So I think there's definitely room for improvement but we should start with the duplicated call to np.unique. |
Thanks for profiling. I think we can reduce a I think with your results we could close this PR and #15403 as it seems this is not the way to improver performance here. WDYT? |
I'm okay with that.
I don't think there's an issue with check_consistent_length, the 2nd call to |
My bad, the |
Actually there's already ongoing work for that in #26820 |
Reference Issues/PRs
closes #15388
closes #15403 (supercedes)
What does this implement/fix? Explain your changes.
Continues from stalled PR #15403
Adds a fast path for binary confusion matrix using:
as suggested here: #15388 (comment)
Any other comments?
Will add some benchmarks.