Skip to content

ENH Add fast path for binary confusion matrix #28578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

lucyleeow
Copy link
Member

Reference Issues/PRs

closes #15388
closes #15403 (supercedes)

What does this implement/fix? Explain your changes.

Continues from stalled PR #15403
Adds a fast path for binary confusion matrix using:

confusion = np.bincount(y_true * 2 + y_pred, minlength=4).reshape(2, 2)

as suggested here: #15388 (comment)

Any other comments?

Will add some benchmarks.

Copy link

github-actions bot commented Mar 5, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3019c08. Link to the linter CI: here

@adrinjalali
Copy link
Member

@jjerphan @jeremiedbb would you be able to have a look here?

@jeremiedbb
Copy link
Member

I ran some quick benchmarks and didn't find improvement in many cases. This is what I tested:

import numpy as np
from sklearn.metrics import confusion_matrix

N = 2**20
a = np.random.choice([True,False], size=N)    # replace [True, False] by any 2 labels
b = np.random.choice([True,False], size=N)    #

%timeit confusion_matrix(a, b)

I only found some benefit when the labels are not [0, 1], like ["a", "b"] or [1, 2].

@lucyleeow
Copy link
Member Author

lucyleeow commented Mar 5, 2024

Thank you for running the benchmarks! In that case I am happy to close this PR and we could maybe close the other related PR and issue as well?

@jeremiedbb
Copy link
Member

Unless we consider that the speed-up we gain when labels are not [0,1] is worth the added complexity :)

@lucyleeow
Copy link
Member Author

Reading #15388 more, it seems the poster was doing

batches of about 4096 but doing it a great many times as part of a modeling algorithm for generic detection of Boolean formulas

see: #15388 (comment) for a benchmark he used.

Again, even if it does improve performance in that case, is it worth the complexity?

@jeremiedbb
Copy link
Member

The issue with the benchmarks in #15388 is that it's only measuring the bincount part, but a quick profiling of the snippet I used above shows that almost all the time is spent in _check_targets and labels = unique_labels(y_true, y_pred). In both places, it's the call to np.unique in particular that is the most costly.

It's the same with small repeated batches and with a one shot large batch.

Then the actual time to compute the confusion matrix once everything is checked and set up is very low. So I think there's definitely room for improvement but we should start with the duplicated call to np.unique.

@lucyleeow
Copy link
Member Author

Thanks for profiling. I think we can reduce a np.unique call by replacing check_consistent_length with _check_sample_weight as we've already used _check_targets on y_true and y_pred. I'll open another PR with the changes and try benchmarking!

I think with your results we could close this PR and #15403 as it seems this is not the way to improver performance here. WDYT?

@jeremiedbb
Copy link
Member

I think with your results we could close this PR and #15403 as it seems this is not the way to improver performance here. WDYT?

I'm okay with that.

I think we can reduce a np.unique call by replacing check_consistent_length with _check_sample_weight

I don't think there's an issue with check_consistent_length, the 2nd call to np.unique comes from unique_labels. The thing is that when we run _check_target to determine y_type, we already do a np.unique so we could have access to the labels at that point. We could make _check_target return those labels but it will impact many places I guess and hard to make it generic because there are not always labels for other y types. There's a bit of refactoring to figure out there :)

@lucyleeow
Copy link
Member Author

My bad, the np.unique call in check_consistent_length is just on the lengths of the arrays. Using _check_sample_weight would just make the code look neater and avoid a if/else.

@lucyleeow lucyleeow deleted the cmspeed branch March 6, 2024 10:16
@jeremiedbb
Copy link
Member

So I think there's definitely room for improvement but we should start with the duplicated call to np.unique.

Actually there's already ongoing work for that in #26820

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics.confusion_matrix far too slow for Boolean cases
4 participants