ENH Add fast path for binary confusion matrix #28578

lucyleeow · 2024-03-05T05:36:27Z

Reference Issues/PRs

closes #15388
closes #15403 (supercedes)

What does this implement/fix? Explain your changes.

Continues from stalled PR #15403
Adds a fast path for binary confusion matrix using:

confusion = np.bincount(y_true * 2 + y_pred, minlength=4).reshape(2, 2)

as suggested here: #15388 (comment)

Any other comments?

Will add some benchmarks.

github-actions · 2024-03-05T05:37:48Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 3019c08. Link to the linter CI: here}

adrinjalali · 2024-03-05T08:35:49Z

@jjerphan @jeremiedbb would you be able to have a look here?

jeremiedbb · 2024-03-05T15:41:53Z

I ran some quick benchmarks and didn't find improvement in many cases. This is what I tested:

import numpy as np
from sklearn.metrics import confusion_matrix

N = 2**20
a = np.random.choice([True,False], size=N)    # replace [True, False] by any 2 labels
b = np.random.choice([True,False], size=N)    #

%timeit confusion_matrix(a, b)

I only found some benefit when the labels are not [0, 1], like ["a", "b"] or [1, 2].

lucyleeow · 2024-03-05T16:19:30Z

Thank you for running the benchmarks! In that case I am happy to close this PR and we could maybe close the other related PR and issue as well?

jeremiedbb · 2024-03-05T16:42:20Z

Unless we consider that the speed-up we gain when labels are not [0,1] is worth the added complexity :)

lucyleeow · 2024-03-05T16:48:37Z

Reading #15388 more, it seems the poster was doing

batches of about 4096 but doing it a great many times as part of a modeling algorithm for generic detection of Boolean formulas

see: #15388 (comment) for a benchmark he used.

Again, even if it does improve performance in that case, is it worth the complexity?

jeremiedbb · 2024-03-05T22:06:46Z

The issue with the benchmarks in #15388 is that it's only measuring the bincount part, but a quick profiling of the snippet I used above shows that almost all the time is spent in _check_targets and labels = unique_labels(y_true, y_pred). In both places, it's the call to np.unique in particular that is the most costly.

It's the same with small repeated batches and with a one shot large batch.

Then the actual time to compute the confusion matrix once everything is checked and set up is very low. So I think there's definitely room for improvement but we should start with the duplicated call to np.unique.

lucyleeow · 2024-03-06T09:58:27Z

Thanks for profiling. I think we can reduce a np.unique call by replacing check_consistent_length with _check_sample_weight as we've already used _check_targets on y_true and y_pred. I'll open another PR with the changes and try benchmarking!

I think with your results we could close this PR and #15403 as it seems this is not the way to improver performance here. WDYT?

jeremiedbb · 2024-03-06T10:10:09Z

I think with your results we could close this PR and #15403 as it seems this is not the way to improver performance here. WDYT?

I'm okay with that.

I think we can reduce a np.unique call by replacing check_consistent_length with _check_sample_weight

I don't think there's an issue with check_consistent_length, the 2nd call to np.unique comes from unique_labels. The thing is that when we run _check_target to determine y_type, we already do a np.unique so we could have access to the labels at that point. We could make _check_target return those labels but it will impact many places I guess and hard to make it generic because there are not always labels for other y types. There's a bit of refactoring to figure out there :)

lucyleeow · 2024-03-06T10:16:10Z

My bad, the np.unique call in check_consistent_length is just on the lengths of the arrays. Using _check_sample_weight would just make the code look neater and avoid a if/else.

jeremiedbb · 2024-03-07T21:19:00Z

So I think there's definitely room for improvement but we should start with the duplicated call to np.unique.

Actually there's already ongoing work for that in #26820

jnothman and others added 7 commits October 30, 2019 19:31

ENH fast path for binary confusion matrix

81d6775

Try fix issue comparing array on some numpy versions

5819228

Should usually return int64

fd90156

Merge branch 'master' into cmspeed

0e55564

Fix normalize handling

93f94e2

merge main

9173872

mv helper fun out

04cca50

github-actions bot added the module:metrics label Mar 5, 2024

lucyleeow added No Changelog Needed Performance labels Mar 5, 2024

lint

3019c08

jeremiedbb mentioned this pull request Mar 6, 2024

ENH fast path for binary confusion matrix #15403

Closed

jeremiedbb closed this Mar 6, 2024

lucyleeow deleted the cmspeed branch March 6, 2024 10:16

lucyleeow mentioned this pull request Mar 7, 2024

metrics.confusion_matrix far too slow for Boolean cases #15388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add fast path for binary confusion matrix #28578

ENH Add fast path for binary confusion matrix #28578

lucyleeow commented Mar 5, 2024

github-actions bot commented Mar 5, 2024 •

edited

Loading

adrinjalali commented Mar 5, 2024

jeremiedbb commented Mar 5, 2024

lucyleeow commented Mar 5, 2024 •

edited

Loading

jeremiedbb commented Mar 5, 2024

lucyleeow commented Mar 5, 2024

jeremiedbb commented Mar 5, 2024

lucyleeow commented Mar 6, 2024

jeremiedbb commented Mar 6, 2024

lucyleeow commented Mar 6, 2024

jeremiedbb commented Mar 7, 2024

ENH Add fast path for binary confusion matrix #28578

ENH Add fast path for binary confusion matrix #28578

Conversation

lucyleeow commented Mar 5, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 5, 2024 • edited Loading

✔️ Linting Passed

adrinjalali commented Mar 5, 2024

jeremiedbb commented Mar 5, 2024

lucyleeow commented Mar 5, 2024 • edited Loading

jeremiedbb commented Mar 5, 2024

lucyleeow commented Mar 5, 2024

jeremiedbb commented Mar 5, 2024

lucyleeow commented Mar 6, 2024

jeremiedbb commented Mar 6, 2024

lucyleeow commented Mar 6, 2024

jeremiedbb commented Mar 7, 2024

github-actions bot commented Mar 5, 2024 •

edited

Loading

lucyleeow commented Mar 5, 2024 •

edited

Loading