Skip to content

ENH fast path for binary confusion matrix #15403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

jnothman
Copy link
Member

A patch for @GregoryMorse to benchmark.

Fixes #15388.

@@ -879,12 +879,6 @@ def test_confusion_matrix_dtype():
assert cm[0, 0] == 4294967295
assert cm[1, 1] == 8589934590

# np.iinfo(np.int64).max should cause an overflow
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a neater solution than removing this, but it turns out that different implementations will give different results in the case of overflow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it just test for the binary type and exact selection conditions as in the optimized code before making the assertions to exclude them?

Copy link
Contributor

@GregoryMorse GregoryMorse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the PR. Helpful question: what about handling of the normalization case, as I do not see anything which deals with it? Update: nevermind, it appears I must have been looking at another function or version previously as I realize there is currently no normalize argument for the confusion matrix though theoretically one could be added, its easy enough to divide by np.sum.

@jnothman
Copy link
Member Author

Looks like CI is unhappy anyway

@GregoryMorse
Copy link
Contributor

GregoryMorse commented Oct 30, 2019

Perhaps instead of sample_weight.dtype.kind == 'O', not sample_weight.dtype.kind in {'i', 'u', 'b'}. Otherwise it also looks like the final return should be .astype(np.float64) and not float. I had forgotten about sample weights. Also the first return should get .astype(np.int64) and then all tests should pass. Its interesting that supposedly bincount uses int64 internally then is returning ints and some complaints about that had been raised due to double memory. But I kept forgetting the sample weight parameter.

@GregoryMorse
Copy link
Contributor

Nice job, it works now, only the doc system has some sort of issue. I would consider putting sample_weight.dtype.kind in 'iub' in a variable and not having that code repeated 4 or so times, along with the trivial computation it entails.

@jnothman
Copy link
Member Author

Please feel free to run some benchmarks, @GregoryMorse

@GregoryMorse
Copy link
Contributor

Would using timeit between the old and new version be sufficient?

@jnothman
Copy link
Member Author

jnothman commented Jan 26, 2020 via email

Base automatically changed from master to main January 22, 2021 10:51
Comment on lines +295 to +296
sample_weight = np.asarray(sample_weight)
check_consistent_length(y_true, y_pred, sample_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight = np.asarray(sample_weight)
check_consistent_length(y_true, y_pred, sample_weight)
sample_weight = _check_sample_weight(sample_weight, X)

@lorentzenchr
Copy link
Member

This PR needs a simple benchmark (as comment/post here in github), e.g. with %time or %timeit.

@jeremiedbb
Copy link
Member

Some profiling done in #28578 showed that it's actually all the checks that dominate in confusion_matrix, especially the call(s) to np.unique. The specialization in the binary case only brings a marginal improvement so we decided that it's not worth the added complexity for now. With that in mind I'm closing this PR. If we manage to reduce the overhead of this function then we'll reconsider implementing this optimization.

@jeremiedbb jeremiedbb closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics.confusion_matrix far too slow for Boolean cases
4 participants