ENH fast path for binary confusion matrix #15403

jnothman · 2019-10-30T08:32:46Z

A patch for @GregoryMorse to benchmark.

jnothman · 2019-10-30T08:33:18Z

sklearn/metrics/tests/test_classification.py

@@ -879,12 +879,6 @@ def test_confusion_matrix_dtype():
    assert cm[0, 0] == 4294967295
    assert cm[1, 1] == 8589934590

-    # np.iinfo(np.int64).max should cause an overflow


There might be a neater solution than removing this, but it turns out that different implementations will give different results in the case of overflow

Should it just test for the binary type and exact selection conditions as in the optimized code before making the assertions to exclude them?

GregoryMorse

Thank you so much for the PR. Helpful question: what about handling of the normalization case, as I do not see anything which deals with it? Update: nevermind, it appears I must have been looking at another function or version previously as I realize there is currently no normalize argument for the confusion matrix though theoretically one could be added, its easy enough to divide by np.sum.

jnothman · 2019-10-30T09:00:51Z

Looks like CI is unhappy anyway

GregoryMorse · 2019-10-30T09:20:50Z

Perhaps instead of sample_weight.dtype.kind == 'O', not sample_weight.dtype.kind in {'i', 'u', 'b'}. Otherwise it also looks like the final return should be .astype(np.float64) and not float. I had forgotten about sample weights. Also the first return should get .astype(np.int64) and then all tests should pass. Its interesting that supposedly bincount uses int64 internally then is returning ints and some complaints about that had been raised due to double memory. But I kept forgetting the sample weight parameter.

GregoryMorse · 2019-10-30T14:17:37Z

Nice job, it works now, only the doc system has some sort of issue. I would consider putting sample_weight.dtype.kind in 'iub' in a variable and not having that code repeated 4 or so times, along with the trivial computation it entails.

jnothman · 2020-01-10T01:18:50Z

Please feel free to run some benchmarks, @GregoryMorse

GregoryMorse · 2020-01-24T21:05:23Z

Would using timeit between the old and new version be sufficient?

jnothman · 2020-01-26T13:08:33Z

Yes. Maybe checking a few different affected cases.

lorentzenchr · 2022-01-15T11:45:28Z

sklearn/metrics/_classification.py

+        sample_weight = np.asarray(sample_weight)
+        check_consistent_length(y_true, y_pred, sample_weight)


Suggested change

sample_weight = np.asarray(sample_weight)

check_consistent_length(y_true, y_pred, sample_weight)

sample_weight = _check_sample_weight(sample_weight, X)

lorentzenchr · 2022-01-15T11:52:13Z

This PR needs a simple benchmark (as comment/post here in github), e.g. with %time or %timeit.

jeremiedbb · 2024-03-06T10:14:04Z

Some profiling done in #28578 showed that it's actually all the checks that dominate in confusion_matrix, especially the call(s) to np.unique. The specialization in the binary case only brings a marginal improvement so we decided that it's not worth the added complexity for now. With that in mind I'm closing this PR. If we manage to reduce the overhead of this function then we'll reconsider implementing this optimization.

ENH fast path for binary confusion matrix

81d6775

jnothman commented Oct 30, 2019

View reviewed changes

GregoryMorse reviewed Oct 30, 2019

View reviewed changes

jnothman added 2 commits October 30, 2019 21:09

Try fix issue comparing array on some numpy versions

5819228

Should usually return int64

fd90156

Merge branch 'master' into cmspeed

0e55564

Fix normalize handling

93f94e2

github-actions bot added the module:metrics label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

lorentzenchr reviewed Jan 15, 2022

View reviewed changes

lorentzenchr added Stalled Performance labels Jan 15, 2022

lucyleeow mentioned this pull request Mar 5, 2024

ENH Add fast path for binary confusion matrix #28578

Closed

jeremiedbb closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH fast path for binary confusion matrix #15403

ENH fast path for binary confusion matrix #15403

jnothman commented Oct 30, 2019

jnothman Oct 30, 2019

GregoryMorse Oct 30, 2019

GregoryMorse left a comment •

edited

Loading

jnothman commented Oct 30, 2019

GregoryMorse commented Oct 30, 2019 •

edited

Loading

GregoryMorse commented Oct 30, 2019

jnothman commented Jan 10, 2020

GregoryMorse commented Jan 24, 2020

jnothman commented Jan 26, 2020 via email

lorentzenchr Jan 15, 2022

lorentzenchr commented Jan 15, 2022

jeremiedbb commented Mar 6, 2024

		sample_weight = np.asarray(sample_weight)
		check_consistent_length(y_true, y_pred, sample_weight)

	sample_weight = np.asarray(sample_weight)
	check_consistent_length(y_true, y_pred, sample_weight)
	sample_weight = _check_sample_weight(sample_weight, X)

ENH fast path for binary confusion matrix #15403

ENH fast path for binary confusion matrix #15403

Conversation

jnothman commented Oct 30, 2019

jnothman Oct 30, 2019

Choose a reason for hiding this comment

GregoryMorse Oct 30, 2019

Choose a reason for hiding this comment

GregoryMorse left a comment • edited Loading

Choose a reason for hiding this comment

jnothman commented Oct 30, 2019

GregoryMorse commented Oct 30, 2019 • edited Loading

GregoryMorse commented Oct 30, 2019

jnothman commented Jan 10, 2020

GregoryMorse commented Jan 24, 2020

jnothman commented Jan 26, 2020 via email

lorentzenchr Jan 15, 2022

Choose a reason for hiding this comment

lorentzenchr commented Jan 15, 2022

jeremiedbb commented Mar 6, 2024

GregoryMorse left a comment •

edited

Loading

GregoryMorse commented Oct 30, 2019 •

edited

Loading