[MRG+1] Solves integer overlow in mutual_info_score #10414

thechargedneutron · 2018-01-07T07:10:36Z

Reference Issues/PRs

thechargedneutron · 2018-01-07T07:13:35Z

Not sure what the best fix actually is, maybe casting pi and pj as int64, this way you make sure that pi.sum() and pj.sum() do not overflow either.

It don't think changing pi to int64 is of great use. We encounter an integer overflow mostly when we multiply two large numbers. Sum should work in most cases.

…into mutual_info_classif

jnothman · 2018-01-07T23:23:49Z

sklearn/metrics/cluster/supervised.py

@@ -602,6 +602,8 @@ def mutual_info_score(labels_true, labels_pred, contingency=None):
    contingency_nm = nz_val / contingency_sum
    # Don't need to calculate the full outer product, just for non-zeroes
    outer = pi.take(nzx) * pj.take(nzy)
+    if np.any(outer < 0):


Add a comment that this checks for overflow

jnothman · 2018-01-07T23:25:23Z

sklearn/metrics/cluster/supervised.py

@@ -602,6 +602,8 @@ def mutual_info_score(labels_true, labels_pred, contingency=None):
    contingency_nm = nz_val / contingency_sum
    # Don't need to calculate the full outer product, just for non-zeroes
    outer = pi.take(nzx) * pj.take(nzy)


Is it worthwhile always casting to int64? Usually there should either be a small contingency matrix, or a sparse one, so I don't think memory is a big issue...

IMO, I don't think that's required. But happy to do that if there's some special advantage over the current proposed method.

It merely avoids duplicated code

@jnothman do you mean casting the contigency variable to int64? IIRC this is what I was thinking as well.

That's certainly an option if contingency is sparse. I just meant removing the if line

Do you want the outer variable to be always converted to int64?

outer = pi.take(nzx).astype(np.int64) * pi.take(nzy).astype(np.int64)

Yes, that's my suggestion

…into mutual_info_classif

glemaitre · 2018-01-10T22:35:58Z

sklearn/metrics/cluster/tests/test_supervised.py

+                       np.repeat(0, 814), np.repeat(1, 39),
+                       np.repeat(0, 316), np.repeat(1, 20)))
+
+    mutual_info_score(x.ravel(), y.ravel())


Did you forgot to assert the score to be sure that we don't have NaN?

No, this check is only to be assured there's no overflow. Why do I check for a NaN?

Previously, it was already possible to call the function and it was resulting to nan (due to the overflow in the log). You should at least check that the output is finite to be sure that they is no overflow. Otherwise, your current test is also passing in the current master branch,

The test does not pass in 32 bit Python in master branch but passes otherwise. I'll soon add the suggested check.

glemaitre · 2018-01-10T22:38:41Z

sklearn/metrics/cluster/supervised.py

@@ -601,7 +601,7 @@ def mutual_info_score(labels_true, labels_pred, contingency=None):
    log_contingency_nm = np.log(nz_val)
    contingency_nm = nz_val / contingency_sum
    # Don't need to calculate the full outer product, just for non-zeroes
-    outer = pi.take(nzx) * pj.take(nzy)
+    outer = pi.take(nzx).astype(np.int64) * pj.take(nzy).astype(np.int64)


I would cast before

pi = np.ravel(contingency.sum(axis=1, dtype=np.int64)) pj = np.ravel(contingency.sum(axis=0, dtype=np.int64))

This does not work for all NumPy versions. And not worth to backport the feature for one line. Hence switched back to casting it later.

glemaitre · 2018-01-10T22:42:16Z

sklearn/metrics/cluster/tests/test_supervised.py

+def test_int_overflow_mutual_info_score():
+    # Test overflow in mutual_info_classif
+    x = np.concatenate((np.repeat(1, 52632 + 2529), np.repeat(2, 14660+793),
+                       np.repeat(3, 3271+204), np.repeat(4, 814+39),


put space between the arithmetic signs

I think that this is easier to read np.array([0] * 10000 + [1] * 10000 + [2] * 10000) than the call with np.repeat.
You can also make the sum directly instead of making the addition.

…into mutual_info_classif

glemaitre · 2018-01-14T11:08:03Z

It is only missing an entry in the what's new
@thechargedneutron
Please add an entry to the change log at doc/whats_new/v0.20.rst under bug fixes. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

…into mutual_info_classif

jnothman · 2018-01-14T20:42:03Z

Thanks @thechargedneutron

initial changes

80a29ea

thechargedneutron added 3 commits January 7, 2018 12:49

tests added

45ac46b

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

6763ae6

…into mutual_info_classif

print removed

c2258d6

jnothman reviewed Jan 7, 2018

View reviewed changes

thechargedneutron added 4 commits January 9, 2018 01:30

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

2827887

…into mutual_info_classif

comment added

d2c35aa

if removed

f36dc9f

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

3e6bd6d

…into mutual_info_classif

glemaitre requested changes Jan 10, 2018

View reviewed changes

thechargedneutron added 8 commits January 13, 2018 01:50

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

3272157

…into mutual_info_classif

casted before

c0685c8

casted again

912170a

test variable changed

c5c4e55

restored changes:

95d971f

pep8 corrected

8ea4594

finite added

f5a7b1d

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

d29976a

…into mutual_info_classif

jnothman approved these changes Jan 14, 2018

View reviewed changes

jnothman changed the title ~~[MRG] Solves integer overlow in mutual_info_score~~ [MRG+1] Solves integer overlow in mutual_info_score Jan 14, 2018

glemaitre approved these changes Jan 14, 2018

View reviewed changes

thechargedneutron added 2 commits January 15, 2018 00:06

what's new entry added

5721753

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

1551858

…into mutual_info_classif

jnothman merged commit 4a2b96f into scikit-learn:master Jan 14, 2018

aliddell mentioned this pull request Mar 20, 2018

fowlkes_mallows_score returns RuntimeWarning when variables get too big #9515

Closed

qinhanmin2014 mentioned this pull request Oct 2, 2018

[MRG] Fix numpy.int overflow in make_classification #10811

Merged

Uh oh!

[MRG+1] Solves integer overlow in mutual_info_score #10414

[MRG+1] Solves integer overlow in mutual_info_score #10414

Uh oh!

Conversation

thechargedneutron commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

Uh oh!

thechargedneutron commented Jan 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 14, 2018

Uh oh!

jnothman commented Jan 14, 2018

Uh oh!

Uh oh!

thechargedneutron commented Jan 7, 2018 •

edited

Loading