Skip to content

[MRG+1] Solves integer overlow in mutual_info_score #10414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 14, 2018

Conversation

thechargedneutron
Copy link
Contributor

@thechargedneutron thechargedneutron commented Jan 7, 2018

Reference Issues/PRs

Fixes #9772

@thechargedneutron
Copy link
Contributor Author

@lesteve

Not sure what the best fix actually is, maybe casting pi and pj as int64, this way you make sure that pi.sum() and pj.sum() do not overflow either.

It don't think changing pi to int64 is of great use. We encounter an integer overflow mostly when we multiply two large numbers. Sum should work in most cases.

@@ -602,6 +602,8 @@ def mutual_info_score(labels_true, labels_pred, contingency=None):
contingency_nm = nz_val / contingency_sum
# Don't need to calculate the full outer product, just for non-zeroes
outer = pi.take(nzx) * pj.take(nzy)
if np.any(outer < 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this checks for overflow

@@ -602,6 +602,8 @@ def mutual_info_score(labels_true, labels_pred, contingency=None):
contingency_nm = nz_val / contingency_sum
# Don't need to calculate the full outer product, just for non-zeroes
outer = pi.take(nzx) * pj.take(nzy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile always casting to int64? Usually there should either be a small contingency matrix, or a sparse one, so I don't think memory is a big issue...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I don't think that's required. But happy to do that if there's some special advantage over the current proposed method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It merely avoids duplicated code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman do you mean casting the contigency variable to int64? IIRC this is what I was thinking as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly an option if contingency is sparse. I just meant removing the if line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the outer variable to be always converted to int64?

outer = pi.take(nzx).astype(np.int64) * pi.take(nzy).astype(np.int64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my suggestion

np.repeat(0, 814), np.repeat(1, 39),
np.repeat(0, 316), np.repeat(1, 20)))

mutual_info_score(x.ravel(), y.ravel())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forgot to assert the score to be sure that we don't have NaN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this check is only to be assured there's no overflow. Why do I check for a NaN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, it was already possible to call the function and it was resulting to nan (due to the overflow in the log). You should at least check that the output is finite to be sure that they is no overflow. Otherwise, your current test is also passing in the current master branch,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test does not pass in 32 bit Python in master branch but passes otherwise. I'll soon add the suggested check.

@@ -601,7 +601,7 @@ def mutual_info_score(labels_true, labels_pred, contingency=None):
log_contingency_nm = np.log(nz_val)
contingency_nm = nz_val / contingency_sum
# Don't need to calculate the full outer product, just for non-zeroes
outer = pi.take(nzx) * pj.take(nzy)
outer = pi.take(nzx).astype(np.int64) * pj.take(nzy).astype(np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would cast before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pi = np.ravel(contingency.sum(axis=1, dtype=np.int64))
pj = np.ravel(contingency.sum(axis=0, dtype=np.int64))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work for all NumPy versions. And not worth to backport the feature for one line. Hence switched back to casting it later.

def test_int_overflow_mutual_info_score():
# Test overflow in mutual_info_classif
x = np.concatenate((np.repeat(1, 52632 + 2529), np.repeat(2, 14660+793),
np.repeat(3, 3271+204), np.repeat(4, 814+39),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put space between the arithmetic signs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is easier to read np.array([0] * 10000 + [1] * 10000 + [2] * 10000) than the call with np.repeat.
You can also make the sum directly instead of making the addition.

@jnothman jnothman changed the title [MRG] Solves integer overlow in mutual_info_score [MRG+1] Solves integer overlow in mutual_info_score Jan 14, 2018
@glemaitre
Copy link
Member

It is only missing an entry in the what's new
@thechargedneutron
Please add an entry to the change log at doc/whats_new/v0.20.rst under bug fixes. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman jnothman merged commit 4a2b96f into scikit-learn:master Jan 14, 2018
@jnothman
Copy link
Member

Thanks @thechargedneutron

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows only: problem with mutual_info_classif when discrete=True
4 participants