Skip to content

FIX Fixes issue with exatly_zero_info_score #19179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 16, 2021

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #19165

What does this implement/fix? Explain your changes.

The issue stemmed from an numerical error where log_N did not cancel out log_a

@glemaitre
Copy link
Member

@thomasjpfan How did you reproduce the error. I installed the same library version as on the CI and I could not reproduce it. Would it be linked to a different compiler?

@thomasjpfan
Copy link
Member Author

I was never able to reproduce the error locally. I had to slowly use the CI to figure out where the issue was.

@thomasjpfan thomasjpfan marked this pull request as draft January 14, 2021 21:53
@glemaitre
Copy link
Member

glemaitre commented Jan 14, 2021 via email

@thomasjpfan
Copy link
Member Author

Even so, the issue persist. I think I can create a small test example using np.log that fails on the CI, but passes locally.

@glemaitre
Copy link
Member

It seems that this is only an issue in openml now?

@thomasjpfan thomasjpfan marked this pull request as ready for review January 15, 2021 22:45
Copy link
Member Author

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now ready for review. This is the smallest diff I can come up with while also resolving the issue on [scipy-dev].

Comment on lines +41 to +42
log_a = np.log(a)
log_b = np.log(b)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more memory efficient because we would not need to create the 2d array anymore.

@@ -38,9 +38,10 @@ def expected_mutual_information(contingency, int n_samples):
term1 = nijs / N
# term2 is log((N*nij) / (a * b)) == log(N * nij) - log(a * b)
# term2 uses the outer product
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you adapt/remove this comment. With this PR, no outer product anymore.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan

@@ -795,6 +795,7 @@ def mutual_info_score(labels_true, labels_pred, *, contingency=None):
log_outer = -np.log(outer) + log(pi.sum()) + log(pj.sum())
mi = (contingency_nm * (log_contingency_nm - log(contingency_sum)) +
contingency_nm * log_outer)
mi = np.where(np.abs(mi) < np.finfo(mi.dtype).eps, 0.0, mi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, but I wonder what other places we need to be doing it!

@@ -54,12 +54,12 @@ def expected_mutual_information(contingency, int n_samples):
start = np.maximum(start, 1)
end = np.minimum(np.resize(a, (C, R)).T, np.resize(b, (R, C))) + 1
# emi itself is a summation over the various values.
emi = 0
emi = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emi is defined as DOUBLE, not sure why you've added the .0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is more explicit if you did not read the definition :)

@adrinjalali adrinjalali merged commit 4b8ab92 into scikit-learn:master Jan 16, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 18, 2021
* ENH Fixes issue with exatly_zero_info_score [scipy-dev]

* ENH Remove unneeded line [scipy-dev]

* WIP Keep types [scipy-dev]

* REV Smaller diff [scipy-dev]

* WIP Expand mutual_info_score [scipy-dev]

* WIP Removes float casting [scipy-dev]

* WIP Adds casting back in

* CI [scipy-dev]

* WIP Casting is not needed [scipy-dev]

* WIP Only clip [scipy-dev]

* REV Smaller diff [scipy-dev]

* WIP Place expected_mutual_information diff back [scipy-dev]

* ENH Uses around

* WIP Use where again [scipy-dev]

* ENH Adjust comments to match code
jeremiedbb pushed a commit that referenced this pull request Jan 19, 2021
* ENH Fixes issue with exatly_zero_info_score [scipy-dev]

* ENH Remove unneeded line [scipy-dev]

* WIP Keep types [scipy-dev]

* REV Smaller diff [scipy-dev]

* WIP Expand mutual_info_score [scipy-dev]

* WIP Removes float casting [scipy-dev]

* WIP Adds casting back in

* CI [scipy-dev]

* WIP Casting is not needed [scipy-dev]

* WIP Only clip [scipy-dev]

* REV Smaller diff [scipy-dev]

* WIP Place expected_mutual_information diff back [scipy-dev]

* ENH Uses around

* WIP Use where again [scipy-dev]

* ENH Adjust comments to match code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

numerical stability bug in adjusted_mutual_info_score with numpy master (1.20 dev)
4 participants