Skip to content

DOC: Specify Units for Mutual Information Metrics #18288 #18641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

amrcode
Copy link
Contributor

@amrcode amrcode commented Oct 19, 2020

Fixes #18288

This PR adds documentation to the return values of the mutual-information-score functions to make it clear that the units are based on the natural logarithm. It also indicates how the parameters are related to the mathematical representations.

@cmarmo
Copy link
Contributor

cmarmo commented Oct 19, 2020

Hi @armcode , thanks for your pull request. I'm under the impression that something went wrong in your last merge with upstream.
To revert your branch to your first commit (f8a079d) you might want to use

git reset --hard f8a079d0717ac15dbcc37b409782c13f8ffc9055

@amrcode amrcode force-pushed the specify-units-for-mutual-information-metrics branch from 3e32ad7 to f8a079d Compare October 19, 2020 14:46
@amrcode
Copy link
Contributor Author

amrcode commented Oct 19, 2020

Yes I apologize--mixed in a rebase that shouldn't have been there. Thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Oct 20, 2020

You have lint issues

sklearn/metrics/cluster/_supervised.py:625:80: E501 line too long (93 > 79 characters)
sklearn/metrics/cluster/_supervised.py:838:74: W291 trailing whitespace

To check the code that you changed for lint issues , you can run the following command:

git diff upstream/master -u -- "*.py" | flake8 --diff

or make flake8-diff (on unix-like system)

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @armcode. Just a minor comment to fix.
Also, adding U and V is not directly related to this pull request, but let's wait for a core dev review.

@cmarmo
Copy link
Contributor

cmarmo commented Dec 15, 2020

Hi @amrcode the failing test is probably unrelated with your pull request. Do you mind synchronizing with the last version in upstream? A useful tutorial is available here if you need help in doing that. Thanks!

@amrcode
Copy link
Contributor Author

amrcode commented Dec 15, 2020

Looks like it's ok now. Thanks!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks but this is a good change.

@@ -740,10 +740,10 @@ def mutual_info_score(labels_true, labels_pred, *, contingency=None):
Parameters
----------
labels_true : int array, shape = [n_samples]
A clustering of the data into disjoint subsets.
A clustering of the data into disjoint subsets (U).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be nice to have something like

Suggested change
A clustering of the data into disjoint subsets (U).
A clustering of the data into disjoint subsets, called $U$ in the above formula.


labels_pred : int array-like of shape (n_samples,)
A clustering of the data into disjoint subsets.
A clustering of the data into disjoint subsets (V).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reference here.

@@ -965,7 +967,8 @@ def normalized_mutual_info_score(labels_true, labels_pred, *,
Returns
-------
nmi : float
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
score between 0.0 and 1.0 in normalized nats (based on the natural
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
score between 0.0 and 1.0 in normalized nats (based on the natural
Score between 0.0 and 1.0 in normalized nats (based on the natural

@@ -828,10 +829,10 @@ def adjusted_mutual_info_score(labels_true, labels_pred, *,
Parameters
----------
labels_true : int array, shape = [n_samples]
A clustering of the data into disjoint subsets.
A clustering of the data into disjoint subsets (U).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the formula

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these make sense to me. I just pushed an update that should incorporate all of these. Thanks a lot!

@glemaitre
Copy link
Member

The linter is failing. This is probably due to some lines that are too long. Can you ensure to not have more than 79 characters?

Base automatically changed from master to main January 22, 2021 10:53
@amrcode
Copy link
Contributor Author

amrcode commented Feb 2, 2021

Sorry about that; I fixed the line lengths and couldn't figure out why the linter was still failing. There was a trailing space on one line that sneaked in. Looks good now.

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amrcode!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amrcode! Here are a few suggestions.

Comment on lines 743 to 744
A clustering of the data into disjoint subsets, called $U$ in the
above formula.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To render the math formatting.

Suggested change
A clustering of the data into disjoint subsets, called $U$ in the
above formula.
A clustering of the data into disjoint subsets, called :math:`U` in the
above formula.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe of the sentence in a paragraph above can be changed to use:

This metric is furthermore symmetric: switching :math:`U` (i.e 
``label_true``) with :math:`V` (i.e. ``label_pred``) will return the 
same score value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that those patches can also be applied in some of the others metrics' signatures (e.g. the one of normalized_mutual_info_score for the math notation).

Comment on lines 747 to 748
A clustering of the data into disjoint subsets, called $V$ in the
above formula.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Suggested change
A clustering of the data into disjoint subsets, called $V$ in the
above formula.
A clustering of the data into disjoint subsets, called :math:`V` in the
above formula.

Comment on lines 835 to 840
A clustering of the data into disjoint subsets, called $U$ in the
above formula.

labels_pred : int array-like of shape (n_samples,)
A clustering of the data into disjoint subsets.
A clustering of the data into disjoint subsets, called $V$ in the
above formula.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same suggestions apply also here.

@amrcode amrcode force-pushed the specify-units-for-mutual-information-metrics branch from 349ceb7 to 26e2064 Compare June 11, 2021 12:59
@cmarmo
Copy link
Contributor

cmarmo commented Oct 5, 2021

@glemaitre , @jjerphan, if all your comments have been addressed this pull request can probably be merged? Thanks!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up, @cmarmo.

LGTM, thank you @amrcode!

@jjerphan
Copy link
Member

jjerphan commented Oct 6, 2021

(I can't merge but @glemaitre can).

@glemaitre glemaitre merged commit 3bfb033 into scikit-learn:main Oct 12, 2021
@glemaitre
Copy link
Member

Thanks @amrcode LGTM

@amrcode
Copy link
Contributor Author

amrcode commented Oct 13, 2021

You're welcome! Thanks for reviewing and merging!

@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
glemaitre pushed a commit that referenced this pull request Oct 25, 2021
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify Units for Mutual Information Metrics
4 participants