Skip to content

[MRG+2] Raise ValueError for metrics.cluster.supervised with too many classes #5445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 19, 2015

Conversation

tomMoral
Copy link
Contributor

Fix for #4976

  • Add max_n_classes param to cluster.supervised metric
  • Add testing for the Value Error
  • Check that n_clusters,n_classes are not too high in contingency matrix

- Add testing for the Value Error
- Check that n_clusters,n_classes are not too high in contingency matrix
@@ -671,7 +728,7 @@ def adjusted_mutual_info_score(labels_true, labels_pred):
return ami


def normalized_mutual_info_score(labels_true, labels_pred):
def normalized_mutual_info_score(labels_true, labels_pred, max_n_classes=5000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seems to be above column 79 here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a 79 line I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake

-Now the overall test time is below .2s for cluster metrics
@TomDLT
Copy link
Member

TomDLT commented Oct 19, 2015

LGTM

@@ -226,6 +246,11 @@ def homogeneity_completeness_v_measure(labels_true, labels_pred):
labels_pred : array, shape = [n_samples]
cluster labels to evaluate

max_n_classes : int
Maximal number of class handled by the adjusted_rand_score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "number of classes", IMHO

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Raise ValueError for metrics.cluster.supervised with too many classes [MRG+1] Raise ValueError for metrics.cluster.supervised with too many classes Oct 19, 2015
@GaelVaroquaux GaelVaroquaux changed the title [MRG+1] Raise ValueError for metrics.cluster.supervised with too many classes [MRG+2] Raise ValueError for metrics.cluster.supervised with too many classes Oct 19, 2015
@GaelVaroquaux
Copy link
Member

LGTM. Waiting for travis to finish and then merging.

@GaelVaroquaux
Copy link
Member

Travis ran. Merging!

GaelVaroquaux added a commit that referenced this pull request Oct 19, 2015
[MRG+2] Raise ValueError for metrics.cluster.supervised with too many classes
@GaelVaroquaux GaelVaroquaux merged commit f6b85d9 into scikit-learn:master Oct 19, 2015
@tomMoral tomMoral deleted the devTom branch October 19, 2015 21:38
@amueller
Copy link
Member

That's a clever way to deal with this. I'm not sure it is better than the type_of_target logic, which uses _is_integral_float, though. The issue came up because someone gave floats to the functions. And I think we should deal with this issue consistently.

@amueller
Copy link
Member

If we do keep this solution, it needs a whatsnew entry, refering to this PR and a versionadded for the option.

@jnothman
Copy link
Member

FWIW, I think this is the wrong solution, when most of those metrics should be using a sparse contingency matrix, as per #4788. (Apart from concerns like passing continuous targets by accident, there are problem setting for clustering where the assignments are very sparse, such as coreference resolution.) Would anyone mind if we deprecated max_n_classes? I think it's very obscure.

@jnothman
Copy link
Member

The issue here is really a validation issue. If someone puts continuous scores into a clustering metric we can either explicitly catch that and raise an error; otherwise, we should return a result of 0.

There are, on the other hand, real clustering problems with a large number of clusters, but any likely clustering is sparse.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

For the sake of navigating between issues, a rebased version of #4788 to implement sparse contengy matrix computation has been re-submitted as #7203.

@jnothman
Copy link
Member

Never mind that, I'm about to make a PR of a more complete version of #7203.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

I was about to do the same :) I will wait for yours.

@jnothman
Copy link
Member

oh dear. :)

On 14 September 2016 at 21:38, Olivier Grisel notifications@github.com
wrote:

I was about to do the same :) I will wait for yours.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#5445 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6wi1kefvLT7r1WO9p00w9h4p-7v7ks5qp9ylgaJpZM4GROGL
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants