-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Supervised cluster metrics using sparse contingency matrix #4788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supervised cluster metrics using sparse contingency matrix #4788
Conversation
Thanks for the PR. |
btw, what kind of clustering do you use to get 100k clusters and how many data points do you have? This seems like a rather borderline use-case ^^ |
We could probably catch a |
I think catching a memory error is not great, because numpy might segfault instead of throwing a memory error (sometimes?) |
Also you might fill the ram, bringing other applications down. So if the |
I think this is the right way to go, and was just about to suggest an issue for it. Unfortunately, in the meantime @amueller you can get many clusters in something like cross-document coreference resolution; perhaps not 100k, but only an order of magnitude off for current manually annotated datasets. |
@jnothman Sorry for the long delay. Yes, I can I can. Are you saying I should remove the |
You don't need to resubmit the pull request, just |
I'm not sure I'm doing this right. I have merge conflicts in sklearn/metrics/cluster/supervised.py. Do I need to (manually) merge the |
You've somehow reset your branch to master! |
Yes, you'll have to manually merge the changes. First you'll have to go back and find your changes! |
In
sklearn.metrics.cluster.supervised
With large numbers of clusters (approx. >100k), construction of the contingency matrix runs out of memory (throws
MemoryError
).Using a sparse matrix instead allows construction of the contingency matrix.
Modified
adjusted_rand_score
homogeneity_completeness_v_measure
mutual_info_score
to accept a sparse matrix.