Supervised cluster metrics using sparse contingency matrix #4788

stuppie · 2015-05-29T06:09:42Z

In sklearn.metrics.cluster.supervised
With large numbers of clusters (approx. >100k), construction of the contingency matrix runs out of memory (throws MemoryError).

>>> from random import randrange
>>> labels_true = [randrange(100000) for x in range(100000)]
>>> labels_pred = [randrange(100000) for x in range(100000)]
>>> contingency = contingency_matrix(labels_true, labels_pred)
... MemoryError
>>> contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
... No error

Using a sparse matrix instead allows construction of the contingency matrix.

Modified adjusted_rand_score
homogeneity_completeness_v_measure
mutual_info_score
to accept a sparse matrix.

amueller · 2015-06-03T19:28:05Z

Thanks for the PR.
Do we want to / can we guess whether we want to go sparse? It seems like the number of non-zeros or the shape of the matrix might be a good heuristic?

amueller · 2015-06-03T19:28:36Z

btw, what kind of clustering do you use to get 100k clusters and how many data points do you have? This seems like a rather borderline use-case ^^

stuppie · 2015-06-03T22:00:27Z

We could probably catch a MemoryError in contingency_matrix and call it again with sparse=True ? And generate a warning saying this is happening?
Clustering 10s of millions of proteins (using external tools). Validating clustering using sklearn.

amueller · 2015-06-04T15:46:58Z

I think catching a memory error is not great, because numpy might segfault instead of throwing a memory error (sometimes?)

amueller · 2015-06-04T15:51:11Z

Also you might fill the ram, bringing other applications down. So if the contingency_matrix is large and sparse, maybe we don't want to call toarray on it. Then we might be able to keep the interface much simpler.

jnothman · 2016-06-14T12:13:28Z

I think this is the right way to go, and was just about to suggest an issue for it. Unfortunately, in the meantime max_n_classes has been introduced, which I think is the wrong solution; this PR needs a rebase as a result. If I give this PR a review, will you be able to work on it to completion, @stuppie?

@amueller you can get many clusters in something like cross-document coreference resolution; perhaps not 100k, but only an order of magnitude off for current manually annotated datasets.

stuppie · 2016-08-17T19:21:55Z

@jnothman Sorry for the long delay. Yes, I can I can. Are you saying I should remove the max_n_classes code, rebase, and re-submit the pull request?

jnothman · 2016-08-17T21:02:51Z

You don't need to resubmit the pull request, just git push -f your branch back to github.

stuppie · 2016-08-17T21:35:44Z

I'm not sure I'm doing this right. I have merge conflicts in sklearn/metrics/cluster/supervised.py. Do I need to (manually) merge the sparse and max_n_classes changes?

jnothman · 2016-08-17T23:25:57Z

You've somehow reset your branch to master!

jnothman · 2016-08-17T23:29:09Z

Yes, you'll have to manually merge the changes. First you'll have to go back and find your changes!

jnothman mentioned this pull request Jun 14, 2016

[MRG+2] ENH Add new metrics for clustering #6823

Merged

stuppie closed this Aug 17, 2016

jnothman mentioned this pull request Aug 17, 2016

[MRG+2] Raise ValueError for metrics.cluster.supervised with too many classes #5445

Merged

stuppie mentioned this pull request Aug 18, 2016

Use sparse contingency matrix for supervised cluster metrics #7203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Supervised cluster metrics using sparse contingency matrix #4788

Supervised cluster metrics using sparse contingency matrix #4788

Uh oh!

stuppie commented May 29, 2015

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

stuppie commented Jun 3, 2015

Uh oh!

amueller commented Jun 4, 2015

Uh oh!

amueller commented Jun 4, 2015

Uh oh!

jnothman commented Jun 14, 2016

Uh oh!

stuppie commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

stuppie commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

Uh oh!

Uh oh!

Supervised cluster metrics using sparse contingency matrix #4788

Supervised cluster metrics using sparse contingency matrix #4788

Uh oh!

Conversation

stuppie commented May 29, 2015

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

stuppie commented Jun 3, 2015

Uh oh!

amueller commented Jun 4, 2015

Uh oh!

amueller commented Jun 4, 2015

Uh oh!

jnothman commented Jun 14, 2016

Uh oh!

stuppie commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

stuppie commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

Uh oh!