Skip to content

Supervised cluster metrics using sparse contingency matrix #4788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Supervised cluster metrics using sparse contingency matrix #4788

wants to merge 0 commits into from

Conversation

stuppie
Copy link
Contributor

@stuppie stuppie commented May 29, 2015

In sklearn.metrics.cluster.supervised
With large numbers of clusters (approx. >100k), construction of the contingency matrix runs out of memory (throws MemoryError).

>>> from random import randrange
>>> labels_true = [randrange(100000) for x in range(100000)]
>>> labels_pred = [randrange(100000) for x in range(100000)]
>>> contingency = contingency_matrix(labels_true, labels_pred)
... MemoryError
>>> contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
... No error

Using a sparse matrix instead allows construction of the contingency matrix.

Modified adjusted_rand_score
homogeneity_completeness_v_measure
mutual_info_score
to accept a sparse matrix.

@amueller
Copy link
Member

amueller commented Jun 3, 2015

Thanks for the PR.
Do we want to / can we guess whether we want to go sparse? It seems like the number of non-zeros or the shape of the matrix might be a good heuristic?

@amueller
Copy link
Member

amueller commented Jun 3, 2015

btw, what kind of clustering do you use to get 100k clusters and how many data points do you have? This seems like a rather borderline use-case ^^

@stuppie
Copy link
Contributor Author

stuppie commented Jun 3, 2015

We could probably catch a MemoryError in contingency_matrix and call it again with sparse=True ? And generate a warning saying this is happening?
Clustering 10s of millions of proteins (using external tools). Validating clustering using sklearn.

@amueller
Copy link
Member

amueller commented Jun 4, 2015

I think catching a memory error is not great, because numpy might segfault instead of throwing a memory error (sometimes?)

@amueller
Copy link
Member

amueller commented Jun 4, 2015

Also you might fill the ram, bringing other applications down. So if the contingency_matrix is large and sparse, maybe we don't want to call toarray on it. Then we might be able to keep the interface much simpler.

@jnothman
Copy link
Member

I think this is the right way to go, and was just about to suggest an issue for it. Unfortunately, in the meantime max_n_classes has been introduced, which I think is the wrong solution; this PR needs a rebase as a result. If I give this PR a review, will you be able to work on it to completion, @stuppie?

@amueller you can get many clusters in something like cross-document coreference resolution; perhaps not 100k, but only an order of magnitude off for current manually annotated datasets.

@stuppie
Copy link
Contributor Author

stuppie commented Aug 17, 2016

@jnothman Sorry for the long delay. Yes, I can I can. Are you saying I should remove the max_n_classes code, rebase, and re-submit the pull request?

@jnothman
Copy link
Member

You don't need to resubmit the pull request, just git push -f your branch back to github.

@stuppie stuppie closed this Aug 17, 2016
@stuppie
Copy link
Contributor Author

stuppie commented Aug 17, 2016

I'm not sure I'm doing this right. I have merge conflicts in sklearn/metrics/cluster/supervised.py. Do I need to (manually) merge the sparse and max_n_classes changes?

@jnothman
Copy link
Member

You've somehow reset your branch to master!

@jnothman
Copy link
Member

Yes, you'll have to manually merge the changes. First you'll have to go back and find your changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants