Skip to content

[MRG+1] Use sparse cluster contingency matrix by default #7419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Sep 14, 2016

Supersedes #7203. Obviates the need for max_n_classes, to some extent and removes it as unreleased API (introduced in #5445 to fix #4976).

A sparse contingency matrix should handle the common case with a large number of clusters that overlap among them is sparse; where there are few clusters we suffer a penalty for using a sparse contingency matrix, but this penalty is small in absolute term.

The change is being benchmarked on my sparse_cluster_contingency_option branch, performing multiple segmentations of an image into different numbers of clusters, then timing the evaluation of metrics for different contingency matrix shapes. Benchmarks will be posted shortly.

@jnothman
Copy link
Member Author

jnothman commented Sep 14, 2016

benchmarks at https://github.com/jnothman/scikit-learn/blob/sparse_cluster_contingency_option/stuppie-metrics_sparse_comparison.ipynb (see the end) evaluate timings for a 31000-point dataset (an image), with contingency matrix shapes in ({5, 10, 20, 50, 100, 250}, {5, 10, 20, 50, 100, 250}). We see:

  • small absolute differences (<4ms) for homogeneity_completeness_v_measure
  • negligible or negative differences (substantial in low density) for adjusted_rand_score
  • small absolute (<10ms) and relative (<0.3x, proportional to density) differences for adjusted_mutual_info_score
  • small absolute differences (<3ms) for normalized_mutual_info_score
  • small absolute differences (<3ms) for fowlkes_mallows_score

@jnothman jnothman force-pushed the sparse_cluster_contingency branch from 78e0464 to d91f939 Compare September 14, 2016 13:53

# Compute the ARI using the contingency data
sum_comb_c = sum(comb2(n_c) for n_c in contingency.sum(axis=1))
sum_comb_k = sum(comb2(n_k) for n_k in contingency.sum(axis=0))
if isinstance(contingency, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep the code for the dense case if sparse=True in the previous line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the previous line with sparse=True there is no need for supporting array-based contingency matrix here. We can delete this branch of the if statement as well as the final else clause.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

Thanks very much, I am in favor of always using sparse=True in those metrics and remove max_n_classes.

@jnothman jnothman force-pushed the sparse_cluster_contingency branch from 84f1758 to 5b76cd5 Compare September 14, 2016 14:13
@jnothman jnothman force-pushed the sparse_cluster_contingency branch from 4469114 to 0cbe30d Compare September 14, 2016 14:17
@jnothman
Copy link
Member Author

After giving myself a review and making numerous cosmetic tweaks, I'm going to bed. Would be a nice surprise to wake up to see this merged :p

@jnothman
Copy link
Member Author

(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.)

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.)

I will do it and send you the credentials by private email.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

BTW I confirm that this PR solves the original issue (#4976) with large floating point valued vectors passed to clustering metrics:

>>> from sklearn.metrics import mutual_info_score
>>> import numpy as np
>>> y1 = np.random.randn(int(1e5))
>>> y2 = np.random.randn(int(1e5))
>>> %time mutual_info_score(y1, y2)
CPU times: user 96 ms, sys: 8 ms, total: 104 ms
Wall time: 99.1 ms
11.512925464970223
>>> %load_ext memory_profiler
>>> %memit mutual_info_score(y1, y2)
peak memory: 75.73 MiB, increment: 8.95 MiB

There is definitely no need for the ugly max_n_classes in our public API.

max_n_classes=max_n_classes)
contingency = np.array(contingency, dtype='float')
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
contingency = contingency.astype(float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.astype(float) is not guaranteed to use the same level of precision on all platforms. To get platform agnostic results it's better to be specific:

contingency = contingency.astype(np.float64)

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

Hum it seems that I made a bunch of comments on an old version of the diff. Sorry for the noise.

Anyways, besides my comments on astype(float), +1 for merge.

@ogrisel ogrisel changed the title [MRG] Use sparse cluster contingency matrix by default [MRG+1] Use sparse cluster contingency matrix by default Sep 14, 2016
@@ -28,8 +28,8 @@ def expected_mutual_information(contingency, int n_samples):
#cdef np.ndarray[int, ndim=2] start, end
R, C = contingency.shape
N = <DOUBLE>n_samples
a = np.sum(contingency, axis=1).astype(np.int32)
b = np.sum(contingency, axis=0).astype(np.int32)
a = np.ravel(contingency.sum(axis=1).astype(np.int32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the ravel needed?

Copy link
Member

@ogrisel ogrisel Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it can now be a sparse matrix and the sum over the rows of a sparse matrix is a sparse matrix with a single column instead of a 1D array.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

@amueller if you agree I can address the comments, merge and backport while @jnothman is AFK :)

@amueller
Copy link
Member

@ogrisel sounds good :)

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2016

polished, squashed, rebased and backported.

@ogrisel ogrisel closed this Sep 14, 2016
@jnothman
Copy link
Member Author

ooh :) you two are so kindly obliging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics.mutual_info_score hangs when given real vectors
4 participants