-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Use sparse cluster contingency matrix by default #7419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Use sparse cluster contingency matrix by default #7419
Conversation
benchmarks at https://github.com/jnothman/scikit-learn/blob/sparse_cluster_contingency_option/stuppie-metrics_sparse_comparison.ipynb (see the end) evaluate timings for a 31000-point dataset (an image), with contingency matrix shapes in ({5, 10, 20, 50, 100, 250}, {5, 10, 20, 50, 100, 250}). We see:
|
Remove max_n_classes option
78e0464
to
d91f939
Compare
|
||
# Compute the ARI using the contingency data | ||
sum_comb_c = sum(comb2(n_c) for n_c in contingency.sum(axis=1)) | ||
sum_comb_k = sum(comb2(n_k) for n_k in contingency.sum(axis=0)) | ||
if isinstance(contingency, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep the code for the dense case if sparse=True
in the previous line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the previous line with sparse=True
there is no need for supporting array-based contingency matrix here. We can delete this branch of the if statement as well as the final else clause.
Thanks very much, I am in favor of always using |
84f1758
to
5b76cd5
Compare
4469114
to
0cbe30d
Compare
After giving myself a review and making numerous cosmetic tweaks, I'm going to bed. Would be a nice surprise to wake up to see this merged :p |
(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.) |
I will do it and send you the credentials by private email. |
BTW I confirm that this PR solves the original issue (#4976) with large floating point valued vectors passed to clustering metrics: >>> from sklearn.metrics import mutual_info_score
>>> import numpy as np
>>> y1 = np.random.randn(int(1e5))
>>> y2 = np.random.randn(int(1e5))
>>> %time mutual_info_score(y1, y2)
CPU times: user 96 ms, sys: 8 ms, total: 104 ms
Wall time: 99.1 ms
11.512925464970223
>>> %load_ext memory_profiler
>>> %memit mutual_info_score(y1, y2)
peak memory: 75.73 MiB, increment: 8.95 MiB There is definitely no need for the ugly |
max_n_classes=max_n_classes) | ||
contingency = np.array(contingency, dtype='float') | ||
contingency = contingency_matrix(labels_true, labels_pred, sparse=True) | ||
contingency = contingency.astype(float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.astype(float)
is not guaranteed to use the same level of precision on all platforms. To get platform agnostic results it's better to be specific:
contingency = contingency.astype(np.float64)
Hum it seems that I made a bunch of comments on an old version of the diff. Sorry for the noise. Anyways, besides my comments on |
@@ -28,8 +28,8 @@ def expected_mutual_information(contingency, int n_samples): | |||
#cdef np.ndarray[int, ndim=2] start, end | |||
R, C = contingency.shape | |||
N = <DOUBLE>n_samples | |||
a = np.sum(contingency, axis=1).astype(np.int32) | |||
b = np.sum(contingency, axis=0).astype(np.int32) | |||
a = np.ravel(contingency.sum(axis=1).astype(np.int32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the ravel
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it can now be a sparse matrix and the sum over the rows of a sparse matrix is a sparse matrix with a single column instead of a 1D array.
@ogrisel sounds good :) |
polished, squashed, rebased and backported. |
ooh :) you two are so kindly obliging! |
Supersedes #7203. Obviates the need for
max_n_classes
, to some extent and removes it as unreleased API (introduced in #5445 to fix #4976).A sparse contingency matrix should handle the common case with a large number of clusters that overlap among them is sparse; where there are few clusters we suffer a penalty for using a sparse contingency matrix, but this penalty is small in absolute term.
The change is being benchmarked on my
sparse_cluster_contingency_option
branch, performing multiple segmentations of an image into different numbers of clusters, then timing the evaluation of metrics for different contingency matrix shapes. Benchmarks will be posted shortly.