-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note that I believe no tests are needed: the functionality of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be great to close all those issues!
Not a full review - just a couple of minor comments,
doc/whats_new/v0.20.rst
Outdated
@@ -255,6 +255,10 @@ Metrics | |||
False and both inputs are sparse, will return a sparse matrix. | |||
:issue:`10999` by :user:`Taylor G Smith <tgsmith61591>`. | |||
|
|||
- :func:`metrics.cluster.silhouette_score` and | |||
:func:`metrics.cluster.silhouette_samples` are more memory efficient, | |||
causing them to run faster. :issue:`11135` by `Joel Nothman`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As previously discussed, it would run faster only assuming that swapping occurs, meaning that the full distance matrix takes more than RAM size but less than RAM + SWAP disk size (typical sizes here) which is IMO a too strong assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the reported freezes...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely significantly faster in light of benchmarks below - but I think because of a better implementation, not just due to chunking.
Maybe "are more memory efficient, and faster"? or "are significantly more ..."
# accumulate distances from each sample to each cluster | ||
clust_dists = np.zeros((len(D_chunk), len(label_freqs)), | ||
dtype=D_chunk.dtype) | ||
np.add.at(clust_dists.T, labels, D_chunk.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some reports about np.add.at
being slow (numpy/numpy#5922, numpy/numpy#11156) and it's currently almost not used in the code base, do you have an opposite experience with it?
I guess there is not much alternatives since
clust_dists.T[labels] += D_chunk.T
and np.bincount
(as suggested in the first issue) wouldn't work here as far as I understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unaware of those reports. I'll have a quick look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clust_dists.T[labels] += D_chunk.T
will certainly not work if labels has duplicates, which it does. No, bincount won't work either, given that the output is 2d.
We could do this in a for loop. Is it worth trying that, or should we assume that it's not the bottleneck??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm getting better mileage with a for loop (assuming I've correctly implemented it)
n_samples = 10000
n_samples_chunk = 100
n_clusters = 10
D_chunk = np.random.rand(n_samples_chunk, n_samples)
labels = np.random.randint(n_clusters, size=n_samples)
label_freqs = np.bincount(labels)
clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
dtype=D_chunk.dtype)
%timeit np.add.at(clust_dists.T, labels, D_chunk.T)
clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
dtype=D_chunk.dtype)
%timeit for i in range(len(D_chunk)): clust_dists[i] += np.bincount(labels, weights=D_chunk[i], minlength=len(label_freqs))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I double-checked that the results are identical on a few examples.
A quick benchmark also showed that computations now take -30% time, even without swapping.
Great work !
I made sure that there is a test with data from the original publication replicated perfectly. Thanks for the review! |
Adapting earlier benchmarks from #7177 (comment), where import numpy as np
from sklearn.metrics.cluster import silhouette_score
from neurtu import delayed, Benchmark
n_features = 100
def runs():
for n_samples in [100, 1000, 5000, 10000, 20000, 40000]:
for n_labels in [4, 10, 100]:
tags = {'n_samples': n_samples, 'n_labels': n_labels}
rng = np.random.RandomState(42)
X = rng.rand(n_samples, n_features)
y = rng.randint(n_labels, size=n_samples)
yield delayed(silhouette_score, tags=tags)(X, y)
df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_labels', 'n_samples'], inplace=True)
df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})
df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df) I get,
which corresponds to a 2.5x speedup for large (Ignore cells with zero peak_memory - it's a measurement artifact). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really great work!
P.S: could you please partially revert what's new formulation changes from #11135 (comment) - sorry about that.
BTW, if I read #7177 (comment) right, looking at the benchmarks above this PR should result in faster calculations than #7177 too.. |
The speed improvement since #7177 might be due to larger working memory, or due to np.add.at being removed... |
Thanks for the reviews! |
Used to be part of #10280 and elsewhere
Closes #7175
Closes #10279
Closes #7177