[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

jnothman · 2018-05-25T02:18:04Z

Used to be part of #10280 and elsewhere

jnothman · 2018-05-25T02:42:56Z

Note that I believe no tests are needed: the functionality of silhouette_score and silhouette_samples is tested, and the memory consumption of pairwise_distances_chunked is tested elsewhere.

rth

It will be great to close all those issues!

Not a full review - just a couple of minor comments,

rth · 2018-05-27T18:11:50Z

doc/whats_new/v0.20.rst

@@ -255,6 +255,10 @@ Metrics
  False and both inputs are sparse, will return a sparse matrix.
  :issue:`10999` by :user:`Taylor G Smith <tgsmith61591>`.

+- :func:`metrics.cluster.silhouette_score` and
+  :func:`metrics.cluster.silhouette_samples` are more memory efficient,
+  causing them to run faster. :issue:`11135` by `Joel Nothman`_.


As previously discussed, it would run faster only assuming that swapping occurs, meaning that the full distance matrix takes more than RAM size but less than RAM + SWAP disk size (typical sizes here) which is IMO a too strong assumption.

Given the reported freezes...?

It's definitely significantly faster in light of benchmarks below - but I think because of a better implementation, not just due to chunking.

Maybe "are more memory efficient, and faster"? or "are significantly more ..."

rth · 2018-05-27T18:26:15Z

sklearn/metrics/cluster/unsupervised.py

+    # accumulate distances from each sample to each cluster
+    clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
+                           dtype=D_chunk.dtype)
+    np.add.at(clust_dists.T, labels, D_chunk.T)


There are some reports about np.add.at being slow (numpy/numpy#5922, numpy/numpy#11156) and it's currently almost not used in the code base, do you have an opposite experience with it?

I guess there is not much alternatives since

clust_dists.T[labels] += D_chunk.T

and np.bincount (as suggested in the first issue) wouldn't work here as far as I understand?

I'm unaware of those reports. I'll have a quick look.

clust_dists.T[labels] += D_chunk.T will certainly not work if labels has duplicates, which it does. No, bincount won't work either, given that the output is 2d.

We could do this in a for loop. Is it worth trying that, or should we assume that it's not the bottleneck??

Yes, I'm getting better mileage with a for loop (assuming I've correctly implemented it)

n_samples = 10000 n_samples_chunk = 100 n_clusters = 10 D_chunk = np.random.rand(n_samples_chunk, n_samples) labels = np.random.randint(n_clusters, size=n_samples) label_freqs = np.bincount(labels) clust_dists = np.zeros((len(D_chunk), len(label_freqs)), dtype=D_chunk.dtype) %timeit np.add.at(clust_dists.T, labels, D_chunk.T) clust_dists = np.zeros((len(D_chunk), len(label_freqs)), dtype=D_chunk.dtype) %timeit for i in range(len(D_chunk)): clust_dists[i] += np.bincount(labels, weights=D_chunk[i], minlength=len(label_freqs))

TomDLT

LGTM

I double-checked that the results are identical on a few examples.
A quick benchmark also showed that computations now take -30% time, even without swapping.
Great work !

jnothman · 2018-05-28T10:50:48Z

I made sure that there is a test with data from the original publication replicated perfectly.

Thanks for the review!

rth · 2018-06-01T20:38:17Z

Adapting earlier benchmarks from #7177 (comment), where n_features=100,

import numpy as np
from sklearn.metrics.cluster import silhouette_score

from neurtu import delayed, Benchmark

n_features = 100

def runs():
    for n_samples in [100, 1000, 5000, 10000, 20000, 40000]:
        for n_labels in [4, 10, 100]:
            tags = {'n_samples': n_samples, 'n_labels': n_labels}
            
            rng = np.random.RandomState(42)
            X = rng.rand(n_samples, n_features)
            y = rng.randint(n_labels, size=n_samples)
            yield delayed(silhouette_score, tags=tags)(X, y)

df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_labels', 'n_samples'], inplace=True)

df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})

df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)

I get,

                   wall_time (seconds)      peak_memory (MB)
                      master PR #11135      master PR #11135
n_labels n_samples
4        100          0.0010    0.0010        0.00      0.00
         1000         0.0149    0.0179        0.31      0.62
         5000         0.7804    0.4628      287.30    191.32
         10000        3.4167    1.7711     1144.00    763.28
         20000       14.4864    4.9563     4576.68   1023.87
         40000       59.7319   19.5900    18329.18   1023.87
10       100          0.0037    0.0010        0.00      0.00
         1000         0.0190    0.0182        0.00      0.00
         5000         0.6339    0.4578      247.71    190.74
         10000        3.1061    1.7667      919.54    762.94
         20000       13.3901    4.9475     3676.61   1023.87
         40000       58.4792   19.5537    14704.77   1023.87
100      100          0.1148    0.0010        0.00      0.00
         1000         0.3631    0.0180        0.00      0.00
         5000         1.0177    0.4606      190.74    190.74
         10000        2.8273    1.7763      762.94    762.94
         20000       11.7602    4.9412     3173.65   1023.87
         40000       48.2780   19.5229    12465.71   1023.87

which corresponds to a 2.5x speedup for large n_samples and with memory capped at 1GB (instead of 18GB+ in this example). This is awesome!

(Ignore cells with zero peak_memory - it's a measurement artifact).

rth

LGTM, really great work!

P.S: could you please partially revert what's new formulation changes from #11135 (comment) - sorry about that.

rth · 2018-06-01T20:50:09Z

BTW, if I read #7177 (comment) right, looking at the benchmarks above this PR should result in faster calculations than #7177 too..

jnothman · 2018-06-02T23:01:35Z

The speed improvement since #7177 might be due to larger working memory, or due to np.add.at being removed...

jnothman · 2018-06-02T23:02:19Z

Thanks for the reviews!

jnothman added 2 commits May 25, 2018 12:14

ENH use pairwise_distances_chunked in silhouette_score

886f074

Undo other things

a3f41fb

jnothman changed the title ~~ENH use pairwise_distances_chunked in silhouette_score~~ [MRG] ENH use pairwise_distances_chunked in silhouette_score May 25, 2018

jnothman added 2 commits May 25, 2018 13:25

reinstate missing import

eae3172

Update PR number

8ae06b7

rth mentioned this pull request May 27, 2018

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged

10 tasks

rth reviewed May 27, 2018

View reviewed changes

jnothman added 2 commits May 28, 2018 09:56

DOC reword

bf98824

Use for loop instead of np.add.at

5abb893

TomDLT approved these changes May 28, 2018

View reviewed changes

jnothman changed the title ~~[MRG] ENH use pairwise_distances_chunked in silhouette_score~~ [MRG+1] ENH use pairwise_distances_chunked in silhouette_score May 29, 2018

rth approved these changes Jun 1, 2018

View reviewed changes

rth changed the title ~~[MRG+1] ENH use pairwise_distances_chunked in silhouette_score~~ [MRG+2] ENH use pairwise_distances_chunked in silhouette_score Jun 1, 2018

DOC mention speed in what's new

da3eff8

jnothman merged commit 86ab5a4 into scikit-learn:master Jun 2, 2018

haiatn mentioned this pull request Aug 26, 2023

[MRG] support sample_weight in silhouette_score #4087

Open

Uh oh!

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

Uh oh!

Conversation

jnothman commented May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented May 25, 2018

Uh oh!

rth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth May 27, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman May 27, 2018

Choose a reason for hiding this comment

Uh oh!

rth Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth May 27, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman May 27, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman May 28, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman May 28, 2018

Choose a reason for hiding this comment

Uh oh!

TomDLT left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 28, 2018

Uh oh!

rth commented Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Jun 1, 2018

Uh oh!

jnothman commented Jun 2, 2018

Uh oh!

jnothman commented Jun 2, 2018

Uh oh!

Uh oh!

jnothman commented May 25, 2018 •

edited

Loading

rth left a comment •

edited

Loading

rth Jun 1, 2018 •

edited

Loading

TomDLT left a comment •

edited

Loading

rth commented Jun 1, 2018 •

edited

Loading