Skip to content

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 2, 2018

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented May 25, 2018

Used to be part of #10280 and elsewhere

Closes #7175
Closes #10279
Closes #7177

@jnothman jnothman changed the title ENH use pairwise_distances_chunked in silhouette_score [MRG] ENH use pairwise_distances_chunked in silhouette_score May 25, 2018
@jnothman
Copy link
Member Author

Note that I believe no tests are needed: the functionality of silhouette_score and silhouette_samples is tested, and the memory consumption of pairwise_distances_chunked is tested elsewhere.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great to close all those issues!

Not a full review - just a couple of minor comments,

@@ -255,6 +255,10 @@ Metrics
False and both inputs are sparse, will return a sparse matrix.
:issue:`10999` by :user:`Taylor G Smith <tgsmith61591>`.

- :func:`metrics.cluster.silhouette_score` and
:func:`metrics.cluster.silhouette_samples` are more memory efficient,
causing them to run faster. :issue:`11135` by `Joel Nothman`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously discussed, it would run faster only assuming that swapping occurs, meaning that the full distance matrix takes more than RAM size but less than RAM + SWAP disk size (typical sizes here) which is IMO a too strong assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the reported freezes...?

Copy link
Member

@rth rth Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely significantly faster in light of benchmarks below - but I think because of a better implementation, not just due to chunking.

Maybe "are more memory efficient, and faster"? or "are significantly more ..."

# accumulate distances from each sample to each cluster
clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
dtype=D_chunk.dtype)
np.add.at(clust_dists.T, labels, D_chunk.T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some reports about np.add.at being slow (numpy/numpy#5922, numpy/numpy#11156) and it's currently almost not used in the code base, do you have an opposite experience with it?

I guess there is not much alternatives since

clust_dists.T[labels] += D_chunk.T

and np.bincount (as suggested in the first issue) wouldn't work here as far as I understand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unaware of those reports. I'll have a quick look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clust_dists.T[labels] += D_chunk.T will certainly not work if labels has duplicates, which it does. No, bincount won't work either, given that the output is 2d.

We could do this in a for loop. Is it worth trying that, or should we assume that it's not the bottleneck??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm getting better mileage with a for loop (assuming I've correctly implemented it)

n_samples = 10000
n_samples_chunk = 100
n_clusters = 10

D_chunk = np.random.rand(n_samples_chunk, n_samples)
labels = np.random.randint(n_clusters, size=n_samples)
label_freqs = np.bincount(labels)


clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
                       dtype=D_chunk.dtype)
%timeit np.add.at(clust_dists.T, labels, D_chunk.T)

clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
                       dtype=D_chunk.dtype)
%timeit for i in range(len(D_chunk)): clust_dists[i] += np.bincount(labels, weights=D_chunk[i], minlength=len(label_freqs))

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I double-checked that the results are identical on a few examples.
A quick benchmark also showed that computations now take -30% time, even without swapping.
Great work !

@jnothman
Copy link
Member Author

I made sure that there is a test with data from the original publication replicated perfectly.

Thanks for the review!

@jnothman jnothman changed the title [MRG] ENH use pairwise_distances_chunked in silhouette_score [MRG+1] ENH use pairwise_distances_chunked in silhouette_score May 29, 2018
@rth
Copy link
Member

rth commented Jun 1, 2018

Adapting earlier benchmarks from #7177 (comment), where n_features=100,

import numpy as np
from sklearn.metrics.cluster import silhouette_score

from neurtu import delayed, Benchmark

n_features = 100

def runs():
    for n_samples in [100, 1000, 5000, 10000, 20000, 40000]:
        for n_labels in [4, 10, 100]:
            tags = {'n_samples': n_samples, 'n_labels': n_labels}
            
            rng = np.random.RandomState(42)
            X = rng.rand(n_samples, n_features)
            y = rng.randint(n_labels, size=n_samples)
            yield delayed(silhouette_score, tags=tags)(X, y)

df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_labels', 'n_samples'], inplace=True)

df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})

df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)

I get,

                   wall_time (seconds)      peak_memory (MB)
                      master PR #11135      master PR #11135
n_labels n_samples
4        100          0.0010    0.0010        0.00      0.00
         1000         0.0149    0.0179        0.31      0.62
         5000         0.7804    0.4628      287.30    191.32
         10000        3.4167    1.7711     1144.00    763.28
         20000       14.4864    4.9563     4576.68   1023.87
         40000       59.7319   19.5900    18329.18   1023.87
10       100          0.0037    0.0010        0.00      0.00
         1000         0.0190    0.0182        0.00      0.00
         5000         0.6339    0.4578      247.71    190.74
         10000        3.1061    1.7667      919.54    762.94
         20000       13.3901    4.9475     3676.61   1023.87
         40000       58.4792   19.5537    14704.77   1023.87
100      100          0.1148    0.0010        0.00      0.00
         1000         0.3631    0.0180        0.00      0.00
         5000         1.0177    0.4606      190.74    190.74
         10000        2.8273    1.7763      762.94    762.94
         20000       11.7602    4.9412     3173.65   1023.87
         40000       48.2780   19.5229    12465.71   1023.87

which corresponds to a 2.5x speedup for large n_samples and with memory capped at 1GB (instead of 18GB+ in this example). This is awesome!

(Ignore cells with zero peak_memory - it's a measurement artifact).

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really great work!

P.S: could you please partially revert what's new formulation changes from #11135 (comment) - sorry about that.

@rth
Copy link
Member

rth commented Jun 1, 2018

BTW, if I read #7177 (comment) right, looking at the benchmarks above this PR should result in faster calculations than #7177 too..

@rth rth changed the title [MRG+1] ENH use pairwise_distances_chunked in silhouette_score [MRG+2] ENH use pairwise_distances_chunked in silhouette_score Jun 1, 2018
@jnothman
Copy link
Member Author

jnothman commented Jun 2, 2018

The speed improvement since #7177 might be due to larger working memory, or due to np.add.at being removed...

@jnothman jnothman merged commit 86ab5a4 into scikit-learn:master Jun 2, 2018
@jnothman
Copy link
Member Author

jnothman commented Jun 2, 2018

Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants