-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Block-wise silhouette calculation to avoid memory consumption #7177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
498ccce
3a4dd68
d45bb78
c6edfbb
3b726aa
2301fde
85d5971
53fa8d9
6828646
969eab3
bfbde51
03a73ab
51640c0
eb4619d
71ac994
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,10 +5,14 @@ | |
# Thierry Guillemot <thierry.guillemot.work@gmail.com> | ||
# License: BSD 3 clause | ||
|
||
from __future__ import division | ||
|
||
import numpy as np | ||
|
||
from ...utils import check_random_state | ||
from ...utils import check_X_y | ||
from ...utils import _get_n_jobs | ||
from ...externals.joblib import Parallel, delayed | ||
from ..pairwise import pairwise_distances | ||
from ...preprocessing import LabelEncoder | ||
|
||
|
@@ -19,7 +23,12 @@ def check_number_of_labels(n_labels, n_samples): | |
"to n_samples - 1 (inclusive)" % n_labels) | ||
|
||
|
||
DEFAULT_BLOCK_SIZE = 64 | ||
BYTES_PER_FLOAT = 8 | ||
|
||
|
||
def silhouette_score(X, labels, metric='euclidean', sample_size=None, | ||
block_size=DEFAULT_BLOCK_SIZE, n_jobs=1, | ||
random_state=None, **kwds): | ||
"""Compute the mean Silhouette Coefficient of all samples. | ||
|
||
|
@@ -56,6 +65,18 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None, | |
<sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance | ||
array itself, use ``metric="precomputed"``. | ||
|
||
block_size : int, optional, default=64 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could it be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the GPU computing language (OpenCl, Cuda), There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not committed to it. (Nor am I, by the way, committed to this being per job. I'm happy for it to be divided by the jobs.) But it's not explicitly the maximum memory consumption either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay then can we have it as Because as a user, I'd blindly set this to my ram maximum. And There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you're an average user, I'm better off not making it a parameter and hard-coding something sensible! :P There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed :) |
||
The maximum number of mebibytes (MiB) of memory per job (see | ||
``n_jobs``) to use at a time for calculating pairwise distances. | ||
|
||
.. versionadded:: 0.18 | ||
|
||
n_jobs : int, optional (default = 1) | ||
The number of parallel jobs to run. | ||
If ``-1``, then the number of jobs is set to the number of CPU cores. | ||
|
||
.. versionadded:: 0.18 | ||
|
||
sample_size : int or None | ||
The size of the sample to use when computing the Silhouette Coefficient | ||
on a random subset of the data. | ||
|
@@ -103,10 +124,61 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None, | |
X, labels = X[indices].T[indices].T, labels[indices] | ||
else: | ||
X, labels = X[indices], labels[indices] | ||
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds)) | ||
return np.mean(silhouette_samples(X, labels, metric=metric, | ||
block_size=block_size, n_jobs=n_jobs, | ||
**kwds)) | ||
|
||
|
||
def _silhouette_block(X, labels, label_freqs, start, block_n_rows, | ||
block_range, add_at, dist_kwds): | ||
"""Accumulate silhouette statistics for X[start:start+block_n_rows] | ||
|
||
def silhouette_samples(X, labels, metric='euclidean', **kwds): | ||
Parameters | ||
---------- | ||
X : shape (n_samples, n_features) or precomputed (n_samples, n_samples) | ||
data | ||
labels : array, shape (n_samples,) | ||
corresponding cluster labels, encoded as {0, ..., n_clusters-1} | ||
label_freqs : array | ||
distribution of cluster labels in ``labels`` | ||
start : int | ||
first index in block | ||
block_n_rows : int | ||
length of block | ||
block_range : array | ||
precomputed range ``0..(block_n_rows-1)`` | ||
add_at : array, shape (block_n_rows * n_clusters,) | ||
indices into a flattened array of shape (block_n_rows, n_clusters) | ||
where distances from block points to each cluster are accumulated | ||
dist_kwds : dict | ||
kwargs for ``pairwise_distances`` | ||
""" | ||
# get distances from block to every other sample | ||
stop = min(start + block_n_rows, X.shape[0]) | ||
if stop - start == X.shape[0]: | ||
# allow pairwise_distances to use fast paths | ||
block_dists = pairwise_distances(X, **dist_kwds) | ||
else: | ||
block_dists = pairwise_distances(X[start:stop], X, **dist_kwds) | ||
|
||
# accumulate distances from each sample to each cluster | ||
clust_dists = np.bincount(add_at[:block_dists.size], | ||
block_dists.ravel()) | ||
clust_dists = clust_dists.reshape((stop - start, len(label_freqs))) | ||
|
||
# intra_index selects intra-cluster distances within clust_dists | ||
intra_index = (block_range[:len(clust_dists)], labels[start:stop]) | ||
# intra_clust_dists are averaged over cluster size outside this function | ||
intra_clust_dists = clust_dists[intra_index] | ||
# of the remaining distances we normalise and extract the minimum | ||
clust_dists[intra_index] = np.inf | ||
clust_dists /= label_freqs | ||
inter_clust_dists = clust_dists.min(axis=1) | ||
return intra_clust_dists, inter_clust_dists | ||
|
||
|
||
def silhouette_samples(X, labels, metric='euclidean', | ||
block_size=DEFAULT_BLOCK_SIZE, n_jobs=1, **kwds): | ||
"""Compute the Silhouette Coefficient for each sample. | ||
|
||
The Silhouette Coefficient is a measure of how well samples are clustered | ||
|
@@ -144,6 +216,18 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`. If X is | ||
the distance array itself, use "precomputed" as the metric. | ||
|
||
block_size : int, optional, default=64 | ||
The maximum number of mebibytes (MiB) of memory per job (see | ||
``n_jobs``) to use at a time for calculating pairwise distances. | ||
|
||
.. versionadded:: 0.18 | ||
|
||
n_jobs : int, optional (default = 1) | ||
The number of parallel jobs to run. | ||
If ``-1``, then the number of jobs is set to the number of CPU cores. | ||
|
||
.. versionadded:: 0.18 | ||
|
||
`**kwds` : optional keyword parameters | ||
Any further parameters are passed directly to the distance function. | ||
If using a ``scipy.spatial.distance`` metric, the parameters are still | ||
|
@@ -166,46 +250,58 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |
<https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_ | ||
|
||
""" | ||
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr']) | ||
le = LabelEncoder() | ||
labels = le.fit_transform(labels) | ||
|
||
distances = pairwise_distances(X, metric=metric, **kwds) | ||
unique_labels = le.classes_ | ||
|
||
# For sample i, store the mean distance of the cluster to which | ||
# it belongs in intra_clust_dists[i] | ||
intra_clust_dists = np.ones(distances.shape[0], dtype=distances.dtype) | ||
|
||
# For sample i, store the mean distance of the second closest | ||
# cluster in inter_clust_dists[i] | ||
inter_clust_dists = np.inf * intra_clust_dists | ||
|
||
for curr_label in unique_labels: | ||
|
||
# Find inter_clust_dist for all samples belonging to the same | ||
# label. | ||
mask = labels == curr_label | ||
current_distances = distances[mask] | ||
|
||
# Leave out current sample. | ||
n_samples_curr_lab = np.sum(mask) - 1 | ||
if n_samples_curr_lab != 0: | ||
intra_clust_dists[mask] = np.sum( | ||
current_distances[:, mask], axis=1) / n_samples_curr_lab | ||
|
||
# Now iterate over all other labels, finding the mean | ||
# cluster distance that is closest to every sample. | ||
for other_label in unique_labels: | ||
if other_label != curr_label: | ||
other_mask = labels == other_label | ||
other_distances = np.mean( | ||
current_distances[:, other_mask], axis=1) | ||
inter_clust_dists[mask] = np.minimum( | ||
inter_clust_dists[mask], other_distances) | ||
n_samples = len(labels) | ||
label_freqs = np.bincount(labels) | ||
|
||
n_jobs = _get_n_jobs(n_jobs) | ||
block_n_rows = block_size * (2 ** 20) // (BYTES_PER_FLOAT * n_samples) | ||
if block_n_rows > n_samples: | ||
block_n_rows = min(block_n_rows, n_samples) | ||
if block_n_rows < 1: | ||
min_block_mib = np.ceil(n_samples * BYTES_PER_FLOAT * 2 ** -20) | ||
raise ValueError('block_size should be at least n_samples * %d bytes ' | ||
'= %.0f MiB, got %r' % (BYTES_PER_FLOAT, | ||
min_block_mib, block_size)) | ||
|
||
intra_clust_dists = [] | ||
inter_clust_dists = [] | ||
|
||
# We use these indices as bins to accumulate distances from each sample in | ||
# a block to each cluster. | ||
# NB: we currently use np.bincount but could use np.add.at when Numpy >=1.8 | ||
# is minimum dependency, which would avoid materialising this index. | ||
block_range = np.arange(block_n_rows) | ||
add_at = np.ravel_multi_index((np.repeat(block_range, n_samples), | ||
np.tile(labels, block_n_rows)), | ||
dims=(block_n_rows, len(label_freqs))) | ||
parallel = Parallel(n_jobs=n_jobs, backend='threading') | ||
|
||
kwds['metric'] = metric | ||
results = parallel(delayed(_silhouette_block)(X, labels, label_freqs, | ||
start, block_n_rows, | ||
block_range, add_at, kwds) | ||
for start in range(0, n_samples, block_n_rows)) | ||
|
||
intra_clust_dists, inter_clust_dists = zip(*results) | ||
if len(intra_clust_dists) == 1: | ||
intra_clust_dists = intra_clust_dists[0] | ||
inter_clust_dists = inter_clust_dists[0] | ||
else: | ||
intra_clust_dists = np.hstack(intra_clust_dists) | ||
inter_clust_dists = np.hstack(inter_clust_dists) | ||
|
||
denom = (label_freqs - 1).take(labels, mode='clip') | ||
with np.errstate(divide="ignore", invalid="ignore"): | ||
intra_clust_dists /= denom | ||
|
||
sil_samples = inter_clust_dists - intra_clust_dists | ||
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) | ||
return sil_samples | ||
with np.errstate(divide="ignore", invalid="ignore"): | ||
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) | ||
# nan values are for clusters of size 1, and should be 0 | ||
return np.nan_to_num(sil_samples) | ||
|
||
|
||
def calinski_harabaz_score(X, labels): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead have the
block_size
as 0 and let it denote the master setup of use all the memory?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or -1 like
n_jobs
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what benefit? The default 64MB will run a 2896 sample problem or smaller in a single block. For problems much larger than that, you're likely to benefit from splitting the problem up as suggested by our benchmark which shows 2x speedup from "use all memory" for a dataset less than 4x that size (and >9x the number of pairwise calculations). Yes, this is only my machine, but it's hard to imagine why we should suggest using all memory possible to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not dissimilar to
n_jobs=-2
often being a better choice thann_jobs=-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do not select automatically
block_size
as themin(64MB, np.ceil(n_samples * BYTES_PER_FLOAT * 2 ** -20)
and don't let the user choose a specific value ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be identical to just setting it to 64, no? That's tempting, especially because this isn't a learning routine. I don't expect my benchmarks to be optimal, certainly not for any particular platform, or with
n_jobs != 1
(using the default or some other parallel backend); hence my inclination to keep it public. I'll have to chew on this.