Skip to content

[MRG] Blockwise parallel silhouette computation #1976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

[MRG] Blockwise parallel silhouette computation #1976

wants to merge 9 commits into from

Conversation

AlexandreAbraham
Copy link
Contributor

This pull request introduces a blockwise computation of the silhouette, using less memory, but slightly slower.

First, the vocabulary can be debated. I have chosen the word "global" for the original silhouette strategy (as the global distance matrix is computed, I could not come with a better word) and "blockwise" for my method.

The other point is that I have 3 implementations of the silhouette:

  • the original one
  • a blockwise single threaded version
  • a blockwise multi threaded version (which uses a bit more memory than the single threaded version, even if n_jobs=1)

I have decided to leave the original version untouched, as the code is far more readable than mine. Then I have not kept the most efficient blockwise single threaded version because the memory gain is not worth the code complication (I think).

It is in fact possible to have "one implementation to rule them all". This could be done by keeping the same code skeleton and using a "precomputed" distance matrix for the original version. But I think that the code would become too opaque.

The method used to compute distance matrix between samples. Default is
``global`` which means that the full distance matrix is computed
yielding in fast computation but high memory consumption. ``blockwise``
option compute clusterwise distance matrices, dividing approximately
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute -> computes :)

and maybe add a the before blockwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approximately should maybe be between by and the squared in the next line.
So it reads: ...dividing memory consumption by approximately the squared number of clusters.
That make sense? currently its a bit hard to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Plus, it is a bit superfluous. Do you think I should remove this part and just that it lowers memory consumption without giving an order of magnitude ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay as is now 👍

@jaquesgrobler
Copy link
Member

I had a quick look. Thanks for the add-on, @AlexandreAbraham. This seems very useful considering the conversation on the mailing list here

@@ -43,10 +48,24 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None,
<sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance
array itself, use ``metric="precomputed"``.

method: string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should be {'global', 'blockwise'}, and not 'string'.

@GaelVaroquaux
Copy link
Member

Looks good to me.

def _intra_cluster_distances_block(X_cluster, metric, **kwds):
''' Calculate the mean intra-cluster distance for a given cluster

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove the parameter documentation and leave only the short description above. Those two functions are private and short. The documentation takes more space than the actual implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove the parameter documentation and leave only the
short description above. Those two functions are private and short. The
documentation takes more space than the actual implementation.

I was actually thinking the same. So +1 on this remark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have also removed the parameters documentation for the other private methods (original method).

@jaquesgrobler
Copy link
Member

+1 for merge

@mblondel
Copy link
Member

mblondel commented Jan 5, 2014

Hum this PR has two +1 for merge and is still open?!

@@ -179,25 +238,24 @@ def _intra_cluster_distance(distances_row, labels, i):

def _nearest_cluster_distance(distances_row, labels, i):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove the docstrings?

@AlexandreAbraham
Copy link
Contributor Author

I fixed the only remaining remark (docstring removed for no reason) and rebased. This should be good to go.

@AlexandreAbraham AlexandreAbraham changed the title Blockwise parallel silhouette computation [MRG] Blockwise parallel silhouette computation Oct 19, 2015
``blockwise`` option computes clusterwise distance matrices, dividing
memory consumption by approximately the squared number of clusters.
The ``blockwise`` method allows parallelization through ``n_jobs``
parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we rename this argument to 'blockwise', and have as options False, True, and 'auto', where 'auto' is the default, and you do some clever check on the size of the input to decided whether blocking is good or not.

@AlexandreAbraham
Copy link
Contributor Author

I used this code to bench the silhouette computation:

import time

from sklearn import datasets
from sklearn.metrics.cluster.unsupervised import silhouette_score


def test_silhouette(n_samples, n_clusters):
    X, y = datasets.make_blobs(n_samples=n_samples, n_features=500,
                              centers=n_clusters)
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=False)
    t_global = time.time() - t0
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=True)
    t_block = time.time() - t0

    return t_global, t_block

for n_clusters in [50, 100, 200, 300]:
    for n_samples in [200, 300, 400, 500, 1000]:
        if n_clusters >= n_samples - 1:
            continue
        t_g, t_b = test_silhouette(n_samples, n_clusters)
        print('%d,%d,%f,%f' % (n_clusters, n_samples, t_g, t_b))

On my machine, blockwise is always faster below 50 clusters. Above 50 clusters, the global version is faster if there is less then 5 * n_labels samples. I set up a heuristic in the code accordingly.

# Test with the block method
silhouette_metric_block = silhouette_score(X, y, metric='euclidean',
blockwise=True)
assert_almost_equal(silhouette, silhouette_metric_block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the auto logic covered too (just to explore all codepaths).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'auto' case is covered below. But I can add an extra verification here if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexandreAbraham
Copy link
Contributor Author

Yes, given the last comments, I'm working on the docs.

@GaelVaroquaux
Copy link
Member

Your tests are failing under windows because of the added parallelism. You need to disable it in the test (check appveyor to see the problem).

@@ -133,6 +160,21 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`. If X is
the distance array itself, use "precomputed" as the metric.

blockwise: {'auto', True, False}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add versionadded to both new parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you add the flag to a parameter? I didn't manage to do it :-/

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2015

The failure under windows might be random and not necessarily related to this specific PR. We had seemingly similar random failure in the past on completely unrelated PRs and the failure disappeared without having us do anything.

Apparently here we first get a WindowsError: [Error 5] Access is denied and then a series of failures that reference "Attempting to do parallel computing without protecting your import" but which are probably acutally triggered by a problem in the runtime environment on the CI server.

Let me relaunch appveyor manually on that PR.

approximately the squared number of clusters. The latter allows
parallelization through ``n_jobs`` parameter.
Default is 'auto' that chooses the fastest option without
consideration for memory consumption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here just after this line add a .. versionadded::0.17 not sure if it needs a newline before it though

Copy link

@krishnateja614 krishnateja614 May 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel , I am getting the windows error 5 access is denied inconsistently (but like 8 out of 10 times) on an Azure VM but not my PC. This is occurring after gridsearchcv fits all the folds but before it can calculate the best parameters and best scores. I raised an issue with all the details here #6820 . The VM is Windows Server 2012 R2. Is it something related to VM environment or is there something we can do?

@AlexandreAbraham
Copy link
Contributor Author

The CircleCI failure is not due to the PR. Should I amend and re-launch the tests?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 27, 2015 via email

@AlexandreAbraham
Copy link
Contributor Author

Done.

@AlexandreAbraham
Copy link
Contributor Author

I'll rebase again but I don't get why CircleCI is failing.

@jnothman
Copy link
Member

jnothman commented Aug 11, 2016

#7175 suggest this would still be of interest to users.

However, I have my doubts about how much time we save by doing this block computation cluster-wise, where clusters are very unevenly sized. If we simply processed b samples at a time, we could accumulate as follows:

def process_block(start, intra_cluster):
    """compute minimal inter-cluster distance for points in block and add to intra-cluster totals"""
    block_dists = pairwise_distances(X[start:start+b], X)
    cluster_dists = np.zeros((b, n_clusters))
    # Following is equivalent to cluster_dists = np.vstack([np.bincount(y, row, n_clusters) for row in block_dists])
    np.add.at(cluster_dists, (np.repeat(np.arange(b), len(y)), np.tile(y, b)), block_dists.ravel())
    np.add.at(intra_cluster, y[start:start+b], cluster_dists[np.arange(b), y[start:start+b]])
    cluster_dists[np.arange(b), y[start:start+b]] = np.inf
    return (cluster_dists / np.bincount(y)).min(axis=1)

This would also much more easily benefit from parallelism (obviously without sharing intra_cluster accumulators), I think.

@jnothman
Copy link
Member

That code is wrong (I forgot the definition of intra-cluster distance), but the idea still applies.

@jnothman
Copy link
Member

jnothman commented Jun 6, 2017

FWIW, I think this is still in the pipeline to fix, via #7979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants