[MRG] Blockwise parallel silhouette computation #1976

AlexandreAbraham · 2013-05-21T08:10:30Z

This pull request introduces a blockwise computation of the silhouette, using less memory, but slightly slower.

First, the vocabulary can be debated. I have chosen the word "global" for the original silhouette strategy (as the global distance matrix is computed, I could not come with a better word) and "blockwise" for my method.

The other point is that I have 3 implementations of the silhouette:

the original one
a blockwise single threaded version
a blockwise multi threaded version (which uses a bit more memory than the single threaded version, even if n_jobs=1)

I have decided to leave the original version untouched, as the code is far more readable than mine. Then I have not kept the most efficient blockwise single threaded version because the memory gain is not worth the code complication (I think).

It is in fact possible to have "one implementation to rule them all". This could be done by keeping the same code skeleton and using a "precomputed" distance matrix for the original version. But I think that the code would become too opaque.

jaquesgrobler · 2013-05-21T08:39:29Z

sklearn/metrics/cluster/unsupervised.py

+        The method used to compute distance matrix between samples. Default is
+        ``global`` which means that the full distance matrix is computed
+        yielding in fast computation but high memory consumption. ``blockwise``
+        option compute clusterwise distance matrices, dividing approximately


compute -> computes :)

and maybe add a the before blockwise

approximately should maybe be between by and the squared in the next line.
So it reads: ...dividing memory consumption by approximately the squared number of clusters.
That make sense? currently its a bit hard to read

I agree. Plus, it is a bit superfluous. Do you think I should remove this part and just that it lowers memory consumption without giving an order of magnitude ?

I think it's okay as is now 👍

jaquesgrobler · 2013-05-21T08:52:19Z

I had a quick look. Thanks for the add-on, @AlexandreAbraham. This seems very useful considering the conversation on the mailing list here

GaelVaroquaux · 2013-05-21T16:53:54Z

sklearn/metrics/cluster/unsupervised.py

@@ -43,10 +48,24 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None,
        <sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance
        array itself, use ``metric="precomputed"``.

+    method: string


The docstring should be {'global', 'blockwise'}, and not 'string'.

GaelVaroquaux · 2013-05-21T16:56:44Z

Looks good to me.

mblondel · 2013-05-21T23:10:29Z

sklearn/metrics/cluster/unsupervised.py

+def _intra_cluster_distances_block(X_cluster, metric, **kwds):
+    ''' Calculate the mean intra-cluster distance for a given cluster
+
+    Parameters


I think I would remove the parameter documentation and leave only the short description above. Those two functions are private and short. The documentation takes more space than the actual implementation.

I think I would remove the parameter documentation and leave only the
short description above. Those two functions are private and short. The
documentation takes more space than the actual implementation.

I was actually thinking the same. So +1 on this remark

Done. I have also removed the parameters documentation for the other private methods (original method).

jaquesgrobler · 2013-05-22T09:31:03Z

+1 for merge

mblondel · 2014-01-05T08:52:37Z

Hum this PR has two +1 for merge and is still open?!

amueller · 2014-01-05T10:13:08Z

sklearn/metrics/cluster/unsupervised.py

@@ -179,25 +238,24 @@ def _intra_cluster_distance(distances_row, labels, i):

 def _nearest_cluster_distance(distances_row, labels, i):


why remove the docstrings?

AlexandreAbraham · 2015-10-19T08:48:04Z

I fixed the only remaining remark (docstring removed for no reason) and rebased. This should be good to go.

GaelVaroquaux · 2015-10-19T09:55:56Z

sklearn/metrics/cluster/unsupervised.py

+        ``blockwise`` option computes clusterwise distance matrices, dividing
+        memory consumption by approximately the squared number of clusters.
+        The ``blockwise`` method allows parallelization through ``n_jobs``
+        parameter.


I suggest that we rename this argument to 'blockwise', and have as options False, True, and 'auto', where 'auto' is the default, and you do some clever check on the size of the input to decided whether blocking is good or not.

AlexandreAbraham · 2015-10-19T14:03:05Z

I used this code to bench the silhouette computation:

import time

from sklearn import datasets
from sklearn.metrics.cluster.unsupervised import silhouette_score


def test_silhouette(n_samples, n_clusters):
    X, y = datasets.make_blobs(n_samples=n_samples, n_features=500,
                              centers=n_clusters)
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=False)
    t_global = time.time() - t0
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=True)
    t_block = time.time() - t0

    return t_global, t_block

for n_clusters in [50, 100, 200, 300]:
    for n_samples in [200, 300, 400, 500, 1000]:
        if n_clusters >= n_samples - 1:
            continue
        t_g, t_b = test_silhouette(n_samples, n_clusters)
        print('%d,%d,%f,%f' % (n_clusters, n_samples, t_g, t_b))

On my machine, blockwise is always faster below 50 clusters. Above 50 clusters, the global version is faster if there is less then 5 * n_labels samples. I set up a heuristic in the code accordingly.

GaelVaroquaux · 2015-10-19T14:11:25Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    # Test with the block method
+    silhouette_metric_block = silhouette_score(X, y, metric='euclidean',
+                                               blockwise=True)
+    assert_almost_equal(silhouette, silhouette_metric_block)


We need the auto logic covered too (just to explore all codepaths).

The 'auto' case is covered below. But I can add an extra verification here if you want.

The 'auto' case is covered below. But I can add an extra verification here if you want.

OK. Sorry, I was being stupid!

AlexandreAbraham · 2015-10-19T14:15:10Z

Yes, given the last comments, I'm working on the docs.

GaelVaroquaux · 2015-10-21T14:19:33Z

Your tests are failing under windows because of the added parallelism. You need to disable it in the test (check appveyor to see the problem).

amueller · 2015-10-21T14:24:08Z

sklearn/metrics/cluster/unsupervised.py

@@ -133,6 +160,21 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
        allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`. If X is
        the distance array itself, use "precomputed" as the metric.

+    blockwise: {'auto', True, False}


can you add versionadded to both new parameters.

How do you add the flag to a parameter? I didn't manage to do it :-/

ogrisel · 2015-10-21T14:28:06Z

The failure under windows might be random and not necessarily related to this specific PR. We had seemingly similar random failure in the past on completely unrelated PRs and the failure disappeared without having us do anything.

Apparently here we first get a WindowsError: [Error 5] Access is denied and then a series of failures that reference "Attempting to do parallel computing without protecting your import" but which are probably acutally triggered by a problem in the runtime environment on the CI server.

Let me relaunch appveyor manually on that PR.

amueller · 2015-10-21T16:09:33Z

sklearn/metrics/cluster/unsupervised.py

+        approximately the squared number of clusters. The latter allows
+        parallelization through ``n_jobs`` parameter.
+        Default is 'auto' that chooses the fastest option without
+        consideration for memory consumption.


here just after this line add a .. versionadded::0.17 not sure if it needs a newline before it though

@ogrisel , I am getting the windows error 5 access is denied inconsistently (but like 8 out of 10 times) on an Azure VM but not my PC. This is occurring after gridsearchcv fits all the folds but before it can calculate the best parameters and best scores. I raised an issue with all the details here #6820 . The VM is Windows Server 2012 R2. Is it something related to VM environment or is there something we can do?

AlexandreAbraham · 2015-10-27T16:53:57Z

The CircleCI failure is not due to the PR. Should I amend and re-launch the tests?

GaelVaroquaux · 2015-10-27T16:55:01Z

The CircleCI failure is not due to the PR. Should I amend and re-launch the tests?

How about rebasing on master

AlexandreAbraham · 2015-10-27T16:58:24Z

Done.

AlexandreAbraham · 2015-11-08T09:04:10Z

I'll rebase again but I don't get why CircleCI is failing.

jnothman · 2016-08-11T04:07:29Z

#7175 suggest this would still be of interest to users.

However, I have my doubts about how much time we save by doing this block computation cluster-wise, where clusters are very unevenly sized. If we simply processed b samples at a time, we could accumulate as follows:

def process_block(start, intra_cluster):
    """compute minimal inter-cluster distance for points in block and add to intra-cluster totals"""
    block_dists = pairwise_distances(X[start:start+b], X)
    cluster_dists = np.zeros((b, n_clusters))
    # Following is equivalent to cluster_dists = np.vstack([np.bincount(y, row, n_clusters) for row in block_dists])
    np.add.at(cluster_dists, (np.repeat(np.arange(b), len(y)), np.tile(y, b)), block_dists.ravel())
    np.add.at(intra_cluster, y[start:start+b], cluster_dists[np.arange(b), y[start:start+b]])
    cluster_dists[np.arange(b), y[start:start+b]] = np.inf
    return (cluster_dists / np.bincount(y)).min(axis=1)

This would also much more easily benefit from parallelism (obviously without sharing intra_cluster accumulators), I think.

jnothman · 2016-08-11T09:05:01Z

That code is wrong (I forgot the definition of intra-cluster distance), but the idea still applies.

jnothman · 2017-06-06T13:51:03Z

FWIW, I think this is still in the pipeline to fix, via #7979

jaquesgrobler reviewed May 21, 2013
View reviewed changes

GaelVaroquaux reviewed May 21, 2013
View reviewed changes

mblondel reviewed May 21, 2013
View reviewed changes

amueller reviewed Jan 5, 2014
View reviewed changes

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller mentioned this pull request Dec 2, 2014

[MRG+1] LDA refactoring #3523

Merged

AlexandreAbraham changed the title ~~Blockwise parallel silhouette computation~~ [MRG] Blockwise parallel silhouette computation Oct 19, 2015

GaelVaroquaux reviewed Oct 19, 2015
View reviewed changes

amueller reviewed Oct 21, 2015
View reviewed changes

AlexandreAbraham added 9 commits November 8, 2015 10:06

Add silhouette blockwise test

0c66127

Add silhouette blockwise method

5e40052

Fix doc

811bd95

Typos

7831391

Fix docstring.

4edc68e

Modify the function signature as requested.

84cc22b

Set an automatic heuristic for blockwise silouhette

a9c73de

Improve doc and error handling

9de762b

Improve parameter checking.

9e10f6d

amueller added the Waiting for Reviewer label Dec 10, 2015

jnothman mentioned this pull request Aug 11, 2016

System freeze on Silhouette scoring #7175

Closed

jnothman mentioned this pull request Aug 11, 2016

[MRG] Block-wise silhouette calculation to avoid memory consumption #7177

Closed

10 tasks

AlexandreAbraham closed this Jun 6, 2017

jnothman mentioned this pull request Dec 10, 2017

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged

10 tasks

		@@ -179,25 +238,24 @@ def _intra_cluster_distance(distances_row, labels, i):

		def _nearest_cluster_distance(distances_row, labels, i):

Uh oh!

[MRG] Blockwise parallel silhouette computation #1976

[MRG] Blockwise parallel silhouette computation #1976

Uh oh!

Conversation

AlexandreAbraham commented May 21, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaquesgrobler commented May 21, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented May 21, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaquesgrobler commented May 22, 2013

Uh oh!

mblondel commented Jan 5, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexandreAbraham commented Oct 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexandreAbraham commented Oct 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux Oct 19, 2015 via email

Choose a reason for hiding this comment

Uh oh!

AlexandreAbraham commented Oct 19, 2015

Uh oh!

GaelVaroquaux commented Oct 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krishnateja614 May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexandreAbraham commented Oct 27, 2015

Uh oh!

GaelVaroquaux commented Oct 27, 2015 via email

Uh oh!

AlexandreAbraham commented Oct 27, 2015

Uh oh!

AlexandreAbraham commented Nov 8, 2015

Uh oh!

jnothman commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 11, 2016

Uh oh!

jnothman commented Jun 6, 2017

Uh oh!

Uh oh!

krishnateja614 May 24, 2016 •

edited

Loading

jnothman commented Aug 11, 2016 •

edited

Loading