-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Blockwise parallel silhouette computation #1976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Blockwise parallel silhouette computation #1976
Conversation
The method used to compute distance matrix between samples. Default is | ||
``global`` which means that the full distance matrix is computed | ||
yielding in fast computation but high memory consumption. ``blockwise`` | ||
option compute clusterwise distance matrices, dividing approximately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute -> computes :)
and maybe add a the
before blockwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approximately
should maybe be between by
and the squared
in the next line.
So it reads: ...dividing memory consumption by approximately the squared number of clusters.
That make sense? currently its a bit hard to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Plus, it is a bit superfluous. Do you think I should remove this part and just that it lowers memory consumption without giving an order of magnitude ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay as is now 👍
I had a quick look. Thanks for the add-on, @AlexandreAbraham. This seems very useful considering the conversation on the mailing list here |
@@ -43,10 +48,24 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None, | |||
<sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance | |||
array itself, use ``metric="precomputed"``. | |||
|
|||
method: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring should be {'global', 'blockwise'}, and not 'string'.
Looks good to me. |
def _intra_cluster_distances_block(X_cluster, metric, **kwds): | ||
''' Calculate the mean intra-cluster distance for a given cluster | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would remove the parameter documentation and leave only the short description above. Those two functions are private and short. The documentation takes more space than the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would remove the parameter documentation and leave only the
short description above. Those two functions are private and short. The
documentation takes more space than the actual implementation.
I was actually thinking the same. So +1 on this remark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have also removed the parameters documentation for the other private methods (original method).
+1 for merge |
Hum this PR has two +1 for merge and is still open?! |
@@ -179,25 +238,24 @@ def _intra_cluster_distance(distances_row, labels, i): | |||
|
|||
def _nearest_cluster_distance(distances_row, labels, i): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove the docstrings?
I fixed the only remaining remark (docstring removed for no reason) and rebased. This should be good to go. |
``blockwise`` option computes clusterwise distance matrices, dividing | ||
memory consumption by approximately the squared number of clusters. | ||
The ``blockwise`` method allows parallelization through ``n_jobs`` | ||
parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we rename this argument to 'blockwise', and have as options False, True, and 'auto', where 'auto' is the default, and you do some clever check on the size of the input to decided whether blocking is good or not.
I used this code to bench the silhouette computation:
On my machine, blockwise is always faster below 50 clusters. Above 50 clusters, the global version is faster if there is less then |
# Test with the block method | ||
silhouette_metric_block = silhouette_score(X, y, metric='euclidean', | ||
blockwise=True) | ||
assert_almost_equal(silhouette, silhouette_metric_block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the auto logic covered too (just to explore all codepaths).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'auto' case is covered below. But I can add an extra verification here if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, given the last comments, I'm working on the docs. |
Your tests are failing under windows because of the added parallelism. You need to disable it in the test (check appveyor to see the problem). |
@@ -133,6 +160,21 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`. If X is | |||
the distance array itself, use "precomputed" as the metric. | |||
|
|||
blockwise: {'auto', True, False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add versionadded
to both new parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you add the flag to a parameter? I didn't manage to do it :-/
The failure under windows might be random and not necessarily related to this specific PR. We had seemingly similar random failure in the past on completely unrelated PRs and the failure disappeared without having us do anything. Apparently here we first get a Let me relaunch appveyor manually on that PR. |
approximately the squared number of clusters. The latter allows | ||
parallelization through ``n_jobs`` parameter. | ||
Default is 'auto' that chooses the fastest option without | ||
consideration for memory consumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here just after this line add a .. versionadded::0.17
not sure if it needs a newline before it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel , I am getting the windows error 5 access is denied inconsistently (but like 8 out of 10 times) on an Azure VM but not my PC. This is occurring after gridsearchcv fits all the folds but before it can calculate the best parameters and best scores. I raised an issue with all the details here #6820 . The VM is Windows Server 2012 R2. Is it something related to VM environment or is there something we can do?
The CircleCI failure is not due to the PR. Should I amend and re-launch the tests? |
The CircleCI failure is not due to the PR. Should I amend and re-launch the
tests?
How about rebasing on master
|
Done. |
I'll rebase again but I don't get why CircleCI is failing. |
#7175 suggest this would still be of interest to users. However, I have my doubts about how much time we save by doing this block computation cluster-wise, where clusters are very unevenly sized. If we simply processed def process_block(start, intra_cluster):
"""compute minimal inter-cluster distance for points in block and add to intra-cluster totals"""
block_dists = pairwise_distances(X[start:start+b], X)
cluster_dists = np.zeros((b, n_clusters))
# Following is equivalent to cluster_dists = np.vstack([np.bincount(y, row, n_clusters) for row in block_dists])
np.add.at(cluster_dists, (np.repeat(np.arange(b), len(y)), np.tile(y, b)), block_dists.ravel())
np.add.at(intra_cluster, y[start:start+b], cluster_dists[np.arange(b), y[start:start+b]])
cluster_dists[np.arange(b), y[start:start+b]] = np.inf
return (cluster_dists / np.bincount(y)).min(axis=1) This would also much more easily benefit from parallelism (obviously without sharing |
That code is wrong (I forgot the definition of intra-cluster distance), but the idea still applies. |
FWIW, I think this is still in the pipeline to fix, via #7979 |
This pull request introduces a blockwise computation of the silhouette, using less memory, but slightly slower.
First, the vocabulary can be debated. I have chosen the word "global" for the original silhouette strategy (as the global distance matrix is computed, I could not come with a better word) and "blockwise" for my method.
The other point is that I have 3 implementations of the silhouette:
I have decided to leave the original version untouched, as the code is far more readable than mine. Then I have not kept the most efficient blockwise single threaded version because the memory gain is not worth the code complication (I think).
It is in fact possible to have "one implementation to rule them all". This could be done by keeping the same code skeleton and using a "precomputed" distance matrix for the original version. But I think that the code would become too opaque.