-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the test could prob be strengthened |
@jnothman here is an excerpt from the Silhouette paper mentionned in the docstring:
http://www.sciencedirect.com/science/article/pii/0377042787901257 Unfortunately your PR does not seem to implement this: >>> import numpy as np
>>> from sklearn.metrics import silhouette_samples
>>> silhouette_samples(np.array([[0], [0.1], [1]]), np.array([0, 0, 1]))
array([ 0.9 , 0.88888889, -0.05 ]) The last value of that array should be 0 instead of -0.05. |
Good catch. Though I think it might have been similarly misbehaved in 0.17. And annoyingly, the main example in the paper appears not to cover this case. |
Yes, this patch is wrong in a few ways. |
@ogrisel, what do you think behaviour should be for a cluster exclusively of multiple identical points??? Score 0 or 1? Should we leave this broken for 0.18.0? Add a note that we know it's broken? |
I think it should be 1 by continuity with a cluster with 2 very close points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great to fix this before the 0.18 release although I would not make it a strict blocker.
labels = np.array([1, 0, 1, 1, 1]) | ||
# The distance matrix doesn't actually matter. | ||
D = np.random.RandomState(0).rand(len(labels), len(labels)) | ||
silhouette = silhouette_score(D, labels, metric='precomputed') | ||
assert_false(np.isnan(silhouette)) | ||
ss = silhouette_samples(D, labels, metric='precomputed') | ||
assert_false(np.isnan(ss).any()) | ||
assert_equal(ss[1], 0.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a random precomputed metric array, could you please use a toy 2D X array with shape (4, 1) in 1D euclidean spaces such as X = [[1., 0., 1., 2.]]
and then check that the output of silhouette_samples
is [1., 0., 1., 0.]
as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review and for handing an example to me on a silver platter.
48aa0ef
to
3c850dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider the following comments:
# silhouette = [.5, .5, 0] | ||
# Cluster 3: intra-cluster = [0, 0] | ||
# inter-cluster = [arbitrary, arbitrary] | ||
# silhouette = [1., 1.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the cluster ids (1, 2, 3) in this comment to match 0
, 1
, and 2
used in the labels
variable.
@@ -180,22 +182,24 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
# cluster in inter_clust_dists[i] | |||
inter_clust_dists = np.inf * intra_clust_dists | |||
|
|||
for curr_label in unique_labels: | |||
for curr_label in range(len(unique_labels)): | |||
|
|||
# Find inter_clust_dist for all samples belonging to the same | |||
# label. | |||
mask = labels == curr_label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now broken if labels are not contiguous integers starting at zero, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was broken previously ::/
# is not discussed in reference material, and we choose for it a sample | ||
# score of 1. | ||
X = [[0.], [1.], [1.], [2.], [3.], [3.]] | ||
labels = np.array([0, 1, 1, 1, 2, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually please update this test to have non-contiguous labels that do not start at zero, for instance:
labels = np.array([1, 3, 3, 3, 4, 4])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be confusing here. Indeed there's already a "test_non_encoded_labels", and if I add a *2 in it, it's gappy. Actually, it was only testing silhouette_score
not silhouette_samples
. Changed. Also removed the double encoding of labels from silhouette_score
.
@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
|
|||
sil_samples = inter_clust_dists - intra_clust_dists | |||
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still raise a division by zero warning if all samples are duplicate of one another and all belong to the same cluster. But maybe we don't care that much of such a rare edge case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if they belong to different cluster. Quick example.
silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])
However, according to the formula, this might be expected behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all samples belong to one cluster, silhouette_score
is undefined according to the paper. All points being identical is not covered in the paper. Understandably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough then.
# (cluster 0). We also test the case where there are identical samples | ||
# as the only members of a cluster (cluster 2). To our knowledge, this case | ||
# is not discussed in reference material, and we choose for it a sample | ||
# score of 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is having identical samples in a cluster a special case? It comes directly from the formula, right? Here
After fixing i
, a(i)
can be viewed as mean distance to all other identical points which is zero and b(i)
can be anything. So the formula reduces to b(i) / b(i)
which is 1, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is having identical samples in a cluster a special case?
It's not a special case, it just represents a non-smooth part of the function, where you might have expected, for clustering evaluation, that duplicating a point and assigning it the same label should not affect the score.
As a matter of implementation, I could have, even accidentally, handled the single-point cluster case in a way that assigned this 0, so it's important that we make a decision on it and test it.
@@ -180,22 +182,24 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
# cluster in inter_clust_dists[i] | |||
inter_clust_dists = np.inf * intra_clust_dists | |||
|
|||
for curr_label in unique_labels: | |||
for curr_label in range(len(unique_labels)): | |||
|
|||
# Find inter_clust_dist for all samples belonging to the same | |||
# label. | |||
mask = labels == curr_label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was broken previously ::/
LGTM pending @ogrisel 's comments |
if n_samples_curr_lab != 0: | ||
intra_clust_dists[mask] = np.sum( | ||
current_distances[:, mask], axis=1) / n_samples_curr_lab | ||
else: | ||
intra_clust_dists[mask] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worthy to initialize intra_clust_dists
with np.zeros this at the start and get rid of the else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
|
|||
sil_samples = inter_clust_dists - intra_clust_dists | |||
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if they belong to different cluster. Quick example.
silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])
However, according to the formula, this might be expected behaviour.
silhouette_score(X, labels + 10), silhouette_score(X, labels)) | ||
silhouette_score(X, labels * 2 + 10), silhouette_score(X, labels)) | ||
assert_array_equal( | ||
silhouette_samples(X, labels * 2 + 10), silhouette_samples(X, labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great checks, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
* tag '0.18': (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* releases: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* dfsg: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
fixes the regression described in #5988. #6089, and the initial patch here, were incorrect.
Replaces #6089. This is #6089 rebased with an error context manager.