[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438

jnothman · 2016-09-15T14:14:56Z

fixes the regression described in #5988. #6089, and the initial patch here, were incorrect. ~~Replaces #6089. This is #6089 rebased with an error context manager.~~

jnothman · 2016-09-15T14:16:55Z

the test could prob be strengthened

ogrisel · 2016-09-16T15:10:31Z

@jnothman here is an excerpt from the Silhouette paper mentionned in the docstring:

When cluster A contains only a single object it is unclear how u(i) should be defined, and then
we simply set s(i) equal to zero. This choice is of course arbitrary, but a value of zero appears to
be most neutral. Indeed, from the above definition we easily see that -1 <= s(i) <= 1 for each object i.

http://www.sciencedirect.com/science/article/pii/0377042787901257

Unfortunately your PR does not seem to implement this:

>>> import numpy as np
>>> from sklearn.metrics import silhouette_samples
>>> silhouette_samples(np.array([[0], [0.1], [1]]), np.array([0, 0, 1]))
array([ 0.9       ,  0.88888889, -0.05      ])

The last value of that array should be 0 instead of -0.05.

jnothman · 2016-09-17T10:05:32Z

Good catch. Though I think it might have been similarly misbehaved in 0.17. And annoyingly, the main example in the paper appears not to cover this case.

jnothman · 2016-09-17T11:48:53Z

Yes, this patch is wrong in a few ways.

jnothman · 2016-09-17T11:54:38Z

@ogrisel, what do you think behaviour should be for a cluster exclusively of multiple identical points??? Score 0 or 1?

Should we leave this broken for 0.18.0? Add a note that we know it's broken?

ogrisel · 2016-09-20T18:55:55Z

@ogrisel, what do you think behaviour should be for a cluster exclusively of multiple identical points??? Score 0 or 1?

I think it should be 1 by continuity with a cluster with 2 very close points.

ogrisel

That would be great to fix this before the 0.18 release although I would not make it a strict blocker.

ogrisel · 2016-09-20T18:58:06Z

sklearn/metrics/cluster/tests/test_unsupervised.py

    labels = np.array([1, 0, 1, 1, 1])
    # The distance matrix doesn't actually matter.
    D = np.random.RandomState(0).rand(len(labels), len(labels))
    silhouette = silhouette_score(D, labels, metric='precomputed')
    assert_false(np.isnan(silhouette))
+    ss = silhouette_samples(D, labels, metric='precomputed')
+    assert_false(np.isnan(ss).any())
+    assert_equal(ss[1], 0.)


Instead of using a random precomputed metric array, could you please use a toy 2D X array with shape (4, 1) in 1D euclidean spaces such as X = [[1., 0., 1., 2.]] and then check that the output of silhouette_samples is [1., 0., 1., 0.] as expected?

Thanks for the review and for handing an example to me on a silver platter.

reverted the comment Resolved merge conflicts

ogrisel

Please consider the following comments:

ogrisel · 2016-09-21T14:59:51Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    #            silhouette    = [.5, .5, 0]
+    # Cluster 3: intra-cluster = [0, 0]
+    #            inter-cluster = [arbitrary, arbitrary]
+    #            silhouette    = [1., 1.]


Please rename the cluster ids (1, 2, 3) in this comment to match 0, 1, and 2 used in the labels variable.

ogrisel · 2016-09-21T15:31:03Z

sklearn/metrics/cluster/unsupervised.py

@@ -180,22 +182,24 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf * intra_clust_dists

-    for curr_label in unique_labels:
+    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same
        # label.
        mask = labels == curr_label


This is now broken if labels are not contiguous integers starting at zero, no?

Yes, it was broken previously ::/

ogrisel · 2016-09-21T15:32:26Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    # is not discussed in reference material, and we choose for it a sample
+    # score of 1.
+    X = [[0.], [1.], [1.], [2.], [3.], [3.]]
+    labels = np.array([0, 1, 1, 1, 2, 2])


Actually please update this test to have non-contiguous labels that do not start at zero, for instance:

labels = np.array([1, 3, 3, 3, 4, 4])

I think that would be confusing here. Indeed there's already a "test_non_encoded_labels", and if I add a *2 in it, it's gappy. Actually, it was only testing silhouette_score not silhouette_samples. Changed. Also removed the double encoding of labels from silhouette_score.

ogrisel · 2016-09-21T15:34:19Z

sklearn/metrics/cluster/unsupervised.py

@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)


This will still raise a division by zero warning if all samples are duplicate of one another and all belong to the same cluster. But maybe we don't care that much of such a rare edge case...

Also if they belong to different cluster. Quick example.

silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])

However, according to the formula, this might be expected behaviour.

If all samples belong to one cluster, silhouette_score is undefined according to the paper. All points being identical is not covered in the paper. Understandably.

Fair enough then.

MechCoder · 2016-09-21T20:02:54Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    # (cluster 0). We also test the case where there are identical samples
+    # as the only members of a cluster (cluster 2). To our knowledge, this case
+    # is not discussed in reference material, and we choose for it a sample
+    # score of 1.


Why is having identical samples in a cluster a special case? It comes directly from the formula, right? Here

After fixing i, a(i) can be viewed as mean distance to all other identical points which is zero and b(i) can be anything. So the formula reduces to b(i) / b(i) which is 1, no?

Why is having identical samples in a cluster a special case?

It's not a special case, it just represents a non-smooth part of the function, where you might have expected, for clustering evaluation, that duplicating a point and assigning it the same label should not affect the score.

As a matter of implementation, I could have, even accidentally, handled the single-point cluster case in a way that assigned this 0, so it's important that we make a decision on it and test it.

MechCoder · 2016-09-21T20:17:23Z

sklearn/metrics/cluster/unsupervised.py

@@ -180,22 +182,24 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf * intra_clust_dists

-    for curr_label in unique_labels:
+    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same
        # label.
        mask = labels == curr_label


Yes, it was broken previously ::/

MechCoder · 2016-09-21T20:26:10Z

LGTM pending @ogrisel 's comments

MechCoder · 2016-09-21T20:40:22Z

sklearn/metrics/cluster/unsupervised.py

        if n_samples_curr_lab != 0:
            intra_clust_dists[mask] = np.sum(
                current_distances[:, mask], axis=1) / n_samples_curr_lab
+        else:
+            intra_clust_dists[mask] = 0


Is it worthy to initialize intra_clust_dists with np.zeros this at the start and get rid of the else?

MechCoder · 2016-09-21T20:48:14Z

sklearn/metrics/cluster/unsupervised.py

@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)


Also if they belong to different cluster. Quick example.

silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])

However, according to the formula, this might be expected behaviour.

ogrisel · 2016-09-22T07:31:53Z

sklearn/metrics/cluster/tests/test_unsupervised.py

-        silhouette_score(X, labels + 10), silhouette_score(X, labels))
+        silhouette_score(X, labels * 2 + 10), silhouette_score(X, labels))
+    assert_array_equal(
+        silhouette_samples(X, labels * 2 + 10), silhouette_samples(X, labels))


Great checks, thanks!

ogrisel

LGTM 👍

)

…ikit-learn#7438)

* tag '0.18': (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...

* releases: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...

* dfsg: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...

…ikit-learn#7438)

jnothman added this to the 0.18 milestone Sep 15, 2016

ogrisel requested changes Sep 20, 2016

View reviewed changes

Sentient07 and others added 5 commits September 21, 2016 09:59

Reverted the change, added regression test

681f5b8

reverted the comment Resolved merge conflicts

Hide divide by zero error

5c66f0b

DOC what's new

c7a2b95

Overall silhouette score should be 0 for single-sample clusters

cf4dcec

TST stronger tests of silhouette edge cases

3c850dd

jnothman force-pushed the silhouette_fix branch from 48aa0ef to 3c850dd Compare September 21, 2016 00:00

ogrisel requested changes Sep 21, 2016

View reviewed changes

MechCoder reviewed Sep 21, 2016

View reviewed changes

MechCoder changed the title ~~[MRG] Fix regression in silhouette_score for clusters of size 1~~ [MRG+1] Fix regression in silhouette_score for clusters of size 1 Sep 21, 2016

MechCoder reviewed Sep 21, 2016

View reviewed changes

jnothman added 2 commits September 22, 2016 09:43

FIX silhouette_samples validation; cosmetic changes

173a3a1

Add what's new for incidental fix

dbf5300

ogrisel reviewed Sep 22, 2016

View reviewed changes

ogrisel approved these changes Sep 22, 2016

View reviewed changes

ogrisel merged commit 3e7a7ca into scikit-learn:master Sep 22, 2016

amueller pushed a commit that referenced this pull request Sep 25, 2016

[MRG+1] Fix regression in silhouette_score for clusters of size 1 (#7438

6bc1257

)

amueller mentioned this pull request Sep 29, 2016

regression in silhouette score #5988

Closed

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] Fix regression in silhouette_score for clusters of size 1 (sc…

4386ff7

…ikit-learn#7438)

amueller mentioned this pull request Oct 25, 2016

[MRG] Fix regression in silhouette_score for clusters of size 1 #6089

Closed

raghavrv mentioned this pull request Nov 1, 2016

[MRG] Block-wise silhouette calculation to avoid memory consumption #7177

Closed

10 tasks

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Fix regression in silhouette_score for clusters of size 1 (sc…

b0d6e1c

…ikit-learn#7438)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Fix regression in silhouette_score for clusters of size 1 (sc…

337ef55

…ikit-learn#7438)

		@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

		sil_samples = inter_clust_dists - intra_clust_dists
		sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)

Uh oh!

[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438

[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438

Uh oh!

Conversation

jnothman commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 15, 2016

Uh oh!

ogrisel commented Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 17, 2016

Uh oh!

jnothman commented Sep 17, 2016

Uh oh!

jnothman commented Sep 17, 2016

Uh oh!

ogrisel commented Sep 20, 2016

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Sep 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Sep 15, 2016 •

edited

Loading

ogrisel commented Sep 16, 2016 •

edited

Loading

ogrisel Sep 20, 2016 •

edited

Loading

MechCoder Sep 21, 2016 •

edited

Loading