Skip to content

[MRG] Fix regression in silhouette_score for clusters of size 1 #6089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Sentient07
Copy link
Contributor

fix #5988

@@ -162,7 +162,7 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
B = np.array([_nearest_cluster_distance(distances[i], labels, i)
for i in range(n)])
sil_samples = (B - A) / np.maximum(A, B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could also revert that deleted comment Addressed

@raghavrv
Copy link
Member

raghavrv commented Jan 4, 2016

This looks good to me (apart from the minor nitpick)! Wait for reviews from @MechCoder or @amueller

(PS: Could you also squash the two commits? ;) )

@@ -48,7 +48,8 @@ def test_no_nan():
D = np.random.RandomState(0).rand(len(labels), len(labels))
silhouette = silhouette_score(D, labels, metric='precomputed')
assert_false(np.isnan(silhouette))

silhouette_sample = silhouette_samples([[3],[3]], np.array([2,4]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 ([2,4] --> [2, 4])

@MechCoder MechCoder changed the title Reverted the change, added regression test [MRG] Fix regression in silhouette_score for clusters of size 1 Jan 7, 2016
@@ -48,7 +48,8 @@ def test_no_nan():
D = np.random.RandomState(0).rand(len(labels), len(labels))
silhouette = silhouette_score(D, labels, metric='precomputed')
assert_false(np.isnan(silhouette))

silhouette_sample = silhouette_samples([[3],[3]], np.array([2,4]))
assert_false(np.isnan(silhouette_sample).any())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not a regression test. It will pass in master.

Can you remove this and add a new test testing something like this

distances = np.random.RandomState(0).rand(5)
A = _intra_cluster_distance(distance, np.array([0, 1, 1, 1, 1]), 0)
assert_true(np.isnan(A))

@MechCoder
Copy link
Member

Would be great if you could hold on till we clarify that the issue is actually an issue.

@amueller
Copy link
Member

it is according to the definition on wikipedia, right? (at least when I last thought about that ;)

@Sentient07
Copy link
Contributor Author

There's a merge conflict, which I realized just now. Should i resolve it or wait and do it when it's ready to be merged ? @MechCoder @amueller @rvraghav93

@raghavrv
Copy link
Member

Resolving merge conflicts as soon as you encounter one saves a lot of headache later on, (say if you happen to accumulate more merge conflicts, it gets harder to fix them).

@Sentient07
Copy link
Contributor Author

@rvraghav93 there are some changes that has been made in the master, which i don't understand why it's been made. Like
from sklearn.utils.testing import assert_false, assert_almost_equal
has been changed to

from sklearn.utils.testing import assert_false
from sklearn.utils.testing import assert_almost_equal

I'd go ahead with those changes ?
Edit1: have pinged you in the commit that made this change

@GaelVaroquaux
Copy link
Member

You should follow the changes in master.

@Sentient07
Copy link
Contributor Author

@rvraghav93 sorry for the delay,fixed merge conflicts. Shall i amend my commits ?
@GaelVaroquaux Yeah, now it's in agreement with master

@@ -85,4 +88,4 @@ def test_non_numpy_labels():
X = dataset.data
y = dataset.target
assert_equal(
silhouette_score(list(X), list(y)), silhouette_score(X, y))
silhouette_score(list(X), list(y)), silhouette_score(X, y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line at the end

reverted the comment

Resolved merge conflicts
@MechCoder
Copy link
Member

Thinking about it again, I would suggest we just raise an error for this case

@jnothman
Copy link
Member

I have justified at f0f174b#commitcomment-18600086 why I think setting a sample score to 0 for a sample clustered alone is reasonable. I do not think an error is appropriate.

@amueller
Copy link
Member

fixed in #7438

@amueller amueller closed this Oct 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

regression in silhouette score
6 participants