-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fix regression in silhouette_score for clusters of size 1 #6089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -162,7 +162,7 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
B = np.array([_nearest_cluster_distance(distances[i], labels, i) | |||
for i in range(n)]) | |||
sil_samples = (B - A) / np.maximum(A, B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could also revert that deleted comment Addressed
This looks good to me (apart from the minor nitpick)! Wait for reviews from @MechCoder or @amueller (PS: Could you also squash the two commits? ;) ) |
@@ -48,7 +48,8 @@ def test_no_nan(): | |||
D = np.random.RandomState(0).rand(len(labels), len(labels)) | |||
silhouette = silhouette_score(D, labels, metric='precomputed') | |||
assert_false(np.isnan(silhouette)) | |||
|
|||
silhouette_sample = silhouette_samples([[3],[3]], np.array([2,4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 ([2,4]
--> [2, 4]
)
@@ -48,7 +48,8 @@ def test_no_nan(): | |||
D = np.random.RandomState(0).rand(len(labels), len(labels)) | |||
silhouette = silhouette_score(D, labels, metric='precomputed') | |||
assert_false(np.isnan(silhouette)) | |||
|
|||
silhouette_sample = silhouette_samples([[3],[3]], np.array([2,4])) | |||
assert_false(np.isnan(silhouette_sample).any()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not a regression test. It will pass in master.
Can you remove this and add a new test testing something like this
distances = np.random.RandomState(0).rand(5)
A = _intra_cluster_distance(distance, np.array([0, 1, 1, 1, 1]), 0)
assert_true(np.isnan(A))
Would be great if you could hold on till we clarify that the issue is actually an issue. |
it is according to the definition on wikipedia, right? (at least when I last thought about that ;) |
There's a merge conflict, which I realized just now. Should i resolve it or wait and do it when it's ready to be merged ? @MechCoder @amueller @rvraghav93 |
Resolving merge conflicts as soon as you encounter one saves a lot of headache later on, (say if you happen to accumulate more merge conflicts, it gets harder to fix them). |
@rvraghav93 there are some changes that has been made in the master, which i don't understand why it's been made. Like
I'd go ahead with those changes ? |
8b212b9
to
0d31aaf
Compare
You should follow the changes in master. |
209e6d9
to
508516d
Compare
@rvraghav93 sorry for the delay,fixed merge conflicts. Shall i amend my commits ? |
@@ -85,4 +88,4 @@ def test_non_numpy_labels(): | |||
X = dataset.data | |||
y = dataset.target | |||
assert_equal( | |||
silhouette_score(list(X), list(y)), silhouette_score(X, y)) | |||
silhouette_score(list(X), list(y)), silhouette_score(X, y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line at the end
508516d
to
38cdf05
Compare
reverted the comment Resolved merge conflicts
38cdf05
to
498ccce
Compare
Thinking about it again, I would suggest we just raise an error for this case |
I have justified at f0f174b#commitcomment-18600086 why I think setting a sample score to 0 for a sample clustered alone is reasonable. I do not think an error is appropriate. |
fixed in #7438 |
fix #5988