-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] support sample_weight in silhouette_score #4087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ccf6cf9
to
e56d3af
Compare
silhouette_score, X, y) | ||
|
||
|
||
def test_paper_example(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe add the reference Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis again?
LGTM apart from an explanation of the strategy and possibly a reference. |
Btw, I don't know its value as a reference, but http://cs.au.dk/~simina/weighted.pdf intuits the notion of weighted clustering as I did (last para section 4), but I've not yet read on to see how they extend this to the real-valued case |
needs a rebase btw. |
rebased |
distances = pairwise_distances(X, metric=metric, **kwds) | ||
if sample_weight is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be viewed like,
for a given sample, if another sample belonging to the same cluster has a very high sample weight and is far away, it should reduce the silhouette score more
right?
And vice versa.
You need to Rebase. Again. |
@@ -68,3 +68,64 @@ def test_correct_labelsize(): | |||
'Number of labels is %d\. Valid values are 2 ' | |||
'to n_samples - 1 \(inclusive\)' % len(np.unique(y)), | |||
silhouette_score, X, y) | |||
|
|||
|
|||
def test_paper_example(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_paper_example is not a great name for a test :P . Would be great to mention the name of the paper.
Is it possible #11135 should have closed this? I see the current version has |
@jnothman would you be able to give this a fresh update? If not, @StefanieSenger would you be able to take this over and update per our codebase these days? |
I sought
sample_weight
insilhouette_score
, to account for multiple points that are merged into one when calculating average distances. Hacking it into the current implementation resulted in a very slow solution. Thus this PR also rewrites the implementation, yielding something that's a bit slower than the solution at master, but supportssample_weight
.I've also added tests for correctness which I haven't otherwise found in the code.