Skip to content

[MRG] support sample_weight in silhouette_score #4087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jnothman
Copy link
Member

I sought sample_weight in silhouette_score, to account for multiple points that are merged into one when calculating average distances. Hacking it into the current implementation resulted in a very slow solution. Thus this PR also rewrites the implementation, yielding something that's a bit slower than the solution at master, but supports sample_weight.

I've also added tests for correctness which I haven't otherwise found in the code.

@jnothman jnothman force-pushed the silhouette branch 2 times, most recently from ccf6cf9 to e56d3af Compare January 12, 2015 13:52
silhouette_score, X, y)


def test_paper_example():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe add the reference Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis again?

@amueller
Copy link
Member

LGTM apart from an explanation of the strategy and possibly a reference.

@jnothman
Copy link
Member Author

jnothman commented Jun 3, 2015

Btw, I don't know its value as a reference, but http://cs.au.dk/~simina/weighted.pdf intuits the notion of weighted clustering as I did (last para section 4), but I've not yet read on to see how they extend this to the real-valued case

@amueller
Copy link
Member

amueller commented Jun 3, 2015

needs a rebase btw.

@jnothman
Copy link
Member Author

jnothman commented Jun 6, 2015

rebased

distances = pairwise_distances(X, metric=metric, **kwds)
if sample_weight is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be viewed like,

for a given sample, if another sample belonging to the same cluster has a very high sample weight and is far away, it should reduce the silhouette score more
right?
And vice versa.

@MechCoder
Copy link
Member

You need to Rebase.

Again.

@@ -68,3 +68,64 @@ def test_correct_labelsize():
'Number of labels is %d\. Valid values are 2 '
'to n_samples - 1 \(inclusive\)' % len(np.unique(y)),
silhouette_score, X, y)


def test_paper_example():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_paper_example is not a great name for a test :P . Would be great to mention the name of the paper.

@haiatn
Copy link
Contributor

haiatn commented Aug 26, 2023

Is it possible #11135 should have closed this? I see the current version has D_chunk that calculates sample_weights, although I must say I am familiar with the algorithm. I also see the paper example test is in the main branch

@adrinjalali
Copy link
Member

@jnothman would you be able to give this a fresh update?

If not, @StefanieSenger would you be able to take this over and update per our codebase these days?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants