Skip to content

[MRG+1] MAINT: Refactor and speed up silhoutte_score (samples) #6151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MechCoder
Copy link
Member

A refactor of metrics.silhouette_score. Able to get at least 2 times speedup in most cases.

 n_samples=10000, n_labels=6
 In this branch: 13.4s, in master:23.2s
 n_samples=10000, n_labels=4
 In this branch: 14.4s, in master:23.9s

 n_samples=1000, n_labels=6
 In this branch: 115ms, in master:287ms
 n_samples=1000, n_labels=4
 In this branch: 110ms, in master:249ms

 n_samples=100, n_labels=6
 In this branch: 1.87ms, in master:7.13ms
 n_samples=100, n_labels=4
 In this branch: 1.54ms, in master:5.84ms

Also fixed a bug related to non-encoded labels.

@MechCoder MechCoder force-pushed the silhouette_score_refactor branch from ff2f719 to 76b1b3d Compare January 10, 2016 07:33
@GaelVaroquaux
Copy link
Member

I think that this code would greatly benefit from a few comments, and maybe calling 'A' and 'B' with more explicite names. Right now, it is very hard to follow. Harder than the code it replaces.

@MechCoder
Copy link
Member Author

@GaelVaroquaux I have added comments and replaced a few variable names. Can you tell me if it still harder to follow?

@MechCoder MechCoder force-pushed the silhouette_score_refactor branch from ec3ce2f to 7bb62bd Compare January 10, 2016 15:51
@MechCoder
Copy link
Member Author

In general, I think the tests can be made more stronger but I'll leave that for another PR

B = np.array([_nearest_cluster_distance(distances[i], labels, i)
for i in range(n)])
sil_samples = (B - A) / np.maximum(A, B)
n_labels = labels.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is n_samples, not n_labels

@MechCoder
Copy link
Member Author

@TomDLT looks ok now?

@MechCoder
Copy link
Member Author

@agramfort if you have the time :-)

@TomDLT
Copy link
Member

TomDLT commented Jan 13, 2016

LGTM, you can squash

@TomDLT TomDLT changed the title [MRG] MAINT: Refactor and speed up silhoutte_score (samples) [MRG+1] MAINT: Refactor and speed up silhoutte_score (samples) Jan 13, 2016
@MechCoder MechCoder force-pushed the silhouette_score_refactor branch from f91abff to 11f9b08 Compare January 13, 2016 14:07
@amueller
Copy link
Member

needs rebase ;)

@amueller
Copy link
Member

how does it scale to very large n_labels?

@MechCoder
Copy link
Member Author

It scales pretty well ! :-) (best of 3)

For n_samples=10000, n_labels=100
In this branch: 12.3s, In master: 28s
For n_samples=1000, n_labels=100
In this branch: 277ms, In master: 992ms

Do I haz your +1 as well?

@MechCoder MechCoder force-pushed the silhouette_score_refactor branch from 11f9b08 to c7ce0ab Compare January 15, 2016 21:18
@MechCoder
Copy link
Member Author

rebased

@GaelVaroquaux
Copy link
Member

👍 on my side. This is a good rewrite. It is clear and fast. I am merging this!

GaelVaroquaux added a commit that referenced this pull request Jan 17, 2016
[MRG+1] MAINT: Refactor and speed up silhoutte_score (samples)
@GaelVaroquaux GaelVaroquaux merged commit 769127c into scikit-learn:master Jan 17, 2016
@MechCoder MechCoder deleted the silhouette_score_refactor branch January 17, 2016 15:03
@MechCoder
Copy link
Member Author

thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants