-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Rework k-means assumptions example #24970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Anisotropically distributed blobs: k-means consists of minimizing sample's | ||
euclidean distances to the centroid of the cluster they are assigned | ||
to. As a consequence, k-means is more appropriated for clusters that are | ||
normally distributed with a spherical covariance matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the average person on the street know what a "spherical covariance matrix" is? I don't :( Is there a way to say the same thing with simpler words, without it becoming super long?
I imagine a spherical covariance matrix is one you print out and then crumple up the paper into a ball, hence spherical ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still hesitating about the simplest yet clearest wording. Which one do you like the best?
a) "k-means is more appropriate for clusters that are isotropic and normally distributed."
b) "k-means is more appropriate for clusters that are normally distributed and have spherical contours (as opposed to Gaussian distributions with ellipsoidal contours)."
c) "k-means is more appropriate for clusters that are normally distributed and have spherical contours." <- and then explain in the data generation section that we are creating Gaussian distributions with ellipsoidal contours.
d) Other suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably (b) because it includes a counter example. Helps me understand what is meant. Maybe it makes sense to include a link/reference to the section of the docs about rescaling features before feeding them to KMeans
. Something like "and this is why it is smart to scale your features."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By addressing this comment by @glemaitre I changed the narrative of the example. Now I introduce the concept of spherical/elliptical gaussians in the first section, before presenting the scenarios where the concept is required, what do you think?
About rescaling features it does not seem to have an effect in this particular case, as both features have the same range. In this sense I rather added a comment on another possible issue: sparse high-dimensional spaces. If we really want to pass a message on the importance of scaling, maybe we can try reviving the discussion on #12282.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me 👍
Co-authored-by: Tim Head <betatim@gmail.com>
plt.show() | ||
|
||
# %% | ||
# Possible solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new section.
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# | ||
# - Non-optimal number of clusters: in a real setting there is no uniquely | ||
# defined **true** number of clusters. An appropriate number of clusters has | ||
# to be decided from data-based criteria and knowledge of aim. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the ending "knowledge of aim." - what are you trying to say? It reads like some words got lost?
Should it be something like "knowledge of the goal"? But then "what goal?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it is clearer now.
Co-authored-by: Tim Head <betatim@gmail.com>
# Possible solution | ||
# ----------------- | ||
# | ||
# For an example on how to find a correct number of blobs, see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the text describing the different solutions have the same order as the code examples below? It is possible to work out which bit of text is meant to go with which code snippet but I think it would be easier if the first solution described in the text was also the first code snippet, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to rather re-factorize this section for clarity, even if that meant splitting the subplots. Thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also sounds good to me
Co-authored-by: Tim Head <betatim@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement to this example. Thanks @ArturoAmorQ. LGTM
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Good to go. Thanks @ArturoAmorQ |
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Reference Issues/PRs
Related to #24928.
What does this implement/fix? Explain your changes.
As mentioned in this comment by @glemaitre, this example can benefit from a "tutorialization", specially with regards to adding information on the initialization parameter
n_init
.The original narrative shows 3 cases where k-means gives unexpected results and one where it works fine. In this PR I propose first showing 4 cases where it gives such unintuitive results and then suggest solutions for each case.
Any other comments?
Probably there are cleaner or more "pythonic" ways to code, so please feel free to comment! :)
Side effect: Implements notebook style as intended in #22406.