Skip to content

DOC Rework k-means assumptions example #24970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Dec 5, 2022

Conversation

ArturoAmorQ
Copy link
Member

Reference Issues/PRs

Related to #24928.

What does this implement/fix? Explain your changes.

As mentioned in this comment by @glemaitre, this example can benefit from a "tutorialization", specially with regards to adding information on the initialization parameter n_init.

The original narrative shows 3 cases where k-means gives unexpected results and one where it works fine. In this PR I propose first showing 4 cases where it gives such unintuitive results and then suggest solutions for each case.

Any other comments?

Probably there are cleaner or more "pythonic" ways to code, so please feel free to comment! :)

Side effect: Implements notebook style as intended in #22406.

- Anisotropically distributed blobs: k-means consists of minimizing sample's
euclidean distances to the centroid of the cluster they are assigned
to. As a consequence, k-means is more appropriated for clusters that are
normally distributed with a spherical covariance matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the average person on the street know what a "spherical covariance matrix" is? I don't :( Is there a way to say the same thing with simpler words, without it becoming super long?


I imagine a spherical covariance matrix is one you print out and then crumple up the paper into a ball, hence spherical ;)

Copy link
Member Author

@ArturoAmorQ ArturoAmorQ Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still hesitating about the simplest yet clearest wording. Which one do you like the best?

a) "k-means is more appropriate for clusters that are isotropic and normally distributed."

b) "k-means is more appropriate for clusters that are normally distributed and have spherical contours (as opposed to Gaussian distributions with ellipsoidal contours)."

c) "k-means is more appropriate for clusters that are normally distributed and have spherical contours." <- and then explain in the data generation section that we are creating Gaussian distributions with ellipsoidal contours.

d) Other suggestion :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably (b) because it includes a counter example. Helps me understand what is meant. Maybe it makes sense to include a link/reference to the section of the docs about rescaling features before feeding them to KMeans. Something like "and this is why it is smart to scale your features."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By addressing this comment by @glemaitre I changed the narrative of the example. Now I introduce the concept of spherical/elliptical gaussians in the first section, before presenting the scenarios where the concept is required, what do you think?

About rescaling features it does not seem to have an effect in this particular case, as both features have the same range. In this sense I rather added a comment on another possible issue: sparse high-dimensional spaces. If we really want to pass a message on the importance of scaling, maybe we can try reviving the discussion on #12282.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me 👍

plt.show()

# %%
# Possible solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new section.

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
ArturoAmorQ and others added 2 commits November 23, 2022 17:59
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
#
# - Non-optimal number of clusters: in a real setting there is no uniquely
# defined **true** number of clusters. An appropriate number of clusters has
# to be decided from data-based criteria and knowledge of aim.
Copy link
Member

@betatim betatim Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ending "knowledge of aim." - what are you trying to say? It reads like some words got lost?

Should it be something like "knowledge of the goal"? But then "what goal?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it is clearer now.

Co-authored-by: Tim Head <betatim@gmail.com>
# Possible solution
# -----------------
#
# For an example on how to find a correct number of blobs, see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the text describing the different solutions have the same order as the code examples below? It is possible to work out which bit of text is meant to go with which code snippet but I think it would be easier if the first solution described in the text was also the first code snippet, and so on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to rather re-factorize this section for clarity, even if that meant splitting the subplots. Thoughts on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sounds good to me

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement to this example. Thanks @ArturoAmorQ. LGTM

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@glemaitre glemaitre merged commit cbfb6ab into scikit-learn:main Dec 5, 2022
@glemaitre
Copy link
Member

Good to go. Thanks @ArturoAmorQ

@ArturoAmorQ ArturoAmorQ deleted the kmeans_assumptions branch December 5, 2022 10:55
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Dec 21, 2022
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
glemaitre added a commit that referenced this pull request Dec 21, 2022
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants