DOC Rework k-means assumptions example #24970

ArturoAmorQ · 2022-11-17T16:37:27Z

Reference Issues/PRs

Related to #24928.

What does this implement/fix? Explain your changes.

As mentioned in this comment by @glemaitre, this example can benefit from a "tutorialization", specially with regards to adding information on the initialization parameter n_init.

The original narrative shows 3 cases where k-means gives unexpected results and one where it works fine. In this PR I propose first showing 4 cases where it gives such unintuitive results and then suggest solutions for each case.

Any other comments?

Probably there are cleaner or more "pythonic" ways to code, so please feel free to comment! :)

Side effect: Implements notebook style as intended in #22406.

examples/cluster/plot_kmeans_assumptions.py

betatim · 2022-11-18T08:23:24Z

examples/cluster/plot_kmeans_assumptions.py

+- Anisotropically distributed blobs: k-means consists of minimizing sample's
+  euclidean distances to the centroid of the cluster they are assigned
+  to. As a consequence, k-means is more appropriated for clusters that are
+  normally distributed with a spherical covariance matrix.


Does the average person on the street know what a "spherical covariance matrix" is? I don't :( Is there a way to say the same thing with simpler words, without it becoming super long?

I imagine a spherical covariance matrix is one you print out and then crumple up the paper into a ball, hence spherical ;)

I am still hesitating about the simplest yet clearest wording. Which one do you like the best?

a) "k-means is more appropriate for clusters that are isotropic and normally distributed."

b) "k-means is more appropriate for clusters that are normally distributed and have spherical contours (as opposed to Gaussian distributions with ellipsoidal contours)."

c) "k-means is more appropriate for clusters that are normally distributed and have spherical contours." <- and then explain in the data generation section that we are creating Gaussian distributions with ellipsoidal contours.

d) Other suggestion :)

Probably (b) because it includes a counter example. Helps me understand what is meant. Maybe it makes sense to include a link/reference to the section of the docs about rescaling features before feeding them to KMeans. Something like "and this is why it is smart to scale your features."?

By addressing this comment by @glemaitre I changed the narrative of the example. Now I introduce the concept of spherical/elliptical gaussians in the first section, before presenting the scenarios where the concept is required, what do you think?

About rescaling features it does not seem to have an effect in this particular case, as both features have the same range. In this sense I rather added a comment on another possible issue: sparse high-dimensional spaces. If we really want to pass a message on the importance of scaling, maybe we can try reviving the discussion on #12282.

Works for me 👍

Co-authored-by: Tim Head <betatim@gmail.com>

examples/cluster/plot_kmeans_assumptions.py

glemaitre · 2022-11-20T12:17:19Z

examples/cluster/plot_kmeans_assumptions.py

+plt.show()
+
+# %%
+# Possible solution


I like this new section.

examples/cluster/plot_kmeans_assumptions.py

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

examples/cluster/plot_kmeans_assumptions.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre

minor typo

examples/cluster/plot_kmeans_assumptions.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

betatim · 2022-12-01T14:21:32Z

examples/cluster/plot_kmeans_assumptions.py

+#
+# - Non-optimal number of clusters: in a real setting there is no uniquely
+#   defined **true** number of clusters. An appropriate number of clusters has
+#   to be decided from data-based criteria and knowledge of aim.


For the ending "knowledge of aim." - what are you trying to say? It reads like some words got lost?

Should it be something like "knowledge of the goal"? But then "what goal?"

I hope it is clearer now.

examples/cluster/plot_kmeans_assumptions.py

Co-authored-by: Tim Head <betatim@gmail.com>

examples/cluster/plot_kmeans_assumptions.py

betatim · 2022-12-01T14:48:10Z

examples/cluster/plot_kmeans_assumptions.py

+# Possible solution
+# -----------------
+#
+# For an example on how to find a correct number of blobs, see


Can you make the text describing the different solutions have the same order as the code examples below? It is possible to work out which bit of text is meant to go with which code snippet but I think it would be easier if the first solution described in the text was also the first code snippet, and so on.

I decided to rather re-factorize this section for clarity, even if that meant splitting the subplots. Thoughts on that?

Also sounds good to me

Co-authored-by: Tim Head <betatim@gmail.com>

jeremiedbb

This is a great improvement to this example. Thanks @ArturoAmorQ. LGTM

examples/cluster/plot_kmeans_assumptions.py

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

glemaitre · 2022-12-05T10:39:34Z

Good to go. Thanks @ArturoAmorQ

Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ArturoAmorQ added 3 commits November 16, 2022 17:01

First step to improve notebook style

0d074be

Improve narrative

0366ba4

Add possible solution

3e99936

github-actions bot added the Documentation label Nov 17, 2022

glemaitre self-requested a review November 17, 2022 19:24

betatim reviewed Nov 18, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

betatim reviewed Nov 18, 2022

View reviewed changes

ArturoAmorQ and others added 7 commits November 18, 2022 11:20

Update examples/cluster/plot_kmeans_assumptions.py

2bbf0d0

Co-authored-by: Tim Head <betatim@gmail.com>

Wording tweak

4f658b7

Improve narrative

33f3f05

Simplify code

cd8da29

Add co-author

03baf30

Tweak

2a6952b

Iter

00a889c

glemaitre reviewed Nov 20, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Show resolved Hide resolved

glemaitre reviewed Nov 20, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

glemaitre reviewed Nov 20, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Show resolved Hide resolved

glemaitre reviewed Nov 20, 2022

View reviewed changes

ArturoAmorQ added 5 commits November 21, 2022 14:36

Add plot of generated data

2c61da2

Delay explanation after plotting as suggested by Guillaume

d2f850b

Add concluding remark

21ce3f7

Add comment on sparse high dimensional spaces

6a75335

Fix shape error

1df2551

jeremiedbb reviewed Nov 22, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

Update examples/cluster/plot_kmeans_assumptions.py

08df83d

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

glemaitre reviewed Nov 23, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

ArturoAmorQ and others added 2 commits November 23, 2022 17:59

Update examples/cluster/plot_kmeans_assumptions.py

a8be3b0

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Address comments from Jeremie

dfada0e

glemaitre reviewed Dec 1, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

Apply suggestions from code review

0c4889c

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

betatim reviewed Dec 1, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

Update examples/cluster/plot_kmeans_assumptions.py

95dfde5

Co-authored-by: Tim Head <betatim@gmail.com>

betatim reviewed Dec 1, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

betatim reviewed Dec 1, 2022

View reviewed changes

ArturoAmorQ and others added 5 commits December 1, 2022 16:28

Update examples/cluster/plot_kmeans_assumptions.py

30ac8f8

Co-authored-by: Tim Head <betatim@gmail.com>

Fix format

97287fd

Improve wording

6589634

Improve narrative

679e78b

Merge branch 'main' into kmeans_assumptions

8fd2c95

jeremiedbb approved these changes Dec 2, 2022

View reviewed changes

examples/cluster/plot_kmeans_assumptions.py Outdated Show resolved Hide resolved

Update examples/cluster/plot_kmeans_assumptions.py

1ad8abb

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

glemaitre merged commit cbfb6ab into scikit-learn:main Dec 5, 2022

ArturoAmorQ deleted the kmeans_assumptions branch December 5, 2022 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Rework k-means assumptions example #24970

DOC Rework k-means assumptions example #24970

ArturoAmorQ commented Nov 17, 2022

betatim Nov 18, 2022

ArturoAmorQ Nov 18, 2022 •

edited

Loading

betatim Nov 21, 2022

ArturoAmorQ Nov 21, 2022

betatim Nov 21, 2022

glemaitre Nov 20, 2022

glemaitre left a comment

betatim Dec 1, 2022 •

edited

Loading

ArturoAmorQ Dec 1, 2022

betatim Dec 1, 2022

ArturoAmorQ Dec 1, 2022

betatim Dec 1, 2022

jeremiedbb left a comment

glemaitre commented Dec 5, 2022

DOC Rework k-means assumptions example #24970

DOC Rework k-means assumptions example #24970

Conversation

ArturoAmorQ commented Nov 17, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

ArturoAmorQ Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

betatim Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

glemaitre commented Dec 5, 2022

ArturoAmorQ Nov 18, 2022 •

edited

Loading

betatim Dec 1, 2022 •

edited

Loading