Skip to content

Commit 056864d

Browse files
DOC Add links to KMeans examples in docstrings and the user guide (scikit-learn#27799)
1 parent 34061da commit 056864d

File tree

5 files changed

+48
-14
lines changed

5 files changed

+48
-14
lines changed

doc/modules/clustering.rst

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,10 @@ It suffers from various drawbacks:
182182
:align: center
183183
:scale: 50
184184

185+
For more detailed descriptions of the issues shown above and how to address them,
186+
refer to the examples :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py`
187+
and :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.
188+
185189
K-means is often referred to as Lloyd's algorithm. In basic terms, the
186190
algorithm has three steps. The first step chooses the initial centroids, with
187191
the most basic method being to choose :math:`k` samples from the dataset
@@ -218,7 +222,9 @@ initializations of the centroids. One method to help address this issue is the
218222
k-means++ initialization scheme, which has been implemented in scikit-learn
219223
(use the ``init='k-means++'`` parameter). This initializes the centroids to be
220224
(generally) distant from each other, leading to probably better results than
221-
random initialization, as shown in the reference.
225+
random initialization, as shown in the reference. For a detailed example of
226+
comaparing different initialization schemes, refer to
227+
:ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`.
222228

223229
K-means++ can also be called independently to select seeds for other
224230
clustering algorithms, see :func:`sklearn.cluster.kmeans_plusplus` for details
@@ -231,7 +237,17 @@ weight of 2 to a sample is equivalent to adding a duplicate of that sample
231237
to the dataset :math:`X`.
232238

233239
K-means can be used for vector quantization. This is achieved using the
234-
transform method of a trained model of :class:`KMeans`.
240+
``transform`` method of a trained model of :class:`KMeans`. For an example of
241+
performing vector quantization on an image refer to
242+
:ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`.
243+
244+
.. topic:: Examples:
245+
246+
* :ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`: Example usage of
247+
:class:`KMeans` using the iris dataset
248+
249+
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering
250+
using :class:`KMeans` and :class:`MiniBatchKMeans` based on sparse data
235251

236252
Low-level parallelism
237253
---------------------
@@ -291,11 +307,11 @@ small, as shown in the example and cited reference.
291307

292308
.. topic:: Examples:
293309

294-
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of KMeans and
295-
MiniBatchKMeans
310+
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of
311+
:class:`KMeans` and :class:`MiniBatchKMeans`
296312

297-
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering using sparse
298-
MiniBatchKMeans
313+
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering
314+
using :class:`KMeans` and :class:`MiniBatchKMeans` based on sparse data
299315

300316
* :ref:`sphx_glr_auto_examples_cluster_plot_dict_face_patches.py`
301317

examples/cluster/plot_cluster_iris.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@
77
88
- top left: What a K-means algorithm would yield using 8 clusters.
99
10-
- top right: What the effect of a bad initialization is
10+
- top right: What using three clusters would deliver.
11+
12+
- bottom left: What the effect of a bad initialization is
1113
on the classification process: By setting n_init to only 1
1214
(default is 10), the amount of times that the algorithm will
1315
be run with different centroid seeds is reduced.
1416
15-
- bottom left: What using eight clusters would deliver.
16-
1717
- bottom right: The ground truth.
1818
1919
"""
@@ -73,8 +73,7 @@
7373
horizontalalignment="center",
7474
bbox=dict(alpha=0.2, edgecolor="w", facecolor="w"),
7575
)
76-
# Reorder the labels to have colors matching the cluster results
77-
y = np.choose(y, [1, 2, 0]).astype(float)
76+
7877
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor="k")
7978

8079
ax.xaxis.set_ticklabels([])

examples/cluster/plot_color_quantization.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
china = load_sample_image("china.jpg")
4242

4343
# Convert to floats instead of the default 8 bits integer coding. Dividing by
44-
# 255 is important so that plt.imshow behaves works well on float data (need to
44+
# 255 is important so that plt.imshow works well on float data (need to
4545
# be in the range [0-1])
4646
china = np.array(china, dtype=np.float64) / 255
4747

examples/text/plot_document_clustering.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,9 @@
9999
# assignment have an ARI of 0.0 in expectation.
100100
#
101101
# If the ground truth labels are not known, evaluation can only be performed
102-
# using the model results itself. In that case, the Silhouette Coefficient comes
103-
# in handy.
102+
# using the model results itself. In that case, the Silhouette Coefficient comes in
103+
# handy. See :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`
104+
# for an example on how to do it.
104105
#
105106
# For more reference, see :ref:`clustering_evaluation`.
106107

sklearn/cluster/_kmeans.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1208,6 +1208,9 @@ class KMeans(_BaseKMeans):
12081208
The number of clusters to form as well as the number of
12091209
centroids to generate.
12101210
1211+
For an example of how to choose an optimal value for `n_clusters` refer to
1212+
:ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.
1213+
12111214
init : {'k-means++', 'random'}, callable or array-like of shape \
12121215
(n_clusters, n_features), default='k-means++'
12131216
Method for initialization:
@@ -1364,6 +1367,21 @@ class KMeans(_BaseKMeans):
13641367
>>> kmeans.cluster_centers_
13651368
array([[10., 2.],
13661369
[ 1., 2.]])
1370+
1371+
For a more detailed example of K-Means using the iris dataset see
1372+
:ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`.
1373+
1374+
For examples of common problems with K-Means and how to address them see
1375+
:ref:`sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py`.
1376+
1377+
For an example of how to use K-Means to perform color quantization see
1378+
:ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`.
1379+
1380+
For a demonstration of how K-Means can be used to cluster text documents see
1381+
:ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`.
1382+
1383+
For a comparison between K-Means and MiniBatchKMeans refer to example
1384+
:ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`.
13671385
"""
13681386

13691387
_parameter_constraints: dict = {

0 commit comments

Comments
 (0)