Skip to content

DOC Add links to KMeans examples in docstrings and the user guide #27799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 22 additions & 6 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,10 @@ It suffers from various drawbacks:
:align: center
:scale: 50

For more detailed descriptions of the issues shown above and how to address them,
refer to the examples :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py`
and :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.

K-means is often referred to as Lloyd's algorithm. In basic terms, the
algorithm has three steps. The first step chooses the initial centroids, with
the most basic method being to choose :math:`k` samples from the dataset
Expand Down Expand Up @@ -218,7 +222,9 @@ initializations of the centroids. One method to help address this issue is the
k-means++ initialization scheme, which has been implemented in scikit-learn
(use the ``init='k-means++'`` parameter). This initializes the centroids to be
(generally) distant from each other, leading to probably better results than
random initialization, as shown in the reference.
random initialization, as shown in the reference. For a detailed example of
comaparing different initialization schemes, refer to
:ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`.

K-means++ can also be called independently to select seeds for other
clustering algorithms, see :func:`sklearn.cluster.kmeans_plusplus` for details
Expand All @@ -231,7 +237,17 @@ weight of 2 to a sample is equivalent to adding a duplicate of that sample
to the dataset :math:`X`.

K-means can be used for vector quantization. This is achieved using the
transform method of a trained model of :class:`KMeans`.
``transform`` method of a trained model of :class:`KMeans`. For an example of
performing vector quantization on an image refer to
:ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`: Example usage of
:class:`KMeans` using the iris dataset

* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering
using :class:`KMeans` and :class:`MiniBatchKMeans` based on sparse data

Low-level parallelism
---------------------
Expand Down Expand Up @@ -291,11 +307,11 @@ small, as shown in the example and cited reference.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of KMeans and
MiniBatchKMeans
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of
:class:`KMeans` and :class:`MiniBatchKMeans`

* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering using sparse
MiniBatchKMeans
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering
using :class:`KMeans` and :class:`MiniBatchKMeans` based on sparse data

* :ref:`sphx_glr_auto_examples_cluster_plot_dict_face_patches.py`

Expand Down
9 changes: 4 additions & 5 deletions examples/cluster/plot_cluster_iris.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@

- top left: What a K-means algorithm would yield using 8 clusters.

- top right: What the effect of a bad initialization is
- top right: What using three clusters would deliver.

- bottom left: What the effect of a bad initialization is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be done in another PR, but currently it seems that the initialization is good. I would rather pass a fixed random_state to KMeans instead of setting a global np.random.seed

on the classification process: By setting n_init to only 1
(default is 10), the amount of times that the algorithm will
be run with different centroid seeds is reduced.

- bottom left: What using eight clusters would deliver.

- bottom right: The ground truth.

"""
Expand Down Expand Up @@ -73,8 +73,7 @@
horizontalalignment="center",
bbox=dict(alpha=0.2, edgecolor="w", facecolor="w"),
)
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(float)

ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor="k")

ax.xaxis.set_ticklabels([])
Expand Down
2 changes: 1 addition & 1 deletion examples/cluster/plot_color_quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
china = load_sample_image("china.jpg")

# Convert to floats instead of the default 8 bits integer coding. Dividing by
# 255 is important so that plt.imshow behaves works well on float data (need to
# 255 is important so that plt.imshow works well on float data (need to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

# be in the range [0-1])
china = np.array(china, dtype=np.float64) / 255

Expand Down
5 changes: 3 additions & 2 deletions examples/text/plot_document_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,9 @@
# assignment have an ARI of 0.0 in expectation.
#
# If the ground truth labels are not known, evaluation can only be performed
# using the model results itself. In that case, the Silhouette Coefficient comes
# in handy.
# using the model results itself. In that case, the Silhouette Coefficient comes in
# handy. See :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`
# for an example on how to do it.
#
# For more reference, see :ref:`clustering_evaluation`.

Expand Down
18 changes: 18 additions & 0 deletions sklearn/cluster/_kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -1208,6 +1208,9 @@ class KMeans(_BaseKMeans):
The number of clusters to form as well as the number of
centroids to generate.

For an example of how to choose an optimal value for `n_clusters` refer to
:ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.

init : {'k-means++', 'random'}, callable or array-like of shape \
(n_clusters, n_features), default='k-means++'
Method for initialization:
Expand Down Expand Up @@ -1364,6 +1367,21 @@ class KMeans(_BaseKMeans):
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])

For a more detailed example of K-Means using the iris dataset see
:ref:`sphx_glr_auto_examples_cluster_plot_cluster_iris.py`.

For examples of common problems with K-Means and how to address them see
:ref:`sphx_glr_auto_examples_cluster_plot_kmeans_assumptions.py`.

For an example of how to use K-Means to perform color quantization see
:ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`.

For a demonstration of how K-Means can be used to cluster text documents see
:ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`.

For a comparison between K-Means and MiniBatchKMeans refer to example
:ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`.
"""

_parameter_constraints: dict = {
Expand Down