Skip to content

[WIP] DOC Description of X missing in KMeans.fit (#7772) #7775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@ New features
Enhancements
............

- :class:`cluster.MiniBatchKMeans` and :class:`cluster.KMeans`
now uses significantly less memory when assigning data points to their
nearest cluster center.
(`#7721 <https://github.com/scikit-learn/scikit-learn/pull/7721>`_)
By `Jon Crall`_.

- Added ``classes_`` attribute to :class:`model_selection.GridSearchCV`
that matches the ``classes_`` attribute of ``best_estimator_``. (`#7661
<https://github.com/scikit-learn/scikit-learn/pull/7661>`_) by `Alyssa
Expand Down
25 changes: 12 additions & 13 deletions sklearn/cluster/k_means_.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from ..base import BaseEstimator, ClusterMixin, TransformerMixin
from ..metrics.pairwise import euclidean_distances
from ..metrics.pairwise import pairwise_distances_argmin_min
from ..utils.extmath import row_norms, squared_norm, stable_cumsum
from ..utils.sparsefuncs_fast import assign_rows_csr
from ..utils.sparsefuncs import mean_variance_axis
Expand Down Expand Up @@ -552,17 +553,14 @@ def _labels_inertia_precompute_dense(X, x_squared_norms, centers, distances):

"""
n_samples = X.shape[0]
k = centers.shape[0]
all_distances = euclidean_distances(centers, X, x_squared_norms,
squared=True)
labels = np.empty(n_samples, dtype=np.int32)
labels.fill(-1)
mindist = np.empty(n_samples)
mindist.fill(np.infty)
for center_id in range(k):
dist = all_distances[center_id]
labels[dist < mindist] = center_id
mindist = np.minimum(dist, mindist)

# Breakup nearest neighbor distance computation into batches to prevent
# memory blowup in the case of a large number of samples and clusters.
# TODO: Once PR #7383 is merged use check_inputs=False in metric_kwargs.
labels, mindist = pairwise_distances_argmin_min(
X=X, Y=centers, metric='euclidean', metric_kwargs={'squared': True})
# cython k-means code assumes int32 inputs
labels = labels.astype(np.int32)
if n_samples == distances.shape[0]:
# distances will be changed in-place
distances[:] = mindist
Expand Down Expand Up @@ -876,6 +874,7 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster
"""
random_state = check_random_state(self.random_state)
X = self._check_fit_data(X)
Expand Down Expand Up @@ -1310,8 +1309,8 @@ def fit(self, X, y=None):

Parameters
----------
X : array-like, shape = [n_samples, n_features]
Coordinates of the data points to cluster
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we can have it as

Training data to cluster.

A more general docstring which is standardized across the codebase will be addressed by fixing #3791 later...

"""
random_state = check_random_state(self.random_state)
X = check_array(X, accept_sparse="csr", order='C',
Expand Down