Skip to content

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions sklearn/cluster/k_means_.py
Original file line number Diff line number Diff line change
Expand Up @@ -1422,8 +1422,8 @@ class MiniBatchKMeans(KMeans):
>>> kmeans = kmeans.partial_fit(X[0:6,:])
>>> kmeans = kmeans.partial_fit(X[6:12,:])
>>> kmeans.cluster_centers_
array([[1, 1],
[3, 4]])
array([[2. , 1. ],
[3.5, 4.5]])
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> # fit on the whole data
Expand Down Expand Up @@ -1667,7 +1667,8 @@ def partial_fit(self, X, y=None, sample_weight=None):

"""

X = check_array(X, accept_sparse="csr", order="C")
X = check_array(X, accept_sparse="csr", order="C",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make a copy in order to compute the centers in float? It's a bit weird but I guess it makes sense?
We could do sum on the ints and then divide to get float centers but that might be excessive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make a copy in order to compute the centers in float? It's a bit weird but I guess it makes sense?

That and the inertia I think.

We could do sum on the ints and then divide to get float centers but that might be excessive?

The larger problem is that a number of intermediary arrays are created in Kmeans/MinibatchKmeans with dtype=X.dtype (presumably to keep float32 / float64 unchanged),

$ rg "dtype=X.dtype" sklearn/cluster/k_means_.py
79:    centers = np.empty((n_clusters, n_features), dtype=X.dtype)
171:        return np.ones(n_samples, dtype=X.dtype)
331:        init = check_array(init, dtype=X.dtype.type, copy=True)
533:    distances = np.zeros(shape=(X.shape[0],), dtype=X.dtype)
670:        distances = np.zeros(shape=(0,), dtype=X.dtype)
751:        centers = np.array(init, dtype=X.dtype)
754:        centers = np.asarray(centers, dtype=X.dtype)
1499:            self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
1516:            old_center_buffer = np.zeros(n_features, dtype=X.dtype)
1521:            old_center_buffer = np.zeros(0, dtype=X.dtype)
1523:        distances = np.zeros(self.batch_size, dtype=X.dtype)
1673:            self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
1702:            distances = np.zeros(X.shape[0], dtype=X.dtype)
1706:                         np.zeros(0, dtype=X.dtype), 0,

sure in each of those we could do dtype conversions as necessary, depending on X.dtype but it would add complexity for, I think, not that common use case. A have not benchmarked it, but my guess is that a copy to float would be negligible with respect to the total run time, and that's what MinibatchKMeans.fit does already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, seems reasonable.

dtype=[np.float64, np.float32])
n_samples, n_features = X.shape
if hasattr(self.init, '__array__'):
self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
Expand Down
8 changes: 8 additions & 0 deletions sklearn/cluster/tests/test_k_means.py
Original file line number Diff line number Diff line change
Expand Up @@ -942,3 +942,11 @@ def test_k_means_empty_cluster_relocated():

assert len(set(km.labels_)) == 2
assert_allclose(km.cluster_centers_, [[-1], [1]])


def test_minibatch_kmeans_partial_fit_int_data():
# Issue GH #14314
X = np.array([[-1], [1]], dtype=np.int)
km = MiniBatchKMeans(n_clusters=2)
km.partial_fit(X)
assert km.cluster_centers_.dtype.kind == "f"