Skip to content

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 15, 2019

Conversation

rth
Copy link
Member

@rth rth commented Jul 13, 2019

The check_array in MinibatchKMeans.partial_fit didn't convert X to float dtype, resulting in all calculations being done in int dtype when X was int.

Partially addresses #14314

Necessary to make CI pass in #14307

@@ -1667,7 +1667,8 @@ def partial_fit(self, X, y=None, sample_weight=None):

"""

X = check_array(X, accept_sparse="csr", order="C")
X = check_array(X, accept_sparse="csr", order="C",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make a copy in order to compute the centers in float? It's a bit weird but I guess it makes sense?
We could do sum on the ints and then divide to get float centers but that might be excessive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make a copy in order to compute the centers in float? It's a bit weird but I guess it makes sense?

That and the inertia I think.

We could do sum on the ints and then divide to get float centers but that might be excessive?

The larger problem is that a number of intermediary arrays are created in Kmeans/MinibatchKmeans with dtype=X.dtype (presumably to keep float32 / float64 unchanged),

$ rg "dtype=X.dtype" sklearn/cluster/k_means_.py
79:    centers = np.empty((n_clusters, n_features), dtype=X.dtype)
171:        return np.ones(n_samples, dtype=X.dtype)
331:        init = check_array(init, dtype=X.dtype.type, copy=True)
533:    distances = np.zeros(shape=(X.shape[0],), dtype=X.dtype)
670:        distances = np.zeros(shape=(0,), dtype=X.dtype)
751:        centers = np.array(init, dtype=X.dtype)
754:        centers = np.asarray(centers, dtype=X.dtype)
1499:            self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
1516:            old_center_buffer = np.zeros(n_features, dtype=X.dtype)
1521:            old_center_buffer = np.zeros(0, dtype=X.dtype)
1523:        distances = np.zeros(self.batch_size, dtype=X.dtype)
1673:            self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
1702:            distances = np.zeros(X.shape[0], dtype=X.dtype)
1706:                         np.zeros(0, dtype=X.dtype), 0,

sure in each of those we could do dtype conversions as necessary, depending on X.dtype but it would add complexity for, I think, not that common use case. A have not benchmarked it, but my guess is that a copy to float would be negligible with respect to the total run time, and that's what MinibatchKMeans.fit does already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, seems reasonable.

@jeremiedbb jeremiedbb merged commit bf8eff3 into scikit-learn:master Jul 15, 2019
@jeremiedbb
Copy link
Member

Thanks @rth !

@rth rth deleted the minibatch-kmeans-int-dtype branch July 15, 2019 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants