Fix check_array dtype in MinibatchKMeans.partial_fit #14323

rth · 2019-07-13T13:46:00Z

The check_array in MinibatchKMeans.partial_fit didn't convert X to float dtype, resulting in all calculations being done in int dtype when X was int.

Partially addresses #14314

Necessary to make CI pass in #14307

amueller · 2019-07-13T22:21:21Z

sklearn/cluster/k_means_.py

@@ -1667,7 +1667,8 @@ def partial_fit(self, X, y=None, sample_weight=None):

        """

-        X = check_array(X, accept_sparse="csr", order="C")
+        X = check_array(X, accept_sparse="csr", order="C",


Do we need to make a copy in order to compute the centers in float? It's a bit weird but I guess it makes sense?
We could do sum on the ints and then divide to get float centers but that might be excessive?

Do we need to make a copy in order to compute the centers in float? It's a bit weird but I guess it makes sense?

That and the inertia I think.

We could do sum on the ints and then divide to get float centers but that might be excessive?

The larger problem is that a number of intermediary arrays are created in Kmeans/MinibatchKmeans with dtype=X.dtype (presumably to keep float32 / float64 unchanged),

$ rg "dtype=X.dtype" sklearn/cluster/k_means_.py 79: centers = np.empty((n_clusters, n_features), dtype=X.dtype) 171: return np.ones(n_samples, dtype=X.dtype) 331: init = check_array(init, dtype=X.dtype.type, copy=True) 533: distances = np.zeros(shape=(X.shape[0],), dtype=X.dtype) 670: distances = np.zeros(shape=(0,), dtype=X.dtype) 751: centers = np.array(init, dtype=X.dtype) 754: centers = np.asarray(centers, dtype=X.dtype) 1499: self.init = np.ascontiguousarray(self.init, dtype=X.dtype) 1516: old_center_buffer = np.zeros(n_features, dtype=X.dtype) 1521: old_center_buffer = np.zeros(0, dtype=X.dtype) 1523: distances = np.zeros(self.batch_size, dtype=X.dtype) 1673: self.init = np.ascontiguousarray(self.init, dtype=X.dtype) 1702: distances = np.zeros(X.shape[0], dtype=X.dtype) 1706: np.zeros(0, dtype=X.dtype), 0,

sure in each of those we could do dtype conversions as necessary, depending on X.dtype but it would add complexity for, I think, not that common use case. A have not benchmarked it, but my guess is that a copy to float would be negligible with respect to the total run time, and that's what MinibatchKMeans.fit does already.

ok, seems reasonable.

jeremiedbb · 2019-07-15T09:17:29Z

Thanks @rth !

rth added 4 commits July 13, 2019 15:42

Fix check_array dtype in MinibatchKMeans.partial_fit

d92bdd3

Improve dtype check

c71dc3c

Fix docstring

08c4292

Fix cluster centroids values

95e96ed

amueller reviewed Jul 13, 2019

View reviewed changes

amueller approved these changes Jul 14, 2019

View reviewed changes

jeremiedbb approved these changes Jul 15, 2019

View reviewed changes

jeremiedbb merged commit bf8eff3 into scikit-learn:master Jul 15, 2019

rth deleted the minibatch-kmeans-int-dtype branch July 15, 2019 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

rth commented Jul 13, 2019

amueller Jul 13, 2019

rth Jul 14, 2019

amueller Jul 14, 2019

jeremiedbb commented Jul 15, 2019

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

Conversation

rth commented Jul 13, 2019

amueller Jul 13, 2019

Choose a reason for hiding this comment

rth Jul 14, 2019

Choose a reason for hiding this comment

amueller Jul 14, 2019

Choose a reason for hiding this comment

jeremiedbb commented Jul 15, 2019