Skip to content

KMeans processing n_init sequentially!! #23366

@ShivKJ

Description

@ShivKJ

Hi,

I was looking into KMeans code and found that the following can be parallelized. For example, each work in for loop can be processed independently. I expect this to reduce the runtime. Please check.

for i in range(self._n_init):
# Initialize centers
centers_init = self._init_centroids(
X, x_squared_norms=x_squared_norms, init=init, random_state=random_state
)
if self.verbose:
print("Initialization complete")
# run a k-means once
labels, inertia, centers, n_iter_ = kmeans_single(
X,
sample_weight,
centers_init,
max_iter=self.max_iter,
verbose=self.verbose,
tol=self._tol,
x_squared_norms=x_squared_norms,
n_threads=self._n_threads,
)
# determine if these results are the best so far
# we chose a new run if it has a better inertia and the clustering is
# different from the best so far (it's possible that the inertia is
# slightly better even if the clustering is the same with potentially
# permuted labels, due to rounding errors)
if best_inertia is None or (
inertia < best_inertia
and not _is_same_clustering(labels, best_labels, self.n_clusters)
):
best_labels = labels
best_centers = centers
best_inertia = inertia
best_n_iter = n_iter_

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions