-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
Needs TriageIssue requires triageIssue requires triage
Description
Hi,
I was looking into KMeans code and found that the following can be parallelized. For example, each work in for loop
can be processed independently. I expect this to reduce the runtime. Please check.
scikit-learn/sklearn/cluster/_kmeans.py
Lines 1406 to 1438 in 84f8409
for i in range(self._n_init): | |
# Initialize centers | |
centers_init = self._init_centroids( | |
X, x_squared_norms=x_squared_norms, init=init, random_state=random_state | |
) | |
if self.verbose: | |
print("Initialization complete") | |
# run a k-means once | |
labels, inertia, centers, n_iter_ = kmeans_single( | |
X, | |
sample_weight, | |
centers_init, | |
max_iter=self.max_iter, | |
verbose=self.verbose, | |
tol=self._tol, | |
x_squared_norms=x_squared_norms, | |
n_threads=self._n_threads, | |
) | |
# determine if these results are the best so far | |
# we chose a new run if it has a better inertia and the clustering is | |
# different from the best so far (it's possible that the inertia is | |
# slightly better even if the clustering is the same with potentially | |
# permuted labels, due to rounding errors) | |
if best_inertia is None or ( | |
inertia < best_inertia | |
and not _is_same_clustering(labels, best_labels, self.n_clusters) | |
): | |
best_labels = labels | |
best_centers = centers | |
best_inertia = inertia | |
best_n_iter = n_iter_ |
Metadata
Metadata
Assignees
Labels
Needs TriageIssue requires triageIssue requires triage