-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Choose number of clusters #4301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
30feff4
42e1bec
f8d8114
60edec2
55bf872
ba980ce
6320126
c7237d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
""" | ||
============================================ | ||
Selecting number of clusters on toy datasets | ||
============================================ | ||
|
||
This example shows several algorithms to choose the number of clusters, | ||
for a particular clustering algorithm on a particular dataset. It mainly | ||
illustrates that some algorithms are faster, some algorithms only understand | ||
convex clusters (first dataset) and some algorithms understand non-convex | ||
clusters (second and third datasets). | ||
|
||
The running times only give intuition of which algorithm is faster. Running time | ||
highly depends on a datasets number of samples and number of features. | ||
""" | ||
print(__doc__) | ||
|
||
import time | ||
from operator import itemgetter | ||
|
||
import matplotlib.pyplot as plt | ||
import numpy as np | ||
|
||
from sklearn.cluster.spectral import SpectralClustering | ||
from sklearn.datasets import make_blobs, make_moons, make_circles | ||
from sklearn.metrics.cluster.calinski_harabaz_index import max_CH_index | ||
from sklearn.metrics.cluster.stability import stability | ||
from sklearn.metrics.cluster.distortion_jump import distortion_jump | ||
from sklearn.metrics.cluster.gap_statistic import gap_statistic | ||
from sklearn.metrics.cluster.unsupervised import silhouette_score | ||
from sklearn.preprocessing import StandardScaler | ||
|
||
n_samples = 1500 | ||
seed = 1 | ||
datasets = [ | ||
make_blobs(n_samples=n_samples, random_state=seed), | ||
make_circles(n_samples=n_samples, factor=.5, noise=.05, shuffle=True, random_state=seed), | ||
make_moons(n_samples=n_samples, noise=.05, shuffle=True, random_state=seed), | ||
] | ||
|
||
cluster_estimator = SpectralClustering(eigen_solver='arpack', affinity="nearest_neighbors") | ||
|
||
|
||
def max_silhouette(X, cluster_estimator, k_max=None): | ||
if not k_max: | ||
k_max = int(X.shape[0] / 2) | ||
silhouettes = [] | ||
for k in range(2, k_max + 1): | ||
cluster_estimator.set_params(n_clusters=k) | ||
labels = cluster_estimator.fit_predict(X) | ||
silhouettes.append((k, silhouette_score(X, labels))) | ||
return max(silhouettes, key=itemgetter(1))[0] | ||
|
||
|
||
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k'] | ||
nb_colors = len(colors) | ||
|
||
plt.figure(figsize=(13, 9.5)) | ||
plt.subplots_adjust(left=.001, right=.999, bottom=.001, top=.96, wspace=.05, | ||
hspace=.01) | ||
|
||
plot_num = 1 | ||
printed_header = False | ||
|
||
for dataset in datasets: | ||
X, true_labels = dataset | ||
# normalize dataset for nicer plotting | ||
X = StandardScaler().fit_transform(X) | ||
|
||
for name, func_choose_nb_cluster in { | ||
'Silhouette': max_silhouette, | ||
'Stability': stability, | ||
'Gap statistic': gap_statistic, | ||
'Calinski-Harabasz index': max_CH_index, | ||
'Distortion jump': distortion_jump, | ||
}.items(): | ||
# predict cluster memberships | ||
t0 = time.time() | ||
nb_cluster = func_choose_nb_cluster(X, cluster_estimator, k_max=10) | ||
t1 = time.time() | ||
|
||
# retrieving clustering done | ||
cluster_estimator.set_params(n_clusters=nb_cluster) | ||
y_pred = cluster_estimator.fit_predict(X) | ||
|
||
# plot | ||
plt.subplot(3, 5, plot_num) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why create There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, it is not i_dataset + 1. It's i_dataset + i_algorithm + 1. I'd rather have an explicit variable than two counters and a formula. Besides, following your previous comment, I removed i_dataset |
||
if not printed_header: | ||
plt.title(name, size=18) | ||
points_color = [colors[y % nb_colors] for y in y_pred] | ||
plt.scatter(X[:, 0], X[:, 1], color=points_color, s=10) | ||
|
||
plt.xlim(-2, 2) | ||
plt.ylim(-2, 2) | ||
plt.xticks(()) | ||
plt.yticks(()) | ||
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'), | ||
transform=plt.gca().transAxes, size=15, | ||
horizontalalignment='right') | ||
plot_num += 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Useless. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another copy from plot_cluster_comparison. Which file should I look for better plotting code ? |
||
printed_header = True | ||
|
||
plt.show() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import numpy as np | ||
|
||
|
||
def adjacency_matrix(cluster_assignement): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is already implemented in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not find the adjacency matrix in metrics.cluster.supervised, in contingency matrix or other functions. The adjusted_rand_score makes some calculations that are related to the fowlkes mallows index, for which we use adjacency matrix, but I did not find the way to remove the "adjacency matrix" code There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, these are different things. |
||
""" | ||
Parameter | ||
--------- | ||
cluster_assignement: vector (n_samples) of int i, 0 <= i < k | ||
|
||
Return | ||
------ | ||
adj_matrix: matrix (n_samples, n_samples) | ||
adji_matrix[i, j] = cluster_assignement[i] == cluster_assignement[j] | ||
""" | ||
n_samples = len(cluster_assignement) | ||
adj_matrix = np.zeros((n_samples, n_samples)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this will require O(n^2) memory, O(n^2) time. This could be improved significantly if a sparse matrix were used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as we materialise the matrix in memory (even sparsely), this needs in worst-case # get an array of member samples for each cluster:
# [might be just as good to do this with a defaultdict(list)]
cluster_sizes = np.bincount(y)
cluster_members = np.split(np.argsort(y), np.cumsum(cluster_sizes))
lil = np.take(cluster_members, y)
indices = np.hstack(lil)
indptr = np.cumsum(np.hstack([0, np.take(cluster_sizes, y)]))
out = csr_matrix((np.ones_like(indptr), indices, indptr)) Another sparse construction with the same condition on y:
|
||
for i, val in enumerate(cluster_assignement): | ||
for j in range(i, n_samples): | ||
linked = val == cluster_assignement[j] | ||
adj_matrix[i, j] = linked | ||
adj_matrix[j, i] = linked | ||
return adj_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from plot_cluster_comparison. Which file should I look to have a clear figure with subplots code ?