-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add Davies-Bouldin index #10827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
76facb5
ed1325d
a588458
cd52612
975944e
3dcf3cb
7848d87
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
|
||
from ...utils import check_random_state | ||
from ...utils import check_X_y | ||
from ...utils import safe_indexing | ||
from ..pairwise import pairwise_distances | ||
from ...preprocessing import LabelEncoder | ||
|
||
|
@@ -258,3 +259,57 @@ def calinski_harabaz_score(X, labels): | |
return (1. if intra_disp == 0. else | ||
extra_disp * (n_samples - n_labels) / | ||
(intra_disp * (n_labels - 1.))) | ||
|
||
|
||
def davies_bouldin_score(X, labels): | ||
"""Computes the Davies-Bouldin score. | ||
|
||
The score is defined as the ratio of within-cluster distances to | ||
between-cluster distances. | ||
|
||
Read more in the :ref:`User Guide <davies-bouldin_index>`. | ||
|
||
Parameters | ||
---------- | ||
X : array-like, shape (``n_samples``, ``n_features``) | ||
List of ``n_features``-dimensional data points. Each row corresponds | ||
to a single data point. | ||
|
||
labels : array-like, shape (``n_samples``,) | ||
Predicted labels for each sample. | ||
|
||
Returns | ||
------- | ||
score: float | ||
The resulting Davies-Bouldin score. | ||
|
||
References | ||
---------- | ||
.. [1] `Davies, David L.; Bouldin, Donald W. (1979). | ||
"A Cluster Separation Measure". IEEE Transactions on | ||
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add an Examples Sexton There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sorry, I can see there are sections like this in other parts of the doc, but I don't know how to generate the example contents (?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should just be a couple of lines showing how you would use this function in a simple case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the doctest example in lines 1625-1637 doing that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide |
||
""" | ||
X, labels = check_X_y(X, labels) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would factorize the part which is the same in all different metrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems a bit out of scope for this PR. Also, there are small differences in each metric that make the refactor non-trivial. I could take it up in a later PR if you do not mind. |
||
le = LabelEncoder() | ||
labels = le.fit_transform(labels) | ||
n_samples, _ = X.shape | ||
n_labels = len(le.classes_) | ||
check_number_of_labels(n_labels, n_samples) | ||
|
||
intra_dists = np.zeros(n_labels) | ||
centroids = np.zeros((n_labels, len(X[0])), dtype=np.float) | ||
for k in range(n_labels): | ||
cluster_k = safe_indexing(X, labels == k) | ||
centroid = cluster_k.mean(axis=0) | ||
centroids[k] = centroid | ||
intra_dists[k] = np.average(pairwise_distances( | ||
cluster_k, [centroid])) | ||
|
||
centroid_distances = pairwise_distances(centroids) | ||
|
||
if np.allclose(intra_dists, 0) or np.allclose(centroid_distances, 0): | ||
return 0.0 | ||
|
||
score = (intra_dists[:, None] + intra_dists) / centroid_distances | ||
score[score == np.inf] = np.nan | ||
return np.mean(np.nanmax(score, axis=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I do not understand what is requested here (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can leave it explicitly as "Davies-Bouldin". "DB" might be confused with database, or DBSCAN.