-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[STALLED] Davies bouldin index #7942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. sklearn/metrics/cluster/unsupervised.py - calculation itself 2. sklearn/metrics/cluster/tests/test_unsupervised.py - tests 3. sklearn/metrics/cluster/__init__.py - exposing the function
def davies_bouldin_index(X, labels): | ||
"""Compute the Davies Bouldin index. | ||
|
||
The index is defiend as the ratio of within-cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined
n_labels = len(le.classes_) | ||
|
||
check_number_of_labels(n_labels, n_samples) | ||
clusters_data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to allocate mean_k and d_k with a fixed size and a numpy.array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please reference me to an example somewhere so I'll modified it accordingly?
@tomron Thanks for the PR. I have some questions for you. I'm not an expert, so maybe the questions are stupid : there are a lot of metrics for clustering in sklearn, so why adding a new one ? Can you give more details and explain the pros/cons in comparaison with Calinsky Harabaz and why we will add DBI ? Thanks in advance |
@tguillemot |
>>> from sklearn import metrics | ||
>>> from sklearn.metrics import pairwise_distances | ||
>>> from sklearn import datasets | ||
>>> dataset = datasets.load_iris() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use load_Xy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the rest of the documentation of this module
It might be nice to add this to the comparison in #6948. Can you maybe run that and see how the metric fares? |
This has 3543 cites btw. |
separation between clusters. | ||
|
||
For :math:`k` clusters, the Davies–Bouldin index :math:`DB` is given as the | ||
ratio of within cluster-mean distance to the between means distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something here is not right.
doc/modules/clustering.rst
Outdated
.. math:: | ||
DB(k) = \frac{1}{k} \sum_{i=1}^k \max_{i \neq j} D_{ij} | ||
|
||
Where :math:`D_ij` is the ratio between the within distances in clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ij
-> {ij}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/modules/clustering.rst
Outdated
:math:`i` and :math:`j`. | ||
|
||
.. math:: | ||
D_ij = \frac{\bar{d_i}+\bar{d_j}}{d_ij} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ij
-> {ij}
x2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/modules/clustering.rst
Outdated
.. math:: | ||
D_ij = \frac{\bar{d_i}+\bar{d_j}}{d_ij} | ||
|
||
:math:`\bar{d_i}` is the average distance between each point cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"point cluster" -> "point in"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/modules/clustering.rst
Outdated
:math:`i` and the centroid of cluster :math:`i`. | ||
:math:`\bar{d_i}` is the diameter of cluster :math:`i`. | ||
|
||
:math:`\bar{d_j}` is the average distance between each point cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"point cluster" -> "point in"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
assert_equal(0., davies_bouldin_index([[-1, -1], [1, 1]] * 10, | ||
[0] * 10 + [1] * 10)) | ||
|
||
# General case (with non numpy arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've not checked the paper, but if the paper has examples you can copy in here, it's best to make the test suite as complete as is reasonable with respect to those examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find numeric self contained example in the article
[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5) | ||
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10 | ||
assert_almost_equal(davies_bouldin_index(X, labels), | ||
2*np.sqrt(0.5)/3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure you test the case where a cluster has a single sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
n_labels = len(le.classes_) | ||
|
||
check_number_of_labels(n_labels, n_samples) | ||
clusters_data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please average intracluster distances and centroids in separate numpy arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for k in range(n_labels): | ||
cluster_k = X[labels == k] | ||
mean_k = np.mean(cluster_k, axis=0) | ||
d_k = np.average(pairwise_distances(cluster_k, mean_k)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean_k
-> [mean_k]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the code
clusters_data[k] = (mean_k, d_k) | ||
|
||
score = 0 | ||
for i in range(n_labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be performed with something like:
scores = (intra_dists[:, None] + intra_dists) / pairwise_distances(centroids)
# TODO: fix diagonal infs
return np.mean(np.max(scores, axis=1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the code
Adding a test to validate that the code run correctly when there is one sample in a cluster (not in all clusters, this is already validated).
1. average intracluster distances and centroids in separate numpy arrays 2. Adjust code for the case where a cluster have single sample
~~~~~~~~~~ | ||
|
||
- The computation of the Davies-Bouldin index is simpler than the computation | ||
of the Silhouette index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a word that contraty to Calinsky Harabaz, DBI is bounded (0-1) ?
|
||
check_number_of_labels(n_labels, n_samples) | ||
intra_dists = np.zeros(n_labels) | ||
centroids = np.zeros((n_labels, len(X[0])), np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you use the np.float32
. Maybe, can you adapt to the type of X as it is done in l174 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This more-or-less LGTM!
:math:`i` and :math:`j`. | ||
|
||
.. math:: | ||
D_{ij} = \frac{\bar{d_i}+\bar{d_j}}{d_ij} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need d_{ij}
---------------------- | ||
|
||
If the ground truth labels are not known, the Davies–Bouldin index | ||
(:func:`sklearn.metrics.davies_bouldin_index`) can be used to evaluate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need an entry in classes.rst for this to work. You should have an entry there anyway!
D_{ij} = \frac{\bar{d_i}+\bar{d_j}}{d_ij} | ||
|
||
:math:`\bar{d_i}` is the average distance between each point in cluster | ||
:math:`i` and the centroid of cluster :math:`i`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, "known as its diameter." Then you may add a comment that dbar_j is similarly the diameter for cluster j, or you can just leave it out, because that's clear from the notation.
|
||
The index is defined as the ratio of within-cluster | ||
and between-cluster distances. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually have a link to the user guide here
centroids = np.zeros((n_labels, len(X[0])), np.float32) | ||
for k in range(n_labels): | ||
cluster_k = X[labels == k] | ||
mean_k = np.mean(cluster_k, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use cluster_k.mean(axis=0)
and use safe_indexing
on the line before, and then I think your code will automatically work for sparse matrices...? If there is no reason not to support a sparse matrix input, please add a test and update the docstring.
intra_dists[k] = np.average(pairwise_distances(cluster_k, [mean_k])) | ||
centroid_distances = pairwise_distances(centroids) | ||
with np.errstate(divide='ignore', invalid='ignore'): | ||
if np.all((intra_dists[:, None] + intra_dists) == 0.0) or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just test if not np.any(intra_dists)
(or if np.allclose(intra_dists, 0)
if numerical stability is a concern)?
if np.all((intra_dists[:, None] + intra_dists) == 0.0) or \ | ||
np.all(centroid_distances == 0.0): | ||
return 0.0 | ||
scores = (intra_dists[:, None] + intra_dists)/centroid_distances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space around /
, please
@tomron Are you with us? :) |
Reference Issue
What does this implement/fix? Explain your changes.
Added implementation of Davies Bouldin index (https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)
Implementation includes -
__init__.py
filesAny other comments?