[STALLED] Davies bouldin index #7942

tomron · 2016-11-26T18:16:11Z

Reference Issue

What does this implement/fix? Explain your changes.

Added implementation of Davies Bouldin index (https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)

Implementation includes -

Function code
Exposing function in the __init__.py files
Testing
Adding documentation and usage example

Any other comments?

1. sklearn/metrics/cluster/unsupervised.py - calculation itself 2. sklearn/metrics/cluster/tests/test_unsupervised.py - tests 3. sklearn/metrics/cluster/__init__.py - exposing the function

tguillemot · 2016-11-28T10:01:29Z

sklearn/metrics/cluster/unsupervised.py

+def davies_bouldin_index(X, labels):
+    """Compute the Davies Bouldin index.
+
+    The index is defiend as the ratio of within-cluster


tguillemot · 2016-11-28T10:05:58Z

sklearn/metrics/cluster/unsupervised.py

+    n_labels = len(le.classes_)
+
+    check_number_of_labels(n_labels, n_samples)
+    clusters_data = {}


I think it's better to allocate mean_k and d_k with a fixed size and a numpy.array

Can you please reference me to an example somewhere so I'll modified it accordingly?

tguillemot · 2016-11-28T10:17:47Z

@tomron Thanks for the PR. I have some questions for you.

I'm not an expert, so maybe the questions are stupid : there are a lot of metrics for clustering in sklearn, so why adding a new one ?
I've read your doc and indeed the computation is simpler than silhouette but I am not sure that the DBI is something used in practice (maybe I'm wrong). Moreover for a fast computation, we have already the Calinsky Harabaz metric which seems have the same advantages and drawbacks.

Can you give more details and explain the pros/cons in comparaison with Calinsky Harabaz and why we will add DBI ?

Thanks in advance

tomron · 2016-11-28T10:47:58Z

@tguillemot
Simple answer - I needed it and implemented it so I decided to contribute it.
Comparing to Calinsky Harabaz, DBI is bounded (0-1).
There are some use cases where DBI performs better or more relevant than Calinsky Harabaz or silhouette, or just the relevant measure for the person who uses it.

amueller · 2016-11-28T21:34:11Z

doc/modules/clustering.rst

+  >>> from sklearn import metrics
+  >>> from sklearn.metrics import pairwise_distances
+  >>> from sklearn import datasets
+  >>> dataset = datasets.load_iris()


you could use load_Xy

This is consistent with the rest of the documentation of this module

amueller · 2016-11-28T21:36:55Z

It might be nice to add this to the comparison in #6948. Can you maybe run that and see how the metric fares?

amueller · 2016-11-28T21:37:30Z

This has 3543 cites btw.

jnothman · 2016-11-29T07:58:01Z

doc/modules/clustering.rst

+separation between clusters.
+
+For :math:`k` clusters, the Davies–Bouldin index :math:`DB` is given as the
+ratio of within cluster-mean distance to the between means distance.


Something here is not right.

jnothman · 2016-11-29T07:58:03Z

doc/modules/clustering.rst

+.. math::
+  DB(k) = \frac{1}{k} \sum_{i=1}^k \max_{i \neq j} D_{ij}
+
+Where :math:`D_ij` is the ratio between the within distances in clusters


ij -> {ij}

jnothman · 2016-11-29T07:58:04Z

doc/modules/clustering.rst

+:math:`i` and :math:`j`.
+
+.. math::
+  D_ij = \frac{\bar{d_i}+\bar{d_j}}{d_ij}


ij -> {ij} x2

jnothman · 2016-11-29T07:58:06Z

doc/modules/clustering.rst

+.. math::
+  D_ij = \frac{\bar{d_i}+\bar{d_j}}{d_ij}
+
+:math:`\bar{d_i}` is the average distance between each point cluster


"point cluster" -> "point in"

jnothman · 2016-11-29T07:58:08Z

doc/modules/clustering.rst

+:math:`i` and the centroid of cluster :math:`i`.
+:math:`\bar{d_i}` is the diameter of cluster :math:`i`.
+
+:math:`\bar{d_j}` is the average distance between each point cluster


"point cluster" -> "point in"

jnothman · 2016-11-29T07:58:22Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    assert_equal(0., davies_bouldin_index([[-1, -1], [1, 1]] * 10,
+                                          [0] * 10 + [1] * 10))
+
+    # General case (with non numpy arrays)


i've not checked the paper, but if the paper has examples you can copy in here, it's best to make the test suite as complete as is reasonable with respect to those examples.

I didn't find numeric self contained example in the article

jnothman · 2016-11-29T07:58:24Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+         [[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
+    labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
+    assert_almost_equal(davies_bouldin_index(X, labels),
+                        2*np.sqrt(0.5)/3)


Please make sure you test the case where a cluster has a single sample.

jnothman · 2016-11-29T07:58:26Z

sklearn/metrics/cluster/unsupervised.py

+    n_labels = len(le.classes_)
+
+    check_number_of_labels(n_labels, n_samples)
+    clusters_data = {}


Please average intracluster distances and centroids in separate numpy arrays.

jnothman · 2016-11-29T07:58:28Z

sklearn/metrics/cluster/unsupervised.py

+    for k in range(n_labels):
+        cluster_k = X[labels == k]
+        mean_k = np.mean(cluster_k, axis=0)
+        d_k = np.average(pairwise_distances(cluster_k, mean_k))


mean_k -> [mean_k]

I improved the code

jnothman · 2016-11-29T07:58:29Z

sklearn/metrics/cluster/unsupervised.py

+        clusters_data[k] = (mean_k, d_k)
+
+    score = 0
+    for i in range(n_labels):


this can be performed with something like:

scores = (intra_dists[:, None] + intra_dists) / pairwise_distances(centroids) # TODO: fix diagonal infs return np.mean(np.max(scores, axis=1))

I improved the code

Adding a test to validate that the code run correctly when there is one sample in a cluster (not in all clusters, this is already validated).

1. average intracluster distances and centroids in separate numpy arrays 2. Adjust code for the case where a cluster have single sample

tguillemot · 2016-12-08T10:05:57Z

doc/modules/clustering.rst

+~~~~~~~~~~
+
+- The computation of the Davies-Bouldin index is simpler than the computation
+  of the Silhouette index.


Can you add a word that contraty to Calinsky Harabaz, DBI is bounded (0-1) ?

tguillemot · 2016-12-08T15:43:58Z

sklearn/metrics/cluster/unsupervised.py

+
+    check_number_of_labels(n_labels, n_samples)
+    intra_dists = np.zeros(n_labels)
+    centroids = np.zeros((n_labels, len(X[0])), np.float32)


I don't understand why you use the np.float32. Maybe, can you adapt to the type of X as it is done in l174 ?

jnothman

This more-or-less LGTM!

jnothman · 2016-12-05T12:22:14Z

doc/modules/clustering.rst

+:math:`i` and :math:`j`.
+
+.. math::
+  D_{ij} = \frac{\bar{d_i}+\bar{d_j}}{d_ij}


still need d_{ij}

jnothman · 2016-12-08T21:20:35Z

doc/modules/clustering.rst

+----------------------
+
+If the ground truth labels are not known, the Davies–Bouldin index
+(:func:`sklearn.metrics.davies_bouldin_index`) can be used to evaluate the


You need an entry in classes.rst for this to work. You should have an entry there anyway!

jnothman · 2016-12-08T21:20:38Z

doc/modules/clustering.rst

+  D_{ij} = \frac{\bar{d_i}+\bar{d_j}}{d_ij}
+
+:math:`\bar{d_i}` is the average distance between each point in cluster
+:math:`i` and the centroid of cluster :math:`i`.


I think, "known as its diameter." Then you may add a comment that dbar_j is similarly the diameter for cluster j, or you can just leave it out, because that's clear from the notation.

jnothman · 2016-12-08T21:20:44Z

sklearn/metrics/cluster/unsupervised.py

+
+    The index is defined as the ratio of within-cluster
+    and between-cluster distances.
+


We usually have a link to the user guide here

jnothman · 2016-12-08T21:20:46Z

sklearn/metrics/cluster/unsupervised.py

+    centroids = np.zeros((n_labels, len(X[0])), np.float32)
+    for k in range(n_labels):
+        cluster_k = X[labels == k]
+        mean_k = np.mean(cluster_k, axis=0)


use cluster_k.mean(axis=0) and use safe_indexing on the line before, and then I think your code will automatically work for sparse matrices...? If there is no reason not to support a sparse matrix input, please add a test and update the docstring.

jnothman · 2016-12-08T21:20:48Z

sklearn/metrics/cluster/unsupervised.py

+        intra_dists[k] = np.average(pairwise_distances(cluster_k, [mean_k]))
+    centroid_distances = pairwise_distances(centroids)
+    with np.errstate(divide='ignore', invalid='ignore'):
+        if np.all((intra_dists[:, None] + intra_dists) == 0.0) or \


why don't you just test if not np.any(intra_dists) (or if np.allclose(intra_dists, 0) if numerical stability is a concern)?

jnothman · 2016-12-08T21:20:49Z

sklearn/metrics/cluster/unsupervised.py

+        if np.all((intra_dists[:, None] + intra_dists) == 0.0) or \
+           np.all(centroid_distances == 0.0):
+            return 0.0
+        scores = (intra_dists[:, None] + intra_dists)/centroid_distances


space around /, please

raghavrv · 2017-01-03T12:06:19Z

@tomron Are you with us? :)

tommagic and others added 4 commits November 25, 2016 17:27

Adding davies_bouldin_index calculation

bf4c3e2

1. sklearn/metrics/cluster/unsupervised.py - calculation itself 2. sklearn/metrics/cluster/tests/test_unsupervised.py - tests 3. sklearn/metrics/cluster/__init__.py - exposing the function

Expose function at metrics level

b9230d3

Adding documentation and usage example

06c1df0

Fix - adding davies_bouldin_index to all list

a9ccce2

tguillemot suggested changes Nov 28, 2016

View reviewed changes

Typo fix: defiend -> defined

6870cbc

TomDLT added the New Feature label Nov 28, 2016

amueller reviewed Nov 28, 2016

View reviewed changes

jnothman requested changes Nov 29, 2016

View reviewed changes

tommagic added 7 commits November 29, 2016 10:58

Test cluster with one sample

a55fafe

Adding a test to validate that the code run correctly when there is one sample in a cluster (not in all clusters, this is already validated).

Fix mathematic notation

cdb0cb8

Add drawback - euclidean distance only

8984578

mean_k -> [mean_k]

6586a2b

Efficiency and style fixes

8a4bd26

1. average intracluster distances and centroids in separate numpy arrays 2. Adjust code for the case where a cluster have single sample

Remove unused variable

95614be

Flakes and pep8 fixes

8b6c212

raghavrv changed the title ~~Davies bouldin index~~ [MRG] Davies bouldin index Dec 2, 2016

raghavrv added this to the 0.19 milestone Dec 2, 2016

tguillemot suggested changes Dec 8, 2016

View reviewed changes

tguillemot reviewed Dec 8, 2016

View reviewed changes

jnothman requested changes Dec 8, 2016

View reviewed changes

raghavrv added Need Contributor Stalled labels Mar 13, 2017

raghavrv changed the title ~~[MRG] Davies bouldin index~~ [STALLED] Davies bouldin index Mar 13, 2017

amueller removed this from the 0.19 milestone Jun 12, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

logc mentioned this pull request Mar 17, 2018

[MRG+1] Add Davies-Bouldin index #10827

Merged

glemaitre closed this in #10827 May 18, 2018


		The index is defined as the ratio of within-cluster
		and between-cluster distances.

Uh oh!

[STALLED] Davies bouldin index #7942

[STALLED] Davies bouldin index #7942

Uh oh!

Conversation

tomron commented Nov 26, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tguillemot commented Nov 28, 2016

Uh oh!

tomron commented Nov 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Nov 28, 2016

Uh oh!

amueller commented Nov 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment