Skip to content

[STALLED] Davies bouldin index #7942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

tomron
Copy link

@tomron tomron commented Nov 26, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Added implementation of Davies Bouldin index (https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)

Implementation includes -

  • Function code
  • Exposing function in the __init__.py files
  • Testing
  • Adding documentation and usage example

Any other comments?

tommagic and others added 4 commits November 25, 2016 17:27
1. sklearn/metrics/cluster/unsupervised.py - calculation itself
2. sklearn/metrics/cluster/tests/test_unsupervised.py - tests
3. sklearn/metrics/cluster/__init__.py - exposing the function
def davies_bouldin_index(X, labels):
"""Compute the Davies Bouldin index.

The index is defiend as the ratio of within-cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined

n_labels = len(le.classes_)

check_number_of_labels(n_labels, n_samples)
clusters_data = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to allocate mean_k and d_k with a fixed size and a numpy.array

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please reference me to an example somewhere so I'll modified it accordingly?

@tguillemot
Copy link
Contributor

@tomron Thanks for the PR. I have some questions for you.

I'm not an expert, so maybe the questions are stupid : there are a lot of metrics for clustering in sklearn, so why adding a new one ?
I've read your doc and indeed the computation is simpler than silhouette but I am not sure that the DBI is something used in practice (maybe I'm wrong). Moreover for a fast computation, we have already the Calinsky Harabaz metric which seems have the same advantages and drawbacks.

Can you give more details and explain the pros/cons in comparaison with Calinsky Harabaz and why we will add DBI ?

Thanks in advance

@tomron
Copy link
Author

tomron commented Nov 28, 2016

@tguillemot
Simple answer - I needed it and implemented it so I decided to contribute it.
Comparing to Calinsky Harabaz, DBI is bounded (0-1).
There are some use cases where DBI performs better or more relevant than Calinsky Harabaz or silhouette, or just the relevant measure for the person who uses it.

>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> dataset = datasets.load_iris()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use load_Xy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the rest of the documentation of this module

@amueller
Copy link
Member

It might be nice to add this to the comparison in #6948. Can you maybe run that and see how the metric fares?

@amueller
Copy link
Member

This has 3543 cites btw.

separation between clusters.

For :math:`k` clusters, the Davies–Bouldin index :math:`DB` is given as the
ratio of within cluster-mean distance to the between means distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something here is not right.

.. math::
DB(k) = \frac{1}{k} \sum_{i=1}^k \max_{i \neq j} D_{ij}

Where :math:`D_ij` is the ratio between the within distances in clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ij -> {ij}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:math:`i` and :math:`j`.

.. math::
D_ij = \frac{\bar{d_i}+\bar{d_j}}{d_ij}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ij -> {ij} x2

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. math::
D_ij = \frac{\bar{d_i}+\bar{d_j}}{d_ij}

:math:`\bar{d_i}` is the average distance between each point cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"point cluster" -> "point in"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:math:`i` and the centroid of cluster :math:`i`.
:math:`\bar{d_i}` is the diameter of cluster :math:`i`.

:math:`\bar{d_j}` is the average distance between each point cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"point cluster" -> "point in"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert_equal(0., davies_bouldin_index([[-1, -1], [1, 1]] * 10,
[0] * 10 + [1] * 10))

# General case (with non numpy arrays)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've not checked the paper, but if the paper has examples you can copy in here, it's best to make the test suite as complete as is reasonable with respect to those examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find numeric self contained example in the article

[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
assert_almost_equal(davies_bouldin_index(X, labels),
2*np.sqrt(0.5)/3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure you test the case where a cluster has a single sample.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

n_labels = len(le.classes_)

check_number_of_labels(n_labels, n_samples)
clusters_data = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please average intracluster distances and centroids in separate numpy arrays.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for k in range(n_labels):
cluster_k = X[labels == k]
mean_k = np.mean(cluster_k, axis=0)
d_k = np.average(pairwise_distances(cluster_k, mean_k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean_k -> [mean_k]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the code

clusters_data[k] = (mean_k, d_k)

score = 0
for i in range(n_labels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be performed with something like:

scores = (intra_dists[:, None] + intra_dists) / pairwise_distances(centroids)
# TODO: fix diagonal infs
return np.mean(np.max(scores, axis=1))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the code

Adding a test to validate that the code run correctly when
there is one sample in a cluster (not in all clusters, this is
already validated).
1. average intracluster distances and centroids in separate numpy arrays
2. Adjust code for the case where a cluster have single sample
@raghavrv raghavrv changed the title Davies bouldin index [MRG] Davies bouldin index Dec 2, 2016
@raghavrv raghavrv added this to the 0.19 milestone Dec 2, 2016
~~~~~~~~~~

- The computation of the Davies-Bouldin index is simpler than the computation
of the Silhouette index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a word that contraty to Calinsky Harabaz, DBI is bounded (0-1) ?


check_number_of_labels(n_labels, n_samples)
intra_dists = np.zeros(n_labels)
centroids = np.zeros((n_labels, len(X[0])), np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you use the np.float32. Maybe, can you adapt to the type of X as it is done in l174 ?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more-or-less LGTM!

:math:`i` and :math:`j`.

.. math::
D_{ij} = \frac{\bar{d_i}+\bar{d_j}}{d_ij}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need d_{ij}

----------------------

If the ground truth labels are not known, the Davies–Bouldin index
(:func:`sklearn.metrics.davies_bouldin_index`) can be used to evaluate the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need an entry in classes.rst for this to work. You should have an entry there anyway!

D_{ij} = \frac{\bar{d_i}+\bar{d_j}}{d_ij}

:math:`\bar{d_i}` is the average distance between each point in cluster
:math:`i` and the centroid of cluster :math:`i`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, "known as its diameter." Then you may add a comment that dbar_j is similarly the diameter for cluster j, or you can just leave it out, because that's clear from the notation.


The index is defined as the ratio of within-cluster
and between-cluster distances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually have a link to the user guide here

centroids = np.zeros((n_labels, len(X[0])), np.float32)
for k in range(n_labels):
cluster_k = X[labels == k]
mean_k = np.mean(cluster_k, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use cluster_k.mean(axis=0) and use safe_indexing on the line before, and then I think your code will automatically work for sparse matrices...? If there is no reason not to support a sparse matrix input, please add a test and update the docstring.

intra_dists[k] = np.average(pairwise_distances(cluster_k, [mean_k]))
centroid_distances = pairwise_distances(centroids)
with np.errstate(divide='ignore', invalid='ignore'):
if np.all((intra_dists[:, None] + intra_dists) == 0.0) or \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just test if not np.any(intra_dists) (or if np.allclose(intra_dists, 0) if numerical stability is a concern)?

if np.all((intra_dists[:, None] + intra_dists) == 0.0) or \
np.all(centroid_distances == 0.0):
return 0.0
scores = (intra_dists[:, None] + intra_dists)/centroid_distances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space around /, please

@raghavrv
Copy link
Member

raghavrv commented Jan 3, 2017

@tomron Are you with us? :)

@raghavrv raghavrv changed the title [MRG] Davies bouldin index [STALLED] Davies bouldin index Mar 13, 2017
@amueller amueller removed this from the 0.19 milestone Jun 12, 2017
@amueller amueller removed this from the 0.19 milestone Jun 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants