[MRG+1] Add Davies-Bouldin index #10827

logc · 2018-03-17T21:59:23Z

Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This implements an unsupervised quality metric for clustering results, the Davies-Bouldin Index, based on an already existing PR that was stalled. The differences between this commit and the changes proposed there are minimal. In particular, the tests are copied verbatim, to ensure that this implementation does still conform to what was expected.

Any other comments?

I noticed while working on a toy problem that there are not many alternatives in sklearn for unsupervised metrics of clustering quality. In particular, I missed Davies-Bouldin from working with other ML packages (Weka, IIRC).

Looking through sklearn, I found the above mentioned PR, noticed the author @tomron does not seem to answer anymore, and decided to push for this change to get accepted by doing a similar proposal. I fixed all remaining open comments.

If there is a better way to get the DB index into sklearn, please tell me. If there are other comments that can still be improved in this implementation, I will do my best to correct them, too.

jnothman

Thanks for this. Also, flake8 is failing

jnothman · 2018-03-18T01:08:33Z

sklearn/metrics/cluster/unsupervised.py

+    ----------
+    .. [1] `Davies, David L.; Bouldin, Donald W. (1979).
+       "A Cluster Separation Measure". IEEE Transactions on
+       Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_


Please add an Examples Sexton

I'm sorry, I can see there are sections like this in other parts of the doc, but I don't know how to generate the example contents (?)

It should just be a couple of lines showing how you would use this function in a simple case.

Isn't the doctest example in lines 1625-1637 doing that?

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

jnothman · 2018-03-18T01:11:31Z

doc/modules/clustering.rst

+cluster analysis.
+
+  >>> from sklearn.cluster import KMeans
+  >>> from sklearn.metrics import davis_bouldin_index


jnothman · 2018-03-18T01:16:06Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+                         rng.rand(10, 2), np.arange(10))
+
+    # Assert the value is 0. when all samples are equals
+    assert_equal(0., davies_bouldin_index(np.ones((10, 2)),


These days we would prefer a bare assert, rather than assert_equal

jnothman · 2018-03-18T01:33:21Z

Feel like wrapping up #8135 for us also so we don't need to add tests here?

glemaitre · 2018-03-18T08:59:59Z

#10828 is taking over #8135 and could be almost merge with we don't consider to add the two tests which I mentioned in the summary.

jnothman · 2018-03-18T09:37:53Z

@logc, I don't think you've pushed a second commit.

jnothman · 2018-03-18T09:41:06Z

Please merge in the latest master, where we have just added common tests for clustering metrics. Please add this to the common tests in sklearn/metrics/cluster/tests/test_common.py

logc · 2018-03-18T09:42:34Z

@jnothman sorry, commented before pushing the second commit. The tests run really long, locally (!)

jnothman · 2018-03-18T09:43:07Z

The tests run really long, locally (!)

Do you mean the full test suite? The clustering metrics tests?

logc · 2018-03-18T09:44:24Z

@jnothman yes, the full test suite. Currently, I am running it with make from the project root, before creating a commit. If I only have changes to the test_unsupervised.py file, then I only run that. If there is a better strategy, please tell me.

jnothman · 2018-03-18T09:47:16Z

If I were you, I'd run pytest sklearn/metrics/cluster and let Travis do the rest.

logc · 2018-03-18T11:54:25Z

@jnothman added davies_bouldin_index to test_common.

In order to pass test_format_invariance, I had to change the dtype of the centroids variable -- from adapting to X to hardcoded as float.

This was one of the open questions in the previous PR, and now I know the answer: if you do not do that assumption, then the resulting index is not the same for an array of ints as for the same array cast as float.

logc · 2018-04-05T10:51:20Z

@jnothman can I have another review here? 😄

Also, @tguillemot had comments on the previously existing PR; maybe I could have a review here, too?

Thanks!

jnothman · 2018-04-09T06:57:58Z

is the difference between int and float X substantial?

jnothman

LGTM, thanks

jnothman · 2018-04-09T07:01:21Z

sklearn/metrics/cluster/unsupervised.py

+    ----------
+    .. [1] `Davies, David L.; Bouldin, Donald W. (1979).
+       "A Cluster Separation Measure". IEEE Transactions on
+       Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_


Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

jnothman · 2018-04-09T07:03:38Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+         [[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
+    labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
+    assert_almost_equal(davies_bouldin_index(X, labels),
+                        2*np.sqrt(0.5)/3)


Please include spaces around * and /

jnothman · 2018-04-09T07:04:17Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
+    labels = [0, 0, 1, 2]
+    assert_almost_equal(davies_bouldin_index(X, labels),
+                        (5./4)/3)


Spaces around /, please

logc · 2018-04-11T09:34:55Z

My ping to @tguillemot was not very successful 😄 ; maybe @glemaitre would help here in having a second opinion? Thanks to any and all reviewers.

glemaitre · 2018-04-11T09:43:38Z

I put it in my list of revision :) You should receive my review soon

glemaitre

Sorry for the delay.

Couple of comments and this is missing a what's new entry.

glemaitre · 2018-04-21T20:54:12Z

doc/modules/clustering.rst

+   DB = \frac{1}{k} \sum{i=1}^k \max_{i \neq j} R_{ij}
+
+
+  >>> from sklearn import datasets


This is not recongnized as a block code: https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html

glemaitre · 2018-04-21T20:56:26Z

doc/modules/clustering.rst

+
+.. topic:: References
+
+* Davies, David L.; Bouldin, Donald W. (1979).


You are missing an indent in the references to properly render them:

https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#id27

Done! My RST skills are rusty, I only write Markdown these days ...

glemaitre · 2018-04-21T20:58:19Z

sklearn/metrics/__init__.py

@@ -80,6 +81,7 @@
    'confusion_matrix',
    'consensus_score',
    'coverage_error',
+    'davies_bouldin_index',


@jnothman do we have a convention finishing by score since that this is bounded between 0 and 1.

glemaitre · 2018-04-21T21:16:24Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    rng = np.random.RandomState(seed=0)
+
+    # Assert message when there is only one label
+    assert_raise_message(ValueError, "Number of labels is",


I think this is time to factorize the error test which is the same between the different metrics.

glemaitre · 2018-04-21T21:18:47Z

sklearn/metrics/cluster/unsupervised.py

+       "A Cluster Separation Measure". IEEE Transactions on
+       Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_
+    """
+    X, labels = check_X_y(X, labels)


I would factorize the part which is the same in all different metrics.
I think that this is redundant and stand there for a kind of check/validation

This seems a bit out of scope for this PR. Also, there are small differences in each metric that make the refactor non-trivial. I could take it up in a later PR if you do not mind.

glemaitre · 2018-04-21T21:19:00Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+                         rng.rand(10, 2), np.zeros(10))
+
+    # Assert message when all point are in different clusters
+    assert_raise_message(ValueError, "Number of labels is",


Same as above

glemaitre · 2018-04-21T21:20:43Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+                         rng.rand(10, 2), np.arange(10))
+
+    # Assert the value is 0. when all samples are equals
+    assert 0. == davies_bouldin_index(np.ones((10, 2)),


you should write:
assert computed == expected. With float use pytest.approx

assert davies_bouldin_index(...) == pytest.approx(0.0)

glemaitre · 2018-04-21T21:20:51Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+                                      [0] * 5 + [1] * 5)
+
+    # Assert the value is 0. when all the mean cluster are equal
+    assert 0. == davies_bouldin_index([[-1, -1], [1, 1]] * 10,


Same as above

glemaitre · 2018-04-21T21:21:31Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    X = ([[0, 0], [1, 1]] * 5 + [[3, 3], [4, 4]] * 5 +
+         [[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
+    labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
+    assert_almost_equal(davies_bouldin_index(X, labels),


Use pytest approx

Done! Also, refactored other usages of assert_almost_equal to pytest.approx in this module, for consistency.

glemaitre · 2018-04-21T21:21:37Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    # General case - cluster have one sample
+    X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
+    labels = [0, 0, 1, 2]
+    assert_almost_equal(davies_bouldin_index(X, labels),


pytest approx

lzfelix · 2018-04-22T01:34:53Z

Thanks for your PR, but could you please provide supporting evidence that Davies-Bouldin index is bounded on the interval [0, 1]? The original paper by Davies and Bouldin [1] only explicitly states that this measure is grater or equal than zero. On the other hand, the MATLAB documentation shows a scenario where the index is larger than zero under K-means clustering:

NumObservations: 600
InspectedK: [1 2 3 4 5 6]
CriterionValues: [NaN 0.4663 0.4454 0.8316 1.0444 0.9236]
OptimalK: 3

Reference: [1] Davies, David L., and Donald W. Bouldin. "A cluster separation measure." IEEE transactions on pattern analysis and machine intelligence 2 (1979): 224-227.

glemaitre · 2018-04-22T08:38:48Z

@lzfelix Reading the references, it seems that there is nothing mentioning an upper bound equal to 1.
I am not sure exactly how to make the formal proof.

My intuition to find an upper bound to 1 would be in the below configuration:

But honestly, this is just a draw and I would appreciate with somebody could have a formal proof, either way.

@jnothman @GaelVaroquaux @lesteve @agramfort

lzfelix · 2018-04-22T13:37:12Z

@glemaitre it seems that DBI > 1 if there's some degree of overlap between two clusters, at least for the bidimensional case. It's harder to figure out what happens on higher-dimensional spaces without a better analysis. Still, it is possible to obtain DBI ~ 1.19 using the code below:

X, y = sklearn.datasets.make_blobs(n_samples=400, centers=1, center_box=(0, 1))
y_hat = sklearn.cluster.KMeans(n_clusters=2).fit_predict(X)

plt.scatter(*X.T, c=y_hat, alpha=0.4)
print('Davies-Bouldin index: {:4.4}'.format(davies_bouldin_index(X, y_hat)))

which produces:

glemaitre · 2018-04-22T16:26:09Z

Actually this is not really an overlap but more something like ellipses-like shape. In this case, it will be higher than 1. However, I am not sure what would be the upper bound formally.

lzfelix · 2018-04-22T17:26:33Z

@glemaitre you are right. Maybe the upper bound helps on deciding if the generated clusters are not well defined, in the opposite sense that DBI ~ 0 corresponds to a good partitioning of the data? Still, it's just speculation...

Anyways, I just wanted to try to contribute on the docs of this PR to avoid later confusion :)

glemaitre · 2018-04-22T17:28:30Z

Thanks this is useful review. We don't like to put erroneous doc :) ‎

logc · 2018-04-22T19:25:41Z

@lzfelix Thanks for pointing out that inconsistency! I am having a close look at the original reference, to see where this "upper bound is 1" idea is rooted ... I am not sure if it is the documentation, or the implementation which is wrong.

By the way, if I understand correctly make_blobs, in your example you are forcing the data to be scattered around a single center, and then fit a KMeans model with n=2. One could argue that this is a fundamental shortcoming of the algorithm -- it tries to partition the dataset always in as many labels as given.

In fact, DBI is "warning" (via the RuntimeWarning) that the clustering is artificial because it results in centroid distances that are 0 ... ?

logc · 2018-04-22T19:33:42Z

@glemaitre What is the correct way to create a "What's new" entry for this change?

lzfelix · 2018-04-22T20:04:40Z

@logc great! Maybe "index values closer to zero indicate a better clustering partition"? This way you ensure to convey the message that there is a lower bound only.

jnothman · 2018-04-22T23:08:26Z

Or just say "Zero is the minimum score and a greater score is better.".

…

On 23 April 2018 at 06:04, Luiz Felix ***@***.***> wrote: @logc <https://github.com/logc> great! Maybe "index values closer to zero indicative of a better clustering partition"? This way you ensure to convey the message that there is a lower bound only. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10827 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yz3ucR3nDxE56-o6I87k_mu_W_tks5trOJZgaJpZM4Su-1P> .

lzfelix · 2018-04-22T23:46:08Z

Actually, the smaller the better.

jnothman · 2018-04-23T01:13:57Z

of course. forgetting the context. just trying to simplify the message

logc · 2018-04-23T06:23:03Z

I went for a mix of both formulations 😄

Zero is the lowest possible score. Values closer to zero indicate a better partition.

logc · 2018-05-14T08:05:48Z

@glemaitre I am sure this PR is on your list, but let me ask: is there something else that needs fixing here? thanks! 😄

glemaitre · 2018-05-14T12:50:27Z

Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology:
https://22132-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#calinski-harabaz-index

@jnothman Do you think that this is right to do so.

jnothman · 2018-05-14T14:43:12Z

Yes, it should be _score for consistency.

…

On 14 May 2018 at 22:50, Guillaume Lemaitre ***@***.***> wrote: Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology: https://22132-843222-gh.circle-artifacts.com/0/doc/ modules/clustering.html#calinski-harabaz-index @jnothman <https://github.com/jnothman> Do you think that this is right to do so. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10827 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-pSlcoETdqGtVwSAxPtRNfMXTudks5tyX2VgaJpZM4Su-1P> .

logc · 2018-05-14T17:24:10Z

@jnothman @glemaitre renamed to score in implementation. Following the example in Calinski-Harabaz index linked, the doc still refers to it as an "index".

Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.

- Add Davies-Bouldin to clustering test_common - Fix flake8 issue - Use bare `assert` in test - Fix typo

- Fix incorrectly parsed documentation block - Fix references indentation - Refactor test assertions

logc · 2018-05-14T18:16:55Z

Rebased on top of the current master

logc · 2018-05-14T19:58:19Z

The failing check does not seem related to these changes, but I don't know how to deal with the error:

60    Complete output from command python setup.py egg_info:
61    This backport is meant only for Python 2.
62    It does not work on Python 3, and Python 3 users do not need it as the concurrent.futures package is available in the standard library.
63    For projects that work on both Python 2 and 3, the dependency needs to be conditional on the Python version, like so:
64    extras_require={':python_version == "2.7"': ['futures']}
65    
66    ----------------------------------------
67Command "python setup.py egg_info" failed with error code 1 in C:\Users\appveyor\AppData\Local\Temp\1\pip-install-dzi69ulw\futures\

futures is a transitive dependency pulled in from wheelhouse_uploader. Any ideas?

glemaitre · 2018-05-14T12:45:33Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    # General case - cluster have one sample
+    X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
+    labels = [0, 0, 1, 2]
+    pytest.approx(davies_bouldin_index(X, labels), (5. / 4) / 3)


glemaitre · 2018-05-14T12:55:41Z

doc/modules/clustering.rst

+.. math::
+   R_{ij} = \frac{s_i + s_j}{d_{ij}}
+
+Then the DB index is defined as:


We should probably defined DB -> Davies-Bouldin (DB) index

Changed to "Davies-Bouldin"

glemaitre · 2018-05-14T12:55:50Z

doc/modules/clustering.rst

+Zero is the lowest possible score. Values closer to zero indicate a better
+partition.
+
+In normal usage, the Davies-Bouldin index is applied to the results of a


I am sorry, I do not understand what is requested here (?)

I guess we can leave it explicitly as "Davies-Bouldin". "DB" might be confused with database, or DBSCAN.

glemaitre · 2018-05-14T12:56:08Z

doc/modules/clustering.rst

+  >>> from sklearn import datasets
+  >>> iris = datasets.load_iris()
+  >>> X = iris.data
+


remove this line

glemaitre · 2018-05-14T12:56:25Z

doc/modules/clustering.rst

+~~~~~~~~~~
+
+- The computation of Davies-Bouldin is simpler than that of Silhouette scores.
+


remove this line

glemaitre · 2018-05-14T12:56:32Z

doc/modules/clustering.rst

+  DBSCAN.
+
+- The usage of centroid distance limits the distance metric to Euclidean space.
+


remove this line

glemaitre · 2018-05-16T15:19:20Z

Weird failing. It should not be related.

glemaitre · 2018-05-16T15:20:59Z

Also, please add a what's new entry.
Once those I think that we are good to merge!!!

logc · 2018-05-16T15:30:12Z

About the what's new entry, I added one in commit cd52612 , for release 0.20 ...

glemaitre · 2018-05-16T15:35:25Z

About the what's new entry, I added one in commit cd52612 , for release 0.20 ...

Sorry I missed the file.

glemaitre · 2018-05-16T15:36:20Z

I'll merge when it will be (almost) green

logc · 2018-05-18T11:12:46Z

@glemaitre the checks passed this time 😄 can we merge then?

glemaitre · 2018-05-18T11:53:13Z

Thanks for the recall and the PR

jnothman reviewed Mar 18, 2018

View reviewed changes

jnothman mentioned this pull request Mar 18, 2018

[MRG+1] Invariance tests for clustering metrics #8102 #8135

Closed

logc force-pushed the davies-bouldin-index branch from 881c3a6 to a53c91f Compare March 18, 2018 11:52

jnothman approved these changes Apr 9, 2018

View reviewed changes

jnothman changed the title ~~Add Davies-Bouldin index~~ [MRG+1] Add Davies-Bouldin index Apr 9, 2018

jnothman mentioned this pull request Apr 16, 2018

System freeze on Silhouette scoring #7175

Closed

glemaitre requested changes Apr 21, 2018

View reviewed changes

logc force-pushed the davies-bouldin-index branch from e52f7cf to 513e45f Compare April 22, 2018 19:28

logc force-pushed the davies-bouldin-index branch from 3402a77 to 6712379 Compare April 23, 2018 06:29

logc added 6 commits May 14, 2018 20:01

Add Davies-Bouldin index

76facb5

Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.

Code review changes

ed1325d

- Add Davies-Bouldin to clustering test_common - Fix flake8 issue - Use bare `assert` in test - Fix typo

Add spaces around operators

a588458

Second code review changes

cd52612

- Fix incorrectly parsed documentation block - Fix references indentation - Refactor test assertions

Documentation changes

975944e

Rename index to score

3dcf3cb

logc force-pushed the davies-bouldin-index branch from db3c8ae to 3dcf3cb Compare May 14, 2018 18:16

glemaitre reviewed May 16, 2018

View reviewed changes

Documentation changes

7848d87

glemaitre merged commit 680c36b into scikit-learn:master May 18, 2018

logc deleted the davies-bouldin-index branch May 18, 2018 12:02

		DB = \frac{1}{k} \sum{i=1}^k \max_{i \neq j} R_{ij}


		>>> from sklearn import datasets


		.. topic:: References

		* Davies, David L.; Bouldin, Donald W. (1979).

		~~~~~~~~~~

		- The computation of Davies-Bouldin is simpler than that of Silhouette scores.

		DBSCAN.

		- The usage of centroid distance limits the distance metric to Euclidean space.

Uh oh!

[MRG+1] Add Davies-Bouldin index #10827

[MRG+1] Add Davies-Bouldin index #10827

Uh oh!

Conversation

logc commented Mar 17, 2018 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Mar 18, 2018

Uh oh!

glemaitre commented Mar 18, 2018

Uh oh!

jnothman commented Mar 18, 2018

Uh oh!

jnothman commented Mar 18, 2018

Uh oh!

logc commented Mar 18, 2018

Uh oh!

jnothman commented Mar 18, 2018

Uh oh!

logc commented Mar 18, 2018

Uh oh!

jnothman commented Mar 18, 2018

Uh oh!

logc commented Mar 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

logc commented Apr 5, 2018

Uh oh!

jnothman commented Apr 9, 2018 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

logc commented Apr 11, 2018

Uh oh!

glemaitre commented Apr 11, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

logc commented Mar 17, 2018 •

edited by glemaitre

Loading

logc commented Mar 18, 2018 •

edited

Loading

logc Apr 22, 2018 •

edited

Loading