Skip to content

[MRG+1] Add Davies-Bouldin index #10827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 18, 2018

Conversation

logc
Copy link
Contributor

@logc logc commented Mar 17, 2018

Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.

Reference Issues/PRs

closes #7942

What does this implement/fix? Explain your changes.

This implements an unsupervised quality metric for clustering results, the Davies-Bouldin Index, based on an already existing PR that was stalled. The differences between this commit and the changes proposed there are minimal. In particular, the tests are copied verbatim, to ensure that this implementation does still conform to what was expected.

Any other comments?

I noticed while working on a toy problem that there are not many alternatives in sklearn for unsupervised metrics of clustering quality. In particular, I missed Davies-Bouldin from working with other ML packages (Weka, IIRC).

Looking through sklearn, I found the above mentioned PR, noticed the author @tomron does not seem to answer anymore, and decided to push for this change to get accepted by doing a similar proposal. I fixed all remaining open comments.

If there is a better way to get the DB index into sklearn, please tell me. If there are other comments that can still be improved in this implementation, I will do my best to correct them, too.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Also, flake8 is failing

----------
.. [1] `Davies, David L.; Bouldin, Donald W. (1979).
"A Cluster Separation Measure". IEEE Transactions on
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an Examples Sexton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I can see there are sections like this in other parts of the doc, but I don't know how to generate the example contents (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should just be a couple of lines showing how you would use this function in a simple case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the doctest example in lines 1625-1637 doing that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

cluster analysis.

>>> from sklearn.cluster import KMeans
>>> from sklearn.metrics import davis_bouldin_index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

rng.rand(10, 2), np.arange(10))

# Assert the value is 0. when all samples are equals
assert_equal(0., davies_bouldin_index(np.ones((10, 2)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days we would prefer a bare assert, rather than assert_equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jnothman
Copy link
Member

Feel like wrapping up #8135 for us also so we don't need to add tests here?

@glemaitre
Copy link
Member

#10828 is taking over #8135 and could be almost merge with we don't consider to add the two tests which I mentioned in the summary.

@jnothman
Copy link
Member

@logc, I don't think you've pushed a second commit.

@jnothman
Copy link
Member

Please merge in the latest master, where we have just added common tests for clustering metrics. Please add this to the common tests in sklearn/metrics/cluster/tests/test_common.py

@logc
Copy link
Contributor Author

logc commented Mar 18, 2018

@jnothman sorry, commented before pushing the second commit. The tests run really long, locally (!)

@jnothman
Copy link
Member

The tests run really long, locally (!)

Do you mean the full test suite? The clustering metrics tests?

@logc
Copy link
Contributor Author

logc commented Mar 18, 2018

@jnothman yes, the full test suite. Currently, I am running it with make from the project root, before creating a commit. If I only have changes to the test_unsupervised.py file, then I only run that. If there is a better strategy, please tell me.

@jnothman
Copy link
Member

If I were you, I'd run pytest sklearn/metrics/cluster and let Travis do the rest.

@logc logc force-pushed the davies-bouldin-index branch from 881c3a6 to a53c91f Compare March 18, 2018 11:52
@logc
Copy link
Contributor Author

logc commented Mar 18, 2018

@jnothman added davies_bouldin_index to test_common.

In order to pass test_format_invariance, I had to change the dtype of the centroids variable -- from adapting to X to hardcoded as float.

This was one of the open questions in the previous PR, and now I know the answer: if you do not do that assumption, then the resulting index is not the same for an array of ints as for the same array cast as float.

@logc
Copy link
Contributor Author

logc commented Apr 5, 2018

@jnothman can I have another review here? 😄

Also, @tguillemot had comments on the previously existing PR; maybe I could have a review here, too?

Thanks!

@jnothman
Copy link
Member

jnothman commented Apr 9, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

----------
.. [1] `Davies, David L.; Bouldin, Donald W. (1979).
"A Cluster Separation Measure". IEEE Transactions on
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
assert_almost_equal(davies_bouldin_index(X, labels),
2*np.sqrt(0.5)/3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include spaces around * and /

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
labels = [0, 0, 1, 2]
assert_almost_equal(davies_bouldin_index(X, labels),
(5./4)/3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around /, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jnothman jnothman changed the title Add Davies-Bouldin index [MRG+1] Add Davies-Bouldin index Apr 9, 2018
@logc
Copy link
Contributor Author

logc commented Apr 11, 2018

My ping to @tguillemot was not very successful 😄 ; maybe @glemaitre would help here in having a second opinion? Thanks to any and all reviewers.

@glemaitre
Copy link
Member

I put it in my list of revision :) You should receive my review soon

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

Couple of comments and this is missing a what's new entry.

DB = \frac{1}{k} \sum{i=1}^k \max_{i \neq j} R_{ij}
>>> from sklearn import datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


.. topic:: References

* Davies, David L.; Bouldin, Donald W. (1979).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing an indent in the references to properly render them:

https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#id27

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! My RST skills are rusty, I only write Markdown these days ...

@@ -80,6 +81,7 @@
'confusion_matrix',
'consensus_score',
'coverage_error',
'davies_bouldin_index',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman do we have a convention finishing by score since that this is bounded between 0 and 1.

rng = np.random.RandomState(seed=0)

# Assert message when there is only one label
assert_raise_message(ValueError, "Number of labels is",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is time to factorize the error test which is the same between the different metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

"A Cluster Separation Measure". IEEE Transactions on
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_
"""
X, labels = check_X_y(X, labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would factorize the part which is the same in all different metrics.
I think that this is redundant and stand there for a kind of check/validation

Copy link
Contributor Author

@logc logc Apr 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit out of scope for this PR. Also, there are small differences in each metric that make the refactor non-trivial. I could take it up in a later PR if you do not mind.

rng.rand(10, 2), np.zeros(10))

# Assert message when all point are in different clusters
assert_raise_message(ValueError, "Number of labels is",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

rng.rand(10, 2), np.arange(10))

# Assert the value is 0. when all samples are equals
assert 0. == davies_bouldin_index(np.ones((10, 2)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should write:
assert computed == expected. With float use pytest.approx

assert davies_bouldin_index(...) == pytest.approx(0.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

[0] * 5 + [1] * 5)

# Assert the value is 0. when all the mean cluster are equal
assert 0. == davies_bouldin_index([[-1, -1], [1, 1]] * 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

X = ([[0, 0], [1, 1]] * 5 + [[3, 3], [4, 4]] * 5 +
[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
assert_almost_equal(davies_bouldin_index(X, labels),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pytest approx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Also, refactored other usages of assert_almost_equal to pytest.approx in this module, for consistency.

# General case - cluster have one sample
X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
labels = [0, 0, 1, 2]
assert_almost_equal(davies_bouldin_index(X, labels),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest approx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@lzfelix
Copy link

lzfelix commented Apr 22, 2018

Thanks for your PR, but could you please provide supporting evidence that Davies-Bouldin index is bounded on the interval [0, 1]? The original paper by Davies and Bouldin [1] only explicitly states that this measure is grater or equal than zero. On the other hand, the MATLAB documentation shows a scenario where the index is larger than zero under K-means clustering:

NumObservations: 600
InspectedK: [1 2 3 4 5 6]
CriterionValues: [NaN 0.4663 0.4454 0.8316 1.0444 0.9236]
OptimalK: 3

Reference: [1] Davies, David L., and Donald W. Bouldin. "A cluster separation measure." IEEE transactions on pattern analysis and machine intelligence 2 (1979): 224-227.

@glemaitre
Copy link
Member

@lzfelix Reading the references, it seems that there is nothing mentioning an upper bound equal to 1.
I am not sure exactly how to make the formal proof.

My intuition to find an upper bound to 1 would be in the below configuration:
circle_cluster

But honestly, this is just a draw and I would appreciate with somebody could have a formal proof, either way.

@jnothman @GaelVaroquaux @lesteve @agramfort

@lzfelix
Copy link

lzfelix commented Apr 22, 2018

@glemaitre it seems that DBI > 1 if there's some degree of overlap between two clusters, at least for the bidimensional case. It's harder to figure out what happens on higher-dimensional spaces without a better analysis. Still, it is possible to obtain DBI ~ 1.19 using the code below:

X, y = sklearn.datasets.make_blobs(n_samples=400, centers=1, center_box=(0, 1))
y_hat = sklearn.cluster.KMeans(n_clusters=2).fit_predict(X)

plt.scatter(*X.T, c=y_hat, alpha=0.4)
print('Davies-Bouldin index: {:4.4}'.format(davies_bouldin_index(X, y_hat)))

which produces:
clustering_output

@glemaitre
Copy link
Member

glemaitre commented Apr 22, 2018 via email

@lzfelix
Copy link

lzfelix commented Apr 22, 2018

@glemaitre you are right. Maybe the upper bound helps on deciding if the generated clusters are not well defined, in the opposite sense that DBI ~ 0 corresponds to a good partitioning of the data? Still, it's just speculation...

Anyways, I just wanted to try to contribute on the docs of this PR to avoid later confusion :)

@glemaitre
Copy link
Member

glemaitre commented Apr 22, 2018 via email

@logc
Copy link
Contributor Author

logc commented Apr 22, 2018

@lzfelix Thanks for pointing out that inconsistency! I am having a close look at the original reference, to see where this "upper bound is 1" idea is rooted ... I am not sure if it is the documentation, or the implementation which is wrong.

By the way, if I understand correctly make_blobs, in your example you are forcing the data to be scattered around a single center, and then fit a KMeans model with n=2. One could argue that this is a fundamental shortcoming of the algorithm -- it tries to partition the dataset always in as many labels as given.

In fact, DBI is "warning" (via the RuntimeWarning) that the clustering is artificial because it results in centroid distances that are 0 ... ?

@logc logc force-pushed the davies-bouldin-index branch from e52f7cf to 513e45f Compare April 22, 2018 19:28
@logc
Copy link
Contributor Author

logc commented Apr 22, 2018

@glemaitre What is the correct way to create a "What's new" entry for this change?

@lzfelix
Copy link

lzfelix commented Apr 22, 2018

@logc great! Maybe "index values closer to zero indicate a better clustering partition"? This way you ensure to convey the message that there is a lower bound only.

@jnothman
Copy link
Member

jnothman commented Apr 22, 2018 via email

@lzfelix
Copy link

lzfelix commented Apr 22, 2018

Actually, the smaller the better.

@jnothman
Copy link
Member

jnothman commented Apr 23, 2018 via email

@logc
Copy link
Contributor Author

logc commented Apr 23, 2018

I went for a mix of both formulations 😄

Zero is the lowest possible score. Values closer to zero indicate a better partition.

@logc logc force-pushed the davies-bouldin-index branch from 3402a77 to 6712379 Compare April 23, 2018 06:29
@logc
Copy link
Contributor Author

logc commented May 14, 2018

@glemaitre I am sure this PR is on your list, but let me ask: is there something else that needs fixing here? thanks! 😄

@glemaitre
Copy link
Member

Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology:
https://22132-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#calinski-harabaz-index

@jnothman Do you think that this is right to do so.

@jnothman
Copy link
Member

jnothman commented May 14, 2018 via email

@logc
Copy link
Contributor Author

logc commented May 14, 2018

@jnothman @glemaitre renamed to score in implementation. Following the example in Calinski-Harabaz index linked, the doc still refers to it as an "index".

logc added 6 commits May 14, 2018 20:01
Add another unsupervised quality metric for clustering results, the
Davies-Bouldin Index.
- Add Davies-Bouldin to clustering test_common
- Fix flake8 issue
- Use bare `assert` in test
- Fix typo
- Fix incorrectly parsed documentation block
- Fix references indentation
- Refactor test assertions
@logc logc force-pushed the davies-bouldin-index branch from db3c8ae to 3dcf3cb Compare May 14, 2018 18:16
@logc
Copy link
Contributor Author

logc commented May 14, 2018

Rebased on top of the current master

@logc
Copy link
Contributor Author

logc commented May 14, 2018

The failing check does not seem related to these changes, but I don't know how to deal with the error:

60    Complete output from command python setup.py egg_info:
61    This backport is meant only for Python 2.
62    It does not work on Python 3, and Python 3 users do not need it as the concurrent.futures package is available in the standard library.
63    For projects that work on both Python 2 and 3, the dependency needs to be conditional on the Python version, like so:
64    extras_require={':python_version == "2.7"': ['futures']}
65    
66    ----------------------------------------
67Command "python setup.py egg_info" failed with error code 1 in C:\Users\appveyor\AppData\Local\Temp\1\pip-install-dzi69ulw\futures\

futures is a transitive dependency pulled in from wheelhouse_uploader. Any ideas?

# General case - cluster have one sample
X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
labels = [0, 0, 1, 2]
pytest.approx(davies_bouldin_index(X, labels), (5. / 4) / 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. -> 5

.. math::
R_{ij} = \frac{s_i + s_j}{d_{ij}}
Then the DB index is defined as:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably defined DB -> Davies-Bouldin (DB) index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Davies-Bouldin"

Zero is the lowest possible score. Values closer to zero indicate a better
partition.

In normal usage, the Davies-Bouldin index is applied to the results of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I do not understand what is requested here (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can leave it explicitly as "Davies-Bouldin". "DB" might be confused with database, or DBSCAN.

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

~~~~~~~~~~

- The computation of Davies-Bouldin is simpler than that of Silhouette scores.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

DBSCAN.

- The usage of centroid distance limits the distance metric to Euclidean space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

@glemaitre
Copy link
Member

Weird failing. It should not be related.

@glemaitre
Copy link
Member

Also, please add a what's new entry.
Once those I think that we are good to merge!!!

@logc
Copy link
Contributor Author

logc commented May 16, 2018

About the what's new entry, I added one in commit cd52612 , for release 0.20 ...

@glemaitre
Copy link
Member

About the what's new entry, I added one in commit cd52612 , for release 0.20 ...

Sorry I missed the file.

@glemaitre
Copy link
Member

I'll merge when it will be (almost) green

@logc
Copy link
Contributor Author

logc commented May 18, 2018

@glemaitre the checks passed this time 😄 can we merge then?

@glemaitre glemaitre merged commit 680c36b into scikit-learn:master May 18, 2018
@glemaitre
Copy link
Member

glemaitre commented May 18, 2018

Thanks for the recall and the PR

@logc logc deleted the davies-bouldin-index branch May 18, 2018 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants