-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add Davies-Bouldin index #10827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. Also, flake8 is failing
---------- | ||
.. [1] `Davies, David L.; Bouldin, Donald W. (1979). | ||
"A Cluster Separation Measure". IEEE Transactions on | ||
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an Examples Sexton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I can see there are sections like this in other parts of the doc, but I don't know how to generate the example contents (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should just be a couple of lines showing how you would use this function in a simple case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the doctest example in lines 1625-1637 doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide
doc/modules/clustering.rst
Outdated
cluster analysis. | ||
|
||
>>> from sklearn.cluster import KMeans | ||
>>> from sklearn.metrics import davis_bouldin_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
rng.rand(10, 2), np.arange(10)) | ||
|
||
# Assert the value is 0. when all samples are equals | ||
assert_equal(0., davies_bouldin_index(np.ones((10, 2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These days we would prefer a bare assert, rather than assert_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Feel like wrapping up #8135 for us also so we don't need to add tests here? |
@logc, I don't think you've pushed a second commit. |
Please merge in the latest master, where we have just added common tests for clustering metrics. Please add this to the common tests in sklearn/metrics/cluster/tests/test_common.py |
@jnothman sorry, commented before pushing the second commit. The tests run really long, locally (!) |
Do you mean the full test suite? The clustering metrics tests? |
@jnothman yes, the full test suite. Currently, I am running it with |
If I were you, I'd run |
881c3a6
to
a53c91f
Compare
@jnothman added In order to pass This was one of the open questions in the previous PR, and now I know the answer: if you do not do that assumption, then the resulting index is not the same for an array of ints as for the same array cast as float. |
@jnothman can I have another review here? 😄 Also, @tguillemot had comments on the previously existing PR; maybe I could have a review here, too? Thanks! |
is the difference between int and float X substantial?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
---------- | ||
.. [1] `Davies, David L.; Bouldin, Donald W. (1979). | ||
"A Cluster Separation Measure". IEEE Transactions on | ||
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide
[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5) | ||
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10 | ||
assert_almost_equal(davies_bouldin_index(X, labels), | ||
2*np.sqrt(0.5)/3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include spaces around * and /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
X = ([[0, 0], [2, 2], [3, 3], [5, 5]]) | ||
labels = [0, 0, 1, 2] | ||
assert_almost_equal(davies_bouldin_index(X, labels), | ||
(5./4)/3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces around /, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
My ping to @tguillemot was not very successful 😄 ; maybe @glemaitre would help here in having a second opinion? Thanks to any and all reviewers. |
I put it in my list of revision :) You should receive my review soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
Couple of comments and this is missing a what's new entry.
doc/modules/clustering.rst
Outdated
DB = \frac{1}{k} \sum{i=1}^k \max_{i \neq j} R_{ij} | ||
>>> from sklearn import datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not recongnized as a block code: https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
doc/modules/clustering.rst
Outdated
|
||
.. topic:: References | ||
|
||
* Davies, David L.; Bouldin, Donald W. (1979). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing an indent in the references to properly render them:
https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#id27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! My RST skills are rusty, I only write Markdown these days ...
sklearn/metrics/__init__.py
Outdated
@@ -80,6 +81,7 @@ | |||
'confusion_matrix', | |||
'consensus_score', | |||
'coverage_error', | |||
'davies_bouldin_index', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman do we have a convention finishing by score since that this is bounded between 0 and 1.
rng = np.random.RandomState(seed=0) | ||
|
||
# Assert message when there is only one label | ||
assert_raise_message(ValueError, "Number of labels is", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is time to factorize the error test which is the same between the different metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
"A Cluster Separation Measure". IEEE Transactions on | ||
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_ | ||
""" | ||
X, labels = check_X_y(X, labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would factorize the part which is the same in all different metrics.
I think that this is redundant and stand there for a kind of check/validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit out of scope for this PR. Also, there are small differences in each metric that make the refactor non-trivial. I could take it up in a later PR if you do not mind.
rng.rand(10, 2), np.zeros(10)) | ||
|
||
# Assert message when all point are in different clusters | ||
assert_raise_message(ValueError, "Number of labels is", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
rng.rand(10, 2), np.arange(10)) | ||
|
||
# Assert the value is 0. when all samples are equals | ||
assert 0. == davies_bouldin_index(np.ones((10, 2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should write:
assert computed == expected
. With float use pytest.approx
assert davies_bouldin_index(...) == pytest.approx(0.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
[0] * 5 + [1] * 5) | ||
|
||
# Assert the value is 0. when all the mean cluster are equal | ||
assert 0. == davies_bouldin_index([[-1, -1], [1, 1]] * 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
X = ([[0, 0], [1, 1]] * 5 + [[3, 3], [4, 4]] * 5 + | ||
[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5) | ||
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10 | ||
assert_almost_equal(davies_bouldin_index(X, labels), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pytest approx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Also, refactored other usages of assert_almost_equal
to pytest.approx
in this module, for consistency.
# General case - cluster have one sample | ||
X = ([[0, 0], [2, 2], [3, 3], [5, 5]]) | ||
labels = [0, 0, 1, 2] | ||
assert_almost_equal(davies_bouldin_index(X, labels), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest approx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Thanks for your PR, but could you please provide supporting evidence that Davies-Bouldin index is bounded on the interval [0, 1]? The original paper by Davies and Bouldin
Reference: |
@lzfelix Reading the references, it seems that there is nothing mentioning an upper bound equal to 1. My intuition to find an upper bound to 1 would be in the below configuration: But honestly, this is just a draw and I would appreciate with somebody could have a formal proof, either way. |
@glemaitre it seems that X, y = sklearn.datasets.make_blobs(n_samples=400, centers=1, center_box=(0, 1))
y_hat = sklearn.cluster.KMeans(n_clusters=2).fit_predict(X)
plt.scatter(*X.T, c=y_hat, alpha=0.4)
print('Davies-Bouldin index: {:4.4}'.format(davies_bouldin_index(X, y_hat))) |
Actually this is not really an overlap but more something like ellipses-like shape. In this case, it will be higher than 1. However, I am not sure what would be the upper bound formally.
|
@glemaitre you are right. Maybe the upper bound helps on deciding if the generated clusters are not well defined, in the opposite sense that Anyways, I just wanted to try to contribute on the docs of this PR to avoid later confusion :) |
Thanks this is useful review. We don't like to put erroneous doc :)
|
@lzfelix Thanks for pointing out that inconsistency! I am having a close look at the original reference, to see where this "upper bound is 1" idea is rooted ... I am not sure if it is the documentation, or the implementation which is wrong. By the way, if I understand correctly In fact, DBI is "warning" (via the RuntimeWarning) that the clustering is artificial because it results in centroid distances that are 0 ... ? |
e52f7cf
to
513e45f
Compare
@glemaitre What is the correct way to create a "What's new" entry for this change? |
@logc great! Maybe "index values closer to zero indicate a better clustering partition"? This way you ensure to convey the message that there is a lower bound only. |
Or just say "Zero is the minimum score and a greater score is better.".
…On 23 April 2018 at 06:04, Luiz Felix ***@***.***> wrote:
@logc <https://github.com/logc> great! Maybe "index values closer to zero
indicative of a better clustering partition"? This way you ensure to convey
the message that there is a lower bound only.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10827 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yz3ucR3nDxE56-o6I87k_mu_W_tks5trOJZgaJpZM4Su-1P>
.
|
Actually, the smaller the better. |
of course. forgetting the context. just trying to simplify the message
|
I went for a mix of both formulations 😄
|
3402a77
to
6712379
Compare
@glemaitre I am sure this PR is on your list, but let me ask: is there something else that needs fixing here? thanks! 😄 |
Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology: @jnothman Do you think that this is right to do so. |
Yes, it should be _score for consistency.
…On 14 May 2018 at 22:50, Guillaume Lemaitre ***@***.***> wrote:
Mainly, I would rename index by score to be similar to other metric. Other
index follow this terminology:
https://22132-843222-gh.circle-artifacts.com/0/doc/
modules/clustering.html#calinski-harabaz-index
@jnothman <https://github.com/jnothman> Do you think that this is right
to do so.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10827 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-pSlcoETdqGtVwSAxPtRNfMXTudks5tyX2VgaJpZM4Su-1P>
.
|
@jnothman @glemaitre renamed to |
Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.
- Add Davies-Bouldin to clustering test_common - Fix flake8 issue - Use bare `assert` in test - Fix typo
- Fix incorrectly parsed documentation block - Fix references indentation - Refactor test assertions
db3c8ae
to
3dcf3cb
Compare
Rebased on top of the current |
The failing check does not seem related to these changes, but I don't know how to deal with the error:
|
# General case - cluster have one sample | ||
X = ([[0, 0], [2, 2], [3, 3], [5, 5]]) | ||
labels = [0, 0, 1, 2] | ||
pytest.approx(davies_bouldin_index(X, labels), (5. / 4) / 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- -> 5
doc/modules/clustering.rst
Outdated
.. math:: | ||
R_{ij} = \frac{s_i + s_j}{d_{ij}} | ||
Then the DB index is defined as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably defined DB -> Davies-Bouldin (DB) index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "Davies-Bouldin"
Zero is the lowest possible score. Values closer to zero indicate a better | ||
partition. | ||
|
||
In normal usage, the Davies-Bouldin index is applied to the results of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I do not understand what is requested here (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can leave it explicitly as "Davies-Bouldin". "DB" might be confused with database, or DBSCAN.
doc/modules/clustering.rst
Outdated
>>> from sklearn import datasets | ||
>>> iris = datasets.load_iris() | ||
>>> X = iris.data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
doc/modules/clustering.rst
Outdated
~~~~~~~~~~ | ||
|
||
- The computation of Davies-Bouldin is simpler than that of Silhouette scores. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
doc/modules/clustering.rst
Outdated
DBSCAN. | ||
|
||
- The usage of centroid distance limits the distance metric to Euclidean space. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
Weird failing. It should not be related. |
Also, please add a what's new entry. |
About the what's new entry, I added one in commit cd52612 , for release 0.20 ... |
Sorry I missed the file. |
I'll merge when it will be (almost) green |
@glemaitre the checks passed this time 😄 can we merge then? |
Thanks for the recall and the PR |
Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.
Reference Issues/PRs
closes #7942
What does this implement/fix? Explain your changes.
This implements an unsupervised quality metric for clustering results, the Davies-Bouldin Index, based on an already existing PR that was stalled. The differences between this commit and the changes proposed there are minimal. In particular, the tests are copied verbatim, to ensure that this implementation does still conform to what was expected.
Any other comments?
I noticed while working on a toy problem that there are not many alternatives in
sklearn
for unsupervised metrics of clustering quality. In particular, I missed Davies-Bouldin from working with other ML packages (Weka, IIRC).Looking through
sklearn
, I found the above mentioned PR, noticed the author @tomron does not seem to answer anymore, and decided to push for this change to get accepted by doing a similar proposal. I fixed all remaining open comments.If there is a better way to get the DB index into
sklearn
, please tell me. If there are other comments that can still be improved in this implementation, I will do my best to correct them, too.