FEAT Introduce DBCV as new cluster metric #28244

luis261 · 2024-01-24T19:27:51Z

for theoretical details/high level considerations see #28243

The version of DBCV which I want to integrate is the one shipped with the validity module (which I contributed to in the past) of this version of HDBSCAN. (the HDBSCAN implementation of that repo was recently integrated into scikit-learn)

steps before merging:

integrate code from HDBSCAN into the existing module structure:
- add code from github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/validity.py
- sort out imports
- renaming
adopt validation constructs specific/internal to sklearn:
- utilize validate_params decorator
- run input through check_X_y
- check_number_of_labels
make the new code conform to usual cluster metric "API" input assumptions/interface
- make validation calculations agnostic of label input type via usage of LabelEncoder
- allow list input for labels param instead of assuming numpy.ndarray
add tests:
- register new public DBCV function for param validation checking at /tests/test_public_functions.py
- register new logic for generic, unsupervised cluster metric testing at /metrics/cluster/tests/test_common.py
- write DBCV-specific test checking score output at /metrics/cluster/tests/test_unsupervised.py
documentation:
- information about params and output in docstring (already given per HDBSCAN implementation)
- add a release note

Resolves #28243

github-actions · 2024-01-24T19:29:16Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9ea7d00. Link to the linter CI: here}

luis261

@OmarManzoor thank you so much for your review! I started off by directly applying/accepting some of your simpler suggestions regarding the minor renaming + docstrings.

Left some comments as well, looking forward to hear your response.

sklearn/cluster/_hierarchical_fast.pyx

sklearn/metrics/cluster/_dbcv_internals.py

sklearn/metrics/cluster/tests/test_unsupervised.py

OmarManzoor

Thanks for addressing the comments. Here are a few more.

sklearn/metrics/cluster/_dbcv_internals.py

sklearn/metrics/cluster/_unsupervised.py

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

OmarManzoor

Minor comments, mostly related to adding some full stops for consistency. Also I think we might need a follow up PR to add this in the docs for model_evaluation.rst where we have a clustering section, Model evaluation docs
and also in Clustering performance evaluation of clustering.rst Clustering evaluation

sklearn/metrics/cluster/_dbcv_internals.py

sklearn/metrics/cluster/_unsupervised.py

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

luis261 · 2024-08-09T11:49:42Z

Applied your new suggestions. Regarding the followup PR: good point, I'm already keeping track of that at #29365

(I think it would be out of place in the Model evaluation docs, but that doc links to Clustering evaluation anyway, which is where I'll be adding it in the implementation of #29365. I'll start working on the actual text once this PR is merged, but I already have the Latex definitions prepared.)

OmarManzoor

LGTM. Thank you for adding this @luis261

luis261 · 2024-08-09T12:17:33Z

Appreciate it! Thank you for your input @OmarManzoor

Micky774 · 2024-08-13T01:43:28Z

Making one last pass. I also want to run a few tests locally to sanity check w/ some plotting examples

Micky774 · 2024-09-03T16:44:47Z

@luis261 Please sanity check me on this.

In the plot below (clustering provided via HDBSCAN) the DBCV score is ~.097

Switching to ground truth labels, the score is -0.112. Using MR distances only does not change this, it just spits out wildly different scores -- still including negative scores.

Generating code

import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.metrics.cluster import dbcv_score
from sklearn.cluster import HDBSCAN


def plot(X, labels, probabilities=None, parameters=None, ground_truth=False, ax=None):
    if ax is None:
        _, ax = plt.subplots(figsize=(10, 4))
    labels = labels if labels is not None else np.ones(X.shape[0])
    probabilities = probabilities if probabilities is not None else np.ones(X.shape[0])
    # Black removed and is used for noise instead.
    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
    # The probability of a point belonging to its labeled cluster determines
    # the size of its marker
    proba_map = {idx: probabilities[idx] for idx in range(len(labels))}
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = [0, 0, 0, 1]

        class_index = np.where(labels == k)[0]
        for ci in class_index:
            ax.plot(
                X[ci, 0],
                X[ci, 1],
                "x" if k == -1 else "o",
                markerfacecolor=tuple(col),
                markeredgecolor="k",
                markersize=4 if k == -1 else 1 + 5 * proba_map[ci],
            )
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    preamble = "True" if ground_truth else "Estimated"
    title = f"{preamble} number of clusters: {n_clusters_}"
    if parameters is not None:
        parameters_str = ", ".join(f"{k}={v}" for k, v in parameters.items())
        title += f" | {parameters_str}"
    ax.set_title(title)
    plt.tight_layout()


centers = [[1, 1], [-1, -1], [1.5, -1.5]]
X, labels_true = make_blobs(
    n_samples=750, centers=centers, cluster_std=[0.4, 0.1, 0.75], random_state=0
)
plot(X, labels=labels_true, ground_truth=True)
print(dbcv_score(X, labels=labels_true))


hdb = HDBSCAN()
hdb.fit(X)
y = hdb.labels_
plot(X, y, hdb.probabilities_)
print(dbcv_score(X, labels=y))

luis261 · 2024-09-06T10:06:58Z

@Micky774

First of all I commend your ongoing diligence in sanity checking the implementation/corroborating the results of the systematic tests and benchmarks.

I do have to say that I don't find the results that surprising. At least the fact that in this case the ground truth scores worse than the estimation as assigned by HDBSCAN, as well as the fact that the scores don't come close to either of the bounds.

Of course, the more "external" points of the underlying distributions on the right side lead to smaller density separation distances, thus decreasing the overall score. In the HDBSCAN clustering, the points that are further away from the centers of these distributions are noise points, which improves the density separation, thus improving the overall score.

Overall, the clusters aren't separated that clearly, at least the two clusters with the less tight distributions (other algorithms might combine them into a singular cluster based on their weak delineation). For the HDBSCAN result, (moreso for the ground truth) the sparseness of the MST distances of those clusters on the right doesn't seem like it'd be easily surpassed by their separation.

However I do agree that the scores might be closer to zero than expected.

As an aside, this of course doesn't have to be interpreted as the clustering being "bad" from a practical POV, as long as the score surpasses other candidates for the given dataset, maybe the global optimum for this dataset is just low.

However, this is not the case for the non MRD version of the score, which I assume is what you meant to express with "wildly different scores".

I replicated your experiment on a local machine with the verbose option. This results in the subcomponents of the overall score being emitted, broken down into the underlying separation and sparseness distances. This helps illustrate the underlying calculation results. For the default, MRD-based version, the verbose option also results in an output of the maximum ratio of the raw (in this case euclidean) distance to the MRD (per MST).

As for the discrepancies between the distance modes, I think they can generally be explained by the MRD inflation phenomenon again, which I do have to concede is a bit surprising here since the clusters are still somewhat spherical.
To be more specific, I measured ratios of 14 to 16 fold increases between the MRD's and the raw distances. This impairs the score in that configuration (as already discovered and described by me on many occasions in the past).

I have another suspicion which I am currently investigating. It has to do with the edge selection, which could also be a contributing factor here, since the locations of points "at the edges of the cluster" is what determines the density separations and the edge selection determines which of those points get filtered out as non-internal edges.

While running your example code, I noticed that your vectorization for the maximum ratio calculations seems to be faulty unfortunately. It always produced None output. I ended up substituting it in place with my original implementation. Could you please look into this issue? I do prefer your more elegant and performant implementation, as long as it gets fixed.

luis261 · 2024-09-06T10:07:22Z

Maybe try utilizing the verbose mode, if you haven't already.

Here is the output of my local runs:

default:

Max raw distance to coredistance ratio: 15.690077884292547
Max raw distance to coredistance ratio: 15.162453791655066
Max raw distance to coredistance ratio: 15.342044341686552
Minimum density separation: 0.585
Density sparseness: 0.918
Minimum density separation: 0.946
Density sparseness: 0.242
Minimum density separation: 0.585
Density sparseness: 2.070
-0.11221391414583601
Max raw distance to coredistance ratio: 15.22309823686122
Max raw distance to coredistance ratio: 15.436202177179643
Max raw distance to coredistance ratio: 14.228306760859649
Minimum density separation: 1.313
Density sparseness: 0.301
Minimum density separation: 0.743
Density sparseness: 0.928
Minimum density separation: 0.743
Density sparseness: 1.123
0.09726864192738956

no MRDs:

Minimum density separation: 0.048
Density sparseness: 0.345
Minimum density separation: 0.140
Density sparseness: 0.096
Minimum density separation: 0.048
Density sparseness: 0.817
-0.495895881471357
Minimum density separation: 1.313
Density sparseness: 0.140
Minimum density separation: 0.416
Density sparseness: 0.299
Minimum density separation: 0.416
Density sparseness: 0.297
0.4723529237264573

doc/whats_new/v1.6.rst

sklearn/metrics/cluster/_dbcv_internals.py

Micky774 · 2024-09-06T15:05:41Z

BTW, I think a helpful basis for formally explaining the MRD inflation and spherical bias (as separate problems, despite obvious entanglement) is the notion of concentration of measure and isoperimetry for e.g. Euclidean metric / Lebesgue measure in $\mathbb{R}^n$. The use of k-nearest-neighbors-based MRD would greatly alleviate this as it would alter the effective metric. This comment is mainly an aside, and a note for myself in the future as we discuss that feature.

Micky774 · 2024-09-06T15:22:50Z

I appreciate your explanation btw, I think it does alleviate the concern I have. Given the exact axiom it's trying to enforce, the ground-truth labels do, indeed, indicate a worse clustering, which is probably reasonable given the interpretation of outliers as noise, or at least low confidence points.

Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>

luis261 · 2024-09-16T13:00:17Z

I think the edge selection is still off btw.

lorentzenchr · 2025-01-26T11:03:28Z

@luis261 Why did you close? It seems it was quite close to being finished?
Wanna reopen? Especially given the green light in #27259.

Add HDBSCAN's validity index (DBCV)

83caaa3

github-actions bot added the module:metrics label Jan 24, 2024

luis261 mentioned this pull request Jan 24, 2024

RFC: Introduce DBCV cluster validity metric #28243

Closed

luis261 added 2 commits January 24, 2024 20:57

Rename validity_index to dbcv

d356015

Resolve indirect import ref to np.isclose

64eaa2f

luis261 mentioned this pull request Jan 26, 2024

New clustering metrics #27259

Open

luis261 added 13 commits January 31, 2024 08:48

Create _dbcv_linkage.pyx

fa318ad

Move DBCV subcomponents

328e9bb

Move DBCV subcomponents

949f9f2

Import DBCV internals

2b1feca

Add linkage import

7265245

Remove unneccessary import

47cc515

Remove unnecessary import

489cf3c

Add missing imports

e39ba4b

Try to appease Linter

91ac5b2

Try to appease Linter

33031a5

Shorten lines to adhere to Linter cfg

e10bf42

Register new DBCV linkage Cython module

2d24f32

Add missing numpy import

083e495

This comment was marked as outdated.

Sign in to view

luis261 added 9 commits January 31, 2024 11:55

Expose new function (DBCV)

f8ae3eb

Make quoting consistent

502430b

Adjust import order

b932462

Fix order, add DBCV to all

3be1bf0

Rename DBCV function

dcaf126

Adjust to new DBCV function name

8b8d406

Group local imports

7d459f1

Separate local and external imports

2df02f5

Add param validation

a11c329

luis261 added 2 commits August 8, 2024 11:34

Mention new precomputed kwarg in docstring

ee9fe73

Document no_coredist option in docstring

e901c82

luis261 commented Aug 8, 2024

View reviewed changes

sklearn/cluster/_hierarchical_fast.pyx Show resolved Hide resolved

sklearn/metrics/cluster/_dbcv_internals.py Show resolved Hide resolved

sklearn/metrics/cluster/tests/test_unsupervised.py Show resolved Hide resolved

luis261 added 2 commits August 8, 2024 15:06

Parametrize existing testcase to cover additional dist metrics

b266ffb

Adjust fmt to linter

cae316c

OmarManzoor reviewed Aug 9, 2024

View reviewed changes

luis261 and others added 2 commits August 9, 2024 08:58

Adjust docstrings (code review)

5548078

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Adjust expected err msg in unittest

75f00c5

OmarManzoor reviewed Aug 9, 2024

View reviewed changes

luis261 and others added 2 commits August 9, 2024 13:38

Adjust docstrings (code review)

8e202ce

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Adjust expected messages to review changes (unittests)

c506e6e

OmarManzoor approved these changes Aug 9, 2024

View reviewed changes

OmarManzoor removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 9, 2024

Micky774 reviewed Sep 6, 2024

View reviewed changes

doc/whats_new/v1.6.rst Outdated Show resolved Hide resolved

sklearn/metrics/cluster/_dbcv_internals.py Show resolved Hide resolved

sklearn/metrics/cluster/_dbcv_internals.py Outdated Show resolved Hide resolved

luis261 and others added 4 commits September 6, 2024 22:21

Name HDBSCAN as example instead of DBSCAN

fe65cd9

Remove else statement for less indentation

5e31d2d

Add missing return to helper function

2fd4120

Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>

Add missing .max() to helper function

9ea7d00

luis261 marked this pull request as draft September 16, 2024 13:00

luis261 closed this by deleting the head repository Sep 16, 2024

Uh oh!

FEAT Introduce DBCV as new cluster metric #28244

FEAT Introduce DBCV as new cluster metric #28244

Uh oh!

Conversation

luis261 commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

This comment was marked as outdated.

luis261 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OmarManzoor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luis261 commented Aug 9, 2024

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

luis261 commented Aug 9, 2024

Uh oh!

Micky774 commented Aug 13, 2024

Uh oh!

Micky774 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luis261 commented Sep 6, 2024

Uh oh!

luis261 commented Sep 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Micky774 commented Sep 6, 2024

Uh oh!

luis261 commented Sep 16, 2024

Uh oh!

lorentzenchr commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

luis261 commented Jan 24, 2024 •

edited

Loading

github-actions bot commented Jan 24, 2024 •

edited

Loading

OmarManzoor left a comment •

edited

Loading

Micky774 commented Sep 3, 2024 •

edited

Loading

Micky774 commented Sep 6, 2024 •

edited

Loading

lorentzenchr commented Jan 26, 2025 •

edited

Loading