Skip to content

FEAT Introduce DBCV as new cluster metric #28244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 170 commits into from
Closed

FEAT Introduce DBCV as new cluster metric #28244

wants to merge 170 commits into from

Conversation

luis261
Copy link

@luis261 luis261 commented Jan 24, 2024

for theoretical details/high level considerations see #28243

The version of DBCV which I want to integrate is the one shipped with the validity module (which I contributed to in the past) of this version of HDBSCAN. (the HDBSCAN implementation of that repo was recently integrated into scikit-learn)

steps before merging:

  • integrate code from HDBSCAN into the existing module structure:
    • add code from github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/validity.py
    • sort out imports
    • renaming
  • adopt validation constructs specific/internal to sklearn:
    • utilize validate_params decorator
    • run input through check_X_y
    • check_number_of_labels
  • make the new code conform to usual cluster metric "API" input assumptions/interface
    • make validation calculations agnostic of label input type via usage of LabelEncoder
    • allow list input for labels param instead of assuming numpy.ndarray
  • add tests:
    • register new public DBCV function for param validation checking at /tests/test_public_functions.py
    • register new logic for generic, unsupervised cluster metric testing at /metrics/cluster/tests/test_common.py
    • write DBCV-specific test checking score output at /metrics/cluster/tests/test_unsupervised.py
  • documentation:
    • information about params and output in docstring (already given per HDBSCAN implementation)
    • add a release note

Resolves #28243

Copy link

github-actions bot commented Jan 24, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9ea7d00. Link to the linter CI: here

@luis261

This comment was marked as outdated.

Copy link
Author

@luis261 luis261 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OmarManzoor thank you so much for your review! I started off by directly applying/accepting some of your simpler suggestions regarding the minor renaming + docstrings.

Left some comments as well, looking forward to hear your response.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. Here are a few more.

luis261 and others added 2 commits August 9, 2024 08:58
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, mostly related to adding some full stops for consistency. Also I think we might need a follow up PR to add this in the docs for model_evaluation.rst where we have a clustering section, Model evaluation docs
and also in Clustering performance evaluation of clustering.rst Clustering evaluation

luis261 and others added 2 commits August 9, 2024 13:38
@luis261
Copy link
Author

luis261 commented Aug 9, 2024

Applied your new suggestions. Regarding the followup PR: good point, I'm already keeping track of that at #29365

(I think it would be out of place in the Model evaluation docs, but that doc links to Clustering evaluation anyway, which is where I'll be adding it in the implementation of #29365. I'll start working on the actual text once this PR is merged, but I already have the Latex definitions prepared.)

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for adding this @luis261

@OmarManzoor OmarManzoor removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 9, 2024
@luis261
Copy link
Author

luis261 commented Aug 9, 2024

Appreciate it! Thank you for your input @OmarManzoor

@Micky774
Copy link
Contributor

Making one last pass. I also want to run a few tests locally to sanity check w/ some plotting examples

@Micky774
Copy link
Contributor

Micky774 commented Sep 3, 2024

@luis261 Please sanity check me on this.

In the plot below (clustering provided via HDBSCAN) the DBCV score is ~.097

b1e34866-05c3-4338-b7a6-f869d598142b

Switching to ground truth labels, the score is -0.112. Using MR distances only does not change this, it just spits out wildly different scores -- still including negative scores.

87fea764-c68b-4bff-ba7a-18a1eedf2afa

Generating code
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.metrics.cluster import dbcv_score
from sklearn.cluster import HDBSCAN


def plot(X, labels, probabilities=None, parameters=None, ground_truth=False, ax=None):
    if ax is None:
        _, ax = plt.subplots(figsize=(10, 4))
    labels = labels if labels is not None else np.ones(X.shape[0])
    probabilities = probabilities if probabilities is not None else np.ones(X.shape[0])
    # Black removed and is used for noise instead.
    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
    # The probability of a point belonging to its labeled cluster determines
    # the size of its marker
    proba_map = {idx: probabilities[idx] for idx in range(len(labels))}
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = [0, 0, 0, 1]

        class_index = np.where(labels == k)[0]
        for ci in class_index:
            ax.plot(
                X[ci, 0],
                X[ci, 1],
                "x" if k == -1 else "o",
                markerfacecolor=tuple(col),
                markeredgecolor="k",
                markersize=4 if k == -1 else 1 + 5 * proba_map[ci],
            )
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    preamble = "True" if ground_truth else "Estimated"
    title = f"{preamble} number of clusters: {n_clusters_}"
    if parameters is not None:
        parameters_str = ", ".join(f"{k}={v}" for k, v in parameters.items())
        title += f" | {parameters_str}"
    ax.set_title(title)
    plt.tight_layout()


centers = [[1, 1], [-1, -1], [1.5, -1.5]]
X, labels_true = make_blobs(
    n_samples=750, centers=centers, cluster_std=[0.4, 0.1, 0.75], random_state=0
)
plot(X, labels=labels_true, ground_truth=True)
print(dbcv_score(X, labels=labels_true))


hdb = HDBSCAN()
hdb.fit(X)
y = hdb.labels_
plot(X, y, hdb.probabilities_)
print(dbcv_score(X, labels=y))

@luis261
Copy link
Author

luis261 commented Sep 6, 2024

@Micky774

First of all I commend your ongoing diligence in sanity checking the implementation/corroborating the results of the systematic tests and benchmarks.

I do have to say that I don't find the results that surprising. At least the fact that in this case the ground truth scores worse than the estimation as assigned by HDBSCAN, as well as the fact that the scores don't come close to either of the bounds.

Of course, the more "external" points of the underlying distributions on the right side lead to smaller density separation distances, thus decreasing the overall score. In the HDBSCAN clustering, the points that are further away from the centers of these distributions are noise points, which improves the density separation, thus improving the overall score.

Overall, the clusters aren't separated that clearly, at least the two clusters with the less tight distributions (other algorithms might combine them into a singular cluster based on their weak delineation). For the HDBSCAN result, (moreso for the ground truth) the sparseness of the MST distances of those clusters on the right doesn't seem like it'd be easily surpassed by their separation.



However I do agree that the scores might be closer to zero than expected.

As an aside, this of course doesn't have to be interpreted as the clustering being "bad" from a practical POV, as long as the score surpasses other candidates for the given dataset, maybe the global optimum for this dataset is just low.

However, this is not the case for the non MRD version of the score, which I assume is what you meant to express with "wildly different scores".

I replicated your experiment on a local machine with the verbose option. This results in the subcomponents of the overall score being emitted, broken down into the underlying separation and sparseness distances. This helps illustrate the underlying calculation results. For the default, MRD-based version, the verbose option also results in an output of the maximum ratio of the raw (in this case euclidean) distance to the MRD (per MST).

As for the discrepancies between the distance modes, I think they can generally be explained by the MRD inflation phenomenon again, which I do have to concede is a bit surprising here since the clusters are still somewhat spherical.
To be more specific, I measured ratios of 14 to 16 fold increases between the MRD's and the raw distances. This impairs the score in that configuration (as already discovered and described by me on many occasions in the past).



I have another suspicion which I am currently investigating. It has to do with the edge selection, which could also be a contributing factor here, since the locations of points "at the edges of the cluster" is what determines the density separations and the edge selection determines which of those points get filtered out as non-internal edges.



While running your example code, I noticed that your vectorization for the maximum ratio calculations seems to be faulty unfortunately. It always produced None output. I ended up substituting it in place with my original implementation. Could you please look into this issue? I do prefer your more elegant and performant implementation, as long as it gets fixed.

@luis261
Copy link
Author

luis261 commented Sep 6, 2024

Maybe try utilizing the verbose mode, if you haven't already.

Here is the output of my local runs:

  • default:
Max raw distance to coredistance ratio: 15.690077884292547
Max raw distance to coredistance ratio: 15.162453791655066
Max raw distance to coredistance ratio: 15.342044341686552
Minimum density separation: 0.585
Density sparseness: 0.918
Minimum density separation: 0.946
Density sparseness: 0.242
Minimum density separation: 0.585
Density sparseness: 2.070
-0.11221391414583601
Max raw distance to coredistance ratio: 15.22309823686122
Max raw distance to coredistance ratio: 15.436202177179643
Max raw distance to coredistance ratio: 14.228306760859649
Minimum density separation: 1.313
Density sparseness: 0.301
Minimum density separation: 0.743
Density sparseness: 0.928
Minimum density separation: 0.743
Density sparseness: 1.123
0.09726864192738956
  • no MRDs:
Minimum density separation: 0.048
Density sparseness: 0.345
Minimum density separation: 0.140
Density sparseness: 0.096
Minimum density separation: 0.048
Density sparseness: 0.817
-0.495895881471357
Minimum density separation: 1.313
Density sparseness: 0.140
Minimum density separation: 0.416
Density sparseness: 0.299
Minimum density separation: 0.416
Density sparseness: 0.297
0.4723529237264573

@Micky774
Copy link
Contributor

Micky774 commented Sep 6, 2024

BTW, I think a helpful basis for formally explaining the MRD inflation and spherical bias (as separate problems, despite obvious entanglement) is the notion of concentration of measure and isoperimetry for e.g. Euclidean metric / Lebesgue measure in $\mathbb{R}^n$. The use of k-nearest-neighbors-based MRD would greatly alleviate this as it would alter the effective metric. This comment is mainly an aside, and a note for myself in the future as we discuss that feature.

@Micky774
Copy link
Contributor

Micky774 commented Sep 6, 2024

I appreciate your explanation btw, I think it does alleviate the concern I have. Given the exact axiom it's trying to enforce, the ground-truth labels do, indeed, indicate a worse clustering, which is probably reasonable given the interpretation of outliers as noise, or at least low confidence points.

@luis261
Copy link
Author

luis261 commented Sep 16, 2024

I think the edge selection is still off btw.

@luis261 luis261 marked this pull request as draft September 16, 2024 13:00
@luis261 luis261 closed this by deleting the head repository Sep 16, 2024
@lorentzenchr
Copy link
Member

lorentzenchr commented Jan 26, 2025

@luis261 Why did you close? It seems it was quite close to being finished?
Wanna reopen? Especially given the green light in #27259.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

RFC: Introduce DBCV cluster validity metric
6 participants