-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEAT Introduce DBCV as new cluster metric #28244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OmarManzoor thank you so much for your review! I started off by directly applying/accepting some of your simpler suggestions regarding the minor renaming + docstrings.
Left some comments as well, looking forward to hear your response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments. Here are a few more.
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, mostly related to adding some full stops for consistency. Also I think we might need a follow up PR to add this in the docs for model_evaluation.rst where we have a clustering section, Model evaluation docs
and also in Clustering performance evaluation of clustering.rst Clustering evaluation
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Applied your new suggestions. Regarding the followup PR: good point, I'm already keeping track of that at #29365 (I think it would be out of place in the Model evaluation docs, but that doc links to Clustering evaluation anyway, which is where I'll be adding it in the implementation of #29365. I'll start working on the actual text once this PR is merged, but I already have the Latex definitions prepared.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for adding this @luis261
Appreciate it! Thank you for your input @OmarManzoor |
Making one last pass. I also want to run a few tests locally to sanity check w/ some plotting examples |
@luis261 Please sanity check me on this. In the plot below (clustering provided via Switching to ground truth labels, the score is -0.112. Using MR distances only does not change this, it just spits out wildly different scores -- still including negative scores. Generating codeimport numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.metrics.cluster import dbcv_score
from sklearn.cluster import HDBSCAN
def plot(X, labels, probabilities=None, parameters=None, ground_truth=False, ax=None):
if ax is None:
_, ax = plt.subplots(figsize=(10, 4))
labels = labels if labels is not None else np.ones(X.shape[0])
probabilities = probabilities if probabilities is not None else np.ones(X.shape[0])
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
# The probability of a point belonging to its labeled cluster determines
# the size of its marker
proba_map = {idx: probabilities[idx] for idx in range(len(labels))}
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_index = np.where(labels == k)[0]
for ci in class_index:
ax.plot(
X[ci, 0],
X[ci, 1],
"x" if k == -1 else "o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=4 if k == -1 else 1 + 5 * proba_map[ci],
)
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
preamble = "True" if ground_truth else "Estimated"
title = f"{preamble} number of clusters: {n_clusters_}"
if parameters is not None:
parameters_str = ", ".join(f"{k}={v}" for k, v in parameters.items())
title += f" | {parameters_str}"
ax.set_title(title)
plt.tight_layout()
centers = [[1, 1], [-1, -1], [1.5, -1.5]]
X, labels_true = make_blobs(
n_samples=750, centers=centers, cluster_std=[0.4, 0.1, 0.75], random_state=0
)
plot(X, labels=labels_true, ground_truth=True)
print(dbcv_score(X, labels=labels_true))
hdb = HDBSCAN()
hdb.fit(X)
y = hdb.labels_
plot(X, y, hdb.probabilities_)
print(dbcv_score(X, labels=y)) |
First of all I commend your ongoing diligence in sanity checking the implementation/corroborating the results of the systematic tests and benchmarks. I do have to say that I don't find the results that surprising. At least the fact that in this case the ground truth scores worse than the estimation as assigned by HDBSCAN, as well as the fact that the scores don't come close to either of the bounds. Of course, the more "external" points of the underlying distributions on the right side lead to smaller density separation distances, thus decreasing the overall score. In the HDBSCAN clustering, the points that are further away from the centers of these distributions are noise points, which improves the density separation, thus improving the overall score. Overall, the clusters aren't separated that clearly, at least the two clusters with the less tight distributions (other algorithms might combine them into a singular cluster based on their weak delineation). For the HDBSCAN result, (moreso for the ground truth) the sparseness of the MST distances of those clusters on the right doesn't seem like it'd be easily surpassed by their separation. However I do agree that the scores might be closer to zero than expected.
However, this is not the case for the non MRD version of the score, which I assume is what you meant to express with "wildly different scores". I replicated your experiment on a local machine with the verbose option. This results in the subcomponents of the overall score being emitted, broken down into the underlying separation and sparseness distances. This helps illustrate the underlying calculation results. For the default, MRD-based version, the verbose option also results in an output of the maximum ratio of the raw (in this case euclidean) distance to the MRD (per MST). As for the discrepancies between the distance modes, I think they can generally be explained by the MRD inflation phenomenon again, which I do have to concede is a bit surprising here since the clusters are still somewhat spherical. I have another suspicion which I am currently investigating. It has to do with the edge selection, which could also be a contributing factor here, since the locations of points "at the edges of the cluster" is what determines the density separations and the edge selection determines which of those points get filtered out as non-internal edges. While running your example code, I noticed that your vectorization for the maximum ratio calculations seems to be faulty unfortunately. It always produced |
Maybe try utilizing the verbose mode, if you haven't already. Here is the output of my local runs:
|
BTW, I think a helpful basis for formally explaining the MRD inflation and spherical bias (as separate problems, despite obvious entanglement) is the notion of concentration of measure and isoperimetry for e.g. Euclidean metric / Lebesgue measure in |
I appreciate your explanation btw, I think it does alleviate the concern I have. Given the exact axiom it's trying to enforce, the ground-truth labels do, indeed, indicate a worse clustering, which is probably reasonable given the interpretation of outliers as noise, or at least low confidence points. |
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
I think the edge selection is still off btw. |
for theoretical details/high level considerations see #28243
The version of DBCV which I want to integrate is the one shipped with the validity module (which I contributed to in the past) of this version of HDBSCAN. (the HDBSCAN implementation of that repo was recently integrated into scikit-learn)
steps before merging:
github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/validity.py
validate_params
decoratorcheck_X_y
check_number_of_labels
LabelEncoder
list
input forlabels
param instead of assumingnumpy.ndarray
/tests/test_public_functions.py
/metrics/cluster/tests/test_common.py
/metrics/cluster/tests/test_unsupervised.py
Resolves #28243