Include `HDBSCAN` as a sub-module for `sklearn.cluster` #22616

Micky774 · 2022-02-25T18:51:00Z

Body

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Integrates the excellent work done at https://github.com/scikit-learn-contrib/hdbscan and includes it as a sub-module of sklearn.cluster

For Reviewers:

The diff between this branch and the original HDBSCAN implementation (link kindly provided by @thomasjpfan): https://thomasjpfan.github.io/hdbscan_pr_diff/

Potential follow-up functionality

Add tests for correctness of various components of the estimator
Add experimental API content, _prediction.py (removed in 6f20a08)
Add extended "flat" (fixed number of clusters) API content, _flat.py (removed in 6f20a08)
Add plotting API content, plot.py (removed in 8aa297a and fe362b5)
Add robust_single_linkage to sklearn.cluster.AgglomerativeClustering
Reintroduce Boruvka algorithm (removed in b7736ef) (maintaining this PR for convenience later)
Add support for float32 fit data.
Add support for np.inf values when metric=='precomputed' and X is sparse.
Benchmark KD vs Ball Tree efficiency
Implement weighted argkmin backend for medoid calculation
Support np.nan in Cython implementation for sparse matrices
Investigate PWD backend for mst_from_* functions in _linkage.pyx

Any other comments?

This borrows inspiration from our OPTICS implementation in that it uses the NearestNeighbors estimator to compute core distances instead of directly querying directly an underlying {KD, Ball}Tree for the prims algorithm. In particular this decreases maintainability overhead since its usage is very straight forwards, and so long as NearestNeighbors isn't failing any of its tests, we can be confident this portion of the code is fine too (it literally just computes the k-th smallest distance via NearestNeighbors to calculate core_distances). The rest of the OPTICS implementation is, from what I saw, pretty orthogonal to the rest of the HDBSCAN algorithm, so this was all I could directly repurpose. Open to ideas if there are any though.

To Do

Refactor MST format to a structure containing arrays
Include dbscan_clustering in plotting example.

- Added support for `n_features_in_` - Improved validation and added support for `feature_names_in_` - Renamed `kwargs` to `metric_params` and added safety check for an empty dict - Removed attributes set in init and deferred to properties - Raised error if tree query is performed with too few samples - Cleaned up some list/dict comprehension logic

…to hdbscan

…trics`" This reverts commit cd1edc4.

- Removed internal minkowski metric parameter validation in favor of `sklearn.metrics` built-in handling - Removed default argument and presence of `p` in hdbscan functions - Now users must pass `p` in through `metric_params`, consistent w/ other metrics such as `wminkowski` and `mahalanobis` - Removed vestigial estimator check -- now supported via common tests - Fixed bug where `boruvka_kdtree` algorithm's accepted metrics were based off of `BallTree` not `KDTree` - Cleaned up lines with unused returns by indexing output of `hdbscan` - Greatly expanded scope of algorithm/metric compatability tests - Streamlined some other tests - Delted commented out tests

Micky774 · 2022-09-16T03:22:13Z

sklearn/cluster/_hierarchical_fast.pxd

+from ..utils._typedefs cimport ITYPE_t
+
+cdef class UnionFind:
+    cdef ITYPE_t next_label
+    cdef ITYPE_t[:] parent
+    cdef ITYPE_t[:] size
+
+    cdef void union(self, ITYPE_t m, ITYPE_t n)
+    cdef ITYPE_t fast_find(self, ITYPE_t n)


This change, along with that of _hierarchical_fast.pxd are just to allow _linkage.pyx to reuse the UnionFind class that they share in common, rather than duplicating code.

jjerphan

I think this is good for integrations in the hdbscan feature branch.

I think we better have follow-up PRs improve aspects of the implementation, as listed by @Micky774 on this PR description.

Regarding subsequent work, I would create an issue add the following tasks to the ones listed in the description:

Inspect if sklearn/cluster/_hdbscan/_linkage.pyx implementations can make use of PairwiseDistancesReductions
Inspect if sklearn/cluster/_hdbscan/_reachability.pyx implementations can make use of PairwiseDistancesReductions
Tidy/simplify implementations in sklearn/cluster/_hdbscan/_tree.pyx

Here are a final remarks, suggestions and nitpicks for this PR.

Once again, thanks @Micky774!

doc/modules/clustering.rst

sklearn/cluster/_hdbscan/hdbscan.py

sklearn/cluster/_hdbscan/tests/test_hdbscan.py

doc/modules/clustering.rst

examples/cluster/plot_hdbscan.py

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

glemaitre · 2022-09-27T13:30:00Z

@Micky774 Could you ping me once you addressed @jjerphan comments such that I make a last review before the merge?

…to hdbscan

Micky774 · 2022-09-27T20:15:52Z

@glemaitre Should be ready for you now.

There are several stylistic considerations that need to be made before merging into main, such as cleaning up Cython code, but let's reserve those for the follow-up PR(s). I think for this one we just need to limit attention to public API. The only potentially controversial aspect of which, right now, is the use of negative labels to distinguish types of outliers.

Currently we do the following:

Natively accept np.nan as a valid value
Label points -1 if they are noisy outliers
Label points -1 if they are infinite outliers, indicated by the presence of +/- np.inf
Label points -2 if they are disconnected/missing, indicated by the presence of np.nan

While np.nan usually means missing data, @lmcinnes taught me that there is a meaningful use-case of treating np.nan as an indicator of manifold non-membership (e.g. from UMAP output). See quote from our emails:

Hi Meekail,

The motivation for it was the fact that UMAP added a feature to specify points that are completely disconnected as np.nan values. This is helpful as this integrates with all the plotting libraries people use. The catch was that people like to feed UMAP results directly in hdbscan, and that was resulting in errors which was breaking people's pipelines. From the point of view of what UMAP outputs the np.nan values are simply ultimate outliers, like np.inf really. I do see that data from other sources may present differently, and np.inf doesn't play as nicely with plotting libraries.

Given all of that my opinion on handling would be:

strictly disallowing nan values is bad, because it will break a lot of common workflows.

differentiating between nan and inf makes a lot of sense, but wasn't something we had given any thought to.

Having different classes of noise starts to make a lot of sense really, so I like your -1 and -2 label proposal.

Could we go even further and use -1 for regular noise, -2 for inf (because they are different) and -3 for nan? I'm not sure how scikit-learn maintainers will feel about that, but it makes a lot of sense to me actually.

Leland.

With this insight in mind, I'm in favor of:

Natively accepting np.nan as a valid value (we already do)
Label points -1 if they are noisy outliers
Label points -2 if they are infinite outliers, indicated by the presence of +/- np.inf
Label points -3 if they are disconnected/missing, indicated by the presence of np.nan

For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple labels < 0 check suffices.

Curious what everyone's opinions are.

CC: @thomasjpfan

thomasjpfan · 2022-10-04T17:55:16Z

sklearn/cluster/_hdbscan/hdbscan.py

+    References
+    ----------
+
+    .. [1] Campello, R. J., Moulavi, D., & Sander, J. (2013, April).


I am unable to find an open-access to all of these.

@Micky774 Here are the doi's and public links to the reference papers:

doi:10.1007/978-3-642-37456-2_14

doi:10.1007/978-3-642-37456-2_14

https://papers.nips.cc/paper/2010/hash/b534ba68236ba543ae44b22bd110a1d6-Abstract.html

https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf

sklearn/cluster/_hdbscan/hdbscan.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan · 2022-10-07T07:17:50Z

With this insight in mind, I'm in favor of:
1. Natively accepting `np.nan` as a valid value (we already do)

2. Label points `-1` if they are noisy outliers

3. Label points `-2` if they are infinite outliers, indicated by the presence of `+/- np.inf`

4. Label points `-3` if they are disconnected/missing, indicated by the presence of `np.nan`
For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple labels < 0 check suffices.

I am OK with this as long as there are constants encoding labels of this labeling and a comment motivating this choice.

For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple labels < 0 check suffices.

I am OK with having users be able to use this labelling if it is documented.

glemaitre

OK, let's iterate on other PR from here.

…#22616) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Initial addition of hdbscan

1c61429

github-actions bot added module:cluster cython labels Feb 25, 2022

Micky774 added 27 commits February 25, 2022 19:14

Added wraparound wrappers where needed

c5240b7

Updated documentation

74bd0b3

Merge branch 'main' into hdbscan

15793b2

Added a new batch of doc updates for passing docstring tests

faa06b5

Improved metric_params handling

2a7cc22

Propogated metric_params change to tests and other functions

97f036f

Removed plotting, to_pandas, to_networkx infrastructure

8aa297a

Removed plotting, to_pandas, to_networkx infrastructure

fe362b5

Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…

dd44dbc

…to hdbscan

Renamed plots.py-->_trees.py

fda9350

Fixed package namespace in cluster/__init__.py

7478586

Drop-in replaced private dist_metrics with metrics.dist_metrics

cd1edc4

Revert "Drop-in replaced private dist_metrics with `metrics.dist_me…

0802504

…trics`" This reverts commit cd1edc4.

Docstring compliance for flat.py

ce94591

Renamed flat.py --> _flat.py

e93bfe1

Renamed flat.py-->_flat.py

028e98f

Renamed validity.py-->_validity.py

a1ac99a

Renamed robust_single_linkage_.py

788d4bc

Merge branch 'main' into hdbscan

5fba5e0

Removed _flat_.py and associated tests

cf4f239

Made memview readonly constant

1ceac43

Removed experimental/extra API -- may reenable in future PRs

6f20a08

Merge branch 'main' into hdbscan

6705fa7

WIP docstring improvements for RSL

9e9be81

Trimmed and removed unnecessary RSL estimator

0cd08f3

Micky774 added 2 commits September 15, 2022 23:16

Utilize shared UnionFind code

c31f463

Adjusted common test parameter

f4cd003

Micky774 commented Sep 16, 2022

View reviewed changes

Micky774 added 2 commits September 15, 2022 23:34

Added one-sample error

52d2f09

Updated references

95c0705

jjerphan approved these changes Sep 22, 2022

View reviewed changes

Micky774 and others added 5 commits September 26, 2022 15:38

Apply suggestions from code review

b1446f7

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Moved test

fd5c5df

Updated setup.py

8a2be40

Apply suggestions from code review

5aca317

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Apply suggestions from code review

6abe276

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Micky774 added 2 commits September 27, 2022 16:04

Incorporated review feedback

e38d934

Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…

e1436c2

…to hdbscan

Micky774 added 2 commits September 27, 2022 16:16

Lint

9b3d2e4

Updated copy docstring and simplified behavior

841c5cd

thomasjpfan reviewed Oct 4, 2022

View reviewed changes

Micky774 and others added 3 commits October 6, 2022 13:42

Addressed feedback

b6dd52a

Update sklearn/cluster/_hdbscan/hdbscan.py

6b68706

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Clarified comment

bd47ec8

Micky774 added 2 commits October 10, 2022 20:13

Implemented outlier encoding

b8e6da1

Added space in txt for rendering

874f85c

glemaitre self-requested a review October 12, 2022 09:19

glemaitre added 2 commits October 12, 2022 11:30

blackify

220d1d2

Merge branch 'hdbscan' into hdbscan

c720514

glemaitre approved these changes Oct 12, 2022

View reviewed changes

glemaitre merged commit a76b39e into scikit-learn:hdbscan Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Include `HDBSCAN` as a sub-module for `sklearn.cluster` #22616

Include `HDBSCAN` as a sub-module for `sklearn.cluster` #22616

Uh oh!

Micky774 commented Feb 25, 2022 •

edited

Loading

Uh oh!

Micky774 Sep 16, 2022

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Sep 27, 2022 •

edited

Loading

Uh oh!

Micky774 commented Sep 27, 2022 •

edited

Loading

Uh oh!

thomasjpfan Oct 4, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan commented Oct 7, 2022

Uh oh!

glemaitre left a comment

Uh oh!

Uh oh!

Uh oh!

Include HDBSCAN as a sub-module for sklearn.cluster #22616

Include HDBSCAN as a sub-module for sklearn.cluster #22616

Uh oh!

Conversation

Micky774 commented Feb 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Body

Reference Issues/PRs

What does this implement/fix? Explain your changes.

For Reviewers:

Potential follow-up functionality

Any other comments?

To Do

Uh oh!

Micky774 Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Micky774 commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan commented Oct 7, 2022

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Include `HDBSCAN` as a sub-module for `sklearn.cluster` #22616

Include `HDBSCAN` as a sub-module for `sklearn.cluster` #22616

Micky774 commented Feb 25, 2022 •

edited

Loading

glemaitre commented Sep 27, 2022 •

edited

Loading

Micky774 commented Sep 27, 2022 •

edited

Loading