Skip to content

Include HDBSCAN as a sub-module for sklearn.cluster #22616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 169 commits into from
Oct 12, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Feb 25, 2022

Body

Reference Issues/PRs

Closes #14331

What does this implement/fix? Explain your changes.

Integrates the excellent work done at https://github.com/scikit-learn-contrib/hdbscan and includes it as a sub-module of sklearn.cluster

For Reviewers:

The diff between this branch and the original HDBSCAN implementation (link kindly provided by @thomasjpfan): https://thomasjpfan.github.io/hdbscan_pr_diff/

Potential follow-up functionality

  • Add tests for correctness of various components of the estimator
  • Add experimental API content, _prediction.py (removed in 6f20a08)
  • Add extended "flat" (fixed number of clusters) API content, _flat.py (removed in 6f20a08)
  • Add plotting API content, plot.py (removed in 8aa297a and fe362b5)
  • Add robust_single_linkage to sklearn.cluster.AgglomerativeClustering
  • Reintroduce Boruvka algorithm (removed in b7736ef) (maintaining this PR for convenience later)
  • Add support for float32 fit data.
  • Add support for np.inf values when metric=='precomputed' and X is sparse.
  • Benchmark KD vs Ball Tree efficiency
  • Implement weighted argkmin backend for medoid calculation
  • Support np.nan in Cython implementation for sparse matrices
  • Investigate PWD backend for mst_from_* functions in _linkage.pyx

Any other comments?

This borrows inspiration from our OPTICS implementation in that it uses the NearestNeighbors estimator to compute core distances instead of directly querying directly an underlying {KD, Ball}Tree for the prims algorithm. In particular this decreases maintainability overhead since its usage is very straight forwards, and so long as NearestNeighbors isn't failing any of its tests, we can be confident this portion of the code is fine too (it literally just computes the k-th smallest distance via NearestNeighbors to calculate core_distances). The rest of the OPTICS implementation is, from what I saw, pretty orthogonal to the rest of the HDBSCAN algorithm, so this was all I could directly repurpose. Open to ideas if there are any though.

To Do

  • Refactor MST format to a structure containing arrays
  • Include dbscan_clustering in plotting example.

- Added support for `n_features_in_`
- Improved validation and added support for `feature_names_in_`
- Renamed `kwargs` to `metric_params` and added safety check
  for an empty dict
- Removed attributes set in init and deferred to properties
- Raised error if tree query is performed with too few samples
- Cleaned up some list/dict comprehension logic
- Removed internal minkowski metric parameter validation in favor
  of `sklearn.metrics` built-in handling
- Removed default argument and presence of `p` in hdbscan functions
- Now users must pass `p` in through `metric_params`, consistent w/
  other metrics such as `wminkowski` and `mahalanobis`
- Removed vestigial estimator check -- now supported via common tests
- Fixed bug where `boruvka_kdtree` algorithm's accepted metrics were
  based off of `BallTree` not `KDTree`
- Cleaned up lines with unused returns by indexing output of `hdbscan`
- Greatly expanded scope of algorithm/metric compatability tests
- Streamlined some other tests
- Delted commented out tests
Comment on lines +1 to +9
from ..utils._typedefs cimport ITYPE_t

cdef class UnionFind:
cdef ITYPE_t next_label
cdef ITYPE_t[:] parent
cdef ITYPE_t[:] size

cdef void union(self, ITYPE_t m, ITYPE_t n)
cdef ITYPE_t fast_find(self, ITYPE_t n)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, along with that of _hierarchical_fast.pxd are just to allow _linkage.pyx to reuse the UnionFind class that they share in common, rather than duplicating code.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for integrations in the hdbscan feature branch.

I think we better have follow-up PRs improve aspects of the implementation, as listed by @Micky774 on this PR description.

Regarding subsequent work, I would create an issue add the following tasks to the ones listed in the description:

  • Inspect if sklearn/cluster/_hdbscan/_linkage.pyx implementations can make use of PairwiseDistancesReductions
  • Inspect if sklearn/cluster/_hdbscan/_reachability.pyx implementations can make use of PairwiseDistancesReductions
  • Tidy/simplify implementations in sklearn/cluster/_hdbscan/_tree.pyx

Here are a final remarks, suggestions and nitpicks for this PR.

Once again, thanks @Micky774!

Micky774 and others added 5 commits September 26, 2022 15:38
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@glemaitre
Copy link
Member

glemaitre commented Sep 27, 2022

@Micky774 Could you ping me once you addressed @jjerphan comments such that I make a last review before the merge?

@Micky774
Copy link
Contributor Author

Micky774 commented Sep 27, 2022

@glemaitre Should be ready for you now.

There are several stylistic considerations that need to be made before merging into main, such as cleaning up Cython code, but let's reserve those for the follow-up PR(s). I think for this one we just need to limit attention to public API. The only potentially controversial aspect of which, right now, is the use of negative labels to distinguish types of outliers.

Currently we do the following:

  1. Natively accept np.nan as a valid value
  2. Label points -1 if they are noisy outliers
  3. Label points -1 if they are infinite outliers, indicated by the presence of +/- np.inf
  4. Label points -2 if they are disconnected/missing, indicated by the presence of np.nan

While np.nan usually means missing data, @lmcinnes taught me that there is a meaningful use-case of treating np.nan as an indicator of manifold non-membership (e.g. from UMAP output). See quote from our emails:

Hi Meekail,

The motivation for it was the fact that UMAP added a feature to specify points that are completely disconnected as np.nan values. This is helpful as this integrates with all the plotting libraries people use. The catch was that people like to feed UMAP results directly in hdbscan, and that was resulting in errors which was breaking people's pipelines. From the point of view of what UMAP outputs the np.nan values are simply ultimate outliers, like np.inf really. I do see that data from other sources may present differently, and np.inf doesn't play as nicely with plotting libraries.

Given all of that my opinion on handling would be:

  • strictly disallowing nan values is bad, because it will break a lot of common workflows.
  • differentiating between nan and inf makes a lot of sense, but wasn't something we had given any thought to.
  • Having different classes of noise starts to make a lot of sense really, so I like your -1 and -2 label proposal.
  • Could we go even further and use -1 for regular noise, -2 for inf (because they are different) and -3 for nan? I'm not sure how scikit-learn maintainers will feel about that, but it makes a lot of sense to me actually.

Leland.

With this insight in mind, I'm in favor of:

  1. Natively accepting np.nan as a valid value (we already do)
  2. Label points -1 if they are noisy outliers
  3. Label points -2 if they are infinite outliers, indicated by the presence of +/- np.inf
  4. Label points -3 if they are disconnected/missing, indicated by the presence of np.nan

For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple labels < 0 check suffices.

Curious what everyone's opinions are.

CC: @thomasjpfan

References
----------

.. [1] Campello, R. J., Moulavi, D., & Sander, J. (2013, April).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unable to find an open-access to all of these.

@Micky774 Here are the doi's and public links to the reference papers:

Micky774 and others added 3 commits October 6, 2022 13:42
@jjerphan
Copy link
Member

jjerphan commented Oct 7, 2022

With this insight in mind, I'm in favor of:

1. Natively accepting `np.nan` as a valid value (we already do)

2. Label points `-1` if they are noisy outliers

3. Label points `-2` if they are infinite outliers, indicated by the presence of `+/- np.inf`

4. Label points `-3` if they are disconnected/missing, indicated by the presence of `np.nan`

For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple labels < 0 check suffices.

I am OK with this as long as there are constants encoding labels of this labeling and a comment motivating this choice.

For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple labels < 0 check suffices.

I am OK with having users be able to use this labelling if it is documented.

@glemaitre glemaitre self-requested a review October 12, 2022 09:19
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's iterate on other PR from here.

@glemaitre glemaitre merged commit a76b39e into scikit-learn:hdbscan Oct 12, 2022
Micky774 added a commit to Micky774/scikit-learn that referenced this pull request May 16, 2023
…#22616)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add HDBSCAN
7 participants