-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Include HDBSCAN
as a sub-module for sklearn.cluster
#22616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added support for `n_features_in_` - Improved validation and added support for `feature_names_in_` - Renamed `kwargs` to `metric_params` and added safety check for an empty dict - Removed attributes set in init and deferred to properties - Raised error if tree query is performed with too few samples - Cleaned up some list/dict comprehension logic
…trics`" This reverts commit cd1edc4.
- Removed internal minkowski metric parameter validation in favor of `sklearn.metrics` built-in handling - Removed default argument and presence of `p` in hdbscan functions - Now users must pass `p` in through `metric_params`, consistent w/ other metrics such as `wminkowski` and `mahalanobis` - Removed vestigial estimator check -- now supported via common tests - Fixed bug where `boruvka_kdtree` algorithm's accepted metrics were based off of `BallTree` not `KDTree` - Cleaned up lines with unused returns by indexing output of `hdbscan` - Greatly expanded scope of algorithm/metric compatability tests - Streamlined some other tests - Delted commented out tests
from ..utils._typedefs cimport ITYPE_t | ||
|
||
cdef class UnionFind: | ||
cdef ITYPE_t next_label | ||
cdef ITYPE_t[:] parent | ||
cdef ITYPE_t[:] size | ||
|
||
cdef void union(self, ITYPE_t m, ITYPE_t n) | ||
cdef ITYPE_t fast_find(self, ITYPE_t n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change, along with that of _hierarchical_fast.pxd
are just to allow _linkage.pyx
to reuse the UnionFind
class that they share in common, rather than duplicating code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for integrations in the hdbscan
feature branch.
I think we better have follow-up PRs improve aspects of the implementation, as listed by @Micky774 on this PR description.
Regarding subsequent work, I would create an issue add the following tasks to the ones listed in the description:
- Inspect if
sklearn/cluster/_hdbscan/_linkage.pyx
implementations can make use ofPairwiseDistancesReductions
- Inspect if
sklearn/cluster/_hdbscan/_reachability.pyx
implementations can make use ofPairwiseDistancesReductions
- Tidy/simplify implementations in
sklearn/cluster/_hdbscan/_tree.pyx
Here are a final remarks, suggestions and nitpicks for this PR.
Once again, thanks @Micky774!
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@glemaitre Should be ready for you now. There are several stylistic considerations that need to be made before merging into Currently we do the following:
While
With this insight in mind, I'm in favor of:
For users that care about the semantics distinguishing these, providing them as separate labels would make things smoother to feed into downstream processes. For those that do not care to distinguish them, a simple Curious what everyone's opinions are. CC: @thomasjpfan |
sklearn/cluster/_hdbscan/hdbscan.py
Outdated
References | ||
---------- | ||
|
||
.. [1] Campello, R. J., Moulavi, D., & Sander, J. (2013, April). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unable to find an open-access to all of these.
@Micky774 Here are the doi's and public links to the reference papers:
- doi:10.1007/978-3-642-37456-2_14
- doi:10.1007/978-3-642-37456-2_14
- https://papers.nips.cc/paper/2010/hash/b534ba68236ba543ae44b22bd110a1d6-Abstract.html
- https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
I am OK with this as long as there are constants encoding labels of this labeling and a comment motivating this choice.
I am OK with having users be able to use this labelling if it is documented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's iterate on other PR from here.
…#22616) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Body
Reference Issues/PRs
Closes #14331
What does this implement/fix? Explain your changes.
Integrates the excellent work done at https://github.com/scikit-learn-contrib/hdbscan and includes it as a sub-module of
sklearn.cluster
For Reviewers:
The diff between this branch and the original HDBSCAN implementation (link kindly provided by @thomasjpfan): https://thomasjpfan.github.io/hdbscan_pr_diff/
Potential follow-up functionality
_prediction.py
(removed in 6f20a08)_flat.py
(removed in 6f20a08)plot.py
(removed in 8aa297a and fe362b5)robust_single_linkage
tosklearn.cluster.AgglomerativeClustering
float32
fit data.np.inf
values whenmetric=='precomputed'
andX
is sparse.np.nan
in Cython implementation for sparse matricesmst_from_*
functions in_linkage.pyx
Any other comments?
This borrows inspiration from our
OPTICS
implementation in that it uses theNearestNeighbors
estimator to compute core distances instead of directly querying directly an underlying{KD, Ball}Tree
for theprims
algorithm. In particular this decreases maintainability overhead since its usage is very straight forwards, and so long asNearestNeighbors
isn't failing any of its tests, we can be confident this portion of the code is fine too (it literally just computes the k-th smallest distance viaNearestNeighbors
to calculatecore_distances
). The rest of theOPTICS
implementation is, from what I saw, pretty orthogonal to the rest of theHDBSCAN
algorithm, so this was all I could directly repurpose. Open to ideas if there are any though.To Do
dbscan_clustering
in plotting example.