ENH `_hdbscan` `HIERARCHY` data type introduction #25826

Micky774 · 2023-03-12T14:48:22Z

Reference Issues/PRs

Towards #24686

What does this implement/fix? Explain your changes.

Creates new HIERARCHY dtype and c-type
Replaces current 2D float64_t arrays with HIERARCHY where appropriate
Minor code refactor reflecting new dtype usage

Any other comments?

This is a blocker for more sweeping algorithm refactors and simplifications which rely on the structure of the dtype

jjerphan

Thanks @Micky774!

Here is a first pass. I guess conversion of cnp.ndarray to (const-qualify) might be done in another PRs not to pollute the diff of this PR.

sklearn/cluster/_hdbscan/_tree.pxd

sklearn/cluster/_hdbscan/_tree.pyx

sklearn/cluster/_hdbscan/_linkage.pyx

sklearn/cluster/_hdbscan/hdbscan.py

sklearn/cluster/_hdbscan/_linkage.pyx

sklearn/cluster/_hdbscan/_tree.pyx

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Micky774 · 2023-03-24T23:13:32Z

@thomasjpfan @jjerphan wondering what you two think of the PR now

jjerphan

Thank you for the heads-up, @Micky774.

sklearn/cluster/_hdbscan/_linkage.pyx

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

thomasjpfan · 2023-03-28T13:45:26Z

sklearn/cluster/_hdbscan/_tree.pyx

@@ -468,7 +518,7 @@ cdef get_probabilities(cnp.ndarray hierarchy, dict cluster_map, cnp.ndarray labe

        cluster = cluster_map[cluster_num]
        max_lambda = deaths[cluster]
-        if max_lambda == 0.0 or not np.isfinite(lambda_array[n]):
+        if max_lambda == 0.0 or isinf(lambda_array[n]):


Should this be?

Suggested change

if max_lambda == 0.0 or isinf(lambda_array[n]):

if max_lambda == 0.0 or not isinf(lambda_array[n]):

The original checks that it is not finite rather than not infinite, so I think the current isinf is correct -- unless I have it further mixed up somehow?

Ah I'm thinking about:

Suggested change

if max_lambda == 0.0 or isinf(lambda_array[n]):

if max_lambda == 0.0 or not isfinite(lambda_array[n]):

Because of nan behavior:

not isfinite(NaN) == 1 isinf(NaN) == 0

Is lambda_array[n] never nan by construction?

Right, lambda_array stems from the computed minimum reachability distance, which is never expected to be nan since we separate those points before processing

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Updated hdbscan submodule with new HIERARCHY types

e8a56a2

github-actions bot added module:cluster cython labels Mar 12, 2023

Micky774 mentioned this pull request Mar 12, 2023

Path to HDBSCAN Inclusion #24686

Closed

13 tasks

Micky774 added the No Changelog Needed label Mar 12, 2023

Reverted change to decorator

8d53894

jjerphan reviewed Mar 13, 2023

View reviewed changes

thomasjpfan reviewed Mar 16, 2023

View reviewed changes

sklearn/cluster/_hdbscan/_linkage.pyx Outdated Show resolved Hide resolved

sklearn/cluster/_hdbscan/_tree.pyx Show resolved Hide resolved

Micky774 and others added 5 commits March 22, 2023 19:38

Refactor based on feedback

728f9dd

Created second dtype for clarification of tree semantics

f1c0d3e

Update sklearn/cluster/_hdbscan/_tree.pxd

5cac761

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Merge branch 'hdbscan' into hdbscan_tree_dtype

00b0530

Improved documentation for condensed dtype

83d9e48

jjerphan approved these changes Mar 27, 2023

View reviewed changes

sklearn/cluster/_hdbscan/_linkage.pyx Outdated Show resolved Hide resolved

sklearn/cluster/_hdbscan/_linkage.pyx Outdated Show resolved Hide resolved

Micky774 and others added 2 commits March 27, 2023 09:23

Update sklearn/cluster/_hdbscan/_linkage.pyx

5e05644

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update sklearn/cluster/_hdbscan/_linkage.pyx

4fdf2be

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

thomasjpfan reviewed Mar 28, 2023

View reviewed changes

thomasjpfan approved these changes Mar 28, 2023

View reviewed changes

thomasjpfan merged commit a827589 into scikit-learn:hdbscan Mar 28, 2023

Micky774 deleted the hdbscan_tree_dtype branch March 28, 2023 14:53

Micky774 added a commit to Micky774/scikit-learn that referenced this pull request May 16, 2023

ENH _hdbscan HIERARCHY data type introduction (scikit-learn#25826)

d013d38

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH `_hdbscan` `HIERARCHY` data type introduction #25826

ENH `_hdbscan` `HIERARCHY` data type introduction #25826

Uh oh!

Micky774 commented Mar 12, 2023 •

edited

Loading

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Mar 24, 2023

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan Mar 28, 2023

Uh oh!

Micky774 Mar 28, 2023

Uh oh!

thomasjpfan Mar 28, 2023

Uh oh!

Micky774 Mar 28, 2023

Uh oh!

Uh oh!

	if max_lambda == 0.0 or isinf(lambda_array[n]):
	if max_lambda == 0.0 or not isinf(lambda_array[n]):

	if max_lambda == 0.0 or isinf(lambda_array[n]):
	if max_lambda == 0.0 or not isfinite(lambda_array[n]):

Uh oh!

ENH _hdbscan HIERARCHY data type introduction #25826

ENH _hdbscan HIERARCHY data type introduction #25826

Uh oh!

Conversation

Micky774 commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Mar 24, 2023

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

Micky774 Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

Micky774 Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ENH `_hdbscan` `HIERARCHY` data type introduction #25826

ENH `_hdbscan` `HIERARCHY` data type introduction #25826

Micky774 commented Mar 12, 2023 •

edited

Loading