CLN Cleaned `cluster/_hdbscan/_linkage.pyx` #24857

Micky774 · 2022-11-08T00:51:39Z

Reference Issues/PRs

Towards #24686

What does this implement/fix? Explain your changes.

Cleans up and revises _hdbscan/_linkage.pyx.

Any other comments?

Micky774 · 2022-11-08T21:41:52Z

@thomasjpfan @glemaitre @jjerphan Wondering if any of you would be interested in reviewing this.

sklearn/cluster/_hdbscan/_linkage.pyx

jjerphan

Thanks for making it clearer, @Micky774!

Here is a first comprehensive review with questions, especially regarding the structure defined and used.

sklearn/cluster/_hdbscan/_linkage.pyx

jjerphan · 2022-11-09T15:11:16Z

sklearn/cluster/_hdbscan/_linkage.pyx

+cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability(
+    cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability
 ):


We should only use cnp.ndarray when memoryviews can't be used.

Is the usage of MST_edge_t preventing us to use (const-qualified) memoryview here and in other functions and methods?

Generally, most cnp.ndarray seem replaceable by memoryviews here.

The current algorithm uses binary masks for indexing into subsets of cnp.ndarrays, simplifying the innermost scope of the algorithm. Hence I think it would be preferable to maintain the use of cnp.ndarray instead. In particular, it is necessary for:

scikit-learn/sklearn/cluster/_hdbscan/_linkage.pyx

Lines 48 to 52 in 39e7d7e

label_filter = current_labels != current_node

current_labels = current_labels[label_filter]

left = min_reachability[label_filter]

right = mutual_reachability[current_node][current_labels]

min_reachability = np.minimum(left, right)

I could attempt to change the algorithm to avoid using this, but it is a pretty eloquent solution. Ultimately I don't know how costly the python interactions here are, and whether they outweigh the usefulness of the algorithm as it is now.

There are some overhead of just typing and referencing variable with cnp.ndarray instead of using memoryviews (because cnp.ndarray are PyObjects and memoryviews aren't if I remember correctly).

We can leave this for now. If this shows to be a hotspot, we can have a PR to rewrite this part. What do you think?

In this case, is it possible to add a comment to indicate that numpy arrays are used for binary masks and convenient indexing?

Sounds good, I will add in such a comment.

Was this comment added somewhere?

sklearn/cluster/_hdbscan/_linkage.pyx

Micky774 · 2022-11-09T23:08:30Z

sklearn/cluster/_hdbscan/hdbscan.py

@@ -125,7 +131,6 @@ def _hdbscan_brute(
        distance_matrix = pairwise_distances(
            X, metric=metric, n_jobs=n_jobs, **metric_params
        )
-    distance_matrix /= alpha


This is technically a bit out of scope for this PR since it lives in _hdbscan_brute, however upon investigation I noticed:

We do not have any tests for the behavior of the alpha parameter.

It is unlikely to be used.

It is unadvised to use.

It currently has no effect in _hdbscan_brute (carryover bug?).

Hence I opted to remove it from _linkage.pyx. Since it was essentially only used in mst_from_data_matrix and the ineffective code above, I opted to remove it from HDBSCAN entirely. The above change does not affect any HDBSCAN behavior in a meaningful way.

Thanks for spotting this. Do you think it is possible to extract it in another PR? This way, we can discuss the problem independently from this PR which can be merged in the meantime.

I agree removing alpha should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.

glemaitre · 2022-11-10T08:39:08Z

I'll have a go soonish on the HDBSCAN PRs.

jjerphan

Just a comment which is somewhat-orthogonal to this PR's goal.

sklearn/cluster/_hdbscan/_linkage.pyx

thomasjpfan · 2022-11-22T17:53:06Z

sklearn/cluster/_hdbscan/_linkage.pyx

+cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability(
+    cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability
 ):


Was this comment added somewhere?

sklearn/cluster/_hdbscan/_linkage.pyx

sklearn/cluster/_hdbscan/hdbscan.py

thomasjpfan · 2022-11-22T18:52:23Z

sklearn/cluster/_hdbscan/hdbscan.py

@@ -125,7 +131,6 @@ def _hdbscan_brute(
        distance_matrix = pairwise_distances(
            X, metric=metric, n_jobs=n_jobs, **metric_params
        )
-    distance_matrix /= alpha


I agree removing alpha should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

sklearn/cluster/_hdbscan/_linkage.pyx

jjerphan · 2022-11-30T07:55:11Z

sklearn/cluster/_hdbscan/hdbscan.py

@@ -131,6 +134,7 @@ def _hdbscan_brute(
        distance_matrix = pairwise_distances(
            X, metric=metric, n_jobs=n_jobs, **metric_params
        )
+    distance_matrix /= alpha


What if alpha=None in this case?

Suggested change

distance_matrix /= alpha

if alpha is not None:

distance_matrix /= alpha

I just left that as it is on hdbscan since we'll be removing/revisiting alpha in the follow-up PR and this portion is in _hdbscan_brute and a bit out of scope for this PR.

jjerphan · 2022-11-30T07:55:56Z

sklearn/cluster/_hdbscan/_linkage.pyx

+        # Note: we utilize ndarray's over memory-views to make use of numpy
+        # binary indexing and sub-selection below.


Thank you for this comment, this answers one of @Vincent-Maladiere's questions. 👍

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

thomasjpfan

LGTM

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Micky774 added 2 commits November 5, 2022 15:52

Initial cleanup

cc31a2a

WIP partial implementation of custom struct for MST

4dcfe8e

Micky774 marked this pull request as draft November 8, 2022 00:51

github-actions bot added cython module:cluster labels Nov 8, 2022

Micky774 mentioned this pull request Nov 8, 2022

Path to HDBSCAN Inclusion #24686

Closed

13 tasks

Micky774 added 3 commits November 8, 2022 16:35

Refactor including new struct for simplification

7a07548

Added contiguous specification where applicable

9f4fbdf

Updated authorship

0182bc9

Micky774 marked this pull request as ready for review November 8, 2022 21:41

Micky774 added the No Changelog Needed label Nov 8, 2022

Micky774 commented Nov 8, 2022

View reviewed changes

sklearn/cluster/_hdbscan/_linkage.pyx Outdated Show resolved Hide resolved

jjerphan reviewed Nov 9, 2022

View reviewed changes

Micky774 added 2 commits November 9, 2022 15:32

Feedback from review

39e7d7e

Refactor and remove alpha

9c38bad

Micky774 commented Nov 9, 2022

View reviewed changes

Added documentation

e8ad933

jjerphan reviewed Nov 17, 2022

View reviewed changes

sklearn/cluster/_hdbscan/_linkage.pyx Outdated Show resolved Hide resolved

thomasjpfan reviewed Nov 22, 2022

View reviewed changes

Micky774 and others added 2 commits November 29, 2022 17:49

Apply suggestions from code review

50847ec

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Review feedback and revert alpha changes

e533f0c

jjerphan approved these changes Nov 30, 2022

View reviewed changes

Micky774 and others added 2 commits December 6, 2022 17:39

Update sklearn/cluster/_hdbscan/_linkage.pyx

f154164

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Merge branch 'hdbscan' into hdbscan_cleanup_linkage

6db58bd

thomasjpfan approved these changes Dec 19, 2022

View reviewed changes

thomasjpfan merged commit c359ec1 into scikit-learn:hdbscan Dec 19, 2022

Micky774 deleted the hdbscan_cleanup_linkage branch January 3, 2023 16:00

Micky774 mentioned this pull request Feb 21, 2023

DOC Update _hdbscan/_linkage.pyx with new inline comments #25656

Closed

Micky774 added a commit to Micky774/scikit-learn that referenced this pull request May 16, 2023

CLN Cleaned cluster/_hdbscan/_linkage.pyx (scikit-learn#24857)

488fb1f

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN Cleaned `cluster/_hdbscan/_linkage.pyx` #24857

CLN Cleaned `cluster/_hdbscan/_linkage.pyx` #24857

Micky774 commented Nov 8, 2022

Micky774 commented Nov 8, 2022

jjerphan left a comment

jjerphan Nov 9, 2022

Micky774 Nov 9, 2022 •

edited

Loading

Micky774 Nov 9, 2022

jjerphan Nov 10, 2022

Micky774 Nov 10, 2022

thomasjpfan Nov 22, 2022

Micky774 Nov 9, 2022

jjerphan Nov 10, 2022

thomasjpfan Nov 22, 2022

glemaitre commented Nov 10, 2022

jjerphan left a comment

thomasjpfan Nov 22, 2022

thomasjpfan Nov 22, 2022

jjerphan Nov 30, 2022

Micky774 Dec 6, 2022

jjerphan Nov 30, 2022

thomasjpfan left a comment

	label_filter = current_labels != current_node
	current_labels = current_labels[label_filter]
	left = min_reachability[label_filter]
	right = mutual_reachability[current_node][current_labels]
	min_reachability = np.minimum(left, right)

	distance_matrix /= alpha
	if alpha is not None:
	distance_matrix /= alpha

		# Note: we utilize ndarray's over memory-views to make use of numpy
		# binary indexing and sub-selection below.

CLN Cleaned cluster/_hdbscan/_linkage.pyx #24857

CLN Cleaned cluster/_hdbscan/_linkage.pyx #24857

Conversation

Micky774 commented Nov 8, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Micky774 commented Nov 8, 2022

jjerphan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Micky774 Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Nov 10, 2022

jjerphan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

CLN Cleaned `cluster/_hdbscan/_linkage.pyx` #24857

CLN Cleaned `cluster/_hdbscan/_linkage.pyx` #24857

Micky774 Nov 9, 2022 •

edited

Loading