Skip to content

CLN Cleaned cluster/_hdbscan/_linkage.pyx #24857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 19, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Nov 8, 2022

Reference Issues/PRs

Towards #24686

What does this implement/fix? Explain your changes.

Cleans up and revises _hdbscan/_linkage.pyx.

Any other comments?

@Micky774 Micky774 marked this pull request as draft November 8, 2022 00:51
@Micky774 Micky774 mentioned this pull request Nov 8, 2022
13 tasks
@Micky774 Micky774 marked this pull request as ready for review November 8, 2022 21:41
@Micky774
Copy link
Contributor Author

Micky774 commented Nov 8, 2022

@thomasjpfan @glemaitre @jjerphan Wondering if any of you would be interested in reviewing this.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making it clearer, @Micky774!

Here is a first comprehensive review with questions, especially regarding the structure defined and used.

Comment on lines +33 to 35
cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability(
cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only use cnp.ndarray when memoryviews can't be used.

Is the usage of MST_edge_t preventing us to use (const-qualified) memoryview here and in other functions and methods?

Generally, most cnp.ndarray seem replaceable by memoryviews here.

Copy link
Contributor Author

@Micky774 Micky774 Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current algorithm uses binary masks for indexing into subsets of cnp.ndarrays, simplifying the innermost scope of the algorithm. Hence I think it would be preferable to maintain the use of cnp.ndarray instead. In particular, it is necessary for:

label_filter = current_labels != current_node
current_labels = current_labels[label_filter]
left = min_reachability[label_filter]
right = mutual_reachability[current_node][current_labels]
min_reachability = np.minimum(left, right)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could attempt to change the algorithm to avoid using this, but it is a pretty eloquent solution. Ultimately I don't know how costly the python interactions here are, and whether they outweigh the usefulness of the algorithm as it is now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some overhead of just typing and referencing variable with cnp.ndarray instead of using memoryviews (because cnp.ndarray are PyObjects and memoryviews aren't if I remember correctly).

We can leave this for now. If this shows to be a hotspot, we can have a PR to rewrite this part. What do you think?

In this case, is it possible to add a comment to indicate that numpy arrays are used for binary masks and convenient indexing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will add in such a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this comment added somewhere?

@@ -125,7 +131,6 @@ def _hdbscan_brute(
distance_matrix = pairwise_distances(
X, metric=metric, n_jobs=n_jobs, **metric_params
)
distance_matrix /= alpha
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a bit out of scope for this PR since it lives in _hdbscan_brute, however upon investigation I noticed:

  1. We do not have any tests for the behavior of the alpha parameter.
  2. It is unlikely to be used.
  3. It is unadvised to use.
  4. It currently has no effect in _hdbscan_brute (carryover bug?).

Hence I opted to remove it from _linkage.pyx. Since it was essentially only used in mst_from_data_matrix and the ineffective code above, I opted to remove it from HDBSCAN entirely. The above change does not affect any HDBSCAN behavior in a meaningful way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this. Do you think it is possible to extract it in another PR? This way, we can discuss the problem independently from this PR which can be merged in the meantime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree removing alpha should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.

@glemaitre
Copy link
Member

I'll have a go soonish on the HDBSCAN PRs.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment which is somewhat-orthogonal to this PR's goal.

Comment on lines +33 to 35
cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability(
cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this comment added somewhere?

@@ -125,7 +131,6 @@ def _hdbscan_brute(
distance_matrix = pairwise_distances(
X, metric=metric, n_jobs=n_jobs, **metric_params
)
distance_matrix /= alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree removing alpha should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.

Micky774 and others added 2 commits November 29, 2022 17:49
@@ -131,6 +134,7 @@ def _hdbscan_brute(
distance_matrix = pairwise_distances(
X, metric=metric, n_jobs=n_jobs, **metric_params
)
distance_matrix /= alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if alpha=None in this case?

Suggested change
distance_matrix /= alpha
if alpha is not None:
distance_matrix /= alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left that as it is on hdbscan since we'll be removing/revisiting alpha in the follow-up PR and this portion is in _hdbscan_brute and a bit out of scope for this PR.

Comment on lines +48 to +49
# Note: we utilize ndarray's over memory-views to make use of numpy
# binary indexing and sub-selection below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment, this answers one of @Vincent-Maladiere's questions. 👍

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit c359ec1 into scikit-learn:hdbscan Dec 19, 2022
@Micky774 Micky774 deleted the hdbscan_cleanup_linkage branch January 3, 2023 16:00
Micky774 added a commit to Micky774/scikit-learn that referenced this pull request May 16, 2023
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants