-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
CLN Cleaned cluster/_hdbscan/_linkage.pyx
#24857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN Cleaned cluster/_hdbscan/_linkage.pyx
#24857
Conversation
@thomasjpfan @glemaitre @jjerphan Wondering if any of you would be interested in reviewing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making it clearer, @Micky774!
Here is a first comprehensive review with questions, especially regarding the structure defined and used.
cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability( | ||
cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only use cnp.ndarray
when memoryviews can't be used.
Is the usage of MST_edge_t
preventing us to use (const
-qualified) memoryview here and in other functions and methods?
Generally, most cnp.ndarray
seem replaceable by memoryviews here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current algorithm uses binary masks for indexing into subsets of cnp.ndarray
s, simplifying the innermost scope of the algorithm. Hence I think it would be preferable to maintain the use of cnp.ndarray
instead. In particular, it is necessary for:
scikit-learn/sklearn/cluster/_hdbscan/_linkage.pyx
Lines 48 to 52 in 39e7d7e
label_filter = current_labels != current_node | |
current_labels = current_labels[label_filter] | |
left = min_reachability[label_filter] | |
right = mutual_reachability[current_node][current_labels] | |
min_reachability = np.minimum(left, right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could attempt to change the algorithm to avoid using this, but it is a pretty eloquent solution. Ultimately I don't know how costly the python interactions here are, and whether they outweigh the usefulness of the algorithm as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some overhead of just typing and referencing variable with cnp.ndarray
instead of using memoryviews (because cnp.ndarray
are PyObjects
and memoryviews aren't if I remember correctly).
We can leave this for now. If this shows to be a hotspot, we can have a PR to rewrite this part. What do you think?
In this case, is it possible to add a comment to indicate that numpy arrays are used for binary masks and convenient indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will add in such a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this comment added somewhere?
sklearn/cluster/_hdbscan/hdbscan.py
Outdated
@@ -125,7 +131,6 @@ def _hdbscan_brute( | |||
distance_matrix = pairwise_distances( | |||
X, metric=metric, n_jobs=n_jobs, **metric_params | |||
) | |||
distance_matrix /= alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a bit out of scope for this PR since it lives in _hdbscan_brute
, however upon investigation I noticed:
- We do not have any tests for the behavior of the
alpha
parameter. - It is unlikely to be used.
- It is unadvised to use.
- It currently has no effect in
_hdbscan_brute
(carryover bug?).
Hence I opted to remove it from _linkage.pyx
. Since it was essentially only used in mst_from_data_matrix
and the ineffective code above, I opted to remove it from HDBSCAN
entirely. The above change does not affect any HDBSCAN behavior in a meaningful way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this. Do you think it is possible to extract it in another PR? This way, we can discuss the problem independently from this PR which can be merged in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree removing alpha
should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.
I'll have a go soonish on the HDBSCAN PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment which is somewhat-orthogonal to this PR's goal.
cpdef cnp.ndarray[MST_edge_t, ndim=1, mode='c'] mst_from_mutual_reachability( | ||
cnp.ndarray[cnp.float64_t, ndim=2] mutual_reachability | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this comment added somewhere?
sklearn/cluster/_hdbscan/hdbscan.py
Outdated
@@ -125,7 +131,6 @@ def _hdbscan_brute( | |||
distance_matrix = pairwise_distances( | |||
X, metric=metric, n_jobs=n_jobs, **metric_params | |||
) | |||
distance_matrix /= alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree removing alpha
should be in it's own PR. This changes the public API and could become a blocker for including HDBSCAN.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
sklearn/cluster/_hdbscan/hdbscan.py
Outdated
@@ -131,6 +134,7 @@ def _hdbscan_brute( | |||
distance_matrix = pairwise_distances( | |||
X, metric=metric, n_jobs=n_jobs, **metric_params | |||
) | |||
distance_matrix /= alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if alpha=None
in this case?
distance_matrix /= alpha | |
if alpha is not None: | |
distance_matrix /= alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left that as it is on hdbscan
since we'll be removing/revisiting alpha
in the follow-up PR and this portion is in _hdbscan_brute
and a bit out of scope for this PR.
# Note: we utilize ndarray's over memory-views to make use of numpy | ||
# binary indexing and sub-selection below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this comment, this answers one of @Vincent-Maladiere's questions. 👍
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Towards #24686
What does this implement/fix? Explain your changes.
Cleans up and revises
_hdbscan/_linkage.pyx
.Any other comments?