Skip to content

ENH _hdbscan HIERARCHY data type introduction #25826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 28, 2023

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Mar 12, 2023

Reference Issues/PRs

Towards #24686

What does this implement/fix? Explain your changes.

  1. Creates new HIERARCHY dtype and c-type
  2. Replaces current 2D float64_t arrays with HIERARCHY where appropriate
  3. Minor code refactor reflecting new dtype usage

Any other comments?

This is a blocker for more sweeping algorithm refactors and simplifications which rely on the structure of the dtype

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Micky774!

Here is a first pass. I guess conversion of cnp.ndarray to (const-qualify) might be done in another PRs not to pollute the diff of this PR.

@Micky774
Copy link
Contributor Author

@thomasjpfan @jjerphan wondering what you two think of the PR now

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the heads-up, @Micky774.

Micky774 and others added 2 commits March 27, 2023 09:23
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@@ -468,7 +518,7 @@ cdef get_probabilities(cnp.ndarray hierarchy, dict cluster_map, cnp.ndarray labe

cluster = cluster_map[cluster_num]
max_lambda = deaths[cluster]
if max_lambda == 0.0 or not np.isfinite(lambda_array[n]):
if max_lambda == 0.0 or isinf(lambda_array[n]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be?

Suggested change
if max_lambda == 0.0 or isinf(lambda_array[n]):
if max_lambda == 0.0 or not isinf(lambda_array[n]):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original checks that it is not finite rather than not infinite, so I think the current isinf is correct -- unless I have it further mixed up somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'm thinking about:

Suggested change
if max_lambda == 0.0 or isinf(lambda_array[n]):
if max_lambda == 0.0 or not isfinite(lambda_array[n]):

Because of nan behavior:

not isfinite(NaN) == 1
isinf(NaN) == 0

Is lambda_array[n] never nan by construction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, lambda_array stems from the computed minimum reachability distance, which is never expected to be nan since we separate those points before processing

@thomasjpfan thomasjpfan merged commit a827589 into scikit-learn:hdbscan Mar 28, 2023
@Micky774 Micky774 deleted the hdbscan_tree_dtype branch March 28, 2023 14:53
Micky774 added a commit to Micky774/scikit-learn that referenced this pull request May 16, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants