-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX attribute error is BIRCH #23395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX attribute error is BIRCH #23395
Conversation
For better visibility: I managed to reduce the original problem to a simple test in #23269 (comment) |
+1 for using this as a non-regression test. |
Thanks for the simple reproducer ! |
I added a test and checked that it fails on main but not in this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I assumed there will probably be a 1.1.2)
Is that volunteering for being a release manager :) ?
@@ -92,7 +92,7 @@ def _split_node(node, threshold, branching_factor): | |||
|
|||
node1_closer = node1_dist < node2_dist | |||
for idx, subcluster in enumerate(node.subclusters_): | |||
if node1_closer[idx]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment explaining the edge case? I remember trying to understand last week how this was related to duplicates and not really managing to ... full disclosure, I am not super familiar with BIRCH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the idx check directly close to where we define which subcluster is closest to which node so that I could add a clear comment on what's happening.
What happens is:
- you want to split a node. The node has several centroids.
- for that you compute all distances bewteen all centroids. The new nodes will be defined from the 2 centroids that are the most far apart.
- if it happens that the node only contains duplicates, then all centroids are the same point and all distances will be zero. Then the mask
node1_closer = node1_dist < node2_dist
considers that the centroid defining node1 is closer to the one defining node2 than to itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, great thanks a lot for the explanation and the additional comment, merging!
Fixes #17966
Fixes #23269
I would like to add a test but I can't manage to make a reproducible example simpler than the one from the original issue.
I found that it was due to the dataset having multiple duplicates which can at some point lead to all subclusters being the same point, leading to subcluster1 never being updated, even by its own starting centroid.