Skip to content

FIX attribute error is BIRCH #23395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 25, 2022
Merged

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented May 17, 2022

Fixes #17966
Fixes #23269

I would like to add a test but I can't manage to make a reproducible example simpler than the one from the original issue.
I found that it was due to the dataset having multiple duplicates which can at some point lead to all subclusters being the same point, leading to subcluster1 never being updated, even by its own starting centroid.

@lesteve
Copy link
Member

lesteve commented May 19, 2022

For better visibility: I managed to reduce the original problem to a simple test in #23269 (comment)

@ogrisel
Copy link
Member

ogrisel commented May 24, 2022

For better visibility: I managed to reduce the original problem to a simple test in #23269 (comment)

+1 for using this as a non-regression test.

@jeremiedbb
Copy link
Member Author

For better visibility: I managed to reduce the original problem to a simple test

Thanks for the simple reproducer !

@jeremiedbb
Copy link
Member Author

jeremiedbb commented May 25, 2022

I added a test and checked that it fails on main but not in this branch.
I added a what's new entry targeting 1.1.2 (I assumed there will probably be a 1.1.2)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I assumed there will probably be a 1.1.2)

Is that volunteering for being a release manager :) ?

@@ -92,7 +92,7 @@ def _split_node(node, threshold, branching_factor):

node1_closer = node1_dist < node2_dist
for idx, subcluster in enumerate(node.subclusters_):
if node1_closer[idx]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining the edge case? I remember trying to understand last week how this was related to duplicates and not really managing to ... full disclosure, I am not super familiar with BIRCH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the idx check directly close to where we define which subcluster is closest to which node so that I could add a clear comment on what's happening.

What happens is:

  • you want to split a node. The node has several centroids.
  • for that you compute all distances bewteen all centroids. The new nodes will be defined from the 2 centroids that are the most far apart.
  • if it happens that the node only contains duplicates, then all centroids are the same point and all distances will be zero. Then the mask node1_closer = node1_dist < node2_dist considers that the centroid defining node1 is closer to the one defining node2 than to itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great thanks a lot for the explanation and the additional comment, merging!

@lesteve lesteve merged commit ea5a749 into scikit-learn:main May 25, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
glemaitre pushed a commit that referenced this pull request Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AttributeError in Birch for StandardScaled values Counterintuitive AttributeError in Birch for very large numbers
3 participants