FIX attribute error is BIRCH #23395

jeremiedbb · 2022-05-17T13:25:03Z

I would like to add a test but I can't manage to make a reproducible example simpler than the one from the original issue.
I found that it was due to the dataset having multiple duplicates which can at some point lead to all subclusters being the same point, leading to subcluster1 never being updated, even by its own starting centroid.

lesteve · 2022-05-19T14:50:46Z

For better visibility: I managed to reduce the original problem to a simple test in #23269 (comment)

ogrisel · 2022-05-24T17:10:07Z

For better visibility: I managed to reduce the original problem to a simple test in #23269 (comment)

+1 for using this as a non-regression test.

jeremiedbb · 2022-05-25T08:40:24Z

For better visibility: I managed to reduce the original problem to a simple test

Thanks for the simple reproducer !

jeremiedbb · 2022-05-25T08:41:56Z

I added a test and checked that it fails on main but not in this branch.
I added a what's new entry targeting 1.1.2 (I assumed there will probably be a 1.1.2)

ogrisel

(I assumed there will probably be a 1.1.2)

Is that volunteering for being a release manager :) ?

lesteve · 2022-05-25T09:28:05Z

sklearn/cluster/_birch.py

@@ -92,7 +92,7 @@ def _split_node(node, threshold, branching_factor):

    node1_closer = node1_dist < node2_dist
    for idx, subcluster in enumerate(node.subclusters_):
-        if node1_closer[idx]:


Maybe add a comment explaining the edge case? I remember trying to understand last week how this was related to duplicates and not really managing to ... full disclosure, I am not super familiar with BIRCH.

I moved the idx check directly close to where we define which subcluster is closest to which node so that I could add a clear comment on what's happening.

What happens is:

you want to split a node. The node has several centroids.

for that you compute all distances bewteen all centroids. The new nodes will be defined from the 2 centroids that are the most far apart.

if it happens that the node only contains duplicates, then all centroids are the same point and all distances will be zero. Then the mask node1_closer = node1_dist < node2_dist considers that the centroid defining node1 is closer to the one defining node2 than to itself.

OK, great thanks a lot for the explanation and the additional comment, merging!

subcluster is always closest to itself

20faa28

github-actions bot added the module:cluster label May 17, 2022

jeremiedbb added 2 commits May 25, 2022 10:19

Merge remote-tracking branch 'upstream/main' into fix-birch-attr-error

3e6e40d

add test and whats new

7b5c521

ogrisel approved these changes May 25, 2022

View reviewed changes

lesteve reviewed May 25, 2022

View reviewed changes

clarify the fix

c5cb381

lesteve merged commit ea5a749 into scikit-learn:main May 25, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

FIX attribute error is BIRCH (scikit-learn#23395)

95a8216

glemaitre pushed a commit that referenced this pull request Aug 5, 2022

FIX attribute error is BIRCH (#23395)

ecbe2d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX attribute error is BIRCH #23395

FIX attribute error is BIRCH #23395

Uh oh!

jeremiedbb commented May 17, 2022 •

edited

Loading

Uh oh!

lesteve commented May 19, 2022

Uh oh!

ogrisel commented May 24, 2022

Uh oh!

jeremiedbb commented May 25, 2022

Uh oh!

jeremiedbb commented May 25, 2022 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

lesteve May 25, 2022

Uh oh!

jeremiedbb May 25, 2022

Uh oh!

lesteve May 25, 2022

Uh oh!

Uh oh!

Uh oh!

FIX attribute error is BIRCH #23395

FIX attribute error is BIRCH #23395

Uh oh!

Conversation

jeremiedbb commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented May 19, 2022

Uh oh!

ogrisel commented May 24, 2022

Uh oh!

jeremiedbb commented May 25, 2022

Uh oh!

jeremiedbb commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve May 25, 2022

Choose a reason for hiding this comment

Uh oh!

jeremiedbb May 25, 2022

Choose a reason for hiding this comment

Uh oh!

lesteve May 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb commented May 17, 2022 •

edited

Loading

jeremiedbb commented May 25, 2022 •

edited

Loading