Skip to content

AttributeError in Birch for StandardScaled values #23269

@nlahaye

Description

@nlahaye

I have run into this issue which shows up in a few older reported issues as well, but the current open is a bit different.

Using this data, and running this code:

import dask.array as da
from sklearn.cluster import Birch
from sklearn.preprocessing import StandardScaler

data = da.from_zarr("bug_data.zarr")
print("Data Shape: ", data.shape)
print("Min, Max, Mean, StDev.: ", data.min().compute(), data.max().compute(), data.mean().compute(), data.std().compute())
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
print("Post-Scale - Min, Max, Mean, StDev.: ", data.min(), data.max(), data.mean(), data.std())
clustering = Birch(branching_factor=5, threshold=1e-5, n_clusters=None)
clustering.fit(data)

I run into this error:

Data Shape:  (150000, 2000)
Min, Max, Mean, StDev.:  -1.7028557 5.1015463 0.020574544 0.32617828
/home/nlahaye/.local/lib/python3.8/site-packages/dask/array/core.py:1650: FutureWarning: The `numpy.may_share_memory` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
Post-Scale - Min, Max, Mean, StDev.:  -5.8093686 7.8372993 4.7429404e-11 1.0000027
Traceback (most recent call last):
  File "clustering_bug.py", line 25, in <module>
    clustering.fit(data)
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 517, in fit
    return self._fit(X, partial=False)
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 562, in _fit
    split = self.root_.insert_cf_subcluster(subcluster)
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 200, in insert_cf_subcluster
    split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 200, in insert_cf_subcluster
    split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 200, in insert_cf_subcluster
    split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)
  [Previous line repeated 3 more times]
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 221, in insert_cf_subcluster
    self.update_split_subclusters(
  File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 179, in update_split_subclusters
    self.init_sq_norm_[ind] = new_subcluster1.sq_norm_
AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'

For simplicity, I extracted this code and stripped away dask-ml wrappers from software I use for clustering, and have been able to successfully complete jobs with other datasets. This data is also a reduced set from a dataset that has many more samples.

Environment:
OS - CentOS-7
python - v3.8.2
dask - v2022.04.1
sklearn - v1.0.2

Please let me know if there is any other info you would like, etc.

Thanks!
Nick

Originally posted by @nlahaye in #17966 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions