-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
Description
I have run into this issue which shows up in a few older reported issues as well, but the current open is a bit different.
Using this data, and running this code:
import dask.array as da
from sklearn.cluster import Birch
from sklearn.preprocessing import StandardScaler
data = da.from_zarr("bug_data.zarr")
print("Data Shape: ", data.shape)
print("Min, Max, Mean, StDev.: ", data.min().compute(), data.max().compute(), data.mean().compute(), data.std().compute())
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
print("Post-Scale - Min, Max, Mean, StDev.: ", data.min(), data.max(), data.mean(), data.std())
clustering = Birch(branching_factor=5, threshold=1e-5, n_clusters=None)
clustering.fit(data)
I run into this error:
Data Shape: (150000, 2000)
Min, Max, Mean, StDev.: -1.7028557 5.1015463 0.020574544 0.32617828
/home/nlahaye/.local/lib/python3.8/site-packages/dask/array/core.py:1650: FutureWarning: The `numpy.may_share_memory` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
Post-Scale - Min, Max, Mean, StDev.: -5.8093686 7.8372993 4.7429404e-11 1.0000027
Traceback (most recent call last):
File "clustering_bug.py", line 25, in <module>
clustering.fit(data)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 517, in fit
return self._fit(X, partial=False)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 562, in _fit
split = self.root_.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)
[Previous line repeated 3 more times]
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 221, in insert_cf_subcluster
self.update_split_subclusters(
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 179, in update_split_subclusters
self.init_sq_norm_[ind] = new_subcluster1.sq_norm_
AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'
For simplicity, I extracted this code and stripped away dask-ml wrappers from software I use for clustering, and have been able to successfully complete jobs with other datasets. This data is also a reduced set from a dataset that has many more samples.
Environment:
OS - CentOS-7
python - v3.8.2
dask - v2022.04.1
sklearn - v1.0.2
Please let me know if there is any other info you would like, etc.
Thanks!
Nick
Originally posted by @nlahaye in #17966 (comment)