Description
Describe the bug
Input data containing very large numbers causes overflows in the Birch algorithm, that manifest in different errors depending on the branching factor parameter. If the number of data points is smaller than or equal to the branching factor a ValueError is thrown in AgglomerativeClustering, but if this number exceeds the branching factor an AttributeError is thrown instead. Since both errors are caused by the input data I would expect to get a ValueError in both cases.
Steps/Code to Reproduce
Running the same code with less data points causes a ValueError, otherwise an AttributeError.
Example:
from sklearn.cluster import Birch
X = [[1.30830774e+307, 6.02217328e+307],
[1.54166067e+308, 1.75812744e+308],
[5.57938866e+307, 4.13840113e+307],
[1.36302835e+308, 1.07968131e+308],
[1.58772669e+308, 1.19380571e+307],
[2.20362426e+307, 1.58814671e+308],
[1.06216028e+308, 1.14258583e+308],
[7.18031911e+307, 1.69661213e+308],
[7.91182553e+307, 5.12892426e+307],
[5.58470885e+307, 9.13566765e+306],
[1.22366243e+308, 8.29427922e+307]]
clusterer = Birch(branching_factor=10)
clusterer.fit(X)
Expected Results
A ValueError that specifies the range of allowed values like in other clustering algorithms:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Or a similar error like the ValueError from the case where data points are smaller than or equal to the branching factor:
ValueError: The condensed distance matrix must contain only finite values.
Actual Results
C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:189: RuntimeWarning: invalid value encountered in add
dist_matrix += self.squared_norm_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:304: RuntimeWarning: overflow encountered in add
new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:309: RuntimeWarning: invalid value encountered in double_scalars
sq_radius = (new_ss + dot_product) / new_n + new_norm
C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\utils\extmath.py:153: RuntimeWarning: overflow encountered in matmul
ret = a @ b
C:\Program Files\Python37\lib\site-packages\sklearn\metrics\pairwise.py:310: RuntimeWarning: invalid value encountered in add
distances += XX
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:81: RuntimeWarning: invalid value encountered in less
node1_closer = node1_dist < node2_dist
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:294: RuntimeWarning: overflow encountered in add
self.linear_sum_ += subcluster.linear_sum_
Traceback (most recent call last):
File "C:\Users\thaar\PycharmProjects\sklearn-dev\birch_test.py", line 61, in <module>
clusterer.fit(X)
File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 463, in fit
return self._fit(X)
File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 510, in _fit
self.root_.append_subcluster(new_subcluster1)
File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 158, in append_subcluster
self.init_sq_norm_[n_samples] = subcluster.sq_norm_
AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'
Versions
System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0
Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.5
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0
Built with OpenMP: True