Skip to content

Counterintuitive AttributeError in Birch for very large numbers #17966

Closed
@THaar50

Description

@THaar50

Describe the bug

Input data containing very large numbers causes overflows in the Birch algorithm, that manifest in different errors depending on the branching factor parameter. If the number of data points is smaller than or equal to the branching factor a ValueError is thrown in AgglomerativeClustering, but if this number exceeds the branching factor an AttributeError is thrown instead. Since both errors are caused by the input data I would expect to get a ValueError in both cases.

Steps/Code to Reproduce

Running the same code with less data points causes a ValueError, otherwise an AttributeError.
Example:

from sklearn.cluster import Birch

X = [[1.30830774e+307, 6.02217328e+307],
     [1.54166067e+308, 1.75812744e+308],
     [5.57938866e+307, 4.13840113e+307],
     [1.36302835e+308, 1.07968131e+308],
     [1.58772669e+308, 1.19380571e+307],
     [2.20362426e+307, 1.58814671e+308],
     [1.06216028e+308, 1.14258583e+308],
     [7.18031911e+307, 1.69661213e+308],
     [7.91182553e+307, 5.12892426e+307],
     [5.58470885e+307, 9.13566765e+306],
     [1.22366243e+308, 8.29427922e+307]]

clusterer = Birch(branching_factor=10)
clusterer.fit(X)

Expected Results

A ValueError that specifies the range of allowed values like in other clustering algorithms:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Or a similar error like the ValueError from the case where data points are smaller than or equal to the branching factor:

ValueError: The condensed distance matrix must contain only finite values.

Actual Results

C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:189: RuntimeWarning: invalid value encountered in add
  dist_matrix += self.squared_norm_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:304: RuntimeWarning: overflow encountered in add
  new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:309: RuntimeWarning: invalid value encountered in double_scalars
  sq_radius = (new_ss + dot_product) / new_n + new_norm
C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\utils\extmath.py:153: RuntimeWarning: overflow encountered in matmul
  ret = a @ b
C:\Program Files\Python37\lib\site-packages\sklearn\metrics\pairwise.py:310: RuntimeWarning: invalid value encountered in add
  distances += XX
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:81: RuntimeWarning: invalid value encountered in less
  node1_closer = node1_dist < node2_dist
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:294: RuntimeWarning: overflow encountered in add
  self.linear_sum_ += subcluster.linear_sum_
Traceback (most recent call last):
  File "C:\Users\thaar\PycharmProjects\sklearn-dev\birch_test.py", line 61, in <module>
    clusterer.fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 463, in fit
    return self._fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 510, in _fit
    self.root_.append_subcluster(new_subcluster1)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 158, in append_subcluster
    self.init_sq_norm_[n_samples] = subcluster.sq_norm_
AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'

Versions

System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0

Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.5
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions