Array API regression in `homogeneity_completeness_v_measure`? #29452

StefanieSenger · 2024-07-10T11:37:33Z

Describe the bug

When I enable Array API and run homogeneity_completeness_v_measure, an error is raised, since within it, a sparse matrix is passed into mutual_info_score, which already supports array API (li. 530-531):

contingency = contingency_matrix(labels_true, labels_pred, sparse=True) # <--- we return sparse data
MI = mutual_info_score(None, None, contingency=contingency) #<--- we pass sparse data

I can't wrap my head around it: is this a regression or is this something I need to fix while implementing array API for homogeneity_completeness_v_measure?

If I understand correctly: in case of a regression we would need to fix mutual_info_score to convert sparse into dense if array API is enabled, in case it is not a regression, but expected, then we would have to deal with it in homogeneity_completeness_v_measure (having a condition here), correct?

ping @adrinjalali, since we have already talked about it.

Steps/Code to Reproduce

from sklearn.utils._testing import _array_api_for_tests
from sklearn.base import config_context
from sklearn.metrics.cluster import homogeneity_completeness_v_measure

xp = _array_api_for_tests("numpy", device="cpu")
labels_true = xp.asarray([0, 0, 0, 1, 1, 1])
labels_pred = xp.asarray([0, 1, 0, 1, 2, 2])

with config_context(array_api_dispatch=True):
    homogeneity_completeness_v_measure(labels_true, labels_pred)

Expected Results

no error

Actual Results

Traceback (most recent call last):
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/zzzzz_homogenity_array_api.py", line 10, in <module>
    homogeneity_completeness_v_measure(labels_true, labels_pred)
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/metrics/cluster/_supervised.py", line 532, in homogeneity_completeness_v_measure
    MI = mutual_info_score(None, None, contingency=contingency)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_param_validation.py", line 186, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/metrics/cluster/_supervised.py", line 884, in mutual_info_score
    contingency = check_array(
                  ^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/validation.py", line 841, in check_array
    xp, is_array_api_compliant = get_namespace(array)
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_array_api.py", line 553, in get_namespace
    namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/array_api_compat/common/_helpers.py", line 366, in array_namespace
    raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: csr_matrix is not a supported array type

Versions

System:
    python: 3.12.2 (main, Apr 18 2024, 11:14:27) [GCC 13.2.1 20230801]
executable: /home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/bin/python
   machine: Linux-6.9.5-arch1-1-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.6.dev0
          pip: 24.0
   setuptools: 69.5.1
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: 3.0.10
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 14
         prefix: libopenblas
       filepath: /home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 14
         prefix: libopenblas
       filepath: /home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/scipy.libs/libopenblasp-r0-24bff013.3.26.dev.so
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 14
         prefix: libgomp
       filepath: /usr/lib/libgomp.so.1.0.0
        version: None

The text was updated successfully, but these errors were encountered:

adrinjalali · 2024-07-11T07:27:16Z

This is a smaller minimal reproducer, basically doing the same thing via public API only:

import numpy as np
from sklearn import config_context
from sklearn.metrics.cluster import homogeneity_completeness_v_measure


labels_true = np.asarray([0, 0, 0, 1, 1, 1])
labels_pred = np.asarray([0, 1, 0, 1, 2, 2])

with config_context(array_api_dispatch=True):
    homogeneity_completeness_v_measure(labels_true, labels_pred)

I think it's a regression, cause I don't think we want users to encounter errors when enabling array API dispatch and passing numpy array.

cc @ogrisel @betatim @OmarManzoor

ogrisel · 2024-07-11T07:49:18Z

I agree this is a bug.

we would need to fix mutual_info_score to convert sparse into dense if array API is enabled

In general we never want to convert sparse arrays/matrices into dense arrays implicitly because it might blow-up memory by allocating memory to store a very large number of zeros. For instance:

>>> import scipy.sparse as sp
>>> A_sparse = sp.random_array((10_000, 10_000), density=0.0001)
>>> (A_sparse.data.nbytes + A_sparse.coords[0].nbytes + A_sparse.coords[1].nbytes) / 1e6
0.24  # MB
>>> A_dense = A_sparse.toarray()
>>> A_dense.nbytes / 1e6
800.0  # MB

We should only call .toarray() upon a users explicit request or in cases where we know that the resulting dense array will not be arbitrarily large in memory.

Let me look at the code to understand what's going on.

ogrisel · 2024-07-11T08:11:15Z

Here is an even more minimal reproducer:

>>> from sklearn.utils.validation import check_array
>>> import scipy.sparse as sp
>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> check_array(sp.random_array((10, 10)), accept_sparse=True)
Traceback (most recent call last):
  Cell In[13], line 1
    check_array(sp.random_array((10, 10)), accept_sparse=True)
  File ~/code/scikit-learn/sklearn/utils/validation.py:841 in check_array
    xp, is_array_api_compliant = get_namespace(array)
  File ~/code/scikit-learn/sklearn/utils/_array_api.py:553 in get_namespace
    namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
  File ~/miniforge3/envs/dev/lib/python3.11/site-packages/array_api_compat/common/_helpers.py:366 in array_namespace
    raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: coo_array is not a supported array type

or we even could argue that the following is a bug:

>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> from sklearn.utils._array_api import get_namespace
>>> get_namespace(sp.random_array((10, 10)))
Traceback (most recent call last):
  Cell In[15], line 1
    get_namespace(sp.random_array((10, 10)))
  File ~/code/scikit-learn/sklearn/utils/_array_api.py:553 in get_namespace
    namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
  File ~/miniforge3/envs/dev/lib/python3.11/site-packages/array_api_compat/common/_helpers.py:366 in array_namespace
    raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: coo_array is not a supported array type

The way we should handle namespace inspection in the presence of sparse inputs is subtle. When we only have sparse inputs, we should probably return the numpy (or the numpy wrapper) namespace.

However, when we have mixed inputs (dense and sparse) I think we should return the inspected namespace from the non-sparse inputs.

In either case, our code should have if sp.issparse(data) conditions to branch sparse computation from numpy / array API dedicated code path.

The above hold for the time being, that is, as long as scipy sparse arrays do not support the array API. I am not sure if they ever will.

I will open a PR to precisely state what I mean above in a series of tests.

OmarManzoor · 2024-07-11T08:13:07Z

Hi @ogrisel

I think we added this check in the PR for f1_score, https://github.com/scikit-learn/scikit-learn/pull/27369/files#diff-6264adab84df6fafdb2d950141783b52b718da1e1be773daae7b5921b2556f94R546

ogrisel · 2024-07-11T08:44:04Z

Yes it's similar but your solution might hide problems if input arrays are mixed. I will open a dedicated PRs with extra tests to have a discussion on how we want to handle those edge cases while not coupling this with the review of #27369.

StefanieSenger added Bug Needs Triage Issue requires triage labels Jul 10, 2024

ogrisel added the Array API label Jul 11, 2024

ogrisel removed the Needs Triage Issue requires triage label Jul 11, 2024

This was referenced Jul 11, 2024

Fix mixed dense/sparse array API namespace inspection #29466

Closed

FIX make check_array accept sparse inputs when array api dispatch is enabled #29469

Closed

This was referenced Jul 12, 2024

FIX sparse arrays not included in array API dispatch #29476

Merged

Fix array api in mean_absolute_percentage_error for older versions #29490

Merged

StefanieSenger mentioned this issue Jul 22, 2024

ENH Array API for check_consistent_length #29519

Merged

ogrisel closed this as completed in #29476 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array API regression in `homogeneity_completeness_v_measure`? #29452

Array API regression in `homogeneity_completeness_v_measure`? #29452

StefanieSenger commented Jul 10, 2024 •

edited by ogrisel

Loading

adrinjalali commented Jul 11, 2024

ogrisel commented Jul 11, 2024

ogrisel commented Jul 11, 2024

OmarManzoor commented Jul 11, 2024

ogrisel commented Jul 11, 2024

Array API regression in homogeneity_completeness_v_measure? #29452

Array API regression in homogeneity_completeness_v_measure? #29452

Comments

StefanieSenger commented Jul 10, 2024 • edited by ogrisel Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

adrinjalali commented Jul 11, 2024

ogrisel commented Jul 11, 2024

ogrisel commented Jul 11, 2024

OmarManzoor commented Jul 11, 2024

ogrisel commented Jul 11, 2024

Array API regression in `homogeneity_completeness_v_measure`? #29452

Array API regression in `homogeneity_completeness_v_measure`? #29452

StefanieSenger commented Jul 10, 2024 •

edited by ogrisel

Loading