Skip to content

Array API regression in homogeneity_completeness_v_measure? #29452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
StefanieSenger opened this issue Jul 10, 2024 · 5 comments · Fixed by #29476
Closed

Array API regression in homogeneity_completeness_v_measure? #29452

StefanieSenger opened this issue Jul 10, 2024 · 5 comments · Fixed by #29476

Comments

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Jul 10, 2024

Describe the bug

When I enable Array API and run homogeneity_completeness_v_measure, an error is raised, since within it, a sparse matrix is passed into mutual_info_score, which already supports array API (li. 530-531):

contingency = contingency_matrix(labels_true, labels_pred, sparse=True) # <--- we return sparse data
MI = mutual_info_score(None, None, contingency=contingency) #<--- we pass sparse data 

I can't wrap my head around it: is this a regression or is this something I need to fix while implementing array API for homogeneity_completeness_v_measure?

If I understand correctly: in case of a regression we would need to fix mutual_info_score to convert sparse into dense if array API is enabled, in case it is not a regression, but expected, then we would have to deal with it in homogeneity_completeness_v_measure (having a condition here), correct?

ping @adrinjalali, since we have already talked about it.

Steps/Code to Reproduce

from sklearn.utils._testing import _array_api_for_tests
from sklearn.base import config_context
from sklearn.metrics.cluster import homogeneity_completeness_v_measure

xp = _array_api_for_tests("numpy", device="cpu")
labels_true = xp.asarray([0, 0, 0, 1, 1, 1])
labels_pred = xp.asarray([0, 1, 0, 1, 2, 2])

with config_context(array_api_dispatch=True):
    homogeneity_completeness_v_measure(labels_true, labels_pred)

Expected Results

no error

Actual Results

Traceback (most recent call last):
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/zzzzz_homogenity_array_api.py", line 10, in <module>
    homogeneity_completeness_v_measure(labels_true, labels_pred)
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/metrics/cluster/_supervised.py", line 532, in homogeneity_completeness_v_measure
    MI = mutual_info_score(None, None, contingency=contingency)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_param_validation.py", line 186, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/metrics/cluster/_supervised.py", line 884, in mutual_info_score
    contingency = check_array(
                  ^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/validation.py", line 841, in check_array
    xp, is_array_api_compliant = get_namespace(array)
                                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_array_api.py", line 553, in get_namespace
    namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/array_api_compat/common/_helpers.py", line 366, in array_namespace
    raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: csr_matrix is not a supported array type

Versions

System:
    python: 3.12.2 (main, Apr 18 2024, 11:14:27) [GCC 13.2.1 20230801]
executable: /home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/bin/python
   machine: Linux-6.9.5-arch1-1-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.6.dev0
          pip: 24.0
   setuptools: 69.5.1
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: 3.0.10
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 14
         prefix: libopenblas
       filepath: /home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 14
         prefix: libopenblas
       filepath: /home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/scipy.libs/libopenblasp-r0-24bff013.3.26.dev.so
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 14
         prefix: libgomp
       filepath: /usr/lib/libgomp.so.1.0.0
        version: None
@StefanieSenger StefanieSenger added Bug Needs Triage Issue requires triage labels Jul 10, 2024
@adrinjalali
Copy link
Member

This is a smaller minimal reproducer, basically doing the same thing via public API only:

import numpy as np
from sklearn import config_context
from sklearn.metrics.cluster import homogeneity_completeness_v_measure


labels_true = np.asarray([0, 0, 0, 1, 1, 1])
labels_pred = np.asarray([0, 1, 0, 1, 2, 2])

with config_context(array_api_dispatch=True):
    homogeneity_completeness_v_measure(labels_true, labels_pred)

I think it's a regression, cause I don't think we want users to encounter errors when enabling array API dispatch and passing numpy array.

cc @ogrisel @betatim @OmarManzoor

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2024

I agree this is a bug.

we would need to fix mutual_info_score to convert sparse into dense if array API is enabled

In general we never want to convert sparse arrays/matrices into dense arrays implicitly because it might blow-up memory by allocating memory to store a very large number of zeros. For instance:

>>> import scipy.sparse as sp
>>> A_sparse = sp.random_array((10_000, 10_000), density=0.0001)
>>> (A_sparse.data.nbytes + A_sparse.coords[0].nbytes + A_sparse.coords[1].nbytes) / 1e6
0.24  # MB
>>> A_dense = A_sparse.toarray()
>>> A_dense.nbytes / 1e6
800.0  # MB

We should only call .toarray() upon a users explicit request or in cases where we know that the resulting dense array will not be arbitrarily large in memory.

Let me look at the code to understand what's going on.

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Jul 11, 2024
@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2024

Here is an even more minimal reproducer:

>>> from sklearn.utils.validation import check_array
>>> import scipy.sparse as sp
>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> check_array(sp.random_array((10, 10)), accept_sparse=True)
Traceback (most recent call last):
  Cell In[13], line 1
    check_array(sp.random_array((10, 10)), accept_sparse=True)
  File ~/code/scikit-learn/sklearn/utils/validation.py:841 in check_array
    xp, is_array_api_compliant = get_namespace(array)
  File ~/code/scikit-learn/sklearn/utils/_array_api.py:553 in get_namespace
    namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
  File ~/miniforge3/envs/dev/lib/python3.11/site-packages/array_api_compat/common/_helpers.py:366 in array_namespace
    raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: coo_array is not a supported array type

or we even could argue that the following is a bug:

>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> from sklearn.utils._array_api import get_namespace
>>> get_namespace(sp.random_array((10, 10)))
Traceback (most recent call last):
  Cell In[15], line 1
    get_namespace(sp.random_array((10, 10)))
  File ~/code/scikit-learn/sklearn/utils/_array_api.py:553 in get_namespace
    namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
  File ~/miniforge3/envs/dev/lib/python3.11/site-packages/array_api_compat/common/_helpers.py:366 in array_namespace
    raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: coo_array is not a supported array type

The way we should handle namespace inspection in the presence of sparse inputs is subtle. When we only have sparse inputs, we should probably return the numpy (or the numpy wrapper) namespace.

However, when we have mixed inputs (dense and sparse) I think we should return the inspected namespace from the non-sparse inputs.

In either case, our code should have if sp.issparse(data) conditions to branch sparse computation from numpy / array API dedicated code path.

The above hold for the time being, that is, as long as scipy sparse arrays do not support the array API. I am not sure if they ever will.

I will open a PR to precisely state what I mean above in a series of tests.

@OmarManzoor
Copy link
Contributor

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2024

Yes it's similar but your solution might hide problems if input arrays are mixed. I will open a dedicated PRs with extra tests to have a discussion on how we want to handle those edge cases while not coupling this with the review of #27369.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment