-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Array API regression in homogeneity_completeness_v_measure
?
#29452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a smaller minimal reproducer, basically doing the same thing via public API only: import numpy as np
from sklearn import config_context
from sklearn.metrics.cluster import homogeneity_completeness_v_measure
labels_true = np.asarray([0, 0, 0, 1, 1, 1])
labels_pred = np.asarray([0, 1, 0, 1, 2, 2])
with config_context(array_api_dispatch=True):
homogeneity_completeness_v_measure(labels_true, labels_pred) I think it's a regression, cause I don't think we want users to encounter errors when enabling array API dispatch and passing numpy array. |
I agree this is a bug.
In general we never want to convert sparse arrays/matrices into dense arrays implicitly because it might blow-up memory by allocating memory to store a very large number of zeros. For instance: >>> import scipy.sparse as sp
>>> A_sparse = sp.random_array((10_000, 10_000), density=0.0001)
>>> (A_sparse.data.nbytes + A_sparse.coords[0].nbytes + A_sparse.coords[1].nbytes) / 1e6
0.24 # MB
>>> A_dense = A_sparse.toarray()
>>> A_dense.nbytes / 1e6
800.0 # MB We should only call Let me look at the code to understand what's going on. |
Here is an even more minimal reproducer: >>> from sklearn.utils.validation import check_array
>>> import scipy.sparse as sp
>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> check_array(sp.random_array((10, 10)), accept_sparse=True)
Traceback (most recent call last):
Cell In[13], line 1
check_array(sp.random_array((10, 10)), accept_sparse=True)
File ~/code/scikit-learn/sklearn/utils/validation.py:841 in check_array
xp, is_array_api_compliant = get_namespace(array)
File ~/code/scikit-learn/sklearn/utils/_array_api.py:553 in get_namespace
namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
File ~/miniforge3/envs/dev/lib/python3.11/site-packages/array_api_compat/common/_helpers.py:366 in array_namespace
raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: coo_array is not a supported array type or we even could argue that the following is a bug: >>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> from sklearn.utils._array_api import get_namespace
>>> get_namespace(sp.random_array((10, 10)))
Traceback (most recent call last):
Cell In[15], line 1
get_namespace(sp.random_array((10, 10)))
File ~/code/scikit-learn/sklearn/utils/_array_api.py:553 in get_namespace
namespace, is_array_api_compliant = array_api_compat.get_namespace(*arrays), True
File ~/miniforge3/envs/dev/lib/python3.11/site-packages/array_api_compat/common/_helpers.py:366 in array_namespace
raise TypeError(f"{type(x).__name__} is not a supported array type")
TypeError: coo_array is not a supported array type The way we should handle namespace inspection in the presence of sparse inputs is subtle. When we only have sparse inputs, we should probably return the numpy (or the numpy wrapper) namespace. However, when we have mixed inputs (dense and sparse) I think we should return the inspected namespace from the non-sparse inputs. In either case, our code should have The above hold for the time being, that is, as long as scipy sparse arrays do not support the array API. I am not sure if they ever will. I will open a PR to precisely state what I mean above in a series of tests. |
Hi @ogrisel I think we added this check in the PR for f1_score, https://github.com/scikit-learn/scikit-learn/pull/27369/files#diff-6264adab84df6fafdb2d950141783b52b718da1e1be773daae7b5921b2556f94R546 |
Yes it's similar but your solution might hide problems if input arrays are mixed. I will open a dedicated PRs with extra tests to have a discussion on how we want to handle those edge cases while not coupling this with the review of #27369. |
Describe the bug
When I enable Array API and run
homogeneity_completeness_v_measure
, an error is raised, since within it, a sparse matrix is passed intomutual_info_score
, which already supports array API (li. 530-531):I can't wrap my head around it: is this a regression or is this something I need to fix while implementing array API for
homogeneity_completeness_v_measure
?If I understand correctly: in case of a regression we would need to fix mutual_info_score to convert sparse into dense if array API is enabled, in case it is not a regression, but expected, then we would have to deal with it in
homogeneity_completeness_v_measure
(having a condition here), correct?ping @adrinjalali, since we have already talked about it.
Steps/Code to Reproduce
Expected Results
no error
Actual Results
Versions
The text was updated successfully, but these errors were encountered: