-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Array API support for confusion_matrix converting to numpy array #30562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Array API support for confusion_matrix converting to numpy array #30562
Conversation
sklearn/utils/_array_api.py
Outdated
def _nan_to_num(array, xp=None): | ||
"""Substitutes NaN values of an array with 0 and inf values with the maximum or | ||
minimum numbers available for the dtype respectively; like np.nan_to_num.""" | ||
xp, _ = get_namespace(array, xp=xp) | ||
try: | ||
array = xp.nan_to_num(array) | ||
except AttributeError: # currently catching exceptions from array_api_strict | ||
array[xp.isnan(array)] = 0 | ||
if xp.isdtype(array.dtype, "real floating"): | ||
array[xp.isinf(array) & (array > 0)] = xp.finfo(array.dtype).max | ||
array[xp.isinf(array) & (array < 0)] = xp.finfo(array.dtype).min | ||
else: # xp.isdtype(array.dtype, "integral") | ||
array[xp.isinf(array) & (array > 0)] = xp.iinfo(array.dtype).max | ||
array[xp.isinf(array) & (array < 0)] = xp.iinfo(array.dtype).min | ||
return array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_nan_to_num
is a leftover from the orginal branch and not needed for this PR, since here, we are converting to numpy arrays and _nan_to_num
is only needed to work with array_api_strict
arrays.
But we're using np.nan_to_num
also in silhouette_samples
, in BaseSearchCV._format_results._store()
and in two tests. So it might be needed later anyways (?) and it this case it could stay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just seen this, including the link to this issue. Maybe I have to remove it then, depending on what will be discussed on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also make a separate PR only with _nan_to_num()
, that can stay open months or years until we know if the issue will be fruitful. People can than find it easier if it is needed sometime in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs more discussion anyway, so would be nicer to move it out of this PR. I'm personally not convinced with what's happening in this function TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will delete it then. I don't think it makes sense to keep it somewhere anymore.
result = confusion_matrix(y_true, y_pred) | ||
xp_result, _ = get_namespace(result) | ||
assert _is_numpy_namespace(xp_result) | ||
|
||
# Since the computation always happens with NumPy / SciPy on the CPU, this | ||
# function is expected to return an array allocated on the CPU even when it does | ||
# not match the input array's device. | ||
assert result.device == "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adjusted this test to your suggestions from this comment, @ogrisel. But here, the test is narrower because we made the return value of confusion_matrix
to always be a numpy array on cpu.
Regarding the question how to document the return type of |
sklearn/utils/_array_api.py
Outdated
# TODO: remove when minimum pandas version is pandas==1.2.0, when | ||
# `numpy.asarray(pd.Series)` with nullable dtypes no longer returns nd.arrays with | ||
# `object` dtypes: | ||
def convert_pandas_nullable_dtypes(pandas_series): | ||
"""Convert from pandas nullable extension dtypes to numpy dtypes. Without this | ||
conversion, numpy.asarray(array) creates a numpy array with dtype `object` for older | ||
pandas versions. | ||
""" | ||
dtype_mapping = { | ||
**{f"pd.Int{x}Dtype()": f"int{x}" for x in [8, 16, 32, 64]}, | ||
**{f"pd.Float{x}Dtype()": f"float{x}" for x in [32, 64]}, | ||
"pd.BooleanDtype()": "bool", | ||
} | ||
return pandas_series.astype(dtype_mapping.get(pandas_series.dtype), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the CI failure with pandas==1.1.5 that appeared because we now use _convert_to_numpy()
, which has a numpy.asarray(array)
line, which returns a numpy array with dtype object
which then causes _check_targets()
to raise.
Mapping to the desired numpy dtype prevents this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the CI failure (about the Test Library) is a false positive, as I couldn't reproduce the same error after running the tests multiple times on my local machine. Instead, I encountered a different set of errors related to CUDA. I believe the issue might be linked to how we convert sample_weight
into a NumPy array.
I've updated the PR. Since #30613 is merged, we don't need the Edit: Oh my, we still get the same CI error:
So, should we: In I think the second option is the more secure, because the helper function fixes only this specific test, but @ogrisel: you surely have an intuition here? |
Thanks for checking this @virchan. I have now re-tried, runing pytest on an environment created from the lockfile for the min dependencies. I now get the failures locally too. When I tried before, the test has passed locally with pandas==1.2.0. However, @lesteve pointed me to the possibility that there might be some interplay between dependencies that causes this. I'm actually not sure anymore how I did run the test. Maybe I wasn't in my |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the ball rolling and make the CIs green again, I have added calls to check_array()
into confusion_matrix().
An alternative would be to re-introduce the helper function _convert_pandas_nullable_dtypes()
as it was before removing it (which also leads to green CIs).
Please let me know if there is anything else I can or should do here.
sklearn/metrics/_classification.py
Outdated
y_true = check_array(y_true, dtype=None, ensure_2d=False, ensure_min_samples=0) | ||
y_pred = check_array(y_pred, dtype=None, ensure_2d=False, ensure_min_samples=0) | ||
y_true = _convert_to_numpy(y_true, xp) | ||
y_pred = _convert_to_numpy(y_pred, xp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for reviewers:
Only using check_array()
without _convert_to_numpy()
afterwards results in a test failure for array_api_strict arrays: FAILED sklearn/metrics/tests/test_classification.py::test_confusion_matrix_array_api[array_api_strict-None-None] - TypeError: unhashable type: 'Array'
This is why I left it there. However, I did not research the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good overall. Thanks @StefanieSenger
We just need to finalize whether we want to keep the current flow where we return a numpy array on the cpu or should we convert to the initial namespace and device.
CC: @ogrisel
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
if not _is_numpy_namespace(get_namespace(labels)[0]): | ||
labels = _convert_to_numpy(labels, xp) | ||
else: | ||
labels = np.asarray(labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can replace these with a single line:
labels = _convert_to_numpy(labels, xp)
About namespace, I would say for now the least controversial thing to do (from the point of view of array API support) would be to convert to the initial namespace, for example see #30440 (comment). About device, I am not quite sure ... the Scipy array API support blog post compiled code section (from October 2023) says:
Maybe @lucascolley has some opinion on what to do with device based on his scipy experience? |
thanks for the ping @lesteve, I left my initial thoughts at #30440 (comment) then discussed this with Olivier and Tim in a call. The summary is that I think you'll want to think carefully before setting the precedent for different defaults across functions as to whether they return the input namespace or NumPy. That seems like it could be confusing for users and difficult to document. In this case, if (some of) the computation has to be done with NumPy, then I would suggest attempting to convert back to the input namespace. In SciPy we have chosen to let this error for now and skip the tests for device transfers, but you could choose to let the device transfers happen (as I think you do in your |
I ran some benchmarks on a Kaggle kernel with CUDA using this branch where in one case, the current flow in which a numpy array is returned is used, while in the other we convert to the namespace and device when returning
By looking at this, I think we might as well return the confusion matrix in the original namespace and device. |
Thanks, everyone, for your thoughts on the return type! To summarize: The original decision (based on @ogrisel’s suggestion) was to default to returning numpy arrays. So, should we finalise the PR by converting back to the input namespace and device, despite the minor performance difference? Or should we keep the numpy default? I believe the advantages of converting back to the input namespace and device overweigh and in case we would later judge differently for some reason, array API is experimental and we could still change its behaviour without annoying users too much. @ogrisel, since your suggestion was to convert to numpy, do you have any thoughts on whether we should stick with that or return types from the input namespace based on the discussion? |
+1 from me! For clarity, what is the NumPy code (i.e. which lines) that we have been unable to write in a performant way with the standard (which caused the initial performance discussion)? |
Basically where we create the confusion matrix using a coo_matrix scikit-learn/sklearn/metrics/_classification.py Lines 413 to 417 in 869f568
|
This seems the less surprising to me for now and we can revisit this later if the need arises. I would think metrics computation is unlikely to be the bottleneck in most real life use cases. |
@ogrisel Could you kindly provide your feedback so that we can finalize this PR? |
Reference Issues/PRs
towards #26024
closes #30440 (supercedes)
This PR is an alternative, discussed in #30440, that converts the input arrays to numpy arrays right away and returns the confusion_matrix as a numpy array. For more details see the discussion there.