Skip to content

ENH Array API support for confusion_matrix converting to numpy array #30562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Dec 30, 2024

Reference Issues/PRs

towards #26024
closes #30440 (supercedes)

This PR is an alternative, discussed in #30440, that converts the input arrays to numpy arrays right away and returns the confusion_matrix as a numpy array. For more details see the discussion there.

Copy link

github-actions bot commented Dec 30, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 869f568. Link to the linter CI: here

Comment on lines 1113 to 1127
def _nan_to_num(array, xp=None):
"""Substitutes NaN values of an array with 0 and inf values with the maximum or
minimum numbers available for the dtype respectively; like np.nan_to_num."""
xp, _ = get_namespace(array, xp=xp)
try:
array = xp.nan_to_num(array)
except AttributeError: # currently catching exceptions from array_api_strict
array[xp.isnan(array)] = 0
if xp.isdtype(array.dtype, "real floating"):
array[xp.isinf(array) & (array > 0)] = xp.finfo(array.dtype).max
array[xp.isinf(array) & (array < 0)] = xp.finfo(array.dtype).min
else: # xp.isdtype(array.dtype, "integral")
array[xp.isinf(array) & (array > 0)] = xp.iinfo(array.dtype).max
array[xp.isinf(array) & (array < 0)] = xp.iinfo(array.dtype).min
return array
Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nan_to_num is a leftover from the orginal branch and not needed for this PR, since here, we are converting to numpy arrays and _nan_to_num is only needed to work with array_api_strict arrays.

But we're using np.nan_to_num also in silhouette_samples, in BaseSearchCV._format_results._store() and in two tests. So it might be needed later anyways (?) and it this case it could stay.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just seen this, including the link to this issue. Maybe I have to remove it then, depending on what will be discussed on the issue.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also make a separate PR only with _nan_to_num(), that can stay open months or years until we know if the issue will be fruitful. People can than find it easier if it is needed sometime in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more discussion anyway, so would be nicer to move it out of this PR. I'm personally not convinced with what's happening in this function TBH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will delete it then. I don't think it makes sense to keep it somewhere anymore.

Comment on lines 3120 to 3127
result = confusion_matrix(y_true, y_pred)
xp_result, _ = get_namespace(result)
assert _is_numpy_namespace(xp_result)

# Since the computation always happens with NumPy / SciPy on the CPU, this
# function is expected to return an array allocated on the CPU even when it does
# not match the input array's device.
assert result.device == "cpu"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adjusted this test to your suggestions from this comment, @ogrisel. But here, the test is narrower because we made the return value of confusion_matrix to always be a numpy array on cpu.

@StefanieSenger
Copy link
Contributor Author

Regarding the question how to document the return type of confusion_matrix() as a numpy array, I think that keeping
C : ndarray of shape (n_classes, n_classes) in the docstring should be enough, assumed that all the other functions and methods where we have added array api support document the return value type correctly, which is currently not the case.

Comment on lines 879 to 892
# TODO: remove when minimum pandas version is pandas==1.2.0, when
# `numpy.asarray(pd.Series)` with nullable dtypes no longer returns nd.arrays with
# `object` dtypes:
def convert_pandas_nullable_dtypes(pandas_series):
"""Convert from pandas nullable extension dtypes to numpy dtypes. Without this
conversion, numpy.asarray(array) creates a numpy array with dtype `object` for older
pandas versions.
"""
dtype_mapping = {
**{f"pd.Int{x}Dtype()": f"int{x}" for x in [8, 16, 32, 64]},
**{f"pd.Float{x}Dtype()": f"float{x}" for x in [32, 64]},
"pd.BooleanDtype()": "bool",
}
return pandas_series.astype(dtype_mapping.get(pandas_series.dtype), None)
Copy link
Contributor Author

@StefanieSenger StefanieSenger Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the CI failure with pandas==1.1.5 that appeared because we now use _convert_to_numpy(), which has a numpy.asarray(array) line, which returns a numpy array with dtype object which then causes _check_targets() to raise.

Mapping to the desired numpy dtype prevents this.

Copy link
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the CI failure (about the Test Library) is a false positive, as I couldn't reproduce the same error after running the tests multiple times on my local machine. Instead, I encountered a different set of errors related to CUDA. I believe the issue might be linked to how we convert sample_weight into a NumPy array.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jan 13, 2025

I've updated the PR.

Since #30613 is merged, we don't need the _convert_pandas_nullable_dtypes() helper function anymore.

Edit: Oh my, we still get the same CI error:

FAILED metrics/tests/test_classification.py::test_confusion_matrix_pandas_nullable[Int64] - ValueError: Classification metrics can't handle a mix of unknown and binary...
FAILED metrics/tests/test_classification.py::test_confusion_matrix_pandas_nullable[Float64] - ValueError: Classification metrics can't handle a mix of unknown and binary...
FAILED metrics/tests/test_classification.py::test_confusion_matrix_pandas_nullable[boolean] - ValueError: Classification metrics can't handle a mix of unknown and binary...
= 3 failed, 32952 passed, 3142 skipped, 146 xfailed, 70 xpassed, 4114 warnings in 1138.31s (0:18:58) =

So, should we:
a) re-introduce the _convert_pandas_nullable_dtypes() helper function?
b) call check array() from within _convert_to_numpy() to fix this? Edit: A few weeks later now I have come to the conclusion that the way check_array needs to be called is very specific to the data we expect and then maybe it's better not to call it from within a function that is going to be used so so many different places, but rather right after user input. So I don't think that b) is a good idea anymore.

In check_array() these issues were dealt with in #25814 and in #25147 in a more general way.

I think the second option is the more secure, because the helper function fixes only this specific test, but check_array()can deal with all the pandas extension dtypes, which involve more if I understand correctly.

@ogrisel: you surely have an intuition here?

@virchan
Copy link
Member

virchan commented Jan 13, 2025

Edit: Oh my, we still get the same CI error:

FAILED metrics/tests/test_classification.py::test_confusion_matrix_pandas_nullable[Int64] - ValueError: Classification metrics can't handle a mix of unknown and binary...
FAILED metrics/tests/test_classification.py::test_confusion_matrix_pandas_nullable[Float64] - ValueError: Classification metrics can't handle a mix of unknown and binary...
FAILED metrics/tests/test_classification.py::test_confusion_matrix_pandas_nullable[boolean] - ValueError: Classification metrics can't handle a mix of unknown and binary...
= 3 failed, 32952 passed, 3142 skipped, 146 xfailed, 70 xpassed, 4114 warnings in 1138.31s (0:18:58) =

I couldn't reproduce the CI error mentioned above on my local machine. In fact, test_confusion_matrix_pandas_nullable passed in all three cases:

Screenshot 2025-01-13 151327

Could this be a false positive?

@StefanieSenger
Copy link
Contributor Author

Thanks for checking this @virchan.

I have now re-tried, runing pytest on an environment created from the lockfile for the min dependencies.
(conda create --name min_dep_env --file ./build_tools/azure/pymin_conda_forge_openblas_min_dependencies_linux-64_conda.lock).

I now get the failures locally too.

When I tried before, the test has passed locally with pandas==1.2.0. However, @lesteve pointed me to the possibility that there might be some interplay between dependencies that causes this. I'm actually not sure anymore how I did run the test. Maybe I wasn't in my min_dep_env and instead only lazily re-installed my pandas with numpy unaltered. So this might have been the reason why I had seen these tests pass locally.

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the ball rolling and make the CIs green again, I have added calls to check_array() into confusion_matrix().

An alternative would be to re-introduce the helper function _convert_pandas_nullable_dtypes() as it was before removing it (which also leads to green CIs).

Please let me know if there is anything else I can or should do here.

Comment on lines 342 to 345
y_true = check_array(y_true, dtype=None, ensure_2d=False, ensure_min_samples=0)
y_pred = check_array(y_pred, dtype=None, ensure_2d=False, ensure_min_samples=0)
y_true = _convert_to_numpy(y_true, xp)
y_pred = _convert_to_numpy(y_pred, xp)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for reviewers:

Only using check_array() without _convert_to_numpy() afterwards results in a test failure for array_api_strict arrays: FAILED sklearn/metrics/tests/test_classification.py::test_confusion_matrix_array_api[array_api_strict-None-None] - TypeError: unhashable type: 'Array'

This is why I left it there. However, I did not research the reason.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall. Thanks @StefanieSenger
We just need to finalize whether we want to keep the current flow where we return a numpy array on the cpu or should we convert to the initial namespace and device.
CC: @ogrisel

Comment on lines +366 to +369
if not _is_numpy_namespace(get_namespace(labels)[0]):
labels = _convert_to_numpy(labels, xp)
else:
labels = np.asarray(labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace these with a single line:

labels = _convert_to_numpy(labels, xp)

@lesteve
Copy link
Member

lesteve commented Feb 11, 2025

We just need to finalize whether we want to keep the current flow where we return a numpy array on the cpu or should we convert to the initial namespace and device.

About namespace, I would say for now the least controversial thing to do (from the point of view of array API support) would be to convert to the initial namespace, for example see #30440 (comment).

About device, I am not quite sure ... the Scipy array API support blog post compiled code section (from October 2023) says:

or now, we attempt to convert to NumPy using np.asarray before the compiled code, then convert back to our array's namespace using xp.asarray after it. This will raise exceptions for arrays on a different device to the CPU, as justified in the RFC (tldr: we want to avoid silent device transfers).

Maybe @lucascolley has some opinion on what to do with device based on his scipy experience?

@lucascolley
Copy link
Contributor

thanks for the ping @lesteve, I left my initial thoughts at #30440 (comment) then discussed this with Olivier and Tim in a call. The summary is that I think you'll want to think carefully before setting the precedent for different defaults across functions as to whether they return the input namespace or NumPy. That seems like it could be confusing for users and difficult to document.

In this case, if (some of) the computation has to be done with NumPy, then I would suggest attempting to convert back to the input namespace. In SciPy we have chosen to let this error for now and skip the tests for device transfers, but you could choose to let the device transfers happen (as I think you do in your _convert_to_numpy, right?).

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Feb 16, 2025

I ran some benchmarks on a Kaggle kernel with CUDA using this branch where in one case, the current flow in which a numpy array is returned is used, while in the other we convert to the namespace and device when returning return xp.asarray(cm, device=device_)

Array Size (CUDA) Return Numpy Timing Return Namespace and Device Timing
1e3 0.002571168 0.002800589
1e4 0.005218287 0.005717518
1e5 0.010124249 0.011052353
1e6 0.041873491 0.044407668
1e7 0.571530447 0.586297677

By looking at this, I think we might as well return the confusion matrix in the original namespace and device.

@StefanieSenger
Copy link
Contributor Author

Thanks, everyone, for your thoughts on the return type!

To summarize:

The original decision (based on @ogrisel’s suggestion) was to default to returning numpy arrays.
@OmarManzoor’s benchmarks show that converting back to the original namespace and device is a little less performant than keeping the arrays numpy arrays, but the difference is small.
@lesteve and @lucascolley brought up the user experience and consistency across functions and users get the return types that they expect.

So, should we finalise the PR by converting back to the input namespace and device, despite the minor performance difference? Or should we keep the numpy default?

I believe the advantages of converting back to the input namespace and device overweigh and in case we would later judge differently for some reason, array API is experimental and we could still change its behaviour without annoying users too much.

@ogrisel, since your suggestion was to convert to numpy, do you have any thoughts on whether we should stick with that or return types from the input namespace based on the discussion?

@lucascolley
Copy link
Contributor

lucascolley commented Feb 16, 2025

So, should we finalise the PR by converting back to the input namespace and device, despite the minor performance difference?

+1 from me!

For clarity, what is the NumPy code (i.e. which lines) that we have been unable to write in a performant way with the standard (which caused the initial performance discussion)?

@OmarManzoor
Copy link
Contributor

For clarity, what is the NumPy code (i.e. which lines) that we have been unable to write in a performant way with the standard (which caused the initial performance discussion)?

Basically where we create the confusion matrix using a coo_matrix

cm = coo_matrix(
(sample_weight, (y_true, y_pred)),
shape=(n_labels, n_labels),
dtype=dtype,
).toarray()

@lesteve
Copy link
Member

lesteve commented Feb 18, 2025

So, should we finalise the PR by converting back to the input namespace and device, despite the minor performance difference?

This seems the less surprising to me for now and we can revisit this later if the need arises. I would think metrics computation is unlikely to be the bottleneck in most real life use cases.

@OmarManzoor
Copy link
Contributor

@ogrisel Could you kindly provide your feedback so that we can finalize this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants