ENH Add array api for log_loss #30439

OmarManzoor · 2024-12-09T09:00:29Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds array api support for log_loss

Note: As discussed in #28626, we handle conversion to numpy when using the LabelBinarizer as it does not support the array api currently. However this might not be feasible when we have to move data between devices. Therefore, this PR depends on LabelBinarizer supporting the array api in order to provide the expected performance gains.

Any other comments?

CC: @ogrisel @adrinjalali @betatim

github-actions · 2024-12-09T09:01:44Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: cfabbd0. Link to the linter CI: here}

sklearn/metrics/_classification.py

OmarManzoor · 2024-12-09T10:33:23Z

The CUDA tests seem to be green 🟢

ogrisel

Quick feedback about the new _allclose helper. Otherwise LGTM.

sklearn/utils/_array_api.py

sklearn/metrics/_classification.py

OmarManzoor · 2024-12-09T11:39:58Z

sklearn/metrics/_classification.py

-    y_pred_sum = y_pred.sum(axis=1)
-    if not np.allclose(y_pred_sum, 1, rtol=np.sqrt(eps)):
+    y_pred_sum = xp.sum(y_pred, axis=1)
+    if not _allclose(y_pred_sum, 1, rtol=np.sqrt(eps), xp=xp):


Note: Since we are internally converting to numpy in all cases in the all_close helper, we can use np.sqrt instead of xp.sqrt here as this is just a scalar value and using xp.sqrt will require us to unnecessarily convert eps into an array to satisfy array api strict.

adrinjalali · 2024-12-09T13:10:45Z

sklearn/metrics/_classification.py


    check_consistent_length(y_pred, y_true, sample_weight)
    lb = LabelBinarizer()

    if labels is not None:
-        lb.fit(labels)
+        lb.fit(_convert_to_numpy(labels, xp=xp))


This is somewhat then missing the point of "supporting array API" here. I'd say we support array API if we don't convert to Numpy, and here we do. So in effect, there's not much of an improvement with this PR.

I think in order to get this merged, LabelBinarizer should support array API.

I agree that would be better, but I think we still perform computations after the LabelBinarizer part. Particularly the sums, clipping and xlogy, that might still bring some improvements as scipy's xlogy supports the array api.

It might not be worth moving the data back and forth between devices

Yes I think you are right.

I updated the description to reflect this.

I think that means we need to shelf this PR until we fix label binarizer.

OmarManzoor · 2024-12-09T14:33:26Z

Closing for now, as this depends on LabelBinarizer supporting the array api.

ogrisel · 2024-12-10T10:01:11Z

array API does not and will never support str objects in arrays. Since LabelBinarizer is mostly about mapping an array of str objects class labels to an array of integers, its input will rarely be array API. So I don't think it makes sense for LabelBinarizer to ever "support" array API.

But we want to have (some) classifiers that accept X as array API, y as numpy of object labels: the y is internally encoded as from class labels to integers (using a LabelBinarizer internally), then the resulting integer coded y is moved into the same namespace and device as X and the probabilistic predictions of that classifier will naturally be in the array API namespace.

So, to me, it would make sense to make log loss and other classification metrics accept array API inputs, even if it's just converting back to numpy internally. The computational intensive part is in the fit of the classifier, not in the computation of the log loss.

For instance, we would like to be able to do something like

pipeline = make_pipeline(
    TableVectorizer(),  # feature engineering with heterogenous types in pandas
    FunctionTransformer(func=lambda x: torch.tensor(x).astype(torch.float32).to("cuda")),
    RidgeClassifier(),
)


cross_validate(
   pipeline,
   X_pandas_dataframe,
   y_pandas_series,
   scoring="neg_log_loss",
)

And similarly for a RandomizedSearchCV call.

In the above, TableVectorizer is: https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html but it could be any transformer that accepts heterogeneously typed columns (e.g. a mix of numerical features and non-numerical features such as categories, datetime, text valued features...) in a data-frame and output only numerical columns.

Note that the function transformer in charge of moving the numerical features (extracted from X_pandas_dataframe by skrub or similar pandas capable transformers) to the GPU does not impact y_pandas_series that is fed directly to the fit method of the classifier and to the log_loss metric function

adrinjalali · 2024-12-10T10:10:30Z

But in the above example, the data passed to neg_log_loss would be numpy, not array API. So you can already do that.

OmarManzoor · 2024-12-10T15:25:44Z

If the y_true or labels will usually be strings then I don't think there should be any issue as they will already be on the cpu and we convert them to the array api and move them to the GPU after they are binarized. The only concern is if y_true or labels are already on the GPU and we move them to the CPU and then move the binarized output back to the GPU.

ogrisel · 2024-12-10T15:55:50Z

But in the above example, the data passed to neg_log_loss would be numpy, not array API. So you can already do that.

The y_true argument passed to log_loss will be a slice pandas series (derived from y_pandas_series by the CV splitter) but the second argument will be the output of pipeline.predict_proba(X_pandas_series_cv_split_i) which will be a pytorch tensor allocated on a CUDA GPU.

ogrisel · 2024-12-10T16:36:44Z

Similarly to the "y-follows-X" policy we want to implement in the fit method of estimators, we might want to decide officially on a y_true-follows-y_pred or y_pred-follows-y_true policy for the metric functions in case both inputs do not stem from the same namespace.

For soft-classification metrics like log-loss or ROC AUC, y_pred will be the output of either predict_proba or decision_function and could therefore naturally be allocated on the namespace / device of the X parameter passed to the fit method of the last step of the pipeline. y_true could either be an array API integer array or a numpy arrays or pandas series with arbitrarily typed class labels:

>>> from sklearn.metrics import log_loss
>>> log_loss(y_true=["a", "b", "a"], y_pred=[[0.9, 0.1], [0.2, 0.8], [0.8, 0.2]], labels=["a", "b"])
0.18388253942874858

For log-loss we don't care if we do y_true-follows-y_pred or y_pred-follows-y_true as the computation should not be expensive, so CPU or GPU computation should not matter much compared to fitting or predicting with complex models and feature sets.

But for ROC-AUC, it might be interesting to do most of the computation using the namespace and device of y_pred because it involves sorting, which can be much faster on GPUs. So we could move the label-encoded y_true to the same device and namespace as y_pred.

The result is a flow scalar in most cases. Not sure what should be the output namespace / device in case we output an array, e.g. roc_auc_score with average=None on multiclass problems...

StefanieSenger · 2024-12-10T20:58:19Z

If the y_true or labels will usually be strings then I don't think there should be any issue as they will already be on the cpu and we convert them to the array api and move them to the GPU after they are binarized. The only concern is if y_true or labels are already on the GPU and we move them to the CPU and then move the binarized output back to the GPU.

Sorry, @OmarManzoor, can you explain that? Is it that moving data to GPU is less costly than the other way around? Or do you mean that moving twice is worse than moving once?

I'm asking to get some more information on data transfer between devices. I believe that with my cpu pc and Colab (where I have to start a dedicated runtime), I cannot try this out myself?

OmarManzoor · 2024-12-11T06:17:57Z

Sorry, @OmarManzoor, can you explain that? Is it that moving data to GPU is less costly than the other way around? Or do you mean that moving twice is worse than moving once?

I don't think that moving to the GPU or from the GPU should be too different. But yes moving back and forth is more heavy than simply moving once and doing all the computations there.

OmarManzoor · 2024-12-11T07:01:02Z

For log-loss we don't care if we do y_true-follows-y_pred or y_pred-follows-y_true as the computation should not be expensive, so CPU or GPU computation should not matter much compared to fitting or predicting with complex models and feature sets.

I think since y_pred seems to be the actual numerical array it might make sense to extract the namespace based on that.

Is it possible to maybe fix the device to be "cpu" inside log_loss and then do all the computations on the cpu? Even though that would eliminate any advantages that could possibly be gained by using the GPU, I think the computation as you mentioned is not too expensive.

ogrisel · 2024-12-11T10:19:41Z

But yes moving back and forth

The impact of moving back and forth across devices really depends on how many times (twice vs e.g. vs hundred of millions of times) we do it and how much computation we do on the device (arithmetic intensity) compared to data transfers. It's best to do measurements with timeit in case of doubt.

StefanieSenger · 2024-12-11T11:04:42Z

Best it to do measurements with timeit in case of doubt.

Yes. Currently, we don't have a place to put pre-defined data to use for these kind of timeit tests, correct? Currently, we decide on case to case what we are interested in about the performance: whether it is the edge cases where the data is very big or has specific characteristics, or more in the normal use cases with what users would normally provide. And we decide from case to case how we evaluate the tradeoff between those different benchmarks, if I see this correctly.

I believe that having a set of data defined to use for testing would be very helpful to compare results between different array libraries with array api implemented and it would be more efficient to discuss the results compared to the current status.

What do you think about defining data that we want to test on together and write it down somewhere? People could then use it if they find it helpful (I certainly would).

ogrisel · 2024-12-11T14:08:12Z

I believe that having a set of data defined to use for testing would be very helpful to compare results between different array libraries with array api implemented and it would be more efficient to discuss the results compared to the current status.

In most cases, CPU vs GPU quick perf check can be conducted with random data compatible with whatever the expectations of the function to benchmark. Since functions have different expectations (as defined in the docstring of the function), it's hard to come up with generic test data.

For instance, if you want a 1D array with 1 million random floating-point, with positive and negative values:

import numpy as np
import torch
data_np = np.random.default_rng(0).normal(size=int(1e6)).astype(np.float32)
data_torch = torch.tensor(data_np)

If you want random 1 million random integers between 0 and 9 included:

import numpy as np
import torch
data_np = np.random.default_rng(0).integers(0, 10, size=int(1e6)).astype(np.int32)
data_torch = torch.tensor(data_np)

If you need 2D data with shape=(n_samples=10_000_000, n_classes=5) with positive-only values and rows that sum to 1 (like the output of y_pred = clf.predict_proba(X)) you can use something like:

import numpy as np
import torch
data_np = np.random.default_rng(0).uniform(0, 1, size=(int(1e7), 5)).astype(np.float32)
data_np /= data_np.sum(axis=1, keepdims=True)
data_torch = torch.tensor(data_np)

Depending on the functions we are benchmarking, we have to think about the typical data shapes our users who have access to GPUs will care about. And we need data that is large enough for the difference to be meaningfully measurable. Some functions are very scalable (with linear complexity) and should therefore be benchmarked with datasets that are quite large (but still fit in host or device memory). Some other functions are not as scalable, e.g. n log(n) or even quadratic in the number of data points or features and therefore, we need to adjust the test data size to benchmark with a case that stays tractable.

In general, my rule of thumb would be to feed input data that is large enough for the function to take at least a few seconds to run with the slowest of the two alternatives to compare (e.g. when comparing execution on CPU with numpy vs CUDA GPU with torch or cupy).

StefanieSenger · 2024-12-16T11:25:10Z

Thank you, @ogrisel, that is very helpful.

Edit: now a day later I'm wondering if we test array libraries against each other, then we should also use their random number generation over numpy's, right? Sorry for all the questions, I'm trying to make sense of all this, but it's getting better.

ogrisel · 2024-12-23T16:01:49Z

now a day later I'm wondering if we test array libraries against each other, then we should also use their random number generation over numpy's, right? Sorry for all the questions, I'm trying to make sense of all this, but it's getting better.

Random number generation is not part of the array API spec at the time of writing, so we should not naively assume that array API-compliant libraries implement the same RNG API as NumPy. Actually, they quite significantly differ, see for instance:

Adding random number generation API data-apis/array-api#431

It might be possible to write a library-aware compatibility wrapper that tries to delegate to the underlying library-specific RNG implementations as much as possible, but this can be a lot of extra maintenance, so I would rather avoid this.

So in the context of scikit-learn code, we should stick to using NumPy whenever we need pseudo-random numbers and convert the result to the array namespace of other data inputs when needed.

In the context of the benchmark code I shared above, we are not interested in benchmarking the random data generating process but only the function call. So we don't really care if the data is generated with NumPy and converted to PyTorch or directly generated with PyTorch.

We could also do the opposite: generate the test data with the PyTorch API on the GPU and then convert a copy to NumPy on the CPU to benchmark the function with NumPy inputs. But this does not matter much as long as we compare the function calls with the same input values.

StefanieSenger · 2024-12-23T16:28:40Z

So in the context of scikit-learn code, we should stick to using NumPy whenever we need pseudo-random numbers and convert the result to the array namespace of other data inputs when needed.

In the context of the benchmark code I shared above, we are not interested in benchmarking the random data generating process but only the function call. So we don't really care if the data is generated with NumPy and converted to PyTorch or directly generated with PyTorch.

That both makes sense to me, thanks.

OmarManzoor · 2024-12-24T14:37:05Z

Here are some benchmarks for log_loss as well using a kaggle notebook

GPU P100 16GB

RAM 29 GB

	n_classes=100, n_samples=1e6	n_classes=10, n_samples=1e7	n_classes=10, n_samples=1e8
using numpy code by converting to numpy at the start, with cpu arrays	1.8270534038543702	3.0837581157684326	31.191024827957154
using array api where ever we can (this PR), with cpu arrays	2.353586959838867	3.3250258207321166	33.20811450481415

using numpy code by converting to numpy at the start, with cuda arrays	2.1802847623825072	3.5629594802856444	41.15251977443695
using array api where ever we can (this PR), with cuda arrays	1.0973777055740357	2.3917490959167482	CUDA out of memory

ogrisel · 2025-04-17T15:41:52Z

@OmarManzoor I don't understand what "array api cpu" and "using numpy cuda" mean. Could you please link to your notebook?

OmarManzoor · 2025-04-17T16:26:07Z

@OmarManzoor I don't understand what "array api cpu" and "using numpy cuda" mean. Could you please link to your notebook?

SInce it's been a while I might not have the exact specifications right now. However I think that using numpy means that we simply convert to numpy at the start of the function and then just use the code as it is. While array api means that we convert whatever we can to the array api (aside from the LabelBinarizer part where we need to convert to numpy) which is I think the code in this PR. So the comparison is just to see whether we should even bother with the array api in log_loss or not.

ENH Add array api for log_loss

0d8a25d

github-actions bot added module:metrics module:utils labels Dec 9, 2024

Merge branch 'main' into array_api_log_loss

ca15245

OmarManzoor added 2 commits December 9, 2024 14:03

Add changelog

57398c4

Revert unintended change

3339cac

StefanieSenger reviewed Dec 9, 2024

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

OmarManzoor added Array API CUDA CI labels Dec 9, 2024

github-actions bot removed the CUDA CI label Dec 9, 2024

OmarManzoor added 2 commits December 9, 2024 15:25

Merge branch 'main' into array_api_log_loss

8df9a45

Improve var naming

6c9b0b4

ogrisel reviewed Dec 9, 2024

View reviewed changes

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

OmarManzoor added 2 commits December 9, 2024 16:33

Improve the _allclose helper

6476b98

Add missing xp

9e7d91e

OmarManzoor commented Dec 9, 2024

View reviewed changes

Add doc and improve variable names in _allclose

cfabbd0

adrinjalali reviewed Dec 9, 2024

View reviewed changes

OmarManzoor closed this Dec 9, 2024

OmarManzoor mentioned this pull request Dec 9, 2024

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

ogrisel mentioned this pull request Dec 10, 2024

ENH Array API support for confusion_matrix #30440

Open

lucyleeow mentioned this pull request Apr 8, 2025

Automatically move y (and sample_weight) to the same device and namespace as X #28668

Open

This was referenced Apr 30, 2025

Automatically move y_true to the same device and namespace as y_pred for metrics #31274

Open

Clarification of output array type when metrics accept multiclass/multioutput #31286

Closed

OmarManzoor deleted the array_api_log_loss branch June 27, 2025 10:07

Uh oh!

ENH Add array api for log_loss #30439

ENH Add array api for log_loss #30439

Uh oh!

Conversation

OmarManzoor commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

OmarManzoor commented Dec 9, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

OmarManzoor Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

adrinjalali Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

adrinjalali Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

OmarManzoor commented Dec 9, 2024

Uh oh!

ogrisel commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Dec 10, 2024

Uh oh!

OmarManzoor commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Dec 10, 2024

Uh oh!

ogrisel commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Dec 11, 2024

Uh oh!

ogrisel commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Dec 9, 2024 •

edited

Loading

github-actions bot commented Dec 9, 2024 •

edited

Loading

OmarManzoor Dec 9, 2024 •

edited

Loading

ogrisel commented Dec 10, 2024 •

edited

Loading

OmarManzoor commented Dec 10, 2024 •

edited

Loading

ogrisel commented Dec 10, 2024 •

edited

Loading

StefanieSenger commented Dec 10, 2024 •

edited

Loading

OmarManzoor commented Dec 11, 2024 •

edited

Loading

OmarManzoor commented Dec 11, 2024 •

edited

Loading

ogrisel commented Dec 11, 2024 •

edited

Loading

ogrisel commented Dec 11, 2024 •

edited

Loading

StefanieSenger commented Dec 16, 2024 •

edited

Loading

OmarManzoor commented Dec 24, 2024 •

edited

Loading

OmarManzoor commented Apr 17, 2025 •

edited

Loading