-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add array api for log_loss #30439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The CUDA tests seem to be green 🟢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick feedback about the new _allclose
helper. Otherwise LGTM.
y_pred_sum = y_pred.sum(axis=1) | ||
if not np.allclose(y_pred_sum, 1, rtol=np.sqrt(eps)): | ||
y_pred_sum = xp.sum(y_pred, axis=1) | ||
if not _allclose(y_pred_sum, 1, rtol=np.sqrt(eps), xp=xp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Since we are internally converting to numpy in all cases in the all_close
helper, we can use np.sqrt
instead of xp.sqrt
here as this is just a scalar value and using xp.sqrt
will require us to unnecessarily convert eps
into an array to satisfy array api strict.
|
||
check_consistent_length(y_pred, y_true, sample_weight) | ||
lb = LabelBinarizer() | ||
|
||
if labels is not None: | ||
lb.fit(labels) | ||
lb.fit(_convert_to_numpy(labels, xp=xp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat then missing the point of "supporting array API" here. I'd say we support array API if we don't convert to Numpy, and here we do. So in effect, there's not much of an improvement with this PR.
I think in order to get this merged, LabelBinarizer
should support array API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that would be better, but I think we still perform computations after the LabelBinarizer part. Particularly the sums, clipping and xlogy, that might still bring some improvements as scipy's xlogy supports the array api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be worth moving the data back and forth between devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the description to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that means we need to shelf this PR until we fix label binarizer.
Closing for now, as this depends on |
array API does not and will never support str objects in arrays. Since But we want to have (some) classifiers that accept So, to me, it would make sense to make log loss and other classification metrics accept array API inputs, even if it's just converting back to numpy internally. The computational intensive part is in the For instance, we would like to be able to do something like pipeline = make_pipeline(
TableVectorizer(), # feature engineering with heterogenous types in pandas
FunctionTransformer(func=lambda x: torch.tensor(x).astype(torch.float32).to("cuda")),
RidgeClassifier(),
)
cross_validate(
pipeline,
X_pandas_dataframe,
y_pandas_series,
scoring="neg_log_loss",
) And similarly for a In the above, Note that the function transformer in charge of moving the numerical features (extracted from |
But in the above example, the data passed to |
If the |
The |
Similarly to the "y-follows-X" policy we want to implement in the For soft-classification metrics like log-loss or ROC AUC, >>> from sklearn.metrics import log_loss
>>> log_loss(y_true=["a", "b", "a"], y_pred=[[0.9, 0.1], [0.2, 0.8], [0.8, 0.2]], labels=["a", "b"])
0.18388253942874858 For log-loss we don't care if we do But for ROC-AUC, it might be interesting to do most of the computation using the namespace and device of The result is a flow scalar in most cases. Not sure what should be the output namespace / device in case we output an array, e.g. |
Sorry, @OmarManzoor, can you explain that? Is it that moving data to GPU is less costly than the other way around? Or do you mean that moving twice is worse than moving once? I'm asking to get some more information on data transfer between devices. I believe that with my cpu pc and Colab (where I have to start a dedicated runtime), I cannot try this out myself? |
I don't think that moving to the GPU or from the GPU should be too different. But yes moving back and forth is more heavy than simply moving once and doing all the computations there. |
I think since Is it possible to maybe fix the device to be "cpu" inside |
The impact of moving back and forth across devices really depends on how many times (twice vs e.g. vs hundred of millions of times) we do it and how much computation we do on the device (arithmetic intensity) compared to data transfers. It's best to do measurements with timeit in case of doubt. |
Yes. Currently, we don't have a place to put pre-defined data to use for these kind of timeit tests, correct? Currently, we decide on case to case what we are interested in about the performance: whether it is the edge cases where the data is very big or has specific characteristics, or more in the normal use cases with what users would normally provide. And we decide from case to case how we evaluate the tradeoff between those different benchmarks, if I see this correctly. I believe that having a set of data defined to use for testing would be very helpful to compare results between different array libraries with array api implemented and it would be more efficient to discuss the results compared to the current status. What do you think about defining data that we want to test on together and write it down somewhere? People could then use it if they find it helpful (I certainly would). |
In most cases, CPU vs GPU quick perf check can be conducted with random data compatible with whatever the expectations of the function to benchmark. Since functions have different expectations (as defined in the docstring of the function), it's hard to come up with generic test data. For instance, if you want a 1D array with 1 million random floating-point, with positive and negative values: import numpy as np
import torch
data_np = np.random.default_rng(0).normal(size=int(1e6)).astype(np.float32)
data_torch = torch.tensor(data_np) If you want random 1 million random integers between 0 and 9 included: import numpy as np
import torch
data_np = np.random.default_rng(0).integers(0, 10, size=int(1e6)).astype(np.int32)
data_torch = torch.tensor(data_np) If you need 2D data with import numpy as np
import torch
data_np = np.random.default_rng(0).uniform(0, 1, size=(int(1e7), 5)).astype(np.float32)
data_np /= data_np.sum(axis=1, keepdims=True)
data_torch = torch.tensor(data_np) Depending on the functions we are benchmarking, we have to think about the typical data shapes our users who have access to GPUs will care about. And we need data that is large enough for the difference to be meaningfully measurable. Some functions are very scalable (with linear complexity) and should therefore be benchmarked with datasets that are quite large (but still fit in host or device memory). Some other functions are not as scalable, e.g. n log(n) or even quadratic in the number of data points or features and therefore, we need to adjust the test data size to benchmark with a case that stays tractable. In general, my rule of thumb would be to feed input data that is large enough for the function to take at least a few seconds to run with the slowest of the two alternatives to compare (e.g. when comparing execution on CPU with numpy vs CUDA GPU with torch or cupy). |
Thank you, @ogrisel, that is very helpful. Edit: now a day later I'm wondering if we test array libraries against each other, then we should also use their random number generation over numpy's, right? Sorry for all the questions, I'm trying to make sense of all this, but it's getting better. |
Random number generation is not part of the array API spec at the time of writing, so we should not naively assume that array API-compliant libraries implement the same RNG API as NumPy. Actually, they quite significantly differ, see for instance: It might be possible to write a library-aware compatibility wrapper that tries to delegate to the underlying library-specific RNG implementations as much as possible, but this can be a lot of extra maintenance, so I would rather avoid this. So in the context of scikit-learn code, we should stick to using NumPy whenever we need pseudo-random numbers and convert the result to the array namespace of other data inputs when needed. In the context of the benchmark code I shared above, we are not interested in benchmarking the random data generating process but only the function call. So we don't really care if the data is generated with NumPy and converted to PyTorch or directly generated with PyTorch. We could also do the opposite: generate the test data with the PyTorch API on the GPU and then convert a copy to NumPy on the CPU to benchmark the function with NumPy inputs. But this does not matter much as long as we compare the function calls with the same input values. |
That both makes sense to me, thanks. |
Here are some benchmarks for GPU P100 16GB RAM 29 GB
|
@OmarManzoor I don't understand what "array api cpu" and "using numpy cuda" mean. Could you please link to your notebook? |
SInce it's been a while I might not have the exact specifications right now. However I think that using numpy means that we simply convert to numpy at the start of the function and then just use the code as it is. While array api means that we convert whatever we can to the array api (aside from the LabelBinarizer part where we need to convert to numpy) which is I think the code in this PR. So the comparison is just to see whether we should even bother with the array api in log_loss or not. |
Reference Issues/PRs
Towards #26024
What does this implement/fix? Explain your changes.
Note: As discussed in #28626, we handle conversion to numpy when using the
LabelBinarizer
as it does not support the array api currently. However this might not be feasible when we have to move data between devices. Therefore, this PR depends onLabelBinarizer
supporting the array api in order to provide the expected performance gains.Any other comments?
CC: @ogrisel @adrinjalali @betatim