Array API support for pairwise kernels #29822

EmilyXinyi · 2024-09-10T00:35:23Z

Reference Issues/PRs

(hopefully) unblocks progress with #29661

What does this implement/fix? Explain your changes.

Adding array API support for pairwise_kernels

To do

_parallel_pairwise
_pairwise_callable
Add tests in test_pairwise.py (pairwise_distances and pairwise_kernels in particular)
Changelog

github-actions · 2024-09-10T00:36:41Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7225996. Link to the linter CI: here}

sklearn/metrics/tests/test_pairwise.py

sklearn/metrics/pairwise.py

sklearn/metrics/tests/test_pairwise.py

ogrisel · 2024-09-13T15:42:08Z

Note that I wouldn't mind specializing this PR to pairwise kernels only and deal with distances in a follow-up PR if it can help simplify things.

lucyleeow · 2025-02-24T09:50:55Z

@EmilyXinyi are you still interested in working on this? Thanks!

ogrisel · 2025-02-24T10:24:45Z

@EmilyXinyi: to be more explicit, @lucyleeow can allocate bandwidth to work on stalled array API PRs so she could help on this one.

There is also #29431 that is possibly blocking many other estimators so it would be great to share the work on those two PRs to unblock progress on array API support.

EmilyXinyi · 2025-02-24T12:44:00Z

@lucyleeow please take this one over if it's blocking other estimators! Thank you for your help!

@ogrisel Thank you for the clarification!

lucyleeow · 2025-02-24T21:11:51Z

@EmilyXinyi I'm happy to continue working on your branch, so you can always contribute as well if you wish?

EmilyXinyi · 2025-02-26T01:03:10Z

@lucyleeow that would be awesome, thank you!

sklearn/metrics/pairwise.py

lucyleeow · 2025-05-19T03:06:13Z

Maybe @betatim may be interested to take a look?

sklearn/utils/_array_api.py

lucyleeow · 2025-05-22T07:03:16Z

sklearn/metrics/tests/test_pairwise.py

+    # Why 5 and not more? this seems to still result in a lot of 0 vaules?
+    X_np = np.array(5 * rng.random_sample((5, 4)), dtype=dtype_name)


Not directly related to this PR, I think the 5 * is to prevent lots of 0 values when dtype is int in the test_pairwise_parallel test above. While fiddling I noticed that even 5 * results in a lot of 0 values, and wondered why not just increase it to e.g., 10?

betatim · 2025-05-27T11:59:23Z

I was looking at the code touched by this PR and tried to remember "stuff" about it. One thought I had was "aren't matrix products one of the most massively optimised operations ever (on a GPU)?". For example in the euclidean distance computation we run a big matrix product, which I assume is where all the time is spent. Which then made me wonder if we should or shouldn't chunk the work - my assumption being that torch's @ will already do chunking or other magic to make it fast. Which made me realise that I don't fully remember why we chunk: is it computational performance or to save memory?

Does someone remember/can give some context?

ogrisel · 2025-06-05T06:50:31Z

Which made me realise that I don't fully remember why we chunk: is it computational performance or to save memory?

I think the main objective was to save memory and the secondary objective could be to optimize the use of CPU caches. I agree that for GPUs, using larger chunk sizes might make sense. It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.

lucyleeow · 2025-06-05T10:47:35Z

I agree that for GPUs, using larger chunk sizes might make sense.

But would it be easy to determine whether an input array is on GPU ? Because for CPU, we'd still want to chunk as we currently are right?

lucyleeow · 2025-06-05T10:48:11Z

sklearn/metrics/pairwise.py

    Parallel(backend="threading", n_jobs=n_jobs)(
-        fd(func, ret, s, X, Y[s], **kwds)
+        fd(func, ret, s, X, Y[s, ...], **kwds)
        for s in gen_even_slices(_num_samples(Y), effective_n_jobs(n_jobs))


It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.

Just to confirm, we're talking about these chunk sizes right? 😬

betatim · 2025-06-06T11:45:10Z

But would it be easy to determine whether an input array is on GPU ? Because for CPU, we'd still want to chunk as we currently are right?

I'd leave it as an "exercise for someone from the future". At least my immediate though about it is that it isn't "hard" but ugly. You could have a hardcoded list of (namespace, device) combinations that you know are on a GPU ([(torch, "cuda"), (cupy, None), (torch, "mps"), ...]) but that seems like a hack. So for now I'd leave it as is.

It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.

I might attempt this

betatim · 2025-06-06T15:18:17Z

Tried some benchmarking but with interesting(?) sizes of X and pairwise_distances(X) I quickly run out of memory on my GPU. I used X = rng.rand(50_000, 36) and moved that to a torch array with device="cuda". torch.cdist works with this size of X. Not really sure what all this means for this PR/benchmarking :-/

OmarManzoor · 2025-06-10T11:23:55Z

@betatim Do you run out of memory when you consider the current code or when you remove the chunking from the current code?

betatim · 2025-06-10T11:37:03Z

I'm using this PR. If I use the CPU I don't run out of memory, on the GPU I run out of memory.

OmarManzoor · 2025-06-10T11:50:40Z

I'm using this PR. If I use the CPU I don't run out of memory, on the GPU I run out of memory.

How much memory do you have on the GPU?

betatim · 2025-06-10T12:17:24Z

I have 49140MiB, which I think is a "large amount" of memory for an X with shape (50_000, 36). Especially because the CPU version of pairwise_distances takes a few seconds, which means that sizes like this or bigger is where it gets interesting to use a GPU. At least that is my intuition. Which is why I am a bit lost about what all this means, as it seems we'd have to change more of this code to make that use-case possible?? What do you think?

edit: maybe the thing to agree on is, if the use-case I have in mind is realistic?

OmarManzoor · 2025-06-10T12:23:26Z

I have 49140MiB, which I think is a "large amount" of memory for an X with shape (50_000, 36)

This is a lot! I agree with you that this code needs to be debugged further on the GPU to see precisely where this is running out of memory. Could it be that we are using Parallel threading with a certain number of jobs and that is not quite going too well with the GPU?

In any case I don't think there is any point removing the chunking in this PR for now.

Edit: The use case you are trying is realistic in my opinion. I think it should be able to handle at least this much > 100000 samples and 100 features

betatim · 2025-06-10T12:25:35Z

I agree that we shouldn't remove/change/do anything to the chunking for now.

I think by default (if you don't pass n_jobs) you get "one job" - so no parallelism.

Do you think that my use-case makes sense? And the idea that 50_000 samples is somehow "where maybe it starts getting interesting"?

lucyleeow · 2025-06-10T12:33:33Z

FYI there is #31445 , which avoids use of reshape (thus avoiding copy during reshape as array is non-C-contiguous), though I am not confident that would be cause the difference in memory usage.

OmarManzoor · 2025-06-10T15:06:21Z

@betatim I don't have a GPU with the size that you have available in order to debug this but I did try with an mps device.

This is the code I am using

from sklearn.metrics.pairwise import pairwise_kernels
import torch as xp

from sklearn._config import config_context


X = xp.rand((50000, 30), device="mps")
Y = xp.rand(50000, 30, device="mps")
with config_context(array_api_dispatch=True):
    pairwise_kernels(X, Y, n_jobs=1)

If we use n_jobs=1 we don't do any sort of chunking and work with the original arrays. In this case the code breaks in the safe_sparse_dot function where we directly multiply the arrays

    ret = a @ b
          ~~^~~
RuntimeError: Invalid buffer size: 9.31 GB

If we use n_jobs >= 2 the code breaks when we try to allocate an empty array of zeros of size 50000 x 50000

ret = xp.empty((X.shape[0], Y.shape[0]), device=device, dtype=dtype_float).T

    return torch.empty(shape, dtype=dtype, device=device, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 9.31 GB

ogrisel · 2025-06-10T15:50:14Z

@betatim you might want to try the approach documented in: https://pytorch.org/blog/understanding-gpu-memory-1/

lucyleeow · 2025-06-11T05:36:26Z

Asking the stupid question first, for array of 50_000 by 50_000, float32 (4 bytes), 9.31GB seems to be the correct size?

Assuming you have enough GPU memory, what could the problem be? That it needs to be contiguous memory?

Thinking about the parallel case, even if we chunk the ret = xp.empty part, we still have to combine them in the end...

I used memory_profiler to check memory usage on a numpy array of the same size and got:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5    129.1 MiB    129.1 MiB           1   @profile
     6                                         def compute_pairwise():
     7                                             # Create two random arrays
     8    140.5 MiB     11.4 MiB           1       X = np.random.rand(50000, 30)
     9    152.0 MiB     11.5 MiB           1       Y = np.random.rand(50000, 30)
    10                                             
    11                                             # Compute pairwise kernels
    12  19220.4 MiB  19068.4 MiB           1       result = pairwise_kernels(X, Y, n_jobs=1)
    13  19220.4 MiB     -0.0 MiB           1       return result

(note that these would be float64 vs torch's float32 above so size checks out)

I wonder if this is out of scope for this particular PR, and raises the more general question of how to handle larger computations/computations on GPU (e.g., write output to disc..?), where we have previously mostly focused on running models on a standard laptop on data of size <10GB

OmarManzoor · 2025-06-11T06:30:29Z

@lucyleeow I agree that this is more related to how to manage memory better to accommodate large computations on GPUs more generally. For my particular case I don't have enough memory so that makes sense.

However @betatim has reported a GPU memory size of ~ 49 GB which is quite a lot. It would be nice to get some feedback in that particular case on where the code runs out of memory.

In any case I think we can merge this PR and investigate the large computations later. What do you think @betatim

betatim · 2025-06-11T08:40:30Z

Yes, I think this is a bigger can of worms than this PR.

I also feel like I might be missing something or otherwise getting things mixed up. For example, I'm not sure we are using "too much" memory - in the sense of being wasteful. It is more that I am surprised that I get OOM so easily, with such a small problem - makes me wonder "what is the point of all this if the GPU can't handle the cases that are actually interesting??"

OmarManzoor · 2025-06-11T08:58:09Z

Yes, I think this is a bigger can of worms than this PR.

I also feel like I might be missing something or otherwise getting things mixed up. For example, I'm not sure we are using "too much" memory - in the sense of being wasteful. It is more that I am surprised that I get OOM so easily, with such a small problem - makes me wonder "what is the point of all this if the GPU can't handle the cases that are actually interesting??"

Well I guess batching is required in all such cases to manage the memory efficiently. Do you have any idea at what point does your 49 GB GPU run out of memory? I think you are using n_jobs = 1 so we don't even enter the code where we create batches and use multi threading. So we are working with the complete arrays. But I feel that if the matrix multiplication of the complete arrays ran out of memory on an M1 GPU it should not really crash at the same point when we have 49 GB available.

lucyleeow · 2025-06-12T05:41:23Z

Especially because the CPU version of pairwise_distances takes a few seconds, which means that sizes like this or bigger is where it gets interesting to use a GPU.

Totally agree with your sentiment. Seems reasonable.

I spent too much time on this but out of interest I tried to check memory usage of torch.cdist vs our pairwise_kernel, with torch's memory snapshot as described in the blog Olivier linked above.

Using:

    X_torch = torch.rand(1000, 1000, device="cuda")
    Y_torch = torch.rand(1000, 1000, device="cuda")

(Note used _ = torch.cdist(X_torch[:1], Y_torch[:1]) to warm up CUDA as suggested by claude 🤷 )

This is pairwise_kernel(X_torch, Y_torch):

Note all of the blocks are 3.8MB (which is the size of a 1000 by 1000 array of float32) except one that is 8.1. Only the green block was allocated after function starts (i.e. after generating the random arrays). Max memory was 50MB.

This is torch.cdist(X_torch, Y_torch)

Again all blocks are 3.8 size except one 8.1MB size. Max memory was 57.7. All blocks above the long blue one are after function starts.

Would be interested in your OOM case @betatim . Don't quote me on doing all this correctly (lots of trial and error) but it seems that we don't use more memory than torch.cdist - so would be interested in your case to see why pairwise_kernel gave OOM but torch.cdist didn't...

For reference here is the same sized array in numpy

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    127.6 MiB    127.6 MiB           1   @profile
     7                                         def compute_pairwise():
     8                                             # Create two random arrays
     9    131.7 MiB      4.1 MiB           1       X = np.random.rand(1000, 1000).astype(np.float32)
    10    142.6 MiB     10.9 MiB           1       Y = np.random.rand(1000, 1000).astype(np.float32)
    11                                             
    12                                             # Compute pairwise kernels
    13    147.3 MiB      4.7 MiB           1       result = pairwise_kernels(X, Y, n_jobs=1)
    14    147.3 MiB      0.0 MiB           1       return result

WIP array API supprt for pairwise kernels

a3287f9

github-actions bot added the module:metrics label Sep 10, 2024

EmilyXinyi commented Sep 10, 2024

View reviewed changes

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

Merge branch 'scikit-learn:main' into array_API_pairwise_kernels

ade47a3

ogrisel reviewed Sep 10, 2024

View reviewed changes

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

ogrisel added the Array API label Sep 10, 2024

ogrisel reviewed Sep 10, 2024

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

ogrisel reviewed Sep 10, 2024

View reviewed changes

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

ogrisel changed the title ~~Array API supprt for pairwise kernels~~ Array API support for pairwise kernels Sep 13, 2024

EmilyXinyi added 2 commits December 29, 2024 22:06

Merge branch 'scikit-learn:main' into array_API_pairwise_kernels

0bb6ac3

Merge branch 'scikit-learn:main' into array_API_pairwise_kernels

653249c

Merge branch 'main' into array_API_pairwise_kernels

dbf1791

lucyleeow mentioned this pull request Mar 28, 2025

DOC Improve pairwise_kernel docstring #31103

Merged

lucyleeow added 3 commits April 3, 2025 17:34

initial implementation

8987dc0

factorise out utility func

a98ac1d

add to array api doc page, add testing

14bf080

lucyleeow reviewed Apr 3, 2025

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

lucyleeow added 6 commits April 4, 2025 00:26

doc which metric is supported

5f0f696

fix reshape, not method

509c1a5

Merge branch 'main' into array_API_pairwise_kernels

567f81f

transpose

1017ab1

transpose

ea99f31

add whats new

5fe7138

ogrisel added the CUDA CI label May 19, 2025

github-actions bot removed the CUDA CI label May 19, 2025

lesteve reviewed May 21, 2025

View reviewed changes

sklearn/utils/_array_api.py Show resolved Hide resolved

Merge branch 'main' into array_API_pairwise_kernels

49d1bb4

lucyleeow mentioned this pull request May 22, 2025

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

lucyleeow reviewed May 22, 2025

View reviewed changes

fix typo

7225996

lucyleeow force-pushed the array_API_pairwise_kernels branch from 8460188 to 7225996 Compare May 26, 2025 05:50

lucyleeow mentioned this pull request May 28, 2025

Fix _fill_or_add_to_diagonal when reshape returns copy #31445

Open

lucyleeow reviewed Jun 5, 2025

View reviewed changes

		# Why 5 and not more? this seems to still result in a lot of 0 vaules?
		X_np = np.array(5 * rng.random_sample((5, 4)), dtype=dtype_name)

Uh oh!

Array API support for pairwise kernels #29822

Are you sure you want to change the base?

Array API support for pairwise kernels #29822

Uh oh!

Conversation

EmilyXinyi commented Sep 10, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

To do

Uh oh!

github-actions bot commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Sep 13, 2024

Uh oh!

lucyleeow commented Feb 24, 2025

Uh oh!

ogrisel commented Feb 24, 2025

Uh oh!

EmilyXinyi commented Feb 24, 2025

Uh oh!

lucyleeow commented Feb 24, 2025

Uh oh!

EmilyXinyi commented Feb 26, 2025

Uh oh!

Uh oh!

lucyleeow commented May 19, 2025

Uh oh!

Uh oh!

lucyleeow May 22, 2025

Choose a reason for hiding this comment

Uh oh!

betatim commented May 27, 2025

Uh oh!

ogrisel commented Jun 5, 2025

Uh oh!

lucyleeow commented Jun 5, 2025

Uh oh!

lucyleeow Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

betatim commented Jun 6, 2025

Uh oh!

betatim commented Jun 6, 2025

Uh oh!

OmarManzoor commented Jun 10, 2025

Uh oh!

betatim commented Jun 10, 2025

Uh oh!

OmarManzoor commented Jun 10, 2025

Uh oh!

betatim commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented Jun 10, 2025

Uh oh!

lucyleeow commented Jun 10, 2025

Uh oh!

OmarManzoor commented Jun 10, 2025

Uh oh!

ogrisel commented Jun 10, 2025

Uh oh!

lucyleeow commented Jun 11, 2025

Uh oh!

OmarManzoor commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented Jun 11, 2025

Uh oh!

OmarManzoor commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

github-actions bot commented Sep 10, 2024 •

edited

Loading

betatim commented Jun 10, 2025 •

edited

Loading

OmarManzoor commented Jun 10, 2025 •

edited

Loading

OmarManzoor commented Jun 11, 2025 •

edited

Loading

OmarManzoor commented Jun 11, 2025 •

edited

Loading