Skip to content

Array API support for pairwise kernels #29822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

EmilyXinyi
Copy link
Contributor

Reference Issues/PRs

(hopefully) unblocks progress with #29661

What does this implement/fix? Explain your changes.

Adding array API support for pairwise_kernels

To do

  • _parallel_pairwise
  • _pairwise_callable
  • Add tests in test_pairwise.py (pairwise_distances and pairwise_kernels in particular)
  • Changelog

Copy link

github-actions bot commented Sep 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7225996. Link to the linter CI: here

@ogrisel ogrisel changed the title Array API supprt for pairwise kernels Array API support for pairwise kernels Sep 13, 2024
@ogrisel
Copy link
Member

ogrisel commented Sep 13, 2024

Note that I wouldn't mind specializing this PR to pairwise kernels only and deal with distances in a follow-up PR if it can help simplify things.

@lucyleeow
Copy link
Member

@EmilyXinyi are you still interested in working on this? Thanks!

@ogrisel
Copy link
Member

ogrisel commented Feb 24, 2025

@EmilyXinyi: to be more explicit, @lucyleeow can allocate bandwidth to work on stalled array API PRs so she could help on this one.

There is also #29431 that is possibly blocking many other estimators so it would be great to share the work on those two PRs to unblock progress on array API support.

@EmilyXinyi
Copy link
Contributor Author

@lucyleeow please take this one over if it's blocking other estimators! Thank you for your help!

@ogrisel Thank you for the clarification!

@lucyleeow
Copy link
Member

@EmilyXinyi I'm happy to continue working on your branch, so you can always contribute as well if you wish?

@EmilyXinyi
Copy link
Contributor Author

@lucyleeow that would be awesome, thank you!

@lucyleeow
Copy link
Member

Maybe @betatim may be interested to take a look?

Comment on lines +371 to +372
# Why 5 and not more? this seems to still result in a lot of 0 vaules?
X_np = np.array(5 * rng.random_sample((5, 4)), dtype=dtype_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, I think the 5 * is to prevent lots of 0 values when dtype is int in the test_pairwise_parallel test above. While fiddling I noticed that even 5 * results in a lot of 0 values, and wondered why not just increase it to e.g., 10?

@lucyleeow lucyleeow force-pushed the array_API_pairwise_kernels branch from 8460188 to 7225996 Compare May 26, 2025 05:50
@betatim
Copy link
Member

betatim commented May 27, 2025

I was looking at the code touched by this PR and tried to remember "stuff" about it. One thought I had was "aren't matrix products one of the most massively optimised operations ever (on a GPU)?". For example in the euclidean distance computation we run a big matrix product, which I assume is where all the time is spent. Which then made me wonder if we should or shouldn't chunk the work - my assumption being that torch's @ will already do chunking or other magic to make it fast. Which made me realise that I don't fully remember why we chunk: is it computational performance or to save memory?

Does someone remember/can give some context?

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2025

Which made me realise that I don't fully remember why we chunk: is it computational performance or to save memory?

I think the main objective was to save memory and the secondary objective could be to optimize the use of CPU caches. I agree that for GPUs, using larger chunk sizes might make sense. It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.

@lucyleeow
Copy link
Member

I agree that for GPUs, using larger chunk sizes might make sense.

But would it be easy to determine whether an input array is on GPU ? Because for CPU, we'd still want to chunk as we currently are right?

Parallel(backend="threading", n_jobs=n_jobs)(
fd(func, ret, s, X, Y[s], **kwds)
fd(func, ret, s, X, Y[s, ...], **kwds)
for s in gen_even_slices(_num_samples(Y), effective_n_jobs(n_jobs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.

Just to confirm, we're talking about these chunk sizes right? 😬

@betatim
Copy link
Member

betatim commented Jun 6, 2025

But would it be easy to determine whether an input array is on GPU ? Because for CPU, we'd still want to chunk as we currently are right?

I'd leave it as an "exercise for someone from the future". At least my immediate though about it is that it isn't "hard" but ugly. You could have a hardcoded list of (namespace, device) combinations that you know are on a GPU ([(torch, "cuda"), (cupy, None), (torch, "mps"), ...]) but that seems like a hack. So for now I'd leave it as is.

It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.

I might attempt this

@betatim
Copy link
Member

betatim commented Jun 6, 2025

Tried some benchmarking but with interesting(?) sizes of X and pairwise_distances(X) I quickly run out of memory on my GPU. I used X = rng.rand(50_000, 36) and moved that to a torch array with device="cuda". torch.cdist works with this size of X. Not really sure what all this means for this PR/benchmarking :-/

@OmarManzoor
Copy link
Contributor

@betatim Do you run out of memory when you consider the current code or when you remove the chunking from the current code?

@betatim
Copy link
Member

betatim commented Jun 10, 2025

I'm using this PR. If I use the CPU I don't run out of memory, on the GPU I run out of memory.

@OmarManzoor
Copy link
Contributor

I'm using this PR. If I use the CPU I don't run out of memory, on the GPU I run out of memory.

How much memory do you have on the GPU?

@betatim
Copy link
Member

betatim commented Jun 10, 2025

I have 49140MiB, which I think is a "large amount" of memory for an X with shape (50_000, 36). Especially because the CPU version of pairwise_distances takes a few seconds, which means that sizes like this or bigger is where it gets interesting to use a GPU. At least that is my intuition. Which is why I am a bit lost about what all this means, as it seems we'd have to change more of this code to make that use-case possible?? What do you think?

edit: maybe the thing to agree on is, if the use-case I have in mind is realistic?

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jun 10, 2025

I have 49140MiB, which I think is a "large amount" of memory for an X with shape (50_000, 36)

This is a lot! I agree with you that this code needs to be debugged further on the GPU to see precisely where this is running out of memory. Could it be that we are using Parallel threading with a certain number of jobs and that is not quite going too well with the GPU?

In any case I don't think there is any point removing the chunking in this PR for now.

Edit: The use case you are trying is realistic in my opinion. I think it should be able to handle at least this much > 100000 samples and 100 features

@betatim
Copy link
Member

betatim commented Jun 10, 2025

I agree that we shouldn't remove/change/do anything to the chunking for now.

I think by default (if you don't pass n_jobs) you get "one job" - so no parallelism.

Do you think that my use-case makes sense? And the idea that 50_000 samples is somehow "where maybe it starts getting interesting"?

@lucyleeow
Copy link
Member

FYI there is #31445 , which avoids use of reshape (thus avoiding copy during reshape as array is non-C-contiguous), though I am not confident that would be cause the difference in memory usage.

@OmarManzoor
Copy link
Contributor

@betatim I don't have a GPU with the size that you have available in order to debug this but I did try with an mps device.

This is the code I am using

from sklearn.metrics.pairwise import pairwise_kernels
import torch as xp

from sklearn._config import config_context


X = xp.rand((50000, 30), device="mps")
Y = xp.rand(50000, 30, device="mps")
with config_context(array_api_dispatch=True):
    pairwise_kernels(X, Y, n_jobs=1)

If we use n_jobs=1 we don't do any sort of chunking and work with the original arrays. In this case the code breaks in the safe_sparse_dot function where we directly multiply the arrays

    ret = a @ b
          ~~^~~
RuntimeError: Invalid buffer size: 9.31 GB

If we use n_jobs >= 2 the code breaks when we try to allocate an empty array of zeros of size 50000 x 50000

ret = xp.empty((X.shape[0], Y.shape[0]), device=device, dtype=dtype_float).T

    return torch.empty(shape, dtype=dtype, device=device, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 9.31 GB

@ogrisel
Copy link
Member

ogrisel commented Jun 10, 2025

@betatim you might want to try the approach documented in: https://pytorch.org/blog/understanding-gpu-memory-1/

@lucyleeow
Copy link
Member

Asking the stupid question first, for array of 50_000 by 50_000, float32 (4 bytes), 9.31GB seems to be the correct size?

Assuming you have enough GPU memory, what could the problem be? That it needs to be contiguous memory?

Thinking about the parallel case, even if we chunk the ret = xp.empty part, we still have to combine them in the end...

I used memory_profiler to check memory usage on a numpy array of the same size and got:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5    129.1 MiB    129.1 MiB           1   @profile
     6                                         def compute_pairwise():
     7                                             # Create two random arrays
     8    140.5 MiB     11.4 MiB           1       X = np.random.rand(50000, 30)
     9    152.0 MiB     11.5 MiB           1       Y = np.random.rand(50000, 30)
    10                                             
    11                                             # Compute pairwise kernels
    12  19220.4 MiB  19068.4 MiB           1       result = pairwise_kernels(X, Y, n_jobs=1)
    13  19220.4 MiB     -0.0 MiB           1       return result

(note that these would be float64 vs torch's float32 above so size checks out)

I wonder if this is out of scope for this particular PR, and raises the more general question of how to handle larger computations/computations on GPU (e.g., write output to disc..?), where we have previously mostly focused on running models on a standard laptop on data of size <10GB

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jun 11, 2025

@lucyleeow I agree that this is more related to how to manage memory better to accommodate large computations on GPUs more generally. For my particular case I don't have enough memory so that makes sense.

However @betatim has reported a GPU memory size of ~ 49 GB which is quite a lot. It would be nice to get some feedback in that particular case on where the code runs out of memory.

In any case I think we can merge this PR and investigate the large computations later. What do you think @betatim

@betatim
Copy link
Member

betatim commented Jun 11, 2025

Yes, I think this is a bigger can of worms than this PR.

I also feel like I might be missing something or otherwise getting things mixed up. For example, I'm not sure we are using "too much" memory - in the sense of being wasteful. It is more that I am surprised that I get OOM so easily, with such a small problem - makes me wonder "what is the point of all this if the GPU can't handle the cases that are actually interesting??"

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jun 11, 2025

Yes, I think this is a bigger can of worms than this PR.

I also feel like I might be missing something or otherwise getting things mixed up. For example, I'm not sure we are using "too much" memory - in the sense of being wasteful. It is more that I am surprised that I get OOM so easily, with such a small problem - makes me wonder "what is the point of all this if the GPU can't handle the cases that are actually interesting??"

Well I guess batching is required in all such cases to manage the memory efficiently. Do you have any idea at what point does your 49 GB GPU run out of memory? I think you are using n_jobs = 1 so we don't even enter the code where we create batches and use multi threading. So we are working with the complete arrays. But I feel that if the matrix multiplication of the complete arrays ran out of memory on an M1 GPU it should not really crash at the same point when we have 49 GB available.

@lucyleeow
Copy link
Member

Especially because the CPU version of pairwise_distances takes a few seconds, which means that sizes like this or bigger is where it gets interesting to use a GPU.

Totally agree with your sentiment. Seems reasonable.

I spent too much time on this but out of interest I tried to check memory usage of torch.cdist vs our pairwise_kernel, with torch's memory snapshot as described in the blog Olivier linked above.

Using:

    X_torch = torch.rand(1000, 1000, device="cuda")
    Y_torch = torch.rand(1000, 1000, device="cuda")

(Note used _ = torch.cdist(X_torch[:1], Y_torch[:1]) to warm up CUDA as suggested by claude 🤷 )

This is pairwise_kernel(X_torch, Y_torch):

image

Note all of the blocks are 3.8MB (which is the size of a 1000 by 1000 array of float32) except one that is 8.1. Only the green block was allocated after function starts (i.e. after generating the random arrays). Max memory was 50MB.

This is torch.cdist(X_torch, Y_torch)

image

Again all blocks are 3.8 size except one 8.1MB size. Max memory was 57.7. All blocks above the long blue one are after function starts.

Would be interested in your OOM case @betatim . Don't quote me on doing all this correctly (lots of trial and error) but it seems that we don't use more memory than torch.cdist - so would be interested in your case to see why pairwise_kernel gave OOM but torch.cdist didn't...

For reference here is the same sized array in numpy

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    127.6 MiB    127.6 MiB           1   @profile
     7                                         def compute_pairwise():
     8                                             # Create two random arrays
     9    131.7 MiB      4.1 MiB           1       X = np.random.rand(1000, 1000).astype(np.float32)
    10    142.6 MiB     10.9 MiB           1       Y = np.random.rand(1000, 1000).astype(np.float32)
    11                                             
    12                                             # Compute pairwise kernels
    13    147.3 MiB      4.7 MiB           1       result = pairwise_kernels(X, Y, n_jobs=1)
    14    147.3 MiB      0.0 MiB           1       return result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Array API module:metrics Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants