-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Array API support for pairwise kernels #29822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Array API support for pairwise kernels #29822
Conversation
Note that I wouldn't mind specializing this PR to pairwise kernels only and deal with distances in a follow-up PR if it can help simplify things. |
@EmilyXinyi are you still interested in working on this? Thanks! |
@EmilyXinyi: to be more explicit, @lucyleeow can allocate bandwidth to work on stalled array API PRs so she could help on this one. There is also #29431 that is possibly blocking many other estimators so it would be great to share the work on those two PRs to unblock progress on array API support. |
@lucyleeow please take this one over if it's blocking other estimators! Thank you for your help! @ogrisel Thank you for the clarification! |
@EmilyXinyi I'm happy to continue working on your branch, so you can always contribute as well if you wish? |
@lucyleeow that would be awesome, thank you! |
Maybe @betatim may be interested to take a look? |
# Why 5 and not more? this seems to still result in a lot of 0 vaules? | ||
X_np = np.array(5 * rng.random_sample((5, 4)), dtype=dtype_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this PR, I think the 5 *
is to prevent lots of 0 values when dtype is int
in the test_pairwise_parallel
test above. While fiddling I noticed that even 5 *
results in a lot of 0 values, and wondered why not just increase it to e.g., 10?
8460188
to
7225996
Compare
I was looking at the code touched by this PR and tried to remember "stuff" about it. One thought I had was "aren't matrix products one of the most massively optimised operations ever (on a GPU)?". For example in the euclidean distance computation we run a big matrix product, which I assume is where all the time is spent. Which then made me wonder if we should or shouldn't chunk the work - my assumption being that torch's Does someone remember/can give some context? |
I think the main objective was to save memory and the secondary objective could be to optimize the use of CPU caches. I agree that for GPUs, using larger chunk sizes might make sense. It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU. |
But would it be easy to determine whether an input array is on GPU ? Because for CPU, we'd still want to chunk as we currently are right? |
Parallel(backend="threading", n_jobs=n_jobs)( | ||
fd(func, ret, s, X, Y[s], **kwds) | ||
fd(func, ret, s, X, Y[s, ...], **kwds) | ||
for s in gen_even_slices(_num_samples(Y), effective_n_jobs(n_jobs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to do some benchmarking to assess the impact of the chunk size when using torch CUCDA or CuPy vs numpy CPU.
Just to confirm, we're talking about these chunk sizes right? 😬
I'd leave it as an "exercise for someone from the future". At least my immediate though about it is that it isn't "hard" but ugly. You could have a hardcoded list of
I might attempt this |
Tried some benchmarking but with interesting(?) sizes of |
@betatim Do you run out of memory when you consider the current code or when you remove the chunking from the current code? |
I'm using this PR. If I use the CPU I don't run out of memory, on the GPU I run out of memory. |
How much memory do you have on the GPU? |
I have 49140MiB, which I think is a "large amount" of memory for an edit: maybe the thing to agree on is, if the use-case I have in mind is realistic? |
This is a lot! I agree with you that this code needs to be debugged further on the GPU to see precisely where this is running out of memory. Could it be that we are using Parallel threading with a certain number of jobs and that is not quite going too well with the GPU? In any case I don't think there is any point removing the chunking in this PR for now. Edit: The use case you are trying is realistic in my opinion. I think it should be able to handle at least this much > 100000 samples and 100 features |
I agree that we shouldn't remove/change/do anything to the chunking for now. I think by default (if you don't pass Do you think that my use-case makes sense? And the idea that 50_000 samples is somehow "where maybe it starts getting interesting"? |
FYI there is #31445 , which avoids use of reshape (thus avoiding copy during reshape as array is non-C-contiguous), though I am not confident that would be cause the difference in memory usage. |
@betatim I don't have a GPU with the size that you have available in order to debug this but I did try with an mps device. This is the code I am using from sklearn.metrics.pairwise import pairwise_kernels
import torch as xp
from sklearn._config import config_context
X = xp.rand((50000, 30), device="mps")
Y = xp.rand(50000, 30, device="mps")
with config_context(array_api_dispatch=True):
pairwise_kernels(X, Y, n_jobs=1) If we use n_jobs=1 we don't do any sort of chunking and work with the original arrays. In this case the code breaks in the safe_sparse_dot function where we directly multiply the arrays
If we use n_jobs >= 2 the code breaks when we try to allocate an empty array of zeros of size 50000 x 50000
|
@betatim you might want to try the approach documented in: https://pytorch.org/blog/understanding-gpu-memory-1/ |
Asking the stupid question first, for array of 50_000 by 50_000, float32 (4 bytes), 9.31GB seems to be the correct size? Assuming you have enough GPU memory, what could the problem be? That it needs to be contiguous memory? Thinking about the parallel case, even if we chunk the I used memory_profiler to check memory usage on a numpy array of the same size and got:
(note that these would be float64 vs torch's float32 above so size checks out) I wonder if this is out of scope for this particular PR, and raises the more general question of how to handle larger computations/computations on GPU (e.g., write output to disc..?), where we have previously mostly focused on running models on a standard laptop on data of size <10GB |
@lucyleeow I agree that this is more related to how to manage memory better to accommodate large computations on GPUs more generally. For my particular case I don't have enough memory so that makes sense. However @betatim has reported a GPU memory size of ~ 49 GB which is quite a lot. It would be nice to get some feedback in that particular case on where the code runs out of memory. In any case I think we can merge this PR and investigate the large computations later. What do you think @betatim |
Yes, I think this is a bigger can of worms than this PR. I also feel like I might be missing something or otherwise getting things mixed up. For example, I'm not sure we are using "too much" memory - in the sense of being wasteful. It is more that I am surprised that I get OOM so easily, with such a small problem - makes me wonder "what is the point of all this if the GPU can't handle the cases that are actually interesting??" |
Well I guess batching is required in all such cases to manage the memory efficiently. Do you have any idea at what point does your 49 GB GPU run out of memory? I think you are using n_jobs = 1 so we don't even enter the code where we create batches and use multi threading. So we are working with the complete arrays. But I feel that if the matrix multiplication of the complete arrays ran out of memory on an M1 GPU it should not really crash at the same point when we have 49 GB available. |
Totally agree with your sentiment. Seems reasonable. I spent too much time on this but out of interest I tried to check memory usage of Using:
(Note used This is Note all of the blocks are 3.8MB (which is the size of a 1000 by 1000 array of float32) except one that is 8.1. Only the green block was allocated after function starts (i.e. after generating the random arrays). Max memory was 50MB. This is Again all blocks are 3.8 size except one 8.1MB size. Max memory was 57.7. All blocks above the long blue one are after function starts. Would be interested in your OOM case @betatim . Don't quote me on doing all this correctly (lots of trial and error) but it seems that we don't use more memory than For reference here is the same sized array in numpy
|
Reference Issues/PRs
(hopefully) unblocks progress with #29661
What does this implement/fix? Explain your changes.
Adding array API support for
pairwise_kernels
To do
_parallel_pairwise
_pairwise_callable
test_pairwise.py
(pairwise_distances
andpairwise_kernels
in particular)