-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Description
🚀 The feature, motivation and pitch
PyTorch relies on CUDA streams and events to naively overlap computation and communication. However, these overlaps can interfere with one another - causing throughput degradation and unstable performance. We propose introducing CUDA Green Context[1] in fsdp2 to provide contexts that isolate SM resources for computation and communication. For example, we can split 132 sm into two part 104 sm for compute and 24 sm for communication in H100 (will waste 4 sm, since green context need each partiton's sm is a multiple of 8.
FlashInfer[2] has already integrated this experimental feature into their framework using API provided by cuda-python.
I think a naive implentation will be:
we split the SM into two contexts, computation and communcation(only one context for allgather/reduce_scatter/allreduce)
we create streams from context, 1 stream for overlapped compute, for commucation, we just replace normal cuda streams with streams created from commucation green context
Since we want to use green context stream for overlapped computation and default stream for non-overlapped computation, we need a hook-like mechanism that, just before an allgather or reduce-scatter call begins, switches the current CUDA stream to the green-context stream, and then switches back to the default stream when the communication finishes—so that we can fully utilize all GPU resources. However, I’m not sure whether such frequent stream-switching would introduce significant overhead. Perhaps instead of swapping streams back and forth, we could dispatch different kernels onto two separate streams from the start. This way we overlap communication on the green context stream with computation on the default stream without incurring the cost of repeated stream switches.
[1] https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html
[2] flashinfer-ai/flashinfer#1163
Alternatives
No response
Additional context
No response
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta