Device-Initiated Communication¶
Starting with version 2.28, NCCL provides a device-side communication API, making it possible to use communication primitives directly from user CUDA kernels.
Device API¶
Device API consists of the following modules:
- LSA (Load/Store Accessible) – for communication between devices accessible via memory load/store operations, using CUDA P2P. This includes devices connected over NVLink and some devices connected over PCIe (the latter are currently limited to devices with P2P connectivity (as indicated by
nvidia-smi topo -p2p p
), subject to the NCCL_P2P_LEVEL distance check).- Multimem – for communication between devices using the hardware multicast feature provided by NVLink SHARP (available on some datacenter GPUs since the Hopper generation).
- GIN (GPU-Initiated Networking) – for communication over the network. This module is under active development and will not be covered here at this time.
The device API relies on symmetric memory (see Window Registration), which in turn depends on GPU virtual memory management (see NCCL_CUMEM_ENABLE) and optionally – for multimem support – on NVLink SHARP (see NCCL_NVLS_ENABLE).
Host-Side Setup¶
To perform communication from the device, a device communicator needs to be created using ncclDevCommCreate()
.
Data transfer operations on buffers require symmetric memory windows (see Window Registration). A custom
communication kernel can then be launched using the standard CUDA syntax. The code excerpt below demonstrates
these steps:
int main() {
[...]
NCCLCHECK(ncclCommInitRank(&comm, nranks, id, rank));
/* Buffer initialization and window creation */
char* buffer;
size_t size = 256*1048576;
NCCLCHECK(ncclMemAlloc((void**)&buffer, size));
ncclWindow_t win;
NCCLCHECK(ncclCommWindowRegister(comm, buffer, size, &win, NCCL_WIN_COLL_SYMMETRIC));
/* Get device communicator */
ncclDevComm devComm;
ncclDevCommRequirements reqs;
memset(&reqs, 0, sizeof(ncclDevCommRequirements));
int nCTAs = 16;
reqs.lsaBarrierCount = nCTAs;
NCCLCHECK(ncclDevCommCreate(comm, &reqs, &devComm));
/* Launch user kernel */
customKernel<<<nCTAs, 256>>>(devComm, win);
[...]
}
Depending on the kernel and application requirements, the same window can be used for input and output, or multiple
windows may be needed. When creating a device communicator, the resources that the kernel will need should be specified
via the requirements list (see ncclDevCommRequirements
). In the
above example we specify just the number of barriers that the kernel will need, in this case one for each CTA the kernel
is to be launched on (16, each CTA running 256 threads).
Simple Device Kernel¶
template <typename T>
__global__ void inPlaceAllReduceKernel(ncclDevComm devComm, ncclWindow_t win, size_t offset, size_t count) {
ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamTagLsa(), blockIdx.x };
bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
const int rank = devComm.lsaRank, nRanks = devComm.lsaSize;
const int globalTid = threadIdx.x + blockDim.x * (rank + blockIdx.x * nRanks);
const int globalNthreads = blockDim.x * gridDim.x * nRanks;
for (size_t o = globalTid; o < count; o += globalNthreads) {
T v = 0;
for (int peer=0; peer<nRanks; peer++) {
T* inputPtr = (T*)ncclGetLsaPointer(win, offset, peer);
v += inputPtr[o];
}
for (int peer=0; peer<nRanks; peer++) {
T* outputPtr = (T*)ncclGetLsaPointer(win, offset, peer);
outputPtr[o] = v;
}
}
bar.sync(ncclCoopCta(), cuda::memory_order_release);
}
The above code excerpt shows a simple device kernel – an in-place variant (the input buffer is reused for the output) of AllReduce, utilizing LSA support (data is transferred via memory load/store instructions).
The start of the buffer is specified as a (byte-based) offset within the previously registered window win (see Window Registration); the buffer consists of count elements of type T.
Before the kernel can start processing data, it needs to ensure that all participants are ready. It creates a memory
barrier session bar (see ncclLsaBarrierSession
) and uses it to synchronize across all the threads of the CTA
(ncclCoopCta) and the ranks of the communicator (devComm). ncclTeamTagLsa indicates the subset of ranks the
barrier will apply to and blockIdx.x is the CTA’s local index, used to select the barrier.
The kernel then calculates a globally unique index for each thread as well as the overall thread count, and can finally
start processing data, using an all-to-all communication pattern. In each iteration, every participating thread loads a
single input element of each communicator rank. ncclGetLsaPointer()
is used to calculate the locally-accessible
address of the start of the buffer within each rank (remote device memory was previously mapped into the local address
space – see Window Registration). Extracted input data is accumulated and then stored back at each rank. Before the
kernel terminates, another memory synchronization needs to take place to ensure that all the threads have finished
processing their data.
Note that this simple implementation would likely fall short of achieving the peak bandwidth, as it utilizes neither vectorization nor loop unrolling.
Multimem Device Kernel¶
int main() {
[...]
memset(&reqs, 0, sizeof(ncclDevCommRequirements));
int nCTAs = 16;
reqs.lsaBarrierCount = nCTAs;
reqs.lsaMultimem = true;
NCCLCHECK(ncclDevCommCreate(comm, &reqs, &devComm));
[...]
}
template <typename T>
__global__ void inPlaceAllReduceKernel(ncclDevComm devComm, ncclWindow_t win, size_t offset, size_t count) {
ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamTagLsa(), blockIdx.x, /*multimem*/true };
[...]
T* mmPtr = (T*)ncclGetLsaMultimemPointer(win, offset, devComm);
for (size_t o = globalTid; o < count; o += globalNthreads) {
T v = multimem_sum(mmPtr+o);
multimem_st(mmPtr+o, v);
}
[...]
}
The above code excerpt demonstrates modifications needed to the earlier code segments to enable multimem support (the
lines with critical changes are highlighted). On the host
side, lsaMultimem
needs to be set in the requirements prior to creating the device communicator
(ncclDevCommCreate()
will fail if the necessary hardware support is unavailable).
Within the device kernel, we can switch the memory barrier to a multimem-optimized variant by adding an extra argument
to the constructor. The processing loop is actually simpler with multimem: ncclGetLsaMultimemPointer()
needs to
be invoked just once per kernel. The returned multicast memory pointer enables access to the device memory of all the
ranks of the communicator without having to iterate over them, and the data can be reduced in hardware. To keep this
example simple, the implementations of multimem_sum
and multimem_st
are not included. Those need to be
implemented using PTX, e.g., multimem.ld_reduce.global.add
and multimem.st.global
.
Thread Groups¶
Many functions in the device API take a thread cooperative group as input to indicate which threads within the CTA will take part in the operation. NCCL provides three predefined ones: ncclCoopThread()
, ncclCoopWarp()
and ncclCoopCta()
.
Users may also pass CUDA cooperative groups, or any class which provides thread_rank()
, size()
and sync()
functions.
Teams¶
To address remote ranks or perform barriers, NCCL refers to subsets of ranks within the global communicator as “teams”.
NCCL provides three predefined ones: ncclTeamWorld()
, ncclTeamLsa()
, and ncclTeamRail()
.