Close

Presentation

Optimizing the GPU All-Reduce Using Multiple Processes Per GPU
DescriptionLarge inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing four GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This poster presents novel optimizations to large GPU-aware all-reduce operations, extending lane-aware reductions to the GPUs, and notably using multiple CPU cores per GPU to accelerate these operations. These multi-CPU-accelerated GPU-aware lane all-reduces using an intermediate host buffer yield speedup of up to 2.45x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer. Finally, the approach is extended to GPUDirect RDMA communication, yielding speedup of 1.17x for large all-reduces.