Close

Presentation

CPU- and GPU-Initiated Communication Strategies for Conjugate Gradient Methods on Large GPU Clusters
DescriptionStrong scaling of conjugate gradient (CG) algorithms on GPU-based supercomputers is notoriously challenging. These linear system solvers have low computational intensity, making inter-GPU communication and synchronization primary bottlenecks. In light of recent developments in multi-GPU communication, we revisit CG parallelization for large-scale GPU clusters. We implement standard and pipelined CG solvers using three flavors of multi-GPU communication: GPU-aware MPI, NVIDIA's NCCL/AMD's RCCL, and NVIDIA's NVSHMEM.

Our monolithic NVSHMEM-based implementation with GPU-initiated communication enables CPU-free execution and thus lower overhead. However, lack of vendor-supported device-side computational kernels means that CPU-controlled CG implementations based on GPU-aware MPI or NCCL/RCCL are still favored for small GPU counts. Compared with state-of-the-art CG implementations, we have also eliminated unnecessary CPU-GPU data transfers and synchronization points. Our CG implementations are benchmarked on NVIDIA- and AMD-based supercomputers using SuiteSparse matrices and real-world finite element applications, achieving strong scaling on over 1,000 GPUs, and outperforming existing approaches.