Close

Presentation

Redesigning GROMACS Halo Exchange: Improving Strong Scaling with GPU-initiated NVSHMEM
DescriptionImproving time-to-solution in molecular dynamics simulations often requires strong scaling due to fixed-sized problems.
GROMACS is highly latency-sensitive, with peak iteration rates in the sub-millisecond, making scalability on heterogeneous supercomputers challenging.
MPI's CPU-centric nature introduces additional latencies on GPU-resident applications' critical path, hindering GPU utilization and scalability.
To address these limitations, we present an NVSHMEM-based GPU kernel-initiated
redesign of the GROMACS domain decomposition halo-exchange algorithm.
Highly tuned GPU kernels fuse data packing and communication, leveraging hardware latency-hiding for fine-grained overlap.
We employ kernel fusion across overlapped data forwarding communication phases and utilize the asynchronous copy engine over NVLink to optimize latency and bandwidth. Our GPU-resident formulation greatly increases communication-computation overlap, improving GROMACS strong scaling performance across NVLink by up to 1.5x (intra-node) and 2x (multi-node), and up to 1.3x multi-node over NVLink+InfiniBand.
This demonstrates the profound benefits of GPU-initiated communication for strong-scaling a broad range of latency-sensitive applications.