Presentation
Plexus: Taming Billion-Edge Graphs with 3D Parallel Full-Graph GNN Training
DescriptionGraph neural networks leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation: Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2,048 GPUs of Perlmutter, and 1,024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5X over prior state of the art, and a reduction in time-to-solution by 5.2-8.7X on Perlmutter and 7.0-54.2X on Frontier.
Event Type
Paper
TimeTuesday, 18 November 202510:30am - 10:52am CST
Location260-267
HPC for Machine Learning



