Close

Presentation

Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel & Nvidia
DescriptionEfficient graph processing is essential for a wide range of applications.
Scalability and memory access patterns are still a challenge, especially with the Breadth-First Search algorithm. This work focuses on leveraging multi-GPU HPC nodes with peer-to-peer support of the Intel oneAPI implementation of SYCL.
We propose three GPU-based load-balancing methods: work-group localisation for efficient data access, even workload distribution for higher GPU occupancy, and a hybrid strided-access approach for heuristic balancing. These methods ensure performance, portability, and productivity with a unified codebase.
Our proposed methodologies outperform state-of-the-art single-GPU implementations based on CUDA on synthetic RMAT graphs. We analysed BFS performance across NVIDIA A100, Intel Max 1550, and AMD MI300X GPUs, achieving a peak performance of 153.27 GTEPS on an RMAT25-64 graph using 8 GPUs on the NVIDIA A100. Furthermore, our work handles RMAT graphs up to scale 29, achieving superior performance on synthetic graphs and competitive results on real-world datasets.