Close

Presentation

Heterogeneity-Aware Task Allocation for Modern HPC Systems
DescriptionModern supercomputing systems exhibit heterogeneous node configurations, where seemingly identical hardware exhibits significant performance variations due to memory capacity differences, manufacturing tolerances, and deployment conditions. This heterogeneity impacts the efficiency of scientific applications built on frameworks like AMReX, leading to substantial computational waste on leadership-class systems. We present performance-aware and relation-aware load balancing algorithms specifically designed for scientific applications, like AMReX on heterogeneous HPC clusters. Our approach uses empirically measured node performance characteristics and a relative performance matrix to optimize task distribution across diverse computational resources.

Evaluation of NERSC Perlmutter with 14 representative AMReX computational kernels demonstrates 99.9% scheduling efficiency, achieving performance improvements of 4.4%-11.5% over traditional methods in moderate heterogeneity scenarios (A100 40GB vs. 80GB) and up to 300x improvements in extreme CPU-GPU mixed configurations where homogeneous methods fail to utilize CPU resources effectively. The algorithms handle million-task workloads with O(nlogn + nm) complexity while maintaining practical deployment feasibility.