Close

Presentation

Managing Heterogeneous Topologies and Understanding Their Impact on Performance
DescriptionTo solve increasingly complex problems more efficiently, modern HPC systems feature highly heterogeneous components: CPUs, GPUs, and recently QPUs (quantum processing units), each with a unique, complex compute topology. The massive parallelism of GPUs, combined with emerging memory technologies on CPUs and GPUs, makes the memory topologies increasingly heterogeneous, complex, and dynamically configurable. Understanding these topological details, especially regarding available memory and its usage, is essential to operating the systems and applications efficiently.

This thesis presents a framework targeting several fundamental gaps in the currently available research and tooling: sys-sage, MT4G, GPUscout, and Mitos modeling. At the core, the sys-sage library offers a unified approach to maintaining static and dynamic topological information from different sources and APIs. Its universal architecture handles CPUs, GPUs, and QPUs alike. MT4G provides an otherwise unavailable, vendor-agnostic, and complete report on GPU memory topologies, integrable with sys-sage. GPUs' massive parallelism amplifies the potential performance penalties of improper cache and memory usage. Therefore, GPUscout identifies root causes of frequently occurring memory-related bottlenecks, helping users efficiently utilize the complex memory subsystem of GPUs. Finally, to address emerging memory technologies, such as CXL.mem, this thesis presents a novel data access modeling workflow as an extension of Mitos. The model predicts the performance impact of CXL.mem-based cross-node shared-buffer data exchange as an alternative to point-to-point MPI communication. Altogether, these tools capture topologies of HPC systems and provide missing insights into application data transfer behavior.