Presentation
Unmasking Performance Variability in GPU Codes on Supercomputers
DescriptionPerformance variability is often a critical issue on GPU-accelerated systems, undermining efficiency and reproducibility. Since large-scale investigations of performance variability on GPU clusters are lacking, we set up a longitudinal experiment on Perlmutter and Frontier. We benchmark representative HPC and AI applications and collect detailed performance data to assess the impact of compute variability, allocated node topology, and network conditions on overall runtime. We also use an ML-based approach to identify potential correlations between these factors and to forecast the execution time. Our analysis identifies network performance as the dominant source of runtime variability. These findings provide crucial insights that can inform the development of future mitigation strategies.

Event Type
Research and ACM SRC Posters
TimeThursday, 20 November 20258:00am - 5:00pm CST
LocationSecond Floor Atrium
Archive
view


