Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs
DescriptionThis study characterizes GPU resilience in Delta, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include:
1) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors.
2) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity.
3) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components.
4) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level.
5) We project the impact of GPU node availability on larger scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
1) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors.
2) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity.
3) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components.
4) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level.
5) We project the impact of GPU node availability on larger scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
Event Type
Paper
TimeWednesday, 19 November 20253:52pm - 4:15pm CST
Location263-264
State of the Practice


