Close

Presentation

Fine-Grained Automated Failure Management for Extreme-Scale GPU-Accelerated Systems
DescriptionAs high performance computing (HPC) systems scale in size, system-wide hardware failure rates increase. Historical data from previous large-scale HPC installations illustrate this trend, with the mean time between failures (MTBF) decreasing steadily over the past decade. Recent studies from artificial intelligence and machine learning (AI/ML) training extrapolate MTBF declining even further for future GPU-accelerated systems. As MTBF decreases, mean time to repair (MTTR) becomes more pronounced, highlighting the need for efficient recovery strategies.

This paper presents an automated failure management system that addresses this issue by minimizing MTTR through real-time decision-making based on failure statistics. Our key contributions include a centralized meta-database for event history analysis including correlated events, fine-grained multi-strike repair policies, and an automated recovery framework. Deployed on the Aurora supercomputer, the proposed system has reduced MTTR by up to 84X compared to manual servicing, leading to significant cost savings and decreased system downtime.