Close

Presentation

Improving Supercomputer Usage with Aging Awareness
DescriptionLifetime of electronic devices has a critical impact on their environmental footprint. In addition, the high-demand by AI companies of GPU has reduced tremendously their availability for supercomputing centers. Consequently, improving the duration of CPUs and GPUs is becoming a major issue in High Performance Computing (HPC) domain.

This paper investigates the optimization of a machine usage before a fatal failure and the trade-offs with performance. The lifetime of computing devices is strongly connected with the temperature and thus with the running frequency. We investigate the node frequency reconfiguration to optimize HPC usage. We estimate the benefit of a dedicated scheduling algorithm compared with a constant frequency.

We show that a correct decision can increase considerably the number of FLOP of a machine with a trade-off in terms of performance. Because aging models are currently inaccurate, we consider different models and discuss the robustness of our algorithms to inaccuracy.