Close

Presentation

Energy-Aware HPC Scheduling with LLM-Based Power Prediction
DescriptionAs the increasing energy consumption of High-Performance Computing (HPC) systems places greater strain on electric grid infrastructure, operational strategies for load balancing become critically important. Energy-aware scheduling offers a promising solution by enabling HPC systems to function as actively managed loads within the energy grid. Despite extensive theoretical research on this strategy, practical implementations and real-system evaluations remain scarce. To bridge this gap, we introduce a systematic approach to developing, evaluating, and implementing energy-aware scheduling without modifications to Slurm's core scheduler. Our method includes a novel mechanism for per-job power prediction based on Large Language Model embeddings of enriched job scripts, coupled with a lightweight, deployable scheduling strategy. Our predictor reduces per-job power MAE by 15% compared to the current state-of-the-art, and our simulated scheduler shifts 4.0 MWh onto on-site solar without throughput loss. These results demonstrate a clear and practical pathway to production deployment of energy-aware scheduling in HPC.