Close

Presentation

Predicting Resources for AI Workloads in HPC: Methods, Challenges, and Opportunities
DescriptionArtificial Intelligence (AI) is rapidly transforming scientific discovery and industrial applications, but its growth has escalated demands on high-performance computing (HPC) resources. A central challenge is predicting resource requirements for deep neural network (DNN) workloads, where inefficient provisioning leads to underutilized GPUs, wasted CPUs, and higher costs. This work explores AI resource prediction in HPC using complementary approaches: Black-Box models leverage tabular features and regressors such as XGBoost for fast, workload-specific predictions. In contrast, White-Box models extract graph-based features from High-Level Optimized (HLO) graphs to generalize across architectures. Results show hybrid methods significantly improve accuracy, reducing fit-time estimation error from 75.48% to 10.55%. The estimators are being integrated with AI-driven job schedulers to improve workload allocation and utilization, paving the way for creating agents for Machine Learning Workflow (MLOps) systems across the computing continuum.