Close

Presentation

WAGES: Workload-Aware GPU Sharing System for Energy-Efficient Serverless LLM Serving
DescriptionServerless LLM serving lowers costs by elastically provisioning GPUs and charging only for usage. However, current systems mostly target cold-start latency, overlooking inefficiencies: (i) static, exclusive GPU allocation that wastes compute resources and increases costs, and (ii) fixed hardware-controlled clock speeds that waste energy. Our analysis shows many LLM workloads can meet SLOs with partial SM allocations and reduced clock speeds, enabling GPU multiplexing and dynamic clock scaling. We present WAGES, a workload-aware GPU sharing system that uses NVIDIA MPS to co-locate LLMs, dynamically adjusting SM partitions and clock speeds to workload needs while meeting SLOs. A two-tier scheduler coordinates global GPU consolidation and local SLO-aware tuning, overlapping model/KV migration with execution to reduce reconfiguration overhead. On real LLM traces, WAGES improves SLO attainment by up to 4% over prior GPU sharing approaches and reduces energy use by up to 26%.