Presentation
Engine-Agnostic Model Hot-Swapping for Cost-Effective LLM Inference
DescriptionThe widespread adoption of Large Language Models (LLMs) has led to an increased demand for large-scale inference services, presenting a unique set of challenges for the HPC community. These services are characterized by moderate-scale models that require dedicating expensive GPUs to handle bursty inference requests, leading to high costs and resource underutilization. In this paper, we propose SwapServeLLM — a novel engine-agnostic hot-swapping method for cost-effective inference. This model hot-swapping approach is enabled by recent driver capabilities for transparent GPU checkpointing. SwapServeLLM optimizes resource utilization by dynamically allocating GPU resources with two key mechanisms: (1) a demand-aware preemption leveraging information about concurrent requests, and (2) efficient request routing with memory reservation minimizing inference latency. Our evaluation demonstrates that SwapServeLLM optimizes model loading for state-of-the-art inference engines by 31× compared to vLLM and up to 29% compared to Ollama, enabling cost-effective inference.
Event Type
Workshop
TimeMonday, 17 November 202511:00am - 11:30am CST
Location275


