Close

Presentation

Closing Invited Talk: TBA
DescriptionAs large language models move into production at unprecedented scale, the requirements for efficient, reliable, and cost-effective inference have diverged from those of training. Modern deployments must meet diverse SLAs, support rapidly growing GPU fleets, and include workloads with different performance characteristics. NVIDIA Dynamo is a production-grade framework for distributed inference at scale that addresses these challenges through modular disaggregation, topology-aware scheduling, and intelligent memory and KV-cache management. This presentation covers Dynamo’s design for high-performance inference at scale, detailing how disaggregating inference across prefill and decode phases increases utilization. We highlight advancements such as KV-cache-aware routing and offloading strategies that leverage the full memory hierarchy, from HBM to networked storage. Together, these strategies enable a cohesive platform that enables efficient and scalable LLM inference in real-world production environments.