Close

Presentation

DiffMoE: Efficient Batched MoE Inference with Priority-Driven Differential Expert Caching
DescriptionThe Mixture-of-Experts (MoE) model reduces the computation of large LLMs by sparsely activating experts, but its massive parameter storage creates severe GPU memory bottlenecks. Existing solutions offload experts to host memory and prefetch them with sophisticated policies, yet they target single-batch inference and suffer from communication bottlenecks at larger batch sizes. We identify two forms of locality in expert activation: a small set of experts are frequently invoked across inference (global locality), while others recur within short decoding bursts (temporal locality). To exploit this, we propose DiffMoE, which introduces a differential cache hierarchy in GPU memory. Globally hot experts reside in per-layer high-priority caches, locally hot ones are dynamically managed in per-layer medium-priority caches under a priority-driven replacement policy, and cold experts are cached temporarily and evicted on demand. Moreover, a lightweight predictor overlaps expert migration with computation to reduce latency. Evaluation shows DiffMoE outperforms the state-of-the-art systems significantly.