Close

Presentation

Understanding Communication Bottlenecks in Multi-Node LLM Inference
DescriptionAs large language models (LLMs) grow in parameter count, efficient generation requires inference to scale beyond a single node. Current approaches use tensor parallelism (TP) or pipeline parallelism (PP), but TP incurs high communication volume, while PP suffers from pipeline bubbles and is unsuitable for latency-critical scenarios. We present Yalis (Yet Another LLM Inference System), a lightweight and modular distributed inference framework that performs comparably to existing state-of-the-art systems for offline inference, while enabling rapid prototyping. Using Yalis, we study strong scaling of LLM inference on the Alps and Perlmutter supercomputers, revealing the poor scaling performance of existing parallelism strategies due to high communication overheads. We further compare the all-reduce performance of NCCL and MPI in the small-message regime, finding that while NCCL is efficient intra-node, MPI can outperform it cross-node for messages between 256-1024 KB. These results motivate the need for communication-efficient parallelism strategies for multi-node LLM inference.