Close

Presentation

Towards Efficient LLM Inference via Collective and Adaptive Speculative Decoding
DescriptionEfficient LLM inference remains challenging due to the autoregressive decoding process, which generates only one token at a time. Speculative decoding has been introduced to address the limitation by using small speculative models (SSMs) to speed up LLM inference. However, the low acceptance rate of SSMs and the high verification cost of LLMs prohibit further performance improvement. In this paper, we present Smurfs, an LLM inference system designed to accelerate LLM inference through collective and adaptive speculative decoding. Smurfs adopts a majority-voted mechanism that harnesses multiple SSMs to collaboratively predict LLM outputs in multi-task scenarios. It also decouples SSM speculation from LLM verification and uses a pipelined execution to reduce the latency of SSM speculation. Additionally, Smurfs proposes a mechanism to dynamically determine the optimal speculation length of SSM at runtime. The experimental results demonstrate the superiority of Smurfs in terms of inference throughput and latency compared to state-of-the-art systems.