Close

Presentation

HydraCache: LLM Inference Prefill Parallelization Through Distributed Cache Blending
DescriptionThe prefill phase of large language model (LLM) inference, where the input prompt is processed to generate a key-value (KV) cache, is a critical latency bottleneck for input sequences. Existing serving architectures face a trade-off: data parallelism (DP) offers flexibility but cannot accelerate a single long prompt, while tensor parallelism (TP) parallelizes prefill but at the cost of rigid resource allocation and constant communication overhead at each layer. We introduce HydraCache, a system that resolves this problem by enabling a cluster of independent, data-parallel model replicas to collaborate on-demand to parallelize the prefill of a single long prompt. Our core contribution is DistBlendAttention, a lightweight mechanism that fuses distributed KV caches with minimal communication, avoiding the prohibitive overheads of both TP and traditional sequence parallelism. Our evaluation shows that HydraCache significantly reduces Time-to-First-Token (TTFT) up to 7x for requests and enables flexible, SLO-aware serving.