Presentation
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training
DescriptionPipeline parallelism serves as a crucial technique for training large language models, owing to its capability to alleviate memory pressure from model states with low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency.
To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward scheduling. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal self-attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves near-zero memory overhead and minimal pipeline bubbles simultaneously.
To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward scheduling. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal self-attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves near-zero memory overhead and minimal pipeline bubbles simultaneously.
Event Type
Paper
TimeWednesday, 19 November 20254:15pm - 4:37pm CST
Location261-262-265-266
HPC for Machine Learning
