Presentation
X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms
DescriptionEmerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems—primarily optimized for NVIDIA GPUs—perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped.
In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1,024 GPUs—10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput.
In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered by AMD MI250X GPUs, shows that X-MoE scales DeepSeek-style MoEs up to 545 billion parameters across 1,024 GPUs—10x larger than the largest trainable model with existing methods under the same hardware budget, while maintaining high training throughput.
Event Type
Paper
TimeWednesday, 19 November 20251:30pm - 1:52pm CST
Location261-262-265-266
BSP
HPC for Machine Learning

