Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
Scaling LLM Training Using RDMA over Converged Ethernet
SessionThe 12th Annual International Workshop on Innovating the Network for Data-Intensive Science (INDIS)
DescriptionWe present a comprehensive benchmarking study that evaluates the scaling performance of RDMA over Converged Ethernet (RoCE) and compares it with Infiniband in the context of large-scale LLM training workloads. While Infiniband is traditionally favored for its low-latency, high-bandwidth characteristics, it imposes significant infrastructure and operational costs. RoCE, leveraging commodity Ethernet and RDMA, offers a cost-effective alternative. Through extensive experiments on production clusters, we demonstrate that RoCE can achieve near-linear scaling performance comparable to Infiniband when properly configured. Our analysis spans data sharding strategies, quantization and activation recomputation techniques, batch size tuning, and system-level optimizations, providing practical guidance for designing scalable and efficient AI infrastructure.
