Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
LLM training in practice: insights from 85,000 checkpoints
DescriptionTraining large language models (LLMs) at scale generates significant I/O, and vendor guidance typically recommends provisioning performance based on the supply side: peak bandwidth required to keep GPUs busy. These recommendations often overstate requirements though, since they assume ideal GPU utilization. The demand side, or the I/O performance that training jobs actually drive, is not as well characterized. Drawing on telemetry from production VAST systems underpinning some of the world’s largest AI training supercomputers, we analyzed over 85,000 checkpoints from 40 production LLM training jobs and found that even trillion-parameter models require only a few hundred GB/s for efficient checkpointing. From these observations, we derive a simple, demand-side model that relates LLM size and checkpoint interval to the global bandwidth needed. This model offers a way to avoid overprovisioning I/O and to maximize the resources (power, cooling) that can go towards compute.
Event Type
Workshop
TimeMonday, 17 November 20254:05pm - 4:10pm CST
Location230
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Livestreamed
Recorded
TP
W
Similar Presentations

