Close

Presentation

This content is available for: Technical Program Reg Pass, Workshop Reg Pass. Upgrade Registration
LLM training in practice: insights from 85,000 checkpoints
DescriptionTraining large language models (LLMs) at scale generates significant I/O, and vendor guidance typically recommends provisioning performance based on the supply side: peak bandwidth required to keep GPUs busy. These recommendations often overstate requirements though, since they assume ideal GPU utilization. The demand side, or the I/O performance that training jobs actually drive, is not as well characterized. Drawing on telemetry from production VAST systems underpinning some of the world’s largest AI training supercomputers, we analyzed over 85,000 checkpoints from 40 production LLM training jobs and found that even trillion-parameter models require only a few hundred GB/s for efficient checkpointing. From these observations, we derive a simple, demand-side model that relates LLM size and checkpoint interval to the global bandwidth needed. This model offers a way to avoid overprovisioning I/O and to maximize the resources (power, cooling) that can go towards compute.
Event Type
Workshop
TimeMonday, 17 November 20254:05pm - 4:10pm CST
Location230
Tags
Data Analytics
High Performance I/O, Storage, Archive, & File Systems
Storage
Recordings
Livestreamed
Recorded
Registration Categories
TP
W