Presentation
Distributed Deep Learning on GPU-Based Clusters
DescriptionDeep learning (DL) is rapidly becoming pervasive in almost all areas of computer science, and is even being used to assist computational science simulations and data analysis. A key behavior of these deep neural networks (DNNs) is that they reliably scale, i.e., they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate DL models increases, the need for large-scale parallel model training, fine-tuning, and inference has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training and inference on GPU-based platforms. This tutorial will introduce and provide basics of the state of the art in distributed deep learning. We will use large language models (LLMs) as a running example, and teach the audience the fundamentals involved in performing the three essential steps of working with LLMs: (1) training an LLM from scratch, (2) continued training/fine-tuning of an LLM from a checkpoint, and (3) inference on a trained LLM. We will cover algorithms and frameworks falling under the purview of data parallelism (PyTorch DDP and DeepSpeed), and tensor parallelism (AxoNN).
Event Type
Tutorial
TimeMonday, 17 November 20251:30pm - 5:00pm CST
Location123
Livestreamed
Recorded
