Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
Exploring Efficient Deep Learning Training on AI Accelerators
DescriptionThe computational and memory demands of DNN training have grown with the size of AI models in recent years. To address these demands, popular accelerators (i.e., GPUs) must find novel ways to reduce memory utilization since their memory capacity is on the scale of tens of GB. Other companies have unveiled novel AI accelerators, generally with high on-chip memory capacity and varying architectures. For these accelerators, frequent on-chip/off-chip memory transactions can bottleneck performance. Lossy compression is a promising tool to reduce data footprint for efficient DNN training. Our work studies lossy compressors targeting training data and activation data, and how to efficiently run compression and GNN training on novel AI accelerators.
Our contributions are: 1) a novel, portable training data compressor, called DCT+Chop, for emerging AI accelerators; 2) an activation compression framework tailored to the Graphcore Intelligence Processing Unit (IPU); 3) a GPU-based design for a compressor/optimizer-agnostic lossy activation compression framework, called LAT-ACT; and 4) an exploration in training graph neural networks (GNNs) on the Cerebras CS-2. DCT+Chop and IPU activation compression have yielded strong results, where DCT+Chop can compress training data up to 16X with a throughput on the scale of tens of GB/s. IPU activation compression can speedup single IPU training up to 3.5X and multi-IPU training by several orders of magnitude. Preliminary results suggest LAT-ACT yields compression ratios of 4-12X with limited accuracy degradation. GNN training on the CS-2 can be implemented with PyTorch APIs, but further exploration is needed for supporting sparse operators common to GNNs.
Our contributions are: 1) a novel, portable training data compressor, called DCT+Chop, for emerging AI accelerators; 2) an activation compression framework tailored to the Graphcore Intelligence Processing Unit (IPU); 3) a GPU-based design for a compressor/optimizer-agnostic lossy activation compression framework, called LAT-ACT; and 4) an exploration in training graph neural networks (GNNs) on the Cerebras CS-2. DCT+Chop and IPU activation compression have yielded strong results, where DCT+Chop can compress training data up to 16X with a throughput on the scale of tens of GB/s. IPU activation compression can speedup single IPU training up to 3.5X and multi-IPU training by several orders of magnitude. Preliminary results suggest LAT-ACT yields compression ratios of 4-12X with limited accuracy degradation. GNN training on the CS-2 can be implemented with PyTorch APIs, but further exploration is needed for supporting sparse operators common to GNNs.

