Close

Presentation

KAMI: Communication-Avoiding General Matrix Multiplication Within a Single GPU
DescriptionEfficient general matrix multiplication (GEMM) has attracted significant research attention in HPC and AI workloads. While large-scale GEMM has nearly achieved the peak floating-point performance of GPUs, substantial opportunities for optimization remain in small and batched GEMM operations.

In this paper we propose KAMI, a set of 1D, 2D, and 3D GEMM algorithms that extend the theory of communication-avoiding (CA) techniques within a single GPU. KAMI optimizes thread block-level GEMM by utilizing tensor cores as computational units, low-latency thread registers as local memory, and high-latency on-chip shared memory as a communication medium. We provide a theoretical analysis of CA performance from the perspective of GPU clock cycles, rather than the traditional execution time. Also, we implement SpMM and SpGEMM with this compute-communication pattern. Experimental results for general, low-rank, batched and sparse multiplication operations on the latest NVIDIA, AMD, and Intel GPUs show significant performance improvements over existing libraries.