Close

Presentation

An Efficient GEMM Acceleration Method for LLM Inference with Variable-Length Sequences
DescriptionTransformer-based large language models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) tasks. The transformer layer in LLM involves substantial general matrix multiplication (GEMM). However, the sequence length variability leads to redundant computation and hardware resource overhead in the GEMM with a uniform-size padding approach, leading to reduced inference speed.

This work proposes an efficient GEMM acceleration method for LLM inference with variable-length sequences. First, a fused parallel prefix scan design is developed to capture the matrix dimension distribution. Second, an efficient various-size tile kernel is implemented based on Matrix Core, with an analysis of the hardware resource requirements in the computation process. Third, a hardware-aware tiling algorithm is designed to select the optimal tiling scheme based on thread parallelism and hardware resources. The experimental results show that the proposed approach achieves performance improvements of 3.10x and 2.99x (up to 4.44x and 4.27x) over hipBLAS and rocBLAS.