Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
The MALL is Open: Exploring Shared Caches and Latency in AMD CDNA™ 3 GPUs
DescriptionThis paper presents an analysis of memory hierarchy latency across AMD Instinct™ MI300A, MI300X, and MI250X GPUs using a fine-grained pointer-chasing microbenchmark. We characterize the scalar L1 (sL1), L2, AMD Infinity Cache™ referred to as the MALL (Memory Attached Last Level), and HBM (High Bandwidth Memory), revealing distinct latency levels and architectural trade-offs. MI300A and MI300X, based on the CDNA3 architecture, exhibit nearly identical latency profiles, while MI250X lacks a MALL, resulting in different performance characteristics. Memory latency remains consistent across compute partitioning modes, but NUMA Partitioning per Socket (NPS) significantly impacts performance. In NPS4 mode, partitioning improves locality, reducing latency by up to 1.42× in MALL and 1.31× in HBM. We further analyze MALL contention and Translation Lookaside Buffer (TLB) behavior under varying parallelism levels, identifying conditions where MALL performance degrades. These findings provide actionable insights for optimizing memory access patterns and improving performance on AMD’s latest GPU architectures.




