Presentation
HPC-R1: Characterizing R1-Like Large Reasoning Models on HPC
DescriptionLarge reasoning models (LRMs) are becoming increasingly popular as they offer advanced capabilities in logical inference, mathematical reasoning, and knowledge synthesis, even beyond those of standard language models. However, their complex training workflows present significant challenges in reproducibility, efficiency, and system-level optimization. This paper introduces HPC-R1, a comprehensive characterization of LRM training on a modern HPC cluster. We analyze all major stages, including supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO)-based reinforcement learning (RL), autoregressive generation, and distillation using customized state-of-the-art frameworks. Our detailed performance analysis reveals key system scaling behaviors. We find that GRPO-based reinforcement learning training is heavily communication-bound, with over 90% of GPU time spent in non-compute operations, and that SFT achieves stable GPU throughput near 9.8 TFLOPs. We also observe inference pipeline imbalance, where the performance gap between ranks can reach 64%. Based on these findings, we present recommendations to guide future AI-HPC system design.
Event Type
Paper
TimeWednesday, 19 November 20252:37pm - 3:00pm CST
Location261-262-265-266
Performance Measurement, Modeling, & Tools
Similar Presentations

