Close

Presentation

ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage
DescriptionNetwork simulators play a crucial role in evaluating the performance of large-scale systems. However, most existing simulators rely heavily on synthetic microbenchmarks or narrowly focus on a specific domain. In this paper, we introduce ATLAHS, a flexible, extensible, and open-source toolchain designed to trace real-world applications and accurately simulate their network behavior. ATLAHS leverages the GOAL format to efficiently model communication and computation patterns in AI, HPC, and distributed storage applications. It supports multiple network simulation backends and natively handles multi-job and multi-tenant scenarios. Through extensive validation, we demonstrate that ATLAHS achieves high accuracy in simulating real application workloads (consistently less than 5% error), while significantly outperforming AstraSim, the current state-of-the-art AI systems simulator. We further illustrate ATLAHS's utility via case studies, highlighting the impact of congestion control algorithms on distributed storage performance, as well as the influence of job-placement strategies on application performance within computing clusters.