Presentation
Novel Graph Alignment Algorithms for Identifying Non-Determinism in Large-Scale Simulations
DescriptionThe increasing complexity of HPC simulations poses several challenges to their reproducibility and reliability. One critical issue is the non-determinism (ND) induced by asynchronous MPI communication. Locating the sources of ND in large codes is difficult. This problem can be addressed by comparing event graphs (graphs mapping MPI communication) across multiple runs of the application by using tools like ANACIN-X[2] (to trace the event graphs) and network alignment (to locate areas of ND). We expand ANACIN-X's point-to-point tracing capabilities by adding collective communication tracing, and propose a novel network alignment algorithm to effectively compare event graphs.

Event Type
Research and ACM SRC Posters
TimeTuesday, 18 November 20258:00am - 5:00pm CST
LocationSecond Floor Atrium
Research & ACM SRC Posters
TP
Archive
view
