Close

Presentation

Towards Application Agnostic HPC Profiling
DescriptionModern HPC systems generate large amounts of GPU and network telemetry, typically used for system health monitoring. At NERSC, we are developing a Performance API/UI that generates a job report card from this telemetry, providing an overview of performance characteristics. Using DCGM counters, we report GPU memory, compute, and power usage, and present preliminary investigations of job-level network activity. Without traditional profiling tools, this application-agnostic approach helps identify resource utilization imbalances, detect anomalies such as memory leaks, and assess overall performance for the user without additional effort.