Presentation
Extending THAPI with CXI Hardware Counter Sampling for High Resolution NIC Telemetry
DescriptionHigh performance computing (HPC) applications are sensitive to network variability, yet existing tracing tools lack insight into low level network behavior.
Modern network interface controllers (NICs), such as HPE's Slingshot-11 Cassini, provide detailed hardware counters that can reveal conditions like congestion and retries, but remain underused due to limited integration with tracing frameworks.
We extend the THAPI framework with a sampling plugin for Cassini's CXI interface, periodically collecting NIC counters and integrating them into HPC trace timelines via the iprof tool.
Data is visualized in Perfetto, enabling correlation between network telemetry and application events.
Our approach imposes negligible overhead at typical sampling rates and exposes previously hidden performance factors, such as congestion delays and load imbalances.
Case studies on point-to-point and collective patterns demonstrate new diagnostic capabilities.
Contributions include the plugin's design, integration into a state-of-the-art tracing toolchain, and evaluation highlighting opportunities for improved HPC communication performance.
Modern network interface controllers (NICs), such as HPE's Slingshot-11 Cassini, provide detailed hardware counters that can reveal conditions like congestion and retries, but remain underused due to limited integration with tracing frameworks.
We extend the THAPI framework with a sampling plugin for Cassini's CXI interface, periodically collecting NIC counters and integrating them into HPC trace timelines via the iprof tool.
Data is visualized in Perfetto, enabling correlation between network telemetry and application events.
Our approach imposes negligible overhead at typical sampling rates and exposes previously hidden performance factors, such as congestion delays and load imbalances.
Case studies on point-to-point and collective patterns demonstrate new diagnostic capabilities.
Contributions include the plugin's design, integration into a state-of-the-art tracing toolchain, and evaluation highlighting opportunities for improved HPC communication performance.
Event Type
Workshop
TimeMonday, 17 November 20254:00pm - 4:30pm CST
Location241
