Close

Presentation

Tools To Detect and Diagnose Floating-Point Errors in Heterogeneous Computing Hardware and Software
DescriptionHigh performance computing and machine learning applications increasingly rely on mixed-precision arithmetic on CPUs and GPUs for superior performance. However, this shift introduces several challenging numerical issues such as increased round-off errors, and INF and NaN exceptions that can render the computed solutions useless. At present, this places a heavy burden on developers, interrupting their work while they diagnose these problems manually. This tutorial presents three tools that target specific issues leading to floating-point bugs. First, we present FPChecker, which not only detects and reports INF/NaN exceptions in parallel and distributed CPU codes, but also tells programmers about the exponent value ranges for avoiding exceptions while also minimizing rounding errors. Second, we present GPU-FPX, which detects floating-point exceptions generated by NVIDIA GPUs, including their Tensor Cores via a "nixnan" extension to GPU-FPX. Third, we present FloatGuard, a unique tool that detects exceptions in AMD GPUs. The tutorial is aimed at helping programmers avoid exception bugs; for this, we will demonstrate our tools on simple examples with seeded bugs. Attendees may optionally install and run our tools. The tutorial also allocates question/answer time to address real situations faced by the attendees.
Note for Attendees As advertised earlier, our half-day tutorial "Tools to Detect and Diagnose Floating-Point Errors in Heterogeneous Computing Hardware and Software (tut122)" covers three tools: FPChecker, GPU-FPX and FloatGuard (video overview at https://lnkd.in/g6tU8FEV).

The first part of the tutorial, FPChecker, is about LLVM-based Floating-Point Exception Tracing. Those who want to follow along the FPChecker exercises on Mac OS or Linux may kindly install the Conda environment (instructions for Mac are provided at https://youtu.be/DNu8pQOYRGg).

The second part of the tutorial, GPU-FPX (and its extension NixNan), will be a demo-only presentation of NVIDIA SIMT-Core and Tensor-Core Binary Instrumentation-based Floating-Point Exception Tracing.

The third and final part of the tutorial will be a demo-only presentation of AMD GPU Binary Instrumentation-based Floating-Point Exception Tracing (FloatGuard).