Close

Presentation

Forward Error Bounds and Efficient Algorithms for Computing a Tensor Times Matrix Chain in Low Precision on GPUs
DescriptionMany tensor processing algorithms require computing a tensor times matrix chain (TTMc) operation, and this operation is frequently the bottleneck in such algorithms. This work develops strategies for accelerating a TTMc using low-precision hardware.

We present a novel scheme for scaling the TTMc operands to prevent overflow. Our scheme exploits the Kronecker Product structure of a TTMc to allow for efficient application. Additionally, we present the first forward error bound for TTMc, and we develop a heuristic for ordering the individual TTM operations within a TTMc to reduce the forward error.

Our scaling scheme allows for a TTMc on the Miranda Tensor to be computed without overflow on an NVIDIA A100 GPU using FP16 arithmetic, exhibiting a speedup of up to 2× over FP64 arithmetic, even when accounting for the overhead of applying scaling. We show that our TTM ordering heuristic is effective for some tensors in certain cases.