Presentation
Forward Error Bounds and Efficient Algorithms for Computing a Tensor Times Matrix Chain in Low Precision on GPUs
DescriptionMany tensor processing algorithms require computing a tensor times matrix chain (TTMc) operation, and this operation is frequently the bottleneck in such algorithms. This work develops strategies for accelerating a TTMc using low-precision hardware.
We present a novel scheme for scaling the TTMc operands to prevent overflow. Our scheme exploits the Kronecker Product structure of a TTMc to allow for efficient application. Additionally, we present the first forward error bound for TTMc, and we develop a heuristic for ordering the individual TTM operations within a TTMc to reduce the forward error.
Our scaling scheme allows for a TTMc on the Miranda Tensor to be computed without overflow on an NVIDIA A100 GPU using FP16 arithmetic, exhibiting a speedup of up to 2× over FP64 arithmetic, even when accounting for the overhead of applying scaling. We show that our TTM ordering heuristic is effective for some tensors in certain cases.
We present a novel scheme for scaling the TTMc operands to prevent overflow. Our scheme exploits the Kronecker Product structure of a TTMc to allow for efficient application. Additionally, we present the first forward error bound for TTMc, and we develop a heuristic for ordering the individual TTM operations within a TTMc to reduce the forward error.
Our scaling scheme allows for a TTMc on the Miranda Tensor to be computed without overflow on an NVIDIA A100 GPU using FP16 arithmetic, exhibiting a speedup of up to 2× over FP64 arithmetic, even when accounting for the overhead of applying scaling. We show that our TTM ordering heuristic is effective for some tensors in certain cases.

Event Type
Research and ACM SRC Posters
TimeThursday, 20 November 20258:00am - 5:00pm CST
LocationSecond Floor Atrium
Archive
view

