Presentation
Optimizing Task-Driven Offloading in LLVM
DescriptionWe investigate an inefficiency in the LLVM OpenMP runtime related to accelerator offloading. The current implementation manages asynchronous GPU tasks by polling async handles, which introduces CPU overhead. We propose replacing this polling model with an event-driven approach that detaches target tasks by default. In our design, each asynchronous task is associated with an event that is fulfilled once the GPU kernel completes, allowing the task to yield execution. This eliminates repeated polling and reduces scheduling overhead. We implemented this mechanism using existing features in the LLVM OpenMP runtime, relying on a host callback function provided by CUDA. Experiments on NVIDIA H100 GPUs show runtime improvements of up to 75% for independent tasks once matrix sizes exceed 128×128, with benefits appearing at even smaller sizes when task dependencies are present. For large kernels, the effect diminishes as execution time dominates.

Event Type
Research and ACM SRC Posters
TimeThursday, 20 November 20258:00am - 5:00pm CST
LocationSecond Floor Atrium
Archive
view


