BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260202T201804Z
LOCATION:230
DTSTART;TZID=America/Chicago:20251120T103000
DTEND;TZID=America/Chicago:20251120T104500
UID:submissions.supercomputing.org_SC25_sess534_drs111@linklings.com
SUMMARY:Designing GPU-Aware Collective Communication for Heterogeneous Clu
 sters with Diverse GPUs and Interconnects
DESCRIPTION:Chen-Chun Chen (The Ohio State University)\n\nGPU-accelerated 
 HPC and deep learning workloads now operate at scales of tens to thousands
  of GPUs, making collective communication a dominant cost. Applications su
 ch as Amber, heFFTe, and distributed LLM training require frequent synchro
 nization and exchange of large data partitions. At the same time, systems 
 are increasingly heterogeneous: clusters combine NVIDIA, AMD, and Intel GP
 Us with interconnects such as NVLink, Infinity Fabric, InfiniBand, and Sli
 ngshot. Many MPI runtimes remain tuned for CPU-centric designs, performing
  unnecessary host staging, adding extra copies, and underutilizing high-ba
 ndwidth device paths or multi-rail topology. Support for newer stacks, par
 ticularly SYCL and Level Zero on Intel GPUs, is also uneven, hindering per
 formance portability. \n\nWe present a unified, GPU-aware collective frame
 work that targets portability and efficiency across vendors and networks. 
 For Alltoall, we design IPC-based intra-node paths that avoid host staging
  and introduce push and pull variants that overlap intra- and inter-node t
 ransfers. For Allreduce, we implement on-device reduction kernels with nat
 ive inter-node GPU support and computation-communication overlap; for medi
 um messages at large scale, we add a direct sendrecv algorithm with thrott
 ling to balance bandwidth and latency. The framework extends to Intel GPUs
  via SYCL and Level Zero, alongside CUDA and ROCm back ends. To mitigate i
 nter-node bandwidth limits for very large messages, we integrate a lightwe
 ight casting-based compression that downcasts in flight with negligible ac
 curacy loss. Together, these designs provide efficient Alltoall and Allred
 uce across NVIDIA, AMD, and Intel platforms, improving end-to-end performa
 nce while reducing CPU involvement and data movement overhead.\n\nTag: Res
 earch & ACM SRC Posters\n\nRecording: Livestreamed, Recorded\n\nRegistrati
 on Category: Technical Program Reg Pass\n\nSession Chairs: Yanfei Guo (Arg
 onne National Laboratory (ANL)); Shirley Moore (University of Texas at El 
 Paso); Kento Sato (RIKEN Center for Computational Science (R-CCS)); Chris 
 Schlipalius (Pawsey Supercomputing Research Centre; Commonwealth Scientifi
 c and Industrial Research Organisation (CSIRO), Australia); and Anja Gerbe
 s (Georg-August-Universität Göttingen)\n\n
END:VEVENT
END:VCALENDAR
