Close

Presentation

Implementing Network-level QoS at HPC Datacenters to Enable Distributed Scientific Workflows
DescriptionHigh-performance computing (HPC) datacenters must simultaneously support real-time data streams with sub-millisecond latency and bulk transfers requiring sustained multi-gigabit throughput—demands that compete for the same network resources. End-to-end performance guarantees are therefore essential, typically delivered through Quality of Service (QoS) mechanisms that classify traffic, reserve bandwidth, and enforce priorities across all network hops. While backbone and wide-area network providers already implement QoS, the local Ethernet ingress “last-mile” inside HPC facilities generally remains best-effort, creating a critical blind spot where latency builds and time-sensitive workflows can suffer. We address this gap with a standards-based Differentiated Services Code Point (DSCP) QoS configuration on existing leaf–spine switches: packets are marked at the host, queued per traffic class, and shaped on every hop through to the high-speed network (HSN) gateway NIC. Experiments on both intra-domain and inter-domain traffic show up to 60 percent more stable throughput and 30 percent fewer retransmissions, without hardware upgrades.