Presentation
CATIOS: Time-Resolved I/O-Aware Job Scheduling for HPC Systems
DescriptionHPC workloads are increasingly data-intensive, with contention on shared storage emerging as a primary bottleneck. Existing I/O-aware job schedulers rely on static bandwidth assumptions that overlook time-varying I/O behavior, leading to inefficient utilization and unpredictability.
This work introduces the Contention-Avoiding Temporal I/O-aware job Scheduling (CATIOS) framework, which considers temporal I/O behavior in scheduling decisions to address these issues. CATIOS first matches incoming jobs with similar historical profiles semantically and resource-wise, who serve as proxies and are evaluated sequentially in time-resolved contention-aware simulation after being reprioritized by configurable scheduling policies. These enable CATIOS to avoid overlapping bursts according to objectives.
Evaluations with Blue Waters workloads on a SimGrid-based platform show that CATIOS reduces makespan while maintaining controlled average wait times, achieving a balanced trade-off of approximately 1:1.3 between makespan reduction and average wait time increase. These results demonstrate CATIOS’s capability to improve data-intensive HPC systems with mixed workloads.
This work introduces the Contention-Avoiding Temporal I/O-aware job Scheduling (CATIOS) framework, which considers temporal I/O behavior in scheduling decisions to address these issues. CATIOS first matches incoming jobs with similar historical profiles semantically and resource-wise, who serve as proxies and are evaluated sequentially in time-resolved contention-aware simulation after being reprioritized by configurable scheduling policies. These enable CATIOS to avoid overlapping bursts according to objectives.
Evaluations with Blue Waters workloads on a SimGrid-based platform show that CATIOS reduces makespan while maintaining controlled average wait times, achieving a balanced trade-off of approximately 1:1.3 between makespan reduction and average wait time increase. These results demonstrate CATIOS’s capability to improve data-intensive HPC systems with mixed workloads.

Event Type
Research and ACM SRC Posters
TimeTuesday, 18 November 20258:00am - 5:00pm CST
LocationSecond Floor Atrium
Archive
view



