Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
An Integrated Deep Reinforcement Learning Agent for Sunfish and HPC Workload Manager Composable Disaggregated Resource Scheduling
DescriptionThe Sunfish Composable Disaggregated Infrastructure framework, combined with a deep reinforcement learning agent for scheduling, integrates with both HPC workload managers and container orchestrators to reduce application run-time latency, increase data center batch run efficiency, dynamically create ephemeral IO burst buffers, and mitigate problems from degraded hardware. Managing disaggregated resource pools with Sunfish minimizes idle resources and allows burst buffer allocations that create optimized execution environments for modern workloads, such as MOD/SIM and AI/ML. We will disclose our work integrating Sunfish with the Flux workload manager on a national lab testbed and discuss additional use cases within the industry.





