Presentation
Taming the Beast of Dynamic Resource Management in HPC
DescriptionDynamic resource management (DRM) enables the resources assigned to a job to be adjusted during execution. From a system perspective, DRM adds flexibility to resource allocation and job scheduling, with the potential to improve utilization, throughput, energy efficiency, and responsiveness. From an application perspective, it allows users to match resource requests to evolving needs, potentially reducing queue times and costs.
Despite these benefits and a decade of research, DRM remains largely an academic concept in HPC rather than a production feature. This is due to the need for coordinated changes across the entire software stack—applications, programming models, process managers, and resource managers—along with a holistic co-design effort to develop new scheduling and optimization policies.
We present a novel, end-to-end approach to DRM in HPC, introducing generic design principles for parallel programming models that integrate applications’ dynamic process management with the resource managers’ optimization capabilities. We apply these principles across the HPC stack, incorporating standards such as MPI and PMIx, to create a fully dynamic environment supporting diverse applications. This is paired with a performance-aware scheduling strategy based on steepest-ascent optimization.
Experiments on up to 100 nodes show moderate overheads for application process reconfiguration while delivering substantial gains in system throughput and average job turnaround time compared to static scheduling under high-load conditions.
Despite these benefits and a decade of research, DRM remains largely an academic concept in HPC rather than a production feature. This is due to the need for coordinated changes across the entire software stack—applications, programming models, process managers, and resource managers—along with a holistic co-design effort to develop new scheduling and optimization policies.
We present a novel, end-to-end approach to DRM in HPC, introducing generic design principles for parallel programming models that integrate applications’ dynamic process management with the resource managers’ optimization capabilities. We apply these principles across the HPC stack, incorporating standards such as MPI and PMIx, to create a fully dynamic environment supporting diverse applications. This is paired with a performance-aware scheduling strategy based on steepest-ascent optimization.
Experiments on up to 100 nodes show moderate overheads for application process reconfiguration while delivering substantial gains in system throughput and average job turnaround time compared to static scheduling under high-load conditions.

Event Type
Doctoral Showcase
TimeThursday, 20 November 202511:15am - 11:30am CST
Location230
Archive
view

