Close

Presentation

An Elastic Job Scheduler for HPC Applications on the Cloud
DescriptionThe pay-as-you-go cost model of cloud resources has necessitated the development of specialized programming models and schedulers for HPC jobs for efficient utilization of cloud resources. A key aspect of efficient utilization is the ability to rescale applications on the fly to maximize the utilization of cloud resources. Most commonly used parallel programming models, like MPI, have traditionally not supported autoscaling either in a cloud environment or on supercomputers. Charm++ is a parallel programming model that natively supports dynamic rescaling through its migratable objects paradigm. We present a Kubernetes operator to run Charm++ applications on a Kubernetes cluster. We also present a priority-based elastic job scheduler that can dynamically rescale jobs based on the state of the cluster to maximize cluster utilization while minimizing response time for high-priority jobs. We show that our elastic scheduler demonstrates significant performance improvements over traditional static schedulers.