Close

Presentation

The Cost of Teaching Operational ML
DescriptionOperational machine learning (ML) requires skills beyond model development, including infrastructure provisioning, large-scale training across clusters, model deployment with consideration of operational performance, monitoring, and automation - capabilities grounded in high-performance computing and distributed systems. This paper presents the design and infrastructure requirements of a graduate-level course on ML Systems Engineering and Operations, aimed at equipping students with these skills. Using 186,692 total compute instance hours on the Chameleon Cloud testbed, students built end-to-end ML pipelines incorporating distributed training, reproducible experiment tracking, automated re-training and re-deployment, and continuous monitoring. We analyze compute usage across assignments, compare expected versus actual resource consumption, and estimate that replicating the course on commercial cloud platforms would cost approximately $250 per student (almost $50,000 for our course with enrollment of 191 students).
All course materials are publicly available for reuse.