Presentation
The Cost of Teaching Operational ML
DescriptionOperational machine learning (ML) requires skills beyond model development, including infrastructure provisioning, large-scale training across clusters, model deployment with consideration of operational performance, monitoring, and automation - capabilities grounded in high-performance computing and distributed systems. This paper presents the design and infrastructure requirements of a graduate-level course on ML Systems Engineering and Operations, aimed at equipping students with these skills. Using 186,692 total compute instance hours on the Chameleon Cloud testbed, students built end-to-end ML pipelines incorporating distributed training, reproducible experiment tracking, automated re-training and re-deployment, and continuous monitoring. We analyze compute usage across assignments, compare expected versus actual resource consumption, and estimate that replicating the course on commercial cloud platforms would cost approximately $250 per student (almost $50,000 for our course with enrollment of 191 students).
All course materials are publicly available for reuse.
All course materials are publicly available for reuse.
Event Type
Workshop
TimeSunday, 16 November 202510:45am - 11:00am CST
Location261
Similar Presentations

