Close

Presentation

From Exploration to Explanation: ML-Driven Causal Discovery for Datacenter Reliability at Scale
DescriptionModern datacenters operate at unprecedented scale, supporting HPC and AI workloads while consuming hundreds of megawatts of power. Their reliability is challenged by complex interdependencies across cooling, power, and network subsystems, where failures can cascade into downtime and degraded performance. Existing monitoring approaches, largely threshold or only correlation-based, struggle to isolate root causes within high-dimensional, evolving telemetry. We present PACE (Pattern and Causal Exploration), an ML-based framework that combines unsupervised correlation clustering with supervised, lag-aware Granger causality to uncover subsystem structure and directed causal pathways from multivariate telemetry. PACE yields interpretable causal graphs, subsystem heatmaps, that align with physical processes and control logic, providing actionable insights for operations. Finally, we discuss how embedding PACE into digital twin architectures enables causal-informed \emph{what-if} reasoning, advancing the reliability and efficiency in datacenters.
Event Type
Workshop
TimeSunday, 16 November 202510:45am - 11:10am CST
Location242
Tags
AI, Machine Learning, & Deep Learning
Clouds & Distributed Computing
Performance Evaluation, Scalability, & Portability
Scientific & Information Visualization