Close

Presentation

Detecting Silent Data Corruption in Sparse Matrices Using Hardware Performance Counters
DescriptionHigh performance computing (HPC) systems frequently execute large-scale sparse matrix computations in scientific and engineering domains. These workloads are susceptible to silent data corruptions (SDCs)—undetected faults that can alter results without triggering errors—posing a significant risk to computational integrity. In this work, we show how injected errors in sparse matrices propagate during repeated sparse matrix-vector multiplication (SpMV) executions and evaluate whether hardware performance counter (PMC) patterns can be used to detect such corruptions. We conduct controlled experiments with Gaussian noise injection at varying magnitudes and injection rates, record hardware counter values using the Linux perf tool, and train a decision tree classifier to distinguish corrupted runs from clean runs. Experiments on four real-world matrices from the SuiteSparse Matrix Collection yield detection accuracies around 90%–99% with under 2% runtime overhead. The results confirm that PMC-based classification is a viable approach for lightweight SDC detection.