SC16 Salt Lake City, UT

DS4. From Detection to Optimization: Understanding Silent Error's Impact on Scientific Applications


Student: Jon Calhoun (University of Illinois)
Advisor: Luke Olson (University of Illinois)
Abstract: As HPC systems are becoming more complex, they are becoming more vulnerable to failures. In particular, silent errors that lead to silent data corruption (SDC) are of particular concern. SDC refers to change in application state without any indication that a failure occurred, and can lead to longer simulation times or perturbations in results. Understanding how SDC impacts applications and how to create low cost SDC detectors is critical for large-scale applications. Full detection is cost prohibitive; therefore, we seek detection of SDC that impacts results. Acceptance of small perturbations in state allows for new optimizations such a lossy compression to mitigate memory bottlenecks. In this talk, I discuss SDC impacts on HPC applications and detail a customized SDC detection and recovery scheme for algebraic multigrid linear solvers. I conclude showing how lossy compression can improve checkpoint-restart performance by adding small errors guided by a compression error tolerance selection methodology.

Summary: pdf
Presentation: pdf
Poster: pdf


Doctoral Showcase Index