SC16 Salt Lake City, UT

DS10. Improving Fault Tolerance for Extreme Scale Systems

Student: Eduardo Berrocal (Illinois Institute of Technology)
Advisor: Zhiling Lan (Illinois Institute of Technology)
Abstract: Mean Time Between Failures is expected to drop on exascale. It has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer checkpoints. We present a new density-based approach for failure prediction based on the Void Search (VS) algorithm, and evaluate the algorithm using environmental logs from the Mira Blue Gene/Q supercomputer at Argonne National Laboratory. While moving to exascale, other problems will also arise as transistor size and energy consumption of future systems must be significantly reduced, steps that might dramatically impact the soft error rate (SER). When soft errors are not detected and corrected properly, either by hardware or software mechanisms, they have the potential to corrupt applications’ memory state. In our previous work we leveraged the fact that datasets produced by HPC applications can be used effectively to design a general corruption detection scheme with relatively low overhead.

Summary: pdf
Presentation: pdf
Poster: pdf

Doctoral Showcase Index