FlipBack: Automatic Targeted Protection Against Silent Data Corruption
SessionResilience
Session ChairChristian Engelmann
Event Type
Paper
Intermediate
Resiliency
System Software
Location355-BC
DescriptionThe decreasing size of transistors has been critical to the increase in capacity of supercomputers. It is predicted that transistors will likely be one third of their current size by the time exascale computers are available. The smaller the transistors are, less energy is required to flip a bit, and thus silent data corruptions (SDCs) become more common. Traditional approaches to protect applications from SDCs come at the cost of either doubling the hardware resource or elongating application execution time by two times. In this paper, we present FlipBack, an automatic software based approach that protects applications from SDCs. FlipBack provides targeted protection for different types of data and calculations based on their characteristics. We evaluate FlipBack with various HPC mini-applications that capture the behavior of real scientific applications and show that FlipBack is able to fully protect applications from silent data corruptions with only 10% performance degradation.








