56. Software-Level Fault Tolerant Framework for Task-Based Applications
Event Type
Poster
LocationLower Lobby Concourse
DescriptionFault tolerance has been identified as one of the major challenges for exascale computing. In addition to fail-stop errors, silent data corruptions (SDCs) can perturb applications and produce incorrect results. Software-based fault tolerance mechanisms have the advantage of being capable of leveraging some of the properties of the applications to improve their reliability. In this poster, we present a fault tolerance framework that implements multiple resiliency schemes to cope with both fail-stop errors and data corruption. Our techniques are tested with two real scientific applications: BUDE, a molecular docking engine, and TeaLeaf, a heat conduction code. Using this framework we have successfully detected and recovered from real data corruptions. We have also performed error injection experiments, which clearly demonstrated the efficacy of our framework.
Archive
Authors












