SC16 Salt Lake City, UT

56. Software-Level Fault Tolerant Framework for Task-Based Applications

Authors: Joy Yeh (University of Bristol)Grzegorz Pawelczak (University of Bristol)James Sewart (University of Bristol)James Price (University of Bristol)Ferad Zyulkyarov (Barcelona Supercomputing Center)Leonardo Bautista-Gomez (Barcelona Supercomputing Center)Osman Unsal (Barcelona Supercomputing Center)Simon McIntosh-Smith (University of Bristol)Amaurys Avila Ibarra (University of Bristol)

Abstract: Fault tolerance has been identified as one of the major challenges for exascale computing. In addition to fail-stop errors, silent data corruptions (SDCs) can perturb applications and produce incorrect results. Software-based fault tolerance mechanisms have the advantage of being capable of leveraging some of the properties of the applications to improve their reliability. In this poster, we present a fault tolerance framework that implements multiple resiliency schemes to cope with both fail-stop errors and data corruption. Our techniques are tested with two real scientific applications: BUDE, a molecular docking engine, and TeaLeaf, a heat conduction code. Using this framework we have successfully detected and recovered from real data corruptions. We have also performed error injection experiments, which clearly demonstrated the efficacy of our framework.

Poster: pdf
Two-page extended abstract: pdf

