Fault-Tolerance for HPC: Theory and Practice
Presenters
Event Type
Tutorial
Introductory
Performance
Resiliency
Location250-F
DescriptionResilience is a critical issue for large-scale platforms, and this tutorial provides a comprehensive survey of fault-tolerant techniques for HPC, with a fair balance between practice and theory.
This tutorial is organized along four main topics:
(i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications, user-level checkpointing in memory; and
(iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a proposed MPI standard extension).
Relevant examples based on ubiquitous computational solver routines will be protected with a mix of checkpoint-restart and advanced recovery techniques in a hands-on session.
The tutorial is open to all SC16 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
This tutorial is organized along four main topics:
(i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications, user-level checkpointing in memory; and
(iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a proposed MPI standard extension).
Relevant examples based on ubiquitous computational solver routines will be protected with a mix of checkpoint-restart and advanced recovery techniques in a hands-on session.
The tutorial is open to all SC16 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
Links












