34. Sparse Grid Algorithms to Recover from Hard and Soft Faults
Authors: Alfredo Parra Hinojosa (Technical University Munich)Hans-Joachim Bungartz (Technical University Munich)Mario Heene (University of Stuttgart)Dirk Pflüger (University of Stuttgart)
Abstract: High dimensional PDEs present a challenge in computing due to the exponential growth of discretization points with increasing dimension. An algorithm to tackle such problems is the Sparse Grid Combination Technique (SGCT), which is an extrapolation scheme. The SGCT has some inherent data redundancy that can be exploited to make it tolerant to both hard and soft faults: it can recover from process failures as well as from data corruption without the need for checkpointing, process replication or any of the typical system-level approaches. We describe two main results: first, that our parallel implementation of the SGCT scales well with simulated hard faults on a large parallel system (Hazel Hen). And second, that the SGCT can be extended to deal with Silent Data Corruption, a type of soft fault that is becoming more common as supercomputers grow in size. This makes the SGCT a promising algorithm for future exascale systems.
Two-page extended abstract: pdf