Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
SessionFluid Dynamics
Session ChairCarol Woodward
Event Type
Paper
Applications
Effective Application of HPC
Intermediate
Resiliency
Scientific Computing
Location255-EF
DescriptionSupercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.











