SC16 Salt Lake City, UT

DS7. Low Design-Risk Checkpointing Storage Solution for Exascale Supercomputers

Student: Nilmini Abeyratne (University of Michigan)
Advisor: Trevor Mudge (University of Michigan)
Abstract: This work presents a checkpointing solution for exascale supercomputers that employs commodity DRAM and SSD de-vices that pose a low design risk compared to solutions that use emerging non-volatile memories. The proposed local checkpointing solution uses DRAM and SSD in tandem to provide both speed and reliability in checkpointing. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, a low latency ECC is added to correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and a second Chipkill-Correct level ECC is added to protect the checkpoints residing in DRAM.

Summary: pdf
Presentation: pdf
Poster: pdf

Doctoral Showcase Index