DS7. Low Design-Risk Checkpointing Storage Solution for Exascale Supercomputers
SessionDoctoral Showcase 2
Session ChairBoyana Norris
Presenter
Event Type
Doctoral Showcase
Location155-C
DescriptionThis work presents a checkpointing solution for exascale supercomputers that employs commodity DRAM and SSD de-vices that pose a low design risk compared to solutions that use emerging non-volatile memories.
The proposed local checkpointing solution uses DRAM and SSD in tandem to provide both speed and reliability in checkpointing. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, a low latency ECC is added to correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and a second Chipkill-Correct level ECC is added to protect the checkpoints residing in DRAM.
The proposed local checkpointing solution uses DRAM and SSD in tandem to provide both speed and reliability in checkpointing. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, a low latency ECC is added to correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and a second Chipkill-Correct level ECC is added to protect the checkpoints residing in DRAM.
Archive
Presenter








