61. A Tool for Semi-Automatic Application-Level Checkpointing
Authors: Trung Nguyen Ba (University of Texas at Austin)Ritu Arora (University of Texas at Austin)
Abstract: Computational jobs running on supercomputing resources at open-science data centers are often limited to a maximum number of compute-nodes and wall-clock time. However, many jobs need longer than the maximum allowed wall-clock time to complete. To overcome this limitation, applications can be checkpointed such that their execution state is saved before they time-out from the job-queue. Using their saved state, the applications can resume their computation from the point where they stopped in the previous run. When the checkpointing-and-restart mechanism is built within the application, it is called Application-Level Checkpointing (ALC). We are developing a tool for semi-automatic ALC of existing applications without requiring any manual reengineering of the applications. The memory footprint of the checkpoints written using our tool is small. Applications written in C/C++/MPI/OpenMP will be supported in the upcoming release of our tool, and in future, the tool will support Fortran and Python applications too.
Two-page extended abstract: pdf