58. Pin-Pointing Node Failures in HPC Systems
Authors: Anwesha Das (North Carolina State University)Frank Mueller (North Carolina State University)Paul Hargrove (Lawrence Berkeley National Laboratory)Eric Roman (Lawrence Berkeley National Laboratory)
Abstract: Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resilience. With increasing scalability required for exascale, accurate fault prediction aiding in quick remedy is hard. With changing supercomputer architectures, distilling fault data from the noisy raw logs requires substantial efforts. Predicting node failures in such voluminous system logs is challenging. To this end, we investigate an interesting way to pin-point node failures in such supercomputing systems. Our study on Cray system data with automated machine learning tools suggests that specific patterns of event messages on node unavailability can be indicator to node failures. This data extraction coupled with system and job data correlation helps in devising a methodology to predict node failures and their location over a specific time frame. This work aims to enable broader applicability for a generic fault prediction framework.
Two-page extended abstract: pdf