SC16 Salt Lake City, UT

64. On the Path to the Holy Grail: Predicting Onset of System Failure with Log Files

Authors: Robert E. Settlage (Virginia Polytechnic Institute and State University)Michael B. Marshall (Virginia Polytechnic Institute and State University)Karthik R. Senthilvel (Virginia Polytechnic Institute and State University)Vijay K. Agarwala (Virginia Polytechnic Institute and State University)Joshua D. Akers (Virginia Polytechnic Institute and State University)Rajiv D. Bendale (Engility Corporation)Kimberly Robertson (Engility Corporation)

Abstract: HPC environments are challenging to maintain and optimize. Indeed, in large systems, mean time to the next failure can be on the order of milliseconds. In production environments, detecting eminent failure or sub-optimum performance is critical. Here, we report first steps in predicting failures – anomaly detection within system log events. Further, we demonstrate the utility of anomaly detection as a diagnostic tool for identifying misconfigurations or performance issues using stale file handles as our target message. We then propose a system to monitor system logs and alert on anomalies in real-time to improve HPC operational efficiency.

