64. On the Path to the Holy Grail: Predicting Onset of System Failure with Log Files
Event Type
Poster
LocationLower Lobby Concourse
DescriptionHPC environments are challenging to maintain and optimize. Indeed, in large systems, mean time to the next failure can be on the order of milliseconds. In production environments, detecting eminent failure or sub-optimum performance is critical. Here, we report first steps in predicting failures – anomaly detection within system log events. Further, we demonstrate the utility of anomaly detection as a diagnostic tool for identifying misconfigurations or performance issues using stale file handles as our target message. We then propose a system to monitor system logs and alert on anomalies in real-time to improve HPC operational efficiency.
Archive
Authors








