SRC04. Leveraging Neural Networks to Predict Job I/O in HPC Systems
Student: Michael R. Wyatt (University of Delaware)
Supervisor: Michela Taufer (University of Delaware)
Abstract: The Lustre File System is a shared resource for many HPC systems. The metadata server associated with the Lustre File System is liable to become unresponsive and crash if there are too many metadata operations from the connected cluster. In order to avoid this and maintain high throughput on a cluster, jobs must be scheduled with consideration of metadata operations associated with job I/O. Predicting runtime and metadata operations is an essential step towards an I/O aware scheduler. We use a neural network to accurately predict runtime and I/O for clusters located at Lawrence Livermore National Laboratory. Our method is novel in that entire user-submitted job scripts are analyzed by our neural network. In our poster, we present our methods to achieve high predictive accuracy for runtime and I/O.
Two-page extended abstract: pdf