Data Analytics Support for HPC System Management
Moderator
Event Type
Panel
Data Analytics
Performance
State of the Practice
Location255-BC
DescriptionHPC systems are a complex combination of hardware and software. The personnel running them need to have the tools both to continuously ensure that the infrastructure is running with optimal efficiency and to proactively identify underperforming hardware and software. Furthermore, given that most HPC centers are oversubscribed, it is important that center personnel have access to appropriate technologies for monitoring all jobs that run on the cluster to determine their efficiency and resource consumption as well as to plan appropriately for future upgrades and acquisitions. In this panel, the current state-of-the-art in HPC system monitoring and management at various HPC centers, including ANL, NCSA, NNSA, ORNL, and TACC, are discussed. The discussion will focus on two use cases common to all HPC centers; (1) diagnosis of system errors (i.e., parallel file system crashes, network errors), and (2) diagnosis of poor performance or unexpected problems with user jobs.















