Presentation

· Presenter IndexPresenters · Organization IndexOrganizations · Search Program · Flagged · Happening Now · QRCode Reader

Panel

: Data Analytics Support for HPC System Management

ask a question

Moderator

Panelists

Event Type

Panel

Event Tags

Data Analytics

Performance

State of the Practice

TimeFriday, November 18th10:30am - 12pm

Location255-BC

DescriptionHPC systems are a complex combination of hardware and software. The personnel running them need to have the tools both to continuously ensure that the infrastructure is running with optimal efficiency and to proactively identify underperforming hardware and software. Furthermore, given that most HPC centers are oversubscribed, it is important that center personnel have access to appropriate technologies for monitoring all jobs that run on the cluster to determine their efficiency and resource consumption as well as to plan appropriately for future upgrades and acquisitions. In this panel, the current state-of-the-art in HPC system monitoring and management at various HPC centers, including ANL, NCSA, NNSA, ORNL, and TACC, are discussed. The discussion will focus on two use cases common to all HPC centers; (1) diagnosis of system errors (i.e., parallel file system crashes, network errors), and (2) diagnosis of poor performance or unexpected problems with user jobs.

Moderator