SC16 Salt Lake City, UT

Monitoring Large Scale HPC Systems: Understanding, Diagnosis, and Attribution of Performance Variation and Issues


Authors: Ann Gentile (Sandia National Laboratories)

BP Abstract: This BOF addresses critical issues in large-scale monitoring from the perspectives of worldwide HPC center system administrators, users, and vendors. This year will be 100% facilitated audience interactive discussion on tools, techniques, experiences, and gaps in understanding, diagnosing, and attributing causes behind performance variation and poor performance. Causes include contention for shared network and I/O resources and system component problems. Our goal is to facilitate enhancement of community monitoring and analysis capabilities by identifying useful tools and techniques and encouraging the development of quickstart guides for these tools to be posted at the community web site: https://sites.google.com/site/monitoringlargescalehpcsystems/

Long Description: This BOF addresses critical issues in large-scale monitoring from the perspectives of worldwide HPC center system administrators, users, and vendors. This year will be 100% facilitated audience interactive discussion on tools, techniques, experiences and gaps in understanding, diagnosing, and attributing causes behind performance variation and poor performance. Our goal is to facilitate enhancement of community monitoring and analysis capabilities by identifying useful tools and techniques and encouraging the development of quickstart guides for these tools to be posted at the community web site: https://sites.google.com/site/monitoringlargescalehpcsystems/ BoF organizers will familiarize BoF attendees with our “Monitoring Large Scale HPC Systems” wiki resources (https://sites.google.com/site/monitoringlargescalehpcsystems/) for both posting and learning about available HPC monitoring and analysis tools. During the BoF, we will, via group discussion, enhance the wiki’s “HPC Monitoring Tools Quick Start” section with tools and areas of interest and expertise, that will be fleshed out after the meeting, to provide a valuable resource for operations staff wanting to try new tools and techniques on the systems they support. In recent HPC Monitoring BoFs, both panelists and audience members have ubiquitously agreed that diagnosing and understanding performance variation is the highest priority need for HPC operations staff. It is also recognized that monitoring and analysis tools for HPC platforms are becoming increasingly important to their efficient operation and will be required for operation as platforms continue to scale. The number of seemingly applicable tools available is daunting and is increasing rapidly due to the success of “Cloud” buildout and use. What is not at all obvious is the correct choice of tools for a particular site, platform, and workflow or how “cloud” related application of these tools maps into HPC needs. This applies across the gamut from low level data gathering tools, to data transport (e.g., in band, out of band, socket, RDMA), storage technologies (e.g., flat files, databases, cloud constructs), and analysis techniques (e.g., machine learning, statistics, filters, visualization). In a recent HPC Monitoring BoF (CUG 2016) it was suggested that one of the main impediments to experimentation with most of the available tools is the time that must be invested by someone to figure out how to build, configure, utilize, and evaluate any particular tool. This BoF will address this concern by focusing on what tools and approaches audience members and an international group of facilitating panelists have used, found useful/useless, and are interested in investigating. A tangible outcome of this BoF will be a list of tools and approaches, commitments for fleshing out the guides, and the guides posted for the community at the community web site. Previous Monitoring Large-Scale HPC Systems BoF's (75-149 attendance). SC15: Data Analytics and Insights SC14: Issues and Approaches Organizers have also run BoF's at ISC and CUG.

Conference Presentation: pdf


Birds of a Feather Index