Evaluating HPC Networks via Simulation of Parallel Workloads
SessionSystems and Networks I
Session ChairScott Pakin
Event Type
Paper
Architectures
Energy
Intermediate
Networks
Performance
Location355-BC
DescriptionThis paper presents an evaluation and comparison of three topologies that are popular for building interconnection networks in large-scale supercomputers: torus, fat-tree, and dragonfly. We propose a comprehensive methodology and we present a scalable packet-level network simulator, TraceR, which is used to perform this evaluation. Our methodology includes design of prototype systems that are being evaluated, use of proxy applications to determine computation and communication load, simulating individual proxy applications and multi-job workloads, and computing aggregated performance metrics. Using the proposed methodology, torus, fat-tree, and dragonfly based prototype systems with up to 730K endpoints (MPI ranks) executed on 46K nodes are compared in the context of multi-job workloads from capability and capacity systems. For the 180 Petaflop/s prototype systems simulated in the paper, we show that different topologies are superior in different scenarios.










