Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Apache Hadoop, Spark, and Memcached
Event Type
Tutorial
Accelerators
Intermediate
Networks
Location355-C
DescriptionApache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Similarly, Memcached in Web-2.0 environment is becoming important for large-scale query processing. Recent studies have shown that default Hadoop, Spark, and Memcached can not efficiently leverage the features of modern high-performance computing clusters, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects, high-throughput and large-capacity parallel storage systems (e.g. Lustre). These middleware are traditionally written with sockets, and do not deliver best performance on modern networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark, and Memcached. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and RSocket) with RDMA and storage architectures. Using the publicly available software packages in the High-Performance Big Data project (HiBD, http://hibd.cse.ohio-state.edu), we will provide case studies of the new designs for several Hadoop/Spark/Memcached components and their associated benefits. Through these case studies, we will also examine the interplay between high-performance interconnects, storage (HDD, NVM, and SSD), and multi-core platforms to achieve the best solutions for these components and Big Data applications on modern HPC clusters.
Links









