97. Developing A Scalable Platform For Next-Generation Sequencing Data Analytics Over Heterogeneos Clouds and HPCs : A Case for Transcriptomes and Metagenomes
Authors: Shayan Shams (Louisiana State University)Nayong Kim (Louisiana State University)Ming-Tai Ha (Rutgers University)Shantenu Jha (Rutgers University)Jian Tao (Louisiana State University)Ramesh Subramanian (Louisiana State University)Vladmir Chouljenko (Louisiana State University)K. Gus Kousoulas (Louisiana State University)Ram J. Ramanujam (Louisiana State University)Seung-Jong ParK (Louisiana State University)Joohyun Kim (Louisiana State University)
Abstract: A novel scalable pipeline for metagenome/transcriptome is presented. Thanks to the underlying distributed computing platform, a significant roadblock in Next-Generation Sequencing data analytics, associated with ever-growing and noisy data sets, can be effectively resolved.
On top of the core feature for accessing and utilizing heterogeneous distributed computing resources including HPCs and Clouds (EC2, OpenStack-based, and IBM Bluemix), the distributed application runtime environment is built for efficient management of massive workloads and data processing tasks by leveraging high-end HPC technologies, emerging Hadoop-based software models, and DOCKER. The consequently available repertoire of options for flexible and scalable runtime scenarios constitutes the pipeline for dealing with any size of data sets. In order to maximize benefits from the scalable platform, a novel method was developed for de novo genome sequence reconstruction with Multiple Assembly Multiple Parameter (MAMP) and available with the pipeline. Preliminary results indicate great potentials of MAMP.
Two-page extended abstract: pdf