A Parallel Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings
SessionCombinatorial and Multigrid Algorithms
Session ChairJed Brown
Authors
Event Type
Paper
Algorithms
Intermediate
Introductory
Location355-E
DescriptionWe present an efficient parallel algorithm for the following problem: Given an input collection D of n strings of total length N, a length threshold H and a mismatch threshold k, report all k-mismatch maximal common substrings of length at least H over all pairs of strings in D. This problem is motivated by various applications in computational biology, where D is a collection of millions of short DNA sequences. Sequencing errors and massive size of these datasets necessitate efficient parallel approximate sequence matching algorithms. We present a novel distributed memory parallel algorithm that solves this approximate sequence matching problem in O((N/p + occ) log^{k} N) expected time and takes O(log^{k+1} N) expected rounds of global communications, under some realistic assumptions, where p is the number of processors and occ is the output size. We demonstrate the performance and scalability of our algorithm using large high throughput sequencing data sets.








