22. Optimizing Search in Un-Sharded Large-Scale Distributed Systems
Authors: Suraj Chafle (Illinois Institute of Technology)Jonathan Wu (Washington University in St. Louis)Kyle Chard (University of Chicago)Ioan Raicu (Illinois Institute of Technology)
Abstract: Distributed file systems and storage networks are used to store large volumes of unstructured data. While these systems support large-scale storage, they create new challenges relating to efficiently discovering, accessing, managing, and analyzing distributed data. At the core of these challenges is the need to support scalable discovery of unstructured data. Traditional search methods leverage centralized and globally sharded indexes. We present a distributed search framework that does not rely on sharding and can be applied to a range of distributed storage models. Our approach is built on top of Lucene and utilizes search trees to distribute and parallelize queries. To further optimize query performance we explore methods to prioritize indexes based on size. We evaluate our search framework against alternatives, Grep and Solr, comparing our hierarchical query distribution with a centralized model. Our implementation proved to be faster and scale better.
Two-page extended abstract: pdf