Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. After deduplication, the structure of inverted index for versioned data becomes complex and it is expensive to search relevant results.
In this talk, I will first present a fast two-phase search scheme with hybrid indexing for versioned datasets to strike a tradeoff between performance and relevance. Phase 1 search leverages representatives to narrow the scope of search and minimize the number of clusters needed for Phase 2. Phase 2 exploits a hybrid per-cluster index structure to take the advantages of forward and inverted index based on the term characteristics. The experiment results show that the proposed scheme can be up-to 4.12x as fast as the previous work on solid state drives while retaining good relevance. In the second part, I will discuss our investigation on data traversal methods for fast score calculation with a large ranking ensemble. We propose a 2D blocking scheme for better cache utilization with simpler code structure compared to previous work. We will also talk about a framework to help fast select best blocking methods and parameters with cache analysis. The experiments with several benchmarks show significant acceleration in score calculation without loss of ranking accuracy. Finally I will present an open-source search system for versioned datasets.