Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more. However, such an optimization is resource-intensive. After deduplication, the structure of inverted index for versioned data becomes complex and it is expensive to search relevant results.
This project involves a fast two-phase search scheme with hybrid indexing for versioned datasets to strike a tradeoff between performance and relevance. Phase 1 search leverages representatives to narrow the scope of search and minimize the number of clusters needed for Phase 2. Phase 2 exploits a hybrid per-cluster index structure to take the advantages of forward and inverted index based on the term characteristics. Optimizations on the online search part is explored. The project is concluded as an Open-Sourced Search System for versioned datasets.