Low-Cost Deduplication and Search for Versioned Datasets


[Project Overview] [Publication] [People]

Project Overview


Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. After deduplication, the structure of inverted index for versioned data becomes complex and it is expensive to search relevant results. This project will study low-cost solutions for compact archiving and indexing and develop efficient algorithms and system techniques for searching versioned datasets. It will also consider that the archived data can be stored in an untrusted server environment and investigate tradeoffs in efficiency and privacy-preserving for search.

This project will be focused on studying key challenges and cost-sensitive technical aspects in integrated archival and search support for managing large versioned datasets. The main tasks include efficient software architecture and optimization in detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.

People