Organizations and companies archive many versions of their digital documents and multimedia data for preservation, electronic discovery, and regulatory compliance. There are research challenges and opportunities for developing integrated archival and search support needed. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. The two-phase method seeks a cost tradeoff by searching representatives at Phase 1 to quickly narrow the search scope using clustering. Phase 2 of this method re-ranks top document versions with fragment-based index for each cluster. This project will study a low-cost method for deduplication and indexing, and finally deliver a ready-to-use software package for versioned data search.