Title: Efficient Keyword Search for Versioned Web Data
Multi-versioned data sets are archived on a regular basis in many organizations. For example, Internet Archive has collected and preserved more than 376 billion versioned web pages. It is challenging to scale the keyword search for a large archived data collection with many versions. In this project, we develop and evaluate a two-phase efficient search framework. In the first phase, we identify a representative for a group of document versions with the same ID. Then we search the top-k results from the version representatives instead of all versions of data. In the second phase, we conduct additional search from different versions of the selected representatives. Compared to the related work, our two-phase approach shows a good trade-off between response time and result accuracy. This talk discusses the data crawling efforts and system implementation, and presents the current evaluation results for a web data collection with about 20 versions.