Title: Efficient Indexing and Search of Archived GitHub Data
In this project, we implement and evaluate a multi-stage framework for indexing and searching archived data. First, due to the highly repetitive nature of GitHub data with many versions, we develop and implement techniques to compress and index the archived data. Second, we develop a two-phase searching framework for runtime processing of search queries. The first phase uses a small subset of the entire data collection as representatives of index data and representative searching identifies top-k documents. The second phase expands the scope of search to identify other related documents with different versions. Our approach achieves a good trade-off between time cost and result accuracy. This talk discusses the data retrieval efforts as well as framework implementation, and presents the evaluation results for the Linux kernel code collection with about 450 versions.