Report ID
2002-03
Report Authors
Tamer Kahveci, Ambuj Singh
Report Date
Abstract
The exponential increase in the size of string databases makes substring search a challenging problem. Current techniques suffer from both disk I/O and computational cost because of extensive memory requirements and large candidate sets.We accelerate string search tools and reduce their memory requirements by precomputing the associations between the database strings and the query string.Our technique integrates the MRS index structure, proposed earlier for whole matching, and a memory resident substring searching tool such as BLAST. The associations computed between the query string and MBRs of the index structure are used to prune the database-query substring pairs that do not contain similar regions. The user can tune a cutoff parameter to decide on the amount of pruning and the quality of the result. The experimental results show that our technique prunes a significant amount of the database (typically 20-80$\\%$), thus reducingthe disk I/O cost and the CPU cost significantly. The performance of BLAST increases by one order of magnitude without much of an impact on the output quality.In addition, the amount of memory need is 30-40 times less. Thus, this technique has the potential of making substring searching over large databases viable on ordinary desktop PCs.
Document
2002-03.ps403.77 KB