Report ID
2001-10
Report Authors
Tamer Kahveci, Ambuj K. Singh
Report Date
Abstract
We consider the problem of substring searching in very large string databases. Typical applications of this problem are genetic data, web data, and event sequences.Since the size of such databases grows exponentially, it becomes impractical to use in-memory algorithms for these problems.In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients.Later, we index these coefficients using Minimum Bounding Rectangles. We define a distance function which is a lower bound to the actual edit distance between strings.We experiment with both nearest neighbor queries and range queries.The results show that our technique prunes significant amount of the database (typically 50-95\\%), thus reducingboth the disk I/O cost and the CPU cost significantly.
Document
2001-10.ps574.53 KB