Report ID
2003-10
Report Authors
Tamer Kahveci, Ambuj K. Singh
Report Date
Abstract
The explosive growth of string databases makes similarity search achallenging problem. Current search tools are non-interactive inthe sense that the user has to wait a long time until the entiredatabase is inspected. We consider the problem of interactivesearching, and propose a set of innovative k-NN (k-NearestNeighbor) search algorithms. We propose a new model for thedistance distribution of a query to a set of strings. Using thisdistribution, our first technique orders the MBRs (MinimumBounding Rectangles) of an index structure based on their orderstatistics. We also propose an early pruning strategy to reducethe total search time for this technique. Our second techniqueexploits existing statistical models to define an order on theindex structure MBRs. We also propose a method to compute theconfidence levels for the partial results. Our experiments showthat our technique can achieve 75% accuracy within the first2.5-35% of the iterations and 90% accuracy within the first12-45% of the iterations. Furthermore, the reported confidencelevels reflect the quality of the partial results accurately.
Document
2003-10.pdf364.92 KB