A Framework for Accelerating High-dimensional NN-queries

Report ID: 
Christian A. Lang, Ambuj K. Singh
2001-02-01 04:00:00


The performance of nearest neighbor (NN) queries degradesnoticeably with increasing dimensionality of the data. This stems notonly from reduced selectivity of high-dimensional data but also froman increased number of seek operations during query execution.We propose a new framework to transform NN queries into at most tworange queries. This is achieved by first estimating the NN-radius,performing a range query with this radius, and (if not enough NNswere found) estimating the NN-radius again and performing a secondrange query. Since range queries know the set of pages to be read inadvance, a page read scheduler can be employed to minimize read costs.This scheme guarantees that the query cost is bounded from above bythe cost of 2 linear scans over a subset of the data pages, whiletypically resulting in costs well below a single scan.Our framework can be instantiated with different range query indexstructures, radius estimators, and page read schedulers. We presentone possible implementation of the radius estimator based on samplingand the fractal dimensionality of data. Since the overall cost dependson the quality of this radius estimate, we examine it analytically andexperimentally. Our analysis for uniform data indicates that theestimation error is below 14% and that errors in the fractaldimensionality estimation have only minor impact on the overall cost.Experiments on synthetic and real datasets show that our new techniquereduces the I/O cost during NN queries by a factor of up to 15 for anR*-tree based index structure and up to 5 for an X-tree-like bulkloadedindex structure.


File 2001-04.ps