Report ID
2002-04
Report Authors
Christian A. Lang, Ambuj K. Singh
Report Date
Abstract
The performance of nearest neighbor (NN) queries degrades noticeablywith increasing dimensionality of the data. This stems not only from reducedselectivity of high-dimensional data but also from an increased number of seekoperations during NN-query execution.If the NN-radii would be known in advance, the disk accesses could be reorderedsuch that seek operations are minimized. We therefore propose a new way ofestimating the NN-radius based on the fractal dimensionality and sampling. We show that the estimation error is considerably lower than for previous approaches. Since it is possible that the radius is underestimated, we proposea way of finding a close upper bound on the radius based on a given radius estimate. This technique is applicable to any page-based index structure. Weshow that this upper bound is close to the correct radius.In the second part of the paper, we present two applications of these findings.We show how the radius estimations can be used to transform k-NN queriesinto at most two range queries, and how it can be used to reduce the number of page reads during all-NN queries. In both cases, we observe significant speedups over traditional techniques for synthetic and real-world data.
Document
2002-04.ps536.17 KB