HashFile : An Efficient Index Structure For Multimedia Data

TitleHashFile : An Efficient Index Structure For Multimedia Data
Publication TypeConference Paper
Year of Publication2011
AuthorsZhang, D, Agrawal D, Chen G, Tung AKH
Conference NameICDE

Nearest neighbor (NN) search in the high dimensional
vector space is a fundamental but essential query in
many multimedia retrieval applications. Due to the curse of
dimensionality, it is difficult to build indexing support to efficiently
execute NN queries. As the number of feature dimensions
increases, the performance of existing index structures degrades
rapidly and even becomes worse than a simple sequential scan
of data to answer queries. Therefore, an alternative solution
attempts to answer the approximate nearest neighbor query by
adopting the random projection, typically the locality sensitive
hashing (LSH). The hash function can preserve the distance
in the Euclidean space so that similar objects have a high
probability of colliding in the same bucket. The efficiency is
attained by marginally sacrificing the quality of result sets.
Recently, locality sensitive B-tree(LSB-tree) has been proposed
to ensure both quality and efficiency simultaneously. However, in
order to achieve an approximation that really works in practice,
multiple trees have to be built. The duplicate entries of the same
object among the trees incur additional storage and random
access overhead.
In this paper, we propose a novel index structure, named Hash-
File for efficient retrieval of multimedia objects. The HashFile
can support both the exact and approximate NN query at the
same time. The query processing algorithm takes advantage of
random projection and sequential scan. The data set is recursively
partitioned in the hash space and organized as a tree structure.
Given a query point q, the search algorithm only explores
the buckets near the query object in a top-down manner. The
candidate buckets are stored sequentially in the disk file and can
be efficiently loaded into memory for linear scan. Our experiment
results on the image data sets show that the HashFile not only
answers the exact NN query faster than existing indexes, but
also demonstrates significant performance improvements over
the LSB tree, which is a state-of-the-art index structure for
processing approximate NN queries.

Refereed DesignationRefereed