Report ID
2003-20
Report Authors
S. Alireza Aghili, Divyakant Agrawal and Amr El Abbadi
Report Date
Abstract
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step to effectively reduce the search space. BFT reduces the search space and the running time of the join operation drastically. Our empirical results on a number of Prokaryote and Eukaryote DNA contig databases, demonstrate up to 99.9% filtration ratio to efficiently prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram extraction approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole genome comparison. Additionally, we discuss the integration of our proposed techniques for more efficient approximate join in the text databases, data integration, and data cleansing. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.
Document
2003-20.pdf295.06 KB