Report ID
2003-19
Report Authors
S. Alireza Aghili, Ozgur D. Sahin, Divyakant Agrawal, and Amr El Abbadi
Report Date
Abstract
Similarity search in textual databases and bioinformatics has received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of whole-genome sequence homology search into an approximate vector comparison in the well-established multidimensional vector space. We propose the application of Singular Value Decomposition(SVD) dimensionality reduction technique as a pre-processing filtration step to effectively reduce the search space and the running time of the search operation. Our empirical results on a Prokaryote and a Eukaryote DNA contig dataset, demonstrate effective filtration to prune non-relevant portions of the database with up to 2.4 times faster running time compared with q-gram approach. SVD filtration may easily be integrated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA. We analyze the precision of applying SVD filtration as a transformation-based dimensionality reduction technique, and finally discuss the imposed trade-offs.
Document
2003-19.pdf218.48 KB