Bio-image Database
A
Distributed Database for BioMolecular Images, Ambuj
K. Singh, B. S. Manjunath, and Robert F. Murphy, SIGMOD Record, 33(2), pp.
65—71.
|
Index Structure for Whole Query ComparisonWe consider the problem of finding data substrings similar to a given query in large genetic databases. Since the size of such databases grows exponentially, it becomes impractical to use in-memory algorithms for these problems. We propose to map the substrings of the data into an integer space with the help of wavelet coefficients. Later, we index these coefficients using MBRs (Minimum Bounding Rectangles). We define a distance function which is a lower bound to the actual edit distance between strings. We experiment with both nearest neighbor queries and range queries. The results show that our technique prunes significant amount of the database (typically 50-95%), thus reducing both the disk I/O cost and the CPU cost significantly. An Efficient Index Structure for String Databases, Tamer Kahveci and Ambuj K. Singh, VLDB 2001, September, Roma, Italy, September 2001. |
Aligning large sequences for comparative genomicsMany important biological applications require the comparison of large genome strings. Current techniques suffer from both disk I/O and computational cost because of extensive memory requirements and large candidate sets. We propose an efficient technique for alignment of large genome strings. Our technique pre-computes the associations between the database strings and the query string. These associations are used to prune the database-query substring pairs that do not contain similar regions. We use a hash table to compare the unpruned regions of the query and database strings. The cost of the ensuing search is determined by how the hash table is constructed. We present a dynamic strategy that optimizes the random disk I/O needed for accessing the hash table. Our technique gives the user the freedom to select specific regions within the search space and assign different accuracy levels to these regions prior to search. It also provides the user a coarse grain visualization of the similarity pattern quickly before the actual search. The experimental results show that our technique reduces the disk I/O cost and the CPU cost significantly. MAP: Searching Large
Genome Databases, Tamer Kahveci and Ambuj K. Singh, PSB 2003.
Speeding up
whole-genome alignment by indexing frequency vectors, T. Kahveci, V. Ljosa,
and Ambuj K. Singh, Bioinformatics 20(13), Sept. 2004. |
Interactive Search on Genome Databases
The explosive growth of string databases makes substring
search a challenging problem.
Current search tools are non-interactive in the sense that the user
has to wait a long period of time until the entire database is
inspected. We consider the
problem of interactive string searching, and propose the first set of interactive
k-NN (k-Nearest Neighbor) search
algorithms in the literature. We
embody the MRS index structure and non-interactive substring search tools. Our first technique orders the MBRs
(Minimum Bounding Rectangles) based on their order statistics of the distance
distributions to the query substrings. Our second technique exploits BLAST's
statistical model to define an order to index structure MBRs. The database
substrings of an MBR are then iteratively searched, and intermediate results
are reported to the user along with confidence intervals. The user can decide on the
interactivity level of the search tool.
As the interactivity level increases, the technique updates the
reported results more frequently.
This technique is very general: it can be used to make any existing
string search tool interactive.
We experimentally show that our technique can find highly accurate
results to k-NN queries quickly.
In the second part of the paper, we show that our techniques can
easily be extended to search string databases incrementally in distributed
environments.
Progressive Searching of Biological Sequences, Tamer
Kahveci and Ambuj K. Singh, IEEE Data Engineering Bulletin, 27(3), September
2004. |
Protein Structure Comparison and
Classification
Comparing substructures of proteins is an expensive
task. We are designing new
algorithms for speeding up substructure comparison between proteins. We extract a set of feature vectors
from proteins based on their secondary structure, and insert these into an
index structure. Given a query
structure, we extract its feature vectors, and compare them with the feature
vectors in the index structure.
The most promising matches are then extended to find the best
substructure matches. These
algorithms have been used along with sequence comparison to classify
proteins.
PSI: Indexing
Protein Structures for Fast Similarity Search, Orhan Camoglu, Tamer
Kahveci, and Ambuj K. Singh, ISMB 2003.
Towards Index-based
Similarity Search for Protein Structure Databases, Orhan Camoglu, Tamer
Kahveci, and Ambuj K. Singh,
IEEE Computer Society Bioinformatics Conference, August 2003.
ProGreSS: Searching Protein
Databases by Sequence and Structure Simultaneously, A. Bhattacharya, T. Can,
T. Kahveci, Ambuj K. Singh, and Y. F. Wang, Pacific Symposium on
Biocomputing, 2004.
Index-based Similarity Search
for Protein Structure Databases, Orhan Camoglu, Tamer Kahveci, and Ambuj K.
Singh, Journal of Bioinformatics
and Computational Biology, 2(1),
March 2004, pp. 99-126.
Automated Protein Classification
Using Consensus Decision, T. Can, O. Camoglu, Ambuj K. Singh, and Y.F. Wang,
IEEE Computer Society Bioinformatics Conference, August 2004.
|
Similarity analysis of metabolic pathways
Comparative analysis of metabolic pathways in different
genomes can give insights into the understanding of evolutionary and
organizational relationships among species. This type of analysis allows one
to measure the evolution of complete processes (with different functional
roles) rather than the individual elements of a conventional analysis. We present a new technique for the
phylogenetic analysis of metabolic pathways based on the topology of the
underlying graphs. A distance measure between graphs is defined using the
similarity between nodes of the graphs and the structural relationship
between them. This distance measure is applied to the enzyme-enzyme
relational graphs derived from metabolic pathways. Using this approach,
pathways and group of pathways of different organisms are compared to each
other and the resulting distance matrix is used to obtain a phylogenetic
tree.
Deriving
phylogenetic trees from the similarity analysis of metabolic pathways,
Maureen Heymans and Ambuj K. Singh, ISMB 2003.
|
Computational Biology Analysis Tools
NSF BDI-0213903
NSF CISE Infrastructure Award
EIA-0080134
NSF ITR program
Center for Bio-image Informatics