Bioinformatics:

Bio-image Database

A Distributed Database for BioMolecular Images, Ambuj K. Singh, B. S. Manjunath, and Robert F. Murphy, SIGMOD Record, 33(2), pp. 65—71.

Index Structure for Whole Query Comparison

We consider the problem of finding data substrings similar to a given query in large genetic databases. Since the size of such databases grows exponentially, it becomes impractical to use in-memory algorithms for these problems.  We propose to map the substrings of the data into an integer space with the help of wavelet coefficients. Later, we index these coefficients using MBRs (Minimum Bounding Rectangles). We define a distance function which is a lower bound to the actual edit distance between strings. We experiment with both nearest neighbor queries and range queries. The results show that our technique prunes significant amount of the database (typically 50-95%), thus reducing both the disk I/O cost and the CPU cost significantly. 

An Efficient Index Structure for String Databases, Tamer Kahveci and Ambuj K. Singh, VLDB 2001, September, Roma, Italy, September 2001. 

Aligning large sequences for comparative genomics

Many important biological applications require the comparison of large genome strings.  Current techniques suffer from both disk I/O and computational cost because of extensive memory requirements and large candidate sets. We propose an efficient technique for alignment of large genome strings. Our technique pre-computes the associations between the database strings and the query string.  These associations are used to prune the database-query substring pairs that do not contain similar regions.  We use a hash table to compare the unpruned regions of the query and database strings.  The cost of the ensuing search is determined by how the hash table is constructed.  We present a dynamic strategy that optimizes the random disk I/O needed for accessing the hash table.  Our technique gives the user the freedom to select specific regions within the search space and assign different accuracy levels to these regions prior to search.  It also provides the user a coarse grain visualization of the similarity pattern quickly before the actual search.  The experimental results show that our technique reduces the disk I/O cost and the CPU cost significantly. 

 

          MAP: Searching Large Genome Databases, Tamer Kahveci and Ambuj K. Singh, PSB 2003. 

          Speeding up whole-genome alignment by indexing frequency vectors, T. Kahveci, V. Ljosa, and Ambuj K. Singh, Bioinformatics 20(13), Sept. 2004.

 

Interactive Search on Genome Databases

The explosive growth of string databases makes substring search a challenging problem.  Current search tools are non-interactive in the sense that the user has to wait a long period of time until the entire database is inspected.  We consider the problem of interactive string searching, and propose the first set of interactive k-NN (k-Nearest Neighbor)  search algorithms in the literature.  We embody the MRS index structure and non-interactive substring search tools.  Our first technique orders the MBRs (Minimum Bounding Rectangles) based on their order statistics of the distance distributions to the query substrings. Our second technique exploits BLAST's statistical model to define an order to index structure MBRs. The database substrings of an MBR are then iteratively searched, and intermediate results are reported to the user along with confidence intervals.  The user can decide on the interactivity level of the search tool.  As the interactivity level increases, the technique updates the reported results more frequently.  This technique is very general: it can be used to make any existing string search tool interactive.  We experimentally show that our technique can find highly accurate results to k-NN queries quickly.  In the second part of the paper, we show that our techniques can easily be extended to search string databases incrementally in distributed environments.

         Progressive Searching of Biological Sequences, Tamer Kahveci and Ambuj K. Singh, IEEE Data Engineering Bulletin, 27(3), September 2004.

 

Protein Structure Comparison and Classification

Comparing substructures of proteins is an expensive task.  We are designing new algorithms for speeding up substructure comparison between proteins.  We extract a set of feature vectors from proteins based on their secondary structure, and insert these into an index structure.  Given a query structure, we extract its feature vectors, and compare them with the feature vectors in the index structure.  The most promising matches are then extended to find the best substructure matches.  These algorithms have been used along with sequence comparison to classify proteins.

PSI: Indexing Protein Structures for Fast Similarity Search, Orhan Camoglu, Tamer Kahveci, and Ambuj K. Singh, ISMB 2003.

Towards Index-based Similarity Search for Protein Structure Databases, Orhan Camoglu, Tamer Kahveci, and Ambuj K. Singh,  IEEE Computer Society Bioinformatics Conference, August 2003.

ProGreSS: Searching Protein Databases by Sequence and Structure Simultaneously, A. Bhattacharya, T. Can, T. Kahveci, Ambuj K. Singh, and Y. F. Wang, Pacific Symposium on Biocomputing, 2004.

Index-based Similarity Search for Protein Structure Databases, Orhan Camoglu, Tamer Kahveci, and Ambuj K. Singh,  Journal of Bioinformatics and Computational Biology, 2(1),  March 2004, pp. 99-126.

Automated Protein Classification Using Consensus Decision, T. Can, O. Camoglu, Ambuj K. Singh, and Y.F. Wang, IEEE Computer Society Bioinformatics Conference, August 2004.

 

Similarity analysis of metabolic pathways

Comparative analysis of metabolic pathways in different genomes can give insights into the understanding of evolutionary and organizational relationships among species. This type of analysis allows one to measure the evolution of complete processes (with different functional roles) rather than the individual elements of a conventional analysis.  We present a new technique for the phylogenetic analysis of metabolic pathways based on the topology of the underlying graphs. A distance measure between graphs is defined using the similarity between nodes of the graphs and the structural relationship between them. This distance measure is applied to the enzyme-enzyme relational graphs derived from metabolic pathways. Using this approach, pathways and group of pathways of different organisms are compared to each other and the resulting distance matrix is used to obtain a phylogenetic tree.

Deriving phylogenetic trees from the similarity analysis of metabolic pathways, Maureen Heymans and Ambuj K. Singh, ISMB 2003.

Projects, Software, and Demos:

Computational Biology Analysis Tools

Students:

Orhan Camoglu

Vebjorn Ljosa

Arnab Bhattacharya

Tolga Can

Sponsors:

NSF BDI-0213903

NSF CISE Infrastructure Award EIA-0080134

NSF ITR program

Other Links:

Center for Bio-image Informatics