Background and Motivation:

The clear understanding of biological processes is the key to solving the mystery of life. To solve this mystery we need to first understand the complex biological functions in higher organisms' like humans. Recent studies have shown that interaction between protein sets in higher organisms' like humans drives the our biological functions. Thus, if we can identify pairs of proteins which are likely to interact and separate them from those which don't, then we have a mechanism to construct a protein interaction network.

Some work have already been done by Patty Lu, a student in University of California, Santa Barbara. She has devised three scoring algorithms, namely “random walk”, “hamming distance”, and tf-idf” to construct the protein networks. A higher score means that pair of protein are likely to be more interactive than a pair with a lower score. Scientists can then use these scores to obtain pairs of interacting proteins to form the protein interaction network. However, we believe that the accuracy of these algorithms can be improved, and thus we undertake this project to improve the scoring mechanism by creating a better algorithm.

For this project, we will use the same dataset used by Patty Lu, which is the GO database available at http://www.geneontology.org/. “The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene. The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions .” This vocabulary enables us to form uniform queries across the database. Each protein is associated with a set of terms which are a part of several hierarchies. All the algorithms use “is_a” and “part_of” relationship tags to construct GO annotation hierarchy. “The hierarchy is a single rooted, gene_ontology, structure such that each node can have multiple parents and children.” Function of a protein is defined by its strucure. SO, structure comparison is used to find simlarity between proteins. Detection of homologs in the sequence is also used to find the similarity between proteins. Other most important task is finding proteins those form a network and interact directly or participate in the same pathway. The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism's body functions. Currently all the proteins are also annotated with GO terms based on their molecular function, participating biological process and cellular location. Since these terms have been arrranged in a heirarchical structure and try to maintain the property of moving from less specialized to more specialized case(not always true), it can be exploited to find proximity between them and thus predict the chance of them interacting directly or participating in a pathway. proteins located in same cellular components or having close molecular function or close biological process have higher chance of participating in a common network though not totally true. This project is aimed at exploiting this very idea.

References:

[1] Edward M. Marcotte, Matteo Pallegrini, Ho-Leung Ng, Danny W. Rice, Todd O. Yeates and David Eisenberg.Detecting protein functions and prtotein-protein interactions from genome sequences. www.sciencemag.org, vol 285, 30 July, 1999.
[2] Joel R. Bock, and David A. Gough. Prdicting protein protein interactions from primary structure. Bioinformatics, Oxford Journals, August 22, 2000.
[3] Jones,S. and Thornton,J.M. Principles of protein-protein interactions. Proc. Natl Acad. Sci. USA, 93, 1320, 1996.
[4]. http://www.geneontology.org/ (Read)
[5]. MS Thesis Report, Patty Lu, Department of Computer Science, UCSB, June 2006. (Read)
[6] M.A. Huynene J.O. Korbel, B. Snel and P. Brok. Shot: a web server for the construction of genome phylogenies. Trends Genet, pages 158–162, 2002.
[7] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal, 24(5):513–523, 1988.
[8] W. Zhong and P.W. Sternberg. An efficient indirect branch predictor. Genome-wide prediction of C. elegans genetic interactions, pages 1481–4, March 2006.

Methods:

1. We will be taking all the three ontologies into consideration to get an overall picture. We will be defining a scoring method for each ontology. Final proximity score will be weighted sum of them.
2. For a given ontology, we will find the nearest parent of the given annotations from two proteins. We will take a pair of annotation at a time since the same protein can be annotated with more than one term in the same ontology.
3. Once common parent is found for a given pair, we will calculate depth of each annotation from parent node. We will use a weighted cost for traversing an edge. Sum of the depth will define distnace between terms of two proteins.
4. We will be required to discuss a strategy to deal with the multiple terms for the same protein for the given ontology.
5. A node having both is_a and part_of linkage, part_of will be given priority.
6. Same term can exist in more than one branch. This case should also be taken into account.
7. Same protein can be annotated with child as well as parent term.
8. Distance calculated in step 4 will be used to define proximity between given proteins in a given ontology.
9. Weighted sum from all the three ontologies will give final score.

Datasets:

Annotation definitions. This flat file defines GO annotations with a set of tags. Annotation flat file was downloaded from geneontology.org on January 13, 2006. There are 20403 terms for all three categories, biological process, molecular functions, and cellular component. We only process biological process for C. elegans annotations and there are 10693 of them. Out of these biological process annotations, only 985 terms are actually being used for annotation. Gene hierarchy definitions. This gene hierarchy file defines each gene with a number of GO annotations. Each gene only has one protein, so we interchange the terms ”protein” and ”gene”. There are 9820 genes defined in this file, so we have 48211290 pairs of proteins to analyze for all three algorithms. Random dataset. This dataset contains 100,000 pairs of proteins randomly selected from the 48211290 pairs of proteins we have available. Random dataset 2. This dataset contains 17475 pairs of proteins randomly selected from the 48211290 pairs of proteins we have available. This file is only used in quality performance graphs. Positive dataset. This dataset contains pairs of proteins that we believe are of a higher likelihood to be interacting than not interacting. This file was download from wormbase.org. There are 11590 pairs of proteins. Negative dataset. This dataset contains pairs of proteins that we believe are of a higher likelihood of being non interacting than interacting. This file is obtained from localization of files downloaded from geneontology.org and wormbase.org. There are 6711106 pairs of proteins. Positive training dataset. This file contains protein pairs that are of yeast-two-hybrid and genetic interaction from wormbase. There are 4775 pairs of proteins and our gene hierarchy file only supports 473 pairs of them. This was provided by Zhong and Sternberg [3]. Negative training dataset. This file contains proteins pairs that are of cis markers from wormbase. There are 3296 pairs of proteins and our gene hierarchy file only supports 598 pairs of them. This was provided by Zhong and Sternberg [3]. Kegg positive dataset. This file contains Kegg positive pathway protein pairs from Kegg pathway database. There are 23653 pairs of proteins and our gene hierarchy file only supports 17473 of them.

Experiments:

We plan to conduct the same experiments that Patty Lu did in her thesis report. She has conducted 4 experiments: Performance graphs, ROC curves, Correlation curves, Quality performance graphs. The reason for doing the same experiments is it will allow us to compare our results to that of Patty Lu. This way we will have a better idea of how good our algorithm is compared to her.