INTRODUCTION:

One of the main goals of evolutionary biology is to trace the origins of the different biological features that comprise us as living creatures.  Of particular interest, is how the evolutionary path of complex organs and features of Eukaryotes have developed. In order to trace this path, it is helpful to analyze the evolutionary path of the components of these complex features.  The animal eye, being comprised of many smaller components, is ideal to trace this path.

Opsins are a family of proteins that are vital to animal vision.  Opsins bind to photoreactive chemicals called chromophores, which react when struck by a photon.  This reaction causes an opsin to change shape in turn triggering a signal to be sent to the brain.  Opsins are part of a larger protein family called G-Protein Coupled Receptors (GPCR), which transduce extracellular signals into a cell.  Like all other GPCRs, opsins are seven transmembrane (TM) proteins; they cross the cell boundary a total of seven times.  These seven regions that cross through the cell membrane must share certain biophysical properties, most importantly the string of amino-acids must be highly hydrophobic.  The evolutionary relationship between these seven TM regions is the focus of our project.


GPCR 7 transmembrane segments (4)

Taylor and Agarwal (3) proposed that commonalities between prokaryote TM sequences in bacteriorhopsin originated from a duplication event.  Specifically, they proposed that the seven TM sequences duplicated from an original two or three sequences.  To further extend this idea, Shimizu et al (2) analyzed TM sequences from 87 different prokaryotes for evidence of gene duplication.  Their results show that 377 TM proteins of varying TM sequence length exhibit commonalities within their TM regions.  Specifically, they found that several TM proteins with seven TM sequences show strong evidence of arising from a duplication event.

We propose to extend these duplication ideas from prokaryotes to animal opsins.  By analyzing different opsin proteins, we hope to find sufficient evidence to determine whether or not they have originated from a duplication event.  Using various bioinformatics tools, we can align and compare the seven TM sequences in different combinations (tms-1 with tms-5, tms-1-2 with tms-5-6, etc) and score the results.  To prove any statistical significance, we will compare these different TM alignments with a null set to show that these TM sequences did not develop randomly, but rather by gene duplication. 


METHODS:

In order to further investigate this problem, we need to develop a method that allows us accurately gauge the similarities (or lack thereof) of different opsin proteins.  We have split the procedure into several parts. 

  1. Protein Retrieval

To perform any similarity tests, we obviously need a way to retrieve, organize, and parse many different proteins.  GenBank, which is a large repository of thousands of different eukaryotic opsins, will provide the data samples that we will use to compare.  We will retrieve them using BioPerl’s built in GenBank query functions which result in a set of sequences we wish to run similarity tests on. 

  1. Transmembrane Sequence Determination

Opsin proteins contain unknown transmembrane (TM) sequences which must be isolated from the rest of the protein in order to perform similarity tests. Several algorithms have already been developed to predict the TM sequences in a protein.  When Shimizu et al. performed their TM duplication experiments they used a tool called SOSUI to find TM sequences. 

We plan to use a two tiered system to retrieve the TM sequences out of the protein. Bovine rhodopsin is the only mammalian protein whose crystal structure has been solved; thus its TM sequences have been accurately identified (4). Scoring each TM sequence of bovine opsin against a target opsin (using local alignment) should produce TM sequences in the target opsin with a high accuracy.  To confirm that these matches are indeed TM sequences, we will also use SOSUI or TMpred (another TM prediction tool) on the scored TM sequences that we have pulled out using Bovine rhodopsin.  Between these two methods, we feel that an accurate prediction of the TM sequences in the opsin can be obtained.

We will be comparing short sequences (a Bovine TM sequence) against a larger protein (the target opsin), so we can use BLAST to score the similarity.  All of the BLAST executables are available from the National Center for Biotechnology Information (NCBI) website. Unfortunately, both of the TM predictions tools do not come with up to date stand alone implementations.  We are using Perl to query each website with the protein we retrieved from GenBank and parsing the html response to obtain the likely TM sequences.  This portion of the process may become a bottleneck as these websites are not designed to specifically handle BioPerl queries.  However, SOSUI does have a batch query interface which should alleviate the speed limitations of our process. 

There are more than two TM prediction algorithms available on the internet, and further investigation may warrant switching to a new TM prediction tool or adding an additional one (TMHMM is one such site that needs further investigation). 

  1. Alignment and Comparison 

Having obtained the prediction TM sequences from a given opsin (and there will be many different opsins), we need to align and score those sequences with respect to each other.  To perform the alignment and scoring we will use two well known comparison algorithms, BLAST and T-Coffee.  Since we are testing the hypothesis that TM sequences originated from duplication, we have to test all valid combinations TM duplication – TM-1 with TM-3, TM-1 with TM-5, etc.  Since each TM sequence has a polarity, not every combination of duplication is possible, only approximately 14 combinations.  The results of these comparisons will show how closely related each of these TM sequences are (the thresholds and statistics are discussed later).  Additionally, to help in determining the relationship between TM sequences, we will use BLAST and T-Coffee to build a null set of scores with random sequences that we will use as a base to compare the opsin TM sequence scores against.

Both of these tools have stand alone program versions that are capable of being integrated into BioPerl and we intend to run them locally since we are simply comparing two sequences (as opposed to comparing against a database).  Again, as our investigation continues, we may find more suitable alignment and scoring alternatives. 

  1. Integration 

As mentioned throughout, we plan to use BioPerl as much as possible.  Anything that is not available in the BioPerl modules (such as TMpred), we will develop our own interfaces using Perl.  Each of these three parts of the process is being developed as standalone scripts, and we will integrate them all into one using BioPerl functions, modules and objects.  This enables us to automate the entire process of downloading a protein, determining the TM sequences, and testing similarity.  As we envision running this experiment on many different types of opsins, this automation process is vital to getting the throughput needed to get an accurate sample. Additionally, given this modular process, we would like to release our tool via the internet to other researchers interested in identifying gene duplications.



DATASETS:


For this project we will focus on rhodopsins, a protein that plays a critical role in night vision in all animals. Rhodopsins are found in a wide variety of animals, which is exactly what we wish to survey in this project. If we find significant evidence of a duplication event in rhodopsins across many species, then we may choose to expand our search to other proteins in the opsin family and possibly even other GPCRs.


Because we are forced to use prediction tools to determine the number and locations of TMSs within a protein, we will not consider any proteins that are predicted to contain fewer than seven TMSs from the majority of our prediction tools. This will reduce our dataset and allow us to make useful assumptions about each protein on which we perform comparisons.

For our initial trials, we will limit our data to GenBank. If we find that we need more data then we can expand to other existing biological databases.


EXPERIMENTS:


In order for us to determine if rhodopsins occurred by duplication we must find a significant amount of similarity between transmembrane sequences. There are many theories as to how seven TM proteins could have duplicated over the years. Shimizu et al. (2) propose three different evolution schemes that may have occurred. They are as follows:


  1. {1-2-3}  {1-2-3-(new TMS)-1-2-3}

  2. {1-2-3-4}  {1-2-3-1-2-3-4}

  3. {1-2-3-4-5}  {1-2-3-4-3-4-5} or {1-2-3-4-5-4-5}


In this notation, each number represents a TM segment, the numbers that are underlined are the segments that are proposed to have duplicated. Because there is a great deal of variance among these schemes we propose a comparison of all combinations of pairs of TM sequences. This means 7C2 (seven choose two) or 21 different comparisons for each protein, but since the comparisons are over short segments (approximately 23 amino-acids long) the comparisons are very quick. Once we have analyzed this information, we should clearly be able to detect any existing correlation between TM segments within the rhodopsin.

We still have the problem of determining whether or not a pair of TM segments is significantly similar. Even if a duplication event has occurred, the TM segments of have had plenty of time to evolve further over years. This means that a relatively low similarity score may show a significant correlation between the two sequences.

To address this issue, we propose using a statistical measure called bootstrapping in which we will compare 100 additional random shuffles of the second protein sequence to the first. From these scores, we can compute a confidence interval to determine the significance of the original similarity score. The results of this process will be the likelihood that the similarity score between a pair of TM segments are meaningful, from which we can deduce whether or not a duplication event was likely.



REFERENCES:


  1. Saier MH Jr, (2003) Tracing pathways of transport protein evolution. Mol Microbiol. 48(5): 1145-56.

  1. Shimizu T, Mitsuke H, Noto K, Arai M., (2004) Internal gene duplication in the evolution of prokaryotic transmembrane proteins. J Mol Biol. 339(1): 1-15.

  1. Taylor EW, Agarwal A. (1993) Sequence homology between bacteriorhodopsin and G-protein coupled receptors: exon shuffling or evolution by duplication? FEBS Lett. 325(3): 161-6.

  1. G protein-coupled receptor. (n.d.). Wikipedia. Retrieved October 30, 2006, from Answers.com Web site: http://www.answers.com/topic/g-protein-coupled-receptor

  1. Notredame C, Higgins DG, Heringa J (2000) T-COFFEE: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302: 205–217.

  2. Ewens, Warren J., Grant, Gregory R., (2001). Statistical Methods in Bioinformatics: An Introduction. New York: Springer-Verlag.





Authors:

Nick Larusso and Brian Ruttenberg

Last updated: 12/12/06