1. Theoretical exercise
To illustrate the backtracking algorithm a simplified example on DNA was used because the scoring system is very simple: matches have score 1 while mismatches get score –1. Gaps (insertions and deletions are also penalized) by a score of –1.
Suppose 2 very short DNA sequences need to be aligned seq i ATT and seq j TTC.
Solution
GAP |
|
A |
|
T |
|
T |
|
|
0 |
-1 |
-1 |
-2 |
-2 |
-3 |
-3 |
T |
-1 |
-1 |
-2 |
0 |
-3 |
-1 |
-4 |
|
-1 |
-2 |
-1 |
-2 |
0 |
-1 |
-1 |
T |
-2 |
-2 |
-2 |
0 |
-1 |
1 |
-2 |
|
-2 |
-3 |
-2 |
-3 |
0 |
-1 |
1 |
C |
-3 |
-3 |
-3 |
-3 |
-1 |
-1 |
0 |
|
-3 |
-4 |
-3 |
-4 |
-1 |
-2 |
0 |
2. Practical exercises
Program is available
DOTLET
http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Alignment based on dot representation (use NetScape)
GLOBAL ALIGNMENT
http://genome.cs.mtu.edu/align/align.html
LOCAL ALIGNMENT
http://fasta.bioch.virginia.edu/fasta/lalign.htm
These programs are based on dynamic programming and offer a global and local alignment program.
Use the following sequences in FastA format:
DATASET 0 Two bacterial terminal oxidases from different family
>gi|1071819|pir||B54759 ba-type ubiquinol oxidase (EC 1.10.3.-) chain I - Paracoccus denitrificans
MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVRGFADAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVGGAVITMASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWTSFCANILIVASFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLFGYSSMVYATVCITVLSYLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTFVIGGMTGVLLAVPPADFVLHNSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPLYILGLMGVTRRLRVFDDPDLRIWFAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPAYNFAFNPISHGLDTWWEMKQQGATRPTGGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTFNYNRDYYIPVSEIEATEDARTRQLAQGV
>gi|461786|sp|P33517|COX1_RHOSH Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)
MADAAIHGHEHDRRGFFTRWFMSTNHKDIGVLYLFTGGLVGLISVAFTVYMRMELMAPGVQFMCAEHLESGLVKGFFQSLWPSAVENCTPNGHLWNVMITGHGILMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRMNNLSYWLYVAGTSLAVASLFAPGGNGQLGSGIGWVLYPPLSTSESGYSTDLAIFAVHLSGASSILGAINMITTFLNMRAPGMTMHKVPLFAWSIFVTAWLILLALPVLAGAITMLLTDRNFGTTFFQPSGGGDPVLYQHILWFFGHPEVYIIVLPAFGIVSHVIATFAKKPIFGYLPMVYAMVAIGVLGFVVWAHHMYTAGLSLTQQSYFMMATMVIAVPTGIKIFSWIATMWGGSIELKTPMLWALGFLFLFTVGGVTGIVLSQASVDRYYHDTYYVVAHFHYVMSLGAVFGIFAGSTSGIGKMSGRQYPEWAGKLHFWMMFVGANLTFFPQHFLGRQGMPRRYIDYPEAFATWNFVSSLGAFLSFASFLFFLGVIFYSLSGARVTANNYWNEHADTLEWTLTSPPPEHTFEQLPKREDERAPAH
DATASET 1 Two terminal oxidases from the same family
>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis thaliana]
MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIFFMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITSHSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMISIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLFTIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITFFGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSPWALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK
>gi|461786|sp|P33517|COX1_RHOSH Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)
MADAAIHGHEHDRRGFFTRWFMSTNHKDIGVLYLFTGGLVGLISVAFTVYMRMELMAPGVQFMCAEHLESGLVKGFFQSLWPSAVENCTPNGHLWNVMITGHGILMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRMNNLSYWLYVAGTSLAVASLFAPGGNGQLGSGIGWVLYPPLSTSESGYSTDLAIFAVHLSGASSILGAINMITTFLNMRAPGMTMHKVPLFAWSIFVTAWLILLALPVLAGAITMLLTDRNFGTTFFQPSGGGDPVLYQHILWFFGHPEVYIIVLPAFGIVSHVIATFAKKPIFGYLPMVYAMVAIGVLGFVVWAHHMYTAGLSLTQQSYFMMATMVIAVPTGIKIFSWIATMWGGSIELKTPMLWALGFLFLFTVGGVTGIVLSQASVDRYYHDTYYVVAHFHYVMSLGAVFGIFAGSTSGIGKMSGRQYPEWAGKLHFWMMFVGANLTFFPQHFLGRQGMPRRYIDYPEAFATWNFVSSLGAFLSFASFLFFLGVIFYSLSGARVTANNYWNEHADTLEWTLTSPPPEHTFEQLPKREDERAPAH
DATASET 2 Two terminal oxidases from a different family
>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis thaliana]
MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIFFMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITSHSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMISIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLFTIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITFFGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSPWALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK
>gi|2114418|gb|AAB58264.1| cbb3-type cytochrome oxidase component FixN [Rhizobium leguminosarum bv. viciae]
MNYTTETMVIAVAAFLALLVAAFAHDHLFAVHMGILCLCLVMGAVLMVRKVDFSPAGQQRNVDRSGYFDEVIRYGLIATVFWGVVGFLVGVIIALQLAFPDLNIAPYLNFGRLRPVHTSAVIFAFGGNALIMTSFYVVQRTCRARLFGGNLAWFVFWGYQLFIVMAATGYVLGITQGREYAEPEWYVDLWLTIVWVAYLAVYLGTILKRKEPHIYVANWFYLSFIVTIAMLHVVNNLAVPASFLGSKSYSVSSGVQDALTQWWYGHNAVGFFLTAGFLGMMYYFVPKQANRPVYSYRLSIIHFWALIFMYIWAGPHHLHYTALPDWAQTLGMVFSIMLWMPSWGGMINGLMTLSGAWDKIRTDPIIRMMIVAIAFYGMSTFEGPMMSVKTVNSLSHYTEWTIGHVHSGALGWVGMITFGAIYYLTPKLWGRERLYSLRMVNWHFWLATFGIVVYAAVLWVAGIQQGLMWREYNSQGFLVYSFAETVAAMFPYYVLRAVGGTLYLAGGLVMAWNVFMTIRGHLRDEAAIPTTFVPQAQPAE
DATASET 3 Random sequences
>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis thaliana]
MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIFFMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITSHSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMISIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLFTIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITFFGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSPWALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK
>gi|16121653|ref|NP_404966.1| transport ATP-binding protein [Yersinia pestis]
MQTSHLMNKTRQYELIRWLKKQSAPAQRWLRLSMLLGLLSGLLIIAQAWLLATLLQSLIIDKLPRATLTTEFSLLAGAFALRAVISWLRERVGFICGMRVRQQIRKVVLDRLEQLGPSWVKGKPAGSWATIILEQIEDMQEYYSRYLPQMYLAVFIPVLILIAVFPINWAAGLILFVTAPLIPIFMILVGMGAADANRRNFVALARLSGNFLDRLRGLDTLRLFNRAKAETDQIRDSSEDFRSRTMEVLRMAFLSSAVLEFFAAISIAVVAVYFGFSYLGELNFGSYGLGVTLFAGFLVLILAPEFFQPLRDLGTFYHAKAQAVGAAESLVTFLSSEGEAIGQGEKQLDGKEAIALEANELEILAPNGTRLAGPLNFSLPAGKRVAIVGQSGAGKSSLLNLLLGFLPYRGSLKVNGIELRELEPQVWRSQLSWVGQNPHLPEQTLATNILLRQPDASEHQLQQAVERAYINEFLKDLPQGLNTEIGDHSARLSVGQAQRIAVARALLNPCRLLLLDEPTASLDAHSEQLVMKALEEASRAQSTLLVTHQLEDTLGYDQIWVMDNGRLIQQGDYSTLSQSAGSFANLLSQRNEEL
Import the sequences
1) Compare a sequence to itself e.g. Arabidopsis to Arabidopsis
2) Align the sequence of Rhodobacter (COX1_RHOS) with the one of Arabidopsis (Dataset 1)
3) Align the sequence of Arabidopsis to the one of Rhizobium (dataset 2)
4) Align the sequence of Arabidopsis to the sequence from dataset 3
Try out the effect of different substitution matrices and of the sequence length used to put a dot.
Use the Default settings (EMBOSS global alignment)
Use dataset 1:
Use dataset 2:
Use dataset 3: what do you observe
What is the score? Is there much difference between the scores of the different datasets?
Can you compare the score between the different datasets?
Try different parameter settings on the first dataset
Change the gap penalty term: What is the effect on the score, on how the alignment looks?
Change the gap extension parameter:
Explain why you find for each time you use different parameter setting a different score. Compare the obtained alignment with the segments obtained by Dotlet. Is there a resemblance?
Use now as a local alignment procedure, the procedure
http://fasta.bioch.virginia.edu/fasta/lalign.htm
because this implementation allows you to alter the scoring matrix.
Remark that when you change the scoring matrix, the default values of the gap penalties alter too). Can you explain this?
Pam250: gap –16
Ext -4
Pam120 gap –22
Ext -4
Use the BLOSUM62:
1) Use dataset 1: Compare the result with a global alignment on the same dataset
What is most appropriate ? Why?
2) Do the same for dataset 2 (compare its global alignment to its local alignment). What do you observe ? How would you explain.
3) Align the sequence of Rhodobacter with the sequence in dataset 3.
Can you compare the E-values between alignments?
Use dataset 2:
Perform the local alignment with PAM250 and PAM120 (use the same gap penalty for both matrices).
What do you observe?
Especially for 2 sequences that are not very similar anymore it is difficult to assess whether the alignment is still biologically true. Introducing more sequences and making use of a multiple sequence alignment can increase the information (see exercise 2)
Suppose in the lab you sequenced the following sequence.
To learn more about the function of this gene, search for a homologue in the protein database.
ATGACATCAGCGACTCTGACGCCAGGGGCCGCCCTGGGCAGCCAGCGGGTGTCGGAAAATGTGCGTTACTACGAAGACGCCGTCCGACTCTTCGTCATCGCTGCAGTGTTCTGGGGCGTCGTCGGCTTCCTCGCCGGCGTCTTCATCGCGCTGCAGCTGGCTTTTCCGGCGCTGAATCTCGGCCTTGAGTGGACGAGCTTCGGGCGCCTGCGGCCGGTCCACACCTCGGCCGTGATCTTCGCGTTTGGCGGCAACGTCCTGTTCGCCACCTCGCTCTACTCCGTGCAGCGCACCAGCCGCCAGTTCCTGTTCGGCGGCGAGGGCCTCGCGAAGTTCGTCTTCTGGAACTACAACATCTTCATCGTCCTGGCGGCGCTCAGCTACGTGCTCGGCTACACCCAGGGCAAGGAGTATGCAGAGCCGGAGTGGATCCTCGACCTCTACCTGACGGTCATCTGGGTCCTCTACGCCATCCAGTTCGTCGGCACGGTGATGACCCGCAAGGAGTCGCACATCTACGTCGCCAACTGGTTCTTCATGGCGTTCATCCTGACCGTCGCGATCCTCCACATCGGCAACAACGTCAACGTCCCGGTGTCGCTGACCGGGATGAAGTCCTACCCGTTCGTCTCGGGCGTGCAGAGCGCCATGGTGCAGTGGTGGTACGGCCACAACGCGGTCGGCTTCTTCCTGACCGCCGGCTTCCTCGGCATCATGTCTACTTCGTTCCGAAGCGCGCGGAGCGGCCGGTCTATTCGTACCGCCTGTCGATCGTGCACTTCTGGACGCTGATCTTCCTCTACATCTGGGCCGGCCCGCACCACCTGCACTACACGGCCCTGCCGGATTGGGCGCAGACGCTGGGCATGACCTTCTCGGTCATGCTGTGGATGCCGTCCTGGGGCGGCATGATCAACGGCATCATGACCCTGTCGGGTGCCTGGGACAAGCTGCGCACCGACCCGGTCCTGCGCTTCCTCGTGACGTCGGTGGCCTTCTACGGCATGTCGACCTTCGAGGGCCCGCTGATGTCGGTGAAGCCGGTCAACGCCCTGTCGCACTACACCGACTGGACGATCGGCCACGTGCACTCCGGTGCGCTCGGCTGGGTGGCCTTCATCTCCTTCGGCGCGATCTACTATCTGGTCCCGGTCCTGTGGAAGCGCTCGCAGCTCTACAGCCTGCGTCTGGTCAGCTACCACTTCTGGACCGCCACCATCGGCATCGTGCTCTACATCACCGCCATGTGGGTGTCGGGCATCATGCAGGGCCTGATGTGGCGCGCCTACGACAACCTCGGCTTCCTCCAGTACTCGTTCGTCGAGACGGTCGCGGCCATGCATCCCTTCTACGTGATCCGTGCGCTGGGCGGCGTCCTGTTCCTGGCTGGTGCCCTGATCATGGTCTACAACCTGTGGCGCACGGCCAAGGGTGACGTCCGCATCGAGAAGCCCTATGCCTCCGCCCCGCACAAGGCGGCGGTCGGTGCGGCCTGA
Blast the following sequence to the non redundant protein database.
Which sequence has the highest score.
How do you explain the different HSTs observed for the same sequence?
What does the E-value tell you?
Take a protein sequence from dataset 1
Use the FASTA program: http://fasta.bioch.virginia.edu/fasta/cgi/searchx.cgi?pgm=fa
To start this exercise download and locally install the bioedit program
Suppose you have cloned and sequenced the following sequence in the lab. During the exercise session we will try to study the function of this gene based on in silico searches. The only thing we know is that we isolated this sequence from the bacterial species Paracoccus denitrificans and that it is involved in respiration.
MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGARLIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGVALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLFKVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIISHVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMWGGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGKMSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISSIGAYISFASFLFFIGIVFYTLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH
The FastA format is the standard format used by most sequence based programs (clustalW, …)
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA
format is:
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNLLAAVEAQQQMLKLTIWGVK
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:
A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length
For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:
A alanine P proline
B aspartate or asparagine Q glutamine
C cystine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate or glutamine
L leucine X any
M methionine * translation stop
N asparagine - gap of indeterminate length
This is the sequence of the unknown gene
> unknown gene
MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGARLIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGVALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLFKVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIISHVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMWGGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGKMSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISSIGAYISFASFLFFIGIVFYTLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH
Try to find some clues on the function of this gene by using a homology search.
We will use the heuristic program Blast and search for homologs in the non redundant NCBI database.
Entrez is a search and retrieval system that integrates information from databases at NCBI. These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, through PubMed.
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank (at NCBI), together with the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis.
GenBank grows at an exponential rate, with the number of nucleotide bases doubling approximately every 14 months. Currently, GenBank contains more than 17 billion bases from over 100,000 species.
Go to
http://www.ncbi.nlm.nih.gov/Database/index.html
http://www.sanger.ac.uk/Software/Pfam/
http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.h.b.d.html
What is the difference between blastn, blastp,blastX
What does nr (on redundant mean)
Blast the unknown sequence
Interpret the result:
Explain E-value, bitscore, Identities, Positives, Query Subject
What does E-value = 0 mean?
What is the best match in the database?
Do you have a clue on the function of the protein?
Go to the GenBank file of the best hit: (click on the link)
What information can you find in the GenBank file?
Do you find Matches with eukaryotic sequences? Are these Significant? What does that mean.
This system is interesting because it is very ancient and has been conserved throughout all phylogenetic branches.
To find out which sequence residues are involved in the catalytic function, we will construct an alignment of sequences from distinct species, so that we can have a representative alignment of the family of terminal oxidases. From this alignment we will derive the residues that are essential for the function because they have been conserved.
Select from this file the following sequences in FastA format and add them to the clipboard
Bacteria:
P98002
Cytochrome c oxidase polypeptide I-beta (Cytochrome AA3 subunit 1-beta)
gi|1169145|sp|P98002|CX1B_PARDE[1169145]
P33517
Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)
gi|461786|sp|P33517|COX1_RHOSH[461786]
Q08855
Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)
gi|1352141|sp|Q08855|COX1_RHILE[1352141]
Because the protein is so widely distributed and we found some hits in eukaryotes as well, we will search for more eukaryotic hits. However, because the “unknown protein” is from bacterial origin we will retrieve all prokaryotic sequences first because they will be most similar. To focus on the eukaryotic hits, we will perform an advanced blastsearch and blast the “unknown sequence” against the non redundant database from which the prokaryotic sequences were excluded.
Redo the blast but now only search for eukaryotic hits
In what species do find most hits?
Add from this blast output to the clipboard:
viridaeplantae
1: P08742
Cytochrome c oxidase polypeptide I
gi|1169027|sp|P08742|COX1_MAIZE[1169027]
2: P14578
Cytochrome c oxidase polypeptide I
gi|1169030|sp|P14578|COX1_ORYSA[1169030]
3: NP_085587
cytochrome c oxidase subunit 1 [Arabidopsis thaliana]
gi|13449404|ref|NP_085587.1|[13449404]
Redo the blast but know only search for mammalian hits
Select fro this file
15: AAL54607
cytochrome c oxidase subunit I [Homo sapiens]
gi|17985616|gb|AAL54607.1|[17985616]
and sequence from one more species
Add the following sequences by searching for the following accessionnumbers in GenBank, Add the FastA files to the clipboard.
AAC72071.1
AAB58264.1
ZP_00008147
We will use the sequences selected above as input in a multiple sequence alignment program (ClustalW).
Download the program from
http://inn-prot.weizmann.ac.il/software/ClustalX.html
Information on the program can be found at:
http://www.molbiol.ox.ac.uk/documentation/clustalx/clustalx.html
Use the webinterface
http://www.ebi.ac.uk/clustalw/
Align globally the sequences that you have selected. Test the influence of different parameters. Once you have obtained a reliable alignment:
Look at the alignment and the phylogenetic tree? What do you observe?
Compare the multiple alignment with the local pairwise alignments of two members of the family (e.g. dataset2). What is most informative the pairwise or the multiple alignment?
The sequences that you have added and that were not retrieved by the blast search? Do they belong in the multiple alignment? Why were they not detected by the blast search?
Change the parameters of the multiple alignment (different gap cost, lower), do you still find the right alignment? How sensitive is the finding the true alignment to the gap cost
The system that will be studied is the terminal oxidase, the enzyme that catalyses the final reduction of O2 to H2O in the respiratory chain to generate energy. For more information see http://www.sanger.ac.uk/Software/Pfam/
The genes of which the sequences were aligned constitute Subunit 1 of the terminal oxidase complex. This subunit contains the catalytic site where O2 is reduced to H2O. It contains to this purpose a heme copper center. A conserved high spin and low spin heme are involved in ligating the Cu center. Conserved H residues have been shown to be involved in binding the heme. Three major families of terminal oxidases can be detected, 2 of which occur in prokaryotes only (cytcb3 type oxidases and quinol oxidases. The terminal oxidase all are part of the respiratory chain, they receive electrons from an electron donor and use these electrons to reduce the O2. Some terminal oxidases receive the elcttrons from quinols, other from cytochrome c. This explains the differences in sequence between the distinct classes of oxidases.
Eukaryotes only have one type of oxidase: cytochrome c type oxidase:
Prokaryotes often have branched repiratory chains with different type of terminal oxidases. Each of these oxidases has different properties (some are produced at very low O2 concentrations and have a high affinity for O2). This allows bacteria to live in very different environments while for eukaryotes a fixed O2 concentration is required.
The cytochrome c type oxidases is the one that is also present in eukaryotes.
11 TRANSFORMATION EXERCISES COMPLETE THE SECOND SENTENCE OF
150 JOURNAL OF EXERCISE PHYSIOLOGYONLINE APRIL 2018 VOLUME 21
150080 INTRODUCTION TO INFORMATION SYSTEMS HMTL INCLASS LAB EXERCISE
Tags: exercise 1:, the exercise, exercise, alignments, pairwise, theoretical