EXERCISE 1 PAIRWISE ALIGNMENTS 1 THEORETICAL EXERCISE TO

2006SOM3TFEP002A AGENDA ITEM 3 APEC PANDEMIC RESPONSE EXERCISE
CARDS EXERCISE FACILITATOR GUIDANCE LEARNING OUTCOME TO GAIN
RECONCILE AND APPROVE PCARD TRANSACTIONS EXERCISES IN THIS

0 20120814 EXERCISES ZEMAX 3 ABERRATIONS 3 PROPERTIES OF
10 THE HÜCKEL APPROXIMATION IN THIS EXERCISE YOU WILL
109 JOURNAL OF EXERCISE PHYSIOLOGYONLINE AUGUST 2021 VOLUME 24

Suppose you have cloned and sequenced the following sequence in the lab

Exercise 1: Pairwise alignments


1. Theoretical exercise


To illustrate the backtracking algorithm a simplified example on DNA was used because the scoring system is very simple: matches have score 1 while mismatches get score –1. Gaps (insertions and deletions are also penalized) by a score of –1.

Suppose 2 very short DNA sequences need to be aligned seq i ATT and seq j TTC.


EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO



Solution


GAP


A


T


T



0

-1

-1

-2

-2

-3

-3

T

-1

-1

-2

0

-3

-1

-4


-1

-2

-1

-2

0

-1

-1

T

-2

-2

-2

0

-1

1

-2


-2

-3

-2

-3

0

-1

1

C

-3

-3

-3

-3

-1

-1

0


-3

-4

-3

-4

-1

-2

0





2. Practical exercises


Program is available


DOTLET

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Alignment based on dot representation (use NetScape)


GLOBAL ALIGNMENT

http://genome.cs.mtu.edu/align/align.html

LOCAL ALIGNMENT

http://fasta.bioch.virginia.edu/fasta/lalign.htm

These programs are based on dynamic programming and offer a global and local alignment program.


Use the following sequences in FastA format:


DATASET 0 Two bacterial terminal oxidases from different family

>gi|1071819|pir||B54759 ba-type ubiquinol oxidase (EC 1.10.3.-) chain I - Paracoccus denitrificans

MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVRGFADAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVGGAVITMASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWTSFCANILIVASFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLFGYSSMVYATVCITVLSYLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTFVIGGMTGVLLAVPPADFVLHNSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPLYILGLMGVTRRLRVFDDPDLRIWFAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPAYNFAFNPISHGLDTWWEMKQQGATRPTGGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTFNYNRDYYIPVSEIEATEDARTRQLAQGV


>gi|461786|sp|P33517|COX1_RHOSH Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)

MADAAIHGHEHDRRGFFTRWFMSTNHKDIGVLYLFTGGLVGLISVAFTVYMRMELMAPGVQFMCAEHLESGLVKGFFQSLWPSAVENCTPNGHLWNVMITGHGILMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRMNNLSYWLYVAGTSLAVASLFAPGGNGQLGSGIGWVLYPPLSTSESGYSTDLAIFAVHLSGASSILGAINMITTFLNMRAPGMTMHKVPLFAWSIFVTAWLILLALPVLAGAITMLLTDRNFGTTFFQPSGGGDPVLYQHILWFFGHPEVYIIVLPAFGIVSHVIATFAKKPIFGYLPMVYAMVAIGVLGFVVWAHHMYTAGLSLTQQSYFMMATMVIAVPTGIKIFSWIATMWGGSIELKTPMLWALGFLFLFTVGGVTGIVLSQASVDRYYHDTYYVVAHFHYVMSLGAVFGIFAGSTSGIGKMSGRQYPEWAGKLHFWMMFVGANLTFFPQHFLGRQGMPRRYIDYPEAFATWNFVSSLGAFLSFASFLFFLGVIFYSLSGARVTANNYWNEHADTLEWTLTSPPPEHTFEQLPKREDERAPAH


DATASET 1 Two terminal oxidases from the same family

>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis thaliana]

MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIFFMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITSHSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMISIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLFTIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITFFGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSPWALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK


>gi|461786|sp|P33517|COX1_RHOSH Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)

MADAAIHGHEHDRRGFFTRWFMSTNHKDIGVLYLFTGGLVGLISVAFTVYMRMELMAPGVQFMCAEHLESGLVKGFFQSLWPSAVENCTPNGHLWNVMITGHGILMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRMNNLSYWLYVAGTSLAVASLFAPGGNGQLGSGIGWVLYPPLSTSESGYSTDLAIFAVHLSGASSILGAINMITTFLNMRAPGMTMHKVPLFAWSIFVTAWLILLALPVLAGAITMLLTDRNFGTTFFQPSGGGDPVLYQHILWFFGHPEVYIIVLPAFGIVSHVIATFAKKPIFGYLPMVYAMVAIGVLGFVVWAHHMYTAGLSLTQQSYFMMATMVIAVPTGIKIFSWIATMWGGSIELKTPMLWALGFLFLFTVGGVTGIVLSQASVDRYYHDTYYVVAHFHYVMSLGAVFGIFAGSTSGIGKMSGRQYPEWAGKLHFWMMFVGANLTFFPQHFLGRQGMPRRYIDYPEAFATWNFVSSLGAFLSFASFLFFLGVIFYSLSGARVTANNYWNEHADTLEWTLTSPPPEHTFEQLPKREDERAPAH



DATASET 2 Two terminal oxidases from a different family

>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis thaliana]

MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIFFMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITSHSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMISIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLFTIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITFFGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSPWALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK


>gi|2114418|gb|AAB58264.1| cbb3-type cytochrome oxidase component FixN [Rhizobium leguminosarum bv. viciae]

MNYTTETMVIAVAAFLALLVAAFAHDHLFAVHMGILCLCLVMGAVLMVRKVDFSPAGQQRNVDRSGYFDEVIRYGLIATVFWGVVGFLVGVIIALQLAFPDLNIAPYLNFGRLRPVHTSAVIFAFGGNALIMTSFYVVQRTCRARLFGGNLAWFVFWGYQLFIVMAATGYVLGITQGREYAEPEWYVDLWLTIVWVAYLAVYLGTILKRKEPHIYVANWFYLSFIVTIAMLHVVNNLAVPASFLGSKSYSVSSGVQDALTQWWYGHNAVGFFLTAGFLGMMYYFVPKQANRPVYSYRLSIIHFWALIFMYIWAGPHHLHYTALPDWAQTLGMVFSIMLWMPSWGGMINGLMTLSGAWDKIRTDPIIRMMIVAIAFYGMSTFEGPMMSVKTVNSLSHYTEWTIGHVHSGALGWVGMITFGAIYYLTPKLWGRERLYSLRMVNWHFWLATFGIVVYAAVLWVAGIQQGLMWREYNSQGFLVYSFAETVAAMFPYYVLRAVGGTLYLAGGLVMAWNVFMTIRGHLRDEAAIPTTFVPQAQPAE


DATASET 3 Random sequences

>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis thaliana]

MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIFFMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITSHSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMISIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLFTIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITFFGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSPWALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK


>gi|16121653|ref|NP_404966.1| transport ATP-binding protein [Yersinia pestis]

MQTSHLMNKTRQYELIRWLKKQSAPAQRWLRLSMLLGLLSGLLIIAQAWLLATLLQSLIIDKLPRATLTTEFSLLAGAFALRAVISWLRERVGFICGMRVRQQIRKVVLDRLEQLGPSWVKGKPAGSWATIILEQIEDMQEYYSRYLPQMYLAVFIPVLILIAVFPINWAAGLILFVTAPLIPIFMILVGMGAADANRRNFVALARLSGNFLDRLRGLDTLRLFNRAKAETDQIRDSSEDFRSRTMEVLRMAFLSSAVLEFFAAISIAVVAVYFGFSYLGELNFGSYGLGVTLFAGFLVLILAPEFFQPLRDLGTFYHAKAQAVGAAESLVTFLSSEGEAIGQGEKQLDGKEAIALEANELEILAPNGTRLAGPLNFSLPAGKRVAIVGQSGAGKSSLLNLLLGFLPYRGSLKVNGIELRELEPQVWRSQLSWVGQNPHLPEQTLATNILLRQPDASEHQLQQAVERAYINEFLKDLPQGLNTEIGDHSARLSVGQAQRIAVARALLNPCRLLLLDEPTASLDAHSEQLVMKALEEASRAQSTLLVTHQLEDTLGYDQIWVMDNGRLIQQGDYSTLSQSAGSFANLLSQRNEEL

1.1Make an alignment using dotlet

Import the sequences

1) Compare a sequence to itself e.g. Arabidopsis to Arabidopsis

2) Align the sequence of Rhodobacter (COX1_RHOS) with the one of Arabidopsis (Dataset 1)

3) Align the sequence of Arabidopsis to the one of Rhizobium (dataset 2)

4) Align the sequence of Arabidopsis to the sequence from dataset 3


Try out the effect of different substitution matrices and of the sequence length used to put a dot.

EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO


1.2Perform a global alignment

Use the Default settings (EMBOSS global alignment)


Use dataset 1:

Use dataset 2:

Use dataset 3: what do you observe

What is the score? Is there much difference between the scores of the different datasets?

Can you compare the score between the different datasets?


Try different parameter settings on the first dataset

Change the gap penalty term: What is the effect on the score, on how the alignment looks?

Change the gap extension parameter:

Explain why you find for each time you use different parameter setting a different score. Compare the obtained alignment with the segments obtained by Dotlet. Is there a resemblance?

1.3Perform a local alignment

Use now as a local alignment procedure, the procedure

http://fasta.bioch.virginia.edu/fasta/lalign.htm

because this implementation allows you to alter the scoring matrix.

Remark that when you change the scoring matrix, the default values of the gap penalties alter too). Can you explain this?


Pam250: gap –16

Ext -4

Pam120 gap –22

Ext -4


Use the BLOSUM62:


1) Use dataset 1: Compare the result with a global alignment on the same dataset

What is most appropriate ? Why?

2) Do the same for dataset 2 (compare its global alignment to its local alignment). What do you observe ? How would you explain.

3) Align the sequence of Rhodobacter with the sequence in dataset 3.


Can you compare the E-values between alignments?


Use dataset 2:

Perform the local alignment with PAM250 and PAM120 (use the same gap penalty for both matrices).

What do you observe?




Especially for 2 sequences that are not very similar anymore it is difficult to assess whether the alignment is still biologically true. Introducing more sequences and making use of a multiple sequence alignment can increase the information (see exercise 2)



EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO


EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO


Exercise 2A blast

Suppose in the lab you sequenced the following sequence.

To learn more about the function of this gene, search for a homologue in the protein database.

>gene 1

ATGACATCAGCGACTCTGACGCCAGGGGCCGCCCTGGGCAGCCAGCGGGTGTCGGAAAATGTGCGTTACTACGAAGACGCCGTCCGACTCTTCGTCATCGCTGCAGTGTTCTGGGGCGTCGTCGGCTTCCTCGCCGGCGTCTTCATCGCGCTGCAGCTGGCTTTTCCGGCGCTGAATCTCGGCCTTGAGTGGACGAGCTTCGGGCGCCTGCGGCCGGTCCACACCTCGGCCGTGATCTTCGCGTTTGGCGGCAACGTCCTGTTCGCCACCTCGCTCTACTCCGTGCAGCGCACCAGCCGCCAGTTCCTGTTCGGCGGCGAGGGCCTCGCGAAGTTCGTCTTCTGGAACTACAACATCTTCATCGTCCTGGCGGCGCTCAGCTACGTGCTCGGCTACACCCAGGGCAAGGAGTATGCAGAGCCGGAGTGGATCCTCGACCTCTACCTGACGGTCATCTGGGTCCTCTACGCCATCCAGTTCGTCGGCACGGTGATGACCCGCAAGGAGTCGCACATCTACGTCGCCAACTGGTTCTTCATGGCGTTCATCCTGACCGTCGCGATCCTCCACATCGGCAACAACGTCAACGTCCCGGTGTCGCTGACCGGGATGAAGTCCTACCCGTTCGTCTCGGGCGTGCAGAGCGCCATGGTGCAGTGGTGGTACGGCCACAACGCGGTCGGCTTCTTCCTGACCGCCGGCTTCCTCGGCATCATGTCTACTTCGTTCCGAAGCGCGCGGAGCGGCCGGTCTATTCGTACCGCCTGTCGATCGTGCACTTCTGGACGCTGATCTTCCTCTACATCTGGGCCGGCCCGCACCACCTGCACTACACGGCCCTGCCGGATTGGGCGCAGACGCTGGGCATGACCTTCTCGGTCATGCTGTGGATGCCGTCCTGGGGCGGCATGATCAACGGCATCATGACCCTGTCGGGTGCCTGGGACAAGCTGCGCACCGACCCGGTCCTGCGCTTCCTCGTGACGTCGGTGGCCTTCTACGGCATGTCGACCTTCGAGGGCCCGCTGATGTCGGTGAAGCCGGTCAACGCCCTGTCGCACTACACCGACTGGACGATCGGCCACGTGCACTCCGGTGCGCTCGGCTGGGTGGCCTTCATCTCCTTCGGCGCGATCTACTATCTGGTCCCGGTCCTGTGGAAGCGCTCGCAGCTCTACAGCCTGCGTCTGGTCAGCTACCACTTCTGGACCGCCACCATCGGCATCGTGCTCTACATCACCGCCATGTGGGTGTCGGGCATCATGCAGGGCCTGATGTGGCGCGCCTACGACAACCTCGGCTTCCTCCAGTACTCGTTCGTCGAGACGGTCGCGGCCATGCATCCCTTCTACGTGATCCGTGCGCTGGGCGGCGTCCTGTTCCTGGCTGGTGCCCTGATCATGGTCTACAACCTGTGGCGCACGGCCAAGGGTGACGTCCGCATCGAGAAGCCCTATGCCTCCGCCCCGCACAAGGCGGCGGTCGGTGCGGCCTGA


Blast the following sequence to the non redundant protein database.


Which sequence has the highest score.

How do you explain the different HSTs observed for the same sequence?

What does the E-value tell you?



EXERCISE 2B FASTA


Take a protein sequence from dataset 1

Use the FASTA program: http://fasta.bioch.virginia.edu/fasta/cgi/searchx.cgi?pgm=fa






































Exercise 3 Muliple alignment and blast



To start this exercise download and locally install the bioedit program


Suppose you have cloned and sequenced the following sequence in the lab. During the exercise session we will try to study the function of this gene based on in silico searches. The only thing we know is that we isolated this sequence from the bacterial species Paracoccus denitrificans and that it is involved in respiration.


MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGARLIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGVALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLFKVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIISHVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMWGGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGKMSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISSIGAYISFASFLFFIGIVFYTLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH

21 convert file to FastA format

The FastA format is the standard format used by most sequence based programs (clustalW, …)

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA

format is:


>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNLLAAVEAQQQMLKLTIWGVK


Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

The nucleic acid codes supported are:

A --> adenosine M --> A C (amino)

C --> cytidine S --> G C (strong)

G --> guanine W --> A T (weak)

T --> thymidine B --> G T C

U --> uridine D --> G A T

R --> G A (purine) H --> A C T

Y --> T C (pyrimidine) V --> G C A

K --> G T (keto) N --> A G C T (any)

- gap of indeterminate length


For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:


A alanine P proline

B aspartate or asparagine Q glutamine

C cystine R arginine

D aspartate S serine

E glutamate T threonine

F phenylalanine U selenocysteine

G glycine V valine

H histidine W tryptophan

I isoleucine Y tyrosine

K lysine Z glutamate or glutamine

L leucine X any

M methionine * translation stop

N asparagine - gap of indeterminate length


This is the sequence of the unknown gene

> unknown gene

MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGARLIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGVALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLFKVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIISHVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMWGGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGKMSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISSIGAYISFASFLFFIGIVFYTLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH

3Ncbi Database

Try to find some clues on the function of this gene by using a homology search.

We will use the heuristic program Blast and search for homologs in the non redundant NCBI database.

http://www.ncbi.nlm.nih.gov/


EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO

Entrez is a search and retrieval system that integrates information from databases at NCBI. These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, through PubMed.


GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank (at NCBI), together with the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) comprise the International Nucleotide Sequence Database Collaboration. These three organizations exchange data on a daily basis.

GenBank grows at an exponential rate, with the number of nucleotide bases doubling approximately every 14 months. Currently, GenBank contains more than 17 billion bases from over 100,000 species.

EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO


Go to

http://www.ncbi.nlm.nih.gov/Database/index.html


http://www.sanger.ac.uk/Software/Pfam/

http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.c.h.b.d.html

4Homology search using Blast


What is the difference between blastn, blastp,blastX

What does nr (on redundant mean)


Blast the unknown sequence


Interpret the result:

Explain E-value, bitscore, Identities, Positives, Query Subject

What does E-value = 0 mean?

What is the best match in the database?


Do you have a clue on the function of the protein?

Go to the GenBank file of the best hit: (click on the link)

What information can you find in the GenBank file?

Do you find Matches with eukaryotic sequences? Are these Significant? What does that mean.


This system is interesting because it is very ancient and has been conserved throughout all phylogenetic branches.


To find out which sequence residues are involved in the catalytic function, we will construct an alignment of sequences from distinct species, so that we can have a representative alignment of the family of terminal oxidases. From this alignment we will derive the residues that are essential for the function because they have been conserved.

Select from this file the following sequences in FastA format and add them to the clipboard


Bacteria:

P98002

Cytochrome c oxidase polypeptide I-beta (Cytochrome AA3 subunit 1-beta)

gi|1169145|sp|P98002|CX1B_PARDE[1169145]


P33517

Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)

gi|461786|sp|P33517|COX1_RHOSH[461786]


Q08855

Cytochrome c oxidase polypeptide I (Cytochrome AA3 subunit 1)

gi|1352141|sp|Q08855|COX1_RHILE[1352141]


Because the protein is so widely distributed and we found some hits in eukaryotes as well, we will search for more eukaryotic hits. However, because the “unknown protein” is from bacterial origin we will retrieve all prokaryotic sequences first because they will be most similar. To focus on the eukaryotic hits, we will perform an advanced blastsearch and blast the “unknown sequence” against the non redundant database from which the prokaryotic sequences were excluded.



Redo the blast but now only search for eukaryotic hits

In what species do find most hits?

Add from this blast output to the clipboard:

viridaeplantae


1: P08742

Cytochrome c oxidase polypeptide I

gi|1169027|sp|P08742|COX1_MAIZE[1169027]


2: P14578

Cytochrome c oxidase polypeptide I

gi|1169030|sp|P14578|COX1_ORYSA[1169030]


3: NP_085587

cytochrome c oxidase subunit 1 [Arabidopsis thaliana]

gi|13449404|ref|NP_085587.1|[13449404]


Redo the blast but know only search for mammalian hits

Select fro this file

15: AAL54607

cytochrome c oxidase subunit I [Homo sapiens]

gi|17985616|gb|AAL54607.1|[17985616]


and sequence from one more species


Add the following sequences by searching for the following accessionnumbers in GenBank, Add the FastA files to the clipboard.

AAC72071.1

AAB58264.1

ZP_00008147

5Multiple sequence alignment

We will use the sequences selected above as input in a multiple sequence alignment program (ClustalW).

http://inn-prot.weizmann.ac.il/software/ClustalX.html

Information on the program can be found at:

http://www.molbiol.ox.ac.uk/documentation/clustalx/clustalx.html


http://www.ebi.ac.uk/clustalw/


Align globally the sequences that you have selected. Test the influence of different parameters. Once you have obtained a reliable alignment:

  1. Look at the alignment and the phylogenetic tree? What do you observe?

  2. Compare the multiple alignment with the local pairwise alignments of two members of the family (e.g. dataset2). What is most informative the pairwise or the multiple alignment?


EXERCISE 1 PAIRWISE ALIGNMENTS  1 THEORETICAL EXERCISE TO

  1. The sequences that you have added and that were not retrieved by the blast search? Do they belong in the multiple alignment? Why were they not detected by the blast search?

  2. Change the parameters of the multiple alignment (different gap cost, lower), do you still find the right alignment? How sensitive is the finding the true alignment to the gap cost

6Results

The system that will be studied is the terminal oxidase, the enzyme that catalyses the final reduction of O2 to H2O in the respiratory chain to generate energy. For more information see http://www.sanger.ac.uk/Software/Pfam/

The genes of which the sequences were aligned constitute Subunit 1 of the terminal oxidase complex. This subunit contains the catalytic site where O2 is reduced to H2O. It contains to this purpose a heme copper center. A conserved high spin and low spin heme are involved in ligating the Cu center. Conserved H residues have been shown to be involved in binding the heme. Three major families of terminal oxidases can be detected, 2 of which occur in prokaryotes only (cytcb3 type oxidases and quinol oxidases. The terminal oxidase all are part of the respiratory chain, they receive electrons from an electron donor and use these electrons to reduce the O2. Some terminal oxidases receive the elcttrons from quinols, other from cytochrome c. This explains the differences in sequence between the distinct classes of oxidases.

Eukaryotes only have one type of oxidase: cytochrome c type oxidase:

Prokaryotes often have branched repiratory chains with different type of terminal oxidases. Each of these oxidases has different properties (some are produced at very low O2 concentrations and have a high affinity for O2). This allows bacteria to live in very different environments while for eukaryotes a fixed O2 concentration is required.

The cytochrome c type oxidases is the one that is also present in eukaryotes.


6



11 TRANSFORMATION EXERCISES COMPLETE THE SECOND SENTENCE OF
150 JOURNAL OF EXERCISE PHYSIOLOGYONLINE APRIL 2018 VOLUME 21
150080 INTRODUCTION TO INFORMATION SYSTEMS HMTL INCLASS LAB EXERCISE


Tags: exercise 1:, the exercise, exercise, alignments, pairwise, theoretical