SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION

PRACTICE NOTE SOURCING SUPPLEMENTARY EMERGENCY RESPONSE RESOURCES
SUPPLEMENTARY MATERIAL (ESI) FOR CHEMICAL COMMUNICATIONS THIS
(SUPPLEMENTARY ORDER PAPER) 19 DÁIL ÉIREANN DÉ MÁIRT 2

1 SUPPLEMENTARY MATERIAL FOR PERKIN TRANSACTIONS 1 THIS JOURNAL
1 SUPPLEMENTARY MATERIALS SUPPLEMENTARY TABLE 1 DESCRIPTION OF DES
1 SUPPLEMENTARY TABLE 1 CATEGORIZATION OF THE 120 NODULES

SUPPLEMENTARY MATERIAL



RNA secondary structure mediates alternative 3’ss selection in Saccharomyces cerevisiae

Mireya Plass1,2, Carles Codony-Servat3, Pedro Gabriel Ferreira1,4, Josep Vilardell3,5, Eduardo Eyras1,5

1 Universitat Pompeu Fabra (UPF), Dr. Aiguader 88, 08003, Barcelona, Spain

2 Present address: The Bioinformatics Centre, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200 Copenhagen, Denmark

3 Molecular Biology Institute of Barcelona (IBMB), Baldiri Reixac 10-12, 08028 Barcelona, Spain.

4 Centre for Genomic Regulation (CRG), Dr. Aiguader 88, 08003 Barcelona, Spain

5 Catalan Institution of Research and Advanced Studies (ICREA), Passeig Lluís Companys, 23 08010, Barcelona, Spain (ICREA).

Corresponding author: [email protected]



SVM comparison

To further understand the importance of secondary structure for the classification of 3’ss, we built the SVM classifier without using the accessibility as feature (SVM1), without using the effective distance as a cut-off (SVM2), or without using any of them (SVM3), and compared them with the classifier using all features (see Supplementary Table 2). For all comparisons, we observe an improvement of the performance of the SVM classifier when secondary structure information is included (Supplementary Table 2). Furthermore, we analyzed the predictive ability of each of the features by creating SVM classifiers using individual features. The performance of the classifiers is shown as ROC curves in Supplementary Figure 1B. The performance of the SVM using single features is in agreement with the rank of the features by information gain. The accessibility has an overall performance of 0.66, as measure by the area under the ROC curve (AUC). Although is it is the smallest AUC value, it can by itself explain a number of real 3’ss.

Function of the predicted alternative 3’ss

We checked whether the usage of the alternative 3’ss predicted would introduce premature termination codons (PTCs) that could trigger the degradation of the resulting transcripts by non-sense mediated decay (NMD). In fact, 32 of the 35 candidate alternative 3’ss in coding regions introduce a PTC (Table 1). In these cases, the 3’UTR is enlarged on average by 578 nucleotides (nt), which is much longer than the average of 144 nt for the 3’UTR length in yeast genes (Graber et al. 1999). Hence, the PTCs will probably trigger NMD. This hypothesis is also supported by the fact that we could not find homologs of the mRNAs or translated products resulting from the usage of the candidate alternative 3’ss (see below). In the 3 cases in which no PTC is introduced (Table 1 and 4), we tested whether the resulting proteins could be functional. We looked for all possible structural homologs in UniProt (Uniprot consortium 2011) and kept all matches with a FPR < 0.01 (see below and Supplementary Figure 6). In the three cases the alternative 3’ss introduce a deletion in the resulting protein that does not appear in the homologous proteins and which corresponds to a region conserved in other yeast species (Supplementary Figures 7, 8 and 9), therefore suggesting an alteration or lost of functionality of the proteins containing the deletion.

Blast search of alternative splicing products

We obtained the mRNA sequences and the corresponding protein sequences, when applicable, resulting from the usage of the predicted alternative 3’ss. For the predicted proteins we looked for homologous sequences in the non-redundant protein database (nr) (Pruitt et al. 2009) using blastp (Altschul et al. 1997), imposing that the variable region was part of an alignment. In the cases in which we predicted alternative 3’ss in non-coding genes (snRN17A and snRN17B), we looked for homologous sequences in nr database using blastn (Altschul et al. 1997).

Search of homologous proteins

We wanted to check whether the protein products resulting from the usage of the predicted alternative 3’ss that did not introduce a PTC could be functional, i.e. chrIV:1103808-1103890:+:YDR318W_31, chrIII:107034-107110:+:YCL005W-A_54, and chrXII:564457-564515:-:YLR211C_40. In order to do so, we looked for homologous proteins of the translated results of the alternative 3’ss events that were inside coding genes and did not introduced a PTC. First, we identified putative homologous proteins using blastp against Uniprot database (Uniprot Consortium 2011), and kept all matches obtained. We then selected the possible homologous proteins based on the percentage of identity and the length of the alignment using the curve calculated for protein pairs with known structure (Rost 1999). The threshold applied is described by the formula:

SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION

where SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION is the amount of nucleotides aligned between two proteins; SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION the cut-off percentage of identical residues over the SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION aligned residues; and SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION describes the distance in percentage points from the curve. For 99% true positives, SUPPLEMENTARY MATERIAL RNA SECONDARY STRUCTURE MEDIATES ALTERNATIVE 3’SS SELECTION =5 (Rost 1999). Applying this method, we are able to retrieve 31 homologous proteins for YCL005W-A (Supplementary figure 7A), 13 for YDR318W (Supplementary Figure 7B), and 16 for YLR211C (Supplementary Figure 7C). For each of the sets of homologous proteins, we performed multiple sequence alignments using T-Coffee with default parameters (Notredame et al. 2000). In all three cases we find that the deletion produced by the usage of the alternative 3’ss does not occur in the homologous proteins and that in fact corresponds to a region highly conserved in the other species, suggesting that the deleted part is important for protein function. The conservation of the predicted proteins with the closest homologs is shown in Supplementary Figures 8, 9 and 10.

Additionally, we also did protein secondary structure predictions with psipred (McGuffin et al. 2000) for the 3 predicted protein candidates. In all three cases the deletion falls inside a predicted alpha-helix, which suggests that the deletion could affect protein function. Taking together all these pieces of evidence, we cannot claim that the protein products resulting from the usage of the candidate 3’ss predicted will be functional.

The protein alignments shown in the Supplementary Figure 7, 8 and 9 were edited with Jalview (Waterhouse et al. 2009).

REFERENCES

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.

Graber, J.H., Cantor, C.R., Mohr, S.C. and Smith, T.F. 1999. Genomic detection of new yeast pre-mRNA 3'-end-processing signals. Nucleic Acids Res. 27: 888-894.

McGuffin, L.J., Bryson, K. and Jones, D.T. 2000. The PSIPRED protein structure prediction server. Bioinformatics 16: 404-405.

Notredame, C., Higgins, D.G. and Heringa, J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302: 205-217.

Pruitt, K.D., Tatusova, T., Klimke, W. and Maglott, D.R. 2009. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37: D32-6.

Rost, B. 1999. Twilight zone of protein sequence alignments. Protein Eng. 12: 85-94.

UniProt Consortium. 2011. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 39: D214-9.

Waterhouse, A.M., Procter, J.B., Martin, D.M., Clamp, M. and Barton, G.J. 2009. Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25: 1189-1191.

FIGURE LEGENDS

Supplementary Figure 1. (A) Information Gain of the features for the classification at 22C and 37C. For each feature considered for the classification and for each condition (22C and 37C), we show the box plots of the distributions of the Information Gain values (see Methods) of the corresponding 10000 SVM. Receiving Operating Characteristic (ROC) curves of single feature SVMs. For each threshold of the score, the TPR and FPR values are represented in the x and y axes, respectively. Lines with different colors represent the performance of each of the classifiers: The SVM built using all the features is added for comparison (all, light blue). The areas under the ROC curves are: all AUC= 0.9809; BS-3’ss distance AUC = 0.9519; Py content AUC = 0.7902; PPT distance AUC= 07715; 3’ss score AUC= 0.7549, accessibility AUC= 0.6603).

Supplementary Figure 2. Validation of 3’ss using RNA-Seq. The upper panel shows the percentage of 3’ss (real, alternative and cryptic) that can be validated with RNA-Seq reads that validate the annotated 5’ss of the intron (left) or any 5’ss (right) at normal and heat-shock conditions. Alternative 3’ss are cryptic 3’ss predicted as positives using score2. The lower panel shows the boxplots of the distributions of the amount of reads that validate each 3’ss (log10 scale).

Supplementary Figure 3. Experimental validation of the alternative 3’ss in the RPL26B gene. (A) Predicted optimal secondary structure between the BS and the alternative 3’ss predicted in the RPL26B gene, discarding the first 8 nucleotides after the BS A. The 3’ss and the alternative 3’ss (alt. 3’ss) tested are boxed in the picture. The color of the nucleotides represents the pair probability of the bases in the secondary structure. For nucleotides outside the secondary structure, the color represents the accessibility in the same scale. (B) RT-PCR validation of the splicing pattern in the RPL26B gene in different yeast strains. Lanes 1-8 show the result of the analyses using primers hybridizing on the upstream and downstream exon. Using these primers, only the usage of the annotated 3’ss can be detected (lanes 1-4). Lanes 5-8 show the corresponding negative controls without AMV reverse transcriptase. When using a specific primer designed to hybridize on the alternative splicing junction (lanes 10-13), the alternative splicing isoform can be detected in the same strains. Lanes 9 and 14 contain the markers, with the corresponding lengths indicated on the right.

Supplementary Figure 4. Experimental validation of the alternative 3’ss in the RPS22B gene (A) Predicted optimal secondary structure between the BS and the annotated 3’ss in the RPSB22B gene. The 3’ss and the alternative 3’ss (alt. 3’ss) tested are boxed in the picture. The color of the lnucleotides represents the pair probability of the bases in the secondary structure. For nucleotides outside the secondary structure, the color represents the accessibility in the same scale. (B) RT-PCR validation of the alternative 3’ss in RPS22B gene using primers specifically designed for the exon junction resulting from the usage of the alternative 3’ss. Lanes 1-4 show analyses using RNA from strains UPF1∆ (lane 1), W303 (lane 2) and SK1 at time zero of meiosis (lane 3) and after 5 hours (lane 4) (Materials and Methods). Lanes 5-8 show the corresponding negative controls without AMV reverse transcriptase. Lane 9 contains the markers, with the corresponding lengths indicated on the right. Bands corresponding to the alternative 3'ss are indicated. The alternative isoform can be detected in all the yeast strains used (lanes 1-4).

Supplementary Figure 5. Evaluation of the SVM classifier at 37C. (A) Receiving Operating Characteristic (ROC) curve of the SVM classifier using score1. For each threshold of the score, the TPR and FPR values are represented in the x and y axes, respectively. The distribution of values for positive cases (dark grey) and negative cases (light grey) together with the color scale for the different thresholds used can be seen at the bottom of the figure. (B) Precision-recall curves of the SVM classifier using score1. In the precision-recall curve the TPR (recall) is represented in the x-axis, whereas the y-axis shows the precision, i.e. number of true positive cases for a given threshold over the total amount of cases predicted as positive. (C) ROC curve and (D) precision-recall curve for the SVM classifier using the score2. (E) and (F), cumulative distribution of predicted 3’ss that are validated by RNA-Seq reads obtained at 37C, for annotated (E) and cryptic (F) 3’ss. The left y-axis represents the percentage of HAGs that have a score2 higher or equal to that given on the x-axis (grey bars). The right y-axis scale represents the percentage of cases with a score2 higher or equal to that given on the x-axis and that can be validated using RNA-Seq reads that validate also the annotated 5’ss (black line) and any 5’ss (grey line).

Supplementary Figure 6. Experimental validation of the alternative 3’ss in RPL23B (A) Predicted optimal secondary structure between the BS and the alternative 3’ss predicted at 22ºC (upper structure) and at 37ºC (lower structure) for the RPL23B gene, discarding the first 8nt after the BS A. The annotated 3’ss and the alternative 3’ss (alt 3’ss) tested are boxed in the picture. The color of the nucleotides represents the pair probability of the bases in the secondary structure. For nucleotides outside the secondary structure, the color represents the accessibility of the nucleotide (1 - pair probability) in the same scale. (B) RT-PCR validation of the annotated (ann. 3’ss) and the alternative 3’ss (alt. 3’ss) in two different yeast strains and at different temperature conditions using primers designed for the exon junction.

Supplementary Figure 7. Identity vs. alignment length curves for detecting homologs. Representation of percentage identity (y-axis) versus alignment length (x-axis) of the potential homologous proteins identified by database searches for YCL005W-A (A), YDR318W (B), and YLR211C (C). The red line shows the curve defined by Burkhard Rost (Rost 1999) to define structural homologs with a true positive rate of 99%. Grey dots represent discarded proteins (false structural homologs) whereas black circles represent the selected proteins (true structural homologs).

Supplementary Figure 8. Homologs of the YLC005W-A variant. (A) Extract of the alignment of YLC005W-A homologous proteins identified, including the alternative protein predicted. The alignment has been edited with Jalview (Waterhouse et al. 2009). The sequences are colored according to the Neighbor-Joining (NJ) tree based on the percentage identity of the proteins (B). For clarity purposes, proteins are divided into two groups according to the NJ tree (grey and black groups), and the percentage identity used for coloring takes into account the identity within a given group. The orange box shows the deletion introduce by the predicted alternative 3’ss.

Supplementary Figure 9. Homologs of the YDR318W variant. (A) Extract of the alignment of YDR318W homologous proteins identified, including the alternative protein predicted. The alignment has been edited with Jalview (Waterhouse et al. 2009). The sequences are colored according to the Neighbor-Joining (NJ) tree based on the percentage identity of the proteins (B). The orange box shows the deletion introduce by the predicted alternative 3’ss.

Supplementary Figure 10. Homologs of the YLR211C variant. (A) Extract of the alignment of YLR211C homologous proteins identified, including the alternative protein predicted. The alignment has been edited with Jalview (Waterhouse et al. 2009). The sequences are colored according to the Neighbor-Joining (NJ) tree based on the percentage identity of the proteins (B). For clarity purposes, the homologs that were not aligned in the region of interest were deleted from the alignment and from the tree. The orange box shows the deletion introduce by the predicted alternative 3’ss predicted.

TABLES

Supplementary Table 2. Comparison of SVM classifiers

SVM classifier

AUC

Theshold

TPR

FPR

Precision

Accuracy

ALL

0.9809

1

0.9255

0.0265

0.4531

0.9724

SVM1

0.9776

1

0.9326

0.0296

0.4276

0.9695

SVM2

0.9804

1

0.9362

0.0281

0.4415

0.9711

SVM3

0.9765

1

0.9326

0.0286

0.4354

0.9704

Comparison of performances for the SVM classifiers using all features (ALL), without using the accessibility as feature (SVM1), without using the effective distance as a cut-off (SVM2), or without using any of them (SVM3).


10 SUPPLEMENTARY DATA FOR “THE VACUUM UV PHOTOABSORPTION SPECTROSCOPY
11 Supplementary Record Form Advanced Livestock Record Rabbits 20to
12 SUPPLEMENTARY MATERIALS THIS FILE CONTAINS THREE SUPPLEMENTARY TABLES


Tags: alternative 3’ss, predicted alternative, structure, secondary, supplementary, selection, material, alternative, mediates