```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
IsoformSwitchAnalyzeR
Enabling Identification and Analysis of Isoform Switches with Functional Consequences and the Associated Alternative Splicing
Kristoffer Vitting-Seerup
`r Sys.Date()`
## Abstract
Recent breakthroughs in bioinformatics now allow us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data via tools such as Cufflinks, StringTie(2), Kallisto and Salmon. Alternatively long-read RNASeq (third generation sequencing) now rutinly provide us with full length transcripts. These full lenght transcripts makes it possible to analyze isoform usage, but unfortunately this is rarely done meaning RNA-seq data is typically not used to its full potential.
To solve this problem, we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to use-R package that enables statistical identification of isoform switching from RNA-seq derived quantification of novel and/or annotated full-length isoforms. IsoformSwitchAnalyzeR facilitates integration of many sources of (predicted) annotation such as Open Reading Frame (ORF/CDS), protein domains (via Pfam), signal peptides (via SignalP), Intrinsically Disordered Regions (IDR, via NetSurfP-2 or IUPred2A), coding potential (via CPAT or CPC2) and sensitivity to Non-sense Mediated Decay (NMD) and more. The combination of identified isoform switches and their annotation enables IsoformSwitchAnalyzeR to predict potential functional consequences of the identified isoform switches --- such as loss of protein domains --- thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provides article-ready visualization methods for isoform switches for individual genes as well as both summary statistics and visualization of the genome-wide changes/consequences of isoform switches, their consequences and the associated alternative splicing.
In summary, IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences) and its associated alternative splicing, thereby expanding the usability of RNA-seq data.
## Table of Content
[Preliminaries]
- [Background and Package Description]
- [Installation]
- [How To Get Help]
[What To Cite] (please remember)
[Quick Start]
- [Overview of Isoform Switch Workflow]
- [Example Workflow] (a.k.a. the "Too long; didn't read" section)
- [Examples Visualizations]
[Detailed Workflow]
- [Overview of Detailed Workflow]
+ [IsoformSwitchAnalyzeR Background Information]
- [Importing Data Into R]
+ [Importing Data from Kallisto, Salmon, RSEM or StringTie]
+ [Importing Data from Cufflinks/Cuffdiff]
+ [Importing long-read Data from the TALON Pipeline]
+ [Importing Data from other Full-length Transcript Assemblers]
- [Filtering]
- [Identifying Isoform Switches]
+ [Testing for Isoform Switches via DEXSeq] (recommended)
+ [Testing for Isoform Switches via DRIMSeq]
+ [Testing for Isoform Switches with other Tools]
- [Analyzing Open Reading Frames]
- [Extracting Nucleotide and Amino Acid Sequences]
- [Advice for Running External Sequence Analysis Tools and Downloading Results]
- [Importing External Sequence Analysis]
- [Predicting Alternative Splicing]
- [Predicting Switch Consequences]
- [Post Analysis of Isoform Switches with Consequences]
+ [Analysis of Individual Isoform Switching]
+ [Genome-Wide Analysis of Isoform Switching]
[Analyzing Alternative Splicing]
- [Overview of Alternative Splicing Workflow]
- [Genome-Wide Analysis of Alternative Splicing]
[Other workflows]
- [Augmenting ORF Predictions with Pfam Results]
- [Analyze Small Upstream ORFs]
- [Remove Sequences Stored in SwitchAnalyzeRlist]
- [Adding Uncertain Category to Coding Potential Predictions]
- [Quality control of ORF of known annotation]
- [Analyzing the Biological Mechanisms Behind Isoform Switching]
- [Analyzing experiments without replicates]
[Frequently Asked Questions, Problems and Errors]
- [What Quantification Tool(s) Should I Use?]
- [What Transcript Database Should I Use?]
- [How to handle confounding effects (including batches)]
- [What constitute an independent biological replicate?]
- And more...
[Final Remarks]
[Sessioninfo]
## Preliminaries
### Background and Package Description
The usage of Alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) are collectively collectively results in the production of different isoforms. Alternative isoforms are widely used as recently demonstrated by The ENCODE Consortium, which found that on average, 6.3 different transcripts are generated per gene; a number which may vary considerably per gene.
The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such shifts in isoform usage are termed 'isoform switching' and cannot be detected at when only analyzing data on gene level. For an introduction to isoform switches please refere to [Vitting-Seerup et al 2017](http://mcr.aacrjournals.org/content/15/9/1206.long).
On a more systematic level several recent studies suggest that isoform switches are quite common since they often identify hundreds of switch events in both cancer and non-cancer studies.
Tools such as Cufflinks, Salmon and Kallisto allows for (reconstruction and) quantification of full-length transcripts from RNA-seq data. Such data has the potential to facilitate genome-wide analysis of alternative isoform usage and identification of isoform switching --- but unfortunately these types of analyses are still only rarely done; most analyses are on gene level only.
We hypothesize that there are multiple reasons why RNA-seq data is not used to its full potential:
1) There is still a lack of tools that can identify isoform switches with isoform resolution.
2) Although there are many excellent tools to perform sequence analysis, there is no common framework which allows for integration of the analysis provided by these tools.
3) There is a lack of tools facilitating easy and article-ready visualization of isoform switches.
To solve all these problems, we developed IsoformSwitchAnalyzeR.
IsoformSwitchAnalyzeR is an easy-to-use R package that enables the user to import the result of the (reconstructed) full-length isoforms quantification an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF/GFF file including the annotated coding sequences (CDS). If transcript structures are predicted (either de-novo or guided) IsoformSwitchAnalyzeR offers an accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) --- the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC).
IsoformSwitchAnalyzeR facilitates identification of isoform switches via newly developed statistical methods that tests each individual isoform for differential usage and thereby identifies the exact isoforms involved in an isoform switch.
Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the tools for assessing coding potential of an isoform (both the CPAT and CPC2 tools are supported) which can be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can extract the most likely amine acid sequence of the CDS/ORF. The amine acid sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) --- both supported by IsoformSwitchAnalyzeR. Additionally, since the structures of all expressed isoforms from a given gene are known, one can also annotate alternative splicing - including aTSS and aTTS sites - a functionality also implemented in IsoformSwitchAnalyzeR.
In summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches in both individual genes and on a genome wide level.
IsoformSwitchAnalyzeR contains tools that allow the user to create article-ready visualizations of:
1) Individual isoform switches
2) Genome-wide analysis of isoform switches and their predicted consequences
3) Genome-wide analysis of alternative splicing and changes thereof.
These visualizations are easy to understand and integrate all information gathered throughout the workflow. Example of visualizations can be found in the [Examples Visualizations] section.
Lastly IsoformSwitchAnalyzeR, although originally developed for model organisms now also directly support analysis of all organisms.
Back to [Table of Content].
### Installation
IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor (or don't know if you have used it), simply copy-paste the following into an R session to install the basic Bioconductor packages (will only done if you don't already have them):
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
BiocManager::install()
}
If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.
After you have installed the basic Bioconductor packages you can install IsoformSwitchAnalyzeR by copy-pasting the following into an R session:
BiocManager::install("IsoformSwitchAnalyzeR")
This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.
If you need to install the developmental version of IsoformSwitchAnalyzeR this can be done from the GitHub. Please note that this is for advanced uses and should not be done unless you have good reason to. Installation from the GitHub can be done by copy/pasting the following into R:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("kvittingseerup/IsoformSwitchAnalyzeR", build_vignettes = TRUE)
### How To Get Help
This R package comes with _plenty_ of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R "?functionName", for example "?importRdata") - here a lot of information can be found in the individual argument description as well as in the "details" section of the individual function documentation. This vignette also contains a lot of information and will be continuously updated, so make sure to read both sources carefully as it contains the answers to the most questions (including [Frequently Asked Questions, Problems and Errors]).
If you want to report a bug/error (found in the newest version of the R package!) please make an issue with a reproducible example at [GitHub](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR/issues) and ember to post i) the code that gave the error, ii) the error message as well as iii) the output of running sessionInfo().
If you have unanswered questions or comments regarding IsoformSwitchAnalyzeR or how to run it, please post them on the associated [google group](https://groups.google.com/forum/#!forum/isoformswitchanalyzer) (after make sure the question was not already answered). If code or error messages are included please post i) the code that gave the error, ii) the error message as well as iii) the output of running sessionInfo().
If you have suggestions for improvements, please put them on [GitHub](https://github.com/kvittingseerup/IsoformSwitchAnalyzeR). This will allow other people to up vote your idea, thereby showing us there is wide support of implementing your idea.
Back to [Table of Content].
## What To Cite
The analysis performed by IsoformSwitchAnalyzeR is only possible due to a string of other tools and scientific discoveries --- please read this section thoroughly and cite the appropriate articles to give credit where credit is due (also citations are what allow people to continue both maintaining and developing bioinformatic tools).
If you are using the:
- **Import of data from Salmon/Kallisto/RSEM/StringTie**: Please cite reference _10_.
- **Inter-library normalization of abundance values**: Please cite reference _10_ and _11_.
- **Isoform switch test implemented utilizing DEXSeq via IsoformSwitchAnalyzeR (Default)** : Please cite reference _1_, _12_ and _13_.
- **Isoform switch test implemented in the DRIMSeq package**: Please cite reference _1_ and _3_.
- **Prediction of open reading frames (ORF) analysis**: Please cite reference _1_ and _4_.
- **Prediction of pre-mature termination codons (PTC) and thereby NMD-sensitivity**: Please cite reference _1_, _4_, _5_ and _6_.
- **CPAT**: Please cite reference _7_.
- **CPC2**: Please cite reference _14_.
- **Pfam**: Please cite reference _8_.
- **SignalP**: Please cite reference _9_.
- **NetSurf2-P**: Please cite reference _15_.
- **IUPred2A**: Please cite reference _16_.
- **Prediction of consequences**: Please cite reference _1_.
- **Visualizations** (plots) implemented in the IsoformSwitchAnalyzeR package: Please cite reference _1_.
- **Alternative splicing analysis**: Please cite both reference _1_ and _4_.
- **Genome-wide enrichment analysis**: Please cite both reference _1_ and _2_.
**Refrences**:
1. _Vitting-Seerup et al. **The Landscape of Isoform Switches in Human Cancers.** Cancer Res. (2017)_ [link](http://mcr.aacrjournals.org/content/15/9/1206.long).
2. _Vitting-Seerup et al. **IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequences.** Bioinformatics (2019)_ [link](http://dx.doi.org/10.1093/bioinformatics/btz247).
3. _Nowicka et al. **DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics.** F1000Research, 5(0), 1356._ [link](https://f1000research.com/articles/5-1356/v2).
4. _Vitting-Seerup et al. **spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data**. BMC Bioinformatics 2014, 15:81._ [link](http://www.biomedcentral.com/1471-2105/15/81).
5. _Weischenfeldt et al. **Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns**. Genome Biol 2012, 13:R35_ [link](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-5-r35).
6. _Huber et al. **Orchestrating high-throughput genomic analysis with Bioconductor**. Nat. Methods, 2015, 12:115-121._ [link](https://www.nature.com/articles/nmeth.3252).
7. _Wang et al. **CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model**. Nucleic Acids Res. 2013, 41:e74._ [link](https://academic.oup.com/nar/article/41/6/e74/2902455).
8. _Finn et al. **The Pfam protein families database**. Nucleic Acids Research (2012)_ [link](https://academic.oup.com/nar/article/40/D1/D290/2903007).
9. _Almagro et al. SignalP 5.0 improves signal peptide predictions using deep neural networks.**. Nat. Biotechnol (2019)_ [link](https://www.nature.com/articles/s41587-019-0036-z)
10. _Soneson et al. **Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.** F1000Research 4, 1521 (2015)._ [link](https://f1000research.com/articles/4-1521/v2).
11. _Robinson et al. **A scaling normalization method for differential expression analysis of RNA-seq data**. Genome Biology (2010)_ [link](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25).
12. _Ritchie et al. **limma powers differential expression analyses for RNA-sequencing and microarray studies**. Nucleic Acids Research (2015)_ [link](https://academic.oup.com/nar/article/43/7/e47/2414268).
13. _Anders et al. **Detecting differential usage of exons from RNA-seq data**. Genome Research (2012)_ [link](https://genome.cshlp.org/content/22/10/2008.long).
14. _Kang et al. **CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features**. Nucleic Acids Res (2017)_ [link](https://academic.oup.com/nar/article/45/W1/W12/3831091).
15. _Klausen et al. **NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning**. BioRxiv (2018)_ [link](https://www.biorxiv.org/content/10.1101/311209v3)
16. _Meszaros et al. IUPred2A: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res (2018)_ [link](https://doi.org/10.1093/nar/gky384)
## Quick Start
### Overview of Isoform Switch Workflow
In this section, we will outline the IsoformSwitchAnalyzeR workflow - in the following sections we will show you how to do it including example data and code which can be copy/pasted to do your own analysis.
The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post-analysis of full-length RNA-seq derived isoform/transcript level quantification of RNASeq data with a focus on finding, annotating and visualizing isoform switches with functional consequences as well as its associated alternative splicing. If you want to know more about considerations and recommendations for bioinformatic tools which can perform the initial isoform quantification, please refer to [What Quantification Tool(s) Should I Use?] and note that IsoformSwitchAnalyzeR both supports analysis of short- and long-read sequencing data.
From the isoform quantification IsoformSwitchAnalyzeR performs four high-level tasks:
- Statistical identification of isoform switches.
- Integration of a wide range of (predicted) annotations for the isoforms involved in the identified switches.
- Identification of which isoforms have a predicited functional consequnce (e.g. loss/gain of protein domain).
- Visualization of predicted consequences of the isoform switches, for individual genes as well as for analyzing the genome wide patterns.
Please note that just like any other statistical tool IsoformSwitchAnalyzeR requires independent biological replicates (see [What constitute an independent biological replicate?]) and that [recent benchmarks](https://f1000research.com/articles/7-952/v3) highlights that at least three (if not more) independent replicates are needed for good statistical performance - most tools have a hard time controlling the False Discovery Rate (FDR) with only two replicates so extra caution is needed for interpreting results based on three or fewer replicates.
The first step of a IsoformSwitchAnalyzeR workflow is to import and integrate the isoform quantification with its basic annotation. Once you have all the relevant data imported into R (IsoformSwitchAnalyzeR will also help you with that), the workflow for identification and analysis of isoform switches with functional consequences can be divided into two parts (also illustrated below in Figure 1).
**Part 1) Extract Isoform Switches and Their Sequences.** This part includes filtering out lowly expressed genes/isoforms, statistical analysis to identify isoform switches, annotating those switches with open reading frames (ORF) and writing the nucleotide and amino acid (peptide) sequences to fasta files. All of these analysis are performed by the high-level function:
isoformSwitchAnalysisPart1()
See below for example of usage, and [Detailed Workflow] for details on the individual analysis step (and what function actually perform them).
The resulting fasta files enables the usage of external sequence analysis tools. Currently the external sequence tools directly supported by IsoformSwitchAnalyzeR are:
- Pfam: Prediction of protein domains, which can be run either locally or via their [webserver](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan).
- Tools for Coding Potential Analysis:
+ CPC2: The Coding Calculator 2, which can be run either locally or via their [webserver](http://cpc2.cbi.pku.edu.cn/batch.php).
+ CPAT: The Coding-Potential Assessment Tool, which can be run either locally or via their [webserver](http://lilab.research.bcm.edu/cpat/).
- SignalP: Prediction of Signal Peptides, which can be run either locally or via their [webserver](http://www.cbs.dtu.dk/services/SignalP/) (V5 is supported).
- Tools Prediction of Intrinsically Disordered Regions (IDR):
+ IUPred2A: Which both predict IDRs and Intrinsically Disordered Binding Regions (IDBR) It can be run either locally or via their [webserver](https://iupred2a.elte.hu)
+ NetSurfP-2: Which as parts of its predicitons also predict IDR. It can be run either locally or via their [webserver](http://www.cbs.dtu.dk/services/NetSurfP-2.0/).
**Part 2) Plot All Isoform Switches and Their annotation.** This part involves importing and incorporating the results of all the external sequence analysis, analyzing alternative splicing, predicting functional consequences of isoform switches and plotting i) individual genes with isoform switches and ii) genome wide patterns in the consequences of isoform switching.
All of this can be done using the function:
isoformSwitchAnalysisPart2()
See below for example of usage, and [Detailed Workflow] for details on the individual analysis step (and what function actually perform them).
**Alternatively**, if one does not plan to incorporate external sequence analysis (or is only interested in splicing), it is possible to run the full workflow using:
isoformSwitchAnalysisCombined()
This corresponds to running _isoformSwitchAnalysisPart1()_ and _isoformSwitchAnalysisPart2()_ without adding the external results.