---
title: "RNASeqR : RNA-Seq analysis based on one independent variable"
author: "Author: Kuan-Hao Chao (ntueeb05howard@gmail.com)"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
BiocStyle::html_document:
highlight: pygments
toc: true
toc_depth: 3
fig_caption: yes
#bibliography: bibliography.bib
fontsize: 14pt
vignette: >
%\VignetteIndexEntry{RNA-Seq analysis based on one independent variable}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteDepends{}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
%\VignettePackage{RNASeqR}
---
```{r style, echo=FALSE, results="asis", message=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
warning = FALSE,
message = FALSE)
```
# Introduction
## RNA-Seq Overview
RNA-Seq is a revolutionary approach for discovering and investigating the transcriptome using next-generation sequencing technologies (Wang et al. 2009).
Typically, this transcriptome analysis aims to identify genes differentially expressed across different conditions, treatments or tissues, resulting in an understanding of the important pathways that are associated with the conditions (Wang et al. 2009). Following are the overview steps of RNA-Seq technique.
1. **Set up experiment**:
* First, scientists make their hypotheses and choose their target samples for
RNA extraction.
2. **RNA isolation & purify**:
* RNA is extracted and isolated from sample tissue. Then quality and total amount
of RNA are checked including the amount of RNA degradation in order to assess
the samples.
3. **RNA filtration**:
* Ribosomal RNA (rRNA) which accounts for around 90% of transcriptome from total
RNA-Seq sample needs to be removed. The interest areas of RNA-Seq analysis
are mRNAs, small RNAs as well as lincRNAs.
4. **Reverse-transcribed to cDNA**:
* cDNA is more stable than RNA, thus would be a better polymer to do PCR
amplification and storage.
5. **Sequencing**:
* Samples are sequenced by next generation sequencing machine. Raw fastq reads
files are generated.
6. **Reads alignment & transcript assembly**:
* Sequenced reads are algined to reference genome (traditional mapping) and then
assembled into potential transcripts by assembler. A different approach
is to directly compare reads with target transcripts (pseudo alignment
method) to quantify estimated expression quickly.
7. **Differential expression gene analysis**:
* Statistical methods are used in this step in order to get potential differential
expressed genes.
8. **Functional Analysis**:
* Differential expressed genes are selected as inputs of GO functional analysis as
well as KEGG pathway analysis in order to get further biology information.
## RNASeqR Overview
RNASeqR is an user-friendly R-based tool for running a six-step automation RNA-Seq analysis pipeline including quality assessment, read alignment and transcript quantification, differential expression analysis, and functional analysis. The main features of this package are an automated workflow and comprehensive reports with data visualization. In this package, the new tuxedo pipeline published in Nature Protocols in 2016 can be fully implemented in the R environment, including extra functions such as reads quality assessment and functional analysis. RNASeqR is beneficial for clinical researchers without significant computational background. There are only six lines of code that users need to execute in order to finish an end-to-end RNA-Seq analysis.
The central concept behind this package is that **each step involved in RNA-Seq data analysis is a function call in R**. For the subsequent parts of this documentation, inputs, outputs as well as detail implementation for these six steps will be elaborated upon in order. Following are the six steps and the each corresponding function that users need to execute.
1. **`RNASeqRParam` S4 Object Creation**
* **Function: `RNASeqRParam()`**
* In the beginning, users will create an `RNASeqRParam` S4 object by running the `RNASeqRParam()` constructor function for all variables being checked. This object will be used as input for the following steps.
2. **Environment Setup**
* **Function: `RNASeqEnvironmenenvironmenttSet_CMD()` or `RNASeqEnvironmentSet()`**
* In the second step, the environment is set up automatically with necessary tools installed for the subsequent steps in RNASeqR.
3. **Quality Assessment (optional)**
* **Function: `RNASeqQualityAssessment_CMD()` or `RNASeqQualityAssessment()`**
* In the third step (optional), users check the quality of raw fastq files.
4. **Reads alignment and transcripts quantification**
* **Function: `RNASeqReadProcess_CMD()` or `RNASeqReadProcess()`**
* In the fourth step, raw reads are processed and all intermediate files are stored properly.
5. **Differential Expression Gene Analysis**
* **Function: `RNASeqDifferentialAnalysis_CMD()` or `RNASeqDifferentialAnalysis()`**
* In the fifth step, differential analysis is done by edgeR, DESeq2 and ballgown R / Bioconductor packages.
6. **Functional Analysis**
* **Function: `RNASeqGoKegg_CMD()` or `RNASeqGoKegg()`**
* In the sixth step, differential expressed genes are used as the input for the GO functional analysis and KEGG pathway analysis. Enriched pathways are displayed visually.
Functions with the `CMD` suffix create an R script and run `nohup R CMD BATCH script.R` in the background. Functions with no `CMD` suffix process in the R shell. After running the above functions, the whole RNA-Seq analysis is done and the files generated in each step are stored in an organized file directory. The `RNASeqR` package makes two-group comparative RNA-Seq analysis **more efficient** and **easier** for users.
The following are the main tools that are used in this package: '[HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml)' (Kim et al. 2015) and '[STAR](https://github.com/alexdobin/STAR)' (Dobin et al. 2013) for read alignment; '[StringTie](https://ccb.jhu.edu/software/stringtie/)' (Pertea et al. 2015) for alignment assembly and transcripts quantification; '[Rsamtools](https://bioconductor.org/packages/release/bioc/html/Rsamtools.html)' (Morgan et al. 2018) for converting `SAM` files to `BAM` files; '[Gffcompare](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml)' for comparing merged `GTF` files with reference `GTF` files; '[systemPipeR](https://bioconductor.org/packages/release/bioc/html/systemPipeR.html)' (Backman et al. 2016) for quality assessment; '[ballgown](https://bioconductor.org/packages/release/bioc/html/ballgown.html)' (Fu et al. 2018), '[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)' (Love et al. 2014) and '[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html)' (Robinson et al. 2010;McCarthy et al. 2012) for finding potential differentially expressed genes; and '[clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html)' (Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.
## Sample Definition
Sample data used in this vignette can be downloaded from the `RNASeqRData` experiment package. It originated from [NCBI's Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra) for the entries SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, and SRR3396387. These samples are from Saccharomyces cerevisiae. Suitable reference genome and gene annotation files for this species can be further downloaded from iGenomes, Ensembl, R64-1-1. To create a mini dataset for demonstration purposes, reads aligned to the region from 0 to 100000 on chromosome XV were extracted. The analysis of this mini dataset will be shown in this vignette. The experimental data package is located [here](https://github.com/HowardChao/RNASeqRData).
For more real case-control data and RNA-Seq analysis results from this package, please go to this website: https://github.com/HowardChao/RNASeqR_analysis_result.
## System Requirements
Necessary:
1. R version >= 3.5.0
2. Operating System: Linux and macOS are supported in the `RNASeqR` package. Windows is not supported. (because StringTie and HISAT2 are not available for Windows).
3. Third-party software used in this package includes HISAT2, StringTie and Gffcompare. The availability of these commands will be checked by `system2()` through the R shell at the end of the 'Environment Setup' step. The environment must successfully built before running the following RNA-Seq analysis. By default, binaries will be installed based on the operating system of the workstation; therefore there is no additional compiling required. Alternatively, users can still decide to skip certain software binary installation. For more details, please refer to the 'Environment Setup' chapter.
Recommended:
1. Python: Python2 or Python3.
2. `2to3`: If the Python version on the workstation is 3, this command will be used. Generally, `2to3` is available if Python3 is available.
* Python and `2to3` are used for creating raw read counts for [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) and [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html).
* The following are two conditions that will create a raw read count from StringTie output.
1. Python2
2. Python3 with `2to3` command available.
* If one of these conditions is met, the raw read count will be created and DESeq2 and edgeR will be run automatically in the 'Gene-level Differential Analyses' step. If not, DESeq2 and edgeR will be skipped during 'Gene-level Differential Analysis' step. Checking the Python version and `2to3` command on the workstation beforehand is highly recommended but not necessary.
3. HISAT2 indexex: Users are advised to provide an *'indices/'* directory in *'inputfiles/'*. HISAT2 requires at least 160 GB of RAM and several hours to index the entire human genome.
## Installation
```{r install, eval=FALSE, warning=FALSE}
if (!require("BiocManager")) {
install.packages("BiocManager")
}
BiocManager::install("RNASeqRData")
```
# "RNASeqRParam" S4 Object Creation
This is the **first step** of the RNA-Seq analysis workflow in this package. Prior to conducting RNA-Seq analysis, it is necessary to implement a constructor function, called `RNASeqRParam()` and to create a `RNASeqRParam` S4 object which stores parameters not only for pre-checking but also for utilizing as input the parameters in the subsequent steps.
## `RNASeqRParam` Slots Explanation
There are 11 slots in `RNASeqRParam`:
1. os.type : The operating system type. Value is `linux` or `osx`. This package only supports 'Linux' and 'macOS' (not 'Windows'). If another operating system is detected, ERROR will be reported.
2. `python.variable` : A Python-related variable. The value is a list of whether Python is available and the Python version (`TRUE` or `FALSE`, `2` or `3`).
3. `python.2to3` : Availability of the `2to3` command. THe value is `TRUE` or `FALSE`.
4. `path.prefix` : Path prefix of the *'gene_data/'*, *'RNASeq_bin/'*, *'RNASeq_results/'*, *'Rscript/'* and *'Rscript_out/'* directories. It is recommended that you create a new, empty directory in which all the subsequent RNA-Seq results can be saved.
```{r fig.width=10, echo=FALSE}
library(png)
library(grid)
img <- readPNG("figure/whole_file_structure.png")
grid.raster(img, just = "center")
```
5. `input.path.prefix` : Path prefix of the *'input_files/'* directory. Users should prepare an *'input_file/'* directory with the following rules:
* **`genome.name`.fa**: Reference genome in FASTA file formation.
* **`genome.name`.gtf**: Gene annotation in GTF file formation.
* **raw_fastq.gz/**: Directory storing `FASTQ` files.
* Supports both paired-end and single-end reads files.
* Names of `FASTQ` files : '`sample.pattern`_1.fastq.gz' and '`sample.pattern`_2.fastq.gz'. `sample.pattern` must be distinct for each sample.
* **phenodata.csv**: Information about the RNA-Seq experiment design.
* First column : Distinct ids for each sample. The value of each sample of this column must match `sample.pattern` in `FASTQ` files in *'raw_fastq.gz/'*. The column name must be **ids**.
* Second column : An independent variable for the RNA-Seq experiment. Values of each sample of this column can only be parameter `case.group` or `control.group`. The column name is the parameter `independent.variable` which can be any string defined by users.
* Third column : The color coding for `case.group` and `control.group`. Samples in the same group must have same color coding and their values are HEX color code. The column name must be **color**.
* **indices/** : The directory storing `HT2` index files for the HISAT2 alignment tool.
* This directory is **optional**. `HT2` index files corresponding to the reference genome can be installed at [HISAT2 official website](https://ccb.jhu.edu/software/hisat2/index.shtml). Providing `HT2` files can accelerate the subsequent analysis steps. It is highly advised that you install `HT2` files.
* If `HT2` index files are not provided, the *'input_files/indices/'* directory should be deleted.
```{r fig.width=10, echo=FALSE}
img <- readPNG("figure/input_files_structure.png")
grid.raster(img, just = "center")
```
An example '*phenodata.csv*' file. File is stored in 'CSV' format.
```{r fig.width=10, echo=FALSE}
img <- readPNG("figure/phenodata_csv.png")
grid.raster(img, just = "center")
```
6. `genome.name` : The genome name defined in this RNA-Seq workflow (ex. '`genome.name`.fa', '`genome.name`.gtf')
7. `sample.pattern` : A regular expression of paired-end or single-end fastq.gz files under *'input_files/raw_fastq.gz/'*. **IMPORTANT!!** The expression shouldn't have `_[1,2].fastq.gz` at the end.
8. `independent.variable`: Independent variable for the biological experiment design of a two-group RNA-Seq analysis workflow.
9. `case.group` : Name of the case group.
10. `control.group` : Name of the control group.
11. `indices.optional` : A logical value indicating whether *'input_files/indices/'* exists. Value is `TRUE` or `FALSE`
## `RNASeqRParam` Constructor Checking
1. Create a new directory for the RNA-Seq analysis. It is highly recommended to create a new, completely empty directory. The parameter `path.prefix` of `RNASeqRParam()` constructor should be the absolute path of this new directory. All the RNA-SeqR-related files are generated in the subsequent steps will be stored inside of this directory.
2. Create a valid *'input_files/'* directory. You should create a file directory named *'input_files/'* with the neccessary files inside. It should follow the rules mentioned above.
3. Run the constructor of the `RNASeqRParam` S4 object. This constructor will check the validity of the input parameters before creating the S4 objects.
* Operating system
* Python version
* 2to3 command
* Structure, content, and rules of *'inputfiles/'*
* Validity of input parameters
## Example
```{r, warning=FALSE}
library(RNASeqR)
library(RNASeqRData)
```
```{r}
input_files.path <- system.file("extdata/", package = "RNASeqRData")
#rnaseq_result.path <- "/Users/chaokuan-hao/Documents/ANU_2019_Semester_2/Lanfear_Lab/HI"
rnaseq_result.path <- "/tmp/RNASeqR/"
dir.create(rnaseq_result.path, recursive = TRUE)
list.files(input_files.path, recursive = TRUE)
```
Check the reads files in *'input_files/'* directory. Users can set *fastq.gz.type* as "PE" (paired-end) or "SE" (single-end) to specify which kinds of fastq.gz types they are.
* For paired-end, set fastq.gz.type as "PE"
```{r, warning=FALSE}
exp <- RNASeqRParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control",
fastq.gz.type = "PE")
```
* For single-end, set fastq.gz.type as "SE"
```{r, warning=FALSE, eval=FALSE}
exp <- RNASeqRParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control",
fastq.gz.type = "SE")
```
```{r, warning=FALSE}
show(exp)
```
In this example, the `RNASeqRParam` S4 object is stored in `exp` for subsequent RNA-Seq analysis steps. Any ERROR occurring in the checking steps will terminate the program.
# Environment Setup
This is the **second step** of the RNA-Seq analysis workflow in this package. To set up the environment, run `RNASeqEnvironmentSet_CMD()` to execute the process in the background, or run `RNASeqEnvironmentSet()` to execute the process in the R shell.
## File Setup
1. Create Base Directories *'gene_data/'*, *'RNASeq_bin/'*, *'RNASeq_results'*, *'Rscript'*, and *'Rscript_out'* will be created in the `path.prefix` directory. Here is the usage of these five main directories:
* *'gene_data/'*: Symbolic links of *'input_files/'* and files that are created in each step of RNA-Seq analysis will be stored in this directory.
* *'RNASeq_bin/'*: The binaries of necessary tools, HISAT2, SAMtools, StringTie and Gffcompare, are installed in this directory.
* *'RNASeq_results'*: The RNA-Seq results, for example, alignment results, quality assessment results, differential analysis results etc., will be stored in this directory.
* *'Rscript'*: If your run the `XXX_CMD()` function, the corresponding R script(`XXX.R`) for certain steps will be created in this directory.
* *'Rscript_out'*: The corresponding output report for R scripts (`XXX.Rout`) will be stored in this directory.
2. Symbolic links will be created from files in *'input_files/'* to the `path.prefix` directory.
## Installation of Necessary Tools
The operating system of your workstation will be detected. If the operating system is not `Linux` or `macOS`, ERROR will be reported. Users can decide whether the installation of essential programs(HISAT2, StringTie and Gffcompare) is going be automatically processed.
Third-party softwares used in this package includes HISAT2, StringTie and Gffcompare. Binaries are available for these three programs, and by default, they will be installed automatically based on the operating system of the workstation. Zipped binaries will be unpacked and exported to the R environment PATH. **No compilation is needed**.
To specify, there are three parameters(`install.hisat2`, `install.stringtie` and `install.gffcompare`) in both the `RNASeqEnvironmentSet_CMD()` and `RNASeqEnvironmentSet()` functions for users to determine which software is going to be installed automatically or skipped. The default settings of these parameters are `TRUE` so that these three programs will be installed directly. Otherwise, users can skip certain software installation processes by turning the values to `FALSE`. Please make sure to check that the skipped programs are available using `system2()` through the R shell. Unavailability of any of the programs will cause failure in the ‘Environment Setup’ step.
Here is the version information of each software binary.
* HISAT2
* Based on your operating system, `hisat2-2.1.0-Linux_x86_64.zip` or `hisat2-2.1.0-OSX_x86_64.zip` zipped file will be installed.
* The downloaded file will be unzipped and all binary files will be copied to and run from *'RNASeq_bin/'*
* StringTie
* Based on your operating system, `stringtie-1.3.4d.Linux_x86_64.tar.gz` or `stringtie-1.3.4d.Linux_x86_64` will be installed.
* The downloaded file will be unzipped and all binary files will be copied to and run from *'RNASeq_bin/'*.
* Gffcompare
* Based on your operating system, `gffcompare-0.10.4.Linux_x86_64.tar.gz` or `gffcompare-0.10.4.Linux_x86_64.tar.gz` will be installed.
* The downloaded file will be unzipped and all binary files will be copied to and run from *'RNASeq_bin/'*.
## Export Path
*'RNASeq_bin/'* will be added to the R environment PATH so that these binaries can be found in the R environment in the R shell through `system2()`. In the last step of environment setup, the `hisat2 --version`,`stringtie --version`,`gffcompare --version`, and `samtools --version` commands will be checked in order to make sure the environment is correctly constructed. The environment must be set up successfully before the subsequent analyses.
## Example
Run `RNASeqEnvironmentSet_CMD()` or `RNASeqEnvironmentSet()`.
1. For running in background, the results will be reported in *'Rscript_out/Environment_Set.Rout'*. Make sure the environment is successfully set up before running the subsequent steps.
```{r, eval=FALSE}
RNASeqEnvironmentSet_CMD(exp)
```
2. For running in R shell, the results will be reported in the R shell. Make sure the environment is successfully set up before running the subsequent steps.
```{r, warning=FALSE}
RNASeqEnvironmentSet(exp)
```
# Quality Assessment of `FASTQ` Sequence Data
This is the **third step** of the RNA-Seq analysis workflow in this package. Different from other necessary steps, it is optional and can be run several times with each result stored separately. Although this step can be skipped, it is strongly recommended that it be performed before processing the alignment step. To evaluate the quality of the raw reads in the `FASTQ` files, run `RNASeqQualityAssessment_CMD()` in the background or run `RNASeqQualityAssessment()` to execute the process in the R shell.
## "systemPipeR" Quality Assessment
In this step, the [systemPipeR](http://bioconductor.org/packages/release/bioc/html/systemPipeR.html) package is used for evaluating sequencing reads and the details are as follows:
1. Check the number of times that the user has run the quality assessment process and create the corresponding files *'RNASeq_results/QA_results/QA_{times}'*.
2. RNA-Seq environment set up. The *'rnaseq/'* directory will be created by [systemPipeR](http://bioconductor.org/packages/release/bioc/html/systemPipeR.html) package.
3. Create the *'data.list.txt'* file.
4. Reading `FASTQ` files and create *'fastqReport.pdf'* containing the results of the quality assessment.
5. Remove the *'rnaseq/'* directory.
This quality assessment result (example below) is generated by [systemPipeR](http://bioconductor.org/packages/release/bioc/html/systemPipeR.html) package. It will be stored as a `PDF`.
```{r fig.width=20, echo=FALSE}
img <- readPNG("figure/fastqReport.png")
grid.raster(img, just = "center")
```
## Example
Run `RNASeqQualityAssessment_CMD()` or `RNASeqQualityAssessment()`.
1. For running in background, the results will be reported in *'Rscriptout/Quality_Assessment.Rout'*. Make sure the quality assessment is successfully done before running the subsequent steps.
```{r, eval=FALSE}
RNASeqQualityAssessment_CMD(exp)
```
2. Results fo runs in R shell will be reported in the R shell. Make sure the quality assessment is successfully done before running the subsequent steps.
```{r, warning=FALSE}
RNASeqQualityAssessment(exp)
```
## Update fastq.gz files
Sliding through quality score in each base would be computational-intensive. Moreover, due to R package limitation, it is not suitable to trim large RNA-Seq fastq.gz in R environment and is not supported in RNASeqR. Sample with an average quality score below Q30 is suggested to be omitted from the sub-sequent analyses.
In order to let users update trimmed fastq.gz files easily, an update function is provided. User need the `RNASeqRParam` S4 object created at the first step, the prepared new *'raw_fasrq.gz/'* directory with trimmed files and the samples stored in list as `Update_Fastq_gz` function inputs.
```{r, warning=FALSE, eval=FALSE}
Update_Fastq_gz (RNASeqRParam = exp,
prepared_fastq_gz = paste0(input_files.path,
"/input_files/raw_fastq.gz/"),
target_samples = "ALL")
```
# Read Alignment and Quantification
This is the **fourth step** of the RNA-Seq analysis workflow in this package. To process raw reads in `FASTQ` files, users can either run `RNASeqRawReadProcess_CMD()` to execute the process in background or run `RNASeqRawReadProcess()` to execute the process in the R shell. For further details about the commands and parameters that executed during each step, please check the reported *'RNASeq_results/COMMAND.txt'*.
## The *HISAT2* Indexer
The defalt indexer is Hisat2. In a preparation step (the `RNASeqRParam` creation step), the *'indices/'* directory is checked for whether `HT2` index files already exist. If not, the following commands will be executed:
* Input: '`genome.name`.gtf', '`genome.name`.fa'
* Output: '`genome.name`.ss', '`genome.name`.exon', '`genome.name`_tran.{number}.ht2'
1. `extract_splice_sites.py`, `extract_exons.py` execution
* These two commands are executed to extract splice-site and exon information from the gene annotation file.
2. `hisat2-build` index creation
* This command builds the HISAT2 index files *`genome.name`.ss* and *`genome.name`.exon* created in the previous step. Be aware that the index building step requires a larger amount of memory and longer time than other steps, and it might not be possible to run on some personal workstations. It is highly recommended that you check the availability of `HT2` index files at the [HISAT2 official website](https://ccb.jhu.edu/software/hisat2/index.shtml) for your target reference genome beforehands. Pre-installing `HT2` index files will greatly shorten the analysis time.
3. Write *'RNASeq_results/COMMAND.txt'*
* Shell commands that run in this step will be documented in *'RNASeq_results/COMMAND.txt'*.
## The *HISAT2* Aligner
* Input: '`genome.name`_tran.{number}.ht2', '`sample.pattern`.fastq.gz'
* Output: '`sample.pattern`.sam'
The defalt Aligner is Hisat2.
1. `hisat2` command is executed on paired-end or single-end `FASTQ` files. `SAM` files will be created.
* `SAM` files are stored in *'gene_data/raw_sam/'*.
2. Write *'RNASeq_results/COMMAND.txt'*
* The shell command that run in this step will be documented into *'RNASeq_results/COMMAND.txt'*.
3. The summary dataframe for alignment reads and mapping rates in terms of tabular(`CSV`) and picture (`PNG`) format is created and kept in the directory 'RNASeq_results/Alignment_Report'.
## The *STAR* Indexer and Aligner
* Input: '`genome.name`.gtf', '`genome.name`.fa', '`sample.pattern`.fastq.gz'
* Output: '``sample.pattern`.sam`'
Users could select *STAR* as their aligner in this pipeline by setting `STAR.Alignment.run` to `TRUE` and `Hisat2.Index.run`, `Hisat2.Alignment.run` to `FALSE`. Indexed results would be stored in *'indices/'* directory, and generated `SAM` files are stored in *'gene_data/raw_sam/'*.
```{r fig.width=6,fig.height=6, echo=FALSE}
img <- readPNG("figure/Alignment_Report/Alignment_Result_ggplot2.png")
grid.raster(img, just = "center")
```
```{r fig.width=6,fig.height=6, echo=FALSE}
img <- readPNG("figure/Alignment_Report/Overall_Mapping_rate_ggplot2.png")
grid.raster(img, just = "center")
```
## The *Rsamtools `SAM` to `BAM`* Converter
In this step, users can choose whether they want to use 'Rsamtools'(an R package) or 'SAMtools' ( a command-line-based tool) to do files conversion by setting the `SAMtools.or.Rsamtools` parameter to `Rsamtools` or `SAMtools`. By default, `Rsamtools` will be used. However, if the amount of RNA-Seq data is too large, 'Rsamtools' might not be able to finish this process due to the Rtmp file issue, and therefore 'SAMtools' is recommended. Users have to make sure the 'samtools' command is available on the workstation beforehands or ERROR will be reported.
* Input: '`sample.pattern`.sam'
* Output: '`sample.pattern`.bam'
1. The [Rsamtools](https://bioconductor.org/packages/release/bioc/html/Rsamtools.html) package provides an interface to `samtools` in the R environment. In this step, `SAM` files from HISAT2 will be converted to `BAM` files by running the `asBam()` function.
* Output `BAM` files are stored in *'gene_data/raw_sam/'*.
2. Write *'RNASeq_results/COMMAND.txt'*
## The *StringTie* Assembler
* Input: '`genome.name`.gtf', '`sample.pattern`.bam'
* Output: '`sample.pattern`.gtf'
1. `stringtie` command is executed.
* The tool assembles RNA-Seq alignments into potential transcripts.
* The assembled `GTF` files which are from each `FASTQ` files are stored in *'gene_data/raw_gtf/'*
2. Write *'RNASeq_results/COMMAND.txt'*
* Shell commands that run in this step will be documented in *'RNASeq_results/COMMAND.txt'*.
## The *StringTie `GTF`* Merger
* Input: '`sample.pattern`.gtf'
* Output: 'stringtiemerged.gtf', 'mergelist.txt'
1. Create the file 'mergelist.txt'.
* *gene_data/merged/mergelist.txt* is created for the merging step.
2. `stringtie` command is executed.
* The transcript merger merges each *`sample.pattern`.gtf* into *stringtiemerged.gtf*
* Output files are all stored in *'gene_data/merged/'*
3. Write *'RNASeq_results/COMMAND.txt'*
* Shell commands that run in this step will be documented in *'RNASeq_results/COMMAND.txt'*.
## The *Gffcompare* Comparer
* Input: '`genome.name`.gtf', 'stringtie_merged.gtf'
* Output: 'merged.annotated.gtf', 'merged.loci', 'merged.stats', 'merged.stringtie_merged.gtf.refmap', 'merged.stringtie_merged.gtf.tmap', 'merged.tracking'
1. `gffcompare` command is executed.
* The comparison result of the merged `GTF` file and reference annotation files is reported in the *'merged/'* directory.
2. Write *'RNASeq_results/COMMAND.txt'*
* Shell commands that run in this step will be documented in *'RNASeq_results/COMMAND.txt'*.
## The *Ballgown input* Creator
* Input: 'stringtie_merged.gtf'
* Output: 'ballgown/', 'gene_abundance/'
1. `stringtie` command is executed.
* StringTie will create the input directory for the ballgown package for subsequent differential expression analysis. Ballgown-related files will be stored under each sample name in *'gene_data/ballgown/'*.
* StringTie will store gene-related information in an `TSV` file under each sample name in *'gene_data/gene_abundance/'*.
2. Write *'RNASeq_results/COMMAND.txt'*
* Shell commands that run in this step will be documented in *'RNASeq_results/COMMAND.txt'*.
## The *Reads Count Table* Creator
Whether this step is executed depends on the availability of Python on your workstation.
* Input: 'samplelst.txt'
* Output: 'gene_count_matrix.csv', 'transcript_count_matrix.csv'
1. The reads count table converter Python script is downloaded as `prepDE.py`
2. Python checking
* When Python is not available, this step is skipped.
* When Python2 is available, `prepDE.py` is executed.
* When Python3 is available, the `2to3` command will be checked.(Usually, if Python3 is installed, `2to3` command will be installed too.)
* When Python3 is available but the `2to3` command is unavailable, the raw read count step will be skipped.
* When Python3 and the `2to3` command are available, `prepDE.py` is converted to a file that can be executed by Python2 and is automatically executed.
3. Raw reads count creation
* The raw reads count is created and the results are stored in *'gene_data/reads_count_matrix/'*
4. Write *'RNASeq_results/COMMAND.txt'*
* Shell commands that run in this step will be documented in *'RNASeq_results/COMMAND.txt'*.
## Example
Run `RNASeqReadProcess_CMD()` or `RNASeqReadProcess()`. Parameters for index, alignmet and assembly are set by default. Only S4 `RNASeqRParam` object is need. If users want to change their parameters, please check function inputs of `RNASeqReadProcess_CMD()` and `RNASeqReadProcess()`.
1. For running in background, the results will be reported in *'Rscriptout/Raw_Read_Process.Rout'*. Make sure the raw read process is successfully done before running the subsequent steps. Default indexer would be `HISAT2`.
```{r, eval=FALSE}
RNASeqReadProcess_CMD(exp)
```
* If users would like to change to `STAR` and run in the background, run this one
```{r, eval=FALSE}
RNASeqReadProcess_CMD(exp, STAR.Alignment.run=TRUE,
Hisat2.Index.run=FALSE,
Hisat2.Alignment.run = FALSE)
```
2. For running in R shell, the results will be reported in R shell. Make sure the raw read process is successfully done before running the subsequent steps. Default aligner would be `HISAT2`.
```{r, warning=FALSE, eval=FALSE}
RNASeqReadProcess(exp)
```
* If users would like to change to `STAR` and run in the R shell, then run this one
```{r, eval=FALSE}
RNASeqReadProcess(exp, STAR.Alignment.run=TRUE,
Hisat2.Index.run=FALSE,
Hisat2.Alignment.run = FALSE)
```
# Gene-level Differential Analysis
This is the **fifth step** of the RNA-Seq analysis workflow in this package. To identify differentially expressed genes, users can either run `RNASeqDifferentialAnalysis_CMD()` to execute the process in the background or run `RNASeqDifferentialAnalysis()` to execute the process in the R shell. In this package, we provide three normalized expression values: fragments per kilobase per million reads (FPKM) (Mortazavi et al. 2008), normalized counts by means of median of ratios normalization (MRN), or trimmed mean of m-values (TMM), each with proper statistical analyses using the R packages, ballgown, DESeq2, and edgeR. Gene IDs from StringTie and the ballgown R package will be mapped to 'gene_name' in the GTF file for further functional analysis.
## Gene Annotations
For each differential expression analysis tool, all statistical results would be stored properly under '*ballgown_analysis/*', '*DESeq2_analysis/*' and '*edgeR_analysis/*'. In the file, known gene annotations would be listed in the front rows while unknown gene annotations would be listed in the back rows. The following picture shows partial result of an example dataset where ‘.’ represents unknow gene id.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/DE/Statistical_Results.png")
grid.raster(img, just = "center")
```
## General Data Visualization
Here we illustrate general data visualization before and after differential expression analysis. The results based on each differential analysis tool(*ballgown*, *DESeq2*, *edgeR*) are kept in the directory 'RNASeq_results/'. The plots shown below are the statistical results visualization of example data in the `RNASeqRData` package based on MRN-normalized expression values from DESeq2.
For real data analysis results, please go to this website: https://howardchao.github.io/RNASeqR_analysis_result/.
### Before Differential Expression Analysis
#### Frequency Plots
When visualizing the frequency of expression value per sample using the [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) R package, the x-axis represents the range of normalized counts value by MRN or log2(MRN+1) value and the y-axis represents the frequency of each value on the x-axis
* **'Frequency_Plot_normalized_count_ggplot2.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Frequency/Frequency_Plot_normalized_count_ggplot2.png")
grid.raster(img, just = "center")
```
* **'Frequency_Plot_log_normalized_count_ggplot2.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Frequency/Frequency_Plot_log_normalized_count_ggplot2.png")
grid.raster(img, just = "center")
```
#### Distribution Plots
You can display the distribution of normalized expression values (e.g., log2(MRN+1) values) in a boxplot and a violin plot using the [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) R package. Samples are colored as follows: blue for the case group and yellow for the control group.
* **'Box_Plot_ggplot2.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Distribution/Box_Plot_ggplot2.png")
grid.raster(img, just = "center")
```
* **'Violin_Plot_ggplot2.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Distribution/Violin_Plot_ggplot2.png")
grid.raster(img, just = "center")
```
#### PCA Plots
To display how the biological samples compare in overall similarities and differences using principal component analysis(PCA); the principal component scores of the top five dimensions are calculated using the [FactoMineR](https://cran.r-project.org/web/packages/FactoMineR/index.html) package and the results are extracted and visualized using [factoextra](https://cran.r-project.org/web/packages/factoextra/index.html) or [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html).
* **'Dimension_PCA_Plot_factoextra.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/PCA/Dimension_PCA_Plot_factoextra.png")
grid.raster(img, just = "center")
```
* **'PCA_Plot_factoextra.png'**: Samples are colored as follows: blue for the case group and yellow for the control group. The small point represents each sample while the big one represents each comparison group. Ellipases can be further added for grouping samples.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/PCA/PCA_Plot_factoextra.png")
grid.raster(img, just = "center")
```
* **'PCA_Plot_ggplot2.png'**: Samples are colored as described above.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/PCA/PCA_Plot_ggplot2.png")
grid.raster(img, just = "center")
```
#### Correlation Plots
You can display Pearson correlation coefficient of a pairwise correlation analysis of changes in gene expression from all samples calculated by [stats](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html) using [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html)(correlation heat plot), [corrplot](https://cran.r-project.org/web/packages/corrplot/index.html)(correlation dot plot), and [PerformanceAnalytics](https://cran.r-project.org/web/packages/PerformanceAnalytics/index.html)(correlation bar plot). The colors from red to blue indicate the value of the correlation, from maximum value to minimum value, among all samples.
* **'Correlation_Heat_Plot_ggplot2.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Correlation/Correlation_Heat_Plot_ggplot2.png")
grid.raster(img, just = "center")
```
* **'Correlation_Dot_Plot_corrplot.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Correlation/Correlation_Dot_Plot_corrplot.png")
grid.raster(img, just = "center")
```
* **'Correlation_Bar_Plot_PerformanceAnalytics.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Correlation/Correlation_Bar_Plot_PerformanceAnalytics.png")
grid.raster(img, just = "center")
```
### After Differential Expression Analysis
#### Volcano Plots
* **'Volcano_Plot_graphics.png'**: You can display the criteria for selecting differentially expressed genes (DEGs) and identify these DEGs using [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html). The x-axis represents the log2-based fold change while the y-axis denotes the log10-based p-value. The up-regulated and down-regulated DEGs are hightlighted in red and green color, respectively.
```{r fig.width=8, fig.height=8, echo=FALSE}
img <- readPNG("figure/DE/Volcano_Plot_graphics.png")
grid.raster(img, just = "center")
```
#### PCA Plots
To display how the biological samples compare in similarities and differences based on the expression value of DEGs using PCA. [FactoMineR](https://cran.r-project.org/web/packages/FactoMineR/index.html), [factoextra](https://cran.r-project.org/web/packages/factoextra/index.html) and [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) packages are used in this step.
* **'Dimension_PCA_Plot_factoextra.png'**
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/DE/PCA/Dimension_PCA_Plot_factoextra.png")
grid.raster(img, just = "center")
```
* **'PCA_Plot_factoextra.png'**: Samples are colored as follows: blue for the case group and yellow for the control group. The small point represents each sample while the big one represents each comparison group. Ellipases can be further added for grouping samples.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/DE/PCA/PCA_Plot_factoextra.png")
grid.raster(img, just = "center")
```
* **'PCA_Plot_ggplot2.png'**: Samples are colored as described above.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/DE/PCA/PCA_Plot_ggplot2.png")
grid.raster(img, just = "center")
```
#### Heatmap Plots
* **'Heatmap_Plot_pheatmap.png'**: To visualize the expression value of DEGs in terms of the log2-based normalized value from all samples of the two groups using [pheatmap](https://cran.r-project.org/web/packages/pheatmap/index.html). Samples are colored by defined groups: blue for the case group and yellow for the control group. Gene names and sample names will be shown on the heatmap except for two conditions: if there are more than 60 DEGs, gene names will not be shown, and if there are more than 16 samples, sample names will not be shown.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/DE/Heatmap_Plot_pheatmap.png")
grid.raster(img, just = "center")
```
## "ballgown" Analysis Based on FPKM Value
[Ballgown](https://www.bioconductor.org/packages/release/bioc/html/ballgown.html) is an R package designed for differential expression analysis of RNA-Seq data. This package extracts FPKM values, i.e., read counts normalized by both library size and gene length, from StringTie software followed by applying a parametric F-test comparing the nested linear model as its default statistic model to identify DEGs. The basic steps are as follows:
* Input: *'gene_data/ballgown/'*
* Output: *'RNASeq_results/ballgown_analysis/'*
1. Create a ballgown object that will be stored in *'RNASeq_results/ballgown_analysis/ballgown.rda'*.
2. Filter out the genes for which the sum of FPKM values of all samples per gene equals 0.
3. Calculate log2-based fold change value in column `log2FC`.
4. Split a matrix of normalized counts into case and control groups based on phenotype data (*'gene_data/phenodata.csv'*) and assign relative information in column sample.pattern.FPKM.
5. Generate a `csv` file, *'RNASeq_results/ballgown_analysis/ballgown_normalized_result.csv'*, in which to store normalized FPKM values, mean expression values per group, and statistical results.
6. Select DEGs based on default criteria: **`pval < 0.05` and `log2FC > 1 | log2FC < 1`**. Store the result in *'RNASeq_results/ballgown_analysis/ballgown_normalized_DE_result.csv'*
7. Additional data visualization: aside from the general data visualization mentioned above, transcript-related plots and an MA plot are also provided.
* **'Distribution_Transcript_Count_per_Gene_Plot.png'**: This plots the distribution of transcript counts per gene.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/transcript_related/Distribution_Transcript_Count_per_Gene_Plot.png")
grid.raster(img, just = "center")
```
* **'Distribution_Transcript_Length_Plot.png'**: This plots the distribution of transcript length.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/transcript_related/Distribution_Transcript_Length_Plot.png")
grid.raster(img, just = "center")
```
* **'MA_Plot_ggplot2.png'**: This displays the difference of expression value between two groups by transforming the data onto log2-based ratio (x-axis) and log2-based mean (y-axis) scales by using [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/ballgown_MA_Plot_ggplot2.png")
grid.raster(img, just = "center")
```
## "DESeq2" Analysis Based on Read Count
[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) is an R package for count-based differential expression analysis using read counts to estimate variance-mean dependence. It takes sequence depth and gene composition into consideration and uses the MRN method to normalize read counts. The statistical model for differential expression is based on a negative binomial distribution. The basic steps are as follows:
* Input: *'gene_data/reads_count_matrix/gene_count_matrix.csv'*
* Output: *'RNASeq_results/ballgown_analysis/'*
1. Create the `DESeqDataSet` object based on count data from the matrix of read counts and phenotype data using the `DESeqDataSetFromMatrix()` function.
2. Filter out the genes for which the sum of read counts of all samples equals 0.
3. Run `DESeq2()` function to process the differential expression analysis.
4. Generate a `csv` file, *'RNASeq_results/DESeq2_analysis/DESeq2_normalized_result.csv'*, in which to store normalized MRN counts, mean expression values per group, and statistical results.
5. Select DEGs based on default criteria: **`pval < 0.05` and `log2FC > 1 | log2FC < 1`**. Store the result in *'RNASeq_results/DESeq2_analysis/DESeq2_normalized_DE_result.csv'*..
6. Additional data visualization: aside from the general data visualization mentioned above, a dispersion plot and an MA plot are also provided.
* **Dispersion_Plot_DESeq2.png**: This displays the dispersion estimates before and after normalization using `plotDispEsts()`. The x-axis denotes the mean of the normalized counts while the y-axis represents the dispersion estimates values using `plotDispEsts()` in [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/preDE/Dispersion_Plot_DESeq2.png")
grid.raster(img, just = "center")
```
* **MA_Plot_DESeq2.png**: This displays the difference in expression values between two groups by transforming the data onto log2-based ratio (x-axis) and log2-based mean (y-axis) scales using `plotMA()` in [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/DE/MA_Plot_DESeq2.png")
grid.raster(img, just = "center")
```
## "edgeR" Analysis Based on Read Count
[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is another R package for count-based differential expression analysis. It implements the TMM method to normalize count data between samples, and several statistical strategies based on the negative binomial distribution, such as exact tests, are used to detect differential expression. The basic steps are as follows:
* Input: *'gene_data/reads_count_matrix/gene_count_matrix.csv'*
* Output: *'RNASeq_results/edgeR_analysis/'*
1. Create `DEGList` based on the count data from the matrix of read counts and phenotype data using the `DGEList()` function.
2. Normalize the `DEGList` object by running three functions in the following order: `calcNormFactors()`, `estimateCommonDisp()` and `estimateTagwiseDisp()`.
3. Conduct genewise exact tests using the `exactTest()` function.
4. Obtain a normalized count matrix using `cpm()` after TMM normalization. (cpm = counts per million)
5. Generate a `csv` file, 'RNASeq_results/edgeR_analysis/edgeR_normalized_DE_result.csv', in which to store normalized TMM-normalized counts, mean expression values per group, and statistical results.
6. Select DEGs based on default criteria: **`pval < 0.05` and `log2FC > 1 | log2FC < 1`**. Store the result in 'RNASeq_results/edgeR_analysis/edgeR_normalized_DE_result.csv'.
7. Additional data visualization: aside from the general data visualization mentioned above, a MeanVar plot, a biological coefficient of variance (BCV) plot, an MDS plot, and a Smear plot are also provided.
* **MeanVar_Plot_edgeR.png**: This visualizes the mean-variance relationship before and after TMM normalization using the `plotMeanVar()` function in [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/edgeR_MeanVar_Plot_edgeR.png")
grid.raster(img, just = "center")
```
* **BCV_Plot_edgeR.png**: This displays the genewise biological coefficient of variance (BCV) against gene abundance using the `plotBCV()` function in [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/edgeR_BCV_Plot_edgeR.png")
grid.raster(img, just = "center")
```
* **MDS_Plot_edgeR.png**: This presents the expression differences between the samples using the `plotMDS.DGEList()` function in [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/edgeR_MDS_Plot_edgeR.png")
grid.raster(img, just = "center")
```
* **Smear_Plot_edgeR.png**: This plots the log2-based fold change against the log10-based concentration using `plotSmear()` in [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html).
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/edgeR_Smear_Plot_edgeR.png")
grid.raster(img, just = "center")
```
## Example
Run `RNASeqDifferentialAnalysis_CMD()` or `RNASeqDifferentialAnalysis()`.
1. For running in background, the results will be reported in *'Rscriptout/Differential_Analysis.Rout'*. Make sure the differential expression analysis is successfully finished before running subsequent steps.
```{r, eval=FALSE}
RNASeqDifferentialAnalysis_CMD(exp)
```
2. For running in R shell, the results will be reported in the R shell. Make sure the differential expression analysis is successfully finished before running subsequent steps.
```{r, warning=FALSE, eval=FALSE}
RNASeqDifferentialAnalysis(exp)
```
# Functional Analysis
This is the **sixth step** of the RNA-Seq analysis workflow in this package. [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) is used for GO functional analysis and KEGG pathway analysis based on the DEGs found in three different differential analyses. Users can either run `RNASeqGoKegg_CMD()` to execute the process in the background or run `RNASeqGoKegg()` to execute the process in the R shell.
In this step, users have to provide a gene name type, `input.TYPE.ID`, that is used in StringTie and ballgown and is supported in the `OrgDb.species` annotation package for the target species. In GO functional analysis and KEGG pathway analysis, the `input.TYPE.ID` ID type will be converted into an `ENTREZID` ID type by the `bitr()` function in clusterProfiler first. Those `input.TYPE.ID` with no corresponding `ENTREZID` will return `NA` and be filtered out. The genes with `Inf` or `-Inf` log2 fold change will be filtered out too. ID conversion is done in each differential analysis tools (ballgown, DESeq2 and edgeR).
In this example, the RNA-Seq analysis target species is *Saccharomyces cerevisiae*(yeast). The `OrgDb.species` is `org.Sc.sgd.db`; the `input.TYPE.ID` is `GENENAME`. IDs are converted from `GENENAME` to `ENTREZID`.
## GO Enrichment Analysis
GO is defined by three concepts relating to gene functions(GO terms) - molecular function(MF), cellular component(CC), and biological process(BP) - and by how these functions are related to each other. In this step, GO classification and GO over-representation tests are conducted. To classify significant GO terms, DEGs are analyzed by the `groupGO()` function. Similarly, the GO over-representation test of DEGs is conducted using `enrichGO()`. Both results are stored in a `csv` file and the top 15 GO terms are visualized by both a bar chart and a bubble plot.
In this example, the DESeq2 CC GO Classification bar chart is shown.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/GO_analysis/GO_CC_Classification_Bar_Plot_clusterProfiler.png")
grid.raster(img, just = "center")
```
In this example, the DESeq2 MF GO Over-representation bar chart and the DESeq2 MF GO Over-representation dot plot are shown.
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/GO_analysis/GO_MF_Overrepresentation_Bar_Plot_clusterProfiler.png")
grid.raster(img, just = "center")
```
```{r fig.width=6, height=6, echo=FALSE}
img <- readPNG("figure/GO_analysis/GO_MF_Overrepresentation_Dot_Plot_clusterProfiler.png")
grid.raster(img, just = "center")
```
## KEGG Pathway Analysis
KEGG is a database resource for understanding functions and utilities of a biological system from molecular-level information([KEGG website](https://www.genome.jp/kegg/)). In this step, KEGG over-representation tests can be conducted using the [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package. A KEGG over-representation test of DEGs is conducted using `enrichKEGG()`. The result will be stored in a `csv` file. The pathway IDs that are found in the KEGG over-representation test will be visualized with the `pathview` package. KEGG pathway URL will also be stored.
In this example, due to the limited number of DEGs, no over-represented pathways are found.
## Example
Run `RNASeqGoKegg_CMD()` or `RNASeqGoKegg()`.
1. For running in background, the results will be reported in *'Rscriptout/GO_KEGG_Analysis.Rout'*.
```{r, eval = FALSE}
RNASeqGoKegg_CMD(exp,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")
```
2. Results from runs in the R shell will be reported in R shell.
```{r, warning=FALSE, eval=FALSE}
RNASeqGoKegg(exp,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")
```
# Other Starting Points
The main analysis workflow of RNASeqR starts from raw fastq.gz files. However, we also allow users to run the pipeline from `SAM` files or `BAM` files. Similar to the first step mentioned in section **"RNASeqRParam S4 Object Creation"**, users need to create a different S4 Object: `RNASeqRParam_Sam` for starting from `SAM` files or `RNASeqRParam_Bam` for starting from `BAM` files.
Following are the steps for running pipeline from SAM or BAM files:
1. **`RNASeqRParam_Sam` or `RNASeqRParam_Bam` S4 Object Creation**
* **Function: `RNASeqRParam_Sam()` or `RNASeqRParam_Bam()`**
* At the beginning, users create a `RNASeqRParam_Sam` or `RNASeqRParam_Bam` S4 object by running the `RNASeqRParam_Sam()` or `RNASeqRParam_Bam()` constructor function for all variable checking. The S4 object will be used as the input for the remaining analysis functions. Then, subsequent steps are similar to starting the pipeline from FASTQ.gz files, with the RNASeqRParam input replaced by `RNASeqRParam_sam` or `RNASeqRParam_Bam`.
* Starting from `SAM` files. Users first need to create a `RNASeqRParam_Sam` S4 object.
```{r, warning=FALSE, eval=FALSE}
exp_Sam <- RNASeqRParam_Sam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control")
```
* Starting from `BAM` files. Users first need to create a `RNASeqRParam_Bam` S4 object.
```{r, warning=FALSE, eval=FALSE}
exp_Bam <- RNASeqRParam_Bam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control")
```
Then, subsequent steps would be similar as starting pipeline from `FASTQ.gz` files just to replace `RNASeqRParam` by `RNASeqRParam_Sam` or `RNASeqRParam_Bam` as input.
2. **Environment Setup**
* **Function: `RNASeqEnvironmentSet_CMD()` or `RNASeqEnvironmentSet()`**
* In the second step, the environment is set up automatically with the necessary tools installed for the subsequent steps in RNASeqR.
3. **Quality Assessment (optional)**
* **Function: `RNASeqQualityAssessment_CMD()` or `RNASeqQualityAssessment()`**
* In the third step (optional), users check the quality of raw fastq files.
4. **Transcripts quantification**
* **Function: `RNASeqReadProcess_CMD()` or `RNASeqReadProcess()`**
* In the fourth step, raw reads are processed, and all intermediate files are stored properly.
5. **Differential Expression Gene Analysis**
* **Function: `RNASeqDifferentialAnalysis_CMD()` or `RNASeqDifferentialAnalysis()`**
* In the fifth step, differential analysis is done by edgeR, DESeq2 and ballgown R / Bioconductor packages.
6. **Functional Analysis**
* **Function: `RNASeqGoKegg_CMD()` or `RNASeqGoKegg()`**
* In the sixth step, differential expressed genes are used as input for the GO functional analysis and KEGG pathway analysis. Enriched pathways are displayed visually.
# Wrapper function to run RNASeqR
We also provide a funtion which wraps all the analyses into one function. Users also have to create `RNASeqRParam` S4 object and set up the environment first.
```{r, warning=FALSE, eval=FALSE}
exp <- RNASeqRParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control",
fastq.gz.type = "PE")
```
```{r, warning=FALSE, eval=FALSE}
RNASeqEnvironmentSet(exp)
```
Then users can choose to execute `All_Steps_Interface_CMD` which runs in background or `All_Steps_Interface` which runs in R interface. These two functions wrap from quality assessment to functional analysis (step 3 to step 6 mentioned above). Default parameters will be applied if users do not specify. Although we provide this wrapper function, we still strongly recommend users to run RNASeqR step by step and check outputs correctness before running next function to prevent any potential errors.
```{r, warning=FALSE, eval=FALSE}
All_Steps_Interface_CMD(exp,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")
```
```{r, warning=FALSE, eval=FALSE}
All_Steps_Interface(exp,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")
```
# Conlusion
RNASeqR is an user-friendly R-based tool for running case-control study(two group) RNA-Seq analysis pipelines. The six main steps in this package are *RNASeqRParam S4 Object Creation*, *Environment Setup*, *Quality Assessment*, *Reads Alignment and Quantification*, *Gene-level Differential Expression Analysis* and, *Functional Analysis.*
The main features that RNASeqR provides are an automated workflow, an extendable file structure, comprehensive reports, and data visualization on widely-used differential analysis tools. With this R package, doing two-group RNA-Seq analysis will be much easier and faster.
# Session Information
```{r}
sessionInfo()
```
# References
Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1), 57.
Pang CN et al., "Transcriptome and network analyses in Saccharomyces cerevisiae reveal that amphotericin B and lactoferrin synergy disrupt metal homeostasis and stress response.", Sci Rep, 2017 Jan 12;7:40232
Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015
Dobin, Alexander, et al. "STAR: ultrafast universal RNA-seq aligner." Bioinformatics 29.1 (2013): 15-21.
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Nature Biotechnology 2015, doi:10.1038/nbt.3122
Morgan M, Pagès H, Obenchain V, Hayden N (2018). Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.32.0, http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621.
Backman TWH, Girke T (2016). “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics, 17(1). doi: 10.1186/s12859-016-1241-0, https://doi.org/10.1186/s12859-016-1241-0.
Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H, Gentleman R (2009). “ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” Bioinformatics, 25, 2607-2608. doi: 10.1093/bioinformatics/btp450, http://dx.doi.org10.1093/bioinformatics/btp450.
Fu J, Frazee AC, Collado-Torres L, Jaffe AE, Leek JT (2018). ballgown: Flexible, isoform-level differential expression analysis. R package version 2.12.0.
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140.
McCarthy, J. D, Chen, Yunshun, Smyth, K. G (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297.
Yu G, Wang L, Han Y, He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), 284-287. doi: 10.1089/omi.2011.0118.
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature protocols, 11(9), 1650.ISO 690
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621.
Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., & Dewey, C. N. (2009). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4), 493-500.