---
title: "Tumor, tumor-adjacent normal, and healthy colorectal transcriptomes as a
  SummarizedExperiment on Bioconductor's ExperimentHub"
author:
- name: Christopher Dampier
  affiliation:
  - University of Virginia
  - Center for Public Health Genomics
  - Department of General Surgery
  email: chd5n@virginia.edu
package: "FieldEffectCrc"
date: "`r BiocStyle::doc_date()`"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{SummarizedExperiments of colorectal transcriptomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Data included in this package was generated by harmonizing all publicly
available bulk RNA-seq samples from human primary colorectal tissue available
through NCBI's [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) and
[database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/), NCI's
[Genomic Data Commons](https://gdc.cancer.gov/), and the NIH-funded
[BarcUVa-Seq](https://barcuvaseq.org/) project. FASTQ files were downloaded
directly or generated from BAM files using
[biobambam2](https://gitlab.com/german.tischler/biobambam2). Gene expression
estimates were generated from FASTQ files using
[Salmon](https://combine-lab.github.io/salmon/) and aggregated from `quant.sf`
files using
[tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html).
The original processed data are stored as tab-delimited text on
[Synapse](https://www.synapse.org/) under Synapse ID
[syn22237139](https://www.synapse.org/#!Synapse:syn22237139/wiki/604471) and
packaged as `SummarizedExperiment` objects on `ExperimentHub` to facilitate
reproduction and extension of results published in Dampier et al. (PMCID:
PMC7386360, PMID: 32764205).

# Installation

The following example demonstrates how to download the dataset from
`ExperimentHub`.

First, make sure the Bioconductor packages `BiocManager`,
`SummarizedExperiment`, and `ExperimentHub` are installed:

```{r, install-packages, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install(c("SummarizedExperiment", "ExperimentHub", "FieldEffectCrc"))
```

Second, load the `SummarizedExperiment` and `ExperimentHub` packages:

```{r, load-hub}
library(SummarizedExperiment)
library(ExperimentHub)
```

Third, find the `FieldEffectCrc` data objects and look at their descriptions:

```{r, find-data}
hub = ExperimentHub()
## simply list the resource names
ns <- listResources(hub, "FieldEffectCrc")
ns
## query the hub for full metadata
crcData <- query(hub, "FieldEffectCrc")
crcData
## extract metadata
df <- mcols(crcData)
df
## see where the cache is stored
hc <- hubCache(crcData)
hc
```

Fourth, explore the metadata of a particular file. Let's look at the data for
cohort A:

```{r, explore-file}
md <- hub["EH3524"]
md
```

Fifth, download the data object itself as a `SummarizedExperiment`:

```{r, download-data}
## using resource ID
data1 <- hub[["EH3524"]]
data1
## using loadResources()
data2 <- loadResources(hub, "FieldEffectCrc", "cohort A")
data2
```

Since the data is a `SummarizedExperiment`, the different phenotypes can be
accessed using `colData()`. The phenotype of interest in the field effect
analysis is referred to as `sampType`:

```{r, phenotypes}
summary(colData(data1)$sampType)
```

# Quick Start

To immediately access the count data for the 834 samples in cohort A, load the
`SummarizedExperiment` object for cohort A as demonstrated in the
[Installation](#installation) section immediately above. Then use the `assays()`
accessor function to extract the `counts` assay, which is the second assay in
the `SummarizedExperiment`:

```{r, quick-start-1}
se <- data1
counts1 <- assays(se)[["counts"]]
```

Note that `assay()`, the singular as opposed to plural, will by default extract
the first assay element from a `SummarizedExperiment` but will extract the 'i'th
assay element with the `assay(se, i)` syntax:

```{r, quick-start-2}
counts2 <- assay(se, 2)
```

The two alternatives are equivalent:

```{r, quick-start-3}
all(counts1==counts2)
```

# Case Study

We are interested in conducting a differential expression analysis using
`DESeq2` to identify genes over- and under-expressed in different colorectal
tissue phenotypes. Due to substantial technical artifacts between single-end and
paired-end library formats observed in the full `FieldEffectCrc` data set,
samples are grouped according to library format, with paired-end samples in
cohorts A and B and single-end samples in cohort C. Cohort A is the primary
analysis cohort and the one we want to use to discover differentially expressed
genes. To begin, load cohort A as demonstrated in the
[Installation](#installation) section immediately above if not already loaded.

In the following steps, we will use `DESeq2` to perform differential expression
analysis on expression data stored in the `SummarizedExperiment` object for
cohort A. We will adjust for latent batch effects with surrogate variable
analysis as implemented in the `sva` package. This analysis would typically
consume excess memory and time, so for the purpose of demonstration, we perform
the analysis on a small subset of the full data.

## Prepare DESeqDataSet

In order to take advantage of the feature lengths preserved by `tximport`, we
will recreate a `tximport` object from the `SummarizedExperiment` object
downloaded from `ExperimentHub` and use the `DESeqDataSetFromTximport()`
function to create the `DESeqDataSet` object. Alternatively, the
`SummarizedExperiment` object itself could be used to create the `DESeqDataSet`
object using the `DESeqDataSet()` function, but the order of the assays would
need to be re-arranged and the counts would need to be rounded, as the
`DESeqDataSet()` function expects `counts` with integer values to be the first
assay in a `SummarizedExperiment` object, and `abundance` with floating-point
values is the first assay in the `FieldEffectCrc` objects to facilitate
conversion to `tximport` objects. Such a re-ordering is facilitated by the `FieldEffectCrc::reorder_assays()` function. Yet another option is to extract
the `counts` assay as a matrix and use the `DESeqDataSetFromMatrix()` function
to create the `DESeqDataSet` object. Since there are probably some benefits to
feature length normalization as implemented in `DESeq2` when such information is
available, we demonstrate creation of the `tximport` object.

First, make the `tximport` object, which is just an object of class `list` from
base R:

```{r, make-txi}
## manual construction
txi <- as.list(assays(se))
txi$countsFromAbundance <- "no"
## convenience function construction
txi <- make_txi(se)
```

Next, we could assign the clinical annotations from `colData(se)` to an
`S4Vectors::DataFrame` or a `base::data.frame` object, since that is what
`DESeqDataSetFromTximport()` expects for the `colData` argument. However, the
`colData` slot of a `SummarizedExperiment` object already stores the clinical
annotations as an `S4Vectors::DataFrame`, so we are ready to create the
`DESeqDataSet` object:

```{r, make-dds}
if (!requireNamespace("DESeq2", quietly=TRUE)) {
    BiocManager::install("DESeq2")
}
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi, colData(se), ~ sampType)
```

Note that we are interested in expression differences between phenotypes, which
are coded in the `sampType` field, so we include `sampType` in our preliminary
design.

## Filter Genes

Normally, we would pre-filter genes to select genes whose quantities are likely
to be estimated accurately. We could use something like the following code,
which selects for genes with a count greater than 10 RNA molecules in at least a
third of all samples:

```{r, typical-filter}
countLimit <- 10
sampleLimit <- (1/3) * ncol(dds)
keep <- rowSums( counts(dds) > countLimit ) >= sampleLimit
dds <- dds[keep, ]
```

However, to expedite subsequent computational steps in this demonstration, we
will select a random subset of genes and samples for analysis and then filter
out any gene with a count less than 10 across all samples:

```{r, special-filter}
r <- sample(seq_len(nrow(dds)), 6000)
c <- sample(seq_len(ncol(dds)), 250)
dds <- dds[r, c]
r <- rowSums(counts(dds)) >= 10
dds <- dds[r, ]
```

## Estimate Size Factors and Dispersions, Fit Negative Binomial Models, and Perform Wald Test

Amazingly, `DESeq2` estimates size factors (i.e. normalization factors based on
library size and feature length) and dispersions for every gene, fits negative
binomial models for every gene, and performs a Wald test for every gene in a
single function:

```{r, size-factors}
dds <- DESeq(dds)
```

Pretty incredible. Unfortunately, we have to do some more work before we can fit
useful models. The size factor estimates are the key result of the `DESeq()`
function at this stage.

## Estimate Latent Factors

If we trusted our high-throughput assays to be perfect representations of the
underlying biology they were designed to measure, we would be ready to test for
differential expression. However, our dataset is composed of samples from
several studies collected and processed in various ways over several years, so
we have reason to suspect there are latent factors driving non-biological
variation in the expression data. Fortunately, we can use `sva` to estimate the
latent factors and include them in our `DESeq2` design.

For a full tutorial on how to use the `sva` package, see the package
[vignette](https://bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf).
For this demonstration, we will provide only cursory motivations for the code.

### Prepare Inputs

The `sva` functions will take the normalized count data as well as model
matrices as inputs. The full model matrix should include the variables of
interest (e.g. `sampType`) as well as adjustment variables (e.g. `sex`, `age`),
which we do not use here. The null model matrix should include adjustment
variables only (i.e. no variables of interest). Without any adjustment
variables, the null model includes an intercept term alone.

```{r, prep-input}
dat <- counts(dds, normalized=TRUE)  ## extract the normalized counts
mod <- model.matrix(~sampType, data=colData(dds))  ## set the full model
mod0 <- model.matrix(~1, data=colData(dds))  ## set the null model
```

### Identify Number of Factors to Estimate

The `sva` package offers two approaches for estimating the number of surrogate
variables that should be included in a differential expression model, the Buja
Eyuboglu and Leek approaches. Buja Eyuboglu is the default approach.

```{r, get-nsv}
if (!requireNamespace("sva", quietly=TRUE)) {
    BiocManager::install("sva")
}
library(sva)
nsv <- num.sv(dat, mod, method=c("be"), B=20, seed=1)  ## Buja Eyuboglu method
```

### Estimate Factors

The crucial function for estimating surrogate variables in RNA-seq data is
`svaseq()`:

```{r, get-svs}
svs <- svaseq(dat, mod, mod0, n.sv=nsv)
```

The output is a list of 4 elements, the first of which is the latent factor
estimate with a vector for each factor. The name of this element is `sv`.

### Append Latent Factors to colData

We want to include the latent factors as adjustment variables in our statistical
testing. Here, we add the surrogate variables to `colData` and update the
`DESeqDataSet` design formula:

```{r, add-svs}
for (i in seq_len(svs$n.sv)) {
    newvar <- paste0("sv", i)
    colData(dds)[ , newvar] <- svs$sv[, i]
}
nvidx <- (ncol(colData(dds)) - i + 1):ncol(colData(dds))
newvars <- colnames(colData(dds))[nvidx]
d <- formula(
    paste0("~", paste(paste(newvars, collapse="+"), "sampType", sep="+"))
)
design(dds) <- d
```

## Re-Fit Negative Binomial Models and Perform Wald Tests with Surrogate Variables

Now that we have the surrogate variables in the `DESeqDataSet`, we can perform
the differential expression analysis we set out to do:

```{r, test-de}
dds <- DESeq(dds)
```

## Extract Results

We want to extract results from each of the following 3 comparisons:

1. HLT vs NAT
2. NAT vs CRC
3. HLT vs CRC

With three categories under consideration, the `DESeq()` function call tests the
first level against the other two, but the second and third levels are not
tested against each other. (Note that the order of the levels is simply
alphabetical if they are not explicitly set.) Furthermore, the results reported
with the `results()` function are for the comparison of the first level against
the last by default. Therefore, in order to extract all the results of interest,
we set contrasts as follows:

```{r, results}
cons <- list()
m <- combn(levels(colData(dds)$sampType), 2)
for (i in seq_len(ncol(m))) {
    cons[[i]] <- c("sampType", rev(m[, i]))
    names(cons) <- c(
        names(cons)[seq_len(length(cons) - 1)], paste(rev(m[, i]), collapse="v")
    )
}
res <- list()
for (i in seq_len(length(cons))) {
    res[[i]] <- results(dds, contrast=cons[[i]], alpha=0.05)  ## default alpha is 0.1
}
names(res) <- names(cons)
```

## Display Results

Let's take a look at the results:

```{r, display}
lapply(res, head)
lapply(res, summary)
```

# sessionInfo()

```{r, sessionInfo}
sessionInfo()
```