---
title: "S.3 -- Introduction to _Bioconductor_"
author: Martin Morgan <martin.morgan@roswellpark.org>
date: "21 November, 2016"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{S.3 -- Introduction to Bioconductor}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
options(width=100)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

```{r setup, echo=FALSE}
suppressPackageStartupMessages({
    library(Biostrings)
    library(GenomicRanges)
    library(SummarizedExperiment)
    library(airway)
    library(rtracklayer)
    library(ShortRead)
    library(GenomicAlignments)
    library(RNAseqData.HNRNPC.bam.chr14)
    library(VariantAnnotation)
})
```

# Project Overview

## About

[Bioconductor][]: Analysis and comprehension of high-throughput
genomic data

- Statistical analysis: large data, technological artifacts, designed
  experiments; rigorous
- Comprehension: biological context, visualization, reproducibility
- High-throughput
    - Sequencing: RNASeq, ChIPSeq, variants, copy number, ...
    - Microarrays: expression, SNP, ...
    - Flow cytometry, proteomics, images, ...

Packages, vignettes, work flows

- 1296 software packages; also...
    - 'Annotation' packages -- static data bases of identifier maps,
      gene models, pathways, etc; e.g., [TxDb.Hsapiens.UCSC.hg19.knownGene][]
    - 'Experiment packages -- data sets used to illustrate software
      functionality, e.g., [airway][]
- Discover and navigate via [biocViews][]
- Package 'landing page'
    - Title, author / maintainer, short description, citation,
      installation instructions, ..., download statistics
- All user-visible functions have help pages, most with runnable
  examples
- 'Vignettes' an important feature in Bioconductor -- narrative
  documents illustrating how to use the package, with integrated code
- 'Release' (every six months) and 'devel' branches
- [Support site](https://support.bioconductor.org);
  [videos](https://www.youtube.com/user/bioconductor), [recent
  courses](https://bioconductor.org/help/course-materials/)

Package installation and use

- A package needs to be installed once, using the instructions on the
  package landing page (e.g., [DESeq2][]).

    ```{r install, eval=FALSE}
    source("https://bioconductor.org/biocLite.R")
    biocLite(c("DESeq2", "org.Hs.eg.db"))
    ```

- `biocLite()` installs _Bioconductor_, [CRAN][], and github packages.

- Once installed, the package can be loaded into an R session

    ```{r require}
    library(GenomicRanges)
    ```

    and the help system queried interactively, as outlined above:

    ```{r help-bioc, eval=FALSE}
    help(package="GenomicRanges")
    vignette(package="GenomicRanges")
    vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
    ?GRanges
    ```

## Key concepts

Goals

- Reproducibility
- Interoperability
- Use

What a few lines of _R_ has to say

```{r five-lines}
x <- rnorm(1000)
y <- x + rnorm(1000)
df <- data.frame(X=x, Y=y)
plot(Y ~ X, df)
fit <- lm(Y ~ X, df)
anova(fit)
abline(fit)
```

Classes and methods -- "S3"

- `data.frame()`
  - Defines _class_ to coordinate data
  - Creates an _instance_ or _object_

- `plot()`, `lm()`, `anova()`, `abline()`: _methods_ defined on
  _generics_ to transform instances

- Discovery and help

    ```{r help-r, eval=FALSE}
    class(fit)
    methods(class=class(fit))
    methods(plot)
    ?"plot"
    ?"plot.formula"
    ```

- tab completion!

_Bioconductor_ classes and methods -- "S4"

- Example: working with DNA sequences

    ```{r classes-and-methods}
    library(Biostrings)
    dna <- DNAStringSet(c("AACAT", "GGCGCCT"))
    reverseComplement(dna)
    ```

- Discovery and help

    ```{r classes-and-methods-discovery, eval=FALSE}
    class(dna)
    ?"DNAStringSet-class"
    ?"reverseComplement,DNAStringSet-method"
    ```

## High-throughput sequence analysis work flows

1. Experimental design

2. Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)

    ![](our_figures/fig-rna-seq.png)

3. (Illumina) Sequencing (Bentley et al., 2008,
   doi:10.1038/nature07517)

    ![](http://www.nature.com/nature/journal/v456/n7218/images/nature07517-f1.2.jpg)

    - Primary output: FASTQ files of short reads and their [quality
      scores](http://en.wikipedia.org/wiki/FASTQ_format#Encoding)

4. Alignment
    - Choose to match task, e.g., [Rsubread][], Bowtie2 good for ChIPseq,
      some forms of RNAseq; BWA, GMAP better for variant calling
    - Primary output: BAM files of aligned reads
    - More recently: [kallisto][] and similar programs that produce
      tables of reads aligned to transcripts
5. Reduction
    - e.g., RNASeq 'count table' (simple spreadsheets), DNASeq called
      variants (VCF files), ChIPSeq peaks (BED, WIG files)
6. Analysis
    - Differential expression, peak identification, ...
7. Comprehension
    - Biological context

## _Bioconductor_ sequencing ecosystem

![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png)

# High-Throughput Sequence Data

## DNA (and other) sequences

[Biostrings][]

```{r}
library(Biostrings)
data(phiX174Phage)
phiX174Phage
letterFrequency(phiX174Phage, c("A", "C", "G", "T"))
letterFrequency(phiX174Phage, "GC", as.prob=TRUE)
```

## Genomic ranges

[GenomicRanges][]

- `GRanges()`: genomic coordinates to represent annotations (exons,
  genes, regulatory marks, ...) and data (called peaks, variants,
  aligned reads)

    ![Alt GRanges](our_figures/GRanges.png)

- `GRangesList()`: genomic coordinates grouped into list elements
  (e.g., paired-end reads; exons grouped by transcript)

    ![Alt GRangesList](our_figures/GRangesList.png)

Operations

  ![Alt GRanges operations](our_figures/RangeOperations.png)

- intra-range: act on each range independently
    - e.g., `shift()`
- inter-range: act on all ranges in a `GRanges` object or
  `GRangesList` element
      - e.g., `reduce()`; `disjoin()`
- between-range: act on two separate `GRanges` or `GRangesList`
  objects
      - e.g., `findOverlaps()`, `nearest()`

```{r ranges, message=FALSE}
library(GenomicRanges)
gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+")
shift(gr, 1)                            # intra-range
range(gr)                               # inter-range
reduce(gr)                              # inter-range
snps <- GRanges("A", IRanges(c(11, 17, 24), width=1))
findOverlaps(snps, gr)                  # between-range
setdiff(range(gr), gr)                  # 'introns'
```

## Summarized experiments

[SummarizedExperiment][]

![Alt SummarizedExperiment](our_figures/SummarizedExperiment.png)

- Coordinate feature x sample 'assays' with row (feature) and column
  (sample) descriptions.
- 'assays' can be any matrix-like object, including very large on-disk
  representations such as [HDF5Array][]

### _SummarizedExperiment_ exercise

The [airway][] experiment data package summarizes an RNA-seq
experiment investigating human smooth-muscle airway cell lines treated
with dexamethasone. Load the library and data set.

```{r}
library(airway)
data(airway)
airway
```

`airway` is an example of the _SummarizedExperiment_ class. Explore
its `assay()` (the matrix of counts of reads overlapping genomic
regions of interest in each sample), `colData()` (a description of
each sample), and `rowRanges()` (a description of each region of
interest; here each region is an ENSEMBL gene).

```{r}
x <- assay(airway)
class(x)
dim(x)
head(x)
colData(airway)
rowRanges(airway)
```

It's easy to subset a _SummarizedExperiment_ on rows, columns and
assays, e.g., retaining just those samples in the `trt` level of the
`dex` factor. Accessing elements of the column data is common, so
there is a short-cut.

```{r}
cidx <- colData(airway)$dex %in% "trt"
airway[, cidx]
## shortcut
airway[, airway$dex %in% "trt"]
```

It's also easy to perform range-based operations on
`SummarizedExperiment` objects, e.g., querying for range of chromosome
14 and then subsetting to contain only genes on this chromosome. Range
operations on rows are very common, so there are shortcuts here, too.

```{r}
chr14 <- as(seqinfo(rowRanges(airway)), "GRanges")["14"]
ridx <- rowRanges(airway) %over% chr14
airway[ridx,]
## shortcut
chr14 <- as(seqinfo(airway), "GRanges")["14"]
airway[airway %over% chr14,]
```

Use the `assay()` and `rowSums()` function to remove all rows from the
`airway` object that have 0 reads overlapping all samples. Summarize
the library size (column sums of `assay()`) and plot a histogram of
the distribution of reads per feature of interest.

## BED, GFF, GTF, WIG import and export

Genome annotations: BED, WIG, GTF, etc. files. E.g., GTF:

- Component coordinates

        7   protein_coding  gene        27221129    27224842    .   -   . ...
        ...
        7   protein_coding  transcript  27221134    27224835    .   -   . ...
        7   protein_coding  exon        27224055    27224835    .   -   . ...
        7   protein_coding  CDS         27224055    27224763    .   -   0 ...
        7   protein_coding  start_codon 27224761    27224763    .   -   0 ...
        7   protein_coding  exon        27221134    27222647    .   -   . ...
        7   protein_coding  CDS         27222418    27222647    .   -   2 ...
        7   protein_coding  stop_codon  27222415    27222417    .   -   0 ...
        7   protein_coding  UTR         27224764    27224835    .   -   . ...
        7   protein_coding  UTR         27221134    27222414    .   -   . ...

- Annotations

        gene_id "ENSG00000005073"; gene_name "HOXA11"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
        ...
        ... transcript_id "ENST00000006015"; transcript_name "HOXA11-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS5411";
        ... exon_number "1"; exon_id "ENSE00001147062";
        ... exon_number "1"; protein_id "ENSP00000006015";
        ... exon_number "1";
        ... exon_number "2"; exon_id "ENSE00002099557";
        ... exon_number "2"; protein_id "ENSP00000006015";
        ... exon_number "2";
        ...

[rtracklayer][]

- `import()`: import various formats to `GRanges` and similar instances
- `export()`: transform from `GRanges` and similar types to BED, GTF, ...
- Also, functions to interactively drive UCSC genome browser with data
  from _R_ / _Bioconductor_

## FASTQ files

Sequenced reads: FASTQ files

    @ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
    CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
    +
    HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
    @ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
    GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
    +
    DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################

[ShortRead][]

- `readFastq()`: input
- `FastqStreamer()`: iterate through FASTQ files
- `FastqSampler()`: sample from FASTQ files, e.g., for quality assessment
- Functions for trimming and filters FASTQ files, QA assessment

## Aligned reads

Aligned reads: BAM files

- Header

        @HD     VN:1.0  SO:coordinate
        @SQ     SN:chr1 LN:249250621
        @SQ     SN:chr10        LN:135534747
        @SQ     SN:chr11        LN:135006516
        ...
        @SQ     SN:chrY LN:59373566
        @PG     ID:TopHat       VN:2.0.8b       CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq

- Alignments: ID, flag, alignment and mate

        ERR127306.7941162       403     chr14   19653689        3       72M             =       19652348        -1413  ...
        ERR127306.22648137      145     chr14   19653692        1       72M             =       19650044        -3720  ...
        ERR127306.933914        339     chr14   19653707        1       66M120N6M       =       19653686        -213   ...

- Alignments: sequence and quality

        ... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC        *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%))
        ... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG        '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)****
        ... TGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCT        '******&%)&)))&")')'')'*((******&)&'')'))$))'')&))$)**&&****************

- Alignments: Tags

        ... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:2  CC:Z:chr22      CP:i:16189276   HI:i:0
        ... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:3  CC:Z:=  CP:i:19921600   HI:i:0
        ... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:4  MD:Z:72 YT:Z:UU XS:A:+  NH:i:3  CC:Z:=  CP:i:19921465   HI:i:0
        ... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:4  MD:Z:72 YT:Z:UU XS:A:+  NH:i:2  CC:Z:chr22      CP:i:16189138   HI:i:0

[GenomicAligments][]

- `readGAlignments()`: Single-end reads
- `readGAlignmentPairs()`, `readGAlignmentsList()`: paired end reads

Working with large files

- `ScanBamParam()`: restrict input
- `BamFile(, yieldSize=)`: iteration
- [GenomicFiles][] provides useful helpers, e.g., `reduceByYield()`

### _GenomicAlignments_ exercise

The [RNAseqData.HNRNPC.bam.chr14][] package is an example of an
experiment data package. It contains a subset of BAM files used in a
gene knock-down experiment, as described in
`?RNAseqData.HNRNPC.bam.chr14`. Load the package and get the path to
the BAM files.

```{r}
library(RNAseqData.HNRNPC.bam.chr14)
fls = RNAseqData.HNRNPC.bam.chr14_BAMFILES
basename(fls)
```

Create `BamFileList()`, basically telling R that these are paths to
BAM files rather than, say, text files from a spreadsheet.

```{r}
library(GenomicAlignments)
bfls = BamFileList(fls)
bfl = bfls[[1]]
```

Input and explore the aligments. See `?readGAlignments` and
`?GAlignments` for details on how to manipulate these objects.

```{r}
ga = readGAlignments(bfl)
ga
table(strand(ga))
```

Many of the reads have cigar "72M". What does this mean? Can you
create a subset of reads that do not have this cigar? Interpret some
of the non-72M cigars. Any hint about what these cigars represent?

```{r}
tail(sort(table(cigar(ga))))
ga[cigar(ga) != "72M"]
```

Use the function `summarizeJunctions()` to identify genomic regions
that are spanned by reads with complicated cigars. Can you use the
argument `with.revmap=TRUE` to extract the reads supporting a
particular (e.g., first) junction?

```{r}
summarizeJunctions(ga)
junctions <- summarizeJunctions(ga, with.revmap=TRUE)
ga[ junctions$revmap[[1]] ]
```

It is possible to do other actions on BAM files, e.g., calculating the
'coverage' (reads overlapping each base).

```{r}
coverage(bfl)$chr14
```
    
## Called variants: VCF files

- Header

          ##fileformat=VCFv4.2
          ##fileDate=20090805
          ##source=myImputationProgramV3.1
          ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
          ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
          ##phasing=partial
          ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
          ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
          ...
          ##FILTER=<ID=q10,Description="Quality below 10">
          ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
          ...
          ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
          ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

- Location

          #CHROM POS     ID        REF    ALT     QUAL FILTER ...
          20     14370   rs6054257 G      A       29   PASS   ...
          20     17330   .         T      A       3    q10    ...
          20     1110696 rs6040355 A      G,T     67   PASS   ...

- Variant INFO

          #CHROM POS     ...	INFO                              ...
          20     14370   ...	NS=3;DP=14;AF=0.5;DB;H2           ...
          20     17330   ...	NS=3;DP=11;AF=0.017               ...
          20     1110696 ...	NS=2;DP=10;AF=0.333,0.667;AA=T;DB ...

- Genotype FORMAT and samples

          ... POS     ...  FORMAT      NA00001        NA00002        NA00003
          ... 14370   ...  GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
          ... 17330   ...  GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
          ... 1110696 ...  GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4

[VariantAnnotation][]

- `readVcf()`: VCF input
- `ScanVcfParam()`: restrict input to necessary fields / ranges
- `VcfFile()`: indexing and iterating through large VCF files
- `locateVariants()`: annotate in relation to genes, etc; see also
  [ensemblVEP][], [VariantFiltering][]
- `filterVcf()`: flexible filtering

[Bioconductor]: https://bioconductor.org
[CRAN]: https://cran.r-project.org
[biocViews]: https://bioconductor.org/packages/

[HDF5Array]: https://bioconductor.org/packages/HDF5Array
[AnnotationDbi]: https://bioconductor.org/packages/AnnotationDbi
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub
[BSgenome.Hsapiens.UCSC.hg19]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
[BSgenome]: https://bioconductor.org/packages/BSgenome
[BiocParallel]: https://bioconductor.org/packages/BiocParallel
[Biostrings]: https://bioconductor.org/packages/Biostrings
[CNTools]: https://bioconductor.org/packages/CNTools
[ChIPQC]: https://bioconductor.org/packages/ChIPQC
[ChIPseeker]: https://bioconductor.org/packages/ChIPseeker
[DESeq2]: https://bioconductor.org/packages/DESeq2
[DiffBind]: https://bioconductor.org/packages/DiffBind
[GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments
[GenomicFeatures]: https://bioconductor.org/packages/GenomicFeatures
[GenomicFiles]: https://bioconductor.org/packages/GenomicFiles
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[Gviz]: https://bioconductor.org/packages/Gviz
[Homo.sapiens]: https://bioconductor.org/packages/Homo.sapiens
[IRanges]: https://bioconductor.org/packages/IRanges
[KEGGREST]: https://bioconductor.org/packages/KEGGREST
[OmicCircos]: https://bioconductor.org/packages/OmicCircos
[PSICQUIC]: https://bioconductor.org/packages/PSICQUIC
[Rsamtools]: https://bioconductor.org/packages/Rsamtools
[Rsubread]: https://bioconductor.org/packages/Rsubread
[ShortRead]: https://bioconductor.org/packages/ShortRead
[SomaticSignatures]: https://bioconductor.org/packages/SomaticSignatures
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[TxDb.Hsapiens.UCSC.hg19.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation
[VariantFiltering]: https://bioconductor.org/packages/VariantFiltering
[VariantTools]: https://bioconductor.org/packages/VariantTools
[airway]: https://bioconductor.org/packages/airway
[biomaRt]: https://bioconductor.org/packages/biomaRt
[cn.mops]: https://bioconductor.org/packages/cn.mops
[csaw]: https://bioconductor.org/packages/csaw
[edgeR]: https://bioconductor.org/packages/edgeR
[ensemblVEP]: https://bioconductor.org/packages/ensemblVEP
[epivizr]: https://bioconductor.org/packages/epivizr
[ggbio]: https://bioconductor.org/packages/ggbio
[h5vc]: https://bioconductor.org/packages/h5vc
[limma]: https://bioconductor.org/packages/limma
[metagenomeSeq]: https://bioconductor.org/packages/metagenomeSeq
[org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db
[org.Sc.sgd.db]: https://bioconductor.org/packages/org.Sc.sgd.db
[phyloseq]: https://bioconductor.org/packages/phyloseq
[rtracklayer]: https://bioconductor.org/packages/rtracklayer
[snpStats]: https://bioconductor.org/packages/snpStats

[dplyr]: https://cran.r-project.org/package=dplyr
[data.table]: https://cran.r-project.org/package=data.table
[Rcpp]: https://cran.r-project.org/package=Rcpp
[kallisto]: https://pachterlab.github.io/kallisto