---
title: "3. Adding Annotation To Your Analysis"
author: "Valerie Obenchain (valerie.obenchain@roswellpark.org)<br />
    Lori Shepherd (lori.shepherd@roswellpark.org)<br />
    Martin Morgan (martin.morgan@roswellpark.org)<br />
    Stanford University, Stanford, CA<br />
    25 - 26 June, 2016"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{3. Adding Annotation To Your Analysis}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
    library(AnnotationDbi)
    library(AnnotationHub)
    library(GenomicFeatures)
    library(biomaRt)
    library(org.Hs.eg.db)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
})
```

The material in this course requires R version 3.3 and Bioconductor
version 3.4

```{r configure-test}
stopifnot(
    getRversion() >= '3.3' && getRversion() < '3.4',
    BiocInstaller::biocVersion() == "3.4"
)
```

# Annotation

## Model organisms

### Gene model annotation resources -- `TxDb` packages

e.g., `{r Biocpkg("TxDb.Hsapiens.UCSC.hg19.knownGene")}

```{r gene-model-discovery}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
methods(class=class(txdb))
```

`TxDb` objects

- Curatated annotation resources -- http://bioconductor.org/packages/biocViews
- Underlying sqlite database -- `dbfile(txdb)`
- Make your own: `GenomicFeatures::makeTxDbFrom*()`

Accessing gene models

- `exons()`, `transcripts()`, `genes()`, `cds()` (coding sequence)
- `promoters()` & friends
- `exonsBy()` & friends -- exons by gene, transcript, ...
- 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`,
  `mapIds()`

```{r txdb-exons}
exons(txdb)
exonsBy(txdb, "tx")
```


### Whole-genome sequence -- `BSgenome` packages

e.g., `{r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")}

```{r bsgenome}
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
getSeq(genome, exons(txdb)[1:100])
```

### Identifier mapping -- `OrgDb`

```{r org}
library(org.Hs.eg.db)
org.Hs.eg.db
```

`OrgDb` objects

- Curated resources, underlying sqlite data base, like `TxDb`
- make your own: [AnnotationForge][] (but see the AnnotationHub,
  below!)
- 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`,
  `mapIds()`

`select()`

- Vector of keys, desired columns
- Specification of key type

    ```{r select}
    select(org.Hs.eg.db, c("BRCA1", "PTEN"), c("ENTREZID", "GENENAME"), "SYMBOL")
    keytypes(org.Hs.eg.db)
    columns(org.Hs.eg.db)
    ```

Related functionality 

- `mapIds()` -- special case for mapping from 1 identifier to another
- `OrganismDb` objects: combined `org.*`, `TxDb.*`, and other
  annotation resources for easy access

    ```{r organismdb}
    library(Homo.sapiens)
    select(Homo.sapiens, c("BRCA1", "PTEN"), 
           c("TXNAME", "TXCHROM", "TXSTART", "TXEND"), 
           "SYMBOL")
    ```
    
## Other annotation resources -- `biomaRt`, `AnnotationHub`

### _biomaRt_ & friends

http://biomart.org; _Bioconductor_ package [biomaRt][]

```{r biomart, eval=FALSE}
## NEEDS INTERNET ACCESS !!
library(biomaRt)
head(listMarts(), 3)                      ## list marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <-                                ## fully specified mart
    useMart("ensembl", dataset = "hsapiens_gene_ensembl")

head(listFilters(ensembl), 3)             ## filters
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3)          ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")

## assemble and query the mart
res <- getBM(attributes =  myAttributes, filters =  myFilter,
             values =  myValues, mart = ensembl)
```

Other internet resources

- [biomaRt](http://biomart.org)                       Ensembl and other annotations
- [PSICQUIC](https://code.google.com/p/psicquic)      Protein interactions
- [uniprot.ws](http://uniprot.org)                    Protein annotations
- [KEGGREST](http://www.genome.jp/kegg)               KEGG pathways
- [SRAdb](http://www.ncbi.nlm.nih.gov/sra)            Sequencing experiments
- [rtracklayer](http://genome.ucsc.edu)               USCS genome tracks
- [GEOquery](http://www.ncbi.nlm.nih.gov/geo/)        Array and other data
- [ArrayExpress](http://www.ebi.ac.uk/arrayexpress/)  Array and other data
- ...

### _AnnotationHub_

- _Bioconductor_ package [AnnotationHub][]
- Meant to ease use of 'consortium' and other genome-scale resources
- Simplify discovery, retrieval, local management, and import to
  standard _Bioconductor_ representations

Example: Ensembl 'GTF' files to _R_ / _Bioconductor_ GRanges and TxDb

```{r annotationhub-gtf, eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
hub
query(hub, c("Ensembl", "80", "gtf"))
## ensgtf = display(hub)                   # visual choice
hub["AH47107"]
gtf <- hub[["AH47107"]]
gtf
txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf)
```

Example: non-model organism `OrgDb` packages

```{r annotationhub-orgdb, eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
query(hub, "OrgDb")
```

Example: Map Roadmap epigenomic marks to hg38

- Roadmap BED file as _GRanges_

    ```{r annotationhub-roadmap, eval=FALSE}
    library(AnnotationHub)
    hub <- AnnotationHub()
    query(hub , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
    E126 <- hub[["AH29817"]]
    ```

- UCSC 'liftOver' file to map coordinates

    ```{r annotationhub-liftover, eval=FALSE}
    query(hub , c("hg19", "hg38", "chainfile"))
    chain <- hub[["AH14150"]]
    ```

- lift over -- possibly one-to-many mapping, so _GRanges_ to _GRangesList_

    ```{r liftover, eval=FALSE}
    library(rtracklayer)
    E126hg38 <- liftOver(E126, chain)
    E126hg38
    ```

# Annotating Variants

Example: read variants from a VCF file, and annotate with respect to a
known gene model
  
```{r vcf, message=FALSE}
## input variants
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf) <- "chr22"
## known gene model
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
coding <- locateVariants(rowRanges(vcf),
    TxDb.Hsapiens.UCSC.hg19.knownGene,
    CodingVariants())
head(coding)
```

# Resources

Acknowledgements

- The research reported in this presentation was supported by the
  National Cancer Institute and the National Human Genome Research
  Institute of the National Institutes of Health under Award numbers
  U24CA180996 and U41HG004059, and the National Science Foundation
  under Award number 1247813. The content is solely the responsibility
  of the authors and does not necessarily represent the official views
  of the National Institutes of Health or the National Science
  Foundation.

## `sessionInfo()`

```{r sessionInfo}
sessionInfo()
```

[AnnotationHub]: http://bioconductor.org/packages/AnnotationHub
[TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[ChIPseeker]: http://bioconductor.org/packages/ChIPseeker
[VariantFiltering]: http://bioconductor.org/packages/VariantFiltering
[AnnotationForge]: http://bioconductor.org/packages/AnnotationForge
[biomaRt]: http://bioconductor.org/packages/biomaRt