BiocIntro 0.0.10
Description: This workshop will introduce you to the Bioconductor collection of R packages for statistical analysis and comprehension of high-throughput genomic data. The emphasis is on data exploration, using RNA-sequence gene expression experiments as a motivating example. How can I access common sequence data formats from R? How can I use information about gene models or gene annotations in my analysis? How do the properties of my data influence the statistical analyses I should perform? What common workflows can I perform with R and Bioconductor? How do I deal with very large data sets in R? These are the sorts of questions that will be tackled in this workshop.
Requirements: You will need to bring your own laptop. The workshop will use cloud-based resources, so your laptop will need a web browser and WiFi capabilities. Participants should have used R and RStudio for tasks such as those covered in introductory workshops earlier in the week. Some knowledge of the biology of gene expression and of concepts learned in a first course in statistics will be helpful.
Relevance: This workshop is relevant to anyone eager to explore
genomic data in R. The workshop will help connect ‘core’ R concepts
for working with data (e.g., data management via data.frame()
,
statistical modelling with lm()
or t.test()
, visualization using
plot()
or ggplot()
) to the special challenges of working with large
genomic data sets. It will be especially helpful to those who have or
will have their own genomic data, and are interested in more fully
understanding how to work with it in R.
RNA-seq
cell dex
SRR1039508 N61311 untrt
SRR1039509 N61311 trt
SRR1039512 N052611 untrt
SRR1039513 N052611 trt
SRR1039516 N080611 untrt
SRR1039517 N080611 trt
SRR1039520 N061011 untrt
SRR1039521 N061011 trt
source: http://bio.lundberg.gu.se/courses/vt13/rnaseq.html
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003 679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419 467 515 621 365 587
... ... ... ... ... ...
SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 1047 770 572
ENSG00000000005 0 0 0
ENSG00000000419 799 417 508
... ... ... ...
Research question
untrt
and trt
experimental treatments?Our goal
Sample information
read.table()
integer()
factor()
and NA
data.frame()
: coordinated management
$
[ , ]
samples_file <-
system.file(package="BiocIntro", "extdata", "samples.tsv")
samples <- read.table(samples_file)
samples
## cell dex avgLength
## SRR1039508 N61311 untrt 126
## SRR1039509 N61311 trt 126
## SRR1039512 N052611 untrt 126
## SRR1039513 N052611 trt 87
## SRR1039516 N080611 untrt 120
## SRR1039517 N080611 trt 126
## SRR1039520 N061011 untrt 101
## SRR1039521 N061011 trt 98
samples$dex <- relevel(samples$dex, "untrt")
Counts
head()
to view the first few.matrix()
rather than data.frame()
counts_file <-
system.file(package="BiocIntro", "extdata", "counts.tsv")
counts <- read.table(counts_file)
dim(counts)
## [1] 63677 8
head(counts)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
counts <- as.matrix(counts)
GRanges
)Row annotations.
url <- "ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz"
BiocFileCache
to download the resource once to a location
that persists across R sessions.library(BiocFileCache)
About Bioconductor packages
BiocFileCache
is available from https://bioconductor.orgBiocFileCache
from the BiocFileCache ‘landing
page’.BiocFileCache
package vignette (access the
vignette from within R: browseVignettes("BiocFileCache")
).BiocManager::install("BiocFileCache")
.?bfcrpath
.gtf_file <- bfcrpath(rnames = url)
## Using temporary cache /var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//RtmpLUzYG5/BiocFileCache
## adding rname 'ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz'
read.table()
or similar, but contain structured information that
we want to represent in R.Common sequence data formats
Use the rtracklayer package to import the file.
library(rtracklayer)
gtf <- import(gtf_file)
A GRanges
object
seqnames()
(e.g.,
chromosome), start()
/ end()
/ width()
, strand()
, etc.$
or mcols()$
to access annotations on ranges.gene_id
column as names()
.rowidx <- gtf$type == "gene"
colidx <- c("gene_id", "gene_name", "gene_biotype")
genes <- gtf[rowidx, colidx]
names(genes) <- genes$gene_id
genes$gene_id <- NULL
genes
## GRanges object with 63677 ranges and 2 metadata columns:
## seqnames ranges strand | gene_name gene_biotype
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSG00000223972 1 11869-14412 + | DDX11L1 pseudogene
## ENSG00000227232 1 14363-29806 - | WASH7P pseudogene
## ENSG00000243485 1 29554-31109 + | MIR1302-10 lincRNA
## ENSG00000237613 1 34554-36081 - | FAM138A lincRNA
## ENSG00000268020 1 52473-54936 + | OR4G4P pseudogene
## ... ... ... ... . ... ...
## ENSG00000198695 MT 14149-14673 - | MT-ND6 protein_coding
## ENSG00000210194 MT 14674-14742 - | MT-TE Mt_tRNA
## ENSG00000198727 MT 14747-15887 + | MT-CYB protein_coding
## ENSG00000210195 MT 15888-15953 + | MT-TT Mt_tRNA
## ENSG00000210196 MT 15956-16023 - | MT-TP Mt_tRNA
## -------
## seqinfo: 265 sequences from an unspecified genome; no seqlengths
SummarizedExperiment
)Three different data sets
counts
: results of the RNAseq workflowsamples
: sample and experimental design informationgenes
: information about the genes that we’ve assayed.Coordinate our manipulation
Use the SummarizedExperiment package and data representation.
[,]
assay()
, rowData()
,
rowRanges()
, colData()
, etc.library(SummarizedExperiment)
samples
rows match the order of the
samples in the columns of the counts
matrix, and the order of the
genes
rows match the order of the rows of the counts
matrix.SummarizedExperiment
to coordinate our data manipulation.samples <- samples[colnames(counts),]
genes <- genes[rownames(counts),]
se <- SummarizedExperiment(
assays = list(counts = counts),
rowRanges = genes, colData = samples
)
se
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(0):
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(2): gene_name gene_biotype
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): cell dex avgLength
Gestalt
t.test()
for each row of the count matrix, asking
whether the trt
samples have on average counts that differ from
the untrt
samples.The DESeq2 pacakge
library(DESeq2)
dds <- DESeqDataSet(se, ~ cell + dex)
fit <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
destats <- results(fit)
destats
## log2 fold change (MLE): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 63677 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 708.602169691234 -0.38125388742934 0.100654430181804
## ENSG00000000005 0 NA NA
## ENSG00000000419 520.297900552084 0.206812715390398 0.112218674568195
## ENSG00000000457 237.163036796015 0.0379205923946151 0.14344471633862
## ENSG00000000460 57.9326331250967 -0.0881676962628265 0.287141995236272
## ... ... ... ...
## ENSG00000273489 0.275899382507797 1.48372584344306 3.51394515550546
## ENSG00000273490 0 NA NA
## ENSG00000273491 0 NA NA
## ENSG00000273492 0.105978355992386 -0.463691271907546 3.52308373749196
## ENSG00000273493 0.106141666408122 -0.521381077922898 3.53139001322807
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 -3.78775069056286 0.000152017272514002 0.00128292609656079
## ENSG00000000005 NA NA NA
## ENSG00000000419 1.84294384322566 0.0653372100662581 0.196469601297369
## ENSG00000000457 0.26435684326705 0.791504962999781 0.91141814384918
## ENSG00000000460 -0.307052600196215 0.75880333554496 0.895006448013164
## ... ... ... ...
## ENSG00000273489 0.422239328669782 0.672850337762336 NA
## ENSG00000273490 NA NA NA
## ENSG00000273491 NA NA NA
## ENSG00000273492 -0.131615171950935 0.895288684444562 NA
## ENSG00000273493 -0.147641884914972 0.88262539793309 NA
SummarizedExperiment
, so that we can
manipulate these in a coordianted fashion too.rowData(se) <- cbind(rowData(se), destats)
order()
and head()
to identify the row indexes of the top 20
most differentially expressed (based on adjusted P-value) genes.SummarizedExperiment
to contain just these rows.assay()
of our subset as a heatmaptop20idx <- head( order(rowData(se)$padj), 20)
top20 <- se[top20idx,]
heatmap(assay(top20))
Update row labels and adding information about treatment group.
m <- assay(top20)
rownames(m) <- rowData(top20)$gene_name
trtcolor <- hcl.colors(2, "Pastel 1")[ colData(top20)$dex ]
heatmap(m, ColSideColors = trtcolor)
plot()
to create the pointstext()
to add labels to the two most significant genes.plot(-log10(pvalue) ~ log2FoldChange, rowData(se))
label <- with(
rowData(se),
ifelse(-log10(pvalue) > 130, gene_name, "")
)
text(
-log10(pvalue) ~ log2FoldChange, rowData(se),
label = label, pos = 4, srt=-15
)
Goal
dplyr and ‘tidy’ data
tibble
: a data.frame
with better display properties%>%
, e.g., mtcars %>% count(cyl)
: a pipe that takes a tibble (or
data.frame) tbl
on the left and uses it as an argument to a small
number of functions like count()
on the right.tibble()
, and hence can be
chained together.library(dplyr)
Steps below:
as_tibble()
: create a tibble from rowData(se)
select()
: select specific columns.arrange()
: arrange all rows from smallest to largest padj
head()
: filter to the first 20 rowsrowData(se) %>%
as_tibble(rownames = "ensembl_id") %>%
select(ensembl_id, gene_name, baseMean, log2FoldChange, padj) %>%
arrange(padj) %>%
head(n = 20)
## # A tibble: 20 x 5
## ensembl_id gene_name baseMean log2FoldChange padj
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG00000152583 SPARCL1 997. 4.57 4.00e-132
## 2 ENSG00000165995 CACNB2 495. 3.29 7.06e-131
## 3 ENSG00000120129 DUSP1 3409. 2.95 2.20e-126
## 4 ENSG00000101347 SAMHD1 12703. 3.77 4.32e-126
## 5 ENSG00000189221 MAOA 2342. 3.35 3.96e-120
## 6 ENSG00000211445 GPX3 12286. 3.73 1.39e-108
## 7 ENSG00000157214 STEAP2 3009. 1.98 1.48e-103
## 8 ENSG00000162614 NEXN 5393. 2.04 2.98e-100
## 9 ENSG00000125148 MT2A 3656. 2.21 5.81e- 94
## 10 ENSG00000154734 ADAMTS1 30315. 2.35 5.87e- 88
## 11 ENSG00000139132 FGD4 1223. 2.23 1.24e- 83
## 12 ENSG00000162493 PDPN 1100. 1.89 1.32e- 83
## 13 ENSG00000134243 SORT1 5511. 2.20 2.01e- 82
## 14 ENSG00000179094 PER1 777. 3.19 2.73e- 81
## 15 ENSG00000162692 VCAM1 508. -3.69 6.78e- 81
## 16 ENSG00000163884 KLF15 561. 4.46 2.51e- 77
## 17 ENSG00000178695 KCTD12 2650. -2.53 7.07e- 76
## 18 ENSG00000198624 CCDC69 2057. 2.92 3.58e- 70
## 19 ENSG00000107562 CXCL12 25136. -1.91 4.54e- 70
## 20 ENSG00000148848 ADAM12 1365. -1.81 6.14e- 70
Packages
BiocManager::install("BiocFileCache")
library(BiocFileCache)
.?bfcrpath
Data structures
data.frame()
, matrix()
.GRanges
for representing genomic ranges
seqnames()
, start()
, etc for core
components$
or mcols()$
for annotationsSummarizedExperiment
for coordinated manipulation of assay data
with row and column annotation.
[,]
to subset assay and annotaions in a coordinated fashion.assay()
, rowRanges()
, rowData()
, colData()
to access
components.Analysis
Single-cell RNA-seq: an amazing resource: Orchestarting Single Cell Analysis with Bioconductor, including the scran and scater packages.
Other prominent domains of analysis (check out biocViews)
Participate!
Research reported in this presentation was supported by the NCI and NHGRI of the National Institutes of Health under award numbers U24CA232979, U41HG004059, and U24CA180996. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
A portion of this work is supported by the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation.
## R version 3.6.1 Patched (2019-12-01 r77489)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_0.8.3 DESeq2_1.26.0
## [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
## [5] BiocParallel_1.20.0 matrixStats_0.55.0
## [7] Biobase_2.46.0 rtracklayer_1.46.0
## [9] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
## [11] IRanges_2.20.1 S4Vectors_0.24.0
## [13] BiocGenerics_0.32.0 BiocFileCache_1.10.2
## [15] dbplyr_1.4.2 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
## [4] httr_1.4.1 tools_3.6.1 backports_1.1.5
## [7] utf8_1.1.4 R6_2.4.1 rpart_4.1-15
## [10] Hmisc_4.3-0 DBI_1.0.0 lazyeval_0.2.2
## [13] colorspace_1.4-1 nnet_7.3-12 tidyselect_0.2.5
## [16] gridExtra_2.3 bit_1.1-14 curl_4.2
## [19] compiler_3.6.1 cli_1.1.0 htmlTable_1.13.2
## [22] bookdown_0.16 scales_1.1.0 checkmate_1.9.4
## [25] genefilter_1.68.0 rappdirs_0.3.1 stringr_1.4.0
## [28] digest_0.6.23 Rsamtools_2.2.1 foreign_0.8-72
## [31] rmarkdown_1.18 XVector_0.26.0 base64enc_0.1-3
## [34] pkgconfig_2.0.3 htmltools_0.4.0 htmlwidgets_1.5.1
## [37] rlang_0.4.2 rstudioapi_0.10 RSQLite_2.1.2
## [40] acepack_1.4.1 RCurl_1.95-4.12 magrittr_1.5
## [43] GenomeInfoDbData_1.2.2 Formula_1.2-3 Matrix_1.2-18
## [46] Rcpp_1.0.3 munsell_0.5.0 fansi_0.4.0
## [49] lifecycle_0.1.0 stringi_1.4.3 yaml_2.2.0
## [52] zlibbioc_1.32.0 grid_3.6.1 blob_1.2.0
## [55] crayon_1.3.4 lattice_0.20-38 Biostrings_2.54.0
## [58] splines_3.6.1 annotate_1.64.0 locfit_1.5-9.1
## [61] zeallot_0.1.0 knitr_1.26 pillar_1.4.2
## [64] geneplotter_1.64.0 codetools_0.2-16 XML_3.98-1.20
## [67] glue_1.3.1 evaluate_0.14 latticeExtra_0.6-28
## [70] data.table_1.12.6 BiocManager_1.30.10 vctrs_0.2.0
## [73] gtable_0.3.0 purrr_0.3.3 assertthat_0.2.1
## [76] ggplot2_3.2.1 xfun_0.11 xtable_1.8-4
## [79] survival_3.1-7 tibble_2.1.3 GenomicAlignments_1.22.1
## [82] AnnotationDbi_1.48.0 memoise_1.1.0 cluster_2.1.0