Bioconductor Newsletter

posted by Valerie Obenchain, January 2015

Welcome to the January 2015 newsletter. This is a quarterly look behind the scenes at recent developments and exploratory work going on in the core group and Bioconductor community. This issue covers Docker containers, work on coordinate mapping, changes to the algorithm underlying overlap operations and an overview of csaw, the most recent package contribution from the Smyth group. We want this newsletter to be valuable to you so please share comments and feedback.

Contents

Infrastructure Developments

Docker

Anyone who has attended the annual BioC conference or participated in a class at the Hutchinson Center is familiar with the Bioconductor Amazon Machine Images (AMI). Dan configures these images with the current version of R and all necessary package dependencies and non-R software. Providing a pre-configured environment to participants has eliminated many “day of” problems such as internet overload due to concurrent downloads, installation of the wrong package version and the inability to access common sample data used in the course.

Recently Dan has been looking into the Docker software as another approach to providing pre-configured environments. The Docker containers operate using process level isolation and are therefore more lightweight than a virtual machine. Additionally the containers have access to the full physical machine.

These containers are isolated, reproducible environments useful for development or production. In contrast to the AMI, a Docker container could be used on a desktop or laptop instead of the cloud or other remote hardware. We envision them being useful for

  • reproducibility: providing identical, reproducible environments

  • convenience: deploying on any (Linux, Mac, or Windows) machine, or in the cloud via Amazon’s Elastic Container Service (ECS)

  • time saver; instead of waiting for packages to compile on Linux, you can download a container in which packages are already installed

This work is exploratory and not yet supported. There will be an announcement on the mailing list when containers are ready for use. Those interested in following the development can visit the bioc_docker GitHub repository.

NCList

One of Hervé’s recent projects was replacing the interval tree-based algorithm used in finding and counting overlaps. The change was motivated by decreased performance observed when comparing many different chromosomes or many overlapping ranges. He decided on an approach based on the Nested Containment List algorithm by Alexander V. Alekseyenko and Christopher J. Lee.

In BioC 3.1 (current devel) overlaps operations on GRanges and/or GRangesList objects are approximately 3x to 10x faster than in BioC 3.0 for a medium size data set (e.g. 25 million reads). Memory usage is also reduced by ~ 25% or more. Numbers will vary depending on the size of the data; the larger the data the greater the improvement.

The user-visible change is the ‘algorithm’ argument added to most overlap-based operations. This allows a choice between the new (algorithm=”nclist”) and the old (algorithm=”intervaltree”).

At the time of this writing, there are 3 situations where “nclist” and “intervaltree” produce different output:

  • With ‘nclist’, zero-width ranges are interpreted as insertion points and are considered to overlap with ranges that contain them.

  • When using ‘select=”arbitrary”’, the ‘nclist’ algorithm will generally not select the same hits as the old algorithm.

  • When using ‘intervaltree’, the ‘maxgap’ argument has a special meaning if ‘type’ is “start”, “end”, or “within”. This is not yet the case with the new ‘nclist’.

For a complete description of changes and future activity please see this post on the bioc-devel mailing list.

Coordinate Mapping

Translating (mapping) coordinates between genome, transcript and protein space is a common task in bioinformatics. Over the past quarter a group of us have been working to expand and harmonize mapping capabilities in the infrastructure.

Functions mapping between genome and the transcriptome will be added to GenomicFeatures and methods for mapping via a CIGAR alignment will be added to GenomicAlignments. The Pbase package has a mapping vignette that demonstrates the steps involved when mapping from protein to genomic position. Once the API for the transcriptome and alignment mapping methods is stable, similar functions can be added to Pbase to round out the suite of mappers.

Others involved in the project are Michael Lawrence, Hervé Pagès, Laurent Gatto, Robert Castelo and Martin Morgan. Discussions and progress can be followed at the biocCoordinateMapping Google Group.

A related mapping task is migrating data from one assembly to another either via alignment tool or by converting assembly coordinates. The implementation of the UCSC LiftOver tool in rtracklayer (thanks Michael) and the availability of the UCSC chain files in AnnotationHub (thanks Sonali) make this a straightforward operation.

library(AnnotationHub)
hub <- AnnotationHub()

The chain file format describes a pairwise alignment that allows gaps in both sequences simultaneously. AnnotationHub currently hosts 1113 chain files.

allChains <- query(hub, 'chain')

## > length(allChains)
## [1] 1113

Search for the conversion from hg38 to hg19:

query(hub, 'hg38ToHg19')

## > query(hub, 'hg38ToHg19')
## class: AnnotationHub 
## hub: https://annotationhub.bioconductor.org 
## cache: /home/vobencha/.AnnotationHub 
## display()ing 1 of 1 records on 6 mcols()
##                            title            dataprovider      species
## AH14108 hg38ToHg19.over.chain.gz hgdownload.cse.ucsc.edu Homo sapiens
##         taxonomyid genome                                description
## AH14108       9606   hg38 UCSC liftOver chain file from hg38 to hg19
##                                            tags
## AH14108 liftOver, chain, UCSC, genome, homology

Methods in AnnotationHub use import.chain() from rtracklayer to read data into R. The return object is a Chain class where data for each chromosome are parsed into a ChainBlock class.

chain <- query(hub, 'hg38ToHg19')[[1]] 
chain

## > chain
## Chain of length 25
## names(25): chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX 
## chrY chrM

class(chain$chr6)

## > class(chain$chr6)
## [1] "ChainBlock"
## attr(,"package")
## [1] "rtracklayer"

liftOver() translates coordinates and outputs a GRangesList .

gr <- GRanges(c("chr7", "chr2"), IRanges(c(75625897, 68010781), width=1)) 
res <- liftOver(gr, chain)

## > res
## GRangesList object of length 2:
## $1 
## GRanges object with 1 range and 0 metadata columns:
##       seqnames               ranges strand
##          <Rle>            <IRanges>  <Rle>
##   [1]     chr7 [75255215, 75255215]      *
## 
## $2 
## GRanges object with 1 range and 0 metadata columns:
##       seqnames               ranges strand
##   [1]     chr2 [68237913, 68237913]      *

Another example of using liftOver() is in the Changing genomic coordinate systems workflow. SNPS from the NHGRI GWAS catalogue are mapped from hg38 to hg19 resulting in a few lost loci.

Overview of the csaw package

Gordon Smyth is a professor at Walter and Eliza Hall Institute of Medical Research in Victoria and has been active member in the Bioconductor project since inception. Many know Gordon from his detailed responses on the support site where he provides statistical guidance and thoughtful discussion to the novice and advanced user alike. The Smyth group has authored a number of Bioconductor packages including cornerstone contributions such as limma and edgeR. limma is consistently in the top 10 and edgeR in the top 25 of the package download stats.

A trademark of a Smyth group package is a well-written vignette with detailed scientific and statistical background. These documents are an excellent starting point for anyone new to microarray or RNA-seq analysis. The most recent contribution from the group is the csaw package (ChIP-seq analysis with windows) by Aaron Lun. I asked Aaron and Gordon if they would answer a few questions about the package.

Aaron Lun and Gordon Smyth on the csaw package:

Q: What is differential binding (DB) in ChIP-seq and how does this differ from differential expression (DE) in RNA-seq?

As most readers will know, ChIP-seq sequences genomic DNA that is bound to a target protein whereas RNA-seq sequences RNA transcripts. From a scientific point of view, RNA-seq measures gene expression whereas ChIP-seq is used to examine the regulation of gene expression. ChIP-seq is often used for example to find the binding sites of a transcription factor (TF) or to examine the positioning of an epigenetic histone mark across the genome. Many people analyze ChIP-seq data by calling peaks in each individual ChIP-seq library. These peaks are then taken to identify where the TF binds or where epigenetic marking is active. In the csaw package, however, we envisage ChIP-seq experiments with multiple experimental conditions and with multiple biological replicates within each condition, in other words with a structure much like a typical gene expression experiment. We focus on testing for DB in a quantitative way between the experimental conditions rather than on making present/absent calls. ChIP-seq and RNA-seq experiments both produce short sequence reads that can be aligned to a reference genome. In principle, the two types of experiments can be analyzed in broadly similar ways. In each case, we can choose a set of genomic locations of interest, count the number of reads mapping to those regions, and then test for differential coverage between the experimental conditions relative to biological variability. The key difference between ChIP-seq and RNA-seq is how the genomic locations are chosen and combined.

Q: What scientific question does a DB ChIP-seq analysis help answer?

A conventional ChIP-seq analysis produces a list of peaks in each library, with the implication that the protein is bound to DNA within the peak regions and not bound elsewhere. We take an alternative view that binding coverage is quantitative, not just present or absent. DB analyses identify a list of genomic regions where the strength of binding changes between biological conditions. We believe that the regions where the binding is subject to change are the most likely to be biologically important, even if they are not necessarily the regions with highest coverage. DB regions are necessarily linked with the phenotype or treatment of interest being studied in the experiment. We can therefore start investigating mechanisms through which DB can lead to observed differences in biology. One of our favorite analyses is to relate DB to differential expression for the same set of genes. These insights are harder to obtain with the conventional analyses, as the identified regions are only associated with each condition in isolation.

Q: What was the motivation for creating the package? Has your group recently started doing DB ChIP-seq or have you been doing it for some time?

Our group has been doing DB analyses for a while. However, these analyses have mostly been gene-orientated. We would count reads into pre-defined intervals like the promoters or gene bodies and then test for whether these intervals were DB using the edgeR package. This type of analysis can be done very similarly for ChIP-seq as for RNA-seq. The csaw package is our first attempt at performing de novo DB analyses where the regions of interest are not known in advance. We wanted to be able to identify novel DNA elements like new enhancers or promoters and we wanted to avoid potential headaches regarding the regions of interest, for example misspecifying the promoter region of a gene. We were motivated to take a new approach to de novo DB analysis by our observation that some commonly used approaches do not control the error rates correctly. A simple method is to call peaks in each condition separately, then simply compare the regions to identify unique peaks in each condition. This ad hoc approach does not provide any statistical error rate control and tends to overstate the differences between the conditions. A more sophisticated version of this approach is to conduct statistical tests between conditions for each of the regions called as peaks in either condition. Again, this over-estimates the differences between the conditions. Our approach gives similar flexibility to peak calling but with rigorous error rate control. The idea is to slide windows across the genome, count reads into those windows, and then use those counts to test for significant differences between conditions. Adjacent regions are then merged and the p-values combined in a statistically rigorous way. This provides the same level of statistical rigor as for our previous gene-orientated analysis but without the need to specify regions of interest beforehand. Using small windows also provides excellent spatial resolution. There are a number of other Bioc packages that can do DB analyses, diffBind and DBChIP for example, but these require a set of peaks identified with external software like MACS. The motivation for the csaw package is to avoid having to specific the genomic regions externally . We wanted to generate and discover the DB completely de novo as an integral part of the analysis. csaw is the first Bioc package to take a windowing approach.

Q: How are the statistical methods from edgeR leveraged or extended in csaw? Please describe from a statistical perspective, why methods used in RNA-seq are appropriate for DB ChIP-seq?

Once a set of genomic regions has been chosen, a DB analysis is very similar to an RNA-seq experiment. So it was logical to leverage the statistical methods provided by edgeR. edgeR accounts for biological variability between replicates, which is critical for DB analyses where the counts for each region are typically over-dispersed. Failure to account for such variability will lead to spurious DB calls. The generalized linear model functionality of edgeR provides a rich framework with which to analyze complex experiments with multiple experimental factors or covariate and batch effects. We chose edgeR over limma and voom because the ChIP-seq counts for each window can be quite small. The reads can be sparsely distributed across the whole genome. limma and voom are very effective for moderate to large counts, but edgeR is able to more accurately model the distribution of very small integer counts. csaw uses the quasi-likelihood functionality of edgeR because of its rigorous error rate control and because it gives access to the adaptive empirical Bayes functionality of limma within the edgeR negative binomial framework. We made many improvements to edgeR as part of the csaw project. Our sliding window approach to ChIP-seq generates a much larger number of regions than is typical for an RNA-seq DE analysis, so it was important to be as computationally efficient as possible. One of us (AL) converted many of the edgeR functions into C++ for speed and memory efficiency. There are a number of statistical extensions specific to csaw. These include an implementation of Simes’ method for combining the p-values of adjacent windows within a region; a non-linear normalization procedure adapted to low counts; and a method to calculate the average abundance of a region scaled to its width.

Project Statistics

Publications

A number of core and community members have been working on a “perspective” article that provides a project overview for potential users and developers. Focus is on reproducibility, interoperability and data access. The collective work, “Orchestrating high-throughput genomic analysis with Bioconductor”, is scheduled to appear in Nature Methods in early 2015.

Other recent project-level (vs package-level) publications are “Scalable Genomics with R and Bioconductor” by Michael Lawrence and Martin Morgan which reviews strategies for processing, summarizing and visualizing large genomic data. “Programming tools: Adventures with R” by Sylvia Tippman highlights R / Bioconductor as the analysis tool of choice in genomics, oceanography and ecology applications. Links are available on the publications page.

It can be challenging to get an accurate count of articles that cite the Bioconductor project or the individual software packages. The Publications page of the web site lists results of a full text search for the single term bioconductor for each of PubMed, PubMedCentraland Google Scholar. The number of hits range from 591 (PubMed) to 27,000 (Google Scholar).

Taking a package-level approach, a PubMed query for titles or IDs found in CITATION files returned 22838 references. For packages with no CITATION file a PubMed search of the package title returned 1281 references.

Website traffic

The following tables compare traffic from the fourth quarter of 2014 with the fourth quarter of 2013 (October 1 - December 28).

In the fourth quarter we saw an increase in total sessions, new sessions and the number of total users.

Website traffic Q4 2014 vs Q4 2013
Sessions23.28% (311,731 vs 252,873)
% New Sessions0.62% (36.01% vs 35.79%)
Users24.32% (133,839 vs 107,655)

The greatest increase (percent change) in total sessions was seen in China followed by Spain then the United States and Italy.

Total Sessions by Location Q4 2014 vs Q4 2013
United States24.95% (101011 vs 80840)
China28.43% (27593 vs 21485)
United Kingdom11.19% (22627 vs 20350)
Germany20.75% (21138 vs 17506)
France20.47% (10446 vs 8671)
Canada21.96% (9808 vs 8042)
Japan19.21% (9406 vs 7890)
Spain27.30% (8444 vs 6633)
India4.15% (8048 vs 7727)
Italy24.68% (7494 vs 6010)

Statistics were generated with Google Analytics.

Package downloads and new submissions

There was a 9% increase in the number of (distinct IP) downloads of software packages in the fourth quarter of 2014 (106212 total) as compared to the fourth quarter of 2013 (97462 total). See the website for a full summary of package download stats.

A total of 63 software packages were added in the fourth quarter of 2014 bringing the count to 954 in devel (Bioconductor 3.1) and 934 in release (Bioconductor 3.0).

Resources and Upcoming Events

Videos and Webinars

On average, 20-30 software packages are submitted to Bioconductor each quarter resulting in 100+ new packages each year. New package guidelines are posted on the website but developers often have additional questions or special case situations. We thought a more interactive exchange would help avoid common errors and encourage questions before a package was submitted for review.

In mid December Marc and Sonail hosted a Google Hangout to share some key development tips and solicit questions from a wider audience. Topics covered were package organization, documentation and code reuse. Questions were posted via a YouTube chat window and answered after Marc finished the slide presentation. If you are considering submitting a package you may want to check out the Webinar posted on the Bioconductor video page.

Conferences and Courses

The events page is updated regularly with new courses and conference announcements. In January 2015, EMBL is hosting both the European Developer’s conference and an Advanced R Programming course.

European Bioconductor Developer’s Meeting European Molecular Biology Laboratory Heidelberg, Germany January 12-13, 2015

Advanced R Programming and Development European Molecular Biology Laboratory Heidelberg, Germany January 15-16, 2015

Send comments or questions to Valerie at [email protected].