Bioconductor Newsletter

posted by Valerie Obenchain, April 2014

Contents

Introduction

Welcome to the first Bioconductor newsletter! In this space we will provide project updates from the Seattle team and recent contributions from core developers. Topics will include infrastructure reorganizations, new packages and features, conferences, and other areas of interest.

The newsletter will be on a quarterly schedule with the intent of summarizing major developments of the past 3 months. Comments and suggestions are welcome, please send to Valerie at [email protected].

Infrastructure

GenomicAlignments package

Last fall Hervé did some housecleaning on the infrastructure packages. All GAlignments* classes and related functions were moved from GenomicRanges and Rsamtools to the new GenomicAlignments package. Classes and functions from GenomicRanges that built on functionality in Rsamtools were also moved. The reorg substantially cleaned up our dependency tree and will be reflected in the upcoming release, BioC 2.14. (ASCII art courtesy of Hervé).

                  IRanges
                  ^     ^
                  |     |
                  |    XVector
                  |     ^
                  |     |
       GenomicRanges   Biostrings
                  ^     ^
                  |     |
                 Rsamtools
                     ^
                     |
                 BSgenome
                     ^
                     |
             GenomicAlignments
                     ^
                     |
                 rtracklayer
                     ^
                     |
               GenomicFeatures

Junction tools in GenomicAlignments

New tools for extracting junctions from genomic alignments have been added to the GenomicAlignments package. The junctions() function, previously known as gaps(), extracts the portion of a read corresponding to the N operation in the CIGAR.

New junction-related operations:

  • summarize the number of reads that overlap each junction
  • retrieve junction motifs based on dinucleotides at junction boundaries (BSgenome package required)
  • import junction files generated by TopHat and STAR aligners
  • determine if the RNA-seq protocol was stranded

GenomeInfoDb package

Converting between chromosome names from different institutions presents a challenge when working with ranges and annotations. The new GenomeInfoDb package is a step towards resolving these inconsistencies by providing a mapping interface between styles (e.g., UCSC, NCBI, Ensembl, …) and organisms.

Low-level functions related to seqnames and SeqnamesStyle were moved from GenomicRanges and AnnotationDbi into GenomeInfoDb. Some functions were deprecated, some renamed. See the NEWS file for a complete listing of activities.

The package is intended to serve as the base for methods querying or transforming seqlevels. Future work will address the problem of auto-detecting incompatible styles between objects in operations that require matching seqlevels.

Sonali, Marc, Hervé and Martin have all contributed to this work.

Currently 9 organisms are supported:

library(GenomeInfoDb)
names(genomeStyles())

## [1] "Arabidopsis_thaliana"     "Caenorhabditis_elegans"  
## [3] "Cyanidioschyzon_merolae"  "Drosophila_melanogaster" 
## [5] "Homo_sapiens"             "Oryza_sativa"            
## [7] "Populus_trichocarpa"      "Saccharomyces_cerevisiae"
## [9] "Zea_mays"

There are methods for style discovery,

seqlevelsStyle("chr3")

## [1] "UCSC"

and for converting between styles. Here we map from UCSC to NCBI:

mapSeqlevels(c("chrII", "chrIII", "chrM"), "NCBI")

##  chrII chrIII   chrM 
##   "II"  "III"   "MT"

Networks / Annotations

AnnotationHub package

The AnnotationHub resource continues to grow steadily. As of March 2014 the hub supports 204 genomes and 98 species from 7 data providers (Ensembl, NCBI, UCSC, dbSNP, HAEMCODE, Encode, Inparanoid 8). The hub is a repository of annotation data parsed from original file format to R objects for convient computing. For example, BED, GTF, and track files from UCSC are available as GRanges. VCF files from dbSNP have been parsed into VCF objects and FASTA files are available as FaFile objects.

Marc, Dan and Paul have led this effort.

RefNet package

A new package to watch for in the Spring release is RefNet. This package offers access to molecular interaction data in computable form and is the first of its kind in Bioconductor. The package offers access to 2 transcription factor networks and Recon 2 (consensus metabolic reconstruction compiled from 5 resources). Queries are made using the PSICQUIC or RefNet.db packages. (Note: RefNet and RefNet.db will be available in BioC 2.14.)

These data can be used to curate or study experiment-specific pathways, explore the function and relationships of differentially expressed genes, and offer insight to a variety of explorations at the molecular level.

RefNet is extensible and we encourage contributions. Networks, known interactions, infection pathways, disease-host interactions, or pathways from manuscripts would be valuable additions. Guidance for contributing data are included in the package.

Credit for RefNet, RefNet.db and PSICQUIC goes to Paul with contributions from Mentored Projects mentees.

Web Site-ish

biocViews terms

The biocViews terms used to categorize packages have been been refreshed. New terms were added, old terms removed and packages with old or invalid terms were updated. The updated web-interface allows searching by biocViews term or subject. A check box option displays terms not associated with any package. Thanks to Sonali and Dan for the work on this.

BiocCheck package

BiocCheck is a new package that encapsulates Bioconductor package guidelines and best practices. It was designed to provide guidance to new package authors and is run by the SinglePackageBuilder during the package review process.

BiocCheck can also be used from the command line

R CMD BiocCheck package

and should be run after R CMD check successfully completes. A full-featured vignette explains the suggestions made by the package. Dan is the architect behind this one.

Programming bits

Hits class

One of my favorite objects is the Hits class. It is a simple, straightforward class that can be used for complicated indexing and combining. It stores a set of “hits” between two vector-like objects and is most familiar as the output of a findOverlaps operation.

Here we use Hits to identify and remove the “shared” regions in a set of ranges. The shared region in the original set is positions 5 to 10. Range 3 is completely within this shared range.

library(GenomicRanges)
gr1 <- GRanges("A", IRanges(c(1, 5, 5), c(10, 15, 10)))
gr1

## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A   [1, 10]      *
##   [2]        A   [5, 15]      *
##   [3]        A   [5, 10]      *
##   ---
##   seqlengths:
##     A
##    NA

First we disjoin the ranges and perform findOverlaps which will give us a Hits object:

gr2 <- disjoin(gr1)
ov <- findOverlaps(gr1, gr2)
ov

## Hits of length 5
## queryLength: 3
## subjectLength: 3
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           1 
##  2         1           2 
##  3         2           2 
##  4         2           3 
##  5         3           2

Convenience functions countQueryHits' and countSubjectHits’ summarize the number of times each record in the query or subject was hit.

countSubjectHits(ov)

## [1] 1 3 1

The regions that are not shared between the ranges will be counted only once (self-hit).

keep_idx <- which(countSubjectHits(ov) == 1L)

Accessors queryHits and subjectHits extract the corresponding columns from the Hits object:

subjectHits(ov)

## [1] 1 2 2 3 2

The extracted subjectHits can be used in a matching operation to subset the Hits object containing just the “unshared” regions:

ov <- ov[subjectHits(ov) %in% keep_idx]
ov

## Hits of length 2
## queryLength: 3
## subjectLength: 3
##   queryHits subjectHits 
##    <integer>   <integer> 
##  1         1           1 
##  2         2           3

Again subjectHits are extracted but this time from the subset Hits object. We use this index to subset the disjoined original ranges leaving only the “unshared” regions:

ans_regions <- gr2[subjectHits(ov)]
ans_regions

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [ 1,  4]      *
##   [2]        A  [11, 15]      *
##   ---
##   seqlengths:
##     A
##    NA

To finish this off we make a GRangesList of the unshared regions. The list is length 3 corresponding to the 3 ranges in `gr1’:

ans_eltlens <- countQueryHits(ov)
ans_skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
relist(ans_regions, ans_skeleton)

## GRangesList of length 3:
## [[1]] 
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A    [1, 4]      *
## 
## [[2]] 
## GRanges with 1 range and 0 metadata columns:
##       seqnames   ranges strand
##   [1]        A [11, 15]      *
## 
## [[3]] 
## GRanges with 0 ranges and 0 metadata columns:
##      seqnames ranges strand
## 
## ---
## seqlengths:
##   A
##  NA

Thanks to Michael and Hervé for the work on the Hits class.

Self-matching

Recently I had a question about splitting on groups/factors. I thought the solution Hervé came up with was quite elegant and worth sharing.

The problem boils down to converting the ‘group1’ vector to ‘group2’:

group1 <- c(3, 1, 2, 2)
group2 <- c(1, 2, 3, 3)

The motivating use case was to split a GRanges based on the groups in group1 but not the implied order.

When we split this GRanges

x <- GRanges(c(rep("chr2", 2), rep("chr19", 2)), IRanges(1:4, width = 1))
x

## GRanges with 4 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2    [1, 1]      *
##   [2]     chr2    [2, 2]      *
##   [3]    chr19    [3, 3]      *
##   [4]    chr19    [4, 4]      *
##   ---
##   seqlengths:
##    chr19  chr2
##       NA    NA

by group1 the object is re-ordered to match the priority in group1.

split(x, group1)

## GRangesList of length 3:
## $1 
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2    [2, 2]      *
## 
## $2 
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames ranges strand
##   [1]    chr19 [3, 3]      *
##   [2]    chr19 [4, 4]      *
## 
## $3 
## GRanges with 1 range and 0 metadata columns:
##       seqnames ranges strand
##   [1]     chr2 [1, 1]      *
## 
## ---
## seqlengths:
##  chr19  chr2
##     NA    NA

This is expected behavior but wasn’t exactly the result I was after. I wanted to respect the grouping in group1 but not the order. Essentially I wanted to turn group1 in to group2 so I would get this result:

split(x, group2)

## GRangesList of length 3:
## $1 
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2    [1, 1]      *
## 
## $2 
## GRanges with 1 range and 0 metadata columns:
##       seqnames ranges strand
##   [1]     chr2 [2, 2]      *
## 
## $3 
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames ranges strand
##   [1]    chr19 [3, 3]      *
##   [2]    chr19 [4, 4]      *
## 
## ---
## seqlengths:
##  chr19  chr2
##     NA    NA

Solution:

One straight-forward solution is to re-order the GRangesList object once created.

grl <- split(x, group1)
grl[as.character(unique(group1))]

However the data were large so I was hoping to solve the ordering problem before creating the list. The second solution was to transform group1 with a “self-match”.

match(group1, group1)

## [1] 1 2 3 3

A “self match” on any vector-like object ‘x’ is an easy and very efficient way to assign unique ids to unique values in ‘x’. Also the mapping from unique values in ‘x’ and ids is guaranteed to be strictly increasing, which is what I needed for my use case.

xx <- c("Z", "B", "C", "C", "A", "B")
ids <- match(xx, xx)
ids

## [1] 1 2 3 3 5 2

ids[!duplicated(xx)]  # always strictly increasing

## [1] 1 2 3 5

Hervé’s 2 line solution to my splitting problem:

grl2 <- split(x, match(group1, group1))
names(grl2) <- unique(group1)

More detail on the Hits class can be found on the man page:

?Hits

The “HOWTOs” vignette in GenomicRanges demonstrates other practical workflows that make use of the extensive GenomicRanges infrastructure.

library(GenomicRanges)
browseVignettes("GenomicRanges")

Courses / Conferences

In January Martin, Sean and Hervé headed south to Ribeirão Preto, Brazil to teach at the Summer Bioinformatics course hosted by the Ribeirão Preto Medical School. The course was organized by Benilton Carvalho and Houtan Noushmehr and focused on the analysis and comprehension of high throughput sequence data.

In Seattle we offer several short courses throughout the year. The last was in February given by Martin and Sonali. This was a 2 day introductory course covering the basics of working with large data, class structure, integrated workflows and the analysis of RNAseq differential expresssion data.

Upcoming events and past course materials can be found on the web site.

Looking ahead

The spring release is just weeks away. We plan to branch on April 11th and release on the 14th. The full schedule is posted online.

BioC 2014 is in Boston this year (July 30 - August 1). Watch the web site for updates.

The July newsletter will (probably) include copy number packages on the rise, the GenomicFileViews package, and a summary of tally/coverage functions. If there are topics you’d like to see covered let us know!