Bioconductor Newsletter
posted by Valerie Obenchain, April 2015
On April 17, the release of Bioconductor
3.1 will mark the 24th release of the
software. The project started in 2001 with the first svn commits made in May of
that year:
r3 rgentlem 2001-05-25 14:08:57 -0700 (Fri, 25 May 2001)
r2 rgentlem 2001-05-25 08:28:31 -0700 (Fri, 25 May 2001)
r1 (no author) 2001-05-25 08:28:31 -0700 (Fri, 25 May 2001)
At the time of the first official Bioconductor manuscript in 2004 the project consisted of
"... more than 80 software packages, hundreds of metadata
packages and a number of experimental data packages ..."
Eleven years later (and after more than 100000 svn commits) Bioconductor
hosts over 990 software, 900 annotation and 230 experimental data packages.
Another quote from the 2004 paper shows that, fortunately, not everything has changed,
"... The group dynamic has also been an important factor in the success of
Bioconductor. A willingness to work together, to see that cooperation and
coordination in software development yields substantial benefits for the
developers and the users and encouraging others to join and contribute to
the project are also major factors in our success. ..."
This issue looks at the growing role of proteomics in Bioconductor
and the use
of web sockets to bridge the gap between workspace data and interactive
visualization. We re-visit Docker with use cases in package development and
managing system administration tasks. We also have a section on new and notable
functions recently added to base R
and Bioconductor
.
Contents
- Proteomics in Bioconductor
- Web Sockets
- Infrastructure
- Reproducible Research
- New and Noteworthy
- Project Statistics
- Upcoming Events
- Acknowledgements
Proteomics in Bioconductor
The diversity of proteomics analysis available in Bioconductor
continues
to grow steadily and the devel branch now hosts 68 proteomic-based software
packages. Many individuals have contributed to this area in the form of
packages, web-based workflows and course offerings. One very active member is
Laurent Gatto, head of the Computational Proteomics Unit at the Cambridge Centre
for Proteomics. On a day to day basis he is responsible for developing robust
proteomics technologies applicable to a wide variety of biological questions.
Laurent is the author of many Bioconductor
packages, including the
new
ProtGenerics.
Similar in concept to the more general BiocGenerics
package,
ProtGenerics
provides a central location where proteomic-specific S4 generics
can be defined and reused. He has produced course materials and tools to
help newcomers get started including the detailed
proteomics workflow
on the web site and a two publications titled
Using R and Bioconductor for proteomics analysis
and
Visualisation of proteomics data using R and Bioconductor.
These publications have a companion experimental data package,
RforProteomics
which illustrates data input/output, data processing, quality control,
visualisation and quantitative proteomics analysis within the
Bioconductor
framework. Since the first release, RforProteomics
has benefited from contributions from additional developers.
The RforProteomics
data package has 4 vignettes:
-
The first vignette offers a current perspective on proteomics in
Bioconductor
and is in poster format. It gives an overview of theBioconductor
proteomics infrastructure and mass spectrometry analysis. Topics covered include raw data manipulation, identification, quantitation, MS data processing, visualization, statistics and machine learning. -
Also in poster format, the RforProteomics BioC2013 vignette is specific to the
Using R and Bioconductor for proteomics analysis
publication. Special attention is given to labelled vs label-free quantitation andBioconductor
packages that offer these methods. -
Using R and Bioconductor for Proteomics Data Analysis includes code executed in the
Using R and Bioconductor for proteomics analysis
publication. -
Visualisation of proteomics data using R and Bioconductor includes the code from the
Visualisation of proteomics data using R and Bioconductor
publication.
Many proteomic packages are worthy of mention in the Bioconductor
repository. Here we highlight a few that have played a primary role in
the growing infrastructure.
Packages mzR
and mzID
read and parse raw and identification MS
data. The former is an R interface to the popular C++
proteowizard toolkit.
Identification methods are offered in rTANDEM
and MSGFplus
(and
its shiny
interface MSGFgui
) and quantitation methods can be found
in MSnbase
and isobar
(isobaric tagging and spectral counting
methods), synapter
, xcms
and MALDIquant
packages (label-free).
Statistical modelling and machine learning are offered in MSnbase
,
isobar
, MSstats
and msmsTests
.
The rpx
package provides an interface to the
ProteomeXchange
infrastructure, which coordinates multiple data repositories of
MS-based proteomics data.
The pRoloc
package contains methods for spatial proteomics analysis,
e.g., machine learning and classification methods for assigning a
protein to an organelle.
The relatively new Pbase
package contains a Proteins
class for storing and
manipulating protein sequences and ranges of interest. The package has multiple
vignettes on coordinate mapping. One addresses mapping proteins between
different genome builds and the other mapping from protein to genomic
coordinates.
Web Sockets
Overview
As our ability to generate volumes of sequencing data grows so does the need for
effective visualization tools. Tools that can summarize large quantities of
information into digestible bits and quickly identify unique features or
outliers are important steps in any analysis pipeline. In the R
world we
have seen an increase in the use of web sockets to provide an interactive link
between data in the workspace and exploration in the browser.
The analysis capabilities of R
make it a good fit for the rich and interactive
graphics of HTML5 web browsers. The WebSocket protocol enables more interaction
between a browser and a web site, facilitating live content and the creation of
real-time graphics.
Web sockets are often described as “a standard for bi-directional, full duplex communication between servers and clients over a single TCP connection”. These characteristics offer several advantages over HTTP:
-
‘bi-directional’ means either client or server can send a message to the other party. HTTP is uni-directional and the request is always initiated by the client.
-
‘full duplex’ allows client and server to talk independently. In the case of HTTP, at any given time, either the client is talking or the server is talking.
-
Web sockets open a ‘single TCP connection’ over which the client and server communicate for the lifecycle of the web socket connection. In contrast, HTTP typically opens a new TCP connection for each round trip; a connection is initiated for a request and terminated after the response is received. A new TCP connection must be established for each request/response. The opening and closing creates overhead, especially in the case where rapid responses or real time interactions are needed.
For those interested, this blog post provides more in-depth details and benchmarking against REST.
Applications in Shiny
and epivizr
The Shiny package created by the RStudio team
pioneered the use of web sockets in R
. Shiny enables the building of
interactive web applications from within an R
session. Popular applications
are interactive plots and maps that allow real-time manipulation through
widgets. The workhorse behind Shiny is the
httpuv package, also authored by
RStudio. httpuv provides low-level socket and protocol support for handling
HTTP and WebSocket requests within R
.
The httpuv infrastructure is also used by the
epivizr
package. In this application, web sockets create a two-way communication
between the R
environment and the
Epiviz visualization tool.
Objects available in an R
session can be displayed as tracks or plots on
Epiviz.
BrowserViz
A slightly different approach is taken in the new
BrowserViz
package by Paul Shannon. This application provides access to both the browser
and an active R
prompt.
The BrowserViz
package contains the BrowserViz class whose main purpose is to
provide the necessary R
, Javascript websocket and JSON infrastructure for
communication. By loosely coupling R
and the browser the two environments are
linked but kept maximally ignorant of each other; only simple JSON messages pass
back and forth with no HTML, CSS or Javascript. The result is access to the
interactive graphics of a web browser in conjunction with an active R
session.
The companion Javascript library, BrowserViz.js, is also included in the
package. The combination of library and base class provides the infrastructure
necessary for any BrowserViz-style application. The BrowserVizDemo
and RCyjs
packages build on the BrowserViz
class and will be available in the
Bioconductor
3.2 release. BrowserVizDemo
is a minimal example of interactive
plotting and selection of xy points using the popular d3.js library. The more
full featured RCyjs provides interactive access to the full power of
Cytoscape.js, a richly featured browser-based network visualization library.
More details on the BrowserViz
class and applications can be found in the
package vignette.
Infrastructure
Changes in AnnotationHub
This quarter Marc and Sonali continued their work on AnnotationHub
. Several
new resources were added and the display method and search navigation were
substantially reworked.
New resources:
- TF-Target Gene Files from the PAZAR public database
(available as
GRanges
objects) - BioPAX files (Level1 and Level2) from the NCI Pathway Interaction Database
(available as
biopax
objects) - Background file for ChEA required for the command line version of ChEA
(available as
data.frame
object) - GTF files from Ensembl release 76 to 79
(available as
GRanges
object) - Expression Set of raw read counts for the GSE62944 dataset from GEO:
(available as
ExpressionSet
object)
An improved show method and more flexible data retrieval make interacting with the 18900+ files straightforward. Sonali has a new AnnotationHub video where she gives a tour of the resource with tips and tricks for data access.
Code below was generated with AnnotationHub
version 1.99.75. The show
method now list fields common for subsetting up front, e.g., providers,
species and class of R
object.
> library(AnnotationHub)
> hub <- AnnotationHub()
> hub
AnnotationHub with 18992 records
# snapshotDate(): 2015-03-12
# $dataprovider: UCSC, Ensembl, BroadInstitute, NCBI, Haemcode, dbSNP, Inpar...
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r...
# $rdataclass: GRanges, FaFile, OrgDb, ChainFile, CollapsedVCF, Inparanoid8D...
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH169"]]'
title
AH169 | Meleagris_gallopavo.UMD2.69.cdna.all.fa
AH170 | Meleagris_gallopavo.UMD2.69.dna.toplevel.fa
AH171 | Meleagris_gallopavo.UMD2.69.dna_rm.toplevel.fa
AH172 | Meleagris_gallopavo.UMD2.69.dna_sm.toplevel.fa
AH173 | Meleagris_gallopavo.UMD2.69.ncrna.fa
... ...
AH28575 | A500002_Erg.csv
AH28576 | A500005_Erg.csv
AH28577 | A500001_IgG.csv
AH28578 | A500004_IgG.csv
AH28579 | GSM730632_Runx1.csv
Tab completion on a hub object lists all fields available for subsetting:
> hub$
hub$ah_id hub$dataprovider hub$taxonomyid
hub$description hub$rdataclass hub$sourcetype
hub$title hub$species hub$genome
hub$tags hub$sourceurl
Quick discovery of file type and provider:
> sort(table(hub$sourcetype), decreasing=TRUE)
BED FASTA UCSC track GTF NCBI/blast2GO
7855 3876 2208 1606 1145
Chain CSV VCF BigWig Inparanoid
1113 406 316 315 268
TwoBit BioPaxLevel2 RData BioPax GRASP
144 6 4 3 1
tar.gz Zip
1 1
> sort(table(hub$dataprovider), decreasing=TRUE)
UCSC Ensembl
8746 4590
BroadInstitute NCBI
3146 1145
Haemcode dbSNP
945 316
Inparanoid8 Pazar
268 91
NIH Pathway Interaction Database EncodeDCC
9 5
RefNet ChEA
4 1
GEO NHLBI
1 1
Given the volume and diversity of data available in the hub we encourage using these files as sample data before creating your own experimental data package.
For example, to get an idea of available GRCh37 FASTA from Ensembl:
> hub[hub$sourcetype=="FASTA" & hub$dataprovider=="Ensembl" & hub$genome=="GRCh37"]
AnnotationHub with 42 records
# snapshotDate(): 2015-03-26
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH18924"]]'
title
AH18924 | Homo_sapiens.GRCh37.73.cdna.all.fa
AH18925 | Homo_sapiens.GRCh37.73.dna_rm.toplevel.fa
AH18926 | Homo_sapiens.GRCh37.73.dna_sm.toplevel.fa
AH18927 | Homo_sapiens.GRCh37.73.dna.toplevel.fa
AH18928 | Homo_sapiens.GRCh37.73.ncrna.fa
... ...
AH21181 | Homo_sapiens.GRCh37.72.dna_rm.toplevel.fa
AH21182 | Homo_sapiens.GRCh37.72.dna_sm.toplevel.fa
AH21183 | Homo_sapiens.GRCh37.72.dna.toplevel.fa
AH21184 | Homo_sapiens.GRCh37.72.ncrna.fa
AH21185 | Homo_sapiens.GRCh37.72.pep.all.fa
Advanced developers may be interested in writing a ‘recipe’ to add
additional online resources to AnnotationHub
. The process involves
writing functions to first parse file metadata and then create R
objects or
files from these metadata. Detailed HOWTO steps are in the
AnnotationHubRecipes vignette.
Rhtslib
package
Nate recently completed work on the
Rhtslib
package which wraps the htslib C library from
Samtools. The plan is for Rhtslib
to replace the Samtools code inside
Rsamtools
. Rhtslib
contains a clean branch of htslib directly from
Samtools, including all unit tests. This approach simplifies maintenance when
new versions or bug fixes become available. The clean API also promises to make
outsourcing to the package more straightforward for both Rsamtools
and others
wanting access to the native routines.
htslib was developed with a ‘linux-centric’ approach and getting the library to build across platforms (specifically Windows) was a challenge. To address this, Nate chose to use Gnulib, the GNU portability library. Briefly, Gnulib is a collection of modules that package portability code to enable POSIX-compliance in a transparent manner; the goal being to supply common infrastructure to enable GNU software to run on a variety of operating systems. Modules are incorporated into a project at the source level rather than as a library that is built, installed and linked against.
Incorporating Gnulib involves (at minimum) the following steps:
- adapt the project to use Autoconf and Automake
- identify and import relevant Gnulib modules using gnulib-tool
- add
#include "config.h"
to source files - remove (now unnecessary!) preprocessor complier/platform tests from source
For more on specific functions available in Rhtslib
see the
Samtools docs or the API-type headers in the
package, faidx.h, hfile.h, hts.h, sam.h, tbx.h and vcf.h. Headers are located
in Rhtslib/src/htslib/htslib or if the package is installed,
library(Rhtslib)
system.file(package="Rhtslib", "include")
Reproducible Research
AMI course links
The course materials web
page has links to several resources including slides, presentations and
packages. Recently Dan started adding an “AMI” link for courses that use them.
The AMI contains the packages, sample data and exact version of
R
/Bioconductor
used. This is a convenient, portable way to ensure
reproduciblity. One can imagine using an AMI or Docker container to
capture the state of a research project or publication which can be easily
shared with colleagues.
Developing with Docker
Elena Grassi is a Ph.D. student in Biomedical Sciences and Oncology in the Department of Genetics, Biology and Biochemistry at the University of Torino. Her research focuses on transcriptional and post transcriptional regulation with special interest in transcription factors and the alternative polyadenylation phenomenon.
With a background in computer science she is involved in developing
computational pipelines and tools and is the author of Bioconductor
packages
roar
(preferential usage of APA sites) and
MatrixRider
(propensity of binding protein to interact with a sequence). Elena was one of
the first to try out the Docker containers and found them useful for both
package development and system administration tasks. I asked a few questions
about her experience and got some interesting answers.
What motivated you to try Docker when developing MatrixRider
?
I heard about docker from some friends last year and I was eager to try it.
During the New Year’s Eve holidays I decided to start using it with R
/
Bioconductor
to run different versions on our computational server without
adding burden to the sysadmin work. I started with mere curiosity fiddling with
rocker and, eased by the holiday laziness, I stopped with the idea to begin
working on some ad hoc R
\ Bioconductor
containers in January. Imagine my
happiness when I read in the newsletter about the brand new Bioconductor docker
containers: they were ready for me :). I decided to use them to develop
MatrixRider as long as I needed to have working versions both for release and
devel. I work on different computers and using the Bioconductor
devel_sequencing container freed me completely from the procedure of getting the
source, building, and installing all needed packages. Besides this advantage
using docker made me sure that the package I was developing did not have any
dependencies on my local system libraries that would not be available in a
clean installation. This was my first package containing C code and it was nice
to be sure.
Which image did you use, base, core, sequencing, … ?
Mainly devel_sequencing to start with a fully-fledged working environment. I
had to install some other packages (TFBSTools and JASPAR2014) and it worked
flawlessly with biocLite()
.
Any unanticipated pros/cons of developing in these containers?
No. I think that I will continue using the devel containers to develop and maintain packages.
Describe how Docker was useful for managing multiple R
versions
on your computational server. Was this for multiple users or
just yourself?
Right now I’m the only one that needs devel so the version management was for myself. Eventually I would like to set it up on our server and have it working for multiple users but it will take a little work to integrate it with our “pipeline management system”.
In the past we have had up to three different R
versions, one from the package
management system of our distribution, Debian, and two compiled ad hoc. Teaching
new students how to reach them and the related library paths has been hard - I
am pretty sure docker will give a huge hand in these situations, helping
also in tracking which versions of packages were used to perform certain
analyses.
New and Noteworthy
A number of functions added to R
(3.2) and Bioconductor
(3.1) this
quarter have potential for wide-spread use. I thought they were worth a
mention.
-
base::lengths()
Computes the element lengths of a
list
object. InBioconductor
, S4Vectors::elementLengths performs the same operation onList
objects. (contributed by Michael Lawrence) -
base::trimws()
Removes leading or trailing whitespace from character strings. (contributed by Kurt Hornik)
-
utils::methods()
This function previously worked on S3 generics only and has been enhanced to also handle S4. (enhanced by Martin Morgan)
> library(Rsamtools) > methods("scanBam") [1] scanBam,BamFile-method scanBam,BamSampler-method [3] scanBam,BamViews-method scanBam,character-method see '?methods' for accessing help and source code Warning message: In findGeneric(generic.function, envir) : 'scanBam' is a formal generic function; S3 methods will not likely be found > methods(class = "BamFile") [1] $ $<- asMates [4] asMates<- close coerce [7] countBam filterBam indexBam [10] initialize isIncomplete isOpen [13] obeyQname obeyQname<- open [16] path pileup qnamePrefixEnd [19] qnamePrefixEnd<- qnameSuffixStart qnameSuffixStart<- [22] quickBamFlagSummary scanBam scanBamHeader [25] seqinfo show sortBam [28] testPairedEndBam updateObject yieldSize [31] yieldSize<- see '?methods' for accessing help and source code
-
GenomicFeatures::transcriptLengths()
Computes transcripts lengths in a TxDb object with the option to include / excluded coding and UTR regions. (contributed by Hervé Pagès)
-
BiocParallel::bpvalidate()
Flags undefined symbols in functions intended for parallel, distributed memory computations. (contributed by Martin Morgan, Valerie Obenchain)
-
BiocInstaller::biocLite()
Now capable of installing git repositories. When the ‘pkg’ argument contains a forward slash, e.g., “myRepo/myPkg”, it is assumed to be a repository and is installed with devtools::install_github. (contributed and enhanced by Martin Morgan)
Project Statistics
Website traffic
The following compares the number of sessions and new users from the first quarter of 2015 (January 1 - March 30) with the first quarter of 2014. Sessions are broken down by new and returning visitors. New visitors correspond to the total new users.
Sessions: Total | 24.03% | (339,283 vs 273,559) |
Sessions: Returning Visitor | 21.42% | (213,848 vs 176,128) |
Sessions: New Visitor | 28.74% | (125,435 vs 97,431) |
New Users | 28.74% | (125,435 vs 97,431) |
Statistics generated with Google Analytics.
Package downloads and new submissions
The number of unique IP downloads of software packages for January, February and March of 2015 were 31720, 31956, and 38379, respectively. For the same time period in 2014, numbers were 29690, 28993 and 34634. Numbers must be compared by month (vs sum) because some IPs are the same between months. See the web site for a full summary of download stats.
A total of 55 software packages were added in the first quarter of 2015 bringing counts to 991 in devel (Bioconductor
3.2) and 936 in release
(Bioconductor
3.1).
Upcoming Events
See the events page for a listing of all courses and conferences.
-
Use R / Bioconductor for Sequence Analysis: This intermediate level course is offered 06 - 07 April in Seattle, WA USA.
-
Advanced RNA-Seq and ChiP-Seq Data Analysis: Held in Hinxton, UK at EMBL-EBI, 11 - 14 May.
-
CSAMA 2015 - Statistics and Computing in Genome Data Science: Held the 14 - 19 of June in Bressanone-Brixen, Italy.
-
BioC2015: This year the 20 - 22 of July in Seattle, WA, USA.
Acknowledgements
Thanks to Laurent Gatto, Elena Grassi and Paul Shannon for contributing to
the Proteomics, Docker and Web Sockets sections. Also thanks to the
Bioconductor
team in Seattle for project updates and editorial review.
Send comments or questions to Valerie at [email protected].