Contents

Package: ProteomicsAnnotationHubData
Authors: Gatto Laurent [aut, cre], Sonali Arora [aut]
Modified: 2016-04-14 16:20:36
Compiled: Thu Apr 14 21:21:38 2016

1 Introduction

About AnnotationHub:

This package provides a client for the Bioconductor AnnotationHub web resource. The AnnotationHub web resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered. The resource includes metadata about each resource, e.g., a textual description, tags, and date of modification. The client creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.

The goal of ProteomicsAnnotationHubData is to expand this functionality to mass spectrometry and proteomics data.

See the AnnotationHub’s How-To and Access the AnnotationHub Web Service vignettes for a description on how to use it.

Accessing proteomics data

library("AnnotationHub")
ah <- AnnotationHub()
## snapshotDate(): 2016-03-09
ah
## AnnotationHub with 36170 records
## # snapshotDate(): 2016-03-09 
## # $dataprovider: BroadInstitute, UCSC, Ensembl, ftp://ftp.ncbi.nlm.nih....
## # $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Da...
## # $rdataclass: GRanges, BigWigFile, FaFile, ChainFile, OrgDb, TwoBitFil...
## # additional mcols(): taxonomyid, genome, description, tags,
## #   sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH2"]]' 
## 
##             title                                               
##   AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
##   AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
##   AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
##   AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa          
##   AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa        
##   ...       ...                                                 
##   AH50423 | common_no_known_medical_impact_20160203.vcf.gz      
##   AH50424 | clinvar_20160203.vcf.gz                             
##   AH50425 | clinvar_20160203_papu.vcf.gz                        
##   AH50426 | common_and_clinical_20160203.vcf.gz                 
##   AH50427 | common_no_known_medical_impact_20160203.vcf.gz

We can extract the entries that originate from the PRIDE database:

query(ah, "PRIDE")
## AnnotationHub with 4 records
## # snapshotDate(): 2016-03-09 
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description, tags,
## #   sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH49006"]]' 
## 
##             title                                                         
##   AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
##   AH49007 | PXD000001: Peptide-level quantitation data                    
##   AH49008 | PXD000001: raw mass spectrometry data                         
##   AH49009 | PXD000001: MS-GF+ identiciation data

Or those of a specific project

query(ah, "PXD000001")
## AnnotationHub with 4 records
## # snapshotDate(): 2016-03-09 
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description, tags,
## #   sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH49006"]]' 
## 
##             title                                                         
##   AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
##   AH49007 | PXD000001: Peptide-level quantitation data                    
##   AH49008 | PXD000001: raw mass spectrometry data                         
##   AH49009 | PXD000001: MS-GF+ identiciation data

To see the metadata of a specific entry, we use its AnnotationHub entry number inside single [

ah["AH49008"]
## AnnotationHub with 1 record
## # snapshotDate(): 2016-03-09 
## # names(): AH49008
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: mzRpwiz
## # $title: PXD000001: raw mass spectrometry data
## # $description: Four human TMT spliked-in proteins in an Erwinia caroto...
## # $taxonomyid: 554
## # $genome: NA
## # $sourcetype: mzML
## # $sourceurl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD0...
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960 
## # retrieve record with 'object[["AH49008"]]'

To access the actual data, raw mass spectrometry data in this case, we double the [[

library("mzR")
rw <- ah[["AH49008"]]
## downloading from 'https://annotationhub.bioconductor.org/fetch/55314'
## retrieving 1 resource
## Warning: Failed to parse headers:
## 220- ftp.pride.ebi.ac.uk FTP server
## 220-
## 220 
## 331 Please specify the password.
## 230 Login successful.
## 257 "/"
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 250 Directory successfully changed.
## 213 20150116075122
## 229 Entering Extended Passive Mode (|||35838|).
## 200 Switching to Binary mode.
## 213 450032788
## 150 Opening BINARY mode data connection for TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML (450032788 bytes).
## 226 Transfer complete.
rw
## Mass Spectrometry file handle.
## Filename:  55314 
## Number of scans:  7534

In this case, we have an instance of class mzRpwiz, that can be processed as anticipated

plot(peaks(rw, 1), type = "h", xlab = "M/Z", ylab = "Intensity")

In the short demonstration above, we had direct and standardised access to the raw data, without a need to manually open this raw data or worry about the file format. The data was prepared and converted into a standard Bioconductor data types for immediate consumption by the user. This is also valid for other relevant data types such as identification results, fasta files or protein of peptide quantitation data.

2 Available datasets

To list all available proteomics datasets, one can query AnnotationHub, as described above, or using the following variable defined in the ProteomicsAnnotationHubData package:

library("ProteomicsAnnotationHubData")
availableProteomicsAnnotationHubData
## [1] "PXD000001"

2.1 PXD000001

Description

Four human TMT spliked-in proteins in an Erwinia carotovora background. Expected reporter ion ratios: Erwinia peptides: 1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10; BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.

Four data files from the PRIDE PXD000001 experiment are served through AnnotationHub.

  1. The raw mass spectrometry data from the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML file from the PRDIE ftp site, served as an mzRpwiz object, from the mzR package.

  2. The peptide-level quantitation data from the F063721.dat-mztab.txt file from the PRIDE ftp site, served as an MSnSet object, from the MSnbase package.

  3. The protein data base, via the erwinia_carotovora.fasta file from the PRIDE ftp server, served as a AAStringSet object, from the Biostrings package.

  4. The identification results, produced using the MSGF+ search engine, served as a mzRident object, from the mzR package.

3 Adding new datasets

To suggest updates and/or new mass spectrometry and/or proteomics data, please post your suggestions/request on the Bioconductor support site or open a github issue. Contributions can also be made using github pull requests.

3.1 Input files

Starting with ProteomicsAnnotationHubData version 1.1.2, preparing data for submission can be done by writing simple metadata files in Debian Control File (DCF) format. DCF is a simple format for storing key:value pairs in plain text files that can easily be directly read and written by humans. For example, package DESCRIPTION files follow the DCF format. See the details section in ?read.dcf for details about the format.

Each DCF file can document one or more data files and, as opposed to the default R specification, comment lines starting with a # are supported (inline comments are not supported). The fields that must be documented in these ProteomicsAnnotationhubData (PAHD) files are detailed in the next section.

An example, taken from /tmp/RtmpnKLXoM/Rinst685e5e62e4/ProteomicsAnnotationHubData/extdata/PXD000001.dcf is illustrated below:

## Title: PXD000001: Erwinia carotovora and spiked-in protein fasta file
## Description: Four human TMT spliked-in proteins in an Erwinia
##       carotovora background. Expected reporter ion ratios:
##       Erwinia peptides: 1:1:1:1:1:1; Enolase spike
##       (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10; BSA spike
##       (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
##       (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
##       (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.     
## SourceType: FASTA
## Recipe: ProteomicsAnnotationHubData:::PXD000001FastaToAAStringSet
## RDataPath: pride/data/archive/2012/03/PXD000001/erwinia_carotovora.rda
## Location_Prefix: S3
## SourceUrl: PRIDE
## Species: Erwinia carotovora
## TaxonomyId: 554
## File: erwinia_carotovora.fasta
## DataProvider: PRIDE
## Maintainer: Laurent Gatto <[email protected]>
## RDataClass: AAStringSet
## DispatchClass: AAStringSet
## Tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960

The writePahdTemplate function prepares a PAHD DCF template.

writePahdTemplate()
## Title: A short title (one line)
## Description: A longer description
## SourceType: FASTA, mzTab, mzid, mzML, ... (only one).
## Recipe: see ProteomicsAnnotationHubData() for details
## RDataPath: Path to the file (on destination resource).
## Location_Prefix: Location of final file. Either S3 or PRIDE.
## SourceUrl: Location of source file. Either S3 or PRIDE.
## Species: Genus species
## TaxonomyId: Search in http://www.ncbi.nlm.nih.gov/taxonomy
## File: Data file name
## DataProvider: Orignal data provider, such as PRIDE.
## Maintainer: Your name <[email protected]>
## RDataClass: Class of file served through AnnotationHub.
## DispatchClass: Dispatch class.
## Tags: Useful tags.
## See ProteomicsAnnotationHubData() for details.

3.2 Required data and metadata

This section describes how ProteomicsAnnotationHubData metadata ojects are described and generated. See also the r Biocpkg("AnnotationHub") package for additional documentation. Below is an excerpt of PXD000001.dcf

## Title: PXD000001: Erwinia carotovora and spiked-in protein fasta file
## Description: Four human TMT spliked-in proteins in an Erwinia
##       carotovora background. Expected reporter ion ratios:
##       Erwinia peptides: 1:1:1:1:1:1; Enolase spike
##       (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10; BSA spike
##       (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
##       (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
##       (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.     
## SourceType: FASTA
## Recipe: ProteomicsAnnotationHubData:::PXD000001FastaToAAStringSet
## RDataPath: pride/data/archive/2012/03/PXD000001/erwinia_carotovora.rda
## Location_Prefix: S3
## SourceUrl: PRIDE
## Species: Erwinia carotovora
## TaxonomyId: 554
## File: erwinia_carotovora.fasta
## DataProvider: PRIDE
## Maintainer: Laurent Gatto <[email protected]>
## RDataClass: AAStringSet
## DispatchClass: AAStringSet
## Tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960

Title

The title of a file should always be prefixed with its experiment identifier, such as

Description

A short description of the experiment, generally a couple of lines.

Source types

These 3 field document the type/format of the original file and the R data class the file will be converted to.

SourceType mzML mzTab mzid FASTA MSnSet
DispatchClass mzRpwiz MSnSet mzRident AAStringSet MSnSet
RDataClass mzRpwiz MSnSet mzRident AAStringSet MSnSet

Recipe

The function that converts the data into its R data class. See below for further details.

RDataPath

The path to the R data file (see the scenarios below for more details).

Location_prefix

The path to the location of the file. Use S3 if the file will be stored on the Amazon S3 instance or PRIDE if the file is to be retrieved from the PRIDE resource.

SourceUrl

The URL of the original source file. Use S3 if the file will be stored on the Amazon S3 instance or PRIDE if the file is to be retrieved from the PRIDE resource.

Species

Scientific species name.

TaxonomyId

The NCBI taxonomy identifier. Can be found by searching the species name in http://www.ncbi.nlm.nih.gov/taxonomy.

File

The name of the source file.

DataProvider

The original provider of the data. A list of predefined/tested providers.

name baseUrl
PRIDE ftp://ftp.pride.ebi.ac.uk/
AHS3 http://s3.amazonaws.com/annotationhub/

Maintainer

Resource maintainer name and email address.

Tags

Frer from tags. A list of suggested tags is shown below. These suggestions will be updated and completed over time.

##  [1] "Proteomics"      "TMT6"            "TMT10"          
##  [4] "iTRAQ4"          "iTRAQ8"          "LFQ"            
##  [7] "SC"              "SILAC"           "PMID:1234567"   
## [10] "SWATH"           "MSE"             "MRM"            
## [13] "SRM"             "PRM"             "Instrument name"

3.3 Data location and associated metadata

Overview

The data accessed through the AnnotationHub infrastructure exists, in different forms, in different locations. These locations can be the user’s computer, the AnnotationHub Amazon S3 instance and the original data provider. Multiple scenarios are can occur:

  1. The data originates from the provider’s public repository. It is directly served to the user, from that third-party server, with possible local processing/coercion and made accessible as a Bioconductor data object.

  2. The data originates from the provider’s public repository. However, conversion to a Bioconductor data object is time-consuming or it is anticipated that this would be repeated many times. The data is therefor copied, processed and stored on the AnnotationHub Amazon S3 instance and server from there upon request.

  3. The original file is not available from a data provider, and is stored on the AnnotationHub Amazon S3 instance and, possibly pre-processed. Upon request, it is served to the user.

Definitions

  • The Recipe is a short function, typically named NameOfDataOrigformatToFinalformat, that generally converts the original data into on compatible with R/Bioconductor or enable to read the data directly using a special data accessor.

    For example, for some fasta files, the recipe function uses the Rsamtools::indexFa function to create an index file without converting the original file. Similarly, raw mass-spectrometry files are not converted into objects per se, but an accessor object is produced to extract data directly from the data file.

  • Location_Prefix is either S3, when the file to be loaded/read by the user exists on the AH Amazon S3 instance, or PRIDE when it lives on the PRIDE ftp server. (These will be replaced by .amazonBaseUrl and .prideBaseUrl respectively during data preparation.)

  • SourceUrl is the full location of the original file. This is generally the third-party server, but not necessarily.

  • RDataPath is the path and filename of the file to be read into R and provided to the user. This field does not contain the server address (.prideBaseUrl/PRIDE or .amazonBaseUrl/S3, see Location_Prefix).

  • The metadata list, used to create the AnnotationHubResources also uses a SourceBaseUrl, which is the full url minus file name (that is in File) of the original file. Used to construct SourceUrl.

Examples

Refering back to the scenarios described above

Scenario 1

Files that are downloaded from the third-party resource, in our case PRIDE, and loaded directly into R without any pre-processing:

  • the Recipe argument must be NA. Leave empty in the DCF file.
  • the Location_Prefix should be PRIDE (.prideBaseUrl).
  • the RDdataPath should be sub(.prideBaseUrl, "", SourceUrl)
  • the SourceUrl should be the actual full url on third-party server.

If the data is pre-processed, a Recipe must be provided.

An example from the PXD000001 data set is the raw mzML file, which is directly downloaded from the PRIDE server and read into R as an mzRpwiz object:

SourceType: mzML
RDataClass: mzRpwiz
Recipe: NA
Location_Prefix: ftp://ftp.pride.ebi.ac.uk/
RDataPath: pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
SourceUrl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML

Scenario 2

Files that need to be downloaded from a third-party provider such as the PRIDE server, pre-processed and the pre-processed product is stored on AnnotationHub Amazon s3 machine. The user directly gets the object from Amazon S3 instance:

  • the Recipe argument should not be NA.
  • the Location_Prefix should be the .amazonBaseUrl.
  • the RDataPath should correspond to the directory structure after .amazonBaseUrl on the Amazon s3 instance. Typically, the directory structure on the Amazon S3 instance mimics the directory structure on the original server.
  • the SourceUrl should be the actual url on third-party server.

An example from the PXD000001 data set is the fasta file. It originates from the PRIDE ftp server, but is processed into and AAStringSet and stored/server on the AnnotationHub Amazon S3 instance.

SourceType: FASTA
RDataClass: AAStringSet
Recipe: ProteomicsAnnotationHubData:::PXD000001FastaToAAStringSet
Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath: pride/data/archive/2012/03/PXD000001/erwinia_carotovora.rda
SourceUrl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/erwinia_carotovora.fasta

Another example is the mzTab file with peptide-level quantitation data, that is served from the Amazon instance as an MSnSet object.

SourceType: mzTab
RDataClass: MSnSet
Recipe: ProteomicsAnnotationHubData:::PXD000001MzTabToMSnSet
Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath: pride/data/archive/2012/03/PXD000001/F063721.dat-MSnSet.rda
SourceUrl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/F063721.dat-mztab.txt

Scenario 3

The original data file and the Bioconductor data object are stored on the AnnotationHub Amazon S3 instance and directly served to the user upon request.

An example from the PXD000001 data set is the mzid file, which is not available from the PRIDE ftp server (only a Macot dat file is provided).

SourceType: mzid
RDataClass: mzRident
Recipe: NA
Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath: pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
SourceUrl: http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid

3.4 Data preparation script

The fully completed DCF files are added to r Biocpkg("ProteomicsAnnotationHubData")’s extdata directory and named accordin to the dataset’s identifier using the dcf extension.

Once the above metadata is prepared in one or multiple DCF files, these can be read into R with PAHD. Data preparation scripts are added to If you have new types of data, please contact r Biocpkg("ProteomicsAnnotationHubData")’s maintainer. r Biocpkg("ProteomicsAnnotationHubData")’s scripts directory. Below are the first four lines of PXD000001.R:

## library("ProteomicsAnnotationHubData")
## library("AnnotationHubData")
## 
## PXD000001 <- PAHD("../extdata/PXD000001.dcf")

The rest of the preparation script calls various functions from the AnnotationHubData package to create valid AnnotationHubMetadata instances. At the end, it is important to serialise the metadata objects in the extdata directory, as these will be used in the unit tests described below.

Preparer functions

Preparer functions and recipes are only required if the rda file is prepared on the AnnotationHub Amazon S3 instance.

Below are the relevant functions for mzRpwiz, mzRIdent, MSnSet and AAStringSet resources. These are defined in r Biocpkg("AnnotationHub") /R/AnnotationHubProteomicsResource-class.R file.

setClass("mzRpwizResource", contains="AnnotationHubResource")
setMethod(".get1", "mzRpwizResource",
    function(x, ...) 
{
    .require("mzR")
    yy <- cache(.hub(x))
    mzR::openMSfile(yy, backend = "pwiz")
})
setClass("mzRidentResource", contains="AnnotationHubResource")
setMethod(".get1", "mzRidentResource",
    function(x, ...) 
{
    .require("mzR")
    yy <- cache(.hub(x))
    mzR::openIDfile(yy)
})
setClass("MSnSetResource", contains="RdaResource")
setMethod(".get1", "MSnSetResource",
    function(x, ...) 
{
    .require("MSnbase")
    callNextMethod(x, ...) 
})
setClass("AAStringSetResource", contains="AnnotationHubResource")
setMethod(".get1", "AAStringSetResource",
     function(x, ...) 
{
    .require("Biostrings")
    yy <- cache(.hub(x))
    Biostrings::readAAStringSet(yy)
})

If you have new types of data, please contact r Biocpkg("ProteomicsAnnotationHubData")’s maintainer.

3.5 Testing

Experiment/data unit tests

When new data/experiments or even file types are added, the procedure to add new AnnotationHub items will be streamlined, revised, simplified and hopefully automated. To make sure that any of these updates do not alter the format/annotation, a set of experiment-specific unit tests are set up, that compare the metadata created in this package and the metadata extracted from AnnotationHub.

See for example ./tests/testthat/test_PXD000001.R.