The AnnotationHubData
package provides tools to acquire, annotate, convert
and store data for use in Bioconductor’s AnnotationHub
. BED files from the
Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are
examples of data that can be downloaded, described with metadata, transformed
to standard Bioconductor
data types, and stored so that they may be
conveniently served up on demand to users via the AnnotationHub client. While
data are often manipulated into a more R-friendly form, the data themselves
retain their raw content and are not filtered or curated like those in
ExperimentHub.
Each resource has associated metadata that can be searched through the
AnnotationHub
client interface.
Multiple, related resources are added to AnnotationHub
by creating a software
package similar to the existing annotation packages. The package itself does
not contain data but serves as a light weight wrapper around scripts that
generate metadata for the resources added to AnnotationHub
.
At a minimum the package should contain a man page describing the resources.
Vignettes and additional R
code for manipulating the objects are optional.
Creating the package involves the following steps:
Notify Bioconductor
team member.
Man page and vignette examples in the software package will not work until
the data are available in AnnotationHub
. Adding the data to AWS S3 and the
metadata to the production database involves assistance from a Bioconductor
team member. The metadata.csv file will have to be created before the data
can officially be added to the hub (See inst/extdata section below). Please
read the section of “Storage of Data Files”.
Building the software package: Below is an outline of package organization. The files listed are required unless otherwise stated.
inst/extdata/
AnnotationHub
database. The file should be generated
from the code in inst/scripts/make-metadata.R where the final data are
written out with write.csv(…, row.names=FALSE). The required column
names and data types are specified in
AnnotationHubData::makeAnnotationHubMetadata
.
See ?AnnotationHubData::makeAnnotationHubMetadata
for details. Ensure that
the above function runs without ERROR.If necessary, metadata can be broken up into multiple csv files instead having of all records in a single “metadata.csv”.
inst/scripts/
make-data.R:
A script describing the steps involved in making the data object(s). This
includes where the original data were downloaded from, pre-processing, and
how the final R object was made. Include a description of any steps
performed outside of R
with third party software. Output of the script
should be files on disk ready to be pushed to S3. If data are to be hosted
on a personal web site instead of S3, this file should explain any
manipulation of the data prior to hosting on the web site. For data hosted
on a public web site with no prior manipultaion this file is not needed.
make-metadata.R:
A script to make the metadata.csv file located in inst/extdata of the
package. See ?AnnotationHubData::makeAnnotationHubMetadata
for a
description of the metadata.csv file, expected fields and data types. The
AnnotationHubData::makeAnnotationHubMetadata()
function can be used to
validate the metadata.csv file before submitting the package.
vignettes/
OPTIONAL vignette(s) describing analysis workflows.
R/
OPTIONAL functions to enhance data exploration.
man/
package man page: OPTIONAL. The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an entry for each resource title either on the package man page or individual man pages.
resource man pages:
OPTIONAL. Man page(s) should describe the resource (raw data source,
processing, QC steps) and demonstrate how the data can be loaded through
the AnnotationHub
interface. For example, replace “SEARCHTERM*" below
with one or more search terms that uniquely identify resources in your
package.
library(AnnotationHub)
hub <- AnnotationHub()
myfiles <- query(hub, "SEARCHTERM1", "SEARCHTERM2")
myfiles[[1]] ## load the first resource in the list
DESCRIPTION / NAMESPACE
The scripts used to generate the metadata will likely use functions from
AnnotationHub or AnnotationHubData which should be listed in Depends/Imports as
necessary. The biocViews should contain terms from
AnnotationData
and should also contain the term AnnotationHub
.
Data objects: Data are not formally part of the software package and are stored separately in a publicly accessible hosted site or by Bioconductor in an AWS S3 buckets. The author should read the following section on “Storage of Data Files”.
Confirm valid metadata: Confirm the data in inst/exdata/metadata.csv are valid by running AnnotationHubData:::makeAnnotationHubMetadata() on your package. Please address and warnings or errors.
Package review: Submit the package to the tracker for review. The primary purpose of the package review is to validate the metadata in the csv file(s). It is ok if the package fails R CMD build and check because the data and metadata are not yet in place. Once the metadata.csv is approved, records are added to the production database. At that point the package man pages and vignette can be finalized and the package should pass R CMD build and check.
Metadata for new versions of the data can be added to the same package as they become available.
The titles for the new versions should be unique and not match the title of
any resource currently in AnnotationHub. Good practice would be to
include the version and / or genome build in the title. If the title is
not unique, the AnnotationHub
object will list multiple files with the
same title. The user will need to use ‘rdatadateadded’ to determine which
is the most current.
Make data available: either on publicly accessible site or see section on “Uploading Data to S3”
Update make-metadata.R with the new metadata information
Generate a new metadata.csv file. The package should contain metadata for all versions of the data in AnnotationHub so the old file should remain. When adding a new version it might be helpful to write a new csv file named by version, e.g., metadata_v84.csv, metadata_85.csv etc.
Bump package version and commit to git
Notify [email protected] that an update is ready and a team member will add the new metadata to the production database; new resources will not be visible in AnnotationHub until the metadata are added to the database.
Contact [email protected] or [email protected] with any questions.
The concepts and directory structure of the package would stay the same. The main steps involved would be
Restructure the inst/extdata and inst/scripts to include metadata.csv and
make-data.R as described in the section above for creating new packages. Ensure the
metadata.csv file is formatted correctly by running AnnotationHubData::makeAnnotationHubMetadata()
on your package.
Add biocViews term “AnnotationHub” to DESCRIPTION
Upload the data to S3 or place on a publicly accessible site and remove the data from the package. See the section on “Storage of Data Files” below.
Once the data is officially added to the hub, update any code to utilize AnnotationHub for retrieving data.
A bug fix may involve a change to the metadata, data resource or both.
The replacement resource must have the same name as the original and be at the same location (path).
Notify [email protected] that you want to replace the data and make the files available: see section “Uploading Data to S3”.
New metadata records can be added for new resources but modifying existing records is discouraged. Record modification will only be done in the case of bug fixes.
Notify [email protected] that you want to change the metadata
Update make-metadata.R and regenerate the metadata.csv file
Bump the package version and commit to git
When a resource is removed from AnnotationHub
two things happen:
the ‘rdatadateremoved’ field is populated with a date and the ‘status’
field is populated with a reason why the resource is no longer available. Once
these changes are made, the AnnotationHub()
constructor will not list the
resource among the available ids. An attempt to extract the resource with
‘[[’ and the AH id will return an error along with the status message. The
function getInfoOnIds
will display metadata information for any resource
including resources still in the database but no longer available.
In general, resources are only removed when they are no longer available (e.g., moved from web location, no longer provided etc.).
To remove a resource from AnnotationHub
contact [email protected]
or [email protected].
Versioning of resources is handled by the maintainer. If you plan to provide incremental updates to a file for the same organism / genome build, we recommend including a version in the title of the resource so it is easy to distinguish which is most current. We also would recommend when uploading the data to S3 or your publicly accessible site to have a directory structure accounting for versioning.
If you do not include a version, or make the title unique in some way,
multiple files with the same title will be listed in the AnnotationHub
object. The user will can use the ‘rdatadateadded’ metadata field
to determine which file is the most current.
Several metadata fields control which resources are visible when a user invokes AnnotationHub(). Records are filtered based on these criteria:
Once a record is added to AnnotationHub it is visable from that point forward until stamped with ‘rdatadateremoved’. For example, a record added on May 1, 2017 with ‘biocVersion’ 3.6 will be visible in all snapshots >= May1, 2017 and in all Bioconductor versions >= 3.6.
A special filter for OrgDb is utilized. Only one OrgDb is available per
release/devel cycle. Therefore contributed OrgDb added to a devel cycle are
masked until the following release. There are options for debugging these masked
resources. see ?setAnnotationHubOption
The data should not be included in the package. This keeps the package light weight and quick for a user to install. This allows the user to investigate functions and documentation without downloading large data files and only proceeding with the download when necessary. There are two options for storing data: Bioconductor AWS S3 buckets or hosting the data elsewhere on a publicly accessible site. See information below and choose the options that fits best for your situation.
Data can be accessed through the hubs from any publicly accessible site. The
metadata.csv file[s] created will need the column Location_Prefix
to indicate
the hosted site. See more in the description of the metadata columns/fields
below but as a quick example if the link to the data file is
ftp://mylocalserver/singlecellExperiments/dataSet1.Rds
an example breakdown of
the Location_Prefix
and RDataPath
for this entry in the metadata.csv file
would be ftp://mylocalserver/
for the Location_Prefix
and
singlecellExperiments/dataSet1.Rds
for the RDataPath
.
Instead of providing the data files via dropbox, ftp, etc. we will grant temporary access to an S3 bucket where you can upload your data. Please email [email protected] for access.
You will be given access to the ‘AnnotationContributor’ user. Ensure that the
AWS CLI
is installed on your machine. See instructions for installing AWS CLI
here. Once you have requested access you
will be emailed a set of keys. There are two options to set the profile up for
AnnotationContributor
.aws/config
file to include the following updating the keys
accordingly:[profile AnnotationContributor]
output = text
region = us-east-1
aws_access_key_id = ****
aws_secret_access_key = ****
.aws/config
file, Run the following command entering
appropriate information from aboveaws configure --profile AnnotationContributor
After the configuration is set you should be able to upload resources using
# To upload a full directory use recursive:
aws --profile AnnotationContributor s3 cp test_dir s3://annotation-contributor/teset_dir --recursive --acl public-read
# To upload a single file
aws --profile AnnotationContributor s3 cp test_file.txt s3://annotation-contributor/test_file.txt --acl public-read
Please upload the data with the appropriate directory structure, including
subdirectories as necessary (i.e. top directory must be software package name,
then if applicable, subdirectories of versions, …). Please also do not forget
to use the flag --acl public-read
; This allows read access to the data file.
Once the upload is complete, email [email protected] to continue the process. To add the data officially the data will need to be uploaded and the metadata.csv file will need to be created in the github repository.
As described above the metadata.csv file (or multiple metadata.csv files) will
need to be created before the data can be added to the database. To ensure
proper formatting one should run AnnotationHubData::makeAnnotationHubMetadata
on the package with any/all metadata files, and address any ERRORs that
occur. Each data object uploaded to S3 should have an entry in the metadata
file. Briefly, a description of the metadata columns required:
devel version
of Bioconductor.FilePath
that
instead of trying to load the file into R, will only return the path to the
locally downloaded file.Any additional columns in the metadata.csv file will be ignored but could be included for internal reference.
More on Location_Prefix and RDataPath. These two fields make up the complete
file path url for downloading the data file. If using the Bioconductor AWS S3 bucket the
Location_Prefix should not be included in the metadata file[s] as this field
will be populated automatically. The RDataPath will be the directory structure
you uploaded to S3. If you uploaded a directory MyAnnotation/
, and that
directory had a subdirectory v1/
that contained two files counts.rds
and
coldata.rds
, your metadata file will contain two rows and the RDataPaths would
be MyAnnotation/v1/counts.rds
and MyAnnotation/v1/coldata.rds
. If you
host your data on a publicly accessible site you must include a base url as the
Location_Prefix
. If your data file was at
ftp://myinstiututeserver/biostats/project2/counts.rds
, your metadata file will
have one row and the Location_Prefix
would be ftp://myinstiututeserver/
and
the RDataPath
would be biostats/project2/counts.rds
.
This is a bad example because these annotations are already in the hubs but it should give you an idea of the format. Let’s say I have a package myAnnotations and I upload two annotation files for dog and cow with information extracted from ensembl to S3. You would want the following saved as a csv (comma seperated output) but for easier view we show in a table:
Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dog Annotation | Gene Annotation for Canis lupus from ensembl | 3.9 | Canis lupus | GTF | ftp://ftp.ensembl.org/pub/release-95/gtf/canis_lupus_dingo/Canis_lupus_dingo.ASM325472v1.95.gtf.gz | release-95 | Canis lupus | 9612 | true | ensembl | Bioconductor Maintainer [email protected] | character | FilePath | myAnnotations/canis_lupus_dingo.ASM325472v1.95.gtf.gz |
Cow Annotation | Gene Annotation for Bos taurus from ensemble | 3.9 | Bos taurus | GTF | ftp://ftp.ensembl.org/pub/release-74/gtf/bos_taurus/Bos_taurus.UMD3.1.74.gtf.gz | release-74 | Bos taurus | 9913 | true | ensembl | Bioconductor Maintainer [email protected] | character | FilePath | myAnnotations/Bos_taurus.UMD3.1.74.gtf.gz |