scRNAseq 2.18.0
The scRNAseq package provides convenient access to several publicly available single-cell datasets in the form of SingleCellExperiment
objects.
We do all of the necessary data munging for each dataset beforehand, so that users can obtain a SingleCellExperiment
for immediate use in further analyses.
To enable discovery, each dataset is decorated with metadata such as the study title/abstract, the species involved, the number of cells, etc.
Users can also contribute their own published datasets to enable re-use by the wider Bioconductor community.
The surveyDatasets()
function will show all available datasets along with their metadata.
This can be used to discover interesting datasets for further analysis.
library(scRNAseq)
all.ds <- surveyDatasets()
all.ds
## DataFrame with 83 rows and 15 columns
## name version path object
## <character> <character> <character> <character>
## 1 romanov-brain-2017 2023-12-19 NA single_cell_experiment
## 2 campbell-brain-2017 2023-12-14 NA single_cell_experiment
## 3 zhong-prefrontal-2018 2023-12-22 NA single_cell_experiment
## 4 macosko-retina-2015 2023-12-19 NA single_cell_experiment
## 5 ledergor-myeloma-2018 2023-12-20 NA single_cell_experiment
## ... ... ... ... ...
## 79 kotliarov-pbmc-2020 2024-04-18 NA single_cell_experiment
## 80 nestorowa-hsc-2016 2024-04-18 NA single_cell_experiment
## 81 tasic-brain-2016 2024-04-18 NA single_cell_experiment
## 82 buettner-esc-2015 2024-04-18 NA single_cell_experiment
## 83 leng-esc-2015 2024-04-25 NA single_cell_experiment
## title description taxonomy_id genome rows
## <character> <character> <List> <List> <integer>
## 1 Molecular interrogat.. Molecular interrogat.. 10090 GRCm38 24341
## 2 A molecular census o.. A molecular census o.. 10090 GRCm38 26774
## 3 A single-cell RNA-se.. A single-cell RNA-se.. 9606 GRCh38 24153
## 4 Highly Parallel Geno.. Highly Parallel Geno.. 10090 GRCm38 24658
## 5 Single cell dissecti.. Single cell dissecti.. 9606 GRCh38 57874
## ... ... ... ... ... ...
## 79 Broad immune activat.. Broad immune activat.. 9606 GRCh37 32738
## 80 A single-cell resolu.. A single-cell resolu.. 10090 GRCm38 46078
## 81 Adult mouse cortical.. Adult mouse cortical.. 10090 GRCm38 24058
## 82 Computational analys.. Computational analys.. 10090 GRCm38 38293
## 83 Oscope identifies os.. Oscope identifies os.. 9606 GRCh37 19084
## columns assays
## <integer> <List>
## 1 2881 counts
## 2 21086 counts
## 3 2394 counts
## 4 49300 counts
## 5 51840 counts
## ... ... ...
## 79 58654 counts
## 80 1920 counts
## 81 1809 counts
## 82 288 counts
## 83 460 normalized
## column_annotations
## <List>
## 1 level1 class,level2 class (neuron..,level2 cluster numbe..,...
## 2 group,batches,sex,...
## 3 developmental_stage,gender,sample,...
## 4 cluster
## 5 well_coordinates,Amp_batch_ID,Cell_barcode,...
## ... ...
## 79 nGene,nUMI,orig.ident,...
## 80 gate,broad,broad.mpp,...
## 81 mouse_line,cre_driver_1,cre_driver_2,...
## 82 phase,metrics
## 83 CellLine,Experiment,Phase
## reduced_dimensions alternative_experiments
## <List> <List>
## 1
## 2
## 3
## 4
## 5
## ... ... ...
## 79 ADT
## 80 diffusion ERCC,FACS
## 81 ERCC
## 82 ERCC
## 83
## sources
## <SplitDataFrameList>
## 1 GEO:GSE74672:NA,PubMed:27991900:NA
## 2 GEO:GSE93374:NA,PubMed:28166221:NA
## 3 GEO:GSE104276:NA,PubMed:29539641:NA
## 4 GEO:GSE63472:NA,PubMed:26000488:NA,URL:http://mccarrolllab...:2024-02-23
## 5 GEO:GSE117156:NA,PubMed:30523328:NA
## ... ...
## 79 PubMed:32094927:NA,URL:https://nih.figshare..:2024-02-23
## 80 GEO:GSE81682:NA,PubMed:27365425:NA,URL:http://blood.stemcel..:2024-02-23
## 81 PubMed:26727548:NA,GEO:GSE71585:NA
## 82 ArrayExpress:E-MTAB-2805:NA,PubMed:25599176:NA
## 83 GEO:GSE64016:NA,PubMed:26301841:NA
Users can also search on the metadata text using the searchDatasets()
function.
This accepts both simple text queries as well as more complicated expressions involving boolean operations.
# Find all datasets involving pancreas.
searchDatasets("pancreas")[,c("name", "title")]
## DataFrame with 5 rows and 2 columns
## name title
## <character> <character>
## 1 grun-bone_marrow-2016 De Novo Prediction o..
## 2 muraro-pancreas-2016 A Single-Cell Transc..
## 3 baron-pancreas-2016 A Single-Cell Transc..
## 4 baron-pancreas-2016 A Single-Cell Transc..
## 5 grun-pancreas-2016 De Novo Prediction o..
# Find all mm10 datasets involving pancreas or neurons.
searchDatasets(
defineTextQuery("GRCm38", field="genome") &
(defineTextQuery("neuro%", partial=TRUE) |
defineTextQuery("pancrea%", partial=TRUE))
)[,c("name", "title")]
## DataFrame with 14 rows and 2 columns
## name title
## <character> <character>
## 1 romanov-brain-2017 Molecular interrogat..
## 2 campbell-brain-2017 A molecular census o..
## 3 fletcher-olfactory-2.. Deconstructing Olfac..
## 4 hu-cortex-2017 Dissecting cell-type..
## 5 hu-cortex-2017 Dissecting cell-type..
## ... ... ...
## 10 zeisel-nervous-2018 Molecular Architectu..
## 11 zeisel-brain-2015 Brain structure. Cel..
## 12 shekhar-retina-2016 Comprehensive Classi..
## 13 baron-pancreas-2016 A Single-Cell Transc..
## 14 grun-pancreas-2016 De Novo Prediction o..
Keep in mind that the search results are not guaranteed to be reproducible - more datasets may be added over time, and existing datasets may be updated with new versions. Once a dataset of interest is identified, users should explicitly list the name and version of the dataset in their scripts to ensure reproducibility.
The fetchDataset()
function will download a particular dataset, returning it as a SingleCellExperiment
:
sce <- fetchDataset("zeisel-brain-2015", "2023-12-14")
sce
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
For studies that generate multiple datasets, the dataset of interest must be explicitly requested via the path=
argument:
sce <- fetchDataset("baron-pancreas-2016", "2023-12-14", path="human")
sce
## class: SingleCellExperiment
## dim: 20125 8569
## metadata(0):
## assays(1): counts
## rownames(20125): A1BG A1CF ... ZZZ3 pk
## rowData names(0):
## colnames(8569): human1_lib1.final_cell_0001 human1_lib1.final_cell_0002
## ... human4_lib3.final_cell_0700 human4_lib3.final_cell_0701
## colData names(2): donor label
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
By default, array data is loaded as a file-backed DelayedArray
from the HDF5Array package.
Setting realize.assays=TRUE
and/or realize.reduced.dims=TRUE
will coerce these to more conventional in-memory representations like ordinary arrays or dgCMatrix
objects.
assay(sce)
## <20125 x 8569> sparse DelayedMatrix object of type "integer":
## human1_lib1.final_cell_0001 ... human4_lib3.final_cell_0701
## A1BG 0 . 0
## A1CF 4 . 0
## A2M 0 . 0
## A2ML1 0 . 0
## A4GALT 0 . 0
## ... . . .
## ZYG11B 0 . 0
## ZYX 2 . 0
## ZZEF1 0 . 0
## ZZZ3 0 . 0
## pk 1 . 0
sce <- fetchDataset("baron-pancreas-2016", "2023-12-14", path="human", realize.assays=TRUE)
class(assay(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
Users can also fetch the metadata associated with each dataset:
str(fetchMetadata("zeisel-brain-2015", "2023-12-14"))
## List of 9
## $ title : chr "Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq"
## $ description : chr "The mammalian cerebral cortex supports cognitive functions such as sensorimotor integration, memory, and social"| __truncated__
## $ taxonomy_id :List of 1
## ..$ : chr "10090"
## $ genome :List of 1
## ..$ : chr "GRCm38"
## $ sources :List of 5
## ..$ :List of 3
## .. ..$ provider: chr "URL"
## .. ..$ id : chr "https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt"
## .. ..$ version : chr "2024-02-23"
## ..$ :List of 3
## .. ..$ provider: chr "URL"
## .. ..$ id : chr "https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_rep_17-Aug-2014.txt"
## .. ..$ version : chr "2024-02-23"
## ..$ :List of 3
## .. ..$ provider: chr "URL"
## .. ..$ id : chr "https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_spikes_17-Aug-2014.txt"
## .. ..$ version : chr "2024-02-23"
## ..$ :List of 3
## .. ..$ provider: chr "URL"
## .. ..$ id : chr "https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mito_17-Aug-2014.txt"
## .. ..$ version : chr "2024-02-23"
## ..$ :List of 2
## .. ..$ provider: chr "PubMed"
## .. ..$ id : chr "25700174"
## $ maintainer_name : chr "Aaron Lun"
## $ maintainer_email : chr "[email protected]"
## $ bioconductor_version: chr "3.19"
## $ applications :List of 1
## ..$ takane:List of 3
## .. ..$ type : chr "single_cell_experiment"
## .. ..$ summarized_experiment :List of 4
## .. .. ..$ rows : int 20006
## .. .. ..$ columns : int 3005
## .. .. ..$ assays :List of 1
## .. .. .. ..$ : chr "counts"
## .. .. ..$ column_annotations:List of 9
## .. .. .. ..$ : chr "tissue"
## .. .. .. ..$ : chr "group #"
## .. .. .. ..$ : chr "total mRNA mol"
## .. .. .. ..$ : chr "well"
## .. .. .. ..$ : chr "sex"
## .. .. .. ..$ : chr "age"
## .. .. .. ..$ : chr "diameter"
## .. .. .. ..$ : chr "level1class"
## .. .. .. ..$ : chr "level2class"
## .. ..$ single_cell_experiment:List of 2
## .. .. ..$ reduced_dimensions : list()
## .. .. ..$ alternative_experiments:List of 2
## .. .. .. ..$ : chr "repeat"
## .. .. .. ..$ : chr "ERCC"
Want to contribute your own dataset to this package? It’s easy! Just follow these simple steps for instant fame and prestige.
Format your dataset as a SummarizedExperiment
or SingleCellExperiment
.
Let’s just make up something here.
library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=matrix(rpois(1000, lambda=1), 100, 10)))
rownames(sce) <- sprintf("GENE_%i", seq_len(nrow(sce)))
colnames(sce) <- head(LETTERS, 10)
Assemble the metadata for your dataset.
This should be a list structured as specified in the Bioconductor metadata schema
Check out some examples from fetchMetadata()
- note that the application.takane
property will be automatically added later, and so can be omitted from the list that you create.
meta <- list(
title="My dataset",
description="This is my dataset",
taxonomy_id="10090",
genome="GRCh38",
sources=list(
list(provider="GEO", id="GSE12345"),
list(provider="PubMed", id="1234567")
),
maintainer_name="Chihaya Kisaragi",
maintainer_email="[email protected]"
)
Save your SummarizedExperiment
(or whatever object contains your dataset) to disk with saveDataset()
.
This saves the dataset into a “staging directory” using language-agnostic file formats - check out the alabaster framework for more details.
In more complex cases involving multiple datasets, users may save each dataset into a subdirectory of the staging directory.
# Simple case: you only have one dataset to upload.
staging <- tempfile()
saveDataset(sce, staging, meta)
list.files(staging, recursive=TRUE)
## [1] "OBJECT" "_bioconductor.json"
## [3] "assays/0/OBJECT" "assays/0/array.h5"
## [5] "assays/names.json" "column_data/OBJECT"
## [7] "column_data/basic_columns.h5" "row_data/OBJECT"
## [9] "row_data/basic_columns.h5"
# Complex case: you have multiple datasets to upload.
staging <- tempfile()
dir.create(staging)
saveDataset(sce, file.path(staging, "foo"), meta)
saveDataset(sce, file.path(staging, "bar"), meta) # etc.
You can check that everything was correctly saved by reloading the on-disk data into the R session for inspection:
alabaster.base::readObject(file.path(staging, "foo"))
## class: SingleCellExperiment
## dim: 100 10
## metadata(0):
## assays(1): counts
## rownames(100): GENE_1 GENE_2 ... GENE_99 GENE_100
## rowData names(0):
## colnames(10): A B ... I J
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Open a pull request (PR) for the addition of a new dataset. You will need to provide a few things here:
{NAME}-{SYSTEM}-{YEAR}
, where NAME
is the last name of the first author of the study,
SYSTEM
is the biological system (e.g., tissue, cell types) being studied,
and YEAR
is the year of publication for the dataset.scripts/
directory of this package,
in order to provide some record of how the dataset was created.Wait for us to grant temporary upload permissions to your GitHub account.
Upload your staging directory to gypsum backend with gypsum::uploadDirectory()
.
On the first call to this function, it will automatically prompt you to log into GitHub so that the backend can authenticate you.
If you are on a system without browser access (e.g., most computing clusters), a token can be manually supplied via gypsum::setAccessToken()
.
gypsum::uploadDirectory(staging, "scRNAseq", "my_dataset_name", "my_version")
You can check that everything was successfully uploaded by calling fetchDataset()
with the same name and version:
fetchDataset("my_dataset_name", "my_version")
If you realized you made a mistake, no worries. Use the following call to clear the erroneous dataset, and try again:
gypsum::rejectProbation("scRNAseq", "my_dataset_name", "my_version")
Comment on the PR to notify us that the dataset has finished uploading and you’re happy with it. We’ll review it and make sure everything’s in order. If some fixes are required, we’ll just clear the dataset so that you can upload a new version with the necessary changes. Otherwise, we’ll approve the dataset. Note that once a version of a dataset is approved, no further changes can be made to that version; you’ll have to upload a new version if you want to modify something.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scRNAseq_2.18.0 SingleCellExperiment_1.26.0
## [3] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [5] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
## [7] IRanges_2.38.0 S4Vectors_0.42.0
## [9] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
## [11] matrixStats_1.3.0 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.2 bitops_1.0-7 httr2_1.0.1
## [4] rlang_1.1.3 magrittr_2.0.3 gypsum_1.0.0
## [7] compiler_4.4.0 RSQLite_2.3.6 GenomicFeatures_1.56.0
## [10] paws.storage_0.5.0 png_0.1-8 vctrs_0.6.5
## [13] ProtGenerics_1.36.0 pkgconfig_2.0.3 crayon_1.5.2
## [16] fastmap_1.1.1 dbplyr_2.5.0 XVector_0.44.0
## [19] paws.common_0.7.2 utf8_1.2.4 Rsamtools_2.20.0
## [22] rmarkdown_2.26 UCSC.utils_1.0.0 bit_4.0.5
## [25] xfun_0.43 zlibbioc_1.50.0 cachem_1.0.8
## [28] jsonlite_1.8.8 blob_1.2.4 rhdf5filters_1.16.0
## [31] DelayedArray_0.30.0 Rhdf5lib_1.26.0 BiocParallel_1.38.0
## [34] parallel_4.4.0 R6_2.5.1 bslib_0.7.0
## [37] rtracklayer_1.64.0 jquerylib_0.1.4 Rcpp_1.0.12
## [40] bookdown_0.39 knitr_1.46 Matrix_1.7-0
## [43] tidyselect_1.2.1 abind_1.4-5 yaml_2.3.8
## [46] codetools_0.2-20 curl_5.2.1 lattice_0.22-6
## [49] alabaster.sce_1.4.0 tibble_3.2.1 KEGGREST_1.44.0
## [52] evaluate_0.23 BiocFileCache_2.12.0 alabaster.schemas_1.4.0
## [55] ExperimentHub_2.12.0 Biostrings_2.72.0 pillar_1.9.0
## [58] BiocManager_1.30.22 filelock_1.0.3 generics_0.1.3
## [61] RCurl_1.98-1.14 BiocVersion_3.19.1 ensembldb_2.28.0
## [64] alabaster.base_1.4.0 alabaster.ranges_1.4.0 glue_1.7.0
## [67] alabaster.matrix_1.4.0 lazyeval_0.2.2 tools_4.4.0
## [70] AnnotationHub_3.12.0 BiocIO_1.14.0 GenomicAlignments_1.40.0
## [73] XML_3.99-0.16.1 rhdf5_2.48.0 grid_4.4.0
## [76] jsonvalidate_1.3.2 AnnotationDbi_1.66.0 GenomeInfoDbData_1.2.12
## [79] HDF5Array_1.32.0 restfulr_0.0.15 cli_3.6.2
## [82] rappdirs_0.3.3 fansi_1.0.6 S4Arrays_1.4.0
## [85] dplyr_1.1.4 V8_4.4.2 AnnotationFilter_1.28.0
## [88] alabaster.se_1.4.0 sass_0.4.9 digest_0.6.35
## [91] SparseArray_1.4.0 rjson_0.2.21 memoise_2.0.1
## [94] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [97] bit64_4.0.5