{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Single Cell RNA-Sequencing data and gene relevance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need of course destiny, scran for preprocessing, and some tidyverse niceties." ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "library(conflicted)\n", "library(destiny)\n", "suppressPackageStartupMessages(library(scran))\n", "library(purrr)\n", "library(ggplot2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s use data from the `scRNAseq`[1] package. If necessary, install it via `BiocManager::install('scRNAseq')`.\n", "\n", "[1] Risso D, Cole M (2019). [scRNAseq: A Collection of Public Single-Cell RNA-Seq Datasets](https://bioconductor.org/packages/scRNAseq/)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
The dataset fluidigm
contains 65 cells from Pollen et al. (2014), each sequenced at high and low coverage.\n",
"
The dataset th2
contains 96 T helper cells from Mahata et al. (2014).\n",
"
The dataset allen
contains 379 cells from the mouse visual cortex. This is a subset of the data published in Tasic et al. (2016).\n",
"
The dataset.*?
', dotall = TRUE)) %>% unlist() %>%\n", " paste(collapse = '') %>% IRdisplay::display_html()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "379 cells seems sufficient to see something!" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "data('allen', package = 'scRNAseq')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We’ll mostly stick to the [scran vignette][] here. Let’s add basic information to the data and choose what to work with.\n", "\n", "As `scran` expects the raw counts in the `counts` assay, we rename the more accurate RSEM counts to `counts`:\n", "\n", "[scran vignette]: https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "'select()' returned 1:many mapping between keys and columns\n", "\n", "'select()' returned 1:many mapping between keys and columns\n", "\n" ] }, { "data": { "text/plain": [ "class: SingleCellExperiment \n", "dim: 20908 379 \n", "metadata(2): SuppInfo which_qc\n", "assays(4): tophat_counts cufflinks_fpkm counts rsem_tpm\n", "rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3\n", "rowData names(3): Symbol EntrezID Uniprot\n", "colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336\n", "colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s\n", "reducedDimNames(0):\n", "spikeNames(1): ERCC" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "allen <- as(allen, 'SingleCellExperiment')\n", "rowData(allen)$Symbol <- rownames(allen)\n", "rowData(allen)$EntrezID <- AnnotationDbi::mapIds(org.Mm.eg.db::org.Mm.eg.db, rownames(allen), 'ENTREZID', 'ALIAS')\n", "rowData(allen)$Uniprot <- AnnotationDbi::mapIds(org.Mm.eg.db::org.Mm.eg.db, rownames(allen), 'UNIPROT', 'ALIAS', multiVals = 'list')\n", "assayNames(allen)[assayNames(allen) == 'rsem_counts'] <- 'counts'\n", "isSpike(allen, 'ERCC') <- grepl('^ERCC-', rownames(allen))\n", "allen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use it to renormalize the data. We normalize the `counts` using the spike-in size factors and logarithmize them into `logcounts`." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "allen <- computeSpikeFactors(allen)\n", "allen <- normalize(allen)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also use the spike-ins to detect highly variable genes more accurately:" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "decomp <- decomposeVar(allen, trendVar(allen, parametric = TRUE))\n", "rowData(allen)$hvg_order <- order(decomp$bio, decreasing = TRUE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a subset of the data containing only rasonably highly variable genes and no spike-ins:" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [], "source": [ "allen_hvg <- subset(allen, hvg_order <= 5000L & !isSpike(allen))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s create a Diffusion map. For rapid results, people often create a PCA first, which can be stored in your `SingleCellExperiment` before creating the Diffusion map or simply created implicitly using `DiffusionMap(..., n_pcs =Entry | Gene names | Tissue specificity |
---|---|---|
<chr> | <chr> | <chr> |
Q8K221 | Arfip2 | NA |
Q6GQU0 | NA | NA |
Q7TPG8 | Tafa1 Fam19a1 | TISSUE SPECIFICITY: Expressed in the hippocampus and detected also in the cortex (at protein level). {ECO:0000269|PubMed:29799787}. |
Q8K304 | Tmem129 | NA |
P63318 | Prkcg Pkcc Pkcg Prkcc | TISSUE SPECIFICITY: Expressed in the cerebellum, cerebral cortex and hippocampus (at protein level). Highly expressed in Purkinje cells. {ECO:0000269|PubMed:17904530, ECO:0000269|PubMed:23185022}. |
Q8K2G4 | Bbs7 Bbs2l1 | NA |
Q2NKI4 | Prkcg Prkcc | NA |
A0A1B0GSM3 | Arfip2 mCG_19713 | NA |
Q3UN66 | Prkcg Prkcc mCG_18472 | NA |