--- title: "Using ctgGEM" author: "Carrie Minette" date: "`r Sys.Date()`" output: html_document: BiocStyle::html_document vignette: > %\VignetteIndexEntry{ctgGEM} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction Single-cell gene expression profiling methods are well suited for determining transcriptional heterogeneity across a cell population. In many cases, this heterogeneity may reflect dynamic changes in the transcriptional landscape due to biological processes such as differentiation or tumorigenesis. The overall trajectory of this change may be unified or branched and is partially obscured by the noise of the transcriptional programs within each cell. To address this computational challenge, a number of packages have been created to calculate pseudotemporal ordering of cells to reflect cell-state hierarchies from gene expression data. Each of these packages have unique inputs, outputs, and visualization schemes, which makes comparisons across these methods time-consuming. Here we present Cell Tree Generator for Gene Expression Matrices (`r Rpackage("ctgGEM")`), a package to streamline the building of cell-state hierarchies from single-cell gene expression data across multiple existing tools for improved comparability and reproducibility. ctgGEM provides the user with simplified way to build trees with these packages using one function call with a single dataset and the desired visualization name as a parameter. Results are also stored in the SIF file format for use in downstream analysis workflows or input into Cytoscape. The packages currently supported by ctgGEM are: - `r Biocpkg("destiny")` - `r Biocpkg("monocle")` - `r Biocpkg("sincell")` - `r Biocpkg("TSCAN")` # Installing `r Rpackage("ctgGEM")` `r Rpackage("ctgGEM")` requires R version 4.0 or higher (available [here](http://www.r-project.org/)) and the most recent version of Bioconductor. For more information on using Bioconductor, please see their website at https:// bioconductor.org. The following code will install Bioconductor, `r Rpackage("ctgGEM")` and the following CRAN packages and their dependencies: `r CRANpkg(c("ggm", "ggplot2", "igraph", "irlba", "maptpx", "VGAM"))` ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ctgGEM") ``` After installing, attach the `r Rpackage("ctgGEM")` package with the following command: ```{r loadCTG, message=FALSE, warning=FALSE, results='hide'} library(ctgGEM) ``` # Preparing a `ctgGEMset` object The `r Rpackage("ctgGEM")` workflow is based around a single data class, `ctgGEMset`, that extends the `r Biocpkg("Biobase")` `SummarizedExperiment` class, which provides a common interface familiar to those who have analyzed microarray experiments with Bioconductor. The `ctgGEMset` class requires the following three inputs: 1. `exprsData`, a numeric matrix of expression values, where rows are genes, and columns are cells/samples 2. `phenoData`, a data frame, where rows are cells/samples, and columns are cell/sample attributes (cell type, culture condition, day captured, etc.) 3. `featureData`, a data frame, where rows are features (e.g. genes), and columns are gene attributes (gene identifiers, gc content, etc.) The expression value matrix must have the same number of columns as the phenoData has rows, and it must have the same number of rows as the featureData data frame has rows. Row names of the phenoData object must match the column names of the expression matrix. Row names of the featureData object must matchrow names of the expression matrix. Details for loading the data for and constructing an example `ctgGEMset` object suitable for this vignette can be found in the following section. ## Loading the Data A `ctgGEMset` object that will support all of the tree types in `r Rpackage("ctgGEM")` requires a gene expression matrix, which for the `monocle` and `TSCAN` tree types must contain strictly non-negative values. For this vignette, we will construct a toy object, using the data provided in the `r Biocpkg("HSMMSingleCell")` package. For more information on this data set, please see the documentation available at https://doi.org/doi:10.18129/B9.bioc.HSMMSingleCell ```{r loadRaw, message=FALSE, warning=FALSE, results='hide'} # load HSMMSingleCell package library(HSMMSingleCell) # load the data data(HSMM_expr_matrix) data(HSMM_sample_sheet) data(HSMM_gene_annotation) ``` ## Constructing a `ctgGEMset` object Using the data loaded in the previous step, we can construct the `ctgGEMset` object for this vignette as follows: ```{r constructctgGEMset} toyGEMset <- ctgGEMset(exprsData = HSMM_expr_matrix, phenoData = HSMM_sample_sheet, featureData = HSMM_gene_annotation) ``` ### Additional Info for the "monocle" method Using the "monocle" method requires a pair of parameters prepared with a column name of gene identifiers in `featureData()` that corresponds to the gene short names as the first item to be set, followed by the data type. The data type must be one of "UMI", "TC", "FPKM", "TPM", "LTFPKM", or "LTTPM", where "UMI" is UMI counts, "TC" is transcript counts, "FPKM" is FPKM, and "TPM" is TPM. Monocle works best with untransformed data, but if you want to use log-transformed FPKM or TPM, use data type "LTFPKM" or "LTTPM", respectively. Here we will use the *gene_short_name* column and the data type "FPKM". ```{r monocleInfo} monocleInfo(toyGEMset, "gene_id") <- "gene_short_name" monocleInfo(toyGEMset, "ex_type") <- "FPKM" ``` The "monocle" method can be used in either semi-supervised or unsupervised mode. To run in semi-supervised mode, the names of two known classifying marker genes from the column specified as "gene_id" need to be set. If these two marker genes are not set, unsupervised mode will run. The two genes we will use for this demonstration are *MYF5*, a known marker for myoblasts, and *ANPEP*, a known marker for fibroblasts. ```{r monocleInfo2} # Set two marker genes to use semi-supervised mode # Alternatively, omit the next two lines to run in unsupervised mode monocleInfo(toyGEMset, "cell_id_1") <- "MYF5" # marks myoblasts monocleInfo(toyGEMset, "cell_id_2") <- "ANPEP" # marks fibroblasts ``` ### Additional Info for the "TSCAN" method In addition to its primary plot, the "TSCAN" method can also generate a single gene vs. pseudotime plot. To generate this plot, we need to supply the rowname of a single gene row in exprs(), and store it in TSCANinfo. Here we will use the "ENSG00000000003.10" gene. ```{r TSCANinfo} TSCANinfo(toyGEMset) <- "ENSG00000000003.10" ``` ### Additional Info for the "sincell" method The "sincell" method can be used with a variety of parameters to control the distance method used, which type (if any) dimensionality reduction to be used, and which clustering method to use. If no options are specified, PCA will be applied, with KNN clustering. Additional details concerning these parameters can be found in the sincell package documentation. To use a distance method with no dimensionality reduction, set the "method" parameter using the `sincellInfo()` function to one of the following: "euclidean", "L1" (Manhattan distance), "cosine", "pearson", "spearman", or "MI" (Mutual Information). ```{r sincelldistonly, eval=FALSE} sincellInfo(toyGEMset, "method") <- "pearson" ``` To use dimensionality reduction with the "sincell" method, set the "method" parameter to "PCA" for Principal Component Analysis, "ICA" for Independent Component Analysis, "tSNE" for t-Distributed Stochastic Neighbor Embedding, "classical-MDS" for classical Multidimensional Scaling, or "nonmetric-MDS" for non-metric Multidimensional Scaling. If using "classical-MDS" or "nonmetric- MDS", we can also select the distance method to use with the "MDS.distance" parameter. The options for "MDS.distance" are the same as those listed above for no dimensionality reduction, with the exception the that the "cosine" method cannot be selected for "MDS.distance". ```{r sincelldimred} sincellInfo(toyGEMset, "method") <- "classical-MDS" sincellInfo(toyGEMset, "MDS.distance") <- "spearman" ``` The final optional parameter for the "sincell" method is "clust.method". Acceptable values are "max-distance", "percent", "knn", "k-medoids", "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid". In this example, we will set it to "k-medoids". ```{r sincellclust} sincellInfo(toyGEMset, "clust.method") <- "k-medoids" ``` ### Additional Info for the "destiny" method Unlike the other tree types, the "destiny" method does not have the option to provide further information. It should be noted, however, that the `r Biocpkg("destiny")` documentation recommends cleaned, referably normalized data, and suggests that single-cell RNA-seq count data should be transformed using a variance-stabilizing transformation (e.g. log or rlog). # Using `r Rpackage("ctgGEM")` To use `r Rpackage("ctgGEM")`, call the `generate_tree` function with the desired tree method and `ctgGEMset` object. The user can specify the desired output directory using the 'outputDir' parameter, or default to the temporary directory returned by `tempdir`. ## Using the "monocle" method To use our example `ctgGEMset` and the "monocle" method, we would type the following: ```{r genMon, echo=TRUE, results='hide', message=FALSE} toyGEMset <- generate_tree(dataSet = toyGEMset, treeType = "monocle") ``` This stores the final trees in the *originalTrees* list within the `ctgGEMset` object, a simplified `r CRANpkg("igraph")` version of the tree in the *treeList* within the `ctgGEMset` object, and if necessary creates a new folder, called "CTG-Output", that contains a folder called "SIFs" containing the .SIF text file for the final tree, and a folder called "Plots" containing a .png image of the following plot: ```{r monPlots, echo=FALSE, out.width="400px", message=FALSE} plotOriginalTree(toyGEMset, "monocle") ``` ## Using the "sincell" method To use our example `ctgGEMset` and the "sincell" method, we would type the following: ```{r genSin, message=FALSE} toyGEMset <- generate_tree(dataSet = toyGEMset, treeType = "sincell") ``` This stores the final trees in the *originalTrees* list within the `ctgGEMset` object, a simplified `r CRANpkg("igraph")` version of the tree in the *treeList* within the `ctgGEMset` object, and if necessary creates a new folder, called "CTG-Output", that contains a folder called "Plots" containing .png images of the following plots: ```{r sincellPlots, echo=FALSE, out.width="400px", message=FALSE} plotOriginalTree(toyGEMset, "sincellIMC") plotOriginalTree(toyGEMset, "sincellMST") plotOriginalTree(toyGEMset, "sincellSST") ``` ## Using the "TSCAN" method To use our example `ctgGEMset` and the "TSCAN"" method, we would type the following: ```{r genTSCAN, echo=TRUE, message=FALSE} toyGEMset <- generate_tree(dataSet = toyGEMset, treeType = "TSCAN") ``` This stores the final trees in the *originalTrees* list within the `ctgGEMset` object, a simplified `r CRANpkg("igraph")` version of the tree in the *treeList* within the `ctgGEMset` object, and if necessary creates a new folder, called "CTG-Output", that contains a folder called "SIFs" containing the .SIF text file for the final tree, and a folder called "Plots" containing .png images of the following plots: ```{r TSCANplots, echo=FALSE, out.width="400px", message=FALSE} plotOriginalTree(toyGEMset, "TSCANclustering") plotOriginalTree(toyGEMset, "TSCANsingleGene") ``` ## Using the "destiny" method Note: some users may encounter problems installing and using the destiny package on R 4.0.0. To use our example `ctgGEMset` and the "destiny" method, we would type the following: ```{r genDes, message=FALSE} toyGEMset <- generate_tree(dataSet = toyGEMset, treeType = "destiny") ``` This stores the final trees in the *originalTrees* list within the `ctgGEMset` object and if necessary creates a new folder, called "CTG-Output" that contains a folder called "SIFs" containing the .SIF text file for the final tree, and contains a folder called "Plots" containing .png images of the following plots: ```{r destinyPlots, echo=FALSE, out.width="400px", message=FALSE} plotOriginalTree(toyGEMset, "destinyDM") plotOriginalTree(toyGEMset, "destinyDPT") ``` # Replotting Trees If at some point we wish to view the plot of a tree generated after it's been created, but don't want to have to regenerate it and all its files, `r Rpackage("ctgGEM")` has a function named `plotOriginalTree()`, that will reproduce a plot stored in a ctgGEMset object. To use this function, we must know the name of the tree we wish to plot. We can view the names of the trees in our toyGEMset object using the `names()` function. ```{r getNames, echo=TRUE} names(originalTrees(toyGEMset)) ``` Once we have the names, we can choose a tree to plot. Let's plot the "destinyDM" tree again. ```{r plotTree, echo=TRUE, message=FALSE} plotOriginalTree(toyGEMset, "destinyDM") ``` Using this function eliminates the need to regenerate the tree to view a plot that was already created, thereby saving time for trees that require extensive computations to generate. To store your analysis session result for later use, you can use the .Rda format. ```{r save, eval = TRUE} save(toyGEMset, file = "toyGEMset.Rda") ``` # Session Information ```{r} sessionInfo() ``` # References ## destiny Philipp Angerer et al. (2015): destiny: diffusion maps for large-scale single-cell data in R. Helmholtz-Zentrum München. URL: http://bioinformatics.oxfordjournals.org/content/32/8/1241 destiny package URL: https://bioconductor.org/packages/release/bioc/html/destiny.html ## monocle Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS and Rinn JL (2014). “The dynamics and regulators of cell fate decisions are revealed by pseudo-temporal ordering of single cells.” Nature Biotechnology. monocle package URL: https://bioconductor.org/packages/release/bioc/html/monocle.html ## sincell Juliá M, Telenti A, Rausell A (2014): Sincell: R package for the statistical assessment of cell state hierarchies from single-cell RNA-seq data. bioRxiv preprint sincell package URL: https://bioconductor.org/packages/release/bioc/html/sincell.html ## TSCAN Zhicheng Ji and Hongkai Ji (2015). TSCAN: TSCAN: Tools for Single-Cell ANalysis. R package version 1.12.0. TSCAN package URL: https://bioconductor.org/packages/release/bioc/html/TSCAN.html