---
title: "curatedTBData"
author: 
- name: Xutao Wang
  affiliation: 
  - Department of Biostatistics, Boston University School of Public Health,
    Boston, MA, U.S.A.
  email: xutaow@bu.edu
- name: Prasad Patil
  affiliation: 
  - Department of Biostatistics, Boston University School of Public Health,
    Boston, MA, U.S.A.
  email: patil@bu.edu
- name: W. Evan Johnson
  affiliation: 
  - Section of Computational Biomedicine, Boston University School of Medicine, 
    Boston, MA, U.S.A.
  email: wej@bu.edu
package: curatedTBData
abstract: >
    The curatedTBData is an R package that provides standardized, curated 
    tuberculosis(TB) transcriptomic studies. The initial release of the package 
    contains 49 studies. The curatedTBData package allows users to access 
    tuberculosis trancriptomic efficiently and to make easy comparison for 
    different TB gene signatures across multiple datasets.
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{curatedTBData}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Installation
```{r install curatedTBData, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("curatedTBData")
```

# R packages
```{r libraries, message=FALSE, warning=FALSE, results='hide'}
library(curatedTBData)
library(dplyr)
library(SummarizedExperiment)
library(sva)
```

# Access to MetaData
The `DataSummary` table summarized the list of available studies and 
their metadata information contained in the curatedTBData package. \
The table helps users to query avialable datasets quickly.
```{r accessimg datasummary}
# Remove GeographicalRegion, Age, DiagnosisMethod, Notes, Tissue, HIVStatus for concise display
data("DataSummary", package = "curatedTBData")
DataSummary |>
  dplyr::select(-c(`Country/Region`, Age, DiagnosisMethod, Notes,
                   Tissue, HIVStatus)) |>
  DT::datatable()
```

# Access curated tuberculosis gene expression profile
Users can use `curatedTBData()` function to access data. 
There are three arguments in the function. 
The first argument `study_name` represents the names of the data that are used 
to determine the resources of interests. 
Users can find all available resource names from `DataSummary$Study`. 
The second argument `dry.run` enables users to determine the resources's 
availability before actually downloading them. 
When `dry.run` is set to `TRUE`, the output includes names of the resources. 
The third argument `curated.only` allows the users to access 
the curated ready-to-use data. 
If `curated.only` is `TRUE`, the function only download the curated 
gene expression profile and the clinical annotation of the corresponding data. 
If `curated.only` is `FALSE`, the function downloads all available resources 
for input studies.
```{r access data with dry.run}
curatedTBData("GSE19439", dry.run = TRUE, curated.only = FALSE)
```
To download complete data for a Microarry study (e.g. [GSE19439](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19439)), 
we set `dry.run = FALSE` and `curated.only = FALSE`.
There are two experiments assay being included in the output Microarray studies. 
The first experiment is `assay_curated`, which is a `matrix` that represents 
normalized, curated version of the gene expression profile. 
The second experiment is `assay_raw`, which is a 
`r Biocpkg("SummarziedExperiment")` object that contains the raw gene expression profile and information about probe features.
```{r download full data for GSE19439, warning=FALSE}
GSE19439 <- curatedTBData("GSE19439", dry.run = FALSE, curated.only = FALSE)
GSE19439
```

When accessing data for RNA sequencing studies, another assay called 
`assay_reprocess` is included. This `matrix` represents the reprocessed version 
of gene expression profile from the raw .fastq files using 
`r Biocpkg("Rsubread")`.

```{r, warning=FALSE}
GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
GSE79362
```

A list of `r Biocpkg("MultiAssayExperiment")` objects is returned 
when `dry.run` is `FALSE`. \
To save running time, in the following example, we set the `curated.only = TRUE` 
and selected five studies that belong to [GSE19491](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19491) 
from the Gene Expression Omnibus.
```{r access selected curated datasets microarray, warning=FALSE, results='hide'}
myGEO <- c("GSE19435", "GSE19439", "GSE19442", "GSE19444", "GSE22098")
object_list <- curatedTBData(myGEO, dry.run = FALSE, curated.only = TRUE)
```

```{r}
object_list[1:2]
```

A full version example for RNA sequencing data:
[GSE79362](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79362). 
```{r ACS RNA-seq, warning=FALSE, results='hide'}
GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
```

```{r}
GSE79362
```

# Subset/Combine MultiAssayExperiment objects

## Subset
The major advantage of using `r Biocpkg("MultiAssayExperiment")` 
is the coordination of the meta-data and assays when sub-setting.
The MultiAssayExperiment object has built-in function for subsetting samples 
based on column condition. \
The following code shows how to select samples with only active TB.
```{r}
GSE19439 <- object_list$GSE19439
GSE19439[, GSE19439$TBStatus == "PTB"]["assay_curated"] # 13 samples
```

The following example shows how to subset patients with active TB and LTBI
```{r}
GSE19439[, GSE19439$TBStatus %in% c("PTB", "LTBI")]["assay_curated"] # 30 samples
```

# Dataset integeration

## Merge Studies with common gene symbols
The `combine_objects()` function provides an easy implementation for 
combining different studies based on common gene symbols. 
The function returns a `r Biocpkg("SummarizedExperiment")` object that contains 
the merged assay and associated clinical annotation. Noticed that [GSE74092](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) is usually 
removed from merging, because it used quantitative PCR,
which did not have enough coverage to capture all genes. \
There are two arguments in the `combine_objects()` function. 
The first one is `object_list`, which takes a list of 
`r Biocpkg("MultiAssayExperiment")` objects obtained from `curatedTBData()`. Notice that the `names(object_list)` should not be `NULL` 
and must be unique for each object within the list,
so that we can keep track the original study after merging. 
The second argument is `experiment_name`, which can be a string or vector of strings representing the name of the assay from the object. 
```{r}
GSE19491 <- combine_objects(object_list, experiment_name = "assay_curated",
                            update_genes = TRUE)
GSE19491
```
When the objects are merged, the original individual data tag can be found 
in the `Study` section from the metadata.
```{r}
unique(GSE19491$Study)
```

It is also possible to combine the given list of objects with 
different experiment names. In this case, the `experiment_name` is a vector of 
string that corresponds to each of object from the input list.
```{r commbine objects multiple experiment name}
exp <- combine_objects(c(GSE79362[1], object_list[1]),
                       experiment_name = c("assay_reprocess_hg19",
                                           "assay_curated"),
                       update_genes = TRUE)
exp
```
## Batch correction
If datasets are merged, it is typically recommended to 
remove a very likely batch effect. 
We use the `ComBat()` function from `r Biocpkg("sva")` to remove 
potential batch effect between studies. \
In the following example, each study is viewed as one batch. The batch corrected
assay will be stored in a `r Biocpkg("SummarizedExperiment")` object.
```{r batch correction, message=FALSE}
batch1 <- colData(GSE19491)$Study
combat_edata1 <- sva::ComBat(dat = assay(GSE19491), batch = batch1)
assays(GSE19491)[["Batch_corrected_assay"]] <- combat_edata1
GSE19491
```

# Dataset subset
The function `subset_curatedTBData()` allows the users to subset a list of 
`r Biocpkg("MultiAssayExperiment")` with the output contains 
the exact conditions given by the `annotationCondition`. \
With `subset_curatedTBData()`, users can quickly subset desired results 
from `r Biocpkg("curatedTBData")` database without checking individual object. \
There are four arguments in this function. 
The `theObject` represents a `r Biocpkg("MultiAssayExperiment")` or 
`r Biocpkg("SummarizedExperiment")` object. 
The `annotationColName` is a character that indicates the 
column name in the metadata. The `annotationCondition` is a character or 
vector of characters that the users intend to select.

## Select patients with Active TB and LTBI
In the following example, we call `subset_curatedTBData()` function to 
subset samples with active TB (`PTB`) and latent TB infection (`LTBI`) for 
binary classification. 
```{r subset active TB and LTBI, message=FALSE, warning=FALSE}
multi_set_PTB_LTBI <- lapply(object_list, function(x)
  subset_curatedTBData(x, annotationColName = "TBStatus",
                       annotationCondition = c("LTBI", "PTB"), 
                       assayName = "assay_curated"))
# Remove NULL from the list
multi_set_PTB_LTBI <- multi_set_PTB_LTBI[!sapply(multi_set_PTB_LTBI, is.null)]
multi_set_PTB_LTBI[1:3]
```

## Select other outcome of interests
The HIV status (`HIVStatus`) and diabetes status (`DiabetesStatus`) for each 
subject were also recorded for each study in the `r Biocpkg("curatedTBData")`. \
In the following example, we select subjects with HIV positive from the input. \
Users can also find HIV status information for each study by 
looking at the column: `HIVStatus` from `DataSummary`. \
When the the length of the `annotationCondition` equals to 1, we can subset using 
either `MultiAssayExperiment` built-in procedure or `subset_curatedTBData`.
```{r subset patients with HIV, message=FALSE, warning=FALSE}
multi_set_HIV <- lapply(object_list, function(x)
  subset_curatedTBData(x, annotationColName = "HIVStatus",
                       annotationCondition = "Negative",
                       assayName = "assay_curated"))
# Remove NULL from the list
multi_set_HIV <- multi_set_HIV[!vapply(multi_set_HIV, is.null, TRUE)]
multi_set_HIV[1:3]
```

# Session Info
```{r session info}
sessionInfo()
```