---
title: "The depmap data"
author:
- name: Theo Killian
- name: Laurent Gatto
  affiliation: Computational Biology, UCLouvain
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{depmap}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r, echo = FALSE}
suppressPackageStartupMessages(library("tidyverse"))
```

# Introduction

The `depmap` package aims to provide a reproducible research framework
to cancer dependency data described by [Tsherniak, Aviad, et
al. "Defining a cancer dependency map." Cell 170.3 (2017):
564-576.](https://www.ncbi.nlm.nih.gov/pubmed/28753430).  The data
found in the
[depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html)
package has been formatted to facilitate the use of common R packages
such as `dplyr` and `ggplot2`. We hope that this package will allow
researchers to more easily mine, explore and visually illustrate
dependency data taken from the Depmap cancer genomic dependency study.

# Installation instructions

To install
[depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html),
the
[BiocManager](https://cran.r-project.org/web/packages/BiocManager/index.html)
Bioconductor Project Package Manager is required. If
[BiocManager](https://cran.r-project.org/web/packages/BiocManager/index.html)
is not already installed, it will need to be done so beforehand. Type
(within R) install.packages("BiocManager") (This needs to be done just
once.)

```{r install, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("depmap")
```

The `depmap` package fully depends on the `ExperimentHub` Bioconductor
package, which allows the data accessed in this package to be stored
and retrieved from the cloud.

```{r import_EH, message = FALSE}
library("depmap")
library("ExperimentHub")
```

# Tidy depmap data

The
[depmap](https://bioconductor.org/packages/devel/data/experiment/html/depmap.html)
package currently contains eight datasets available through
`ExperimentHub`.

The data found in this R package has been converted from a "wide"
format `.csv` file to "long" format .rda file. None of the values
taken from the original datasets have been changed, although the
columns have been re-arranged.  Descriptions of the changes made are
described under the `Details` section after querying the relevant
dataset.

```{r ehquery}
## create ExperimentHub query object
eh <- ExperimentHub()
query(eh, "depmap")
```

Each dataset has a `ExperimentHub` accession number, (e.g. *EH2260*
refers to the `rnai` dataset from the 19Q1 release).

## RNA inference knockout data


The `rnai` dataset contains the combined genetic dependency data for
RNAi - induced gene knockdown for select genes and cancer cell
lines. This data corresponds to the
`D2_combined_genetic_dependency_scores.csv` file.

Specific `rnai` datasets can be accessed, such as `rnai_19Q1` by EH number.

```{r, eval = FALSE}
eh[["EH2260"]]
```

The most recent `rnai` dataset can be automatically loaded into R by
using the `depmap_rnai` function.

```{r}
depmap::depmap_rnai()
```

## CRISPR-Cas9 knockout data

The `crispr` dataset contains the (batch corrected CERES inferred gene
effect) CRISPR-Cas9 knockout data of select genes and cancer cell
lines. This data corresponds to the `gene_effect_corrected.csv` file.

Specific `crispr` datasets can be accessed, such as `crispr_19Q1` by
EH number.

```{r, eval = FALSE}
eh[["EH2261"]]
```

The most recent `crispr` dataset can be automatically loaded into R by
using the `depmap_crispr` function.

```{r}
depmap::depmap_crispr()
```

## WES copy number data

The `copyNumber` dataset contains the WES copy number data, relating
to the numerical log-fold copy number change measured against the
baseline copy number of select genes and cell lines. This dataset
corresponds to the `public_19Q1_gene_cn.csv`

Specific `copyNumber` datasets can be accessed, such as
`copyNumber_19Q1` by EH number.

```{r, eval = FALSE}
eh[["EH2262"]]
```

The most recent `copyNumber` dataset can be automatically loaded into
R by using the `depmap_copyNumber` function.

```{r}
depmap::depmap_copyNumber()
```

## CCLE Reverse Phase Protein Array data


The `RPPA` dataset contains the CCLE Reverse Phase Protein Array
(RPPA) data which corresponds to the `CCLE_RPPA_20180123.csv` file.

Specific `RPPA` datasets can be accessed, such as `RPPA_19Q1` by EH
number.

```{r, eval = FALSE}
eh[["EH2263"]]
```

The most recent `RPPA` dataset can be automatically loaded into R by
using the `depmap_RPPA` function.

```{r}
depmap::depmap_RPPA()
```

## CCLE RNAseq gene expression data

The `TPM` dataset contains the CCLE RNAseq gene expression data. This
shows expression data only for protein coding genes (using scale
log2(TPM+1)). This data corresponds to the `CCLE_depMap_19Q1_TPM.csv`
file.

Specific `TPM` datasets can be accessed, such as `TPM_19Q1` by EH number.

```{r, eval = FALSE}
eh[["EH2264"]]
```

The `TPM` dataset can also be accessed by using the `depmap_TPM` function.

```{r}
depmap::depmap_TPM()
```

## Cancer cell lines

The `metadata` dataset contains the metadata about all of the cancer
cell lines.  It corresponds to the `depmap_19Q1_cell_lines.csv` file.

Specific `metadata` datasets can be accessed, such as `metadata_19Q1`
by EH number.

```{r, eval = FALSE}
eh[["EH2266"]]
```

The most recent `metadata` dataset can be automatically loaded into R by using
the `depmap_metadata` function.

```{r}
depmap::depmap_metadata()
```


## Mutation calls


The `mutationCalls` dataset contains all merged mutation calls (coding
region, germline filtered) found in the depmap dependency study. This
dataset corresponds with the `depmap_19Q1_mutation_calls.csv` file.

Specific `mutationCalls` datasets can be accessed, such as
`mutationCalls_19Q1` by EH number.

```{r, eval = FALSE}
eh[["EH2265"]]
```

The most recent `mutationCalls` dataset can be automatically loaded into R by
using the  `depmap_mutationCalls` function.

```{r}
depmap::depmap_mutationCalls()
```


## Drug Sensitivity

The `drug_sensitivity` dataset contains dependency data for cancer
cell lines treated with various compounds. This dataset corresponds
with the `primary_replicate_collapsed_logfold_change.csv` file.

Specific `drug_sensitivity` datasets can be accessed, such as
`drug_sensitivity_19Q3` by EH number.

```{r, eval = FALSE}
 eh[["EH3087"]]
```

The most recent `drug_sensitivity` dataset can be automatically loaded
into R by using the `depmap_drug_sensitivity` function.

```{r}
depmap::depmap_drug_sensitivity()
```

## Proteomic

The `proteomic` dataset contains normalized quantitative profiling of
proteins of cancer cell lines by mass spectrometry. This dataset
corresponds with the
`https://gygi.med.harvard.edu/sites/gygi.med.harvard.edu/files/documents/protein_quant_current_normalized.csv.gz`
file.

Specific `proteomic` datasets can be accessed, such as
`proteomic_20Q2` by EH number.

```{r, eval = FALSE}
eh[["EH3459"]]
```

The most recent `proteomic` dataset can be automatically loaded into R by
using the `depmap_proteomic` function.

```{r}
depmap::depmap_proteomic()
```


## Repackaged data source

If desired, the original data from which the
[depmap](https://bioconductor.org/packages/depmap) package were
derived from can be downloaded from the [Broad
Institute](https://depmap.org/portal/download/) website. The
instructions on how to download these files and how the data was
transformed and loaded into the
[depmap](https://bioconductor.org/packages/depmap) package can be
found in the `make_data.R` file found in `./inst/scripts`. (It should
be noted that the original uncompressed *.csv* files are > 1.5GB in
total and take a moderate amount of time to download remotely.)

# Original depmap data

In addition to the re-packaged files, the package also allows to
download any of the original files provided by the [DepMap project on
Figshare](https://figshare.com/authors/Broad_DepMap/5514062).

A list of all the datasets is available with the `dmsets()` function:

```{r}
dmsets()
```

We could check what datasets from any quarter of 2020 are available by
searching for `"20Q"` in the datasets titles:

```{r, message=FALSE}
library(tidyverse)
dmsets() |>
    filter(grepl("20Q", title))
```

Let's focus on the *PRISM Repurposing 20Q2 Dataset* dataset, with
identifier `20564034`.

A list of all the files is available with the `dmfiles()` function:

```{r}
dmfiles()
```

If we want to find all files from the *PRISM Repurposing 20Q2 Dataset*
identified above, we could filter all files with its `dataset_id`:

```{r}
dmfiles() |>
    filter(dataset_id == 20564034)
```

Let's now focus on the
`prism-repurposing-20q2-primary-screen-cell-line-info.csv` file. We
can filter it by its name and downloaded it with `dmget()`:

```{r}
dmfiles() |>
    filter(name == "prism-repurposing-20q2-primary-screen-cell-line-info.csv") |>
    dmget()
```

The `dmget()` function will first check if it hasn't already been
downloaded and cached in the depmap cache directory (see
`?dmCache()`). If so, it will retrieve if from there. Otherwise, it
will download the file and store it in the package cache directory. It
will return the location of the cached file.

Given that the file is in csv format, we can directly open it with
`read_csv()`:

```{r}
dmfiles() |>
    filter(name == "prism-repurposing-20q2-primary-screen-cell-line-info.csv") |>
    dmget() |>
    read_csv()
```

It is also possible to pass multiple rows of the `dmfiles()` table to
`dmget()` to retrieve multiple file paths. Below, let's get all the
README.txt files from 2020:

```{r}
ids_2020 <- filter(dmsets(), grepl("20Q", title)) |>
    pull(dataset_id)

dmfiles() |>
    filter(dataset_id %in% ids_2020) |>
    filter(grepl("README", name)) |>
    dmget()
```

# Session information

```{r echo = FALSE}
sessionInfo()