---
title: "DTE and DGE with inferential replicates"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
author: "Anqi Zhu, Avi Srivastava, Joseph Ibrahim, Rob Patro, Michael Love"
output:
  rmarkdown::html_document:
    highlight: tango
    toc: true
    toc_float: true
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{DTE and DGE with inferential replicates}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- run this document with rmarkdown::render("swish.Rmd") -->

```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(tidy=FALSE, cache=FALSE, dev="png",
                      message=FALSE, error=FALSE, warning=FALSE)
```

# The Swish method

The `swish` method for differential expression analysis of RNA-seq
data using inferential replicate counts is described in the following
reference: @swish - [doi: 10.1093/nar/gkz622](https://doi.org/10.1093/nar/gkz622).

We note that `swish` extends and builds on another method, *SAMseq*
[@samseq], implemented in the *samr* package,
by taking into account inferential uncertainty, and allowing to
control for batch effects and matched samples. Additionally, `swish`
has methods for testing changes in effect size across secondary
covariates, which we refer to as "interactions".
`swish` calls functions from the *qvalue* [@qvalue] or *samr* package
for calculation of local FDR and q-value. This vignette gives an
example of differential analysis of matched samples, and an
interaction test for matched samples, to see if a condition effect
changes in magnitude across two groups of samples.

**Acknowledgments:** We have benefited in the development of *Swish*
from the feedback of Hirak Sarkar.

# Quick start

The following lines of code will perform a basic transcript-level
`swish` two group analysis. For more details, read on.

```{r eval=FALSE}
# 'coldata.csv': sample information table
coldata <- read.csv("coldata.csv")
library(tximeta)
y <- tximeta(coldata)
library(swish)
y <- scaleInfReps(y)
y <- labelKeep(y)
set.seed(1)
y <- swish(y, x="condition")
```

The results can be found in `mcols(y)`. For example, one can calculate
the number of genes passing a 5% FDR threshold:

```{r eval=FALSE}
table(mcols(y)$qvalue < .05)
```

One can at any point remove the genes that didn't pass the expression
filter with the following line of code (can be run before or after
`swish`). These genes are ignored by `swish`, and so will have `NA` in
the results columns in `mcols(y)`.

```{r eval=FALSE}
y <- y[mcols(y)$keep,]
```

A gene-level analysis looks identical to a transcript-level analysis,
only the input data changes. Examples follow.

Lastly, what is the structure of the output of `tximeta` [@tximeta], which is
used in `swish`? See the section below, *Structure of tximeta
output / swish input*.

## Macrophage stimulation experiment

We begin the *fishpond* vignette by loading data from a Bioconductor
Experiment Data package, *macrophage*. The package contains RNA-seq
quantification from 24 RNA-seq samples, which are a subset of the
RNA-seq samples generated and analyzed by @alasoo - 
[doi: 10.1038/s41588-018-0046-7](https://doi.org/10.1038/s41588-018-0046-7).

The experiment involved treatment of macrophage cell lines from a number
of human donors with IFN gamma, *Salmonella* infection, or both
treatments combined. In the beginning of this vignette, we will focus
on comparing the IFN gamma stimulated cell lines with the control cell
lines, accounting for the paired nature of the data (cells from the
same donor). Later in the vignette we will analyze differences in the
*Salmonella* infection response by IFN gamma treatment status --
whether the cells are primed for immune response.

We load the package, and point to the `extdata` directory. For a
typical analysis, the user would just point `dir` to the location on
the machine or cluster where the transcript quantifications are stored
(e.g. the `quant.sf` files).

```{r}
library(macrophage)
dir <- system.file("extdata", package="macrophage")
```

The data was quantified using *Salmon* [@salmon] 0.12.0 against the
Gencode v29 human reference transcripts [@gencode]. For more details
and all code used for quantification, refer to the
[macrophage](https://bioconductor.org/packages/macrophage) 
package vignette. 

Importantly, `--numGibbsSamples 20` was used to generate 20
inferential replicates with *Salmon*'s Gibbs sampling procedure.
Inferential replicates, either from Gibbs sampling or bootstrapping of
reads, are required for the *swish* method shown below. We also
recommend to use `--gcBias` when running *Salmon* to protect against
common sample-specific biases present in RNA-seq data.

# Data import

## Read in the column data from CSV

We start by reading in a CSV with the *column data*, that is,
information about the samples, which are represented as columns of
the *SummarizedExperiment* object we will construct containing the
counts of reads per gene or transcript.

```{r}
coldata <- read.csv(file.path(dir, "coldata.csv"))
head(coldata)
```

We will subset to certain columns of interest, and re-name them for
later.

```{r}
coldata <- coldata[,c(1,2,3,5)]
names(coldata) <- c("names","id","line","condition")
```

## Add a column pointing to your files

`coldata` needs to have a column `files` which specifies the path to
the quantification files. In this case, we've gzipped the
quantification files, so we point to the `quant.sf.gz` file. We make
sure that all the files exist in the location we specified.

```{r}
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
all(file.exists(coldata$files))
```

## Read in quants with tximeta

We will read in quantification data for some of the samples. First we
load the *SummarizedExperiment* package. We will store out data and
the output of the statistical method in a *SummarizedExperiment*
object. We use the *tximeta* [@tximeta] package to read in the data:

```{r}
suppressPackageStartupMessages(library(SummarizedExperiment))
```

```{r include=FALSE}
# This hidden code chunk is only needed for Bioc build machines,
# so that 'fishpond' will build regardless of whether
# the machine can connect to ftp.ebi.ac.uk.
# Using linkedTxomes to point to a GTF that lives in the macrophage pkg.
# The chunk can be skipped if you have internet connection,
# as tximeta will automatically ID the transcriptome and DL the GTF.
library(tximeta)
makeLinkedTxome(
  indexDir=file.path(dir, "gencode.v29_salmon_0.12.0"),
  source="Gencode",
  organism="Homo sapiens",
  release="29",
  genome="GRCh38",
  fasta="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz",
  gtf=file.path(dir, "gencode.v29.annotation.gtf.gz"), # local version
  write=FALSE
)
```

We load in the quantification data with `tximeta`:

```{r}
library(tximeta)
se <- tximeta(coldata)
```

We can see that all the assays have been loaded:

```{r}
assayNames(se)
```

`tximeta` loads transcript-level data, although it can later be
summarized to the gene levels:

```{r}
head(rownames(se))
```

We will rename our *SummarizedExperiment* `y` for the statistical
analysis. For speed of the vignette, we subset to the transcripts on
chromosome 1.

```{r}
y <- se
y <- y[seqnames(y) == "chr1",]
```

Two demonstrate a two group comparison, we subset to the "naive" and
"IFNg" groups. 

```{r}
y <- y[,y$condition %in% c("naive","IFNg")]
y$condition <- factor(y$condition, c("naive","IFNg"))
```

# Differential transcript expression

## Running Swish at the transcript level

Running `swish` has three steps: scaling the inferential replicates,
labeling the rows with sufficient counts for running differential
expression, and then calculating the statistics. As `swish` makes use
of pseudo-random number generation in breaking ties and in calculating
permutations, to obtain identical results, one needs to set a random
seed before running `swish()`, as we do below.

The default number of permutations in `swish` is
`nperms=100`. However, for paired datasets as this one, you may have
fewer maximum permutations. In this case, there are 64 possible ways
to switch the condition labels for six pairs of samples. We can set
the `nperms` manually (or if we had just used the default value,
`swish` would set `nperms` to the maximum value possible and notify
the user that it had done so).

```{r results="hide", message=FALSE}
library(fishpond)
y <- scaleInfReps(y)
y <- labelKeep(y)
y <- y[mcols(y)$keep,]
set.seed(1)
y <- swish(y, x="condition", pair="line", nperms=64)
```

A note about `labelKeep`: by default we keep features with `minN=3`
samples with a minimal count of 10. For scRNA-seq data with
de-duplicated UMI counts, we recommend to lower the count, e.g. a
count of 3, across a higher number of `minN` cells, depending on the
number of cells being compared. You can also set `x="condition"` when
running `labelKeep` which will use the condition variable to set
`minN`.

The results are stored in `mcols(y)`. We will show below how to pull
out the top up- and down-regulated transcripts.

We can see how many transcripts are in a 5% FDR set:

```{r}
table(mcols(y)$qvalue < .05)
```

## Plotting results

We can check the distribution of p-values. This looks as expected for
a comparison where we expect many transcripts will be affected by the
treatment (IFNg stimulation of macrophage cells). There is a flat
component and then an enrichment of transcripts with p-values near 0.

```{r}
hist(mcols(y)$pvalue, col="grey")
```

Of the transcripts in this set, which have the most extreme log2
fold change? Note that often many transcripts will share the same
q-value, so it's valuable to look at the log2 fold change as well (see
further note below on q-value computation). The log2 fold
change computed by `swish` is the median over inferential replicates,
and uses a pseudo-count of 5 on the scaled counts, to stabilize
the variance on the fold change from division by small counts. Here we
make two vectors that give the significant genes with the lowest (most
negative) and highest (most positive) log fold changes.

```{r}
with(mcols(y),
     table(sig=qvalue < .05, sign.lfc=sign(log2FC))
     )
sig <- mcols(y)$qvalue < .05
lo <- order(mcols(y)$log2FC * sig)
hi <- order(-mcols(y)$log2FC * sig)
```

Here we print a small table with just the calculated statistics for
the large positive log fold change transcripts (up-regulation):

```{r}
top.up <- mcols(y)[head(hi),]
names(top.up)
cols <- c("log10mean","log2FC","pvalue","qvalue")
print(as.data.frame(top.up)[,cols], digits=3)
```

Likewise for the largest negative log fold change transcripts
(down-regulation): 

```{r}
top.down <- mcols(y)[head(lo),]
print(as.data.frame(top.down)[,cols], digits=3)
```

We can plot the scaled counts for the inferential replicates, and also
group the samples by a covariate, in this case the cell line. The
analysis was paired, so the statistic assessed if the change within
pairs was consistent. Here we plot the 100th top up-regulated
transcript: 

```{r}
plotInfReps(y, idx=hi[100], x="condition", cov="line")
```

We can make an MA plot, where the transcripts in our FDR set are
colored:

```{r}
plotMASwish(y, alpha=.05)
```

Using the `addIds` function from *tximeta*, we can easily add gene
symbols. By specifying `gene=TRUE`, this will use the gene ID to match
to gene symbols for all of the transcripts.

```{r}
library(org.Hs.eg.db)
y <- addIds(y, "SYMBOL", gene=TRUE)
```

We can then add gene symbols to our MA plot:

```{r}
plotMASwish(y, alpha=.05, xlim=c(.5,5.5))
with(
  subset(mcols(y), qvalue < .05 & abs(log2FC) > 4),
     text(log10mean, log2FC, SYMBOL,
          col="blue", pos=4, cex=.7)
)
```

# Differential gene expression

## Running Swish at the gene level

We can also run swish at the gene level. First we summarize all of the
data to the gene level, using the `summarizeToGene` function from
*tximeta*. Again, we rename the object for statistical analysis, and
then we subset to the genes on chromosome 1 for the demonstration.

```{r}
gse <- summarizeToGene(se)
gy <- gse
gy <- gy[seqnames(gy) == "chr1",]
```

Two demonstrate a two group comparison, we subset to the "naive" and
"IFNg" groups, as before.

```{r}
gy <- gy[,gy$condition %in% c("naive","IFNg")]
gy$condition <- factor(gy$condition, c("naive","IFNg"))
```

Next we can run the same steps as before. Again we set a random seed
in order to be able to reproduce exact results in the future:

```{r results="hide", message=FALSE}
gy <- scaleInfReps(gy)
gy <- labelKeep(gy)
gy <- gy[mcols(gy)$keep,]
set.seed(1)
gy <- swish(gy, x="condition", pair="line", nperms=64)
```

As before, the number of genes in a 1% FDR set:

```{r}
table(mcols(gy)$qvalue < .05)
```

## Plotting gene results

The histogram of p-values:

```{r}
hist(mcols(y)$pvalue, col="grey")
```

As before, finding the genes with the most extreme log2 fold change:

```{r}
with(mcols(gy),
     table(sig=qvalue < .05, sign.lfc=sign(log2FC))
     )     
sig <- mcols(gy)$qvalue < .05
glo <- order(mcols(gy)$log2FC * sig)
ghi <- order(-mcols(gy)$log2FC * sig)
```

```{r}
gtop.up <- mcols(gy)[head(ghi),]
print(as.data.frame(gtop.up)[,cols], digits=3)
gtop.down <- mcols(gy)[head(glo),]
print(as.data.frame(gtop.down)[,cols], digits=3)
```

We can plot a particular one of these genes:

```{r}
plotInfReps(gy, idx=ghi[100], x="condition", cov="line")
```

As expected, the highly up-regulated genes are involved in immune
response. Many genes encoding guanylate-binding proteins (GBP) are
up-regulated, and these proteins are induced by interferon,
produced in response to infection by pathogenic microbes.

We can make an MA plot, where the genes in our FDR set are colored:

```{r}
plotMASwish(gy, alpha=.05)
```

Again, using the `addIds` function from *tximeta*, we can easily add
gene symbols to our gene-level expression analysis:

```{r}
library(org.Hs.eg.db)
gy <- addIds(gy, "SYMBOL", gene=TRUE)
```

We can then add gene symbols to our MA plot:

```{r}
plotMASwish(gy, alpha=.05, xlim=c(.5,5.5))
with(
  subset(mcols(gy), qvalue < .05 & abs(log2FC) > 3),
     text(log10mean, log2FC, SYMBOL,
          col="blue", pos=4, cex=.7)
)
```

# Differential transcript usage

We have added a new function `isoformProportions` which can be run
after `scaleInfReps` (and optionally after removing genes via 
`labelKeep` and subsetting the SummarizedExperiment). This function,
`isoformProportions` will create a new assay `isoProp` from the
scaledTPM counts, containing isoform proportions per gene. The same
procedure will also be applied to all the inferential replicates. Note
that after `isoformProportions` the transcripts from single isoform
genes will be removed, and the transcripts will be re-ordered by gene
(alphabetically by gene).

Following this function, running `swish` will be equivalent to a test
of differential transcript usage, taking account of the uncertainty in
transcript abundances, as it will look for transcripts where the
isoform proportions change across condition.

```{r}
# run on the transcript-level dataset
iso <- isoformProportions(y)
iso <- swish(iso, x="condition", pair="line", nperms=64)
```

```{r eval=FALSE, echo=FALSE}
# some unevaluated code for looking into DTE within non-DGE gene
# (DTE vs DGE plot)
fisherP <- function(p) {
  pchisq(-2 * sum(log(p)), 2*length(p), lower.tail=FALSE)
}
stopifnot(all(lengths(mcols(y)$gene_id) == 1))
dat <- as.data.frame(mcols(y)[,c("gene_id","pvalue")])
dat$gene_id <- unlist(dat$gene_id)
pvals <- tapply(dat$pvalue, dat$gene_id, fisherP)
dte <- data.frame(gene_id=names(pvals), pvalue=pvals)
dte <- dte[rownames(gy),]
plot(-log10(mcols(gy)$pvalue), -log10(dte$pvalue))
#identify(-log10(mcols(gy)$pvalue), -log10(dte$pvalue))
idx <- 193
idx2 <- which(unlist(mcols(y)$gene_id) == rownames(gy)[idx])
plotInfReps(gy, idx, x="condition", cov="line", xaxis=FALSE)
par(mfrow=c(1,3))
for (i in 1:3) {
  plotInfReps(y, idx2[i], x="condition", cov="line", xaxis=FALSE)
}
```

# Interaction designs

We also provide in `swish` methods for testing if a condition effect
varies *across a secondary covariate*, using matched samples for
condition, or un-matched samples, which we refer to as "interactions"
in the software.

If matched samples are available, we compute the log2 fold change for
each pair of samples across condition in the same covariate group, and
then we use a Wilcoxon rank sum statistic for comparing the log2 fold
changes across the secondary covariate. For permutation significance,
the secondary covariate labels of the pairs are permuted. For
unmatched samples, multiple random "pseudo-pairs" of samples across
condition within the two covariate groups are chosen, and the
statistic computed as above, averaging over the random
pseudo-pairings. The motivation for the above permutation schemes is
to ensure the following condition, that "under the null hypothesis,
the likelihood of the data is invariant under these permutations"
[@anderson], where our null hypothesis specifically involves the
interaction between condition and the secondary covariate.

For the macrophage dataset we have been working with [@alasoo], we
have a 2x2 experimental design, with IFN gamma stimulation,
*Salmonella* infection, and both treatments, as well as control
samples. We have these four conditions across 6 cell lines from 6
donors (a subset of all the RNA-seq samples available). So we can use
the first method described above, where the cell line is used to match
samples across condition. Our implementation does not make use of the
pairing information across the secondary covariate, but we will still
be well powered to detect differences in the log2 fold change.

## Condition and secondary covariates

We begin the interaction analysis by re-loading the
*SummarizedExperiment* with all the samples, and defining two new
factors indicating IFNg status and *Salmonella* status:

```{r}
se$ifng <- factor(ifelse(
  grepl("IFNg",se$condition),
  "treated","control"))
se$salmonella <- factor(ifelse(
  grepl("SL1344",se$condition),
  "infected","control"))
with(colData(se),
     table(ifng, salmonella)
     )
```

We will work with the chromosome 1 transcripts for demonstration:

```{r}
y2 <- se
y2 <- y2[seqnames(y2) == "chr1",]
```

## Create and check paired samples

Our implementation of the interaction design for matched samples takes
into account matched samples within the `x` condition, which we will
specify to be the *Salmonella* infection status. We will specify the
secondary covariate `cov` to be the IFN gamma treatment. We will look
for transcripts where the infection response changes based on IFN
gamma treatment.

We actually have matched samples across both IFN gamma treatment and
*Salmonella* infection, but the extra pairing is not used by our
current implementation of interactions (it is common that there would
not be pairing across the secondary covariate).

To perform the analysis, we create a new variable `pair` which will
record which samples are related within a group based on IFN gamma
treatment status.

```{r}
y2$pair <- as.numeric(factor(y2$line))
y2$pair[y2$ifng == "control"]
y2$pair[y2$ifng == "treated"]
y2$pair[y2$ifng == "treated"] <- rep(7:12,each=2)
y2$pair <- factor(y2$pair)
table(y2$pair, y2$salmonella)
```

## Swish for interaction effects

We now perform `swish` analysis, specifying the *Salmonella* infection
as our main condition, the IFN gamma treatment as the secondary
covariate, and providing the pairing within IFN gamma treatment
groups. We specify `interaction=TRUE` to test for differences in
infection response across IFN gamma treatment group.

```{r results="hide", message=FALSE}
y2 <- scaleInfReps(y2)
y2 <- labelKeep(y2)
y2 <- y2[mcols(y2)$keep,]
set.seed(1)
y2 <- swish(y2, x="salmonella", cov="ifng", pair="pair", interaction=TRUE)
```

## Plotting interaction results

In this case, we appear to have fewer non-null p-values from first
impression of the p-value histogram:

```{r}
hist(mcols(y2)$pvalue, col="grey")
```

The MA plot shows significant transcripts on either side of
`log2FC=0`. Note that the log2 fold change reported is the
*difference* between the log2 fold change in the IFN gamma treated and
IFN gamma control group. So positive `log2FC` in this plot indicates
that the effect is higher with IGN gamma treatment than in absence of
the treatment.

```{r}
plotMASwish(y2, alpha=.05)
```

We can plot some of the transcripts with high log2 fold change
*difference* across IFN gamma treatment group, and which belong to the
less than 5% nominal FDR group:

```{r}
idx <- with(mcols(y2), which(qvalue < .05 & log2FC > 5))
plotInfReps(y2, idx[1], x="ifng", cov="salmonella")
plotInfReps(y2, idx[2], x="ifng", cov="salmonella")
```

# Further details

## Analysis types supported by Swish

There are currently five types of analysis supported by `swish`:

* Two group analysis
* Two groups with two or more batches
* Two group paired or matched samples
* Two condition x two group paired samples, interaction test
* Two condition x two group samples, not paired, interaction test

This vignette demonstrated the third in this list, but the others
can be run by either not specifying any additional covariates, or by
specifying a batch variable with the argument `cov` instead of `pair`.
The two interaction tests can be run by specifying `interaction=TRUE`
and providing `x`, `cov`, and optionally `pair`.

## Structure of `tximeta` output / `swish` input

While `tximeta` is the safest way to provide the correct input to
`swish`, all that `swish` requires for running is a
*SummarizedExperiment* object with the following assays: `counts`,
`length`, and `infRep1`, `infRep2`, ..., `infRepN`, where `N` is
simply the number of Gibbs samples or boostraps samples, e.g. 20 in
the examples above. The counts and inferential replicates are
estimated counts from a quantification method, either at the
transcript level or summed to the gene level (simple sum). These
counts sum up to the (mapped) library size for each sample. It is
assumed that the `length` matrix gives the effective lengths for each
transcript, or average transcript length for each gene as summarized
by the functions in `tximeta`/`tximport`. If the counts should not be
corrected for effective length (e.g. 3' tagged RNA-seq), then
`lengthCorrect=FALSE` should be specified when running
`scaleInfReps`. 

Note on simulation: it is difficult to simulate inferential uncertainty
in a realistic manner without construction of reads from transcripts,
using a method like *polyester*. Constructing reads from the reference
transcriptome or a sample-specific transcriptome naturally produces
the structure of read-assignment inferential uncertainty that `swish`
and other methods control for in real RNA-seq data.

## Plotting q-values over statistics

As with *SAMseq* and *SAM*, `swish` makes use of the permutation
plug-in approach for q-value calculation. `swish` calls the `empPvals`
and `qvalue` functions from the *qvalue* package to calculate the
q-values (or optionally similar functions from the *samr* package).
If we plot the q-values against the statistic, or against the log2
fold change, one can see clusters of genes with the same q-value
(because they have the same or similar statistic). One consequence of
this is that, in order to rank the genes, rather than ranking directly
by q-value, it makes more sense to pick a q-value threshold and then
within that set of genes, to rank by the log2 fold change, as shown
above when the code chunk has `log2FC * sig`.

```{r}
gres <- mcols(gy)[mcols(gy)$keep,]
min(gres$qvalue, na.rm=TRUE) # min nominal FDR is not 0
with(gres, plot(stat, -log10(qvalue)))
with(gres, plot(log2FC, -log10(qvalue)))
abline(v=0, col="red")
with(gres, plot(log2FC, -log10(qvalue),
                xlim=c(-1.5,1.5), ylim=c(0,1.5)))
abline(v=0, col="red")
```

## Plotting InfRV

In the Swish paper, we describe a statistic, InfRV, which is useful
for categorizing groups of features by their inferential uncertainty.
Note that InfRV is not used in the `swish` method, but only for
visualization in the paper. Here we show how to compute and plot the
InfRV:

```{r}
y3 <- se
y3 <- y3[seqnames(y3) == "chr1",]
y3 <- y3[,y3$condition %in% c("naive","IFNg")]
y3 <- labelKeep(y3)
y3 <- y3[mcols(y3)$keep,]
y3 <- computeInfRV(y3)
mcols(y3)$meanCts <- rowMeans(assays(y3)[["counts"]])
with(mcols(y3), plot(meanCts, meanInfRV, log="xy"))
hist(log10(mcols(y3)$meanInfRV),
     col="grey50", border="white", breaks=20,
     xlab="mean InfRV", main="Txp-level inferential uncertainty")
```

## *alevin* inferential replicates

The *alevin* [@alevin] and *tximport* / *tximeta* maintainers have
created an efficient format for storing and importing the sparse 
scRNA-seq estimated gene counts, and optionally inferential variance
and inferential replicate counts. `tximeta` will automatically 
import these matrices if *alevin* was run using
`--numCellBootstraps` (in order to generate inferential variance) and
additionally `--dumpFeatures` (in order to dump the inferential
replicates). The storage format for counts, and for inferential
replicates, involves writing one cell at a time, storing the locations
of the non-zero counts, and then the non-zero counts. The matrices are
imported sparely using the *Matrix* package. The storage format is
efficient, for example, the estimated counts for the 900 mouse neuron
dataset from 10x Genomics takes up 4.2 Mb, the variance matrix
takes up 8.6 Mb, and the inferential replicates takes up 72 Mb (20
bootstrap inferential replicates).

`swish` can be run on *alevin* counts imported with `tximeta`, but
there are a few extra steps required. First, we recommend to filter
genes as the first step, to reduce the size of the data before losing
sparsity on the count matrices (conversion of data to ranks loses data
sparsity inside the `swish()` function). One can run `labelKeep`
therefore before `scaleInfReps`. E.g., to remove genes for which there
are not 10 cells with a count of 3 or more:

```{r eval=FALSE}
y <- labelKeep(y, minCount=3, minN=10)
y <- y[mcols(y)$keep,]
```

One can also subset to cells of interest in order to take up the least
amount of memory when the sparse matrices in the
*SummarizedExperiment* are converted to dense matrices.

After one has filtered both genes and cells down to the set that are
of interest for differential expression, one can run the following
commands, to (1) make the sparse matrices into dense ones, (2) scale
the cells, and (3) perform *Swish* differential expression. 

```{r eval=FALSE}
assays(y) <- lapply(assays(y), as.matrix)
y <- scaleInfReps(y, lengthCorrect=FALSE)
y <- swish(y, x="condition")
```

Note that `scaleInfReps` has an argument `sfFun` which allows the user
to provide their own size factor calculation function. One could use
`computeSumFactors` in the *scran* package for example. 

## Permutation schemes for interactions

The following diagrams describe the permutation schemes used for the
interaction designs implemented in `swish`. The case with matched
samples (pair indicated by number, primary condition indicated by
color, the vertical line separating the pairs by secondary covariate):

```{r echo=FALSE}
n <- 8
condition <- rep(1:2,length=2*n)
group <- rep(1:2,each=n)
pair <- rep(c(1:n),each=2)
cols <- c("dodgerblue","goldenrod4")
plot(1:(2*n), rep(0,2*n), ylim=c(-.5,3.5),
     type="n", xaxt="n", yaxt="n",
     xlab="samples", ylab="permutation")
abline(v=8.5, lty=2)
axis(2, 0:3, c("orig",1:3), las=2)
text(1:(2*n), rep(0,2*n), pair, col=cols[condition], cex=2)
set.seed(1)
for (i in 1:3) {
  perms <- rep(2*sample(n,n),each=2) - rep(1:0,length=2*n)
  text(1:(2*n), rep(i,2*n), pair[perms], col=cols[condition[perms]], cex=2)
}
```

The case without matched samples (sample indicated by letter, primary
condition indicated by color, the vertical line separating the samples
by secondary covariate). Here multiple random pseudo-pairs are chosen
across condition. The permutation scheme ensures that LFCs are always 
calculated between samples from the same covariate group.

```{r echo=FALSE}
n <- 8
condition <- rep(c(1:2,1:2),each=n/2)
group <- rep(1:2,each=n)
id <- LETTERS[1:(2*n)]
cols <- c("dodgerblue","goldenrod4")
plot(1:(2*n), rep(0,2*n), ylim=c(-.5,3.5),
     type="n", xaxt="n", yaxt="n",
     xlab="samples", ylab="permutation")
abline(v=8.5, lty=2)
axis(2, 0:3, c("orig",1:3), las=2)
text(1:(2*n), rep(0,2*n), id, col=cols[condition], cex=2)
set.seed(3)
for (i in 1:3) {
  id.perms <- character(2*n)
  grp1 <- id[group==1]
  grp2 <- id[group==2]
  id.perms[c(1:4,9:12)] <- sample(id[condition==1],n)
  idx1 <- id.perms[c(1:4,9:12)] %in% grp1
  id.perms[c(5:8,13:16)][idx1] <- sample(id[condition==2 & group==1],sum(idx1))
  idx2 <- id.perms[c(1:4,9:12)] %in% grp2
  id.perms[c(5:8,13:16)][idx2] <- sample(id[condition==2 & group==2],sum(idx2))
  text(1:(2*n), rep(i,2*n), id.perms, col=cols[condition], cex=2)
}
arrows(3,1.5,1.3,1.15,,length=.1)
arrows(3,1.5,4.7,1.15,length=.1)
```

## Session information

```{r}
sessionInfo()
```

# References