%\VignetteIndexEntry{EnrichmentBrowser Manual}
%\VignetteDepends{EnrichmentBrowser}
%\VignettePackage{EnrichmentBrowser}
%\VignetteEngine{knitr::knitr}

\documentclass[a4paper]{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
knitr::opts_chunk$set(message = FALSE)
@

\usepackage{graphicx, subfig}
\usepackage[utf8]{inputenc} 

\bioctitle[EnrichmentBrowser]{Seamless navigation through combined
        results of set- \& network-based enrichment analysis}
\author{Ludwig Geistlinger} 
\affil{School of Public Health, City University of New York}

\begin{document}

\maketitle

\begin{abstract}
The \Biocpkg{EnrichmentBrowser} package implements an analysis pipeline
for high-throughput gene expression data as measured with microarrays and 
RNA-seq. In a workflow-like manner, the package brings together a selection of
established Bioconductor packages for gene expression data analysis. It
integrates a wide range of gene set and network enrichment analysis methods 
and facilitates combination and exploration of results across methods.
\end{abstract}

\packageVersion{\Sexpr{BiocStyle::pkg_ver("EnrichmentBrowser")}}

Report issues on \url{https://github.com/lgeistlinger/EnrichmentBrowser/issues}

\newpage

\tableofcontents

\newpage

\section{Introduction}
The \Biocpkg{EnrichmentBrowser} package implements essential functionality for
the enrichment analysis of gene expression data.
The analysis combines the advantages of set-based and network-based enrichment 
analysis to derive high-confidence gene sets and biological pathways 
that are differentially regulated in the expression data under investigation.
Besides, the package facilitates the visualization and exploration of such sets 
and pathways.\\
The following instructions will guide you through an end-to-end expression data
analysis workflow including:

\begin{enumerate}
\item Preparing the data
\item Preprocessing of the data
\item Differential expression (DE) analysis
\item Defining gene sets of interest
\item Executing individual enrichment methods
\item Combining the results of different methods
\item Visualize and explore the results
\end{enumerate}

All of these steps are modular, i.e.~each step can be executed individually and 
fine-tuned with several parameters. In case you are interested in a 
particular step, you can directly move on to the respective section.
For example, if you have differential expression already calculated for each gene, 
and your are now interested whether certain gene functions are enriched for differential expression, 
section \emph{Set-based enrichment analysis} would be the one you should go for. 
The last section \emph{Putting it all together} also demonstrates how to wrap 
the whole workflow into a single function, making use of suitably chosen 
defaults.

<<setup, echo = FALSE>>=
library(EnrichmentBrowser)
library(ALL)
library(airway)
@

\section{Reading expression data from file}
Typically, the expression data is not already available in \R{} but 
rather has to be read in from file. 
This can be done using the function \Rfunction{readSE}, which reads 
the expression data (\Rfunction{exprs}) along with the phenotype data 
(\Rfunction{colData}) and feature data (\Rfunction{rowData}) into a
\Biocpkg{SummarizedExperiment}. 
<<readSE>>=
library(EnrichmentBrowser)
data.dir <- system.file("extdata", package = "EnrichmentBrowser")
exprs.file <- file.path(data.dir, "exprs.tab")
cdat.file <- file.path(data.dir, "colData.tab")
rdat.file <- file.path(data.dir, "rowData.tab")
se <- readSE(exprs.file, cdat.file, rdat.file)
@

The man pages provide details on file format and the
\Rclass{SummarizedExperiment} data structure.

<<help, eval=FALSE>>=
?readSE
?SummarizedExperiment
@

\emph{Note:} Previous versions of the \Biocpkg{EnrichmentBrowser} used the
\Rclass{ExpressionSet} data structure.
The migration to \Rclass{SummarizedExperiment} in the current release of the
\Biocpkg{EnrichmentBrowser} is done to reflect recent developments in 
\Bioconductor{}, which discourage use of \Rclass{ExpressionSet} in favor of 
\Rclass{SummarizedExperiment}.
Major reasons are the compatibility of \Rclass{SummarizedExperiment} with 
operations on genomic regions as well as efficient dealing with big data.

To enable a smooth transition, all functions of the \Biocpkg{EnrichmentBrowser} 
are still accepting also an \Rclass{ExpressionSet} as input, but are 
consistently returning a \Rclass{SummarizedExperiment} as output.

Furthermore, users can always coerce from \Rclass{SummarizedExperiment} to 
 \Rclass{ExpressionSet} via

<<sexp2eset>>=
eset <- as(se, "ExpressionSet")
@

and vice versa

<<eset2sexp>>=
se <- as(eset, "SummarizedExperiment")
@

\section{Types of expression data}

The two major data types processed by the \Biocpkg{EnrichmentBrowser} are 
microarray (intensity measurements) and RNA-seq (read counts) data.

Although RNA-seq has become the \emph{de facto} standard for 
transcriptomic profiling, it is important to know that many methods for 
differential expression and gene set enrichment analysis have been 
originally developed for microarray data.

However, differences in data distribution assumptions (microarray: quasi-normal,
RNA-seq: negative binomial) made adaptations in differential expression analysis
and, to some extent, also in gene set enrichment analysis necessary.

Thus, we consider two example datasets -- a microarray and a RNA-seq dataset, 
and discuss similarities and differences of the respective analysis steps.

\subsection{Microarray data}

To demonstrate the functionality of the package for microarray data, we 
consider expression measurements of patients with acute lymphoblastic
 leukemia \cite{Chiaretti04}. A frequent chromosomal defect found among these 
patients is a translocation, in which parts of chromosome 9 and 22 swap places. 
This results in the oncogenic fusion gene BCR/ABL created by positioning the 
ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.

We load the \Biocpkg{ALL} dataset
<<load-ALL>>=
library(ALL)
data(ALL)
@

and select B-cell ALL patients with and without the BCR/ABL fusion as 
described previously \cite{Scholtens05}.

<<subset-ALL>>=
ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
all.eset <- ALL[, sset]
@

We can now access the expression values, which are intensity measurements 
on a log-scale for 12,625 probes (rows) across 79 patients (columns).

<<show-ALL>>=
dim(all.eset)
exprs(all.eset)[1:4,1:4]
@

As we often have more than one probe per gene, we summarize gene expression values
as the average of the corresponding probe values.

<<probe2gene>>=
allSE <- probe2gene(all.eset)
head(rownames(allSE))
@

Note, that the mapping from probe to gene is done automatically as long as 
as you have the corresponding annotation package, here the 
\Biocpkg{hgu95av2.db} package, installed. Otherwise, the mapping can be 
manually defined in the \Rfunction{rowData} slot.

<<show-probe2gene>>=
rowData(se)
@

\subsection{RNA-seq data}

To demonstrate the functionality of the package for RNA-seq data, we 
consider transcriptome profiles of four primary human airway smooth muscle cell 
lines in two conditions: control and treatment with dexamethasone \cite{Himes14}.

We load the \Biocpkg{airway} dataset

<<load-airway>>=
library(airway)
data(airway)
@

For further analysis, we remove genes with very low read counts and 
measurements that are not mapped to an ENSEMBL gene ID.

<<preproc-airway>>=
airSE <- airway[grep("^ENSG", rownames(airway)),]
airSE <- airSE[rowSums(assay(airSE)) > 4,]
dim(airSE)
assay(airSE)[1:4,1:4]
@

\section{Normalization}

Normalization of high-throughput expression data is essential to make results 
within and between experiments comparable.
Microarray (intensity measurements) and RNA-seq (read counts) data typically 
show distinct features that need to be normalized for.
The function \Rfunction{normalize} wraps commonly used functionality from 
\Biocpkg{limma} for microarray normalization and from \Biocpkg{EDASeq} for 
RNA-seq normalization.
For specific needs that deviate from these standard normalizations, the user
should always refer to more specific functions/packages.

Microarray data is expected to be single-channel. For two-color arrays, it is 
expected that normalization within arrays has been already carried out, 
e.g.~using \Rfunction{normalizeWithinArrays} from \Biocpkg{limma}.

A default quantile normalization based on
\Rfunction{normalizeBetweenArrays} from \Biocpkg{limma} can be carried out via

<<norm-ma>>=
allSE <- normalize(allSE, norm.method = "quantile")
@

\begin{center}
<<plot-norm, fig.width=12, fig.height=6>>=
par(mfrow=c(1,2))
boxplot(assay(allSE, "raw"))
boxplot(assay(allSE, "norm"))
@
\end{center}

Note that this is only done for demonstration, as the ALL data has 
been already RMA-normalized by the authors of the ALL dataset.
 
RNA-seq data is expected to be raw read counts. 
Note that normalization for downstream DE analysis, e.g.~with 
\Biocpkg{edgeR} and \Biocpkg{DESeq2}, is not ultimately necessary 
(and in some cases even discouraged) as many of these tools implement specific 
normalization approaches themselves.
See the vignette of \Biocpkg{EDASeq}, \Biocpkg{edgeR}, and \Biocpkg{DESeq2} for
details.

In case normalization is desired, between-lane normalization to adjust for 
sequencing depth can be carried out as demonstrated for microarray data.

<<norm-rseq>>=
airSE <- normalize(airSE, norm.method = "quantile")
@

Within-lane normalization to adjust for gene-specific effects such as gene 
length and GC-content requires to retrieve this information first, 
e.g.~from \software{BioMart} or specific \Bioconductor annotation packages.
Both modes are implemented in the \Biocpkg{EDASeq} function 
\Rfunction{getGeneLengthAndGCContent}.


\section{Differential expression}

The \Biocpkg{EnrichmentBrowser} incorporates established functionality from the
\Biocpkg{limma} package for differential expression analysis between sample 
groups.
This involves the \Rfunction{voom}-transformation when applied to RNA-seq data.
Alternatively, differential expression analysis for RNA-seq data can also be
carried out based on the negative binomial distribution with \Biocpkg{edgeR}
and \Biocpkg{DESeq2}.

This can be performed using the function \Rfunction{deAna}
and assumes some standardized variable names:

\begin{itemize}
\item \textbf{GROUP} defines the sample groups being contrasted,
\item \textbf{BLOCK} defines paired samples or sample blocks, as e.g.~for batch
effects.
\end{itemize}

For more information on experimental design, see the
\href{http://bioconductor.org/packages/limma}{limma user's guide},
chapter 9.

For the ALL dataset, the \textbf{GROUP} variable indicates whether the BCR-ABL 
gene fusion is present (1) or not (0). 

<<sample-groups-ALL>>=
allSE$GROUP <- ifelse(allSE$mol.biol == "BCR/ABL", 1, 0)
table(allSE$GROUP)
@

For the airway dataset, it indicates whether the cell lines have been treated 
with dexamethasone (1) or not (0).

<<sample-groups-airway>>=
airSE$GROUP <- ifelse(airway$dex == "trt", 1, 0)
table(airSE$GROUP)
@

Paired samples, or in general sample batches/blocks, can be defined via a 
\textbf{BLOCK} column in the \Rfunction{colData} slot.
For the airway dataset, the sample blocks correspond to the four different cell 
lines.

<<sample-blocks>>=
airSE$BLOCK <- airway$cell
table(airSE$BLOCK)
@

For microarray expression data, the \Rfunction{deAna} function carries 
out a differential expression analysis between the two groups based on 
functionality from the \Biocpkg{limma} package.
Resulting fold changes and $t$-test derived $p$-values for each gene are 
appended to the \Robject{rowData} slot.

<<DE-ana-ALL>>=
allSE <- deAna(allSE, padj.method = "BH")
rowData(allSE)
@
 
Nominal $p$-values (\Robject{PVAL}) are corrected for multiple testing 
(\Robject{ADJ.PVAL}) using the method from Benjamini and Hochberg implemented 
in the function \Rfunction{p.adjust} from the \CRANpkg{stats} package.

To get a first overview, we inspect the $p$-value distribution and the volcano 
plot (fold change against $p$-value).
 
\begin{center}
<<plot-DE, fig.width=12, fig.height=6>>=
par(mfrow = c(1,2))
pdistr(rowData(allSE)$PVAL)
volcano(rowData(allSE)$FC, rowData(allSE)$ADJ.PVAL)
@
\end{center}

The expression change of highest statistical significance is observed for the 
ENTREZ gene \verb=7525=. 

<<DE-exmpl>>=
ind.min <- which.min(rowData(allSE)$ADJ.PVAL)
rowData(allSE)[ind.min,]
@

This turns out to be the YES proto-oncogene 1 
(\href{http://www.genome.jp/dbget-bin/www_bget?hsa:7525}{hsa:7525@KEGG}).
  
For RNA-seq data, the \Rfunction{deAna} function carries out a differential 
expression analysis between the two groups either based on functionality from
\Biocpkg{limma} (that includes the \Rfunction{voom} transformation), or 
alternatively, the popular \Biocpkg{edgeR} or \Biocpkg{DESeq2} package. 

Here, we use the analysis based on \Biocpkg{edgeR} for demonstration.

<<DE-ana-airway>>=
airSE <- deAna(airSE, de.method = "edgeR")
rowData(airSE)
@

\section{ID mapping}\label{sec:idmap}

Using genomic information from different resources often requires mapping 
between different types of gene identifiers.
Although primary analysis steps such as normalization and differential expression 
analysis can be carried out independent of the gene ID type, downstream exploration 
functionality of the \Biocpkg{EnrichmentBrowser} is consistently based on NCBI 
Entrez Gene IDs.
It is thus, in this regard, beneficial to initially map gene IDs of a different
type to NCBI Entrez IDs.

The function \Rfunction{idTypes} lists the available ID types for the mapping 
depending on the organism under investigation.

<<idmap-idtypes>>=
idTypes("hsa")
@

ID mapping for the airway dataset (from ENSEMBL to ENTREZ gene ids) can then be
carried out using the function \Rfunction{idMap}.

<<idmap-airway>>=
head(rownames(airSE))
airSE <- idMap(airSE, org = "hsa", from = "ENSEMBL", to = "ENTREZID")
head(rownames(airSE))
@

Now, we subject the ALL and the airway gene expression data to the enrichment 
analysis. 

\newpage

\section{Enrichment analysis}

In the following, we introduce how the \Biocpkg{EnrichmentBrowser} package 
can be used to perform state-of-the-art enrichment analysis of gene sets. 
We consider the ALL and the airway gene expression data as processed in the 
previous sections. We are now interested in whether pre-defined sets of genes 
that are known to work together, e.g. as defined in the Gene Ontology (GO)
or the KEGG pathway annotation, are coordinately differentially expressed.


\subsection{Obtaining gene sets}

The function \Rfunction{getGenesets} can be used to download gene sets from
databases such as GO and KEGG. 
We can use the function to download all KEGG pathways for a chosen organism 
(here: \emph{Homo sapiens}) as gene sets.

<<get-kegg-gs, eval=FALSE>>=
kegg.gs <- getGenesets(org = "hsa", db = "kegg")
@

Analogously, the function \Rfunction{getGenesets} can be used to retrieve GO 
terms of a selected ontology (here: biological process, BP) as defined in the
\Biocpkg{GO.db} annotation package.

<<get-go-gs, eval=FALSE>>=
go.gs <- getGenesets(org = "hsa", db = "go", onto = "BP", mode = "GO.db")
@

If provided a file, the function parses user-defined gene sets from GMT file format.
Here, we use this functionality for reading a list of already downloaded
KEGG gene sets for \emph{Homo sapiens} containing NCBI Entrez Gene IDs.
 
<<parseGMT>>=
gmt.file <- file.path(data.dir, "hsa_kegg_gs.gmt")
hsa.gs <- getGenesets(gmt.file)
length(hsa.gs)
hsa.gs[1:2]
@

Note \#1: Use \Rfunction{getGenesets} with \Rcode{db = "msigdb"} to obtain gene 
set collections for 11 different species from the 
\href{http://software.broadinstitute.org/gsea/msigdb/collections.jsp}{Molecular Signatures Database (MSigDB)}.
Analogously, \Rfunction{getGenesets} with \Rcode{db = "enrichr"} allows to obtain
gene set libraries from the comprehensive \href{https://amp.pharm.mssm.edu/Enrichr/#stats}{Enrichr collection} for 5 different species.

Note \#2: The \Rfunction{idMap} function can be used to map gene sets from 
NCBI Entrez Gene IDs to other common gene ID types such as ENSEMBL gene IDs or 
HGNC symbols as described in Section~\ref{sec:idmap}. 

\subsection{Set-based enrichment analysis}

Currently, the following set-based enrichment analysis methods are supported
<<sbeaMethods>>=
sbeaMethods()
@

\begin{itemize}
\item ORA: Overrepresentation Analysis (simple and frequently used test based 
        on the hypergeometric distribution, see \cite{Goeman07} for a critical 
        review),
\item SAFE: Significance Analysis of Function and Expression 
       (resampling version of ORA, implements additional test statistics, 
        e.g.~Wilcoxon's rank sum, and allows to estimate the significance of 
        gene sets by sample permutation; implemented in the \Biocpkg{safe} 
        package), 
\item GSEA: Gene Set Enrichment Analysis (frequently used and widely accepted, 
        uses a Kolmogorov–Smirnov statistic to test whether the ranks of the 
        $p$-values of genes in a gene set resemble a uniform distribution 
        \cite{Subramanian05}),
\item PADOG: Pathway Analysis with Down-weighting of Overlapping Genes 
	(incorporates gene weights to favor genes appearing in few pathways versus 
    genes that appear in many pathways; implemented in the \Biocpkg{PADOG} 
    package),
\item ROAST: ROtAtion gene Set Test 
    (uses rotation instead of permutation for assessment of gene set significance;
	implemented in the \Biocpkg{limma} and \Biocpkg{edgeR} packages for 
    microarray and RNA-seq data, respectively),
\item CAMERA: Correlation Adjusted MEan RAnk gene set test
	(accounts for inter-gene correlations as implemented in the \Biocpkg{limma} 
	and \Biocpkg{edgeR} packages for microarray and RNA-seq data, respectively),
\item GSA: Gene Set Analysis
	(differs from GSEA by using the maxmean statistic, i.e.~the mean of the 
    positive or negative part of gene scores in the gene set; 
    implemented in the \CRANpkg{GSA} package),
\item GSVA: Gene Set Variation Analysis
	(transforms the data from a gene by sample matrix to a gene set by sample 
    matrix, thereby allowing the evaluation of gene set enrichment for each 
    sample; implemented in the \Biocpkg{GSVA} package)
\item GLOBALTEST: Global testing of groups of genes
	(general test of groups of genes for association with a response variable;
	implemented in the \Biocpkg{globaltest} package),
\item SAMGS: Significance Analysis of Microarrays on Gene Sets 
        (extending the SAM method for single genes to gene set analysis 
        \cite{Dinu07}),
\item EBM: Empirical Brown's Method (combines $p$-values of genes in a 
gene set using Brown's method to combine $p$-values from dependent tests;
implemented in \Biocpkg{EmpiricalBrownsMethod}),
\item MGSA: Model-based Gene Set Analysis
	(Bayesian modeling approach taking set overlap into account by working on 
    all sets simultaneously, thereby reducing the number of redundant sets; 
    implemented in \Biocpkg{mgsa}).
\end{itemize}

See also Appendix~\ref{sec:primer} for a comprehensive introduction on underlying
statistical concepts.

\subsection{Guidelines for input and method selection}

We recently performed a comprehensive assessment of the availabe set-based enrichment 
methods, and identified significant differences in runtime and applicability to RNA-seq data, 
fraction of enriched gene sets depending on the null hypothesis tested, and detection 
of relevant processes~\cite{Geistlinger20}. 
Based on these results, we make practical recommendations on how methods originally 
developed for microarray 
data can efficiently be applied to RNA-seq data, how to interpret results depending 
on the type of gene set test conducted and which methods are best suited to effectively 
prioritize gene sets with high phenotype relevance:

\begin{itemize}

\item for the exploratory analysis of \textbf{simple gene lists}, we recommend ORA given its 
ease of applicability, fast runtime and evident relevance of resulting gene set 
rankings, provided that input gene list and reference gene list are chosen carefully 
and remembering ORA’s propensity for type I error rate inflation when genes tend
to be co-expressed within sets.

\item for the analysis of \textbf{pre-ranked gene lists} accompanied by gene scores such as 
fold changes, alternatives to ORA such as pre-ranked GSEA or pre-ranked CAMERA
exist.

\item for expression-based enrichment analysis on the \textbf{full expression matrix}, 
we recommend providing
normalized log2 intensities for microarray data, and logTPMs (or logRPKMs / logFPKMs)
for RNA-seq data. 
When given raw read counts, we recommend to apply a variance-stabilizing transformation
such as \Rfunction{voom} to arrive at library-size normalized logCPMs. 

\item if the question of interest is to test for association of any gene in the set with
the phenotype (\textbf{self-contained} null hypothesis), we recommend ROAST or GSVA that 
both test a \textbf{directional} hypothesis (genes in the set tend to be either
predominantly up- or down-regulated). 
Both methods can be applied for simple or extended experimental designs, where 
ROAST is the more natural choice for the comparison of sample groups and also 
allows one to test a \textbf{mixed} hypothesis (genes in the set tend to be 
differentially expressed, regardless of the direction). 
The main strength of GSVA lies in its capabilities for analyzing single samples.

\item if the question of interest is to test for excess of differential expression in
a gene set relative to genes outside the set (\textbf{competitive} null hypothesis), which
we believe comes closest to the expectations and intuition of most end users when
performing GSEA, we recommend PADOG, which is slower to run but resolves major 
shortcomings of ORA, and has desirable properties for the analyzed criteria and 
when compared to other competitive methods. 
However, PADOG is limited to testing a mixed hypothesis in a comparison of two 
sample groups, optionally including paired samples or sample batches.
Therefore, we recommend the highly customizable SAFE for testing a directional
hypothesis or in situations of more complex experimental designs such as
comparisons between multiple groups, continuous phenotypes or the presence of
covariates.

\end{itemize}

See also Reference~\cite{Geistlinger20} for the results of the benchmarking study 
and the \Biocpkg{GSEABenchmarkeR} package for a general framework for reproducible 
benchmarking of gene set enrichment methods.


\subsubsection{Microarray data}

Given normalized log2 intensities for the ALL microarray dataset, a basic ORA can
be carried out via 

<<sbea>>=
sbea.res <- sbea(method = "ora", se = allSE, gs = hsa.gs, perm = 0, alpha = 0.1)
gsRanking(sbea.res)
@

Note that we set \Rcode{perm = 0} to invoke the classical hypergeometric test 
without sample permutation, and that we chose a significance level $\alpha$ of 0.1 
for demonstration purposes.

\subsubsection{RNA-seq data}

When analyzing RNA-seq datasets with expression values given as logTPMs 
(or logRPKMs / logFPKMs), the available set-based enrichment methods can be 
applied as for microarray data. 
However, when given raw read counts as for the airway dataset, we recommend to 
first apply a variance-stabilizing transformation such as \Rfunction{voom} to 
arrive at library-size normalized logCPMs. 

<<vst>>=
airSE <- normalize(airSE, norm.method = "vst")
@

The mean-variance relationship of the transformed data is similar to what is 
observed for microarray data, simplifying the application of legacy enrichment 
methods such as GSEA and PADOG to RNA-seq data, and enable the use of fast and 
established methods.

<<gsea-rseq, eval=FALSE>>=
air.res <- sbea(method = "gsea", se = airSE, gs = hsa.gs)
gsRanking(sbea.res)
@

\subsection{Result exploration}

The result of every enrichment analysis is a ranking of gene sets by the 
corresponding $p$-value. The \Rfunction{gsRanking} function displays by default
only those gene sets satisfying the chosen significance level $\alpha$, but we
can use

<<fullrank>>=
gsRanking(sbea.res, signif.only = FALSE)
@

to obtain the full ranking.

While such a ranked list is the standard output of existing enrichment tools, 
the \Biocpkg{EnrichmentBrowser} package provides visualization 
and interactive exploration of resulting gene sets far beyond that point.
Using the \Rfunction{eaBrowse} function creates a HTML summary from which each
gene set can be inspected in detail (this builds on functionality from the
\Biocpkg{ReportingTools} package). 

The various options are described in Figure \ref{fig:oraRes}. 

<<eaBrowse, eval=FALSE>>=
eaBrowse(sbea.res)
@

\begin{figure*}
\centering
\includegraphics[width=14cm, height=8.75cm]{ora_ebrowse.png}
\caption{ORA result view. For each significant gene set in the ranking, the user
    can select to view (1) a gene report, that lists all genes of a set along
    with fold change and DE $p$-value, (2) interactive overview plots
    such as heatmap, $p$-value distribution, and volcano plot, (3) the pathway 
    in KEGG with differentially expressed genes highlighted in red.}\label{fig:oraRes}
\end{figure*}

\subsection{Network-based enrichment analysis}

Having found sets of genes that are differentially regulated in the ALL data, 
we are now interested whether these findings can be supported by known 
regulatory interactions. 

For example, we want to know whether transcription factors and their target 
genes are expressed in accordance to the connecting regulations 
(activation/inhibition). 
Such information is usually given in a gene regulatory network derived from 
specific experiments or compiled from the literature (\cite{Geistlinger13} 
for an example). 

There are well-studied processes and organisms for which comprehensive and 
well-annotated regulatory networks are available, e.g.~the \verb=RegulonDB= for
\emph{E.~coli} and \verb=Yeastract= for \emph{S.~cerevisiae}. 
However, there are also cases where such a network is missing. 
A basic workaround is to compile a network from regulations in 
pathway databases such as KEGG.

<<compile-grn>>=
hsa.grn <- compileGRN(org="hsa", db="kegg")
head(hsa.grn)
@

Now, we are able to perform enrichment analysis using the compiled network.
Currently, the following network-based enrichment analysis methods are supported

<<nbeaMethods>>=
nbeaMethods()
@
 
\begin{itemize}
\item GGEA: Gene Graph Enrichment Analysis 
    (evaluates consistency of known regulatory interactions with the observed
    expression data \cite{Geistlinger11}),
\item SPIA: Signaling Pathway Impact Analysis 
    (combines ORA with the probability that expression changes
	are propagated across the pathway topology;
	implemented in the \Biocpkg{SPIA} package),
\item PathNet: Pathway Analysis using Network Information
    (applies ORA on combined evidence for the observed signal for gene nodes 
    and the signal implied by connected neighbors in the network; 
	implemented in the \Biocpkg{PathNet} package),
\item DEGraph: Differential expression testing for gene graphs
	(multivariate testing of differences in mean incorporating underlying 
	graph structure; implemented in the \Biocpkg{DEGraph} package),
\item TopologyGSA: Topology-based Gene Set Analysis
	 (uses Gaussian graphical models to incorporate the dependence structure 
	among genes as implied by pathway topology;
	implemented in the \CRANpkg{topologyGSA} package),
\item GANPA: Gene Association Network-based Pathway Analysis
	(incorporates network-derived gene weights in the enrichment analysis;
	implemented in the \CRANpkg{GANPA} package),
\item CePa: Centrality-based Pathway enrichment
	(incorporates network centralities as node weights mapped from 
	differentially expressed genes in pathways;
		implemented in the \CRANpkg{CePa} package),
\item NetGSA: Network-based Gene Set Analysis
	(incorporates external information about interactions among genes
	as well as novel interactions learned from data;
	implemented in the \CRANpkg{NetGSA} package),
\end{itemize}

For demonstration, we perform GGEA using the compiled KEGG regulatory network.

<<nbea>>=
nbea.res <- nbea(method="ggea", se=allSE, gs=hsa.gs, grn=hsa.grn)
gsRanking(nbea.res)
@

The resulting ranking lists, for each statistically significant gene set, the 
number of relations of the network involving a member of the gene set under study 
(\texttt{NR.RELS}), the sum of consistencies over the relations of the set (\texttt{RAW.SCORE}), 
the score normalized by induced network size 
(\texttt{NORM.SCORE} = \texttt{RAW.SCORE} / \texttt{NR.RELS}), 
and the statistical significance of each gene set based on a permutation 
approach.\\
A GGEA graph for a gene set depicts the consistency of each interaction in the set. 
Nodes (genes) are colored according to expression (up-/down-regulated) and 
edges (interactions) are colored according to consistency, i.e.~how well the 
interaction type (activation/inhibition) is reflected in the correlation of the 
observed expression of both interaction partners.

\begin{center}
<<ggea-graph, fig.width=12, fig.height=6>>=
par(mfrow=c(1,2))
ggeaGraph(
    gs=hsa.gs[["hsa05217_Basal_cell_carcinoma"]], 
    grn=hsa.grn, se=allSE)
ggeaGraphLegend()
@
\end{center}

\subsection{User-defined enrichment methods}

The goal of the \Biocpkg{EnrichmentBrowser} package is to provide
frequently used enrichment methods. However, it is also possible to exploit
its visualization capabilities with user-defined set-based enrichment methods.

This requires to implement a function that takes the characteristic arguments
\Robject{se} (expression data) and \Robject{gs} (gene sets). 

In addition, it must return a numeric vector \Robject{ps} storing the resulting
$p$-value for each gene set in \Robject{gs}. The $p$-value vector must also be
named accordingly (i.e.~\Rcode{names(ps) == names(gs)}).

Let us consider the following dummy enrichment method, which randomly renders 
five gene sets significant and the remaining insignificant.

<<dummySBEA>>=
dummySBEA <- function(se, gs)
{
        sig.ps <- sample(seq(0, 0.05, length = 1000), 5)
        insig.ps <- sample(seq(0.1, 1, length = 1000), length(gs) - 5)
        ps <- sample(c(sig.ps, insig.ps), length(gs))
        names(ps) <- names(gs)
        return(ps)
}
@

We can plug this method into \Rfunction{sbea} as before.

<<sbea2>>=
sbea.res2 <- sbea(method = dummySBEA, se = allSE, gs = hsa.gs)
gsRanking(sbea.res2)
@ 

As described in the previous section, it is also possible to analogously plug in
user-defined network-based enrichment methods into \Rfunction{nbea}.

\newpage

\section{Combining results}
Different enrichment analysis methods usually result in different gene set 
rankings for the same dataset.
To compare results and detect gene sets that are supported by different methods,
the \Biocpkg{EnrichmentBrowser} package allows to combine results from the 
different set-based and network-based enrichment analysis methods.
The combination of results yields a new ranking of the gene sets under 
investigation by specified ranking criteria, e.g.~the average rank across methods.
We consider the ORA result and the GGEA result from the previous sections and 
use the function \Rfunction{combResults}.

<<combine>>=
res.list <- list(sbea.res, nbea.res)
comb.res <- combResults(res.list)
@
 
The combined result can be detailedly inspected as before and interactively 
ranked as depicted in Figure \ref{fig:combRes}. 

<<browse-comb, eval=FALSE>>=
eaBrowse(comb.res, graph.view=hsa.grn, nr.show=5)
@

\begin{figure*}[!h]
\centering
\includegraphics[width=17cm, height=6.8cm]{comb_ebrowse.png}
\caption{Combined result view. By clicking on one of the columns 
    (ORA.RANK, ..., GGEA.PVAL) the result can be interactively ranked according
    to the selected criterion.}\label{fig:combRes}
\end{figure*}

\newpage

\section{Putting it all together}
There are cases where it is necessary to perform certain steps of the demonstrated
enrichment analysis pipeline individually.
However, it is often more convenient to run the complete standardized pipeline.
This can be done using the all-in-one wrapper function \Rfunction{ebrowser}.
For example, the result page displayed in Figure \ref{fig:combRes} can also be 
produced from scratch via 

<<all-in-one, eval=FALSE>>=
ebrowser(   meth=c("ora", "ggea"), 
        exprs=exprs.file, cdat=cdat.file, rdat=rdat.file, 
        org="hsa", gs=hsa.gs, grn=hsa.grn, comb=TRUE, nr.show=5)
@

\section{Advanced: configuration parameters}\label{sec:config}

Similar to \R's options settings, the \Biocpkg{EnrichmentBrowser} uses certain
package-wide configuration parameters, which affect the way in which analysis is
carried out and how results are displayed.
The settings of these parameters can be examined and, to some extent, also 
changed using the function \Rfunction{configEBrowser}.
For instance, the default directory where the \Biocpkg{EnrichmentBrowser} writes
results to can be updated via

<<config-set>>=
configEBrowser(key="OUTDIR.DEFAULT", value="/my/out/dir")
@

and examined via

<<config-get>>=
configEBrowser("OUTDIR.DEFAULT")
@

Note that changing these defaults should be done with care, as inappropriate 
settings might impair the package's functionality.
The complete list of incorporated configuration parameters along with their 
default settings can be inspected via

<<config-man, eval=FALSE>>=
?configEBrowser
@

\section{For non-R users: command line invocation}\label{sec:cmd}

The package source contains two scripts in \Rcode{inst/scripts} to invoke the 
EnrichmentBrowser from the command line using \software{Rscript}. 

The \Rcode{de\_rseq.R} script is a lightweight wrapper script to carry out 
differential expression analysis of RNA-seq data either based on \Biocpkg{limma} 
(using the \Rfunction{voom}-transformation), \Biocpkg{edgeR}, or \Biocpkg{DESeq2}.

The \Rcode{eBrowserCMD.R} implements the full functionality and allows to carry 
out the various enrichment methods and to produce HTML reports for interactive
exploration of results.

The \Rcode{inst/scripts} folder also contains a \Rcode{README} file that 
comprehensively documents the usage of both scripts.

\newpage

\appendix

\section{A primer on terminology and statistical theory}\label{sec:primer}

\subsection{Where does it all come from?}\label{sec:ora}

Test whether known biological functions or processes are over-represented
(= enriched) in an experimentally-derived gene list, e.g.~a list of
differentially expressed (DE) genes. See~\cite{Goeman07} for
a critical review.

Example: Transcriptomic study, in which 12,671 genes have been tested for
differential expression between two sample conditions and 529 genes were found
DE.

Among the DE genes, 28 are annotated to a specific functional gene set, which
contains in total 170 genes. This setup corresponds to a $2\times2$ contingency table,

<<deTbl>>=
deTable <-
     matrix(c(28, 142, 501, 12000),
            nrow = 2,
            dimnames = list(c("DE", "Not.DE"),
                            c("In.gene.set", "Not.in.gene.set")))
deTable
@

where the overlap of 28 genes can be assessed based on the hypergeometric 
distribution.                         
This corresponds to a one-sided version of Fisher's exact test, yielding here a
significant enrichment.

<<fisher>>= 
fisher.test(deTable, alternative = "greater")
@

This basic principle is at the foundation of major public and commercial 
enrichment tools such as \software{DAVID}
(\href{https://david.ncifcrf.gov}{https://david.ncifcrf.gov}) and
\software{Pathway Studio} (\href{https://www.pathwaystudio.com}{https://www.pathwaystudio.com}).

Although gene set enrichment methods have been primarily developed and applied
on transcriptomic data, they have recently been modified, extended and applied
also in other fields of genomic and biomedical research. 
This includes novel approaches for functional enrichment analysis of proteomic 
and metabolomic data as well as genomic regions and disease 
phenotypes~\cite{Lavallee16,Chagoyen16,McLean10,Ried12}.

\subsection{Gene sets, pathways \& regulatory networks}

\emph{Gene sets} are simple lists of usually functionally related genes without further
specification of relationships between genes.

\emph{Pathways} can be interpreted as specific gene sets, typically representing a
group of genes that
work together in a biological process. Pathways are commonly divided in
metabolic and signaling pathways.
Metabolic pathways such as glycolysis represent biochemical substrate conversions
by specific enzymes. Signaling pathways such as the MAPK signaling pathway 
describe signal transduction cascades from receptor proteins to transcription 
factors, resulting in activation or inhibition of specific target genes.

\emph{Gene regulatory networks} describe the interplay and effects of regulatory
factors (such as transcription factors and microRNAs) on the expression of their
target genes.

\subsection{Resources}

\software{GO} (\href{http://www.geneontology.org}{http://www.geneontology.org}) 
and 
\software{KEGG} (\href{http://www.genome.jp/kegg}{http://www.genome.jp/kegg})
annotations are most frequently used for the enrichment analysis of
functional gene sets. Despite an increasing number of gene set and pathway
databases, they are typically the first choice due to their long-standing
curation and availability for a wide range of species.

The Gene Ontology (GO) consists of three major sub-ontologies 
that classify gene products according to molecular function (MF), biological 
process (BP) and cellular component (CC). Each ontology consists of GO terms 
that define MFs, BPs or CCs to which specific genes are annotated. The terms 
are organized in a directed acyclic graph, where edges between the terms 
represent relationships of different types. They relate the terms according to 
a parent-child scheme, i.e.~parent terms denote more general entities, whereas 
child terms represent more specific entities.

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of
manually drawn pathway maps representing molecular interaction and reaction
networks.
These pathways cover a wide range of biochemical processes that can be divided in
7 broad categories: metabolism, genetic and environmental information processing,
cellular processes, organismal systems, human diseases, and drug development.
Metabolism and drug development pathways differ from pathways of the other 5
categories by illustrating reactions between chemical compounds.
Pathways of the other 5 categories illustrate molecular interactions between
genes and gene products.

\subsection{Gene set analysis vs.~gene set enrichment analysis}

The two predominantly used enrichment methods are:
\begin{itemize}
\item Overrepresentation analysis (ORA), testing whether a gene set contains
disproportional many genes of significant expression change, based on the
procedure outlined in section~\ref{sec:ora},
\item Gene set enrichment analysis (GSEA), testing whether genes of a gene set
accumulate at the top or bottom of the full gene vector ordered by direction
and magnitude of expression change~\cite{Subramanian05}.
\end{itemize}

However, the term \emph{gene set enrichment analysis} nowadays subsumes a general
strategy implemented by a wide range of methods~\cite{Huang09}.
Those methods have in common the same goal, although approach and statistical
model can vary substantially~\cite{Goeman07,Khatri12}.

To better distinguish from the specific method, some authors use the term
\emph{gene set analysis} to denote the general strategy.
However, there is also a specific method of this name~\cite{Efron07}.

\subsection{Underlying null: competitive vs.~self-contained}

Goeman and Buehlmann, 2007, classified existing enrichment methods into 
\emph{competitive} and \emph{self-contained} based on the underlying null 
hypothesis~\cite{Goeman07}.

\begin{itemize}
\item \emph{Competitive} null hypothesis: the genes in the set of interest are 
at most as often DE as the genes not in the set,
\item \emph{Self-contained} null hypothesis: no genes in the set of interest are DE.
\end{itemize}

Although the authors argue that a self-contained null is closer to the actual
question of interest, the vast majority of enrichment methods is competitive.

Goeman and Buehlmann further raise several critical issues concerning the 
$2\times2$ ORA:

\begin{itemize}
\item rather arbitrary classification of genes in DE / not DE,
\item based on gene sampling, although sampling of subjects is appropriate,
\item unrealistic independence assumption between genes, resulting in highly
anti-conservative \emph{p}-values.
\end{itemize}

With regard to these statistical concerns, GSEA is considered superior:

\begin{itemize}
\item takes all measured genes into account,
\item subject sampling via permutation of class labels,
\item the incorporated permutation procedure implicitly accounts for correlations
between genes.
\end{itemize}

However, the simplicity and general applicability of ORA is unmet by subsequent
methods improving on these issues. 
For instance, GSEA requires the expression data as input, which is not available
for gene lists derived from other experiment types.
On the other hand, the involved sample permutation procedure has been proven
inaccurate and time-consuming~\cite{Efron07,Phipson10,Larson15}.

\subsection{Generations: ora, fcs \& topology-based}

Khatri \textit{et al}., 2012, have taken a
slightly different approach by classifying methods along the timeline of  
development into three generations~\cite{Khatri12}:

\begin{enumerate}
\item Generation: ORA methods based on the $2\times2$ contingency table test,
\item Generation: functional class scoring (FCS) methods such as GSEA, which 
compute gene set (= functional class) scores by summarizing per-gene DE statistics,
\item Generation: topology-based methods, explicitly taking into account interactions
between genes as defined in signaling pathways and gene regulatory networks
(\cite{Geistlinger11} for an example).
\end{enumerate}

Although topology-based (also: network-based) methods appear to be most realistic,
their straightforward application can be impaired by features that are 
not detectable on the transcriptional level (such as protein-protein interactions)
and insufficient network knowledge~\cite{Geistlinger13,Bayerlova15}.

Given the individual benefits and limitations of existing methods,
cautious interpretation of results is required to derive valid conclusions.
Whereas no single method is best suited for all application scenarios, applying
multiple methods can be beneficial.
This has been shown to filter out spurious hits of individual methods, thereby
reducing the outcome to gene sets accumulating evidence from different 
methods~\cite{Geistlinger16,Alhamdoosh17}.

\section{Frequently asked questions}
\begin{enumerate}
\item \textbf{How to cite the \Biocpkg{EnrichmentBrowser}?}\\[0.25cm]
Geistlinger L, Csaba G and Zimmer R. Bioconductor's EnrichmentBrowser: 
seamless navigation through combined results of set- \& network-based enrichment
analysis. \textit{BMC Bioinformatics}, \textbf{17}:45, 2016.\\[0.25cm]
\item \textbf{Is it possible to apply the \Biocpkg{EnrichmentBrowser} to simple gene lists?}\\[0.25cm]
Enrichment methods implemented in the \Biocpkg{EnrichmentBrowser} are, except for ORA, 
expression-based (and also draw their strength from that).
The set-based methods GSEA, SAFE, and SAMGS use sample permutation,
involving recomputation of differential expression, for gene set
significance estimation, i.e.~they require the complete expression
matrix.
The network-based methods require measures of differential expression 
such as fold change and $p$-value to score interactions of the network.
In addition, visualization of enriched gene sets is explicitly designed for 
expression data.
Thus, for simple gene list enrichment, tools like \software{DAVID}
(\href{https://david.ncifcrf.gov}{https://david.ncifcrf.gov}) and
\software{GeneAnalytics} (\href{http://geneanalytics.genecards.org}{http://geneanalytics.genecards.org})
are more suitable, and it is recommended to use them for this purpose.

\end{enumerate}

\bibliography{refs}

\end{document}