---
title: "GOfuncR: Gene Ontology Enrichment Using FUNC"
author: "Steffi Grote"
date: "October 22, 2019"
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{Introduction to GOfuncR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE)
.libPaths(c(.libPaths(), "/r1/people/steffi_grote/R/x86_64-pc-linux-gnu-library/3.3"))
options(width=110)
set.seed(123)
```
# Overview
`r Biocpkg('GOfuncR')` performs a gene ontology enrichment analysis based on the ontology enrichment software FUNC [1,2].
It provides the standard candidate vs. background enrichment analysis using the **hypergeometric test**, as well as three additional tests: (i) the **Wilcoxon rank-sum test** that is used when genes are ranked, (ii) a **binomial test** that can be used when genes are associated with two counts, e.g. amino acid changes since a common ancestor in two different species, and (iii) a **2x2 contingency table test** that is used in cases when genes are associated with four counts, e.g. non-synonymous or synonymous variants that are fixed between or variable within species.
To correct for multiple testing and interdependency of the tests, family-wise error rates (FWER) are computed based on random permutations of the gene-associated variables (see [Schematic 1](#hyper_scheme) below).
`r Biocpkg('GOfuncR')` also provides tools for exploring the ontology graph and the annotations, and options to take gene-length or spatial clustering of genes into account during testing.
GO-annotations and gene-coordinates are obtained from *OrganismDb* packages (`r Biocpkg('Homo.sapiens')` by default) or *OrgDb* and *TxDb* packages.
The gene ontology graph (obtained from [geneontology](http://current.geneontology.org/ontology/), release date 23-Mar-2020), is integrated in the package.
It is also possible to provide custom gene coordinates, annotations and ontologies.
![input data and test selection](Input_data_test_selection.png 'overview tests')
## Functions included in `GOfuncR`
function | description
-------- | -----------------------------------------------------
[go_enrich](#go_enrich) | core function for performing enrichment analyses given a candidate gene set
[plot_anno_scores](#plot_anno) | plots distribution of scores of genes annotated to GO-categories
[get_parent_nodes](#graph) | returns all parent-nodes of input GO-categories
[get_child_nodes](#graph) | returns all child-nodes of input GO-categories
[get_names](#get_names) | returns the full names of input GO-categories
[get_ids](#get_ids) | returns all GO-categories that contain the input string
[get_anno_genes](#get_anno_g) | returns genes that are annotated to input GO-categories
[get_anno_categories](#get_anno_c) | returns GO-categories that input genes are annotated to
[refine](#refine) | restrict results to most specific GO-categories
## Core function `go_enrich`
The function `go_enrich` performs all enrichment analyses given input genes and has the following parameters:
parameter | default | description
------------- |:-------------:| -----------------------------------------------------------------------|
`genes` | - | a dataframe with gene-symbols or genomic regions and gene-associated variables
`test` | 'hyper' | statistical test to use ('hyper', 'wilcoxon', 'binomial' or 'contingency')
`n_randsets` | 1000 | number of randomsets for computing the family-wise error rate
`organismDb`| 'Homo.sapiens' | *OrganismDb* package for GO-annotations and gene coordinates
`gene_len` | FALSE | correct for gene length (only for `test='hyper'`)
`regions` | FALSE | chromosomal regions as input instead of independent genes (only for `test='hyper'`)
`circ_chrom` | FALSE | use background on circularized chromosome (only for `test='hyper'` and `regions=TRUE`)
`silent` | FALSE | suppress output to screen
`domains` | NULL | optional vector of GO-domains (if NULL all 3 domains are analyzed)
`orgDb`| NULL | optional *OrgDb* package for GO-annotations (overrides `organismDb`)
`txDb`| NULL | optional *TxDb* package for gene-coordinates (overrides `organismDb`)
`annotations`| NULL | optional dataframe with GO-annotations (overrides `organismDb` and `orgDb`)
`gene_coords`| NULL | optional dataframe with gene-coordinates (overrides `organismDb` and `txDb`)
`godir`| NULL | optional directory with ontology graph tables to use instead of the integrated GO-graph
# Examples of GO-enrichment analyses for human genes
## Install annotation package
`GOfuncR` uses external packages to obtain the GO-annotations and gene-coordinates.
In the examples we will use the default `r Biocpkg('Homo.sapiens')` package.
See below for examples [how to use other packages](#other_db) or [how to provide custom annotations](#cu_anno).
```{r, eval=FALSE}
## install annotation package 'Homo.sapiens' from bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install('Homo.sapiens')
```
## Test for gene set enrichment using the hypergeometric test
The hypergeometric test evaluates the over- or under-representation of a set of candidate genes in GO-categories, compared to a set of background genes (see [Schematic 1](#hyper_scheme) below).
The input for the hypergeometric test is a dataframe with two columns: (1) a column with gene-symbols and (2) a binary column with `1` for a candidate gene and `0` for a background gene.
### Hypergeometric test using the default background gene set
The declaration of background genes is optional.
If only candidate genes are defined, then all remaining genes from the annotation package are used as default background.
In this example GO-enrichment of 13 human genes will be tested:
```{r}
## load GOfuncR package
library(GOfuncR)
## create input dataframe with candidate genes
gene_ids = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2', 'AGTR1',
'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2')
input_hyper = data.frame(gene_ids, is_candidate=1)
input_hyper
```
This dataframe is the only mandatory input for `go_enrich`, however to lower computation time for the examples, we also lower the number of randomsets that are generated to compute the FWER:
```{r, results='hide'}
## run enrichment analysis (n_randets=100 lowers compuation time
## compared to default 1000)
res_hyper = go_enrich(input_hyper, n_randset=100)
```
The output of `go_enrich` is a list of 4 elements:
The most important is the first element which contains the results from the enrichment analysis (ordered by FWER for over-representation of candidate genes):
```{r}
## first element of go_enrich result has the stats
stats = res_hyper[[1]]
## top-GO categories
head(stats)
## top GO-categories per domain
by(stats, stats$ontology, head, n=3)
```
The second element is a dataframe with all valid input genes:
```{r}
## all valid input genes
head(res_hyper[[2]])
```
The third element states the reference genome for the annotations and the version of the GO-graph:
```{r}
## annotation package used (default='Homo.sapiens') and GO-graph version
res_hyper[[3]]
```
The fourth element is a dataframe with the minimum p-values from the permutations, which are used to compute the FWER:
```{r}
## minimum p-values from randomsets
head(res_hyper[[4]])
```
### Hypergeometric test using a defined background gene set
Instead of using the default background gene set, it will often be more accurate to just include those genes in the background gene set, that were studied in the experiment that led to the discovery of the candidate genes.
For example, if the candidate genes are based on microarray expression data, than the background gene set should consist of all genes on the array.
To define a background gene set, just add lines to the input dataframe where the gene-associated variable in the second column is a `0`.
Note that all candidate genes are implicitly part of the background gene set and do not need to be defined as background.
```{r}
## create input dataframe with candidate and background genes
candi_gene_ids = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2',
'AGTR1', 'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2')
bg_gene_ids = c('FGR', 'NPHP1', 'DRD2', 'ABCC10', 'PTBP2', 'JPH4', 'SMARCC2',
'FN1', 'NODAL', 'CYP1A2', 'ACSS1', 'CDHR1', 'SLC25A36', 'LEPR', 'PRPS2',
'TNFAIP3', 'NKX3-1', 'LPAR2', 'PGAM2')
is_candidate = c(rep(1,length(candi_gene_ids)), rep(0,length(bg_gene_ids)))
input_hyper_bg = data.frame(gene_ids = c(candi_gene_ids, bg_gene_ids),
is_candidate)
head(input_hyper_bg)
tail(input_hyper_bg)
```
The enrichment analysis is performed like before, again with only 100 randomsets to lower computation time.
```{r, results='hide'}
res_hyper_bg = go_enrich(input_hyper_bg, n_randsets=100)
```
```{r}
head(res_hyper_bg[[1]])
```
### Hypergeometric test with correction for gene length
If the chance of a gene to be discovered as a candidate gene is higher for longer genes (e.g. the chance to have an amino-acid change compared to another species), it can be helpful to also correct for this length-bias in the calculation of the family-wise error rate (FWER).
`go_enrich` therefore offers the `gene_len` option: while with the default `gene_len=FALSE` candidate and background genes are permuted randomly in the randomsets (see [Schematic 1](#hyper_scheme)), `gene_len=TRUE` makes the chance of a gene to be chosen as a candidate gene in a randomset dependent on its gene length.
```{r, eval=FALSE}
## test input genes again with correction for gene length
res_hyper_len = go_enrich(input_hyper, gene_len=TRUE)
```
Note that the default annotation package `r Biocpkg('Homo.sapiens')` uses the *hg19* gene-coordinates.
See below for examples how to use [other packages](#other_db) or [custom gene-coordinates](#cu_coord).
### Hypergeometric test with genomic regions as input
Instead of defining candidate and background genes explicitly in the input dataframe, it is also possible to define entire chromosomal regions as candidate and background regions.
The GO-enrichment is then tested for all genes located in, or overlapping the candidate regions on the plus or the minus strand.
In comparison to defining candidate and background genes explicitly, this option has the advantage that the FWER accounts for spatial clustering of genes.
For the random permutations used to compute the FWER, blocks as long as candidate regions are chosen from the merged candidate and background regions and genes contained in these blocks are considered candidate genes.
The option `circ_chrom` defines whether random candidate blocks are chosen from the same chromosome or not ([Schematic 2](#block_scheme)).
To define chromosomal regions in the input dataframe, the entries in the first column have to be of the form `chr:start-stop`, where `start` always has to be smaller than `stop`.
Note that this option requires the input of background regions.
If multiple candidate regions are provided, in the randomsets they are placed randomly (but without overlap) into the merged candidate and background regions.
```{r}
## create input vector with a candidate region on chromosome 8
## and background regions on chromosome 7, 8 and 9
regions = c('8:81000000-83000000', '7:1300000-56800000', '7:74900000-148700000',
'8:7400000-44300000', '8:47600000-146300000', '9:0-39200000',
'9:69700000-140200000')
is_candidate = c(1, rep(0,6))
input_regions = data.frame(regions, is_candidate)
input_regions
```
```{r,results='hide'}
## run GO-enrichment analysis for genes in the candidate region
res_region = go_enrich(input_regions, n_randsets=100, regions=TRUE)
```
The output of `go_enrich` for genomic regions is identical to the one that is produced for single genes.
The first element of the output list contains the results of the enrichment analysis and the second element contains the candidate and background genes located in the user-defined regions:
```{r}
stats_region = res_region[[1]]
head(stats_region)
## see which genes are located in the candidate region
input_genes = res_region[[2]]
candidate_genes = input_genes[input_genes[,2]==1, 1]
candidate_genes
```
Note that the default annotation package `r Biocpkg('Homo.sapiens')` uses the *hg19* gene-coordinates.
See below for examples how to use [other packages](#other_db) or [custom gene-coordinates](#cu_coord).
## Test for enrichment of high scored genes using the Wilcoxon rank-sum test
When genes are not divided into candidate and background genes, but are ranked by some kind of score, e.g. a p-value for differential expression, a Wilcoxon rank-sum test can be performed to find GO-categories where genes with high (or low) scores are over-represented.
This example uses genes ranked by random scores:
```{r}
## create input dataframe with scores in second column
high_score_genes = c('GCK', 'CALB1', 'PAX2', 'GYS1','SLC2A8', 'UGP2', 'BTBD3',
'MTUS1', 'SYP', 'PSEN1')
low_score_genes = c('CACNG2', 'ANO1', 'ZWINT', 'ENGASE', 'HK2', 'PYGL', 'GYG1')
gene_scores = c(runif(length(high_score_genes), 0.5, 1),
runif(length(low_score_genes), 0, 0.5))
input_willi = data.frame(gene_ids = c(high_score_genes, low_score_genes),
gene_scores)
head(input_willi)
```
```{r, results='hide'}
res_willi = go_enrich(input_willi, test='wilcoxon', n_randsets=100)
```
The output is analogous to the one for the hypergeometric test:
```{r}
head(res_willi[[1]])
```
Note that when p-values are used as scores, often one would want to look for enrichment of low ranks, i.e. low p-values (or alternatively use (1 - p-value) as score and check for enrichment of high ranks).
## Test for enrichment using the binomial test
When genes are associated with two counts *A* and *B*, e.g. amino-acid changes since a common ancestor in two species, a binomial test can be used to identify GO-categories with an enrichment of genes with a high fraction of one of the counts compared to the fraction in the root node.
To perform the binomial test the input dataframe needs a column with the gene symbols and two additional columns with the corresponding counts:
```{r}
## create a toy example dataset with two counts per gene
high_A_genes = c('G6PD', 'GCK', 'GYS1', 'HK2', 'PYGL', 'SLC2A8', 'UGP2',
'ZWINT', 'ENGASE')
low_A_genes = c('CACNG2', 'AGTR1', 'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1',
'PAX2')
A_counts = c(sample(20:30, length(high_A_genes)),
sample(5:15, length(low_A_genes)))
B_counts = c(sample(5:15, length(high_A_genes)),
sample(20:30, length(low_A_genes)))
input_binom = data.frame(gene_ids=c(high_A_genes, low_A_genes), A_counts,
B_counts)
head(input_binom)
```
In this example we also use the `domains` option to reduce the analysis to `molecular_function` and `cellular_component`.
Also the `silent` option is used, which suppresses all output that would be written to the screen (except for warnings and errors):
```{r}
## run binomial test, excluding the 'biological_process' domain,
## suppress output to screen
res_binom = go_enrich(input_binom, test='binomial', n_randsets=100,
silent=TRUE, domains=c('molecular_function', 'cellular_component'))
head(res_binom[[1]])
```
## Test for enrichment using the 2x2 contingency table test
When genes are associated with four counts (*A*-*D*), e.g. non-synonymous or synonymous variants that are fixed between or variable within species, like for a McDonald-Kreitman test [3], the 2x2 contingency table test can be used.
It can identify GO-categories which have a high ratio of *A/B* compared to *C/D*, which in this example would correspond to a high ratio of *non-synonymous substitutions / synonymous substitutions* compared to *non-synonymous variable / synonymous variable*:
```{r}
## create a toy example with four counts per gene
high_substi_genes = c('G6PD', 'GCK', 'GYS1', 'HK2', 'PYGL', 'SLC2A8', 'UGP2',
'ZWINT', 'ENGASE')
low_substi_genes = c('CACNG2', 'AGTR1', 'ANO1', 'BTBD3', 'MTUS1', 'CALB1',
'GYG1', 'PAX2', 'C21orf59')
subs_non_syn = c(sample(5:15, length(high_substi_genes), replace=TRUE),
sample(0:5, length(low_substi_genes), replace=TRUE))
subs_syn = sample(5:15, length(c(high_substi_genes, low_substi_genes)),
replace=TRUE)
vari_non_syn = c(sample(0:5, length(high_substi_genes), replace=TRUE),
sample(0:10, length(low_substi_genes), replace=TRUE))
vari_syn = sample(5:15, length(c(high_substi_genes, low_substi_genes)),
replace=TRUE)
input_conti = data.frame(gene_ids=c(high_substi_genes, low_substi_genes),
subs_non_syn, subs_syn, vari_non_syn, vari_syn)
head(input_conti)
## the corresponding contingency table for the first gene would be:
matrix(input_conti[1, 2:5], ncol=2,
dimnames=list(c('non-synonymous', 'synonymous'),
c('substitution','variable')))
```
```{r, results='hide'}
res_conti = go_enrich(input_conti, test='contingency', n_randset=100)
```
The output is analogous to that of the other tests:
```{r}
head(res_conti[[1]])
```
Depending on the counts for each GO-category a Chi-square or Fisher's exact test is performed.
Note that this is the only test that is not dependent on the distribution of the gene-associated variables in the root nodes.
# Enrichment analyses with different annotations or ontologies
## Other annotation packages
Annotation package types suggested for `GOfuncR`:
annotation package | information used in `GOfuncR`
------------------ | ----------------------------
[*OrganismDb*](http://www.bioconductor.org/packages/release/BiocViews.html#___OrganismDb) | GO-annotations + gene-coordinates
[*OrgDb*](http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb) | GO-annotations
[*TxDb*](http://www.bioconductor.org/packages/release/BiocViews.html#___TxDb) | gene-coordinates
The default annotation package used by `GOfuncR` is bioconductor's *OrganismDb* package `r Biocpkg('Homo.sapiens')`, which contains GO-annotations as well as gene-coordinates.
There are currently also *OrganismDb* packages available for mouse (`r Biocpkg('Mus.musculus')`) and rat (`r Biocpkg('Rattus.norvegicus')`).
After installation those packages can be used in `go_enrich`:
```{r, eval=FALSE}
## perform enrichment analysis for mouse genes
## ('Mus.musculus' has to be installed)
mouse_gene_ids = c('Gck', 'Gys1', 'Hk2', 'Pygl', 'Slc2a8', 'Ugp2', 'Zwint',
'Engase')
input_hyper_mouse = data.frame(mouse_gene_ids, is_candidate=1)
res_hyper_mouse = go_enrich(input_hyper_mouse, organismDb='Mus.musculus')
```
Besides *OrganismDb* packages also *OrgDb* packages can be used to get GO-annotations.
These packages have the advantage that they are available for a broader range of species (e.g. `r Biocpkg('org.Pt.eg.db')` for chimp or `r Biocpkg('org.Gg.eg.db')` for chicken).
*OrgDb* packages are specified by the `orgDb` parameter of `go_enrich`:
```{r, eval=FALSE}
## perform enrichment analysis for chimp genes
## ('org.Pt.eg.db' has to be installed)
chimp_gene_ids = c('SIAH1', 'MIIP', 'ELP3', 'CFB', 'ADARB1', 'TRNT1',
'DEFB124', 'OR1A1', 'TYR', 'HOXA7')
input_hyper_chimp = data.frame(chimp_gene_ids, is_candidate=1)
res_hyper_chimp = go_enrich(input_hyper_chimp, orgDb='org.Pt.eg.db')
```
When an *OrgDb* package is used for annotations and the `go_enrich` analysis relies on gene-coordinates (i.e. `gene_len=TRUE` or `regions=TRUE`), then an additional *TxDb* package has to be provided for the gene-coordinates:
```{r, eval=FALSE}
## perform enrichment analysis for chimp genes
## and account for gene-length in FWER
## (needs 'org.Pt.eg.db' and 'TxDb.Ptroglodytes.UCSC.panTro4.refGene' installed)
res_hyper_chimp_genelen = go_enrich(input_hyper_chimp, gene_len=TRUE,
orgDb='org.Pt.eg.db', txDb='TxDb.Ptroglodytes.UCSC.panTro4.refGene')
```
*OrgDb* + *TxDb* packages can also be useful even if there is an *OrganismDb* package available, for example to use a different reference genome. Here we use the *hg38* gene-coordinates from `r Biocpkg('TxDb.Hsapiens.UCSC.hg38.knownGene')` instead of the default *hg19* from the *OrganismDb* package `r Biocpkg('Homo.sapiens')`.
```{r,eval=FALSE}
## run GO-enrichment analysis for genes in the candidate region
## using hg38 gene-coordinates
## (needs 'org.Hs.eg.db' and 'TxDb.Hsapiens.UCSC.hg38.knownGene' installed)
res_region_hg38 = go_enrich(input_regions, regions=TRUE,
orgDb='org.Hs.eg.db', txDb='TxDb.Hsapiens.UCSC.hg38.knownGene')
```
Note that using *TxDb* packages always requires defining an *OrgDb* package for the annotations.
## Custom annotations
Besides using bioconductor's annotation packages for the mapping of genes to GO-categories, it is also possible to provide the annotations directly as a dataframe with two columns: (1) genes and (2) GO-IDs (parameter `annotations`).
```{r,echo=FALSE}
custom_anno = get_anno_categories(input_hyper_bg[,1])
# to have several genes in head
custom_anno = custom_anno[6:nrow(custom_anno),1:2]
rownames(custom_anno) = 1:nrow(custom_anno)
```
```{r}
## example for a dataframe with custom annotations
head(custom_anno)
```
```{r,eval=FALSE}
## run enrichment analysis with custom annotations
res_hyper_anno = go_enrich(input_hyper, annotations=custom_anno)
```
## Custom gene-coordinates
Gene-coordinates are used when the FWER is corrected for gene length (`gene_len=TRUE`) or for spatial clustering of genes (`regions=TRUE`).
Instead of using gene-coordinates from bioconductor packages, one can also provide custom gene-coordinates directly as a dataframe with four columns: gene, chromosome, start, end (parameter `gene_coords`).
```{r,echo=FALSE}
gene = c('NCAPG','APOL4','NGFR','NXPH4','C21orf59','CACNG2')
chr = c('chr4', 'chr22', 'chr17', 'chr12', 'chr21', 'chr22')
start = c(17812436, 36585176, 47572655, 57610578, 33954510, 36956916)
end = c(17846487, 36600879, 47592382, 57620232, 33984918, 37098690)
custom_coords = data.frame(gene, chr, start, end, stringsAsFactors=FALSE)
```
```{r}
## example for a dataframe with custom gene-coordinates
head(custom_coords)
```
```{r,eval=FALSE}
## use correction for gene-length based on custom gene-coordinates
res_hyper_cc = go_enrich(input_hyper, gene_len=TRUE, gene_coords=custom_coords)
```
Note that this allows to use `gene_len=TRUE` to correct the FWER for any user-defined gene 'weight', since the correction for gene length just weights each gene with its length (`end - start`).
A gene with a higher weight has a bigger chance of becoming a candidate gene in the randomsets.
## Custom GO-graph
A default GO-graph (obtained from [geneontology](http://current.geneontology.org/ontology/), release date 23-Mar-2020), is integrated in the package.
However, also a custom GO-graph, e.g. a specific version or a different ontology can be provided. `go_enrich` needs a directory which contains three tab-separated files in the GO MySQL Database Schema:
*term.txt*, *term2term.txt* and *graph_path.txt*.
The full path to this directory needs to be defined in the parameter `godir`.
Specific versions of the GO-graph can be downloaded from http://archive.geneontology.org/lite/.
For example, to use the GO-graph from 2018-11-24, download and unpack the files from
http://archive.geneontology.org/lite/2018-11-24/go_weekly-termdb-tables.tar.gz.
Assume the files were saved in `/home/user/go_graphs/2018-11-24/`. This directory now contains the needed files `term.txt`, `term2term.txt` and `graph_path.txt` and can be used in `go_enrich`:
```{r,eval=FALSE}
## run enrichment with custom GO-graph
go_path = '/home/user/go_graphs/2018-11-24/'
res_hyper = go_enrich(input_hyper, godir=go_path)
```
### Conversion from _.obo_ format
At some point [Gene Ontology](http://geneontology.org/) may no longer provide the ontology in the GO MySQL Database Schema; and other ontologies may not be provided in that format at all.
Therefore, custom ontologies might need to be converted to the right format before using them in `GOfuncR`.
On https://github.com/sgrote/OboToTerm you can find a python script that converts the widely used `.obo` format to the tables needed (*term.txt*, *term2term.txt* and *graph_path.txt*).
# Additional functionalities
## Plot distribution of gene-associated variables from an enrichment analysis
The function `plot_anno_scores` can be used to get a quick visual overview of the gene-associated variables in GO-categories, that were used in an enrichment analysis.
`plot_anno_scores` takes a result from `go_enrich` as input together with a vector of GO-IDs.
It then plots the combined scores of all input genes for the `go_enrich` analysis in each of the defined GO-categories.
The type of the plot depends on the test that was used in `go_enrich`.
Note that if custom `annotations` were used in `go_enrich`, then they also have to be provided to `plot_anno_scores` (whereas ontology and annotation databases are inferred from the input and loaded in `plot_anno_scores`).
For the **hypergeometric test** pie charts show the amounts of candidate and background genes that are annotated to the GO-categories and the root nodes (candidate genes in the colour of the corresponding root node).
The top panel shows the odds-ratio and 95%-CI from Fisher's exact test (two-sided) comparing the GO-categories with their root nodes.
```{r}
## hypergeometric test
top_gos_hyper = res_hyper[[1]][1:5, 'node_id']
# GO-categories with a high proportion of candidate genes
top_gos_hyper
plot_anno_scores(res_hyper, top_gos_hyper)
```
`plot_anno_scores` returns an invisible dataframe that contains the stats from Fisher's exact test shown in the plot:
```{r}
## hypergeometric test with defined background
top_gos_hyper_bg = res_hyper_bg[[1]][1:5, 'node_id']
top_gos_hyper_bg
plot_stats = plot_anno_scores(res_hyper_bg, top_gos_hyper_bg)
plot_stats
```
Note that `go_enrich` reports the hypergeometric tests for over- and under-representation of candidate genes which correspond to the one-sided Fisher's exact tests.
Also keep in mind that the p-values from this table are not corrected for multiple testing.
For the **Wilcoxon rank-sum test** violin plots show the distribution of the scores of genes that are annotated to each GO-category and the root nodes.
Horizontal lines in the left panel indicate the median of the scores that are annotated to the root nodes.
The Wilcoxon rank-sum test reported in the `go_enrich` result compares the scores annotated to a GO-category with the scores annotated to the corresponding root node.
```{r}
## scores used for wilcoxon rank-sum test
top_gos_willi = res_willi[[1]][1:5, 'node_id']
# GO-categories with high scores
top_gos_willi
plot_anno_scores(res_willi, top_gos_willi)
```
For the **binomial test** pie charts show the amounts of *A* and *B* counts associated with each GO-category and root node, (*A* in the colour of the corresponding root node).
The top-panel shows point estimates and the 95%-CI of *p(A)* in the nodes, as well as horizontal lines that correspond to *p(A)* in the root nodes.
The p-value in the returned object is based on the null hypothesis that *p(A)* in a node equals *p(A)* in the corresponding root node.
Note that `go_enrich` reports that value for one-sided binomial tests.
```{r}
## counts used for the binomial test
top_gos_binom = res_binom[[1]][1:5, 'node_id']
# GO-categories with high proportion of A
top_gos_binom
plot_anno_scores(res_binom, top_gos_binom)
```
Note that domain `biological_process` is missing in that plot because it was excluded from the GO-enrichment analysis in the first place (`res_binom` was created using the `domains` option of `go_enrich`).
For the **2x2 contingency table test** pie charts show the proportions of *A* and *B*, as well as *C* and *D* counts associated with a GO-category.
Root nodes are not shown, because this test is independent of the root category.
The top panel shows the odds ratio and 95%-CI from Fisher's exact test (two-sided) comparing *A/B* and *C/D* inside one node.
Note that in `go_enrich`, if all four values are >=10, a chi-square test is performed instead of Fisher's exact test.
```{r}
## counts used for the 2x2 contingency table test
top_gos_conti = res_conti[[1]][1:5, 'node_id']
# GO-categories with high A/B compared to C/D
top_gos_conti
plot_anno_scores(res_conti, top_gos_conti)
```
## Explore the GO-graph
The functions `get_parent_nodes` and `get_child_nodes` can be used to explore the ontology-graph.
They list all higher-level GO-categories and sub-GO-categories of input nodes, respectively, together with the distance between them:
```{r}
## get the parent nodes (higher level GO-categories) of two GO-IDs
get_parent_nodes(c('GO:0051082', 'GO:0042254'))
## get the child nodes (sub-categories) of two GO-IDs
get_child_nodes(c('GO:0090070', 'GO:0000112'))
```
Note that a GO-category per definition is also its own parent and child with distance 0.
The function `get_names` can be used to retrieve the names and root nodes of GO-IDs:
```{r}
## get the full names and domains of two GO-IDs
get_names(c('GO:0090070', 'GO:0000112'))
```
It is also possible to go the other way round and search for GO-categories given part of their name using the function `get_ids`:
```{r}
## get GO-IDs of categories that contain 'blood-brain barrier' in their names
bbb = get_ids('blood-brain barrier')
head(bbb)
```
Note that this is just a `grep(..., ignore.case=TRUE)` on the node names of the ontology.
More sophisticated searches, e.g. with regular expressions, could be performed on the table returned by `get_ids('')` which lists all non-obsolete GO-categories.
Like for `go_enrich` also custom ontologies can be used (see the help pages of the functions).
## Retrieve associations between genes and GO-categories
`GOfuncR` also offers the functions `get_anno_genes` and `get_anno_categories` to get annotated genes given input GO-categories, and annotated GO-categories given input genes, respectively.
`get_anno_genes` takes a vector of GO-IDs as input, and returns all genes that are annotated to those categories (using the default annotation package `r Biocpkg('Homo.sapiens')`).
The optional arguments `database` and `genes` can be used to define a [different annotation package](#other_db) and the set of genes which is searched for annotations, respectively.
This function implicitly includes annotations to child nodes.
```{r}
## find all genes that are annotated to GO:0000109
## using default database 'Homo.sapiens'
head(get_anno_genes(go_ids='GO:0000109'))
## find out wich genes from a set of genes
## are annotated to some GO-categories
genes = c('AGTR1', 'ANO1', 'CALB1', 'GYG1', 'PAX2')
gos = c('GO:0001558', 'GO:0005536', 'GO:0072205', 'GO:0006821')
anno_genes = get_anno_genes(go_ids=gos, genes=genes)
# add the names and domains of the GO-categories
cbind(anno_genes, get_names(anno_genes$go_id)[,2:3])
```
```{r, eval=FALSE}
## find all mouse-gene annotations to two GO-categories
## ('Mus.musculus' has to be installed)
gos = c('GO:0072205', 'GO:0000109')
get_anno_genes(go_ids=gos, database='Mus.musculus')
```
`get_anno_categories` on the other hand uses gene-symbols as input and returns associated GO-categories:
```{r}
## get the GO-annotations for two random genes
anno = get_anno_categories(c('BTC', 'SPAG5'))
head(anno)
```
```{r, eval=FALSE}
## get the GO-annotations for two mouse genes
## ('Mus.musculus' has to be installed)
anno_mus = get_anno_categories(c('Mus81', 'Papola'), database='Mus.musculus')
```
This function only returns direct annotations.
To get also the parent nodes of the GO-categories a gene is annotated to, the function `get_parent_nodes` can be used:
```{r}
# get all direct annotations of NXPH4
direct_anno = get_anno_categories('NXPH4')
direct_anno
# get parent nodes of directly annotated GO-categories
parent_ids = unique(get_parent_nodes(direct_anno$go_id)[,2])
# add GO-domain
full_anno = get_names(parent_ids)
head(full_anno)
```
Like for `go_enrich` also custom annotations and ontologies can be used (see the help pages of the functions).
## Refine results from go_enrich
When there are many significant GO-categories given a FWER-threshold, it may be useful to restrict the results to the most specific categories.
The `refine` function implements the _elim_ algorithm from [4], which removes genes from significant child categories and repeats the test to check whether a category would still be significant.
```{r, results='hide'}
## perform enrichment analysis for some genes
gene_ids = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2', 'AGTR1',
'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2')
input_hyper = data.frame(gene_ids, is_candidate=1)
res_hyper = go_enrich(input_hyper, n_randset=100)
## perform refinement for categories with FWER < 0.1
refined = refine(res_hyper, fwer=0.1)
```
```{r}
## the result shows p-values and significance before and after refinement
refined
```
By default `refine` performs the test for over-representation of candidate genes, see `?refine` for how to check for under-representation.
# Schematics
## Schematic 1: Hypergeometric test and FWER calculation
![FWER calculation](./Skizze_Fig1.png 'hypergeometric test and FWER')
The FWER for the other tests is computed in the same way: the gene-associated variables (scores or counts) are permuted while the annotations of genes to GO-categories stay fixed.
Then the statistical tests are evaluated again for every GO-category.
## Schematic 2: circ_chrom option for genomic regions input
![options for genomic regions input](./Skizze_Fig2.png 'options for genomic regions input')
To use genomic regions as input, the first column of the `genes` input dataframe has to be of the form `chr:start-stop` and `regions=TRUE` has to be set.
The option `circ_chrom` defines how candidate regions are randomly moved inside the background regions for computing the FWER.
When `circ_chrom=FALSE` (default), candidate regions can be moved to any background region on any chromosome, but are not allowed to overlap multiple background regions.
When `circ_chrom=TRUE`, candidate regions are only moved on the same chromosome and are allowed to overlap multiple background regions.
The chromosome is 'circularized' which means that a randomly placed candidate region may start at the end of the chromosome and continue at the beginning.
# Session Info
```{r}
sessionInfo()
```
# References
[1] Ashburner, M. et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29. [https://doi.org/10.1038/75556]
[2] Pruefer, K. et al. (2007). FUNC: A package for detecting significant associations between gene
sets and ontological annotations, BMC Bioinformatics 8: 41. [https://doi.org/10.1186/1471-2105-8-41]
[3] McDonald, J. H. Kreitman, M. (1991). Adaptive protein evolution at the Adh locus in Drosophila, Nature 351: 652-654. [https://doi.org/10.1038/351652a0]
[4] Alexa, A. et al. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600–1607. [https://doi.org/10.1093/bioinformatics/btl140]