--- title: "Functional analysis of mouse mammary gland RNA-Seq" author: - name: Aurelien Brionne affiliation: BOA, Institut National de la Recherche Agronomique (INRA), Universitée de Tours, 37380 Nouzilly, France - name: Amelie Juanchich affiliation: BOA, Institut National de la Recherche Agronomique (INRA), Universitée de Tours, 37380 Nouzilly, France - name: Christelle Hennequet-Antier affiliation: BOA, Institut National de la Recherche Agronomique (INRA), Universitée de Tours, 37380 Nouzilly, France date: "`r format(Sys.time(), '%d %B, %Y')`" output: BiocStyle::html_document: highlight: tango vignette: > %\VignetteIndexEntry{2: mouse_bionconductor} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: "`r system.file('extdata/','bibliography.bib',package='ViSEAGO')`" csl: "`r system.file('extdata/','bmc-genomics.csl',package='ViSEAGO')`" --- ```{r setup,include=FALSE} # load ViSEAGO and mouse db package library(ViSEAGO) # knitr document options knitr::opts_chunk$set( eval=FALSE,echo=TRUE,fig.pos = 'H', fig.width=6,message=FALSE,comment=NA,warning=FALSE ) ``` # Introduction{-} This study explores expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) of the mammary gland of virgin, pregnant and lactating mice @pmid25730472. Six groups are compared corresponding to a combination of cell type and mouse status. Each group contains two biological replicates. Sequences and counts datasets are publicly available from the Gene Expression Omnibus (GEO) with the serie accession number GSE60450. The RNA-Seq data is analysed by `r BiocStyle::Biocpkg("edgeR")` development team @edgeR. As described in edgeR user guide, we tested for significant differential expression in gene, using the QL F-test @pmid23104842. Here, we focused in gene expression analysis in only one type of cell, i.e. luminal cells for the three developmental status (virgin, pregnant and lactating mice). Among the 15,804 expressed genes, we obtained 7,302 significantly differentially expressed (DE) genes for the comparison virgin versus pregnant, 7,699 for the comparison pregnant versus lactate, and 9,583 for the comparison virgin versus lactate, with Benjamini-Hochberg correction to control the false discovery rate at 5%. # Data{-} Vignette build convenience (for less build time and package size) need that data were pre-calculated (provided by the package), and that illustrations were not interactive. ```{r vignette_data_used,eval=TRUE} # load vignette data data( myGOs, package="ViSEAGO" ) ``` # Genes of interest We load examples files from `r BiocStyle::Biocpkg("ViSEAGO")` package using `system.file` from the locally installed package. We read the gene identifiers for the background (= expressed genes) file and the three lists of DE genes of the study. Here, gene identifiers are GeneID from EntrezGene database. ```{r geneList_input} # load genes identifiants (GeneID,ENS...) background (expressed genes) background<-scan( system.file( "extdata/data/input", "background_L.txt", package = "ViSEAGO" ), quiet=TRUE, what="" ) # load Differentialy Expressed (DE) gene identifiants from lists PregnantvsLactateDE<-scan( system.file( "extdata/data/input", "pregnantvslactateDE.txt", package = "ViSEAGO" ), quiet=TRUE, what="" ) VirginvsLactateDE<-scan( system.file( "extdata/data/input", "virginvslactateDE.txt", package = "ViSEAGO" ), quiet=TRUE, what="" ) VirginvsPregnantDE<-scan( system.file( "extdata/data/input", "virginvspregnantDE.txt", package = "ViSEAGO" ), quiet=TRUE, what="" ) ``` Here, we display the first 6 GeneID from the *L_pregnantvslactateDE* list. ```{r geneList_input-head,echo=FALSE} # show the ten first lines of genes_DE (same as genes_ref) head(PregnantvsLactateDE) ``` # GO annotation of genes In this study, we build a `myGENE2GO` object using the Bioconductor `r BiocStyle::Biocpkg("org.Mm.eg.db")` database package for the mouse species. This object contains all available GO annotations for categories Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). NB: Don't forget to check if the last current annotation database version is installed in your R session! See `ViSEAGO::available_organisms(Bioconductor)`. ```{r Genomic-ressources} # connect to Bioconductor Bioconductor<-ViSEAGO::Bioconductor2GO() # load GO annotations from Bioconductor myGENE2GO<-ViSEAGO::annotate( "org.Mm.eg.db", Bioconductor ) ```
```{r Genomic-ressources_show} # display summary myGENE2GO ```
```{r Genomic-ressources_display,echo=FALSE,eval=TRUE} cat( "- object class: gene2GO - database: Bioconductor - stamp/version: 2018-Oct11 - organism id: org.Mm.eg.db GO annotations: - Molecular Function (MF): 22744 annotated genes with 90922 terms (4133 unique terms) - Biological Process (BP): 23239 annotated genes with 160094 terms (12087 unique terms) - Cellular Component (CC): 23416 annotated genes with 105430 terms (1727 unique terms)" ) ``` # Functional GO enrichment ## GO enrichment tests We perform a functional Gene Ontology (GO) enrichment analysis from differentially expressed (DE) genes of luminal cells in the mammary gland. The enriched **Biological process** (BP) are obtained using a Fisher's exact test with `elim` algorithm developped in `r BiocStyle::Biocpkg("topGO")` package. First, we create three `topGOdata` objects, using `ViSEAGO::create_topGOdata` method, corresponding to the three DE genes lists for the comparison virgin versus pregnant, pregnant versus lactate, and virgin versus lactate. The gene background corresponding to expressed genes in luminal cells of the mammary gland and GO annotations are also provided. ```{r Enrichment_data} # create topGOdata for BP for each list of DE genes BP_PregnantvsLactate<-ViSEAGO::create_topGOdata( geneSel=PregnantvsLactateDE, allGenes=background, gene2GO=myGENE2GO, ont="BP", nodeSize=5 ) BP_VirginvsLactate<-ViSEAGO::create_topGOdata( geneSel=VirginvsLactateDE, allGenes=background, gene2GO=myGENE2GO, ont="BP", nodeSize=5 ) BP_VirginvsPregnant<-ViSEAGO::create_topGOdata( geneSel=VirginvsPregnantDE, allGenes=background, gene2GO=myGENE2GO, ont="BP", nodeSize=5 ) ``` Now, we perform the GO enrichment tests for BP category with Fisher's exact test and *elim* algorithm using `topGO::runTest` method. NB: p-values of enriched GO terms are not adjusted and considered significant if below 0.01. ```{r Enrichment_data_tests} # perform TopGO tests elim_BP_PregnantvsLactate<-topGO::runTest( BP_PregnantvsLactate, algorithm ="elim", statistic = "fisher" ) elim_BP_VirginvsLactate<-topGO::runTest( BP_VirginvsLactate, algorithm ="elim", statistic = "fisher" ) elim_BP_VirginvsPregnant<-topGO::runTest( BP_VirginvsPregnant, algorithm ="elim", statistic = "fisher" ) ``` ## Combine enriched GO terms We combine the results of the three enrichment tests into an object using `ViSEAGO::merge_enrich_terms` method. A table of enriched GO terms in at least one comparison is displayed in interactive mode, or printed in a file using `ViSEAGO::show_table` method. The printed table contains for each enriched GO terms, additional columns including the list of significant genes and frequency (ratio between the number of significant genes and number of background genes in a specific GO tag) evaluated by comparison. ```{r Enrichment_merge} # merge topGO results BP_sResults<-ViSEAGO::merge_enrich_terms( Input=list( PregnantvsLactate=c( "BP_PregnantvsLactate", "elim_BP_PregnantvsLactate" ), VirginvsLactate=c( "BP_VirginvsLactate", "elim_BP_VirginvsLactate" ), VirginvsPregnant=c( "BP_VirginvsPregnant", "elim_BP_VirginvsPregnant" ) ) ) ```
```{r Enrichment_merge_show} # display a summary BP_sResults ```
```{r Enrichment_merge_display,echo=FALSE,eval=TRUE} cat( "- object class: gene2GO - database: Bioconductor - stamp/version: 2018-Apr4 - organism id: org.Mm.eg.db GO annotations: - Molecular Function (MF): 23049 annotated genes with 92018 terms (4118 unique terms) - Biological Process (BP): 23843 annotated genes with 162583 terms (11881 unique terms) - Cellular Component (CC): 23583 annotated genes with 102801 terms (1662 unique terms) > BP_sResults - object class: enrich_GO_terms - ontology: BP - input: PregnantvsLactate: elim VirginvsLactate: elim VirginvsPregnant: elim - topGO summary: PregnantvsLactate BP_PregnantvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 7699 feasible_genes: 14402 feasible_genes_significant: 7185 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_PregnantvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 198 feasible_genes: 14402 feasible_genes_significant: 7185 genes_nodeSize: 5 Nontrivial_nodes: 8155 VirginvsLactate BP_VirginvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 9583 feasible_genes: 14402 feasible_genes_significant: 8898 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_VirginvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 151 feasible_genes: 14402 feasible_genes_significant: 8898 genes_nodeSize: 5 Nontrivial_nodes: 8180 VirginvsPregnant BP_VirginvsPregnant description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 7302 feasible_genes: 14402 feasible_genes_significant: 6875 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_VirginvsPregnant description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 232 feasible_genes: 14402 feasible_genes_significant: 6875 genes_nodeSize: 5 Nontrivial_nodes: 8143 - enrich GOs data.table (p<0.01 in at least one list): 509 GO terms of 3 conditions. PregnantvsLactate : 198 terms VirginvsLactate : 151 terms VirginvsPregnant : 232 terms" ) ```
```{r Enrichment_merge_table} # show table in interactive mode ViSEAGO::show_table(BP_sResults) ``` bioconductor_table1 ## Graphs of GO enrichment tests Several graphs summarize important features. An interactive barchart showing the number of GO terms enriched or not in each comparison, using `ViSEAGO::GOcount` method. Intersections of lists of enriched GO terms between comparisons are displayed in an Upset plot using `ViSEAGO::Upset` method. ```{r Enrichment_merge_count} # barchart of significant (or not) GO terms by comparison ViSEAGO::GOcount(BP_sResults) ``` gocount ```{r Enrichment_merge_interactions} # display intersections ViSEAGO::Upset( BP_sResults, file="upset.xls" ) ``` upset # GO terms Semantic Similarity GO annotations of genes created at a previous step and enriched GO terms are combined using `ViSEAGO::build_GO_SS` method. The Semantic Similarity (SS) between enriched GO terms are calculated using `ViSEAGO::compute_SS_distances` method which is a wrapper of functions implemented in the `r BiocStyle::Biocpkg("GOSemSim")` package @pmid20179076. Here, we choose *Wang* method based on the topology of GO graph structure. The built object `myGOs` contains all informations of enriched GO terms and the SS distances between them. ```{r SS_build} # create GO_SS-class object myGOs<-ViSEAGO::build_GO_SS( gene2GO=myGENE2GO, enrich_GO_terms=BP_sResults ) ```
```{r SS_compute} # compute Semantic Similarity (SS) myGOs<-ViSEAGO::compute_SS_distances( myGOs, distance="Wang" ) ```
```{r SS_build_compute_show} # display a summary myGOs ```
```{r SS_build_compute_display,echo=FALSE,eval=TRUE} cat( "- object class: GO_SS - database: Bioconductor - stamp/version: 2018-Apr4 - organism id: org.Mm.eg.db - ontology: BP - input: PregnantvsLactate: elim VirginvsLactate: elim VirginvsPregnant: elim - topGO summary: PregnantvsLactate BP_PregnantvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 7699 feasible_genes: 14402 feasible_genes_significant: 7185 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_PregnantvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 198 feasible_genes: 14402 feasible_genes_significant: 7185 genes_nodeSize: 5 Nontrivial_nodes: 8155 VirginvsLactate BP_VirginvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 9583 feasible_genes: 14402 feasible_genes_significant: 8898 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_VirginvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 151 feasible_genes: 14402 feasible_genes_significant: 8898 genes_nodeSize: 5 Nontrivial_nodes: 8180 VirginvsPregnant BP_VirginvsPregnant description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 7302 feasible_genes: 14402 feasible_genes_significant: 6875 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_VirginvsPregnant description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 232 feasible_genes: 14402 feasible_genes_significant: 6875 genes_nodeSize: 5 Nontrivial_nodes: 8143 - enrich GOs data.table: 509 GO terms of 3 conditions. PregnantvsLactate : 198 terms VirginvsLactate : 151 terms VirginvsPregnant : 232 terms - terms distances: Wang" ) ``` # Visualization and interpretation of enriched GO terms ## Multi Dimensional Scaling of GO terms - A preview A Multi Dimensional Scale (MDS) plot with `ViSEAGO::MDSplot` method provides a representation of distances among a set of enriched GO terms on the two first dimensions. Some patterns could appear at this time and could be investigated in the interactive mode. The plot could also be printed in a png file. ```{r SS_terms_mdsplot} # MDSplot ViSEAGO::MDSplot(myGOs) ``` mds1 ## Clustering heatmap of GO terms To fully explore the results of this functional analysis, a hierarchical clustering using `ViSEAGO::GOterms_heatmap` is performed based on `Wang` SS distance between enriched GO terms and `ward.D2` aggregation criteria. Clusters of enriched GO terms are obtained by cutting branches off the dendrogram. Here, we choose a dynamic branch cutting method based on the shape of clusters using `r BiocStyle::CRANpkg("dynamicTreeCut")` [@pmid18024473; @dynamicTreeCut]. Here, we use parameters to detect small clusters in agreement with the dendrogram structure. Enriched GO terms are ranked in a dendrogram and colored depending on their cluster assignation. Additional illustrations are displayed: the GO description of terms (trimmed if more than 50 characters), a heatmap of -log10(pvalue) of enrichment test for each comparison, and optionally the Information Content (IC). The IC of a GO term is computed by the negative log probability of the term occurring in the GO annotations database of the studied species. A rarely used term contains a greater amount of IC. The illustration is displayed with `ViSEAGO::show_heatmap` method. ```{r SS_Wang-wardD2} # Create GOterms heatmap Wang_clusters_wardD2<-ViSEAGO::GOterms_heatmap( myGOs, showIC=TRUE, showGOlabels =FALSE, GO.tree=list( tree=list( distance="Wang", aggreg.method="ward.D2" ), cut=list( dynamic=list( pamStage=TRUE, pamRespectsDendro=TRUE, deepSplit=2, minClusterSize =2 ) ) ), samples.tree=NULL ) ```
```{r SS_Wang-wardD2_heatmap_display} # display the heatmap ViSEAGO::show_heatmap( Wang_clusters_wardD2, "GOterms" ) ``` GOtermsheatmap A table of enriched GO terms in at least one of the comparison is displayed in interactive mode, or printed in a file using `ViSEAGO::show_table` method. The columns GO cluster number and IC value are added to the previous table of enriched terms table. The printed table contains for each enriched GO terms, additional columns including the list of significant genes and frequency (ratio between the number of significant genes and number of background genes in a specific GO tag) evaluated by comparison. ```{r SS_Wang-ward.D2_table} # display table ViSEAGO::show_table(Wang_clusters_wardD2) ``` bioconductor_table2 NB: For this vignette, this illustration is not interactive. ## Multi Dimensional Scaling of GO terms We also display a colored Multi Dimensional Scale (MDS) plot showing the overlay of GO terms clusters from `Wang_clusters_wardD2` object with `ViSEAGO::MDSplot` method. It is a way to check the coherence of GO terms clusters on the MDS plot. ```{r SS_Wang-ward.D2_mdsplot} # colored MDSplot ViSEAGO::MDSplot(Wang_clusters_wardD2) ``` mds2 # Visualization and interpretation of GO clusters This part is dedicated to explore relationships between clusters of GO terms defined by the last hierarchical clustering with `ViSEAGO::GOterms_heatmap` method. This analysis should be conducted if the number of clusters of GO terms is large enough and therefore difficult to investigate. ## Compute semantic similarity between GO clusters A Semantic Similarity (SS) between clusters of GO terms, previously defined, is calculated using `ViSEAGO::compute_SS_distances` method. Here, we choose the Best-Match Average, also known as *BMA* method implemented in the `r BiocStyle::Biocpkg("GOSemSim")` package @pmid20179076. It calculates the average of all maximum similarities between two clusters of GO terms. A colored Multi Dimensional Scale (MDS) plot with `ViSEAGO::MDSplot` method provides a representation of distances between the clusters of GO terms. Each circle represents a cluster of GO terms and its size depends on the number of GO terms that it contains. Clusters of GO terms that are close should share a functional coherence. ```{r SS_Wang-wardD2_groups} # calculate semantic similarites between clusters of GO terms Wang_clusters_wardD2<-ViSEAGO::compute_SS_distances( Wang_clusters_wardD2, distance="BMA" ) ```
```{r SS_Wang-ward.D2_groups_mdsplot} # MDSplot ViSEAGO::MDSplot( Wang_clusters_wardD2, "GOclusters" ) ``` mds3 ## GO clusters semantic similarities heatmap A new hierarchical clustering using `ViSEAGO::GOclusters_heatmap` is performed based on the `BMA` SS distance between the clusters of GO terms, previously computed, and the `ward.D2` aggregation criteria. Clusters of GO terms are ranked in a dendrogram and colored depending on their cluster assignation. Additional illustrations are displayed: the definition of the cluster of GO terms corresponds to the first common GO term ancestor followed by the cluster label in brackets, and the heatmap of GO terms count in the corresponding cluster. The illustration is displayed with `ViSEAGO::show_heatmap` method. ```{r SS_Wang-wardD2_groups_heatmap} # GOclusters heatmap Wang_clusters_wardD2<-ViSEAGO::GOclusters_heatmap( Wang_clusters_wardD2, tree=list( distance="BMA", aggreg.method="ward.D2" ) ) ```
```{r SS_Wang-ward.D2_groups_heatmap_display} # display the heatmap ViSEAGO::show_heatmap( Wang_clusters_wardD2, "GOclusters" ) ``` BMA An history of the enrichment functional analysis is reported. ```{r SS_Wang-wardD2_groups_show} # display a summary Wang_clusters_wardD2 ```
```{r SS_Wang-wardD2_groups_display,echo=FALSE,eval=TRUE} cat( "- object class: GO_clusters - database: Bioconductor - stamp/version: 2018-Apr4 - organism id: org.Mm.eg.db - ontology: BP - input: PregnantvsLactate: elim VirginvsLactate: elim VirginvsPregnant: elim - topGO summary: PregnantvsLactate BP_PregnantvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 7699 feasible_genes: 14402 feasible_genes_significant: 7185 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_PregnantvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 198 feasible_genes: 14402 feasible_genes_significant: 7185 genes_nodeSize: 5 Nontrivial_nodes: 8155 VirginvsLactate BP_VirginvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 9583 feasible_genes: 14402 feasible_genes_significant: 8898 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_VirginvsLactate description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 151 feasible_genes: 14402 feasible_genes_significant: 8898 genes_nodeSize: 5 Nontrivial_nodes: 8180 VirginvsPregnant BP_VirginvsPregnant description: Bioconductor org.Mm.eg.db 2018-Apr4 available_genes: 15804 available_genes_significant: 7302 feasible_genes: 14402 feasible_genes_significant: 6875 genes_nodeSize: 5 nodes_number: 8189 edges_number: 18887 elim_BP_VirginvsPregnant description: Bioconductor org.Mm.eg.db 2018-Apr4 test_name: fisher p<0.01 algorithm_name: elim GO_scored: 8189 GO_significant: 232 feasible_genes: 14402 feasible_genes_significant: 6875 genes_nodeSize: 5 Nontrivial_nodes: 8143 - enrich GOs data.table: 509 GO terms of 3 conditions. PregnantvsLactate : 198 terms VirginvsLactate : 151 terms VirginvsPregnant : 232 terms - clusters distances: BMA - Heatmap: * GOterms: TRUE - GO.tree: tree.distance: Wang tree.aggreg.method: ward.D2 cut.dynamic.pamStage: TRUE cut.dynamic.pamRespectsDendro: TRUE cut.dynamic.deepSplit: 2 cut.dynamic.minClusterSize: 2 number of clusters: 56 clusters min size: 1 clusters mean size: 30 clusters max size: 56 - sample.tree: FALSE * GOclusters: TRUE - tree: distance: BMA aggreg.method: ward.D2" ) ``` # Conclusion A functional Gene Ontology (GO) enrichment analysis was performed using `r BiocStyle::Biocpkg("ViSEAGO")` package, from differentially expressed (DE) genes in luminal cells of the mouse mammary gland. It facilitates the interpretation of biological processes involved between these three comparisons of virgin, pregnant and lactating mice. It provides both a synthetic and detailed view using interactive functionalities respecting the GO graph structure and ensuring functional coherence. # Information session{-} ```{r session,eval=TRUE,echo=FALSE} sessionInfo() ``` # References{-}