An Introduction to the ImmuneSpaceR Package

Renan Sauteraud

2020-04-27

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

This package provides a thin wrapper around Rlabkey and connects to the ImmuneSpace database, making it easier to fetch datasets, including gene expression data, HAI, and so forth, from specific studies.

1 Configuration

In order to connect to ImmuneSpace, you will need a netrc file in your home directory.

Set up your netrc file now!

If you’re not familiar with the command-line interface, there is the interactive_netrc() function to set up your netrc file. See the interactive_netrc vignette.

Or create netrc file in the computer running R:

The following three lines must be included in the .netrc or _netrc file either separated by white space (spaces, tabs, or newlines) or commas.

machine www.immunespace.org
login [email protected]
password superSecretPassword

Multiple such blocks can exist in one file. Please ensure that the machine name in the netrc file contains the “www” prefix as that is how the package connects to immunespace by default. A mismatch will lead to connection failures.

See the official LabKey documentation for more information.

2 Instantiate a connection

We’ll be looking at study SDY269. If you want to use a different study, change that string. The connections have state, so you can instantiate multiple connections to different studies simultaneously.

library(ImmuneSpaceR)
sdy269 <- CreateConnection(study = "SDY269")
## Warning in matrix(data = unlist(curfld[resultCols]), nrow = 1, ncol =
## length(resultCols), : data length [4] is not a sub-multiple or multiple of the
## number of columns [7]
sdy269
## <ImmuneSpaceConnection>
##   Study: SDY269
##   URL: https://www.immunespace.org/Studies/SDY269
##   User: unknown_user at not_a_domain.com
##   9 Available Datasets
##     - cohort_membership
##     - demographics
##     - elisa
##     - elispot
##     - fcs_analyzed_result
##     - fcs_sample_files
##     - gene_expression_files
##     - hai
##     - pcr
##   2 Available Expression Matrices

The call to CreateConnection instantiates the connection. Printing the object shows where it’s connected, to what study, and the available data sets and gene expression matrices.

Note that when a script is running on ImmuneSpace, some variables set in the global environments will automatically indicate which study should be used and the study argument can be skipped.

3 Fetching datasets

We can grab any of the datasets listed in the connection.

sdy269$getDataset("hai")
##      participant_id age_reported gender  race          cohort
##   1:  SUB112829.269           26   Male White LAIV group 2008
##   2:  SUB112829.269           26   Male White LAIV group 2008
##   3:  SUB112829.269           26   Male White LAIV group 2008
##   4:  SUB112829.269           26   Male White LAIV group 2008
##   5:  SUB112829.269           26   Male White LAIV group 2008
##  ---                                                         
## 332:  SUB112888.269           34 Female White  TIV Group 2008
## 333:  SUB112888.269           34 Female White  TIV Group 2008
## 334:  SUB112888.269           34 Female White  TIV Group 2008
## 335:  SUB112888.269           34 Female White  TIV Group 2008
## 336:  SUB112888.269           34 Female White  TIV Group 2008
##      study_time_collected study_time_collected_unit                  virus
##   1:                    0                      Days A/South Dakota/06/2007
##   2:                    0                      Days     A/Uruguay/716/2007
##   3:                    0                      Days       B/Florida/4/2006
##   4:                   28                      Days A/South Dakota/06/2007
##   5:                   28                      Days     A/Uruguay/716/2007
##  ---                                                                      
## 332:                    0                      Days     A/Uruguay/716/2007
## 333:                    0                      Days     B/Brisbane/03/2007
## 334:                   28                      Days     A/Brisbane/59/2007
## 335:                   28                      Days     A/Uruguay/716/2007
## 336:                   28                      Days     B/Brisbane/03/2007
##      value_preferred
##   1:              40
##   2:              40
##   3:              20
##   4:              40
##   5:              40
##  ---                
## 332:               5
## 333:             320
## 334:              80
## 335:              40
## 336:              40

The sdy269 object is an R6 class, so it behaves like a true object. Methods (like getDataset) are members of the object, thus the $ semantics to access member functions.

The first time you retrieve a dataset, it will contact the database. The data is cached in the object, so the next time you call getDataset on the same dataset, it will retrieve the cached local copy. This is much faster.

To get only a subset of the data and speed up the download, filters can be passed to getDataset. The filters are created using the makeFilter function of the Rlabkey package.

library(Rlabkey)
myFilter <- makeFilter(c("gender", "EQUAL", "Female"))
hai <- sdy269$getDataset("hai", colFilter = myFilter)

See ?Rlabkey::makeFilter for more information on the syntax.

For more information about getDataset’s options, refer to the dedicated vignette.

4 Fetching expression matrices

We can also grab a gene expression matrix

sdy269$getGEMatrix("SDY269_PBMC_LAIV_Geo")
## Downloading matrix..
## Constructing ExpressionSet
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16442 features, 83 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BS586100 BS586156 ... BS586239 (83 total)
##   varLabels: participant_id study_time_collected ...
##     exposure_process_preferred (8 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: DDR1 RFC2 ... NUS1P3 (16442 total)
##   fvarLabels: FeatureId gene_symbol
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

The object contacts the database and downloads the matrix file. This is stored and cached locally as a data.table. The next time you access it, it will be much faster since it won’t need to contact the database again.

It is also possible to call this function using multiple matrix names. In this case, all the matrices are downloaded and combined into a single ExpressionSet.

sdy269$getGEMatrix(c("SDY269_PBMC_TIV_Geo", "SDY269_PBMC_LAIV_Geo"))
## Downloading matrix..
## Constructing ExpressionSet
## Returning SDY269_PBMC_LAIV_Geo_summary_latest_eset from cache
## Combining ExpressionSets
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 16442 features, 163 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BS586128 BS586240 ... BS586239 (163 total)
##   varLabels: participant_id study_time_collected ...
##     exposure_process_preferred (8 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 16442 (16442 total)
##   fvarLabels: FeatureId gene_symbol
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Finally, the summary argument will let you download the matrix with gene symbols in place of probe ids.

gs <- sdy269$getGEMatrix("SDY269_PBMC_TIV_Geo", outputType = "summary", annotation = "latest")
## Returning SDY269_PBMC_TIV_Geo_summary_latest_eset from cache

If the connection was created with verbose = TRUE, some methods will display additional informations such as the valid dataset names.

5 Plotting

A plot of a dataset can be generated using the plot method which automatically chooses the type of plot depending on the selected dataset.

sdy269$plot("hai")

sdy269$plot("elisa")

However, the type argument can be used to manually select from “boxplot”, “heatmap”, “violin” and “line”.

6 Cross study connections

To fetch data from multiple studies, simply create a connection at the project level.

con <- CreateConnection("")
## Warning in matrix(data = unlist(curfld[resultCols]), nrow = 1, ncol =
## length(resultCols), : data length [4] is not a sub-multiple or multiple of the
## number of columns [7]

This will instantiate a connection at the Studies level. Most functions work cross study connections just like they do on single studies.

You can get a list of datasets and gene expression matrices available accross all studies.

con
## <ImmuneSpaceConnection>
##   Study: Studies
##   URL: https://www.immunespace.org/Studies/
##   User: unknown_user at not_a_domain.com
##   13 Available Datasets
##     - cohort_membership
##     - demographics
##     - elisa
##     - elispot
##     - fcs_analyzed_result
##     - fcs_control_files
##     - fcs_sample_files
##     - gene_expression_files
##     - hai
##     - hla_typing
##     - mbaa
##     - neut_ab_titer
##     - pcr
##   110 Available Expression Matrices

In cross-study connections, getDataset and getGEMatrix will combine the requested datasets or expression matrices. See the dedicated vignettes for more information.

Likewise, plot will visualize accross studies. Note that in most cases the datasets will have too many cohorts/subjects, making the filtering of the data a necessity. The colFilter argument can be used here, as described in the getDataset section.

plotFilter <- makeFilter(
  c("cohort", "IN", "TIV 2010;TIV Group 2008"),
  c("study_time_collected", "EQUALS", "7")
)
con$plot("elispot", filter = plotFilter)

The figure above shows the ELISPOT results for two different years of TIV vaccine cohorts from two different studies.

7 Session info

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rlabkey_2.4.0       jsonlite_1.6.1      httr_1.4.1         
## [4] ImmuneSpaceR_1.16.0 rmarkdown_2.1       knitr_1.28         
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.48.0        viridis_0.5.1         tidyr_1.0.2          
##  [4] viridisLite_0.3.0     foreach_1.5.0         gtools_3.8.2         
##  [7] RcppParallel_5.0.0    assertthat_0.2.1      stats4_4.0.0         
## [10] latticeExtra_0.6-29   flowWorkspace_4.0.0   yaml_2.2.1           
## [13] pillar_1.4.3          lattice_0.20-41       glue_1.4.0           
## [16] digest_0.6.25         RColorBrewer_1.1-2    colorspace_1.4-1     
## [19] preprocessCore_1.50.0 htmltools_0.4.0       XML_3.99-0.3         
## [22] pkgconfig_2.0.3       pheatmap_1.0.12       zlibbioc_1.34.0      
## [25] purrr_0.3.4           flowCore_2.0.0        scales_1.1.0         
## [28] webshot_0.5.2         gdata_2.18.0          jpeg_0.1-8.1         
## [31] tibble_3.0.1          farver_2.0.3          ggplot2_3.3.0        
## [34] ellipsis_0.3.0        BiocGenerics_0.34.0   lazyeval_0.2.2       
## [37] magrittr_1.5          crayon_1.3.4          heatmaply_1.1.0      
## [40] evaluate_0.14         MASS_7.3-51.6         gplots_3.0.3         
## [43] graph_1.66.0          registry_0.5-1        tools_4.0.0          
## [46] data.table_1.12.8     ncdfFlow_2.34.0       lifecycle_0.2.0      
## [49] matrixStats_0.56.0    stringr_1.4.0         plotly_4.9.2.1       
## [52] munsell_0.5.0         cluster_2.1.0         compiler_4.0.0       
## [55] caTools_1.18.0        rlang_0.4.5           grid_4.0.0           
## [58] iterators_1.0.12      htmlwidgets_1.5.1     labeling_0.3         
## [61] bitops_1.0-6          codetools_0.2-16      cytolib_2.0.0        
## [64] gtable_0.3.0          curl_4.3              TSP_1.1-10           
## [67] R6_2.4.1              RProtoBufLib_2.0.0    seriation_1.2-8      
## [70] gridExtra_2.3         dplyr_0.8.5           KernSmooth_2.23-17   
## [73] dendextend_1.13.4     Rgraphviz_2.32.0      stringi_1.4.6        
## [76] parallel_4.0.0        Rcpp_1.0.4.6          vctrs_0.2.4          
## [79] png_0.1-7             gclus_1.3.2           tidyselect_1.0.0     
## [82] xfun_0.13