---
title: "A.3 -- Statistics and Graphics"
author:
  Martin Morgan <Martin.Morgan@RoswellPark.org><br/>
date: "11 - 12 September 2017"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{A.3 -- Statistics and Graphics}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
suppressPackageStartupMessages({
    library(tidyverse)
})
```

# Exploration, univariate, and bivariate statistics and visualization

Input clean data, with `Sex` and `Year` as factors.

<!--
```{r echo=FALSE}
path <- file.path("extdata", "BRFSS-subset.csv")
```
-->
```{r ALL-choose, eval=FALSE}
path <- file.choose()    # look for BRFSS-subset.csv
```

```{r ALL-input}
stopifnot(file.exists(path))
library(tidyverse)
col_types <- cols(
    Age = col_integer(),
    Weight = col_double(),
    Sex = col_factor(c("Female", "Male")),
    Height = col_double(),
    Year = col_factor(c("1990", "2010"))
)
brfss <- read_csv(path, col_types = col_types)
brfss
```

## Univariate: `t.test()` for Weight in 1990 vs. 2010 Females

Filter the data to include females only, and use base `plot()` function
and the formula interface to visualize the relationship between
`Weight` and `Year`.

```{r brfss-female-plot}
brfss %>% filter(Sex %in% "Female") %>% plot(Weight ~ Year, data = .)
```

Use a `t.test()` to test the hypothesis that female weight is the same
in 2010 as in 1990

```{r brfss-female-t-test}
brfss %>% filter(Sex %in% "Female") %>% t.test(Weight ~ Year, data = .)
```



## Bivariate: Weight and height in 2010 Males

Filter the data to contain 2010 Males. Use `plot()` to visualize the
relationship, and `lm()` to model it.

```{r brfss-male}
male2010 <- brfss %>% filter(Year %in% "2010", Sex %in% "Male")
male2010 %>% plot( Weight ~ Height, data = .)
fit <- male2010 %>% lm( Weight ~ Height, data = .)
fit
summary(fit)
```

Multiple regression: Weight and Height, accounting for difference between years

```{r brfss-male-year-and-height}
male <- brfss %>% filter(Sex %in% "Male")
male %>% lm(Weight ~ Year + Height, data = .) %>% summary()
```

Is there an interaction between `Year` and `Height`?


```{r brfss-male-interaction}
male %>% lm(Weight ~ Year * Height, data = .) %>% summary()
```

Check out other things to do with fitted model:

- `broom::tidy()`: P-value, etc., as data.frame
- `broom::augment()`: fitted values, residuals, etc

    ```{r brfss-male-augment, warning=FALSE}
    library(broom)
    male %>% lm(Weight ~ Year + Height, data = .) %>% 
        augment() %>% as.tibble()
    ```

## Visualization: [ggplot2][]

*gg*plot: "Grammar of Graphics"

- data: `ggplot2()`
- *aes*thetics: `aes()`, 'x' and 'y' values, point colors, etc.
- *geom*metric summaries, layered
    - `geom_point()`: points
    - `geom_smooth()`: fitted line
    - `geom_*`: ...
- *facet* plots (e.g., `facet_grid()`) to create 'panels' based on
  factor levels, with shared axes.

Create a plot with data points

```{r male-geom_point, warning = FALSE}
ggplot(male, aes(x=Height, y = Weight)) + geom_point()
```

Capture the base plot and points, and explore different smoothed
relationships, e.g., linear model, non-parameteric smoother

```{r male-ggplot, warning = FALSE}
plt <- ggplot(male, aes(x=Height, y = Weight)) + geom_point()
plt + geom_smooth(method = "lm")
plt + geom_smooth()       # default: generalized additive model
```

Use an `aes()`thetic to color smoothed lines based on `Year`, or
`facet_grid()` to separate years.

```{r male-facet, warning = FALSE}
ggplot(male, aes(x = Weight)) + geom_density(aes(fill = Year), alpha = .2)
plt + geom_smooth(method = "lm", aes(color = Year))
plt + facet_grid( ~ Year ) + geom_smooth(method = "lm")
```

[ggplot2]: https://cran.r-project.org/package=ggplot2

# Multivariate analysis

This is a classic microarray experiment. Microarrays
consist of 'probesets' that interogate genes for their level of
expression. In the experiment we're looking at, there are 12625
probesets measured on each of the 128 samples. The raw expression
levels estimated by microarray assays require considerable
pre-processing, the data we'll work with has been pre-processed.

## Input and setup

Start by finding the expression data file on disk.

<!--
```{r echo=FALSE}
path <- file.path("extdata", "ALL-expression.csv")
stopifnot(file.exists(path))
```
-->

```{r ALL-choose-again, eval=FALSE}
path <- file.choose()          # look for ALL-expression.csv
stopifnot(file.exists(path))
```

The data is stored in 'comma-separate value' format, with each
probeset occupying a line, and the expression value for each sample in
that probeset separated by a comma. Input the data using
`read_csv()`. The sample identifiers are present in the first column.

```{r ALL-input-exprs}
exprs <- read_csv(path)
```

<!--
```{r echo=FALSE}
path <- file.path("extdata", "ALL-phenoData.csv")
stopifnot(file.exists(path))
```
-->

We'll also input the data that describes each column

```{r ALL-phenoData.csv-clustering-student, eval=FALSE}
path <- file.choose()         # look for ALL-phenoData.csv
stopifnot(file.exists(path))
```

```{r}
pdata <- read_csv(path)
pdata
```

## Cleaning and Exploration

The expression data is presented in what is sometimes called 'wide'
format; a different format is 'tall', where Sample and Gene group the
single observation Expression. Use `tidyr::gather()` to gather the
columns of the wide format into two columns representing the tall
format, excluding the `Gene` column from the gather operation.

```{r ALL-gather}
exprs <- exprs %>% gather("Sample", "Expression", -Gene)
```

Explore the data a little, e.g., a summary and histogram of the
expression values, and a histogram of average expression values of
each gene.

```{r}
exprs %>% select(Expression) %>% summary()
exprs $ Expression %>% hist()
exprs %>% group_by(Gene) %>% 
    summarize(AveExprs = mean(Expression)) %$% AveExprs %>% 
    hist(breaks=50)
```

For subsequent analysis, we also want to simplify the 'B or T' cell
type classification

```{r B_or_T}
pdata <- pdata %>% mutate(B_or_T = factor(substr(BT, 1, 1)))
```

## Unsupervised machine learning -- multi-dimensional scaling

We'd like to reduce high-dimensional data to lower dimension for
visualization. To do so, we need the `dist()`ance between
samples. From `?dist`, the input can be a data.frame where rows
represent `Sample` and columns represent `Expression` values. Use
`spread()` to create appropriate data from `exprs`, and pipe the
result to `dist()`ance.x

```{r spread}
input <- exprs %>% spread(Gene, Expression)
samples <- input $ Sample
input <- input %>% select(-Sample) %>% as.matrix
rownames(input) <- samples
```

Calculate distance between samples, and use that for MDS scaling

```{r cmdscale}
mds <- dist(input) %>% cmdscale()
```

The result is a matrix; make it 'tidy' by coercing to a tibble; add
the Sample identifiers as a distinct column.

```{r mds-to-tibble}
mds <- mds %>% as.tibble() %>% mutate(Sample = rownames(mds))
```

Visualize the result

```{r}
ggplot(mds, aes(x=V1, y = V2)) + geom_point()
```

With the 'eye of faith', it seems like there are two groups of
points. To explore this, join the MDS scaling with the phenotypic data

```{r join}
joined <- inner_join(mds, pdata)
```

and use the `B_or_T` column as an aesthetic to color points

```{r mds-color}
ggplot(joined, aes(x = V1, y = V2)) + geom_point(aes(color = B_or_T))
```