\documentclass[11]{article}
%\usepackage{geometry}
\usepackage{graphicx}
\usepackage[pdftex,
bookmarks,
bookmarksopen,
pdfauthor={Mathieu Emily},
pdftitle={GenePair}]
{hyperref}

\oddsidemargin -2 mm
\evensidemargin -2 mm
\textwidth 170 mm
\topmargin -2 cm
\textheight 23.5 cm
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}

\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{amssymb}

\SweaveOpts{echo=TRUE, eval=FALSE,concordance=TRUE}

\let\proglang=\textsf
\newcommand{\E}{\mathsf{E}}
\newcommand{\VAR}{\mathsf{VAR}}
\newcommand{\COV}{\mathsf{COV}}
\newcommand{\Prob}{\mathsf{P}}
%\newcommand{\Plaintitle}[1]{\def\@Plaintitle{#1}}

\begin{document}
\setkeys{Gin}{width=1.0\textwidth}

%\VignetteIndexEntry{Pairwise interaction tests} 
%\VignettePackage{GeneGeneInteR}

\title{GeneGeneInteR vignette \\
Statistical analysis of the interaction between a pair of genes.}

\author{Mathieu Emily and Magalie Hou\'ee-Bigot}

%\date{}



\maketitle



%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%
%\section[Methods]{Statistical methods for testing one pair of gene}
%\label{Sec:GenePair}
%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%

\noindent
This vignette presents the technical details of the statistical procedure implemented in the package. Readers that would like to have a global overview of the main functions and tools proposed in the package are encouraged to read the vignette {\bf VignetteGeneGeneInteR\_Introduction}. 

\section{Introduction}

\noindent
In this vignette we consider statistical procedures to test for the interaction between two genes in susceptibility with a binary phenotype, typically a case/control disease status. Let $Y \in \{0,1\}$ be the phenotype, where $Y=0$ stands for a control and $Y=1$ a case, and $X_1$ and $X_2$ be the two genes for which the interaction is tested. 

\vskip1em
\noindent
Let consider a sample of $n$ individuals with $n_c$ controls and $n_d$ cases ($n_c+n_d=n$) and $\mathbf{Y}=[y_1,\dots,y_n]^{\prime}$ the vector of the observed binary phenotypes. Each gene is a collection of respectively $m_1$ and $m_2$ SNPs. The observed genotypes for gene $X_1$ can be represented by a $n \times m_1$ matrix:~$\mathbf{X_1}=[x^1_{ij}]_{i \in 1\dots n; j \in 1\dots m_1}$ where $x^1_{ij} \in \{0;1;2\}$ is the number of copies of the minor allele for SNP $j$ carried by individual $i$. A similar representation is used for gene $X_2$ where $\mathbf{X_2}$ is a $n \times m_2$ matrix. Let us further introduce $\mathbf{X_1^c}$ and $\mathbf{X_2^c}$ the matrices of observed genotypes among controls for gene 1 and 2 and $\mathbf{X_1^d}$ and $\mathbf{X_2^d}$ among cases for both genes. Thus $\mathbf{X_1^c}$ is a $n_c \times m_1$ matrix, $\mathbf{X_1^d}$ a $n_d \times m_1$ matrix, $\mathbf{X_2^c}$ a $n_c \times m_2$ matrix and $\mathbf{X_2^d}$ a $n_d \times m_2$ matrix. A general setup of the observed values can be presented as follows:

$$
\mathbf{Y}
=
\begin{bmatrix}
y_1 \\
\vdots \\
y_{n_c} \\
y_{n_c+1} \\
\vdots \\
y_{n_c+n_d} \\
\end{bmatrix}
\mathbf{X_1}=
\begin{bmatrix}
\\
\mathbf{X_1^c} \\
 \\
 \\
\mathbf{X_1^d} \\
 \\
\end{bmatrix}
=
\begin{bmatrix}
x^1_{11} & \dots & x^1_{1m_1} \\
\vdots & \ddots & \vdots \\
x^1_{n_c 1} & \dots & x^1_{n_c m_1} \\
x^1_{(n_c+1)1} & \dots & x^1_{(n_c+1)m_1} \\
\vdots & \ddots & \vdots \\
x^1_{(n_c+n_d) 1} & \dots & x^1_{(n_c+n_d) m_1} \\
\end{bmatrix}
%$$
%$$
\mathbf{X_2}=
\begin{bmatrix}
\\
\mathbf{X_2^c} \\
 \\
 \\
\mathbf{X_2^d} \\
 \\
\end{bmatrix}
=
\begin{bmatrix}
x^2_{11} & \dots & x^2_{1m_2} \\
\vdots & \ddots & \vdots \\
x^2_{n_c 1} & \dots & x^2_{n_c m_1} \\
x^2_{(n_c+1)1} & \dots & x^2_{(n_c+1)m_1} \\
\vdots & \ddots & \vdots \\
x^2_{(n_c+n_d) 1} & \dots & x^2_{(n_c+n_d) m_1} \\
\end{bmatrix}
$$

\vskip1em
\noindent
In our package we proposed 10 methods for testing interaction at the gene level. These 10 methods are all based on three main parameters: \code{Y}, a \code{numeric} or \code{factor} vector with exactly two distinct values, \code{G1} and \code{G2} two \code{SnpMatrix} objects as proposed by the \proglang{R} \pkg{Bioconductor} package \pkg{snpStats}. Our implementation is illustrated by the dataset \code{gene.pair} provided with the \pkg{GeneGeneInteR} package and summarized in the following command lines:

\vskip0.5em
%\code{R> library(GeneGeneInteR)}
<<>>=
library("GeneGeneInteR")
data("gene.pair")
head(gene.pair$Y)
@

\begin{Soutput}
[1] HealthControl HealthControl HealthControl HealthControl HealthControl 
Levels: HealthControl RheumatoidArthritis
\end{Soutput}

%\code{[1] HealthControl HealthControl HealthControl HealthControl HealthControl HealthControl}

%\code{Levels: HealthControl RheumatoidArthritis}

<<>>==
gene.pair$G1
@

\begin{Soutput}
A SnpMatrix with  453 rows and  8 columns
Row names:  Id1 ... Id453
Col names:  rs1491710 ... rs2298849
\end{Soutput}

%\code{A SnpMatrix with  453 rows and  8 columns}
%
%\code{Row names:  Id1 ... Id453}
%
%\code{Col names:  rs1491710 ... rs2298849}

<<>>==
gene.pair$G2
@

\begin{Soutput}
A SnpMatrix with  453 rows and  4 columns
Row names:  Id1 ... Id453
Col names:  rs2057094 ... rs1005753
\end{Soutput}

%\code{A SnpMatrix with  453 rows and  4 columns}
%
%\code{Row names:  Id1 ... Id453}
%
%\code{Col names:  rs2057094 ... rs1005753}

\vskip1em
\noindent
The 10 methods implemented in our package can be divided into two main families: 6 methods based on a multidimensional modeling of the interaction at the gene level and 4 methods that combine interaction tests at the SNP level into a single test at the gene level.

%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%
\section{Multidimensional methods at the gene level}
%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%

In the \pkg{GeneGeneInteR} package, 6 multidimensional methods have been implemented that are based on:
\begin{itemize}
\item Principal Components Analysis - \code{PCA.test} function,
\item Canonical Correlation Analysis - \code{CCA.test} function,
\item  Kernel Canonical Correlation Analysis - \code{KCCA.test} function,
\item Composite Linkage Disequilibrium - \code{CLD.test} function,
\item Partial Least Square Path Modeling - \code{PLSPM.test} function,
\item Gene-Based Information Gain Method - \code{GBIGM.test} function.
\end{itemize}

In the remainder of this section, technical and practical details are given regarding these 6 methods.

%%%%%%%%%%%%%%%
\subsection{PCA-based}
%%%%%%%%%%%%%%%

In the PCA-based method, a likelihood ratio test is performed to compare the model $\mathcal{M}_{Inter}$ to the model $\mathcal{M}_{No}$, where $\mathcal{M}_{Inter}$ is defined by:
$$
\mathrm{logit}\left(\mathbb{P}\left[Y=1|PC_{X_1}^1\dots PC_{X_1}^{n_1},PC_{X_2}^1\dots PC_{X_2}^{n_2} \right]\right)=\beta_0+\sum_{i=1}^{n_1} PC_{X_1}^i+\sum_{j=1}^{n_2} PC_{X_2}^j+\sum_{i=1}^{n_1}\sum_{i=2}^{n_2}PC_{X_1}^iPC_{X_2}^j
$$
and $\mathcal{M}_{No}$ by:
$$
\mathrm{logit}\left(\mathbb{P}\left[Y=1|PC_{X_1}^1\dots PC_{X_1}^{n_1},PC_{X_2}^1\dots PC_{X_2}^{n_2} \right]\right)=\beta_0+\sum_{i=1}^{n_1} PC_{X_1}^i+\sum_{j=1}^{n_2} PC_{X_2}^j
$$

\vskip1em
\noindent
In models $\mathcal{M}_{Inter}$ and $\mathcal{M}_{No}$, $PC_{X_1}^i$ and  $PC_{X_2}^j$ are the $i^{th}$ principal component of $\mathbf{X_1}$ and the $j^{th}$ principal component of $\mathbf{X_2}$. The number of principal components, $n_1$ and $n_2$, kept in the interaction test is determined by the percentage of inertia retrieved by the PCA. Such a percentage is defined by the user and corresponds to the \code{threshold} parameter.

\vskip1em
\noindent
In our package, two distinct Principal Component decomposition are provided by the functions \code{PCA.test} via the argument \code{method}. With \code{method="Std"}, dataset is standardized using variables' standard deviation while with \code{method="GenFreq"}, dataset is standardized using standard deviation under Hardy-Weinberg equilibrium, as proposed in the \pkg{snpStats} package.


%In the PCA-based method, introduced by \cite{Li:09}, a Principal Component Analysis is first performed on each gene, {\it i.e.} on the two \code{SnpMatrix} objects \code{G1} and \code{G2}. Principal components are retrieved to describe each dataset with user-set inertia percentage. Such a percentage corresponds to the \code{threshold} parameter used in the \code{PCA.Std} and \code{PCA.GeneFreq} functions implemented if the \pkg{GeneGeneInteR} package. In \code{PCA.Std}, dataset is standardized using variables' standard deviation while in \code{PCA.GenFreq}, dataset is standardized using standard deviation under Hardy-Weinberg equilibrium, as proposed in the \pkg{snpStats} package. Principal components are then used in a logit model fitting process in which interaction is tested. 

\vskip1em
\noindent
When the percentage of inertia asked by the user is high, the number of PCs can be important and fitting logistic models $\mathcal{M}_{Inter}$ and $\mathcal{M}_{No}$ is likely to fail. In that case, the number of PCs in each gene is iteratively reduced until convergence of the \code{glm} function for fitting models $\mathcal{M}_{Inter}$ and $\mathcal{M}_{No}$.

\newpage
%\vskip1em
\noindent
The following lines provide an example of the \code{PCA.test} function:

<<>>==
PCA.test(Y=gene.pair$Y, G1=gene.pair$G1, G2=gene.pair$G2,threshold=0.7,
method="GenFreq")
@

\begin{Soutput}

	Gene-based interaction based on Principal Component Analysis - GenFreq

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
Deviance = 8.2157, df = 6.0, threshold = 0.7, p-value = 0.2227
alternative hypothesis: true deviance is greater than 0
sample estimates:
Deviance without interaction    Deviance with interaction 
                    615.2977                     607.0821 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Principal Component Analysis}
%
%\code{Deviance = 8.215662, df = 6, p-value = 0.2227253}

<<>>==
PCA.test(Y=gene.pair$Y, G1=gene.pair$G1, G2=gene.pair$G2,threshold=0.7, 
method="Std")
@

\begin{Soutput}

	Gene-based interaction based on Principal Component Analysis - Std

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
Deviance = 8.5074, df = 6.0, threshold = 0.7, p-value = 0.2032
alternative hypothesis: true deviance is greater than 0
sample estimates:
Deviance without interaction    Deviance with interaction 
                    615.0911                     606.5837 

\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Principal Component Analysis}
%
%\code{Deviance = 8.507404, df = 6, p-value = 0.2032347}

%%%%%%%%%%%%%%%
\subsection{Canonical Correlation Analysis (CCA)}
%%%%%%%%%%%%%%%

The CCA test is based on a Wald-type statistic defined as follows (see \cite{Peng:10} for details):

$$
U_{CCA}=\frac{z_d-z_c}{\sqrt{\mathbb{V}(z_d)+\mathbb{V}(z_c)}}
$$

\noindent
where $z_d=\frac{1}{2}\left(\log(1+r_d)-\log(1-r_d) \right)$ and $z_c=\frac{1}{2}\left(\log(1+r_c)-\log(1-r_c) \right)$ with $r_d$ the maximum canonical correlation coefficient between $\mathbf{X_1^d}$ and $\mathbf{X_2^d}$ and $r_c$ the maximum canonical correlation coefficient between $\mathbf{X_1^c}$ and $\mathbf{X_2^c}$ computed for controls ($Y=0$). As suggested by \cite{Peng:10}, the sampled variances $\mathbb{V}(z_d)$ and $\mathbb{V}(z_c)$ were evaluated by applying a bootstrapping method. The number of bootstrap sample used to estimate $\mathbb{V}(z_d)$ and $\mathbb{V}(z_c)$ is determined by the \code{n.boot} argument.
P-value is then obtained by noting that under the null hypothesis $U_{CCA} \sim \mathcal{N}(0,1)$. 

\vskip1em
\noindent
CCA based gene-gene interaction is implemented in the \code{CCA.test} function and mainly depends on the \code{cancor} function from the \pkg{Stats} package \cite{R:16}.

\vskip0.5em
\code{R> set.seed(1234)}

<<>>==
CCA.test(Y=gene.pair$Y, G1=gene.pair$G1, G2=gene.pair$G2,n.boot=500)
@

\begin{Soutput}

	Gene-based interaction based on Canonical Correspondance Analysis

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
CCU = 0.60304, n.boot = 500, p-value = 0.5465
alternative hypothesis: true CCU is not equal to 0
sample estimates:
       z0        z1 
0.2940799 0.2414700 

\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null\hskip2cm 	 Canonical Correlation Analysis }
%
%\code{CCU = 0.6030414, p-value = 0.5464811}

%%%%%%%%%%%%%%%
\subsection{Kernel Canonical Correlation Analysis (KCCA)}
%%%%%%%%%%%%%%%

The KCCA based test provides a generalization of CCA test to detect non-linear co-association between $\mathbf{X_1}$ and $\mathbf{X_2}$ \cite{Yuan:12,Larson:13} and is based on the following Wald-type statistic:
$$
U_{KCCA}=\frac{kz_d-kz_c}{\sqrt{\mathbb{V}(kz_d)+\mathbb{V}(kz_c)}}
$$


\noindent
where $kz_d=\frac{1}{2}\left(\log(1+kr_d)-\log(1-kr_d) \right)$ and $kz_c=\frac{1}{2}\left(\log(1+kr_c)-\log(1-kr_c) \right)$ with $kr_d$ the maximum kernel canonical correlation coefficient between $\mathbf{X_1^d}$ and $\mathbf{X_2^d}$ and $kr_c$ the maximum kernel canonical correlation coefficient between $\mathbf{X_1^c}$ and $\mathbf{X_2^c}$. 

\vskip1em
\noindent
Similar to the CCA test, $\mathbb{V}(kz_d)$ and $\mathbb{V}(kz_c)$ are estimated using bootstrap techniques \cite{Yuan:12,Larson:13} and the p-value is obtained using the standard gaussian distribution of $U_{KCCA}$ under the null hypothesis. Since the performance of kernel methods strongly relates to the choice of kernel functions, the default is the Radial Basis kernel Function (RBF) owing to its flexibility in parameter specification. However, other kernel functions, such as linear, polynomial or spline kernels, can be used. Thus, in addition to the three arguments \code{Y}, \code{G1} and \code{G2}, our implementation of the KCCA test proposes two optional arguments: \code{n.boot} that determines the number of bootstrap samples and \code{kernel} that provides the kernel function to be used. This \code{kernel} parameter is character string matching one of the kernel name provided by the \pkg{kernlab} package \cite{Karatzoglou:04} such as "rbfdot", "polydot", "tanhdot", "vanilladot", "laplacedot", "besseldot", "anovadot", "splinedot". Specific arguments, \code{sigma}, \code{degree}, \code{scale}, \code{offset}and \code{order}, can also be passed to the \code{kcca.test} function in order to parameterized the kernel used in the analysis.

\vskip1em
\noindent
KCCA based gene-gene interaction test is implemented in the \code{KCCA.test} function and mainly depends on the \code{kcca} function from the \pkg{kernlab} package \cite{Karatzoglou:04}.

<<>>==
set.seed(1234)
KCCA.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,
kernel="rbfdot",sigma = 0.05,n.boot=500)
@

\begin{Soutput}

	Gene-based interaction based on Kernel Canonical Correspondance
	Analysis

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
KCCU = 1.4055, n.boot = 500, p-value = 0.1599
alternative hypothesis: true KCCU is not equal to 0
sample estimates:
       z0        z1 
 3.717346 -3.759154 

\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Kernel Canonical Correlation Analysis}
%
%\code{KCCU = 1.407369, p-value = 0.1593179
%}

<<>>==
set.seed(1234)
KCCA.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,
kernel="polydot",degree = 1, scale = 1, offset = 1)
@

\begin{Soutput}

	Gene-based interaction based on Kernel Canonical Correspondance
	Analysis

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
KCCU = 1.4106, n.boot = 100, p-value = 0.1584
alternative hypothesis: true KCCU is not equal to 0
sample estimates:
       z0        z1 
 4.161048 -4.251702 

\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Kernel Canonical Correlation Analysis}
%
%\code{KCCU = -1.413508, p-value = 0.1575063}

%%%%%%%%%%%%%%%
\subsection{Partial Least Square Path Modeling (PLSPM)}
%%%%%%%%%%%%%%%

The PLSPM testing has been introduced by \cite{Zhang:13} and is based on the Wald-like statistic:
$$
U_{PLSPM}=\frac{\beta_d-\beta_c}{\sqrt{\mathbb{V}(\beta_d-\beta_c)}}
$$

\noindent
where $\beta_d$ (resp. $\beta_c$) is the path coefficient between $\mathbf{X_1^d}$ and $\mathbf{X_2^d}$ (resp. $\mathbf{X_1^c}$ and $\mathbf{X_2^c}$). As quoted by \cite{Zhang:13}, the distribution of $U_{PLSPM}$ is unknown and significance can be tested with bootstrapping method.

\vskip1em
\noindent
PLSPM based gene-gene interaction test is implemented in the \code{PLSPM.test} function and mainly depends on the \code{plspm} function from the \pkg{plspm} package \cite{Sanchez:15}.

<<>>==
set.seed(1234)
PLSPM.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,n.perm=1000)
@

\begin{Soutput}

	Gene-based interaction based on Partial Least Squares Path Modeling

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
U = 4.0938, n.perm = 1000, p-value = 0.18
alternative hypothesis: true U is not equal to 0
sample estimates:
     beta0      beta1 
-0.2125869  0.2434624 
\end{Soutput}

%\vskip0.5em
%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Partial Least Squares Path Modeling}
%
%\code{U = 4.093781, p-value = 0.18}

%%%%%%%%%%%%%%%
\subsection{Composite Linkage Disequilibrium (CLD)}
%%%%%%%%%%%%%%%

The CLD method, proposed in \cite{Rajapakse:12} is based on the normalized quadratic distance (NQD) and is defined as 
$$
\delta^2=\mathrm{tr.}\Big((\tilde{D}-\tilde{C})W^{-1}(\tilde{D}-\tilde{C})W^{-1} \Big)
$$

\noindent
where $\tilde{D}$, $\tilde{C}$ and $W$ are three $(m_1+m_2) \times (m_1+m_2)$ matrices of the covariance between the whole set of SNPs that combines SNPs from both genes. More precisely, $\tilde{D}$ and $\tilde{C}$ are defined as follows:
$$
\tilde{D}=
\begin{bmatrix}
W_{11} & D_{12} \\
D_{21} & W_{22} \\
\end{bmatrix}
\;\;\;\;\;\;\;\;\;
\tilde{C}=
\begin{bmatrix}
W_{11} & C_{12} \\
C_{21} & W_{22} \\
\end{bmatrix}
$$

\noindent
where $W_{11}$ (resp. $W_{22}$) is the pooled estimate of the
covariance matrix for $\mathbf{X_1}$ (resp. $\mathbf{X_2}$, $D_{12}(=D_{21}^{\prime})$ and $C_{12}(=C_{21}^{\prime})$ are the sample covariance matrix between the two genes estimated from $\left(\mathbf{X_1^d},\mathbf{X_2^d}\right)$ and $\left(\mathbf{X_1^c},\mathbf{X_2^c}\right)$ respectively. In more details, the sample covariance matrices in cases, denoted by $D$, and in controls, denoted by $C$, can be partitioned in 4 blocks as follows:
$$
D=\mathrm{Cov}\left(\mathbf{X_1^d},\mathbf{X_2^d}\right)=
\begin{bmatrix}
D_{11} & D_{12} \\
D_{21} & D_{22} \\
\end{bmatrix}
\;\;\;\;\;\;\;\;
C=\mathrm{Cov}\left(\mathbf{X_1^c},\mathbf{X_2^c}\right)=
\begin{bmatrix}
C_{11} & C_{12} \\
C_{21} & C_{22} \\
\end{bmatrix}
$$

\noindent
The pooled estimate of the covariance matrix, $W$, can thus been obtained by:
$$
W=\frac{n_c C +n_d D}{n_c+n_d}=
\begin{bmatrix}
W_{11} & W_{12} \\
W_{21} & W_{22} \\
\end{bmatrix}
$$

\noindent
Since the distribution of $\delta^2$ is not known under the null hypothesis, significance testing is performed using permutation tests, as proposed by \cite{Rajapakse:12}. Such a test has been implemented in our package in the \code{CLD.test} function where the number of permutations is determined by the argument \code{n.perm}.

<<>>==
set.seed(1234)
CLD.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,n.perm=2000)
@

\begin{Soutput}

	Gene-based interaction based on Composite Linkage Disequilibrium

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
CLD = 0.49257, n.perm = 2000, p-value = 0.8865
alternative hypothesis: true CLD is not equal to 0
sample estimates:
      CLD 
0.4925654 
\end{Soutput}

%\vskip0.5em
%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Composite Linkage Disequilibrium}
%
%\code{CLD = 0.4925654, p-value = 0.8865}


%%%%%%%%%%%%%%%
\subsection{Gene-Based Information Gain Method (GBIGM)}
%%%%%%%%%%%%%%%

Introduced by \cite{Li:15}, the GBIGM method is based on the information gain rate $\Delta R_{1,2}$. $\Delta R_{1,2}$ is defined as follows:
$$
\Delta R_{1,2}=\frac{\min(H_1,H_2)-H_{1,2}}{\min(H_1,H_2)}
$$
where $H_1$, $H_2$, $H_{1,2}$ are the conditional entropies, given the $\mathbf{Y}$, of $\mathbf{X_1}$, $\mathbf{X_2}$ and the pooled SNP set $(\mathbf{X_1},\mathbf{X_2})$ respectively. Assuming that $H(.)$ is the classical entropy function, we have:
\begin{eqnarray*}
H_1&=&H(\mathbf{Y},\mathbf{X_1})-H(\mathbf{X_1})\\
H_2&=&H(\mathbf{Y},\mathbf{X_2})-H(\mathbf{X_2})\\
H_{1,2}&=&H(\mathbf{Y},\mathbf{X_1},\mathbf{X_2})-H(\mathbf{X_1},\mathbf{X_2})\\
\end{eqnarray*}

\noindent
Since the distribution of $\Delta R_{1,2}$ is unknown, the significance testing is performed by permutations as suggested by \cite{Li:15}. The GBIGM method has been implemented in the \code{GBIGM.test} function and the number of permutations is defined by the argument \code{n.perm}.

<<>>==
set.seed(1234)
GBIGM.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,n.perm=2000)
@

\begin{Soutput}
	Gene-based interaction based on Gene-based Information Gain Method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
DeltaR1,2 = 0.46441, n.perm = 2000, p-value = 0.441
alternative hypothesis: two.sided
sample estimates:
DeltaR1,2 
0.4644093 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Gene-based Information Gain Method}
%
%\code{DeltaR1,2 = 0.4644093, p-value = 0.441}

%%%%%%%%%%%%%%%
\section{From SNP-SNP interaction to Gene-Gene interaction testing}
%%%%%%%%%%%%%%%

This section provides details of the four statistical methods that proposes a gene-based test from SNP-based tests \cite{Emily:16}. Rather than considering multiple SNPs in both gene as part of a joint model, these methods aim at aggregating p-values obtained at the SNP level into a single p-value at a gene level.

\paragraph{Interaction testing at the SNP level}\


\noindent
Let consider a pair of SNPs, $(X_{1,j},X_{2,k})$ where $X_{1,j}$ is the $j^{\textmd{th}}$ SNP  of gene $X_1$ and $X_{2,k}$ the $k^{\textmd{th}}$ SNP  of gene $X_2$ ($1 \leq j \leq m_1$ and $1 \leq k \leq m_2$). To test for interaction at the SNP level, we used the following Wald statistic:
$$
W_{jk}=\frac{\widehat{\beta_3^{j,k}}}{\widehat{\sigma \left(\widehat{\beta_3^{j,k}}\right)}}
$$
where $\widehat{\beta_3^{j,k}}$ is an estimate of the interaction coefficient  $\beta_3^{j,k}$ of the following logistic model:
$$
\log\left( \frac{\mathbb{P}[Y=1|X_{1,j}=x_1,X_{2,k}=x_2]}{1-\mathbb{P}[Y=1|X_{1,j}=x_1,X_{2,k}=x_2]}\right) = \beta_0^{j,k}+\beta_1^{j,k}x_1+\beta_2^{j,k} x_2 +\beta_3^{j,k} x_1x_2
$$

\noindent
$\widehat{\beta_3^{j,k}}$ is obtained  by maximizing the likelihood function on the observed data $\mathbf{Y}$, $\mathbf{X_1}$ and $\mathbf{X_2}$ while $\widehat{\sigma \left(\widehat{\beta_3^{j,k}} \right)}$ is calculating by inverting the Hessian of the likelihood. Since the solution of the maximization of the likelihood function does not have a closed form, we compute $W_{jk}$ according to the iteratively reweighted least squares algorithm proposed in the \code{glm} function of the \pkg{stats} package \cite{R:16} .

\vskip1em
\noindent
To combine the statistics $W_{jk}$ into a single test, \cite{Ma:13} proposed four methods that all account for covariance matrix $\Sigma=[\sigma_{(j,k),(j^{\prime},k^{\prime})}]_{\substack{j=1\dots m_1; k=1\dots m_2 \\j^{\prime}=1\dots m_1;k^{\prime}=1\dots m_2}}$, a $(m_1 \times m_2) \times (m_1 \times m_2)$ symmetric matrix where $\sigma_{(j,k),(j^{\prime},k^{\prime})}=Cov(W_{jk},W_{j^{\prime},k^{\prime}})$. As proposed by \cite{Emily:16}, the covariance between $W_{jk}$ and $W_{j^{\prime},k^{\prime}}$ is estimated by:
$$
\widehat{\sigma_{(j,k),(j^{\prime},k^{\prime})}}= r_{j,j^{\prime}} r_{k,k^{\prime}}
$$
where $r_{j,j^{\prime}}=\frac{p_{j j^{\prime}}-p_j p_{j^{\prime}}}{\sqrt{p_j(1-p_j)p_{j^{\prime}}(1-p_{j^{\prime}})}}$ is the widely used correlation measure between SNP $j$ and SNP $j^{\prime}$, given that $p_j$ and $p_{j^\prime}$ are the respective allelic frequencies and $p_{jj^{\prime}}$ is the joint allelic frequency \cite{Hill:68}.


\vskip2em
\noindent
In the remainder of this section, the four methods: minP (function \code{minP.test}, GATES (function \code{gates.test}), tTS (function \code{tTS.test}) and tProd (function \code{tProd.test}) are detailed.

%%%%%%%%%%%%%%%
\subsection{minP}\
%%%%%%%%%%%%%%%

\noindent
The minP test is based on the minimum p-value that is often used to combine p-values of association (see \cite{Conneely:07}). Let $W_{\max}=\max{|W_{11}|,\dots,|W_{m_1,m_2}|}$ be the maximum of the absolute observed statistics. The minP is then defined by:
\begin{equation}
\textmd{minP}=1-\mathbb{P}\Big[\max(|Z_1|,|Z_2|,\dots,|Z_{m_1m_2}|) < W_{\max}\Big].
\label{Eq:pval}
\end{equation}
where $\mathbb{Z}=(Z_1,Z_2,\dots,Z_{m_1m_2})$ is a random vector that follows a multivariate normal distribution $\mathbb{Z} \sim \mathcal{N}(\mathbf{0},\Sigma)$.

\vskip1em
\noindent
The computation of Equation~\eqref{Eq:pval} requires the calculation of the probability distribution of a multivariate normal random variable. For that purpose, we used the {\tt pmvnorm} function from the R package {\tt mvtnorm} \cite{Genz:09}. 

<<>>==
set.seed(1234)
minP.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2)
@

\begin{Soutput}

	Gene-based interaction based on minP method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
Wmax = 0.0099241, p-value = 0.1796
alternative hypothesis: true Wmax is greater than 0
sample estimates:
       Wmax 
0.009924148 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 minP}
%
%\code{minP = 0.009924148, p-value = 0.1795672}


%%%%%%%%%%%%%%%
\subsection{GATES}\
%%%%%%%%%%%%%%%

\noindent
The GATES procedure, proposed by \cite{Li:11}, is an extension of the Simes procedure used to assess the gene level association significance. Let $p_{(1)}, \dots, p_{(m_1m_2)}$ be the ascending SNP-SNP interaction $m_1 \times m_2$ p-values, GATES p-value is then defined by
$$
\textmd{p}_{GATES}=\min\left( \frac{me p_{(1)}}{me_{(1)}},\frac{me p_{(2)}}{me_{(2)}},\dots,\frac{me p_{(m_1m_2)}}{me_{(m_1m_2)}} \right)
$$
where $m_e$ is the number of effective tests among the $m_1 \times m_2$ tests and $me_(i)$ the number of effective tests among the $i$ most significative tests associated with the lowest order p-values $p_{(1)}, \dots, p_{(i)}$. The number of effective tests ought to characterize the number of independent tests equivalent to the correlated tests that are really performed and is often used to account for dependence in a multiple testing correction.

\vskip1em
\noindent
Although no formal definition of the number of effective tests has been formulated in the literature, several procedures have been proposed to estimate such number. All methods are based on a transformation of the set of eigenvalues of the SNP covariance matrix assuming that (1) if the SNPs are independent, the number of effective tests is the number of performed, (2) if the absolute value of the correlation between any pair of SNPs is equal to $1$, the number of effective tests is $1$. In the \pkg{GeneGeneInteR} package, four main methods have been implemented and can be chosen by the user with the argument \code{merest}: Cheverud-Nyholt method - \code{me.est="ChevNy"} \cite{Cheverud:01,Nyholt:04}, Keff method -\code{me.est="Keff"} \cite{Moskvina:08}, Li and Ji method - \code{me.est="LiJi"} \cite{Li:05} and Galwey - \code{me.est="Galwey"} \cite{Galwey:09}.

<<>>==
set.seed(1234)
gates.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,me.est="ChevNy")
@

\begin{Soutput}

	Gene-based interaction based on GATES method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
GATES = 0.0099241, p-value = 0.2939
alternative hypothesis: less
sample estimates:
      GATES 
0.009924148 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 GATES}
%
%\code{GATES = 0.009924148, p-value = 0.293932
%}

<<>>==
set.seed(1234)
gates.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,alpha=0.05,me.est="Keff")
@

\begin{Soutput}

	Gene-based interaction based on GATES method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
GATES = 0.013945, p-value = 0.1899
alternative hypothesis: less
sample estimates:
     GATES 
0.01394543 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 GATES }
%
%\code{GATES = 0.01394543, p-value = 0.1899414
%}

<<>>==
set.seed(1234)
gates.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,me.est="LiJi")
@

\begin{Soutput}

	Gene-based interaction based on GATES method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
GATES = 0.013945, p-value = 0.1255
alternative hypothesis: less
sample estimates:
     GATES 
0.01394543 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 GATES}
%
%\code{GATES = 0.01394543, p-value = 0.1255088
%}

<<>>==
set.seed(1234)
gates.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,me.est="Galwey")
@

\begin{Soutput}

	Gene-based interaction based on GATES method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
GATES = 0.013945, p-value = 0.1596
alternative hypothesis: less
sample estimates:
     GATES 
0.01394543 

\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 GATES}
%
%\code{GATES = 0.01394543, p-value = 0.1596024
%}

\newpage
\subsection{tTS and tProd}\

\noindent
tTS and tProd procedures are two truncated tail strength methods that aim at combining signals from all single-SNP p-values less than a predefined cutoff value \cite{Jiang:11}. Denoting by $\tau$ the cutoff value, the two truncated p-values are defined as follows \cite{Zaykin:02}:
\begin{eqnarray*}
tTS&=&\frac{1}{m_1m_2} \sum_{i=1}^{m_1m_2} \mathbb{I}(p_{(i)} < \tau) \left(1-p_{(i)}\frac{m_1m_2+1}{i} \right)\\
tProd&=&\prod_{i=1}^{m_1m_2} p_{i}^{\mathbb{I}(p_i < \tau)}
\end{eqnarray*}

\noindent where $\mathbb{I}$ is the indicator function.

\vskip1em
\noindent
When p-values are correlated, the null distribution of $tTS$ and $tProd$ are unknown. Following the approach proposed by \cite{Zaykin:02}, a p-value is obtained in the \pkg{GeneGeneInteR} package by computing an empirical null distribution using Monte-Carlo (MC) simulations. For  each MC iteration, an empirical value for $tTS$ (or $tProd$) is obtained by simulating a vector of $W_{jk}$ with respect to a multivariate normal distribution with a vector of $0$ means and $\widehat{\Sigma}$ as covariance matrix. The empirical p-value is calculated as the proportion of simulated statistics larger than the observed statistic on the ``true" set of $W_{jk}$. 

\vskip1em
\noindent
tTS and tProd methods have been implemented in the functions \code{tTS.test} and \code{tProd.test} of the \pkg{GeneGeneInteR} package. Additional to the mandatory \code{Y}, \code{$G_1$} and \code{$G_2$} arguments, these two functions have two optional arguments: \code{tau} and \code{n.sim} that control the cutoff value and the number of simulations used to estimate the empirical value respectively. The following coding lines give an example of the \code{tTS.test} and \code{tProd.test}:

<<>>==
set.seed(1234)
tTS.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,tau=0.5,n.sim=10000)
@

\begin{Soutput}

	Gene-based interaction based on the Truncated Tail Strength method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
tTS = -0.0099127, tau = 0.5, p-value = 0.5104
alternative hypothesis: less
sample estimates:
         tTS 
-0.009912706 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2cm 	 Truncated Tail Strength}
%
%\code{tTS = -0.009912706, p-value = 0.5104
%}

<<>>==
set.seed(1234)
tProd.test(Y=gene.pair$Y, G1=gene.pair$G1,G2=gene.pair$G2,tau=0.05,n.sim=1000)
@

\begin{Soutput}

	Gene-based interaction based on the Truncated Product method

data:  gene.pair$Y and  (gene.pair$G1 , gene.pair$G2)
tProd = 0.0001384, tau = 0.05, p-value = 0.265
alternative hypothesis: less
sample estimates:
       tProd 
0.0001383965 
\end{Soutput}

%\code{Gene-Gene Interaction method performed with:}
%
%\code{\null \hskip2em 	 Truncated Product}
%
%\code{tProd = 3.69877e-12, p-value = 0.485}


\bibliographystyle{apalike}
\bibliography{Biblio}

\end{document}