Lab2.Rnw

% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Seattle Lab 2} %\VignetteDepends{Biobase, genefilter, annotate, edd} %\VignetteKeywords{Microarray, genefilter} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\bibliographystyle{plainnat}

\begin{document}

\section*{Using Bioconductor Packages}

In this laboratory we will show you how to use different Bioconductor packages to perform analyses of microarray data. Specifically we will consider genefiltering, the use of the \textit{edd} package and some visualization techniques.

Before starting we would like to emphasize that there are very many other resources available. Every Bioconductor has at least one \textit{vignette} associated with it. A vignette is a document, similar to this one, that mixes together code and textual explanations. These vignettes are intended to provide you with more comprehensive views of the packages than we can do here.

<>= library(Seattle)

In Lab 1 we loaded the data from \cite{Golub99} and examined some of the basic properties of the \texttt{exprSet} class. Now we turn out attention to one of the most commonly performed tasks in microarray data analysis. Filtering genes. How do we find those genes that are most associated with some phenotypic characteristic.

The only characteristic that we have to analyse for this data set (that is interesting is the AML/ALL distinction).

The philosophy in the \textit{genefilter} package is to separate the notion of the filter itself from the process of filtering. So, the first step is to create a filter. There are a handful of builtin filter functions.

The output of filtering is a logical index vector where the value is \texttt{TRUE} if the row (gene) passed all filters. These will be the genes that we want and we can then use the vector to subset the data.

They are listed below, \begin{description}

\item[allNA] If all entries are NA then this will return FALSE. \item[Anova] Perform a simple one-way ANOVA and return TRUE if the resultant p-value is less than a specified value. \item[anyNA] Return TRUE if all values are not NA. \item[coxfilter] Apply a simple Cox model, using the gene expression values as the covariate; return TRUE if the p-value for the covariate is less than a pre-specified value. \item[cv] Filter on the coefficient of variation. Return TRUE if the coefficient of variation is between two specified bounds (a and b). \item[gapFilter] Return TRUE if there is a \textit{gap} in the middle part of the data. \item[kOverA] Return TRUE if at least k samples have values larger than A. \item[maxA] Return TRUE if at least one sample has a value larger than A. \item[pOverA] Return TRUE if a proportion p of the values are larger than A. \item[ttest] Return TRUE if a t-test, using a specified covariate for grouping, has a p-value less than a pre-specified value. \end{description}.

It is straightforward to write your own filter-functions (we'll do that in a little while).

First we process the Golub data in the same manner that they did. Any values less than 100 were set to 100, and any values larger than 16,000 were set to 16,000. We can see that there are some problems with the estimated expressions by looking at a histogram of the raw data. Notice a number of negative values and other values up around 60,000.

<>=

hist(exprs(golubTrain))

X<-exprs(golubTrain) X[X<100]<-100 X[X>16000]<-16000

@ Now that we have massaged the expression data into the right format we are ready to filter the data. The filtering that \citet{Golub99} used was to filter was that a probeset was selected if the ratio of the maximum value to the minimum value was larger than 5 and the difference between the maximum value and the minimum value was larger than 500.

<>=

mmfilt <- function(r=5, d=500, na.rm=TRUE) { function(x) { minval <- min(x, na.rm=na.rm) maxval <- max(x, na.rm=na.rm) (maxval/minval > r) && (maxval-minval > d) } }

mmfun <- mmfilt()

ffun <- filterfun(mmfun) sub <- genefilter(X, ffun ) sum(sub) ## Should get 3051 @

Notice the set of steps that you need to carry out. First, we create the filter function, this is \verb+mmfun+. This is assembled into a filter-function. This is simple here, but we could put many filters into a single filter function. They will be evaluated in order so choose that wisely. Then we apply \verb+ffun+ to the expression data and capture the result in \verb+sub+.

So we now have a variable that can be used to select the appropriate genes. And we will now use that to subset the genes.

<>= X <- X[sub,] X <- log2(X) golubTrainSub<-golubTrain[sub,] golubTrainSub@exprs <- X @

Now, we have created a new data set, \texttt{golubTrainSub} that contains the appropriate data.

We will briefly explore those data to see what the transformation has achieved.

<>= hist(exprs(golubTrainSub)) @

We see that there is a big spike that represents those probes that had their expression value set to 100 (log2 of 100 is 6.64). This could have some substantial effects on our analysis but we will just ignore it for now.

For convenience we will now create a few variables that we will use later.

<>= Y <- golubTrainSub$ALL.AML

Y<-paste(golubTrain$ALL.AML,golubTrain$T.B.cell) Y<-sub("NA","",Y) table(Y)

gnames<-dimnames(X)[[1]]

As we can see the variable \verb+Y+ provides a further information over whether a sample is ALL or AML. It further breaks down the ALL/AML data into those samples that are B-cell derived and those that are T-cell derived.

We set up \verb+gnames+ for easy access to the Affymetrix identifiers for each probe. These will be used in conjunction with the \textit{annotate} package to provide meaningful interactions with the annotation.

\subsection*{Filtering genes by factors}

We can filter genes to select those that are most associated with a particular factor. We first will use a t-test filter to select on the basis of the ALL/AML distinction and then use an ANOVA filter (using \verb+Y+ defined above) to filter on the three level factor.

Since these are both quite computationally intensive we recommend for this exercise that you perform only \textbf{one}. Both indexing variables will be available and can be loaded directly (we'll tell you how during the lab).

<>=

tf <- ttest(golubTrainSub$ALL.AML, 0.0001) ff <- filterfun(tf)

# gftt <- genefilter(golubTrainSub, ff)

data(gftt)

sum(gftt)

gTrsubtt<- golubTrainSub[gftt,]

gTrsubtt @

This filtering selected 124 genes. We can use these for other purposes. They can be used in machine learning or they could be reported directly to the researcher.

Since we have performed so many tests the p-values involved are nominal. In a later laboratory we will explore how to get the p-values and how to use the Bioconductor package \textit{multtest} to perform various multiple comparison adjustments.

Performing an ANOVA selection is also straight forward. In the same way, we define the filter, then put it into a filter function and finally apply it.

<>=

af <- Anova(factor(Y), 0.0001)

aFF <- filterfun(af)

# gfaF <- genefilter(golubTrainSub, aFF)

data(gfaF)

sum(gfaF)

gTrsubAv <- golubTrainSub[gfaF,]

We can also ask which probe sets were selected by both methods.

<>=

sum(gfaF & gftt)

gnames[gfaF & gftt]

@

We see that 66 of the genes were selected by both methods. We now have the problem of deciding which of the gene lists we have produced are interesting or relevant. In this section we will explore some of the functionality in the geneplotter package. This package relies on a data structure that specifies certain statistics about the genome of interest. Specifically how many chromosomes and their respective lengths in nucleotides.

We can render all chromosomes the same length.

<>=

data(hu6800ChrClass) cPlot(hu6800ChrClass) @

Or we can scale them by their relative lengths.

<>=

cPlot(hu6800ChrClass, scale="rel") @

Now, we just jam a new definition of cColor in here, cause the one in Bioconductor is broken.

<>= if (compareVersion(package.description("geneplotter",fields="Version"),"1.0.1")>0) { cColor <- function(genes, color, plotChroms, scale=c("max","relative"), glen=0.4) { ## Passed a vector of gene names, a color and an instance of a ## chromLocation class. Will recolor the specific genes in the ## cPlot created plot to match the specified color. Scale should ## be the same as the scale from cPlot scale <- match.arg(scale) xPoints <- 1000

## if (!exists("hgu95Achrom", mode="environment")) ## data(hgu95Achrom) gcO <- multiget(genes,env=geneToChrom(plotChroms)) gc <- sapply(gcO, function(x) {if( class(x) == "chromLoc") return(x@chrom) return(NA)} ) gchr <- split(names(gc),gc) gchr[["NA"]] <- NULL

## Look up the locations of these genes in each chromosome, ## plotting any results. locList <- chromLocs(plotChroms)

if (!exists("hgCLengths", mode="environment")) data(hgCLengths)

lens <- hgCLengths

for (cName in names(gchr)) { locs <- locList[[cName]][gchr[[cName]]] locs <- as.numeric(locs[!is.na(locs)]) if (length(locs) > 0) { .plotData(cName, locs, xPoints, lens, color, scale, glen) } } } }

Then we can color different genes according to what ever color scheme we want. For example, those that were selected by t-test can be colored red.

<>=

cColor(geneNames(gTrsubtt), "red", hu6800ChrClass, scale="rel")

And for the Anova filter we will color them blue.

<>=

cColor(geneNames(gTrsubAv), "blue", hu6800ChrClass, scale="rel")

\section*{Expression Density Diagnostics}

One of the ideas that we have been working on to examine microarray data is the idea of looking for genes that have similar patterns of expression. There are many different ways in which this can be interpreted and in the \textit{edd} package we interpret it as looking for genes that have similar shapes.

Four options are available for classifying expression densities. Data on each gene are shifted and scaled to have median zero and mad 1. They are then compared to shapes of reference distributions (standard Gaussian, chisq(1), lognorm(0,1), t(3), .75N0,1+.25N4,1, .25N0,1+.75N4,1, Beta(2,8), Beta(8,2), U(0,1)) after each of these has been transformed to have median 0 and mad 1. Classification proceeds by one of four methods, selected by setting of the \verb+ref+ argument.

In the next example we use Neural networks to perform the classification. In the unsupervised setting we generate 100 samples from each reference distribution (there are other options) and construct a classifier. We then classify the given samples using the constructed classifier.

Since we are using neural networks there is a stochastic component. We set an arbitrary seed here to ensure repeatability (and so we all get the same answers).

<>=

set.seed(12345) edd2 <- edd.unsupervised( golubTrainSub, ref="nnet" ) print(table(edd2)) print(sum(table(edd2)))

Looking that the table we see that there are a number of genes whose distributions are classified as being close to mixture distributions. From the perspective of studying cancer this is interesting since bimodality may be anticipated. We do not expect all patients to have all defects. There is substantial evidence that there is great diversity at the genomic level within cancers. We therefore think that this methodology has some promise.

\subsection{Group Comparisons}

Among the more interesting uses of the \textit{edd} technology is the comparison of two or more groups.

Since edd requires substantial numbers of observations to perform well we will use the \verb+golubMerge+ data set. For expediency we will simply subset this to select the same genes chosen for \verb+golubTrainSub+.

We then divide \verb+golubMergeSub+ into two parts on the basis of the ALL/AML classification.

<>=

golubMergeSub <- golubMerge[sub,]

golubMergeSub

table(golubMergeSub$ALL)

gMSALL <- golubMergeSub[, golubMergeSub$ALL=="ALL"] gMSAML <- golubMergeSub[, golubMergeSub$ALL=="AML"]

gMSALL

gMSAML @

Now that we have our subsets we can apply edd to each one separately.

<<>>= set.seed(12345) ed1 <- edd.unsupervised( gMSALL, ref="nnet") ed2 <- edd.unsupervised( gMSAML, ref="nnet") print(table(ed1,ed2)) @

Entries on the diagonal of the table are genes that we classified into the same shape for both samples. Those in off-diagonal positions were classified as having one shape in one of the subsets and a different shape in the other. Once again, we might want to focus on these that appear to be mixtures in one of the two samples.

\section{Graphics}

In this section we document how to use some of the functions available in the \textit{geneplotter} package.

These functions let users associate their expression data with chromosomal location and various biological constants.

In order to use the functions you must first create an object of class \verb+chromLocation+. This object provides the interface between the plotting functions and the specific data being considered. The construction of these objects is explained in the

%%FIXME: need to get this done!!!

\end{document}

bioconductor.org

Sections

Lab2.Rnw

<>=

# gftt <- genefilter(golubTrainSub, ff)

<>=

# gfaF <- genefilter(golubTrainSub, aFF)

<>=

@

<>=

<>=

<>=

<>=

Navigation

News