--- title: "parody: parametric and resistant outlier dytection" shorttitle: "outlier dytection methods" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{parody: parametric and resistant outlier dytection} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- ```{r setuppp,echo=FALSE} suppressPackageStartupMessages({ library(parody) library(BiocStyle) }) ``` # Introduction Outliers are data points that seem anomalous. Precise quantitative measurement of anomalousness is easiest when a parametric probability model is adopted. Significant statistical and computational research has been devoted to devising and rationalizing criteria for outlyingness that do not require adoption of a parametric probability model. This package provides interfaces to various methods presented in the literature of statistical methods for measuring outlyingness in univariate and multivariate samples. It is offered as a resource through bioconductor.org because formal testing for outliers is a common concern of genome scale data analysis, in many different contexts. # Univariate samples Davies and Gather (1992) present a detailed study of criteria for outlyingness in univariate samples. Inlier boundaries are defined, with form $(\hat{m}-c(n)\hat{s}, \hat{m} + c(n) \hat{s})$, where $\hat{m}$ is a location parameter estimate, $\hat{s}$ is a scale parameter estimate, and $c(n)$ are multipliers depending on sample size. Candidates for $m$ are trimmed mean, median, midpoint of shorth; candidates for $s$ include trimmed standard deviation, median absolute deviation (MAD), length of shorth. The familiar boxplot outlier labeling rules do not fall immediately into this framework, as the left and right inlier boundaries may not be equidistant from the location parameter value. This package provides functions implementing components of the various labeling rules; users may mix components to define their own procedures. ```{r do1} library(parody) ``` We define a dataset presented in Rosner (1983). The data concern tapping frequencies achieved by children with various exposures to lead. ```{r dun21} lead <- c(83, 70, 62, 55, 56, 57, 57, 58, 59, 50, 51, 52, 52, 52, 54, 54, 45, 46, 48, 48, 49, 40, 40, 41, 42, 42, 44, 44, 35, 37, 38, 38, 34, 13, 14) ``` We can use a boxplot to visualize the distribution. This includes an outlier labeling criterion, and the dots beyond the whiskers are declared to be outlying. ```{r dobox1} boxplot(lead) ``` The first formal assessment uses the familiar boxplot rules. For this, we need to supply a scaling function as described in the manual page. ```{r dobox2} calout.detect(lead,alpha=.05,method="boxplot", scaling= function(n,alpha)1.5) ``` We can see that this agrees with the R visualization. Theoretical work on the calibration of the boxplot rule is deployed when \texttt{ftype} is set to \texttt{"ideal"}. See the manual page for references. ```{r dobox3} calout.detect(lead,alpha=.05,method="boxplot",ftype="ideal") ``` Another procedure that does not fit in the Davies and Gather framework is the generalized extreme studentized deviate procedure due to B Rosner. In this procedure, we prespecify the number of possibly outlying points $k$ (which may be any number less than half the sample size), and the $k$ most extreme studentized deviates are obtained and recorded from the highest to lowest value. Then repeated `outward testing' is conducted using analytic critical values that bound the rate of false outlier labeling for the entire testing procedure, whether or not outliers are present. Here we apply Rosner's procedure: ```{r dor} calout.detect(lead,alpha=.05,method="GESD",k=5) ``` It gives results that agree with the standard boxplot. Davies and Gather characterize a rule based on median and MAD using scaling functions $c(n)$ that they associate with Hampel. This procedure uses the scaling function defined in their expression (3). ```{r domemd} calout.detect(lead,alpha=.05,method="medmad",scaling=hamp.scale.3) ``` Finally we use the shorth-based detector. ```{r dosh} calout.detect(lead,alpha=.05,method="shorth") ``` # Multivariate samples A sample dataset called \texttt{tcost} is supplied with the package. Before visualizing, we compute a default multivariate outlier criterion, due to Caroni and Prescott, that generalizes Rosner's GESD: ```{r dot} data(tcost) ostr = mv.calout.detect(tcost) ostr ``` We see that indices 9 and 21 are flagged. Let's create a pairs plot with coloring. ```{r doparis} thecol = rep("black", nrow(tcost)) thecol[ostr$ind] = "red" pairs(tcost, col=thecol, pch=19) ``` Another display of interest employs principal components: ```{r dobipi} pc = prcomp(tcost) biplot(pc) ``` To get a sense of the robustness (or lack thereof) of the biplot, let us replot after removing the datapoints that were marked as outliers by the Caroni and Prescott procedure. ```{r mvmv} ftcost = tcost[-c(9,21),] fpc = prcomp(ftcost) biplot(fpc) ```