Multivariate Outlier Detection with R

For multivariate outlier detection, R contains a package called "mvoutlier".
The package contains number of multivariate outlier detection methods based on robust methods.

There are many algorithms implemented in the package for identifying multivariate outliers in high dimensional large datasets including pcout [1], uni.plot [2], sign2 [1], symbol.plot [2]. All methods are multi-dimensional whereas symbol is two-dimensional.
Detailed information for the package can be found in its manual [3].

A sample R Script using these methods is given below. It gets the name of the CSV file that contains the dataset as parameter.
# parameter file: CSV file contains dataset
# seperator: ; quote: " decimal symbol: , include row names
# plots outliers to png images
mvOutliers = function(file)
{
  data <- read.csv(file, header = TRUE, sep = ";", quote = "\"", dec = ",", row.names = 1);
 
  fn = paste("Outlier-", substr(file, 1, 5), "-PCOut.png", sep = "");
  png(fn, width=1183, height=664, units="px", res = 100);
  resPC <- pcout(data, makeplot = TRUE, outbound = 0.425);
  resPCOut <- which(resPC$wfinal01 == 0);
  dev.off();
 
  fn = paste("Outlier-", substr(file, 1, 5), "-UniPlot.png", sep = "");
  png(fn, width=1183, height=664, units="px", res = 100);
  resUni <- uni.plot(data, quan = 0.975);
  resUniOut <- which(resUni$outliers == TRUE);
  dev.off();
   
  fn = paste("Outlier-", substr(file, 1, 5), "-Symbol.png", sep = "");
  png(fn, width=1183, height=664, units="px", res = 100);
  resSym <- symbol.plot(data[,c("CP","XU100")], quan = 0.975);
  resSymOut <- which(resSym$outliers == TRUE);
  dev.off();

  lengths <- numeric();
  lengths[1] <- length(resPCOut);
  lengths[2] <- length(resUniOut);
  lengths[3] <- length(resSymOut);
  ratios <- lengths / length(data[,1]);
  print(lengths);
  print(ratios);
}

Sample Plots
PCOut

Uni.Plot
Symbol

1. pcout

Based on the robustly sphered data, semi-robust principal components are computed which are needed for determining distances for each observation. Separate weights for location and scatter outliers are computed based on these distances. The combined weights are used for outlier identification. 

Usage in R:
pcout(x, makeplot = FALSE, explvar = 0.99, crit.M1 = 1/3, crit.c1 = 2.5, 
   crit.M2 = 1/4, crit.c2 = 0.99, cs = 0.25, outbound = 0.25)
 x: a numeric matrix or data frame which provides the data for outlier detection
makeplot: a logical value indicating whether a diagnostic plot should be generated (default: FALSE)
explvar: a numeric value between 0 and 1 indicating how much variance should be covered by the robust PCs (default: 0.99)
crit.M1: a numeric value between 0 and 1 indicating the quantile to be used as lower boundary for location outlier detection (default: 1/3)
crit.c1: a positive numeric value used for determining the upper boundary for location outlier detection (default: 2.5)
crit.M2: a numeric value between 0 and 1 indicating the quantile to be used as lower boundary for scatter outlier detection (default: 1/4)
crit.c2: a numeric value between 0 and 1 indicating the quantile to be used as upper boundary for scatter outlier detection (default: 0.99)
cs: a numeric value indicating the scaling constant for combined location and scatter weights (default: 0.25)
outbound: a numeric value between 0 and 1 indicating the outlier boundary for defining values as final outliers (default: 0.25)

2. uni.plot
It shows the mutlivariate outliers in the single variables by one-dimensional scatter plots.

Usage in R:
uni.plot(x, symb=FALSE, quan=1/2, alpha=0.025)
 x: matrix or data.frame containing the data.
symb: Logical value. if FALSE, only two colors and no special symbols are used. outliers are marked red. if TRUE different symbols (cross means big value, circle means little value) according to the robust mahalanobis distance based on the mcd estimator and different colors (red means big value, blue means little value) according to the euclidean distances of the observations are used.
quan: amount of observations which are used for mcd estimations. has to be between 0.5 and 1, (default: 0.5)
alpha: amount of observations used for calculating the adjusted quantile (default: 0.025).

3. sign2
sign2 algorithm uses spatial signs. The computation of the distances is based on principal components.

Usage in R:
sign2(x, makeplot = FALSE, explvar = 0.99, qcrit = 0.975)
x: a numeric matrix or data frame which provides the data for outlier detection
makeplot: a logical value indicating whether a diagnostic plot should be generated (default: FALSE)
explvar: a numeric value between 0 and 1 indicating how much variance should be covered by the robust PCs (default: 0.99)
qcrit: a numeric value between 0 and 1 indicating the quantile to be used as critical value for outlier detection (default: 0.975)


References:
[1] Filzmoser, P., Maronna, R., & Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics & Data Analysis, 52(3), 1694-1711.
[2] Filzmoser, P., Garrett, R. G., & Reimann, C. (2005). Multivariate outlier detection in exploration geochemistry. Computers & Geosciences, 31(5), 579-587.
[3] http://cran.r-project.org/web/packages/mvoutlier/mvoutlier.pdf

Comments

Popular posts from this blog

Custom ActionResult for Files in ASP.NET MVC - ExcelResult

Human Captcha (Not Robot) React Component with .Net Core WebApi Backend

Filtering html select listbox items