Academia.eduAcademia.edu

Collate ’clstab.R ’ ’easystab-package.R’

2013

R topics documented: easystab-package...................................... 2 from.hclust......................................... 2 from.kmeans........................................ 4 getOptTheta......................................... 6 make2dStabilityImage................................... 7 perturbationStability.................................... 9 plot.StabilityCollection................................... 11 1 2 from.hclust plot.StabilityReport..................................... 12 print.StabilityCollection................................... 14 print.StabilityReport.................................... 14 summary.StabilityCollection................................ 15 summary.StabilityReport.................................. 15 Index 16 easystab-package Clustering stability analysis using Bayesian perturbations. Description

Package ‘easystab’ March 7, 2013 Version 1.0 Date 2013-03-07 Title Clustering Perturbation Stability Analysis Description Clustering perturbation stability analysis toolkit Author Hoyt Koepke <[email protected]>, Zongjun Hu <[email protected]>, Bertrand Clarke <[email protected]> Maintainer Zongjun Hu <[email protected]> Depends R(>= 2.11.1), graphics, fields, plotrix, grDevices,RColorBrewer Suggests rJava, mlbench LazyData yes License GPL (>= 2) URL https://github.com/zongjunhu/easystab Collate ’clstab.R’ ’easystab-package.R’ NeedsCompilation yes Repository CRAN Date/Publication 2013-03-07 19:51:12 R topics documented: easystab-package . . . from.hclust . . . . . . from.kmeans . . . . . getOptTheta . . . . . . make2dStabilityImage perturbationStability . plot.StabilityCollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . 2 . 4 . 6 . 7 . 9 . 11 2 from.hclust plot.StabilityReport . . . . . print.StabilityCollection . . . print.StabilityReport . . . . summary.StabilityCollection summary.StabilityReport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index easystab-package 12 14 14 15 15 16 Clustering stability analysis using Bayesian perturbations. Description A collection of functions for analyzing the stability of one or more clusterings. The technique used is given in [[PAPER]]. Functions are given to: assess the behavior of a clustering under perturbations, estimate the correct number of clusters in a dataset, assess the relative stability of individual clusters within a dataset, and interface with the hclust and kmeans clustering packages. Details The exact method used is to perturb the cluster-to-point distances by scaling them with a shifted exponential random variable, then computing the probabilities of membership for each of the points under this perturbation. This is compared against a set of randomly sampled bootstrapped baselines to determine the final stability score. Package: Type: Version: Date: License: LazyLoad: easystab Package 1.0 2012-11-28 GPL (>= 2) yes References Hoyt Koepke, Bertrand Clarke. to appear: Stastical Analysis and Data Mining. See Also perturbationStability, make2dStabilityImage, plot.StabilityCollection, print.StabilityCollection, summary.StabilityCollection, plot.StabilityReport, print.StabilityReport, summary.StabilityReport, from.hclust, from.kmeans, getOptTheta from.hclust Adapts the output of hclust for input into perturbationStability. 3 from.hclust Description Adapts the output of hclust for use with perturbationStability to give more information about the behavior of the hierarchical clustering tree. Usage from.hclust(dx, hc, k = 1:10, method = "average") Arguments dx Distance matrix as produced by dists, giving the point-to-point distances. hc Hierarchical clustering as produced by hclust. k A list giving the numbers of clusters to cut the tree at; this is passed to cutree. Defaults to 1:10. method Method used to calculate the point-to-cluster distances from the point-to-point distance matrix dx given. Currently, the two supported methods are "average", which takes the average of the distances between the given point and all the points in the cluster (similar to average linkage), and "median", which uses the median distance to the points in the cluster. Value A list of clusterings suitable for use with perturbationStability. See Also easystab, perturbationStability, from.kmeans Examples ############################################################ ## Interfacing with the hierarchical clustering method library(easystab) ## Generate a fake dataset with 3 clusters cen <- matrix(c(0,-2,1,3,-3,1), ncol=2, byrow=TRUE) cl.size <- 100 X <- t(cbind(rbind(rnorm(cl.size,mean=cen[[1,1]]), rnorm(cl.size,mean=cen[[1,2]])), rbind(rnorm(cl.size,mean=cen[[2,1]]), rnorm(cl.size,mean=cen[[2,2]])), rbind(rnorm(cl.size,mean=cen[[3,1]]), rnorm(cl.size,mean=cen[[3,2]])))) dx <- dist(X) hc <- hclust(dx) cl_list <- from.hclust(dx,hc) stability_collection <- perturbationStability(cl_list) ## Information about the stability sequence print(stability_collection) summary(stability_collection) 4 from.kmeans ## Plot the stability sequence plot(stability_collection) ############################################################ ## A more detailed example using the UCI Wisconsin breast cancer dataset. library(mlbench) # Load and cluster the Breast Cancer dataset using correlation distance. data(BreastCancer) bcdata <- na.omit(BreastCancer) ## Use 1 - (x %*% y) / (|x|_2 |y|_2) to compute divergence X <- data.matrix(bcdata[,-c(1,11)]) Y <- X %*% t(X) Ynorm <- diag(diag(Y)^(-1/2)) dx <- as.dist(1 - Ynorm %*% Y %*% Ynorm) hc <- hclust(dx, method="complete") cl_list <- from.hclust(dx, hc, method = "median") stability_collection <- perturbationStability(cl_list) # Information about the stability sequence print(stability_collection) summary(stability_collection) layout(matrix(1:2, nrow=1, ncol=2)) plot(stability_collection) plot(stability_collection$best, classes = bcdata[,11]) from.kmeans Adapts a single clustering, or list of clusterings, from kmeans to one usable by perturbationStability. Description Given a clustering or list of clusterings, each from kmeans, returns a corresponding list of clusterings suitable for input to perturbationStability. Usage from.kmeans(X, kmeans_output) Arguments X Matrix or data frame object containing the clustered data. This is needed to compute the cluster to centroid distances. kmeans_output An output of kmeans objects, or list of such objects, each being the output of the kmeans function. from.kmeans 5 Value A clustering or list of clusterings that can be used as input to the perturbationStability function. See Also easystab, perturbationStability, from.hclust Examples library(easystab) X <- scale(iris[,c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]) km_list <- lapply(1:12, function(k) { kmeans(X, k, iter.max=20, nstart=30)}) stability_collection <- perturbationStability(from.kmeans(X, km_list)) ## plots the sequence and stability map of the 3 component case layout(matrix(1:2, nrow=1, ncol=2)) plot(stability_collection) plot(stability_collection[[3]], classes = iris[,"Species"]) ############################################################ ## Example with kmeans clustering on yeast data set yeast <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data") X <- scale(data.matrix(yeast[,-c(1,10)])) ## To replicate results in paper, please comment out the following lines ## and increase the number of clusters considered to 12. rowmask <- yeast[,10] %in% c("MIT", "ME1", "ME2", "ME3") yeast <- yeast[rowmask,] X <- X[rowmask,] km_list <- lapply(1:6, function(k) { kmeans(X, k, iter.max=20, nstart=100)}) stability_collection <- perturbationStability(from.kmeans(X, km_list)) print(stability_collection) layout(matrix(1:2, nrow=1, ncol=2)) ## Plot the whole stability collection and stability map of the best one plot(stability_collection) plot(stability_collection$best, classes = yeast[,10]) ############################################################ ## Example using from.kmeans on a single clustering ## Use X from previous yeast example ## Works on a single clustering 6 getOptTheta km_cl <- kmeans(X, 8, iter.max = 20, nstart=30) stability <- perturbationStability(from.kmeans(X, km_cl)) ## Plot the stability -- a single clustering, so displays it as a ## stability map plot. plot(stability, classes=yeast[,10]) getOptTheta Calculates the optimal prior parameter Description Calculate the optimal prior parameter theta by maximizing the difference in overall stability aganst the baseline distributions. The theta parameter indexes the strength of the perturbations, with smaller values translating into stronger perturbations. Usage getOptTheta(clusterings, seed = 0, n_baselines = 25) Arguments clusterings A single clustering or a list of clusterings. Each clustering of n data points into K clusters is specified primarily by matrix giving point to cluster distances. Specifically, clustering must contain an of n by K distance matrix giving the point to cluster distance (K can be different across clusterings). Optionally, an array of n integer labels labels in 1,...,K is expected; if not present, a warning is given and the labels are computed according to the minimum point-to-cluster distance. seed Random seed used for generating the baseline stability matrices. n_baselines The number of random baseline matrices to use in computing the stability scores. Increase this number to get more accuracy at the expense of speed. See Also easystab, perturbationStability Examples ################################################## ## These examples produce exactly the same results as those in ## perturbationStability. library(easystab) ## Generate a fake dataset with 3 clusters cen <- matrix(c(0,-2,1,2,-2,1), ncol=2, byrow=TRUE) 7 make2dStabilityImage cl.size <- 100 X <- t(cbind(rbind(rnorm(cl.size,mean=cen[[1,1]]), rnorm(cl.size,mean=cen[[1,2]])), rbind(rnorm(cl.size,mean=cen[[2,1]]), rnorm(cl.size,mean=cen[[2,2]])), rbind(rnorm(cl.size,mean=cen[[3,1]]), rnorm(cl.size,mean=cen[[3,2]])))) dists <- t(apply(X, 1, function(mu) {sqrt(rowSums((cen - mu)^2))})) labels <- c(rep(1,100), rep(2,100), rep(3,100)) ## Takes same input as theta <- getOptTheta(dists) ## Apply to just the distance matrix stability1 <- perturbationStability(dists, theta = theta) ## Ways to display information print(stability1) summary(stability1) plot(stability1, classes=labels) ## Add in our labels cl <- list(dists = dists, labels = labels) stability2 <- perturbationStability(cl) print(stability2) summary(stability2) plot(stability2, classes=labels) ## Now try several numbers of clusters using kmeans km_list <- lapply(1:8, function(k) { kmeans(X, k, iter.max=20, nstart=30)}) cl_list <- from.kmeans(X, km_list) stability_collection <- perturbationStability(cl_list) print(stability_collection) summary(stability_collection) plot(stability_collection) make2dStabilityImage Creates an image of the relative regions of stability for a 2d clustering. Description For a set of 2d centroids, shows the regions of stability and regions of instability for a given value of the perturbation hyperparameter. The values in this plot indicate the contribution to the overall stability or instability from a point located at that value. This function is provided to demonstrate the intuitive behavior of the method and to help analyze 2d datasets. 8 make2dStabilityImage Usage make2dStabilityImage(centroids, theta = 1, bounds = NULL, size = c(500, 500), buffer = 0.25) Arguments centroids Array of 2D centroid points, given as a K by 2 array or matrix. theta The rate parameter passed to the shifted exponential prior on the perturbations. theta must be non-negative; a warning is issued if theta < 0. The parameter indexes the strength of the perturbations, with smaller values translating into stronger perturbations. If NULL, theta is chosen by optimizing the overall stability against the baseline distributions as in getOptTheta. bounds The bounds of the image, given as a four element array of c(x_min, x_max, y_min, y_max). If bounds is NULL, it is calculated automatically from the centroids by giving a buffer region of buffer times the absolute spread of centroids. size Specify the x and y resolution of the image. Given as c(nx, ny); defaults to c(500,500). buffer If bounds is NULL, then gives the height or width of the margins of the image containing the centroids. For each x and y coordinates, this margin is equal to buffer times the difference between the minimum and maximum values present in the list of centroids. Value A list with elements stability, x, y, bounds, centroids, and theta. stability is the 2d image of size size to be plotted as the map of stable and unstable regions in the 2d space, x and y give the x and y positions in stability, theta gives the original theta passed to the image, and bounds contains the c(x_min, x_max, y_min, y_max) bounds of the image. See Also easystab Examples ## Display the behavior of a set of centroids library(easystab) cen <- matrix(c(0,-2,1,2,-2,1), ncol=2, byrow=TRUE) #to generate image with higher resolution, use larger size in the following line Z <- make2dStabilityImage(cen, buffer=2, size=c(200,200)) image(Z$x, Z$y, Z$stability) points(Z$centroids) ## Something more detailed; display how things change by theta layout(matrix(1:4, ncol = 2, byrow=TRUE)) for(i in 1:4) { 9 perturbationStability t <- (i - 1) * 0.5 Z <- make2dStabilityImage(cen, theta=t, buffer=2, size=c(200,200)) image(Z$x, Z$y, Z$stability, main = sprintf("Theta = %1.2f.", t), xlab = "x", ylab="y") } perturbationStability Calculate clustering perturbation stability. Description Calculates the stability of clusterings under a non-parametric Bayesian perturbation as described in in [[PAPER]]. The exact method used is to perturb the cluster-to-point distances by scaling them with a shifted exponential random variable, then computing the probabilities of membership for each of the points under this perturbation. This is compared against a set of randomly sampled bootstrapped baselines to determine the final stability score. Usage perturbationStability(clusterings, n_baselines = 25, seed = 0, theta = NULL, test_pvalue = 0.05) Arguments clusterings A point-to-cluster distance matrix, a single clustering, or a list of clusterings. Each clustering of n data points into K clusters is specified primarily by matrix giving point to cluster distances. Specifically, clustering must contain an of n by K distance matrix giving the point to cluster distance (K can be different across clusterings). Optionally, an array of n integer labels labels in 1,...,K is expected; if not present, a warning is given and the labels are computed according to the minimum point-to-cluster distance. If only the distance matrix is given, then points are assigned to their minimum distance cluster with no warning. seed Random seed used for generating the baseline stability matrices. n_baselines The number of random baseline matrices to use in computing the stability scores. Increase this number to get more accuracy at the expense of speed. theta The rate parameter passed to the shifted exponential prior on the perturbations. theta must be non-negative; a warning is issued if theta < 0. The parameter indexes the strength of the perturbations, with smaller values translating into stronger perturbations. If NULL, theta is chosen by optimizing the overall stability against the baseline distributions as in getOptTheta. test_pvalue When selecting the best clustering among candidates with a differing number of clusters, a one-sided t-test is performed to choose the clustering having the smallest number of clusters and statistically indistinguishable from the clustering with the highest score. This is the level at which this t-test indicates that two stability scores are statistically indistinguishable. 10 perturbationStability Value Returns an object of type StabilityCollection if a list of clusterings is supplied, otherwise returns an object of type StabilityReport. A StabilityCollection is essentially a list of StabilityReport objects corresponding to the original list of clusterings. A StabilityReport object contains the original dists, labels (possibly calculated), the scalar stability score stability, the empirical collection of stability scores scores, the theta parameter used or found theta, the individual membership probabilities of the points under perturbation, the stability_matrix the sorted stability matrix used for plotting the behavior of the clustering. print, summary, and plot methods are provided. See Also easystab, from.hclust, from.kmeans, getOptTheta, make2dStabilityImage Examples ## Generate a fake dataset with 3 clusters cen <- matrix(c(0,-2,1,2,-2,1), ncol=2, byrow=TRUE) cl.size <- 100 X <- t(cbind(rbind(rnorm(cl.size,mean=cen[[1,1]]), rnorm(cl.size,mean=cen[[1,2]])), rbind(rnorm(cl.size,mean=cen[[2,1]]), rnorm(cl.size,mean=cen[[2,2]])), rbind(rnorm(cl.size,mean=cen[[3,1]]), rnorm(cl.size,mean=cen[[3,2]])))) dists <- t(apply(X, 1, function(mu) {sqrt(rowSums((cen - mu)^2))})) labels <- c(rep(1,100), rep(2,100), rep(3,100)) ## Apply to just the distance matrix stability1 <- perturbationStability(dists) ## Ways to display information print(stability1) summary(stability1) plot(stability1, classes=labels) ## Add in our labels cl <- list(dists = dists, labels = labels) stability2 <- perturbationStability(cl) print(stability2) summary(stability2) plot(stability2, classes=labels) ## Now try several numbers of clusters using kmeans km_list <- lapply(1:8, function(k) { kmeans(X, k, iter.max=20, nstart=30)}) cl_list <- from.kmeans(X, km_list) stability_collection <- perturbationStability(cl_list) print(stability_collection) summary(stability_collection) plot(stability_collection) 11 plot.StabilityCollection plot.StabilityCollection Plot the stability scores produced by perturbationStability as a sequence of box plots. Description Summary display of the output of the perturbationStability function. Plot the stability scores produced by perturbationStability as a sequence of box plots. Usage ## S3 method for class ’StabilityCollection’ plot(x, sort = TRUE, prune = FALSE, label.indices = NULL, ...) Arguments x The output of perturbationStablity – a list of clusters with perturbation stability analyses. Additionally: Set the name attribute of a specific clustering in order to change the corresponding label on the box plots. For example, clusterings[[5]]$label <- "Clust5" sets the displayed label of that clustering, overriding the generated labels. Set the color attribute of a specific clustering in order to change the color of boxplot. The default is to color the "best" one red, and the rest black (see color.best below). This overrides this behavior. sort Whether to sort the results in ascending order by the number of clusters in the data, then by stability scores within the clusters. prune If sort is TRUE, and multiple clusterings are given for a specific number of clusters, then show only the most stable one from each group. For example, if there were three clusterings in the collection that had 5 clusters, only the most stable of those three would be displayed. label.indices If label.indices is TRUE, then the original indices from clusterings is included in the label for each box plot; if FALSE, they are not included. If label.indices is NULL (default), then they are included only if items in the graph are reordered. Note that setting the label attribute on the clusterings input overrides this. ... Additional parameters passed to the boxplot function. See boxplot for more information. See Also easystab 12 plot.StabilityReport Examples ## Generate a fake dataset with 3 clusters cen <- matrix(c(0,-2,1,2,-2,1), ncol=2, byrow=TRUE) cl.size <- 100 X <- t(cbind(rbind(rnorm(cl.size,mean=cen[[1,1]]), rnorm(cl.size,mean=cen[[1,2]])), rbind(rnorm(cl.size,mean=cen[[2,1]]), rnorm(cl.size,mean=cen[[2,2]])), rbind(rnorm(cl.size,mean=cen[[3,1]]), rnorm(cl.size,mean=cen[[3,2]])))) ## Now try a range of numbers of clusters using kmeans km_list1 <- lapply(1:6, function(k) { kmeans(X, k, iter.max=20, nstart=30)}) stabilities1 <- perturbationStability(from.kmeans(X, km_list1)) plot(stabilities1) ## Now plot each K with multiple runs of the clustering function. ## Now try several numbers of clusters using kmeans km_list2 <- lapply(0:17, function(k) { kmeans(X, 1 + (k %% 6))}) stabilities2 <- perturbationStability(from.kmeans(X, km_list2)) plot(stabilities2) ## Plot the same thing, except without grouping by number of clusters plot(stabilities2, sort=FALSE) ## If two clusterings have the same number of clusters, plot only the ## most stable one. plot(stabilities2, prune=TRUE, sort=FALSE) ## Name the best one stabilities2[[stabilities2$best.index]]$name <- "BEST!!!" plot(stabilities2) plot.StabilityReport Display the stability of a clustering as a heat map plot. Description Plots the stability of a clustering as a heat map plot, showing the relative stability of the different clusters, the data points, and the overall behavior of the clustering. The input is taken as a single clustering analysis as given by perturbationStability. Usage ## S3 method for class ’StabilityReport’ plot(x, classes = NULL, class_colors = NULL, sort.clusters = 0, ...) 13 plot.StabilityReport Arguments x A StabilityReport object, as given by an output of perturbationStability. classes Auxiliary class labels for the data points, possibly from known classes or other clusterings. The classes must be integers in 1,...,L. If NULL, this column is not plotted. class_colors Colors to use when plotting the auxiliary class labels. If the given classes are in 1,...,L, it must be a list of at least L colors. If NULL, RColorBrewer is used to choose representative colors. Ignored if classes is NULL. sort.clusters Whether to sort the clusters in the stability map image for aesthetic reasons. 0 (default) means to not reorder them, 1 orders them by cluster size, and 2 orders them by average stability. ... optional arguments passed to internal functions Details If classes are supplied (possibly from known classes or from another clustering) version, they are plotted alongside the heatmap plot, with class membership indexed by color. See Also easystab Examples ## Generate a fake dataset with 3 clusters cen <- matrix(c(0,-2,1,2,-2,1), ncol=2, byrow=TRUE) cl.size <- 100 X <- t(cbind(rbind(rnorm(cl.size,mean=cen[[1,1]]), rnorm(cl.size,mean=cen[[1,2]])), rbind(rnorm(cl.size,mean=cen[[2,1]]), rnorm(cl.size,mean=cen[[2,2]])), rbind(rnorm(cl.size,mean=cen[[3,1]]), rnorm(cl.size,mean=cen[[3,2]])))) dists <- t(apply(X, 1, function(mu) {sqrt(rowSums((cen - mu)^2))})) labels <- c(rep(1,100), rep(2,100), rep(3,100)) ## Apply to just the distance matrix stability1 <- perturbationStability(dists) ## Ways to display information print(stability1) summary(stability1) plot(stability1, classes=labels) ## Add in our labels cl <- list(dists = dists, labels = labels) stability2 <- perturbationStability(cl) print(stability2) summary(stability2) plot(stability2, classes=labels) 14 print.StabilityReport print.StabilityCollection Print a brief summary of the stability of a clustering collection. Description Print a brief summary of the stability of a clustering collection. Usage ## S3 method for class ’StabilityCollection’ print(x, ...) Arguments x The output of perturbationStablity – a list of clusters with perturbation stability analyses. ... optional arguments passed to internal functions See Also easystab print.StabilityReport Print a brief summary of the stability of an undividual clustering under perturbation. Description Print a brief summary of the stability of an undividual clustering under perturbation. Usage ## S3 method for class ’StabilityReport’ print(x, ...) Arguments x A StabilityReport object, as given by an output of perturbationStability. ... optional arguments passed to internal functions See Also easystab summary.StabilityCollection 15 summary.StabilityCollection Print a detaild summary of the stability of a clustering collection. Description Print a detaild summary of the stability of a clustering collection. Usage ## S3 method for class ’StabilityCollection’ summary(object, ...) Arguments object The output of perturbationStablity – a list of clusters with perturbation stability analyses. ... optional argements passed to internal functions See Also easystab summary.StabilityReport Print a summary of the stability of an undividual clustering under perturbation. Description Print a summary of the stability of an undividual clustering under perturbation. Summary includes individual cluster stabilites. Usage ## S3 method for class ’StabilityReport’ summary(object, ...) Arguments object A StabilityReport object, as given by an output of perturbationStability. ... optional arguments passed to internal functions See Also easystab Index boxplot, 11 cutree, 3 easystab, 3, 5, 6, 8, 10, 11, 13–15 easystab (easystab-package), 2 easystab-package, 2 from.hclust, 2, 2, 5, 10 from.kmeans, 2, 3, 4, 10 getOptTheta, 2, 6, 8–10 make2dStabilityImage, 2, 7, 10 perturbationStability, 2, 3, 5, 6, 9 plot.StabilityCollection, 2, 11 plot.StabilityReport, 2, 12 print.StabilityCollection, 2, 14 print.StabilityReport, 2, 14 summary.StabilityCollection, 2, 15 summary.StabilityReport, 2, 15 16