Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, Journal of Statistical Software
…
14 pages
1 file
A variant of the boxplot is proposed in which the sides contain the information of a percentile plot (which is equivalent to the empirical cumulative distribution function). Unlike boxplots, there is no question about how long to draw the whiskers, nor is there loss of information due to grouping. Side-by-side comparisons of distributions are especially effective. In spite of including more detail, the impact on statisticallyuntrained readers remains similar to that of traditional boxplots.
Journal of Computational and Graphical Statistics, 2017
Conventional boxplots (Tukey, 1977) are useful displays for conveying rough information about the central 50% and the extent of data. For small-sized data sets (n < 200), detailed estimates of tail behavior beyond the quartiles may not be trustworthy, so the information provided by boxplots is appropriately somewhat vague beyond the quartiles, and the expected number of "outliers" of size n is often less than 10 (Hoaglin et al., 1986). Larger data sets (n ≈ 10, 000-100, 000) afford more precise estimates of quantiles beyond the quartiles, but conventional boxplots do not show this information about the tails, and, in addition, show large numbers of extreme, but not unexpected, observations. The letter-value plot addresses both these shortcomings: (1) it conveys more de-1 tailed information in the tails using letter values, but only to the depths where the letter values are reliable estimates of their corresponding quantiles and (2) "outliers" are labeled as those observations beyond the most extreme letter value. All features shown on the letter-value plot are actual observations, thus remaining faithful to the principles that governed Tukey's original boxplot. We illustrate letter-value plots on real data (univariate and bivariate) that demonstrate their usefulness, particularly for large data sets. All graphics are created using R (R Development Core Team, 2011), and code and data are available in the supplementary materials.
Computational Statistics & Data Analysis, 2008
The boxplot is a very popular graphical tool to visualize the distribution of continuous univariate data. First of all, it shows information about the location and the spread of the data by means of the median and the interquartile range. The length of the whiskers on both sides of the box and the position of the median within the box are helpful to detect possible skewness in the data. Finally, observations that fall outside the whiskers are pinpointed as outliers, hence the boxplot also includes information from the tails. However, when the data are skewed, usually too many points are classified as outliers. This is because the outlier rule is solely based on measures of location and scale, and the cutoff values are derived from the normal distribution. We present a generalization of the boxplot that includes a robust measure of skewness in the determination of the whiskers. We show with several simulation results that this adjusted boxplot gives a more accurate representation of the data and of possible outliers.
2007
Excel is the most widespread and the most powerful general-purpose spreadsheet software, but it is not popular with statisticians. Nevertheless, as a natural means for organising, displaying and analysing large amounts of data, spreadsheets keep gaining importance in statistical education and practice. Aiming at improving such practice rather than fruitlessly and indiscriminately condemning it, the paper provides general considerations on the topic, pointers to the huge body of relevant literature and software, and several concrete examples of data visualisation in Excel in the sense of univariate, bivariate and multivariate distribution plotting. Original and improved Excel solutions for producing dot-density plots, dot plots, stem-and-leaf plots, windowgrams, coplots and parallel coordinates plots are presented, as well as for performing the Box-Cox transformation. Additionally, further possibilities opening with the forthcoming Excel 2007 version, use of various commercial and freeware add-ins, and integration of Excel with statistical software are discussed.
Teaching Statistics, 2000
2015
A picture is worth a thousand words." This idiom is true for research studies as well: illustrations in a paper helps the reader to better understand the findings of the authors. There are already several possibilities for visualizing data. But there always exist cases when the currently available diagram types are not useful enough. We also ran into such a situation, and created two new diagram types: Cumulative Characteristic Diagram and Quantile Difference Diagram for illustrating data sets of numeric types. The Cumulative Characteristic Diagram is a curve, which is based on the non-ascending order of the values. It makes it easy to read many characteristics of the input data, and it is suitable to find similarities and differences between several data sets quickly. Quantile Difference Diagram draws the differences of two ascending sets of data on the same quantiles. This diagram is suitable to illustrate in which subset the data are higher, and it also reveals some important details, which would remain hidden using statistic tests only. We found them very useful both in explaining our actual results, and gaining ideas for further development directions. In this article we show the usefulness of these diagrams illustrating the results of Contingency Chi-Squared tests, Wilcoxon rank tests and variance tests.
2018
This is a tutorial on quantile-quantile plots, a technique for determining if different data sets originate from populations with a common distribution. The technique can be used to determine if a data set is normally distributed, and to optimize the transformation parameter of variance-stabilizing Box-Cox transformation models. An Excel link to a reproducible example is provided.<br>
Journal of Physics: Conference Series
The robust procedure used in constructing boxplot makes it to remain a vital tool for the display of distributional summaries, with no or less deviation from the empirical model characters which the data possess. In this paper, we investigate the embedded characters of the extreme dataset as richly displayed by a boxplot. We discuss and assess boxplot characters such as; the display of asymmetry, the outliers cut off using the outside rate per sample for three different types of boxplot implementations. The performance of the three boxplot fence implementations on extreme data was further assessed by introducing a new measure called fence sensitivity ratio. The fence sensitivity ratio is an attempt to propose an alternative to the conventional routine of data contamination in assessing the boxplot outlier rules. The findings in this paper highlighted the significance of boxplot as an exploratory data analysis tool in diagnosing some extreme data modelling tools and stress on the weakness of the existing boxplot methods and recommend useful suggestion in addressing such weaknesses for further investigation.
Journal of Computational and Graphical Statistics, 2012
Multiple-quantile plots provide a powerful graphical method for comparing the distributions of two or more populations. This article develops a method of visualizing triple-quantile plots and their associated confidence tubes, thus extending the notion of a quantile–quantile (QQ) plot to three dimensions. More specifically, we consider three independent one-dimensional random samples with corresponding quantile functions Q 1, Q 2, and Q 3. The triple-quantile (QQQ) plot is then defined as the three-dimensional curve Q(p) = (Q 1(p), Q 2(p), Q 3(p)), where 0 < p < 1. The empirical likelihood method is used to derive simultaneous distribution-free confidence tubes for Q. We apply our method to an economic case study of strike durations and to an epidemiological study involving the comparison of cholesterol levels among three populations. These data as well as the Mathematica code for computation of the tubes are available in the online supplementary materials.
Naval Research Logistics, 2008
The distribution of the range of a sample, even in the case of a normal distribution, is not symmetric. Shewhart's control chart for range and other approximations for range from skewed distributions and long-tailed (leptokurtic) symmetrical distributions assume the distribution of range as symmetric and provide ±3 sigma control limits. We provide accurate approximations for the R-chart control limits for the leptokurtic symmetrical distributions, using a range quantile approximation (RQA) method and illustrate the use of the RQA method with a numerical example. As special cases, we provide constants for the R-chart for the normal, logistic, and Laplace distributions.
Introduction
Consider the question of how to graphically communicate the information in several related sets of continuously-distributed data. This article introduces the box-percentile plot, a modified version of the boxplot that has two distinct advantages over previous versions. One is that there are no questions about how it should be drawn -the plot configuration is justified by the properties of empirical distributions and does not require arbitrary choices about box configuration or whisker length. The other is that it includes details which are useful to statistically-trained readers without weakening the impact on statisticallyuntrained readers -it is not necessary to choose between showing general behavior or detail (Tukey, 1977). The box-percentile plot can be used as both an exploratory tool and a spacesaving method of publishing information in non-statistical scholarly articles which otherwise might take two or more graphs to convey.
Unlike the boxplot, which uses width only to emphasize the middle 50 per cent of the data, the box-percentile plot uses width to encode information about the distribution of the data over the entire range of data values. Figure 1 plots the highest points in 50 states and the heights of 219 volcanos (the same data used by Tukey (1977) for exhibit 5 of his Chapter 2) using box-percentile plots and standard boxplots. It is clear that the box-percentile plots convey the same graphical impression as the boxplots and contain additional information about the shape of the distributions. We will look at these plots in more detail in Section 4.2.
Figure 1
A comparison of Boxplots and Box-Percentile plots for the highest points in the fifty states and the heights of the 219 highest volcanos(Tukey (1977), Exhibit 2.5). The Box-Percentile plots provide more detailed information about the distribution of the data. This plot shows the effect of a few outliers on the box-percentile plot for normal data. Compare this plot to the box-percentile plot for normal data shown inFigure 3.
The idea behind constructing a box-percentile plot is simple. At any height the width of the irregular "box" is proportional to the percentile of that height, up to the 50th percentile, and above the 50th percentile the width is proportional to 100 minus the percentile. Thus, the width at any given height is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th, and 75th percentiles are marked with line segments across the box. Other percentiles may be emphasized if desired.
To illustrate the effectiveness of the additional information contained in the box-percentile plot (compared to the boxplot), consider the three artificial data sets shown as histograms in Figure 2. The distributions are clearly different. The first data set is from a standard normal distribution, the second data set is from a uniform distribution with an extreme outlier on each end, and the third data set comes from a tri-modal distribution. As shown in Figure 3, the boxplots of these data sets are indistinguishable while the box-percentile plots allow the observer to distinguish and identify the three distributions. We have obviously constructed this data to emphasize a strength of the box-percentile plot, however, in Section 4 we will use real data sets to further illustrate how the information provided by box-percentile plots can be used to solve problems that boxplots are unable to address.
Figure 2
Figure 3
2 Background Tukey (1990) reminds us of the significance of impact, the distinction between prospecting and transfer of information, and the importance of recognizing the purpose to be served. Side-by-side boxplots indicate both the general magnitude of the observations in each set and permit rough comparisons between the sets. Even statistically-untrained readers can quickly grasp the basic distributional information displayed by the plot. The typical values and the general differences in the distribution of the values from set to set are easy to see. Then, only a small amount of statistical training is needed to be able to properly interpret the information that the quartiles and extreme values provide about the dispersion of the data. Thus, for communicating data, especially to scholarly but statistically-untrained readers, boxplots are often very effective and have become a standard graphical tool in many sciences.
Given the effectiveness of boxplots at communicating a few salient distributional details, the question naturally arises if boxplots can be modified to convey more information to the statistically-trained reader without confusing the picture for the uninitiated. Width has been used to encode information about sample size or confidence intervals (McGill, Tukey, and Larson, 1978). Shading has also been used to indicate confidence intervals (Benjamini, 1988). Another modification is to use width to indicate an estimated probability density. Unfortunately, the estimated probability density depends strongly on the method used to estimate it, so one set of data could yield widely varying plots (Benjamini, 1988). Variants of the boxplot without an actual box have also been proposed (Tufte, 1983;Tukey, 1977;Tukey, 1990). So far, none of these variants have found their way into common usage.
If the underlying principle of the basic, unmodified, boxplot method of visual presentation is accepted, there remains the question of precisely how to construct the plot (Frigge, Hoaglin, and Iglewicz, 1989). Boxplots emphasize a few key percentiles and mark them on the vertical scale. Horizontal marks are used primarily to emphasize the location of the middle 50 per cent of the data. They group data in the middle half into quartiles and group much of the rest into whiskers. Problems with boxplot construction are that the length of the whiskers and the treatment of extreme values are somewhat arbitrarily determined and no one definition has been accepted by all users. One way is to draw whiskers out to the 10th and 90th percentiles and mark each of the more extreme observations individually (Cleveland, 1985). In other versions the whiskers extend to the most extreme values (Moore and McCabe, 2003, p. 46;Tukey, 1977, p. 40;Tufte, 1983), or you may use discretion to chose how many extreme values to indicate separately, in which case the whiskers extend out to them (Moore and McCabe, 2003, p. 47;Tukey, 1977, p. 41). Still another version (Tukey's "schematic plot", 1977, pp. 44, 47;Chambers, Cleveland, Kleiner, and Tukey, 1983) uses a far more complicated method to determine the lengths of the whiskers. The definitions of "steps" and "adjacent values" in this context are arbitrary. Lee and Tu (1997) developed a general graphical tool, the BLiP plot, which can produce many variations (34 options) on most standard distributional graphs, including boxplots. In the examples in their paper they use the .025 and .975 quantiles as the boundary for the whiskers in their boxplots. There is no general agreement as to what constitutes the basic boxplot. Thus when you see a boxplot you must scrutinize the accompanying text to determine how the plot has been constructed. The box-percentile plot eliminates the need to make these arbitrary choices.
Neither the boxplot nor the box-percentile plot is intended as a substitute for plots of estimated probability density functions. They do not plot densities and cannot be interpreted with the sophisticated (integral calculus) idea that area represents observations. The variable width BLiP plot (Lee and Tu, 1997) uses a simple central difference estimator of the density function to produce graphs that look similar to box-percentile plots but the encoded information and the interpretation is very different. As with all density estimates, the variable width BLiP plot requires a smoothing parameter and, depending on the choice of the parameter, can have several different representations of the same data. All distributional graphs that are based on density estimates entail arbitrary choices, and perhaps the most appealing feature of box-percentile plots is that they do not require such choices.
There are other types of diagrams which effectively display individual data sets without grouping and without questions about how they should be defined. In particular, percentile plots and empirical cumulative distribution functions display all the data and have straightforward definitions. For comparing several data sets, however, neither plot is suitable. If the graphs for several data sets are side-by-side the critical horizontal comparison is difficult to make, although summary lines can help (Cleveland, 1985, Fig. 3.19). Plotting several data sets on the same percentile or empirical cumulative distribution functions plot may lead to problems in detection and evaluation of differences between curves (Cleveland, 1985, Figs. 3.75 and 1.5).
The Box-Percentile Plot
The box-percentile plot combines the virtues of boxplots (ease of interpretation, ability to compare several data sets simultaneously) with those of percentile plots (display all the data, no arbitrary choices in construction). The idea is to use the width (as in the boxplot) to emphasize the middle of the data and to continue to use width (but not in an arbitrary fashion) to give less emphasis to the more extreme data (as whiskers and outliers are given less emphasis in boxplots). Thus the box-percentile plot "boxes" are wide in the middle (like boxplot boxes), narrow away from the middle (as are whiskers) and very narrow at the extremes. Unlike boxplots, the width contains precise information about the distribution of the data. They contain all the information of percentile plots and permit an easy and accurate assessment of symmetry. Also, since the data are not grouped, grouping can never conceal significant information as it would in some examples (Tukey's (1977) weight of nitrogen example, p. 50).
Let the number of observations be n and the observed values be ordered lowest-to-highest as y (1) , y (2) , . . . , y (n) . Each y-value is plotted as a distinct point, so no information is lost. Let the desired maximum width of the box be w. If the data is a random sample, under rather general conditions it can be proved that the expected probability between the i th and (i + 1) st order statistic is 1/(n + 1), so in percentile plots the data-point y (k) is marked at height y (k) above the horizontal coordinate k/(n + 1). In box-percentile plots the data-point y (k) is marked at height y (k) at distance kw/(n + 1) on either side of a vertical axis of symmetry, if y (k) is less than or equal to the median. If y (k) is greater than the median, it is plotted at height y (k) at distance (n + 1 − k)w/(n + 1) on either side of the vertical axis of symmetry. If the data is a population rather than a sample, then division by n + 1 is inappropriate. In that case y (k) could be plotted at distance (k − 1)w/(n − 1) from the axis for y (k) less than or equal to the median and at distance (n − k)w/(n − 1) if y (k) is above the median.
Examples
Simulated Data
Box-percentile plots not only facilitate comparison of a few key percentiles, as is the case with boxplots, they also permit comparison of complete distributions. However, if one wishes to ignore the additional information contained in box-percentile plots they can be used in the same manner as boxplots with no additional effort on the part of the reader. On the other hand, if one wishes to use the additional information provided by box-percentile plots then a small amount of training is needed. This subsection uses simulated data to help familiarize the reader with some of the common patterns that emerge in box-percentile plots. The following subsections will illustrate the use of box-percentile plots with real data sets. Figure 2 shows the histograms of three simulated datasets and Figure 3 shows the corresponding boxplots and box-percentile plots. There are 300 observations in each sample so the plots have settled down to a fairly stable shape.
The first data set in Figure 2 is from a normal distribution. The corresponding boxpercentile plot in Figure 3 shows a typical box-percentile plot for normal data. There is a single mode (no flat vertical lines) that occurs at the median. The plot is vertically symmetric about the median and the sides of the box are concave.
The second data set in Figure 2 is from a uniform distribution with two single outliers, one in each direction. The corresponding box-percentile plot in Figure 3 indicates there are outliers and the main body of the data is uniformly distributed because the sides are straight.
Outliers cause a long, thin line leading from the main body of the plot to the outliers. If there are several outliers in one direction, the "arm" of the box-percentile plot may have some width but they are still easily identifiable as unusual values that may not belong to the main body of data. Figure 4 shows a normal data set with several outliers in one direction. Notice how the main body of the box-percentile plot has the typical shape of a normal distribution while the narrow arm leading to the outliers easily identifies that set of points as outliers, without an arbitrary definition of "outlier." With regular boxplots, if some of the observations are regarded as outliers, extending the whiskers out to the extreme values may give a misleading impression of where the data lie so outliers are generally marked individually. This is not a problem with box-percentile plots. However, if one wishes to define outliers and then emphasize them, the box-percentile plot can be modified to do this.
Figure 4
Regardless of the outliers, the box-percentile plot of the uniform data with outliers in Figure 3 still has the characteristic "diamond" shape of the uniform distribution (the percentile plot of a uniform distribution is linear). Compare the box-percentile plot of the uniform data with outliers from Figure 3 with the box-percentile plot of normal data with outliers in Figure 4. It is easy to distinguish between the distributions of the main body of the data, even in the presence of outliers.
The tri-modal box-percentile plot in Figure 3 illustrates the typical feature of a multimodal distribution, vertical lines in the outline of the box. The "valleys" between modes have few observations relative to the "peaks", so there is little change in the percentiles in those regions which translates into flat, near vertical lines. Figure 5 shows the box-percentile plot of a χ 2 data set. This illustrates the typical pattern of a box-percentile plot for skewed data. Compare Figure 5 to Figure 4 and it is easy to see the difference between a data set that is skewed and one that has outliers. Figure 1 shows box-percentile plots and boxplots for two datasets, the highest point in each of the 50 states and the heights of 219 volcanos. In the boxplots it appears that the states data is skewed toward the higher values and there are a few outliers in the volcano data. The boxplots give no detailed information, other than location, about how these two datasets are related.
Figure 5
Volcanos
The box-percentile plots provide a far more informative view of the data. We can see from the box-percentile plot that the states data is bimodal rather than skewed. The shorter heights appear somewhat uniformly distributed between 0 and about 8000 feet. There is another group of states with heights between 13000 and 15000 feet and a single outlier (Alaska) at 21000 feet. Note that the boxplot for the states data gives the impression that there may be several states with maximum heights between 8000 and 11000 feet (the upper part of the box) while the box-percentile plot clearly shows that this is not the case.
Tree Invasions
Box-percentile plots provide a means to easily compare distributions and, as any good exploratory tool should, they can lead researchers to ask new questions about the data. A tree invasion is an encroachment of trees into a region where they have not traditionally grown. This is a serious problem in many parts of the world due to the loss of farming and grazing land. Figure 6 shows several data sets from a study of tree invasions in the upper Madison valley of Montana (Hansen, Wyckoff, and Banfield, 1995). A sample of trees were cored (a small part of the trunk is removed so the age of the tree can be determined by counting the tree rings) at several different invasion sites to determine when invasions occurred. Invasions show up as modes in the age distribution of the trees.
Figure 6
One of the more noticeable features of Figure 6 is the difference between the shapes of the plots which indicate different invasion histories. The trees at site 1 have a fairly uniform age distribution indicating a fairly constant rate of germination. In contrast, site 2 had a small invasion (sudden increase in trees) in the early 1900's while site 3 had a strong invasion during the mid to late 1930's. Sites 5 and 6 have similarly shaped distributions that start at different times. Could there be something about the morphology of the land that could cause similar patterns? Sites 3 and 4 have similiar distributions until the strong invasion at site 3 in the late 1930's during which time site 4 had virtually no increase in trees germinating. Is there a difference between the fire or grazing histories of these two sites during that period? Also, the shape of the distributions for sites 3 and 4 after 1950 is different, the rate of germination at site 3 is smaller than that at site 4 (the width of the box is not changing as fast for site 3), could this indicate the sudden influx of trees at site 3 during the 1930's is suppressing the germination of new trees in the years that follow?
Box-percentile plots are, of course, not the only graphical tool that should be used in the exploration and analysis of this data. They are, however, a powerful tool for an initial view of what is occuring, in terms of tree invasions, over a large geographic area. In the hands of geographers, familiar with the potential causes of tree invasions, climate data, and grazing histories of the sites, the differences in the age distributions that can be seen and compared across sites using box-percentile plots provide insights and comparisons that are simply not available with other methods.
The Box-Percentile Plot Code
An R function has been written to implement box-percentile plots. The code, which is short and relatively easy to understand, is available as an ascii file at the Journal of Statistical Software (www.jstatsoft.org). A version of bpplot (based on an earlier version of our code and modified by Frank Harrell) is also available from the R archive (lib.stat.cmu.edu/R/CRAN/) in the package Hmisc. Box-percentile plots may also be created in Rweb (www.math.montana.edu/Rweb), a Web based interface to R (Banfield, 1999), just attach the box-percentile plot function with the command: attach("/export/faculty/umsfjban/bpplot.rda")
The box-percentile R function, bpplot, has behavior similar to that of the boxplot function. It will accept an arbitrary number of variables (or a single list of variables) and it allows you to title the plot and label the axes. Besides the list of variables to be plotted, there are five named arguments that may be supplied:
• names . . . a character vector to label the individual plots
• main . . . a character string to title the plot (default is "Box-Percentile Plot")
• xlab . . . a character string to label the x-axis (default is no label)
• ylab . . . a character string to label the y-axis (default is "Percentiles")
• population . . . a logical variable indicating whether or not the data represent a population (default is F). The last paragraph of Section 3 discusses how a box-percentile plot is calculated for a population.
The following code snippet is from the help page for the R function boxplot modified to compare boxplots with box-percentile plots. If you are using Rweb, uncomment the first line of code (remove the # symbol) to load the box-percentile plot function. #attach("/export/faculty/umsfjban/bpplot.rda") mat <-cbind(Uni05 = (1:100)/21, Norm = rnorm(100), T5 = rt(100, df = 5), Gam2 = rgamma(100, shape = 2)) boxplot(data.frame(mat), main = "Boxplots") bpplot(data.frame(mat), main = "Box-Percentile Plots")
The following R code illustrates some of the options that may be used with bpplot.
x1 <-rnorm 100x2 <-runif(200, -1, 3) x3 <-rchisq(150, 3) alist <-list(x1, x2, x3) bpplot (x1, x2, x3) bpplot(alist, names=c("Normal", "Uniform", "Chi Squared")) bpplot(Normal=x1, Uniform=x2, ChiSquared=x3) bpplot (x1, x2, x3, xlab="Distributions", ylab="Quantiles", population=T)
Conclusion
The box-percentile plot is not an all-purpose graph, but it does everything the boxplot does, and more, without being more difficult to interpret. And, fortunately, its construction is based on principles of mathematical statistics, not on arbitrary rules. There is no question about how long to draw the whiskers or how to plot the outliers. Thus this new type of graph not only has a visual impact which provides incisive comparisons, it presents the information in a statistically justified manner. Chi Squared Data Figure 5: A box-percentile plot showing the typical pattern for skewed distributions. Compare this plot to the box-percentile plot for normal data with outliers shown in Figure 4. Figure 6: Box-percentile plots for tree invasions at six different sites in the Madison Valley of Montana. A tree invasion is encroachment of trees into a region where they have not traditionally grown. One of the more noticeable features of these box-percentile plots is the difference between the shapes of the plots which indicate different invasion histories.
人文社会論叢. 人文科学篇, 2009
Rivista Trimestrale Di Diritto E Procedura Civile, 2004
BENJAMINS …, 1997
Environmental Management, 2009
Journal of Tourism and Heritage Research, 2024
Public Affairs and Administration: Concepts, Methodologies, Tools, and Applications
BMJ case reports, 2017
Journal of Sport and Health Science, 2019
Journal of Vascular and Interventional Radiology, 2018
International Journal of Scientific Research, 2012
Pamukkale Üniversitesi Sosyal Bilimler Enstitüsü Dergisi, 2024