Chapter 2 - Representing Sample Data: Graphical Displays
Chapter 2 - Representing Sample Data: Graphical Displays
Chapter 2 - Representing Sample Data: Graphical Displays
A set of data on its own can be hard to interpret. Therefore, once we have a
sample of data, it is very useful to look at graphical and numerical summaries
of it in order to try and understand the important features. This is an
important first step in any statistical analysis,
Graphical displays
(i) Bar Charts. If the variables we are observing are discrete and the data
we are collecting are counts corresponding to the numbers of sample
members in each discrete category then an appropriate plot is a bar
chart. A rectangular bar is constructed for each category, the length
of which is proportional to the count. Bars are drawn separated from
each other and, where order in the categories does not matter, in order
of decreasing size from left to right. Using our simulated voting data
we can construct the following bar chart:
200
100
0
1 2 3 4
Party
1: Conservative; 2:Labour; 3:Liberal_democrats; 4:Other
1
plot(voters, ylim=c(0, 400), xlab="Party", ylab="Count",
main="Bar chart of opinion poll data", sub="1:
Conservative; 2:Labour; 3:Liberal_democrats; 4:Other")
The chart clearly shows the numbers of voters supporting each party
and the differences between them. Note that we could have divided
each of the four counts by n = 1000 and plotted them as proportions
on the vertical axis. If our sample is representative of the population of
voters as a whole then it seems sensible to regard the observed sample
proportions of voters supporting each party as estimates of the true
proportions supporting each party in the population. The merits of
this will be discussed in more detail later in the module.
(ii) Histograms. This is an appropriate display when the data are mea-
sured on a continuous scale. To construct a histogram for a sample of
size n we go through the following steps:
(1) Choose an origin t0 and a bin width h and use them to define a
mesh of equally-spaced points which cover the range of the data.
ie. tj = t0 ± jh for j = 1, 2, 3, . . ..
(2) The successive intervals we have defined are called bins. ie. the
k’th bin is the interval (tk−1 , tk ] which is open at the left-hand
end-point and closed at the right-hand end-point. We have thus
divided the line up into a series of non-overlapping bins (or inter-
vals) each of width tk − tk−1 = h.
(3) The height of the block defining the density histogram for an x-
value in in bin Bk is denoted by Hist(x) and is defined by:
1
Hist(x) = νk
nh
where where x ∈ Bk and νk is the count of the number of sample
observations which have values in Bk .
(Note that the total area enclosed by a density histogram is 1.
We can define a frequency histogram by multiplying by n so that
Hist(x) = h1 νk and the total area now enclosed by the histogram
is n).
(i) As an example consider the component lifetime data intro-
duced earlier.
2
Histogram of comp_lifetime$lifetime
0.08
0.06
Density
0.04
0.02
0.00
comp_lifetime$lifetime
The first interval starts at 323.75 and the bin width, h, is 2.5.
Essentially we are plotting the summary of the data given in
the frequency table presented in Chapter 1. For example,
ν3 9
Hist(330) = = = 0.072
nh 50 ∗ 2.5
is the height of the histogram at the point x = 330. In
fact, the histogram is at the constant height of 0.072 ∀x ∈
(328.75, 331.25].
It is clear that the distribution is fairly symmetric and is cen-
tered near to 335 hours.
3
(ii) The second example we will look at is a histogram of the gross
incomes of the sample of n = 500.
Histogram of income$income
0.025
0.020
0.015
Density
0.010
0.005
0.000
income$income
In this example the first interval starts at 5 and the bin width
is h = 10. Again, we are graphically representing the infor-
mation in the above frequency table. The distribution has a
peak (or mode) in the interval 15 − 25 and has a long tail in
that the distribution is skewed to the right.
Numerical summaries
(i) The five-number summary. To start with we need to define the or-
der statistics of a random sample x1 , x2 , . . . , xn The order statistics are
denoted x(1) , x(2) , . . . , x(n) where x(1) is the smallest of x1 , . . . , xn , x(2)
is the second smallest of x1 , . . . , xn and x(n) is the largest of x1 , . . . , xn .
In general, x(r)) is the r’th ordered sample value.
4
A very simple and convenient characterization of a set of data is pro-
vided by the five-number summary. Such a summary lists, in order
Note that, in the literature, you may see some slightly different sug-
gestions for how to calculate sample quantiles. The default method in
R is slightly different to that above and this is discussed more fully on
the chapter 2 lecture slides.
A simple measure of spread for the data is given by the interquartile-
range (iqr) which is defined as:
iqr = qU − qL
and gives the range of the middle 50% of the ordered sample values.
5
We can regard qL , m qU and the iqr as sample estimates of the corre-
sponding quantities in the population or underlying distribution which
are defined by:
Our sample estimates are random in that their values will vary for
different samples of the same size from the population. Consequently,
each of these statistics will have its own probability or sampling
distribution.
(i) Example 1: For the component lifetime data we have the following
five number summary:
(x(1) , qL , m, qU , x(n) ) = (324.07, 332.28, 335.10, 336.95, 344.45)
R code:
summary(comp_lifetime$lifetime)
6
summary(income$income)
Here we have iqr = 43.94 − 17.61 = 26.33. The fact that the data
are strongly skewed to the right is indicated by m−qL = 10.64 but
qU −m is larger 15.69 while m−x(1) = 18.22 but x(n) −m = 164.24.
(ii) The sample mean. This is the sum of the observed sample values
divided by the number of observations. If we denote the n values in a
data set by x1 , . . . , xn then the sample mean is given by:
n
1X
x̄ = xi
n i=1
(i) Example 1: For the lifetime data the sample mean is 334.59. This
is very similar to the median value of 335.10 which again indicates
that the data are symmetrically distributed.
R code:
7
mean(comp_lifetime$lifetime)
(ii) Example 2: The mean value of the income data is 33.27. This
is larger than the median value of 28.25 which is indicative of
positive skewness.
R code:
mean(income$income)
Like the other sample statistics discussed above, the sample variance
and sample standard deviation are also random with their own sampling
distributions.
Examples:
8
(i) For the lifetime data we have s2 = 15.288 so that s = 3.91
R code:
sd(comp_lifetime$lifetime)
var(comp_lifetime$lifetime)
(ii) For the income data we have s2 = 503.554 and so s = 22.44.
R code:
sd(income$income)
var(income$income)
The boxplot
A boxplot (also called a box-and-whisker plot)is a graphical display of certain
numerical summaries of a sample of data. It depicts the median, the quartiles,
the range of the data and any outliers which may be present thus showing
the distributional characteristics of the data. They are also useful for making
comparisons between comparable data sets.
A boxplot consists of a box, whiskers and outliers. A rectangular box ex-
tends from the lower quartile to the upper quartile thus spanning the range of
the middle 50% of sample values and a line is drawn across the box depicting
the median value. Next we calculate the interquartile range as iqr = qU −QL .
The whiskers are the lines that extend from the top and bottom of the box
to what are called adjacent values which are the furthest observations within
1.5iqr either side of the box. Data points which lie beyond 1.5iqr from the
box are labeled as outliers.
It is useful to look at the method of construction and their interpretation
through some examples.
9
(i) Example 1: We have the following boxplot for the lifetime data.
●
340
335
lifetime
330
325
R code:
boxplot(comp_lifetime$lifetime, ylab="lifetime",
main="Boxplot of lifetime data")
It confirms graphically the symmetry in the data and also indicates two
possible outliers corresponding to the minimum and maximum sample
values.
10
(ii) Example 2: For the income data we have the following boxplot:
●
150
●
●
income
100
●
●
●
●
●
●
50
R code:
boxplot(income$income, ylab="income",
main="Boxplot of income data")
This indicates that the middle 50% of observations are fairly symmetric
about the median but the very short lower whisker and long upper
whisker together with a dozen outliers clearly show the skewness present
in the data.
(iii) Boxplots are also very useful for graphically comparing two or more
commensurate samples of data. The data we will use to illustrate
this are the blood plasma β endorphin concentrations (pmol/l) for 22
runners who had taken part in the Tyneside Great North Run one year.
11 of the measurements were from runners who successfully completed
the race while the other 11 were from runners who collapsed near the
end of the race. The data collected (ordered from smallest to largest
within group) was as follows.
Successful runners:
14.2, 15.5, 20.2, 21.9, 24.1, 25.1, 29.6, 29.634.6, 37.8, 46.2
11
Collapsed runners:
66, 72, 79, 84, 102, 110, 123, 144, 162, 169, 414
R code:
runners<-read.table(file="runners.txt", header=T)
names(runners)
summary(runners$successful)
Our box will thus extend from 20.2 to 34.6 with a line across at 25.1
depicting the median. The iqr = 34.6 − 20.2 = 14.4 giving the fol-
lowing lower and upper limits for determining outliers: the lower limit
is qL − 1.5 ∗ iqr = 20.2 − 1.5 ∗ 14.4 = −1.4 and the upper limit is
qU + 1.5 ∗ iqr = 34.6 + 1.5 ∗ 14.4 = 56.2. The whiskers thus extend to
the smallest observation which is bigger than −1.4 (ie. 14.2 and the
largest observation which does not exceed 56.2 (ie. 46.2). No outliers
are indicated for these data. The resulting boxplot is given by:
12
Boxplot of successful runners beta endorphin concentration
45
40
35
pmol/l
30
25
20
15
boxplot(runners$successful, ylab="pmol/l",
main="Boxplot of successful runners beta endorphin
concentration")
We will now construct a boxplot for the collapsed runners data. The
five-number summary is:
R code:
summary(runners$collapsed)
The iqr = 162.0 − 79.0 = 83.0 so the lower limit is 79.0 − 1.5 ∗ 83.0 =
−45.5 and the upper limit is 162+1.5∗83.0 = 286.5. The whiskers thus
extend to 66.0 on the lower side and 169.0 on the upper side with the
maximum value of 414.0 being designated an outlier since it is greater
than 1.5 ∗ iqr from qU . It is useful to plot the boxplot for the collapsed
runners on the same scale as that for the successful runners in order to
compare the distributions of the two samples of data.
13
Boxplots for successful and collapsed runners beta endorphin concentrations
400
300
pmol/l
200
100
0
1 2
1: successful; 2: collapsed
runners_group<-read.table(file="runners_group.txt", header=T)
boxplot(pmol~group, data=runners_group, ylab="pmol/l",
main="Boxplots for successful and collapsed runners
beta endorphin concentrations", sub="1: successful;
2: collapsed")
14
The boxplot is essentially a graphical display of the 5-number summary so
you should relate what you can see graphically in the boxplots with the infor-
mation and conclusions from your 5 number summaries, including comments
on the shape of the distributions (symmetric or skewed?)
(iv) for a particular group, comparing the lengths of the two whiskers can
be helpful in identifying symmetry or skewness. Also, is the median in
the centre of the box?
The other numerical summary information are the group means and stan-
dard deviations which can be compared and contrasted. The difference be-
tween a group mean and median is indicative of any skewness of the data in
that group.
The other graphical summaries are the histograms which again can be
compared and contrasted. Does the information in the histograms tie in
with what you’ve seen on the boxplots? We will see later that superimposing
a Normal pdf (with the mean and standard deviation estimated from the
data) onto the histogram can be helpful in subjectively deciding whether a
Normal distribution would make a reasonable probability model for the data.
15
Later in the module we will discuss how we can formally make inferences
about a single population or difference between two populations based on
the values of summary statistics obtained from random samples from the
populations.
16