Chapter 2 - Representing Sample Data: Graphical Displays

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Chapter 2 - Representing Sample Data

A set of data on its own can be hard to interpret. Therefore, once we have a
sample of data, it is very useful to look at graphical and numerical summaries
of it in order to try and understand the important features. This is an
important first step in any statistical analysis,

Graphical displays
(i) Bar Charts. If the variables we are observing are discrete and the data
we are collecting are counts corresponding to the numbers of sample
members in each discrete category then an appropriate plot is a bar
chart. A rectangular bar is constructed for each category, the length
of which is proportional to the count. Bars are drawn separated from
each other and, where order in the categories does not matter, in order
of decreasing size from left to right. Using our simulated voting data
we can construct the following bar chart:

Bar chart of opinion poll data


400
300
Count

200
100
0

1 2 3 4

Party
1: Conservative; 2:Labour; 3:Liberal_democrats; 4:Other

Figure 1: Bar chart of the opinion poll data.

The R code used to create this plot was as follows:

voters<-rep(1:4, c(366, 344, 212, 78))


voters<-factor(voters)

1
plot(voters, ylim=c(0, 400), xlab="Party", ylab="Count",
main="Bar chart of opinion poll data", sub="1:
Conservative; 2:Labour; 3:Liberal_democrats; 4:Other")

The chart clearly shows the numbers of voters supporting each party
and the differences between them. Note that we could have divided
each of the four counts by n = 1000 and plotted them as proportions
on the vertical axis. If our sample is representative of the population of
voters as a whole then it seems sensible to regard the observed sample
proportions of voters supporting each party as estimates of the true
proportions supporting each party in the population. The merits of
this will be discussed in more detail later in the module.
(ii) Histograms. This is an appropriate display when the data are mea-
sured on a continuous scale. To construct a histogram for a sample of
size n we go through the following steps:
(1) Choose an origin t0 and a bin width h and use them to define a
mesh of equally-spaced points which cover the range of the data.
ie. tj = t0 ± jh for j = 1, 2, 3, . . ..
(2) The successive intervals we have defined are called bins. ie. the
k’th bin is the interval (tk−1 , tk ] which is open at the left-hand
end-point and closed at the right-hand end-point. We have thus
divided the line up into a series of non-overlapping bins (or inter-
vals) each of width tk − tk−1 = h.
(3) The height of the block defining the density histogram for an x-
value in in bin Bk is denoted by Hist(x) and is defined by:
1
Hist(x) = νk
nh
where where x ∈ Bk and νk is the count of the number of sample
observations which have values in Bk .
(Note that the total area enclosed by a density histogram is 1.
We can define a frequency histogram by multiplying by n so that
Hist(x) = h1 νk and the total area now enclosed by the histogram
is n).
(i) As an example consider the component lifetime data intro-
duced earlier.

2
Histogram of comp_lifetime$lifetime

0.08
0.06
Density

0.04
0.02
0.00

325 330 335 340 345

comp_lifetime$lifetime

Figure 2: Density histogram of the component lifetime data.

The R code used was:


comp_lifetime<-read.table(file="comp_lifetime.txt", header=T)
names(comp_lifetime)
hist(comp_lifetime$lifetime, freq=F,
breaks=seq(from=323.75, to= 346.25, by=2.5))

The first interval starts at 323.75 and the bin width, h, is 2.5.
Essentially we are plotting the summary of the data given in
the frequency table presented in Chapter 1. For example,
ν3 9
Hist(330) = = = 0.072
nh 50 ∗ 2.5
is the height of the histogram at the point x = 330. In
fact, the histogram is at the constant height of 0.072 ∀x ∈
(328.75, 331.25].
It is clear that the distribution is fairly symmetric and is cen-
tered near to 335 hours.

3
(ii) The second example we will look at is a histogram of the gross
incomes of the sample of n = 500.

Histogram of income$income
0.025
0.020
0.015
Density

0.010
0.005
0.000

0 50 100 150 200

income$income

Figure 3: Density histogram of the income data

The R code used was:


income<-read.table(file="income.txt", header=T)
names(income)
hist(income$income, freq=F,
breaks=seq(from=5, to=195, by=10))

In this example the first interval starts at 5 and the bin width
is h = 10. Again, we are graphically representing the infor-
mation in the above frequency table. The distribution has a
peak (or mode) in the interval 15 − 25 and has a long tail in
that the distribution is skewed to the right.

Numerical summaries
(i) The five-number summary. To start with we need to define the or-
der statistics of a random sample x1 , x2 , . . . , xn The order statistics are
denoted x(1) , x(2) , . . . , x(n) where x(1) is the smallest of x1 , . . . , xn , x(2)
is the second smallest of x1 , . . . , xn and x(n) is the largest of x1 , . . . , xn .
In general, x(r)) is the r’th ordered sample value.

4
A very simple and convenient characterization of a set of data is pro-
vided by the five-number summary. Such a summary lists, in order

• the sample minimum, x(1)


• the sample lower quartile, qL . This value divides the sample such
that 25% of sample values are < qL while 75% of sample values
are > qL .
• the sample median, m. This is such that 50% of sample values
are < m and 50% of sample values are > m.
• the sample upper quartile, qU . This value is such that 75% of
sample values are < qU while 25% of sample values are > qU .
• the sample maximum, x(n)

To calculate their values from a sample of data we go through the


following steps:

(i) qL . Calculate r = 0.25 ∗ (n + 1). if r is an integer then qL = x(r) If


it is not then interpolate between x[p(n+1)] and x[p(n+1)+1] , where
[y] denotes the largest integer not exceeding y.
(ii) m. Calculate r = 0.5 ∗ (n + 1). if r is an integer then m = x(r)
If it is not then interpolate between x[p(n+1)] and x[p(n+1)+1] . If n
is odd then m will correspond to the actual middle ordered value
whereas if n is even then m will be the average of the middle two
ordered values.
(iii) qU . Calculate r = 0.75 ∗ (n + 1). if r is an integer then qL = x(r)
If it is not then interpolate between x[p(n+1)] and x[p(n+1)+1]

Note that, in the literature, you may see some slightly different sug-
gestions for how to calculate sample quantiles. The default method in
R is slightly different to that above and this is discussed more fully on
the chapter 2 lecture slides.
A simple measure of spread for the data is given by the interquartile-
range (iqr) which is defined as:

iqr = qU − qL

and gives the range of the middle 50% of the ordered sample values.

5
We can regard qL , m qU and the iqr as sample estimates of the corre-
sponding quantities in the population or underlying distribution which
are defined by:

(i) QL is the solution to FX (QL ) = 0.25


(ii) M is the solution to FX (M ) = 0.50
(iii) QU is the solution to FX (QU ) = 0.75
(iv) IQR = QU − QL

Our sample estimates are random in that their values will vary for
different samples of the same size from the population. Consequently,
each of these statistics will have its own probability or sampling
distribution.

(i) Example 1: For the component lifetime data we have the following
five number summary:
(x(1) , qL , m, qU , x(n) ) = (324.07, 332.28, 335.10, 336.95, 344.45)
R code:
summary(comp_lifetime$lifetime)

The interquartile range is thus iqr = 336.95 − 332.28 = 4.67.


This is the range covered by the middle 50% of data values. The
range of the whole data though is 344.45 − 324.07 = 20.38 These
two ranges indicate that the data is quite concentrated around
the median but as we move into the two tails of the distribution
the data is much more spread out. The following two pairs of
differences indicate that the data is fairly symmetric about the
median. We have m − qL = 2.82 while qU − m = 1.85 and also
m − x(1) = 11.03 while x(n) − m = 9.35.
(ii) Example 2: The following five number summary was obtained for
the income data:

(x(1) , qL , m, qU , x(n) ) = (10.03, 17.61, 28.25, 43.94, 192.49)


R code:

6
summary(income$income)
Here we have iqr = 43.94 − 17.61 = 26.33. The fact that the data
are strongly skewed to the right is indicated by m−qL = 10.64 but
qU −m is larger 15.69 while m−x(1) = 18.22 but x(n) −m = 164.24.

(ii) The sample mean. This is the sum of the observed sample values
divided by the number of observations. If we denote the n values in a
data set by x1 , . . . , xn then the sample mean is given by:
n
1X
x̄ = xi
n i=1

If the distribution of the data is fairly regular and concentrated in the


middle of their range then the sample mean is a good summary measure
in that it describes a typical or representative value for the sample and
is also a measure of where the center of the empirical distribution of
the sample is located. If we have different groups of commensurate
measurements then we can summarize each group by its sample mean
and compare the groups by comparing the values of the group means.
Again, note that the sample mean is random in that its value will
change for different samples of size n. Under certain assumptions, such
as Normality, we can determine the form of its sampling distribution
analytically.
If, on the other hand, the sample data are irregularly distributed (the
empirical distribution might be very skewed in one direction, for exam-
ple)then the sample median is probably a better summary measure for
the sample.
If a histogram of our data indicates that the empirical distribution may
have more than one mode (or peak) then it is better practice in this case
to report the locations (x-values) of each mode as summary measures
of the sample.

(i) Example 1: For the lifetime data the sample mean is 334.59. This
is very similar to the median value of 335.10 which again indicates
that the data are symmetrically distributed.
R code:

7
mean(comp_lifetime$lifetime)
(ii) Example 2: The mean value of the income data is 33.27. This
is larger than the median value of 28.25 which is indicative of
positive skewness.
R code:
mean(income$income)

(v) The sample variance and standard deviation. We have already


seen in (iii) above that the interquartile range is a useful measure of
spread or dispersion in a data set and it has the good property that it
is not too sensitive to outliers. It has disadvantages though in that its
computation requires sorting the data which can be time consuming
when n is large and also its statistical and mathematical properties are
not straightforward. An alternative is to calculate the average of the
squared deviations of each data value about the sample mean. This is
the sample variance and is defined as:
n n
!
2 1 X 1
(xi − x̄)2 = x2 − nx̄2
X
s =
n − 1 i=1 n − 1 i=1 i
Clearly, s2 ≥ 0 and it will only equal zero if all the observations have
the same value (ie. xi = x̄ ∀i). As the data become more spread out
about the mean so that the absolute residuals | xi − x̄ | become larger
then the value of s2 will become bigger. In this sense then it makes
for a good measure of dispersion and, together with the fact that it
is mathematically tractable, it has become very commonly used. We
average using (n − 1) rather than n for technical reasons which will be
discussed later.
Note though that s2 is measured in squared units. The positive square
root of s2 , denoted by s, is called the sample standard deviation and is
in the same units as the original data. ie.

+
s = s2

Like the other sample statistics discussed above, the sample variance
and sample standard deviation are also random with their own sampling
distributions.
Examples:

8
(i) For the lifetime data we have s2 = 15.288 so that s = 3.91
R code:
sd(comp_lifetime$lifetime)
var(comp_lifetime$lifetime)
(ii) For the income data we have s2 = 503.554 and so s = 22.44.
R code:
sd(income$income)
var(income$income)

The boxplot
A boxplot (also called a box-and-whisker plot)is a graphical display of certain
numerical summaries of a sample of data. It depicts the median, the quartiles,
the range of the data and any outliers which may be present thus showing
the distributional characteristics of the data. They are also useful for making
comparisons between comparable data sets.
A boxplot consists of a box, whiskers and outliers. A rectangular box ex-
tends from the lower quartile to the upper quartile thus spanning the range of
the middle 50% of sample values and a line is drawn across the box depicting
the median value. Next we calculate the interquartile range as iqr = qU −QL .
The whiskers are the lines that extend from the top and bottom of the box
to what are called adjacent values which are the furthest observations within
1.5iqr either side of the box. Data points which lie beyond 1.5iqr from the
box are labeled as outliers.
It is useful to look at the method of construction and their interpretation
through some examples.

9
(i) Example 1: We have the following boxplot for the lifetime data.

Boxplot of lifetime data


345


340
335
lifetime

330
325

Figure 4: Boxplot of the component lifetime data.

R code:

boxplot(comp_lifetime$lifetime, ylab="lifetime",
main="Boxplot of lifetime data")

It confirms graphically the symmetry in the data and also indicates two
possible outliers corresponding to the minimum and maximum sample
values.

10
(ii) Example 2: For the income data we have the following boxplot:

Boxplot of income data


150



income

100







50

Figure 5: Boxplot of the income data.

R code:

boxplot(income$income, ylab="income",
main="Boxplot of income data")

This indicates that the middle 50% of observations are fairly symmetric
about the median but the very short lower whisker and long upper
whisker together with a dozen outliers clearly show the skewness present
in the data.
(iii) Boxplots are also very useful for graphically comparing two or more
commensurate samples of data. The data we will use to illustrate
this are the blood plasma β endorphin concentrations (pmol/l) for 22
runners who had taken part in the Tyneside Great North Run one year.
11 of the measurements were from runners who successfully completed
the race while the other 11 were from runners who collapsed near the
end of the race. The data collected (ordered from smallest to largest
within group) was as follows.
Successful runners:
14.2, 15.5, 20.2, 21.9, 24.1, 25.1, 29.6, 29.634.6, 37.8, 46.2

11
Collapsed runners:

66, 72, 79, 84, 102, 110, 123, 144, 162, 169, 414

The five-number summary of the successful runners data is given by:

(x(1) , qL , m, qU , x(n) ) = (14.2, 20.2, 25.1, 34.6, 46.2)

R code:

runners<-read.table(file="runners.txt", header=T)
names(runners)
summary(runners$successful)

Our box will thus extend from 20.2 to 34.6 with a line across at 25.1
depicting the median. The iqr = 34.6 − 20.2 = 14.4 giving the fol-
lowing lower and upper limits for determining outliers: the lower limit
is qL − 1.5 ∗ iqr = 20.2 − 1.5 ∗ 14.4 = −1.4 and the upper limit is
qU + 1.5 ∗ iqr = 34.6 + 1.5 ∗ 14.4 = 56.2. The whiskers thus extend to
the smallest observation which is bigger than −1.4 (ie. 14.2 and the
largest observation which does not exceed 56.2 (ie. 46.2). No outliers
are indicated for these data. The resulting boxplot is given by:

12
Boxplot of successful runners beta endorphin concentration

45
40
35
pmol/l

30
25
20
15

Figure 6: Boxplot of the successful runners data.

boxplot(runners$successful, ylab="pmol/l",
main="Boxplot of successful runners beta endorphin
concentration")

We will now construct a boxplot for the collapsed runners data. The
five-number summary is:

(x(1) , qL , m, qU , x(n) ) = (66.0, 79.0, 110.0, 162.0, 414.0)

R code:

summary(runners$collapsed)

The iqr = 162.0 − 79.0 = 83.0 so the lower limit is 79.0 − 1.5 ∗ 83.0 =
−45.5 and the upper limit is 162+1.5∗83.0 = 286.5. The whiskers thus
extend to 66.0 on the lower side and 169.0 on the upper side with the
maximum value of 414.0 being designated an outlier since it is greater
than 1.5 ∗ iqr from qU . It is useful to plot the boxplot for the collapsed
runners on the same scale as that for the successful runners in order to
compare the distributions of the two samples of data.

13
Boxplots for successful and collapsed runners beta endorphin concentrations

400
300
pmol/l

200
100
0

1 2

1: successful; 2: collapsed

Figure 7: Boxplots of the runners data by group.

runners_group<-read.table(file="runners_group.txt", header=T)
boxplot(pmol~group, data=runners_group, ylab="pmol/l",
main="Boxplots for successful and collapsed runners
beta endorphin concentrations", sub="1: successful;
2: collapsed")

Summary and discussion


Suppose that we are comparing k samples of data. Then, presenting the
summary statistics in tables is neat and informative - one table for the group
5-number summaries and one table for the group means and standard devi-
ations.
Regarding the comments that can be made, with the 5-number summaries
for example, you can compare and contrast subjectively:

(i) the k medians as measures of the centers of the k distributions,

(ii) the k interquartile ranges as measure of spread,

(iii) the mins,maxs and full ranges.

14
The boxplot is essentially a graphical display of the 5-number summary so
you should relate what you can see graphically in the boxplots with the infor-
mation and conclusions from your 5 number summaries, including comments
on the shape of the distributions (symmetric or skewed?)

(iv) for a particular group, comparing the lengths of the two whiskers can
be helpful in identifying symmetry or skewness. Also, is the median in
the centre of the box?

(v) any values highlighted as possible outliers.

The other numerical summary information are the group means and stan-
dard deviations which can be compared and contrasted. The difference be-
tween a group mean and median is indicative of any skewness of the data in
that group.
The other graphical summaries are the histograms which again can be
compared and contrasted. Does the information in the histograms tie in
with what you’ve seen on the boxplots? We will see later that superimposing
a Normal pdf (with the mean and standard deviation estimated from the
data) onto the histogram can be helpful in subjectively deciding whether a
Normal distribution would make a reasonable probability model for the data.

In this chapter, each of the graphical and numerical summaries is applied


to a single sample of size n from the population under study. As already
mentioned, if we then draw a second random sample of size n from the same
population and calculate the same summary, such as the sample mean, then
we would obtain a different numerical value for it from that calculated with
our first sample. This is because the samples are randomly selected and so the
second one will comprise a different collection of items from the population
and hence different sample values from the first. If we then took a third
random sample of size n and calculated the summary measure from these
data then we would obtain a third different numerical value for it. If we
continue to draw samples of size n from the population then we will obtain
a different value for the summary measure each time and we can see that
the summary measure is itself random and its sampling distribution can be
built up under repeated sampling from the population. In a later section we
will look at how to determine the sampling distributions of certain sample
statistics theoretically.

15
Later in the module we will discuss how we can formally make inferences
about a single population or difference between two populations based on
the values of summary statistics obtained from random samples from the
populations.

16

You might also like