Statistics - Methods of Describing Sets of Data
Statistics - Methods of Describing Sets of Data
Statistics - Methods of Describing Sets of Data
Introduction:
Suppose you want to test a class of 1,000 first-year college students' analytical
capabilities based on their quantitative scores of Scholastic Aptitude Test
(SAT). How will you describe 1,000 of those measurements? Interest
characteristics include the average, or most regular, SAT score; scoring
variability; highest and lowest scores; data "shape;" and if there are any unique
scores in the data set. It's not easy to extract those details. The 1,000 scores
provide our minds with just too many pieces of knowledge to comprehend.
Obviously we need some method to summarize and describe the details in such
a collection of data. Statistical inference methods are also important for
representing data sets.
1
computing the class relative frequency—the proportion of the total number of
observations falling into each class.
2
Evolutionary Ecology Research (July 2003) study of the patterns of extinction
in the New Zealand bird population, Exercise 1.20 (p. 49). Data on flight
capability (volant or flightless), habitat (aquatic, ground terrestrial, or aerial
terrestrial), nesting site (ground, cavity within ground, tree, or cavity above
ground), nest density (high or low), diet (fish, vertebrates, vegetables, or
invertebrates), body mass (grams), egg length (millimeters), and extinct status
(extinct, absent from island, or present) for 132 bird species that existed at the
time of the Maori colonization of New Zealand are saved in the NZBIRDS file.
Use a graphical method to investigate the theory that extinct status is related to
flight capability, habitat, and nest density.
A visual inspection of the data indicates some obvious facts. For example, most
of the mileages are in the 30s, with a smaller fraction in the 40s. But it is
difficult to provide much additional information on the 100 mileage ratings
without resorting to some method of summarizing the data. One such method is
a dot plot.
Dot Plots A MINITAB dot plot for the 100 EPA mileage ratings is shown in
Figure 2.8. The horizontal axis of the figure is a scale for the quantitative
variable in miles per gallon (mpg).
The rounded (to the nearest half gallon) numerical value of each measurement
in the data set is located on the horizontal scale by a dot. When data values
repeat, the dots are placed above one another, forming a pile at that particular
numerical location. As you can see, this dot plot verifies that almost all of the
3
mileage ratings are in the 30s, with most falling between 35 and 40 miles per
gallon.
Stem-and-Leaf Display Another graphical representation of these same data, a
MINITAB stem-and-leaf display, is shown in Figure 2.9. In this display, the
stem is the portion of the measurement (mpg) to the left of the decimal point,
while the remaining portion, to the right of the decimal point, is the leaf. In
Figure 2.9, the stems for the data set are listed in the second column, from the
smallest (30) to the largest (44). Then the leaf for each observation is listed to
the right, in the row of the display corresponding to the observation’s stem.*
For example, the leaf 3 of the first observation (36.3) in Table 2.2 appears in the
row corresponding to the stem 36. Similarly, the leaf 7 for the second
observation (32.7) in Table 2.2 appears in the row corresponding to the stem 32,
while the leaf 5 for the third observation (40.5) appears in the row
corresponding to the stem 40. (The stems and leaves for these first three
observations are highlighted in Figure 2.9.) Typically, the leaves in each row
are ordered as shown in the MINITAB stem-and-leaf display.
The stem-and-leaf display presents another
compact picture of the data set. You can see at a
glance that the 100 mileage readings were
distributed between 30.0 and 44.9, with most of
them falling in stem rows 35 to 39. The 6 leaves in
stem row 34 indicate that 6 of the 100 readings
were at least 34.0, but less than 35.0. Similarly,
the 11 leaves in stem row 35 indicate that 11 of
the 100 readings were at least 35.0, but less than
36.0. Only five cars had readings equal to 41 or
larger, and only one was as low as 30.
The definitions of the stem and leaf for a data set can be modified to alter the
graphical description. For example, suppose we had defined the stem as the tens
digit for the gas mileage data, rather than the ones and tens digits. With this
4
definition, the stems and leaves corresponding to the measurements 36.3 and
32.7 would be as follows:
Note that the decimal portion of the numbers has been dropped. Generally, only
one digit is displayed in the leaf. If you look at the data, you’ll see why we
didn’t define the stem this way. All the mileage measurements fall into the 30s
and 40s, so all the leaves would fall into just two stem rows in this display. The
resulting picture would not be nearly as informative as Figure 2.9.
Histograms An SPSS histogram for the 100 EPA mileage readings given in
Table 2.2 is shown in Figure 2.10. The horizontal axis of the figure, which gives
the miles per gallon for a given automobile, is divided into class intervals,
commencing with the interval from 30–31 and proceeding in intervals of equal
size to 44–45 mpg. The vertical axis gives the number (or frequency) of the 100
readings that fall into each interval. It appears that about 21 of the 100 cars, or
21%, attained a mileage between 37 and 38 mpg. This class interval contains
the highest frequency, and the intervals tend to contain a smaller number of the
measurements as the mileages get smaller or larger. Histograms can be used to
display either the frequency or relative frequency of the measurements falling
into the class intervals. The class intervals, frequencies, and relative frequencies
for the EPA car mileage data are shown in the summary table, Table 2.3
Note that the sum of all class frequencies will always equal the sample size n. In
interpreting a histogram, consider two important facts. First, the proportion of
the total area under the histogram that falls above a particular interval on the x-
5
axis is equal to the relative frequency of measurements falling into that interval.
For example, the relative frequency for the class interval 37–38 in Figure 2.10 is
.20. Consequently, the rectangle above the interval contains .20 of the total area
under the histogram. Second, imagine the appearance of the relative frequency
histogram for a very large set of data (representing, say, a population). As the
number of measurements in a data set is increased, you can obtain a better
description of the data by decreasing the width of the
class intervals.
6
While histograms provide good visual descriptions of data sets—particularly
very large ones—they do not let us identify individual measurements. In
contrast, each of the original measurements is visible to some extent in a dot
plot and is clearly visible in a stem-and-leaf display. The stem-and-leaf display
arranges the data in ascending order, so it’s easy to locate the individual
measurements. For example, in Figure 2.9 we can easily see that two of the gas
mileage measurements are equal to 36.3, but we can’t see that fact by inspecting
the histogram in Figure 2.10. However, stem-and-leaf displays can become
unwieldy for very large data sets. A very large number of stems and leaves
causes the vertical and horizontal dimensions of the display to become
cumbersome, diminishing the usefulness of the visual display.
Numerical Measures of Central Tendancy
When we speak of a data set, we refer to either a sample or a population. If
statistical inference is our goal, we’ll ultimately wish to use sample numerical
descriptive measures to make inferences about the corresponding measures for a
population. As you’ll see, a large number of numerical methods are available to
describe quantitative data sets. Most of these methods measure one of two data
characteristics:
1. The central tendency of the set of measurements—that is, the tendency of the
data to cluster, or center, about certain numerical values. (See Figure 2.14a.) 2.
The variability of the set of measurements—that is, the spread of the data.
(See Figure 2.14b.) In this section, we concentrate on measures of central
tendency. In the next section, we discuss measures of variability. The most
popular and best understood measure of central tendency for a quantitative data
set is the arithmetic mean (or simply the mean) of the data set.
7
The mean of a set of quantitative data is the
sum of the measurements, divided by the
number of measurements contained in the data
set, In everyday terms, the mean is the average
value of the data set and is often used to
represent a “typical” value. We denote the
mean of a sample of measurements by x (read
“x-bar”) and represent the formula for its
calculation as shown in the following box:
The sample mean x will play an important role in accomplishing our objective
of making inferences about populations on the basis of information about the
sample. For this reason, we need to use a different symbol for the mean of a
population—the mean of the set of measurements on every unit in the
population. We use the Greek letter m (mu) for the population mean.
We’ll often use the sample mean x to estimate (make an inference about) the
population mean m. For example, the EPA mileages for the population
consisting of all cars has a mean equal to some value m. Our sample of 100 cars
yielded mileages with a mean of x = 36.9940. If, as is usually the case, we don’t
have access to the measurements for the entire population, we could use x as an
estimator or approximator for m. Then we’d need to know something about the
reliability of our inference. That is, we’d need to know how accurately we
8
might expect x to estimate m. In Chapter 7, we’ll find that this accuracy
depends on two factors:
1. The size of the sample. The larger the sample, the more accurate the estimate
will tend to be. 2. The variability, or spread, of the data. All other factors
remaining constant, the more variable the data, the less accurate is the estimate.
Another important measure of central tendency is the median.
The median of a quantitative data set is the middle number when the
measurements are arranged in ascending (or descending) order.
The median is of most value in describing large data sets. If a data set is
characterized by a relative frequency histogram (Figure 2.16), the median is the
point on the x-axis such that half the area under the histogram lies above the
median and half lies below. [Note: In Section 2.2, we observed that the relative
frequency associated with a particular interval on the x-axis is proportional to
the amount of area under the histogram that lies above the interval.] We denote
the median of a sample by M. Like with the population mean, we use a Greek
letter 1h2 to represent the population median.
9
being studied by a sociologist. The presence of just a few households with very
high incomes will affect the mean more than the median. Thus, the median will
provide a more accurate picture of the typical income for the community. The
mean could exceed the vast majority of the sample measurements (household
incomes), making it a misleading measure of central tendency.
A data set is said to be skewed if one tail of the distribution has more extreme
observations than the other tail.
10
Using the Mean and Standard Deviation to Describe Data
Conclusion:
The bulk of populations make large data sets. Therefore, we need methods to
characterize a collection of data that allow us to make inferences about a
population based on the information contained in a sample. This chapter
introduces two methods for representing results, one graphical and the other
numerical. Both play an significant statistical role.
References:
12