Statistics - Methods of Describing Sets of Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Methods of Describing Sets of Data

Introduction:
Suppose you want to test a class of 1,000 first-year college students' analytical
capabilities based on their quantitative scores of Scholastic Aptitude Test
(SAT). How will you describe 1,000 of those measurements? Interest
characteristics include the average, or most regular, SAT score; scoring
variability; highest and lowest scores; data "shape;" and if there are any unique
scores in the data set. It's not easy to extract those details. The 1,000 scores
provide our minds with just too many pieces of knowledge to comprehend.
Obviously we need some method to summarize and describe the details in such
a collection of data. Statistical inference methods are also important for
representing data sets.

Describing Qualitative Data


Consider a study of aphasia published in the Journal of Communication
Disorders. Aphasia is the “impairment or loss of the faculty of using or
understanding spoken or written language.” Three types of aphasia have been
identified by researchers: Broca’s, conduction, and anomic. The researchers
wanted to determine whether one type of aphasia occurs more often than any
other and, if so, how often. Consequently, they measured the type of aphasia for
a sample of 22 adult aphasics. Table 2.1 gives the type of aphasia diagnosed for
each aphasic in the sample. For this study, the variable of interest, type of
aphasia, is qualitative in nature. Qualitative data are nonnumerical in nature;
thus, the value of a qualitative variable can only be classified into categories
called classes. The possible types of aphasia—Broca’s, conduction, and anomic
—represent the classes for this qualitative variable. We can summarize such
data numerically in two ways: (1) by computing the class frequency—the
number of observations in the data set that fall into each class—or (2) by

1
computing the class relative frequency—the proportion of the total number of
observations falling into each class. 

Graphical Methods for Describing Qualitative Data


quantitative data sets consist of data that are recorded on a meaningful
numerical scale. To describe, summarize, and detect patterns in such data, we
can use three graphical methods: dot plots, stem-and-leaf displays, and
histograms. Since most statistical software packages can be used to construct
these displays, we’ll focus here on their interpretation rather than their
construction. For example, the Environmental Protection Agency (EPA)
performs extensive tests on all new car models to determine their mileage
ratings. Suppose that the 100 measurements in Table 2.2 represent the results of
such tests on a certain new car model. How can we summarize the information
in this rather large sample?
2.2 Graphical Methods for Describing Quantitative Data
a. Use graphical methods to describe each of the three qualitative variables for
all 223 wells. b. Use side-by-side bar charts to compare the proportions of
contaminated wells for private and public well classes.
c. Use side-by-side bar charts to compare the proportions of contaminated wells
for bedrock and unconsolidated aquifiers. d. What inferences can be made from
the bar charts of parts a–c? 2.24 Extinct New Zealand birds. Refer to the

2
Evolutionary Ecology Research (July 2003) study of the patterns of extinction
in the New Zealand bird population, Exercise 1.20 (p. 49). Data on flight
capability (volant or flightless), habitat (aquatic, ground terrestrial, or aerial
terrestrial), nesting site (ground, cavity within ground, tree, or cavity above
ground), nest density (high or low), diet (fish, vertebrates, vegetables, or
invertebrates), body mass (grams), egg length (millimeters), and extinct status
(extinct, absent from island, or present) for 132 bird species that existed at the
time of the Maori colonization of New Zealand are saved in the NZBIRDS file.
Use a graphical method to investigate the theory that extinct status is related to
flight capability, habitat, and nest density.
A visual inspection of the data indicates some obvious facts. For example, most
of the mileages are in the 30s, with a smaller fraction in the 40s. But it is
difficult to provide much additional information on the 100 mileage ratings
without resorting to some method of summarizing the data. One such method is
a dot plot.
Dot Plots A MINITAB dot plot for the 100 EPA mileage ratings is shown in
Figure 2.8. The horizontal axis of the figure is a scale for the quantitative
variable in miles per gallon (mpg).

The rounded (to the nearest half gallon) numerical value of each measurement
in the data set is located on the horizontal scale by a dot. When data values
repeat, the dots are placed above one another, forming a pile at that particular
numerical location. As you can see, this dot plot verifies that almost all of the

3
mileage ratings are in the 30s, with most falling between 35 and 40 miles per
gallon.
Stem-and-Leaf Display Another graphical representation of these same data, a
MINITAB stem-and-leaf display, is shown in Figure 2.9. In this display, the
stem is the portion of the measurement (mpg) to the left of the decimal point,
while the remaining portion, to the right of the decimal point, is the leaf. In
Figure 2.9, the stems for the data set are listed in the second column, from the
smallest (30) to the largest (44). Then the leaf for each observation is listed to
the right, in the row of the display corresponding to the observation’s stem.*
For example, the leaf 3 of the first observation (36.3) in Table 2.2 appears in the
row corresponding to the stem 36. Similarly, the leaf 7 for the second
observation (32.7) in Table 2.2 appears in the row corresponding to the stem 32,
while the leaf 5 for the third observation (40.5) appears in the row
corresponding to the stem 40. (The stems and leaves for these first three
observations are highlighted in Figure 2.9.) Typically, the leaves in each row
are ordered as shown in the MINITAB stem-and-leaf display.
The stem-and-leaf display presents another
compact picture of the data set. You can see at a
glance that the 100 mileage readings were
distributed between 30.0 and 44.9, with most of
them falling in stem rows 35 to 39. The 6 leaves in
stem row 34 indicate that 6 of the 100 readings
were at least 34.0, but less than 35.0. Similarly,
the 11 leaves in stem row 35 indicate that 11 of
the 100 readings were at least 35.0, but less than
36.0. Only five cars had readings equal to 41 or
larger, and only one was as low as 30.
The definitions of the stem and leaf for a data set can be modified to alter the
graphical description. For example, suppose we had defined the stem as the tens
digit for the gas mileage data, rather than the ones and tens digits. With this

4
definition, the stems and leaves corresponding to the measurements 36.3 and
32.7 would be as follows:
Note that the decimal portion of the numbers has been dropped. Generally, only
one digit is displayed in the leaf. If you look at the data, you’ll see why we
didn’t define the stem this way. All the mileage measurements fall into the 30s
and 40s, so all the leaves would fall into just two stem rows in this display. The
resulting picture would not be nearly as informative as Figure 2.9.
Histograms An SPSS histogram for the 100 EPA mileage readings given in
Table 2.2 is shown in Figure 2.10. The horizontal axis of the figure, which gives
the miles per gallon for a given automobile, is divided into class intervals,
commencing with the interval from 30–31 and proceeding in intervals of equal
size to 44–45 mpg. The vertical axis gives the number (or frequency) of the 100
readings that fall into each interval. It appears that about 21 of the 100 cars, or
21%, attained a mileage between 37 and 38 mpg. This class interval contains
the highest frequency, and the intervals tend to contain a smaller number of the
measurements as the mileages get smaller or larger. Histograms can be used to
display either the frequency or relative frequency of the measurements falling
into the class intervals. The class intervals, frequencies, and relative frequencies
for the EPA car mileage data are shown in the summary table, Table 2.3

By summing the relative frequencies in the


intervals 35–36, 36–37, 37–38, and 38–39, you
find that 65% of the mileages are between 35
and 39. Similarly, only 2% of the cars obtained a
mileage rating over 42.0. Many other summary
statements can be made by further examining the
histogram and accompanying summary table.

Note that the sum of all class frequencies will always equal the sample size n. In
interpreting a histogram, consider two important facts. First, the proportion of
the total area under the histogram that falls above a particular interval on the x-
5
axis is equal to the relative frequency of measurements falling into that interval.
For example, the relative frequency for the class interval 37–38 in Figure 2.10 is
.20. Consequently, the rectangle above the interval contains .20 of the total area
under the histogram. Second, imagine the appearance of the relative frequency
histogram for a very large set of data (representing, say, a population). As the
number of measurements in a data set is increased, you can obtain a better
description of the data by decreasing the width of the
class intervals.

When the class intervals


become small enough, a
relative frequency
histogram will (for all
practical purposes)
appear as a smooth
curve. (See Figure 2.11.)

Some recommendations for selecting the number of intervals in a histogram for


smaller data sets are given in the following box:

6
While histograms provide good visual descriptions of data sets—particularly
very large ones—they do not let us identify individual measurements. In
contrast, each of the original measurements is visible to some extent in a dot
plot and is clearly visible in a stem-and-leaf display. The stem-and-leaf display
arranges the data in ascending order, so it’s easy to locate the individual
measurements. For example, in Figure 2.9 we can easily see that two of the gas
mileage measurements are equal to 36.3, but we can’t see that fact by inspecting
the histogram in Figure 2.10. However, stem-and-leaf displays can become
unwieldy for very large data sets. A very large number of stems and leaves
causes the vertical and horizontal dimensions of the display to become
cumbersome, diminishing the usefulness of the visual display.
Numerical Measures of Central Tendancy
When we speak of a data set, we refer to either a sample or a population. If
statistical inference is our goal, we’ll ultimately wish to use sample numerical
descriptive measures to make inferences about the corresponding measures for a
population. As you’ll see, a large number of numerical methods are available to
describe quantitative data sets. Most of these methods measure one of two data
characteristics:
1. The central tendency of the set of measurements—that is, the tendency of the
data to cluster, or center, about certain numerical values. (See Figure 2.14a.) 2.
The variability of the set of measurements—that is, the spread of the data.
(See Figure 2.14b.) In this section, we concentrate on measures of central
tendency. In the next section, we discuss measures of variability. The most
popular and best understood measure of central tendency for a quantitative data
set is the arithmetic mean (or simply the mean) of the data set.

7
The mean of a set of quantitative data is the
sum of the measurements, divided by the
number of measurements contained in the data
set, In everyday terms, the mean is the average
value of the data set and is often used to
represent a “typical” value. We denote the
mean of a sample of measurements by x (read
“x-bar”) and represent the formula for its
calculation as shown in the following box:

The sample mean x will play an important role in accomplishing our objective
of making inferences about populations on the basis of information about the
sample. For this reason, we need to use a different symbol for the mean of a
population—the mean of the set of measurements on every unit in the
population. We use the Greek letter m (mu) for the population mean.

We’ll often use the sample mean x to estimate (make an inference about) the
population mean m. For example, the EPA mileages for the population
consisting of all cars has a mean equal to some value m. Our sample of 100 cars
yielded mileages with a mean of x = 36.9940. If, as is usually the case, we don’t
have access to the measurements for the entire population, we could use x as an
estimator or approximator for m. Then we’d need to know something about the
reliability of our inference. That is, we’d need to know how accurately we

8
might expect x to estimate m. In Chapter 7, we’ll find that this accuracy
depends on two factors:
1. The size of the sample. The larger the sample, the more accurate the estimate
will tend to be. 2. The variability, or spread, of the data. All other factors
remaining constant, the more variable the data, the less accurate is the estimate.
Another important measure of central tendency is the median.
The median of a quantitative data set is the middle number when the
measurements are arranged in ascending (or descending) order.
The median is of most value in describing large data sets. If a data set is
characterized by a relative frequency histogram (Figure 2.16), the median is the
point on the x-axis such that half the area under the histogram lies above the
median and half lies below. [Note: In Section 2.2, we observed that the relative
frequency associated with a particular interval on the x-axis is proportional to
the amount of area under the histogram that lies above the interval.] We denote
the median of a sample by M. Like with the population mean, we use a Greek
letter 1h2 to represent the population median.

In certain situations, the median may be a better measure of central tendency


than the mean. In particular, the median is less sensitive than the mean to
extremely large or small measurements. Note, for instance, that all but one of
the measurements in part a of Example 2.5 are close to x = 5. The single
relatively large measurement, x = 20, does not affect the value of the median, 5,
but it causes the mean, x = 7, to lie to the right of most of the measurements. As
another example of data for which the central tendency is better described by
the median than the mean, consider the household incomes of a community

9
being studied by a sociologist. The presence of just a few households with very
high incomes will affect the mean more than the median. Thus, the median will
provide a more accurate picture of the typical income for the community. The
mean could exceed the vast majority of the sample measurements (household
incomes), making it a misleading measure of central tendency.
A data set is said to be skewed if one tail of the distribution has more extreme
observations than the other tail.

A third measure of central tendency is the mode of a set of measurements.


The mode is the measurement that occurs most frequently in the data set.
Numerical Measures of Variability
Measures of central tendency provide only a partial description of a quantitative
data set. The description is incomplete without a measure of the variability, or
spread, of the data set. Knowledge of the data set’s variability, along with
knowledge of its center, can help us visualize the shape of the data set as well as
its extreme values.
The range of a quantitative data set is equal to the largest measurement minus
the smallest measurement. The sample variance for a sample of n
measurements is equal to the sum of the squared deviations from the mean,
divided by (n – 1). The symbol s2 is used to represent the sample variance.

10
Using the Mean and Standard Deviation to Describe Data

Numerical Measures of Relative Standing


We’ve seen that numerical measures of central tendency and variability describe
the general nature of a quantitative data set (either a sample or a population). In
addition, we may be interested in describing the relative quantitative location of
a particular measurement within a data set. Descriptive measures of the
relationship of a measurement to the rest of the data are called measures of
relative standing. One measure of the relative standing of a measurement is its
percentile ranking, or percentile score.

Methods for Detecting Outlier: Box Plots and z- Score


Sometimes it is important to identify inconsistent or unusual measurements in a
data set. An observation that is unusually large or small relative to the data
values we want to describe is called an outlier. Outliers are often attributable to
one of several causes. First, the measurement associated with the outlier may be
invalid. For example, the experimental procedure used to generate the
measurement may have malfunctioned, the experimenter may have misrecorded
the measurement, or the data might have been coded incorrectly in the
11
computer. Second, the outlier may be the result of a misclassified measurement.
That is, the measurement belongs to a population different from that from which
the rest of the sample was drawn. Finally, the measurement associated with the
outlier may be recorded correctly and from the same population as the rest of
the sample but represent a rare (chance) event. Such outliers occur most often
when the relative frequency distribution of the sample data is extremely skewed
because a skewed distribution has a tendency to include extremely large or
small observations relative to the others in the data set.

Conclusion:
The bulk of populations make large data sets. Therefore, we need methods to
characterize a collection of data that allow us to make inferences about a
population based on the information contained in a sample. This chapter
introduces two methods for representing results, one graphical and the other
numerical. Both play an significant statistical role.

References:

Statistics, James T. McClave, Terry Sincich

12

You might also like