Sullivan Section 3.4 Measures of Position and Outliers 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

154 CHAPTER 3 Numerically Summarizing Data

Age Male Resident Pop Female Resident Pop Problems 15 and 16 use the following steps to approximate the
median from grouped data.
0–9 20,700,000 19,826,000
10–19 21,369,000 20,475,000
20–29 21,417,000 21,355,000
Approximating the Median from Grouped Data
30–39 19,455,000 20,011,000 Step 1 Construct a cumulative frequency distribution.
40–49 20,839,000 21,532,000 Step 2 Identify the class in which the median lies. Remember,
the median can be obtained by determining the observation
50–59 20,785,000 22,058,000
that lies in the middle.
60–69 14,739,000 16,362,000 Step 3 Interpolate the median using the formula
70–79 7,641,000 9,474,000
n
Ú 80 4,230,000 6,561,000 - CF
2
Source: U.S. Census Bureau Median = M = L + 1i2
f
(a) Approximate the population mean and standard deviation
of age for males. where L is the lower class limit of the class containing the
(b) Approximate the population mean and standard deviation median
of age for females. n is the number of data values in the frequency
(c) Which gender has the higher mean age? distribution
(d) Which gender has more dispersion in age? CF is the cumulative frequency of the class
14. Age of Mother The following data represent the age of the immediately preceding the class containing the
mother at childbirth for 1980 and 2013. median
f is the frequency of the median class
i is the class width of the class containing the
Age 1980 Births (thousands) 2013 Births (thousands)
median
10–14 9.8 3.1
15–19 551.9 274.6
15. Approximate the median of the frequency distribution in
20–24 1226.4 902.1 Problem 2.
25–29 1108.2 1127.6 16. Approximate the median of the frequency distribution in
30–34 549.9 1044.0 Problem 4.
35–39 140.7 487.5
Problems 17 and 18 use the following definition of the modal
40–44 23.2 110.3
class. The modal class of a variable can be obtained from data
45–49 1.1 8.2 in a frequency distribution by determining the class that has the
Source: National Vital Statistics Reports largest frequency.
(a) Approximate the population mean and standard deviation 17. Determine the modal class of the frequency distribution in
of age for mothers in 1980. Problem 1.
(b) Approximate the population mean and standard deviation 18. Determine the modal class of the frequency distribution in
of age for mothers in 2013. Problem 2.
(c) Which year has the higher mean age?
(d) Which year has more dispersion in age?

3.4 Measures of Position and Outliers

Objectives ❶ Determine and interpret z-scores


➋ Interpret percentiles
➌ Determine and interpret quartiles
➍ Determine and interpret the interquartile range
❺ Check a set of data for outliers

In Section 3.1, we determined measures of central tendency, which describe the typical
data value. Section 3.2 discussed measures of dispersion, which describe the amount of
spread in a set of data. In this section, we discuss measures of position, which describe
the relative position of a certain data value within the entire set of data.
SECTION 3.4 Measures of Position and Outliers 155

➊ Determine and Interpret z-Scores


At the end of the 2014 season, the Los Angeles Angels led the American League with
773 runs scored, while the Colorado Rockies led the National League with 755 runs
scored. It appears that the Angels are the better run-producing team. However, this
comparison is unfair because the two teams play in different leagues. The Angels play
in the American League, where the designated hitter bats for the pitcher, whereas the
Rockies play in the National League, where the pitcher must bat (pitchers are typically
poor hitters). To compare the two teams’ scoring of runs, we need to determine their
relative standings in their respective leagues. We can do this using a z-score.

Definition The z-score represents the distance that a data value is from the mean in terms of
the number of standard deviations. We find it by subtracting the mean from the data
value and dividing this result by the standard deviation. There is both a population
z-score and a sample z-score:
Population z-Score Sample z-Score
x-m x-x
z = z = (1)
s s
The z-score is unitless. It has mean 0 and standard deviation 1.

In Other Words
The z-score provides a way to
If a data value is larger than the mean, the z-score is positive. If a data value is
compare apples to oranges smaller than the mean, the z-score is negative. If the data value equals the mean, the
by converting variables with z-score is zero. A z-score measures the number of standard deviations an observation
different centers or spreads to is above or below the mean. For example, a z-score of 1.24 means the data value is 1.24
variables with the same center standard deviations above the mean. A z-score of -2.31 means the data value is 2.31
(0) and spread (1). standard deviations below the mean.

EXAMPLE 1 Comparing z-Scores


Problem Determine whether the Los Angeles Angels or the Colorado Rockies had
a relatively better run-producing season. The Angels scored 773 runs and play in
the American League, where the mean number of runs scored was m = 677.4 and
the standard deviation was s = 51.7 runs. The Rockies scored 755 runs and play
in the National League, where the mean number of runs scored was m = 640.0 and the
standard deviation was s = 55.9 runs.

Approach To determine which team had the relatively better run-producing season,
compute each team’s z-score. The team with the higher z-score had the better season.
Because we know the values of the population parameters, compute the population
z-score.

Solution We compute each team’s z-score, rounded to two decimal places.


x-m 773 - 677.4
Angels: z@score = = = 1.85
s 51.7
x-m 755 - 640.0
Rockies: z@score = = = 2.06
s 55.9
So the Angels had run production 1.85 standard deviations above the mean, while the
Rockies had run production 2.06 standard deviations above the mean. Therefore, the
rNow Work Problem 5 Rockies had a relatively better year at scoring runs than the Angels. r
156 CHAPTER 3 Numerically Summarizing Data

In Example 1, the team with the higher z-score was said to have a relatively better
season in producing runs. With negative z-scores, we need to be careful when deciding
the better outcome. For example, suppose Bob and Mary run a marathon. If Bob
finished the marathon in 213 minutes, where the mean finishing time among all men was
242 minutes with a standard deviation of 57 minutes, and Mary finished the marathon
in 241 minutes, where the mean finishing time among all women was 273 minutes with
a standard deviation of 52 minutes, who did better in the race? Since Bob’s z-score is
213 - 242 241 - 273
z Bob = = -0.51 and Mary’s z-score is zMary = = -0.62, Mary
57 52
did better. Even though Bob’s z-score is larger, Mary did better because she is more
standard deviations below the mean.

➋ Interpret Percentiles
Recall that the median divides the lower 50% of a set of data from the upper 50%. The
median is a special case of a general concept called the percentile.

Definition The kth percentile, denoted Pk, of a set of data is a value such that k percent of the
observations are less than or equal to the value.

So percentiles divide a set of data that is written in ascending order into 100 parts;
thus 99 percentiles can be determined. For example, P1 divides the bottom 1% of the
observations from the top 99%, P2 divides the bottom 2% of the observations from the
top 98%, and so on. Figure 17 displays the 99 possible percentiles.

Figure 17 Smallest Largest


Data Value P1 P2 P98 P99 Data Value
...
Bottom Top
1% 1%
Bottom Top
2% 2%

Percentiles are used to give the relative standing of an observation. Many


standardized exams, such as the SAT college entrance exam, use percentiles to let
students know how they scored on the exam in relation to all other students who took
the exam.

EXAMPLE 2 Interpret a Percentile


Problem Jennifer just received the results of her SAT exam. Her SAT Mathematics
score of 600 is in the 74th percentile. What does this mean?

Approach The kth percentile of an observation means that k percent of the


observations are less than or equal to the observation.

Interpretation A percentile rank of 74% means that 74% of SAT Mathematics scores
are less than or equal to 600 and 26% of the scores are greater. So 26% of the students
rNow Work Problem 15 who took the exam scored better than Jennifer. r

➌ Determine and Interpret Quartiles


The most common percentiles are quartiles. Quartiles divide data sets into fourths, or
four equal parts.
SECTION 3.4 Measures of Position and Outliers 157

In Other Words r The first quartile, denoted Q1, divides the bottom 25% of the data from the top 75%.
The first quartile, Q1, is Therefore, the first quartile is equivalent to the 25th percentile.
equivalent to the 25th r The second quartile, Q2, divides the bottom 50% of the data from the top 50%; it is
percentile, P25. The 2nd equivalent to the 50th percentile or the median.
quartile, Q2, is equivalent to
r The third quartile, Q3, divides the bottom 75% of the data from the top 25%; it is
the 50th percentile, P50, which
equivalent to the 75th percentile.
is equivalent to the median,
M. Finally, the third quartile, Figure 18 illustrates the concept of quartiles.
Q3, is equivalent to the 75th
percentile, P75. Figure 18
Smallest Median Largest
Data Value Q1 Q2 Q3 Data Value

25% of 25% of 25% of 25% of


the data the data the data the data

Finding Quartiles
In Other Words
To find Q2, determine the Step 1 Arrange the data in ascending order.
median of the data set. To find Step 2 Determine the median, M, or second quartile, Q2.
Q1, determine the median of
the “lower half” of the data Step 3 Divide the data set into halves: the observations below (to the left of) M and
set. To find Q3, determine the the observations above M. The first quartile, Q1, is the median of the bottom half of
median of the “upper half” of the data and the third quartile, Q3, is the median of the top half of the data.
the data set.

EXAMPLE 3 Finding and Interpreting Quartiles


Problem The Highway Loss Data Institute routinely collects data on collision coverage
claims. Collision coverage insures against physical damage to an insured individual’s
vehicle. The data in Table 16 represent a random sample of 18 collision coverage claims
based on data obtained from the Highway Loss Data Institute. Find and interpret the
first, second, and third quartiles for collision coverage claims.

Table 16  
$6751 $9908 $3461 $2336 $21,147 $2332
$189 $1185 $370 $1414 $4668 $1953
$10,034 $735 $802 $618 $180 $1657

Approach Follow the steps given above.

Solution
Step 1 The data written in ascending order are given as follows:
$180 $189 $370 $618 $735 $802 $1185 $1414 $1657
$1953 $2332 $2336 $3461 $4668 $6751 $9908 $10,034 $21,147

Step 2 There are n = 18 observations, so the median, or second quartile, Q2, is the
$1657 + $1953
mean of the 9th and 10th observations. Therefore, M = Q2 = 2 = $1805.
Step 3 The median of the bottom half of the data is the first quartile, Q1. As shown
next, the median of these data is the 5th observation, so Q1 = $735.
NOTE
If the number of observations $180 $189 $370 $618 $735 $802 $1185 $1414 $1657
is odd, do not include the
c
median when determining Q1
and Q3 by hand. r Q1
(continued)
158 CHAPTER 3 Numerically Summarizing Data

The median of the top half of the data is the third quartile, Q3. As shown next, the
median of these data is the 5th observation, so Q3 = $4668.
$1953 $2332 $2336 $3461 $4668 $6751 $9908 $10,034 $21,147
c
Q3

Interpretation Interpret the quartiles as percentiles. For example, 25% of the collision
claims are less than or equal to the first quartile, $735, and 75% of the collision claims
are greater than $735. Also, 50% of the collision claims are less than or equal to $1805,
the second quartile, and 50% of the collision claims are greater than $1805. Finally,
75% of the collision claims are less than or equal to $4668, the third quartile, and 25%
of the collision claims are greater than $4668. r

EXAMPLE 4 Finding Quartiles Using Technology


Problem Find the quartiles of the collision coverage claims data in Table 16.

Using Technology
U Approach Use both StatCrunch and Minitab to obtain the quartiles. The steps for
Statistical
tatis packages may use obtaining quartiles using a TI-83/84 Plus graphing calculator, Minitab, Excel, and
different formulas for obtaining StatCrunch are given in the Technology Step-by-Step on pages 160–161.
the quartiles, so results may differ
slightly. Solution The results obtained from StatCrunch [Figure 19(a)] agree with our “by
hand” solution. In Figure 19(b), notice that the first quartile, 706, and the third quartile,
5189, reported by Minitab disagree with our “by hand” and StatCrunch result. This
difference is due to the fact that StatCrunch and Minitab use different algorithms for
obtaining quartiles.
Figure 19

(a)

Descriptive statistics: Claim


Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Claim 18 0 3874 1250 5302 180 706 1805 5189 21447
rNow Work Problem 21(b) (b) r

➍ Determine and Interpret the Interquartile Range


So far we have discussed three measures of dispersion: range, standard deviation, and
variance, all of which are not resistant. Quartiles, however, are resistant. For this reason,
quartiles are used to define a fourth measure of dispersion.

Definition The interquartile range, IQR, is the range of the middle 50% of the observations in a
data set. That is, the IQR is the difference between the third and first quartiles and is
found using the formula
IQR = Q3 - Q1

The interpretation of the interquartile range is similar to that of the range and standard
deviation. That is, the more spread a set of data has, the higher the interquartile range
will be.
SECTION 3.4 Measures of Position and Outliers 159

EXAMPLE 5 Determining and Interpreting the Interquartile Range


Problem Determine and interpret the interquartile range of the collision claim data
from Example 3.

Approach Use the quartiles found by hand in Example 3. The interquartile range, IQR,
is found by computing the difference between the third and first quartiles. It represents
the range of the middle 50% of the observations.

Solution The interquartile range is


IQR = Q3 - Q1
= $4668 - $735
= $3933

Interpretation The IQR, that is, the range of the middle 50% of the observations, in
rNow Work Problem 21(c) the collision claim data is $3933. r

Let’s compare the measures of central tendency and dispersion discussed thus
far for the collision claim data. The mean collision claim is $3874.4 and the median is
$1805. The median is more representative of the “center” because the data are skewed
to the right (only 5 of the 18 observations are greater than the mean). The range is
$21,147 - $180 = $20,967. The standard deviation is $5301.6 and the interquartile
range is $3933. The values of the range and standard deviation are affected by the
extreme claim of $21,147. In fact, if this claim had been $120,000 (let’s say the claim
was for a totaled Mercedes S-class AMG), then the range and standard deviation would
increase to $119,820 and $27,782.5, respectively. The interquartile range would not be
affected. Therefore, when the distribution of data is highly skewed or contains extreme
observations, it is best to use the interquartile range as the measure of dispersion
because it is resistant.

Summary: Which Measures to Report


Shape of Distribution Measure of Central Tendency Measure of Dispersion
Symmetric Mean Standard deviation
Skewed left or skewed right Median Interquartile range

For the remainder of this text, the direction describe the distribution will mean to
describe its shape (skewed left, skewed right, symmetric), its center (mean or median),
and its spread (standard deviation or interquartile range).

❺ Check a Set of Data for Outliers


CAUTION! When performing any type of data analysis, we should always check for extreme
Outliers distort both the mean and observations in the data set. Extreme observations are referred to as outliers. Outliers
the standard deviation, because can occur by chance, because of error in the measurement of a variable, during data
neither is resistant. Because these
measures often form the basis entry, or from errors in sampling. For example, in the 2000 presidential election, a
for most statistical inference, any precinct in New Mexico accidentally recorded 610 absentee ballots for Al Gore as 110.
conclusions drawn from a set of data Workers in the Gore camp discovered the data-entry error through an analysis of vote
that contains outliers can be flawed.
totals.
Outliers do not always occur because of error. Sometimes extreme observations
are common within a population. For example, suppose we wanted to estimate the
mean price of a European car. We might take a random sample of size 5 from the
population of all European automobiles. If our sample included a Ferrari F430 Spider
160 CHAPTER 3 Numerically Summarizing Data

(approximately $175,000), it probably would be an outlier, because this car costs much
more than the typical European automobile. The value of this car would be considered
unusual because it is not a typical value from the data set.
Use the following steps to check for outliers using quartiles.

Checking for Outliers by Using Quartiles


Step 1 Determine the first and third quartiles of the data.
Step 2 Compute the interquartile range.
Step 3 Determine the fences. Fences serve as cutoff points for determining outliers.
Lower fence = Q1 - 1.51IQR2
Upper fence = Q3 + 1.51IQR2
Step 4 If a data value is less than the lower fence or greater than the upper fence,
it is considered an outlier.

EXAMPLE 6 Checking for Outliers


Problem Check the collision coverage claims data in Table 16 for outliers.

Approach Follow the preceding steps. Any data value that is less than the lower fence
or greater than the upper fence will be considered an outlier.

Solution
Step 1 The quartiles found in Example 3 are Q1 = $735 and Q3 = $4668.
Step 2 The interquartile range, IQR, is
IQR = Q3 - Q1
= $4668 - $735
= $3933

Step 3 The lower fence, LF, is


LF = Q1 - 1.51IQR2
= $735 - 1.51$39332
= - $5164.5
The upper fence, UF, is
UF = Q3 + 1.51IQR2
= $4668 + 1.51$39332
= $10,567.5
Step 4 There are no observations below the lower fence. However, there is an
rNow Work Problem 21(d) observation above the upper fence. The claim of $21,147 is an outlier. r

Technology Step-by-Step Determining Quartiles

TI-83/84 Plus Minitab


Follow the same steps given to compute the mean and median Follow the same steps given to compute the mean and median
from raw data. (Section 3.1) from raw data. (Section 3.1)
SECTION 3.4 Measures of Position and Outliers 161

Excel StatCrunch
1. Enter the raw data into column A. Follow the same steps given to compute the mean and median
2. With the data analysis Tool Pak enabled, select the Data from raw data. (Section 3.1)
tab and click on Data Analysis.
3. Select Rank and Percentile from the Data Analysis
window. Press OK.
4. With the cursor in the Input Range cell, highlight the
data. Press OK.

3.4 Assess Your Understanding

Vocabulary 9. ERA Champions In 2014, Clayton Kershaw of the Los


Angeles Dodgers had the lowest earned-run average (ERA is
1. The represents the number of standard deviations the mean number of runs yielded per nine innings pitched) of
an observation is from the mean. any starting pitcher in the National League, with an ERA of
2. The of a data set is a value such that k 1.77. Also in 2014, Felix Hernandez of the Seattle Mariners had
percent of the observations are less than or equal to the value. the lowest ERA of any starting pitcher in the American League
3. divide data sets into fourths. with an ERA of 2.14. In the National League, the mean ERA
4. The is the range of the middle 50% of the in 2014 was 3.430 and the standard deviation was 0.721. In the
observations in a data set. American League, the mean ERA in 2014 was 3.598 and the
standard deviation was 0.762. Which player had the better year
Applying the Concepts relative to his peers? Why?
10. Batting Champions The highest batting average ever
5. Birth Weights Babies born after a gestation period of
recorded in Major League Baseball was by Ted Williams in
32–35 weeks have a mean weight of 2600 grams and a standard
1941 when he hit 0.406. That year, the mean and standard
deviation of 660 grams. Babies born after a gestation period
deviation for batting average were 0.2806 and 0.0328. In 2014,
of 40 weeks have a mean weight of 3500 grams and a standard
Jose Altuve was the American League batting champion, with
deviation of 470 grams. Suppose a 34-week gestation period baby
a batting average of 0.341. In 2014, the mean and standard
weighs 2400 grams and a 40-week gestation period baby weighs
deviation for batting average were 0.2679 and 0.0282. Who
3300 grams. What is the z-score for the 34-week gestation period
had the better year relative to his peers, Williams or Altuve?
baby? What is the z-score for the 40-week gestation period baby?
Why?
Which baby weighs less relative to the gestation
period? 11. Swim Ryan Murphy, nephew of the author, swims for the
University of California at Berkeley. Ryan’s best time in the
6. Birth Weights Babies born after a gestation period of
100-meter backstroke is 45.3 seconds. The mean of all NCAA
32–35 weeks have a mean weight of 2600 grams and a standard
swimmers in this event is 48.62 seconds with a standard
deviation of 660 grams. Babies born after a gestation period
deviation of 0.98 second. Ryan’s best time in the 200-meter
of 40 weeks have a mean weight of 3500 grams and a standard
backstroke is 99.32 seconds. The mean of all NCAA swimmers
deviation of 470 grams. Suppose a 34-week gestation period baby
in this event is 106.58 seconds with a standard deviation of 2.38
weighs 3000 grams and a 40-week gestation period baby weighs
seconds. In which race is Ryan better?
3900 grams. What is the z-score for the 34-week gestation period
baby? What is the z-score for the 40-week gestation period 12. Triathlon Roberto finishes a triathlon (750-meter swim,
baby? Which baby weighs less relative to the gestation 5-kilometer run, and 20-kilometer bicycle) in 63.2 minutes.
period? Among all men in the race, the mean finishing time was
69.4 minutes with a standard deviation of 8.9 minutes. Zandra
7. Men versus Women The average 20- to 29-year-old man is
finishes the same triathlon in 79.3 minutes. Among all women
69.6 inches tall, with a standard deviation of 3.0 inches, while
in the race, the mean finishing time was 84.7 minutes with a
the average 20- to 29-year-old woman is 64.1 inches tall, with a
standard deviation of 7.4 minutes. Who did better in relation to
standard deviation of 3.8 inches. Who is relatively taller, a
their gender?
75-inch man or a 70-inch woman?
Source: CDC Vital and Health Statistics, Advance Data, Number 361, 13. School Admissions A highly selective boarding school will
July 5, 2005 only admit students who place at least 1.5 standard deviations
above the mean on a standardized test that has a mean of 200
8. Men versus Women The average 20- to 29-year-old man is
and a standard deviation of 26. What is the minimum score that
69.6 inches tall, with a standard deviation of 3.0 inches, while
an applicant must make on the test to be accepted?
the average 20- to 29-year-old woman is 64.1 inches tall, with a
standard deviation of 3.8 inches. Who is relatively taller, a 14. Quality Control A manufacturer of bolts has a quality-
67-inch man or a 62-inch woman? control policy that requires it to destroy any bolts that are more
Source: CDC Vital and Health Statistics, Advance Data, Number 361, than 2 standard deviations from the mean. The quality-control
July 5, 2005 engineer knows that the bolts coming off the assembly line have
162 CHAPTER 3 Numerically Summarizing Data

a mean length of 8 cm with a standard deviation of 0.05 cm. For (d) Do you believe that the distribution of time spent doing
what lengths will a bolt be destroyed? homework is skewed or symmetric? Why?
15. You Explain It! Percentiles Explain the meaning of the 19. Ogives and Percentiles The following graph is an ogive of
following percentiles. IQ scores. The vertical axis in an ogive is the cumulative relative
Source: Advance Data from Vital and Health Statistics frequency and can also be interpreted as a percentile.
(a) The 15th percentile of the head circumference of males 3 to
5 months of age is 41.0 cm. Percentile Ranks of IQ Scores
(b) The 90th percentile of the waist circumference of females
100
2 years of age is 52.7 cm.
(c) Anthropometry involves the measurement of the human
body. One goal of these measurements is to assess how body 80
measurements may be changing over time. The following
table represents the standing height of males aged 20 years

Percentile
or older for various age groups. Based on the percentile 60
measurements of the different age groups, what might you
conclude? 40

Percentile
Age 10th 25th 50th 75th 90th 20

20–29 166.8 171.5 176.7 181.4 186.8


0
30–39 166.9 171.3 176.0 181.9 186.2
40 60 80 100 120 140 160 180
40–49 167.9 172.1 176.9 182.1 186.0 IQ
50–59 166.0 170.8 176.0 181.2 185.4
60–69 165.3 170.1 175.1 179.5 183.7 (a) Find and interpret the percentile rank of an individual whose
70–79 163.2 167.5 172.9 178.1 181.7 IQ is 100.
(b) Find and interpret the percentile rank of an individual whose
80 or older 161.7 166.1 170.5 175.3 179.4
IQ is 120.
16. You Explain It! Percentiles Explain the meaning of the (c) What score corresponds to the 60th percentile for IQ?
following percentiles. 20. Ogives and Percentiles The following graph is an ogive of
Source: National Center for Health Statistics. the mathematics scores on the SAT. The vertical axis in an ogive
(a) The 5th percentile of the weight of males 36 months of age is the cumulative relative frequency and can also be interpreted
is 12.0 kg. as a percentile.
(b) The 95th percentile of the length of newborn females is
53.8 cm. SAT Mathematics Scores
17. You Explain It! Quartiles Violent crimes include rape, 1
robbery, assault, and homicide. The following is a summary of the
Cumulative Relative Frequency

violent-crime rate (violent crimes per 100,000 population) for all


50 states in the United States plus Washington, D.C., in 2012. 0.8

Q1 = 252.4 Q2 = 333.8 Q3 = 454.5


0.6
(a) Provide an interpretation of these results.
(b) Determine and interpret the interquartile range.
(c) The violent-crime rate in Washington, D.C., in 2012 was 0.4
1243.7. Would this be an outlier?
(d) Do you believe that the distribution of violent-crime rates is
skewed or symmetric? Why? 0.2

18. You Explain It! Quartiles One variable that is measured


by online homework systems is the amount of time a student 0
spends on homework for each section of the text. The following 0 100 200 300 400 500 600 700 800 900
is a summary of the number of minutes a student spends for Score
each section of the text for the fall 2014 semester in a College
Algebra class at Joliet Junior College. (a) Find and interpret the percentile rank of a student who
Q1 = 42 Q2 = 51.5 Q3 = 72.5 scored 450 on the SAT mathematics exam.
(b) Find and interpret the percentile rank of a student who
(a) Provide an interpretation of these results. scored 750 on the SAT mathematics exam.
(b) Determine and interpret the interquartile range. (c) If Jane scored at the 44th percentile, what was her score?
(c) Suppose a student spent 2 hours doing homework for a 21. SMART Car The following data represent the miles per
section. Is this an outlier? gallon of a random sample of SMART cars with a three-cylinder,
1.0-liter engine.
SECTION 3.4 Measures of Position and Outliers 163

phone may have been used by another person. The data below
31.5 36.0 37.8 38.4 40.1 42.3
represent the monthly phone use in minutes of a customer
34.3 36.3 37.9 38.8 40.6 42.7 enrolled in this program for the past 20 months. The phone
34.5 37.4 38.0 39.3 41.4 43.5 company decides to use the upper fence as the cutoff point for
35.5 37.5 38.3 39.5 41.5 47.5 the number of minutes at which the customer should
be contacted. What is the cutoff point?
Source: www.fueleconomy.gov

(a) Compute the z-score corresponding to the individual who 346 345 489 358 471
obtained 36.3 miles per gallon. Interpret this result. 442 466 505 466 372
(b) Determine the quartiles.
442 461 515 549 437
(c) Compute and interpret the interquartile range, IQR.
(d) Determine the lower and upper fences. Are there any 480 490 429 470 516
outliers?
26. Stolen Credit Card A credit card company has a fraud-
22. Hemoglobin in Cats The following data represent the detection service that determines if a card has any unusual
hemoglobin (in g/dL) for 20 randomly selected cats. activity. The company maintains a database of daily charges on
a customer’s credit card. Days when the card was inactive are
5.7 8.9 9.6 10.6 11.7 excluded from the database. If a day’s worth of charges appears
7.7 9.4 9.9 10.7 12.9 unusual, the customer is contacted to make sure that the credit
7.8 9.5 10.0 11.0 13.0 card has not been compromised. Use the following daily charges
8.7 9.6 10.3 11.2 13.4
(rounded to the nearest dollar) to determine the amount the
daily charges must exceed before the customer is contacted.
Source: Joliet Junior College Veterinarian Technology Program

(a) Compute the z-score corresponding to the hemoglobin of 143 166 113 188 133
Blackie, 7.8 g/dL. Interpret this result. 90 89 98 95 112
(b) Determine the quartiles. 111 79 46 20 112
(c) Compute and interpret the interquartile range, IQR.
70 174 68 101 212
(d) Determine the lower and upper fences. Are there any
outliers? 27. Student Survey of Income A survey of 50 randomly selected
23. Rate of Return of Google The following data represent full-time Joliet Junior College students was conducted during
the monthly rate of return of Google common stock from its the Fall 2015 semester. In the survey, the students were asked to
inception in January 2007 through November 2014. disclose their weekly income from employment. If the student
did not work, $0 was entered.
- 0.10 - 0.02 0.00 0.02 - 0.10 0.03 0.04 - 0.15 - 0.08
0.02 0.01 - 0.18 - 0.10 - 0.18 0.14 0.07 - 0.01 0.09 0 262 0 635 0 0 671

0.03 0.10 - 0.17 - 0.10 0.05 0.05 0.08 0.08 - 0.07 244 521 476 100 650 454 95

0.06 0.25 - 0.07 - 0.02 0.10 0.01 0.09 - 0.07 0.17 12,777 567 310 527 0 67 736

0.05 - 0.02 0.30 - 0.14 0.00 0.05 0.06 - 0.08 0.17 83 159 0 547 188 389 300

Source: Yahoo!Finance
719 0 367 316 0 0 181
479 0 82 579 289
(a) Determine and interpret the quartiles.
375 347 331 281 628
(b) Check the data set for outliers.
24. CO2 Emissions The following data represent the carbon 0 203 149 0 403
dioxide emissions from the consumption of energy per capita (a) Check the data set for outliers.
(total carbon dioxide emissions, in tons, divided by total (b) Draw a histogram of the data and label the outliers on the
population) for the countries of Europe. histogram.
(c) Provide an explanation for the outliers.
1.31 5.38 10.36 5.73 3.57 5.40 6.24
28. Student Survey of Entertainment Spending A survey of
8.59 9.46 6.48 11.06 7.94 4.63 6.12
40 randomly selected full-time Joliet Junior College students was
14.87 9.94 10.06 10.71 15.86 6.93 3.58 conducted in the Fall 2015 semester. In the survey, the students
4.09 9.91 161.57 7.82 8.70 8.33 9.38 were asked to disclose their weekly spending on entertainment.
7.31 16.75 9.95 23.87 7.76 8.86 The results of the survey are as follows:
Source: Carbon Dioxide Information Analysis Center
21 54 64 33 65 32 21 16
(a) Determine and interpret the quartiles. 22 39 67 54 22 51 26 14
(b) Is the observation corresponding to Albania, 1.31, an
115 7 80 59 20 33 13 36
outlier?
25. Fraud Detection As part of its “Customers First” program, 36 10 12 101 1000 26 38 8
a cellular phone company monitors monthly phone usage. The 28 28 75 50 27 35 9 48
program identifies unusual use and alerts the customer that their
164 CHAPTER 3 Numerically Summarizing Data

(a) Check the data set for outliers. Explaining the Concepts
(b) Draw a histogram of the data and label the outliers on the
histogram. 32. Write a paragraph that explains the meaning of percentiles.
(c) Provide an explanation for the outliers. 33. Suppose you received the highest score on an exam. Your
29. Pulse Rate Use the results of Problem 21 in Section 3.1 friend scored the second-highest score, yet you both were in the
and Problem 19 in Section 3.2 to compute the z-scores for all 99th percentile. How can this be?
the students. Compute the mean and standard deviation of 34. Morningstar is a mutual fund rating agency. It ranks a fund’s
these z-scores. performance by using one to five stars. A one-star mutual fund
30. Travel Time Use the results of Problem 22 in Section 3.1 is in the bottom 10% of its investment class; a five-star mutual
and Problem 20 in Section 3.2 to compute the z-scores for fund is at the 90th percentile of its investment class. Interpret the
all the students. Compute the mean and standard deviation meaning of a five-star mutual fund.
of these z-scores. 35. When outliers are discovered, should they always be
31. Fraud Detection Revisited Use the fraud-detection data removed from the data set before further analysis?
from Problem 25 to do the following. 36. Mensa is an organization designed for people of high
(a) Determine the standard deviation and interquartile range of intelligence. One qualifies for Mensa if one’s intelligence is
the data. measured at or above the 98th percentile. Explain what this means.
(b) Suppose the month in which the customer used 346 minutes 37. Explain the advantage of using z-scores to compare
was not actually that customer’s phone. That particular observations from two different data sets.
month, the customer did not use her phone at all, so 38. Explain the circumstances for which the interquartile range
0 minutes were used. How does changing the observation is the preferred measure of dispersion. What is an advantage that
from 346 to 0 affect the standard deviation and interquartile the standard deviation has over the interquartile range?
range? What property does this illustrate? 39. Explain what each quartile represents.

3.5 The Five-Number Summary and Boxplots

Objectives ❶ Compute the five-number summary


➋ Draw and interpret boxplots
Historical Note
John Tukey was born
on July 16, 1915,
in New Bedford, Let’s consider what we have learned so far. In Chapter 2, we discussed techniques
Massachusetts. His
parents graduated for graphically representing data. These summaries included bar graphs, pie charts,
numbers 1 and 2 histograms, stem-and-leaf plots, and time series graphs. In Sections 3.1 to 3.4, we
from Bates College presented techniques for measuring the center of a distribution, spread in a distribution,
and were voted
“the couple most likely to give and relative position of observations in a distribution of data. Why do we want these
birth to a genius.” Tukey earned summaries? What purpose do they serve?
his undergraduate and master’s Well, we want these summaries to see what the data can tell us. We explore the
degrees in chemistry from Brown
University. In 1939, he earned his data to see if they contain interesting information that may be useful in our research.
doctorate in mathematics from The summaries make this exploration much easier. In fact, because these summaries
Princeton University. He remained represent an exploration, a famous statistician named John Tukey called this material
at Princeton and, in 1965, became
the founding chair of the Depart- exploratory data analysis.
ment of Statistics. Among his Tukey defined exploratory data analysis as “detective work—numerical detective
many accomplishments, Tukey is work—or graphical detective work.” He believed exploration of data is best carried out
credited with coining the terms
software and bit. In the early 1970s, the way a detective searches for evidence when investigating a crime. Our goal is only to
he discussed the negative effects collect and present evidence. Drawing conclusions (or inference) is like the deliberations of
of aerosol cans on the ozone layer. the jury. What we have done so far falls under the category of exploratory data analysis. We
In December 1976, he published
Exploratory Data Analysis, from have only collected information and presented summaries, not reached any conclusions.
which the following quote appears: We have already seen one of Tukey’s graphical summaries, the stem-and-leaf plot. In
“Exploratory data analysis can this section, we look at two more summaries: the five-number summary and the boxplot.
never be the whole story, but
nothing else can serve as the
foundation stone—as the first
step” (p. 3). Tukey also ➊ Compute the Five-Number Summary
recommended that the 1990
Census be adjusted by means of Remember that the median is a measure of central tendency that divides the lower 50%
statistical formulas. John Tukey of the data from the upper 50%. It is resistant to extreme values and is the preferred
died in New Brunswick, New measure of central tendency when data are skewed right or left.
Jersey, on July 26, 2000.

You might also like