Sullivan Section 3.4 Measures of Position and Outliers 1
Sullivan Section 3.4 Measures of Position and Outliers 1
Sullivan Section 3.4 Measures of Position and Outliers 1
Age Male Resident Pop Female Resident Pop Problems 15 and 16 use the following steps to approximate the
median from grouped data.
0–9 20,700,000 19,826,000
10–19 21,369,000 20,475,000
20–29 21,417,000 21,355,000
Approximating the Median from Grouped Data
30–39 19,455,000 20,011,000 Step 1 Construct a cumulative frequency distribution.
40–49 20,839,000 21,532,000 Step 2 Identify the class in which the median lies. Remember,
the median can be obtained by determining the observation
50–59 20,785,000 22,058,000
that lies in the middle.
60–69 14,739,000 16,362,000 Step 3 Interpolate the median using the formula
70–79 7,641,000 9,474,000
n
Ú 80 4,230,000 6,561,000 - CF
2
Source: U.S. Census Bureau Median = M = L + 1i2
f
(a) Approximate the population mean and standard deviation
of age for males. where L is the lower class limit of the class containing the
(b) Approximate the population mean and standard deviation median
of age for females. n is the number of data values in the frequency
(c) Which gender has the higher mean age? distribution
(d) Which gender has more dispersion in age? CF is the cumulative frequency of the class
14. Age of Mother The following data represent the age of the immediately preceding the class containing the
mother at childbirth for 1980 and 2013. median
f is the frequency of the median class
i is the class width of the class containing the
Age 1980 Births (thousands) 2013 Births (thousands)
median
10–14 9.8 3.1
15–19 551.9 274.6
15. Approximate the median of the frequency distribution in
20–24 1226.4 902.1 Problem 2.
25–29 1108.2 1127.6 16. Approximate the median of the frequency distribution in
30–34 549.9 1044.0 Problem 4.
35–39 140.7 487.5
Problems 17 and 18 use the following definition of the modal
40–44 23.2 110.3
class. The modal class of a variable can be obtained from data
45–49 1.1 8.2 in a frequency distribution by determining the class that has the
Source: National Vital Statistics Reports largest frequency.
(a) Approximate the population mean and standard deviation 17. Determine the modal class of the frequency distribution in
of age for mothers in 1980. Problem 1.
(b) Approximate the population mean and standard deviation 18. Determine the modal class of the frequency distribution in
of age for mothers in 2013. Problem 2.
(c) Which year has the higher mean age?
(d) Which year has more dispersion in age?
In Section 3.1, we determined measures of central tendency, which describe the typical
data value. Section 3.2 discussed measures of dispersion, which describe the amount of
spread in a set of data. In this section, we discuss measures of position, which describe
the relative position of a certain data value within the entire set of data.
SECTION 3.4 Measures of Position and Outliers 155
Definition The z-score represents the distance that a data value is from the mean in terms of
the number of standard deviations. We find it by subtracting the mean from the data
value and dividing this result by the standard deviation. There is both a population
z-score and a sample z-score:
Population z-Score Sample z-Score
x-m x-x
z = z = (1)
s s
The z-score is unitless. It has mean 0 and standard deviation 1.
In Other Words
The z-score provides a way to
If a data value is larger than the mean, the z-score is positive. If a data value is
compare apples to oranges smaller than the mean, the z-score is negative. If the data value equals the mean, the
by converting variables with z-score is zero. A z-score measures the number of standard deviations an observation
different centers or spreads to is above or below the mean. For example, a z-score of 1.24 means the data value is 1.24
variables with the same center standard deviations above the mean. A z-score of -2.31 means the data value is 2.31
(0) and spread (1). standard deviations below the mean.
Approach To determine which team had the relatively better run-producing season,
compute each team’s z-score. The team with the higher z-score had the better season.
Because we know the values of the population parameters, compute the population
z-score.
In Example 1, the team with the higher z-score was said to have a relatively better
season in producing runs. With negative z-scores, we need to be careful when deciding
the better outcome. For example, suppose Bob and Mary run a marathon. If Bob
finished the marathon in 213 minutes, where the mean finishing time among all men was
242 minutes with a standard deviation of 57 minutes, and Mary finished the marathon
in 241 minutes, where the mean finishing time among all women was 273 minutes with
a standard deviation of 52 minutes, who did better in the race? Since Bob’s z-score is
213 - 242 241 - 273
z Bob = = -0.51 and Mary’s z-score is zMary = = -0.62, Mary
57 52
did better. Even though Bob’s z-score is larger, Mary did better because she is more
standard deviations below the mean.
➋ Interpret Percentiles
Recall that the median divides the lower 50% of a set of data from the upper 50%. The
median is a special case of a general concept called the percentile.
Definition The kth percentile, denoted Pk, of a set of data is a value such that k percent of the
observations are less than or equal to the value.
So percentiles divide a set of data that is written in ascending order into 100 parts;
thus 99 percentiles can be determined. For example, P1 divides the bottom 1% of the
observations from the top 99%, P2 divides the bottom 2% of the observations from the
top 98%, and so on. Figure 17 displays the 99 possible percentiles.
Interpretation A percentile rank of 74% means that 74% of SAT Mathematics scores
are less than or equal to 600 and 26% of the scores are greater. So 26% of the students
rNow Work Problem 15 who took the exam scored better than Jennifer. r
In Other Words r The first quartile, denoted Q1, divides the bottom 25% of the data from the top 75%.
The first quartile, Q1, is Therefore, the first quartile is equivalent to the 25th percentile.
equivalent to the 25th r The second quartile, Q2, divides the bottom 50% of the data from the top 50%; it is
percentile, P25. The 2nd equivalent to the 50th percentile or the median.
quartile, Q2, is equivalent to
r The third quartile, Q3, divides the bottom 75% of the data from the top 25%; it is
the 50th percentile, P50, which
equivalent to the 75th percentile.
is equivalent to the median,
M. Finally, the third quartile, Figure 18 illustrates the concept of quartiles.
Q3, is equivalent to the 75th
percentile, P75. Figure 18
Smallest Median Largest
Data Value Q1 Q2 Q3 Data Value
Finding Quartiles
In Other Words
To find Q2, determine the Step 1 Arrange the data in ascending order.
median of the data set. To find Step 2 Determine the median, M, or second quartile, Q2.
Q1, determine the median of
the “lower half” of the data Step 3 Divide the data set into halves: the observations below (to the left of) M and
set. To find Q3, determine the the observations above M. The first quartile, Q1, is the median of the bottom half of
median of the “upper half” of the data and the third quartile, Q3, is the median of the top half of the data.
the data set.
Table 16
$6751 $9908 $3461 $2336 $21,147 $2332
$189 $1185 $370 $1414 $4668 $1953
$10,034 $735 $802 $618 $180 $1657
Solution
Step 1 The data written in ascending order are given as follows:
$180 $189 $370 $618 $735 $802 $1185 $1414 $1657
$1953 $2332 $2336 $3461 $4668 $6751 $9908 $10,034 $21,147
Step 2 There are n = 18 observations, so the median, or second quartile, Q2, is the
$1657 + $1953
mean of the 9th and 10th observations. Therefore, M = Q2 = 2 = $1805.
Step 3 The median of the bottom half of the data is the first quartile, Q1. As shown
next, the median of these data is the 5th observation, so Q1 = $735.
NOTE
If the number of observations $180 $189 $370 $618 $735 $802 $1185 $1414 $1657
is odd, do not include the
c
median when determining Q1
and Q3 by hand. r Q1
(continued)
158 CHAPTER 3 Numerically Summarizing Data
The median of the top half of the data is the third quartile, Q3. As shown next, the
median of these data is the 5th observation, so Q3 = $4668.
$1953 $2332 $2336 $3461 $4668 $6751 $9908 $10,034 $21,147
c
Q3
Interpretation Interpret the quartiles as percentiles. For example, 25% of the collision
claims are less than or equal to the first quartile, $735, and 75% of the collision claims
are greater than $735. Also, 50% of the collision claims are less than or equal to $1805,
the second quartile, and 50% of the collision claims are greater than $1805. Finally,
75% of the collision claims are less than or equal to $4668, the third quartile, and 25%
of the collision claims are greater than $4668. r
Using Technology
U Approach Use both StatCrunch and Minitab to obtain the quartiles. The steps for
Statistical
tatis packages may use obtaining quartiles using a TI-83/84 Plus graphing calculator, Minitab, Excel, and
different formulas for obtaining StatCrunch are given in the Technology Step-by-Step on pages 160–161.
the quartiles, so results may differ
slightly. Solution The results obtained from StatCrunch [Figure 19(a)] agree with our “by
hand” solution. In Figure 19(b), notice that the first quartile, 706, and the third quartile,
5189, reported by Minitab disagree with our “by hand” and StatCrunch result. This
difference is due to the fact that StatCrunch and Minitab use different algorithms for
obtaining quartiles.
Figure 19
(a)
Definition The interquartile range, IQR, is the range of the middle 50% of the observations in a
data set. That is, the IQR is the difference between the third and first quartiles and is
found using the formula
IQR = Q3 - Q1
The interpretation of the interquartile range is similar to that of the range and standard
deviation. That is, the more spread a set of data has, the higher the interquartile range
will be.
SECTION 3.4 Measures of Position and Outliers 159
Approach Use the quartiles found by hand in Example 3. The interquartile range, IQR,
is found by computing the difference between the third and first quartiles. It represents
the range of the middle 50% of the observations.
Interpretation The IQR, that is, the range of the middle 50% of the observations, in
rNow Work Problem 21(c) the collision claim data is $3933. r
Let’s compare the measures of central tendency and dispersion discussed thus
far for the collision claim data. The mean collision claim is $3874.4 and the median is
$1805. The median is more representative of the “center” because the data are skewed
to the right (only 5 of the 18 observations are greater than the mean). The range is
$21,147 - $180 = $20,967. The standard deviation is $5301.6 and the interquartile
range is $3933. The values of the range and standard deviation are affected by the
extreme claim of $21,147. In fact, if this claim had been $120,000 (let’s say the claim
was for a totaled Mercedes S-class AMG), then the range and standard deviation would
increase to $119,820 and $27,782.5, respectively. The interquartile range would not be
affected. Therefore, when the distribution of data is highly skewed or contains extreme
observations, it is best to use the interquartile range as the measure of dispersion
because it is resistant.
For the remainder of this text, the direction describe the distribution will mean to
describe its shape (skewed left, skewed right, symmetric), its center (mean or median),
and its spread (standard deviation or interquartile range).
(approximately $175,000), it probably would be an outlier, because this car costs much
more than the typical European automobile. The value of this car would be considered
unusual because it is not a typical value from the data set.
Use the following steps to check for outliers using quartiles.
Approach Follow the preceding steps. Any data value that is less than the lower fence
or greater than the upper fence will be considered an outlier.
Solution
Step 1 The quartiles found in Example 3 are Q1 = $735 and Q3 = $4668.
Step 2 The interquartile range, IQR, is
IQR = Q3 - Q1
= $4668 - $735
= $3933
Excel StatCrunch
1. Enter the raw data into column A. Follow the same steps given to compute the mean and median
2. With the data analysis Tool Pak enabled, select the Data from raw data. (Section 3.1)
tab and click on Data Analysis.
3. Select Rank and Percentile from the Data Analysis
window. Press OK.
4. With the cursor in the Input Range cell, highlight the
data. Press OK.
a mean length of 8 cm with a standard deviation of 0.05 cm. For (d) Do you believe that the distribution of time spent doing
what lengths will a bolt be destroyed? homework is skewed or symmetric? Why?
15. You Explain It! Percentiles Explain the meaning of the 19. Ogives and Percentiles The following graph is an ogive of
following percentiles. IQ scores. The vertical axis in an ogive is the cumulative relative
Source: Advance Data from Vital and Health Statistics frequency and can also be interpreted as a percentile.
(a) The 15th percentile of the head circumference of males 3 to
5 months of age is 41.0 cm. Percentile Ranks of IQ Scores
(b) The 90th percentile of the waist circumference of females
100
2 years of age is 52.7 cm.
(c) Anthropometry involves the measurement of the human
body. One goal of these measurements is to assess how body 80
measurements may be changing over time. The following
table represents the standing height of males aged 20 years
Percentile
or older for various age groups. Based on the percentile 60
measurements of the different age groups, what might you
conclude? 40
Percentile
Age 10th 25th 50th 75th 90th 20
phone may have been used by another person. The data below
31.5 36.0 37.8 38.4 40.1 42.3
represent the monthly phone use in minutes of a customer
34.3 36.3 37.9 38.8 40.6 42.7 enrolled in this program for the past 20 months. The phone
34.5 37.4 38.0 39.3 41.4 43.5 company decides to use the upper fence as the cutoff point for
35.5 37.5 38.3 39.5 41.5 47.5 the number of minutes at which the customer should
be contacted. What is the cutoff point?
Source: www.fueleconomy.gov
(a) Compute the z-score corresponding to the individual who 346 345 489 358 471
obtained 36.3 miles per gallon. Interpret this result. 442 466 505 466 372
(b) Determine the quartiles.
442 461 515 549 437
(c) Compute and interpret the interquartile range, IQR.
(d) Determine the lower and upper fences. Are there any 480 490 429 470 516
outliers?
26. Stolen Credit Card A credit card company has a fraud-
22. Hemoglobin in Cats The following data represent the detection service that determines if a card has any unusual
hemoglobin (in g/dL) for 20 randomly selected cats. activity. The company maintains a database of daily charges on
a customer’s credit card. Days when the card was inactive are
5.7 8.9 9.6 10.6 11.7 excluded from the database. If a day’s worth of charges appears
7.7 9.4 9.9 10.7 12.9 unusual, the customer is contacted to make sure that the credit
7.8 9.5 10.0 11.0 13.0 card has not been compromised. Use the following daily charges
8.7 9.6 10.3 11.2 13.4
(rounded to the nearest dollar) to determine the amount the
daily charges must exceed before the customer is contacted.
Source: Joliet Junior College Veterinarian Technology Program
(a) Compute the z-score corresponding to the hemoglobin of 143 166 113 188 133
Blackie, 7.8 g/dL. Interpret this result. 90 89 98 95 112
(b) Determine the quartiles. 111 79 46 20 112
(c) Compute and interpret the interquartile range, IQR.
70 174 68 101 212
(d) Determine the lower and upper fences. Are there any
outliers? 27. Student Survey of Income A survey of 50 randomly selected
23. Rate of Return of Google The following data represent full-time Joliet Junior College students was conducted during
the monthly rate of return of Google common stock from its the Fall 2015 semester. In the survey, the students were asked to
inception in January 2007 through November 2014. disclose their weekly income from employment. If the student
did not work, $0 was entered.
- 0.10 - 0.02 0.00 0.02 - 0.10 0.03 0.04 - 0.15 - 0.08
0.02 0.01 - 0.18 - 0.10 - 0.18 0.14 0.07 - 0.01 0.09 0 262 0 635 0 0 671
0.03 0.10 - 0.17 - 0.10 0.05 0.05 0.08 0.08 - 0.07 244 521 476 100 650 454 95
0.06 0.25 - 0.07 - 0.02 0.10 0.01 0.09 - 0.07 0.17 12,777 567 310 527 0 67 736
0.05 - 0.02 0.30 - 0.14 0.00 0.05 0.06 - 0.08 0.17 83 159 0 547 188 389 300
Source: Yahoo!Finance
719 0 367 316 0 0 181
479 0 82 579 289
(a) Determine and interpret the quartiles.
375 347 331 281 628
(b) Check the data set for outliers.
24. CO2 Emissions The following data represent the carbon 0 203 149 0 403
dioxide emissions from the consumption of energy per capita (a) Check the data set for outliers.
(total carbon dioxide emissions, in tons, divided by total (b) Draw a histogram of the data and label the outliers on the
population) for the countries of Europe. histogram.
(c) Provide an explanation for the outliers.
1.31 5.38 10.36 5.73 3.57 5.40 6.24
28. Student Survey of Entertainment Spending A survey of
8.59 9.46 6.48 11.06 7.94 4.63 6.12
40 randomly selected full-time Joliet Junior College students was
14.87 9.94 10.06 10.71 15.86 6.93 3.58 conducted in the Fall 2015 semester. In the survey, the students
4.09 9.91 161.57 7.82 8.70 8.33 9.38 were asked to disclose their weekly spending on entertainment.
7.31 16.75 9.95 23.87 7.76 8.86 The results of the survey are as follows:
Source: Carbon Dioxide Information Analysis Center
21 54 64 33 65 32 21 16
(a) Determine and interpret the quartiles. 22 39 67 54 22 51 26 14
(b) Is the observation corresponding to Albania, 1.31, an
115 7 80 59 20 33 13 36
outlier?
25. Fraud Detection As part of its “Customers First” program, 36 10 12 101 1000 26 38 8
a cellular phone company monitors monthly phone usage. The 28 28 75 50 27 35 9 48
program identifies unusual use and alerts the customer that their
164 CHAPTER 3 Numerically Summarizing Data
(a) Check the data set for outliers. Explaining the Concepts
(b) Draw a histogram of the data and label the outliers on the
histogram. 32. Write a paragraph that explains the meaning of percentiles.
(c) Provide an explanation for the outliers. 33. Suppose you received the highest score on an exam. Your
29. Pulse Rate Use the results of Problem 21 in Section 3.1 friend scored the second-highest score, yet you both were in the
and Problem 19 in Section 3.2 to compute the z-scores for all 99th percentile. How can this be?
the students. Compute the mean and standard deviation of 34. Morningstar is a mutual fund rating agency. It ranks a fund’s
these z-scores. performance by using one to five stars. A one-star mutual fund
30. Travel Time Use the results of Problem 22 in Section 3.1 is in the bottom 10% of its investment class; a five-star mutual
and Problem 20 in Section 3.2 to compute the z-scores for fund is at the 90th percentile of its investment class. Interpret the
all the students. Compute the mean and standard deviation meaning of a five-star mutual fund.
of these z-scores. 35. When outliers are discovered, should they always be
31. Fraud Detection Revisited Use the fraud-detection data removed from the data set before further analysis?
from Problem 25 to do the following. 36. Mensa is an organization designed for people of high
(a) Determine the standard deviation and interquartile range of intelligence. One qualifies for Mensa if one’s intelligence is
the data. measured at or above the 98th percentile. Explain what this means.
(b) Suppose the month in which the customer used 346 minutes 37. Explain the advantage of using z-scores to compare
was not actually that customer’s phone. That particular observations from two different data sets.
month, the customer did not use her phone at all, so 38. Explain the circumstances for which the interquartile range
0 minutes were used. How does changing the observation is the preferred measure of dispersion. What is an advantage that
from 346 to 0 affect the standard deviation and interquartile the standard deviation has over the interquartile range?
range? What property does this illustrate? 39. Explain what each quartile represents.