2010 Midterm Solutions
2010 Midterm Solutions
2010 Midterm Solutions
Answer: 1-5: CABBA 6-10: BCADB 11-12: CC 1. (3 points) The median of a sample with data values of 10, 20, 12, 17, 16 is (a) 12 (b) 15 (c) 16 (d) 17 C. You need to order the sequence rst. 2. (3 points) The units of the residuals are the same as (a) Units of the response variable (b) Units of the independent variable (c) Residuals have no units (d) Residuals have standard units A. Notice that e = y y 3. (3 points) (T/F) If all of the bars in a bar chart have the same length, then the categorical variable shown in the bar chart has no variability. (a) True (b) False B. Only when all values of a variable lie in the same category can we say that there is no variability. 4. (3 points) To determine association between two numerical variables, we can (a) compare their marginal distributions (b) look for patterns in the scatterplot (c) check if the cells of a mosaic plot align (d) all of the above B. mosaic plot applies for categorical variables and marginal distribution has nothing to do with association. 1
5. (3 points) (T/F) The least squares line always goes through the point ( x, y ) (a) True (b) False A. sy (x x ) + y Notice ( x, y ) satises the equation y = r sx 6. (3 points) (T/F) If a distribution is bell shaped, then about 5% of the z-scores are larger than 1 or less than -1. (a) True (b) False B. 2/3 of the z-scores are larger than 1 or less than -1 7. (3 points) Suppose X and Y are two categorical variables. The best way to determine if there is a relationship between X and Y is to (a) nd the correlation between X and Y (b) make a scatterplot (c) make a contingency table (d) do all of the above C. correlation and scatterplot usually apply for numerical variables. 8. (3 points) Birth weights at a local hospital have a mean of 110 oz and an SD of 15 oz. The distribution is bell-shaped. Approximately what proportion of infants have a birthweight under 80 oz? (a) .025 (b) .05 (c) .16 (d) .34 (e) .5 A. 80 oz is 2 sd below mean. Notice there are 5% below or above 2 sd from the mean, so the propotion is 5%/2=0.025 9. (3 points) A chain of supermarkets observes that weekly sales of beer is positively correlated with weekly sales of ice cream. Based on this information, which of these conclusions would be valid? (a) People who drink beer tend to eat more ice cream (b) Eating ice cream causes a thirst for beer (c) A scatterplot of weekly sales of beer versus weekly sales of ice cream would show that a straight line ts the data very well (d) None of the above.
Page 2
D. First choice is not true because the data only looks at the sales of beer and ice cream, not the consumption of beer and ice cream. The second choice is not true because the correlation does not imply causation. The third choice is not true because even if the two variables are positively correlated, each point on the scatterplot could be far from the line (i.e. RMSE is very large, but r2 is very high). In addition, positive correlation does not always imply that there is a linear association between the two variables; its very possible that r = 0.2, which is still positive, but there may be non-linear associations between the two variables. Therefore, the correct choice is D. 10. (3 points) Alice and Bob compute the correlation between daily maximum temperatures for Philadelphia and Boston. Alice does it for June 2009; Bob does it for all of 2009. (a) Alice gets larger correlation than Bob (b) Bob gets larger correlation than Alice (c) Bobs correlation will be negative (d) None of the above B. This is almost the identical question from midterm 2007, question 17. Basically, since Bob considers all the data, including those from the winter months while Alice only considers the summer months, Bob would have a larger correlation compared to that from Alice because the line of best t on Bobs scatterplot would be better stabilized by the winter and the summer temperatures (i.e. the lower and upper values in the scatterplot). Alice only considers the summer months, which lie in the upper-ranges of the scattterplot only. Therefore, B is the correct answer. 11. (3 points) If the distribution of data is bell-shaped, the interquartile range will contain (a) about the same amount of data as in [ x s, x + s] (b) more data than the range [ x s, x + s] (c) less data than the range [ x s, x + s] (d) cannot determine from the information provided C. The interquartile range contains the middle 50% of the data. Using the empirical rule, a bellshaped distribution contains 68% of the data with [ x s, x + s], or the middle 68% of the data since x represents the middle of the distribution. Hence C is the correct answer. 12. (4 points) Consider a data table containing CEO salaries. On the log scale (base 10), the interquartile range of these salaries is 2 log-dollars. On the dollar scale, the 25th percentile is $200,000. What is the 75th percentile of these salaries on the dollar scale? (a) $20,000 (b) $1,000,000 (c) $20,000,000 (d) $102 (e) cannot be determined from the given information
Page 3
C. The interquartile range on the log scale is 2 log-dollars. Mathematically, this means that for some X and Y in the original scale where X is the 25th percentile value (X = 20,000) and Y is the 75th percentile value, log (Y ) log (X ) = 2. Solving for Y gets you Y = 102 X = 20, 000, 000. 13. (6 points) In his novel Bomber, Len Deighton argues that a World War II pilot had a 2% chance of being shot down on each mission. So in 50 missions he is mathematically certain to be shot down: 50 2% = 100%. Do you agree? Explain using mathematics.
I dont agree(1 points). Lets say Ai ={he was shot down in ith mission}. Because Ai are not disjoint, you cant simply add them up to get the probability of event: A={he was shot down at least once}. The right formula should be P(shot down)=1-P(not shot down)=1 0.9850 = 0.6358303 (5 points).
Common mistakes. The common mistake was not computing the actualy probability. The second most common mistake was computing the probability that the pilot gets shot down on the 50th mission, which is 0.9849 0.02. However, this probability is the probability of getting shot down exactly once and at the 50th mission. The question asks for the probability of getting shot down during his entire 50 missions. That is, students had to take into account that the pilot can be shot down on the 39th mission or the 27th mission and so forth. The third most common mistake was using LLN as evidence. LLN only guarantees convergence of long-term frequencies. It doesnt contradict how he is mathematically certain that the pilot will get shot down after 50 missions. Point distribution 1. Lost six points if the student agreed with the statement 2. Lost ve points if students disagreed with the statement, but no mathematical attempt was made to explain their reasoning. Students also missed ve points if they did not compute 1 0.9850 or 0.9850 or computed the wrong probabilities or did not state that events were disjoint. 3. Lost three points if students disagreed with the statement, stated explicitly that the events were not disjoint, or wrote the desired probability expressions (e.g. 1 - 0.9850 ), but confused the probabilities 0.9850 as the probability of getting shot down or 1 0.9850 as the probability of not getting shot down or made computational errors computational errors. 4. Lost one-to-two points if students disagreed with the statement, stated explicity that the events were not disjoint, computed the desired probabilities (e.g.1 0.9850 ) correctly. But yet explained how this contradicts the statement with some error. 5. Lost no points if students disagreed with the statement and computed the correct probabilities (e.g.1 0.9850 ). 14. (8 points) A club has 100 members. Among them there are 40 lawyers and 50 liars. The number of members who are neither lawyers nor liars is 20. A club president is chosen from the 100 members at random.
Page 4
(a) (2 points) What is the probability that the club president is a lawyer? Dene event: La={president is a lawyer}, Li={president is a liar} #lawyers 40 P (La) = #total people = 100 = 0.4 Common mistakes. The common mistake was computing the probability of picking a club 30 . president who is only a lawyer, which is 100 Point distribution 1. Lost two points if students computed the wrong probability. 2. Lost no points if students computed the correct probability. (b) (3 points) What is the probability that the club president is a lawyer and a liar? According to the given numbers, we know: P (La) = 0.4, P (Li) = 0.5, P ({La Li}C ) = 0.2 So P ({La Li}) = 1 P ({La Li}C ) = 0.8 P ({La Li}) = P (La) + P (Li) P ({La Li}) = 0.1
Common mistakes. The most common mistake was considering the event P ({La Li}) = P (La)P (Li). This is not true because some of the members could be both lawyers and liars and the events La and Li do not have to be independent. Point distribution (a) Lost three points if students computed the wrong probability or assumed independence by either multiplying 0.4 * 0.5 or 0.2, the product of the two numbers or indicating that they are independent in some manner. Three points were also deducted if no attempt was made. (b) Lost two points if students did not assume independence, but got in (c) Lost no points if students computed the correct probability. (c) (3 points) If you know that the president is a lawyer, what is the probability that he is also a liar? P (Li|La) =
P (Li,La) P (La) 10 80
instead of
10 100
0.1 0.4
= 0.25
Common mistakes. The most common mistake was using the wrong probabilities for the numerator. Point distribution [(a)] 1. Lost three points if students computed the wrong probability. 2. Lost one poitn if students had the right probability formula, but used the values from (b). 3. Lost no points if students computed the correct probability.
Page 5
15. (5 points) A doctor is in the habit of measuring blood pressures twice. He notices that patients who are unusually high on the rst reading tend to have somewhat lower second readings. He concludes that patients are more relaxed on the second reading. A colleague disagrees, pointing out that the patients who are unusually low on the rst reading tend to have somewhat higher second readings, suggesting they get more nervous. Which doctor is right or wrong? Explain briey. Both of them are wrong(1 points). From the regression method, we know that if one response has a large residual, then it will probably goes down next time because the residual is supposed to be normally distributed.(4 points) Common mistakes. The most common mistake was justifying the statements using lurking variables, Simpsons paradox, dependence/indepdendence. We dont know if all these phenomena might be present. The only thing were certain from the question is that each of their blood pressure is converging towards the mean (i.e. regressing towards the mean). Point distribution (a) Lost ve points if students did not indicate that both doctors were wrong and the explanations made no sense or were not reasonable. (b) Lost four points if students did not indicate that both doctors were wrong and used worse reasoning than -3 point explanations. (c) Lost three points if students indicated that both doctors were wrong, but used lurking variable, dependence/independence, Simpsons paradox, or the alike as their explanations. (d) Lost two points if the explanations were better than -3 point explanations, but not as good as -1 point explanations (e) Lost one pont if students did not indicate that both doctors were wrong, but referened regression to to the mean. (f) Lost no points if students indicated that that both doctors were wrong, they clearly (i.e. literally written out regression to the mean) stated regression to the mean, and briey explained why this phenomenon was the case. 16. (8 points) Your STAT101 professor is an avid forest mushroom picker. One test of whether a mushroom is edible or poisonous is whether it changes color when you cut it. Suppose a mushroom is picked at random. About 1/5 of all edible mushrooms will change their color. Out of non-edible mushrooms, only 1 in 33 will change color. Furthermore, 99 out of 100 mushrooms are not edible. If a randomly chosen mushroom changes color, what is the probability that it is edible? Dene the event: C={a mushroom changes color}, E={a mushroom is edible} The question tells us: P (E C ) = 0.99, P (C |E ) = 0.2, P (C |E C ) = 1/33
Page 6
P (E |C ) = =
P (E, C ) P (C )
Common mistakes. There were a variety of mistakes, ranging from simple numerical mistakes to not using the law of total probability correctly. Also, few students assumedP (E, C ) = P (E )P (C ), which is not true because E and C are dependent on each other by whats given in the question. Point distribution (a) Lost eight points if students wrote non-sensical statistical formulas or made no attempt to sovle the problem. (b) Lost seven points if students wrote something statistics-related, even if it bears no relevance to the problem. (c) Lost six points if students initial setup bear little chance to solving the problem or if the initial setup was not even close (d) Lost four points if the setup was attempted, but more than 1 or 2 numbers/expressions that were wrong. (e) Lost two to three points if the setup was attempted, but contained 1-2 numbers/expres32 1 sions that were wrong. The most common was using 33 instead of the correct 33 . (f) Lost one point if the setup was correect, but computational error prevented from the correct answer. (g) Lost no points if 0.0625 or 0.0631 or
1 16
17. (16 points) A casino expects to win 2/38 of a dollar (approximately 5 cents) for every dollar you bet on a roulette. This is true for both bet-on-Red and bet-on-a-number schemes. Consider a Wharton Special deal, where you win if any of 00, 0, 1, 2, 3, 4 comes out. You pay $1 to play the roulette and you lose this dollar if you do not win the round. There are 38 numbers on the roulette. (a) (5 points) If the casino wants to keep its expected winnings at 2/38 of a dollar, how much should it pay you for winning the Wharton Special? Let X reprsent the payment from the Casino. Then, E(Winning for Casino) = -X P(Casino 6 2 Lose) + 1 P(Casino Wins) = X 38 + 32 38 = 38 (2 points). Solving the equation for X gets you 5. 6 is also accepted since 5 + 1 = 6 assumes that the Casino still retain the money the player gives to it (3 points)
Page 7
Common mistakes. Many students switched the signs of the above expression where, instead 6 6 32 17 of X 38 + 32 38 , they wrote X 38 38 . By following this logic, X = 3 . It seemed like students were attempting to compute the expected winning of the gambler. Point distribution (a) Lost ve points if no attempt was made or non-sensical answers were written.
6 (b) Lost two points if the setup was X 38 32 38
(c) Lost no points if the setup was correct and they got the correct answer(s) (b) (3 points) What is the standard deviation of your net winnings?
2 2 6 2 2 32 Using the formula for standard deviation, (5 38 ) 38 + (1 38 ) 38 = 2.190384. 2.18 was also accepted if they calculated the standard deviation from the perspective of the gambler.
Common mistakes. The most common mistake was using the wrong values from (a) or making computational errors. Point distribution (a) Lost three points if no attempt was made or non-sensical answers were written (b) Lost two points if the answer is completely wrong, but the correct base formula was used. (c) Lost one point if the correct formula was used, but made computational errors. Lost one point if student used the correct formua, but used the wrong value from (a). The answer should have been 2.43 in that case. (d) Lost no points if the formula was correct and they got the correct answer (c) (3 points) Suppose you play Wharton Special 2 times. What is the variance of your total winnings? Suppose X1 and X2 represent the winnings. Since each game is independent and each game has the same set of probabilities (i.e. i.i.d), V ar(X1 + X2 ) = V ar(X1 ) + V ar(X2 ) = 2 V ar(X1 ) = 9.595568 Common mistakes. The most common mistake was not recognizing that the variables X1 and X2 are i.i.d. X1 and X2 i.i.d does not imply X1 = X2 . The last statement would be equivalent to saying that you win the exact same amount on your second shot at the Wharton Special. Point distribution (a) Lost three points if no attempt was made or non-sensical answers were written. Also students lost three points if they used V ar(2X ) = 4V ar(X ) (b) Lost one point if the correct formula was used, but made computational errors. Lost one point if student used the correct formua, but used the wrong value from (a).
Page 8
(c) Lost no points if the formula was correct and students obtained the correct answer. (d)(5 points) How much should the casino pay for your win in order for the game to be fair ? Let X represent the payment from the Casino. Then E(Winning for Casino) = XP (CasinoLose)+ 6 1P (CasinoW ins) = X 38 + 32 38 = 0, since it is a fair game. Solving the equation for X gets 16 19 19 you 3 .. 3 is also accepted since 16 3 + 1 = 3 assumes that the Casino still retain the money the player gives to it.
Common mistakes. The most common mistake was the same as (a) and not recognizing that in order for the game to be fair, the expectation had to be zero. Point distribution (a) Lost ve points if students made no attempt or provided non-sensical answers. (b) Lost four points if students used the correct expectation formula, but got everything else incorrect. (c) Lost three points if students used the correct expectation formula, recognized that in order to be a fair game, the expectation had to be zero, but got everything else wrong. (d) Lost two points if students used the correct formula, recognized what it means to be fair (i.e. E(Winning) = 0), but made computational errors. (e) Lost one point if the correct formula was used, recognized what it means to be fair (i.e. E(Winning) = 0), and had the proper algebraic formula, but if they used the wrong values carried over from previous questions (a, b, or c). (f) Lost no points if the formula was correct, recognized what fair meant in terms of expectatin, and obtained the correct answer. 18. (20 points) Wisconsin is an important milk-producing state. Some people might argue that because of transportation costs, the cost of milk increases with the distance of markets from Madison (Wisconsin). Suppose the milk prices in 15 randomly chosen places are recorded and the data are shown below. The correlation is found to be 0.7688.
Bivariate Fit of Cost of Milk (per gallon) By Distance from Madison (in 1000 miles)
2.6 2.5 Cost of Milk (per gallon) 2.4 2.3 2.2 0.2 0.4 0.6 0.8 1 Distance from Madison (in 1000 miles) 1.2 1.4
Page 9
2.2
2.3
2.4
2.5
2.6
0.2
0.4
0.6
0.8
1.2
1.4
Quantiles Moments
100.0% Mean maximum 99.5% Std Dev 97.5% Std Err Mean 90.0% Upper 95% Mean 75.0% Lower 95% quartile Mean 50.0% N median 25.0% quartile 2.64 2.436 0.1238548 2.64 0.0319792 2.64 2.5045885 2.616 2.3674115 2.55 2.4 15 2.36
Quantiles Moments
100.0% maximum Mean 99.5% Std Dev 97.5% Std Err Mean 90.0% Upper 95% Mean 75.0% quartile Lower 95% Mean 50.0% median N 25.0% quartile 1.346 0.8123333 1.346 0.3730262 1.346 0.0963149 1.3184 1.0189083 1.2 0.6057583 0.865 15 0.5
(a) (2 points) Does the high correlation of 0.7688 imply that the increasing cost of milk is due to transportation costs? Explain briey. No. There may be spurious correlation or a lurking variable.
Common mistakes. There were few people who made any errors on this question. Point distribution (a) Lost two points if no attempt to solve the question was made or if explanation made no sense (b) Lost one point if the explanation was statistical, but had no relation to spurious correlation or lurking variable. (c) Lost no points if the explanation was correct. (b) (4 points) Calculate the regression line for predicting the cost of milk by distance from Madison. Show all work.
y .1238548 Let X = cost of milk, Y = distance from Madison. Then b = r sdx = 0.7688 0 0.3730262 = 0.2552624 $ per 1000 miles. And a = y bx = 2.436 b(0.8123333) = 2.228642 $ per gallon. Therefore, y = 0.2552624(X ) + 2.228642.
sd
Common mistakes. Almost all students get this one correct. Few made mistake about the value of sx , sy , r.... Point distribution (a) Lost 4 points if not provide the regression equation (b) Lost 2 points if get sx , sy , r.... wrong (c) Lost 1 point if get the wrong answer (c) (4 points) Histogram of the residuals for the above regression is bell-shaped. We expect roughly 10 out of the 15 milk prices in the dataset to be within . . . . . . of the values predicted by this regression line (ll in the blank and show all work). RM SE = sdy 1 r2 = 0.1238548 1 0.76882 = 0.07920382. Because 10 15 = 2/3 68%, Page 10 the empirical rule tells us that the dataset is within 0.07920382 from the regression line. Common mistakes. There are many people who wrote RM SE = sy 1 + r2 or RM SE = sx 1 r2 Another typical mistake is that student ll in sy or them just ll in RMSE
(d) (4 points) A person walks into a market 500 miles away from Madison. On average, what should she expect to pay for a 2-gallon carton of milk? Show all work. y = 0.2552624(X ) + 2.228642. Thus, 500 miles is the same as 0.5 of 1000 miles. Thus y = 0.2552624(0.5) + 2.228642 = 2.356273 $ per gallon. Since we want 2-gallon carton of milk, we multiply 2.356273 by 2 to get 4.712546 $ for a 2-gallon carton of milk Common mistakes. Some students confuse the units and ignore the fact that there are TWO gallons. Point distribution (a) Lost 2 points if not translate the unit mile to 1000 mile (b) Lost another 2 points if ignore the 2 gallon fact (e) (4 points) A person walks into a market and observes that a gallon of milk costs $2.3. On average, how far we expect him to be from Madison? Show all work.
y .3730262 Y = distance from Madion. X = cost of milk. Then b = r sdx = 0.7688 0 0.1238548 = 2.315474 1000-mile per dollar. And a = y bx = 0.8123333 b2.436 = 4.828161 1000-miles. Therefore, y = 2.315474(X ) 4.828161. Plugging in $2.30 as X , we get 0.4974289 of 1000 miles or 497.4289 miles.
sd
Common mistakes. Some students just use the original regression equatiion to solve out the value instead of constructing a new regression equation. Point distribution (a) Lost 4 points if you try to solve out the original regression equation (b) Lost 2 points if reconstruct the regression while get a wrong value (f) (2 points) True or False: We cannot trust the regression method for this problem because the distributions of Cost of Milk and Miles from Madison are not bell-shaped. False. We can run regression on any two numerical variables, regardless of whether they are bell-shaped or not. Also, non-bell-shaped distributions can give you well-tting regression lines. Common mistakes. Some students answered True. You dont need to give any explanation to your answers. Point distribution (a) Lost 2 points if you answered True.
Page 11