BES - R Lab 9
BES - R Lab 9
BES - R Lab 9
Regression models
1. Objectives
- Develop an estimated simple regression equation
- Check the model assumptions
- Verify the significance of the model
- Calculate and interpret the coefficient of determination
- Construct confidence and prediction intervals for simple linear regression
- Develop an estimated multiple linear regression equation
2. Exercises
Firstly, remember to set your own working directory.
2.a Simple linear regression
Exercise 1. Health experts recommend that runners drink 4 ounces of water every 15 minutes they
run. Although handheld bottles work well for many types of runs, all-day cross-country runs require
hip-mounted or over-the-shoulder hydration systems. In addition to carrying more water, hip-
mounted or over-the-shoulder hydration systems offer more storage space for food and extra
clothing. As the capacity increases, however, the weight and cost of these larger-capacity systems
also increase. The data showing the weight (ounces) and the price for 26 hip-mounted or over-the-
shoulder hydration systems are stored in Hydration.csv.
Firstly, import the Hydration data frame into R and assign it to Hydration.
Let’s run the following codes:
Ø Hydration <- read.table("Hydration.csv",header=TRUE,sep=","
,quote="\"",stringsAsFactors = FALSE)
Ø head(Hydration)
Ø str(Hydration)
You can develop a scatter diagram for this set of data with weight as the independent variable to
visualize the relationship between the two variables. The code to produce a scatter diagram is as
follows.
Ø plot(Hydration$Weight, Hydration$Price, xlab = "Weight (oun
ces)", ylab = "Price ($)", main = "Relationship between Pr
ice and Weight of hydration systems")
1|Page
BES – LAB 9
a. What does the scatter diagram indicate about the relationship between the two
variables?
The estimated regression equation can be formulated by the following code.
Ø reg.ex1 <- lm(Price~Weight, data = Hydration)
Ø summary (reg.ex1)
Ø with(Hydration,plot(Weight, Price))
Ø abline(reg.ex1)
Residuals:
Min 1Q Median 3Q Max
-21.656 -5.335 -0.884 3.588 16.840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9785 3.3800 1.473 0.154
Weight 2.9370 0.2934 10.011 4.81e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
b. Develop an estimated regression equation that could be used to predict the price of a
hydration system given its weight.
c. What are the values of b0 and b1? How can we interpret these values?
d. Is there a significant relationship between price and weight? Use α = .05.
e. How does the straight line approximate the relationship between weight and price?
What is the coefficient of determination? What can it say about the power of the
model?
2|Page
BES – LAB 9
Assume that the estimated regression equation developed above will also apply to hydration systems
produced by other companies.
f. Develop a 95% prediction interval estimate of the price for the Back Draft system
produced by Eastern Mountain Sports that weighs 10 ounces.
To answer this question, it is necessary to produce a prediction interval with the following code.
Ø predict(reg.ex1, data.frame(Weight=10), interval="predict")
fit lwr upr
1 34.34858 16.56123 52.13593
If you want to produce a 95% confidence interval estimate of the price for all hydration
systems that weigh 10 ounces instead, you should use the following code:
Ø predict(reg.ex1, data.frame(Weight=10), interval="confidenc
e")
The coefficient of correlation is used to measure the strength of a linear association between two
variables. We can test the coefficient of correlation to determine if a linear relationship exists with
the following code.
Ø cor.test(Hydration$Weight, Hydration$Price,alternative = "t
wo.sided",method = "pearson",conf.level = 0.95)
3|Page
BES – LAB 9
Problem 2. Consumer Reports provided extensive testing and ratings for more than 100 HDTVs.
An overall score, based primarily on picture quality, was developed for each model. In general, a
higher overall score indicates better performance. The data showing the price and overall score for
the ten 42-inch plasma televisions are stored in Plasma.csv.
a) Use the data to develop an estimated regression equation that could be used to estimate the
overall score for a 42-inch plasma television given the price.
b) What is the value of R2. Did the estimated regression equation provide a good fit?
c) Test for a significant relationship between price and overall score. What is your conclusion?
Use α = 0.05.
d) Estimate the overall score for a 42-inch plasma television with a price of $3,200.
e) Estimate the overall score for all 42-inch plasma television with a price of $3,200.
The output is given below.
Call:
lm(formula = Score ~ Price, data = Plasma)
Residuals:
Min 1Q Median 3Q Max
-11.108 -5.417 0.836 4.468 14.431
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.01749 14.90958 0.806 0.4435
Price 0.01270 0.00496 2.560 0.0337 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
4|Page
BES – LAB 9
Prediction interval
fit lwr upr
1 52.64723 32.58724 72.70722
Confidence interval
fit lwr upr
1 52.64723 46.05691 59.23755
5|Page
BES – LAB 9
Residuals:
Min 1Q Median 3Q Max
-2.1789 -0.4463 0.0362 0.4422 1.9331
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.984658 0.080137 186.99 <2e-16 ***
biking -0.200133 0.001366 -146.53 <2e-16 ***
smoking 0.178334 0.003539 50.39 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
6|Page
BES – LAB 9
In our survey of 500 towns, we found significant relationships between the frequency of biking to
work and the frequency of heart disease and the frequency of smoking and frequency of heart
disease (p < 0.001 for each).
7|Page