BES - R Lab 9

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

BES – LAB 9

Regression models
1. Objectives
- Develop an estimated simple regression equation
- Check the model assumptions
- Verify the significance of the model
- Calculate and interpret the coefficient of determination
- Construct confidence and prediction intervals for simple linear regression
- Develop an estimated multiple linear regression equation
2. Exercises
Firstly, remember to set your own working directory.
2.a Simple linear regression
Exercise 1. Health experts recommend that runners drink 4 ounces of water every 15 minutes they
run. Although handheld bottles work well for many types of runs, all-day cross-country runs require
hip-mounted or over-the-shoulder hydration systems. In addition to carrying more water, hip-
mounted or over-the-shoulder hydration systems offer more storage space for food and extra
clothing. As the capacity increases, however, the weight and cost of these larger-capacity systems
also increase. The data showing the weight (ounces) and the price for 26 hip-mounted or over-the-
shoulder hydration systems are stored in Hydration.csv.
Firstly, import the Hydration data frame into R and assign it to Hydration.
Let’s run the following codes:
Ø Hydration <- read.table("Hydration.csv",header=TRUE,sep=","
,quote="\"",stringsAsFactors = FALSE)
Ø head(Hydration)
Ø str(Hydration)

You can develop a scatter diagram for this set of data with weight as the independent variable to
visualize the relationship between the two variables. The code to produce a scatter diagram is as
follows.
Ø plot(Hydration$Weight, Hydration$Price, xlab = "Weight (oun
ces)", ylab = "Price ($)", main = "Relationship between Pr
ice and Weight of hydration systems")

1|Page
BES – LAB 9

a. What does the scatter diagram indicate about the relationship between the two
variables?
The estimated regression equation can be formulated by the following code.
Ø reg.ex1 <- lm(Price~Weight, data = Hydration)
Ø summary (reg.ex1)
Ø with(Hydration,plot(Weight, Price))
Ø abline(reg.ex1)

The output is given below.


Call:
lm(formula = Price ~ Weight, data = Hydration)

Residuals:
Min 1Q Median 3Q Max
-21.656 -5.335 -0.884 3.588 16.840

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9785 3.3800 1.473 0.154
Weight 2.9370 0.2934 10.011 4.81e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.457 on 24 degrees of freedom


Multiple R-squared: 0.8068, Adjusted R-squared: 0.7987
F-statistic: 100.2 on 1 and 24 DF, p-value: 4.813e-10

b. Develop an estimated regression equation that could be used to predict the price of a
hydration system given its weight.
c. What are the values of b0 and b1? How can we interpret these values?
d. Is there a significant relationship between price and weight? Use α = .05.
e. How does the straight line approximate the relationship between weight and price?
What is the coefficient of determination? What can it say about the power of the
model?

2|Page
BES – LAB 9

Assume that the estimated regression equation developed above will also apply to hydration systems
produced by other companies.
f. Develop a 95% prediction interval estimate of the price for the Back Draft system
produced by Eastern Mountain Sports that weighs 10 ounces.
To answer this question, it is necessary to produce a prediction interval with the following code.
Ø predict(reg.ex1, data.frame(Weight=10), interval="predict")
fit lwr upr
1 34.34858 16.56123 52.13593

If you want to produce a 95% confidence interval estimate of the price for all hydration
systems that weigh 10 ounces instead, you should use the following code:
Ø predict(reg.ex1, data.frame(Weight=10), interval="confidenc
e")

fit lwr upr


1 34.34858 30.92532 37.77183

The coefficient of correlation is used to measure the strength of a linear association between two
variables. We can test the coefficient of correlation to determine if a linear relationship exists with
the following code.
Ø cor.test(Hydration$Weight, Hydration$Price,alternative = "t
wo.sided",method = "pearson",conf.level = 0.95)

The R output is provided below.


Pearson's product-moment correlation

data: Hydration$Weight and Hydration$Price


t = 10.011, df = 24, p-value = 4.813e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7834398 0.9537369
sample estimates:
cor
0.8982136

How is the correlation between the two varibles?


Some important conditions required for the validity of the regression analysis are:
ü The relationship between X and Y is linear.
ü The error variable is normally distributed.
ü The error variance is constant for all values of x.
ü The errors are independent of each other.
We can draw residual plots with the following code to check the first three assumptions.
Ø hpar<-par(mfrow=c(2,2))
Ø plot(reg.ex1)
Ø par(hpar)

3|Page
BES – LAB 9

Problem 2. Consumer Reports provided extensive testing and ratings for more than 100 HDTVs.
An overall score, based primarily on picture quality, was developed for each model. In general, a
higher overall score indicates better performance. The data showing the price and overall score for
the ten 42-inch plasma televisions are stored in Plasma.csv.
a) Use the data to develop an estimated regression equation that could be used to estimate the
overall score for a 42-inch plasma television given the price.
b) What is the value of R2. Did the estimated regression equation provide a good fit?
c) Test for a significant relationship between price and overall score. What is your conclusion?
Use α = 0.05.
d) Estimate the overall score for a 42-inch plasma television with a price of $3,200.
e) Estimate the overall score for all 42-inch plasma television with a price of $3,200.
The output is given below.
Call:
lm(formula = Score ~ Price, data = Plasma)

Residuals:
Min 1Q Median 3Q Max
-11.108 -5.417 0.836 4.468 14.431

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.01749 14.90958 0.806 0.4435
Price 0.01270 0.00496 2.560 0.0337 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.216 on 8 degrees of freedom


Multiple R-squared: 0.4503, Adjusted R-squared: 0.3816
F-statistic: 6.553 on 1 and 8 DF, p-value: 0.03365

4|Page
BES – LAB 9

Prediction interval
fit lwr upr
1 52.64723 32.58724 72.70722

Confidence interval
fit lwr upr
1 52.64723 46.05691 59.23755

Pearson's product-moment correlation

data: Plasma$Price and Plasma$Score


t = 2.5599, df = 8, p-value = 0.03365
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.07169443 0.91434657
sample estimates:
cor
0.67103

5|Page
BES – LAB 9

2.b Multiple linear regression


The formula for a multiple linear regression is:
y = β! + β" x" + β# x# + ⋯ + β$ x$ + ε
Problem 3. You are a public health researcher interested in social factors that influence heart
disease. You survey 500 towns and gather data on the percentage of people in each town who
smoke, the percentage of people in each town who bike to work, and the percentage of people in
each town who have heart disease.
Import the heart.data data frame into R and assign it to Heart.
The estimated multiple linear regression equation can be formulated by the following code.
multiple.regression <- lm(heart.disease ~ biking + smoking, data = Heart)
Use summary() to see the following output:
Call:
lm(formula = heart.disease ~ biking + smoking, data = Heart)

Residuals:
Min 1Q Median 3Q Max
-2.1789 -0.4463 0.0362 0.4422 1.9331

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.984658 0.080137 186.99 <2e-16 ***
biking -0.200133 0.001366 -146.53 <2e-16 ***
smoking 0.178334 0.003539 50.39 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.654 on 495 degrees of freedom


Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16

6|Page
BES – LAB 9

Use avPlots from package “car” to visualising the results in graph.

model3 <- lm(heart.disease ~ biking + smoking, data = Heart)


install.packages("car")
library(car)
avPlots(model3)

In our survey of 500 towns, we found significant relationships between the frequency of biking to
work and the frequency of heart disease and the frequency of smoking and frequency of heart
disease (p < 0.001 for each).

7|Page

You might also like