Lecture 6 Correlation and Regression
Lecture 6 Correlation and Regression
Lecture 6 Correlation and Regression
Linear Regression
Linear Non-linear
Income Income
(a) (b)
Figure:1 Relationship between food expenditure and income. (a) Linear
relationship. (b) Nonlinear relationship.
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 ------------------------------(1)
𝑦̂ = 𝑎 + 𝑏𝑥-----------------------------------------(3)
where 𝑦̂ (read as y hat) is the estimated or predicted value of y for a given value of
x.
• Equation (3) is called the estimated regression model; it gives the regression
of y on x.
(∑ 𝑥)( ∑ 𝑦)
∑ 𝑥𝑦 −
𝑏= 𝑛
(∑ 𝑥)2
∑ 𝑥2 −
𝑛
and 𝑎 = 𝑦̅ − 𝑏𝑥̅
EXAMPLE
Find the least squares regression line for the data on incomes and food expenditures
on the seven households given in the following Table. Use income as an independent
variable and food expenditure as a dependent variable.
Income, x 55 83 38 61 33 49 67
Food expenditure, y 14 24 13 16 9 15 17
Solution We are to find the values of a and b for the regression model. Table
below shows the calculations required for the computation of a and b.
Thus,
(∑ 𝑥)( ∑ 𝑦) 386 × 108
∑ 𝑥𝑦 − 6403 − 447.5714
𝑏= 𝑛 = 7 = = 0.2525
2 (∑ 𝑥)2 (386)2 1772.8571
∑𝑥 − 23058 −
𝑛 7
386 108
𝑥̅ = = 55.1429 𝑦̅ = = 15.4286
7 7
This regression line is called the least squares regression line. It gives the regression
of food expenditure on income.
Interpretation of a and b
Interpretation of a
Consider a household with zero income. Using the estimated regression line obtained
in Example, we get the predicted value of y for x = 0 as
Thus, we can state that a household with no income is expected to spend $150.50
per month on food.
Interpretation of b
The value of b in a regression model gives the change in y (dependent variable) due
to a change of one unit in x (independent variable).
Note that when b is positive, an increase in x will lead to an increase in y, and a
decrease in x will lead to a decrease in y. In other words, when b is positive, the
movements in x and y are in the same direction. Such a relationship between x and y
is called a positive linear relationship.
The regression line in this case slopes upward from left to right. On the other hand,
if the value of b is negative, an increase in x will lead to a decrease in y, and a
decrease in x will cause an increase in y. The changes in x and y in this case are in
opposite directions. Such a relationship between x and y is called a negative linear
relationship. The regression line in this case slopes downward from left to right.
se =
( y − yˆ ) 2
n−2
where
• y = values of the dependent variable
• ŷ = estimated values from the estimating equation that correspond
to each y value
• number of data points used to fit the regression line.
x y yˆ = −13.02 + 2.54 x ( y − yˆ ) 2
12 20 17.46 6.4516
30 60 63.18 10.1124
15 27 25.08 3.6864
24 50 47.94 4.2436
14 21 22.54 2.3716
18 30 32.7 7.29
28 61 58.1 8.41
26 54 53.02 0.9604
19 32 35.24 10.4976
27 57 55.56 2.0736
56.0972
Thus, se = ( y − yˆ ) 2
=
56.097
=
56.097
= 2.64
n−2 10 − 2 8
( y − y) 2
y=
y = 412 = 41.2
n 10
r2 = 1−
( y − yˆ ) 2
= 1−
56.097
= 1 − 0.0224 = 0.9776
( y − y) 2
2505 .6
Thus, we can conclude that the variation in number of workers (the independent
variable X) explains 97.76 percent of the variation in the production of Redwood
falls plant (the dependent variable Y).
Correlation
• Correlation analysis is used to measure strength of the association (linear
relationship) between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Correlation Coefficient
• The population correlation coefficient ρ (rho) measures the strength of the
association between the variables
• The sample correlation coefficient r is an estimate of ρ and is used to measure
the strength of the linear relationship in the sample observations
Features of ρ and r
• Unit free
• Range between -1 and 1
• The closer to -1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship
Example
Compute r for the following pair of set and draw a scatter plot.
x 1 2 3 4 5
y 10 8 6 4 2
Solution
The computing formula for Karl Pearson ‘s correlation coefficient is
(∑ 𝑥)(∑ 𝑦)
∑ 𝑥𝑦−
𝑛
𝑟= 2 2
√{∑ 𝑥 2 −(∑ 𝑥) } {∑ 𝑦 2 −(∑ 𝑦) }
𝑛 𝑛
−20
= = -1
20