Lecture 6 Correlation and Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Lecture 6

Linear Regression and Correlation


Simple Regression
• A regression model is a mathematical equation that describes the relationship
between two or more variables.
• A simple regression model includes only two variables: one independent and
one dependent.
• The dependent variable is the one being explained, and the independent
variable is the one used to explain the variation in the dependent variable.

Linear Regression

• The relationship between two variables in a regression analysis is expressed


by a mathematical equation called a regression equation or model.
• A regression equation, when plotted, may assume one of many possible
shapes, including a straight line.
• A regression equation that gives a straight-line relationship between two
variables is called a linear regression model; otherwise, the model is called
a nonlinear regression model.

• The two diagrams in Figure 1 show a linear and a nonlinear relationship


between the dependent variable food expenditure and the independent
variable income.
• A linear relationship between income and food expenditure, shown in Figure
1a, indicates that as income increases, the food expenditure always increases
at a constant rate.
• A nonlinear relationship between income and food expenditure, as depicted
in Figure 1b, shows that as income increases, the food expenditure increases,
although, after a point, the rate of increase in food expenditure is lower for
every subsequent increase in income.
Food Expenditure Food Expenditure

Linear Non-linear

Income Income
(a) (b)
Figure:1 Relationship between food expenditure and income. (a) Linear
relationship. (b) Nonlinear relationship.

Simple Linear Regression Analysis


In a regression model, the independent variable is usually denoted by x, and the
dependent variable is usually denoted by y. Simple linear regression model is
written as

Constant term or y-intercept Slope Random error term

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 ------------------------------(1)

Dependent variable Independent variable

• In model (1), 𝛽0 and 𝛽1 are the population parameters.


• The regression line obtained for model (1) by using the population data is
called the population regression line.
• The values of 𝛽0 and 𝛽1 in the population regression line are called the true
values of the y-intercept and slope, respectively.

• However, population data are difficult to obtain.


• As a result, we almost always use sample data to estimate model (1).
• The values of the y-intercept and slope calculated from sample data on x and
y are called the estimated values of 𝜷𝟎 and 𝜷𝟏 and are denoted by a and b,
respectively.

Using a and b, we write the estimated regression model as

𝑦̂ = 𝑎 + 𝑏𝑥-----------------------------------------(3)

where 𝑦̂ (read as y hat) is the estimated or predicted value of y for a given value of
x.
• Equation (3) is called the estimated regression model; it gives the regression
of y on x.

Least Squares Line

For the least squares regression line 𝑦̂ = 𝑎 + 𝑏𝑥

(∑ 𝑥)( ∑ 𝑦)
∑ 𝑥𝑦 −
𝑏= 𝑛
(∑ 𝑥)2
∑ 𝑥2 −
𝑛

and 𝑎 = 𝑦̅ − 𝑏𝑥̅
EXAMPLE
Find the least squares regression line for the data on incomes and food expenditures
on the seven households given in the following Table. Use income as an independent
variable and food expenditure as a dependent variable.

Income, x 55 83 38 61 33 49 67
Food expenditure, y 14 24 13 16 9 15 17

Solution We are to find the values of a and b for the regression model. Table
below shows the calculations required for the computation of a and b.

Income, Food expenditure, xy 𝑥2


x y
55 14 770 3025
83 24 1992 6889
38 13 494 1444
61 16 976 3721
33 9 297 1089
49 15 735 2401
67 17 1139 4489
∑ 𝑥 = 386 ∑ 𝑦 = 108 ∑ 𝑥 𝑦=6403 ∑ 𝑥 2 = 23,058

Thus,
(∑ 𝑥)( ∑ 𝑦) 386 × 108
∑ 𝑥𝑦 − 6403 − 447.5714
𝑏= 𝑛 = 7 = = 0.2525
2 (∑ 𝑥)2 (386)2 1772.8571
∑𝑥 − 23058 −
𝑛 7
386 108
𝑥̅ = = 55.1429 𝑦̅ = = 15.4286
7 7

𝑎 = 𝑦̅ − 𝑏𝑥̅ = 15.4286 − (. 25252)(55.1429) = 1.5050

Thus, our estimated regression model 𝑦̂ = 𝑎 + 𝑏𝑥=1.5050+0.2525x

This regression line is called the least squares regression line. It gives the regression
of food expenditure on income.
Interpretation of a and b
Interpretation of a
Consider a household with zero income. Using the estimated regression line obtained
in Example, we get the predicted value of y for x = 0 as

𝑦̂ = 1.5050 + 0.2525(0) = $1.5050 ℎ𝑢𝑛𝑑𝑟𝑒𝑑𝑒𝑑 = 150.50

Thus, we can state that a household with no income is expected to spend $150.50
per month on food.

Interpretation of b
The value of b in a regression model gives the change in y (dependent variable) due
to a change of one unit in x (independent variable).
Note that when b is positive, an increase in x will lead to an increase in y, and a
decrease in x will lead to a decrease in y. In other words, when b is positive, the
movements in x and y are in the same direction. Such a relationship between x and y
is called a positive linear relationship.
The regression line in this case slopes upward from left to right. On the other hand,
if the value of b is negative, an increase in x will lead to a decrease in y, and a
decrease in x will cause an increase in y. The changes in x and y in this case are in
opposite directions. Such a relationship between x and y is called a negative linear
relationship. The regression line in this case slopes downward from left to right.

Standard Error of Estimate


To measure the reliability of the estimating equation, statisticians have developed
the standard error of estimate. This standard error is symbolized se and is similar to
the standard deviation, in that both are measures of dispersion. The standard
deviation is used to measure the dispersion of a set of observations about the
mean. The standard error of estimate, on the other hand, measures the variability,
or scatter, of the observed values around the regression line.

The standard error may be defined as follows:

se =
 ( y − yˆ ) 2

n−2

where
• y = values of the dependent variable
• ŷ = estimated values from the estimating equation that correspond
to each y value
• number of data points used to fit the regression line.

Example: Let the estimated regression equation is


yˆ = b0 + b1 x = −13.02 + 2.545x .

To calculate se for this problem, we must determine the value of  ( y − yˆ ) 2


. We
have done this in the following table:

x y yˆ = −13.02 + 2.54 x ( y − yˆ ) 2

12 20 17.46 6.4516
30 60 63.18 10.1124
15 27 25.08 3.6864
24 50 47.94 4.2436
14 21 22.54 2.3716
18 30 32.7 7.29
28 61 58.1 8.41
26 54 53.02 0.9604
19 32 35.24 10.4976
27 57 55.56 2.0736
56.0972

Thus, se =  ( y − yˆ ) 2

=
56.097
=
56.097
= 2.64
n−2 10 − 2 8

Interpreting the Standard Error of Estimate


As was true of the standard deviation, the larger the standard error of estimate,
the greater the scattering (or dispersion) of points around the regression line.
Conversely, if se = 0, we expect the estimating equation to be a “perfect” estimator
of the dependent variable. In that case, all the data points lie directly on the
regression line, and no points would be scattered around it.
Coefficient of Determination
The coefficient of determination is the primary way we can measure the extent, or
strength, of the association that exists between two variables, X and Y. Statisticians
interpret the coefficient of determination by looking at the amount of the variation
in Y that is explained by the regression line. The coefficient of determination is
defined by
r2 = 1−
 ( y − yˆ ) 2

 ( y − y) 2

We can use the following Table to calculate coefficient of determination:


x y yˆ = −13.02 + 2.54 x ( y − yˆ ) 2 ( y − y) 2

12 20 17.46 6.4516 449.44


30 60 63.18 10.1124 353.44
15 27 25.08 3.6864 201.64
24 50 47.94 4.2436 77.44
14 21 22.54 2.3716 408.04
18 30 32.7 7.29 125.44
28 61 58.1 8.41 392.04
26 54 53.02 0.9604 163.84
19 32 35.24 10.4976 84.64
27 57 55.56 2.0736 249.64
412 56.0972 2505.6

y=
 y = 412 = 41.2
n 10
r2 = 1−
 ( y − yˆ ) 2

= 1−
56.097
= 1 − 0.0224 = 0.9776
 ( y − y) 2
2505 .6

Thus, we can conclude that the variation in number of workers (the independent
variable X) explains 97.76 percent of the variation in the production of Redwood
falls plant (the dependent variable Y).

Correlation
• Correlation analysis is used to measure strength of the association (linear
relationship) between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Correlation Coefficient
• The population correlation coefficient ρ (rho) measures the strength of the
association between the variables
• The sample correlation coefficient r is an estimate of ρ and is used to measure
the strength of the linear relationship in the sample observations
Features of ρ and r
• Unit free
• Range between -1 and 1
• The closer to -1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship

Calculating the Correlation Coefficient


Sample correlation coefficient:
(∑ 𝑥)(∑ 𝑦)
∑ 𝑥𝑦 −
𝑟= 𝑛
2 2
√{∑ 𝑥 2 − (∑ 𝑥) } {∑ 𝑦 2 − (∑ 𝑦) }
𝑛 𝑛
The value of correlation coefficient r lies between -1 to +1.
Interpretation of Correlation Coefficient
+ r values Positive -r values Negative
1.0 Perfect -1.0 Perfect
0.8 to 0.99 Very Strong -0.8 to -0.99 Very Strong
0.6 to 0.79 Strong -0.6 to -0.79 Strong
0.4 to 0.59 Moderate -0.4 to -0.59 Moderate
0.2 to 0.39 Weak -0.2 to -0.39 Weak
0.01 to 0.19 Very Week -0.01 to -0.19 Very Weak
0 No Linear
Relationship

Example
Compute r for the following pair of set and draw a scatter plot.
x 1 2 3 4 5
y 10 8 6 4 2

Solution
The computing formula for Karl Pearson ‘s correlation coefficient is
(∑ 𝑥)(∑ 𝑦)
∑ 𝑥𝑦−
𝑛
𝑟= 2 2
√{∑ 𝑥 2 −(∑ 𝑥) } {∑ 𝑦 2 −(∑ 𝑦) }
𝑛 𝑛

Let us make a table to calculate correlation coefficient


x y x2 y2 xy
1 10 1 100 10
2 8 4 64 16
3 6 9 36 18
4 4 16 16 16
5 2 25 4 10
Total=15 30 55 220 70
(∑ 𝑥)(∑ 𝑦)
∑ 𝑥𝑦 −
𝑟= 𝑛
2 2
√{∑ 𝑥 2 − (∑ 𝑥) } {∑ 𝑦 2 − (∑ 𝑦) }
𝑛 𝑛
15∗30
70−
5
= 2 2
√{55−(15) } {220−(30) }
5 5

−20
= = -1
20

Conclusion: There exist a perfect negative relationship between x and y.


Scatter plot
The simplest device for showing the relationship between two variables on a graph
paper in the form of dots is called scatter diagram or scatter plot.

You might also like