Bio-L8- Correlation and Regression Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Linear Regression and Correlation

Correlation and Regression:


It frequently happens that statistic want to discribe with a single number 𝑎
relationship between two sets of scores. [ A number which measures a
relationship between two sets of scores is called a correlation coefficient ].

Scatter Gram :
Consider a list of pairs of numerical values respresenting variables 𝑥 and 𝑦 .
The scatter gram of the data is simply a picture of the pairs of values as a
point in a coordinate plan 𝑅3 . The picture some times indicates a
relationship between the point as illustrated in the following examples :
Correlation Coefficient :
Pearson defined 𝑟 so that the formula for 𝑟 has a minimum possible value of −1 and
a maximum possible value of +1 , when the sample points lie exatly in a line sloping
down to the right we say there is perfect negative correlation : 𝑟 = −1 , when the
sample points lie exactly in a line sloping up to the right , we say there is perfect
positive correlation : 𝑟 = +1 , when there is no tending of the points to the lie in a
straight line we say there is no correlation 𝑟 = 0 .
If 𝑟 is near to +1 or −1 we say we have high correlation. If 𝑟 is near zero, we say
that we have lowe correlation .
𝑛 𝑛
𝑖=1(𝑥𝑖 −𝑋) 𝑖=1(𝑦𝑖 −𝑌)
𝑟=
𝑛 (𝑥 −𝑋)2 𝑛 (𝑦 −𝑌)2
𝑖=1 𝑖 𝑖=1 𝑖

or
𝑛 𝑛 𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 − ( 𝑖=1 𝑥𝑖 𝑖=1 𝑦𝑖 )/𝑛
𝑟=
𝑛 2 𝑛 2
𝑛 2
( 𝑖=1 𝑥𝑖 ) 𝑛 2
( 𝑖=1 𝑦𝑖 )
𝑖=1 𝑥𝑖 − 𝑛 𝑖=1 𝑦𝑖 − 𝑛
64 81 81
Regression Line :
A regression line is a straight line that describes how a response variable
y changes as an explanatory variable x changes. We often use a
regression line to predict the value of y for a given value of x.

𝑦 = 𝑎 + 𝑏𝑥

𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 −𝑛𝑋𝑌
𝑏= 𝑛 𝑥 2 −𝑛𝑋 2
𝑖=1 𝑖

𝑎 = 𝑌 − 𝑏𝑋
𝑛 𝑛 𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 = 711 , 𝑖=1 𝑥𝑖 = 47 , 𝑋 = 47/6 , 𝑖=1 𝑦𝑖 = 79 , 𝑌 = 79/6

𝑛 2
𝑖=1 𝑥𝑖 = 423 ,

𝑏 = 1.68 , 𝑎 = 𝑌 − 𝑏𝑋 = 0.005 , 𝑦 = 0.005 + 1.68𝑥

b.) when 𝑥 = 4 , 𝑦 = 0.005 + 1.68 ∗ (4) = 6.725

when 𝑥 = 1 , 𝑦 = 0.005 + 1.68 ∗ (1) = 1.685

when 𝑥 = 15 , 𝑦 = 0.005 + 1.68 ∗ 15 = 25.205

Example : let we have the following data :


𝑥 6 6 7 8 9 9
𝑦 5 6 6 7 7 8

then , 1) construct a scatter gram and draw a calculated regression line .

2) find the correlation coefficient and explain it .


Simple and Multiple Linear Regression Models
Definition: A multiple linear regression model
relating a random response 𝑌 to a set of predictor
variables 𝑥1 , . . . , 𝑥𝑛 is an equation of the form
𝑌 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 +· · ·+𝛽𝑘 𝑥𝑘 + ε
Where 𝛽0 , . . . , 𝛽𝑘 are unknown parameters,
𝑥1 , . . . , 𝑥𝑘 are the independent non-random
variables, and 𝜀 is a random variable
representing an error term. We assume that
𝐸(𝜀) = 0, or equivalently
, 𝐸(𝑌) = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 +· · ·+𝛽𝑘 𝑥𝑘 .

Definition: If 𝑌 = 𝛽0 + 𝛽1 𝑥+𝜀 , this


is called a simple
linear regression model. Here, 𝛽0 is
the y-intercept of the line and 𝛽1 is
the slope of the line. The term ε is
the error component.
11
The Method of Least Squares
Let (𝑥1 , 𝑦1 ), ( 𝑥2 , 𝑦2 ), . . . ( 𝑥𝑛 , 𝑦𝑛 ), are the 𝑛 observed data points, with
corresponding errors 𝜀𝑖 , 𝑖 = 1, . . . , 𝑛. That is, 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 +𝜀𝑖 , 𝑖 = 1, 2, . . . , 𝑛.
We assume that the errors 𝜀𝑖 , 𝑖 = 1, 2, . . . , 𝑛 are independent and identically
distributed with 𝐸(𝜀𝑖 ) = 0 , 𝑖 = 1, 2, . . . , 𝑛 and 𝑉𝑎𝑟(𝜀𝑖 ) = 𝜎 2 , 𝑖 = 1, 2, . . . , 𝑛 .
One of the ways to decide on how well a straight line fits the set of data is to
determine the extent to which the data points deviate from the line. The
straight line model for the response 𝑌 for a given 𝑥 is 𝑌 = 𝛽0 + 𝛽1 𝑥+𝜀 .
Because we assumed that 𝐸(𝜀) = 0, the expected value of 𝑌 is given by
𝐸(𝑌) = 𝛽0 + 𝛽1 𝑥.
The estimator of the 𝐸(𝑌), denoted by 𝑌, can be obtained by using the
estimators 𝛽0 and 𝛽1 of the parameters 𝛽0 and 𝛽1 , respectively. Then, the fitted
regression line we are looking for is given by 𝑌 = 𝛽0 + 𝛽1 𝑥.
For observed values (𝑥𝑖 , 𝑦𝑖 ),we obtain the estimated value of 𝑦𝑖 as
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 .
12
The deviation of observed 𝑦𝑖 from its predicted value 𝑦𝑖 , called the 𝑖𝑡ℎ residual, is
defined by 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 = [𝑦𝑖 − (𝛽0 + 𝛽1 𝑥𝑖 )].
The residuals, or errors 𝑒𝑖 , are the vertical distances between observed and predicted
values of 𝑦𝑖 ’s.
Definition: The sum of squares for errors (SSE) or sum of squares of the residuals for
all of the 𝑛 data points is 𝑆𝑆𝐸 = 𝑛𝑖=1 𝑒𝑖2 = 𝑛𝑖=1[𝑦𝑖 −(𝛽0 + 𝛽1 𝑥𝑖 )]2 .
The least-squares approach to estimation is to find 𝛽0 and 𝛽1 that minimize the sum
of squared residuals, SSE.
Derivation of 𝜷𝟎 and 𝜷𝟏 : To simplify the formula for 𝛽1 , set
𝑛 2 ( 𝑛
𝑖=1 𝑥𝑖 )
2
𝑛
𝑛
𝑖=1 𝑥𝑖
𝑛
𝑖=1 𝑦𝑖
𝑆𝑥𝑥 = 𝑖=1 𝑥𝑖 − , 𝑆𝑥𝑦 = 𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛 𝑛
𝑆𝑥𝑦
𝛽1 =  𝛽0 = 𝑦 − 𝛽1 𝑥  𝑌 = 𝛽0 + 𝛽1 𝑥.
𝑆𝑥𝑥

13
Example 1:Use the method of least squares to fit a straight line to the accompanying data points.
Give the estimates of 𝛽0 and 𝛽1 . Plot the points and sketch the fitted least-squares line. The observed
data values are given in the following table. 𝑥 −1 0 2 −2 5 6 8 11 12 −3
𝑦 −5 −4 2 −7 6 9 13 21 20 −9

Solution: Form a table to compute various terms

14
𝑛 2 ( 𝑛
𝑖=1 𝑥𝑖 )
2 (38)2
𝑆𝑥𝑥 = 𝑖=1 𝑥𝑖 − 𝑛
= 408 − 10
= 263.6
𝑛 𝑛
𝑛 𝑖=1 𝑥𝑖 𝑖=1 𝑦𝑖 38 46
𝑆𝑥𝑦 = 𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛
= 709 − 10
= 534.2

𝑥 = 3.8, 𝑦 = 4.6
𝑆𝑥𝑦 534.2
Therefore, 𝛽1 = 𝑆𝑥𝑥 = 263.6 = 2.0266
and 𝛽0 = 𝑦 − 𝛽1 𝑥 = 4.6 − (2.0266)(3.8) = −3.1011
Hence, the least-squares line for these data is
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 = −3.1011 + 2.0266 x

EXAMPLE : Fit a least square line for the following data. Also find the trend values
(𝑦 ) and show that (𝑦 − 𝑦)=0

𝑥 1 2 3 4 5
𝑦 2 5 3 8 7
H.W

15

You might also like