Linear Regression Analysis: Module - Iv
Linear Regression Analysis: Module - Iv
Linear Regression Analysis: Module - Iv
MODULE – IV
Lecture - 15
Dr. Shalabh
Department of Mathematics and Statistics
Indian Institute of Technology Kanpur
2
The fitting of linear regression model, estimation of parameters testing of hypothesis properties of the estimator are based
on following major assumptions:
1. The relationship between the study variable and explanatory variables is linear, atleast approximately.
2. The error term has zero mean.
3. The error term has constant variance.
4. The errors are uncorrelated.
5. The errors are normally distributed.
One important point to keep in mind is that these assumptions are for the population and we work only with a sample.
So the main issue is to take a decision about the population on the basis of a sample of data.
Several diagnostic methods to check the violation of regression assumption are based on the study of model residuals
with the help of various types of graphics.
3
Checking of linear relationship between study and explanatory variables
If the scatter diagram shows a linear trend, it indicates that the relationship
between y and X is linear. If the trend is not linear, then it indicates that
the relationship between y and X is nonlinear. For example, the following
figure indicates a linear trend.
Linear trend
Nonlinear trend
4
2. Case of more than one explanatory variables
To check the assumption of linearity between study variable and explanatory variables, the scatter plot matrix of the data
can be used.
A scatter plot matrix is a two dimensional array of two dimension plots where each form contains a scatter diagram except
for the diagonal.
Thus, each plot sheds some light on the relationship between a pair of variables.
It gives more information than the correlation coefficient between each pair of variables because it gives a sense of
linearity or nonlinearity of the relationship and some awareness of how the individual data points are arranged over the
region.
Suppose there are only two explanatory variables and the model is y = X 1β1 + X 2 β 2 + ε , then the scatter plot matrix looks
like as follows:
5
20 40 60 80 1000
1000
80
60
40
20
1000
80
60 y
+0.3 X1
40
20
- 0.7 + 0.8 X2
Such arrangement helps in examining of plot and corresponding correlation coefficient together.
The pairwise correlation coefficient should always be interpreted in conjunction with the corresponding scatter plots
because
the correlation coefficient measures only the linear relationship and
the correlation coefficient is non-robust, i.e., its value can be substantially influenced by one or two observations
in the data.
The presence of linear patterns is reassuring but absence of such patterns does not imply that linear model is incorrect.
Most of the statistical software provide the option for creating the scatter plot matrix. The view of all the plots provides an
indication that a multiple linear regression model may provide a reasonable fit to the data.
It is to be kept is mind that we get only the information on pairs of variables through the scatter plot of (y versus X1),
(y versus X2),…, (y versus Xk) whereas the assumption of linearity is between y and jointly with (X1, X2,..., Xk).
If some of the explanatory variables are themselves interrelated, then these scatter diagrams can be misleading. Some
other methods of sorting out the relationships between several explanatory variables and a study variable are used.
7
Residual analysis
The residual is defined as the difference between the observed and fitted value of study variable. The ith residual is defined
as
=ei y= ˆ
i ~ yi , i 1, 2,..., n
where yi is an observation and yˆi is the corresponding fitted value.
We consider it as ei =yi − yˆi , i =
1, 2,..., n
Residual can be viewed as the deviation between the data and the fit.
So it is also a measure of the variability in the response variable that is not explained by the regression model.
∑ (ei − e ) 2 ∑ ei2
SS r e s
=i 1 =i 1
= = = MS r e s .
n−k n−k n−k
Residuals are not independent as the n residuals have only n – k degrees of freedom.
The nonindependence of the residuals has little effect on their use for model adequacy checking as long as n is not
small relative to k.
8
1. Standardized residuals
The residuals are standardized based on the concept of residual minus its mean and divided by its standard deviation.
Since E(ei) = 0 and MSres estimates the approximate average variance, so logically the scaling of residual is
ei
=di = , i 1, 2,..., n
MS r e s
is called as standardized residual for which
E (di ) = 0
Var (di ) ≈ 1.
2. Studentized residuals
The standardized residuals use the approximate variance of ei as MSres. The studentized residuals use the exact
variance of ei.
y X β + ε , the OLSE of
In the model= β is b = ( X ' X ) −1 X ' y and the residual vector is
e= y − yˆ
= y − Xb
= y − Hy
= X ( X ' X ) −1 X '
( I − H ) y where H =
( I − H )( X β + ε )
=
= X β − HX β + ( I − H )ε
= X β − X β + ( I − H )ε
= ( I − H )ε
= Hε.
Thus =
e = H ε , so residuals are the same linear transformation of y and ε .
Hy
10
V (e) = V ( H ε )
= HV (ε ) H
= σ 2H
= σ 2 (I − H )
and
V (ε ) = σ 2 I .
The matrix (I - H) is symmetric and idempotent but generally not diagonal. So residuals have different variances and they
are correlated.
If hii is the ith diagonal element of hat matrix H and hij is the (i, j)th element of H, then
(ei ) σ 2 (1 − hii )
Var =
Cov(ei , e j ) = −σ 2 hij .
=
Var (ei ) σˆ 2 (1 − hii )
= MS r e s (1 − hii )
⇒ MS r e s overestimates the Var (ei ).
Now we discuss that hii is a measure of location of the ith point in x-space.