Unit - 1
Unit - 1
Unit - 1
Linear models are a fundamental class of statistical models that assume a linear
relationship between the dependent variable (outcome) and the independent variables
(predictors). They are widely used in various fields including statistics, econometrics,
machine learning, and social sciences due to their simplicity, interpretability, and
effectiveness in modelling a wide range of phenomena.
1. Linear Relationship: Linear models assume that the relationship between the
dependent variable Y and the independent variables X 1 , X 2 ,⋯ X p can be expressed as
a linear combination:
Y = β0 + β 1 X 1 + β 2 X 2 +⋯+ β p X p +ϵ
2. Assumptions:
o Linearity: The relationship between Y and X is linear.
o Independence: The error term ϵ\epsilonϵ is independent of the predictors X.
o Constant Variance (Homoscedasticity): The variance of ϵ\epsilonϵ is
constant across all values of X.
o Normality: The error term ϵ\epsilonϵ follows a normal distribution.
Violations of these assumptions can affect the validity of the model and may require
adjustments or alternative modelling approaches.
Classification of linear models
Linear models can be classified based on several criteria, including the nature of the
dependent variable, the type of relationship between variables, and the assumptions made
about the error term. Here are some common classifications of linear models:
Simple Linear Regression: The simplest form, with a single predictor variable.
Y = β0 + β 1 X +ϵ
Y = β0 + β 1 X 1 + β 2 X 2 +⋯+ β p X p +ϵ
General Linear Model (GLM): An extension that includes models where the
response variable does not have to be normally distributed, such as Poisson regression
or logistic regression.
Linear Regression: When the relationship between the dependent and independent
variables is assumed to be linear.
Non-linear Regression: When the relationship is not linear, and may involve terms
like quadratic (X2), exponential (e ¿¿ X )¿, or other non-linear functions.
Ordinary Least Squares (OLS): Assumes that the errors are normally distributed
with constant variance (homoscedasticity) and are independent.
Weighted Least Squares: Used when the variance of the errors is not constant across
observations.
Generalized Least Squares (GLS): Allows for more flexible modelling of the
covariance structure of the errors, assuming they are normally distributed.
4. Based on the Application and Structure:
Time Series Models: Includes autoregressive (AR), moving average (MA), and
autoregressive integrated moving average (ARIMA) models.
Panel Data Models: Account for both time series and cross-sectional dimensions,
with fixed effects and random effects models.
Hierarchical Linear Models (Multilevel Models): Handle nested data structures
where observations are clustered within higher-level units.
Ridge Regression: Adds a penalty term to the OLS objective function to shrink the
coefficients and reduce multicollinearity.
Lasso Regression: Performs variable selection and regularization by adding an l1
penalty term, encouraging sparsity in the coefficient estimates.
Elastic Net: Combines l1 and l2 penalties to balance between ridge and lasso
regression, providing both variable selection and regularization.
Y = Xβ+ ε
With E(Y )= Xβ , a primary analytical goal is to estimate or test for the significance of certain
linear combinations of the elements of β .This is accomplished by computing linear
combinations of the observed Y. An unbiased linear estimate of a specific linear function of
the individual β s, say L β , is a linear combination of the Y that has an expected value of L β .
Hence, the following definition:
Any linear combination of the Y, for instance KY, will have expectation E(KY )=KX β
. Thus, the expected value of any linear combination of the Y is equal to that same linear
combination of the rows of X multiplied by β . Therefore,
L β is estimable if and only if there is a linear combination of the rows of X that is
equal to L -that is, if and only if there is a K such that L = KX.
Thus, the rows of X form a generating set from which any estimable L can be
constructed. Since the row space of X is the same as the row space of X ' X , the rows of X ' X
also form a generating set from which all estimable Ls can be constructed. Similarly, the
rows of ( X ' X )−¿ X ' X ¿ also form a generating set for L.
Once an estimable L has been formed, L β can be estimated by computing Lb, where
−¿ X ' Y ¿
b=( X ' X ) . From the general theory of linear models, the unbiased estimator Lb is, in
fact, the best linear unbiased estimator of L β in the sense of having minimum variance as
well as maximum likelihood when the residuals are normal.
Hypothesis Testing
The partial F test, and its special cases the ANOVA F test and the Wald t test, use
c = 0. Let the full model use Y , x1 ≡1 , x2 , ⋯ , x p and let the reduced model use
Y , x1 =x j 1 ≡1 , x j 2 ,⋯ , x jk where { j 1 , ⋯ , j k } ⊂ { 1, ⋯ , p } and j1 = 1. Here 1 ≤ k < p, and if k = 1,
then the model is Y i=β 1 +e i. Hence the full model is Y i=β 1 + β 2 x i ,2 +⋯+ β p xi , p +e i, while the
reduced model is Y i=β 1 + β j 2 xi , j 2 +⋯+ β jk x i , jk + ei . In matrix form, the full model is Y = Xβ+ e
and the reduced model is Y = X R β R +e R where the columns of XR are a proper subset of the
columns of X.
(i) The partial F test has H 0 : β jk +1=⋯ β jp =0 , or H0 : the reduced model is good, or
H0 : Lβ = 0 where L is a (p − k) × p matrix where the ith row of L has a 1 in the j k+1th
position and zeroes elsewhere. In particular, if β 1 , ⋯ , β k are the only βi in the reduced model,
then L=[ 0 , I p−k ] and 0 is a (p−k)×k matrix. Hence r = p − k = number of predictors in the
full model but not in the reduced model.
(ii) The ANOVA F test is the special case of the partial F test where the reduced
model is Y i=β 1 +e i. Hence H0 : β2 = · · · = β p = 0, or H0 : none of the nontrivial predictors
x 2 , ⋯ , x p are needed in the linear model, or H0 : Lβ = 0 where L=[ 0 , I p−1 ] and 0 is a (p − 1)
× 1 vector. Hence r = p − 1.
(iii) The Wald t test uses the reduced model that deletes the jth predictor from the full
model. Hence H 0 : β j=0, or H0 : the jth predictor x j is not needed in the linear model given
that the other predictors are in the model, or H 0 : L j β=0 where L j =[ 0 ,⋯ , 0 , 1 ,0 , ⋯ , 0 ] is a
1 × p row vector with a 1 in the jth position for j = 1, ..., p. Hence r = 1.
A way to get the test statistic F R for the partial F test is to fit the full model and the
reduced model. Let RSS be the RSS of the full model, and let RSS(R) be the RSS of the
reduced model. Similarly, let MSE and MSE(R) be the MSE of the full and reduced models.
Let dfR = n − k and dfF = n − p be the degrees of freedom for the reduced and full
models. Then
RSS(R)−RSS
FR= where r = dfR − dfF = p − k = number of predictors in the full
rMSE
model but not in the reduced model.
The Gauss – Markov theorem establishes that the ordinary least squares estimator of β,
^β=( X ' X )−1 X ' y is BLUE(best linear unbiased estimator). By best mean that ^β has the
smallest variance, is some meaningful sense, among the class of all unbiased estimators that
are linear combinations of the data. One problem is that ^β is a vector. Hence, its variance is
actually a matrix. Consequently, we seek to show that ^β minimises the variance for any linear
combination of the estimated coefficients, l ' β^ . We note that
Var (l ^β)=l Var ( β^ )l
¿l [σ ( X X) ]l
' 2 ' −1
−1
¿σ l ( X X) l
2 ' '
Which is a scalar. Let ^β be another unbiased estimator of β that is a linear combination of the
data. Our goal, then, is to show that Var ¿ with at least one l such that Var (l ' ~
−1
β )>σ l ( X X ) l .
2 ' '
We first note that we can write any other estimator of β that is a linear combination of the
data is
~
β =[ ( X ' X ) X ' + B ] y +b0
−1
Where B is p×n matrix and b 0 is a p×1 vector of constants that appropriately adjusts the
OLS estimator to form the alternative estimate. We next note that if the model is correct, then
~
E ( β )=E ( [ ( X X ) X + B ] y +b 0 )
' −1 '
¿ [ ( X ' X ) X ' + B ] E ( y ) +b 0
−1
¿ [ ( X ' X ) X ' + B ] Xβ +b 0
−1
−1
¿ ( X X ) X Xβ+ BXβ+ b0
' '
~
E ( β )=β + BXβ+b 0
~ ~
Consequently, β is unbiased if and only if both b 0=0 and BX=0. The variance of β is
~
Var ( β )=var ( [ ( X X ) X +B ] y )
' −1 '
¿ [ ( X X ) X + B ] Var ( y) [ ( X X ) X + B ]
−1 −1 '
' ' ' '
¿ [ ( X X ) X + B ] σ I [ ( X X ) X +B ]
−1 −1 '
' ' 2 ' '
¿ σ [ ( X X ) X + B ][ ( X X ) X + B ]
−1 −1 '
2 ' ' ' '
¿ σ [ ( X X ) +B B ]
2 ' −1 '
¿ l (σ [( X X ) + B B ])l
' 2 ' −1 '
−1
¿ σ l ( X X ) l +σ l B B l
2 ' ' 2 ' '
~ 2 ' '
¿ Var (l ¿ ¿ ' β )+ σ l B B l ¿
We first note that B B' is at least a positive semidefinite matrix. Hence, σ 2 l ' B B ' l≥ 0. Next we
note that we can define l ¿ =B ' l. As a result,
p
l ' B B ' l=l ¿' l ¿=∑ l ¿2
i=1
Which must be strictly greater than 0 for some l ≠0 unless B=0. Thus, the OLS estimate of β
is the best linear unbiased estimator.
1. Write down the null hypothesis that there is no relationship between the dependent
variable y and the independent variable x i
¿
H 0 : β i=β
2. Write down the alternative hypotheses that is a relationship between the dependent
variable y and the independent variable x i:
¿
H 1 : βi ≠ β
3. Collect the sample information for the test and identify the significance level α.
4. The p-value is the sum of the area in the tails of the 𝑡-distribution. The 𝑡-score and
degrees of freedom are
𝑑𝑓=𝑛−𝑘−1
bi−β i
t=
s bi
5. Compare the p-value to the significance level and state the outcome of the test:
o If p-value ≤ 𝛼, reject H 0 in favour of H1.
The results of the sample data are significant. There is sufficient
evidence to conclude that the null hypothesis H 0 is an incorrect belief
and that the alternative hypothesis H 1 is most likely correct.
o If p-value > 𝛼, do not reject H 0.
The results of the sample data are not significant. There is not
sufficient evidence to conclude that the alternative hypothesis H 1 may
be correct.