Panel Data Analysis Using STATA 13

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 17

Panel Data Modeling and Estimation Using

Statistics and Data (STATA) Software Package


Lagos State University Workshop in FMS
Dr. Arewa Ajibola

February 16, 2015


1
1. Introduction
Let’s simply begin our discussion with the meaning of data. Data
represent information about a variable or an entity such as
individual, firm, country etc. There are basically four types of
data: time series, cross-sectional and panel data. In the context of
this workshop, we are restricted to panel data analysis; but our
understanding of time series and cross-section data at this
inception is necessary.
2 Time Series Data
These are data on a variable or group of variables over a successive
period of time. Alternatively, data that are collected over a period of
time or at different point in time are referred to time series data.
Characteristically, time series data are often macro in nature; which
may include interest rate, exchange rate, employment rate, inflation
etc. We also have micro time series data which relate to a particular
house hold or firm. Note that time series data show temporal
variations. That is variations over times such as centuries, decades,
bi-annuals, annuals, quarters, months, weeks, days, hours, minutes
and seconds. Time interval is associated with time series data. Typical
example:
Table 1: Time Series Data Set
year GDP(N,B) inflation(%)
2001 30 8
2002 23 9
2003 40 12
2004 67 13
2005 32 9
2006 35 8
2007 87 10
2008 98 11
2009 69 13
2
Source: Hypothetical
3. Cross-sectional Data

These are data at a point in time on different variables or units such as


country, company, religion, region or individual without preference to
time variation(s). Cross-sectional data are single observation and
often a time, they are widely spread cutting across micro units. Thus,
they are more of micro data such as per capital income, company’s
profit, dividend, earnings per share etc. They can also be macro data
countries’ unemployment, GDP, inflation etc. Unlike the time series
data, all data variations in cross-sectional data vary across units. This
is known as special variations. Now, what is panel data?

4. Panel Data
There are different names for panel data. These are pool, longitudinal,
multi-dimensional and cross- sectional time-series data. That is the
combination of both time series and cross-sectional data. Therefore,
panel data can simply be referred to as the data on explicitly different
units or variables over a period of time. Different units and different
periods of time are the basic elements of panel data. For example,
data on the profits of UBA and Zenith Banks for a period of 40 years
are panel data. We also have the macro and micro panel data. Each
of these is discussed separately below:

5. Micro Panel Data and Short Panel Data


At this point we should know that in panel data, there is time
dimension denoted by T and individual unit dimension denoted by N. If
the individual unit dimension N tends toward infinity N→∞; while the T
is relative small such data are referred to micro panel data or short
panel data. Per capital income of 1ooo individuals over 40 years is a
good example.

6. Macro Panel Data and Long Panel Data


If the time dimension T tends toward infinity T→∞; while the N is
relative small. Such data is termed as macro panel data. This can also
be referred to long panel data
3
7. Balanced vs Unbalanced Panel Data

The panel data that have values for all their observations are termed
balanced. In essence, each unit or cross-section has the same time
space (coverage). Number of observation (t) is the same throughout
for each unit. A topical example can be virtualized below:

Table 2: Balanced Panel Data Set


company year GDP(N,B) inflation(%)
1 2001 30 8
1 2002 23 9
1 2003 40 12
2 2001 67 13
2 2002 32 9
2 2003 35 8
3 2001 87 10
3 2002 98 11
3 2003 69 13
Source: Hypothetical
Note that the basic feature of these data is that the three companies
(i.e. 1, 2 & 3) have data for three years (2001-2003). That is the three
cross-sectional units or companies have the same time dimension (t).

Conversely, the unbalanced panel data are characterized with missing


values for some of their observations. The time dimension (t) is
specific to each cross sectional unit. In other words t is not the same
for each unit. Hence in unbalanced panel data, we have different
cross sections and different time dimensions; while in balance, there
are also different units but with the same time dimension. A good
example of unbalanced panel data is given below:

4
Table 3: Unbalanced Panel Data Set
company year GDP(N,B) inflation(%)
1 2001 30 8
1 2002 23 9
2 2001 67 13
2 2002 32 9
2 2003 35 8
3 2001 87 10
Source: Hypothetical

Our task of differentiating the balanced and unbalanced panel data


has been demystified a little bit from the snapshot presentation in
table 3. The table indicates that company 1 has data for two years,
company 2 for three years and company 3 for only one year.

8. Advantages of Panel Data over Time series Data

There are three main advantages of panel data:


(1) Panel data set avail researchers the opportunity to have access to
large number of observations or large number of data points (T=t*n).
This off-course, increases the degree of freedom (df) considerably and
at the same time decreases the possibility of multicollinearity or
collinearity among the included variables (i.e. the independent
variables). This point could be a mere phantasm or ghost because
increasing data points does not explicitly purport increasing the
efficiency of the regression estimates.
(2) The use of panel data allows researchers to answer research
questions that are beyond the scope of time series and cross sectional
data sets.
(3) Panel data allow researchers to control for omitted or unobserved
variables that could correlate with either included or random variable.

5
We are through with the nature of data we shall be dealing with in this
workshop. Let’s proceed to modeling.

9. Classical Regression Model and its Inherent Assumptions

Without equivocation, it is crystal clear that panel data models take


their leads from the classical linear regression models. In view of this,
I shall start the modeling aspect of this discussion with the classical
linear regression model (CLRM). The model indicates that the
response variable is a function of the regressor variables plus residual
variable that cannot be observed. Thus:

yt = f(x1t, x2t, …, xkt) + εt; t=1, 2, …, n (1)

Where yt is the response/dependent variable, xt’s are the regressor or


explanatory variables, εt is the error term and n is the number of
observations. The candidate functional equation 1 can be restricted to
linear regression equation as:

yt = ao + β1x1t+ β2x2t+…+ βkxkt+ µt (2)

Equation 2 can be stated in a compacted form as:

Yt= ao + β(xt.) + µt (3)

Where xt. = (x1t, x2t, …, xkt) and β = (β1, β2,.., βk)


are theǀ

coefficients of a linear combination of all the set of explanatory


variables in xt.
10. Assumptions of CLRM

There are five assumptions underlying the regression


model stated above in equation 2.
(1) E (µt) =0; that is the mean value of the error term is zero.
6
(2) Var (µt) =δ2<∞; that is the error term has constant variance and
finite over all values of xt.

(3) Cov (µt, µj) =0; that is the error term (µt) in equation 2 must not
correlate with any other error term such as µj.

(4) Cov (µt, xt.) =0; that is the error term must not statistically relate
with the explanatory variables.

(5) µt ˜N (0, δ2); that is the error term is normally distributed or


identically and independently (iid) distributed with zero mean and
constant variance.
It is worthy to know at this time that if one or combination of these
assumptions is violated; the model could be exposed to one or all of
the following problems.

(1) E(β^) β meaning that the coefficient estimates of β^ is biased and


as such not the true estimates of the β’s.
(2) The standard errors are biased and as such the hypothesis testing
based on t-test is invalid.
(3) The distribution based on the test statistics is inappropriate.

All these are the common problems encountered in linear


regression models. Thus, there is need to subject these
model to residual diagnostic test.
10.1 Testing for the zero mean assumption (E (µt) =0) - so
long constant term is included in the equation, this
assumption cannot be violated. This is not really a
problem.
7
10.2 Testing for Homoscedasticity assumption (Var (µt) =δ2<∞)
– Given that there are only two explanatory variables in equation 2
that is x1 and x2. You can obtain the residuals µt and square
it (µt2). Then form an auxiliary regression as follows:

µt2= ao + β1x1t+ β2x2t + β3x21t+ β4x22t + β5x1tx2t + vt (4)


You can now obtain the R2
The hypotheses here are:
Ho The error term is homoskedastic (same variance)
Ha The error term is heteroskedastic (different variance)
The R2 is shown as R2*N ˜ X2(m) where m is the number of
explanatory variables.
If R2*N >X2 reject the null hypothesis.
10.3 Testing for No Autocorrelation Assumption (Cov (µt, µj) =0)
Obtain the residual µt and form an auxiliary regression model
as follows:
µt = co + β1x1t+ β2x2t + β3µt-1 + β4µt-2 + vt (5)
Obtain the R2
The hypotheses here are:
Ho there is absence of autocorrelation in the data
Ha there is autocorrelation in the data
8
The R2 is shown as (N-r) R2 ˜ X2(r)
If (N-r) R2 >X2 reject the null hypothesis
10.4 The Non-Stochastic Assumption (Cov (µt, xt.) =0). This
assumption is fulfilled as long as the x and µ are not dependent.

(µt ˜N (0, δ2)). This assumption


10.5. The normality assumption
can be tested by calculating the Bera Jarque statistic (w).
To do this you have to calculate the skewness and
kurtosis first as follows:
Skew(sk) = E(U3) (6)
(δ2)2/3
Kurt(kt) = E(U4) (7)
(δ2)2

W = N(sk)2 + (kt – 3)2 ˜ (8)


6 24
The hypotheses are:
H0 –The residues are normally distributed
Ha – The residues are not normally distributed
Reject H0 if w >
We can now estimate equation 2 and the corresponding diagnostic
tests using STATA. The following steps are involved:

(1) Arrange your data in a work-file synonymous to that of table 1.


(2) Copy the data including year and variable.

(3) Right click on the STATA programme and open it.


9
(4) Click on data at the upper menu, move down to data editor and
navigate right to data editor (edit)

(5) Open the editor profile, right click and paste then a screen short
will appear. On the screen short click on treat first row as variable
names. Immediately the data will be pasted.
(6) Declare data time series. There are two ways to do this- manual
and automatic but in our discussion here let’s follow the manual
method. Just type the “tset year” on the command space below and
enter.
(7) Estimate the model by typing: regress follow by the depend
variable and independent variable (s). Then press enter.
(8) For the diagnostic test and descriptive statistics we shall use
automatic method here.
(9) Click on statistics on the upper menu, move to linear models and
related, navigate right to regress diagnostics, navigate again right to
specification etc and click
(10) Begin to select the diagnostic tests you what to perform and click
on either ok or summit.
FOR THE DESCRIPTIVE STATISTICS
(11) Move down to summary, tables and test, navigate to summary
and descriptive statistics, navigate again to summary statistics.
(12) Click, then enter the variables and click on ok
We have been able to perform mundane task, we can now proceed to
panel modeling and estimation. Again I will quickly let you know that
equation 2 has it panel data model counterpart. And it generally
10
referred to pool regress model. The classical/traditional form of this
model can specified as:

yit = az`i + β1x1it+ β2x2it+…+ βkxkit+ µit (9)


Equations 2 and 9 are inherently similar with little difference noted
on their subscripts. The subscript of panel data model caries “it”;
the “i” stands for individual unit dimension while “t” sands for time
dimension. To simplify our discussion, I will like to reduce
equation 9 to a compacted format as:
yit = αz`i + βx`it + µit; i =1,2, …N, t =1,2,…,T (10)
Where: yit is the dependent variable
αz`i represents heterogeneity or individual effects in which
zi contains a constant, observed and unobserved terms.
x`it implies that there are k regressors or included
variables in xit or in the set of the explanatory variables.
µit; is the unobserved disturbance term.
β=(β1 β2,… ,βk) are the corresponding parameters of the
k regressors.
In analyzing panel data set or model the problem of heterogeneity
(αz`i) across units is the central focus because it is an integral part
of the specification. However, if zi contain only a constant term,
the pool regression in the form ordinary least square regression
model provides consistent and efficient estimates of the common
alpha (α) and the slop coefficients (β). Thus, the pool regression
model can be estimated in STATA using the following syntax:
xtreg follow by dependent and independent variables. But before
accomplishing this task, I will like to stress on the need for panel
data unit root test.

11
(11) Panel Data Unit Root Test
Unit root means a parameter of a series that is equal to 1 and
when there is a unit root in a series, it means there is evidence of
a random walk in the series and therefore it is not stationary.
Regression result based on such series may be spurious or
nonsensicant. To avoid this situation either in panel or time series
analysis, it important you subject data series to stationarility or
unit root test. However, the test of a unit root is recent
phenomenon in panel data see for example Levin, Lin and Chu
(2002), Im, Pesaran and Shin (2003), Harris and Tzavalis (1999),
Choi (2001) and Hadri (2000). The Levin, Lin and Chu test
specification can be expressed as:

yit = βiyit-1 + αz`it + wit (11)

where zit is the deterministic component and wit is the stochastic


process. zit can be zero, one, fixed effect, u i and trend. LLC
assumes that error term is iid complaint. The hypotheses are:
Ho p =1; the series contain a unit root.
Ha p <1; the series does not contain a unit root or stationary.
We can now perform this test using STATA. The following steps
are involved:
(1) Arrange the data in work-file (following table two).
(2) Declare the data panel by typing xtset country year (every
word in STATA syntax must be in small case letter).
(3)Type xtunitroot llc the variable if country/company, lag(aic 10);
then enter. You can change the lag length

12
Note that the majority of the unit root tests assume that you have
a balanced panel dataset, but the Im–Pesaran–Shin and Fisher-
type tests (i.e. Choi 2001) allow for unbalanced panels. The
syntax for Fisher test is- xtunitroot fisher the variable, dfuller trend
demean lags(1), the syntax for Harris–Tzavalis test is- xtunitroot
ht the variable. The next aspect of this discussion is to look at the
descriptive statistics. The panel syntax for these are: xtsum, ,
xtline and histogram
(12) Estimation of the Pool Regression

This is very simple to accomplish in STATA. The code is: xtreg


follow by dependent and independent variables.

(13) Fixed Effect vs Random Effect Model

The assumptions of the CLRM also hold in panel data analysis. In


the panel data context, the pooled equation 10 is alternatively
known as population average model which conforms to the
assumption that the unobservable heterogeneity has been
averaged out. If all of these assumptions stated earlier are met
there is no further analysis beyond equation 10. Unfortunately
most of these assumptions cannot be met specially that of
heterogeneity across units. Hence, the focus becomes analyzing
the fixed effects and random effects models.
I have critically discussed a situation where zi in equation 10
contains only constant. But if it contains unobserved term and at
the same time it correlates with the included variables (x` it); then
the least square estimate (β) is not only bias but inconsistent
because of the omitted variable. The fixed effects model arises
due to the assumption of the omitted effects denoted by c i. In
general form, the fixed effects model can specified as:
13
yit = ci + βx`it + wit (12)
ci = αz`I (13)
Equation 13 embodies all the observable effects and the estimable
conditional mean. We should know that the fixed effects specification
take ci to be a group specification constant term and specifies that c i
correlates with included variables (x`it). That is:
E(ci/xi) = h(xi) (14)
Since the conditional mean (hxi) is the same in every period, equation
12 can be replicated as follows:
yi = ci- h(xi) + h(xi) + βx`it + vit (15)
By rearranging, we have:
yi = βx`it + h(xi) + vit + [ci- h(xi)] (16)
It is assumed that the bracket term in the right hand side is not
correlated to the xi; therefore it can be absorbed in the disturbance
term as:
Wit = vit + [ci- h(xi)] (17)
Therefore, equation 16 becomes
yi = βx`it + h(xi) + wit (18)

To state equation 18 in the normal form, let’s represent h(x i) with αi,
then we have
yi = αi + βx`it + wit (19)
Note that each αi is treated as unknown parameter to be estimated.

14
How do you estimate the fixed effects model in STATA? This pertinent
question demands urgent answer. The following syntax is applicable:
type xtreg dependent variable independent variable(s), fe and enter.
It is important to know that the set of the explanatory variables
included in equation 19 is classified into two- time invariate and time
variate variables. The time invariate variables mimic the individual
specific constant term ci. The coefficients of the time invariate
variables cannot be estimated using the fixed effects model; therefore,
the fixed effects model absorbs them in the αi as stated in equation 19.
This is the limitation of the fixed effects model because it cannot be
used to estimating the coefficients of the time invariate variables. But
the random effects model can provide separate estimations for the
coefficients of the time invariate variables. Therefore, the random
effects model is of immense concern in Econometrics. The
assumption underlying the formulation of the random effects model is
that the unobserved individual heterogeneity is uncorrelated with the
explanatory variables (x`it) and if it is uncorrelated with the explanatory
variables, it must be included in the disturbance term. Thus, the
random effects model can be specified as:
yit = αz`i - E(αz`i) + E(αz`i) + βx`it + εit (20)
Rearrange the terms in equation 20
yit = E(αz`i) + βx`it + αz`i - E(αz`i) + εit (21)
Again let:
αi = E(αz`i) (22)
µi = αz`i - E(αz`i) (23)
Then equation 21 is reduced to:
yit = αi + βx`it + µi + εit (24)
15
Where µi is the group specific error term similar to εit
We can now uphold the strict exogeneity assumptions:
E[εit/x`it] =0
E[µi/x`it] =0
E[ε2 it/x`it] =δ ε2

E[µ2 i/x`it] =δµ 2


E[εit, µj /x`it] =0; for all i, t & j
E[εit, εjs /x`it] =0; where i±j or t±s

E[µi, µj /x`it] =0; where i±j


If these assumptions hold we can develop random effects model.
We can estimate random effects model in STATA using the syntax
below: xtreg dependent variables independent variable(s), re and
press enter
13 Selection of Adequate Model for a Panel Data Study
After estimating the pooled regression model; there is need to subject
it to diagnostic test. The test adopted here is the Wooldridge, Brunche
and Penga Lagrange Multiplier (LM) test. That is the pooled
regression model must be tested against the random effects model.
The null hypothesis is that variances across entities is zero. This is, no
significant difference across units (i.e. no panel effect). . If the null is
rejected, we cannot proceed to random or fixed effects model. But if
however the null is not rejected, our focus will be to test the random
effects against fixed effects by utilizing the Hauseman-Taylor test.
13.1 Breunche and Pagan Lagrange Multiplier (LM) test
TO BE CONTINUED
16
Reference
Choi, I. (2001).Unit root tests for panel data. Journal of
International Money and Finance 20: 249–272.

Hadri, K. (2000).Testing for stationarity in heterogeneous panel


data. Econometrics Journal 3: 148–161.

Harris, R. D. F., & Tzavalis, E. (1999). Inference for unit roots in


dynamic panels where the time dimension is fixed. Journal
of Econometrics 91: 201–226.

Im, K. S., Pesaran, M. H., & Shin, Y. (2003).Testing for unit


roots in heterogeneous panels. Journal of Econometrics
115: 53–74.

Levin, A., Lin, C. F. & Chu, C. S. J. (2002).Unit root tests in


panel data: Asymptotic and finite-sample properties.
Journal of Econometrics 108: 1–24.

17

You might also like