Academia.eduAcademia.edu

AR (1) process with missing data

2003

Abstract

This paper reviews the realizations of ML and EM procedures in the presence of missing data when the data are correlated. The motivation came from the geostatistical analysis of the spatial data where some data were missing. A version of the EM algorithm was suggested, but its theoretical properties were not established. This paper analyzes the standard and the new procedure with the simplest correlation structure, that of an AR (1) process, and shows that they differ by terms of the order Op (n− 1).

Motivation

In a recent paper, Smith, Kolenikov & Cox (forthcoming) analyze a spatiotemporal data set that featured repeated measurement of the spatially correlated data. For each moment of time t = 1, . . . , T , there are up to K measurements available at certain fixed locations of the monitors. After certain transformations, the log likelihood of such model (assuming normality) looks like ln L(θ, β; y, X) ∼ − 1 2

where X is the design matrix, the vector β represents the trends in time, space, and other covariates, and θ is the set of parameters for the geostatistical model for spatial covariance. The subindex t denotes a possible dependence of the dimensionality of the vectors and matrices on time t, as long as some data are missing. The likelihood (1.1) can be maximized explicitly with a nonlinear optimization routine, but it would possibly involve inverting T matrices of rather big size (in Smith et al. (forthcoming), K = 74) and computing their determinants 1 . An appealing method to reduce those computational costs seems to be the EM algorithm (Dempster, Laird & Rubin 1977, McLachlan & Krishnan 1997. This way, only one matrix inversion and one determinant computation are required for the largest K × K covariance matrix Σ(θ) of all observation sites. (The questions would still remain which of the ML and EM procedures faster in the computer time, as EM takes more iterations to converge.)

Then the maximization procedure that also involves splitting the maximization step of the EM algorithm into maximization over the trend parameter β subspace and the covariance parameter θ subspace (a version of the EM known as expectation-conditional maximization) proceeds as follows:

1. Initialize the trend parameters β by OLS over the available cases;

2. Initialize the covariance parameters θ by some reasonable guess;

3. (E-step, h-th iteration) Compute y 5. (E-step, h-th iteration) Compute the conditional expectation of the sufficient statistic

6. (M-step, h-th iteration, part 1) Maximize the log likelihood (1.1) with respect to the covariance parameters θ;

7. (M-step, h-th iteration, part 2) Run GLS regression of y (h) with the covariance matrix Σ(θ (h) ).

8. Declare convergence, or reiterate to step 3.

The most subtle point in the above procedure is step 5 to compute the sufficient statistic of the data given the observed values and the current parameter estimates. The suggestion in Smith et al. (forthcoming) was to use e it e jt in (i, j)-th position of the t-th term in (1.2) whenever both residuals were available, and use σ ij (θ (h) ) otherwise. An alternative that seems to be more in line with the EM methodology might be to perform universal kriging at each time point in the locations of the missing data, and use the predicted residuals instead. This however bounces us back to inverting all covariance matrices of different sizes. Thus a question would be, by how much is the above approximation to the conditional expectation biased, and what kind of other problems it may lead to. This question is probably more pertinent for the covariance parameter estimates as long as the trend estimates will be unbiased and consistent for any weighting matrix in the GLS estimator.

The remainder of the paper analyzes the simplest correlation structure of an AR(1) process and analyzes the performance of the procedure suggested above for that case.

AR(1) process

Consider an AR(1) process

I shall assume |ρ| < 1, so the process is stationary. The mean of the process is

so the process can also be written in deviations form

3)

The variance of the process is

If a sample of y t , t = 1, ..., T is observed, then the log likelihood of the data is

where 1 I = (1, . . . , 1) T is the column vector of ones,

From (2.9), the sufficient statistic of the data is

This fact is crucial in the EM algorithm theory. Note that the variance estimator iŝ

so the log likelihood concentrated with respect to σ 2 is then

Differentiating with respect to µ gives

Differentiating with respect to ρ giveŝ

ML estimates of the parameters arê

The subindex indicates that the estimates are obtained from the complete data set by maximum likelihood. Those can be computed with an iterative procedure: first, the OLS estimateŝ

are obtained which are consistent estimates of the corresponding parameters, then α(ρ OLS , T ) and Q(y,μ OLS ,ρ OLS ) are computed to improve the estimates up to the first order terms by (2.17) and (2.18).

3 AR(1) with missing data: ML approach Now, suppose a portion of data that came from the AR(1) process is missing, so we first have n observed data points (y 1 , . . . , y n ), then l missing observations (y n+1 , . . . , y n+l ), and then again m observed points (y n+l+1 , . . . , y T ),

where Σ o has a block structure:

which is a s × s matrix, s = n, m, and its inverse is given by an expression similar to (2.7);

is a rank 1 matrix of dimensions n × m. By using the formulae for block inverses,

the component blocks of

can be obtained as follows.

To proceed with (T n − RT −1 m R T ) −1 , the following matrix identity can be used:

In using this formula, we shall have V = T n with known inverse, and W is minus the result of (3.7). Then

which is the upper left block of Σ o −1 . The next block of Σ o −1 is

The computations identical to (3.7)-(3.10) yield

and finally Σ o 12 = Σ o 21 T . The changes due to the missing data are concentrated near the place where it is missing, as it should have been expected from the Markovian character of AR(1) process. Let us consider the two limiting cases.

If l = 0, no data is missing, and one can verify that φ(ρ, 0) = 1 + ρ 2 , ψ(ρ, 0) = ρ. Then Σ o −1 has the form of (2.7).

If l = ∞, there are two independent samples of possibly different length from AR(1), and both the covariance matrix and its inverse are block matrices with off-diagonal block equal to zero matrices.

The determinant of Σ o −1 can be shown to be

which can be verified to give the appropriate limits 1 − ρ 2 and (1 − ρ 2 ) 2 when l = 0 and l = ∞, respectively.

Let us now derive the normal equations. The generalized sum of squares is

The likelihood can again be concentrated w.r.t. σ 2 :

where the first term is O p (1), and all others are O p (n + m) −1 . The second term is the correction due to the terminal points similar to the complete data case, and the other two are corrections for the missing data.

The derivatives with respect to ρ are as follows:

Then the normal equation for ρ is 1 2

where the terms in the LHS are of the order O p (1), and the terms in the RHS are of the order O p (n + m) −1 . Just as in the case of the ML estimates for the complete data, the explicit analytic solution cannot be obtained. In practice, one would need to use the twostep or iterative procedures. The starting values can be based on the sample mean and the sample correlation between y t and y t−1 , as before. It should be noted however that the corrections due to the missing data are of order O p (n + m) −1 .

AR(1) with missing data: the EM algorithm

The EM algorithm is an iterative procedure of finding the critical points of the likelihood function 2 . In this procedure, one switches back and forth through estimating the conditional expectation of the missing data, or the complete statistic of the full data set if one is applicable, given the current estimates of the parameter values and the observed data (E-step), and maximizing the likelihood over the parameters using the imputed data / sufficient statistic (Mstep). For details and a multitude of applications, see McLachlan & Krishnan (1997).

The particular version that is nicer to consider in this particular example of the AR(1) process is the expectation-conditional maximization version of the EM algorithm in which the maximization can be performed separately over subspaces of the parameter space spanning the whole space. We would like to split the maximization into maximization over the mean parameter subspace (µ in our case, or a regression mean if covariates are allowed for) and the variance subspace that may be split further into the overall constant σ 2 that lends itself to a simple enough estimate such as (2.11), and the covariance structure parameter, which in this case is ρ.

As was shown above in (2.10), the sufficient statistic of the complete data set is

Thus, to get an implementation of the EM algorithm, we need to be able to predict the three sums given the parameter estimates and the available observations, which boils down to prediction of y t , y 2 t and y t y t−1 when at least one of y t , y t−1 is missing.

To derive the aforementioned conditional expectation, consider four-variate normal distribution

By the standard multivariate normal theory, the distribution of the unob-served y n+r−1 , y n+r conditional on y n , y n+l+1 is bivariate normal with mean

( 4.3)and covariance matrix 1

This is also the kriging estimator. Hence, the conditional expectations of the relevant sums are as follows:

This can now be substituted into (2.14):

which after a number of simplifications coincides with (3.18).

Summary: It looks like the substitution of the missing values through kriging would produce the sums of geometric progressions that reduce to the ML stuff obtained before.

5 AR(1) with missing data:

the "lazy" EM algorithm Now, the proposed "lazy" version of the EM algorithm suggests that the predictions of the missing data statistics be simply

which is obviously quite different from what was derived in the previous section, and does not account for the observed data at all. Those are the values that can be substituted in place of the missing components of (y) − 1 Iµ)(y) − 1 Iµ) T of (1.1).

Let us see what kind of estimating equations are implied by this version of the EM algorithm. The approximate conditional expectation of the generalized sum of squares is

(5.4) and the approximate, or pseudo-likelihood being maximized is

Let us concentrate on σ 2 :

which is different by O p (n + m) −1 from the ML estimate. The concentrated pseudo-likelihood is now

Differentiating with respect to µ gives This differs from the ML estimates by O p (n + m) −1 . The derivatives with respect to ρ are as follows.

1 2

(y, µ, ρ) n + m (5.14)

The terms in the LHS are of the order O p (1), and the term in the RHS are of the order O p (n + m) −1 . Note that the principal term is the same as for the ML case except for the term (y n+l+1 − µ)(y n − µ)/(n + m) which is O p (n + m) −1 , so the two expressions differ by O p (n + m) −1 .

Summary: As noted in the end of Section 3, the corrections for the missing data are of order (sample size) −1 . The suggested procedure essentially affects the weights of the bordering observations y n and y n+l+1 only, with differences in resulting estimating equations of order (sample size) −1 , and thus is asymptotically equivalent to the EM algorithm (and leads to the estimates asymptotically equivalent to the maximum likelihood estimates).