Harville 1977

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

This article was downloaded by: [Indiana Universities]

On: 08 April 2013, At: 16:30


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41
Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical Association


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/uasa20

Maximum Likelihood Approaches to Variance Component


Estimation and to Related Problems
a
David A. Harville
a
Department of Statistics, Iowa State University, Ames, IA, 50011, USA
Version of record first published: 05 Apr 2012.

To cite this article: David A. Harville (1977): Maximum Likelihood Approaches to Variance Component Estimation and to Related
Problems, Journal of the American Statistical Association, 72:358, 320-338

To link to this article: http://dx.doi.org/10.1080/01621459.1977.10480998

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is
expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents will
be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be
independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or
arising out of the use of this material.
Maximum Likelihood Approaches to Variance
Component Estimation and to Related Problems
DAVID A. HARVlLLE*

Recent developments promise to increase greatly the popularity of tivity constraints on the variance components or other
maximum likelihood (ML) as a technique for estimating variance
components. Patterson and Thompson (1971) proposed a restricted constraints On the parameter 'pace no
maximum likelihood (REML)approach which takes into account the difficulties. Moreover, the maximum likelihood estimates
1 0 s ~in degrees of freedom resulting from estimating fixed effects. and the information matrix for a given parameterization
Mil1er (1973) a m p p t o t i c theory for M L
estimators of variance components. There are many iterative algo- of the model can be obtained readily fronl those for any
rithms that can be considered for computing the ML or REML =ti- other parameterization. Interest in the maximum likeli-
mates. The computations on each iteration of these algorithms are hood estimators should be enhanced by the recent dis-
those associated with computing estimates of fixed and random effects
for given values of the variance components. covery by Olsen, Seely, and Birkes (1976) that, for a t
least some unbalanced designs, there exist estimators in
KEY WORDS : Variance component estimation ; Maximum likeli-
the class of locally best translation-invariant quadratic
Downloaded by [Indiana Universities] at 16:30 08 April 2013

hood; Restricted maximum likelihood; Mixed linear models ; Maxi-


mum likelihood computations. unbiased estimators that have uniformly smaller variance
than the Henderson estimators. These locally best estima-
1. INTRODUCTION tors are related closely to maximum likelihood estimators
The testing and estimation procedures associated with (Hocking and Kutner 1975).
the of variance (ANOVA) and the underlying I n spite of these properties, maximum likelihood esti-
fixed, mixed, and random linear models have been widely mators of variance components have not been used much
used. A long-standing problem associated with the use of in Practice. There are several reasons for this neglect ;
the mixed and random models has been the estimation of the most important of which is, except in relatively
the variances of the random effects, ime.,the estimation simple settings, the computation of the maximum likeli-
of the variance components. For balanced data, it has hood estimates requires the numerical solution of a con-
been COIIIIIIOn practice to estimate these parameters by strained nonlinear optimization problem. Prior to the
the mean squares in the ANOVA table to their advent of the electronic computer, this requirement pre-
expectations. Henderson (1953) developed analogous sented a virtually insurmountable barrier to their wide-
techniques for unbalanced data which, a t least in terms spread use. Even after computers became cornrnonP1ace~
of actual usage, have proved to be very popular. Recently, likelihood was not much used to estimate
a bewildering variety of r(new,, approaches has been pro- variance components because effective computational
posedm S ~ ~ t a n e o u s l ythere
, has been renewed interest algorithms were not readily available to practitioners.
in maximum likelihood techniques for estimating vari- Recently, a number of results have come to light that
ance components. promise to make the computation of maximum likelihood
A maximumlikelihood approach to the estimation of estimates of variance components practical in many
variance components has attractive features. ~h~ settings where it was unfeasible before. Even in cases
maximum likelihood estimators are functions of every where their computation is still unfeasible, it may be
sufficient statistic and are consistent and asymptotically Possible to implement an approach s i d a r to the one
normal and efficient (in the SeIISe described by Miller outlined in Section 7 which mimics maximum likelihood
(1973)). Certain deficiencies of various other methods are but is simp1er computationally-
not shared by maximum likelihood. I n particular, the Two other Problem that have kept maximum likeli-
maximum likelihood approach is "always" well-defined, hood from becoming a more Popular technique for
even for the many useful generalizations of the ordinary estimating variance components are the following : (1)
ANOVA models, and, with maximum likelihood, nonnega- The maximum likelihood estimators of the variance com-
ponents take no account of the loss in degrees of freedom
* David A. Harville is Professor, Department of Statistics, Iowa resulting from the estimation of the model,s fixed effects.
State University, Ames, IA 50011. This work was supported in part
by the Air Force Officeof Scientific Research, under Grant NO. (2) The maximum likelihood estimators are derived under
AF'OSR76-3037- During the early stages of the Work, which took the assumption of a particular parametric form, generally
place while he waa a visitor a t the Biometries Unit of Cornell Uni-
versity, the author benefited substantially from informal conversa-
tions with S.R.Searle and C.R. Henderson. Material supplementary Q Journal of the American Statistical Association
t o that in the present paper can be found in Harville (1975) and June 1977, Volume 72, Number 358
Searle (1976). Theory and Methods Section
320
Variance Component Estimation 321

normal, for the distribution of the data vector. The first levels of the other factors, and the residual effects. As-
of these problems has in effect been eliminated by sociated with the ith random factor is a parameter ui2
Patterson and Thompson (1971) through their “restricted which represents the common variance of its levels. The
maximum likelihood” approach. With regard to the residual effects are taken to have common variance
second problem, it will be’ argued in Section 8.1 th a t the ~ . variances u12, . . . , u,+12 are called variance com-
u , + ~ The
maximum likelihood estimators derived on the basis of ponents. In terms of (2.1), b’ = (bl‘, . . ., b=’), where bi
normality may be suitable even when the form of the is a qi X 1 vector whose elements are the levels of the
distribution is not specified. ith random factor, m = c + 1,
I n the following, an attempt is made to present a
Oi = ui2, (i = 1, . . ., m) , (2.2)
unified review of the maximum likelihood approach to
variance component estimation. Computational aspects
are emphasized. The topics covered include : the current
R = OmI , D = diag [OJ, . . ., Om-J] ,
m-1
state of maximum likelihood theory as applied to the
estimation of variance components, the relationship
v = e m I + c oizizi’,
i=l
(shown to be intimate) between the maximum likelihood
estimation of the variance components and the estimation where Zi is a n X pi matrix defined by the partitioning
of the model’s fixed and random effects, the exploitation Z = (ZI, . . ., Zm-l), and
of that relationship for purposes of computation and = {&Om > O,Oi 2 0 (i = 1, ..., m - l ) } .
approximation, the use of numerical algorithms for com-
puting maximum likelihood estimates of variance com- Generally, each row of Zi has a single element equal to
ponents, the use of maximum likelihood as a vehicle for one, and its remaining elements are equal to zero ; so that
Downloaded by [Indiana Universities] at 16:30 08 April 2013

relating the various methods that have been proposed for its j t h row serves to indicate which level of the ith ran-
estimating variance components, and a discussion of dom factor enters the equation for t h e j t h data point. A t
directions for future research. least some columns of X will usually also have only Cb1
The problem of estimating variance components can entries. I n particular, in the ordinary random ANOVA
be regarded as a special case of a general linear model models, X = 1. (1 denotes a column vector each of whose
problem in which the elements of the covariance matrix elements is one.)
are known functions of a parameter vector to be esti- The ANOVA models are sometimes parameterized in
terms of = Q ~ and ~ = ui2/uc+12 (i = 1, . . ., c )
~ yi
mated. Throughout this article, an effort is made to +

promote this point of view. Many of the ideas that are rather than in terms of u12, . . ., u , + ~ ~If. we had taken
discussed are applicable to the more general problem. 0, = yi (i = 1, . . ., m), instead of taking 8 to be as
specified by (2.2), we would have had
2. THE MODEL AND ITS APPLICABILITY
D = em diag [el& . . . , e,,A]
The models that underlie the analysis of variance can and
all be viewed as special cases of the general linear model m-1

v = em(I + c eizizi’) .
y=Xa+Zb+e. (2.1) i=l

I n this model, y is a n X 1 vector of random variables A useful variation on the ordinary ANOVA models is ob-
whose observed values comprise the data points ; X and Z tained by taking the error variance to be heteroscedastic.
are matrices of “regressors” with dimensions n X p and Descriptions of specific ANOVA models can be found in
n X q, respectively; a is a p X 1 vector of unobservable Searle (1971a). These models have been applied widely
parameters, which are called fixed effects; b is a q X 1 in the biological, agricultural, behavioral, and physical
vector of unobservable random effects; and e is a ?z X 1 and engineering sciences. Still, it is a mistake to think of
vector of unobservable random errors. Moreover, E (b) linear models only in terms of the ordinary regression
= 0, E ( e ) = 0, and cov (b,e) = 0. Pu t D = var (b), and ANOVA models. To do so is to miss many potential
+
R = var (e), and V = R ZDZ’, so that var (y) = V. applications. I n particular, the observations may not all
The matrix X is assumed to be known, but the elements have been taken a t the same time so that J or various sub-
of D , R, and possibly even Z may be functions of an un- vectors of J are best regarded as time series, i.e., as having
observable parameter vector 8 = (el, . . ., Om)‘. The been generated by stochastic processes. Such data are
parameter space of a, 8, is taken to be { (a, 8) : 8 E a } , common in many fields, e.g., in economics. Time-series
where Q is some given subset of Euclidean m space such data are often analyzed on the basis of linear models that
th a t R (and thus V) is nonsingular for 8 E a. P u t can be viewed as special cases of (2.1) in which q = 0,
p* = rank (X),and take X* to be a n X p* matrix whose and e or subvectors of e are generated by stochastic
columns are any p* linearly independent columns of X. processes like autoregressive processes, moving average
I n the ordinary mixed and random ANOVA models, processes, or mixed autoregressive moving average proc-
there is some number c of random factors, with the ith esses (see Box and Jenkins (1970)). Useful extensions of
factor having qi levels. The levels of each factor are these special cases can be obtained by supposing that the
generally taken to be uncorrelated with each other, the elements of b or of various of its subvectors are also
322 Journal of the American Statistical Association, June 1977

ordered by time and may have been generated by com- with in many articles. I n contrast, the problem of
e
parable stochastic processes. Models of this kind may estimating or linear combinations of the components of
be suitable for a wide variety of growth curve data. a and @ has not received much consideration (at least not
Also, many types of data ordinarily analyzed by the from statisticians). Nevertheless, the latter problem
usual ANOVA models may be better fitted by these arises in many applications as described, e.g., by Searle
extended time-series models. The specification of 2,m, (1974), Henderson ( 1 9 7 3 ~ )and ~ Harville (1975). In par-
R, D, and Q and the interpretation of b and e for these ticular, the problem of estimating or predicting a future
special cases will depend, of course, on what is assumed data point from data to which (2.1) applies can generally
about the underlying stochastic processes. be formulated as a problem of estimating a linear com-
Note that multivariate linear models as well as uni- bination of the components of a and e.
variate linear models are included in (2.1). While the par- I n the present section, we review some results on the
ticular models described above are essentially univariate estimation of linear combinations of the components of
models, i.e., models in which each component of y repre- a and e. As will become evident in subsequent sections,
sents the same type of measurement, there is nothing in these results are very relevant to the problem of estimat-
the general formulation (2.1) that excludes situations ing e by maximum likelihood techniques.
where different types of measurements are included Subsequently, we take to be any solution to the
among the components of y, e.g., its first component normal equations
might represent a height measurement and its second
component a weight measurement. I n fact, for each of (X’V-1X)e = X’V-’y , (3.1)
e
Downloaded by [Indiana Universities] at 16:30 08 April 2013

our univariate examples, there is a corresponding multi- and put = Ds, where v = Z’V-’(y - Xe) = Z’Py,
variate example that is likewise a special case of (2.1). with P = V-l - V-lX(X’V-lX)-X’V-l (for any matrix B,
The ordinary ANOVA models generalize to the models that B- will denote an arbitrary generalized inverse of B, i.e.,
underlie the multivariate analysis of variance (MANOVA).any solution to BB-B = B).
The multivariate analogs of the extended time-series Since the elements of D, R, and possibly Z are functions
models form the basis for Kalman filtering techniques, of the parameter vector e, so are the elements of V, e, e,
which are much used in engineering applications (see, and s. When we wish to emphasize that the elements of a
e.g., Duncan and Horn (1972)), and they could be applied particular vector or matrix are functions of parameter
to multivariate growth curve data. vectors, we append the appropriate arguments. This nota-
Those special cases of the general linear model (2.1) tion facilitates the identification of the value of the vector
discussed above are ones in which Z is a known matrix. or matrix for particular values of, the parameter vectors.
There are also useful special cases in which a t least some Thus, e.g., e(e) is used interchangeably with ii, and, if 8*
elements of Z are nontrivial functions of unobservable is a particular value of 8, e (e*) is the value of e for 8 = 8*.
parameters. I n particular, such formulations can be ap- Subsequently, we denote the true values of 8 and a by
plied t o factor analysis models with nonnull elements of e+ and a+.
Z representing the factor loadings. By the expectation of a random variable, we shall
Specialized results are available in the literature for mean its unconditional expectation with respect to the
t ha t class of linear models characterized by V being joint distribution of b and e, as opposed say to its condi-
e.
linear in the parameters, i.e., for those linear models tional expectation given b = We refer to an estimator
where
m
t ( y ) of a linear combination Xl’a + A2’e of the elements
of a and @ as (unconditionally) unbiased if E [t(y)] = k ’ a ,
V = C BiGi (2.3) and call E[t(y) - Xl’a - A2’bI2its (unconditional) mean
i-1
squared error.
for n X n symmetric matrices G1, . . ., G, whose elements For the case where ef is known, an answer to the
are known. As Anderson (1970) indicated, this class in- question of how to estimate Il‘a +
Az‘@ is given below in
cludes many useful special cases. In particular, V has the Theorem 1. This estimator differs from that obtained by
form (2.3) for the usual ANOVA models when we take proceeding as though all the model’s effects were fixed.
Bi = ci2 (i = 1, . . ., c+ 1) though not when we take The latter estimator is less efficient in a mean squared
ei = error sense.
3. ESTIMATION OF FIXED A N D RANDOM EFFECTS Theorem 1: For the case where the true value B+ of 8 is
known, the best (uniformly smallest mean squared error)
Corresponding to an actual data vector, i.e., a n ob- linear unbiased estimator (BLUE) of a linear combination
served value of y, is a realized or sample value of the
vector b of random effects. This value will subsequently
&‘a + &’@ of the elements of a and e,
where 3cl‘a is
estimable, is
e
be denoted by and can be thought of as a parameter
Allti(e+) +x2’Q(e+) . (3.2)
vector just as a is a parameter vector. The only distinc-
tion is that something is assumed to be known about the (See Henderson (1973~)and (1975) ; Harville (1976).)
origin of 0. Estimating estimable functions of a is a When p = 0 (so that the model contains no fixed
problem of great practical importance and has been dealt effects) or equivalently when a+ is known, Theorem 1
Variance Component Estimation 323

essentially reduces to the result described by Rao (1965, numerical techniques that solve the linear system
Sect. 4a.11). When A 2 = 0 , it reduces to the ordinary FB = C without explicitly forming F-I.
Gauss-Markov theorem and, when ;hl = 0, to a result Theorem 2 was presented by Harville (1976) as one in
derived by Henderson (1963). a class of modified versions of a result due to Henderson
The following theorem is relevant to the computation (1963). Henderson’s result applied for e E D such that D
of ii and 8 (and e). is nonsingular and states that, if ii and are the com-
ponents of any solution to
Theorem I: If 6 and o are the p X 1 and q X 1 com-
ponents of any solution to the linear system X’R-’X X’R-’Z X‘R- ‘y
X’R-’X X’R-’ZD Z’R-’X D-I + Z’R-IZ
Z’R-’X I + Z’R-’ZD then & is a solution to (3.1) and @ = e. We choose to
then 6 is a solution to (3.1) and o = 8 . Conversely, for work with the system (3.3) rather than (3.8) because it
any-solution ii to (3.1), the system (3.3) has a solution is applicable for all 8 E D and it has 8 imbedded in its
whose first component is ii. solution instead of @.Both of these features are useful in
relating the maximum likelihood estimation of 8 to the
We have, as a corollary to Theorem 2, that estimation of linear combinations of the elements of a
8 E (I+ Z’R-’ZD)-‘Z’R-l(y - Xti) , (3.4) and 0. Another attractive feature of the system (3.3) is
that its use does not require the inversion of D. However,
and that the coefficient matrix of (3.8) is symmetric positive
i = (I+ Z’SZD)-’Z’Sy , (3.5) definite or semidefinite, which can be a useful property
Downloaded by [Indiana Universities] at 16:30 08 April 2013

with S I R-I - R-’X(X’R-’X)-X’R-’. These two repre- from a computational standpoint (see, e.g., Westlake
sentations can also be obtained directly from the matrix 1968).
identities The elements of @ belong to the class of estimators
known as LLshrinkers.’JFor p = 0 and B+ known, 6 is
V-1 = R-1 - R-’ZD(I + Z’R-’ZD)-’Z’R-’ J
(3.6) formally the same as the Bayes estimator of @ provided
2’V-I _= (I+ Z’R-lZD)-lZ’R-l , the distributions of b and e are multivariate normal, and,
+
P E S - SZD(1 Z’SZD)-’Z’S , even in the absence of normality, is linear Bayes in the
sense described by Hartigan (1969). For p > 0 but B+
and known, the above approach coincides with what might be
+
Z’P = (I Z’SZD)-’Z‘S , (3.7) characterized as a partially Bayes approach (Harville
1976). If B+ is unknown as is being assumed here and is
which are derivable from results given and cited by
usually the case in practice, then in general (3.2) can no
Harville (1976).
I n our formulation of the ordinary ANOVA models as
+
longer be regarded as an estimator of ;hl’a A;@. One
way to proceed when 6+ is unknown is to use (3.2) as an
special cases of the general linear model (2.1), the matrices
R, D, and possibly Z and X exhibit relatively simple
+
estimator of X1’a a2/@ with e+ replaced by some value
of e. Depending on how this value is chosen, there is a
structures. Some or all of these matrices also have simple
formal resemblance to ridge regression or to the Stein-like
structures in other useful special cases of (2.1). The
significance of Theorem 2 is that it provides us with the or empirical Bayes estimators considered, e.g., by Efron
and Morris (1973). I n particular, a maximum likelihood
means for exploiting these structures, so as to simplify
the computation of 8 (and thus and a solution to (3.1). estimate of e can be substituted for B+.
e)
.
If we compute i and a solution to (3.1) by directly solving
the system (3.3), these structures can be used to obvious 4. THE MAXIMUM LIKELIHOOD APPROACH TO
advantage in computing the entries in the coefficient THE ESTIMATION OF e
matrix and right side of (3.3) and again in the actual 4.1 Definition
solution of the system. Likewise, we could exploit these I n discussing the maximum likelihood estimation of 8,
structures by first solving the normal equations (3.1) , we take the distribution of y to be of the multivariate
using (3.6) in their formation, and then computing 8 on normal form, so that the logarithm of the likelihood func-
the basis of (3.4), which is equivalent to employing the tion differs by only a n additive constant from the function
system (3.3) after absorbing the o equations into the 6
equations. Alternatively, we could start by computing 8 L(8, a ;y) = - ( t )log [det (V)]
from (3.5) and then compute a solution to (3.1) from - (+)(y - Xa)’V-’, - Xa) ,
the first p equations in the system (3.3), which corre-
sponds t o absorbing the ii equations of (3.3) into the o defined for 8 and (Y such that B E D. By definition, maxi-
equations. I n carrying, out the computations, advantage mum likelihood (ML) estimates of 8 and a are values
should be taken of the well-known fact (see, e.g., Westlake satisfying 8 E D and L( e , a ;y) = Laup(y),where
1968) that F ’ C , where F and C are arbitrary except
for obvious restrictions, is computed most efficiently by L..,(Y) = supremum L(8, a;19 ,
I (0, a) :8 E 0 I
324 Journal of the American Statistical Association, June 1977

i.e., values a t which L assumes a maximum for those 8 that the above asymptotic formulation is inappropriate.
and a such t h at 8 E 0. It is well-known that, for fixed 8, I n particular, it is inappropriate for the ordinary ANOVA
L is maximized with respect to a by taking a = ~ ( 0 ) . models (except for relatively simple cases like the bal-
Thus, putting Ll*(e;y) = L[e, a(0) ; y], 6 is a ML esti- anced random one-way classification).
mate of B if and only if 6 E 0 and L1*(6;y) = L,,,(y) Hartley and Rao (1967) were the first to attempt an
(i.e., if and only if Ll* assumes a maximum at 6 for asymptotic theory th a t would be truly appropriate for
8 E 0,in which case a ML estimate of a is 6 (6)).Similarly, the more complicated of the ordinary ANOVA models.
for fixed values of some number a of components of e, They derived the limiting properties of the ML estimators
which without loss of generality we take to be the first a of a, 71, . . ., yc+l as n -+ ~0 and p i + 03 (i = 1, . . ., c)
components, it may be possible to determine analytically simulataneously in such a way that the number of ob-
.
values dn+l(Ol, . . , On), . . ., Om(&, . . ., 0), that maximize servations falling into any particular level of any random
L1* for . . ., em such that 8 E 0. Then, putting factor stays below some universal constant. However,
Lz*(el, . . ., 8,; y) = L1*[(el, . . ., On, 8,+1(&, . . ., en), . . ., Miller (1973) pointed out th a t the latter restriction
l,,,(el, . . ., en)} ;y], dl, . . . , 0, are ML estimates of el, . . .,en greatly limits the applicability of the Hartley-Rao results.
if and only if they maximize Lz* for those el, . . . , 8, that For example, it rules out any sequence of increasingly
satisfy e E for some . . . , Om) value, in which case larger balanced random two-way cross-classifications.
ML estimates of . . ., e, are 8,+1(81, . . ., f i n ) , . . ., Miller developed a n asymptotic theory for the ordinary
Om(81, . . ., S,). I n particular, in our alternate formulation ANOVA models which, while it is similar to that presented
of the ordinary ANOVA models as special cases of (2.1), by Hartley and Rao, does not exclude any cases of real
ei = yi (i = 1, . . ., m ) , and, for fixed values of el, . . ., interest. Miller (like Hartley and Rao) required that
Downloaded by [Indiana Universities] at 16:30 08 April 2013

L1*is maximized for e,, > 0 by taking p* = p (which causes no real loss of generality) and that
the matrix Zi consist only of zeroes and ones with exactly
em = (l/n)[y - xa(e)]’ one 1 in each row and a t least one 1 in each column
m- 1
(i = 1, . . ., c). He introduced a quantity qi that can be
+c
-p
i=l
eizizi’]-l[y - xa(e)] , (4.1) regarded ‘as the effective number of levels for the ith
.
random factor (i = 1, . . , c), defined another quantity
unless y lies in the column space of X (an event of prob- v.+~ b y qcfl = n - rank (Z), and assumed the existence
ability zero when n > p * ) . (The right side of (4.1) does of a function q o of n such that the matrix
not depend on 0, because a(0) does not depend on 8, in
this setting.) Except in certain fairly simple situations, lim qO-’X’[V(e+)]-lx (4.2)
it will be the case a 2: 1 ; i.e., while analytical techniques n -10

can often be used to reduce the dimensions of the prob- exists and is positive definite. (Our notation differs from
lem, numerical techniques will ordinarily have to be Miller’s.) Miller showed, under fairly unrestrictive addi-
employed a t some point in order to effect a final solution. tional assumptions, that, for sequences of designs for
Under what conditions does a ML estimate of 8 exist; which n + 03 and q i + m (i = 0, . . ., c +
1) simultane-
i.e., under what conditions is there a value of 6 satisfying ously in an ((orderly way,” the likelihood equations for a
8 E Q and LI*(B;y) = Lsup(y)?Hartley and Rao (1967, and e = (u12,. . ., u,+?) have a root with probability one
Sect. 2) gave conditions which were claimed to insure the (provided the true value (uiZ)+of ui2 is greater than zero
existence of ML estimates for the variance components (i = 1, . . . , c)), and such a root is consistent and asymp-
associated with the ordinary ANOVA models. Miller (1973, totically efficient. Furthermore, denoting the ui2 com-
Appendix D) found a deficiency in the Hartley-Rao con- ponent of this root by di2 (i = 1, . . ., c +1) (implying
ditions and showed how to ((fix them up.” Their condi- that the a component is a(6) where 6’ = (a?, . ., d,+?)), ’
tions are quite unrestrictive. the limiting distribution of dq0[a(6) - a+],

4.2 Asymptotic Properties l/rllCBl2 - (.la)+], .. l//rlc+l[~c+? - (fJc+12)+1

Anderson (1971) considered the special case of (2.1) is normal with mean vector 0 and covariance matrix

-
where y is made up of s (= n/r) r-variate vectors that are
independently and identically distributed. He showed
that, for s 03 with T fixed, the usual asymptotic prop-

erties of the maximum likelihood method can be extended.


Asymptotic properties are of value in a particular ap-
diag (Jl-l, Jz-l), where J1 is the matrix given by (4.2)
and Jz is the (c
(3) lim (q,qj)-i
+
1) x (c +
1) matrix with ijth element
]-‘Z;] .
tr [zi’{v(e+)}-lZjZj’{V(e+)

4.3 Restricted Maximum Likelihood


plication only if there is reason to believe that the data
are extensive enough that the properties hold. Anderson’s One criticism of the ML approach to the estimation of
asymptotic results can be applied with confidence if s is e is th a t the ML estimator of th a t vector takes no account
sufficiently large. However, for many useful models of of the loss in degrees of freedom that results from estimat-
the form (2.1), y cannot be partitioned into independently ing a. For the ordinary ANOVA models, the variance-com-
and identically distributed subvectors (except trivially ponent estimators obtained by solving the likelihood
by taking 8 = 1) even though n may be very large; so equations do not in general coincide with those obtained
Variance Component Estimation 325

by ANOVA methods (not even in the case of balanced contrasts, and that Az is any p* x n matrix of constants
data), and, unlike the latter estimators, they are gen- such that A’ = (All, Az’) is nonsingular. The likelihood
erally biased (in a downward direction (Patterson and function for the transformed data vector Ay is propor-
Thompson 1974)), sometimes severely so (Corbeil and tional to that for y so that we can just as-well base our
Searle 1976b; Patterson and Thompson 1974). I n par- inferences on Ay as on y. Since E(A2y) consists of linearly
ticular, for the ordinary fixed ANOVA or regression models, independent estimable functions of a, Patterson and
which collectively comprise the special case of (2.1) where Thompson maintained that, in the absence of outside
Q = 0, m = 1 , V = elI , n = {&:el > 0 } , (4.3) knowledge of a, Azy contains no information about 8 and
therefore inferences for 8 should be based only on Aly.
the ML estimator (4.1) of the single “variance component” More precisely, Aly would seem to be marginally sufficient
81 has expectation Bl(n - p * ) / n , so that it is biased for 8 in the sense described by Sprott (1975). A similar
downward by an amount &p*/n, which can be significant argument, pertaining specifically to estimation, is that
if the number of degrees of freedom n - p * is sufficiently the full ML estimator of e necessarily depends on y only
small. through a set of n - p* linearly independent error con-
These “deficiencies” are eliminated in the restricted trasts, i.e., as a function of Ay it depends on Aly but not
maximum likelihood (REML) approach which was de- on Azy, so that the REML estimator does not ignore any
veloped for specific balanced ANOVA models by several information that is actually used by the full approach.
scholars including Russell and Bradley (1958) and (The fact that the full ML estimator of 8 depends only on
Anderson and Bancroft (1952). It was extended to “all” error contrasts follows upon observing that it depends on
balanced ANOVA models by W.A, Thompson (1962), and y only through the function L1* which, like L1, depends
was set forth in general form by Patterson and R. on y only through Aly.) A related observation has to do
Downloaded by [Indiana Universities] at 16:30 08 April 2013

Thompson (1971 and 1974). For balanced ANOVA models, with translation invariance. (An estimator T(y) of a
the equations that are the likelihood equations in the scalar- or vector-valued function of 8 will be called trans-
REML approach have as their solution the standard
ANOVA estimates.
+
lation invariant if T(y Xa) = T(y) for all y and all
p X 1 vectors a.)It is well-known that the REML estima-
By an error contrast, we shall mean a linear combina- tor of 8 is translation invariant. However, the translation
tion u’y of the observations such that E(u‘y) = 0, i.e., invariance of this estimator is not a valid reason for pre-
such that u’X = 0 (where u does not depend on 8 or a). ferring it to the full ML estimator (as has sometimes been
The maximum possible number of linearly independent maintained), because the full ML estimator is also transla-
error contrasts in any set of error contrasts is n - p*. A tion invariant.
particular set of n - p* linearly independent error con- How does the REML estimator of 6 compare with the
trasts is given by Aly where A1 is a (n - p * ) X n matrix ML estimator with regard to mean squared error (MSE)?
whose rows are any n - p * linearly independent rows of I n general, the answer depends on the specifics of the
the matrix I - X(X’X)-X’.I n the REML approach, infer- underlying model and possibly on B+. For models satisfy-
ences for e are based on the likelihood function associated ing the conditions (4.3) (i.e., ordinary fixed ANOVA or
with n - p* linearly independent error contrasts rather regression models) the ML estimator of the variance 81
than on that associated with the full data vector y. It has uniformly smaller MSE than the REML estimator when
makes no difference which n - p* linearly independent p* 5 4 ; however, the REML estimator has the smaller
contrasts are used because the likelihood function for any MSE when p* 2 5 and n - p* is sufficiently large
such set differs by no more than a n additive constant (n - p* > 2 suffices if p* 2 13). MSE comparisons be-
(which varies with which error constrasts are included tween variance-component estimators obtained by solv-
but does not depend on 8 or a) from the function ing the likelihood equations for ML and variance-com-
L1@; y) = - (3) log [det (V)] ponent estimators obtained by solving the likelihood
- (4) log [det (X*’V-lX*)] equations for REML were made by Corbeil and Searle
(1976b) and by Hocking and Kutner (1975) for several
- ($) (y - xE)’v-’(y - XE) , mixed and random ANOVA models.
defined for 0 E (Harville 1974). (Here, X* is as defined
in Section 2.) A REML estimator is a value of 8 that maxi- 5. DERIVATION AND COMPUTATION OF
mizes L, for 8 E a. As in full ML, numerical techniques DERIVATIVES AND OTHER RELEVANT
must ordinarily be employed to determine the estimate, ITEMS
though sometimes analytical techniques can be used to
reduce the numerical problem to that of maximizing a Various procedures for computing ML or REML estimates
function Le (defined analogously to Lz*) involving only of 8 will be discussed in Section 6. These procedures are
a components of e. iterative, requiring the repeated evaluation of L, L1*, or
It might seem that some information is lost by basing L1; their first- or second-order partial derivatives ; the
inferences for 8 on Ll instead of L. Patterson and expected values of their second-order partial derivatives ;
Thompson (1971) argued to the contrary. Suppose that and/or related quantities. I n deciding on a procedure for
A g is any vector of n - p* linearly independent error a given application and in implementing that procedure,
326 Journal of the American Statistical Association, June 1977

it is imperative to know how to evaluate the required representations,


items efficiently.
Using well-known results on matrix differentiation L1 = - (3) log [det (R)] - (+) log [det (C*)]
(Graybill 1969, Sect. 10.8; Nering 1970, Chap. 6, Sect. 7), - (i)y'R-'(y - x e - 2s)
we find , = - (4) log [det (R)] - (3) log [det (X*'R-'X*)]
- (3) log [det (I + Z'SZD)] - (+)y'S(y - 2e) ,
follow immediately from (5.1) and (5.2). Next, consider
aLl/aO;. If D depends on O i but R and 2 do not (as is the
case for i = 1, . . ., m - 1 in our formulations of the
ordinary ANOVA models), then, using (3.7), we have
aLl/aOi = - (3) tr [(I + Z'SZD)-lZ'sZ(aD/ae,)]
+ (+>i'(aD/aO;)i ,
where, if we wish, we could substitute the right side of
E(d2Ll/dOiaOk) = - (3) t r [P(aV/ae,)~(av/ae~)]. +
(5.3) for (I Z'SZD)-'Z'S. Finally, consider dLl/dO;,
E (a2Ll/aOiaOk), and a2Ll/aeiaBkspecifically for the case
Expressions for the first- and second-order partial deriva- of the ordinary ANOVA models, taking the parameteriza-
tives of L with respect to the compofients of 8 and for the tion to be 0; = y, (i = 1, . . ., m ) . Upon defining the par-
expected values of the latter partials can be obtained titioned matrix
t"]
Downloaded by [Indiana Universities] at 16:30 08 April 2013

from the above expressions by first putting X = 0 and


then substituting y - XCYfor y. Expressions for all of the T = [Ti1 .-. 1

first- and second-order partials and expected second- Tic ... TCc
order partials of L and Ll* are given by Harville (1975).
As observed by Searle (1970), the information matrix where Ti, is q; X qj, to be the lower right corner of any
associated with the full likelihood function is the matrix generalized inverse of C (which is necessarily equal to
diag p*,(X'V-lX)], where B* is the m X m matrix with +
(I Z'SZD)-', see Harville (1976, p. 392)) and observing
ikth element (3) tr P-l(aV/aO,)V-'(av/d&)]. The in- that (for the ordinary ANOVA models)
formation matrix associated with L1 is the m X m matrix C

B whose ikth element is (3) tr [P(av/aB;)P(av/aek)]. T;k + ')'c+l'yk TijZj'SZk = 1I if k =i I


In practice, it is generally inefficient and possibly un- j-1
= O , if k # i ,
feasible computationally to evaluate L, L1*, or L1; their we find
partial derivatives ; their expected second-order partials ; aLl/ay; = - (+)yi-'[pi tr (Ti!)]
-
or related quantities directly from expressions like those
given previously. As noted earlier, the matrices R, D, and
+ (3)Yc+lf%'ii, (5.4)
possibly Z and X often have relatively simple structures. aLl/+rM = - (4)rc+r1C(n - P*)
Formulas were given in Section 3 that made it clear how - Y'S(Y - zm1 , (5.5)
to exploit these structures for purposes of computing (-22)E(d2Ll/i3ydy;) = y,-' t r [(I - T;J2] ,
BLUE'S of linear combinations of the dements of CY and p.
We now demonstrate how, by making use of the results (-2)E(a'Ll/dyiayk) = yi-'Yk-l tr [TikTki] j

of Section 3 and several related identities, comparable (- 2)E (a'Ll/ayiayc+l) = Y c+i-'[qi - tr (Ti;)]
formulas can be obtained for the preceding items. (- 2)E(a2Li/a~c+ia~c+i) = ~c+i-'(n - P*)
Taking C t o be the coefficient matrix of the linear sys- ( -2)d2Ll/d7;dyi = -y;-' t r [(I - T,;)']
tem (3.3) and C* to be the matrix th at results from sub-
stituting X* for X in C, we find
+
2yc+1y,-'i,(I - T..)" VI I I ,

(- 2)d2L1/ay;ayk = - Yi-"yk-' tr [TikTk;]


det (V) = det (R) det (I + Z'R-'ZD) , - 2Y~+iYk-~t~Tik~k
(-2)a'L1/ay;aye+l = tili; ,
det (V) -det (X*'V-lX*) = det (R) mdet (C*)
(-2)a~Ll/ayC+layc+l = -yYG+l-e[(n - P*)
+
= det (R).det (X*'R-'X*)-det (I Z'SZD) , (5.1)
- 2Y'S(Y - ze11 9
V-'b - Xi) R-'(y - X6 - Ze) S(y - Ze) , (5.2) for k .
# i = 1, . ., c, where i , is the qi x 1vector defined

P R-' - R-l(X, ZD)C-(X, 2)'R-' ,


..
by i' = (tl', ., ic').(For those of the expressions that
involve partial differentiation with respect to yi1 we
Z'P = (0, Iqxq)C-(X, 2)'R-l . - (5.3) require y i > 0, i = 1, . . ., c 1.) +
The above representations are closely related to
Proofs of these identities are outlined by Harville (1975). various of the representations given by Patterson and
T o illustrate how the various results can be combined Thompson (1971), Henderson (1973a and 1973b),
t o produce formulas of the desired kind, we note that the. Hemmerle and Hartley (1973), Thompson (1975), and
Variance Component Estimation 327

Corbeil and Searle (1976a). Depending upon the par- equations such as that provided by Westlake (1968) can
ticulars of the model being considered, it may be possible be invaluable.
to further simplify these representations by algebraic
means. For example, “complete” simplification is possible 6. NUMERICAL PROCEDURES FOR MAXIMUM
in the case of the random two-way nested ANOVA model LIKELIHOOD ESTIMATION
as evidenced by Searle’s results (1970).
Ordinarily, we must resort to an iterative numerical
Note that first- and second-order partial derivatives
procedure t o obtain a ML or REML estimate of 8. How-
and expected second-order partial derivatives for L, L1*,
ever, there are simple cases where the estimate can be
and Ll for one parameterization of (2.1) can be obtained
found by analytical means. Herbach (1959) derived
from those for a second parameterization by making use
explicit expressions for the ML estimators of the parame-
of the chain rule of calculus (see, e.g., Nering (1970,
ters (the mean and two variance components) of the
Chap. 6, Sect. 5)). I n particular, in the case of the
balanced one-way random-eff ects model. The results of
ordinary ANOVA models, expressions for partial deriva-
W.A. Thompson (1962) can be used to obtain explicit
tives taken with respect to u12, . . ., uC+? can be obtained
expressions for the REML estimators of the variance com-
from those taken with respect to yl, . . ., yc+l and vice
ponents of any balanced ANOVA model. Thompson worked
versa. General formulas for going from one information
these out himself for several models including the bal-
matrix to the other are given by Zacks (1971, p. 227).
anced two-way crossed random-eff ects ANOVA model
The chain rule can also be used to obtain the partial
(both with and without interaction). While the standard
derivatives of Lz* or Lz from the partial derivatives of
ANOVA estimators of variance components comprise a
L1* or L1.
solution to the likelihood equations for REML in the case
The above results completely link the problem de-
of the balanced ANOVA models, the likelihood equations
Downloaded by [Indiana Universities] at 16:30 08 April 2013

scribed in Section 3 of estimating linear combinations of


for full ML do not admit an explicit solution for all such
the elements of (Y and p when 8+ is known to the problem
models. Explicit solutions to the latter equations exist
of evaluating L, L1*, and L1,their first- and second-order
for the balanced two-way nested ANOVA models, though
partial derivatives, and their expected second-order
partial derivatives. For each of the approaches given in not for the balanced two-way crossed random-effects
ANOVA model with interaction (Hartley and Rao 1967 ;
Section 3 to the first problem, they point the way to a
Miller 1973 ; Herbach 1959).
corresponding approach to the second problem. Note,
There are many iterative numerical algorithms that
however, there is a consideration in the second problem
can be regarded as candidates for computing ML or REML
not ordinarily present in the first. I n the estimation of
linear combinations of the elements of (Y and p when 8+ is estimates of 8. Some were developed specifically for
special cases; e.g., for computing ML estimates of variance
known, there is only a single set of computations, while,
components. Others are general procedures for the
in the iterative procedures for the ML or REML estimation
numerical solution of broad classes of constrained non-
of 8, similar computations must be performed for each of
linear optimization problems. There is no real hope for
a sequence of 8 values. When the computations must be
finding a single iterative numerical algorithm for the M L
carried out for more than one 8 value, they should be
or REML estimation of 8 that will be best, or perhaps even
accomplished such that, to the greatest extent possible,
satisfactory, for every application. An algorithm that
those operations that depend on 8 are segregated from
requires relatively few computations to converge to a
those that do not, so the latter operations need be per-
ML or REML estimate in one setting may converge slowly
formed only once. Hemmerle and Hartley (1973) discuss
or even fail to converge in another. I n deciding which
this point in the context of the ML estimation of variance
among available algorithms to try in a particular applica-
components, and Corbeil and Searle (1976a) describe the
tion, we must make some judgments about their computa-
analogous considerations for REML estimation.
tional requirements and their other properties as applied
I n general, the evaluation of first-order partial deriva-
to a given setting. This section is devoted to describing
tives can require considerable computations beyond those
the various algorithms and their characteristics. The
necessary to evaluate L, L1*, or L1; the evaluation of the initial descriptions, given in Subsections 6.1 and 6.2,
expected values of the second-order partial derivatives ignore any complications brought about by constraints
can require many computations additional to those on the parameter space, i.e., by 8 being confined to it
needed to evaluate the first-order partial derivatives ; and when it is a proper subset of Euclidean m space. I n Sub-
the evaluation of the second-order partial derivatives section 6.3, several techniques are considered for modify-
themselves can require still more extensive computations. ing the various algorithms to cope with constraints.
However, judging from the preceding representations,
it would appear, in the case of the ordinary ANOVA 6.1 Specialized Algorithms
models, first- and even second-order derivative informa- On the kth iteration of an iterative algorithm for pro-
tion can be had rather cheaply. I n assessing the relative ducing a ML or REML estimate of 8, the current value for
difficulty of the computations for any particular applica- the estimate is converted into a new value. I n the follow-
tion, information on the numerical solution of linear ing, we denote by the value produced by the algo-
328 Journal .of the American Statistical Association, June 1977

rithm on its kth iteration, and, for any quantity f which (5.4) and (5.5), the equations aL1*/ayi = 0 and aL1*/
is a function of 8, we use f c k ) to represent the value of f ayc+l = 0 can be put in the form
a t 8 = 8(k)le.g., V(k)= V { W ) . The value 8 ( O ) used to
start the algorithm must be supplied by the user.
ui2 = [&’pi + ui2 tr (Tii*)]/qi , i = 1, . . ., c , (6.1)
Anderson (1973) and Henderson (1973a) proposed uc+12 = y’(y - X& - ZQ)/n , (6.2)
iterative algorithms designed specifically for handling where
certain special cases of the problem of computing ML
estimates of 8. Their approaches are in effect based on
manipulating the equation aL1*/ae = 0 into the form
8 = g(8; y) for some m X 1 vector g of functions of 8.
Q = I:] and (I + Z’R-lZD)-l
T11* . . . Tic*
Nonlinear equations of this form can be solved by the
method of successive approximations, which consists of =[: 1 1 .
taking 8 ( k + ’ ) = g { B ( k ) ; y } (see, e.g., Beltrami (1970, Tcl* . . . Tcc*
Sect. 1.2)). Henderson’s algorithm consists of applying the method of
Anderson’s iterative algorithm for computing a ML successive approximations to (6.1) and (6.2). An analo-
estimate of 8 is for the special case where V has the repre- gous algorithm for computing REML estimates of ulZ1. . . ,
sentation (2.3). Anderson found in effect that dLl*/atI = 0 u , + ~is~obtained by applying the method of successive ap-
can be rewritten as B*8 = d, where B* is as defined in proximations to
Section 5 and d is the m X 1 vector whose ith element is
(+)(y - xti)’v-l(av/aei)v-l(y - xti) .
ui2 = [Qi’ei + ui2 t r (Tii)]/qi , i = 1, . . ., c , (6.3)
u,+? = ~ ’ ( y
- X6 - ZQ)/(n - p * ) . (6.4)
Downloaded by [Indiana Universities] at 16:30 08 April 2013

For fixed 8 such that V is nonsingular, B* is necessarily


positive semidefinite and the linear system B*8 = d is Note that (6.1) and (6.3) can be rewritten as
consistent for 8 (see LaMotte 1973). When B* is non- ui2 = [@,’&]/[q; - t r (T,i*)] , i = 1, . . ., c , (6.5)
singular (and thus positive definite), which is the case if
and only if G1,. . ., G, are linearly independent matrices, ui2 = [Q;’@;]/[qi - t r (Ti;)] , i= 1, . . ., c . (6.6)
the equation B*8 = d is equivalent to the equation Possibly interesting modifications of Henderson’s pro-
8 = B*-’d. The method of successive approximations as cedure and its REML analog are obtained by applying the
applied to the latter equation is to take the (k +
1)st method of successive approximations to (6.5) and (6.2),
iterate to be 6(k+1)= p*(k)]-ld(k).I n ‘the event that rather than (6.1) and (6.2), and to (6.6) and (6.4),
sufficient conditions given by Anderson (1969) for the rather than (6.3) and (6.4).
existence of an explicit solution to the likelihood equa- The following lemma was proven‘by Harville (1975).
tions are met, the iterative procedure converges in one
Lemma 1: For the ordinary ANOVA models (with R, D,
iteration from any starting value (Miller 1973).
A similar iterative algorithm can be constructed for and D as specified in Section 2), (i) t r (Tii*) > 0 ; (ii)
computing a REML estimate of 8 for the case where V has tr (Ti;) > 0 ; (iii) q; 2 tr (Ti;*) provided uiZ > 0, with
strict inequality holding if Zi# 0 ; and (iv) qi 2 t r (Ti;)
the representation (2.3). The likelihood equations for
REML can be put in the form B8 = d. For fixed 8 with V
provided ui2 > 0, with strict inequality holding if
rank (X, Zi) > p*.
nonsingular, B is positive semidefinite and the linear
system Be = d is consistent for 8 (LaMotte 1973, pp. 316 The lemma implies that Henderson’s iterative pro-
and 327-8). The matrix B is nonsingular if and only if 8, cedure for computing ML estimates of variance com-
is estimable in the class of quadratic translation-invariant ponents and its REML analog have an apparently pleasing
estimators for i = 1, . . ., m (again see LaMotte 1973), property (which is not shared by Anderson’s algorithm).
in which case B8 = d is equivalent to 8 = B--’d. Applying Suppose that y does not lie in the column space of X,
the method of successive approximations to the latter which is the case with probability one when, e.g., y is
equation yields an iterative algorithm for computing a normally distributed. If the algorithms are started with
REML estimate of 8 analogous to Anderson’s procedure for strictly positive values for the variance components, then
computing a ML estimate. a t no point can the values for the variance components
Anderson’s iterative algorithm and its REML analog can ever become negative. In fact, starting from strictly
of course be used to compute ML and REML estimates of positive values, they can never reach zero values either,
.
the variance components ulz, . ., u , + ~associated
~ with though it is possible for them to attain values arbitrarily
the ordinary ANOVA models. However, Anderson’s algo- close to zero. Note that these algorithms ordinarily should
rithm differs from the iterative algorithm proposed by not be started with a zero value for any variance com-
Henderson (1973a). Henderson’s algorithm, which is the ponent, since the value for that component would then
same in principle as a procedure discussed by Hartley continue to be zero throughout the iterative procedure.
and Rao (1967, Sect. 5), is designed specifically for com- A further implication of the lemma is that the modified
puting ML estimates of variance components. By using Henderson procedure, which is based on (6.5) or (6.6)
representations for dL1*/dy, and aLI*/ayc+,analogous to rather than on (6.1) or (6.3), is well defined, unless a t
Variance Component Estimation 329

some point a zero value is attained for some variance fk(p) > f k ( o ) , i.e., we merely require that the (k 1)st+
component. Th at is, the denominators of (6.5) and (6.6) iteration produce some progress in the indicated search
are strictly positive unless u I 2= 0, in which case the direction. I n (iii), some effort (short of that required to
denominators are zero. The latter phenomenon causes no approximate p k ) may be expended to find a value of p for
difficulty if we agree to take the value ( E , ~ () k + l ) for u I 2on which f k ( p ) is “large.” The determination of the P k
the ( k + 1)st iteration to be zero whenever ( ~ 7( k~) = ~ )0. specified by (ii) or (iii) is a one-dimensional search prob-
The modified algorithms, like the originals, can never lem. Suitable algorithms for one-dimensional searches are
reach negative values, nor should they ordinarily be discussed by Murray (1972). They require at a minimum
started with zero values for any of the variance com- that f k ( p ) be evaluated a t various trial values of p , and
ponents since, once a zero value is inserted, it is never thus can be very time consuming in instances where the
changed. The iterates derived from (6.5) or (6.6) have evaluation of L , involves extensive computations.
an intuitively appealing form. On each iteration, ui2 is Two of the oldest and best known of the gradient
“estimated” by computing the sum of squares of the algorithms are the steepest ascent algorithm and the
L L ~ ~ of
~ ~the
’ components
~ ” of the q, X 1 vector Dl de- Newton-Raphson algorithm. I n the method of steepest
fined by @’ = (el’, . . ., &’) and by then dividing by a ascent, N k = I for all k and customarily p k = p k . The
number between qI and zero. steepest ascent algorithm is one of the few that is sup-
ported by convergence theorems (see, e.g. , Beltrami
6.2 General Algorithms (1970) and Powell (1970)). Unfortunately, its rate of
To locate a ML or REML estimate of 8, we can, in the convergence is often found to be intolerably slow (Powell
special cases where they apply, try one of the iterative 1970). Bard (1974, p. 88) states, “the method is not
numerical algorithms described in Section 6.1. We can recommended for praotical applications.” Hartley and
Downloaded by [Indiana Universities] at 16:30 08 April 2013

also consider iterative numerical algorithms developed Vaughn (1972) describe, in the context of the ML estima-
for the general problem of maximizing an arbitrary func- tion of variance components, a variation of the steepest
tion. Moreover, when confronted with a situation for ascent algorithm. Their approach requires that a system
which there is no specialized algorithm, we are forced to of c differential equations be solved numerically on
use one of the general procedures. I n this subsection, each iteration. I n the Newton-Raphson procedure,
several general algorithms and their properties are de- N k = { J(k))-l and p k = 1, where J is the m X m matrix
scribed and references are indicated where more complete whose ijth element is -dzLl/dOidO,. (It is assumed here
information can be found. The discussion will be in terms that J ( k ) is invertible.) Unlike the steepest ascent method,
of the problem of computing a R E M L estimate of 8, i.e., the Newton-Raphson procedure utilizes second-order
the problem of computing a value of 8 that maximizes partial derivatives. When applied to a quadratic function
L1.This will cause no real loss of generality, since the that has a negative definite Hessian matrix, the Newton-
extensions t o the problems of maximizing L with respect Raphson procedure will converge to the maximizing value
t o 8 and a, maximizing Ll* with respect to 8, and more in a single iteration from any starting point. Even when
generally maximizing an arbitrary function will be it is applied to a function like L , which is not quadratic,
obvious. the Newton-Raphson algorithm can be expected to locate
+
The (k 1)st iterate of an iterative maximization a maximizing value in relatively few iterations provided
algorithm has the representation t i ( k + l ) = i j ( k ) +
PkWk, it is started within a sufficiently small neighborhood of
where w k is a vector that serves to identify the direction that value (see, e.g., Bard 1974). However, if the starting
of search and P k is a positive scalar that determines the value is poor, it may converge to a stationary point which

distance to be traversed in the indicated search direction. is not a local or global maximum and often does not
Many of the proposed algorithms are gradient algorithms. converge a t all (Powell 1970). This difficulty is overcome
A gradient algorithm, as applied to the maximization of in the extended or modified Newton-Raphson procedure.
L1,is one where w k has a convenient representation of the The extended procedure uses the same search direction
form w k = N k ( d L l / d 8 ] ( k ) for some m X m matrix N k . as the original, but p k is determined so that L1(6(’+”); y]
The various gradient methods are characterized by differ- is a t least somewhat larger than L1(il(k);y}. (If
ent choices for N k and p k . If N k is chosen to be positive the directional derivative is negative in the direction
definite, then necessarily there exists a P k ( P k > 0) such ( J ( k ) ] - l ( dLl/df3} ( k ) , the search direction - ( J ( k ) ] - l ( dL1/
that Ll{ij(k+l);y) > L 1 ( 8 ( k ) y), ; unless of course 6’81 (k)l can be used instead.)
( dLl/d8) ( k ) = 0. In fact, this inequality holds for positive The method of scoring is a gradient procedure that
definite N k if p k is taken to be sufficiently close to zero applies when the function to be maximized depends on
(see, e.g., Beltrami 1970). With regard to the choice of data points (observed values of random variables). It
P k , the methods fall into three categories: (i) P k = 1 ; is identical to the Newton-Raphson procedure except
(ii) p k = pk (to some degree of approximation), where pk that the role of the second-order partial derivatives is
is the value of the scalar p that maximizes f k ( p ) played instead by their expected values. As applied to
+
= L1[il(k) P w k ; y] for p > 0, i.e., p k is chosen so as to the maximization of L1, the ( k + 1)st iterate of the
maximize progress in the indicated search direction; and method of scoring is defined by putting N k = (B(k)]-l
(ii) p k is taken to be any positive value of p for which and p k = 1. Note that N k in this method coincides with
330 Journal of the American Statistical Association, June 1977

the large-sample covariance matrix of the REML estimator ployed in this method can be regarded as a compromise
of 8 evaluated a t e = 6(k), which illustrates a general between the steepest ascent direction and the Newton-
property of the method when it is applied directly to the Raphson direction. Based on scaling considerations, a
maximization of a likelihood function. The method of good choice for Mk is to take it to be the diagonal matrix
scoring as defined above can also be applied to the prob- whose diagonal elements are the absolute values of the
lem of maximizing “reduced” likelihood functions like diagonal elements of Ak, except that zeros are replaced
LI*, Lz*,or Lz. It is to be expected that this will produce by ones (Bard 1974, Sect. 5-8). The scalar x k should be
iterates for the remaining parameters different from those chosen so Ll{6ck+l); y } > L l {6 ( k ); y ] . Algorithms for de-
produced by applying the method directly to the relevant termining a suitable x k are discussed by Bard (1974);
likelihood function. The advantage of the method of Goldfeld, Quandt, and Trotter (1966) ; Marquardt
scoring over the Newton-Raphson method is that, since (1963) ; and Powell (1970). The reason for taking p k = 1
the expected values of second-order partial derivatives in this approach is that the step size as well as the search
are ordinarily easier to compute than the second-order direction are taken into account in choosing x k . Just as
partial derivatives themselves (refer to Section 5), it will the computations per iteration can ordinarily be de-
generally require less computer time per iteration, though creased by going from the Newton-Raphson algorithm to
possibly a t the expense of an increased number of itera- the method of scoring, ‘we can expect the computations
tions to convergence. I n the case of the ML or REML per iteration in the above method to be reduced by
estimation of variance components, this advantage may, taking Ak = B ( k )rather th a n & = J(”. WhenAk = B(k)l
however, be fairly insignificant (again refer to Section 5): this method can be regarded as one natural extension of
The method of scoring can be extended or modified in the Marquardt’s highly successful algorithm for solving
same way as the Newton-Raphson procedure by con- nonlinear least-squares problems.
Downloaded by [Indiana Universities] at 16:30 08 April 2013

sidering values for pk different from one. Note, when the The search for improved optimization algorithms is an
method of scoring is applied to the maximization of L1, ongoing process. Recent progress was reviewed by
i t defines a search direction in which a t least some in- Powell (1970) and Murray (1972). One relatively new
crease in Ll can be achieved (provided (aL,/ae) ( k ) # 0) class of methods consists of the gradient methods known
since ordinarily B(k)will be positive definite (see Section as variable-metric methods. These methods use an Nk
6.1). This again illustrates a general property of the whose construction does not require second-order partial
method when applied directly to the maximization of a derivatives (or their expected values) but which never-
likelihood function. The ( k + 1)st iterate generated by theless approximates J ( k ) for sufficiently large k. A
applying the method of scoring to the maximization of L valuable feature in many instances is second-order partial
with respect t o e and by then substituting ~ ( B ( k ) }for (Y derivatives need not be computed. Perhaps the best
is, in the case where V has the linear representation (2.3), known of the variable-metric algorithms is the Davidon-
+
the same as the (k 1)st iterate defined by T.W. Fletcher-Powell algorithm described by Powell (1970). It
Anderson’s iterative ML algorithm. This observation was has been widely used and has been very successful.
first made by J.N.K. Rao (see Miller 1973). Moreover,
the iterates produced by the REML analog of Anderson’s 6.3 Modifications to Accommodate Constraints on e
algorithm are identical to those defined by applying the Ordinarily, the space to which 8 is constrained has a
method of scoring to the maximization of L1 with respect representation of the form
t o 0 (Hocking and Kutner 1975). Thus this procedure can
be viewed as a special case of the method of scoring. n = { e : r , ( e ) [ > ,210,..., rd(e)c>, 2101 , (6.7)
The extended or modified Newton-Raphson procedure ..
for some functions TI, ., T d . Here, [>, 21 is used to
represents an attempt a t retaining the good performance indicate that the inequality can be a strict inequality or
of the Newton-Raphson procedure when it is started not. For example, in our formulations of the ordinary
close to a maximizing value while improving on its per- ANOVA models, we can let d = c +
1 and r i ( 0 )
formance when it is started with a poor estimate. A .
= 0; (i = 1, . ., c + l), and take the last inequality
similar philosophy underlies the gradient method de- in (6.7) to be a strict inequality.
scribed by Bard (1974, Sect. 5-B), which is based on the As noted in Section 6.2, Henderson’s iterative algo-
work of Levenberg (1944), Marquardt (1963), and rithm for computing ML estimates of variance components
Goldfeld, Quandt, and Trotter (1966). As applied and its REML analog are not affected by the constraints
t o the maximization of L1, the ( k + 1)st iterate of on the parameter space; i.e., by the nonnegativity con-
the latter method is obtained by taking P k = 1 and straints on the variance components. With it, negative
Nk = [Ak + AkMk]-l. Here, Mk is a positive definite components are simply never encountered. Unfortu-
matrix, Ak is a scalar that ordinarily is taken to be posi- nately, none of the gradient algorithms discussed in
tive, and Ak either equals J(k) (the negative of the Hessian Section 6.3 have this kind of property. When they are
matrix a t 8 = 6(k)) or is some approximation to it. The applied, e.g., to the maximization of L, L1*, or L1, any of
matrix A k + AkMk will be positive definite provided Ak is them can in general produce an iterate that lies outside
taken to be sufficiently large (even if Ak is indefinite). the constraint space. I n particular, in the case of the
When A k = J ( k ) and Mk = I, the search direction em- ordinary mixed ANOVA models,. they can produce an
Variance Component Estimation 331

iterate with negative values for one or more of the vari- are reduced and-the algorithm is applied to $ again start-
ance components. (This is also true of Anderson’s itera- ing at the point of convergence in the previous applica-
tiveprocedure-see Miller (1973)). Hemmerle and Hartley tion. The process is terminated when reductions in
.
(1973) encountered this difficulty in applying the $1, . ., 4 d no longer produce significant changes in the
Newton-Raphson method to the problem of computing point of convergence.
ML estimates for o , + ~and~ for the positive square roots of The gradient projection technique can be used when-
the ratios yl, . . ., ye. When an iterate was obtained with ever all of the inequality constraints are linear constraints
negative or nearly negative values for one or more ele- (as in computing ML or REML estimates of variance com-
ments, they set those elements equal to zero and in effect ponents), i.e., whenever Ti@) = uile - ci, for some
constrained them to be zero on subsequent iterations. m x 1 vector ui and some scalar ci (i = I, . . ., d). I n
This approach to the problem is not satisfactory because conjunction with the gradient projection technique,
it can cause the procedure to converge to a point that is we suppose that none of the inequality constraints
not even a constrained local maximum of L1*, let alone a are strict inequalities, i.e., the constraints are ui’e
constrained global maximum, i.e., a ML estimate. (See 2 ci (i = 1, . . ., d), and u; has been normalized so that
the discussion by Bard 1974, Sect. 6-3.) The same criti- ui’ui = 1. (A strict inequality ui% > ci can be approxi-
+
cism applies, though to a lesser extent, to the approach mated by the constraint ui’tl 2 ci ei, where E ; is some
taken by Miller (1973) in using Anderson’s procedure to small positive constant.) Any of the unconstrained
compute ML estimates of variance components. He dis- gradient algorithms can be modified by the gradient pro-
regarded the nonnegativity constraints, unless the pro- jection technique. At the completion of the kth iteration
cedure converged to a vector having one or more negative of the modified algorithm, we have uiWk)> ci for some
components, in which case he restarted the algorithm values of i and ui’6(k)= c; for the remainder. On the
+
Downloaded by [Indiana Universities] at 16:30 08 April 2013

with those components subsequently constrained to re- (k 1)st iteration, certain of the latter constraints are
main at zero. He continued to fix any zero components treated as active constraints. The active constraints are
and to restart the algorithm until no negative values selected in accordance with algorithms like those discussed
were obtained. Because iterates are permitted that can lie by Gill and Murray (1974). P u t Uk = (Uj(1)r . . ., U j ( l ) )
outside the parameter space, there is an additional diffi- where constraints j ( l ) , . . ., j ( t , are the active constraints,
culty with Miller’s approach. The procedure may on and denote by A k the choice for N k in the unconstrained
occasion call for evaluating items that depend on 8 at gradient algorithm. The gradient projection technique is
points at which they are ill-conditioned or even undefined. to use the gradient procedure which has
Satisfactory techniques for modifying unconstrained
maximization algorithms so as to take into account in-
Nk = [I - nkUk(Uk’AeU~)-’U~’]ns ,
equality constraints are discussed by Bard (1974), and which restricts P k so that no constraint is violated but
Beltrami (1970), and Gill and Murray (1974). At least which otherwise determines pk as in the unconstrained
three of these techniques more or less meet our needs: case. The gradient projection technique thus modifies the
(i) the penalty technique, (ii) the gradient projection the original search direction &{ dL,/de} ( k ) by projecting
technique, and (iii) the transformation technique. it into the space determined by those vectors 8 satisfying
There are actually many penalty techniques. Among Uk% = 0 . This technique is considered to be superior to
these, the interior techniques, which cause each iterate to the penalty technique for handling linear constraints,
lie in the interior of the parameter space, are the best especially when it is suspected that the maximizing value
suited for the ML or REML estimation of e. One interior may be located on a boundary.
technique is that proposed by Carroll (1961). I n terms Sometimes a constrained maximization problem can be
of the maximization of L1, Carroll’s technique is to apply transformed into an unconstrained maximization problem
the unconstrained maximization algorithm t o the function by a change of variables (Bard 1974). For example, the
d ordinary ANOVA models can be parameterized so that R
$(el = LI@B; B) - C & / T j ( e ) i = fIm2I, D = diag [S,ZI, . . ., 8m-121], and D = { 0 : Om # 0}.
j=1
Here, g12= ~ 9 ., .~.,~am2= Om2. To obtain a ML or REML
where 41,. . ., @Jd are small positive constants, rather than estimate of a12,. . ., urn2,we can now maximize Ll* or L I
to L1itself. The algorithm is started in the interior of the with respect to 8 , subject only to the constraint Om # 0,
constraint space and, a t each iteration, the distance and then transform the maximizing vector by squaring
traversed in the indicated search direction is limited (if its elements. This maximization problem is, for all prac-
necessary) so that the resulting iterate is again interior tical p ~ ~ p o s e sunconstrained
, because the constraint,
to the constraint space. The underlying philosophy is 8, # 0, ordinarily never comes into play. This kind of
that the function $ is close to L1 except in the neighbor- approach was used by Hartley and Vaughn (1972). One
hood of boundaries where it assumes very large negative possible drawback in using this technique, to compute
values, which serve rn barriers that deflect the algorithm. REML estimates of variance components for example, is
Ordinarily, the maximization of $ should be carried out that additional stationary points of L , are introduced, i.e.,
for more than one set of values of &, . . ., 4.i. When con- for Bi = 0, dLl/dOi = 0 even though dLl/dui2 # 0. Thus,
vergence is obtained for one set, the values of &, . . ., 4 d we should use this technique only in conjunction with
332 Journal of the American Statistical Association, June 1977

algorithms th at guarantee a t least some increase in the their computation is unthinkable. The latter situations
value of the objective function on each iteration. are essentially those where the computations necessary
to form and solve the linear system (3.3) are too extensive.
6.4 Discussion I n this section, we outline an approach to the estimation
of 8 th a t can be viewed as an approximation to the REML
I n a given application, it may be possible to improve
approach. This approximate approach can be used when
the performance of the various iterative optimization
the computation of the exact ML or REML estimate is too
algorithms by first transforming the variables. Most of
demanding.
these algorithms are a t their best when applied to func-
I n cases where the function L1 is known to have a
tions that are a t least approximately quadratic. Thus any
unique stationary point which lies in the constraint space
transformation that makes the function more closely
0 and which corresponds to a global maximum, the prob-
resemble a quadratic function over the relevant region
lem of computing a REML estimate of e is essentially that
should be helpful. In particular, in using the Newton-
of forming the system of nonlinear equations aLl/ae = 0
Raphson algorithm to compute ML estimates for the
and solving it for 0. If the computations required to
ordinary ANOVA models, Hemmerle and Hartley (1973)
evaluate aLl/ae are too extensive, REML estimation can-
found that the behavior of the algorithm could be im-
not be undertaken. The equations aLl/aO = 0 consist in
proved significantly by parameterizing in terms of
effect of m translation-invariant quadratic forms set equal
d-y~, . . . , d-ycr .
yc+lrather than in terms of -yl, . . , yc+l.
to their expectations, This observation suggests when the
I n general, it will be more efficient to compute ML
REML approach is unfeasible computationally, we take
estimates of a and 8 by applying the various iterative
our estimate of 6 to be the solution 6 to G(0; y) = 0,
Downloaded by [Indiana Universities] at 16:30 08 April 2013

optimization algorithms to the “reduced” function L1*


where G ( 6 ;y) = Q ( 6 ; y) - E ( Q ) , and where Q ( 6 ; y) is
rather than to L itself. Similarly, when the problem of
a vector whose elements &; = f r i y (i = 1, . . ., m)
maximizing L1* or L1 can be reduced in dimension by
consist of m translation-invariant quadratic forms which
analytical means to the problem of maximizing Lz* or La
resemble those used in REML estimation but which are
(refer to Sections 4.1 and 4.3), it is to be expected that it
easier to evaluate. The quadratic forms used in REML are
will be more efficient to compute ML or REML estimates of
0 by applying the iterative algorithms to the latter func- (y - xa)’v-l(av/aei)v-l(y - xa)
tions. Analytical reductions have proved to be useful in = (&)(p- X E - ZQ)’R-l(aV/aOi)R-l
nonlinear least-squares problems (see, e.g., Lawton and . (y - Xa - ZQ) (7.1)
Sylvestre 1971).
There is ordinarily no assurance that a value of e ob- = (+)(y - zQ)’s(av/ae,)s(y - ZQ) , (7.2)
tained by applying an iterative maximization algorithm
to L, Ll*, or Lz* or to L1 or Lz is a ML or REML estimate i = 1, . . ., m. For i such that D depends on ei but R and
of 0 . Even if such a 6 value is obtained by starting the Z do not, these quadratic forms have the additional
algorithm with what is thought to be an excellent guess representations
or estimate, it is good practice to apply the algorithm - (+)p’(aD-l/aei)Q , (7.3)
several more times, using a different starting point on
provided e is such that D is nonsingular. One technique
each occasion. If these repetitions all yield the same 0
for “approximating” the quadratic forms used in REML
value, we can be more confident that we have located a
is to replace a and/or in (7.1), (7.2), or (7.3) with
ML or REML estimate.
E = Hy and @ = A(y - XE), where H is a p X n matrix
Actual numerical experience in using the various itera-
such that E ( X 6 ) = Xa and A is a p X n matrix, both of
tive algorithms to compute ML or REML estimates of vari-
ance components seems t o be very limited and is largely which must be specified. The elements of H and A may
be functions of e. The matrices H and A should be chosen
confined t o a variation of the method of steepest ascent
so that, for the case where e+ is known, XE and Q with
(Hartley and Vaughn 1972) , the Newton-Raphson pro-
e = e+ are good estimators of Xa and @, but, a t the same
cedure (Hemmerle and Hartley 1973; Corbeil and Searle
time, they must be such that E and @ are computable for
1976a; Jennrich and Sampson 1976), the method of
any given e value. I n cases where R is hard to invert, we
scoring (Jennrich and Sampson 1976), and to Anderson’s
could replace R-1 in (7.1) or (7.2) with some “approxima-
method (Miller 1973).
tion,” as well as substituting E and 8 for a and Q.
The expression [E(K)]-l[~ar (Q)][E(K’)]-’, where
7. APPROXIMATING THE RESTRICTED M A X I M U M
K(0; y) is the m X m matrix whose j t h column is
LI KELl HOOD APPROACH -aG/ae,, may furnish a useful approximation to var (6)
Efficient computational algorithms for producing ML for “large” samples (Harville 1975). When Q1,. . . , Q m are
or REML estimates of e can be devised by making use of the quadratic forms actually used in REML rather than
the results outlined in Sections 3 through 6. As the speed approximations, this expression simplifies to B-I.
of electronic computers increases, the number of settings There are several hazards in estimating e by the ap-
feasible to compute ML or REML estimates also increases. proximate REML approach described above, i.e., by solving
Nevertheless, there remain numerous situations where the equations G ( 8 ; y) = 0, where Q i is given by (7.11,
Variance Component Estimation 333

(7.2), or (7.3) with E and B substituted for ii and e. In variant quadratic unbiased estimators, though his results
REML, the likelihood equations may not have a solution were left in very inconvenient form. These early efforts
t ha t lies in the constraint space Q, and, even if there are were generalized and greatly improved upon by LaMotte
solutions in Q,some or all of them may not correspond to (1970, 1971, and 1973). LaMotte’s results apply to all
maximizing values of L1,i.e., to REML estimates. Similarly, linear models for which V has the representation (2.3),
in the approximate REML approach, there may not exist though he did assume normality. He considered several
a solution t o G ( 8 ;y) = 0 that lies in Q and, even if such classes of estimators for a linear function 1’8, and, for
a solution does exist, i t may not be a desirable estimate. each class, produced convenient representations for the
I n implementing the REML approach, we were able to locally best estimators. I n particular, he showed that,
circumvent these difficulties, at least to some extent, by when attention is restricted to translation-invariant
using “hill-climbing” techniques, which force increases in quadratic unbiased estimators, the estimator that is
L1 a t each iteration, preventing convergence to unde- locally best a t 8 = 8* is 1‘6 where 6 is any solution to the
sirable stationary points, and which can be modified to linear system
accommodate constraints. This observation points the
way to what may be a useful modification of the approxi- CB(8*)le = [d(e*)I (8.1)
mate REML approach. Instead of merely solving the
(provided that 1’0 is estimable in the class of translation-
equations G ( 8 ;y) = 0 , we could proceed just as though
invariant quadratic estimators, which is the case if and
we were maximizing a function whose gradient vector is
only if the equations P(O*)]c = h have a solution for 7).
G ( 8 ; y). We could use various of the gradient algorithms
Rao (1971b and 1972) independently obtained similar
described in Section 6.2 to maximize this function, with
Downloaded by [Indiana Universities] at 16:30 08 April 2013

results and, in addition, indicated extensions to non-


appropriate modification for constraints as described in
normal cases. Following Rao, we use MIVQUE as an ab-
Section 6.3. (See Harville 1975 for further information.)
breviation for locally best (minimum variance) transla-
The final iterate would comprise our estimate of 8.
tion-invariant quadratic unbiased estimator. (The E in
I n applying various of the gradient algorithms in our
MIVQUE can also stand for estimation.)
approximate REML approach, we must, on each iteration,
I n general, quadratic unbiased estimators of 31’8 (in-
evaluate Q,, E(Q,), aQZ/aO,, a[E(Q,)I/as,, and/or cluding MIVQUE’S) can‘ yield estimates that violate the
E(dQ,/aO,), (i,j = 1, . . . , m).We have
constraints on the parameter space, so that strictly
+
E(Q,) = t r (r,V) = t r (R*r,R*) t r (DWr,ZD+) , speaking they are not estimators a t all. Nevertheless, as
observed by Kempthorne (Searle 1968, p. 783), they can
so that E(Q,) can be evaluated by performing the same be regarded as useful condensations of the data, just as
operations on each column of R* and each column of true estimators are. What is questionable is the practice
ZD+as are performed on y in computing Q,. This tech- of comparing these pseudo-estimatws on the basis of
nique is essentially Hartley’s method of synthesis (see their MSE’S. For reasons discussed by Harville (1969b,
Rao 1968). The computation of the other required items Sect. 3.3.5), such comparisons are potentially misleading.
can be approached in a similar manner. The traditional approach to the estimation of 0, when
V has the representation (2.3) as in the case of the
8. RELATIONSHIPS OF MAXIMUM LIKELIHOOD ordinary ANOVA models, is to equate m translation-in-
AND RESTRICTED MAXIMUM LIKELIHOOD variant quadratic forms (that are not functionally de-
TO OTHER METHODS pendent on e) to their expectations and solve the resulting
8.1 MIVQUE’S and MINQUE’S linear system for 8. The ith of the likelihood equations
Much of the recent literature on the problem of esti- aL/ae = 0 depends on the data only through the qua-
mating variance components, and more generally on the dratic form (+) (y - XCK)’V-’G~V-~(~ - Xa).Suppose that,
problem of estimating 8 when V has the representation in this quadratic form, we substitute ~ ( 8 for
) a and then
(2.3), has centered on the derivation of estimators that replace 8 with a fixed value 8*. The result is a translation-
have minimum MSE at some point in the parameter space, invariant quadratic form that is functionally independent
i.e., that are locally best when attention is restricted to of e. LaMotte (1970) considered estimating 8 by equating
estimators satisfying various conditions. The initial work the m translation-invariant quadratic forms generated in
was done by Townsend (1968) and by Townsend and this way to their expectations. He found that the resulting
Searle (1971). They derived exact expressions for the linear system is the same as the linear system (8.1) as-
locally best quadratic unbiased estimators of the two sociated with MIVQUE. Thus in the case of assumed
variance components associated with the unbalanced normality, this approach is completely equivalent to the
one-way random ANOVA model, under the assumptions MIVQUE approach.
that yi s normal and the mean vector is 0. Harville (1969a) Rao (1970, 1971a, and 1972) proposed an intuitive
considered the same setting but dropped the assumption estimation procedure that can be used in particular to
t ha t the mean vector is null. Harville gave some results estimate linear functions of the variance components
on estimators that are locally best in the class of quadratic associated with the ordinary ANOVA models. He observed
unbiased estimators and in the class of translation-in- in effect that, if 0 and E (the realized or sample value of e)
334 Journal of the American Statistical Association, June 1977

were known, a natural estimator for UMIVQUE of 5‘8, provided that 6 satisfies aL,/ae = 0, as
c+ 1 would necessarily be the case if 6 were an interior point
A’@ = c XiUr2 of Q. Moreover, if there is a UMIVQUE of every component
a=l of 8, then B-ld is functionally independent of 8, so that
would be C there is a 8 E Q satisfying the REML equations aL,/ae = 0
( h c + l / n ) ~ ’f
~ C (Xa/aa)@i’@i = O‘AO 1 (8.2) if and only if B-’d E Q,in which case the REML equations
a=1
admit the explicit solution 8 = B-’d, and the REML
where a’ = (@’,d) and where A is a suitably defined version of Anderson’s iterative procedure converges (to
matrix. Since @ and E are in fact unknown, Rao suggested the UMIVQUE of 0) in a single iteration. (See Gautschi
estimating C, X , U , ~ by the translation-invariant quadratic (1959) and Graybill and Hultquist (1961) for some dis-
unbiased estimator that most closely resembles (8.2). cussion of the existence of uniformly minimum variance
More precisely, observing that y ’ v = O’U’rUO, with quadratic unbiased estimators of variance components.)
U = (2,I), for any translation-invariant quadratic Rao’s MINQUE approach does not require any normality
estimator y ’ o , he proposed the estimator ytr*y, where assumptions, nor is its intuitive appeal diminished by
r* minimizes J J U ‘ r U- All for r such that y’r‘y‘is a nonnormality. The fact that MIVQUE’S derived on the
translation-invariant quadratic unbiased estimator of basis of normality turn out to be MINQUE’S in important
1 .X,ai2. Here, 1) -11 denotes a matrix norm. It can be instances may, because of the relationships between
shown that, when the Euclidean norm is used, f r * y MIVQUE and REML noted above, indicate th at ML or REML
= A%, where, with 8* = 1, 6 is a solution to the linear estimators of 8 derived under normality assumptions are
system (8.1). Rao went on to observe that the difference reasonable estimators even when the form of the dis-
Downloaded by [Indiana Universities] at 16:30 08 April 2013

between a translation-invariant quadratic estimator f r y tribution of b and e is unspecified.


and o’Aa can be expressed as q’A(U’rU - A ) A ~ where ,
A = diag (aJ, . . ., uc+J) and q represents the stan- 8.2 ANOVA-Like Methods
dardized vector A%. Taking A* to be the value of A at The most commonly used methods for estimating vari-
8 = 0*, where the value of 8* can be based on prior in- ance components are the Methods 1, 2, and 3 set forth
formation, we could also consider estimating E l X,ul2 by by Henderson (1953). In these methods, mean squares
y‘r*p, where now r* minimizes jlA(U‘rU - A ) A / ~for associated with various ANOVA tables are set equal to their
translation-invariant quadratic unbiased estimators y‘ r‘y . expectations, and estimates are obtained by solving the
Again, when the Euclidean norm is employed, it can be resulting linear equations. (In Method 2, the data vector
shown that y t r * p = A’B‘where 6 is any solution to the is corrected for fixed effects before forming the ANOVA
linear system (8.1). Rao called these estimators MINQUE’S table.) Searle (1968, 1971a, and 1971b) gave excellent
(minimum norm quadratic unbiased estimators). It is descriptions of Henderson’s methods and indicated
clear that a MINQUE of C, X,cr,2 (based on a Euclidean various generalizations. Henderson’s methods yield trans-
norm) is the same as a MIVQUE (derived on the basis of lation-invariant quadratic unbiased estimators. I n bal-
the normality assumption). anced-data cases, these estimators coincide with the
Several observers (Harville 1969b ; LaMotte 1970; and normality-derived REML estimators, provided the non-
Rao 1972) have suggested an iterative MIVQUE procedure. negativity constraints on the variance components do not
The iterates could be defined in terms of the linear system come into play (Patterson and Thompson 1974). I n gen-
(8.1). If the procedure converges to some point in the eral, however, the only parallel between Henderson’s
parameter space, that point is necessarily a stationary methods and REML would seem to be that both are based
point of L1 (see Section 6.1). Thus, if we disregard any on equating translation-invariant quadratic forms to
complications that might be caused by constraints on the their expectations. I n REML, the quadratic forms are
parameter space or by nonconvergence or convergence to functions of the variance components, the expectations
a point that does not correspond to a maximum of L 1 , are nonlinear, and modifications are incorporated to ac-
then iterative MIVQUE is identical to REML. A similar ob- count for the nonnegativity constraints ; while, in
servation was made by Hocking and Kutner (1975). Henderson’s methods, the quadratic forms are func-
Note that the iterates produced by the iterative MIVQUE tionally independent of the variance components, the
procedure are the same as those defined by the REML expectations are linear, and negative estimates of variance
analog of Anderson’s iterative algorithm (again refer to components can be realized. Cunningham and Henderson
Section 6.1), implying in particular that the initial iterate (1968) proposed a modified version (subsequently cor-
of the REML version of Anderson’s procedure is a MIVQUE.
rected by R. Thompson (1969)) of Henderson’s Method
Suppose that, assuming normality, there exists a
3 which seems more akin to REML. It uses equations of the
UMIVQUE of 5‘8, i.e., an estimator which, among all
translation-invariant quadratic unbiased estimators of form (3.8) in place of the normal equations ordinarily
A’0, has uniformly (for all d E Q) minimum variance. used in Method 3 to form reductions in sums of squares,
Then, every MIVQUE of 1% is a UMIVQUE, implying that with the consequence that the quadratic forms are no
A’B-d is functionally independent of 8. Taking 6 to be longer free of the variance components and an iterative
any REML estimate of 0, it follows th at A’6 agrees with the process is necessary.
Variance Component Estimation 335

For those ANOVA models that can be parameterized so of the joint posterior density of that parameter and
t ha t V has the form (2.3) for some mutually orthogonal various other parameters. It would seem preferable to
idempotent matrices G1, . . ., G,+l, Nelder (1968) pro- use the posterior density that has the maximum possible
posed an iterative ANOVA-like method for estimating number of “nuisance” parameters integrated out (at
variance components that is essentially equivalent to least among those that have proper priors). A posterior
REML (see Patterson and Thompson 1974). mode can be computed numerically by techniques like
One problem with Henderson’s methods for estimating those outlined in Sections 6.2 and 6.3. Moreover, a
variance and covariance components is that the methods posterior mode is insensitive to the tails of the posterior
are not necessarily well defined. Th at is, it is not always density.
clear which mean squares from what ANOVA tables should Suppose that J has the representation (2.1). If we wish
be used (see Searle 1971a or 1971b). How these methods to analyze the data by Bayesian techniques, we need to
should be extended to the general problem of estimating specify a prior distribution for Q and 8. Lindley and
8 is even less clear. I n contrast, ML and REML estimators Smith (1972) suggested in effect that, in many cases, it is
are always well defined (at least conceptually). More- possible to redefine the terms of the model (2.1) so as to
over, except for balanced-data cases, little is known arrive a t a second model of the form (2.1) in which it is
about the goodness of the Henderson estimators, other reasonable, a priori, to take the components of Q to be
than they are unbiased and translation invariant. It is independently and uniformly distributed over the real
well-known that, at least in particular cases, there are line and to be independent of 8, even though this assump-
biased estimators that have uniformly (assuming nor- tion might not .be realistic for the original model. The
mality) smaller MSE’S than the Henderson estimators (see Lindley-Smith technique amounts to expressing various
Downloaded by [Indiana Universities] at 16:30 08 April 2013

Klotz, Milton, and Zacks 1969). What is more surprising of the fixed effects in the original model as deviations
is the recent discovery by Seely (1975) and by Olsen, from hyperparameters, expressing the hyperparameters
Seely, and Birkes (1976) that, at least in the case of most as deviations from hyper-hyperparameters or “second-
unbalanced mixed- or random-eff ects models having one order” hyperparameters, expressing second-order hyper-
random factor (i.e., c = l), there are translation-in- parameters as deviations from third-order hyperparame-
variant quadratic unbiased estimators of u12 th a t have ters, etc. I n the redefined model, the highest-order hyper-
uniformly smaller variance than the Henderson Method-3 parameters comprise the components of Q ; the compo-
estimator. I n contrast, MIVQUE estimators, which (as nents of 0 include the deviations of various orders or,
noted in Section 8.1) are closely related to REML estima- possibly, appropriate linear combinations of those devia-
tors, are admissible in the class of translation-invariant tions, together with the components of the original 0
quadratic unbiased estimators. Moreover, Olsen, Seely, vector ; and additional “parameters” may be inserted
and Birkes (1976) constructed, for a particular case, a into the original 8 vector to accommodate the new entries
MIVQUE estimator that is uniformly better than the cor- in the vector 0. Of course, in the new model, some com-
responding Henderson Method-3 estimator. These revela- ponents of b are random variables only in a subjective
tions would seem to constitute a strong argument for sense, but this is not objectionable if a Bayesian analysis
using REML in preference to Henderson’s methods when is anticipated.
REML is feasible computationally and possibly for using Lindley and Smith proposed, as an estimate for 8, the
a n approximate REML approach similar to the one out- 8 component of the mode of the joint posterior density of
lined in Section 7 when it is not. (Y, 0, and 8. They acknowledged that this estimate may be

unsatisfactory if vague priors are assumed for certain


8.3 Bayesian Methods components of 8. I n fact, their approach can lead to
A review of some pre-1970 Bayesian results on infer- estimators of variance components that are identically
ence for variance components was given by Harville equal to zero when used with vague priors. Lacking
(1969b). When loss is proportional to squared error, the evidence to the contrary, it must be assumed that their
estimator of a variance component (or of any other pa- approach can also lead to unsatisfactory estimators when
rameter) that minimizes Bayes risk is the parameter’s used with informative priors, though they may be less
posterior mean. However, in all but fairly simple cases, obviously unsatisfactory. The problem with their ap-
the computation of the posterior mean of a variance com- proach may stem from the severe “dependencies” that
ponent or, more generally, of e is found to be unfeasible undoubtedly exist between components of 8 and com-
even when numerical integration techniques are used. ponents ofe in the joint posterior density e,
of a, and 8,
Moreover, if an improper prior is employed in place of the which may lead to the 8 component of the mode of the
“true” prior, the posterior mean may, because of its joint posterior density being far removed from, say,
sensitivity t o the tails of the posterior density, represent E (8 I Y).
a rather unsatisfactory condensation of the data. Be- A seemingly superior approach would be to take the
cause of these difficulties with the posterior mean, pos- estimate of 8 to be the 8 component of the mode of the
terior modes are sometimes proposed as estimators. We marginal posterior density of ct and 8 or, better yet, the
can use either the mode of the marginal posterior density mode of the marginal posterior density of 8. Suppose the
of a parameter or the relevant component of the mode distribution of b, e is multivariate normal, p* = p , and
336 Journal of the American Statistical Association, June 1977

a priori the components of a are independently and uni- be worthwhile t o work out detailed procedures for other
formly distributed over the real line and are independent commonly used models. Thompson (1973) did essentially
of 8 so that the joint prior density of (Y and 8 is propor- this for MANOVA models.
tional to h ( 8 ) for some function h. For purposes of esti- While it is unlikely that any one of the iterative pro-
mating 8 alone, it can, for reasons noted by Harville cedures for computing ML or REML estimates of e will be
(1974), make sense to adopt such a prior density even if best or even satisfactory in every instance, useful guide-
the model is such that prior information on (Y is actually lines for choosing a procedure may be possible for par-
available. It follows from Harville (1974) that the ticular classes of models such as ANOVA models. Also, it
marginal posterior density of e (the density obtained by would be nice to know how the various models should be
formally integrating out a) is proportional to the product parameterized in order to effect convergence in the fewest
of h ( e ) and the likelihood function of an arbitrary set of possible number of iterations. Analytical results like
(n - p * ) linearly independent error contrasts. For those discussed in Section 5 can be very useful in deciding
h ( 8 ) = 1, the 8 component of the mode of the marginal on an algorithm, but well-planned numerical studies, like
posterior density of a and 8 is simply the ML estimate, and those carried out by Bard (1970) for nonlinear least-
the mode of the marginal posterior density of e is the squares problems, will ultimately be needed. If Hender-
REML estimate. For h ( 8 ) = [det (B(8))]*, which is the son’s iterative algorithm for computing ML estimates of
Jeffreys’ prior for 8 derived from Ll alone, we have, in variance components and its REML analog are demon-
the case of the ordinary fixed ANOVA or regression models strated to be superior computational procedures, i t would
as defined by (4.3), that the 8 component of the mode of be worthwhile to attempt extcnsions, e.g., to the problem
the marginal posterior density of a and 8 is the estimator of computing ML or REML estimates of covariance
[l/(n + 2)](y - Xa)’(y - Xa), and the mode of the components.
Downloaded by [Indiana Universities] at 16:30 08 April 2013

marginal posterior density of 8 is the estimator The approximate REML scheme outlined in Section 7
[l/(n - p* + 2)](y - Xti)’(y - Xa). The latter esti- needs to be further developed and evaluated. A good
mator has a downward bias of “only” 201/(n - p * + 2) start would be to determine, for particular models such
and has uniformly smaller MSE than both the ML and as ANOVA models, good choices for ti and @ or, equiva-
REML estimators of O1 (except in the case p* = 2, where lently, for H and A. It would be nice to know how the
it coincides with the ML estimator) and in fact has uni- approximate REML scheme compares with, say, Hender-
formly smaller MSE than any other estimator of the form son’s Methods 1 and 2 as a procedure for estimating
(l/k) (y - Xa)’(y - Xa), so that it may have appeal for variance and covariance components.
frequentists who care about MSE but not about small The pseudo-Bayesian procedure that estimates 8 by
biases. It is an intriguing possibility that the pseudo- maximizing the expression (8.3) would seem to be worth
Bayesian procedure that estimates 8 by maximizing investigating. This might be done first for balanced
L1@; Y) + (3) 1% Cdet (B(@11 I (8.3)
ANOVA ’models. If the procedure looks good there, its
performance in more complicated settings could be
for 8 E Q, might be an equally satisfying procedure in evaluated.
more complicated settings. I n cases where the estimate of 8 is not of interest in
itself, but rather is to be used indirectly in estimating
9. FURTHER RESEARCH Al’a + a2’@(by substituting it for 8+ in (3.2)),the possible
There are still many aspects of the problem of esti- estimators of 8 should be compared on the basis of a loss
mating variance components, and more generally the function reflecting those intentions. Efron and Morris
problem of estimating 8, that need to be investigated. I n (1976) made some comparisons of this kind for a par-
this, the final section, an attempt is made to identify ticular case. Further work along these lines is needed.
some of these areas.
The “realistic” asymptotic results developed by Miller [Received-December 1976.1
(1973) for ML estimators of variance components for the
ordinary ANOVA models need to be extended to various REFERENCES
other models of the form (2.1) and to REML estimators.
Anderson, R.L., and Bancroft, T.A. (1952), Sfatistieal T h q in
The results of Weiss (1971 and 1973) should prove useful Research, New York: McGraw-Hill Book CO.
here. Also, for particular models, such as the ordinary Anderson, T.W. (1969), “Statistical Inference for Covariance
ANOVA models, it would be nice to know what parameteri- Matrices with Linear Structure,” in Multivarzizte Analysis-II, ed.
P.R. Krishnaiah, New York: Academic Press, 55-66.
zations produce the “fastest convergence” to asymptotic (1970), “Estimation of Covariance Matrices Which Are
normality. Linear Combinations or Whose Inverses Are Linear Combinations
I n Sections 3 and 5, results were described which can of Given Matrices,” in Essays in Probability and StutiStieS, eds.
R.C. Bose, I.M. Chakravarti, ‘P.C. Mahdanobis, C.R. Rao, and
be used t o exploit structure in the R,D, Z, and X matrices K.J.C. Smith, Chapel Hill, North Carolina: University of North
for purposes of computing L, L1*, or L1,their first- and Carolina Press,1-24.
second-order partial derivatives, and expected values of (1971), “Estimation of Covariance Matrices with Linear
Structure and Moving Average Processes of Finite Order,” Tech-
their second-order derivatives. Some explicit simplifica- nical Report No. 6, Department of Statistics, Stanford Univenity,
tions were given for the ordinary ANOVA models. It may Stanford, California.
Variance Component Estimation 337

(1973), “Asymptotically Efficient Estimation of Covariance (1973b), “MINQUE of Variance Components,” unpublished
Matrices with Linear Structure,” Annals of Statistics, 1, 1 3 5 4 1 . manuscript.
Bard, Yonathan (1970), “Comparison of Gradient Methods for the ( 1 9 7 3 ~ )“Sire
~ Evaluation and Genetic Trends,” in Pro-
Solution of Nonlinear Parameter Estimation Problems,” SIAM ceedings of the Animal Breeding and Genetics Symposium in Honor
Journal on Numerical Analysis, 7, 157-86. of Dr. Jay L. Lush, Champaign, Illinois: American Society of
(1974), Nonlinear Parameter Estimation, New York: Animal Science, 1 0 4 1 .
Academic Press. (1975), “Best Linear Unbiased Estimation and Prediction
Beltrami, Edward J. (1970), An Algorithmic Approach to Nonlinear Under a Selection Model,” Biometries, 31, 4 2 3 4 7.
Anulysis and Optimization, New York: Academic Press. Herbach, Leon H. (1959), “Properties of Model 11-Type Analysis of
Box, George E.P., and Jenkins, Gwilym M. (1970), Time Series Variance Tests, A: Optimum Nature of the F-Test for Model I1
Analysis, San Francisco : Holden-Day. in the Balanced Case,” Annals of Mathematical Statistics, 30,
Carroll, C.W. (1961), “The Created Response Surface Technique for 93g59.
Optimizing Nonlinear, Restrained Systems,” Operations Research, Hocking, Ronald R., and Kutner, Michael H. (1975), “Some Analyti-
9, 169-84. cal and Numerical Comparisons of Estimators for the Mixed
Corbeil, Robert R., and Searle, Shayle It. (1976a), “Restricted A.O.V. Model,” Biometrics, 31, 1+28.
Maximum Likelihood (REML) Estimation of Variance Com- Jennrich, R.I., and Sampson, P.F. (1976), “Newton-Raphson and
ponents in the Mixed Model,” Technometrics, 18, 31-38. Related Algorithms for Maximum Likelihood Variance Com-
, and Searle, Shayle R. (1976b), “A Comparison of Variance ponent Estimation,” Technometrics, 18, 11-17.
Component Estimators,” Biometrics, 32, 779-91. Klotz, Jerome H., Milton, b y C., and Zacks, S. (1969), “Mean
Cunningham, E.P., and Henderson, Charles It. (1968), “An Iterative Square Efficiency of Estimators of Variance Components,” J o u m l
Procedure for Estimating Fixed Effects and Variance Components of the American Statistical Association, 64, 1383402.
in Mixed Model Situations,” Biometrics, 24, 13-25. LaMotte, Lynn R. (1970), “A Class of Estimators of Variance Com-
Duncan, David B., and Horn, Susan D. (1972), “Linear Dynamic ponents,” Technical Report No. 10, Department of Statistics,
Recursive Estimation from the Viewpoint of Regression Analysis,” University of Kentucky, Lexington, Kentucky.
Journal of the American Statistical Association, 67, 815-21. (1971), “Locally Best Quadratic Estimators of Variance Com-
Efron, Bradley, and Morris, Carl (1973), “Stein’s Estimation Rule ponents,” Technical Report No. 22, Department of Statistics, Uni-
and Its Competitors-An Empirical Bayes Approach,” Journal versity of Kentucky, Lexington, Kentucky.
of the American Statistical Association, 68, 117-30. (1973), “Quadratic Estimation of Variance Components,”
Downloaded by [Indiana Universities] at 16:30 08 April 2013

, and Morris, Carl (1976), “Multivariate Empirical Bayes and Biometrics , 29 , 311-30.
Estimation of Covariance Matrices,” Annals ojStahtics, 4, 22-32. Lawton, William H., and Sylvestre, Edward A. (1971), “Elimination
Gautschi, Werner (1959), “Some Remarks on Herbach’s Paper, of Linear Parameters in Nonlinear Regression,” Technometrics, 13,
‘Optimum Nature of the F-Test for Model I1 in the Balanced 461-7.
Case,”’ Annals of Mathematical Statistics, 30, 960-3. Levenberg, K. (1944), “A Method for the Solution of Certain Non-
Gill, P.E., and Murray, William A., eds. (1974), Numerical Methods Linear Problems in Least Squares,” Quarterly of Applied Mathe-
for Constrained Optimization, New York : Academic Press. mutics, 2, 164-8.
Goldfeld, S.M., Quandt, R.E., and Trotter, H.F. (1966), “Maximira- Lindley, Dennis V., and Smith, A.F.M. (1972), “Bayes Estimates for
tion by Quadratic Hill Climbing,” Econometrica, 34, 541-51. the Linear Model,” Journal of the Royal Statistical Society, Series B,
Graybill, Franklin A. (1969), Introduction to MatTices with Applica- 1-18.
tions in Statistics, Belmont, California : Wadsworth Publishing Marquardt, Donald W. (1963), “An Algorithm for Least Squares
Company, Inc. Estimation of Nonlinear Parameters,” SIAM Journal, 11,431-41.
, and Hultquist, Robert A. (1961), “Theorems Concerning Miller, John J. (1973), “Asymptotic Properties and Computation of
Eisenhart’s Model 11,” Annuls ofMathematica1Statistics, 32,261-9. Maximum Likelihood Estimates in the Mixed Model of the
Hartigan, J.A. (1969), “Linear Bayesian Methods,” Journal of the Analysis of Variance,” Technical Report No. 12, Department of
Royal Statistical Society, Series B, 31, 44654. Statistics, Stanford University, Stanford, California.
Hartley, H.O., and Rao, J.N.K. (1967), “Maximum-Likelihood Murray, William A., ed. (1972), Numerical Methods for Uncon-
Estimation for the Mixed Analysis of Variance Model,” Biometriku, strained Optimization, New York : Academic Press.
54, 93-108. Nelder, J.A. (1968), “The Combination of Information in Generally
, and Vaughn, William K. (1972), “A Computer Program for Balanced Designs,” Jounull of the Royal Statistical Society, Ser. B,
the Mixed Analysis of Variance Model Based on Maximum Likeli- 30, 30S11.
hood,” in Statistical Papers in H m r of George W . Snedem, ed. Nering, Evar D. (1970), Linear Algebra and Matrix Theory, 2nd ed.,
T.A. Bancroft, Ames, Iowa: Iowa State University Press. New York: John Wiley & Sons.
Harville, David A. (1969a), “Quadratic Unbiased Estimation of Olsen, Anthony, Seely, Justus, and Birkes, David (1976), “In-
Variance Components for the One-way Classification,” Biometrih, variant Quadratic Unbiased Estimation for Two Variance Com-
56, 313-26. (Correction, (1970), Biometrika, 57, 226.) ponents,” A n d of Statistics, 4, 878-90.
(1969b), “Variance-Component Estimation for the Un- Patterson, H.D., and Thompson, Robin (1971), “Recovery of Inter-
balanced One-way Random Classification-A Critique,” Tech- Block Information when Block Sizes h e Unequal,” Biometrik,
nical Report No. 69-0180, Aerospace Research Laboratories, 58, 5 4 5 5 4 .
Wright-Patterson AFB, Ohio. , and Thompson, Robin (1974), “Maximum Likelihood
(1974), “Bayesian Inference for Variance Components Using Estimation of Components of Variance,” Proceedings of the 8th
Only Error Contrasts,” Biumetriku, 61, 383-5. International Biometric Conference, 197-207.
(1975), “Maximum Likelihood Approaches to Variance Com- Powell, M.J.D. (1970), “A Survey of Numerical Methods for Un-
ponent Estimation and to Related Problems,” Technical Report constrained Optimization,” SIAM Review, 12, 79-97.
No. 75-0175, Aerospace Research Laboratories, Wright-Patterson Rao, C.R. (1965), Linear Statistical Inference and Its Applications,
AFB, Ohio. New York: John Wiley & Sons.
(1976), “Extension of the Gauss-Markov Theorem to Include (1970), “Estimation of Heteroscedastic Variances in Linear
the Estimation of Random Effects,” Annuls of Statistics, 4 , 3 8 6 9 5 . Models,” Journal of the American Statistical Association, 65,
Hemmerle, William J., and Hartley, H.O. (1973), “Computing 161-72.
Maximum Likelihood Estimates for the Mixed A.O.V. Model (1971a), “Estimation of Variance and Covariance Com-
Using the W Transformation,” Technumetiics, 15, 81!&31. ponents-MINQUE Theory,” Journal of Multivariate Analysis, 1,
Henderson, Charles R. (1953), “Estimation of Variance and Co- 257-75.
variance Components,” Biometrics, 9, 22652. (1971b), “Minimum Variance Quadratic Unbissed Estima-
(1963), “Selection Index and Expected Genetic Advance,” in tion of Variance Components,” Journal of Multivard Analysis,
Statistical Genetics and Plant Breeding, National Academy of 1, 4 4 5 5 6 .
Sciences-National Research Council Publication No. 982, 141-63. (1972), ”Estimation of Variance and Covariance Components
(1973a), “Maximum Likelihood Estimation of Variance in Linear Models,” Journal of the American Statistical Association,
Components,” unpublished manuscript. 67, 112-15.
338 Journal of the American Statistical Association, June 1977

Rao, J.N.K. (1968), “On Expectations, Variances, and Covariances Thompson, Robin (1969), “Iterative Estimation of Variance Com-
of ANOVA Mean Squares by ‘Synthesis’,” Biometrics, 24, 963-78. ponents for Nonorthogonal Data,” Biometrics, 25, 767-73.
Russell, Thomas S., and Bradley, Ralph A. (1958), “One-way Vari- (1973), “The Estimation of Variance and Covariance Com-
ances in a Two-way Classification,” Biumetrika, 45, 111-29. ponents with an Application when Records Are Subject to Cull-
Searle, Shayle R., (1968), “Another Look a t Henderson’s Methods ing,” Biometrics, 29, 527-50.
of Estimating Variance Components” (with discussion), Bio- (1975), “A Note on the W Transformation,” Technometrics,
metrics, 24, 749-87. 17, 511-2.
(1970), “Large Sample Variances of Maximum Likelihood Thompson, W.A., Jr. (1962), “The Problem of Negative Estimates
Estimators of Variance Components Using Unbalanced Data,” of Variance Components,” Annals of Mafhematical Statistics, 33,
Biometrics, 26, 50.524. 273-89.
(1971a), Linear Models, New York: John Wiley & Sons. Townsend, Edwin C. (1968), “Unbiased Estimators of Variance
Components in Simple Unbalanced Designs,” Ph.D. dissertation,
(1971b), “Topics in Variance Component Estimation,” Bio-
Cornell University, Ithaca, New York.
metries, 27, 1-76. , and Searle, Shayle, R. (1971), “Best Quadratic Unbiased
(1974), “Prediction, Mixed Models, and Variance Compo- Estimation of Variance Components from Unbalanced Data in the
nents,” in Reliability and Biometry, eds. F. Proschan and R.J. 1-Way Classification,” Biometrics, 27, 643-57.
Serfling, Philadelphia, Pennsylvania : Society of Industrial and Weiss, Lionel (1971), “Asymptotic Properties of Maximum Likeli-
Applied Mathematics, 229-66. hood Estimators in Some Nonstandard Cases,” Journal of the
(1976), “Detailed Derivation of Results and Relationships American Slatistical Association, 66, 34.550.
in the ML, REML and MINQUE Methods of Estimating Vari- (1973), “Asymptotic Properties of Maximum Likelihood
ance Components,” Paper No. BU-586-M, Biometrics Unit, Estimators in Some Nonstandard Cases, 11,” Journal of the
Cornell University, Ithaca, New York. Ammican Statistical Association. 68. 428-30.
Seely, Justus, (1975), “An Example of an Inquadmissible Analysis of Westlake, Joan R . (1968), A Hamhook of Numerical Matriz In-
Variance Estimator for a Variance Component, Biometrika, 62, version and Solution of Linear Equations, New York: John Wiley
689-90. & sons.
Sprott, D.A. (1975), “Marginal and Conditional Sufficiency,” Zacks, S. (1971), The Theory of Statistical Inference, New York:
Biometrika, 62, 59!&605. John Wiley & Sons.
Downloaded by [Indiana Universities] at 16:30 08 April 2013

Lomment
J. N. K. RAO”

1. Harville has covered a lot of ground in this excellent gratifying as the Wald-type result : an estimate of the
review paper. It contains several important features : (1) parameter vector which makes the likelihood function
a treatment of the variance component models as special an absolute maximum is consistent, i.e., a ML estimate is
cases of the general linear model (2.1) which creates a consistent. Using sufficient conditions similar to those of
unified presentation; (2) a review of the recent work of Wald, a corrected version of Hartley-Rao’s proof estab-
Henderson and Harville on the estimation of random lishes the Wald-type consistency result under the above-
effects in the model and its relationship to variance mentioned restriction, but it will not work if this re-
components estimation, especially from the viewpoint of striction is removed (see Miller (1973), p. 257). Perhaps
computations ; (3) a thorough discussion of specialized an alternative set of sufficient conditions might give the
methods as well as general algorithms for computing desired result without the restriction of Hartley-Rao.
maximum likelihood (ML) estimates ; and (4)an exami- 3. Recent work on the computational aspects of ML
nation of the relationships of ML estimates to ANovA-type estimates (Section 6) is certainly impressive, and we
estimates of Henderson, MINQUE of C.R. Rao and others, need further research in this direction. However, none
and Bayesian estimates. However, the scope of this of the proposed algorithms guarantees a solution which
paper is somewhat narrow since it is mainly concerned indeed is a ML estimate: The behavior of the likelihood (as
with point estimation of the variance components. a function of the variance components) appears complex ;
2. A brief account of Miller’s (1973) recent work on even for the simple unbalanced one-way layout, the
asymptotic properties of ML estimates is provided in likelihood equation may have multiple roots or the ML
Section 4.2. Miller removed a restriction of Hartley and estimate is a boundary value rather than any of these
Rao (1967), viz., the number of observations falling into roots (see Klotz and Putter (1970)). We therefore need
any particular level of any random factor stays below a to exercise more caution when resorting to numerical
universal constant. However, he was able to establish algorithms, especially with Monte Carlo studies involving
only a Cram&-type consistent result (via., some root of the computing of ML estimates for several thousand
the likelihood equation is consistent) which is not so samples.
* J.N.K. Rao is Professor, Department of Mathematics, Carleton Henderson’s iterative algorithm (which is a simplifi-
University, Ottawa K1S 5B6, Canada. cation of Hartley-Rao’s equation 51 (1967)) appears

You might also like