Akaike 1987

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

PSYCHOMETRIKA--VOL.52, NO.

3, 317-332
SEPTEMBER 1987
SPECIALSECTION

FACTOR ANALYSIS AND AIC

HIROTUGU AKAIKE
THE INSTITU~ OF STATISTICAL MATHEMA'F/CS

The information criterion AIC was introduced to extend the method of maximum likelihood
to the multimodel situation. It was obtained by relating the successful experience of the order
determination of an autoregressive model to the determination of the number of factors in the
maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particu-
larly interesting when it is viewed as the choice of a Bayesian model. This observation shows that
the area of application of AIC can be much wider than the conventional i.i.d, type models on
which the original derivation of the criterion was based. The observation of the Bayesian structure
of the factor analysis model leads us to the handling of the problem of improper solution by
introducing a natural prior distribution of factor loadings.
Key words: factor analysis, maximum likelihood, information criterion AIC, improper solution,
Bayesian modeling.

1. Introduction

T h e factor analysis m o d e l has been p r o d u c i n g t h o u g h t p r o v o k i n g statistical p r o b -


lems. T h e m o d e l is typically r e p r e s e n t e d b y

y(n) = Ax(n) + u(n), n = 1, 2, . . . , N

where y(n) d e n o t e s a p - d i m e n s i o n a l vector o f o b s e r v a t i o n s , x(n) a k - d i m e n s i o n a l v e c t o r o f


factor scores, u(n) a p - d i m e n s i o n a l vector o f specific variations. It is a s s u m e d t h a t the
v a r i a b l e s with different n's a r e m u t u a l l y i n d e p e n d e n t a n d t h a t x(n) a n d u(n) are m u t u a l l y
i n d e p e n d e n t l y d i s t r i b u t e d as G a u s s i a n r a n d o m variables with variance c o v a r i a n c e
m a t r i c e s I k x ~ a n d W, respectively, where W is a d i a g o n a l matrix. T h e c o v a r i a n c e m a t r i x X
o f y(n) is then given b y

E ---- A A ' + W.

This m o d e l is c h a r a c t e r i z e d b y the use o f a large n u m b e r o f u n k n o w n p a r a m e t e r s ,


m u c h l a r g e r t h a n the n u m b e r o f u n k n o w n p a r a m e t e r s o f a m o d e l used in the c o n v e n t i o n a l
m u l t i v a r i a t e analysis. T h e e m p i r i c a l principle o f p a r s i m o n y in statistical m o d e l b u i l d i n g
d i c t a t e s t h a t the increase o f the n u m b e r o f p a r a m e t e r s s h o u l d be s t o p p e d as s o o n as it is
o b s e r v e d t h a t a further increase d o e s n o t p r o d u c e significant i m p r o v e m e n t o f fit o f the
m o d e l t o the data. T h u s the c o n t r o l o f the n u m b e r o f p a r a m e t e r s h a s usually been realized
b y a p p l y i n g a test of significance.

The author would like to express his thanks to Jim Ramsay, Yoshio Takane, Donald Ramirez and
Hamparsum Bozdogan for helpful comments on the original version of the paper. Thanks are also due to
Emiko Arahata for her help in computing.
Requests for reprints should be sent to Hirotugu Akaike, The Institute of Statistical Mathematics, 4-6-7
Minami-Azabu, Minato-Ku, Tokyo 106, Japan.

0033-3123/87/0900-SS01 $00.75/0 317


© 1987 The Psychometric Society
318 PSYCHOMETRIKA

In the case of the maximum likelihood factor analysis this is done by adopting the
likelihood ratio test. However, in this test procedure, the unstructured saturated model is
always used as the reference and the significance is judged by referring to a chi-square
distribution with a large number of degrees of freedom equal to the difference between the
number of parameters of the saturated model and that of the model being tested. As will
be seen in section 3, an example discussed by J6reskog (1978) shows that direct appli-
cation of such a test to the selection of a factor analysis model is not quite appropriate.
There the expert's view clearly contradicts the conventional use of the likelihood ratio
test.
In 1969 the present author introduced final prediction error (FPE) criterion for the
choice of the order of an autoregressive model of a time series (Akaike, 1969, 1970). The
criterion was defined by an estimate of the expected mean square one-step ahead predic-
tion error by the model with parameters estimated by the method of least squares. The
successful experience of application of the FPE criterion to real data suggested the possi-
bility of developing a similar criterion for the choice of the number of factors in the factor
analysis. The choice of the order of an autoregression controlled the number of unknown
parameters in the model, that controlled the expected mean square one-step ahead predic-
tion error. By analogy it was easily observed that the control of the number of factors was
required for the control of the expected prediction error by the fitted model. However, it
was not easy to identify what the prediction error meant in the case of the factor analysis.
In the case of the autoregressive model an estimate of the expected predictive per-
formance was adopted as the criterion; in the case of the maximum likelihood factor
analysis it was the fitted distribution that was evaluated by the likelihood. The realization
of this fact quickly led to the understanding that our prediction was represented by the
fitted model in the case of the factor analysis, which then led to the understanding that
the expectation of the log likelihood with respect to the "true" distribution was related to
the Kullback-Leibler information that defined the amount of deviation of the "true"
distribution from the assumed model.
The analogy with the FPE criterion then led to the introduction of the criterion
AIC -- ( - 2) log maximum likelihood + 2 (number of parameters),
as the measure of the badness of fit of a model defined with parameters estimated by the
method of maximum likelihood, where log denotes a natural logarithm (Akaike, 1973,
1974). We will present a simple explanation of AIC in the next section and illustrate its
use by applying it to an example in section 3.
Although AIC produces a satisfactory solution to the problem of the choice of the
number of factors, the application of AIC is hampered by the frequent appearance of
improper solutions. This shows that successive increase of the number of factors quickly
lead to models that are not quite appropriate for the direct application of the method of
maximum likelihood.
In section 4 it will be discussed that the factor analysis model may be viewed as a
Bayesian model and the choice of a factor analysis model by minimizing the AIC criterion
is essentially concerned with the choice of a Bayesian model. This recognition encourages
the use of further Bayesian modeling for the elimination of improper solutions. In section
5 a natural prior distribution for the factor loadings is introduced through the analysis of
the likelihood function. Numerical examples will be given in Section 6 to show that the
introduction of the prior distribution suppresses the appearance of improper solutions
and that the indefinite increase of a communality caused by the conventional maximum
likelihood procedure may be viewed as of little practical significance.
HIROTUGU AKAIKE 319

The paper concludes with brief remarks on the contribution of factor analysis to the
development of general statistical methodology.

2. Brief Review of AIC


The fundamental ideas underlying the introduction of AIC are:
1. The predictive use of the fitted model.
2. The adoption of the expected log likelihood as the basic criterion.
Here the concept of parameter estimation is replaced by the estimation of a distribution
and the accuracy is measured by a universal criterion, the expected log likelihood of the
fitted model.
The relation between the expected log likelihood and the Kullback-Leibler infor-
mation number is given by
l(f; g) = E log f ( x ) - E log #(x),
where l(f; g) denotes the Kullback-Leibler information of the distributionfrelative to the
distribution g, and E denotes the expectation with respect to the "true" distributionf(x) of
x. The second term on the right-hand side represents the expected log likelihood of an
assumed model #(x) with respect to the "true" distribution f(x). Since I(f; 0) provides a
measure of the deviation of f from g and since log g(x) provides an unbiased estimate of E
log g(x) the above equation provides a justification for the use of log likelihoods for the
purpose of comparison of statistical models.
Consider the situation where the model g(x) contains unknown parameter 0, that is,
g(x) = g(xlO). When the data x are observed and the maximum likelihood estimate O(x) of
0 is obtained, the predictive point of view suggests the evaluation of O(x) by the goodness
of #(. I O(x)) as an estimate of the true distribution f ( . ). By adopting the information l(f;
g) as the basic criterion we are led to the use of Ey log g(Yl O(x)) as the measure of the
goodness of O(x), where Ey denotes the expectation with respect to the true distribution
f(y) of y. To relate this criterion to the familiar log likelihood ratio test statistic we adopt
2Ey log g(y I 0) as our measure of the goodness of g(y I 0) as an estimate off(y).
Here we consider the conventional setting where the true distributionf(y) is given by
g(Y I 0o), that is, 0o is the true value of the unknown parameter, the data x are a realization
of the vector of i.i.d, random variables x 1, x2, . . . , xN, and the log likelihood ratio test
statistic asymptotically satisfies the relation
2 log g(x[ O(x)) -- 2 log g(xl 0o) = X~,
where ;~ denotes a chi-squared with degrees of freedom m which is equal to the dimen-
sion of the parameter vector 0. Under this setting it is expected that the curvature of the
log likelihood surface provides a good approximation to that of the expected log likeli-
hood surface. This observation leads to another asymptotic equality
2Ey log a(Y 10(x)) -- 2Ey log #(y I 0o) = -- X~,
where it is assumed that y is another independent observation from the same distribution
as that of x and the chi-squared variable is identical to that defined by the log likelihood
ratio test statistic.
The above equations show that the amount of increase of 2 log #(x I O(x)) from 2 log #
(xl0o) obtained by adjusting the parameter value by the method of maximum likelihood
is asymptotically equal to the amount of decrease of 2Ey log #(Yl O(x)) from 2Ey log
320 PSYCHOMETRIKA

0(Yl 0o). Thus, to measure the deviation of O(x) from 0o in terms of the basic criterion of
twice the expected log likelihood, X~ must be subtracted twice from 2 log O(xlO(x)) to
make the difference of twice the log likelihoods an unbiased estimate of that of twice the
expected log likelihoods.
Since Z~ is unobservable, as we do not know 0o, we consider the use of its expected
value m. The negative of the quantity thus obtained defines
AIC = ( - 2) log O(x 1O(x)) + 2m.
When several different O's are compared the one that gives the minimum of AIC repre-
sents the best fit. Such an estimate is denoted as MAICE (minimum AIC estimate). For
more detailed discussion of the predictive point of view of statistics and the use of the
information criterion readers are referred to Akaike (1985),

3. How AIC Works With The Factor Analysis Model


Given a set of observations y = (y(n); n = 1, 2, " - , N) the maximum likelihood factor
analysis starts with the definition of the log likelihood function given by
log L(k) = -2XN[log l ~ J + tr X~-IS-J,
where S denotes the sample covariance matrix of y and k the number of factors and ~k is
given by

where A~ denotes the matrix of factor loadings and W the uniqueness variance matrix.
The diagonal elements of A~A~, define the communalities. The AIC statistic for the k-
factor model is then defined by
AIC(k) = ( - 2 ) log L(k) + [2p(k + 1) - k(k - 1)].
To show the use of AIC in the maximum likelihood factor analysis and to illustrate
the difference between the AIC and conventional test approach in particular here we will
discuss an example treated by J6reskog (1978, p.457). This examle is concerned with the
analysis of Harman's example of twenty-four psychological variables. The unrestricted
four factor model was first fitted which produced
~21s6 = 246.36.
This model was considered to be "representing a reasonably good fit" but a further
restriction of parameters produced a simple structure model with
X231 = 301.42.
Thismodel was accepted as the best fitting simple structure.
Now we have
Prob {Xx2a6>- 246.361Ho} ~ 0.0009,
and
Prob{z23t ->_301.421 n~)} ~ 0.0005,
where H o and H~ denote the hypotheses of the four factor and the simple structure,
respectively, and the chi-squared variables stand for the random variables with respective
degrees of freedom. By the standard of conventional tests these figures show that the
results are extremely significant and both H 0 and H~) should be rejected. In spite of this,
HIROTUGU AKAIKE 321

the expert judgment of J6reskog was to accept the four factor model as a reasonable fit
and prefer the simple structure model to the unrestricted. This conclusion suggests that
the large values of the degrees of freedom appearing in the chi-squared statistics preclude
the application of conventional levels of significance, such as 0.05 or 0.01, in making the
final judgment of models in this situation.
The chi-squared statistic is defined by
•2 = ( _ 2) max log L(H) - ( - 2 ) max log L ( H J ,
where max log L(H) denotes the maximum log likelihood under the hypothesis H and H ~
denotes the saturated or completely unconstrained model. Since AIC for an hypothesis H
is defined by
AIC(H) = ( - 2 ) max log L(H) + 2 dim 0,
where dim 0 denotes the dimension of the vector of unknown parameters 0, we have

AIC(H) - A I C ( H J = Zd.f.2_ 2(d.f.),


where d.f. denotes the difference between the number of unknown parameters of Hoo and
that of H. By neglecting the c o m m o n additive constant AIC (Hoo) we may define AIC(H)
simply by
AIC(H) = Z2.f.- 2(d.f.).

For the models discussed by J6reskog we get

AIC(Ho) = 246.36 -- 2 x 186


= -- 125.64,

and
AIC(H~) -- 301.42 - 2 x 231
= - 160.58.
Since AIC(H~) = 0, these AIC's show that both H o and H~ are by far better than H ~ and
that the simple structure model H~ is showing a better fit than the unrestricted four factor
model H o .
This result by AIC is in complete agreement with J6reskog's conclusion. The conven-
tional theory of statistics does not tell how to evaluate the significance of a test in each
particular application and there is no hope of arriving at a similar conclusion. Obviously
the objective procedure of model selection by an information criterion can be fully imple-
mented to define an automatic factor analysis procedure. Such a possibility is discussed
by Bozdogan and Ramirez (1987).

4. Factor Analysis Model Viewed as a Bayesian Model


As was demonstrated by the application to J6reskog's example the AIC approach
produced a satisfactory solution to the model selection problem in factor analysis. In spite
of this success the use of AIC in the m a x i m u m likelihood factor analysis has been severely
limited by the frequent occurrence of improper solutions, that is, by the appearance of
zero estimates of specific variances. Apparently this is caused by the overparametrization
of the model.
The introduction of AIC is motivated by the desire to control the effect of over-
322 PSYCHOMETRIKA

parametrization and the minimum AIC procedure for model selection is considered to be
a realization of the well-known empirical principle of parsimony in statistical modeling.
However the application of the minimum AIC procedure assumes the existence of proper
m a x i m u m likelihood estimates of the models considered. The frequent occurrence of
improper solutions in the m a x i m u m likelihood factor analysis means that the models are
often too much overparametrized for the application of the method of m a x i m u m likeli-
hood. This suggests the necessity of further control of the likelihood function. This can be
realized by the use of some proper Bayesian modeling.
Before going into the discussion of this Bayesian modeling we will first notice the
essentially Bayesian characteristic of the factor analysis model and point out that the
minimum AIC procedure is concerned with the problem of the selection of a Bayesian
model. In the basic factor analysis model y = Ax + u the vector of observations y is
assumed to be distributed following a Gaussian distribution with mean Ax and unique
variance W. The vector of factor scores x is unobservable but is assumed to be distributed
following a Gaussian distribution with zero mean and variance lk× k. Since x is never
observed this distribution is simply a psychological construction for the explanation of
the behavior of y. Under the assumption that A is fixed the distribution of x specifies the
prior distribution of the mean of the observation y. Thus we can see that the choice of k,
the number of factors, is essentially concerned with the choice of a Bayesian model.
Incidentally, the recognition of the Bayesian characteristic of the factor analysis model
also suggests the use of the posterior distribution of x for the estimation of the factor
scores as is discussed by Bartholomew (1981).
The basic problem in the use of a Bayesian model is how to justify the use of a
subjectively constructed model. Our belief is that it is possible only by considering various
possibilities as alternative models and comparing them with an objectively defined cri-
terion. In particular we propose the use of the log likelihood, or the AIC when some
parameters are estimated by the method of m a x i m u m likelihood, as the criterion of fit.
Let us consider the likelihood of a factor analysis model as a Bayesian model. For a
Bayesian model specified by the data distribution p(. 10) and prior distribution p(O) its
likelihood with respect to the observed data y is given by

p(yl 0)p(0) do.


F r o m the representation y(n) = Ax(n) + u(n), n = 1, 2, . . . , N, and the assumption of the
mutual independence a m o n g the variables the likelihood of the Bayesian model defined
with 0 = (x(1), x(2), -- -, x(N)) is given by

L = ~=1 \2rc] I~;I -x/2 exp -- ~ tr ~-ly(n)y(n)'

= 1- m 2 e x p -- -trZ-1S ,

where E = AA' + tI', I E I denotes the determinant, and


1 N
S = -N_ E 1y(n)y(n)'.
=

For simplicity the mean of y(n) is assumed to be zero. Thus we get


N
log L = -- -~- [log I EI + tr S E - 1] + const.
H I R O T U G U AKAIKE 323

This is exactly the likelihood function used in the conventional maximum likelihood
factor analysis. Thus the maximum likelihood estimates of A and W in the classical sense
are the maximum likelihood estimates of the unknown parameters of a Bayesian model.
The above result shows that the AIC criterion defined for the factor analysis model is
actually the ABIC criterion for the evaluation of a Bayesian model with parameters
estimated by the method of maximum likelihood, where ABIC is defined by (Akaike,
1980)
ABIC = (-- 2) maximum log likelihood of a Bayesian model
+ 2 (number of estimated parameters).
In the case of the factor analysis model we have
ABIC = AIC.
This identity clearly shows that there is no essential distinction between the classical and
Bayesian models when they are viewed from the point of view of the information cri-
terion.

5. Control of Improper Solutions by a Bayesian Modeling


The appearance of improper solutions suggests the necessity of the reduction of the
number of parameters to be estimated by the method of maximum likelihood. The recog-
nition of the Bayesian structure of the factor analysis model suggests that further mod-
eling of the prior distribution of the unknown parameters in A and W is possible. The use
of the Bayesian approach for the control of improper solutions is already discussed in an
earlier paper by Martin and McDonald (1975). These authors point out the importance of
choosing a prior distribution that does not have the appearance of arbitrariness and
discuss the use of a reasonably defined prior distribution of specific variances.
The informational approach to statistics puts very much faith in the information
supplied by the log likelihood. Hence in the present paper we try to develop a prior
distribution without using outside information except for the knowledge of the likelihood
function of the data distribution. In the present situation this is particularly appropriate
as the prior distribution is considered only for the purpose of tempering the likelihood
function to clarify the nature of improper solutions.
By this approach we need a detailed analysis of the likelihood function. For the
convenience of the analysis let us consider

(
where the log likelihood log L is defined in the preceding section. By ignoring the additive
constant we have
q = - l o g ]Z-ISI + tr Z-1S.

By putting W = D 2, where D is a diagonal matrix with positive diagonal elements, we get


IF, = A A ' + D 2

= 0 ( I + CC')O,

where A is p x k, I is a p x p identity matrix and C = D - 1 A , the matrix of standardized


324 PSYCHOMETRIKA

factor loadings. We have


tr Z - t S = tr (I + C C ' ) - I D - 1 S D -1,

and

I X - I S I = I(I + C C ' ) - ~ I I D - X S D - t l .
The modified negative log likelihood q can conveniently be expressed by using the
eigenvectors z~ and eigenvalues (i of D - X S D -1, the standardized sample covariance
matrix. Define the matrix Z by

Z = [z. z2, "", z.].

It is assumed that Z is normalized so that Z ' Z ' = I holds. Represent C by Z in the form

C = ZF.
A d o p t the representation
p
FF' = ~ #i mim;,
t=1

where ~i > 0, for i = 1, 2, . . . , k, = 0, otherwise, and m'i m~ = 6 o, where 3~j = 1, for i = j, 0,


otherwise. T h e n we get
p

CC' = Z F F ' Z ' = ~ #~ l~ l'~,


i=l

where Ii = Z m i with 1'i lj = 3ij, and


p

I + CC' = E;~~ Ii li,'


i=1

where h i = 1 + ~.Li . F r o m this representation we get


p
(I + CC') -1 = E,~,Zllil'i,
i=1

and
tr (I + C C ' ) - I D - I S D -1 = tr ~ 2 7 t l , l'i ~ ~jzjz' i
i i
-~- E E )~i-lCjm2(j),
i j
where mi(j) denotes the j-th element of m i . The last relation is obtained from the equation

z) Ii = mi(j).
We also have
P P
I(I + CC')-X l I D - ' S D - ' I = I-I ;Li-1 ]-I ~j.
i=1 j='

Thus we get the following representation of the modified negative likelihood function
as a function of 2 = (41, 2 2 , . . . , 2p)and m = (ml, m 2 , . . . , rap):

p p p

i=1 i=' j=l


HIROTUGU AKAIKE 325

Assume that (i and 2~ are arranged in the descending order, that is, (1 > (2 > "'" -~ (p and
21 > 22 > ' " > 2p, where 2k+1 . . . . . 2p = 1. Then the successive minimization of q(2,
m) with respect to mr, mr_ 1, " " , ml leads to
k p
q(2) = Z E2,-1~, - log (2;1~,)] + Z (~, - log ~,).
i=1 l=k+l

As a function of 2, ( 2 - 1 0 - log ( 2 - 1 0 attains its minimum at ;t = ~, for ~ > 1, and at


2 = 1, otherwise. Thus we get
P
Minq(2)=k*+ ~ (z,-- log ~),
l=k*+l

where (i > 1, for i < k*, < 1, otherwise. This last quantity is equal to the quantity given
by the Equation (18) of J6reskog (1967, p.448) and is ( - 2 / N ) times the maximum log
likelihood of the factor analysis model when D is given.
In maximizing the likelihood we would normally hope that a too small value of some
of the diagonal elements of D will reduce the maximum likelihood of the corresponding
model. However, that this is not the case is shown by the above result which explains that
the value of the maximum likelihood is sensitive only to the behavior of smaller eigen-
values of D- 1SD- 1. A very small diagonal element of D will only produce a very large
eigenvalue. Thus the process of maximizing the likelihood with respect to the elements of
D does not eliminate the possibility of some of these elements going down to zero.
The form of q(2) shows that if we introduce an additive term pEg~ with p > 0 then
the minimization of

P p
q(2) = ~ [ 2 F 1 ~ , - log (;tF 1~,)] + p ~ #,,
l 1 i=1

with respect to 2 does not allow any of 2~ ( = 1 + #~) going to infinity. Taking into account
the relations C = Z F and FF' = Y~#im~m'~we get

P
#~ = tr FF' = tr CC'.
i=1

Since C = D - t A the minimization of q(2) produces an estimate that is given as the


posterior mode under the assumption of the prior distribution given by

Kexp I - - ~N- p t r D - I A A ' D - 1 t ,


where K denotes the normalizing constant and N the sample size. This prior distribution
is defined by a spherical normal distribution of the standardized factor loadings and will
be referred to as the standard spherical prior distribution of the factor loadings.
For the complete specification of the Bayesian model it is necessary to define the
prior distribution of D. However, an arbitrarily defined prior distribution of the elements
of D can easily eliminate improper solutions if only it penalizes smaller values sufficiently.
Since our interest here is mainly in the clarification of the nature of improper solutions
obtained by the conventional maximum likelihood procedure we will not proceed to the
modeling of the prior distribution of D and simply adopt the uniform prior.
326 PSYCHOMETRIKA

TABLE 1
Communality estimates*
Harman : eight physical

p = 0 (MLE)
~/. 1 2 3 4 5 6 7 8
1 842 865 810 813 240 171 123 199
2 830 893 834 801 911 636 584 463
3 872 1000 806 844 909 641 589 509

p =0.1
LA/, 1 2 3 4 5 6 7 8
1 837 858 804 810 241 172 124 200
2 828 881 828 800 855 647 591 476
3 858 910 830 832 859 650 590 523
4 865 910 832 843 851 689 649 521
5 same as above

p = 1.0
Lki, 1 2 3 4 5 6 7 8
1 763 768 725 739 252 181 134 204
2 766 781 742 743 590 486 440 409
3
4 same as above
5

* In this and following tables maximum possible communality is


normalized to 1000.

6. Numerical Examples
The Bayesian model defined with the standard spherical prior distribution of the
factor loadings was applied to six published examples of improper solutions. These exam-
ples are Harman's eight physical variables data (Harman, 1960, p.82), with p = 8 and
improper at k = 3, Davis data (Rao, 1955, p.ll0), with p = 9 and improper at k = 2,
Maxwelrs normal children data (Maxwell, 1961, p.55), with p = 10 and improper at k = 4,
Emmett data (Lawley & Maxwell, 1971, p.43), with p = 9 and improper at k = 5, Max-
well's neurotic children data (p.53), with p = 10 and improper at k = 5, and Harman's
twenty-four psychological variables data (Harman, 1960, p.137), with p = 24 and im-
proper at k = 6.
The informational point of view suggests that hyperparameter 19 of the prior distri-
HIROTUGU AKAIKE 327

TABLE 2
Communality estimates
Davis data

p = 0 (MLE)
~ 1 2 3 4 5 6 7 8 9
1 658 661 228 168 454 800 705 434 703
2 652 1000 243 168 464 816 704 435 70t
3 1000 661 220 204 451 1000 701 488 696

p =0.1
k\~ 1 2 3 4 5 6 7 8 9
1 653 656 226 167 451 790 700 431 697
2 694 689 227 171 470 800 698 434 696
3 701 695 251 197 470 801 698 444 696
4 same as above

p = 1.0
L~ 1 2 3 4 5 6 7 8 9
1 596 598 210 156 415 702 633 400 631
2
3 same as above
4

bution m a y be "estimated" by maximizing the likelihood of the Bayesian model with


respect to p. However, for this purpose integration in a high-dimensional space is re-
quired. In this paper we will limit our attention to the analysis of solutions with some
fixed values of p.
The estimation of specific variances under the present Bayesian model was realized
by the following procedure. Given the initial estimate De of D 2 the sample covariance
matrix S is replaced by S 1 = D~ 1SD~ t and the next estimate D 2 of D 2 is obtained by the
relation D 2 = diag(S -- DIB1B'ID1) , where B 1 is a p x k matrix such that BtB'1 provides a
least squares fit
p--I p k p k
2~ ~ [Sl(i,j)-- ~ B,(i, l)Bl(j, l)]2 + p ~ ~ B2(i,j)= Min.,
z=t j=i+l 1=1 i=1 j=l

where B(i, j) denotes (i, j)th element of B. The estimates of communalities are defined by
diag (D1B1B'~D1). The process is repeated until convergence is established.
When p = 0 the above procedure produced m a x i m u m likelihood solutions that were
confirmed by a procedure based on the result of Jennrich and Robinson (1969). When
p > 0 the solution may only be considered as an arbitrary approximation to the posterior
328 PSYCHOMETRIKA

TABLE 3
Three factor maximum likelihood solution of Emmett data

i A(. i) A(-2) A(. 5) ku

1 .664 .321 .074 .450


2 .689 .247 -.193 .427
3 .493 .302 -.222 .617
4 .837 -.292 -.035 .212
5 .705 -.315 -.153 .381
6 .819 -.377 .105 .177
7 .611 .396 -.078 .400
8 .458 .296 .491" .462
9 .766 .427 -.012 .231

* The value suggests singular increase of the 8th


communality.

mode. Nevertheless it will be sufficient for the purpose of confirming of the effect of the
tempering of the likelihood function. For convenience we will call the solution the Bayes-
ian estimate.
In the case of the above six examples the choice of p = 1.0 produced solutions with
signficant overall reduction of communalities, or increase of specific variances. With the
choice of p = 0.1 solutions were usually close to the conventional m a x i m u m likelihood
estimates but with the improper estimates of communalities suppressed. I m p r o p e r esti-
mates disappeared completely, unless p was made extremely small. F o r a fixed p estimates
of communalities usually stabilized as k, the number of factors, was increased.
It was generally observed that when the m a x i m u m likelihood method produced an
improper solution first at k =/c o the corresponding Bayesian estimate with p = 0.1 was
proper but with only one communality estimate inflated compared with the estimate at
k = k o - 1. Such a singular increase of the communality means the reinterpretation of a
part of the specific variation as an independent factor. This fact and the result of our
analysis of the likelihood function suggest that the singular increase of the communality is
usually caused by the overparametrization that makes the estimate sensitive to the sam-
piing variability of the data rather than by the structural change of the best fitting model
at k = k o . This is in agreement with the earlier observation of T s u m u r a and Sato (1981)
on the nature of improper solutions.
Tables 1 and 2 provide estimates of communalities of H a r m a n ' s eight physical varia-
bles data and of Davis' data, respectively, for various choices of the order, k, and p. In the
HIROTUGU AKAIKE 329

TABLE 4
Communality estimates by various procedures
Emmett data

p = 0 (MLE)
~ 1 2 3 4 5 6 7 8 9
1 510 537 300 548 390 481 525 224 665
2 538 536 332 809 592 778 597 256 782
3 550 573 384 788 619 823 600 538 769
4 554 666 379 772 663 856 648 480 759
5 556 868 1000 780 664 836 666 464 743

p =0.1
/~/, 1 2 3 4 5 6 7 8 9
1 502 529 296 545 388 478 516 221 652
2 535 531 330 790 588 762 590 252 753
3 549 561 378 783 611 786 590 399 750
4
5 same as above

p=l.0
L~/, 1 2 3 4 5 6 7 8 9
1 425 448 254 478 344 422 434 189 540
2 433 450 261 522 391 472 445 196 551
3
4 same as above
5

case of the Harman data the result in Table 1 shows that the improper value 1000 at i = 2
with k = 3, obtained with p = 0, disappeared for the positive values of p. In particular,
with p = 0.1, the solutions with k = 2, 3 and 4 are all mutually very close and they are
close to the solutions with p = 0 and k = 2 and 3, except for the improper component at
k = 3. This suggests that the two-factor model is an appropriate choice, which is in
agreement with Harman's original observation. The soltuion with p = 1.0 conforms with
this observation.
For the Davis data with ko = 2 the non-uniqueness of the convergence of iterative
procedures for the maximum likelihood was first reported by Tsumura, Fukutomi, and
Asoo (1968). With k = 2, JSreskog (1967, p.474) reported improper estimate of specific
variance for the 1st component and Tsumura et al. (p.57) found one for the 8th compo-
nent. As is shown in Table 2 our procedure found one at the 2nd component. The result
330 PSYCHOMETRIKA

TABLE 5

Suggested choices of dimensionalities*

Harman: eight physical


p = 8 ko = 3 ks = 2

MAICE = oo**
Davis
p=9 ko = 2 ks = 1

MAICE = c o * *
Maxwell: normal
p=lO ko = 4 ks = 3

MAICE = oo**
Emmett
p=9 ko = 5 ks = 2

MAICE = 3
Maxwell: neurotic
p =10 ko = 5 ks = 2

MAICE = 3
Harman • 24 variables
p = 24 ko = 6 ks = 5

MAICE = 5

* p • dimension of observation
k o • lowest order with improper solution
ks • suggested order by the Bayesian analysis
** oo denotes saturated model.

given in Table 4 of Martin and M c D o n a l d (1975, p.515) also suggests the existence of
improper solution with zero unique variance for the 2nd component. These results suggest
the existence of local maxima of the likelihood function. Table 2 also gives improper
estimates for the 1st and 6th components with k = 3, which is in agreement with the result
reported by JSreskog.
The estimates obtained with p = 0.1 may be viewed as practically identical and are
close to the solution with p = 0, the m a x i m u m likelihood estimate, for k = 1. This result
HIROTUGU AKAIKE 331

strongly suggests that the improper solutions are spurious in the sense that they can be
suppressed by mild tempering of the likelihood function. The one-factor model seems a
reasonable choice in this case. The solution with p = 1.0 conforms with the present
observation.
The phenomenon of the singular increase of a communality estimate is observed even
with k < k 0 . Such an example is given by the three-factor maximum likelihood solution of
the Emmett data. The maximum likelihood solution by Lawley and Maxwell (1971, p.43)
is reproduced in Tabl~ 3 which suggests the singular increase of the communality of the
8th component at k = 3. In Table 4 the estimate with p = 0.1 shows substantial increase
of communality at only the 8th component at k = 3, compared with the estimate at k = 2.
The increase is completely suppressed with p = 1.0. This result suggests that the high
value of the communality estimate of the 8th component at k = 3 obtained with p = 0 is
spurious. Similar phenomenon was observed with Maxwell's data of neurotic children for
the 2nd component at k = 3.
Tsumura and Sato (t981, p.163) report that, by their experience, improper solutions
were always with "quasi-specific factors" that respectively showed singular contributions
to some specific variances. The above example shows that our present Bayesian approach
can detect the appearance of such a factor even before one gets a definitely improper
solution. Thus we can expect that the present approach will realize a reasonable control
of improper solutions.
Table 5 summarizes the suggested choices of the number of factors for the six
examples where the choices by the minimum AIC procedure, MAICE, are also included.
The suggested choices are based on subjective judgments of the numerical results. It is
quite desirable to develop a numerical procedure for the evaluation of the likelihood of
each Bayesian model to arrive at an objective judgment.
It is interesting to note here that by a proper choice of p the Bayesian approach can
produce estimate of A even with k -- p. This explains the drastic change of the emphasis
between the modelings by the conventional and Bayesian approach. By the Bayesian
approach there is no particular meaning in trying to reduce the number of factors. To
avoid unnecessary distortion of the model it is even advisable to adopt a large value of k
and control the estimation procedure by a proper choice of p.

7. Concluding Remarks
It is remarkable that the idea of factor analysis has been producing so much stimulus
to the development of statistical modeling. In terms of the structure of the model it is
essentially Bayesian. Nevertheless, the practical use of the model was realized by the
application of the method of maximum likelihood and this eventually led to the introduc-
tion of AIC.
The concept of the information measure underlying the introduction of AIC leads
our attention from parameters to the distribution. This then provides a conceptual frame-
work for the handling of the Bayesian modeling as a natural extension of the convention-
al statistical modeling. The occurence of improper solutions in the maximum likelihood
factor analysis is a typical example that explains the limitation of the conventional mod-
eling. The introduction of the standard spherical prior distribution of factor loadings
provided an example of overcoming the limitation by a proper Bayesian modeling.
This series of experiences clearly explains the close dependence between the factor
analysis and AIC, or the informational point of view of statistics, and illustrates their
contribution to the development of general statistical methodology. It is hoped that this
332 PSYCHOMETRIKA

close contact between psychometrics and statistics will be maintained in the future and
contribute to the advancement of both fields.

References
Akaike, H. (1969). Fitting autoregressive models for prediction, Annals of the Institute of Statistical Mathematics,
21, 243-247.
Akaike, H. (1970). Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 22,
203-217.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov &
F. Csaki (Eds.), 2nd International Symposium on Information Theory (pp. 267-281). Budapest: Akademiai
Kiado.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control,
AC-19, 716--723.
Akaike, H. (1980). Likelihood and the Bayes procedure. In J. M. Bernardo, M. H. De Groot, D. V. Lindley, &
A. F. M. Smith (Eds.), Bayesian Statistics (pp. 143-166). Valencia: University Press.
Akaike, H. (1985). Prediction and entropy. In A. C. Atkinson & S. E. Fienberg (Eds.), A Celebration of Statistics
(pp. 1-24). New York: Springer-Verlag.
Bartholomew, D. J. (1981). Posterior analysis of the factor model. British Journal of Mathematical and Statistical
Psychology, 34, 93-99.
Bozdogan, H., & Ramirez, D. E. (1987). An expert model selection approach to determine the "best" pattern
structure in factor analysis models. Unpublished manuscript.
Harman, H. H. (1960). Modern Factor Analysis. Chicago: University Press.
Jennrich, R. I., & Robinson, S. M. (1969). A Newton-Raphson algorithm for maximum likelihood factor
analysis. Psychometrika, 34, 111-123.
J6reskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika, 32, 443--482.
Jrreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psychometrika, 43, 443-477.
Lawley, D. N., & Maxwell, A. E. (1971). Factor Analysis as a Statistical Method, 2nd Edition. London: Butter-
worths.
Martin, J. K., & McDonald, R. P. (1975). Bayesian estimation in unrestricted factor analysis: a treatment for
Heywood cases. Psychometrika, 40, 505-517.
Maxwell, A. E. (1961). Recent trends in factor analysis. Journal of the Royal Statistical Society, Series A, 124,
49-59.
Rao, C. R. (1955). Estimation and tests of significance in factor analysis. Psychometrika, 20, 93-111.
Tsumura, Y., Fukutomi, K., & Asoo, Y. (1968). On the unique convergence of iterative procedures in factor
analysis. TRU Mathematics, 4, 52-59. (Science University of Tokyo).
Tsumura, Y., & Sato, M. (1981). On the convergence of iterative procedures in factor analysis. TRU Mathemat-
ics, 17, 159-168. (Science University of Tokyo).

You might also like