Academia.eduAcademia.edu

Reduction of regression models under symmetry

2001, Contemporary Mathematics

For collinear data nearly all regression methods that have been proposed, are equivariant under the rotation group in the x-space. It is argued that the regression parameter along orbits of the rotation group in principle always can be estimated in an optimal way as a Pitman type estimator. On the other hand it is argued that it may pay in general to reduce the parameter space of a statistical model when this space is high-dimensional. It follows that any reduction in the regression model then must take place via the orbit index of the rotation group. Further information can be found using the form of the loss function. This is used to discuss the choice of regression model and thereby the choice of regression method. The solution which seems to emerge from this, is closely related to the population version of the chemometricians' partial least squares regression. Estimation under the reduced model is briefly discussed, as is model reduction in the corresponding classification problem.

Reduction of regression models under symmetry.* Inge 8. Helland t Abstract For collinear data nearly all regression methods that have been proposed, are equivariant under the rotation group in the x-space. It is argued that the regression parameter along orbits of the rotation group in principle always can be estimated in an optimal way as a Pitman type estimator. On the other hand it is argued that it may pay in general to reduce the parameter space of a statistical model when this space is high-dimensional. It follows that any reduction in the regression model then must take place via the orbit index of the rotation group. Further information can be found using the form of the loss function. This is used to discuss the choice of regression model and thereby the choice of regression method. The solution which seems to emerge from this, is closely related to the population version of the chemometricians' partial least squares regression. Estimation under the reduced model is briefly discussed, as is model reduction in the corresponding classification problem. KEY WORDS: Collinearity; Model reduction; Multiple regression; Orbit; Partial Least Squares Regression; Pitman estimator; Rotation; Symmetry. *Invited paper to be published in Viana, M. and D. Richards [Eds.] Algebraic Methods in Statistics. Contemporary Mathematics Series of the American Mathematical Society. tDepartment of Mathematics, University of Oslo, P.O.Box 1053 Blindern, N-0316 Oslo, Norway. E-mail: [email protected] 1 1 Introduction. Collinear data in regression and in multivariate fields like classification is a major practical problem, and also a problem area where there exist a large number of seemingly unrelated methods, many of them derived by ad hoc arguments. The present paper is an attempt to introduce some general theory into this area. Our point of departure is a rather heuristic statement, which will be made more concrete in the next sections: When the number of parameters in a statistical model is relatively large and the data set not that big, a sound advice is often to reduce the model. This can of course be done in very many ways, but in the present paper we consider a situation where we are given some help in the model reduction process: Assume in general that there is a natural symmetry group attached to the model. Then the model parameter can be divided into an orbit index (maximal invariant under the group) and a parameter along the orbit. This last parameter will be invariantly measurable, and it can be estimated in a rather satisfying optimal way, at least in principle, by finding the minimum risk equivariant estimator along the orbit. Hence the potential for gaining anything via model reduction is limited to the orbit index. The reduced model should also be invariant under the chosen group. This can again be achieved by making the model reduction through reducing the orbit index parameters. Additional information is found from the specific form of the loss function. A natural group to look at in many multivariate situations is the rotation group. In a regression context it is often natural to consider estimators that are equivariant under the group of rotations in the x-space. After a brief example to motivate the use of model reduction in regression as a general procedure, we discuss the characterization of the orbits of the rotation group in the parameter space and the degree of non-exactness of the same group. Thereafter this is used to discuss model reduction and implications to the construction of regression methods. 2 2 Model reduction in simple regression. Look at the elementary linear regression model Yi are iid N(O, terms fi x 1 , ... , Xn 0'" 2 ), = a+,Bxi+fi, where the error and where we for simplicity assume that the values of the explanatory variable satisfy I: Xi = 0. A straightforward calculation then shows that the mean squared prediction error at the fixed point x 0 under the nonrandom explanatory variable model equals (1) We will start by illustrating on this simple example our first main point: In many situations it will pay to reduce the statistical model. In fact, this is very often done on an intuitive way by applied statisticians and by users of statistics. In my view, one of the most important open areas of theoretical statistics is to develop a sufficiently general theory of model reduction. Roughly, model reduction may be appropriate if the number of parameters in the original model is relatively large and the number n of data points is limited. This should be contrasted to standard statistical theory which concentrates on the ideal situation where n tends to infinity and the number of parameters is fixed. Like all tools, model reduction can be misused, but this should not prevent us from trying to investigate when such a reduction may be beneficial. In the present simple situation, we have a small number of parameters, but even here model reduction can be considered. The strength of the example is that an explicit solution is easy to find. The most natural reduced model here will be one without slope: ,B = 0, leading to the prediction error implied by using the reduced model when data are generated by the full model: (2) Comparison with (1) shows that in terms of prediction error, model reduction pays (for all non-zero x 0 ) iff (3) 3 This condition can also be written as value' f3 / std(~) with std(~) = a-j ~- itl < 1, where t is the 'theoretical t- In this form it can be shown that the condition also can be generalized to the question of deleting a single variable from a multiple regression model, a fact that also has been referred to in applied statistics books like Snedecor and Cochran ( 1989). In this paper we shall mainly concentrate on the random x regression, where the assumption on the error terms is that Ei, given all the x-variables are iid N(O, 0" 2 ), and we typically may assume that x 1 , ... , Xn are iid N(O, O"~). Thus the assumption I: Xi = 0 is replaced by E(xi) = 0. Then the new prediction error is found by taking the expectation over the x-variables in ( 1) and ( 2), leading to the following criterion for model reduction: (4) This illustrates explicitly the statement made earlier that model reduction may be beneficial when the number of data points is small or moderate. Unfortunately, the criterion depends upon unknown parameters - a general problem in this area. In this paper we will concentrate on the act of reducing a given model; the estimation process under the reduced model will not be fully discussed. In Bickel (1984) a general discussion can be found on the problem of estimating under a simple model when wanting to guard against deviations towards a larger model. A very general large sample discussion of model reduction using likelihood theory can be found in Hjort (1991). A number of specific examples of model reduction can also be found in these two papers and in references given there, while other specific cases are discussed elsewhere. For instance, a treatment of model reduction in Kalman filter models is given in Desai et al (1985). The idea of using rotation symmetry to find the best reduction of linear models in the way advocated in the present paper, seems to be new. 4 3 The multiple random x regression model under rotational symmetry. Consider now in general a p-dimensional x-distribution and the corresponding (p+ 1)-dimensionaljoint (x,y)-distribution with expectation (p~,fly)t covariance matrix and ) . ( ~:xaxy axi ay (5) From now on we will usually replace these covariance parameters by the equivalent parameter set () = (~, {3, az), where ~ = ~X and f3 = ~;1-axy· The parameter of interest is {3, and we will be interested in prediction problems, assuming a linear conditional expectation E(ylx) = o: + f3tx (this includes the multinormal case, and, more generally, the case of elliptical distributions; see references in Helland, 2000a). We also assume a linear predictor From this, considering the expectation of 6: = f) - ~tx, (Yx - y )2 Yx = 6: + ~tx. over future x 's and using one can see that a natural loss function ( cp. also Theorem 1 in Helland and Alm!i>y, 1994) is L(O; ~) = (~- f3)t~(- {3). (6) The data set from which estimation shall be made, consists of n independent observations from this model, summarized in the usual way as (X, y). When n is large compared to p, nobody would try to challenge the ordinary least squares estimate ~Ls = (Xt;t')- 1 Xty, where X is the centered X-matrix. But the difficult situations arise when p is relatively large, even of the same order as n, or more generally, if Xt X may have some extremely small eigenvalues. Most of the solutions that have been proposed for prediction in regression models: principal component regression, latent root regression, ridge regression, partial least squares regression, continuum regression and so on, are equivariant under rotation in the x-space of the above model. Therefore, a natural task might be to try to find the best or nearly best estimator among these equivariant ones. For completeness we repeat the necessary definitions; see for instance Lehmann and Casella (1998) for further discussion. 5 The rotation group in question has group elements g which can be identified by the orthogonal p x p matrices Q with determinant +1. In the sample space, the group G is given by the transformations (X, y)-+ (XQ, y), which induces the group G acting on the parameter space through the transformations (7) The parametric function j3 = j3( 0) is trivially seen to be invariantly estimable (also called permissible; see Helland, 2000b): (31 = (32 implies Qtj31 = Qtj32 for all Q. An estimator /J is equivariant if we also have /J -+ Qt /J when (X, y) -+ (XQ, y). For equivariant estimators the loss (6) will be invariant. Since the parameter space is closed under the transformations in G, we have what Lehmann and Casella (1998) call an invariant estimation problem. A difficulty, however, is that the group here is not transitive on the parameter space. For transitive groups the risk (expected loss) will be a constant function of the parameter, which is a strong indication that the problem of finding an equivariant estimator that minimimizes the risk uniformly, has a unique solution. Such best equivariant estimators are indeed found quite generally as the Pitman estimator and its generalizations to non-location groups. In the present case, however, such uniqueness can only be expected when estimating parameters on the orbits of G. When estimating orbit indices (maximally invariant parameters under G), other methods must be used. It is only with respect to this last part of the estimation that we can expect any gain from trying to reduce the parameter space when p is large. 4 . . Non-exactness, orbits and maximal tnvartants in the parameter space. We start by decomposing the covariance matrix ~' that is, giving the general form that such a positive definite matrix might have, i.e., (8) 6 where the Pk are projection matrices upon orthogonal eigenvector spaces Vk (which we will call strata in analogy with the use of this term in Neider, 1965). Here all the Ak are positive and different (otherwise strata could have been combined). Without loss of generality we can take the Ak 's in decreasing order. Obviously, q :S p. We will assume that I; has full rank p, which is equivalent to q L q dim(Vk) = k=l L tr(Pk) = p, k=l or again to the requirement that the direct sum of the spaces p-dimensional Euclidean space RP. vk equals the full The first question we ask, is what transformations in the rotation group that conserve the matrix I;. Proposition 1. The following are equivalent: (a) QtL;Q =I;. (b) QPj = PjQ for j = 1, ... ,q. (c) Q cqn be written as the commuting product of q rotations Qk where Qk 1 is a rotation only within one stratum vk. The proof of this is given in Appendix 1. The next question is simpler: Which transformations conserve the regression vector (3? From Qt j3 = j3 follows that j3 is an eigenvector of Q with eigenvalue 1, or a more useful characterization: Q is a rotation around the axis determined by (3. Combining these two results, we get: Corollary 1. The transformation Q conserves (I;, (3) if and only if Q is the commuting product of q rotations Qk 1 each acting on a single stratum Vk 1 such that for each stratum Vk upon which j3 has a non-zero component Pkf3 we have that Q k is a rotation around the axis given by Pkf3. 7 Since O"; is unaffected by the transformations, this corollary characterizes the transformations that conserve () = (~, (3, O";), i.e. it shows the degree of non-exactness of the rotation group in the -parameter space. The following theorem gives one way of characterizing the orbits of the same group in the parameter space. We omit the proof, since it uses the same techniques and is very similar to the proof of Proposition 1, now looking at the equation Qt~ 1 Q = ~ 2 instead of the previous Qt~ = ~- Theorem 1. Lk~ Two parameter values ()l = (~1,30"; 1 ) and ()1 = (~2,30"; 2 ) with ~r are on the same orbit of the rotation group if and only if: 1 .\f) P~r) = (1) q1 = q2 = q. (2) .\k1 ) = Ak2 )J (k = 1,2, ... ,q). (3) There is a rotation matrix Q such that pp) In particular! the set of dimensions of {Vk( 2 )} = Qt PP)QJ (k = 1, ... , q). must match the dimensions of {vp)}. (4) The same Q gives (32 = Qt/31} or equivalently! PP)f32 = Qt(PP)f31)J (k=1, ... ,q). (5) 0";,2 = 0";,1. From this result we also get the maximal invariant of the parameter group, since this always equals the index of the orbits. Corollary 2. The orbit index of the parameter group is given by (i) For the ordered set of strata: Their relative orientation! their dimensions and the corresponding eigenvalues Ak. (ii) The norms of the projected regression coefficients (iii) (]";. "(k = II Pk/311· Proof. From Theorem 1 it is only left to prove that it is enough to characterize the (3-dependence of the orbits by the norms '"Yi of each stratum component. This 8 can be seen from Proposition 1. By that result, the matrix Q of Theorem 1 (3) has a non-uniqueness corresponding to any set of rotations within some or all of the strata. Using such a rotation within stratum i, we see that the direction of the stratum component of (3 within each stratum is arbitrary, and only its norm is fixed. 5 Optimizing on orbits. Assume now data (X, y) from n independent units. In this section we will also assume multinormality - at least as a useful approximation. The estimation of the expectations is then trivial, and by sufficiency the estimation of the covariance structure can be assumed to depend only on n S = Sxx = n- 1 2:)xi- x)(xi- x)t, i=1 n s = Sxy = n- 1 L)xi- x)(Yi- y), i=1 s; = n n-1 L)Yi- y)2. i=1 Now concentrate on a fixed orbit of the rotation group in the parameter space. This group is compact, and thus has a unique Haar measure, which can be taken as a prior probability. Then using well known results generalizing the Pitman estimator (see, for instance Helland, 2000b; the essential result is also given in Berger, 1980) and using the loss function (6), the minimum risk estimator, given the orbit, will be (9) where Epost is the corresponding aposteriori expectation. The minimal risk (given orbit) will then be (10) because Epost(f3~J) = fJt~ since fJt~3 = 9 l::k Ak'"(~ is a constant of the orbit. As ususal, the posterior density is proportional to the product of the likelihood and the prior. Using a known form for the multinormallikelihood, found by first taking the likelihood of X, and the multiplying by the conditional likelihood of y, given X, we find: Theorem 2. Fix an orbit of the rotation group in the above situation. Then the minimum risk equivariant estimator for (3! given the orbit! is given by ~o 2 (J QtL;exp[ (J QtL;(Jexp[ = n- 1 N = ~ 2 (2(3tQs - (JtQSQt(J) - ~tr( 2 ~ 2 (2(3tQs- (JtQSQt(J)- ~tr(QL;- QtL;- 1 QS)]d!( Q) )- 1 1 (11) QS)]d!(, where (I;, (3) correspond to a fixed point on the orbit and d1( Q) is Haar measure for the rotation group with group elements identified with orthogonal matrices Q. Also! a a; - (JtL;(J a; 2 = = L:k Ak/~ is the residual variance! a constant of the orbit. To calculate explicitly the integrals of (11) seems to be impossible except in some special cases. For instance, when I; = S =I, it follows from Gradshteyn and Ryzhik (1994), formula 4.641.2 together with some simple derivations, cp. also Example 4 in Helland (2000b ), that ~ 0 I~(/fJ·sl)31 _ - s ' I~-1(;2f3·ls) where lq( ·) is the modified Bessel function of the second kind. This is unfortunate, for if it was possible to give (11) as an explicit formula, this formula could be of definite practical importance: For a fixed orbit, the solution above is the best equivariant estimator. Practically all known regression methods for collinear data are equivariant under the rotation group. This means that, given that some definite established method can be put into the form of first estimating the orbit index of the rotation group, and given that we estimate this orbit index in the same way as in the method in question, the estimator derived from (11) will dominate this method. 10 In fact, more than this can be said. The estimator (11) does not depend on the whole orbit index as given in Corollary 2. Proposition 2. The parameter dependence of {11) is given by (i) 0'2. (ii) The number m of strata such that /k -::J. 0. (iii) For given a 2 and m, a vector function ~o(.\1,/) ... , (-Xm,/m); data) which for all data-values is symmetric under any permutation of the m arguments. Proof. Fix a 2 , and look first at the 'denominator' D in ~ 0 . It depends upon the following quantities: Qt~ tr( Qt~- = L:i Ai(Qtei)(Qtei)t, Qtj3 = L:i f3i( Qtei), 1 QS) = L:i }J Qtei)t S( Qtei), (12) where { ei} is a set of eigenvectors for ~ and f3i = (Jt ei. When the set { Qt ei, i = 1, ... , p} is given a uniform distribution under rotation, the expectation of any such function will depend upon {(.Xi, f3i), i = 1, ... , p} in a symmetric way. One particular such symmetric function is the number q of strata. So assume, say, that Vi = span( e 1 , ... , et), say, is a stratum with constant .\ 1 . Then, in particular, the function D should be symmetric under any rotation within this stratum. This means that it can be regarded as a function of ( (.\1, 11), ( A1, 0), ... , ( A1, 0)) together with the rest of the paramaters, or equivalently a function of ( .\ 1 , 1 1 ) and these parameters. Hence an alternative, and more convenient way to state the result above, is that, given a 2 and q, we have that Dis a symmetric function of {(.Xk, /k), k = 1, ... , q}. A similar analysis can be made for the 'numerator' N, but looking at this part, we are also lead to a refinement. In the same way as the stratum number = (number of strata such that /k -::J. 0) is also a symmetric function of {(.Xi,f3i),i = 1, ... ,p} (obviously, m::; q). Now take a closer look q, the function m 11 at (11). Without loss of generality, we can let ~ in this formula be diagonal with eigenvalues on the diagonal. Then we see that any contribution from the strata with "'/k = 0 will vanish. Thus we can take ~ 0 as the symmetric function of the m bivariate orbit parameters as stated. Corollary 3. The minimal risk given the orbit1 Rmin 1 depends upon 0' 2 and m 1 and given these parameters it is a function of ((A 1 , "'(!), ... ,(Am, "'!m)) which is symmetric under any permutation of the m arguments. 6 Model reduction. Consider in general a statistical model with parameter () E 8, an open set in some topological space, and let G be a non-transitive group on this space. Let L( 0, B) be the loss function of an invariant estimation problem in this setting. Fix () 0 E 8. We are interested in the effect of reducing the statistical model by restricting () to some set rc 8 such that () 0 E r. From what has been said earlier, an optimal equivariant estimator can be found quite generally on the orbit of G containing () 0 . Thus we can without loss of generality let r contain this orbit. This gives the minimal useful choice of r. In practice, r should be chosen as a set <I> in the orbit index space of G taken together with the corresponding orbits. Our task is to say something about the choice of <D. Let us specialize again to the regression setting. Proposition 3. Assume the multiple regression model1 and let m the parameter space. Let f3 G be the group of rotations be estimated as in Theorem 2 for each fixed orbit. Then the only model reductions that can affect the estimate are given by combinations of some or all of the following possibilities: - fix the residual variance 0' 1 -fix the number m 1 -fix some specific symmetric function of ((Al, 11), ... , (Am, "'!m)). 12 This follows immediately from Proposition 2. Looking at these three possibilities, the first one seems not very attractive from an applied point of view; it will hardly lead to any decrease in the prediction error. Also, the last type of model reduction may be an option when one has particular model information, but it seems very difficult to specify any such function which can serve as an _input to a general regression method; Furthermore, this reduction method only makes sense in combination with the second method. This leaves one in general with the option of fixing the number m at some value m < p as definitely the most natural approach to model reduction from the present point of view. It is interesting, then, that a practical regression method closely related to this particular choice has been used by chemometricians by decades now in a large spectrum of applications. The method is claimed to have great success, empirically, and it has obtained quite an amount of attention. A special issue of the journal Chemometrics and Intelligent Laboratory Systems devoted to this particular method is planned to appear shortly. 7 Partial least squares regression in population form. Partial least squares regression (PLS) was proposed as a method to solve collinearity problems in multiple regression/ calibration by Wold et al. (1983), and has gained an increasing popularity, especially among chemometricians, see Martens and N&s (1989), and also the forthcoming special journal issue mentioned above. The method was proposed as an algorithm, in fact, several algorithms were available; two of them are shown to be equivalent in Helland (1988). In the present context the corresponding population algorithm, treated in Helland (1990), is of special interest. It can be formulated as follows, taking the p + 1 dimensional variable ( x, y) as the point of departure. Replacing variables by datavectors and (co )variances by their sample counterparts here, leads to the ordinary PLS algorithm. 13 eo = x - f-lx, fo Next, do for a= 1, 2, ... : 1. Initialization: = Y -f-ly· 2. Weights and scores: Wa = Cov(ea-1, fa-1), ta 3. Loadings by 'least squares': Pa = Cov(ea-1, ta)/Var(ta), % = e~_ 1 wa· = Cov(fa-1, ta)/Var(ta)· 4. New residuals: Stopping at step a = k, this leads to the bilinear representation where loadings and scores are determined by the algorithm. The corresponding sample equation is much used in the chemometric literature in the same way as statisticians use factor analysis, and it is often claimed - and in fact also demonstrated- that the corresponding plots are useful for interpreting complex data sets. Rather than interpretation, our main theme in the present paper is prediction at some set of x-values x 0 . Then the natural procedure is to use the partial least squares algorithm to construct new scores t~ from x 0 and in analogy with the last equation in (13) take Yk,PLS A = f-ly + q1 to1 + ... + qk tok• (14) Proposition 4. The above predictor can be written (15) with f3k,PLS = Wk(~t 1 W~O", where() here is O"xy and wk = 14 [w1, ... ,wk]· (16) Proof. The corresponding sample case is proved in Helland (1988), Theorem 3.1. The formula (16) indicates that the space Sk spanned by the columns of Wk is of some interest. Further progress can be made if we note that this space can also be spanned by a socalled Krylov sequence. Proposition 5. As long as Wk is nonzero 1 an alternative basis for Sk is given by the vectors " "k-1 0'. 0', LJO', ••• 'LJ Proof. Compare the proof of Proposition 3.1 in Helland (1988). Now we can start to make some algebraic conclusions from these results (for more details, see Helland, 1990): First, if there should be some value of k such that the corresponding weight vanishes, i.e., wk = 0, then the PLS algorithm will be terminated at this value (and only at such values). Secondly, by Proposition 5, the first value k = m such that wk = 0 for k > m is characterized by the property that the k vectors of the Krylov sequence will span the same m-dimensional space Sm also when k > m. If this happens, we will say that the population PLS algorithm stops at step m. Of course, the algorithm will always stop at m = p; the interesting case is when it stops before that. Note that by Proposition 4, the regression vector f3k,PLS will always belong to the space Sm. From these observations there is in fact a short step to the main result of this section. This result gives a fairly precise interpretation of the result of the population PLS algorithm in the light of model reduction and rotational symmetry. 15 Theorem 3. The population PLS algorithm stops at step m if and only if the number of strata in the population model with non-zero regression coefficient is equal to m. The resulting PLS regression wector f3m,PLS will then equal (3 = I:- 1 0". Main idea of proof. From (8) a linear dependency in the Krylov sequence is given by This can happen only if k ~ m, where now m is the number of strata with PhO" nonzero. Note that fork = m there is a Vandermonde determinant among the different )..h involved in this argument. Thus with this natural specification of the population PLS algorithm, there is a close connection to the optimality results discussed in Section 5 and Section 6 above. Briefly, the termination of this algorithm at m( < p) steps is equivalent to a reduced model of the type which results in a natural way from the orbit behavior of the optimal equivariant regressor under rotation invariance. And the regression vector that results from the corresponding population PLS algorithm is what it should be. Now, of course, what is used in practice by chemometricians is the sample PLS algorithm, and nothing can be said about this from the above discussion, except a vague indication that it will be sensible if n is not too small and a good stopping rule is used. The stopping rule used in practice for PLS regression is based on cross validation using prediction performance, and has only an indirect link to the population stopping criterion mentioned above: If Wk = 0, then one should expect that the prediction at step k - 1 is so good that this will affect the cross validation. The opinions on the sample PLS algorithm are still rather varied. Many practicioners are satisfied with the fact that it gives some automatic procedure which empirically seems to give good results in many cases, and where they 16 also have a possibility to combine prediction will some latent variable analysis. Chemometricians have generalized the algorithm to cases with several y-variables and even to cases with many blocks of variables. On the other hand, sceptical remarks to the procedure have been raised by statisticians, for instance in Frank and Friedman (1993). Butler and Denham (2000) have just shown that the sample PLS procedure, while shrinking globally, has so poor shrinking properties at each factor that it seems impossible that it can satisfy any reasonable optimality property. Here is where the situation stands today. The PLS algorithm will continue to be used by those who like it, even though it has no optimality properties. What matematical statisticians can hope for, is first to make better evaluations of the method. But a more interesting challenge will be if we can improve the method by using our common way of thinking, namely in terms of optimality. The purpose of the present paper has been to try to go some way in that direction. By what has been discussed up to now, we have a population procedure which is equivalent to a model reduction with reasonably well defined good properties related to rotation invariance. We have integral formulae for the optimal estimated regression vector given the orbit indices. In addition to finding practical ways of calculating that integral- where MCMC and related techniques probably may help - the remaining challenge is to find good estimates of the remaining orbit index parameters. 8 Estimation of orbit parameters. In the formula (11), the point (I.;, ,8, aD is a fixed point on some orbit of the rotation group in the parameter space. But by Proposition 3 we may without loss of generality let L; be diagonal with first diagonal elements .\ 1 , ... , Am here, and then take ,8 = (1'1 , ... , /m, 0, 0, ... , O)t. In addition to its data dependence, the solution will then depend upon m, these parameters and a 2 . This means that we are left with 2m + 1 parameters to estimate. This is in general a very small number compared to the (p+ 1)(p+ 2)/2 parameters in the original 17 covariance structure. Note, however, that any value of the parameter ('E, {3, CT;) which belongs to the same orbit (or more generally, to orbits which can be linked to that one through the parameter dependence corresponding to the one described in Proposition 2) will give the same estimate (11). Taking maximal likelihood under multinormali-ty as a natural approach -t-o the estimation of these parameters, the last remark is a great aid together with the following simple remark: When {J is a unique maximum likelihood estimator for(), then, for any f which is constant on a subset r ofthe parameter space e and one-to-one on 8\f, we have that f(B) is the unique maximum likelihood estimator for j(()). Thus to find maximum likelihood estimators of the 2m + 1 parameters described above, it is enough to give maximum likelihood estimators of the covariance parameters under the hypothesis that the number of strata in 'E with non-zero regression component ('number of relevant components', cp. Nres and Helland, 1993) is equal to m. This latter problem was discussed in Helland (1992). The maximum likelihood solution found there, can be formulated as follows: (a) With A= Bxx- s; 2 sxys;y and B which minimizes (b) With R as in (a), and taking = s:;;' find a p X PR = R(RtRt 1 Rt, m matrix R = the maximum likeli- hood estimates under the hypothesis of m relevant components are given by 18 R Note that G(R) will be the same here if R is replaced by another matrix spanning the same space, similarly for the estimators. Hence what in effect is estimated, is an m-dimensional space, the space of relevant components. In Helland (1992), an approximate algorithm for doing the minimization in (a) was proposed, but the prediction method based upon this algorithm performed rather poorly in simulations done by Alm!Ziy (1996). A reasonable conjecture is that performance will be better when combined with the integral (11), since the effective number of parameters estimated by the maximum likelihood part then will be reduced considerably, as discussed above. The computation needed in order to find these predictions will be rather heavy, however, a disadvantage which is particularly important if one wants to use crossvalidation to determine the number of relevant components, the most common procedure. An alternative emerging from the likelihood approach here, is using a likelihood ratio test. In conclusion here, there are many questions concerning estimation and prediction under the reduced regression model which need further investigations. In addition to questions related to performance under the reduced model itself, the robustness questions in the spirit of the paper by Bickel (1984) will need to be adressed. In addition to addressing optimality problems of the kind discussed above, it is also of interest to do further investigations on the performance of the ordinary sample partial least squares algorithm, since this is the algorithm used by chemometricians today. Asymptotic expansions, as in Helland and Alm!Ziy (1994) is one of several ways of approaching such questions. 9 Classification. The sample partial least squares algorithm is also used by chemometricians outside the regression context, for instance when doing classification. A large number of applications of this can be found in the chemometrical literature, and the technique has also spread outside the field chemometrics, for instance to psychiatry (Gottfries et al, 1995). 19 It is obvious that techniques of this kind meet a latent need among researchers with near collinear data, both in regression problems and in classification problems. In this Section we will discuss model reduction under rotational symmetry in the classification problem using ordinary discriminant analysis as the classification method, and point at similarities with and differences from the regression situation. We will c-oncentrate on the simplest possible case: Twoclass discriminant analysis with equal prior for the two classes, equal covariance matrix for the observations, equal cost of misclassification and equal number n of observations in the two classes. We have in mind a situation where the number p of classification variables is rather large, also compared to n. We also assume a multinormal model, the standard point of departure in discriminant analysis: In sample 1 we have n independent observations which are N(f-l 1 , I:), and having mean and have mean x2 . x1 ; in sample 2 then observations are N(f-l 2 , I:) The standard way Df classifying a new observation x is to allocate it to class 1 iff where Sis the pooled covariance matrix from the 2n observations. When pis large or if we have other sources of collinearity, we may want to replace s- 1(xl- x2) by some other observator vector b. be of class 1 iff 1 [x- 2(x1 Thus X is classified to + x2Wb > 0, and the probability of misclassification, given the samples, is given by where <I> is the standard normal cumulative distribution function. We will now simplify notation by assuming /-ll = -f-l 2 = f-l· Also, it is natu- 1 + x2 ) is independent of b, which is true for the original choice of b. Integrating over the probability distribution of this variable, using ral to assume that ~(x the fact that E( <I>(v - u)) = <I>(v I vi + 20 T 2) when u rv N(O, T ), we arrive at a natural loss function for the classification problem: L(O b)- ci> (- fJ/b Vbif~ ' - where e= ~) (17) ' (r.,J.L). Nearly everything that was done in Section 4 can now be repeated here. In particular, the decomposition (8) of r, will hold and will have the same interpretation. Under a rotation determined by the ortogonal matrix Q we have (r.,J.L)-+ (Qtr.Q,Qtf..L), (x,b)-+ (Qtx,Qtb). The loss function Lis invariant under the rotation group, and the group is nontransitive in the parameter space { (r., f..L)} with orbit index given by the ordered set of strata (their relative orientation, their dimensions and the corresponding eigenvalues Ak) and by the norm of the projected components /k This is proved in the same way as Corollary 2. = 1/PkJ.LI/. As a function of b, the loss function (17) takes its minimal value at all vectors of the form /\,r.- 1 f. L where fC > 0. A general feature is that L( e, b) is independent of the norm of b. Similarly, there is an invariance of L as a function of eunder scale transformations (r., f..L) -+ (a 2 r., aJ.L ). This implies that the estimation problem as such - regarding b as some estimator - is invariant under a larger group than the rotation group. A scale change in b does not change the classification rule, however. In this Section we will therefore stick to the rotation group, which also will help illuminating the connections to the regression problem. In general, consider a loss function L of the form f(btf..LJ.Ltbjbtr,b), where f is an increasing function. With b as an estimator, this gives formally an estimation problem which is invariant under the rotation group. In this general problem we cannot give an explicit formula for the Pitman type estimator, in the way we did in Theorem 2, but the estimator can be found in principle by mm1m1smg 21 where l = exp{ -~tr( QtL;Qt 1 [~( Xli- Qt11 )(xli- Qt 11 )t +L) X2i + Q11 )( X2i + Q11 )t]}, 2 2 and where again d1( ·) is Haar measure for the rotation group, the elements of this group being identified by orthonormal matrices Q. Assuming that f is differentiable (which it is in the concrete application we have in mind), we find that the equation from which (the direction of) b can be determined takes the form: (19) with btQtl111tQb r(.) = NQtL;Qb . Similarly to the proof of Proposition 2 we have that every parameter dependence is through QtL;Q = "L-i Ai( Qtei)( Qtei)t and Qt11 = 'L-i l1i( Qtei), where ei are the eigenvectors of I; and l1i are the corresponding components of 11· Also, when 11 has a vanishing component for some stratum, it follows from equation (19) that the corresponding component of b must vanish, and by further inspection: The solution can not depend upon the eigenvalue Ai of that stratum. Hence the argument of the proof of propositon 2 and the model reduction argument of Section 6 can be repeated in verbatim for the present situation, and we conclude: For the standard discriminant analysis situation; if we decide to do a Pitman type estimation on orbits of the rotation group; which is the optimal choice there; the only model reduction that can affect the discrimination rule; is given by: (1) Fix the number m of relevant component; i.e.; the number of strata of I; with nonzero component '"'fi of 11· (2) Possibly also: Fix some symmetric function of((>.l,'"'(l,···,(>.m,'"'fm)); where Aj now is the eigenvalue corresponding to stratum j. 22 The most natural general procedure, having no special information about the problem, is to skip step (2) here. In a similar way as in Section 7 above, this leads to the population version of the partial least squares algorithm for the discriminant analysis problem. Again, of course, the sample algorithm implied by the above argument is different than the sample PLS-algorithm. A necessary part of this procedure will be to solve -equation (19) for each dimension. It is an open question if such a procedure can be made practical. 10 Discussion. One way to look upon the present paper, is that it is a companion paper to Helland (2000b ), which is a survey paper on the application of group theory to statistics. This author is convinced that theoretical statistics needs more legs to stand on than just probability theory. Symmetry is a very simple and natural concept which is illuminating for many different aspects of statistics, both theoretical and applied. Group theory gives the mathematics needed to make symmetry considerations precise. In the present paper we have concentrated on the rotation group and on the orbit aspect of the group. Several other applications of group theory to statistics are discussed in Helland (2000b ). Another way to look at the paper is that it is a companion paper to Helland (2000c), an invited survey paper written for chemometricians on what is known from a mathematical statistical point of view on partial least squares regression. Since the present paper is written for mathematical statisticians, it gives an opportunity to be more precise and prove some definite results on the population version of the algorithm. And not least: It gives an opportunity to state some open problems on which more work is needed, work that probably can only be done by the community of mathematical statisticians. This is a field where there has been done very much - in fact useful - work using a lower level of precision, and it is a field full of applications, which can be seen by looking at the two chemometrical journals and several related journals. The solution proposed in this paper relies heavily on the concept of model reduction. This is a concept for which a proper theoretical statistical theory is 23 lacking. I have a strong feeling that more can be done towards creating at least elements of such a theory. At present model reduction is just one of several tools that are used nearly daily by applied statisticians and other people doing statistical modelling, a tool that has intuitive appeal to many people, and a tool which is completely avoided by theoretical statisticians. In general, taking such concepts into use - both -in theory and in practice - may be one way to escape the danger that statistics_- in principle an art that should include every aspect of inductive inference from empirical data - shall degenerate into a purely deductive science. The main result of this paper is that model reduction under rotational symmetry under reasonably simple conditions is very close to implying the population partial least squares model both for the regression and for the classification case. My main hope is that this result will inspire some start of a closer contact between scientificcommunities which so far have had nearly completely separate developments. References. Alm!Ziy, T. (1996) A simulation study on comparison of prediction methods when only a few components are relevant. Computational Statistics & Data Analysis 21, 87-107. Berger, J.O. (1980) Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, New York. Bickel, P.J. (1984) Parametric robustnes: Small biases can be worthwhile. Ann. Statistics 12, 864-879. Butler, N.A. and M.C. Denham (2000) The peculiar shrinkage properties of partial least squares regression. J. Royal Statist. Soc. B 62, 585-593. Desai, U.B., P. Debajyoti and R.D. Kirkpatrick (1985) Int. J. Control 42, 821-838. Frank, I.E. and J.H. Friedman (1993) A statistical view of some chemometric tools. Technometrics 35, 109-135. 24 Gottfries, J., K. Blennow, A. Wallin and C.G. Gottfries (1995) Diagnosis of dementias using partial least squares discriminant analysis. Dementia 6, 83-88. Gradshteyn, I.S. and I.M. Ryzhik (1994) Tables of Integrals, Series, and Products. 5th ed. Academic Press, Boston. Helland, I.S. (1988) On the structure of partial h~ast squares regression. Commun. Statist.-Simula. 17, 581-607. Helland, I.S. (1990) Partial least squares regression and statistical models. Scand. J. Statist. 17, 97-114. Helland, I.S. (1992) Maximum likelihood regression on relevant components. J. R. Statist. Soc. B 54, 637-647. Helland, I.S. (2000a) Model reduction for prediction in regression models. Scand. J. Statist. 27, 1-20. Helland, I.S. (2000b) Statistical inference under a fixed symmetry group. Submitted. Helland, I.S. (2000c) Some theoretical aspects of partial least squares regression. Submitted to Chemometrics and Intelligent Laboratory Systems. Helland, I.S. and T. Alm¢y (1994) Comparison of prediction methods when only a few components are relevant. J. Amer. Statist. Ass. 89, 583-591. Hjort, N.L. (1991).Estimation in moderately misspecified models. Statistical Research Report No. 8/1991, Department of Mathematics, University of Oslo. Lehmann, E.L. and G. Casella (1998) Theory of Point Estimation. Springer, New York. Martens, H. and T. N&s (1989) Multivariate Calibration. Wiley, New York. Nelder, J.A. (1965) The analysis of randomized experiments with orthogonal block structure. I. Block structure and null analysis of variance. Proc. Royal Soc. A 283, 147-162. N&s, T. and I.S. Helland (1993) Relevant components in regression. Scand. J. Statist. 20, 239-250. Snedecor, G.W. and W.G. Cochran (1989) Statistical Methods. 8th ed. Iowa State University Press, Ames. 25 Wold, S., H. Martens and H. Wold (1983). The multivariate calbration problem in chemistry sloved by the PLS method. Proc. Conf. Matrix Pencils (A. Ruhe, B. Kgstr¢m, eds.) Lecture Notes in Mathematics, Springer Verlag, Heidelberg, 286-293. Appendix 1: Proof of Proposition 1. Assume for a fixed I; = Lk=I )..kpk that QtL;Q = I;, or equivalently L;Q = QL; for an orthogonal matrix Q. Thus q L )..i(QPi- PiQ) = 0. (20) i=l Multiplying (20) from the right by Pj and from the left by Pk, respectively, gives >-.jQPj = L;QPj, (21) )..kPkQ = PkQL;. (22) Next, assume j -=f k and multiply (21) from theleft by Pk and (22) from the right by Pj, giving AjPkQPj = PkL;QPj and )..kPkQPj = PkQL;Pj, respectively. Since the righthand sides of the latter equations are equal by assumption, and since Aj -=f )..k, it follows that PkQPj = 0 (j -=f k). (23) Inserting this into (21) and (22) then gives )..J·QP·J --)..J·P·QP· and ).. J·P·Q-).. ·P·QP·J' J J J J J Since Aj -=f 0, it follows that QPj = PjQ, completing the proof that (a) implies (b). The implication from (b) to (a) is trivial. Applying (b) to v E Vj shows that Qv satisfies PjQv = Qv, so Qv E Vj. Thus the transformation given by Q must conserve all the spaces Vj; hence it can only consist of rotations within each single Vj. Again the opposite implication is trivial. 26