Reduction of regression models under
symmetry.*
Inge 8. Helland t
Abstract
For collinear data nearly all regression methods that have been proposed, are equivariant under the rotation group in the x-space. It is
argued that the regression parameter along orbits of the rotation group
in principle always can be estimated in an optimal way as a Pitman type
estimator. On the other hand it is argued that it may pay in general
to reduce the parameter space of a statistical model when this space is
high-dimensional. It follows that any reduction in the regression model
then must take place via the orbit index of the rotation group. Further
information can be found using the form of the loss function. This is
used to discuss the choice of regression model and thereby the choice
of regression method. The solution which seems to emerge from this, is
closely related to the population version of the chemometricians' partial
least squares regression. Estimation under the reduced model is briefly
discussed, as is model reduction in the corresponding classification problem.
KEY WORDS: Collinearity; Model reduction; Multiple regression;
Orbit; Partial Least Squares Regression; Pitman estimator; Rotation;
Symmetry.
*Invited paper to be published in Viana, M. and D. Richards [Eds.] Algebraic Methods
in Statistics. Contemporary Mathematics Series of the American Mathematical Society.
tDepartment of Mathematics, University of Oslo, P.O.Box 1053 Blindern, N-0316 Oslo,
Norway. E-mail:
[email protected]
1
1
Introduction.
Collinear data in regression and in multivariate fields like classification is a
major practical problem, and also a problem area where there exist a large
number of seemingly unrelated methods, many of them derived by ad hoc
arguments. The present paper is an attempt to introduce some general theory
into this area.
Our point of departure is a rather heuristic statement, which will be made
more concrete in the next sections: When the number of parameters in a
statistical model is relatively large and the data set not that big, a sound
advice is often to reduce the model. This can of course be done in very many
ways, but in the present paper we consider a situation where we are given some
help in the model reduction process:
Assume in general that there is a natural symmetry group attached to
the model. Then the model parameter can be divided into an orbit index
(maximal invariant under the group) and a parameter along the orbit. This
last parameter will be invariantly measurable, and it can be estimated in a
rather satisfying optimal way, at least in principle, by finding the minimum
risk equivariant estimator along the orbit.
Hence the potential for gaining
anything via model reduction is limited to the orbit index. The reduced model
should also be invariant under the chosen group. This can again be achieved
by making the model reduction through reducing the orbit index parameters.
Additional information is found from the specific form of the loss function.
A natural group to look at in many multivariate situations is the rotation
group. In a regression context it is often natural to consider estimators that are
equivariant under the group of rotations in the x-space. After a brief example
to motivate the use of model reduction in regression as a general procedure, we
discuss the characterization of the orbits of the rotation group in the parameter
space and the degree of non-exactness of the same group. Thereafter this is used
to discuss model reduction and implications to the construction of regression
methods.
2
2
Model reduction in simple regression.
Look at the elementary linear regression model Yi
are iid N(O,
terms
fi
x 1 , ...
, Xn
0'" 2 ),
= a+,Bxi+fi, where the error
and where we for simplicity assume that the values
of the explanatory variable satisfy
I: Xi =
0.
A straightforward
calculation then shows that the mean squared prediction error at the fixed
point x 0 under the nonrandom explanatory variable model equals
(1)
We will start by illustrating on this simple example our first main point: In
many situations it will pay to reduce the statistical model. In fact, this is very
often done on an intuitive way by applied statisticians and by users of statistics.
In my view, one of the most important open areas of theoretical statistics is
to develop a sufficiently general theory of model reduction. Roughly, model
reduction may be appropriate if the number of parameters in the original model
is relatively large and the number n of data points is limited. This should
be contrasted to standard statistical theory which concentrates on the ideal
situation where n tends to infinity and the number of parameters is fixed. Like
all tools, model reduction can be misused, but this should not prevent us from
trying to investigate when such a reduction may be beneficial.
In the present simple situation, we have a small number of parameters, but
even here model reduction can be considered. The strength of the example is
that an explicit solution is easy to find. The most natural reduced model here
will be one without slope:
,B = 0, leading to the prediction error implied by
using the reduced model when data are generated by the full model:
(2)
Comparison with (1) shows that in terms of prediction error, model reduction
pays (for all non-zero x 0 ) iff
(3)
3
This condition can also be written as
value'
f3 / std(~)
with std(~)
=
a-j ~-
itl < 1, where t
is the 'theoretical t-
In this form it can be shown that the
condition also can be generalized to the question of deleting a single variable
from a multiple regression model, a fact that also has been referred to in applied
statistics books like Snedecor and Cochran ( 1989).
In this paper we shall mainly concentrate on the random x regression, where
the assumption on the error terms is that Ei, given all the x-variables are iid
N(O, 0" 2 ), and we typically may assume that x 1 , ...
, Xn
are iid N(O, O"~).
Thus
the assumption I: Xi = 0 is replaced by E(xi) = 0. Then the new prediction
error is found by taking the expectation over the x-variables in ( 1) and ( 2),
leading to the following criterion for model reduction:
(4)
This illustrates explicitly the statement made earlier that model reduction may
be beneficial when the number of data points is small or moderate. Unfortunately, the criterion depends upon unknown parameters - a general problem in
this area.
In this paper we will concentrate on the act of reducing a given model; the
estimation process under the reduced model will not be fully discussed. In
Bickel (1984) a general discussion can be found on the problem of estimating
under a simple model when wanting to guard against deviations towards a
larger model. A very general large sample discussion of model reduction using
likelihood theory can be found in Hjort (1991). A number of specific examples
of model reduction can also be found in these two papers and in references
given there, while other specific cases are discussed elsewhere. For instance,
a treatment of model reduction in Kalman filter models is given in Desai et
al (1985). The idea of using rotation symmetry to find the best reduction of
linear models in the way advocated in the present paper, seems to be new.
4
3
The multiple random x regression model under rotational symmetry.
Consider now in general a p-dimensional x-distribution and the corresponding (p+ 1)-dimensionaljoint (x,y)-distribution with expectation (p~,fly)t
covariance matrix
and
) .
( ~:xaxy axi
ay
(5)
From now on we will usually replace these covariance parameters by the equivalent parameter set () = (~,
{3, az), where ~ = ~X
and f3 = ~;1-axy·
The
parameter of interest is {3, and we will be interested in prediction problems,
assuming a linear conditional expectation E(ylx) = o:
+ f3tx
(this includes the
multinormal case, and, more generally, the case of elliptical distributions; see
references in Helland, 2000a). We also assume a linear predictor
From this, considering the expectation of
6: = f) - ~tx,
(Yx -
y )2
Yx
= 6:
+ ~tx.
over future x 's and using
one can see that a natural loss function ( cp. also Theorem 1 in
Helland and Alm!i>y, 1994) is
L(O; ~)
= (~-
f3)t~(-
{3).
(6)
The data set from which estimation shall be made, consists of n independent
observations from this model, summarized in the usual way as (X, y). When
n is large compared to p, nobody would try to challenge the ordinary least
squares estimate ~Ls
= (Xt;t')- 1 Xty, where X is the centered X-matrix. But
the difficult situations arise when p is relatively large, even of the same order
as n, or more generally, if Xt X may have some extremely small eigenvalues.
Most of the solutions that have been proposed for prediction in regression models: principal component regression, latent root regression, ridge regression, partial least squares regression, continuum regression and so on, are
equivariant under rotation in the x-space of the above model. Therefore, a
natural task might be to try to find the best or nearly best estimator among
these equivariant ones. For completeness we repeat the necessary definitions;
see for instance Lehmann and Casella (1998) for further discussion.
5
The rotation group in question has group elements g which can be identified
by the orthogonal p x p matrices Q with determinant
+1.
In the sample space,
the group G is given by the transformations (X, y)-+ (XQ, y), which induces
the group
G acting on the parameter space through the transformations
(7)
The parametric function j3 = j3( 0) is trivially seen to be invariantly estimable
(also called permissible; see Helland, 2000b): (31 = (32 implies Qtj31 = Qtj32
for all Q.
An estimator
/J
is equivariant if we also have
/J
-+ Qt /J when
(X, y) -+ (XQ, y). For equivariant estimators the loss (6) will be invariant.
Since the parameter space is closed under the transformations in
G, we have
what Lehmann and Casella (1998) call an invariant estimation problem.
A difficulty, however, is that the group here is not transitive on the parameter space. For transitive groups the risk (expected loss) will be a constant
function of the parameter, which is a strong indication that the problem of finding an equivariant estimator that minimimizes the risk uniformly, has a unique
solution. Such best equivariant estimators are indeed found quite generally as
the Pitman estimator and its generalizations to non-location groups.
In the present case, however, such uniqueness can only be expected when
estimating parameters on the orbits of G. When estimating orbit indices (maximally invariant parameters under
G), other methods must be used.
It is only
with respect to this last part of the estimation that we can expect any gain
from trying to reduce the parameter space when p is large.
4
.
.
Non-exactness, orbits and maximal tnvartants in the parameter space.
We start by decomposing the covariance matrix ~'
that is, giving the general
form that such a positive definite matrix might have, i.e.,
(8)
6
where the Pk are projection matrices upon orthogonal eigenvector spaces Vk
(which we will call strata in analogy with the use of this term in Neider, 1965).
Here all the Ak are positive and different (otherwise strata could have been
combined). Without loss of generality we can take the Ak 's in decreasing order.
Obviously, q :S p.
We will assume that I; has full rank p, which is equivalent to
q
L
q
dim(Vk)
=
k=l
L
tr(Pk) =
p,
k=l
or again to the requirement that the direct sum of the spaces
p-dimensional Euclidean space RP.
vk
equals the full
The first question we ask, is what transformations in the rotation group
that conserve the matrix I;.
Proposition 1.
The following are equivalent:
(a) QtL;Q =I;.
(b) QPj = PjQ for j = 1, ... ,q.
(c) Q cqn be written as the commuting product of q rotations Qk where Qk
1
is a rotation only within one stratum
vk.
The proof of this is given in Appendix 1.
The next question is simpler: Which transformations conserve the regression vector (3? From Qt j3 = j3 follows that j3 is an eigenvector of Q with
eigenvalue 1, or a more useful characterization: Q is a rotation around the axis
determined by (3. Combining these two results, we get:
Corollary 1.
The transformation Q conserves (I;, (3) if and only if Q is the commuting
product of q rotations Qk 1 each acting on a single stratum Vk 1 such that for
each stratum Vk upon which j3 has a non-zero component Pkf3 we have that Q k
is a rotation around the axis given by Pkf3.
7
Since
O";
is unaffected by the transformations, this corollary characterizes
the transformations that conserve () = (~,
(3, O";), i.e. it shows the degree of
non-exactness of the rotation group in the -parameter space.
The following theorem gives one way of characterizing the orbits of the
same group in the parameter space. We omit the proof, since it uses the same
techniques and is very similar to the proof of Proposition 1, now looking at the
equation Qt~
1 Q
= ~ 2 instead of the previous Qt~
= ~-
Theorem 1.
Lk~
Two parameter values ()l = (~1,30";
1 ) and ()1 = (~2,30";
2 ) with ~r
are on the same orbit of the rotation group if and only if:
1 .\f) P~r)
=
(1) q1 = q2 = q.
(2) .\k1 ) = Ak2 )J (k = 1,2, ... ,q).
(3) There is a rotation matrix Q such that pp)
In particular! the set of dimensions of {Vk(
2 )}
= Qt PP)QJ (k = 1, ... , q).
must match the dimensions of
{vp)}.
(4) The same Q gives (32 = Qt/31} or equivalently! PP)f32 = Qt(PP)f31)J
(k=1, ... ,q).
(5) 0";,2 = 0";,1.
From this result we also get the maximal invariant of the parameter group,
since this always equals the index of the orbits.
Corollary 2.
The orbit index of the parameter group is given by
(i) For the ordered set of strata: Their relative orientation! their dimensions
and the corresponding eigenvalues Ak.
(ii) The norms of the projected regression coefficients
(iii) (]";.
"(k = II Pk/311·
Proof.
From Theorem 1 it is only left to prove that it is enough to characterize the
(3-dependence of the orbits by the norms '"Yi of each stratum component. This
8
can be seen from Proposition 1. By that result, the matrix
Q of Theorem 1 (3)
has a non-uniqueness corresponding to any set of rotations within some or all
of the strata. Using such a rotation within stratum i, we see that the direction
of the stratum component of (3 within each stratum is arbitrary, and only its
norm is fixed.
5
Optimizing on orbits.
Assume now data (X, y) from n independent units. In this section we will also
assume multinormality - at least as a useful approximation. The estimation
of the expectations is then trivial, and by sufficiency the estimation of the
covariance structure can be assumed to depend only on
n
S = Sxx =
n- 1 2:)xi-
x)(xi- x)t,
i=1
n
s =
Sxy
=
n- 1
L)xi- x)(Yi- y),
i=1
s; =
n
n-1
L)Yi- y)2.
i=1
Now concentrate on a fixed orbit of the rotation group in the parameter
space. This group is compact, and thus has a unique Haar measure, which can
be taken as a prior probability. Then using well known results generalizing
the Pitman estimator (see, for instance Helland, 2000b; the essential result is
also given in Berger, 1980) and using the loss function (6), the minimum risk
estimator, given the orbit, will be
(9)
where Epost is the corresponding aposteriori expectation. The minimal risk
(given orbit) will then be
(10)
because Epost(f3~J)
= fJt~
since fJt~3
=
9
l::k Ak'"(~
is a constant of the orbit.
As ususal, the posterior density is proportional to the product of the likelihood and the prior. Using a known form for the multinormallikelihood, found
by first taking the likelihood of X, and the multiplying by the conditional
likelihood of y, given X, we find:
Theorem 2.
Fix an orbit of the rotation group in the above situation. Then the minimum
risk equivariant estimator for (3! given the orbit! is given by
~o
2
(J QtL;exp[
(J QtL;(Jexp[
=
n- 1 N
=
~ 2 (2(3tQs - (JtQSQt(J) - ~tr(
2 ~ 2 (2(3tQs- (JtQSQt(J)- ~tr(QL;-
QtL;- 1 QS)]d!( Q) )- 1
1
(11)
QS)]d!(,
where (I;, (3) correspond to a fixed point on the orbit and d1( Q) is Haar measure
for the rotation group with group elements identified with orthogonal matrices
Q. Also!
a a; - (JtL;(J a; 2 =
=
L:k Ak/~
is the residual variance! a constant
of the orbit.
To calculate explicitly the integrals of (11) seems to be impossible except in
some special cases. For instance, when I; = S =I, it follows from Gradshteyn
and Ryzhik (1994), formula 4.641.2 together with some simple derivations, cp.
also Example 4 in Helland (2000b ), that
~
0
I~(/fJ·sl)31
_
-
s
'
I~-1(;2f3·ls)
where lq( ·) is the modified Bessel function of the second kind.
This is unfortunate, for if it was possible to give (11) as an explicit formula,
this formula could be of definite practical importance: For a fixed orbit, the
solution above is the best equivariant estimator. Practically all known regression methods for collinear data are equivariant under the rotation group. This
means that, given that some definite established method can be put into the
form of first estimating the orbit index of the rotation group, and given that
we estimate this orbit index in the same way as in the method in question, the
estimator derived from (11) will dominate this method.
10
In fact, more than this can be said. The estimator (11) does not depend
on the whole orbit index as given in Corollary 2.
Proposition 2.
The parameter dependence of {11) is given by
(i)
0'2.
(ii) The number m of strata such that /k -::J. 0.
(iii) For given a 2 and m, a vector function ~o(.\1,/)
... , (-Xm,/m); data)
which for all data-values is symmetric under any permutation of the m arguments.
Proof.
Fix a 2 , and look first at the 'denominator' D in ~ 0 . It depends upon the
following quantities:
Qt~
tr( Qt~-
= L:i Ai(Qtei)(Qtei)t,
Qtj3 = L:i f3i( Qtei),
1 QS)
= L:i }J Qtei)t S( Qtei),
(12)
where { ei} is a set of eigenvectors for ~ and f3i = (Jt ei. When the set { Qt ei, i =
1, ... , p} is given a uniform distribution under rotation, the expectation of
any such function will depend upon {(.Xi, f3i), i
= 1, ... , p} in a symmetric
way. One particular such symmetric function is the number q of strata. So
assume, say, that Vi = span( e 1 , ... , et), say, is a stratum with constant .\ 1 .
Then, in particular, the function D should be symmetric under any rotation
within this stratum.
This means that it can be regarded as a function of
( (.\1, 11), ( A1, 0), ... , ( A1, 0)) together with the rest of the paramaters, or equivalently a function of ( .\ 1 , 1 1 ) and these parameters. Hence an alternative, and
more convenient way to state the result above, is that, given a 2 and q, we have
that Dis a symmetric function of {(.Xk, /k), k = 1, ... , q}.
A similar analysis can be made for the 'numerator' N, but looking at this
part, we are also lead to a refinement. In the same way as the stratum number
= (number of strata such that /k -::J. 0) is also a symmetric
function of {(.Xi,f3i),i = 1, ... ,p} (obviously, m::; q). Now take a closer look
q, the function m
11
at (11). Without loss of generality, we can let ~
in this formula be diagonal
with eigenvalues on the diagonal. Then we see that any contribution from the
strata with "'/k
= 0 will vanish. Thus we can take ~ 0 as the symmetric function
of the m bivariate orbit parameters as stated.
Corollary 3.
The minimal risk given the orbit1
Rmin 1
depends upon
0' 2
and m 1 and given
these parameters it is a function of ((A 1 , "'(!), ... ,(Am, "'!m)) which is symmetric
under any permutation of the m arguments.
6
Model reduction.
Consider in general a statistical model with parameter () E 8, an open set in
some topological space, and let G be a non-transitive group on this space. Let
L( 0, B) be the loss function of an invariant estimation problem in this setting.
Fix () 0 E 8. We are interested in the effect of reducing the statistical model
by restricting () to some set
rc
8 such that () 0 E
r.
From what has been said
earlier, an optimal equivariant estimator can be found quite generally on the
orbit of G containing () 0 . Thus we can without loss of generality let r contain
this orbit. This gives the minimal useful choice of r. In practice, r should
be chosen as a set <I> in the orbit index space of G taken together with the
corresponding orbits. Our task is to say something about the choice of <D. Let
us specialize again to the regression setting.
Proposition 3.
Assume the multiple regression model1 and let
m the parameter space. Let
f3
G be the group of rotations
be estimated as in Theorem 2 for each fixed
orbit. Then the only model reductions that can affect the estimate are given by
combinations of some or all of the following possibilities:
- fix the residual variance
0' 1
-fix the number m 1
-fix some specific symmetric function of ((Al, 11), ... , (Am, "'!m)).
12
This follows immediately from Proposition 2.
Looking at these three possibilities, the first one seems not very attractive
from an applied point of view; it will hardly lead to any decrease in the prediction error. Also, the last type of model reduction may be an option when one
has particular model information, but it seems very difficult to specify any such
function which can serve as an _input to a general regression method; Furthermore, this reduction method only makes sense in combination with the second
method. This leaves one in general with the option of fixing the number m at
some value m < p as definitely the most natural approach to model reduction
from the present point of view.
It is interesting, then, that a practical regression method closely related to
this particular choice has been used by chemometricians by decades now in a
large spectrum of applications. The method is claimed to have great success,
empirically, and it has obtained quite an amount of attention. A special issue
of the journal Chemometrics and Intelligent Laboratory Systems devoted to
this particular method is planned to appear shortly.
7
Partial least squares regression in population form.
Partial least squares regression (PLS) was proposed as a method to solve
collinearity problems in multiple regression/ calibration by Wold et al. (1983),
and has gained an increasing popularity, especially among chemometricians,
see Martens and N&s (1989), and also the forthcoming special journal issue
mentioned above. The method was proposed as an algorithm, in fact, several
algorithms were available; two of them are shown to be equivalent in Helland
(1988). In the present context the corresponding population algorithm, treated
in Helland (1990), is of special interest. It can be formulated as follows, taking the p
+ 1 dimensional variable
( x, y) as the point of departure. Replacing
variables by datavectors and (co )variances by their sample counterparts here,
leads to the ordinary PLS algorithm.
13
eo = x - f-lx, fo
Next, do for a= 1, 2, ... :
1. Initialization:
= Y -f-ly·
2. Weights and scores:
Wa = Cov(ea-1, fa-1), ta
3. Loadings by 'least squares':
Pa
= Cov(ea-1, ta)/Var(ta),
%
=
e~_
1 wa·
= Cov(fa-1, ta)/Var(ta)·
4. New residuals:
Stopping at step a = k, this leads to the bilinear representation
where loadings and scores are determined by the algorithm. The corresponding
sample equation is much used in the chemometric literature in the same way
as statisticians use factor analysis, and it is often claimed - and in fact also
demonstrated- that the corresponding plots are useful for interpreting complex
data sets.
Rather than interpretation, our main theme in the present paper is prediction at some set of x-values x 0 . Then the natural procedure is to use the partial
least squares algorithm to construct new scores t~
from x 0 and in analogy with
the last equation in (13) take
Yk,PLS
A
= f-ly
+ q1 to1 + ... + qk tok•
(14)
Proposition 4.
The above predictor can be written
(15)
with
f3k,PLS = Wk(~t
1
W~O",
where() here is O"xy and wk =
14
[w1, ... ,wk]·
(16)
Proof.
The corresponding sample case is proved in Helland (1988), Theorem 3.1.
The formula (16) indicates that the space
Sk
spanned by the columns of
Wk is of some interest. Further progress can be made if we note that this space
can also be spanned by a socalled Krylov sequence.
Proposition 5.
As long as Wk is nonzero 1 an alternative basis for
Sk
is given by the vectors
"
"k-1 0'.
0', LJO',
••• 'LJ
Proof.
Compare the proof of Proposition 3.1 in Helland (1988).
Now we can start to make some algebraic conclusions from these results
(for more details, see Helland, 1990): First, if there should be some value of
k such that the corresponding weight vanishes, i.e., wk = 0, then the PLS
algorithm will be terminated at this value (and only at such values). Secondly,
by Proposition 5, the first value k = m such that wk = 0 for k > m is
characterized by the property that the k vectors of the Krylov sequence will
span the same m-dimensional space Sm also when k
> m. If this happens,
we will say that the population PLS algorithm stops at step m. Of course,
the algorithm will always stop at m = p; the interesting case is when it stops
before that. Note that by Proposition 4, the regression vector f3k,PLS will always
belong to the space Sm.
From these observations there is in fact a short step to the main result of
this section. This result gives a fairly precise interpretation of the result of
the population PLS algorithm in the light of model reduction and rotational
symmetry.
15
Theorem 3.
The population PLS algorithm stops at step m if and only if the number of
strata in the population model with non-zero regression coefficient is equal to
m. The resulting PLS regression wector
f3m,PLS
will then equal (3 = I:- 1 0".
Main idea of proof.
From (8) a linear dependency in the Krylov sequence is given by
This can happen only if k ~ m, where now m is the number of strata with PhO"
nonzero. Note that fork = m there is a Vandermonde determinant among the
different
)..h
involved in this argument.
Thus with this natural specification of the population PLS algorithm, there
is a close connection to the optimality results discussed in Section 5 and Section
6 above. Briefly, the termination of this algorithm at m( < p) steps is equivalent to a reduced model of the type which results in a natural way from the
orbit behavior of the optimal equivariant regressor under rotation invariance.
And the regression vector that results from the corresponding population PLS
algorithm is what it should be.
Now, of course, what is used in practice by chemometricians is the sample
PLS algorithm, and nothing can be said about this from the above discussion,
except a vague indication that it will be sensible if n is not too small and a good
stopping rule is used. The stopping rule used in practice for PLS regression
is based on cross validation using prediction performance, and has only an
indirect link to the population stopping criterion mentioned above: If Wk = 0,
then one should expect that the prediction at step k - 1 is so good that this
will affect the cross validation.
The opinions on the sample PLS algorithm are still rather varied. Many
practicioners are satisfied with the fact that it gives some automatic procedure
which empirically seems to give good results in many cases, and where they
16
also have a possibility to combine prediction will some latent variable analysis. Chemometricians have generalized the algorithm to cases with several
y-variables and even to cases with many blocks of variables. On the other
hand, sceptical remarks to the procedure have been raised by statisticians, for
instance in Frank and Friedman (1993). Butler and Denham (2000) have just
shown that the sample PLS procedure, while shrinking globally, has so poor
shrinking properties at each factor that it seems impossible that it can satisfy
any reasonable optimality property.
Here is where the situation stands today. The PLS algorithm will continue
to be used by those who like it, even though it has no optimality properties.
What matematical statisticians can hope for, is first to make better evaluations
of the method. But a more interesting challenge will be if we can improve the
method by using our common way of thinking, namely in terms of optimality.
The purpose of the present paper has been to try to go some way in that
direction.
By what has been discussed up to now, we have a population procedure
which is equivalent to a model reduction with reasonably well defined good
properties related to rotation invariance. We have integral formulae for the
optimal estimated regression vector given the orbit indices. In addition to
finding practical ways of calculating that integral- where MCMC and related
techniques probably may help - the remaining challenge is to find good estimates of the remaining orbit index parameters.
8
Estimation of orbit parameters.
In the formula (11), the point (I.;, ,8,
aD is a fixed point on some orbit of the
rotation group in the parameter space. But by Proposition 3 we may without
loss of generality let L; be diagonal with first diagonal elements .\ 1 , ... , Am here,
and then take ,8
= (1'1 , ... , /m, 0, 0, ... , O)t.
In addition to its data dependence,
the solution will then depend upon m, these parameters and a 2 . This means
that we are left with 2m
+ 1 parameters
to estimate. This is in general a
very small number compared to the (p+ 1)(p+ 2)/2 parameters in the original
17
covariance structure.
Note, however, that any value of the parameter ('E, {3, CT;) which belongs
to the same orbit (or more generally, to orbits which can be linked to that
one through the parameter dependence corresponding to the one described in
Proposition 2) will give the same estimate (11).
Taking maximal likelihood under multinormali-ty as a natural approach -t-o
the estimation of these parameters, the last remark is a great aid together with
the following simple remark: When {J is a unique maximum likelihood estimator
for(), then, for any
f
which is constant on a subset
r ofthe parameter space e
and one-to-one on 8\f, we have that f(B) is the unique maximum likelihood
estimator for j(()).
Thus to find maximum likelihood estimators of the 2m + 1 parameters
described above, it is enough to give maximum likelihood estimators of the
covariance parameters under the hypothesis that the number of strata in 'E with
non-zero regression component ('number of relevant components', cp. Nres and
Helland, 1993) is equal to m. This latter problem was discussed in Helland
(1992).
The maximum likelihood solution found there, can be formulated as follows:
(a) With A= Bxx- s; 2 sxys;y and B
which minimizes
(b) With
R as
in (a), and taking
= s:;;' find a
p X
PR = R(RtRt 1 Rt,
m matrix R =
the maximum likeli-
hood estimates under the hypothesis of m relevant components are given by
18
R
Note that G(R) will be the same here if R is replaced by another matrix
spanning the same space, similarly for the estimators. Hence what in effect is
estimated, is an m-dimensional space, the space of relevant components.
In Helland (1992), an approximate algorithm for doing the minimization
in (a) was proposed, but the prediction method based upon this algorithm
performed rather poorly in simulations done by Alm!Ziy (1996). A reasonable
conjecture is that performance will be better when combined with the integral
(11), since the effective number of parameters estimated by the maximum
likelihood part then will be reduced considerably, as discussed above. The
computation needed in order to find these predictions will be rather heavy,
however, a disadvantage which is particularly important if one wants to use
crossvalidation to determine the number of relevant components, the most
common procedure.
An alternative emerging from the likelihood approach
here, is using a likelihood ratio test.
In conclusion here, there are many questions concerning estimation and prediction under the reduced regression model which need further investigations.
In addition to questions related to performance under the reduced model itself,
the robustness questions in the spirit of the paper by Bickel (1984) will need
to be adressed.
In addition to addressing optimality problems of the kind discussed above,
it is also of interest to do further investigations on the performance of the
ordinary sample partial least squares algorithm, since this is the algorithm
used by chemometricians today. Asymptotic expansions, as in Helland and
Alm!Ziy (1994) is one of several ways of approaching such questions.
9
Classification.
The sample partial least squares algorithm is also used by chemometricians
outside the regression context, for instance when doing classification. A large
number of applications of this can be found in the chemometrical literature,
and the technique has also spread outside the field chemometrics, for instance
to psychiatry (Gottfries et al, 1995).
19
It is obvious that techniques of this
kind meet a latent need among researchers with near collinear data, both in
regression problems and in classification problems.
In this Section we will discuss model reduction under rotational symmetry in the classification problem using ordinary discriminant analysis as the
classification method, and point at similarities with and differences from the
regression situation. We will c-oncentrate on the simplest possible case: Twoclass discriminant analysis with equal prior for the two classes, equal covariance
matrix for the observations, equal cost of misclassification and equal number
n of observations in the two classes. We have in mind a situation where the
number p of classification variables is rather large, also compared to n.
We also assume a multinormal model, the standard point of departure in
discriminant analysis: In sample 1 we have n independent observations which
are N(f-l 1 , I:), and having mean
and have mean
x2 .
x1 ; in sample 2 then observations are N(f-l 2 , I:)
The standard way Df classifying a new observation x is to
allocate it to class 1 iff
where Sis the pooled covariance matrix from the 2n observations.
When pis large or if we have other sources of collinearity, we may want to
replace
s- 1(xl- x2) by some other observator vector b.
be of class 1 iff
1
[x- 2(x1
Thus
X
is classified to
+ x2Wb > 0,
and the probability of misclassification, given the samples, is given by
where <I> is the standard normal cumulative distribution function.
We will now simplify notation by assuming
/-ll
= -f-l 2 = f-l·
Also, it is natu-
1 + x2 ) is independent of b, which is true for the original
choice of b. Integrating over the probability distribution of this variable, using
ral to assume that ~(x
the fact that E( <I>(v - u)) = <I>(v I
vi +
20
T 2)
when u
rv
N(O, T ), we arrive at a
natural loss function for the classification problem:
L(O b)- ci> (- fJ/b
Vbif~
' -
where
e=
~)
(17)
'
(r.,J.L).
Nearly everything that was done in Section 4 can now be repeated here.
In particular, the decomposition (8) of r, will hold and will have the same
interpretation. Under a rotation determined by the ortogonal matrix Q we
have
(r.,J.L)-+ (Qtr.Q,Qtf..L),
(x,b)-+ (Qtx,Qtb).
The loss function Lis invariant under the rotation group, and the group is nontransitive in the parameter space { (r., f..L)} with orbit index given by the ordered
set of strata (their relative orientation, their dimensions and the corresponding
eigenvalues Ak) and by the norm of the projected components /k
This is proved in the same way as Corollary 2.
= 1/PkJ.LI/.
As a function of b, the loss function (17) takes its minimal value at all
vectors of the form /\,r.- 1 f. L where
fC
>
0.
A general feature is that L(
e, b)
is independent of the norm of b. Similarly, there is an invariance of L as a
function of
eunder scale transformations (r., f..L) -+ (a
2
r., aJ.L ). This implies that
the estimation problem as such - regarding b as some estimator - is invariant
under a larger group than the rotation group. A scale change in b does not
change the classification rule, however. In this Section we will therefore stick
to the rotation group, which also will help illuminating the connections to the
regression problem.
In general, consider a loss function L of the form f(btf..LJ.Ltbjbtr,b), where
f
is an increasing function. With b as an estimator, this gives formally an
estimation problem which is invariant under the rotation group. In this general
problem we cannot give an explicit formula for the Pitman type estimator, in
the way we did in Theorem 2, but the estimator can be found in principle by
mm1m1smg
21
where
l
= exp{ -~tr(
QtL;Qt 1 [~(
Xli- Qt11 )(xli- Qt 11 )t +L) X2i + Q11 )( X2i + Q11 )t]},
2
2
and where again d1( ·) is Haar measure for the rotation group, the elements of
this group being identified by orthonormal matrices Q.
Assuming that f is differentiable (which it is in the concrete application we
have in mind), we find that the equation from which (the direction of) b can
be determined takes the form:
(19)
with
btQtl111tQb
r(.) = NQtL;Qb .
Similarly to the proof of Proposition 2 we have that every parameter dependence is through QtL;Q
= "L-i Ai( Qtei)( Qtei)t and Qt11 = 'L-i l1i( Qtei), where ei
are the eigenvectors of I; and l1i are the corresponding components of 11· Also,
when 11 has a vanishing component for some stratum, it follows from equation
(19) that the corresponding component of b must vanish, and by further inspection: The solution can not depend upon the eigenvalue Ai of that stratum.
Hence the argument of the proof of propositon 2 and the model reduction argument of Section 6 can be repeated in verbatim for the present situation, and
we conclude:
For the standard discriminant analysis situation; if we decide to do a Pitman type estimation on orbits of the rotation group; which is the optimal choice
there; the only model reduction that can affect the discrimination rule; is given
by:
(1) Fix the number m of relevant component; i.e.; the number of strata of
I; with nonzero component '"'fi of 11·
(2) Possibly also: Fix some symmetric function of((>.l,'"'(l,···,(>.m,'"'fm));
where Aj now is the eigenvalue corresponding to stratum j.
22
The most natural general procedure, having no special information about
the problem, is to skip step (2) here. In a similar way as in Section 7 above, this
leads to the population version of the partial least squares algorithm for the
discriminant analysis problem. Again, of course, the sample algorithm implied
by the above argument is different than the sample PLS-algorithm. A necessary
part of this procedure will be to solve -equation (19) for each dimension. It is
an open question if such a procedure can be made practical.
10
Discussion.
One way to look upon the present paper, is that it is a companion paper to
Helland (2000b ), which is a survey paper on the application of group theory to
statistics. This author is convinced that theoretical statistics needs more legs to
stand on than just probability theory. Symmetry is a very simple and natural
concept which is illuminating for many different aspects of statistics, both
theoretical and applied. Group theory gives the mathematics needed to make
symmetry considerations precise. In the present paper we have concentrated
on the rotation group and on the orbit aspect of the group. Several other
applications of group theory to statistics are discussed in Helland (2000b ).
Another way to look at the paper is that it is a companion paper to Helland (2000c), an invited survey paper written for chemometricians on what is
known from a mathematical statistical point of view on partial least squares
regression. Since the present paper is written for mathematical statisticians, it
gives an opportunity to be more precise and prove some definite results on the
population version of the algorithm. And not least: It gives an opportunity to
state some open problems on which more work is needed, work that probably
can only be done by the community of mathematical statisticians. This is a
field where there has been done very much - in fact useful - work using a lower
level of precision, and it is a field full of applications, which can be seen by
looking at the two chemometrical journals and several related journals.
The solution proposed in this paper relies heavily on the concept of model
reduction. This is a concept for which a proper theoretical statistical theory is
23
lacking. I have a strong feeling that more can be done towards creating at least
elements of such a theory. At present model reduction is just one of several
tools that are used nearly daily by applied statisticians and other people doing
statistical modelling, a tool that has intuitive appeal to many people, and a
tool which is completely avoided by theoretical statisticians. In general, taking
such concepts into use - both -in theory and in practice - may be one way
to escape the danger that statistics_- in principle an art that should include
every aspect of inductive inference from empirical data - shall degenerate into
a purely deductive science.
The main result of this paper is that model reduction under rotational
symmetry under reasonably simple conditions is very close to implying the
population partial least squares model both for the regression and for the classification case. My main hope is that this result will inspire some start of
a closer contact between scientificcommunities which so far have had nearly
completely separate developments.
References.
Alm!Ziy, T. (1996) A simulation study on comparison of prediction methods
when only a few components are relevant. Computational Statistics & Data
Analysis 21, 87-107.
Berger, J.O. (1980) Statistical Decision Theory and Bayesian Analysis,
Springer-Verlag, New York.
Bickel, P.J. (1984) Parametric robustnes: Small biases can be worthwhile.
Ann. Statistics 12, 864-879.
Butler, N.A. and M.C. Denham (2000) The peculiar shrinkage properties
of partial least squares regression. J. Royal Statist. Soc. B 62, 585-593.
Desai, U.B., P. Debajyoti and R.D. Kirkpatrick (1985) Int. J. Control 42,
821-838.
Frank, I.E. and J.H. Friedman (1993) A statistical view of some chemometric tools. Technometrics 35, 109-135.
24
Gottfries, J., K. Blennow, A. Wallin and C.G. Gottfries (1995) Diagnosis
of dementias using partial least squares discriminant analysis. Dementia 6,
83-88.
Gradshteyn, I.S. and I.M. Ryzhik (1994) Tables of Integrals, Series, and
Products. 5th ed. Academic Press, Boston.
Helland, I.S. (1988) On the structure of partial h~ast
squares regression.
Commun. Statist.-Simula. 17, 581-607.
Helland, I.S. (1990) Partial least squares regression and statistical models.
Scand. J. Statist. 17, 97-114.
Helland, I.S. (1992) Maximum likelihood regression on relevant components. J. R. Statist. Soc. B 54, 637-647.
Helland, I.S. (2000a) Model reduction for prediction in regression models.
Scand. J. Statist. 27, 1-20.
Helland, I.S. (2000b) Statistical inference under a fixed symmetry group.
Submitted.
Helland, I.S. (2000c) Some theoretical aspects of partial least squares regression. Submitted to Chemometrics and Intelligent Laboratory Systems.
Helland, I.S. and T. Alm¢y (1994) Comparison of prediction methods when
only a few components are relevant. J. Amer. Statist. Ass. 89, 583-591.
Hjort, N.L. (1991).Estimation in moderately misspecified models. Statistical Research Report No. 8/1991, Department of Mathematics, University of
Oslo.
Lehmann, E.L. and G. Casella (1998) Theory of Point Estimation. Springer,
New York.
Martens, H. and T. N&s (1989) Multivariate Calibration. Wiley, New York.
Nelder, J.A. (1965) The analysis of randomized experiments with orthogonal block structure. I. Block structure and null analysis of variance. Proc.
Royal Soc. A 283, 147-162.
N&s, T. and I.S. Helland (1993) Relevant components in regression. Scand.
J. Statist. 20, 239-250.
Snedecor, G.W. and W.G. Cochran (1989) Statistical Methods. 8th ed.
Iowa State University Press, Ames.
25
Wold, S., H. Martens and H. Wold (1983). The multivariate calbration
problem in chemistry sloved by the PLS method. Proc. Conf. Matrix Pencils
(A. Ruhe, B. Kgstr¢m, eds.) Lecture Notes in Mathematics, Springer Verlag,
Heidelberg, 286-293.
Appendix 1: Proof of Proposition 1.
Assume for a fixed I; = Lk=I )..kpk that QtL;Q = I;, or equivalently L;Q =
QL; for an orthogonal matrix Q. Thus
q
L
)..i(QPi- PiQ) = 0.
(20)
i=l
Multiplying (20) from the right by Pj and from the left by Pk, respectively,
gives
>-.jQPj = L;QPj,
(21)
)..kPkQ = PkQL;.
(22)
Next, assume j -=f k and multiply (21) from theleft by Pk and (22) from the
right by Pj, giving AjPkQPj = PkL;QPj and )..kPkQPj = PkQL;Pj, respectively.
Since the righthand sides of the latter equations are equal by assumption, and
since Aj -=f )..k, it follows that
PkQPj = 0 (j -=f k).
(23)
Inserting this into (21) and (22) then gives
)..J·QP·J --)..J·P·QP·
and ).. J·P·Q-)..
·P·QP·J'
J
J
J
J J
Since Aj -=f 0, it follows that QPj = PjQ, completing the proof that (a) implies
(b). The implication from (b) to (a) is trivial.
Applying (b) to v E Vj shows that Qv satisfies PjQv = Qv, so Qv E Vj.
Thus the transformation given by Q must conserve all the spaces Vj; hence
it can only consist of rotations within each single Vj. Again the opposite
implication is trivial.
26