Local Influence - by Manuel Galea Et - Al 1997
Local Influence - by Manuel Galea Et - Al 1997
Local Influence - by Manuel Galea Et - Al 1997
By MANUEL GALEA,
Universidad de Valparaı́so, Chile
SUMMARY
Influence diagnostic methods are extended in this paper to elliptical linear models. These include several symmetric
multivariate distributions such as the normal, Student t-, Cauchy and logistic distributions, among others. For a
particular perturbation scheme and for the likelihood displacement the diagnostics agree with those developed for the
normal linear regression model by Cook when the coefficients and the scale parameter are treated separately. This
result shows the invariance of the diagnostics with respect to the induced model in the elliptical linear family.
However, if the coefficients and the scale parameter are treated jointly we have a different diagnostic for each induced
model, which makes this approach helpful for selecting the less sensitive model in the elliptical linear family. An
example on the salinity of water is given for illustration.
Keywords: Diagnostic; Influence; Likelihood displacement; Multivariate symmetric distributions
1. Introduction
Diagnostic techniques for normal linear regression models have been extensively studied in the
statistical literature. See, for example, Belsley et al. (1980), Cook and Weisberg (1982) and
Chatterjee and Hadi (1988). Several of the diagnostic techniques evaluate the effect of deleting
observations on parameter estimates. An alternative approach that assesses the influence of small
(local) perturbations from the assumed model on key results is considered in Cook (1986).
Additional results on local influence and applications can be found in Beckman et al. (1987),
Lawrance (1988), Thomas and Cook (1990), Tsai and Wu (1992), Paula (1993) and Kim (1995).
The method of local influence was proposed by Cook (1986, 1987) as a general tool for
assessing the effect of local departures from model assumptions. In this paper, the local influ-
ence approach is applied to elliptical linear regression models, i.e. when the error vector has
an elliptical distribution. The perturbation scheme considered here is the scheme in which the
scale parameter is modified to allow convenient perturbations in the model.
In Section 2, along with the notation, the elliptical linear models are defined. The local
influence method is reviewed in Section 3. Section 4 deals with the derivation of the
diagnostic procedures for the elliptical linear models. An illustrative example is given in the
last section.
{ Address for correspondence: Instituto de Matemática e Estatı́stica, Universidade de São Paulo, Caixa Postal 66281
(Agência Cidade de São Paulo), 05315-970 São Paulo, Brazil.
E-mail: [email protected]
literature. See, for example, Fang et al. (1990), Fang and Zhang (1990) and Fang and
Anderson (1990). An n 3 1 random vector Y has an elliptical distribution with location vector
µ and scale positive definite matrix Λ, if its density takes the form
f Y ( y) jΛjÿ1 2 g f( y ÿ µ)T Λÿ1 ( y ÿ µ)g,
=
(1)
y 2 R , where the function g: R ! [0, 1) is such that
n
1
u nÿ1 g(u 2 ) du , 1:
0
The function g is typically known as the density generator. For a vector Y distributed according
to density (1), we use the notation Y El n (µ, Λ, g) or, simply, Y El n (µ, Λ). When µ 0 and
Λ I, we obtain the spherical family of densities. This class of symmetric distributions
includes the normal, Student t-, contaminated normal and logistic (both, univariate and
multivariate) distributions, among others, as considered, for example, by Fang et al. (1990).
Table 1, taken from Fang et al. (1990), reports examples of distributions in the elliptical family.
The notation c1 , c2 , c3 and c4 is used to denote normalizing constants.
Consider now the linear regression model
Y X β E, (2)
where Y is an n 3 1 vector of responses, X is a known n 3 p matrix of rank p, β is a p-
dimensional vector of parameters and E is a p-dimensional error vector with distribution
El n (0, φI), where φ is the scale parameter. Thus, it follows that Y El n (X β, φI). This is
typically called the elliptical linear regression model. If g is a continuous and decreasing
function then the maximum likelihood estimators of β and φ are given by (see Fang and
Anderson (1990))
^
β (X T X )ÿ1 X T Y ,
^ Q(β)
^
(3)
φ = ug ,
TABLE 1
Multivariate elliptical distributions
Contaminated normal CN n (µ, Λ, δ, τ) g(u) c1 f(1 ÿ δ) exp (ÿ u=2) δτÿ n 2 exp (ÿ u=2τ)g
=
where
dflog g(u)g g9(u)
W g (u) g(u) :
du
It is easy to see that, for the normal and t-distributions, ug n. However, for the contaminated
normal and logistic distributions, ug must be obtained numerically. For the logistic distribution,
for example, equation (5) becomes
n
2u
tanh
u
2
,
placement
LD(ω) 2f L(θ)
^ ÿ L(θ
^ ) g,
ω
Small perturbations to the model may be important, especially when assessing whether the
sample is robust with respect to the induced model. To assess this kind of robustness, Cook
(1986) suggested studying the local influence around ω0 . The idea consists of studying the
normal curvature of the surface α(ω) (ωT , LD(ω))T and then taking the direction around ω0
corresponding to the largest normal curvature.
Cook (1986) showed that the normal curvature in the direction l takes the form
Cl (θ) 2j lT ∆T ( L)ÿ1 ∆lj, (6)
where i li 1, ÿ L is the observed information matrix for the postulated model (ω ω0 ) and ∆
is the ( p 1) 3 q matrix with elements
L(θjω)
2
@
∆ ij ,
@ θi @ ω i
evaluated at θ θ^ and ω ω , i 1, . . ., p 1 and j 1, . . ., q.
0
Therefore, the maximization of equation (6) is equivalent to finding the largest eigenvalue
Cmax of the matrix B ∆T ( L)∆, and the largest direction around ω0, denoted by lmax, is the
corresponding eigenvector. If Cmax is much greater than the remaining eigenvalues of B, the
index plot for lmax may be helpful in assessing the influence of small perturbations on LD(ω).
Otherwise, it should be more informative to perform the index plot for the eigenvectors
corresponding to the largest eigenvalues.
and
@ L(θ)
@φ
ÿ 2φ
n
ÿ φ12 Wg (u) Q(β): (8)
From equations (7) and (8) it follows, after some algebraic manipulation, that
φ2 W g (u)(X T X ) W 9g (u)X T ( y ÿ X β)( y ÿ X β)T X ,
2
@ L(θ) 2
@ β@ β φ
T
where
W 9g (u)
dW g (u)
:
du
Evaluating these derivatives at θ θ,
^ given in equations (3), and by noting that ( y ÿ X β)
^ T
X
0 and Q(β)^ ^ u , we have
=φ
02 1
g
B W g ( u)(X T X )^ 0 C
LB C
φ ^
@ 0
n
ug
f 2W g ( u) ug W 9g ( u)g
^
A, ^
^
2φ 2 2 ^
φ
where ^u Q(β)
^ ^ u .
=φ g
Table 2 shows W g (u) and W 9g (u) for the distributions given in Table 1, where
f i (u) (1 ÿ δ) exp (ÿ u=2) δτÿ( n 2)ÿ i exp (ÿ u=2τ), =
i 0, 1, 2:
Note that for the normal case (ug n, W (u) ÿ 1
and W 9g (u) 0) the matrix L reduces to
g 2
L
ÿ(1 φ)(X X ) =^
T
0
ÿ n=2φ^ 2
:
0
Consider now model (2), with the assumption that E El n f0, φ Dÿ1 (ω)g, where D(ω)
diag(ω1 , . . ., ω n ) and Dÿ1 (ω) denotes the inverse of D(ω). Here q n and ω i is the weight
corresponding to the ith case, i 1, . . ., n. When ω ω0 1, the perturbed model reduces to
the postulated model. Under the perturbed model we shall denote Y El n f X β, φ Dÿ1 (ω)g.
Thus, the log-likelihood function is given by
" f 2 #
Contaminated normal ÿ 12 ff 1 (u) 1 f 2 (u)
ÿ (u)
1
@β
ÿ 2
φ
W g (uω )f X T D(ω) y ÿ X T D(ω)X βg,
L(θjω)
@
@φ
ÿ 2φ
n
ÿ φ1 W (u 2 g ω ) Qω (β):
L(θjω)
ÿ φ2 Wg (uω )X T D(E) φ1 W 9g (uω)X T D(ω)ET D(E) ,
2
@
@ β@ ω
T
L(θjω)
ÿ φ1 W g (uω )ET D(E)
2
@ 1
W 9g (uω )Qω (β)ET D(E) ,
@ φ@ ω φ
T 2
which is in agreement with the expression obtained by Cook (1986). Therefore, we may write
B ∆T Lÿ1 ∆ B1 B2 ,
where
and
^
4 jW (u)l
g ^
T
D(e)P D(e)lj
φ
where
r (I ÿ P )X , 2 1
P2 X (X X )ÿ X
2
T
2 2
1 T
2
and iai denotes the norm of the vector a. Thus, the maximum curvature occurs in the direction
lmax / D(e)r :
Accordingly, the cases with j ri ei j large are locally most influential on the estimate β
^
1.
Similarly, the normal curvature for the scale parameter φ in the direction l is given by
Cl (φ) 2j l T B2 lj
2 jC j j l
^2
ω
T
D(e)ee T D(e)lj,
φ
where
Therefore, at least for the perturbation scheme defined in Section 4 and for the likelihood
displacement, we may conclude that the diagnostics for the elliptical linear models are
equivalent to those deduced by Cook (1986) for the normal linear model when β and φ are
treated separately, i.e. the index plots do not change with the induced model in the elliptical
LOCAL INFLUENCE 77
linear family. However, if β and φ are treated jointly, the lmax -vector may change from one
model to another, which suggests a helpful way of discovering those observations that are most
locally influential under each model.
5. Water salinity
To illustrate the methodology described in this paper we consider the data set reported by
Ruppert and Carroll (1980) on the salinity of water during the spring in Pamlico Sound, North
Carolina. The response Y is biweekly salinity, and the explanatory variables are salinity lagged
2 weeks, x1 , a dummy variable x2 for the time period and river discharge, x3 . The value of x1i
may differ from yiÿ1 , once the data are not a contiguous sequence. This data set has been
analysed, for instance, by Atkinson (1985), Carroll and Ruppert (1985) and Davison and Tsai
(1992). Atkinson (1985) assumed a normal distribution for the response whereas Davison and
Tsai (1992) considered a Student t-distribution with 3 degrees of freedom to allow for the
possibility that the data have tails that are longer than for the normal distribution. In both
cases, the linear model
Yi β β x β x β x E
0 1 1i 2 2i 3 3i i
normally distributed errors, found cases 16 and 5 to be the most influential. Case 5 was shown
to be influential after a correction was made for case 16. In Davison and Tsai’s analysis, where
a Student t-distribution with 3 degrees of freedom was used, cases 16, 5 and 3 appear the most
influential. Fig. 1 presents the index plot of j lmax j for β
^ ^ separately. We see in Fig. 1(a)
and φ
outstanding local influence for case 16, whereas in Fig. 1(b) it follows that cases 9, 15, 16
and 17 present the highest local influences. In contrast, when we use the global log-likelihood
L(θ) in Cook’s approach instead of the profiles L(βjφ) or L(φjβ), the local influence of the
observations on θ ^ is no longer invariant in the elliptical linear family. Fig. 2 illustrates this
behaviour. Moreover, Fig. 2 shows that case 16 is the most locally influential for the normal,
Cauchy and Student t-models. However, for the logistic model, cases 9, 15 and 17 also appear
with a high local influence. Therefore, we may conclude from this example that the index plot
of j lmax j for θ
^ may be helpful in selecting the less sensitive model with respect to local
^ ^
perturbations in the elliptical linear family, especially when we are interested in both β and φ.
Acknowledgements
The authors acknowledge partial financial support from the Conselho Nacional de Desen-
volvimento Cientı́fico e Technológico Brasil.
References
Atkinson, A. C. (1985) Plots, Transformations, and Regression. Oxford: Clarendon.
Beckman, R. J., Nachtsheim, C. J. and Cook, R. D. (1987) Diagnostics for mixed-model analysis of variance.
Technometrics, 29, 413–426.
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980) Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: Wiley.
LOCAL INFLUENCE 79
Carroll, R. J. and Ruppert, D. (1985) Transformations in regression: a robust analysis. Technometrics, 27, 1–12.
Chatterjee, S. and Hadi, A. S. (1988) Sensitivity Analysis in Linear Regression. New York: Wiley.
Cook, R. D. (1986) Assessment of local influence (with discussion) J. R. Statist. Soc. B, 48, 133–169.
— (1987) Influence assessment. J. Appl. Statist., 14, 117–131.
Cook, R. D. and Weisberg, S. (1982) Residuals and Influence in Regression. London: Chapman and Hall.
Davison, A. C. and Tsai, C.-L. (1992) Regression model diagnostics. Int. Statist. Rev., 60, 337–353.
Fang, K. T. and Anderson, T. W. (1990) Statistical Inference in Elliptical Contoured and Related Distributions. New
York: Allerton.
Fang, K. T., Kotz, S. and Ng, K. W. (1990) Symmetric Multivariate and Related Distributions. London: Chapman and
Hall.
Fang, K. T. and Zhang, Y. T. (1990) Generalized Multivariate Analysis. London: Springer.
Kim, M. G. (1995) Local influence in multivariate regression. Communs Statist. Theory Meth., 24, 1271–1278.
Lawrance, A. J. (1988) Regression transformation diagnostics using local influence. J. Am. Statist. Ass., 84, 125–141.
Paula, G. A. (1993) Assessing local influence in restricted regression models. Comput. Statist. Data Anal., 16, 63–79.
Ruppert, D. and Carroll, R. J. (1980) Trimmed least squares estimation in the linear model. J. Am. Statist. Ass., 75,
828–838.
Thomas, W. and Cook, R. D. (1990) Assessing influence on predictions from generalized linear models. Technometrics,
32, 59–65.
Tsai, C.-H. and Wu, X. (1992) Assessing local influence in linear regression models with first-order autoregressive or
heteroscedastic error structure. Statist. Probab. Lett., 14, 247–252.