Generalized Linear Model: Badr Missaoui

Generalized Linear Model
Badr Missaoui
Logistic Regression
Outline
I Generalized linear models
I Deviance
I Logistic regression.
I All models we have seen so far deal with continuous

outcome variables with no restriction on their expectations,
and (most) have assumed that mean and variance are
unrelated (i.e. variance is constant).
I Many outcomes of interest do not satisfy this.
I Examples : binary outcomes, Poisson count outcomes.
I A Generalized Linear Model (GLM) is a model with two
ingredients : a link function and a variance function.
I The link relates the means of the observations to
predictors : linearization
I The variance function relates the means to the variances.
I The data involve 462 males between the ages of 15 and

64. The outcome Y is the presence (Y = 1) or absence
Y = 0 of heart disease
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9207616 1.3265724 -4.463 8.07e-06 ***
sbp 0.0076602 0.0058574 1.308 0.190942
tobacco 0.0777962 0.0266602 2.918 0.003522 **
ldl 0.1701708 0.0597998 2.846 0.004432 **
adiposity 0.0209609 0.0294496 0.712 0.476617
famhistPresent 0.9385467 0.2287202 4.103 4.07e-05 ***
typea 0.0376529 0.0124706 3.019 0.002533 **
obesity -0.0661926 0.0443180 -1.494 0.135285
alcohol 0.0004222 0.0045053 0.094 0.925346
age 0.0441808 0.0121784 3.628 0.000286 ***
Motivation
I Classical linear model
Y = Xβ + ε
where ε ∼ N(0, σ 2 ). That means,
Y ∼ N(X β, σ 2 )
I In the GLM, we specify that
Y ∼ P(X β)
We write the GLM as

E(Yi ) = µi
and
ηi = g(µi ) = Xi β
where the function g called a link function which belongs to an
exponential family.
I The exponential family density are specifying two

components, the canonical parameter θ and the dispersion
parameter φ .
I Let Y = (Yi )i=1...n be a sequence of random variables. Yi
has an exponential density if

yi θi − b(θi )
fYi (yi ; θi , φ) = exp + c(yi , φ)
ai (φ)
where the functions b, c are specific to each distribution

and ai (φ) = φ/wi .
Law Law Pm µ σ2
B(m, p) py (1 − p)m−y . m k =0 δ{k } mp mp(1 − p)
Pm 1 y
P(µ) µy e−µ
n . δ
k! k
k =0 o µ µ
−µ) 2
N (µ, σ 2 ) exp − (y2σ 2 .dy µ σ2
√ n
λ(y −µ)2
o
IG(µ, λ) λ exp − 2µy . √dy µ µ3 /λ
2πy 3
I We write
`(y ; θ, φ) = log f (y ; θ, φ)
for the log-likelihood function of Y .
I Using the facts that

∂`
E = 0
∂θ
∂2`

∂`
Var = −E
∂θ ∂θ2
I We have
E(y ) = b0 (θ)
and
Var(y ) = b00 (θ)a(φ)
I Gaussian case
(y − µ)2

1
f (y ; θ, φ) = √ exp −
σ 2π 2σ 2
y µ − µ2 /2 1 y 2

2
= exp − + log(2πσ )
σ2 2 σ2
We can write θ = µ, φ = σ 2 , a(φ) 2
= φ, b(θ) = θ /2 and
2
c(y , φ) = − 21 σy 2 + log(2πσ 2 )
I Binomial case

n y
f (y ; θ, φ) = µ (1 − µ)n−y
y

µ n
= exp y log + n log(1 − µ) + log
1−µ y
µ
We can write θ = log 1−µ , b(θ) = −n log(1 − µ) and
n

c(y , φ) = log y
Recall that in ordinary linear models, the MLE of β satisfies
β̂ = (X T X )−1 X T Y
if X has full rank.

In GLM, the MLE β̂ does not exist in closed form and can be
approximately estimated via iterative weighted least squares.
I For n observations, the log-likelihood function is
n
X
L(β) = `(yi ; θ, φ)
i=1
I Computing
∂`i ∂`i ∂θi ∂µi ∂ηi 1 1 y i − µi
= = xij 0 00
∂βj ∂θi ∂µi ∂ηi ∂βj g (µi ) b (θi ) φ/wi
I The likelihood equations are
n
∂Li X 1 ∂µi
= xij 0 2
(yi − µi ) = 0 j = 1, .., p
∂βj g (µi ) Var(yi ) ∂ηi
i=1
I Put n o
W = diag g 0 (µi )2 Var(yi )
i=1,...,n
and
∂µ ∂µi
= diag
∂η ∂ηi i=1,...,n
I These likelihood equations are
∂µ
X T W −1 (y − µ) = 0
∂η
I These equations are non-linear in β and require an
iterative method (e.g Newton-Raphson).
I The Fisher’s Information matrix is
= = X T W −1 X
and in general term

n
∂ 2 L(β) ∂µi 2

X xij xjk
[=]jk = E =−
∂βj ∂βk Var(yi ) ∂ηi
i=1
Let µ̂0 = Y be the initial estimate. Then, set η̂ 0 = g(µ̂0 ),
and form the adjusted variable
∂η
Z 0 = η̂ 0 + (Y − µ̂0 ) | 0
∂µ µ=µ̂
Calculate β̂ 1 by the least squares regression of Z 0 on X ,

that means
β̂ 1 = argminβ (Z 0 − X β)T W0−1 (Z 0 − X β)
So,
β̂ 1 = (X T W0−1 X )−1 X T W0−1 Z 0
Set
η̂ 1 = X βˆ1 , µ̂1 = g −1 (η̂ 1 )
Repeat until changes in β̂ m are sufficiently small.
Estimation
I In theory, β̂ m → β̂ as m → ∞, but in practice, the algorithm
may fail to converge.
I Under some conditions,
β̂ → N(β, =−1 (β))
I In practice, the asymptotic covariance matrix of β̂ is

estimated by
φ(X T Wm−1 X )−1
where Wm is the weight matrix from the mth iteration.
I If φ is unknown, it is estimated by
n
1 X (yi − µ̂)2
φ̂ = wi
n−p V (µ̂)
i=1
where V (µ̂i ) = var(yi )/a(φ) = wi var(yi )/φ

I Confidence interval

1 1
CIα (βi ) = β̂j − u1−α/2 √ σ̂βj ; β̂j + u1−α/2 √ σ̂βj
n n
where u1−α/2 is the 1 − α/2 quantile of N(0, 1) and
h i−1
σ̂βj = n1 =(β̂) .
jj
I To test the hypothesis
H0 : βj = 0 against H1 : βj 6= 0
|β̂j |
q ∼ N(0, 1)
φ(X T Wm−1 X )−1 (j, j)
if φ is unknown
|β̂j |
q ∼ tn−p
φ̂(X T Wm−1 X )−1 (j, j)
Goodness-of-Fit
H0 : the true model is M versus H1 : the true is Msat
I The likelihood ratio test for this hypothesis is called the

deviance.
I For any submodel M,
dev (M) = 2(`ˆsat − `ˆM )
I Under H0 , dev (M) → χ2psat −p .

Goodness-of-Fit
I The scaled deviance for GLM is
D(y , µ̂) = 2 [`(µ̂sat , φ; y ) − `(µ̂, φ; y )]

Xn
2wi yi (θ(µ̂sat sat

= i ) − θ(µ̂i )) − b(µ̂i ) + b(µ̂i /φ
i=1
n
X
= D ? (yi ; µ̂i )/φ
i=1
?
= D (y ; µ̂)/φ
Tests
I We use the deviance to compare two models having p1
and p2 parameters respectively, where p1 < p2 . Let µ̂1 and
µ̂2 denote the corresponding MLEs.
I
D(y , µ̂1 ) − D(y , µ̂2 ) ∼ χ2p2 −p1
I If φ is unknown,
D ? (y , µ̂1 ) − D ? (y , µ̂2 )
∼ F1−α,p2 −p1 ,n−p2
(p2 − p1 )φ̂
Goodness-of-Fit
I The deviance residuals for a given model are
p
di = sign(yi − µ̂i ) D ? (yi ; µ̂i )
I A poorly fitting point will make a large contribution to the

deviance, so |di | will be large.
Diagnostics
I The Pearson residuals are defined by
yi − µ̂i
ri = p
(1 − hii )V (µ̂)
where hii is the ith diagonal element of
H = X (X T Wm−1 X )−1 X T Wm−1
I The deviance residuals are

s
D ? (yi ; µ̂i )
ε̂i = sign(yi − µ̂i )
1 − hii
Diagnostics
I The Anscombe residuals is defined as a transformation of
the Pearson residual
t(y ) − t(µ̂i )
riA = pi
t 0 (µ̂ i ) φV (µ̂i )(1 − hii )
The aim in introducing the function t is to make the

residuals as Gaussian as possible. We consider
Z x
t(x) = V (µ)−1/3 dµ
0
Diagnostics
I Influential points using the Cook’s distance
1 hii
Ci = (β̂ − β̂)T X T Wm X (β̂(i) − β̂) ≈ ri2
p (i) p(1 − hii )2
I The outliers points : if hii > 2p/n or hii > 3p/n, then we
consider that ith point is an outlier.
Model Selection
I Model selection can be done using the AIC and BIC.
I Forward, Backward and stepwise approach can be used.
Logistic regression
I Logistic regression is a generalization of regression that is
used when the outcome Y is binary 0, 1.
I As example, we assume that
eβ0 +β1 Xi
P(Yi = 1|Xi ) =
1 + eβ0 +β1 Xi
I Note that
E(Yi |Xi ) = P(Yi = 1|Xi )
Logistic regression
I Define the logit function

z
logit(z) = log
1−z
I We can write
logit(πi ) = β0 + β1 Xi
where πi = P(Yi = 1|Xi )
I The extension to several covariates is
p
X
logit(πi ) = β0 + βj xij
i=1
How do we estimate the parameters ?

I Can be fit using maximum likelihood.
I The likelihood function is
n n
πiyi (1 − πi )1−yi
Y Y
L(β) = f (yi |Xi ; β) = L(β) =
i=1 i=1
I The estimator β̂ has to be found numerically.

Usually, we use the reweighted least squares

I First set a starting values of β (0)
I Compute
(k )
eXi β
π̂i = β (k )
1 + eX i
I Define weighted matrix W whose i th diagonal is π̂i (1 − π̂i )
I Define the adjusted response vector
Z = X β (k ) + W −1 (Y − π̂)
I Take
β̂ (k +1) = (X T WX )−1 X T WZ
which is the weighted linear regression of Z on X
Model selection and diagnostics

I Diagnostics : the Pearson χ2
Y − π̂i
p i
π̂i (1 − π̂i )
I The deviance residuals

s
Yi 1 − Yi
sign(Yi − π̂i ) 2 Yi log + (1 − Yi ) log
π̂i 1 − π̂i
I To fit this model, we use the glm command.
Call:
glm(formula = chd ~ ., family = binomial, data = SAheart)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8320 -0.8250 -0.4354 0.8747 2.5503
Coefficients:
(Intercept) -5.9207616 1.3265724 -4.463 8.07e-06 ***
row.names -0.0008844 0.0008950 -0.988 0.323042
sbp 0.0076602 0.0058574 1.308 0.190942
tobacco 0.0777962 0.0266602 2.918 0.003522 **
ldl 0.1701708 0.0597998 2.846 0.004432 **
adiposity 0.0209609 0.0294496 0.712 0.476617
famhistPresent 0.9385467 0.2287202 4.103 4.07e-05 ***
typea 0.0376529 0.0124706 3.019 0.002533 **
obesity -0.0661926 0.0443180 -1.494 0.135285
alcohol 0.0004222 0.0045053 0.094 0.925346
age 0.0441808 0.0121784 3.628 0.000286 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom

Residual deviance: 471.16 on 451 degrees of freedom
AIC: 493.16
Number of Fisher Scoring iterations: 5

Start: AIC=493.16
chd ~ row.names + sbp + tobacco + ldl + adiposity + famhist +
typea + obesity + alcohol + age
Df Deviance AIC
- alcohol 1 471.17 491.17
- adiposity 1 471.67 491.67
- row.names 1 472.14 492.14
- sbp 1 472.88 492.88
<none> 471.16 493.16
- obesity 1 473.47 493.47
- ldl 1 479.65 499.65
- tobacco 1 480.27 500.27
- typea 1 480.75 500.75
- age 1 484.76 504.76
- famhist 1 488.29 508.29
etc...
Step: AIC=487.69
chd ~ tobacco + ldl + famhist + typea + age
Df Deviance AIC
<none> 475.69 487.69
- ldl 1 484.71 494.71
- typea 1 485.44 495.44
- tobacco 1 486.03 496.03
- famhist 1 492.09 502.09
- age 1 502.38 512.38
I Suppose Yi ∼ Binomial(ni , πi )
I We can fit the logistic model as before
logit(πi ) = Xi β
I Pearson residuals
Yi − ni π̂i
ri = p
ni π̂i (1 − π̂i )
I Deviation residuals
s
Yi ni − Yi
di = sign(Yi −Ŷi ) 2 Yi log + (ni − Yi ) log
µ̂i ni − µ̂i
Goodness-of-Fit test
I The Pearson test X
χ2 = ri2
i
I and deviance X
D= di2
i
I both have a χ2n−p distribution if the model is correct.


Call:
glm(formula = cbind(y, n - y) ~ x, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.70832 -0.29814 0.02996 0.64070 0.91132
Coefficients:
(Intercept) -14.73119 1.83018 -8.049 8.35e-16 ***
x 0.24785 0.03031 8.178 2.89e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 137.7204 on 7 degrees of freedom

Residual deviance: 2.6558 on 6 degrees of freedom
AIC: 28.233
Number of Fisher Scoring iterations: 4

To test the correctness of the model

> pvalue = 1-pchisq(out$dev,out$df.residual)
> print(pvalue)
[1] 0.8506433
> r=resid(out,type="deviance")
> p=out$linear.predictors
> plot(p,r,pch=19,xlab="linear predictor", ylab="deviance residuals")
> print(sum(r^2))
[1] 2.655771
> cooks.distance(out)
1 2 3 4 5
0.0004817501 0.3596628502 0.0248918197 0.1034462077 0.0242941942
6 7 8
0.0688081629 0.0014847981 0.0309767612
Note that the residuals give back the deviance test, and the
p-value is large indicating no evidence of a lack of fit.

Generalized Linear Model: Badr Missaoui

Uploaded by

Copyright:

Available Formats

Generalized Linear Model: Badr Missaoui

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Generalized Linear Model: Badr Missaoui

Uploaded by

Copyright:

Available Formats

Generalized Linear Model

I All models we have seen so far deal with continuous

I The data involve 462 males between the ages of 15 and

where ε ∼ N(0, σ 2 ). That means,

I In the GLM, we specify that

We write the GLM as

I The exponential family density are specifying two

where the functions b, c are specific to each distribution

Recall that in ordinary linear models, the MLE of β satisfies

if X has full rank.

I These likelihood equations are

and in general term

Calculate β̂ 1 by the least squares regression of Z 0 on X ,

β̂ 1 = argminβ (Z 0 − X β)T W0−1 (Z 0 − X β)

β̂ → N(β, =−1 (β))

I In practice, the asymptotic covariance matrix of β̂ is

where V (µ̂i ) = var(yi )/a(φ) = wi var(yi )/φ

H0 : the true model is M versus H1 : the true is Msat

I The likelihood ratio test for this hypothesis is called the

dev (M) = 2(`ˆsat − `ˆM )

I Under H0 , dev (M) → χ2psat −p .

D(y , µ̂) = 2 [`(µ̂sat , φ; y ) − `(µ̂, φ; y )]

I A poorly fitting point will make a large contribution to the

where hii is the ith diagonal element of

H = X (X T Wm−1 X )−1 X T Wm−1

I The deviance residuals are

The aim in introducing the function t is to make the

How do we estimate the parameters ?

I The estimator β̂ has to be found numerically.

Usually, we use the reweighted least squares

Model selection and diagnostics

I The deviance residuals

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 596.11 on 461 degrees of freedom

Number of Fisher Scoring iterations: 5

I both have a χ2n−p distribution if the model is correct.

I To fit this model, we use the glm command.

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 137.7204 on 7 degrees of freedom

Number of Fisher Scoring iterations: 4

To test the correctness of the model

You might also like