Theory Generalized Linear Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Generalized Linear Models

Author(s): J. A. Nelder and R. W. M. Wedderburn


Source: Journal of the Royal Statistical Society. Series A (General), Vol. 135, No. 3 (1972),
pp. 370-384
Published by: Wiley for the Royal Statistical Society
Stable URL: https://www.jstor.org/stable/2344614
Accessed: 01-04-2019 18:20 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Royal Statistical Society, Wiley are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series A (General)

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
J. R. Statist. Soc. A, 370
(1972), 135, Part 3, p. 370

Generalized Linear Models

By J. A. NELDER and R. W. M. WEDDERBURN


Rothamsted Experimental Station, Harpenden, Herts

SUMMARY
The technique of iterative weighted linear regression can be used to obtain
maximum likelihood estimates of the parameters with observations distri-
buted according to some exponential family and systematic effects that can
be made linear by a suitable transformation. A generalization of the analysis
of variance is given for these models using log-likelihoods. These generalized
linear models are illustrated by examples relating to four distributions; the
Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and
gamma (variance components).
The implications of the approach in designing statistics courses are
discussed.

Keywords: ANALYSIS OF VARIANCE; CONTINGENCY TABLES; EXPONENTIAL FAMILIES;


INVERSE POLYNOMIALS; LINEAR MODELS; MAXIMUM LIKELIHOOD:
QUANTAL RESPONSE; REGRESSION; VARIANCE COMPONENTS; WEIGHTED
LEAST SQUARES

INTRODUCTION
LINEAR models customarily embody both systematic and random (error) components,
with the errors usually assumed to have normal distributions. The associated analytic
technique is least-squares theory, which in its classical form assumed just one error
component; extensions for multiple errors have been developed primarily for analysis
of designed experiments and survey data. Techniques developed for non-normal
data include probit analysis, where a binomial variate has a parameter related to an
assumed underlying tolerance distribution, and contingency tables, where the distri-
bution is multinomial and the systematic part of the model usually multiplicative.
In both these examples there is a linear aspect to the model; thus in probit analysis
the parameter p is a function of tolerance Y which is itself linear on the dose (or some
function thereof), and in a contingency table with a multiplicative model the logarithm
of the expected probability is assumed linear on classifying factors defining the table.
Thus for both, the systematic part of the model has a linear basis. In another extension
(Nelder, 1968) a certain transformation is used to produce normal errors, and a
different transformation of the expected values is used to produce linearity.
So far we have mentioned models associated with the normal, binomial and
multinomial distributions (this last can be thought of as a set of Poisson distributions
with constraints). A further class is based on the x2 or gamma distribution and
arises in the estimation of variance components from independent quadratic forms
derived from the original observations. Again the systematic component of the model
has a linear structure.
In this paper we develop a class of generalized linear models, which includes all
the above examples, and we give a unified procedure for fitting them based on

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 371

likelihood. This procedure is a generalization of the well-known one described by


Finney (1952) for maximum likelihood estimation in probit analysis. Section 1
defines the models, and Section 2 develops the fitting process and generalizes the
analysis of variance. Section 3 gives examples with four special distributions for
the random components. In Section 4 we consider the usefulness of the models for
courses of instruction in statistics.

1.1. The Random Component


Suppose our observations z come from a distribution with density function

-g(z; 6, b) = exp [x(o {z6-g(O) +h(z)}+13(#, z)],


where ()(O >0 so that for fixed b we have an exponential family. The parameter
b could stand for a certain type of nuisance parameter such as the variance a2 of a
normal distribution or the parameter p of a gamma distribution (see Section 3.4).
We denote the mean of z by ft.
We require expressions for the first and second derivatives of the log-likelihood
in terms of the mean and variance of z and the scale factor o6(4). We use the results
(see, for example, Kendall and Stuart, 1967, p. 9)

E(aL/a0) = 0 (1)
and

E?2la62) -EQ?L/a )2 (2)


where the differentiation under the sign of integration used in their derivation can
be justified by Theorem 9 of Lehmann (1959).
We have

AL/@ 0 = Z o - {zg (O)}.


Then (1) implies that
u= E(z) = g'(0);
hence

aLl a = 1o() (z- ). (3)


From (2) we obtain a(O)g"(0) = [(O)]2 va
g"(0) = a(#)var(z) = V, say, (4)

so that V is the variance of z when the scale factor is unity. Then

a2 LI-a -24q V. (5)


We note also that

V = dp]/dO. (6)

For a one-parameter exponential family c(A) = 1, we can write

-(z; 0) = exp {z6-g(6) + h(z)},


so that

and

_a2LIa-2 = V = var (z).

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
372 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

1.2. The Linear Model for Systematic Effects


The term "linear model" usually encompasses both systematic and random
components in a statistical model, but we shall restrict the term to include only the
systematic components. We write
m

Y= E/3X2
i=1

when the xi are independent variates whose values are supposed known and Piq are
parameters. The Piq may have fixed (known) values or be unknown and require
estimation. An independent variate may be quantitative and produce a single x-variate
in the model, or qualitative and produce a set of x-variates whose values are 0 and 1,
or mixed. Consider the model

Yjj=ajf+Pujj+jvq,j (i= 1,...,n,j=, 1...,p),


where the data are indexed by factors whose levels are denoted by i and j. The term
(x includes n parameters associated with a qualitative variate represented by n dummy
x-variate components taking values 1 for one level and 0 for the rest; luij represents
a quantitative variate, namely u with single parameter P,. and yj vi shows p parameters
yj associated with a mixed independent variate whose p components take the values
of vij for one level ofj and zero for the rest. A notation suitable for computer use has
been developed by C. E. Rogers and G. N. Wilkinson and is to be published in
Applied Statistics.

1.3. The Generalized Linear Model


We now combine the systematic and random components in our model to produce
the generalized linear model. This is characterized by
(i) A dependent variable z whose distribution with parameter 0 is one of the
class in Section 1.1.
(ii) A set of independent variables xl, ..., x. and predicted Y = ,/3 xi as in
Section 1.2.
(iii) A linking function 0 =f(Y) connecting the parameter 0 of the distribution
of z with the Y's of the linear model.
When z is normally distributed with mean 0 and variance u2 and when 0 = Y, we
have ordinary linear models with Normal errors. Other examples of these models
will be described in Section 3 under the various distributions of the exponential type.
We now consider the solution of the maximum likelihood equations for the parameters
of the generalized linear models and show its equivalence to a procedure of iterative
weighted least squares.

2. FITTING THE MODELS


2.1. The Maximum Likelihood Equations
The solution of the maximum likelihood equations is equivalent to an iterative
weighted least-squares procedure with a weight function
w = (d14dY)2/V

and a modified dependendent variable (the working probit of probit analysis)

y = y+ (Z-)/(d/dY),

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 373

where ,u, Y and V are based on current estimates. This generalizes


Nelder (1968).
Proof. Writing L for the log-likelihood from one observation we have, from (3)
and (6),

aL f (_ d 8 dlbyx

VdYd

=o0Z - V {t dZX. (7)


and
02L 02L
O pj- ay2Xi Xi
where

a2Z b2L (dO\2 aL d2 O


Oya= s dY_7+ To dy2

=1 _f+ lV(-(dY + (z__) dY2


(-d-) (y)
i s?-d-yU /2 V+ (z o) d2 02 (8)
The expected second derivative with negative sign is thus

(0) {(d )2/ V}xi xj from (8).

Writing w for the weight function (dp/dy)2/ V, (7) gives

OL/Ij3, = o(k) wx,(z- )/(dl/d Y).


Thus the Newton-Raphson process with expected second derivatives (equivalent to
Fisher's scoring technique) for a sample of n gives

ASP = C, (9)
where A is a m x m matrix with
n

Ai= Wk Xik Xjk


k=1

and C is a m x 1 vector with

Ci = IWkXik(ZIJu)/(dp4dY).
Finally we have

(AAi = 1A4jfPJ = Wk Xik Yk


so that (9) may be written in the form

AP* =r
where rj= IWkXikYk, Yk= Yk+(Zk-,uk)/(dtkl/dYk) and * = +Sp

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
374 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

Starting method
In practice we can obtain a good starting procedure for iteration as follows: take
as a first approximation j = z and calculate Y from it; then calculate w as before
and set y = Y. Then obtain the first approximation to the 3's by regression. The
method may need slight modification to deal with extreme values of z. For instance,
with the binomial distribution it will probably be adequate to replace instances of
z =O0 or z = n with z = 2 and z = n - -, where, e.g. with the probit and logit trans-
formations, tu = 0 or p = n would lead to infinite values for Y.

2.2. Sufficient Statistics


An important special case occurs when 0, the parameter of the distribution of
the random element, and Y the predicted value of the linear model, coincide. Then

L -zY-g(Y)+h(z),

and, using (3),

aL/ ai = CO) (z -) Xi.


The maximum likelihood equations are then of the form k )(Z-) Xik = 0, the
summation being over the observations. Hence we have
E Zk Xik = E&k Xik. (10)

For a qualitative independent variate, this implies that the fitted marginal totals with
respect to that variate will be equal to the observed ones.
From the expression for L we see that the quantities Ek Zk Xik are a set of sufficient
statistics. Also, in (8) d2 O/dy2 = 0 and so

_ 62p_ - Ex=(q) (dv / V) xi (11)

When 0 is also the mean of the distribution, i.e. t = 0 = Y, we have the usual
linear model with normal errors, for g'(0) = 0 gives

g(O) = 1 02 +const

which uniquely determines the distribution as Normal with variance 1/x(0b) (using
Theorem 1 of Patil and Shorrock, 1965). The sub-class of models for which there
are sufficient statistics was noted by Cox (1968), and Dempster (1971) has extended
it to include many dependent variates.

2.3. The Analysis of Deviance


A linear model is said to be ordered if the fitting of the P's is to be done in the same
sequence as their declaration in the model. Ordering (or partial ordering) may be
implied by the structure of the model; for instance it makes no sense to fit an inter-
action term (ab)ij before fitting the corresponding main effects a, and b3. It may also
be implied by the objectives of the fitting, i.e. if a trend must be removed first before
the fitting of further effects. More commonly, however, the ordering is to some
extent arbitrary, and this gives rise to difficult problems of inference which we shall
not try to tackle here. For ease of exposition of the basic ideas we shall assume that

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 375

the model under consideration is ordered, and will be fitted sequentially a term at a
time. The objectives of the fitting will be to assess how many terms are required
for an adequate description of the data, and to derive the associated estimates of the
parameters and their information matrix.
Two extreme models are conceivable for any set of data, the minimal model
which contains the smallest set of terms that the problem allows, and the complete
model in which all the Ys are different and match the data completely so that {i = z.
An extreme case of the minimal model is the null model, which is equivalent to
fitting the grand mean only and effectively consigns all the variation in the data to the
random component of the model, while the complete model fits exactly and so
consigns all the variation in the data to the systematic part. The model-fitting process
with an ordered model thus consists of proceeding a suitable distance from the minimal
model towards the complete model. At each stage we trade increasing goodness-of-fit
to the current set of data against increasing complexity of the model. The fitting of
the parameters at each stage is done by maximizing the likelihood for the current
model and the matching of the model to the data will be measured quantitatively by
the quantity -2Lmax which we propose to call the deviance. For the four special
distributions the deviance takes the form:

Normal I(z - A)2/o2,


Poisson 2{Ez In (z/j2) - (z - u},

Binomial 2[Ez ln (z/4) + (n-z) ln {(n-z)/(n-A)}],


Gamma 2p{-E ln (z/) + E(z-MI).
Note that the deviance is measured from that of the complete model, so that terms
involving constants, the data alone, or the scale factor alone are omitted. The second
term in the expressions for the Poisson and gamma distribution is commonly
identically zero (see Appendix for conditions and proof).
Associated with each model is a quantity r termed the degrees of freedom which
is given by the rank of the Xmatrix, or equivalently the number of linearly independent
parameters to be estimated. For a sample of n independent observations, the deviance
for the model has residual degrees of freedom (n-r). The degrees of freedom,
multiplied where necessary by a scale factor, form a scale for a set of sequential models
with which deviances can be compared; when (residual degrees of freedom x scale
factor) is approximately equal to the deviance of the current model then it is unlikely
that further fitting of systematic components is worth while. The scale factor may be
known (e.g. unity for the Poisson distribution) or unknown (e.g. for the normal
distribution with unknown variance) If unknown it may be estimable directly, e.g.
by replicate observations, or indirectly from the deviance after an adequate model
has been fitted. The adequacy of the model may be determined by plotting successive
deviances against their degrees of freedom, and accepting as a measure of the scale
factor the linear portion through the origin determined by those points with fewest
degrees of freedom.

2.4. The Generalization of Analysis of Variance


The first differences of the deviances for the normal distribution are (apart from
a scale factor) the sums of squares in the analysis of variance for a sequential fit as
shown for a three-term model in Table 1.

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
376 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

TABLE 1

Deviances and their differences

Model term Deviance Difference Component

Minimal dm dm-dA A

B dA dA- dAB B eliminating A


C dAB dAB-dABO C eliminating A and B
Complete do dABC-do Residual

The generalized analysis of variance for a sequential model is now defined to have
components given by the first differences of the deviance, with degrees of freedom
defined as above. These components have distributions proportional to X2, exactly
for normal errors, approximately for others. Such a generalization of the analysis of
variance was suggested by Good (1967).

3. SPECIAL DISTRLJUTIONS
3.1. The Normal Distribution
Here, we have

L = La -2(Zlk
=~~2 _ tj2 _-1Z2) -i ln 62
and in the notation of Section 1.2

p =G, V=1.
Inverse polynomials provide an example where we assume that the observations
u are normal on the log scale and the systematic effects additive on the inverse scale.
Then

z = ln u and Y = ell.

Nelder (1966) gives examples of inverse polynomials calculated using the first
approximation of the method in this paper.
More generally, as shown in Nelder (1968), we can consider models in which
there is a linearizing transformation f and a normalizing transformation g. This
means that if the observations are denoted by u, then g(u) is normally distributed
with mean ,u and constant variance u2 and f{g-1(Ik)} = ,S, xi.
Then we have V = 1 and Y =f{g-'(Ik)} so that
W = [(d/dY)g{f-1(y)}]2
and

y = Y+{g(u)-p}/[(d/dY)g{f-1(Y)}].
Example: Fisher's tuberculin-test data
Fisher (1949) published 16 measurements of tuberculin response which were
classified by three four-level factors in a Latin-square type of arrangement as in
Table 2.

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 377

TABLE 2

Cow class
Sites Treatments
I III II IV

3+6 454 249 349 249 A B C D


4+5 408 322 312 347 B A D C
1+8 523 268 411 285 C D A B
2+7 364 283 266 290 D C B A

Fisher gave reasons for believing that the variances of the observations were
proportional to their expectation, and that the systematic part of the model was
linear on the log-scale. The treatments were:

B Standard single
A Standard double
D Weybridge half
C Weybridge single

and these were treated as a 2 x 2 factorial arrangement, no interaction being fitted.


If the data had been Poisson observations the maximum likelihood estimates
would have had the property that the marginal totals of the fitted values (on the
untransformed scale) would be equal to the marginal totals of the observations.
Although the observations were not, in fact, counts but measurements in millimetres,
Fisher decided to estimate the effects as if they were Poisson observations. He
produced approximations to the effects by a method which made use of features
of the particular Latin square, and then verified that these gave fitted values with
margins approximately equal to the observed ones.
Another approach would be to treat the square roots of the observations as
normally distributed with variance a2/4. Then we have z = lu where u is an obser-
vation and
Y= 21n,u.

Fisher gave estimates of effects on the log scale relative to B. We produced


estimates by the square-root/logarithmic method just described, and the two sets of
estimates are given in Table 3.

TABLE 3

Fisher's result Our result

B 0.0000 0.0000
A 0.2089 0.2092
D 0.0019 0.0023
C 0.2108 0.2115

The method of this paper can also be used to analyse the data as if they were
Poisson observations. The estimates of effects obtained by this method agree with
our other estimates to about four decimal places.

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
378 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

3.2. The Poisson Distribution


Here L = zln,u-,u so we have 0 = lnp. and V== pt.
When Y= In,u there are sufficient statistics and a unique maximum likelihood
estimate of /, provided it is finite. It will always be finite if there are no zero observation
If Y-=1p (O < A < 1), L -oo as IPIoo and hence L must have a maximum
for finite p. Also

a2L a2 L

:api = E ay2a
It is easily verified that a2L/ y2 < 0 and hence L is negative definite. It follows that
P is uniquely determined. When Y = ,u the same result holds provided that the
x's are linearly independent when units with z = 0 are excluded.
The main application of generalized linear models with Poisson errors is to
contingency tables. These arise from data on counts classified by two or more factors,
and the literature on them is enormous (see, for example, Simpson, 1951; Ireland and
Kullback, 1968; Chapter 8 of Kullback 1968; Ku et al., 1971).
Probabilistic models for contingency tables are built on assumptions of a multi-
nomial distribution or a set of multinomial distributions. As Birch (1963) has shown,
the estimation of a set of independent multinomial distributions is equivalent to the
estimation of a set of independent Poisson distributions, and in what follows we shall
regard a contingency table as a set of independent Poisson distributions.
The systematic part of models of contingency tables is usually multiplicative, and
thus gives sufficient statistics with Poisson errors. The model terms usually correspond
to qualitative x's, the equivalent of constant fitting, but quantitative terms occur
naturally when the classifying factors have an underlying quantitative basis.

Example: A contingency table


Maxwell (1961, pp. 70-72) discusses the analysis of a 5 x 4 contingency table
giving the number of boys with four different ratings for disturbed dreams in five
different age groups. The data are given in Table 4. The higher the rating the
more the boy suffers from disturbed dreams.

TABLE 4

Age Rating
in
years 4 3 2 1 Total

5- 7 7 3 4 7 21
8- 9 13 11 15 10 49
10-11 7 11 9 23 50
12-13 10 12 9 28 59
14-15 3 4 5 32 44

Total 40 41 42 100 223

Here we can fit main effects and a linear x linear interaction using an x-variate
of the form uv where u -2, - 1, 0, 1 or 2 according to the age group and v is the
rating for disturbed dreams.

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 379

The estimated linear x linear interaction is - 0-205.


We can form an analysis of deviance thus (regarding main effects as fixed):

Degrees
Model term(s) Deviance Difference of
freedom

Minimal (i.e. main effects only) 32f46 18-38 1


Linear x linear 14-08 14f08 11
Complete 0

Treating 18f38 and 14 0


we find that 18X38 is large while 14*08 is close to expectation. We conclude that the
data are adequately described by a negative linear x linear interaction (indicating
that the dream rating tends to decrease with age).
Maxwell, using the method of Yates (1948), obtained a decomposition of a Pearson
X2 as follows:

Degrees
Source of variation Of X2
freedom

Due to linear regression 1 17X94


Due to departure from linear regression 11 13X73

Total 12 31X67

Maxwell's values of x2 are clearly quite close to ours and his conclusions are
essentially the same.

3.3. The Binomial Distribution


We re-write the usual form

L = rlnp+(n-r)lnq

as

L = z ln (k/n) + (n-z) ln {(-(/n)}

= z ln {4/(n-)} + n ln (n-) + terms in n,

i.e. we put z for r, and pu = E(z) = np. Thus

6 =ln{/n)

and

V = t(n-p,)/n.
Sufficient statistics are provided by the logit transformation giving

= ney/(1 + ey).

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
380 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

Probit analysis
We put

p = n(D(Y),
where (D is the cumulative normal distribution function. There are no sufficient
statistics here; the analysis via iterative weighted least-squares is well known (Finney,
1952).

Fitting constants on a logit scale


This technique was introduced by Dyke and Patterson (1952) and is applied to
multiway tables of proportions. It is a special case of the logistic transformation when
all the x's are qualitative, and yields as sufficient statistics the total responding
(Zz) for each relevant margin.
The logit analogue of probit analysis is, of course, formally identical, with quanti-
tative rather than qualitative x's. Again arbitrary mixtures of x types do not introduce
anything new. Models based on the logistic transform have been extensively developed
by Cox (1970).
From the results of Birch (1963) mentioned above, it follows that models with
independent binomial data are equivalent to models with independent Poisson data.
Bishop (1969) showed that a binomial model that is additive on the logit scale can be
treated as a Poisson model additive on the log scale.

3.4. The Gamma Distribution


For the gamma distribution

L = -p(z/,u+ln,u-lnz)-lnz.

We have 0 =- l/p and V = dpud6 = ,U2. There are sufficient statistics when Y
Thus the inverse transformation to linearity is related to the gamma distribution,
as the logarithmic to the Poisson, or the identity transformation to the normal. The
corresponding models seem not to have been explored.
One application of linear models involving the gamma distribution is the esti-
mation of variance components. Here we have sums of squares which are proportional
to x2 variates and the expectation of each is a linear combination of several variances
which are to be estimated. It is better to write

L = - (zv/2p) - (v/2) In t +{(v/2) + 1} ln z,

where v is the degrees of freedom of z; then putting 0=- v/2p we have


V= dudO = 2u2/v and y = z, Y= =, and w = v/2u2.
The deviance takes the form

Iv Iln (z/,u) + {(z- A)I}]

and the result of the Appendix gives Z{v(z-,u)/1A} = 0; hence the deviance simplifies
to

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 381

Example
In a balanced incomplete block design in which b > v, we can produce an analysis
of variance as follows (Yates, 1940 or paper VIII of Yates, 1970).

Degrees of
freedom

Blocks (eliminating varieties):


Varietal component v-1
Remainder b-v

Total b-1

Varieties (ignoring blocks) v-1


Intra-block error rv-v-b +1

Total rv-1

The expectations of some of the mean squares are as follows:

Expected
mean square

Blocks (eliminating varieties):


Varietal component a2 + Ekcb
Remainder a2+ ka2b
Intra-block error a2

Here u2 iS the intra-block variance and U2 iS the in


On p. 322 of Yates (1940) (or p. 207 of Yates, 1970) an example is given with
v = 9, r = 8, k = 4, b = 18, A = 3 and E= - . The three mean squares mentioned
above, their degrees of freedom and their expectations are as follows:

Degrees
Mean square of Expectation
freedom

Blocks (eliminating varieties):


Varietal component 4-6329 8 a + 87Ob
Remainder 15-3557 9 a2 +4ab
Intra-block error 2-5968 46 a2

Each iteration then takes the form of a weighted fitting of a straight line with
x-values 2, 4 and 0.
The estimates obtained were:
o= 2-5870, b- = 20314.
Yates equated the intra-block error mean square and the blocks (eliminating
varieties) mean square to their expectations. This gives

a = 2 5968, N2 = 2-0813.

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
382 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

This was one example where the first approximation was not very good. Our
first approximation was

2 = 2-5874, 'bA = 09313


Subsequent iterations gave the following values:

Iteration No. a2 orb

1 2-5695 2.0737
2 2-5874 2.0302
3 2-5870 2.0315
4 2-5870 2.0314

4. THE MODELS IN THE TEACHING OF STATISTICS


We believe that the generalized linear models here developed could form a useful
basis for courses in statistics. They give a consistent way of linking together the
systematic elements in a model with the random elements. Too often the student
meets complex systematic linear models only in connection with normal errors, and if
he encounters probit analysis this may seem to have little to do with the linear
regression theory he has learnt. By isolating the systematic linear component
the student can be introduced to multiway tables and their margins, additivity,
weighting, quantitative and qualitative independent variates, and transformations,
quite independently of the added complications of errors and associated probability
distributions. The essential unity of the linear model, encompassing qualitative,
quantitative and mixed independent variates can be brought out and the introduction
of qualitative variates brings in naturally the ideas of singularity of matrices and of
constraints.
The complementary set of probability distributions would be introduced in the
usual way, including the use of transformations of data to attain desirable properties
of the errors. The difficult problem of discussing how far transformations can produce
both linearity and normality simultaneously now disappears because the models
allow two different transformations to be used, one to induce the linearity of the
systematic component and one to induce the desired distribution in the error
component. (Note that this distribution need not necessarily be equal-variance
normal.)
The systematic use of log-likelihood-ratios (or, equivalently, differences in
deviance) extends the ideas of analysis of variance to other distributions and produces
an additive decomposition for the sequential fit of the model. To appreciate the
simplicity that this can produce it is only necessary to look at the algebraic complexities
arising from the attempts to analyse contingency tables by extensions of the Pearson
x2 approach. Irwin (1949), Lancaster (1949, 1950) and Kimball (1954) have given
modifications of the usual formulae for the Pearson x2 in contingency tables, to
produce an additive decomposition of x2 when the table is partitioned. These formulae
are much more complicated than the usual one for Pearson x2. However, the equiva-
lent likelihood statistic, namely

2fn 1lnnnj- i~ ln n ni-YnlnnjI +n lnn }

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 383

does not need any modification to make it additive and is easy to compute as tables
of n Inn are available (Kullback, 1968).
Finally, the fact that a single algorithm can be used to fit any of the models
implies that quite a small set of routines can provide the basic computing facility to
allow students to fit models to a wide range of data. They can get experience of
programming by writing special routines to deal with special forms of output required
by particular models, such as the LD5O of probit analysis. In this way the distinction
can be made between the model-fitting part of an analysis and the subsequent derived
quantities (with estimates of their uncertainties) which particular problems require.
We hope that the approach developed in this paper will prove to be a useful way of
unifying what are often presented as unrelated statistical procedures, and that this unifi-
cation will simplify the teaching of the subject to both specialists and non-specialists.

REFERENCES
BIRCH, M. W. (1963). Maximum likelihood in three-way contingency tables. J. R. Statist. Soc.
B, 25, 220-233.
BISHOP, Y. M. M. (1969). Full contingency tables, logits, and split contingency tables. Biometrics,
25, 383-399.
Cox, D. R. (1968). Notes on some aspects of regression analysis. J. R. Statist. Soc. A, 131, 265-279.
- (1970). Analysis of Binary Data. London: Methuen.
DEMPSTER, A. P. (1971). An overview of multivariate data analysis. J. Multivar. Anal., 1, 316-346.
DYKE, G. V. and PATTERSON, H. D. (1952). Analysis of factorial arrangements when the data
are proportions. Biometrics, 8, 1-12.
FINNEY, D. J. (1952). Probit Analysis. 2nd ed. Cambridge: University Press.
FISHER, R. A. (1949). A biological assay of tuberculosis. Biometrics, 5, 300-316.
GOOD, I. J. (1967). Analysis of log-likelihood ratios, "ANOA". (A contribution to the discussion
of a paper on least squares by F. J. Anscombe.) J. R. Statist. Soc. B, 29, 39-42.
IRELAND, C. T. and KULLBACK, S. (1968). Contingency tables with given marginals. Biometr-ika,
55, 179-188.
IRWIN, J. 0. (1949). A note on the subdivision of X2 into components. Biometrika, 36, 130-134.
KENDALL, M. G. and STUART, A. (1967). The Advanced Theory of Statistics, 2nd ed, Vol. II.
London: Griffin.
KIMBALL, A. W. (1954). Short-cut formulas for the exact partition of x2 in contingency tables.
Biometrics, 10, 452-458.
Ku, H. H., VARNER, R. N. and KULLBACK, S. (1971). On the analysis of multidimensional
contingency tables. J. Amer. Statist. Ass., 66, 55-64.
KULLBACK, S. (1968). Information Theory and Statistics. 2nd ed. New York: Dover.
LANCASTER, H. 0. (1949). The derivation and partition of x2 in certain discrete distributions.
Biometrika, 36, 117-128.
(1950). The exact partition of x2 and its application to the problem of the pooling of small
expectations. Biometrika, 37, 267-270.
LEHMANN, E. L. (1959). Testing Statistical Hypotheses. London: Wiley.
MAXWELL, A. E. (1961). Analysing Qualitative Data. London: Methuen.
NELDER, J. A. (1966). Inverse polynomials, a useful group of multi-factor response functions.
Biometrics, 22, 128-141.
(1968). Weighted regression, quantal response data and inverse polynomials. Biometrics,
24, 979-985.
PATIL, G. P. and SHORROCK, R. (1965). On certain properties of the exponential-type families.
J. R. Statist. Soc. B, 27, 94-99.
SIMPSON, C. H. (1951). The interpretation of interaction in contingency tables. J. R. Statist. Soc.
B, 13, 238-241.
YATES, F. (1940). The recovery of inter-block information in balanced incomplete block designs.
Ann. Eugen., 10, 317-325.
(1948). The analysis of contingency tables with groupings based on quantitative characters.
Biometrika, 35, 176-181.
(1970). Experimental Design: Selected Papers. London: Griffin.

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
384 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,

APPENDIX
We can sometimes simplify the expression for deviance. In what follows, sum-
mation is to be taken to be over the observations unless otherwise stated.
Theorem. If either (a) Y = j1 or (b) Y = log p- and a constant term is being
fitted in the model (say Y = E/i xi where xo takes the constant value 1) then
~{(z- 2)i/J2} = 0, where [i and Vj are the maximum likelihood estimates of , and V.
Proof. For case (b)

AL =0_
~Xo V
In case (a), we have

AL z-,udp, -Z __
api V dYXi V 0 yi
Hence

(z V2) j Xi =

Then

- ~ ~ ~ ~ ~ ~ ~ ~ -

(Z- H) Z Y (Z- &)_ A

= zRiXiVY = i V Ij VY Xi) =
This completes the proof. When the conditions of this theorem are satisfied, we have,
for the Poisson distribution

z-_2=O
and for the gamma distribution

- A 0.
IJL

Then the deviances for these distributions simplify as follows:

Poisson 2 i z ln (zlj),
Gamma -2 ln (z/p).

This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms

You might also like