Theory Generalized Linear Model
Theory Generalized Linear Model
Theory Generalized Linear Model
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Royal Statistical Society, Wiley are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series A (General)
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
J. R. Statist. Soc. A, 370
(1972), 135, Part 3, p. 370
SUMMARY
The technique of iterative weighted linear regression can be used to obtain
maximum likelihood estimates of the parameters with observations distri-
buted according to some exponential family and systematic effects that can
be made linear by a suitable transformation. A generalization of the analysis
of variance is given for these models using log-likelihoods. These generalized
linear models are illustrated by examples relating to four distributions; the
Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and
gamma (variance components).
The implications of the approach in designing statistics courses are
discussed.
INTRODUCTION
LINEAR models customarily embody both systematic and random (error) components,
with the errors usually assumed to have normal distributions. The associated analytic
technique is least-squares theory, which in its classical form assumed just one error
component; extensions for multiple errors have been developed primarily for analysis
of designed experiments and survey data. Techniques developed for non-normal
data include probit analysis, where a binomial variate has a parameter related to an
assumed underlying tolerance distribution, and contingency tables, where the distri-
bution is multinomial and the systematic part of the model usually multiplicative.
In both these examples there is a linear aspect to the model; thus in probit analysis
the parameter p is a function of tolerance Y which is itself linear on the dose (or some
function thereof), and in a contingency table with a multiplicative model the logarithm
of the expected probability is assumed linear on classifying factors defining the table.
Thus for both, the systematic part of the model has a linear basis. In another extension
(Nelder, 1968) a certain transformation is used to produce normal errors, and a
different transformation of the expected values is used to produce linearity.
So far we have mentioned models associated with the normal, binomial and
multinomial distributions (this last can be thought of as a set of Poisson distributions
with constraints). A further class is based on the x2 or gamma distribution and
arises in the estimation of variance components from independent quadratic forms
derived from the original observations. Again the systematic component of the model
has a linear structure.
In this paper we develop a class of generalized linear models, which includes all
the above examples, and we give a unified procedure for fitting them based on
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 371
E(aL/a0) = 0 (1)
and
V = dp]/dO. (6)
and
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
372 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
Y= E/3X2
i=1
when the xi are independent variates whose values are supposed known and Piq are
parameters. The Piq may have fixed (known) values or be unknown and require
estimation. An independent variate may be quantitative and produce a single x-variate
in the model, or qualitative and produce a set of x-variates whose values are 0 and 1,
or mixed. Consider the model
y = y+ (Z-)/(d/dY),
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 373
aL f (_ d 8 dlbyx
VdYd
ASP = C, (9)
where A is a m x m matrix with
n
Ci = IWkXik(ZIJu)/(dp4dY).
Finally we have
AP* =r
where rj= IWkXikYk, Yk= Yk+(Zk-,uk)/(dtkl/dYk) and * = +Sp
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
374 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
Starting method
In practice we can obtain a good starting procedure for iteration as follows: take
as a first approximation j = z and calculate Y from it; then calculate w as before
and set y = Y. Then obtain the first approximation to the 3's by regression. The
method may need slight modification to deal with extreme values of z. For instance,
with the binomial distribution it will probably be adequate to replace instances of
z =O0 or z = n with z = 2 and z = n - -, where, e.g. with the probit and logit trans-
formations, tu = 0 or p = n would lead to infinite values for Y.
L -zY-g(Y)+h(z),
For a qualitative independent variate, this implies that the fitted marginal totals with
respect to that variate will be equal to the observed ones.
From the expression for L we see that the quantities Ek Zk Xik are a set of sufficient
statistics. Also, in (8) d2 O/dy2 = 0 and so
When 0 is also the mean of the distribution, i.e. t = 0 = Y, we have the usual
linear model with normal errors, for g'(0) = 0 gives
g(O) = 1 02 +const
which uniquely determines the distribution as Normal with variance 1/x(0b) (using
Theorem 1 of Patil and Shorrock, 1965). The sub-class of models for which there
are sufficient statistics was noted by Cox (1968), and Dempster (1971) has extended
it to include many dependent variates.
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 375
the model under consideration is ordered, and will be fitted sequentially a term at a
time. The objectives of the fitting will be to assess how many terms are required
for an adequate description of the data, and to derive the associated estimates of the
parameters and their information matrix.
Two extreme models are conceivable for any set of data, the minimal model
which contains the smallest set of terms that the problem allows, and the complete
model in which all the Ys are different and match the data completely so that {i = z.
An extreme case of the minimal model is the null model, which is equivalent to
fitting the grand mean only and effectively consigns all the variation in the data to the
random component of the model, while the complete model fits exactly and so
consigns all the variation in the data to the systematic part. The model-fitting process
with an ordered model thus consists of proceeding a suitable distance from the minimal
model towards the complete model. At each stage we trade increasing goodness-of-fit
to the current set of data against increasing complexity of the model. The fitting of
the parameters at each stage is done by maximizing the likelihood for the current
model and the matching of the model to the data will be measured quantitatively by
the quantity -2Lmax which we propose to call the deviance. For the four special
distributions the deviance takes the form:
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
376 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
TABLE 1
Minimal dm dm-dA A
The generalized analysis of variance for a sequential model is now defined to have
components given by the first differences of the deviance, with degrees of freedom
defined as above. These components have distributions proportional to X2, exactly
for normal errors, approximately for others. Such a generalization of the analysis of
variance was suggested by Good (1967).
3. SPECIAL DISTRLJUTIONS
3.1. The Normal Distribution
Here, we have
L = La -2(Zlk
=~~2 _ tj2 _-1Z2) -i ln 62
and in the notation of Section 1.2
p =G, V=1.
Inverse polynomials provide an example where we assume that the observations
u are normal on the log scale and the systematic effects additive on the inverse scale.
Then
z = ln u and Y = ell.
Nelder (1966) gives examples of inverse polynomials calculated using the first
approximation of the method in this paper.
More generally, as shown in Nelder (1968), we can consider models in which
there is a linearizing transformation f and a normalizing transformation g. This
means that if the observations are denoted by u, then g(u) is normally distributed
with mean ,u and constant variance u2 and f{g-1(Ik)} = ,S, xi.
Then we have V = 1 and Y =f{g-'(Ik)} so that
W = [(d/dY)g{f-1(y)}]2
and
y = Y+{g(u)-p}/[(d/dY)g{f-1(Y)}].
Example: Fisher's tuberculin-test data
Fisher (1949) published 16 measurements of tuberculin response which were
classified by three four-level factors in a Latin-square type of arrangement as in
Table 2.
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 377
TABLE 2
Cow class
Sites Treatments
I III II IV
Fisher gave reasons for believing that the variances of the observations were
proportional to their expectation, and that the systematic part of the model was
linear on the log-scale. The treatments were:
B Standard single
A Standard double
D Weybridge half
C Weybridge single
TABLE 3
B 0.0000 0.0000
A 0.2089 0.2092
D 0.0019 0.0023
C 0.2108 0.2115
The method of this paper can also be used to analyse the data as if they were
Poisson observations. The estimates of effects obtained by this method agree with
our other estimates to about four decimal places.
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
378 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
a2L a2 L
:api = E ay2a
It is easily verified that a2L/ y2 < 0 and hence L is negative definite. It follows that
P is uniquely determined. When Y = ,u the same result holds provided that the
x's are linearly independent when units with z = 0 are excluded.
The main application of generalized linear models with Poisson errors is to
contingency tables. These arise from data on counts classified by two or more factors,
and the literature on them is enormous (see, for example, Simpson, 1951; Ireland and
Kullback, 1968; Chapter 8 of Kullback 1968; Ku et al., 1971).
Probabilistic models for contingency tables are built on assumptions of a multi-
nomial distribution or a set of multinomial distributions. As Birch (1963) has shown,
the estimation of a set of independent multinomial distributions is equivalent to the
estimation of a set of independent Poisson distributions, and in what follows we shall
regard a contingency table as a set of independent Poisson distributions.
The systematic part of models of contingency tables is usually multiplicative, and
thus gives sufficient statistics with Poisson errors. The model terms usually correspond
to qualitative x's, the equivalent of constant fitting, but quantitative terms occur
naturally when the classifying factors have an underlying quantitative basis.
TABLE 4
Age Rating
in
years 4 3 2 1 Total
5- 7 7 3 4 7 21
8- 9 13 11 15 10 49
10-11 7 11 9 23 50
12-13 10 12 9 28 59
14-15 3 4 5 32 44
Here we can fit main effects and a linear x linear interaction using an x-variate
of the form uv where u -2, - 1, 0, 1 or 2 according to the age group and v is the
rating for disturbed dreams.
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 379
Degrees
Model term(s) Deviance Difference of
freedom
Degrees
Source of variation Of X2
freedom
Total 12 31X67
Maxwell's values of x2 are clearly quite close to ours and his conclusions are
essentially the same.
L = rlnp+(n-r)lnq
as
6 =ln{/n)
and
V = t(n-p,)/n.
Sufficient statistics are provided by the logit transformation giving
= ney/(1 + ey).
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
380 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
Probit analysis
We put
p = n(D(Y),
where (D is the cumulative normal distribution function. There are no sufficient
statistics here; the analysis via iterative weighted least-squares is well known (Finney,
1952).
L = -p(z/,u+ln,u-lnz)-lnz.
We have 0 =- l/p and V = dpud6 = ,U2. There are sufficient statistics when Y
Thus the inverse transformation to linearity is related to the gamma distribution,
as the logarithmic to the Poisson, or the identity transformation to the normal. The
corresponding models seem not to have been explored.
One application of linear models involving the gamma distribution is the esti-
mation of variance components. Here we have sums of squares which are proportional
to x2 variates and the expectation of each is a linear combination of several variances
which are to be estimated. It is better to write
and the result of the Appendix gives Z{v(z-,u)/1A} = 0; hence the deviance simplifies
to
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 381
Example
In a balanced incomplete block design in which b > v, we can produce an analysis
of variance as follows (Yates, 1940 or paper VIII of Yates, 1970).
Degrees of
freedom
Total b-1
Total rv-1
Expected
mean square
Degrees
Mean square of Expectation
freedom
Each iteration then takes the form of a weighted fitting of a straight line with
x-values 2, 4 and 0.
The estimates obtained were:
o= 2-5870, b- = 20314.
Yates equated the intra-block error mean square and the blocks (eliminating
varieties) mean square to their expectations. This gives
a = 2 5968, N2 = 2-0813.
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
382 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
This was one example where the first approximation was not very good. Our
first approximation was
1 2-5695 2.0737
2 2-5874 2.0302
3 2-5870 2.0315
4 2-5870 2.0314
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
1972] NELDER AND WEDDERBURN - Generalized Linear Models 383
does not need any modification to make it additive and is easy to compute as tables
of n Inn are available (Kullback, 1968).
Finally, the fact that a single algorithm can be used to fit any of the models
implies that quite a small set of routines can provide the basic computing facility to
allow students to fit models to a wide range of data. They can get experience of
programming by writing special routines to deal with special forms of output required
by particular models, such as the LD5O of probit analysis. In this way the distinction
can be made between the model-fitting part of an analysis and the subsequent derived
quantities (with estimates of their uncertainties) which particular problems require.
We hope that the approach developed in this paper will prove to be a useful way of
unifying what are often presented as unrelated statistical procedures, and that this unifi-
cation will simplify the teaching of the subject to both specialists and non-specialists.
REFERENCES
BIRCH, M. W. (1963). Maximum likelihood in three-way contingency tables. J. R. Statist. Soc.
B, 25, 220-233.
BISHOP, Y. M. M. (1969). Full contingency tables, logits, and split contingency tables. Biometrics,
25, 383-399.
Cox, D. R. (1968). Notes on some aspects of regression analysis. J. R. Statist. Soc. A, 131, 265-279.
- (1970). Analysis of Binary Data. London: Methuen.
DEMPSTER, A. P. (1971). An overview of multivariate data analysis. J. Multivar. Anal., 1, 316-346.
DYKE, G. V. and PATTERSON, H. D. (1952). Analysis of factorial arrangements when the data
are proportions. Biometrics, 8, 1-12.
FINNEY, D. J. (1952). Probit Analysis. 2nd ed. Cambridge: University Press.
FISHER, R. A. (1949). A biological assay of tuberculosis. Biometrics, 5, 300-316.
GOOD, I. J. (1967). Analysis of log-likelihood ratios, "ANOA". (A contribution to the discussion
of a paper on least squares by F. J. Anscombe.) J. R. Statist. Soc. B, 29, 39-42.
IRELAND, C. T. and KULLBACK, S. (1968). Contingency tables with given marginals. Biometr-ika,
55, 179-188.
IRWIN, J. 0. (1949). A note on the subdivision of X2 into components. Biometrika, 36, 130-134.
KENDALL, M. G. and STUART, A. (1967). The Advanced Theory of Statistics, 2nd ed, Vol. II.
London: Griffin.
KIMBALL, A. W. (1954). Short-cut formulas for the exact partition of x2 in contingency tables.
Biometrics, 10, 452-458.
Ku, H. H., VARNER, R. N. and KULLBACK, S. (1971). On the analysis of multidimensional
contingency tables. J. Amer. Statist. Ass., 66, 55-64.
KULLBACK, S. (1968). Information Theory and Statistics. 2nd ed. New York: Dover.
LANCASTER, H. 0. (1949). The derivation and partition of x2 in certain discrete distributions.
Biometrika, 36, 117-128.
(1950). The exact partition of x2 and its application to the problem of the pooling of small
expectations. Biometrika, 37, 267-270.
LEHMANN, E. L. (1959). Testing Statistical Hypotheses. London: Wiley.
MAXWELL, A. E. (1961). Analysing Qualitative Data. London: Methuen.
NELDER, J. A. (1966). Inverse polynomials, a useful group of multi-factor response functions.
Biometrics, 22, 128-141.
(1968). Weighted regression, quantal response data and inverse polynomials. Biometrics,
24, 979-985.
PATIL, G. P. and SHORROCK, R. (1965). On certain properties of the exponential-type families.
J. R. Statist. Soc. B, 27, 94-99.
SIMPSON, C. H. (1951). The interpretation of interaction in contingency tables. J. R. Statist. Soc.
B, 13, 238-241.
YATES, F. (1940). The recovery of inter-block information in balanced incomplete block designs.
Ann. Eugen., 10, 317-325.
(1948). The analysis of contingency tables with groupings based on quantitative characters.
Biometrika, 35, 176-181.
(1970). Experimental Design: Selected Papers. London: Griffin.
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms
384 NELDER AND WEDDERBURN - Generalized Linear Models [Part 3,
APPENDIX
We can sometimes simplify the expression for deviance. In what follows, sum-
mation is to be taken to be over the observations unless otherwise stated.
Theorem. If either (a) Y = j1 or (b) Y = log p- and a constant term is being
fitted in the model (say Y = E/i xi where xo takes the constant value 1) then
~{(z- 2)i/J2} = 0, where [i and Vj are the maximum likelihood estimates of , and V.
Proof. For case (b)
AL =0_
~Xo V
In case (a), we have
AL z-,udp, -Z __
api V dYXi V 0 yi
Hence
(z V2) j Xi =
Then
- ~ ~ ~ ~ ~ ~ ~ ~ -
= zRiXiVY = i V Ij VY Xi) =
This completes the proof. When the conditions of this theorem are satisfied, we have,
for the Poisson distribution
z-_2=O
and for the gamma distribution
- A 0.
IJL
Poisson 2 i z ln (zlj),
Gamma -2 ln (z/p).
This content downloaded from 129.25.184.137 on Mon, 01 Apr 2019 18:20:04 UTC
All use subject to https://about.jstor.org/terms