Statistical Methods
Statistical Methods
Statistical Methods
Statistical Methods
for
Rater Agreement
Recep ÖZCAN
http://recepozcan06.blogcu.com/
2009
1
Statistical Methods for Rater Agreement
INDEX
Page
1. Statistical Methods for Rater Agreement…………………………………………… 5
1.0 Basic Considerations……………………………………………...…...…………… 5
1.1 Know the goals……………………………………………………………..………. 5
1.2 Consider theory………………………………………………………..…………… 5
1.3 Reliability vs. validity…………………………………………………..………….. 6
1.4 Modeling vs. description…………………………………………………………… 6
1.5 Components of disagreement……………………………………….……………… 7
1.6 Keep it simple……………………………………………………………………….7
1.6.1 An example……………………………………………………...………………7
1.7 Recommended Methods………………………………………….………………… 8
1.7.1 Dichotomous data……………………………………………………………….8
1.7.2 Ordered-category data…………………………………………….……………. 9
1.7.3 Nominal data…………………………………………………………………… 9
1.7.4 Likert-type items………………………………………………….……………. 10
2. Raw Agreement Indices……………………………………………………...………. 12
2.0 Introduction ………………………………………………………………...……… 12
2.1 Two Raters, Dichotomous Ratings ………………………………………………… 12
2.2 Proportion of overall agreement ………………………………………...………….12
2.3 Positive agreement and negative agreement ……………………………..…………13
2.4 Significance, standard errors, interval estimation …………………………………. 13
2.4.1 Proportion of overall agreement ………………………………….…………….13
2.4.2 Positive agreement and negative agreement ………………………..…………..14
2.5 Two Raters, Polytomous Ratings …………………………………….……………. 15
2.6 Overall Agreement ……………………………………………...…….…………… 16
2.7 Specific agreement …………………………………………………..…………….. 17
2.8 Generalized Case ………………………………………………………………….. 17
2.9 Specific agreement ……………………………………………………...…………. 17
2.10 Overall agreement …………………………………………………..……………. 18
2.11 Standard errors, interval estimation, significance …………………..……………. 19
3. Intraclass Correlation and Related Method……………………………...………… 21
3.0 Introduction …………………………………………………………..……………. 21
3.1 Different Types of ICC……………………………………………….……………. 23
2
Statistical Methods for Rater Agreement
3
Statistical Methods for Rater Agreement
4
Statistical Methods for Rater Agreement
In many fields it is common to study agreement among ratings of multiple judges, experts,
diagnostic tests, etc. We are concerned here with categorical ratings: dichotomous (Yes/No,
Present/Absent, etc.), ordered categorical (Low, Medium, High, etc.), and nominal
(Schizophrenic, Bi-Polar, Major Depression, etc.) ratings. Likert-type ratings--intermediate
between ordered-categorical and interval-level ratings, are also considered.
There is little consensus about what statistical methods are best to analyze rater agreement
(we will use the generic words "raters" and "ratings" here to include observers, judges,
diagnostic tests, etc. and their ratings/results.) To the non-statistician, the number of
alternatives and lack of consistency in the literature is no doubt cause for concern. This site
aims to reduce confusion and help researchers select appropriate methods for their
applications.
Despite the many apparent options for analyzing agreement data, the basic issues are very
simple. Usually there are one or two methods best for a particular application. But it is
necessary to clearly identify the purpose of analysis and the substantive questions to be
answered.
The most common mistake made when analyzing agreement data is not having a explicit goal.
It is not enough for the goal to be "measuring agreement" or "finding out if raters agree."
There is presumably some reason why one wants to measure agreement. Which statistical
method is best depends on this reason.
For example, rating agreement studies are often used to evaluate a new rating system or
instrument. If such a study is being conducted during the development phase of the
instrument, one may wish to analyze the data using methods that identify how the instrument
could be changed to improve agreement. However if an instrument is already in a final
format, the same methods might not be helpful.
Very often agreement studies are an indirect attempt to validate a new rating system or
instrument. That is, lacking a definitive criterion variable or "gold standard," the accuracy of a
scale or instrument is assessed by comparing its results when used by different raters. Here
one may wish to use methods that address the issue of real concern--how well do ratings
reflect the true trait one wants to measure?
In other situations one may be considering combining the ratings of two or more raters to
obtain evaluations of suitable accuracy. If so, again, specific methods suitable for this purpose
should be used.
A second common problem in analyzing agreement is the failure to think about the data from
5
Statistical Methods for Rater Agreement
the standpoint of theory. Nearly all statistical methods for analyzing agreement make
assumptions. If one has not thought about the data from a theoretical point of view it will be
hard to select an appropriate method. The theoretical questions one asks do not need to be
complicated. Even simple questions, like "is the trait being measured really discrete, like
presence/absence of a pathogen, or is the trait really continuous and being divided into
discrete levels (e.g., "low," "medium, "high") for convenience? If the latter, is it reasonable to
assume that the trait is normally distributed? Or is some other distribution plausible?
Sometimes one will not know the answers to these questions. That is fine, too, because there
are methods suitable for that case also. The main point is to be inclined to think about data in
this way, and to be attuned to the issue of matching method and data on this basis.
These two issues--knowing ones goals and considering theory, are the main keys to successful
analysis of agreement data. Following are some other, more specific issues that pertain to the
selection of methods appropriate to a given study.
One can broadly distinguish two reasons for studying rating agreement. Sometimes the goal is
estimate the validity (accuracy) of ratings in the absence of a "gold standard." This is a
reasonable use of agreement data: if two ratings disagree, then at least one of them must be
incorrect. Proper analysis of agreement data therefore permits certain inferences about how
likely a given rating is to be correct.
Other times one merely wants to know the consistency of ratings made by different raters. In
some cases, the issue of accuracy may even have no meaning--for example ratings may
concern opinions, attitudes, or values.
One should also distinguish between modeling vs. describing agreement. Ultimately, there are
only a few simple ways to describe the amount of agreement: for example, the proportion of
times two ratings of the same case agree, the proportion of times raters agree on specific
categories, the proportions of times different raters use the various rating levels, etc.
The quantification of agreement in any other way inevitably involves a model about how
ratings are made and why raters agree or disagree. This model is either explicit, as with latent
structure models, or implicit, as with the kappa coefficient. With this in mind, two basic
principles are evident:
• It is better to have a model that is explicitly understood than one which is only implicit
and potentially not understood.
• The model should be testable.
Methods vary with respect to how well they meet the these two criteria.
6
Statistical Methods for Rater Agreement
Category definitions, on the other hand, differ because raters divide the trait into different
intervals. For example, by "low skill" one rater may mean subjects from the 1st to the 20th
percentile. Another rater, though, may take it to mean subjects from the 1st to the 10th
percentile. When this occurs, rater thresholds can usually be adjusted to improve agreement.
Similarity of category definitions is reflected as marginal homogeneity between raters.
Marginal homogeneity means that the frequencies (or, equivalently, the "base rates") with
which two raters use various rating categories are the same.
Because disagreement on trait definition and disagreement on rating category widths are
distinct components of disagreement, with different practical implications, a statistical
approach to the data should ideally quantify each separately.
All other things being equal, a simpler statistical method is preferable to a more complicated
one. Very basic methods can reveal far more about agreement data than is commonly realized.
For the most part, advanced methods are complements to, not substitutes for simple methods.
1.6.1 An example:
To illustrate these principles, consider the example for rater agreement on screening
mammograms, a diagnostic imaging method for detecting possible breast cancer. Radiologists
often score mammograms on a scale such as "no cancer," "benign cancer," "possible
malignancy," or "malignancy." Many studies have examined rater agreement on applying
these categories to the same set of images.
In choosing a suitable statistical approach, one would first consider theoretical aspects of the
data. The trait being measured, degree of evidence for cancer, is continuous. So the actual
rating levels would be viewed as somewhat arbitrary discretizations of the underlying trait. A
reasonable view is that, in the mind of a rater, the overall weight of evidence for cancer is an
aggregate composed of various physical image features and weights attached to each feature.
Raters may vary in terms of which features they notice and the weights they associate with
each.
7
Statistical Methods for Rater Agreement
One would also consider the purpose of analyzing the data. In this application, the purpose of
studying rater agreement is not usually to estimate the accuracy of ratings by a single rater.
That can be done directly in a validity study, which compares ratings to a definitive diagnosis
made from a biopsy.
Instead, the aim is more to understand the factors that cause raters to disagree, with an
ultimate goal of improving their consistency and accuracy. For this, one should separately
assess whether raters have the same definition of the basic trait (that different raters weight
various image features similarly) and that they have similar widths for the various rating
levels. The former can be accomplished with, for example, latent trait models. Moreover,
latent trait models are consistent with the theoretical assumptions about the data noted above.
Raters' rating category widths can be studied by visually representing raters' rates of use for
the different rating levels and/or their thresholds for the various levels, and statistically
comparing them with tests of marginal homogeneity.
Another possibility would be to examine if some raters are biased such that they make
generally higher or lower ratings than other raters. One might also note which images are the
subject of the most disagreement and then to try identify the specific image features that are
the cause of the disagreement.
Such steps can help one identify specific ways to improve ratings. For example, raters who
seem to define the trait much differently than other raters, or use a particular category too
often, can have this pointed out to them, and this feedback may promote their making ratings
in a way more consistent with other raters.
This section suggests statistical methods suitable for various levels of measurement based on
the principles outlined above. These are general guidelines only--it follows from the
discussion that no one method is best for all applications. But these suggestions will at least
give the reader an idea of where to start.
Two raters
Multiple raters
8
Statistical Methods for Rater Agreement
Calculate the appropriate intraclass correlation for the data. If different raters are used for
each subject, an alternative is the Fleiss kappa.
If the trait being rated is assumed to be latently discrete, consider use of latent class
models.
If the trait being rated can be interpreted as latently continuous, latent trait models can be
used to assess association among raters and to estimate the correlation of ratings with the
true trait; these models can also be used to assess marginal homogeneity.
In some cases latent class and latent trait models can be used to estimate the accuracy
(e.g., Sensitivity and Specificity) of diagnostic ratings even when a 'gold standard' is
lacking.
Two raters
Use weighted kappa with Fleiss-Cohen (quadratic) weights; note that quadratic weights
are not the default with SAS and you must specify (WT=FC) with the AGREE option in
PROC FREQ.
Ordered rating levels often imply a latently continuous trait; if so, measure association
between the raters with the polychoric correlation or one of its generalizations.
Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.
Test (a) for differences in rater thresholds associated with each rating category and (b) for
a difference between the raters' overall bias using the respectively applicable McNemar
tests.
Optionally, use graphical displays to visually compare the proportion of times raters use
each category (base rates).
Consider association models and related methods for ordered category data. (See Agresti
A., Categorical Data Analysis, New York: Wiley, 2002).
Multiple raters
9
Statistical Methods for Rater Agreement
Two raters
Assess raw agreement, overall and specific to each category.
Use the p-value of Cohen's unweighted kappa to verify that raters agree more than
chance alone would predict.
Often (perhaps usually), disregard the actual magnitude of kappa here; it is
problematic with nominal data because ordinarily one can neither assume that all types
of disagreement are equally serious (unweighted kappa) nor choose an objective set of
differential disagreement weights (weighted kappa). If, however, it is genuinely true
that all pairs of rating categories are equally "disparate", then the magnitude of
Cohen's unweighted kappa can be interpreted as a form of intraclass correlation.
Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.
Test marginal homogeneity relative to individual categories using McNemar tests.
Consider use of latent class models.
Another possibility is use of loglinear, association, or quasi- symmetry models.
Multiple raters
Assess raw agreement, overall and specific to each category.
If different raters are used for different subjects, use the Fleiss kappa statistic; again, as
with nominal data/two raters, attend only to the p-value of the test unless one has a
genuine basis for regarding all pairs of rating categories as equally "disparate".
Use latent class modeling. Conditional tests of marginal homogeneity can be made
within the context of latent class modeling.
Use graphical displays to visually compare the proportion of times raters use each
category (base rates).
Alternatively, consider each pair of raters individually and proceed as described for
two raters.
Very often, Likert-type items can be assumed to produce interval-level data. (By a "Likert-
type item" here we mean one where the format clearly implies to the rater that rating levels
are evenly-spaced, such as
lowest highest
|-------|-------|-------|-------|-------|-------|
1 2 3 4 5 6 7
(circle level that applies)
Two raters
Assess association among raters using the regular Pearson correlation coefficient.
Test for differences in rater bias using the t-test for dependent samples.
Possibly estimate the intraclass correlation.
Assess marginal homogeneity as with ordered-category data.
See also methods listed in the section Methods for Likert-type or interval-level data.
Multiple raters
Perform a one-factor common factor analysis; examine/report the correlation of each
rater with the common factor (for details, see the section Methods for Likert-type or
interval-level data).
10
Statistical Methods for Rater Agreement
11
Statistical Methods for Rater Agreement
2.0 Introduction
Much neglected, raw agreement indices are important descriptive statistics. They have unique
common-sense value. A study that reports only simple agreement rates can be very useful; a
study that omits them but reports complex statistics may fail to inform readers at a practical
level.
Raw agreement measures and their calculation are explained below. We examine first the case
of agreement between two raters on dichotomous ratings.
Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized
by Table 1:
Rater 2
Rater 1 + - total
+ a b a+b
- c d c+d
The values a, b, c and d here denote the observed frequencies for each possible combination
of ratings by Rater 1 and Rater 2.
The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2
agree. That is:
a+d a+d
po = ------------- = -----. (1)
a+b+c+d N
12
Statistical Methods for Rater Agreement
This proportion is informative and useful, but, taken by itself, has possible has limitations.
One is that it does not distinguish between agreement on positive ratings and agreement on
negative ratings.
Further, one may consider Cohen's (1960) criticism of po: that it can be high even with
hypothetical raters who randomly guess on each case according to probabilities equal to the
observed base rates. In this example, if both raters simply guessed "positive" the large
majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this
by comparing po to a corresponding quantity, pc, the proportion of agreement expected by
raters who randomly guess. As described on the kappa coefficients page, this logic is
questionable; in particular, it is not clear what advantage there is to compare an actual level of
agreement, po, with a hypothetical value, pc, which would occur under an obviously unrealistic
model.
We may also compute observed agreement relative to each rating category individually.
Generically the resulting indices are called the proportions of specific agreement (Spitzer &
Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and
negative agreement (NA). They are calculated as follows:
2a 2d
PA = ----------; NA = ----------. (2)
2a + b + c 2d + b + c
.
PA, for example, estimates the conditional probability, given that one of the raters, randomly
selected, makes a positive rating, the other rater will also do so.
A joint consideration of PA and NA addresses the potential concern that, when base rates are
extreme, po is liable to chance-related inflation or bias. Such inflation, if it exists at all, would
affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there
is arguably less need or purpose in comparing actual to chance- predicted agreement using a
kappa statistic. But in any case, PA and NA provide more information relevant to
understanding and improving ratings than a single omnibus index (see Cicchetti and Feinstein,
1990).
13
Statistical Methods for Rater Agreement
Statistical significance. In testing the significance of po, the null hypothesis is that raters are
independent, with their marginal assignment probabilities equal to the observed marginal
proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a
contingency table. Any of the following could potentially be used:
A potential advantage of a kappa significance test is that the magnitude of kappa can be
interpreted as approximately an intra-class correlation coefficient. All of these tests, except
the last, can be done with SAS PROC FREQ.
Standard error. One can use standard methods applicable to proportions to estimate the
standard error and confidence limits of po. For a sample size N, the standard error of po is:
One can alternatively estimate SE(po) using resampling methods, e.g., the nonparametric
bootstrap or the jackknife, as described the next section.
CL = po - SE × zcrit (3.2)
CU = po + SE × zcrit (3.3)
where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and upper
confidence limits, and zcrit is the z-value associated with a confidence range with coverage
probability crit. For a 95% confidence range, zcrit = 1.96; for a 90% confidence range, zcrit =
1.645.
When po is either very large or very small (and especially with small sample sizes) the Wald
method may produce confidence limits less than 0 or greater than 1; in this case better
approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below)
can be used instead.
Statistical significance. Logically, there is only one test of independence in a 2×2 table;
therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and
Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such
"specific kappas", but both have the same value and statistical significance as the overall
kappa.
Standard errors.
14
Statistical Methods for Rater Agreement
• With the nonparametric bootstrap (Efron & Tibshirani, 1993), one constructs a large
number of simulated data sets of size N by sampling with replacement from the
observed data. For a 2×2 table, this can be done simply by using random numbers to
assign simulated cases to cells with probabilities of a/N, b/N, c/N and d/N (however,
with large N, is more efficient algorithms are preferable.) One then computes the
proportion of positive agreement for each simulated data set -- which we denote PA*.
The standard deviation of PA* across all simulated data sets estimates the standard
error SE(PA).
• The delete-1 (Efron, 1982) jackknife works by calculating PA for four alternative
tables where 1 is subtracted from each of the four cells of the original 2 × 2 table. A
few simple calculations then provide an estimate of the standard error SE(PA). The
delete-1 jackknife requires less computation, but the nonparametric bootstrap is
usually considered more accurate.
Confidence intervals.
• Asymptotic confidence limits for PA and NA can be obtained as in Eqs. 3.2 and 3.3.,
substituting PA and NA for po and using the asymptotic standard errors given by Eqs.
3.4 and 3.5.
• Alternatively, the bootstrap can be used. Again, we describe the method for PA. As
with bootstrap standard error estimation, ones generate a large number (e.g., 100,000)
of simulated data sets, computing an estimate PA* for each one. Results are then
sorted by increasing value of PA*. Confidence limits of PA are obtained with
reference to the percentiles of this ranking. For example, the 95% confidence range of
PA is estimated by the values of PA* that correspond to the 2.5 and 97.5 percentiles of
this distribution.
An advantage of bootstrapping is that one can use the same simulated data sets to estimate
not only the standard errors and confidence limits of PA and NA, but also those of p o or
any other statistic defined for the 2×2 table.
A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits
of PA and NA has been written. For a free standalone program that supplies both bootstrap
and asymptotic standard errors and confidence limits, please email the author.
Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a
comparison of different methods for estimating confidence intervals for PA and NA.
We now consider results for two raters making polytomous (either ordered category or purely
nominal) ratings. Let C denote the number of rating categories or levels. Results for the two
raters may be summarized as a C × C table such as Table 2.
15
Statistical Methods for Rater Agreement
Table2
Summary of polytomous ratings by two raters
Rater 2
. . . . .
...
. . . . .
Here nij denotes the number of cases assigned rating category i by Rater 1 and category j by
Rater j, with i, j = 1, ..., C. When a "." appears in a subscript, it denotes a marginal sum over
the corresponding index; e.g., ni. is the sum of nij for j = 1, ..., c, or the row marginal sum for
category i; n.. = N denotes the total number of cases.
For this design, po is the sum of frequencies of the main diagonal of table {nij} divided by
sample size, or
C
po = 1/N SUM nii (4)
i=1
Statistical significance
• One may test the statistical significance of po with Cohen's kappa. If kappa is
significant/nonsignificant, then po may be assumed significant/nonsignificant, and vice
versa. Note that the numerator of kappa is the difference between po and the level of
agreement expected under the null hypothesis of statistical independence.
• The parametric bootstrap can also be used to test statistical significance. This is like
the nonparametric bootstrap already described, except that samples are generated from
the null hypothesis distribution. Specifically, one constructs many -- say 5000 --
simulated samples of size N from the probability distribution {πij}, where
16
Statistical Methods for Rater Agreement
ni.n.j
πij = ------. (5)
N
and the tabulates overall agreement, denoted p*o, for each simulated sample. The po for
the actual data is considered statistically significant if it exceeds a specified percentage
(e.g., 5%) of the p*o values.
If one already has a computer program for nonparametric bootstrapping only slight
modifications are needed to adapt it to perform a parametric bootstrap significance
test.
Standard error and confidence limits. Here the standard error and confidence intervals of po
can again be calculated with the methods described for 2×2 tables.
2nii
ps(i) = ---------. (6)
ni. + n.i
Statistical significance
Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i,
considering this category a 'positive' rating, and then computing the positive agreement (PA)
index of Eq. (2). This is done for each category i successively. In each reduced table one may
perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared,
or use a Fisher exact test.
• Again, for each category i, we may collapse the original C × C table into a 2×2 table,
taking i as the 'positive' rating level. The asymptotic standard error formula Eq. (3.4)
for PA may then be used, and the Wald method confidence limits given by Eqs. (3.1)
and (3.2) may be computed.
• Alternatively, one can use the nonparametric bootstrap to estimate standard errors
and/or confidence limits. Note that this does not require a successive collapsing of the
original table.
• The delete-1 jackknife can be used to estimate standard errors, but this does require
successive collapsings of the C × C table.
We now consider generalized formulas for the proportions of overall and specific agreement.
They apply to binary, ordered category, or nominal ratings and permit any number of raters,
with potentially different numbers of raters or different raters for each case.
17
Statistical Methods for Rater Agreement
Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are
summarized as:
where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a
case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and
{njk} = {3, 2}.
Let nk denote the total number of ratings made on case k; that is,
C
nk = SUM njk. (7)
j=1
The total number of agreements specifically on rating level j, across all cases is
K
S(j) = SUM njk (njk - 1). (9)
k=1
and the number of possible agreements on category j across all cases is:
K
Sposs(j) = SUM njk (nk - 1). (11)
k=1
The proportion of agreement specific to category j is equal to the total number of agreements
on category j divided by the total number of opportunities for agreement on category j, or
S(j)
ps(j) = -------. (12)
Sposs(j)
The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9)
across all categories, or
C
O = SUM S(j). (13)
j=1
18
Statistical Methods for Rater Agreement
K
Oposs = SUM nk (nk - 1). (14)
k=1
Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or
O
po = ------. (15)
Oposs
The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard
errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes
cases are independent and identically distributed (iid). In general, this assumption will be
accepted when:
• the same raters rate each case, and either there are no missing ratings or ratings are
missing completely at random.
• the raters for each case are randomly sampled and the number of rating per case is
constant or random.
• in a replicate rating (reproducibility) study, each case is rated by the procedure the
same number of times or else the number of replications for any case is completely
random.
In these cases, one may construct each simulated sample by repeated random sampling with
replacement from the set of K cases.
If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a
study systematically rotates raters), simple modifications of the bootstrap method--such as
two-stage sampling, can be made.
The parametric bootstrap can be used for significance testing. A variation of this method,
patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:
19
Statistical Methods for Rater Agreement
The significance of po, ps(j), or any other statistic calculated, is determined with reference to
the distribution of corresponding values in the simulated data sets. For example, po is
significant at the .05 level (1-tailed) if it exceeds 95% of the p*o values obtained for the
simulated data sets.
References
Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the
paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558.
Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal
agreement: two families on agreement measures. Canadian Journal on Statistics,
1995, 23, 333-344.
Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied Mathematics, 1982.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and
Hall, 1993.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychological
Bulletin, 1971, 76, 378-381.
Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John
Wiley, 1981.
Graham P, Bull B. Approximate standard errors and confidence intervals for indices of
positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771.
20
Statistical Methods for Rater Agreement
21
Statistical Methods for Rater Agreement
3.0 Introduction
The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of
different ratings of the same subject to the total variation across all ratings and all subjects.
s 2(b)
ICC = ------------ [1]
s 2(b) + s 2 (w)
where s 2(w) is the pooled variance within subjects, and s 2(b) is the variance of the trait
between subjects.
It is easily shown that s 2(b) + s 2(w) = the total variance of ratings--i.e., the variance for all
ratings, regardless of whether they are for the same subject or not. Hence the interpretation of
the ICC as the proportion of total variance accounted for by within-subject variation.
Equation [1] would apply if we knew the true values, s 2 (w) and s 2(b). But we rarely do, and
must instead estimate them from sample data. For this we wish to use all available
information; this adds terms to Equation [1].
For example, s 2(b) is the variance of true trait levels between subjects. Since we do not know
a subject's true trait level, we estimate it from the subject's mean rating across the raters who
rate the subject. Each mean rating is subject to sampling variation--deviation from the
subject's true trait level, or it's surrogate, the mean rating that would be obtained from a very
large number of raters. Since the actual mean ratings are often based on two or a few ratings,
these deviations are appreciable and inflate the estimate of between-subject variance.
We can estimate the amount and correct for this extra, error variation. If all subjects have k
ratings, then for the Case 1 ICC (see definition below) the extra variation is estimated as
(1/k) s 2(w), where s 2(w) is the pooled estimate of within-subject variance. When all subjects
have k ratings, s2(w) equals the average variance of the k ratings of each subject (each
calculated using k-1 as denominator). To get the ICC we then:
For the various other types of ICC's, different corrections are used, each producing it's own
equation. Unfortunately, these formulas are usually expressed in their computational form--
with terms arranged in a way that facilitates calculation, rather than their derivational form--
which would make clear the nature and rationale of the correction terms.
22
Statistical Methods for Rater Agreement
In their important paper, Shrout and Fleiss (1979) describe three classes of ICC for reliability,
which they term Case 1, Case 2 and Case 3. Each Case applies to a different rater agreement
study design.
Case 1. One has a pool of raters. For each subject, one randomly samples from the rater pool
k different raters to rate this subject. Therefore the raters who rate one subject are not
necessarily the same as those who rate another. This design corresponds to a 1-way Analysis
of Variance (ANOVA) in which Subject is a random effect, and Rater is viewed as
measurement error.
Case 2. The same set of k raters rate each subject. This corresponds to a fully-crossed (Rater
× Subject), 2-way ANOVA design in which both Subject and Rater are separate effects. In
Case 2, Rater is considered a random effect; this means the k raters in the study are
considered a random sample from a population of potential raters. The Case 2 ICC estimates
the reliability of the larger population of raters.
Case 3. This is like Case 2--a fully-crossed, 2-way ANOVA design. But here one estimates
the ICC that applies only to the k raters in the study. Since this does not permit generalization
to other raters, the Case 3 ICC is not often used.
Shrout and Fleiss (1981) also show that for each of the three Cases above, one can use the
ICC in two ways:
For each of the Cases, then, there are two forms, producing a total of 6 different versions of
the ICC.
3.2.1 Pros
• Flexible
The ICC, and more broadly, ANOVA analysis of ratings, is very flexible. Besides the
six ICCs discussed above, one can consider more complex designs, such as a grouping
factor among raters (e.g., experts vs. nonexperts), or covariates. See Landis and Koch
(1977a,b) for examples.
• Software
23
Statistical Methods for Rater Agreement
Software to estimate the ICC is readily available (e.g, SPSS and SAS). Output from
most any ANOVA software will contain the values needed to calculate the ICC.
The ICC allows estimation of the reliability of both single and mean ratings.
"Prophecy" formulas let one predict the reliability of mean ratings based on any
number of raters.
An alternative to the ICC for Cases 2 and 3 is to calculate the Pearson correlation
between all pairs of rater. The Pearson correlation measures association between
raters, but is insensitive to rater mean differences (bias). The ICC decreases in
response to both lower correlation between raters and larger rater mean differences.
Some may see this advantage, but others (see Cons) as a limitation.
• Number of categories
The ICC can be used to compare the reliability of different instruments. For example,
the reliability of a 3-level rating scale can be compared to the reliability of a 5-level
scale (provided they are assessed relative to the same sample or population; see Cons).
3.2.2 Cons
The ICC is strongly influenced by the variance of the trait in the sample/population in
which it is assessed. ICCs measured for different populations might not be
comparable.
For example, suppose one has a depression rating scale. When applied to a random
sample of the adult population the scale might have a high ICC. However, if the scale
is applied to a very homogeneous population--such as patients hospitalized for acute
depression--it might have a low ICC.
This is evident from the definition of the ICC as s 2(b)/ [s 2(b)+s 2(w)]. In both
populations above, s 2(w), variance of different raters' opinions of the same subject,
may be the same. But between-subject variance, s 2(b), may be much smaller in the
clinical population than in the general population. Therefore the ICC would be smaller
in the clinical population.
This issue is similar to, and just as much a concern as, the "base rate" problem of the
kappa coefficient. It means that:
24
Statistical Methods for Rater Agreement
1. One cannot compare ICCs for samples or populations with different between-
subject variance; and
2. The often-reproduced table which shows specific ranges for "acceptable" and
"unacceptable" ICC values should not be used.
For more discussion on the implications of this topic see, The Comparability Issue
below.
To use the ICC with ordered-category ratings, one must assign the rating categories
numeric values. Usually categories are assigned values 1, 2, ..., C, where C is the
number of rating categories; this assumes all categories are equally wide, which may
not be true. An alternative is to assign ordered categories numeric values from their
cumulative frequencies via probit (for a normally distributed trait) or ridit (for a
rectangularly distributed trait) scoring; see Fleiss (1981).
The ICC combines, or some might say, confounds, two ways in which raters differ: (1)
association, which concerns whether the raters understand the meaning of the trait in
the same way, and (2) bias, which concerns whether some raters' mean ratings are
higher or lower than others. If a goal is to give feedback to raters to improve future
ratings, one should distinguish between these two sources of disagreement. For
discussion on alternatives that separate these components, see the Likert Scale page of
this website.
With ordered-category or Likert-type data, the ICC discounts the fact that we have a
natural unit to evaluate rating consistency: the number or percent of agreements on
each rating category. Raw agreement is simple, intuitive, and clinically meaningful.
With ordered category data, it is not clear why one would prefer the ICC to raw
agreement rates, especially in light of the comparability issue discussed below. A good
idea is to report reliability using both the ICC and raw agreement rates.
Above it was noted that the ICC is strongly dependent on the trait variance within the
population for which it is measured. This can complicate comparisons of ICCs measured in
different populations, or in generalizing results from a single population.
Some suggest avoiding this problem by eliminating or holding constant the "problematic"
term, s 2(b).
Holding the term constant would mean choosing some fixed value for s 2(b), and using this in
place of the different value estimated in each population. For example, one might pick as s
2
(b) the trait variance in the general adult population--regardless of what population the ICC is
measured in.
25
Statistical Methods for Rater Agreement
However, if one is going to hold s 2(b) constant, one may well question using it at all! Why not
simply report as the index of unreliability the value of s 2(w) for a study? Indeed, this has been
suggested, though not used in practice much.
But if one is going to disregard s 2(b) because it complicates comparisons, why not go a step
further and express reliability simply as raw agreement rates--for example, the percent of
times two raters agree on the exact same category, and the percent of time they are within on
level of one another?
An advantage of including s 2(b) is that it automatically controls for the scaling factor of an
instrument. Thus (at least within the same population), ICCs for instruments with different
numbers of categories can be meaningfully compared. Such is not the case with raw
agreement measures or with s 2 (w) alone. Therefore, someone reporting reliability of a new
scale may wish to include the ICC along with other measures if they expect later researchers
might compare their results to those of a new or different instrument with fewer or more
categories.
26
Statistical Methods for Rater Agreement
4. Kappa Coefficients
4.0 Summary
There is wide disagreement about the usefulness of kappa statistics to assess rater agreement.
At the least, it can be said that (1) kappa statstics should not be viewed as the unequivocal
standard or default way to quantify agreement; (2) one should be concerned about using a
statistic that is the source of so much controversy; and (3) oneshould consider alternatives and
make an informed choice.
One can distinguish between two possible uses of kappa: as a way to test rater independence
(i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size
measure). The first use involves testing the null hypothesis that there is no more agreement
than might occur by chance given random guessing; that is, one makes a qualitative, "yes or
no" decision about whether raters are independent or not. Kappa is appropriate for this
purpose (although to know that raters are not independent is not very informative; raters are
dependent by definition, inasmuch as they are rating the same cases).
A better case for using kappa to quantify rater agreement is that, under certain conditions, it
approximates the intra-class correlation. But this too is problematic in that (1) these
conditions are not always met, and (2) one could instead directly calculate the intraclass
correlation.
27
Statistical Methods for Rater Agreement
5.0 Introduction
Consider symptom ratings (1 = low, 2 = moderate, 3 = high) by two raters on the same sample
of subjects, summarized by a 3×3 table as follows:
Table 1. Summarization of ratings by Rater 1 (rows) and Rater 2
(columns).
1 2 3
Here pij denotes the proportion of all cases assigned to category i Rater 1 and category j by
Rater 2. (The table elements could as easily be frequencies.) The terms p1., p2., and p3. denote
the marginal proportions for Rater 1--i.e. the total proportion of times Rater 1 uses categories
1, 2 and 3, respectively. Similarly, p.1, p.2, and p.3 are the marginal proportions for Rater 2.
Marginal homogeneity refers to equality (lack of significant difference) between one or more
of the row marginal proportions and the corresponding column proportion(s). Testing
marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree
is because of different propensities to use each rating category. When such differences are
observed, it may be possible to provide feedback or improve instructions to make raters'
marginal proportions more similar and improve agreement.
Differences in raters' marginal rates can be formally assessed with statistical tests of marginal
homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates
different cases, testing marginal homogeneity is straightforward: one can compare the
marginal frequencies of different raters with a simple chi-squared test. However this cannot be
done when different raters rate the same cases--the usual situation with rater agreement
studies; then the ratings of different raters are not statistically independent and this must be
accounted for.
• Nonparametric tests
• Bootstrap methods
• Loglinear, association, and quasi-symmetry models
28
Statistical Methods for Rater Agreement
Before discussing formal statistical methods, non-statistical methods for comparing raters'
marginal distributions should be briefly mentioned. Simple descriptive methods can be very
useful. For example, a table might report each raters' rate of use for each category. Graphical
methods are especially helpful. A histogram can show the distribution of each raters' ratings
across categories. The following example is from the output of the MH program:
0.304 + **
| ** ==
| ** == ==
| ** == ** == ** ==
| ** == ** == ** ==
| ** == ** == ** ==
| ** == ** == ** == ** ==
| ** == ** == ** == ** == ** ==
| ** == ** == ** == ** == ** == ** ==
| ** == ** == ** == ** == ** == ** ==
0 +----+-------+-------+-------+-------+-------+----
1 2 3 4 5 6
Vertical or horizontal stacked-bar histograms are good ways to summarize the data. With
ordered-category ratings, a related type of figure shows the cumulative proportion of cases
below each rating level for each rater. An example, again from the MH program, is as follows:
1 234 5 6
*---*-*-*-----*-------------------*-------------------------- Rater 1
*---*-*-*--------*------------*------------------------------ Rater 2
1 234 5 6
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Scale
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .1
29
Statistical Methods for Rater Agreement
These are merely examples. Many other ways to graphically compare marginal distributions
are possible.
The main nonparametric test for assessing marginal homogeneity is the McNemar test. The
McNemar test assesses marginal homogeneity in a 2×2 table. Suppose, however, that one has
an N×N crossclassification frequency table that summarizes ratings by two raters for an N-
category rating system. By collapsing the N×N table into various 2×2 tables, one can use the
McNemar test to assess marginal homogeneity of each rating category. With ordered-category
data one can also collapse the N×N table in other ways to test rater equality of category
thresholds, or test raters for overall bias (i.e., a tendency to make higher or lower rating than
other raters.)
The Stuart-Maxwell test can be used to test marginal homogeneity between two raters across
all categories simultaneously. It thus complements McNemar tests of individual categories by
providing an overall significance value.
Further explanation of these methods and their calculation can be found by clicking on the test
names above.
MH, a computer program for testing marginal homogeneity with these methods is available
online. For more information, click here.
These tests are remarkably easy to use and are usually just as effective as more complex
methods. Because the tests are nonparametric, they make few or no assumptions about the
data. While some of the methods described below are potentially more powerful, this comes
at the price of making assumptions which may or may not be true. The simplicity of the
nonparametric tests lends persuasiveness to their results.
A mild limitation is that these tests apply only for comparisons of two raters. With more than
two raters, of course, one can apply the tests for each pair of raters.
5.3 Bootstrapping
Bootstrap and related jackknife methods (Efron, 1982; Efron & Tibshirani, 1993) provide a
very general and flexible framework for testing marginal homogeneity. Again, suppose one
has an N×N crossclassification frequency table summarizing agreement between two raters on
an N-category rating. Using what is termed the nonparametric bootstrap, one would
repeatedly sample from this table to produce a large number (e.g., 500) of pseudo-tables, each
with the same total frequency as the original table.
Various measures of marginal homogeneity would be calculated for each pseudo-table; for
example, one might calculate the difference between the row marginal proportion and the
column marginal proportion for each category, or construct an overall measure of row vs.
column marginal differences.
Let d* denote such a measure calculated for a given pseudo-table, and let d denote the same
measure calculated for the original table. From the pseudo-tables, one can empirically
calculate the standard deviation of d*, or d*. Let d' denote the true population value of d.
30
Statistical Methods for Rater Agreement
Assuming that d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test
this null hypothesis by calculating the z value:
*
z = d/ d
and determining the significance of the standard normal deviate z by usual methods (e.g., a
table of z value probabilities).
The method above is merely an example. Many variations are possible within the framework
of bootstrap and jackknife methods.
An advantage of bootstrap and jackknife methods is their flexibility. For example, one could
potentially adapt them for simultaneous comparisons among more than two raters.
A potential disadvantage of these methods is that the user may need to write a computer
program to apply them. However, such a program could also be used for other purposes, such
as providing bootstrap significance tests and/or confidence intervals for various raw
agreement indices.
For each type of model the basic approach is the same. First one estimates a general form of
the model--that is, one without assuming marginal homogeneity; let this be termed the
"unrestricted model." Next one adds the assumption of marginal homogeneity to the model.
This is done by applying equality restrictions to some model parameters so as to require
homogeneity of one or more marginal probabilities (Barlow, 1998). Let this be termed the
"restricted model."
Marginal homogeneity can then be tested using the difference G2 statistic, calculated as:
where G2(restricted) and G2(unrestricted) are the likelihood-ratio chi-squared model fit
statistics (Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted
models.
The difference G2 can be interpreted as a chi-squared value and its significance determined
from a table of chi-squared probabilities. The df are equal to the difference in df for the
unrestricted and restricted models. A significant value implies that the rater marginal
probabilities are not homogeneous.
An advantage of this approach is that one can test marginal homogeneity for one category,
several categories, or all categories using a unified approach. Another is that, if one is already
analyzing the data with a loglinear, association, or quasi-symmetry model, the addition of
marginal homogeneity tests may require relatively little extra work.
31
Statistical Methods for Rater Agreement
A possible limitation is that loglinear, association, and quasi-symmetry models are only well-
developed for analysis of two-way tables. Another is that use of the difference G2 test
typically requires that the unrestricted model fit the data, which sometimes might not be the
case.
For an excellent discussion of these and related models (including linear-by-linear models),
see Agresti (2002).
Latent trait models and related methods such as the tetrachoric and polychoric correlation
coefficients can be used to test marginal homogeneity for dichotomous or ordered-category
ratings. The general strategy using these methods is similar to that described for loglinear and
related models. That is, one estimates both an unrestricted version of the model and a
restricted version that assumes marginal homogeneity, and compares the two models with a
difference G2 test.
With latent trait and related models, the restricted models are usually constructed by assuming
that the thresholds for one or more rating levels are equal across raters.
A variation of this method tests overall rater bias. That is done by estimating a restricted
model in which the thresholds of one rater are equal to those of another plus a fixed constant.
A comparison of this restricted model with the corresponding unrestricted model tests the
hypothesis that the fixed constant, which corresponds to bias of a rater, is 0.
Another way to test marginal homogeneity using latent trait models is with the asymptotic
standard errors of estimated category thresholds. These can be used to estimate the standard
error of the difference between the thresholds of two raters for a given category, and this
standard error used to test the significance of the observed difference.
An advantage of the latent trait approach is that it can be used to assess marginal homogeneity
among any number of raters simultaneously. A disadvantage is that these methods require
more computation than nonparametric tests. If one is only interested in testing marginal
homogeneity, the nonparametric methods might be a better choice. However, if one is already
using latent trait models for other reasons, such as to estimate accuracy of individual raters or
to estimate the correlation of their ratings, one might also use them to examine marginal
homogeneity; however, even in this case, it might be simpler to use the nonparametric tests of
marginal homogeneity.
If there are many raters and categories, data may be sparse (i.e., many possible patterns of
ratings across raters with 0 observed frequencies). With very sparse data, the difference G2
statistic is no longer distributed as chi-squared, so that standard methods cannot be used to
determine its statistical significance.
References
32
Statistical Methods for Rater Agreement
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and
practice. Cambridge, Massachusetts: MIT Press, 1975
Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied Mathematics, 1982.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and
Hall, 1993.
33
Statistical Methods for Rater Agreement
6.0 Introduction
This page describes the tetrachoric and polychoric correlation coefficients, explains their
meaning and uses, gives examples and references, provides programs for their estimation, and
discusses other available software. While discussion is primarily oriented to rater agreement
problems, it is general enough to apply to most other uses of these statistics.
What distinguishes the present discussion is the view that the tetrachoric and polychoric
correlation models are special cases of latent trait modeling. (This is not a new observation,
but it is sometimes overlooked). Recognizing this opens up important new possibilities. In
particular, it allows one to relax the distributional assumptions which are the most limiting
feature of the "classical" tetrachoric and polychoric correlation models.
6.0.1 Summary
The tetrachoric correlation (Pearson, 1901), for binary data, and the polychoric correlation,
for ordered-category data, are excellent ways to measure rater agreement. They estimate what
the correlation between raters would be if ratings were made on a continuous scale; they are,
theoretically, invariant over changes in the number or "width" of rating categories. The
tetrachoric and polychoric correlations also provide a framework that allows testing of
marginal homogeneity between raters. Thus, these statistics let one separately assess both
components of rater agreement: agreement on trait definition and agreement on definitions of
specific categories.
These statistics make certain assumptions, however. With the polychoric correlation, the
assumptions can be tested. The assumptions cannot be tested with the tetrachoric correlation if
there are only two raters; in some applications, though, theoretical considerations may justify
the use of the tetrachoric correlation without a test of model fit.
6.1.1 Pros:
34
Statistical Methods for Rater Agreement
6.1.2 Cons:
• Model assumptions not always appropriate--for example, if the latent trait is truly
discrete.
• For only two raters, there is no way to test the assumptions of the tetrachoric
correlation.
Consider the example of two psychiatrists (Raters 1 and 2) making a diagnosis for
presence/absence of Major Depression. Though the diagnosis is dichotomous, we allow that
depression as a trait is continuously distributed in the population.
+---------------------------------------------------------------+
| |
| |
| | * |
| | * * |
| | * * |
| | * |* |
| | * | * |
| | ** | ** |
| | *** | *** |
| | *** | *** |
| | ***** | ***** |
| +--------------------------------+----------------> Y |
| not depressed t depressed |
| |
+---------------------------------------------------------------+
In diagnosing a given case, a rater considers the case's level of depression, Y, relative to some
threshold, t: if the judged level is above the threshold, a positive diagnosis is made; otherwise
the diagnosis is negative.
Figure 2 portrays the situation for two raters. It shows the distribution of cases in terms of
depression level as judged by Rater 1 and Rater 2.
35
Statistical Methods for Rater Agreement
a, b, c and d denote the proportion of cases that fall in each region defined by the two raters'
thresholds. For example, a is the proportion below both raters' thresholds and therefore
diagnosed negative by both.
+------------------------------------------------+
| |
| Rater 1 |
| - + |
| +-------+-------+ |
| -| a | b |a+b |
| Rater 2 +-------+-------+ |
| +| c | d |c+d |
| +-------+-------+ |
| a+c b+d 1 |
| |
+------------------------------------------------+
36
Statistical Methods for Rater Agreement
The polychoric correlation, used when there are more than two ordered rating levels is a
straightforward extension of the model above. The difference is that there are more
thresholds, more regions in Figure 2, and more cells in Figure 3. But again the idea is to find
the values for thresholds and r* that maximize similarity between model-expected and
observed cross-classification proportions.
37
Statistical Methods for Rater Agreement
7. Detailed Description
7.0 Introduction
In many situations, even though a trait may be continuous, it may be convenient to divide it
into ordered levels. For example, for research purposes, one may classify levels of headache
pain into the categories none, mild, moderate and severe. Even for trait usually viewed as
discrete, one might still consider continuous gradations--for example, people infected with the
flu virus exhibit varying levels of symptom intensity.
The tetrachoric correlation and polychoric correlation coefficients are appropriate when the
latent trait that forms the basis of ratings can be viewed as continuous. We will outline here
the measurement model and assumptions for the tetrachoric correlation. The model and
assumptions for the polychoric correlation are the same--the only difference is that there are
more threshold parameters for the polychoric correlations, corresponding to the greater
number ordered rating levels.
Y1, Y2 be latent continuous variables associated with X1 and X2; these are the pre-
discretized, continuous "impressions" of the trait level, as judged by Raters 1 and 2;
A rating or diagnosis of a case begins with the case's true trait level, T. This information,
along with "noise" (random error) and perhaps other information unrelated to the true trait
which a given rater may consider (unique variation), leads to each rater's impression of the
case's trait level (Y1 and Y2). Each rater applies discretizing thresholds to this judged trait
level to yield a dichotomous or ordered-category rating (X1 and X2).
Y1 = bT + u1 + e1,
Y2 = bT + u2 + e2,
where b is a regression coefficient, u1 and u2 are the unique components of the raters'
impressions, and e1 and e2 represent random error or noise. It turns out that unique variation
and error variation behave more or less the same in the model, and the former can be
subsumed under the latter. Thus we may consider the simpler model:
Y1 = b1T + e1,
Y2 = b2T + e2.
The tetrachoric correlation assumes that the latent trait T is normally distributed. As scaling is
arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed
38
Statistical Methods for Rater Agreement
(and independent both between raters and across cases). For reasons we need not pursue here,
the model loses no generality by assuming that var(e1) = var(e2). We therefore stipulate that
e1, e2 ~ N(0, sigmae). A consequence of these assumptions is that Y1 and Y2 must also be
normally distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that
b1 = b2 = b = the correlation of both Y1 and Y2 with the latent trait.
r* = b2
+-------------------------------------+
| |
| |
| b b |
| Y1 <--- T ---> Y2 |
| |
| |
+-------------------------------------+
Here b is the path coefficient that reflects the influence of T on both Y1 and Y2. Those
familiar with the rules of path analysis will see that the correlation of Y1 and Y2 is simply the
product of their degree of dependence on T--that is b2.
As an aside, one might consider that the value of b is interesting in its own right, inasmuch as
it offers a measure of the association of ratings with the true latent trait--i.e., a measure of
rating validity or accuracy.
Assumptions 1--4 can be alternatively expressed as the assumption that Y1 and Y2 follow a
bivariate normal distribution.
39
Statistical Methods for Rater Agreement
We will assume that the one has sufficient theoretical understanding of the application to
accept the assumption of latent continuity.
The second assumption--that of a normal distribution for T--is potentially more questionable.
Absolute normality, however, is probably not necessary; a unimodal, roughly symmetrical
distribution may be close enough. Also, the model implicitly allows for a monotonic
transformation of the latent continuous variables. That is, a more exact way to express
Assumptions 1-4 is that one can obtain a bivariate normal distribution by some monotonic
transformation of Y1 and Y2.
The model assumptions can be tested for the polychoric correlation. This is done by
comparing the observed numbers of cases for each combination of rating levels with those
predicted by the model. This is done with the likelihood ratio chi-squared test, G 2 (Bishop,
Fienberg & Holland, 1975), which is similar the usual Pearson chi-squared test (the Pearson
chi-square test can also be used; for more information on these tests, see the FAQ for testing
model fit on the Latent Class Analysis web site.
The G2 test is assessed by considering the associated p value, with the appropriate degrees of
freedom (df). The df are given by:
df = RC - R - C
where R is the number of levels used by the first rater and C is the number of levels used by
the second rater. As this is a "goodness-of-fit" test, it is standard practice to set the alpha level
fairly high (e.g., .10). A p value lower than the alpha level is evidence of model fit.
For the tetrachoric correlation R = C = 2, and there are no df with which to test the model. It
is possible to test the model, though, when there are more than two raters.
Here are the steps one might follow to use the tetrachoric or polychoric correlation to assess
agreement in a study. For convenience, we will mainly refer to the polychoric correlation,
which includes the tetrachoric correlation as a special case.
For this a computer program, such as those described in the software section, is required.
The next step is to determine if the assumptions of the polychoric correlation are empirically
valid. This is done with the goodness-of-fit test that compares observed crossclassification
frequencies to model-predicted frequencies described previously. As noted, this test cannot be
done for the tetrachoric correlation.
PRELIS includes a test of model fit when estimating the polychoric correlation. It is unknown
whether SAS PROC FREQ includes such a test.
40
Statistical Methods for Rater Agreement
Assuming that model fit is acceptable, the next step is to note is the magnitude of the
polychoric correlation. Its value is interpreted in the same way as a Pearson correlation. As the
value approaches 1.0, more agreement on the trait definition is indicated. Values near 0
indicate little agreement on the trait definition.
One may wish to test the null hypothesis of no correlation between raters. There are at least
two ways to do this. The first makes use of the estimated standard error of the polychoric
correlation under the null hypothesis of r* = 0. At least for the tetrachoric correlation, there is
a simple closed-form expression for this standard error (Brown, 1977). Knowing this value,
one may calculate a z value as:
r*
z = -----------
sigmar*(0)
where the denominator is the standard error of r* where r* = 0. One may then assess statistical
significance by evaluating the z value in terms of the associated tail probabilities of the
standard normal curve.
The second method is via a chi-squared test. If r* = 0, the polychoric correlation model is the
same as the model of statistical independence. It therefore seems reasonable to test the null
hypothesis of r* = 0 by testing the statistical independence model. Either the Pearson (X 2) or
likelihood-ratio (G2) chi-squared statistics can be used to test the independence model. The df
for either test is (R - 1)(C - 1). A significant chi-squared value implies that r* is not equal to 0.
[I now question whether the above is correct. For the polychoric correlation, data may fail
the test of independence even with when r* = 0 (i.e., there may be some other kind of
'structure' to the data). If so, a better alternative would be to calculate a difference G2 statistic
as:
G2H0 - G2H1,
where G2H0 is the likelihood-ratio chi-squared for the independence model and G2H1 is the
likelihood-ratio chi-squared for the polychoric correlation model. The difference G2 can be
evaluated as a chi-squared value with 1 df. -- JSU, 27 Jul 00]
Equality of thresholds between raters can be tested by estimating what may be termed a
threshold-constrained polychoric correlation. That is, one estimates the polychoric
correlation with the added constraint(s) that the threshold(s) of Rater 1 is/are the same Rater
2's threshold(s). A difference G2 test is then made comparing the G2 statistic for this
constrained model with the G2 for the unconstrained polychoric correlation model. The
difference G2 statistic is evaluated as a chi-squared value with df = R - 1, where R is the
number of rating levels (this test only applies when both raters use the same number of rating
levels).
41
Statistical Methods for Rater Agreement
• Modifying measurement error assumptions. One can easily relax the assumptions
concerning measurement error. Hutchinson (2000) described models where the
variance of measurement error differs according to the latent trait level of the object
being rated. In theory, one could also consider non-Gaussian distributions for
measurement error.
• More than two raters. When there are more than two raters, the
tetrachoric/polychoric correlation model generalizes to a latent trait model with
normal-ogive (Gaussian cdf) response functions. Latent trait models can be used to (a)
estimate the tetrachoric/polychoric correlation among all rater pairs; (b)
simultaneously test whether all raters have the same definition of the latent trait; and
(c) simultaneously test for equivalence of thresholds among all raters.
7.3.1 Examples
+ 20 30 50
--------------------------
Total 60 40 100
--------------------------
42
Statistical Methods for Rater Agreement
Table 1 (draft)
For these data, the tetrachoric correlation (std. error) is:
rho 0.6071 (0.1152)
which is much larger than the Pearson correlation of 0.4082 calculated for the same data.
The thresholds (std. errors) for the two raters are estimated as:
Tallis suggested that the number of lambs born is a manifestation of the ewe's fertility--a
continuous and potentially normally distributed variable. Clearly the situation is more
complex than the simple "continuous normal variable plus discretizing thresholds"
assumptions allow for. We consider the data simply for the sake of a computational example.
-----------------------------------
Lambs Lambs born in 1952
born in ------------------
1953 None 1 2 Total
-----------------------------------
None 58 52 1 111
1 26 58 3 87
2 8 12 9 29
-----------------------------------
Total 92 122 13 227
-----------------------------------
Table 2 (draft)
Drasgow (1988; see also Olsson, 1979) described two different ways to calculate the
polychoric correlation. The first method, the joint maximum likelihood (ML) approach,
estimates all model parameters--i.e., rho and the thresholds--at the same time.
The second method, two-step ML estimation, first estimates the thresholds from the one-way
marginal frequencies, then estimates rho, conditional on these thresholds, via maximum
likelihood. For the tetrachoric correlation, both methods produce the same results; for the
polychoric correlation, they may produce slightly different results.
The data in Table 2 are analyzed with the POLYCORR program (Uebersax, 2000).
Application of the joint ML approach produces the following estimates (standard errors):
43
Statistical Methods for Rater Agreement
The polychoric correlation (std. error) for these data is .954 using joint estimation. However
there is reason to doubt the assumptions of the standard polychoric correlation model; the G 2
model fit statistic is 57.33 on 24 df (p < .001).
Hutchinson (2000) showed that the data can be fit by allowing measurement error variance to
differ from low to high levels of the latent trait. Instead, we relax the assumption of a
normally distributed latent trait. Using the LLCA program (Uebersax, 1993a) a latent trait
model with a nonparametric latent trait distribution was fit to the data. The distribution was
represented as six equally-spaced locations (located latent classes) along a unidimensional
continuum, the density at each location (latent class prevalence) being estimated.
Model fit, assessed by the G2 statistic was 15.65 on 19 df (p = .660). The LLCA program gave
the correlation of each variable with the latent trait as .963. This value squared, .927,
estimates what the correlation of the raters would be if they made their ratings on a
continuous scale. This is a generalization of the polychoric correlation, though perhaps we
should reserve that term for the latent bivariate normal case. Instead, we simply term this the
latent correlation between the raters.
44
Statistical Methods for Rater Agreement
(To see the input file for the LLCA program, click here.)
.5 + *
D | *
e .4 + *
n | *
s .3 + *
i | * *
t .2 + * *
y | * *
.1 + * * *
| * * * * *
+----*------*------*------*------*------*----
-2.5 -1.5 -0.5 0.5 1.5 2.5
The shape suggests that the latent trait could be economically modeled with an asymmetric
parametric distribution, such as a beta or exponential distribution.
A new, separate wepage has been added on the topic of factor analysis and SEM with
tetrachoric and polychoric correlations.
Tcorr is a simple utility for estimating a single tetrachoric correlation coefficient and
its standard error. Just enter the frequencies of a fourfold table and get the answer.
Also supplies threshold estimates.
Jim Fleming also has a program to estimate a matrix of tetrachoric correlations and
optionally smoothe of a poorly conditioned matrix.
45
Statistical Methods for Rater Agreement
TESTFACT is a very sophisticated program for item analysis using both classical and
modern psychometric (IRT) methods. It includes provisions for calculating tetrachoric
correlations.
POLYCORR is a program I've written to estimate the polychoric correlation and its
standard error using either joint ML or two-step estimation. Goodness-of-fit and a lot
of other information are also provided. Note: this program is just for a single pair of
variables, or a few considered two at a time. It does not estimate a matrix of tetra- or
polychoric correlations.
o Basic version. This handles square tables only (i.e., models where both items
have the same number of levels).
o Advanced version. This allows non-square tables and has other advanced
technical features, such as the ability to combine cells during estimation.
Mplus can estimate a matrix of polychoric and tetrachoric correlations and estimate
their standard errors. Two-step estimation is used. Features similar to
PRELIS/LISREL.
MicroFACT will estimate polychoric and tetrachoric correlations and standard errors.
Provisions for smoothing an improper correlation matrix are supplied. No goodness-
of-fit tests. A free student version that handles up to thirty variables can be
downloaded. Also does factor analysis.
SAS
A single polychoric or tetrachoric correlation can be calculated with the PLCORR
option of SAS PROC FREQ. Example:
proc freq;
tables var1*var2 / plcorr maxiter=100;
run;
Joint estimation is used. The standard error is supplied, but not thresholds. No
goodness-of-fit test is performed.
46
Statistical Methods for Rater Agreement
significantly increase run times). In any case, it's a good idea to check your SAS log,
which will contain a message if estimation did not converge for any pair of variables.
The macro is relatively slow (e.g., on a PC, a 50 x 50 matrix can take 5 minutes to
estimate; a 100 x 100 matrix four times as long).
SPSS
SPSS has no intrinsic procedure to estimate polychoric correlations. As noted above,
Dirk Enzmann has written an SPSS macro to estimate a matrix of tetrachoric
correlations.
R
John Fox has written an R program to estimate the polychoric correlation and its
standard error with R. A goodness-of-fit test is performed. Another R program for
polychorics has been written by David Duffy.
Stata
Stata's internal function for tetrachoric correlations is a very rough approximation
(e.g., actual tetrachoric correlation = .5172, Stata reports .6169!) based on Edwards
and Edwards (1984) and is unsuitable for many or most applications. A more accurate
external module has been written by Stas Kolenikov to estimate a matrix of
polychoric or tetrachoric correlations and their standard errors.
References
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and
practice. Cambridge, Massachusetts: MIT Press, 1975
47
Statistical Methods for Rater Agreement
Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error.
Applied Statistics, 1977, 26, 343-351.
Loehlin JC. Latent variable models, 3rd ed. Lawrence Erlbaum, 1999.
Uebersax JS. LLCA: Located latent class analysis. Computer program documentation,
1993a.
Uebersax JS. POLYCORR: A program for estimation of the standard and extended
polychoric correlation coefficient. Computer program documentation, 2000.
Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating
agreement. Biometrics, 1993, 49, 823-835.
48
Statistical Methods for Rater Agreement
8.0 Introduction
Of all the methods discussed here for analyzing rater agreement, latent trait modeling is
arguably the best method for handling ordered category ratings. The latent trait model is
intrinsically plausible. More than most other approaches, it applies a natural view of rater
decisionmaking. If one were interested only in developing a good model of how experts make
ratings--without concern for the subject of agreement--one could easily arrive at the latent
trait model. The latent trait model is closely related to signal detection theory, modern
psychometric theory, and factor analysis. The latent trait agreement model is also very flexible
and can be adapted to specific needs of a given study.
Given its advantages, that this method is not more often used is surprising. The likely
explanation is its relative unfamiliarity and a mistaken perception that it is difficult or
esoteric. In truth, it is no more complex that many standard methods for categorical data
analysis.
The basic principles of latent trait models for rater agreement are sketched here (this will be
expanded as time permits). For more details, the reader may consult Uebersax (1992; oriented
to non-statisticians) and Uebersax (1993; a technical exposition), or other references listed in
the bibliography below.
If there are only two raters, the latent trait model is the same as the polychoric correlation
coefficient model.
The essence of the latent trait agreement model is contained in the measurement model,
Y = bT + e (1)
where:
“T” is the latent trait level of a given case;
“Y” is the perception or impression of a given rater of the case's trait level;
“b” is a regression coefficient; and
“e” is measurement error.
The latent trait is what the ratings intend measure--for example, disease severity, subject
ability, or treatment effectiveness; this corresponds to the "signal" emitted by the case being
rated.
The term e corresponds to random measurement error or noise. The combined effect of T and
e is to produce a continuous variable, Y, which is the rater's impression of the signal. These
continuous impressions are converted to ordered category ratings as the rater applies
thresholds associated with the rating categories.
49
Statistical Methods for Rater Agreement
Model parameters are estimated from observed data. The basic estimated parameters are: (1)
parameters that describe the distribution of the latent trait in the sample or population; (2) the
regression coefficient, b, for each rater; and (3) the threshold locations for each rater. Model
variations may have more or fewer parameters.
Parameters are estimated by a computer algorithm that iteratively tests and revises parameter
values to find those which best fit the observed data; usually "best" means the maximum
likelihood estimates. Many different algorithms can be used for this.
The assumptions of the latent trait model are very mild. (Moreover, it should be noted that the
assumptions are tested by evaluating the fit of a model to the observed data).
The existence of a continuous latent trait, a simple additive model of "signal plus noise," and
thresholds that map a rater's continuous impressions into discrete rating categories are very
plausible.
One has latitude in choosing the form of the latent trait distribution. A normal (Gaussian)
distribution is most often assumed. If, as in many medical applications, this is considered
unsuitable, one can consider an asymmetric distribution; this is readily modeled as say, a beta
distribution which can be substituted for a normal distribution with no difficulty.
Still more flexible are versions that use a nonparametric latent trait distribution. This approach
models the latent trait distribution in a way analogous to a histogram, where the user controls
the number of bars, and each bar's height is optimized to best fit the data. In this way nearly
any latent trait distribution can be well approximated.
The usual latent trait agreement model makes two assumptions about measurement error. The
first is that it is normally distributed. The second is that, for any rater, measurement error
variance is constant.
Similar assumptions are made in many statistical models. Still one might wish to relax them.
Hutchinson (2000) showed how non-constant measurement error variance can be easily
included in latent trait agreement models. For example, measurement error variance can be
lower for cases with very high or low latent trait levels, or may increase from low to high
levels of the latent trait,
The latent trait agreement model supplies parameters that separately evaluate the degree of
association between raters, and differences in their category definitions. The separation of
these components of agreement and disagreement enable one to precisely target interventions
to improve rater consistency.
Association is expressed as a correlation between each rater's impressions (Y) and the latent
trait. A higher correlation means that a rater's impressions are more strongly associated with
the "average" impression of all other raters. A simple statistical test permits assessment of the
significance of rater differences in their correlation with the latent trait. One can also use the
model to express association between a pair of raters as a correlation between one rater's
50
Statistical Methods for Rater Agreement
impressions and those of the other; this measure is related to the polychoric correlation
coefficient.
Estimated rater thresholds can be displayed graphically. Their inspection, with particular
attention given to the distance between successive thresholds of a rater, shows how raters may
differ in the definition and widths of the rating categories. Again, these differences can be
statistically tested.
Finally, the model can be used to measure the extent to which one rater's impressions may be
systematically higher or lower than those of other raters--that is, for the existence of rater bias.
References
Heinen T. Latent class and discrete latent trait models: Similarities and differences.
Thousand Oaks, California: Sage, 1996.
Johnson VE, Albert JH. Modeling ordinal data. New York: Springer, 1999.
Uebersax JS. A review of modeling approaches for the analysis of observer agreement.
Investigative Radiology, 1992, 27, 738-743.
Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating
agreement. Biometrics, 1993, 49, 823-835.
51
Statistical Methods for Rater Agreement
9.0 Introduction
The odds ratio is an important option for testing and quantifying the association between two
raters making dichotomous ratings. It should probably be used more often with agreement
data than it currently is.
The odds ratio can be understood with reference to a 2×2 crossclassification table:
Rater 2
Rater 1 + -
+ a b a+b
- c d c+d
[a/(a+b)] / [b/(a+b)]
OR = -----------------------, (1)
[c/(c+d)] / [d/(c+d)]
but this reduces to
a/b
OR = -----, (2)
c/d
or, as OR is usually calculated,
ad
OR = ----. (3)
bc
The last equation shows that OR is equal to the simple crossproduct ratio of a 2×2 table.
The concept of "odds" is familiar from gambling. For instance, one might say the odds of a
particular horse winning a race are "3 to 1"; this means the probability of the horse winning is
3 times the probability of not winning.
In Equation (2), both the numerator and denominator are odds. The numerator, a/b, gives the
odds of a positive versus negative rating by Rater 2 given that Rater 1's rating is positive. The
denominator, c/d, gives the odds of a positive versus negative rating by Rater 2 given that
Rater 1's rating is negative.
52
Statistical Methods for Rater Agreement
OR is the ratio of these two odds--hence its name, the odds ratio. It indicates how much the
odds of Rater 2 making a positive rating increase for cases where Rater 1 makes a positive
rating.
This alone would make the odds ratio a potentially useful way to assess association between
the ratings of two raters. However, it has some other appealing features as well. Note that:
From this we see that the odds ratio can be interpreted in various ways. Generally, it shows
the relative increase in the odds of one rater making a given rating, given that the other rater
made the same rating--the value is invariant regardless of whether one is concerned with a
positive or negative rating, or which rater is the reference and which the comparison.
The odds ratio can be interpreted as a measure of the magnitude of association between the
two raters. The concept of an odds ratio is also familiar from other statistical methods (e.g.,
logistic regression).
9.2 Yule's Q
OR - 1
Q = --------.
OR + 1
It is often more convenient to work with the log of the odds ratio than with the odds ratio
itself. The formula for the standard error of log(OR) is very simple:
z = log(OR)/ log(OR) .
and referring to a table of the cumulative distribution of the standard normal curve to
determine the p-value associated with z.
log(OR) ± zL × log(OR) .
53
Statistical Methods for Rater Agreement
where zL is the z value defining the appropriate confidence limits, e.g., zL = 1.645 or 1.96 for a
two-sided 90% or 95% confidence interval, respectively.
exp[log(OR) ± zL × ].
log(OR)
Once one has used log OR or OR to assess association between raters, one may then also
perform a test of marginal homogeneity, such as the McNemar test.
9.4.1 Pros
9.4.2 Cons
54
Statistical Methods for Rater Agreement
9.5.1 Extensions
• More than two categories. In an N×N table (where N > 2), one might collapse the
table into various 2×2 tables and calculate log(OR) or OR for each. That is, for each
rating category k = 1, ..., N, one would construct the 2×2 table for the
crossclassification of Level k vs. all other levels for Raters 1 and 2, and calculate log
OR or OR. This assesses the association between raters with respect to the Level k vs.
not-Level k distinction.
This method is probably more appropriate for nominal ratings than for ordered-
category ratings. In either case, one might consider instead using Loglinear,
association, or quasi-symmetry models.
• Multiple raters. For more than two raters, a possibility is to calculate log(OR) or OR
for all pairs of raters. One might then report, say, the average value and range of
values across all rater pairs.
9.5.2 Alternatives
Given data by two raters, the following alternatives to the odds ratio may be considered.
• In a 2×2 table, there is a close relationship between the odds ratio and loglinear
modeling. The latter can be used to assess both association and marginal homogeneity.
• Cook and Farewell (1995) presented a model that considers formal decomposition of a
2×2 table into independent components which reflect (1) the odds ratio and (2)
marginal homogeneity.
• The tetrachoric and polychoric correlations are alternatives when one may assume that
ratings are based on a latent continuous trait which is normally distributed. With more
than two rating categories, extensions of the polychoric correlation are available with
more flexible distributional assumptions.
• Association and quasi-symmetry models can be used for N×N tables, where ratings
are nominal or ordered-categorical. These methods are related to the odds ratio.
• When there are more than two raters, latent trait and latent class models can be used.
A particular type of latent trait model called the Rasch model is related to the odds
ratio.
References
Bishop YMM, Fienberg SE, Holland PW. Discrete nultivariate analysis: theory and
practice. Cambridge, Massachusetts: MIT Press, 1975
55
Statistical Methods for Rater Agreement
Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal
agreement: two families of agreement measures. Canadian Journal of Statistics, 1995,
23, 333-344.
Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John
Wiley, 1981.
Somes GW, O'Brien, KF. Odds ratio estimators. In Kotz L, Johnson NL (eds.),
Encyclopedia of statistical sciences, Vol. 6, pp. 407-410. New York: Wiley, 1988.
Sprott DA, Vogel-Sprott MD. The use of the log-odds ratio to assess the reliability of
dichotomous questionnaire data. Applied Psychological Measurement, 1987, 11, 307-
316.
56
Statistical Methods for Rater Agreement
10.0 Introduction
Here we depart from the main subject of our inquiry--agreement on categorical ratings--to
consider interval-level ratings.
Methods for analysis of interval-level rating agreement are better established than is true with
categorical data. Still, there is far from complete agreement about which methods are best,
and many, if not most published studies use less than ideal methods.
Our view is that the basic premise outlined for analyzing categorical data, that there are
different components of agreement and that these should be separately measured, applies
equally to interval-level data. It is only by separating the different components of agreement
that we can tell what steps are needed to improve agreement.
1. If the rating levels have integer anchors on a Likert-type scale, treat the ratings as
interval data. By a Likert-type scale, it is meant that the actual form on which raters
record ratings contains an explicit scale such as the following two exmples:
2.
3.
4. lowest highest
5. level level
6. 1 2 3 4 5 6 7 (circle one)
7.
8.
9.
16. If rating categories have been chosen by the researcher to represent at least
approximately equally-spaced levels, strongly consider treating the data as interval
level. For example, for rating the level of cigarette consumption, one has latitude in
57
Statistical Methods for Rater Agreement
defining categories such as "1-2 cigarettes per day," "1/2 pack per day," "1 pack per
day," etc. It is my observation at least, that researchers instinctively choose categories
that represent more or less equal increments in a construct of "degree of smoking
involvement," justifying treatment of the data as interval level.
17. If the categories are few in number and the rating level anchors are
chosen/worded/formatted in such a way that does not imply any kind of equal spacing
to rating levels, treat the data as ordered categorical. This may apply even when
response labels are labelled with integers--for example, response levels of "1. None,"
"2. Mild," "3. Moderate," and "4. Severe." Note that here one could as easily substitute
the letters A, B, C and D for the integers 1, 2, 3 and 4.
18. If the ratings will, in subsequent research, be statistically analyzed as interval-level
data, then treat them as interval-level data for the reliability study. Conversely, if they
will be analyzed as ordered-categorical in subsequent research, treat them as ordered-
categorical in the reliability study.
Some who are statistically sophisticated may insist that nearly all ratings of this type should
be treated as ordered-categorical and analyzed with nonparametric methods. However, this
view fails to consider that one may also err by applying nonparametric methods when ratings
do meet the assumptions of interval-level data; specifically, by using nonparametric methods,
significant statistical power may be lost.
In this section we consider two general issues. The first is an explanation of three different
components of disagreement on interval-level ratings. The second issue concerns the general
strategy for examining rater differences.
Different causes may result in rater disagreement on a given case. With interval-level data,
these various causes have effects that can be broadly grouped into three categories: effects on
the correlation or association of raters' ratings, rater bias, and rater differences in the
distribution of ratings.
In making a rating, raters typically consider many factors. For example, in rating life quality, a
rater may consider separate factors of satisfaction with social relationships, job satisfaction,
economic security, health, etc. Judgments on these separate factors are combined by the rater
to produce a single overall rating.
Raters may vary in what factors they consider. Moreover, different raters may weight the
same factors differently, or they may use different "algorithms" to combine information on
each factor to produce a final rating.
Finally, a certain amount of random error affects a rating process. A patient's symptoms may
vary over time, raters may be subject to distractions, or the focus of a rater may vary. Because
of such random error, we would not even expect two ratings by a single rater of the same case
to always agree.
The combined effect of these issues is to reduce the correlation of ratings by different raters.
(This can be easily shown with formulas and a formal measurement model.) Said another
58
Statistical Methods for Rater Agreement
way, to the extent that raters' ratings correlate less than 1, we have evidence that the raters are
considering or weighting different factors and/or of random error and noise in the rating
process. When rater association is low, it implies that the study coordinator needs to apply
training methods to improve the consistency of raters' criteria. Group discussion conferences
may also be useful to clarify rater differences in their criteria, definitions, and interpretation of
the basic construct.
Rater bias refers to the tendency of a rater to make ratings generally higher or lower than
those of other raters. Bias may occur for several reasons. For example, in clinical situations,
some raters may tend to "overpathologize" or "underpathologize." Some raters may also
simply interpret the calibration of the rating scale differently so as to make generally higher or
lower ratings.
With interval-level ratings, rater bias can be assessed by calculating the mean rating of a rater
across all cases that they rate. High or low means, relative to the mean of all raters, indicate
positive or negative rater bias, respectively.
Sometimes an individual rater will have, when one examines all ratings made by the rater, a
noticeably different distribution than the distribution of ratings for all raters combined. The
reasons for this are somewhat more complex than is true for differences in rater association
and bias. Partly it may relate to rater differences in what they believe is the distribution of the
trait in the sample or population considered. At present, we mainly regard this as an empirical
issue: examination of the distribution of ratings by each rater may sometimes reveal important
differences. When such differences exist, some attempt should be made to reduce them, as
they are associated with decreased rater agreement.
In analyzing and interpreting results from a rater agreement study, and when more than two
raters are involved, one often thinks in terms of a comparison of each rater with every other
rater. This is relatively inefficient and, it turns out, often unnecessary. Most of the important
information to be gained can be more easily obtained by comparing each rater to some
measure of overall group performance. We term the former approach the Rater vs. Rater
strategy, and the latter the Rater vs. Group strategy.
Though it is the more common, the Rater vs. Rater approach requires more effort. For
example, with 10 raters, one needs to consider a 10 x 10 correlation matrix (actually, 45
correlations between distinct rater pairs). In contrast, a Rater vs. Group approach, which
might, for example, instead correlate each rater's ratings with the average rating across all
raters, would summarize results with only 10 correlations.
The prevalence of the Rater vs. Rater view is perhaps historical and accidental. Originally,
most rater agreement studies used only two raters--so methods naturally developed for the
analysis of rater pairs. As studies grew to include more raters, the same basic methods (e.g.,
kappa coefficients) were applied by considering all pairs of raters. What did not happen (as
59
Statistical Methods for Rater Agreement
seldom does when paradigms evolve gradually) is a basic re-examination of and new selection
of methods.
This is not to say that the Rater vs. Rater approach is always bad, or that the Rater vs. Group
is always better. There is a place for both. Sometimes one wants to know the extent to which
different rater pairs vary in their level of agreement; then the Rater vs. Rater approach is
better. Other times one will wish merely to obtain information on the performance of each
rater in order to provide feedback and improve rating consistency; then the Rater vs. Group
approach may be better. (Of course, there is nothing to prevent the researcher from using both
approaches.) It is important mainly that the researcher realize that they have a choice, and to
make an informed selection of methods.
We now direct attention to the question of which statistical methods to use to assess
association, bias, and rater distribution differences in an agreement study.
As already mentioned, from the Rater vs. Rater perspective, association can be summarized
by calculating a Pearson correlation (r) of the ratings for each distinct pair of raters.
Sometimes one may wish to report the entire matrix of such correlations. Other times it will
make sense to summarize the data as a mean, standard deviation, minimum and maximum
across all pairwise correlations.
From a Rater vs. Group perspective, there are two relatively simple ways to summarize rater
association. The first, already mentioned, is to calculate the correlation of each raters' ratings
with the average of all raters' ratings (this generally presupposes that all raters rate the same
set of cases or, at least, that each case is rated by the same number of raters.) The alternative is
to calculate the average correlation of a rater with every other rater--that is, to consider row or
column averages of the rater x rater correlation matrix. It should be noted that correlations
produced by the former method will be, on average, higher than those produced by the latter.
This is because average ratings are more reliable than individual ratings. However, the main
interest will be to compare different raters in terms of their correlation with the mean rating,
which is still possible; that is, the raters with the highest/lowest correlations with one method
will also be those with the highest/lowest correlations with the other.
A much better method, however, is factor analysis. With this method, one estimates the
association of each rater with a latent factor. The factor is understood as a "proto-rating," or
the latent trait of which each rater's opinions are an imperfect representation. (If one wanted to
take an even stronger view, the factor could be viewed as representing the actual trait which is
being rated.)
• Using any standard statistical software such as SAS or SPSS, one uses the appropriate
routine to request a factor analysis of the data. In SAS, for example, one would use
PROC FACTOR.
• A common factor model is requested (not principal components analysis).
60
Statistical Methods for Rater Agreement
• A one-factor solution is specified; note that factor rotation does not apply with a one-
factor solution, so do not request this.
• One has some latitude in choice of estimation methods, but iterated principal factor
analysis is recommended. In SAS, this is called the PRINIT method.
• Do not request commonalities fixed at 1.0. Instead, let the program estimate
commonalities. If the program requests that you specify starting commonality values,
request that squared multiple correlations (SMC) be used.
In examining the results, two parts of the output should be considered. First are the loadings
of each rater on the common factor. These are the same as the correlations of each rater's
ratings with the common factor. They can be interpreted as the degree to which a rater's
ratings are associated with the latent trait. The latent trait or factor is not, logically speaking,
the same as the construct being measured. For example, a patient's level of depression (the
construct) is a real entity. On the other hand, a factor or latent trait inferred from raters' ratings
is a surrogate--it is the shared perception or interpretation of this construct. It may be very
close to the true construct, or it may represent a shared misinterpretation. Still, lacking a "gold
standard," and if we are to judge only on the basis of raters' ratings, the factor represents our
best information about the level of the construct. And the correlation of raters with the factor
represents our best guess as to the correaltion of raters' ratings with the true construct.
Within certain limitations, therefore, one can regard the factor loadings as upper-bound
estimates for the correlation of ratings with the true construct--that is, upper-bound estimate
on the validity of ratings. If a loading is very high, then we only know that the validity of this
rater is below this number--not very useful information. However, if the loading is low, then
we know that the validity of the rater, which must be lower, is also low. Thus, in pursuing this
method, we are potentially able to draw certain inferences about rating validity--or, at least,
lack thereof, from agreement data (Uebersax, 1989).
While on this subject, it should be mentioned that there has been recent controversy about
using the Pearson correlation vs. using the intraclass correlation vs. using a new coefficient of
concordance. (Again, I will try to supply references.) I believe this controversey is misguided.
Critics are correct in saying that, for example, the Pearson correlation only assesses certain
types of disagreement. For example, if, for two raters, one rater's ratings are consistently X
units higher than another rater's ratings, the two raters will have a perfect Pearson correlation,
even though they disagree on every case.
However, our perspective is that this is really a strength of the Pearson correlation. The goal
should be to assess each component of rater agreement (association, bias, and distributional
differences) separately. The problem with these other measures is precisely that they attempt
to serve as omnibus indices that summarize all types of disagreement into a single number. In
61
Statistical Methods for Rater Agreement
so doing, they provide information of relatively little practical value; as they do not
distinguish among different components of disagreement, they do not enable one to identify
steps necessary to improve agreement.
Here is a "generic" statement that one can adapt to answer any criticisms of this nature:
"There is growing awareness that rater agreement should be viewed as having distinct
components, and that these components should be assessed distinctly, rather than
combined into a single omnibus index. To this end, a statistical modeling approach to
such data has been advocated (Agresti, 1992; Uebersax, 1992)."
The simplest way to express rater bias is to calculate the mean rating level made by each rater.
To compare rater differences (Rater vs. Rater approach), the simplest method is to perform a
paired t-test between each pair of raters. One may wish to perform a Bonferonni adjustment to
control the overall (across all comparisons) alpha level. However, this is not strictly necessary,
especially if ones aims are more exploratory or oriented toward informing "remedial"
intervention.
Another possibility is a one-way Analysis of Variance (ANOVA), in which raters are viewed
as the independent variable and ratings are the dependent variable. An ANOVA can assess
whether there are bias differences among raters considering all raters simultaneouly (i.e., this
is related to the Rater vs. Group approach). If the ANOVA approach is used, however, one
will still want to list the mean ratings for each rater, and potentially perform "post hoc"
comparisons of each rater-pair's means--this is more rigorous, but will likely produce results
comparable to the t-test methods described above.
If the paper will be sent to say, a psychology journal, it might be advisable to report results of
a one-way ANOVA along with results of formal "post-hoc" comparisons of each rater pair.
It is possible to calculate statistical indices that reflect the similarity of one rater's ratings
distribution with that of another, or between each rater's distribution and the distribution for
all ratings. However such indices usually do not characterize precisely how two distributions
differ--merely whether or not they do differ. Therefore, if this is of interest, it is probably
more useful to rely on graphical methods. That is, one can graphically display the distribution
of each rater's ratings, and the overall distribution, and base comparisons on these displays.
62
Statistical Methods for Rater Agreement
Often rater agreement data is collected in during a specific training phase of a project. In other
case, there is not a formal training phase, but it is nonetheless expected that results can be
used to increase the consistency of future ratings.
Several formal and informal methods can be used to assist these tasks. Two are described
here.
The Delphi Method is a technique developed at the RAND Corporation to aid to group
decision making. The essential feature is the use of quantitative feedback given to each
participant. The feedback consists of a numerical summary of that participant's decisions,
opinions, or, as applies here, ratings, along with a summary of the average decisions, opinions
or ratings across all participants. The assumption is that, provided with this feedback, a rater
will begin to make decisions or ratings more consistent with the group norm.
The method is easily adapted to multi-rater, interval-level data paradigms. It can be used in
conjunction with each of the three components of rater agreement already described.
To apply the method to rater bias, one would first calculate the mean rating for each rater in
the training phase. One would then prepare a figure showing the distribution of averages.
Figure 1 is a hypothetical example for 10 raters using a 5-point rating scale.
* **
* * * ** * *<---you
|----+----|----+----|----+----|----+----|
1 2 3 4 5
A copy of the figure is given to each rater. Each is annotated to show the average for that
rater, as shown in Figure 1.
A similar figure is used to give quantitative feedback on the association of each rater's ratings
with those of the other raters.
If one has performed a factor analysis of ratings, then the figure would show the distribution
of factor loadings across raters. If not, simpler alternatives are to display the distribution of
the average correlation of each rater with the other raters, or the correlation of each rater's
ratings with the average of all raters (or, alternatively, with the average of all raters other than
that particular rater).
63
Statistical Methods for Rater Agreement
Once again, a specifically annotated copy of the distribution is given to each rater. Figure 2 is
a hypothetical example for 10 raters.
you-->* * * * * * * * * *
|----+----|----+----|----+----|----+----|----+----|
.5 .6 7 .8 .9 1.0
Finally, one might consider giving raters feedback on the distribution of their ratings and how
this compares with raters overall. For this, each rater would receive a figure such as Figure 3,
showing the distribution of ratings for all raters and for the particular rater.
| |
| ** | **
% of | ** | **
ratings | ** ** ** | ** **
| ** ** ** ** | ** ** ** **
| ** ** ** ** ** | ** ** ** ** **
0 +---+----+----+----+----+ +---+----+----+----+----+
1 2 3 4 5 1 2 3 4 5
Rating Level Rating Level
Use of figures such as Figure 3 might be considered optional, as, to some extent, this overlaps
with the information provided in the Rater Bias figures. On the other hand, it may make a
rater's differences from the group norm more clear.
The second technique consists of having all raters or pairs of raters meet to discuss disagreed-
on cases.
The first step is to identify the cases that are the subject of most disagreement. If all raters rate
the same cases one can simply calculate, for each case, the standard deviation of the different
ratings for that case. Cases with the largest standard deviations--say the top 10%--may be
regarded as ambiguous cases. These cases may then be re-presented to the set of raters who
64
Statistical Methods for Rater Agreement
meet as a group, discuss features of these cases, and identify sources of disagreement.
Alternatively, or if all rater do not rate the same cases, a similar method can be applied at the
level of pairs of raters. That is, for each pair of raters I and J, the cases that are most disagreed
on (cases with the greatest absolute difference between the rating by Rater I and the rating by
Rater J) are reviewed by the two raters who meet to discuss these and iron out differences.
References
http://recepozcan06.blogcu.com/
65