Communications in Statistics - Simulation and
Computation
ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: https://www.tandfonline.com/loi/lssp20
Weighted inter-rater agreement measures for
ordinal outcomes
Duyet Tran, Anil Dolgun & Haydar Demirhan
To cite this article: Duyet Tran, Anil Dolgun & Haydar Demirhan (2018): Weighted inter-rater
agreement measures for ordinal outcomes, Communications in Statistics - Simulation and
Computation, DOI: 10.1080/03610918.2018.1490428
To link to this article: https://doi.org/10.1080/03610918.2018.1490428
Published online: 27 Oct 2018.
Submit your article to this journal
Article views: 58
View Crossmark data
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=lssp20
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
https://doi.org/10.1080/03610918.2018.1490428
R
Weighted inter-rater agreement measures for
ordinal outcomes
Duyet Tran*
, Anil Dolgun
, and Haydar Demirhan
School of Science, Mathematical Sciences, RMIT University, Melbourne, Australia
ABSTRACT
ARTICLE HISTORY
Estimation of the degree of agreement between different raters is of
crucial importance in medical and social sciences. There are lots of
different approaches proposed in the literature for this aim. In this
article, we focus on the inter-rater agreement measures for the
ordinal variables. The ordinal nature of the variable makes this estimation task more complicated. Although there are modified versions
of inter-rater agreement measures for ordinal tables, there is no clear
agreement on the use of a particular approach. We conduct an
extensive Monte Carlo simulation study to evaluate and compare the
accuracy of mainstream inter-rater agreement measures for ordinal
tables with each other and figure out the effect of different table
structures on the accuracy of these measures. Our results are useful
in the sense that they provide detailed information about which
measure to use with different table structures to get most reliable
inferences about the degree of agreement between two raters. With
our simulation study, we recommend use of Gwet’s AC2 and
Brennan-Prediger’s j in the situation where there is high agreement
among raters. However, it should be noted that these coefficients
overstate the extent of agreement among raters when there is no
agreement, and the data is unbalanced.
Received 8 January 2018
Accepted 13 June 2018
KEYWORDS
Inter-rater agreement;
Kappa; Ordinal data;
Weighted kappa; Weighting
schemes; Monte
Carlo simulation
MATHEMATICS SUBJECT
CLASSIFICATION
62H17; 62H20
1. Introduction
In biomedical and behavioral sciences, the reliability of a rating system is usually evaluated by analyzing inter-rater agreement data. The inter-rater agreement coefficient
(a.k.a, inter-rater reproducibility or concordance coefficient) is a statistical measure that
quantifies the extent of agreement among observers. It gives a score to measure the
degree of homogeneity or consensus in the ratings given by observers. There are a number of statistics used to determine the inter-rater agreement and different statistics are
appropriate for different types of measurements. Banerjee et al. (1999) and Yilmaz and
Saracbasi (2017) provided a comprehensive overview of agreement measures as shown
in Table 1 not only for ordinal classification data but also for nominal, interval, and
ratio scales. Ordinal data can arise when ratings have natural ordered categories (e.g.,
disease severity: severe, moderate, mild). Assessing inter-rater agreement is also common for ordinal ratings where disagreement between raters become more informative
CONTACT Anil Dolgun
[email protected]
School of Science, Mathematical Sciences, RMIT University, 364
Swanston Street, Melbourne 3000, Victoria, Australia.
Faculty of Education, Physics Department, An Giang University, Long Xuyen City, An Giang Province, Vietnam.
ß 2018 Taylor & Francis Group, LLC
2
D. TRAN ET AL.
Table 1. Previous agreement studies.
Number of raters
Two raters
Multi-raters
Coefficient
Lambda coefficient
Pi Coefficient
Kappa coefficient
Intraclass correlation
coefficient
Weighted kappa coefficient
AC1 coefficient
AC2 coefficient
Aickin’s alpha
Bangdiwala’s BN statistic
S coefficient
Author(s)
Goodman and Kruskal (1954)
Scott (1955)
Cohen (1960)
Bloch and Kraemer (1989);
Dunn (1989)
Cohen (1968)
Gwet (2008)
Gwet (2008)
Aickin (1990)
Bangdiwala (1988)
Brennan and Prediger (1981)
Alpha coefficient
Van Eerdewegh’s V
Yule’s Y or Q
Kappa coefficient
Agreement coefficient
Coefficient of concordance (W)
Hubert’s kappa coefficient
S coefficient
Krippendorff (2004)
Spitznagel and Helzer (1985)
Spitznagel and Helzer (1985)
Light (1971)
Fleiss (1971)
Kendall (1955)
Kappa coefficient
Kappa coefficient
Gautam (2014)
Berry, Mielke, and
Johnston (2016)
Conger (1980)
Randolph (2005)
Type of ratings
Nominal data
Nominal and ordinal data
Nominal data
Interval and ratio data
Ordinal data
Nominal data
All types of data
Interval and ratio
Ordinal data
Nominal, ordinal and interval data
All types of data
Binary data
Binary data
Nominal and ordinal data
Nominal and ordinal data
Ranked data/ Ordinal data
Ordinal and interval data
Nominal, ordinal, and interval data
Nominal and ordinal data
Ordinal data
due to the hierarchy between the ordinal levels. For such settings, the weighted interrater agreement measures allow the use of weights and take into account the importance
of the disagreements between the ordinal categories (Warrens 2017).
Existing methods for assessing agreement between two raters when ordinal classifications are being examined include Cohen’s weighted kappa, Scott’s p, Gwet’s AC2 coefficient, Brennan - Prediger kappa-like coefficient, and Krippendorff’s a coefficient. Many
of these measures of are either extensions of Cohen’s kappa and weighted kappa or are
formulated as a Cohen’s kappa-like statistic (Fleiss 1971; Light 1971; Conger 1980).
Therefore, they are prone to the same issues as the original Cohen’s kappa, including
sensitivity to marginal distributions of the raters and disease prevalence effects (Maclure
and Willett 1987). Moreover, the choice of weighting schemes for weighted kappa and
other weighted agreement measures has always been controversial. They have been
criticized as the value of weighted agreement depends on the choice of weights and the
choice of weights is subjective (Maclure and Willett 1987). There have been some
attempts to support the use of quadratic weights as the quadratically weighted kappa can
be interpreted as an intraclass correlation coefficient (Fleiss and Cohen 1973; Schuster
2004). However, Warrens (2012) showed that for agreement tables with an odd number
of categories, the value of quadratically weighted kappa does not depend on the value of
the center cell of the agreement table; hence, quadratically weighted kappa fails as a
measure of agreement. Vanbelle and Albert (2009) and Warrens (2011) showed that linearly weighted kappa can be interpreted as a weighted average of the 2 2 tables’ kappas. Recently, Moradzadeh, Ganjali, and Baghfalaki (2017) showed that linear and
quadratic weighted kappa can be computed as a function of unweighted kappas.
To the best of our knowledge there is no generic guide for the selection of the
weights and current approaches to define weights are limited in a sense that they rarely
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
3
Table 2. Inter-rater agreement classification for two raters.
Rater A
1
2
…
q
Total
Rater B
1
n11
n21
…
nq1
nþ1
2
n12
n22
…
nq2
nþ2
Total
…
…
…
…
…
…
q
n1q
n2q
…
nqq
nþq
n1þ
n2þ
…
nqþ
n
provide concrete evidence on which weighting scheme would suit best to the data or
have less bias in terms of assessing the true inter-rater agreement. The aim of this paper
is to compare various weighting schemes and weighted inter-rater agreement coefficients and to provide useful information on the selection of the most appropriate
method to apply in different settings of R R tables. Our most important contribution
is to identify which measure and weighting scheme combination has less bias and how
their bias is affected by the the degree of true inter-rater agreement, the structure of the
R R table, the number of ordinal ratings, and the total sample size. We conducted a
Monte Carlo simulation study to evaluate and compare 5 different weighted inter-rater
agreement coefficients (i.e., Cohen’s weighted kappa, Scott’s p, Gwet’s AC2 coefficient,
Brennan - Prediger coefficient, and Krippendorff’s a coefficient) and 6 various weighting schemes (i.e., unweighted, linear, quadratic, ordinal, radical, and ratio weights) in
the context of two raters for ordinal data.
This paper is organized as follows. In Sec. 2, general information on weighted interrater agreement measures and weighting schemes are presented. In Sec. 3, simulation
design of the Monte Carlo study is given and results are shown and inferred by visualizations before general conclusions are drawn in Sec. 4.
2. General information
In this section, after giving a quick overview of Cohen’s unweighted kappa coefficient
for nominal ratings, we will present weighted kappa coefficient and other commonly
used weighted inter-rater agreement coefficients as well as the weighting schemes
applied for ordinal ratings.
2.1. Weighted inter-rater agreement coefficients
Suppose two raters independently classify the same group of n subjects into one of
q 3 categories. The categories are defined in advance and the raters use the same q
categories. In the classification in Table 2, nkl and pkl denote the frequency and the joint
probability distribution corresponding to k, l ¼ 1, 2, … , q categories, respectively, the
marginal totals are denoted by nkþ for the first rater and nþl for the second rater, where
the marginal probabilities, and correspondingly pkþ and pþl reflect how often the raters
have used the categories.
Cohen’s kappa is a chance-corrected measure of agreement and used for nominal
categories (Cohen 1960). It is defined as
j¼
pa
1
pe
;
pe
(1)
4
D. TRAN ET AL.
Pq
Pq
where pa ¼ k¼1 pkk and pe ¼ k¼1 pkþ pþk are the proportion of observed agreement
and the proportion of agreement expected by chance, respectively. The assumption of j
is that the ratings of raters are statistically independent and it allows different marginal
probabilities of success associated with the raters to differ (Banerjee et al. 1999). The
advantages of Cohen’s kappa is that it is always applicable, easy to calculate, available in
general purpose statistical software packages, and it condenses relevant information into
one coefficient. This is a reason why Cohen’s kappa is often used to calculate inter-rater
agreement among raters.
The value of kappa implies the agreement among raters beyond chance. Therefore,
negative values indicate that the observed agreement is less than that is expected by
chance alone; a value of 0 indicates exactly chance agreement, and positive values imply
that the observed agreement is greater than that is expected by chance. Mathematically,
a value of 1 is hard to achieve and the lower limit of kappa depends on the number of
categories, so it can be undefined. Landis and Koch (1977) suggested that for most purposes, values greater than 0.75 and below 0.40 represent excellent and poor agreement
beyond chance, respectively, and values between 0.40 and 0.75 may be considered to
represent fair to good agreement beyond chance. However, these recommendations are
not based on scientific evidence and it is still an open question how the magnitude of
kappa should be judged.
When the disagreement between two raters become more serious, kappa makes no
such distinction, implicitly treating all disagreements equally. The weighted kappa, (jw)
which allows the use of weights and takes into account the extent of disagreements
between the categories was introduced by Cohen (1968). jw is defined as
jw ¼
paðwÞ peðwÞ
;
1 peðwÞ
(2)
where the weighted proportion of observed agreement, pa(w), is
paðwÞ ¼
q
q X
X
xkl pkl ;
(3)
k¼1 l¼1
and the weighted proportion of agreement expected by chance, pe(w), is
peðwÞ ¼
q
X
k;l¼1
xkl pkþ pþk :
(4)
In Equation (2), the weights are 0 wkl 1 for k, l ¼ 1, 2, … , q and wkl ¼ 1 if k ¼ l.
Hence, the elements on the main diagonal of the contingency table {pkl} get the maximum weight of 1. Commonly used weighting schemes will be given in Sec. 2.2. Note
that, the Equation (3) is the weighted proportion of observed agreement for all coefficients, except Krippendoff’s Alpha.
jw has been criticized in the sense that the selection of the weights has a great
importance on the magnitude of agreement and the choice of weights is subjective
(Maclure and Willett 1987). Moreover, because this measure is an extension of Cohen’s
kappa, it is sensitive to marginal distributions of the raters and disease prevalence effects
(Maclure and Willett 1987).
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
5
Scott’s p is used to calculate chance corrected agreement probability based on marginal probabilities and it assumes that each rater may be characterized by the same
underlying success rate (Scott 1955). This measure has the same structure as
the weighted kappa statistic in Equation (2) but has a different definition of pe(w) given
in
peðwÞ ¼
q
X
xkl pk pl ;
(5)
k;l¼1
where
pk ¼
pkþ þ pþk
:
2
(6)
Gwet’s AC2 coefficient is similar to jw in its formulation and its simplicity and provides
a reasonable chance-corrected agreement coefficient, in line with the percentage level of
agreement (Gwet 2002). It is calculated as
peðwÞ ¼
q
X
1
q
1 k¼1
pk ð1 pk Þ;
(7)
where q is the number of categories and
pk ¼
pkþ þ pþk
:
2
(8)
Wongpakaran et al. (2013) showed that Gwet’s AC2 is a more stable inter-rater agreement coefficient than the jw. It is also found to be less affected by prevalence and marginal probability than that of jw; and hence, it is recommended for assessing inter-rater
agreement with ordinal ratings (Wongpakaran et al. 2013; Gwet 2014).
Brennan and Prediger (1981) proposed a kappa-like agreement coefficient where the
overall percent agreement remains as in the Cohen’s weighted kappa, but the percent
chance agreement is taken as
peðwÞ ¼
q
1 X
xkl ;
q2 k;l¼1
(9)
where q is the number of categories. Some other authors, for example Bennett, Alpert,
and Goldstein (1954), independently developed the same coefficient giving it different
names and often referred to in the literature as the Brennan-Prediger coefficient (Gwet
2014). This coefficient is recommended for use with 2 raters and an arbitrary number
of ordinal ratings, while most of authors suggests it in the case of 2 raters with binary ratings.
The Krippendorff’s a coefficient is a statistical measure of the extent of agreement
among raters and is regularly used by researchers in the area of content analysis
(Krippendorff 2004). The weighted proportion of observed agreement for Krippendoff’s
Alpha is
1
1
paðwÞ ¼ 1
(10)
pa0 þ ;
nr
nr
6
D. TRAN ET AL.
where
q
pa0
n X
1X
rik ðr ikþ 1Þ
;
¼
n i¼1 k¼1 r ðri 1Þ
(11)
n
1X
ri ;
n i¼1
(12)
q
X
(13)
r ¼
and
r ikþ ¼
xkl ril :
l¼1
The weighted proportion of expected agreement for the a coefficient is defined as
peðwÞ ¼
q
X
xkl pk pl ;
(14)
k;l¼1
where
pk ¼
m
1X
rikr :
m i¼1
(15)
In Equation 15, m is the number of subjects rated by 2 raters, rik is the number of
raters who assigned the particular score xk to subject i, and r is the average number of
raters per subject.
In this study, we only consider the methods of Cohen, Scott, Gwet, Brennan Prediger, and Krippendorff, which are used for tables composed of ordinal variables.
Other weighted agreement coefficients presented by Yule (1912), Scott (1955), Holley
and Guilford (1964), Spitznagel and Helzer (1985), Feinstein and Cicchetti (1990), and
Gwet (2002) are based on different assumptions related to definition of pe(w) (Warrens
2010) but they are not appropriate for assessing agreement for ordinal categories. In the
next section, we introduce the commonly used weighting schemes which can be used
with any inter-rater agreement coefficient given in this section.
2.2. Weighting schemes
A review of the different kinds of weighting schemes was given by Gwet (2014). In this
article, we only focus on weighting schemes.
Unweighted:
xkl ¼
0;
1;
k 6¼ l
k ¼ l:
(16)
Note that when this weights are used, the resulting inter-rater agreement measure
will be equal to its unweighted version.
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
7
Linear weights:
xkl ¼
8
<
jk lj
; k 6¼ l
q
1
:
1;
k¼l
1
(17)
where, q is the number of categories and k, l ¼ 1, 2, … , q are the categories for the first
and the second rater, respectively.
Quadratic weights:
xkl ¼
8
>
<
1
>
:1
ðk lÞ2
ðq 1Þ2
k 6¼ l
(18)
k¼l
The quadratic weights are generally greater than the linear weights (Cohen 1968).
Ordinal weights:
xkl ¼
where
Mkl ¼
8
<
Mkl
; k 6¼ l
M
max
: 1;
k¼l
1
(19)
maxðk; lÞ minðk; lÞ þ 1
:
2
(20)
This set of weights only uses the order structure of the ratings. Gwet (2014) stated
that the actual values of the ratings do not affect the magnitude of the ordinal weights
because only their ranks do.
Radical weights:
8
>
<
Ratio weights:
pffiffiffiffiffiffiffiffiffiffi
jk lj
ffi ; k 6¼ l
xkl ¼ 1 pffiffiffiffiffiffiffiffi
q 1
>
: 1;
k ¼ l:
8
>
<
2
ðk lÞ=ðk þ lÞ
1
2 ;
xkl ¼
ðq 1Þ=ðq þ 1Þ
>
:
1;
k 6¼ l
(21)
(22)
k ¼ l:
Note that, one can also use arbitrary scores xk and xl to create weights instead of
sequentially numbered k, l ¼ 1, … , q. In such cases, the magnitudes of linear, quadratic,
radical and ratio weights rely on the scores (xk and xl) attached to the ordinal categories
(Gwet 2014). Among those weights, only the ordinal weights are insensitive to the selection of scores as they use the rankings of the ratings instead of their actual values.
8
D. TRAN ET AL.
Table 3. Abbreviations related with the simulation design.
R R table
q¼3
q¼4
q¼5
Abbr.
33
44
55
Degree of true agreement
Low
Medium
High
Abbr.
L
M
H
Structure of the table
Balanced
Slightly Unbalanced
Heavily Unbalanced
Sample size
Abbr.
Inter-rater agreement measure
Abbr.
Weighting scheme
Cohen’s kappa
Scott’s p
Gwet’s AC2
Brennan - Prediger
Krippendorff’s a
j
p
AC2
BP
a
50
100
200
500
1000
n ¼ 50
n ¼ 100
n ¼ 200
n ¼ 500
n ¼ 1,000
Unweighted
Linear weights
Quadratic weights
Ordinal weights
Radical weights
Abbr.
B
U1
U2
Abbr.
unweighted
linear
quadratic
ordinal
radical
3. Simulation study
3.1. The scope
In this article, an extensive Monte Carlo simulation study is conducted to evaluate and
compare 5 different methods of weighted inter-rater agreement coefficients and 6 various weighting schemes in the context of 2 raters for ordinal data. Simulation space of
the Monte Carlo study is composed of 4,050 different combinations of balanced and
unbalanced R R table structures, sample sizes, number of categories and degree of
true inter-rater agreements which are given in Table 3. By this large simulation space,
we present a detailed analysis of effects of those factors on the bias of combinations of
inter-rater agreement measures and weights using mean absolute error and mean
squared error.
3.2. The true (population) inter-rater agreement coefficients
We used the Pearson correlation coefficient (q) under the bivariate normal distribution
setting to set the true inter-rater agreement among raters. Using this predetermined correlation structure, we were able to adequately quantify the true agreement (not chancecorrected) when both variables are evaluated on the same scale. The values for true
inter-rater agreement were fixed at q ¼ 0.1, 0.6, 0.9 for low, medium, and high interrater agreement, respectively.
3.3. The Monte Carlo simulation
In order to generate R R tables with ordinal categories, we utilized the underlying
variable approach (UVA) (Muthen 1984) which assumes that the observed ordinal variables are generated by underlying normally distributed continuous variables.
Accordingly, we first generated {Y1i, Y2i} for i ¼ 1, … , n from bivariate standard normal
distribution with a pre-specified correlation structure using the mentioned q values.
The true correlation among {Y1i, Y2i} is taken as q ¼ 0.1, 0.6, 0.9 for low, medium, and
high inter-rater agreement, respectively. Next, R R contingency tables and corresponding joint probabilities (pkl’s) are constructed by discretization of {Y1i, Y2i} using standard
normal distribution quantiles. As the quantiles possess the ordered information of Y1i
and Y2i, discretization of Y1i and Y2i using quantiles ensures the ordinality of the generated variables. For example, in order to generate a 3 3 contingency table with
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
9
Figure 1. The MAE results for 3 3 table where n ¼ 50.
balanced marginals, we used standard normal quantiles, U 1(0.33) and U 1(0.66) as
cutoffs for {Y1i, Y2i}. Similarly, we took U 1(0.10) and U 1(0.40) to generate slightly
unbalanced table structure, and U 1(0.05) and U 1(0.25) for heavily unbalanced one.
Therefore, marginal distributions of the ordinal variables are specified using different
quantiles and these two marginal distributions are linked together through the correlation structure. The rest of 4 4 and 5 5 tables are also generated using the same
approach. 10,000 samples from the multinomial distribution with parameters pkl’s are
generated using n ¼ 50, 100, 200, 500, 1,000. For each of these 10,000 samples, the sample inter-rater agreement coefficient is calculated using different combinations of balanced and unbalanced R R table structures, sample sizes, number of categories, and
degree of true inter-rater agreements. The accuracy of inter-rater agreement measure weight combinations are assessed using the Monte - Carlo estimators of mean absolute
error (MAE),
MAE ¼
r
1X
^ i j;
jj j
r i¼1
and the mean squared error (MSE),
MSE ¼
r
1X
ðj j
^ i Þ2 ;
r i¼1
where, r is the number of replications, j is the true inter-rater agreement coefficient
^ i is the inter-rater agreement estimation in the ith replication. The simulation
and j
codes are written in R language by the authors. In order to calculate the inter-rater
agreement measures and weights, we utilized R functions given by Gwet (2017). The
results are visualized by using the lattice package in R (Sarkar 2008) and interpreted in
the next section.
10
D. TRAN ET AL.
Figure 2. The MAE results for 3 3 table where n ¼ 1000.
3.4. Results
All of the numerical results are not given here to save space but they are available upon
request from authors. Only the results related to MAE are given for n ¼ 50 and
n ¼ 1,000 in Figures 1–6 as the MSE results are very similar to the ones obtained from
MAE. The results are inferred over inter-rater agreement methods, weighting schemes,
true level of inter-rater agreement and structure of R R table.
In terms of agreement measures, all measures perform similar for the balanced R R
table structures, where else they differ more with increasing unbalancedness. For the
low degree of inter-rater agreement, Cohen’s kappa, Scott’s p and Krippendorff’s a perform better than the Gwet’s AC2 and Brennan - Prediger’s kappa. For medium and
high degree of inter-rater agreements, Gwet’s AC2 and Brennan - Prediger’s kappa perform better than Cohen’s kappa, Scott’s p and Krippendorff’s a, not only for slightly
unbalanced case but also for heavily unbalanced table structures. When the table size is
increased, all the measures give smaller MAE and MSE values.
In terms of weighting schemes, we get similar results in terms of MAE and MSE
measures. Quadratic weight has the best accuracy for most situations regardless of the
measure chosen. However, Gwet’s AC2 and Brennan Prediger’s kappa perform well
using other weights for the unbalanced structures. In specific, Gwet’s AC2 performs
well in combination with the linear weight for slightly unbalanced structures. When
used with radical weights, Gwet’s AC2 performs well in heavily unbalanced 4 4 and
5 5 tables. For Brennan Prediger’s kappa, linear weight yields the best results for heavily unbalanced table structures, while for slightly unbalanced cases, the ordinal weight
yields the most accurate results for 4 4 tables and 5 5 tables, and quadratic weights
for 3 3 tables. Overall, the accuracy of the measures is sensitive to the used weights if
the table of interest is obviously unbalanced and the true agreement is not that low.
In terms of level of agreement, all inter-rater agreement measures perform well when
the true inter-rater agreement is low, except for Gwet’s AC2 and Brennan Prediger’s
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
11
Figure 3. The MAE results for 4 4 table where n ¼ 50.
Figure 4. The MAE results for 4 4 table where n ¼ 1000.
kappa in unbalanced tables. For the majority of the scenarios, MAE’s are smaller than
0.4. MAE and MSE’s of measures are the lowest when the true inter-rater agreement is
low and they are close to each other for the medium and high inter-rater agreement.
In terms of table structures, balanced table structures usually have smaller mean error
values than the slightly unbalanced and heavily unbalanced tables. In unbalanced tables,
Cohen’s j, Scott’s p and Krippendorff’s a always have lower MAE than Gwet’s AC2
and Brennan-Prediger’s j when the true agreement is low. This situation is the opposite
for medium and high degree of agreement. Specifically, with slightly unbalanced structures, Gwet’s AC2 performs best for high agreement. For the medium degree of agreement, Gwet’s AC2 and Brennan Prediger’s kappa perform similar, where else for heavily
12
D. TRAN ET AL.
Figure 5. The MAE results for 5 5 table where n ¼ 50.
Figure 6. The MAE results for 5 5 table where n ¼ 1000.
unbalanced structure with medium degree of agreement, Brennan-Prediger’s kappa performs best.
4. Conclusion
We compare the accuracy of Cohen’s j, Scott’s p, Gwet’s AC2 coefficient, Brennan Prediger coefficient, and Krippendorff’s a coefficient in combination with 6 weighting
schemes - unweighted, linear, quadratic, ordinal, radical, and ratio weights in the context of two raters for ordinal data using a Monte Carlo simulation approach. By using
the results of this simulation, we identify which inter-rater measure and weighting
scheme combination has less bias, and how their bias are affected from the the degree
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
13
of true inter-rater agreement, the structure of the R R table, the number of ordinal ratings, and the total sample size. Main findings of our study are summarized as follows:
All measures perform similar for balanced table structures. However, for the low
degree of inter-rater agreement, Cohen’s kappa, Scott’s p and Krippendorff’s a
perform better than the Gwet’s AC2 and Brennan - Prediger’s kappa.
Conversely, for medium and high degree of inter-rater agreements, Gwet’s AC2
and Brennan - Prediger’s kappa perform better than Cohen’s kappa, Scott’s p
and Krippendorff’s a.
Unbalancedness in the cell counts of the considered table is the most influential
factor on the accuracy of inter-rater agreement measures. It negatively impacts
the accuracy of the mentioned measurements.
The accuracy of the measures is also sensitive to the used weights if the table of
interest is highly unbalanced.
The majority of the scenarios, values of error measures are small for the low
agreement, high for the medium agreement and in between for the high agreement, except the situations with Gwet’s AC2 and Brennan-Prediger’s kappa.
When the underlying inter-rater agreement is low and the table is unbalanced,
Gwet’s AC2 and Brennan-Prediger’s j in combination with linear, quadratic,
ordinal, radical, and ratio weights should be avoided. For such cases, Cohen’s kappa,
Scott’s p and Krippendorff’s a measures perform well with any type of weights.
Overall, the accuracy of the measures is sensitive to the used weights if the table
of interest is obviously unbalanced and the true agreement is not that low. For
the unbalanced table structures, provided that the inter-rater agreement is high,
Gwet’s AC2 and Brennan-Prediger’s j can be used with any type of weights.
Gwet’s AC2 performs well in combination with the linear weight for slightly
unbalanced structures. When used with radical weights, Gwet’s AC2 performs
well in heavily unbalanced 4 4 and 5 5 tables. For Brennan Prediger’s kappa,
linear weight yields the best results for heavily unbalanced table structures, while
for slightly unbalanced cases, the ordinal weight yields the most accurate results.
In terms of agreement measures, all measures perform similar for the balanced R R
table structures, where else, they differ more with increasing unbalancedness.
Specifically, we recommend use of Gwet’s AC2 and Brennan-Prediger’s j in the unbalanced medium and high agreement levels. However, it should be noted that these coefficients overstate the extent of agreement among raters when there is no agreement, and
the data is unbalanced.
All the inferences given in this article should be considered within the limits of our
simulation space, which is large enough to generalize the inferences on the accuracy of
inter-rater agreement measures for ordinal data.
Acknowledgment
The authors would like to acknowledge the valuable comments and suggestions of the anonymous reviewer, which have improved the quality of this paper. Also, the authors gratefully acknowledge the generous financial support of VIED and RMIT. Finally, the authors’ thanks are due to
14
D. TRAN ET AL.
Dr. Gwet for kindly granting permission to include his functions to calculate various Kappalike measures.
Disclosure statement
No potential conflict of interest was reported by the authors.
Funding
Duyet Tran received financial support of VIED and RMIT.
ORCID
Duyet Tran
http://orcid.org/0000-0003-1720-9591
Anil Dolgun
http://orcid.org/0000-0002-2693-0666
Haydar Demirhan
http://orcid.org/0000-0002-8565-4710
References
Aickin, M. 1990. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46 (2):293–302.
Banerjee, M., M. Capozzoli, L. McSweeney, and D. Sinha. 1999. Beyond kappa: A review of interrater agreement measures. Canadian Journal of Statistics 27 (1):3–23.
Bangdiwala, S. I. 1988. The agreement chart. Chapel Hill, CA: Department of Biostatistics,
University of North Carolina at Chapel Hill.
Bennett, E. M., R. Alpert, and A. C. Goldstein. 1954. Communications through limited-response
questioning. Public Opinion Quarterly 18 (3):303–8.
Berry, K. J., P. W. Mielke, and J. E. Johnston. 2016. Permutation statistical methods: An integrated
approach. Cham: Springer.
Bloch, D. A., and H. C. Kraemer. 1989. 2 2 kappa coefficients: Measures of agreement or association. Biometrics 45 (1):269–87.
Brennan, R. L., and D. J. Prediger. 1981. Coefficient kappa: Some uses, misuses, and alternatives.
Educational and Psychological Measurement 41 (3):687–99.
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological
Measurement 20 (1):37–46.
Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or
partial credit. Psychological Bulletin 70 (4):213–20. doi:10.1037/h0026256.
Conger, A. J. 1980. Integration and generalization of kappas for multiple raters. Psychological
Bulletin 88 (2):322–8.
Dunn, G. 1989. Design and analysis of reliability studies: The statistical evaluation of measurement
errors. Oxford, UK: Edward Arnold Publishers.
Feinstein, A. R., and D. V. Cicchetti. 1990. High agreement but low kappa: I. the problems of
two paradoxes. Journal of Clinical Epidemiology 43 (6):543–9.
Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin
76 (5):378–82.
Fleiss, J. L., and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation
coefficient as measures of reliability. Educational and Psychological Measurement 33 (3):613–9.
Gautam, S. 2014. A-kappa: a measure of agreement among multiple raters. Journal of Data
Science 12:697–716.
Goodman, L. A., and W. H. Kruskal. 1954. Measures of association for cross classifications.
Journal of the American Statistical Association 49 (268):732–64.
COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV
R
15
Gwet, K. 2002. Kappa statistic is not satisfactory for assessing the extent of agreement between
raters. Statistical Methods for Inter-Rater Reliability Assessment 1 (6):1–6.
Gwet, K. L. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61 (1):29–48.
Gwet, K. L. 2014. Handbook of inter-rater reliability: The definitive guide to measuring the extent
of agreement among raters. Gaithersburg, MD: Advanced Analytics, LLC.
Gwet, K. L. 2017. R functions for calculating agreement coefficients. Accessed August 2, 2018.
http://www.agreestat.com/r_functions.html.
Holley, J. W., and J. P. Guilford. 1964. A note on the g index of agreement. Educational and
Psychological Measurement 24 (4):749–53.
Kendall, M. G. 1955. Rank correlation methods. New York: Hafner.
Krippendorff, K. 2004. Measuring the reliability of qualitative text analysis data. Quality &
Quantity 38 (6):787–800.
Landis, J. R., and G. G. Koch. 1977. The measurement of observer agreement for categorical
data. biometrics 33 (1):159–74.
Light, R. J. 1971. Measures of response agreement for qualitative data: Some generalizations and
alternatives. Psychological Bulletin 76 (5):365–77.
Maclure, M., and W. C. Willett. 1987. Misinterpretation and misuse of the kappa statistic.
American Journal of Epidemiology 126 (2):161–9.
Moradzadeh, N., M. Ganjali, and T. Baghfalaki. 2017. Weighted kappa as a function of
unweighted kappas. Communications in Statistics-Simulation and Computation 46:1–12.
Muthen, B. 1984. A general structural equation model with dichotomous, ordered categorical,
and continuous latent variable indicators. Psychometrika 49 (1):115–32.
Randolph, J. J. 2005. Free-marginal multirater kappa (multirater j [free]): An alternative to fleiss’
fixed-marginal multirater kappa. Presented at the Joensuu Learning and Instruction
Symposium, vol. 2005.
Sarkar, D. 2008. Lattice: Multivariate data visualization with R. New York: Springer. http://lmdvr.
r-forge.r-project.org.
Schuster, C. 2004. A note on the interpretation of weighted kappa and its relations to other rater
agreement statistics for metric scales. Educational and Psychological Measurement 64
(2):243–53.
Scott, W. A. 1955. Reliability of content analysis: The case of nominal scale coding. Public
Opinion Quarterly 19 (3):321–5.
Spitznagel, E. L., and J. E. Helzer. 1985. A proposed solution to the base rate problem in the
kappa statistic. Archives of General Psychiatry 42 (7):725–8.
Vanbelle, S., and A. Albert. 2009. A note on the linearly weighted kappa coefficient for ordinal
scales. Statistical Methodology 6 (2):157–63.
Warrens, M. J. 2010. Inequalities between kappa and kappa-like statistics for k k tables.
Psychometrika 75 (1):176–85.
Warrens, M. J. 2011. Cohen’s linearly weighted kappa is a weighted average of 2 2 kappas.
Psychometrika 76 (3):471–86.
Warrens, M. J. 2012. Some paradoxical results for the quadratically weighted kappa.
Psychometrika 77 (2):315–23.
Warrens, M. J. 2017. Symmetric kappa as a function of unweighted kappas. Communications in
Statistics-Simulation and Computation 46:1–6.
Wongpakaran, N., T. Wongpakaran, D. Wedding, and K. L. Gwet. 2013. A comparison of
Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study
conducted with personality disorder samples. BMC Medical Research Methodology 13 (1):61.
doi:10.1186/1471-2288-13-61.
Yilmaz, A. E., and T. Saracbasi. 2017. Assessing agreement between raters from the point of coefficients and log-linear models. Journal of Data Science 15 (1):1–24.
Yule, G. U. 1912. On the methods of measuring association between two attributes. Journal of
the Royal Statistical Society 75 (6):579–652.