Academia.eduAcademia.edu

Weighted inter-rater agreement measures for ordinal outcomes

2018, Communications in Statistics - Simulation and Computation

Estimation of the degree of agreement between different raters is of crucial importance in medical and social sciences. There are lots of different approaches proposed in the literature for this aim. In this article, we focus on the inter-rater agreement measures for the ordinal variables. The ordinal nature of the variable makes this estimation task more complicated. Although there are modified versions of inter-rater agreement measures for ordinal tables, there is no clear agreement on the use of a particular approach. We conduct an extensive Monte Carlo simulation study to evaluate and compare the accuracy of mainstream inter-rater agreement measures for ordinal tables with each other and figure out the effect of different table structures on the accuracy of these measures. Our results are useful in the sense that they provide detailed information about which measure to use with different table structures to get most reliable inferences about the degree of agreement between two raters. With our simulation study, we recommend use of Gwet's AC2 and Brennan-Prediger's j in the situation where there is high agreement among raters. However, it should be noted that these coefficients overstate the extent of agreement among raters when there is no agreement, and the data is unbalanced.

Communications in Statistics - Simulation and Computation ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: https://www.tandfonline.com/loi/lssp20 Weighted inter-rater agreement measures for ordinal outcomes Duyet Tran, Anil Dolgun & Haydar Demirhan To cite this article: Duyet Tran, Anil Dolgun & Haydar Demirhan (2018): Weighted inter-rater agreement measures for ordinal outcomes, Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918.2018.1490428 To link to this article: https://doi.org/10.1080/03610918.2018.1490428 Published online: 27 Oct 2018. Submit your article to this journal Article views: 58 View Crossmark data Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=lssp20 COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV https://doi.org/10.1080/03610918.2018.1490428 R Weighted inter-rater agreement measures for ordinal outcomes Duyet Tran* , Anil Dolgun , and Haydar Demirhan School of Science, Mathematical Sciences, RMIT University, Melbourne, Australia ABSTRACT ARTICLE HISTORY Estimation of the degree of agreement between different raters is of crucial importance in medical and social sciences. There are lots of different approaches proposed in the literature for this aim. In this article, we focus on the inter-rater agreement measures for the ordinal variables. The ordinal nature of the variable makes this estimation task more complicated. Although there are modified versions of inter-rater agreement measures for ordinal tables, there is no clear agreement on the use of a particular approach. We conduct an extensive Monte Carlo simulation study to evaluate and compare the accuracy of mainstream inter-rater agreement measures for ordinal tables with each other and figure out the effect of different table structures on the accuracy of these measures. Our results are useful in the sense that they provide detailed information about which measure to use with different table structures to get most reliable inferences about the degree of agreement between two raters. With our simulation study, we recommend use of Gwet’s AC2 and Brennan-Prediger’s j in the situation where there is high agreement among raters. However, it should be noted that these coefficients overstate the extent of agreement among raters when there is no agreement, and the data is unbalanced. Received 8 January 2018 Accepted 13 June 2018 KEYWORDS Inter-rater agreement; Kappa; Ordinal data; Weighted kappa; Weighting schemes; Monte Carlo simulation MATHEMATICS SUBJECT CLASSIFICATION 62H17; 62H20 1. Introduction In biomedical and behavioral sciences, the reliability of a rating system is usually evaluated by analyzing inter-rater agreement data. The inter-rater agreement coefficient (a.k.a, inter-rater reproducibility or concordance coefficient) is a statistical measure that quantifies the extent of agreement among observers. It gives a score to measure the degree of homogeneity or consensus in the ratings given by observers. There are a number of statistics used to determine the inter-rater agreement and different statistics are appropriate for different types of measurements. Banerjee et al. (1999) and Yilmaz and Saracbasi (2017) provided a comprehensive overview of agreement measures as shown in Table 1 not only for ordinal classification data but also for nominal, interval, and ratio scales. Ordinal data can arise when ratings have natural ordered categories (e.g., disease severity: severe, moderate, mild). Assessing inter-rater agreement is also common for ordinal ratings where disagreement between raters become more informative CONTACT Anil Dolgun [email protected] School of Science, Mathematical Sciences, RMIT University, 364 Swanston Street, Melbourne 3000, Victoria, Australia. Faculty of Education, Physics Department, An Giang University, Long Xuyen City, An Giang Province, Vietnam. ß 2018 Taylor & Francis Group, LLC 2 D. TRAN ET AL. Table 1. Previous agreement studies. Number of raters Two raters Multi-raters Coefficient Lambda coefficient Pi Coefficient Kappa coefficient Intraclass correlation coefficient Weighted kappa coefficient AC1 coefficient AC2 coefficient Aickin’s alpha Bangdiwala’s BN statistic S coefficient Author(s) Goodman and Kruskal (1954) Scott (1955) Cohen (1960) Bloch and Kraemer (1989); Dunn (1989) Cohen (1968) Gwet (2008) Gwet (2008) Aickin (1990) Bangdiwala (1988) Brennan and Prediger (1981) Alpha coefficient Van Eerdewegh’s V Yule’s Y or Q Kappa coefficient Agreement coefficient Coefficient of concordance (W) Hubert’s kappa coefficient S coefficient Krippendorff (2004) Spitznagel and Helzer (1985) Spitznagel and Helzer (1985) Light (1971) Fleiss (1971) Kendall (1955) Kappa coefficient Kappa coefficient Gautam (2014) Berry, Mielke, and Johnston (2016) Conger (1980) Randolph (2005) Type of ratings Nominal data Nominal and ordinal data Nominal data Interval and ratio data Ordinal data Nominal data All types of data Interval and ratio Ordinal data Nominal, ordinal and interval data All types of data Binary data Binary data Nominal and ordinal data Nominal and ordinal data Ranked data/ Ordinal data Ordinal and interval data Nominal, ordinal, and interval data Nominal and ordinal data Ordinal data due to the hierarchy between the ordinal levels. For such settings, the weighted interrater agreement measures allow the use of weights and take into account the importance of the disagreements between the ordinal categories (Warrens 2017). Existing methods for assessing agreement between two raters when ordinal classifications are being examined include Cohen’s weighted kappa, Scott’s p, Gwet’s AC2 coefficient, Brennan - Prediger kappa-like coefficient, and Krippendorff’s a coefficient. Many of these measures of are either extensions of Cohen’s kappa and weighted kappa or are formulated as a Cohen’s kappa-like statistic (Fleiss 1971; Light 1971; Conger 1980). Therefore, they are prone to the same issues as the original Cohen’s kappa, including sensitivity to marginal distributions of the raters and disease prevalence effects (Maclure and Willett 1987). Moreover, the choice of weighting schemes for weighted kappa and other weighted agreement measures has always been controversial. They have been criticized as the value of weighted agreement depends on the choice of weights and the choice of weights is subjective (Maclure and Willett 1987). There have been some attempts to support the use of quadratic weights as the quadratically weighted kappa can be interpreted as an intraclass correlation coefficient (Fleiss and Cohen 1973; Schuster 2004). However, Warrens (2012) showed that for agreement tables with an odd number of categories, the value of quadratically weighted kappa does not depend on the value of the center cell of the agreement table; hence, quadratically weighted kappa fails as a measure of agreement. Vanbelle and Albert (2009) and Warrens (2011) showed that linearly weighted kappa can be interpreted as a weighted average of the 2  2 tables’ kappas. Recently, Moradzadeh, Ganjali, and Baghfalaki (2017) showed that linear and quadratic weighted kappa can be computed as a function of unweighted kappas. To the best of our knowledge there is no generic guide for the selection of the weights and current approaches to define weights are limited in a sense that they rarely COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R 3 Table 2. Inter-rater agreement classification for two raters. Rater A 1 2 … q Total Rater B 1 n11 n21 … nq1 nþ1 2 n12 n22 … nq2 nþ2 Total … … … … … … q n1q n2q … nqq nþq n1þ n2þ … nqþ n provide concrete evidence on which weighting scheme would suit best to the data or have less bias in terms of assessing the true inter-rater agreement. The aim of this paper is to compare various weighting schemes and weighted inter-rater agreement coefficients and to provide useful information on the selection of the most appropriate method to apply in different settings of R  R tables. Our most important contribution is to identify which measure and weighting scheme combination has less bias and how their bias is affected by the the degree of true inter-rater agreement, the structure of the R  R table, the number of ordinal ratings, and the total sample size. We conducted a Monte Carlo simulation study to evaluate and compare 5 different weighted inter-rater agreement coefficients (i.e., Cohen’s weighted kappa, Scott’s p, Gwet’s AC2 coefficient, Brennan - Prediger coefficient, and Krippendorff’s a coefficient) and 6 various weighting schemes (i.e., unweighted, linear, quadratic, ordinal, radical, and ratio weights) in the context of two raters for ordinal data. This paper is organized as follows. In Sec. 2, general information on weighted interrater agreement measures and weighting schemes are presented. In Sec. 3, simulation design of the Monte Carlo study is given and results are shown and inferred by visualizations before general conclusions are drawn in Sec. 4. 2. General information In this section, after giving a quick overview of Cohen’s unweighted kappa coefficient for nominal ratings, we will present weighted kappa coefficient and other commonly used weighted inter-rater agreement coefficients as well as the weighting schemes applied for ordinal ratings. 2.1. Weighted inter-rater agreement coefficients Suppose two raters independently classify the same group of n subjects into one of q  3 categories. The categories are defined in advance and the raters use the same q categories. In the classification in Table 2, nkl and pkl denote the frequency and the joint probability distribution corresponding to k, l ¼ 1, 2, … , q categories, respectively, the marginal totals are denoted by nkþ for the first rater and nþl for the second rater, where the marginal probabilities, and correspondingly pkþ and pþl reflect how often the raters have used the categories. Cohen’s kappa is a chance-corrected measure of agreement and used for nominal categories (Cohen 1960). It is defined as j¼ pa 1 pe ; pe (1) 4 D. TRAN ET AL. Pq Pq where pa ¼ k¼1 pkk and pe ¼ k¼1 pkþ pþk are the proportion of observed agreement and the proportion of agreement expected by chance, respectively. The assumption of j is that the ratings of raters are statistically independent and it allows different marginal probabilities of success associated with the raters to differ (Banerjee et al. 1999). The advantages of Cohen’s kappa is that it is always applicable, easy to calculate, available in general purpose statistical software packages, and it condenses relevant information into one coefficient. This is a reason why Cohen’s kappa is often used to calculate inter-rater agreement among raters. The value of kappa implies the agreement among raters beyond chance. Therefore, negative values indicate that the observed agreement is less than that is expected by chance alone; a value of 0 indicates exactly chance agreement, and positive values imply that the observed agreement is greater than that is expected by chance. Mathematically, a value of 1 is hard to achieve and the lower limit of kappa depends on the number of categories, so it can be undefined. Landis and Koch (1977) suggested that for most purposes, values greater than 0.75 and below 0.40 represent excellent and poor agreement beyond chance, respectively, and values between 0.40 and 0.75 may be considered to represent fair to good agreement beyond chance. However, these recommendations are not based on scientific evidence and it is still an open question how the magnitude of kappa should be judged. When the disagreement between two raters become more serious, kappa makes no such distinction, implicitly treating all disagreements equally. The weighted kappa, (jw) which allows the use of weights and takes into account the extent of disagreements between the categories was introduced by Cohen (1968). jw is defined as jw ¼ paðwÞ peðwÞ ; 1 peðwÞ (2) where the weighted proportion of observed agreement, pa(w), is paðwÞ ¼ q q X X xkl pkl ; (3) k¼1 l¼1 and the weighted proportion of agreement expected by chance, pe(w), is peðwÞ ¼ q X k;l¼1 xkl pkþ pþk : (4) In Equation (2), the weights are 0  wkl  1 for k, l ¼ 1, 2, … , q and wkl ¼ 1 if k ¼ l. Hence, the elements on the main diagonal of the contingency table {pkl} get the maximum weight of 1. Commonly used weighting schemes will be given in Sec. 2.2. Note that, the Equation (3) is the weighted proportion of observed agreement for all coefficients, except Krippendoff’s Alpha. jw has been criticized in the sense that the selection of the weights has a great importance on the magnitude of agreement and the choice of weights is subjective (Maclure and Willett 1987). Moreover, because this measure is an extension of Cohen’s kappa, it is sensitive to marginal distributions of the raters and disease prevalence effects (Maclure and Willett 1987). COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R 5 Scott’s p is used to calculate chance corrected agreement probability based on marginal probabilities and it assumes that each rater may be characterized by the same underlying success rate (Scott 1955). This measure has the same structure as the weighted kappa statistic in Equation (2) but has a different definition of pe(w) given in peðwÞ ¼ q X xkl pk pl ; (5) k;l¼1 where pk ¼ pkþ þ pþk : 2 (6) Gwet’s AC2 coefficient is similar to jw in its formulation and its simplicity and provides a reasonable chance-corrected agreement coefficient, in line with the percentage level of agreement (Gwet 2002). It is calculated as peðwÞ ¼ q X 1 q 1 k¼1 pk ð1 pk Þ; (7) where q is the number of categories and pk ¼ pkþ þ pþk : 2 (8) Wongpakaran et al. (2013) showed that Gwet’s AC2 is a more stable inter-rater agreement coefficient than the jw. It is also found to be less affected by prevalence and marginal probability than that of jw; and hence, it is recommended for assessing inter-rater agreement with ordinal ratings (Wongpakaran et al. 2013; Gwet 2014). Brennan and Prediger (1981) proposed a kappa-like agreement coefficient where the overall percent agreement remains as in the Cohen’s weighted kappa, but the percent chance agreement is taken as peðwÞ ¼ q 1 X xkl ; q2 k;l¼1 (9) where q is the number of categories. Some other authors, for example Bennett, Alpert, and Goldstein (1954), independently developed the same coefficient giving it different names and often referred to in the literature as the Brennan-Prediger coefficient (Gwet 2014). This coefficient is recommended for use with 2 raters and an arbitrary number of ordinal ratings, while most of authors suggests it in the case of 2 raters with binary ratings. The Krippendorff’s a coefficient is a statistical measure of the extent of agreement among raters and is regularly used by researchers in the area of content analysis (Krippendorff 2004). The weighted proportion of observed agreement for Krippendoff’s Alpha is   1 1 paðwÞ ¼ 1 (10) pa0 þ ; nr nr 6 D. TRAN ET AL. where q pa0 n X 1X rik ðr ikþ 1Þ ; ¼ n i¼1 k¼1 r ðri 1Þ (11) n 1X ri ; n i¼1 (12) q X (13) r ¼ and r ikþ ¼ xkl ril : l¼1 The weighted proportion of expected agreement for the a coefficient is defined as peðwÞ ¼ q X xkl pk pl ; (14) k;l¼1 where pk ¼ m 1X rikr : m i¼1 (15) In Equation 15, m is the number of subjects rated by 2 raters, rik is the number of raters who assigned the particular score xk to subject i, and r is the average number of raters per subject. In this study, we only consider the methods of Cohen, Scott, Gwet, Brennan Prediger, and Krippendorff, which are used for tables composed of ordinal variables. Other weighted agreement coefficients presented by Yule (1912), Scott (1955), Holley and Guilford (1964), Spitznagel and Helzer (1985), Feinstein and Cicchetti (1990), and Gwet (2002) are based on different assumptions related to definition of pe(w) (Warrens 2010) but they are not appropriate for assessing agreement for ordinal categories. In the next section, we introduce the commonly used weighting schemes which can be used with any inter-rater agreement coefficient given in this section. 2.2. Weighting schemes A review of the different kinds of weighting schemes was given by Gwet (2014). In this article, we only focus on weighting schemes.  Unweighted: xkl ¼  0; 1; k 6¼ l k ¼ l: (16) Note that when this weights are used, the resulting inter-rater agreement measure will be equal to its unweighted version. COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R  7 Linear weights: xkl ¼ 8 < jk lj ; k 6¼ l q 1 : 1; k¼l 1 (17) where, q is the number of categories and k, l ¼ 1, 2, … , q are the categories for the first and the second rater, respectively.  Quadratic weights: xkl ¼ 8 > < 1 > :1 ðk lÞ2 ðq 1Þ2 k 6¼ l (18) k¼l The quadratic weights are generally greater than the linear weights (Cohen 1968).  Ordinal weights: xkl ¼ where Mkl ¼  8 < Mkl ; k 6¼ l M max : 1; k¼l 1 (19)  maxðk; lÞ minðk; lÞ þ 1 : 2 (20) This set of weights only uses the order structure of the ratings. Gwet (2014) stated that the actual values of the ratings do not affect the magnitude of the ordinal weights because only their ranks do.   Radical weights: 8 > < Ratio weights: pffiffiffiffiffiffiffiffiffiffi jk lj ffi ; k 6¼ l xkl ¼ 1 pffiffiffiffiffiffiffiffi q 1 > : 1; k ¼ l: 8 > <  2 ðk lÞ=ðk þ lÞ 1  2 ; xkl ¼ ðq 1Þ=ðq þ 1Þ > : 1; k 6¼ l (21) (22) k ¼ l: Note that, one can also use arbitrary scores xk and xl to create weights instead of sequentially numbered k, l ¼ 1, … , q. In such cases, the magnitudes of linear, quadratic, radical and ratio weights rely on the scores (xk and xl) attached to the ordinal categories (Gwet 2014). Among those weights, only the ordinal weights are insensitive to the selection of scores as they use the rankings of the ratings instead of their actual values. 8 D. TRAN ET AL. Table 3. Abbreviations related with the simulation design. R R table q¼3 q¼4 q¼5 Abbr. 33 44 55 Degree of true agreement Low Medium High Abbr. L M H Structure of the table Balanced Slightly Unbalanced Heavily Unbalanced Sample size Abbr. Inter-rater agreement measure Abbr. Weighting scheme Cohen’s kappa Scott’s p Gwet’s AC2 Brennan - Prediger Krippendorff’s a j p AC2 BP a 50 100 200 500 1000 n ¼ 50 n ¼ 100 n ¼ 200 n ¼ 500 n ¼ 1,000 Unweighted Linear weights Quadratic weights Ordinal weights Radical weights Abbr. B U1 U2 Abbr. unweighted linear quadratic ordinal radical 3. Simulation study 3.1. The scope In this article, an extensive Monte Carlo simulation study is conducted to evaluate and compare 5 different methods of weighted inter-rater agreement coefficients and 6 various weighting schemes in the context of 2 raters for ordinal data. Simulation space of the Monte Carlo study is composed of 4,050 different combinations of balanced and unbalanced R  R table structures, sample sizes, number of categories and degree of true inter-rater agreements which are given in Table 3. By this large simulation space, we present a detailed analysis of effects of those factors on the bias of combinations of inter-rater agreement measures and weights using mean absolute error and mean squared error. 3.2. The true (population) inter-rater agreement coefficients We used the Pearson correlation coefficient (q) under the bivariate normal distribution setting to set the true inter-rater agreement among raters. Using this predetermined correlation structure, we were able to adequately quantify the true agreement (not chancecorrected) when both variables are evaluated on the same scale. The values for true inter-rater agreement were fixed at q ¼ 0.1, 0.6, 0.9 for low, medium, and high interrater agreement, respectively. 3.3. The Monte Carlo simulation In order to generate R  R tables with ordinal categories, we utilized the underlying variable approach (UVA) (Muthen 1984) which assumes that the observed ordinal variables are generated by underlying normally distributed continuous variables. Accordingly, we first generated {Y1i, Y2i} for i ¼ 1, … , n from bivariate standard normal distribution with a pre-specified correlation structure using the mentioned q values. The true correlation among {Y1i, Y2i} is taken as q ¼ 0.1, 0.6, 0.9 for low, medium, and high inter-rater agreement, respectively. Next, R  R contingency tables and corresponding joint probabilities (pkl’s) are constructed by discretization of {Y1i, Y2i} using standard normal distribution quantiles. As the quantiles possess the ordered information of Y1i and Y2i, discretization of Y1i and Y2i using quantiles ensures the ordinality of the generated variables. For example, in order to generate a 3  3 contingency table with COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R 9 Figure 1. The MAE results for 3  3 table where n ¼ 50. balanced marginals, we used standard normal quantiles, U 1(0.33) and U 1(0.66) as cutoffs for {Y1i, Y2i}. Similarly, we took U 1(0.10) and U 1(0.40) to generate slightly unbalanced table structure, and U 1(0.05) and U 1(0.25) for heavily unbalanced one. Therefore, marginal distributions of the ordinal variables are specified using different quantiles and these two marginal distributions are linked together through the correlation structure. The rest of 4  4 and 5  5 tables are also generated using the same approach. 10,000 samples from the multinomial distribution with parameters pkl’s are generated using n ¼ 50, 100, 200, 500, 1,000. For each of these 10,000 samples, the sample inter-rater agreement coefficient is calculated using different combinations of balanced and unbalanced R  R table structures, sample sizes, number of categories, and degree of true inter-rater agreements. The accuracy of inter-rater agreement measure weight combinations are assessed using the Monte - Carlo estimators of mean absolute error (MAE), MAE ¼ r 1X ^ i j; jj j r i¼1 and the mean squared error (MSE), MSE ¼ r 1X ðj j ^ i Þ2 ; r i¼1 where, r is the number of replications, j is the true inter-rater agreement coefficient ^ i is the inter-rater agreement estimation in the ith replication. The simulation and j codes are written in R language by the authors. In order to calculate the inter-rater agreement measures and weights, we utilized R functions given by Gwet (2017). The results are visualized by using the lattice package in R (Sarkar 2008) and interpreted in the next section. 10 D. TRAN ET AL. Figure 2. The MAE results for 3  3 table where n ¼ 1000. 3.4. Results All of the numerical results are not given here to save space but they are available upon request from authors. Only the results related to MAE are given for n ¼ 50 and n ¼ 1,000 in Figures 1–6 as the MSE results are very similar to the ones obtained from MAE. The results are inferred over inter-rater agreement methods, weighting schemes, true level of inter-rater agreement and structure of R  R table. In terms of agreement measures, all measures perform similar for the balanced R  R table structures, where else they differ more with increasing unbalancedness. For the low degree of inter-rater agreement, Cohen’s kappa, Scott’s p and Krippendorff’s a perform better than the Gwet’s AC2 and Brennan - Prediger’s kappa. For medium and high degree of inter-rater agreements, Gwet’s AC2 and Brennan - Prediger’s kappa perform better than Cohen’s kappa, Scott’s p and Krippendorff’s a, not only for slightly unbalanced case but also for heavily unbalanced table structures. When the table size is increased, all the measures give smaller MAE and MSE values. In terms of weighting schemes, we get similar results in terms of MAE and MSE measures. Quadratic weight has the best accuracy for most situations regardless of the measure chosen. However, Gwet’s AC2 and Brennan Prediger’s kappa perform well using other weights for the unbalanced structures. In specific, Gwet’s AC2 performs well in combination with the linear weight for slightly unbalanced structures. When used with radical weights, Gwet’s AC2 performs well in heavily unbalanced 4  4 and 5  5 tables. For Brennan Prediger’s kappa, linear weight yields the best results for heavily unbalanced table structures, while for slightly unbalanced cases, the ordinal weight yields the most accurate results for 4  4 tables and 5  5 tables, and quadratic weights for 3  3 tables. Overall, the accuracy of the measures is sensitive to the used weights if the table of interest is obviously unbalanced and the true agreement is not that low. In terms of level of agreement, all inter-rater agreement measures perform well when the true inter-rater agreement is low, except for Gwet’s AC2 and Brennan Prediger’s COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R 11 Figure 3. The MAE results for 4  4 table where n ¼ 50. Figure 4. The MAE results for 4  4 table where n ¼ 1000. kappa in unbalanced tables. For the majority of the scenarios, MAE’s are smaller than 0.4. MAE and MSE’s of measures are the lowest when the true inter-rater agreement is low and they are close to each other for the medium and high inter-rater agreement. In terms of table structures, balanced table structures usually have smaller mean error values than the slightly unbalanced and heavily unbalanced tables. In unbalanced tables, Cohen’s j, Scott’s p and Krippendorff’s a always have lower MAE than Gwet’s AC2 and Brennan-Prediger’s j when the true agreement is low. This situation is the opposite for medium and high degree of agreement. Specifically, with slightly unbalanced structures, Gwet’s AC2 performs best for high agreement. For the medium degree of agreement, Gwet’s AC2 and Brennan Prediger’s kappa perform similar, where else for heavily 12 D. TRAN ET AL. Figure 5. The MAE results for 5  5 table where n ¼ 50. Figure 6. The MAE results for 5  5 table where n ¼ 1000. unbalanced structure with medium degree of agreement, Brennan-Prediger’s kappa performs best. 4. Conclusion We compare the accuracy of Cohen’s j, Scott’s p, Gwet’s AC2 coefficient, Brennan Prediger coefficient, and Krippendorff’s a coefficient in combination with 6 weighting schemes - unweighted, linear, quadratic, ordinal, radical, and ratio weights in the context of two raters for ordinal data using a Monte Carlo simulation approach. By using the results of this simulation, we identify which inter-rater measure and weighting scheme combination has less bias, and how their bias are affected from the the degree COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R 13 of true inter-rater agreement, the structure of the R  R table, the number of ordinal ratings, and the total sample size. Main findings of our study are summarized as follows:       All measures perform similar for balanced table structures. However, for the low degree of inter-rater agreement, Cohen’s kappa, Scott’s p and Krippendorff’s a perform better than the Gwet’s AC2 and Brennan - Prediger’s kappa. Conversely, for medium and high degree of inter-rater agreements, Gwet’s AC2 and Brennan - Prediger’s kappa perform better than Cohen’s kappa, Scott’s p and Krippendorff’s a. Unbalancedness in the cell counts of the considered table is the most influential factor on the accuracy of inter-rater agreement measures. It negatively impacts the accuracy of the mentioned measurements. The accuracy of the measures is also sensitive to the used weights if the table of interest is highly unbalanced. The majority of the scenarios, values of error measures are small for the low agreement, high for the medium agreement and in between for the high agreement, except the situations with Gwet’s AC2 and Brennan-Prediger’s kappa. When the underlying inter-rater agreement is low and the table is unbalanced, Gwet’s AC2 and Brennan-Prediger’s j in combination with linear, quadratic, ordinal, radical, and ratio weights should be avoided. For such cases, Cohen’s kappa, Scott’s p and Krippendorff’s a measures perform well with any type of weights. Overall, the accuracy of the measures is sensitive to the used weights if the table of interest is obviously unbalanced and the true agreement is not that low. For the unbalanced table structures, provided that the inter-rater agreement is high, Gwet’s AC2 and Brennan-Prediger’s j can be used with any type of weights. Gwet’s AC2 performs well in combination with the linear weight for slightly unbalanced structures. When used with radical weights, Gwet’s AC2 performs well in heavily unbalanced 4  4 and 5  5 tables. For Brennan Prediger’s kappa, linear weight yields the best results for heavily unbalanced table structures, while for slightly unbalanced cases, the ordinal weight yields the most accurate results. In terms of agreement measures, all measures perform similar for the balanced R  R table structures, where else, they differ more with increasing unbalancedness. Specifically, we recommend use of Gwet’s AC2 and Brennan-Prediger’s j in the unbalanced medium and high agreement levels. However, it should be noted that these coefficients overstate the extent of agreement among raters when there is no agreement, and the data is unbalanced. All the inferences given in this article should be considered within the limits of our simulation space, which is large enough to generalize the inferences on the accuracy of inter-rater agreement measures for ordinal data. Acknowledgment The authors would like to acknowledge the valuable comments and suggestions of the anonymous reviewer, which have improved the quality of this paper. Also, the authors gratefully acknowledge the generous financial support of VIED and RMIT. Finally, the authors’ thanks are due to 14 D. TRAN ET AL. Dr. Gwet for kindly granting permission to include his functions to calculate various Kappalike measures. Disclosure statement No potential conflict of interest was reported by the authors. Funding Duyet Tran received financial support of VIED and RMIT. ORCID Duyet Tran http://orcid.org/0000-0003-1720-9591 Anil Dolgun http://orcid.org/0000-0002-2693-0666 Haydar Demirhan http://orcid.org/0000-0002-8565-4710 References Aickin, M. 1990. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46 (2):293–302. Banerjee, M., M. Capozzoli, L. McSweeney, and D. Sinha. 1999. Beyond kappa: A review of interrater agreement measures. Canadian Journal of Statistics 27 (1):3–23. Bangdiwala, S. I. 1988. The agreement chart. Chapel Hill, CA: Department of Biostatistics, University of North Carolina at Chapel Hill. Bennett, E. M., R. Alpert, and A. C. Goldstein. 1954. Communications through limited-response questioning. Public Opinion Quarterly 18 (3):303–8. Berry, K. J., P. W. Mielke, and J. E. Johnston. 2016. Permutation statistical methods: An integrated approach. Cham: Springer. Bloch, D. A., and H. C. Kraemer. 1989. 2  2 kappa coefficients: Measures of agreement or association. Biometrics 45 (1):269–87. Brennan, R. L., and D. J. Prediger. 1981. Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement 41 (3):687–99. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1):37–46. Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70 (4):213–20. doi:10.1037/h0026256. Conger, A. J. 1980. Integration and generalization of kappas for multiple raters. Psychological Bulletin 88 (2):322–8. Dunn, G. 1989. Design and analysis of reliability studies: The statistical evaluation of measurement errors. Oxford, UK: Edward Arnold Publishers. Feinstein, A. R., and D. V. Cicchetti. 1990. High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology 43 (6):543–9. Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76 (5):378–82. Fleiss, J. L., and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33 (3):613–9. Gautam, S. 2014. A-kappa: a measure of agreement among multiple raters. Journal of Data Science 12:697–716. Goodman, L. A., and W. H. Kruskal. 1954. Measures of association for cross classifications. Journal of the American Statistical Association 49 (268):732–64. COMMUNICATIONS IN STATISTICS - SIMULATION AND COMPUTATIONV R 15 Gwet, K. 2002. Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical Methods for Inter-Rater Reliability Assessment 1 (6):1–6. Gwet, K. L. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61 (1):29–48. Gwet, K. L. 2014. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Gaithersburg, MD: Advanced Analytics, LLC. Gwet, K. L. 2017. R functions for calculating agreement coefficients. Accessed August 2, 2018. http://www.agreestat.com/r_functions.html. Holley, J. W., and J. P. Guilford. 1964. A note on the g index of agreement. Educational and Psychological Measurement 24 (4):749–53. Kendall, M. G. 1955. Rank correlation methods. New York: Hafner. Krippendorff, K. 2004. Measuring the reliability of qualitative text analysis data. Quality & Quantity 38 (6):787–800. Landis, J. R., and G. G. Koch. 1977. The measurement of observer agreement for categorical data. biometrics 33 (1):159–74. Light, R. J. 1971. Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin 76 (5):365–77. Maclure, M., and W. C. Willett. 1987. Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology 126 (2):161–9. Moradzadeh, N., M. Ganjali, and T. Baghfalaki. 2017. Weighted kappa as a function of unweighted kappas. Communications in Statistics-Simulation and Computation 46:1–12. Muthen, B. 1984. A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49 (1):115–32. Randolph, J. J. 2005. Free-marginal multirater kappa (multirater j [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Presented at the Joensuu Learning and Instruction Symposium, vol. 2005. Sarkar, D. 2008. Lattice: Multivariate data visualization with R. New York: Springer. http://lmdvr. r-forge.r-project.org. Schuster, C. 2004. A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement 64 (2):243–53. Scott, W. A. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly 19 (3):321–5. Spitznagel, E. L., and J. E. Helzer. 1985. A proposed solution to the base rate problem in the kappa statistic. Archives of General Psychiatry 42 (7):725–8. Vanbelle, S., and A. Albert. 2009. A note on the linearly weighted kappa coefficient for ordinal scales. Statistical Methodology 6 (2):157–63. Warrens, M. J. 2010. Inequalities between kappa and kappa-like statistics for k  k tables. Psychometrika 75 (1):176–85. Warrens, M. J. 2011. Cohen’s linearly weighted kappa is a weighted average of 2  2 kappas. Psychometrika 76 (3):471–86. Warrens, M. J. 2012. Some paradoxical results for the quadratically weighted kappa. Psychometrika 77 (2):315–23. Warrens, M. J. 2017. Symmetric kappa as a function of unweighted kappas. Communications in Statistics-Simulation and Computation 46:1–6. Wongpakaran, N., T. Wongpakaran, D. Wedding, and K. L. Gwet. 2013. A comparison of Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology 13 (1):61. doi:10.1186/1471-2288-13-61. Yilmaz, A. E., and T. Saracbasi. 2017. Assessing agreement between raters from the point of coefficients and log-linear models. Journal of Data Science 15 (1):1–24. Yule, G. U. 1912. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society 75 (6):579–652.