Academia.eduAcademia.edu

Detection of grey zones in inter-rater agreement studies

BMC Medical Research Methodology

Background In inter-rater agreement studies, the assessment behaviour of raters can be influenced by their experience, training levels, the degree of willingness to take risks, and the availability of clear guidelines for the assessment. When the assessment behaviour of raters differentiates for some levels of an ordinal classification, a grey zone occurs between the corresponding adjacent cells to these levels around the main diagonal of the table. A grey zone introduces a negative bias to the estimate of the agreement level between the raters. In that sense, it is crucial to detect the existence of a grey zone in an agreement table. Methods In this study, a framework composed of a metric and the corresponding threshold is developed to identify grey zones in an agreement table. The symmetry model and Cohen’s kappa are used to define the metric, and the threshold is based on a nonlinear regression model. A numerical study is conducted to assess the accuracy of the developed framewor...

Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 https://doi.org/10.1186/s12874-022-01759-7 BMC Medical Research Methodology Open Access RESEARCH Detection of grey zones in inter-rater agreement studies Haydar Demirhan1*† and Ayfer Ezgi Yilmaz2† Abstract Background In inter-rater agreement studies, the assessment behaviour of raters can be influenced by their experience, training levels, the degree of willingness to take risks, and the availability of clear guidelines for the assessment. When the assessment behaviour of raters differentiates for some levels of an ordinal classification, a grey zone occurs between the corresponding adjacent cells to these levels around the main diagonal of the table. A grey zone introduces a negative bias to the estimate of the agreement level between the raters. In that sense, it is crucial to detect the existence of a grey zone in an agreement table. Methods In this study, a framework composed of a metric and the corresponding threshold is developed to identify grey zones in an agreement table. The symmetry model and Cohen’s kappa are used to define the metric, and the threshold is based on a nonlinear regression model. A numerical study is conducted to assess the accuracy of the developed framework. Real data examples are provided to illustrate the use of the metric and the impact of identifying a grey zone. Results The sensitivity and specificity of the proposed framework are shown to be very high under moderate, substantial, and near-perfect agreement levels for 3 × 3 and 4 × 4 tables and sample sizes greater than or equal to 100 and 50, respectively. Real data examples demonstrate that when a grey zone is detected in the table, it is possible to report a notably higher level of agreement in the studies. Conclusions The accuracy of the proposed framework is sufficiently high; hence, it provides practitioners with a precise way to detect the grey zones in agreement tables. Keywords Cohen’s Kappa, Gray zone, Inter-rater reliability, Ordinal levels, Transition zone, Weighted kappa Background The level of agreement between two or more raters is considered a crucial indicator for assessing the validity of measurements that can stem from treatment responses, diagnostic scans and tests, the use of new therapeutic † Haydar Demirhan and Ezgi Yilmaz contributed equally to this work. *Correspondence: Haydar Demirhan [email protected] 1 Mathematical Sciences Discipline, School of Science, RMIT University, Melbourne 3000, Victoria, Australia 2 Department of Statistics, Hacettepe University, Beytepe, Ankara 06000, Turkey or diagnostic technologies, or any other quantitative procedure. Agreement studies are conducted for either discriminating between the patients (reliability) or evaluating the effects or changes through repeated measurements (agreement) [1]. In this context, the level of the agreement indicates the degree of similarity or dissimilarity between diagnoses, scores, or judgments of raters [1, 2]. Diagnostic imaging is one of the important areas where a gold standard decision criterion is not available, and agreement studies are employed to evaluate the objectivity of imaging results [3]. In pathology, the development of grading schemes is informed by the agreement studies [4]. Pathologists’ reproducibility in grading tumors is evaluated using the level of agreement between © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 raters [5]. In cardiology, inter-rater agreement studies are employed in distinguishing type 1 and type 2 myocardial infarction due to the lack of solid clinical criteria for this classification [6, 7]. In clinical psychology, agreement studies are used to evaluate the replicability of diagnostic distinctions obtained with a diagnostic interview for mental disorders [8, 9]. In forensic medicine, the degree of agreement between two raters is utilized in the assessment of the credibility of physical torture allegations [10]. Agreement studies provide a wide variety of medical fields with essential information for critical decisionmaking and evaluation. Therefore, it is crucial to estimate the level of agreement between the rates with substantial accuracy. While conducting an agreement study, one of the main concerns is the measurement scale of the outcome, which can be nominal, ordinal, interval or ratio scale. Gwet [11] outlines the selection of agreement measures to be used for different scales. In this study, we focus on cross tables (agreement tables) composed of ratings of two raters into ordinal levels. When the outcome is ordinal, the raters classify subjects into categories considering their hierarchy. Due to the impact of the hierarchy, the weighted agreement coefficients are used for ordinal outcomes [11]. The impact of different table structures on five weighted agreement coefficients is explored by Tran et al. [12]. Warrens [13–15] present theoretical and numerical results on the relationship between different weighted agreement coefficients and their usage in agreement studies. The accuracy of the weighted agreement coefficients is affected by the characteristics of the agreement table, such as unbalancedness of the counts’ distribution across the cells, the degree of true agreement, or other rater-related issues such as the existence of a grey zone [12, 16]. The assessment of raters is prone to biases due to some external factors which can be related to their personal background. The rater (examiner or observer) bias increases the variation in the raters’ assessment. This issue is explored by Schleicher et al. [17] in clinical exams in medical schools. The existence of substantial variation due to the lack of clear procedures leading to the rater bias is reported in the literature [4, 18]. Personal characteristics of the raters, such as level of expertise, their previous training, or willingness to take risks, are also sources of variation for rater bias. For example, in grading a tumor into “Normal,” “Benign disease,” “Suspected cancer,” and “Cancer” categories, one of the raters may take a cautious approach and tend to grade toward “Suspected cancer” and “Cancer” categories not to take risk while the other rater rates lower towards “Benign disease” and “Suspected cancer” categories. This difference in the willingness of raters to take risks can create a rater Page 2 of 15 bias leading to grey zones such as discussed by Tran et al. [16] using data from Boyd et al. [19]. Zbären [20] reports increased accuracy in the assessment of frozen section samples with increasing experience of pathologists. In histologic grading, the distribution of grades varies up to 27% between the studies [21]. Although some portion of this variation is attributed to the patient characteristics, inter- and intra-rater variations have an extensive share in the variation. Strategies such as the use of e-learning modules to mitigate the variation caused by rater variation in grading lesions are proposed to mitigate the impact of such grey zones [22]. Since the grey zone is a concept that occurs for ordinal outcomes, we focus solely on the agreement for ordinal outcomes. The issue of having a grey zone in an agreement table is studied by Tran et al. [16, 23]. We get misleading estimates of the level of agreement when there is a grey zone in the agreement table, especially if the level of the true agreement is not high and the number of classes is not large [16]. When the sample size increases, the negative impact of a grey zone on agreement coefficients’ accuracy increases [16]. Tran et al. [23] propose a Bayesian approach for accurate estimation of the agreement level between raters when there is a grey zone in the table for ordinal tables with and without order restrictions on the levels of the classification. While the existence of grey zones in agreement tables and their negative effects are considered in the literature, the question of how we can decide whether there is any grey zone in an agreement table remains unanswered. Motivating example In a study on the assessment of torture allegations, 202 cases are assessed for the consistency between the history of ill-treatment, the symptoms and the physical and psychological indications [10]. In a semi-quantitative assessment, two raters independently assessed the level of details in describing physical symptoms related to illtreatment. The ordinal levels of the assessment that constitute a 4 × 4 agreement table were “0” for “descriptions with no relevant information about physical abuse,” “1” for “descriptions with few details about physical abuse and symptoms,” “3” for “very detailed descriptions,” and “2” for “descriptions between 1 and 3.” The resulting agreement table is given by Petersen and Morentin [10] as in Table 1 (only a relevant section of this agreement table is presented here). Full details of the assessment, including the marginals of the table and proportions of agreement, are given by Petersen and Morentin [10]. For Table 1, linearly weighted Cohen’s kappa coefficient is 0.674, which indicates a good or substantial level of agreement ([24], see Table 5 therein). In this agreement table, Rater I tends to rate one level higher than Rater II Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 Table 1 The agreement table for the level of details in the description of physical symptoms [10] Rater II Rater I 0 1 2 3 0 36 0 0 0 1 7 57 11 0 2 0 23 34 4 3 0 1 19 10 Boldfaced cells show possible locations of grey zones Table 2 Weighted agreement coefficients for the agreement table given in Table 1 Agreement coefficient Weight Cohen’s kappa Gwet’s AC2 BrennanPrediger’s S Linear 0.674 0.759 0.739 Quadratic 0.799 0.884 0.865 for the mid-range of the scale 0-3. While Rater II considers 23 cases as describing a few details about physical abuse and symptoms, Rater I conceives that the same cases describe more details. Consistently, Rater I thinks 19 cases provided very detailed descriptions, while Rater II does not find those descriptions very detailed. Only in one cell of the agreement table, Rater II perceives more details (in 11 cases) about physical abuse and symptoms than Rater I. In this example, the perception of Rater II for the details of physical abuse and symptoms differs from that of Rater I, who shows more sensitivity to the details of physical abuse and symptoms and accepts the descriptions as details easier than Rater II. Overall, Rater I tends to rate one level higher than Rater II. This difference in raters’ perception creates two grey zones in this agreement table, the first one is between levels 1 and 2, and the second one is between levels 2 and 3. Petersen and Morentin [10] mention the existence of a grey zone between two levels of scoring by neither formally referring to any criteria or defining the grey zone. Identification of grey zones in such a critical area of assessment is extremely important when an assessment of allegations of torture and ill-treatment, based on the Istanbul Protocol, is required by the juridical system. Tran et al. [16] suggest using Gwet’s AC2 or BrennanPrediger’s S coefficients with quadratic weights when there is a grey zone in the agreement table. Cohen’s kappa, Gwet’s AC2, and Brennan-Prediger’s S coefficients are calculated using linear and quadratic weights in Table 2. Gwet’s AC2 and Brennan-Prediger’s S coefficients show a higher level of agreement with both sets Page 3 of 15 of weights. Thus, if we can detect the existence of a grey zone in this agreement table in a quantitative way, we gather evidence to rely on Gwet’s AC2 and Brennan-Prediger’s S coefficients; and hence, can report a higher level of agreement that can be qualified as the very good magnitude of agreement instead of good. It is possible to extend such examples of agreement tables reported in the literature where a lower level of agreement is reported due to the impact of a grey zone without noticing its presence [25, 26]. In this sense, a method to detect the existence of grey zones in agreement tables helps the practitioners judge the reliability of the magnitude of agreement revealed by straightforwardly using Cohen’s kappa coefficient and leads them to use robust coefficients against the grey zones. In this study, we develop a framework to assess the existence of grey zone in ordinal agreement tables. The proposed framework is easy to implement for practitioners. It detects grey zones with high accuracy. It also has a low error rate for false detection of grey zones when there is no grey zone present in the table. We demonstrate by real data applications that a practitioner can report a higher degree of agreement between the raters with confidence when the existence of a grey zone is ascertained by the proposed framework. This leads to a better judgment of the objectivity of results or reproducibility of assessors in grading samples. The main contribution of this study is to introduce a straightforwardly applicable framework for assessing the existence of a grey zone in an ordinal agreement table. The required software codes for calculation are presented in this article (see Supplementary Material). In this framework, a metric and a threshold are developed to detect grey zones. The sensitivity, specificity, false positive, and false negative rates of the developed metric are investigated by a numerical study. Real data applications are presented to demonstrate the usefulness of the proposed framework in practice. Using this approach, the practitioners will be able to assess the existence of a grey zone in their agreement table of interest and report more accurate agreement levels by using robust agreement coefficients against the grey zones. Methods Agreement table and grey zone When two raters assign n objects into R classes, we get an agreement table as shown in Table 3, where nij denotes the number of objects that are assigned to class i by the first rater and assigned to class j by the second rater with i, j = 1, 2, . . . , R. The corresponding cell probability is pij = nij /n. The row and column totals are shown as row and column margins, respectively. Marginal row and column probabilities are pi. = ni. /n and p.j = n.j /n, Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 Page 4 of 15 Table 3 The agreement table for two raters Rater II Rater I Row 1 2 ... R Margin 1 n11 n12 ... n1R n1. 2 .. . R n21 .. . nR1 n22 .. . nR2 ... .. . ... n2R .. . nRR n2. .. . nR. Column Margin n.1 n.2 ... n.R n respectively. In this study, we assume that the raters are assessing ordinal levels. When there is complete agreement between the raters, nij = 0 for i = j . Any deviance from this is considered as disagreement. The general form for weighted agreement coefficients ( ACw ) for ordinal tables is defined in Eq. (1): Po − Pe ACw = , 1 − Pe R Po = wij pij , (1) i,j=1 where Po is the observed agreement, Pe is the proportion agreement expected by chance, and wij shows the weight assigned to cell (i, j) of the agreement table. There are many different versions of weighted agreement coefficients and the weights used along with them to define weighted agreement coefficients (see Tran et al. [16] for details). In this study, since each weighted agreement coefficient has its advantages and disadvantages under the existence of grey zones, we straightforwardly use the Kappa coefficient with Pe defined in Eq. (2) and wij = 1: Pe = R  wij pi. p.j . (2) i,j=1 An alternative way of assessing the level of agreement between raters is to use the ordinal quasi-symmetry model [27], represented by Eq. (3): log(pij /pji ) = β(ui − uj ) for all i and j, (3) where u1 ≤ u2 ≤ · · · ≤ uR are ordered scores assigned to the levels of the assessment scale for both row and columns of the agreement table. For this model, as the value of |β| increases, the difference between pij and pji and between the marginal distributions of the raters become greater. When β = 0, we have the symmetry model [27], which implies that the lower and upper triangles of the agreement table perfectly match, and there is a perfect fit on the main diagonal. The maximum likelihood fit of the symmetry model raises the expected cell frequencies in Eq. (4): µ̂ij = (nij + nji )/2, and the corresponding standardised residuals are  rij = (nij − µ̂ij )/ µ̂ij , (4) (5) for nij > 0 and rij = 0 for nij = 0. Moreover, we have rij = −rji and rij = 0 for i = j. A grey zone is defined as a human-contrived disagreement between two assessors occurring locally in adjacent categories of an agreement table due to subjective evaluation of raters [23]. Lack of uniform guidelines for classification, the level of experience of raters, low variability among the levels or other biases impacting the classification behaviour of raters are potential causes of not clearly distinguishing two adjacent categories. Therefore, for the grey zones considered in this study, the personal judgements of the assessors are influential on the existence of a grey zone rather than the characteristics of subjects related to a diagnosis. A grey zone is an attribute of the raters rather than being an attribute of the given scale. It is assumed that a grey zone occurs without human error, and there are no biases or misclassifications in the agreement table causing the existence of a grey zone. The causes of different grading behaviours include having different perceptions of the distance between the adjacent levels for the raters due to using different guidelines or having different experience levels. Northrup et al. [4] and van Dooijeweert et al. [18] report cases where pathologists refer to different references to grade the films leading to increased variation. Detection of a grey zone We need to consider the impact of having a grey zone and how it raises simultaneously to detect a grey zone in an agreement table. The main impact of the grey zone is increased variation and uncertainty. Grey zones cause the researchers to estimate the level of agreement as lower than its actual level since inflation occurs on the off-diagonal cells of an agreement table. Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 When the ratings of two observers are taken as matched pairs due to the dependency created by the fact that the same subjects (diagnostic results, scans, etc.) are being rated by two raters using the same scale, we can utilise the symmetry or ordinal quasi-symmetry models for square tables to assess the existence of grey zones. The ordinal quasi-symmetry model fits the agreement table well if ratings tend to be consistently higher by one rater than the other. Since the grey zones do not occur around all the diagonal cells, the quasi symmetry model is not expected to give a satisfactory fit to detect grey zones. However, the symmetry model represents the case where there is no grey zone in the table. Therefore, if the symmetry model fits an agreement table well, it is a strong indication of not having any grey zones in the table. Following this logic, deviations from the symmetry model for adjacent cells relative to the corresponding cells on the main diagonal leads us to detect the existence of a grey zone in an agreement table. The cell counts on the main diagonal stem from the agreement of two raters. For the existence of a grey zone, some cell counts should move from the main diagonal to the adjacent cell to the right (or below) of the main diagonal cell. This is a deviance from the symmetry model and penalises any agreement coefficient. Therefore, we need to consider the level of agreement along with deviations from the symmetry model to detect a grey zone. The standardised residuals of the symmetry model represent the deviations, while a kappa coefficient shows the level of agreement between the raters. There are many different forms of weighted agreement coefficients that have pros and cons depending on different formations of the agreement table and choice of weights. In fact, all the agreement coefficients will be impacted by the grey zone if it is present in the table. They all underestimate the level of agreement when there is a grey zone. Here, we only aim to represent the level of agreement instead of precisely measuring it. Since we aim to point out the difference between agreements and disagreement on the main diagonal of the table, the use of the kappa coefficient with wij = 1 is a suitable and straightforward choice [28]. The basic element of our criterion to detect whether there is a grey zone in the agreement table or not, namely δij , is defined as the deviation from the symmetry model relative to the level of agreement measured by Cohen’s kappa coefficient as given in Eq. (6): δij = rij /κ, (6) where rij is the standardised residual defined in Eq. (5) and κ is the Cohen’s kappa coefficient. When there is a grey zone, say in the cell (i, j), the corresponding cell count, nij , gets inflated while nji remains the same. This Page 5 of 15 results in large deviance; hence, a large standardised residual, rij , from the symmetry model. However, the magnitude of inflation is not always due to a grey zone. It can also be related to the disagreement between the raters. Therefore, we scale the magnitude of deviance from symmetry by the level of agreement. Thus, the statistic, δij , measures the relative magnitude of deviance from the perfect agreement to the level of agreement for the cell (i, j). Then, we focus on the maximum of δij values to detect the existence of a grey zone, and the corresponding i and j lead us to the location of the grey zone in the table. Thus, the proposed criterion to detect a grey zone is � = max(δij ). (7) In order to give numerical insight into this approach, we focus on the agreement table given Table 1. We arbitrarily move the frequencies of the cells (shown in italic) that are potentially contributing to the grey zones to the main diagonal to create an agreement table that does not have grey zones as in Table 4. The Cohen’s kappa is calculated as κ = 0.725 and 0.545 for Tables 4 and 1, respectively. The corresponding standardised residuals for Tables 4 and 1 are shown in Table 5. The magnitudes of standardized residuals are considerably higher in the table that has grey zones (Table 1) than those of the one without grey zones (Table 4). The corresponding δij values are given in Table 6. The values of the criterion  are 0.975 and 4.058 for Tables 4 and 1, respectively. For the assessment of torture allegations data (Table 1), we observe a very large  value suggesting the existence of a grey zone, as also noted by Petersen and Morentin [10]. With the arbitrarily created no-grey-zone version of the table (Table 4), we observe a very low value for  suggesting the absence of a grey zone in the table. These results are consistent with the logic behind the proposed criterion. However, the question we need to clarify is how large should  be to suggest the presence of a grey zone in the table. In order to answer Table 4 The modified version of the agreement table to remove the grey zone for the level of details in the description of physical symptoms [10] Rater II Rater I 0 1 2 3 0 36 0+4=4 0 0 1 7-4=3 57+13=70 11 0 2 0 23-13=10 34+14=48 4 3 0 1 19-14=5 10 Cells with boldface and italic numbers show the modifications done to cell counts Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 this question, we develop a threshold for  via numerical experiments in the next section. Derivation of a threshold for  The numerical experiments to develop a threshold for  of Eq. (7) involve creating random agreement tables without a grey zone. For the random generation of agreement tables, we follow the algorithm given by Tran et al. ([16], see Algorithm 1 therein) that creates bivariate normal distributed latent variables for a given Pearson correlation coefficient (ρ) to set the level of not-chance-corrected true agreement between two raters. Page 6 of 15 We set ρ = 0.45, 0.50, . . . , 0.85, 0.90 to cover true agreement levels from low to very high and consider the sample sizes of n = 50, 100, 200, 300, 400, 500 , and 1000. Then, for each combination of (ρ, n), we generate 1000 random agreement tables (replications) that do not have any grey zone, record Cohen’s κ and , and calculate minimum, maximum, and median of κ , minimum, maximum, median, and 90th and 95th percentiles of  over 1000 replications. The calculated values are presented in Table S1 of Supplementary file for n = 100 and the results for all (ρ, n) pairs are tabulated in Table S1 of Supplementary file. This data generation aims to Table 5 The standardized residuals of symmetry model for the agreement tables in Tables 4 and 1 No grey zone (Table 4) With grey zone (Table 1) Rater II Rater I 0 Rater II 1 2 3 Rater I 0 1 2 3 0 0 0.267 0 0 0 0 -1.871 0 0 1 -0.267 0 0.154 -0.707 1 1.871 0 -1.455 -0.707 2 0 -0.154 0 -0.236 2 0 1.455 0 -2.212 3 0 0.707 0.236 0 3 0 0.707 2.212 0 1 2 3 Table 6 The δij values for Tables 4 and 1 No grey zone (Table 4) With grey zone (Table 1) Rater II Rater I 0 Rater II 1 2 3 Rater I 0 0 0 0.369 0 0 0 0 -3.433 0 0 1 -0.369 0 0.213 -0.975 1 3.433 0 -2.670 -1.297 2 0 -0.213 0 -0.325 2 0 2.670 0 -4.058 3 0 0.975 0.325 0 3 0 1.297 4.058 0 Table 7 Descriptive statistics of κ and  calculated for n = 100 against the values of ρ  κ ρ Min Med Max Min Med 90th 95th Max 0.45 -0.002 0.213 0.442 4.564 16.053 246.335 413.388 413.388 0.50 0.008 0.231 0.444 4.585 13.777 376.726 376.726 376.726 0.55 0.053 0.263 0.497 3.998 10.750 158.032 158.032 249.900 0.60 0.068 0.292 0.525 3.792 9.013 43.027 104.885 181.672 0.65 0.082 0.331 0.591 3.702 7.416 16.878 16.878 146.471 0.70 0.142 0.371 0.621 2.852 6.363 10.605 14.711 46.144 0.75 0.157 0.418 0.635 2.288 5.438 6.938 6.938 29.757 0.80 0.253 0.471 0.771 1.456 4.690 5.238 5.298 12.260 0.85 0.332 0.532 0.771 1.385 3.878 4.630 4.630 8.992 0.90 0.410 0.607 0.831 1.593 3.203 4.349 4.349 7.542 Min minimum, Max maximum, Med median, 90th 90th percentile, 95th 95th percentile Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 figure out the relationship between the level of agreement, sample size, κ , and  when there is no grey zone in the table. From Table 7, the value and the range of  decreases as the level of agreement increases for n = 100. As expected, there is a clear negative correlation between the level of agreement and . We observe the same relationship for larger sample sizes from Table S1 of Supplementary file. As the sample size gets larger, the maximum and the range of  decreases. Therefore, a sensitive Page 7 of 15 threshold for  needs to be a function of both the sample size and the level of agreement. Scatter plots of the pairs of ρ , n, median of κ, and median of  are displayed in Fig. 1. The relationship patterns between median  and both ρ and median κ are very similar. There is a negative nonlinear relationship between the level of agreement and . The range of median  increases for smaller samples nonlinearly. Therefore, a functional threshold needs to reflect these nonlinear relationship patterns. Fig. 1 Scatter plots of the pairs of ρ (rho), sample size (n), the median of κ (kappaMed), and median of  (DeltaMed) Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 In order to develop a threshold that is a function of both the level of agreement and the sample size and incorporates the nonlinear relationships, we utilize the nonlinear regression technique. We build a model for the median  given the median κ and the sample size. Although the mean is more representative, a small number of large-valued outlier observations can impact the value of the mean considerably, while the median stays unaffected. From Table S1 of the Supplementary Material, we observe a large range of  values for each ρ among the values of sample size, n. Similarly, the range of  values for each n is considerably large among the considered ρ values. This implies that the likelihood of getting outlier  values for a given agreement table is not negligible. Therefore, we used the median instead of the mean to build the nonlinear regression model to get robust results against the outliers. In the scatter plots of both the median  and the median κ, and the median  and n (Fig. 1), the variation Page 8 of 15 increases as the median  increases and the median κ and n decrease. So, we apply the Box-Cox transformation [29] to stabilise this variation before moving into the modelling. The optimal value of the power parameter  of the Box-Cox transformation is found as -1.59596 by using the boxcox() function from the MASS R package [30]. Then, we fit the model in Eq. (8) with the Box-Cox transformed  values, BC . �BC = β0 + β1 κ 2 + β2 n + β3 n2 + ǫ, (8) where ǫ ∼ N (0, σǫ2 ). This specific model form is found by optimizing the adjusted R-squared over a model space that contains the models with linear and quadratic terms of κ and n. The fitted model is obtained as ̂ BC = 0.6319 − 0.2563𝜅 2 − 2.087 ⋅ 10−5 n + 1.546 ⋅ 10−8 n2 Δ (9) with all statistically significant coefficients at 5% level of significance ( P < 0.001 for all). For this model, the adjusted R-squared is 0.989, which implies an almost Fig. 2 Scatter plots of the observed and fitted Box-Cox transformed median  Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 perfect fit. Figure 2 shows the observed and fitted values by Eq. (9). In the top-left section of Fig. 2, six observations are notably underestimated by the model in Eq. (8). These observations come from the replications with a small sample size with n = 50. For the rest of the sample sizes, we have an almost perfect fit that is also identified by the adjusted R-squared of 0.989. Thus, when we take the Box-Cox transformation back by Eq. (10), Page 9 of 15 and higher than 0.85 are not feasible due to the nature of grey zones. For low true agreements, the off-diagonal cells of the table get inflated by disagreement; hence, a grey zone does not occur. For perfect true agreements, the cell counts get highly concentrated on the main diagonal of the table and do not allow the formation of a grey zone. The sample size is taken as n = 50, 100, 250, 500, and 1000. For each sample size, a different value of ρ gives a desired value for κ. The table size is considered as  BC ·  + 1)(1/) τ� =(�  (1/−1.59596) = (0.6319 − 0.2563κ 2 − 2.087 · 10−5 n + 1.546 · 10−8 n2 ) × (−1.59596) + 1 (10) =(−0.0080 + 0.4090κ 2 + 3.331 · 10−5 n − 2.467 · 10−8 n2 )−0.6266 , we observe the desired threshold, τ�, for our criterion . As seen in Eq. (10), τ� reflects the nonlinear relationship patterns between median  and κ and n. Since τ� is an estimate of  when there is no grey zone in the agreement table and  tends to increase when there is a grey zone in the table, if � > τ�, then it is decided that there is a grey zone exists in the agreement table. Otherwise, there is no grey zone in the table. Once it is decided that there is a grey zone in the table by � > τ�, it is possible to compare other δij values with τ� to identify other grey zones in the table. Results Numerical experiments We conducted an experimental study to assess the performance of the proposed metric to detect the existence of a grey zone in an agreement table. The approach in this validation effort is to i) generate an agreement table without any grey zone, ii) introduce a grey zone into the same table without effecting the level of agreement notably, and iii) record the values of  for each case and compare them with the corresponding threshold calculated by Eq. (10). In this way, we reveal the true-positive (sensitivity), true-negative (specificity), false-negative, and false-positive rates of the approach proposed in this study. Data generation The approach of Muthén [31] is used along with the algorithm given by Tran et al. [16] to generate agreement tables without a grey zone. Moderate, substantial, and near-perfect levels of true agreement are generated by using the correlation coefficient ρ . These agreement levels respectively correspond to Cohen’s kappa values around 0.63, 0.75, and 0.83. Note that it is not possible to get exact kappa values as desired in the Monte Carlo data generation environment. The kappa values lower than 0.6 R = 3 and 4. For larger table sizes, the ordinal scale starts to approach the continuous scale; hence, it does not inform us about the pure impact of the ordinal outcome. Johnson and Creech [32] observe that when R > 4 , the bias due to categorisation of continuous measurements does not have a substantial impact in the interpretations. Considering these, including larger table sizes is not quite informative for our aim in this study. The values of ρ and corresponding κ are tabulated for each sample size in Table 8. In order to inject a grey zone into an agreement table, the search approach of Tran et al. [16] is utilized on the cell probabilities for each combination of ρ and n. We searched for the set of cell probabilities that produces a κ value that is almost equal to that of the corresponding table without a grey zone. This way, we make sure that the generated tables with and without a grey zone have the same level of agreement for comparability. For each table size, we consider the position of the generated grey zone. For R = 3, the grey zone is created at cells (1, 2), (2, 1), (2, 3), and (3, 2), and for R = 4 , it is created at cells (1, 2), (2, 1), (2, 3), (3, 2), (3, 4), and (4, 3). In total, we consider 150 different scenarios composed Table 8 The values of ρ and corresponding κ values for each sample size, n, for 3 × 3 tables n ρ κ n ρ κ 50 0.960 0.639 500 0.910 0.630 0.980 0.756 0.960 0.754 0.986 0.817 0.984 0.838 0.930 0.639 0.900 0.632 0.965 0.744 0.960 0.767 0.985 0.835 0.980 0.832 0.925 0.633 0.963 0.753 0.977 0.838 100 250 1000 Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 Page 10 of 15 Table 9 Sample size, ρ , true κ , TP, TN, FP, and FN, sensitivity, specificity, and MCC when the grey zone is in cell (1, 2) of 3 × 3 and 4 × 4 agreement tables R Case n ρ True κ TP FP FN TN Sens Spec MCC 3 GZ at cell (1,2) 50 0.960 0.639 9126 874 385 9615 0.913 0.962 0.875 0.980 0.756 7352 2648 58 9942 0.735 0.994 0.755 0.986 0.817 4017 5983 80 9920 0.402 0.992 0.488 0.930 0.639 9906 94 168 9832 0.991 0.983 0.974 0.965 0.744 8432 1568 98 9902 0.843 0.990 0.843 0.985 0.835 9338 662 749 9251 0.934 0.925 0.859 0.925 0.633 10000 0 777 9223 1.000 0.922 0.925 0.963 0.753 9973 27 2141 7859 0.997 0.786 0.801 0.977 0.838 10000 0 3288 6712 1.000 0.671 0.711 0.910 0.630 10000 0 846 9154 1.000 0.915 0.919 0.960 0.754 10000 0 733 9267 1.000 0.927 0.929 0.984 0.838 10000 0 2225 7775 1.000 0.778 0.797 0.900 0.632 10000 0 1424 8576 1.000 0.858 0.866 0.960 0.767 10000 0 1430 8570 1.000 0.857 0.866 0.980 0.832 10000 0 212 9788 1.000 0.979 0.979 0.911 0.624 3958 6042 72 9928 0.396 0.993 0.484 0.969 0.731 5224 4776 12 9988 0.522 0.999 0.593 0.982 0.839 4839 5161 13 9987 0.484 0.999 0.563 0.935 0.612 9694 306 993 9007 0.969 0.901 0.872 0.982 0.746 9804 196 727 9273 0.980 0.927 0.909 0.992 0.840 9959 41 229 9771 0.996 0.977 0.973 0.945 0.616 10000 0 933 9067 1.000 0.907 0.911 0.975 0.755 10000 0 623 9377 1.000 0.938 0.940 0.987 0.824 10000 0 1065 8935 1.000 0.894 0.899 0.945 0.613 10000 0 1381 8619 1.000 0.862 0.870 0.977 0.747 10000 0 562 9438 1.000 0.944 0.945 0.987 0.824 10000 0 522 9478 1.000 0.948 0.949 0.940 0.617 10000 0 757 9243 1.000 0.924 0.927 0.975 0.740 10000 0 1450 8550 1.000 0.855 0.864 0.985 0.828 10000 0 563 9437 1.000 0.944 0.945 100 250 500 1000 4 at cell (1,2) 50 100 250 500 1000 TP: GZ+ TGZ+; FP: GZ+ TGZ-; NF: GZ- TGZ+; TN: GZ- TGZ-; TGZ+: There is a grey zone in the table; TGZ-: There is no grey zone in the table; GZ+: A grey zone is identified; in the table; GZ-: No grey zone is identified in the table; Sens: Sensitivity; Spec: Specificity; MCC: Mathew’s correlation coefficient of ρ, n, R and the location of the grey zone. For each scenario, 10,000 random agreement tables with and without a grey zone are generated. Accuracy of  We focus on sensitivity, specificity and Mathew’s correlation coefficient (MCC) to describe the accuracy of the proposed criterion. While sensitivity and specificity reflect true-positive and true-negative classifications about having a grey zone in the table, MCC considers false-positive and false-negative decisions along with true-positive and true-negative classifications. There are other performance measures such as precision and F1 score. However, since we create 10,000 tables without a grey zone and 10,000 tables with a grey zone, sensitivity, recall, and F1 score are all equal to each other. Suppose TP, TN, FP, and FN respectively show the number of true-positive, true-negative, false-positive, and falsenegative decisions on the existence of a grey zone in the generated tables. Then, sensitivity, specificity, and the Mathew’s correlation coefficient [33] are calculated as in Eqs. (11) and (12): Sensitivity = MCC = TP , 10, 000 Specificity = TN , 10, 000 and (TP × TN ) − (FP × FN ) . [(TP + FP)(TP + FN )(TN + FP)(TN + FN )]0.5 (11) (12) The proposed criterion,  and the threshold, τ�, are computed for each generated table. Then, we create a classification table composed of the true and estimated status Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 Page 11 of 15 Fig. 3 Accuracy metrics of  for 3 × 3 and 4 × 4 agreement tables under different true agreement levels of a grey zone in the table over 10,000 replications and compute the accuracy measures in Eqs. (11) and (12). The results when the grey zone is in cell (1, 2) of the table for R = 3 and 4 are given Table 9. MCC, sensitivity, and specificity results are plotted for low, moderate, and high agreement under 3 × 3 and 4 × 4 table settings in Fig. 3. The results for all scenarios, the cell probabilities used to inject the grey zone and the corresponding Cohen’s kappa after introducing a grey zone in the table for 3 × 3 and 4 × 4 agreement tables are given in Tables S2 and S3 of Supplementary file, respectively. For small sample sizes and near-perfect level of true agreement, it is highly challenging to detect the existence of a grey zone since the cell counts moving to offdiagonal cells as the result of a grey zone are not large enough to be separated from disagreement easily. Therefore, the sensitivity of , namely the accuracy of detecting the existence of a grey zone correctly, is not as high as desired for n = 50 in 3 × 3 tables. In 4 × 4 tables, it is low for n = 50, and all true agreement levels since the sample size of 50 are distributed across 16 cells instead of 9, making it harder to detect the movement of counts. However, the sensitivity of  rapidly increases over 0.9 when n ≥ 100 for both table sizes; hence, ’s ability to detect a grey zone is very high for n ≥ 100 for both table structures and all levels of true agreement. The same inferences follow for MCC as well. The specificity of , namely the accuracy of concluding the absence of a grey zone correctly, is very high for small samples and slightly reduces to around 0.9 for higher sample sizes for all true agreement levels under 4 × 4 table size and a moderate level of true agreement under 3 × 3 tables. There is a drop in specificity of  for moderate sample sizes under 3 × 3 tables and high true agreement. The reason for this is having a near-perfect agreement. When the agreement is near-perfect, if the sample size is not large, there are not many cell counts move off the diagonal to create a notable grey zone that makes it harder to detect for . Having near-perfect agreement and a low sample size are two extreme ends of the conditions where a grey zone can occur. From the rates of a false-positive decision in Table 9, there is almost no case where the proposed framework indicates the existence of a grey zone when there is no grey zone in the table for moderate and large sample sizes (n ≥ 250). However, there is an acceptable level of false negative decisions where the framework indicates that there is no grey zone in the table while a grey zone is present. From Tables S2 and S3 of Supplementary file, we draw similar inferences for the accuracy of  when the location Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 of the grey zone moves from the cell (1,2) to other possible cells. Therefore, the location of a grey zone in the agreement table does not have an impact on the accuracy of . Overall, the accuracy of  along with the threshold τ� in detecting the absence and presence of a grey zone is substantially high for sample sizes between 100 and 1,000 under all considered table sizes and the levels of agreement. This makes the proposed framework for detecting a grey zone a useful and reliable approach. Applications with real data In order to demonstrate the use of the proposed framework for the detection of grey zones in practice, we focus on previously published agreement tables from studies in the medical field. R codes for the software implementation of the framework are given in the Supplementary Material, along with the calculations for the following applications. Assessment of torture allegations We revisit the agreement table given in the motivating example. The agreement table in Table 1 shows the classification resulting from two raters’ assessment of the level of details in the description of physical symptoms related to ill-treatment in n = 202 cases [10]. Petersen and Morentin [10] mention that there is a grey zone in this table based on their conceptual assessment without Page 12 of 15 I classifies only 7 cases to level 1 while Rater II assigns them to level 2. This is consistent with Rater I’s assessment tendency of rating one level higher than Rater II. However, since raters’ level of agreement on level 0 is high, it does not create enough deviation to be identified as a grey zone. Overall, the grey zone identified in this agreement table is in accordance with the conclusions of Petersen and Morentin [10] about the existence of grey zones in this data. It is possible to report Gwet’s AC2 or Brennan-Prediger’s S with quadratic weights (Table 2) to conclude a higher level of the agreement due to the existence of a grey zone in this study. Assessment of PI‑RADS v2.1 scores In a recent study, Wei et al. [26] focused on developing a graphical representation to predict significant prostate cancer in the transition zone based on the scores from the Prostate Imaging Reporting and Data System version 2.1 (PI-RADS v2.1). Wei et al. ([26], Table 2 therein) report the classification of n = 511 cases into five levels of PI-RADS v2.1 scores by two radiologists. In this classification, Radiologist 1 tends to rate one level higher than Radiologist 2 for 2, 3, and 4 levels of PI-RADS v2.1. The Cohen’s kappa is κ = 0.461. To decide if there is a grey zone in this table, δij , i, j = 1, . . . , 5 are calculated by Eq. (6) as in Table 10. For Table 10,  = 4.625 by Eq. (7). Then, τ� is calculated by Eq. (10) as follows: τ� (−0.0080 + 0.4090 · 0.4612 + 3.331 · 10−5 · 511 − 2.467 · 10−8 · 5112 )−0.6266 = 4.537. using any metric. In order to calculate  by Eqs. (6) and (7), we use the standardized residuals of the symmetry model given on the right-side of Table 5, the corresponding κ = 0.545, and δij values in Table 6. From Eq. (7), we get  = 4.058. Then, we need to calculate the threshold, τ� from Eq. (10) as follows: 𝜏Δ =(−0.0080 + 0.4090𝜅 2 + 3.331 ⋅ 10−5 n − 2.467 ⋅ 10−8 n2 )−0.6266 =(−0.0080 + 0.4090 ⋅ 0.5452 + 3.331 ⋅ 10−5 ⋅ 202 − 2.467 ⋅ 10−8 ⋅ 2022 )−0.6266 (14) Since we get � = 4.625 > 4.537 = τ�, it is concluded that there is a grey zone in the agreement table of two radiologists for PI-RADS v2.1 scores. From Table 10, δ12 = 4.625; hence, the grey zone is in between levels 1 and 2 where Radiologist 1 tends to rate one level higher than Radiologist 2 for level 1 of PI-RADS v2.1 scores. The practical implication of identifying this grey zone is related to the reported level of agreement. Wei et al. [26] report a weighted version of Cohen’s kappa as 0.648 that corresponds to linearly weighted kappa. However, Tran et al. [16] finds that Gwet’s AC2 and Brennan-Prediger’s S =3.791. (13) Since we have � = 4.058 > 3.791 = τ�, it is decided that there is a grey zone in this agreement table. When we check the δij values, δ43 = 4.058. So, the highest magnitude grey zone is between levels 3 and 2, where Rater I tends to rate towards level 3 while Rater II tends to assign the cases to level 2 (note that the levels start from 0 in this data). Looking at Table 6, we observe that there is no other δij greater than 3.791. There is only one other cell that has a δij value close to τ�, δ21 = 3.433, where Rater Table 10 The δij values for PI-RADS v2.1 agreement table Radiologist 1 Radiologist 2 1 2 3 4 5 1 0 4.625 3.579 1.534 0 2 -4.625 0 0.166 3.254 0 3 -3.579 -0.166 0 3.889 0 4 -1.534 -3.254 -3.889 0 -1.252 5 0 0 0 1.252 0 Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 Table 11 Weighted agreement coefficients for assessment of two radiologists for PI-RADS v2.1 scores Agreement coefficient Weight Cohen’s kappa Gwet’s AC2 BrennanPrediger’s S Linear 0.651 0.793 0.747 Quadratic 0.805 0.916 0.879 measures with the quadratic and linear weights are most robust against the grey zones. These weighted agreement coefficients are reported in Table 11 for PI-RADS v2.1 scores data. According to Fleis et al. [34] interpretation of the kappa coefficient, a kappa value of 0.651 corresponds to a “Fair to Good.” However, all other kappa values in Table 11 indicate one level higher, “Very Good,” agreement between two radiologists. Therefore, detecting the grey zone leads to reporting a higher level of agreement between the radiologists. Discussion Grey zones arise in agreement studies in various fields from medicine to education that involves assigning the subjects to ordinal levels. When the raters or assessors tend to assign subjects into classification categories in different manners, an unbalanced structure occurs in adjacent cells around the main diagonal of the agreement table. This unbalance creates a grey zone(s) and causes reporting the level of agreement to be lower than its actual value. The negative impact of grey zones is demonstrated by Tran et al. [16, 23]. Since a grey zone is an attribute of the rating attitudes of raters, it originates from their willingness to take risks, expertise, training, or lack of uniform guidelines for the assessment. If the raters take more training after the first round of scoring or are given clearer guidelines for rating, it would be expected that they will not produce the same grey zone(s). To avoid the impact of grey zones or test such a hypothesis that more training would mitigate the occurrence of grey zones, we need an objective framework to decide if there is a grey zone in an agreement table. This study proposes a framework that includes two statistics: a criterion and a threshold. The criterion, , captures the deviations from the symmetry model relative to the level of agreement. The threshold, τ�, is a nonlinear function of the level of agreement and the sample size obtained by nonlinear regression modelling. The comparison of  to τ� provides us with a decision criterion for the identification of a grey zone in the agreement table. The accuracy of the proposed framework is tested Page 13 of 15 by a numerical study through the metrics sensitivity, specificity, and Matthew’s correlation coefficient. Small, moderate, large, and very large sample sizes, moderate, substantial, and near-perfect agreement levels and 3 × 3 and 4 × 4 table sizes are considered in the numerical study. Low and perfect agreement levels are not feasible settings for the existence of a grey zone since they respectively represent the cases where the raters totally disagree or agree. The tables greater than 4 × 4, the impact of the ordinal scale reduces and starts to approach the continuous scale [32]. Therefore, the results of our numerical study are generalizable to other cases in the grey zone concept. The proposed framework has satisfactorily high sensitivity for samples larger than 50 observations. Its specificity is very high for all sample sizes. When falsepositive and false-negative rates are also considered by the use of Matthew’s correlation coefficient, we get satisfactorily high correlations for samples with larger than 50 observations. Although the grey zone concept is defined as an attribute of the raters due to their background and assessment approach, this concept can be extended to the comparison of two diagnostic methods in the grading of diseases. Zavanone et al. [35] consider grading carotid artery stenosis using noninvasive imaging methods, Doppler ultrasound (DUS) and computed tomography angiography (CTA). They compare the classifications by DUS against CTA in grading carotid artery stenosis in 431 arteries into the levels of “Mild”, “Moderate”, “Severe”, and “Occlusion”. The expectation is to see some degree of agreement between the methods in the grading of the same arteries into the same scale. Although these imaging methods cannot have any biases, they have some differences due to the ways they work, and this can raise artificial inflation in adjacent cells around the main diagonal of the agreement table. Zavanone et al. ([35], Table 1 therein) report the agreement table of DUS and CTA in grading carotid artery stenosis. DUS tends to rate more towards “Severe” in the raw table, while CTA rates those cases as “Moderate”. For this table, n = 431 and κ = 0.674 . When we implement the proposed framework, we get  = 3.028 and τ� = 2.853. Since � = 3.028 > 2.853 = τ�, we conclude that there is a grey zone between the levels “Moderate” and “Severe” in the grading of DUS and CTA for carotid artery stenosis. Zavanone et al. [35] report the Cohen’s quadratic weighted kappa as 0.85, which is more robust against the grey zones [16]. However, due to the identification of the grey zone, we can rely on Gwet’s AC2 and Brennan-Prediger’s S with quadratic weights, which are 0.908 and 0.887, and report even a higher agreement between DUS and CTA. Demirhan and Yilmaz BMC Medical Research Methodology (2023) 23:3 The main limitation of this study is around the nonlinear regression model used to develop the threshold for . The accuracy of  is directly related to the goodnessof-fit of the nonlinear regression model. We obtained an adjusted R-squared of 0.989 for this model. This shows a near-perfect fit for interpolation, occurring for sample sizes between 50 and 1,000 and true kappa values of -0.002 and 0.831. Therefore, the proposed framework should be used cautiously for the samples with less than 50 or more than 1,000 observations or the agreement tables with a very high true agreement. As discussed, the likelihood of having a grey zone in these cases is extremely low. Conclusions In this study, a framework is proposed to detect the existence of grey zones in an agreement table. The main conclusions from the real-data examples and the experimental study conducted with 3 × 3 and 4 × 4 agreement tables under small, moderate, and large samples and moderate, substantial, and near-perfect agreement levels are summarized as follows: • The proposed framework has a sufficiently high-level capability to detect the existence of a grey zone for tables of size greater than 50 under all the considered table sizes and true agreement levels. • The proposed framework’s accuracy in correctly determining the absence of a grey zone is very high in all the considered scenarios of sample size, table size, and the true agreement level. • When there is no grey zone in the agreement table, the framework seldom returns a positive result for the tables with a sample size greater than or equal to 250 under all the considered table sizes and the true agreement levels. • The level of false decisions of the framework to detect the grey zones when there is a grey zone in the table is at an acceptable level. • The location of a grey zone in the agreement table does not impact the accuracy of the proposed framework. • The real-data examples demonstrate that if a grey zone is detected in the agreement table, it is possible to report a higher magnitude of agreement with high confidence. In that sense, if a practitioner is suspected of a grey zone, such as in the first example, the use of the proposed framework leads to more accurate conclusions. • Overall, the proposed metric  and its threshold τ� provide the researchers with an easy to implement, Page 14 of 15 reliable, and accurate way of testing the existence of a grey zone in an agreement table. A future direction for this research is to extend the definition of grey zones to include attributes of the rating mechanisms other than human assessors, as mentioned in the Discussion Section. Supplementary Information The online version contains supplementary material available at https://doi. org/10.1186/s12874-022-01759-7. Additional file 1. Electronic Supplementary Material for ’Detection of Grey Zones in Inter-rater Agreement Studies Acknowledgements The authors would like to acknowledge the comments of two reviewers that helped improve the clarity and quality of the article. Authors’ contributions HD: Design of the work; implementation of simulations; analysis of results; interpretation of results; the creation of software used in work; have drafted the work and revised it. AEY: Design of the work; implementation of simulations; analysis of results; interpretation of results; the creation of software used in work; have drafted the work and revised it. All authors read and approved the manuscript. Funding Not applicable. Availability of data and materials The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interests The authors declare that they have no competing interests. Received: 9 June 2022 Accepted: 18 October 2022 References 1. Hernaez R. Reliability and agreement studies: a guide for clinical investigators. Gut. 2015;64(7):1018–27. https://doi.org/10.1136/ gutjnl-2014-308619. 2. Kottner J, Streiner DL. The difference between reliability and agreement. J Clin Epidemiol. 2011;64(6):701–2. https://doi.org/10.1016/j.jclinepi.2010. 12.001. 3. Farzin B, Gentric JC, Pham M, Tremblay-Paquet S, Brosseau L, Roy C, et al. Agreement studies in radiology research. Diagn Interv Imaging. 2017;98(3):227–33. https://doi.org/10.1016/j.diii.2016.05.014. 4. Northrup N, Howerth W, Harmon B, et al. Variation among Pathologists in the Histologic Grading of Canine Cutaneous Mast Cell Tumors with Uniform Use of a Single Grading Reference. J Vet Diagn Investig. 2005;17:561–4. Demirhan and Yilmaz BMC Medical Research Methodology 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. (2023) 23:3 Barnard ME, Pyden A, Rice MS, Linares M, Tworoger SS, Howitt BE, et al. Inter-pathologist and pathology report agreement for ovarian tumor characteristics in the Nurses’ Health Studies. Gynecol Oncol. 2018;150(3):521–6. Shah AS, McAllister DA, Mills R, Lee KK, Churchhouse AM, Fleming KM, et al. Sensitive troponin assay and the classification of myocardial infarction. Am J Med. 2015;128(5):493–501. Gard A, Lindahl B, Batra G, Hadziosmanovic N, Hjort M, Szummer KE, et al. Interphysician agreement on subclassification of myocardial infarction. Heart. 2018;104(15):1284–91. https://doi.org/10.1136/heart jnl-2017-312409. Summerfeldt LJ, Ovanessian MM, Antony MM. Structured and semistructured diagnostic interviews. In: Antony MM, Barlow DH, editors. Handbook of assessment and treatment planning for psychological disorders. New York: The Guilford Press; 2020. p. 74–115. Blanchard JJ, Brown SB. 4.05 - Structured Diagnostic Interview Schedules. In: Bellack AS, Hersen M, editors. Comprehensive Clinical Psychology. Oxford: Pergamon; 1998. p. 97–130. https://doi.org/10.1016/B0080-4270(73)00003-1. Petersen HD, Morentin B. Assessing the level of credibility of allegations of physical torture. Forensic Sci Int. 2019;301:263–70. Gwet KL. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. USA: Advanced Analytics, LLC; 2014. Tran D, Dolgun A, Demirhan H. Weighted inter-rater agreement measures for ordinal outcomes. Commun Stat-Simul Comput. 2018;49:1–15. Warrens MJ. Some paradoxical results for the quadratically weighted kappa. Psychometrika. 2012;77(2):315–23. Warrens MJ. Cohen’s weighted kappa with additive weights. Adv Data Anal Classif. 2013;7(1):41–55. Warrens MJ. Weighted Kappas for Tables. J Probab Stat. 2013;Article ID 325831. Tran QD, Dolgun A, Demirhan H. The impact of grey zones on the accuracy of agreement measures for ordinal tables. BMC Med Res Methodol. 2021;21:70. https://doi.org/10.1186/s12874-021-01248-3. Schleicher I, Leitner K, Juenger J, Moeltner A, Ruesseler M, Bender B, et al. Examiner effect on the objective structured clinical exam-a study at five medical schools. BMC Med Educ. 2017;17(1):71. van Dooijeweert C, van Diest PJ, Baas IO, van der Wall E, Deckers IA. Grading variation in 2,934 patients with ductal carcinoma in situ of the breast: the effect of laboratory-and pathologist-specific feedback reports. Diagn Pathol. 2020;15:1–9. Boyd NF, Wolfson C, Moskowitz M, Carlile T, Petitclerc M, Ferri HA, et al. Observer variation in the interpretation of xeromammograms. J Natl Cancer Inst. 1982;68(3):357–63. Zbären P. Fine needle aspiration cytology, core needle biopsy, and frozen section. Surg Salivary Glands E-book. 2019:32. van Dooijeweert C, van Diest P, Ellis I. Grading of invasive breast carcinoma: the way forward. Virchows Archiv. 2021;1–11. https://doi.org/10. 1007/s00428-021-03141-2. van Dooijeweert C, Deckers IA, de Ruiter EJ, Ter Hoeve ND, Vreuls CP, van der Wall E, et al. The effect of an e-learning module on grading variation of (pre) malignant breast lesions. Mod Pathol. 2020;33(10):1961–7. Tran QD, Demirhan H, Dolgun A. Bayesian approaches to the weighted kappa-like inter-rater agreement measures. Stat Methods Med Res. 2021;30(10):2329–51. https://doi.org/10.1177/09622802211037068. Yilmaz AE, Saracbasi T. Assessing agreement between raters from the point of coefficients and log-linear models. J Data Sci. 2017;15(1):1–24. Wei GC, Chen T, Zhang YY, Pan P, Dai GC, Yu HC, et al. Biparametric prostate MRI and clinical indicators predict clinically significant prostate cancer in men with “gray zone’’ PSA levels. Eur J Radiol. 2020;127:108977. Wei C, Pan P, Chen T, Zhang Y, Dai G, Tu J, et al. A nomogram based on PI-RADS v2. 1 and clinical indicators for predicting clinically significant prostate cancer in the transition zone. Transl Androl Urol. 2021;10(6):2435. Agresti A. An introduction to categorical data analysis. New York: Wiley; 2018. de Raadt A, Warrens MJ, Bosker RJ, Kiers HA. A comparison of reliability coefficients for ordinal rating scales. J Classif. 2021;38(3):519–43. Gep B, Cox D. An analysis of transformations (with discussion). J R Stat Soc Ser B. 1964;26:211–52. Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. New York: Springer; 2002. https://www.stats.ox.ac.uk/pub/MASS4/. Page 15 of 15 31. Muthén B. A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika. 1984;49(1):115–32. 32. Johnson DR, Creech JC. Ordinal measures in multiple indicator models: A simulation study of categorization error. Am Sociol Rev. 1983;398–407. 33. Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS ONE. 2017;12(6):e0177678. 34. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. New York: Wiley; 2013. 35. Zavanone C, Ragone E, Samson Y. Concordance rates of Doppler ultrasound and CT angiography in the grading of carotid artery stenosis: a systematic literature review. J Neurol. 2012;259(6):1015–8. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Ready to submit your research ? Choose BMC and benefit from: • fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations • maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions