Part 16 PDF
Part 16 PDF
Part 16 PDF
REVIEW ARTICLE
Concordance Analysis
Part 16 of a Series on Evaluation of Scientific Publications
Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521 515
MEDICINE
516 Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521
MEDICINE
FIGURE 2 FIGURE 3
Comparison of two raters with a Bland-Altman diagram; Point cloud diagrams for comparing two functionally related
diagrams are shown for Example a (above) and Example b (below) measuring techniques; Measurement 1 vs Measurement 2 for
Example c (above) and Example d (below)
below the mean of all differences (2). The factor 2 is cases; this difference is small in relation to the
often used, for simplicity, instead of 1.96; the latter, measured quantities themselves. The distance be-
however, corresponds more precisely to the 97.5% tween the two limits of agreement (in other words,
quantile of the normal distribution. In summary, the the width of the region of agreement) is 0.2 in this
Bland-Altman diagram is a useful aid that enables a example.
visual comparison of measuring techniques. When Bland-Altman diagrams are used in real-life
In Figure 2a, the Bland-Altman diagram for situations to see how well two measuring techniques
Example a confirms that the two measuring tech- agree, the question whether the observed degree of
niques are in close agreement. The mean-of-all- agreement is good enough can only be answered in
differences line is very near 0; thus, there seems to be relation to the particular application for which the
no systematic deviation between the measured values techniques are to be used (i.e., good enough for
of the two techniques. In this example, the standard what?). Prospective users must decide how closely
deviation of all differences is roughly 0.05. Assuming the measurements must agree (otherwise stated: how
that the quantity being measured is normally dis- narrow the band between the limits of agreement
tributed, we can conclude that the difference between must be) to be acceptable for clinical purposes.
the two measurements will be less than 0.1 in 95% of Tetzlaff et al. (1), for instance, compared magnetic
Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521 517
MEDICINE
518 Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521
MEDICINE
Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521 519
MEDICINE
Equivalently stated, the 95% confidence interval for Cohens kappa of the overall Discussion
population is 0.042 to 0.418. Statistical methods of assessing the degree of agree-
ment between two raters or two measuring techniques
are used in two different situations:
ratings on a continuous scale, and
categorical (nominal) ratings.
In the first situation, it is advisable to use descriptive
and graphical methods, such as point-cloud plots
around the line of agreement and Bland-Altman
diagrams. Although point clouds are more intuitive and
perspicuous, Bland-Altman diagrams enable a more
detailed analysis in which the differences between the
two raters are assessed not just qualitatively, but also
quantitatively. The limits of agreement in a Bland-
Altman diagram may be unsuitable for assessing the
agreement between two measuring techniques if the
differences between measured values are not normally
KEY MESSAGES
distributed. In such cases, empirical quantiles can be
used instead.
The mere demonstration that a correlation coefficient The distribution of the differences between two
differs significantly from 0 is totally unsuitable for con- measured values can be studied in greater detail if, as
cordance analysis. Such tests are often wrongly used. first step, these differences are plotted on a histogram
(3). In many cases, when the two measuring techniques
The appropriate method for concordance analysis de- are linked by a good linear (or other functional) rela-
pends on the type of scale used by the measuring or
tionship, it will be possible to predict one of the
rating techniques that are to be compared.
measurements from the other one, even if the two tech-
The point-cloud diagram, the Bland-Altman diagram, and niques yield very different results at first glance. The
Cohens kappa are suitable methods for concordance Pearson correlation coefficient is a further type of
analysis. descriptive statistic; it indicates the presence of a linear
Concordance analysis cannot be used to judge the relationship. A significantly nonzero correlation coeffi-
correctness of measuring or rating techniques; rather, it cient, however, cannot be interpreted as implying that
shows the degree to which different measuring or rating two raters are concordant, as their ratings may still
techniques agree with each other. deviate from each other very strongly even when a
significant correlation is present.
520 Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521
MEDICINE
Cohens kappa is a suitable tool for assessing the de- 4. Bland JM, Altman DG: Statistical methods for assessing agreement
gree of agreement between two raters for categorical between two methods of clinical measurement. Lancet 1986; 1:
30710.
(nominal) ratings. A confidence interval for Cohens
5. Song JW, Oh YM, Shim TS, Kim WS, Ryu JS, Choi CM: Efficacy
kappa can be calculated as well. comparison between (18)F-FDG PET/CT and bone scintigraphy in
detecting bony metastases of non-small-cell lung cancer. Lung
Conflict of interest statement
Cancer 2009; 65: 3338.
The authors declare that no conflict of interest exists. 6. du Prel JB, Hommel G, Rhrig B, Blettner M: Confidence interval or
p-value? Part 4 of a series on evaluation of scientific publications.
Manuscript submitted on 22 November 2010; revised version accepted on Dtsch Arztebl Int 2009; 106(19): 3359.
11 May 2011. 7. Bortz J, Lienert G A, Boehnke K: Verteilungsfreie Methoden in der
Biostatistik. 3rd Edition. Heidelberg: Springer 2008; 1929.
Translated from the original German by Ethan Taub, M.D.
8. Hilgers R D, Bauer P, Scheiber V: Einfhrung in die Medizinische
Statistik. 2nd edition. Heidelberg: Springer 2007.
REFERENCES 9. Altman DG, Machin D, Bryant TN, Gardner MJ: Statistics with confi-
dence. 2nd edition. London: BMJ Books 2000.
1. Tetzlaff R, Schwarz T, Kauczor HU, Meinzer HP, Puderbach M, Ei-
chinger M: Lung function measurement of single lungs by lung area
segmentation on 2D dynamic MRI. Acad Radiol. 2010; 17: Corresponding author
496503. Dr. rer. nat. Robert Kwiecien
2. Altman DG: Practical statistics for medical research. 1st edition. Institut fr Biometrie und Klinische Forschung (IBKF)
Westfhlische Wilhelms-Universitt Mnster
Oxford: Chapman and Hall 1991; 1611.
Albert-Schweitzer-Campus 1 Gebude A11
3. Altman DG, Bland JM: Measurement in medicine: the analysis of D-48149 Mnster, Germany
method comparison studies. The Statistician 1983; 32: 30717. [email protected]
Deutsches rzteblatt International | Dtsch Arztebl Int 2011; 108(30): 51521 521