A Guide Using Krippendorff'S Α & Atlas.Ti: Bstract
A Guide Using Krippendorff'S Α & Atlas.Ti: Bstract
A Guide Using Krippendorff'S Α & Atlas.Ti: Bstract
A BSTRACT. In recent years, the research on empirical software engineering that uses qualitative
data analysis (e.g., cases studies, interview surveys, and grounded theory studies) is increasing.
However, most of this research does not deep into the reliability and validity of findings, specifically
in the reliability of coding in which these methodologies rely on, despite there exist a variety of sta-
arXiv:2008.00977v2 [cs.SE] 10 Jan 2021
tistical techniques known as Inter-Coder Agreement (ICA) for analyzing consensus in team coding.
This paper aims to establish a novel theoretical framework that enables a methodological approach
for conducting this validity analysis. This framework is based on a set of coefficients for measuring
the degree of agreement that different coders achieve when judging a common matter. We analyze
different reliability coefficients and provide detailed examples of calculation, with special attention
to Krippendorff’s α coefficients. We systematically review several variants of Krippendorff’s α re-
ported in the literature and provide a novel common mathematical framework in which all of them
are unified through a universal α coefficient. Finally, this paper provides a detailed guide of the use
of this theoretical framework in a large case study on DevOps culture. We explain how α coefficients
are computed and interpreted using a widely used software tool for qualitative analysis like Atlas.ti.
We expect that this work will help empirical researchers, particularly in software engineering, to
improve the quality and trustworthiness of their studies.
1. I NTRODUCTION
In recent years, the research on empirical software engineering that uses qualitative research
methods is on the rise [16, 38, 41, 35, 39]. Coding plays a key role in the qualitative data analysis
process of case studies, interview surveys, and grounded theory studies. Content analysis, thematic
analysis and grounded theory’s coding methods (e.g. in vivo, process, initial, focused, axial, and
theoretical coding) have been established as top notch procedures for conducting qualitative data
analysis as they provide methods for examining and interpreting qualitative data to understand
what it represents [8, 34].
Reliability in coding is particularly crucial to identify mistakes before the codes are used in
developing and testing a theory or model. In this way, it assesses the soundness and correctness of
the drawn conclusions, with a view towards creating well-posed and long-lasting knowledge. Weak
confidence in the data only leads to uncertainty in the subsequent analysis and generate doubts on
findings and conclusions. In Krippendorff’s own words: “If the results of reliability testing are
compelling, researchers may proceed with the analysis of their data. If not, doubts prevail as to
what these data mean, and their analysis is hard to justify” [23].
This problem can be addressed by means of well-established statistical techniques known as
Inter-Coder Agreement (ICA) analysis. These are a collection of coefficients that measure the
extend of the agreement/disagreement between several judges when they subjectively interpret a
common reality. In this way, these coefficients allow researchers to establish a value of reliability
of the coding that will be analyzed later to infer relations and to lead to conclusions. Coding is
1
2 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
reliable if coders can be shown to agree on the categories assigned to units to an extent determined
by the purposes of the study [5].
The ICA techniques are used in many different research context both from social sciences and
engineering. For instance, these methods have been applied for evaluating the selection criteria
of primary studies or data extraction in systematic literature reviews and the ratings when coding
qualitative data. This has led to a variety of related terms interchangeably used, such as inter-coder,
inter-rater or inter-judge agreement, according to the customary of the topic.
However, although more and more researchers apply inter-rater agreement to assess the validity
of their results in systematic literature reviews and mapping studies as we analyzed in [32], it is
a fact that few studies in software engineering analyze and test reliability and trustworthiness of
coding, and thus, the validity of their findings. A systematic search in the main scientific reposito-
ries (namely, ACM Digital Library, Science Direct and Springer1) returns no more than 25 results.
Similar results were obtained in a systematic literature review reported by Nili et al. [30] in in-
formation management research. Nevertheless, the amount of publications that test the reliability
of their coding, and consequently, of their findings is notably higher in other areas, specially in
health sciences, social psychology, education, and business [30]. In this paper we focus on testing
reliability of qualitative data analysis, specifically reliability of coding, and in this context, judges
or raters are referred to as coders.
In this paper, we propose to introduce the ICA analysis techniques in software engineering em-
pirical research in order to enhance the reliability of coding in qualitative data analysis and the
soundness of the results. For that purpose, in Section 2 we review some of the coefficients that,
historically, have been reported in the literature to measure the ICA. We start discussing, in Sec-
tion 2, some general purpose statistics, like Cronbach’s α [7], Pearson’s r [31] and Spearman’s ρ
[37], that are typically misunderstood to be suitable for ICA analysis, since they cannot measure
the degree of agreement among coders but correlation, a weaker concept than agreement. Section
2.1 reviews some coefficients that evaluate agreement between coders, which are not suitable for
measuring reliability because they do not take into account the agreement by chance, such as the
percent agreement [12, 13] and the Holsti Index [20]. We also present a third group, with repre-
sentants like Scott’s π [36] (Section 2.2), Cohen’s κ [3] (Section 2.3) and Fleiss’ κ [14] (Section
2.4), which have been intensively used in the literature for measuring reliability, specially in social
sciences. However, as pointed out by Krippendorff in [19], all of these coefficients suffer some
kind of weakness that turns them non-optimal for measuring reliability. Finally, in Section 2.5
we sketch briefly Krippendorff’s proposal to overcome these flaws, the so-called Krippendorff’s α
coefficient [23], on which we will focus along of this paper.
Despite the success and wide spread of Krippendorff’s α, there exists in the literature plenty
of variants of this coefficient formulated quite ad hoc for very precise and particular situations,
like [21, 22, 42, 18, 9, 24, 23]. The lack of uniform treatment of these measures turns their use
confusing, and the co-existence of different formulations and diffuse interpretations becomes their
comparison a hard task. To address this problem, Sections 3 and 4 describe a novel theoretical
framework that reduces the existing variants to a unique universal α coefficient by means of labels
that play the role of meta-codes. With this idea in mind, we focus on four of the most outstanding
and widely used α coefficients and show how their computation can be reduced to the universal α
by means of a simple re-labelling. This framework provides new and more precise interpretations
1
The search string used is ‘‘inter-coder agreement AND software engineering AND qualitative
research’’.
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 3
of these coefficients that will help to detect flaws in the coding process and to correct them easily
on the fly. Moreover, this interpretation in terms of labels sheds light to some awkward behaviors
of the α coefficients that are very hard to understand otherwise.
Section 5 includes a tutorial on the use and interpretation of Krippendorff’s α coefficients for
providing reliability in software engineering case studies through the software tool Atlas.ti. This
tool provides support for the different tasks that take place during qualitative data analysis, as well
as the calculation of the ICA measures. There exists in the market a variety of available tools with
a view towards qualitative research, like NVIVO, MaxQDA, Qualcoder, Qcoder, etc., but for its
simplicity and ability for computing Krippendorff’s α coefficients, along this tutorial we will focus
on Atlas.ti. The tutorial is driven by a running example based on a real case study developed by the
authors about a qualitative inquiry in the DevOps domain [10]. Additionally, we highlight several
peculiarities of Atlas.ti when dealing with large corpus and sparse relevant matter.
Finally, in Section 6 we summarize the main conclusions of this paper and we provide some
guidelines on how to apply this tutorial to case studies in qualitative research. We expect that
the theoretical framework, the methodology, and the subsequent tutorial introduced in this paper
may help empirical researchers, particularly in software engineering, to improve the quality and
soundness of their studies.
2. BACKGROUND
In a variety of situations, researchers have to deal with the problem of judging some data. While
the observed data is objective, the perception of each researcher is deeply subjective. In many
cases, the solution is to introduce several judges, typically referred to as coders, to reduce the
amount of subjectivity by comparing their judgements. However, in this context, a method for
measuring the degree of agreement that the coders achieve in their evaluations, i.e., the coding of
the raw data, is required. Thus, researchers need to measure the reliability of coding. Only after
establishing that reliability is sufficiently high, it makes sense to proceed with the analysis of the
data.
It is worthy to mention that, although often used interchangeably, there is a technical distinction
between the terms agreement and reliability. Inter-Coder Agreement (ICA) coefficients assess the
extent to which the responses of two or more independent raters are concordant; on the other hand,
inter-coder reliability evaluates the extent to which these raters consistently distinguish between
different responses [17]. In other words, the measurable quantity is the ICA, and using this value
we can infer reliability. In the same vein, we should not confuse reliability and validity. Reliability
deals with the extent to which the research is deterministic and independent of the coders, and in
this sense it is strongly tied to reproducibility; whereas validity deals with the truthfulness, with
how the claims assert the truth. Reliability is a must for validity, but does not guarantee it. Several
coders may share a common interpretation of the reality, so that we have a high level of reliability,
but this interpretation might be wrong and biased, so the validity is really small.
There co-exists in the literature several statistical coefficients that have been applied for evaluat-
ing ICA, such as Cronbach’s α, Pearson’s r and Spearman’s ρ. However, these coefficients cannot
be confused with methods of inter-coder agreement test, as none of these three types of methods
measure the degree of agreement among coders. For instance, Cronbach’s α [7] is a statistic for
interval or ratio level data that focuses on the consistency of coders when numerical judgements
are required for a set of units. As stated in [29]: “It calculates the consistency by which people
4 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
judge units without any aim to consider how much they agree on the units in their judgments”, and
as written in [19] “[it is] unsuitable to assess reliability of judgments”.
On the other hand, correlation coefficients, as Pearson’s r [31] or Spearman’s rank ρ [37], mea-
sure the extent to which two logically separate interval variables, say X and Y , covary in a linear
relationship of the form Y = a + bX. They indicate the degree to which the values of one variable
predict the values of the other. Agreement coefficients, in contrast, must measure the extent to
which Y = X. High correlation means that data approximate to some regression line, whereas high
agreement means that they approximate the 45-degrees line [23].
For these reasons, more advanced coefficients for measuring agreement are needed to assess
reliability. In this section we analyze several proposals that have been reported in the literature for
quantifying the Inter-Coder Agreement (ICA) and to infer, from this value, the reliability of the
coding.
2.1. Percent agreement. This measure is computed as the rate between the number of times the
coders agreed when classifying an item (a datum to be analyzed) with the total amount of items
(multiplied by 100 if we want to express it as a percent). It has been widely used in the literature
due to its simplicity and straightforward calculation [12, 13]. However, it is not a valid measure
for inferring reliability in case studies that require a high degree of accuracy, since it does no take
into account the agreement by chance [29] and, according to [19], there is no a clear interpretation
for values different than 100%. In addition, it can be used only by two coders and only for nominal
data [43].
Table 1 shows an illustrative example, which is part of a Systematic Literature Review (SLR)
performed by some authors of this paper [32]. This table shows the selection of primary studies;
each item corresponds to one of these studies and each coder (denoted by J1 and J2 ) determines,
according to a pre-established criterion, if the studies should be promoted to an analysis phase (Y)
or not (N). From these data, the percent agreement attained is (10/15) · 100 = 66.7%. At a first
sight, this seems to be a high value that would lead to a high reliability on the data. However,
we are missing the fact that the coders may achieve agreement purely by chance. As pointed out
by [23]: “Percent-agreement is often used to talk about the reliability between two observers, but
it has no valid reliability interpretations, not even when it measures 100%. Without reference to
the variance in the data and to chance, percent agreement is simply uninterpretable as a measure
of reliability—regardless of its popularity in the literature”. Indeed, following sections show the
values of other ICA measures such as Cohen’s κ and Krippendorff’s α, which are very low: 0.39
(39%) and 0.34 (34%), respectively.
Nevertheless, by the same reasons that the simple percent agreement, this index is not a valid
measure for analyzing ICA.
2.2. Scott’s π. This index, introduced in [36], is an agreement coefficient for nominal data and
two coders. The method corrects percent agreement by taking into account the agreement that can
occur between the coders by chance. The index of Inter-Coder Agreement for Scott’s π coefficient
is computed as
Po − Pe
π= .
1 − Pe
Here, Po (observed percent agreement) represents the percentage of judgments on which the two
analysts agree when coding the same data independently; and Pe is the percent agreement to be
expected on the basis of chance. This later value can be computed as
k
Pe = ∑ p2i ,
i=1
where k is the total number of categories and pi is the proportion of the entire sample which falls
in the i-th category.
For instance, for our example of Table 1, we have that Po = 0.667, as computed in Section 2.1.
For Pe , it is given by Pe = (13/30)2 + (17/30)2 = 0.509. Observe that, while the number of items
is 15, in the previous computation we divided by 30. This is due to the fact that there are 15 items,
but 30 pairs of evaluations, see also Table 2.
J1 J2 pi p2i
Y 4 9 4/30 + 9/30 = 0.433 0.188
N 11 6 11/30 + 6/30 = 0.567 0.321
Total 0.509
From this contingency matrix, the observed agreement, Po , and agreement by chance, Pc , are
defined as is defined as
1 k k
Po = ∑ ci,i , Pc = ∑ pi ,
m i=1 i=1
where the probability of the i-th category, pi , is given by
! ! ! !
1 k 1 k 1 k k
pi = ∑ ci, j m ∑ c j,i = m2 ∑ ci, j ∑ c j,i .
m j=1 j=1 j=1 j=1
J1
Category 1 Category 2 . . . Category k
Category 1 c1,1 c2,1 ... ck,1
Category 2 c1,2 c2,2 ···
J2
···
Category k c1,k ... ck,k
J1
Y N Total
Y 4 5 9
J2
N 0 6 6
Total 4 11 15
It is worthy to mention that, despite of its simplicity, this coefficient has some intrinsic problems.
On one hand, it is limited to nominal data and two coders. On the other hand, it is difficult to
interpret the result. Under various conditions, the κ statistic is affected by two paradoxes that
return biased estimates of the statistic itself: (1) high levels of observer agreement with low κ
values; (2) lack of predictability of changes in κ with changing marginals [26]. Some proposals
for overcoming these paradoxes are described in [11] and in [2]. According to [19]: “κ is simply
incommensurate with situations in which the reliability of data is the issue”.
For this coefficient, we will no longer focus on the contingency matrix, but on the number of
ratings. Hence, given a category 1 ≤ i ≤ k and an item 1 ≤ β ≤ m, we will denote by ni,β the
number of raters that assigned the i-th category to the β -th item.
In this case, for each item β and for each category i, we can compute the corresponding propor-
tion of observations as
1 m 1 k
1 k
pi =
nm β∑
ni,β , p̂β = n (n
∑ i,β i,β
n(n − 1) i=1
− 1) = ∑ n2i,β − n.
n(n − 1) i=1
=1
Recall that pi is very similar to the one considered in Cohen’s κ but, now, p̂β counts the rate of
pairs coder–coder that are in agreement (relative to the number of all possible coder–coder pairs)
In this way, the observed agreement, Po , and the expected agreement, Pe , are the average of these
quantities relative to the total number of possibilities, that is
1 m k
Po = p̂β , Pe = ∑ p2i .
m β∑
=1 i=1
As an example of application, consider again the Table 1. Recall that, in our notation, the
parameters of this example are m = 15 is the number of items (primary studies in our case), n = 2
is the number of coders and k = 2 is the number of nominal categories (Y and N in our example
which are categories 1 and 2, respectively).
From these data, we form the Table 6 with the computation of the counting values ni,β . In the
second and the third row of this table we indicate, for each item, the number of ratings it received
for each of the possible categories (Y and N). For instance, for item #01, J1 voted it for the category
N, while J2 assigned it to the category Y and, thus, we have n1,1 = n2,1 = 1. On the other hand, for
8 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
item #03 both coders assigned it the the category N so we have n2,3 = 2 and n1,3 = 0. In the last
column of the table we compute the observed percentages of each category, which give the results
1 13
p1 = (1 + 1 + 0 + 0 + 0 + 1 + 2 + 1 + 0 + 2 + 2 + 0 + 1 + 0 + 2) = = 0.433,
2 · 15 30
1 17
p2 = (1 + 1 + 2 + 2 + 2 + 1 + 0 + 1 + 2 + 0 + 0 + 2 + 1 + 2 + 0) = = 0.567.
2 · 15 30
Observe that, as expected, p1 + p2 = 1. Therefore, Pe = 0.4332 + 0.5672 = 0.508.
ni,β #01 #02 #03 #04 #05 #06 #07 #08 #09
Y 1 1 0 0 0 1 2 1 0
N 1 1 2 2 2 1 0 1 2
p̂β 0 0 1 1 1 0 1 0 1
ni,β #10 #11 #12 #13 #14 #15 Total pi
Y 2 2 0 1 0 2 13 13/30 = 0.433
N 0 0 2 1 2 0 17 17/30 = 0.567
p̂β 1 1 1 0 1 1 10
TABLE 6. Count of the proportion of observations for Fleiss’ κ for the example of
Table 1
On the other hand, for the observed percentages per item, we have two types of results. First, if
for the β -th item the two coders disagreed in their ratings, we have that p̂β = 12 (1 · 0 + 1 · 0) = 0.
However, if both coders agreed in their ratings (regardless if it was a Y or a N), we have that
p̂β = 21 (2·1+0·(−1)) = 1. Their average is the observed agreement Po = 15 1
(5·0+10·1) = 0.667.
Therefore, the value of the Fleiss’ κ coefficient is
0.667 − 0.508
κ= = 0.322.
1 − 0.508
2.5. Krippendorff’s α. The last coefficient that we will consider for measuring ICA is Krippen-
dorff’s α coefficient. Sections 3 and 4 are entirely devoted to the mathematical formulation of
Krippendorff’s α and its variants for content analysis. However, we believe that, for the conve-
nience of the reader, it is worthy to introduce this coefficient here through a working example. The
version that we will discuss here corresponds to the universal α coefficient introduced in [21] (see
also Section 3), that only deals with simple codings as the examples above. This is sometimes
called binary α in the literature, but we reserve this name for a more involved version (see Section
4.2).
Again, we consider the data of Table 1, which corresponds to the simplest reliability data gener-
ated by two observers who assign one of two available values to each of a common set of units of
analysis (two observers, binary data). In this context, this table is called the reliability data matrix.
From this table, we construct the so-called matrix of observed coincidences, as shown in Table 7.
This is a square matrix of order the number of possible categories (hence, a 2 × 2 matrix in our
case since we only deal with the categories Y and N).
The way in which this table is built is the following. First, you need to count in Table 1 the
number of pairs (Y, Y). In this case, 4 items received two Y from the coders (#07, #10, #11 and
#15). However, the observed coincidences matrix counts ordered pairs of judgements, and in the
previous count J1 always shows up first and J2 appears second. Hence, we need to multiply this
result by 2, obtaining a total count of 8 that is written down in the (Y, Y) entry of the observed
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 9
coincidences matrix, denoted o1,1 . In the same spirit, the entry o2,2 of the matrix corresponds to
the 2 · 6 = 12 ordered pairs of responses (N, N). In addition, the anti-diagonal entries of the matrix,
o1,2 and o2,1 , correspond to responses of the form (Y, N) and (N, Y). There are 5 items in which
we got a disagreement (#01, #02, #06, #08 and #14), so, as ordered pairs of responses, there are 5
pairs of responses (Y, N) and 5 pairs of responses (N, Y), which are written down in the observed
coincidences matrix. Finally, the marginal data, t1 and t2 , are the sums of the values of the rows
and the columns and t = t1 + t2 is twice the number of items. Observe that, by construction, the
observed coincidences matrix is symmetric.
Y N
Y o1,1 = 8 o1,2 = 5 t1 = 13
N o1,2 = 5 o2,2 = 12 t2 = 17
t1 = 13 t2 = 17 t = 30
Y N
Y e1,1 = 5.38 e1,2 = 7.62 t1 = 13
N e1,2 = 7.62 e2,2 = 9.38 t2 = 17
t1 = 13 t2 = 17 t = 30
Krippendorff’s α coefficient is one of the most widely used coefficients for measuring Inter-
Coder Agreement in content analysis. As we mentioned in Section 2.5, one of the reasons is that
this coefficient solves many of the flaws that Cohen’s κ and Fleiss’ κ suffer. For a more detailed
exposition comparing these coefficients see [19], and for an historical description of this coefficient
check [23].
In this section, we explain the probabilistic framework that underlies Krippendorff’s α coeffi-
cient. For this purpose, we introduce a novel interpretation that unifies the different variants of the
α coefficient presented in the literature (see for instance [21, 22, 19, 42, 18, 9, 24, 23]). These
coefficients are usually presented as unrelated and through a kind of ad hoc formulation for each
problem. This turns the use of Krippendorff’s α for the unfamiliar researcher confusing and un-
motivated. For this reason, we consider that is worthy to provide a common framework in which
precise interpretations and comparisons can be conducted. Subsequently, in Section 4 we will pro-
vide descriptions of this variants in terms of this universal α coefficient. The present formulation
is an extension of the work of the authors in [10] towards a uniform formulation.
Suppose that we are dealing with n ≥ 2 different judges, also referred to as coders, denoted by
J1 , . . . , Jn ; as well as with a collection of m ≥ 1 items to be judged, also referred to as quotations,
denoted I1 , . . . , Im . We fix a set of k ≥ 1 admissible ‘meta-codes’, called labels, say Λ = {l1 , . . . , lk }.
The task of each of the coders Jα is to assign, to each item Iβ , a collection (maybe empty)
of labels from Λ. Hence, as byproduct of the evaluation process, we get a set Ω = ωα,β , for
1 ≤ α ≤ n and 1 ≤ β ≤ m, where ωα,β ⊆ Λ is the set of labels that the coder Jα assigned to the
item Iβ . Recall that ωα,β is not a multiset, so every label appears in ωα,β at most once. Moreover,
notice that multi-evaluations are now allowed, that is, a coder may associate more than a label to
an item. This translates to the fact that ωα,β may be empty (meaning that Jα did not assign any
label to Iβ ), it may have a single element (meaning that Jα assigned only one label) or it may have
more than an element (meaning that Jα chose several labels for Iβ ).
From the collection of responses Ω, we can count the number of observed pairs of responses.
For that, fix 1 ≤ i, j ≤ k and set
In other words, oi, j counts the number of (ordered) pairs of responses of the form (ωα,β , ωα 0 ,β ) ∈
Ω × Ω that two different coders Jα and Jα 0 gave to the same item Iβ and such that Jα included li
in his response and Jα 0 included l j in his response. In the notation of Section 2.3, in the case that
n = 2 (two coders) we have that oi, j = ci, j + c j,i .
Remark 3.1. Suppose that there exists an item Iβ that was judged by a single coder, say Jα . The
other coders, Jα 0 for α 0 6= α, did not vote it, so ωα 0 ,β = 0.
/ Then, this item Iβ makes no contribution
to the calculation of oi, j since there is no other judgement to which ωα,β can be paired. Hence,
from the point of view of Krippendorff’s α, Iβ is not taken into account. This causes some strange
behaviours in the coefficients of Section 4 that may seem counterintuitive.
k
From these counts, we construct the matrix of observed coincidences as Mo = oi, j . By
i, j=1
its very construction, Mo is a symmetric matrix. From this matrix, we set tk = ∑kj=1 ok, j , which
is
(twice) the total number of times to which the label lk ∈ Λ was assigned by any coder. Observe
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 11
that t = ∑kk=1 tk is the total number of judgments. In the case that each coder evaluates each item
with a single non-empty label, we have t = nm.
k
On the other hand, we can construct the matrix of expected coincidences, Me = ei, j i, j=1 , where
ti t j ti t j
t t−1 t = t−1
if i 6= j,
ei, j =
ti ti −1 t = ti (ti −1) if i = j.
t t−1 t−1
The value of ei, j might be though as the average number of times that we expect to find a pair
(li , l j ), when the frequency of the label li is estimated from the sample as ti /t. It is analogous to
the value of the proportion p̂β in Section 2.4. Again, Me is a symmetric matrix.
Finally, let us fix a pseudo-metric δ : Λ × Λ → [0, ∞) ⊆ R, i.e. a symmetric function satisfying
the triangle inequality and with δ (li , li ) = 0 for any li ∈ Λ (recall that this is only a pseudo-metric
since different labels at distance zero are allowed). This metric is given by the semantic of the
analyzed problem and, thus, it is part of the data used for quantifying the agreement. The value
δ (li , l j ) should be seen as a measure of how similar the labels li and l j are. Some common choices
of this metric are the following.
For subtler metrics that may be used for extracting more semantic information from the data, see
[24].
From these computations, we define the observed disagreement, Do , and the expected disagree-
ment, De , as
k k k k
(1) Do = ∑ ∑ oi, j δ (li, l j ), De = ∑ ∑ ei, j δ (li, l j ).
i=1 j=1 i=1 j=1
These quantities measure the degree of disagreement that is observed from Ω and the degree of
disagreement that might be expected by judging randomly (i.e. by chance), respectively.
Remark 3.2. In the case of taking δ as the discrete metric, we have another interpretation of the
disagreement. Observe that, in this case, since δ (li , li ) = 0 we can write the disagreements as
k k
Do = ∑ oi, j = t − ∑ oi,i , De = ∑ ei, j = t − ∑ ei,i .
i6= j i=1 i6= j i=1
The quantity Po = ∑ki=1 oi,i (resp. Pe = ∑ki=1 ei,i ) can be understood as the observed (resp. expected)
agreement between the coders. In the same vein, t = ∑ki, j=1 oi, j = ∑ki, j=1 ei, j may be seen as the
12 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
maximum achievable agreement. Hence, in this context, the disagreement Do (resp. De ) is actu-
ally the difference between the maximum possible agreement and the observed (resp. expected)
agreement.
The setting described in Sections 2 and 3 only addresses the problem of evaluating the agreement
when the judges use labels to code data. This might be too restrictive for the purposes of quali-
tative data analysis, where the codes typically form a two-layer structure: semantic domains that
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 13
encompass broad interrelated concepts and codes within the domains that capture subtler details.
This is, for instance, the setting considered by qualitative analysis tools like the Atlas.ti software
[1].
In this section, we describe a more general framework that enables evaluating reliability in this
two-layer structure. However, this more involved setting also leads to more aspects of reliability
that must be measured. It is not the same to evaluate reliability in the choice of the semantic domain
to be applied than in the chosen codes within a particular domain or the agreement achieved when
distinguishing between relevant and irrelevant matter. To assess the reliability of these aspects,
several variants of Krippendorff’s α have being proposed in the literature (up to 10 are mentioned
in [23, Section 12.2.3]). In this vein, we explain some of these variants and how they can be
reduced to the universal α coefficient of Section 3 after an algorithmic translation by re-labeling
codes. As Section 3, this framework is an extension of the authors’ work [10].
To be precise, in coding we usually need to consider a two-layers setting as follows. First,
we have a collection of s > 1 semantic domains, S1 , . . . , Ss . A semantic domain defines a space
of distinct concepts that share a common meanings (say, Si might be colors, brands, feelings...).
Subsequently, each semantic domain embraces mutually exclusive concepts indicated by a code.
Hence, for 1 ≤ i ≤ s, the domain Si decomposes into ri ≥ 1 codes, that we denote by C1i , . . . ,Cri i .
For design consistency, these semantic domains must be logically and conceptually independent.
This principle translates into the fact that there exists no shared codes between different semantic
domains and two codes within the same semantic domain cannot be applied at the same time by a
coder.
Now, the data under analysis (e.g. scientific literature, newspapers, videos, interviews) is chopped
into items, which in this context are known as quotations, that represent meaningful parts of the
data by their own. The decomposition may be decided by each of the coders (so different coders
may have different quotations) or it may be pre-established (for instance, by the codebook creator
or the designer of the ICA study). In the later case, all the coders share the same quotations so they
cannot modify their limits and they should evaluate each quotation as a block. In order to enlighten
the notation, we will suppose that we are dealing with this case of pre-established quotations. In-
deed, from a mathematical point of view, the former case can be reduced to this version by refining
the data subdivision of each coder to get a common decomposition into the same pieces.
Therefore, we will suppose that the data is previously decomposed into m ≥ 1 items or quo-
tations, I1 , . . . , Im . Observe that the union of all the quotations must be the whole matter so, in
particular, irrelevant matter is also included as quotations. Now, each of the coders Jα , 1 ≤ α ≤ n,
evaluates the quotations Iβ , 1 ≤ i ≤ m, assigning to Iβ any number of semantic domains and, for
each chosen semantic domain, one and only one code. No semantic domain may be assigned in
the case that the coder considers that Iβ is irrelevant matter, and several domains can be applied to
Iβ by the same coder.
Hence, as byproduct of the evaluation process, we obtain a collection of sets Σ = σα,β , for
n o
i
1 ≤ α ≤ n and 1 ≤ β ≤ m. Here, σα,β = Cij11 , . . . ,C jpp is the collection of codes that the coder
Jα assigned to the quotation Iβ . The exclusion principle of the codes within the semantic domain
means that the collection of chosen semantic domains i1 , . . . , i p contains no repetitions.
Remark 4.1. To be precise, as proposed in [22], when dealing with a continuum of matter each of
the quotations must be weighted by its length in the observed and expected coincidences matrices.
This length is defined as the amount of atomic units the quotation has (say characters in a text
14 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
or seconds in a video). In this way, (dis)agreements in long quotations are more significant than
(dis)agreements in short quotations. This can be easily incorporated to our setting just by refining
the data decomposition to the level of units. In this way, we create new quotations having the length
of an atomic unit. Each new atomic quotation is judged with the same evaluations as the old bigger
quotation. In the coefficients introduced below, this idea has the mathematical effect that, in the
sums of Equation (1), each old quotation appears as many times as atomic units it contains, which
is the length of such quotation. Therefore, in this manner, the version explained here computes the
same coefficient as in [22].
In order to quantify the degree of agreement achieved by the coders in the evaluations Σ, several
variants of Krippendorff’s α are proposed in the literature [24, 23]. Some of the most useful for
case studies, and the ones implemented in Atlas.ti, are the following variants.
gl
• The coefficient αbinary : This is a global measure. It quantifies the agreement of the coders
when identifying relevant matter (quotations that deserve to be coded) and irrelevant matter
(part of the corpus that is not coded).
• The coefficient αbinary : This coefficient is computed on a specific semantic domain Si . It
is a measure of the degree of agreement that the coders achieve when choosing to apply a
semantic domain Si or not.
• The coefficient cu-α: This coefficient is computed on a semantic domain Si . It indicates
the degree of agreement to which coders identify codes within Si .
• The coefficient Cu-α: This is a global measure of the goodness of the partition into se-
mantic domains. Cu-α measures the degree of reliability in the decision of applying the
different semantic domains, independently of the chosen code.
Before diving into the detailed formulation, let us work out an illustrative example. Figure 1
shows an example of the use of these coefficients. Let us consider three semantic domains, which
their respective codes being as follows
The two coders, J1 and J2 , assign codes to four quotations as shown in Figure 1(a). We created
a graphical metaphor so that each coder, each semantic domain, and each code are represented as
shown in Figure 1(b). Each coder is represented by a shape, so that J1 is represented by triangles
and J2 by circles. Each domain is represented by a colour, so that S1 is red, S2 is blue and S3
is green. Each code within the same semantic domain is represented a fill, so that Ci1 codes are
represented by a solid fill and Ci2 codes are represented by dashed fill.
The coefficient αbinary is calculated per domain (i.e. S1 red, S2 blue, S3 green) and analyzes
whether the coders assigned or not a domain—independently of the code—to the quotations (see
Figure 1(c)). Notice that we only focus on the presence or absence of a semantic domain by
quotation, so Figure 1(c) only takes into account the color. Now, the αbinary coefficient measures
the agreement that the coders achieved in assigning the same color to the same quotation. The
bigger the coefficient, the better the agreement. In this way, we get total agreement (αbinary = 1)
for S2 as both coders assigned this domain (blue) to the second quotation and the absence of this
domain in the rest of quotations. On the other hand, αbinary < 1 for S1 as J1 assigned this domain
(red) to quotations 1 and 3 while J2 assigned it to quotations 1, 2 and 3, leading to a disagreement
in quotation 2.
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 15
The coefficient cu-α is also calculated per domain (i.e. S1 red, S2 blue, S3 green), but it measures
the agreement attained when applying the codes of that domain. In other words, given a domain Si ,
this coefficient analyzes whether the coders assigned the same codes of Si (i.e. the same fills) to the
quotations or not. In this way, as shown in Figure 1(d), it only focuses on the applied fills to each
quotation. In particular, observe that cu-α = 1 for S2 since both coders assigned the same code to
the second quotation and no code from this domain to the rest of quotations, i.e. total agreement.
Also notice that cu-α < 1 for S3 as the coders assigned the same code of S3 to the third quotation
but they did not assign the same codes of S3 to the rest of quotations. Finally, observe that cu-alpha
for S1 is very small (near to zero) since the coders achieve no agreement on the chosen codes.
With respect to the global coefficients, the coefficient Cu-α analyzes all the domains as a whole,
but it does not take into account the codes within each domain. In this way, in Figure 1(e), we
colour each segments with the colors corresponding to the applied semantic domain (regardless of
the particular used code). From these chromatic representation, Cu-α measures the agreement in
applying these colours globally between the coders. In particular, notice that Cu-α < 1 as both
coders assigned the same domain S1 to the first quotations and the domains S1 and S3 to the third
quotation, but they did not assign the same domains in the second and fourth quotations.
gl
Finally, the αbinary coefficient measures the agreement in the selection of relevant matter, as
shown in Figure 1(f). In this case, both coders recognized the first three segments as relevant (they
were coded), as highlighted in gray in the figure. However, J2 considered that the forth quotation
was irrelevant (it was not coded), as marked in white, and J1 marked it as relevant, in gray. In this
gl
way, we have that αbinary < 1.
gl
4.1. The coefficient αbinary . The first variation of Krippendorff’s α coefficient that we consider
gl
is the αbinary coefficient. It is a global measure that summarizes the agreement of the coders for
recognizing relevant parts of the matter. For computing it, we consider a set of labels have only
two labels, that semantically represent ‘recognized as relevant’ (1) and ‘not recognized as relevant’
(0). Hence, we take
Λ = {1, 0} .
16 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
Now, using the whole set of evaluation Σ, we create a new labelling Ωgl
bin = ωα,β as follows.
Let 1 ≤ α ≤ n and 1 ≤ β ≤ m. We set ωα,β = {1} if the coder Jα assigned some code to the
/ and ωα,β = {0} otherwise (i.e. if Jα did not code Iβ , that is σα,β = 0).
quotation Iβ (i.e. if σα,β 6= 0) /
gl gl
From this set of evaluations, Ωbin = ωα,β , αbinary is given as
gl
αbinary = α(Ωgl
bin ).
gl
Therefore, αbinary measure the degree of agreement that the coders achieved when recognizing
gl
relevant parts, that is coded parts, and irrelevant matter. A high value of αbinary may be interpreted
as that the matter is well structured and it is relatively easy to detect and isolate the relevant parts
of information.
Remark 4.2. In many studies (for instance in case studies in software engineering), it is customary
that a researcher pre-processes the raw data to be analyzed, say by transcribing it or by writing it
down into a ICA software like Atlas.ti. In that case, usually this pre-processor selects the parts that
must be analyzed and chops the matter into quotations before the starting of the judgement process.
In this way, the coders are required to code these pre-selected parts, so that they no longer chop
gl
the matter by themselves and they code all the quotations. Hence, we always get that αbinary = 1,
since the evaluation protocol forces the coders to consider as relevant matter the selected parts by
gl
the pre-processor. Therefore, in these scenarios, the αbinary coefficient is not useful for providing
reliability on the evaluations and other coefficients of the α family are required.
4.2. The coefficient αbinary . The second variation of the Krippendorff’s α coefficient is the so-
called αbinary coefficient. This is a coefficient that must be computed on a specific semantic do-
main. Hence, let us fix a semantic domain Si for some fixed i with 1 ≤ i ≤ s. As above, the set of
labels will have only two labels, that semantically represent ‘voted Si ’ (1) and ‘did not vote Si ’ (0).
Hence, we take
Λ = {1, 0} .
For the assignment of labels to items, the rule is as follows. For 1 ≤ α ≤ n and 1 ≤ β ≤ m, we set
ωα,β = {1} if the coder Jα assigned some code of Si to the quotation Iβ (i.e. if Cij ∈ σα,β for some
1 ≤ j ≤ ri ) and ωα,β = {0} otherwise. Observe that, in particular, ωα,β = {0} if Jα considered
that Iβ was irrelevant matter. From this set of evaluations, ΩSbinary
i
Si
= ωα,β , αbinary is given as
Si
αbinary = α(ΩSbinary
i
).
Si
In this way, the coefficient αbinary can be seen as a measure of the degree of agreement that the
Si
coders achieved when choosing to apply the semantic domain Si or not. A high value of αbinary is
interpreted as an evidence that the domain Si is clearly stated, its boundaries are well-defined and,
thus, the decision of applying it or not is near to be deterministic. However, observe that it does
not measure the degree of agreement in the application of the different codes within the domain Si .
Hence, it may occur that the boundaries of the domain Si are clearly defined but the inner codes are
Si
not well chosen. This is not a task of the αbinary coefficient, but of the cu-α Si coefficient explained
below.
Remark 4.3. By the definition of αbinary , in line with the implementation in Atlas.ti [1], the ir-
relevant matter plays a role in the computation. As we mentioned above, all the matter that was
evaluated as irrelevant (i.e. was not coded) is labelled with {0}. In particular, a large corpus with
only a few sparse short quotations may distort the value of αbinary .
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 17
4.3. The coefficient cu-α. Another variation of the Krippendorff’s α coefficient is the so-called
cu-α coefficient. As the previous variation, this coefficient is computed per semantic domain, say
Si for some 1 ≤ i ≤ s. Suppose that this semantic domain contains codes C1i , . . . ,Cri . The collection
of labels is now a set
Λ = {C1 , . . . , Cr } .
Semantically, they are labels that represent the codes of the chosen domain Si .
For the assignment of labels to items, the rule is as follows. For 1 ≤ α ≤ n and 1 ≤ β ≤ m, we
set ωα,β = Ck if the coder Jα assigned the code Cki of Si to the item (quotation) Iβ . Recall that,
from the exclusion principle for codes within a semantic domain, the coder Jα applied at most one
code from Si to Iβ . If the coder Jα did not apply any code of Si to Iβ , we set ωα,β = 0.
/ From this
Si
set of judgements Ωcu = ωα,β , cu-α Si is given as
cu-α Si = α(ΩScui ).
Remark 4.4. As explained in Remark 3.1, for the computation of the observed and expected co-
incidence matrices, only items that received at least two evaluations with codes of Si from two
different coders count. In particular, if a quotation is not evaluated by any coder (irrelevant mat-
ter), received evaluations for other domains but not for Si (matter that does not corresponds to the
chosen domain) or only one coder assigned to it a code from Si (singled-voted), the quotation plays
no role in cu-α. This limitation might seem a bit cumbersome, but it could be explained by arguing
Si
that the presence/absence of Si is measured by αbinary so it will be redundant to take it into account
S
for cu-α i too.
4.4. The coefficient Cu-α. The last variation of Krippendorff’s α coefficient that we consider in
this study is the so-called Cu-α coefficient. In contrast with the previous coefficients, this is a
global measure of the goodness of the partition into semantic domains. Suppose that our codebook
determines semantic domains S1 , . . . , Ss . In this case, the collection of labels is the set
Λ = {S1 , . . . , Ss } .
Semantically, they are labels representing the semantic domains of our codebook.
n o
i
We assign labels to items as follows. Let 1 ≤ α ≤ n and 1 ≤ β ≤ m. Then, if σα,β = Cij11 , . . . ,C jpp ,
we set ωα,β = Si1 , . . . , Si p . In other words, we label Iβ with the labels corresponding to the
semantic domains chosen by coder Jα for this item, independently of the particular code. Observe
that this is the first case
in which the final evaluation Ω might be multivaluated. From this set of
judgements, ΩCu = ωα,β , Cu-α is given as
Cu-α = α(ΩCu ).
In this way, Cu-α measures the degree of reliability in the decision of applying the different
semantic domains, independently of the particular chosen code. Therefore, it is a global measure
that quantifies the logical independence of the semantic domains and the ability of the coders of
looking at the big picture of the matter, only from the point of view of semantic domains.
18 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
In this section, we describe how to use the ICA utilities provided by Atlas.ti v8.42 [1] (from now
on, shortened as Atlas) as well as a guide for interpreting of the obtained results. We will assume
that the reader is familiar with the general operation of Atlas and focus on the computation and
evaluation of the different ICA coefficients calculated by Atlas.
This section is structured as follows. First, in Section 5.1 we describe the different operation
methods provided by Atlas for the computation of ICA. In Section 5.2, we describe briefly the pro-
tocol for analyzing case study research involving qualitative data analysis in software engineering
and we introduce a running example on the topic that will serve as a guide along all the tutorial.
Finally, in Section 5.3, we discuss the calculation, interpretation and validity conclusions that can
be drawn from the ICA coefficients provided by Atlas.
5.1. Coefficients in Atlas for the computation of ICA. Atlas provides three different methods
for computing the agreement between coders, namely simple percent agreement, Holsti Index (both
can be checked in Section 2.1), and Krippendorff’s α coefficients (see Sections 2.5 and 4).
5.1.1. Simple percent agreement and Holsti Index. As we pointed out in Section 2.1, it is well
reported in the literature that simple percent agreement is not a valid ICA measure, since it does
not take into account the agreement that the coders can be attained by chance.
On the other hand, the Holsti Index [20], as referred in Section 2.1, is a variation of the percent
agreement that can be applied when there are no pre-defined quotations and each coder selects the
matter that considers relevant. Nevertheless, as in the case of the percent agreement, it ignores the
agreement by chance so it is not suitable for a rigorous analysis.
In any case, Atlas provides us these measures that allow us to glance at the results and to get an
idea of the distribution of the codes. However, they should not be used for drawing any conclusions
about the validity of the coding. For this reason, in this tutorial we will focus on the application
and interpretation of Krippendorff’s α coefficients.
5.1.2. Krippendorff’s α in Atlas. Atlas also provides an integrated method for computing the
Krippendorff’s α coefficients. However, it may be difficult at a first sight to identify the prompted
results since the notation is not fully consistent between Atlas and some reports in the literature.
The development of the different versions of the α coefficient has taken around fifty years and,
during this time, the notation for its several variants has changed. Now, a variety of notations
co-exists in the literature that may confuse the unfamiliar reader with this ICA measure.
In order to clarify these relations, in this paper we always use the notation introduced in Section
4. These notations are based on the ones provided by Atlas, but some slightly differences can be
appreciated. For the convenience of the reader, in Table 9 we include a comparative between the
original Krippendorff’s notation, the Atlas notation for the α coefficient and the notation used in
this paper.
2During the final phase of this work, Atlas.ti v9 has been released. The screenshots and commands mentioned in
this paper correspond to v8.4. However, regarding the calculation of the ICA coefficients, the new version does not
provide new features. In this way, the tutorial shown in this section is also valid for v9, with only minimum changes
in the interface.
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 19
Remark 5.1. Empirically, we have discovered that the semantics that the software Atlas applies for
gl
computing the coefficients αbinary /αbinary and cu-α/Cu-α are the ones explained in this paper, as
provided in Section 4.
5.2. Case study: instilling DevOps culture in software companies. This tutorial uses as guid-
ing example an excerpt of a research conducted by the authors in the domain of DevOps [10].
The considered example is an exploratory study to characterize the reasons why companies move
to DevOps and what results do they expect to obtain when adopting the DevOps culture [27].
This exploratory case study is based on interviews to software practitioners from 30 multinational
software-intensive companies. The study has been conducted according to the guidelines for per-
forming qualitative research in software engineering proposed by Wohlin et al. [41].
In Figure 2 we show, through a UML activity diagram [33], the different stages that comprise
the above-mentioned study. For the sake of completeness, in the following exposition each step of
the analysis is accompanied with a brief explanation of the underlying qualitative research method-
ology that the authors carried out. For a complete description of the methodology in qualitative
research and thematic analysis, please check the aforementioned references.
5.2.1. Set research objectives. The first step needed for conducting an exploratory study is to
define the aim of the prospective work, the so-called research question (RQ). These objectives must
be clearly stated and the boundaries of each research question should be undoubtedly demarcated.
In the case study of the running example presented in this paper, we propose one research ques-
tion related to the implications of instilling a DevOps culture in a company, which is the main
concern of the analysis.
RQ: What problems do companies try to solve by implementing DevOps?
5.2.2. Collect data. The next step in the research is to collect the empirical evidences needed for
understanding the phenomenon under study. Data is the only window that the researchers have
to the object of research, so getting high quality data typically leads to good researches. As a
rule-of-thumb, the better the data, the more precise the conclusions can be drawn.
There are two main methods for collecting information in qualitative analysis, and both are
particularly useful in software engineering: questionnaires and interviews [41]. Usually, question-
naires are easier to issue, since they can be provided by email or web pages that can be access
whenever is preferable for the person in charge of answering it. On the other hand, interviews
tent to gather a more complete picture of the phenomenon under study since there exists an active
interaction between interviewer and interviewee, typically face to face. In this way, interviews are
usually better suited for case studies since they allow the researcher to modify the questions to be
asked on the fly, in order to emphasize the key points under analysis. As a drawback, typically the
20 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
F IGURE 2. Phases for conducting case study research involving qualitative data
analysis in software engineering
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 21
number of answers that can be obtained through a questionnaire is much larger than the number of
interviews that can be conducted, but the later usually lead to higher quality data.
In the study considered in this paper, the data collection method was semi-structured interviews
to software practitioners of 30 companies. The interviews were conducted face-to-face, using the
Spanish language, and the audio was recorded with the permission of the participants, transcribed
for the purpose of data analysis, and reviewed by respondents. In the transcripts, the companies
were anonymized by assigning them an individual identification number from ID01 to ID30. The
full script of the interview is available at the project’s webpage3.
5.2.3. Analyze data. This is the most important phase in the study. In this step, the researchers
turn the raw data into structured and logically interconnected conclusions. On the other hand, due
to its creative component, it is the less straightforward phase in the cycle.
To help researchers to analyze the data and to draw the conclusions, there exists several methods
for qualitative data analysis that can be followed. In the DevOps exploratory study considered
here, the authors conducted a thematic analysis approach [8, 40]. Thematic analysis is a method
for identifying, analyzing, and reporting patterns within the data. For that purpose, the data is
chopped into small pieces of information, the quotations or segments, that are minimal units of
data. Then, some individuals (typically some of the researchers) act as coders, codifying the seg-
ments to highlight the relevant information and to assign it a condensate description, the code. In
the literature, codes are defined as “descriptive labels that are applied to segments of text from each
study” [8]. In order to easy the task of the coders, the codes can be grouped into bigger categories
that share come higher level characteristics, forming the semantic domains (also known as themes
in this context). This introduce a multi-level coding that usually leads to richer analysis.
A very important point is that splitting of the matter under study into quotations can be provided
by a non-coder individual (typically, the thematic analysis designer), or it can be a task delegated
to the coders. In the former case, all the coders work with the same segments, so it is easier to
achieve a high level of consensus that leads to high reliability in the results of the analysis. In the
later case, the coders can decide by themselves how to cut the stream of data, so hidden phenomena
can be uncovered. However, the cuts may vary from a coder to another, so there exists a high risk
of getting too diverse codings that cannot be analyzed under a common framework.
Thematic analysis can be instrumented through Atlas [1, 15], which provides an integrated
framework for defining the quotations, codes and semantic domains, as well as for gathering the
codings and to compute the attained ICA.
In the study considered in this section, the method for data analysis followed is described in the
four phases described below (see also Figure 2).
(1) Define quotations & codebook. In the study under consideration, the coders used pre-
defined quotations. In this way, once the interviews were transcripted, researcher R1
chopped the data into its unit segments that remain unalterable during the subsequent
phases. In parallel, R1 elaborated a codebook by collecting all the available codes and
their aggregation into semantic domains. After completing the codebook, R1 also created
a guide with detailed instructions about how to use the codebook and how to apply the
codes.
3https://blogs.upm.es/devopsinpractice.
22 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
The design of the codebook is accomplished through two different approaches: a de-
ductive approach [28] for creating semantic domains and an inductive approach [4] for
creating codes. In the first phase, the deductive approach, R1 created a list of of semantic
domains in which codes will be grouped inductively during the second phase. These initial
domains integrate concepts known in the literature. Each domain is named P01, P02, P03,
etc. Domains were written with uppercase letters (see Figure 3).
In the second phase, the inductive approach, R1 approached the data (i.e. the interviews’
transcriptions) with the research question RQ in mind. R1 reviewed the data line by line
and created the quotations. R1 also assigned them a code (new or previously defined)
in order to get a comprehensive list of all the needed codes. As more interviews were
analyzed, the resulting codebook was refined by using a constant comparison method that
forced R1 to go back and forth.
Additionally, the codes were complemented with a brief explanation of the concept they
describe. This allows R1 to guarantee that the collection of created codes satisfy the re-
quirements imposed by thematic analysis, namely exhaustiveness and mutual exclusive-
ness. The exhaustiveness requirement means that the codes of the codebook must cover
all the relevant aspects for the research. Mutual exclusiveness means that there must exist
no overlapping in the semantics of each code within a semantic domain. In this way, the
codes of a particular semantic domain must capture disjoint aspects and complementary
aspects, which implies that the codes should have explicit boundaries so that they are not
interchangeable or redundant. This mutual exclusiveness translates into the fact that, dur-
ing the coding phase, a coder cannot apply several codes of the same semantic domain to
the same quotation. In other words, each coder can apply at most a code of each semantic
domain to each quotation.
(2) Code. In this phase, the chosen coders (usually researchers different than the codebook
designer) analyze the prescribed quotations created during phase (1). For that purpose, they
use the codebook as a statement of the available semantic domains and codes as well as the
definitions of each one, scope of application and boundaries. It is crucial for the process
that the coders apply the codes exactly as described in the codebook. No modifications on
the fly or alternative interpretations are acceptable.
Nevertheless, the coders are encourage to annotate any problem, diffuse limit or misdef-
inition they find during the coding process. After the coding process ends, if the coders
consider that the codebook was not clear enough or the ICA measured in phase (3) does
not reach an acceptable level, the coders and the codebook designer can meet to discuss the
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 23
found problems. With this information, the codebook designer creates a new codebook and
instructions for coding that can be used for a second round of codings. This iterative pro-
cess can be conducted as many times as needed until the coders consider that the codebook
is precise enough and the ICA measures certify an acceptable amount of reliability.
In the case study of ICA considered in this paper, the coding process involved two re-
searchers different than R1, that acted as coders C1 and C2. They coded the matter accord-
ing to the codebook created by R1.
(3) Calculate ICA. It is a quite common misconception in qualitative research that no numer-
ical calculations can be performed for the study. Qualitative research aims to understand
very complex an unstructured phenomena, for which a semantic analysis of the different
facets and their variations is required. However, by no means this implies that no mathe-
matical measures can be obtained for controlling the process. Due to its broad and flexible
nature, qualitative research is highly sensible to introduce biases in the judgements of the
researchers, so it is mandatory to supervise the research through some reliability measure
that are usually numerical [23]. In this way, the quantitative approach takes place in a
higher level, as meta-analysis of the conducted process in order to guarantee mathematical
reliability in the drawn conclusions. Only when this formal quality assurance process is
satisfactory, researchers can trust in the conclusions and the method is sound and complete.
Therefore, to avoid biases and be confident that the codes mean the same to anyone
who uses them, it is necessary to build that confidence. According to Krippendorff [23],
reliability grounds this confidence empirically and offers the certainty that research findings
can be reproduced.
In the presented example about a DevOps case study, we used Inter-Coder Agreement
(ICA) analysis techniques for testing the reliability of the obtained codebook. In this way,
after coding, another researcher, R4, calculated and interpreted the ICA between C1 and
C2. If coders did not reach an acceptable level of reliability, R1 analyzes the disagreements
pointed out by R4 to find out why C1 and C2 had not understood a code in the same mode.
Using this acquired knowledge, R1 delivers a refined new version of the codebook and
the accompanying use instructions. R1 also reviews the coding of those quotations that led
to disagreement between C1 and C2, modifying it according to the new codebook when
necessary. Notice that, if a code disappears in the new version of the codebook, it also must
disappear of all the quotations that were asigned with it.
At this point, C1 and C2 can continue coding on a new subset of interviews’ transcrip-
tions. This process is repeated until the ICA reached an acceptable level of reliability
(typicall ≥ 0.8). In Section 5.3 it is provided a detailed explanation about how to compute
and interpret ICA coeffients in Atlas.
(4) Synthetize. Once the loop (1)-(2)-(3) has been completed because the ICA measures
reached an acceptable threshold, we can rely in the output of the coding process and start
drawing conclusions. At this point, there exists a consensus about the meaning, applicabil-
ity and limits of the codes and semantic domains of the codebook.
Using this processed information, this phase aims to provide a description of higher-
order themes, a taxonomy, a model, or a theory. The first action is to determine how many
times each domain appears in the data in order to estimate its relevance (grounded) and
to support the analysis with evidences through quotations from the interviews. After that,
the co-occurrence table between semantic units should be computed, that is, the table that
collects the number of times a semantic domain appears jointly with the other domains.
24 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
With these data, semantic networks can be created in order to portray the relationships
between domains (association, causality, etc.) as well as the relationship strength based on
co-occurrence. These relations determine the density of the domains, i.e. the number of
domains you have related to each domain. If further information is needed, it is possible to
repeat these actions for each code within a domain.
All these synthesis actions are not the main focus of this paper, so we will not describe
them further. For more information and techniques, please refer to [41].
5.2.4. Perform validation analysis. As a final step, it is necessary to discuss in which way the
obtained analysis and drawn conclusions are valid, as well as the threats to the validity that may
jeopardize the study. In the words of Wohlin [41] “The validity of a study denotes the trustwor-
thiness of the results, and to what extent the results are true and not biased by the researchers’
subjective point of view”.
There are several strategies for approaching to the validity analysis of the procedure. In the
aforementioned case study, it was followed the methodology suggested by Creswell & Creswell [6]
to improve the validity of exploratory case studies, namely data triangulation, member checking,
rich description, clarify bias, and report discrepant information. Most of these methods are out
of the scope of this paper and are not described further (for more information, check [6, 41]). We
mainly focus on reducing authors bias by evaluating the reliability and consistency of the codebook
on which the study findings are based through ICA analysis.
5.3. ICA calculation. This section describes how to perform the ICA analysis required using At-
las to assess the validity of the exploratory study described in Section 5.2. For this purpose, we
use the theoretical framework developed in Section 4 regarding the different variants of Krippen-
dorff’s α coefficient. In this way, we will monitor the evolution of the α coefficients along the
coding process in order to assure it reaches an acceptable threshold of reliability, as mentioned in
Section 5.2.3.
Nevertheless, before starting the coding/evaluation protocol, it is worthy to consider two impor-
tant methodological aspects, as described below.
(1) The number of coders. Undoubtedly, the higher the number of involved coders, the richer
the coding process. Krippendorff’s α coefficients can be applied to an arbitrary number
of coders, so there exists no intrinsic limitation to this number. On the other hand, a high
number of coders may introduce too many different interpretations that may difficult to
reach an agreement. In this way, it is important to find a fair balance between the number
of coders and the time to reach agreement. For that purpose, it may be useful to take into
account the number of interviews to be analyzed, its length and the resulting total amount
of quotations. In the case study analyzed in this section, two coders, C1 and C2 were
considered for coding excerpts extracted from 30 interviews.
(2) The extend of the coding/evaluation loop. A first approach to the data analysis process
would be to let the coders codify the whole corpus of interviews, and to get an resulting
ICA measure when the coding is completed. However, if the obtained ICA is below the
acceptable threshold (say 0.8), the only solution that can be given is to refine the codebook
and to re-codify the whole corpus again. This is a slow and repetitive protocol that can lead
to intrinsic deviations in the subsequent codings due to cognitive biases in the coders.
In this way, it is more convenient to follow an iterative approach that avoids these prob-
lems and speeds up the process. In this approach, the ICA coefficient is screened on several
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 25
partially completed codings. To be precise, the designer of the case study splits the inter-
views into several subsets. The coders process the first subset and, after that, the ICA
coefficients are computed. If this value is below the threshold of acceptance (0.8), there
exists a disagreement between the coders when applying the codebook. At this point, the
designer can use the partial coding to detect the problematic codes and to offer a refined
version of the codebook and the accompanying instructions. Of course, after this revision,
the previously coded matter should be updated with the new codes. With this new code-
book, the coders can face the next subset of interviews, in the expectation that the newer
version of the codebook will lead to decrease the disagreement. This reduces the number
of complete codings needed to achieve an acceptable agreement.
In the case study considered as example, the first batch of interviews comprised the
first 19 interviews (ID01 to ID19). The attained ICA was unsatisfactory, so the codebook
designer R1 reviewed the codebook releasing a new version. With the updated codebook,
the coders codified the remaining 11 interviews (ID20 to ID30) but, now, the obtained ICA
pass the acceptance threshold, which evidences a high level of reliability in the evaluations.
As a final remark, it is not recommendable to replace the coders during this iterative
process. Despite that Krippendorff’s α allows to exchange coders, the new coders may not
share the same vision and expertice with the codebook, requiring to roll back to previous
versions (c.f. Remark 3.4).
Now, we present the calculation and interpretation of each of the four coefficients mentioned in
Section 4. In order to emphasize the methodological aspects of the process, we focus only on some
illustrative instances for each of the two rounds of the coding/evaluation protocol.
The first step in order to address the ICA analysis is to create an Atlas project and to introduce
the semantic domains and codes compiled in the codebook. In addition, all the interviews (and
their respective quotations, previously defined by R1) should be loaded as separated documents
and all the codings performed by the coders should be integrated in the Atlas project. This is a
straightforward process that is detailed in the Atlas’ user manual [1].
To illustrate our case study, we have a project containing the codings of the two coders for the
first 19 interviews. The codebook has 10 semantic domains and 35 codes (Figure 4). Observe that
Atlas reports a total of 45 codes, since it treats semantic domains as codes (despite that it will work
as an aggregation of codes for ICA purposes).
Now, we should select the semantic domains we wish to analyze, and the codes within them.
For example, in Figure 5, the three codes associated to semantic domain P07 have been added.
26 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
After adding a semantic domain and its codes, Atlas automatically plots a graphical representa-
tion. For each code, this graph is made of as many horizontal lines as coders (two, in our running
example) that are identified with a small icon on the left. Each line is divided into segments that
represent each of the documents added for analysis.
As can be checked in Figure 6, there are two coders (Daniel and Jorge, represented with blue
and brown icons respectively) and the semantic domain P07 has three associated codes, so three
groups of pairs of horizontal lines are depicted. In addition, since we selected 19 documents for
this coding round, the lines are divided into 19 segments (notice that the last one is very short and
it can be barely seen). Observe that the length of each segment is proportional to the total length
of the file.
Moreover, on the right of the horizontal lines we find a sequence of numbers organized into two
groups separated by a slash. For example, in Figure 7 we can see those numbers for the first code
of P07 (problems/lack of collaboration/sync). The left-most group shows the number of quotations
to which the corresponding code has been applied by the coder along all the documents, as well as
the total length (i.e. the number of characters) of the chosen quotations. In the example of Figure
7, the first coder (Daniel) used the first code twice, and the total length of the chosen quotations is
388; while the second coder (Jorge) used the first code only once on a quotation of length 81.
On the other hand, the right-most group indicates the total length of the analyzed documents
(in particular, it is a constant independent of the chosen code, semantic domain or coder). This
total length is accompanied with the rate of the coded quotations among the whole corpus. In this
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 27
example, the total length of the documents to be analyzed is 504.384 and the coded quotations
(with a total length of 388 and 81 respectively) represent the 0.076% and the 0.016% of the corpus
(rounded to 0.1% and 0.0% in the Atlas representation). Recall that these lengths of the coded
quotations and total corpus play an important role in the computation of the α coefficient, as
mentioned in Remark 4.1.
Each time that a coder uses a code, a small coloured mark is placed in the position of the
quotation within the document. The colour of the mark agrees with the assigned color to the coder,
and its length corresponds to the length of the coded quotation. Due to the short length of the
chosen quotations in Figure 6 they are barely seen, but we can zoom in by choosing the Show
Documents Details in the Atlas interface. In Figure 8, we can check that, in document ID01,
both coders agreed to codify two quotations (of different length) with the second and third codes
of P07.
5.3.1. The αbinary coefficient. In order to compute this coefficient, click on the Agreement Measure
button and select Krippendorff’s c-Alpha-binary option. As it is shown in Figure 9, the sys-
tem returns two values. The first one is the αbinary coefficient per semantic domain (P07 in this
P07 = 0.913) an another global coefficient of the domains as a whole that corresponds
case, with αbinary
gl
to what we called αbinary as described in Sections 4.1 and 4.2. Since we selected a single semantic
gl
domain (P07), both values of αbinary and αbinary agree.
P07 = 0.913 > 0.8) which
In the case shown in Figure 9, the value of the coefficient is high (αbinary
can be interpreted as an evidence that the domain P07 is clearly stated, its boundaries are well-
defined and, thus, the decision of applying it or not is near to be deterministic. However, observe
that this does not measure the degree of agreement in the application of the different codes within
the domain P07. It might occur that the boundaries of the domain P07 are clearly defined but the
inner codes are not well chosen. This is not a task of the αbinary , but of the cu-α coefficient.
In order to illustrate how Atlas performed the previous computation, let us calculate αbinary by
hand. For this purpose, we export the information provided by Atlas about the coding process. In
order to do so, we click on the Excel Export button at the Intercoder Agreement panel. In
Figure 10 we show the part of the exported information that is relevant for our analysis. As we can
28 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
see, there are two coders (Jorge and Daniel) and three codes. The meaning of each column is as
follows:
• Applied*: Number of times the code has been applied.
• Units*: Number of units to which the code has been applied.
• Total Units*: Total number of units across all selected documents, voted or not.
• Total Coverage*: Percentage of coverage in the selected documents
The length of the quotation (what is called units in Atlas) is expressed in number of characters.
From this information, we see that coder Daniel voted 388 units (characters) with the first code of
the domain (problems/lack of collaboration/sync) while coder Jorge only voted 81 units with that
code. For the others codes, both coders apply them to 1143 and 403 units, respectively. Indeed, as
we will check later, the quotations that Jorge chose for applying P07 are actually a subset of the
ones chosen by Daniel. Hence, Daniel and Jorge achieved perfect agreement when applying the
second and third codes of P07 while Jorge only considered eligible 81 units for the first code of the
388 chosen by Daniel.
From these data, we can construct the observed coincidence matrix, shown in Table 10, as
explained in Section 2.5 (see also Section 3). Recall from Section 4.2 that a label 1 means that the
coder voted the quotation with a code of the semantic domain (P07 in this case) and the label 0
means that no code of the domain was applied.
This matrix is computed as follows. The number of units to which the coders assigned any code
from domain P07 is 81 + 1143 + 403 = 1627 in the case of Jorge and 388 + 1143 + 403 = 1934
for Daniel. Since the choices of Jorge are a subset of the ones of Daniel, we get that they agreed in
1627 = min(1627, 1934) units. Recall that o1,1 counts ordered pairs of votes, so we need to double
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 29
Coder 2 (Jorge)
1 0
1 o1,1 = 3254 o1,2 = 307 t1 = 3561
Coder 1 (Daniel)
0 o1,2 = 307 o2,2 = 1004900 t2 = 1005207
t1 = 3561 t2 = 1005207 t = 1008768
P07 .
TABLE 10. Observed coincidences matrix for αbinary
the contribution to get o1,1 = 2 · 1627 = 3254. On the other hand, Jorge did not apply any code
of P07 to 504384 − 1627 = 502757 units, while Daniel did not apply them to 504384 − 1934 =
502450, which means that they agreed on not to chose P07 in 502450 = min(502757, 502450)
units. Doubling the contribution, we get o2,2 = 1004900. Finally, for the disagreements we find that
Daniel applied a code from P07 to 307 units that Jorge did not select, so we get that o1,2 = o2,1 =
307. Observe that we do not have to double this value, since there is already an implicit order in
this votes (Daniel voted 1 and Jorge voted 0). From these data, it is straightforward to compute the
aggregated quantities t1 = o1,1 + o1,2 = 3561, t2 = o1,2 + o2,2 = 1005207 and t = t1 +t2 = 1008768.
In the same vein, we can construct the matrix of expected coincidences, as explained in Section
2.5 (see also 3). The value of the expected disagreements are
t1t2 3561 · 1005207
e1,2 = e2,1 = = = 3548.43.
t −1 1008767
Analogously, we can compute e1,1 and e2,2 . However, they are not actually needed for computing
the α coefficient, so we will skip them. With these calculations, we finally get that
P07 Do 307
D0 = o1,2 = 307, De = e1,2 = 3548.43, αbinary = 1− = 1− = 0.913.
De 3548.43
We want to notice again that the previous calculation is correct because Jorge voted with a code
of P07 a subset of the quotations that Daniel selected for domain P07. We can check this claim
using Atlas. For that purpose, click on Show Documents Details and review how the codes were
assigned per document. In the case considered here, Table 11 shows an excerpt of the displayed
information. To shorten the notation, the first code of P07 is denoted by 7a, the second one by 7b,
and the third one by 7c. In this table, we see that all the voted elements coincide except the last
307 units corresponding to document ID17, that Daniel codified and Jorge did not.
TABLE 11. Codified units, per document, for the semantic domain P07.
Another example of calculation of this coefficient is shown in Figure 11. It refers to the compu-
P07 , but in the second round of the coding process (see Section
tation of the same coefficient, αbinary
5.2.3), where 11 documents were analyzed. We focus on this case because it shows a cumbersome
effect of the α coefficient.
30 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
F IGURE 11. Computation of the αbinary coefficient for the domain P07 in the sec-
ond round
gl gl
5.3.2. The αbinary coefficient. As we mentioned in Section 4.1, the αbinary coefficient allows re-
searchers to measure the degree of agreement that the coders reached when distinguishing rele-
gl
vant and irrelevant matter. In this way, αbinary is only useful if each coder chops the matter by
him/herself to select the relevant information to code. On the other hand, if the codebook designer
pre-defines the quotations to be evaluated, this coefficient is no longer useful since it always attains
gl
the value αbinary = 1. This later situation is the case in our running example.
gl
Recall that, as we mentioned in Section 5.3.1, the αbinary coefficient is automatically computed
when we select the c-Alpha-binary option in Atlas. It is displayed at the bottom of the results,
below all the αbinary coefficients, as a summary of the reliability in the semantic domains.
Nevertheless, as can be checked in Figure 9, in the calculation performed in the previous Section
gl
5.3.1 we got a value αbinary = 0.931 < 1, which apparently contradicts the aforementioned fact
that with pre-defined quotations always perfect agreement is achieved. Recall that this coefficients
evaluates the agreement of the coders when trying to discern which parts of the matter are relevant
(quotations) and which ones are not. In other words, this coefficient distinguish coded matter (with
any code) and non-coded matter. If we would introduce in Atlas all the domains, all the pre-defined
quotations will have, at least, a code assigned, and the rest of the matter will not receive any code.
gl
In this way, we would get αbinary = 1, since there exists perfect agreement between the relevant
matter (the quotations) and the irrelevant matter.
The key point here is that, in the calculation of Section 5.3.1 we do not introduced in Atlas all
gl
the semantic domains, but only P07. In this way, the αbinary coefficient were not computed over
the whole corpus. In other words, since we only added to the ICA tool the codes belonging to P07,
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 31
the Atlas’ analyzer considered that these are all the codes that exist in the codebook, and anything
that did not receive a code of P07 is seen as irrelevant matter. In this way, the quotation (made of
307 units) that Daniel chose for applying the code 7a and Jorge did not is considered by the ICA
tool as a quotation that Daniel saw as relevant matter and Jorge as irrelevant matter, decreasing the
gl
value of αbinary .
If we conduct the analysis over all the codes of the codebooks it turns out that Jorge applied
to this quotation a code from a different semantic domain so these 307 units are actually relevant
gl
matter, restoring the expected value αbinary = 1. The same result is obtained if we select a subset
of domains in which the coders used codes of the subset on the same quotations (even if they did
not agree on the particular codes), as shown in Figure 12 for domains P01 and P08.
gl
F IGURE 12. Computation of the αbinary coefficient with domains P01 and P08
gl
For this reason, it is crucial to evaluate the global α coefficients (namely, αbinary and Cu-α) only
when all the codes have been added to the ICA tool. Otherwise, the result might be wrong and
may lead to confusions.
gl
5.3.3. The cu-α and Cu-α coefficients. In some sense, the binary coefficients αbinary and αbinary
evaluate whether to apply a particular semantic domain or not is well-defined in the codebook.
They are binary measures, in the sense that they test a binary question: to apply some domain or
not.
On the other hand, in this section we will consider the cu-α and Cu-α coefficients that, roughly
speaking, zoom in to measure the limits of definition of the semantic domains themselves and of
the codes within them. Recall from Section 4.3 that the cu-α coefficient is computed per semantic
domain. Hence, fixed a domain S, it evaluates the amount of agreement that the coders reached
when choosing to apply some code of S or other. It is, therefore, a measure of the reliability in
the application of the codes withing S, not of the domains themselves. Analogously, as explained
in Section 4.4, Cu-α is a global measure that allows us to assess the limits of definitions of the
semantic domains. In other words, it measures the goodness of the partition of the codebook into
semantic domains, independently of the chosen code
These two later coefficients can be easily computed with Atlas. For that purpose, click on the
option Cu-Alpha/cu-Alpha in the ICA pannel, under the Agreement Measure button. In Table
12, we show the obtained results of cu-α for each of 10 semantic domains of the running example
considered in this section in the first round.
Observe that cu-α attained its maximum value, cu-α = 1, over the domains P02, P05, P07, P08,
and P09. This may seem counterintuitive at a first sight since, as can be checked in Figure 13
32 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
and as we mentioned in Section 5.3.1, in P07 there is no perfect agreement. Indeed, as we know
the code of P07 “problems/lack of collaboration/sync” was chosen by Daniel for a quotation that
Jorge skipped. This is strongly related with Remarks 3.1 and 4.4 since recall that, fixed a domain
S, in the observed coincidences matrix only quotations were voted with codes of S by at least two
different coders count. Otherwise, these quotations do not contribute with a pair of disagreements
so, through the eyes of cu-α, they do not compromise the reliability.
F IGURE 13. Computation of the cu-α coefficient for the domain P07
This fact is precisely what is taking place in this case. The quotation voted by Daniel with a
code of P07 and not by Jorge does not appear in the observed coincidences matrix for cu-α P07 ,
neither as an agreement or as a disagreement. This allows cu-α P07 = 1 even though there is no
perfect agreement in the evaluation. This might seem awkward, but it actually makes sense since
P07 < 1, so decreasing also cu-α P07 would count it
this disagreement was already detected via αbinary
twice. The same scenario occurs in domains P02 and P09.
Special mention deserves the semantic domain P04 (Figure 14), that was evaluated as N/A∗ .
Here, visual inspection shows that both coders do actually codify the same quotation with the
same code. However, since only a single quotation was judged with a code of this domain, Atlas
considers that there is not enough variability for providing statistical confidence (i.e. the p-value is
above 0.05) to draw reliable conclusions. As we mentioned at the end of Section 5.3.1, a manual
verification of the agreement is required in this case in order to assess the reliability.
F IGURE 14. Computation of the cu-α coefficient for the domain P04
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 33
Finally, as it can be checked in Figure 15, the Cu-α coefficient reached a value of 0.67, which
is slightly above the lower threshold of applicability of 0.667. This suggests that the limits of defi-
nition of the semantic domains are a bit diffuse and can be improved. This problem was addressed
in the second version of the codebook, in which a better definition of the domains allowed us to
increase Cu-α to 0.905, which is a sound evidence of reliability.
6. C ONCLUSIONS
Along this tutorial, we have applied a set of statistical measures proposed in the literature for
evaluating Inter-Coder Agreement in thematic analysis. Among them, we have payed special at-
tention to Krippendorff’s α coefficients as the most appropriate and best behaved for qualitative
analysis.
After a formal introduction to these coefficients, we have presented a theoretical framework in
which we provide a common formulation for all of them in terms of a universal α coefficient. This
analysis provides a clearer and more precise interpretation of four of the most important variants
gl
of the α coefficients: the binary α coefficient (αbinary ), the global binary α coefficient (αbinary ),
the cu coefficient (cu-α) and the Cu coefficient (Cu-α). This redefinition is particularly well suited
for its use in case studies, interview surveys, and grounded theory, with a view towards providing
sound reliability of the coding.
From an exploratory study about the adoption of the DevOps culture in software companies,
and using the Atlas.ti software as a tool, in this paper we have presented a tutorial about how to
apply these coefficients to software engineering research to improve the reliability of the drawn
conclusions. With this idea in mind, we describe how to compute these coefficients using Atlas,
and how to interpret them, leading to interpretations that complement the ones provided by the
Atlas’ user manual itself.
Furthermore, the interpretation provided in this paper has allowed us to detect some bizarre
behaviors of the Krippendorff’s α coefficients that are not described in the Atlas’ manuals and
that may mislead to researchers who are no familiar with the α measures. To shed light to these
unexpected results, to justify why do they appear and how to interpret them have been guiding lines
of this work. In particular, we addressed situations in which the insufficient statistical variability
of the coding prevents Atlas to emit any measure of the attained agreement. We also explained
paradoxical results in which very small deviations from the agreement lead to very bad measures
of the α coefficient and we clarified why the cu-α may be maximum even though there is no
perfect agreement and how to detect it through αbinary .
Most of the qualitative works in software engineering suffer the lack of formal measures of the
reliability of the drawn conclusions. This is a very dangerous threat that must be addressed to
establish sound, well-posed and long-lasting knowledge. Otherwise, if the data are not reliable,
34 RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA
the drawn conclusions are not trustworthy. In this direction, we expect that this tutorial will help
researchers in qualitative analysis, in general, and in empirical software engineering, in particular,
to incorporate these techniques to their investigations, aiming to improve the quality and reliability
of the research.
R EFERENCES
[1] ATLAS.ti Scientific Software Development GmbH. Inter-coder agreement analysis: Atlas.ti 8, 2020.
[2] D. V. Cicchetti and A. R. Feinstein. High agreement but low kappa: II. resolving the paradoxes. Journal of
clinical epidemiology, 43(6):551–558, 1990.
[3] J. Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–
46, 1960.
[4] J. Corbin and A. Strauss. Techniques and procedures for developing grounded theory. Basics of Qualitative
Research, 3rd ed.; Sage: Thousand Oaks, CA, USA, 2008.
[5] R. Craggs and M. M. Wood. Evaluating discourse and dialogue coding schemes. Computational Linguistics,
31(3):289–296, 2005.
[6] J. W. Creswell and J. D. Creswell. Research design: Qualitative, quantitative, and mixed methods approaches.
Sage publications, 2017.
[7] L. J. Cronbach. Coefficient alpha and the internal structure of tests. psychometrika, 16(3):297–334, 1951.
[8] D. S. Cruzes and T. Dyba. Recommended steps for thematic synthesis in software engineering. In 2011 interna-
tional symposium on empirical software engineering and measurement, pages 275–284. IEEE, 2011.
[9] K. De Swert. Calculating inter-coder reliability in media content analysis using krippendorff’s alpha. Center for
Politics and Communication, pages 1–15, 2012.
[10] J. Diaz, D. López-Fernández, J. Perez, and Á. González-Prieto. Why are many business instilling a devops culture
into their organization? To appear in Empirical Software Engineering. Preprint arXiv:2005.10388, 2020.
[11] A. R. Feinstein and D. V. Cicchetti. High agreement but low kappa: I. the problems of two paradoxes. Journal of
clinical epidemiology, 43(6):543–549, 1990.
[12] G. C. Feng. Intercoder reliability indices: disuse, misuse, and abuse. Quality & Quantity, 48(3):1803–1815,
2014.
[13] G. C. Feng. Mistakes and how to avoid mistakes in using intercoder reliability indices. Methodology: European
Journal of Research Methods for the Behavioral and Social Sciences, 11(1):13, 2015.
[14] J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
[15] S. Friese. Qualitative data analysis with ATLAS.ti. SAGE Publications Limited, 2019.
[16] H. Ghanbari, T. Vartiainen, and M. Siponen. Omission of quality software development practices: A systematic
literature review. ACM Comput. Surv., 51(2), 2018.
[17] N. Gisev, J. S. Bell, and T. F. Chen. Interrater agreement and interrater reliability: key concepts, approaches, and
applications. Research in Social and Administrative Pharmacy, 9(3):330–338, 2013.
[18] K. L. Gwet. On the Krippendorff’s alpha coefficient. Manuscript submitted for publication. Retrieved October,
2(2011):2011, 2011.
[19] A. F. Hayes and K. Krippendorff. Answering the call for a standard reliability measure for coding data. Commu-
nication methods and measures, 1(1):77–89, 2007.
[20] O. R. Holsti. Content analysis for the social sciences and humanities. Reading. MA: Addison-Wesley (content
analysis), 1969.
[21] K. Krippendorff. Estimating the reliability, systematic error and random error of interval data. Educational and
Psychological Measurement, 30(1):61–70, 1970.
[22] K. Krippendorff. On the reliability of unitizing continuous data. Sociological Methodology, pages 47–76, 1995.
[23] K. Krippendorff. Content analysis: An introduction to its methodology, 4rd edition. Sage publications, 2018.
[24] K. Krippendorff, Y. Mathet, S. Bouvry, and A. Widlöcher. On the reliability of unitizing textual continua: Further
developments. Quality & Quantity: International Journal of Methodology, 50(6):2347–2364, November 2016.
[25] J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics,
33(1):159–174, 1977.
[26] C. A. Lantz and E. Nebenzahl. Behavior and interpretation of the κ statistic: Resolution of the two paradoxes.
Journal of clinical epidemiology, 49(4):431–434, 1996.
RELIABILITY IN SOFTWARE ENGINEERING QUALITATIVE RESEARCH THROUGH ICA 35
[27] L. Leite, C. Rocha, F. Kon, D. Milojicic, and P. Meirelles. A survey of devops concepts and challenges. ACM
Comput. Surv., 52(6), 2019.
[28] M. B. Miles and A. M. Huberman. Qualitative data analysis: An expanded sourcebook. sage, 1994.
[29] A. Nili, M. Tate, and A. Barros. A critical analysis of inter-coder reliability methods in information systems
research. In Proceedings of the 28th Australasian Conference on Information Systems, pages 1–11. University of
Tasmania, 2017.
[30] A. Nili, M. Tate, A. Barros, and D. Johnstone. An approach for selecting and using a method of inter-coder
reliability in information management research. International Journal of Information Management, 54:102154,
2020.
[31] K. Pearson. Vii. note on regression and inheritance in the case of two parents. proceedings of the royal society of
London, 58(347-352):240–242, 1895.
[32] J. Pérez, J. Dı́az, J. Garcia-Martin, and B. Tabuenca. Systematic literature reviews in software engineering-
enhancement of the study selection process using cohen’s kappa statistic. Journal of Systems and Software, page
110657, 2020.
[33] J. Rumbaugh, I. Jacobson, and G. Booch. The unified modeling language. Reference manual, 1999.
[34] J. Saldaña. The coding manual for qualitative researchers. Sage publications, 2012.
[35] N. Salleh, R. Hoda, M. T. Su, T. Kanij, and J. Grundy. Recruitment, engagement and feedback in empirical
software engineering studies in industrial contexts. Information and Software Technology, 98:161 – 172, 2018.
[36] W. A. Scott. Reliability of content analysis: The case of nominal scale coding. Public opinion quarterly, pages
321–325, 1955.
[37] C. Spearman. The proof and measurement of association between two things. The American Journal of Psychol-
ogy, 15(1):72–101, 1904.
[38] K.-J. Stol, P. Ralph, and B. Fitzgerald. Grounded theory in software engineering research: A critical review
and guidelines. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, page
120–131, New York, NY, USA, 2016. Association for Computing Machinery.
[39] T. Storer. Bridging the chasm: A survey of software engineering practice in scientific programming. ACM Com-
put. Surv., 50(4), 2017.
[40] J. Thomas and A. Harden. Methods for the thematic synthesis of qualitative research in systematic reviews. BMC
medical research methodology, 8(1):45, 2008.
[41] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in software
engineering. Springer Science & Business Media, 2012.
[42] Y. Yang and S. B. Green. Coefficient alpha: A reliability coefficient for the 21st century? Journal of Psychoedu-
cational Assessment, 29(4):377–392, 2011.
[43] X. Zhao, J. S. Liu, and K. Deng. Assumptions behind intercoder reliability indices. Annals of the International
Communication Association, 36(1):419–480, 2013.
ETSI DE S ISTEMAS I NFORM ÁTICOS , U NIVERSIDAD P OLIT ÉCNICA DE M ADRID . C. A LAN T URING S / N ,
28031. M ADRID , S PAIN .
Email address: [email protected]
ETSI DE S ISTEMAS I NFORM ÁTICOS , U NIVERSIDAD P OLIT ÉCNICA DE M ADRID . C. A LAN T URING S / N ,
28031. M ADRID , S PAIN .
Email address: [email protected]
ETSI DE S ISTEMAS I NFORM ÁTICOS , U NIVERSIDAD P OLIT ÉCNICA DE M ADRID . C. A LAN T URING S / N ,
28031. M ADRID , S PAIN .
Email address: [email protected]
ETSI DE S ISTEMAS I NFORM ÁTICOS , U NIVERSIDAD P OLIT ÉCNICA DE M ADRID . C. A LAN T URING S / N ,
28031. M ADRID , S PAIN .
Email address: [email protected]