Mcdonaldetal,2019
Mcdonaldetal,2019
Mcdonaldetal,2019
What does reliability mean for building a grounded theory? What about when writing an auto-ethnography?
When is it appropriate to use measures like inter-rater reliability (IRR)? Reliability is a familiar concept in
traditional scientific practice, but how, and even whether to establish reliability in qualitative research is an
oft-debated question. For researchers in highly interdisciplinary fields like computer-supported cooperative
work (CSCW) and human-computer interaction (HCI), the question is particularly complex as collaborators
bring diverse epistemologies and training to their research. In this article, we use two approaches to
understand reliability in qualitative research. We first investigate and describe local norms in the CSCW and
HCI literature, then we combine examples from these findings with guidelines from methods literature to
help researchers answer questions like: “should I calculate IRR?” Drawing on a meta-analysis of a
representative sample of CSCW and HCI papers from 2016-2018, we find that authors use a variety of
approaches to communicate reliability; notably, IRR is rare, occurring in around 1/9 of qualitative papers.
We reflect on current practices and propose guidelines for reporting on reliability in qualitative research
using IRR as a central example of a form of agreement. The guidelines are designed to generate discussion
and orient new CSCW and HCI scholars and reviewers to reliability in qualitative research..1
CCS Concepts: • Human-Centered Computing → HCI design and evaluation methods; HCI theory,
concepts, and models
KEYWORDS
Qualitative methods; interviews; content analysis; inter-rater reliability; IRR
ACM Reference format:
Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-rater Reliability in
Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proc. ACM Hum.-Comput. Interact.,
No. CSCW, Article 72 (November 2019), 23 pages. https://doi.org/10.1145/3359174
Author addresses: Nora McDonald, [email protected], Drexel University, 3141 Chestnut St., Philadelphia, PA, 19104, USA; Sarita
Schoenebeck, [email protected], University of Michigan, 3376 North Quad, 105 S. State St., Ann Arbor, MI, 48109, USA; Andrea Forte,
[email protected], Drexel University, 3141 Chestnut St., Philadelphia, PA, 19104, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
72
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To
copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions
from [email protected].
2573-0142/2019/November - 72 $15.00
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
https://doi.org/10.1145/3359174
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:2 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
to reviewers who have been trained to expect it. The authors of this paper themselves have been
challenged repeatedly to explain alternate approaches to reliability and have discussed whether
IRR is appropriate in conversations online, in personal communications, in reviews, and in
dissertation defenses. These conversations are laborious, but they encapsulate tensions implicit
in scholarly communities like CSCW and HCI that unite diverse research traditions. We begin
our investigation of reliability in qualitative CSCW and HCI literature with this deceptively
simple question because answering it requires a thoughtful exploration of what reliability means
in different methodological traditions. We show that CSCW and HCI qualitative researchers use
the same terms and concepts in multiple, complex ways, and that readers and authors themselves
may have little consensus about what was done and why.
IRR is a statistical measure of agreement between two or more coders of data. IRR can be
confusing because it merges a quantitative method, which has roots in positivism and objective
discovery, with qualitative methods that favor an interpretivist view of knowledge. Further, while
there are many guidelines available across disciplines for how to do IRR, there are few guidelines
for deciding when and why to do IRR.
In our informal discussions with CSCW and HCI researchers about this work, we have
observed passionate and thoughtful perspectives on IRR and the concept of reliability more
generally, but these perspectives are diverse and sometimes contradictory. In general, diversity is
a celebrated hallmark of CSCW and HCI—Dourish noted that HCI is a “discipline that has often
proceeded with something of a mix-and-match approach, liberally and creatively borrowing ideas
and elements from different places” [28]. It is unsurprising, then, that researchers who write and
evaluate descriptions of qualitative methods often face quandaries, like whether or not IRR is
appropriate and how to communicate their choices to a diverse audience. Similar confusion has
been reported and addressed by Caine when it comes to choosing sample sizes in HCI [18].
In this article we first examine practices in recently published papers in the major CSCW and
HCI publishing venues, the CSCW and CHI conferences2, to create transparency around local
norms for communicating reliability and related concepts in qualitative scholarship. The same
diversity that makes these communities challenging to navigate methodologically also position
them to provide guidance. Our research found that over 1/3 of CHI and almost 1/2 of CSCW
papers from 2016 to 2018 used qualitative analysis as a primary method, highlighting the need for
shared understandings of how to write about and evaluate qualitative research [74]. We found
that IRR is relatively rare, but that most papers used some form of agreement to signal reliability.
Based on foundational methods texts and illustrated by our findings, we propose guidelines for
deciding when IRR is appropriate. This work contributes:
Conceptual work: We situate reliability and IRR in a wide spectrum of related practices
and concepts, including rejection of the term reliability altogether, to widen our
discussion of methods and approaches beyond our starting point of “Should we use IRR?”
Descriptive norms: We characterize how qualitative researchers in the CSCW and HCI
community communicate reliability in their publications. Understanding local standards
can both reveal misperceptions about community expectations and enable reflection on
our practices as a community by newcomers and established researchers alike.
2We adopt the shorthand of these communities and refer to the Association for Computing Machinery (ACM) Special
Interest Group on Computer-Human Interaction (SIGCHI) Conference on Computer-Supported Cooperative Work and
Social Computing simply as CSCW and the ACM SIGCHI Conference on Human Factors in Computing Systems as CHI.
Note that the CSCW conference proceedings became part of the journal series Proceedings of the ACM: HCI in 2017.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:3
Guidelines for deciding when agreement and/or IRR is not desirable (and may even be
harmful): The decision not to use agreement or IRR is associated with the use of methods
for which IRR does not make sense. Pragmatic examples include when developing codes
is part of the process, when there is a single researcher, when researchers are embedded
in the research context, where analysis is driven by participants’ own interpretations of
their data, or when coding requires little interpretation.
Guidelines for deciding when agreement or IRR is useful: The use of agreement or IRR
should be consistent with methodological choices and analytical goals, which might
include ensuring consistency across multiple coders, for applying existing codebooks,
and when researchers aim to report quantitative results.
2 RELATED LITERATURE
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:4 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
publication expectations [45,70]. In general, even researchers who use traditional forms of
qualitative methods and have a strong understanding of data sources and analysis may have a
less firm grasp of how and whether reliability and agreement fit in [3,19,30,53].
The next sections present the concept of reliability and its role in qualitative research, then
critique its limitations and potential harms.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:5
whether these agreements or disagreements need to be measured (i.e., IRR). Indeed, the process
of reaching or failing to reach agreement may be more important than its measurement. These
tensions may be explained by the subtle differences between agreement (where two or more
coders reconcile differences through discussion) versus establishing reliability (where two or
more coders independently apply the same code to a unit of text).
Some scholars insist that both agreement and reliability are required [19]. However, other
techniques for communicating reliability include member checking (e.g., confirming/reviewing
results with participants or other members of the community), re-interviewing participants,
triangulation of data with secondary data sources (e.g., interviews plus field notes, photographs,
etc.), making research process transparent [3], and communicating positionality of the
researchers. Many qualitative researchers reject validity and reliability altogether in favor of
concepts like dependability, confirmability, credibility, and transferability that resonate with
interpretivist accounts [41,62].
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:6 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
times on Google Scholar, but they acknowledge that their scale is “clearly arbitrary” and provided
merely as “useful benchmarks” [57]. Krippendorff’s inspection of the tradeoffs between statistical
techniques establishes that “to assure that the data under consideration are at least similarly
interpretable by two or more scholars… it is customary to require α ≥ .800. Where tentative
conclusions are still acceptable, α ≥ .667 is the lowest conceivable limit” [55]. Although statistical
measures can help confirm that interpretations are consistent between coders, they are not a
substitute for interpretation and making meaning from the data.
In this paper, we argue that approaches to establishing and measuring reliability should be
aligned with epistemological traditions the researcher draws on. An epistemology is a theory of
knowledge describing a set of assumptions about what is possible to know and how we
communicate that knowledge. Wittingly or unwittingly, researchers invoke such assumptions
when they make and describe methodological choices. Burrell and Morgan organized assumptions
underlying social science research along two dimensions: assumptions about the nature of social
science and assumptions about the nature of societies [17]. They emphasized the need for
researchers to be consistent in their assumptions in order to do cogent social science. By
understanding their own assumptions, CSCW and HCI researchers can make sound and
justifiable decisions about how (and whether) to establish and report reliability.
The wide range of disciplinary traditions infused in CSCW and HCI research can introduce
uncertainty about when and how to analyze qualitative data, but also provides a rich dataset for
understanding expert practice. In the next section, we present findings from an analysis of three
years of published work in CSCW and CHI to describe current norms around communicating
reliability and methods in qualitative research.
3 STUDY DESIGN
This research draws from and triangulates multiple sources of data. We conducted informal
discussions throughout the data collection, analysis, and writing process with CSCW and HCI
researchers, ranging from senior scholars to new graduate students, and across the expanse of
disciplines typically found at CSCW and CHI. We reviewed scholarship on qualitative methods,
including foundational texts on grounded theory, ethnography, phenomenological research, and
feminism, as well as textbooks and recent methods-related works in CSCW, HCI, and adjacent
fields. The empirical data presented here comes from a systematic review of research papers from
the 2016-2018 CSCW (Computer-Supported Cooperative Work and Social Computing) and CHI
(Human Factors in Computing) conferences. Our inclusion criterion for papers was the use of
qualitative methods for the primary analysis. Our dataset thus contained a wide range of methods,
including ethnographies, interview studies, diary studies, user studies, and design-based research.
Our research team has conducted numerous qualitative research studies using multiple sources
of data and a variety of approaches to reliability, and has reviewed hundreds of studies in CSCW,
CHI, and related venues. This project was born out of our desire to advance rigor, consistency,
and disciplinary sensitivity in our own work and the CSCW and HCI community more broadly.
3.1 Dataset
We conducted a qualitative analysis of full paper and note proceedings (hereafter referred to as
“papers”) from CSCW and CHI 2016-2018. We used the ACM DL to collect metadata for all 617
CSCW papers and 1811 CHI papers from 2016-2018 and selected a random subsample of 250
CSCW and 400 CHI papers for the first pass of coding to identify qualitative papers. We chose
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:7
three years to obtain a robust sample, while prioritizing recent practices in the community. We
used Cochran's sample size formula to calculate minimum sample sizes of 237 and 317,
respectively, based on a 95% confidence level and a 5% margin of error [90]. Some researchers
have published multiple papers that appear in our dataset; we did not adjust for these
dependencies, which may introduce a bias towards highly productive individuals and institutions
who may conduct research (and train large numbers of students) in unique ways.
We conducted a content analysis of methods sections in those 650 papers to determine
eligibility for inclusion in our dataset. We coded papers as include if they used qualitative methods
as a primary method of analysis. These were often exclusively qualitative papers (e.g., an
interview study, ethnographic work, or content analysis) and were sometimes mixed methods
studies with qualitative methods playing an integral role in the study (e.g., survey study +
interview study). Papers coded as partial included qualitative methods as a secondary form of
analysis, such as a systems paper that included a user study at the end that qualitatively asked
users about their reactions to the system or described qualitative research that shaped the design
of their system. Papers coded as exclude did not contain any qualitative methods. Note that we
were concerned with methods, not data; papers that used quantitative methods to analyze
qualitative data (for example using machine learning techniques to analyze texts) were coded as
exclude. Papers coded as partial or exclude were not included in the final dataset.
The research team reviewed and discussed criteria for inclusion before beginning the coding
process. Because the inclusion criteria could be interpreted differently by different coders, and
because we wished to divide the dataset between two different coders, we calculated IRR for the
application of inclusion criteria. Two of the co-authors coded the first 30 papers in our random
sample of CHI papers and had 100% agreement between them. This suggested that our discussion
had yielded a shared understanding of the criteria for inclusion, so both coauthors proceeded to
code an additional 70 papers, which produced minor disagreement. We calculated IRR between
the two coders for those 70 papers. Both Cohen’s kappa and Krippendorff’s alpha are appropriate
for use when there are two coders coding the same dataset and the data are nominal. Both of
these methods yielded an acceptable level of agreement (Cohen’s kappa unweighted=0.849, p<.05;
Krippendorff’s alpha=0.85). There were no disagreements between inclusion and exclusion; the
seven disagreements were primarily between partial and exclude. We discussed and resolved
those disagreements and divided the remaining dataset for coding between the same two coders.
Four additional papers were subsequently recoded from include to partial (3) or exclude (1) during
this pass. The final dataset comprised 140 CHI papers (35%) and 121 CSCW papers (48%) coded as
include for a total of 261 qualitative papers.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:8 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
met/meet, discuss, agree, refine, consensus, member (e.g., “member checking” “team members”),
revise, collaborate/collaboration (e.g., “collaboratively identified themes”), triangulate/
triangulation, and compare/comparison—and then examined texts with those key words to
confirm use of language of agreement.
We coded the following items: specific data collection activity or methods (e.g., interviews,
observations, diary study, etc.), number of researchers/coders involved in analysis, use of IRR,
specific IRR statistical method (e.g., Cohen’s kappa), mention of methodology (e.g., grounded
theory, ethnography), methods citations, use of language to signal agreement about codes when
IRR or multiple coders were not used, and tools used (e.g., post-its, NVivo, ATLAS.ti, etc.). We
coded these items only when they were unambiguously present in the text (i.e., we did not try to
infer whether researchers had done something or not, we only coded when the paper described a
practice). Many methods sections inevitably omitted details of the research process that will
therefore not show up in our data. Because coding involved a simple binary identification—was
a measurement reported in the paper or not—we chose not to calculate IRR per our guidelines
below. Two researchers coded the full dataset of texts for presence of each measurement.
In the next section, we present our results as a descriptive quantitative analyses of reported
methodological choices across recent qualitative CSCW and CHI papers, including excerpts from
individual papers with citations to illustrate concepts. In cases where the analysis could be
perceived as critical, we lightly disguise the excerpt and do not include a citation to the paper.
The intent of this work is to identify areas of excellence in our collective approach to choosing
and reporting qualitative research methods and to support improvement of our collective
technique, but not to call out individual scholars and their work.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:9
deemed appropriate as […] is an emerging and poorly understood topic” [66]. Most papers used a
process we would describe as inductive, even if not stated explicitly, however some specified a
deductive approach. For example, Fiesler, Morrison, and Bruckman stated that they “began with
an inductive approach but then grouped themes deductively under the framework of feminist
HCI once we identified its relevance” [32]. Four other CHI papers and five CSCW papers used
deductive approaches as part of a deductive thematic approach [36,86] or mixed deductive and
inductive approach [13,22,49,52,58,69,92].
Among those that discussed their approach to analysis, just under half (121 papers, 48.2%)
specify the number of people who analyzed qualitative datasets, with the majority of those
specifying 2 researchers or authors (65 papers, 53.7%). Those that don’t specify a number typically
refer to “we” in their description of the analysis, or sometimes “the team” or “the group”: “The
entire team then reviewed and analyzed the data…” [14]. Some described a first or single author
doing an initial pass and then the “research team” going on to code.
In total, 139 (55.4%) papers provided language that indicated some form of agreement was
reached, primarily among multiple researchers. In addition to some of the examples given, Yarosh
et al. stated “more than 750 open codes were then discussed among all four investigators to resolve
any disagreements” [93].
In all, 10 papers described triangulation of some kind (data not shown). Finn and Oreglia
described “triangulation” of their notes with other researchers and also use of other documents
to “substantiate (or not)” some of their interpretations [33]. Another paper specified using various
types of triangulation, including “data triangulation” (use of multiple studies), “triangulation of
sources of evidence” (questionnaires, interviews, and other artifacts) as well as “analysis
triangulation” (between two researchers) to achieve “validity and trustworthiness” in place of
“generalisability and reliability” [67]. Seven of those 10 described triangulation with different
sources of data [26,80,95] such as user logs [39], data provided by observers/researchers [5], and
quantitative data gathered during their research [1,94]. One study that described their process of
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:10 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
Specified Described
% of
Used IRR no. of seeking # of papers
papers
coders agreement
✓ ✓ ✓ 20 8.0%
✓ ✓ x 10 4.0%
✓ x ✓ 1 0.4%
x ✓ ✓ 75 29.9%
✓ x x 1 0.4%
x ✓ x 16 6.4%
x x ✓ 43 17.1%
x x x 85 33.9%
251 100%
agreement specified that triangulation with their multiple sources (in this case, doctors, patients,
NGOs, etc.) was not always appropriate because they maintained “contrary viewpoints” [46].
A relatively small subset, 32 papers (12.7%), described using IRR. Of those 32, the majority (26
papers, 81.3%), specified a statistical method and one paper reported using more than one method.
The six papers that did not specify a method reported IRR as a simple percentage. Cohen’s kappa
was the most commonly used measure of IRR (17 out of 32 papers, 53.1%) with Fleiss’ kappa (3
papers, 9%) , Krippendorff’s alpha (3 papers, 9%), and Scott’s Pi or Kappa (2 papers, 6%) used less
often. Papers rarely described why a statistical method was chosen, although in one case, Fleiss
and Krippendorff were both corresponding to different types of data being analyzed (nominal or
categorical data used Fleiss; ordinal data used Krippendorff).
Eleven papers described agreement as “substantial,” “very good,” “sufficient,” “satisfactory,”
“moderate,” and “fair to good.” Some papers conducted a second round of coding and IRR if
agreement was too low after the first round; others kept the low IRR and provided a rationale for
it, such as “the first pass inter-rater reliability test achieved a Kappa score of 0.61 as there was
some confusion about redundant codes” [89] and “The coding comparison revealed a Cohen’s
Kappa figure of 0.54… reflect a codebook which was meaningful (objective) but not too restrictive
(allowed subjectivity)” [60].
A few papers addressed non-use of IRR. For example, Paavilainen et al. stated that “As the
informants wrote clear, brief sentences making the data easy to interpret and no major
disagreements on coding arise, inter-rater reliability test was not needed” [75]. Jun et al. clarified
their decision not to use IRR saying that “Inter-rater reliability was not calculated as it is rarely
done with semi-structured interview data due to the possibility of applying the same code to
different sections of the interview” [48]. Yet another paper pointed out that they did not perform
IRR because while multiple researchers reviewed the codes, only one conducted the coding, but
also added that “multiple researchers reviewed the codes and themes and there were critical and
detailed discussions at all stages of the analysis” [91].
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:11
Though not the focus of our analysis, we observed some differences between the CHI and
CSCW datasets: 16.4% of CHI papers used IRR and 8.5% of CSCW papers did. At the same time,
more CSCW papers specified their method of agreement (i.e., describing how they “discussed,”
“disagreed,” “collaborate,” “refined,” etc. their codes or themes) (63.2%) compared with CHI
(48.5%). We report on these differences between the conferences in Table 1.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:12 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
249 papers
Indicated Specified no. Cited
named Used IRR*
agreement of coders methods text
a method
Code for “themes” 161 64.7% 14 8.7% 93 57.8% 79 49.1% 116 72.0%
Table 4: How agreement was reported in the 249 papers that named specific methods
*Percentages in columns to the right are out of corresponding base “n” in this column.
**Examples of “Other”: coding, categorizing or classifying (24), memos (22), clustering/clusters
(19), deductive (13), constant comparative (12), content analysis (9), feminist HCI (3), etc.
least one measure of reliability. For example, Starbird et al. analyzed tweets for predetermined
categories which lends itself to multiple coders for scale, and measures of agreement. They stated
that “the first dimension, which we designed to identify crowd corrections, consists of five
mutually exclusive categories: Affirm, Deny, Neutral, Unrelated, and Uncodable” [84] and
subsequently calculated IRR to test agreement between coders for these categories.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:13
methods as grounded theory. Similarly, open coding was mentioned in 85 papers (34.1%) but only
20 of those papers described their methods as grounded theory (data not shown).
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:14 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
highlighting, revising memos, and recording, the output of which is not always agreement, but
discovery of emergent themes as described in Klein et al. [54].
5.1.2. Expert researcher. Agreement (formal or informal) is rarely appropriate when a single
researcher with unique expertise and experience is conducting the research. Many techniques
such as memoing, reflection, and review are designed to support the single researcher process
without requiring agreement (e.g., “To answer these questions, I reviewed the data collected
during the four years of my fieldwork” [81]), though triangulation may be useful in some cases.
This is often the case in ethnographic work where researchers are embedded in a topic or field
for long periods. Barkhuus and Rossitto argue that “Subjective analysis is what makes
ethnographic methods relevant and powerful… it is impossible to take the preliminary
experiences out of the ethnographer” [7]. The ethnographer is often placed in a privileged
position by virtue of acceptance of the ethnographers’ subjectivity with access to an “ideal of
correspondence” between an event and its rendering [4]. In CSCW and HCI, emphasis on a
participatory turn (the “insertion of the ethnographer into the scene”) suggests “the self” as “an
instrument of knowing” [28]. Dourish underscores the role of ethnographic participation not just
as an “unavoidable consequence of going somewhere, but as the fundamental point” and goes on
to say that “If we accept a view of ethnographic material as the product of occasions of
participative engagement, then we surely need to be able to inquire into the nature of that
engagement” [28]. Autoethnographic accounts are a product of often single-researcher
participation in a practice of interest and rigorous self-reflection [2] and have been advocated in
the HCI literature as a way of developing empathy with users [73].
5.1.3. Social context or social action. In some types of research where social context or social
change is integral, agreement among external researchers may not be meaningful. For research
that takes a social, ethical, or political stance, agreement may not be required as communities
themselves may be charged with producing their own insights that in turn allow them to enact
change. For example, participatory action research (PAR) engages participants in “self-
investigation” with the hope of producing action [79]. Other approaches include co-design or
interventions where the participants themselves took part in interpretation, for example, “with
this intervention ‘in the wild,’ we as investigators have participated in the study at the same time
as the participants have been investigators” [44]. One paper in our dataset embraced the
potentially idiosyncratic role of researchers interpreting the unfamiliar as follows: “abductive
reasoning can be used in situations of uncertainty… abduction is driven by astonishment,
mystery, and breakdowns” [71].
5.1.4. Agreement can be harmful. Some scholarly standpoints are philosophically at odds with
the concept of agreement. For example, feminist HCI [6,8] researchers seeking to challenge
hegemonic categories of available knowledge and to privilege marginal or subaltern perspectives
may not adopt an analytical stance a priori. Similarly, intersectional approaches challenge the
systems that reproduce inequality, including the power assumptions embedded into the very
concept of coding and labeling [12]. Research that is focused on social change, power structures,
and empowerment may consider the concept of agreement to perpetuate injustices the research
is looking to overcome [23].
5.1.5. Ease of coding. For some data and analyses, multiple coders and IRR are overkill because
the data is straightforward. If the coding is binary, then the coding likely only requires one coder.
Another example from our dataset states: “three researchers independently coded the responses
using constant comparison to iteratively arrive at themes… Nine remaining open-ended questions
were simpler elaborations, which were coded independently by one researcher” [51]. Though a
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:15
stigma may exist around having a single coder in qualitative research (which may foster the use
of “we” in single-authored papers), it is worth reflecting on when one researcher really is just as
good as two. We did not calculate IRR for our content analysis of methods sections because
flagging the presence of terms or citations is a simple form of coding that required little to no
interpretation.
5.1.6. Grounded theory. We pay special attention to the curious case of grounded theory
because although it does not require IRR, we encountered several papers that used grounded
theory methods (e.g., “grounded theory” or “grounded approach”) in conjunction with IRR.
Grounded theory often (though not always) engages in a standardized format of analysis that
requires multiple iterations of interactions with the data, followed by analysis, and then more
theoretical sampling and analysis, and finally the development of theory [20,24,72]. There are
many possible steps in the coding process: open coding to identify topics of interest, axial coding
to identify relationships among the codes so that they can be organized into clusters of more
complex and themes, and selective coding where the researcher focuses more narrowly on topics
of interest [20,72].
Grounded theory rarely, if ever, requires IRR. Grinter has pointed out that IRR is not specified
in any of the foundational grounded theory texts and explained that, for the grounded theorist,
codes are “merely” an interim product that support the development of theory, not a final result
that requires testing [40]. From initial sampling to theoretical sampling, the researchers’ aim is to
refine the theory and test it “to broaden the range of situations and attributes over which the
theory makes good predictions or descriptions” [72]. The procedures used in grounded theory,
when done in this way, perform the critical, reflexive work that informal discussions of agreement
might achieve. That said, some papers in our dataset still readily report use of grounded theory
methods (e.g., open coding, axial coding, constant comparison) while also performing IRR. The
differences in language choice of using grounded theory versus grounded approach also indicates
variance in CSCW researchers’ alignments to grounded theory principles; future work could
explore whether those language choices are intentional or meaningful.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:16 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:17
5.3.6. Confirmation bias. Most research methods are subject to confirmation bias. In qualitative
research, confirmation bias can occur in the data collection, analysis, or interpretation process
where researchers(s) may have preconceived notions about what they will discover, or want to
discover, and those biases can inadvertently influence their scholarship. While we have noted
some cases where researchers’ perspectives are an important part of the scholarly process (which
is often acknowledged with a positionality statement in the methods section), in many cases it is
important to provide checkpoints to minimize the effects of confirmation bias. Using multiple
coders and agreement processes can help reduce individual biases of a particular researcher.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:18 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
may also use these perceived standards to guide decisions about when to include justifications for
methodological choices. This may be to our collective detriment. Descriptions in our datasets such
as “in keeping with standard qualitative analysis techniques” and “followed standard guidelines
to code our themes” may be attempts to signal conformity to an accepted method; yet, our
research makes clear the diversity of approaches and epistemologies. There is no standard
qualitative method in CSCW or HCI. We recommend clear and detailed descriptions of analytical
procedures, even when researchers suspect that most people share the same practices. For Glaser
and Strauss, theory is process: “an ever-developing entity” rather than a “perfected product” [38].
When conducting and reporting qualitative research, process and rationale should be treated as
a critical part of the research.
6 CONCLUSIONS
Qualitative research is a rich and powerful way to understand the world around us. Yet, we have
demonstrated that in CSCW and HCI, there is little consensus about how to approach reliability
in qualitative research. As a result, authors often struggle to communicate methodological choices
with confidence and reviewers may communicate confusing expectations. As researchers, we
might expect that epistemological stances dictate methodological choices; however, in practice,
other forces like perceived norms and reviewer variance may play a role. Our descriptive analysis
of recent papers finds that most qualitative CSCW and HCI papers code data as part of their
research process and most provide some description of the method they use to do so. Far fewer
report the number of coders involved in the process and even fewer use IRR. Justifications for
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:19
these choices are sparse. We argue that diverse epistemological standpoints in CSCW and HCI
support these diverse uses, but that the community could be clearer about communicating the
rationale for methodological choices. Further, there is variability in how terms and concepts are
used, and although implemented in similar ways, they can be used differently and toward
different ends. It is precisely these differences that necessitate thorough descriptions of methods
and analytical process. Through our analysis and discussion, we examine a wide range of research
cases, more experiences and situations than perhaps any one researcher is likely to have
encountered in their own practice. Finally, we discuss considerations for best practices in writing
methods and provide a set of recommendations for when researchers should seek agreement in
qualitative research and when it is not needed, or even potentially harmful.
In sum: there is no one correct way to approach reliability in qualitative research. Different
epistemologies invite different frames; we hope that this work serves as a generative starting
point for researchers to reflect on their epistemological goals and how they produce knowledge.
ACKNOWLEDGMENTS
This work was supported in part by the National Science Foundation grants CNS-1703736 and
CHS-1552503. Thank you to Kentaro Toyama, Susan Wyche, and Denise Agosto for their valuable
feedback on early drafts of this paper.
REFERENCES
[1] Ali Abdolrahmani, William Easley, Michele Williams, Stacy Branham, and Amy Hurst. 2017. Embracing Errors:
Examining How Context of Use Impacts Blind Individuals’ Acceptance of Navigation Aid Errors. In Proceedings of
the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 4158–4169.
DOI:https://doi.org/10.1145/3025453.3025528
[2] Tony E Adams, Carolyn Ellis, and Stacy Holman Jones. 2017. Autoethnography. The international encyclopedia of
communication research methods (2017), 1–11.
[3] David Armstrong, Ann Gosling, John Weinman, and Theresa Marteau. 1997. The Place of Inter-Rater Reliability in
Qualitative Research: An Empirical Study. Sociology 31, 3 (August 1997), 597–606.
[4] Paul Anthony Atkinson, Amanda Coffey, Sara Delamont, John Lofland, and Lyn H. Lofland (Eds.). 2001. Handbook
of Ethnography (1 edition ed.). SAGE Publications Ltd, London ; Thousand Oaks, Calif.
[5] Mara Balestrini, Paul Marshall, Raymundo Cornejo, Monica Tentori, Jon Bird, and Yvonne Rogers. 2016. Jokebox:
Coordinating Shared Encounters in Public Spaces. In Proceedings of the 19th ACM Conference on Computer-Supported
Cooperative Work & Social Computing (CSCW ’16), 38–49. DOI:https://doi.org/10.1145/2818048.2835203
[6] Shaowen Bardzell. 2010. Feminist HCI: Taking Stock and Outlining an Agenda for Design. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems (CHI ’10), 1301–1310.
DOI:https://doi.org/10.1145/1753326.1753521
[7] Louise Barkhuus and Chiara Rossitto. 2016. Acting with Technology: Rehearsing for Mixed-Media Live
Performances. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 864–875.
DOI:https://doi.org/10.1145/2858036.2858344
[8] Rosanna Bellini, Angelika Strohmayer, Ebtisam Alabdulqader, Alex A. Ahmed, Katta Spiel, Shaowen Bardzell, and
Madeline Balaam. 2018. Feminist HCI: Taking Stock, Moving Forward, and Engaging Community. In Extended
Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI EA ’18), SIG02:1–SIG02:4.
DOI:https://doi.org/10.1145/3170427.3185370
[9] Kirsten Boehner and Carl DiSalvo. 2016. Data, Design and Civics: An Exploratory Study of Civic Tech. In Proceedings
of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 2970–2981.
DOI:https://doi.org/10.1145/2858036.2858326
[10] Gerd Bohner. 2001. Writing about rape: Use of the passive voice and other distancing text features as an expression
of perceived responsibility of the victim. British Journal of Social Psychology 40, 4 (December 2001), 515–529.
DOI:https://doi.org/10.1348/014466601164957
[11] Adrien Bousseau, Theophanis Tsandilas, Lora Oehlberg, and Wendy E. Mackay. 2016. How Novices Sketch and
Prototype Hand-Fabricated Objects. In Proceedings of the 2016 CHI Conference on Human Factors in Computing
Systems (CHI ’16), 397–408. DOI:https://doi.org/10.1145/2858036.2858159
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:20 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
[12] Geoffrey C. Bowker and Susan Leigh Star. 2000. Sorting Things Out: Classification and Its Consequences (Revised
edition ed.). The MIT Press, Cambridge, Massachusetts London, England.
[13] LouAnne E. Boyd, Alejandro Rangel, Helen Tomimbang, Andrea Conejo-Toledo, Kanika Patel, Monica Tentori, and
Gillian R. Hayes. 2016. SayWAT: Augmenting Face-to-Face Conversations for Adults with Autism. In Proceedings of
the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 4872–4883.
DOI:https://doi.org/10.1145/2858036.2858215
[14] LouAnne E. Boyd, Kyle Rector, Halley Profita, Abigale J. Stangl, Annuska Zolyomi, Shaun K. Kane, and Gillian R.
Hayes. 2017. Understanding the Role Fluidity of Stakeholders During Assistive Technology Research “In the Wild.”
In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 6147–6158.
DOI:https://doi.org/10.1145/3025453.3025493
[15] Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology
3, 2 (2006), 77–101.
[16] Alan Bryman and Robert G. Burgess. 1994. Analyzing Qualitative Data. Routledge, London.
[17] G. Burrell and G. Morgan. 1979. Sociological Paradigms and Organizational Analysis: Elements of the Sociology of
Corporate Life. Ashgate Publishing.
[18] Kelly Caine. 2016. Local Standards for Sample Size at CHI. In Proceedings of the 2016 CHI Conference on Human Factors
in Computing Systems (CHI ’16), 981–992. DOI:https://doi.org/10.1145/2858036.2858498
[19] John L. Campbell, Charles Quincy, Jordan Osserman, and Ove K. Pedersen. 2013. Coding in-depth semistructured
interviews: Problems of unitization and intercoder reliability and agreement. Sociological Methods & Research 42, 3
(2013), 294–320.
[20] Kathy Charmaz. 2006. Constructing Grounded Theory: A Practical Guide through Qualitative Analysis (1 edition ed.).
SAGE Publications Ltd, London ; Thousand Oaks, Calif.
[21] Ana Paula Chaves and Marco Aurelio Gerosa. 2018. Single or Multiple Conversational Agents?: An Interactional
Coherence Comparison. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18),
191:1–191:13. DOI:https://doi.org/10.1145/3173574.3173765
[22] Chia-Fang Chung, Elena Agapie, Jessica Schroeder, Sonali Mishra, James Fogarty, and Sean A. Munson. 2017. When
Personal Tracking Becomes Social: Examining the Use of Instagram for Healthy Eating. In Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems (CHI ’17), 1674–1687.
DOI:https://doi.org/10.1145/3025453.3025747
[23] Patricia Hill Collins. 2015. Intersectionality’s Definitional Dilemmas. Annual Review of Sociology 41, 1 (2015), 1–20.
[24] Juliet Corbin and Anselm Strauss. 2007. Basics of Qualitative Research: Techniques and Procedures for Developing
Grounded Theory (3rd ed.). SAGE Publications, Inc.
[25] John W. Creswell. 1998. Qualitative Inquiry and Research Design : Choosing Among Five Traditions. Sage Publications
Inc, Thousand Oaks, Calif.
[26] Dharma Dailey and Kate Starbird. 2017. Social Media Seamsters: Stitching Platforms & Audiences into Local Crisis
Infrastructure. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social
Computing (CSCW ’17), 1277–1289. DOI:https://doi.org/10.1145/2998181.2998290
[27] Paul Dourish. 2006. Implications for design. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (CHI ’06), 541–550. DOI:https://doi.org/10.1145/1124772.1124855
[28] Paul Dourish. 2014. Reading and Interpreting Ethnography. In Ways of Knowing in HCI (Judith S Olson and Wendy
A. Kellogg (Eds)). Springer-Verlag, New York, 1–23.
[29] Ellen A. Drost. 2011. Validity and Reliability in Social Science Research. Education Research and Perspectives 38, 1
(June 2011), 105–123.
[30] Robert Elliott, Constance T. Fischer, and David L. Rennie. 1999. Evolving guidelines for publication of qualitative
research studies in psychology and related fields. British Journal of Clinical Psychology 38, 3 (September 1999), 215–
229.
[31] Robert M. Emerson, Rachel I. Fretz, and Linda L. Shaw. 2011. Writing Ethnographic Fieldnotes, Second Edition (Second
edition ed.). University of Chicago Press, Chicago.
[32] Casey Fiesler, Shannon Morrison, and Amy S Bruckman. 2016. An archive of their own: a case study of feminist HCI
and values in design. 2574–2585.
[33] Megan Finn and Elisa Oreglia. 2016. A Fundamentally Confused Document: Situation Reports and the Work of
Producing Humanitarian Information. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative
Work & Social Computing (CSCW ’16), 1349–1362. DOI:https://doi.org/10.1145/2818048.2820031
[34] Katie Z. Gach, Casey Fiesler, and Jed R. Brubaker. 2017. “Control Your Emotions, Potter”: An Analysis of Grief
Policing on Facebook in Response to Celebrity Death. Proc. ACM Hum.-Comput. Interact. 1, CSCW (December 2017),
47:1–47:18. DOI:https://doi.org/10.1145/3134682
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:21
[35] C. Geertz. 1983. Thick Description: toward an interpretive theory of culture. In The Interpretation of Cultures. Basic
Books, New York, 3–32.
[36] Kathrin Gerling, Kieran Hicks, Michael Kalyn, Adam Evans, and Conor Linehan. 2016. Designing Movement-based
Play With Young People Using Powered Wheelchairs. In Proceedings of the 2016 CHI Conference on Human Factors in
Computing Systems (CHI ’16), 4447–4458. DOI:https://doi.org/10.1145/2858036.2858070
[37] Lisa Given. 2008. Natural Setting. In The Sage encyclopedia of qualitative research methods. SAGE, London.
[38] Barney Glaser and Anselm Strauss. 1967. The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine
Transaction.
[39] Daniel Gooch, Asimina Vasalou, Laura Benton, and Rilla Khaled. 2016. Using Gamification to Motivate Students with
Dyslexia. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 969–980.
DOI:https://doi.org/10.1145/2858036.2858231
[40] Beki Grinter. 2010. Inter-Rater Reliability. Beki’s Blog. Retrieved January 29, 2019 from
https://beki70.wordpress.com/2010/09/09/inter-rater-reliability-apply-with-care/
[41] Egon G. Guba. 1981. Criteria for assessing the trustworthiness of naturalistic inquiries. Educational Communication
and Technology Journal 29, (1981), 75–91.
[42] Kevin A. Hallgren. 2012. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor
Quant Methods Psychol 8, 1 (2012), 23–34.
[43] David Hammer and Leema K. Berland. 2014. Confusing Claims for Data: A Critique of Common Practices for
Presenting Qualitative Research on Learning. Journal of the Learning Sciences 23, 1 (January 2014), 37–46.
[44] Hanna Hasselqvist, Mia Hesselgren, and Cristian Bogdan. 2016. Challenging the Car Norm: Opportunities for ICT
to Support Sustainable Transportation Practices. In Proceedings of the 2016 CHI Conference on Human Factors in
Computing Systems (CHI ’16), 1300–1311. DOI:https://doi.org/10.1145/2858036.2858468
[45] John Hughes, Tom Rodden, and Hans Andersen. 1994. Moving out from the control room: ethnography in system
design. ACM Conference on Computer-Supported Cooperative Work (1994), 429–439.
[46] Azra Ismail, Naveena Karusala, and Neha Kumar. 2018. Bridging Disconnected Knowledges for Community Health.
Proc. ACM Hum.-Comput. Interact. 2, CSCW (November 2018), 75:1–75:27. DOI:https://doi.org/10.1145/3274344
[47] Jialun “Aaron” Jiang, Casey Fiesler, and Jed R. Brubaker. 2018. “The Perfect One”: Understanding Communication
Practices and Challenges with Animated GIFs. Proc. ACM Hum.-Comput. Interact. 2, CSCW (November 2018), 80:1–
80:20. DOI:https://doi.org/10.1145/3274349
[48] Eunice Jun, Blue A. Jo, Nigini Oliveira, and Katharina Reinecke. 2018. Digestif: Promoting Science Communication
in Online Experiments. Proc. ACM Hum.-Comput. Interact. 2, CSCW (November 2018), 84:1–84:26.
DOI:https://doi.org/10.1145/3274353
[49] Vaishnav Kameswaran, Jatin Gupta, Joyojeet Pal, Sile O’Modhrain, Tiffany C. Veinot, Robin Brewer, Aakanksha
Parameshwar, Vidhya Y, and Jacki O’Neill. 2018. “We Can Go Anywhere”: Understanding Independence Through a
Case Study of Ride-hailing Use by People with Visual Impairments in Metropolitan India. Proc. ACM Hum.-Comput.
Interact. 2, CSCW (November 2018), 85:1–85:24. DOI:https://doi.org/10.1145/3274354
[50] Matthew Kay, Steve Haroz, Shion Guha, Pierre Dragicevic, and Chat Wacharamanotham. 2017. Moving Transparent
Statistics Forward at CHI. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in
Computing Systems (CHI EA ’17), 534–541. DOI:https://doi.org/10.1145/3027063.3027084
[51] Christina Kelley, Bongshin Lee, and Lauren Wilcox. 2017. Self-tracking for Mental Wellness: Understanding Expert
Perspectives and Student Experiences. In Proceedings of the 2017 CHI Conference on Human Factors in Computing
Systems (CHI ’17), 629–641. DOI:https://doi.org/10.1145/3025453.3025750
[52] Ryan Kelly, Daniel Gooch, Bhagyashree Patil, and Leon Watts. 2017. Demanding by Design: Supporting Effortful
Communication Practices in Close Personal Relationships. In Proceedings of the 2017 ACM Conference on Computer
Supported Cooperative Work and Social Computing (CSCW ’17), 70–83. DOI:https://doi.org/10.1145/2998181.2998184
[53] Jerome Kirk and Marc L. Miller. 1986. Reliability and validity in qualitative research. Sage Publications, Beverly Hills.
[54] Maximilian Klein, Jinhao Zhao, Jiajun Ni, Isaac Johnson, Benjamin Mako Hill, and Haiyi Zhu. 2017. Quality
Standards, Service Orientation, and Power in Airbnb and Couchsurfing. Proc. ACM Hum.-Comput. Interact. 1, CSCW
(December 2017), 58:1–58:21. DOI:https://doi.org/10.1145/3134693
[55] Klaus H. Krippendorff. 2003. Content Analysis: An Introduction to Its Methodology (2nd ed.). Sage Publications, Inc.
[56] Karen S. Kurasaki. 2000. Intercoder Reliability for Validating Conclusions Drawn from Open-Ended Interview Data.
Field Methods 12, 3 (August 2000), 179–194.
[57] J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics
33, 1 (1977), 159–174. DOI:https://doi.org/10.2307/2529310
[58] Simone Lanette, Phoebe K. Chua, Gillian Hayes, and Melissa Mazmanian. 2018. How Much is “Too Much”?: The Role
of a Smartphone Addiction Narrative in Individuals’ Experience of Use. Proc. ACM Hum.-Comput. Interact. 2, CSCW
(November 2018), 101:1–101:22. DOI:https://doi.org/10.1145/3274370
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
72:22 Nora McDonald, Sarita Schoenebeck, & Andrea Forte
[59] Amanda Lazar, Hilaire J. Thompson, Shih-Yin Lin, and George Demiris. 2018. Negotiating Relation Work with
Telehealth Home Care Companionship Technologies That Support Aging in Place. Proc. ACM Hum.-Comput.
Interact. 2, CSCW (November 2018), 103:1–103:19. DOI:https://doi.org/10.1145/3274372
[60] Pierre Le Bras, David A. Robb, Thomas S. Methven, Stefano Padilla, and Mike J. Chantler. 2018. Improving User
Confidence in Concept Maps: Exploring Data Driven Explanations. In Proceedings of the 2018 CHI Conference on
Human Factors in Computing Systems (CHI ’18), 404:1–404:13. DOI:https://doi.org/10.1145/3173574.3173978
[61] Margaret LeCompte and Judith Goetz. 1982. Problems of Reliability and Validity in Ethnographic Research. Review
of Educational Research 52, 1 (1982), 31–60.
[62] Yvonna S. Lincoln and Egon G. Guba. 1986. But is it rigorous? Trustworthiness and authenticity in naturalistic
evaluation. New Directions for Program Evaluation 1986, 30 (1986), 73–84. DOI:https://doi.org/10.1002/ev.1427
[63] Matthew Lombard, Jennifer Snyder‐Duch, and Cheryl Campanella Bracken. Content Analysis in Mass
Communication: Assessment and Reporting of Intercoder Reliability. Human Communication Research 28, 4 , 587–
604. DOI:https://doi.org/10.1111/j.1468-2958.2002.tb00826.x
[64] Catherine MacPhail, Nomhle Khoza, Laurie Abler, and Meghna Ranganathan. 2016. Process guidelines for
establishing Intercoder Reliability in qualitative studies. Qualitative Research 16, 2 (April 2016), 198–212.
[65] Megh Marathe and Kentaro Toyama. 2018. Semi-Automated Coding for Qualitative Research: A User-Centered
Inquiry and Initial Prototypes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems
(CHI ’18), 348:1–348:12. DOI:https://doi.org/10.1145/3173574.3173922
[66] Joe Marshall, Conor Linehan, and Adrian Hazzard. 2016. Designing Brutal Multiplayer Video Games. In Proceedings
of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 2669–2680.
DOI:https://doi.org/10.1145/2858036.2858080
[67] Roberto Martinez-Maldonado, Lucila Carvalho, and Peter Goodyear. 2018. Collaborative Design-in-use: An
Instrumental Genesis Lens in Multi-device Environments. Proc. ACM Hum.-Comput. Interact. 2, CSCW (November
2018), 118:1–118:24. DOI:https://doi.org/10.1145/3274387
[68] Alice Marwick, Claire Fontaine, and danah boyd. 2017. “Nobody Sees It, Nobody Gets Mad”: Social Media, Privacy,
and Personal Responsibility Among Low-SES Youth. Social Media + Society 3, 2 (April 2017), 2056305117710455.
[69] Marius Mikalsen and Eric Monteiro. 2018. Data Handling in Knowledge Infrastructures: A Case Study from Oil
Exploration. Proc. ACM Hum.-Comput. Interact. 2, CSCW (November 2018), 123:1–123:16.
DOI:https://doi.org/10.1145/3274392
[70] D. R. Millen. 2000. Rapid ethnography: time deepening strategies for HCI field research. 280–286.
[71] Trine Møller. 2018. Presenting The Accessory Approach: A Start-up’s Journey Towards Designing An Engaging Fall
Detection Device. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18), 559:1–
559:10. DOI:https://doi.org/10.1145/3173574.3174133
[72] M. J. Muller and S. Kogan. 2010. Grounded theory method in hci and cscw. In Technical Report. IBM Watson Research
Center.
[73] Aisling Ann O’Kane, Yvonne Rogers, and Ann E Blandford. 2014. Gaining empathy for non-routine mobile device
use through autoethnography. 987–990.
[74] Judith S. Olson and Wendy A. Kellogg (Eds.). 2014. Ways of Knowing in HCI. Springer-Verlag, New York.
[75] Janne Paavilainen, Hannu Korhonen, Kati Alha, Jaakko Stenros, Elina Koskinen, and Frans Mayra. 2017. The
PokéMon GO Experience: A Location-Based Augmented Reality Mobile Game Goes Mainstream. In Proceedings of
the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 2493–2498.
DOI:https://doi.org/10.1145/3025453.3025871
[76] Chanda Phelan, Cliff Lampe, and Paul Resnick. 2016. It’s Creepy, But It Doesn’T Bother Me. In Proceedings of the
2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 5240–5251.
DOI:https://doi.org/10.1145/2858036.2858381
[77] Laura R. Pina, Carmen Gonzalez, Carolina Nieto, Wendy Roldan, Edgar Onofre, and Jason C. Yip. 2018. How Latino
Children in the U.S. Engage in Collaborative Online Information Problem Solving with Their Families. Proc. ACM
Hum.-Comput. Interact. 2, CSCW (November 2018), 140:1–140:26. DOI:https://doi.org/10.1145/3274409
[78] David Pinelle and Carl Gutwin. 2000. A Review of Groupware Evaluations. In Proceedings of the 9th IEEE International
Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE ’00), 86–91.
[79] MD A. Rahman. 2008. Some Trends in the Praxis of Participatory Action Research. In The SAGE Handbook of Action
Research (P. Reason and H. Bradbury (eds)). Sage, London, 49–62.
[80] Shriti Raj, Mark W. Newman, Joyce M. Lee, and Mark S. Ackerman. 2017. Understanding Individual and
Collaborative Problem-Solving with Patient-Generated Data: Challenges and Opportunities. Proc. ACM Hum.-
Comput. Interact. 1, CSCW (December 2017), 88:1–88:18. DOI:https://doi.org/10.1145/3134723
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.
Reliability and Inter-rater Reliability in Qualitative Research 72:23
[81] Amon Rapp. 2018. Gamification for Self-Tracking: From World of Warcraft to the Design of Personal Informatics
Systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18), 80:1–80:15.
DOI:https://doi.org/10.1145/3173574.3173654
[82] Marén Schorch, Lin Wan, David William Randall, and Volker Wulf. 2016. Designing for Those Who Are Overlooked:
Insider Perspectives on Care Practices and Cooperative Work of Elderly Informal Caregivers. In Proceedings of the
19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW ’16), 787–799.
DOI:https://doi.org/10.1145/2818048.2819999
[83] Alfred Schutz. 1967. The Phenomenology of the Social World. Northwestern University Press.
[84] Kate Starbird, Emma Spiro, Isabelle Edwards, Kaitlyn Zhou, Jim Maddock, and Sindhuja Narasimhan. 2016. Could
This Be True?: I Think So! Expressed Uncertainty in Online Rumoring. In Proceedings of the 2016 CHI Conference on
Human Factors in Computing Systems (CHI ’16), 360–371. DOI:https://doi.org/10.1145/2858036.2858551
[85] Anselm Strauss. 1987. Qualitative Analysis for Social Scientists. Cambridge University Press, Cambridge.
[86] Aaron Tabor, Scott Bateman, Erik Scheme, David R. Flatla, and Kathrin Gerling. 2017. Designing Game-Based
Myoelectric Prosthesis Training. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
(CHI ’17), 1352–1363. DOI:https://doi.org/10.1145/3025453.3025676
[87] Matthieu Tixier and Myriam Lewkowicz. 2016. “Counting on the Group”: Reconciling Online and Offline Social
Support Among Older Informal Caregivers. In Proceedings of the 2016 CHI Conference on Human Factors in Computing
Systems (CHI ’16), 3545–3558. DOI:https://doi.org/10.1145/2858036.2858477
[88] James R. Wallace, Saba Oji, and Craig Anslow. 2017. Technologies, Methods, and Values: Changes in Empirical
Research at CSCW 1990 - 2015. Proc. ACM Hum.-Comput. Interact. 1, CSCW (December 2017), 106:1–106:18.
DOI:https://doi.org/10.1145/3134741
[89] April Y. Wang, Ryan Mitts, Philip J. Guo, and Parmit K. Chilana. 2018. Mismatch of Expectations: How Modern
Learning Resources Fail Conversational Programmers. In Proceedings of the 2018 CHI Conference on Human Factors
in Computing Systems (CHI ’18), 511:1–511:13. DOI:https://doi.org/10.1145/3173574.3174085
[90] G. Cochran William. 1977. Sampling Techniques, 3rd Edition (3rd edition ed.). John Wiley & Sons, New York.
[91] Lillian Yang and Carman Neustaedter. 2018. Our House: Living Long Distance with a Telepresence Robot. Proc. ACM
Hum.-Comput. Interact. 2, CSCW (November 2018), 190:1–190:18. DOI:https://doi.org/10.1145/3274459
[92] Svetlana Yarosh, Elizabeth Bonsignore, Sarah McRoberts, and Tamara Peyton. 2016. YouthTube: Youth Video
Authorship on YouTube and Vine. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative
Work & Social Computing (CSCW ’16), 1423–1437. DOI:https://doi.org/10.1145/2818048.2819961
[93] Svetlana Yarosh, Sarita Schoenebeck, Shreya Kothaneth, and Elizabeth Bales. 2016. “Best of Both Worlds”:
Opportunities for Technology in Cross-Cultural Parenting. In Proceedings of the 2016 CHI Conference on Human
Factors in Computing Systems (CHI ’16), 635–647. DOI:https://doi.org/10.1145/2858036.2858210
[94] Svetlana Yarosh and Pamela Zave. 2017. Locked or Not?: Mental Models of IoT Feature Interaction. In Proceedings of
the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17), 2993–2997.
DOI:https://doi.org/10.1145/3025453.3025617
[95] Jason C. Yip, Tamara Clegg, June Ahn, Judith Odili Uchidiuno, Elizabeth Bonsignore, Austin Beck, Daniel Pauw, and
Kelly Mills. 2016. The Evolution of Engagements and Social Bonds During Child-Parent Co-design. In Proceedings of
the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), 3607–3619.
Proceedings of the ACM on Human-Computer Interaction, No. CSCW, Article 72. Publication date: November 2019.