American Antiquity (2023), 1–24
doi:10.1017/aaq.2023.4
ARTICLE
Replicability in Lithic Analysis
Justin Pargeter1 , Alison Brooks2, Katja Douze3, Metin Eren4 , Huw S. Groucutt5, Jessica McNeil6,
Alex Mackay7, Kathryn Ranhorn8 , Eleanor Scerri9, Matthew Shaw10 , Christian Tryon11 ,
Manuel Will12, and Alice Leplongeon13
1
Department of Anthropology, New York University, NY, USA; Palaeo-Research Institute, University of Johannesburg,
Johannesburg, South Africa, 2Center for the Advanced Study of Human Paleobiology, Department of Anthropology, George
Washington University, Washington, DC, USA; Human Origins Program, National Museum of Natural History, Smithsonian
Institution, Washington, DC, USA, 3Laboratory Archaeology and Population in Africa, Section of Biology, Faculty of Science,
University of Geneva, Geneva, Switzerland, 4Department of Anthropology, Kent State University, Kent, OH, USA; Department
of Archaeology, Cleveland Museum of Natural History, Cleveland, OH, USA, 5Department of Classics and Archaeology,
University of Malta, Msida, Malta; Extreme Events Research Group, Max Planck Institutes for the Science of Human History,
Chemical Ecology, and Biogeochemistry, Jena, Germany, 6Department of Anthropology, Harvard University, Cambridge, MA,
USA, 7Center for Archaeological Science, University of Wollongong, Wollongong, Australia; Department of Archaeology,
University of Cape Town, Cape Town, South Africa, 8School of Human Evolution and Social Change, Arizona State University,
Tempe, AZ, USA; Institute of Human Origins, Arizona State University, Tempe, AZ, USA, 9Pan-African Evolution Research
Group, Max Planck Institute for the Science of Human History, Jena, Germany; Department of Prehistoric Archaeology,
University of Cologne, Cologne, Germany, 10Center for Archaeological Science, University of Wollongong, Wollongong,
Australia, 11Department of Anthropology, University of Connecticut, Storrs, CT, USA; Department of Anthropology, Harvard
University, Cambridge, MA, USA; Human Origins Program, National Museum of Natural History, Smithsonian Institution,
Washington, DC, USA, 12Department of Early Prehistory and Quaternary Ecology, University of Tübingen, Tübingen, Germany,
and 13Department of Archaeology, KU Leuven, Leuven, Belgium; UMR Histoire naturelle de l’Homme Préhistorique, Muséum
national d’Histoire naturelle – Centre National de la Recherche Scientifique – Université de Perpignan Via Domitia, Paris, France
Corresponding author: Justin Pargeter, Email:
[email protected]
(Received 13 July 2022; revised 16 December 2022; accepted 9 January 2023)
Abstract
The ubiquity and durability of lithic artifacts inform archaeologists about important dimensions of human
behavioral variability. Despite their importance, lithic artifacts can be problematic to study because lithic analysts differ widely in their theoretical approaches and the data they collect. The extent to which differences in
lithic data relate to prehistoric behavioral variability or differences between archaeologists today remains
incompletely known. We address this issue with the most extensive lithic replicability study yet, involving
11 analysts, 100 unmodified flakes, and 38 ratio, discrete, and nominal attributes. We use mixture models
to show strong inter-analyst replicability scores on several attributes, making them well suited to comparative
lithic analyses. Based on our results, we highlight 17 attributes that we consider reliable for compiling datasets
collected by different individuals for comparative studies. Demonstrating this replicability is a crucial first
step in tackling more general problems of data comparability in lithic analysis and lithic analyst’s ability
to conduct large-scale meta-analyses.
Resumen
La ubicuidad y la durabilidad de los artefactos líticos le da a los arqueólogos datos importantes sobre las
dimensiones de la variabilidad del comportamiento humano. A pesar de su importancia, los artefactos
líticos pueden ser problemáticos de estudiar ya que los especialistas en lítica difieren ampliamente en sus
enfoques teóricos y en los datos que recogen. Si las diferencias en los datos líticos reflejan la variabilidad
en el comportamiento prehistórico, o por el contrario están ligadas a las diferencias entre los arqueólogos
que los estudian hoy es una cuestión aún parcialmente desconocida. Abordamos esta problemática con el
estudio de replicabilidad lítica más amplio realizado hasta la fecha, incluyendo 11 especialistas, 100 lascas
y 38 atributos continuos, discretos y nominales. Usando modelos de mezcla presentamos altos resultados
de replicabilidad entre los especialistas participantes sobre algunos atributos, lo que los hace adecuados
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Society for American Archaeology
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
2
Justin Pargeter et al.
para análisis líticos comparativos. Basados en nuestros resultados evidenciamos 17 atributos que consideramos fiables cuando recopilamos conjuntos de datos recogidos por diferentes individuos para análisis comparativos. Demostrar esta replicabilidad es un primer paso crucial para abordar problemas más generales de
comparabilidad de datos en análisis líticos y la posibilidad de conducir meta-análisis a gran escala combinando múltiples conjuntos de datos.
Keywords: stone tools; attribute analysis; inter-analyst replicability
Palabras clave: herramientas líticas; análisis de atributos; replicabilidad entre especialistas en lítica
Comparative research greatly benefits Paleolithic archaeologists because of the field’s vast temporal and
spatial remit. However, comparative research requires datasets in which researchers know that the individual data points are robust, repeatable, and comparable; that is, data are collected similarly and compared against similar standards. Notably, such studies must recognize the error and uncertainty
involved in specific types of data and the way they are collected. While archaeologists working with
large datasets of published radiometric dates acknowledge this issue (Carleton and Groucutt 2021;
Mauz et al. 2021; Scott et al. 2018; Stewart et al. 2021), lithic analysts have yet to deal with these issues
systematically.
Stone tools are durable and ubiquitous, and they tend to pattern in space and time.
Comparative lithic analysis therefore remains, and must remain, a cornerstone for understanding
human behavioral evolution and the evolution of technology. Yet, to achieve this potential, stone
tool analysts need to know the comparability of their units of analysis. Data incongruence is a significant problem given that stone tools make up most or all the archaeological record for a span of approximately 3–2 million years, particularly in Africa (Harmand et al. 2015; Shea 2016).
In lithic studies, researchers from different backgrounds often take different conceptual approaches
to their analyses. They may record different kinds of data entirely, or even similar types of essential
information in different ways (e.g., Andrefsky 2005; Holdaway and Stern 2004; Shea 2013; Van Peer
1992). For example, proponents of the chaîne opératoire approach generally focus on qualitative
classification of tool production systems, whereas analysts in the Americanist attribute-based
system often pursue quantitative measures of reduction intensity. This difference is true even for
seemingly simple attributes such as “length,” for which multiple definitions exist. Andrefsky
(2005:100), for example, shows how analysts can measure flake length in at least two different ways:
as a line perpendicular to the striking platform width or as the maximum distance from the
proximal to the distal end along a line perpendicular to the striking platform width. Dogandžić
et alia (2015) show that calculations of flake edge length and surface areas based on datasets where
analysts recorded length using different methods are prone to large variance and errors. If variance
exists in measuring even these basic lithic attributes, it is obvious that problems will arise when
constructing and comparing lithics with large datasets generated by multiple analysts and analytical
approaches.
Comparative lithic analysis aims to achieve high consistency and low error rates when recording and
measuring attributes on lithic artifacts between observers. Increasing such inter-analyst replicability is
a goal common to all empirical sciences. A lack of clear standards for assessing data quality and replicability has led to the recent “reproducibility crisis” (Baker 2016). Even though the importance of
analytical standardization is undisputed, there are surprisingly few studies explicitly tackling
analyst-induced variance in lithic technological analyses (Conard et al. 2004; Tixier 1963).
Exceptions exist in the field of lithic use-wear (Crowther and Haslam 2007; Newcomer et al. 1986;
Rots et al. 2006; Wadley et al. 2004), but these studies focus on tool use rather than lithic
production strategies.
Researchers followed classic inter-analyst replicability studies by Fish (1978), Wilmsen and Robert
(1978), and Dibble and Bernard (1980), with a limited number of more recent assessments for quantifying the effect of different observers on lithic data quality. Table 1 summarizes the most relevant
studies for lithic analyses focused on assessing the extent, source, and relevance of inter-analyst
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
Table 1. Summary of Previous Lithic Inter-Analyst Replicability Studies. All numerical values are counts.
Study
Observers
Lithics
Attributes
Attributes
Fish 1978
3
25
6
Maximum length, technological length, width, thickness, striking platform angle,
dorsal cortex
Ratio
Wilmsen and Robert 1978
4
4
8
Length, width, thickness, platform thickness, flake angle, cutting edge angle,
distal edge angle, left lateral edge angle, right lateral edge angle
Ratio
Dibble and Bernard 1980
6
29
1
Edge angle (comparison of 4 different methods to measure edge angle)
Ratio
Perpére 1986
3
198
1
Levallois (Y/N)
Nominal
Boyd 1987
3
246
3
Working edge type, technological class, raw material type
Nominal
Calogero 1992
Attribute Classes
5
17
1
Raw material type
Nominal
15
211
9
Length, width, thickness, platform thickness, platform width, termination type,
cortex, platform type, artifact form
Ratio and
nominal
Lycett et al. 2006
2
3
3
Length, width, thickness (all at 10% increments)
Ratio
Mackay 2008
7
58
1
Blade (Y/N)
Nominal
Driscoll 2011
47
20
5
Flake type, core type, debitage type, fragmentation, retouch type
Nominal
4
765
9
Technological category, platform cortex, platform facets, knapping accidents,
step scars, dorsal surface cortex, number of dorsal negatives, direction of
dorsal negatives, Toth flake category
Ratio and
nominal
Gnaden and Holdaway 2000
Proffitt and de la Torre 2014
Note: All numerical values are counts.
7
6
5
11
100
38
Maximum dimension, width, thickness, shape (22 GMM)
Ratio
This article Supplementary Table 2
Ratio, discrete,
and nominal
American Antiquity
Timbrell et al. 2022
Pargeter et al. (this study)
3
4
Justin Pargeter et al.
replicability. Principally, all studies in Table 1 found that replicability between analysts is an issue—
larger than generally anticipated and with important ramifications for subsequent interpretations.
Problems included measuring basic dimensions such as flake length, seemingly straightforward assessments such as counting the number of flake dorsal negative scars, and more complex inferences such
as identifying whether a flake belongs to a specific technological system. Previous studies have a generally low number of examined attributes (median = 4; range = 1–9), a low number of observers
(median = 5; range = 2–47, mostly from a close group of coworkers and students), and small lithic samples (Table 1). For the sake of this article, we exclude inter-analyst use-wear studies. As a result, we still
lack the following:
• Quantitative measures for inter-analyst variability in lithic studies
• Tests to better understand the causes of inter-analyst differences
• Recommendations for fixing issues in inter-analyst replicability
We assembled the “Comparative Analyses of Middle Stone Age Artifacts in Africa” (CoMSAfrica;
Will et al. 2019) group to address these issues with specific reference to the African Middle Stone Age
(MSA), but the group immediately recognized that the problems of lithic inter-analyst
replicability extend well beyond any time or place. In this article, we report on the group’s first of
three inter-analyst replicability studies, with this one focused on unretouched lithic flakes. We present
data showing possible reasons for poor inter-analyst reliability on some of our attributes, and we suggest ways to improve the replicability of future lithic attribute analyses.
The CoMSAfrica Project
The CoMSAfrica project started as a three-day workshop at Harvard University (USA) in 2018
(Will et al. 2019). The workshop brought together 12 international lithic analysts (see author list)
from seven countries working in different periods and regions of Africa, with varied methodological
backgrounds (e.g., chaîne opératoire and attribute analysis) and levels of seniority (full professor to
PhD student). The group aimed to compare African MSA lithic assemblages at the initial workshop.
The project’s long-term objective is to use African MSA lithic assemblages in comparative
continental-scale studies to unpack spatial and temporal variation among Pleistocene H. sapiens
populations.
Intense discussions at the initial meeting in 2018 made it clear that our initial goals were too ambitious and that any continental-scale comparisons were impossible until we understood differences in
how group members recorded their lithic data. In 2018, we established a minimum number of attributes that each group member currently used or considered useful for reliable comparative analysis.
We initially focused on unretouched flakes because they form the dominant category of all lithic
assemblages and they carry important information about lithic production methods, techniques,
and reduction intensity. This study forms the basis for working toward other future studies
involving cores and retouched tools.
To maximize replicability, the group derived a set of definitions for each attribute from existing literature and lithic recording systems used by the project’s members and others (e.g., Scerri et al. 2014;
Shea 2013; Tostevin 2012; Wilkins et al. 2017). Again, although the group focuses on African MSA
lithic assemblages, our current protocols relate to issues faced by lithic analysts working at almost
any time or location.
In this study, we address the following seven research questions that arose from the process of data
exploration:
(1) Which lithic attributes are analysts able to code more reliably, and which are they able to code
less reliably?
(2) Does limiting the number of possible attribute states impact inter-analyst replicability?
(3) Do specific flake characteristics (i.e., differences in flake shape, etc.) impact inter-analyst
replicability?
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
(4)
(5)
(6)
(7)
5
Does the inclusion of images in definitions impact inter-analyst replicability?
What degree of measurement precision is realistic in lithic analysis?
Do differences in lithic flaking systems impact inter-analyst replicability?
Do the analyst’s experience and training impact inter-analyst replicability?
Methods and Materials
This study included lithic analysts from diverse research traditions and with varying levels of experience. We intentionally included multiple backgrounds to provide a wide range of inputs into the
recording system and to avoid the false positives that might result when related analysts confront
diverse sets of attributes. The study used alpha-numeric analyst IDs to ensure each analyst’s anonymity. We asked analysts to list their years of experience conducting lithic analysis and to rate their training using chaîne opératoire and quantitative methods on a scale from 1 to 5 (1 = lowest, 5 = highest).
The group averaged 18.5 years of experience (median = 13, range = 9–58), 3.2 on the self-reported
quantitative training scale (median = 3, range = 1–5), and 3.5 on the self-reported chaîne opératoire
training scale (median = 3.5, range = 2–5; see Supplemental Table 1).
After presenting their existing recording systems, the participants selected a common subset of attributes for this study. In subsequent discussions, all participants jointly agreed on a definition for each of
the study’s attributes. This approach ensured that the recording system represented the group’s training
backgrounds and experience levels. All attributes had to satisfy one common criterion: participants
accepted them as potentially useful for studying African MSA human behavioral variability. The
group selected 38 attributes for this study (see Supplemental Table 1).
The Flake Attributes
We divide our attributes into three broad classes:
(1) Ratio-scale attributes (n = 17): analysts recorded measurements on the flakes (e.g., flake length,
width, thickness, and mass). Ratio-scales refer to data with a true zero and equal intervals
between neighboring points.
(2) Discrete-scale attributes (n = 5): analysts counted attribute expressions as whole numbers (e.g.,
dorsal scar counts).
(3) Nominal attributes (n = 16): analysts selected options from a predefined list of descriptive
characteristics (e.g., platform types). Analysts use nominal scales to label attributes with no
quantitative value.
During data cleaning, we edited certain text inputs to lump answers with slight variations that otherwise referred to the same technological system (e.g., “Levallois variant” and “Levallois”). Following
conventions in lithic analysis, we refer to subdivisions within each attribute as “states” (also called
“expressions” or “levels”; Andrefsky 2005:65; Holdaway and Stern 2004:98). Each nominal attribute
had between three and 10 attribute states. However, we left two (“Flake type” and “Reduction system”)
as free text and open to the analysts’ unconstrained input, although in both cases, we included a range
of suggested attribute states—16 for “Flake type” and 9 for “Reduction system” (see Supplemental
Table 2). Besides these two examples, we designed the attribute states to be exhaustive and typically
provided a range of prescribed states and one “other” state. For example, the attribute “Distal plan
form” includes the states “Flat,” “Pointed,” “Rounded,” and “Irregular,” the last of which captures
all nonconforming shapes (see Supplemental Table 2).
Before the analysis, the group agreed on a textual definition for all attributes, with instructions for
their measurement. We added pictures to the definitions in some cases, particularly for size measurements. Wherever possible, we sourced attribute definitions and pictures from publications (see
Supplemental Table 2). We typically did not define attribute states. The lack of a textual or “logical”
description for attribute states was not a conscious part of our research design but reflected common
practice in lithic artifact research—one that we do not recommend for future studies (see below).
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
6
Justin Pargeter et al.
We compiled the study’s attributes into two main data recording systems. The first uses the opensource E4 data entry program by Shannon McPherron and Harold Dibble (www.oldstoneage.com). We
used E4 to speed up data entry and reduce data entry errors. The E4 program creates conditional statements that allow certain variables to be skipped based on values entered for previous variables. The
benefits of E4 over other data logging methods, such as Microsoft Excel, are that E4 prevents users
from directly accessing a project database when entering data, and it prevents users from manually
entering text inputs, which can lead to transcription errors. The program helps reduce data entry errors
and increases data entry consistency. Several analysts were, however, more comfortable using Microsoft
Excel (with a series of predefined columns and drop-down menus). This variance in the data recording
method was a poor design choice that resulted in substantial time dedicated to data cleaning (see R
code for details: https://osf.io/seh2t/?view_only=9097 ef58225b49e48f66afb220022fbf).
The Flake Assemblage
The flake assemblage comprised one raw material—chert—because its physical properties are analogous to many finer-grained raw materials found in African MSA and other lithic assemblages.
Chert is also relatively fine grained, is homogenous, and fractures relatively reliably. This choice of
raw material meant that the group worked with a raw material likely to produce a high proportion
of flakes with “readable” technological characteristics.
One person (M. I. Eren) knapped all the flakes with a hard stone hammer and a direct freehand
percussion technique. He used two continuous individual reduction strategies: recurrent unidirectional
Levallois and a migrating multiplatform strategy in which he gave no platform surface preference.
Admittedly, this is a limited framework, but with the study’s otherwise complex recording methods,
we decided to simplify the technological comparisons. These two reduction strategies cover a large
amount of variance in African MSA lithic assemblages (Shea 2020) and occur in other periods and
geographical areas. They also allowed us to test the attribute system on two different, but representative, flaking variants. During the study, analysts were unaware of these assemblage differences.
Eren reduced two cores until he had produced 100 flakes from each reduction strategy, and from
these 100, we used a random number generator to select 50 flakes. The flakes, bagged separately,
were boxed for shipping to each of the study’s 11 participants. The team shared and shipped a set
of digital calipers for flake measurements and used their own scales for mass measurements.
Analysts examined the assemblages independently, without fixed protocols for lighting or the use of
magnifying lenses, among other things (cf. best practices listed in Blumenschine et al. 1996). The participants did not know which flake assemblage corresponded to which knapping strategy. Analysts did
not discuss observations until everyone had studied all the flakes, which took about two years.
Statistical Methods
Our primary research question is this: Which attributes are analysts able to code more reliably, and
which are they able to code less reliably? To answer this question, we used replicability coefficients.
Replicability describes the relative partitioning of variance in a measurement or other assessment
into within-group and between-group sources of variance. Researchers generally refer to this measure
as inter-rater repeatability (IRR; Hallgren 2012; Stoffel et al. 2017). We use inter-analyst replicability in
this article.
We used a mixed effects model framework to estimate IRR and its uncertainty on the study’s attributes using the rpt function in R version 4.0.3’s rptR package (R Core Team 2021; Stoffel et al. 2017).
Where analysts take repeated measures (e.g., quantifications of flake attributes) on the same objects
(i.e., stone flakes), replicability estimation is calculated as the variance among group means (in our
case, each flake measured 11 times) relative to the sum of group-level and data-level (individual measurements) variance. We included each analyst’s anonymous ID and the two technological assemblage
codes as random effect components to estimate the replicability at the level of each flake and across
the two flaking systems. Higher replicability values show greater agreement between different analysts
(1 = perfect agreement, 0 = no agreement). We modeled ratio data (i.e., flake maximum length) as
approximating a normal distribution using rpt’s Gaussian parameter. We modeled discrete attributes
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
7
(i.e., flake scars) using rpt’s Poisson parameter. Dorsal cortex is a proportion (scored from 0 to 1) for
which the rpt function does not yet have an inbuilt error function. Therefore, we omitted the dorsal
cortex attribute from the IRR analysis, but we discuss qualitative observations on this attribute in this
article. We also used standard deviation as a percentage of the mean for each ratio-scale attribute on
each flake to track absolute error in our measurements. This measurement allowed us to determine if
there are differences in relative (IRR) versus absolute (standard deviation as a percentage of the mean
or the coefficient of variation [CV]) inter-analyst errors on the measurements.
For our nominal data, we used the first-order agreement coefficient (AC1), where analysts classified
flake attributes into one category among a limited number of possible categories (Gwet 2008). The AC1
coefficient accounts for chance agreement between analysts in the presence of high agreement and can
handle inputs from multiple raters. We implemented the analysis using the gwet.ac1.raw function in
R’s irrCAC package. We omitted instances where analysts either did not rate a specific flake or, for
whatever reason, fewer than four analysts classified a flake. For some attributes, such as the
“Reduction system,” we ended up with 31 flakes for this attribute.
What constitutes a robust inter-analyst replicability estimate will depend on the nature of the study.
Cohen (1960) provides a general guide that we use to interpret this study’s IRR values: values ≤0 indicate no agreement, 0.01–0.20 indicate none to slight, 0.21–0.40 indicate fair, 0.41–0.60 indicate moderate, 0.61–0.80 indicate substantial, and 0.81–1.00 indicate strong agreement.
Our study also involved building several linear models to determine, for example, the impact of
an analyst’s prior experience on flake measurement performance. We built these models using R’s
base lm package, fitting different error functions (i.e., Gaussian and Poisson) to account for different response variable data scales or with two-way Analysis of Variance (ANOVA) using R’s base aov
package.
To evaluate potential causes for inter-analyst variance and whether some flakes led to a higher
inter-analyst variance, we identified flake outliers using the Interquartile Range (IQR) for each ratio
and discrete attribute. Here, a value is considered an outlier when it falls above the seventy-fifth or
below the twenty-fifth percentile by a factor of 1.5 times the IQR. Because we were only interested
in flakes with higher inter-analyst variance, we only considered outliers falling above the seventy-fifth
percentile.
Results
Supplemental Tables 3–21 provide detailed summary data for each of the 38 attributes tested in this
study. Supplemental Tables 3 and 4 document average results by analyst and flake for the ratio, discrete, and nominal attributes, whereas Supplemental Tables 5–21 provide summaries of the discrete
attributes by flake. Here, we limit the results to our primary research questions.
Are Some Attributes More Replicable Than Others?
Figure 1 shows the IRR results for the study’s 17 ratio-scale attributes with 95% confidence intervals
(see Supplemental Table 3). Ten of them show strong inter-analyst measurement agreement between
the analysts. Seven measurements show less, but still substantial, agreement (IRR >0.6 and <0.8)
between the 11 analysts. These seven attributes include the four platform measurements (width and
three platform thickness measurement variants) and three technologically oriented size measurements
(thickness at the proximal end, thickness at the distal end, and width at the distal end). As expected,
flake mass showed the highest IRR values, with maximum flake dimension and technological length
showing very high IRR scores.
Figure 2 presents the CV values for each ratio-scale attribute on each flake. Our measurement CV
values show very low effective variance in the measurements (mean = 0.09, range = 0.009–0.18). The
ordering of attributes along this measure follows approximately the same pattern seen in the IRR
data (Figure 1). The results show that (a) our relative measure of error (IRR) tracks our absolute measure of error (CV), and (b) that simple measures such as CVs can trace some of the variance present in
our more complex IRR calculations. This result also reaffirms the overall strong performance of our
ratio-scale attributes.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
8
Justin Pargeter et al.
Figure 1. Summary IRR data for the study’s 17 ratio-scale attributes. The dashed line indicates the cutoff for substantial
agreement among raters. All measurements fall well above the substantial agreement threshold. Error bars show 95% confidence intervals.
Figure 3 shows the IRR results for our five discrete-scale attributes tracking flake dorsal scar characteristics. Three of these attributes (left sector scars, distal sector scars, and right sector scars) fall
below the substantial agreement threshold. Proximal sector scar counts show an IRR value above
the substantial agreement threshold. Total dorsal scar counts showed the highest IRR value within
the substantial agreement threshold.
Figure 4 shows the overall IRR results for the study’s 16 nominal attributes (see Supplemental
Table 4). Five of these attributes show IRR values within the strong agreement range. The topperforming nominal attribute tracks analysts’ ability to identify basic flake fracture mechanics features
(i.e., bending, wedging, or Hertzian initiations), with the “free-text” input attributes (“Reduction system”) also performing very well. Four nominal attributes show IRR values within the substantial agreement threshold (flake termination, form, completeness, and platform lipping). Seven of these attributes
show IRR values below the substantial agreement threshold. Five lower-performing attributes relate to
flake shape characteristics (ventral plan form, distal plan form, lateral edge shape, cross-section shape,
and platform morphology).
Does Limiting the Number of Potential Attribute States Impact Inter-Analyst Replicability?
Having observed the study’s wide-ranging (and generally lower) performance for inter-analyst replicability among our nominal data (Figure 4), we asked whether each attribute’s number of states among
which analysts could choose impacted some of this variability (see Supplemental Table 23).
A generalized linear model with a Poisson error parameter to account for the response variable’s
(attribute state counts) discrete scale shows a significant effect of attribute state counts on inter-analyst
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
9
Figure 2. Summary showing the CV for each ratio-scale attribute on each flake. Outlier values with CV >0.5 are excluded from
this plot.
Figure 3. Summary IRR data for
the study’s discrete-scale attributes. The dashed line indicates
the cutoff for substantial agreement among raters. Error bars
show 95% confidence intervals.
replicability (df = 15, p = 0.01). Nominal attributes with more states tend to show lower inter-analyst
replicability scores, whereas attributes with fewer states perform better. Three attributes (ventral
plan form, distal plan form, and flake completeness) are notable outliers. Ventral and distal plan
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
10
Justin Pargeter et al.
Figure 4. Summary AC1 data for attributes. The dashed line indicates the cutoff for substantial agreement among raters. Error
bars show 95% confidence intervals.
forms have fewer states (n = 5 and n = 4), but analysts struggled to agree on their coding (see
Supplemental Table 22). This result is likely because they require analysts to make complex decisions
about flake shape and specific flake locations. Flake completeness has more states (n = 8), but analysts
had fewer issues coding it. This is likely because the flakes in our assemblages had fewer breakages than
the average archaeological assemblage.
Do Specific Flake Characteristics Impact Inter-Analyst Replicability?
One of the more complex issues our study faced is how different attributes might interact. For
example, flake form and technological characteristics have the potential to impact the way analysts
record different measurements. Flakes with more complex platform shapes or lateral edge types
could complicate where analysts take specific measurements. This variance could impact inter-analyst
replicability and increase “systematic errors” (i.e., errors that affect the central tendency of a size measurement [Gnaden and Holdaway 2000]). To examine this question more closely, we conducted
ANOVA analyses with Bonferroni corrected post hoc comparisons for all our attributes against respective IRR values for attributes measured on those flake portions. For example, we compared the IRR
values for platform measurements taken on different platform types.
Table 2 presents a subset of these results focused on attribute states that show statistically significant
IRR results for each attribute. The data show that platform types, the presence/absence of platform
cortex, flake lateral edge types, ventral plan form, and flake termination differences significantly impact
measurements taken on these flake components. This result is particularly, but not exclusively, applicable to measurements of “technological” versus “maximum” dimensions. For example, lateral edge
type shows significant differences in inter-analyst replicability in technological width measurements
at the proximal and medial flake portions. Platform thickness measurements are complicated by different platform morphologies—especially those classed as “indeterminate”—and by the presence/
absence of platform cortex, likely caused by the gradation of the cortex into regions less clearly part
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
Table 2. Summary ANOVA Results Comparing Instances Where Categorical Attribute’s IRR Values Differed Significantly Based on Comparisons with Specific Attribute States.
Attribute
Platform morphology
Platform cortex
Lateral edge type
Ventral plane form
Measurement
Est.
Lower
Upper
Adj. P-Value
Indeterminate-Dihedral
Platform thickness midpoint
2.36
0.00
4.72
0.04
Plain-Indeterminate
Platform thickness midpoint
−2.09
−4.08
−0.10
0.03
Punctiform-Indeterminate
Platform thickness midpoint
−2.84
−5.68
−0.01
0.04
Complete-Absent
Platform width
7.03
2.44
11.61
<0.01
Partial-Absent
Platform thickness midpoint
2.58
1.02
4.15
<0.01
Partial-Absent
Platform thickness maximum
2.10
0.51
3.68
<0.01
Parallel-Amorphous
Technological width medial
−2.98
−5.25
−0.71
<0.01
Parallel-Divergent
Technological width medial
−2.28
−4.17
−0.39
<0.01
Parallel-Amorphous
Technological width proximal
−2.64
−5.21
−0.07
0.04
Ovoid-Convergent
Technological width proximal
2.96
0.07
5.85
0.04
Parallel-Ovoid
Technological width proximal
−3.47
−6.04
−0.90
<0.01
Parallel-Amorphous
Technological width proximal
−2.64
−5.21
−0.07
0.04
Ovoid-Convergent
Technological width proximal
2.96
0.07
5.85
0.04
Parallel-Ovoid
Technological width proximal
−3.47
−6.04
−0.90
<0.01
Flat-Bulbar
Maximum thickness
−0.46
−0.85
−0.07
0.01
Concave-Bulbar
Technological maximum thickness
−0.81
−1.49
−0.12
0.01
Flat-Bulbar
Technological maximum thickness
−0.76
−1.51
−0.01
0.04
Overshot-Feather
Technological thickness distal
2.24
0.23
4.24
0.02
Overshot-Hinge
Technological thickness distal
2.76
0.69
4.84
<0.01
Note: All p-values are adjusted to account for multiple comparisons in the post hoc tests.
American Antiquity
Flake termination
Comparison
11
12
Justin Pargeter et al.
of the original striking platform. Flake thickness measurements show wider variability when flakes
have larger bulbs, whereas distal thickness measurements are harder to record consistently on overshot
flakes.
Another way to address this question is to examine all flake outliers for each ratio and discrete attribute to identify higher inter-analyst variance scores on specific flakes (Supplemental Table 23).
Because we could not apply a systematic method to detect discrete attribute outliers, we do not consider these here (but see Supplemental Table 22 for a qualitative overview of [dis]agreement between
analysts per nominal variable). Inter-analyst replicability scores for ratio and discrete-scale attributes
were relatively high, but looking at outliers provides a means to explore potential ways of optimizing
data recording in future lithic analyses. Supplemental Table 23 summarizes the main flake
outlier characteristics for each attribute.
The qualitative assessment of flake outliers shows that some (n = 14/110 outliers) are due to high
inter-analyst variance driven by one analyst’s measurements, which may reflect human error when taking the measurement (e.g., typing error when entering the value). At least one flake consistently
appears as an outlier (ID = 58) for several variables due to the breakage of its distal part during transport. We note the highest number of outliers for technological thickness and all four platform measurement attributes, which were also the variables that had lower—albeit substantial—
agreement between analysts (IRR >0.6 and <0.8). Large inter-analyst variance in maximum dimension
and technological measurements may be due to specific flake shapes (see above). For example, high
inter-analyst variance in maximum size seems to occur when the flake maximum dimension is similar
to the maximum width (see Figure 8). Inter-analyst variance occurs when the flakes have shapes that
vary widely in width (e.g., flakes with expanding edges in the proximal part and convergent edges in
the distal portion). Variance in thickness and width measurements may also occur on flakes with large
and thick platforms and prominent bulbs, potentially inducing errors while measurements are taken
despite the definitions provided (i.e., thickness and width should be measured independently from
the platform).
Platform measurements seem more likely to vary between analysts when flakes have a cortical platform or no clear delimitation of the platform (e.g., Figure 9, ID9). In the case of débordant (core edge)
flakes, issues occur with a blurred line between the platform and the lateral side of the flake (e.g., a
relict platform unrelated to the removal of the flake). The large difference in recording dorsal cortex
and dorsal scar count appeared to be due to diverse definitions for cortex—in particular, whether there
should be a difference between cortex and naturally fractured or weathered surfaces—and what scar
types should be counted (see Figure 9). Flakes with a higher inter-analyst variance seem to have a specific set of characteristics (including irregular shape, offset of technological axis compared to maximum dimension, cortical platforms), and they are often débordant flakes. In assemblages in which
these categories of flakes are few, as in this experimental assemblage, there will be a nonsignificant
impact on comparative analyses. Still, for the ones that include high proportions of such flakes, comparative studies should consider the issues raised here.
Does the Inclusion of Images in Definitions Impact Inter-Analyst Replicability?
In seeking to understand potential sources of variation within the group, we considered how visual aids
in defining our attributes reduced inter-analyst replicability. The group predicted that analysts would
code attributes with images in their definitions more reliably.
Comparisons between nominal attributes with and without images in their definitions show that
the presence/absence of images does not significantly impact inter-analyst replicability (F [1,14] =
0.4, p = 0.53). The same is true for our ratio-scale and discrete attributes (F [1,21] = 0.01, p = 0.9).
The best-performing discrete attribute (total dorsal scar counts) showed the highest IRR values for
this attribute class. Still, it lacked a visual aid, as did many of our high-performing attributes.
Although the nominal attributes’ group mean differences are not significant, the IRR score variability
around the mean seems different. It appears that including images in definitions reduces nominal attribute IRR variability. However, the sample size of nominal attributes with images is too small to make
statistical conclusions. We hesitate to generalize too much from our sample because our definitions
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
13
were either taken from existing texts or arrived at after considerable group discussion. We hypothesize
that our within-group consensus is higher than one would encounter among naive users of our recording system.
Do Differences in Lithic Flaking Systems Impact Inter-Analyst Replicability?
For a recording system like ours to achieve maximum comparability across sites and time, it needs to
be robust to differences in lithic reduction strategies. To test the hypothesis that our recording system is
insensitive to reduction strategy, we ran our mixed effects models separately on the two flaking systems
(recurrent unidirectional Levallois and migrating multiplatform). We then compared the IRR results
for each of our three attribute classes (ratio, discrete, and nominal). This comparison allowed us to
track differences in replicability between these two broad reduction strategies. If our recording system
is insensitive to reduction strategy differences, we should see minor differences in IRR values between
the two assemblages (IRR <0.2).
Figure 5 presents the IRR assemblage difference variance contributions for our ratio-scale attributes.
Negative values show lower IRR values in the Levallois assemblage, whereas positive values show lower
IRR values in the multiplatform multidirectional assemblage. The data show minor inter-analyst replicability differences for all 17 attributes between the two lithic technological systems. About half (8/17)
of these differences come from the multiplatform multidirectional assemblage. Technological thickness
measurements at the flake proximal and distal ends show similarly high inter-analyst replicability differences in the two assemblages. The fact that knappers distribute mass across the flake differently in
these two flaking systems likely drives these thickness differences (Tostevin 2012). This difference is
because technologically driven variables can impact the recording of flake thickness at specific points
along a flake’s margin.
Figure 6 presents the IRR assemblage difference variance contributions for our discrete-scale attributes. These attributes concern flake scar patterns counted in different flake sectors. Again, we see
minimal inter-analyst replicability differences for these five attributes between the two lithic reduction
Figure 5. Comparisons of the ratio-scale inter-analyst replicability differences on our two assemblages. Levallois values are
arbitrarily converted to negative numbers for graphical reasons.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
14
Justin Pargeter et al.
Figure 6. Comparisons of the
discrete-scale inter-analyst replicability differences on our two
assemblages. Levallois values are
arbitrarily converted to negative
numbers for graphical reasons.
Figure 7. Comparisons of the nominal attribute inter-analyst replicability differences on our two assemblages. Levallois values are arbitrarily converted to negative numbers for graphical reasons.
strategies. Surprisingly, most (4/5) of these differences come from the Levallois assemblage. It is important to note that in at least one commonly used lithic recording system (Van Peer 1992), counting flake
scar patterns according to flake sectors is an important component of diagnosing variability within the
Levallois approach.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
15
Figure 8. Examples of flake outliers for the maximum dimension attribute. All flakes are oriented ventral face up according to
their technological axis, with their proximal part at the bottom. Values show different analysts’ range of values on each
flake. (Color online)
Figure 7 presents the IRR assemblage difference variance contributions for our nominal attributes.
Most of these attributes (14/16) show minor inter-analyst replicability differences between the two
lithic technological systems. The data show an even split of the differences (n = 8) between the two
reduction strategies. Two notably high differences are in the platform morphology (difference =
−0.22) and reduction system attributes (difference = 0.93). These results show that analysts agreed
less on platform morphologies in the Levallois assemblage than in the multiplatform multidirectional
assemblage. They also show that analysts tended to agree when identifying a flake as belonging to the
Levallois reduction system, but they struggled to identify flakes from the multiplatform multidirectional system.
Does an Analyst’s Experience and Quantitative Training Impact Inter-Analyst Replicability?
A final question concerns the impact of individual differences on inter-analyst replicability. Our analyst survey data (Supplemental Table 1) allowed us to determine three individual difference metrics
on the inter-analyst replicability outcomes: years of experience, training in quantitative methods, and
training in the chaîne opératoire approach. It seems reasonable to hypothesize that analysts with
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
16
Justin Pargeter et al.
Figure 9. Examples of flake outliers for the dorsal cortex and dorsal scar count attribute. All flakes are oriented according to
their technological axis, with their proximal part at the bottom. Values show different analysts’ range of values on each
flake. (Color online)
greater quantitative training and expertise will more consistently record ratio and discrete-scale
attributes.
Here, we focus on how these measures impact the recording of attributes as they provide the most
straightforward means of assessing individual measurement performance relative to the group. To do
this, we first compared each analyst’s distance from the group’s average measurements. We then averaged these values across an analyst’s suite of measures to derive a single performance metric. A surprising result is that overall years of experience showed a nonsignificant relationship with
measurement performance (F [1, 7] = 0.2, R2 = 0.1, p = 0.65). Our average measurement performance
metric is significantly and positively correlated (F [1, 7] = 5.6, R2 = 0.39, p = 0.04) with an analyst’s selfreported ranking of quantitative training levels (1 = lowest, 5 = highest).
Chaîne opératoire training levels show a negative but nonsignificant (F [1, 7] = 5, R2 = 0.33, p = 0.06)
relationship with the average measurement performance metric. There are at least two possible explanations for this. First, most of the data we collected in this study align with more quantitative
approaches to lithic analysis. Although it is an oversimplification to state that chaîne opératoire
approaches are opposed to quantitative research (Soressi and Geneste 2011), they tend to emphasize
qualitative readings of artifacts. At least within our group, the data collected for this study were
more often unfamiliar to those analysts who employ a chaîne opératoire approach extensively. Less
experience in collecting some of the data described here likely drives some of the lower inter-rater replicability. The second possibility is that individual responses to our basic survey overestimate or underestimate expertise in particular approaches among the group. Visual inspection of both these results
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
17
shows that the correlations are driven predominantly by individuals who report very low or high levels
of training in each field.
Discussion
We observe that ratio-scale attributes (e.g., maximum flake dimension and technological length)
show strong inter-analyst replicability scores, making them simple and immediately suitable for
comparative lithic analyses. The discrete-scale attributes, mostly concerned with dorsal negatives
on flakes, showed a comparatively low inter-analyst replicability score, likely due to the complexity
of identifying flake scar patterns. This result implies that inter-analyst variation stemmed from
dividing flakes into “sectors” (left, right, proximal, and distal) rather than the actual counting of dorsal scars and with differences in counting scars occurring in specific sectors versus originating from
specific sectors. However, proximal scars performed better than the other sectors. This difference is
likely because flakes originating from the proximal end can—often do—retain initiations, making
their orientation easier to work out, whereas those from the laterals are more difficult to position.
It is important for future work to address these results in more detail because these arbitrary sectors
are a central part of several systems used to describe unretouched flakes (e.g., Crew 1975; Tostevin
2012; Van Peer 1992). Finally, our nominal attributes showed more variation in inter-analyst replicability scores, suggesting that additional work is required to ensure that they are reliable for future
comparative research.
Why Do Certain Attributes Show Lower Inter-Analyst Replicability Scores and How Can
Comparability Be Increased?
Our study showed that increased attribute state counts significantly decrease inter-analyst replicability. More choices increase the chance that analysts code features in different ways. A simple fix
would be to collapse certain attribute states into more manageable and reliable classes. However,
for platform morphology and directionality, the analysts selected four or more different attribute
states for 30% to 60% of the flakes, respectively. This result suggests that it is unlikely that
collapsing certain attribute states would provide a satisfying solution for the lower agreement values on these variables (see Supplemental Table 22). Future work could test whether reducing attribute state counts in more complex nominal attributes will allow analysts to track meaningful
variability.
Flake and platform shapes and certain technological features (e.g., thick bulbs of percussion) created
several issues for the attributes’ IRR scores. We cannot deal with flake shape and technological variability simply because lithic artifacts have widely variable shapes and technical features. However, our
results show that attribute states such as “amorphous” or “indeterminate” can create uncertainty for
analysts taking specific measurements. Still, they do not impact the overall agreement between analysts
(see Supplemental Figure 1). The “indeterminate” attribute state serves as a placeholder for times when
analysts cannot code a specific attribute but intend to note that there is clear uncertainty with that
decision. Future work should look to understand better the flake qualities “indeterminate” states
describe and to explain better how the word “indeterminate” is used (e.g., “it cannot be determined”
vs. “I cannot determine it”).
Our study did not observe any effect of years of experience on analyst performance. Instead, we
found significant training background effects. This result suggests that increasing replicability in lithic
analysis is more about changing training than increasing experience per se. We suggest that training
programs including mixed basic quantitative methods and chaîne opératoire–like instructions might
help standardize similar lithic attribute analyses. Our group agreed on a unified set of definitions
for all the study’s attributes and yet, we still found significant inter-analyst differences in some attributes. This result suggests that programs need to provide training to analysts wishing to engage in
comparative lithic research, including beta-testing attribute definitions before engaging in primary
data collection.
We included illustrations in our attribute definitions, where possible, to increase inter-analyst replicability. Our results show that these images did not significantly impact inter-analyst replicability.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
18
Justin Pargeter et al.
One reason is that static images provide a snapshot of the more dynamic measurement process. They
also represent one’s interpretation of an idealized and specific attribute that does not necessarily capture how analysts measure attributes on variable artifacts. Regardless, images provide a useful set of
information that can help strandardize recording systems and that might be more helpful for naive
analysts learning to identify specific attributes.
Do Differences in Lithic Flaking Systems Impact Inter-Analyst Replicability?
Our two assemblages show minor differences in inter-analyst replicability. We can identify specific
areas where greater IRR differences arise between the two assemblages. These include analysts’ difficulties in classifying Levallois platforms, measuring technological thickness, and identifying flakes
from the multiplatform multidirectional cores (in the absence of those cores). This result shows biases
and weaknesses in existing datasets that researchers could address in future analyses. Notably, the
study’s recording system is robust to differences in technological strategies, and researchers can use
it to compare these technological variants. Given how representative these two flaking systems are
of MSA (and Middle Paleolithic) technological variability, our findings are therefore likely to be generalizable to flakes made from other core reduction strategies.
How Do Our Results Compare to Prior Inter-Analyst Replicability Lithic Studies?
The previous studies listed in Table 1 agree that inter-analyst replicability is an issue that researchers
should address more thoroughly in lithic analysis. Unfortunately, most of these studies suffer in having
few analysts, a small number of tested attributes, small lithic samples, and they generally serve as a
starting point for further comparative lithic analyses. With some notable exceptions (e.g., Gnaden
and Holdaway 2000; Proffitt and de la Torre 2014), these studies also lack quantitative assessment
of inter-analyst replicability. As a further caveat, we did not examine accuracy as some experiments
have done (e.g., Gnaden and Holdaway 2000; Proffitt and de la Torre 2014)—testing measurements
against a true standard or “correct answer.”
In common with this study, Fish (1978) and Gnaden and Holdaway (2000) found metric attributes such as length or thickness highly reliable. Maximum width and platform thickness performed
well in our study, as they did for Fish (1978) and Wilmsen and Robert (1978), but they showed high
inter-analyst variability in another experiment (Gnaden and Holdaway 2000). These latter authors
attributed some of the systematic errors to differences in definitions that we ruled out in our
study, underlining their general importance. The high variance in recording dorsal cortex aligns
our study with others (Fish 1978; Gnaden and Holdaway 2000), demonstrating the need for more
precise definitions of what researchers should consider “true” cortical faces. Interestingly, cortex
identification on dorsal surfaces was among the most reliable attributes in another experiment on
an Oldowan assemblage (Proffitt and de la Torre 2014), which may speak more to raw material
differences between these studies. As our study did, Proffitt and de la Torre (2014) found that the
direction of dorsal negatives compared poorly among researchers. Although our study found substantial to strong IRR values for many attributes (>0.6), Proffitt and de la Torre (2014) report mostly
moderate levels of agreement (0.4–0.6), which likely stemmed from their use of three different raw
materials, including quartzite, which performed the worst. Chert, which we used, had the highest
agreement values among their analysts. Driscoll’s (2011) quartz-based study also found low replicability between observers for several discrete variables. Timbrell and colleagues’ (2022) recent study
examined shape variable replicability via outline 2D GMM and linear measurements. They found
inter-analyst error to be low enough for accurate analyses with both methods. Unfortunately, no previous study had ratio, discrete, and nominal variables in their design, precluding comparisons to this
experiment on the level of different measurement scales.
Limitations and Recommendations
We hope to overcome several limitations in future studies, including testing the impact of different raw
materials and including a greater range of flake production strategies. Unretouched flakes comprise
only one (if the most abundant) component of the Paleolithic record that lithic analysts study. The
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
19
Table 3. Recommended Attributes Showing Strong Inter-Analyst Replicability Scores (>0.8) in the Current Study.
Attribute
Units
Data type
General metric descriptors
Mass
g
Ratio
Maximum dimension
mm
Ratio
Maximum width
mm
Ratio
Maximum thickness
mm
Ratio
Technological measures
Technological length
mm
Ratio
Technological maximum width
mm
Ratio
mm
Ratio
Technological maximum thickness
Reduction intensity measures
Total dorsal scar count
Nominal
Platform cortex presence/absence
Categorical
Reduction strategy indicators
Initiation type
Categorical
Kombewa
Categorical
Reduction system
Categorical
Platform lipping
Categorical
Bulb (?)
Categorical
Flake termination
Categorical
Flake form
Categorical
Completeness measures
Flake completeness
Categorical
extent to which we can extend our broad measures of inter-analyst replicability to other artifact classes
(e.g., cores and retouched tools) is unknown.
Another area for future research is the need to understand different sources of variation in lithic
data. Random variation about the mean can arise due to minor differences in data collection. More
worrisome are differences that occur because of systematic errors, which stem from unclear definitions
that affect a measurement’s central tendency (Gnaden and Holdaway 2000). Figure 2 shows the range
of CV values for our ratio-scale variables, which despite ranging from roughly 0.01 to 0.15, are still very
low when compared to any non-machine-based method of data collection (cf. Eerkens and Lipo 2005).
This result suggests that random variation, although present in our study, is relatively modest.
Systematic errors are likely more difficult to detect but were undoubtedly present in our dataset, especially in our efforts to measure exterior platform angle using the modified caliper method initially
described by Dibble and Bernard (1980).
One major limitation of our study is that we do not track variations in exterior platform angle and
flake curvature measurements. We currently have little basis other than two studies (Andrefsky 2005;
Dibble and Bernard 1980) from which to assess these attributes’ effectiveness. Our group could not
reliably use the modified caliper method as published by Dibble and Bernard (1980), and future
work should aim to retest and refine this method. There is also a general implicit assumption
among many lithic analysts that platform angles and curvature values are difficult to measure accurately. Addressing this issue could require 3D scans on new software to record exterior platform
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
20
Justin Pargeter et al.
angle and flake curvature accurately (Valletta et al. 2020; Yezzi-Woodley et al. 2021). It remains uncertain how these patterns of inter-analyst variation might impact assemblage-level comparisons.
Although costly and impractical on larger lithic assemblages with hundreds or thousands of specimens, such research would produce results of broad relevance to comparative lithic analysts.
Based on our current study, we identify several ways to advance replicability in stone tool analysis.
First, a standard set of transparent, clear, and agreed-upon definitions of attributes are necessary for
any comparative study. Ratio-scale attributes fared by far the best, and key attributes such as mass,
maximum dimension, technological length, maximum width, and thickness are easy to record.
They can form a good comparative baseline, given that many researchers already use them. Many
nominal attributes showed higher replicability than expected by us and likely others—such as flake
form—and increasing comparability can further be achieved by decreasing the number of attribute
states in some cases.
In contrast, many attributes associated with flake shape showed lower replicability in our study. As a
way forward, we recommend increasing sample sizes while using photogrammetry or
morphometric methods designed to capture shape quantitatively (e.g., Bretzke and Conard 2012;
Grosman et al. 2008; Iovita 2011; Magnani et al. 2020; Ranhorn et al. 2019). Ideally, we should explore
these options using approaches that are increasingly accessible as costs decline and that researchers can
capture on widely available devices (e.g., Cerasoni et al. 2022; Porter et al. 2016). The same goes for
measuring angles (such as EPA) and curvature. For data recording, analysts should use relational
databases built using programs such as E4, instead of spreadsheets. Instruction should ideally work
with static images and dynamic visuals, such as short training videos showing how to measure in
three dimensions. Moreover, researchers might more reliably record some variables (i.e., flake scar sectors) on images rather than actual implements.
Our results show that enhancing replicability in comparative studies in the MSA, or any other
period, is not dependent on experience but rather on basic training in quantitative methods. To be
clear, quantitative data are not “better” than more qualitative interpretations. They are simply more
replicable, and illustrations and technological readings using chaîne opératoire and allied approaches
remain an essential component of lithic analysis because they provide complementary information.
As stated at the outset, this project’s initial and long-term goal was to assemble large datasets to
explore patterning at the subcontinental and smaller scales across Africa, also making use of the enormous quantities of data already gathered by researchers over the last decades. Based on our experiences
thus far and the results presented above, Table 3 lists those variables (n = 17) that we consider reliable
when compiling datasets (published or otherwise) collected by different individuals for comparative
ends, using definitions consistent with those we used here. Note that Table 3 provides general guidelines for interpreting inter-analyst replicability, with values >0.6 considered “substantially” reliable
(Cohen 1960). However, our results stem from definitions and protocols extensively worked out
through hundreds of hours of collaborative conversation and writing. We cannot extend these measures uncritically to other research teams. Consequently, we take a more conservative view, favoring
those variables that have IRR scores of >0.8 and that showed minimal effects of interactions with
other variables. We group the attributes according to potential uses for exploring a range of lithic artifact research questions, including basic metric parameters, flake propagation measures, measures of
reduction intensity, core reduction strategies, and basic flake breakage indicators.
Conclusions
The issue of data comparability is particularly acute in the analysis of lithic (stone) artifacts that dominate the Paleolithic record. Addressing this issue, we presented the most extensive study yet on replicability in lithic analysis based on a total of 11 analysts, 100 lithics, 38 attributes, and hundreds of
hours of collaborative conversation and writing. Although initially geared toward the African
Middle Stone Age record, based on the diverse range of experience of the participating researchers,
this study has broad applicability to analyzing stone tools across all regions and periods. The most
salient finding of our study is that the 11 international expert lithic analysts performed well across
many of the attributes tested in the study. Ratio-scale attributes fared the best, but several
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
21
nominal-scale attributes showed promise when used with the definitions provided in this study. We
conclude that high replicability in lithic analysis is possible, providing the baseline for any comparative
study, at least under specific methodological designs. Apart from its general relevance for the field of
lithic analysis, this finding is important given this project’s original goal of comparing lithic assemblages across the MSA of Africa, including datasets already collected by researchers and new ones.
Acknowledgments. This article is CoMSAfrica publication number 2.
Funding Statement. We thank the Radcliffe Institute for Advanced Study for funding the initial CoMSAfrica meeting held in
2018. We also thank the Swiss National Science Foundation (SNSF, grant #IZSEZ0_186545) for funding the second workshop
in 2019 in Switzerland, and the Faculty of Sciences of the University of Geneva for supporting both financially and logistically
this second CoMSAfrica meeting. Alice Leplongeon’s research is funded by a grant from the Research Foundation in Flanders
(FWO, grant Q36#12U9220N).
Data Availability Statement. Detailed results of all analyses and assessments of the data structure are available in this article’s
supplementary materials and through the Open Science Framework (https://osf.io/seh2t/?view_only=9097ef58225b49
e48f66afb220022fbf).
Competing Interests. The authors declare none.
Supplemental Material. For supplemental material accompanying this article, visit https://doi.org/10.1017/aaq.2023.4.
Supplemental Figure 1. Visual summary of the gwet results with and without indeterminates.
Supplemental Table 1. Results of the prior experience survey for all study participants.
Supplemental Table 2. Overview and definitions for the study’s attributes.
Supplemental Table 3. Summary IRR statistics for the study’s ratio and discrete scale attributes.
Supplemental Table 4. Summary IRR statistics for the study’s nominal scale attributes.
Supplemental Table 5. Detailed summary statistics broken down by flake and subject for the study’s ratio-scale attributes.
Supplemental Table 6. Detailed summary statistics broken down by flake and subject for the study’s platform lipping nominal
scale attribute.
Supplemental Table 7. Detailed summary statistics broken down by flake and subject for the study’s bulb type nominal scale
attribute.
Supplemental Table 8. Detailed summary statistics broken down by flake and subject for the study’s platform morphology
nominal scale attribute.
Supplemental Table 9. Detailed summary statistics broken down by flake and subject for the study’s flake initiation nominal
scale attribute.
Supplemental Table 10. Detailed summary statistics broken down by flake and subject for the study’s flake scar directionality
nominal scale attribute.
Supplemental Table 11. Detailed summary statistics broken down by flake and subject for the study’s flake form nominal
scale attribute.
Supplemental Table 12. Detailed summary statistics broken down by flake and subject for the study’s reduction system nominal scale attribute.
Supplemental Table 13. Detailed summary statistics broken down by flake and subject for the study’s kombewa presence
nominal scale attribute.
Supplemental Table 14. Detailed summary statistics broken down by flake and subject for the study’s shattered bulb nominal
scale attribute.
Supplemental Table 15. Detailed summary statistics broken down by flake and subject for the study’s flake distal plan form
nominal scale attribute.
Supplemental Table 16. Detailed summary statistics broken down by flake and subject for the study’s flake initiation nominal
scale attribute.
Supplemental Table 17. Detailed summary statistics broken down by flake and subject for the study’s flake lateral edge type
nominal scale attribute.
Supplemental Table 18. Detailed summary statistics broken down by flake and subject for the study’s flake platform cortex
scale attribute.
Supplemental Table 19. Detailed summary statistics broken down by flake and subject for the study’s flake section nominal
scale attribute.
Supplemental Table 20. Detailed summary statistics broken down by flake and subject for the study’s flake completeness
nominal scale attribute.
Supplemental Table 21. Detailed summary statistics broken down by flake and subject for the study’s flake ventral plan form
nominal scale attribute.
Supplemental Table 22. Detailed summary statistics for the study’s outlier flakes according to the nominal scale attributes.
Supplemental Table 23. Detailed overview of the number of outlier flakes for each of study’s ratio scale attributes and explanations for why they were classed as outliers.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
22
Justin Pargeter et al.
Author Contributions. CT and MW started the CoMSAfrica project by organizing a workshop funded by the Radcliffe Institute
for Advanced Study in November 2018. All authors except ME designed the study during this workshop and subsequent online
meetings (2018–2019). Following the initial identification of the attributes to be included in the study during the workshop, AL,
JM, and KR led a working group on the definitions of the variables, which were refined and agreed upon by the whole group. JM
and KR implemented the database in E4 and Microsoft Excel format. JP, MS, and KD worked on developing a common communication platform and a digital place where researchers could collaborate and share data. AM, ES, AL, CT, and JP developed a
plan to create a test assemblage. ME made the test assemblage.
All authors analyzed the test assemblage (2019–2021). Analyses were agreed upon and refined by all authors during a
workshop organized at the University of Geneva and funded by the Swiss National Science Foundation (application for funding led by KD, HG, and MW). JP conducted the statistical analyses. AL performed the data cleaning and generated summary
statistics.
All authors contributed to the writing of the article: AM, MW, CT, and JP wrote the initial draft.
JP and AL reported the results and discussion sections. All authors contributed to the revisions of the article.
References Cited
Andrefsky, William. 2005. Lithics: Macroscopic Approaches to Analysis. Cambridge University Press, Cambridge.
Baker, Monya. 2016. Reproducibility Crisis. Nature 533:353–366.
Blumenschine, Robert J., Curtis W. Marean, and Salvatore D. Capaldo. 1996. Blind Tests of Inter-Analyst Correspondence and
Accuracy in the Identification of Cut Marks, Percussion Marks, and Carnivore Tooth Marks on Bone Surfaces. Journal of
Archaeological Science 23:493–507.
Boyd, Clifford C. 1987. Interobserver Error in the Analysis of Nominal Attribute States: A Case Study. Tennessee Anthropologist
12:88–95.
Bretzke, Knut, and Nicholas J. Conard. 2012. Evaluating Morphological Variability in Lithic Assemblages Using 3D Models of
Stone Artifacts. Journal of Archaeological Science 39:3741–3749.
Calogero, Barbara. 1992. Lithic Misidentification. Man in the Northeast 43:87–90.
Carleton, Christopher W., and Huw S. Groucutt. 2021. Sum Things Are Not What They Seem: Problems with Point-Wise
Interpretations and Quantitative Analyses of Proxies Based on Aggregated Radiocarbon Dates. Holocene 31:630–643.
Cerasoni, Jacopo Niccolò, Felipe do Nascimento Rodrigues, Yu Tang, and Emily Yuko Hallett. 2022. Do-It-Yourself Digital
Archaeology: Introduction and Practical Applications of Photography and Photogrammetry for the 2D and 3D
Representation of Small Objects and Artefacts. PLoS ONE 17(4):e0267168. https://doi.org.10.1371/journal.pone.0267168.
Cohen, Jacob. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20:37–46.
Conard, Nicholas J., Marie Soressi, John E. Parkington, Sarah Wurz, and Royden Yates. 2004. A Unified Lithic Taxonomy Based
on Patterns of Core Reduction. South African Archaeological Bulletin 59:13–17.
Crew, Harry L. 1975. An Examination of the Variability of the Levallois Method: Its Implications for the Internal and External
Relationships of the Levantine Mousterian. Ph dissertation, UC Davis, Department of Anthropology, University of Michigan,
Ann Arbor.
Crowther, Alison, and Michael Haslam. 2007. Blind Tests in Microscopic Residue Analysis: Comments on Wadley et al. (2004).
Journal of Archaeological Science 34:997–1000.
Dibble, Harrold L., and M. C. Bernard. 1980. A Comparative Study of Edge Angle Measurement Techniques. American Antiquity
45:857–865.
Dogandžić, Tamara, David R. Braun, and Shannon P. McPherron. 2015. Edge Length and Surface Area of a Blank: Experimental
Assessment of Measures, Size Predictions and Utility. PLoS ONE 10(9):e0133984. https://doi.org/10.1371/journal.pone.
0133984.
Driscoll, Killian. 2011. Vein Quartz in Lithic Traditions: An Analysis Based on Experimental Archaeology. Journal of
Archaeological Science 38:734–745.
Eerkens, Jelmer W., and Carl P. Lipo. 2005. Cultural Transmission Theory and the Archaeological Record: Providing Context to
Understanding Variation and Temporal Changes in Material Culture. Journal of Archaeological Research 15:239–274.
Fish, Paul R. 1978. Consistency in Archaeological Measurement and Classification: A Pilot Study. American Antiquity 43(1):86–
89.
Gnaden, Denis, and Simon Holdaway. 2000. Understanding Observer Variation When Recording Stone Artifacts. American
Antiquity 65:739–748.
Grosman, Leore, Oded Smikt, and Uzy Smilansky. 2008. On the Application of 3-D Scanning Technology for the Documentation
and Typology of Lithic Artifacts. Journal of Archaeological Science 35:3101–3110.
Gwet, Kilem L. 2008. Computing Inter-Rater Reliability and Its Variance in the Presence of High Agreement. British Journal of
Mathematical and Statistical Psychology 61:29–48.
Hallgren, Kevin A. 2012. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutorials in
Quantitative Methods for Psychology 8:23.
Harmand, Sonia, Jason E. Lewis, Craig S. Feibel, Christopher J. Lepre, Sandrine Prat, Arnaud Lenoble, Xavier Boes, et al. 2015.
3.3-Million-Year-Old Stone Tools from Lomekwi 3, West Turkana, Kenya. Nature 521:310–315.
Holdaway, Simon, and Nicola Stern. 2004. A Record in Stone: The Study of Australia’s Flaked Stone Artifacts. Aboriginal Studies
Press, Melbourne.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
American Antiquity
23
Iovita, Radu. 2011. Shape Variation in Aterian Tanged Tools and the Origins of Projectile Technology: A Morphometric
Perspective on Stone Tool Function. PLoS ONE 6(12):e29029. https://doi.org.10.1371/journal.pone.0029029.
Lycett, Stephen J., Noreen von Cramon-Taubadel, and Robert A. Foley. 2006. A Crossbeam Co-ordinate Caliper for the
Morphometric Analysis of Lithic Nuclei: A Description, Test and Empirical Examples of Application. Journal of
Archaeological Science 33:847–861.
Mackay, Alex. 2008. On the Production of Blades and Its Relationship to Backed Artefacts in the Howiesons Poort at Diepkloof,
South Africa. Lithic Technology 33:87–99.
Magnani, Matthew, Matthew Douglass, Whittaker Schroder, Jonathan Reeves, and David R. Braun. 2020. The Digital Revolution
to Come: Photogrammetry in Archaeological Practice. American Antiquity 85:737–760.
Mauz, Barbara, Loïc Martin, Michael Discher, Chantal Tribolo, Sebastian Kreutzer, Chiara Bahl, Andreas Lang, and
Nobert Mercier. 2021. On the Reliability of Laboratory Beta-Source Calibration for Luminescence Dating. Geochronology
3:371–381.
Newcomer, Mark, Roger Grace, and Romana Unger-Hamilton. 1986. Investigating Microwear Polishes with Blind Tests. Journal
of Archaeological Science 13:203–217.
Perpère, Marie. 1986. Apport de la typométrie à la définition des éclats Levallois: l’exemple d’Ault. Bulletin de la
Société Préhistorique Française 83:115–118.
Porter, Samantha Thi, Morgan Roussel, and Marie Soressi. 2016. A Simple Photogrammetry Rig for the Reliable Creation of 3D
Artifact Models in the Field: Lithic Examples from the Early Upper Paleolithic Sequence of Les Cottés (France). Advances in
Archaeological Practice 4:71–86.
Proffitt, Thomas, and Ignacio de la Torre. 2014. The Effect of Raw Material on Inter-Analyst Variation and Analyst Accuracy for
Lithic Analysis: A Case Study from Olduvai Gorge. Journal of Archaeological Science 45:270–283.
Ranhorn, Kathryn L., David R. Braun, Rebecca E. Biermann Gürbüz, Elliot Greiner, Daniel Wawrzyniak, and Alison S. Brooks.
2019. Evaluating Prepared Core Assemblages with Three-Dimensional Methods: A Case Study from the Middle Paleolithic at
Skhūl (Israel). Archaeological and Anthropological Sciences 11:3225–3238.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://
www.R-project.org/.
Rots, Veerle, Louis Pirnay, Ph Pirson, and Odette Baudoux. 2006. Blind Tests Shed Light on Possibilities and Limitations for
Identifying Stone Tool Prehension and Hafting. Journal of Archaeological Science 33:935–952.
Scerri, Eleanor M. L., Nick A. Drake, Richard Jennings, and Huw S. Groucutt. 2014. Earliest Evidence for the Structure of Homo
Sapiens Populations in Africa. Quaternary Science Reviews 101:207–216.
Scott, E. Marian, Philip Naysmith, and Gordon T. Cook. 2018. Why Do We Need 14C Inter-Comparisons?: The Glasgow-14C
Inter-Comparison Series, a Reflection over 30 Years. Quaternary Geochronology 43:72–82.
Shea, John J. 2013. Stone Tools in the Paleolithic and Neolithic Near East: A Guide. Cambridge University Press, Cambridge.
Shea, John J. 2016. Stone Tools in Human Evolution: Behavioral Differences among Technological Primates. Cambridge University
Press, Cambridge.
Shea, John J. 2020. Prehistoric Stone Tools of Eastern Africa: A Guide. Cambridge University Press, Cambridge.
Soressi, Marie, and Jean-Michel Geneste. 2011. The History and Efficacy of the Chaîne Opératoire Approach to Lithic
Analysis: Studying Techniques to Reveal Past Societies in an Evolutionary Perspective. PaleoAnthropology 2011:
334–350.
Stewart, Mathew, W. Christopher Carleton, and Huw S. Groucutt. 2021. Climate Change, Not Human Population Growth,
Correlates with Late Quaternary Megafauna Declines in North America. Nature Communications 12:1–15.
Stoffel, Martin A., Shinichi Nakagawa, and Holger Schielzeth. 2017. rptR: Repeatability Estimation and Variance Decomposition
by Generalized Linear Mixed-Effects Models. Methods in Ecology and Evolution 8:1639–1644.
Timbrell, Lucy, Christopher Scott, Behailu Habte, Yosef Tefera, Hélène Monod, Mouna Qazzih, Benjamin Marais, et al. 2022.
Testing Inter-Observer Error under a Collaborative Research Framework for Studying Lithic Shape Variability.
Archaeological and Anthropological Sciences 14:209. https://doi.org/10.1007/s12520-022-01676-2.
Tixier, Jacques. 1963. Typologie de l’épipaléolithique du Maghreb. Mémoires du Centre de recherches anthropologiques,
préhistoriques et ethnographiques 2. Arts et métiers graphiques, Paris.
Tostevin, Gilbert B. 2012. Seeing Lithics: A Middle-Range Theory for Testing for Cultural Transmission in the Pleistocene. Oxbow
Books, Oakville, California.
Valletta, Francesco, Uzy Smilansky, A. Nigel Goring-Morris, and Leore Grosman. 2020. On Measuring the Mean Edge Angle of
Lithic Tools Based on 3-D Models–A Case Study from the Southern Levantine Epipalaeolithic. Archaeological and
Anthropological Sciences 12:Article 49.
Van Peer, Philip. 1992. The Levallois Reduction Strategy. Prehistory Press, Madison, Wisconsin.
Wadley, Lyn, Marlize Lombard, and Bonnie Williamson. 2004. The First Residue Analysis Blind Tests: Results and Lessons
Learnt. Journal of Archaeological Science 31:1491–1501.
Wilkins, Jayne, Kyle S. Brown, Simen Oestmo, Telmo Pereira, Kathryn L. Ranhorn, Benjamin J. Schoville, and Curtis W. Marean.
2017. Lithic Technological Responses to Late Pleistocene Glacial Cycling at Pinnacle Point Site 5-6, South Africa. PLoS ONE
12(3):e0174051. https://doi.org/10.1371/journal.pone.0174051.
Will, Manuel, Christian Tryon, Matthew Shaw, Eleanor M. L. Scerri, Kathryn Ranhorn, Justin Pargeter, Jessica McNeil,
Alex Mackay, Alice Leplongeon, and Huw S. Groucutt. 2019. Comparative Analysis of Middle Stone Age Artifacts in
Africa (CoMSAfrica). Evolutionary Anthropology 28:57–59.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press
24
Justin Pargeter et al.
Wilmsen, Edwin N., and Frank H. H. Robert, Jr. 1978 Lindenmeier, 1934–1974: Concluding Report on Investigations. Smithsonian
Contributions to Anthropology 24. Smithsonian Institution, Washington, DC.
Yezzi-Woodley, Katrina, Jeff Calder, Peter J. Olver, Paige Cody, Thomas Huffstutler, Alexander Terwilliger, J. Anne Melton,
Martha Tappen, Reed Coil, and Gilbert Tostevin. 2021. The Virtual Goniometer: Demonstrating a New Method for
Measuring Angles on Archaeological Materials Using Fragmentary Bone. Archaeological and Anthropological Sciences 13:
Article 106.
Cite this article: Pargeter, Justin, Alison Brooks, Katja Douze, Metin Eren, Huw S. Groucutt, Jessica McNeil, Alex Mackay, et al.
2023. Replicability in Lithic Analysis. American Antiquity. https://doi.org/10.1017/aaq.2023.4.
https://doi.org/10.1017/aaq.2023.4 Published online by Cambridge University Press