Monograph
Monograph
Monograph
Symbol Systems
Richard Sproat
Abstract
We report on the creation and analysis of a set of corpora of non-linguistic
symbol systems. The resource, the first of its kind, consists of data from
seven systems, both ancient and modern, with two further systems under
development, and several others planned. The systems represent a range
of types, including heraldic systems, formal systems, and systems that are
mostly or purely decorative. We also compare these systems statistically
with a large set of linguistic systems, which also range over both time and
type.
We show that none of the measures proposed in published work by Rao
and colleagues (Rao et al., 2009a; Rao, 2010) or Lee and colleagues (Lee
et al., 2010a) works. In particular, Rao’s entropic measures are evidently
useless when one considers a wider range of examples of real non-linguistic
symbol systems. And Lee’s measures, with the cutoff values they propose,
misclassify nearly all of our non-linguistic systems. However, we also show
that one of Lee’s measures, with different cutoff values, as well as another
measure we develop here, do seem useful. We further demonstrate that they
are useful largely because they are both highly correlated with a rather trivial
feature: mean text length.
⃝2012–2013,
c Richard Sproat
1
1 Introduction
Humans have been using symbols for many millenia to represent many different
kinds of information. In some cases a single symbol represents a single concept,
and is not part of any larger symbol system: an example is the red, blue and white
helical symbol for a barber shop. In other cases the symbols may be part of a more
complex combinatoric system that has its own syntax and semantics. Examples
are mathematical symbols, European heraldry (Slater, 2002), or the Mesopotamian
symbols representing deities (Seidl, 1989), all of which are systems consisting of
many symbols, which may be combined together into “texts” of varying degrees of
complexity. Writing systems are just one kind of complex symbol system, where
the system happens to represent language. In writing systems, the individual sym-
bols of the script represent a linguistic unit — often a phoneme or a syllable, but
in the case of many (especially ancient) writing systems, also other units such as
morphemes, words, or semantic information. Fully developed writing systems —
those capable of representing essentially anything that can be spoken — always
represent at least some phonological information (DeFrancis, 1984, 1989), so that
there are no true “logographic” or “semasiographic” writing systems.
Suppose an archaeologist discovers a clay tablet inscribed with heretofore un-
known symbols among the ruins of an ancient civilization. As long as the symbols
seem to be arranged more or less linearly (see below) and the inscription meets
certain other properties, one’s gut feeling is often to assume that the system must
have been writing.
Consider in this light the stone inscription in Figure 1. Here we see a set of
symbols, arranged more or less in lines from top to bottom. The symbols repeat
in some places. Was this some form of writing? In considering this question one
might note that the symbols are highly pictographic: perhaps, then, they are not
writing. On the other hand, we know of writing systems that were also highly
pictographic: three examples are Egyptian, Luwian and Mayan (Figure 2).
2
Figure 1: Kudurru of Gula-Eresh (King, 1912), Plate I.
3
Figure 2: Examples of Mayan (http://en.wikipedia.org/wiki/
File:Palenque_glyphs-edit1.jpg, Luwian (Hawkins, 2000), and
Egyptian (multiple sources on the Web) pictographic scripts.
4
Figure 3: Two adjacent instances of the “horned crown” and “symbol base”, two
highly associated symbols in the Mesopotamian deity symbol system.
5
So perhaps being pictographic is not particularly important in considering whether
a symbol system is writing. Furthermore, the symbols in Figure 1 have other
script-like properties. Note, for example, that some of the symbols are “ligatured”
together, as with the H ORNED C ROWN and S YMBOL BASE combinations in Fig-
ure 3.
In fact, in this case, we know that the symbol system was not writing. Rather,
the symbols in question were deity symbols used (among other things) to mark
kudurru boundary stones, legal documents that specified property rights. The
stones, as in this example, also included actual cuneiform Babylonian writing. The
deity symbols had an essentially heraldic function, to indicate favored deities of
the owner of the property. In this particular case the culture in question was literate
in a writing system that we can read, and it was through that written record that
we were able to understand the function of the non-linguistic deity symbols. But
suppose symbols like those in Figure 1 were the only symbols left behind by the
culture? How then would one know whether one is dealing with writing or not?
The question is an important one. The first impulse when dealing with an un-
known symbol system is to attempt to “decipher” it — i.e. decode the language
underlying it and the way in which the symbols encode that language. But if the
symbols do not encode language, attempts at decipherment are an exercise in futil-
ity. The most one can hope to do is to try to understand the symbols in some other
way than as an encoding of language, though a priori this is a much harder task.
Thus we arrive at the following question: Is there a way to determine, on the
basis of the properties of the symbols, and their distribution, whether the system is
likely to be writing (and thus worth the effort of attempting decipherment), or not?
Some recent work has proposed that the answer to this question is “yes”. One
paper (Rao et al., 2009a) used bigram conditional entropy to argue that the symbols
used by the Indus Valley civilization constituted a writing system and not a non-
linguistic symbol system as had been recently proposed in (Farmer et al., 2004).
Another paper (Lee et al., 2010a) used a more sophisticated method, but one also
based in part on conditional entropy, to argue that Pictish symbols, found on a
few hundred standing stones in Scotland, were part of a heretofore unrecognized
writing system. Both of these papers were widely reported in the popular science
press, where there seemed to be a general consensus that new computational tech-
niques had been discovered that can be used to unravel ancient mysteries. The
only problem, as we have argued elsewhere (Sproat, 2010a), and as we shall in any
case document much more extensively here, is that the techniques reported in those
papers do not work.
A large part of the problem, is that whereas there is now a large amount of
linguistic data around in the form of text corpora from languages both ancient and
modern, there is a dearth of data from non-linguistic symbol systems. While (Rao
6
et al., 2009a) did have some real data from biological sequences and Fortran code,
they compared the Indus and linguistic systems with no real ancient non-linguistic
data, but instead used simulated data based on (mis)conceptions about how a cou-
ple of different ancient symbol systems worked. Lee et al. (2010a) included exam-
ples from European Heraldry (as we had previously in (Farmer et al., 2004)), but
no other ancient or traditional non-linguistic systems. Both studies were limited
in part because a set of corpora of non-linguistic symbol systems was simply not
available. Yet without such corpora, one cannot hope to develop convincing statis-
tical methods that might distinguish between linguistic and non-linguistic systems.
And without such a basis for developing a rigorous approach to the problem, stud-
ies like those of Rao et al. (2009a) and Lee et al. (2010a), despite their obvious
appeal to the popular science press, are effectively meaningless.
This study addresses this situation in two ways. First, we have developed elec-
tronic corpora from published (mostly print) sources for a variety of unequivocally
non-linguistic symbol systems both from the ancient world, from later traditions,
and from modern times. Second, we have used these corpora, in combination with
a set of readily available linguistic corpora from a variety of ancient and modern
languages, to begin an investigation of whether statistical methods might be devel-
oped that can distinguish between the two kinds of systems.
The outline of this monograph is as follows. First, in Section 2 we discuss
general issues in the study of symbol systems, in particular the field of Semiotics,
that are relevant (or not) to our study of non-linguistic systems. Then, in Section 3
we report on an informal survey on factors that are relevant to the judgment that a
sign system “looks like” writing.
Section 4 proposes a first cut at a taxonomy of non-linguistic symbol systems,
something that will be useful for understanding the types of corpora we have col-
lected, and which will also be useful for future research.
In Section 5 we discuss non-linguistic corpora that we have developed, as well
as others that are in progress or are planned. Section 6 describes the XML markup
scheme used to encode the corpora. Then, in Section 7 we describe the linguistic
corpora that we have used for comparison in our statistical tests.
Section 8 outlines the statistical methods we have investigated, including sep-
arate features, and combinations of those features into a decision tree classifier.
Finally, in Section 9 we summarize our conclusions and discuss future direc-
tions for this kind of research.
7
2 Semiotics
In any discussion of symbol systems a natural starting point might seem to be the
field of semiotics (Peirce, 1934; Eco, 1976; Sebeok, 1977). Semiotics is, after all,
“concerned with everything that can be taken as a sign” (Eco, 1976), and is in
principle the theoretical discipline that deals with sign and symbol systems.
There are several reasons however for not adopting much if any semiotic the-
orizing in this work. First and foremost, the notion of sign as it has come to be
used in semiotics is far broader than we need. For example, the entry for sign in
Bouissac’s Encyclopia of Semiotics tells us:
In scholarly writing, the term sign might include, for example,
words, sentences, marks on paper that represent words or sentences,
computer programs . . . pictures, ideograms, graphs, chemical and phys-
ical formulas, fingerprints, ideas, concepts, mental images, sensations,
money, postures and gestures, manners and customs, costumes, rules
and values, the orienting dance of the honeybee, avian display, fishing
lures, DNA, objects made of other signs . . . and also nonrepresenta-
tional objects (perhaps in music or mathematics) that have types of
structure characteristic of other signs. (Bouissac, 1998, page 572)
Insofar as we are interested here in the statistical distribution and denotation of
sets of graphical signs, clearly we are discussing a far narrower concept than what
Bouissac lays out above.
Second, semiotics provides no formal theory of the combination of signs in
text, the key interest here. After all, what we are investigating is whether one
can tell merely by looking at the distribution of symbols, what kinds of things
they represent. Part of the problem, to put it bluntly, is that the field of semiotics
was early on hijacked by deconstructionists, who are not particularly interested in
formal mathematical models.1
Finally we have little use for distinctions common in the semiotics between
symbols, indices, icons, and so forth and just use the neutral term symbol.2 Whether,
for example, a given symbol is iconic for what it represents is largely orthogonal
to the issues that we will be addressing.
1
Even non-deconstructionist semioticians, such as Eco, are not typically well versed in mathemat-
ical theories of information. Cf. the following (incorrect) definition by Umberto Eco: “[according
to the mathematical theory of information] information is only the measure of the probability of an
event within an equi-probable system” (Eco, 1976, page 42).
2
Roughly, a symbol is an arbitrary sign denoting an object, an icon is similar to the object it
represents, and indices are somehow physically connected to the object — e.g. a symptom of an
underling disease. Cf. (Eco, 1976, page 178), though it should be noted that Eco largely rejects this
trichotomy.
8
In this work we will thus restrict ourselves to using the term symbol, where by
symbol we mean set of man-made graphical forms that is itself a member of a set,
termed a symbol system, that has a well-defined cultural function. Thus mathemat-
ical symbols are each a member of the mathematical symbol system. Boy scout
merit badges are individually members of a set of symbols whose function has a
well-defined function in scouting. Mesopotamian deity symbols were individually
members of a set of symbols that had a recognized function of representing favored
deities. And letters of the Roman script, as used for say English, are elements of
the set of symbols used in the English writing system.
For familiar reasons, a symbol cannot in general be defined purely as a graph-
ical form: “A” and “a” are both instances of the same letter <a>, as are multiple
font and handwritten variants of these forms. That is why we defined a symbol as
a set of forms above. This brings up the question of how we decide what is in each
set, and thus which graphical forms are just variants of the same symbol. This is of
course a very real problem when dealing with an unknown symbol system: which
variations in form distinguish among separate symbols, and which do not? In this
work we are fortunate to be dealing with symbol systems that in many cases are
known. So, in Mesopotamian deity symbols there may be multiple variants of the
G OAT F ISH symbol, but scholars of this system know that these are just variants of
the same symbol. Similarly, in known writing systems, we know which variants to
classify as being the same symbol. For symbol systems, such as the Vinča system
or Pictish symbols, whose meanings are not known, we can at least make use of
prior scholarship in determining which distinctions in graphical form are relevant
and which not.
As a practical matter, if one knows the meaning of the symbols the equivalence
classes can be determined by the denotation — the signifié in Saussure’s (1916)
terms, or the object in Peirce’s (1934) terms. Of course in many symbol systems,
including linguistic symbol systems (writing), symbols may be multivocal. Thus,
in the largely segmental writing system of English <a> represents far more than
just one vowel, and the digraph <th> represents at least two phonemes (/T/, /D/). In
systems where the meaning is not known — including many ancient non-linguistic
systems, and any undeciphered script — distributional evidence is often used to
make guesses about equivalence classes.
A special case is symbol systems where the symbols may not actually denote
anything: as we shall discuss in Section 5.5, this may be true of Pennsylvania
Dutch barn stars, where scholarship as well as local tradition weighs in on the side
of the symbols being purely decorative. In such cases, does it even make sense to
talk of a symbol system? We do not attempt to decide this issue here: rather, we
will adopt the conventions of previous scholarship in deciding what the variants of
a given symbol are, and we will simply sidestep the question of what the symbols
9
may or may not represented. Ultimately we are just interested in understanding
how the symbols, whatever they may be, are arranged in text.
10
3 What does Writing “Look Like”?
Can one tell just by looking at a text in an unknown symbol system if it was writing
or not? What does writing look like?
In a recent (January 5, 2012) discussion on the Ancient Near East (Yahoo!)
discussion forum, writing-systems scholar Peter Daniels, in discussing some con-
troversial instances of early writing supposedly excavated at Jiroft, proclaimed:
“Well, it looks like writing.” He then goes on to note that “four small samples of
this script are now known.”3 Daniels, apparently, is willing to countenance some-
thing as writing on the basis of just four small samples because it “looks like”
writing.
But what does it mean to say something “looks like” writing? What particular
features should a symbol system have to “look like” writing? With only four small
samples of a system, there is little hope the system could be deciphered — unless
of course many more samples are found.
In order to elucidate these questions I conducted a small informal survey using
Survey Monkey (http://www.surveymonkey.com). In the survey, partici-
pants were asked to rate, on a scale of 1 to 4 — very unimportant, to very important
– the following factors in making something “look like” a writing system:
11
Repetition rate Very important 18 3.29
Linearity Very important 16 3.24
Number of distinct texts Somewhat important 18 3.00
Text length Somewhat important 16 2.92
Abstractness Somewhat unimporant 17 2.21
The results for the main questions were for the most part unsurprising insofar
as they accorded with my own judgments. The modes, number of respondents (out
of 38) choosing the mode, and the means for each of these questions were as in
Table 1.
Thus it was considered to be very important that symbols be arranged largely
linearly, and that symbols should repeat in texts. The length and number of texts
was considered to be a somewhat important factor. And abstractness of symbols
in pictographic systems was considered to be somewhat unimportant. On the lat-
ter, presumably enough people are familiar with Egyptian and other highly pic-
tographic scripts to know that a system can look like pretty pictures, yet still be
writing.
There was no significant interaction between expertise level and the five main
questions in the survey. The largest difference between group 2 (expert) and groups
1 and 0 (minimal expertise), was on question 3, where members of group 2, perhaps
counterintuitively, considered length of texts a somewhat less important factor than
members of the other groups.
In addition, I asked respondents for other suggestions on other features that
might be relevant to deciding if a symbol system looks like a writing system. I
quote directly some of the suggestions below. First, there were a few suggestions
that relate roughly to the statistical distribution of symbols:
• How important is it that symbols are made on an object such that they could
not have been made randomly (i.e., as an expression of a thought rather than
meaningless scribble)?
12
• Statistical distribution of signs should accord with statistical distribution of
graphemes, syllables, or words in attested languages.
• The number of variations of symbols. How many are there? Does it corre-
spond at all to the number of sounds in a given language?
Finally, one suggestion had to do with the degree to which cursive styles have
developed, suggesting a long tradition of use:
• Degree of cursivity; a fully evolved system should show signs of use, and
therefore influence of the writing instruments, medium, and human dexterity.
The latter suggestion is of particular interest given that one of the arguments put
forward by Farmer, Sproat and Witzel (2004) for the non-linguistic status for the
Indus inscriptions was that over 700 years, there was no evidence of the develop-
ment of a cursive style.
The above survey and discussion is hardly scientific, nor is it intended to be.
No serious scholar believes that one can tell just by looking at a system whether
it is writing or not and misjudgments can go either way: Ignace Gelb, the “father”
of the study of writing systems famously misclassified Mayan writing as a “lim-
ited” preliterate system (Gelb, 1952). But it is nonetheless interesting to see what
people believe are necessary characteristics of a symbol system in order for it to be
considered writing. As we shall see in any case, all of the characteristics that were
considered to be at least somewhat important for writing systems to exhibit, also
show up in non-linguistic systems.
13
4 A Brief Taxonomy of Non-Linguistic Symbol Systems
What is writing then? The way most grammatologists define it, writing is a system
of symbols for representing linguistic information. That linguistic information is
usually some sort of sound information — either segmental phonemes, or syllables,
or perhaps some other unit. Some writing systems, notably many ancient systems
such as Sumerian, Egyptian, Mayan but also systems such as Chinese which are
still used today encode other information in addition to phonological information:
in Chinese, for example characters usually give some clue to the pronunciation
but also contain semantic information about the morpheme that the character rep-
resents. Still, there is a strong case to be made that all fully developed writing
systems represent sound information to a fairly sizable degree, and no full writing
system can get by by purely representing some more abstract linguistic represen-
tation such as meaning (DeFrancis, 1989). But no matter: what is clear is that all
writing systems represent some form of linguistic information. Simply put, they
are linguistic symbol systems.
What about the many non-linguistic systems? What kind of information do
they represent? Whereas writing is narrowly circumscribed by the fact that, what-
ever it represents, it must be some aspect of language, there is no such constraint on
non-linguistic systems. In principle the signs in a system could represent anything
that the creators of the system want them to represent. It seems that we need a way
of classifying non-linguistic systems, and in this section I present a preliminary
taxonomy.
To my knowledge, systematic taxonomies of non-linguistic systems do not ex-
ist. Certainly such systems are discussed. In (Daniels and Bright, 1996), a few
non-linguistic systems are presented at the end of the book: numerical notation,
music notation, movement notation systems. Such short shrift in that book is rea-
sonable, since the main theme is writing systems and, as Daniels puts it (page 785),
these other systems are on “the sidelines of the pageant of writing”. Harris 1995
also discusses a few cases of what he terms “non-glottic writing”: mathematics,
knitting patterns, dance notation and music. But, again, there is no attempt at a
systematic classification of non-linguistic systems.
The following discussion is intended to attempt to fill this void. I start out, how-
ever, with a discussion multivocality — the fact that symbols can often represent
more than one thing, sometimes simultaneously.
14
meanings.
Multivocality is rampant with non-linguistic symbols. For example, Turner
(1967) discusses the Ndembu “Mystery of the Three Rivers” thus:
This mystery (mpang’u) is exhibited at circumcision and funerary
cult association rites. Three trenches are dug in a consecrated site and
filled respectively with white, red, and black water. These “rivers” are
said to “flow from Nzambi,” the High God. The instructors tell the
neophytes, partly in riddling songs and partly in direct terms, what
each river signifies. Each “river” is a multivocal symbol with a fan
of referents ranging from life values, ethical ideas, and social norms,
to grossly physiological processes and phenomena. (Turner, 1967,
page 107)
Or as Farmer et al. (2004) point out, writing of possible interpretations of the Indus
Valley symbols:
But it would be naive to assume that the same signs were inter-
preted at all times in the same way, any more than this was true of
the Christian cross in medieval Europe — which in different contexts
could serve as a political-military symbol, a magical talisman, a sign
of a profession, a mystical aid, or a symbol of death. (Farmer et al.,
2004, page 43)
But it is important to realize that such multivocality is by no means unique
to non-linguistic systems. We already noted above (Section 2), for example, that
<a> in English is multivocal, representing several different vowels depending on
the context. For Sumerian, as we shall discuss again later on (Section 7.13), mul-
tivocality was the norm, with individual cuneiform symbols taking on a variety
of different phonological or logographic denotations. Thus the symbol could
represent the sounds (or morphemes) aya2 , duru5 , eš10 , among a variety of other
functions. Such systematic multivocality was typical for ancient writing systems,
but one sees extreme cases also in modern writing systems such as Japanese, where
a given Chinese character (kanji) often has upwards of six pronunciations (Smith,
1996).
And that’s just when one considers the linguistic functions of a symbol. Often,
though, linguistic symbols can take on other, non-linguistic, functions. Consider
just the letter <a>, and its capital form <A>. In addition to its linguistic functions
it can also denote, the first in a series, the top grade in an academic scoring sys-
tem, a musical note, an adulteress in the context of a particular Hawthorne novel,
the 8th Avenue Express in the New York subway system, or acceleration in me-
chanics. For many other uses see http://en.wikipedia.org/wiki/A_
15
(disambiguation). To be sure, many of these uses derive from the linguistic
use of <a> (the words acceleration and adulteress begin with <a>), and others
from the purely grammatological fact that <a> is first in the alphabet (thus ex-
plaining, the letter grade, or the musical note). But however the extension occurs,
once so extended the symbol has become largely separated from its linguistic func-
tion. The recipient of an “A” in a class is unlikely to think of the linguistic uses of
that symbol to reflect a range of English vowels.
Or consider the description of the Hebrew letter he in the fanciful account of
kabbalist Helmont (1667). Helmont’s theory is that the Hebrew letters represent
the shapes the tongue takes on when making the sounds associated with the letters,
but there are other meanings as well:
When, therefore, the mouth opens, it must close again when it
seeks rest; and then the tongue rises perceptibly, as is suitable for pro-
ducing of the next letter. The name He is a definite article, meaning
“this” or “that.” A certain mystical meaning concerning generation
seems to be hidden in this letter, for all animals produce this sound
when panting from the heat of lust. And for this reason it is probable
that a He, but no other letters, was added to the names of Abraham and
Sarah because many people were descended from them. (Translation
by Coudert and Corse, pages 115–117.)
Thus we have he as the symbol for a sound (/h/), but in addition as the symbol for
a morpheme (the definite article, written with he), as well as a symbol with the
deeper meaning related to generation.
The main difference between cases like <A> and cases like the Ndembu Three
Rivers mystery or (probably) Helmont’s he, is that in the former case, while the
symbol can have multiple meanings, it is unusual for the symbol to have those
meanings all at the same time. Thus again, the academic grade “A” is not contex-
tually ambiguousm with some other use, such as its use to represent the vowels
/eI/ or /@/; the “A” on a New York subway sign does not simultaneously carry the
meaning “adulteress”. But as Turner makes clear, in the Ndembu case, the rivers
simultaneously take on a variety of meanings — indeed that simultaneity of multi-
ple meanings is crucial for the function of those signs in the system. For <a>, not
only are the other uses of the symbol not crucial: they are completely irrelevant.
Thus when it comes to multivocality, one is tempted to classify symbols along
two dimensions. The first is how multivocal they are: a barber pole (see below)
is not notably multivocal in modern times, having essentially one basic function
(to advertise the presence of a barber shop); a Christian cross, as noted above,
is highly multivocal. Depending on the application domain of the symbol — the
type of thing(s) it denotes — there may be more or less ambiguity. Restricting
16
ourselves to its purely linguistic functions, <a> is far less multivocal in a highly
phonemic writing system like Finnish than it is in English: writing systems that by
design are highly regular impose strict constraints on how symbols are used and
thus the opportunity for multivocality is minimized. For non-linguistic systems,
highly formal systems will, again, often impose strict constraints. It is strictly
defined, for example, what the symbol “∀” denotes in logic or mathematics. On the
other hand, in religious symbology, a high degree of multivocality can be desirable.
The second dimension is how readily the symbol can take on its various mean-
ings simultaneously.4 Presumably this relates to the particular domain of appli-
cation of the symbol in question. Most of the functions of the letter <a> are
rather mundane, either denoting particular phonemes, or being used as labels or
symbols in mathematical equations. Religious or ritualistic symbols, on the other
hand, often are used in contexts where the mystery of the rite is enhanced by the
simultaneous appeal to multiple meanings.
Symbols of a variety of types can thus be both highly multivocal, or not; and
can allow for simultaneous interpretation of the several meanings, or not, depend-
ing upon the kinds of things they denote and their function.
With that (ultimately rather obvious) point established, we now turn to a taxon-
omy of symbol system types, starting with the simplest type where symbols usually
occur alone rather than in “texts” (or if they do, where there is no particular rela-
tion between the symbols in the “text”), and where the denotation of the system is
usually pretty straightforward and unambiguous.
17
concatenation.
Besides barber poles and weather icons, some further simple informative sys-
tems include: other symbols of guild such as three balls for a pawnbroker; institu-
tional logos; traffic information signs; ownership signs, such as brands, e.g. Mon-
golian horse brands (Waddington, 1974) or house marks (http://en.wikipedia.
org/wiki/House_mark).
Symbols in simple informative systems may be “etymologically” complex in
that a number of symbols were combined into what eventually became a single
symbol. Thus for the barber pole, the red helix represents blood (barbers used
to perform operations as well as cut hair), white represents bandages, the origin
of the blue helix being less clear (see http://en.wikipedia.org/wiki/
Barber’s_pole). Thus one might characterize the barber pole as being multi-
vocal, though it is a safe bet that most people who see this symbol today will not
be aware of the origin of the symbol and will merely see it as a sign for a barber
shop. In similar ways, symbols that have come to be used to mark ownership of
horses in Mongolia, have complex historical origins (Waddington, 1974).
18
4.4 Religious Iconography
Religious iconography could also be characterized as a simple informative system,
except that here though here the notion of “single piece of information” is much
less clear. The Christian crucifix, for example, represents the Christian faith, and
is used to mark Christian houses of worship as such, but it also retains its original
meaning (a symbol of Christ’s execution), as well as other meanings (e.g. as a
symbol of death). It is, by nature, highly multivocal, a necessary result of its central
function as a religious symbol, as we discussed above.5
Some other obvious symbols in this category: Star of David (Judaism), Star
and Crescent (Islam), Dharmachakra (Buddhism), Swastika (Buddhism), the “Om”
symbol (Hinduism, though this is itself derived from linguistic scripts).
Related to this category, though possibly deserving their own category, are
magic symbols, which hide hidden meanings (Goodman, 1989). Horoscopal signs,
insofar as they generally invoke a range of “powers”, and form part of a usually
complex system of lore that relates one symbol to another, can also be seen as
falling into this category.
19
simpler than the syntax of European heraldry, but there are some weak syntactic
tendencies, such as the tendency for symbols of higher-ranked gods to occur closer
to the beginning of the text.
Heraldic systems occur in many cultures. Apart from European heraldry and
kudurrus, other examples include some functions of Totem poles (Barbeau, 1950)
as well as well as Japanese Kamon (http://en.wikipedia.org/wiki/
Mon_(emblem)), though these are much simpler syntactically than the afore-
mentioned systems.
20
4.8 Narrative systems or “prompt” texts
Narrative systems are used to recount stories and as such are the most language-like
of the non-linguistic systems. In narrative systems, the symbols typically represent
actors or events in the story in an iconic way. For example, in Dakota winter counts
(see Section 5.11.3), each winter was notated by a picture that represented some
critical event that “defined” that particular year. The depiction thus evoked a story
relevant to the given year.
A more elaborate narrative system is the “dongba” symbols of the Naxi; see
(Li, 2001) and Section 5.11.1. In this system, a few thousand distinct symbols
are used to represent important characters, as well as important plants, animals,
topographical and astronomical entities, and so forth. While the system does have
some linguistic elements (for example, sound cues are often written as part of the
text using the Naxi syllabary), the system is basically non-linguistic in that it does
not specify the exact set of words to be read. Naxi texts can be quite long, allowing
for some quite elaborate stories to be told using this system.
Possibly also fitting into this category are some Mesoamerican systems such
as Aztec “writing”, though some scholarship seems to be leaning more towards
considering Aztec writing as true writing, with significant amounts of phoneticism
(Lacadena, 2008).6
Another possible example is rongorongo (Fischer, 1997), the “writing system”
of Easter Island, which has defied decipherment as a true writing system for over
a century despite the best possible conditions for its decipherment that one could
imagine.7
21
are quite irrelevant in their current use. A clear modern example is the use of Chi-
nese characters in body tattoos, worn by people who may be completely unaware
of the character’s original meaning, and use them solely because they look “cool”.
Other examples are Pennsylvania German barn stars (Graves (1984), and Sec-
tion 5.5), and Asian emoticons (Section 5.7), where in the latter case symbols from
various scripts are combined into a “text” that represents an image, usually a face.
The face itself may convey some sort of emotion (e.g. sadness, via depiction of
crying), but other than that has no real meaning and mostly functions to decorate
the surrounding text.
As we shall discuss in detail in Section 5.5, there has been a lot of confusion
in the literature that has equated the notion of “non-linguistic” symbols with dec-
orative systems, under the mistaken belief that something that does not encode
language must therefore be meaningless.
4.10 Summary
In this section we have presented an initial taxonomy of non-linguistic signs. While
the classification is certainly in need of further work and is undoubtedly incom-
plete, it is nonetheless useful in that it gives us a way of characterizing the corpora
that we will describe in the next section. As will be clear we cover some types
well, and are lacking in others, suggesting the need for further work.
One thing that should be clear after reading the foregoing section is what sets
non-linguistic systems apart from linguistic systems. For all of the systems we
have mentioned, it is clear that whereas the “messages” that can be conveyed by
the system can be quite varied, nonetheless the scope of the messages is limited
to the kinds of things for which the system was designed. Mathematical symbol-
ogy can only be used to express mathematical equations and formulae. Kudurrus
could express the deities favored by the owner and “invoke” their power. But none
of these systems can be used to express arbitrary messages. One could not, for
example, write a treatise on mining using the symbols of European heraldry.
A fully developed writing system is capable of expressing any information
that can be conveyed using speech, including any of the content expressed by the
more specialized systems we have been considering here. One can express the
meaning of any mathematical formula using written language: the expression will
certainly not be as compact or as clear, but it will nonetheless be possible to do
it. The reverse — the expression of anything that can be handled in writing using
purely the symbology of mathematics — is, pace Asimov’s psychohistorians in the
Foundation series, not possible.
This is at is base a rather simple and obvious point, but one that is often lost
in the discussion of whether some arbitrary symbol system constitutes a writing
22
I. The Graphemic Inventory 274.
.
38
Rahmani, Les Liturgies Orientales et Occidentales 368–69.
39
The text of the Words of the Institution according to the Anaph-
ora of St. James reads:
ch. 5
23
5 The Corpora
Corpora were selected using both the criteria of what kinds of systems were de-
sirable to represent, as well as what was readily available. Falling into the former
category are Mesopotamian deity symbols, Vinča symbols, Pictish symbols, totem
pole symbols and European heraldry. Mesopotamian deity symbols (kudurrus)
and Vinča systems were mentioned by Rao et al. (2009a) as models for low- and
high-entropy systems respectively. As we shall see, Rao and his colleagues were
completely wrong about this, a point that could only have been known for certain
by actually doing the work of collecting the corpora.
Pictish symbols were, of course, already discussed by Lee et al. (2010a), and
argued to be writing: their inclusion here as non-linguistic accords with what is
probably still the most widely accepted view, but in addition to that, as we shall see,
our tests, including Lee and colleagues’ own measure, suggest it is non-linguistic.
Pacific Northwest totem poles are included because there are a large number
of “texts” in this system, and many sources. A similar point can be made for
European heraldry, where it is possible to find tens of thousands of “texts” in the
form of blazon.
The function and meaning of Vinča symbols and Pictish symbols is unknown.
On the other hand, the meanings of totem poles, European heraldic symbols, and
Mesopotamian deity symbols are well understood. Their inclusion, in fact, is not
purely serendipitous: All of these systems had, at least in part, a heraldic function,
and might serve as a model for the kind of system that the Indus signs represented,
given that it was not writing.
One other traditional system included here is Pennsylvania Dutch barn stars.
This system is different from the rest since, as we noted above and as we shall
discuss more fully below, it is highly likely that the system was purely decorative.
Where to draw the boundary between decorative art and symbols is, of course, a
difficult one, and as we shall argue below, many critiques of Farmer et al. (2004)
have been confused on precisely this point.
Three further corpora were collected because they were easy to collect semi-
or fully automatically from the Web. Weather icon sequences8 can easily by har-
vested by a script that runs at regular intervals. Asian emoticons had been collected
as part of another project I was involved in. Mathematical formulae are available in
the hundreds of thousands from LATEX documents at archiv.org. Such corpora,
which are already in electronic format, require relatively little processing.
For the other corpora, it was necessary to transcribe them from print sources.
The only exception to this was the Pictish corpus, but even here reference to pho-
8
Suggested by a colleague, Bill Zaumen.
24
tographs was needed. In transcribing any symbol system the very first concern
is the inventory of signs, something that can be hard to ascertain for the non-
specialist. Fortunately, in all cases, we could make use of the standard inven-
tories compiled by experts on the systems. Thus, for totem poles, for example,
our sources all gave transcriptions of the depicted poles, naming each symbol; for
Mesopotamian deity symbols our source (Seidl, 1989), lists the symbols used on
each stone; and so forth.
After discussing the various corpora, we present in Section 5.8 statistics on
the size of the various corpora. The Appendix contains examples of the first five
symbol systems, along with the corresponding transcription in the XML markup
scheme we introduce in Section 6.
At the head of each section we propose a classification for the symbol system
in question in terms of the taxonomy developed in Section 4. The breakdown of the
types is as follows. Note that totem poles account for two of the types — heraldic
and narrative, since poles can have either of these functions:
25
Figure 4: An example of a Vinča “text”, from (Winn, 1981).
significance. As Winn states (p. 255): “In the final analysis, the religious system
remains the principle source of motivation for the use of signs.”
Developing Winn’s materials into an electronic corpus was relatively straight-
forward, since he very nicely lays out the texts in his catalog, giving both the form
and the type number for each sign. We followed his linearization of the texts, but
since he also provides line drawings of the artifacts themselves, it was possible in
some cases to indicate the true spatial relationship between the symbols.
26
Figure 5: The Aberlemno 1 stone from http://www.stams.strath.ac.
uk/research/pictish/database.php.
The Picts were an Iron Age people (or possibly several peoples) of Scotland
who, among other things, left a few hundred standing stones inscribed with sym-
bols, with “texts” ranging from one to a few symbols in length. The meaning of the
symbols is unknown, but until recently it would never have occurred to anyone to
assume that they are some form of writing. If nothing else, the Picts evidently had
at least some literacy in the (segmental) Ogham script (Rhys, 1892). An example
of Pictish symbols is shown in Figure 5.
As we noted above, a high profile paper by Rob Lee and colleagues (Lee et al.,
2010a) has argued on the basis of a statistical measure based on conditional entropy
(see Section 8.3), that the symbols comprised a heretofore unrecognized writing
system. While this paper generated a lot of press, it is less clear how widely Lee
and colleagues’ view is accepted among scholars of ancient Scotland. More to the
27
point Lee and colleagues in their reply to Sproat (2010a) — (Lee et al., 2010b),
revealed that they adhere to a much looser notion of what it means to be a writing
system, accepting Powell’s (2009) definition, namely that “writing is a system of
markings with a conventional reference that communicates information”. They
also state that they are comfortable with the result I reported in (Sproat, 2010a)
that, using their methods, the known non-linguistic system of Mesopotamian deity
symbols would be classified as logographic writing (like Pictish symbols): indeed,
they replicated my result. Clearly using this definition, any symbol system where
the symbols have a more or less fixed denotation, would be considered writing. If
that is the definition one wants then there is really nothing to argue about.
But if one wants a stricter definition (such as the one assumed by many students
of writing systems, including (DeFrancis, 1984, 1989; Sproat, 2000) , that writing
systems encode specifically linguistic (and usually phonological) information, then
there seems to be little reason to believe that Pictish symbols were writing.
Fortunately for our work, a corpus of images, comprising 340 stones, along
with information on the symbols on each stone is available from the University of
Strathclyde http://www.stams.strath.ac.uk/research/pictish/
database.php. The stones are cross-referenced to standard texts such as (Jack-
son, 1984; Royal Commission on the Ancient and Historical Monuments of Scot-
land, 1994; Jackson, 1990; Mack, 1997; Sutherland, 1997). The main work that
was needed was to correct the ordering of symbols in some cases since the order-
ing of the symbols in the Strathclyde transcription does not always correspond to
the symbols’ ordering on the stone.
28
Figure 6: The Grizzly Bear Pole of Yan, from (Drew, 1969, page 16).
1931; Garfield, 1940; Barbeau, 1950; Gunn, 1965, 1966, 1967; Drew, 1969; City
of Duncan, 1990; Stewart, 1990, 1993; Feldman, 2003).
There are many recurring themes on totem poles. For example, certain combi-
nations of symbols always allude to the same folk story. A whale with a man and
a woman always refers to the Nanasimget story (Stewart, 1993). In principle then
this could be represented by a single symbol (e.g. NANASIMGET), but instead we
elected to use the symbolUnit tag (Section 6) to represent the grouping:
<symbolUnit>
<symbol><title>Whale</title></symbol>
<symbol><title>Man</title></symbol>
<symbol><title>Woman</title></symbol>
</symbolUnit>
29
5.5 Pennsylvania Dutch Barn Stars
Classification: Decorative
Barn stars, commonly known as “hex signs”, are a traditional decorative art
among Pennsylvania German (“Dutch”) communities. They are found widely in
northeastern Pennsylvania, particularly Berks County, as well as in scattered com-
munities in Ohio and elsewhere. They are mostly used to decorate barns, but can
also be found on other structures, such as porches. Contrary to common belief,
they are not found among the Plain People (Amish and Mennonites), who eschew
decoration of any kind. Traditional barn symbols consist mostly of stars, rosettes,
wheels-of-fortune, and swastikas. Ironically, none of these forms are readily avail-
able from any of the purveyors of “hex signs” for tourists.
There is a common belief that “hex signs” had a magical function, such as to
ward off evil spirits, and indeed this belief was sometimes expressed by the German
farmers who used the symbols on their barns (Mahr, 1945). And the barn symbols
themselves can be traced back in many cases to symbols that probably had definite
sets of meanings at one time (Mahr, 1945; Graves, 1984; Yoder and Graves, 2000).
That said, the consensus of much of the small scholarly literature on this topic is
that barn stars probably had no particular meaning at all for the people who used
them and were used rather for decoration (Graves, 1984; Yoder and Graves, 2000).
If barn stars are purely decorative, why include them here? There are two
reasons. First of all, the symbols can occur in “texts” of several symbols in length.
While the texts look rather unlike written texts — for one thing, they are almost
always symmetric — they are nonetheless interesting from the point of view of
developing statistical techniques that might possibly distinguish linguistic from
non-linguistic systems.
Second, much of the critique of (Farmer et al., 2004) has depended on a funda-
mental confusion of what is meant by the term “non-linguistic symbol system,” the
most common misconception being that we were referring to randomly arranged
meaningless symbols. A good example of this misconception is expressed by Vi-
dale in his critique of our paper (Vidale, 2007). Using the obviously decorative
pottery designs of Shahr-e Sukhteh and elsewhere as his examples he states (page
344):
30
water-like patterns and interwoven snakes at Jiroft, and redundant spec-
ular doubling of most major symbols in the Dilmunite seals). While
positional regularities might be detected in part of the Jiroft figuration,
redundancy in all these systems dismiss one of the basic assumption
of Farmer & others, who take the rarity of repeating signs as a proof
of the non-linguistic character of the Indus script.
31
Figure 7: 6-pointed star (left) versus rosette.
32
Figure 9: A sample weather icon sequence: forecast for Portland, Oregon, April
29, 2011.
|ˆ_ˆ|
[o_-]
\(ˆvˆ)/
33
Corpus # Texts # Tokens # Types Mean text length
Asian emoticons 10,000 59,186 334 5.9
Barn stars 310 963 32 3.1
Mesopotamian deity symbols 69 939 64 13.6
Pictish stones 283 984 104 3.5
Totem poles 325 1,798 477 5.5
Vinča 591 804 185 1.4
Weather icons 10,142 50,710 16 5.0
Table 2: Number of texts, type and token counts, and mean text length for the
corpora collected so far.
34
words “integrate”, or “integral”, or “integration”. Similarly ex represents the math-
ematical exponentiation of the irrational number e to the power of variable x, not
the English expression “e to the x”.
Fortunately mathematical expressions are easy to harvest automatically from
online sources. For this project, 35,492 LATEX documents from the ArXiv database
(http://www.arxiv.org) have been downloaded from http://www.cs.
cornell.edu/projects/kddcup/. From these we extracted 414,492 equa-
tions delimited by \begin{equation} and \end{equation}.
35
Azure, a bend or Quarterly argent and gules Party per pale argent and vert,
a tree eradicated counterchanged
Figure 11: Examples of arms and their corresponding blazon from http://en.
wikipedia.org/wiki/Blazon.
blazon1,2 are full blazons of the specified quarters.12 Second, it will naturally fail
or produce an incorrect analysis if there are OCR or other errors in the text: this
feature actually proved useful in that it allowed us to filter out a large number of
examples with OCR errors that would have been impractical to clean up by hand,
and focus on a still reasonable sized subset that was mostly clean. Third, and most
importantly, the XML structures generated by pyBlazon are overly verbose, and
fail to mark features that are useful for the kind of analyses we want to perform
here. We discuss this issue further in Section 6.
36
Figure 12: The first page of NZA002 in the Library of Congress Collection.
37
Symbolics (Bliss, 1965), which do allow users to compose a limited, though still
large number of messages. Finally the Naxi culture is an interesting living example
of how linguistic and non-linguistic symbol systems can exist side by side, serving
different sociolinguistic functions.
38
Figure 13: Buffalo robe winter counts (http://wintercounts.si.edu/
html_version/images/a.jpg)
really a separate “text”, and thus the system is non-combinatoric. However winter-
counts were compiled into “texts” such as the famous buffalo robe depicted in
Figure 13, which might be mistaken for writing by someone who did not know
how the system worked.
Yet further systems. Other systems that various people have suggested include
chess notation, game notation (e.g. chess), dance notation (labanotation) and Wikipedia
barn stars (http://en.wikipedia.org/wiki/Wikipedia:Barnstars).
Many of these, including obviously Wikipedia barn stars can easily be mined from
online sources.
39
6 The Markup Scheme
The XML-based markup system, also described in (Wu et al., 2012) is intended to
be minimally verbose, and at the same time give enough information to allow for
the extraction of symbols from the texts according to different criteria of interest.
For example, if the text appears in multiple lines, then the line breaks might be
of interest. If the text appears in a circle, e.g. on a disk, then we want to tag the
text as circular since in such cases it it not generally clear where the text begins.
If symbols are grouped or ligatured together, we would like a way of representing
that. If a symbol is uncertain, that information should be marked. Bibliographic
information, the source of the particular text, and so forth also needs to be recorded.
An example (see also the Appendix for further examples) is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="mySchema.xsd">
<entry>
<bib>
<author>Drew, F. W. M.</author>
<year>1969</year>
<title>Totem poles of Prince Rupert.</title>
<pubInfo>Prince Rupert, B.C.: F.W.M. Drew.</pubInfo>
</bib>
40
one or more documents from a single source. The entry starts with a bib ele-
ment, whose contents should be self-explanatory. This is followed by a set of one
of more document entries, consisting of an description and page informa-
tion. Attributes are the type of document and the origin. Then follows the
actual docText, consisting minimally of a set of symbol elements containing
the title (the actual symbol). docText elements may also contain side el-
ements for multisided documents, and line elements for multiline texts, each of
these containing symbol elements. For circular texts we provide a circle tag.
Symbols that are grouped together in some way are collected under the
symbolUnit element, as in the above example. The symbolUnit may con-
tain a description, which explains the significance of the grouping. One of
the questions that arises in the analysis of a symbol system is whether to treat ele-
ments that seem to be composed of more atomic symbols as single symbols or as
a composite of the individual symbols. The symbolUnit preserves the grouping
structure so that one can make the choice of level of analysis later on.
What we have just described covers most of the texts we have encoded to date;
further technical details on the XML markup can be had by consulting the XML
schema that will be published along with the corpora.14
One corpus for which the above scheme is not adequate is European heraldry.
This is in part because unlike most of the other symbol systems under consider-
ation, heraldry makes meaningful use of two dimensions, and in part because the
blazon transcription we are working from includes not only symbols — charges,
colors and so forth that are significant elements of the system — but also English
words such as a, the or and, which are not part of the underlying symbol system,
and should usually be discarded in statistical analysis. We started with pyBlazon
(http://web.meson.org/pyBlazon/), a Python module for parsing bla-
zon descriptions, and as we discussed in Section 5.10, the 13,190 blazons in our
set are those blazons (from a much larger set) that could be parsed using pyBlazon.
The output of pyBlazon is an XML representation, but it is highly verbose
and has many levels of embedding that seem largely unnecessary. It also fails to
mark terms such as a or the as grammatical morphemes, which we need. Much of
our work on blazon has involved simplifying this output so as to represent all the
critical information with a minimum of structure, while still preserving necessary
information, and in some cases adding useful information. An example of this
modified markup can be seen below for the blazon azure on a bend ermine three
buckles gules:
<blazon>
14
Publication details to follow.
41
<treatment>
<color>azure</color>
</treatment>
<charges>
<group>
<spatial>on</spatial>
<grammar>a</grammar>
<ordinary>
<title>bend</title>
<coloring>
<fur>ermine</fur>
</coloring>
</ordinary>
<group>
<amount>3</amount>
<charge>
<title>buckles</title>
<coloring>
<color>gules</color>
</coloring>
</charge>
</group>
</group>
</charges>
</blazon>
42
Corpus # Texts # Tokens # Types Mean text length
Amharic 111 17,747 219 159.9
Arabic 10,000 429,463 62 42.9
Chinese 10,000 136,246 2,738 13.6
Ancient Chinese 22,359 91,010 701 4.1
Egyptian 1,259 38,804 691 30.8
English 10,000 503,309 76 50.3
Hindi 1,000 61,254 77 61.2
Korean 10,000 223,869 984 22.4
Korean (Jamo) 10,000 438,440 102 43.8
Linear B 439 5,465 220 12.4
Malayalam 937 46,497 80 49.6
Oriya 1,000 40,242 75 40.2
Sumerian 11,528 50,318 692 4.4
Tamil 1,000 38,455 61 38.5
43
7.1 Amharic
Collection of texts from http://www.waltainfo.com. The text was tok-
enized at the syllable level, with spaces represented as a separate token (“#”).
7.5 Egyptian
Collection of texts transcribed using the JSesh hieroglyph editor (http://jsesh.
qenherkhopeshef.org/) downloaded from http://webperso.iut.univ-paris8.
fr/˜rosmord/hieroglyphes. The text was tokenized at the glyph level.15
7.6 English
Collection of the first 10,000 headlines from the LDC English Gigaword corpus
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=
LDC2003T05. The text was tokenized at the letter level, with spaces represented
as a separate token (“#”).
15
The following texts were used, all in http://webperso.iut.univ-paris8.fr/
˜rosmord/hieroglyphes/: CT160 S2P.hie, DoomedPrince.hie, HAtra.hie, HetS.hie, L2.hie,
LC26.hie, Pacherereniset.hie, Prisse.hie, gurob.hie, amenemope/*.gly, ikhernofret.hie, ineni.hie,
kagemni.hie, lebensmuede.hie, mery.hie, naufrage.hie, sethnakht.hie, twobro.hie, year400.hie.
44
7.7 Hindi
Collection of the first 1,000 lines of text from the Hindi corpus developed at the
Central Institute of Indian Languages. The text was tokenized at the letter level,
with spaces represented as a separate token (“#”).
7.8 Korean
Collection of the first 10,000 headlines from the LDC Korean Newswire corpus
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=
LDC2000T45. Following the standard Unicode encoding of Korean, the text was
tokenized at the Hangul syllable level, with spaces represented as a separate token
(“#”).
7.10 Linear B
Transcription by the author of 300 tablets from (Ventris and Chadwick, 1956).
Omitted lines with uncertainties and/or omissions. Tokenized at the glyph level.
7.11 Malayalam
The Malayalam corpus (937 lines) developed at the Central Institute of Indian Lan-
guages. The text was tokenized at the letter level, with spaces represented as a
separate token (“#”).
7.12 Oriya
Collection of the first 1,000 lines of text from the Oriya corpus developed at the
Central Institute of Indian Languages. The text was tokenized at the letter level,
with spaces represented as a separate token (“#”).
7.13 Sumerian
Tablets from Gudea ruler of Lagah (c. 2100 BCE), extracted from the Cuneiform
Digital Library Initiative (http://cdli.ucla.edu/) via a transliteration search
for gu3-de2-a, with “rime 3” in the primary publication field. Omitting lines with
45
“#”, “<” or “[”, since these mark uncertainty/reconstructions.16 “Texts” consisted
of individual lines in the CDLI transcription, meaning that the Sumerian “texts”
are rather short.
One issue with Sumerian that we will need to resolve in future work is that the
symbols are transcribed, following standard Sumerological practice, with roman-
ized spellings of either the pronunciation of the symbol (if in lower case) or of the
morpheme denoted by the symbol (if in upper case), followed by a subscript. Thus
aya2 is a transcription of , which is one of the ways of writing the two-syllable se-
quence a-ya. The problem is that the system is not many-to-one, but many to many:
could also be duru5 , eš10 , and several others. Standard Sumerological transcrip-
tions thus give an overestimate of the actual number of glyph types appearing in a
document. This issue can only be resolved by knowing which transcribed elements
map back to a single glyph.17
7.14 Tamil
Collection of the first 1,000 lines of text from the Tamil corpus developed at the
Central Institute of Indian Languages. The text was tokenized at the letter level,
with spaces represented as a separate token (“#”).
16
Thanks to Chris Woods for suggesting appropriate search terms.
17
At least some of this information is available in the Pennsylvania Sumerian Dictionary (http:
//psd.museum.upenn.edu/epsd/index.html), though expert manual intervention will
almost certainly be needed beyond any automatic method that can be developed.
46
8 Statistical Models
In this section we explore the use of existing measures, as well as propose new mea-
sures and approaches for the task of distinguishing linguistic from non-linguistic
symbol systems. One of the byproducts of this investigation is the result that previ-
ously proposed measures are largely uninformative about a symbol system’s status.
In particular, Rao’s entropic measures (Rao et al., 2009a; Rao, 2010) are evidently
useless when one considers a wider range of examples of real non-linguistic sym-
bol systems. And Lee’s measures (Lee et al., 2010a), with the cutoff values they
propose, misclassify nearly all of our non-linguistic systems. However, we also
show that one of Lee’s measures, with different cutoff values, as well as another
measure we develop here, do seem useful. But we also demonstrate that they are
useful largely because they are both highly correlated with a rather trivial feature:
mean text length.
We note at the outset the way in which we extract the texts from the XML
for each of the corpora: every document is extracted as a single text, ignoring
line breaks and sides. We extract symbols rather than symbolUnits and thus
extract the finest grained breakdown of the texts. We also ignore uncertainty mark-
ing on symbols. These decisions could of course affect our results, though they
are unlikely to change the results enough to affect the overall conclusions of our
analysis.
Since not all readers will have a background in statistics or information theory,
we provide, at the introduction to each of these sections, a non-technical synop-
sis of the main issues discussed in the section as well as a summary of the basic
conclusions.
47
which is neither completely rigid nor completely free, and show plots
that seem on the face of it to be highly convincing.
But no symbol systems of any kind are completely rigid or com-
pletely free, so the entropic behavior of the Indus Valley symbols is un-
surprising and uninformative. Indeed analysis of our corpora shows
no clear separation of linguistic and non-linguistic systems using the
measure of conditional entropy.
One measure that has been used a lot to measure the information structure of
language is conditional entropy (Shannon, 1948), for example, bigram conditional
entropy, defined as follows:
∑
H(Y |X) = − p(x, y)logp(y|x) (1)
x∈X ,y∈Y
48
Figure 14: Bigram conditional entropy growth curves of various linguistic sym-
bol systems, the Indus Valley symbols, and two artificial models of non-linguistic
systems, from (Rao et al., 2009a). See the text for further explanation. Used with
permission of the American Association for the Advancement of Science.
49
Figure 15: Relative conditional entropy (conditional entropy relative to a uni-
formly random sequence with the same number of tokens) of various systems from
(Rao et al., 2009a). Used with permission of the American Association for the
Advancement of Science.
Figure 14 shows these curves, taking n = 20, for a variety of actual writ-
ing systems, Mahadevan’s corpus of Indus inscriptions, and two artificial systems
with maximum entropy, and minimum entropy, which Rao and colleagues label as
“Type 1” and “Type 2”, respectively. Figure 15 shows the relative conditional en-
tropy — “the conditional entropy for the full corpus relative to a uniformly random
sequence with the same number of tokens” — for the various systems represented
in Figure 14, as well as the real non-linguistic systems, DNA, protein (“prot”) and
Fortran (“prog lang”).
While the maximum (“Type 1”) and minimum (“Type 2”) curves are generated
from completely artificial systems, Rao and colleagues argue that they are in fact
reasonable models for actual symbol systems: Vinča symbols and Mesopotamian
deity symbols, respectively. Their beliefs are based, respectively, on Winn’s (Winn,
1990) description of a certain subset of the Vinča corpus as having no discernible
order; and on the fact that there is a certain hierarchy observed in the ordering of the
symbols on kudurru stones. In (Sproat, 2010a) and (Sproat, 2010b), I argued that
Rao’s interpretation of how these symbol systems work is a misinterpretation, and
indeed the results we present below support that conclusion. Rao and colleagues
50
also present results for DNA and protein in Figure 15 (though not in Figure 14): it
is clear from that figure that the curves of these biological symbol systems are very
close to the maximum entropy curves. At least in the case of DNA, though, this
is because Rao and colleagues take the “alphabet” of DNA to be the base pairs, an
alphabet of size 4. But these are surely not the right units of information: the base
pairs are effectively the “bits” of DNA. The equivalent for natural language would
be taking the basic units of English to be the ones and zeroes of the ASCII en-
coding — and throwing in large amounts of random-looking “non-coding regions”
to mimic the non-coding regions of DNA — surely a meaningless comparison for
these purposes.
We are not sure exactly how Rao and colleagues computed their results: while
Rao reports that they used Kneser-Ney smoothing (Kneser and Ney, 1995), a num-
ber of details are not clear: for example, it is not clear whether they included esti-
mates for unseen bigrams in their analysis. Discrepancies such as this and others
may help explain why the entropy estimates we present below are on bulk lower
than those presented in Rao and colleagues’ work. It is not possible to produce
plots that are directly comparable to Rao’s plots. However we can produce entropy
growth curves for a variety of systems using similar estimation techiques.
In Figure 16 we show conditional entropy growth curves for all of our non-
linguistic systems (in blue), all of our linguistic samples (in red), and for a corpus
of Indus bar seals from Harappa and Mohejo Daro (in green), which we developed
as part of the work reported in (Farmer et al., 2004). The details on this latter
corpus are given below:
Corpus # Texts # Tokens # Types Mean text length
Indus bar seals 206 1,265 209 6.14
Note that the bar seals are relatively late instances of Indus inscriptions, and that
the mean length of the texts, 6.14, is substantially longer than the mean length of
all Indus inscriptions taken together, which is 4.5 (Parpola, 1994; Farmer et al.,
2004).
Entropy statistics were computed using the OpenGrm NGram library (Roark
et al., 2012), available from http://www.opengrm.org. We used Kneser-
Ney smoothing, including the probability and thus entropy estimates for unseen
cases. Start and end symbols are implicit in the computation.
As can be seen in the plot, the results are pretty much across the map. The
kudurru inscriptions turn out, contrary to what Rao and colleagues claim — and
what Rao continued to insist in (Rao, 2010) — to have the highest entropy of all the
systems. Chinese shows the lowest. Between these two extremes the linguistic and
non-linguistic systems are mixed together. The Indus bar seal corpus, falls in the
middle of this distribution, but that is quite uninformative since the distribution is
51
Conditional Entropy
● Barn
4
Totem
Vinca
Weather
Kudurru
Pictish
Emoticons
Chinese
Egyptian
● English
Ethiopic
Hindi
3
● Korean jamo
Korean
Linear B
Conditional Entropy
● Malayalam
Oracle
Oriya
● Sumerian
● Tamil
● Arabic
Indus bar ● ● ● ●
● ●
2
● ● ● ● ●
●
●
●
● ● ● ●
● ●
● ●
●
● ● ●
●
●
● ● ●
● ●
●
1
●
●
● ●
●
●
●
●
0
Figure 16: Bigram conditional entropy growth curves computed for our linguistic
and non-linguistic corpora, and for Indus bar seals.
a mix of both types of system. One concern is that the corpora are of many differ-
ent sizes, which would definitely affect the probability estimates. How serious is
this effect? Could the (from Rao and colleagues’ point of view) unexpectedly high
entropy estimate for the kudurru corpus be due simply to undersampling? In Fig-
ure 17 we address that concern. This plots the bigram conditional entropy growth
curves for three samples from our Arabic newswire headline corpus containing 100
lines, 1,000 lines and 10,000 lines of text. Note that Arabic has roughly the same
number of symbol types as the kudurru corpus. The entropy does indeed change
as the corpus size changes, with the smaller corpora having higher overall entropy
than the larger corpora. This is not surprising, since with the smaller corpora, the
estimates for unseen cases are poorer and, one would expect, more uniform, lead-
52
4
3
Conditional Entropy
●
●
2
●
1
Number of Tokens
Figure 17: Bigram conditional entropy growth curves for three different sized
Arabic corpora.
ing in turn to higher entropy estimates. But the difference is not huge, suggesting
that one cannot attribute the relatively high entropy of the kudurru corpus wholly
to its small size. Given a larger corpus, the estimates would surely be lower. But
there is clearly no reason to presume, as Rao does, that a large kudurru corpus
would show a conditional entropy near zero.
Bigram conditional entropy thus seems not to be very useful as evidence for the
linguistic status of symbol systems. This is hardly surprising: an entropy measure
in the “middle” range merely tells us that the system is neither close to completely
unconstrained, nor close to completely rigid, something one might expect to hold
of any symbol system. But it does contradict Rao and colleagues’ claims, so a nat-
ural question is how to interpret these results vis-à-vis the results presented in Rao
et al’s work? In order to answer that question, one needs to understand how Rao
and colleagues interpret their own results. When Rao’s original paper came out in
Science in 2009, it was billed in the popular press as a demonstration that the In-
dus symbols constituted writing. The most natural interpretation of that would be
that the methods proposed by Rao and his colleagues constitute a way of discrim-
inating between linguistic and non-linguistic systems. The plot in Figure 14 goes
53
a long way towards reinforcing that interpretation. However, Rao never actually
says this: what the paper says is that the results “increase the probability” that the
Indus system was writing. Only when the objections I raised in (Sproat, 2010a)
forced a response (Rao et al., 2010) , did it become clear that Rao did not interpret
their results in a discriminative way. Rather they took a generative interpretation:
consider two hypotheses HL which states that the Indus system was linguistic and
HN L , which states that it was non-linguistic. Which hypothesis is more consistent
with the facts? Rao then goes on to discuss a range of evidence that they argue is
more consistent with HL . Included in this evidence are:
• Apparent evidence (presented in (Rao et al., 2009b)) for different uses of the
symbols in seals unearthed in Mesopotamia, suggesting a difference in the
underlying language.
If the Indus symbol system was linguistic, the argument goes, then one would
expect all of the above features, as well as the behavior evidenced by the entropic
measures that Rao and colleagues present.
The problem with this line of reasoning is that it is all too easy to come up with
a list of features that make a known non-linguistic system look linguistic. Consider
the hypothesis that Mesopotamian deity symbols constituted a form of writing.
Under that hypothesis, one would expect properties such as the following:
1. Deity symbols are frequently linearly written and in the cases where linearity
is less clear, the symbols are written around the top of a stone in an apparent
concentric circle pattern (see examples in (Seidl, 1989)). One sees such non-
linear arrangements with scripts too: the Etruscan Magliano disk (http://
it.wikipedia.org/wiki/Disco_di_Magliano, and many rune
stones have text wrapped around the border of the stone (see, e.g., http:
//en.wikipedia.org/wiki/Lingsberg_Runestone).
2. There is clear evidence for the directionality of deity symbols: to the extent
that “more important” gods were depicted first (see below), these occur at
the left/top of the text.
54
Figure 18: The linearly arranged symbols of the major deities of Aššurnas.irpal II.
From http://upload.wikimedia.org/wikipedia/commons/8/
87/Ashurnasirpal_II_stela_british_museam.jpg, released under
the GNU Free Documentation License, Version 1.2.
3. Deity symbols are often ligatured together: one symbol may be joined with
another. 18
6. Deity symbols are pictographic, like many real scripts — Egyptian, Luwian,
Mayan, Hittite hieroglyphs.
7. Deity symbols were used largely on standing stones, but were also used in
other contexts such as the “necklace” depicted in Figure 18, suggesting that
there were a variety of media in which one could create “texts”.
All of these properties hold of the deity symbols, yet we know that this symbol
system did not represent language.
Turning back to the Indus system, one can easily list features of the system that
seem more consistent with HN L . Consider features such as:
18
We note in passing that it is actually unclear for the Indus symbols that there are ligaturings
or modifications of symbols. A given symbol may look as if it is a modified version of another,
or ligatured, but unless we know what the symbols mean it is hard to be sure. In the case of the
Mesopotamian deity symbols, we actually know what the symbols denoted.
55
• All extant texts are very short.
• The system was used in a culture where there are no archaeological markers
of manuscript production (such as pens, styluses, ink pots . . . ).
• No evidence for the development, even over a 700 year period, of the kind
of cursive forms that one expects in a true writing system.
These features, and others, were discussed in (Farmer et al., 2004). To be sure,
that work has not gone without its critics — see, e.g., (Vidale, 2007) — and the
interested reader is invited to compare our original arguments with those of the
critics to see which seem more convincing.
Returning to the statistical evidence, one thing that is obvious is that if instead
of Figure 14 from (Rao et al., 2009a), it had been clear from the outset that the true
situation is more like that presented in Figure 16, it would have been significantly
less easy to argue that conditional entropy supports the linguistic status of the Indus
signs — under any interpretation of what “support” means.
Here N is the length of an ngram of tokens, say 3 for a trigram, so all we are doing
is computing the entropy of the set of ngrams of a given length. One can vary the
56
length being considered, say from 1 to 6, and thus get estimates for the entropy of
all the unigrams, bigrams, etc., all the way up to hexagrams. In order to compare
across symbol systems with different numbers of symbols, Rao uses the log to the
base of the number of symbols in each system (log|V | in the equation). Thus for
DNA the log is taken to base 4. The maximum entropy thus normalized is then just
N , the length of the ngrams being considered, for any system.
Figure 19 shows the block entropy estimates from (Rao, 2010) for a variety
of linguistic systems, some non-linguistic systems, the Indus corpus as well as the
maximum and minimum entropy systems from (Rao et al., 2009a).19 For this work,
Rao used an estimation method proposed in (Nemenman et al., 2002), which comes
with an associated MATLAB package. In this case, therefore, it is possible to
replicate the method used by Rao and produce results that are directly comparable
with his results in Figure 19.
These we present in Figure 20. As with the conditional entropy, and unlike
Rao’s clear-looking results, the systems are quite mixed up, with linguistic and
non-linguistic systems interspersed. The Indus system is right in the middle of the
range. Note that the curve for our Indus corpus is rather close to the curve for the
Mahadevan corpus seen in Figure 19, suggesting that even though the bar seals
constitute a smaller subset of the whole corpus with rather different average text
length, there is some similarity in the distributions of ngrams. Note on the other
hand that our sample of Sumerian is radically different from Rao’s in its behavior,
and in fact looks more like Fortran. The Ancient Chinese Oracle Bone texts are
also similar to Fortran, which is perhaps not surprising given that one finds quite a
few repeating formulae in this corpus.
Once again, if the true situation represented in Figure 20 had been clear rather
than the situation presented in Figure 19, one could not have made much of a case
that statistical measures “increase the probability” of the Indus symbols being a
writing system.
57
Figure 19: Block entropy estimates from (Rao, 2010) for a variety of linguistic
systems, some non-linguistic systems, and the Indus corpus. Used with permission
of the IEEE.
Unlike Rao, Lee and colleagues (2010a) are forthright in their claim that they
58
Block Entropy
● Barn
6
Totem
Vinca
Weather
Kudurru
5
Pictish
Emoticons
Chinese
Normalized Block Entropy
Egyptian
● English
4
Ethiopic
Hindi
● Korean jamo
Korean
Linear B
3
● Malayalam
Oracle ●
●
Oriya ●
● ●
●
● Sumerian ● ● ●
● ●
● ●
2
● Tamil ●
● ●
● Arabic ●
●
●
Indus bar
● ● ●
●
●
●
● ●
● ● ● ●
1
●
●
●
●
●
●
0
1 2 3 4 5 6
Figure 20: Block entropy values for our linguistic and non-linguistic corpora,
and the Indus bar seal corpus. As in Rao’s case, we used the method reported in
(Nemenman et al., 2002).
have developed a discriminative method. Lee et al.’s paper attempts to use a couple
of measures one of which is derived from bigram conditional entropy to argue that
Pictish symbols were a writing system. As with Rao et al.’s work, they compare
the symbols to a variety of known writing systems, as well as symbol systems like
Morse code, and European heraldry, and randomly generated texts — by which,
again, is meant random and equiprobable.
Lee and colleagues develop two measures, Ur and and Cr , defined as follows.
First, Ur is defined as:
F2
Ur = (3)
log2 (Nd /Nu )
59
where F2 is the bigram entropy, Nd is the number of bigram types and Nu is the
number of unigram types. Cr is defined as:
Nd Sd
Cr = +a (4)
Nu Td
where Nd and Nu are as above, a is a constant (for which, in their experiments,
they derive a value of 7, using cross-validation), Sd is the number of bigrams that
occur once (hapax legomena) and Td is the total number of bigram tokens; this
latter measure will be familiar as nN1 , the Good-Turing estimate of the probability
mass for unseen events. To illustrate the components of Cr , Lee et al. show a plot
(their Figure 5.5), reproduced here as Figure 21. According to their description this
shows:
Note that the non-linguistic system of heraldry seems to have a much lower num-
ber of singleton bigrams than would be expected given the corpus size, clearly
separating it from linguistic systems.
Lee et al. use Cr and Ur to train a decision tree to classify symbol systems.
If Cr ≥ 4.89, the system is linguistic. Subsequent refinements use values of Ur
to classify the system as segmental (Ur < 1.09), syllabic (Ur < 1.37) or else
logographic. Their decision tree is shown in Figure 22.
What happens If we apply Lee et al’s tree, “out of the box” to our data? Table 4
shows the results of doing this.20 Not surprisingly, the Pictish symbols are classi-
fied as (logographic) writing, consistent with Lee et al.’s results. But, with the ex-
ception of Vinča symbols, which are appropropriately classified as non-linguistic,
the remainder of the known non-linguistic systems in our set are misclassified as
some form of writing.21 I already presented a preliminary result for Mesopotamian
deity symbols in (Sproat, 2010a), where at that time I reported that Lee et al’s sys-
20
Note that like (Lee et al., 2010a), in performing these computations, we pad the texts with start
and end symbols.
21
Note that all of our linguistic corpora are correctly classified as linguistic, though not always as
the right type of linguistic system.
60
Figure 21: Reproduction of Figure 5.5, page 8, from: Lee, Rob, Philip Jonathan,
and Pauline Ziman. “Pictish symbols revealed as a written language through ap-
plication of Shannon entropy.” Proceedings of the Royal Society A: Mathematical,
Physical & Engineering Sciences, pages 1–16, March 31, 2010. Used with permis-
sion of the Royal Society. See text for explanation.
61
Figure 22: Reproduction of Figure 6, page 9, from: (Lee et al., 2010a).
is supposed to tell you whether or not you are looking at such a system, then it
should be somewhat disturbing that, excluding Pictish, it misclassifies five out of
six known non-linguistic systems as linguistic.
But, no doubt, Lee and his colleagues would be mostly satisfied with this result:
as already noted. Indeed, of the six systems listed in Table 4, four of them certainly
convey information of one kind or another, and Vinča (classified as non-linguistic)
may have been meaningless (though we do not know). Barn stars, though, are
almost surely purely decorative, so it is an embarrassment to their approach that
these are classified as a form of “writing”.
62
Corpus Classification
Asian emoticons linguistic: letters
Barn stars linguistic: letters
Mesopotamian deity symbols linguistic: syllables
Pictish symbols linguistic: words
Totem poles linguistic: words
Vinča symbols non-linguistic
Weather icon sequences linguistic: letters
Table 4: The results of applying Lee et al.’s (2010a) decision tree from Figure 22
to our data.
how common one would expect that pair to be if there were no partic-
ular association between them.
As with entropic measures, this association measure is not useful
at distinguishing between linguistic and non-linguistic systems.
So far we have looked at measures that, one way or another, involve entropy.
Another property of language is that some symbols tend to be strongly associated
with each other in the sense that they occur together more often than one would
expect by chance. There is by now a large literature on association measures,
but all of the approaches compare how often two symbols (e.g., words), occur to-
gether, compared with some measure of their expected cooccurrence. One popular
measure is Shannon’s (pointwise) mutual information (Shannon, 1948; Church and
Gale, 1991) for two terms ti and tj :
p(ti , tj )
PMI(ti tj ) = log (5)
p(ti )p(tj )
This is known to produce reasonable results for word association problems, though
there are also problems with sensitivity to sample size as pointed out by Manning
and Schütze (1999).
In this study we use pointwise mutual information between adjacent symbols
to estimate how strongly associated the most frequent symbols are with each other.
For words at least, it is often the case that the most frequent words — function
words such as articles or prepositions — are not strongly associated with one an-
other: consider that the the or the is are not very likely sequences in English.
We therefore look at the mean association of the n most frequent symbols,
63
12 PMI
● Barn
Emoticons
Totem
Vinca
Weather
11
Kudurru
Pictish
Chinese
Egyptian
10
● English
Amharic
Hindi
● Korean jamo
9
Association
Korean
● Linear B
● Malayalam
Oracle
8
Oriya
●
● Sumerian
●
Tamil
● ● Arabic
7
●
●
Indus bar
●
●
Amharic: small
Arabic: 1000
● Arabic: 100
6
●
●
●
●
●
5
0 10 20 30 40 50 60 70
# Types
Figure 23: PMI association computations for our corpora, computed over subsets
of each corpus starting with the 10 most frequent symbols, the 20 most frequent,
and so forth, up to 25% of the corpus. See the text for more detailed explanation.
64
8.5 Models of Repetition
All of the measures we have discussed so far share one crucial
property: they are local, in that they involve statistics computed over
adjacent symbols or symbol sequences. But language — and other
symbol systems, have many non-local properties too. For example, it
is typical that certain words will repeat in a text. If you have read this
far in this monograph, for example, you will have seen about 1,000
instances of the word the. With one metalinguistic exception, none of
these thes are adjacent to each other and few of them are even very
close to each other.
Could a measure of how often symbols repeat in text, whether or
not those repetitions are adjacent, be useful in distinguishing between
linguistic and non-linguistic symbol systems? We develop a simple
measure of repetition and show that of all the measures so far, this one
is by far the most informative. However, at least part of its value comes
from the fact that the measure correlates with another even simpler
feature: mean text length. While repetition rate is still a better predic-
tor than mean text length alone, it is of note that mean text length is so
informative. We return to this point in the conclusion.
To the extent that they involve the statistical properties of the relations between
symbols, the measures that we have discussed so far are all local, involving ad-
jacent pairs of symbols. But despite the fact that local entropic models have a
distinguished history in information-theoretic analysis of language dating back to
Shannon (1948), language also has many non-local properties.
One is that symbols tend to repeat in texts. Suppose you had a corpus of boy
scout merit badge sashes and you considered each badge to be a symbol. The
corpus would have many language-like properties. Since some merit badges are
awarded far more than others, one would find that the individual symbols follow a
roughly Zipfian distribution, as we see in Figure 24.
65
5e+06
2e+06
5e+05
Count
2e+05
5e+04
1e+04
0 20 40 60 80 100 120
Rank
Figure 24: Unigram distribution of awarded boyscout merit badges, from http:
//meritbadge.org/wiki/index.php/Merit_Badges_Earned,
showing the standard power-law inverse log-linear relationship between count and
rank.
66
Figure 25: Indus seal impression M-314, consisting of seventeen symbols with no
repeats.
Furthermore, some badges tend to be earned before others, and while there
is no requirement that merit badges be applied to the sash in any particular order
(http://www.scoutinsignia.com/sash.htm), nonetheless one would
expect that the earlier earned badges would tend to come earlier in the text. Thus
we would expect a hierarchy rather like what we find in the kudurru texts. All of
this implies that entropic measures of the kinds we have been considering would
find “structure” in these “texts”. Yet such measures would fail to capture what is
the most salient feature of merit badge “texts”, namely that a merit badge is never
earned twice, and therefore no symbol repeats. In this feature, more than any other,
a corpus of merit badge “texts” would differ from linguistic texts.
Symbols in writing systems, whether they represent segments, syllables, mor-
phemes or other linguistic information, repeat for the simple reason that the lin-
guistic items that they represent tend to repeat. It is hard to utter a sentence in any
natural language without at least some segments repeating. Obviously as the units
represented by the symbols in the writing system become larger, the probability
of repetition decreases, but we would still expect some repetition. We therefore
argued in (Farmer et al., 2004) that it was noteworthy that the Indus inscriptions
showed a remarkably low rate of repetition, and that even in “long” texts one typi-
cally finds no symbol repetitions. Indeed, the longest Indus inscription on a single
surface, consisting of seventeen glyphs, show not a single repetition of a symbol
(Figure 25).
However it is not only the presence or absence of repetition in a corpus, but
how the repetition manifests itself, that is important. In (Farmer et al., 2004) we
argued that the Indus inscriptions not only showed rather low repetition rates but
that when one does find repetition of individual symbols within an inscription, it is
often of the form found in Figure 26, with symmetric and other patterns.
67
Figure 26: Indus repetitions, from (Farmer et al., 2004), Figure 6. As described
there these are: “Examples of the most common types of Indus sign repetition.
. . . The most frequent repeating Indus symbol is the doubled sign illustrated in M-
382 A, which is sometimes claimed to represent a field or building, based on Near
Eastern parallels. The sign is often juxtaposed (as here) with a human or divine
figure carrying what appears to be one (or in several other cases) four sticks. M-
634 a illustrates a rare type of sign repetition that involves three duplications of
the so-called wheel symbol, which other evidence suggests in some cases served
as a sun/power symbol; the sign shows up no less than four times on the badly
deteriorated Dholavira signboard (not shown), which apparently once hung over
(or guarded?) the main gate to the citys inner citadel. The color photo of MD-1429
is reproduced from M. Kenoyer, Ancient Cities of the Indus Valley Civilization,
Oxford University Press, Oxford 1998), p. 85, exhibition catalog number MD 602.
The sign on either side of the oval symbols in the inscription is the most common
symbol in the Indus corpus, making up approximately 10% of all symbol cases;
despite its high general frequency, repetitions of the symbol in single inscriptions,
of the kind seen here, are relatively rare.”
68
In this study we use a measure of repetition that seeks to quantify the rate of
iterated repeats of the kind seen in some cases in Figure 26, relative to the total
number of repetitions found. Specifically, we count the number of tokens in each
text that are repeats of a token that has already occurred in that text, and sum that
number over the entire corpus. Call this number R. We then count the number
of tokens that are repeats in each text, and which are furthermore adjacent to the
token they repeat. Sum that over the entire corpus, and call this number r. So for
example for a single text:
AABACDB
R would be 3 (two As are repeats, and one B), and r would be 1 (since only one
of the repeated As is adjacent to a previous A). Thus the repetition rate Rr would be
0.33. Table 5 shows our corpora ranked by Rr . This measure is by far the cleanest
separator of our data into linguistic versus non-linguistic. If we set a value of 0.10
as the boundary, only Sumerian and Mesopotamian deity symbols are clearly mis-
classified (while Asian emoticons and Egyptian are ambiguously on the border).
Not surprisingly, this is the feature most often picked by the decision tree classifier
we will discuss in the next section.
But one important caveat is that the repetition measure is also negatively cor-
related with mean text length: the Pearson’s correlation for mean length and Rr is
−0.49. This makes sense given that the shorter the text, the less chance there is for
repetition, whereas at the same time, the more chance that if there is repetition, the
repetition will involve adjacent symbols. Of course the correlation is not perfect,
meaning that Rr probably also reflects intrinsic properties of the symbol systems.
How much is length a factor?
One-way analyses of variance with class (linguistic/non-linguistic) as the in-
dependent variable and mean text length or Rr as dependent variables yielded the
following results:
Mean text length is thus well predicted by class, with non-linguistic systems having
shorter mean lengths. But the repetition rate is even better predicted, suggesting
that our results above cannot be simply reduced to length differences.
To see this in another way, consider the results of computing the same repetition
measures over our corpora where we have artificially limited the texts to length no
69
r
Corpus R
Barn Stars 0.86
Weather Icons 0.79
Sumerian 0.67
Totem Poles 0.63
Vinča 0.59
Indus bar seals 0.58
Pictish 0.26
Asian emoticons 0.10
Egyptian 0.10
Mesopotamian deity symbols 0.099
Linear B 0.055
Oracle Bones 0.048
Chinese 0.048
English 0.035
Arabic 0.032
Korean jamo 0.022
Malayalam 0.022
Korean 0.020
Amharic 0.018
Oriya 0.0075
Tamil 0.0060
Hindi 0.0017
r
Table 5: Repetition rate R for the various corpora.
70
r
Corpus R
Barn Stars 0.85
Weather Icons 0.80
Indus bar seals 0.77
Totem Poles 0.71
Vinča 0.63
Egyptian 0.42
Sumerian 0.33
Chinese 0.31
Pictish 0.29
English 0.29
Mesopotamian diety symbols 0.287
Amharic 0.25
Asian emoticons 0.22
Linear B 0.21
Malayalam 0.19
Arabic 0.14
Korean jamo 0.08
Oriya 0.04
Korean 0.015
Hindi 0.015
Tamil 0.0040
Oracle Bones 0.0
Table 6: Repetition rate Rr for versions of the corpora with artificially shortened
texts of length 6 or less, and no more than 500 texts.
more than six (by simply trimming each text to the first six symbols). Also, in
order to make the corpora more comparable, we limit the shortened corpus to the
first 500 “texts” from the original corpus. As we can see in Table 6, the separation
is not as clean as in the case of Table 5. Nevertheless, the five corpora with the
highest Rr are non-linguistic (if one counts the Indus system as non-linguistic) and
the nine corpora with the lowest values are all linguistic.
71
In this section we combine these features into a decision tree clas-
sifier whose task is to make a two-way classification: linguistic or
non-linguistic. A classification decision tree is a simple representa-
tion of structured knowledge where each node in the tree represents
a question, branches represent answers to those questions, and the
leaves the ultimate classification answer. One starts at the root of the
tree, answers the question at the root, then depending on the answer
takes one or another branch. One then repeats the process until one
arrives at a leaf node.
Many algorithms exist for training such trees, given a set of fea-
tures and a set of training data. The particular algorithm we use here
is Classification and Regression Trees (CART) (Breiman et al., 1984).
We perform an exhaustive set of analyses of our data with a variety
of different settings for the training sets. One of the upshots of these
experiments is that one of Lee et al. (2010a)’s measures — Cr , when
appropriately retrained, does prove somewhat useful as a discrimi-
native feature. Sadly though, from the point of view of all analyses
performed, both Pictish and Indus symbols look non-linguistic.
We used the above features and the Classification and Regression Tree (CART)
algorithm (Breiman et al., 1984) to train (binary branching) classification trees for
a binary classifier. The features used were:
• The PMI association measure over the symbol set comprising 25% of the
corpus
• F2 , the log2 bigram conditional entropy for the whole corpus, from (Lee
et al., 2010a).
• Ur
• Cr
• Repetition measure
Since the set of corpora is small — only 21 in total, not counting the Indus bar
seals — we tested the system by randomly dividing the corpora into 16 training and
5 test corpora, building, pruning and testing the generated trees, and then iterating
this process 100 times. The mean accuracy of the trees on the held out data sets
72
was 0.8, which is above the baseline accuracy 0.66 of always predicting a system
to be linguistic (i.e., picking the most frequent choice).
It is interesting to see how these 100 trees classify the Indus bar seals, which
were held out from the training sets. 98 of the trees classified it as non-linguistic,
and only 2 of the trees classified it as linguistic. Furthermore, the 98 that classified
it as non-linguistic had a higher mean accuracy — 0.81 — than the 2 that classified
it as non-linguistic (0.3). Thus, more and better classifiers tend to classify the Indus
symbols as non-linguistic than those that classify it as linguistic.23
Since the various runs involve different divisions into training and test corpora,
an obvious question is whether particular corpora are associated with better perfor-
mance. If corpus X is in the training and thus not in the test set, does this result in
better performance, either because its features are more informative for a classifier
and thus helpful for training, or because it is easier to classify and thus helpful
for testing? Tables 7 and 8 provide some results on this. Asian emoticons and
Sumerian result in higher classification rates when they occur in the training and
not in the testing, perhaps because both are hard to classify in testing: Sumerian
is misclassified by the repetition measure, as we saw above, and emoticons on that
measure are close to the border with linguistic systems, as is Egyptian, which also
results in better performance if it is not in the test data. The next two on the list,
weather icons and Pictish symbols, have a repetition value that places them well
outside the range of the linguistic corpora, so in these cases it may be that they are
useful as part of training to provide better classifiers. Mesopotamian deity sym-
bols are also misclassified by the repetition measure, and thus could lead to better
performance when they are not part of the test data.
In what we just described, we included Pictish among the non-linguistic sys-
tems. Of course, with the publication of (Lee et al., 2010a), this classification has
become somewhat controversial. What if we do not include the Pictish data? In
this case we used training sets of 15 corpora, and again held out 5 corpora for
testing. Here the overall mean accuracy is 0.84. What do these classifiers make
of Pictish? The results are strongly in favor of the non-linguistic hypothesis: 97
classified Pictish as non-linguistic, with a mean accuracy of 0.81; 3 classified it as
linguistic, with a lower mean accuracy of 0.4.
It is also interesting to consider the features that are used by the trees. These
are presented in Table 9 for the two experimental conditions (with versus without
Pictish). In both cases, the most prominent feature was repetition, and the second
most common is Cr . The latter is interesting since if one compares Lee et al.’s
23
To show that this result is no artifact of the particular setup we are using, we ran the same
experiment, this time dropping the Indus bar seals entirely, and holding out Oriya from the training.
100% of the trees classified Oriya as linguistic, though see below for further discussion on this point.
73
Corpus Mean accuracy # training runs
Asian Emoticons 0.85 79
Mesopotamian deity symbols 0.85 71
Sumerian 0.84 76
Egyptian 0.83 72
Weather icons 0.82 80
Pictish 0.82 79
Malayala 0.81 79
Hindi 0.80 78
Oriya 0.80 82
Korean jamo 0.80 77
Arabic 0.80 74
Linear B 0.79 70
English 0.79 76
Tamil 0.79 76
Amharic 0.79 73
Totem poles 0.79 86
Chinese 0.79 75
Korean 0.79 72
Oracle bones 0.78 69
Vinča 0.78 76
Barn stars 0.78 80
Table 7: Mean accuracy of results for each of the corpora when occurring in the
training data.
tree in Figure 22, it is Cr which was involved in the the top-level linguistic/non-
linguistic decision.
In Table 10 we show the results of the Wilcoxon signed rank test for each fea-
ture comparing for the two populations of linguistic and non-linguistic corpora.
Only for our repetition measure, and Cr are the population means different accord-
ing to this test, suggesting that the other features are largely useless for determining
the linguistic status of a symbol system.
Since our repetition measure is well correlated with mean length of texts in the
corpus, and the non-linguistic corpora in general have shorter mean lengths than
the linguistic corpora, it is instructive to consider what happens when we remove
that feature and train models as before. As we can see in Table 11, a wider variety
of trees is produced, and Lee and colleagues’ Cr is the favored feature, being used
in 82 of the trees. The results for the held out Indus bar seal corpus are not as
74
Corpus Mean accuracy # training runs
Barn stars 0.91 20
Totem poles 0.89 14
Vinča 0.88 24
Oracle bones 0.85 31
Chinese 0.85 25
Korean 0.84 28
English 0.84 24
Tamil 0.84 24
Amharic 0.84 27
Linear B 0.83 30
Korean jamo 0.83 23
Arabic 0.82 26
Oriya 0.82 18
Hindi 0.81 22
Malayala 0.80 21
Pictish 0.75 21
Weather icons 0.74 20
Egyptian 0.74 28
Mesopotamian deity symbols 0.70 29
Sumerian 0.69 24
Asian Emoticons 0.64 21
Table 8: Mean accuracy of results for each of the corpora when occurring in the
test data.
dramatic as when we included the repetition feature, but they still highly favor the
non-linguistic analysis. 88 of the trees classified the system as non-linguistic (mean
accuracy 0.75), and 12 as linguistic (mean accuracy 0.37).
Similarly lower results were obtained when we replaced the repetition rate Rr
from Table 5 with that computed over the artificially shortened corpora seen above
in Table 6 . 85 of the trees classified the Indus symbols as non-linguistic (mean
accuracy 0.63), and 15 classified it as linguistic (mean accuracy 0.36). The features
used by these trees are shown in Table 12. Repetition is far less dominant than it is
in the original tree set shown in Table 9, but it is still the second most used feature,
after Cr , occurring in 35 out of 100 trees.
The most useful features thus seem to be our measure of repetition and Cr , and
this is further confirmed by the Wilcoxon signed rank test reported above for which
these were the only two features that showed a significant correlation with corpus
75
with Pictish without Pictish
# trees features # trees features
84 repetition 71 repetition
13 Cr 26 Cr
1 block 5 1 block 5
1 block 3 1 association
1 block 1 1
Table 9: Features used by trees under the two training conditions. The left column
in each case is the number of trees using the particular combination of features in
the right column. The final line for the trees trained without Pictish data, is for one
tree with a single node that predicts “linguistic” in all cases.
Measure W p
association 29 0.15
block 1 42 0.64
block 2 49 1
block 3 56 0.64
block 4 49 1
block 5 63.5 0.30
block 6 56 0.64
maximum conditional entropy 58 0.54
F2 63 0.32
Ur 40 0.54
Cr 84 0.0074**
repetition 6 0.00052**
Table 10: Wilcoxon signed rank test for each feature with the two populations
being linguistic and non-linguistic corpora.
76
# trees features
57 Cr
25 Cr , association
6 association
4 maximum conditional entropy
2 block 5
2
1 Ur
1 block 6
1 association, block 5
1 association, block 4
Table 11: Features used by trees when repetition is removed. The sixth row
consists of two trees where there is a single leaf node with the decision “linguistic”.
# trees features
36 Cr
30 repetition
11 Cr , association
8
4 maximum conditional entropy
3 association
2 Ur
2 maximum conditional entropy, repetition
2 block 4, repetition
1 block 5
1 block 1, repetition
Table 12: Features used by trees when repetition is replaced by the repetition rate
computed over artificially shortened corpora (Table 6. The fourth row consists of
twelve trees where there is a single leaf node with the decision “linguistic”.
77
type.
What do these two features have in common? We know that Rr is correlated
with text length: could it be that Cr is also correlated? As Figure 27 shows, this
is indeed the case. Pearson’s r for Cr and text length is 0.39, and this increases
dramatically to 0.71 when the one obvious outlier — Amharic — is not considered.
As we argued above, there is a plausible story for why Rr should correlate with
text length, but why should Cr correlate? Recall the formula for Cr , repeated here:
Nd Sd
Cr = +a (7)
Nu Td
where, again, Nd is the number of bigram types, Nu the number of unigram types,
Sd is the number of bigram hapax legomena, and Td is the total number of bigram
tokens. Now recall (footnote 20) that Lee et al. pad the texts with beginning and
end of text delimiters. For corpora consisting of shorter texts, this means that
bigrams that include either the beginning or the ending tag will make up a larger
portion of the types, and that the type richness will be reduced. The unigrams, on
the other hand, will be unaffected by this, though they will of course be affected
Nd
by the overall corpus size. This predicts that the term N u
alone should correlate
well with mean text length, and indeed it does, with r = 0.4 (r = 0.72 excluding
Amharic).
Ur , which is derived from N Nd
u
is also correlated, but negatively: r = −0.29.
Thus a higher mean text length corresponds to a lower value for Ur . Recall again
that it is this feature that is used in Lee and colleagues’ (2010a) decision tree for
classifying the type of linguistic system: the lowest values of Ur are segmental
systems, next syllabaries, next “logographic” systems. This makes perfect sense
in light of the correlation with text length: ceteris paribus, it takes more symbols
to convey the same message in a segmental system than in a syllabary, and more
symbols in a syllabary than in a logographic system. Thus segmental systems
should show a lower Ur , syllabic systems a higher Ur and logographic systems the
highest Ur values.
Returning to Cr , non-linguistic systems do tend to have shorter text lengths
than linguistic systems, so one reason why Cr seems to be such a good measure for
distinguishing the two types, apparently, is because it correlates with text length.
Of course where one draws the boundary between linguistic and non-linguistic on
this scale will determine which systems get classified in which ways. For Lee and
colleagues, Pictish comes out as linguistic only because they set the value of Cr
relatively low.
While there are obviously other factors at play— for one thing, something must
explain why Amharic is such an outlier — it does seem that the key insight at work
78
here comes down to a rather simple measure: how long on average are the extant
texts in the symbol system? If they are short, then they are probably non-linguistic;
if they are long they are probably linguistic. Of course such a trivial computation
would hardly get one’s paper published in Proceedings of the Royal Society (or
Science). But it seems that the fancier methods used in Lee and colleagues’ work
are in large part proxies for this far more basic measure. We will return to the
implications of this result in Section 9.
We have already seen that repetition on its own misclassifies Sumerian as non-
linguistic, due to the short “texts” — an artifact of the way we divided the corpus.
Given what we have just discussed, we would therefore expect that the decision
trees would also misclassify it. Indeed they do: using all features, 92 trees classified
it as non-linguistic (accuracy 0.72) and 8 trees as linguistic (0.23).
Our corpora of mathematical formulae and heraldry are still under develop-
ment. However, and putting aside the serious reservations about the methods that
we have just discussed, it is of interest to see how the statistical methods hold up
when samples of those corpora are added. To that end we added 1,000 “texts” from
each of these corpora.
For mathematical formulae, we selected equations that consisted of single lines
in the original LATEX. We processed the LATEX code so that commands (a string
of alphanumeric characters preceded by a backslash) were kept as single tokens;
square brackets, angle brackets and parentheses were split off as separate tokens;
subscript (underbar) and superscript (wedge) were also split off as separate tokens;
and all other alphanumeric sequences were split into sequences of character tokens.
For heraldry we selected 500 texts from each of Burke and the Mitchell rolls.
From each blazon’s XML markup, we extracted all terms, except those marked
as “grammatical” (usually articles such as a or the), or conjunction (and). Further-
more, whenever a count was specified (three roundels, four lions, etc.), we replaced
that phrase with the charge, etc., repeated the appropriate number of times. Thus
four lions is replaced with lion lion lion lion. This yields a more accurate repre-
sentation of the actual symbols used in the arms. Note that the repetition rate Rr
of the heraldry corpus is quite high (0.71) whereas for the mathematics corpus it is
79
Corpus # Texts # Tokens # Types Mean text length
Mathematical expressions 1,000 23,312 273 23.31
Heraldry 1,000 9,788 410 9.79
Table 13: Number of texts, type and token counts, and mean text length for subsets
of the mathematics and heraldry corpora.
low (0.027). For mathematics, it is relatively rare that the same symbol will follow
itself, hence the low rate of local repetitions. On the other hand, given our pro-
cessing of the heraldry corpus so that four lions becomes lion lion lion lion, a large
proportion of the repetitions involve local repetitions, so Rr is high.
Basic statistics for these corpora are given in Table 13. Figure 28 shows the
block entropy curves as in Figure 20, with these two new corpora added. As can
be seen, both mathematical formulae and heraldry fall somewhere in the middle of
the distribution.
80
English
25
Arabic
Hindi
Emoticons
20
Korean
Chinese
Oriya
Malayalam
KoreanJamo
C_r
15
OracleBones
Tamil Amharic
Egyptian
10
WeatherIcons
LinearB
Kudurrus
Sumerian
Barseals
Totem
Pictish
Barnstars
5
Vinca
0 50 100 150
Figure 27: Correlation between Cr and text length. Pearson’s r is 0.39, but when
the obvious outlier Amharic is removed, the correlation jumps to 0.71.
81
Block Entropy
● Barn
6
Totem
Vinca
Weather
Kudurru
5
Pictish
Emoticons
Math
Normalized Block Entropy
Heraldry
● Chinese
4
Egyptian
English
● Ethiopic
Hindi
Korean jamo
3
● Korean
Linear B
Malayalam ●
●
● Oracle
●
2
● Oriya
●
● Sumerian ● ● ●
●
Tamil ● ● ● ●
● ●
Arabic ● ● ● ●
●
● ●
Indus bar ● ●
● ● ● ●
1
●
● ●
●
●
●
●
0
1 2 3 4 5 6
Figure 28: Block entropy values for our linguistic and non-linguistic corpora,
including mathematical formulae and heraldry, and the Indus bar seal corpus.
82
When we replicated the classification experiments reported in Section 8.6 with
these additional corpora, and using all features, the results were much as before:
of the 100 trees trained, 94 classified the Indus bar seals as non-linguistic (mean
accuracy 0.73) and 6 classified it as linguistic (mean accuracy 0.33).
83
9 Summary and Future Work
In this monograph we have presented a set of corpora of non-linguistic symbol
systems that we have developed and compare them using a variety of statistical
measures to a large set of corpora of written language. The non-linguistic corpus
set was designed to include types of systems that are likely to be relevant to assess-
ing whether ancient systems such as the Indus Valley signs, or Pictish symbols, are
linguistic or not. We also included one system (barn stars) that is almost certainly
purely decorative (thus addressing a common misconception about the claims of
Farmer et al. (2004)); as well as some modern systems that can easily be harvested
from the Web. We also picked a wide variety of linguistic symbol systems, ranging
across both time and script type.
We have examined a variety of statistical measures and shown that two of them
— Lee and colleagues’ (2010a) Cr and our repetition measure Rr — are useful for
distinguishing linguistic and non-linguistic systems. However, the fact that they
are useful seems to be in large measure because both of them are also correlated
with a measure that is far more basic and trivial: mean text length.
In both depth and breadth of coverage, this work arguably goes far beyond
previous work in this area, but the apparent conclusion — that a core distinction
between linguistic and at least some non-linguistic systems — comes down to text
length, seems troublesome. Could the answer be that trivial?
I believe the answer to that question is yes, for a simple reason: if you have a
real writing system, it allows you to write anything you want to say, meaning that
you can generate very long texts. Non-linguistic systems can be expected to be
much more varied in this regard, depending upon what kind of information they
represent. Mathematical formulae can, indeed, be quite long, as can deity symbol
strings on kudurru stones. However Pictish “texts” are all very short, presumably
because whatever kind of information was represented by those “texts” was by
its nature concise. Totem pole texts are short for similar reasons (and probably
also because of the labor involved in carving whole tree trunks). These character-
izations are informal and somewhat circular to be sure, but the point is that with
non-linguistic systems, unlike linguistic systems, there is no a priori expectation
that the texts should be long.
One of the main problems all along for the theory that the Indus Valley symbols
were a written language has been the extremely short length of the “texts”. While
few Indus researchers would perhaps admit this, the brevity of the inscriptions is an
embarrassment. Why else would researchers since Marshall (1924) have appealed
to a “lost manuscript” hypothesis if it were not the sense that a symbol system
that was only used for very short inscriptions did not look much like a writing
84
system.24 The “lost manuscript” hypothesis and its ramifications was discussed
extensively in (Farmer et al., 2004), and those arguments will not be repeated here.
But one of the conclusions of that work was that one discovery that would be
most needed to support the script hypothesis would be a genuine long text in the
symbols. This claim is supported by the statistical results we have obtained here.
At its core this just seems like common sense, and there is a clear analogy in
child language development. The most basic measure of language development
is mean length of utterance (MLU), measured in words, or in morphemes. There
are of course other measures, but often even much more sophisticated measures
correlate very strongly with MLU: for example the Index of Productive Syntax (IP-
Syn) measure (Scarborough, 1990), a complex set of syntactic and morphological
features used in assessing language development for English-speaking children,
has an extremely high correlation with MLU in Scarborough’s original data. That
MLU should be so basic seems sensible: if an individual only utters four-word
sentences, we would not generally think of that individual as having mastered his
or her native language.
And so it is with symbol systems. If a symbol system is a true writing system,
then it is able to represent anything that can be said in the language of its users.
Language is in principle unbounded in that there are no limitations, apart from
obvious physical ones, on how long a linguistic message can go on. Linguistic
texts can range from short texts such as a couple of names to lengthy stories that
can extend into the hundreds of thousands of words or more. Texts in a true writing
system should thus show a wide range of lengths because of what it is they encode.
Indeed, in societies that make extensive use of writing, written texts can often be
much longer than any given spoken message, precisely because their creation is
not limited by the same factors, such as breathing or fatigue, that limit spoken
utterances. If the Indus Valley system was a full writing system as some have
claimed, it would be quite anomalous if the Indus Valley scribes had not figured
out, over a span of roughly 700 years, that they could do a bit more with the system
than write short cryptic texts (Farmer et al., 2004).
Needless to say, text length, while a plausible predictor of symbol-system sta-
tus, is nonetheless a crude feature. Non-linguistic systems can still have relatively
long texts, as in some of the corpora we have collected. And while our useful fea-
tures seem to reduce ultimately to text length, it is not out of the question that in
future work we will discover other features that are useful in classifying symbol
systems, and that are not correlated with length.
24
Indeed, while it ranked fourth out of five in our informal survey in Section 3, text length was
nonetheless considered to be “somewhat important” as a feature in determining if a system is writing
or not.
85
One possible avenue is automated grammar induction — see, e.g. (Klein, 2005;
Clark and Lappin, 2011; Cohen, 2011).25 When one claims that a symbol system
is linguistic, one is implicitly claiming that one will find evidence of linguistic
structure. It is not enough that one can show evidence for structure, as people have
argued for the Indus symbols at least since the work of computational linguists
like Koskenniemi (1981). After all mathematical equations have structure. What is
needed is a demonstration that the structure involves things like segments grouped
into syllables, syllables into morphemes or words, and words into noun-phrases,
verb phrases, and so forth. This is a tall order to be sure, but without such a
demonstration — or without, on the other hand, a convincing decipherment — it is
pretty hard to argue that one is dealing with a linguistic system.
Further work would extend the taxonomy of non-linguistic systems that we
started in Section 4. In terms of the kinds of things they can denote, there is a
much wider range of non-linguistic systems than linguistic systems. We have in-
cluded here a variety of systems covering several types in our taxonomy. But there
are certainly other kinds, and future work will need to develop corpora that are
classified into a wider set of groups than what we have developed here.
For contentious systems — the Indus Valley symbols, and to a lesser extent
Pictish symbols — we do not expect the debate to end here. For the Indus Valley
symbols in particular, too many people have a lot invested in the idea that the
symbols were writing to give that idea up. We do hope however that we have
provided an analysis of statistical approaches to the question of symbol system
type that is better informed than what has been prominently displayed in the past
few years.
25
Some preliminary experiments with a grammar induction system yielded unimpressive results,
but there is much further work to be done along those lines.
86
Acknowledgments
A number of people have given input of one kind or another to this work.
First and foremost, I would like to thank my research assistants Katherine Wu,
Jennifer Solman and Ruth Linehan, who worked on the development of the cor-
pora and markup scheme. Katherine Wu developed the corpora for Totem poles,
Mesopotamian deity symbols, Vinča and Pictish, and Ruth Linehan the Heraldry
corpus. Jennifer Solman developed the XML markup system used in most of our
corpora. An initial report on our joint work was presented in (Wu et al., 2012).
I also thank audiences at the Linguistic Society of America 2012 annual meet-
ing in Portland, at the Center for Spoken Language Understanding, and at the
Oriental Institute at the University of Chicago for feedback and suggestions on
presentations of this work.
Chris Woods gave me useful suggestions for Sumerian texts to use. William
Boltz and Ken Takashima were instrumental in pointing me to the corpus of tran-
scribed Oracle bones.
Nate Bodenstab helped out with training the BUBS parser on blazon.
I would like to acknowledge the staff of the Berks County Historical Society,
in particular Ms. Kimberly Richards-Brown, for their help in locating W. Farrell’s
slide collection of barn stars. I also thank the staff of the Asian Reading Room at
the Library of Congress for help with the Naxi pictograph collection.
I thank Brian Roark, Shalom Lappin and especially Steve Farmer for lots of
useful discussion.
This material is based upon work supported by the National Science Founda-
tion under grant number BCS-1049308. Any opinions, findings, and conclusions
or recommendations expressed in this material are those of the author and do not
necessarily reflect the views of the National Science Foundation.
87
References
Barbeau, M., 1950. Totem Poles. Vol. 1–2 of Anthropology Series 30, National
Museum of Canada Bulletin 119. National Museum of Canada, Ottawa.
Basso, K., Anderson, N., 8 June 1973. A Western Apache writing system: The
symbols of Silas John. Science 180 (4090).
Bedrick, S., Beckley, R., Roark, B., Sproat, R., June 2012. Robust kaomoji detec-
tion in Twitter. In: Workshop on Language and Social Media. Montreal, Canada.
Black, J., Green, A., Rickards, T., 1992. Gods, Demons and Symbols of Ancient
Mesopotamia. University of Texas Press, Austin, TX.
Bodenstab, N., Dunlop, A., Hall, K., Roark, B., 2011. Adaptive beam-width pre-
diction for efficient CYK parsing. In: ACL/HLT 2011. pp. 440–449.
Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regres-
sion Trees. Wadsworth and Brooks, Pacific Grove, CA.
British Columbia Provincial Museum, 1931. British Columbia totem poles. Printed
by Charles F. Banfield printer to the King’s most excellent majesty.
Burke, B., 1884. The General Armory of England, Scotland, Ireland, and Wales;
Comprising a Registry of Armorial Bearings from the Earliest to the Present
Time. Harrison & Sons, London.
Clark, A., Lappin, S., 2011. Linguistic Nativism and the Poverty of the Stimulus.
John Wiley & Sons, Malden, MA.
88
Daniels, P., Bright, W. (Eds.), 1996. The World’s Writing Systems. Oxford, New
York.
DeFrancis, J., 1984. The Chinese Language: Fact and Fantasy. University of
Hawaii Press, Honolulu, HI.
DeFrancis, J., 1989. Visible Speech: The Diverse Oneness of Writing Systems.
University of Hawaii Press, Honolulu, HI.
Drew, F., 1969. Totem Poles of Prince Rupert. F.W.M. Drew, Prince Rupert, BC.
Eco, U., 1976. A Theory of Semiotics. Harvard University Press, Cambridge, MA.
Farmer, S., Sproat, R., Witzel, M., 2004. The collapse of the Indus-script thesis:
The myth of a literate Harappan civilization. Electronic Journal of Vedic Studies
11 (2).
Farnell, B., 1996. Movement notation systems. In: Daniels, P., Bright, W. (Eds.),
The World’s Writing Systems. Oxford, New York, pp. 855–879.
Feldman, R. D., 2003. Home before the Raven Caws: the Mystery of the Totem
Pole. Cincinnati, OH. Emmis Books.
Fischer, S. R., 1997. Rongorongo: The Easter Island Script. Oxford University
Press, Oxford.
Friar, S., Ferguson, J., 1993. Basic Heraldry. Herbert Press, London.
Garfield, V. E., 1940. The Seattle Totem Pole. University of Washington Press,
Seattle.
Graves, T. E., 1984. The Pennsylvania German hex sign: A study in folk process.
Ph.D. thesis, University of Pennsylvania, Philadelphia, PA.
Gunn, S. W., 1965. The Totem Poles in Stanley Park, Vancouver, B.C., 2nd Edition.
Macdonald, Vancouver, BC.
89
Gunn, S. W., 1966. Kwakiutl House and Totem Poles at Alert Bay. Whiterocks
Publications, Vancouver, BC.
Gunn, S. W., 1967. Haida Totems in Wood and Argillite. Whiterocks Publications,
Vancouver, BC.
Jackson, A., 1984. The Symbol Stones of Scotland: a Social Anthropological Res-
olution of the Problem of the Picts. Orkney Press.
Kaiser, D., 2005. Physics and Feynman’s diagrams. American Scientist 93, 156–
165.
King, L., 1912. Babylonian Boundary Stones and Memorial Tablets in the British
Museum. British Museum, London.
Kiraz, G., 2012. Tūrrās. Mamllā: A Grammar of the Syriac Language. Volume 1:
Orthography. Gorgias Press, Piscataway, NJ.
Klein, D., 2005. The unsupervised learning of natural language. Ph.D. thesis, Stan-
ford University, Stanford, CA.
Kneser, R., Ney, H., 1995. Improved backing-off for m-gram language modeling.
In: International Conference on Speech and Signal Processing. pp. 181–184.
Koskenniemi, K., 1981. Syntactic methods in the study of the Indus script. Studia
Orientalia 50, 125–136.
90
Le Novère, N., et al, 2009. The systems biology graphical notation. Nature
Biotechnology 27, 735–741.
URL http://www.nature.com/nbt/journal/v27/n8/full/
nbt.1558.html
Lee, R., Jonathan, P., Ziman, P., 2010a. Pictish symbols revealed as a written lan-
guage through application of Shannon entropy. Proceedings of the Royal Society
A: Mathematical, Physical & Engineering Sciences, 1–16.
Lee, R., Jonathan, P., Ziman, P., 2010b. A response to Richard Sproat on random
systems, writing, and entropy. Computational Linguistics 36 (4), 791–794.
Li, L., 2001. Naxizu xiangxing biaoyin wenzi zidian. Yunnan Minzu Chubanshe.
Mack, A., 1997. Field Guide to the Pictish Symbol Stones. The Pinkfoot Press,
Balgavies, Angus, updated 2006.
Mahr, A. C., 1945. Origin and significance of Pennsylvania Dutch barn symbols.
Ohio History: The Scholarly Journal of the Ohio Historical Society 54 (1),
1–32, http://publications.ohiohistory.org/ohstemplate.
cfm?action=toc&vol=54.
Malin, E., 1986. Totem Poles of the Pacific Northwest Coast. Timber Press, Port-
land, OR.
Mallery, G., 1883. Pictographs of the North American Indians: A Preliminary Pa-
per. No. 4 in Annual Report of the Bureau of Ethnology. Smithsonian Institution,
Washington, DC, available from Google Books.
Manning, C., Schütze, H., 1999. Foundations of Statistical Natural Language Pro-
cessing. MIT Press, Cambridge, MA.
Marshall, J., September 20 1924. First light on a long forgotten civilization. Illus-
trated London News, 528–32, 548.
McCawley, J., 1996. Music notation. In: Daniels, P., Bright, W. (Eds.), The World’s
Writing Systems. Oxford, New York, pp. 847–854.
Nemenman, I., Shafee, F., Bialek, W., 2002. Entropy and inference, revisited. In:
Advances in Neural Information Processing Systems. Vol. 14. MIT Press, Cam-
bridge, MA.
Parpola, A., 1994. Deciphering the Indus Script. Cambridge University Press, New
York.
91
Peirce, C., 1934. Collected Papers: Volume V. Pragmatism and Pragmaticism. Har-
vard University Press, Cambridge, MA.
Powell, B., 2009. Writing: Theory and History of the Technology of Civilization.
Wiley-Blackwell, Chichester.
Rao, R., 2010. Probabilistic analysis of an ancient undeciphered script. IEEE Com-
puter 43 (3), 76–80.
Rao, R., Yadav, N., Vahia, M., Joglekar, H., Adhikari, R., Mahadevan, I.,
2009a. Entropic Evidence for Linguistic Structure in the Indus Script. Science
324 (5931), 1165.
Rao, R., Yadav, N., Vahia, M., Joglekar, H., Adhikari, R., Mahadevan, I., Au-
gust 5 2009b. A Markov model of the Indus script. Proceedings of the National
Academy of Sciences 106 (33), 13685–13690.
Rao, R., Yadav, N., Vahia, M., Joglekar, H., Adhikari, R., Mahadevan, I., 2010.
Entropy, the indus script, and language: A reply to R. Sproat. Computational
Linguistics 36 (4), 795–805.
Rhys, J., 1892. The inscriptions and language of the northern Picts. Proceedings of
the Society of Antiquaries of Scotland 26, 263–351.
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T., 2012. The
OpenGrm open-source finite-state grammar software libraries. In: Proceedings
of the Association for Computational Linguistics, System Demonstrations. As-
sociation for Computational Linguistics, Jeju Island, South Korea, pp. 61–66.
92
Slater, S., 2002. The Complete Book of Heraldry. Lorenz Books, London.
Smith, J., 1996. Japanese writing. In: Daniels, P., Bright, W. (Eds.), The World’s
Writing Systems. Oxford, New York, pp. 209–217.
Sproat, R., 2000. A Computational Theory of Writing Systems. Cambridge Univer-
sity Press, Cambridge.
Sproat, R., 2010a. Ancient symbols, computational linguistics, and the reviewing
practices of the general science journals. Computational Linguistics 36 (3).
Sproat, R., 2010b. Reply to Rao et al. and Lee et al. Computational Linguistics
36 (4), 807–816.
Stewart, H., 1990. Totem Poles. University of Washington, Seattle, WA.
Stewart, H., 1993. Looking at Totem Poles. University of Washington, Seattle, WA.
Sutherland, E., 1997. The Pictish Guide. Birlinn Ltd, Edinburgh.
Swinney, D., 1979. Lexical access during sentence comprehension:
(re)consideration of contextual effects. Journal of Verbal Learning and
Verbal Behavior (5), 219–227.
Turner, V., 1967. The Forest of Symbols: Aspects of Ndembu Ritual. Cornell Uni-
versity Press, New York.
Ventris, M., Chadwick, J., 1956. Documents in Mycenaean Greek. Cambridge Uni-
versity Press, Cambridge.
Vidale, M., 2007. The collapse melts down: a reply to Farmer, Sproat and Witzel.
East and West 57, 333–366.
Waddington, C., 1974. Horse brands of the Mongolians: A system of signs in a
nomadic culture. American Ethnologist 1 (3), 471–488.
Winn, S., 1990. A Neolithic sign system in southeastern Europe. In: Foster, M. L.
(Ed.), The Life of Symbols. Westview Press, Boulder, pp. 269–271.
Winn, S. M. M., 1981. Pre-Writing in Southeastern Europe: The Sign System of
the Vinča Culture, ca. 4000 B.C. Western Publishers.
Wu, K., Solman, J., Linehan, R., Sproat, R., January 2012. Corpora of non-
linguistic symbol systems. In: Linguistic Society of America. Portland, OR.
URL http://elanguage.net/journals/lsameeting/article/
view/2845/pdf
93
Yoder, D., Graves, T., 2000. Hex Signs: Pennsylvania Dutch Barn Symbols and
their Meaning. Stackpole, Mechanicsburg, PA.
94
Appendix: Sample Texts and Transcriptions
Tordos Spindle Whorl #20, (Winn, 1981, page 270)
95
Kudurru stone #67, from (Seidl, 1989), plate 23.
96
<symbol><title>Schlangendrache</title></symbol>
</symbolUnit>
<symbolUnit>
<symbol><title>Schreibgeraet</title></symbol>
<symbol><title>Symbolsockel</title></symbol>
</symbolUnit>
<symbolUnit>
<symbol><title>Band</title></symbol>
<symbol><title>Symbolsockel</title></symbol>
<symbol><title>Ziegenfisch</title></symbol>
</symbolUnit>
</line>
<line number="4">
<symbol><title>Adlerstab</title></symbol>
<symbolUnit>
<symbol><title>Widderstab</title></symbol>
<symbol><title>Loewenstab</title></symbol>
</symbolUnit>
<symbol><title>Pferdekopf</title></symbol>
<symbol><title>Vogel-auf-d.-Stange</title></symbol>
</line>
<line number="5">
<symbolUnit>
<symbol><title>Symbolsockel</title></symbol>
<symbol><title>Hund</title></symbol>
</symbolUnit>
</line>
<line number="6">
<symbol><title>Schildkroete</title></symbol>
<symbol><title>Lampe</title></symbol>
</line>
<line number="7">
<symbolUnit>
<symbol><title>Blitzbuendel</title></symbol>
<symbol><title>Rind</title></symbol>
</symbolUnit>
<symbol><title>Skorpion</title></symbol>
</line>
</docText>
</document>
97
The Aberlemno 1 stone from http:
//www.stams.strath.ac.uk/research/pictish/database.php
98
Grizzly Bear Pole of Yan, (Drew, 1969, page 16)
99
Farrell collection AR(2)F
100