S. Scott Schupbach
LING 257B; Winter 2014
Quantifying syntactic complexity
1. Introduction
Until recently, the long-held linguistic dogma that all languages are equally complex (e.g.
Hockett 1958) has gone largely unchallenged. Although the principle of linguistic equicomplexity has not
been conclusively proven, it is a reasonable hypothesis given that any thought, conceived of and
communicated in one language, can easily be expressed in any another. However some linguists have
begun to challenge idea that languages are uniform in their levels of complexity. McWhorter, for
example, claims that g a
a s of eole la guages a e less o ple tha g a
a s of olde
la guages a d that it is a atu al p o ess of hu a la guages to a ui e o ple it o e ti e (2001)
and that it is impossible for languages to lose complexity except through non-native acquisition (2008).
Dahl (2009) argues that languages acquire more complexity through the processes of language change
that result from robust contact situations. Nichols (2009) states that complexity is synonymous with a
high degree of structural and paradigmatic variability and proposes a measure of complexity based on
combining a number of such variations. Each of these claims is based on the assumption that language,
like technology, varies in complexity from one speech community to the next. However before any of
these claims may be tested, what is meant by linguistic complexity must be clearly defined.
First and foremost, it must be recognized that complexity is not just the amount of variation in a
la guage s st u tu e and paradigms. Many attempts to quantify grammatical complexity involve
identifying a number of structural domains where languages differ and taking some sort of count, such
as the number of possible syllable types, the size of the phonemic inventory, or the number of
inflectional categories marked on the maximally inflected verb form (all from Nichols 2009; see also Dahl
2009, McWhorter 2008, Kusters 2008, Miestamo 2008, Gil 2008). The main problem with this aggregatemethod approach is that none of these measures alone come anywhere near capturing the entire
picture, and regardless of how many individual factors are incorporated into a complexity measure of
Schupbach 2
this type, it is not possible to be certain that the resulting measure is a complete or reliable measure of
complexity. Additionally, it is difficult to determine what weight should be given to each component
since it is not clear that phonological complexity and syntactic complexity, for example, have the same
impact on overall grammatical complexity.
It is also important to recognize that a measure of grammatical complexity is different from
cognitive processing complexity. As Dahl (2009) points out, complexity as it pertains to cognitive
processing is not necessarily synonymous with structural complexity, and is better thought of as
processing cost, or agent-related complexity, in contrast to objective complexity. While the two may be
similar, and measures of cognitive costs may more reliably approximate grammatical complexity than
aggregate measures do, there are other factors involved in the cognitive processing of language that are
not a result of the structure of the language.
One proven approach to quantifying language complexity is calculating the entropy—or lack of
information—of the linguistic signal (Shannon 1948). Shannon (1951) demonstrates that the predictable
aspects of English (such as phonotactic constraints, probabilistic word groupings, and semantic
associations) are reflected in entropy estimates of the characters of written English. Further, Shannon
argues that character entropy represents an upper bound of language complexity, but that the actual
entropy of a language is lower. This maximum entropy approach is used in numerous computational
approaches to natural language processing (Malouf 2013) and is a reliable estimate of overall language
complexity.
2. Methods
Dahl (2004:21) introduces (and then quickly dismisses) a measure of cross-linguistic grammatical
complexity that involves comparing the compressibility of the same content in different languages,
which he calls algorithmic information content. One benefit of this approach is that by comparing
Schupbach 3
translations of the same text in multiple languages, the complexity of the message is kept constant
(Juola 2008). For the present study, I use translations of the Old Testament portion of the Bible as the
texts for comparison. The texts all come from the multilingual parallel Bible corpus prepared by
Christodoulopoulos (n.d.). I chose the Old Testament rather than the New Testament because it is
substantially longer and stylistically more diverse. I chose to exclude the New Testament because it
would introduce a large number of highly specialized words that could introduce an undesirable effect
given the limited length of each text. Using Bible portions has the added benefit that there is a widely
accepted system of reference involving chapter and verse numbers. These allow for a division of the text
into discrete units based on meaning rather than syntactic divisions such as clauses or sentences.
Although the translations used in this project are not identical in verse length, the standard deviation in
the number of verses is less than 0.2% from the mean. Verse divisions allow for a more reliable method
of measuring syntactic complexity because they control for the way a language combines multiple
clauses in discourse. Comparing the number of sentences in each text, the standard deviation is over
10% of the mean, indicating that variability in clause combining strategies is a salient part of syntactic
complexity (although this was not a measure in any of the previous aggregate-method studies).
The six languages under investigation (Bulgarian, English, Finnish, German, Slovak, Spanish)
were not so much chosen as they were identified as the only viable options. They needed to have a
complete and uncorrupted Old Testament portion in the Bible corpus, use an alphabetic writing system
and be capable of POS tagging, meaning they had to be one of the languages for which TreeTagger
already had a tag library, or there had to exist a POS tagged corpus on which TreeTagger could be
trained. Initially I included Italian, Latin, and Estonian in the corpus, however I soon found that the each
of these were corrupted in such a way that made them unusable. Future stages of this project will
incorporate those languages as well.
Schupbach 4
The only variable that is not controlled for by using this particular corpus is the variation in the
writing systems used to encode the written form of each language. Although only alphabetic systems
are used, there is still some inconsistency in the amount of phonological detail encoded in each
orthographic system. In English, for example, the o i al a d e al i te p etatio s of e o d
a
only in the placement of stress. However stress is not marked in English. Spanish, on the other hand
does mark stress (e.g.
she/he/it is s. esta this fe .
e e though the u
e of st ess-
contrastive pairs is much lower in Spanish than it is in English. This lack of stress marking in English will
make the entropy lower relative to Spanish since both languages have contrastive stress, but only
Spanish marks stress overtly. Similarly, English uses a number of digraphs to represent specific
consonantal phonemes (<th> fo θ a d ð , <sh> for [ʃ], <ch> for [tʃ], <(d)ge> for [dʒ], <ph> for [f], etc.)
as well as vocalic phonemes (<ea>, <ee> and <ie> for [i], <ei> and <ay> for [e], <oa> for [o], <oo> for [u]
and [ʊ], etc.), whereas Spanish uses digraphs only to represent certain phonemic consonants (<ll> for
[ʎ], <rr> for [r], <ch> for [tʃ], <qu> for [k], etc.). An ideal corpus would be comprised of phonemic
transcriptions of each text, marking all contrastive segmental and suprasegmental features. Because
time and energy are finite resources, this was not done. Ultimately differences such as these do not
affect the overall picture drastically, but they do introduce noise that would need to be accounted for in
any model claiming to be comprehensive.
Since the entropy of the characters used to write a language simultaneously captures the
uncertainty of numerous domains of the language (phonology, morphology, syntax, semantics,
pragmatics), a measure of syntactic complexity must isolate the uncertainty based on word order and
word classes from other domains. For this, I extract the part-of-speech (POS) tags from tagged corpora
and calculate the entropy of the tags. Because POS-tag sets are not standardized cross-linguistically,
there is a large degree of variability in the number of POS tag types for each language ranging from 508
Schupbach 5
in Bulgarian to 54 in both English and German. In order to adjust for this, POS tag entropies are divided
by the logarithm of the number of POS tag types.
For entropy calculations, both Shannon (1951) and Lempel-Ziv (1976) entropies were calculated.
Due to the relatively small size of the texts, Lempel-Ziv estimates provide better approximations (Lesne,
Blanc, Pezard 2009) and unigram entropy values were used.
The overall complexity measure is calculated by multiplying the adjusted character entropy by
the average number of characters per verse. Texts are converted to all lower-case letters and
extraneous punctuation is removed.
)
(
Similarly, syntactic complexity is the product of the adjusted POS tag entropy and the average number
of POS tags per verse.
(
)
Because of the noise added by orthographic and tagging conventions, these complexity measures are
not necessarily directly comparable across languages. However the correlation between syntactic
complexity and the difference between overall complexity and syntactic complexity speaks to the
hypothesis that a lack of complexity in one part of the grammar is compensated for by increased
complexity in other parts of the grammar. If this hypothesis is true, then there should be a strong
correlation between the two. Thus the mean characters per verse and mean POS tags per verse serve to
provide a baseline for cross-linguistic comparison by normalizing the entropy values against the
information encoded in the entire text.
Schupbach 6
3. Results and Discussion
As predicted, there is a fairly strong correlation between the Lempel-Ziv entropy of the POS tags
and the overall complexity measure less the syntactic complexity measure.
Pearson = -0.587
Spearman = -0.8857
p = 0.2206
p = 0.0188
The only outlier is Bulgarian, which (as mentioned above) has the most robust POS tagging system with
over 500 unique tags. Although an adjustment was applied to lessen this effect, the discrepancy in
tagging conventions is just too substantial to remove entirely. With this tagging convention, some of
Bulga ia s
o phologi al o ple it is e essa il
o flated with syntactic complexity. Setting
Bulgarian aside for a moment, the picture becomes clearer:
Schupbach 7
Pearson = -0.9145
Spearman = -0.9
But regardless of Bulga ia s outlie status, the patte
p = 0.0296
p = 0.0374
that e e ges sho s that ase-based systems
tend toward lower syntactic complexity and higher differences between overall complexity and syntactic
complexity. This p o ides o fi
atio fo Ho kett s i tuiti e laim that simplicity in one part of the
grammar is balanced by complexity in another part.
As a preliminary attempt at quantifying syntactic complexity and a first examination of the
relationship between different types of language complexity, these results are encouraging and invite
further investigation. Future research will expand on the present study by including additional
languages—especially creoles and more diverse language types, such as polysynthetic and agglutinating
languages—and will likely also involve a corpus with larger texts. Additional measures of complexity are
also needed to further test the principle of equicomplexity where phonology, morphology, and
semantics are concerned. Methods for controlling for orthographic noise and methods for regularizing
POS tags will also be sought and employed where possible.
Schupbach 8
References
Christodoulopoulos, Christos. n.d. Multilingual Parallel Bible Corpus.
<http://homepages.inf.ed.ac.uk/s0787820/bible/> Accessed 7 Feb 2014.
ahl, ste .
. The Growth and Maintenance of Linguistic Complexity. Amsterdam: John Benjamins.
---. 2009. Testing the assumption of complexity invariance: the case of Elfdalian and Swedish. In
Sampson et al., eds.
Gil, David. 2008. How complex are isolating languages? In Miestamo et al., eds.
Hockett, Charles F. 1958. A Course in Modern Linguistics. New York: Macmillan.
Juola, Patrick. 2008. Assessing linguistic complexity. In Miestamo et al., eds.
Kusters, Wouter. 2008. Complexity in linguistic theory, language learning, and language change. In
Miestamo et al., eds.
Lempel, Abraham, Jacob Ziv. 1976. On the complexity of finite sequences. IEEE Transactions on
Information Theory IT-22:75–81
Lesne, Annick, Jean-Luc Blanc, and Laurent Pezard. 2009. Entropy estimation of very short symbolic
sequences. Physical Review E 79:46208.
M Who te , Joh .
1. The o ld s si plest grammars are creole grammars. Linguistic Typology
(5):125-166
---. 2008. Why does a language undress? Strange cases in Indonesia. In Miestamo et al., eds.
Miesta o, Matti, aius “i e ki, a d ed a lsso , eds.
. Language Complexity: Typology,
contact, change. Amsterdam: John Benjamins.
Nichols, Johanna. 2009. Linguistic complexity: a comprehensive definition and survey. In Sampson et al.,
eds.
Sampson, Geoffrey, David Gil, and Peter Trudgill, eds. 2009. Language Complexity as an Evolving
Variable. Oxford: Oxford University Press.
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal
27:379-423, 623-656.
---. 1951. Prediction and entropy in printed English. Bell System Technical Journal 30:50-64.