Quantifying syntactic complexity

S. Scott  Schupbach

Quantifying syntactic complexity

S. Scott Schupbach

visibility

…

description

8 pages

link

1 file

Until recently, the long-held linguistic dogma that all languages are equally complex (e.g. Hockett 1958) has gone largely unchallenged. Although the principle of linguistic equicomplexity has not been conclusively proven, it is a reasonable hypothesis given that any thought, conceived of and communicated in one language, can easily be expressed in any another. In this paper I discuss potential measures of complexity and demonstrate that entropy based measures are the most reliable. I also provide evidence that the principle of linguistic equicomplexity may in fact be demonstrable by analyzing the morphological and syntactic complexity of a small set of linguistically diverse languages.

S. Scott Schupbach LING 257B; Winter 2014 Quantifying syntactic complexity 1. Introduction Until recently, the long-held linguistic dogma that all languages are equally complex (e.g. Hockett 1958) has gone largely unchallenged. Although the principle of linguistic equicomplexity has not been conclusively proven, it is a reasonable hypothesis given that any thought, conceived of and communicated in one language, can easily be expressed in any another. However some linguists have begun to challenge idea that languages are uniform in their levels of complexity. McWhorter, for example, claims that g a a s of eole la guages a e less o ple tha g a a s of olde la guages a d that it is a atu al p o ess of hu a la guages to a ui e o ple it o e ti e (2001) and that it is impossible for languages to lose complexity except through non-native acquisition (2008). Dahl (2009) argues that languages acquire more complexity through the processes of language change that result from robust contact situations. Nichols (2009) states that complexity is synonymous with a high degree of structural and paradigmatic variability and proposes a measure of complexity based on combining a number of such variations. Each of these claims is based on the assumption that language, like technology, varies in complexity from one speech community to the next. However before any of these claims may be tested, what is meant by linguistic complexity must be clearly defined. First and foremost, it must be recognized that complexity is not just the amount of variation in a la guage s st u tu e and paradigms. Many attempts to quantify grammatical complexity involve identifying a number of structural domains where languages differ and taking some sort of count, such as the number of possible syllable types, the size of the phonemic inventory, or the number of inflectional categories marked on the maximally inflected verb form (all from Nichols 2009; see also Dahl 2009, McWhorter 2008, Kusters 2008, Miestamo 2008, Gil 2008). The main problem with this aggregatemethod approach is that none of these measures alone come anywhere near capturing the entire picture, and regardless of how many individual factors are incorporated into a complexity measure of Schupbach 2 this type, it is not possible to be certain that the resulting measure is a complete or reliable measure of complexity. Additionally, it is difficult to determine what weight should be given to each component since it is not clear that phonological complexity and syntactic complexity, for example, have the same impact on overall grammatical complexity. It is also important to recognize that a measure of grammatical complexity is different from cognitive processing complexity. As Dahl (2009) points out, complexity as it pertains to cognitive processing is not necessarily synonymous with structural complexity, and is better thought of as processing cost, or agent-related complexity, in contrast to objective complexity. While the two may be similar, and measures of cognitive costs may more reliably approximate grammatical complexity than aggregate measures do, there are other factors involved in the cognitive processing of language that are not a result of the structure of the language. One proven approach to quantifying language complexity is calculating the entropy—or lack of information—of the linguistic signal (Shannon 1948). Shannon (1951) demonstrates that the predictable aspects of English (such as phonotactic constraints, probabilistic word groupings, and semantic associations) are reflected in entropy estimates of the characters of written English. Further, Shannon argues that character entropy represents an upper bound of language complexity, but that the actual entropy of a language is lower. This maximum entropy approach is used in numerous computational approaches to natural language processing (Malouf 2013) and is a reliable estimate of overall language complexity. 2. Methods Dahl (2004:21) introduces (and then quickly dismisses) a measure of cross-linguistic grammatical complexity that involves comparing the compressibility of the same content in different languages, which he calls algorithmic information content. One benefit of this approach is that by comparing Schupbach 3 translations of the same text in multiple languages, the complexity of the message is kept constant (Juola 2008). For the present study, I use translations of the Old Testament portion of the Bible as the texts for comparison. The texts all come from the multilingual parallel Bible corpus prepared by Christodoulopoulos (n.d.). I chose the Old Testament rather than the New Testament because it is substantially longer and stylistically more diverse. I chose to exclude the New Testament because it would introduce a large number of highly specialized words that could introduce an undesirable effect given the limited length of each text. Using Bible portions has the added benefit that there is a widely accepted system of reference involving chapter and verse numbers. These allow for a division of the text into discrete units based on meaning rather than syntactic divisions such as clauses or sentences. Although the translations used in this project are not identical in verse length, the standard deviation in the number of verses is less than 0.2% from the mean. Verse divisions allow for a more reliable method of measuring syntactic complexity because they control for the way a language combines multiple clauses in discourse. Comparing the number of sentences in each text, the standard deviation is over 10% of the mean, indicating that variability in clause combining strategies is a salient part of syntactic complexity (although this was not a measure in any of the previous aggregate-method studies). The six languages under investigation (Bulgarian, English, Finnish, German, Slovak, Spanish) were not so much chosen as they were identified as the only viable options. They needed to have a complete and uncorrupted Old Testament portion in the Bible corpus, use an alphabetic writing system and be capable of POS tagging, meaning they had to be one of the languages for which TreeTagger already had a tag library, or there had to exist a POS tagged corpus on which TreeTagger could be trained. Initially I included Italian, Latin, and Estonian in the corpus, however I soon found that the each of these were corrupted in such a way that made them unusable. Future stages of this project will incorporate those languages as well. Schupbach 4 The only variable that is not controlled for by using this particular corpus is the variation in the writing systems used to encode the written form of each language. Although only alphabetic systems are used, there is still some inconsistency in the amount of phonological detail encoded in each orthographic system. In English, for example, the o i al a d e al i te p etatio s of e o d a only in the placement of stress. However stress is not marked in English. Spanish, on the other hand does mark stress (e.g. she/he/it is s. esta this fe . e e though the u e of st ess- contrastive pairs is much lower in Spanish than it is in English. This lack of stress marking in English will make the entropy lower relative to Spanish since both languages have contrastive stress, but only Spanish marks stress overtly. Similarly, English uses a number of digraphs to represent specific consonantal phonemes (<th> fo θ a d ð , <sh> for [ʃ], <ch> for [tʃ], <(d)ge> for [dʒ], <ph> for [f], etc.) as well as vocalic phonemes (<ea>, <ee> and <ie> for [i], <ei> and <ay> for [e], <oa> for [o], <oo> for [u] and [ʊ], etc.), whereas Spanish uses digraphs only to represent certain phonemic consonants (<ll> for [ʎ], <rr> for [r], <ch> for [tʃ], <qu> for [k], etc.). An ideal corpus would be comprised of phonemic transcriptions of each text, marking all contrastive segmental and suprasegmental features. Because time and energy are finite resources, this was not done. Ultimately differences such as these do not affect the overall picture drastically, but they do introduce noise that would need to be accounted for in any model claiming to be comprehensive. Since the entropy of the characters used to write a language simultaneously captures the uncertainty of numerous domains of the language (phonology, morphology, syntax, semantics, pragmatics), a measure of syntactic complexity must isolate the uncertainty based on word order and word classes from other domains. For this, I extract the part-of-speech (POS) tags from tagged corpora and calculate the entropy of the tags. Because POS-tag sets are not standardized cross-linguistically, there is a large degree of variability in the number of POS tag types for each language ranging from 508 Schupbach 5 in Bulgarian to 54 in both English and German. In order to adjust for this, POS tag entropies are divided by the logarithm of the number of POS tag types. For entropy calculations, both Shannon (1951) and Lempel-Ziv (1976) entropies were calculated. Due to the relatively small size of the texts, Lempel-Ziv estimates provide better approximations (Lesne, Blanc, Pezard 2009) and unigram entropy values were used. The overall complexity measure is calculated by multiplying the adjusted character entropy by the average number of characters per verse. Texts are converted to all lower-case letters and extraneous punctuation is removed. ) ( Similarly, syntactic complexity is the product of the adjusted POS tag entropy and the average number of POS tags per verse. ( ) Because of the noise added by orthographic and tagging conventions, these complexity measures are not necessarily directly comparable across languages. However the correlation between syntactic complexity and the difference between overall complexity and syntactic complexity speaks to the hypothesis that a lack of complexity in one part of the grammar is compensated for by increased complexity in other parts of the grammar. If this hypothesis is true, then there should be a strong correlation between the two. Thus the mean characters per verse and mean POS tags per verse serve to provide a baseline for cross-linguistic comparison by normalizing the entropy values against the information encoded in the entire text. Schupbach 6 3. Results and Discussion As predicted, there is a fairly strong correlation between the Lempel-Ziv entropy of the POS tags and the overall complexity measure less the syntactic complexity measure. Pearson = -0.587 Spearman = -0.8857 p = 0.2206 p = 0.0188 The only outlier is Bulgarian, which (as mentioned above) has the most robust POS tagging system with over 500 unique tags. Although an adjustment was applied to lessen this effect, the discrepancy in tagging conventions is just too substantial to remove entirely. With this tagging convention, some of Bulga ia s o phologi al o ple it is e essa il o flated with syntactic complexity. Setting Bulgarian aside for a moment, the picture becomes clearer: Schupbach 7 Pearson = -0.9145 Spearman = -0.9 But regardless of Bulga ia s outlie status, the patte p = 0.0296 p = 0.0374 that e e ges sho s that ase-based systems tend toward lower syntactic complexity and higher differences between overall complexity and syntactic complexity. This p o ides o fi atio fo Ho kett s i tuiti e laim that simplicity in one part of the grammar is balanced by complexity in another part. As a preliminary attempt at quantifying syntactic complexity and a first examination of the relationship between different types of language complexity, these results are encouraging and invite further investigation. Future research will expand on the present study by including additional languages—especially creoles and more diverse language types, such as polysynthetic and agglutinating languages—and will likely also involve a corpus with larger texts. Additional measures of complexity are also needed to further test the principle of equicomplexity where phonology, morphology, and semantics are concerned. Methods for controlling for orthographic noise and methods for regularizing POS tags will also be sought and employed where possible. Schupbach 8 References Christodoulopoulos, Christos. n.d. Multilingual Parallel Bible Corpus. <http://homepages.inf.ed.ac.uk/s0787820/bible/> Accessed 7 Feb 2014. ahl, ste . . The Growth and Maintenance of Linguistic Complexity. Amsterdam: John Benjamins. ---. 2009. Testing the assumption of complexity invariance: the case of Elfdalian and Swedish. In Sampson et al., eds. Gil, David. 2008. How complex are isolating languages? In Miestamo et al., eds. Hockett, Charles F. 1958. A Course in Modern Linguistics. New York: Macmillan. Juola, Patrick. 2008. Assessing linguistic complexity. In Miestamo et al., eds. Kusters, Wouter. 2008. Complexity in linguistic theory, language learning, and language change. In Miestamo et al., eds. Lempel, Abraham, Jacob Ziv. 1976. On the complexity of ﬁnite sequences. IEEE Transactions on Information Theory IT-22:75–81 Lesne, Annick, Jean-Luc Blanc, and Laurent Pezard. 2009. Entropy estimation of very short symbolic sequences. Physical Review E 79:46208. M Who te , Joh . 1. The o ld s si plest grammars are creole grammars. Linguistic Typology (5):125-166 ---. 2008. Why does a language undress? Strange cases in Indonesia. In Miestamo et al., eds. Miesta o, Matti, aius “i e ki, a d ed a lsso , eds. . Language Complexity: Typology, contact, change. Amsterdam: John Benjamins. Nichols, Johanna. 2009. Linguistic complexity: a comprehensive definition and survey. In Sampson et al., eds. Sampson, Geoffrey, David Gil, and Peter Trudgill, eds. 2009. Language Complexity as an Evolving Variable. Oxford: Oxford University Press. Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379-423, 623-656. ---. 1951. Prediction and entropy in printed English. Bell System Technical Journal 30:50-64.

Log In

Quantifying syntactic complexity

Related papers

Related topics