Academia.eduAcademia.edu

Sherds from an arabic treebanking mosaic

2002, Prague Bulletin of Mathematical Linguistics, (78:63-76).

This paper would like to introduce the reader into those aspects of the Arabic language which require some special treatment compared to languages Europeans are more familiar with. In spite of having fresh experience in building the Prague Arabic Dependency Treebank, the authors try to take a broader view of the problems encountered under way. The topics discussed include linguistic data retrieval, morphology and morphotactics modelling, and description of the language on the analytical level.

Sherds from an Arabic Treebanking Mosaic Otakar Smrž and Petr Zemánek Abstract This paper would like to introduce the reader into those aspects of the Arabic language which require some special treatment compared to languages Europeans are more familiar with. In spite of having fresh experience in building the Prague Arabic Dependency Treebank, the authors try to take a broader view of the problems encountered under way. The topics discussed include linguistic data retrieval, morphology and morphotactics modelling, and description of the language on the analytical level. 1 Introduction Let us assume a background knowledge of the motivation and concepts of the Prague Arabic Dependency Treebank project (cf. Smrž, Šnaidauf, Zemánek 2002). Its idea is to follow the practice set up by the Prague Dependency Treebank for Czech, as long as analogy between the two languages allows. There are points, however, which we would like to draw attention to, since they defy the usual considerations rather than yielding to them, or simply have not been dealt with before. We shall focus on these without much stress on the complete and overall look in the frame of their application, and thus become free to mention also other approaches not necessarily realized by our team. 2 Characteristics of the Arabic Language and Script Arabic is, together with the Northwestern Semitic, a branch of the Central group of West Semitic languages (for further details, cf. Faber 1997). The literary language should be the same throughout the Arabic speaking countries, the local dialects can be considerably distinct from each other. Generally, Arabic is the mother tongue of about 300 million people. Arabic is usually characterized as a highly introflective language, i.e. a language, where—apart from other standard instruments of flexion (mainly desinential in case of Arabic)—there is a system of inner flexion, based on different roles of the consonantal root (mostly tri-radical), its vocalization and various affixes. This system works mainly in the word-building, where the root is considered a semantic base of the word, the vocalization together with the root forms a stem (lexical and partially also morphological “actualization” of the root), which is closer to the actual meaning of the word. The word is then “finalized” by the use of various affixes.1 In the non-concatenative approach (cf., e.g., McCarthy 1985), this scheme is represented in several tiers, where the respective morphemes (root, vocalization, affixes) occupy one such tier, and the word is then built by the junction of these tiers. This has been also used in the NLP domain (cf. works by Kiraz, e.g. 1998, 1999 and 2000, and Beesley 1999).2 The system is also used in the organization of dictionaries of Arabic (esp. those produced in the West), which means that for dictionary look-up it is necessary that the user be capable of full morphological analysis of a given word form, or potentially a string of word forms (see below). 1 E.g., for the root [ktb], the vocalization katab represents the stem of the verb to write in perfect tense, kotub the stem of the same verb in imperfect (both active voices). katabato then means she wrote, yakotubu is he writes, the prefixes/suffixes and the stems being independent of each other. The notation of Arabic is treated in Section 3. 2 In his works, Kiraz uses a consonantal skeleton and a vocalization as a template, from which the actual word form is generated. E.g., a template CVCVC with the root CCC=[ktb] and vocalization VV=[aa] gives katab as the result. Beesley, on the other hand, rather works with so-called patterns, i.e. CaCaC in this case. The Prague Bulletin of Mathematical Linguistics 78, 2002 For the NLP, it is mainly the Arabic script that presents a challenge. It uses two types of graphemes: “real” graphemes which represent mainly consonants (Huru~f, “letters” in Arabic), and additional marks, used mainly for vocalization (in English generally referred to as “diacritics”, in Arabic as Haraka~t, “movements”). In the script—as a principle—we find mostly the marks from the first group, representing the consonantal skeleton of a word. Rarely does the script comprise the so-called vocalization marks, which include marks for vowels and other characters (e.g., a mark for gemination of a consonant). The virtual absence of “diacritics” certainly increases ambiguity of Arabic texts. The degree, density and quality of vocalization differ according to the kind of the text, being then described as fully vocalized, partially vocalized or non-vocalized. The distinction may sometimes be fuzzy since omission/enforcement of diacritics can happen locally in contrast to the global style. Table 1 offers a rough view of the classification reasoned by the distribution of graphemes: Components graphemes letters diacritics Fully Vocalized Bible Quran 3 743 329 610 644 55.73 % 55.49 % 44.27 % 44.51 % Partially Vocalized Fiction 36 860 639 97.53 % 2.47 % Non-vocalized Other 151 840 850 99.94 % 0.06 % Table 1: Percentage of letters vs. diacritics in chosen Arabic texts. The CLARA corpus (Zemánek 2001) provides canonized religious texts as the only reliable source of fully vocalized data. Fiction features partial vocalization, while the other subcorpora are non-vocalized, settling closely to the average values shown above. Another drawback of the notation of Arabic is that words can be clustered into one string of characters, within which, for the sake of further analysis, the morphotactic borders have to be set. These clusters can consist of one autosemantic word, definite article and a number of functional words, such as prepositions (especially uniliteral), conjunctions and pronouns (objective at verbs, possessive at nouns, mixed at prepositions). Also some other markers (the future tense marker etc.) can appear. Thus, e.g., the string fsyktbwnhA so they will write it/them can be divided into at least four words—cf. Table 6. This means that for analyses of Arabic on the analytical and the tectogrammatical levels, it is necessary to provide a sequence of tokens resulting from the morphologically disambiguated language. Most of morphological analyzers of Arabic offer segmentation on the level of morphemes. For further treatment of the output of these analyzers, a decision has to be made in respect of how this information is to be represented in the morphological annotation and how the strings of characters are to be divided into tokens needed on the analytical level. The preceding sentence implicitly gives the answer to this problem. This means that in case there is a syntactic type of government within the respective string, this government has to be reflected on the analytical level, and these “new” units have to be generated by the splitting of the original string.3 Else, there are connections that do not have to be separated, although they do not belong to the original form of the autosemantic word.4 In the syntax, Arabic can be characterized as a language with prevailing VSO word order. This, however, holds mainly for sentences where the role of the predicate is played by a verb. Besides, there are sentences with non-verbal predication, where the predicate is expressed by a noun, prepositional phrase or by other means. In nominal sentences, the word order is mostly inverse, i.e. subject, predicate and no object. In addition, Arabic has a number of topicalizers, which make the word order far from fixed, and thus increase the number of instances differing from the VSO order. 3 In some cases, also other changes than splitting are necessary. E.g., the preposition li_ combines with the definite article {al_ merging into lil_ instead of li{al_. During tokenization, missing graphemes must be restored. 4 Here belongs e.g. the definite article {al_, which certainly does not form a part of the lexeme. However, it does not have to be represented as a unit on the analytical level, and it is better to keep the article’s value in a morphological tag. Otakar Smrž and Petr Zemánek: Sherds from an Arabic Treebanking Mosaic 3 Representation of Arabic Since the times when Unicode came into general use, persistent encoding of the Arabic script has not been a problem to talk about. Nevertheless, in recognized cases, adopting alternative transliteration or transcription systems may prove more convenient. The Arabic script is suited for recording individual phonemes of the language. Written from right to left, the strokes continuously cross the boundaries of letters, the shapes of which conform to the adjacent glyphs or letter forms (initial, medial, final, isolated). Irrespective of the presence or absence of short vowels and other optional marks (altogether referred to as diacritics), the algorithm determining the glyphs given the letters is well-defined. This regularity (provided that the original script is correct) makes it possible to encode just the letters and let the forms be computed at the very moment of script rendering. While Unicode charts Arabic Presentation Forms-A (0xFB50–0xFDFF) and Arabic Presentation Forms-B (0xFE70–0xFEFF) ensure fidelity by remembering every single ligature of a sequence of glyphs, the more common systems like Unicode Arabic (0x0600–0x06FF), Windows CP 1256, ISO 8859-6 or lower ASCII Buckwalter transliteration introduce one-to-one mappings of distinct graphemes, i.e. letters and diacritics. Unlike these graphemic transliteration concepts, the typesetting system of ArabTeX (Lagally 1999) defines its own notation, which covers both contemporary and historical orthography in an excellent way. Moreover, the encoding is human-readable, and thus comes in handy wherever the script were too difficult to display or edit. The point is that ArabTeX has to evaluate a larger context of each lower ASCII character to generate the corresponding Arabic representation. Real-time conversions become however less efficient then. We will use Buckwalter transliteration in examples emphasizing the actual manner of vocalization, e.g. in morphology analyses, whereas ArabTeX notation will restore the complete word forms. An approximate phonetic transcription shall be enough to engage in the dependency trees later on. Table 2 demonstrates these three encodings on an Arabic sentence asking you to “read this text carefully”. Aiqora>o h`*aA {ln~aS~a bi{notibaAhK Buckwalter graphemic transliteration iqra’ h_a_dA an-na.s.sa bi-intibAhiN ArabTeX transliteration encoding iqra’ ha~Va~ an-naSSa bi- intiba~hin phonetic transcription of tokenized text Table 2: Comparison of lower ASCII transliterations and a phonetic transcription. The fully vocalized text implies use of various diacritical marks, even those echoing an empty vowel (Buckwalter). Original orthography is preserved, though it disguises for the sake of readability (ArabTeX). Understanding all the graphemes and the proper pronunciation of the symbols in our transcription, quite vague indeed, is not essential for this paper. 4 Linguistic Data Retrieval The resources being exploited in the treebanking project count LDC Arabic Newswire A Corpus (ANAC), Corpus Linguae Arabicae (CLARA) and Ummah Arabic-English Parallel News Corpus (UAEC). After characterizing each data set, we shall explain our method of document topic analysis which helped retrieve a domain-specific language resource. 4.1 Resource Information Arabic Newswire A Corpus was collected by the Linguistic Data Consortium (LDC), University of Pennsylvania, from the news which appeared on the Agence France Presse (AFP) Arabic Newswire in the period from May 1994 to December 2000. The corpus contains roughly 80 million words in about 384 thousand documents of a wide thematic scope. Although renowned information retrieval experi- The Prague Bulletin of Mathematical Linguistics 78, 2002 ments have been performed on these data (cf. Sawaf et al. 2001, Brants et al. 2002, Oard et al. 2002), there is no reusable topic identification associated with the data yet. Corpus Linguae Arabicae is, on the other hand, a topically classified corpus of Modern Standard Arabic which has been compiled at the Institute of Comparative Linguistics, Charles University in Prague (Zemánek 2001). Out of the total of 53 million words, 13 million constitute a subcorpus of the language of economics, business and finance (mostly from the Hayat newspaper of the years 1995–1997), the rest being news in general, expert materials, fiction, and scientific literature. The last corpus to mention is Ummah Arabic-English Parallel News Corpus, based on various Arabic newspapers digests issued weekly by Ummah Press Service in Cairo. The news stories in this collection, gathered by the LDC, date from January 2001 to March 2002 (reported by Xiaoyi Ma of the LDC, July 31st 2002). There are 3,039 story pairs giving 13,027 sentence pairs, or 765,492 words altogether, 352,759 in Arabic and 412,733 in English.5 4.2 Document Topic Analysis The topically distinguished subcorpora of CLARA can be utilized for building reference models of each particular language domain. For an arbitrary document, some measure of conformity or similarity to the given model may be studied to see whether both the document and the subcorpus fall in the same thematic class. Out of the set of all ANAC articles, possible candidates to treat economics, business or finance (to be found also in legal and industrial texts) were extracted like this. Still, before including them in the new resource of the desired property, humans must have confirmed their relevance. 4.2.1 The method and its application While diverse techniques may be employed (Oard et al. 2002), we resorted to statistical modelling. The choice of the method was conditioned by the size of the data in question. The documents to classify comprised just 200 words on average, spanning say from 50 to 500 words. That is why distribution of unigrams occurring in a text was taken as our modelling criterion. It would have been hopeless to follow any bigger elements, once having such sparse testing data. According to CLARA subdivision, reference models of these domains were established: economics and finance, law, industry, agriculture, traffic, politics, humanities, sports, medicine, science, arts, fiction, as well as a complementary non-economics model (fields from law to fiction). Furthermore, global models for both CLARA and ANAC corpora were prepared. Even for every single ANAC document, a unigram model can be constructed. Its resemblance to the reference models gets quantified, for instance, by the value of the correlation coefficient between the respective distribution functions (normalized to integrate to one). In order to enhance sensitivity to linguistic nuances of the domains, unigram frequencies beyond a certain interval were clipped to zero. Empirically, the reliability range of <0.002 %, 0.200 %> was set for reference models, while <0.002 %, 100.0 %> imposed no upper bound on the models being tested. Those documents which identified best with one of the first three reference models, or which identified with them on the second position and whose correlation coefficient there scored at least 90 % of the winning value, proceeded to manual verification. Such texts provided in total more than 1 million words. Humans themselves had difficulties in appointing sharp and unbiased criteria for topic assignment, anyway, their judgements disqualified one third of the pool. There are probably many ANAC articles which were never recognized and yet should have been, as there are those which did suit the method and were rejected by humans. Recalling our initial intention of building a domain-specific language resource, reducing but not eliminating the manual effort, we dare declare our solution successful. 5 Word counts for ANAC and CLARA share the definition of a word (strings delimited by boundaries between incompatible characters), which is different from that used with UAEC (splitting on whitespace only). In neither case are the words real linguistic units, as justified by the tokenization problem. Otakar Smrž and Petr Zemánek: Sherds from an Arabic Treebanking Mosaic 4.2.2 Discussion and remarks The method of correlation coefficient evaluation may be interpreted equivalently in terms of vector calculus. Let us imagine that every distinct word form (or n-gram) generates one dimension of a vector space. A language model over these forms then renders as a vector the co-ordinates of which correspond to the frequencies of the n-grams. In such a case, the correlation coefficient equals the cosine of the angle contained by the vectors of the models being compared. Naturally, the classification process likens to finding which of the reference vectors deviates least from the vector being studied. The notion of vectors makes it easy to consider relations among the reference models, too.6 Our experiment in Table 3 tells about orientation of the domain vectors relative to the vectors of CLARA and ANAC, and indicates some prevailing character of the topics in the corpora. The Table also shows discrepancies in the quality of both resources. While ANAC data are robust and uniform, CLARA models do not seem representative enough due to the low ratio of words to word forms. Its subcorpora may not be formatted consistently, and feature different typographic conventions. Definitely, ANAC transcribes all foreign proper names and abbreviations into Arabic, while the Hayat newspaper keeps Roman characters intact. It would therefore be necessary to improve the language models prior to tuning-up the aspects of the method. Topic Domain economics and finance law industry agriculture traffic politics humanities sports medicine science arts fiction non-economics CLARA ANAC Word Count Form Count 12 722 560 272 378 1 121 202 2 507 161 671 601 532 440 9 928 893 9 053 453 1 240 809 1 649 972 1 710 542 713 117 10 812 730 39 948 703 52 671 263 79 872 381 84 111 43 59 284 481 91 148 144 80 652 1 159 1 227 555 097 648 261 651 848 989 823 004 332 621 273 126 361 973 W/F Correlation Coef. CLARA ANAC Deviation Angle CLARA ANAC 46.7 0.737 0.573 42.5 55.0 13.3 22.5 15.5 8.9 34.9 18.8 13.5 11.1 11.9 8.8 16.6 34.5 42.9 143.7 0.709 0.646 0.542 0.679 0.820 0.754 0.662 0.757 0.846 0.748 0.692 0.898 1.000 0.609 0.492 0.372 0.372 0.389 0.695 0.392 0.566 0.464 0.506 0.434 0.334 0.539 0.609 1.000 44.8 49.8 57.2 47.2 34.9 41.1 48.5 40.8 32.2 41.6 46.2 26.1 0.0 52.5 60.5 68.2 68.2 67.1 46.0 66.9 55.5 62.4 59.6 64.3 70.5 57.4 52.5 0.0 Table 3: Reference models and their relation to CLARA and ANAC. Characteristics of the data sets prompt questions about their reliability. The discussion above explains why the deviation angles (given in degrees) are better for CLARA (union of subcorpora) than for ANAC (uneven data type). 5 Morphological Analysis and Disambiguation Given an unresolved string of Arabic characters, morphological analyzers commonly spell out a word stem and all underlying morphemes, clitics etc. with their appropriate labelling. The systems usually differ in the implementation of the parsing process and in the method of stem decomposition, if any, into the root and the pattern (cp. Beesley or Kiraz or Cavalli-Sforza et al.). Let us have a closer look on those the performance of which has been tested during our project. 6 In fact, there is no need for vector discrimination in the space. The Prague Bulletin of Mathematical Linguistics 78, 2002 Xerox Arabic Morphological Analyzer (XAMA) is based clearly on the two-level morphology reusing finite-state tools developed for language independent processing by the Xerox Research Centre Europe (cf. Beesley 2001). Though its analyses offer valuable information on roots, patterns and case and mood endings, some intricate derivational schemes are missing and the vocabulary cannot be easily extended by end-users. Tim Buckwalter’s Arabic Morphology Analyzer (cf. Maamouri and Cieri 2002) does not go in detail about root and pattern nor does it discover all imaginable readings. It works with a lexicon of stem entries to which input strings are reduced while obeying Arabic morphotactics rules. The system, being utilized in the PENN Arabic Treebank as well as in the Prague Dependency Treebank projects, is iteratively refined according to real corpus evidence and comments from annotators. 5.1 Ambiguity and Tokenization Problems The orthographical conventions of ignoring diacritics in writing and of tying words together increase the number of interpretations of a string in an extraordinary way. There are, of course, ambiguities caused by morphonological transformations applying widely to weak Arabic consonants, or by other systematic or incidental language tricks. Examples shall support such claims. Table 4 summarizes all existing readings for a string fhm. The first column identifies the solutions for reference, the last two provide the full Arabic forms and their translations into English. Explicit linguistic information rests in the analysis strings, the format of which comes from the XAMA tool but has been modified to seem more intuitive. Besides, five solutions in the Table were inferred from other sources (Wehr 1974), therefore the use of the XAMA+ heading. ID + XAMA String ~ [Root&Pattern]+Morpheme+Label Full Form Gloss 1 2 3 [fhm&CaCiC]+Verb+FormI+Perf+Act+a+3P+Masc+Sg [fhm&CuCiC]+Verb+FormI+Perf+Pass+a+3P+Masc+Sg [fhm&CaC~aC]+Verb+FormII+Perf+Act+a+3P+Masc+Sg fahima fuhima fahhama 4 [fhm&CuC~iC]+Verb+FormII+Perf+Pass+a+3P+Masc+Sg fuhhima 5 6 7 8 9 10 [fhm&CaC~iC]+Verb+FormII+Impv+o+Masc+Sg [fhm&CaCoC]+Noun+N+Indef+Nom [fhm&CaCoC]+Noun+K+Indef+Gen [fhm&CaCoC]+Noun+u+Def+Nom [fhm&CaCoC]+Noun+i+Def+Gen [fhm&CaCoC]+Noun+a+Def+Acc fa+Conj+[hmm&CaCaC]+Verb+FormI+Perf+Act+a+3P+ Masc+Sg fa+Conj+[hmm&CuCiC]+Verb+FormI+Perf+Pass+a+3P+ Masc+Sg fa+Conj+[hmm&{uCoCuC]+Verb+FormI+Impv+i+Masc+Sg fa+Conj+[hmm&CaCoC]+Noun+N+Indef+Nom fa+Conj+[hmm&CaCoC]+Noun+K+Indef+Gen fa+Conj+[hmm&CaCoC]+Noun+u+Def+Nom fa+Conj+[hmm&CaCoC]+Noun+i+Def+Gen fa+Conj+[hmm&CaCoC]+Noun+a+Def+Acc fa+Conj+[hym&{iCoCiC]+Verb+FormI+Impv+o+Masc+Sg fa+Conj+[whm&{iCoCiC]+Verb+FormI+Impv+o+Masc+Sg fa+Conj+hum+Funcwa [wfy&{iCoCiC]+Verb+FormI+Impv+o+Masc+Sg+hum+ Pron+DO+3P+Masc+Pl fahhim fahmuN fahmiN fahmu fahmi fahma he understood he was understood he made understand he was made to understand make [sg.m.] understand understanding [1.indef.] understanding [2.indef.] understanding [1.] to/of understanding [2.] to/of understanding [4.] to/of fa-hamma so he commenced fa-humma so he was commenced fa-hummi fa-hammuN fa-hammiN fa-hammu fa-hammi fa-hamma fa-him fa-him fa-hum so commence [sg.m.] and interest [1.indef.] and interest [2.indef.] and interest [1.] in/of and interest [2.] in/of and interest [4.] in/of so be [sg.m.] in love so imagine [sg.m.] and they [pl.m.an.] fulfil [sg.m.] them [pl.m.an.] 11 12 13 14 15 16 17 18 19 20 21 22 fi-him Table 4: Possible readings of the fhm string. Notice the clustering of solutions (1,2), (3–5), (6–10), (11–13), (14–18), (19), (20), (21), (22) which groups together words of the same lexical unit. Derivations of imperatives 13, 19, 20 and 22 from the canonical forms of the verbs are quite adventurous for an Arabic grammarian. So adventurous that the {uCoCuC or {iCoCiC patterns no longer appear on the surface. Nonetheless, all of the transformations are present and frequent in today’s Arabic. Otakar Smrž and Petr Zemánek: Sherds from an Arabic Treebanking Mosaic The ordering of the solutions tries to reflect their lexical relationship, thus splitting the list into nine disparate clusters. While the top analyses treat the input string as one word, later on, two separable tokens, varied in nature, are identified. This is the point why morphological disambiguation is a prerequisite to operations on the analytical level. Partly by chance, partly by the power of the word-merging phenomenon, even the notorious example of a ktb sequence can be tokenized in two ways, and read in seventeen. Table 5 gives explanation. + ID XAMA String ~ [Root&Pattern]+Morpheme+Label Full Form 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [ktb&CaCaC]+Verb+FormI+Perf+Act+a+3P+Masc+Sg [ktb&CuCiC]+Verb+FormI+Perf+Pass+a+3P+Masc+Sg [ktb&CaC~aC]+Verb+FormII+Perf+Act+a+3P+Masc+Sg [ktb&CuC~iC]+Verb+FormII+Perf+Pass+a+3P+Masc+Sg [ktb&CaC~iC]+Verb+FormII+Impv+o+Masc+Sg [ktb&CuCuC]+Noun+N+Indef+Nom [ktb&CuCuC]+Noun+K+Indef+Gen [ktb&CuCuC]+Noun+u+Def+Nom [ktb&CuCuC]+Noun+i+Def+Gen [ktb&CuCuC]+Noun+a+Def+Acc [ktb&CaCoC]+Noun+N+Indef+Nom [ktb&CaCoC]+Noun+K+Indef+Gen [ktb&CaCoC]+Noun+u+Def+Nom [ktb&CaCoC]+Noun+i+Def+Gen [ktb&CaCoC]+Noun+a+Def+Acc ka+Prep+[tbb&CaCoC]+Noun+K+Indef+Gen ka+Prep+[tbb&CaCoC]+Noun+i+Def+Gen kataba kutiba kattaba kuttiba kattib kutubuN kutubiN kutubu kutubi kutuba katbuN katbiN katbu katbi katba ka-tabbiN ka-tabbi Gloss he wrote he was written he made write he was made to write make [sg.m.] write books [1.indef.] books [2.indef.] books [1.] of books [2.] of books [4.] of writing up [1.indef.] writing up [2.indef.] writing up [1.] of writing up [2.] of writing up [4.] of like destruction [2.indef.] like destruction [2.] of Table 5: Possible readings of the ktb string. The lexical clustering gives (1,2), (3–5), (6–10), (11–15) as related units of the same root, while (16,17) resolve a prefixed preposition and a root of a totally different semantic content. 5.2 Lemma and Positional Tag in Arabic Disambiguation of a set of morphological analyses applicable to a string in question does not only yield tokens for the upper levels of linguistic description, but in itself relates the tokens (word-forms) to their actual canonical forms (lemmas). If a word-form derives from its lemma as taking on certain morphological properties, then they are to be revealed by the analysis and pronounced in some labelling (tag). Token + XAMA Tokenized String Positional Tag Lemma fa+Conj fa_ P--------sa+Fut sa_ P--------ya+VPref+[ktb&CoCuC]+Verb+ yaktubUna FormI+Imperf+Act+Una+Indic [ktb&CoCuC] VI1-AMP--3 +3P+Masc+Pl fasa- _hA NZ---FS4-3 -hA hA+Pron+DO+3P+Fem+Sg .sabA.ha [SbH&CaCAC]+Noun+a+Def+Acc [SbH&CaCAC] NO---MS4-- al-.gadi {al+Art+gad+Stem+i+Def+Gen gad NO---MS2D- Interpretation particle of consequence particle of future tense verb of the 1st stem in indicative, active, masculine plural, 3rd person pronoun, feminine singular, accusative, 3rd person non-derived noun, masculine singular in accusative non-derived noun, masculine singular in genitive, prefixed definite article Table 6: Review of the approach to morphological analysis. Three input strings fsyktbwnhA SbAH Algd are disambiguated into six tokens, literally so will write [pl.m.an.] it/them [sg.f.] morning [acc.] the-tomorrow [gen.]. In the Table, tokens use ArabTeX notation, lemmas keep to Buckwalter’s style. Positional tags are shown and discussed next. Some dictionaries associate words like .gaduN tomorrow with the root&pattern lemma (in our case [gdw&CaCxX]) rather than just isolating the stem (i.e. gad). The Prague Bulletin of Mathematical Linguistics 78, 2002 Unfortunately, neither of our analyzers declares clearly what a lemma is. Having so far defined lemmas equal to stems, we lack means to unite singulars and broken plurals (mismatching patterns) under one lexical unit, unless we duplicate the work of an analyzer and build some coupling lexicons. This, though not so desperate, kind of problem comes with perfect and imperfect patterns of a verb, too. Tim Buckwalter’s Analyzer is however expected to undergo improvements in this regard. As to the format of a tag, the XAMA-like output is intelligible, but yet somewhat ineffective in terms of its automated processing. The information may be recorded as a bit vector in which mutually exclusive values of a morphological category map into a fixed position. The system of positional tags for Czech (cf. Hajič 2002) inspired a preliminary design of such a scheme for Arabic. 6 Selected Syntactic Structures The principles which we follow on the analytical level are strongly influenced by the conclusions taken for the representation of Czech (cf., e.g., Böhmová et al. 2001). As both Czech and Arabic are languages with a rich inflection and a relatively free word order, many of the solutions for Czech are applicable also to Arabic. However, it is obvious that in Arabic, we will find phenomena that will not fit into the guidelines drawn in the Czech annotation manual. Some of these will be treated here, namely: • non-verbal predication in Arabic, • co-reference matching, • verbal characteristics of certain nominal formations, • figura etymologica. 6.1 Non-verbal Predication Beside the standard type of predication expressed by a verb, Arabic possesses a number of other types of predication. This set of predication types is traditionally grouped under the heading of “nominal sentence”. However, in this set, there are several other types of predication that do not fit easily under such a heading. Therefore, we will distinguish between verbal predication and other types which we label as non-verbal predication. As predication expressed by a verb presents no crucial problems for the dependency type of syntactic representation, we will not treat it here. The main focus here will be the non-verbal predication which can be divided as follows: • pure nominal sentence, • “clausal” (conjunctional) predication, • impersonal predication with a prepositional phrase (locative and possessive constructions), • existential predication. From the point of view of the dependency approach, a verb is the governing node of the sentence. As in these sentences no verb is used, we transfer this role to the highest node of the predicate, which then becomes the highest node of the sentence. Such a solution is quite smooth and expectable in case of a nominal sentence (Example 1 on the next page). The nominal predicate (labelled as Pnom) without a verbal conjecture can be found in several other languages (e.g., Russian), and can easily take over the role of the governing node. A slightly different picture occurs if the (nominal) predicate is represented by a clause (Example 2 therein), because in such a case, the sentence is governed by a conjunction as its highest node. Following the principle given above, it is the conjunction that has to receive the role of the predicate in the tree structure (labelled as PredC), while Pnom prepends to the particular function of the head of the clause. Otakar Smrž and Petr Zemánek: Sherds from an Arabic Treebanking Mosaic (1) al-baytu kabi~run. (2) al-mas’alatu anna ... the-house big. the-problem that … The house is big. The problem is that … kabi~run [Pnom] anna [PredC] al-baytu [Sb] al-mas’alatu [Sb] ??? [Pnom_???] The next two types of sentences already have their traditional solution on the surface syntax level (especially in the Arabic environment), where the prepositional phrase could be perceived as the subject. However, in these cases, there is no explicit preference about the predicate, and a choice has to be made which part of the sentence will assume that role. For such cases, we have decided to suggest a solution that is closer to the underlying, tectogrammatical level. The first type, impersonal predication with a prepositional phrase, can be illustrated by the two following sentences, expressing locative and possessive types of constructions: (3) fi~ al-bayti na~fiVatun. (4) la- -hu baytun. in the-house[gen.] window. for him a-house[nom.]. There is a window in the house. He has a house. fi~ [PredP] al-bayti [Adv] la- [PredP] na~fiVatun [Sb] -hu [Obj] baytun [Sb] As it has been pointed out before, there are voices according to which the first (prepositional) part of the sentence should be treated as subject and the second part (na~fiVatun, baytun) as predicate. However, there are other concepts that appeared in the linguistic theory. In deciding our own approach, we were inspired by the work by Freeze 1992, who argues that locative and possessive constructions have the same manifestation on deeper syntactic levels, and both these constructions are treated as predicative. When this point of view is adopted, we have to change also the manifestation on the surface level, where the roles of the parts of the sentence are exchanged and the role of the predicate is played by the prepositional phrase. Then, as the dependency governing in Arabic is respected, the preposition becomes the head of the predicate. In order to distinguish it properly from other sentence structures, we label it with PredP. The difference between the locative and possessive constructions is expressed by the function right after the preposition, as emphasized in Examples 3 and 4 above. In spite of the fact that this solution can seem contradictory to the traditional approach, evidence in favour of the prepositional phrase as predicate is given already in one of the most classical and respected grammars (cf. Wright 1875, esp. 271–276) and found also recently (Moutaouakil 1989:87). The other type, which we call existential predication, is represented by Examples 5 and 6. For Freeze (1992), there are two types of existential sentences on the surface level—those with a locativephrase subject, and the proform existential. Arabic (which he also treated in his study) would fall into the second group, the proform being locative—both lexically and syntactically—and thus nonsubjective. The Prague Bulletin of Mathematical Linguistics 78, 2002 (5) huna~ka baytun. there (6) a-house[nom.] la~ $akka anna ... no doubt[acc.] that … There is a house. There is no doubt that ... huna~ka [PredE] la~ [PredE] baytun [Sb] $akka [Sb] anna [AuxC] This led us to a solution which is in principle identical with the one outlined above, i.e. we ascribe the predicative role to the existential part of the sentence, which is huna~ka and la~ respectively. As this construction is somewhat different from the other types of predication, we label it with a somewhat different function PredE. Note that in Example 6, the head of the clause coming after anna would be annotated in the manner of Example 2. 6.2 Co-reference Matching As in other languages, pronouns are not trivial to associate with the entity they represent. Resolving these bonds is important for true interpretation of the text on the tectogrammatical level. In annotation, linking the appropriate co-references is done by means of the so-called “lines across the graph”, pointing from the pronouns to the expression being substituted. There are however pronouns for which their match need not be marked explicitly since it results clearly from the syntactic structure (i.e. cases of grammatical co-reference). In relative clauses, attributive pseudo-clauses or when anteposition takes place, it is enough to attach a suffix _Ref to the analytical function of such a pronoun, implying a certain algorithm shall be put in force to determine the node the referential corresponds to. For example, in Arabic relative sentences, we find a very explicit expression of traces that are left after a movement of an element. A referential pronoun can be found at such places with the only exception when the movement concerns the subject of the relative clause. Our approach is shown in Example 7. Other kinds of traces and links between nodes are treated below in Sections 6.3 and 6.4. (7) Earaftu mar’atan aTrada -ha~ zawju -ha~. I-knew a-woman[acc.] chased-away her[acc.] husband[nom.] her[gen.]. I knew a woman whose husband chased her away. Earaftu [Pred] mar’atan [Obj] aTrada [Atr] -ha~ [Obj_Ref] zawju [Sb] -ha~ [Atr_Ref] Otakar Smrž and Petr Zemánek: Sherds from an Arabic Treebanking Mosaic 6.3 Verbal Characteristics of Certain Nominal Formations The Arabic word-derivational system is very heavily dependent on the verbal system. The purely nominal part of Arabic lexicon is relatively small and most words in Arabic are generated by the verbal system (known as verbal nouns and active/passive participles). These deverbatives can preserve both nominal and verbal syntactic attributes. This can sometimes lead to a special sort of constructions, where words traditionally classified as nominal in nature can exert a verbal type of government over some part of the sentence. This fact does not change the structure of the dependency tree on the analytical level, but substantially changes the usual picture of relations between the analytical functions and the morphological attributes of nodes. Below, we give some examples that cover the following types: (8) • verbal noun with predicative function (Example 8) • participle with predicative function (Example 9) • sequence of such nominal forms (Example 10) da~ma iqtira~Hu -hu al-Eamali~yata lasted Eala~ zumala~’i -hi sa~Eatayni. proposal his the-operation[acc.] on colleagues his two-hours[acc.]. It took two hours when he proposed the operation to his colleagues. da~ma [Pred] iqtira~Hu [Sb] -hu [Atr] sa~Eatayni [Adv] al-Eamali~yata [Obj] Eala~ [AuxP] zumala~’i [Obj] -hi [Atr] (9) al-mu’tamaru al-muqarraru the-congress Eaqdu the-decided -hu (10) a~mili~na qubu~la hoping daEwata -kum accepting[acc.] your -na~ convening its invitation[acc.] our congress whose convention is decided hoping that you will accept our invitation al-mu’tamaru [???] al-muqarraru [Atr] Eaqdu [Sb] -hu [Atr_Ref] a~mili~na [???] qubu~la [Obj] -kum [Atr] daEwata [Obj] -na~ [Atr] The Prague Bulletin of Mathematical Linguistics 78, 2002 In Example 9, we find a non-clausal construction resulting from a transformation of a relative clause into a nominal attributive phrase, which we call an attributive pseudo-clause. The referential pronouns may be dealt with as if they occurred in a proper relative clause, though. 6.4 Figura Etymologica As an intensifying construction, Arabic can (and often does so) use the so-called accusative of the inner object, where the verbal noun of the same root as the verb is used. The verbal noun can also be a part of an attributive phrase, and then replaces what might be an adverbial of mood in English. (11) Daraba -hu Darban he-hit kabi~ran. (12) him hitting[acc.] big. He gave him a big blow. a$adda xawfin. we-fear from him/it strongest[acc.] fear. We are afraid of him/it the most. Daraba [Pred] -hu [Obj] naxa~fu min -hu naxa~fu [Pred] Darban [Adv_Msd] kabi~ran [Atr] min [AuxP] -hu [Obj] a$adda [Adv] xawfin [Atr_Msd] In such cases, there is a need to indicate the semantic adherence of a masdar (verbal noun) to its verb, in order to avoid possible mechanical translations (“He hit him big hitting.”). Therefore, all the words will keep their usual positions in the tree as will their analytical functions, and the form of a masdar will receive a suffix _Msd to its analytical function. The “line across the graph” will go from the verbal noun to the upper-closest verb. Unlike co-reference matching, now the relation between the two nodes is semantic rather than syntactic. 7 Conclusion and Perspectives This short overview cannot list all the problems and interesting cases which our team have encountered when working on the Prague Arabic Dependency Treebank. However, we hope that from the points mentioned here, one can get an idea of the issues dwelling in the description of Arabic on morphological and analytical levels. Apart from the annotation procedure and the design of the guidelines for the tectogrammatical level, the team pursues problems concerning automation and pre-processing. These involve preparatory treebuilding and function/functor assignment based on a set of observed rules, transformation of phrasestructure trees into dependency trees (cf. Žabokrtský and Kučerová 2002) or automated assignment of case and mood endings (in analogy to Žabokrtský et al. 2002). 8 Acknowledgements The results presented here would not have been achieved without a fruitful collaboration of the following colleagues: Jan Hajič, Ivona Kučerová, Jan Šnaidauf, Ondřej Beránek, Petr Pajas, Monika Kolbová, Martin Špáta, Pavel Ťupek, Jiří Hana and Daniel Zeman. The research reflected in this paper was supported by the Ministry of Education of the CR, projects LN00A063 and MSM113200006, and also by the Czech Grant Agency, GACR 405/02/0823. Otakar Smrž and Petr Zemánek: Sherds from an Arabic Treebanking Mosaic References Beesley, K. R. (1999): Arabic Stem Morphotactics via Finite-State Intersection. Benmamoun, E. (ed.): Perspectives on Arabic Linguistics XII. Papers from the Twelfth Annual Symposium on Arabic Linguistics. Benjamins, Amsterdam, pp. 85–99. Beesley, K. R. (2001): Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001. Association for Computational Linguistics. 39th Annual Meeting and 10th Conference of the European Chapter. Workshop Proceedings on Arabic Language Processing: Status and Prospects, July 6th 2001. CNRS – Institut de Recherche en Informatique de Toulouse, and Université des Sciences Sociales, Toulouse, France, pp. 1–8. Böhmová, A. – Hajič, J. – Hajičová, E. – Hladká, B. (2001): The Prague Dependency Treebank: A ThreeLevel Annotation Scenario. Abeille, A. (ed.): Treebanks: Building and Using Syntactically Annotated Corpora. Kluwer Academic Publishers. In press. Brants, T. – Chen, F. – Farahat, A. (2002): Arabic Document Topic Analysis. Proceedings of the Post Workshop of LREC 2002 on Arabic Language Resources and Evaluation: Status and Prospects. ELRA, Las Palmas de Gran Canaria. Cavalli-Sforza, V. – Soudi, A. – Mitamura, T. (2000): Arabic Morphology Generation Using a Concatenative Strategy. Proceedings of NAACL 2000, Seattle, WA, pp. 86–93. Faber, A. (1997): Genealogical Subgrouping of the Semitic Languages. Hetzron, R. (ed.): The Semitic Languages, Routledge, London, pp. 1–15. Freeze, R. (1992): Existential and Other Locatives. Language 68, No. 3, 1992, pp. 553–595. Hajič, J. (2002): Disambiguation of Rich Inflection (Computational Morphology of Czech). Habilitation Thesis, Charles University in Prague, Faculty of Mathematics and Physics. Karolinum, Charles University Press, Prague, 334 p. Kiraz, G. A. (1998): Arabic Computational Morphology in the West. Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, Cambridge. http://www.belllabs.com/project/tts/icemco-98.ps Kiraz, G. A. (1999): Computational Tool for Developing Morphophonological Models for Arabic. Benmamoun, E. (ed.): Perspectives on Arabic Linguistics XII. Papers from the Twelfth Annual Symposium on Arabic Linguistics, Benjamins, Amsterdam, pp. 101–110. Kiraz, G. A. (2000): Multi-Tiered Nonlinear Morphology Using Multi-Tape Finite Automata: A Case Study on Syriac and Arabic. Computational Linguistics 26 (1), pp. 77–105. Lagally, K. (1999): ArabTeX: A System for Typesetting Arabic. User Manual Version 3.09. Technical Report 1998/09. Institut für Informatik, Universität Stuttgart. Maamouri, M. – Cieri, C. (2002): Resources for Natural Language Processing at the Linguistic Data Consortium. Proceedings of the International Symposium on Processing of Arabic, April 18th–20th 2002. University of Manouba, Tunisia, pp. 125–146. McCarthy, J. (1985): Formal Problems in Semitic Phonology and Morphology, Outstanding Dissertations in Linguistics Series, Garland Publishing, New York, 430 p. Moutaouakil, A. (1989): Pragmatic Functions in a Functional Grammar of Arabic. Dordrecht – Providence, Foris, 156 p. Oard, D. W. – Gey, F. C. – Dorr, B. J. (2002): Evaluating Arabic Retrieval from English or French Queries: The TREC-2001 Cross-Language Information Retrieval Track. Proceedings of the Post Workshop of LREC 2002 on Arabic Language Resources and Evaluation: Status and Prospects. ELRA, Las Palmas de Gran Canaria. Sawaf, H. – Zaplo, J. – Ney, H. (2001): Statistical Classification Methods for Arabic News Articles. Association for Computational Linguistics. 39th Annual Meeting and 10th Conference of the European Chapter. Workshop Proceedings on Arabic Language Processing: Status and Prospects, July 6th 2001. CNRS – Institut de Recherche en Informatique de Toulouse, and Université des Sciences Sociales, Toulouse, France, pp. 127–132. Smrž, O. – Šnaidauf, J. – Zemánek, P. (2002): Prague Dependency Treebank for Arabic: Multi-Level Annotation of Arabic Corpus. Proceedings of the International Symposium on Processing of Arabic, April 18th–20th 2002. University of Manouba, Tunisia, pp. 147–155. Soudi, A. – Cavalli-Sforza, V. – Jamari, A. (2001): A Computational Lexeme-Based Treatement of Arabic Morphology. Association for Computational Linguistics. 39th Annual Meeting and 10th Conference of the European Chapter. Workshop Proceedings on Arabic Language Processing: Status and Prospects, July 6th 2001. CNRS – Institut de Recherche en Informatique de Toulouse, and Université des Sciences Sociales, Toulouse, France, pp. 155–162. Wehr, H. (1974): A Dictionary of Modern Written Arabic. Arabic-English. Wiesbaden, Harrassowitz, 1110 p. Wright, W. (1875): A Grammar of the Arabic Language. Vol. II. London, F. Norgate, 483 p. Zemánek, P. (2001): CLARA (Corpus Linguae Arabicae): An Overview. Association for Computational Linguistics. 39th Annual Meeting and 10th Conference of the European Chapter. Workshop Proceedings on Arabic Language Processing: Status and Prospects, July 6th 2001. CNRS – Institut de Recherche en Informatique de Toulouse, and Université des Sciences Sociales, Toulouse, France, pp. 111–112. Žabokrtský, Z. – Kučerová, I. (2002): Transforming Penn Treebank Phrase Trees into (Praguian) Tectogrammatical Dependency Trees. PBML 78, Charles University, Prague. Žabokrtský, Z. – Sgall, P. – Džeroski, S. (2002): A Machine Learning Approach to Automatic Functor Assignment in the Prague Dependency Treebank. Proceedings of LREC 2002, ELRA, Las Palmas de Gran Canaria, pp. 1513–1520.