The System of Register Labels in plWordNet

Stan Szpakowicz

The System of Register Labels in plWordNet

Stan Szpakowicz

2015, Cognitive Studies | Études cognitives

visibility

…

description

15 pages

link

1 file

The System of Register Labels in plWordNetStylistic registers influence word usage. Both traditional dictionaries and wordnets assign lexical units to registers, and there is a wide range of solutions. A system of register labels can be flat or hierarchical, with few labels or many, homogeneous or decomposed into sets of elementary features. We review the register label systems in lexicography, and then discuss our model, designed for plWordNet, a large wordnet for Polish. There follows a detailed comparative analysis of several register systems in Polish lexical resources. We also present the practical effect of the adoption of our flat, small and homogeneous system: a relatively high consistency of register assignment in plWordNet, as measured by inter-annotator agreement on a manageable sample. Large-scale conclusions for the whole plWordNet remain to be made once the annotation has been completed, but the experience half-way through this labour-intensive exercise is very encoura...

COGNITIVE STUDIES | ÉTUDES COGNITIVES, 15: 161–175 Warsaw 2015 DOI: 10.11649/cs.2015.013 MAREK MAZIARZ1,A , MACIEJ PIASECKI1,B , & STAN SZPAKOWICZ2,C 1 Department 2 School of Computational Intelligence, Wrocław University of Technology, Poland of Electrical Engineering & Computer Science, University of Ottawa, Canada A [email protected] ; B [email protected] ; C [email protected] THE SYSTEM OF REGISTER LABELS IN PLWORDNET Abstract Stylistic registers inﬂuence word usage. Both traditional dictionaries and wordnets assign lexical units to registers, and there is a wide range of solutions. A system of register labels can be ﬂat or hierarchical, with few labels or many, homogeneous or decomposed into sets of elementary features. We review the register label systems in lexicography, and then discuss our model, designed for plWordNet, a large wordnet for Polish. There follows a detailed comparative analysis of several register systems in Polish lexical resources. We also present the practical eﬀect of the adoption of our ﬂat, small and homogeneous system: a relatively high consistency of register assignment in plWordNet, as measured by inter-annotator agreement on a manageable sample. Large-scale conclusions for the whole plWordNet remain to be made once the annotation has been completed, but the experience half-way through this labour-intensive exercise is very encouraging. Keywords: wordnets; plWordNet; lexical register; large-scale wordnet expansion; inter-annotator agreement 1. Introduction As many other wordnets, plWordNet is a lexical-semantic network which describes lexical meaning, represented by lexical units,1 in terms of such lexico-semantic relations as, e.g., hypernymy, hyponymy, meronymy, antonymy, cause and precedence. A wordnet implements the relational paradigm of lexical semantics. LUs are the nodes in a network, i.e., a graph, and the relations define the arcs between pairs of LUs. The network structure is meant to be the principal means for the description of the LUs. Every LU u is characterised by its links (direct) to other LUs which are next linked to further LUs (thus indirectly linked to u), and so on. The wordnet, therefore, describes u by a graph around it, part of the complete network, and this graph imposes restriction on the meaning of u. For example: 1 The term lexical unit will be abbreviated to LU throughout this paper. 162 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz • fura 4 ‘≈ (informal ) a good car’ is a hyponym of samochód osobowy 1 ‘a car’ which is a hyponym of samochód 1 ‘a motor vehicle’; • bagażnik 1 ‘a luggage compartment’ is a meronym of samochód 1; • gablota 2 ‘≈ (informal ) an expensive large car’ is a hyponym of samochód osobowy 1; • and so on. From the links for ‘≈ (informal ) a good car’ we can learn that it is a kind of car (which is a kind of vehicle and so on) and it can have parts such as a luggage compartment. We notice, however, that it is a partial description: it does not provide, e.g., a detailed description of situations in which a car can be used, who can drive it and so on. This is an intended effect, because a wordnet is a compromise between the formalisation and the coverage of the description. The wordnet is formalised enough for many applications in Natural Language Engineering, but at the same time its limited formalisation allows for relatively fast work on its construction. As a result, wordnets are among the largest lexical-semantic resources ever built. Their large size and wide coverage are important for their applications. A hyponym, e.g., ‘≈ (informal ) a good car’ is more specific than its hypernym ‘a car’, so the latter can be used in most contexts in which the former is used. The semantic opposition expressed by hyponymy does not explain, however, why the former can be used in all contexts, including formal documents, while the latter is more typical of private conversations or informal texts. This difference can be traced back to the different styles of writing, and cannot be described by the lexicosemantic relations. That is because a style is not a lexical meaning, and cannot be an element of a wordnet, which is a lexical-semantic network. We need a different way of introducing limited pragmatic information into the description provided by plWordNet and any wordnet in general. Our goal is to investigate the use of stylistic registers as a means for expressing selected pragmatic constraints on the lexical meaning described in a wordnet. We want to find the best way of introducing the registers into the wordnet structure, given that they are not relational by nature. We also want to develop a system of stylistic registers for Polish to assist the consistent construction of plWordNet and its future applications. 2. Register label systems in lexicography Register is usually defined as a language variation stemming from situational characteristic of a communication act. According to Biber (2006): “[Register is] any language variety defined by its situational characteristics, including the speaker’s purpose, the relationship between speaker and hearer, and the production circumstances”. Halliday (2002, p. 168) defines it thus: “Register [is a] functional (diatopic) variation in language.” This language variety includes many aspects of communication, among them formality (e.g., formal style), text type (e.g., literary, poetic), medium (e.g., spoken), The System of Register Labels in plWordNet 163 technicality (e.g., terminology, jargon), frequency (e.g., rare), time (e.g., old-use, archaic), attitude (e.g., vulgar, ironic), socio-cultural context (e.g., argot), normativity (e.g., non-standard ) or place (e.g., dialect, American English (Hausmann, 1989)). People can shift between these and many other registers. This style-shifting is triggered by social pressure and requires a higher or lower “amount of attention paid to speech”, with spoken colloquial style demanding the least (Milroy & Gordon, 2003). Incapability of such code-switching may be a sign of mental disorders, such as autism (Lyons, 2013). This extensive and multidimensional variability of language gathered under the umbrella term of register (and others, like style (Eckert & Rickford, 2001)) may evade precise definition. An example of such problems is the theoretical status of dialect. Dialects are often allocated outside the register list, because of the assumption that one cannot switch from his dialect to the general language or to another dialect in the same way as one jumps into one register from another (Biber & Conrad, 2009, pp. 11–13; Gregory, 1967; Halliday, 2002, pp. 168–169). This common conviction appears to be debunked by research on code-switching, which proves that a dialect could be switched in the same way as style (DeBose, 1992; Trudgill, 1999), and leads to a different register list (Svensén, 2009). Not only do lists of registers vary from one publication to another, but also the boundaries between register types are neither clear nor well established (Bowker, 2013, p. 48). Biber and Conrad (2009, pp. 32–33) claim that the situation is somehow natural, since registers are organised hierarchically and form a continuum; in fact the granularity of register types depends on the researcher’s purpose, and on the scope of scientific analysis (Biber & Conrad, 2009). Halliday (2002, p. 169) describes it thus: “[Registers] are best thought of as spaces within which the speakers and writers are moving; spaces that may be defined with varying depth of focus (... the register of high school physics textbooks versus the register of natural science), and whose boundaries are in any case permeable, hence constantly changing and evolving.” Multidimensional register systems are arranged into many scales with an unmarked/neutral central zone. For example, in Routledge Dictionary of Lexicography we note the following scales (Hartmann & James, 2002, after Svensén, 2009): • the emotiveness scale (“from ‘appreciative’ through neutral (the unmarked zone) to ‘derogatory’ and ‘offensive’ ”): • the formality scale (“from ‘elevated’ and ‘formal’ through neutral (the unmarked zone) to ‘informal’ and ‘intimate’ ”); • the frequency of occurrence scale (“ranging from ‘very frequent’ to frequent (the unmarked neutral zone) to ‘becoming rare’ and ‘very rare’ ”); • the scale of indigenisation (“from ‘foreign’ and ‘borrowed’ through ‘assimilated’ to native (the unmarked neutral zone)”); • the scale of textuality (“from ‘poetic’ to ‘conversational’, with the shared neutral items remaining unmarked”); • the diatopic scale / continuum (“from ‘local’ or ‘provincial’ dialects to ‘metropolitan’ and even ‘international’ varieties”, “[t]he neutral zone of the ‘home’ 164 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz variety (e.g., British English in a British dictionary or American English in an American dictionary) may be left unmarked”); • the diastratic scale (“from neutral (the unmarked zone) to ‘demotic’ or ‘slang’ ”); • the dianormative scale (“from ‘correct’ (the unmarked neutral zone) to ‘substandard’ or ‘illiterate” ’). The unmarked/neutral centre of all these scales is the general language (Atkins & Rundell, 2008, p. 498). All other registers are described as marked. A specific register may be described with regard to some single feature (Biber, 1995, ch. 1.3.1), but real-world registers are fairly complex and must be decomposed in order to find out the underlying simple linguistic features (Maybin & Swann, 2009, pp. 64–65). Biber’s model, based on statistical analysis, includes five features (Biber & Conrad, 2009): (i) (ii) (iii) (iv) (v) ‘involved production’ ↔ ‘informational production’, ‘narrative discourse’ ↔ ‘non-narrative discourse’, ‘elaborated reference’ ↔ ‘situation-dependent reference’, ‘overt expression of argumentation’, ‘impersonal style’ ↔ ‘non-impersonal style’. Buttler and Markowski (1998) proposed an interesting three-dimensional model of lexical registers. Three scales were used: technicality (±t), formality (±f ), and expressiveness/emotiveness (±e). Here is the structure of each of the six registers (Buttler & Markowski, 1998, p. 109): • • • • • • common [−t, −f , −e], literary [−t, +f , −e], colloquial [−t, −f , +e], terminological [+t, +f , −e], professional [+t, −f , −e], argot [+t, −f , +e]. Note that it is impossible in this model to combine features [+f ] with [+e], so six rather than eight (23 ) possibilities are realised. Registers are “ways of saying different things” (Halliday, 2002, p. 169), and involve different vocabulary (Biber, 2006). To mark a register of a given word/sense, dictionaries use register labels (Svensén, 2009). Register label systems mirror register models, so the difficulties with precise register definitions become a problem for lexicography (Engelking, Markowski, & Weiss, 1989, p. 300). Indeed, not only is there no consensus what register label system to adopt, but also the very same registers are marked inconsistently (Svensén, 2009, p. 316): “Different dictionaries may use different labels, and the categories represented by the labels may have different ranges in different dictionaries. Moreover, there may be differences in labelling practice, so that, in one dictionary, fewer or more lexical items are regarded as formal or informal, correct or incorrect, etc., than in another one (Haussman 1989: 650).” The System of Register Labels in plWordNet 165 It is not difficult to find such discrepancies in dictionaries. Let us compare the descriptions of three most frequent senses of the word clone in Cambridge Dictionaries Online (CDO) (Heacock, 1995–2011) and in Oxford Dictionaries (OE) (Simpson, 2013):2 1. General register ‘a plant or animal that has the same genes as the original from which it was produced’ (CDO) / Biology ‘an organism or cell, or group of organisms or cells, produced asexually from one ancestor or stock, to which they are genetically identical’ (OE); 2. Informal ‘someone or something that is very similar to someone or something else’ (CDO) / General register ‘A person or thing regarded as an exact copy of another’ (OE); 3. Computing ‘a computer that operates in a very similar way to the one that it was copied from’ (CDO) / General register ‘a computer designed to simulate exactly the operation of another, typically more expensive, model’ (OE). Clearly, the same state of affairs is present in Polish lexicography (Kurkiewicz, 2007, pp. 29–30; Engelking et al., 1989). In Dubisz (2006), for example, the register system includes over a hundred register labels organised hierarchically, while in Kurkiewicz (2007) the list is shorter. We prefer to keep the whole system simple. We agree with the editors of the Great Polish Dictionary that “it is better to give less information but base it on reasonably clear criteria” (Kurkiewicz, 2007, p. 30). The next section presents a new system of register labels prepared for plWordNet, very small, well defined, non-hierarchical and with single labels rather than label sequences. 3. A model of register labels in plWordNet A higher number of stylistic registers allows for more fine-grained distinctions, but it makes assigning LUs to registers more difficult. Inconsistencies between the decisions of different linguists are likely. The similarities among registers are not apparent in a flat structure. A hierarchy of registers could be introduced in order to express generalisations over registers (e.g., specialist registers distinguished but grouped together), but such a solution would only be feasible if there were more registers. The question arises, then, whether a larger number of registers is really needed for plWordNet (or any wordnet, for that matter). We aim to maintain the high consistency in applying register labels to LUs, so we have decided to build our system only on 11 registers. In order to facilitate the process, the register labels have been arranged into a decision tree presented in Figure 1. A plWordNet editor, in a series of substitution tests, assesses the acceptability of the instances of test expressions. The tree guides her to the final choice of a lexical register label. We will show in section 4 how this ascetic system of registers allows the editors to work with a fair degree of consistency. 2 Curiously, the dictionaries disagree on the register labels for all three senses, despite the proximity of Cambridge and Oxford. . . 166 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz LU incorrect? Y non-standard N obsolete? Y obsolete N regional? Y regional N used mainly by scientists? Y terminological N from a slang or argot? Y slang/argot N Y used only in literature? official Y literary N used only in communication with state institutions? Y suitable for oﬃcial situations? N emotionally marked? Y N vulgar coarse colloquial general language N N unsuitable for common situations Y Figure 1: The decision tree for register assignment. The tests for three emotive labels have been conflated. The System of Register Labels in plWordNet 167 Registers can be represented — at least for the purpose of analysis — as bundles of primitive semantic features. We consider technicality (±t), formality (±f ), three levels of expressiveness/emotiveness (according to Engelking et al. (1989): + + +e, + + e, +e, −e), the status of LU users’ community (is it open or closed, like in subcultures, ±s), an exclusively literary character of a LU (±l), the possibility of using a LU in everyday situation (±u), and a bureaucratic character of a LU (±b). The system we have designed includes the following registers: • non-standard — we use this register label to mark incorrect but very frequent LUs; • obsolete — this label marks LUs which are outdated, typically used only by the elderly or (rarely) middle-aged people, as well as in old literature; • regional — LUs from a dialect, well known to (but not used by) almost all Poles; • terminological [+t] — LUs used by specialists, scientists, engineers, and generally professionals; • argot/slang [−t, +s] — LUs used by a particular closed social group or a small community; • literary [−t, −s,+l], [−t, −s,−l, +f , −b, −u] — this label marks high-style vocabulary, especially LUs used only in literature or in speeches; • official [−t, −s, −l, +f , +b] — LUs used on official and formal occasions, mainly in the communication between citizens and representatives of state institutions;3 • vulgar [−t, −s, −l, −f , + + +e] — crude vocabulary, LUs with very restricted acceptable usage; • coarse [−t, −s, −l, −f , + + e] — LUs which might be used in a familiar context, but normally not acceptable in other situations; • colloquial [−t, −s, −l, −f , +e] — vocabulary used informally, in a free style, but with low acceptability in official situations; • general [−t, −s,−l, +f , −b, +u], [−t, −s, −l, −f , −e] — LUs which could be used virtually in every situation (are common within all styles). Registers in plWordNet have an important role in shaping the structure of the graph of lexico-semantic relations. In the case of LUs in different registers, we must consider the compatibility of their registers before linking them by a relation, e.g., hyponymy/hypernymy, and thus deciding how they are to be grouped into one synset. We follow three rules when we link LUs by hyponymy/hypernymy: 1. LU u1 in the register obsolete, regional, argot or non-standard may be a hypernym of LU u2 if and only if u2 is in exactly the same register.4 2. LU u1 in the register vulgar or coarse may be a hypernym of LU u2 if and only if u2 is either in vulgar or in coarse.5 3 Such language develops around any bureaucracy. of these registers shows affinity only for itself. 5 There is affinity between vulgar and coarse. 4 Each 168 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz 3. The remaining registers may be linked by hyponymy without restrictions. Synonymy in plWordNet is captured as bidirectional hyponymy (Maziarz, Piasecki, & Szpakowicz, 2013), so very similar rules apply to synomy as well; Table 1 shows the exceptions. Table 1: Registers allowed in the same synset (+), and those not allowed (−). coll. gen. lit. off. term. colloquial + + − − − general + + + + + literary − + + + + official − + + + + terminological − + + + + 4. Inter-annotator agreement and statistics At the end of 2013, we constructed the first set of ten register labels. The set was tested and proven useful (Maziarz, Piasecki, Rudnicka, & Szpakowicz, 2014). We then added the 11th register, non-standard, for the LUs very frequent in Polish but assumed to be incorrect in normative dictionaries. We also conducted a survey. Two of the plWordNet editors applied registers from our set to a random sample of 385 noun LUs taken from plWordNet. The editors were professional linguists, but they had not been trained in register label recognition; they took their guidelines from the decision tree. The distribution of their choices is presented in Table 2; it also shows the statistics of register usage in the newest version of plWordNet (the column ‘plWN 2015’). The inter-annotator agreement was determinied by the Cohen’s kappa coefficient: the overall agreement was κ = 0.647 with the confidence interval 0.586-0.722.6 According to Landis and Koch (1977, p. 165), the confidence interval covers four values of agreement strength: fair – moderate – substantial – almost perfect. We also give kappa values for individual register labels. A generous rule of thumb in computational linguistics says that only κ ≥ 0.8 guarantees reliable results, and κ in 0.67–0.8 is tolerable.7 Our result was at the border of the tolerable interval of lower κ (in the terms of confidence intervals). As one can notice, the agreement values between the two annotators were quite good for very frequent registers (terminology: κ = 0.78, general register: 6 The confidence interval was calculated by a simple percentile bootstrap method (DiCiccio & Efron, 1996; DiCiccio & Romano, 1988) suitable for Cohen’s κ (Artstein & Poesio, 2008), n = 10000 resamplings, α = 0.05. 7 Reidsma and Carletta (2007) show that this rule of thumb does not always work. Sometimes lower κ makes the results reliable, sometimes even κ ≥ 0.8 does not suffice. That is why in Maziarz et al. (2014) applied to the data a non-parametric test for independence. It proved that neither linguist had a bias. In this paper we also give κ for every category, as suggested in Reidsma and Carletta (2007), so as to inspect the behaviour of agreement across the registers. 169 The System of Register Labels in plWordNet Table 2: Inter-rater agreement of two annotators assigning register labels to nouns from plWordNet in 2013, and the frequencies of choices of linguists F #1 and F #2. The label non-standard was added in 2014. The column ‘plWN 2015’ contains data from the beginning of 2015. marking label Cohen’s κ F #1 % F #2 % plWN 2015 % terminology 0.78 162 42% 146 38% 52 164 59% general 0.60 108 28% 113 29% 26 242 29% literary 0.62 27 15% 33 16% 2 875 3% colloquial 0.52 24 6% 44 11% 3 372 4% obsolete 0.56 12 3% 9 2% 2 095 2% coarse 0.49 9 2% 3 <1% 324 <1% argot 0.60 5 1% 5 1% 520 <1% official −0.01 4 1% 1 <1% 494 <1% regional 0.50 3 <1% 1 <1% 832 1% vulgar NA 0 0% 0 0% 65 <1% non-standard NA 0 0% 0 0% 57 <1% overall 0.647 385 100% 89 040 100% 385 100% κ = 0.60), and literary: κ = 0.62, but lower for less frequent ones (colloquial: κ = 0.52, and obsolete: κ = 0.56).8 The confidence intervals would be narrower if we reduced the number of registers from 11 to 6, having gathered compatible registers into broader bins — see Table 3 and Maziarz et al. (2014). By compatible we mean registers with similar definitions (Section 3) and close in the decision tree (Figure 1). After this reduction, the overall κ = 0.72 with a good confidence interval of κ ∈ (0.657, 0.785). Now all the most frequent registers have sufficiently good kappa values (terminology ∼ argot ∼ official: κ = 0.77, general ∼ literary ∼ colloquial: κ = 0.71).9 With this register labelling system, we began to annotate plWordNet systematically (the column ‘plWN 2015’ in Table 2). At the time of this writing, 55% of all noun LUs have been assigned registers. We were adding to plWordNet terminological multi-word LUs (mainly from the humanities, social sciences and biology), so the terminology register is overrepresented in the column ‘plWN 2015’. Even such an unbalanced but very large sample, however, re-enacts the lead pattern visible in the smaller random sample (‘F #1’ and ‘F #2’): terminology is the most frequent register, followed by the general, literary, colloquial and obso8 Other registers were too rare to give meaningful values of κ (the confidence intervals were very broad), but we proved statistically that κ > 0 for all registers except official. 9 This result shows that disagreements are located in the close neighbourhood in our decision tree (since registers were combined according to their proximity in the tree). 170 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz Table 3: Inter-rater agreement of two annotators assigning register labels to nouns from plWordNet in 2013, and the frequencies of choices of linguists F #1 and F #2. The expanded five-label system equates compatible labels, as described in Maziarz et al. (2014). The label non-standard was added in 2014. The column ‘plWN 2015’ contains data from the beginning of 2015. marking label Cohen’s κ F #1 % F #2 % plWN 2015 % 0.77 171 44% 152 40% 53 178 60% 0.71 190 49% 220 57% 32 489 36% obsolete 0.56 12 3% 9 2% 2 095 2% vulgar ∼ coarse 0.49 9 2% 3 <1% 324 <1% regional 0.50 3 <1% 1 <1% 832 1% non-standard NA 0 0% 0 0% 57 <1% overall 0.72 385 100% 385 100% 89 040 100% terminology ∼ argot ∼ official general ∼ literary ∼ colloquial lete. Other registers are very rare, summing at most to 2.6% in ‘plWN 2015’ and ‘F #2’ samples and up to 5.5% in ‘F #1’. This high frequency of the terminology register is probably a common feature of large dictionaries. In the third volume of Doroszewski (1958–1962, letters H –K ), terminology is the most frequent of all registers (Buttler & Markowski, 1998, pp. 110, 121):10 “First of all, scanty number of lexemes of all three types [i.e., general register – literary – colloquial] is striking as compared to the overall number of dictionary entries. It is settled by the huge amount of terminological and crypto-terminological units in lexical content of the dictionary.” From Buttler & Markowski’s analysis of Doroszewski (1958–1962) we know that in vocabulary housed in this dictionary the second rank goes to the obsolete register (2460 occurrences, or 16%, in the 3rd volume). This is so, because Doroszewski’s dictionary contains many words from the 19th century and the second half of the 18th century (Piotrowski, 2001, p. 86). (In comparison with this number, it is clear that plWordNet is a par excellence contemporary Polish dictionary with its 2% of old-use vocabulary.) Then the most frequent are general register (called common by Buttler & Markowski, only 546 occurrences, 371 nominal senses among them), colloquial (216 occurrences, 160 nominal senses) and literary (112, including 58 nominal senses). The proportions of the three lexical layers are shown in Figure 2. 10 Note that Buttler & Markowski used to apply their own labels to many words from Doroszewski, according to their register model. 171 The System of Register Labels in plWordNet 100% 75% 50% 25% 0% plWordNet general (common) register Buttler & Markowski colloquial literary Figure 2: Relative frequencies of three register labels — general, colloquial and literary — in plWordNet, and in Doroszewski (1958–1962) as analysed by Buttler and Markowski (1998). Both from Buttler & Markowski and from plWordNet we get the same pattern: the most frequent of the three is the general register, followed by the colloquial and the literary. In Dubisz (2006), the terminology register is also the most common (Table 4, 50%), while the obsolete register is far less frequent (only 3%), as in plWordNet.11 As we can see from Table 4, the literary, colloquial and general registers are the most frequent ones after terminology. Putting aside the statistics of terminology and old-use vocabulary, we may focus on three registers which play an important role in the lexical system, i.e., the general (or common) register, the literary register and the colloquial register (Buttler & Markowski, 1998). The distribution of the registers is different in plWordNet and in Buttler & Markowski’s model, and that is due to the difference in definitions (Figure 3). Buttler and Markowski (1998) define the general register with the triple [−t, −f , −e] (Section 2), while in our decision tree (Section 3) the register gets the following feature configurations: [−t, −s, −l, +f , −b, +u], [−t, −s, −l, −f , −e]. Because of the semantic feature +f in the former set, the general register of plWordNet has a broader meaning than the common register of Buttler & Markowski. The authors estimate the total population of the common vocabulary at around 5000 11 The statistics were taken randomly from the dictionary. In the sample of 122 nouns (192 senses) we found 74 unique labels, including 26 complex labels (25 twofold and 1 threefold). Of those 74 labels, 51 represent terminological subregisters, 7 — colloquial, 6 — argot, 4 — literary, 2 — the general register, 2 — the regional register, and 1 each — coarse and official. We have transformed the data into a simpler set, taking into account only the superordinate registers. 172 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz Table 4: Register frequencies in a small sample of 122 nouns from Dubisz (2006), 192 senses in total. Register label Frequency % terminological 95 50% literary 28 14% colloquial 26 14% general 20 10% argot 8 4% obsolete 5 3% official 5 3% coarse 3 1% regional 2 1% 192 100% sum LUs (ca. 500 LUs × 11 volumes) (Buttler & Markowski, 1998, p. 110). This is much less than in plWordNet: 26 000 in 55% of plWordNet’s vocabulary. The colloquial registers also differ in Doroszewski (1958–1962) and plWordNet. According to Buttler and Markowski (1998) the colloquial register receives the feature set [−t, −f , +e]. In plWordNet, the colloquial register is simply one of the three registers marked with emotiveness (together with vulgar and coarse). Since we single out three levels on the emotiveness scale [+++e], [++e], [+e], in this case the Buttler and Markowski register has a broader meaning than plWordNet’s colloquial. The literary registers are defined following Buttler and Markowski: [−t, +f , −e], plWordNet: [−t, −s, +l], [−t, −s,−l, +f , −b, −u]. The definitions of the literary registers are also different (Figure 2), mainly because Buttler & Markowski’s model disallows features [+e], [+ + e], [+ + e] together with [+f ]. 5. Concluding remarks We have proposed an innovative system of stylistic registers for plWordNet, a large Polish wordnet. The system has only 11 registers, is non-hierarchical and always assigns one label to a LU. We have designed a procedure which helps plWordNet editors assign a register label to a given LU. The procedure is summarised in a decision tree accompanied by substitution tests. The editors consult the complete guidelines online.12 The register labels significantly affect the structure of plWordNet, because hyponymy/hypernymy and synonymy only link LUs whose registers show affinity for each other. 12 http://tinyurl.com/plWN-registers 173 The System of Register Labels in plWordNet +f 1 -f 3 4a 4b 4c -e +e ++e +++e 2 Buttler & Markowski (1998) 1 – literary 3 – general (common) 2 – null 4a 4b colloquial 4c plWordNet 1 – general & literary 3 – general 2 – general & literary 4a – colloquial 4b – coarse 4c – vulgar Figure 3: Differences in the definitions of the general register and the colloquial and literary registers between Buttler and Markowski (1998) and plWordNet with regard to the register scales of formality {−f , +f } and emotiveness {−e, +e, ++ e, +++e}. The plWordNet general register has a broader extension than the common register in Buttler & Markowski’s model, while their colloquial register is a superordinate term for colloquial — coarse — vulgar in plWordNet. Field 2 is a forbidden area in their model: that is why the literary registers have different definitions. All definitions from plWordNet were “translated” into the semantic description language of Buttler & Markowski; we had to project our multidimensional definitions onto a two-dimensional description in terms of formality and emotiveness. We have examined the consistency of the procedure and found it reasonable. We measure it as inter-annotator agreement, obtaining sufficiently high values of Cohen’s kappa. Bundling three groups of compatible labels gives a system with only six categories, and the kappa values for that system are even higher. Finally, we have compared the statistics: plWordNet half-way through a complete annotation; the Universal Dictionary of Polish; and Buttler & Markowski’s model. The distribution of labels is fairly similar, but details differ due to the differences in the underlying register systems. 174 Marek Maziarz, Maciej Piasecki, & Stan Szpakowicz References Artstein, R. & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34 (4), 555–596. http://dx.doi.org/10.1162/coli. 07-034-R2 Atkins, B. T. S. & Rundell, M. (2008). The Oxford guide to practical lexicography. Oxford: Oxford University Press. Biber, D. & Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge University Press. Retrieved from http://dx.doi.org/10.1017/CBO9780511814358 Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press. Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins. Retrieved from http://dx.doi.org/10.1075/ scl.23 Bowker, J. (2013). Variation across spoken and written registers in internal corporate communication: Multimodality and blending in evolving genres. In J. Bamford, S. Cavalieri, & G. Diani (Eds.), Variation and change in spoken and written discourse (pp. 47–64). Amsterdam: John Benjamins. Retrieved from http://dx.doi.org/10. 1075/ds.21.08bow Buttler, D. & Markowski, A. (1998). Słownictwo wspólnoodmianowe, książkowe i potoczne współczesnej polszczyzny. Język a kultura, (1), 179–203. DeBose, C. E. (1992). Codeswitching: Black English and standard English in the AfricanAmerican linguistic repertoire. Journal of Multilingual and Multicultural Development, 13 (1–2), 157–167. http://doi.org/10.1080/01434632.1992.9994489 DiCiccio, T. J. & Efron, B. (1996). Bootstrap conﬁdence intervals. Statistical Science, 11 (3), 189–212. http://dx.doi.org/10.1214/ss/1032280214 DiCiccio, T. J. & Romano, J. P. (1988). A review of bootstrap conﬁdence intervals. Journal of the Royal Statistical Society. Series B (Methodological), 50 (3), 338–354. Doroszewski, W. (Ed.). (1958–1962). Słownik języka polskiego (Vol. 3). Warszawa: PWN. Dubisz, S. (2006). Wstęp. In S. Dubisz (Ed.), Uniwersalny słownik języka polskiego PWN. Wersja 3.0 [CD]. Warszawa: Wydawnictwo Naukowe PWN. Eckert, P. & Rickford, J. (2001). Style and sociolinguistic variation. Cambridge: Cambridge University Press. Engelking, A., Markowski, A., & Weiss, E. (1989). Kwaliﬁkatory w słownikach — próba systematyzacji. Poradnik Językowy, (5), 300–309. Gregory, M. (1967). Aspects of varieties diﬀerentiation. Journal of Linguistics, 3 (02), 177–197. http://dx.doi.org/10.1017/S0022226700016601 Halliday, M. A. K. (2002). The construction of knowledge and value in the grammar of scientiﬁc discourse: With reference to Charles Darwin’s The origin of species (1990). In J. Webster (Ed.), Collected works of M.A.K. Halliday (Vol. 2: Linguistic studies of text and discourse, pp. 168–193). London: Continuum. Hartmann, R. R. K. & James, G. (2002). Dictionary of lexicography. London: Routledge. Hausmann, F. J. (1989). Die Markierung im allgemeinen einsprachigen Wörterbuch: Eine Übersicht. In F. J. Hausmann, O. Reichmann, H. E. Wiegand, & L. Zgusta (Eds.), Wörterbücher: Ein internationales Handbuch zur Lexikographie (Vol. 5.1, pp. 649– 657). New York: De Gruyter. Heacock, P. (Ed.). (1995–2011). Cambridge dictionaries online. Cambridge: Cambridge University Press. The System of Register Labels in plWordNet 175 Kurkiewicz, J. (2007). Kwaliﬁkatory w Wielkim słowniku języka polskiego. In P. Żmigrodzki & R. Przybylska (Eds.), Nowe studia leksykograﬁczne. Kraków: Wydawnictwo Lexis. Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33 (1), 159–174. http://dx.doi.org/10.2307/2529310 Lyons, M. (2013). Register variation. In F. R. Volkmar (Ed.), Encyclopedia of autism spectrum disorders (p. 2534). New York: Springer. http://www.springerlink.com/ index/10.1007/978-1-4419-1698-3_983 Maybin, J. & Swann, J. (2009). The Routledge Companion to English Language Studies. New York: Routledge. Maziarz, M., Piasecki, M., & Szpakowicz, S. (2013). The chicken-and-egg problem in wordnet design: Synonymy, synsets and constitutive relations. Language Resources and Evaluation, 47 (3), 769–796. http://dx.doi.org/10.1007/s10579-012-9209-9 Maziarz, M., Piasecki, M., Rudnicka, E., & Szpakowicz, S. (2014). Registers in the system of semantic relations in plWordNet. In Proceedings of 7th International Global Wordnet Conference (pp. 330–337). Milroy, L. & Gordon, J. (2003). Sociolinguistics: Method and interpretation. Cambridge, MA: Blackwell Publishing Ltd. Retrieved from http://dx.doi.org/10.1002/ 9780470758359 Piotrowski, T. (2001). Zrozumieć leksykograﬁę. Warszawa: PWN. Reidsma, D. & Carletta, J. (2007). Reliability measurement without limits. Computational Linguistics, 1 (1), 1–8. Simpson, J. (2013). Oxford English Dictionary. Oxford: Oxford University Press. Retrieved from public.oed.com/ Svensén, B. (2009). A handbook of lexicography: The theory and practice of dictionarymaking. New York: Cambridge University Press. Trudgill, P. (1999). Standard English: What it isn’t. In T. Bex & R. J. Watts (Eds.), Standard English: The widening debate (pp. 117–128). London: Routledge. Acknowledgment This work was supported by a grant from the Polish Ministry of Science and Higher Education, a program in support of scientific units involved in the development of a European research infrastructure for the humanities and social sciences in the scope of the consortia CLARIN ERIC and ESS-ERIC, 2015–2016. The authors declare that they have no competing interests. The authors’ contribution was as follows: concept of the study: Marek Maziarz, Maciej Piasecki, Stan Szpakowicz; data analyses: Marek Maziarz, Maciej Piasecki, Stan Szpakowicz; the writing: Marek Maziarz, Maciej Piasecki, Stan Szpakowicz. This is an Open Access article distributed under the terms of the Creative Commons Attribution 3.0 PL License (http://creativecommons.org/licenses/by/3.0/pl/), which permits redistribution, commercial and non-commercial, provided that the article is properly cited. © The Authors 2015 Publisher: Institute of Slavic Studies, PAS, University of Silesia & The Slavic Foundation

Log In

The System of Register Labels in plWordNet

Related papers

Related papers