Academia.eduAcademia.edu

Natural-Language Interpretation in Prolog

1994

This booklet introduces natural-language processing in general and the way it is presently carried out at SICS. The overall goal of any system for naturallanguage processing system is to translate an input utterance stated in a natural language (such as English or Swedish) to some type of computer internal representation. Doing this requires theories for how to formalize the language and techniques for actually processing it on a machine. How this is done within the framework of the Prolog programming langauge is described in ...

Natural-Language Interpretation in Prolog Björn Gambäck Jussi Karlgren Christer Samuelsson [email protected] Natural Language Processing Group SICS — Swedish Institute of Computer Science, Box 1263, S–164 28 KISTA, Sweden ii ABSTRACT iii Abstract This booklet introduces natural-language processing in general and the way it is presently carried out at SICS. The overall goal of any system for naturallanguage processing system is to translate an input utterance stated in a natural language (such as English or Swedish) to some type of computer internal representation. Doing this requires theories for how to formalize the language and techniques for actually processing it on a machine. How this is done within the framework of the Prolog programming langauge is described in detail. The SICS perspective on natural-language processing is that any theories regarding it is rather uninteresting unless put to practical use. Thus we are currently testing all our work within a particular large-scale NLP system called the Core Language Engine. In order to exemplify how natural language can be formalized in practice, the system is also fully described together with the ways languages are formalized in it. iv Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction iii v ix 1 2 Natural-Language Syntax 2.1 Formal syntax . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Axiomatic systems and formal grammars . . . Types of grammars . . . . . . . . . . . . . . . . Parse trees . . . . . . . . . . . . . . . . . . . . 2.1.2 Abstract machines . . . . . . . . . . . . . . . . Finite-state automata . . . . . . . . . . . . . . Pushdown automata . . . . . . . . . . . . . . . Turing machines and linear bounded automata 2.2 Natural language grammars . . . . . . . . . . . . . . . 2.2.1 Context-free grammars . . . . . . . . . . . . . . 2.2.2 Unification grammars . . . . . . . . . . . . . . 2.3 Grammar formalisms . . . . . . . . . . . . . . . . . . . 2.3.1 Definite Clause Grammar . . . . . . . . . . . . 2.3.2 Government and Binding . . . . . . . . . . . . The X-Convention . . . . . . . . . . . . . . . . Transformations . . . . . . . . . . . . . . . . . Move-α — “Move anything anywhere” . . . . . 2.3.3 Generalized Phrase Structure Grammar . . . . Categories and features . . . . . . . . . . . . . Metarules . . . . . . . . . . . . . . . . . . . . . 2.3.4 Lexical-Functional Grammar . . . . . . . . . . 2.4 A feature-value based grammar . . . . . . . . . . . . . 2.4.1 Empty productions . . . . . . . . . . . . . . . . WH-questions . . . . . . . . . . . . . . . . . . . 2.4.2 Lexicalized compliments . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 6 8 9 12 13 14 16 16 20 24 24 25 26 28 30 30 32 32 33 35 35 39 42 vi 3 Natural-Language Parsing 3.1 Top-down parsing . . . . . . . . 3.2 Well-formed-substring tables . 3.3 Bottom-up parsing . . . . . . . 3.4 Link tables . . . . . . . . . . . 3.5 Shift-reduce parsers . . . . . . 3.6 Chart parsers . . . . . . . . . . 3.7 Head parsers . . . . . . . . . . 3.8 LR parsers . . . . . . . . . . . 3.8.1 Parsing . . . . . . . . . 3.8.2 Parsed example . . . . . 3.8.3 Compilation . . . . . . . 3.8.4 Relation to decision-tree 3.8.5 Extensions . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 46 47 49 51 56 60 61 62 65 65 66 4 The Core Language Engine 4.1 Overview of the CLE . . . . . . . . . . . . 4.2 The CLE grammar formalism . . . . . . . 4.2.1 A toy grammar for Swedish . . . . The lexicon formalism . . . . . . . 4.2.2 The “real” grammar . . . . . . . . Feature defaults and value spaces . Classification of sentence types . . A grammar rule . . . . . . . . . . 4.3 Morphology . . . . . . . . . . . . . . . . . 4.3.1 Segmentation . . . . . . . . . . . . 4.3.2 Inflectional morphology . . . . . . 4.4 Compositional semantics . . . . . . . . . . 4.4.1 The logical formalism . . . . . . . Logical form requirements . . . . . The predicate logic part . . . . . . 4.4.2 The semantic rule formalism . . . 4.4.3 Semantic analysis . . . . . . . . . . 4.4.4 Semantic rules . . . . . . . . . . . 4.4.5 Sense entries . . . . . . . . . . . . 4.4.6 Higher order extensions . . . . . . 4.4.7 Abstraction and application . . . . 4.4.8 Generalized quantifiers . . . . . . . 4.4.9 Statements and questions . . . . . 4.4.10 “Quasi-logical” constructs . . . . . 4.4.11 Quantified terms and descriptions 4.4.12 Anaphoric terms . . . . . . . . . . 4.4.13 Event variables . . . . . . . . . . . 4.5 Later-stage processing . . . . . . . . . . . 4.5.1 Reference resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 70 70 71 74 74 75 75 77 78 78 79 80 81 81 82 84 85 88 89 89 90 91 91 93 94 94 95 96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . indexing . . . . . vii CONTENTS 4.5.2 Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The Basics of Information Retrieval 5.1 Manual methods . . . . . . . . . . . . . . . . . . . . 5.2 Automatization . . . . . . . . . . . . . . . . . . . . . 5.2.1 Words as indicators of document topic . . . . 5.2.2 Word frequencies - tf . . . . . . . . . . . . . . 5.2.3 Normalizing inflected forms . . . . . . . . . . 5.2.4 Uncommonly frequent words - idf . . . . . . . 5.2.5 Methodological problems with idf . . . . . . . 5.2.6 Combining tf and idf . . . . . . . . . . . . . . 5.2.7 Document length effects . . . . . . . . . . . . 5.3 Words as indicators of information need . . . . . . . 5.4 Query expansion . . . . . . . . . . . . . . . . . . . . 5.4.1 Using retrieved documents . . . . . . . . . . . 5.4.2 Relevance Feedback . . . . . . . . . . . . . . 5.4.3 Extracting terms from retrieved documents . 5.5 Beyond single words . . . . . . . . . . . . . . . . . . 5.5.1 Collocations and Multi-word technical terms 5.5.2 Concept Spotting . . . . . . . . . . . . . . . . 5.5.3 Information Extraction . . . . . . . . . . . . 5.5.4 Text is not just a bag of words . . . . . . . . 5.5.5 Text is more than topic . . . . . . . . . . . . 5.5.6 Merging several information streams . . . . . 5.6 Evaluating information retrieval . . . . . . . . . . . . 5.6.1 How exhaustive is the search? - Recall . . . . 5.6.2 How much garbage? - Precision . . . . . . . . 5.6.3 Combining precision and recall . . . . . . . . 5.6.4 What is wrong with the evaluation measures? 5.7 References . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 99 99 99 99 100 102 103 103 105 105 106 106 106 106 107 107 107 108 108 108 108 109 109 109 109 109 110 110 114 viii CONTENTS List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 Grammar G2 . . . . . . . . . . . . . . . . . . . . Grammar G3 . . . . . . . . . . . . . . . . . . . . Grammar G1 . . . . . . . . . . . . . . . . . . . . A sample parse tree . . . . . . . . . . . . . . . . FSA for L3 . . . . . . . . . . . . . . . . . . . . . PDA for L2 . . . . . . . . . . . . . . . . . . . . . Grammar 1 . . . . . . . . . . . . . . . . . . . . . Parse tree for “He (sees (a book about Paris) )” . Parse tree for “He ( (sees a book) about Paris )” Inflection of the Icelandic adjective “langur”. . . Grammar G′1 . . . . . . . . . . . . . . . . . . . . A possible noun-phrase branching structure . . . The branching in X-theory . . . . . . . . . . . . Swedish sentence structure . . . . . . . . . . . . . Swedish GB example . . . . . . . . . . . . . . . . An example of government . . . . . . . . . . . . . The parse tree for a WH-question . . . . . . . . . The f-structure for “Kalle gillar Lena” . . . . . . Grammar 2 . . . . . . . . . . . . . . . . . . . . . Lexicon 2 . . . . . . . . . . . . . . . . . . . . . . An implicit parse tree . . . . . . . . . . . . . . . The parse tree for a WH-question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 8 10 12 14 16 18 19 21 22 26 27 27 28 29 31 33 36 37 38 41 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 A simple top-down parser . . . . . . . . . . . . . A parser employing a well-formed-substring table A simple bottom-up parser . . . . . . . . . . . . A parser employing a link table . . . . . . . . . . The chart-parser version of Grammar 2 . . . . . The chart parser version of Lexicon 2 . . . . . . . Head Grammar Version of Grammar 2 . . . . . . A Head Driven Parser/Generator . . . . . . . . . Generalized connect others/3 . . . . . . . . . . . Grammar 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 47 48 53 54 58 59 60 62 ix x LIST OF FIGURES 3.11 The internal states of Grammar 1 . . . . . . . . . . . . . . . . . . 3.12 The LR parsing tables for Grammar 1 . . . . . . . . . . . . . . . 63 64 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 68 69 72 73 75 82 90 92 96 Broad overview of the CLE architecture . . The analysis steps of the CLE . . . . . . . . A Swedish grammar in the CLE formalism . A lexicon in the CLE formalism . . . . . . . Sentence classification by feature values . . BNF definition of the predicate logic part . Generalized quantifiers . . . . . . . . . . . . BNF definition of the “quasi-logical” part . The resolution steps of the CLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction The purpose of this booklet is to introduce natural-language processing in general and the way it is presently carried out at SICS. The term “natural language” marks in itself a distinction between the programming and formal languages normally used by computers (“artificial languages”) and the languages used by people. Here, we will of course not discuss all the possible aspects of all possible human languages, but rather restrict ourselves to discussing some techniques and theories which can be used while aiming at processing a language on a computer. Most of these will be rather universal in the sense that they could apply to any language; however, we will restrict ourselves even further and only illustrate the techniques by examples mainly from the two languages English and Swedish. The overall goal of any natural-language processing (NLP) system is to translate an input utterance stated in a natural language to some type of representation internal to a computer, i.e., to interpret the utterance. The interpretation thus constructed will normally depend on the task for which the system is to be used. If the system for example is a front-end to a database query system, the internal representation could be the query language actually used by the underlying database (e.g., SQL). If the NLP system is used as a front-end to a operating system, the internal representation could be the operating-system commands, etc. Commonly, however, the underlying system is abstracted away from the internal representation used and the interpretation will be an expression in some kind of logical formalism. Such a formalism will be described at the end of the booklet (in Section 4.4), but before reaching that level, we will go through several preliminaries in the first chapters. Chapter 2 will start out by describing how languages can be formalized and processed in the first place, first discussing abstract machines and formal languages and then moving on to natural languages. Most contemporary theories of natural-language description are based on the notion of unification. The relevance of this (and how this is naturally implemented in the programming 1 2 CHAPTER 1. INTRODUCTION language Prolog) goes as a main theme through the entire booklet, but the discussion of it is initiated in Section 2.3 which describes some common theories used for natural-language grammar formalization. Giving a language a formal description (i.e., a grammar) is not enough. In order to make anything useful out of the it, we need a way to process an input utterance given the grammar, or to parse the utterance. Doing this efficiently is a main research area within the NLP field and is thus discussed at length in Chapter 3. A number of different parsing strategies will be exemplified, but in general they have the common task of either rejecting an input word-string as being ungrammatical, or accepting it as grammatical and producing some useful output structure (such as a parse tree). The SICS perspective on natural-language processing is that any theories regarding it is rather uninteresting unless put to practical use. Thus we are currently testing all our work within a particular large-scale NLP system called the Core Language Engine (CLE, for short), a system originally developed for English by a group at SRI International in Cambridge, England, and further extended and adapted to Swedish by the NLP group at SICS. The CLE is described in Chapter 4, which also goes into detail in describing how some particular natural-language are handled within the CLE. The main part of the chapter is, however, devoted to the logical formalism used and how utterances actually are interpreted by the system. Chapter 2 Natural-Language Syntax Natural-language systems generally translate an input utterance stated in natural language to some type of internal representation, i.e., they construct a semantic interpretation. This can be an expression in a logical formalism or SQL for a database query system; an interlingua or a transfer language in a machine-translation task; or a switch-board or operating-systems command in a command-language interface. Normally the interpretation process is carried out in several steps. The utterance is usually analyzed syntactically in one of the first processing steps. In order to do this, one needs a formal way of describing the syntax of the natural language and a computing machine for analyzing the input utterance using this formal description. The former is usually a formal grammar and the latter a syntactic parser. Abstract machines take an intermediate position by fully specifying a language and offering a computational procedure for language analysis as well. In Section 2.1 we will discuss two types of formal language descriptions, namely formal grammars and abstract machines. The reader interested in a more thorough treatment of this topic is referred to [Partee 1987]. In Section 2.2 we will discuss grammars that describe natural languages. We strongly recommend the book [Pereira & Shieber 1987] for the craft of natural-language description in Prolog, [Shieber 1986] for an introduction to unification grammars, and [Sells 1985] for an overview of contemporary syntactic theories of natural language. 3 4 2.1 CHAPTER 2. NATURAL-LANGUAGE SYNTAX Formal syntax We will here give some necessary definitions for the further formal treatment of languages. The underlying definition is that a language is a set of strings. We will discuss the relevance of this assumption for the study and processing of natural languages in a section below. Definition 1 (Alphabet) An alphabet A is a finite set of symbols. The set {a, b}, which consists of the two symbols a and b, is a fairly typical example of a formal alphabet. Slightly more practical alphabets are the sets {1, 0}, {a, b, c, ..., z}. Definition 2 (String) A string is a sequence of elements selected from an alphabet. By sequence we mean a set where the elements are ordered. aabb is a string. bbaa is another. The strings are generally assumed to be finite but unbounded in length. Definition 3 (String concatenation) Concatenating two strings is simply sticking two strings after each other. It is usually denoted by juxtaposition of the symbols representing the string. If X = aa and Y = bb then XY = aabb. The concatenation of a string with itself is usually denoted by raising it to the appropriate power. Thus X 2 = aaaa, X 3 = aaaaaa, X 1 = aa, and X 0 = ǫ, where ǫ denotes the empty string. Definition 4 (String length) The length of a string A is written | A |. Definition 5 (Language) A formal language is defined as a set of strings. Definition 6 (Operations on languages) Languages can be operated on with any of the standard set theoretical operations: union, intersection, complement, etc. In addition, languages can be concatenated. The language LA = {a, aaa} can be concatenated with the language LB = {b, bb} to form the language LA LB = {ab, aaab, abb, aaabb}. The same conventions as for string concatenation apply here, for LA 2 = {aa, aaaa, aaaaaa}, LA 0 = the empty set, and so forth. The closure of the concatenation operation is denoted with a star, the so called Kleene star, named after S. C. Kleene of metamathematical fame. 2.1. FORMAL SYNTAX 5 Definition 7 (Kleene star) X ∗ is the set of all strings formed with concatenations (including zero) from the string X, or in other words, the set of all X i , where i is a natural number. In an effort to cause confusion among mathematical linguists, the Kleene star is used for the closure of language concatenation as well as string concatenation. Thus LA ∗ is the set of all LA i , where i is a natural number. The Kleene star is used with an alphabet as an operand as well: A∗ denotes the set1 of all strings that can be formed from the alphabet. In practice, the different but related uses of the Kleene star do not usually cause the confusion they might be expected to. Thus, using the above definitions, a language over an alphabet A is a subset of the set of all strings A∗ that can be formed from the alphabet. As an example, let the language L2 denote the set of strings in {a, b}∗ that consist of a number of a:s followed by an equal number of b:s. This language is often denoted an bn since each string in it consists of n a:s followed by n b:s for some number n. 2.1.1 Axiomatic systems and formal grammars Formal grammars are one type of popular device for describing languages. A grammar is a meta-language for the language it describes, and is usually defined as an axiomatic system. Definition 8 (Grammar) A grammar is a quadruple hσ, VT , VN , P i, where σ is the axiom or the top symbol, VT and VN are the alphabets and P is the actual content of the grammar. VT is the terminal alphabet that the resulting language is written in, and VN is the nonterminal alphabet used in the formulæ of the meta-language. σ is a member of VN . Often VT and VN are required to be disjunct. The content of the grammar, P is a set of string-rewriting rules of the form ψ → ω. ψ is the left-hand side (LHS) of the rule and ω is the right-hand side (RHS), and both are strings over VT ∪ VN . The left-hand side should contain at least one nonterminal symbol. The rules are interpreted as follows: in any string where ψ occurs as a substring, ψ may be replaced by ω to form a new string. Definition 9 (Derivation) A derivation of a string ω is a sequence of strings starting with the top symbol σ = ψ1 ⇒ ψ2 ⇒ ... ⇒ ψn = ω where ψk can be rewritten as ψk+1 using some grammar rule from P . Take as an example the following derivation of the string aaabbb from the top symbol σ using grammar G2 : σ ⇒ aSb ⇒ aaSbb ⇒ aaaSbbb ⇒ aaabbb 1 Or in other words, language. 6 CHAPTER 2. NATURAL-LANGUAGE SYNTAX Definition 10 (L(G)) The language L(G) is the set of strings that a grammar G describes, or in other words, all strings that have derivations in L. Take as a simple example the grammar G2 in Figure 2.1 which generates the language L2 = an bn . VT = {a, b} VN = {S} σ = P = S  S S → → aSb ǫ  Figure 2.1: Grammar G2 Types of grammars Grammars according to the definition we have given so far are general string rewriting systems. We can, by restricting the formal definition of grammar rules, constrain the expressiveness of the grammar. The expressiveness of a grammar formalism is measured in terms of the complexity of classes of languages that the formalism can generate. Thus an extremely simple grammar formalism can only generate simple classes of languages, whereas a more complex formalism can generate complex classes. As an example, if we were to define a class of grammars C11 where we restrict the size of the rule set P to one single rule with | LHS |=| RHS |= 1, we can see that it only can produce languages with single short elements. If we define another class of grammars C1n where we remove the restriction on the length of the RHS; i.e., grammars where the rule set P has one single rule with | LHS | = 1, we see that C1n includes C11 . We will now define a series of grammars in order of decreasing complexity. Unrestricted grammars The grammars as we have defined them so far are called unrestricted grammars. Context-sensitive grammars If we restrict the grammar rules to be of the form αAβ → αψβ for some nonterminal symbol A, where ψ is not allowed to be the empty string ǫ, we get a context-sensitive grammar. Another way of expressing this restriction is that we disallow “shrinking rules”: in context sensitive grammars | LHS | is never greater than | RHS |. A rule in a context-sensitive grammar is understood as stating that the nonterminal A may be rewritten as the string ψ in the context of α and 2.1. FORMAL SYNTAX 7 β. An alternative notation for context- sensitive grammar rules is A → ψ/α β. Context-free grammars If we limit the number of LHS symbols to one single, by necessity nonterminal symbol we get a context-free grammar. This means that each grammar rule is of the form A → ψ for some nonterminal symbol A. The name context free is natural in view of the fact that A may always be rewritten as ψ regardless of context. Regular grammars If we in addition to this last requirement restrict the form of the RHS to be a single terminal symbol optionally followed by a single nonterminal symbol, we get a so-called regular grammar.2 This means that grammar rules in regular grammars are of the form A → xB or A → x where the x is in VT and B is a element in VN . These four types of grammars are by no means the only possible grammar classification. In fact, for natural language there is a large interest in grammars that are only slightly more expressive than context-free ones. However, these four types are well studied and have well understood properties, and are usually taken as a starting point for further study of the mathematical properties of formal grammars. Somewhat less imaginatively, Chomsky, who originally defined the hierarchy, chose to call the unrestricted grammars type 0 grammars, the context-sensitive grammars type 1 grammars, the context-free type 2 grammars, and the regular grammars type 3 grammars. We will not use this terminology here. The classes of grammars defined above are proper subsets of each other.3 The language generated by a context-free grammar, for example the example grammar G2 above, is called a context-free language. Since a regular grammar is a special case of a context-free grammar, regular languages are also contextfree languages. The set of regular languages is a subset of the set of context-free languages, which in turn is a subset of the set of context-sensitive languages,4 which in turn is a subset of the set of unrestricted languages. If we bound the string length, there can only be a finite number of strings in the language since the alphabet is finite. For this reason, this language class is called finite languages and it is a subset of the set of regular languages. Any of the grammar types above will only generate a finite language if the string length is bounded. L3 = an bm , denoting the set of strings in {a, b}∗ that consist of some (unspecified) number of a:s followed by some (possibly different) number of b:s, is an example of a regular language described by the regular grammar G3 of Figure 2.2, while L2 above denoting an bn is a context-free language described by 2 There is a trivial variation of regular grammars where the nonterminal precedes the terminal symbol. 3 Apart for some technicalities involving context-sensitive grammars, the empty string ǫ, and the non-shrinking rule requirement. 4 Again modulo the empty string ǫ. 8 CHAPTER 2. NATURAL-LANGUAGE SYNTAX the context-free grammar G2 (see Figure 2.1). Note that L2 is a sublanguage of L3 . = VT {a, b} VN = {S, B} σ = S =  S     S P S     B B → → → → → aS bB ǫ bB ǫ          Figure 2.2: Grammar G3 Let L1 denote the language an bn cn over {a, b, c}. This is an example of a context-sensitive language that is not context free. It is generated by the context-sensitive grammar G1 shown in Figure 2.3. VT = {a, b, c} VN = {S, T, X, Y } σ = S P =                        S S aX bX bY cY YX TX TY → → → → → → → → → aSX aX abY bbY bc cc TX TY XY                        Figure 2.3: Grammar G1 Examples of languages that are not context sensitive are quite complicated and we will return to these when we discuss Turing machines in Section 2.1.2. Parse trees There are two different derivations of the string aaabbbccc from the top symbol σ using grammar G1 . One is the following: S ⇒ aSX ⇒ aaSXX ⇒ aaaXXX ⇒ aaabY XX ⇒ 2.1. FORMAL SYNTAX 9 aaabT XX ⇒ aaabT Y X ⇒ aaabXY X ⇒ aaabXT X ⇒ aaabXT Y ⇒ aaabXXY ⇒ aaabbY XY ⇒ aaabbT XY ⇒ aaabbT Y Y ⇒ aaabbXY Y ⇒ aaabbbY Y Y ⇒ aaabbbcY Y ⇒ aaabbbccY ⇒ aaabbbccc This is called the right-most derivation since the right-most substring is always expanded. The left-most derivation is: S ⇒ aSX ⇒ aaSXX ⇒ aaaXXX ⇒ aaabY XX ⇒ aaabT XX ⇒ aaabT Y X ⇒ aaabXY X ⇒ aaabbY Y X ⇒ aaabbY T X ⇒ aaabbY T Y ⇒ aaabbY XY ⇒ aaabbT XY ⇒ aaabbT Y Y ⇒ aaabbXY Y ⇒ aaabbbY Y Y ⇒ aaabbbcY Y ⇒ aaabbbccY ⇒ aaabbbccc The only difference is that the rules were applied in a different order, but not in a structurally different way.5 To eliminate this ambiguity, one often employs a parse tree to describe the derivation and the order in which the rule applications are carried out is left unspecified. In the example, both derivations are represented by the parse tree of Figure 2.4. A parse tree is a connected directed acyclic graph. The arcs in the tree denote the dominance relation D. This relation is asymmetric (and thus irreflexive), and transitive. In the example parse-tree above the “deepest” node labelled X dominates a node labelled Y and a node labelled c. The parse tree has a single root, i.e., there is a unique minimal element w.r.t. D. This is the top-most S node in the example parse-tree. There is also a partial order indicated by the horizontal position of each node — the precedence relation P . In the example the nodes labelled a precede the nodes labelled b. If one node dominates another, neither one of them precedes the other. Thus a pair of nodes hx, yi can be a member of P only if neither hx, yi nor hy, xi is a member of D. Arcs in the tree are not allowed to cross. This means that if some node precedes another node, all nodes that the former dominates precede the latter, and that the former node precedes all nodes the latter node dominates. This can be state formally as if hx, yi is in P , then ∀z : hx, zi ∈ Dhz, yi ∈ P and ∀z : hy, zi ∈ Dhx, zi ∈ P . There is also a function that maps the nodes of the tree into the set of symbols VT ∪ VN , the so-called labelling function. 2.1.2 Abstract machines Abstract machines are another type of popular device for describing languages. An abstract machine consists of a finite set of internal states, of which one is the 5 One says that the two derivations are weakly equivalent. 10 CHAPTER 2. NATURAL-LANGUAGE SYNTAX S ❅ ❅ ❅ ❅ a S X ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ a S X T ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ a X T Y ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ b Y Y T c c ❅ ❅ ❅ ❅ Y b T X ❅ ❅ ❅ ❅ b Y c Figure 2.4: A sample parse tree 2.1. FORMAL SYNTAX 11 distinguished initial state and where a set of states are designated final states. In addition to this, the machine may have some other internal memory as well, depending on what type of machine it is. The abstract machine is given an input string and its task is to determine whether or not the input string is a member of the language that the machine describes. There is a reader head to scan the input string and a transition relation between the internal states. This transition relation may take into account not only the current state and the current input symbol, but also the information present in the internal memory. The transitions may also change the content of the internal memory. A machine decides a language if it will halt in finite time on any input string and the state in which it halts determines whether or not the string is a member of the language. A machine accepts a language if it halts in a final state for any input string that is a member of the language, but not for any string that is not. We will discuss four types of abstract machines namely: Finite state automata (FSA) decide regular languages, i.e., languages generated by regular grammars. Thus for each regular language L there exists some finite-state automaton that accepts all strings in L and that rejects all strings not in L. Further more, if some finite-state automaton decides a language L, then L is by necessity regular. Pushdown automata (PDA) decide context-free languages, i.e., languages generated by context-free grammars. Thus for each context-free language L there exists some pushdown automaton that accepts all strings in L and that rejects all strings not in L. Also, if some pushdown automaton decides a language L, then L is by necessity context-free. Of course, it may be the case that L is in fact regular. Linear bounded automata (LBA) accept context-sensitive languages, i.e., languages generated by context-sensitive grammars. Linear bounded automata and context-sensitive languages are not totally equivalent since ǫ is not a member of any context-sensitive language. For each contextsensitive language L, though, there exists a linear bounded automaton that accepts L ∪ {ǫ}. Likewise if there exists a linear bounded automaton that accepts a language L, then L\{ǫ} is a context-sensitive language. Turing machines (TM) decide recursive sets and accept recursively enumerable sets, i.e., languages generated by unrestricted grammars. For each recursive set there is a Turing machine that halts6 on every input string and that halts in a final state only for input strings that are members of that set. Also, for each recursively enumerable set there is a Turing machine that halts on every input string that is a member of that set. 6 A Turing machine may compute forever. If it does not for some particular input string, it is said to “halt” on that input string. 12 CHAPTER 2. NATURAL-LANGUAGE SYNTAX These abstract machines differ in how much internal memory they have at their disposal, and in what ways they may manipulate it. In the following, we will give definitions of these abstract machines. Note that there are other ways of defining them that give them the same expressive power, i.e., allows them to decide and accept the same classes of languages. Finite-state automata A finite-state automaton has no internal memory apart from the set of internal states. The reader head scans the input string from left to right, one symbol at a time. Thus the transition relation is determined solely on the basis of the current state and the current input symbol. Definition 11 (Finite-state automaton) A finite-state automaton is a tuple hK, Σ, ∆, σ, F i where K is the set of states, Σ is the alphabet, ∆ is a relation from K × Σ to K, σ is the initial state and F ⊆ K is the set of final states. The transitions can be denoted (qi , a, qj ) This means that if the current state is qi , and the current input symbol is a, then transit to state qi and advance the reader head to the next input symbol. If the FSA is in a final state when the entire input string has been read, the input string is accepted by the FSA, otherwise the string is rejected. Take as an example the simple automaton of Figure 2.5 deciding the language L 3 = an b m : K = {q0 , q1 } Σ = {a, b} ∆ = ( σ = q0 F = {q1 } (q0 , a, q0 ) (q0 , b, q1 ) (q1 , b, q1 ) ) Figure 2.5: FSA for L3 Finite-state automata come in non-deterministic and deterministic varieties. For a deterministic FSA, the transition relation ∆ is a function. This means that in each state there is exactly one transition for each possible input symbol. 13 2.1. FORMAL SYNTAX For a non-deterministic FSA, there may be several different, one unique, or none possible transitions for each combination of state and input symbol. A deterministic FSA is trivially non-deterministic. It can be shown that for each non-deterministic FSA there exists an equivalent deterministic FSA in the sense that both automata decide the same language. Pushdown automata A pushdown automaton is essentially an FSA where the internal memory has been extended with a pushdown stack, i.e., a “last-in first-out” queue. The top elements of the stack may be read or removed, or other symbols may be added above them, or the stack may be left unchanged. The stack symbols are not restricted to the terminal alphabet, i.e., the alphabet of the input string, but may contain symbols from the nonterminal alphabet as well. The transitions can be represented as (qi , a, α) → (qj , β) This is to be interpreted as follows: In state qi , with input symbol a and with the symbols α on top of the stack, transit to state qj and replace α with the string β. We admit both α and β to be the empty string ǫ. In the former case the action is independent of the stack context, in the latter the symbols α are popped from the stack. Formally: Definition 12 (Non-deterministic pushdown automaton) A non-deterministic pushdown automaton is described by a tuple hK, Σ, Γ, ∆, σ, F i where K is the set of states, Σ is the input (i.e., terminal) alphabet, Γ ⊆ VT ∪VN is the stack alphabet, ∆ is a relation from K × Σ × Γ∗ to K × Γ∗ , σ is the initial state and F ⊆ K is the set of final states. Apart from the transitions in general being dependent on the symbols of the top portion of the stack, and the stack being manipulated by the transitions, a PDA works like an FSA. An input string is accepted if the PDA halts in a final state with an empty stack after reading the entire input string. Otherwise it is rejected. Take as an example the simple pushdown automaton shown in Figure 2.6 which is deciding the language L2 = an bn . There are deterministic and non-deterministic pushdown automata as well. The non-deterministic and the deterministic PDA are not equivalent, though. Intuitively this can be appreciated since there is no way for a deterministic PDA to know when the middle of a string is reached when attempting to recognize the language XX R , i.e., the set of strings where the second half of the string is the mirror image of the first. This is a simple task for a non-deterministic PDA. 14 CHAPTER 2. NATURAL-LANGUAGE SYNTAX K = {q0 , q1 } Σ = {a, b} Γ = {A} ∆ = ( σ = q0 F = {q0 , q1 } (q0 , a, ǫ) → (q0 , {A}) (q0 , b, {A}) → (q1 , ǫ) (q1 , b, {A}) → (q1 , ǫ) ) Figure 2.6: PDA for L2 Turing machines and linear bounded automata A Turing machine has an internal memory that consists of an infinite tape on which it may read or write. In fact, we will assume that the input string is already written on the internal tape and that the reader head actually reads from that tape. A TM can write a symbol on the tape or move the reader head one step to the right or to the left. The actions a TM can perform are determined by a finite set of quadruples (qi , a, qk , X) These should be interpreted as follows: In state qi reading the symbol a, transit to state qk and perform action X. If X is a symbol of the alphabet, write X on the tape (over-writing a). If X is the special symbol L then move the reader head one step to the left on the tape; if X is the symbol R move the reader head one step to the right. Formally, Definition 13 (Turing machine) A Turing machine is a tuple hK, Σ, ∆, s, F i where K is a finite set of states, Σ is a finite set (the alphabet), ∆ is a partial function from K ×Σ to K ×(Σ∪{L, R}), σ ∈ K is the initial state and F is the set of final states. Here we have assumed that the Turing machine is deterministic. Deterministic and non-deterministic Turing machines are equivalent in the sense that they determine the same classes of languages. In fact, no known “extension” to deterministic Turing machines extends their descriptive power. Thus a Turing machine is the most powerful abstract computing machine devised as of yet. Since the reader head can move both left and right, an option open to a Turing machine, but not to an FSA or a PDA as they have been formulated above, is to compute forever. Thus for Turing machines there is a substantial difference between accepting languages and deciding them: If we know that the 2.1. FORMAL SYNTAX 15 Turing machine halts on any input and the state in which it halts determines whether or not the input string is in the language, it decides the language. If it is guaranteed to halt only on strings in the language, but may compute forever given strings that are not in the language, it accepts the language. A linear bounded automaton is simply a Turing machine where the size of the available tape is limited and in fact bounded by some linear function of the length of the input string. Thereof the term “linear bounded”. Since a pushdown automaton must in each step consume an input symbol and can in each step only add a bounded number of items to the pushdown stack, the amount of memory available to a PDA is limited by a linear function of the length of the input string. Thus a pushdown automaton is a type of linear bounded automaton. Whether deterministic and non-deterministic linear bounded automata are equivalent is still an open research question. 16 2.2 CHAPTER 2. NATURAL-LANGUAGE SYNTAX Natural language grammars We will use the apparatus of formal language descriptions presented in the previous section for natural languages, but it is at this point important to discuss the appropriateness of that model. Firstly, the assumption is that a language consists of strings over a finite alphabet. Since the alphabet will consist of the vocabulary of a natural language, and since this vocabulary is most likely infinite, this assumption is invalid. Secondly, the strings are assumed to be unbounded in length, while the utterances made in a natural language most probably are not. This is an important observation, since any limit of the string length will eliminate the difference between the various language types, all of the languages will then be finite. Thirdly, it is assumed that language membership is well-defined. This is not true for natural languages, since some sentences may be dubious, but neither obviously ”grammatical” nor ”ungrammatical” (in a more intuitive sense of the word). 2.2.1 Context-free grammars We will take context-free grammars as a starting point for describing natural languages. Consider for example the small context-free grammar of Figure 2.7 describing a tiny fragment of English. VT = {a, about, book, gives, P aris, he, John, sees, sleeps} VN = {Det, N, N P, P P, P, P ron, S, V, V P } σ = S P =                            S VP VP VP VP NP NP NP PP Grammar → NP VP → V → V NP → V NP NP → VP PP → Det N → P ron → NP PP → P NP (1) (2) (3) (4) (5) (6) (7) (8) (9) Lexicon Det → a P → about NP → P aris NP → John P ron → he N → book V → sleeps V → gives V → sees                            Figure 2.7: Grammar 1 The first rule S → NP VP states that a sentence S (like He sleeps) can be constructed from a noun phrase NP (like “He” or “The man with the golden gun”) and a verb phrase VP (like “sleeps” or “is one of Roger Moore’s first 2.2. NATURAL LANGUAGE GRAMMARS 17 Bond movies”). Note that the structure of the noun phrase and the verb phrase is not specified by this rule. The division of the rules into a grammar and lexicon part is for convenience. The LHS symbols of the rules of the lexicon are often referred to as preterminals. One way of coping with the potentially infinite number of words in natural languages is to consider the set of preterminals as the effective terminal alphabet VT . Note however that for example “John” is a lexicalized NP and would in such a case violate the requirement that VT and VN be disjoint. Using lexicalized nonterminals is common practise in natural-language grammar design. Figures 2.8 and 2.9 show two parse trees for the sentence He sees a book about Paris representing two structurally different analyses of the sentence. These two analyses are distinguished by how the prepositional phrase (PP ) “about Paris” is attached; to the noun phrase “a book” in the former case (the natural reading) or to the verb phrase (VP ) “sees a book” in the latter (a reading similar to Pedro beats his donkey with a stick ). This and other types of ambiguity are very common in natural languages. The grammar in Figure 2.7 is a so-called phrase structure grammar (PSG) where each rule specifies a possible structure of a phrase. This is by no means the only possible way of describing a natural language. In fact, another very successful way of doing this is to instead specify well-formedness constraints on a sentence. For example, there might be constraints such as that any well-formed sentence contains at least one finite verb. This is the approach taken in the constraint-grammar scheme, see [Voutilainen et al 1992]. Let us examine the nature of the rules of a phrase-structure grammar. Firstly, each rule has a LHS and a RHS which specifies the dominance relation locally. For example, the S of the first rule dominates the NP and the VP . Secondly the order of the phrases of the RHS specifies the precedence relation locally. In the example rule the NP precedes the VP . In languages with a large degree of freedom regarding word order, such as Finnish or Japanese, it might be more sensible to separate these two constraints. We could re-formulate the phrase-structure rule S → NP VP as S NP → NP, VP ≺ VP The first part specifies the immediate-dominance (ID) relation and the second part specifies the linear-precedence (LP) relation, and such rules are often referred to as ID/LP rules. Note the somewhat subtle way of distinguishing a phrase-structure rule from an immediate-dominance rule by the use of a comma, rather than a blank character, to separate the phrases of the RHS of the rule. For example, in Japanese the two constructions Taro wa Hanako wo aishite iru and Hanako wo Tara wa aishite iru, both meaning John loves Mary, are captured by the ID/LP rule S NP → NP[+wa], NP[+wo], V ≺ V 18 CHAPTER 2. NATURAL-LANGUAGE SYNTAX S ❅ ❅ ❅ ❅ VP NP ❅ ❅ ❅ ❅ PRON NP V ❅ ❅ ❅ ❅ he sees PP NP ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ DET N P NP a book about P aris Figure 2.8: Parse tree for “He (sees (a book about Paris) )” 19 2.2. NATURAL LANGUAGE GRAMMARS S ❅ ❅ ❅ ❅ VP NP ❅ ❅ ❅ ❅ PRON PP VP ❅ ❅ ❅ ❅ he V NP P NP about P aris ❅ ❅ ❅ ❅ sees DET N a book Figure 2.9: Parse tree for “He ( (sees a book) about Paris )” 20 CHAPTER 2. NATURAL-LANGUAGE SYNTAX NP[+wa] means that the noun phrase is marked by the subject particle ”wa”. The above rule would for example reject the incorrect sentence Tara wo Hanako wo aishite iru. The separation of immediate dominance and linear precedence is an important ingredient in the generalized phrase-structure grammar scheme discussed in the next section. We will now turn to unification grammars. 2.2.2 Unification grammars Experience has it that context-free grammars almost suffice for describing natural languages, and it has not been easy to find counter-examples. The most famous one is due to Stuart Shieber, see [Shieber 1985], and involves cross-serial dependencies in Swiss German. Studies by Chytil and Karlgren [Chytil & Karlgren 1988], however, indicate that the extra expressive power needed is very small, and in particular very much less than that of a context-sensitive grammar. In view of this, and the observed discrepancy between the basic assumptions of the underlying models and the nature of natural languages, we will refrain from pursuing this rather academic discussion any further. What is important, though, is the convenience of using various grammar types and the resulting parsing efficiency. The following example should convince the reader that natural languages are indeed context sensitive in the more intuitive sense that the form of a word is dependent on its context. Figure 2.10 tabulates the 48 forms of the adjective ”tall” (literally ”long”) used in Icelandic in the positive sense for ”tall man”, ”tall woman” and ”tall child” respectively. The comparative and superlative forms have been omitted out of space considerations. Here “Nom.” stands for nominative, “Gen.” for genitive, “Dat.” for dative and “Acc.” for accusative case respectively. It is obviously more convenient to write a single grammar rule for this than 48 different: NP → Adj : [case=Case,gen=Gen,num=Num,spec=Spec] Noun : [case=Case,gen=Gen,num=Num,spec=Spec] This rule states that a noun phrase consisting of an adjective and a noun is well-formed if the latter two phrases agree w.r.t. case (case), gender (gen), number (num) and species (spec). Here, the syntactic category has been augmented with a Prolog-type feature list and these are separated by a colon. The feature list consists of feature-value pairs of the type Name=Value where Name specifies the feature name and Value the feature value.7 Actually, the syntactic category could itself have been a feature-value pair such as cat=Cat, but by convention it generally is not. Normally only a finite number of different feature names are allowed. 7 “=” is simply an operator joining the feature name and feature value. 21 2.2. NATURAL LANGUAGE GRAMMARS Num. Sing. Plur. Sing. Plur. Case Nom. Gen. Dat. Acc. Nom. Gen. Dat. Acc. Nom. Gen. Dat. Acc. Nom. Gen. Dat. Acc. Indefinite species Masculine Feminine langur ma∂ur löng kona langs manns langrar konu löngum manni langri konu langan mann langa konu langir menn langar konur langra manna langra kvenna löngum mönnum löngum konum langa menn langar konur Definite species langi ma∂urinn langa konan langa mannsins löngu konunnar langa manninum löngu konunni langa manninn löngu konuna löngu mennirnir löngu konurnar löngu mannanna löngu kvennanna löngu mönnunum löngu konunum löngu mennina löngu konurnar Neuter langt barn langs barns löngu barni langt barn löng börn langra barna löngum börnum löng börn langa barni∂ langa barnsins langa barninu langa barni∂ löngu börnin löngu barnanna löngu börnunum löngu börnin Figure 2.10: Inflection of the Icelandic adjective “langur”. Two phrases can be unified if they are of the same syntactic category and if their feature lists can be merged without conflicting specifications on any feature value. This can be formalized as follows: Definition 14 (Unification) Two terms T1 and T2 can be unified iff there is a substitution θ such that T1 θ ≡ T2 θ. The result is that of applying the most general unifier (mgu), i.e., the minimal substitution θ′ for which T1 θ′ ≡ T2 θ′ holds, to either T1 or T2 . A substitution is an assignment of variables to terms or to other variables. A substitution may add extra feature-value pairs to feature lists if the feature name is not already present in the feature list. Equivalence (≡) is defined as follows: Definition 15 (Equivalence) 1. Two variables V1 and V2 are equivalent iff they are identical, e.g. X ≡ X. 2. Two atoms a1 and a2 are equivalent iff they are identical, e.g. a ≡ a. 3. Two terms F1 (A11 , ..., A1N ) and F2 (A21 , ..., A2M ) are equivalent iff they have the same functor, the same arity, and if the corresponding arguments are equivalent, i.e., iff F1 and F2 are identical, N = M and ∀i A1i ≡ A2i . 22 CHAPTER 2. NATURAL-LANGUAGE SYNTAX 4. Two feature lists L1 and L2 are equivalent iff every feature name N1i that occurs in L1 also occurs in L2 , and vice versa, and if the corresponding feature values V1i and V2i are equivalent, i.e., V1i ≡ V2i . A term T1 is said to subsume another term T2 if the former can be made equivalent to the latter by instantiating it further, i.e., if and only if T1 θ ≡ T2 for some substitution θ. Due to the use of feature lists, the described unification is not quite Prolog unification, but the latter can be used if all feature-value pairs are specified irrespectively of whether they are relevant in the context (and if they aren’t, the value can be assigned the “don’t care” variable “_”) and if the feature-value pairs are arranged in some specific order w.r.t. the feature name. This type of grammar is usually referred to as a unification grammar (UG). Note that this deviates from the definition of formal grammars above, where the grammar symbols are atomic, and not allowed any internal structure. So how does this influence the expressive power of the grammar? This depends on the values that the features are allowed to take. If each feature is only allowed a finite value range, the expressive power is no more than that of a context-free grammar. In this case the grammar (formalism) is said to be weakly context free, since it can only generates context-free languages. If the features are allowed a discrete range, we arrive at so-called indexed grammars, see [Aho 1968]. Such grammars are not as expressive as general context-sensitive grammars, but suffice for generating for example the language L1 = an bn cn , which is not context free. Consider grammar G′1 shown in Figure 2.11. VT = {a, b, c} VN = {S, A : [f = ], B : [f = ], C : [f = ]} σ = S P =  S    A : [f = 0]     A : [f = s(N )] B : [f = 0]   B : [f = s(N )]      C : [f = 0] C : [f = s(N )] → → → → → → → A : [f = N ] B : [f = N ] C : [f = N ] ǫ a A : [f = N ] ǫ b B : [f = N ] ǫ c C : [f = N ]                Figure 2.11: Grammar G′1 The range of the single feature f is the set of natural numbers, here represented using Peano arithmetic. This grammar is arguably more comprehensible than the grammar G1 . However, if the range of some feature value is unrestricted, the resulting 2.2. NATURAL LANGUAGE GRAMMARS 23 grammar is Turing complete, i.e. it has the same descriptive power as unrestricted grammars. Thus it can describe languages that are extremely more complicated than any observed natural language. For this reason, many theorectical systems have been proposed to limit the number, range and co-occurrence of the features. Some of these theories are the topic of the next section. 24 CHAPTER 2. NATURAL-LANGUAGE SYNTAX 2.3 Grammar formalisms In this section we introduce some common grammatical theories. This brief overview of the state of the art in generative grammar is certainly not to be considered a complete record. Nor is it strictly necessary for the reader not interested in the particularities of grammatical analysis to dwell into this section — it is rather just a background to the grammatical theory that will be introduced in Section 4.2, the framework actually used in the Core Language Engine. The formalisms we will look at are: “Definite Clause Grammar” (DCG), “Government and Binding” (GB), “Lexical-Functional Grammar” (LFG) and “Generalized Phrase Structure Grammar” (GPSG). For the reader with specific interest in any of these theories, each subsection will contain the most central references to the theory in question. General overviews and comparisons of these theories can also be found in for example [Sells 1985, Horrocks 1987]. All these theories build on “phrase structure rules” which define grammatical correctness. Such a rule can be viewed as a function from phrase-types to phrase-types, i.e. from a phrase to its daughter or daughters. For example s -> np vp is a rule stating that a sentence consists of a noun-phrase followed by a verb-phrase. np and vp would be defined by other (possibly recursive) phrase structure rules. 2.3.1 Definite Clause Grammar Definite clause grammars [Colmerauer 1978, Pereira & Warren 1980] were “invented” by Alain Colmerauer as an extension of the well-known context-free grammars. DCGs are of interest here mainly because they can be easily expressed in Prolog. In general, a DCG grammar rule in Prolog takes the form HEAD --> BODY. meaning “a possible form for HEAD is BODY”. Both BODY and HEAD are sequences of one or more items linked by the standard Prolog conjunction operator ‘,’. DCG grammar rules are merely a convenient “syntactic sugar” for ordinary Prolog clauses. Each grammar rule takes an input string, analyses some initial portion, and produces the remaining portion (possibly enlarged) as output for further analysis. The arguments required for the input and output strings are not written explicitly in a grammar rule, but the syntax implicitly defines them. The following example shows how DCG grammar rules are translated into ordinary Prolog clauses by making explicit the extra arguments. A rule such as s --> np , vp. translates into s(P0, P2) :- np(P0, P1) , vp(P1, P2). 2.3. GRAMMAR FORMALISMS 25 which is to interpreted as “there exists a constituent s from the point P0 to the point P2 in the input string if there exists an np between point P0 and P1 and a vp between P1 and P2”. The way a Definite Clause Grammar extend context-free grammars is somewhat dependent on the Prolog dialect in which it is implemented. For example the DCG extensions used in SICStus Prolog are defined as: 1. A non-terminal symbol may be any Prolog term (other than a variable or number). 2. A terminal symbol may be any Prolog term. To distinguish terminals from non-terminals, a sequence of one or more terminal symbols is written within a grammar rule as a Prolog list. An empty sequence is written as the empty list []. If the terminal symbols are character codes, such lists can be written (as elsewhere) as strings. An empty sequence is written as the empty list, [] or "". 3. Extra conditions, in the form of Prolog procedure calls, may be included in the right-hand side of a grammar rule. Such procedure calls are written enclosed in {} brackets. 4. The left-hand side of a grammar rule consists of a non-terminal, optionally followed by a sequence of terminals (again written as a Prolog list). 5. Alternatives may be stated explicitly in the right-hand side of a grammar rule, using the disjunction operator ; or | as in Prolog. 6. The cut symbol may be included in the right-hand side of a grammar rule, as in a Prolog clause. The cut symbol does not need to be enclosed in {} brackets. 2.3.2 Government and Binding The grammatical formalism that nowadays is known as “GB-theory” was introduced by Noam Chomsky in the fifties and has since been elaborated on both by Chomsky himself and by hundreds of his disciples [Chomsky 1957, Chomsky 1981, Chomsky 1986, van Riemsdijk & Williams 1986]. The formalisms GB, LFG and GPSG all have in common the notion of Universal Grammar , that is that the same type of grammar should be possible to use for all natural languages, and that indeed, all languages are rooted in the same grammar. In GB, cross-linguistic variation is specified by parameters. Other important concepts of that and the other theories will be described briefly. Starting off with GB, the central building-blocks of the theory are the X-convention, transformations, c-command, and as the name indicates, the notions of government and binding. 26 CHAPTER 2. NATURAL-LANGUAGE SYNTAX The X-Convention Usually the phrase of the right-hand side of a phrase-structure rule that is most similar to the left-hand side is called the head of the rule. For example, the noun is the head of the rule NP → DET N, the preposition is the head of the rule P → P NP and the two NP s of the RHS of a noun-phrase conjunction rule like NP → NP Conj NP are both heads. Although slightly controversial, the finite verb is often considered to be the head of a sentence. For each phrase we can follow the chain of heads down to some lexical item. The phrase is said to be a projection of this lexical item. The head of any phrase is termed X, the phrasal category containing X is termed X (which is read as “X-bar”), and the phrasal category containing X is termed X (“X double-bar”). The head is sometimes called X 0 and the node of completed phrases such as NP, VP, PP etc., that is the X with the maximal number of bars, normally phrases with a bar value of 2, is referred to as X max (for “maximal projection”) or XP . The semi-phrasal level (X) is in practice often left out if it does not branch. N → DET N N → NP Thus e.g. NP corresponds to N and if built by the above rules would give a branching structure as the one shown in Figure 2.12. N DET N ❅ ❅ ❅ ❅ N P Figure 2.12: A possible noun-phrase branching structure In general, the branching structure is as in Figure 2.13. For English (and Swedish) A is “specifiers” (determiner, degree, etc.), B is “modifiers” and D is “arguments” (C hardly exists). 27 2.3. GRAMMAR FORMALISMS X ❅ ❅ ❅ ❅ A B X ❅ ❅ ❅ ❅ C D X Figure 2.13: The branching in X-theory S ❅ ❅ ❅ ❅ TOP S ❅ ❅ ❅ ❅ S COMP ❅ ❅ ❅ ❅ NP INFL Figure 2.14: Swedish sentence structure VP 28 CHAPTER 2. NATURAL-LANGUAGE SYNTAX The sentence structure for Swedish is given in Figure 2.14. The TOP and COMP nodes are used for moved constituents, INFL (“inflection”) for e.g., tense, aspect, and agreement. The sentence “Kalle gillar Lena” could be interpreted as the tree of Figure 2.15. S ❅ ❅ ❅ ❅ TOP S ❅ ❅ ❅ ❅ S COMP ❅ ❅ ❅ ❅ NP VP INFL ❅ ❅ ❅ ❅ kalle PRES V gilla NP lena Figure 2.15: Swedish GB example Transformations The notion of transformations is based on a claim by Chomsky which basically says that “Phrase structures are not sufficient to describe human language”. According to this view the description of natural languages also must rely on functions that destructively rearrange the phrase-structure trees. Going from some kind of deeper-level representation of the sentence (“D-structure”), the transformations are used to produce the output format (somtimes referred to as the “surface structure”, even though most GB-theoreticians nowadays refrain from using the terms “deeper” and “surface”): 29 2.3. GRAMMAR FORMALISMS D-structure − transformation(s) → S-structure Thus the S- and D-structures for normal (straight) sentences are the same, while the S-structures for e.g. questions are derived from the corresponding D-structure through the application of certain transformations. Most transformations are of the nature “move a constituent from one place in the sentence structure to another”. The moved constituent will then leave a hole at its original position; this “hole” is referred to as a trace and its relation to the moved constituent is normally indicated by coindexing (a trace will be written as ǫ in the text following). Classical transformation grammar theory contains a multitude of such transformations, while modern GB-theory seeks to minimize the number of transformations needed. The applicability of a transformation is expressed in terms of constraints defined by the following notions: Definition 16 (C-command) X c-commands Y iff the first branching node dominating X also dominates Y , X itself does not dominate Y , and Y does not dominate X. Movement must always be to a c-commanding position and an anaphor must always be c-commanded by its antecedent (anaphors are reflexives, reciprocals, and obligatory control pronomina). Definition 17 (Binding) X binds Y if X and Y are coindexed and X c-commands Y . Definition 18 (Government) X governs Y iff Y is contained in the maximal X-projection of X, X max , and X max is the smallest maximal projection containing Y and X c-commands Y . VP ❅ ❅ ❅ ❅ S V ✡ ❅ ❅ ❅ ❅ ✲ NP INFL VP Figure 2.16: An example of government 30 CHAPTER 2. NATURAL-LANGUAGE SYNTAX The governors are normally X 0 (i.e., N , V , A, P ). An example of government is given by the tree in Figure 2.16 where the V governs the NP . Note that this would not be the case if the top VP subcategorized for an S (instead of an S), since this would introduce an X max between the V and the NP . This is sometimes expressed as “S is a barrier to government”, or that S (and the subtree it dominates) is an island isolated from the rest of the tree. In general the governors (X in the definition) can be: • X 0 (i.e., N , V , A, P ) • INFL([+tns],AGR) • NP i where the governee (the Y in the definition) is (another, but coindexed) NP i Move-α — “Move anything anywhere” The ultimate goal of modern GB-theory is to have just one transformation, “Move-α”, which takes care of all necessary movements of sentence constituents. An S-structure is assumed to have been reached through (possibly several) applications of the Move-α transformation. In Swedish the TOP (“topic”) node normally holds a topicalized NP or a WH-word,8 while the COMP holds a V after question-transformation. Thus the sentence “Vem gillar Kalle?” would be regarded as having the same D-structure as “Kalle gillar Lena” above, and would be derived after two applications of Move-α. The first application turns the straight sentence into a Yes-No question with a verb-trace (Gillari Kalle ǫi Lena? ), while the second creates the S-structure shown in Figure 2.17, leaving a WH-trace in the place where the proper name “Lena” was before the transformation. 2.3.3 Generalized Phrase Structure Grammar In the late seventies and early eighties Gerald Gazdar and others worked on the linguistic theory now known as GPSG [Gazdar et al 1985]. Their aim was to create a unification-based grammar formalism building on the idea that phrasestructure rules of PSG without transformations really are enough to describe natural languages. The key motivation behind the theory was actually simplicity, or rather, the use of a simplified framework : • one level of syntactic representation: surface structure. • one kind of syntactic object: the phrase structure rule. • one empty category: N U LL (≈ GB’s WH-trace). 8 WH-words are the ones introducing WH-questions, so named simply because many of the question-words in English (i.e., who, which, where, why, how, etc.) begin with the letters “WH”. 31 2.3. GRAMMAR FORMALISMS S ❅ ❅ ❅ ❅ TOP S ❅ ❅ ❅ ❅ vemj S COMP ❅ ❅ ❅ ❅ gillai NP INFL VP ❅ ❅ ❅ ❅ kalle PRES V NP ǫi ǫj Figure 2.17: The parse tree for a WH-question 32 CHAPTER 2. NATURAL-LANGUAGE SYNTAX This simplification of the grammar rules is mainly achieved by using an elaborate category-feature system and by introducing so called “meta-rules”, that is phrase-structure rules that act on other phrase-structure rules. The phrasestructure rules are separated into ID and LP-rules as described above (see Section 2.2.1). Categories and features The GPSG theory does (like most unification-based theories) distinguish between categories and features, allowing two types of categories: • major: N V A P • minor: DET COMP . . . The major categories are the ones that participate in the X scheme, while the minor do not. The features are distinguished by two properties, namely their • kind of value (e.g. atom-valued) • distributional regularities (shared with other features) To keep the grammar rules as simple as possible, the propagation of features is governed by several general principles and conventions, so that most feature instantiations in a rule need not be explicitly stated. One example of such a principle is the “head-feature convention” which will be described later on (see Page 42). Metarules GPSG has been a very influential theory in that it has introduced several concepts which now are central to most unification-based grammar theories. Thus we will not go into too much details of the theory as such here, but will introduce many GPSG concepts in other places in the text. One central notion, however, is that of metarules, which: • perform some of the duties of GB’s transformations. • derive (classes of) rules from (classes of) rules. An example is the following “Passive Metarule” for Swedish: VP → W NP ⇓ VP [PAS] → W (PP [av]) where W is a variable over categories. The rule says that a passive verb-phrase may be obtained by deleting a noun-phrase, possibly adding a prepositional phrase starting with the preposition “av” (of). This can be done regardless of whatever categories introduce the verb-phrase. 33 2.3. GRAMMAR FORMALISMS 2.3.4 Lexical-Functional Grammar LFG, like GPSG, is a unification-based theory based on categories and feature structures. The framework is described in [Kaplan & Bresnan 1982]. Of the grammar formalisms introduced in this section, LFG is the only one using the “school-book” grammatical functions: subject, object, etc.; however, this in hardly central to LFG as such, instead the main idea in the formalism is to assign two levels of syntactic description to a sentence: c- and f-structures. The c-structure expresses word order and phrasal structure (i.e., “ordinary” trees obtained by phrase structure rules), while the f-structure encodes grammatical relations as functions from names to values: SU BJ → kalle Information in the f-structure is represented as a set of ordered pairs each of which consists of an attribute and a specification of that attribute’s value. An attribute is a name of a grammatical function or feature (SUBJ, PRED, OBJ, NUM, CASE, etc.). Values are: • Simple symbols (constants, below in italics). • Semantic forms (constants, below in bold face). • Subsidiary f-structures. • Sets of symbols, semantic forms, or f-structures. Definition 19 (Uniqueness Condition) In a particular f-structure a particular attribute may have at the most one value. A functional structure is a set of ordered pairs satisfying the Uniqueness Condition, i.e., a standard mathematical function.              SU BJ T EN SE P RED OBJ " SP EC NUM P RED [] sing kalle P RED lena #       P RES   gillah(↑ SUBJ)(↑ OBJ)i   # "  SP EC [ ]   NUM sing Figure 2.18: The f-structure for “Kalle gillar Lena” 34 CHAPTER 2. NATURAL-LANGUAGE SYNTAX The f-structure for “Kalle gillar Lena” would be written as in Figure 2.18; showing a sentence in present tense (TENSE=PRES ), where neither the subject (SUBJ ) nor the object (OBJ ) contain any specifiers (SPEC is the empty list, here depicted as [ ]), i.e., no determiners, etc., and both have singular number agreement (NUM=sing). Natural-language semantics will be addressed further on in the text, but since we are not going to discuss LFG in more detail than done in this section, we note right away that the semantic interpretation of the sentence in LFG is obtained from the value of its predicate (PRED ) attribute: gillah(↑ SUBJ)(↑ OBJ)i, which is a predicate-argument expression containing the semantic predicate name (gilla) followed by an argument list specification enclosed in angle brackets. The argument list specification defines a mapping between the logical arguments of the two-place predicate gilla and the grammatical functions of the f-structure: the first argument position is to be filled by the formula that results from interpreting the SUBJ function of the sentence, while the second position is to be the interpretation of the OBJ . The formula from the embedded SUBJ f-structure is determined by its PRED value, the semantic form kalle, and so on. Arrows (↑ and ↓) are to be read as immediate domination. In the above example, the first argument slot of gilla is filled by the subject of the structure immediately dominating gilla, i.e., the sentence itself. 2.4. A FEATURE-VALUE BASED GRAMMAR 2.4 35 A feature-value based grammar A key idea in the grammar formalisms of the previous section was to (in different ways) limit the number, range and co-occurrence of the features. We are now ready to use the mechanism of feature-value pairs to improve the grammar of Figure 2.7. Consider the unification grammar of Figure 2.19. The anatomy of a rule LHS => Id-RHS is the following: LHS is simply the left-hand side of the grammar rule, Id is a mnemonic rule identifier and RHS is the right-hand side of the grammar rule (again in Prolog list notation). For example, the first rule, named s_np_vp, corresponds to the context-free grammar rule S → NP VP of Grammar 1. In the new rule, the feature agr is an agreement feature ensuring that the sentences He sleeps is grammatical while sentences like He sleep or They sleeps are not. The feature subcat is a subcategorization feature for verbs specifying their complement type, thus ensuring that He gives is not grammatical, but that He sleeps and He gives Mary a book are. The feature tree is the parse tree of the phrase, reflecting its internal structure. A unification grammar for a real natural-language system would employ a much larger set of features. A term used in the following is context-free backbone grammar . This is a grammar that consists entirely of atomic (or at least ground) symbols, but that has the same structure as the underlying unification grammar. Such a grammar could be constructed for Grammar 2 above by for example omitting the feature lists and only taking the atomic syntactic categories into account. This would give us Grammar 1. Another useful term is implicit parse tree, a tree where the nodes are the grammar rules resolved on at each point, rather than the syntactic categories. Implicit parse trees are convenient for describing UG derivations. Figure 2.21 shows such a tree for the sentence He sees a book about Paris. (The lexicon look-ups are commonly not shown as explicitly as in this tree.) 2.4.1 Empty productions Another creative use of features is associated with empty productions, or as they are often called in UG theory, “gaps”. The rule np_gap introduces an empty NP construction, i.e., an NP-trace:9 np:[agr=Agr,gaps=([np:[agr=Agr,wh=y]|G],G),wh=n] => []. np_gap- Gap rules like this one are used to model movement as in the sentence Whati does John seek ǫi ? , which is viewed as being derived from its declarative counterpart John seeks what? . The trace “ǫi ” marks the position from which the 9 For the sake of clarity, the tree feature has been omitted in the following. 36 CHAPTER 2. NATURAL-LANGUAGE SYNTAX top_symbol(s:[tree=_]). s:[tree=s(NP,VP)] => [np:[agr=Agr,tree=NP], vp:[agr=Agr,tree=VP]]. s_np_vp- vp:[agr=Agr,tree=vp(V)] => [v:[agr=Agr,subcat=intran,tree=V]]. vp_v- vp:[agr=Agr,tree=vp(V,NP)] => [v:[agr=Agr,subcat=tran,tree=V], np:[agr=_,tree=NP]]. vp_v_np- vp:[agr=Agr,tree=vp(V,NP1,NP2)] => [v:[agr=Agr,subcat=ditran,tree=V], np:[agr=_,tree=NP1], np:[agr=_,tree=NP2]]. vp_v_np_np- vp:[agr=Agr,tree=vp(VP,PP)] => [vp:[agr=Agr,tree=VP], pp:[tree=PP]]. vp_vp_pp- np:[agr=Agr,tree=np(DET,N)] => [det:[agr=Agr,tree=DET], n:[agr=Agr,tree=N]]. np_det_n- np:[agr=Agr,tree=np(PRON)] => [pron:[agr=Agr,tree=PRON]]. np_pron- np:[agr=Agr,tree=np(NP,PP)] => [np:[agr=Agr,tree=NP], pp:[tree=PP]]. np_np_pp- pp:[tree=pp(P,NP)] => [p:[tree=P], np:[agr=_,tree=NP]]. pp_p_np- Figure 2.19: Grammar 2 2.4. A FEATURE-VALUE BASED GRAMMAR lexicon(a,det:[agr=sg,tree=det(a)]). lexicon(the,det:[agr=_,tree=det(the)]). lexicon(several,det:[agr=pl,tree=det(several)]). lexicon(about,p:[tree=p(about)]). lexicon(in,p:[tree=p(in)]). lexicon(paris,np:[agr=sg,tree=np(paris)]). lexicon(john,np:[agr=sg,tree=np(john)]). lexicon(mary,np:[agr=sg,tree=np(mary)]). lexicon(he,pron:[agr=sg,tree=pron(he)]). lexicon(she,pron:[agr=sg,tree=pron(she)]). lexicon(they,pron:[agr=pl,tree=pron(they)]). lexicon(book,n:[agr=sg,tree=n(book)]). lexicon(books,n:[agr=pl,tree=n(books)]). lexicon(house,n:[agr=sg,tree=n(house)]). lexicon(houses,n:[agr=pl,tree=n(houses)]). lexicon(sleeps,v:[agr=sg,subcat=intran,tree=v(sleeps)]). lexicon(sleep,v:[agr=pl,subcat=intran,tree=v(sleep)]). lexicon(gives,v:[agr=sg,subcat=ditran,tree=v(gives)]). lexicon(give,v:[agr=pl,subcat=ditran,tree=v(give)]). lexicon(sees,v:[agr=sg,subcat=tran,tree=v(sees)]). lexicon(see,v:[agr=pl,subcat=tran,tree=v(see)]). Figure 2.20: Lexicon 2 37 38 CHAPTER 2. NATURAL-LANGUAGE SYNTAX S→NP VP ❅ ❅ ❅ ❅ VP→V NP NP→PRON ❅ ❅ ❅ ❅ PRON→lex NP→NP PP V→lex ❅ ❅ ❅ ❅ he sees PP→P NP NP→DET N ❅ ❅ ❅ ❅ DET→lex a ❅ ❅ ❅ ❅ N→lex P→lex NP→lex book about P aris Figure 2.21: An implicit parse tree 39 2.4. A FEATURE-VALUE BASED GRAMMAR word “whati ” has been moved (and co-indexing, here with subscript index i is used as before to associate the moved word with the trace). This is an example of left movement , since the word “what” has been moved to the left. Examples of right movement are rare in English, but frequent in other languages, the prime example being German subordinate clauses. The feature gaps is used for gap threading, i.e., passing around a list of moved phrases to ensure that an empty production is only applicable if there is a moved phrase elsewhere in the sentence to license its use. WH-questions We will now exemplify several aspects of the feature-value concept and the treatment of empty productions in unification grammars by trying to extend Grammar 2 to handle WH-questions. In order to do this, we will need gap-rules like the one above and might also introduce the binary-valued features fin and wh indicating whether the main verb is finite or not and whether a word (or, to be exact, an NP ) is a question-word or not, respectively. With a feature setting fin=y on the top-symbol we can impose a restriction on the sentences which will be accepted by the grammar, namely that they must contain a finite verb. Non-finite verbs would in contrast be the ones which could form the lower VP of a verb-phrase formation rule VP → V VP used for auxiliary verbs, i.e., verbs like “do” which can be viewed as subcategorizing for another (non-finite) verb-phrase. Whether a verb is finite or not can in this simplified framework be captured a setting the feature in the lexicon, but will in a real system be obtained from morphology rules (morphology will be discussed briefly later on). Whether an NP consists of a question-word or not would also be indicated in the lexicon. Words with wh=y would be the only ones for which a specific question-rule S → NP V S would apply. This rule would state that a sentence can consist of a (moved) WH-word followed by a (moved) verb and another sentence. The lower S should thus have gaps for both the verb and the WHword. In English the moved verb must be an auxiliary. In many other languages (e.g., Swedish), this restriction does not apply; however, we will impose it in the following example by giving the feature subcat the special value aux on auxiliaries. In addition to the np_gap rule above we add the rules s:[fin=Fin,gaps=([],[])] => s_np_v_s[np:[agr=Agr1,gaps=([],[]),wh=y], v:[agr=Agr2,fin=Fin,gaps=([],[]),subcat=aux], s:[fin=n,gaps=([v:[agr=Agr2,fin=Fin,subcat=aux], np:[agr=Agr1,wh=y]|G]),G]]. vp:[agr=Agr,fin=Fin,gaps=(G0,G)] => vp_v_vp- 40 CHAPTER 2. NATURAL-LANGUAGE SYNTAX [v:[agr=Agr,fin=Fin,gaps=(G0,G1),subcat=aux], vp:[agr=_,fin=n,gaps=(G1,G)]]. v:[agr=Agr,fin=n, gaps=([v:[agr=Agr,fin=_,subcat=SubCat]|G],G), subcat=SubCat] => v_gap[]. and modify the rest of the grammar accordingly, e.g: top_symbol(s:[fin=y,gaps=([],[])]). s:[fin=Fin,gaps=(G0,G)] => [np:[agr=Agr,gaps=([],[]),wh=n], vp:[agr=Agr,fin=Fin,gaps=(G0,G)]]. s_np_vp- vp:[agr=Agr,fin=Fin,gaps=(G0,G)] => vp_v_np_np[v:[agr=Agr,fin=Fin,gaps=(G0,G1),subcat=ditran], np:[agr=_,gaps=(G1,G2),wh=n], np:[agr=_,gaps=(G2,G),wh=n]]. vp:[agr=Agr,fin=Fin,gaps=(G0,G)] => [vp:[agr=Agr,fin=Fin,gaps=(G0,G1)], pp:[gaps=(G1,G),lexform=_]]. vp_vp_pp- The point is that the missing NP can be consumed by either the first or second NP of the vp_v_np_np rule, or by the PP of the vp_vp_pp rule, but only by one of them. This will cover sentences like the following Whomi didj John ǫj give ǫi a book in Paris? Whati didj John ǫj give Mary ǫi in Paris? (What city)i didj John ǫj give Mary a book in ǫi ? The implicit parse tree for the first example sentence is shown in Figure 2.22. The feature value of the gaps feature is interesting from two points of view: Firstly, it is represented using a so called difference list (L1,L2) where the list L2 is the tail of the list L1 and the list represented is the list of elements preceding L2 in L1. To give a simple example, the list [a,b] can be represented as the difference between the two lists [a,b,c,d] and [c,d]. In the example above L1 is [np:[agr=Agr,gaps=_]|G] and L2 is the Prolog variable G, so the represented list is [np:[agr=Agr,gaps=_]]. Secondly, the feature value is complex in the sense that it may contain grammar phrases, the phrase np:[agr=Agr,gaps=_] in the example above. 41 2.4. A FEATURE-VALUE BASED GRAMMAR S→NP V S ❅ ❅ ❅ ❅ whomi didj S→NP VP ❅ ❅ ❅ ❅ john VP→V VP ❅ ❅ ❅ ❅ tj VP→VP PP ❅ ❅ ❅ ❅ VP→V NP NP PP→P NP ❅ ❅ ❅ ❅ give ti ❅ ❅ ❅ ❅ NP→DET N about ❅ ❅ ❅ ❅ a book Figure 2.22: The parse tree for a WH-question P aris 42 2.4.2 CHAPTER 2. NATURAL-LANGUAGE SYNTAX Lexicalized compliments Another case where grammar phrases figure in the feature values is that of lexicalized complements, which is commonly used to keep the grammar as simple as possible by incorporating into the lexicon large quantities of information that has traditionally resided in the grammar rules. In Grammar 2 we could replace the three rules vp_v, vp_v_n and vp_v_np_np with a single rule (or rather, rule schema) vp:[agr=Agr] => [v:[agr=Agr,subcat=Complements] | Complements]. vp_v_compl- and instead specify the verb’s complements in the lexicon, e.g: lexicon(sees,v:[agr=sg,subcat=[np:[agr=_,wh=_]]]). In this example, the VP inherits its structure from a favoured RHS phrase, namely the verb, that is the head of the verb-phrase. Quite a lot of features values tend to be shared between the head and LHS of a rule, and in GPSG there is a convention that says that unless otherwise specified, the values of the features that have been classified as head features should be passed on between the LHS and the head of the grammar rule without this necessarily being stated explicitly in the grammar rule. For example, let the lexform feature, specifying the lexical item that the phrase is a projection of, be a head feature. The rule pp_p_np will then ensure that the value of lexform is the same for the PP and the preposition. This allows us to specify that a particular PP must start with some specific preposition, for example that pp:[lexform=to] must be realized as “to NP ” for some NP. Chapter 3 Natural-Language Parsing The task of a syntactic parser is to use the grammar rules to either accept an input sentence as grammatical and produce some useful output structure (such as a parse tree), or to reject the input word-string as ungrammatical. It is important that this task is performed efficiently and in particular that the parsing process terminates. The strategies a parser can implement differ along several lines. The parsing strategy can be top-down — goal driven, or it can be bottom-up — data driven. The order in which the RHS phrases of a rule are processed differ, parsing leftto-right, so-called left-corner parsing, being the most common.1 Parsers also differ in the way hypothesized or previously parsed phrases are used to filter out spurious search branches, and what use they make of the internal memory, e.g., employing well-formed-substring tables and charts. Also the form of the transition relation, usually referred to as the parsing tables, varies. 3.1 Top-down parsing A top-down parser tries to construct a parse tree starting from the root (i.e., the top symbol of the grammar) and expand it using the rules of the grammar. Figure 3.1 shows a simple parser implementing the parsing strategy of Prolog — top-down and left-to-right. The predicate parse/4 parses a phrase either by the phrase being a preterminal matching the next word in the sentence, or by applying a grammar rule whose left-hand side matches the phrase. The latter is called top-down (rule) prediction and is the essence of top-down parsing. The predicate parse_rest/4 is a simple list recursion. The first argument is a phrase in parse/4 and a list of phrases in parse_rest/4, the second two arguments constitute a difference 1 Alternative strategies have been suggested by Kay [Kay 1989] and van Noord [van Noord 1991] (Head-corner parsing) and by Milne [Milne 1928] (Pooh-corner parsing). The former strategy is discussed in Section 3.7. 43 44 CHAPTER 3. NATURAL-LANGUAGE PARSING parse(Words,Tree) :top_symbol(S), parse(S,Words,[],Tree). parse(Phrase,[Word|Words],Words,lex-Word) :lexicon(Word,Phrase). parse(Phrase,Words0,Words,Id-Trees) :(Phrase => Id-Body), parse_rest(Body,Words0,Words,Trees). parse_rest([],Words,Words,[]). parse_rest([Phrase|Rest],Words0,Words,[Tree|Trees]) :parse(Phrase,Words0,Words1,Tree), parse_rest(Rest,Words1,Words,Trees). Figure 3.1: A simple top-down parser list denoting the portion of the input word string that the phrase spans, and the fourth argument is the output implicit parse tree(s) reflecting the structure of the analysis. 3.2 Well-formed-substring tables If we for example want the grammar on Page 36 to cover the sentence John gives a book to Mary, we might add the following rule2 vp:[agr=Agr,tree=vp(V,NP1,pp(p(to),NP2))] => [v:[agr=Agr,subcat=dative,tree=V], np:[agr=_,tree=NP1], pp:[tree=pp(p(to),NP2)]]. vp_v_np_pp- and the following lexicon entries lexicon(gives,v:[agr=sg,subcat=dative,tree=v(gives)]). lexicon(give,v:[agr=pl,subcat=dative,tree=v(give)]). lexicon(to,p:[tree=p(to)]). The problem now is that the parser first will analyze “gives” as a ditransitive verb using the rule 2 Here the parse tree of the PP is used in a way similar to the lexform feature to ensure that the preposition is realized as “to”. 3.2. WELL-FORMED-SUBSTRING TABLES 45 vp:[agr=Agr,tree=vp(V,NP1,NP2)] => vp_v_np_np[v:[agr=Agr,subcat=ditran,tree=V], np:[agr=_,tree=NP1], np:[agr=_,tree=NP2]]. When it fails to analyze “to Mary” as a second noun phrase it will resort to the other lexicon entry for “gives” and the new grammar rule. Despite already having analyzed “a book”as a noun phrase it will re-do this analysis, since that information was lost when back-tracking. To avoid this, all intermediate results are stored in a special table, a well-formed-substring table, which is consulted before parsing. If the ways of parsing a certain phrase starting in this position are already exhausted, the stored results are retrieved and no further parsing is attempted. This strategy is implemented in the parser of Figure 3.2. parse(Words,Tree) :reset_wfst, top_symbol(S), parse(S,Words,[],Tree). parse(Phrase,Words0,Words,Tree) :complete(Phrase,Words0),!, wfst(Phrase,Words0,Words,Tree). parse(Phrase,Words0,Words,Tree) :parse1(Phrase,Words0,Words,Tree), assert(wfst(Phrase,Words0,Words,Tree)). parse(Phrase,Words0,_Words,_Tree) :assert(complete(Phrase,Words0)), fail. parse1(Phrase,[Word|Words],Words,lex-Word) :lexicon(Word,Phrase). parse1(Phrase,Words0,Words,Id-Trees) :(Phrase => Id-Body), parse_rest(Body,Words0,Words,Trees). parse_rest([],Words,Words,[]). parse_rest([Phrase|Rest],Words0,Words,[Tree|Trees]) :parse(Phrase,Words0,Words1,Tree), parse_rest(Rest,Words1,Words,Trees). Figure 3.2: A parser employing a well-formed-substring table Here the predicate parse1/4 does the real parsing. As we can see, parse/4 asserts the derived results into the Prolog database and their re-use is regulated 46 CHAPTER 3. NATURAL-LANGUAGE PARSING by the predicate complete/2. The first clause is only applicable when an exhaustive search has been carried out. In this case the well-formed-substring table is consulted. Otherwise search is carried out by the second clause through the call to parse1 and the result stored in the well-formed-substring table. When no more solutions can be found by clause two, Prolog fails into clause three and the flag complete is set. If both wfst/4 and complete/2 were defined to be dynamic predicates, the predicate reset_wfst/0 could be defined simply as reset_wfst :retractall(wfst(_,_,_,_)), retractall(complete(_,_)). Now the parser will not re-do analyses unnecessarily. Whether or not using a well-formed-substring table pays off from an efficiency point-of-view is in general an empirical question. 3.3 Bottom-up parsing The vp_vp_pp and np_np_pp rules causing the ambiguity in the sentence He sees a book about Paris above are interesting from another point of view: They are recursive (since the LHS unifies with a phrase of the RHS), and are the simplest form of cyclic derivations. Cyclic derivations are a constant source to non-termination problems. The problem with rules like np_np_pp in conjunction with top-down parsing is that if they are applicable once, they are applicable an infinite number of times, building a left-branching parse tree where the righthand side NP of one incarnation of the rule is unified with the left-hand side of another. For this and other reasons, parsing is often performed bottom-up instead. A bottom-up parser tries to construct a parse tree starting from the leaves (i.e., from the words of the input sentence) and combine these using the rules of the grammar. The bottom-up parser described here is due to the one in [Pereira & Shieber 1987] which in turn is based on Matsumoto’s original parser, “BUP” presented in [Matsumoto et al 1983]. In Figure 3.3 we have modified the parse/4 predicate to instead parse bottom-up. This predicate now parses a phrase by first looking up a preterminal corresponding to the next word in the input string using leaf/4. It then tries to connect the preterminal with the predicted phrase using connect/6. The first clause of this predicate is the base case stating that each phrase is connected to itself. The second clause tries to connect the SubPhrase with the SuperPhrase by invoking a grammar rule where the first RHS phrase unifies with the SubPhrase. This is called bottom-up (rule) prediction and is the essence of bottom-up parsing. It then parses the remaining RHS phrases with parse_rest/4, which again is a simple list recursion, before attempting to connect the LHS of the rule with the SuperPhrase. 3.4. LINK TABLES 47 parse(Words,Tree) :top_symbol(S), parse(S,Words,[],Tree). parse(Phrase,Words0,Words,Tree) :leaf(SubPhrase,Words0,Words1,SubTree), connect(SubPhrase,Phrase,Words1,Words,SubTree,Tree). leaf(Phrase,[Word|Words],Words,lex-Word) :lexicon(Word,Phrase). connect(Phrase,Phrase,Words,Words,Tree,Tree). connect(SubPhrase,SuperPhrase,Words0,Words,SubTree,Root) :(Phrase => Id-[SubPhrase|Rest]), parse_rest(Rest,Words0,Words1,Trees), Tree = Id-[SubTree|Trees], connect(Phrase,SuperPhrase,Words1,Words,Tree,Root). parse_rest([],Words,Words,[]). parse_rest([Phrase|Phrases],Words0,Words,[Tree|Trees]) :parse(Phrase,Words0,Words1,Tree), parse_rest(Phrases,Words1,Words,Trees). Figure 3.3: A simple bottom-up parser 3.4 Link tables It is possible to save some search time by in advance working out what can potentially be a left-corner of what and maintain this in a table, which for historical reasons is called either a link table, viable prefix table or reachability table. The parser in Figure 3.4 uses the table to avoid attempting to construct phrases that cannot start the (super)phrase being sought for. This is called top-down filtering. Empty production rules constitute a notorious problem for parser developers since the LHSs of these grammar rules have no realization in the input wordstring, their RHSs being empty. This means that when parsing strictly bottomup, they are always applicable, and a bottom-up parser can easily get stuck hallucinating an infinite number of empty phrases at some point in the input word-string. This problem does not occur when parsing top-down. Using gap-threading to limit the applicability of empty productions, as described in Section 2.2.2, in conjunction with a link table is a type of top-down filtering. Top-down filtering is typically used in conjunction with bottom-up parsing strategies to prune the search space and avoid non-termination. 48 CHAPTER 3. NATURAL-LANGUAGE PARSING parse(Words,Tree) :top_symbol(S), parse(S,Words,[],Tree). parse(Phrase,Words0,Words,Tree) :leaf(SubPhrase,Words0,Words1,SubTree), link(SubPhrase,Phrase), connect(SubPhrase,Phrase,Words1,Words,SubTree,Tree). leaf(Phrase,[Word|Words],Words,lex-Word) :lexicon(Word,Phrase). connect(Phrase,Phrase,Words,Words,Tree,Tree). connect(SubPhrase,SuperPhrase,Words0,Words,SubTree,Root) :(Phrase => Id-[SubPhrase|Rest]), link(Phrase,SuperPhrase), parse_rest(Rest,Words0,Words1,Trees), Tree = Id-[SubTree|Trees], connect(Phrase,SuperPhrase,Words1,Words,Tree,Root). parse_rest([],Words,Words,[]). parse_rest([Phrase|Phrases],Words0,Words,[Tree|Trees]) :parse(Phrase,Words0,Words1,Tree), parse_rest(Phrases,Words1,Words,Trees). link(s:[tree=s(A,B)], s:[tree=s(A,B)]). link(np:[agr=_,tree=_], s:[tree=s(_,_)]). link(det:[agr=_,tree=_], s:[tree=s(_,_)]). link(pron:[agr=_,tree=_], s:[tree=s(_,_)]). link(v:[agr=B,subcat=C,tree=A], v:[agr=B,subcat=C,tree=A]). link(vp:[agr=A,tree=_], vp:[agr=A,tree=_]). link(v:[agr=A,subcat=_,tree=_], vp:[agr=A,tree=_]). link(det:[agr=B,tree=A], det:[agr=B,tree=A]). link(n:[agr=B,tree=A], n:[agr=B,tree=A]). link(pron:[agr=B,tree=A], pron:[agr=B,tree=A]). link(pp:[tree=A], pp:[tree=A]). link(p:[tree=A], pp:[tree=pp(A,_)]). link(p:[tree=A], p:[tree=A]). link(np:[agr=A,tree=_], np:[agr=A,tree=_]). link(det:[agr=A,tree=_], np:[agr=A,tree=_]). link(pron:[agr=A,tree=_], np:[agr=A,tree=_]). Figure 3.4: A parser employing a link table 3.5. SHIFT-REDUCE PARSERS 3.5 49 Shift-reduce parsers A shift-reduce parser is a parser that in each cycle performs one of the two actions shift and reduce. The shift actions consume a word from the input sentence and the reduce actions apply a grammar rule. LR parsers (discussed in Section 3.8) are a kind of shift-reduce parser. We will here describe a simple shift-reduce parser. The parser implements a left-to-right, bottom-up parsing strategy and employs a rule invocation as its current working hypothesis. This consists of a goal corresponding to the LHS of the grammar rule and a body corresponding to those phrases of the RHS that remain to be found. The rule invocations are represented as edge(Goal,Body,Tree) The reasons behind the choice of functor will become apparent later. Since the parser can temporarily switch working hypothesis, we will use a pushdown stack to store rule invocations. The current rule invocation will simply be the one on top of the stack. Thus the parser is a type of pushdown automaton, even though strictly speaking it is not one according to the formal definition of Section 2.1.2, since the reduce actions do not consume any input symbols. This is not important from a theoretical point of view, since it can be proved that this does not change the expressive power of the machine. Likewise the use of complex stack objects, namely the rule invocations, does not in this case affect the expressiveness of the automaton, but improves readability. A theoretically more important objection is that the symbols of the alphabets may be feature-based. The automaton only has a single internal state corresponding to a call to the predicate shift_or_reduce/3. parse(Sentence,Tree) :empty_stack(Stack), top_symbol(TopSymbol), (TopSymbol => Id-Body), Edge = edge(TopSymbol,Body,Id-[]), push(Stack,Edge,Stack1), shift_or_reduce(Sentence,Stack1,Tree). shift_or_reduce(Words,Stack,Tree) :shift(Words,Stack,Tree). shift_or_reduce(Words,Stack,Tree) :reduce(Words,Stack,Tree). The shift action consumes an input word and produces a corresponding preterminal as a new phrase. The reduce action pops off the current rule invocation from the stack, producing the LHS of the rule as a new phrase in the process. Reductions are only applicable to rule invocations where the body is empty. 50 CHAPTER 3. NATURAL-LANGUAGE PARSING shift([Word|Words],Stack,Tree) :lexicon(Word,NewPhrase), top_down_filter(NewPhrase,Stack), predict_or_match(NewPhrase,Words,Stack,lex-Word,Tree). reduce([],Stack,Tree) :pop(Stack,Edge,Stack1), Edge = edge(_,[],Tree0), empty_stack(Stack1),!, Tree0 = Tree. reduce(Words,Stack,Tree) :pop(Stack,Edge,Stack1), Edge = edge(Goal,[],Tree0), predict_or_match(Goal,Words,Stack1,Tree0,Tree). The new phrase is either matched against the first phrase in the body or used to predict a new rule invocation. In the latter case the first phrase of the body is used for top-down filtering and the new rule invocation is pushed onto the stack. predict_or_match(NewPhrase,Words,Stack,NewTree,Tree) :predict(NewPhrase,Words,Stack,NewTree,Tree). predict_or_match(NewPhrase,Words,Stack,NewTree,Tree) :match(NewPhrase,Words,Stack,NewTree,Tree). predict(NewPhrase,Words,Stack,NewTree,Tree) :(NewGoal => Id-[NewPhrase|Rest]), top_down_filter(NewGoal,Stack), push(Stack,edge(NewGoal,Rest,Id-[NewTree]),Stack1), shift_or_reduce(Words,Stack1,Tree). match(NewPhrase,Words,Stack,NewTree,Tree) :pop(Stack,Edge,Stack1), Edge = edge(Goal,[NewPhrase|Body],Tree0), fix_trees(Tree0,NewTree,Tree1), Edge1 = edge(Goal,Body,Tree1), push(Stack1,Edge1,Stack2), shift_or_reduce(Words,Stack2,Tree). Some auxiliary predicates: top_down_filter(NewPhrase,Stack) :top(Stack,Edge), Edge = edge(_,Body,_), Body = [Phrase|_], link(NewPhrase,Phrase). 3.6. CHART PARSERS 51 fix_trees(Tree,SubTree,Tree1) :Tree = Id-Trees, append(Trees,[SubTree],Trees1), Tree1 = Id-Trees1. %%% Stack Predicates empty_stack([]). top([Item|_Stack],Item). pop([Item|Stack],Item,Stack). push(Stack,Item,[Item|Stack]). By extending the shift_or_reduce/3 predicate with a clause for handling empty productions we would arrive at a parser implementing the basic parsing strategy of the Core Language Engine [Alshawi (ed.) 1992]. 3.6 Chart parsers The chart-parsing paradigm is due to Martin Kay [Kay 1980] and based on Earley deduction [Earley 1969]. A very nice overview of various chart-parsing strategies is given in Mats Wirén’s PhD dissertation [Wirén 1992]. A chart parser stores partial results in a table called the chart. This is an extension of the idea of storing completed intermediate results in a well-formedsubstring table, as discussed in Section 3.2. The chart entries are edges between various positions in the sentence marked with phrases. There are “passive” and “active” edges. Let us call a phrase spanning two positions in the input word string a goal: 1. Passive edges correspond to proven goals or facts; one knows that there is a phrase of the type indicated by the edge between its starting and end points. 2. Active edges correspond to unproven goals; one can find the type of phrase indicated by the edge if one can prove the remaining subgoals. These subgoals are specified by the edge. The parse is completed when one has constructed a passive edge between the starting point and the end point of the sentence marked with the top symbol of the grammar. The similarity with the rule invocations of the shift-reduce parser of the previous section is not coincidental, nor is the choice of “edge” as the name of the stack elements for that parser. The goal of an edge corresponds to the goal of a rule invocation and the remaining subgoals of an edge to the body of a rule invocation. 52 CHAPTER 3. NATURAL-LANGUAGE PARSING Now we describe a chart parser based on that of [Pereira & Shieber 1987]. We will represent the positions in the sentence explicitly. Apart from this, the grammar and lexicon of Figures 3.5 and 3.6 are the same as Grammar 2 and Lexicon 2 in Figures 2.19 and 2.20 of Section 2.2.2 respectively. We will use the same notation for edges here as for the shift-reduce parser. edge(Goal,[],Tree) are passive edges corresponding to proven goals. The active edges are edge(Goal,Body,Tree), where Body is a non-empty list of subgoals that remain to be proven. We have succeeded when we have added the edge edge(s(0,End):Feats,[],Tree) to the chart. parse(Sentence,Tree) :clear_chart, lexical_analysis(Sentence,End), top_symbol(S,End), prove(S,Tree). prove(Goal,Tree) :predict(Goal,Agenda), process(Agenda), edge(Goal,[],Tree). The parser first clears the chart (i.e., removes all occurences of the dynamic predicate edge/3), looks up the words of the input sentence in the lexicon, and adds the corresponding preterminals as passive edges to the chart. The sentence John sees a house, for example, would be added to the chart as the passive edges edge(np(0,1):[agr=sg,tree=np(john)], [], lex-john). edge(v(1,2):[agr=sg,subcat=tran,tree=v(sees)], [], lex-sees). edge(det(2,3):[agr=sg,tree=det(a)], [], lex-a). edge(n(3,4):[agr=sg,tree=n(house)], [], lex-house). The parsing process is driven by an agenda that keeps track of new edges added to the chart, which remain to be processed. When processing an edge in the agenda, different measures are taken depending on whether the edge is active or passive. process([]). process([edge(Goal,Body,Tree)|OldAgenda]) :process_one(Goal,Body,Tree,SubAgenda), append(SubAgenda,OldAgenda,Agenda), process(Agenda). process_one(Goal,[],Tree,Agenda) :resolve_passive(Goal,Tree,Agenda). process_one(Goal,[First|Body],Tree,Agenda) :predict(First,Front), resolve_active(edge(Goal,[First|Body],Tree),Back), append(Front,Back,Agenda). 53 3.6. CHART PARSERS top_symbol(s(0,End):[tree=_],End). s(P0,P):[tree=s(NP,VP)] => [np(P0,P1):[agr=Agr,tree=NP], vp(P1,P):[agr=Agr,tree=VP]]. s_np_vp- vp(P0,P):[agr=Agr,tree=vp(V)] => vp_v[v(P0,P):[agr=Agr,subcat=intran,tree=V]]. vp(P0,P):[agr=Agr,tree=vp(V,NP)] => vp_v_np[v(P0,P1):[agr=Agr,subcat=tran,tree=V], np(P1,P):[agr=_,tree=NP]]. vp(P0,P):[agr=Agr,tree=vp(V,NP1,NP2)] => vp_v_np_np[v(P0,P1):[agr=Agr,subcat=ditran,tree=V], np(P1,P2):[agr=_,tree=NP1], np(P2,P):[agr=_,tree=NP2]]. vp(P0,P):[agr=Agr,tree=vp(VP,PP)] => [vp(P0,P1):[agr=Agr,tree=VP], pp(P1,P):[tree=PP]]. vp_vp_pp- np(P0,P):[agr=Agr,tree=np(DET,N)] => [det(P0,P1):[agr=Agr,tree=DET], n(P1,P):[agr=Agr,tree=N]]. np_det_n- np(P0,P):[agr=Agr,tree=np(PRON)] => [pron(P0,P):[agr=Agr,tree=PRON]]. np_pron- np(P0,P):[agr=Agr,tree=np(NP,PP)] => [np(P0,P1):[agr=Agr,tree=NP], pp(P1,P):[tree=PP]]. np_np_pp- pp(P0,P):[tree=pp(P,NP)] => [p(P0,P1):[tree=P], np(P1,P):[agr=_,tree=NP]]. pp_p_np- Figure 3.5: The chart-parser version of Grammar 2 54 CHAPTER 3. NATURAL-LANGUAGE PARSING lexicon(a,det(_,_):[agr=sg,tree=det(a)]). lexicon(the,det(_,_):[agr=_,tree=det(the)]). lexicon(several,det(_,_):[agr=pl,tree=det(several)]). lexicon(about,p(_,_):[tree=p(about)]). lexicon(in,p(_,_):[tree=p(in)]). lexicon(paris,np(_,_):[agr=sg,tree=np(paris)]). lexicon(john,np(_,_):[agr=sg,tree=np(john)]). lexicon(mary,np(_,_):[agr=sg,tree=np(mary)]). lexicon(he,pron(_,_):[agr=sg,tree=pron(he)]). lexicon(she,pron(_,_):[agr=sg,tree=pron(she)]). lexicon(they,pron(_,_):[agr=pl,tree=pron(they)]). lexicon(book,n(_,_):[agr=sg,tree=n(book)]). lexicon(books,n(_,_):[agr=pl,tree=n(books)]). lexicon(house,n(_,_):[agr=sg,tree=n(house)]). lexicon(houses,n(_,_):[agr=pl,tree=n(houses)]). lexicon(sleeps,v(_,_):[agr=sg,subcat=intran,tree=v(sleeps)]). lexicon(sleep,v(_,_):[agr=pl,subcat=intran,tree=v(sleep)]). lexicon(gives,v(_,_):[agr=sg,subcat=ditran,tree=v(gives)]). lexicon(give,v(_,_):[agr=pl,subcat=ditran,tree=v(give)]). lexicon(sees,v(_,_):[agr=sg,subcat=tran,tree=v(sees)]). lexicon(see,v(_,_):[agr=pl,subcat=tran,tree=v(see)]). Figure 3.6: The chart parser version of Lexicon 2 3.6. CHART PARSERS 55 A passive edge in the agenda corresponds to a new fact, and thus the active edges in the chart are compared with this fact to see if it matches one of their remaining goals. resolve_passive(Fact,Tree,Agenda) :findall(Edge, Fact^p_resolution(Fact,Tree,Edge), Agenda). p_resolution(Fact,SubTree,Edge) :edge(Goal,[Fact|Body],Tree), fix_trees(Tree,SubTree,Tree1), Edge = edge(Goal,Body,Tree1), store(Edge). An active edge has goals that remain to be proven. One way of doing this is to invoke grammar rules that are potentially useful for proving them. predict(Goal,Agenda) :findall(Edge, Goal^prediction(Goal,Edge), Agenda). prediction(Goal,Edge) :(Goal => Id-Body), Edge = edge(Goal,Body,Id-[]), store(Edge). Here rules are chosen where the goal matches the left-hand side of the rule — a top-down strategy. Another way of proving the remaining goals of an active edge is to search the chart for passive edges, i. e. facts, that match them. resolve_active(Edge,Agenda) :findall(NewEdge, Edge^a_resolution(Edge,NewEdge), Agenda). a_resolution(edge(Goal,[First|Body],Tree),Edge) :edge(First,[],SubTree), fix_trees(Tree,SubTree,Tree1), Edge = edge(Goal,Body,Tree1), store(Edge). We only want to add truly new edges to the chart. For this reason we must check if the edge, or a more general one, has already been added to the chart. This is what subsumed/1 does. 56 CHAPTER 3. NATURAL-LANGUAGE PARSING store(Edge) :\+ subsumed(Edge), assert(Edge). subsumed(Edge) :edge(GenGoal,GenBody,_), subsumes(edge(GenGoal,GenBody,_),Edge). subsumes(General,Specific) :\+ \+ (make_ground(Specific), General = Specific). make_ground(Term) :numbervars(Term,0,_). Actually, the subsumes check is a built-in predicate in SICStus Prolog called subsumes_chk/2 that could be used instead. 3.7 Head parsers Most grammar formalisms, and thus most parsers, assume that string production is a question of concatenating elements. This assumption may be a consequence of the fact that most of the work in computational linguistics has focused on a small family of languages with a relatively constrained word order. There are excellent arguments for questioning this assumption. “Our Latin teachers were apparently right”, as Martin Kay puts it [Kay 1989]. Latin is a language with relatively free word order, and as most Latin students know, it is good practice when parsing a Latin sentence to search for the main verb first. It carries inflectional information which will aid in determining the other constituents of the sentence. In the general case, languages are not concatenative, constituents are not necessarily continuous, and dependencies in a clause are not necessarily neatly stacked. If we wish to generalize the representation we have been working with so far, we can start by examining a normal context free grammar, augmented with feature structures, such as the ones we have been making use of throughout this text. For each rule, a certain element of the right hand side is designated the head of the phrase. The parser should look for the head first, and then try to extend the left and right contexts of the head. The features of the head constituent percolate up to the root node, and are then a powerful top-down prediction tool when extending the context. A central question when writing a head grammar is how the head is chosen from the right-hand constituents available. It is of course appropriate to pick a head which distinguishes the right-hand side well. The problem of designating the head is usually not addressed in the theory — the grammar writer is assumed to have good intuitions. 3.7. HEAD PARSERS 57 An avenue of approach which has not been explored is constructing an algorithm whereby a head is picked automatically. This may be difficult: the element on the right-hand side with the highest informational content is the natural candidate, but this may be distorted through lop-sided application statistics on the grammar rules and can thus not be determined by looking at the grammar alone. In a feature structure augmented grammar it may even be a complex matter to determine which element has the highest informational value, depending on the feasibility of factoring out all the features and their value space. A head grammar version of Grammar 2 (from Page 36) is shown in Figure 3.7, with the heads of different types of phrases stated explicitly. Given a head grammar the generalization from previous parsing algorithms can be stated in terms of link tables: instead of a link table which indicates which root a certain leftmost element in a string can start, a head relation indicates which root a certain element in a string can motivate. This element is called the seed of a phrase. The parser, as most bottom-up parsers, proceeds from head to head, starting from the seed, until it has proven that a seed can indeed be the seed of a phrase. In the code for the parser below (see Figure 3.8), which is a simple reversible parser/generator by Martin Kay [Kay 1989], the basic procedure is to, given a string, assume a goal and a head for the goal. This is done in the syntax/3 predicate. Determining that the head is in the string is done using the range/3 predicate, which steps through the string until it finds an element in it which unifies with the description of a head. If no such element is found, another head is tried. If the head is found, its range will be in HeadRange. After finding the head, the goal is built around it. A goal is built using the connect/5 predicate. A rule which connects the head with the goal is identified, and the left and right contexts of the rule are built, so that the LHS of the rule extends from LL to RR in the string. connect/5 is invoked recursively until the goal is complete. This parser and grammar formalism does not allow for other than concatenative rules. The difference between this parser and the bottom-up parser discussed in Section 3.3 is only the more flexible linking. To generalize to nonconcatenative grammars we need to change the formalism somewhat. If we change the rule formalism from LHS => LeftContext,Head,RightContext to LHS => Head,OtherDaughters and generalize the two predicates connect_left/3 and connect_right/3 to one connect_others/3, as per below, we get a more general framework. The other daughters in the OtherDaughters list have information on in which direction they are to be sought, or on which side they are to be concatenated to the head. 58 CHAPTER 3. NATURAL-LANGUAGE PARSING top_symbol(s:[tree=_]). s:[tree=s(NP,VP)] => [np:[agr=Agr,tree=NP]], vp:[agr=Agr,tree=VP], []. vp:[agr=Agr,tree=vp(V)] => [],v:[agr=Agr,subcat=intran,tree=V], []. vp:[agr=Agr,tree=vp(V,NP)] => [],v:[agr=Agr,subcat=tran,tree=V], [np:[agr=_,tree=NP]]. vp:[agr=Agr,tree=vp(V,NP1,NP2)] => [],v:[agr=Agr,subcat=ditran,tree=V], [np:[agr=_,tree=NP1],np:[agr=_,tree=NP2]]. vp:[agr=Agr,tree=vp(VP,PP)] => [],vp:[agr=Agr,tree=VP], [pp:[tree=PP]]. np:[agr=Agr,tree=np(DET,N)] => [det:[agr=Agr,tree=DET]], n:[agr=Agr,tree=N], []. np:[agr=Agr,tree=np(PRON)] => [],pron:[agr=Agr,tree=PRON], []. np:[agr=Agr,tree=np(NP,PP)] => [],np:[agr=Agr,tree=NP], [pp:[tree=PP]]. pp:[tree=pp(P,NP)] => [],p:[tree=P], [np:[agr=_,tree=NP]]. head(s:_,vp:_). head(s:_,v:_). head(np:_,n:_). head(pp:_,p:_). head(K,K). Figure 3.7: Head Grammar Version of Grammar 2 3.7. HEAD PARSERS parse(String,Struct) :syntax(String/[],Struct,String/[]). syntax(GoalRange,Goal,MaxRange) :head(Goal,Head), range(HeadRange,Head,MaxRange), connect(GoalRange,Goal,HeadRange,Head,MaxRange). range(_,_,X/Y) :nonvar(X), X=Y, fail. range(L/R,Head,L1/_) :( var(L1), ! ; L = L1 ), dict(L/R, Head). range(HRange,Head,MaxL/MaxR) :nonvar(MaxL), MaxL = [_|T], range(HRange,Head,T/MaxR). connect(R,G,R,G,_). connect(GL/GR,Goal,HL/HR,Head,MaxL/MaxR) :(LHS => Left,Head,Right), head(Goal,LHS), connect_left(LL/HL,Left,MaxL/HL), connect_right(HR/RR,Right,HR/MaxR), connect(GL/GR,Goal,LL/RR,LHS,MaxL/MaxR). connect_left(X/X,[],_). connect_left(L/R,[Head|Heads],MaxL/MaxR) :syntax(HL/R,Head,MaxL/MaxR), connect_left(L/HL,Heads,MaxL/HL). connect_right(X/X,[],_). connect_right(L/R,[Head|Heads],MaxL/MaxR) :syntax(L/HR,Head,MaxL/MaxR), connect_right(HR/R,Heads,HR/MaxR). dict([K|X]/X,L) :- lexicon(K,L). Figure 3.8: A Head Driven Parser/Generator 59 60 CHAPTER 3. NATURAL-LANGUAGE PARSING The example in Figure 3.9 still only handles concatenative grammars, and is equivalent to the example above, but if we add new connect_others/3 clauses we will be able to build more complex constituents, and allow for interesting ways of coping with movement and discontinuous constituents. This is reminiscent of the ID/LP grammar rule paradigm referred to in Section 2.2.1. For examples of how this is done and how it can be put to use, we refer to [van Noord 1991]. connect(R,G,R,G,_). connect(GL/GR,Goal,HL/HR,Head,MaxL/MaxR) :(LHS => Head,Others), head(Goal,LHS), connect_others(Others,LL/HL,HR/RR), connect(GL/GR,Goal,LL/RR,LHS,MaxL/MaxR). connect_others([],L/L,R/R). connect_others([LeftDaughter:direction=l:Features|Others], MaxL/MaxR,RightContext) :syntax(NewL/MaxR,LeftDaughter:Features,MaxL/MaxR), connect_others(Others,MaxL/NewL,RightContext). connect_others([RightDaughter:direction=r:Features|Others], LeftContext,MaxL/MaxR) :syntax(MaxL/NewR,RightDaughter:Features,MaxL/MaxR), connect_others(Others,LeftContext,NewR/MaxR). Figure 3.9: Generalized connect others/3 The bottom line question in practical natural-language processing naturally will be: Are head grammars a good idea or not in terms of actually analyzing natural languages? There is no answer to this question as of yet. The jury is still out. It will probably depend heavily on the language that is being analyzed. 3.8 LR parsers An LR parser is a type of shift-reduce parser that was originally devised by Donald Knuth [Knuth 1965] for programming languages and is well described in e.g., [Aho et al 1986]. Since Tomita’s article [Tomita 1986] it has become increasingly popular for natural languages. The success of LR parsing lies in handling a number of grammar rules simultaneously by the use of prefix merging, rather than attempting one rule at a time as the shift-reduce parser of Section 3.5 does. 61 3.8. LR PARSERS 3.8.1 Parsing An LR parser is basically a pushdown automaton, i.e., it has a pushdown stack in addition to a finite set of internal states and a reader head for scanning the input string from left to right one symbol at a time. Thus an LR parser employs a left-corner parsing strategy. In fact, the “L” in “LR” stands for left-to-right scanning of the input. The “R” stands for constructing the rightmost derivation in reverse. The stack is used in a characteristic way: Every other item on the stack is a grammar symbol and every other is a state. The current state is simply the state on top of the stack. The most distinguishing feature of an LR parser is however the form of the transition relation — the action and goto tables: a (non-deterministic) LR parser can in each step perform one of four basic actions. In state S with lookahead symbol3 Sym it can: 1. accept(S,Sym): Halt and signal success. 2. error(S,Sym): Fail and backtrack. 3. shift(S,Sym,S2): Consume the input symbol Sym, place it on top of the stack, and transit to state S2 by placing it on top of the stack. 4. reduce(S,Sym,R): Pop off a number of items from the stack corresponding to the RHS of grammar rule R, inspect the stack for the old state S1, place the LHS of rule R on the stack, and transit to state S2 determined by goto(S1,LHS,S2) by placing S2 on the stack. Like the shift-reduce parser of Section 3.5, this is not a pushdown automaton according to our definition above, since the reduce actions do not consume any input symbols, and the same remarks apply here. This gives the parser the possibility of not halting by reducing empty productions and transiting back to the same state, and care must be taken to avoid this. Prefix merging is accomplished by each internal state corresponding to a set of partially processed grammar rules, so-called “dotted items” containing a dot (·) to mark the current position. For example, if the grammar contains the following two rules, NP → Det Noun NP → Det Adj Noun 3 The lookahead symbol is the next symbol in the input string i.e., the symbol under the reader head. 62 CHAPTER 3. NATURAL-LANGUAGE PARSING there will be a state containing the dotted items NP → Det · Noun NP → Det · Adj Noun This state corresponds to just having found a determiner (Det ). Which of the two rules to apply in the end will be determined by the rest of the input string; at this point no commitment has been made to either of the two rules. The concept of dotted items is closely related to that of rule invocations in Section 3.5 and to that of edges in Section 3.6. The LHS of a dotted item corresponds to the goal of an edge and the phrases following the dot correspond to the body. 3.8.2 Parsed example Grammar 1, reproduced in Figure 3.10 for reference, will generate the internal states of Figure 3.11. S VP VP VP VP NP NP NP PP → → → → → → → → → NP VP V V NP V NP NP VP PP Det N Pron NP PP P NP (1) (2) (3) (4) (5) (6) (7) (8) (9) Figure 3.10: Grammar 1 These in turn give rise to the parsing tables of Figure 3.12. The entry “s2” in the action table, for example, should be interpreted as “shift the lookahead symbol onto the stack and transit to State 2”. The action entry “r7” should be interpreted as “reduce by Rule 7”. The goto entries simply indicate what state to transit to once a phrase of that type has been constructed. Note the two possibilities in States 11, 12 and 13 for lookahead symbol “P”. We can either shift it onto the stack or perform a reduction. This is called a shift-reduce conflict and is the source to the ambiguity previously observed. 63 3.8. LR PARSERS S′ S NP NP NP S NP VP VP VP VP PP NP State 0 → ·S → · NP VP → · Det N → · Pron → · NP PP State 1 → NP · VP → NP · PP → ·V → · V NP → · V NP NP → · VP PP → · P NP State 2 → Det · N NP State 3 → Pron · S′ State 4 → S· S VP PP State 5 → NP VP · → VP · PP → · P NP VP VP VP NP NP NP State 6 → V· → V · NP → V · NP NP → · Det N → · Pron → · NP PP NP State 7 → NP PP · PP NP NP NP State 8 → P · NP → · Det N → · Pron → · NP PP VP State 9 → VP PP · NP State 10 → Det N · VP VP NP NP NP NP PP State 11 → V NP · → V NP · NP → NP · PP → · Det N → · Pron → · NP PP → · P NP VP NP PP State 12 → V NP NP · → NP · PP → · P NP PP NP PP State 13 → P NP · → NP · PP → · P NP Figure 3.11: The internal states of Grammar 1 64 CHAPTER 3. NATURAL-LANGUAGE PARSING State 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Det s2 N NP s1 Action P Pron s3 s8 V eos NP 1 s6 Goto PP S 4 7 VP 5 s10 r7 r7 s2 r8 s2 s11 r8 s13 r6 s2 r6 s12 r9 r9 r7 s8 r2 r8 r5 r6 s8/r3 s8/r4 s8/r9 r7 s3 r8 s3 r7 r8 r7 acc r1 r2 r8 9 11 13 r6 s3 r6 r9 r9 r5 r6 r3 r4 r9 12 7 7 Figure 3.12: The LR parsing tables for Grammar 1 Using these tables we can parse the sentence John sees a book as follows: Action Stack String init s1 s6 s2 s10 r6 r3 r1 accept [0] [1, NP, 0] [6, V, 1, NP, 0] [2, Det, 6, V, 1, NP, 0] [10, N, 2, Det, 6, V, 1, NP, 0] [11, NP, 6, V, 1, NP, 0] [5, VP, 1, NP, 0] [4, S, 0] [4, S,0] John sees a book sees a book a book book ǫ ǫ ǫ ǫ ǫ Initially State 0 is added to the empty stack. The noun phrase (NP ) corresponding the word “John” is shifted onto the stack and the parser transits to State 1 by adding it to the stack. Next, the verb (V ) corresponding to the word “sees” is shifted onto the stack and the parser transits to State 6 by adding it to the stack. Next, the determiner (Det ) corresponding to the word “a” is shifted onto the stack and the parser transits to State 2 by adding it to the stack. Next, the noun (N ) corresponding to the word “book” is shifted onto the stack and the parser transits to State 10 by adding it to the stack. Next, the noun and the determiner on top of the stack are reduced to a noun phrase using Rule 6 (NP → Det N ), which is added to the stack, and the parser transits to State 11 by adding it to the stack. Next, the noun phrase and the verb on top of the stack are reduced to a verb phrase (VP ) using Rule 3 (VP → V NP ), which is 3.8. LR PARSERS 65 added to the stack, and the parser transits to State 5 by adding it to the stack. Next, the verb phrase and the noun phrase on top of the stack are reduced to a sentence (S) using Rule 1 (S → NP VP ), which is added to the stack, and the parser transits to State 4 by adding it to the stack. Finally, the input string is accepted. 3.8.3 Compilation Compiling LR parsing tables consists of constructing the internal states (i.e., sets of dotted items) and from these deriving the accept, shift, reduce and goto entries of the transition relation. New states can be induced from previous ones; given a state S1, another state S2 reachable from it by goto(S1,Sym,S2) (or shift(S1,Sym,S2) if Sym is a terminal symbol) can be constructed as follows: 1. Select all items in state S1 where a particular symbol Sym follows immediately after the dot and move the dot to after this symbol. This yields the kernel items of state S2. 2. Construct the non-kernel closure by repeatedly adding a so-called nonkernel item (with the dot at the beginning of the RHS) for each grammar rule whose LHS matches a symbol following the dot of some item in S2. Using this method, the set of all parsing states can be induced from an initial state whose single kernel item has the top symbol of the grammar preceded by the dot as its RHS. In figure 3.11 this is the item S ′ → ·S. The accept, shift and goto entries fall out automatically from this procedure. Any dotted item where the dot is at the end of the RHS gives rise to a reduction by the corresponding grammar rule. Thus it remains to determine the lookahead symbols of the reduce entries. In Simple LR (SLR) the lookahead is any terminal symbol that can follow immediately after a symbol of the same type as the LHS of the rule. In LookAhead LR (LALR) it is any terminal symbol that can immediately follow the LHS given that it was constructed using this rule in this state. In general, LALR gives considerably fewer reduce entries than SLR, and thus results in faster parsing. 3.8.4 Relation to decision-tree indexing A simplified version of LR parsing is decision-tree indexing. The procedure for indexing a grammar in a decision tree is the following: First one tree is constructed for each rule in the grammar. The root of each tree is the LHS symbol of the rule, and from this there is an arc labelled with the first RHS symbol of the rule to another node, and from this node in turn there is an arc labelled with the second RHS symbol of the rule, etc. Then trees whose roots are labelled with the same symbol are merged, allowing rules with the same LHS 66 CHAPTER 3. NATURAL-LANGUAGE PARSING and the same initial RHS symbols to be processed together until they branch off. Each node in the decision tree corresponds to some state(s) in the LR parsing tables.4 The arcs correspond to the shift and goto entries where the end node of each arc corresponds to the new state. The arcs labelled with terminal symbols correspond to shift actions and the ones with nonterminal symbols correspond to goto entries. The leaves of the tree correspond to reductions. This parsing scheme also requires a stack to keep track of where to transit once a rule has been applied and a phrase corresponding to the label of the root has been found. At the other end, if a nonterminal arc is encountered, the tree with a root with that label must be searched. This does not give maximal prefix merging. Rules with different LHS are not processed together, nor are rule combinations yielding the same prefix. Also, some mechanism must be introduced for handling recursion to avoid the following scenario: If a tree with root label X has an arc labelled X, then the parser will return to the root node, find its way to the arc labelled X, return to the root node again and thus loop. 3.8.5 Extensions Local-ambiguity packing and employing a graph-structured stack are two important techniques for extending the basic LR-parsing scheme which are due to Tomita. These are described in more detail in [Tomita 1986]. Employing a non-deterministic LR-parsing scheme with these two extensions is often referred to as “generalized LR parsing”. Local-ambiguity packing can be visualized by imagining that one performs all possible derivations in parallel, maintaining a set of stacks, while synchronizing the shift actions. If two stacks are identical (apart from the internal structure of the symbols), their origin will differ in the derivation of one (or more) nonterminal symbol. For subsequent parsing they are treated as a single stack and only when recovering the parse trees are they distinguished. Thus this parsing is done only once. Using a graph-structured stack can then be viewed as merging the set of stacks in the parallel processing into a graph. Realizing these efficiently in Prolog requires using a well-formed-substring table as described in Section 3.2. 4 This correspondence is in general not one-to-one in either direction. Chapter 4 The Core Language Engine The name of this booklet contains the term “natural-language”, which normally is used by computer scientists as a contrast to formal and programming languages. Thus a field within the computer science subarea of Artifical Intelligence has traditionally been called Natural-Language Processing — a field whose practitioners have come from an AI background and have realised that in order to make “truly intelligent” systems some kind of language understanding must be included. Even though AI now is a more or less dead area in that very few scientists like to be referred to as working within it, NLP is very much alive. Quite a few of the researchers working with language-processing computers do, however, have a background in the humanities, being linguists which have realised that large-scale and applied linguistics can only be addressed if using computers. This subfield within linguistics is commonly referred to as Computational Linguistics, a term which sometimes is used in quite a wide sense, including all work in linguistics in which a computer is used for anywhich purpose. The boundaries between Natural-Language Processing on one hand and Computational Linguistics on the other have become more and more blurred in that computer scientists have realised the need for linguistic knowledge within their systems and linguists are starting to realise that much of the programming techniques and theories of computer science are useful for linguistic purposes, as well. Of course this development is beneficiary for all parties and will certainly continue, making the distinction between the two areas obsolete, which it mostly already is. Some aspects of the NLP–CL distinction still remain, however. The main difference nowadays is often on a empiricist versus theoretician basis, NLP being the area more concerned with building large-scale systems. This brings us back to the title of the booklet. The reference to NLP is there since the authors think the only way to prove their theories (sic!) is by building large, commercially feasible systems. The system for Swedish which we use is based on one for English originally developed by a group at SRI International, 67 68 CHAPTER 4. THE CORE LANGUAGE ENGINE Cambridge, England and is called the “Core Language Engine” (CLE). The purpose of this chapter is to introduce that system and its grammar formalism. The description will by necessity be quite brief, to get a complete picture of the CLE system the reader is referred to the CLE book [Alshawi (ed.) 1992]. The Swedish version of the CLE used at SICS is described in [Gambäck & Rayner 1992]. [sentence] ✻ ❄ syntactic & ✛ semantic analysis/synthesis ❅ ■ ❅ ❄ [QLFs] ❄ logical form transformations lexical entries lexical acquisition linguistic rules ✠ ✛ context ✻ [LF] ❄ application Figure 4.1: Broad overview of the CLE architecture 4.1 Overview of the CLE The name Core Language Engine reflects the purpose of the system, it is intended to be the “core” (the heart) of an NLP system of any kind, for example a query-interface to a database, a machine translation system, or as the back-end of a speech understanding system. As an “engine” it should be a well-functioning work-horse. The CLE is written purely in Prolog and based on unification as the main mechanism. The grammar formalism is a feature-category type with declarative bidirectional rules, which means that the same grammar can be used for both language analysis and generation. Other hall-marks of the CLE are already familiar from the previous chapters: as much information as possible is put in the lexicon (rather than in the 69 4.1. OVERVIEW OF THE CLE grammar rules) and the system employs a left-corner shift-reduce parser. A broad overview of the CLE architecture is shown in Figure 4.1. Apart from using unification as the underlying mechanism in all parts of the system, the main design decision behind the CLE was to build it as a number of modules or processing steps, each with well-defined input from and output to the other modules. This has resulted in the design of a number of interface formalisms, one of which now can be taken as one of the main concepts of the CLE: the notion of “Quasi-Logical Form” (QLF) which is the formalism used for representing natural-language (compositional) semantics. The QLF is a conservative representation of the meaning of an input sentence based on purely linguistic evidence. Semantics (mainly QLF-based) will be discussed later on, here we will only show the CLE’s first processing steps which produce that form from an input NL-sentence (see Figure 4.2). NL sentence ❄ morphological analysis Analysis ❄ syntactic parsing ❄ semantic analysis ❄ QLF Figure 4.2: The analysis steps of the CLE Other processing steps will then take application and situation dependent information into account when converting the QLF into a “pure” logical form, LF. The processing beyond QLF-level will, however, only be discussed in passing in this booklet. For a full description of this, the reader is again referred to [Alshawi (ed.) 1992]. The usefulness of any NLP system will depend largely on its ability to adjust (be customized) rapidly to new domains. This was reflected in the CLE design by equipping the system with a “core” grammar which should cover a large portion of English, but which should still be adjustable to a new register (sublanguage) without too large an effort. On the lexicon side, ease of extendability is even more important. Thus the CLE has a special lexicon acquisition component, shown on the right-hand side of Figure 4.1. The idea being that a potential user with little or no linguistic knowledge should be able to expand the lexicon easily by herself. 70 CHAPTER 4. THE CORE LANGUAGE ENGINE 4.2 The CLE grammar formalism We now turn to the rule-formalisms used in the CLE. Actually all the formalisms used for rules of different kinds in the various models have the same general appearance. This section will discuss the syntactic grammar and lexicon rules in particular, but as we shall see in the sections following the morphology and semantic rules are more or less equivalent. A syntactic rule is schematically syn(hsyntax-rule-idi, hrule-groupi, [hmother-categoryi, hdaughter-category1 i, . . . , hdaughter-categoryn i]). The functor name syn indicates that this is a syntactic grammar rule, while its first argument is the actual name of the rule. This could be almost anything, but for ease of readability and debugging, the convention for rule names is that the first part lists the mother (i.e, RHS) and daughter (LHS) categories, then the portion (if any) after the first upper case letter provides a brief comment. The second argument indicates the rule group to which the rule belongs. This is important for example when customizing the system to a new domain, but will not be discussed here. The third argument is the really important part of the rule giving the categories and feature-value pairs of both the left-hand and right-hand side of the grammar rule as such. In the rules, the mother node and the daughter nodes form a Prolog-type list, so the → of the previous chapters has now been “replaced” by the first comma, while the commas following separate the daughters from each other. Each constituent (both mother and daughters) in the schema above consists of a colon-separated pair Category:FVList where the Category of course is the category name, while FVList is a Prolog-type list of feature-value pairs given in the same fashion as introduced in Section 2.2.2, that is, each constituent is Category:[Feature1=Value1,. . .,Featuren=Valuen] It should be obvious that having the LHS and RHS separated by a comma rather than an arrow and giving the rule name as an argument rather than putting it at the far right is just a notational variant. The formalism of the CLE and the formalism used for the example “toy” grammars earlier in the text are in fact completely equivalent. 4.2.1 A toy grammar for Swedish A simple grammar for a fragment of Swedish in the CLE formalism is shown in Figure 4.3. Note that this grammar makes use of its features in two ways described in some detail in Section 2.4.2. Firstly, the rule vp_v_comp_Normal leaves the verb’s complements as a function of the value of the feature subcat, thus allowing the subcategorization scheme to be instantiated in the lexicon at run-time. This of course means that 4.2. THE CLE GRAMMAR FORMALISM 71 the parse-table compiler must be designed with care, but that is really of no concern of the grammar writer, who instead rejoices at the thought of being able to capture several grammar rules in only one rule schema. The Swedish CLE grammar distinguishes between some 50 different types of verbs, all of which can be captured by a single rule looking like the one in the toy grammar. Having to write specific rules for each of these verb types would of course be both tedious and error-prone. Apart from verb subcategorization, both adjectives and some nouns show a similar behaviour and can be treated by other rule schemas in a parallel fashion. Another use of features described before is given by vform, which is a feature which ensures that the top-level (σ) rule for declarative sentences only is applicable if the (main) verb is finite. This feature is no longer boolean (as it was when called just fin in Section 2.4.2), but can rather take on a range of values corresponding to all the inflectional forms a Swedish verb can take. The lexicon formalism A corresponding lexicon is shown in Figure 4.4. The lexicon formalism mainly follows the one of the syntax rules, but with the functor name lex rather than syn. The lexicon rules parallels the syntax in containing a Category:FVList pairs list as described above. Each lexicon entry can actually contain several different categories, thus in effect defining several words, but the entries we will discuss here will only introduce one word at the time. As shown in the lexicon, the feature agr carries gender agreement, which in Swedish manifests itself as either common or neuter gender (nowadays commonly referred to as “n-genus” and “t-genus”, respectively, reflecting their definite-form suffixes). Abstractly, an entry in the lexicon can be defined as lex(hterminal-rooti, [hterminal-categoryi: [hFeature1 =Value1 , ..., Featuren =Valuen i] ]). where the only really new acquaintance is terminal-root, which is the base form of the terminal, that is the non-inflected form. So far, the lexicon base form has been identical to the form in the input string. Considering the example from Icelandic given in Figure 2.10 this is obviously not an approach which recommends itself. In Section 4.3 below we will discuss how inflectional morphology helps to produce the lexicon base form of a word from all its possible surface forms. First, however, we shall look at some examples of how a real-sized unification grammar actually may look in practise. 72 CHAPTER 4. THE CORE LANGUAGE ENGINE syn(sigma_decl, core, [sigma:[], s:[vform=fin] ]). syn(s_np_vp_Normal, core, [s:[vform=Vform], np:[], vp:[vform=Vform] ]). syn(vp_v_comp_Normal, core, [vp:[vform=Vform], v:[vform=Vform, subcat=Complements] | Complements ]). syn(np_det_nbar, core, [np:[], det:[agr=Agr], nbar:[agr=Agr] ]). syn(np_name, core, [np:[], name:[] ]). Figure 4.3: A Swedish grammar in the CLE formalism 4.2. THE CLE GRAMMAR FORMALISM lex(kalle,[name:[]]). lex(lena,[name:[]]). lex(bil, [nbar:[agr=common] ]). lex(garage, [nbar:[agr=neuter] ]). lex(snarkar, [v:[vform=fin, subcat=[]] ]). lex(gillar, [v:[vform=fin, subcat=[np:[]]] ]). lex(ger, [v:[vform=fin, subcat=[np:[], np:[]]] ]). lex(en, [det:[agr=common] ]). lex(ett, [det:[agr=neuter] ]). Figure 4.4: A lexicon in the CLE formalism 73 74 CHAPTER 4. THE CORE LANGUAGE ENGINE 4.2.2 The “real” grammar In the real Swedish grammar used in the CLE, the rules are of course quite a lot more complex, even though the follow the general scheme outlined above. The main cause of complexity is the number of features. In the toy grammar not even a hand-full of features were used; in the real grammar some 1500 different features are passed around.1 Feature defaults and value spaces As shown by for example the feature agr above, the features need not be binary valued. Instead, we can specify the ranges (“syntactic feature value space”) and default values of some of the features: syn_feature_value_space(agr, [[plur,sing], [1,2,3], [common,neuter]]). feature_default(nullmorphn,n). A few features have specified “value spaces”. The feature agr thus has a value space reflecting its function of carrying number, person, and gender agreement. The sublists in the definition are to be interpreted as Cartesian products of the elements, giving the feature 2 ∗ 3 ∗ 2 = 12 possible values. If no feature value space is defined, the feature is assumed to be binary valued, or rather to have the value space [y,n,_], i.e., “yes”, “no” or “uninstantiated”. There is no requirement that a feature must have a declaration of any kind — just using it in the grammar with a specific category is enough; however, default declarations are used for some features, so e.g. nullmorphn is defined to be n per default. If no default value is present for a feature, it is assumed to be _, i.e., uninstantiated. Using default values for some features allows the grammar writer to leave them out from rules where their use simply follows from their default values, thus simplifying the grammar rules. 1 The actual number of distinct features of a grammar does in some respect depend on how they are counted. Is for example the feature agr on an NP the same as the feature agr on a VP , or are they two different features which just happen to have the same name (and also oftentimes unify with each other)? The number 1500 given here comes from counting all features on different categories as distinct. In passing, we should point out something which following the above example should be obvious, but actually is a common misconception even among skilled computational linguists: the number of features or the number of grammar rules are of no real importance when measuring the complexity or coverage of a grammar (a claim like “my grammar is better than yours, since it has more rules” or a question like “how many grammar rules do you have?” are nonsensical). What really counts when “bolstering” your own grammar is proven coverage percentage figures on unseen parts of standardized corpora and the inherent properties of the formalisms used. For a suggested measurement method of the latter, see e.g. [Gambäck et al 1991]. 75 4.2. THE CLE GRAMMAR FORMALISM Classification of sentence types As a typical example of the usage of features in real unification grammars, we will now discuss how the different types of sentences can be separated from each other by an elaborate use of features. The type of a sentence (WH-question, q or normal, norm) is defined by that of the subject, via passing of a feature type. The features type, whmoved, and inv (inverted) all default to n and classify sentences as shown in Figure 4.5. Sentence type type whmoved inv Declarative Yes-No question Non-subject WH-question Subject WH-question norm norm q q n n y n n y y n Example kalle lever lever kalle vad gör kalle vem hyrde bilen Figure 4.5: Sentence classification by feature values All the sentence types in the table must contain a finite main verb, i.e., they must have vform=fin, so simple declarative sentences have a feature setting looking like [type=norm, whmoved=n, inv=n, vform=fin] That is, they must contain a finite verb, but may not contain any movement. Yes-No questions look like declaratives, but are inverted (inv=y), i.e., the main verb comes before the subject.2 Wh-questions come in two types, depending on whether the WH-word is the subject of the sentence or not. Non-subject WH-questions are treated as could be expected, i.e., as having undergone two movements. One of the verb to obtain an inverted sentence and one WH-word movement. Subject WH-questions are however treated as having undergone no movement at all. A grammar rule As an (rather hairy) example of a rule from the real grammar consider the following, which is the non-simplified Swedish CLE grammar version of a rule which should by now almost be like an old friend, the “normal” main sentence rule S → NP VP : 2 As pointed out in Section 2.4.1, any verb (not just auxiliaries as in English) can be moved in Scandinavian languages like Swedish. Thus the Yes-No question structure, rather than the declarative, has sometimes been assumed to be the basic sentence structure for this language group. This assumption is the one underlying so called “Diedrichsen schemata”, which will not be used here, see however for example [Diderichsen 1966]. 76 CHAPTER 4. THE CORE LANGUAGE ENGINE syn(s_np_vp_Normal, core, doc("[Jag flyger]", [d1,d3,d4,e1,e3,q2,q4], "Covers most types of finite and subjunctive, uninverted clause"), [s:[hascomp=n, sententialsubj=SS, conjoined=n | Shared], np:[nform=Sfm, vform=(fin\/att), agr=Ag, reflexive=_, pron=_, sentential=SS, wh=_, whmoved=_, passive=P, temporal=_ | NPShared], vp:[vform=(fin\/inf\/presp\/att), subjform=Sfm, agr=Ag, modifiable=_, headfinal=_, passive=P | VPShared] ]) :- NPShared = [relagr=RelAgr, type=Type, case=Case], VPShared = [gaps=Gaps, vform=VForm, vpellipsis=VE, svi=FrontedV], append(NPShared,VPShared,Shared). Some parts of the rule are new, apart from the introduction of a number of new features which will not be discussed here. One new part is the third argument doc which is used for documentation. It is actually optional and was thus not mentioned above. A more important aspect of the rule is the usage of lists for features which are simply shared between the constituents. The sentence shares some features with the noun-phrase and some with the verb-phrase. These features are represented by the two lists NPShared and VPShared and appended together to form the list Shared. This is completely equivalent to writing out all the features explicitly in the feature-value lists, but the usage of the shared lists reflects one of the overall endeavours of unification grammar theories: to keep the grammar rules proper as “clean” as possible. Yet another new concept in the rule can be seen if the values of the feature vform are scrutinized. Here they suddenly contain the symbol \/ which is used as a type-writer variant of the normal logical disjunction (OR, ∪). The CLE formalism also allows features to be given values containing conjunctions (AND, ∩, written as /\) or negation (¬, written \). The feature vform is in the real system defined as syn_feature_value_space(vform, [[fin, impera, perfp, presp, supine, inf, att, stem, supine_stem, n], [present,imperf,fut]]). that is, as a Cartesian product of the second list (giving present, past, and future tense, respectively) and the first list giving a number of different verb-forms such 4.3. MORPHOLOGY 77 as finite, imperative, past- and present participle, supine (a Swedish-specific verb-form), etc. A grammar rule could thus contain the following feature-value setting for vform vform=(inf\/(fin/\(present\/imperf))) defining it to be either infinite or finite and either present or past tense (the latter part could thus in this case have been written as (fin/\(\(fut))), i.e., finite and not future, which following the declaration of vform of course would be equivalent). 4.3 Morphology The CLE contains a rather rudimentary treatment of Swedish morphology, mainly treating the most common types of pre- and suffixing. Given this, we will not go into much detail describing the CLE morphological processing; however, for sake of completeness of the system description, we will in the following give a short overview of it. The strategy outlined in this section does work quite nicely for English, but it should be noted that a full-fledge treatment of morphology for a “real” language3 most certainly would involve a methodology akin to that of Koskenniemi’s “twolevel morphology” [Koskenniemi 1983]. For suggestions on how to implement such a strategy in Prolog, see [Pulman 1991] or [Abramson 1992]. As implemented, the parts of the CLE which in Figure 4.2 were (following normal practise) lumped together as the morphology component really consists of several different modules, of which the most important are called Segmentation: splits words into bits (lexemes / morphemes) Morphology: assigns syntactic categories to words Derivation: assigns semantic categories to words Even though it forms one distinct module in the CLE, the field of morphology as such generally includes the others, but is also commonly divided into at least three subfields: inflectional and derivational morphology and the formation of compounds. We will only describe segmentation and inflectional morphology in this text, since the other parts would not add anything interesting to the discussion. 3 As far as inflectional morphology is concerned, English can be viewed as more or less a toy example, thus (at least partially!) motivating the rather hash judgement of it not being a “real” language in this sense. Swedish morphology is actually quite a simple case, too, albeit it is substantially much more rich than the English one. 78 CHAPTER 4. THE CORE LANGUAGE ENGINE 4.3.1 Segmentation The purpose of the segmentation component is to locate all possible root-form and affix combinations in the input sentence. This is accomplished with segmentation rules, which define suffixes, prefixes, and infixes in terms of the surrounding letters. Although the exact details of how this is done is not really that important, consider as an example the following segmentation rule for the suffix ‘-et’ forming the definite form of second, fourth or fifth declension nouns (as in bord → bordet ) and the perfect participle of fourth conjungation verbs (e.g., riv → rivet ) suffix_rule([’-et’], [pair(v1t,v1), % % pair(l0l2c5et,l0l2ec5)], % % pair(et,[])). % leendet = leende hjärtat = hjärta offret = offer + tecknet = tecken tåget = tåg + et + et + et et + et In the rule v1 and c5 are interpreted as variables over classes of letters, in this case all vowels resp. ‘r’, ‘l’, and ‘n’. l0 and l2 match all Swedish letters and all letters but ‘m’ (since ‘m’ follows some rather special spelling rules). The rule consists of three arguments, defining the letters of the suffix (et), non-default cases, and the default case, respectively. The third argument here simply says that the ‘-et’ ending by default is obtaine adding the two letters to the word without deleting anything from it (e.g., tåg → tåget). The second argument is the most interesting of this rule example. It defines non-default cases which may optionally be applied in addition to the default case. Each pair defines a range of letters to add and delete from the word when forming the ‘-et’ ending form. The non-defaults of this rule give different syncopation4 cases, so for example the first pair says that a word ending with a vowel could have the suffix ‘e’ removed (as in hjärta → hjärtat ), while the second pair basically says that a word ending with ‘er’, ‘el’ or ‘en’ could have the root-form ‘e’ removed. 4.3.2 Inflectional morphology The, for our purposes, most interesting part of the morphology component is the (syntactical) morphological inflection rules. These rules define how different affixes can be added to root-forms while taking into account restrictions introduced by the values of several features. The formalism used for the rules closely follows the one given before for the grammar, now with morph as the functor name. Abstractly, a morphology rule is: 4 syncope is the loss of one or more sounds or letters in the interior of a word. 4.4. COMPOSITIONAL SEMANTICS 79 morph(hmorph-rule-idi, [hmother-categoryi, hdaughter-category1 i, . . . , hdaughter-categoryn i]). Here, the daughters normally consist of a root-form followed by affixes; all the parts of the rule are given as categories with feature-value pairs, as usual. The root-form can be either a lexicon form or the output of another morphology rule. The affixes are treated as normal lexicon items and defined by rules like the following, which gives ‘-ar’ as the plural ending for second declension nouns: lex(’-ar’,[’PLURAL’:[synmorphn=2,lexform=’-ar’]]). The category of the affix is here simply given as PLURAL, while the feature synmorphn holds the “syntactical morphological category of a noun”, i.e., the noun declension. An example of a morphology rule is the noun plural one, nbar_nbar_plural which in essence says that adding the right ending to the singular stem gives the plural, changing only the value of the agreement feature agr. All the other features are the same on the mother and the daughter noun and passed as a list of shared (unified) features in the same fashion as in the grammar: morph(nbar_nbar_plural, [nbar:[agr=plur | Shared], nbar:[agr=sing | Shared], ’PLURAL’:[synmorphn=Morph] ]) :- Shared=[def=n, mass=n, measure=M, nn_infix=Inf, synmorphn=Morph, temporal=T, subcat=S, lexform=L, paradigm=Par, simple=y]. What the “right ending” for a specific noun is depends on its declension, i.e., on the value of synmorphn, which as we can see must unify between all the components of the rule. Its value on the root-form (daughter) nbar is defined in the noun’s lexicon entry.5 4.4 Compositional semantics Language theory is commonly divided into syntax, semantics and pragmatics. The distinctions between these fields are not at all clear-cut, and several different definitions of them have been given over time. Here we will adopt the following division: 5 As the morphology rules really can be viewed upon as being part of the grammar, the general philosophy of keeping the grammar clean from information which can be lexicalized applies to them as well. 80 CHAPTER 4. THE CORE LANGUAGE ENGINE Syntax: defines how signs are related to each other. Semantics: defines how signs are related to things. Pragmatics: defines how signs are related to people. Semantics in turn can be divided into compositional semantics, which defines the (in some sense) abstract “meaning” of a sentence from the meaning of the parts, and situational semantics which borders pragmatics in adding contextdependent information to the interpretation. As indicated already at the beginning of Chapter 1 the main purpose of natural-language systems is to translate an input utterance from natural language to some type of internal representation, i.e., to interpret the utterance. To accomplish this, the semantic processing of the CLE really consists of several steps, Semantics: assigns semantic categories to phrases. Reference resolution: gets references for pronouns, etc. Scoping: determines the scope of quantifiers and such. the first of which, i.e., the part which really is the compositional semantics, is the subject of this section. The others are context-dependent and will be briefly discussed later on (in Section 4.5). However, we will start out by describing what we are aiming for, that is, the internal representation, the logical forms. The discussion of logical forms for the CLE will by necessity be an abbreviated version of the one given in [Alshawi & van Eijck 1989] and [Alshawi (ed.) 1992]. The CLE formalism and its treatment of semantics in general in turn builds on the work originally done by Richard Montague described in [Thomason 1974]. For an introduction to Montague semantics see [Dowty et al 1981] or [Gamut 1991]. 4.4.1 The logical formalism The logical formalism of the CLE is called “Quasi-Logical Form” (QLF), indicating that it is not a “pure” logical form (LF). In particular, transforming the QLF expressions formed by the semantic processing into such “pure” (true) LF expressions requires: 1. Fixing the scopes of quantifiers and operators. 2. Resolving pronouns, definite descriptions, ellipsis, underspecified relations and vague quantifiers. 3. Extracting the truth conditional information. 4.4. COMPOSITIONAL SEMANTICS 81 After these steps, partially described in Section 4.5, we would get what Hiyan Alshawi likes to refer to as “fully instantiated QLF” [Alshawi & Crouch 1992]. In the following, however, we will somewhat sloppily call this form just LF (logical form) and discuss it first before moving on to QLF. A particular (uninstantiated) QLF expression may correspond to several, possibly infinitely many, LF expressions, i.e., instantiations. However, the LF language as it will be defined here is in fact just a sublanguage of the QLF language; there are additional “quasi logical” constructs for unscoped quantifiers, unscoped descriptions, unresolved references and unresolved relations, to be discussed shortly. Logical form requirements When using logical forms as the internal representation of natural-language utterances, we must make sure that they satisfy the following requirements: • LFs should be expressions in a disambiguated language, i.e., alternative readings of natural language expressions should give rise to different logical forms. • LFs should be suitable for representing the literal “meanings” of naturallanguage expressions, i.e., they should specify the truth conditions of (appropriate readings of) the original natural-language expressions. • LFs should provide a suitable medium for the representation of knowledge as expressed in natural language, and they should be a suitable vehicle for reasoning. The predicate logic part The formalism used in the CLE is a higher order logic, in which extensions to first order logic (FOL) have been motivated by trying to satisfy the above requirements with respect to the range of natural-language expressions covered. For now we will restrict ourselves to the predicate logic part (in BNF-like rules in Figure 4.6) and introduce the higher order extensions later on: In this notation, the logical form: [or, [anka1,kalle1], [struts1,kalle1] ] expresses the proposition that Kalle is a duck (anka) or an ostrich (struts), with anka1 and struts1 being one place predicates, kalle1 a constant, and or the usual disjunction operator. 82 CHAPTER 4. THE CORE LANGUAGE ENGINE hf ormulai ::= [hpredicatei,hargument1 i, . . . ,hargumentn i] | [not,hf ormulai] | [and,hf ormulai,hf ormulai] | [or,hf ormulai,hf ormulai] | [impl,hf ormulai,hf ormulai] | quant( hquantif ieri,hvariablei, hf ormulai,hf ormulai) hpredicatei ::= snarka1 | anka1 | struts1 | geq . . . hargumenti ::= htermi htermi ::= hvariablei | hconstanti hvariablei ::= X | Y... hconstanti ::= kalle1 | lena1 . . . hquantif ieri ::= forall | exists Figure 4.6: BNF definition of the predicate logic part The notation allows restricted first order quantifiers. For a simple Swedish sentence like Alla pojkar ser Lena, a logical form translation in the notation above contains quantified variables: quant(forall,P,[pojke1,P], quant(exists,H,[event,H], [se1,H,P,lena1])) Here P and H are variables bound by the familiar first order logic quantifiers (i.e., ∀P and ∃H). This logical form can be paraphrased as För varje pojke P, existerar det en händelse, H, sådan att P ser Lena. 4.4.2 The semantic rule formalism The semantic rules indicate how the meaning of a complex expression is composed of the meanings of its constituents. Every syntactic rule of the CLE has one or more corresponding semantic rule. Each semantic rule thus has a semantic rule identifier distinguishing it from other cases for the semantics of the same syntax rule: sem(hsyntax-rule-idi, hsemantic-rule-idi, [(hlogical-formi, hmother-categoryi), hdaughter-pair1 i, . . . , hdaughter-pairn i]). 4.4. COMPOSITIONAL SEMANTICS 83 To distinguish the semantics from the syntax, the semantic cases will in the following be referred to as hsyntax-rule-idi*hsemantic-rule-idi, thus including the semantic rule identifier in the rule name. The logical form paired with the mother category forms a QLF expression, normally not fully instantiated, corresponding to the semantic analysis of the constituent analysed by the syntax rule whose identifier is hsyntax-rule-idi. The format of the semantic rule mother category follows that of the syntax rule, i.e., it is Category:FVList. The mother logical form typically contains variables that are unified with the semantic analyses of the daughter constituents. Such variables appear as the left-hand elements of the daughter pairs: (hdaughter-qlf-variablei,hdaughter-categoryi) In the example semantic rule below, the variable Nbar stand for the semantic analysis of the daughter, and by unification its value appear in the logical form template associated with the mother: sem(np_nbar_Def, mass, [( term(_,ref(def,bare,mass,l(Ain)),V,Nbar), np:[handle=v(V), anaIn=l(Ain) | Shared]), (Nbar, nbar:[agr=sing, mass=y, anaIn=l([V-sing|Ain]) | Shared]) ]) :- Shared=[anaOut=Aout, semGaps=Gaps]. Categories appearing in semantic rules may include specifications for the values of syntactic features (i.e., features that have appeared in some syntax rule, for example mass in the rule above). This is commonly used when distinguishing different semantic cases of the same syntactic rule. The semantic rule above is actually a version of a rule NP → N [+def, +mass, +sing] for forming complete noun-phrases from simple definite form nouns. This semantic case of the rule can be used for singular agreement mass nouns, as indicated by the semantic-rule identifier mass and shown by the values of the features agr and mass. The setting of the feature def is, however, not shown in the semantic rule, since it is the same for all the cases of the syntactic rule (it does show up in the syntactic rule, of course). Forming noun-phrases from simple nouns is quite common and there are other semantic cases of the corresponding syntax rule treating singular nonmass nouns and plural nouns. There are also specific NP-formation rules for indefinite plural and mass nouns which also can form NPs on their own. 84 CHAPTER 4. THE CORE LANGUAGE ENGINE The syntactic features in the rule are complemented with specific semantic features. Semantic features are often used to hold logical form fragments passed between mother and daughter constituents. So for example handle in the rule above holds a logical-form variable V which appears both in the QLF for the mother NP and in the feature anaIn on the daughter N . In a parallel fashion to the corresponding syntactic rules, there are cases where the set of daughter items in a semantic rule is empty, or where a daughter position may be unified with part of the feature structure of another daughter. Empty constituents are treated by gap-threading (as in the syntax) and passed as the value of the feature SemGaps (which holds a difference list just like its corresponding syntactic feature gaps). The anaphora features anaIn and anaOut actually together form another difference list, giving the possible intra-sentential referents — the rule above adds a new one. Semantic features will be exemplified and more carefully described in the rest of this section, even though they of course are not substantially different from the syntactic ones in any sense. Quite importantly, for example, just like the syntactic features, the semantic features can have default declarations associated with them. These defaults have the general form semantic_feature_default(hfeaturei, hvaluei). 4.4.3 Semantic analysis The syntactic parsing phase generates all possible syntax trees of a given input sentence. These trees are fed to the semantic analysis phase, which turns the trees into appropriate quasi logical forms. If, for example, the sentence Flyget avgår (The flight leaves) was analysed, the CLE syntactic parser would return the only parse: [sigma_decl-1, [[s_np_vp_Normal-2, [[np_nbar_Def-3, [[lex-4,[flyg,-et]]]], [vp_v_comp_Normal-5, [[lex-6,[avgå,-r]]]] ]]]] The CLE parser output is just another form of the implicit parse tree described in Section 2.4, thus listing the syntactic rules which have been applied in the parse. The numbers in the tree simply identify rule-applications, while the splitting of the terminals show the lexemes and suffixes used by the morphology rules. 4.4. COMPOSITIONAL SEMANTICS 85 The semantic analysis would then apply semantic rules with the same names as the syntactic ones in a top-down fashion, giving the QLF: [dcl, form(l([flyget,avgår]),verb(pres,no,no,no,y),A, B^ [B, [avgå_2p,A, term(l([flyget]), ref(def,bare,sing,l([])),_, C^[flyg1,C],_,_)]], _)] The details of this form will be explained in due time; for now, we only note that avgå_2p and flyg1 are semantic sense names corresponding to the words in the input string. The 2p and 1 extensions simply distinguish the sense names from the lexemes and could have been chosen rather arbitrarily; however, e.g., the 2p actually indicates that the verb avgå is a two-place predicate, i.e., an intransitive verb (the first place of the predicate is for the event itself, as described in detail in Section 4.4.13 below). 4.4.4 Semantic rules The analysis in the previous section would have been obtained by the application of a number of semantic rules, all of which will be described in this section. Thus we will at the same time see a substantial portion of the most important rules of the real CLE grammar; however, to improve readability (and hopefully also intelligibility), the rules will in the following be stripped of most of their features. We will discuss the rules in the same order as in the implicit parse tree above, starting with the top-node: sem(sigma_decl, only, [([dcl,S],sigma:[]), (S,s:[type=norm, whmoved=n, inv=n, vform=fin, semGaps=([],[])]) ]). Many syntactic features show up again in the semantics. As discussed above, the only new feature here is semGaps — the semantic counter-part of the syntactic gap-list feature gaps. As in the syntax, the two arguments form a difference list specifying the gapsIn and gapsOut values, respectively. The variable S simply passes the logical-form value of the s-node up to the sigma-node. The extra operator dcl added to the QLF in the σ simply states that this is a declarative sentence (operators like this one are really higher-order extensions and will thus be further discussed in Section 4.4.9). 86 CHAPTER 4. THE CORE LANGUAGE ENGINE The semantic rule-identifier only indicates that this is the only semantic rule that can match the syntactic rule sigma_decl, or in other words, that there is a one-to-one correspondence between syntax and semantics in this case. sem(s_np_vp_Normal, only, [(Vp,s:[anaIn=Ain, anaOut=Aout | Shared]), (Np,np:[semGaps=([],[]), anaIn=Ain, anaOut=Anext]), (Vp,vp:[anaIn=Anext, anaOut=Aout | Shared]) ]) :- Shared=[subjval=Np, movedv=FrontedV, eventvar=E, semGaps=Gaps]. This rule states that the S-meaning is the VP-meaning with the meaning of the subject (subjval) plugged in. subjval is thus a feature carrying a QLFfragment. This is also true for eventvar, which carries a logical form variable matching the event (verb) itself. The values of the features anaIn and anaOut (which together form a difference list) shows that the possible anaphoric referents are “threaded” through the rule in the same fashion as gaps normally are. The semantic gaps, however, are not treaded here, but rather only unified between the VP and the S. No empty constituents are allowed in the noun-phrase. The feature movedv performs the same function as a corresponding syntactic feature svi (subject-verb inversion) and is used for the left-movement of verbs. Both movedv and svi keep the information necessary for getting the correct complements to the moved verb, and subject-verb agreement. The value of movedv is a pair of the same format as the semantic ruleconstituents: pair(hqlf-variablei, hcategoryi) The qlf-variable carries the meaning of the V down.6 A rule for empty verbconstituents, v_gap*only will plug it in where it was “moved” from: sem(v_gap, only, [(V,v:[svi=v:SynShared, movedv=pair(V,v:SemShared) | Shared])]) :- SemShared=[subjval=A, eventvar=E, arglist=Comp, semGaps=SemGaps, anaIn=Ain, anaOut=Aout, @shared_tense_aspect(TenseAspect)], SynShared=[agr(Agr), gaps=Gaps, subcat=S], append(SynShared,SemShared,Shared). 6 Yes, “down”. Since we are talking proper unification, the direction is hardly that important, but remember anyhow that semantic analysis in the CLE is a top-down process! 4.4. COMPOSITIONAL SEMANTICS 87 By using shared features for both syntax and semantics in the rule, we clearly see that the verb-gap has the same feature values as the moved verb. These values are held by svi and movedv, respectively. Of the other new features shown in the rule, @shared_tense_aspect passes any tense, aspect and mood information needed, while arglist performs the same function as the familiar syntactic feature subcat, i.e., holds the verbs subcategorization scheme (Comp) as instantiated in the lexicon. sem(np_nbar_Def, sing, [( term(_,ref(def,bare,sing,l(Ain)),V,Nbar), np:[anaIn=l(Ain) | Shared]), (Nbar, nbar:[agr=sing, mass=n, anaIn=l([V-sing|Ain]) | Shared]) ]) :- Shared=[anaOut=Aout, semGaps=Gaps]. This rule is just another case of the np_nbar_Def*mass rule which has already been discussed (in Section 4.4.2). The case above is used for forming an np from a singular definite nbar. The main reason for having three cases (the third one is for plurals) of this rule shows up in the new referent added to the anaphora list. This rule imposes the restriction on it that it must be singular. The logical form of the NP formed by a term (again, a higher-order expression which is discussed in more detail later) also indicates this: ref(def,bare,sing,...) shows that this is a singular definite form referential expression without a determiner (i.e., “bare”). sem(vp_v_comp_Normal, mainv, [(V,vp:Shared), (V,v:[semanticAux=n, arglist=Comp | Shared]) | Comp ]) :- Shared=[eventvar=E, subjval=A, movedv=FrontedV, @shared_tense_aspect(TenseAspect), semGaps=Gaps, anaIn=Ain, anaOut=Aout]. The final rule used in the analysis is the one for forming a verb-phrase from a verb and its complements. As can be seen, most information is unified between the v and the vp (e.g., the logical form passed in the variable V). The rule seen above is thus quite simple, but it is actually only the main verb case of the corresponding syntactic rule. The syntax rule treats almost any kind of verb and obligatory verb modification; the semantic rules, however, distinguish between whether the verb semantically behaves like an auxiliary or not. 88 CHAPTER 4. THE CORE LANGUAGE ENGINE Main verbs (and non-finite7 modals) are semanticAux=n, while tense auxiliaries and finite modals are semanticAux=y, meaning that they will not fully influence the semantics of the verb-phrase. For those verbs, the semantics of the verb-phrase will also need to take into account any information from the modified non-finite verb. 4.4.5 Sense entries The semantic rules of the previous section have to be complemented with a semantic lexicon in order to obtain the QLF shown on Page 85. The semantic lexicon entries are commonly referred to as “sense entries”, since they give the semantic senses of the syntactic lexicon items. A sense entry in the CLE looks as the following one for the count noun flyg, which as we can see has empty gap- and anaphora-lists. The logical form fragment of the entry contains the sense name (flyg1). sense(flyg, (A^[flyg1,A], nbar:[semGaps=(G,G), anaIn=Al, anaOut=Al])). For an intransitive verb like avgår, the sense entry is a bit more complicated: sense(avgår, (form(_,verb(T,Pf,Pg,M,A),Event, P^[P,[avgå_2p,Event,Subject]],_), v:[vform=impera, tense=no, modal=imp, perf=Pf, prog=Pg, active=A, eventvar=v(Event), subjval=Subject, semGaps=(G,G), anaIn=Al, anaOut=Al, arglist=[], subcat=[]])). Firstly, arglist and its syntactic counter-part syncat show the subcategorization to be empty. Secondly, a number of features are added to cope with tense and aspect; of these note only that the lexicon verb-form in the Swedish CLE is the imperative, since most other verb-forms can be formed easily from it. The imperative is special in more or less standing on the side of the tense-system, thus it has tense=no. Finally, the logical-form fragment of the verb contains a list with the sense name (avgå_2p) as well as variables for the event itself and for the subject. For non-intransitive verbs, this list would also contain the QLFs of the arguments. Note that when forming a verb-phrase (with the rule vp_v_comp_Normal*mainv) we do not know what the subject of the verb is — but we do know that it is going to be unified with the value of the feature subjval (which it is for example in s_np_vp_Normal*only, see above)! 7 The non-finite verb-forms in Swedish are the infinites and the supine. 4.4. COMPOSITIONAL SEMANTICS 4.4.6 89 Higher order extensions After having acquinted ourselves with the first order logic part of the QLF language and how it is used, we now turn our attention to its higher order logic extensions which can be specified with the following additional BNF-like rules: hf ormulai ::= [hmood opi,hf ormulai] hmood opi ::= dcl | ynq | whq | imp hargumenti ::= hf ormulai hargumenti ::= habstracti hquantif ieri ::= en | wh . . . hquantif ieri ::= hvariablei^hvariablei^hf ormulai habstracti ::= hvariablei^hlambda bodyi hlambda bodyi ::= hf ormulai hlambda bodyi ::= habstracti The kinds of hf ormulai that are distinguished at top level are declaratives, yes/no questions, wh-questions, and imperatives, as marked by the mood operators dcl, ynq, whq, and imp respectively. However, the dcl operator is sometimes omitted. 4.4.7 Abstraction and application Lambda abstraction is used to construct functions of arbitrary complexity (properties of objects, relations between objects, and so on). In the LF notation X^[ful,X] corresponds to the more usual notation for lambda-abstraction, λx.f ul(x).8 The BNF rules show that terms, formulae and abstracts can all act as arguments to a functor-expression. For every functor expression the types of the arguments that it takes are fixed and the only way of forming higher order expressions is by means of the abstraction functor “^”. The logical counterpart of the abstraction functor “^” is the functor apply for lambda-application. This functor expresses the result of applying an abstract to an appropriate argument. Thus [apply,X^[kvinna1,X],lena1] reduces to [kvinna1,lena1] 8 The CLE variant is actually a type-writer version of the lambda-operator notation x̂ used by Montague. 90 CHAPTER 4. THE CORE LANGUAGE ENGINE In the LF representations produced by the CLE, the only uses of apply reduce properties to formulae by applying them to terms. This special case can be expressed as follows in a specific BNF rule: hargumenti ::= [apply,hvariablei^hf ormulai,hargumenti] 4.4.8 Generalized quantifiers Logical form quantifiers are not restricted to existentials and universals; these are simply special cases of generalized quantifiers. A generalized quantifier is a relation Q between two sets A and B (where A is called the restriction set and B the body set ) that satisfies some specific requirements, as shown in Figure 4.7. A restriction A∩B intersection B body Figure 4.7: Generalized quantifiers For present purposes it is enough to note that the requirements can be summarized as the condition that Q be insensitive to anything but the cardinalities of the sets A (the restriction set) and the set A ∩ B (the intersection set ). Thus a generalized quantifier with restriction set A and body set B is fully characterized by a predicate λnλm.Q(n, m) on n and m, where n = |A| and m = |A ∩ B|, as examplified by: • Alla pojkar snarkar is true if and only if the restriction set (the set of boys) equals the intersection set (the intersection of the set of boys and the set of people who snore). • Minst tre pojkar snarkar is true if and only if the intersection set contains at least three individuals. • Inte alla pojkar snarkar is true if and only if the the restriction set does not equal the intersection set. • De flesta pojkar snarkar is true (in neutral contexts) if and only if the size of the intersection set is greater than half the size of the restriction set. • Minst fyra, men som mest tio pojkar snarkar is true if and only if the intersection set contains at least four and at most ten individuals. 4.4. COMPOSITIONAL SEMANTICS 4.4.9 91 Statements and questions The logical forms that the CLE assigns to questions are similar to those for declarative statements. Logical forms for yes/no questions are distinguished from those for declarative statements by the top level (mood) operator ynq. For example, the logical form for the sentence 4.2 is the same as that for 4.1 except that the operator ynq has replaced the operator dcl. Flyget avgår. (4.1) [dcl, form(l([flyget,avgår]),verb(pres,no,no,no,y),A, B^ [B, [avgå_2p,A, term(l([flyget]),ref(def,bare,sing,l([])),_, C^[flyg1,C],_,_)]], _)] Avgår flyget? (4.2) [ynq, form(l([avgår,flyget]),verb(pres,no,no,no,y),A, B^ [B, [avgå_2p,A, term(l([flyget]),ref(def,bare,sing,l([])),_, C^[flyg1,C],_,_)]], _)] 4.4.10 “Quasi-logical” constructs The basic constructs by which the QLF language extends the LF language are the following: 1. Terms (term) for unscoped quantified expressions (ingen man, varje man), definite descriptions (mannen, de där männen), and unresolved references, such as pronouns, reflexives, etc. (han, det, sig). 2. Formulae for implicit relations (form). These are used for among other things genitives (Kalles bok ) and unresolved temporal and aspectual information (idag). 92 CHAPTER 4. THE CORE LANGUAGE ENGINE 3. “island” constructions (island) to block the raising of quantifiers outside “island” constituents, i.e., constituents which are isolated from the rest of the tree (see also Section 2.3.2). The BNF rules for the additional QLF term and formula constructs are given in Figure 4.8. hf ormulai ::= htermi ::= hstringi ::= hindexi ::= hvariablei | hconstanti | | form( hstringi,hcategoryi,hindexi, hrestrictioni,hf orm ref i) [island,hf ormulai] term( hstringi,hcategoryi,hindexi, hrestrictioni,hquantif ieri,hterm ref i) hindexi l(hwordsi) hrestrictioni ::= hvariablei^hlambda bodyi hf orm ref i ::= hvariablei | hf ormulai hterm ref i ::= hvariablei hcategoryi ::= q(hindexi) | ref(def,hdef initei,hnumberi,hantecedentsi) | ref(pro,hpronouni,hnumberi,hantecedentsi) hnumberi ::= plur | sing | mass hdef initei ::= den | bare . . . hpronouni ::= han | hon . . . hantecedentsi ::= l(hvariablesi) Figure 4.8: BNF definition of the “quasi-logical” part The hcategoryi arguments are categories in the sense of collections of linguistic attributes. They are used to pass linguistic information, including syntactic information, to the scoping and reference resolution phases. This can include information on number, reflexivity, the surface form of quantifiers, and so on. There are several types of categories identified by constants; the ones shown in the definition above are, for example, anaphoric expressions (e.g., ref for noun phrase reference) and phrase types (e.g., pro for pronoun, def for definite description). 4.4. COMPOSITIONAL SEMANTICS 93 The quantifier term notation above (together with the hquantif ieri definition on Page 89) thus replaces the simpler first-order quantification treatment (quant) given in Section 4.4.1. 4.4.11 Quantified terms and descriptions In quantified terms the category gives the lexical form of the determiner, and marks the singular/plural distinction. The QLF analysis of Några flyg avgår is: [dcl, form(l([några,flyg,avgår]),verb(pres,no,no,no,y),A, B^ [B, [avgå_2p,A, term(l([några,flyg]),q(tpc,någon,plur),_, C^[flyg1,C],_,_)]], _)] Information in term categories is used by scoping and reference resolution to decide the scope of a determiner, the quantifier it corresponds to, and whether a collective interpretation is possible. The island operator shown in the BNF rules serves to prevent unscoped quantifiers in its range from having wider scope than the operator. In particular, it is used to block the raising of terms out of relative clauses during the scoping procedure (Section 4.5.2). Definite descriptions are also represented as quantified terms in QLF. For example the compund noun phrase in Dallas-flyget avgår is translated as the term: term(l([dallas-flyget]),ref(def,bare,sing,l([])),_, C^ form(l([dallas-flyget]),nn,_, D^ [and,[flyg1,C], [D,C, term(_,proper_name(_),_, E^[name_of,E,Dallas_Stad],_,_ )]], _), _,_) The reference resolution phase (Section 4.5.1) determines whether to replace a definite description with a referent (giving a referential reading) or whether to convert it into a quantification (giving an attributive reading). 94 CHAPTER 4. THE CORE LANGUAGE ENGINE 4.4.12 Anaphoric terms Pronouns are represented in QLF as terms in which the restriction places constraints on a variable corresponding to the referent, and the category contains linguistic information which guides the search for possible referents. For example, in the QLF for Han avgår, the term for han is: term(l([han]),ref(pro,han,sing,l([])),_, C^[sex_GenderOf,Male,C],_,_) while the representation of honom in Kalle ser honom is: term(l([honom]),ref(pro,han,sing,l([C-sing])),C, E^[sex_GenderOf,Male,E],_,_)]], where C is the variable bound to Kalle. The value of hantecedentsi in the term category is a list of possible antecedents within the same sentence. It contains indices to the translations of noun phrases that precede or dominate the given pronoun in the sentence, allowing, for example, reference to bound variables. As can be seen in the example above, the antecedent list can impose restrictions on the bindings of the variables; here, the variable C is restricted to bind to a singular referent. 4.4.13 Event variables Verbs are not treated simply as relations between a subject and a number of VP complements. The event being described is introduced as an additional argument. The presence of an event variable allows optional verb phrase modifiers to be treated as predications on events, in first order fashion, which in turn permits a uniform interpretation to be given to prepositional phrases, whether they modify verb phrases or nouns. Take as an example the sentence Avboka biljetten i Boston. (4.3) which has two readings, depending on the attachment of the prepostional phrase i Boston. On one reading it expresses a property of the ticket and on the other a property of the event of cancelling the ticket. Several examples of this notorious PP-attachment problem has been seen before in the text and this is certainly no coincidence: this is one of the most common types of ambiguity found in natural languages, but one which is quite neatly represented in the logical form notation used in the CLE. The reading of in which i Boston is taken to modify the verb phrase gives rise to the first logical form following, while the interpretation in which the prepositional phrase takes part in the noun phrase modification gives the second QLF: 4.5. LATER-STAGE PROCESSING 95 [imp, form(l([avboka,biljetten,i,Boston]), verb(no,_,_,imp,_),A, B^[B, form(l([i,Boston]),prep(i),_, C^[C,v(A),term(l([Boston]), proper_name(_),_, D^[name_of,D,Boston_Stad], _,_)], _), [avboka_Något,A, term(_,ref(pro,imp_subj,_,l([])), E,F^[personal,F],_,_), term(l([biljetten]),ref(def,bare,sing,l([E-_])), _,G^[biljett1,G],_,_)]], _)] [imp, form(l([avboka,biljetten,i,Boston]), verb(no,_,_,imp,_),A, B^[B, [avboka_Något,A, term(_,ref(pro,imp_subj,_,l([])), C,D^[personal,D],_,_), term(l([biljetten,i,Boston]), ref(def,bare,sing,l([C-_])), E,F^[and,[biljett1,F], form(l([i,Boston]), prep(i),_, G^[G,E,term(l([Boston]), proper_name(_),_, H^[name_of,H,Boston_Stad], _,_)], _)], _,_)]], _)] 4.5 Later-stage processing The final steps of the CLE processing can be thought of as forming the situational semantic part in the sense that they include information which is situation dependent. These steps are the ones shown in Figure 4.9. This section will very briefly discuss and exemplify the first two steps, reference resolution and scoping; plausibility checking simply involves checking the 96 CHAPTER 4. THE CORE LANGUAGE ENGINE QLF ❄ reference resolution ❄ Resolution quantifier scoping ❄ plausibility checking ❄ LF Figure 4.9: The resolution steps of the CLE relevance of the logical form produced in the given context. 4.5.1 Reference resolution The main purpose of the reference resolution component is to determine the referents to underspecified “referring expressions” (terms and forms) (names, pronouns, etc.). We have seen some (partial) examples of this already; here, we will not go into any detail on how this is done, but simply give a full example. If the English CLE would be fed with the sentence Who works on CLAM-BAKE? the semantic processing would produce the following unresolved QLF [whq, form(l([who,works,on,CLAM-BAKE]), verb(pres,no,no,no,y), A, B^ [B, [work_on_BeEmployedOn, A, term(l([who]), q(tpc,wh,C),D,E^[personal,E],F,G), term(l([CLAM-BAKE]), proper_name(H),I, J^[name_of,J,CLAM-BAKE],K,L)]], M)] Reference resolution would process the QLF above and replace the name “CLAM-BAKE” with its corresponding referent. Supposing the name in a given 4.5. LATER-STAGE PROCESSING 97 situation actually referred to a project, the unscoped, resolved logical form would be: [whq, [work_on_BeEmployedOn, term(A,exists,B,C^[event,C],D,E), term(F,q(tpc,wh,G),H,I^[personal,I],J,K), SF(c(x^[project_Activity,x],CLAM-BAKE))]] 4.5.2 Scoping The scoping component follows an algorithm which very simply can be said to take the following steps: 1. rewrite a formula containing a quantifier term into one in which the term has been given a scope as a quantifier 2. do this for all the quantifier terms in an unscoped logical form in all possible ways 3. apply local constraints to block certain alternatives (or to suggest a preference ranking among alternatives) Again, we only give a full example of the functionality of the component. Continuing the example from the previous section, the scoped QLF produced would be [whq, quant(wh, A, [personal,A], quant(exists, B, [event,B], [work_on_BeEmployedOn, B, A, SF(c(x^[project_Activity,x], CLAM-BAKE))]))] Where we can note that the quantifiers wh and exists have been determined to have scope over the entire sentence and over the verb phrase (event), respectively. 98 CHAPTER 4. THE CORE LANGUAGE ENGINE Chapter 5 The Basics of Information Retrieval Sorting documents so that they can be found easily is difficult, especially if more than one reader is expected to be able to use the document collection. 5.1 Manual methods The traditional way of organizing documents and books are sorting them physically in shelves after categories that have been predetermined. This generally works well, but finding the right balance between category generality and category specificity is difficult; the library client has to learn the categorization scheme; quite often it is difficult to determine what category a document belongs to; and quite often a document may rightly belong to several categories. Some of these drawbacks can be remedied by installing an index to the document collection. Documents can be indexed in several ways and can be reached any of several routes. Indexing, i.e. establishing correspondences between a set, possibly large and typically finite, of index terms or search terms and individual documents or sections thereof. Indexing is both difficult and often quite dull; it poses great demands on consistency from indexing session to indexing session and between different indexers. This is the sort of job which always seems to be a prime candidate for automatization. 5.2 5.2.1 Automatization Words as indicators of document topic The basic assumption of automatic indexing mechanisms is that the presence or absence of a word - or more generally, a term, which can be any word or 99 100 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL combination of words - in a document is indicative of topic. The basic scenario of information retrieval is then finding the correct combination of query words combined together in a logical structure using Boolean operators such as AND OR or NOT, and then examining the set of documents that fit the query. The documents in the set match the query, the documents outside do not. If a document matches a query it is presented, if it does not, it will not be. This set algebraic approach works reasonably well for document bases where the documents are short and concise: a list of literature abstracts or document titles, for instance. For full texts it is too imprecise: full documents contain too many spurious terms for Boolean matching to be practicable. Boolean systems are still in use, especially for trained documentalists; for untrained users, the Boolean approach has been largely abandoned in full-text retrieval systems - although the name retrieval has been retained for an activity which only in its extreme cases resembles retrieval - in favor of a more probabilistic approach, which ranks the retrieved documents by likelihood of providing relevant data for the resolution of the query. The basic improvement is the weighting of terms by assumed importance for a document. 5.2.2 Word frequencies - tf The first steps to find index terms automatically is to build a list of words in a text, and sort them in order of frequency of occurrence. The more frequent terms are considered more valuable in proportion to their observed frequencies. This suggestion is first made by Hans Peter Luhn (1957, 1959), and the measure is commonly called term frequency or, imaginatively, tf for short. For this text, for instance, the list will be as shown in table 1. 92 72 64 62 60 55 51 30 26 25 24 23 23 22 22 21 20 19 the of and to a in is for terms documents be that as words term text information document 5.2. AUTOMATIZATION 101 17 this 17 retrieval 16 are ... Table 1. Frequency table of words in this text. An obvious improvement in the enterprise to build a semantic representation from a list such as the one in Table 1 is to filter out certain words that seem to have little to do with topic. A list of such words, most often grammatical form words and other closed class words, is commonly called a stop list. Another is to note - as Luhn does in his 1959 paper - that the most frequent words seldom are significant for this sort of enterprise, and that thus it might be possible to filter them out automatically. | | \ | \ | \ | \ | \ ___--___ | \ ___---___ | /\ \ | / \ \ | / \ \ | / \_ \ | / \__ \ | / \___ \ | / \____ \ | / \____ \ | / \___\__ | / \ \_______ |/ \ +------------------------------------------------Figure 1. Significance vs frequency. From Luhn (1959). the a and that one it two may could 102 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL such next just half both of to in for ... Table 2. Stoplist. 26 25 22 22 21 20 19 17 14 14 11 10 10 10 9 8 8 terms documents words term text information document retrieval idf frequency technical word indexing collection table single query ... Table 3. Frequency table of words in this text, filtered with stoplist. 5.2.3 Normalizing inflected forms As can be seen in tables 1 and 3 the words ”document” and ”documents” both show up in the beginning of the list. The words ”indexed” and ”indexing” do not, and probably should - they show up further down in the list. Word form analysis, or morphological analysis would conflate these forms, and raise their combined weight. 5.2. AUTOMATIZATION 103 It is unfortunate for the generality of the results in the field that the research and business language of the world currently is English, with an exceedingly spare morphology. Most information retrieval systems today make use of simple ”stemming” - i.e. stripping suffixes from word forms without further analysis. This is sufficient, but of debatable utility, for English, but not for most other languages of the world. In comparison, language with richer morphologies such as Finnish or French show much better gains from morphological analysis (Koskenniemi, 19xx; Jacquemin and Tzoukermann, 1997). 5.2.4 Uncommonly frequent words - idf If we try to determine what terms in a document are significant for representing its content, we find that terms that are common in a document, but also common in all other documents are less useful than others. The question is how specific a term is to a document. Collection frequency, inverse document frequency or idf is a measure of term specificity first proposed by Karen Sparck-Jones (1972). This is defined as a function of N/di, where N is the total number of documents in the collection and di is the number of documents where term i occurs - the document frequency. This measure gives high value to terms which occur in only a few documents. Used alone, it gives about as useful results as term frequency used alone idf is vectored towards high precision while tf gives better recall or indexing exhaustivity. A modified idf measure - the weighted idf or widf is suggested by Tokunaga and Iwayama (1994). Their measure is weighted for term frequency in the documents which it occurs in: the widf is calculated as a function of dfi rather than di, where dfi is the frequencies of term i in the respective documents. Their experiments seem to indicate an improvement in performance. 5.2.5 Methodological problems with idf The problem with idf as a measure is that it is unclear what universe the document frequency should be calculated over. The measures all depend on an N, a total number of documents, and establishing what general usage of a term is may be difficult, if not impossible. In some cases a collection is so well defined that a collection internal idf is quite adequate; in others, where potential readers may not be aware of the collection setup or if the collection is very heterogenous it may not. In table 2 you will find idf scores for words in this document; the scores are calculated with respect to the top twenty documents retrieved by Altavista for the query ”information, retrieval, algorithm”. 0.019 the 0.020 and 0.020 retrieval 104 0.020 0.021 0.021 0.022 0.022 0.023 0.023 0.023 0.024 0.025 0.026 0.026 0.026 0.027 0.029 0.030 0.031 0.031 0.032 0.033 0.033 0.033 0.033 0.034 0.034 0.034 0.034 0.034 1 1 1 1 1 1 1 1 1 1 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL to information for is of a this with in as from or not search be use but language example form on text web documents first queries query words ... behavior behaviour ... karlgren ... mathematical mathematically mathematics ... meteorological ... microwave ... miscellaneous ... morphology 5.2. AUTOMATIZATION 105 ... 1 nationwide ... 1 navigation ... 1 pulmonary ... 1 radio ... 1 redundancy ... Table x. Inverted document frequencies for terms in this collection. 5.2.6 Combining tf and idf There are various ways of combining term frequencies and inverse document freqeuencies, and from empirical studies (e.g. Salton and Yang, 1973), we find that the optimal combination may vary from collection to collection. Generally, tf is multiplied by idf to obtain a combined term weight. Alternatives would be for instance to entirely discard terms with too low idf - which seems to be slightly better for high precision searches. 5.2.7 Document length effects As the term weight is defined in the tf component of the combined formula, it is heavily influenced by document length. A long document about a topic is likely to have more hits than a short one will for a relevant term; this may not improve its likelihood of being relevant. Most algorithms in use introduce document length as a normalization factor of some sort. In table y the formulae for OKAPI and the basic formulation of the SMART cosine formula are given. (Robertson and Sparck Jones, 1996; Singhal et al, 1995). ”k” in the OKAPI formula in Table y is a constant to be set after experimentation with the particular collection of texts. If it is set to zero, the OKAPI formula reduces to idf. OKAPI: tf(w,d) * idf(w) * (k + 1) / [k * dl(d)/dlbar + tf(w,d)] Cosine: tf(d,w)/sqrt(sum(j)(tf(d,j)^2) Table y. Document length normalization in SMART and OKAPI 106 5.3 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL Words as indicators of information need Given that we have a document representation in the form of a list of terms with attached weights - often called a term vector - however these weights are computed, and a similar representation of the query, the question is how to match the two term vectors to find documents that fit the query. One approach is to use the conventional scalar product of the two vectors, and simply add the pairwise products of each the weight of each term under consideration. Documents and queries are very different. Luhn’s original model was for the searcher to compose an essay of approximately the same form as the sought for documents, but in practice queries are of different lenght and different type than the target documents. The way the respective term vectors are to be treated can be expected to be very different. Empirical tests on small collections of texts of different types (Salton and Buckley, 1988), conclude i.a. that length normalization for queries does not affect the result; using an idf factor improves the result; when the document lengths vary - as they usually do - length normalization is useful. However, lately it has seemed that if the length varies over a great range, the document length normalization should be damped somewhat (Singhal et al, 1996). 5.4 Query expansion As has been established both by informal observation and several formal experiments, queries to information retrieval systems tend to be very short; the majority being of three words or less (Rose and Stevens, 1996; Rose and Cutting, forthcoming). This gives very little purchase to most linguistically oriented methods, and one would wish to find methods which would encourage searchers to produce longer queries (Karlgren and Franzen, 1997). 5.4.1 Using retrieved documents One method to get more textual material for fleshing out a short query is to submit the query, use the first few retrieved documents as a renewed query, and hope that the first few documents indeed are relevant. 5.4.2 Relevance Feedback Alternatively, a retrieval system can present the retrieved set, and have users note which documents seem useful at a glance. These relevant documents are then used as a query. Analogously, non-relevant documents can be discarded in the first iteration, and the terms in them weighted down in subsequent iterations. This technique - relevance feedback - can be extended by clustering the retrieved documents in similarity sets, if the system has an efficient clustering algorithm. 5.5. BEYOND SINGLE WORDS 107 Cutting et al have implemented the Scatter/Gather which does this. Users can select not only single documents for relevance feedback, but entire clusters of documents, represented by terms common to the entire cluster. 5.4.3 Extracting terms from retrieved documents Automatic query expansion (e.g. Robertson; Strzalkowski et al, 1997). Using relevance feedback (e.g. Robertson). 5.5 Beyond single words Counting solitary words is fine, but the idea that lone words by themselves carry the topic of the text is one of the more obvious over-simplifications in the model so far. Indexing texts on ice cream on ”ice” and ”cream” is intuitively less useful than looking at the combination ”ice cream”, just to take an obvious example. However, in experiments designed to test the usefulness of multi-word terms, any addition past single word indexing is cumbersome and expensive in memory and processing, while adding comparatively little to performance. On balance, the methods are useful, but single word indexing is and will remain the most efficient method to index texts. The usefulness of searching a document base for multi-word terms rather than limiting the search to single terms is under debate. ”It may be sensible, for some files, to index explicitly on complex or compound terms ... In general these elaborations are tricky to manage and are not recommended for beginners.” (Robertson and Sparck Jones, 1996). In any case, the discriminatory power of single word terms is much stronger than that of any other information source (Strzalkowski et al, 1997) and given the relatively low level of granularity of the text description simple words are quite likely to be sufficient. 5.5.1 Collocations and Multi-word technical terms One way of expanding the search to words beyond single terms is simply tabulating words that occur adjacently in the text - n-grams. For instance, Magnus Merkel has implemented a tool for retrieving recurrent word sequences in text (1994). ¡!– FRASSE table for this text. –¿ Using more theoretical finesse, other types of arbitrary and recurrent combinations in the text - collocations - can be recognized and tabulated as well. Frank Smadja has implemented a set of tools (1992) for retrieving collocations of various types using both statistical and lexical information; he identifies three major types of collocations: predicative relations such as hold between verbs and their objects in recurrent constructions, set noun phrases, and phrasal templates, where only a certain slot varies from instance to instance. 108 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL To extract collocations of the second type, Justeson and Katz (1995) have added lexical knowledge simple statistics, and use it to extract technical terms. Technical terms are a specific category of words which behave almost like names. They cannot easily be modified - their elements cannot be scrambled or replaced by more or less synonymous components, and they usually cannot be referred to with pronouns. Thus, the technical terms tend to stay invariant throughout a text, and between texts. Justeson’s and Katz’ appealingly simple algorithm to spot multi-word technical terms tabulates all multi-word sequences with a noun head from a text, and retains those that appear more than once. This method gives a surprisingly characteristic picture of a text topic, given that the text is of a technical or at least non-fiction nature. Their major point is well worth noting: the fact that a complex noun phrase is used more than once identically is evidence enough for its special quality as a technical term. It is repetition, not frequency, that is notable for longer terms. Strzalkowski has experimented using head modifier structures from fully parsed texts to extract index terms (1994): this normalizes phrases such as ”information retrieval” and ”retrieval of information” to the same index representation. 5.5.2 Concept Spotting Names of people, organizations, places, and other entities can be spotted automatically with some degree of success (Strzalkowski and Wang, 1996). 5.5.3 Information Extraction 5.5.4 Text is not just a bag of words All indexing approaches mentioned so far assume that words appear in a text somewhat randomly, in a Poisson-style distribution. This is naturally a gross simplification: words appear in a text not following a memory-less distribution but in a pattern governed by the textual topic progression (Karlgren, 1976). This can conceivably be made use of in indexing: if text segments more likely to be topically pertinent are chosen and terms within them weighted up as compared to terms from other sections this weighting would reflect the topical make-up of the text better than a non-progressional model. Techniques potentially useful for this include summarization, text segmentation (e.g. Hearst) and clause weighting approaches, for instance the foreground/background experiments performed by Karlgren. 5.5.5 Text is more than topic Besides the topical analysis touched on in the preceding sections, it is worth noting that textual data are so much more than topic and information: texts 5.6. EVALUATING INFORMATION RETRIEVAL 109 belong to one or several genres, are of varying quality, are written for various purposes, and adhere or transcend stylistic conventions. 5.5.6 Merging several information streams As has been indicated above, the optimal way of putting index streams to use may vary from stream to stream. Justeson and Katz find that repetition, not frequency is what makes a multi-word technical term an interesting index term; in experiments on single words mostly frequencies tend to be more efficient as weightings. In experiments in recent TREC evaluations our group has used a multistream approach, where every stream is indexed and searched separately; the resulting searches are merged to present a unified result (Strzalkowski et al, 1996). A numerical optimization for weighting several knowledge sources has been done by Bartell et al. (19xx) 5.6 Evaluating information retrieval Information retrieval has a unusually well defined set of evaluation tools. 5.6.1 How exhaustive is the search? - Recall If one has a good picture of how many relevant documents a document base contains for some query, one can calculate the proportion of the total set of relevant documents found and retrieved by an algorithm. This proportion is called recall. 5.6.2 How much garbage? - Precision The proportion of relevant documents in a retrieved set is called precision and is a complement measure to recall. 5.6.3 Combining precision and recall Trivially, if an algorithm always retrieves all documents in a document base, it has one hundred per cent recall. However, it presumably has low precision. Typically recall and precision are plotted against each other; in the TREC evaluations, e.g., a ”11-point” average measure is used, with precision measures at every 10 percent of recall, and the average figure is used as a total measure (e.g. Harman, 1996). 110 5.6.4 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL What is wrong with the evaluation measures? • We do not have a good picture of how many documents are relevant in a document base. • Indeed, often we cannot determine what the ’entire document base’ is. • A query is well defined experimentally, but what its counterpart in real life is less well defined. Users often cannot pose their information needs in succinct search terms. • The retrieved set is not delimited. A list of several thousand documents is presumably not very useful to a human user; should set an arbitrary cutoff point at some figure? • Averaging recall-precision trade offs in e.g. 11-point averages mask algorithm differences: some algorithms may do very well in high-precision searches and less well in high recall cases; some may do well in cases where there are very few documents to be found, others do better when the document base is saturated with material for the topic at hand. • Most importantly: what is ”relevant”? 5.7 References Nicholas J. Belkin. 1994. “Design Principles for Electronic Texytual Resources: Investigating Users and Uses of Scholarly Information”, Studies in the memory of Donald Walker, Kluwer. Benedict du Boulay, Tim O’Shea, and John Monk. 1981. “The Black Box Inside the Glass Box: Presenting Computing Concepts to Novices”, International Journal of Man-Machine Studies, 14:237-249. Naoufel Ben Cheikh and Magnus Zackrisson. 1994. “Genrekategorisering av text för filtrering av elektroniska meddelanden” Stockholm University Bachelor’s thesis in Computer and Systems Sciences Stockholm University. Jussi Karlgren. 1990. “An Algebra for Recommendations”, Syslab Working Paper 179, Department of Computer and System Sciences, Stockholm University, Stockholm. Robert Kass. 1991. “Building a User Model Implicitly from a Cooperative Advisory Dialog”, UMUAI 1:3, pp.203-258 Ann Lantz. 1993. “How do experienced users of Usenet News select their information?”. IntFilter Working Paper No. 3, Department of Computer and Systems Sciences, University of Stockholm. Sadaako Miyamoto. 1989. Fuzzy Sets in Information Retrieval and Cluster Analysis. Dordrecht: Kluwer. Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergström, John Riedl. 1994. “GroupLens: An Open Architechture for Collaborative Filtering of Netnews”, Procs. CSCW 94, Chapel Hill. 5.7. REFERENCES 111 Elaine Rich. 1979. “User Modeling via Stereotypes” Cognitive Science, vol 3:pp 329-354. Gerald Salton and Michael McGill. (1983). Introduction to Modern Information Retrieval New York: McGraw-Hill. Gerard Salton and James Allan. 1994. “Automatic Text Decomposition and Structuring”, Procs. 4th RIAO – Intelligent Multimedia Information Retrieval Systems and Management, New York. Donald E. Walker. 1981. “The Organization and Use of Information: Contributions of Information Science, Computational Linguistics, and Artificial Intelligence”, Journal of the American Society for Information Science 32, (5), pp. 347-363. Hans Iwan Bratt, Hans Karlgren, Ulf Keijer, Tomas Ohlin, Gunnar Rodin. (1983). “En liberal datapolitik”, Tempus, 16-19/12, Stockholm. Sadaako Miyamoto. (1989). Fuzzy Sets in Information Retrieval and Cluster Analysis. Dordrecht: Kluwer. Erik Andersson. 1975. “Style, optional rules and contextual conditioning. In Style and Text - Studies presented to Nils Erik Enkvist. Håkan Ringbom. (ed.) Stockholm: Skriptor and Turku: Åbo Akademi. Douglas Biber. 1988. Variation across speech and writing. Cambridge University Press. Douglas Biber. 1989. “A typology of English texts”, Linguistics, 27:3-43. Chris Buckley, Amit Singhal, Mandar Mitra, Gerard Salton. 1996. “New Retrieval Approches Using SMART: TREC 4”. Proceedings of TREC-4. John Dawkins. 1975. Syntax and Readability. Newark, Delaware: International Reading Association. Nils Erik Enkvist. 1973. Linguistic Stylistics. The Hague: Mouton. Donna Harman (ed.). 1995. The Third Text REtrieval Conference (TREC-3). National Institute of Standards Special Publication. Washington. Donna Harman (ed.). 1996. The Fourth Text REtrieval Conference (TREC-4). National Institute of Standards Special Publication 500-236. Washington. Donna Harman (ed.). forthcoming. The Fifth Text REtrieval Conference (TREC-5). National Institute of Standards Special Publication. Washington. Fahima Polly Hussain and Ioannis Tzikas. 1995. “Ordstatistisk kategorisering av text för filtrering av elektroniska meddelanden” (Genre Classification of Texts by Word Occurrence Statistics for Filtering of Electronic Messages) Stockholm University Bachelor’s thesis in Computer and Systems Sciences, Stockholm University. George R. Klare 1963. The Measurement of Readability. Iowa Univ press. Irving Lorge. 1959. The Lorge Formula for Estimating Difficulty of Reading Materials. New York: Teachers College Press, Columbia University. Robert M. Losee. forthcoming. “Text Windows and Phrases Differing by Discipline, Location in Document, and Syntactic Structure”. Information Processing and Management. (In the Computation and Language E-Print Archive: cmplg/9602003). 112 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL Tomek Strzalkowski. 1994. “Robust Text Processing in Automated Information Retrieval”. Proceedings of the Fourth Conference on Applied Natural Language Processing in Stuttgart. ACL. Ellen Voorhees, Narendra K. Gupta, Ben Johnson-Laird. 1994. “The Collection Fusion Problem”. Proceedings of TREC-3. Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. 19xx. Automatic Combination of Multiple Ranked Retrieval Systems. xxx...xxx. Donna K. Harman (ed.) 1993. The First Text Retrieval Conference (TREC-1), NIST SP 500-207, Gaithersburg, MD: National Institute of Standards and Technology. Donna K. Harman (ed.) 1994. The Second Text Retrieval Conference (TREC2), NIST SP 500-215, Gaithersburg, MD: National Institute of Standards and Technology. Donna K. Harman (ed.) 1995. Overview of The Third Text Retrieval Conference (TREC-3), NIST SP 500-225, Gaithersburg, MD: National Institute of Standards and Technology. [http://www-nlpir.nist.gov/TREC/] Donna K. Harman (ed.) 1996. The Fourth Text Retrieval Conference (TREC-4), NIST SP 500-236, Gaithersburg, MD: National Institute of Standards and Technology. [http://www-nlpir.nist.gov/TREC/] Donna K. Harman (ed.) 1997. The Fifth Text Retrieval Conference (TREC-5), NIST SP 500-xxx, Gaithersburg, MD: National Institute of Standards and Technology. [http://www-nlpir.nist.gov/TREC/] John S. Justeson and Slava M. Katz. 1995. Technical Terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 1, 9-27. Jussi Karlgren and Kristofer Franzń. 1997. Verbosity and Interface Design. Reptile working papers No. 2. [http://www.sics.se/ jussi/irinterface.html]. Hans Peter Luhn. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development 1 (4) 309-317. (Reprinted in H.P.Luhn: Pioneer of Information Science, selected works. Claire K. Schultz (ed.) 1968. New York:Sparta. Hans Peter Luhn. 1959. Auto-Encoding of Documents for Information Retrieval Systems. In Modern Trends in Documentation, M. Boaz (ed) London: Pergamon Press. (Reprinted in H.P.Luhn: Pioneer of Information Science, selected works. Claire K. Schultz (ed.) 1968. New York:Sparta. Magnus Merkel, Bernt Nilsson and Lars Ahrenberg. 1994. A Phrase-Retrieval System Based on Recurrence. In Proceedings from the Second Annual Workshop on Very Large Corpora. Kyoto. S. E. Robertson and Karen Sparck Jones. 1996. Simple, proven approaches to textretrieval. Technical report 356, Computer Laboratory, University of Cambridge. [http://www.cl.cam.ac.uk/ftp/papers/reports/TR356-ksj-approaches-to-text-retrieval.ps.gz] Daniel E. Rose and Curt Stevens. 1996. V-Twin: A Lightweight Engine for Interactive Use. Proceedings of the fifth Text Retrieval Conference, TREC-5. Donna Harman (ed), NIST Special Publication, Gaithersburg: NIST. 5.7. REFERENCES 113 Gerard Salton and Christopher Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing and Management. 24 (5) 513-523. Gerard Salton and C. S. Yang. 1973. On the Specification of Term Values in Automatic Indexing. The Journal of Documentation. 29 (4) 351 - 372. Amit Singhal, Gerard Salton, Mandar Mitra, Chris Buckley. 19xx. Document Length Normalization. Cornell CS TR000. Frank Smadja. 1992. Retrieving Collocations from Text: XTRACT. Journal of Computational Linguistics. Special issue on corpus based techniques. Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. December 1972. 28:1:11-20. Tomek Strzalkowski. 1994. Building a Lexical Domain Map from Text Corpora. In Papers presented to the Fifteenth International Conference On Computational Linguistics (COLING-94), Kyoto. Tomek Strzalkowski, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang Lin, Jose Perez-Carballo, Troy Straszheim, Jin Wang, Jon Wilding. 1997. Natural Language Information Retrieval: TREC-5 Report Proceedings of the fifth Text Retrieval Conference, TREC-5. Donna Harman (ed.), NIST Special Publication, Gaithersburg: NIST. Tomek Strzalkowski and Jin Wang. 1996. A Self-Learning Universal Concept Spotter. In Papers presented to the Sixteenth International Conference On Computational Linguistics (COLING-96), Copenhagen. Takenobu Tokunaga and Makoto Iwayama. 1994. Text categorization based on weighted inverse document frequency. Technical Report 94 TR0001. Department of Computer Science. Tokyo Institute of Technology. 114 CHAPTER 5. THE BASICS OF INFORMATION RETRIEVAL Bibliography [Abramson 1992] Harvey Abramson. 1992. “A Logic Programming View of Relational Morphology”. In Proceedings of the 14th International Conference on Computational Linguistics, volume 3, pp. 850–859, Nantes, France, July. ACL. [Alshawi & Crouch 1992] Hiyan Alshawi and Richard Crouch. 1992. “Monotonic Semantic Interpretation”. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pp. 32–39, Newark, Delaware, June. ACL. Also available as SRI International Technical Report CRC-022, Cambridge, England. [Alshawi (ed.) 1992] Hiyan Alshawi, editor, David Carter, Jan van Eijck, Björn Gambäck, Robert C. Moore, Douglas B. Moran, Fernando C. N. Pereira, Stephen G. Pulman, Manny Rayner, and Arnold G. Smith. 1992. The Core Language Engine. The MIT Press, Cambridge, Massachusetts, March. [Aho 1968] Alfred V. Aho. 1968. “Indexed grammars — an extension to contextfree grammars”. Journal of the ACM, 4:647–671. [Aho et al 1986] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques and Tools. Addison-Wesley, Reading, Massachusetts. [Alshawi & van Eijck 1989] Hiyan Alshawi and Jan van Eijck. 1989. “Logical Forms in the Core Language Engine”. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 25–32, Vancouver, British Columbia, June. ACL. 115 116 BIBLIOGRAPHY [Chomsky 1957] Noam Chomsky. 1957. Syntactic Structures. Mouton, Haag, Holland. [Chomsky 1981] Noam Chomsky. 1981. Lectures on Government and Binding. Foris, Dordrecht, Holland. [Chomsky 1986] Noam Chomsky. 1986. Barriers. The MIT Press, Cambridge, Massachusetts. [Chytil & Karlgren 1988] Michail B. Chytil and Hans Karlgren. 1988. “Categorial grammars and list automata for strata of non-CF-languages”. In W. Busszkowski, W. Marciszewski, and J. van Benthem, editors, Categorial Grammar. John Benjamins, Amsterdam, Holland. [Colmerauer 1978] A. Colmerauer. 1978. “Metamorphosis Grammars”. In L. Bolc, editor, Natural Language Communication with Computers. SpringerVerlag, Berlin, Germany. [Diderichsen 1966] Paul Diderichsen. 1966. Helhed og Struktur — udvalgte sprogvidenskabelige afhandlinger. København. (in Danish). [Dowty et al 1981] David R. Dowty, Robert E. Wall, and Stanley Peters. 1981. Introduction to Montague Semantics. D. Reidel, Dordrecht, Holland. [Earley 1969] Jay Earley. 1969. “An Efficient Context-Free Parsing Algorithm”. In Readings in Natural Language Processing. Morgan Kaufmann, San Mateo, California. Reprint. [Gambäck et al 1991] Björn Gambäck, Hiyan Alshawi, David M. Carter, and Manny Rayner. 1991. “Measuring Compositionality in Transfer-Based Machine Translation Systems”. In J. G. Neal and S. M. Walter, editors, Natural Language Processing Systems Evaluation Workshop, pp. 141–145, University of California, Berkeley, California, June. ACL. [Gamut 1991] L. T. F. Gamut. 1991. Logic, Language, and Meaning, volume 2. The University of Chicago Press, Chicago, Illinois. (Gamut is a pseudonym for Johan van Benthem, Jeroen Groenendijk, Dick de Jongh, Martin Stokhof, and Henk Verkuyl). BIBLIOGRAPHY 117 [Gazdar et al 1985] Gerald Gazdar, Ewan Klein, Geoffrey Pullum, and Ivan Sag. 1985. Generalized Phrase Structure Grammar. Harvard University Press, Cambridge, Massachusetts. [Gambäck & Rayner 1992] Björn Gambäck and Manny Rayner. 1992. “The Swedish Core Language Engine”. In L. Ahrenberg, editor, Papers from the 3rd Nordic Conference on Text Comprehension in Man and Machine, pp. 71– 85, Linköping University, Linköping, Sweden, April. Also available as SICS Research Report R92013, Stockholm, Sweden and as SRI International Technical Report CRC-025, Cambridge, England. [Horrocks 1987] Geoffrey Horrocks, editor. 1987. Generative Grammar. Longman, London, England. [Kay 1980] Martin Kay. 1980. “Algorithmic Schemata and Data Structures in Syntactic Processing”. In Readings in Natural Language Processing. Morgan Kaufmann, San Mateo, California. Reprint. [Kay 1989] Martin Kay. 1989. “Head-Driven Parsing”. In Proceedings of the 1st International Workshop on Parsing Technologies, Pittsburgh, Pennsylvania. [Kaplan & Bresnan 1982] Ronald M. Kaplan and Joan Bresnan. 1982. “Lexical-Functional Grammar: A Formal System for Grammar Representation”. In Joan Bresnan, editor, The Mental Representation of Grammatical Relations, pp. 173–281. The MIT Press, Cambridge, Massachusetts. [Knuth 1965] Donald E. Knuth. 1965. “On the Translation of Languages from Left to Right”. Information and Control, 8(6):607–639. [Koskenniemi 1983] Kimmo Koskenniemi. 1983. Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Doctor of Philosophy Thesis, University of Helsinki, Dept. of General Linguistics, Helsinki, Finland. [Milne 1928] Alan A. Milne. 1928. The House at Pooh-Corner. Methuen & Co Ltd. 118 BIBLIOGRAPHY [Matsumoto et al 1983] Yuji Matsumoto, Hozumi Tananka, Hideki Hirakawa, Hideo Miyoshi, and Hideki Yasukawa. 1983. “BUP: A Bottom-Up Parser Embedded in Prolog”. New Generation Computing, 1(2):145–158. [Pereira & Shieber 1987] Fernando C. N. Pereira and Stuart M. Shieber. 1987. Prolog and Natural Language Analysis. Number 10 in Lecture Notes. CSLI, Stanford, California. [Partee 1987] Barbara H. Partee, Alice ter Meulen, and Robert E. Wall. 1987. Mathematical Models in Linguistics. Kluwer, Boston, Massachusetts. [Pulman 1991] Stephen G. Pulman. 1991. “Two Level Morphology”. In Stephen G. Pulman, editor, Eurotra ET6/1: Rule Formalism and Virtual Machine Design Study, chapter 5. Commission of the European Communities, Luxembourg. [Pereira & Warren 1980] Fernando C. N. Pereira and David H. D. Warren. 1980. “Definite Clause Grammars for Natural Language Analysis”. Artificial Intelligence, 13:231–278. [Sells 1985] Peter Sells. 1985. Lectures on Contemporary Syntactic Theories. Number 3 in Lecture Notes. CSLI, Stanford, California. [Shieber 1985] Stuart M. Shieber. 1985. “Evidence against the context-freeness of natural language”. Language and Philosophy, 8:333–343. [Shieber 1986] Stuart M. Shieber. 1986. An Introduction to Unification-Based Approaches to Grammar. Number 4 in Lecture Notes. CSLI, Stanford, California. [Thomason 1974] Richmond Thomason, editor. 1974. Formal Philosophy: Selected Papers of Richard Montague. Yale University Press, New Haven, Connecticut. [Tomita 1986] Masaru Tomita. 1986. Efficient Parsing of Natural Language. A Fast Algorithm for Practical Systems. Kluwer, Boston, Massachusetts. BIBLIOGRAPHY 119 [Voutilainen et al 1992] Atro Voutilainen, Juha Heikkilä, and Arto Anttila. 1992. “Constraint Grammar of English”. Publication 21, Dept. of General Linguistics, University of Helsinki, Helsinki, Finland. [van Noord 1991] Gertjan van Noord. 1991. “Head Corner Parsing for Discontinuous Constituency”. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp. 114–121, University of California, Berkeley, California, July. ACL. [van Riemsdijk & Williams 1986] Henk van Riemsdijk and Edwin Williams. 1986. Introduction to the Theory of Grammar. The MIT Press, Cambridge, Massachusetts. [Wirén 1992] Mats Wirén. 1992. Studies in Incremental Natural-Language Analysis. Doctor of Philosophy Thesis, Linköping University, Dept. of Computer and Information Science, Linköping, Sweden, December. 120 BIBLIOGRAPHY Easing the use of computers by allowing users to express themselves in their own (“natural”) language is a field which has been given much attention during the last years. This booklet introduces natural-language processing in general and the way it is presently carried out at SICS. The overall goal of any system for natural-language processing system is to translate an input utterance stated in a natural language (such as English or Swedish) to some type of computer internal representation. Doing this requires theories for how to formalize the language and techniques for actually processing it on a machine. How this is done within the framework of the Prolog programming langauge is described in detail. The booklet is directed to an audience interested in user-friendly computer interfaces in general and natural-language processing in particular. The reader is assumed to have some knowledge of Prolog and of basic (school-book) grammar. ISRN SICS/P--94/01--SE ISSN 1100-4665