Nli Diss
Nli Diss
Nli Diss
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Bill MacCartney
June 2009
c Copyright by Bill MacCartney 2009
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Dan Jurafsky)
(Stanley Peters)
iii
Abstract
Inference has been a central topic in artificial intelligence from the start, but while
automatic methods for formal deduction have advanced tremendously, comparatively
little progress has been made on the problem of natural language inference (NLI),
that is, determining whether a natural language hypothesis h can justifiably be in-
ferred from a natural language premise p. The challenges of NLI are quite different
from those encountered in formal deduction: the emphasis is on informal reasoning,
lexical semantic knowledge, and variability of linguistic expression. This dissertation
explores a range of approaches to NLI, beginning with methods which are robust but
approximate, and proceeding to progressively more precise approaches.
We first develop a baseline system based on overlap between bags of words. Despite
its extreme simplicity, this model achieves surprisingly good results on a standard NLI
evaluation, the PASCAL RTE Challenge. However, its effectiveness is limited by its
failure to represent semantic structure.
To remedy this lack, we next introduce the Stanford RTE system, which uses typed
dependency trees as a proxy for semantic structure, and seeks a low-cost alignment
between trees for p and h, using a cost model which incorporates both lexical and
structural matching costs. This system is typical of a category of approaches to NLI
based on approximate graph matching. We argue, however, that such methods work
best when the entailment decision is based, not merely on the degree of alignment,
but also on global features of the aligned hp, hi pair motivated by semantic theory.
Seeking still greater precision, we devote the largest part of the dissertation to
developing an approach to NLI based on natural logic. We present a new model of
iv
natural logic which extends the monotonicity calculus of van Benthem and Sánchez-
Valencia to incorporate semantic exclusion and implicativity. We define an expressive
set of entailment relations—including representations of both semantic containment
and semantic exclusion—and described the algebra of their joins. We then describe
a model of compositional entailment capable of determining the entailment relation
between two sentences connected by a sequence of edits. We introduce the concept
of projectivity signatures, which generalizes the concept of monotonicity classes to
cover the exclusion relations. And, we show how our framework can be used to explain
inferences involving implicatives and non-factives.
Next, we describe the NatLog system, a computational implementation of our
model of natural logic. NatLog is the first robust, general-purpose system for natural
logic inference over real English sentences. The NatLog system decomposes an infer-
ence problem into a sequence of atomic edits which transforms p into h; predicts a
lexical entailment relation for each edit using a statistical classifier; propagates these
relations upward through a syntax tree according to semantic properties of interme-
diate nodes; and joins the resulting entailment relations across the edit sequence. We
demonstrate the practical value of the NatLog system in evaluations on the FraCaS
and RTE test suites.
Finally, we address the problem of alignment for NLI, by developing a model of
phrase-based alignment inspired by analogous work in machine translation, including
an alignment scoring function, inference algorithms for finding good alignments, and
training algorithms for choosing feature weights.
v
Acknowledgments
To write a dissertation is a mighty undertaking, and I could not have reached the
finish line without the influence, advice, and support of many colleagues, friends, and
family. My thanks are due first and foremost to my advisor, Chris Manning, for being
willing to take me on as a student, and for being unfailingly generous with his time
and his support throughout my graduate career. Chris provided consistently good
advice, helped me to situate my work in a broader context, and gave me tremendous
freedom in pursuing my ideas, while always ensuring that I was making progress
toward some fruitful outcome. I will be forever grateful for his help.
I am deeply obliged to the other members of my dissertation committee for their
advice and encouragement. From Dan Jurafsky, I received a steady stream of positive
feedback that helped to keep my motivation strong. Stanley Peters was of great help
in working out various theoretical issues, and in making connections to the literature
of formal semantics. I am indebted to Johan van Benthem not only for his pioneering
work on the role of monotonicity in natural language inference—which was the direct
inspiration for my model of natural logic—but also for pushing me to consider the
theoretical implications of my work. Perceptive questions from Mike Genesereth also
helped greatly to sharpen my ideas.
I owe special thanks to David Beaver, who first brought my attention to the topic
of natural logic and the concept of monotonicity, and thereby set me on a path which
culminated in this dissertation.
The faculty, students, and visitors involved with the Stanford NLP Group have
provided a congenial and stimulating environment in which to pursue my studies.
I am particularly indebted to my frequent collaborators—including Marie-Catherine
vi
de Marneffe, Michel Galley, Teg Grenager, and many others—who shared my strug-
gles and helped to shape my thoughts. I am also grateful to past officemates Iddo
Lev, Roger Levy, Kristina Toutanova, Jeff Michels, and Dan Ramage for their warm
companionship and for countless fascinating conversations.
Finally, I offer my deepest thanks to my siblings—Leslie, Jim, and Doug—whose
love and support has sustained me; to my parents, Louis and Sharon, who instilled
in me the love of learning that has fueled all my efforts; and to Destiny, who believed
in me most when I believed in myself least.
vii
Contents
Abstract iv
Acknowledgments vi
viii
3 Alignment for natural language inference 28
3.1 NLI alignment vs. MT alignment . . . . . . . . . . . . . . . . . . . . 29
3.2 The MSR alignment data . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 The MANLI aligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 A phrase-based alignment representation . . . . . . . . . . . . 32
3.3.2 A feature-based scoring function . . . . . . . . . . . . . . . . . 34
3.3.3 Decoding using simulated annealing . . . . . . . . . . . . . . . 35
3.3.4 Perceptron learning . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Evaluating aligners on MSR data . . . . . . . . . . . . . . . . . . . . 38
3.4.1 A robust baseline: the bag-of-words aligner . . . . . . . . . . . 39
3.4.2 MT aligners: GIZA++ and Cross-EM . . . . . . . . . . . . . 39
3.4.3 The Stanford RTE aligner . . . . . . . . . . . . . . . . . . . . 41
3.4.4 The MANLI aligner . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Using alignment to predict RTE answers . . . . . . . . . . . . . . . . 44
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Entailment relations 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Representations of entailment . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Entailment as two-way classification . . . . . . . . . . . . . . . 64
5.2.2 Entailment as three-way classification . . . . . . . . . . . . . . 65
5.2.3 Entailment as a containment relation . . . . . . . . . . . . . . 66
ix
5.2.4 The best of both worlds . . . . . . . . . . . . . . . . . . . . . 68
5.3 The 16 elementary set relations . . . . . . . . . . . . . . . . . . . . . 70
5.3.1 Properties of the elementary set relations . . . . . . . . . . . . 72
5.3.2 Cardinalities of the elementary set relations . . . . . . . . . . 75
5.4 From set relations to entailment relations . . . . . . . . . . . . . . . . 76
5.5 The seven basic entailment relations . . . . . . . . . . . . . . . . . . . 77
5.5.1 The assumption of non-vacuity . . . . . . . . . . . . . . . . . 78
5.5.2 Defining the relations in B . . . . . . . . . . . . . . . . . . . . 78
5.6 Joining entailment relations . . . . . . . . . . . . . . . . . . . . . . . 80
5.6.1 “Nondeterministic” joins . . . . . . . . . . . . . . . . . . . . . 81
5.6.2 Computing joins . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6.3 Unions of the basic entailment relations . . . . . . . . . . . . . 84
6 Compositional entailment 86
6.1 Lexical entailment relations . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.1 Substitutions of open-class terms . . . . . . . . . . . . . . . . 88
6.1.2 Substitutions of closed-class terms . . . . . . . . . . . . . . . . 90
6.1.3 Generic deletions and insertions . . . . . . . . . . . . . . . . . 91
6.1.4 Special deletions and insertions . . . . . . . . . . . . . . . . . 92
6.2 Entailments and semantic composition . . . . . . . . . . . . . . . . . 92
6.2.1 Semantic composition in the monotonicity calculus . . . . . . 93
6.2.2 Projectivity signatures . . . . . . . . . . . . . . . . . . . . . . 93
6.2.3 Projectivity of logical connectives . . . . . . . . . . . . . . . . 94
6.2.4 Projectivity of quantifiers . . . . . . . . . . . . . . . . . . . . 98
6.2.5 Projectivity of verbs . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Implicatives and factives . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Implication signatures . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Deletions and insertions of implicatives . . . . . . . . . . . . . 101
6.3.3 Deletions and insertions of factives . . . . . . . . . . . . . . . 102
6.3.4 Projectivity signatures of implicatives . . . . . . . . . . . . . . 104
6.4 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . 105
x
6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5.1 An example involving exclusion . . . . . . . . . . . . . . . . . 107
6.5.2 Examples involving implicatives . . . . . . . . . . . . . . . . . 108
6.5.3 Different edit orders . . . . . . . . . . . . . . . . . . . . . . . 111
6.5.4 Inability to handle de Morgan’s laws for quantifiers . . . . . . 111
6.5.5 A more complex example . . . . . . . . . . . . . . . . . . . . . 113
6.6 Is the inference method sound? . . . . . . . . . . . . . . . . . . . . . 117
8 Conclusions 152
8.1 Contributions of the dissertation . . . . . . . . . . . . . . . . . . . . . 152
8.2 The future of natural language inference . . . . . . . . . . . . . . . . 155
xi
List of Tables
xii
6.11 Analysis of an example of de Morgan’s laws for quantifiers, second try 113
6.12 Analysis of a more complex inference, first try . . . . . . . . . . . . . 114
6.13 Analysis of a more complex inference, second try . . . . . . . . . . . . 115
6.14 Analysis of a more complex inference, third try . . . . . . . . . . . . 116
7.1 Examples of training problems for the lexical entailment model . . . . 132
7.2 An example of the operation of the NatLog model . . . . . . . . . . . 138
7.3 Performance of various systems on the FraCaS test suite . . . . . . . 143
7.4 Performance of NatLog on the FraCaS test suite, by section . . . . . 144
7.5 Confusion matrix for NatLog on the FraCaS test suite . . . . . . . . . 145
7.6 Performance of various systems on the RTE3 test suite . . . . . . . . 150
xiii
List of Figures
1.1 Some single-premise inference problems from the FraCaS test suite . . 7
1.2 Some multiple-premise inference problems from the FraCaS test suite 8
3.1 The MSR gold standard alignment for RTE2 problem 116 . . . . . . 31
3.2 The MSR gold alignment for RTE2 problem 116, as phrase edits . . . 33
3.3 The manli-align algorithm . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 The manli-learn algorithm . . . . . . . . . . . . . . . . . . . . . . 37
xiv
Chapter 1
(1) p Several airlines polled saw costs grow more than expected,
even after adjusting for inflation.
h Some of the companies in the poll reported cost increases.
In the NLI problem setting, (1) is considered a valid inference, for the simple
reason that an ordinary person, upon hearing p, would likely accept that h follows.
Note, however, that h is not a strict logical consequence of p: for one thing, seeing
1
CHAPTER 1. THE PROBLEM OF NATURAL LANGUAGE INFERENCE 2
cost increases does not necessarily entail reporting cost increases—it is conceivable
that every company in the poll kept mum about increasing costs, perhaps for reasons
of business strategy. That the inference is nevertheless considered valid in the NLI
setting is a reflection of the informality of the task definition.
Although NLI involves recognizing an asymmetric relation of inferability between p
and h, an important special case of NLI is the task of recognizing a symmetric relation
of approximate semantic equivalence (that is, paraphrase) between p and h. (It is a
special case because, if we have a system capable of determining whether h can be
inferred from p, then we can detect semantic equivalence simply by running the system
both “forwards” and “backwards”.) Recognizing approximate semantic equivalence
between words is comparatively straightforward, using manually constructed thesauri
such as WordNet (Fellbaum et al. 1998) or automatically constructed thesauri such as
that of Lin (1998). But the ability to recognize when two sentences are saying more
or less the same thing is far more challenging, and if possible, could be of enormous
benefit to many language processing tasks. We describe a few potential applications
in the next section.
An intrinsic property of the NLI task definition is that the problem inputs are
expressed in natural language. Research on methods for automated deduction, by
contrast, typically assumes that the problem inputs are already expressed in some
formal meaning representation, such as the language of first-order logic. This fact
alone reveals how different the problem of NLI is from earlier work on logical infer-
ence, and places NLI squarely within the field of natural language processing (NLP):
in developing approaches to NLI, we will be concerned with issues such as syntactic
parsing, morphological analysis, word sense disambiguation, lexical semantic related-
ness, and even linguistic pragmatics—topics which are the bread and butter of NLP,
but are quite foreign to logical AI.
Over the last few years, there has been a surge of interest in the problem of NLI,
centered around the PASCAL Recognizing Textual Entailment (RTE) Challenge (Da-
gan et al. 2005) and within the U.S. Government AQUAINT program. Researchers
working on NLI can build on the successes achieved during the last decade in areas
such as syntactic parsing and computational lexical semantics, and begin to tackle
CHAPTER 1. THE PROBLEM OF NATURAL LANGUAGE INFERENCE 3
For example, an NLI system should be able to recognize that was Lincoln’s Sec-
retary of State can be inferred from the candidate answer William H. Seward served
as Secretary of State under President Abraham Lincoln. A capacity for NLI is even
more directly applicable to the CLEF Answer Validation Exercise,1 in which each
problem consists of a question, a proposed answer, and a supporting text, and the
goal is simply to return a boolean value indicating whether the answer is correct for
the question, given the text.
Semantic search. The goal of semantic search is to provide the ability to retrieve
documents from a very large collection (such as the World Wide Web) based on the
semantic content (rather than simply the surface words) of the documents and the
search query. If a user searches for people demonstrating against free trade, most
existing keyword-based search engines will return only documents containing the
terms demonstrating, free, and trade. However, a document containing the sentence
Protestors chanted slogans opposing the agreement to drop trade barriers might very
well be responsive to the user’s query, even though it fails to contain most of the
search terms. If an NLI system were used to identify approximate semantic equiva-
lence between search queries and sentences in source documents, then one could offer a
form of semantic retrieval not available in current keyword-based search. (Of course,
applying NLI to every document on the Web at query time is infeasible, so using
NLI to enable semantic search will require clever approaches to “semantic indexing”
and/or pre-filtering of candidate documents.)
that can be inferred from the rest of the summary. An even more basic goal of sum-
marization is correctness, i.e., the summary must accurately reflect the content of the
source document(s). In extractive summarization (Das and Martins 2007), where the
summary is constructed from snippets extracted from the source document(s), this
is rarely an issue. But whether or not the extractive strategy is used, NLI can be
useful in ensuring correctness, by checking that the the summary is implied by the
source document(s). The application of NLI to summarization has been explored by
Lacatusu et al. (2006).
definition: for a particular problem set, the objective is simply to answer as many
problems correctly as possible.
In this section we survey various formulations of the NLI task, and introduce NLI
problem sets to which we will return in later chapters.
§1: Quantifiers
§2: Plurals
§3: Anaphora
§4: Ellipsis
178 p John wrote a report, and Bill said Peter did too.
h Bill said Peter wrote a report. yes
§5: Adjectives
§6: Comparatives
§9: Attitudes
335 p Smith believed that ITEL had won the contract in 1992.
h ITEL won the contract in 1992. unk
Figure 1.1: Some single-premise inference problems from the FraCaS test suite.
CHAPTER 1. THE PROBLEM OF NATURAL LANGUAGE INFERENCE 8
§6: Comparatives
Figure 1.2: Some multiple-premise inference problems from the FraCaS test suite.
short (averaging 11 words in RTE1, 8 words in RTE2, and 7 words in RTE3 and
RTE4). Examples of RTE problems are shown in figures 4.1 and 7.4.
In the first three years, the RTE Challenge was presented as a binary classification
task: the goal was simply to determine whether p entailed h (answer yes) or not
(answer no). Problem sets were designed to be balanced, containing equal numbers
of yes and no answers. Beginning with RTE4, participants were encouraged (but
were not obliged) to make three-way predictions, distinguishing cases in which h
contradicts p from those in which h is compatible with, but not entailed by, p (as in
the FraCaS evaluation).
Several characteristics of the RTE problems should be emphasized. Examples are
derived from a broad variety of sources, including newswire; therefore systems must
be domain-independent. The inferences required are, from a human perspective, fairly
superficial: no long chains of reasoning are involved. However, there are “trick” ques-
tions expressly designed to foil simplistic techniques. (Problem 2081 of figure 4.1 is a
good example.) The definition of entailment is informal and approximate: whether a
competent speaker with basic knowledge of the world would typically infer h from p.
Entailments will certainly depend on linguistic knowledge, and may also depend on
world knowledge; however, the scope of required world knowledge is left unspecified.
Despite the informality of the problem definition, human judges exhibit very good
agreement on the RTE task, with agreement rate of 91–96% (Dagan et al. 2005). In
principle, then, the upper bound for machine performance is quite high. In practice,
however, the RTE task is exceedingly difficult for computers. Participants in the
first PASCAL RTE workshop reported accuracy from 50% to 59% (Dagan et al.
CHAPTER 1. THE PROBLEM OF NATURAL LANGUAGE INFERENCE 10
2005). In later RTE competitions, higher accuracies were achieved, but this is partly
attributable to the RTE test sets having become intrinsically easier—see section 2.2
for discussion.
the inference is invalid, but the vanilla bag-of-words model cannot recognize this. Not
only does every word in h also appear in p, but in fact the whole of h is an exact
substring of p. A related problem with the bag-of-words model is that, because it
ignores predicate-argument structure, it can’t distinguish between Booth shot Lincoln
and Lincoln shot Booth.
Both of these shortcomings can be mitigated by including syntactic information
in the matching algorithm, an approach exemplified by the Stanford RTE system,
which we’ll describe more fully in chapter 4. Nevertheless, all shallow approaches
will struggle with such commonplace phenomena as antonymy, negation, non-factive
contexts, and verb-frame alternation. And, crucially, they all depend on an assump-
tion of upward monotonicity.4 To see how this can go wrong, consider changing the
quantifiers in our example from Several and Some to Every and All (or to Most, or
to No and None). The lexical similarity of h to p will scarcely be affected, but the
inference will no longer be valid, because the poll may have included automakers who
did not report cost increases. And constructions which are not upward monotone are
surprisingly widespread—they include not only negation (not) and many quantifiers
(e.g., less than three), but also conditionals (in the antecedent), superlatives (e.g.,
tallest), and countless other prepositions (e.g., without), verbs (e.g., avoid ), nouns
(e.g., denial ), adjectives (e.g., unable), and adverbs (e.g., rarely). In order properly
to handle inferences involving these phenomena, an approach with greater precision
is required.
negation, quantifiers, conditionals, and so on, and it can succeed in restricted do-
mains, but it fails badly on open-domain NLI evaluations such as RTE. The difficulty
is plain: truly natural language is fiendishly complex, and the full and accurate trans-
lation of natural language into formal representations of meaning presents countless
thorny problems, including idioms, ellipsis, anaphora, paraphrase, ambiguity, vague-
ness, aspect, lexical semantics, the impact of pragmatics, and so on.
Consider for a moment the difficulty of fully and accurately translating the premise
of example (1) into first-order logic. We might choose to use a neo-Davidsonian event
representation (though this choice is by no means essential to our argument). Clearly,
there was a polling event, and some airlines were involved. But what is the precise
meaning of several ? Would three airlines be considered several? Would one thousand?
Let’s agree to leave these comparatively minor issues aside. We also have some costs,
and a growing event of which the costs are the subject (a single growing event, or one
for each airline?). And the growth has some quantity, or magnitude, which exceeds
some expectation (one expectation, or many?), held by some agent (the airlines?),
which was itself relative to some notion (held by whom?) of inflation (of what, and
to what degree?). This is beginning to seem like a quagmire—yet this is absolutely
the norm when interpreting real English sentences, as opposed to toy examples.
The formal approach faces other problems as well. Many proofs can’t be completed
without axioms encoding background knowledge, but it’s not clear where to get these.
And let’s not forget that many inferences considered valid in the NLI setting (such as
(1)) are not actually strict logical consequences. Bos and Markert (2005b) provided
a useful reality check when they tried to apply a state-of-the-art semantic interpreter
to the RTE1 test set—they were able to find a formal proof for fewer than 4% of the
problems (and one-quarter of those proofs were incorrect). Thus, for the problem of
open-domain NLI, the formal approach is a nonstarter. To handle inferences requiring
greater precision than the shallow methods described in section 1.4.1, we’ll need a
new approach.
CHAPTER 1. THE PROBLEM OF NATURAL LANGUAGE INFERENCE 13
In chapter 6, we’ll present a new theory of natural logic which extends the mono-
tonicity calculus to account for negation and exclusion, and also incorporates elements
of Nairn’s model of implicatives.
Seeking still greater precision, we devote the largest part of the dissertation to de-
veloping an approach to NLI based on natural logic. In chapters 5 and 6, we present a
new model of natural logic which extends the monotonicity calculus of van Benthem
and Sánchez-Valencia to incorporate semantic exclusion and implicativity. In chap-
ter 5, we define an expressive set of entailment relations—including representations
of both semantic containment and semantic exclusion—and described the algebra of
their joins. Then, in chapter 6, we describe a model of compositional entailment
capable of determining the entailment relation between two sentences connected by
a sequence of edits. We introduce the concept of projectivity signatures, which gen-
eralizes the concept of monotonicity classes to cover the exclusion relations. And, we
show how our framework can be used to explain inferences involving implicatives and
non-factives.
In chapter 7, we describe the NatLog system, a computational implementation of
our model of natural logic. NatLog is the first robust, general-purpose system for nat-
ural logic inference over real English sentences. The NatLog system decomposes an
inference problem into a sequence of atomic edits which transforms p into h; predicts
a lexical entailment relation for each edit using a statistical classifier; propagates these
relations upward through a syntax tree according to semantic properties of interme-
diate nodes; and joins the resulting entailment relations across the edit sequence. We
demonstrate the practical value of the NatLog system in evaluations on the FraCaS
and RTE test suites.
Finally, in chapter 8, we summarize the contributions of the dissertation, and offer
some thoughts on the future of natural language inference.
Chapter 2
How can we begin to approach the task of building a working model of natural lan-
guage inference? For logicians and semanticists, the most obvious approach relies
on full semantic interpretation: translate the premise p and hypothesis h of an NLI
problem into a formal meaning representation, such as first-order logic, and then
apply automated reasoning tools. However, the recent surge in interest in the prob-
lem of NLI was initiated by researchers coming from the information retrieval (IR)
community, which has exploited very lightweight representations, such as the the bag-
of-words representation, with great success. If we approach the NLI task from this
direction, then a reasonable starting point is to represent p and h simply by (sparse)
vectors encoding the counts of the words they contain, and then to predict inferential
validity using some measure of vector similarity. Although the bag-of-words repre-
sentation might seem unduly impoverished, bag-of-words models have been shown to
be surprisingly effective in addressing a broad range of NLP tasks, including word
sense disambiguation, text categorization, and sentiment analysis.
We first alluded to the bag-of-words approach in section 1.4, where we described it
as one end of a spectrum of approaches to the NLI problem. Indeed, the bag-of-words
approach exemplifies what is perhaps the most popular and well-explored category
of approaches to NLI: those which operate solely by measuring approximate lexical
similarity, without regard to syntactic or semantic structure. Our goal in this chapter
is not to propose any particularly novel concepts or techniques; rather, we aim merely
16
CHAPTER 2. THE BAG-OF-WORDS APPROACH 17
to flesh out the details of a specific way of implementing the bag-of-words approach,
to evaluate its effectiveness on standard NLI test suites, and thereby to establish a
baseline against which to compare later results, both for alignment (chapter 3) and
for entailment prediction (chapters 4 and 7).
2.1.1 Approach
Our approach is directly inspired by the probabilistic entailment model described in
Glickman et al. (2005). Let P (h|p) denote the probability that premise p supports an
inference to (roughly, entails) hypothesis h. We assume that h is supported by p only
to the degree that each individual word hj in h is supported by p. We also assume
that the probability that a given word in h is supported is independent of whether
any other word in h is supported. We can thus factor the probability of entailment
as follows:
Y
P (h|p) = P (hj |p)
j
In addition, we assume that each word in h derives its support chiefly from a single
word in p. Consequently, the probability that a given word in h is supported by p
can be identified with the max over the probability of its support by the individual
CHAPTER 2. THE BAG-OF-WORDS APPROACH 18
words of p:
P (hj |p) = max P (hj |pi )
i
The expression P (hj |pi ) can be interpreted as a sort of lexical entailment score be-
tween words pi and hj . We make no attempt to explain what it might mean for words
(as opposed to declarative propositions) to stand in an entailing relationship. Rather,
we focus on the purely pragmatic goal of choosing a lexical scoring function which
does a good job of predicting entailment between p and h. Glickman et al. (2005)
used a lexical scoring function based on web co-occurrence statistics. The model
we’ll describe in the following pages uses a much more simple-minded lexical scoring
function. However, we’ll consider other ways of enhancing the model’s effectiveness,
including weighting the words in h by their frequency in a large corpus; generating
not only a “forward entailment” score P (h|p) but also a “reverse entailment” score
P (p|h); and combining these scores in a maximum entropy model.
Note that, by matching each word in h to the word in p which best supports it,
this approach can also be viewed as inducing an alignment between the words of h
and p, akin to word alignments in statistical machine translation (Brown et al. 1993).
In fact, this model is closely analogous to the “heuristic model” of word alignment
described by Och and Ney (2003).
(or identical) words, and will assign a score close to 0 to a pair of dissimilar words.
The best candidates for lexical scoring functions will therefore be measures of
lexical similarity, or perhaps the somewhat weaker notion of lexical relatedness. Can-
didate scoring functions might include:
dist(lemma(w1 ), lemma(w2 ))
sim(w1 , w2 ) = 1 −
max(|lemma(w1 )|, |lemma(w2 )|)
(Here lemma(w) denotes the lemma of word w; dist() denotes Levenshtein string edit
distance; and | · | denotes string length.)
The cost function has range [0, ∞]: aligning equal words will have cost 0; aligning
very different words will have very high cost. And, of course, if sim is symmetric,
then cost will be symmetric as well.
Following the approach described in section 2.1.1, the cost of aligning a given
hypothesis word hj to the premise p will be the minimum of the costs of aligning hj
to each possible premise word pi :
word alignment costs; otherwise, a single hard-to-align hypothesis word can com-
pletely block inference.
Lower values for maxcost make the model “looser”: since every word can be aligned
more cheaply, we’re more likely to predict equivalence between the premise and the
hypothesis. Conversely, higher values for maxcost make the model “stricter”. One
interpretation of maxcost is that it represents the cost of inserting a word into h
which doesn’t correspond to anything in p.
However, intuition suggests that we would like to be able to treat some words
in h as more important than others. It is of no great consequence if we are unable
cheaply to align small function words (such as the, for, or as) in h; it matters much
more whether we can find a suitable alignment for big, rare words (such as Ahmadine-
jad ). To capture this intuition, we can define a weight function which represents the
importance of each word in alignment as a non-negative real value. We then define
the total cost of aligning h to p to be the weighted sum of the costs of aligning the
individual words in h:
X
cost(h|p) = weight(hj ) · cost(hj |p)
j
Just as there were many possible choices for the lexical scoring function, we have
many possible choices for this lexical weight function weight(hj ). Clearly, we want
common words to receive less weight, and rare words to receive higher weight. A
CHAPTER 2. THE BAG-OF-WORDS APPROACH 22
plausible weight function might also take account of the part of speech of each word.
However, to keep things simple, we’ll use a weight function based on the inverse
document frequency (IDF) of each word in a large corpus. If N is the number of
documents in our corpus and Nw is the number of documents containing word w,
then we define:
idf (w) = − log(Nw /N )
Since the value of this function will be infinite for words which do not appear in the
corpus, we’ll define our weight function to place an upper bound on weights. We’ll
scale IDF scores into the range [0, 1] by dividing each score by the highest score
observed in the corpus (typically, this is the score of words observed exactly once),
and we’ll assign weight 1 to words not observed in the corpus. If W is the set of
words observed in the corpus, then we define the weight function by:
idf (hj )
weight(hj ) = min 1,
maxw∈W idf (w)
For example, when based on a typical corpus (the English Gigaword corpus), this
function assigns weight 0.0004 to the, 0.1943 to once, 0.6027 to transparent, and
1.000 to Ahmadinejad.
P (h|p) = exp(−cost(h|p))
However, this method is rather inflexible: it utilizes only one measure of the overlap
between p and h, and provides no automatic way to tune the threshold above which
CHAPTER 2. THE BAG-OF-WORDS APPROACH 23
to predict entailment.
The second is to use features based on our cost function as input to a machine
learning classifier. Rather than using cost(h|p) directly to predict entailment, we
construct a feature representation of the entailment problem using quantities derived
from the cost function, and train a statistical classifier to predict entailment using
these feature representations as input. For this purpose, we used a simple maximum
entropy (logistic regression) classifier with Gaussian regularization (and regularization
parameter σ = 1).
One advantage of this approach is that it enables us automatically to determine the
appropriate threshold beyond which to predict entailment. Another is that it allows
us to combine different kinds of information about the similarity between p and h. For
example, our feature representation can include not only the cost of aligning h to p,
but also the reverse, that is, the cost of aligning p to h. (If p and h are nearly identical,
then both cost(h|p) and cost(p|h) should be low; whereas if p subsumes h but also
includes extraneous content—a common situation in NLI problems—then cost(h|p)
should be low, but cost(p|h) should be high.) We experimented with a number of
different features based on the cost function, shown in figure 2.1. Intuitively, feqScore
is intended to represent the likelihood that p entails h; reqScore, that h entails p;
eqvScore, that p and h are mutually entailing (i.e., equivalent); fwdScore, that p
entails h and not vice-versa; revScore, that h entails p and not vice-versa; and
indScore, that neither p and h entails the other (i.e., they are independent). However,
these interpretations are merely suggestive—one of the virtues of the machine learning
approach is that a (suitably regularized) classifier can learn to exploit whichever
features carry signal, and ignore the rest, whether or not the features well capture
the intended interpretations.
Clearly, there is considerable redundancy between these features, but because
maximum entropy classifiers (unlike, say, Naïve Bayes classifiers) handle correlated
features gracefully, this is not a source of concern. However, in experiments, we found
little benefit to including all of these features; using just the feqScore and reqScore
features gave results roughly as good as including any additional features.
CHAPTER 2. THE BAG-OF-WORDS APPROACH 24
hCost = cost(h|p)
pCost = cost(p|h)
feqScore = exp(−hCost)
reqScore = exp(−pCost)
eqvScore = feqScore · reqScore
fwdScore = feqScore · (1 − reqScore)
revScore = (1 − feqScore) · reqScore
indScore = (1 − feqScore) · (1 − reqScore)
Figure 2.1: Possible features for a maximum-entropy model entailment based on the
bag-of-words model.
Table 2.1: Performance of the bag-of-words entailment model described in section 2.1
on various combinations of RTE training and test data (two-way classification). The
columns show the training data used, the test data used, the proportion of yes
predictions, precision and recall for the yes class, and accuracy.
Table 2.2: Results from teams participating in the first three RTE competitions. For
each RTE competition, we show the accuracy achieved by the best-performing team,
along with average and median accuracies over all participating teams. Each RTE
competition allowed teams to submit two runs; for those teams which did so, we used
the run with higher accuracy in computing all statistics.
CHAPTER 2. THE BAG-OF-WORDS APPROACH 26
to the average and median accuracies achieved by all RTE participants in each of
the RTE competitions. Indeed, when evaluated on the RTE3 test set (after training
on the RTE3 development set), the bag-of-words model achieved an accuracy (63%)
higher than that of more than half of the RTE3 participants. Many of those teams
had pursued approaches which were apparently far more sophisticated than the bag-
of-words model, yet our results show that they could have done better with a more
naïve approach.
Note that, while the system of Glickman et al. (2005) was essentially a pure bag-
of-words model, few or none of the systems submitted to RTE2 or RTE3 could be
described as such. Most of those which came closest to measuring pure lexical overlap
(e.g., Adams (2006), Malakasiotis and Androutsopoulos (2007)) exploited additional
information as well, such as the occurrence of negation in p or h. For comparison,
Zanzotto et al. (2006) describe experiments using a simple statistical lexical system,
similar to the one described in this chapter. They were able to achieve accuracy close
to 60% on both RTE1 and RTE2, and found that simple lexical models outperformed
more sophisticated lexical models, thus confirming our findings.
The second surprise in table 2.1 is the degree of variance in the results across
data sets. A long-standing criticism of the RTE test suites has been that, because
the task is loosely defined and the problems are not drawn from any “natural” data
distribution, there is no way to ensure consistency in the nature and difficulty of the
NLI problems from one RTE data set to the next. Many in the RTE community have
the subjective perception that the character of the problems has changed significantly
from one RTE test suite to the next; the results in table 2.1 show that there are
substantial objective differences as well. Using our simple bag-of-words model as a
yardstick, the RTE1 test suite is the hardest, while the RTE2 test suite is roughly 4%
easier, and the RTE3 test suite is roughly 9% easier. Moreover, in RTE3 there is a
marked disparity between the difficulty of the development and test sets. Whichever
data set we train on, the RTE3 development set appears to be roughly 6% easier than
the RTE3 test set—certainly, an undesirable state of affairs.
A final observation about the results in table 2.1 is that the model does not
appear to be especially prone to overfitting. Half of the rows in table 2.1 show
CHAPTER 2. THE BAG-OF-WORDS APPROACH 27
results for experiments where the test data is the same as the training data, yet these
results are not appreciably better than the results for experiments where the test
data is different from the training data. However, this result is not too surprising—
because the maximum entropy classifier uses just two features, it has only three free
parameters, and thus is unlikely to overfit the training data.
Chapter 3
In order to recognize that Kennedy was killed can be inferred from JFK was assassi-
nated, one must first recognize the correspondence between Kennedy and JFK, and
between killed and assassinated. Consequently, most current approaches to NLI rely,
implicitly or explicitly, on a facility for alignment—that is, establishing links between
corresponding entities and predicates in the premise p and the hypothesis h.
Recent entries in the annual Recognizing Textual Entailment (RTE) competition
(Dagan et al. 2005) have addressed the alignment problem in a variety of ways, though
often without distinguishing it as a separate subproblem. Glickman et al. (2005)
and Jijkoun and de Rijke (2005), among others, have explored approaches like that
presented in chapter 2, based on measuring the degree of lexical overlap between bags
of words. While ignoring structure, such methods depend on matching each word in h
to the word in p with which it is most similar—in effect, an alignment. At the other
extreme, Tatu and Moldovan (2007) and Bar-Haim et al. (2007) have formulated
the inference problem as analogous to proof search, using inferential rules which
encode (among other things) knowledge of lexical relatedness. In such approaches,
the correspondence between the words of p and h is implicit in the steps of the proof.
Increasingly, however, the most successful RTE systems have made the align-
ment problem explicit. Marsi and Krahmer (2005) and MacCartney et al. (2006)
28
CHAPTER 3. ALIGNMENT FOR NATURAL LANGUAGE INFERENCE 29
Consequently, the tools and techniques of MT alignment may not transfer readily to
NLI alignment. We investigate the matter empirically in section 3.4.2.
ed
t
en
nt
in ese
e n
. am
re rly
ar me
pr
rli
o
o
po
pa
W
In
most
Pacific
countries
there
are
very
few
women
in
parliament
.
Figure 3.1: The MSR gold standard alignment for problem 116 from the RTE2 de-
velopment set.
links, while at least two of three agreed on more than 99.7% of proposed links,3 at-
testing to the high quality of the annotation data. For this work, we merged the three
independent annotations, according to majority rule,4 to obtain a gold-standard an-
notation containing an average of 7.3 links per RTE problem.
del(In1 )
del(most2 )
del(Pacific3 )
del(countries4 )
del(there5 )
eq(are6 , are2 )
sub(very7 few8 , poorly3 represented4 )
eq(women9 , Women1 )
eq(in10 , in5 )
eq(parliament11 , parliament6 )
eq(.12 , .7 )
Figure 3.2: The MSR gold standard alignment for RTE2 problem 116 (the same
problem as in figure 3.1), represented as a set of phrase edits. Note that del and ins
edits of size > 1 are possible in principle, but are not used in our training data.
As an example, figure 3.2 shows the same alignment as in figure 3.1, but represented
as a set of phrase edits.
Alignments are constrained to be one-to-one at the phrase level: every token in p
and h belongs to exactly one phrase, which participates in exactly one edit (possibly
del or ins). However, the phrase representation permits alignments which are many-
to-many at the token level. In fact, this is the chief motivation for the phrase-based
representation: we can align very few and poorly represented as units, without being
forced to make an arbitrary choice as to which word goes with which word. Moreover,
our scoring function can make use of lexical resources which have information about
semantic relatedness of multi-word phrases, not merely individual words.
About 23% of the MSR gold-standard alignments are not one-to-one (at the token
level), and are therefore technically unreachable for MANLI, which is constrained to
generate one-to-one alignments. However, by merging contiguous token links into
phrase edits of size > 1, most MSR alignments (about 92%) can be straightforwardly
converted into MANLI-reachable alignments. For the purpose of model training (but
not for the evaluation described in section 3.4.4), we generated a version of the MSR
3
The sure/possible distinction is taken as significant in computing all these figures.
4
The handful of three-way disagreements were treated as possible links, and thus were not used.
CHAPTER 3. ALIGNMENT FOR NATURAL LANGUAGE INFERENCE 34
We’ll explain how the feature weights w are set in section 3.3.4. The features used
to characterize each edit are as follows:
Edit type features. We begin with boolean features encoding the type of each
edit. We expect eqs to score higher than subs, and (since p is commonly longer than
h) dels to score higher than inss.
Phrase features. Next, we have features which encode the sizes of the phrases
involved in the edit, and whether these phrases are non-constituents (in syntactic
parses of the sentences involved).
Semantic relatedness feature. For sub edits, a very important feature represents
the degree of semantic relatedness of the substituends, as a real value in [0, 1]. This
relatedness score is computed as a max over a number of component scoring functions,
some based on external lexical resources, including:
• various string similarity functions, of which most are applied to word lemmas
5
About 8% of the MSR alignments contain non-contiguous links, most commonly because p
contains two references to an entity (e.g., Christian Democrats and CDU ) which are both linked
to a reference to the same entity in h (e.g., Christian Democratic Union). In such cases, one or
more links must be eliminated to achieve a MANLI-reachable alignment. We used a string-similarity
heuristic to break such conflicts, but were obliged to make an arbitrary choice in about 2% of cases.
CHAPTER 3. ALIGNMENT FOR NATURAL LANGUAGE INFERENCE 35
Contextual features. Even when the lexical similarity for a sub edit is high,
it may not be a good match. If p or h contains multiple occurrences of the same
word—which happens frequently with function words, and occasionally with content
words—lexical similarity alone may not suffice to determine which is the right match.
To remedy this, we introduce contextual features for sub and eq edits. A real-
valued distortion feature measures the difference between the relative positions of
the substituends within their respective sentences, while boolean matching neighbors
features indicate whether the tokens before and after the substituends are equal or
similar.
Inputs
• an alignment problem hp, hi
• a number of iterations N (e.g. 100)
• initial temperature T0 (e.g. 40) and multiplier r (e.g. 0.9)
• a bound on edit size max (e.g. 6)
• an alignment scoring function, score(E)
Initialize
• Let E be an “empty” alignment for hp, hi
(i.e., E contains only del and ins edits, no eq or sub edits)
• Set Ê = E
Repeat for i = 1 to N
• Let {F1 , F2 , ...} be the set of possible successors of E. To generate this set:
– Consider every possible edit f up to size max
– Let C(E, f ) be the set of edits in E which “conflict" with f (i.e., involve at least some
of the same tokens as f )
– Let F = E ∪ {f } \ C(E, f )
• Let s(F ) be a map from successors of E to scores generated by score
• Set P (F ) = exp s(F ), and then normalize P (F ), transforming the score map to a probability
distribution
• Set Ti = r · Ti−1
• Set P (F ) = P (F )1/Ti , smoothing or sharpening P (F )
• Renormalize P (F )
• Choose a new value for E by sampling from P (F )
• If score(E) > score(Ê), set Ê = E
Return Ê
Figure 3.3: The manli-align algorithm
Inputs
• training problems hpj , hj i, j = 1..n
• corresponding gold-standard alignments Ej
• a number of learning epochs N (e.g. 50)
• a “burn-in” period N0 < N (e.g. 10)
• initial learning rate R0 (e.g. 1) and multiplier r (e.g. 0.8)
• a vector of feature functions Φ(E)
• an alignment algorithm align(p, h; w) which finds a good alignment for hp, hi using weight
vector w
Initialize
• Set w = 0
Repeat for i = 1 to N
• Set Ri = r · Ri−1 , reducing the learning rate
• Randomly shuffle the training problems
• For j = 1 to n:
– Set Êj = align(pj , hj ; w)
– Set w = w + Ri · (Φ(Ej ) − Φ(Êj ))
• Set w = w/kwk2 (L2 normalization)
• Set w[i] = w, storing the weight vector for this epoch
Return an averaged weight
Pvector:
N
• wavg = 1/(N − N0 ) i=N0 +1 w[i]
the difference between the features of the target alignment7 and the features of the
alignment produced by the decoder using the current weight vector. The size of the
update is controlled by a learning rate which decreases over time. At the end of each
epoch, the weight vector is normalized and stored. The final result is the average
of the stored weight vectors, omitting vectors from a fixed number of epochs at the
beginning of the run (which tend to be of poor quality). Using the parameter values
suggested in figure 3.4, training runs on the RTE2 development set required about
20 hours.
1
score(h|p) = cost(h|p)
|h|
Despite the simplicity of this alignment model, its performance is fairly robust, with
good recall. Its precision, however, its mediocre—chiefly because, by design, it aligns
every hypothesis token with some premise token. The model could surely be improved
by allowing it to leave some h tokens unaligned, but this was not pursued.
System Data P % R% F1 % E%
Bag-of-words dev 57.8 81.2 67.5 3.5
(baseline) test 62.1 82.6 70.9 5.3
GIZA++ dev 83.0 66.4 72.1 —
(using lexicon, ∩ heuristic) test 85.1 69.1 74.8 —
Cross-EM dev 67.6 80.1 72.1 —
(using lexicon, ∩ heuristic) test 70.3 81.0 74.1 —
Stanford RTE dev 81.1 61.2 69.7 0.5
test 82.7 61.2 70.3 0.3
Stanford RTE dev 81.1 75.8 78.4 —
(punctuation correction) test 82.7 75.8 79.1 —
MANLI dev 83.4 85.5 84.4 21.7
test 85.4 85.3 85.3 21.3
Table 3.1: Performance of various aligners on the MSR RTE2 alignment data. The
columns show the data set used (800 problems each); average precision, recall, and
F-measure; and the exact match rate (see text).
(46%). Even equal words were usually not aligned—because GIZA++ is designed
for cross-linguistic use, it does not consider word equality between source and tar-
get sentences. To remedy this, we supplied GIZA++ with a lexicon, using a trick
common in the MT community: we supplemented the training data with synthetic
data consisting of matched pairs of equal words. This gives GIZA++ a better chance
of learning that, e.g., man should align with man. The result was a big boost in
recall (+23%), and a smaller gain in precision. The results for GIZA++ shown in
table 3.1 are based on using the lexicon and intersection. With these settings,
GIZA++ properly aligned most pairs of equal words, but continued to align other
words apparently at random.
Next, we compared the performance of intersection with other symmetriza-
tion heuristics defined in Moses—including union, grow, grow-diag, grow-diag-
final (the default), and grow-diag-final-and—and with asymmetric alignments
CHAPTER 3. ALIGNMENT FOR NATURAL LANGUAGE INFERENCE 41
in both directions. While all these alternatives achieved better recall than inter-
section, all showed substantially worse precision and F1 . On the RTE2 test set, the
asymmetric alignment from h to p scored 68% in F1 ; grow scored 58%; and all other
alternatives scored below 52%.
As an additional experiment, we tested the Cross-EM aligner (Liang et al. 2006)
from the BerkeleyAligner package on the MSR data. While this aligner is in many
ways simpler than GIZA++ (it lacks any model of fertility, for example), its method
of jointly training two simple asymmetric HMM models has outperformed GIZA++
on standard evaluations of MT alignment. As with GIZA++, we experimented with
a variety of symmetrization heuristics, and ran trials with and without a supplemen-
tal lexicon. The results were broadly similar: intersection greatly outperformed
alternative heuristics, and using a lexicon provided a big boost (up to 12% in F1 ).
Under optimal settings, the Cross-EM aligner showed better recall and worse preci-
sion than GIZA++, with F1 just slightly lower. Like GIZA++, it did well at aligning
equal words, but aligned most other words at random.
The mediocre performance of MT aligners on NLI alignment comes as no surprise,
for reasons discussed in section 3.1. Above all, the quantity of training data is simply
too small for unsupervised learning to succeed. A successful NLI aligner will need
to exploit supervised training data, and will need access to additional sources of
knowledge about lexical relatedness.
collapses multi-token named entities and certain collocations into single tokens. The
features used for alignment scoring include not only measures of lexical similarity, but
also syntactic features encoding the similarity of paths through dependency parses
between pairs of aligned tokens, which are intended to promote the alignment of
similar predicate-argument structures.
Despite this sophistication, the out-of-the-box performance of the Stanford aligner
is mediocre, as shown in table 3.1. The low recall figures are particularly notewor-
thy. However, a partial explanation is readily available: by design, the Stanford
system ignores punctuation.13 Because punctuation tokens constitute about 15% of
the aligned pairs in the MSR data, this sharply reduces measured recall. However,
since punctuation matters little in inference, such recall errors arguably should be
forgiven. Thus, table 3.1 also shows adjusted statistics for the Stanford system in
which all punctuation alignments are (generously) counted as correct.
Even after this adjustment, the recall figures are unimpressive. Error analysis
reveals that the Stanford aligner does a poor job of aligning function words. About
13% of the aligned pairs in the MSR data are matching prepositions or articles; the
Stanford aligner misses about 67% of such pairs. (By contrast, MANLI misses only
10% of such pairs.) While function words matter less in inference than nouns and
verbs, they are not irrelevant, and because sentences often contain multiple instances
of a particular function word, matching them properly is by no means trivial. If
matching prepositions and articles were ignored (in addition to punctuation), the gap
in F1 between the MANLI and Stanford systems would narrow to about 2.8%.
Finally, the Stanford aligner is handicapped by its token-based alignment repre-
sentation, often failing (partly or completely) to align multi-word phrases such as
peace activists with protesters, or hackers with non-authorized personnel.
GIZA++ and 6.2% higher than the Stanford aligner (even with the punctuation
correction).14 MANLI achieved a good balance between precision and recall, and
matched more than 20% of the gold-standard alignments exactly.
Three factors seem to have contributed most to MANLI’s success. First, MANLI
is able to outperform the MT aligners principally because it is able to leverage lexical
resources to identify the similarity between pairs of words such as jail and prison,
prevent and stop, or injured and wounded. Second, MANLI’s contextual features en-
able it to do better than the Stanford aligner at matching function words, a weakness
of the Stanford aligner discussed in section 3.4.3. Third, MANLI gains a marginal
advantage because its phrase-based representation of alignment permits it to properly
align phrase pairs such as death penalty and capital punishment.
However, the phrase-based representation contributed far less than we had hoped.
Setting MANLI’s maximum phrase size to 1 (effectively, restricting it to token-based
alignments) caused F1 to fall by just 0.2%. We do not interpret this to mean that
phrase alignments are not useful—indeed, about 2.6% of the links in the gold-standard
data involve phrases of size > 1. Rather, we think it shows that we have failed to
fully exploit the advantages of the phrase-based representation, chiefly because we
lack lexical resources providing good information on similarity of multi-word phrases.
Even if we had such resources, however, the consequent gains might be modest.
Recent work by de Marneffe et al. (2009) suggests that the role played by multi-
word expressions in RTE is smaller than one might expect, and is largely restricted
to lexico-syntactic variations which can be captured by other means.
Error analysis suggests that there is ample room for improvement. A large pro-
portion of recall errors (perhaps 40%) occur because the lexical similarity function
assigns too low a value to pairs of words or phrases which are clearly similar, such as
conservation and protecting, server and computer networks, organization and agen-
cies, or bone fragility and osteoporosis. Better exploitation of lexical resources could
help to reduce such errors. Another important category of recall errors (about 12%)
result from the failure to identify one- and multi-word versions of the name of some
entity, such as Lennon and John Lennon, or Nike Inc. and Nike. A special-purpose
14
Reported results for MANLI are averages over 10 runs.
CHAPTER 3. ALIGNMENT FOR NATURAL LANGUAGE INFERENCE 44
similarity function could help here. Note, however, that about 10% of the recall errors
are unavoidable, given our choice of alignment representation, since they involve cases
where the gold standard aligns one or more tokens on one side to a non-contiguous
set of tokens on the other side.
Precision errors may be harder to reduce. These errors are dominated by cases
where we mistakenly align two equal function words (49% of precision errors), two
forms of the verb to be (21%), two equal punctuation marks (7%), or two words or
phrases of other types having equal lemmas (18%). Because such errors often occur
because the aligner is forced to choose between nearly equivalent alternatives, they
may be difficult to eliminate. The remaining 5% of precision errors result mostly from
aligning words or phrases rightly judged to be highly similar, such as expanding and
increasing, labor and birth, figures and number, or 223,000 and 220,000.
Table 3.2: Performance of various aligners and complete RTE systems in predicting
RTE2 answers. The columns show the data set used, accuracy, and average precision
(the recommended metric for RTE2).
3.6 Conclusion
While MT aligners succeed by unsupervised learning of word correspondences from
massive amounts of bitext, NLI aligners are forced to rely on smaller quantities of
supervised training data. With the MANLI system, we have demonstrated how to
overcome this lack of data by utilizing external lexical resources, and how to gain
additional power from a phrase-based representation of alignment.
Chapter 4
In this chapter, we describe the Stanford RTE system1 for natural language infer-
ence, which represents an initial attempt to find a middle ground between, on the
one hand, approaches to NLI based on lexical similarity, which are robust, but im-
precise; and on the other, approaches based on full semantic interpretation, which
are precise, but brittle. Because full, accurate, open-domain natural language under-
standing lies far beyond current capabilities, most early efforts in natural language
inference sought to extract the maximum mileage from quite limited semantic rep-
resentations. Some (like the bag-of-words model described in chapter 2) used simple
measures of semantic overlap, but over time, the more interesting work largely con-
verged on a graph-alignment approach, operating on semantic graphs derived from
syntactic dependency parses, and using a locally-decomposable alignment score as a
proxy for strength of entailment. (Below, we argue that even approaches relying on
weighted abduction may be seen in this light.) In this chapter, we highlight the fun-
damental semantic limitations of this type of approach, and advocate a multi-stage
architecture that addresses these limitations. The three key limitations are an as-
sumption of monotonicity, an assumption of locality, and a confounding of alignment
1
The Stanford RTE system has been developed over several years by a large team, including (in
addition to this author) Rajat Raina, Trond Grenager, Marie-Catherine de Marneffe, Anna Rafferty,
Christopher D. Manning, and many others. The material in this chapter is derived in large part
from (MacCartney et al. 2006), and therefore reflects an earlier stage of development of the Stanford
RTE system. The more recent evolution of the system is described in (Chambers et al. 2007, Padó
et al. 2008).
46
CHAPTER 4. THE STANFORD RTE SYSTEM 47
152 p Twenty-five of the dead were members of the law enforcement agencies and
the rest of the 67 were civilians.
h 25 of the dead were civilians. no
231 p The memorandum noted the United Nations estimated that 2.5 million to
3.5 million people died of AIDS last year.
h Over 2 million people died of AIDS last year. yes
971 p Mitsubishi Motors Corp.’s new vehicle sales in the US fell 46 percent in June.
h Mitsubishi sales rose 46 percent. no
1806 p Vanunu, 49, was abducted by Israeli agents and convicted of treason in 1986
after discussing his work as a mid-level Dimona technician with Britain’s
Sunday Times newspaper.
h Vanunu’s disclosures in 1968 led experts to conclude that Israel has a stock-
pile of nuclear warheads. no
2081 p The main race track in Qatar is located in Shahaniya, on the Dukhan Road.
h Qatar is located in Shahaniya. no
Figure 4.1: Illustrative examples from the RTE1 development set. For each problem,
we show the ID number, the premise p and hypothesis h, and the correct answer.
Though most of the problems shown have answer no, the RTE data sets are actually
balanced between yes and no.
CHAPTER 4. THE STANFORD RTE SYSTEM 49
with factored scores, the problem of finding the best alignment of two graphs is NP-
complete, so exact computation is intractable. Authors have proposed a variety of
approximate search techniques. Haghighi et al. (2005) divide the search into two
steps: in the first step they consider node scores only, which relaxes the problem to a
weighted bipartite graph matching that can be solved in polynomial time, and in the
second step they add the edge scores and hillclimb the alignment via an approximate
local search.
A third approach, exemplified by Moldovan et al. (2003) and Raina et al. (2005),
is to turn the syntactic representation into a neo-Davidsonian-style quasi-logical form,
and to perform weighted abductive theorem proving in the tradition of Hobbs et al.
(1988). We argue, however, that this style of weighted resolution theorem proving is
actually isomorphic to the graph matching approach. For example, a neo-Davidsonian
representation of the graph in figure 4.2 might correspond to the quasi-LF
rose(e1 ) ∧ nsubj (e1 , x1 ) ∧ sales(x1 ) ∧ nn(x1 , x2 ) ∧ Mitsubishi (x2 ) ∧
dobj (e1 , x3 ) ∧ percent(x3 ) ∧ num(x3 , x4 ) ∧ 46 (x4 )
There is a term corresponding to each node and arc, and the unit resolution steps at
the core of resolution theorem proving consider matching an individual node or arc
of the hypothesis with something from the premise, just as in the graph-matching
approach.2
Finally, a few efforts (Akhmatova 2005, Fowler et al. 2005, Bos and Markert
2005a) have tried to translate sentences into formulas of first-order logic, in order to
test logical entailment with a theorem prover. While in principle this approach does
not suffer from the limitations we describe below, in practice it has not borne much
fruit. Because few problem sentences can be accurately translated to logical form,
and because logical entailment is a strict standard, recall tends to be poor.
The simple graph matching formulation of the problem belies three important is-
sues. First, the above systems assume monotonicity: if a good match is found with a
part of the premise, other material in the premise is assumed not to affect the validity
2
A possible difference is that with a good supply of additional linguistic and world knowledge
axioms—as is present in Moldovan et al. (2003) but not Raina et al. (2005)—the theorem prover
may generate intermediate forms in the proof, but, nevertheless, individual terms are resolved locally
without reference to global context.
CHAPTER 4. THE STANFORD RTE SYSTEM 50
of the match. But many entailment decisions are non-monotonic in the node and edge
scores. Consider variants on problem 98 in figure 4.1. Suppose the hypothesis were
Arafat targeted for assassination. This would allow a perfect graph match or zero-
cost weighted abductive proof, because the hypothesis is a subgraph of the premise.
However, this would be incorrect, because it ignores the modal operator could. Infor-
mation that changes the validity of a proof can also exist outside a matching clause.
Consider the alternate premise Sharon denies Arafat is targeted for assassination.3
The second issue is the assumption of locality. Locality is needed to allow practical
search, but many entailment decisions rely on global features of the alignment, and
thus do not naturally factor by nodes and edges. To take just one example, dropping
a restrictive modifier preserves entailment in a positive context, but not in a negative
one. For example, Dogs barked loudly entails Dogs barked, but No dogs barked loudly
does not entail No dogs barked. These more global phenomena cannot be modeled
with a factored alignment score.
The last issue arising in the graph matching approaches is the inherent difficulty
in confounding alignment and entailment determination. The way to show that one
graph element does not follow from another is to make the cost of aligning them high.
However, since we are embedded in a search for the lowest cost alignment, this will
just cause the system to choose an alternate alignment rather than recognizing a non-
entailment. In problem 152 of figure 4.1, we would like the hypothesis to align with
the first part of the premise, to be able to prove that civilians are not members of
law enforcement agencies and conclude that the hypothesis does not follow from the
premise. But a graph-matching system will to try to get non-entailment by making
the matching cost between civilians and members of law enforcement agencies be
very high. However, the likely result of that is that the final part of the hypothesis
will align with were civilians at the end of the premise, assuming that we allow the
alignment to “break” arcs.4 Under this candidate alignment, the lexical alignments are
perfect, and the only imperfect alignment is the subject arc of were is mismatched in
3
This is the same problem labeled and addressed as context in (Tatu and Moldovan 2005).
4
Robust systems need to allow matches with imperfect arc correspondence. For instance, given
Bill went to Lyons to study French farming practices, we would like to be able to conclude that Bill
studied French farming despite the small structural mismatch.
CHAPTER 4. THE STANFORD RTE SYSTEM 51
the two. A robust inference guesser will still likely conclude that there is entailment.
We propose that all three problems can be resolved in a multi-stage architecture,
where the alignment phase is followed by a separate phase of entailment determi-
nation. Finding aligned content is naturally useful, and can be done by any search
procedure. Compared to previous work, our approach to alignment emphasizes struc-
tural correspondence, and downplays issues like polarity and quantity, which can be
left to a subsequent entailment decision. For example, our alignment scoring function
is designed to encourage antonym matches, and ignore the negation of verb pred-
icates. The ideas clearly generalize to evaluating several alignments, but we have
found working with the one-best alignment to be adequate for the PASCAL RTE
data. Given a good alignment, the determination of entailment reduces to a simple
classification decision. The classifier can use hand-tuned weights, or it can be trained
to minimize a relevant loss function using standard techniques from machine learning.
The classifier is built over features designed to identify patterns of valid and invalid
inference. For example, they can detect that it is okay to add a restrictive modifier
if the passage has a universal quantifier (All the students got on the bus |= All the
male students got on the bus), whereas this form of inference is not supported in or-
dinary (upward-monotone) contexts. Because we already have a complete alignment,
the classifier’s decision can be conditioned on arbitrary global features of the aligned
graphs, and it can detect inversions of monotonicity.
4.2 System
Our system has three stages: linguistic analysis, alignment, and entailment determi-
nation. In the following pages, we describe each of these stages in turn.
rose
nsubj dobj
sales percent
nn num
Mitsubishi 46
Figure 4.2: A typed dependency graph for problem 971 of figure 4.1.
and labeled edges representing the grammatical relations between words. An example
of a typed dependency graph for problem 971 of figure 4.1 is given in figure 4.2. The
advantage of this representation is that it contains much of the information about
the entities and relations between them described by the sentence, while being easy
to compute deterministically from a syntactic parse. The disadvantage is that it fails
properly to represent many semantic phenomena; particularly egregious is its inability
to represent quantification and modality.
Our general approach is first to parse the input sentences, and then to convert
each of the resulting phrase structure trees to a typed dependency graph. We use the
Stanford parser (Klein and Manning 2003), a statistical syntactic parser trained on the
Penn TreeBank. To ensure correct parsing, we preprocess the sentences to collapse
named entities and collocations into single tokens (by joining separate words with
underscores). Named entities are identified by the Stanford Named Entity Recognizer
(Finkel et al. 2005), a CRF-based NER system similar to that described by McCallum
and Li (2003), and collocations are derived from consecutive words pairs in WordNet
(Fellbaum et al. 1998).
We convert the phrase structure trees to typed dependency graphs using a set of
deterministic hand-coded rules, as described by de Marneffe et al. (2006). Heads of
the constituents are first identified using the Collins rules (Collins 2003), modified to
retrieve semantic heads. For each grammatical relation, we define patterns over the
phrase structure tree using a language similar to the tgrep utility. Each pattern is
CHAPTER 4. THE STANFORD RTE SYSTEM 53
rose → fell
sales → sales
Mitsubishi → Mitsubishi_Motors_Corp.
percent → percent
46 → 46
score: −0.8962
Figure 4.3: A sample alignment for problem 971 of figure 4.1.
matched against each dependency in the tree, adding the most specific relation found
as the typed dependency. The nodes in the final graph are then annotated with their
associated word, part-of-speech (given by the parser), lemma (given by a finite-state
transducer described by Minnen et al. (2001)) and named-entity tag (given by the
NER tagger).
4.2.2 Alignment
The purpose of the second phase of processing is to find a good partial alignment
between the typed dependency graphs representing the premise p and the hypothesis
h. An alignment consists of a mapping from each node (word) in the h graph to
a single node in the p graph, or to null.5 Not all words in the premise graph are
mapped to, since typically the premise contains much more information than needed
to support the hypothesis. Furthermore, we do not require that the alignment con-
tains a mapping for all words in the hypothesis. Figure 4.3 shows an example of an
alignment for problem 971 of figure 4.1.
The space of alignments is large: there are O((m + 1)n ) possible alignments for a p
graph with m nodes and an h graph with n nodes. We define a measure of alignment
quality, and a procedure for identifying high scoring alignments. We choose a locally
decomposable scoring function, such that the score of an alignment is the sum of the
local node and edge alignment scores. Unfortunately, there is no polynomial time
5
The limitations of using one-to-one alignments are mitigated by the fact that many multiword
expressions (e.g., named entities, noun compounds, multiword prepositions) have been collapsed into
single nodes during linguistic analysis.
CHAPTER 4. THE STANFORD RTE SYSTEM 54
algorithm for finding the exact best alignment. Instead we use an incremental beam
search, combined with a node ordering heuristic, to do approximate global search
in the space of possible alignments. We have experimented with several alternative
search techniques, and found that the solution quality is not very sensitive to the
specific search procedure used.
Our scoring measure is designed to favor alignments which align semantically
similar subgraphs, irrespective of polarity. For this reason, nodes receive high align-
ment scores when the words they represent are semantically similar. Synonyms and
antonyms receive the highest score, and unrelated words receive the lowest. Our
hand-crafted scoring metric takes into account the word, the lemma, and the part of
speech, and searches for word relatedness using a range of external resources, includ-
ing WordNet, precomputed latent semantic analysis matrices, and special-purpose
gazettes. Alignment scores also incorporate local edge scores, which are based on the
shape of the paths between nodes in the premise graph which correspond to adjacent
nodes in the hypothesis graph. Preserved edges receive the highest score, and longer
paths receive lower scores.
Polarity features. These features capture the presence (or absence) of linguistic
markers of negative polarity contexts in both p and h, such as simple negation (not),
downward-monotone quantifiers (no, few ), restricting prepositions (without, except)
and superlatives (tallest).
CHAPTER 4. THE STANFORD RTE SYSTEM 56
Factivity features. The context in which a verb phrase is embedded may carry
semantic presuppositions giving rise to (non-)entailments such as The gangster tried
to escape 6|= The gangster escaped. This pattern of entailment, like others, can be
reversed by negative polarity markers (The gangster managed to escape |= The gang-
ster escaped while The gangster didn’t manage to escape 6|= The gangster escaped ). To
capture these phenomena, we compiled small lists of “factive” and non-factive verbs,
clustered according to the kinds of entailments they create. We then determine to
which factivity class the parent of the p node aligned with the root of h belongs.
If the parent is not in the list, we only check whether the embedding context is an
affirmative context or a negative one.
features is generated: both ‘no’ if both p and h quantifiers are no; one ‘no’ if just
one is; expand if the h quantifier is “larger” (less restrictive) than the p quantifier;
and contract if the reverse. Intuitively, both ‘no’ and expand should favor
entailments; one ‘no’ and contract should disfavor. Unfortunately, these features
are rarely useful on RTE data. Overwhelming, they are activated only for exact
matches between definite or indefinite determiners. Very few quantifiers per se are
identified.
Number, date, and time features. These are designed to recognize (mis-)matches
between numbers, dates, and times, as in problems 1806 and 231 of figure 4.1. We
do some normalization (e.g., of date representations) and have a limited ability to
do fuzzy matching. In problem 1806, the mismatched years are correctly identified.
Unfortunately, in problem 231, the significance of over is not grasped and a mismatch
is reported.7
4.4 Evaluation
We present results based on the PASCAL RTE1 Challenge, which was introduced
in section 1.3.2. The RTE1 Challenge recommended two evaluation metrics: raw
accuracy and confidence weighted score (CWS).8 The CWS is computed as follows:
for each positive integer k up to the size of the test set, we compute accuracy over
the k most confident predictions. The CWS is then the average, over k, of these
7
Later versions of the Stanford RTE system handle over correctly.
8
In subsequent RTE competitions, the use of CWS was abandoned in favor of average precision.
CHAPTER 4. THE STANFORD RTE SYSTEM 59
Table 4.1: Performance of various systems on the RTE1 development and test sets.
The columns show accuracy and confidence weighted score (see text).
partial accuracies. Like raw accuracy, it lies in the interval [0, 1], but it will exceed
raw accuracy to the degree that predictions are well-calibrated.
Table 4.1 shows results for a range of systems and testing conditions. We report
accuracy and CWS on each RTE1 data set. The baseline for all experiments is random
guessing, which always attains 50% accuracy. Table 4.1 also displays comparable
results from RTE1 submissions based on lexical similarity (Jijkoun and de Rijke 2005),
graph alignment (Haghighi et al. 2005), weighted abduction (Raina et al. 2005), and
theorem proving (Bos and Markert 2005a).
We then show results for the Stanford RTE system under several different training
regimes. The row labeled “alignment only” describes experiments in which all fea-
tures except the alignment score are turned off. We predict entailment just in case the
alignment score exceeds a threshold which is optimized on development data. “Hand-
tuning” describes experiments in which all features are on, but no training occurs;
rather, weights are set by hand, according to human intuition. Finally, “learning”
describes experiments in which all features are on, and feature weights are trained
on the development data. The figures reported for development data performance
therefore reflect overfitting; while such results are not a fair measure of overall perfor-
mance, they can help us assess the adequacy of our feature set: if our features have
CHAPTER 4. THE STANFORD RTE SYSTEM 60
failed to capture relevant aspects of the problem, we should expect poor performance
even when overfitting. It is therefore encouraging to see CWS above 70%. Finally,
the figures reported for test data performance are the fairest basis for comparison.
These are significantly better than our results for alignment only (Fisher’s exact test,
p < 0.05), indicating that we gain real value from our features. However, the gain
over comparable results from other teams is not significant at the p < 0.05 level.
A curious observation is that the results for hand-tuned weights are as good or
better than results for learned weights. A possible explanation runs as follows. Most
of the features represent high-level patterns which arise only occasionally. Because the
training data contains only a few hundred examples, many features are active in just a
handful of instances; their learned weights are therefore quite noisy. Indeed, a feature
which is expected to favor entailment may even wind up with a negative weight: the
modal feature weak yes is an example. As shown in table 4.2, the learned weight for
this feature was strongly negative—but this resulted from a single training example
in which the feature was active but the hypothesis was not entailed. In such cases,
we shouldn’t expect good generalization to test data, and human intuition about the
“value” of specific features may be more reliable.
Table 4.2 shows the values learned for selected feature weights. As expected, the
features added adjunct in ‘all’ context, modal yes, and p is factive were
all found to be strong indicators of entailment, while date insert, date modifier
insert, widening from p to h all indicate lack of entailment. Interestingly, p has
neg marker and p & h diff polarity were also found to disfavor entailment;
while this outcome is sensible, it was not anticipated or designed.
The real-valued alignment features good score and bad score proved to be
highly informative. However, the more basic score feature from which these are
derived got a weight near 0, suggesting that it carries little additional information.
4.5 Conclusion
Many researchers have explored approaches to the problem of natural language infer-
ence which work by aligning semantic graphs, using a locally-decomposable alignment
CHAPTER 4. THE STANFORD RTE SYSTEM 61
Table 4.2: Learned weights for selected features. Positive weights favor entailment.
Weights near 0 are omitted. Based on training on the PASCAL RTE1 development
set, with a Gaussian smoothing parameter of 20.
CHAPTER 4. THE STANFORD RTE SYSTEM 62
score as a proxy for strength of entailment. We have argued that such models suf-
fer from three crucial limitations: an assumption of monotonicity, an assumption of
locality, and a confounding of alignment and entailment determination.
We have described the Stanford RTE system, which extends alignment-based sys-
tems while attempting to address these limitations. After finding the best alignment
between premise and hypothesis, the Stanford RTE system extracts high-level se-
mantic features of the entailment problem, and inputs these features to a statistical
classifier to make an entailment decision. Using this multi-stage architecture, we re-
port results on the PASCAL RTE1 data which surpass previously-reported results for
alignment-based systems.
Chapter 5
Entailment relations
5.1 Introduction
This chapter marks a shift in direction.1 Whereas the bag-of-words model (chapter 2)
and the Stanford RTE system (chapter 4) (and, indeed, most existing RTE systems)
approach the NLI task via approximate measures of the lexical and syntactic similarity
of hypothesis h to premise p, in this chapter we take our first steps toward developing
a more precise model of natural language entailment2 based on natural logic. We begin
by considering the fundamental representations to be used. What kind of answer do
we want a model of entailment to give us? For that matter, what kinds of questions
should we be able to ask of it? More precisely: if we view our entailment model as a
function, what should be the types of its inputs and output?
The simplest kind of entailment model represents entailment as a binary relation
1
The material in this chapter is derived in large part from (MacCartney and Manning 2009).
2
The attentive reader may note a slight shift in terminology, from natural language inference
to natural language entailment. In formal semantics, entailment is conventionally defined more
narrowly than inferability: we say that p entails h just in case p cannot be true unless h is true.
While the term “textual entailment” has frequently been used to describe the problem of natural
language inference (NLI), we believe that “inference” is a more appropriate term for the general
problem, since NLI problems can hinge on presuppositions and implicatures, as well as entailments.
However, the model we develop in this chapter and the next focuses principally on entailments,
and has little to say about other forms of inference; therefore it will be convenient to use the term
“entailment”.
63
CHAPTER 5. ENTAILMENT RELATIONS 64
do so, in order to facilitate comparison with the more complex representations of en-
tailment to follow. Formally, the output labels entailment and non-entailment
may be interpreted as denoting relations between (that is, sets of ordered pairs of)
declarative expressions. If DomT denotes the domain of declarative expressions (se-
mantic type T ), and Dom2T denotes the Cartesian product of DomT with itself, then
we can define these relations as:
def
entailment = {hp, hi ∈ Dom2T : p |= h}
def
non-entailment = {hp, hi ∈ Dom2T : p 6|= h}
• unk denotes compatibility: neither h nor its negation can be inferred from p.
Condoravdi et al. (2003) argued for the importance of entailment and contradiction
detection (ECD) as a necessary condition for natural language understanding, and
explored theoretical approaches to the task, but do not report empirical results.
To our knowledge, Harabagiu et al. (2006) provide the first empirical results for
contradiction detection, but they evaluate their system on constructed corpora con-
taining two specific kinds of contradiction: those featuring negation and those formed
by paraphrases. A more general treatment is given by de Marneffe et al. (2008), who
3
A few FraCaS problems do not fit neatly into any of these three categories. See section 7.8.1.
CHAPTER 5. ENTAILMENT RELATIONS 66
def
entailment = {hp, hi ∈ Dom2T : p |= h}
def
contradiction = {hp, hi ∈ Dom2T : p |= ¬h}
def
compatibility = {hp, hi ∈ Dom2T : p 6|= h ∧ p 6|= ¬h}
As in the binary formulation, the output labels form a partition of the input space,
so that every hp, hi ∈ Dom2T can be assigned to exactly one of the three relations.
a no-containment relation which holds exactly when none of the other relations
hold. The output space for the monotonicity calculus then consists of four relations:
def
≡ = {hp, hi ∈ Dom2T : p |= h ∧ h |= p}
def
@ = {hp, hi ∈ Dom2T : p |= h ∧ h 6|= p}
def
A = {hp, hi ∈ Dom2T : p 6|= h ∧ h |= p}
def
no-containment = {hp, hi ∈ Dom2T : p 6|= h ∧ h 6|= p}
However, these definitions are incomplete. The monotonicity calculus differs from
the simpler formulations above not only with respect to the output space, but also
with respect to the input space: it defines the semantic containment relation v for
expressions of every semantic type, including individual words and phrases as well
as complete sentences. It achieves this using a recursive definition, beginning from
atomic types and building up to functional types:
4. Otherwise, x 6v y and y 6v x.
(Expressions having different semantic types are unrelated.)
p. X is a couch
p≡h
h. X is a sofa
ENTAILMENT ENTAILMENT
p. X is a crow
p⊏h
h. X is a bird
p. X is a fish
p⊐h
h. X is a carp
COMPATIBILITY
p. X is a hippo
NON-ENTAILMENT
h. X is hungry
NO-CONTAINMENT
p. X is a cat
CONTRADICTION
h. X is a dog
the monotonicity calculus lets us conclude from dancing w waltzing that isn’t danc-
ing v isn’t waltzing and thus Matilda isn’t dancing v Matilda isn’t waltzing. (See
section 6.2.1 for a fuller exposition.)
b. Garfield is a mammal.
c. Garfield is not a fish.
d. Garfield is not a carp.
Clearly, the first proposition (1a) entails the last (1d). However, the monotonicity
calculus lacks the machinery to recognize this. It can make the inference from (1a) to
(1b) (using the semantic containment cat @ mammal ), and from (1c) to (1d) (using
the semantic containment fish A carp). But it cannot make the simple inference from
(1b) to (1c), which hinges on the semantic exclusion between mammal and fish.
Of course, the first-order predicate calculus renders such inferences trivial, but
using formal logic requires full semantic interpretation, which is contrary to the nat-
ural logic approach. We’d like to preserve the spirit of the monotonicity calculus,
while enhancing its power. Ideally, then, we’d like to have the best of both worlds, by
defining an inventory of entailment relations which satisfies the following desiderata:
• Indeed, it should partition the space of ordered pairs of expressions, that is, the
relations it contains should be not only exhaustive but also mutually exclusive,
so that we can define a function which maps any ordered pair of expressions to
a single entailment relation.
We’ll approach these goals somewhat indirectly. Taking inspiration from Sánchez
Valencia, we’ll focus first on set relations. We’ll define an inventory of set relations
which satisfy our desiderata, and then return to entailment relations in section 5.4.
CHAPTER 5. ENTAILMENT RELATIONS 70
Every ordered pair of sets hx, yi can now be assigned to one of 16 equivalence
classes, according to whether each of these four partitions is empty or not. Let us label
each equivalence class with R subscripted by a 4-bit string, following the convention
that the nth bit of the subscript (starting from 0) denotes the non-emptiness of the
partition whose label is the bit string with binary value n. (Thus R1101 denotes the
4
x denotes the set complement of set x, defined by the constraints x ∩ x = ∅ and x ∪ x = U .
CHAPTER 5. ENTAILMENT RELATIONS 71
Figure 5.2: The 16 elementary set relations, represented by Johnston diagrams. Each
box represents the universe U , and the two circles within the box represent the sets
x and y. A region is white if it is empty, and shaded if it is non-empty. Thus in the
diagram labeled R1101 , only the region x ∩ y is empty, indicating that ∅ ⊂ x ⊂ y ⊂ U .
equivalence class in which only partition 10 is empty.) These equivalence classes are
depicted graphically in figure 5.2.
In fact, each of these equivalence classes is a set relation, that is, a set of ordered
pairs of sets. We will refer to these 16 set relations as the elementary set relations,
and we will denote this set of 16 relations by R. By construction, the relations in R
are both mutually exhaustive (every ordered pair of sets belongs to some relation in
R) and mutually exclusive (no ordered pair of sets belongs to two different relations
in R). Thus, every ordered pair of sets can be assigned to exactly one relation in R.
CHAPTER 5. ENTAILMENT RELATIONS 72
Table 5.1: The 16 elementary set relations between sets x and y in universe U ,
described in terms of set-theoretic constraints. Constraints which appear in gray are
less salient, but still active.
Singleton relations. Relations R0001 , R0010 , R0100 , and R1000 cover the cases where
the universe is non-empty and both x and y are either empty or universal. Each of
these relations therefore contains exactly one pair of sets.
Other edge cases. Relations R0011 , R0101 , R1010 , and R1100 cover the cases where
the universe is non-empty and one (but not both) of x and y is either empty or
universal. Each of these relations therefore has cardinality 2|U | − 2.
x ∩ y = ∅.
Duals under negation. We describe set relations R and S as duals under negation
iff ∀x, y : hx, yi ∈ R ⇔ hx, yi ∈ S. Two relations in R are duals under negation if
and only if their bit-string labels are reverses. Thus R1011 and R1101 are duals, while
R1001 is self-dual. The significance of this duality will become apparent later.
We have defined the relations in R as relations between sets, but we can easily
recast them as relations between set-denoting linguistic expressions, (e.g., simple
predicates). In this form, we refer to the elements of R as the elementary entailment
relations. (We consider in section 5.4 how to define entailment relations for expressions
of all semantic types.)
CHAPTER 5. ENTAILMENT RELATIONS 75
• Then, four relations in R (namely R0111 , R1011 , R1101 , and R1110 ) contain 3n −
(3 · 2n ) + 3 pairs each. These are the non-degenerate relations expressing non-
exclusive exhaustion, non-exhaustive exclusion, and containment. Note that
each has a label containing one 0 and three 1s.
• Next, six relations in R (namely R0011 , R0101 , R0110 , R1001 , R1010 , and R1100 )
contain 2n −2 pairs each. Two of these are the non-degenerate relations express-
ing equivalence and negation; the remainder are degenerate relations. Note that
each has a label containing two 0s and two 1s.
• Then, four relations in R (namely R0001 , R0010 , R0100 , and R1000 ) contain just
1 pair each. These are the degenerate singleton relations. Note that each has a
label containing three 0s and one 1.
• Finally, unless the universe is empty, the relation R0000 contains 0 pairs.
CHAPTER 5. ENTAILMENT RELATIONS 76
def
n S = {hx, zi : ∃y (hx, yi ∈ R ∧ hy, zi ∈ S)}
Ro (5.1)
Some joins are quite intuitive. For example, it is immediately clear that:
@o
n@=@
Ao
nA=A
∧ o
n ∧ =≡
∀R n≡=R
Ro
∀R ≡o
nR=R
12
In Tarskian relation algebra, this operation is known as relation composition, and is often rep-
resented by a semi-colon: R ; S. To avoid confusion with semantic composition (to be discussed
in section 6.2), we prefer to use the term join for this operation, by analogy to the database JOIN
operation (also commonly represented by o n). (To be precise, this operation is a projected join—but
we prefer to keep the terminology simple.)
CHAPTER 5. ENTAILMENT RELATIONS 81
Other joins are less obvious, but still accessible to intuition. For example, the join of |
and ∧ is @. This can be seen with the aid of Venn diagrams, or by considering simple
examples: fish | human and human ∧ nonhuman, thus fish @ nonhuman. Other joins
which can be apprehended in similar fashion include:
|o
n ∧ =@ ∧ n|=A
o
∧ n`=@
o `o
n ∧ =A
|o
n`=@ n|=A
`o
∧ n#=#
o #o
n ∧ =#
(Note that the join operation is not always commutative: the order in which entail-
ment relations are joined can make a difference to the result.)
x|y y|z x ? z
gasoline | water water | petrol gasoline ≡ petrol
pistol | knife knife | gun pistol @ gun
dog | cat cat | terrier dog A terrier
rose | orchid orchid | daisy rose | daisy
woman | frog frog | Eskimo woman # Eskimo
The join of | and | is “nondeterministic”, in the sense that x | y and y | z does not
determine a unique relation in B which holds between x and z. And there are many
pairs of relations in B which share this property. In fact, we’ll show in section 5.6.2
that, of the 49 possible joins of two relations in B, 17 are “nondeterministic” in this
sense.
CHAPTER 5. ENTAILMENT RELATIONS 82
between x and z. We use πx,y (T ) to denote the projection of the ternary relation T
onto a binary relation between x and y. The three such projections may be defined
as follows:
def
πx,y (T ) = {hx, yi : ∃z (hx, y, zi ∈ T )}
def
πy,z (T ) = {hy, zi : ∃x (hx, y, zi ∈ T )}
def
πx,z (T ) = {hx, zi : ∃y (hx, y, zi ∈ T )}
By construction, each of πx,y (T ), πy,z (T ), and πx,z (T ) must be one of the 16 elementary
(binary) set relations in R introduced in section 5.3.
We can now determine the join of two relations R and S in R as follows:14
1. Find all ternary relations T in T having binary relation R between x and y and
binary relation S between y and z, that is, {T ∈ T : πx,y (T ) = R ∧ πy,z (T ) = S}.
2. For each such T , determine the binary relation between x and z, namely πx,z (T ).
Each of these three steps is easily implemented in code, and because |T| is only 256, a
complete join table for R can be computed very quickly. Table 5.2 shows an example
of computing the join of R1110 (which is identical with |) with itself.
Since B (the set of seven basic entailment relations) is a subset of R (the set of 16
elementary entailment relations), computing the joins of relations in B requires no
14
To be precise, the method described produces a (relatively tight) upper bound on the true join
of R and S, not an exact result. As we observed in footnote 13, the relation which results from this
method may include a few pairs hx, zi of expressions such that x cannot be an argument to R, or
z to S. The exact join can be computed by exhaustively enumerating the extensions of R and S.
However, the method described is far more efficient and intuitive, and the approximation involved
makes little practical difference.
CHAPTER 5. ENTAILMENT RELATIONS 84
additional work. The complete join table for the relations in B is shown in table 5.3.
Note that joining two relations in B always yields either a relation in B or a union
thereof; the “degenerate” relations in R do not appear.
o
n ≡ @ A ∧ | ` #
≡ ≡ @ A ∧ | ` #
@ @ @ ≡@A|# | | @∧|`# @|#
A A ≡@A`# A ` A∧|`# ` A`#
∧ ∧ ` | ≡ A @ #
| | @∧|`# | @ ≡@A|# @ @|#
` ` ` A∧|`# A A ≡@A`# A`#
# # @`# A|# # A|# @`# ≡@A∧|`#
16
That is, the relations in B plus 9 union relations. Note that this
S closure fails S
to include most of
the 120 possible union relations. Perhaps surprisingly, the unions {≡, @} and {∧, |} mentioned
in footnote 15 do not appear.
17
In fact, computer experiments show that if relations are selected uniformly at random from B,
it requires on average just five joins to reach •.
Chapter 6
Compositional entailment
86
CHAPTER 6. COMPOSITIONAL ENTAILMENT 87
frigid @ cold, soar @ rise); and antonyms generally belong to the | relation (hot |
cold, rise | fall, advocate | opponent). (Note that most antonym pairs do not belong
to the ∧ relation, since they typically do not exclude the middle.) Two common
nouns which have neither a synonymy nor a hyponymy relation typically describe
exclusive categories, and thus belong to the | relation, whether they are coordinate
terms (cat | dog) or unrelated nouns (battle | chalk ). By contrast, two unrelated
adjectives usually describe qualities which are not incompatible, and therefore belong
to the # relation (weak # temporary). The appropriate entailment relation between
two arbitrary verbs may depend in part on their lexical aspects (Aktionsarten), and
may also depend on world knowledge not readily available to an automatic system.
For example, plausibly skiing | sleeping (because someone who is skiing is surely not
asleep), but skiing # talking (someone who is skiing may or may not be talking).
Proper nouns, which denote individual entities or events, will stand in the ≡
relation if they denote the same entity (USA ≡ United States), or the | relation
otherwise (JFK | FDR). A proper noun and a common noun should be assigned to
the @ relation if it can be established that the former is an instance of the latter
(Socrates @ man). There is an interesting question concerning geographic meronym-
holonym pairs: should they be assigned to the @ relation, as in Kyoto @ Japan?
While this may seem intuitive, it leads to non-sensical conclusions such as Kyoto is a
beautiful city @ Japan is a beautiful city; thus it is better to assign such pairs to the
| relation. On the other hand, locative phrases formed from meronym-holonym pairs
do belong to to the @ relation: thus in Kyoto @ in Japan.
In this work, we have adopted the (perhaps dubious) assumption that tense and
aspect matter little in inference. Similarly, we neglect the distinction between singu-
lar and plural. Consequently, we routinely assign all edits involving auxiliaries and
inflection to the ≡ relation; thus we say that did sleep ≡ has slept and is sleeping
≡ sleeps. While a purist might quibble with calling such phrase pairs semantically
equivalent, we find that the flexibility afforded by this approximation yields a sub-
stantial dividend in the performance of our implemented system, and we are bolstered
by the fact that the official definition of the RTE task explicitly specifies that tense
be ignored. Similarly, we treat punctuation as semantically vacuous, and thus assign
CHAPTER 6. COMPOSITIONAL ENTAILMENT 90
all ≡ every
every @ some
some ∧ no
no | every
four or more @ two or more
exactly four | exactly two
at most four ` at least two
most # ten or more
Some of these results may be more intuitive when seen in context. For instance, from
some ∧ no, we can establish that some birds talk ∧ no birds talk, and from exactly
four | exactly two, we can establish that exactly four jurors smoke | exactly two jurors
smoke. Note also that some of these assertions assume the non-vacuity (section 5.5.1)
CHAPTER 6. COMPOSITIONAL ENTAILMENT 91
RTE task, in which the premise p often contains much extraneous content not found
in the hypothesis h. Most RTE systems try to determine whether p subsumes h: they
penalize new content inserted into h, but do not penalize content deleted from p. The
success of this strategy depends on the prevalence of upward-monotone contexts, and
thus can easily be derailed by the presence of negation, certain quantifiers, restrictive
verbs and adverbs, and other downward-monotone operators.
projectivity
connective ≡ @ A ∧ | ` #
negation (not) ≡ A @ ∧ ` | #
conjunction (and ) / intersection ≡ @ A | | # #
disjunction (or ) ≡ @ A ` # ` #
conditional (if ) (antecedent) ≡ A @ # # # #
conditional (if ) (consequent) ≡ @ A | | # #
biconditional (if and only if ) ≡ # # ∧ # # #
Table 6.1: Projectivity signatures for various constructions which correspond to log-
ical connectives. Certain approximations have been made; see text for details.
swaps | and `. It thus projects every relation as its dual under negation (in the sense
defined formally in section 5.3.1). For example:
Note that these results do not match the projectivity signatures obtained algorithmi-
cally for material implication (which include the partial signatures {∧ :`, | :`, `:#}
for the antecedent, and {∧ :`, | :#, `:`} for the consequent). Clearly, material impli-
cation is not a good model for the semantics of the conditional statement as typically
used in natural language; for a thorough discussion, see Stalnaker (1968; 1992).
Biconditionals. Biconditionals are rarely used in natural language, but where they
are, they block projection of all entailment relations except ≡ and ∧.
Some caveats are in order. While the results shown in table 6.1 are broadly
correct, certain approximations have been made (except in the case of negation, for
which the projectivity signature shown is exact). The reason is that, for binary
functions, the projection of a given entailment relation can depend on the value
of the other argument to the function. That is, if we are given β(x, y), and we
are trying to determine its projection β(f (x, z), f (y, z)), the answer can depend not
only on the properties of f , but also on the properties of z. In particular, if z
has some “special” relation to x or y (say, ≡, @, or ∧), a different projection may
result. Therefore, the results shown in table 6.1 reflect the assumption that z is
as “generic” as possible: specifically, that z stands in the (minimally informative)
# relation to both x and y. Our motivation for this assumption is closely related
to our motivation for ignoring vacuous expressions (those having empty or universal
denotations): in natural language one rarely combines two expressions which already
have some “special” relation, since the result is typically either redundant (French
European, sofa or couch) or nonsensical (human nonhuman, hot cold food ). This
assumption causes the results shown in table 6.1 to be imprecise in the following
ways:
CHAPTER 6. COMPOSITIONAL ENTAILMENT 97
• Certain “special” values for z can cause conjunction, disjunction, and the con-
ditional (in both the antecedent and the consequent) to project @ or A as ≡.
As an example, consider the projection of @ by conjunction (and ). Suppose x
is French, y is European, and z is Parisian. Since the conjunctions French and
Parisian and European and Parisian are both equivalent to Parisian, the en-
tailment relation between them is ≡. In table 6.1, we neglect such possibilities,
following the argument that such redundant expressions rarely occur in natural
language.
• Certain “special” values for z can result in degenerate entailment relations (in
the sense defined in section 5.3.1) being projected. As an example, consider
the projection of | by conjunction (and ). Suppose x is male, y is female, and
z is asexual. Since the conjunctions male and asexual and female and asex-
ual are both empty, the entailment relation between them is R1000 (defined in
section 5.3.1). In table 6.1, we neglect such possibilities, in keeping with our
assumption that semantically vacuous expressions will rarely be encountered in
natural language.
• Except in the case of negation, the projections shown as # are in fact larger, less
S
informative union relations (defined in section 5.6.3): either {≡, @, A, |, #},
S S
{≡, @, A, `, #}, or {@, A, |, `, #}. As in section 5.6.3, we argue that these
approximations make little practical difference, since the true projections con-
tain #, and # is already the least informative entailment relation.
• Some of the relations shown as projections in table 6.1 are not strictly identical
with the true projections, but rather are least upper bounds (in the space of
relations in B and unions thereof) on the true projections. In other words, the
relations shown may contain certain pairs of expressions which cannot occur as
a result of projection. The discrepancies in question are small, and are of no
practical consequence. (Note that a similar issue was encountered in connection
with computing the joins of relations in B in section 5.6.2.)
CHAPTER 6. COMPOSITIONAL ENTAILMENT 98
no ≡ A @ |† # |† # ≡ A @ |† # |† #
every ≡ A @ |‡ # |‡ # ≡ @ A |† |† # #
not every ≡ @ A `‡ # `‡ # ≡ A @ `† `† # #
at least two ≡ @ A # # # # ≡ @ A # # # #
most ≡ # # # # # # ≡ @ A | | # #
exactly one ≡ # # # # # # ≡ # # # # # #
all but one ≡ # # # # # # ≡ # # # # # #
Table 6.2: Projectivity signatures for various binary generalized quantifiers for each
argument position. “1st argument” refers to the restrictor NP; “2nd argument” refers
to the body VP. Results marked with † or ‡ depend on the assumption of non-vacuity;
see text.
every and most in their second arguments, where | is projected without change: thus
everyone was early | everyone was late and most people were early | most people were
late. (Observe the similarity of the projectivity signatures of every and most in their
second arguments to that of intersective modification.)
Finally, a caveat: some of the results shown in table 6.2 depend on assuming the
†
non-vacuity of the other argument to the quantifier: those marked with assume it
‡
to be non-empty, while those marked with assume it to be non-universal. Without
these assumptions, # is projected.
signature example
implicatives +/ – manage to
+/ ◦ force to
◦/– permit to
– /+ fail to
–/◦ refuse to
◦ /+ hesitate to
factives +/+ admit that
–/– pretend that
◦/◦ believe that
Table 6.4: The lexical entailment relations generated by deletions and insertions of
implicatives (and nonfactives), by implication signature.
Table 6.4 depicts the lexical entailment relations generated by deletions and in-
sertions of implicative operators according to their implication signatures. This table
invites several observations. First, as the examples make clear, there is room for vari-
ation regarding passivization and morphology. An implemented model must tolerate
such diversity. Second, the entailment relations generated by deletion and insertion
are converses of each other, as we’d expect. Third, we get the right predictions
concerning the implications of implicatives occurring in negative polarity contexts.
Indeed, some of the examples may seem more intuitive when one considers their
negations. For example, deleting an operator with implication signature ◦/– gener-
ates lexical entailment relation A, but as we saw in section 6.2, under negation A
is projected as @ (he wasn’t permitted to live @ he didn’t live). Likewise, deleting
an operator with implication signature ◦/+ generates lexical entailment relation `;
under negation, this is projected as | (he didn’t hesitate to ask | he didn’t ask ).
knew implies he knew, but not vice-versa. Likewise, deleting implication signature
–/– seems to generate the lexical entailment relation |, since he pretended he was sick
excludes he was sick, but is not equivalent to its negation.
However, following this path does not lead to the right predictions in negative
polarity contexts. If deletions of signature +/+ yield @, then since negation projects
@ as A, we would expect to find that he didn’t admit that he knew A he didn’t know.
But this is clearly incorrect. In the first place, he didn’t know does not imply he didn’t
admit that he knew. Moreover, if x A y, then the truth of x must be compatible with
the truth of y, but one cannot consistently assert both he didn’t admit that he knew
and he didn’t know, since the former presupposes the negation of the latter. Similarly,
if deletions of signature –/– yield |, then since negation projects | as `, we would
expect to find that he didn’t pretend he was sick ` he wasn’t sick. But this too is
incorrect. If x ` y, then the truth of x must be compatible with the falsity of y, but
one cannot consistently assert both he didn’t pretend he was sick and he was sick,
since the former presupposes the negation of the latter.
What has gone wrong? Recall from section 6.3.1 that the implication between a
factive and its complement is not an entailment, but a presupposition. The problem
arises principally because, as is well known, the projection behavior of presuppositions
differs from that of entailments (van der Sandt 1992). Consequently, the model
of compositional entailment developed in this chapter does not suffice to explain
inferences involving presupposition.6 While the model could perhaps be elaborated
to handle presupposition, we have chosen not to pursue this path.
The problems described here do not affect the nonfactives (implication signature
◦/◦), since the nonfactives do not carry any presupposition (or, indeed, any entail-
ment) regarding their complements. Thus, we can accurately say that deleting a
nonfactive generates lexical entailment relation # (he believed he had won # he had
won). And, since negation projects # as #, we obtain the correct predictions in
negative contexts (he didn’t believe he had won # he hadn’t won).
6
Nevertheless, the NatLog system described in chapter 7 does handle factives (proper) and coun-
terfactives within this framework, assigning lexical entailment relations @ and | to deletions of
implication signatures +/+ and –/– (respectively). Indeed, NatLog benefits from doing so, since in
practice, this leads to correct predictions more often than not.
CHAPTER 6. COMPOSITIONAL ENTAILMENT 104
projectivity
signature example monotonicity ≡ @ A ∧ | ` #
+/ – manage to up ≡ @ A ∧ | ` #
+/ ◦ force to up ≡ @ A | | # #
◦/– permit to up ≡ @ A ` # ` #
– /+ fail to down ≡ A @ ∧ ` | #
–/◦ refuse to down ≡ A @ | # | #
◦ /+ hesitate to down ≡ A @ ` ` # #
+/+ admit that up ≡ @ A ∧ ∧ # #
–/– pretend that up ≡ @ A ∧ # ∧ #
◦/◦ believe that non # # # # # # #
Table 6.5: The monotonicity and projectivity properties of implicatives and factives,
by implication signature. Some results may depend on whether one assumes a de
dicto or de re reading; see text.
entailment relation for edit ei , and sometimes refer to it using the alternate
notation β(xi−1 , ei ).
3. Join atomic entailment relations across the sequence of edits, as in section 5.6:
β(p, h) = β(x0 , xn ) = β(x0 , e1 ) o
n ... o
n β(xi−1 , ei ) o
n ... o
n β(xn−1 , en )
While this inference method effectively explains many common patterns of infer-
ence (see section 6.5 for examples), it faces several important limitations:
• Third, the inference method provides no general mechanism for combining in-
formation from multiple premises. It thus fails to capture many inference rules
of classical logic, including modus ponens, modus tollens, disjunction elimina-
tion, and so on. (However, some inferences can be enabled by treating auxiliary
premises as encoding lexical entailment relations. For example, if we encode All
men are mortal as the lexical entailment men @ mortal, then we can enable the
classic syllogism Socrates is a man @ Socrates is mortal.)
Because of these limitations, the inference method we describe here has less de-
ductive power than first-order logic. Indeed, it fails to sanction some fairly simple
CHAPTER 6. COMPOSITIONAL ENTAILMENT 107
inferences, including de Morgan’s laws for quantifiers (see section 6.5.4 for details).
Yet despite its limitations, this inference method neatly explains a broad range of
inferences, including not only those which involve semantic containment (which are
also explained by the monotonicity calculus) but also those which involve semantic
exclusion and implicativity (which are not). The next section presents a variety of
examples.
6.5 Examples
In the following pages, we illustrate the operation of the inference method described in
section 6.4, and elucidate some related issues, by working through a series of concrete
examples of its application.
We also show (in gray) the intermediate forms through which p progresses as it is
transformed into h by the sequence of edits.
The analysis proceeds as follows. First, replacing cat with its coordinate term dog
generates the | relation. Next, inserting not generates the ∧ relation, and | joined with
∧ yields @. Finally, replacing dog with its hyponym poodle generates the A relation.
Because of the downward-monotone context created by not, this is projected as @,
and @ joined with @ yields @. Therefore, p entails h.
We can analyze these three edits as follows. First, the deletion of the implica-
tive hesitate to generates lexical entailment relation `, according to its implication
signature (namely ◦/+); but because of the downward-monotone context created by
didn’t, this is projected as |. Next, the deletion of didn’t generates lexical entailment
relation ∧, in an upward-monotone context, and | joined with ∧ yields @. Finally, the
substitution of Prozac with its hypernym medication generates lexical entailment re-
lation @, in an upward-monotone context, and @ joined with @ yields @. Therefore,
p entails h.
As another example, consider:
p: We were not permitted to smoke.
h: We smoked Cuban cigars.
This is not a valid inference: in fact, p and h are incompatible. But again, p can
be transformed into h by a sequence of three edits (neglecting auxiliaries and mor-
phology, as before), which we analyze as follows. First, deleting the implicative
CHAPTER 6. COMPOSITIONAL ENTAILMENT 110
1 del(hesitate to) ` | |
2 del(didn’t) ∧ ∧ @
3 sub(Prozac, medication) @ @ @
1 del(hesitate to) ` | |
2 sub(Prozac, medication) @ A |
3 del(didn’t) ∧ ∧ @
1 del(didn’t) ∧ ∧ ∧
2 del(hesitate to) ` ` @
3 sub(Prozac, medication) @ @ @
1 del(didn’t) ∧ ∧ ∧
2 sub(Prozac, medication) @ A |
3 del(hesitate to) ` ` @
1 sub(Prozac, medication) @ @ @
2 del(hesitate to) ` | |
3 del(didn’t) ∧ ∧ @
1 sub(Prozac, medication) @ @ @
2 del(didn’t) ∧ ∧ |
3 del(hesitate to) ` ` @
Table 6.9: Six analyses of the same inference, using different edit orders.
Table 6.10: Analysis of an example of de Morgan’s laws for quantifiers, first try.
Table 6.11: Analysis of an example of de Morgan’s laws for quantifiers, second try.
and |. Because this result contains the desired result (namely ≡), it is not incorrect,
but it is far less informative than we would hope.
Different orderings of the three required edits lead to the same outcome. For
example, if we insert the predicate negation before we perform the quantifier substi-
tution, then we obtain the analysis shown in table 6.11. As before, the deletion of the
sentential negation generates the ∧ relation. But this time, the insertion of the pred-
icate negation occurs in the second argument (the body) of the universal quantifier
all, so that the lexical entailment relation ∧ is projected as | (see section 6.2.4), and ∧
joined with | yields A. Finally, the quantifier substitution generates @, and A joined
with @ again yields the non-exclusion relation: not incorrect, but under-specific.
While we won’t bother to work through the details here, it can be shown that each
of the six possible orderings of the three required edits leads to the same result, albeit
via different intermediate steps. The problem could perhaps be solved by allowing
more complex edit operations to count as atomic, and by specifying the appropriate
lexical entailment relations generated by these richer edits. In our example, it would
suffice to treat the quantifier substitution and the insertion of predicate negation as
a single atomic edit which generates the ∧ relation. However, such an ad-hoc solution
has little to recommend it.
join: as explained in section 5.6, | joined with ∧ yields @. The fifth edit substitutes
move with its hyponym dance, generating the A relation. However, because the edit
occurs within the scope of the newly-introduced negation, A is projected as @, and
@ joined with @ yields @. The sixth edit deletes a generic modifier (blue), which
generates the @ relation by default. This time the edit occurs within the scope of two
downward-monotone operators (without and negation), so we have two inversions of
monotonocity, and @ is projected as @. Again, @ joined with @ yields @. Finally,
the seventh edit substitutes jeans with its hypernym pants, generating the @ relation.
Again, the edit occurs within the scope of two downward-monotone operators, so @
is projected as @, and @ joined with @ yields @. Thus p entails h.
Of course, the edit sequence shown is not the only sequence which can transform p
into h. A different edit sequence might yield a different sequence of intermediate steps,
but the same final result. Consider, for example, the edit sequence shown in table 6.13.
Note that the lexical entailment relation β(ei ) generated by each edit is the same
as before. But because the edits involving downward-monotone operators (namely,
ins(n’t) and del(refused to)) now occur at different points in the edit sequence, many
of the atomic entailment relations β(xi−1 , ei ) have changed, and thus the sequence of
CHAPTER 6. COMPOSITIONAL ENTAILMENT 116
joins has changed as well. In particular, edits 3 and 4 occur within the scope of three
downward-monotone operators (negation, refuse, and without), with the consequence
that the @ relation generated by each of these lexical edits is projected as A. Likewise,
edit 5 occurs within the scope of two downward-monotone operators (negation and
refuse), and edit 6 occurs within the scope of one downward-monotone operator
(negation), so that | is projected as `. Nevertheless, the ultimate result is still @.
However, it turns out not to be the case that every edit sequence which trans-
forms p into h will yield equally satisfactory results. Consider the sequence shown in
table 6.14. The crucial difference in this edit sequence is that the insertion of not,
which generates lexical entailment relation ∧, occurs within the scope of refuse, so
that ∧ is projected as atomic entailment relation | (see section 6.3.4). But the dele-
tion of refuse to also produces atomic entailment relation | (see section 6.3.2), and |
S
joined with | yields a relatively uninformative union relation, namely {≡, @, A, |, #}
(which could also be described as the non-exhaustion relation). The damage has
been done: further joining leads directly to the “black hole” relation •, from which
there is no escape.
Note, however, that even for this infelicitous edit sequence, our inference method
CHAPTER 6. COMPOSITIONAL ENTAILMENT 117
has not produced an incorrect answer (because the • relation includes the @ relation),
only an uninformative answer (because it includes all other relations in B as well). In
fact, through all the examples we have considered, we have encountered no evidence
that our inference method is unsound, i.e., that it can lead to an incorrect (as opposed
to merely uninformative) result. We’ll consider whether this is true in general in the
final section of this chapter.
described in section 6.4 depends on many factors which defy formalization. First, as
noted above, we must find a suitable sequence of edits which transforms p into h (and
the inference method per se offers no guidance on how to find such a sequence, or
indeed, whether one exists). Second, we must predict the correct entailment relation
for each edit. Cases like sub(jeans, pants) are straightforward enough, but (as we
noted in section 6.1) other cases are not so clear-cut. Lexical ambiguity is one issue:
whether bank @ building depends on which sense of bank is intended. Vagueness is
another: does three @ several ? Does method ≡ strategy? Finally, we must correctly
determine the projectivity properties of semantic functions in our sentences, and
properly identify the extent of their arguments. Classic scope ambiguities (as in Every
man loves some woman) may be one issue, but there are other potential sources of
difficulty. For example, we noted in section 6.2.5 that while most verbs project | as
#, those which encode functional relations project | as |. But whether a particular
verb encodes a functional relation may not be entirely clear-cut.
Many of the difficulties just described involve properly identifying the seman-
tic properties of lexical items involved in the inference. A further consideration—
though it is purely theoretical—is that lexical items exhibiting certain combinations
of semantic properties could, in principle, render our inference method unequivocally
unsound.8 Suppose there were an English adverb X-ly having the following (admit-
tedly bizarre) combination of properties: first, it is semantically vacuous (so that
β(del(X-ly)) = ≡); and second, it is downward monotone (so that its projectivity
signature includes {@:A}). Now consider the following inference:
p: John runs X-ly.
h: John moves.
Here, p is transformed into h by two edits, which can occur in either order:
sub(runs, moves) and del(X-ly). Of course, β(sub(runs, moves)) = @. Now, the
result produced by the inference method will depend upon the order in which the
edits are applied. If we first apply the del and then the sub, the result is @, because
the del is semantically vacuous, and after it has occurred, X-ly is no longer present
8
The following example is due to an anonymous reviewer for the Eighth International Conference
on Computational Semantics, to whom we are indebted.
CHAPTER 6. COMPOSITIONAL ENTAILMENT 119
to affect the sub, which generates @. However, if we first apply the sub and then
the del, then the result is A, because the presence of the downward-monotone X-ly
causes the lexical entailment relation @ produced by the sub to be projected as A.
Since it is not possible that p @ h and p A h, at least one of these two results must
be incorrect. Therefore, if expressions like X-ly can occur in natural language, then
at least some edit orders can cause our inference method to produce incorrect results.
We are not aware of any actual natural language expression having the unusual
combination of properties exemplified by X-ly. Indeed, it is difficult to imagine that
such an expression could be coined—what, exactly, might it mean? It therefore seems
a plausible hypothesis that expressions like X-ly do not occur in natural languages,
and consequently that, in practice if not in principle, such expressions pose no threat
to the soundness of our inference method. This claim resembles the typological claims
that have been made about certain monotonicity patterns of quantifiers existing in
natural languages. But because we can offer no concrete evidence to support this
claim, it remains a hypothesis only.
Whatever the theoretical limitations of our model of natural logic, it has been
shown in practice successfully to address a wide range of inference problems. In
section 6.5, we illustrated this using a variety of toy examples; in chapter 7, we will
describe a computational implementation of this model and a large-scale evaluation
of its effectiveness on real-world inference problems.
Chapter 7
The model of natural logic developed in chapters 5 and 6 may be an elegant theory,
and we have seen that it successfully explains a handful of contrived examples. But
is the model just a toy, or can it be effectively implemented and applied to real-
world examples of natural language inference? This chapter aims to answer those
questions.1
The NatLog system is a Java implementation of our model of natural logic. In the
following pages, we present the high-level architecture of the system (section 7.1) and
provide detailed descriptions of each of its five key components (sections 7.2 through
7.6). Next, we illustrate the operation of the system (section 7.7) using a concrete
example first introduced in the previous chapter. We then discuss evaluations of
the effectiveness of the system on two well-known collections of NLI problems: the
FraCaS test suite (section 7.8) and the RTE3 test suite (section 7.9).
120
CHAPTER 7. THE NATLOG SYSTEM 121
RTE systems, alignment is not explicitly formulated as a separate task, but is achieved
implicitly as part of entailment classification. In adopting a pipelined approach, we
follow a precedent established by Marsi and Krahmer (2005), and developed further in
the Stanford RTE system, as described in (MacCartney et al. 2006) and in chapter 4
of this dissertation.
As we saw in chapter 4, the Stanford RTE system uses a pipeline of three stages,
comprising linguistic analysis, alignment, and entailment classification. The NatLog
pipeline has a similar structure, but places far great emphasis on entailment classi-
fication, which it divides into three separate stages. Altogether, then, the NatLog
pipeline includes five stages of processing:
predicted by the preceding stage across the chain of edits in order to produce
the entailment relation which holds between p and h.
7.3 Alignment
In the second stage of processing, we establish an alignment between the premise p
and hypothesis h, represented by a sequence of atomic edits, operating over phrases,
CHAPTER 7. THE NATLOG SYSTEM 123
which transforms p into h. (Following the usage common in the machine translation
literature, we use the term “phrase” to refer to a contiguous span of word tokens,
whether or not the tokens constitute a syntactic phrase.) This alignment representa-
tion is symmetric and many-to-many, and is general enough to include various other
alignment representations as special cases. We define four edit types:
Each edit is parameterized by the token indices at which it operates, and edit indices
may “cross”, permitting representation of movement. (This is the same representation
of alignment used by the MANLI system, described in chapter 3.)
The NatLog system does not itself include any facility for finding good alignments;
rather, its purview is the identification of entailment relations between aligned sen-
tence pairs. NatLog therefore relies on alignments derived from other sources: either
human annotators, or automatic alignment systems such as the MANLI system (chap-
ter 3) or the Stanford RTE aligner (chapter 4).
Although NatLog does not itself generate alignments, but merely reads them in
from an outside source, it does perform one important bit of processing as part of
the alignment stage: name, heuristic ordering of edits. The outside sources from
which NatLog obtains alignments typically do not specify any particular ordering in
which the edits contained in an alignment should be applied. However, the inference
method implemented by NatLog requires that such an ordering be established (see
section 6.4), since this ordering defines a path from p to h through intermediate forms,
and thereby decomposes the inference problem into a sequence of atomic inference
problems, one for each atomic edit. Since NatLog is free to choose whatever edit
order it prefers, it uses heuristic rules to select an order which maximizes the benefit
of the monotonicity marking performed in the first stage of processing (but described
in section 7.5).
CHAPTER 7. THE NATLOG SYSTEM 124
In particular, NatLog prefers to put del edits at the beginning of the edit se-
quence, sub edits in the middle, and ins edits at the end. There is one complication:
the relative ordering of edits involving downward-monotone (or non-monotone) op-
erators (e.g., the deletion of not) may influence the effective monotonicity applicable
for other edits. Therefore, in choosing an edit order, NatLog places any edits involv-
ing downward-monotone (or non-monotone) operators in a separate group, after all
other del and sub edits, and before all other ins edits. This has the benefit that
when NatLog needs to compute the effective monotonicity for a particular edit (in
the fourth stage), it is able to use the monotonicity markings on p for all ordinary
del and sub edits, and the monotonicity markings on h for all ordinary ins edits.
(By “ordinary”, we simply mean “not involving downward-monotone or non-monotone
operators”.)
WordNet-derived features. Perhaps the most important features for sub edits
are those which exploit WordNet to measure the semantic relatedness of the sub-
stituends, using various relation types, including synonymy, antonymy, and hyper-
nymy. Each of these features takes values on the interval [0..1].
• The WNSyn feature indicates synonymy: it takes value 1 iff the substituends are
synonyms per WordNet (i.e., belong to the same synset), and 0 otherwise.
• The WNAnt feature indicates antonymy: it takes value 1 iff the substituends are
antonyms per WordNet, and 0 otherwise.
• The WNHypo feature indicates hyponymy, and is simply the inverse of the WNHyper
feature. Thus WNHypo(owl , bird ) = 0.0, while WNHypo(bird , owl ) = 0.75.
CHAPTER 7. THE NATLOG SYSTEM 126
def
dJC (s1 , s2 ) = IC(s1 ) + IC(s2 ) − 2 · IC(lso(s1 , s2 ))
where IC(s) is the information content of synset s (that is, the negative of the
logarithm of its frequency of occurrence in a large corpus), and lso(s1 , s2 ) is
the least superordinate (most specific hypernym) of synsets s1 and s2 . Possible
values for dJC range from a minimum of 0 (for synonyms) to a theoretical
maximum of 2 · maxs (IC(s)). Thus, to convert dJC to a measure of semantic
relatedness on [0..1], we use
This formula gives results that are intuitively pleasing, although it has the
consequence that comparatively few pairs of terms receive a score near 0.
• The NomB feature uses information from the NomBank lexical resource (Mey-
ers et al. 2004) to indicate derivationally related words having different parts
of speech, particularly nominalizations. We extracted from the NOMLEX-
PLUS.0.8 file about 4,000 related noun/verb pairs (such as accusation/accuse)
and about 2,500 related adjective/adverb pairs (such as angry/angrily). The
NomB feature is set to 0.75 if the substituends are among these pairs, or to 0
otherwise.
nouns, verbs, and adjectives, estimates the relatedness between a pair of words
or phrases by measuring their distributional similarity, that is, the similarity be-
tween the distributions, in a very large corpus, of the grammatical relationships
in which they participate. It is thus a useful complement to the information ob-
tained from manually-compiled resources like WordNet and NomBank. Because
the scores obtained from the thesaurus are not normalized, we map them onto
[0..1] by dividing every score by the maximum score observed in the thesaurus
(separately for each part of speech).
String similarity features. Since no static lexical resource can possibly contain
every pair of words which might be encountered, the robustness of the system can
be greatly enhanced by including features which measure only string similarity. Such
features are particularly helpful for identifying the equivalence between different ver-
sions of proper names, which are commonly missing from static lexical resources. For
example, string similarity can help us to identify alternate or erroneous spellings of a
name (Ahmadinejad, Ahmadi-Nejad, or Ahmadinejab), or shorter and longer versions
of the same name (Lincoln vs. Abraham Lincoln).
• The LemStrSim feature estimates the similarity between words w1 and w2 using
a function based on the string edit distance between the word lemmas:
dist(lemma(w1 ), lemma(w2 ))
LemStrSim(w1 , w2 ) = max 0, 1 −
max(|lemma(w1 )|, |lemma(w2 )|) − k
Here lemma(w) denotes the lemma of word w, as generated by the finite state
transducer of Minnen et al. (2001); dist() denotes Levenshtein string edit dis-
tance (Levenshtein 1966); and |·| denotes string length. k is a penalty parameter
whose purpose is to prevent assigning high similarity to very short pairs of un-
related words with similar spelling, such as or and for, or she and the. For
example, if k = 0, then LemStrSim(she, the) = 0.67, but if k = 2 (the valued we
used in practice), then LemStrSim(she, the) = 0.0. The similarity assigned to
longer word pairs is less affected by the penalty parameter. For example, with
k = 2, LemStrSim(Ahmadinejad , Ahmadinejab) = 0.89.
CHAPTER 7. THE NATLOG SYSTEM 128
• The LemSubSeq feature is designed to identify the case where one of the sub-
stituends is a multi-word phrase which contains the other as a sub-phrase, as in
sub(Lincoln, Abraham Lincoln) or sub(Dick and Jane, Jane). Unlike the other
features presented so far, the LemSubSeq feature takes values which are entail-
ment relations (and which are subsequently encoded by a bundle of boolean
features). In particular:
– LemSubSeq(p, h) = @ iff h is a subsequence of p
– LemSubSeq(p, h) = A iff p is a subsequence of h
– LemSubSeq(p, h) = ≡ iff p and h are equal
– LemSubSeq(p, h) = # otherwise
As the name suggests, the LemSubSeq feature works by comparing word lemmas,
not merely surface forms.
• The Light feature is a boolean feature which is activated just in case both
substituends contain only semantically “light” words, that is, words which have
a limited impact on the semantics of the sentences in which they appear. The
intent is that substitutions involving light words should be predicted to gener-
ate lexical entailment relation ≡. We define light words to include punctuation,
prepositions, possessive markers, articles, auxiliary verbs, and expletives. Of
course, we do not mean to make a strong claim that all such words are seman-
tically vacuous—rather, this definition reflects a pragmatic choice which seems
to work well in practice.
• The Preps feature is a boolean feature which is activated just in case both
substituends are prepositions. The presence of this feature encourages NatLog
to be somewhat liberal (i.e., biased toward predicting ≡) regarding substitutions
involving prepositions. As we argued in section 6.1.2, this is often—though not
always—an appropriate choice.
CHAPTER 7. THE NATLOG SYSTEM 129
• The Pronoun feature is a boolean feature which is activated just in case both
substituends are pronouns. The presence of this feature encourages NatLog to
be somewhat liberal regarding substitutions involving pronouns.
• The NNNN feature is a boolean feature which is activated just in case either
both substituends are common nouns, or both substituends are proper nouns.
The motivation for this feature is to encode the background assumption that
most such pairs of nouns should (in the absence of evidence to the contrary) be
predicted to stand in the | relation, as described in section section 6.1.1. For
example, we want apple | banana and Atlanta | Boston. Of course, for specific
pairs of nouns, the ≡, @, or A relation might be more appropriate, but we hope
that in such cases some other featurizer will provide the evidence we need to
make that prediction.
sub(ALL, SOME) → @
sub(ALL, NONE) → |
sub(ALL, NUM) → #
CHAPTER 7. THE NATLOG SYSTEM 130
and so on. These definitions are driven both by theory and by pragmatism. If
either substituend fails to be recognized as a quantifier phrase, or if we have not
defined a predicted entailment relation for the given pair of quantifier categories,
then the feature takes value #.
• The MiscSub feature is used to encode knowledge about specific pairs of expres-
sions which do not fit neatly into any of the other features. Its output space is
the set B of basic entailment relations, and it includes hand-coded mappings
from specific pairs of expressions to lexical entailment relations which should be
predicted for sub edits involving those pairs. In particular, the MiscSub feature
encodes the knowledge that:
– sub(and, or ) should yield @.
– sub edits involving different ways of expressing the genitive (e.g., ’s, of,
his, her, its) should yield ≡.
– sub edits involving certain pairs of prepositions should yield | (e.g., after
| before, inside | outside), rather than the default ≡.
What about del and ins edits? These are handled as one: in fact, when asked
to predict a lexical entailment relation for an ins edit, NatLog first converts it to
a del edit (by swapping the roles of p and h), then applies the lexical entailment
model (which is designed for del edits), and then returns the converse of the resulting
prediction (for example, swapping @ and A).
The feature representation used for del edits is much smaller than the one used
for sub edits. (As noted in section 6.1.3, most del edits just have @ as the target
CHAPTER 7. THE NATLOG SYSTEM 131
lexical entailment relation, so this is an easier prediction problem.) In fact, for del
edits, we use only the Light and Pronoun features, plus an additional feature which
applies only to dels:
Table 7.1: Examples of training problems for the lexical entailment model.
hinging on just one or two features. While linear models excel at combining evidence
from a large number of features, in our problem setting we anticipate comparatively
little need to combine evidence.
In order to train the classifier, we constructed a training set containing 2,449 lexical
entailment problems, derived from alignments of NLI problems from various sources.
Of these problems, 1,525 represented sub edits, and consisted of a pair of words or
phrases. The other 924 represented del edits, and consisted of just a single word or
phrase. Each problem was manually annotated with one of the seven relations in B,
indicating the lexical entailment relation generated by the given edit. As table 7.1
makes clear, the distribution of data was far from even. Because the problems were
CHAPTER 7. THE NATLOG SYSTEM 133
derived from actual alignments, most sub problems consisted of pairs of equal (or
nearly equal) expressions, and thus were labeled with relation ≡. The second most
common label for sub problems was |; many of these problems contained pairs of un-
equal proper nouns or mutually exclusive common nouns. Among the del problems,
the overwhelming majority were labeled with relation @, reflecting the prevalence of
intersective modification, as described in section 6.1.3. The second most common
label for del problems was ≡; most such problems involved expressions treated as
semantically vacuous, such as auxiliary verbs, though a few involved implicatives.
During training, the decision tree was minimally pruned, and the resulting tree
contains about 180 leaves. When tested on the training data, the classifier achieves
>99% accuracy, indicating that our feature representation successfully captures nearly
all relevant distinctions between examples.
composition trees. Our solution is imperfect, but largely effective. We define a list of
monotonicity operator types (that is, categories of operators with monotonicity down
or non, such as implicatives like refuse, or prepositions like without), and for each
such operator type we specify its arity and a Tregex tree pattern (Levy and Andrew
2006) which permits us to identify its occurrences in our Treebank parses. We also
specify, for each argument position of each operator type, both the monotonicity class
(down or non) and another Tregex pattern which helps us to determine the likely
extent of the argument in a phrase-structure tree. (Figure 7.1 shows some example
definitions.)
Although we have presented monotonicity marking as belonging (logically) to
the fourth stage of NatLog processing, in fact it happens during the first stage, as
part of linguistic analysis. After parsing, we match the tree patterns in each of the
monotonicity operator type definitions against the phrase-structure parse trees for p
and h, and add annotations to parse tree nodes to indicate constituents which are
either monotonicity operators or arguments thereto.
Note that we make no attempt to resolve scope ambiguities, such as that exhibited
by the sentence Every man loves some woman, but we find that in practice this rarely
causes problems: the default behavior usually leads to the right predictions. There
are at least three reasons for this. First, inferences which hinge on scope ambiguities
are rare. Second, the assumption that the first quantifier takes wide scope turns out
to be a fairly reliable heuristic. Third, in many cases the effective monotonicity at
each point in the sentence will be same no matter how the scope ambiguity is resolved.
This is the case with Every man loves some woman: whether some is interpreted as
having wide or narrow scope, the effective monotonicity at man will be down, while
the effective monotonicity in the rest of the sentence will be up.
one by one, across the chain of atomic edits, to produce a final, overall prediction
for the inference problem. The combination is performed using the join operation o
n
defined in section 5.6. This is a deterministic operation, and is simple to compute. (In
fact, the 7 × 7 join table for relations in B is precomputed, so this stage of processing
involves only lookups.)
As we alluded to in section 5.6.3, the NatLog system has no need to represent
unions of relations in B which may result from joining two relations in B. Rather,
those joins which would result in a union relation are approximated instead by #.
Note that if a chain of joins ever reaches #, then it is “stuck” there: no further joining
can lead to any other result. Consequently, prediction errors earlier in the processing
pipeline are likely to cause the final prediction to be #, and this tendency becomes
stronger as the number of atomic edits grows. Because # is the least informative
relation in B (permitting neither the truth nor falsity of one of its arguments to be
inferred from the truth or falsity of the other), we might say that NatLog has an
innate tendency to be conservative in its predictions, in the sense that when it runs
into problems, it tends to predict neither entailment nor contradiction.
This chaining of entailments across edits can be compared to the method presented
in (Harmeling 2007); however, that approach assigns to each edit merely a probability
of preserving truth, not an entailment relation.
(S (S
(NP (NNP Jimmy) (NNP Dean)) (NP (NNP James) (NNP Dean))
(VP (VBD refused) (VP (VBD did) (RB n’t)
(S (VP (VB dance)
(VP (TO to) (PP (IN without)
(VP (VB move) (NP (NNS pants)))))
(PP (IN without) (. .))
(NP (JJ blue) (NNS jeans)))))))
(. .))
The best alignment for this example is fairly easy to identify, and the edits it
contains appear as the first column table 7.2. Note, however, that these edits have
been ordered according to the heuristics described in section 7.3: dels first, then
subs, then edits which involve operators with non-default projectivity, and inss last.
The second column of table 7.2 shows features generated by the lexical entailment
model for each edit. For compactness, we display at most one (especially salient) fea-
ture per edit. Jimmy Dean and James Dean have comparatively high string similarity.
We use WordNet to identify the other two sub edits as hyponym and hypernym sub-
stitutions. The miscDeln feature identifies refuse to as an implicative with signature
+/◦, and the final ins edits are distinguished by their lexical categories.
CHAPTER 7. THE NATLOG SYSTEM 139
The third column of table 7.2 shows the lexical entailment relation predicted for
each edit. A generic deletion (edit 1) generates the @ relation. A substitution with
high string similarity (edit 2) leads to the ≡ relation, as do a match (edit 4), and
the insertion of a semantically vacuous auxiliary (edit 8). The hyponym (edit 3)
and hypernym (edit 5) substitutions generate A and @, respectively. The implicative
deletion (edit 6) generates |, according to its implication signature, and the insertion
of negation (edit 7) generates the ∧ relation.
The fourth column of table 7.2 shows the effective monotonicity at the locus of
each edit. (We present monotonicity classes, rather than the more general projectivity
signatures, for simplicity and compactness.) The effective monotonicity for edits 1
through 6 is computed from the syntactic parse for p, while the effective monotonicity
for edits 7 and 8 is computed from the syntactic parse for h. In p, refuse to is identified
as a downward-monotone operator whose scope is its sentential complement, and
without is identified as a downward-monotone operator whose scope is its argument
noun phrase. Consequently, edits 1 and 5 occur in an upward-monotone context
(since they fall within the scope of two downward-monotone operators), while edits
3 and 4 occur in a downward-monotone context (since they fall within the scope of
just one downward-monotone operator), and edit 2 occurs in an upward-monotone
context (since it doesn’t fall within the scope of any downward-monotone operators).
Meanwhile, in h, not is identified as a downward-monotone operator whose scope
is the sentence in which it appears. Consequently, edit 8 occurs in a downward-
monotone context (since it falls within the scope of a downward-monotone operator),
while edit 7 occurs in an upward-monotone context (since it doesn’t fall within the
scope of any downward-monotone operators).
The fifth column of table 7.2 shows how the lexical entailment relations (from
the third column) are projected into atomic entailment relations, according to the
effective monotonicity shown in the fourth column. The only noteworthy case is edit
3, where the lexical entailment relation A is projected into atomic entailment relation
@ by a downward-monotone context.
The sixth and final column of table 7.2 shows the cumulative join of the atomic
entailment relations in the fifth column. Initially, these are unremarkable: @ joined
CHAPTER 7. THE NATLOG SYSTEM 140
with either ≡ or @ yields @. But at edit 6, we find that @ joined with | yields |, and
at edit 7, we find that | joined with ∧ yields @ again. The final entailment relation in
this line, @, is NatLog’s final (and correct) answer for our example problem.
In light of the view expressed elsewhere in this and other FraCaS de-
liverables ... that inferential ability is not only a central manifestation of
semantic competence but is in fact centrally constitutive of it, it shouldn’t
be a surprise that we regard inferencing tasks as the best way of testing
an NLP system’s semantic capacity.2
Despite having been designed expressly for the purpose of evaluating NLI systems,
the FraCaS test suite went largely unexploited for more than a decade. In fact, to
our knowledge, we are the first to have undertaken a quantitative system evaluation
using the complete FraCaS data set.3
2
Cooper et al. (1996), p. 63.
3
Sukkarieh (2003) uses a handful of FraCaS examples in a manual evaluation of her McLogic for-
malism for knowledge representation and inference, but does not undertake a complete, quantitative,
or automatic evaluation.
CHAPTER 7. THE NATLOG SYSTEM 141
Questions and hypotheses. While the original FraCaS test suite expresses the
“goal” of each problem as a question, the standard formulation of the NLI task involves
determining the entailment relation between a premise and a declarative hypothesis.
Thus, for the purpose of this work, we converted each FraCaS question into a declara-
tive hypothesis, using an automatic tool which first generates a syntactic parse of the
question and then applies a few simple tree-transformation rules. The results were
then manually reviewed for grammaticality. Very few corrections were needed, mostly
involving replacing negative-polarity items such as any. (For example, the question
Did any accountant attend the meeting? was automatically transformed into the hy-
pothesis Any accountant attended the meeting. Any was then manually changed to
Some.)
Answer types. Most FraCaS problems are labeled with one of three answers: yes,
no, or unk.4 This three-way formulation of the NLI task was introduced in sec-
tion 5.2.2, where we identified the yes label with the entailment relation, the no
label with the contradiction relation, and the unk label with the compatibil-
ity relation. We can also identify each of these three labels with a union of relations
S
in B, as described in section 5.6.3: yes corresponds to {≡, @}, no corresponds
S S
to {∧, |}, and unk corresponds to {A, `, #}. The distribution of answers is not
balanced: about 59% of the problems have answer yes, while 28% have answer unk,
4
Actually, the answers are not completely standardized, and quite a few contain qualifiers, such
as, “Yes, on one possible reading”. However, in most cases, the given answer can be straightforwardly
identified with one of the values yes, no, or unk. But 12 of the 346 problems (about 3%) do not fit
neatly into this scheme. Four problems lack questions and answers altogether, and in the remaining
eight problems, the answer is too ambiguous or complex to be clearly identified with one of the
three labels. Examples of such answers include “Yes on one scoping; unknown on another scoping”,
“Not many”, and “At most two”. These 12 problems were omitted from the evaluation described in
section 7.8.2.
CHAPTER 7. THE NATLOG SYSTEM 142
and 10% have answer no. Consequently, a most-common-class classifier can achieve
respectable accuracy simply by guessing yes on every problem.
Multiple-premise problems. Of the 346 FraCaS problems, 154 (about 45%) con-
tain multiple premises: 122 problems contain two premises, 29 problems contain three
premises, two problems contain four premises, and one problem contains five premises.
These multiple-premise problems were excluded from the evaluation described in sec-
tion 7.8.2. As we noted in section 6.4, an important limitation of the inference method
implemented by NatLog is that, unlike first-order logic, it provides no general mech-
anism for combining information from multiple premises. However, this limitation is
not unique to NatLog: it is faced by all other NLI systems of which we are aware. To
be sure, most NLI systems—including NatLog—can accept a premise which contains
multiple sentences. Indeed, recent RTE data sets have included an ever-greater pro-
portion of problems with multi-sentence premises. But, for the most part, solving such
problems requires nothing more than extracting information from a single sentence.
By contrast, the inference problems shown in figure 1.2 cannot be solved without
combining information from multiple sentences. Because NatLog relies on analyzing
entailment relations across a single chain of atomic edits, it cannot effectively handle
such problems. But nor can other systems developed for the RTE challenge.
Sections. The FraCaS test suite is divided into nine sections, each focused on a
specific category of semantic phenomena: (1) quantifiers, (2) plurals, (3) anaphora,
(4) ellipsis, (5) adjectives, (6) comparatives, (7) temporal reference, (8) verbs, and
(9) attitudes. Of course, some of these sections will be more amenable than others to
the natural logic approach: we hope to do well with quantifiers, but expect to make
little headway with ellipsis, anaphora, or temporal reference.
System P% R% Acc %
baseline: most common class 55.7 100.0 55.7
bag of words 59.7 87.2 57.4
NatLog 2007 68.9 60.8 59.6
NatLog 2008 89.3 65.7 70.5
section 7.8.1) and a handful of other problems which lack a well-defined answer (see
footnote 4).
The NatLog system depends on alignments (that is, edit sequences connecting
p and h) from an outside source. For these experiments, we generated alignments
automatically using a very simple dynamic programming algorithm similar to the
Levenshtein string edit distance algorithm (Levenshtein 1966), applied at the token
level. The results from this automatic alignment were then manually reviewed and
corrected.
Table 7.3 shows the results of the evaluation, along with some comparisons. As
a baseline, we show results for a most-common-class classifier, which achieves perfect
recall for the yes class (because it always guesses yes), but mediocre precision and
accuracy (equal to the proportion of yes answers in the test set). A slightly stronger
comparison is provided by a bag-of-words model like the one described in chapter 2.
Relative to the most-common-class classifier, this model achieves lower recall, but
slightly better precision and accuracy. Table 7.3 also shows results from an evaluation
of an earlier version of the NatLog system, reported in (MacCartney and Manning
2007).
The current (2008) version of the NatLog system achieves overall accuracy of over
70%, representing a 27% reduction in error from the 2007 version and a 33% reduction
in error from the baseline. A particularly noteworthy and gratifying result is the high
figure for precision: over 89% for the current system. Errors in which NatLog wrongly
predicts yes have fallen 66% from the 2007 version and 76% from the baseline.
CHAPTER 7. THE NATLOG SYSTEM 144
§ Section # P% R% Acc %
1 Quantifiers 44 95.2 100.0 97.7
2 Plurals 24 90.0 64.3 75.0
3 Anaphora 6 100.0 60.0 50.0
4 Ellipsis 25 100.0 5.3 24.0
5 Adjectives 15 71.4 83.3 80.0
6 Comparatives 16 88.9 88.9 81.3
7 Temporal reference 36 85.7 70.6 58.3
8 Verbs 8 80.0 66.7 62.5
9 Attitudes 9 100.0 83.3 88.9
Sections 1, 2, 5, 6, 9 108 90.4 85.5 87.0
All sections 183 89.3 65.7 70.5
Not surprisingly, NatLog’s performance varies considerably over the different sec-
tions of the FraCaS test set. Table 7.4 shows results broken down by section. In
the section concerning quantifiers, which is both the largest and the most amenable
to natural logic, all problems but one are answered correctly.5 We also answer all
problems but one correctly in the (admittedly small) section on attitudes, which in-
volves implicatives and factives. As we anticipated, performance is mediocre in four
sections concerning semantic phenomena (anaphora, ellipsis, temporal reference, and
verbs) not relevant to natural logic and not modeled by the system. But in the other
five sections (covering about 60% of the problems), we achieve accuracy of 87%, a
reduction in error of 61% from the 2007 version of the system. What’s more, precision
is high in nearly every section: even outside its areas of expertise, the system rarely
predicts entailment when none exists.
5
In fact, the sole exception is disputable, since it hinges on whether many refers to proportion
(apparently, the view held by the FraCaS authors) or absolute quantity.
CHAPTER 7. THE NATLOG SYSTEM 145
guess
yes no unk total
yes 67 4 31 102
gold no 1 16 4 21
unk 7 7 46 60
total 75 27 81 183
Table 7.5: Confusion matrix for NatLog on the FraCaS test suite (all sections).
7.8.3 Discussion
The confusion matrix shown in table 7.5 reveals an interesting property of the NatLog
system. By far the largest category of confusions are those where the answer is yes
but we guess unk. This reflects both the bias toward yes in the FraCaS data,
and the system’s tendency to predict unk (entailment relation #) when confused:
given the rules for joining entailment relations, the system can predict yes only if all
atomic-level predictions are either @ or ≡.
In fact, almost two-thirds of all errors made by NatLog are cases where NatLog
wrongly predicted unk. As we noted in section 5.6.3, joining a chain of predicted
atomic entailment relations will tend toward a less-informative result (namely, #) if
the chain contains any “noise” in the form of mistaken predictions. Consequently,
NatLog tends to predict # whenever it runs into trouble, and in this sense, unk is
NatLog’s default response.
It may be instructive to examine the remaining categories of errors more closely.
The examples referred to in the following text are shown in figure 7.3.
Gold unk, guess no. There were seven errors in this category, and all involved
cases where a lexical substitution generated the | relation, which was (wrongly) pro-
jected to the top level without change (and any other edits generated ≡ or @). Prob-
lem 277 is representative of these mistakes. (Three other cases are highly similar to
this one, while the remainder are broadly similarly.) The edit sub(1991, 1992 ) is
correctly predicted to generate lexical entailment relation |; however, NatLog acts as
CHAPTER 7. THE NATLOG SYSTEM 146
§1: Quantifiers
§2: Plurals
§4: Ellipsis
176 p John said Mary wrote a report, and Bill did too.
h John said Bill wrote a report. yes
§5: Adjectives
Figure 7.3: Examples of errors made by NatLog on the FraCaS test suite.
CHAPTER 7. THE NATLOG SYSTEM 147
if the verb lived projects | (from a temporal modifier) without change, when in fact
it should project | as #. Note the discrepancy between implementation and theory:
in section 6.2.5, we noted explicitly that most verbs project | as #. True, we also
observed there that verbal constructions which encode functional relations, such as
was born in, do project | as |; however, lived is not such a verb. This type of error
could be corrected without much difficulty.
Gold unk, guess yes. Each of the seven errors in this category involved a deletion
edit for which NatLog rightly predicted lexical entailment relation @, but wrongly
assumed that this relation would be projected without change to the top level. The
sole error made in the section on quantifiers (problem 56) is disputable, since it hinges
on whether the quantifier many refers to proportion (and thus is non-monotone in
its first argument—apparently the view held by the FraCaS authors), or to absolute
quantity (and thus is upward-monotone in its first argument). In this author’s view,
the latter reading is more intuitive, and NatLog’s answer of yes is in fact correct.
The remaining errors in this category involved adjectives, comparatives, and temporal
reference, and are typified by problems 217 and 304.
Gold yes, guess no. All four errors in this category involve ellipsis, and all are
more-or-less hopeless cases for NatLog. Problem 176 is a typical example. The align-
ment used by NatLog in this problem includes the deletion of the second (elliptical)
clause in p, and straightforward linear alignment of the first clause in p to h, including
the edit sub(Mary, Bill ). For this edit, NatLog (rightly) predicts the lexical entail-
ment relation |, and consequently produces answer no for the problem. While there
is an alignment which could have led NatLog to the correct answer (namely, the one
which matches each token in h to the equal token in p, and deletes all other tokens in
p), it is unreasonable to expect an aligner to produce it, since that alignment appears
preferable only to one who grasps the phenomenon of ellipsis.
Gold no, guess yes. The sole error in this category was problem 109. NatLog
predicts yes because it (rightly) assigns lexical entailment relation @ to the edit
CHAPTER 7. THE NATLOG SYSTEM 148
sub(one, Some), and because it does not model the restrictive semantics of Just.
Note that yes would be the correct answer if not for the s on accountants in h; thus,
a further reason for NatLog’s mistake is that it systematically (and intentionally)
ignores morphology.
Since the NatLog system was developed with FraCaS problems in mind, these
results do not constitute a proper evaluation on unseen test data. On the other hand,
the system does no training on FraCaS data, and has had no opportunity to learn
its biases. (Otherwise, accuracy on §4 could not fall so far below the baseline.) The
system not only answers most problems correctly, but usually does so for valid reasons,
particular within its areas of expertise. All in all, the results fulfill our main goal in
testing on FraCaS: to demonstrate the representational and inferential adequacy of
our model of natural logic.
85 p Mr. Fitzgerald revealed he was one of several top officials who told Mr.
Libby in June 2003 that Valerie Plame, wife of the former ambassador
Joseph Wilson, worked for the CIA.
h Joseph Wilson worked for CIA. no
788 p Democrat members of the Ways and Means Committee, where tax bills
are written and advanced, do not have strong small business voting
records.
h Democrat members had strong small business voting records. no
Table 7.6: Performance of various systems on the RTE3 test suite (two-way classi-
fication). The columns show the data set used (development or test, 800 problems
each), the proportion of yes predictions, precision and recall for the yes class, and
accuracy.
Recall that in the Stanford system, an alignment is a map from h words to p words.
An h word can be aligned to any p word, regardless of position; multiple h words can
be aligned to the same p word; and there is no notion of phrase alignments. When
we translate such alignments into the NatLog representation described in section 7.3,
each pair of aligned words generates a sub edit (or, if the words are identical, an
eq edit). Unaligned p words yield del edits, while unaligned h words yield ins
edits. Where possible, contiguous sequences of word-level edits are then collected
into equivalent phrase edits. While the result of this translation method cannot be
interpreted as a conventional edit script (there is no well-defined ordering of edits,
and multiple edits can operate on the same input phrases), we find that this poses no
great impediment to subsequent processing by the entailment model.
The first rows of table 7.6 show the performance of NatLog on RTE3 development
and test sets, as well as the performance of the Stanford RTE system. The overall
accuracy of the NatLog system is not particularly impressive (below 60%), mainly
due to the fact that most RTE problems involve types of inference where natural
logic is of little help. However, relative to the Stanford system, NatLog achieves high
precision on its yes predictions—above 70%—suggesting that combining the NatLog
CHAPTER 7. THE NATLOG SYSTEM 151
and Stanford systems using a hybridization strategy may be effective. For comparison,
the FOL-based system of Bos and Markert (2006) attained a similarly high precision
of 76% on RTE2 problems, but was able to make a positive prediction in only about
4% of cases. NatLog makes positive predictions far more often—in about 25% of
cases. Thus, NatLog achieves six times the coverage of the Bos & Markert system,
with nearly the same precision.
The Stanford system makes yes/no predictions by thresholding a real-valued in-
ference score. To construct a hybrid system, we adjust the Stanford inference scores
by +∆ or −∆, depending on whether or not NatLog predicts yes. We choose ∆
by optimizing development set accuracy, while adjusting the threshold to generate
balanced predictions (that is, equal numbers of yes and no predictions). As an ad-
ditional experiment, we fix ∆ at this value and then adjust the threshold to optimize
development set accuracy, resulting in an excess of yes predictions. (Since this opti-
mization is based solely on development data, its use on test data is fully legitimate.)
Results for these two cases are shown in the last rows of table 7.6. The parameters
tuned on development data were found to yield good performance on test data. The
optimized hybrid system attained an absolute accuracy gain of 4% over the Stanford
system, corresponding to an extra 32 problems answered correctly. This result is
statistically significant (p < 0.05, McNemar’s test, 2-tailed).
7.9.3 Discussion
The gains attributable to NatLog are exemplified by problem 788 (figure 7.4). NatLog
sanctions the deletion of a restrictive modifier and an appositive from the premise,
and recognizes that deleting a negation generates a contradiction; thus it correctly
answers no. On the other hand, there are many RTE problems where NatLog’s
precision works against it. For example, NatLog answers no to problem 71 because
it cannot account for the insertion of acts as in the hypothesis. Fortunately, both the
Stanford system and the hybrid system answer this problem correctly.
Chapter 8
Conclusions
152
CHAPTER 8. CONCLUSIONS 153
was the wide variation in the intrinsic difficulty of the various RTE problem sets, as
measured by the yardstick of this very simple model.
In chapter 3, we undertook the first systematic investigation of the problem of
alignment for NLI. We examined the relation between alignment in NLI and in ma-
chine translation (MT), and presented a number of arguments for the unsuitability
of MT aligners to the NLI alignment task. We made the first comparative evalu-
ation of bag-of-words, MT, and NLI aligners on an NLI alignment task. And, we
proposed a new model of alignment for NLI—the MANLI system—and showed that
it significantly outperforms rival approaches, by exploiting external lexical resources
in a supervised learning paradigm, by including contextual features to aid in aligning
function words, and by allowing multi-word phrases to be aligned as units.
In chapter 4, we introduced the Stanford RTE system, which was among the first
NLI systems to make a clear separation between alignment and entailment determi-
nation. This crucial innovation enables the Stanford RTE system to make predictions
of inferential validity which are based, not merely on the degree of alignment, but
also on global features of the aligned hp, hi pair motivated by semantic theory. The
Stanford RTE system is thereby able to attain substantially greater precision than
simple overlap models such as the bag-of-words model presented in chapter 2.
Seeking still greater precision, we then presented the most important contribution
of the dissertation: a new model of natural logic which extends the monotonicity
calculus of van Benthem and Sánchez-Valencia to incorporate semantic exclusion
and implicativity. We began in chapter 5 by defining an expressive inventory of
entailment relations—the set B of seven basic entailment relations—which includes
representations of both semantic containment and semantic exclusion. We showed
that the relations in B are both mutually exclusive and (if vacuous expressions are
excluded) mutually exhaustive, so that every pair of (non-vacuous) expressions can
be assigned to some relation in B. We also described the algebra of joins for relations
in B, and explained how to compute such joins automatically.
In chapter 6, we introduced the concept of projectivity signatures, which gener-
alizes the concept of monotonicity classes to cover the exclusion relations. We then
CHAPTER 8. CONCLUSIONS 154
described a general method for determining the entailment relation between two sen-
tences p and h, by (1) finding a sequence of atomic edits which transforms p into h;
(2) predicting a lexical entailment relation for each edit; (3) propagating the lexical
entailment relation generated by each edit upward through the semantic composition
tree of the sentence to which the edit is applied, according to the projectivity signa-
tures of intermediate nodes; and (4) joining the resulting entailment relations across
the edit sequence. And, we showed how to provide a unified account of inferences
involving implicatives and non-factives under the same framework.
Finally, in chapter 7, we described the NatLog system, the first robust, general-
purpose system for natural logic inference over real English sentences. NatLog con-
tains two key innovations: (1) it learns to predict lexical entailment relations by using
machine learning techniques and exploiting a variety of manually and automatically
constructed sources of information on lexical relations; and (2) it identifies expres-
sions with non-default projectivity and computes the likely extent of their arguments
in a syntactic parse using hand-crafted tree patterns. We demonstrated the practical
value of the NatLog system in evaluations on the FraCaS and RTE test suites. On
the FraCaS test suite, we showed NatLog attained high accuracy on most sections
(excluding sections not relevant to natural logic, such as ellipsis), and high precision
on all sections. On the RTE test suite, we showed that adding NatLog as a component
in the broad-coverage Stanford RTE system led to substantial accuracy gains.
In addition to these specific contributions just outlined, we hope to have achieved
something greater through this program of research: namely, to have shown that nat-
ural logic, far from being merely a formal plaything of logicians and semanticists, has
relevance and applicability to the real-world challenges of open-domain NLI, and con-
sequently constitutes a topic ripe for the attention of NLP researchers. And indeed,
there are indications that our work in this area has begun to catalyze related work
by other investigators. For example, Schoenmackers et al. (2008) built on our work
on inferences involving monotonicity in their exploration of scaling textual inference
to the web, while Danescu-Niculescu-Mizil et al. (2009) were motivated by our model
of natural logic to seek methods for unsupervised discovery of downward-entailing
operators.
CHAPTER 8. CONCLUSIONS 155
While h is not a strict logical consequence of p, most people would not hesitate to
accept that h follows from p, and thus the inference is valid by the definition of the
NLI task. And the reasoning involved is straightforward—one need not possess the
deductive genius of Sherlock Holmes to arrive at h. And yet recognizing the validity
of the inference depends on several different kinds of expertise: first, the knowledge
that the First Lady is a member of the First Family, and that Paris is in France; next,
the capacity to handle inferences involving containment and negation; moreover, the
ability to perform simple clock arithmetic; and finally, an aptitude for basic spatial
reasoning. In short, there is no silver bullet.
So, going forward, we’ll need to look at ways to combine different kinds of rea-
soners, with different areas of expertise—including not only models based on lexical
overlap (such as the bag-of-words model presented in chapter 2) and models of natural
logic (like NatLog), but also reasoners with specialized expertise in temporal, spatial,
and mathematical reasoning; systems which excel in relation extraction; and systems
with a facility for commonsense reasoning. Now, some of these components are closer
to reality than others, but even if we were satisfied with each of the individual com-
ponents, the key question is, how can we combine them to best advantage? Should
we apply each one independently, and then aggregate their results? What sort of
aggregation function should we use? Or, can we devise an architecture in which these
heterogeneous reasoners collaborate in a more fine-grained way, each contributing in-
dividual steps to an overall proof? These difficult questions will undoubtedly keep
the NLI research community busy for many years.
Bibliography
Bar-Haim, Roy, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo
Magnini, and Idan Szpektor. 2006. The Second PASCAL Recognising Textual
Entailment Challenge. In Proceedings of the Second PASCAL Challenges Workshop
on Recognising Textual Entailment, pp. 1–9.
Bar-Haim, Roy, Ido Dagan, Iddo Greental, and Eyal Shnarch. 2007. Semantic Infer-
ence at the Lexical-Syntactic Level. In Proceedings of AAAI-07.
Bos, Johan, and Katja Markert. 2005a. Combining shallow and deep NLP meth-
ods for recognizing textual entailment. In Proceedings of the PASCAL Challenges
Workshop on Recognizing Textual Entailment, pp. 65–68.
Bos, Johan, and Katja Markert. 2005b. Recognising Textual Entailment with Logical
Inference. In Proceedings of EMNLP-05.
Bos, Johan, and Katja Markert. 2006. When logical inference helps determining
textual entailment (and when it doesn’t). In Proceedings of the Second PASCAL
Challenges Workshop on Recognizing Textual Entailment.
156
BIBLIOGRAPHY 157
Brockett, Chris. 2007. Aligning the RTE 2006 Corpus. Technical Report MSR-TR-
2007-77, Microsoft Research. url: ftp://ftp.research.microsoft.com/pub/
tr/TR-2007-77.pdf.
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L.
Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter
Estimation. Computational Linguistics 19(2):263–311.
Callison-Burch, Chris, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the
Role of BLEU in Machine Translation Research. In Proceedings of EACL-06, pp.
249–256.
Chambers, Nathanael, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill
MacCartney, Marie-Catherine de Marneffe, Daniel Ramage, Eric Yeh, and Christo-
pher D. Manning. 2007. Learning Alignments and Leveraging Natural Logic. In
ACL-07 Workshop on Textual Entailment and Paraphrasing.
Collins, Michael. 2002. Discriminative titleing methods for hidden Markov models.
In Proceedings of EMNLP-02, pp. 1–8.
Collins, Michael. 2003. Head-Driven Statistical Models for Natural Language Parsing.
Computational Linguistics 29(4):589–637.
Condoravdi, Cleo, Dick Crouch, Valeria de Paiva, Reinhard Stolle, and Daniel G.
Bobrow. 2003. Entailment, Intensionality and Text Understanding. Proceedings of
the HLT-NAACL 2003 Workshop on Text Meaning pp. 38–45.
Cooper, Robin, et al. 1996. Using the framework. Technical Report LRE 62-051
D-16, The FraCaS Consortium. url: http://www.cogsci.ed.ac.uk/~fracas/.
BIBLIOGRAPHY 158
Crouch, Dick, Roser Saurí, and Abraham Fowler. 2005. AQUAINT Pilot Knowledge-
Based Evaluation: Annotation Guidelines. Technical report, Palo Alto Research
Center. url: http://www2.parc.com/istl/groups/nltt/papers/aquaint_kb_
pilot_evaluation_guide.pdf.
Dagan, Ido, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL Recognising
Textual Entailment Challenge. In Proceedings of the PASCAL Challenges Workshop
on Recognising Textual Entailment.
Dang, Hoa Trang, and Danilo Giampiccolo. 2008. The TAC 2008 Recognizing Textual
Entailment (RTE) Track. url: http://www.nist.gov/tac/tracks/2008/rte/.
Das, Dipanjan, and André F. T. Martins. 2007. A Survey on Automatic Text Summa-
rization. url: http://www.cs.cmu.edu/~nasmith/LS2/das-martins.07.pdf.
de Salvo Braz, Rodrigo, Roxana Girju, Vasin Punyakanok, Dan Roth, and Mark Sam-
mons. 2005. An inference model for semantic entailment and question-answering. In
Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI),
pp. 1678–1679.
BIBLIOGRAPHY 159
DeNero, John, Dan Gillick, James Zhang, and Dan Klein. 2006. Why Generative
Phrase Models Underperform Surface Heuristics. In Proceedings of the ACL-06
Workshop on Statistical Machine Translation, pp. 31–38.
Dolan, Bill, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of
large paraphrase corpora. In Proceedings of COLING-04.
Finkel, Jenny Rose, Trond Grenager, and Christopher D. Manning. 2005. Incorporat-
ing Non-local Information into Information Extraction Systems by Gibbs Sampling.
In Proceedings of ACL-05, pp. 363–370.
Fowler, Abraham, Bob Hauser, Daniel Hodges, Ian Niles, Adrian Novischi, and Jens
Stephan. 2005. Applying COGEX to recognize textual entailment. In Proceedings
of the PASCAL Challenges Workshop on Recognising Textual Entailment.
Fraser, Alexander, and Daniel Marcu. 2007. Measuring Word Alignment Quality for
Statistical Machine Translation. Computational Linguistics 33(3):293–303.
Fyodorov, Yaroslav, Yoad Winter, and Nissim Francez. 2000. A Natural Logic Infer-
ence System. In Proceedings of the 2nd Workshop on Inference in Computational
Semantics (ICoS-2).
Giampiccolo, Danilo, Hoa Trang Dang, Bernardo Magnini, Ido Dagan, and Bill Dolan.
2008. The Fourth PASCAL Recognizing Textual Entailment Challenge. In Pro-
ceedings of the TAC-08 Text Analysis Conference.
Giampiccolo, Danilo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The Third
PASCAL Recognizing Textual Entailment Challenge. In Proceedings of the ACL-07
Workshop on Textual Entailment and Paraphrasing.
Glickman, Oren, Ido Dagan, and Moshe Koppel. 2005. Web based probabilistic
textual entailment. In Proceedings of the PASCAL Challenges Workshop on Rec-
ognizing Textual Entailment.
BIBLIOGRAPHY 160
Grice, H. Paul. 1975. Logic and conversation. Syntax and Semantics 3:41–58.
Haghighi, Aria, Andrew Ng, and Christopher D. Manning. 2005. Robust textual in-
ference via graph matching. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP-05).
Harabagiu, Sanda, and Andrew Hickl. 2006. Methods for Using Textual Entailment
in Open-Domain Question Answering. In Proceedings of ACL-06, volume 44, p.
905.
Harabagiu, Sanda, Andrew Hickl, and Finley Lacatusu. 2006. Negation, Contrast,
and Contradiction in Text Processing. Proceedings of AAAI-06.
Hickl, Andrew, John Williams, Jeremy Bensley, Kirk Roberts, Bryan Rink, and Ying
Shi. 2006. Recognizing Textual Entailment with LCC’s GROUNDHOG System. In
Proceedings of the Second PASCAL Challenges Workshop on Recognizing Textual
Entailment.
Hobbs, Jerry R., Mark Stickel, Paul Martin, and Douglas D. Edwards. 1988. Inter-
pretation as abduction. In Proceedings of ACL-88, pp. 95–103.
Jiang, Jay J., and David W. Conrath. 1997. Semantic similarity based on corpus
statistics and lexical taxonomy. In Proceedings of the International Conference on
Research in Computational Linguistics.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-
erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens,
Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses:
Open source toolkit for statistical machine translation. In Proceedings of ACL-07,
demonstration session.
Lacatusu, Finley, Andrew Hickl, Kirk Roberts, Ying Shi, Jeremy Bensley, Bryan
Rink, Patrick Wang, and Lara Taylor. 2006. LCC’s GISTexter at DUC 2006:
Multi-Strategy Multi-Document Summarization. In Proceedings of DUC-06.
Landauer, Thomas K., and Susan T. Dumais. 1997. A solution to Plato’s problem:
The Latent Semantic Analysis theory of the acquisition, induction and representa-
tion of knowledge. Psychological Review 104(2):211–240.
Levy, Roger, and Galen Andrew. 2006. Tregex and Tsurgeon: tools for querying
and manipulating tree data structures. In Proceedings of LREC-06. url: http:
//nlp.stanford.edu/software/tregex.shtml.
Liang, Percy, Ben Taskar, and Dan Klein. 2006. Alignment by Agreement. In Pro-
ceedings of NAACL-06. url: http://www.cs.berkeley.edu/~pliang/papers/
alignment-naacl2006.pdf.
Lin, Dekang. 1998. Automatic retrieval and clustering of similar words. In Proceedings
of COLING/ACL-98, pp. 768–774.
BIBLIOGRAPHY 162
MacCartney, Bill, and Christopher D. Manning. 2007. Natural logic for textual
inference. In ACL-07 Workshop on Textual Entailment and Paraphrasing.
Marcu, Daniel, and William Wong. 2002. A phrase-based, joint probability model
for statistical machine translation. In Proceedings of EMNLP-02, pp. 133–139.
McCallum, Andrew, and Wei Li. 2003. Early results for named entity recognition
with conditional random fields, feature induction and web-enhanced lexicons. In
Proceedings of CoNLL-2003.
Meyers, Adam, et al. 2004. The NomBank project: An interim report. In HLT-
NAACL 2004 Workshop: Frontiers in Corpus Annotation, pp. 24–31.
BIBLIOGRAPHY 163
Minnen, Guido, John Carroll, and Darren Pearce. 2001. Applied morphological
processing of English. Natural Language Engineering 7(3):207–223.
Moldovan, Dan, Christine Clark, Sanda Harabagiu, and Steve Maiorano. 2003. CO-
GEX: A logic prover for question answering. In Proceedings of NAACL-2003.
Nairn, Rowan, Cleo Condoravdi, and Lauri Karttunen. 2006. Computing relative
polarity for textual inference. In Proceedings of ICoS-5 (Inference in Computational
Semantics).
Och, Franz Josef, and Hermann Ney. 2003. A Systematic Comparison of Various
Statistical Alignment Models. Computational Linguistics 29(1):19–51.
Padó, Sebastian, Michel Galley, Dan Jurafsky, and Christopher D. Manning. 2009.
Robust machine translation evaluation with entailment features. In Proceedings of
ACL-09.
Raina, Rajat, Andrew Ng, and Christopher D. Manning. 2005. Robust textual
inference via learning and abductive reasoning. In Proceedings of the Twentieth
National Conference on Artificial Intelligence (AAAI).
Ritter, Alan, Doug Downey, Stephen Soderland, and Oren Etzioni. 2008. It’s a
Contradiction—No, it’s Not: A Case Study using Functional Relations. In Pro-
ceedings of EMNLP-08.
Romano, Lorenza, Milen Kouylekov, Idan Szpektor, Ido Dagan, and Alberto Lavelli.
2006. Investigating a generic paraphrase-based approach for relation extraction. In
Proceedings of EACL 2006.
Sánchez Valencia, Victor. 1991. Studies on Natural Logic and Categorial Grammar.
PhD thesis, University of Amsterdam.
BIBLIOGRAPHY 164
Schoenmackers, Stefan, Oren Etzioni, and Daniel S. Weld. 2008. Scaling Textual
Inference to the Web. In Proceedings of the 2008 Conference on Empirical Methods
in Natural Language Processing (EMNLP-08), pp. 79–88.
Tatu, Marta, and Dan Moldovan. 2005. A semantic approach to recognizing textual
entailment. In Proceedings of HLT/EMNLP 2005, pp. 371–378.
Tatu, Marta, and Dan Moldovan. 2007. COGEX at RTE3. In Proceedings of ACL-07.
van Benthem, Johan. 1991. Language in action: categories, lambdas and dynamic
logic. Studies in Logic 130.
van Benthem, Johan. 2008. A brief history of natural logic. Technical Report PP-
2008-05, Institute for Logic, Language & Computation. url: http://www.illc.
uva.nl/Publications/ResearchReports/PP-2008-05.text.pdf.
BIBLIOGRAPHY 165
van Eijck, Jan. 2005. Natural logic for natural language. url: http://homepages.
cwi.nl/~jve/papers/05/nlnl/NLNL.pdf.
Vogel, Stephan, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word
alignment in statistical translation. In Proceedings of COLING-96, pp. 836–841.
Witten, Ian H., and Eibe Frank. 2005. Data Mining: Practical machine learning tools
and techniques, 2nd edition. Morgan Kaufmann.