Knowledge-Based Systems 21 (2008) 617–628
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
References to graphical objects in interactive multimodal queries
Daqing He a,*, Graeme Ritchie b, John Lee c
a
b
c
School of Information Sciences, University of Pittsburgh, 135 North Bellefield Avenue, Pittsburgh, PA 15260, USA
Department of Computing Science, University of Aberdeen, Aberdeen AB24 3UE, Scotland, UK
School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland, UK
a r t i c l e
i n f o
Article history:
Received 22 May 2007
Accepted 24 March 2008
Available online 31 March 2008
Keywords:
Source ambiguities
Intelligence multimodal interfaces
Constraint satisfaction problems
Reference resolution
a b s t r a c t
Traditionally, interactive natural language systems assume a semantic model in which the entities
referred to are in some abstract representation of a real or imagined world. In a system where graphical
objects such as diagrams may be on the screen during the language interaction, there is possibility that
the user may want to allude to these visual entities. Since graphical objects have their own properties
(colour, shape, position on the screen, etc.) but may also represent items in a knowledge base which have
other associated properties (price, geographical location, technical specifications, etc.), some systematic
way is needed to enable such objects to be referred to in terms of either their screen properties or their
associated attributes from the domain under discussion. In this paper, we present a formalisation for
these arrangements, and show how our logical definitions can be used to generate constraints suitable
for reference resolution within a natural language interpreter.
Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction
For many years, the typical natural language (NL) query interface would assume a simple environment in which a symbolic representation of information about some domain (i.e., a knowledge
base) was the sole source of responses to questions, and was the
only place in which referents might be sought for phrases (for
overviews of conventional NL query systems, see [41,12,5,4]). In
more recent years, attention has been paid to more complex systems, in which graphical displays are integrated into the system
handling the queries (see Section 3 below). Such an arrangement
raises interesting questions about the semantic model underlying
the interpretation of NL queries, since words and phrases may refer
to knowledge base entities directly (as in traditional systems) or
may denote graphical objects visually accessible on the screen. A
query may even contain a mixture of these two broad classes of
reference. To complicate matters further, the screen entities (icons,
etc.) may be in a denotational relationship to the symbolic entries
in the database, so that a user wishing to allude to a database object may have two routes for achieving this: a phrase which directly mentions the properties of the object as recorded in the
database, or a phrase which uses the visual properties of the corresponding icon in order to single out (indirectly) the database
object.
* Corresponding author. Tel.: +1 412 624 2477; fax: +1 412 648 7001.
E-mail addresses:
[email protected] (D. He),
[email protected]
(G. Ritchie),
[email protected] (J. Lee).
0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2008.03.023
We focus on this basic situation: a database (or knowledge
base) is queried in English where there may be images (icons) on
the screen corresponding to database objects. We consider how
simple referring noun phrases could be interpreted when there
may be ambiguity about whether screen or database objects (or
properties) are being referred to, or when a single phrase may contain a mixture of screen and database information.
In the remainder of the paper, we will first explain this scenario in
more detail and propose our approach for handling it (Section 2). We
then outline briefly some of the related work in this area (Section 3).
After that, we will present our language-interpretation framework
(Section 4), discuss our formalisation in general set-theoretic terms
(Section 5) and in terms of constraint-resolution (Section 7), including a simple worked example (Section 9). In Section 10, we consider
limitations and possible developments of the work.
2. Motivation
2.1. Example scenarios
Graphical displays inherently have spatial relations, but the
applications that they represent may or may not. For simplicity,
we shall start by considering applications in which there is no real
spatial information in the subject matter (as opposed to the screen
display, which cannot avoid being spatial). We shall return to the
issue of domains with inherent spatial content later. The following
example is merely illustrative, but is closely based on the simple
scenarios which we have tested by implementation [22]. In considering this example, it should be borne in mind that we are not
618
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
proposing it as a model of excellence in HCI, but rather as an artificial context for provoking the phenomena we are interested in
while remaining simple enough for clarity. We imagine these kinds
of phenomena as arising more plausibly in a system that would
more closely approximate human–human dialogue, including the
use of speech input. This would include cases where, for example,
the user can introduce graphical items and state new interpretations for them, as perhaps a sketch system embedded in a natural
language dialogue. A system supporting such discourse would,
however, raise many issues that would obscure our focus in this
paper. In a simple case such as we discuss here, one might well
in practice use direct manipulation and forego natural language
altogether; but our point is that even in such a simple case it is possible to exhibit the ambiguities we are interested in, notwithstanding that it would be straightforward to avoid them.
Consider a system which allows the user to browse through a
database of used cars which are for sale, and which displays the
information about the available vehicles in a simple graphical form
(Fig. 1). Icons in the DISPLAY area represent individual cars, and
characteristics of these icons convey attributes of the corresponding
cars, as summarised in the table displayed in the KEY area. A user
who is browsing through the cars may use the POTENTIAL BUY area
to collect the icons of cars that s/he is interested in. During the interaction, the user can ask about the cars on the screen, or perform
screen actions (e.g., move, remove, add) on the icons of those cars.
We assume that the system maintains some form of table, referred to here as the mapping relation, which shows which database
entity corresponds to each of the icons on the screen, and we assume that this is a one-to-one mapping from screen into domain
(i.e., every screen item has a unique corresponding domain entity,
but some domain entities may have no associated screen item).
The system does not impose any explicit limits on how these
commands or queries may be phrased, allowing the user some
freedom in the choice of NL expressions.
Let us consider a case where the user asks about one particular
car, as in Example (1), with graphical displays as in Fig. 1, where
the bottom-right icon is displayed in green, and where the database contains no information about the colours of actual cars.
(1) User: What is the insurance group of the green car?
(a)
System: It is group 5
(b)
User: Move it to the potential buy area
(c)
System: The green car has been moved
(d)
The adjective ‘‘green” in the phrase ‘‘the green car” indicates the
colour of the icon, not the colour of the car being represented. However, the head noun ‘‘car” refers to possible entities in the database
(or, depending on one’s point of view, in the domain of discussion).
So too does the entire NP ‘‘the green car”, since cars have insurance
groups, but icons do not. Hence when computationally calculating
the potential candidate entities for the reference of ‘‘the green car”,
we cannot simply consider only the entities of type ‘‘car” which
have the property ‘‘green”.
Suppose the user now makes the request indicated as (c), and
gets the response from the system shown as (d). Here, the object
being moved is clearly an icon – no cars are being driven through
the streets. Nevertheless, the user uses ‘‘it” to refer to the icon, presumably referring anaphorically to the earlier phrase ‘‘the green
car” which, as argued already, denotes not an icon but a car. Moreover, the system’s response uses the phrase ‘‘the green car” to refer
to the moved object, i.e., an icon.
Underlying the above comments is one of our central assumptions, namely that it makes sense to talk about an intended referent
for a noun phrase in a particular context of use. This is the object
which the phrase can be seen to denote when all possible linguistic
and contextual information (including the requirements of actions)
has been taken into account. (The term has been used in an analogous manner with respect to the generation of referring expressions [24]). In discussing our illustrative examples, we shall rely
on intuition about the meanings of the sample sentences to decide
the intended referents, always adopting the assumption that referents must meet all internal semantic constraints of the dialogues.
One way of looking at our proposed framework is to say that the
information which contributes to singling out an intended referent
may not be straightforward, particularly if a phrase is viewed in
isolation (e.g., during the reference resolution process), and we
must devise a way of overcoming this.
We can imagine a more convoluted case than that above, where
the system has been set up so that the user may ask about the colour of a car, while colour is still being used in the icons to code
non-colour information about the cars (e.g., year of manufacture,
as in the above example). It has to be admitted that this would
not be a well-designed representation, but it serves here to illustrate a logical possibility (Ben Amara et al. [6] discuss such an
arrangement). In this situation, a phrase such as ‘‘the green car”
might well-denote a car which is green in real life (but perhaps
with a blue icon) under one context, and represent another car
which is represented by a green icon (but is not actually green) under another context. Therefore, it causes ambiguity between referents during the resolution process.
We shall use the term source to refer to the two possible contexts
for interpretation: the screen (visual display) or the domain (the represented subject matter); that is, the system has to interpret linguistic elements relative to one or other of these sources. We shall refer
Fig. 1. Screen displays for (1).
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
to predicates or constants which belong to the domain as domain
predicates/constants, and those which are associated with screen
properties as screen predicates/constants. Where there is doubt about
the source relative to which a word, phrase or sentence should be
interpreted, we shall refer to this as source ambiguity.
Even where all the substantive terms in a phrase are derived from
the same source, the phrase may refer to an entity in the other
source. Consider the phrase ‘‘the expensive car”. This, in the conventional account, has descriptive content based on the domain predicates expensive and car, and hence would conventionally denote a
domain entity with these properties: something that is a car and is
expensive. The phrase would indeed refer to such an entity in the
sentence ‘‘buy the expensive car”. However, it has a different reference in the sentence ‘‘move the expensive car to the top of the screen”.
In order that the screen operation move can make sense, the phrase
must have as its intended referent the screen entity – the icon – that
corresponds to the expensive car in the domain.
Even with a single utterance, these dual possibilities may arise.
Consider the example ‘‘move the expensive car to the top of the
screen, and place an order for it”. The natural interpretation of the
pronoun ‘‘it” allocates ‘‘the expensive car” as its antecedent. However, the referent of the pronoun ‘‘it” must, to make sense in ‘‘place
an order for it” be a car, not an icon, while the referent of ‘‘the
expensive car” must, for compatibility with ‘‘move”, be a screen object. The two referents are linked by the mapping relation, but this
is not the standard notion of coreference. We shall refer to such
arrangements, where anaphor and antecedent denote entities
linked by the mapping relation, as quasi-coreference.
The possibilities for source ambiguity in individual words may
combine to create even more possible options. Sentence (a) in
Example (2) is a relatively simple query. In a situation with source
ambiguities for ‘‘blue” and ‘‘small” (but with ‘‘car” unambiguously a
domain expression), its meaning could in theory be synonymous
with any one of the four sentences (b) to (e) (where the subscripts
attached to the words ‘‘blue”, ‘‘car” and ‘‘small” indicate the source
– screen or domain – in which the word is interpreted).1
(2)
User: Is the blue car small?
(a)
User: Is the blued card smalld ?
(b)
User: Is the blued card smalls ?
(c)
User: Is the blues card smalls ?
(d)
(e)
User: Is the blues card smalld ?
If the domain in question has a valid notion of space (unlike the
car-selling domain discussed above), there is even more scope for
source ambiguity. Expressions such as ‘‘above”, ‘‘next to”, ‘‘to the
right of” could refer to relationships on the screen or in the domain.
For example, imagine a system which allowed the user to plan the
layout of a room by viewing and altering a visual display of the position of items. If the display was a plan view, ‘‘above” on the screen
might mean ‘‘to the right of” in the actual room (cf. [7]).
To summarise, there are various interesting phenomena illustrated by examples like those discussed above:
(i) Predicates (such as the meaning of ‘‘green”) may refer to
visual screen properties, or to properties of objects in the
database denoted by the screen objects.
(ii) Phrases (such as ‘‘the green car”) may in certain situations
contain both words referring to screen properties/items
and words referring to properties/items in the domain under
discussion, thus not uniformly describing any entity in
either source.
1
We acknowledge that some combinations could be ruled out on pragmatic
grounds – is the user likely to ask about the size of a screen icon? – but they do exist
as logical possibilities.
619
(iii) A phrase which directly describes an entity in one source may,
in context, denote the corresponding entity in the other source.
(iv) A pronoun may refer back to an antecedent which strictly
does not denote the object that the pronoun refers to,
although the two referents may be systematically related.
(v) There may even be uncertainty as to whether a particular
word or phrase is to be interpreted with respect to screen
information or domain information.
We do not claim that source ambiguities would always be there
when the utterances as a whole is considered in context. In fact, we
believe that source ambiguities can be resolved with the help of
context. However, as with most reference ambiguities, a program
trying to compute referents in a near-compositional fashion (i.e.,
figuring out the meaning from the words expressed in the text)
might encounter source ambiguities before the whole meaning of
the utterance is clarified. This is why we are working on a computational mechanism to resolve them.
2.2. Limitations of previous representations of ambiguities
Previous approaches to semantic ambiguity assume that the various possible meanings are all within the domain, and hence a choice
can sometimes be made on the basis of domain information. This approach is neatly typified by van Deemter who presents a formalisation in which a word may be associated with several different logical
expressions, with the correct expression being selectable on the basis of the sort (or type) of the variable(s) involved [48]. For example,
an ambiguous word American might map to one of the following
unambiguous expressions (notation from van Deemter):
American1: kx:Country(DeparturePlace(x)) = USA(for a flight)
American2: kx:Country(Manufacture(x)) = USA(for an aircraft)
American3: kx:Country(PlaceOfBirth(x)) = USA(for a person)
Disambiguation then results from the sort of the value of x:
flight, aircraft, or person.
This approach is not general enough for our two-source situation.
First, as stated in Section 1, the typical NL query interface contains
only a symbolic representation (constants and predicates) from
the application domain. No information about the screen is included.
It is not difficult to systematically augment such domain information
with screen objects and predicates (indeed, that forms part of our
own solution), and this might allow a disambiguation approach like
van Deemter’s. However, this would not handle the ‘‘mixed” phrases,
where some predicates within the NP contain domain information
and others embody screen information. More importantly, we need
not only a method of disambiguation, but a method of determining a
referent set for an NP. Given the types of example we have outlined
above, there needs to be some method for reference resolution
which takes account of the two-source subtleties.
2.3. Our approach
Our goal is to translate combinations of locally ambiguous
expressions into unambiguous representations so that the language interpreter can identify the right referents for input phrases.
We tackle this by representing source information explicitly (so
that it can be manipulated, and even reasoned about), and by partitioning the set of abstract entities (predicates and values to which
they might apply) into two disjoint classes: domain entities and
screen entities.
Using this, we define some declarative relations between predicates and their potential referents (Section 5), thereby specifying
the set of candidate referents for an NP where there might be
two sources involved. We also show how these definitions allow
620
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
a systematic expression of these relationships as constraints (Section 7), so that processing of reference information can proceed in a
general manner.
Although, we implemented our ideas within a prototype NL
query system called IMIG [22], the details of that environment
are not of interest here. The central idea is a formal mechanism
for mapping an NP to a set of candidate referents; other issues of
parsing, dialogue, integrating pointing actions, generating answers,
etc., are outside the scope of the current article. All that we need to
assume here is an environment which supplies certain information
(see Section 4 for a summary of the knowledge bases needed).
To keep the problem to a manageable level, we concentrated on
English language phenomena involving singular NPs. As is often
the case in computational work on reference resolution (e.g.,
[52,36]) or on the generation of referring expressions (e.g.,
[13,24]), we have considered only NPs where the various modifiers
represent a conjunction of properties which the referent(s) must
meet, without allowing for relative properties (e.g., ‘‘small elephant” vs. ‘‘large flea”) or predicates which are not properties of
the referent (e.g., ‘‘fake gun”). In order to concentrate on the referential issues, we have not gone into any depth on other aspects of
sentence interpretation. Our test implementation processes whole
sentences, but the mechanisms are conventional and not of interest here.
3. Related work
3.1. General metonymy
The type of indirect reference which we are examining here is
a particular kind of metonymy, in the traditional sense: ‘something associated with an idea is made to serve for the expression
of that idea’ [43]. Such phenomena have been discussed by Nunberg [39], considering examples such as naming an author to refer to their books (‘‘Plato is on the top shelf”), mentioning a
publication in place of the publisher (‘‘The newspaper fired John”),
using the name of a bird/animal for its flesh (‘‘We ate chicken in
bean sauce”) or a having a restaurant dish denote the customer
ordering it (‘‘The ham sandwich is sitting at table 20”). Nunberg
argues that such usages are best captured by positing a pragmatic referring function which relates the explicitly mentioned
item to the one actually intended. He then discusses some of
the ways in which referring functions can be used, and the factors that allow language users to determine the appropriate
referring function. He also notes the phenomenon which we
have called quasi-coreference. Fauconnier [14] takes up Nunberg’s ideas and summarises the central point thus:
Identification principle: If two objects (in the most general
sense), a and b, are linked by a pragmatic function F (i.e.,
b = F(a)), a description of a, da, may be used to identify its counterpart b.
We have not attempted to formalise the full range of effects
which Nunberg and Fauconnier discuss. The main obstacle to a
computational implementation which can handle general metonymic references is that detection and use of the pragmatic
(referring) function may be difficult. This is because, as Koch [29]
points out, metonymy has many types, and their interpretation depends on various information including speech rules, language
rules, discourse rules, and even trade knowledge. Our analysis,
and our small prototype implementation, are based on a particular
situation where that function (our ‘‘mapping relation”) is well-defined, and indeed directly provided by the multimedia setup. It is
to be hoped that our methods may be generalised in some way
to handle more complex and subtle forms of metonymy.
3.2. Intelligent multimodal studies
Intelligent multimodal systems have been studied since the
1980’s. Some important research topics include theoretic and
empirical studies of multimodal integration [11,2,40], multimodal
reference resolution models [27,25,26,42,28,10], interactive multimodal systems [37,38,1], and multimodal reference in NL generation [3,35]. However, none of these projects explores the
phenomena we are considering. For example, although researchers
like Johnston and Pineda have developed methods that are capable
of handling multimodal reference [27,25,26,42], none of them considered source ambiguities. We suggest that the key to handling
source ambiguity problems is viewing the on-screen icons as different from, but related to, their corresponding domain entities.
Without this, a model cannot resolve expressions involving source
ambiguities. However, we acknowledge that the source ambiguity
issues have been touched upon briefly by Ben Amara et al. [6], Binot et al. [7], and Chai [9], although none of them give a name to
the ambiguities.
Ben Amara et al. [6] discuss natural language expressions using
the graphical features of objects to refer to those objects. Their scenarios involve querying and displaying a local area network. One of
their examples relates to source ambiguities – the user may enter
one of the following sentences to remove a workstation shown as a
purple icon:
(i) ‘‘remove the stellar workstation”
(ii) ‘‘remove the purple workstation”
(iii) ‘‘remove the purple icon”.
They suggest that the second command is ambiguous because it
is not clear whether it is the workstation in the application domain
or the icon of the workstation that is purple.
The MMI2 system is a toolkit connected to a knowledge base
system [7]. It supports various forms of input from natural language, command language, graphical display, direct manipulation
and gesture. The system has different modules to handle input
from different modals, and then combines the processed inputs
into a logical expression scheme developed for the task.
Binot et al. [7] enumerate some examples of natural language
references using graphical spatial relations in an application domain that is the same as that of Ben Amara et al. Two such examples are ‘‘add a PC to its left” and ‘‘remove the left workstation”. They
suggest that the spatial relations in both cases are ambiguous as to
whether the relations are in the domain or in the graphical representation. They do mention the possibility that the referent of a
phrase might be a graphical icon, but acknowledge that they do
not have a solution to, in our terminology, source ambiguity.
Chai [9] presents a semantic-based multimodal interpretation
framework called MIND (Multimodal Interpretation for Natural
Dialog). The framework combines semantic meanings of unimodal
inputs, and forms an overall understanding of users multimodal inputs. The interpretation of the inputs integrates contextual information from domain, temporal and anaphoric aspects of the
conversations, and visual and spatial aspects of the screen display.
In this framework, the visual properties of certain entities (e.g., colour, highlight) can be used in the referring expressions. For example, when the system highlights two houses with colours, the users
may use phrase ‘‘the green one” to refer to one of them. Chai states
that the colour ‘‘green” has to be ‘‘further mapped to the internal
color encoding used by graphics generation” in order to resolve
the reference. However, she does not go further to examine the potential ambiguities and resolution mechanism if the colour ‘‘green”
exists in both the domain and the visual display.
Binot et al. [7] briefly suggest that the ambiguities can be resolved by using the context of the utterance, or by asking the user
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
to disambiguate explicitly. In addition, the resolution process must
have access to a representation of the spatial geometry of the domain. However, they do not explore the question of source ambiguity in depth, nor provide any systematic way to model and
resolve the problems.
4. Meaning representation with source annotation
We have adopted a fairly conventional meaning representation
language (MRL), with some extensions to allow for explicit processing of source information (and hence source ambiguity). Our
MRL is closely based on the procedural semantics stated in [53–
55] (or, more accurately, the reconstruction of Woods’ ideas in
[15,16]). That is, it uses logic-like constructs but is ultimately defined in terms of procedural operations upon the stored data, treating the knowledge base (for a particular domain) as providing a
fairly conventional collection of objects and relations, much as in
the standard Tarski semantics for first-order logic (although the
language itself is not first-order). What is novel about our language
is that it provides mechanisms for source information (in our
sense) to be explicitly marked on predicates and constants, and
for this information to be left unspecified where there is source
ambiguity. That is, each predicate and constant symbol has a subscript which can be d, s or is omitted where the source is
‘unspecified’.
As well as constants and variables, we also allow (as a term in
an expression) the special function CW ðYÞ, which maps an item Y
to its corresponding entity under the mapping relation (its subscript W indicates the source of the argument it requires, not the
source of its result). This CW ðYÞ is a particular instance of the pragmatic referring functions presented by Nunberg and Fauconnier
(Section 3.1 above).
Predicates or arguments with different sources cannot be
mixed: a screen-predicate cannot be applied to a domain-constant.
Although there may be a screen-predicate greens, and a domainpredicate greend, and these may both (intuitively speaking) be representing the colour green, they are different predicates in our
framework.
The exact notation used for the MRL is not central to our discussion here. A knowledge representation scheme from a different tradition (e.g., semantic nets, deductive databases) might well be
equally effective. The central points are that source information
must be represented explicitly, that under-specification (ambiguity) of source must be possible, and that there is some notion of
terms having referents. Although our MRL does have quantifier
facilities, we shall not discuss issues involving quantifiers and their
scoping.
In the course of the various semantic processing, we assume the
presence of the following knowledge sources (an example of these
knowledge bases is in Fig. 2 below):
621
General model. This stores general knowledge that is independent of any particular domain.
World model. This stores specific facts about items in the
domain.
Display Model. This contains information about what is currently on the screen.
Context model. This acts as a basic discourse memory, so that recently mentioned entities can be referred back to (cf. [47,46,50]).
Mapping model. This is the table, mentioned earlier, which associates each icon with exactly one database (domain) entity.
As Ben Amara et al. [6] note, to handle source ambiguities the
graphical attributes on the screen should be available to referent
resolution, just as those in the domain knowledge base are. The
Display Model ensures that there is information about all the entities/attributes/relations in the visual display.
5. The source disambiguation problem
5.1. Sense and reference
Although source ambiguity can occur in various lexical categories (e.g., adjectives such as ‘‘green”, verbs such as ‘‘move”), we have
focussed our attention on its effect on, and interplay with, the resolution of references, conveyed by noun phrases.
Very broadly, the problem of reference resolution in the type of
system we are considering (computer programs for interpreting
natural language with respect to a finite knowledge base) can be
viewed as follows: for each referring noun phrase, the system must
compute a set of knowledge base identifiers, the referents of that
phrase. This perspective goes back at least as far as Woods [53]
and Winograd [52]. For a singular noun phrase, this set would be
a singleton.
In order to organise this mapping from language to sets of identifiers, it is customary to interpose intermediate stages of representation, with a corresponding decomposition of the processing into
stages. After some form of syntactic processing, there is normally
some symbolic representation of the ‘‘meaning” of the phrase. Usually, this semantic structure fulfils a descriptive role, and the referent(s) are exactly those knowledge base items which meet the
description [30]. For example, a phrase such as ‘‘the green car”
would, traditionally, have a semantic structure which specifies that
the referent should be of type car and should have the property
green. This arrangement can be seen as a computational version
of the traditional philosophical distinction between sense (a
descriptive unit) and reference (the actual item denoted) [17]. Philosophers have tended to treat the reference of a phrase as being an
object in the real world, but we, like most implementers of natural
language query systems, have to regard referents as belonging
within a symbolically represented world of knowledge-base
entities.
Fig. 2. Relevant content of knowledge bases.
622
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
The overall computation of a referent can be viewed as depending on three sorts of information (see [45]) for a fuller discussion:
Descriptive information. This originates, as outlined above, in the
‘‘sense” or semantic structure of the phrase, and the referent must
conform to it.
Semantic compatibility. The enclosing linguistic unit may supply
further limits; for example, the referent of a noun phrase object
must meet any selection restrictions specified for the dominating
verb.
Miscellaneous inferrable constraints. These are derived from
other contextual information, including domain knowledge.
Any NL interpretation system which evaluates referents assumes the first of these. It is also quite normal for a relatively
simple NL system to implement a simple form of semantic compatibility. Dealing with miscellaneous contextual information is
more complicated. Whatever arrangement is chosen, it is commonplace for the last two types of information (semantic compatibility and domain inference) to be seen as narrowing down a
short-list of candidate referents, each of which already meets
the first criterion: being described by the sense of the phrase,
viewed in isolation. It is this which makes the phrases we have
been examining problematic. That first step, the defining of candidates based on descriptive information, is less straightforward
where source information (and source ambiguity) has to be
considered.
What we will do now is show how a systematic definition can
be given of available candidates for these multiple-source phrases,
thereby allowing conventional candidate-pruning methods to
proceed.
It is important to note that these are abstract set-theoretic
definitions of various semantic entities and relationships which
we are positing to capture the regularities cleanly; they are
not definitions of a computational sequence of stages whereby
referents are computed. The computation will be outlined later
(Section 7).
5.2. From descriptive content to intended referent
In the kind of system we are considering, where words may
correspond to different sources (domain or screen), the descriptive content of a referring expression in a sentence may appear
to describe an entity in one source, but when considered in a larger context, e.g., with other components of the same sentence, it
may actually refer to the corresponding entity in the other source.
All this indicates that an extra level of representation (and processing) may be needed. The new representation is more specific
than the descriptive content (sense), but is not necessarily the
ultimate referent object. We define a set of item(s) which are systematically related to the descriptive content (sense) of a phrase
(in the current context), taking into account the mapping relation.
This set then gives rise to a (very) shortlist of candidates which
includes the actual referent in context, the intended referent (cf.
[39]).
Although all these definitions would have to be generalised to
cover plural noun phrases, the concepts are made clearer by considering only singular noun phrases here.
The first building-block formalises the notion of ‘‘an entity and
possibly its counterpart under the mapping relation”:
Definition 1. A described entity pair DE is a set with one of the
following forms:
(i) a singleton set containing either a domain constant or a
screen constant;
(ii) a two-element set containing a domain constant and a
screen constant related by the mapping relation.
Here, we take ‘constant’ to be the class of entity which is a
potential symbolic referent. As stated in Section 2.3, we assume
that the semantic content of a singular, specific, referring noun
phrase can be stated as a collection (set) of predicates, which,
in the simpler cases, can be interpreted as a conjunction. We
also assume that there is a level of meaning representation at
which specific predicates (such as expensive, cheap) have
their meaning tied unambiguously to be either domain or screen
predicates (as in our MRL). For example, the meaning of a phrase
such as ‘‘the cheap car” can be approximated by a conjunction of
the properties cheap and car, with both these being domain
properties.
However, there could be phrases like ‘‘the green car” with
‘‘green” denoting a screen attribute instead. This makes the two
predicates ‘‘green” and ‘‘car” coexist in a conjunction but one describes the screen icon and the other denotes the domain object.
It is for these type of phrases we include the second clause in the
definition of described entity pair, and it is this complex relationship that leads to the following definition:
Definition 2. The semantic content DC of a noun phrase descriptively covers a described entity pair DE iff
(i) every predicate in DC is true of at least one element in DE,
with source taken into account (i.e., domain-predicates can
be true only of domain-constants, and screen-predicates
can be true only of screen-constants); and
(ii) for every element in DE, there is a predicate in DC which is
true of the element, with source taken into account.
The second clause is to ensure that there are no extraneous elements in the described entity pair-simply enforcing the first clause
would allow a collection of predicates to ‘‘descriptively cover” any
set whatsoever as long as there was a subset of it which the predicates all described.
We can now put in place the additional level of representation
(between semantic structure and referent) that we mentioned
earlier.
Definition 3. A described entity pair assignment DEA is a
mapping which allocates to each noun phrase NP a described
entity pair DE such that the semantic content of NP descriptively
covers DE.
For example, suppose that car2 is a domain constant that is
classified (in the domain KB) as a card, icon3 is a constant on
the screen with greens screen colour attribute, that car2 and
icon3 correspond via the mapping relation, that ‘‘green” is interpretable only as a screen predicate, and ‘‘car” is interpretable only
as a domain predicate (i.e., these lexical items have no source
ambiguity in isolation). Then the semantic content of ‘‘the green
car” consists of two predicates greens and card, and a described
entity pair assignment is possible which allocates to this phrase
the described entity pair {icon3, car2}. Notice that this relationship between phrases and entities is independent to the linguistic
context because it does not consider the surrounding sentence
structures (if any).
The use of an ‘assignment’ in Definition 3 is intended to
capture the notion of a ‘‘reading” or an ‘‘interpretation”, so that
we can discuss the phenomena we are interested in without
complications from other forms of ambiguity. These other factors may give rise to alternative described entity pair
assignments.
Definition 4. Given a described entity pair DE, the potential
referents of DE is the two-element set: DE [ {y j there is an x in DE
which is related to y by the mapping relation}.
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
This is to capture the fact that a phrase like ‘‘the cheap car”
might be allocated a described entity pair containing only one (domain) constant but might (as in an earlier example) refer in context
to the (screen) entity which corresponds to that constant under the
mapping relation. That related constant would in such examples
not be part of the described entity pair, as it (being in the other
source) would not be described by the semantic content of the
noun phrase.
Table 1 shows some simple examples, where we assume that
the mapping relation links (car1, icon1) and (car2, icon3).
Definition 4 defines only the potential referents, because the actual choice of referent will depend upon various factors, such as the
surrounding linguistic context.
Definition 5. Given a described entity pair assignment DEA, an
intended referent assignment based on DEA is a mapping IRA
which allocates to each noun phrase NP an element of the potential
referents of DEA(NP).
An intended referent assignment IRA(NP) is the mapping which
finally allocates a referent to a noun phrase NP. Again, where there
is ambiguity, there might be more than one IRA for a given phrase.
These definitions allow phrases with different described entity
pairs to have the same potential referents, and thus their intended
referents could be the same, as for example, the phrases ‘‘the
cheapd card” and ‘‘the blues icons” in Table 1.
Definitions 1–5 provide a chain from the descriptive content of
a phrase to the intended referent of the phrase. That is, we have a
formally defined shortlist of candidate referents, while taking into
account the dimension of ‘‘source”: the candidate intended referents IR for a phrase N are exactly the potential referents of a described entity pair DE which is descriptively covered by the
predicates in the semantic form of N.
There is actually more information extractable from the NP than
just a set of potential referents: there are also restrictions on these
referents, in terms of the predicates which must apply to them
(such as cheapd and blues), and what their sources might be. We
shall outline below (Section 7) how all this locally extractable
information can be very simply encoded as constraints, thereby
allowing these definitions to interact with more general contextual
or linguistic information in a process which results in the selection
of the intended referent from the potential referents (i.e., the role
of the IRA).
6. Anaphora
The phenomenon that we called (Section 2.1) quasi-coreference
(a pronoun being related to an antecedent with a different source),
is captured by the following two definitions.
Definition 6. Noun Phrases N1 and N2 are coreferential iff they
have the same intended referents.
Definition 7. Noun Phrases N1 and N2 are quasi-coreferential iff
their intended referents correspond under the mapping relation.
We have not specifically investigated anaphoric phenomena
within our framework, but we have ensured that those mechanisms which we have developed do cater for the possibility of quasi-coreference.
Table 1
Representations for examples in the text
Phrase
Semantic predicates
DE
Potential referents
‘‘the green car”
‘‘the cheap car”
‘‘the blue icon”
{greens, card}
{cheapd, card}
{blues, icons}
{car2, icon3}
{car1}
{icon1}
{car2, icon3}
{car1, icon1}
{car1, icon1}
623
6.1. Syntax salience information
Pronoun resolution often makes use of syntactic regularities
(e.g., [31,23]) and salience factors (e.g., [47,46,8,32,19]). In certain
applications, it may be helpful to impose some form of compatibility between the semantic content of the anaphor and antecedent
phrases (e.g., [49]). All these are generally used to find candidate
antecedent phrases for an anaphor.
Our simplified way of finding candidate antecedents is to allow
referring expressions (including anaphors) take their referents
either from the Display Model (items on the screen) or the Context
model (items recently mentioned), or the counterparts (under the
mapping relation) of the items in these two models, but we have
not imposed any other salience or syntactic constraints on this process (although such restrictions could be added). The inclusion of
counterparts allows quasi-coreference. (It might seem that this
prevents items not already on the screen ever being mentioned
in the dialogue, but we do allow quantified commands to extract
items from the knowledge base. For example, ‘‘Display all expensive
cars” is not treated as involving a referring expression, but would
cause the corresponding icons of certain domain entities displayed
on the screen, and hence in the Display Model.)
6.2. Referential inference
Some form of inference (perhaps limited) is sometimes used to
assess whether the interpretation in which the denotations of the
phrases are the same is plausible or even possible. This is typically
the ultimate arbiter for evaluating or eliminating possibilities.
Although we have not looked at real inference (beyond intrasentential semantic constraints), our constraint-extraction process
automatically allows quasi-coreference to occur; see Section 7.2
below.
There is no explicit linking of anaphor to antecedent phrase; the
linkage is only implicit, in that both phrases are assigned the same
(or corresponding) referents.
7. Formulation as a constraint problem
7.1. Constraint satisfaction and reference resolution
Constraint satisfaction, as a way of sifting through candidates
when there may be various requirements to be met, has a long history in artificial intelligence [51,34]. Its use for reference resolution
was first proposed by Mellish [36]; see also [20]. In a computer system for understanding natural language within a small finite world
model (the situation we are considering), there is a limited set of
possible referents for any given referring phrase. However, there
may be interrelationships between these phrases (for example, if
they occur in the same sentence) which impose relationships between their possible referents. Hence the choice of referent for
each phrase cannot be made in isolation, but must take into account the choices being made for other related phrases. Constraint
satisfaction is a formal search procedure (or more accurately, a
family of search methods) designed for just this kind of situation.
It has the advantage of generality, so that inferred contextual information can be handled by the same uniform process as semantic
compatibility restrictions.
A constraint problem consists of a number of variables, each
with an associated value-set to which its value must belong,2 and
a number of constraints between the variables.
2
Value-sets are usually called ‘‘domains”, but as we are already using the word
‘‘domain” for another technical concept, we have opted for this expression.
624
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
In a situation where there is just one ‘‘source”, the candidate
set for each noun phrase could be straightforwardly narrowed
down to all the objects which were described by the meaning
of the noun phrase, in keeping with the Fregean tradition. This,
as argued above, has to be amended to cover the examples we
are concerned with, as this descriptive relationship does not always hold when there are separate sources. However, if we can
provide a suitably tractable definition of how to frame the content of an NP as restrictions on its referents, the constraint satisfaction method will once again be applicable. We have laid the
basis for that in Section 5: our constraint satisfaction system relies on our formal definitions in deciding the set of potential referents for each NP, and any inherent restrictions on those
objects.
7.2. Extracting the constraints
Once the semantic representation of an input sentence has been
constructed (in our MRL formalism), a process scans this formula
(and the lexicon) to determine which variables and constraints
should constitute the CSP. That is, it builds up (quite separately
from the semantic representation of the input text) a set of CSP
variables and a set of constraints (predicates) linking these variables. In our problem, there are just two types of CSP variables. A
variable ranging across entities (possible intended referents, for
example) has a value-set containing the entities in the Context
model and the Display model, together with any entities that correspond to them through the mapping relation (see Section 6). A
variable ranging across sources has the value-set {screen,
domain}.
A noun phrase such as ‘‘the blue car” will, in its semantic representation, involve two predicates, blued and card. In setting up
the CSP, the system creates one entity-variable for each predicate
in the phrase, and inserts a constraint that the predicate must be
true of that variable (more precisely, must apply to any binding
for that variable). In the worked example that we give in Section
9, these variables appear as DE11 and DE12, and the predicate
constraints as blue(DE11) and car(DE12). This ensures that, no
matter how many CSP variables are created in this way for a given
phrase, the set of entities that they are bound to will be ‘descriptively covered’ by the semantic predicates, in the sense of Definition 2 earlier. (These variables will be referred to here as
‘‘described entity” variables, as this captures their function fairly
well.) Also inserted are variables for the sources of the entities,
and the appropriate have_source relations – see the worked example for details.
The constraint-builder also inserts constraints between every
pair of variables in this set using the predicate related. This
(symmetric) predicate is true of two entities if either they are
the same entity or they correspond under the mapping relation
(in either direction). Stipulating that the entities bound to the
phrase’s variables meet this constraint, in all possible pairs, ensures that the set of entities bound to these variables forms a ‘described entity pair’ in the sense of Definition 1 earlier. (Formally,
related is an equivalence relation whose equivalence sets are
the maximal described entity pairs possible under the current
mapping relation.)
The related predicate is used to constrain the possible intended referents. For a referring noun phrase, a single variable is
set up for its intended referent, IR. A related constraint is established between this IR variable and at least one of the described entity variables. This means that the IR variable can be bound only to
an entity in the ‘potential referents’ (Definition 4). That is, the entity for the IR variable is either in the described entity pair or corresponds under the mapping relation to an element of that
described entity pair.
Since the set of values allowed for entity variables is restricted
(as noted earlier) to entities in the Display Model and the Context
Model (and their counterparts under the mapping relation), the
flexibility of the related relation automatically allows quasicoreference as well as conventional coreference. Hence our formal
model (Section 5) is enforced indirectly via these instances of the
related constraint.
It is a crucial part of the constraint expressions for an input sentence that the sources of each item are explicitly represented (with
constraints linking them to their items), so that constraints can be
stated on the sources, for example, that two sources must be the
same, or that a particular source must be screen.
There is a further small technical point which is necessary to
follow the example that we give in Section 9. It is central to our
way of expressing the various relations between referent variables
that the relation related is transitive. However, a basic constraint-solving algorithm of the type we have adopted does not
automatically compute transitive closures of relations. We have
therefore chosen to an algorithm for inserting all the related
links in initialising the constraints for the solver. In particular,
the link from an IR variable to the DE variables for that phrase need
(logically) be made only once, but the algorithm links the IR variable to all the DE variables for that phrase.
8. Heuristics for source disambiguation
8.1. Constraints and preferences
When designing constraint-satisfaction methods, it is commonplace to have two sorts of rules for computing semantic
properties or relationships: constraints which eliminate candidates, and preferences which express some form of ordering or
priority on possible solutions. If a value does not satisfy a constraint, it is rejected, whereas a preference does not lead to
rejection. The constraint propagation aims at getting rid of any
value in a candidate set that does not satisfy a constraint, but
may not narrow down the possibilities to unique values. In such
a situation, preferences can be used to select a particular value
from the available candidates, thus initiating a new constraint
propagation to narrow down other candidate sets accordingly.
If this still leaves some sets over-populated (having more candidates than is required), further preferences, if available, can be
applied. Note that preferences can be viewed as weak constraints. Their purpose is to steer the resolution process to a
more suitable direction without actually committing to it. How
often preferences are needed really depends on the complexity
of the input sentences and the states of the knowledge bases.
The initial information extracted from noun phrases NP (see
Section 7.2) is all in the form of constraints. However, we have formulated a small number of generalisations about references within
sentences in a two-source setup, and these regularities can contribute further information for the reference resolution process, if
stated as constraints or preferences. There is space here to present
only a few representative heuristics; a more detailed discussion
can be found in [22]. All of these rules are relatively tentative,
and our central proposals about two-source references do not crucially depend on the correctness of their content, but they illustrate
the way in which further observations about source and references
can be used uniformly.
8.2. Constraints
Our illustrative example of a linguistic constraint relies on a
small amount of syntactic information about the NP. For this, we
adopt a fairly conventional analysis of NP structure in English
(e.g., [21]).
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
In a phrase like ‘‘the blue nissan car in the corner of the screen”,
‘‘car” is the head noun, and all the other substructures constitute
modifiers, either pre-modifiers (before the head noun) or post-modifiers (after the head noun). Where a noun such as ‘‘nissan” is used
as a modifier just before a head noun in an adjective-like role, it is a
classifier.
The mapping relation between screen entities and domain entities is not used symmetrically in linguistic expressions. For example, in our sample car-sales application, it would be natural to say
that ‘‘a square represents a car”, but not to say that ‘‘a car represents
a square”. This asymmetry restricts the use of head nouns. Intuitively, a screen entity, such as a square, can be referred to by a
head noun naming its screen source, such as ‘‘squares”, or by a word
naming the source of the domain entity it represents, such as ‘‘card”
(where subscripts s or d indicate the source we are postulating.)
However, a domain entity such as a car cannot be referred to by
a head noun naming the corresponding screen source; that is, it
is very odd to refer to a car as a ‘‘squares”. We can summarise this
pattern by Rule 1.
RULE 1. If the head noun of a phrase unambiguously names a
screen predicate, then the intended referent of the phrase must be
a screen entity.
Sentences which violate this rule sound unacceptably
strange; for example, in our example application, ‘‘*buyd the
icons” and ‘‘*is the squares cheapd?”. (Of course, ‘‘buy the icon”
would be acceptable in an application where the domain of
discussion was icons rather than cars, but that would be ‘‘buyd
the icond”.)
Where the head noun unambiguously denotes a screen object
(as in Rule 1), it seems that the modifiers must denote only screen
attributes. For example, while phrases like ‘‘a reds icons” and ‘‘a
smalls circles” are meaningful expressions, phrases like ‘‘*the
expensived icons” and ‘‘*every larged squares” (where non-screen
modifiers combine with a screen head noun) do not sound
felicitous.
In contrast, if the head noun is from the domain (so not covered
by Rule 1), there is no strong restriction on modifiers. So, phrases
like ‘‘the expensived card ins the corners”, ‘‘the reds card”, ‘‘a smalld
saloond” are all acceptable.
However, there are exceptions. A domain word acting as the
classifier component can sometimes be so closely bonded to the
head noun denoting a fairly general screen class that the whole
phrase is still meaningful in that combination; for example, ‘‘the
Nissand icons”. The reason could be that the classifier component is
not functioning as a modifier in this case, but helping to form a
compound head noun. We have not found any clear generalisations
regarding classifier modifiers.
If, as suggested above, the modifiers are screen words/phrases,
it follows they can describe only screen entities; hence, Rule 2:
RULE 2. If the head noun of a phrase N unambiguously names a
screen predicate then, when computing a described entity pair
assignment for N, non-classifier modifiers of N can be true only
of screen entities.
Both our rules so far cover situations where the head noun
unambiguously names a screen predicate, so we can combine them
as the screen head noun rule:
RULE 3. (Screen head noun rule) If the head noun of a phrase unambiguously names a screen predicate, then:
– the intended referent of the phrase must be a screen entity,
– the non-classifier modifiers must be true of screen entities (in defining possible described entity pairs for the
phrase).
625
8.3. Preferences
As stated, we have devised a small number of preferences relating to allocation of source to words or phrases. It is recognised that
at present we have relatively little data to work from in creating
these, and they are based mainly on intuitive analogies with other
linguistic contexts. It is possible that study, e.g., of dialogues in design contexts would provide further guidance; but it is also possible that the preferences depend on various factors that vary by
context, making such a comparison misleading. It is not easy to
say how far, if a system of the kind we propose were to gain widespread use, our preferences might have to be extended to cope
with new kinds of phenomena, or indeed with the ways in which
users of the system might adapt or change their existing linguistic
preferences. The proposals here are intended merely to be illustrative of the kinds of preferences that we expect to be appropriate for
this type of system, rather than absolutely accurate empirically.
8.3.1. Semantic types
We organise semantic entities into a hierarchy of types, where
near the top there are three major classes: those which are solely
domain types (e.g., car), those which are solely screen types (e.g.,
icon) and those which appear in both (e.g., colour). As well as being
useful in various ways during semantic processing, this can be used
to help determine the source of words in cases where the word is
ambiguous between screen and display (for example, with colours
or sizes).
The first observation is based on our intuitions together with
observations made in the literature about the construction of referring expressions which in some way parallel a preceding phrase.
Ritchie [44] suggests that where an adjective pre-modifies an anaphoric ‘‘one”, it is more acceptable if it is directly comparable to the
corresponding adjective in the antecedent phrase:
– ‘‘That is a red bottle and this is a blue one.”
– ‘‘This is a wine bottle and that is a whisky one.”
– ?? ‘‘That is a red bottle and this is a whisky one.”
Levelt, in reviewing psycholinguistic studies, observes that ‘‘the
speaker apparently contrasted the new referent object with the previous one” [33], and Gatt [18] argues that when describing more than
one object, there is a natural tendency to use the same ‘‘perspective”, to increase coherence.
In our illustrative application, we suggest that in a phrase such
as ‘‘is the upholstery in the red car blue?” it is most natural to assume
that the two colour words ‘‘red” and ‘‘blue” are either both screen
attributes or both domain attributes; to mix sources here would
be slightly perverse (though not impossible – this is a preference,
not a constraint, and could in practice be overridden by pragmatic
considerations).
We have therefore formulated the following rule:
RULE 4. (Same Type Rule) If two words in the same sentence have
the same semantic type (i.e., if there is a concept in the hierarchy
which directly dominates both), then they probably have the same
source.3
8.3.2. Structural parallelism
A closely related proposal is that where two phrases in a sentence have analogous structure, corresponding elements in those
3
We have used the wording ‘‘probably” to emphasise that this is a preference
rather than a constraint.
626
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
structures will be from the same source. For example, in the sentence ‘‘Delete the small car to the right of the expensive car” the
pre-modifiers ‘‘small” and ‘‘expensive” are in comparable positions,
and we suggest that they therefore are likely to have the same
source.
RULE 5. (Same Position Rule) If two words are at the same position
within two phrases, and the two phrases have the same structure,
then the two words probably have the same source.
The criteria for defining two phrases as having the same structure are: they have the same type of head noun, and either they both
have post-modifiers, or neither has.
In some cases, both the Same Type Rule and the Same Position
Rule may apply (e.g., example in Section 9).
9. A worked example
We will work through the input sentence (3) to demonstrate
how constraint solving is used in our system.
We have adopted Mackworth’s AC-3 consistency algorithm
[34], which achieves node and arc consistency within the network
of constraints. If we assume the relevant knowledge bases contain
the information summarised in Fig. 2, then Table 3 shows the results after applying this algorithm. In this example, when network
consistency is achieved, not every variable has the right number of
candidates (i.e., one) in its candidate set (see the column ‘‘after AC3” in Table 3). Therefore, it is appropriate to try to apply preferences. In the phrases ‘‘the blue car” and ‘‘the red car”, the modifiers
‘‘blue” and ‘‘red” meet the conditions of the Same Type Rule and of
the Same Position Rule, both of which indicate a preference for
these items to have the same source. The system therefore selects
{screen} as the value for Sde11. After applying the preference, the
network needs to reapply the consistency algorithms, which results in the solution shown in the last column of Table 3.
All of the above ideas have been tested in an implementation
which, although primarily an exploratory testbed, embeds the constraint mechanism in a relatively realistic multi-modal environment [22]. In this, the mechanism achieves its task of identifying
potential referents for NPs.
(3) ‘‘delete the red car at the right of the blue car”
The MRL expression for this sentence has a structure whose
nesting is analogous to the syntax of 3; that is, the representation
of ‘‘the blue car” is a substructure inside a large representation of
‘‘at the right of the blue car”, which in turn is a subpart within the
representation of the whole phrase ‘‘the red car at the right of the
blue car”. Finally, the representation of the phrase forms the argument to the operation denoted by the verb ‘‘delete”.
In the following account of the CSP variables generated from
this, variables with names DE** are what we called ‘‘described entity variables”, ranging over entities, and variables called Sde** are
to be bound to sources of described entity variables. Variables with
names of the form IR* and Sir* represent an intended referent
and its source, respectively.
The portion of MRL corresponding to ‘‘the blue car” gives rise
to two DE variables, DE11 (from blue) and DE12 (from car),
and two corresponding source variables Sde11, Sde12. Following
the procedure sketched in Section 7.2, there are these constraints:
blue(DE11), car(DE12), have_source(DE11,Sde11), have_source(DE12,Sde12), and related(DE11,DE12). Also for this phrase, there
is a variable for the intended referent, IR1, and its source,
Sir1,
with
constraints
have_source(IR1,Sir1),
related
(DE11,IR1), related(DE12,IR1) (see Section 7.2 for the rationale
for these).
The phrase ‘‘the red car” gives rise to a similar set of variables
DE21, DE22, Sde21, Sde22, IR2 and Sir2, with an analogous
set of constraints.
The prepositional phrase ‘‘at the right of” is regarded as using
the intended referent of the embedded phrase ‘‘the blue car” to
add descriptive information to the NP whose core is ‘‘the red car”;
that is, it contributes to defining the described entities for that larger phrase. This arrangement is not obvious, but it seems to capture the idea that ‘‘at the right of” has to take a spatial location as
its argument (at the referential level), but the whole adjunct ‘‘at
the right of the blue car” is part of the semantic content (i.e.,
descriptive material) for the larger NP. This perspective leads to
a constraint right_of(DE23,IR1), where DE23 is a further variable
generated for the right_of predicate, standing for whatever entity is to the right of IR1 (in keeping with the procedure of Section
7.2). Similarly, the verb ‘‘delete” depicts a ‘‘removable” attribute of
the intended referent (represented by variable IR2 here) of the
whole phrase ‘‘the red car at the right of the blue car.” This raises
the constraint removable(IR2).
The set of constraints produced from the whole NP is shown in
Table 2.
10. Conclusions
We have provided a formalisation of a form of metonymic reference that is very basic and limited, but does occur naturally in
a certain type of applications. The formal account centres on a
set of definitions which systematically relate the linguistic semantic content of a noun phrase to a set of potential symbolic referents,
and which can be framed succinctly in terms of constraints imposing equivalences between abstract referent candidates. Providing a
route to a set of potential referents allows existing computational
mechanisms (e.g., inference) to be applied. We have also proposed
some filtering heuristics which are specific to phrases that may refer in this indirect way.
An obvious extension is to see whether these ideas can be used in
other, more general, forms of metonymy, and if so, how. Even within
our limited application, there is scope for further improvement:
(i) There may be other generalisations about source-ambiguous
forms (or even more general metonymic phrases) which
could be added to the set of constraints.
(ii) A wider range of linguistic constructs could be covered. For
example, anaphora, including reflexive pronouns, could be
more thoroughly explored.
(iii) The heuristics used in the constraint-solving are mainly
based on our intuition and experiments on a very small
number of test dialogues. It would be beneficial to have a
multimodal dialogue corpus upon which heuristics could
be tested and on which further generalisations could be
based. Statistical methods or machine learning approaches
might be useful for gathering relevant data.
Table 2
All the constraints raised from (3)
Con1
Con3
Con5
Con7
Con9
Con11
Con13
Con15
Con17
Con19
Con21
Con23
Con25
domain(Sde12)
screen(Sde23)
blue(DE11)
red(DE21)
right_of(DE23,IR1)
have_source(DE11,Sde11)
have_source(DE21,Sde21)
have_source(DE23,Sde23)
have_source(IR2,Sir2)
related(DE11,IR1)
related(DE21,DE22)
related(DE21,IR2)
related(DE22,IR2)
Con2
Con4
Con6
Con8
Con10
Con12
Con14
Con16
Con18
Con20
Con22
Con24
Con26
domain(Sde22)
same_source(Sde23, Sir1)
car(DE12)
car(DE22)
removable(IR2)
have_source(DE12, Sde12)
have_source(DE22, Sde22)
have_source(IR1, Sir1)
related(DE11, DE12)
related(DE12, IR1)
related(DE21, DE23)
related(DE22, DE23)
related(DE23, IR2)
627
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
Table 3
The candidate sets of variables in Table 2 during constraint satisfaction
Variable
Initial candidates
Candidates after NC
After AC-3
After preference
Sde11
Sde12
Sde21
Sde22
Sde23
Sir1
Sir2
DE11
DE12
DE21
DE22
DE23
IR1
IR2
{screen,
{screen,
{screen,
{screen,
{screen,
{screen,
{screen,
{*}
{*}
{*}
{*}
{*}
{*}
{*}
{screen, domain}
{domain}
{screen}
{domain}
{screen, domain}
{screen, domain}
{screen}
{car1, icon3}
{car1,car2, car3}
{icon1, icon2}
{car1,car2, car3}
{*}
{*}
{icon1,icon2,icon3}
{screen, domain}
{domain}
{screen}
{domain}
{screen, domain}
{screen, domain}
{screen}
{car1, icon3}
{car1, car3}
{icon1, icon2}
{car1, car2}
{icon1, car2}
{car1, icon3}
{icon1, icon2}
{screen}
{domain}
{screen}
{domain}
{screen}
{screen}
{screen}
{icon3}
{car3}
{icon1}
{car1}
{icon1}
{icon3}
{icon1}
domain}
domain}
domain}
domain}
domain}
domain}
domain}
{*} = {car1, car3, icon3, icon1, icon2, car2}; NC = node consistency.
(iv) Mapping relations are used during the resolution of reference, but they are assumed not to be mentioned by the user
in the dialogues. For example, the sentence ‘‘which car is represented by a blue icon?” is not considered, although it could
plausibly be used in an interaction.
(v) The problem addressed in this paper assumes that there are
only two sources, the domain and the screen. The question
arises – could there be more than one source? We have
not encountered any such situation, so it is hard to be sure
what the correct generalisation would be in such a case.
Acknowledgements
The first author (Daqing He) was supported by a Colin & Ethel
Gordon Studentship from the University of Edinburgh. The work
reported here was carried out at the University of Edinburgh.
References
[1] J. Allgayer, K. Harbusch, A. Kobsa, C. Reddig, N. Reithinger, D. Schmauks, XTRA:
a natural language access system to expert systems, International Journal of
Man–Machine Studies 31 (1989) 161–195.
[2] E. André, T. Rist, The design of illustrated documents as a planning task, in: M.
Maybury (Ed.), Intelligent Multimedia Interfaces, AAAI/MIT Press, Menlo Park,
CA/Cambridge, MA, 1993, pp. 94–116.
[3] E. André, T. Rist, Referring to world objects with text and pictures, in:
Proceedings of the 15th International Conference on Computational
Linguistics(COLING’94), Kyoto Japan, 1994, pp. 530–534.
[4] I. Androutsopoulos, G. Ritchie, Database interfaces, in: R. Dale, H. Moisl, H.
Somers (Eds.), Handbook of Natural Language Processing, Marcel Dekker, New
York, 2000, pp. 209–240. chapter 9.
[5] I. Androutsopoulos, G. Ritchie, P. Thanisch, Natural language interfaces to
databases – an introduction, Journal of Natural Language Engineering 1 (1)
(1995) 29–81.
[6] H. Ben Amara, B. Peroche, H. Chappel, M. Wilson, Graphical interaction in a
multi-modal interface, in: Proceedings of Esprit Conferences, Kluwer Academic
Publisher, Dordrecht, Netherlands, 1991, pp. 303–321.
[7] J. Binot, L. Debille, D. Sedlock, B. Vandecapelle, H. Chappel, M.D. Wilson,
Multimodal integration in MMI2: anaphora resolution and mode selection, in: H.
Luczak, A. Cakir, G. Cakir (Eds.), Work With Display Units-WWDU’92 (Abstract
Book of 3rd. International Scientific Conference), Berlin, Germany, 1992.
[8] S.E. Brennan, M.W. Friedman, C.J. Pollard, A centering approach to pronouns,
in: Proceedings of 25th Annual Meeting of the Association of Computational
Linguistics (ACL-87), Stanford, CA, 1987, pp. 155–162.
[9] J. Chai, Semantics-based representation for multimodal interpretation in
conversational systems, in: Proceedings of International Conference on
Computational Linguistics (COLING), 2002.
[10] J.Y. Chai, P. Hong, M.X. Zhou, A probabilistic approach to reference resolution
in multimodal user interfaces, in: Proceedings of Intelligent User Interfaces 04,
2004, pp. 70–77.
[11] P. Cohen, M. Dalrymple, D. Moran, F. Pereira, J. Sullivan, R. Gargan Jr., J.
Schlossberg, S. Tyler, Synergistic use of direct manipulation and natural
language, in: Proceedings of the 1989 Conference on Human Factors in
Computing Systems (CHI’89), 1989, pp. 227–233.
[12] A. Copestake, K. Sparck Jones, Natural language interfaces to databases, The
Knowledge Engineering Review 5 (4) (1990) 225–249.
[13] R. Dale, N. Haddock, Generating referring expressions involving relations, in:
Proceedings of European Chapter of Association of Computational Linguistics
(ECAL-91), 1991, pp. 161–166.
[14] G. Fauconnier, Mental Spaces: Aspects of Meaning Construction in Natural
Language, Cambridge University Press, Cambridge, UK, 1994. First published
by MIT Press, 1985.
[15] A.A. Fernandes, A Meaning Representation Language for Natural-Language
Interfaces, Master’s thesis, Department of Artificial Intelligence, University of
Edinburgh, 1990.
[16] A.A. Fernandes, G. Ritchie, D. Moffat, A Formal Reconstruction of Procedural
Semantics, Technical Report DAI-RP-678, Department of Artificial Intelligence,
University of Edinburgh, 1994.
[17] G. Frege, Sense and reference, in: D. Davidson, G. Harman (Eds.), The
Logic of Grammar, Dickenson, Encino, CA, 1892, pp. 116–128. Reprinted in
1975.
[18] A. Gatt, Structuring knowledge for reference generation: a clustering
algorithm, in: Proceedings of the 11th Conference of the European Chapter
of the Association for Computational Linguistics, Association for
Computational Linguistics, Trento, Italy, 2006, pp. 321–328.
[19] B.J. Grosz, A.K. Joshi, S. Weinstein, Centering: a framework for modelling the
local coherence of discourse, Computational Linguistics 21 (2) (1995) 203–
225.
[20] N.J. Haddock, Incremental Semantics and Interactive Syntactic Processing,
Ph.D. thesis, University of Edinburgh, Edinburgh, Scotland, 1988.
[21] M. Halliday, An Introduction to Functional Grammar, second ed., Edward
Arnold, London, UK, 1994.
[22] D. He, References to Graphical Objects in Interactive Multimodal Queries, Ph.D.
thesis, Division of Informatics, University of Edinburgh, Edinburgh, Scotland,
2001.
[23] J.R. Hobbs, Resolving pronoun references, Lingua 44 (1978) 311–338.
[24] H. Horacek, An algorithm for generating referential descriptions with flexible
interfaces, in: Proceeding of the 35th Annual Meeting of the Association for
Computational Linguistics/8th Conference of the European Chapter of the
Association for Computational Linguistics, 1997, pp. 206–213.
[25] M. Johnston, Unification-based multimodal parsing, in: Proceedings of the
Conference of COLING-ACL’98, Montreal, Canada, 1998, pp. 624–630.
[26] M. Johnston, S. Bangalore, Finite-state multimodal integration and
understanding, Natural Language Engineering 11 (2) (2005) 159–187.
[27] M. Johnston, P. Cohen, D. McGee, S. Oviatt, J. Pittman, I. Smith, Unificationbased multimodal integration, in: Proceedings of 35th ACL conference ACL-97,
1997, pp. 281–288.
[28] H. Kim, J. Seo, Resolution of referring expressions in a Korean multimodal
dialogue system, ACM Transactions on Asian Language Information Processing
2 (4) (2003) 324–337.
[29] P. Koch, Metonymy between pragmatics, reference and diachrony,
metaphorik.de, Available from: <http://www.metaphorik.de/07/>, 2004.
[30] A. Kronfeld, Reference and Computation: An Essay in Applied Philosophy of
Language, Studies in Natural Language Processing, Cambridge University Press,
Cambridge, UK, 1990.
[31] R. Langacker, Pronominalization and the chain of command, in: D. Reibel, S.
Schane (Eds.), Modern Studies in English, Prentice-Hall, Englewood Cliffs, NJ,
1969.
[32] S. Lappin, H.J. Leass, An algorithm for pronominal anaphora resolution,
Computational Linguistics 20 (4) (1994) 535–561.
[33] W.J.M. Levelt, Speaking: From Intention to Articulation, A Bradford Book, MIT
Press, Cambridge, MA, 1989.
[34] A.K. Mackworth, Consistency in networks of relations, Artificial Intelligence 8
(1977) 99–118.
[35] K. McKeown, S. Feiner, J. Robin, D. Seligmann, M. Tanenblatt, Generating crossreferences for multimedia explanation, in: Proceedings of 10th National
628
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
D. He et al. / Knowledge-Based Systems 21 (2008) 617–628
Conference of the American Association for Artificial Intelligence (AAAI’92),
San Jose, CA, 1992, pp. 9—16.
C.S. Mellish, Computer Interpretation of Natural Language Descriptions, Ellis
Horwood Series in Artificial Intelligence, Ellis Horwood, 1985.
J. Neal, Z. Dobes, K. Bettinger, J. Byoun, Multimodal references in human–
computer dialogue, in: Proceedings of 6th National Conference of the
American Association for Artificial Intelligence (AAAI’88), St. Paul, MN, USA,
1988, pp. 819–823.
J. Neal, S. Shapiro, Intelligent multi-media interface technology, in: J. Sullivan,
S. Tyler (Eds.), Intelligent User Interfaces, ACM Press, New York, NY, 1991, pp.
11–44.
G. Nunberg, The non-uniqueness of semantic solutions:polysemy, Linguistics
and Philosophy 3 (2) (1979).
S. Oviatt, A. DeAngeli, K. Kuhn, Integration and synchronization of input modes
during multimodal human–computer interaction, in: Referring Phenomena in
a Multimedia Context and Their Computational Treatment, A Meeting of the
ACL Special Interest Group on Multimedia Language Processing, Madrid, Spain,
1997, pp. 1–13.
C. Perrault, B. Grosz, Natural language interfaces, in: H. Shrobe (Ed.), Exploring
Artificial Intelligence, Morgan Kaufmann Publishers Inc., San Mateo, California,
1988, pp. 133–172.
L. Pineda, G. Garza, A model for multimodal reference resolution,
Computational Linguistics 26 (2) (2000) 139–193.
H. Read, English Prose Style, Bell and Sons, London, 1934.
G. Ritchie, Computer Modelling of English Grammar, Ph.D. thesis, University of
Edinburgh, Edinburgh, Scotland, CST-1-77, Department of Computer Science,
1977.
[45] G. Ritchie, Semantics in parsing, in: M. King (Ed.), Parsing Natural Language,
Academic Press, North Holland, 1983, pp. 199–217.
[46] C. Sidner, Focusing in the comprehension of definite anaphora, in: B. Grosz,
K.S. Jones, B.L. Webber (Eds.), Reading in Natural Language Processing,
Morgan Kaufmann, Los Altos, CA., 1987, pp. 363–394.
[47] C.L. Sidner, Focusing for interpretation of pronouns, American Journal of
Computational Linguistics 7 (4) (1981) 217–231.
[48] K. Van Deemter, Towards a logic of ambiguous expressions, in: K. Van
Deemter, S. Peters (Eds.), Semantic Ambiguity and Underspecification, CSLI
Publications, 1995, pp. 203–237.
[49] R. Vieira, Definite Description Processing in Unrestricted Text, Ph.D. thesis,
University of Edinburgh, Edinburgh, Scotland, 1997.
[50] M. Walker, Centering, anaphora resolution, and discourse structure, in: M.
Walker, A. Joshi, E. Prince (Eds.), Centering Theory in Discourse, Clarendon
Press, Oxford, England, 1997.
[51] D. Waltz, Understanding line drawings of scenes, in: P.H. Winston (Ed.), The
Psychology of Computer Vision, McGraw-Hill, New York, 1975.
[52] T. Winograd, Understanding Natural Language, Academic Press, New York,
1972.
[53] W. Woods, Semantics for a Question-Answering System, Ph.D. thesis, Harvard
University, 1967.
[54] W. Woods, Procedural semantics for a question-answering machine, in:
Proceedings for the Fall Joint Computer Conference, AFIPS, New York, NY,
1968, pp. 457–471.
[55] W. Woods, Semantics and quantification in natural language question
answering, in: M. Rubinoff, M. Yovits (Eds.), Advances in Computers,
Academic Press, New York, NY, 1978, pp. 2–64.