BARGE-IN EFFECTS IN BAYESIAN DIALOGUE ACT RECOGNITION AND SIMULATION
Heriberto Cuayáhuitl, Nina Dethlefs, Helen Hastie, Oliver Lemon
School of Mathematical and Computer Sciences
Heriot-Watt University, Edinburgh, United Kingdom
{h.cuayahuitl,n.s.dethlefs,h.hastie,o.lemon}@hw.ac.uk
ABSTRACT
Dialogue act recognition and simulation are traditionally
considered separate processes. Here, we argue that both can
be fruitfully treated as interleaved processes within the same
probabilistic model, leading to a synchronous improvement of
performance in both. To demonstrate this, we train multiple
Bayes Nets that predict the timing and content of the next
user utterance. A specific focus is on providing support for
barge-ins. We describe experiments using the Let’s Go data
that show an improvement in classification accuracy (+5%) in
Bayesian dialogue act recognition involving barge-ins using
partial context compared to using full context. Our results
also indicate that simulated dialogues with user barge-in are
more realistic than simulations without barge-in events.
Index Terms— spoken dialogue systems, dialogue act
recognition, dialogue simulation, Bayesian nets, barge-in
1. INTRODUCTION AND MOTIVATION
Modelling dialogue phenomena incrementally has been highlighted as one of the (remaining) challenges for spoken dialogue systems [1, 2]. Whereas non-incremental architectures
wait until the end of an incoming user utterance before starting to process it, incremental ones become active as soon as
the first input units are available. In addition, whereas nonincremental architectures assume communication based on
complete dialogue acts, incremental ones assume communication based on partial dialogue acts. This difference is illustrated in the figure below and has been shown to account for
shorter processing times and higher user acceptance [3].
In this paper, we focus on user dialogue act recognition
and user simulation for spoken dialogue systems at the semantic level. The former involves mapping a set of features
of the Automatic Speech Recognition (ASR) input and dialogue history onto a unique user dialogue act. The latter
are typically used for training system policies which require
a large amount of dialogue data. While both problems are
well investigated for the non-incremental case, little work exists on incremental approaches; see [4, 5] for some first advances. More work in this direction is therefore needed to
enhance the efficiency and quality of spoken dialogue systems. In addition, previous work has so far neglected the
fact that user dialogue act recognition and simulation can often be treated fruitfully within the same probabilistic model.
Given a dialogue act recogniser which finds the most likely
user dialogue act based on the history of previous system and
user dialogue context, we can treat simulation as an equivalent problem: given a history of system and user dialogue
context, what action is the user most likely to perform next?
This double-function model is advantageous because an improved dialogue act recognition accuracy automatically leads
to improved realism of simulated dialogues1 . Our solution
here consists of using multiple statistical classifiers that predict both when the user will start speaking next (which can
be at any point during a system utterance) as well as what
the user is most likely going to say. This paper describes and
analyses our proposed approach, advocating recognition and
simulation from the same set of statistical models. A particular focus will be on recognising and simulating barge-ins in
natural interaction.
2. RELATED WORK
(a) Complete dialogue acts (DAs) without barge-in
2.1. Incremental Natural Language Understanding
SDAs
SYS:
UDAs
USR:
(b) Partial dialogue acts (DAs) with user barge-in
SYS:
USR:
SDA1
SDA2
UDA1
...
SDAn
UDA2
UDAn
This research was funded by the European Commission FP7 programme
FP7/2011-14 under grant agreement no. 287615 (PARLANCE).
Related work in incremental language understanding has focused on finding the intended semantics in a user utterance
as soon as possible while the user is still speaking. This has
been shown to lead to faster system responses and increased
human acceptability. [6] were among the first to demonstrate
1 The converse may not apply, however, especially if the simulations apply
some constraints or distortions.
that incremental understanding is not only substantially faster
than its non-incremental counterpart, but is also significantly
preferred by human judges. [7] use a classifier to map ASR
input features onto (full) semantic frames from partial inputs.
They show that better results can be achieved by training the
classifier from partial dialogue acts rather than from full dialogue acts. [8] present an incremental parser which finds the
best semantic interpretation based on syntactic and pragmatic
constraints, especially taking into account whether a candidate parse has a denotation in the current context. [9] perform
incremental language understanding taking visual, discourse
and linguistic context into account. They show that while employing an incremental analysis module provides some benefits, hypotheses are not always stable, especially at early processing states. Our results extend previous work in analysing
the performance of dialogue act recognition and simulation in
dialogue turns with and without barge-in events.
2.2. User Simulation for Incremental Dialogue
Despite the growing popularity of incremental architectures,
little work exists on incremental simulation. Some authors
have used non-incremental simulations to optimize incremental dialogue or language generation systems [10, 11]. Similarly, [12] discuss the option of integrating POMDP-based dialogue managers with incremental processing, but leave the
question of simulation unaddressed. [5] present a first model
of incremental simulation, but focus exclusively on the problem of turn taking. Given the increased interest in incremental processing, the absence of incremental phenomena in user
simulations represents an important limitation.
2.3. Statistical Dialogue Act Recognition and Simulation
Approaches to (non-incremental) dialogue act recognition
from spoken input have explored a wide range of different
methods and feature sets. [13] use Hidden Markov Models
(HMMs) in a joint segmentation and classification model.
[14] also use HMMs but explore decision trees and neural networks in addition. Several authors have explored
Bayesian methods, such as Naive Bayes classification [15]
and Bayesian Networks [16]. [17] use Bayesian networks to
re-rank dialogue acts from n-best lists. Other authors have
used Support Vector Machines (SVMs), such as [18] (who
use a combination of SVMs and HMMs) or [19] who show
that an active learning framework for dialogue act classification can outperfom passive learning. [20] use max-ent
classification. A wide range of feature sets have also been
explored, including discourse and sentence features [14, 21],
multi-level information features [22], affective facial expression features [23], or speaker-specific features [24].
In terms of dialogue act simulation, a similarly wide
range of methods has been investigated. Several authors have
explored Bayesian methods, such as [25] who use Bayesian
Networks (BNs) to estimate user actions independently from
natural language understanding. Similarly, [26] use Bayesian
Networks that can deal with missing data in simulation and
[27] use a dynamic BN in an unsupervised learning setting.
Other graphical models for simulation have been used by
[28], who compare different types of HMMs and [29] who
use conditional random fields. [30] use an agenda-based
model for user simulation, in which the agenda is represented
by a stack of user goals which needs to be satisfied before
successfully achieving the next dialogue goal. The agenda
can be treated as hidden from the dialogue manager to represent the uncertainty that also exists with respect to real users.
This model has been extended to be trainable directly from
real users rather than corpora [31]. Other methods have been
based on “advanced” n-grams [32], clustering [33], or random selection of user utterances from a corpus [34]. Finally,
some authors model the user as an agent similar to the dialogue manager [35] or use inverse reinforcement learning to
simulate user actions [36]. For a detailed overview of the different user simulations see [37, 38] and [39] for an overview
of possible evaluation metrics.
3. A BARGE-IN BASED APPROACH FOR
DIALOGUE ACT RECOGNITION AND SIMULATION
The sort of dialogue act recognition (also referred to as
shallow semantic parsing in the literature) that we aim for
takes into account dialogue act types, attributes and values. An example dialogue act is confirm(to=Pittsburgh
Downtown). While dialogue act recognition and user simulation are typically treated as separate components, our
approach is based on the premise that user simulation can
make use of a statistical dialogue act recogniser to generate user responses. This is possible by sampling from the
estimated probability distributions induced by the dialogue
act recogniser. The user simulations in this case, would be
mimicking the user responses from some training data, but
with more variation for wider coverage in terms of conversational behaviours. To make simulations account for
unseen situations, the statistical models would have to include all possible combinations of dialogue act types and slot
value pairs with non-zero probabilities. Algorithm 1 shows
a fairly generic dialogue act recogniser assuming multiple
statistical classifiers λ = {λdat , λatt , λval(i) } with features
X = {x1 , ..., xn }, labels Y = {y1 , ..., yn }, and the evidence e = {x1 =val(x1 ), ..., xn =val(xn )} collected during
the system-user interaction. While model λdat is used to
predict system dialogue act types, model λatt is used to predict attributes, and the remaining models (λval(i) ) are used to
predict slot value pairs. In our case we use Bayes Nets (BNs)
to obtain the most likely label y (i.e. a dialogue act type,
attribute, or value) from domain values D(Y ) expressed as
y ∗ = arg max P (y|e).
y∈D(Y )
For incremental simulators, it is important to model when
Algorithm 1 Statistical recogniser of user dialogue acts
1: function D IALOG ACT R ECOGNITION(StatisticalModels λ, Evidence
e)
2:
λdat ← statistical model for predicting dialogue act types (DAT)
3:
λatt ← statistical model for predicting attributes (ATT)
4:
λval(i) ← statistical models for predicting attribute values (VAL)
5:
y ← output (class label) of the statistical classifier in turn
6:
dat = arg maxy∈D(DAT ) P (y|e; λdat )
7:
att = arg maxy∈D(AT T ) P (y|e; λatt )
8:
pairs ← []
9:
for each attribute i in att do
10:
val = arg maxy∈D(V AL(i) ) P (y|e; λval(i) )
11:
pairs ← APPEND(attribute i=val)
12:
end for
13:
return dat(pairs)
14: end function
the user speaks assuming that system dialogue acts are received incrementally. For example, should the user speak after the first, second, ..., or last system dialogue act? This event
can be modelled probabilistically as
ut∗ = arg max P (ut|e; λut ),
ut∈{0,1}
where ut is a binary value and λut in our case is an additional statistical model (Bayes net) in our set of classifiers λ.
If the result ut∗ is true then the main body of Algorithm 2
is invoked, otherwise a null event is returned. Our approach
queries (with observed features) the set of Bayes nets incrementally after each partial system dialogue act.
In the rest of the paper, we analyse a corpus of dialogues
using Algorithms 1 and 2 for dialogue act recognition and
simulation. The simulations below focus on when the user
speaks and what they say. For goal-directed user dialogue
acts, the slot values can be derived from user goal g with probability ǫ and from models λval(i) with probability 1-ǫ.
4. EXPERIMENTS AND RESULTS
4.1. Data
Our experiments are based on the Let’s Go corpus [40]. Let’s
Go contains recorded interactions between a spoken dialogue
system and human users who make enquiries about the bus
schedule in Pittsburgh. Dialogues are system-initiative and
query the user sequentially for five slots: an optional bus
route, a departure place, a destination, a desired travel date,
and a desired travel time. Each slot needs to be explicitly (or
implicitly) confirmed by the user. Our analyses are based on
a subset of this data set containing 779 dialogues with 7275
turns, collected in Summer of 2010. From these dialogues, we
used 70% for training our classifiers and the rest for testing
(with 50 random splits). Brief statistics of this data set are as
follows. Table 1 shows all dialogue act types that occur in the
data together with their frequency of occurrence. System dialogue acts are shown on top and user dialogue acts in the bot-
Algorithm 2 Statistical simulator of user dialogue acts
1: function D IALOG ACT S IMULATION(StatisticalModels λ, Evidence e)
2:
λut ← statistical model to predict when the user takes the turn (UT)
3:
λdat ← statistical model for predicting dialogue act types (DAT)
4:
λatt ← statistical model for predicting attributes (ATT)
5:
λval(i) ← statistical models for predicting attribute values (VAL)
6:
y ← output (class label) of the statistical classifier in turn
7:
8:
9:
10:
11:
12:
ut = arg maxy∈D(U T ) P (y|e; λut )
if ut is true then
dat ← sample from P (Y |e; λdat )
att ← sample from P (Y |e; λatt )
pairs ← []
for each attribute i in att do
(
with probability ǫ, get value from user goal g(i)
val =
(i)
with probability 1 − ǫ, sample from P (Y |e; λval )
13:
pairs ← APPEND(attribute i=val)
14:
end for
15:
return dat(pairs)
16:
end if
17:
return null
18: end function
tom. Table 2 shows the main attribute types for the dialogue
acts again paired with their frequency of use by system and
user. Notice that the combination of all possible dialogue acts,
attributes and values leads to a large number of triplets. While
a whole dialogue act is represented as a sequence of tuples
<dialogue act(attribute=value pairs)>, a partial dialogue act
is represented as <dialogue act(attribute=value pair)>.
4.2. Statistical Classifiers
We trained our statistical classifiers in a supervised learning
manner, and used 43 discrete features plus a class label (also
discrete), see Table 3. The feature set is described by three
main subsets: 24 system-utterance-level binary features derived from the system dialogue act(s) in the last turn; 16 userutterance-level binary features derived from (a) what the user
heard prior to the current turn, or (b) what keywords the system recognised in its list of speech recognition hypotheses;
and 4 non-binary features corresponding to the last system
dialogue act type, duration in seconds, previous and current
label. Predicted labels are restricted to those that occur in the
N-best parsing hypotheses from the Let’s Go data and an additional dialogue act “silence”. See [17] for details.
Figure 1 shows the Bayesian network corresponding to
the classifier that predicts when the user speaks, queried incrementally after each partial system dialogue act. The structures of our Bayesian classifiers were derived from the K2 algorithm2 , their parameters were derived from maximum likelihood estimation, and probabilistic inference using the Junction tree algorithm3 . We trained a set of 14 Bayesian classifiers to predict (1) when the user speaks, (2) the dialogue act
2 http://www.cs.waikato.ac.nz/ml/weka/
3 http://www.cs.cmu.edu/
˜javabayes/Home/
Agent
Sys
Sys
Sys
Sys
Sys
Sys
Sys
Sys
Sys
Sys
Sys
Sys
Usr
Usr
Usr
Usr
Usr
Usr
Usr
Usr
Usr
Usr
Usr
Dialogue Act Type
ack
canthelp
example
expl-conf
goback
hello
impl-conf
morebuses
request
restart
schedule
sorry
affirm
bye
goback
inform
negate
nextbus
prevbus
repeat
restart
silence
tellchoices
Frequency (%)
0.52
1.45
59.04
5.23
0.37
1.90
8.71
0.46
11.37
0.25
3.44
7.24
14.10
1.35
2.36
41.68
5.02
11.00
1.63
3.83
1.44
17.36
0.22
Type
Features (b =binary, nb =non-binary)
System
heardAckb , heardCantHelpb , heardExampleb , heardExplConfb ,
heardGoBackDATb , heardHellob , heardImplConfb ,
heardMoreBusesb , heardRequestb , heardRestartDATb ,
heardScheduleb , heardSorryb , heardDateb , heardFromb ,
heardRouteb , heardTimeb , heardTob , heardNextb ,
heardPreviousb , heardGoBackb , heardChoicesb ,
heardRestartb , heardRepeatb , heardDontKnowb ,
lastSystemDialActTypenb , durationnb (in seconds: 0,1,2,3,4,>5),
currentLabel (e.g. userSpeaksb , dialActTypenb , slotnb
i ), prevLabel
hasRouteb , hasFromb , hasTob , hasDateb , hasTimeb , hasYesb ,
hasNob , hasNextb , hasPreviousb , hasGoBackb , hasChoicesb ,
hasRestartb , hasRepeatb , hasDontKnowb , hasByeb , hasNothingb .
User
Table 3. Features for dialogue act recognition & simulation.
Table 1. Frequencies of dialogue act types in our data set.
Attribute (slot)
date.absday
date.absmonth
date.day
date.relweek
from
route
time.ampm
time.arriveleave
time.hour
time.minute
time.rel
to
System Freq. (%)
0.50
0.50
1.62
0.41
26.26
36.70
1.73
1.67
2.19
2.19
2.80
23.41
User Freq. (%)
0.38
0.38
4.71
0.0
24.44
33.36
2.45
2.91
3.60
3.60
0.31
23.86
Fig. 1. Bayesian network for predicting when the user speaks.
4.4. Experimental Results in Dialogue Act Recognition
Table 2. Frequencies of system and user slots in our data set.
type, (3) the attributes (also called ‘slots’), and (4) the slot
values. The advantage of using multiple Bayes Nets over just
one is that a multiple classifier system is a powerful solution
to complex classification problems involving a large set of inputs and outputs. This approach not only decreases training
time but has also been shown to increase the performance of
classification [41].
4.3. Evaluation Metrics
The accuracy of dialogue act recognition is computed as the
proportion of correct classifications among the all classifications. The comparison is made against labelled gold standard
data from human annotations.
We compute the quality of simulations with the KullbackLeibler (KL) divergence [42], which measures the similarity
between a gold standard data set and a target data set.
Our dialogue act recognition results compared 4 different
recognisers with and without barge-in events: (a) SemiRandom: a recogniser choosing a random dialogue act from
the Let’s Go N-best parsing hypotheses; (b) LetsGo: a recogniser choosing the most likely dialogue act from the Let’s
Go N-best parsing hypotheses; (c) Bayes Nets: a Bayesian
recogniser using Algorithm 1; and (d) Ceiling: a recogniser
choosing the correct dialogue act from the Let’s Go N-best
parsing hypotheses. The latter was used as a gold standard
from manual annotations, which reflects the proportion of
correct labels in the N-best parsing hypotheses. Figure 2
shows the dialogue act recognition results in this order, which
can be described as follows. First, we can observe that recognition accuracy in dialogue turns without user barge-in events
consistently performs better than its counterpart with bargeins (significant at p<0.003)4 . Second, it can be noted that
the Let’s Go baseline is substantially outperformed by the
Bayesian recogniser (also significant at p<0.004). This is
4 Based
on a 2-tailed Wilcoxon signed rank test.
80
Turns with Barge−in
90
Turns with Barge−in
Turns without Barge−in
All Turns
Classification Accuracy
70
60
Partial Context
Full Context
0
10
20
30
40
50
60
70
80
Classification Accuracy
50
40
Fig. 3. Bayesian dialogue act recognition with partial and
full context, based on turns with only barge-in. The bar with
partial context is the 7th bar in Figure 2 from left to right.
30
20
10
0
Semi−Random
LetsGo
BayesNets
Ceiling
Dialogue Act Recogniser
Fig. 2. Dialogue act recognition results with(out) barge-in.
partially due to the fact that the Let’s Go system always tries
to recognise a dialogue act even when there was only noise
in the environment, which causes the ASR to produce incorrect recognition hypotheses. Our Bayesian classifiers model
probability distributions over dialogue acts that are closer to a
human gold standard than several baselines. Third, we compared the performance of Bayesian dialogue act recognition
of turns with barge-in based on partial and full context (see
Figure 3). Whilst partial context considers evidence until the
barge-in point, full context considers evidence based on all
system dialogue acts within a turn. This comparison revealed
that recognition using partial context improves its counterpart
with full context by 4.6% (significant at p<0.0024). This
suggests that the degradation in recognition of turns with
barge-in can be mitigated with incremental processing, i.e.
context from partial dialogue acts. These results are relevant
for spoken dialogue systems because they suggest how to
achieve more efficient interactions: in our data set the average duration of system turns with barge-in events is 8.6
seconds, and the average duration of system turns without
barge-in events is 10 seconds5 , in favour of incremental processing. This could potentially improve the user experience
by making spoken dialogue systems more accurate in the face
of user barge-ins, leading to more timely relevant responses.
4.5. Experimental Results in Dialogue Act Simulation
Our simulation results compared Let’s Go system-user dialogue act tuples (from correct labels) against simulated dialogues with and without barge-in events. The latter, a typical
type of simulations in the literature, represents our baseline.
We consider tuples of dialogue act type and attribute names
(but without attribute values to avoid the data sparsity problem). While tuples without barge-in considered evidence until
5 Since our data only had durations per system turn rather than per partial
dialogue act, we estimated such durations using a linear regressor based on
TTS durations. Predictor variables for the linear regressor: numerical ID of
dialogue act type, number of slots, number of words, number of characters.
the very last system dialogue act in the turn, tuples with bargein considered evidence until a barge-in. The exact point of a
barge-in over a system dialogue act is not logged in the data.
Barge-ins (11.5% in our data) were extracted from the data
based on the overlap time between system and user turns. The
partial system dialogue act with the overlap is marked as the
point of barge-in. These results are shown in Table 4. It can be
noted that the simulated dialogues with barge-in events (compared with real Let’s Go dialogues) obtain lower divergences
than its counterpart without barge-in events. This result suggests that dialogue simulators should incorporate user bargeins based on partial system dialogue acts rather than complete
ones to achieve more realistic simulated interactions.
Classifier (simulator)
Bayesian Networks
Turns
with barge-in
without barge-in
Divergence
5.1731
5.4123
p-value
<0.0074
Table 4. KL divergences (the smaller the better) between
dialogue turns with and without barge-in events.
5. CONCLUSION AND FUTURE WORK
We have presented an approach to incremental user dialogue
act recognition and simulation which treats both problems
as interleaved processes within the same probabilistic model.
Multiple classifiers are used to (a) predict dialogue acts from
dialogue history features, and (b) predict when the user should
speak after each partial system dialogue act. Applying our
approach to the Let’s Go data we found the following. First,
we found an improvement in classification accuracy (+5%) in
Bayesian dialogue act recognition involving barge-ins using
partial context compared to using full context. Second, dialogue simulation taking into account user barge-in events represent more realistic interactions than their counterpart without barge-in events. This should be a feature of dialogue simulators used for training future dialogue systems.
Future work includes a comparison of our Bayesian classifiers with other statistical models and forms of training (for
example by using semi-supervised learning) [43], and investigating the effects of barge-in on dialogue act recognisers and
simulators in different (multi-modal) domains [44, 45].
6. REFERENCES
[1] Victor Zue and James Glass, “Conversational interfaces: Advances and challenges,” vol. 88, no. 8, pp. 1166–1180, 2000.
[2] Antoine Raux, Flexible Turn-Taking for Spoken Dialog Systems, Ph.D. thesis,
School of Computer Science, Carnegie Mellon University, 2008.
[3] Gabriel Skantze and David Schlangen, “Incremental Dialogue Processing in a
Micro-Domain,” in EACL, Athens, Greece, 2009.
[23] Kristy Elizabeth Boyer, Joseph F. Grafsgaard, Eunyoung Ha, Robert Phillips, and
James C. Lester, “An Affect-Enriched Dialogue Act Classification Model for
Task-Oriented Dialogue,” in ACL, 2011, pp. 1190–1199.
[24] Congkai Sun and Louis-Philippe Morency, “Dialogue Act Recognition Using
Reweighted Speaker Adaptation,” in SIGDIAL, 2012, pp. 118–125.
[25] Olivier Pietquin and Thierry Dutoit, “A Probabilistic Framework for Dialog Simulation and Optimal Strategy Learning,” IEEE Transactions on Audio, Speech &
Language Processing, vol. 14, no. 2, pp. 589–599, 2006.
[4] Volha Petukhova and Harry Bunt, “Incremental Dialogue Act Understanding,” in
IWCS, Oxford, UK, 2011.
[26] Stéphane Rossignol, Olivier Pietquin, and Michel Ianotto, “Training a BN-Based
User Model for Dialogue Simulation with Missing Data,” in IJCNLP, Chiang Mai,
Thailand, November 2011, pp. 598–604.
[5] Ethan Selfridge and Peter Heeman, “A Temporal Simulator for Developing TurnTaking Methods for Spoken Dialogue Systems,” in SIGDIAL, Seoul, South Korea,
2012.
[27] Sungjin Lee and Maxine Eskenazi, “An Unsupervised Approach to User Simulation: Toward Self-Improving Dialog Systems,” in SIGDIAL, 2012, pp. 50–59.
[6] Gregory Aist, James Allen, Ellen Campana, Carlos Gomez Gallo, Scott Stoness,
Mary Swift, and Michael Tanenhaus, “Incremental Understanding in HumanComputer Dialogue and Experimental Evidence for Advantages over Nonincremental Methods,” in DECOLOG, Trento, Italy, 2007.
[7] David DeVault, Kenji Sagae, and David Traum, “Can I Finish? Learning When
to Respond to Incremental Interpretation Results in Interactive Dialogue,” in SIGDIAL, London, UK, 2009.
[8] Andreas Peldszus, Okko Buss, Timo Baumann, and David Schlangen, “Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken
Language Understanding,” in EACL, Avignon, France, 2012.
[9] Casey Kennington and David Schlangen, “Markov Logic Networks for Situated
Incremental Natural Language Understanding,” in SIGDIAL, Seoul, South Korea,
2012.
[10] Nina Dethlefs, Helen Hastie, Verena Rieser, and Oliver Lemon, “Optimising
Incremental Generation for Spoken Dialogue Systems: Reducing the Need for
Fillers,” in INLG, Chicago, IL, USA, 2012.
[11] Nina Dethlefs, Helen Hastie, Verena Rieser, and Oliver Lemon, “Optimising
Incremental Dialogue Decisions Using Information Density for Interactive Systems,” in EMNLP-CoNLL, Jeju, South Korea, 2012.
[12] Ethan Selfridge, Iker Arizmendi, Peter Heeman, and Jason Williams, “Integrating Incremental Speech Recognition and POMDP-based Dialogue Systems,” in
SIGDIAL, Seoul, South Korea, 2012.
[13] Matthias Zimmermann, Yang Liu, Elizabeth Shriberg, and Andreas Stolcke, “Toward Joint Segmentation and Classification of Dialog Acts in Multiparty Meetings,” in MLMI, 2005, pp. 187–193.
[14] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca A.
Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and
Marie Meteer, “Dialog Act Modeling for Automatic Tagging and Recognition of
Conversational Speech,” Comp. Linguistics, vol. 26, no. 3, pp. 339–373, 2000.
[15] Sergio Grau, Emilio Sanchis, Maria Jose Castro, and David Vilar, “Dialogue Act
Classification Using a Bayesian Approach,” in SPECOM, 2004.
[16] Simon Keizer and Rieks op den Akker, “Dialogue Act Recognition Under Uncertainty Using Bayesian Networks,” Natural Language Engineering, vol. 13, no. 4,
pp. 287–316, 2007.
[17] Heriberto Cuayáhuitl, Nina Dethlefs, Helen Hastie, and Oliver Lemon, “Impact of
ASR N-Best Information on Bayesian Dialogue Act Recognition,” in SIGDIAL,
2013.
[18] Dinoj Surendran and Gina-Anne Levow, “Dialog Act Tagging with Support Vector Machines and Hidden Markov Models,” in INTERSPEECH, 2006.
[19] Björn Gambäck, Fredrik Olsson, and Oscar Täckström, “Active Learning for Dialogue Act Classification,” in INTERSPEECH, 2011, pp. 1329–1332.
[20] Vivek Kumar Rangarajan Sridhar, Srinivas Bangalore, and Shrikanth Narayanan,
“Combining Lexical, Syntactic and Prosodic Cues for Improved Online Dialog
Act Tagging,” Computer Speech & Language, vol. 23, no. 4, pp. 407–422, 2009.
[21] Keyan Zhou and Chengqing Zong, “Dialog-Act Recognition Using Discourse and
Sentence Structure Information,” in IALP, 2009, pp. 11–16.
[22] Tina Klüwer, Hans Uszkoreit, and Feiyu Xu, “Using Syntactic and Semantic based
Relations for Dialogue Act Recognition,” in COLING, 2010, pp. 570–578.
[28] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira,
“Human-Computer Dialogue Simulation Using Hidden Markov Models,” in
ASRU, San Juan, Puerto Rico, Dec 2005, pp. 290–295.
[29] Sangkeun Jung, Cheongjae Lee, Kyungduk Kim, Minwoo Jeong, and Gary Geunbae Lee, “Data-Driven User Simulation for Automated Evaluation of Spoken
Dialog Systems,” Computer Speech & Lang., vol. 23, no. 4, pp. 479–509, 2009.
[30] Jost Schatzmann and Steve Young, “The Hidden Agenda User Simulation Model,”
IEEE Transactions on Speech, Audio and Language Processing, vol. 17, no. 4, pp.
733–747, 2009.
[31] Simon Keizer, Milica Gasic, Filip Jurcı́cek, François Mairesse, Blaise Thomson,
Kai Yu, and Steve Young, “Parameter Estimation for Agenda-Based User Simulation,” in SIGDIAL, 2010, pp. 116–123.
[32] Kallirroi Georgila, James Henderson, and Oliver Lemon, “User Simulation for
Spoken Dialogue Systems: Learning and Evaluation,” in INTERSPEECH, Pittsburgh, PA, USA, Sep 2006, pp. 267–659.
[33] V. Rieser and O. Lemon, “Cluster-Based User Simulations for Learning Dialogue
Strategies,” in INTERSPEECH, Pittsburgh, PA, USA, Sep 2006, pp. 1766–1769.
[34] R. López-Cózar, Z. Callejas, and M. McTear, “Testing the Performance of Spoken
Dialogue Systems by Means of an Artificial User,” Artificial Intelligence Review,
vol. 26, no. 4, pp. 291–323, 2008.
[35] F. Torres, E. Sanchis, and E. Segarra, “User Simulation in a Stochastic Dialog
System,” Computer Speech and Language, vol. 22, no. 3, pp. 230–255, 2008.
[36] Senthilkumar Chandramohan, Matthieu Geist, Fabrice Lefèvre, and Olivier
Pietquin, “User Simulation in Dialogue Systems Using Inverse Reinforcement
Learning,” in INTERSPEECH, 2011, pp. 1025–1028.
[37] J. Schatzmann, K. Weilhammer, M. Stuttle, and S. Young, “A Survey on Statistical User Simulation Techniques for Reinforcement Learning of Dialogue Management Strategies,” Knowledge Eng. Review, vol. 21, no. 2, pp. 97–126, 2006.
[38] Hua Ai and Diane J. Litman, “Comparing User Simulations for Dialogue Strategy
Learning,” TSLP, vol. 7, no. 3, pp. 9, 2011.
[39] Olivier Pietquin and Helen Hastie, “A Survey on Metrics for the Evaluation of
User Simulations,” Knowledge Engineering Review, vol. 28, no. 01, pp. 59–73,
2013.
[40] Antoine Raux, Brian Langner, Dan Bohus, Alan W. Black, and Maxine Eskenazi,
“Let’s go public! Taking a Spoken Dialog System to the Real World,” in INTERSPEECH, 2005, pp. 885–888.
[41] David M. Tax, Martijn van Breukelen, Robert P. Duin, and Josef Kittler, “Combining multiple classifiers by averaging or by multiplying?,” Pattern Recognition,
vol. 33, no. 9, pp. 1475–1485, Sept. 2000.
[42] Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira,
“Evaluation of a hierarchical reinforcement learning spoken dialogue system,”
Computer Speech and Language, vol. 24, no. 2, pp. 395–429, 2010.
[43] Heriberto Cuayáhuitl, Martijn van Otterlo, Nina Dethlefs, and Lutz Frommberger,
“Machine learning for interactive systems and robots: a brief introduction,” in
MLIS, 2013, pp. 19–28.
[44] Heriberto Cuayáhuitl and Nina Dethlefs, “Optimizing situated dialogue management in unknown environments,” in INTERSPEECH, 2011, pp. 1009–1012.
[45] Heriberto Cuayáhuitl and Ivana Kruijff-Korbayová, “An interactive humanoid
robot exhibiting flexible sub-dialogues,” in HLT-NAACL, 2012, pp. 17–20.