TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

TOD-BERT: Pre-trained Natural Language Understanding for

Task-Oriented Dialogue

Chien-Sheng Wu, Steven Hoi, Richard Socher, and Caiming Xiong


Salesforce Research
[cswu, shoi, rsocher, cxiong]@salesforce.com

Abstract language models using chit-chat corpora from so-


cial media, such as Twitter or Reddit, has been
The underlying difference of linguistic pat- recently investigated, especially for dialogue re-
terns between general text and task-oriented sponse generation (Zhang et al., 2019) and retrieval
arXiv:2004.06871v3 [cs.CL] 1 Oct 2020

dialogue makes existing pre-trained language


(Henderson et al., 2019b). Although these open-
models less useful in practice. In this work,
we unify nine human-human and multi-turn domain dialogues are diverse and easy-to-get, they
task-oriented dialogue datasets for language are usually short, noisy, and without specific chat-
modeling. To better model dialogue behav- ting goals.
ior during pre-training, we incorporate user On the other hand, a task-oriented dialogue
and system tokens into the masked language has explicit goals (e.g. restaurant reservation or
modeling. We propose a contrastive objec- ticket booking) and many conversational interac-
tive function to simulate the response selec-
tions. But each dataset is usually small and scat-
tion task. Our pre-trained task-oriented dia-
logue BERT (TOD-BERT) outperforms strong tered because obtaining and labeling such data is
baselines like BERT on four downstream task- time-consuming. Moreover, a task-oriented dia-
oriented dialogue applications, including in- logue has explicit user and system behaviors where
tention recognition, dialogue state tracking, di- a user has his/her goal, and a system has its be-
alogue act prediction, and response selection. lief and database information, which makes the
We also show that TOD-BERT has a stronger language understanding component and dialogue
few-shot ability that can mitigate the data
policy learning more important than those chit-chat
scarcity problem for task-oriented dialogue.
scenarios.
1 Introduction This paper aims to prove this hypothesis: self-
supervised language model pre-training using task-
Pre-trained models with self-attention encoder ar- oriented corpora can learn better representations
chitectures (Devlin et al., 2018; Liu et al., 2019) than existing pre-trained models for task-oriented
have been commonly used in many NLP appli- downstream tasks. We emphasize that what we care
cations. Such models are self-supervised based about the most is not whether our pre-trained model
on a massive scale of general text corpora, such can achieve state-of-the-art results on each down-
as English Wikipedia or books (Zhu et al., 2015). stream task since most of the current best models
By further fine-tuning these representations, break- are built on top of pre-trained models, and ours can
throughs have been continuously reported for vari- easily replace them. We avoid adding too many
ous downstream tasks, especially natural language additional components on top of the pre-training
understanding. architecture when fine-tuning in our experiments.
However, previous work (Rashkin et al., 2018; We collect and combine nine human-human and
Wolf et al., 2019) shows that there are some defi- multi-turn task-oriented dialogue corpora to train
ciencies in the performance to apply fine-tuning a task-oriented dialogue BERT (TOD-BERT). In
on conversational corpora directly. One possible total, there are around 100k dialogues with 1.4M
reason could be the intrinsic difference of linguistic utterances across over 60 different domains. Like
patterns between human conversations and writing BERT (Devlin et al., 2018), TOD-BERT is formu-
text, resulting in a large gap of data distributions lated as a masked language model and uses a deep
(Bao et al., 2019). Therefore, pre-training dialogue bidirectional Transformer (Vaswani et al., 2017)
encoder as its model architecture. Unlike BERT, jpurkar et al., 2016).
TOD-BERT incorporates two special tokens for
Some language models can support both uni-
user and system to model the corresponding dia-
directional and bi-directional attention, such as
logue behavior. A contrastive objective function
UniLM (Dong et al., 2019). Conditional language
of response selection task is combined during pre-
model pre-training is also proposed. For exam-
training stage to capture response similarity. We
ple, CTRL (Keskar et al., 2019) is a conditional
select BERT because it is the most widely used
Transformer model, trained to condition on control
model in NLP research recently, and our unified
codes that govern style, content, and task-specific
datasets can be easily applied to pre-train any ex-
behavior. Recently, multi-task language model pre-
isting language models.
training with unified sequence-to-sequence gener-
We test TOD-BERT on task-oriented dialogue
ation is proposed. Text-to-text Transformer (T5)
systems on four core downstream tasks, including
(Raffel et al., 2019) unifies multiple text modeling
intention recognition, dialogue state tracking, dia-
tasks and achieves the promising results in various
logue act prediction, and response selection. What
NLP benchmarks.
we observe is: TOD-BERT outperforms BERT
and other strong baselines such as GPT-2 (Radford
et al., 2019) and DialoGPT (Zhang et al., 2019) on
Dialogue Pre-trained Language Models are
all the selected downstream tasks, which further
mostly trained on open-domain conversational data
confirms its effectiveness for improving dialogue
from Reddit or Twitter for dialogue response gener-
language understanding. We find that response
ation. Transfertransfo (Wolf et al., 2019) achieves
contrastive learning is beneficial, but it is currently
good performance on ConvAI-2 dialogue competi-
overlooked not well-investigated in dialogue pre-
tion using GPT-2. DialoGPT (Zhang et al., 2019) is
training research. More importantly, TOD-BERT
an extension of GPT-2 that is pre-trained on Reddit
has a stronger few-shot ability than BERT on each
data for open-domain response generation. Con-
task, suggesting that it can reduce the need for
veRT (Henderson et al., 2019a) pre-trained a dual
expensive human-annotated labels. TOD-BERT
transformer encoder for response selection task on
can be easily leveraged and adapted to a new task-
large-scale Reddit (input, response) pairs. PLATO
oriented dialogue dataset. Our source code and data
(Bao et al., 2019) uses both Twitter and Reddit data
processing are released to facilitate future research
to pre-trained a dialogue generation model with
on pre-training and fine-tuning of task-oriented di-
discrete latent variables. All of them are designed
alogue 1 .
to cope with the response generation task for open-
2 Related Work domain chatbots.
Pretraining for task-oriented dialogues, on the
General Pre-trained Language Models, which
other hand, has few related works. Budzianowski
are trained on massive general text such as
and Vulić (2019) first apply the GPT-2 model to
Wikipedia and BookCorpus, can be roughly di-
train on response generation task, which takes sys-
vided into two categories: uni-directional or bi-
tem belief, database result, and last dialogue turn
directional attention mechanisms. GPT (Radford
as input to predict next system responses. It only
et al., 2018) and GPT-2 (Radford et al., 2019) are
uses one dataset to train its model because few pub-
representatives of uni-directional language models
lic datasets have database information available.
using a Transformer decoder, where the objective
Henderson et al. (2019b) pre-trained a response
is to maximize left-to-right generation likelihood.
selection model for task-oriented dialogues. They
These models are commonly applied in natural lan-
first pre-train on Reddit corpora and then fine-tune
guage generation tasks. On the other hand, BERT
on target dialogue domains, but their training and
(Devlin et al., 2018), RoBERTa (Liu et al., 2019),
fine-tuning code is not released. Peng et al. (2020)
and their variances are pre-trained using a Trans-
focus on the natural language generation (NLG)
former encoder with bi-directional token prediction.
task, which assumes dialogue acts and slot-tagging
These models are usually evaluated on classifica-
results are given to generate a natural language re-
tion tasks such as GLUE benchmark (Wang et al.,
sponse. Pre-training on a set of annotated NLG
2018) or span-based question answering tasks (Ra-
corpora can improve conditional generation quality
1
github.com/jasonwu0731/ToD-BERT using a GPT-2 model.
Name # Dialogue # Utterance Avg. Turn # Domain
MetaLWOZ (Lee et al., 2019) 37,884 432,036 11.4 47
Schema (Rastogi et al., 2019) 22,825 463,284 20.3 17
Taskmaster (Byrne et al., 2019) 13,215 303,066 22.9 6
MWOZ (Budzianowski et al., 2018) 10,420 71,410 6.9 7
MSR-E2E (Li et al., 2018) 10,087 74,686 7.4 3
SMD (Eric and Manning, 2017) 3,031 15,928 5.3 3
Frames (Asri et al., 2017) 1,369 19,986 14.6 3
WOZ (Mrkšić et al., 2016) 1,200 5,012 4.2 1
CamRest676 (Wen et al., 2016) 676 2,744 4.1 1

Table 1: Data statistics for task-oriented dialogue datasets.

3 Method mains, including 5,507 spoken and 7,708 writ-


ten dialogs created with two distinct procedures.
This section discusses each dataset used in our task- One is a two-person Wizard of Oz approach that
oriented pre-training and how we process the data. one person acts like a robot, and the other is a
Then we introduce the selected pre-training base self-dialogue approach in which crowdsourced
model and its objective functions. workers wrote the entire dialog themselves. It
3.1 Datasets has 22.9 average conversational turns in a single
dialogue, which is the longest among all task-
We collect nine different task-oriented datasets oriented datasets listed.
which are English, human-human and multi-turn.
In total, there are 100,707 dialogues, which con- • MWOZ (Budzianowski et al., 2018): Multi-
tain 1,388,152 utterances over 60 domains. Dataset Domain Wizard-of-Oz dataset contains 10,420
statistics is shown in Table 1. dialogues over seven domains, and it has multi-
ple domains in a single dialogue. It has a detailed
• MetaLWOZ (Lee et al., 2019): Meta-Learning description of the data collection procedure, user
Wizard-of-Oz is a dataset designed to help de- goal, system act, and dialogue state labels. Dif-
velop models capable of predicting user re- ferent from most of the existing corpora, it also
sponses in unseen domains. This large dataset provides full database information.
was created by crowdsourcing 37,884 goal-
oriented dialogs, covering 227 tasks in 47 do- • MSR-E2E (Li et al., 2018): Microsoft end-to-
mains. The MetaLWOZ dataset is used as the end dialogue challenge has 10,087 dialogues in
fast adaptation task for DSTC8 (Kim et al., 2019) three domains, movie-ticket booking, restaurant
dialogue competition. reservation, and taxi booking. It also includes an
experiment platform with built-in simulators in
• Schema (Rastogi et al., 2019): Schema-guided each domain.
dialogue has 22,825 dialogues and provides a
challenging testbed for several tasks, in partic- • SMD (Eric and Manning, 2017): Stanford multi-
ular, dialogue state tracking. Each schema is domain dialogue is an in-car personal assistant
a set of tracking slots, and each domain could dataset, comprising 3,301 dialogues and three
have multiple possible schemas. This allows a domains: calendar scheduling, weather informa-
single dialogue system to support many services tion retrieval, and point-of-interest navigation. It
and facilitates the simple integration of new ser- is designed to smoothly interface with knowl-
vices without requiring much training data. The edge bases, where a knowledge snippet is at-
Schema dataset is used as the dialogue state track- tached with each dialogue as a piece of simplified
ing task for DSTC8 (Kim et al., 2019) dialogue database information.
competition.
• Frames (Asri et al., 2017): This dataset com-
• Taskmaster (Byrne et al., 2019): This dataset prises 1,369 human-human dialogues with an
includes 13,215 dialogues comprising six do- average of 14.6 turns per dialogue, where users
are given some constraints to book a trip and as- replacement are performed once in the beginning
sistants who search a database to find appropriate and saved for the training duration. Here we con-
trips. Unlike other datasets, it has labels to keep duct token masking dynamically during batch train-
track of different semantic frames, which is the ing. TOD-BERT is initialized from BERT, a good
decision-making behavior of users throughout starting parameter set, then is further pre-trained
each dialogue. on those task-oriented corpora. The MLM loss
function is
• WOZ (Mrkšić et al., 2016) and Cam-
Lmlm = − M
P
Rest676 (Wen et al., 2016): These two m=1 log P (xm ), (1)
corpora use the same data collection procedure where M is the total number of masked tokens and
and same ontology from DSTC2 (Henderson P (xm ) is the predicted probability of the token xm
et al., 2014). They are one of the first task- over the vocabulary size.
oriented dialogue datasets that use Wizard of
Oz style with text input instead of speech input, Response contrastive loss can also be used for
which improves the model’s capacity for the dialogue language modeling since it does not
semantic understanding instead of its robustness require any additional human annotation. Pre-
to automatic speech recognition errors. training with RCL can bring us several advantages:
1) we can learn a better representation for the [CLS]
3.2 TOD-BERT token, as it is essential for all the downstream tasks,
and 2) we encourage the model to capture under-
We train our TOD-BERT based on BERT archi-
lying dialogue sequential order, structure informa-
tecture using two loss functions: masked language
tion, and response similarity.
modeling (MLM) loss and response contrastive
Unlike the original next sentence prediction
loss (RCL). Note that the datasets we used can
(NSP) objective in BERT pre-training, which con-
be used to pre-train any existing language model
catenates two segments A and B to predict whether
architecture, and here we select BERT because it
they are consecutive text with binary classifica-
is the most widely used model in NLP research.
tion, we apply a dual-encoder approach (Hender-
We use the BERT-base uncased model, which is a
son et al., 2019a) and simulate multiple nega-
transformer self-attention encoder (Vaswani et al.,
tive samples. We first draw a batch of dialogues
2017) with 12 layers and 12 attention heads with
{D1 , . . . , Db } and split each dialogue at a ran-
its hidden size dB = 768.
domly selected turn t. For example, D1 will be
To capture speaker information and the under-
separated into two segments, one is the context
lying interaction behavior in dialogue, we add
{S11 , U11 , . . . , St1 , Ut1 } and the other is the response
two special tokens, [USR] and [SYS], to the byte- 1 }. We use TOD-BERT to encode all the con-
{St+1
pair embeddings (Mrkšić et al., 2016). We prefix
texts and their corresponding responses separately.
the special token to each user utterance and sys-
Afterwards, we have a context matrix C ∈
tem response, and concatenate all the utterances
Rb×dB and a response matrix R ∈ Rb×dB by tak-
in the same dialogue into one flat sequence, as
ing the output [CLS] representations from the b
shown in Figure 1. For example, for a dialogue
dialogues. We treat other responses in the same
D = {S1 , U1 , . . . , Sn , Un }, where n is the num-
batch as randomly selected negative samples. The
ber of dialogue turns and each Si or Ui contains
RCL objective function is
a sequence of words, the input of the pre-training
model is processed as “[SYS] S1 [USR] U1 . . . ” b
P
with standard positional embeddings and segmen- Lrcl = − log Mi,i ,
i=1 (2)
tation embeddings. M= Softmax(CRT ) ∈ Rb×b .
Masked language modeling is a common pre- Increasing batch size to a certain amount can ob-
training strategy for BERT-like architectures, in tain better performance on downstream tasks, es-
which a random sample of tokens in the input se- pecially for the response selection. The Softmax
quence is selected and replaced with the special to- function normalizes the vector per row. In our set-
ken [MASK]. The MLM loss function is the cross- ting, increasing batch size also means changing the
entropy loss on predicting the masked tokens. In positive and negative ratio in the contrastive learn-
the original implementation, random masking and ing. Batch size is a hyper-parameter that may be
models predict one single intent class over I possi-
ble intents.

Pint = Softmax(W1 (F (U ))) ∈ RI , (3)

where F is a pre-trained language model and we


use its [CLS] embeddings as the output represen-
Figure 1: Dialogue pre-training based on Transformer tation. W1 ∈ RI×dB is a trainable linear mapping.
encoder with user and system special tokens. Two ob- The model is trained with cross-entropy loss be-
jective functions are used: masked language modeling tween the predicted distributions Pint and the true
and response contrastive learning. intent labels.
Dialogue state tracking can be treated as a
limited by hardware. We also try different nega- multi-class classification problem using a prede-
tive sampling strategies during pre-training such fined ontology. Unlike intent, we use dialogue
as local sampling (Saeidi et al., 2017), but do not history X (a sequence of utterances) as input and a
observe significant change compared to random model predicts slot values for each (domain, slot)
sampling. pair at each dialogue turn. Each corresponding
value vij , the i-th value for the j-th (domain, slot)
Overall pre-training loss function is the pair, is passed into a pre-trained model and fixed
weighted-sum of Lmlm and Lrcl , and in our exper- its representation during training.
iments, we simply sum them up. We gradually
reduce the learning rate without a warm-up period. Sij = Sim(Gj (F (X)), F (vij )) ∈ R1 , (4)
We train TOD-BERT with AdamW (Loshchilov
and Hutter, 2017) optimizer with a dropout ratio where Sim is the cosine similarity function, and
j
of 0.1 on all layers and attention weights. GELU S j ∈ R|v | is the probability distribution of the
activation functions (Hendrycks and Gimpel, j-th (domain, slot) pair over its possible values.
2016) is used. Models are early-stopped using Gj is the slot projection layer of the j slot, and
perplexity scores of a held-out development set, the number of layers |G| is equal to the number
with mini-batches containing 32 sequences of of (domain, slot) pairs. The model is trained with
maximum length 512 tokens. Experiments are cross-entropy loss summed over all the pairs.
conducted on two NVIDIA Tesla V100 GPUs. Dialogue act prediction is a multi-label classi-
fication problem because a system response may
4 Downstream Tasks
contain multiple dialogue acts, e.g., request and
We care the most in this paper whether TOD-BERT, inform at the same time. Model take dialogue his-
a pre-trained language model using aggregated task- tory as input and predict a binary result for each
oriented corpora, can show any advantage over possible dialogue act:
BERT. Therefore, we avoid adding too many addi-
tional components on top of its architecture when A = Sigmoid(W2 (F (X))) ∈ RN , (5)
fine-tuning on each downstream task. Also, we where W2 ∈ RdB ×N is a trainable linear mapping,
use the same architecture with a similar number of N is the number of possible dialogue acts, and each
parameters for a fair comparison. All the model value in A is between [0, 1] after a Sigmoid layer.
parameters are updated with a gradient clipping to The model is trained with binary cross-entropy loss
1.0 using the same hyper-parameters during fine- and the i-th dialogue act is considered as a triggered
tuning. We select four crucial task-oriented down- dialogue act if Ai > 0.5.
stream tasks to evaluate: intent recognition, dia-
logue state tracking, dialogue act prediction, and Response selection is a ranking problem, aiming
response selection. All of them are core compo- to retrieve the most relative system response from
nents in modularized task-oriented systems (Wen a candidate pool. We use a dual-encoder strategy
et al., 2016). (Henderson et al., 2019b) and compute similarity
scores between source X and target Y ,
Intent recognition task is a multi-class classifi-
cation problem, where we input a sentence U and ri = Sim(F (X), F (Yi )) ∈ R1 , (6)
where Yi is the i-th response candidate and ri is its dialogues, especially for dialogue state tracking.
cosine similarity score. Source X can be truncated, It has 8420/1000/1000 dialogues for train, vali-
and we limit the context lengths to the most recent dation, and test sets, respectively. Across seven
256 tokens in our experiments. We randomly sam- different domains, in total, it has 30 (domain,
ple several system responses from the corpus as slot) pairs that need to be tracked in the test set.
negative samples. Although it may not be a true We use its revised version MWOZ 2.1, which has
negative sample, it is common to train a ranker and the same dialogue transcripts but with cleaner
evaluate its results (Henderson et al., 2019a). state label annotation.

5 Evaluation Datasets
6 Results
We pick up several datasets, OOS, DSTC2, GSIM,
and MWOZ, for downstream evaluation. The first For each downstream task, we first conduct the
three corpora are not included in the pre-trained experiments using the whole dataset, and then we
task-oriented datasets. For MWOZ, to be fair, we simulate the few-shot setting to show the strength
do not include its test set dialogues during the pre- of our TOD-BERT. We run at least three times with
training stage. Details of each evaluation dataset different random seeds for each few-shot exper-
are discussed in the following: iment to reduce data sampling variance, and we
report its mean and standard deviation for these
• OOS (Larson et al., 2019): The out-of-scope in-
limited data scenarios. We investigate two ver-
tent dataset is one of the largest annotated intent
sions of TOD-BERT; one is TOD-BERT-mlm that
datasets, including 15,100/3,100/5,500 samples
only uses MLM loss during pre-training, and the
for the train, validation, and test sets, respectively.
other is TOD-BERT-jnt, which is jointly trained
It covers 151 intent classes over ten domains, in-
with the MLM and RCL objectives. We compare
cluding 150 in-scope intent and one out-of-scope
TOD-BERT with BERT and other baselines, in-
intent. The out-of-scope intent means that a user
cluding two other strong pre-training models GPT-
utterance that does not fall into any of the prede-
2 (Radford et al., 2019) and DialoGPT (Zhang et al.,
fined intents. Each of the intents has 100 training
2019). For a GPT-based model, we use mean pool-
samples.
ing of its hidden states as its output representation,
• DSTC2 (Henderson et al., 2014): DSTC2 is a which we found it is better than using only the last
human-machine task-oriented dataset that may token.
include a certain system response noise. It has
1,612/506/1117 dialogues for train, validation, 6.1 Linear Probe
and test sets, respectively. We follow Paul et al. Before fine-tuning each pre-trained models, we first
(2019) to map the original dialogue act labels investigate their feature extraction ability by prob-
to universal dialogue acts, which results in 19 ing their output representations. Probing methods
different system dialogue acts. are proposed to determine what information is car-
ried intrinsically by the learned embeddings (Ten-
• GSIM (Shah et al., 2018a): GSIM is a human-
ney et al., 2019). We probe the output representa-
rewrote machine-machine task-oriented corpus,
tion using one single-layer perceptron on top of a
including 1500/469/1039 dialogues for the train,
“fixed” pre-trained language model and only fine-
validation, and test sets, respectively. We com-
tune that layer for a downstream task with the same
bine its two domains, movie and restaurant do-
hyper-parameters. Table 3 shows the probing re-
mains, into one single corpus. It is collected by
sults of domain classification on MWOZ, intent
Machines Talking To Machines (M2M) (Shah
identification on OOS, and dialogue act prediction
et al., 2018b) approach, a functionality-driven
on MWOZ. TOD-BERT-jnt achieves the highest
process combining a dialogue self-play step and
performance in this setting, suggesting its represen-
a crowdsourcing step. We map its dialogue act la-
tation contains the most useful information.
bels to universal dialogue acts (Paul et al., 2019),
resulting in 13 different system dialogue acts. 6.2 Intent Recognition
• MWOZ (Budzianowski et al., 2018): MWOZ is TOD-BERT outperforms BERT and other strong
the most common benchmark for task-oriented baselines in one of the largest intent recognition
Acc Acc Acc Recall
Model
(all) (in) (out) (out)
BERT 29.3% ± 3.4% 35.7% ± 4.1% 81.3% ± 0.4% 0.4% ± 0.3%
1-Shot
TOD-BERT-mlm 38.9% ± 6.3% 47.4% ± 7.6% 81.6% ± 0.2% 0.5% ± 0.2%
TOD-BERT-jnt 42.5% ± 0.1% 52.0% ± 0.1% 81.7% ± 0.1% 0.1% ± 0.1%
BERT 75.5% ± 1.1% 88.6% ± 1.1% 84.7% ± 0.3% 16.5% ± 1.7%
10-Shot
TOD-BERT-mlm 76.6% ± 0.8% 90.5% ± 1.2% 84.3% ± 0.2% 14.0% ± 1.3%
TOD-BERT-jnt 77.3% ± 0.5% 91.0% ± 0.5% 84.5% ± 0.4% 15.3% ± 2.1%
FastText* - 89.0% - 9.7%
SVM* - 91.0% - 14.5%
CNN* - 91.2% - 18.9%
Full GPT2 83.0% 94.1% 87.7% 32.0%
(100-Shot) DialoGPT 83.9% 95.5% 87.6% 32.1%
BERT 84.9% 95.8% 88.1% 35.6%
TOD-BERT-mlm 85.9% 96.1% 89.5% 46.3%
TOD-BERT-jnt 86.6% 96.2% 89.9% 43.6%

Table 2: Intent recognition results on the OOS dataset, one of the largest intent corpus. Models with * are reported
from Larson et al. (2019).

Domain Intent Dialogue Act accuracy individually compares each (domain, slot,
(acc) (acc) (F1-micro) value) triplet to its ground truth label.
GPT2 63.5% 74.7% 85.7%
DialoGPT 63.0% 65.7% 84.2% In Table 5, we compare BERT to TOD-BERT-
BERT 60.5% 71.1% 85.3% jnt on the MWOZ 2.1 dataset and find the latter
TOD-BERT-mlm 63.9% 70.7% 83.5% has 2.4% joint goal accuracy improvement. Since
TOD-BERT-jnt 68.7% 77.8% 86.2% the original ontology provided by Budzianowski
et al. (2018) is not complete (some labeled val-
Table 3: Probing results of different pre-trained lan-
guage models using a single-layer perceptron.
ues are not included in the ontology), we create
a new ontology of all the possible annotated val-
ues. We also list several well-known dialogue state
datasets, as shown in Table 2. We evaluate accu- trackers as reference, including DSTReader (Gao
racy on all the data, the in-domain intents only, and et al., 2019), HyST (Goel et al., 2019), TRADE
the out-of-scope intent only. Note that there are (Wu et al., 2019), and ZSDST (Rastogi et al., 2019).
two ways to predict out-of-scope intent, one is to We also report the few-shot experiments using 1%,
treat it as an additional class, and the other is to 5%, 10%, and 25% data. Note that 1% of data
set a threshold for prediction confidence. Here we has around 84 dialogues. TOD-BERT outperforms
report the results of the first setting. TOD-BERT- BERT in all the setting, which further show the
jnt achieves the highest in-scope and out-of-scope strength of task-oriented dialogue pre-training.
accuracy. Besides, we conduct 1-shot and 10-shot
experiments by randomly sampling one and ten 6.4 Dialogue Act Prediction
utterances from each intent class in the training We conduct experiments on three different datasets
set. TOD-BERT-jnt has 13.2% all-intent accuracy and report micro-F1 and macro-F1 scores for the
improvement and 16.3% in-domain accuracy im- dialogue act prediction task, a multi-label classifica-
provement compared to BERT in the 1-shot setting. tion problem. For the MWOZ dataset, we remove
the domain information from the original system
6.3 Dialogue State Tracking
dialogue act labels. For example, the “taxi-inform”
Two evaluation metrics are commonly used in dia- will be simplified to “inform”. This process re-
logue state tracking task: joint goal accuracy and duces the number of possible dialogue acts from 31
slot accuracy. The joint goal accuracy compares to 13. For DSTC2 and GSIM corpora, we follow
the predicted dialogue states to the ground truth at Paul et al. (2019) to apply universal dialogue act
each dialogue turn. The ground truth includes slot mapping that maps the original dialogue act labels
values for all the possible (domain, slot) pairs. The to a general dialogue act format, resulting in 19
output is considered as a correct prediction if and and 13 system dialogue acts in DSTC2 and GSIM,
only if all the predicted values exactly match its respectively. We run two other baselines, MLP and
ground truth values. On the other hand, the slot RNN, to further show the strengths of BERT-based
MWOZ (13) DSTC2 (19) GSIM (13)
micro-F1 macro-F1 micro-F1 macro-F1 micro-F1 macro-F1
1% Data BERT 84.0% ± 0.6% 66.7% ± 1.7% 77.1% ± 2.1% 25.8% ± 0.8% 67.3% ± 1.4% 26.9% ± 1.0%
TOD-BERT-mlm 87.5% ± 0.6% 73.3% ± 1.5% 79.6% ± 1.0% 26.4% ± 0.5% 82.7% ± 0.7% 35.7% ± 0.3%
TOD-BERT-jnt 86.9% ± 0.2% 72.4% ± 0.8% 82.9% ± 0.4% 28.0% ± 0.1% 78.4% ± 3.2% 32.9% ± 2.1%
BERT 89.7% ± 0.2% 78.4% ± 0.3% 88.2% ± 0.7% 34.8% ± 1.3% 98.4% ± 0.3% 45.1% ± 0.2%
10% Data
TOD-BERT-mlm 90.1% ± 0.2% 78.9% ± 0.1% 91.8% ± 1.7% 39.4% ± 1.7% 99.2% ± 0.1% 45.6% ± 0.1%
TOD-BERT-jnt 90.2% ± 0.2% 79.6% ± 0.7% 90.6% ± 3.2% 38.8% ± 2.2% 99.3% ± 0.1% 45.7% ± 0.0%
MLP 61.6% 45.5% 77.6% 18.1% 89.5% 26.1%
RNN 90.4% 77.3% 90.8% 29.4% 98.4% 45.2%
Full Data GPT2 90.8% 79.8% 92.5% 39.4% 99.1% 45.6%
DialoGPT 91.2% 79.7% 93.8% 42.1% 99.2% 45.6%
BERT 91.4% 79.7% 92.3% 40.1% 98.7% 45.2%
TOD-BERT-mlm 91.7% 79.9% 90.9% 39.9% 99.4% 45.8%
TOD-BERT-jnt 91.7% 80.6% 93.8% 41.3% 99.5% 45.8%

Table 4: Dialogue act prediction results on three different datasets. The numbers reported are the micro and macro
F1 scores, and each dataset has different numbers of dialogue acts.

Joint Slot
Model
Acc Acc
BERT 6.4% ± 1.4% 84.4% ± 1.0%
1% Data
TOD-BERT-mlm 9.9% ± 0.6% 86.6% ± 0.5%
TOD-BERT-jnt 8.0% ± 1.0% 85.3% ± 0.4%
BERT 19.6% ± 0.1% 92.0% ± 0.5%
5% Data
TOD-BERT-mlm 28.1% ± 1.6% 93.9% ± 0.1%
TOD-BERT-jnt 28.6% ± 1.4% 93.8% ± 0.3%
BERT 32.9% ± 0.6% 94.7% ± 0.1%
10% Data
TOD-BERT-mlm 39.5% ± 0.7% 95.6% ± 0.1%
TOD-BERT-jnt 37.0% ± 0.1% 95.2% ± 0.1%
(a) BERT (b) BERT
BERT 40.8% ± 1.0% 95.8% ± 0.1%
25% Data
TOD-BERT-mlm 44.0% ± 0.4% 96.4% ± 0.1%
TOD-BERT-jnt 44.3% ± 0.3% 96.3% ± 0.2%
DSTReader* 36.4% -
HyST* 38.1% -
ZSDST* 43.4% -
Full Data TRADE* 45.6% -
GPT2 46.2% 96.6%
DialoGPT 45.2% 96.5%
BERT 45.6% 96.6%
TOD-BERT-mlm 47.7% 96.8%
TOD-BERT-jnt 48.0% 96.9% (c) TOD-BERT-mlm (d) TOD-BERT-mlm

Table 5: Dialogue state tracking results on MWOZ 2.1.


We report joint goal accuracy and slot accuracy for the
full data setting and the simulated few-shot settings.

models. The MLP model simply takes bag-of-word (e) TOD-BERT-jnt (f) TOD-BERT-jnt
embeddings to make dialogue act prediction, and
the RNN model is a bi-directional GRU network. Figure 2: The tSNE visualization of BERT, TOD-
BERT-mlm and TOD-BERT-jnt representations of sys-
tem responses in the MWOZ test set. Different colors
In Table 4, one can observe that in full data in the left-hand column mean different domains, and in
scenario, TOD-BERT consistently works better the right-hand column represent different dialogue acts.
than BERT and other baselines, no matter which
datasets or which evaluation metrics. In the few-
shot experiments, TOD-BERT-mlm outperforms 6.5 Response Selection
BERT by 3.5% micro-F1 and 6.6% macro-F1 on To evaluate response selection in task-oriented di-
MWOZ corpus in the 1% data scenario. We also alogues, we follow the k-to-100 accuracy, which
found that 10% of training data can achieve good is becoming a research community standard (Yang
performance that is close to full data training. et al., 2018; Henderson et al., 2019a). The k-of-100
MWOZ DSTC2 GSIM
1-to-100 3-to-100 1-to-100 3-to-100 1-to-100 3-to-100
BERT 7.8% ± 2.0% 20.5% ± 4.4% 3.7% ± 0.6% 9.6% ± 1.3% 4.0% ± 0.4% 10.3% ± 1.1%
1% Data
TOD-BERT-mlm 13.0% ± 1.1% 34.6% ± 0.4% 12.5% ± 6.7% 24.9% ± 10.7% 7.2% ± 4.0% 15.4% ± 8.0%
TOD-BERT-jnt - - 37.5% ± 0.6% 55.9% ± 0.4% 12.5% ± 0.9% 26.8% ± 0.8%
BERT 20.9% ± 2.6% 45.4% ± 3.8% 8.9% ± 2.3% 21.4% ± 3.1% 9.8% ± 0.1% 24.4% ± 1.2%
10% Data
TOD-BERT-mlm 22.3% ± 3.2% 48.7% ± 4.0% 19.0% ± 16.3% 33.8% ± 20.4% 11.2% ± 2.5% 26.0% ± 2.7%
TOD-BERT-jnt - - 49.7% ± 0.3% 66.6% ± 0.1% 23.0% ± 1.0% 42.6% ± 1.0%
GPT2 47.5% 75.4% 53.7% 69.2% 39.1% 60.5%
DialoGPT 35.7% 64.1% 39.8% 57.1% 16.5% 39.5%
Full Data BERT 47.5% 75.5% 46.6% 62.1% 13.4% 32.9%
TOD-BERT-mlm 48.1% 74.3% 50.0% 65.1% 36.5% 60.1%
TOD-BERT-jnt 65.8% 87.0% 56.8% 70.6% 41.0% 65.4%

Table 6: Response selection evaluation results on three corpora for 1%, 10% and full data setting. We report
1-to-100 and 3-to-100 accuracy, which is similar to recall1 and recall@3 given 100 candidates.

metric is computed using a random batch of 100 each utterance, we use different colors to repre-
examples so that responses from other examples in sent different domains and dialogue acts. As one
the same batch can be used as random negative can- can observe, TOD-BERT-jnt has more clear group
didates. This allows us to be compute the metric boundaries than TOD-BERT-mlm, and two of them
across many examples in batches efficiently. While are better than BERT.
it is not guaranteed that the random negatives will To analyze the results quantitatively, we run K-
indeed be “true” negatives, the 1-of-100 metric still means, a common unsupervised clustering algo-
provides a useful evaluation signal. During infer- rithms, on top of the output embeddings of BERT
ence, we run five different random seeds to sample and TOD-BERT. We set K for K-means equal to
batches and report the average results. 10 and 20. After the clustering, we can assign
In Table 6, we conduct response selection ex- each utterance in the MWOZ test set to a predicted
periments on three datasets, MWOZ, DSTC2, and class. We then compute the normalized mutual
GSIM. TOD-BERT-jnt achieves 65.8% 1-to-100 information (NMI) between the clustering result
accuracy and 87.0% 3-to-100 accuracy on MWOZ, and the actual domain label for each utterance.
which surpasses BERT by 18.3% and 11.5%, re- Here is what we observe: TOD-BERT consistently
spectively. The similar results are also consistently achieves higher NMI scores than BERT. For K=10,
observed in DSTC2 and GSIM datasets, and the TOD-BERT has a 0.143 NMI score, and BERT
advantage of the TOD-BERT-jnt is more evident only has 0.094. For K=20, TOD-BERT achieves a
in the few-shot scenario. We do not report TOD- 0.213 NMI score, while BERT has 0.109.
BERT-jnt for MWOZ few-shot setting because it
is not fair to compare them with others as the full 8 Conclusion
MWOZ training set is used for response contrastive
We propose task-oriented dialogue BERT (TOD-
learning during pre-training stage. The response
BERT) trained on nine human-human and multi-
selection results are sensitive to the training batch
turn task-oriented datasets across over 60 domains.
size since the larger the batch size the harder the
TOD-BERT outperforms BERT on four dialogue
prediction. In our experiments, we set batch size
downstream tasks, including intention classifica-
equals to 25 for all the models.
tion, dialogue state tracking, dialogue act predic-
7 Visualization tion, and response selection. It also has a clear
advantage in the few-shot experiments when only
In Figure 2, we visualize the embeddings of BERT, limited labeled data is available. TOD-BERT is
TOD-BERT-mlm, and TOD-BERT-jnt given the easy-to-deploy and will be open-sourced, allowing
same input from the MWOZ test set. Each sample the NLP research community to apply or fine-tune
point is a system response representation, which is any task-oriented conversational problem.
passed through a pre-trained model and reduced its
high-dimension features to a two-dimension point
using the t-distributed stochastic neighbor embed- References
ding (tSNE) for dimension reduction. Since we Layla El Asri, Hannes Schulz, Shikhar Sharma,
know the true domain and dialogue act labels for Jeremie Zumer, Justin Harris, Emery Fine, Rahul
Mehrotra, and Kaheer Suleman. 2017. Frames: A Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo
corpus for adding memory to goal-oriented dialogue Casanueva, Paweł Budzianowski, Sam Coope,
systems. arXiv preprint arXiv:1704.00057. Georgios Spithourakis, Tsung-Hsien Wen, Nikola
Mrkšić, and Pei-Hao Su. 2019b. Training neural re-
Siqi Bao, Huang He, Fan Wang, and Hua Wu. sponse selection for task-oriented dialogue systems.
2019. Plato: Pre-trained dialogue generation In Proceedings of the 57th Annual Meeting of the
model with discrete latent variable. arXiv preprint Association for Computational Linguistics, pages
arXiv:1910.07931. 5392–5404, Florence, Italy. Association for Compu-
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s tational Linguistics.
gpt-2–how can i help you? towards the use of pre- Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
trained language models for task-oriented dialogue sian error linear units (gelus). arXiv preprint
systems. arXiv preprint arXiv:1907.05774. arXiv:1606.08415.
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra- Caiming Xiong, and Richard Socher. 2019. Ctrl: A
madan, and Milica Gašić. 2018. Multiwoz-a conditional transformer language model for control-
large-scale multi-domain wizard-of-oz dataset for lable generation. arXiv preprint arXiv:1909.05858.
task-oriented dialogue modelling. arXiv preprint
arXiv:1810.00278. Seokhwan Kim, Michel Galley, Chulaka Gunasekara,
Adam Atkinson Sungjin Lee, Baolin Peng, Hannes
Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada,
Sankar, Arvind Neelakantan, Daniel Duckworth, Minlie Huang, Luis Lastras, Jonathan K. Kummer-
Semih Yavuz, Ben Goodrich, Amit Dubey, Andy feld, Walter S. Lasecki, Chiori Hori, Anoop Cherian,
Cedilnik, and Kyu-Young Kim. 2019. Taskmaster-1: Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang,
Toward a realistic and diverse dialog dataset. arXiv Srinivas Sunkara, and Raghav Gupta. 2019. The
preprint arXiv:1909.05358. eighth dialog system technology challenge. arXiv
preprint.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep Stefan Larson, Anish Mahendran, Joseph J Peper,
bidirectional transformers for language understand- Christopher Clarke, Andrew Lee, Parker Hill,
ing. arXiv preprint arXiv:1810.04805. Jonathan K Kummerfeld, Kevin Leach, Michael A
Laurenzano, Lingjia Tang, et al. 2019. An evalua-
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- tion dataset for intent classification and out-of-scope
aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, prediction. arXiv preprint arXiv:1909.02027.
and Hsiao-Wuen Hon. 2019. Unified language
model pre-training for natural language understand- Sungjin Lee, Hannes Schulz, Adam Atkinson, Jianfeng
ing and generation. In Advances in Neural Informa- Gao, Kaheer Suleman, Layla El Asri, Mahmoud
tion Processing Systems, pages 13042–13054. Adada, Minlie Huang, Shikhar Sharma, Wendy Tay,
and Xiujun Li. 2019. Multi-domain task-completion
Mihail Eric and Christopher D Manning. 2017. Key- dialog challenge. In Dialog System Technology
value retrieval networks for task-oriented dialogue. Challenges 8.
arXiv preprint arXiv:1705.05414.
Xiujun Li, Sarah Panda, JJ (Jingjing) Liu, and Jianfeng
Shuyang Gao, Abhishek Sethi, Sanchit Aggarwal, Gao. 2018. Microsoft dialogue challenge: Building
Tagyoung Chung, and Dilek Hakkani-Tur. 2019. Di- end-to-end task-completion dialogue systems. In
alog state tracking: A neural reading comprehension SLT 2018.
approach. arXiv preprint arXiv:1908.01946.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Rahul Goel, Shachi Paul, and Dilek Hakkani-Tür. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
2019. Hyst: A hybrid approach for flexible and Luke Zettlemoyer, and Veselin Stoyanov. 2019.
accurate dialogue state tracking. arXiv preprint Roberta: A robustly optimized bert pretraining ap-
arXiv:1907.00883. proach. arXiv preprint arXiv:1907.11692.
Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Ilya Loshchilov and Frank Hutter. 2017. Decou-
Pei-Hao Su, Ivan Vulić, et al. 2019a. Con- pled weight decay regularization. arXiv preprint
vert: Efficient and accurate conversational rep- arXiv:1711.05101.
resentations from transformers. arXiv preprint
arXiv:1911.03688. Nikola Mrkšić, Diarmuid O Séaghdha, Tsung-Hsien
Wen, Blaise Thomson, and Steve Young. 2016. Neu-
Matthew Henderson, Blaise Thomson, and Jason D. ral belief tracker: Data-driven dialogue state track-
Williams. 2014. The second dialog state tracking ing. arXiv preprint arXiv:1606.03777.
challenge. In Proceedings of the 15th Annual Meet-
ing of the Special Interest Group on Discourse and Shachi Paul, Rahul Goel, and Dilek Hakkani-Tür.
Dialogue (SIGDIAL), pages 263–272, Philadelphia, 2019. Towards universal dialogue act tag-
PA, U.S.A. Association for Computational Linguis- ging for task-oriented dialogues. arXiv preprint
tics. arXiv:1907.03020.
Baolin Peng, Chenguang Zhu, Chunyuan Li, Xi- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ujun Li, Jinchao Li, Michael Zeng, and Jian- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
feng Gao. 2020. Few-shot natural language gen- Kaiser, and Illia Polosukhin. 2017. Attention is all
eration for task-oriented dialog. arXiv preprint you need. In Advances in neural information pro-
arXiv:2002.12328. cessing systems, pages 5998–6008.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Alex Wang, Amanpreet Singh, Julian Michael, Felix
Ilya Sutskever. 2018. Improving language under- Hill, Omer Levy, and Samuel R Bowman. 2018.
standing by generative pre-training. Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, arXiv:1804.07461.
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,
Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su,
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Stefan Ultes, and Steve Young. 2016. A network-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, based end-to-end trainable task-oriented dialogue
Wei Li, and Peter J Liu. 2019. Exploring the limits system. arXiv preprint arXiv:1604.04562.
of transfer learning with a unified text-to-text trans-
former. arXiv preprint arXiv:1910.10683. Thomas Wolf, Victor Sanh, Julien Chaumond, and
Clement Delangue. 2019. Transfertransfo: A
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and transfer learning approach for neural network
Percy Liang. 2016. Squad: 100,000+ questions based conversational agents. arXiv preprint
for machine comprehension of text. arXiv preprint arXiv:1901.08149.
arXiv:1606.05250.
Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-
Hannah Rashkin, Eric Michael Smith, Margaret Li, and Asl, Caiming Xiong, Richard Socher, and Pascale
Y-Lan Boureau. 2018. Towards empathetic open- Fung. 2019. Transferable multi-domain state gener-
domain conversation models: A new benchmark and ator for task-oriented dialogue systems. In Proceed-
dataset. arXiv preprint arXiv:1811.00207. ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 808–819, Flo-
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, rence, Italy. Association for Computational Linguis-
Raghav Gupta, and Pranav Khaitan. 2019. Towards tics.
scalable multi-domain conversational agents: The
schema-guided dialogue dataset. arXiv preprint Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong,
arXiv:1909.05855. Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan
Sung, Brian Strope, and Ray Kurzweil. 2018. Learn-
Marzieh Saeidi, Ritwik Kulkarni, Theodosia Togia, and ing semantic textual similarity from conversations.
Michele Sama. 2017. The effect of negative sam- In Proceedings of The Third Workshop on Repre-
pling strategy on capturing semantic similarity in sentation Learning for NLP, pages 164–174, Mel-
document embeddings. In Proceedings of the 2nd bourne, Australia. Association for Computational
Workshop on Semantic Deep Learning (SemDeep-2), Linguistics.
pages 1–8.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,
Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
Gokhan Tur. 2018a. Bootstrapping a neural conver- Liu, and Bill Dolan. 2019. Dialogpt: Large-scale
sational agent with dialogue self-play, crowdsourc- generative pre-training for conversational response
ing and on-line reinforcement learning. In Proceed- generation. arXiv preprint arXiv:1911.00536.
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin- Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
guistics: Human Language Technologies, Volume 3 dinov, Raquel Urtasun, Antonio Torralba, and Sanja
(Industry Papers), pages 41–51. Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Ab- and reading books. In Proceedings of the IEEE inter-
hinav Rastogi, Ankur Bapna, Neha Nayak, and national conference on computer vision, pages 19–
Larry Heck. 2018b. Building a conversational agent 27.
overnight with dialogue self-play. arXiv preprint
arXiv:1801.04871.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,


Adam Poliak, R Thomas McCoy, Najoung Kim,
Benjamin Van Durme, Samuel R Bowman, Dipan-
jan Das, et al. 2019. What do you learn from
context? probing for sentence structure in con-
textualized word representations. arXiv preprint
arXiv:1905.06316.
A Appendices

(a) BERT

(a) BERT

(b) TOD-BERT-mlm

(b) TOD-BERT-mlm

(c) TOD-BERT-jnt

Figure 3: The tSNE visualization of BERT and TOD-


BERT representations of system responses in MWOZ
test set. Different colors mean different domains.
(c) TOD-BERT-jnt

Figure 4: The tSNE visualization of BERT and TOD-


BERT representations of system responses in MWOZ
test set. Different colors mean different dialogue acts.
(a) BERT

(b) TOD-BERT-mlm

(c) TOD-BERT-jnt

Figure 5: The tSNE visualization of BERT and TOD-


BERT representations of system responses in MWOZ
test set. Different colors mean different dialogue slots.

You might also like