TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
Task-Oriented Dialogue
5 Evaluation Datasets
6 Results
We pick up several datasets, OOS, DSTC2, GSIM,
and MWOZ, for downstream evaluation. The first For each downstream task, we first conduct the
three corpora are not included in the pre-trained experiments using the whole dataset, and then we
task-oriented datasets. For MWOZ, to be fair, we simulate the few-shot setting to show the strength
do not include its test set dialogues during the pre- of our TOD-BERT. We run at least three times with
training stage. Details of each evaluation dataset different random seeds for each few-shot exper-
are discussed in the following: iment to reduce data sampling variance, and we
report its mean and standard deviation for these
• OOS (Larson et al., 2019): The out-of-scope in-
limited data scenarios. We investigate two ver-
tent dataset is one of the largest annotated intent
sions of TOD-BERT; one is TOD-BERT-mlm that
datasets, including 15,100/3,100/5,500 samples
only uses MLM loss during pre-training, and the
for the train, validation, and test sets, respectively.
other is TOD-BERT-jnt, which is jointly trained
It covers 151 intent classes over ten domains, in-
with the MLM and RCL objectives. We compare
cluding 150 in-scope intent and one out-of-scope
TOD-BERT with BERT and other baselines, in-
intent. The out-of-scope intent means that a user
cluding two other strong pre-training models GPT-
utterance that does not fall into any of the prede-
2 (Radford et al., 2019) and DialoGPT (Zhang et al.,
fined intents. Each of the intents has 100 training
2019). For a GPT-based model, we use mean pool-
ing of its hidden states as its output representation,
• DSTC2 (Henderson et al., 2014): DSTC2 is a which we found it is better than using only the last
human-machine task-oriented dataset that may token.
include a certain system response noise. It has
1,612/506/1117 dialogues for train, validation, 6.1 Linear Probe
and test sets, respectively. We follow Paul et al. Before fine-tuning each pre-trained models, we first
(2019) to map the original dialogue act labels investigate their feature extraction ability by prob-
to universal dialogue acts, which results in 19 ing their output representations. Probing methods
different system dialogue acts. are proposed to determine what information is car-
ried intrinsically by the learned embeddings (Ten-
• GSIM (Shah et al., 2018a): GSIM is a human-
ney et al., 2019). We probe the output representa-
rewrote machine-machine task-oriented corpus,
tion using one single-layer perceptron on top of a
including 1500/469/1039 dialogues for the train,
“fixed” pre-trained language model and only fine-
validation, and test sets, respectively. We com-
tune that layer for a downstream task with the same
bine its two domains, movie and restaurant do-
hyper-parameters. Table 3 shows the probing re-
mains, into one single corpus. It is collected by
sults of domain classification on MWOZ, intent
Machines Talking To Machines (M2M) (Shah
identification on OOS, and dialogue act prediction
et al., 2018b) approach, a functionality-driven
on MWOZ. TOD-BERT-jnt achieves the highest
process combining a dialogue self-play step and
performance in this setting, suggesting its represen-
a crowdsourcing step. We map its dialogue act la-
tation contains the most useful information.
bels to universal dialogue acts (Paul et al., 2019),
resulting in 13 different system dialogue acts. 6.2 Intent Recognition
• MWOZ (Budzianowski et al., 2018): MWOZ is TOD-BERT outperforms BERT and other strong
the most common benchmark for task-oriented baselines in one of the largest intent recognition
Acc Acc Acc Recall
(all) (in) (out) (out)
BERT 29.3% ± 3.4% 35.7% ± 4.1% 81.3% ± 0.4% 0.4% ± 0.3%
TOD-BERT-mlm 38.9% ± 6.3% 47.4% ± 7.6% 81.6% ± 0.2% 0.5% ± 0.2%
TOD-BERT-jnt 42.5% ± 0.1% 52.0% ± 0.1% 81.7% ± 0.1% 0.1% ± 0.1%
BERT 75.5% ± 1.1% 88.6% ± 1.1% 84.7% ± 0.3% 16.5% ± 1.7%
TOD-BERT-mlm 76.6% ± 0.8% 90.5% ± 1.2% 84.3% ± 0.2% 14.0% ± 1.3%
TOD-BERT-jnt 77.3% ± 0.5% 91.0% ± 0.5% 84.5% ± 0.4% 15.3% ± 2.1%
FastText* - 89.0% - 9.7%
SVM* - 91.0% - 14.5%
CNN* - 91.2% - 18.9%
Full GPT2 83.0% 94.1% 87.7% 32.0%
(100-Shot) DialoGPT 83.9% 95.5% 87.6% 32.1%
BERT 84.9% 95.8% 88.1% 35.6%
TOD-BERT-mlm 85.9% 96.1% 89.5% 46.3%
TOD-BERT-jnt 86.6% 96.2% 89.9% 43.6%
Table 2: Intent recognition results on the OOS dataset, one of the largest intent corpus. Models with * are reported
from Larson et al. (2019).
Domain Intent Dialogue Act accuracy individually compares each (domain, slot,
(acc) (acc) (F1-micro) value) triplet to its ground truth label.
GPT2 63.5% 74.7% 85.7%
DialoGPT 63.0% 65.7% 84.2% In Table 5, we compare BERT to TOD-BERT-
BERT 60.5% 71.1% 85.3% jnt on the MWOZ 2.1 dataset and find the latter
TOD-BERT-mlm 63.9% 70.7% 83.5% has 2.4% joint goal accuracy improvement. Since
TOD-BERT-jnt 68.7% 77.8% 86.2% the original ontology provided by Budzianowski
et al. (2018) is not complete (some labeled val-
Table 3: Probing results of different pre-trained lan-
guage models using a single-layer perceptron.
ues are not included in the ontology), we create
a new ontology of all the possible annotated val-
ues. We also list several well-known dialogue state
datasets, as shown in Table 2. We evaluate accu- trackers as reference, including DSTReader (Gao
racy on all the data, the in-domain intents only, and et al., 2019), HyST (Goel et al., 2019), TRADE
the out-of-scope intent only. Note that there are (Wu et al., 2019), and ZSDST (Rastogi et al., 2019).
two ways to predict out-of-scope intent, one is to We also report the few-shot experiments using 1%,
treat it as an additional class, and the other is to 5%, 10%, and 25% data. Note that 1% of data
set a threshold for prediction confidence. Here we has around 84 dialogues. TOD-BERT outperforms
report the results of the first setting. TOD-BERT- BERT in all the setting, which further show the
jnt achieves the highest in-scope and out-of-scope strength of task-oriented dialogue pre-training.
accuracy. Besides, we conduct 1-shot and 10-shot
experiments by randomly sampling one and ten 6.4 Dialogue Act Prediction
utterances from each intent class in the training We conduct experiments on three different datasets
set. TOD-BERT-jnt has 13.2% all-intent accuracy and report micro-F1 and macro-F1 scores for the
improvement and 16.3% in-domain accuracy im- dialogue act prediction task, a multi-label classifica-
provement compared to BERT in the 1-shot setting. tion problem. For the MWOZ dataset, we remove
the domain information from the original system
6.3 Dialogue State Tracking
dialogue act labels. For example, the “taxi-inform”
Two evaluation metrics are commonly used in dia- will be simplified to “inform”. This process re-
logue state tracking task: joint goal accuracy and duces the number of possible dialogue acts from 31
slot accuracy. The joint goal accuracy compares to 13. For DSTC2 and GSIM corpora, we follow
the predicted dialogue states to the ground truth at Paul et al. (2019) to apply universal dialogue act
each dialogue turn. The ground truth includes slot mapping that maps the original dialogue act labels
values for all the possible (domain, slot) pairs. The to a general dialogue act format, resulting in 19
output is considered as a correct prediction if and and 13 system dialogue acts in DSTC2 and GSIM,
only if all the predicted values exactly match its respectively. We run two other baselines, MLP and
ground truth values. On the other hand, the slot RNN, to further show the strengths of BERT-based
MWOZ (13) DSTC2 (19) GSIM (13)
micro-F1 macro-F1 micro-F1 macro-F1 micro-F1 macro-F1
1% Data BERT 84.0% ± 0.6% 66.7% ± 1.7% 77.1% ± 2.1% 25.8% ± 0.8% 67.3% ± 1.4% 26.9% ± 1.0%
TOD-BERT-mlm 87.5% ± 0.6% 73.3% ± 1.5% 79.6% ± 1.0% 26.4% ± 0.5% 82.7% ± 0.7% 35.7% ± 0.3%
TOD-BERT-jnt 86.9% ± 0.2% 72.4% ± 0.8% 82.9% ± 0.4% 28.0% ± 0.1% 78.4% ± 3.2% 32.9% ± 2.1%
BERT 89.7% ± 0.2% 78.4% ± 0.3% 88.2% ± 0.7% 34.8% ± 1.3% 98.4% ± 0.3% 45.1% ± 0.2%
10% Data
TOD-BERT-mlm 90.1% ± 0.2% 78.9% ± 0.1% 91.8% ± 1.7% 39.4% ± 1.7% 99.2% ± 0.1% 45.6% ± 0.1%
TOD-BERT-jnt 90.2% ± 0.2% 79.6% ± 0.7% 90.6% ± 3.2% 38.8% ± 2.2% 99.3% ± 0.1% 45.7% ± 0.0%
MLP 61.6% 45.5% 77.6% 18.1% 89.5% 26.1%
RNN 90.4% 77.3% 90.8% 29.4% 98.4% 45.2%
Full Data GPT2 90.8% 79.8% 92.5% 39.4% 99.1% 45.6%
DialoGPT 91.2% 79.7% 93.8% 42.1% 99.2% 45.6%
BERT 91.4% 79.7% 92.3% 40.1% 98.7% 45.2%
TOD-BERT-mlm 91.7% 79.9% 90.9% 39.9% 99.4% 45.8%
TOD-BERT-jnt 91.7% 80.6% 93.8% 41.3% 99.5% 45.8%
Table 4: Dialogue act prediction results on three different datasets. The numbers reported are the micro and macro
F1 scores, and each dataset has different numbers of dialogue acts.
Joint Slot
Acc Acc
BERT 6.4% ± 1.4% 84.4% ± 1.0%
1% Data
TOD-BERT-mlm 9.9% ± 0.6% 86.6% ± 0.5%
TOD-BERT-jnt 8.0% ± 1.0% 85.3% ± 0.4%
BERT 19.6% ± 0.1% 92.0% ± 0.5%
5% Data
TOD-BERT-mlm 28.1% ± 1.6% 93.9% ± 0.1%
TOD-BERT-jnt 28.6% ± 1.4% 93.8% ± 0.3%
BERT 32.9% ± 0.6% 94.7% ± 0.1%
10% Data
TOD-BERT-mlm 39.5% ± 0.7% 95.6% ± 0.1%
TOD-BERT-jnt 37.0% ± 0.1% 95.2% ± 0.1%
(a) BERT (b) BERT
BERT 40.8% ± 1.0% 95.8% ± 0.1%
25% Data
TOD-BERT-mlm 44.0% ± 0.4% 96.4% ± 0.1%
TOD-BERT-jnt 44.3% ± 0.3% 96.3% ± 0.2%
DSTReader* 36.4% -
HyST* 38.1% -
ZSDST* 43.4% -
Full Data TRADE* 45.6% -
GPT2 46.2% 96.6%
DialoGPT 45.2% 96.5%
BERT 45.6% 96.6%
TOD-BERT-mlm 47.7% 96.8%
(c) TOD-BERT-mlm (d) TOD-BERT-mlm
(e) TOD-BERT-jnt (f) TOD-BERT-jnt
embeddings to make dialogue act prediction, and
the RNN model is a bi-directional GRU network. Figure 2: The tSNE visualization of BERT, TOD-
BERT-mlm and TOD-BERT-jnt representations of sys-
tem responses in the MWOZ test set. Different colors
In Table 4, one can observe that in full data in the left-hand column mean different domains, and in
scenario, TOD-BERT consistently works better the right-hand column represent different dialogue acts.
than BERT and other baselines, no matter which
datasets or which evaluation metrics. In the few-
shot experiments, TOD-BERT-mlm outperforms 6.5 Response Selection
BERT by 3.5% micro-F1 and 6.6% macro-F1 on To evaluate response selection in task-oriented di-
MWOZ corpus in the 1% data scenario. We also alogues, we follow the k-to-100 accuracy, which
found that 10% of training data can achieve good is becoming a research community standard (Yang
performance that is close to full data training. et al., 2018; Henderson et al., 2019a). The k-of-100
1-to-100 3-to-100 1-to-100 3-to-100 1-to-100 3-to-100
BERT 7.8% ± 2.0% 20.5% ± 4.4% 3.7% ± 0.6% 9.6% ± 1.3% 4.0% ± 0.4% 10.3% ± 1.1%
1% Data
TOD-BERT-mlm 13.0% ± 1.1% 34.6% ± 0.4% 12.5% ± 6.7% 24.9% ± 10.7% 7.2% ± 4.0% 15.4% ± 8.0%
TOD-BERT-jnt - - 37.5% ± 0.6% 55.9% ± 0.4% 12.5% ± 0.9% 26.8% ± 0.8%
BERT 20.9% ± 2.6% 45.4% ± 3.8% 8.9% ± 2.3% 21.4% ± 3.1% 9.8% ± 0.1% 24.4% ± 1.2%
10% Data
TOD-BERT-mlm 22.3% ± 3.2% 48.7% ± 4.0% 19.0% ± 16.3% 33.8% ± 20.4% 11.2% ± 2.5% 26.0% ± 2.7%
TOD-BERT-jnt - - 49.7% ± 0.3% 66.6% ± 0.1% 23.0% ± 1.0% 42.6% ± 1.0%
GPT2 47.5% 75.4% 53.7% 69.2% 39.1% 60.5%
DialoGPT 35.7% 64.1% 39.8% 57.1% 16.5% 39.5%
Full Data BERT 47.5% 75.5% 46.6% 62.1% 13.4% 32.9%
TOD-BERT-mlm 48.1% 74.3% 50.0% 65.1% 36.5% 60.1%
TOD-BERT-jnt 65.8% 87.0% 56.8% 70.6% 41.0% 65.4%
Table 6: Response selection evaluation results on three corpora for 1%, 10% and full data setting. We report
1-to-100 and 3-to-100 accuracy, which is similar to recall1 and recall@3 given 100 candidates.
metric is computed using a random batch of 100 each utterance, we use different colors to repre-
examples so that responses from other examples in sent different domains and dialogue acts. As one
the same batch can be used as random negative can- can observe, TOD-BERT-jnt has more clear group
didates. This allows us to be compute the metric boundaries than TOD-BERT-mlm, and two of them
across many examples in batches efficiently. While are better than BERT.
it is not guaranteed that the random negatives will To analyze the results quantitatively, we run K-
indeed be “true” negatives, the 1-of-100 metric still means, a common unsupervised clustering algo-
provides a useful evaluation signal. During infer- rithms, on top of the output embeddings of BERT
ence, we run five different random seeds to sample and TOD-BERT. We set K for K-means equal to
batches and report the average results. 10 and 20. After the clustering, we can assign
In Table 6, we conduct response selection ex- each utterance in the MWOZ test set to a predicted
periments on three datasets, MWOZ, DSTC2, and class. We then compute the normalized mutual
GSIM. TOD-BERT-jnt achieves 65.8% 1-to-100 information (NMI) between the clustering result
accuracy and 87.0% 3-to-100 accuracy on MWOZ, and the actual domain label for each utterance.
which surpasses BERT by 18.3% and 11.5%, re- Here is what we observe: TOD-BERT consistently
spectively. The similar results are also consistently achieves higher NMI scores than BERT. For K=10,
observed in DSTC2 and GSIM datasets, and the TOD-BERT has a 0.143 NMI score, and BERT
advantage of the TOD-BERT-jnt is more evident only has 0.094. For K=20, TOD-BERT achieves a
in the few-shot scenario. We do not report TOD- 0.213 NMI score, while BERT has 0.109.
BERT-jnt for MWOZ few-shot setting because it
is not fair to compare them with others as the full 8 Conclusion
MWOZ training set is used for response contrastive
We propose task-oriented dialogue BERT (TOD-
learning during pre-training stage. The response
BERT) trained on nine human-human and multi-
selection results are sensitive to the training batch
turn task-oriented datasets across over 60 domains.
size since the larger the batch size the harder the
TOD-BERT outperforms BERT on four dialogue
prediction. In our experiments, we set batch size
downstream tasks, including intention classifica-
equals to 25 for all the models.
tion, dialogue state tracking, dialogue act predic-
7 Visualization tion, and response selection. It also has a clear
advantage in the few-shot experiments when only
In Figure 2, we visualize the embeddings of BERT, limited labeled data is available. TOD-BERT is
TOD-BERT-mlm, and TOD-BERT-jnt given the easy-to-deploy and will be open-sourced, allowing
same input from the MWOZ test set. Each sample the NLP research community to apply or fine-tune
point is a system response representation, which is any task-oriented conversational problem.
passed through a pre-trained model and reduced its
high-dimension features to a two-dimension point
