TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
TOD-BERT: Pre-Trained Natural Language Understanding For Task-Oriented Dialogue
Task-Oriented Dialogue
5 Evaluation Datasets
6 Results
We pick up several datasets, OOS, DSTC2, GSIM,
and MWOZ, for downstream evaluation. The first For each downstream task, we first conduct the
three corpora are not included in the pre-trained experiments using the whole dataset, and then we
task-oriented datasets. For MWOZ, to be fair, we simulate the few-shot setting to show the strength
do not include its test set dialogues during the pre- of our TOD-BERT. We run at least three times with
training stage. Details of each evaluation dataset different random seeds for each few-shot exper-
are discussed in the following: iment to reduce data sampling variance, and we
report its mean and standard deviation for these
• OOS (Larson et al., 2019): The out-of-scope in-
limited data scenarios. We investigate two ver-
tent dataset is one of the largest annotated intent
sions of TOD-BERT; one is TOD-BERT-mlm that
datasets, including 15,100/3,100/5,500 samples
only uses MLM loss during pre-training, and the
for the train, validation, and test sets, respectively.
other is TOD-BERT-jnt, which is jointly trained
It covers 151 intent classes over ten domains, in-
with the MLM and RCL objectives. We compare
cluding 150 in-scope intent and one out-of-scope
TOD-BERT with BERT and other baselines, in-
intent. The out-of-scope intent means that a user
cluding two other strong pre-training models GPT-
utterance that does not fall into any of the prede-
2 (Radford et al., 2019) and DialoGPT (Zhang et al.,
fined intents. Each of the intents has 100 training
2019). For a GPT-based model, we use mean pool-
samples.
ing of its hidden states as its output representation,
• DSTC2 (Henderson et al., 2014): DSTC2 is a which we found it is better than using only the last
human-machine task-oriented dataset that may token.
include a certain system response noise. It has
1,612/506/1117 dialogues for train, validation, 6.1 Linear Probe
and test sets, respectively. We follow Paul et al. Before fine-tuning each pre-trained models, we first
(2019) to map the original dialogue act labels investigate their feature extraction ability by prob-
to universal dialogue acts, which results in 19 ing their output representations. Probing methods
different system dialogue acts. are proposed to determine what information is car-
ried intrinsically by the learned embeddings (Ten-
• GSIM (Shah et al., 2018a): GSIM is a human-
ney et al., 2019). We probe the output representa-
rewrote machine-machine task-oriented corpus,
tion using one single-layer perceptron on top of a
including 1500/469/1039 dialogues for the train,
“fixed” pre-trained language model and only fine-
validation, and test sets, respectively. We com-
tune that layer for a downstream task with the same
bine its two domains, movie and restaurant do-
hyper-parameters. Table 3 shows the probing re-
mains, into one single corpus. It is collected by
sults of domain classification on MWOZ, intent
Machines Talking To Machines (M2M) (Shah
identification on OOS, and dialogue act prediction
et al., 2018b) approach, a functionality-driven
on MWOZ. TOD-BERT-jnt achieves the highest
process combining a dialogue self-play step and
performance in this setting, suggesting its represen-
a crowdsourcing step. We map its dialogue act la-
tation contains the most useful information.
bels to universal dialogue acts (Paul et al., 2019),
resulting in 13 different system dialogue acts. 6.2 Intent Recognition
• MWOZ (Budzianowski et al., 2018): MWOZ is TOD-BERT outperforms BERT and other strong
the most common benchmark for task-oriented baselines in one of the largest intent recognition
Acc Acc Acc Recall
Model
(all) (in) (out) (out)
BERT 29.3% ± 3.4% 35.7% ± 4.1% 81.3% ± 0.4% 0.4% ± 0.3%
1-Shot
TOD-BERT-mlm 38.9% ± 6.3% 47.4% ± 7.6% 81.6% ± 0.2% 0.5% ± 0.2%
TOD-BERT-jnt 42.5% ± 0.1% 52.0% ± 0.1% 81.7% ± 0.1% 0.1% ± 0.1%
BERT 75.5% ± 1.1% 88.6% ± 1.1% 84.7% ± 0.3% 16.5% ± 1.7%
10-Shot
TOD-BERT-mlm 76.6% ± 0.8% 90.5% ± 1.2% 84.3% ± 0.2% 14.0% ± 1.3%
TOD-BERT-jnt 77.3% ± 0.5% 91.0% ± 0.5% 84.5% ± 0.4% 15.3% ± 2.1%
FastText* - 89.0% - 9.7%
SVM* - 91.0% - 14.5%
CNN* - 91.2% - 18.9%
Full GPT2 83.0% 94.1% 87.7% 32.0%
(100-Shot) DialoGPT 83.9% 95.5% 87.6% 32.1%
BERT 84.9% 95.8% 88.1% 35.6%
TOD-BERT-mlm 85.9% 96.1% 89.5% 46.3%
TOD-BERT-jnt 86.6% 96.2% 89.9% 43.6%
Table 2: Intent recognition results on the OOS dataset, one of the largest intent corpus. Models with * are reported
from Larson et al. (2019).
Domain Intent Dialogue Act accuracy individually compares each (domain, slot,
(acc) (acc) (F1-micro) value) triplet to its ground truth label.
GPT2 63.5% 74.7% 85.7%
DialoGPT 63.0% 65.7% 84.2% In Table 5, we compare BERT to TOD-BERT-
BERT 60.5% 71.1% 85.3% jnt on the MWOZ 2.1 dataset and find the latter
TOD-BERT-mlm 63.9% 70.7% 83.5% has 2.4% joint goal accuracy improvement. Since
TOD-BERT-jnt 68.7% 77.8% 86.2% the original ontology provided by Budzianowski
et al. (2018) is not complete (some labeled val-
Table 3: Probing results of different pre-trained lan-
guage models using a single-layer perceptron.
ues are not included in the ontology), we create
a new ontology of all the possible annotated val-
ues. We also list several well-known dialogue state
datasets, as shown in Table 2. We evaluate accu- trackers as reference, including DSTReader (Gao
racy on all the data, the in-domain intents only, and et al., 2019), HyST (Goel et al., 2019), TRADE
the out-of-scope intent only. Note that there are (Wu et al., 2019), and ZSDST (Rastogi et al., 2019).
two ways to predict out-of-scope intent, one is to We also report the few-shot experiments using 1%,
treat it as an additional class, and the other is to 5%, 10%, and 25% data. Note that 1% of data
set a threshold for prediction confidence. Here we has around 84 dialogues. TOD-BERT outperforms
report the results of the first setting. TOD-BERT- BERT in all the setting, which further show the
jnt achieves the highest in-scope and out-of-scope strength of task-oriented dialogue pre-training.
accuracy. Besides, we conduct 1-shot and 10-shot
experiments by randomly sampling one and ten 6.4 Dialogue Act Prediction
utterances from each intent class in the training We conduct experiments on three different datasets
set. TOD-BERT-jnt has 13.2% all-intent accuracy and report micro-F1 and macro-F1 scores for the
improvement and 16.3% in-domain accuracy im- dialogue act prediction task, a multi-label classifica-
provement compared to BERT in the 1-shot setting. tion problem. For the MWOZ dataset, we remove
the domain information from the original system
6.3 Dialogue State Tracking
dialogue act labels. For example, the “taxi-inform”
Two evaluation metrics are commonly used in dia- will be simplified to “inform”. This process re-
logue state tracking task: joint goal accuracy and duces the number of possible dialogue acts from 31
slot accuracy. The joint goal accuracy compares to 13. For DSTC2 and GSIM corpora, we follow
the predicted dialogue states to the ground truth at Paul et al. (2019) to apply universal dialogue act
each dialogue turn. The ground truth includes slot mapping that maps the original dialogue act labels
values for all the possible (domain, slot) pairs. The to a general dialogue act format, resulting in 19
output is considered as a correct prediction if and and 13 system dialogue acts in DSTC2 and GSIM,
only if all the predicted values exactly match its respectively. We run two other baselines, MLP and
ground truth values. On the other hand, the slot RNN, to further show the strengths of BERT-based
MWOZ (13) DSTC2 (19) GSIM (13)
micro-F1 macro-F1 micro-F1 macro-F1 micro-F1 macro-F1
1% Data BERT 84.0% ± 0.6% 66.7% ± 1.7% 77.1% ± 2.1% 25.8% ± 0.8% 67.3% ± 1.4% 26.9% ± 1.0%
TOD-BERT-mlm 87.5% ± 0.6% 73.3% ± 1.5% 79.6% ± 1.0% 26.4% ± 0.5% 82.7% ± 0.7% 35.7% ± 0.3%
TOD-BERT-jnt 86.9% ± 0.2% 72.4% ± 0.8% 82.9% ± 0.4% 28.0% ± 0.1% 78.4% ± 3.2% 32.9% ± 2.1%
BERT 89.7% ± 0.2% 78.4% ± 0.3% 88.2% ± 0.7% 34.8% ± 1.3% 98.4% ± 0.3% 45.1% ± 0.2%
10% Data
TOD-BERT-mlm 90.1% ± 0.2% 78.9% ± 0.1% 91.8% ± 1.7% 39.4% ± 1.7% 99.2% ± 0.1% 45.6% ± 0.1%
TOD-BERT-jnt 90.2% ± 0.2% 79.6% ± 0.7% 90.6% ± 3.2% 38.8% ± 2.2% 99.3% ± 0.1% 45.7% ± 0.0%
MLP 61.6% 45.5% 77.6% 18.1% 89.5% 26.1%
RNN 90.4% 77.3% 90.8% 29.4% 98.4% 45.2%
Full Data GPT2 90.8% 79.8% 92.5% 39.4% 99.1% 45.6%
DialoGPT 91.2% 79.7% 93.8% 42.1% 99.2% 45.6%
BERT 91.4% 79.7% 92.3% 40.1% 98.7% 45.2%
TOD-BERT-mlm 91.7% 79.9% 90.9% 39.9% 99.4% 45.8%
TOD-BERT-jnt 91.7% 80.6% 93.8% 41.3% 99.5% 45.8%
Table 4: Dialogue act prediction results on three different datasets. The numbers reported are the micro and macro
F1 scores, and each dataset has different numbers of dialogue acts.
Joint Slot
Model
Acc Acc
BERT 6.4% ± 1.4% 84.4% ± 1.0%
1% Data
TOD-BERT-mlm 9.9% ± 0.6% 86.6% ± 0.5%
TOD-BERT-jnt 8.0% ± 1.0% 85.3% ± 0.4%
BERT 19.6% ± 0.1% 92.0% ± 0.5%
5% Data
TOD-BERT-mlm 28.1% ± 1.6% 93.9% ± 0.1%
TOD-BERT-jnt 28.6% ± 1.4% 93.8% ± 0.3%
BERT 32.9% ± 0.6% 94.7% ± 0.1%
10% Data
TOD-BERT-mlm 39.5% ± 0.7% 95.6% ± 0.1%
TOD-BERT-jnt 37.0% ± 0.1% 95.2% ± 0.1%
(a) BERT (b) BERT
BERT 40.8% ± 1.0% 95.8% ± 0.1%
25% Data
TOD-BERT-mlm 44.0% ± 0.4% 96.4% ± 0.1%
TOD-BERT-jnt 44.3% ± 0.3% 96.3% ± 0.2%
DSTReader* 36.4% -
HyST* 38.1% -
ZSDST* 43.4% -
Full Data TRADE* 45.6% -
GPT2 46.2% 96.6%
DialoGPT 45.2% 96.5%
BERT 45.6% 96.6%
TOD-BERT-mlm 47.7% 96.8%
TOD-BERT-jnt 48.0% 96.9% (c) TOD-BERT-mlm (d) TOD-BERT-mlm
models. The MLP model simply takes bag-of-word (e) TOD-BERT-jnt (f) TOD-BERT-jnt
embeddings to make dialogue act prediction, and
the RNN model is a bi-directional GRU network. Figure 2: The tSNE visualization of BERT, TOD-
BERT-mlm and TOD-BERT-jnt representations of sys-
tem responses in the MWOZ test set. Different colors
In Table 4, one can observe that in full data in the left-hand column mean different domains, and in
scenario, TOD-BERT consistently works better the right-hand column represent different dialogue acts.
than BERT and other baselines, no matter which
datasets or which evaluation metrics. In the few-
shot experiments, TOD-BERT-mlm outperforms 6.5 Response Selection
BERT by 3.5% micro-F1 and 6.6% macro-F1 on To evaluate response selection in task-oriented di-
MWOZ corpus in the 1% data scenario. We also alogues, we follow the k-to-100 accuracy, which
found that 10% of training data can achieve good is becoming a research community standard (Yang
performance that is close to full data training. et al., 2018; Henderson et al., 2019a). The k-of-100
MWOZ DSTC2 GSIM
1-to-100 3-to-100 1-to-100 3-to-100 1-to-100 3-to-100
BERT 7.8% ± 2.0% 20.5% ± 4.4% 3.7% ± 0.6% 9.6% ± 1.3% 4.0% ± 0.4% 10.3% ± 1.1%
1% Data
TOD-BERT-mlm 13.0% ± 1.1% 34.6% ± 0.4% 12.5% ± 6.7% 24.9% ± 10.7% 7.2% ± 4.0% 15.4% ± 8.0%
TOD-BERT-jnt - - 37.5% ± 0.6% 55.9% ± 0.4% 12.5% ± 0.9% 26.8% ± 0.8%
BERT 20.9% ± 2.6% 45.4% ± 3.8% 8.9% ± 2.3% 21.4% ± 3.1% 9.8% ± 0.1% 24.4% ± 1.2%
10% Data
TOD-BERT-mlm 22.3% ± 3.2% 48.7% ± 4.0% 19.0% ± 16.3% 33.8% ± 20.4% 11.2% ± 2.5% 26.0% ± 2.7%
TOD-BERT-jnt - - 49.7% ± 0.3% 66.6% ± 0.1% 23.0% ± 1.0% 42.6% ± 1.0%
GPT2 47.5% 75.4% 53.7% 69.2% 39.1% 60.5%
DialoGPT 35.7% 64.1% 39.8% 57.1% 16.5% 39.5%
Full Data BERT 47.5% 75.5% 46.6% 62.1% 13.4% 32.9%
TOD-BERT-mlm 48.1% 74.3% 50.0% 65.1% 36.5% 60.1%
TOD-BERT-jnt 65.8% 87.0% 56.8% 70.6% 41.0% 65.4%
Table 6: Response selection evaluation results on three corpora for 1%, 10% and full data setting. We report
1-to-100 and 3-to-100 accuracy, which is similar to recall1 and recall@3 given 100 candidates.
metric is computed using a random batch of 100 each utterance, we use different colors to repre-
examples so that responses from other examples in sent different domains and dialogue acts. As one
the same batch can be used as random negative can- can observe, TOD-BERT-jnt has more clear group
didates. This allows us to be compute the metric boundaries than TOD-BERT-mlm, and two of them
across many examples in batches efficiently. While are better than BERT.
it is not guaranteed that the random negatives will To analyze the results quantitatively, we run K-
indeed be “true” negatives, the 1-of-100 metric still means, a common unsupervised clustering algo-
provides a useful evaluation signal. During infer- rithms, on top of the output embeddings of BERT
ence, we run five different random seeds to sample and TOD-BERT. We set K for K-means equal to
batches and report the average results. 10 and 20. After the clustering, we can assign
In Table 6, we conduct response selection ex- each utterance in the MWOZ test set to a predicted
periments on three datasets, MWOZ, DSTC2, and class. We then compute the normalized mutual
GSIM. TOD-BERT-jnt achieves 65.8% 1-to-100 information (NMI) between the clustering result
accuracy and 87.0% 3-to-100 accuracy on MWOZ, and the actual domain label for each utterance.
which surpasses BERT by 18.3% and 11.5%, re- Here is what we observe: TOD-BERT consistently
spectively. The similar results are also consistently achieves higher NMI scores than BERT. For K=10,
observed in DSTC2 and GSIM datasets, and the TOD-BERT has a 0.143 NMI score, and BERT
advantage of the TOD-BERT-jnt is more evident only has 0.094. For K=20, TOD-BERT achieves a
in the few-shot scenario. We do not report TOD- 0.213 NMI score, while BERT has 0.109.
BERT-jnt for MWOZ few-shot setting because it
is not fair to compare them with others as the full 8 Conclusion
MWOZ training set is used for response contrastive
We propose task-oriented dialogue BERT (TOD-
learning during pre-training stage. The response
BERT) trained on nine human-human and multi-
selection results are sensitive to the training batch
turn task-oriented datasets across over 60 domains.
size since the larger the batch size the harder the
TOD-BERT outperforms BERT on four dialogue
prediction. In our experiments, we set batch size
downstream tasks, including intention classifica-
equals to 25 for all the models.
tion, dialogue state tracking, dialogue act predic-
7 Visualization tion, and response selection. It also has a clear
advantage in the few-shot experiments when only
In Figure 2, we visualize the embeddings of BERT, limited labeled data is available. TOD-BERT is
TOD-BERT-mlm, and TOD-BERT-jnt given the easy-to-deploy and will be open-sourced, allowing
same input from the MWOZ test set. Each sample the NLP research community to apply or fine-tune
point is a system response representation, which is any task-oriented conversational problem.
passed through a pre-trained model and reduced its
high-dimension features to a two-dimension point
using the t-distributed stochastic neighbor embed- References
ding (tSNE) for dimension reduction. Since we Layla El Asri, Hannes Schulz, Shikhar Sharma,
know the true domain and dialogue act labels for Jeremie Zumer, Justin Harris, Emery Fine, Rahul
Mehrotra, and Kaheer Suleman. 2017. Frames: A Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo
corpus for adding memory to goal-oriented dialogue Casanueva, Paweł Budzianowski, Sam Coope,
systems. arXiv preprint arXiv:1704.00057. Georgios Spithourakis, Tsung-Hsien Wen, Nikola
Mrkšić, and Pei-Hao Su. 2019b. Training neural re-
Siqi Bao, Huang He, Fan Wang, and Hua Wu. sponse selection for task-oriented dialogue systems.
2019. Plato: Pre-trained dialogue generation In Proceedings of the 57th Annual Meeting of the
model with discrete latent variable. arXiv preprint Association for Computational Linguistics, pages
arXiv:1910.07931. 5392–5404, Florence, Italy. Association for Compu-
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s tational Linguistics.
gpt-2–how can i help you? towards the use of pre- Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
trained language models for task-oriented dialogue sian error linear units (gelus). arXiv preprint
systems. arXiv preprint arXiv:1907.05774. arXiv:1606.08415.
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra- Caiming Xiong, and Richard Socher. 2019. Ctrl: A
madan, and Milica Gašić. 2018. Multiwoz-a conditional transformer language model for control-
large-scale multi-domain wizard-of-oz dataset for lable generation. arXiv preprint arXiv:1909.05858.
task-oriented dialogue modelling. arXiv preprint
arXiv:1810.00278. Seokhwan Kim, Michel Galley, Chulaka Gunasekara,
Adam Atkinson Sungjin Lee, Baolin Peng, Hannes
Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada,
Sankar, Arvind Neelakantan, Daniel Duckworth, Minlie Huang, Luis Lastras, Jonathan K. Kummer-
Semih Yavuz, Ben Goodrich, Amit Dubey, Andy feld, Walter S. Lasecki, Chiori Hori, Anoop Cherian,
Cedilnik, and Kyu-Young Kim. 2019. Taskmaster-1: Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang,
Toward a realistic and diverse dialog dataset. arXiv Srinivas Sunkara, and Raghav Gupta. 2019. The
preprint arXiv:1909.05358. eighth dialog system technology challenge. arXiv
preprint.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep Stefan Larson, Anish Mahendran, Joseph J Peper,
bidirectional transformers for language understand- Christopher Clarke, Andrew Lee, Parker Hill,
ing. arXiv preprint arXiv:1810.04805. Jonathan K Kummerfeld, Kevin Leach, Michael A
Laurenzano, Lingjia Tang, et al. 2019. An evalua-
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- tion dataset for intent classification and out-of-scope
aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, prediction. arXiv preprint arXiv:1909.02027.
and Hsiao-Wuen Hon. 2019. Unified language
model pre-training for natural language understand- Sungjin Lee, Hannes Schulz, Adam Atkinson, Jianfeng
ing and generation. In Advances in Neural Informa- Gao, Kaheer Suleman, Layla El Asri, Mahmoud
tion Processing Systems, pages 13042–13054. Adada, Minlie Huang, Shikhar Sharma, Wendy Tay,
and Xiujun Li. 2019. Multi-domain task-completion
Mihail Eric and Christopher D Manning. 2017. Key- dialog challenge. In Dialog System Technology
value retrieval networks for task-oriented dialogue. Challenges 8.
arXiv preprint arXiv:1705.05414.
Xiujun Li, Sarah Panda, JJ (Jingjing) Liu, and Jianfeng
Shuyang Gao, Abhishek Sethi, Sanchit Aggarwal, Gao. 2018. Microsoft dialogue challenge: Building
Tagyoung Chung, and Dilek Hakkani-Tur. 2019. Di- end-to-end task-completion dialogue systems. In
alog state tracking: A neural reading comprehension SLT 2018.
approach. arXiv preprint arXiv:1908.01946.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Rahul Goel, Shachi Paul, and Dilek Hakkani-Tür. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
2019. Hyst: A hybrid approach for flexible and Luke Zettlemoyer, and Veselin Stoyanov. 2019.
accurate dialogue state tracking. arXiv preprint Roberta: A robustly optimized bert pretraining ap-
arXiv:1907.00883. proach. arXiv preprint arXiv:1907.11692.
Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Ilya Loshchilov and Frank Hutter. 2017. Decou-
Pei-Hao Su, Ivan Vulić, et al. 2019a. Con- pled weight decay regularization. arXiv preprint
vert: Efficient and accurate conversational rep- arXiv:1711.05101.
resentations from transformers. arXiv preprint
arXiv:1911.03688. Nikola Mrkšić, Diarmuid O Séaghdha, Tsung-Hsien
Wen, Blaise Thomson, and Steve Young. 2016. Neu-
Matthew Henderson, Blaise Thomson, and Jason D. ral belief tracker: Data-driven dialogue state track-
Williams. 2014. The second dialog state tracking ing. arXiv preprint arXiv:1606.03777.
challenge. In Proceedings of the 15th Annual Meet-
ing of the Special Interest Group on Discourse and Shachi Paul, Rahul Goel, and Dilek Hakkani-Tür.
Dialogue (SIGDIAL), pages 263–272, Philadelphia, 2019. Towards universal dialogue act tag-
PA, U.S.A. Association for Computational Linguis- ging for task-oriented dialogues. arXiv preprint
tics. arXiv:1907.03020.
Baolin Peng, Chenguang Zhu, Chunyuan Li, Xi- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ujun Li, Jinchao Li, Michael Zeng, and Jian- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
feng Gao. 2020. Few-shot natural language gen- Kaiser, and Illia Polosukhin. 2017. Attention is all
eration for task-oriented dialog. arXiv preprint you need. In Advances in neural information pro-
arXiv:2002.12328. cessing systems, pages 5998–6008.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Alex Wang, Amanpreet Singh, Julian Michael, Felix
Ilya Sutskever. 2018. Improving language under- Hill, Omer Levy, and Samuel R Bowman. 2018.
standing by generative pre-training. Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, arXiv:1804.07461.
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,
Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su,
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Stefan Ultes, and Steve Young. 2016. A network-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, based end-to-end trainable task-oriented dialogue
Wei Li, and Peter J Liu. 2019. Exploring the limits system. arXiv preprint arXiv:1604.04562.
of transfer learning with a unified text-to-text trans-
former. arXiv preprint arXiv:1910.10683. Thomas Wolf, Victor Sanh, Julien Chaumond, and
Clement Delangue. 2019. Transfertransfo: A
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and transfer learning approach for neural network
Percy Liang. 2016. Squad: 100,000+ questions based conversational agents. arXiv preprint
for machine comprehension of text. arXiv preprint arXiv:1901.08149.
arXiv:1606.05250.
Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-
Hannah Rashkin, Eric Michael Smith, Margaret Li, and Asl, Caiming Xiong, Richard Socher, and Pascale
Y-Lan Boureau. 2018. Towards empathetic open- Fung. 2019. Transferable multi-domain state gener-
domain conversation models: A new benchmark and ator for task-oriented dialogue systems. In Proceed-
dataset. arXiv preprint arXiv:1811.00207. ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 808–819, Flo-
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, rence, Italy. Association for Computational Linguis-
Raghav Gupta, and Pranav Khaitan. 2019. Towards tics.
scalable multi-domain conversational agents: The
schema-guided dialogue dataset. arXiv preprint Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong,
arXiv:1909.05855. Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan
Sung, Brian Strope, and Ray Kurzweil. 2018. Learn-
Marzieh Saeidi, Ritwik Kulkarni, Theodosia Togia, and ing semantic textual similarity from conversations.
Michele Sama. 2017. The effect of negative sam- In Proceedings of The Third Workshop on Repre-
pling strategy on capturing semantic similarity in sentation Learning for NLP, pages 164–174, Mel-
document embeddings. In Proceedings of the 2nd bourne, Australia. Association for Computational
Workshop on Semantic Deep Learning (SemDeep-2), Linguistics.
pages 1–8.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,
Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
Gokhan Tur. 2018a. Bootstrapping a neural conver- Liu, and Bill Dolan. 2019. Dialogpt: Large-scale
sational agent with dialogue self-play, crowdsourc- generative pre-training for conversational response
ing and on-line reinforcement learning. In Proceed- generation. arXiv preprint arXiv:1911.00536.
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin- Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
guistics: Human Language Technologies, Volume 3 dinov, Raquel Urtasun, Antonio Torralba, and Sanja
(Industry Papers), pages 41–51. Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Ab- and reading books. In Proceedings of the IEEE inter-
hinav Rastogi, Ankur Bapna, Neha Nayak, and national conference on computer vision, pages 19–
Larry Heck. 2018b. Building a conversational agent 27.
overnight with dialogue self-play. arXiv preprint
arXiv:1801.04871.
(a) BERT
(a) BERT
(b) TOD-BERT-mlm
(b) TOD-BERT-mlm
(c) TOD-BERT-jnt
(b) TOD-BERT-mlm
(c) TOD-BERT-jnt