Recipes for building an open-domain chatbot
Stephen Roller Emily Dinan Naman Goyal Da Ju
Mary Williamson Yinhan Liu∗ Jing Xu Myle Ott
Kurt Shuster Eric M. Smith Y-Lan Boureau Jason Weston
Facebook AI Research
arXiv:2004.13637v2 [cs.CL] 30 Apr 2020
Abstract
Building open-domain chatbots is a challenging area for machine learning research. While
prior work has shown that scaling neural models in the number of parameters and the size
of the data they are trained on gives improved
results, we show that other ingredients are important for a high-performing chatbot. Good
conversation requires a number of skills that an
expert conversationalist blends in a seamless
way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately,
while maintaining a consistent persona. We
show that large scale models can learn these
skills when given appropriate training data and
choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B
parameter models, and make our models and
code publicly available. Human evaluations
show our best models are superior to existing
approaches in multi-turn dialogue in terms of
engagingness and humanness measurements.
We then discuss the limitations of this work by
analyzing failure cases of our models.
1
Introduction
In this work, we provide recipes for building opendomain chatbots that perform well in human evaluations. It has been shown across the field of NLP
(Devlin et al., 2019) and in conversational agents
in particular (Dinan et al., 2020; Zhang et al., 2019;
Adiwardana et al., 2020) that pre-training on large
corpora is important. Beyond simply scaling models the two main takeaways from our study are:
1. Blending Skills
Large improvements can be made by finetuning on data that emphasizes desirable conversational skills. We select tasks that make
∗
Work done while at Facebook; currently AI2 Incubator.
Human
Generative BST 9.4B
If you could have dinner with one person, dead or alive, who would
that be?
That's a tough one. I think I'd have to go with Steve Jobs. I'd love to
pick his brain.
Ah, interesting choice. What do you know about Steve Jobs?
He was one of the most influential people in the history of
technology. He was a visionary.
What questions would you ask him?
I'd want to know how he came up with the idea for the iPod. It was
so revolutionary at the time.
Totally...the iPod changed the way everyone listens to music.
It really did. It's crazy to think that it was released in 2001. That was
a long time ago.
Omg wow, that makes me feel old!
I know, right? Time flies by so fast. I wish I could go back to those
simpler times.
Me too, friend, me too. Speaking of music, what's your favorite
album to listen to lately?
I've been listening to a lot of Led Zeppelin lately. They're my favorite
band.
Figure 1: Paper author (left) conversing with our 9.4B
parameter model (right). This example was cherry
picked. We release conversation logs with crowdworkers with our code, along with lemon-picked examples
in Sec. 10.5.
the model focus on personality and engagingness, knowledge, and empathy, achieving
large gains by using the recently introduced
Blended Skill Talk (BST) set-up (Smith et al.,
2020), which targets those aspects by providing training data and initial conversational
context (personas and topics). Small models using BST can match or outperform larger
models that do not. While BST emphasizes
desirable traits, we also show this tuning can
minimize undesirable traits learnt from large
corpora, such as toxicity.
2. Generation Strategies
The choice of decoding algorithm is of critical
importance, and two models with the same
perplexity but different decoding algorithms
can give vastly different results. In particular
we show that the length of the bot’s utterances
are crucial to human judgments of quality –
too short and the responses are seen as dull or
showing a lack of interest, too long and the
bot appears to waffle and not listen. We show,
contrary to previous work which reports that
beam search is inferior to sampling (Holtzman
et al., 2019; Adiwardana et al., 2020), that
careful choice of search hyperparameters can
give strong results by controlling trade-offs.
In particular, constraining the minimum beam
length gives a crucial control of the dull versus
spicy spectrum of responses.
Human evaluation results are highly dependent
on the precise set-up one chooses. Model performance can be strongly affected by the specific instructions given to evaluators, such as a given topic
or not, the overall conversation length, and the
choice of human interlocutors, which may be difficult to jointly account for. We report performance
when employing crowdworkers in short multi-turn
conversations with no prompt. However, in addition to that, we believe releasing models is the
most reliable way to enable full insight into their
capabilities. We thus make publicly available our
large-scale, state of the art open-domain conversational agent, including code to fine-tune it, the
model weights, and code to evaluate it, so that our
setup is reproducible. In human evaluations of
engagingness our best model outperforms Meena
(Adiwardana et al., 2020) in a pairwise comparison
75% to 25%, and in terms of humanness by 65%
to 35% (both statistically significant, two-tailed
binomial test, p < 0.01).
While the performance of our bot at first sight
is very good, we do not believe we are yet close
to solving the problem of open-domain conversation. We thus discuss limitations of our models,
and initial attempts to solve them. In particular,
our models still display: a lack of in-depth knowledge if sufficiently interrogated; a tendency to stick
to simpler language; and a tendency to repeat oftused phrases. We show how unlikelihood training
and retrieve-and-refine mechanisms are potential
avenues for fixing these problems; however, our
initial experiments with these methods are inconclusive. We thus discuss future possibilities for
alleviating these problems as well as methods to
clearly expose and evaluate them.
Figure 2: The Poly-encoder Transformer architecture
(Humeau et al., 2019) for retrieval encodes global
features of the context using multiple representations
(codes), which are attended to by each possible candidate response. This final attention mechanism gives
improved performance over a single global vector representation, whilst being tractable to compute.
2
Model architectures
We consider three types of architectures in this
work: retrieval, generative, and retrieve-and-refine
models. All three use Transformers (Vaswani et al.,
2017) as a base.
2.1
Retriever
Given a dialogue history (context) as input, retrieval systems select the next dialogue utterance
by scoring a large set of candidate responses and
outputting the highest scoring one. Typically, all
possible training set responses are used as the candidate set.
We employ the poly-encoder architecture of
(Humeau et al., 2019). Poly-encoders encode
global features of the context using multiple representations (n codes, where n is a hyperparameter), which are attended to by each possible candidate response, see Figure 2. This final attention
mechanism gives improved performance over a
single global vector representation (so-called “biencoders”), whilst still being tractable to compute
compared to simply concatenating input and output as input to a Transformer (so-called “crossencoders”). The poly-encoder has state of the
art performance on a number of dialogue tasks
when compared to other retrieval models, and also
gives comparable performance to the winning generative models on the ConvAI2 competition task
(Zhang et al., 2018) in terms of human evaluation
(Li et al., 2019b). We consider two poly-encoder
sizes: 256M (from (Smith et al., 2020)) and 622M
parameter models which we trained here, both using N = 64 codes.
2.2
Generator
We employ a standard Seq2Seq Transformer architecture to generate responses rather than retrieve
them from a fixed set. Our implementation is based
on the ParlAI version (Miller et al., 2017). We use
Byte-Level BPE tokenization (Radford et al., 2019)
trained on the pre-training data, as implemented in
HuggingFace’s Tokenizers.1
We consider three sizes of model: 90M parameters (following Shuster et al., 2019), 2.7B parameters and 9.4B parameters. Our 9.4B parameter
model has a 4 layer encoder, a 32 layer decoder
with 4096 dimensional embeddings, and 32 attention heads. Our 2.7B parameter model roughly
mimics the architectural choices of Adiwardana
et al. (2020), with 2 encoder layers, 24 decoder
layers, 2560 dimensional embeddings, and 32 attention heads.
2.3
Retrieve and Refine
Current generative models are known to have issues
with producing dull and repetitive responses which
are improved, but not resolved, by simply scaling
(Holtzman et al., 2019; Welleck et al., 2020; Li
et al., 2019a). Additionally, generative models are
known to hallucinate knowledge, and in general are
unable to read and access external knowledge other
than what is embedded in their model parameters,
which may be imperfect. One approach to try to
alleviate these problems is to combine a retrieval
step before generation, referred to as a retrieve and
refine model (Weston et al., 2018). We consider
two variants for the retrieval step: dialogue retrieval
and knowledge retrieval.
such models, we use the architectures considered in
the previous two sections for the two components
of the model.
Knowledge Retrieval We can also use the same
mechanism to first retrieve from a large knowledge
base, instead of retrieving an initial dialogue utterance. We can then condition the generation on
the retrieved knowledge, as done in models proposed for the Wizard of Wikipedia task (Dinan
et al., 2019c). We hence refer to this as a Wizard
Generative model, as the supervised training signal of how to use knowledge in dialogue comes
from the Wizard of Wikipedia task, even though
we multi-task on other tasks as well. We use the
same retrieval system as in that cited work, which
uses a TF-IDF-based inverted index lookup over
a Wikipedia dump2 to produce an initial set of
knowledge candidates. A Transformer retriever
model (the same as Sec. 2.1) is then used to rank
the candidates and select a single sentence which
is used to condition generation. We additionally
trained a Transformer-based classifier to choose
when to perform retrieval or not on a per-turn basis,
as some contexts do not require knowledge. This
was trained as a two-class classifier discriminating
between contexts that require knowledge or not in
our fine-tuning tasks, to be described in the next
section. We note all other models in this work do
not condition on retrieved knowledge.
3
3.1
Training Objectives
Ranking for Retrieval
Dialogue Retrieval We can simply use a
retrieval-based dialogue model in the retrieval step,
as in Sec. 2.1. Given the dialogue history, the retrieval model is first used to produce a response.
Rather than showing this response to the speaking partner it is appended to the input sequence of
the generator, along with a special separator token.
The generator then outputs a response as normal
given this modified input sequence. Retrieval models produce human written utterances which tend
to include more vibrant language than the most
high probability utterances of a standard generative
model. Hence, if the generative model learns when
to copy the elements of such an utterance, and when
not to, it can provide improved responses. To build
To train the retrieval models, a cross-entropy
loss is minimized in which the logits are
ycand1 , . . . , ycandn , where ycand1 is the score of
the correct response and the others are sampled
negatives. Following Humeau et al. (2019), during
training we use the other responses in the batch for
negatives. This allows for much faster training, as
we can reuse the embeddings computed for each
candidate, and also use a larger batch size. In our
training we are able to use batches of 512 elements.
1
https://github.com/huggingface/
tokenizers
2
https://parl.ai/projects/wizard_of_
wikipedia/
3.2
Likelihood Training for Generation
To train the generative models, we use the standard
Maximum Likelihood Estimation (MLE) approach.
Given a dataset D = {(x(i) , y(i) )}, minimize:
|y (i) |
(i)
LMLE (pθ , x(i) , y(i) )
=−
X
(i)
(i)
log pθ (yt |x(i) , y<t ),
t=1
where x(i) is a gold input context and y(i) is a gold
next-utterance, and yt(i) is the t-th token of y(i) .
3.3 α-blending for Retrieve and Refine
For retrieve and refine, simply appending dialogue
retrieval responses to the context of a generative
model and training with MLE unfortunately does
not yield satisfying results. As the correspondence
between gold label and retrieved utterance is not
necessarily clear, a trained model often opts to simply ignore the retrieval utterance, as was shown in
Weston et al. (2018). To ensure it is used, one can
replace the retrieved response instead with the gold
response α% of the time, treating α as a hyperparameter to be tuned. This gives a smooth transition
between retrieval and generator-only systems. For
knowledge retrieval we find this issue to be less
of a problem as the fine-tuning datasets used have
a clear correspondence between gold knowledge
conditioning and response, and in that case we only
use the gold knowledge during training.
3.4
Unlikelihood training for generation
An alternative method to combat the failures in
model generations is to change the loss function.
The unlikelihood loss (Welleck et al., 2020; Li et al.,
2019a) has been shown to help fix mismatches between human and model distributions across various axes, including decreasing repetitions and mitigating the issue of overrepresented vocabulary tokens.
The unlikelihood loss penalizes a set of tokens
(i)
Ct at each time-step, LUL (pθ , C1:T , x, y) =
−
|y|
X
X
log (1 − pθ (yc |x, y<t )) ,
t=1 yc ∈Ct
where Ct ⊆ V is a subset of the vocabulary. The
overall objective in unlikelihood training then consists of mixing the likelihood and unlikelihood
losses,
(i)
(i)
(i)
LULE = LMLE + αLUL ,
(1)
where α ∈ R is the mixing hyper-parameter.
Likelihood tries to model the overall sequence
probability distribution, while unlikelihood corrects for known biases. It does this via the set
of negative candidates Ct calculated at each step t;
typically one specifies in advance a method for
generating such candidates, for example the tokens
which have been repeated or overrepresented. Likelihood pushes up the probability of a gold token
(i)
yt while unlikelihood pushes down the probability
of negative candidate tokens yc ∈ Ct . In this work
during training we keep a running count of the distribution of n-grams that appear when generating
from the model, and choose tokens as negative candidates from these n-grams when their counts are
above the human distribution counts as measured
from the gold responses.
4
Decoding
For generative models, at inference time, one must
choose a decoding method to generate a response
to the dialogue context given as input. In this work
we compare a number of well-known approaches.
4.1
Beam Search
Two widely used deterministic decoding approaches are greedy search and beam search. The
former can be seen as a special case of the latter.
Greedy search selects the highest probability token at each time step: yt = arg max pθ (yt |x, y<t ).
Beam search maintains a fixed-size set of partiallydecoded sequences, called hypotheses. At each
time step, beam search forms new hypotheses by
appending each token in the vocabulary to each existing hypothesis, scoring the resulting sequences
then selecting the highest scoring sequences.
We compare beam search for different beam
sizes in our experiments.
4.2
Sampling
An alternative is to sample from a model-dependent
distribution at each step, yt ∼ q(yt |x, y<t , pθ ). In
order to prevent sampling low probability tokens,
a typical approach is to restrict sampling to a subset of the vocabulary at each step, and sampling
according to those (renormalized) probabilities.
For sampling methods, we will compare top-k
sampling (Fan et al., 2018) and sample-and-rank
(Adiwardana et al., 2020). The latter performs
sampling S times, and selects the generated sample
with the highest probability.
4.3
Response Length
Generating with a beam tends to produce short
generations that do not match the length statistics of
the human utterances they were trained on (Weston
et al., 2018). However, longer responses, if of high
quality, can be more engaging than very short ones.
While following the human distribution may not
give optimal performance for a bot – for example, it
may want to err on the side of brevity for improved
human evaluation, because that is less likely to
expose its failings – making its responses longer
may make them provide more information, and
make them less dull.
We consider two simple methods to control the
length of a model’s responses.
Minimum length The first method we consider
is a hard constraint on the minimum generation
length: the end token is forced to not be generated
until a minimum sequence length is achieved.
Predictive length The second approach is to predict the length based on human-human conversation data. To do this we train a 4-class classifier by
binning the lengths of the next conversation turn
(e.g., < 10, < 20, < 30, or > 30 tokens). We use
the same architecture as the retrieval model for
this classifier. Then, at test time, the classifier is
first used to predict the length of the next response,
and sets the minimum generation length constraint
to its corresponding prediction. Unlike the previous approach, this results in more natural variable
length conversation turns, whilst ensuring long responses when they seem natural. One drawback,
however, is that this procedure makes our system
more complex.
4.4
Subsequence Blocking
Sequence generation models are known to repeat
subsequences (Holtzman et al., 2018), particularly
in stochastic methods such as beam search, but also
in sampling methods as well (Adiwardana et al.,
2020). We implement standard beam blocking of
n-grams (Paulus et al., 2017) and use n = 3. We
consider both blocking repeated n-grams within
the generated utterance, and repeating of the input
sequence (previous utterances from either speaker).
5
Training Details
We detail the techniques we employ during pretraining and fine-tuning.
Pre-training Ranking models. We perform pretraining using the Fairseq (Ott et al., 2019) toolkit.
Our 256M parameter ranking model is identical to
the pre-trained model released by Humeau et al.
(2019). Our 622M model is pre-trained using a
simple Masked Language Model objective on the
same data and dictionary as the large Generative
models. We took all hyperparameter choices from
those recommended in RoBERTa (Liu et al., 2019).
Pre-training Generative models. We perform
pre-training using the Fairseq (Ott et al., 2019)
toolkit. Our 2.7B and 9.4B parameter models were
both trained using the Adam optimizer (Kingma
and Ba, 2014). In order to fit the larger models onto
nodes, we utilize Megatron-LM style model parallelism (Shoeybi et al., 2019), in which the Feed
Forward network (FFN) and Multihead Attention
layers of the Transformer are “vertically” sliced,
minimizing the need for communication across
GPUs. We also evaluated Adafactor (Shazeer and
Stern, 2018), which allows for larger batch sizes,
but we found it converged to a worse place than
Adam. In all cases, we use a variant of mixed precision training (Micikevicius et al., 2017), storing
gradients and optimizer state in FP32, but accumulating model parameters directly in FP16 (Ott et al.,
2019). A dynamic loss scalar is utilized to prevent gradient underflow (Micikevicius et al., 2017).
Both our 2.7B and 9.4B parameter models were
trained with batches of approximately 500k label
BPE tokens per batch. The 2.7B parameter model
trained for approximately 200k SGD updates with a
maximum learning rate of 2e-4, a linear warmup of
3125 steps, and an invsqrt LR scheduler (Vaswani
et al., 2017); the model had not converged when
we stopped. The 9.4B parameter model was trained
with a maximum learning rate of 1.15e-4 and 2400
warmup steps for a total of 200k SGD updates, and
did not appear to be overfitting.
Fine-tuning. We fine-tune our models using the
ParlAI toolkit (Miller et al., 2017), which specializes in training and evaluating dialogue models. As opposed to the above pre-training, we utilize GPipe-style model parallelism (Huang et al.,
2019), in which full layers are sharded across different GPUs, and each minibatch is further split
into micro-batches to ensure maximum throughput.
As in pre-training, we found that Adam outperformed Adafactor during fine-tuning, and we utilized Fairseq-style mixed precision training. Models were fine-tuned to convergence, with maximum
learning rates of between 1e-6 and 1e-5.
Figure 3: Sample conversation from the Blended Skill Talk dataset, which blends three skills that previous datasets
(ConvAI2, WoW, ED) have focused on. Individual utterances are annotated with the single-skill datasets they
are reminiscent of. The conversation here has been seeded with two utterances from WoW. For details about the
Guided and Unguided workers (U,G) set up, see Smith et al. (2020).
6
Training Data
We next discuss the training data we use, which is
all in English (#BenderRule).
6.1
Pre-training
pushshift.io Reddit We use a variant of Reddit
discussions, which has also been used in several
existing studies, see e.g. Yang et al. (2018); Mazaré
et al. (2018); Keskar et al. (2019); Shuster et al.
(2019). Following Humeau et al. (2019), we use
a previously existing Reddit dataset extracted and
obtained by a third party and made available on
pushshift.io (Baumgartner et al., 2020), training to
generate a comment conditioned on the full thread
leading up to the comment, spanning 1.5B training
examples from Reddit obtained from PushShift3
through July 2019. The subreddits cover a vast
range of topics, and hence the dataset is a good
candidate for helping train a dialogue model in
the open-domain case. We apply heuristic rules
to filter the dataset with the goal of providing a
cleaner training signal. We remove the comment
and all subsequent child comments if any of the
following conditions are met:
3
1.
2.
3.
4.
https://files.pushshift.io/reddit/
5.
6.
7.
8.
9.
The author is a known bot.
It comes from a known non-English subreddit.
The comment is marked as removed / deleted.
It is longer than 2048 characters and does not
contain spaces.
It is longer than 128 BPE tokens.
It is shorter than 5 characters.
It contains a URL.
It starts with a non-ASCII character.
It is further than depth 7 in the thread.
Models were trained with maximum context and
response lengths set to 128 BPE tokens, and longer
examples were truncated. Our final dataset contains
1.50B comments totaling 56.8B label BPE tokens
and 88.8B context tokens.4 We divide the corpus
into 4096 roughly-equal sized chunks, stratified by
thread ID (such that no two comments from the
same post appear across folds), and reserve the
last two chunks for validation and test respectively,
each approximately 0.02% of the full dataset (∼
360k comments each).
4
Note that the 90M model discussed later in the paper uses
a variant of the corpus with less filtering. See Shuster et al.
(2019) for details.
6.2
Fine-tuning
Our pre-training data, though large, contains data
consisting of group discussions, rather than direct
two-way conversational data. While it has a lot of
useful content, it also still has a lot of noise, even
after filtering. In contrast, the academic community has produced a number of smaller, but cleaner,
more focused tasks, typically collected via crowdworkers, which have been made publicly available.
These tasks can more accurately provide traits that
are desirable for our models. For example, the
ConvAI2 dataset (Zhang et al., 2018) focuses on
personality and engaging the other speaker, Empathetic Dialogues (Rashkin et al., 2019) focuses on
empathy, and Wizard of Wikipedia (Dinan et al.,
2019c) focuses on knowledge. Finally, Blended
Skill Talk (Smith et al., 2020) provides a dataset
that focuses on blending these skills.
ConvAI2: ConvAI2 is a dataset used at the
NeurIPS 2018 competition of the same name, and
is based on PersonaChat (Zhang et al., 2018; Dinan
et al., 2020). The training data of 140k utterances
involves paired crowdworkers having a conversation where they get to know each other, in which
each is given a role to play based on sentences describing their persona, which were also separately
crowdsourced (both speakers can see their own
persona description, but cannot see their partner’s
persona). The task thus involves getting to know
the other speaker and engaging them in friendly
conversation, both asking and answering questions
– useful skills for an open-domain conversational
agent. Models trained on this task are thus conditioned on the persona and the dialogue history,
which are concatenated. It was previously shown
this dataset helps provide more engaging dialogue,
and that the use of persona gives improved consistency for the bot.
Empathetic Dialogues (ED): Rashkin et al.
(2019) constructed the Empathetic Dialogues
dataset, which consists of 50k utterances of crowdworker conversations grounded in an emotional
situation. In each dialogue, one speaker describes
a personal situation and the other plays a “listener”
role, displaying empathy during the discussion.
Trained models are measured playing the part of
the empathetic listener. It was previously shown
fine-tuning models on this dataset helps them display more empathy in human evaluations.
Wizard of Wikipedia (WoW): The Wizard of
Wikipedia task involves discussing a given topic
in depth, where the goal is to both engage the partner as well as display expert knowledge (Dinan
et al., 2019c). The dataset consists of 194k utterances over 1250 topics, where each conversation
begins with a randomly chosen topic. A retrieval
system over Wikipedia was used from which the
dialogues were grounded during the human-human
crowdsourced conversations. The topics were also
crowdsourced and range from e-books to toga parties to showers. In most of our models we use
the simpler version of the task where we only use
the final conversations for fine-tuning, ignoring the
retrieval aspect of the task. For our knowledge retrieve and refine model (Sec. 2.3) we do also use
the gold retrieved knowledge (“checked sentence”)
for training the retrieval system. It was previously
shown for generative models that using such knowledge was rated higher in human evaluation than
without when discussing topics in depth.
Blended Skill Talk: Blended Skill Talk (Smith
et al., 2020) aims to blend the previous three tasks
to combine the skills from them (engaging personality from ConvAI2, empathy from ED, and
knowledge from WoW) seamlessly during dialogue.
To that end, a dialogue dataset of 76k utterances
was collected with a guided and unguided human
speaker, where the guided speaker could select utterances suggested by bots trained on the three individual tasks, see Figure 3. It was shown that this
additional blended data, multi-tasked with the previous three tasks, helped maintain all three skills in
open-domain dialogue. In subsequent experiments
we will refer to the “BST tasks” as training on all
four tasks together.
In each blended dialogue, the model is provided
a two sentence persona to condition on following
PersonaChat, and additionally during one third of
the conversations a WoW topic name as well (see
Figure 3). During evaluations, we equip our models
with randomly chosen personas and, one third of
the time, topics from this set as well, mirroring the
way the model is trained.
7
Safety Characteristics
As models are trained to mimic human-human conversations, they can sometimes learn undesirable
features from this human-human data, such as the
use of toxic or biased language. The BST tasks
we use for fine-tuning were collected from crowd-
workers who were given explicit instructions to not
use such language, and hence are generally safer
than our pre-training data from pushshift.io Reddit.
Nevertheless, issues can still remain.
We have previously investigated building better
classifiers of toxic language by collecting adversarial toxic data that fools existing classifiers and
is then used as additional data to make them more
robust, in a series of rounds (Dinan et al., 2019b).
We can apply such a classifier at test time to detect toxic language before it is shown, but we note
that such classifiers are still not infallible. In our
experiments section we will gauge how often such
classifiers flag responses generated from the models.
We have also previously conducted studies into
mitigating gender bias in dialogue through the use
of conditional generation, controlling the amount
of gendered words to be more neutral, with preliminary success (Dinan et al., 2019a). This is not
currently added to the system described in this paper, but should be considered for future updates.
8
Evaluation Methods
ACUTE-Eval While we employ and report automatic metrics, our main evaluation involves the
ACUTE-Eval procedure (Li et al., 2019b), whereby
evaluators are asked to make pairwise evaluations
of complete dialogues. An example of ACUTEEval is shown in Figure 4. ACUTE-Eval affords advantages over both single-turn pairwise and multiturn Likert evaluations. The explicit use of comparisons avoids the per annotator bias in numerical
(Likert) scores (e.g., annotators who tend to give
generous scores), and remedies many of the issues of sequential effects such as contrasting with
a previous example (Mathur et al., 2017), while
still providing the ability to expose issues that are
present only in multi-turn evaluations.
Furthermore, the pairwise setup facilitates replication and efficient reuse of data: conversations
collected in previous trials and by other systems
can be directly compared with a new system, without having to recollect additional data. This can
significantly reduce the resources needed by a new
evaluation, and ensure that multiple papers are comparing to prior work consistently. In particular, this
makes it possible to compare to logs from Meena
(Adiwardana et al., 2020) even though the model
itself has not been made publicly available.
We consider two evaluation questions, derived
Figure 4: ACUTE-Eval has human annotators directly
compare multi-turn conversations with different systems.
from (Li et al., 2019b):
• Engagingness question: “Who would you pre-
fer to talk to for a long conversation?”
• Humanness question: “Which speaker sounds
more human?”
The phrasing of these questions were themselves
optimized in that work to maximize agreement,
and we hence re-use those exact phrasings. It was
shown that different phrasings can result in weaker
levels of agreement, and that engagingness and
humanness clearly do not measure the same thing.
Self-Chat ACUTE-Eval Nevertheless, full human evaluations are time consuming and costly,
requiring humans to spend time conducting conversations with bots as well as scoring them. As an
alternative, it was shown in Li et al. (2019b) that
ACUTE-Eval can also work in “self-chat” mode,
where models are used for both sides of a conversation, instead of human-model chat. This eliminates
the requirement of the initial chat collection, and
conversations may be generated without human
involvement, dramatically reducing the resource
requirements of evaluation. Results from self-chat
experiments highly correlate with those of humanchat experiments, for most, but not all systems (Li
et al., 2019b). This mirrors other successes in using
self-play, self-chat, and simulated users to evaluate
dialogue systems (Fazel-Zarandi et al., 2017; Shah
et al., 2018a,b; Wei et al., 2018; Ghandeharioun
et al., 2019). We use this procedure for some of
our modeling and hyperparameter choices where
the full ACUTE-Eval would end up too costly, and
only use the full human-bot chat evaluation at the
final stage. In this work we use the BST-setting
to perform self-chats, i.e. models are given the
personas, topics and previous utterances to initiate the conversation, see Section 6.2 and Figure 3.
Note that when using deterministic methods such
as beam decoding, this prevents the models from
generating the same conversation repeatedly.
9
Related Work
The area of open-domain dialogue has made significant progress recently with end-to-end neural
approaches. The ConvAI2 competition at NeurIPS
2018 featured large pre-trained Transformers for
the top two winning teams (Dinan et al., 2020).
In particular, Wolf et al. (2019) pre-trained via
the method of Radford et al. (2018) using the
BooksCorpus dataset, resulting in the best perplexities and F1 scores. Since then, results have
improved further with the advent of larger, improved pre-training (Lewis et al., 2019; Shuster
et al., 2019). In general this extends beyond ConvAI2 to many open-domain dialogue datasets, such
as daily dialogue and Cornell Movies (He et al.,
2019), and also when multi-tasking across many of
these datasets, as we also do here (Shuster et al.,
2019; Smith et al., 2020).
A particular large-scale model of note that we
compare to in this work is Meena (Adiwardana
et al., 2020), a 2.6B parameter Transformer-based
model trained on 341 GB of text, that was shown to
be superior to variants of DialoGPT (Zhang et al.,
2019), Mitsuku5 , Cleverbot6 , and XiaoIce (Shum
et al., 2018; Zhou et al., 2020). The evaluation
metric used was SSA, the average of sensibleness
and specificity, as judged by human raters either
in static or interactive setups, which is shown to
highly correlate with asking raters how “humanlike”
the model is. We note however that the authors
themselves state it may not capture all aspects of
such a test, e.g. might not measure empathy. We
additionally note that neither Meena’s model, the
static “Mini Turing Benchmark” used in the paper,
nor the phrasing of the SSA evaluation question
provided to annotators was released, making cer5
6
https://www.pandorabots.com/mitsuku/
https://www.cleverbot.com/
Model C2
WoW
ED
BST
(K = 20) (K = 100) (K = 100) (K = 100)
256M 88.55
622M 89.96
91.70
93.22
62.67
70.15
83.45
82.11
Table 1: Hits@1/K of fine-tuned poly-encoder models
on the validation set for BST datasets. Hits@1/K measures recall@1 when ranking the gold label among a
set of K − 1 other random candidates.
tain comparisons difficult. Further, the human-bot
conversations were conducted by employees and
were not blind to the model type (in the logs they
say phrases such as “Hi Meena!”). In this work we
employ unbiased crowdworkers with reproducible
experiments, and use ACUTE-Eval (Sec. 8) to directly ask the humanness question, rather than a
proxy. Further, we also report results on engagingness as a main metric, because this measures
more closely whether a human will be interested in
talking to our bots.
10
Results & Analysis
We first present automatic evaluation results using
various metrics. As these are only ever a proxy
for human judgments on conversational quality, we
perform human evaluations and describe the results
in the subsequent sections.
10.1
Automatic Evaluations
Retriever We fine-tune the retrieval models on
ConvAI2, Wizard of Wikipedia, Empathetic Dialogues, and Blended Skill Talk datasets (BST variants of each7 ) and automatically evaluate them by
measuring hits@1/K on the validation sets of each
of these datasets. Results are shown in Table 1.
Generator Before fine-tuning, we assess the performance of our 90M, 2.7B, and 9.4B parameter
models by measuring perplexity on the validation
set from pushshift.io Reddit. For the 90M parameter model, results are reported from Shuster et al.
(2019), as we use that same model. Results are
shown in Table 2. Training curves for the pretrained models are also provided in Figure 5. We
note that the perplexity of our 2.7B and 9.4B parameter models are not directly comparable to that
of the 90M parameter model, as these models do
not share the same dictionary.
7
https://parl.ai/projects/bst
Validation PPL (Pushshift Reddit)
Model
2.7B
9.4B
17
16
15
14
13.3
13
12.2
12
0.0
40.0
80.0
120.0 160.0
SGD steps (thousands)
200.0
Figure 5: Validation PPL of different sized models.
The larger model achieves a better performance in
fewer steps, consistent with other works (Kaplan et al.,
2020; Li et al., 2020).
We also report perplexity both before and after
fine-tuning each of these models on the ConvAI2,
Wizard of Wikipedia, Empathetic Dialogues, and
Blended Skill Talk datasets. Results are shown in
Table 3. They show that fine-tuning gives relatively
large improvements in perplexity on these tasks,
which could hence translate into improved ability
at these skills when conducting open-domain dialogue.
Retrieve and Refine (RetNRef) We also report
perplexity on each of these datasets for our dialogue retrieve and refine variants in Table 3. We
note a small increase in perplexity – relative to
the standard generator models – on each of these
datasets. This small increase in perplexity was
also observed in Weston et al. (2018), even though
the retrieve and refine models outperformed the
baseline generator models in human evaluations
in those experiments. As such, we cannot rely on
automatic evaluations alone to assess the relative
performance of retrieve and refine and generator
models.
Safety We also analyzed the behavior of some
of our generative models in terms of unsafe generated sequences. We produced generations given
pushshift.io Reddit and ConvAI2 validation set
contexts using our 90M parameter models with
and without BST fine-tuning. We then assessed
whether those generations were safe or not using
two different methods: using an unsafe word list,
or the safety classifier of Dinan et al. (2019b), both
methods being available in ParlAI (Miller et al.,
2017). We also compare our generations to the
gold human responses, assessing whether they are
safe or not too.
The results are given in Table 4. First, they show
humans do utter unsafe responses, which our models will likely imitate if provided in their training
data. ConvAI2, one of the BST datasets, contains
much fewer unsafe utterances from humans than
pushshift.io Reddit. This explains why, when we
fine-tune our models on the BST tasks, they also
reply with fewer unsafe utterances than models
trained on pushshift.io Reddit alone.
While lists of banned words are easier to filter
out of training, unsafe utterances consisting of otherwise safe words are harder to avoid – which is
what the safety classifier used can also detect. We
note that simply training on filtered data would not
solve this problem due to the tendency of generative models to copy their current context, so at
deploy time, they could still be provoked by unsafe
user contexts. We can of course apply these safety
classifiers at test/deploy time to further reduce the
unsafe responses from these models, but note that if
the classifier is erroneous, unsafe utterances could
still get through.
10.2
Self-Chat Evaluations
We next perform a number of self-chat ACUTEEvals (see Sec. 8) over various modeling choices,
using the engagingness question and ∼140 trials
per pair compared. This serves as an efficient alternative to a full evaluation in order for us to perform
model selection over a large number of choices.
We finally conduct a full evaluation on the selected
best performing models in the subsequent section.
Retrieval vs. Generator vs. RetNRef We first
compared the three model types described in Sec.
2: retrieval, generative and (dialogue) retrieve and
refine (RetNRef). We used the base 90M parameter generative model, the 256M parameter retrieval
model, while RetNRef combines both. All models
are fine-tuned on the BST tasks. For generation we
use standard beam search (beam size 10, no minimum beam decoding constraint, but with context
and response 3-gram blocking).
The results (Figure 6) show RetNRef outperforming the pure generation approach, but with
retrieval outperforming both. This initial result
comes with the caveat that relative performance
may be different for differently sized models, or
for different training or decoding strategies, as we
shall see. We explore along those axes in subse-
Total Params
V
Lenc
Ldec
d
h
Steps
PPL
90M
87,508,992
55K
8
8
512
16
2.86M
25.6
2.7B
9.4B
2,696,268,800
9,431,810,048
8K
8K
2
4
24
32
2560
4096
32
32
200K
200K
13.3
12.2
Name
Table 2: Perplexity on the validation set of pushshift.io Reddit for several generative Transformer models with
given architecture settings. Note that perplexity is not directly comparable between the 90M models and the larger
models as the 90M models use a different dictionary. Columns include the vocabulary size (V ), number of encoder
and decoder layers (Lenc , Ldec ), embedding dimensionality (d), Multihead Attention Heads (h), and training steps.
Model
Size
ConvAI2
WoW
ED
BST
Avg.
pushshift.io Reddit Generative
BST Generative
BST RetNRef
90M
90M
256M/90M
18.33
11.36
11.79
31.18
17.56
18.37
14.44
11.48
11.87
18.09
14.65
14.62
20.51
13.76
14.16
pushshift.io Reddit Generative
BST Generative
BST RetNRef
2.7B
2.7B
622M/2.7B
15.70
8.74
9.31
13.73
8.78
9.28
11.06
8.32
9.93
14.36
10.08
10.59
13.71
8.98
9.78
pushshift.io Reddit Generative
BST Generative
9.4B
9.4B
15.02
8.36
12.88
8.61
10.41
7.81
13.5
9.57
12.95
8.59
Table 3: Perplexity of the pre-trained and fine-tuned models on the validation set for BST datasets. Note
that perplexity is not directly comparable between the 90M models and the larger models as 90M models use a
different dictionary. Fine-tuning gives gains for each skill (task) compared to pre-training on pushshift.io Reddit
alone.
pushshift.io Reddit
Human
Reddit Gen
BST Gen
12.9%
4.4%
0.6%
Gen
Classifier Word List Classifier
18.5%
17.8%
9.5%
0.32% 3.8%
0.10% 12.1%
0.05% 1.6%
Table 4: Safety of utterances, before filtering
through a safety classifier. We compare human, pretrained and fine-tuned 90M model responses given
pushshift.io Reddit and ConvAI2 contexts using either
an unsafe word list or a trained classifier from (Dinan
et al., 2019b). The pushshift.io Reddit dataset contains more unsafe contexts, leading to more unsafe responses. Models fine-tuned on the safer BST tasks
are less toxic than the pre-trained pushshift.io Reddit
model on either type of dataset context.
quent trials. This mirrors results found in some
recent papers comparing generation and retrieval
(Li et al., 2016; Dinan et al., 2019c). In order for
generation methods to do better, we need to improve their recipe.
Generator Decoding choices We next compare
different ways of controlling the response length in
Win %
Method Word List
ConvAI2
Generative
Retrieval
RetNRef
Loss %
Ret RetNRef
33 ∗
∗
67
60 ∗ 40 ∗
40
60
Figure 6: Self-Chat ACUTE-Eval (engagingness)
shows Retrieve and Refine (α = 0.5) outperforms
its Generative (90M, beam search decoding) but not
its Retrieval (256M) counterpart, all using BST finetuning. ∗ indicates significance (two-tailed binomial
test, (p < 0.05)). x
beam search (Sec. 4.3): controlling the minimum
beam length (in terms of BPE tokens) with a fixed
hyperparameter, or by adjusting it with a predictor
of the optimal length.
The results, shown in Figure 7 show that both
methods improve significantly over not controlling
the length, as in standard beam search. In the remainder of the experiments in the paper we thus
chose a minimum beam length of 20 BPE tokens.
We then investigate the use of beam blocking,
the results are shown in Figure 8. Blocking tends to
increase performance, in line with other works, al-
Generative 2.7B model: Min Beam Length
Constrained vs. Unconst.
Min. Length 5
Min. Length 10
Min. Length 20
Min. Length 40
Predictive (5,10,15,20)
Predictive (10,20,30,40)
52
68 ∗∗
83 ∗∗
82 ∗∗
69 ∗∗
81 ∗∗
48
32 ∗∗
17 ∗∗
18 ∗∗
31 ∗∗
19 ∗∗
Figure 7: Self-Chat ACUTE-Eval (engagingness)
shows controlling minimum beam length gives large
gains in engagingness compared to not controlling it,
according to humans, with 20 being best. All rows are
significant (p < 0.01) except the first.
Generative 2.7B model: Beam Blocking
Block vs. None
3-gram Context Blocks 50
3-gram Response Blocks 54
3-gram Context + Response Blocks 59
50
46
41
Figure 8: Self-Chat ACUTE-Eval (engagingness):
comparing beam-blocking variants. Blocking both context and response 3-grams during generation gives highest scores, however, none of these results are significant.
though the results were not significant. We employ
full blocking in the remainder of our experiments.
Finally, we compare different values of beam
size to other search strategies: Top-k sampling, and
the sample and rank strategy of Adiwardana et al.
(2020) using Top-k (k = 40) and 20 samples.
The results are given in Figure 9, comparing
beam size 10 to alternatives. It appears there is
a sweet spot of beam size, where a value of 10
is superior to 1 or 30, which is then on par with
sampling methods, although none of these results
is significant. We employ beam size 10 in the remainder of our experiments.
Small vs. Large models We compare 90M vs.
2.7B parameter generative models in a pairwise test,
both with BST fine-tuning and with the decoding
settings we selected from previous settings.
The results (Figure 10) indicate improvements
from larger models, in line with previous results
(Adiwardana et al., 2020). We note that this comes
at the cost of increased computational resources
being required for training and deployment.
Generative 2.7B model
Beam 10 + Block
Alternative vs. + Min. Length 20
Beam size 1
Beam size 30
Sample + Rank
Top-k (k = 40)
Figure 9: Self-Chat ACUTE-Eval (engagingness):
comparing different generation schemes. None of these
results are statistically significant.
Generative models
90M params vs. 2.7B params
43
57
Figure 10: Self-Chat ACUTE-Eval (engagingness)
shows a win for a larger vs. smaller model, but this
result is not statistically significant.
tasks, versus using pre-training only.
The results (Figure 11) indicate large improvements from adjusting the model to focus on personality, knowledge and empathy, the three skills in
BST.
Persona context vs. No context given The BST
tasks train models how to use context personas such
as "I design video games for a living", see Fig. 3.
This context can both improve the bot’s consistency
as well as add potential talking points that it can
work into the conversation. To tease apart the impact of adding context vs. fine-tuning on BST but
not using contexts at conversation time, we compared them against each other. The results, shown
in Figure 12 indicate a small win for employing
persona contexts, which we thus employ in all our
full evaluations in the next section.8
Likelihood vs. Unlikelihood We compare unlikelihood training (Sec. 3.4), whereby overexpressed n-grams are discouraged (α = 0.25), to conventional training (MLE). The unlikelihood training has the intended effect of making the system
less “dull” by not using the same common phrases
again and again. We note that this effect would
likely be larger if measured with longer or repeated
conversations with the same user. Nevertheless,
here we perform the same experimental setup as
before.
8
Pre-training vs. Fine-Tuning We compare finetuning our pre-trained generative model on the BST
55
58
48
50
45
42
52
50
We also compared adding a Wizard of Wikipedia-based
topic vs. not to the context, and in that case saw no discernible
difference in evaluation scores.
Generative 2.7B model
Pre-training only vs. BST fine-tuning
39 *
Generative BST 2.7B model
MLE
vs. Unlikelihood
61 *
Figure 11: Self-Chat ACUTE-Eval (engagingness)
shows a significant gain (p < 0.05) for fine-tuning on
the BST Tasks.
Figure 13: Self-Chat ACUTE-Eval (engagingness)
MLE vs. Unlikelihood training (penalizing overexpressed n-grams). The result is not statistically significant (165 trials).
Generative BST 2.7B model
Persona context vs. No context
47
Figure 12: Self-Chat ACUTE-Eval (engagingness)
shows a small win (not significant) for using persona
contexts after fine-tuning on the BST tasks.
We compare two models which are identical except for the training objective: both models are
2.7B parameters, BST fine-tuned with our best chosen decoding settings. The results (Figure 13) have
a small gain against the likelihood model, but this
is not statistically significant.
10.3
Loss %
Ret Gen RetNRef
Full (Human-Bot Chat) Evaluations
The previous section comprised of human pairwise evaluations to perform model selection, but
involved self-chats, not human-bot conversations.
In this section we take the learnings from those
evaluations, and evaluate some of the best choices
of model in our full human-bot evaluation setup.
For human-bot conversation data collection we
used the same setting proposed in (Adiwardana
et al., 2020): open-ended chat that begins with the
message "Hi!" from the human to the bot, and has
a minimum interactive conversation length of 14
turns, collecting 100 conversations per model via
crowdworkers. We do not apply a safety classifier
to our models, but we do apply it to the human
responses, and remove crowdworker conversations
that were flagged.
Retrieval vs. Generator vs. RetNRef We perform an evaluation (engagingness question) similar
to the self-chat version of Figure 6, except using
human-bot conversations, and the generative and
RetNRef models here use the improved decoding
choices. This results in stronger generation and
RetNRef models, which both now beat the retrieval
method, see Figure 14.
The main difference to our initial self-chat experiments (Figure 6) is that our decoding now generates longer responses using a minimum beam
Win %
53
54
46
Retrieval
Generative
RetNRef
29 ∗
∗
71
70 ∗ 56 ∗
30 ∗
44 ∗
Figure 14: Human-bot ACUTE-Eval (engagingness):
Retrieve and Refine(α = 0.5) and Generative (90M,
beam search decoding, min beam size 20) beat Retrieval (256M). All results are significant (p < 0.01)
except for RetNRef vs. Generative.
length constraint. This makes the generative models now outperform the retrieval model, but it also
removes the gains from retrieve and refine over the
generative model. We note that if we remove the
minimum beam length constraint in both retrieve
and refine and the generative model and collect
new human-bot chats, and a pairwise ACUTE-Eval,
we instead get that RetNRef has a statistically significant improvement over our generative model
(p < 0.001).
Comparison to Meena We compare our models
to Meena (Adiwardana et al., 2020) by comparing
pairwise against the publicly available logs. We
note that only some of the logs were made available, as some toxic conversations were removed,
which may affect the evaluations, but we use all
logs that are publicly available. We compare them
with several variants of our models, using both
the engagingness and humanness questions. The
results are given in Figures 15 and 16. We first
observe several results that are in line with the selfchat results from the previous section:
(i) Using BST (BST Generative 2.7B) is superior to pre-training only (pushshift.io Reddit
Generative 2.7B)
(ii) Beam search with a minimum beam length
of 20 (BST Generative 2.7B) is superior to
having no minimum length (BST Generative
(2.7B) std. beam)
Ours vs. Meena
BST Generative (2.7B) std. beam
pushshift.io Reddit Generative (2.7B)
BST RetNRef (256M/90M)
BST Generative∗ (90M)
Wiz Generative (2.7B)
BST Unlikelihood (2.7B)
BST Generative (9.4B)
BST RetNRef (622M/2.7B)
BST Generative (2.7B)
50
53
60 ∗
61 ∗
61 ∗∗
64 ∗∗
67 ∗∗
70 ∗∗
75 ∗∗
50
47
40 ∗
39 ∗
39 ∗∗
36 ∗∗
33 ∗∗
30 ∗∗
25 ∗∗
Figure 15: Human-Chat ACUTE-Eval of engagingness, various models compared to Meena. Our best
models are considered more engaging than Meena,
rows with ∗ (p < 0.05) and ∗∗ (p < 0.01) are statistically significant. Larger generative models with BST
fine-tuning and length-controlled decoding work best.
(iii) The larger BST Generative (2.7B) is superior
to the smaller model BST Generative (90M).
We find RetNRef models (both dialogue version
and using knowledge retrieval) do not improve over
their generative counterparts when using the best
decoding schemes for the generative models. Our
largest BST Generative 9.4B model does well on
the humanness question, but performs worse on
engagingness compared to our 2.7B model, despite
having lower perplexity, showing correlation between these metrics is not straightforward. We verified this result further by performing an ACUTEEval of engagingness directly comparing the 2.7B
and 9.4B against each other, which resulted in a
56% win for the smaller model, aligning with the
other results. Future work should aim to understand
this result further.
Our best models improve significantly over
Meena, with BST Generative 2.7B winning 75%
of the time in pairwise match-ups for the engagingness question and 65% for the humanness question. Meena generally tends to fare better at the
humanness question than the engagingness question, which is line with the goals and modeling
choices in that work.
Model vs. Human-human Chat Comparisons
Rather than comparing different models pairwise,
we can also compare a model directly to human
performance, by running ACUTE-Evals with a bothuman chat vs. a human-human chat. We test
the same models in this setup using the humanhuman chat logs from Adiwardana et al. (2020).
Results are given in Figure 17. We see many of the
same trends, but find that human-human chats are
BST Generative (2.7B) std. beam
BST RetNRef (256M/90M)
pushshift.io Reddit Generative (2.7B)
BST Generative (90M)
Wiz Generative (2.7B)
BST RetNRef (622M/2.7B)
BST Generative (2.7B)
BST Generative (9.4B)
BST Unlikelihood (2.7B)
Ours vs. Meena
46
54
49
51
56
44
59
41
59 *
41 *
65 ∗∗
35 ∗∗
65 ∗∗
35 ∗∗
66 ∗∗
34 ∗∗
70 ∗∗
30 ∗∗
Figure 16: Human-Chat ACUTE-Eval of humanness,
various models compared to Meena. Our best models
are considered more humanlike than Meena, rows with
∗
and ∗∗ are statistically significant.
Model vs. Human
Meena (Adiwardana et al., 2020)
BST Generative (2.7B) std. beam
pushshift.io Reddit Generative (2.7B)
BST RetNRef (256M/90M)
BST Generative (90M)
BST Generative (9.4B)
BST RetNRef (622M/2.7B)
Wiz Generative (2.7B)
BST Unlikelihood (2.7B)
BST Generative (2.7B)
28 ∗∗
21 ∗∗
36 ∗∗
37 ∗∗
42
45
46
47
48
49
72 ∗∗
79 ∗∗
64 ∗∗
63 ∗∗
58
55
54
53
52
51
Figure 17: ACUTE-Eval of engagingness of models
vs. humans by comparing human-bot logs to humanhuman logs. Rows with ∗∗ are statistically significant.
a more challenging barometer for our models to be
compared to.
Response Length We show the average response
length statistics (in terms of BPE 8k dictionary tokens) of some of the models in Figure 18. We compare Generative BST (2.7B) with and without beam
length constraints. With the constraint (of 20), the
average response length is around 21 tokens, so the
beam search often ends as soon as the constraint
is fulfilled. In contrast, without the constraint the
average length is 9.5. Meena’s average length is
10.4, and humans engaged in human-human chats
is 18.0. Humans speaking to models (or other humans) will often match response length if they are
engaged in the conversation, and there appears to be
correlation of their average response length with engagement (intuitively, humans are expending time
and energy typing keys on their keyboard, which
they are more likely to do if engaged).
Model
Meena
BST Gen (2.7B) std beam.
BST Gen (2.7B)
Human
Model Human Partner
10.4
8.2
9.5
11.3
21.3
16.3
18.0
18.0
Figure 18: Response length statistics for various models. We note the best performing methods have longer
response lengths, and humans interacting with them
have longer response lengths in kind.
10.4
Example Successful Conversations
We give several examples of what we consider
successful conversations between crowdworkers
and the Generative BST 2.7B model in Figures
19 and 20. The topics span from cooking, music,
movies and pets to yoga, veganism, instruments
and malls – often with the model going into detail
when asked, naming relevant stores, bands, movies,
actors, pet species and pet names. We also provide two slightly more probing examples which
are conversations between a paper author and the
models in Figures 21. In the first example we ask
for comparison between Bach and Justin Bieber,
with fairly nuanced and detailed answers from the
bot. In the second example we ask the bot to write
a song, which it attempts to do, even though the
lyrics it generates could not be called deeply poetic.
10.5
Failure Cases and Model Extensions
While performance in the ACUTE-Eval setup appears at first sight to be very strong (e.g. 49% to
51% for our 2.7B generative model compared to
human-human logs), we do not believe we are anywhere near as close to solving the problem of opendomain conversation as this evaluation would indicate. Here, we highlight problems with our models,
and elucidate why our evaluation does not capture
them. Selected example failures from crowdworker
logs are given as conversation snippets in Figure
23, and further failures constructed by the paper
authors in Figure 24.
Vocabulary Usage It has been observed that generative models employing beam search decoding
(or other methods that approximately choose the
most likely utterance) tend to generate common
words too frequently, and rare words too infrequently, as compared to the human distribution
(Holtzman et al., 2018; Welleck et al., 2020; Li
et al., 2019a). In dialogue, humans can interpret this as technically correct, but unengaging,
in the extreme this is the so-called “I don’t know”
problem, where models tend to output such noncommittal utterances. Using sampling to select
lower likelihood generations can help, but at the
risk of saying something which makes less sense.
It appears that even our best models using beam
search are still exhibiting such behavior. We have
found that encouraging the length of the generations to be longer helps, in that the model is forced
to generate something more detailed, but the problem still remains. Figure 22 shows the most commonly occurring 3-grams in the conversation logs
with crowdworkers for the BST Generative 2.7B
model, and their counts. Given that there are only
100 conversations, the expressions “do you like”,
“lot of fun”, “have any hobbies” etc. are clearly
over-expressed compared to human-human conversations. We note that the current evaluation
does not seem to expose this as boring because
the conversations are short and are evaluated separately. We applied unlikelihood training to reduce
this over-expression, which successfully reduced
this overexpression during training, and also in the
final conversation logs with humans, as shown in
Figure 22. Unfortunately, this made a very small
or negative impact in our ACUTE-Evals of engagingness, see Figures 15 and 17, although this did
score highly in terms of humanness, see Figure 16.
For engagingness, as explained, we believe this
is because the current evaluation technique employing short conversations cannot measure this
phenomenon well.
Nontrivial Repetition A related issue is that
generative models also have a tendency to repeat
(Holtzman et al., 2019). While beam blocking can
be applied as a band-aid to fix some of these problems, resulting in improved performance, deeper
issues remain. There remains a tendency for models to say that they have a pet dog as well if you
say you have one, and that they love walking it too,
they like the same bands as you, etc. This is both
present in our failure examples (Figures 23 and 24)
and our cherry-picked good examples, see Figures
19 and 20. We observe this in the logs of other generative systems, e.g., Meena as well. While this can
be engaging that the bot tends to agree with many
things you say, control of this seems desirable. One
possibility is applying unlikelihood training for that
goal as well, to minimize context repeats (Li et al.,
2019a). Adding a persona to the bot is another
plausible way to do this. We have added simple
Crowdworker
Generative BST 2.7B
Crowdworker
Generative BST 2.7B
Figure 19: Cherry-picked crowdworker examples. Two conversations between different crowdworkers (left
speakers) and the Generative BST 2.7B model (right speakers).
two line personas following BST (See Figure 3),
but this would need to be much more detailed to
cover all possible cases, so it is unclear if that is
a satisfactory solution. Perhaps one way to track
this would be to ask human evaluators if the bot is
following their persona, as the current evaluation
setup is unlikely to penalize this copycat behavior.
Contradiction and Forgetfulness Our models
do occasionally contradict themselves, see Figure
23, although we observed this happens less often
in the larger models. We believe due to the nature
of language modeling, typical language patterns do
not contain contradictions, but probing the model
with unusual responses would likely expose this
behavior again. A second related problem is what
appears as “forgetfulness” to the human observer,
where for example you tell the model you have
a dog, but then later in the conversation it asks
what pets do you have. This phenomenon can be
attributed to the fact that the model fails to make
the logical link that it should not ask that question, rather than the model actually “forgetting" (if
the previous response is in its dialogue context).
Again, we observe this relatively rarely, but we
believe it can be exposed further by probing the
model. While some recent work has posed possible
solutions for these issues (Li et al., 2019a), they
have not yet been fully resolved.
Knowledge and Factual Correctness In our experience it is actually relatively easy to goad our
models into making factual errors. Perhaps surprisingly, they appear relatively rarely in crowdworker
conversations with the bots. We believe this is due
to the nature of the evaluation conducted: the conversations start with “Hi!” and tend to cover only
shallow topics whereby the speakers get to know
each other, and they are rarely long enough to go
deeper into a topic. Exploring a more focused topic
of conversation would likely expose the model’s
weaknesses. On the contrary, it appears that the
model is good at dodging this issue. We observe
that our models often switch topics – avoiding the
challenge of going “deeper" – which could be a
side effect of the ConvAI2 dataset which exhibits
this behavior. The Wizard of Wikipedia dataset,
however, does not exhibit this behavior, and its
construction was specifically aimed to avoid this.
We implemented a model that directly incorporated
Crowdworker
Generative BST 2.7B
Crowdworker
Generative BST 2.7B
Crowdworker
Generative BST 2.7B
Crowdworker
Generative BST 2.7B
Figure 20: Cherry-picked crowdworker examples. Four conversations between different crowdworkers (left
speakers) and the Generative BST 2.7B model (right speakers).
Paper author
Generative BST 2.7B
Paper author
Generative BST 2.7B
Figure 21: Cherry-picked author examples. Paper author (left speaker) conversations with Generative BST 2.7B
model (right speaker).
reading Wikipedia (Wiz Generative 2.7B, Sec 2.3),
and anecdotally one can find cases where it can employ knowledge that the pure sequence to sequence
model cannot, see Figure 24. Unfortunately the
reading of knowledge only had a negative impact in
ACUTE-Evals compared to a similarly sized model
without knowledge retrieval, see Figure 17. We
believe this is due to a mixture of (i) deeper knowledge rarely being required in the current evaluation
setup; and (ii) the model attempting to use knowledge when there is no need, or using it incorrectly.
True open-domain dialogue agents should be able
to use knowledge effectively, and to achieve that
we have to be able to measure that effectively.
Conversation Length and Memory Our current evaluation involves very short (14-turn) oneshot conversations. Our bots likely would be repetitive and dull over the course of several days or
weeks of conversation, as described above, and they
are also currently completely incapable of even re-
membering earlier conversations. Our generative
architectures which are standard Transformers have
a hard limit of 128 BPE tokens of history, so cannot
possibly expand upon things they have learnt from
or about the user, refer to previous things they said,
etc. While several recent works have extended neural architectures to possess longer contexts (Dai
et al., 2019; Rae et al., 2020; Kitaev et al., 2020;
Beltagy et al., 2020), we have neither implemented
those, nor do we believe the current evaluation
setup is the right one for measuring their success.
Deeper Understanding Finally, while our models appear to chitchat with some degree of effectiveness, their ability to truly understand must be questioned. The contradiction and forgetfulness failure cases also emphasize this, but we give deeper
failure case examples in Figure 25. In the examples, the authors of this paper try to query the bot
whether it can understand two puns. The first requires understanding the semantic connection be-
n-gram MLE Unlikelihood Human
Do you have 110
60
6
you have any 82
46
2
a lot of 74
46
14
What do you 57
20
6
you like to 54
43
1
What kind of 45
41
4
do you like 44
33
6
like to do 42
28
0
lot of fun 39
18
0
do you do 38
14
6
I like to 36
9
2
That sounds like 36
37
0
you have a 34
15
5
have any hobbies 34
22
0
sounds like a 33
35
4
Human
Generative BST 2.7B
Figure 22: Counts of most common 3-grams from
the BST Generative 2.7B model (likelihood) from the
conversation logs when talking to crowdworkers, compared to those of the same model trained with unlikelihood, and to human logs (for the same number of utterances).
tween hay, Harvard and horses, which the model
at one point claims it understands, but clearly does
not. Its lack of understanding can be strongly contrasted with its ability to describe knowledge about
the location of Harvard or horses. This recalls a
quote due to Feynman, “There’s a big difference
between knowing the name of something and knowing something”. We note that these models cannot
be taught a concept through further conversation,
so as-is they will always be stunted, see (Weston,
2016; Hancock et al., 2019) for early work in this
direction. Further, these models, which are disembodied, also have no way of grounding to entities,
actions and experience in the world, which could
also stunt their abilities (Bisk et al., 2020). See
Urbanek et al. (2019); Prabhumoye et al. (2020)
for other work by some of the authors connecting
dialogue models to rich environments.
Further Notes on Evaluation Several of the previous points raised issues concerning our evaluation protocol. Our set-up involves short multi-turn
conversations with no instructions. Extending the
length should expose further weaknesses, however
collecting long conversations with crowdworkers
is clearly difficult, and it is unclear how many turns
would be a sufficient test. We tried a preliminary
experiment of collecting 100 conversations twice
as long (so, 28 turns) to see the performance dropoff of our models. We compared the second half
of the conversations to the shorter versions for the
Figure 23: Examples of issues when talking to crowdworkers with our Generative BST 2.7B model: nontrivial repetition (top example), forgetfulness (second
example), contradiction (third example, Georgia is not
in the Midwest), hallucinating knowledge (fourth example, the long dark and forest are survival games, but
not by the same authors).
same 2.7B generative BST model, but did not see
a statistically significant difference, indicating they
either need to be longer, or the whole conversation
has to be evaluated at once. If the latter is required
this becomes difficult for a human annotator who
was not engaged in the conversation itself, as the
material to evaluate will get very large, so our current setup will not work. Another possibility is to
keep the conversations short, but to provide instruction instead. For example, the Wizard of Wikipedia
task (Dinan et al., 2019c) asks speakers to converse
in depth on a randomly chosen topic, changing the
nature of the conversations, and hence the skills the
model will be evaluated on.
Finally, when comparing to human performance,
the quality of the human conversations matters. In
Figure 17 we compared to logs of employees from
Paper Author (human)
Generative BST 2.7B
Paper Author (human)
Generative BST 2.7B
Horses go to hayvard... get it?
No, I don't get it. Can you explain it to me like I'm five?
Well, Harvard is a school for humans, and horses eat hay...
Yes, I know that. But what does that have to do with the joke? I'm
so confused.
Do you know what hay is?
Yes, it's a type of grain. I just don't see how that's relevant to the
joke.
Do you know what horses are?
Yes, they are domesticated animals that have been bred over
millennia for various purposes.
Do you know what Harvard is?
Yes. It's a university in Boston, Massachusetts, in the United States.
So, what is Hayvard?
I'm not sure, but I do know that it's located in Cambridge.
No, it isn't a real place.
Oh, okay. I didn't know that. Thank you for informing me of that.
It's like a bad pun. Some people think it's funny.
I see. Well, at least I learned something new today. Have a great
day!
What did you learn?
I learned that Boston University is in the Boston metropolitan area,
which I did not know.
Why did the mushroom go to the party?
Human
Wiz Generative 2.7B
I'm not sure, but I'm glad it did. I love museums. What do you do for
a living?
No, it was supposed to be a joke.
Oh, sorry. I didn't get it. What kind of joke was it? I like jokes.
The answer is, "because he's a fun guy." Do you get it?
No, I don't. Can you explain it to me like I'm a five year old?
Yes. Mushrooms are fungi, and "fungi" is pronounced like "fun guy"
Oh, okay. I get it now. Thank you for the explanation. I really
appreciate it.
Figure 24: Examples of issues created by paper authors with our Generative BST 2.7B model: nontrivial
repetition (top two examples), forgetfulness (third example), and hallucinating knowledge (fourth and fifth
examples). Wojciech Zaremba is an AI Researcher
born in Poland, and Amon Tobin is a Brazilian electronic musician, which the Wiz Generative model retrieves from Wikipedia correctly (last two examples).
The Generative BST 2.7B model which does not use
retrieval instead hallucinates an Italian football player
and an American singer.
Adiwardana et al. (2020). Because they work at
the same company, or perhaps know each other,
these conversations are often rich and engaging.
We also tried comparing to human-human crowdworker conversations. In that case crowdworkers
will have no social connection to begin the conversation, and we believe this results in less engaging logs. When comparing to such human-human
Are you sure you understand? Can you explain it to me?
Sure. It's a play on the word fungus, which is also pronounced fun
guy.
Close enough.
I'm glad we could come to an agreement. Have a great day, fun guy!
Figure 25: Examples of failure to deeply comprehend with our Generative BST 2.7B model. (Top) the
model displays knowledge of various concepts without
understanding what it knows, as indicated by its inability to comprehend the pun. (Bottom) The model does a
good job of pretending to understand the pun, but actually does not.
crowdworker conversations, which we took from
the BST paper (Smith et al., 2020) we found our
models perform better than when compared to employees. In that case, our generative BST 2.7B
model in an ACUTE-Eval of engagingness beats
humans 56% to 44% (not statistically significant),
Paper Author (human)
Generative BST 2.7B
Figure 26: Example of persona conditioning in our
Generative BST 9.4B model. One can configure the
bot with arbitrary personality traits and talking points
by feeding in initial context, thanks to multi-tasking
with the PersonaChat and BST tasks (Zhang et al.,
2018; Smith et al., 2020).
whereas it scored 49% to 51% against employee
chats. We also compared crowdworker humans directly to employee humans, with a 56% to 44% win
for employees in terms of engagingness, and a 59%
to 41% win in terms of humanness. We believe
utilizing crowdworkers as a barometer for our models is desirable, as this can yield more replicable
experiments, so finding a way to close this gap, perhaps with alternative ways of matching workers or
differing set-ups and instructions remain possible
avenues of investigation.
11
Released code and models
We release our 90M, 2.7B and 9.4B parameter
pre-trained and fine-tuned generative models. Details are available at http://parl.ai/projects/
recipes. We have also provided a script for interacting with the bot with safety filtering built in. All
code for fine-tuning, including the datasets themselves is available in ParlAI (Miller et al., 2017).
More details lie on the project page. Finally, code
for evaluating models using ACUTE-Eval (Li et al.,
2019b) is also available and described.
12
Discussion
While our methods have taken a step forward and
achieved improved performance in terms of engagingness and humanness according to human evaluations, we have certainly not yet arrived at a solution
to open-domain dialogue. There are still various is-
sues with our models. Firstly, even our best models
still make mistakes: although relatively rarely, they
i) contradict or repeat themselves on occasion, ii)
tend to repeat the same phrases in separate conversations, and iii) hallucinate knowledge as seen in
other generative systems (Massarelli et al., 2019).
Each of these faults naturally leads to future research directions; we made some attempt to rectify
phrase repeats using unlikelihood (Li et al., 2019a)
in Sec. 3.4, and conditioning on knowledge (Dinan
et al., 2019c) in Sec. 2.3, but more needs to be
done.
As the human evaluations are on short dialogues
(14 turns) longer conversations would likely make
these issues appear much worse. Longer conversations would also expose that the Transformer architectures we use have a limited dialogue history.
A number of recent architectures attempt to incorporate longer memory, and that is also a fruitful
direction, although evaluation is more challenging
as long conversations have to be collected, and evaluated. An alternative is to seed the conversation
with a topic or otherwise provide instructions to
the human speaker during evaluation to give the
conversation a certain focus, which would more
deeply probe the skills of the bot. On the modeling side, longer conversations could also make
the choice of context material provided to the bot
more salient. Besides helping with consistency, the
persona and topic that are given as initial context
in Blended Skill Talk can help models introduce
interesting talking points in the conversation. However, they would need to be far more detailed for
longer or repeated conversations to help the models be consistent and avoid repetition, and in our
current experimental setup did not affect evaluations strongly. We note the context our model is
trained to be able to condition on can also be used
to configure a chatbot persona suitable for a given
desired role, see Figure 26 for an example.
For deployment of a chatbot, being well-behaved
remains a significant challenge. In particular, we
expect bots to have more integrity than the average
human (or to even be faultless), but they have much
less understanding of what they are saying than
humans. We have studied improved safety from
toxic language (Dinan et al., 2019b) and mitigating
gender bias in dialogue generation (Dinan et al.,
2019a) but much work remains to be done. While
we have made our models publicly available, we
have not mitigated all safety issues. We believe
their release can help the community work together
to understand further and fix these issues, and we
recommend their use for that line of research.
The work of Adiwardana et al. (2020) showed
that there is a correlation between human evaluation and perplexity, given a fixed decoding scheme.
Of course, language modeling and dialogue agent
training has been optimizing perplexity as a standard objective for a long time. We argue that while
this is important, other factors are also at play and
cannot be ignored: (1) the choice of training data
is paramount, as shown by our pushshift.io Reddit
(pre-training) vs. Blended Skill Talk experiments;
and (2) decoding algorithms make large differences
for the same fixed perplexity model (Sec. 10.2).
We find that while our 2.7B parameter model gives
large gains over our 90M parameter model, our
largest 9.4B model does not have a clear win in
human evaluations over our 2.7B model, despite
having lower perplexity. This is in line with other
results that show the story is more nuanced than
at first sight. For example, dialogue competitions
are not always won by the model with the lowest perplexity (Dinan et al., 2020), and it has been
shown that models that take a small hit in perplexity but provide gains at decoding time can give far
improved results (Welleck et al., 2020; Li et al.,
2019a). Further refining and understanding these
ingredients, and how they help to build the recipe
as a whole, remain important directions.
References
Daniel Adiwardana, Minh-Thang Luong, David R So,
Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,
Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
et al. 2020. Towards a human-like open-domain
chatbot. arXiv preprint arXiv:2001.09977.
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020.
The pushshift reddit dataset.
arXiv preprint
arXiv:2001.08435.
Iz Beltagy, Matthew E Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150.
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella
Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, and Joseph Turian.
2020. Experience grounds language. arXiv preprint
arXiv:2004.10151.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov.
2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint
arXiv:1901.02860.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, and Jason Weston. 2019a.
Queens are powerful too: Mitigating gender
bias in dialogue generation.
arXiv preprint
arXiv:1911.03842.
Emily Dinan, Samuel Humeau, Bharath Chintagunta,
and Jason Weston. 2019b. Build it break it fix it for
dialogue safety: Robustness from adversarial human
attack. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages
4537–4546, Hong Kong, China. Association for
Computational Linguistics.
Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban,
Ryan Lowe, Shrimai Prabhumoye, Alan W. Black,
Alexander Rudnicky, Jason Williams, Joelle Pineau,
Mikhail Burtsev, and Jason Weston. 2020. The
second conversational intelligence challenge (ConvAI2). In The NeurIPS ’18 Competition, pages 187–
208, Cham. Springer International Publishing.
Emily Dinan, Stephen Roller, Kurt Shuster, Angela
Fan, Michael Auli, and Jason Weston. 2019c. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International
Conference on Learning Representations.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. In Proceedings
of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 889–898.
Maryam Fazel-Zarandi, Shang-Wen Li, Jin Cao, Jared
Casale, Peter Henderson, David Whitney, and Alborz Geramifard. 2017. Learning robust dialog policies in noisy environments. In Proceedings of Workshop on Conversational AI.
Asma Ghandeharioun, Judy Hanwen Shen, Natasha
Jaques, Craig Ferguson, Noah Jones, Àgata
Lapedriza, and Rosalind W. Picard. 2019. Approximating interactive human evaluation with self-play
for open-domain dialog systems. Advances in Neural Information Processing Systems.
Braden Hancock, Antoine Bordes, Pierre-Emmanuel
Mazare, and Jason Weston. 2019. Learning from
dialogue after deployment: Feed yourself, chatbot!
In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages
3667–3684, Florence, Italy. Association for Computational Linguistics.
Tianxing He, Jun Liu, Kyunghyun Cho, Myle Ott, Bing
Liu, James Glass, and Fuchun Peng. 2019. Mixreview: Alleviate forgetting in the pretrain-finetune
framework for neural language generation models.
arXiv preprint arXiv:1910.07117.
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine
Bosselut, David Golub, and Yejin Choi. 2018.
Learning to write with cooperative discriminators.
In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics, pages
1638–1649. ACL.
Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin
Choi. 2019. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan
Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019.
Gpipe: Efficient training of giant neural networks
using pipeline parallelism. In Advances in Neural
Information Processing Systems, pages 103–112.
Jiwei Li, Michel Galley, Chris Brockett, Georgios P
Spithourakis, Jianfeng Gao, and Bill Dolan. 2016.
A persona-based neural conversation model. arXiv
preprint arXiv:1603.06155.
Margaret Li, Stephen Roller, Ilia Kulikov, Sean
Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2019a. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training.
arXiv preprint arxiv:1911.03860.
Margaret Li, Jason Weston, and Stephen Roller. 2019b.
ACUTE-EVAL: Improved dialogue evaluation with
optimized questions and multi-turn comparisons. In
NeurIPS workshop on Conversational AI.
Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin,
Kurt Keutzer, Dan Klein, and Joseph E Gonzalez.
2020. Train large, then compress: Rethinking model
size for efficient training and inference of transformers. arXiv preprint arXiv:2002.11794.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
and Luke Zettlemoyerand Veselin Stoyanov. 2019.
RoBERTa: A robustly optimized BERT pretraining
approach. arXiv preprint arXiv:1907.11692.
Luca Massarelli, Fabio Petroni, Aleksandra Piktus,
Myle Ott, Tim Rocktäschel, Vassilis Plachouras,
Fabrizio Silvestri, and Sebastian Riedel. 2019. How
decoding strategies affect the verifiability of generated text. arXiv preprint arXiv:1911.03587.
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,
and Jason Weston. 2019. Poly-encoders: Architectures and pre-training strategies for fast and accurate
multi-sentence scoring. In Proceedings of the International Conference on Learning Representations.
Nitika Mathur, Timothy Baldwin, and Trevor Cohn.
2017. Sequence effects in crowdsourced annotations. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing,
pages 2860–2865.
Jared Kaplan, Sam McCandlish, Tom Henighan,
Tom B Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.
2020. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361.
Pierre-Emmanuel Mazaré, Samuel Humeau, Martin
Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 2775–2779,
Brussels, Belgium. Association for Computational
Linguistics.
Nitish Shirish Keskar, Bryan McCann, Lav R
Varshney, Caiming Xiong, and Richard Socher.
2019. CTRL: A conditional transformer language
model for controllable generation. arXiv preprint
arXiv:1909.05858.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.
2020. Reformer: The efficient transformer. arXiv
preprint arXiv:2001.04451.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer.
2019.
BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehension.
arXiv preprint
arXiv:1910.13461.
Paulius Micikevicius, Sharan Narang, Jonah Alben,
Gregory Diamos, Erich Elsen, David Garcia, Boris
Ginsburg, Michael Houston, Oleksii Kuchaiev,
Ganesh Venkatesh, et al. 2017. Mixed precision
training. arXiv preprint arXiv:1710.03740.
Alexander Miller, Will Feng, Dhruv Batra, Antoine
Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and
Jason Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, pages 79–84.
ACL.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint
arXiv:1904.01038.
Romain Paulus, Caiming Xiong, and Richard Socher.
2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
Shrimai Prabhumoye, Margaret Li, Jack Urbanek,
Emily Dinan, Douwe Kiela, Jason Weston, and
Arthur Szlam. 2020. I love your chain mail! making knights smile in a fantasy game world. arXiv
preprint arXiv:2002.02878.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, 1(8).
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020.
Compressive transformers for long-range sequence
modelling. In International Conference on Learning Representations.
Hannah Rashkin, Eric Michael Smith, Margaret Li, and
Y-Lan Boureau. 2019. Towards empathetic opendomain conversation models: A new benchmark and
dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association
for Computational Linguistics.
Pararth Shah, Dilek Hakkani-Tür, Bing Liu, and
Gokhan Tür. 2018a. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
3 (Industry Papers), pages 41–51, New Orleans Louisiana. Association for Computational Linguistics.
Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and
Larry Heck. 2018b. Building a conversational agent
overnight with dialogue self-play. arXiv preprint
arxiv:1801.04871.
Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Adaptive learning rates with sublinear memory cost.
arXiv preprint arXiv:1804.04235.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion
parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053.
Heung-yeung Shum, Xiao-dong He, and Di Li. 2018.
From Eliza to XiaoIce: challenges and opportunities
with social chatbots. Frontiers of Information Technology & Electronic Engineering, 19(1):10–26.
Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, YLan Boureau, and Jason Weston. 2019. The dialogue dodecathlon: Open-domain knowledge and
image grounded conversational agents.
Eric Smith, Mary Williamson, Kurt Shuster, Jason Weston, and Y-Lan Boureau. 2020. Can you put it all
together: Evaluating conversational agents’ ability
to blend skills. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics. ACL.
Jack Urbanek, Angela Fan, Siddharth Karamcheti,
Saachi Jain, Samuel Humeau, Emily Dinan, Tim
Rocktäschel, Douwe Kiela, Arthur Szlam, and Jason Weston. 2019. Learning to speak and act in
a fantasy text adventure game. In Proceedings
of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 673–683, Hong
Kong, China. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
Wei Wei, Quoc V. Le, Andrew M. Dai, and Li-Jia Li.
2018. A goal-oriented neural conversation model by
self-play.
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In
International Conference on Learning Representations.
Jason Weston, Emily Dinan, and Alexander Miller.
2018. Retrieve and refine: Improved sequence generation models for dialogue. In Proceedings of the
2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational
AI, pages 87–92, Brussels, Belgium. Association for
Computational Linguistics.
Jason E Weston. 2016. Dialog-based language learning. In Advances in Neural Information Processing
Systems, pages 829–837.
Thomas Wolf, Victor Sanh, Julien Chaumond, and
Clement Delangue. 2019. TransferTransfo: A transfer learning approach for neural network based conversational agents. In NeurIPS Workshop on Conversational AI.
Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong,
Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan
Sung, Brian Strope, and Ray Kurzweil. 2018. Learning semantic textual similarity from conversations.
In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164–174, Melbourne, Australia. Association for Computational
Linguistics.
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you
have pets too? In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics, pages 2204–2213. ACL.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,
Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
Liu, and Bill Dolan. 2019. DialoGPT: Large-scale
generative pre-training for conversational response
generation. arXiv preprint arXiv:1911.00536.
Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.
2020. The design and implementation of XiaoIce,
an empathetic social chatbot. Computational Linguistics, pages 1–62.