Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Learning to Respond with Stickers:
A Framework of Unifying Multi-Modality in Multi-Turn Dialog
Shen Gao∗† Xiuying Chen∗ Chang Liu

WICT, Peking University WICT, Peking University WICT, Peking University
[email protected] [email protected] [email protected]
Li Liu Dongyan Zhao Rui Yan‡

Inception Institute of Artificial WICT, Peking University 1 WICT, Peking University
arXiv:2003.04679v1 [cs.CL] 10 Mar 2020
Intelligence [email protected] 2 Beijing Academy of Artificial

[email protected] Intelligence
[email protected]
ABSTRACT effectiveness of each component of SRS. To facilitate further re-

Stickers with vivid and engaging expressions are becoming in- search in sticker selection field, we release this dataset of 340K
creasingly popular in online messaging apps, and some works are multi-turn dialog and sticker pairs1 .
dedicated to automatically select sticker response by matching text
labels of stickers with previous utterances. However, due to their CCS CONCEPTS
large quantities, it is impractical to require text labels for the all • Information systems → Multimedia content creation; Re-
stickers. Hence, in this paper, we propose to recommend an appro- trieval models and ranking.
priate sticker to user based on multi-turn dialog context history
without any external labels. Two main challenges are confronted KEYWORDS
in this task. One is to learn semantic meaning of stickers without sticker selection, online chatting, multi-turn dialog
corresponding text labels. Another challenge is to jointly model
ACM Reference Format:
the candidate sticker with the multi-turn dialog context. To tackle Shen Gao, Xiuying Chen, Chang Liu, Li Liu, Dongyan Zhao, and Rui Yan.
these challenges, we propose a sticker response selector (SRS) model. 2020. Learning to Respond with Stickers: A Framework of Unifying Multi-
Specifically, SRS first employs a convolutional based sticker image Modality in Multi-Turn Dialog. In Proceedings of The Web Conference 2020
encoder and a self-attention based multi-turn dialog encoder to (WWW ’20), April 20–24, 2020, Taipei, Taiwan. ACM, New York, NY, USA,
obtain the representation of stickers and utterances. Next, deep 11 pages. https://doi.org/10.1145/3366423.3380191
interaction network is proposed to conduct deep matching between
the sticker with each utterance in the dialog history. SRS then learns 1 INTRODUCTION
the short-term and long-term dependency between all interaction Images are another important approach for expressing feelings and
results by a fusion network to output the the final matching score. emotions in addition to using text in communication. In mobile
To evaluate our proposed method, we collect a large-scale real- messaging apps, these images can generally be classified into emo-
world dialog dataset with stickers from one of the most popular jis and stickers. Emojis are usually used to help reinforce simple
online chatting platform. Extensive experiments conducted on this emotions in a text message due to their small size, and their variety
dataset show that our model achieves the state-of-the-art perfor- is limited. Stickers, on the other hand, can be regarded as an alterna-
mance for all commonly-used metrics. Experiments also verify the tive for text messages, which usually include cartoon characters and
are of high definition. They can express much more complex and
vivid emotion than emojis. Most messaging apps, such as WeChat,
∗ Equalcontribution. Ordering is decided by a coin flip. Work performed during an Telegram, WhatsApp, and Slack provide convenient ways for users
internship at IIAI.
† WICT is the abbreviation of Wangxuan Institute of Computer Technology. to download stickers for free, or even share self-designed ones. We
‡ Corresponding Author: Rui Yan ([email protected]) show a chat window including stickers in Figure 1.
Stickers are becoming more and more popular in online chat.
First, sending a sticker with a single click is much more convenient
Permission to make digital or hard copies of all or part of this work for personal or than typing text on the 26-letter keyboard of a small mobile phone
classroom use is granted without fee provided that copies are not made or distributed screen. Second, there are many implicit or strong emotions that
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM cannot be accurately explained by words but can be captured by
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, stickers with vivid facial expressions and body language. However,
to post on servers or to redistribute to lists, requires prior specific permission and/or a the large scale use of stickers means that it is not always straight-
fee. Request permissions from [email protected].
WWW ’20, April 20–24, 2020, Taipei, Taiwan forward to think of the sticker that best expresses one’s feeling
© 2020 Association for Computing Machinery. according to the current chatting context. Users need to recall all
ACM ISBN 978-1-4503-7023-3/20/04.
https://doi.org/10.1145/3366423.3380191 1 https://github.com/gsh199449/stickerchat
WWW ’20, April 20–24, 2020, Taipei, Taiwan Shen and Xiuying, et al.
Finally, SRS employs a fusion network which consists of a sub-

network fusion RNN and fusion transformer to learn the short and
long term dependency of the utterance interaction results. The final
matching score is calculated by an interaction function. To evalu-
ate the performance of our model, we propose a large number of
multi-turn dialog dataset associated with stickers from one of the
popular messaging apps. Extensive experiments conducted on this
dataset show that SRS significantly outperforms the state-of-the-art
baseline methods in commonly-used metrics.
Our contributions can be summarized as follows:
• We employ a deep interaction network to conduct matching
between candidate sticker and each utterance in dialog context.
• We propose a fusion network that can capture the short and
long dependency of the interaction results of each utterance simul-
taneously.
• Experiments conducted on a large-scale real-world dataset2
show that our model outperforms all baselines, including state-of-
the-art models.
2 RELATED WORK
We outline related work on sticker recommendation, visual question
Figure 1: An example of stickers in a multi-turn dialog. answering, visual dialog, and multi-turn response selection.
Sticker response selector automatically selects the proper Sticker recommendation. Most of the previous works empha-
sticker based on multi-turn dialog history. size the use of emojis instead of stickers. For example, [5, 6] use a
multimodal approach to recommend emojis based on the text and
images in an Instagram post. However, emojis are typically used in
conjunction with text, while stickers are independent information
the stickers they have collected and selected the appropriate one, carriers. What is more, emojis are limited in variety, while there
which is both difficult and time-consuming. exists an abundance of different stickers. The most similar work
Consequently, much research has focused on recommending ap- to ours is [23], where they generate recommended stickers by first
propriate emojis to users according to the chatting context. Existing predicting the next message the user is likely to send in the chat,
works, such as [41], are mostly based on emoji recommendation, and then substituting it with an appropriate sticker. However, more
where they predict the probable emoji given the contextual informa- often than not the implication of the stickers cannot be fully con-
tion from multi-turn dialog systems. In contrast, other works [5, 6] veyed by text and, in this paper, we focus on directly generating
recommend emojis based on the text and images posted by a user. sticker recommendations from dialog history.
However, the use of emojis is restricted due to their limited variety Visual question answering. Sticker recommendation involves
and small size, while stickers are more expressive and of a great the representation of and interaction between images and text,
variety. As for sticker recommendation, existing works such as [23] which is related to the Visual Question Answering (VQA) task [11,
and apps like Hike or QQ directly match the text typed by the user 15, 28, 29, 35]. Specifically, VQA takes an image and a correspond-
to the short text tag assigned to each sticker. However, since there ing natural language question as input and outputs the answer. It is
are countless ways of expressing the same emotion, it is impossible a classification problem in which candidate answers are restricted
to capture all variants of an utterance as tags. to the most common answers appearing in the dataset and requires
In this paper, we address the task of sticker response selection deep analysis and understanding of images and questions such as
in multi-turn dialog, where an appropriate sticker is recommended image recognition and object localization [16, 27, 38, 42]. Current
based on the dialog history. There are two main challenges in this models can be classified into three main categories: early fusion
task: (1) To the best of our knowledge, no existing image recognition models, later fusion models, and external knowledge-based mod-
methods can model the sticker image, how to capture the semantic els. One state-of-the-art VQA model is [25], which proposes an
meaning of sticker is challenging. (2) Understanding multi-turn architecture, positional self-attention with co-attention, that does
dialog history information is crucial for sticker recommendation, not require a recurrent neural network (RNN) for video question
and jointly modeling the candidate sticker with multi-turn dialog answering. [17] proposes an image-question-answer synergistic
is challenging. Herein, we propose a novel sticker recommendation network, where candidate answers are coarsely scored according
model, namely sticker response selector (SRS), for sticker response to their relevance to the image and question pair in the first stage.
selection in multi-turn dialog. Specifically, SRS first learns represen- Then, answers with a high probability of being correct are re-ranked
tations of dialog context history using a self-attention mechanism by synergizing with images and questions.
and learns the sticker representation by a convolutional network.
Next, SRS conducts deep matching between the sticker and each
utterance and produces the interaction results for every utterance. 2 https://github.com/gsh199449/stickerchat
A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW ’20, April 20–24, 2020, Taipei, Taiwan
The difference between sticker selection and VQA task is that Table 1: Statistics of Response Selection Dataset.
sticker selection task focus more on multi-turn multimodal interac-
tion between stickers and utterances. Train Valid Test
Visual dialog. Visual dialog extends the single turn dialog task [14,
# context-stickers pairs 320,168 10,000 10,000
31] in VQA to a multi-turn one, where later questions may be related
Avg. words of context utterance 7.54 7.50 7.42
to former question-answer pairs. To solve this task, [26] transfers
Avg. users participate 5.81 5.81 5.79
knowledge from a pre-trained discriminative network to a gener-
ative network with an RNN encoder, using a perceptual loss. [39]
combines reinforcement learning and generative adversarial net-
works (GANs) to generate more human-like responses to questions,
where the GAN helps overcome the relative paucity of training data,
and the tendency of the typical maximum-likelihood-estimation-
based approach to generate overly terse answers. [21] demonstrates
4000
a simple symmetric discriminative baseline that can be applied to
Count
both predicting an answer as well as predicting a question in the
visual dialog.
Unlike VQA and visual dialog tasks, in a sticker recommendation 2000
system, the candidates are stickers rather than text.
Multi-turn response selection. Multi-turn response selection [10,
33, 43–45] takes a message and utterances in its previous turns as 0
input and selects a response that is natural and relevant to the (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1.0]
whole context. In our task, we also need to take previous multi-

Sticker Similarity
turn dialog into consideration. Previous works include [46], which
Figure 2: Similarity distribution among all stickers in test
uses an RNN to represent context and response, and measure their
dataset.
relevance. More recently, [40] matches a response with each ut-
terance in the context on multiple levels of granularity, and the
vectors are then combined through an RNN. The final matching
score is calculated by the hidden states of the RNN. [47] extends this from the sticker set. After pre-processing, there are 320,168 context-
work by considering the matching with dependency information. sticker pairs in the training dataset, 10,000 pairs in the validation,
More recently, [32] proposes a multi-representation fusion network and 10,000 pairs in test datasets respectively. We make sure that
where the representations can be fused into matching at an early there is no overlap between these three datasets. Two examples
stage, an intermediate stage, or at the last stage. are shown in Figure 3. We publish this dataset to communities to
Traditional multi-turn response selection deals with pure natural facilitate further research on dialog response selection task.
language processing, while in our task, we also need to obtain a
deep understanding of images. 3.2 Statistics and Analysis
In total, there are 3,516 sets of sticker which contain 174,695 stickers.
3 DATASET The average number of stickers in a sticker set is 49.64. Each context
includes 15.5 utterances on average. The average number of users
In this section, we introduce our multi-turn dialog dataset with
who participate in the dialog context over each dataset is shown in
sticker as response in detail.
the third row of Table 1.
3.1 Data Collection 3.3 Sticker Similarity

We collect the large-scale multi-turn dialog dataset with stickers Stickers in the same set always share a same style or contain the
from one of the most popular messaging apps. In this app, a large same cartoon characters. Intuitively, the more similar the candidate
mount of sticker sets are published, and everyone can use the sticker stickers are, the more difficult it is to choose the correct sticker
when chatting with a friend or in a chat group. Specifically, we select from candidates. In other words, the similarity between candidate
20 public chat groups consisting of active members, which are all stickers determines the difficulty of the sticker selection task. To
open groups that everyone can join it without any authorities. The investigate the difficulty of this task, we calculate the average sim-
chat history of these groups is collected along with the complete ilarity of all the stickers in a specific sticker set by the Structural
sticker sets. These sticker sets include stickers with similar style. Similarity Index (SSIM) metric [2, 37]. We first calculate the similar-
All stickers are resized to a uniform size of 128 × 128 pixels. We use ity between the ground truth sticker and each negative sample, then
20 utterances before the sticker response as the dialog context, and average the similarity scores. The similarity distribution among
then we filter out irrelevant utterance sentences, such as URL links test data is shown in Figure 2, where the average similarity is 0.258.
and attached files. Due to privacy concern, we also filter out user The examples in Figure 3 are also used to illustrate the similarity
information and anonymize user IDs. To construct negative samples, of stickers more intuitively, where the left one has a relatively low
9 stickers other than the ground truth sticker are randomly sampled similarity score, and the right one has a high similarity score.
Case 1 (AVG. 0.125) Case 2 (AVG. 0.702)

⾥的
wsSettings 参数是什么意思？怎么⽤？
Header
你好像挺喜欢⽤这个表情包的呀 What is the Header parameter in wsSettings? how to use?
指的是的那个吧
http
You seem to like using this sticker set very much.
是的，app官⽅的表情包做得很好，深得我⼼ It probably means the parameter of HTTP.
呃呃呃，是的感谢 ...
Yes, the official sticker set of this app is designed very well
Ummm, yes, thanks.
and win my heart.
我也挺喜欢官⽅的表情包 app
哈哈，成功了？
Haha, have you succeeded?
I like the official sticker sets too.
恩恩
Yes.
Figure 3: Example cases in the dataset with different similarity scores.
4 PROBLEM FORMULATION finally outputs the matching score combining these features using
Before presenting our approach for sticker response selection in an interaction function.
multi-turn dialog, we first introduce our notations and key concepts.
Similar to the multi-turn dialog response selection [40, 47], we 5.2 Sticker Encoder
assume that there is a multi-turn dialog context s = {u 1 , . . . , uTu } Much research has been conducted to alleviate gradient vanish-
and a candidate sticker set C = {c 1 , ...cTc }, where ui represents the ing [19] and reduce computational costs [18] in image modeling
i-th utterance in the multi-turn dialog. In the i-th utterance ui = tasks. We utilize one of these models, i.e.,the Inception-v3 [30]
{x 1i , . . . , x i i }, x ji represents the j-th word in ui , and Txi represents model rather than plain CNN to encode sticker image:
Tx
the total number of words in ui utterance. In dialog context s, c i O, O flat = Inception-v3(c), (1)
represents a sticker image with a binary label yi , indicating whether
c i is an appropriate response for s. Tu is the utterance number in where c is the sticker image. The sticker representation is O ∈
the dialog context and Tc is the number of candidate stickers. For Rp×p×d which conserves the two-dimensional information of the
each candidate set, there is only one ground truth sticker, and the sticker, and will be used when associating stickers and utterances in
remaining ones are negative samples. Our goal is to learn a ranking §5.4. We use the original image representation output of Inception-
model that can produce the correct ranking for each candidate v3 O flat ∈ Rd as another sticker representation. However, existing
sticker c i ; that is, can select the correct sticker among all the other pre-trained CNN networks including Inception-v3 are mostly built
candidates. For the rest of the paper, we take the i-th candidate on real-world photos. Thus, directly applying the pre-trained net-
sticker c i as an example to illustrate the details of our model and works on stickers cannot speed up the training process. In this
omit the candidate index i for brevity. dataset, sticker author give each sticker c an emoji tag which de-
notes the general emotion of the sticker. Hereby, we propose an
auxiliary sticker classification task to help the model converge
5 SRS MODEL quickly, which uses O flat to predict which emoji is attached to the
5.1 Overview corresponding sticker. More specifically, we feed O flat into a linear
classification layer and then use the cross-entropy loss Ls as the
In this section, we propose our sticker response selector, abbreviated
loss function of this classification task.
as SRS. An overview of SRS is shown in Figure 4, which can be split
into four main parts:
5.3 Utterance Encoder
• Sticker encoder is a convolutional neural network (CNN) based
image encoding module that learns a sticker representation. To model the semantic meaning of the dialog context, we learn the
• Utterance encoder is a self-attention mechanism based module representation of each utterance ui . First, we use an embedding
encoding each utterance ui in the multi-turn dialog context s. matrix e to map a one-hot representation of each word in each
• Deep interaction network module conducts deep matching be- utterance ui to a high-dimensional vector space. We denote e(x ji ) as
tween each sticker representation and each utterance, and outputs the embedding representation of word x ji . From these embedding
each interaction result. representations, we use the attentive module from Transformer [34]
• Fusion network learns the short-term dependency by the fusion to model the temporal interactions between the words in an ut-
RNN and the long-term dependency by the fusion Transformer, and terance. Attention mechanisms have become an integral part of
(4) Fusion Network

(2) Utterance Encoder
Utterance1 (3) Deep interaction 1
What about we network Fusion
2
Prediction
go shopping? RNN Layer
(2) Utterance Encoder
Utterance2 (3) Deep interaction RNN
Great! I love 2
shopping. network 2
Fusion
Transformer
(2) Utterance Encoder Matching
Utterance3 (3) Deep interaction
What about 3 Score
you, Lisa? network 2
Sticker
Selection
(1) Sticker Encoder
Sticker
flat
Figure 4: Overview of SRS. We divide our model into four ingredients: (1) Sticker encoder learns sticker representation; (2)
Utterance encoder learns representation of each utterance; (3) Deep interaction network conducts deep matching interaction
between sticker representation and utterance representation in different levels of granularity. (4) Fusion network combines
the long-term and short-term dependency feature between interaction results produced by (3).
compelling sequence modeling in various tasks [4, 9, 13, 25]. In our 5.4 Deep Interaction Network
sticker selection task, we also need to let words fully interact with Now that we have the representation of the sticker and each utter-
each other to model the dependencies of words without regard to ance, we can conduct a deep matching between these components.
their locations in the input sentence. The attentive module in the On one hand, there are some emotional words in dialog context
Transformer has three inputs: the query Q, the key K and the value history that match the expression of the stickers such as “happy” or
V . We use three fully-connected layers with different parameters “sad”. On the other hand, specific part of the sticker can also match
to project the embedding of dialog context e(x ji ) into three spaces: these corresponding words such as dancing limbs or streaming eyes.
Hence, we employ a bi-directional attention mechanism between
Q ij = FC(e(x ji )), K ji = FC(e(x ji )), Vji = FC(e(x ji )). (2) a sticker and each utterance, that is, from utterance to sticker and
from sticker to utterance, to analyze the cross-dependency between
The attentive module then takes each Q ij to attend to K i· , and uses the two components. The interaction is illustrated in Figure 5.
i
these attention distribution results α j,i · ∈ RTx as weights to gain We take the i-th utterance as an example and omit the index i
the weighted sum of Vji as shown in Equation 4. Next, we add the for brevity. The two directed attentions are derived from a shared
2
original word representations on β ji as the residential connection relation matrix, M ∈ R(p )×Tu , calculated by sticker representation
O ∈R p×p×d and utterance representation h ∈ RTu ×d . The score
layer, shown in Equation 5:
Mk j ∈ R in the relation matrix M indicates the relation between the
k-th sticker representation unit O k , k ∈ [1, p 2 ] and the j-th word

exp Q ij · Kki
i
α j,k = Í i , (3) h j , j ∈ [1,Tu ] and is computed as:
Tx
n=1 exp Q i · Ki
j n
ÍTxi Mk j = σ (O k , h j ), σ (x, y) = w ⊺ [x ⊕ y ⊕ (x ⊗ y)], (7)
β ji = αi
k =1 j,k
· Vki , (4)
where σ is a trainable scalar function that encodes the relation
ĥij = Dropout e(x ji ) + β ji , (5) between two input vectors. ⊕ denotes a concatenation operation
and ⊗ is the element-wise multiplication.
where α j,k
i denotes the attention weight between j-th word to k-
Next, a max pooling operation is conducted on M, i.e.,let τ ju =
th word in i-th utterance. To prevent vanishing or exploding of max(M :j ) ∈ R represent the attention weight on the j-th utterance
gradients, a layer normalization operation [24] is also applied on word by the sticker representation, corresponding to the “utterance-
the output of the feed-forward layer as shown in Equation 6: wise attention”. This attention learns to assign high weights to the
important words that are closely related to sticker. We then obtain
hij = norm max(0, ĥij · W1 + b1 ) · W2 + b2 , (6) the weighted sum of hidden states as “sticker-aware utterance
representation” l:
where W1 ,W2 , b1 , b2 are all trainable parameters. hij denotes the
ÍTu u
hidden state of j-th word in the Transformer for the i-th utterance. l= j τj h j . (8)
(3) Deep interatction network

utterance-aware
Relation Matrix Co-attention sticker
1 2 3
1 2 3 representation
1
ℎ 4 5 6
1 4 5 6
Utterance 1
7 8 9 7 8 9
ℎ
1 Integrate 1 2 Fusion
2
Function Network
1
ℎ
1 1
1 1
ℎ
3 2
ℎ
2
̃
1
1
sticker-aware
ℎ
3 3 utterance
representation
1 2 3
5 6
4
flat
7 8 9
Figure 5: Framework of deep interaction network.
Fusion Network 5.5 Fusion Network

Fusion RNN
Up till now, we have obtained the interaction result between each
1
2
utterance and the candidate sticker. Here we again include the
utterance index i where Q 2 now becomes Q 2i . Since the utterances in
Fusion Transformer Prediction
a multi-turn dialog context are in chronological order, we employ a
2
2
Layer Fusion RNN and a Fusion Transformer to model the short-term
Add & Norm
Add & Norm
and long-term interaction between utterance {Q 21 , . . . , Q T2 u }.

Multi-Head
Attention
Forward
Feed
3
2
5.5.1 Fusion RNN. Fusion RNN first reads the interaction results
for each utterance {Q 21 , . . . , Q T2 u } and then transforms into a se-
quence of hidden states. In this paper, we employ the gated recur-
rent unit (GRU) [7] as the cell of fusion RNN, which is popular in
Figure 6: Framework of fusion network. sequential modeling [12, 40]:
дi = RNN(Q 2i , дi−1 ), (12)
Similarly, sticker-wise attention learns which part of a sticker is where дi is the hidden state of the fusion RNN. Finally, we obtain
most relevant to the utterance. Let τks = max(Mk : ) ∈ R represent the sequence of hidden states {д1 , . . . , дTu }. One can replace GRU
the attention weight on the k-th unit of the sticker representation. with similar algorithms such as LSTM [20]. We leave the study as
We use this to obtain the weighted sum of O k , i.e.,the “utterance- future work.
aware sticker representation” r :
5.5.2 Fusion Transformer. To model the long-term dependency
Íp 2 s and capture the salience utterance from the context, we employ the
r= τ O .
k k k
(9)
self-attention mechanism introduced in Equation 3-6. Concretely,
After obtaining the two outputs from the co-attention module, given {Q 21 , . . . , Q T2 u }, we first employ three linear projection layers
we combine the sticker and utterance representations and finally with different parameters to project the input sequence into three
get the ranking result. We first integrate the utterance-aware sticker different spaces:
representation r with the original sticker representation O flat using
an integrate function, named I F : Q i = FC(Q 2i ), K i = FC(Q 2i ), V i = FC(Q 2i ). (13)
Q 1 = I F (O flat , r ), I F (x, y) = FC(x ⊕ y ⊕ (x ⊗ y) ⊕ (x + y)). (10) Then we feed these three matrices into the self-attention algorithm
illustrated in Equation 3-6. Finally, we obtain the long-term inter-
We add the sticker-aware utterance representation l into Q 1 to- action result {д̂1 , . . . , д̂Tu }.
gether and then apply a fully-connected layer:
5.5.3 Prediction Layer. To combine the interaction representation
Q 2 = FC(Q 1 ⊕ l). (11) generated by fusion RNN and fusion Transformer, we employ the
SUMULTI function proposed by [36] to combine these representa- Table 2: Ablation models for comparison.
tions, which has been proven effective in various tasks:

(д̂i − дi ) ⊙ (д̂i − дi )
Acronym Gloss
дi = ReLU(W s + bs ). (14)
д̂i ⊙ дi SRS w/o pretrain SRS w/o pre-trained Inception-v3 model
The new interaction sequence {д1 , . . . , дTu } is then boiled down to SRS w/o Classify SRS w/o emoji classification task
a matching vector д̃Tu by another GRU-based RNN: SRS w/o DIN SRS w/o Deep Interaction Network
SRS w/o FR SRS w/o Fusion RNN
д̃i = RNN(д̃i−1 , дi ). (15)
We use the final hidden state д̃Tu as the representation of the overall
interaction result between the whole utterance context and the can- high probabilities of being correct are re-ranked by synergizing
didate sticker. Finally, we apply a fully-connected layer to produce with image and question. This model achieves the state-of-the-art
the matching score ŷ of the candidate sticker: performance on the Visual Dialog v1.0 dataset [8].
(2) PSAC: [25] proposes the positional self-attention with co-attention
ŷ = FC(д̃Tu ), (16)
architecture on VQA task, which does not require RNNs for video
where ŷ ∈ (0, 1) is the matching score of the candidate sticker. question answering. We replace the output probability on the vo-
cabulary size with the probability on candidate sticker set.
5.6 Learning (3) SMN: [40] proposes a sequential matching network to address
Recall that we have a candidate sticker set C = {c 1 , ...cTc } which response selection for the multi-turn conversation problem. SMN
contains multiple negative samples and one ground truth sticker. first matches a response with each utterance in the context. Then
We use hinge loss as our objective function: vectors are accumulated in chronological order through an RNN.
The final matching score is calculated with RNN.
Lr = N max 0, ŷnegative − ŷpositive + margin ,
Í
(17) (4) DAM: [47] extends the transformer model [34] to the multi-turn
response selection task, where representations of text segments
where ŷnegative and ŷpositive corresponds to the predicted labels of
are constructed using stacked self-attention. Then, truly matched
the negative sample and ground truth sticker, respectively. The
segment pairs are extracted across context and response.
margin is the margin rescaling in hinge loss. The gradient descent
(5) MRFN: [32] proposes a multi-representation fusion network,
method is employed to update all the parameters in our model to
where the representations can be fused into matching at an early
minimize this loss function.
stage, at the intermediate stage or the last stage. This is the state-
of-the-art model on the multi-turn response selection task.
6 EXPERIMENTAL SETUP
For the three baselines above, we replace the candidate em-
6.1 Research Questions bedding RNN network with the image encoding CNN network
We list four research questions that guide the experiments: Inception-v3, as used in our model. This network is initialized us-
• RQ1 (See § 7.1): What is the overall performance of SRS compared ing a pre-trained model3 for all baselines and SRS.
with all baselines?
• RQ2 (See § 7.2): What is the effect of each module in SRS? 6.3 Evaluation Metrics
• RQ3 (See § 7.3): How does the performance change when the Following [32, 47], we employ recall at position k in n candidates
number of utterances changes? Rn @k as an evaluation metric, which measures if the positive re-
• RQ4 (See § 7.4): Can co-attention mechanism successfully capture sponse is ranked in the top k positions of n candidates. Follow-
the salient part on the sticker image and the important words in ing [47], we also employ mean average precision (MAP) [3] as
dialog context? an evaluation metric. The statistical significance of differences ob-
• RQ5 (See § 7.5): What is the influence of the similarity between served between the performance of two runs is tested using a
candidate stickers? two-tailed paired t-test and is denoted using ▲ (or ▼ ) for strong
• RQ6 (See § 7.6): What is the influence of the parameter settings? significance at α = 0.01.
6.2 Comparison Methods 6.4 Implementation Details

We first conduct an ablation study to prove the effectiveness of each We implement our experiments using TensorFlow [1] on an NVIDIA
component in SRS as shown in Table 2. Specifically, we remove each P100 GPU. If the number of words in an utterance is less than 30,
key part of our SRS to create ablation models and then evaluate the we pad zeros, otherwise, the first 30 words are kept. The word
performance of these models. embedding dimension is set to 100 and the number of hidden units
Next, to evaluate the performance of our model, we compare it is 100. The batch size is set to 32. 9 negative samples are randomly
with the following baselines. Note that, since no existing models sampled from the sticker set containing the ground truth sticker,
can be directly applied to our task, we adapt VQA and multi-turn and we finally obtain 10 candidate stickers for the model to select.
response selection models to the sticker response selection task. We use Adam optimizer [22] as our optimizing algorithm, and the
(1) Synergistic: [17] devises a novel synergistic network on VQA learning rate is 1 × 10−4 .
task. First, candidate answers are coarsely scored according to their
relevance to the image-question pair. Afterward, answers with 3 https://github.com/tensorflow/models/tree/master/research/slim
Table 3: RQ1: Automatic evaluation comparison. Significant the overall performance. But this additional task can speed up the
differences are with respect to MRFN. training process, and help our model to converge quickly. We use
19 hours to train the SRS until convergence, and we use 30 hours
MAP R 10 @1 R 10 @2 R 10 @5 for training SRS w/o Classify. The fusion RNN brings a significant
contribution, improving the MAP and R 10 @1 scores by 4.43% and
Visual Q&A methods
7.08%, respectively. Besides, the deep interaction network also plays
Synergistic 0.593 0.438 0.569 0.798
an important part. Without this module, the interaction between
PSAC 0.662 0.533 0.641 0.836
the sticker and utterance are hindered, leading to a 6.88% drop in
Multi-turn response selection methods R 10 @1.
SMN 0.524 0.357 0.488 0.737
DAM 0.620 0.474 0.601 0.813 7.3 Analysis of Number of Utterances
MRFN 0.684 0.557 0.672 0.853 For research question RQ3, in addition to comparing with various
SRS 0.709 0.590▲ 0.703▲ 0.872 baselines, we also evaluate our model when reading different num-
ber of utterances to study how the performance relates to number
of context turns.
Table 4: RQ2: Evaluation of different ablation models. Figure 7 shows how the performance of the SRS changes with
respect to different numbers of utterances turns. We observe a
MAP R 10 @1 R 10 @2 R 10 @5 similar trend for SRS on the first three evaluation metrics MAP,
R 10 @1 and R 10 @2: they first increase until the utterance number
SRS w/o pretrain 0.650 0.510 0.641 0.833
reaches 15, and then fluctuate as the utterance number continues to
SRS w/o Classify 0.707 0.588 0.700 0.871
increase. There are two possible reasons for this phenomena. The
SRS w/o DIN 0.680 0.552 0.669 0.854
first reason might be that, when the information in the utterances
SRS w/o FR 0.677 0.551 0.663 0.863
is limited, the model can capture the features well, and thus when
SRS 0.709 0.590 0.703 0.872
the amount of information increases, the performance gets better.
However, the capacity of the model is limited, and when the amount
of information reaches its upper bound, it gets confused by this
7 EXPERIMENTAL RESULT overwhelming information. The second reason might be of the
7.1 Overall Performance usefulness of utterance context. Utterances that occur too early
For research question RQ1, we examine the performance of our before the sticker response may be irrelevant to the sticker and bring
model and baselines in terms of each evaluation metric, as shown in unnecessary noise. As for the last metric, the above observations
Table 3. First, the performance of the multi-turn response selection do not preserve. The R 10 @5 scores fluctuate when the utterance
models is generally consistent with their performances on text number is below 15, and drop when the utterance number increases.
response selection datasets. SMN [40], an earlier work on multi- The reason might be that R 10 @5 is not a strict metric, and it is easy
turn response selection task with a simple structure, obtains the to collect this right sticker in the set of half of the whole candidates.
worst performance on both sticker response and text response Thus, the growth of the information given to SRS does not help it
selection. DAM [47] improves the SMN model and gets the second perform better but the noise it brings harms the performance. On
best performance. MRFN [32] is the state-of-the-art text response the other hand, though the number of utterances changes from 3
selection model and achieves the best performance among baselines to 20, the overall performance of SRS generally remains at a high
in our task as well. Second, VQA models perform generally worse level, which proves the robustness of our model.
than multi-turn response selection models, since the interaction
between multi-turn utterances and sticker is important, which 7.4 Analysis of Attention Distribution in
is not taken into account by VQA models. Finally, SRS achieves Interaction Process
the best performance with 3.36%, 5.92% and 3.72% improvements Next, we turn to address RQ4. We also show three cases with
in MAP, R 10 @1 and R 10 @2 respectively, over the state-of-the-art the dialog context in Figure 8. There are four stickers under each
multi-turn selection model, i.e.,MRFN, and with 6.80%, 10.69% and dialog context, one is the selected sticker by our model and other
8.74% significant increases over the state-of-the-art visual dialog three stickers are random selected candidate stickers. As a main
model, PSAC. This proves the superiority of our model. component of SRS, the deep interaction network comprises a bi-
directional attention mechanism between the utterance and the
7.2 Ablation Study sticker, where each word in the utterance and each unit in the sticker
For research question RQ2, we conduct ablation tests on the use of representation have a similarity score in the co-attention matrix.
the pre-trained Inception-v3 model, the sticker classification loss, To visualize the sticker selection process and to demonstrate the
the deep interaction network and the fusion RNN respectively. The interpretability of SRS, we visualize the sticker-wise attention τ s
evaluation results are shown in Table 4. The performances of all (Equation 9) on the original sticker image and show some examples
ablation models are worse than that of SRS under all metrics, which in Figure 8. The lighter the area is, the higher attention it gets.
demonstrates the necessity of each component in SRS. We also Facial expressions are an important part in sticker images. Hence,
find that the sticker classification makes the least contribution to we select several stickers with vivid facial expression in Figure 8.
0.700 0.58 0.690 0.88
0.695
0.57 0.685
R10@1 score
R10@2 score
R10@5 score
MAP score
0.690
0.56 0.680 0.86
0.685
0.55 0.675
0.680
0.675 0.54 0.670 0.84

3 6 9 12 15 18 20 3 6 9 12 15 18 20 3 6 9 12 15 18 20 3 6 9 12 15 18 20
Utterance number Utterance number Utterance number Utterance number
(a) M AP score (b) R 10 @1 score (c) R 10 @2 score (d) R 10 @5 score
Figure 7: Performance of SRS on all metrics when reading different number of utterances.
话说。。为什么有些⼈可以说出让我不
Case 1 Case 2 Case 3
知道怎么接的话。你认为是机器⼈就是机器⼈
Why do some people always say sth I
don't know how to pick up?
If you think it is a chatbot, it is.
你认为是真⼈就是真⼈ NAT 要怎么样才能连上啊
我⼀个朋友的朋友，最近⽼问我在⼲ If you think it is a real person, it is.
How to connect to NAT?
我连不上
嘛，吃饭了吗。。就是机器⼈没跑了 I can't connect to it.
I have a friend's friend, who always ask
me what am I doing, have I eaten.
It is a chatbot, definitely.
哈哈不会⽤⼲什么买nat啊
然后我说我没吃 Haha
If you don't know how to use, why
Then I said that I haven't eaten. 不能太较真 did you buy it?
话说开赚钱吗？
对⽅就会。。。 Don't be too serious.
, IDC
Then he will reply with ... 太较真就⽆聊了 BTW, does IDC make money?
我实在不知道下⼀句怎么接 It becomes boring if you are take it to
66666
good question
I really don't know what to say next.
求吧
serious.
不信你告诉我五乘以除以三等于多少
1+ 7
投我 150w 150% 给你的回报率
You give me 1.5 million, and I will
Ask him to beg you. If you don't believe it, you can ask him
不跪下来求都不吃 the question "what is the answer of
give you a 150% return.
Unless he begs you to eat, you won't 1+five times by 7 divided by three"
eat.
Figure 8: Examples of sticker selection results produced by SRS. We show the selected sticker and three random selected
candidate stickers with the attention heat map. The lighter the area on image is, the higher attention weight it gets.
Take forth sticker in Case 1 for example where the character has a words in a utterance, as shown in Figure 9. We use the weight τ ju for
wink eye and a smiling mouth. The highlights are accurately placed the j-th word (calculated in Equation 8) as the attention weight. We
on the character’s eye, indicating that the representation of this can find that the attention module always gives a higher attention
sticker is highly dependent on this part. Another example is the weight on the salience word, such as the “easy method”, “make a
last sticker of Case 3, there is two question marks on the top right lot of money” and “use Chine Mobile”.
corner of the sticker image which indicates that the girl is very
suspicious of this. In addition to facial expression, the characters 7.5 Influence of Similarity between Candidates
gestures can also represent emotions. Take the third sticker in In this section, we turn to RQ5 to investigate the influence of the
Case 2 for example, the character in this sticker gives a thumbs up similarities between candidates. The candidate stickers are sampled
representing support and we can find that the attention lies on his from the same set, and stickers in a set usually have a similar
hand, indicating that the model learns the key point of his body style. Thus, it is natural to ask: Can our model identify the correct
language. sticker from a set of similar candidates? What is the influence
Furthermore, we randomly select three utterances from the test of the similarity between candidate stickers? Hence, we use the
dataset, and we also visualize the attention distribution over the Structural Similarity Index (SSIM) metric [2, 37] to calculate the
0.90
MAP
R10@1
0.85 R10@2
R10@5
h u e sy ho
d 0.80
I ac yo on ea
-
et
te m
0.75
ul e f y 0.70
- si
t y - ee
p
en n ak to e
og - la sl th ca m lo - on
M a m
0.65
0.60
ill na M 0.55
So hy e hi le SI rd
w st us C obi 50 100 150 200 250
ca
M Hidden Size
Figure 9: Examples of the attention weights of the dialog ut- Figure 11: Performance of SRS with different parameter set-
terance. We translate Chinese to English word by word. The tings.
darker the area is, the higher weight the word gets.
to a 2.2% and 3.9% drop in terms of MAP and R 10 @1 respectively.

Nonetheless, we can find that each metric maintained at a stable
interval, which demonstrates that our SRS is robust in terms of the
parameter size.
8 CONCLUSION
In this paper, we propose the task of multi-turn sticker response
selection, which recommends an appropriate sticker based on multi-
turn dialog context history without relying on external knowledge.
To tackle this task, we proposed the sticker response selector (SRS).
Specifically, SRS first learns the representation of each utterance
using a self-attention mechanism, and learns sticker representation
by CNN. Next, a deep interaction network is employed to fully
Figure 10: Performance of SRS on groups of different candi- model the dependency between the sticker and utterances. The
date similarity. deep interaction network consists of a co-attention matrix that
calculates the attention between each word in an utterance and
each unit in a sticker representation. Then, a bi-directional atten-
average similarity among all candidates in a test sample and then tion is used to obtain utterance-aware sticker representation and
aggregate all test samples into five groups according to their average sticker-aware utterance representations. Finally, a fusion network
similarities. We calculate the R 10 @1 of each group of samples, as models the short-term and long-term relationship between inter-
shown in Figure 10. The x-axis is the average similarity between action results, and a fully-connected layer is applied to obtain the
candidate stickers and the y-axis is the R 10 @1 score. final selection result. Our model outperforms state-of-the-art meth-
Not surprisingly, SRS gains the best performance when the aver- ods in all metrics and the experimental results also demonstrate
age similarity of the candidate group is low and its performance the robustness of our model on datasets with different similarity
drops as similarity increases. However, we can also see that, though between candidate stickers. In the near future, we aim to propose a
similarity varies from minimum to maximum, the overall perfor- personalized sticker response selection system.
mance can overall stay at high level. R 10 @1 scores of all five groups
are above 0.42, and the highest score reaches 0.59. That is, our model ACKNOWLEDGMENTS
is highly robust and can keep giving reasonable sticker responses.
We would like to thank the anonymous reviewers for their con-
structive comments. We would also like to thank Anna Hennig
7.6 Robustness of Parameter Setting in Inception Institute of Artificial Intelligence for her help on this
Finally, we turn to address RQ6 to investigate the robustness of paper. This work was supported by the National Key Research and
parameter setting. We train our model in different parameter setting Development Program of China (No. 2017YFC0804001), the Na-
as shown in Figure 11. The hidden size of the RNN, CNN and the tional Science Foundation of China (NSFC No. 61876196 and NSFC
dense layer in our model is tuned from 50 to 250, and we use the No. 61672058). Rui Yan is partially supported as a Young Fellow of
MAP and Rn @k to evaluate each model. As the hidden size grows Beijing Institute of Artificial Intelligence (BAAI).
larger from 50 to 100, the performance rises too. The increment of
hidden size improves the MAP and R 10 @1 scores by 0.4% and 1.0%.
When the hidden size continuously goes larger from 100 to 250, the
performance is declined slightly. The increment of hidden size leads
REFERENCES Co-Attention for Video Question Answering. In AAAI.

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey [26] Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017.
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Best of both worlds: Transferring knowledge from discriminative learning to a
2016. Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16. generative visual dialog model. In NIPS. 314–324.
265–283. [27] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons:
[2] Alireza Avanaki. 2008. Exact histogram specification optimized for structural A neural-based approach to answering questions about images. In ICCV. 1–9.
similarity. arXiv preprint arXiv:0901.0065 (2008). [28] H. Noh, T. Kim, J. Mun, and B. Han. 2019. Transfer Learning via Unsupervised
[3] Ricardo Baeza-Yates, Berthier de Araújo Neto Ribeiro, et al. 2011. Modern infor- Task Discovery for Visual Question Answering. In 2019 IEEE/CVF Conference on
mation retrieval. New York: ACM Press; Harlow, England: Addison-Wesley,. Computer Vision and Pattern Recognition (CVPR). 8377–8386.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine [29] F. Sha, H. Hu, and W. Chao. 2018. Cross-Dataset Adaptation for Visual Ques-
translation by jointly learning to align and translate. In ICLR. tion Answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern
[5] Francesco Barbieri, Miguel Ballesteros, Francesco Ronzano, and Horacio Saggion. Recognition. 5716–5725.
2018. Multimodal Emoji Prediction. In NAACL. Association for Computational [30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Linguistics, New Orleans, Louisiana, 679–686. Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR.
[6] Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. 2017. Are Emojis 2818–2826.
Predictable?. In Proceedings of the 15th Conference of the European Chapter of the [31] Chongyang Tao, Shen Gao, Mingyue Shang, Wei Wu, Dongyan Zhao, and Rui
Association for Computational Linguistics: Volume 2, Short Papers. Association for Yan. 2018. Get The Point of My Utterance! Learning Towards Effective Responses
Computational Linguistics, Valencia, Spain, 105–111. with Multi-Head Attention Mechanism. In IJCAI.
[7] Junyoung Chung, ÃĞaglar GülÃğehre, Kyunghyun Cho, and Yoshua Bengio. [32] Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan.
2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence 2019. Multi-Representation Fusion Network for Multi-Turn Response Selection
Modeling. In NIPS Workshop. in Retrieval-Based Chatbots. In WSDM. ACM, 267–275.
[8] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. [33] Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan.
Batra. 2017. Visual Dialog. In CVPR. 1080–1089. 2019. One Time of Interaction May Not Be Enough: Go Deep with an Interaction-
[9] Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story over-Interaction Network for Response Selection in Dialogues. In Proceedings of
Generation. In Proceedings of the 56th Annual Meeting of the Association for the 57th Annual Meeting of the Association for Computational Linguistics. Associa-
Computational Linguistics (Volume 1: Long Papers). Association for Computational tion for Computational Linguistics, Florence, Italy, 1–11.
Linguistics, Melbourne, Australia, 889–898. [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[10] Jiazhan Feng, Chongyang Tao, Wei Wu, Yansong Feng, Dongyan Zhao, and Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Rui Yan. 2019. Learning a Matching Model with Co-teaching for Multi-turn you need. In NIPS. 5998–6008.
Response Selection in Retrieval-based Dialogue Systems. In Proceedings of the [35] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton van den Hengel.
57th Annual Meeting of the Association for Computational Linguistics. Association 2017. Explicit Knowledge-based Reasoning for Visual Question Answering.
for Computational Linguistics, Florence, Italy, 3805–3815. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial
[11] Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, and Hongsheng Li. Intelligence, IJCAI-17. 1290–1296.
2019. Multi-Modality Latent Interaction Network for Visual Question Answering. [36] Shuohang Wang and Jing Jiang. 2017. A Compare-Aggregate Model for Matching
In ICCV. Text Sequences. ICLR (2017).
[12] Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan, Dongyan Zhao, and Rui Yan. [37] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. 2004. Image
2019. How to Write Summaries with Patterns? Learning towards Abstractive quality assessment: from error visibility to structural similarity. IEEE transactions
Summarization through Prototype Editing. In Proceedings of the 2019 Conference on image processing 13, 4 (2004), 600–612.
on Empirical Methods in Natural Language Processing and the 9th International [38] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel.
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association 2016. Ask me anything: Free-form visual question answering based on knowledge
for Computational Linguistics, Hong Kong, China, 3741–3751. from external sources. In CVPR. 4622–4630.
[13] Shen Gao, Xiuying Chen, Piji Li, Zhaochun Ren, Lidong Bing, Dongyan Zhao, [39] Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2018.
and Rui Yan. 2019. Abstractive Text Summarization by Incorporating Reader Are you talking to me? reasoned visual dialog generation through adversarial
Comments. In AAAI. 6399–6406. learning. In ICCV. 6106–6115.
[14] Shen Gao, Zhaochun Ren, Yihong Zhao, Dongyan Zhao, Dawei Yin, and Rui Yan. [40] Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential
2019. Product-Aware Answer Generation in E-Commerce Question-Answering. Matching Network: A New Architecture for Multi-turn Response Selection in
In Proceedings of the Twelfth ACM International Conference on Web Search and Retrieval-Based Chatbots. In ACL. Association for Computational Linguistics,
Data Mining (WSDM ’19). Association for Computing Machinery, New York, NY, Vancouver, Canada, 496–505.
USA, 429–437. [41] Ruobing Xie, Zhiyuan Liu, Rui Yan, and Maosong Sun. 2016. Neural emoji
[15] Ankit Goyal, Jian Wang, and Jia Deng. 2018. Think Visually: Question Answer- recommendation in dialogue systems. arXiv preprint arXiv:1612.04609 (2016).
ing through Virtual Imagery. In Proceedings of the 56th Annual Meeting of the [42] Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory
Association for Computational Linguistics (Volume 1: Long Papers). Association networks for visual and textual question answering. In ICML. 2397–2406.
for Computational Linguistics, Melbourne, Australia, 2598–2608. [43] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep
[16] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Neural Networks for Retrieval-Based Human-Computer Conversation System.
2017. Making the V in VQA matter: Elevating the role of image understanding in In Proceedings of the 39th International ACM SIGIR Conference on Research and
Visual Question Answering. In CVPR. 6904–6913. Development in Information Retrieval (SIGIR ’16). Association for Computing
[17] D. Guo, C. Xu, and D. Tao. 2019. Image-Question-Answer Synergistic Network Machinery, New York, NY, USA, 55–64.
for Visual Dialog. In CVPR ’19. 10426–10435. [44] Rui Yan and Dongyan Zhao. 2018. Coupled Context Modeling for Deep Chit-
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep Chat: Towards Conversations between Human and Computer. In Proceedings of
into rectifiers: Surpassing human-level performance on imagenet classification. the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data
In ICCV. 1026–1034. Mining (KDD ’18). Association for Computing Machinery, New York, NY, USA,
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual 2574–2583.
learning for image recognition. In CVPR. 770–778. [45] Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking
[20] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural and Next Utterance Suggestion in Human-Computer Conversation System. In
computation 9, 8 (1997), 1735–1780. Proceedings of the 40th International ACM SIGIR Conference on Research and
[21] Unnat Jain, Svetlana Lazebnik, and Alexander G Schwing. 2018. Two can play Development in Information Retrieval (SIGIR ’17). Association for Computing
this game: visual dialog with discriminative question generation and answering. Machinery, New York, NY, USA, 685–694.
In ICCV. 5754–5763. [46] Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian,
[22] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- Xuan Liu, and Rui Yan. 2016. Multi-view Response Selection for Human-Computer
mization. In ICLR, Vol. abs/1412.6980. Conversation. In EMNLP ’16. Association for Computational Linguistics, Austin,
[23] Abhishek Laddha, Mohamed Hanoosh, and Debdoot Mukherjee. 2019. Un- Texas, 372–381.
derstanding Chat Messages for Sticker Recommendation in Hike Messenger. [47] Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao,
arXiv:1902.02704 (2019). Dianhai Yu, and Hua Wu. 2018. Multi-turn response selection for chatbots with
[24] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- deep attention matching network. In ACL, Vol. 1. 1118–1127.
tion. arXiv preprint arXiv:1607.06450 (2016).
[25] Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiang-
nan He, and Chuang Gan. 2019. Beyond RNNs: Positional Self-Attention with

Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Uploaded by

Copyright:

Available Formats

Learning to Respond with Stickers:

A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Shen Gao∗† Xiuying Chen∗ Chang Liu

Li Liu Dongyan Zhao Rui Yan‡

Intelligence [email protected] 2 Beijing Academy of Artificial

ABSTRACT effectiveness of each component of SRS. To facilitate further re-

Finally, SRS employs a fusion network which consists of a sub-

whole context. In our task, we also need to take previous multi-

3.1 Data Collection 3.3 Sticker Similarity

Case 1 (AVG. 0.125) Case 2 (AVG. 0.702)

Figure 3: Example cases in the dataset with different similarity scores.

(4) Fusion Network

(3) Deep interatction network

1 2 3

Figure 5: Framework of deep interaction network.

Fusion Network 5.5 Fusion Network

Add & Norm

and long-term interaction between utterance {Q 21 , . . . , Q T2 u }.

дi = RNN(Q 2i , дi−1 ), (12)

6.2 Comparison Methods 6.4 Implementation Details

0.700 0.58 0.690 0.88

0.675 0.54 0.670 0.84

to a 2.2% and 3.9% drop in terms of MAP and R 10 @1 respectively.

REFERENCES Co-Attention for Video Question Answering. In AAAI.

You might also like