Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog
Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog
Learning To Respond With Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog
2 RELATED WORK
We outline related work on sticker recommendation, visual question
Figure 1: An example of stickers in a multi-turn dialog. answering, visual dialog, and multi-turn response selection.
Sticker response selector automatically selects the proper Sticker recommendation. Most of the previous works empha-
sticker based on multi-turn dialog history. size the use of emojis instead of stickers. For example, [5, 6] use a
multimodal approach to recommend emojis based on the text and
images in an Instagram post. However, emojis are typically used in
conjunction with text, while stickers are independent information
the stickers they have collected and selected the appropriate one, carriers. What is more, emojis are limited in variety, while there
which is both difficult and time-consuming. exists an abundance of different stickers. The most similar work
Consequently, much research has focused on recommending ap- to ours is [23], where they generate recommended stickers by first
propriate emojis to users according to the chatting context. Existing predicting the next message the user is likely to send in the chat,
works, such as [41], are mostly based on emoji recommendation, and then substituting it with an appropriate sticker. However, more
where they predict the probable emoji given the contextual informa- often than not the implication of the stickers cannot be fully con-
tion from multi-turn dialog systems. In contrast, other works [5, 6] veyed by text and, in this paper, we focus on directly generating
recommend emojis based on the text and images posted by a user. sticker recommendations from dialog history.
However, the use of emojis is restricted due to their limited variety Visual question answering. Sticker recommendation involves
and small size, while stickers are more expressive and of a great the representation of and interaction between images and text,
variety. As for sticker recommendation, existing works such as [23] which is related to the Visual Question Answering (VQA) task [11,
and apps like Hike or QQ directly match the text typed by the user 15, 28, 29, 35]. Specifically, VQA takes an image and a correspond-
to the short text tag assigned to each sticker. However, since there ing natural language question as input and outputs the answer. It is
are countless ways of expressing the same emotion, it is impossible a classification problem in which candidate answers are restricted
to capture all variants of an utterance as tags. to the most common answers appearing in the dataset and requires
In this paper, we address the task of sticker response selection deep analysis and understanding of images and questions such as
in multi-turn dialog, where an appropriate sticker is recommended image recognition and object localization [16, 27, 38, 42]. Current
based on the dialog history. There are two main challenges in this models can be classified into three main categories: early fusion
task: (1) To the best of our knowledge, no existing image recognition models, later fusion models, and external knowledge-based mod-
methods can model the sticker image, how to capture the semantic els. One state-of-the-art VQA model is [25], which proposes an
meaning of sticker is challenging. (2) Understanding multi-turn architecture, positional self-attention with co-attention, that does
dialog history information is crucial for sticker recommendation, not require a recurrent neural network (RNN) for video question
and jointly modeling the candidate sticker with multi-turn dialog answering. [17] proposes an image-question-answer synergistic
is challenging. Herein, we propose a novel sticker recommendation network, where candidate answers are coarsely scored according
model, namely sticker response selector (SRS), for sticker response to their relevance to the image and question pair in the first stage.
selection in multi-turn dialog. Specifically, SRS first learns represen- Then, answers with a high probability of being correct are re-ranked
tations of dialog context history using a self-attention mechanism by synergizing with images and questions.
and learns the sticker representation by a convolutional network.
Next, SRS conducts deep matching between the sticker and each
utterance and produces the interaction results for every utterance. 2 https://github.com/gsh199449/stickerchat
Learning to Respond with Stickers:
A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW ’20, April 20–24, 2020, Taipei, Taiwan
The difference between sticker selection and VQA task is that Table 1: Statistics of Response Selection Dataset.
sticker selection task focus more on multi-turn multimodal interac-
tion between stickers and utterances. Train Valid Test
Visual dialog. Visual dialog extends the single turn dialog task [14,
# context-stickers pairs 320,168 10,000 10,000
31] in VQA to a multi-turn one, where later questions may be related
Avg. words of context utterance 7.54 7.50 7.42
to former question-answer pairs. To solve this task, [26] transfers
Avg. users participate 5.81 5.81 5.79
knowledge from a pre-trained discriminative network to a gener-
ative network with an RNN encoder, using a perceptual loss. [39]
combines reinforcement learning and generative adversarial net-
works (GANs) to generate more human-like responses to questions,
where the GAN helps overcome the relative paucity of training data,
and the tendency of the typical maximum-likelihood-estimation-
based approach to generate overly terse answers. [21] demonstrates
4000
a simple symmetric discriminative baseline that can be applied to
Count
both predicting an answer as well as predicting a question in the
visual dialog.
Unlike VQA and visual dialog tasks, in a sticker recommendation 2000
system, the candidates are stickers rather than text.
Multi-turn response selection. Multi-turn response selection [10,
33, 43–45] takes a message and utterances in its previous turns as 0
input and selects a response that is natural and relevant to the (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1.0]
4 PROBLEM FORMULATION finally outputs the matching score combining these features using
Before presenting our approach for sticker response selection in an interaction function.
multi-turn dialog, we first introduce our notations and key concepts.
Similar to the multi-turn dialog response selection [40, 47], we 5.2 Sticker Encoder
assume that there is a multi-turn dialog context s = {u 1 , . . . , uTu } Much research has been conducted to alleviate gradient vanish-
and a candidate sticker set C = {c 1 , ...cTc }, where ui represents the ing [19] and reduce computational costs [18] in image modeling
i-th utterance in the multi-turn dialog. In the i-th utterance ui = tasks. We utilize one of these models, i.e.,the Inception-v3 [30]
{x 1i , . . . , x i i }, x ji represents the j-th word in ui , and Txi represents model rather than plain CNN to encode sticker image:
Tx
the total number of words in ui utterance. In dialog context s, c i O, O flat = Inception-v3(c), (1)
represents a sticker image with a binary label yi , indicating whether
c i is an appropriate response for s. Tu is the utterance number in where c is the sticker image. The sticker representation is O ∈
the dialog context and Tc is the number of candidate stickers. For Rp×p×d which conserves the two-dimensional information of the
each candidate set, there is only one ground truth sticker, and the sticker, and will be used when associating stickers and utterances in
remaining ones are negative samples. Our goal is to learn a ranking §5.4. We use the original image representation output of Inception-
model that can produce the correct ranking for each candidate v3 O flat ∈ Rd as another sticker representation. However, existing
sticker c i ; that is, can select the correct sticker among all the other pre-trained CNN networks including Inception-v3 are mostly built
candidates. For the rest of the paper, we take the i-th candidate on real-world photos. Thus, directly applying the pre-trained net-
sticker c i as an example to illustrate the details of our model and works on stickers cannot speed up the training process. In this
omit the candidate index i for brevity. dataset, sticker author give each sticker c an emoji tag which de-
notes the general emotion of the sticker. Hereby, we propose an
auxiliary sticker classification task to help the model converge
5 SRS MODEL quickly, which uses O flat to predict which emoji is attached to the
5.1 Overview corresponding sticker. More specifically, we feed O flat into a linear
classification layer and then use the cross-entropy loss Ls as the
In this section, we propose our sticker response selector, abbreviated
loss function of this classification task.
as SRS. An overview of SRS is shown in Figure 4, which can be split
into four main parts:
5.3 Utterance Encoder
• Sticker encoder is a convolutional neural network (CNN) based
image encoding module that learns a sticker representation. To model the semantic meaning of the dialog context, we learn the
• Utterance encoder is a self-attention mechanism based module representation of each utterance ui . First, we use an embedding
encoding each utterance ui in the multi-turn dialog context s. matrix e to map a one-hot representation of each word in each
• Deep interaction network module conducts deep matching be- utterance ui to a high-dimensional vector space. We denote e(x ji ) as
tween each sticker representation and each utterance, and outputs the embedding representation of word x ji . From these embedding
each interaction result. representations, we use the attentive module from Transformer [34]
• Fusion network learns the short-term dependency by the fusion to model the temporal interactions between the words in an ut-
RNN and the long-term dependency by the fusion Transformer, and terance. Attention mechanisms have become an integral part of
Learning to Respond with Stickers:
A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW ’20, April 20–24, 2020, Taipei, Taiwan
shopping. network 2
Fusion
Transformer
(2) Utterance Encoder Matching
Utterance3 (3) Deep interaction
What about 3 Score
you, Lisa? network 2
Sticker
Selection
(1) Sticker Encoder
Sticker
flat
Figure 4: Overview of SRS. We divide our model into four ingredients: (1) Sticker encoder learns sticker representation; (2)
Utterance encoder learns representation of each utterance; (3) Deep interaction network conducts deep matching interaction
between sticker representation and utterance representation in different levels of granularity. (4) Fusion network combines
the long-term and short-term dependency feature between interaction results produced by (3).
compelling sequence modeling in various tasks [4, 9, 13, 25]. In our 5.4 Deep Interaction Network
sticker selection task, we also need to let words fully interact with Now that we have the representation of the sticker and each utter-
each other to model the dependencies of words without regard to ance, we can conduct a deep matching between these components.
their locations in the input sentence. The attentive module in the On one hand, there are some emotional words in dialog context
Transformer has three inputs: the query Q, the key K and the value history that match the expression of the stickers such as “happy” or
V . We use three fully-connected layers with different parameters “sad”. On the other hand, specific part of the sticker can also match
to project the embedding of dialog context e(x ji ) into three spaces: these corresponding words such as dancing limbs or streaming eyes.
Hence, we employ a bi-directional attention mechanism between
Q ij = FC(e(x ji )), K ji = FC(e(x ji )), Vji = FC(e(x ji )). (2) a sticker and each utterance, that is, from utterance to sticker and
from sticker to utterance, to analyze the cross-dependency between
The attentive module then takes each Q ij to attend to K i· , and uses the two components. The interaction is illustrated in Figure 5.
i
these attention distribution results α j,i · ∈ RTx as weights to gain We take the i-th utterance as an example and omit the index i
the weighted sum of Vji as shown in Equation 4. Next, we add the for brevity. The two directed attentions are derived from a shared
2
original word representations on β ji as the residential connection relation matrix, M ∈ R(p )×Tu , calculated by sticker representation
O ∈R p×p×d and utterance representation h ∈ RTu ×d . The score
layer, shown in Equation 5:
Mk j ∈ R in the relation matrix M indicates the relation between the
k-th sticker representation unit O k , k ∈ [1, p 2 ] and the j-th word
exp Q ij · Kki
i
α j,k = Í i , (3) h j , j ∈ [1,Tu ] and is computed as:
Tx
n=1 exp Q i · Ki
j n
ÍTxi Mk j = σ (O k , h j ), σ (x, y) = w ⊺ [x ⊕ y ⊕ (x ⊗ y)], (7)
β ji = αi
k =1 j,k
· Vki , (4)
where σ is a trainable scalar function that encodes the relation
ĥij = Dropout e(x ji ) + β ji , (5) between two input vectors. ⊕ denotes a concatenation operation
and ⊗ is the element-wise multiplication.
where α j,k
i denotes the attention weight between j-th word to k-
Next, a max pooling operation is conducted on M, i.e.,let τ ju =
th word in i-th utterance. To prevent vanishing or exploding of max(M :j ) ∈ R represent the attention weight on the j-th utterance
gradients, a layer normalization operation [24] is also applied on word by the sticker representation, corresponding to the “utterance-
the output of the feed-forward layer as shown in Equation 6: wise attention”. This attention learns to assign high weights to the
important words that are closely related to sticker. We then obtain
hij = norm max(0, ĥij · W1 + b1 ) · W2 + b2 , (6) the weighted sum of hidden states as “sticker-aware utterance
representation” l:
where W1 ,W2 , b1 , b2 are all trainable parameters. hij denotes the
ÍTu u
hidden state of j-th word in the Transformer for the i-th utterance. l= j τj h j . (8)
WWW ’20, April 20–24, 2020, Taipei, Taiwan Shen and Xiuying, et al.
7 8 9 7 8 9
ℎ
1 Integrate 1 2 Fusion
2
Function Network
1
ℎ
1 1
1 1
ℎ
3 2
ℎ
2
̃
1
1
sticker-aware
ℎ
3 3 utterance
representation
5 6
4
flat
7 8 9
Forward
Feed
3
2
5.5.1 Fusion RNN. Fusion RNN first reads the interaction results
for each utterance {Q 21 , . . . , Q T2 u } and then transforms into a se-
quence of hidden states. In this paper, we employ the gated recur-
rent unit (GRU) [7] as the cell of fusion RNN, which is popular in
Figure 6: Framework of fusion network. sequential modeling [12, 40]:
Similarly, sticker-wise attention learns which part of a sticker is where дi is the hidden state of the fusion RNN. Finally, we obtain
most relevant to the utterance. Let τks = max(Mk : ) ∈ R represent the sequence of hidden states {д1 , . . . , дTu }. One can replace GRU
the attention weight on the k-th unit of the sticker representation. with similar algorithms such as LSTM [20]. We leave the study as
We use this to obtain the weighted sum of O k , i.e.,the “utterance- future work.
aware sticker representation” r :
5.5.2 Fusion Transformer. To model the long-term dependency
Íp 2 s and capture the salience utterance from the context, we employ the
r= τ O .
k k k
(9)
self-attention mechanism introduced in Equation 3-6. Concretely,
After obtaining the two outputs from the co-attention module, given {Q 21 , . . . , Q T2 u }, we first employ three linear projection layers
we combine the sticker and utterance representations and finally with different parameters to project the input sequence into three
get the ranking result. We first integrate the utterance-aware sticker different spaces:
representation r with the original sticker representation O flat using
an integrate function, named I F : Q i = FC(Q 2i ), K i = FC(Q 2i ), V i = FC(Q 2i ). (13)
Q 1 = I F (O flat , r ), I F (x, y) = FC(x ⊕ y ⊕ (x ⊗ y) ⊕ (x + y)). (10) Then we feed these three matrices into the self-attention algorithm
illustrated in Equation 3-6. Finally, we obtain the long-term inter-
We add the sticker-aware utterance representation l into Q 1 to- action result {д̂1 , . . . , д̂Tu }.
gether and then apply a fully-connected layer:
5.5.3 Prediction Layer. To combine the interaction representation
Q 2 = FC(Q 1 ⊕ l). (11) generated by fusion RNN and fusion Transformer, we employ the
Learning to Respond with Stickers:
A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW ’20, April 20–24, 2020, Taipei, Taiwan
SUMULTI function proposed by [36] to combine these representa- Table 2: Ablation models for comparison.
tions, which has been proven effective in various tasks:
(д̂i − дi ) ⊙ (д̂i − дi )
Acronym Gloss
дi = ReLU(W s + bs ). (14)
д̂i ⊙ дi SRS w/o pretrain SRS w/o pre-trained Inception-v3 model
The new interaction sequence {д1 , . . . , дTu } is then boiled down to SRS w/o Classify SRS w/o emoji classification task
a matching vector д̃Tu by another GRU-based RNN: SRS w/o DIN SRS w/o Deep Interaction Network
SRS w/o FR SRS w/o Fusion RNN
д̃i = RNN(д̃i−1 , дi ). (15)
We use the final hidden state д̃Tu as the representation of the overall
interaction result between the whole utterance context and the can- high probabilities of being correct are re-ranked by synergizing
didate sticker. Finally, we apply a fully-connected layer to produce with image and question. This model achieves the state-of-the-art
the matching score ŷ of the candidate sticker: performance on the Visual Dialog v1.0 dataset [8].
(2) PSAC: [25] proposes the positional self-attention with co-attention
ŷ = FC(д̃Tu ), (16)
architecture on VQA task, which does not require RNNs for video
where ŷ ∈ (0, 1) is the matching score of the candidate sticker. question answering. We replace the output probability on the vo-
cabulary size with the probability on candidate sticker set.
5.6 Learning (3) SMN: [40] proposes a sequential matching network to address
Recall that we have a candidate sticker set C = {c 1 , ...cTc } which response selection for the multi-turn conversation problem. SMN
contains multiple negative samples and one ground truth sticker. first matches a response with each utterance in the context. Then
We use hinge loss as our objective function: vectors are accumulated in chronological order through an RNN.
The final matching score is calculated with RNN.
Lr = N max 0, ŷnegative − ŷpositive + margin ,
Í
(17) (4) DAM: [47] extends the transformer model [34] to the multi-turn
response selection task, where representations of text segments
where ŷnegative and ŷpositive corresponds to the predicted labels of
are constructed using stacked self-attention. Then, truly matched
the negative sample and ground truth sticker, respectively. The
segment pairs are extracted across context and response.
margin is the margin rescaling in hinge loss. The gradient descent
(5) MRFN: [32] proposes a multi-representation fusion network,
method is employed to update all the parameters in our model to
where the representations can be fused into matching at an early
minimize this loss function.
stage, at the intermediate stage or the last stage. This is the state-
of-the-art model on the multi-turn response selection task.
6 EXPERIMENTAL SETUP
For the three baselines above, we replace the candidate em-
6.1 Research Questions bedding RNN network with the image encoding CNN network
We list four research questions that guide the experiments: Inception-v3, as used in our model. This network is initialized us-
• RQ1 (See § 7.1): What is the overall performance of SRS compared ing a pre-trained model3 for all baselines and SRS.
with all baselines?
• RQ2 (See § 7.2): What is the effect of each module in SRS? 6.3 Evaluation Metrics
• RQ3 (See § 7.3): How does the performance change when the Following [32, 47], we employ recall at position k in n candidates
number of utterances changes? Rn @k as an evaluation metric, which measures if the positive re-
• RQ4 (See § 7.4): Can co-attention mechanism successfully capture sponse is ranked in the top k positions of n candidates. Follow-
the salient part on the sticker image and the important words in ing [47], we also employ mean average precision (MAP) [3] as
dialog context? an evaluation metric. The statistical significance of differences ob-
• RQ5 (See § 7.5): What is the influence of the similarity between served between the performance of two runs is tested using a
candidate stickers? two-tailed paired t-test and is denoted using ▲ (or ▼ ) for strong
• RQ6 (See § 7.6): What is the influence of the parameter settings? significance at α = 0.01.
Table 3: RQ1: Automatic evaluation comparison. Significant the overall performance. But this additional task can speed up the
differences are with respect to MRFN. training process, and help our model to converge quickly. We use
19 hours to train the SRS until convergence, and we use 30 hours
MAP R 10 @1 R 10 @2 R 10 @5 for training SRS w/o Classify. The fusion RNN brings a significant
contribution, improving the MAP and R 10 @1 scores by 4.43% and
Visual Q&A methods
7.08%, respectively. Besides, the deep interaction network also plays
Synergistic 0.593 0.438 0.569 0.798
an important part. Without this module, the interaction between
PSAC 0.662 0.533 0.641 0.836
the sticker and utterance are hindered, leading to a 6.88% drop in
Multi-turn response selection methods R 10 @1.
SMN 0.524 0.357 0.488 0.737
DAM 0.620 0.474 0.601 0.813 7.3 Analysis of Number of Utterances
MRFN 0.684 0.557 0.672 0.853 For research question RQ3, in addition to comparing with various
SRS 0.709 0.590▲ 0.703▲ 0.872 baselines, we also evaluate our model when reading different num-
ber of utterances to study how the performance relates to number
of context turns.
Table 4: RQ2: Evaluation of different ablation models. Figure 7 shows how the performance of the SRS changes with
respect to different numbers of utterances turns. We observe a
MAP R 10 @1 R 10 @2 R 10 @5 similar trend for SRS on the first three evaluation metrics MAP,
R 10 @1 and R 10 @2: they first increase until the utterance number
SRS w/o pretrain 0.650 0.510 0.641 0.833
reaches 15, and then fluctuate as the utterance number continues to
SRS w/o Classify 0.707 0.588 0.700 0.871
increase. There are two possible reasons for this phenomena. The
SRS w/o DIN 0.680 0.552 0.669 0.854
first reason might be that, when the information in the utterances
SRS w/o FR 0.677 0.551 0.663 0.863
is limited, the model can capture the features well, and thus when
SRS 0.709 0.590 0.703 0.872
the amount of information increases, the performance gets better.
However, the capacity of the model is limited, and when the amount
of information reaches its upper bound, it gets confused by this
7 EXPERIMENTAL RESULT overwhelming information. The second reason might be of the
7.1 Overall Performance usefulness of utterance context. Utterances that occur too early
For research question RQ1, we examine the performance of our before the sticker response may be irrelevant to the sticker and bring
model and baselines in terms of each evaluation metric, as shown in unnecessary noise. As for the last metric, the above observations
Table 3. First, the performance of the multi-turn response selection do not preserve. The R 10 @5 scores fluctuate when the utterance
models is generally consistent with their performances on text number is below 15, and drop when the utterance number increases.
response selection datasets. SMN [40], an earlier work on multi- The reason might be that R 10 @5 is not a strict metric, and it is easy
turn response selection task with a simple structure, obtains the to collect this right sticker in the set of half of the whole candidates.
worst performance on both sticker response and text response Thus, the growth of the information given to SRS does not help it
selection. DAM [47] improves the SMN model and gets the second perform better but the noise it brings harms the performance. On
best performance. MRFN [32] is the state-of-the-art text response the other hand, though the number of utterances changes from 3
selection model and achieves the best performance among baselines to 20, the overall performance of SRS generally remains at a high
in our task as well. Second, VQA models perform generally worse level, which proves the robustness of our model.
than multi-turn response selection models, since the interaction
between multi-turn utterances and sticker is important, which 7.4 Analysis of Attention Distribution in
is not taken into account by VQA models. Finally, SRS achieves Interaction Process
the best performance with 3.36%, 5.92% and 3.72% improvements Next, we turn to address RQ4. We also show three cases with
in MAP, R 10 @1 and R 10 @2 respectively, over the state-of-the-art the dialog context in Figure 8. There are four stickers under each
multi-turn selection model, i.e.,MRFN, and with 6.80%, 10.69% and dialog context, one is the selected sticker by our model and other
8.74% significant increases over the state-of-the-art visual dialog three stickers are random selected candidate stickers. As a main
model, PSAC. This proves the superiority of our model. component of SRS, the deep interaction network comprises a bi-
directional attention mechanism between the utterance and the
7.2 Ablation Study sticker, where each word in the utterance and each unit in the sticker
For research question RQ2, we conduct ablation tests on the use of representation have a similarity score in the co-attention matrix.
the pre-trained Inception-v3 model, the sticker classification loss, To visualize the sticker selection process and to demonstrate the
the deep interaction network and the fusion RNN respectively. The interpretability of SRS, we visualize the sticker-wise attention τ s
evaluation results are shown in Table 4. The performances of all (Equation 9) on the original sticker image and show some examples
ablation models are worse than that of SRS under all metrics, which in Figure 8. The lighter the area is, the higher attention it gets.
demonstrates the necessity of each component in SRS. We also Facial expressions are an important part in sticker images. Hence,
find that the sticker classification makes the least contribution to we select several stickers with vivid facial expression in Figure 8.
Learning to Respond with Stickers:
A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW ’20, April 20–24, 2020, Taipei, Taiwan
0.695
0.57 0.685
R10@1 score
R10@2 score
R10@5 score
MAP score
0.690
0.56 0.680 0.86
0.685
0.55 0.675
0.680
Figure 7: Performance of SRS on all metrics when reading different number of utterances.
话说。。为什么有些⼈可以说出让我不
Case 1 Case 2 Case 3
知道怎么接的话。 你认为是机器⼈就是机器⼈
Why do some people always say sth I
don't know how to pick up?
If you think it is a chatbot, it is.
你认为是真⼈就是真⼈ NAT 要怎么样才能连上啊
我⼀个朋友的朋友,最近⽼问我在⼲ If you think it is a real person, it is.
How to connect to NAT?
我连不上
嘛,吃饭了吗。。 就是机器⼈没跑了 I can't connect to it.
I have a friend's friend, who always ask
me what am I doing, have I eaten.
It is a chatbot, definitely.
哈哈 不会⽤⼲什么买nat啊
然后我说我没吃 Haha
If you don't know how to use, why
Then I said that I haven't eaten. 不能 太较真 did you buy it?
话说 开 赚钱吗?
对⽅ 就会。。。 Don't be too serious.
, IDC
Then he will reply with ... 太较真就⽆聊了 BTW, does IDC make money?
我实在不知道下⼀句怎么接 It becomes boring if you are take it to
66666
good question
I really don't know what to say next.
求吧
serious.
不信你告诉我 五乘以 除以三等于多少
1+ 7
投我 150w 150% 给你 的回报率
You give me 1.5 million, and I will
Ask him to beg you. If you don't believe it, you can ask him
不跪下来求都不吃 the question "what is the answer of
give you a 150% return.
Unless he begs you to eat, you won't 1+five times by 7 divided by three"
eat.
Figure 8: Examples of sticker selection results produced by SRS. We show the selected sticker and three random selected
candidate stickers with the attention heat map. The lighter the area on image is, the higher attention weight it gets.
Take forth sticker in Case 1 for example where the character has a words in a utterance, as shown in Figure 9. We use the weight τ ju for
wink eye and a smiling mouth. The highlights are accurately placed the j-th word (calculated in Equation 8) as the attention weight. We
on the character’s eye, indicating that the representation of this can find that the attention module always gives a higher attention
sticker is highly dependent on this part. Another example is the weight on the salience word, such as the “easy method”, “make a
last sticker of Case 3, there is two question marks on the top right lot of money” and “use Chine Mobile”.
corner of the sticker image which indicates that the girl is very
suspicious of this. In addition to facial expression, the characters 7.5 Influence of Similarity between Candidates
gestures can also represent emotions. Take the third sticker in In this section, we turn to RQ5 to investigate the influence of the
Case 2 for example, the character in this sticker gives a thumbs up similarities between candidates. The candidate stickers are sampled
representing support and we can find that the attention lies on his from the same set, and stickers in a set usually have a similar
hand, indicating that the model learns the key point of his body style. Thus, it is natural to ask: Can our model identify the correct
language. sticker from a set of similar candidates? What is the influence
Furthermore, we randomly select three utterances from the test of the similarity between candidate stickers? Hence, we use the
dataset, and we also visualize the attention distribution over the Structural Similarity Index (SSIM) metric [2, 37] to calculate the
WWW ’20, April 20–24, 2020, Taipei, Taiwan Shen and Xiuying, et al.
0.90
MAP
R10@1
0.85 R10@2
R10@5
h u e sy ho
d 0.80
I ac yo on ea
-
et
te m
0.75
ul e f y 0.70
- si
t y - ee
p
en n ak to e
og - la sl th ca m lo - on
M a m
0.65
0.60
ill na M 0.55
So hy e hi le SI rd
w st us C obi 50 100 150 200 250
ca
M Hidden Size
Figure 9: Examples of the attention weights of the dialog ut- Figure 11: Performance of SRS with different parameter set-
terance. We translate Chinese to English word by word. The tings.
darker the area is, the higher weight the word gets.
8 CONCLUSION
In this paper, we propose the task of multi-turn sticker response
selection, which recommends an appropriate sticker based on multi-
turn dialog context history without relying on external knowledge.
To tackle this task, we proposed the sticker response selector (SRS).
Specifically, SRS first learns the representation of each utterance
using a self-attention mechanism, and learns sticker representation
by CNN. Next, a deep interaction network is employed to fully
Figure 10: Performance of SRS on groups of different candi- model the dependency between the sticker and utterances. The
date similarity. deep interaction network consists of a co-attention matrix that
calculates the attention between each word in an utterance and
each unit in a sticker representation. Then, a bi-directional atten-
average similarity among all candidates in a test sample and then tion is used to obtain utterance-aware sticker representation and
aggregate all test samples into five groups according to their average sticker-aware utterance representations. Finally, a fusion network
similarities. We calculate the R 10 @1 of each group of samples, as models the short-term and long-term relationship between inter-
shown in Figure 10. The x-axis is the average similarity between action results, and a fully-connected layer is applied to obtain the
candidate stickers and the y-axis is the R 10 @1 score. final selection result. Our model outperforms state-of-the-art meth-
Not surprisingly, SRS gains the best performance when the aver- ods in all metrics and the experimental results also demonstrate
age similarity of the candidate group is low and its performance the robustness of our model on datasets with different similarity
drops as similarity increases. However, we can also see that, though between candidate stickers. In the near future, we aim to propose a
similarity varies from minimum to maximum, the overall perfor- personalized sticker response selection system.
mance can overall stay at high level. R 10 @1 scores of all five groups
are above 0.42, and the highest score reaches 0.59. That is, our model ACKNOWLEDGMENTS
is highly robust and can keep giving reasonable sticker responses.
We would like to thank the anonymous reviewers for their con-
structive comments. We would also like to thank Anna Hennig
7.6 Robustness of Parameter Setting in Inception Institute of Artificial Intelligence for her help on this
Finally, we turn to address RQ6 to investigate the robustness of paper. This work was supported by the National Key Research and
parameter setting. We train our model in different parameter setting Development Program of China (No. 2017YFC0804001), the Na-
as shown in Figure 11. The hidden size of the RNN, CNN and the tional Science Foundation of China (NSFC No. 61876196 and NSFC
dense layer in our model is tuned from 50 to 250, and we use the No. 61672058). Rui Yan is partially supported as a Young Fellow of
MAP and Rn @k to evaluate each model. As the hidden size grows Beijing Institute of Artificial Intelligence (BAAI).
larger from 50 to 100, the performance rises too. The increment of
hidden size improves the MAP and R 10 @1 scores by 0.4% and 1.0%.
When the hidden size continuously goes larger from 100 to 250, the
performance is declined slightly. The increment of hidden size leads
Learning to Respond with Stickers:
A Framework of Unifying Multi-Modality in Multi-Turn Dialog WWW ’20, April 20–24, 2020, Taipei, Taiwan