Emotionally Intelligent Chatbots A Systematic Lite

Hindawi
Human Behavior and Emerging Technologies

Volume 2022, Article ID 9601630, 23 pages
https://doi.org/10.1155/2022/9601630
Review Article
Emotionally Intelligent Chatbots: A Systematic Literature Review
Ghazala Bilquise ,1 Samar Ibrahim ,2 and Khaled Shaalan 3
1
Computer and Information Science Department, Higher Colleges of Technology, P.O. Box 15825, Dubai, UAE
2
School of Arts and Sciences, American University in Dubai, UAE
3
Informatics Department, The British University in Dubai, UAE
Correspondence should be addressed to Ghazala Bilquise; [email protected]
Received 2 July 2022; Revised 4 September 2022; Accepted 13 September 2022; Published 26 September 2022
Academic Editor: Zheng Yan
Copyright © 2022 Ghazala Bilquise et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Conversational technologies are transforming the landscape of human-machine interaction. Chatbots are increasingly being used
in several domains to substitute human agents in performing tasks, answering questions, giving advice, and providing social and
emotional support. Therefore, improving user satisfaction with these technologies is imperative for their successful integration.
Researchers are leveraging Artificial Intelligence (AI) and Natural Language Processing (NLP) techniques to impart emotional
intelligence capabilities in chatbots. This study provides a systematic review of research on developing emotionally intelligent
chatbots. We employ a systematic approach to gather and analyze 42 articles published in the last decade. The review is aimed
at providing a comprehensive analysis of past research to discover the problems addressed, the techniques used, and the
evaluation measures employed by studies in embedding emotion in chatbot conversations. The study’s findings reveal that
most studies are based on an open-domain generative chatbot architecture. Researchers mainly address the issue of accurately
detecting the user’s emotion and generating emotionally relevant responses. Nearly 57% of the studies use an enhanced
Seq2Seq encoding and decoding of the input of the conversational model. Almost all the studies use both the automatic and
manual evaluation measures to evaluate the chatbots, with the BLEU measure being the most popular method for objective
evaluation.
1. Introduction benefits of integrating chatbots in service and social disci-

plines lead organizations to invest highly in this technology.
The advancement of conversational technologies has led to a However, research indicates that users are still uncomfort-
massive increase in the integration of chatbots in several able with chatbot communications and prefer interacting
domains. A chatbot is a dialog system that interacts with with a human agent [2]. Moreover, a review on chatbot
humans in natural language via text and voice or as an usability and user acceptance shows that people prefer natu-
embodied agent with multimodal communication [1]. Chat- ral communication over machine-like interactions and
bots are desirable by organizations because they provide believe that a human can understand them better [5]. The
proactive service and immediate assistance to consumers study also reveals that user satisfaction is imperative to suc-
and cut operational costs [2]. They are used extensively to cessfully integrating and adopting chatbots. Therefore,
automate several tasks such as tracking deliveries, making improving user engagement and satisfaction with chatbot
reservations, requesting flight information, and placing interactions has become crucial to provide a better experi-
orders. Their 24/7 availability and quick response to general ence and encourage users to embrace the technology [6].
queries make them an appealing solution for organizations. In the last few years, Artificial Intelligence (AI) and Nat-
More recently, chatbots are also being used to provide social ural Language Processing (NLP) technologies have been
and emotional support in healthcare and personal lives [3]. driving the development of chatbots to enable advanced
Chatbots are the fastest-growing communication chan- conversational capabilities [7]. Chatbots have evolved from
nel worldwide across multiple domains [4]. The enormous utilizing pattern matching and rule-based models to using
2 Human Behavior and Emerging Technologies
AI-powered deep learning technologies that drastically excel development; however, the review is focused on service-
in natural conversation [8]. The advancement in AI and oriented chatbots.
NLP has enabled the development of chatbots that generate There are only a handful of studies that have reviewed
dynamic responses that do not exist in the database, thus empathetic chatbots. A systematic review by Rapp et al. [5]
making the conversation natural. However, despite these focuses on the human-computer interaction (HCI) perspec-
technologies, the responses generated by the chatbots are tive of chatbot usage by investigating the usability and user
often dull and repetitive, which leads to user disengagement acceptance of human-like chatbots. Our study is differenti-
and frustration [9]. ated from this study by focusing on the technical aspect of
Understanding emotion and responding accordingly is empathetic chatbot development rather than the emotional
the essence of effective communication [10]. Hence, the or psychological aspect of user interaction. The study by
emerging trend in chatbot development is to create empa- Wardhana et al. [23] provides a review of empathetic chatbot
thetic and emotionally intelligent agents capable of detecting development based on the chatbot type, model, and inference
user sentiments and generating appropriate responses [11]. techniques. Another study by Pamungkas [24] provides a
Salovey and Mayer [12] proposed the term emotional intelli- survey on the approaches to building an empathetic chatbot.
gence, which refers to identifying, incorporating, compre- Ma et al. [25] survey empathetic dialog systems based on
hending, and controlling emotions. Emotions play a three aspects which include affective dialog, personalization,
significant role in making or breaking a conversation. Users and knowledge. Notwithstanding their recognized contribu-
get frustrated when chatbot responses are irrelevant [13], tions, these studies do not perform a thorough analysis using
while a chatbot that verbalizes emotions can enhance the a systematic approach. Moreover, the past reviews have not
user’s mood [14]. Moreover, users often anthropomorphize provided insights into the challenges and techniques of emo-
chatbots, which in turn influences their interaction and tion generation addressed by empirical studies. Furthermore,
behavior [15]. Chatbots that mimic human behavior and the studies do not provide researchers with the datasets and
emotions lead to increased rapport, higher motivation, and evaluation measures that are used in the development of
better engagement [16]. Therefore, researchers are investigat- emotion-aware chatbots. Our review offers a novel contribu-
ing ways to improve a chatbot’s empathetic and emotional tion to the study of emotionally intelligent chatbot develop-
capabilities [17]. Ongoing research focuses on conversational ment by providing an in-depth analysis of the challenges in
agents capable of perceiving the user’s emotion and respond- emotion generation, techniques used, and evaluation criteria
ing appropriately with emotional cues to better engage users. of empathetic chatbots. To the best of our knowledge, there is
no systematic review that investigates the development of
1.1. Problem Statement. Investigation into the development emotionally intelligent chatbots and their challenges, tech-
of emotionally intelligent chatbots is a recent trend as niques, and evaluations.
researchers continue to find better ways to generate Considering the above factors and the research gap, the
human-like empathetic conversations. Although chatbots objective of this paper is to provide a systematic literature
have existed for decades, the use of AI-driven techniques review of the most relevant studies that investigate the devel-
in empathetic conversational systems is relatively new. This opment of chatbots enriched with emotional capabilities. We
area of research is confronted with several challenges such aim to use a methodical approach to discover, categorize, and
as the accurate recognition of emotion and emotional state present our findings on several aspects of emotionally intelli-
of the user while keeping track of the history of the conver- gent chatbots and discover the gaps relevant to computer sci-
sation and generating appropriate responses that are not dull ence researchers interested in advancing research in this field.
and repetitive. Moreover, emotionally intelligent chatbots Our study, in particular, is aimed at comparing and contrast-
that generate diverse responses require a massive dataset ing the overall characteristics among the studies, such as
[1]. Therefore, it is imperative to gain insights into the data- chatbot language, the domain of study, and trends. We exam-
sets used by empirical studies. Furthermore, the perfor- ine the main problems tackled by researchers in developing
mance of a chatbot is measured by various evaluation emotion-aware conversational agents. We also investigate
strategies, making it vital to study the evaluation measures the techniques and approaches employed by studies in devel-
suitable for emotionally intelligent chatbots. Thus, it is oping chatbots. Lastly, we study the evaluation measures
imperative to study the state-of-the-art techniques in devel- used by the studies to evaluate their solutions. To that effect,
oping emotionally intelligent chatbots and report the find- our study analyzes contributions in this field and is aimed at
ings to the research community to further the development answering the following research questions:
in this field. RQ1: what are the general characteristics of the studies?
While several systematic reviews on chatbots exist in the RQ2: what problems are addressed by the studies?
literature, these studies differ from our study in their objec- RQ3: what approaches and techniques are employed in
tives. Some reviews examine chatbot applications and usage chatbot development?
in a variety of domains, such as healthcare [3], neuropsychi- RQ4: what evaluation measures are used to evaluate
atric disorders [18], education [19], business sectors [20], chatbot performance?
and personal assistants [21], with no focus on emotional The remaining sections of this study are structured as
aspects of the conversation or technical aspects of chatbot follows. Section 2 presents the background information on
development. The study byMohamad Suhaili et al. [22] pro- chatbots with an overview of chatbots, chatbot architecture,
vides deeper insights into the technical aspects of chatbot and the role of emotional intelligence. Section 3 details the
Human Behavior and Emerging Technologies 3
methodology of the systematic review and the phases precise answers. However, it cannot detect lexical errors
involved. Section 4 presents the findings of the study. Sec- and works well when the input message is well formed.
tion 5 presents the discussion of the results. The conclusion, Moreover, a rule-based chatbot answers user queries without
limitations, and further research avenues are presented in keeping track of previous responses and is ideal for a
Section 6. question-answer system.
A retrieval-based chatbot fetches responses from a size-
2. Background Information able predefined corpus using keyword matching or machine
learning techniques to get the most appropriate response.
This section presents an overview of chatbots, describing the Personal assistants such as Alexa, Siri, and Google Assistant
various classifications used in the literature for describing are retrieval-based as they respond to user requests by
them. We discuss the significance of emotional intelligence retrieving information from a broad range of sources [26].
in chatbots, followed by a general chatbot architecture of On the other hand, a generative chatbot generates responses
the different types of chatbots and methods of integrating using machine learning techniques, thereby constructing
emotional intelligence in chatbot technology. The following diverse responses by learning from the corpus. The responses
subsections introduce the concepts of chatbot development, are generated by translating input utterances to output data
in particular incorporating emotion in a chatbot in order to using statistical machine translation and predictive analytics
better understand the terminologies and classifications used techniques, thus making the conversation natural. A limita-
in the review. tion of the generative model is that it requires massive train-
ing data. This limitation has led to the development of
2.1. Overview of Chatbots. Chatbots, also known as conversa- generative chatbots mainly for open domains since domain-
tional agents, are dialog systems that interact with humans specific conversational data is not readily available [30, 31].
in natural language via text and voice or as embodied agents A recent trend is using a hybrid approach by integrating
with multimodal communication [1]. A chatbot’s primary retrieval-based and generative models to create task-
function is to respond to user requests provided in textual- oriented chatbots that possess human-like conversational
based or voice-based input. The chatbot processes the user skills to provide a better user experience [31].
input and generates an appropriate response. The ongoing quest for developing chatbots that mimic
There has been a surge in chatbot development in the humans is evident from its inception. One of the first chat-
last few years, with bot applications manifesting their pres- bots, ELIZA and PARRY, was based on pattern matching
ence in various domains [26]. Businesses deploy chatbots technology to imitate human responses [26]. Both chatbots
to provide efficient customer services by responding to cus- used a rule-based approach for generating responses based
tomer queries and automating tasks [20]. Chatbots are used on keywords limiting the conversation to a predefined set
for teaching and learning activities, student advising, and of responses. In 1995, ALICE [32] was developed using Arti-
administrative tasks [19]. Chatbots have become pervasive ficial Intelligence Markup Language (AIML) and was more
for psychiatric care and evaluation of medical diagnoses in sophisticated in generating human-like responses. Neverthe-
the healthcare sector, raising awareness [18, 27]. Chatbots less, these primitive dialog systems could not keep up with
are also popular as social companions [11]. Social chatbots the growing expectations of users in both conversational
are not designed to accomplish a specific task but rather to style and prediction of the user’s intent. Chatbots these days
engage with humans to fulfill their need for communication are AI-driven and powered by Natural Language Processing
and social belonging [28]. Chatbots offer a cost-effective (NLP) technologies that are capable of offering sophisticated
means of delivering services to consumers by eliminating solutions to meet the language and content expectations of
repetitive and time-consuming human-agent communica- end-users [26].
tion while enabling the agents to focus on high-end complex
tasks [2]. 2.2. Emotionally Intelligent Chatbots. Despite the prolifera-
Several taxonomies are used in the literature to classify tion of chatbots in our daily lives, recent studies have shown
chatbots. Hussain et al. [29] categorized chatbots based on that customers still prefer interacting with humans rather
their purpose as task-oriented and non-task-oriented. The than bots [2]. This resistance is attributed to the poor con-
primary function of a task-oriented chatbot is to respond versational skills of chatbots which make the interaction
to domain-specific user queries and often perform tasks such unnatural and machine-like leading to frustration and com-
as reserving a ticket. A non-task-oriented chatbot interacts munication breakdown [5]. Furthermore, end-users might
with humans in open-ended, domain-specific conversations, be more willing to interact with chatbots if they are enriched
also called open-domain chatbots. The primary function of with human-like interpersonal qualities [2]. Notwithstand-
these chatbots is to act as virtual companions where the dia- ing the limitation of chatbot conversational skills and high
log is open-ended. end-user expectations, conversational agents are still a desir-
Adamopoulou and Moussiades [9, 11] classified chatbots able solution for reducing operational costs. Therefore, it has
based on their response generation method as rule-based, become critical for businesses to bridge the gap between cus-
retrieval-based, and generative chatbots. A rule-based chat- tomer expectations and chatbot technology.
bot selects a response based on a predefined set of rules. Emotions play an integral part in an effective conversa-
The responses are not dynamic and often repetitive. The tion. A study by Xu et al. [33] reveals that nearly 40% of cus-
strength of a rule-based chatbot lies in its ability to provide tomers’ interaction with agents on social media is emotional
rather than informational. Several studies have shown that nal sources of data. Internal data sources might be embedded
emotionally intelligent conversations lead to a good user as a template or rules in Artificial Intelligence Markup
experience resulting in fewer communication breakdowns Language (AIML) to decipher user requests and retrieve
[5]. A qualitative study by Svikhnushina and Pu [34] responses. Additionally, the chatbot may construct its data-
revealed that users are more likely to engage with emotion- base internally from scratch or utilize existing databases out-
aware chatbots and are eager to have a natural conversa- fitted with their domains and functions. Alternatively,
tional experience with a virtual counterpart. Another study chatbots may use third-party APIs to obtain external data
by Ghandeharioun et al. [14] disclosed that emotionally sources [22].
enriched responses by a chatbot could lift a user’s mood,
thus enhancing customer experience and improving cus-
2.3.4. Natural Language Generator (NLG). Finally, the
tomer relationships. Xiao et al. [31] supported these findings
response generation component, NLG, is based on how the
by showing that users are more engaged with chatbots capa-
chatbot generates responses. It may use a retrieval-based,
ble of sensing and verbalizing emotions in the conversation.
rule-based, or generative model. Retrieval- and rule-based
It is evident from these studies that perceiving emotions and
models are simple in design and need essential intelligence
responding with an appropriate empathetic reply is crucial
to select the best response match. However, they have lim-
to enhancing user satisfaction with chatbot conversations.
ited usability and flexibility [22]. In comparison, the genera-
A vast amount of ongoing research is dedicated to inte-
tive model has incredible flexibility and can handle a variety
grating emotional capabilities in chatbots to enhance their
of domains. However, they can be highly complex and
conversational skills. AI-driven chatbots can detect user sen-
expensive, and they need an extra degree of intelligence.
timents in a conversation, thus triggering the chatbot to
Researchers who study emotionally intelligent chatbots
comprehend the user’s emotional state and generate an
have adopted the general chatbot architecture. They imple-
appropriate response. The following subsection presents an
mented the neural-based approach, and they use models that
overview of an AI-driven chatbot architecture.
enforce emotion-aware characteristics, such as emotion
2.3. AI-Driven Chatbot Architecture. Chatbots are composed embedding and reinforcement learning models, in addition
of several essential components, each playing an indispens- to encoder-decoder architectures that use Sequence-to-
able role and working together in a robust system that effec- Sequence learning [24].
tively serves its purpose. These components may be
incorporated into text-based or voice-based agents [1]. In 2.4. Deep Learning in Chatbot Conversations. Artificial neu-
most cases, these components are organized in a pipeline ral networks are machine learning algorithms that may be
based on their order of usage. Figure 1 presents the architec- supervised or unsupervised. Deep learning, being an unsu-
ture showing the main components of a chatbot architecture. pervised machine learning algorithm, can mimic how the
human brain develops patterns and employs them for mak-
2.3.1. Natural Language Processing (NLP). The first compo-
ing decisions [29]. There has been an increase in the use of
nent is the Natural Language Processing (NLP) unit that
deep learning neural networks in conversational modeling,
processes the structured input using tokenization, lemmati-
particularly Recurrent Neural Networks (RNNs), Sequence-
zation, and stemming techniques. Some chatbots apply these
to-Sequence (Seq2Seq) networks, and Long Short-Term
techniques to incoming user requests as a preprocessing
Memory (LSTM) networks [22].
strategy [22]. An additional Automatic Speech Recognition
RNN is an artificial neural network class and a type of
(ASR) component may exist in voice-based agents that
recursive artificial neural network. This method saves the
extract text from the audio stream. In addition, the architec-
output from a layer and feeds that saved output to the
ture may contain a nonverbal information extraction com-
new input to forecast the following output. In the context
ponent, which can detect nonverbal information, like the
of natural language, RNN captures the inherent sequential
user’s emotions [1].
nature of words, where the meaning of words is under-
2.3.2. Natural Language Understanding (NLU). The structured stood through their relationship to the previous words in
data collected by the NLP unit is passed on to the Natural Lan- the sentence. Due to this approach, RNNs are well suited
guage Understanding (NLU) component, which processes the for chatbots since understanding the user input and pro-
data using various strategies. Usually, in this component, data ducing contextually relevant responses is essential [29].
structures are parsed to understand the user’s intent and all Research in emotionally intelligent chatbots employs
particulars associated with that intent [35]. encoder-decoder architecture with Seq2Seq learning. The
Seq2Seq model utilizes RNN as its architecture, with an
2.3.3. Dialog Manager. The dialog manager component encoder processing the input and a decoder producing the
examines the understandable structured data, maintains output. This model was initially introduced in 2014 as a var-
the dialog framework such as the semantic frame, and iation of Ritter’s generative model incorporating advance-
encodes the data to determine what action should be taken ments in deep learning to enhance accuracy [29]. The
next. The dialog managers may request clarification from Seq2Seq model is applied to chatbots to transform input sta-
users if the semantic structure is incomplete to ensure that tus into output response. It is currently regarded as the
the dialog context is relevant and that all ambiguities are industry’s best practice for generating responses because
resolved [11]. The dialog manager relies on external or inter- Seq2Seq maximizes the likelihood of the response and is
NLP NLU
Text input
ASR Dialog manager

voice input
Non-verbal
information
Text output
Voice output NLG
Text-to-speech
Figure 1: Chatbot architecture.
capable of processing a large amount of data to generate the transforms the input text data into a numerical form that
optimal response [24]. is easily understood by the machine [24]. Various embed-
Despite its approximation of a good response, the Seq2- ding methods exist, such as character embedding, word
Seq function fails to meet the chatbot’s true purpose of sim- embedding, and sentence embedding. Word embedding is
ulating human-to-human communication [35]. Therefore, a compact vector representation of words in the lower-
LSTM, a type of RNN, is designed to overcome the long- dimensional space. It is possible to represent words and
term dependency problem of RNNs. LSTMs contain mem- phrases with matrices that produce massive sets of data as
ory cells and gates that can retain previous information for the size of the input increases, such as a bag of words and
long periods where input gates control the data stream, for- Term Frequency-Inverse Document Frequency (TF-IDF)
get gates, and output gates. The LSTM or Gated Recurrent [22]. Word2Vec and BERT models are the two popular word
Unit (GRU) is the dominant variant of RNNs used to learn embedding models used in neural networks, which can also
the conversational dataset in these models. An LSTM net- be used for emotion and semantic embedding. These models
work outperforms the traditional RNN and other sequence strive to maximize conditional probabilities for better word
learning networks and replaces these models in learning matching [37].
from experience. In terms of semantic relations among linguistic concepts,
Some studies have implemented LSTM with reinforce- the Valence, Arousal, and Dominance (VAD) [38] space is
ment learning tasks to get more generic responses and widely used as the primary source of structure since it
enable the chatbot to attain long-term conversation effec- accounts for about 70% of the variance in meaning. VAD
tiveness [35]. In addition to this model, research has shown ratings have also been used in empathetic tutoring, senti-
that the Conditional Variational Autoencoder (CVAE) ment analysis, and other affective computing applications.
model can also improve the diversity of responses. In CVAE, The three standard dimensions of emotion are Valence
a latent variable is used to learn a distribution over possible (the pleasantness of a stimulus), Arousal (the intensity of
conversational intents, and greedy decoders are used to gen- emotion produced by the stimulus), and Dominance (the
erate responses [36]. degree of power produced by the stimulus). There are three
levels of emotion intensity in these words: very low (e.g.,
2.5. Emotionally Intelligent Chatbot Technology. It is crucial dull), moderate (e.g., watchdog), and very high (e.g., insan-
to select preprocessing steps carefully when building an ity) [39].
emotionally intelligent chatbot, as different preprocessing The artificial neural-based approach is extensively used to
techniques suit different contexts. For example, the NLP develop emotionally intelligent chatbots. Artificial neural
process is primarily used to collect, tokenize, and parse network-based chatbots apply both the retrieval-based and gen-
information. Parsing is a technique that implements algo- erative approaches for producing responses. However, the
rithms where the input is deconstructed according to a pre- research trend is heading towards generative approaches [24]
defined rule, such as left-right or bottom-up [37]. as it offers diverse responses. This paper explores the research
Embedding techniques are commonly used in emotion- studies investigating AI technologies to generate emotionally
ally intelligent chatbot technologies. The embedding model intelligent responses to report state-of-the-art techniques.
3. Research Methodology comparison step of PICOC, we consider all possible

approaches, models, development, algorithms, and evalua-
This study explores existing literature on the development of tion metrics in developing emotionally intelligent chatbots.
emotionally intelligent chatbots by adopting the systematic The outcome determines our data coding requirements
review framework of Kitchenham and Charters [40]. This and results, including the knowledge of techniques used in
framework was chosen because it defines the guidelines for developing emotionally intelligent chatbot solutions and
conducting reviews in the technical field instead of other the problems addressed, the datasets, and the evaluation
frameworks like Tranfield et al. [41] that are more oriented metrics used. Finally, we define the context as only empiri-
towards qualitative studies in the medical field. A rigorous cal studies related to emotionally intelligent chatbot
theoretical framework is essential to guiding the comprehen- development.
sive data collection and inquiry methods required for our
investigation. Moreover, the methodical process ensures 3.1.2. Inclusion/Exclusion Criteria. Selecting articles for the
the reliability of our findings. The systematic literature review led us to outline essential criteria that define the char-
review guidelines by Kitchenham and Charters [40] outline acteristics of the studies included in the study. Table 1 sum-
a thorough method for collecting, analyzing, and document- marizes the inclusion/exclusion criteria applied for selecting
ing findings from secondary data sources. We aim to answer the articles. First, empirical studies related to the develop-
our research questions following this methodology to ment of chatbots with emotion-embedded responses were
uncover the latest trends and technologies to develop emo- included. Second, only peer-reviewed journal and confer-
tionally intelligent chatbots. ence papers were included in the study, thereby excluding
The review process is divided into three phases: planning books, book chapters, and reviews. Third, only articles pub-
the review, conducting the review, and reporting the results. lished in the English language were included to eliminate the
Each phase is further subdivided into several steps, each of bias that may result from poor translation. Finally, the study
which is described in the sections below. period was determined to be between 2011 and 2022, as
chatbot development with the integration of AI techniques
3.1. Planning the Review. In recent years, a vast amount of has emerged in recent years. A ten-year period is sufficient
research has been conducted to improve user satisfaction to view the emotionally intelligent chatbot research trend.
with chatbot conversations by detecting user sentiments
and generating appropriate emotional responses. Therefore, 3.1.3. Data Sources. Various data sources were considered to
it is crucial to provide researchers with the current state of retrieve relevant publications for this study, ranging from
the art regarding emotionally intelligent chatbots, including general to computer science topics. Accordingly, the search
the techniques used to embed emotions in computer- utilized the following six digital databases: Scopus, IEEE
generated responses, the datasets used, and the evaluation Xplore, ProQuest, ScienceDirect, ACM Digital Library, and
processes adopted to measure the performance of the EBSCO. Furthermore, we also used a manual snowballing
chatbots. method to identify additional relevant studies by exploring
To begin our systematic review, we start with the plan- references of all selected primary studies.
ning phase that defines the search strategy and the inclu-
3.1.4. Quality Assessment Checklist. Quality assessment is
sion/exclusion criteria and identify the data sources used
crucial in systematic reviews to ensure the validity of the
for selecting the articles of the study. Finally, we describe
results and reduce bias that may be caused due to the inclu-
the quality assessment checklist for assessing the quality of
sion of less robust studies [43]. Furthermore, the quality
the articles and set a threshold for their inclusion.
assessment also provides more detailed inclusion/exclusion
criteria [40].
3.1.1. Search Strategy. The primary aim of the search criteria To ensure a rigorous assessment of the included articles
is to investigate the latest advances in the development of in our review, we developed a quality assessment checklist
emotionally intelligent chatbots. To that effect, we con- consisting of eleven questions presented in Table 2. We con-
ducted a preliminary search of existing literature and sys- sidered the elements essential to our data extraction and
tematic reviews to understand our study’s context, coding phases, such as relevance to our study, clear identifi-
keywords, and scope. We used the Population, Intervention, cation of the problem statement, and validity of the results.
Comparison, Outcome, and Context (PICOC) method out- Furthermore, we also considered the source’s credibility,
lined by Petticrew and Roberts [42] as a guideline to define which we evaluated using the ranking of the journal/confer-
our research directions. In this regard, our study’s popula- ence and the number of citations of the study.
tion relates to the main keywords and their derivatives with
similar connotations for emotionally driven chatbots, such 3.2. Conducting the Review. We implement the plan by
as conversation agents, virtual or digital assistants for chat- searching and retrieving the articles in this phase. The arti-
bots, and empathy or feelings for emotion. We used these cles were retrieved in Jan 2022. The articles were further
keywords to define the search string for the search process screened using the inclusion/exclusion criteria and quality
presented in Section 3.2.1. The intervention in our study assessment checklist described in Section 2.
refers to the search context [42]. We used the identified key-
words to filter studies that meet our objectives: Emotional, 3.2.1. Search Process. An extensive range of search strategies
Chatbot, Conversational agent, and virtual assistant. In the was used to retrieve the studies from the identified databases
Table 1: Inclusion/exclusion criteria.
Inclusion criteria Exclusion criteria

Studies that are qualitative or not related to chatbot
Must be an empirical study on the development of chatbots
development
Must involve emotion detection in input and generation of appropriate Studies that do not consider emotion in chatbot
emotional response conversation
Must be a peer-reviewed journal or conference paper Book, book chapters, reviews, or articles in the press
Must be written in English Papers written in a language other than English
Must be published between 2011 and 2022 Papers published prior to 2011
Table 2: Quality assessment checklist.
# Question
Q1 Is the study relevant to our research?
Q2 Are the research aims and contributions clearly identified?
Q3 Is the problem statement clear?
Q4 Is the experimental setup described adequately?
Q5 Are the techniques/methods clearly explained and analyzed?
Q6 Are the results compared to previous studies/baseline?
Q7 Is sufficient data used for the evaluation of the model?
Q8 Is the proposed technique evaluated using established metrics?
Q9 Is the conclusion explained clearly and linked to the purpose of the study?
Q10 Is the source of the article credible (published in a ranked venue)?
Q11 Has the study been cited in other publications?
to raise the probability of identifying highly relevant studies. The visualization of the terms in the extracted studies
We used logical operators AND and OR by combining the reveals several clusters. This shows that there are various
keywords identified in the planning process. Furthermore, dimensions of studies on empathetic chatbots from the per-
the search was performed on the title, abstract, and key- spective of usage, applications, usability and user experience,
words to ensure that relevant studies were not left out. The and chatbot development. The clusters are tightly over-
following is the search query syntax used in all the identified lapped, indicating that several aspects of the studies are
databases: (“chat bot” OR “chatbot” OR “talkbot” OR “talk interrelated. The clusters show that the current research
bot” OR “personal assistant” OR “virtual assistant” OR “dig- trends on emotionally intelligent chatbots are on chatbot
ital assistant” OR “conversational agent”) AND (“emotional” response generation, chatbot effectiveness, chat evaluation,
OR “emotion” OR “emotions” OR “empathy” OR “sentiment” and usability. Considering only the clusters with highly
OR “feeling”). weighted terms, four main clusters can be seen in the visual-
In addition to the automated search, we also performed ization. The first and central cluster (red) includes the
the manual snowballing search as detailed in the planning following keywords: emotional intelligence, emotional, con-
process. The results of the search are presented in Table 3. versational agent, research, and emotional response. This
A total of 2219 results were retrieved with the highest num- cluster implies that research is active in this area and related
ber of studies from Scopus because it is generic and sources to chatbot empathetic response generation. The second clus-
publications from all domains. ter (purple) contains keywords such as effectiveness, conver-
After retrieving the search results, we performed a bib- sational agent, framework, patient, problem, and technique,
liometric analysis of the results to analyze the research areas. which entails that the area of research in this cluster is about
Figure 2 shows the visualization of the terms in the results, chatbot usage and effectiveness. In the third cluster (green),
constructed using VOSviewer [44]. The diagram presents the significant keywords are input utterance, factor, genera-
the significance and interconnections between the frequently tion model, and human evaluation, which implies that
occurring terms extracted from the abstract, title, and key- research is related more to evaluating chatbot technology.
word search results. The size of the shape and the label asso- Finally, in the fourth cluster, the main keywords are consis-
ciated with the term determines its importance. The color of tency, performance, human, affect, and technique, indicating
the terms determines the clusters in the visualization. Each that research in this cluster is about chatbot performance
cluster represents terms related to each other in that group. (light blue). Examining these clusters provides an idea of
Moreover, the distance between the clusters represents the where research findings are located for better analysis and
relatedness of the clusters. discussion of the studies.
Table 3: Search results. We conducted a thorough data analysis of all the relevant
features identified in the planning phase to accomplish this
Database Search results task. The metadata analysis includes various characteristics
Scopus 1003 that are essential to answering our research questions, such
IEEE Xplore 115 as the characteristics of the study in terms of publication
ProQuest 191 type and year. We also examine the technical aspects of the
ScienceDirect 487 study, such as the chatbot’s language of development, the
emotions detected and used, the problems addressed, the
ACM Digital Library 272
technique used for the development and evaluation mea-
EBSCO 121 sures of the chatbot, and the dataset used for evaluation.
Snowballing 30
Total 2219 3.3. Reporting the Review. The final phase of the systematic
review presents the study results. After a detailed and in-
depth analysis of the metadata extracted from the full-text
3.2.2. Article Selection. In this phase, we applied the inclu- review, we present our findings in Section 4 to answer each
sion/exclusion criteria to screen the retrieved articles for eli- research question.
gibility following the PRISMA [45] framework. This
framework provides a detailed guideline and structured 4. Results
approach to screen the documents. The steps of the screen-
ing process are outlined in Figure 3. This section presents the results obtained from the meta-
First, we removed the duplicate records. Then, we analysis and in-depth review of the included articles with
applied the inclusion/exclusion criteria to ensure that only reference to our research questions. We analyzed 42 journal
relevant articles were included. Each author independently and conference papers published in the span of 10 years to
performed a title and abstract screening of the studies to determine the state-of-the-art technologies used to develop
remove irrelevant articles. At this stage, most of the articles emotionally intelligent chatbots. The following subsections
(n = 1671) were excluded as they did not match the inclusion present the results of each research question.
criteria. As discovered in the network analysis of the search
4.1. RQ1: What Are the General Characteristics of the
terms, most of the articles were related to usability and user
Studies? This subsection presents the general characteristics
acceptance of chatbots. These articles were excluded as they
of the reviewed articles. We analyzed the distribution of
did not contribute to the context of our study. Next, we per-
the studies by the year of publication, the region of study,
formed a full-text screening of the remaining articles
the source type (journals vs. conference papers), the inter-
(n = 325) to assess relevance and eligibility. Each author per-
face language, the chatbot type, and the domain of study.
formed this step independently by equally dividing the stud-
These characteristics provide an overview of the develop-
ies to be reviewed. In cases where the eligibility was unclear,
ment trend of emotionally intelligent chatbots.
the authors discussed resolving the discrepancy. Finally, a
quality assessment was performed of the remaining articles 4.1.1. Source of the Articles. Figure 4 shows the distribution
(n = 57) after the full-text screening. In this step, the authors of studies by source type. Most of the studies included in
screened the articles initially screened by the other to reduce the review are peer-reviewed journals (n = 25), while confer-
bias and ensure that each article was reviewed twice. Finally, ence papers (n = 17) constitute 40% of the studies. The over-
42 studies were included in the systematic literature review. all distribution of the papers is balanced. Moreover, all the
sources of the studies were verified in the quality assessment
3.2.3. Quality Assessment. Each author performed the quality phase to ensure content validity and reduce bias resulting
assessment independently using the assessment checklist from inaccurate or poorly reported results.
presented in Table 2 using a scale from 0 to 1, where 1 rep-
resents that the criteria are wholly met, 0.5 represents par- 4.1.2. Publication Year. Figure 5 presents the distribution of
tially met, and 0 represents not met. We assigned one the reviewed articles by year of publication. It is evident
point to the article having at least two citations per year from the graph that there is significant interest in emotion-
regarding the number of citations. Table 4 presents the aware chatbots over time. The graph also reveals a sharp
detailed quality assessment of the included articles, showing increase in the investigation of emotionally intelligent chat-
that all included articles are of good quality. It must be noted bots in 2018. This may be attributed to the technological
that the quality assessment is a means to determine whether advancement of conversational technologies and a sudden
the selected article is relevant to the contribution of this surge of chatbot usage in 2016, referred to as the chatbot
study, with no attempt to criticize any of the studies and “tsunami” by Grudin and Jacques [26]. The Sequence-to-
their findings. Sequence model [46] published by Google became a basis
for most neural conversational agents, leading to the prolif-
3.2.4. Data Analysis and Coding. The objective of this phase eration of generative chatbot studies.
is to accurately record all findings of the study by collecting
metadata from the primary studies included in the review. 4.1.3. Chatbot Type. Figure 6 presents the distribution of
The metadata relates to the research questions of our study. studies by chatbot type. A chatbot may be text-based,
Emotion
nal feature
nal Affeec
e c t lm Art meth
met od
Emotiona
Emo ona
n l levell
Sentencee
Sen
Emotio
onal expressi
expressi
pression
on Con
Cons
nsis
n ist
iste
stteency
Post
Perfo
Perfor
rf maance
ance
ncc Tech
hniqu
iqu
qu
ue
Huee Text
Text
Input
Inpu
nput
np
pu
pu ssequen
sequ
seque
eq ccee
equ
Emotiona
naal intell
intelligen
tell
lligen
gence
ce Agentt
Agent
Naturall lang
Natu
Na nguage
ng Pa eentt
Pati
Output
O utpu
utput
t
Word
Wor Conv
onv
nvers
errsa
er
ers
rsation
nal
al agent
agent
gent Effecctive
iv ness
Human
Hu
Hum
Huma World
or
Fusion
Fusion
In
npu
nput Decode
der
Expe
E
Exp
xp
xpe
perime
p riment
ntt
Vector
ector
ctorr Fram
mewor
orrk
Q ityy
Qual
Qu Paper
aper
IIntere
Int
t stt Cont
onten
ent
nt
Posi
Pos
Positi
o ve emo
mo
otion
ion
on
Reewa
R eward
rd Respon
onsse
on se Convverrsation
C rsatio
tiiion
tionon Formu
mulaation
mu
Lft Emotio
Emotion expression
expressio
esssion
essi
siion
Goal
G oall
Res
Rese
R es arch
ch Capabi
ap
apabi
paability
p liityy
llit Model
Mod e Pr lem
Prob
Pr m
Xiaoice
Xi ccee
Cvaee Con
Conver
C on sation
s syste
system
ys m
Researcher
Researcher
ch
Daataset
D
Data ett
Ba line
Ba
Base ne
Seq ence
Sequ
Se
Seq ncce
nce
Majori
M o ty F or
Fact
Fa
Emotion
Emo ona
n l respon
na spon
ponse
ponse
Gen rat
Generat on model
rati
ra T
Type
Dialogue agent
Dialogue
Dialog age
a t Seq2seq
Seq2se q
Huma
Human
m evalua
man lu tio
tion
tion
o
Empdg
Empd
mpdg
dg Limi
imi
mi tiion
mitati
Certain
n emot
o ion
o Input uttera
utterance
Effe
Effect
ffect Accou
Accoun
ountt
Tone
Tone
on
Generation
ion process
Social
ciall med
me m
mediu
VOSviewer Terg
erg
Figure 2: Bibliometric analysis of search results.
voice-based, or multimodal. The majority of the work in lems and how studies have addressed the problems. This sec-
developing emotionally intelligent dialog systems is on tion describes each problem and the approaches employed
text-based chatbots. Text-based chatbots are more favored to resolve the problem.
than other forms due to the increased use of messaging tech-
nologies. Moreover, the chatbots need to be trained to pro- 4.2.1. Response Diversity. The studies highlight the limitation
duce an appropriate response. It is easier to train chatbots of the Seq2Seq model [46] as it produces dull and meaning-
to generate text rather than speech, as the training data for less responses. Therefore, several studies (n = 8) tackle the
text is more readily available than speech data. challenge of generating diverse responses that are emotion-
ally relevant. Asghar et al. [39] argued that neural conversa-
4.1.4. Domain of Study. Figure 7 shows that most emotion- tional models do not capture the complexity of emotions
ally intelligent chatbots have been developed for the open and often result in short and ambiguous responses. They
domain. These chatbots specialize in natural and emotion- used a heuristic search algorithm to ensure diversity in gen-
ally rich conversations without focusing on specific topics. erated responses. Multiple studies employed a CVAE-based
High development in this area is due to the lack of conversa- model to generate varied emotional responses [36, 49–52].
tional dataset availability. Moreover, the conversational Yao et al. [52] argue that a chatbot must generate diverse
dataset must be labeled with emotion as a preprocessing step responses for the same input to simulate a human-like con-
before using it in the conversational dataset. versation. Their model uses a latent space variable and six
emotion categories to generate multiple responses that gen-
4.1.5. Chatbot Language and Region of Study. Figure 8 erate multiple emotionally consistent responses.
reveals that English and Chinese are the two most predom- Similarly, Liu et al. [36] also generate several responses
inant interface languages to develop emotionally intelligent and select the most appropriate one based on grammar,
chatbots. Chang and Hsing [47] support our finding and meaning, and emotional score. Zhang et al. [53] argued that
claim that due to the popularity of social media in China, an intervention mechanism is needed to improve response
the Chinese language will soon be one of the most prevailing diversity. They consider the input emotion and model a
languages online. Figure 9 further shows that most of the responder state and topic preference to generate diverse
studies originated from China. This finding shows that responses.
China has played a leading role in developing empathetic
and emotion-aware chatbots since 2018. Moreover, one of 4.2.2. Content Relevance. Several studies (n = 9) focus on
the first emotionally intelligent chatbots, XiaoIce, developed content relevance to achieve a natural conversation in
by Microsoft, is vastly used in China to provide emotional human-computer dialog systems. Both Srinivasan et al.
support to users [48]. These findings reveal the popularity [54] and Sun et al. [55] used reinforcement learning with a
of chatbots in Chinese culture and that China is taking a lead reward function to ensure that the responses were content-
role in developing AI technologies. specific and emotionally relevant. Several studies embedded
topics and emotions in the decoder to generate appropriate
4.2. RQ2: What Problems Are Addressed in the Chatbot responses that are emotionally appropriate [53, 55–57].
Development? After conducting an in-depth analysis of the Huo et al. [49] augmented the encoder-decoder with a
studies, we identified seven main problems addressed by all topic-aware decoder to enhance the content relevance of
the studies. Figure 10 shows an overview of the main prob- the response. They differentiated words in output as
Scopus (n = 1003)
IEEE Xplore (n = 115) Additional records identified
ProQuest (n = 191) through snowballing
ScienceDirect (n = 487) (n = 30)
Identification
ACM digital library (n = 272)

EBSCO (n = 121)
Records identified after initial search

(n = 2219)
Records after duplicates removal

Excluded records (n = 223)
(n = 1996)
Screening
Records after applying inclusion/exclusion

criteria (title and abstract screening) Excluded records (n = 1671)
(n = 325)

Exclusion reasons:
(i) Studies related to usability, user
experience, and user acceptance of
Eligibility
Records after full-text screening chatbots (n = 142)

(n = 72) (ii) Studies proposing a chatbot solution
without implementation (n = 96)
(iii) Chatbot development does not
include emotional aspect (n = 13)
(iv) Full-text article not available (n = 2)
Records after quality assessment

(n = 57)
Included
Studies included in meta-analysis

(n = 42)
Figure 3: PRISMA flowchart.
emotion-related, keywords, and familiar words. In two sepa- tion using a classifier [60, 61], several studies argue that
rate studies, Wei et al. [58] and Wei et al. [59] focused on emotions are complex and cannot be captured by a coarse-
generating emotionally intelligent and content-relevant grained emotion label. To that effect, some studies predict
responses by embedding semantics and emotions in the the emotion by applying the principle of Valence and
input. Arousal (VA) to embed affective meaning for each word in
the input message [47, 62, 63]. Other studies built on the
4.2.3. Poor Emotion Capture. A large number of studies previous work and embedded each input word with a
(n = 14) focus on accurately detecting the emotion of the three-dimensional emotion embedding based on Valence,
input message. While some studies predicted the input emo- Arousal, and Dominance (VAD) [38] to achieve a more
Table 4: Quality assessment.
SNo Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Total %

1 1 1 1 1 1 1 1 1 1 1 0 10 91%
2 1 1 1 1 1 0.5 1 0.5 1 1 1 10 91%
3 1 1 1 1 1 1 1 1 1 1 0.5 10.5 95%
4 1 1 1 1 1 1 1 1 1 1 0.5 10.5 95%
5 1 1 1 1 1 1 1 1 1 0 1 10 91%
6 1 1 1 1 1 1 1 1 1 1 1 11 100%
7 1 1 1 1 1 0.5 1 1 1 1 1 10.5 95%
8 1 1 1 0.5 0 1 1 0.5 1 0.5 1 8.5 77%
9 1 1 1 1 1 1 1 1 1 1 1 11 100%
10 1 1 1 1 1 0 1 0.5 1 1 1 9.5 86%
11 1 1 1 1 0.5 0.5 1 1 1 0.5 1 9.5 86%
12 1 1 1 1 1 1 1 0 1 1 1 10 91%
13 1 1 1 1 0.5 1 1 1 1 1 0.5 10 91%
14 1 1 1 1 1 1 1 1 1 1 1 11 100%
15 1 1 1 1 1 1 1 1 0.5 1 1 10.5 95%
16 1 1 1 1 1 1 1 1 1 1 0 10 91%
17 1 1 1 1 1 1 1 1 1 1 0.5 10.5 95%
18 1 1 1 1 1 0 1 1 1 1 1 10 91%
19 1 1 1 1 1 1 1 1 1 1 1 11 100%
20 1 1 1 1 1 1 1 0.5 1 1 1 10.5 95%
21 1 1 1 1 1 1 1 1 1 1 1 11 100%
22 1 1 0.5 1 1 1 1 1 1 1 1 10.5 95%
23 1 1 1 1 1 1 1 1 1 1 0.5 10.5 95%
24 1 1 1 0.5 0.5 0.5 1 1 1 1 1 9.5 86%
25 1 1 0 1 1 1 1 1 1 1 1 10 91%
26 1 1 1 1 0.5 0.5 1 1 1 1 1 10 91%
27 1 1 0.5 1 1 0.5 1 0 1 1 1 9 82%
28 1 1 1 1 1 1 1 1 1 1 1 11 100%
29 1 1 1 1 1 1 1 1 1 1 0 10 91%
30 1 1 1 1 1 0 0.5 0.5 1 1 1 9 82%
31 1 1 1 1 1 1 1 1 0.5 1 1 10.5 95%
32 1 1 1 1 1 1 1 1 1 1 1 11 100%
33 1 1 1 1 0.5 0.5 1 1 0.5 1 0 8.5 77%
34 1 1 1 0.5 0.5 1 1 1 1 1 1 10 91%
35 1 1 1 1 1 1 1 1 1 1 1 11 100%
36 1 1 0.5 1 1 1 1 1 1 0 1 9.5 86%
37 1 1 1 1 1 1 1 1 1 1 0.5 10.5 95%
38 1 1 1 1 1 1 1 1 1 1 1 11 100%
39 1 1 1 1 1 1 1 1 1 0 1 10 91%
40 1 1 1 1 1 1 1 1 1 1 1 11 100%
41 1 1 0 1 1 1 1 1 1 1 1 10 91%
42 1 1 1 1 1 1 1 1 1 1 1 11 100%
fine-grained emotion detection [36, 39, 51, 64, 65]. Li et al. utterance. Lin et al. [68] identify the user’s emotional state
[66, 67] argue that words in messages are usually connected by employing a tracker that determines the various emo-
and show that capturing the connections of words enables a tional aspects of the input. They use multiple decoders to
deeper understanding of the user’s emotion. respond to each emotional category and generate an appro-
Some studies focus on detecting the user’s emotional priate response. Hasegawa et al. [69] argue that natural con-
state rather than just predicting the sentiment from a single versation is achieved only when the user’s emotional state is
25
20
15
40%
10
60%
5
0
Chinese
English/Chinese
English
Spanish
Japanese
Conference
Journal
Figure 4: Distribution of studies by article source.

Figure 8: Distribution by chatbot interface language.
12
10
On the other hand, Li et al. [66, 67] argue that it is cru-
cial to understand the reason behind the user’s emotion and
8 develop a chatbot that elicits the emotional cause by asking
6 appropriate questions. They generate the response based
4 on the chat history and the identified cause. Qiu et al. [70]
track the user’s emotional state using a transition network.
2 They model the dynamic emotion flow to predict emotions
0 based on past utterances and generate the most appropriate
2013 2017 2018 2019 2020 2021 2022
response.
Figure 5: Distribution of studies by publication year.
4.2.4. Irrelevant Emotional Responses. Several studies argue
that emotions generated using an NLP chatbot are often
40 not emotionally relevant and attempt to alleviate the prob-
35 lem by controlling the emotion exhibited in the response.
30 Several studies control the generated response by embedding
25 a target emotion in the response generator module [8, 61,
20 67–69, 71, 72]. Zhou et al. [61] use internal and external
15 memory to generate explicit emotional words in the
response. Niu and Bansal [72] conditioned the response gen-
10
erator to generate polite, rude, or neutral responses.
5
Several other studies argued that a predefined label to
0 condition the response generator suffers from poor quality
Text Multimodal Voice
of response [59], and furthermore, it cannot be assumed that
Figure 6: Distribution by chatbot type. the output emotion must be the same as the input emotion.
To this effect, some studies attempted to generate more
dynamic responses. Zhang et al. [73] generate multiple
2% 2% 10% responses for six emotional categories and select the most
appropriate response based on rankings. Similarly, Colombo
et al. [64] use two Seq2Seq models to generate several
responses and rank them based on emotion to get the most
86% appropriate response. Zhou et al. [74] add an additional
emotion classifier model for the responses over multiple
emotional distributions, generating two types of responses,
one for the specified emotion and one unspecified.
Business Healthcare
Education Open-domain
4.2.5. Lack of Emotionally Labeled Conversational Datasets.
Figure 7: Distribution by domain of study. One of the challenges of developing a chatbot using machine
learning is that it requires a massive dataset for training.
While several conversational datasets are available for the
predicted from historical conversational utterances rather open domain, datasets labeled with emotions are not readily
than a single utterance. They generate the response based available. Therefore, several studies resorted to classifying
on a predicted target emotion using past utterances. Simi- conversational data using a dynamic classifier as a prepro-
larly, Li et al. [50] also utilize conversational data to generate cessing technique. Few studies tackled the challenge of the
more relevant responses. lack of a publicly available labeled corpus of conversational
25
20
15
10
5
0
China
USA
Canada
Spain
Switzerland
Singapore
Japan
Hong Kong
Taiwan
USA
USA
Canada
Spain
USA
USA
USA
Taiwan
Canada
Switzerland
Canada
China
China
Singapore
China
China
Japan
Japan
Japan
Hong Kong
2013 2017 2018 2019 2020 2021 2022
(a) (b)
Figure 9: Distribution by region of study.
Reinforcement learning (n = 2)
Embed topic in decoder (n = 5)
Embed semantics in input (n = 2)
Assign responder state (n = 1)

Enhance language model (n = 2) Content relevance (n = 9)
Use CVAE model ( n = 5)

Poor language model (n = 2)
Heuristic algorithm (n = 1)
Control emotion in response (n = 8) Response diversity (n = 8)
Multiple responses with ranking (n = 1)
Identified problems
Irrelevant emotional responses (n = 12) and solutions Fine grained emotion capture (n = 10)
Dynamic emotion generation (n = 4) Poor emotion capture ( n = 14)
Detect emotion state (n = 4)
Develop emotionally labelled dataset (n = 2)
Strengthen input emotion (n = 1) Lack/imbalanced dataset (n = 4) Other (n = 6) Develop domain specific chatbot (n = 3)
Use reinforcement learning ( n = 1)

Voice/multimodal chatbot (n = 3)
Bilingual chat interface (n = 1)

n = number
Problem
of studies Solution
Figure 10: Mind map of problems and solutions.
data. Rashkin et al. [17] developed an empathetic dataset of To that effect, they developed an enhanced language model
25 thousand labeled conversations and tested it against well- for empathetic responses.
known neural models. Zhou and Wang [75] generated a
labeled dataset from Twitter using emojis as labels to depict 4.2.7. Other. While previous studies addressed challenges in
the emotion of the input. Their dataset consisted of 64 emo- enhancing an emotionally intelligent chatbot to enable per-
tional labels. Song et al. [76] argued that an emotionally ceiving emotion and generating appropriate responses, some
labeled dataset of conversations is usually imbalanced, which studies focused on other novel areas. We classified it into
leads to incorrect predictions. To alleviate the issue, they three main areas.
explicitly embedded emotional words in the input to
increase the strength of the emotion. On the other hand, Sri- (1) Domain-Specific Chatbot. Some studies addressed the
nivasan et al. [54] used reinforcement learning to address the problem specific to a domain. For example, Adikari et al.
unavailability of supervised training data. [78] stated that previous chatbots in the healthcare sector
mainly focused on question-answer systems. They developed
4.2.6. Poor Language Model. Two studies addressed the a rule-based chatbot that detects patient emotion using NLP
problem of a weak language model for emotional responses. techniques and generates a response using a template. In
Ghosh et al. [77] extended the LSTM language model another study based on the healthcare domain, Wang et al.
trained in a conversational speech corpus to generate text [79, 80] developed a chatbot that provides timely responses
enriched in emotion. In another study, Casas et al. [60] to users seeking emotional support. Hu et al. [8] claim that
attempt to understand the context and implicit emotions previous chatbots in customer care focused solely on gram-
expressed in input data to generate empathetic responses. mar and syntax. They highlight the significance of emotional
intelligence in customer care and develop a chatbot that 4.3.1. Response Generation Models
integrates tones in responses by embedding target tones
(empathetic or passionate) in output. (1) Seq2Seq-Based Model. Nearly 50% of the studies (n = 24)
developed an emotionally intelligent chatbot using a Seq2-
(2) Voice/Multimodal Chatbot. Few studies investigated Seq model, in which a query is represented by one sequence
emotionally intelligent voice-based and multimodal chat- of words and the response by another sequence. Studies have
bots. Griol et al. [81] enhance communication in virtual edu- been conducted to extend the model and improve the per-
cational environments by integrating emotion recognition in formance of Seq2Seq and address the limitation of having
social interaction with multiple modalities. The study uti- dull and meaningless responses by generating an appropri-
lizes user profile data and context information from the dia- ate emotional response.
log history to generate emotionally appropriate responses.
Hu et al. [82] claim that emotion recognition in vocal (2) CVAE Model. Some studies (n = 6) adopt the CVAE
responses is novel and explores emotion regulation in approach to develop an emotionally intelligent chatbot to
voice-based conversations. Their model comprehends the generate diverse and affective responses and overcome limi-
input emotion using acoustic cues and generates emotional tations created by adopting the Seq2Seq model. CVAE
responses by integrating emotional keywords in the gener- allows a more diverse response generator, but syntax and
ated response. grammar errors are compromised to a certain extent.
(3) Bilingual Chatbot Interface. Wang et al. [79] use a bilin- (3) Rule-Based Model. Only two reviewed studies use a rule-
gual decoding algorithm that captures the contextual infor- based approach to develop emotionally intelligent chatbots,
mation and generates emotional responses in two using a hybrid approach to combine lexicons and machine
languages. The model employs two decoders to generate pri- learning to achieve the desired results. The first study
mary and secondary language responses. extracts individual emotions from patient conversations
using NLP techniques based on a psychological emotion
It is essential to note that some of the problems identi- model proposed by Plutchik that sets up an emotion dictio-
fied are also applicable to chatbots that are not emotionally nary from a variety of pretrained language models such as
intelligent; however, the development of these chatbots faces Word2Vec and GloVe Bag of Words. Furthermore, they
additional complexities. For example, all chatbots are con- use AI techniques and multiple classifiers to detect the
fronted with the challenge of generating diverse and relevant group’s emotions. Both kinds of emotions, the group and
responses. However, the additional challenge for emotion- individual emotions, are used to capture the emotion expres-
ally intelligent chatbots is to ensure that the diverse and rel- sion sequence. A rule-based system is used to generate
evant response matches the emotion of the interlocutor. On responses based on negative emotions expressed by patients
the other hand, several challenges are specific to empathetic to predict and generate an automated personalized empa-
chatbots such as accurate detection of emotion, generation thetic alert [78]. The second study uses a rule-based
of emotional response, and lack of emotionally labeled approach to detect, predict, and build a statistical response
datasets. generator based on an utterance’s tags. The training data
were automatically obtained from Twitter, in which a classi-
fier is trained to predict and generate specific emotions
4.3. RQ3: What Approaches and Techniques Are Employed in based on conversational history [69].
Chatbot Development? This section discusses the various
approaches and techniques used in the studies to develop (4) Other Approaches. Ten studies (n = 10) utilize
an emotionally intelligent chatbot. Figure 11 presents a tax- approaches that do not fall into the previous categories.
onomy that classifies the major adopted models and divides Chen et al. [6] use an encoder-decoder architecture in which
them into four categories relating to response generation the semantic and multiresolution emotional contexts are
techniques: Seq2Seq model, rule-based model, CVAE-based encoded. In addition, they implement 2-CNN-based seman-
model, and other models. These studies further used three tics with an emotional discriminator used to capture fine-
different approaches to detect emotion in the input and grained emotion using NRC emotion vocabulary for
response: lexicon-based, machine-based, and hybrid method response generation. Wu et al. [83] use encoders-decoders
that combines both types of learning. Lexicon-based learn- that create emotional label datasets to generate various emo-
ing and machine-based learning are two distinct emotion tional responses. The model by Lin et al. [68] consists of an
detection techniques used in emotionally intelligent chat- emotion detector that uses a transformer encoder and an
bots; i.e., one captures the emotion using a dictionary, and empathetic listener. The model utilizes an independently
the second captures the emotion by training a classifier. In parameterized transformer decoder with a metalistener to
contrast, the hybrid model adopts both these techniques in fuse listeners’ information and produce an empathetic
emotion detection. The taxonomy diagram reveals that response. Casas et al. [60] used a pretrained DeepMoji Dai-
lexicon-based learning is the most used method by studies lyDialog dataset to build an emotion classifier using a labeled
that address the problem of capturing emotions accurately. training set to predict emotional states in text-based mes-
The machine learning approach enables the detection of sages. Furthermore, Griol et al. [81] combine information
emotion in a more coarse-grained approach. from the user profiles with emotional content extracted from
Selected articles (n = 42)
CVAE ( n = 6) Seq2Seq (n = 24) Rule-based (n = 2) Other (n = 10)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Lexicon-based (n = 14) Machine learning (n = 17) Hybrid (n = 11)
Emotion capture (n = 1) Emotion capture (n = 1) Emotion capture (n = 2)

Emotionally relevance (n = 3) Emotionally relevance (n = 6) Emotionally relevance (n = 3)
Content relevance (n = 2) Content relevance (n = 4) Content relevance (n = 3)
Response diversity (n = 4) Response diversity (n = 1) Response diversity (n = 3)
Dataset availability (n = 1) Dataset availability (n = 1) Dataset availability (n = 2)
Language model (n = 1) Language model (n = 1) Language model (n = 0)
Other (n = 0) Other (n = 5) Other (n = 1)
Figure 11: Taxonomy of the selected articles.
the user’s utterances and apply an emotional recognizer in whereas Arousal measures emotion detection and activity.
the dialog manager to choose an adapted system response. This study is based on neural networks using a dialog corpus
Rashkin et al. [17] use a generative pretrained transformer that reflects a positive emotion elicitation strategy [62, 63].
and an emotion classifier trained on the DailyDialog (DD)
dataset to predict emotional states using an encoder- Chang and Hsing [47] propose a two-layered BiLSTM-
decoder model. However, Sun et al. [55] use a topic class based model where word embeddings are constructed by
embedding based on the LDA vector that generates a topic encoding forward and backward sequences of characters
keyword. An emotion embedding vector generates the emo- into a continuous latent space. They capture the emotion
tion keyword by reinforcement learning to generate accurate enriched with semantic representations to provide a capture
emotional responses. of more fine-grained emotions. Furthermore, Ghosh et al.
[77] used the Linguistic Inquiry and Word Count (LIWC)
4.3.2. Input Emotion Detection Models text analysis program based on a dictionary. Each word is
assigned an LIWC category in which the categories were
(1) Lexicon-Based Learning. In several studies (n = 14), selected based on their association with social, affective,
lexicon-based learning models are primarily used to detect and cognitive. They use the text analysis program to identify
and embed emotions to develop emotionally intelligent chat- keywords within a text and extract emotions and features.
bots. Asghar et al. [39] and Zhong et al. [65] adopt the 3D Another study applied the LDA model to derive a topic dic-
semantically augmented affective space VAD (Valence, tionary and specify the topic related to emotion, and this
Arousal, and Dominance) [38] paired with an external cog- technique would overcome the limitation of a supervised
nitively engineered affective dictionary in order to imple- labeled dataset [56].
ment emotion embedding techniques to enhance emotion
diversity. Furthermore, using the bidirectional Seq2Seq (2) Machine-Based Learning. Many studies (n = 17) used a
model with a reinforcement framework that provides machine learning approach for emotion classification solely.
rewards and adopting the VAD affective space to append Several studies use the Seq2Seq-based model with a GRU to
embedding emotion values would enable better emotion improve the Seq2Seq model and improve detecting and gen-
detection and allow to overcome limitations and generate erating response consistency [48, 58, 59, 61]. Some studies
an appropriate emotional response [54]. On the other hand, use a dynamic classifier and a BiLSTM to train the dataset
other studies apply emotion embedding using the VA vector to better capture emotion [53, 72]. In addition, many studies
based on two dimensions of emotion: Valence and Arousal. use the Seq2Seq attention model based on deep RNN and
Valence measures the positivity or negativity of emotion, pair it with a GRU to target the specific emotion-attention
[59]. GRU-RNN is an extension of a neural generator based not task-oriented-based. Some studies lack conversational
on gated neural networks by adding three additional cells and emotionally labeled datasets because of the limited pub-
(refinement, adjustment, and output cells) to capture, con- licly available datasets for training and evaluating the classi-
trol, and produce appropriate sentences [53]. Another study fier systems, which poses a significant challenge [17]. They
used a multilayer encoder-decoder extended with a Genera- propose a new methodology for empathetic dialog genera-
tive Adversarial Network (GAN). The discriminator output tion and introduce a novel dataset of conversations
data are used as rewards for reinforcement learning, pushing grounded in an emotional context. Table 5 provides details
the system to generate dialogs that are most similar to about the various datasets.
human dialogs [7]. Hu et al. [8] implemented a tone-aware
model based on LSTM by adding an indicator vector capable 4.4.2. Evaluation Measures. This section describes the
of controlling the tones of generated conversations that methods used by the reviewed studies to measure the overall
allowed embedding target tones of empathy and passion into performance of emotionally intelligent chatbots in generat-
chatbot responses. Moreover, Niu and Bansal [72] developed ing emotional responses. Almost all the studies have used
a model consisting of 2 layers of the BiLSTM decoder, both the automatic and manual evaluation methods to mea-
followed by a convolution layer with reinforcement rewards sure the effectiveness of their solution. In the automatic
and trained for polite and rude labels that employ an LSTM- method, a test set is used to evaluate the model by compar-
CNN politeness classifier to generate a polite response. ing the generated responses with the existing responses
using well-known metrics. The studies also used an auto-
(3) Hybrid Model. Several studies (n = 5) apply a hybrid mated method to measure the accuracy of emotion classifi-
model to overcome limitations created by adopting only cation. Moreover, most studies use automated metrics to
one approach. For example, in addition to using the VAD compare results against a baseline and other standard
lexicon vector presentation as an emotion embedding tech- models. Several studies compared their models against a
nique, many studies use Bidirectional LSTM (BiLSTM). This Seq2Seq baseline approach. On the other hand, the manual
affective classifier can train a Seq2Seq network in an method employs humans to rate the responses against spec-
encoder-decoder setting to label the sentences according to ified criteria.
their emotional content [64]. Likewise, Peng et al. [57]
increase the emotion intensity by pairing the VA lexicon- (1) Automatic Evaluation. Table 6 summarizes the metrics
based emotion model variations of autoencoders that pro- used in an automated method. It shows the evaluation met-
duce sentences containing a given sentiment or tense using rics for the response generation and classifies the input data.
an emotion classifier. This classifier can increase the inten- BLEU (Bilingual Evaluation Understudy) is the most com-
sity of emotional expression and identify or capture emotion mon metric to evaluate emotionally intelligent chatbot
and intensify emotions that do not include any sentiment. responses. It is derived from a precision tool that automati-
Similarly, Song et al. [76] paired the LDA topic model with cally compares machine translation efficiency with human
a classifier BiLTSM, and Huang et al. [71] used a BiLSTM translation. BLEU is used to estimate the overlap between
with LIWC (Linguistic Inquiry and Word Count) dictionary the generated and target responses. Thus, it measures how
to be trained on the dataset. well the emotional response has been developed. However,
BLEU’s low correlation with human judgment is not suitable
4.4. RQ4: What Evaluation Measures Are Used to Evaluate for measuring conversation generation [61].
Chatbot Performance? This section describes the datasets
used for evaluating the chatbot performance and the differ- Perplexity is another way to evaluate how well a selected
ent evaluation metrics used by the studies. model generates an emotional response—the lower the per-
plexity score, the better the generation performance.
4.4.1. Datasets. A conversational dataset is required to eval- Another measure is the Distinct-1 grams and Distinct-2
uate the performance of a chatbot. Moreover, the dataset grams that measure the diversity of the response. As a result,
must be labeled with emotional tags to feed the encoder with words with many repetitions are penalized, and sentences
emotional input and train the decoder to generate appropri- with many Distinct-n grams are rewarded. These metrics
ate output. are devoted exclusively to the property of a given sentence
Most studies have used conversational datasets from var- and require no reference to ground truth [22].
ious sources, including social media and online websites, as Accuracy, F1-score, precision, and recall are the most
shown in Table 5. The most popular datasets used are common metrics for measuring emotion classification.
Weibo, followed by Twitter, which are open-domain conver- Accuracy is the percentage of correctly predicted outcomes
sational datasets. Only one study used a domain-specific divided by the total amount of predictions [22]. F1-score is
conversational dataset for the healthcare domain [78]. Since also used to assess machine learning models (or classifiers)
none of these datasets are labeled with emotions, researchers as an alternative to accuracy. It measures how well the clas-
have used a machine learning, lexicon-based, or hybrid sifier balances precision and recall. In addition, it measures
approach to label the conversations with emotions. Several how the classifier balances between detecting or capturing
selected studies have used the NLPCC2013, NLPCC2014, the precise emotion and recalling it. Finally, data accuracy
and NLPCC2017 as corpora for labeling. These corpora in a dialog indicates the number of times the data is aligned
can only be used in an open domain where the chatbot is with the topic discussed. On the other hand, recall measures
Table 5: Datasets used by studies.
Datasets # of studies Studies

Weibo 15 [7, 47–49, 52, 55–57, 59, 61, 73, 74, 79, 83, 84]
Twitter 11 [8, 36, 48, 50, 51, 53, 68, 69, 74, 75, 82]
Fisher English Training Speech Corpus 3 [77, 82, 83]
DailyDialog labeled dataset 4 [17, 36, 60, 79]
Cornell Movie-Dialogs 4 [39, 54, 64, 65]
Conversations Support Group: dataset from Kaggle 1 [78]
Durban and Reddit 1 [84]
VoxCeleb 1 [82]
SEMAINE dataset 5 [17, 62, 63, 77, 83]
X-EMAC 1 [67]
Table 6: Automatic evaluation metrics.
Evaluation type Metric # of studies Studies

BLEU 24 [17, 39, 47, 49–53, 55, 57–59, 64, 65, 68, 69, 72–76, 79, 83]
Perplexity 20 [17, 36, 48, 49, 51, 52, 54, 55, 60–67, 72, 73, 75, 79]
Distinct-1 grams
Evaluation of generated responses 12 [36, 50, 53, 56–59, 64, 67, 74, 76, 80]
Distinct-2 grams
ROUGE 5 [39, 52, 54, 59, 66]
METEO 4 [39, 48, 52, 66]
F1 5 [47, 60, 71, 78, 80]
Precision 9 [17, 47, 48, 60, 67, 69, 71, 78, 84]
Evaluation of emotions
Recall 8 [47, 48, 60, 67, 69, 71, 78, 84]
Accuracy 22 [8, 17, 48–52, 55, 57, 58, 61, 66, 68, 71–73, 75, 76, 78, 80, 81, 84]
the number of replies that the chatbot can group into appro- must be able to converse in the user’s preferred language.
priate topics through human-computer interaction [22]. This is an avenue open for further research and exploration.
(2) Human Evaluation. Using human evaluators is another 5.2. Dataset Availability. A vast majority of research studies
way to measure the performance of emotionally intelligent focus on developing an emotionally intelligent chatbot for an
chatbots. Although automatic evaluation is more efficient open domain, whereas only a few have focused on the closed
and has fewer overheads than human evaluation, it does domain, using a rule-based approach for generating
not consider whether the generated emotional response is responses. And only one of the reviewed studies sourced a
appropriate and natural. Human evaluation is usually mea- domain-specific dataset for healthcare [78]. A generative
sured on a Likert scale. Several studies employed the Ama- chatbot that synthesizes human-like natural responses
zon Mechanical Turk (MTurk) participants (n = 5) for requires a massive dataset for training [39]. The unavailabil-
evaluation. Multiple studies (n = 14) used Fleiss’ kappa test ity of domain-specific conversational datasets is the main
to measure the annotator’s agreements and their consistency reason for the research gap in this field. A ripe area for
in rating [39]. Table 7 summarizes the evaluation criteria exploration for researchers is the development of domain-
used for human evaluation. specific datasets for education, business, and more as they
can provide appealing solutions for empathetic customer
service chatbots, advising chatbots, and more.
5. Discussion Moreover, the conversational datasets used for open-
domain chatbots are not emotionally labeled. The reviewed
5.1. Chatbot Interface Language. Chinese and English are the studies have used extensive preprocessing of the datasets
most popular chatbot interface languages used by retrieved from Twitter and other datasets to extract conver-
researchers to develop emotionally intelligent chatbots. The sations and classify them further with labels. However, an
conversational datasets for these languages are retrieved issue with this approach is that the dataset is usually imbal-
from Twitter and Weibo. Only one study proposed the anced, and the classification is usually prone to errors. Rash-
development of a bilingual chatbot [79]. In a multicultural kin et al. [17] addressed this challenge by developing a
environment, this is an essential solution where a chatbot dataset of emotionally labeled conversations. The dataset
Table 7: Manual evaluation criteria.
Evaluation criteria # of studies Studies

Emotion accuracy 3 [49, 53, 78]
Response emotion quality and specificity 4 [47, 57, 60, 83]
Response emotion reflection and expression 8 [55–57, 60, 74, 79, 80, 83]
Response emotion diversity 8 [8, 36, 39, 58, 59, 61, 64, 76]
Response emotion appropriateness 10 [8, 36, 39, 48, 58, 59, 61, 64, 69, 82]
Response empathetic emotion intensity 7 [8, 17, 66–68, 82, 84]
Emotion intensity 7 [8, 17, 66–68, 82, 84]
Response grammatical correctness 11 [36, 39, 48, 50, 51, 58, 59, 64, 76, 77, 80]
Response user preference 1 [64]
Response naturalness 4 [7, 50, 54, 80]
Response coherence 5 [7, 39, 54, 68, 80]
Response fluency 4 [17, 49, 68, 80]
Response relevance 13 [17, 36, 48, 49, 52, 57, 61, 66–68, 72, 74, 80, 84]
Response consistency 9 [7, 50–52, 54–56, 73, 84]
Response logic 5 [55–57, 62, 63]
Response intelligible 2 [62, 63]
Response context 1 [72]
Response politeness 1 [72]
consists of 25k conversational utterances. This is another diagram (Figure 11) shows that mainly lexicon-based
area of research that needs further exploration where approaches are used by studies that address the challenge
researchers may investigate the development of more emo- of emotion capture. Only four studies have attempted to
tionally labeled datasets to be used as the gold standard in capture the user’s emotional state from multiple historical
the open domain. utterances. Connecting the meanings and emotions from
previous utterances is essential to comprehend the user’s
5.3. Encoder-Decoder Model. Several studies use techniques emotional state and foster a continuous conversation. This
to enhance the previously adopted model for developing is still an unexplored area and requires further investigation.
emotionally intelligent chatbots, i.e., extending Seq2Seq to
overcome its dull and meaningless response limitations. 5.5. Voice-Based/Multimodal Chatbots. All of the chatbots
Many studies use a bidirectional classifier that is trained included in the review are text-based. Another area for
using an emotionally labeled dataset to develop the model exploration and further research is the development of
[64]. However, the limitation of such models is that conver- voice-based and multimodal chatbots that are domain-
sational models based on neural networks cannot capture specific.
the complexities of emotions and produce short and unclear
responses. More recent researchers have utilized the CVAE 5.6. Hybrid Chatbots. Finally, there are no studies investigat-
model to alleviate this problem and generate diverse emo- ing generative emotionally intelligent chatbots that are task-
tional responses. The studies have demonstrated that CVAE oriented. Non-task-oriented chatbots are usually rule-based
can solve this problem and increase the diversity of because they provide precise information but at the same
responses. Additionally, it overcomes the dullness and time suffer from machine-like responses. A task-oriented
meaninglessness of Seq2Seq. However, it impacts the syntax emotionally intelligent chatbot could assist the user in
of the responses [36]. A further area for exploration by accomplishing a task, such as making a reservation, placing
researchers is to enhance the CVAE model to make it more an order, and providing advising information, while embed-
robust to syntax errors. ding empathy in the conversation to eliminate user frustra-
tion and provide a good user experience. Moreover, such a
5.4. Emotion Detection and Embedding. The primary focus chatbot could trigger human intervention if required by
of most studies was to accurately detect the input emotion determining the user’s emotional state [6, 47, 74].
or the user’s emotional state and generate appropriate affec-
tive responses. Several studies indicate that emotions are 6. Conclusion
complex and cannot be captured accurately by a classifier
[47, 62, 63]. By adopting a lexicon-based learning approach This section includes a summary of the paper and its signif-
and using VAD vector spaces where each word is embedded icance, limitations, and new directions for future research.
with emotion, it is possible to overcome the inability of clas- Recent technological advances have made chatbots
sifiers to detect fine-grained emotion [39]. The taxonomy increasingly feasible to deliver information to various
Table 8: Studies included in the review.
# Study Purpose
To build a chatbot that captures the emotions of patients during interaction and accordingly updates human therapists to
1 [78]
provide timely care
To generate affective responses in an open-domain chatbot by using a three-method approach in an LSTM conversational
2 [39]
model
3 [60] To develop an empathetic chatbot that generates responses based on the user’s emotional state and the context of the message
4 [47] To incorporate emotional content into the response generation process to make chatbot responses more emotionally sound
To produce an affect-driven dialog system that generates multiple diverse emotional responses and ranks them based on
5 [64]
emotion
To extract the affect category of the input text using the Linguistic Inquiry and Word Count (LIWC) and generate
6 [77]
grammatically correct responses embedded with emotion
7 [81] To develop an embodied conversational agent that responds based on user profiles and emotional content
To predict the emotional state of the sender based on historical responses and accordingly generate an emotionally appropriate
8 [69]
response
9 [82] To build a voice-based conversational agent that embeds responses with emotion
10 [8] To develop a novel tone-aware chatbot that generates toned responses to user requests on social media
To embed emotions in the dialog based on input emotion and to tackle the problem of generic responses that are not
11 [71]
emotionally intelligent
To develop a topic-aware emotional response generation (TERG) model, which can not only exactly generate desired emotional
12 [49]
response but also perform well in topic relevance
13 [56] To embed emotion and topic in the input data to generate meaningful and emotionally relevant responses
14 [66] To develop and evaluate a multiresolution adversarial model that generates more empathetic responses
To elicit a topic-coherent response embedded with emotion using a loss function to predict the corresponding word in every
15 [50]
generation step
To develop an online empathetic chatbot influenced by emotion information using large-scale empathetic conversational
16 [67]
datasets to detect the user’s emotion or ask questions for self-disclosure
17 [68] To develop a model that selects an appropriate reaction by learning the context and underlying emotion
Used an affective lexicon to embed sentiments into the word vectors and used a CVAE-based dialog model to generate diverse
18 [36]
and emotional responses
19 [62] To develop an AI-driven chat-oriented dialog system that dynamically imitates human emotions in the conversation
20 [63] To elicit a more positive emotional valence throughout a chat-based interaction in order to promote positive emotional states
To develop three weakly supervised models that can generate diverse, polite (or rude) dialog responses using data from separate
21 [72]
style and dialog domains
To propose a generative model that fuses word- and sentence-level emotions to model the dialog text and learn emotional
22 [51]
expression in order to control the emotional feature of the generated response
To present a topic-enhanced emotional conversation generation model that incorporates emotional factors and topic
23 [57]
information into the conversation system
To use a custom-built empathetic conversational dataset and explore different ways of combining information from related
24 [17]
tasks that can lead to more empathetic responses
25 [76] To generate meaningful responses embedded with explicit or implicit emotion
26 [54] To develop a new approach of context-relevant emotional responses using the bidirectional Seq2Seq model
27 [7] To create an emotionally intelligent chatbot using emotional tags on the posts and recognize the emotional dimension
To use reinforcement learning with emotional editing constraints to generate more meaningful and customizable emotional
28 [55]
responses
To create a bilingual-aided interactive approach that can simultaneously and interactively generate bilingual emotional replies
29 [79]
to monolingual posts
To provide social support for community members in an online health community using a Seq2Seq model-based chatbot that
30 [80]
recognizes emotion and produces diverse responses
To build a unified neural architecture in order to encode the semantics and affect for generating more intelligent responses with
31 [59]
expressed emotions
32 [58] To extract the emotional and semantic information of the interlocutor to generate logical responses embedded with emotion
To develop an anthropomorphic model and present its ability to understand the human interlocutor using both the subjective
33 [83]
and objective measures
Table 8: Continued.
# Study Purpose
To create an empathetic conversation system that incorporates emotional factors added to semantics and to enhance the
34 [84]
context-response through a multitask learning framework
To design an artificial conversational chatting machine that generates nondeterministic responses providing the same input
35 [52]
with different emotional contexts that are empathetically coherent
To propose a multiemotional conversation system (MECS) and evaluate the model at both the context level and the emotion
36 [73]
level
To develop a dual-factor generation model that fits the conversation data and actively controls the generation of the response
37 [53]
with respect to sentiment or topic specificity
To develop an intelligent open-domain neural conversational model that produces responses that are syntactically and
38 [65]
semantically appropriate and rich in emotion
To propose a neural conversation generation with auxiliary emotional supervised models where the dialog generation system is
39 [74]
characterized by emotional intelligence
40 [61] To propose a model to generate emotional responses using internal and external memory
To present the design and implementation of XiaoIce, a multimodal chatbot that recognizes and responds with emotion in an
41 [48]
open-domain conversation
42 [75] To apply emotion detection with emojis using a reinforced CVAE model to generate affective responses that contain emojis
domains. Consequently, there are now a growing number of to test the quality of the responses. Several studies sourced
chatbots available for public use. Today, more attention is participants from MTurk or used other human judges to
being paid to the development of emotionally intelligent evaluate the response diversity and emotional relevance. Sta-
chatbots. Developing chatbots that can generate emotional tistical measures such as Fleiss’ kappa are used to determine
responses to user requests is challenging yet crucial to its the validity of human responses.
successful adoption. This study may have had limitations due to several fac-
In this study, we conducted a systematic literature review tors. First, there was a limited amount of time to conduct
exploring a spectrum of topics regarding the development of the study. Moreover, although six bibliographic databases
emotionally intelligent chatbots, exploring the technique of were used to retrieve relevant studies, the lack of research
embedding and generating emotional responses, the chal- due to the relatively new and emerging topic resulted in
lenges, the datasets used, and the evaluation processes used the possibility of specific unexplored areas that the readers
to measure the chatbot’s performance. This study was based may notice. Furthermore, due to limited resources, some
on available publications from 2011 to 2022 using six digital retrieval studies may not be thorough and may compromise
databases: Scopus, IEEE Xplore, ProQuest, ScienceDirect, the effectiveness of the study.
ACM Digital Library, and EBSCO. We use a systematic
approach to gather and assimilate our findings. This study Data Availability
is aimed at generating evidence-based guidelines for
researchers and developers to gain insights into emotionally The search keywords and databases used in the systematic
intelligent chatbot development research. Thus, researchers review are provided in the paper. Table 8 list of all the papers
and practitioners in the related fields will gain a deeper included in the systematic literature review. Furthermore,
understanding of emotionally intelligent chatbots based on the data encoding of the papers analyzed during the current
the findings of this study presented in the discussion section. study is available from the corresponding author upon rea-
Our study shows that Chinese is the most commonly sonable request.
used interface language in developing emotionally intelligent
chatbots. Weibo and Twitter datasets are the most popular
Conflicts of Interest
datasets used to develop open-domain AI-powered chatbots.
Additionally, most chatbots are developed for the open All authors declare that they have no conflicts of interest.
domain due to the availability of conversational datasets.
However, these datasets are not labeled; therefore, a com-
mon preprocessing step is to label the dataset using a classi-
References
fier, lexicon-based, or hybrid approach. Furthermore, we [1] M. Allouch, A. Azaria, and R. Azoulay, “Conversational
identified that the lexicon-based approach, such as the agents: goals, technologies, vision and challenges,” Sensors,
VAD vector, provides fine-grained emotion detection. Clas- vol. 21, no. 24, p. 8448, 2021.
sifiers are also used to detect emotion and generate diverse [2] M. Adam, M. Wessel, and A. Benlian, “AI-based chatbots in
responses, which is the ultimate objective of the evaluation. customer service and their effects on user compliance,” Elec-
Most studies use automatic and human evaluation measures. tronic Markets, vol. 31, no. 2, pp. 427–445, 2021.
BLEU and perplexity are the most commonly used metrics [3] M. Milne-ives, C. CockDe, E. Lim et al., “The effectiveness of
in automatic evaluation. Human evaluations are essential artificial intelligence conversational agents in health care:
systematic review,” Journal of Medical Internet Research, of the Association for Computational Linguistics, pp. 5370–
vol. 22, no. 10, article e20346, 2020. 5381, Florence, Italy, 2020.
[4] M. Moran, “25+ top chatbot statistics for 2022: usage, demo- [18] M. R. Pacheco-Lorenzo, S. M. Valladares-Rodríguez, L. E.
graphics, trends,” Startup Bonsai, 2022, September 2022, Anido-Rifón, and M. J. Fernández-Iglesias, “Smart conversa-
https://startupbonsai.com/chatbot-statistics/. tional agents for the detection of neuropsychiatric disorders:
[5] A. Rapp, L. Curti, and A. Boldi, “The human side of human- a systematic review,” Journal of Biomedical Informatics,
chatbot interaction: a systematic literature review of ten years vol. 113, article 103632, 2021.
of research on text-based chatbots,” International Journal of [19] C. W. Okonkwo and A. Ade-Ibijola, “Chatbots applications in
Human Computer Studies, vol. 151, article 102630, 2021. education: a systematic review,” Computers and Education:
[6] J. S. Chen, T. T. Y. Le, and D. Florence, “Usability and respon- Artificial Intelligence, vol. 2, article 100033, 2021.
siveness of artificial intelligence chatbot on online customer [20] A. Miklosik, N. Evans, A. Mahmood, and A. Qureshi, “The use
experience in e-retailing,” International Journal of Retail and of chatbots in digital business transformation: a systematic lit-
Distribution Management, vol. 49, no. 11, pp. 1512–1531, erature review,” IEEE Access, vol. 9, pp. 106530–106539, 2021.
2021. [21] A. de Barcelos Silva, M. M. Gomes, C. A. da Costa et al., “Intel-
[7] X. Sun, X. Chen, Z. Pei, and F. Ren, “Emotional human ligent personal assistants: a systematic literature review,”
machine conversation generation based on SeqGAN,” in Expert Systems with Applications, vol. 147, article 113193,
2018 First Asian Conference on Affective Computing and Intel- 2020.
ligent Interaction (ACII Asia), Beijing, China, May 2018. [22] S. Mohamad Suhaili, N. Salim, and M. N. Jambli, “Service
[8] T. Hu, A. Xu, Z. Liu et al., “Touch your heart: a tone-aware chatbots: a systematic review,” Expert Systems with Applica-
chatbot for customer care on social media,” in Proceedings of tions, vol. 184, p. 115461, 2021.
the 2018 CHI Conference on Human Factors in Computing Sys- [23] A. K. Wardhana, R. Ferdiana, and I. Hidayah, “Empathetic
tems, pp. 1–12, Montréal, Canada, 2018. chatbot enhancement and development: a literature review,”
[9] E. Adamopoulou and L. Moussiades, “An overview of chatbot in 2021 International Conference on Artificial Intelligence and
technology,” in BT - Artificial Intelligence Applications and Mechatronics Systems (AIMS), Bandung, Indonesia, April
Innovations, I. Maglogiannis, L. Iliadis, and E. Pimenidis, 2021.
Eds., p. 373, Springer International Publishing, 2020. [24] E. W. Pamungkas, Emotionally-aware chatbots: a survey, Cor-
[10] S.-M. Tan and T. W. Liew, “Multi-chatbot or single-chatbot? nell University Library, 2019, http://ezproxy.hct.ac.ae/
The effects of m-commerce chatbot interface on source credi- login?url=https://www.proquest.com/working-papers/
bility, social presence, trust, and purchase intention,” Human emotionally-aware-chatbots-survey/docview/2246534988/se-
Behavior and Emerging Technologies, vol. 2022, article 2.
2501538, 14 pages, 2022. [25] Y. Ma, K. L. Nguyen, F. Z. Xing, and E. Cambria, “A survey on
[11] E. Adamopoulou and L. Moussiades, “An Overview of Chatbot empathetic dialogue systems,” Information Fusion, vol. 64,
Technology,” in Artificial Intelligence Applications and Inno- pp. 50–70, 2020.
vations. AIAI 2020. IFIP Advances in Information and Com- [26] J. Grudin and R. Jacques, “Chatbots, humbots, and the quest
munication Technology, I. Maglogiannis, L. Iliadis, and E. for artificial general intelligence,” in Proceedings of the 2019
Pimenidis, Eds., vol. 584, p. 373, Springer, Cham, 2020. CHI Conference on Human Factors in Computing Systems,
[12] P. Salovey and J. D. Mayer, “Emotional intelligence,” Imagina- pp. 1–11, Glasgow, UK, May 2019.
tion, Cognition and Personality, vol. 9, no. 3, pp. 185–211, [27] M. Jovanovic, M. Baez, and F. Casati, “Chatbots as conversa-
1990. tional healthcare services,” IEEE Internet Computing, vol. 25,
[13] X. Wang and R. Nakatsu, “How do people talk with a virtual no. 3, pp. 44–51, 2021.
philosopher: log analysis of a real-world application,” in Enter- [28] H. Y. Shum, X. D. He, and D. Li, “From Eliza to XiaoIce: chal-
tainment Computing – ICEC 2013. ICEC 2013, J. C. Anacleto, lenges and opportunities with social chatbots,” Frontiers of
E. W. G. Clua, F. S. C. Silva, S. Fels, and H. S. Yang, Eds., Information Technology and Electronic Engineering, vol. 19,
vol. 8215 of Lecture Notes in Computer Science, pp. 132– no. 1, pp. 10–26, 2018.
137, Springer, Berlin, Heidelberg, 2013. [29] S. Hussain, O. Ameri Sianaki, and N. Ababneh, “A survey on
[14] A. Ghandeharioun, D. McDuff, M. Czerwinski, and K. Rowan, conversational agents/chatbots classification and design tech-
“Towards understanding emotional intelligence for behavior niques,” in Advances in Intelligent Systems and Computing,
change chatbots,” in 2019 8th International Conference on vol. 927, Springer International Publishing, 2019.
Affective Computing and Intelligent Interaction (ACII), pp. 8– [30] Z. Safi, A. Abd-alrazaq, M. Khalifa, and M. Househ, “Technical
14, Cambridge, UK, September 2019. aspects of developing chatbots for medical applications: scop-
[15] S. C. Paul, N. Bartmann, and J. L. Clark, “Customizability in ing review,” Journal of Medical Internet Research, vol. 22,
conversational agents and their impact on health engage- no. 12, article e19127, 2020.
ment,” Human Behavior and Emerging Technologies, vol. 3, [31] Z. Xiao, M. X. Zhou, W. Chen, H. Yang, and C. Chi, “If I hear
no. 5, pp. 1141–1152, 2021. you correctly: building and evaluating interview chatbots with
[16] J. C. Giger, N. Piçarra, P. Alves-Oliveira, R. Oliveira, and active listening skills,” in Proceedings of the 2020 CHI Confer-
P. Arriaga, “Humanization of robots: is it really such a good ence on Human Factors in Computing Systems, vol. 1–14,
idea?,” Human Behavior and Emerging Technologies, vol. 1, Hawai'i, USA, April 2020.
no. 2, pp. 111–123, 2019. [32] R. S. Wallace, “The anatomy of ALICE,” in Parsing the Turing
[17] H. Rashkin, E. M. Smith, M. Li, and Y. L. Boureau, “Towards Test, pp. 181–210, Springer, 2009.
empathetic open-domain conversation models: a new bench- [33] A. Xu, Z. Liu, Y. Guo, V. Sinha, and R. Akkiraju, “A new chat-
mark and dataset,” in Proceedings of the 57th Annual Meeting bot for customer service on social media,” in Proceedings of the
2017 CHI Conference on Human Factors in Computing Sys- the Twenty-Ninth International Joint Conference on Artificial
tems, pp. 3506–3510, Denver, Colarado, May 2017. Intelligence, pp. 3637–3643, Yokohama, Japan, July 2020.
[34] E. Svikhnushina and P. Pu, “Social and emotional etiquette of [51] D. Peng, M. Zhou, C. Liu, and J. Ai, “Human-machine dia-
chatbots: a qualitative approach to understanding user needs logue modelling with the fusion of word- and sentence-level
and expectations,” 2020, https://arxiv.org/abs/2006.13883. emotions,” Knowledge-Based Systems, vol. 192, article
[35] A. Hutapea, “Chatbot: architecture , design , & development,” 105319, 2020.
University of Pennsylvania School of Engineering and Applied [52] K. Yao, L. Zhang, T. Luo, D. Du, and Y. Wu, “Non-determin-
Science Department of Computer and Information Science, istic and emotional chatting machine: learning emotional con-
2017. versation generation using conditional variational
[36] M. Liu, X. Bao, J. Liu, P. Zhao, and Y. Shen, “Generating emo- autoencoders,” Neural Computing and Applications, vol. 33,
tional response by conditional variational auto-encoder in no. 11, pp. 5581–5589, 2021.
open- domain dialogue system,” Neurocomputing, vol. 460, [53] R. Zhang, J. Guo, Y. Fan, Y. Lan, and X. Cheng, “Dual-factor
pp. 106–116, 2021. generation model for conversation,” ACM Transactions on
[37] M. Aleedy, H. Shaiba, and M. Bezbradica, “Generating and Information Systems, vol. 38, no. 3, pp. 1–31, 2020.
analyzing chatbot responses using natural language process- [54] V. Srinivasan, S. Santhanam, and S. Shaikh, “Using reinforce-
ing,” International Journal of Advanced Computer Science ment learning with external rewards for open-domain natural
and Applications, vol. 10, no. 9, pp. 60–68, 2019. language generation,” Journal of Intelligent Information Sys-
[38] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of tems, vol. 56, no. 1, pp. 189–206, 2021.
valence, arousal, and dominance for 13,915 English lemmas,” [55] X. Sun, J. Li, X. Wei, C. Li, and J. Tao, “Emotional editing con-
Behavior Research Methods, vol. 45, no. 4, pp. 1191–1207, straint conversation content generation based on reinforce-
2013. ment learning,” Information Fusion, vol. 56, pp. 70–80, 2020.
[39] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou, “Affective [56] J. Li and X. Sun, “A syntactically constrained bidirectional-
neural response generation,” in Advances in Information asynchronous approach for emotional conversation genera-
Retrieval. ECIR 2018, G. Pasi, B. Piwowarski, L. Azzopardi, tion,” in Proceedings of the 2018 Conference on Empirical
and A. Hanbury, Eds., vol. 10772 of Lecture Notes in Com- Methods in Natural Language Processing, pp. 678–683, Brus-
puter Science(), pp. 154–166, Springer, Cham, 2018. sels, Belgium, 2020.
[40] B. Kitchenham and S. Charters, Guidelines for performing sys- [57] Y. Peng, Y. Fang, Z. Xie, and G. Zhou, “Topic-enhanced emo-
tematic literature reviews in software engineering, 2007. tional conversation generation with attention mechanism,”
[41] D. Tranfield, D. Denyer, and P. Smart, “Towards a methodol- Knowledge-Based Systems, vol. 163, pp. 429–437, 2019.
ogy for developing evidence‐informed management knowl- [58] W. Wei, J. Liu, X. Mao et al., “Target-guided emotion-aware
edge by means of systematic review,” British Journal of chat machine,” ACM Transactions on Information Systems,
Management, vol. 14, no. 3, pp. 207–222, 2003. vol. 39, no. 4, pp. 1–24, 2021.
[42] M. Petticrew and H. Roberts, Systematic Reviews in the Social [59] W. Wei, J. Liu, X. Mao et al., “Emotion-aware chat
Sciences: A Practical Guide, John Wiley & Sons, 2008. machine: automatic emotional response generation for
[43] L. Yang, H. Zhang, H. Shen et al., “Quality assessment in sys- human-like emotional interaction,” in Proceedings of the
tematic literature reviews: a software engineering perspective,” 28th ACM International Conference on Information and
Information and Software Technology, vol. 130, article 106397, Knowledge Management, pp. 1401–1410, Beijing, China,
2021. November 2019.
[60] J. Casas, T. Spring, K. Daher, E. Mugellini, O. A. Khaled, and
[44] N. J. Van Eck and L. Waltman, VOSviewer Manual, vol. 1,
P. Cudré-Mauroux, “Enhancing conversational agents with
no. 1, 2013, Univeristeit Leiden, Leiden, 2013.
empathic abilities,” in Proceedings of the 21st ACM Interna-
[45] D. Moher, A. Liberati, J. Tetzlaff, and D. G. Altman, “Preferred tional Conference on Intelligent Virtual Agents, pp. 41–47,
reporting items for systematic reviews and meta-analyses: the Japan, 2021.
PRISMA statement,” International Journal of Surgery, vol. 8,
[61] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu, “Emotional
no. 5, pp. 336–341, 2010.
chatting machine: emotional conversation generation with
[46] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence internal and external memory,” Proceedings of the AAAI Con-
learning with neural networks,” Advances in Neural Informa- ference on Artificial Intelligence, vol. 32, no. 1, pp. 730–738,
tion Processing Systems, vol. 4, pp. 3104–3112, 2014. 2018.
[47] Y.-C. Chang and Y.-C. Hsing, “Emotion-infused deep neural [62] N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura, “Eliciting pos-
network for emotionally resonant conversation,” Applied Soft itive emotion through affect-sensitive dialogue response gener-
Computing, vol. 113, p. 107861, 2021. ation: a neural network approach,” in 32nd AAAI Conference
[48] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, “The design and imple- on Artificial Intelligence, AAAI 2018, pp. 5293–5300, Louisi-
mentation of XiaoIce, an empathetic social chatbot,” Compu- ana, USA, 2018.
tational Linguistics, vol. 46, no. 1, pp. 53–93, 2020. [63] N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura, “Positive
[49] P. Huo, Y. Yang, J. Zhou, C. Chen, and L. He, “TERG: topic- emotion elicitation in chat-based dialogue systems,” IEEE/
aware emotional response generation for chatbot,” in 2020 ACM Transactions on Audio Speech and Language Processing,
International Joint Conference on Neural Networks (IJCNN), vol. 27, no. 4, pp. 866–877, 2019.
Glasgow, UK, July 2020. [64] P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia,
[50] S. Li, S. Feng, D. Wang, K. Song, Y. Zhang, and W. Wang, “Affect-driven dialog generation,” in NAACL HLT 2019 - 2019
“EmoElicitor: an open domain response generation model Conference of the North American Chapter of the Association
with user emotional reaction awareness,” in Proceedings of for Computational Linguistics: Human Language Technologies
- Proceedings of the Conference, , Volume 1, pp. 3734–3743, [78] A. Adikari, D. de Silva, H. Moraliyage et al., “Empathic conver-
Minneapolis, Minnesota, 2019. sational agents for real-time monitoring and co-facilitation of
[65] P. Zhong, D. Wang, and C. Miao, “An affect-rich neural con- patient-centered healthcare,” Future Generation Computer
versational model with biased attention and weighted cross- Systems, vol. 126, pp. 318–329, 2022.
entropy loss,” Proceedings of the AAAI Conference on Artificial [79] J. Wang, X. Sun, and M. Wang, “Emotional conversation gen-
Intelligence, vol. 33, no. 1, pp. 7492–7500, 2019. eration with bilingual interactive decoding,” IEEE Transac-
[66] Q. Li, H. Chen, Z. Ren, P. Ren, Z. Tu, and Z. Chen, “EmpDG: tions on Computational Social Systems, vol. 9, no. 3, pp. 818–
multi-resolution interactive empathetic dialogue generation,” 829, 2021.
in Proceedings of the 28th International Conference on Compu- [80] L. Wang, D. Wang, F. Tian et al., “CASS: towards building a
tational Linguistics, pp. 4454–4466, Barcelona, Spain, 2021. social-support chatbot for online health community,” in Pro-
[67] Y. Li, K. Li, H. Ning et al., “Towards an online empathetic ceedings of the ACM on Human-Computer Interaction,
chatbot with emotion causes,” in 44th International ACM 5(CSCW1), 2021.
SIGIR Conference on Research and Development in Informa- [81] D. Griol, A. Sanchis, J. M. Molina, and Z. Callejas, “Developing
tion Retrieval, pp. 2041–2045, Association for Computing enhanced conversational agents for social virtual worlds,”
Machinery, 2021. Neurocomputing, vol. 354, pp. 27–40, 2019.
[68] Z. Lin, A. Madotto, J. Shin, P. Xu, and P. Fung, “MOEL: mix- [82] J. Hu, Y. Huang, X. Hu, and Y. Xu, “Enhancing the perceived
ture of empathetic listeners,” in Proceedings of the 2019 Confer- emotional intelligence of conversational agents through acous-
ence on Empirical Methods in Natural Language Processing tic cues,” in Extended Abstracts of the 2021 CHI Conference on
and the 9th International Joint Conference on Natural Lan- Human Factors in Computing Systems, May 2021.
guage Processing (EMNLP-IJCNLP), pp. 121–132, Hong Kong,
China, 2020. [83] J. Wu, S. Ghosh, M. Chollet, S. Ly, S. Mozgai, and S. Scherer,
“NADiA - towards neural network driven virtual human con-
[69] T. Hasegawa, N. Kaji, N. Yoshinaga, and M. Toyoda, “Predict-
versation agents,” in Proceedings of the 18th International Con-
ing and eliciting addressee’s emotion in online dialogue,” Pro-
ference on Intelligent Virtual Agents, pp. 2262–2264, Sydney,
ceedings of the 51st Annual Meeting of the Association for
Australia, November 2018.
Computational Linguistics, vol. 29, no. 1, pp. 90–99, 2013.
[84] R. Yan, “What if bots feel moods?,” in Proceedings of the 43rd
[70] L. Qiu, Y. Shiu, P. Lin et al., “What if bots feel moods?,” in Pro-
International ACM SIGIR Conference on Research and Devel-
ceedings of the 43rd International ACM SIGIR Conference on
opment in Information Retrieval, pp. 1161–1170, China, July
Research and Development in Information Retrieval,
2020.
pp. 1161–1170, China, July 2020.
[71] C. Huang, O. R. Zaïane, A. Trabelsi, and N. Dziri, “Automatic
dialogue generation with expressed emotions,” in Proceedings
of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, Volume 2 (Short Papers), pp. 49–54, New
Orleans, Louisiana, 2018.
[72] T. Niu and M. Bansal, “Polite dialogue generation without par-
allel data,” Transactions of the Association for Computational
Linguistics, vol. 6, pp. 373–389, 2018.
[73] R. Zhang, Z. Wang, and D. Mai, “Building emotional conver-
sation systems using multi-task Seq2Seq learning,” in Natural
Language Processing and Chinese Computing. NLPCC 2017, X.
Huang, J. Jiang, D. Zhao, Y. Feng, and Y. Hong, Eds.,
vol. 10619 of Lecture Notes in Computer Science(), pp. 612–
621, Springer, Cham, 2018.
[74] G. Zhou, Y. Fang, Y. Peng, and J. Lu, “Neural conversation
generation with auxiliary emotional supervised models,”
ACM Transactions on Asian and Low-Resource Language
Information Processing, vol. 19, no. 2, pp. 1–17, 2019.
[75] X. Zhou and W. Y. Wang, “MojiTalk: generating emotional
responses at scale,” in Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1:
Long Papers), pp. 1-2, Melbourne, Australia, 2018.
[76] Z. Song, X. Zheng, L. Liu, M. Xu, and X. Huang, “Generating
responses with a specific emotion in dialog,” in Proceedings
of the 57th Annual Meeting of the Association for Computa-
tional Linguistics, pp. 3685–3695, Florence, Italy, 2020.
[77] S. Ghosh, M. Chollet, E. Laksana, L. P. Morency, and
S. Scherer, “Affect-LM: a neural language model for customiz-
able affective text generation,” in Proceedings of the 55th
Annual Meeting of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pp. 634–642, Vancouver, Can-
ada, 2017.

Emotionally Intelligent Chatbots A Systematic Lite

Uploaded by

Copyright:

Available Formats

Emotionally Intelligent Chatbots A Systematic Lite

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Emotionally Intelligent Chatbots A Systematic Lite

Uploaded by

Copyright:

Available Formats

Hindawi

Human Behavior and Emerging Technologies

Ghazala Bilquise ,1 Samar Ibrahim ,2 and Khaled Shaalan 3

Correspondence should be addressed to Ghazala Bilquise; [email protected]

Academic Editor: Zheng Yan

1. Introduction beneﬁts of integrating chatbots in service and social disci-

ASR Dialog manager

Figure 1: Chatbot architecture.

3. Research Methodology comparison step of PICOC, we consider all possible

Table 1: Inclusion/exclusion criteria.

Inclusion criteria Exclusion criteria

Table 2: Quality assessment checklist.

Figure 2: Bibliometric analysis of search results.

ACM digital library (n = 272)

Records identified after initial search

Records after duplicates removal

Records after applying inclusion/exclusion

Excluded records (n = 253)

Records after full-text screening chatbots (n = 142)

Records after quality assessment

Studies included in meta-analysis

Figure 3: PRISMA ﬂowchart.

Table 4: Quality assessment.

SNo Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Total %

Figure 4: Distribution of studies by article source.

Figure 9: Distribution by region of study.

Embed topic in decoder (n = 5)

Embed semantics in input (n = 2)

Assign responder state (n = 1)

Use CVAE model ( n = 5)

Use reinforcement learning ( n = 1)

Bilingual chat interface (n = 1)

Figure 10: Mind map of problems and solutions.

Selected articles (n = 42)

CVAE ( n = 6) Seq2Seq (n = 24) Rule-based (n = 2) Other (n = 10)

Lexicon-based (n = 14) Machine learning (n = 17) Hybrid (n = 11)

Emotion capture (n = 1) Emotion capture (n = 1) Emotion capture (n = 2)

Figure 11: Taxonomy of the selected articles.

Table 5: Datasets used by studies.

Datasets # of studies Studies

Table 6: Automatic evaluation metrics.

Evaluation type Metric # of studies Studies

Table 7: Manual evaluation criteria.

Evaluation criteria # of studies Studies

Table 8: Studies included in the review.

You might also like