LAUN Improved StarGAN For Facial Emotion Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Received August 3, 2020, accepted August 31, 2020, date of publication September 3, 2020, date of current version September

16, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3021531

LAUN Improved StarGAN for Facial


Emotion Recognition
XIAOHUA WANG1,2,3 , JIANQIAO GONG 1,2,3 , MIN HU 1,2 , YU GU 1,2 , (Senior Member, IEEE),
AND FUJI REN 2,4 , (Senior Member, IEEE)
1 Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Hefei University of Technology, Hefei 230601, China
2 Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, School of Computer Science and Information Engineering, Hefei
University of Technology, Hefei 230601, China
3 Anhui Province Key Laboratory of Industry Safety and Emergency Technology, Hefei 230601, China
4 Graduate School of Advanced Technology & Science, University of Tokushima, Tokushima 7708502, Japan

Corresponding authors: Yu Gu ([email protected]) and Min Hu ([email protected])


This work was supported in part by the National Natural Science Foundation of China under Grant 61672202, in part by the State Key
Program of NSFC-Shenzhen Joint Foundation under Grant U1613217, and in part by the Fundamental Research Funds for the Central
Universities of China under Grant PA2019GDPK0076.

ABSTRACT In the field of facial expression recognition, deep learning is extensively used. However,
insufficient and unbalanced facial training data in available public databases is a major challenge for
improving the expression recognition rate. Generative Adversarial Networks (GANs) can produce more
one-to-one faces with different expressions, which can be used to enhance databases. StarGAN can perform
one-to-many translations for multiple expressions. Compared with original GANs, StarGAN can increase
the efficiency of sample generation. Nevertheless, there are some defects in essential areas of the generated
face, such as the mouth and the fuzzy side face image generation. To address these limitations, we improved
StarGAN to alleviate the defects of images generation by modifying the reconstruction loss and adding the
Contextual loss. Meanwhile, we added the Attention U-Net to StarGAN’s generator, replacing StarGAN’s
original generator. Therefore, we proposed the Contextual loss and Attention U-Net (LAUN) improved
StarGAN. The U-shape structure and skip connection in Attention U-Net can effectively integrate the details
and semantic features of images. The network’s attention structure can pay attention to the essential areas
of the human face. The experimental results demonstrate that the improved model can alleviate some flaws
in the face generated by the original StarGAN. Therefore, it can generate person images with better quality
with different poses and expressions. The experiments were conducted on the Karolinska Directed Emotional
Faces database, and the accuracy of facial expression recognition is 95.97%, 2.19% higher than that by using
StarGAN. Meanwhile, the experiments were carried out on the MMI Facial Expression Database, and the
accuracy of expression is 98.30%, 1.21% higher than that by using StarGAN. Moreover, experiment results
have better performance based on the LAUN improved StarGAN enhanced databases than those without
enhancement.

INDEX TERMS Facial expression recognition, data enhancement, generative adversarial networks,
self-attention.

I. INTRODUCTION by Picard is one of the most important embodiments in


According to the research of psychologist Mehrabian, in the human-computer interaction and artificial intelligence [2].
process of human communication, only 7% of information Facial expression recognition is an important means to realize
is transmitted through language. In contrast, the amount of affective computing. At present, facial expression recognition
information conveyed by the facial expression is as high as is a task that gets an extensive study and plays an integral part
55% [1]. Therefore, it is necessary to interpret facial expres- in digital entertainment, medical care, human-computer inter-
sion information effectively. Affective computing proposed actions, etc. Ekman, a famous American psychologist, and his
team were the first to apply facial expression recognition to
The associate editor coordinating the review of this manuscript and clinical cases [3]. In infant care, facial expression recognition
approving it for publication was Yongqiang Cheng . could effectively and timely understand the state of babies [4].
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 8, 2020 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 161509
X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

Researchers are increasingly committed to giving computers in data enhancement of non-frontal faces. One is that the
the ability to perceive, recognize, and respond to human rotation of head would cause part of the face to be occluded,
emotions. They are developing wearable computer systems and then the identity information of the face will be lost [17].
and robots that can actively observe and give feedback [5]. The other is that the shape of face texture presents nonlinear
Traditional expression recognition is divided into three distortion with the change of posture, which makes confusion
stages: preprocessing, feature extraction, and facial expres- between different people [18]. In our experiments, we proved
sion classification. It is the key stage for facial expression that the generated facial images still retain the person’s iden-
recognition to extract the features of expression. Compared tity as same as the original images. We also proved that the
with traditional feature extraction algorithms, the deep learn- model can still produce facial images with different expres-
ing method can achieve better performance. However, it is a sions when the original people of images have different pose,
challenge for facial expression recognition that deep neural not only frontal facial images. After enhancing datasets with
networks need large-scale data because the scale of emotion generated facial images, the generated pictures and the orig-
datasets are limited [6]. Among existing methods, deep and inal images were inputted in VGG16 [19] for emotion clas-
complex network structures often cause overfitting. There- sification. VGG16 is a convolutional neural network model
fore, data enhancement is an effective method to improve the proposed by simonyan and zisserman. Because of its excel-
performance of deep neural networks [7]. Traditional image lent performance in classification, VGG16 is widely used in
enhancement methods are geometric transformation [8] and various classification tasks. In our model, the accuracy of
Color space transformation [9]. Simard et al. [10] proposed expression recognition in VGG16 is improved over original
to use rotation, translation, and tilt of the original images to pictures.
increase the number of samples. By combining these three In brief, our contributions are as below.
spatial transformations, they got a great quantity of samples.
1) We proposed a model basing on StarGAN, which is
Wang et al. [11] increased the number of samples by changing
used to generate facial images with different emotions.
the brightness of original images, which reduced the influ-
Meanwhile, it can retain the identity of a person.
ence of light on expression recognition to a certain extent.
2) We changed the reconstruction loss and added the Con-
Hu et al. [12] proposed Fusion Features of Center-Symmetric
textual loss, which can promote the quality of generated
Local Signal Magnitude Pattern for feature extraction. This
facial images.
method effectively improves the rate of expression recogni-
3) We changed StarGAN’s generator and replaced it with
tion. Unlike simple transformation pixel and geometric trans-
the Attention U-Net network to enhance performance.
formation, Generative Adversarial Networks (GANs) can
4) The experiments showed that the model is able to pro-
learn the features from the target dataset by introducing loss
duce a human face with multiple emotions and different
function. GANs have a generator and discriminator, and they
poses. Meanwhile, the experimental results proved that
are against each other to produce images [13]. Zhou et al. [14]
our improvement methods can alleviate some of the
proposed CycleGAN, which used the combination of adver-
flaws in the face generated by the original StarGAN.
sarial loss and cyclic consistency loss to translate images.
Luan et al. [15] proposed disentangled representation learn-
ing GANs for Generating face with different poses. GANs II. RELATED WORK
can be used in data enhancement because it can produce A. FACIAL EXPRESSION RECOGNITION
images with abundant features. Therefore, we used it for data The facial expression has been widely adopted in various
enhancement of facial expressions. fields. In image retrieval, more and more applications imple-
StarGAN [16] is a model that can produce multiple ment retrieval through image semantic index [20]–[22]. Emo-
domains of images from one image. GANs can get different tional semantics are included in image retrieval, and facial
expression facial images when inputting one facial image and expression is the primary pattern manifestation for image
retain identity information of the inputting image. Therefore, semantic. Traditional facial expression recognition algo-
it is a proper way to extend the dataset. However, the images rithms need to extract features of facial expression images
generated by StarGAN still have some flaws in some areas manually. Dennis Gabor invented the Gabor wavelet [23],
of the face, especially the face image with the 180-degree which is the application of Fourier transform in informa-
side face. To promote the quality of created pictures, we espe- tion theory. Gabor describes the relationship between pix-
cially improved the StarGAN by changing the reconstruction els according to the filtered image results, which greatly
loss and adding the Contextual loss. Moreover, we replaced reduces the influence of illumination. It is also a common
StarGAN’s generator with the Attention U-Net network to feature extraction method. Ojala et al. [24] proposed the
promote the model’s performance. The experiments showed Local Binary Pattern (LBP), which is insensitive to noise.
that Contextual loss and Attention U-Net (LAUN) improved Simultaneously, the LBP is easy to calculate and has excellent
StarGAN can generate higher quality facial images. In the performance, which makes subsequent researchers carry out
existing facial expression recognition methods, many experi- a lot of improvement and optimization. The Histogram of
ments are based on the frontal face, while the non-frontal face Oriented Gradient [25] is based on the local feature descrip-
experiments are more challenging. There are some difficulties tor. Because of its low computational complexity and strong

161510 VOLUME 8, 2020


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

feature description ability, it is widely used in facial expres- The training is carried out with a min-max two-player
sion feature extraction research. game between the two modules. The generator learns to
Facial expression recognition is a complex task for intel- generate images that are difficult to distinguish between
ligence computation [26]. In machine learning, many meth- real and fake for the discriminator, while the discriminator
ods can be used to classify facial expression images. The learns to distinguish between real and fake images. Condi-
AdaBoost algorithm trains several different weak classi- tional GANs [42] was to add conditional information based
fiers for the training set and combines the weak classifiers on GANs to control the generation of pictures. Conditional
into strong classifiers with strong classification ability [27]. GANs enables the model to achieve supervised learning. The
Xing et al. [28] used the AdaBoost classifier to classify DCGAN [43] uses Convolutional Neural Networks to replace
the texture features of facial expressions and achieved good the multi-layer perceptron in discriminator and generator.
experimental results. Prabhakar et al. [29] applied the Support Meanwhile, in order to make the whole network differen-
Vector Machine (SVM) model to classify facial expressions. tiable, the pooling layer in the network is removed. The full
Compared with the AdaBoost method, it is proved that the connection layer is displaced by the global pooling layer to
SVM model has certain advantages in solving the classifica- reduce the computation. Karras et al. [44] proposed Style-
tion problem of small samples. GAN, an effective model to generate high-resolution images
In recent years, deep learning has become a popular and transform style. Donahue et al. [45] proposed BigBi-
method because of its good performance [30]–[32]. Com- GAN, which can generate large-scale high-definition images.
pared to the manual feature extraction method, deep learning Meanwhile, the BigBiGAN can also be used in unsupervised
can automatically learn features and achieve a higher recog- learning. At present, GANs has made remarkable achieve-
nition rate in facial expression recognition. Yu et al. [33] ments in the field of image generation. GANs can effectively
used 9-layer Convolutional Neural Networks to carry out generate realistic images, making them widely used in image
experiments on facial expression datasets, which greatly translation, data enhancement, style conversion, and other
improves the recognition effect compared with traditional fields.
artificial features. Kuo et al. [34] proposed an expression
recognition architecture based on image frames and image III. PROPOSED METHOD
sequences, which reduced the number of network parameters. A. STAR GENERATIVE ADVERSARIAL NETWORKS
Meanwhile, a hybrid illumination enhancement scheme was StarGAN can translate an image to images with multiple
proposed to alleviate the overfitting problem in the training domains. After inputting both one image and label, the gen-
process. Wang et al. [35] proposed the SelfCure Network, erator learns to translate it into the corresponding domain
which suppresses the uncertainties efficiently and avoids flexibly. The structure of StarGAN is shown in Fig. 1.
model overfitting on uncertain facial images.
However, with the deepening of networks and the increas-
ing of parameters, networks would appear overfitting phe-
nomenon. Data enhancement is a significant way to solve
database shortage and imbalance. Zhan et al. [36] presented
a model for facial expression generation, relying on Condi-
tion Adversarial Autoencoder (CAAE) [37]. Ding et al. [38]
proposed the ExperGAN, which can edit facial expres-
sions and control the intensity of expression generation.
Wang et al. [39] proposed Comp-GAN, which can transform
facial expression and posture according to different input
images. Bozorgtabar et al. [40] proposed a network named
ExprADA, which can realize image to image conversion and
achieve good performance in expression image generation.
StarGAN has achieved excellent performance in face feature
FIGURE 1. The structure of StarGAN.
transformation [41]. Therefore, our model is based on Star-
GAN, and StarGAN is verified that it can retain the identity
characteristics of the person in different postures to generate
different expressions. However, we found that there are still 1) ADVERSARIAL LOSS
some flaws in the generated face image by StarGAN. In order The discriminator makes a distinction between real and fake
to further get higher quality generated pictures, we improved images. Image x and target domain label c are inputted
StarGAN’s loss function and changed StarGAN’s generator. into the generator to generate the output image G(x, c).
A min-max objective function makes the fake pictures just
B. GENERATIVE ADVERSARIAL NETWORKS like real.
GANs is extensively researched in recent years. The original
GANs model consists of a generator and a discriminator. Ladv = Ex [log Dsrc (x)] + Ex,c [log(1 − Dsrc (G(x, c)))] (1)

VOLUME 8, 2020 161511


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

The Dsrc is the result of the classifier to judge whether When the original image and the generated image are aligned
the picture is true or false. The generator expects the smaller in space, the loss function to calculate the difference between
the Ladv . The discriminator is the opposite. pixels can achieve good results. However, if two images are
not aligned in space, the image generation’s effect using the
2) CLASSIFICATION LOSS traditional loss function is not ideal. Unlike the common
The discriminator has two tasks. One is to distinguish the real methods, the Contextual loss [46] compares the images fea-
or false images, and the other is to distinguish the category of tures to judge the similarity of images. Some deep neural
the picture. The classification loss of real images trains the networks can extract a series of high-dimensional features,
discriminator to make it find the real label of real images. which can be used to compare the similarity between images.
The formula for classifying real pictures is as follow: VGG19 [19] is a classification Convolution Neural Network,
which has the ability to extract image features. We use
r
Lcls = Ex,c0 [− log Dcls (c0 |x)] (2) VGG19 as the feature extraction network. Cosine distance
c0
The is the expression label of the input image x. The can calculate the similarity between different points. The
Dcls (c0 |x) is result of real images classification by the dis- Cosine distance between xi and yi is defined as follow:
criminator. Meanwhile, the generator tries to produce fake (xi − uy ) · (yj − uy )
images that can be classified as the target label c. Specifically, dij = (1 − ) (7)
||xi − uy ||2 ||yi − uy ||2
the equation for classifying generated pictures is defined as:
1 P
f where uy = j yj . Then normalize the distance:
Lcls = Ex,c [− log Dcls (c|G(x, c))] (3) N

The Dcls (c|G(x, c)) is the result about discriminator dis- dij
d̃ij = (8)
tinguishing the generated picture. The discriminator should min dik + ε
k
try to classify the real images correctly, so it should mini-
mize Lclsr . Furthermore, the generator tries to minimize L f The ε is a tiny constant. The distance d̃ij is transforming
cls
for generating images, which can be categorized into the into similarity calculation by power operation:
label c. 1−d̃ij
wij = e h (9)
3) RECONSTRUCTION LOSS
The reconstruction loss can make reconstructive pictures where h>0 is a bandwidth parameter. The contextual sim-
similar to original pictures. The generator must preserve the ilarity between features is normalized similarity and scale-
represents of original pictures, and the generated pictures invariant:
must retain the same represents. Therefore, reconstruction wij
CXij = PN (10)
loss is as follow: k wik
Lrec = Ex,c,c0 [||x − G(G(x, c), c0 )||1 ] (4) From the above formulas, we get the similarity between
feature yj and feature xi . For each feature point yj in the
Image x and label c generate the target image G(x, c) in
target image y, traverse each feature point xi in the image x
the generator. Then G(x, c) and original label c0 use generator
to find the most similar feature xi with the feature yj . Finally,
again to generate the reconstructed image G(G(x, c), c0 ). The
the sum of the corresponding feature similarity on all yj is
L1 loss function is minimized by the generator.
the similarity of the two images. The sum of all yj can be
calculated to get the contextual similarity. The formula is as
4) FULL OBJECTIVE
follow:
The final training loss function has two parts. The loss func-
tion of the generator is as follows: 1 X
CX (x, y) = CX (X , Y ) = max CXij (11)
N i
LD = −Ladv + λcls Lcls
r
(5) j

The loss function of the discriminator is as follows: x and y are images to be calculated, and X and Y are the
f
characteristic graphs of the two images. xi and yi are the points
LG = Ladv + λcls Lcls + λrec Lrec (6) in the two characteristic graphs, respectively.
λcls , λrec are the hyperparameter of the equilibrium loss. Finally, we get the loss function. The features can be
extracted from two images, and we can get the contextual
B. CONTEXTUAL LOSS similarity by calculating the above formulas. The contextual
loss is as follow:
Training Convolutional Neural Networks for image transfor-
mation relies on loss functions. Calculating the difference LCX (x, y, l) = − log(CX (8l (x), 8l (y))) (12)
between the corresponding pixels is a common method to
compare the different images. However, it is not a good way 8l (x) and 8l (y) are the feature maps extracted by VGG19,
to deal with how to evaluate the similarity between images. and l is the number of layers in VGG19.

161512 VOLUME 8, 2020


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

texture information of the current layer. And then, sigmoid


normalization is used to obtain the regions with a strong
correlation. The resulting α can identify significant image
regions and prune feature responses to retain the activation
associated with a specific task. α is multiplied with the cur-
rent layer to emphasize the characteristics of the significant
region in this layer. Attention Gate is added to the skip
connection, and the output is added to the feature graph in
upsampling. Therefore, one of the inputs in Attention Gate is
the input in skip connection, which is the characteristic graph
of downsampling in the corresponding layer. The other input
of Attention Gate is the feature map of the previous layer in
upsampling, as shown in Fig. 2.
Attention Gate learns to focus on the significant areas in
FIGURE 2. The network architecture of Attention U-Net.
inputting and contain contextual information passes through
the Attention Gate. These Attention Gates generate soft
regions that can get implicit information for highlight salient
C. ATTENTION U-NET features [49]. Meanwhile, they do not bring a large number
Attention U-Net [47] originates from the U-Net network [48], of computational costs, nor do they need as many model
and attention mechanism is added to U-Net network. The parameters. In brief, Attention U-Net is an improvement of
model is used for image segmentation. It also uses attention the U-Net model, which promotes the sensitivity of the model
models to restrain the uncorrelated areas and highlight the to foreground pixels without sophisticated heuristics. The
important features for specific tasks in the inputting image. model extracts image features on multiple image scales and
Fig. 2 is the model structure of Attention U-Net. The network makes skip connections, which improves performance.
downsamples the image four times for feature extraction, and
Upper sampling adopts the skip connection in the same stage. D. IMPROVED RECONSTRUCTION LOSS
The structure of Attention Gate is shown in Fig. 3. We need the generated images which keep some crucial fea-
tures from original images, such as the identity information of
one person. Reconstruction loss can distinguish the difference
between the authentic images and generated images through
the pixels comparison. In the original StarGAN, the L1 loss
function is used as a Reconstruction loss. Nevertheless, it is
difficult for the L1 loss function to look for differences
between tiny shifts in images and significant defects in critical
FIGURE 3. The structure of Attention Gate. areas. Besides, the calculation of the L1 loss function is
relatively complicated, and there may be multiple optimal
In Fig. 3, a is input features, and the α is attention coef- solutions. Therefore, we replace the L1 loss function with
ficients computed by Attention Gate. Meanwhile, g is the the L2 loss function. L2 loss function calculates the square
gating signal gathered from a wide range, and it can deter- of the difference between two pixels, and it will amplify two
mine focus regions for each pixel. Attention Gate’s out- points where there is a big difference. Meanwhile, the tiny
put is b, which is the element-wise multiplication for input movement between the original images and generated images
feature-maps and attention coefficients. The a is a character- can only be calculated as a small difference. After squaring
istic graph from the corresponding downsampling layer. The it, it does not expand as much as the massive difference
g is the feature image of the previous layer in the upsam- in two corresponding pixels. In terms of gradient solution
pling process. Da and Dg are convolution 1×1 kernels. Da and convergence, the L2 loss function is also better than the
convolutes a and Dg convolutes g. The convolution results L1 loss function. The L2 loss function is derivable every-
are added, and the sum passes through ReLU, convolution, where, and the gradient value is also dynamically changing,
sigmoid, and resample. Finally, the result is α. α is the atten- which can quickly converge [50]. Therefore, in our exper-
tion coefficient, and the size is consistent with a. The range iment, the L2 loss function is used as reconstruction loss,
of α is 0 to 1, which can make the value of the vital region instead of the L1 loss function. The formula is as follows:
in the feature map larger and the value of the unimportant 0
Lrec = Ex,c,c0 [||x − G(G(x, c), c0 )||2 ] (13)
region smaller. The output of the Attention Gate is obtained
by multiplying g with the feature coefficient α. Attention To improve the quality of the generated image and the
Gate uses additive attention for fusion and calculates a single similarity with original images, the Contextual Loss is added
scalar attention value for each pixel vector. It is fused for in the loss function of StarGAN. The Contextual Loss can
the structure information of the lower sampling layer and the help StarGAN retain the essential features of the original

VOLUME 8, 2020 161513


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

image and measure the similarity of images by comparing image extraction information because of the increase of the
the pictures’ high-dimensional features. The Contextual Loss receptive field and multiple convolution operations. Upsam-
is an effective and simple solution when the original image pling has a similar structure with downsampling, and it can
and the generated image are not entirely aligned. It compares ensure that the final resumed feature map integrates low-level
regions with semantic similarity and considers the context features. The features in different scales are also fused to
of the whole image. The core idea is to treat an image as a multi-scale prediction and in-depth supervision. Four times
set of features, measure the similarity between images by the of upsampling also makes the information of edge restore
similarity between features, and ignore the spatial position of more precise. Moreover, the Attention U-Net model adds the
features. This method makes the generated image not neces- attention mechanism, which can ignore uncorrelated areas
sarily consistent with the original image in space. The result in the image and focus on the specific important features.
of feature comparison and calculation is the similarity of the The feature maps from the skip connection and the previous
two images. Contextual loss is based on context similarity to level get into the Attention Gate to activate vital feature
represent image similarity. Context similarity compares the areas. Self-attention techniques have been put forward to
cosine distance between two points in two feature graphs. eliminate the dependency on external gating information.
The feature maps with a more similar distribution can get Simultaneously, the model adopts a soft attention mechanism,
higher similarity. When most features in one image have which is probabilistic and utilizes standard back-propagation
similar features in another, the two images are considered without the need for Monte Carlo sampling [51]. Fig. 5 shows
to be similar. On the contrary, when most features in one the difference between images from StarGAN and LAUN
image do not have similar features in the other image, the two improved StarGAN. The face on the left is generated by
images are considered to be not similar. The feature value StarGAN, and the face on the right is generated by LAUN
of the image can be extracted by Convolution Neural Net- improved StarGAN. We can see that the quality of the mouth
work. In this paper, VGG19 is used to extract the image generated by LAUN improved StarGAN is better than that of
features. Compared with pixel-based loss, the feature-based the mouth generated by StarGAN.
loss is more robust to the spatial location of the pixel. Finally,
the loss function after modified is as follows:
LD = −Ladv + λrec Lcls
r
(14)
f
LG = Ladv + λcls Lcls + λrec Lrec
0
+ λCX LCX (x, y, l) (15)
λcls , λrec , λCX is the hyperparameter of the equilibrium
loss. FIGURE 5. The images generated by StarGAN and LAUN improved
StarGAN.

IV. EXPERIMENT
In this section, we describe experiments about StarGAN and
LAUN improved StarGAN. In our experiments, we used Star-
GAN to generated faces with different emotions. After that,
we improved the StarGAN to generate higher quality images.
Moreover, we used vgg16 as a classifier to get the accuracy
of expression classification. The experimental learning rate
is 0.0001, and the batch size is 16. The resolution of input
and output images is 128 × 128 pixels, and the training
step is 200000. Our experiments are trained with Adam
algorithm [52]. Adam is a common optimization algorithm
that can effectively reduce the training time of deep neural
FIGURE 4. The network architecture of LAUN improved StarGAN.
models. The hyperparameters about Adam in our experiments
are β1 = 0.5 and β2 = 0.999. The device is the NVIDIA
E. IMPROVED GENERATOR GTX2060 GPU.
We use Attention U-Net as the generator in our model. The
structure of LAUN improved StarGAN is shown in Fig. 4. A. DATABASES
In downsampling, it consists of repeated applications by two The experiments are worked on the KDEF database [53] and
3 × 3 convolutions, each followed by a ReLU and a 2 × 2 the MMI Facial Expression database [54]. The experiment
max pooling operation. The generator downsamples inputting results prove the validity of our model.
images to get compressed features. And then, the generator
upsamples the feature map to get generated images. The shal- 1) THE KDEF DATABASE
low layers can capture some simple features from pictures in The KDEF database has 4900 images of facial emotions, and
the feature extraction stage. The deep layers can get some the database contains 70 individuals. Each person displays

161514 VOLUME 8, 2020


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

seven different emotions with afraid, angry, disgusted, happy, TABLE 2. The accuracy of different loss function and genereator on data
enhancement (KDEF database).
neutral, sad, and surprised. Each emotion has 700 images.
Also, the database contains five pan angles: −90 degrees,
−45 degrees, 0 degrees, 45 degrees, 90 degrees.

2) THE MMI FACIAL EXPRESSION DATABASE


The MMI Facial Expression Database aims to provide a lot of
visual data for the facial expression analysis community. The TABLE 3. The score of FID by different models on KDEF database.

database consists of over 2900 videos and high-resolution


images, which have 75 subjects. MMI Facial Expression
Database has six expressions: anger, disgust, fear, happiness,
sadness, and surprise. We selected 205 videos with expression
tags as the training set.

B. IMPLEMENTATION DETAIL results are shown in TABLE 2. Finally, the accuracy rate of
1) EXPERIMENT ON THE KDEF DATABASE VGG16 after data enhancement by LAUN improved Star-
There are seven emotions in the KDEF database, and each GAN is 95.97%, which is 0.94% higher than that with only
expression has 700 pictures. Before the experiment, a small loss function modification, 1.03% higher than original Star-
number of images with poor quality were removed, and the GAN, and 2.19% higher than that without data enhancement.
final number of images was 4829. The ratio of the training set To further illustrate the quality of the generated pictures,
and test set is 8:2. Therefore, the training set has 3863 images, We used Fréchet Inception Distance (FID) [55] (lower is
and the test set has 966 images. The pictures of the training set better) as the evaluation metric to measure the visual quality.
were put into the StarGAN, and each image generated seven FID is a common metric for evaluating images generated
different emotions. Finally, 27041 images were generated. by GANs. It expresses the quality and diversity of gener-
With the original images being added, the expanded dataset ated images by comparing feature vectors between different
has 30904 images in total. images. The experimental results are shown in TABLE 3.
It shows that LAUN improved StarGAN gets the best result.
Compared with StarGAN, our method effectively improves
TABLE 1. The accuracy of different loss function on data enhancement
(KDEF database). the quality of generated images. In our experiments, it took
about 15 hours to complete training the StarGAN, and about
10 hours to complete training the LAUN improved StarGAN.
In the test phase, it took about 5 minutes for StarGAN to
generate all images and about 5 minutes for LAUN improved
StarGAN to generate all images. The result shows that our
model does not increase the cost of time compared to the
original StarGAN.
Next, the face images and target label information of the After the original StarGAN generating images, the enhan-
training dataset were put into StarGAN, and the face images ced dataset was put into VGG16 for training. It can be known
with target expressions could be generated. Our experiment from the experimental results that the model retains the per-
changed the reconstruction loss function of StarGAN and son’s identity information, and can still generate different
added Contextual loss in the loss function. Finally, the gen- expressions in the side face state. However, some parts of gen-
erated images were put into VGG16 for training. We tested erated images by the original StarGAN have defects, mainly
the model with the test dataset to obtain the facial expression concentrating in the mouth. In the experiment, we changed
recognition rates. The accuracies of the enhanced dataset are the loss function of StarGAN. The reconstruction loss was
in TABLE 1. It can be seen from TABLE 1 that the recog- changed from L1 loss function to L2 loss function, and
nition accuracy of VGG16 to the KDEF dataset is 93.78% we added Contextual Loss in the StarGAN. Through the
before data enhancement. After using the original StarGAN experiment, we could find that the problem of poor mouth
for data enhancement, the accuracy of VGG16 is 94.00%, generation quality has been alleviated. Meanwhile, we used
an increase of 0.22%. After adding L2 loss and contextual the Attention U-Net instead of the original generator. The
loss to StarGAN, the accuracy rates are 94.31% and 94.10%, U-Net can get feature integration, and the Attention Gate
respectively. After adding L2 loss and contextual loss to can restrain the irrelevant area and highlight the important
StarGAN, the accuracy rate is 95.03%, which is 1.03% higher area. Therefore, the quality of image generation is further
than the original StarGAN. In addition, we replaced the gen- improved. After adding Contextual Loss and Attention U-Net
erator in the original model with the Attention U-Net network in the model, the quality of image generation has been
and generated face images to enhance the data. We call improved, especially for the side faces. The resulting images
it LAUN improved StarGAN. The expression recognition are shown in Fig 6. In brief, LAUN improved StarGAN

VOLUME 8, 2020 161515


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

FIGURE 6. Images generated by StarGAN using different models based on


the KDEF database. FIGURE 7. Images enhanced by different ways based on the MMI
Database.
TABLE 4. Comparison of experimental results with other methods on
KDEF database.
TABLE 6. The accuracy on MMI database.

can produce images with better quality, and the accuracy of StarGAN gets the best result. Same as on the KDEF database,
facial expression classification after data enhancement is also it took about 15 hours to complete training the StarGAN,
higher. TABLE 4 shows the comparison results between our and about 10 hours to complete training the LAUN improved
model and other models. StarGAN. In the test phase, it took about 4 minutes for Star-
GAN to generate all images and about 4 minutes for LAUN
2) EXPERIMENT ON THE MMI DATABASE improved StarGAN to generate all images. The result shows
To further verify the validity of the model, we performed that our model does not increase the cost of time compared
experiments on MMI Database. We selected 205 videos with to the original StarGAN. Expression recognition rates are
expression labels in this dataset. Each video was selected summarized in TABLE 6. The accuracy of VGG16 is 97.09%
for 20 frames, including the calm part in front of the video, on the original MMI dataset, and 97.46% by using the original
the climax part in the middle of the video, and the calm StarGAN, with an increase of 0.37%. After adding L2 loss
part in the back of the video. A total of 4100 pictures were and contextual loss in StarGAN, the accuracy rate is 97.58%,
obtained. The division ratio is 8: 2. The training set has which is 0.12% higher than the original StarGAN. The accu-
3280 pictures, and the test set has 820 pictures. Each picture racy of the final LAUN improved StarGAN is 98.30%, which
in the training set can generate different pictures according is 0.84% higher than the original StarGAN. Compared with
to different expressions. Therefore, one image can generate no enhancement, the recognition rate of the LAUN improved
six pictures and a total of 19680 pictures. After putting the StarGAN is improved by 1.21%. It can be seen that after
original images and the generated images together, the entire the dataset is enhanced, the expression recognition rate of
training set is 22960 images. the dataset has been improved. Compared with the quality
of images generated by the original model, the improved
TABLE 5. The score of FID on MMI database. model’s quality of images is also better. TABLE 7 shows the
comparison results between our model and other models.

TABLE 7. Comparison of experimental results with other methods on


MMI database.

We directly put the collated dataset into the StarGAN


model for data generation. The generated pictures and the
original training set were put together to form the training
dataset. The resulting images are shown in Fig 7. VGG16 is
still the classifier in this experiment. The trained classifier
would be tested to get the final recognition result. After mod- V. CONCLUSION
ifying the reconstruction loss function and adding Contextual This paper presents the LAUN improved StartGAN for data
loss, we experimented again to get the generated pictures and enhancement. At present, there are still insufficient and
accuracy of expression classification. Finally, we modified unbalanced samples in expression datasets. To alleviate this
the generator of the model to Attention U-Net to obtain the problem, we can apply the GANs network to enhance expres-
generated pictures and accuracy of expression classification. sion datasets. In addition, StarGAN can generate face images
We used FID to compare the quality of the generated pic- with different expressions to improve the accuracy of expres-
tures, as shown in TABLE 5. It shows that LAUN improved sion recognition. Furthermore, for solving the defects caused

161516 VOLUME 8, 2020


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

by StarGAN, we changed the reconstruction loss function of [20] R. Hong, L. Li, J. Cai, D. Tao, M. Wang, and Q. Tian, ‘‘Coherent semantic-
StarGAN and introduced the Contextual Loss. At the same visual indexing for large-scale image retrieval in the cloud,’’ IEEE Trans.
Image Process., vol. 26, no. 9, pp. 4128–4138, Sep. 2017.
time, we also turned the model generator into the Attention [21] Q. Kai, ‘‘A painting image retrieval approach based on visual features and
U-Net to improve the quality of image generation. Attention semantic classification,’’ in Proc. Int. Conf. Smart Grid Electr. Automat.
U-Net pays more attention to the important areas of pic- (ICSGEA), Aug. 2019, pp. 195–198.
[22] L. Huang, C. Bai, Y. Lu, S. Chen, and Q. Tian, ‘‘Adversarial learning
tures. The experiment results show that our model promotes for content-based image retrieval,’’ in Proc. IEEE Conf. Multimedia Inf.
the quality of generated images and classifier accuracy in Process. Retr. (MIPR), Mar. 2019, pp. 97–102.
emotion recognition. Meanwhile, the model can learn the [23] D. Gabor, ‘‘Theory of communication. Part 1: The analysis of informa-
tion,’’ J. Inst. Electr. Eng. III, Radio Commun. Eng., vol. 93, no. 26,
identity and expression representations explicitly. Besides, pp. 429–441, Nov. 1946.
the experiment also verifies that our model can generate the [24] T. Ojala, M. Pietikainen, and D. Harwood, ‘‘Performance evaluation of
expression of the same person under different postures. texture measures with classification based on kullback discrimination of
distributions,’’ in Proc. 12th Int. Conf. Pattern Recognit., vol. 1, 1994,
pp. 582–585.
REFERENCES [25] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human detec-
[1] A. Mehrabian, ‘‘Communication without words,’’ Commun. Theory, vol. 6, tion,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
pp. 193–200, 2008. (CVPR), vol. 1, Jun. 2005, pp. 886–893.
[2] R. W. Picard, Affective Computing. Cambridge, MA, USA: MIT Press, [26] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, ‘‘A survey of affect
2000. recognition methods: Audio, visual, and spontaneous expressions,’’ IEEE
[3] R. Ekman, What the Face Reveals: Basic and Applied Studies of Spon- Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58, Jan. 2009.
taneous Expression Using the Facial Action Coding System (FACS). [27] W. Hu, J. Gao, Y. Wang, O. Wu, and S. Maybank, ‘‘Online AdaBoost-
New York, NY, USA: Oxford Univ. Press, 1997. based parameterized methods for dynamic distributed network intrusion
[4] L. Ma, W. Chen, X. Fu, and T. Wang, ‘‘Emotional expression and micro- detection,’’ IEEE Trans. Cybern., vol. 44, no. 1, pp. 66–82, Jan. 2014.
expression recognition in depressive patients,’’ Chin. Sci. Bull., vol. 63, [28] Y. Xing and W. Luo, ‘‘Facial expression recognition using local Gabor
no. 20, pp. 2048–2056, Jul. 2018. features and AdaBoost classifiers,’’ in Proc. Int. Conf. Prog. Informat.
[5] A. C. Cruz, B. Bhanu, and N. S. Thakoor, ‘‘Vision and attention theory Comput. (PIC), Dec. 2016, pp. 228–232.
based sampling for continuous facial emotion recognition,’’ IEEE Trans. [29] S. Prabhakar, J. Sharma, and S. Gupta, ‘‘Facial expression recognition in
Affect. Comput., vol. 5, no. 4, pp. 418–431, Oct. 2014. video using AdaBoost and SVM,’’ Int. J. Comput. Appl., vol. 104, no. 2,
[6] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, pp. 1–4, Oct. 2014.
J. Philbin, and L. Fei-Fei, ‘‘The unreasonable effectiveness of noisy data [30] F. Ren and Q. Zhang, ‘‘An emotion expression extraction method for
for fine-grained recognition,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Chinese microblog sentences,’’ IEEE Access, vol. 8, pp. 69244–69255,
Switzerland: Springer, 2016, pp. 301–320. 2020.
[7] C. Han, H. Hayashi, L. Rundo, R. Araki, W. Shimoda, S. Muramatsu, [31] F. Ren and Y. Zhou, ‘‘CGMVQA: A new classification and generative
Y. Furukawa, G. Mauri, and H. Nakayama, ‘‘GAN-based synthetic brain model for medical visual question answering,’’ IEEE Access, vol. 8,
MR image generation,’’ in Proc. IEEE 15th Int. Symp. Biomed. Imag. pp. 50626–50636, 2020.
(ISBI), Apr. 2018, pp. 734–738. [32] F. Ren, W. Liu, and G. Wu, ‘‘Feature reuse residual networks for insect pest
[8] W. Li, M. Li, Z. Su, and Z. Zhu, ‘‘A deep-learning approach to facial recognition,’’ IEEE Access, vol. 7, pp. 122758–122768, 2019.
expression recognition with candid images,’’ in Proc. 14th IAPR Int. Conf.
[33] Z. Yu and C. Zhang, ‘‘Image based static facial expression recognition
Mach. Vis. Appl. (MVA), May 2015, pp. 279–282.
with multiple deep network learning,’’ in Proc. ACM Int. Conf. Multimodal
[9] N. Kaur and N. Bawa, ‘‘Algorithm for fuzzy based compression of gray
Interact. (ICMI), 2015, pp. 435–442.
JPEG images for big data storage,’’ in Proc. 2nd Int. Conf. Contemp.
[34] C.-M. Kuo, S.-H. Lai, and M. Sarkis, ‘‘A compact deep learning model for
Comput. Informat. (ICI), Dec. 2016, pp. 518–523.
robust facial expression recognition,’’ in Proc. IEEE/CVF Conf. Comput.
[10] P. Y. Simard, D. Steinkraus, and J. C. Platt, ‘‘Best practices for convolu-
Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 2121–2129.
tional neural networks applied to visual document analysis,’’ in Proc. Icdar,
[35] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, ‘‘Suppressing uncertainties
vol. 3, 2003, pp. 1–6.
for large-scale facial expression recognition,’’ in Proc. IEEE/CVF Conf.
[11] S. Wang, ‘‘Facial affect detection using convolutional neural networks,’’
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 6897–6906.
Stanford Univ., Stanford, CA, USA, Tech. Rep., 2016.
[12] M. Hu, C. Yang, Y. Zheng, X. Wang, L. He, and F. Ren, ‘‘Facial expression [36] F. Zhang, T. Zhang, Q. Mao, and C. Xu, ‘‘Joint pose and expression
recognition based on fusion features of center-symmetric local signal modeling for facial expression recognition,’’ in Proc. IEEE/CVF Conf.
magnitude pattern,’’ IEEE Access, vol. 7, pp. 118435–118445, 2019. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3359–3368.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [37] Z. Zhang, Y. Song, and H. Qi, ‘‘Age progression/regression by conditional
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in adversarial autoencoder,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. nit. (CVPR), Jul. 2017, pp. 5810–5818.
[14] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to-image [38] H. Ding, K. Sricharan, and R. Chellappa, ‘‘ExprGAN: Facial expression
translation using cycle-consistent adversarial networks,’’ in Proc. IEEE Int. editing with controllable expression intensity,’’ in Proc. 32nd AAAI Conf.
Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232. Artif. Intell., 2018, pp. 1–6.
[15] L. Tran, X. Yin, and X. Liu, ‘‘Disentangled representation learning GAN [39] W. Wang, Q. Sun, Y. Fu, T. Chen, C. Cao, Z. Zheng, G. Xu, H. Qiu,
for pose-invariant face recognition,’’ in Proc. IEEE Conf. Comput. Vis. Y.-G. Jiang, and X. Xue, ‘‘Comp-GAN: Compositional generative adver-
Pattern Recognit. (CVPR), Jul. 2017, pp. 1415–1424. sarial network in synthesizing and recognizing facial expression,’’ in Proc.
[16] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, ‘‘StarGAN: 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 211–219.
Unified generative adversarial networks for multi-domain image-to-image [40] B. Bozorgtabar, D. Mahapatra, and J.-P. Thiran, ‘‘ExprADA: Adversar-
translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., ial domain adaptation for facial expression analysis,’’ Pattern Recognit.,
Jun. 2018, pp. 8789–8797. vol. 100, Apr. 2020, Art. no. 107111.
[17] Z. Zhu and Q. Ji, ‘‘Robust real-time face pose and facial expression recov- [41] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, ‘‘Multimodal unsupervised
ery,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. image-to-image translation,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV),
(CVPR), vol. 1, Jun. 2006, pp. 681–688. 2018, pp. 172–189.
[18] C. Ding and D. Tao, ‘‘A comprehensive survey on pose-invariant face [42] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
recognition,’’ ACM Trans. Intell. Syst. Technol., vol. 7, no. 3, pp. 1–42, with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
Apr. 2016. Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.
[19] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for [43] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- learning with deep convolutional generative adversarial networks,’’ 2015,
able: http://arxiv.org/abs/1409.1556 arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511.06434

VOLUME 8, 2020 161517


X. Wang et al.: LAUN Improved StarGAN for Facial Emotion Recognition

[44] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for JIANQIAO GONG is currently pursuing the mas-
generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. ter’s degree with the Hefei University of Technol-
Pattern Recognit. (CVPR), Jun. 2019, pp. 4401–4410. ogy. His research interests include image process-
[45] J. Donahue and K. Simonyan, ‘‘Large scale adversarial representation ing and facial expression recognition.
learning,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 10541–10551.
[46] R. Mechrez, I. Talmi, and L. Zelnik-Manor, ‘‘The contextual loss for image
transformation with non-aligned data,’’ in Proc. Eur. Conf. Comput. Vis.
(ECCV), 2018, pp. 768–783.
[47] O. Oktay, J. Schlemper, L. Le Folgoc, M. Lee, M. Heinrich,
K. Misawa, K. Mori, S. McDonagh, N. Y Hammerla, B. Kainz,
B. Glocker, and D. Rueckert, ‘‘Attention U-Net: Learning where to
look for the pancreas,’’ 2018, arXiv:1804.03999. [Online]. Available:
http://arxiv.org/abs/1804.03999
[48] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks
for biomedical image segmentation,’’ in Proc. Int. Conf. Med. Image
Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2015,
pp. 234–241.
[49] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
jointly learning to align and translate,’’ 2014, arXiv:1409.0473. [Online]. MIN HU received the M.S. degree in indus-
Available: http://arxiv.org/abs/1409.0473 trial automation from Anhui University, China,
[50] L. Xu, J. S. Ren, C. Liu, and J. Jia, ‘‘Deep convolutional neural network in 1994, and the Ph.D. degree in computer science
for image deconvolution,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, from the Hefei University of Technology, Hefei,
pp. 1790–1798.
China, in 2004. She is currently a Professor with
[51] X. Wang, R. Girshick, A. Gupta, and K. He, ‘‘Non-local neural networks,’’
the School of Computer and Information, Hefei
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7794–7803.
University of Technology. Her research interests
[52] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti- include digital image processing, artificial intelli-
mization,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv. gence, and data mining.
org/abs/1412.6980
[53] D. Lundqvist, A. Flykt, and A. Öhman, ‘‘The Karolinska directed emo-
tional faces—KDEF,’’ CD ROM Dept. Clin. Neurosci., Psychol. Sect.,
Karolinska Institutet, Solna, Sweden, Tech. Rep., 1998.
[54] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, ‘‘Web-based database
for facial expression analysis,’’ in Proc. IEEE Int. Conf. Multimedia Expo,
Jul. 2005, p. 5.
[55] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
‘‘Gans trained by a two time-scale update rule converge to a local
Nash equilibrium,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, YU GU (Senior Member, IEEE) received the
pp. 6626–6637. B.E. and D.E. degrees from the Special Classes
[56] C. L. Kempnich, D. Wong, N. Georgiou-Karistianis, and J. C. Stout, ‘‘Fea- for the Gifted Young, University of Science and
sibility and efficacy of brief computerized training to improve emotion Technology of China, Hefei, China, in 2004 and
recognition in premanifest and early-symptomatic Huntington’s disease,’’
2010, respectively. In 2006, he was an Intern with
J. Int. Neuropsychol. Soc., vol. 23, no. 4, pp. 314–321, Apr. 2017.
Microsoft Research Asia, Beijing, China, for seven
[57] A. Ruiz-Garcia, M. Elshaw, A. Altahhan, and V. Palade, ‘‘Stacked deep
convolutional auto-encoders for emotion recognition from facial expres- months. From 2007 to 2008, he was a Visiting
sions,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, Scholar with the University of Tsukuba, Tsukuba,
pp. 1586–1593. Japan. From 2010 to 2012, he was a JSPS Research
[58] P. Tarnowski, M. Kolodziej, A. Majkowski, and R. J. Rak, ‘‘Emotion recog- Fellow with the National Institute of Informatics,
nition using facial expressions,’’ in Proc. ICCS, 2017, pp. 1175–1184. Tokyo, Japan. He is currently a Professor and the Dean Assistant with the
[59] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, ‘‘Inception-v4, School of Computer and Information, Hefei University of Technology, Hefei.
inception-ResNet and the impact of residual connections on learn- His current research interests include pervasive computing and affective
ing,’’ 2016, arXiv:1602.07261. [Online]. Available: http://arxiv.org/ computing. He is a member of ACM. He was a recipient of the IEEE Scal-
abs/1602.07261 com2009 Excellent Paper Award and the NLP-KE2017 Best Paper Award.
[60] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.

FUJI REN (Senior Member, IEEE) received the


Ph.D. degree from the Faculty of Engineering,
XIAOHUA WANG received the Ph.D. degree in Hokkaido University, Sapporo, Japan, in 1991.
computer science from the Hefei Institute of Physi- He is currently a Professor with the Department
cal Science, Chinese Academy of Sciences, China, of Information Science and Intelligent Systems,
in 2005. She is currently an Associate Profes- Tokushima University, Tokushima, Japan. His cur-
sor with the School of Computer and Informa- rent research interests include natural language
tion, Hefei University of Technology. Her research processing, machine translation, artificial intelli-
interests include affective computing, artificial gence, language understanding and communica-
intelligence, and visual pattern recognition. tion, robust methods for dialogue understanding,
and affective information processing and knowledge engineering.

161518 VOLUME 8, 2020

You might also like