Kaam Ka Chiz
Kaam Ka Chiz
Kaam Ka Chiz
-2018
http://iraj.in
FACIAL EXPRESSION RECOGNITION USING CNN: A SURVEY
1
RAVICHANDRA GINNE, 2KRUPA JARIWALA
1
M.Tech student, Department of Computer Engineering, SVNIT, Surat
2
Assistant Professor, Department of Computer Engineering, SVNIT, Surat
E-mail: [email protected], [email protected]
Abstract - Facial expression recognition (FER) has become an active research area that finds a lot of applications in areas
like human-computer interfaces, human emotion analysis, psychological analysis, medical diagnosis etc. Popular methods
used for this purpose are based on geometry and appearance. Deep convolutional neural networks (CNN) have shown to
outperform traditional methods in various visual recognition tasks including Facial Expression Recognition. Even though
efforts are made to improve the accuracy of FER systems using CNN, for practical applications existing methods might not
be sufficient. This study includes a generic review of FER systems using CNN and their strengths and limitations which help
us understand and improve the FER systems further.
13
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.-2018
http://iraj.in
information. It partitions the image into non-
overlapping regions, and each region is
subsampled (down-sampled) by a non-linear
function such as maximum, minimum or
average.Max pooling is used as the most
commondown-sampling functionthatgives
maximum value for each region as output.
In Rectified Linear Units (ReLU) layer an
activation function f(x) = max (0, x) is applied
element wise. ReLU introduces non-linearity in
the network. Other functions used to increase
non-linearity are hyperbolic tangent, sigmoid etc.
The fully connected layer of CNN is located after
several convolutional and pooling layers andit is
a traditional multi-layer perceptron (MLP). All
neurons in this layer are fully connected to all
activations in the previous layer.
In the loss layer, different loss functions suitable
for different tasks are used. A Softmax loss
function is used for classification of an image
into multiple classes.
14
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.-2018
http://iraj.in
The second network is Yu’s structure [2] which In the SCAE emotion recognition model, each
contains five convolutional layers, three stochastic convolutional layer and its subsequent layers: BN,
pooling layers, and three fully connected layers. The ReLU and Max Pooling are treated as a single block
network has two convolutional layers prior to and an Auto-Encoder is created for each one of these
pooling, except for the first layer. blocks. The first Auto-Encoder learns to reconstruct
The third network that is investigated is Kahou’s raw pixel data. The second Auto-Encoder learns to
structure [3] which contains three convolutional and reconstruct the output of the first encoder and so on.
pooling layers followed by a MLP of two layers. Finally, the fully connected layer is trained to
The last one is the Caffe-Image Net structure [10]. It associate the output of the last convolutional encoder
was designed for the classification of images taken with its corresponding label.
from the ImageNet dataset into 1000 classes. But the Their CNN with BN and the SCAE emotion
output nodes were reduced to seven in Baseline CNN recognizers are trained and tested using KDEF [13]
approach. Every convolutional and fully connected dataset. Applying the pre-training technique to give
layers of all the four networks is applied with a ReLU weights to the CNN using Auto-Encoders increased
layer and a dropout layer. their model’s performance to 92.53% and
Five test sets (FET – 2013, SFEW2.0, CK+, KDEF dramatically reduced the training time.
and Jaffe) are chosen to perform tests with the four
network structures. For the pre-processing of input OBSERVATIONS
image it was found that the Histogram equalization
method shows the most reliable performance for all From the above study it is clear that even though
the four networks. It was alsoobserved that Tang’s there are numerous approaches for facial expression
network could achieve reasonably high accuracy recognition still models are being developed
forhistogram equalized images compared to the other continuously. The reason for this is the accuracy.
network models. Based on this observation they Researches are continuously trying to improve the
suggested Tang’s simple network along with accuracy of the FER by proposing various
histogram equalization as the baseline model for architectures and model. They also adopted some of
carrying out further research. the other techniques in their architectures to improve
the accuracy as discussed in Section 1V to account
C. FER with CNN ensemble for the problem of accuracy. Efforts are done to
Kaung Liu et al. [11] have proposed a model reduce the training time for better performance.
consisting of many subnets that are structured Ensemble CNNs are used to improve the accuracy of
differently. Each of these subnets are separately facial expression recognition.
trained on a training set and combined together.The
output layers are removed and the layers before the CONCLUSIONS
last layer are concatenated together. Finally, this
connected network is trained to output the final This paper includes a study of some of the facial
expression labels. expression recognition systems based on CNNs.
They evaluated their model using Facial Expression Different architectures, approaches, requirements,
Recognition 2013 (FER – 2013) dataset. It databases for training/testing images and their
containsgrayscale image of faces of size 48 x 48 performance have been studied here. Each method
pixels. They divided the dataset intoan 80 % training has its own strengths and limitations. This study helps
set and 20 % validation set. They trained the subnets to understand different kinds of models for facial
separately. Each of the subnets achieved a different expression recognition and to develop new CNN
accuracy on the dataset. By combining and averaging architectures for better performance and accuracy.
the outputs of CNNs of different structures their
network reports betterment in performance when REFERENCES
compared to single CNN structure.
[1] Ian J. Goodfellow et al. “Challenges in representation
learning: A report on three machine learning contests”, in
D. Stacked Deep Convolutional Auto-Encoders for Neural Information Processing, ICONIP, 2013, pp. 117-124.
Emotion Recognition from Facial Expressions [2] Z. Yu, C. Zhang, "Image based static facial expression
Ariel Ruiz-Garcia et al. [12] have studied the effect of recognition with multiple deep network
reducing the number of convolutional layers and pre- learning", Proceedings of the 2015 ACM on International
training the Deep CNN as a Stacked Convolutional Conference on Multimodal Interaction, pp. 435-442, 2015,
November.
Auto-Encoder (SCAE) in a greed layer-wise [3] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Glehre,
unsupervised fashion for emotion recognition using R. Memisevic, M. Mirza, "Combining modality specific
facial expression. They incorporated Batch deep neural networks for emotion recognition in
video", Proceedings of the 15th ACM on International
Normalization (BN) for both convolutional and fully
conference on multimodal interaction, pp. 543-550, 2013,
connected layers in their model to accelerate training December.
and improve classification performance. [4] B. Fasel, "Head-pose invariant facial expression recognition
using convolutional neural networks," Proceedings of the
15
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-5, Issue-3, Mar.-2018
http://iraj.in
Fourth IEEE International Conference on Multimodal [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
Interfaces, 2002, pp. 529-534. Girshick, T. Darrell, "Caffe: Convolutional architecture for
[5] A. Uçar, "Deep Convolutional Neural Networks for facial fast feature embedding", Proceedings of the ACM
expression recognition," 2017 IEEE International International Conference on Multimedia, pp. 675-678, 2014,
Conference on INnovations in Intelligent SysTems and November.
Applications (INISTA), Gdynia, 2017, pp. 371-375. [11] K. Liu, M. Zhang and Z. Pan, "Facial Expression
[6] M. Lyons, J. Budynek, S. Akamatsu, "Automatic Recognition with CNN Ensemble," 2016 International
classification of single facial images", IEEE Trans Pattern Conference on Cyberworlds (CW), Chongqing, 2016, pp.
Anal Mach Intell., vol. 21, no. 12, pp. 1357-1362, 1999. 163-166.
[7] T. Kanade, J. F. Cohn, T. Yingli, "Comprehensive database [12] A. Ruiz-Garcia, M. Elshaw, A. Altahhan and V. Palade,
for facial expression analysis", in: IEEE 4th international "Stacked deep convolutional auto-encoders for emotion
conference on automatic face and gesture recognition, pp. recognition from facial expressions", 2017 International
46-53, 2000. Joint Conference on Neural Networks (IJCNN), Anchorage,
[8] M. Shin, M. Kim and D. S. Kwon, "Baseline CNN structure AK, 2017, pp. 1586-1593.
analysis for facial expression recognition," 2016 25th IEEE [13] D. Lundqvist, A. Flykt, A. Ohman, The Karolinska Directed
International Symposium on Robot and Human Interactive Emotional Faces — KDEF CD ROM from Department of
Communication (RO-MAN), New York, NY, 2016, pp. 724- Clinical Neuroscience Psycology section, Karolinska
729. Institute, pp. 3-5, 1998.
[9] Y. Tang, "Deep learning using support vector
machines", CoRR abs/1306.0239, 2013.
16