Emotion Recognition in Audio and Video Using Deep Neural Networks
Emotion Recognition in Audio and Video Using Deep Neural Networks
Emotion Recognition in Audio and Video Using Deep Neural Networks
Abstract that sentence. The models only predict one of the four dif-
ferent emotions, e.g. happiness, anger, sadness, and neutral
Humans are able to comprehend information from multi- state, which were chosen for comparison with [13]. The
ple domains for e.g. speech, text and visual. With advance- deep learning architectures we explored were CNN, CNN +
ment of deep learning technology there has been significant RNN, CNN + LSTM.
improvement of speech recognition. Recognizing emotion After achieving a comparably good accuracy on audios
from speech is important aspect and with deep learning comparing with [13], we build models which predict emo-
technology emotion recognition has improved in accuracy tions using audio spectrogram and video frames in a video,
and latency. There are still many challenges to improve ac- since we believe video frames contain additional emotion-
curacy. In this work, we attempt to explore different neu- related information that can help us to achieve a better emo-
ral networks to improve accuracy of emotion recognition. tion prediction performance. The inputs of the models are
With different architectures explored, we find (CNN+RNN) the audio spectrogram and video frames, which are con-
+ 3DCNN multi-model architecture which processes audio verted and extracted from the sound and images of a video
spectrograms and corresponding video frames giving emo- recording an actor speaking one sentence. The output of the
tion prediction accuracy of 54.0% among 4 emotions and models is still one of the four selected emotions mentioned
71.75% among 3 emotions using IEMOCAP[2] dataset. above. Inspired by the work of [14], we explore the model
made of two sub-networks; the first sub-network is a 3D
CNN which takes in the video frames, and the second one
1. Introduction is a CNN+RNN which takes in the audio spectrogram, and
the last layer of the two sub-networks are concatenated and
Emotion recognition is an important ability for good in- followed by a fully connected layer that output the predic-
terpersonal relations and plays an important role in an effec- tion.
tive interpersonal communications. Recognizing emotions, The metric we use for evaluation is the overall accuracy
however, could be hard; even for human beings, the ability for both the audio and audio+video models.
of emotion recognition varies among persons.
The aim of this work is to recognize emotions in au- 2. Related Work
dios and audios+videos using deep neural networks. In this
work, we attempt to understand bottlenecks in existing ar- Emotion recognition is an important research area that
chitecture and input data, and explored novel ways on top many researchers work on in recent years using various
of existing architectures to increase emotion recognition ac- methods. Using speech signals[9], facial expression[4],
curacy. and physiological changes[7] are some of the common ap-
The dataset we used is IEMOCAP[2], which contains proaches researchers arise to approach the emotion recogni-
12 hours audiovisual data of 10 people(5 females, 5 males) tion problem. In this work, we will use audio spectrograms
speaking in anger, happiness, excitement, sadness, frustra- and video frames to do emotion recognition.
tion, fear, surprise, other and neutral state. It has been shown, emotion recognition accuracy can
Our work mainly consists of two stages. First, we build be improved with statistical learning of low-level features
neural networks to recognize emotions in audios by repli- (frequency & signal power intensity) by different layers of
cating and expanding upon the work of [13]. The input of deep learning network. Mel-scale spectrograms for speech
the models are the audio spectrograms converted from the recognition was demonstrated to be useful in [3]. There
audio of an actor speaking a sentence, and the models give has been state of the art speech recognition method that
one output which is the emotion the actor has when saying uses linearly spaced audio spectrograms as described in [1]
1
& [8] implements neural network architecture which pro-
cesses audio spectrogram & video frames to recognize emo-
tion. Both [14] and [8] implement a self-supervised model
for cooperative learning of audio & video models on differ-
ent dataset. [8] further does a supervised learning on the
pre-trained model to do classification. The model come
up by [14] and [8] are very similar; both are two-stream
models that contains one part for audio data, and one part
for video data. The only difference is the way the kernel
size, layer number, input data dimension are set. These hy-
perparamters are set differently because their input data is
different.[14] tends to use smaller input size, and kernel size
because its input images only capture mouth, which doesn’t
contain as much information as the image which captures
the movement of a person used in [8].
Figure 1. Example of audio spectrogram of anger emotion. Origi-
nal time scale without noise cleanup. 3. Dataset & Features
3.1. Dataset
[5]. Our work related to emotion recognition using audio The dataset we use is IEMOCAP [2] corpora as it is the
spectrogram follows the approach described in [13]. Audio best known comprehensibly labeled public corpus of emo-
spectrogram is an image of audio signal which consists of tion speech by actors.[10] uses this IEMOCAP dataset to
3 main components namely: 1. Time on x-axis. 2. fre- generate state of the art results at the time. IEMOCAP
quency on y-axis. 3. power intensity on the colorbar scale contains 12 hours audio and visual data of conversations of
which can be in decibels(dB) as shown in Fig. 1. [12] cov- two persons (1 female and 1 male for one conversation, and
ers machine learning methods to extract temporal features there are 5 females and 5 males in total), where each sen-
from the audio signals. The goodness in machine learn- tence in conversations is labelled with one emotion–anger,
ing models is, it’s training & prediction latency is good but, happiness, excitement, sadness, frustration, fear, surprise,
prediction accuracy is low. The CNN model that uses audio other or neutral state.
spectrograms to detect emotion has better prediction accu-
racy compared to machine learning model. 3.2. Data pre-processing
Comparing the CNN network used in [13] & [14] for
training using audio spectrograms, [13] uses wider kernel 3.2.1 Audio Data Pre-processing
window size with zero padding while [14] uses smaller win- IEMOCAP data corpus contain audio wav files with various
dow size and no zero padding. With wider kernel window time length and with marking of actual emotion label for
size we are able to see larger vision of the input which al- corresponding time segment. The audio wav files in IEMO-
lows for more expressive power. In order to avoid loos- CAP are generated at a sample rate of 22KHz. The audio
ing features use of zero padding becomes important. The spectrogram is extracted from the wav file using librosa1
zero padding decreases as the number of CNN layers in- python package with a sample rate of 44KHz. 44KHz sam-
crease in the architeture used in [13]. [14] avoids adding ple rate was used because as per Nyquist-Shannon sampling
zero padding in order to not consume extra virtual zero- theorem2 in order to fully recover a signal the sampling fre-
energy coefficients which are not useful in extracting local quency should at least be twice the signal frequency. The
features. One drawback that we see in [14] is that it does not audio signal frequency ranges from 20Hz to 20KHz. Hence,
compare performance between audio model & audio+video 44KHz sampling rate is commonly used rate for sampling.
model being used. One goodness observed in [14] is that it The spectrograms were generated in 2 segments which are:
does not do noise removal from the audio input data while 1. Original time length of utterance of sentence or emotion.
[13] uses noise removal techniques in the audio spectrogram 2. Clip each utterance of sentence into 3 second clips. An-
before training the model. other data segmentation that was done is with noise cleanup
To achieve better prediction accuracy, a natural progres- and without noise cleanup. We have named these segmen-
sion of emotion recognition using audio sprectrogram is tation as DSI, DSII, DSIII & DSIV. This data segmentation
to include facial features extracted from the video frames.
[11] & [6] implements facial emotion recognition using im- 1 https://librosa.github.io/librosa/index.html
ages and video frames respectively but, without audio. [14] 2 https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon sampling theorem
2
Dataset segmentation Type Noise Cleanup Name intensity relative to where there is actual signal of interest.
Original time length of utterance No DS I Compared to Fig. 1 where some signal intensity is observed
Clip each utterance into 3 second clips No DS II throughout the time scale, which is actually the noise. The
Original time length of utterance Yes DS III spectrogram images generated are of size 200x300 pixels.
Clip each utterance into 3 second clips Yes DS IV
Total count of 3 second audio spectrograms among 4 dif-
Table 1. Segmentation of input data generation.
ferent emotions is summarized in Table. 2. As observed
the happy emotion count is significantly low. So we dupli-
cated the happy data to reach total count of 1600. Similarly
anger emotion count was also duplicated. Sad & Neutral
data count was reduced to match to 1600 data points for
each emotion. A total images of 6400 is used for training
the model. Data balance is crucial for the model to train
well. 400 images from each emotion is used for model val-
idation purpose. The images used for validation are never
part of training set.
3
Emotion Count of data points
Happy 786
Sad 1752
Anger 1458
Neutral 2118
4
trastive loss.
N
1 X n
Lcontrastive loss = L + Ln2
N n=1 1
where
2
Ln1 = (y n ) kfv (v n ) − fa (an )k2
Ln2 = (1 − y n )max(η − kfv (v n ) − fa (an )k2 , 0)2
N is the number of datapoints in the dataset, v n , an are
the video frames and audio spectrogram of the n-the dat-
apoint, fv , fa are the video and audio sub-networks, yn is
one if the video frames and audio spectrogram are from the
same video, and zero otherwise.η is the margin hyperparam-
eter. kfv (v n ) − fa (an )k2 should be small when the video
frames and audio spectrogram are from the same video, and
large when they come from different videos. Therefore, by
minimizing the contrastive loss, the audio and video mod-
els are forced to output similar values when their inputs are
from the same video, and very disctint values when they are
not. This allows the model to learn the connection between
audio and visual elements from the same video.
After pre-training is done, we do supervised learning on
the pre-trained model where the input is the audio spectro-
gram and video frames of a video and output is the emotion
predicted, as shown in Fig.4(a). The loss of our model is the
cross entropy, and the formula is the same as in Equation.1.
The second training method is that we do supervised
training directly on the model without pre-training process.
5
5.1. Hyperparameters Architecture Accuracy(%) Data Aug. Emotion
CNN 52.23 No H,S,A,N
We started off with prediction on 4 emotions and most of CNN 51.90 Yes H,S,A,N
the work, results and analysis is based off of these 4 emo- CNN+LSTM 39.77 No H,S,A,N
tions. Our validation accuracy did not go beyond 54.00% CNN+LSTM 39.65 Yes H,S,A,N
and we saw overfitting during model training beyond this CNN+RNN 54.00 No H,S,A,N
point. This lead us to experiment with various hyperpa- CNN+RNN 70.25 No S,A,N
rameters in the optimizer and in the network model layers CNN+RNN+3DCNN 51.94 No H,S,A,N
for e.g. kernel size, size of input and output in each layer, CNN+RNN+3DCNN 71.75 No S,A,N
dropout, batchnorm, data augmentation, l1 & l2 regulariza-
Table 3. Validation set accuracy over CNN, CNN+LSTM,
tion.
CNN+RNN & CNN+RNN+3DCNN among 4 and 3 different
Adam optimizer was used with learning rate of 1e-4 to emotions. H=Happy, S=Sad, A=Angry, N=Neutral
train the model as this gave best accuracy. We experimented
with 1e-3 & 1e-5 and observed the model did not train well
with these settings. It was observed weight decay(parameter
that controls l2 regularization) of 0.01 in Adam optimizer
improved the accuracy by 1%. Weight decay values of
0.005 and 0.02 were also experimented with but, did not
help. All other parameters was kept default in the optimizer.
Enabling l1 regularization, data augmentation of rotation
and cropping, batchnorm resulted in no change or improve-
ment in accuracy. This is possibly due to that the model
learned all the features it can from the available data based
on the model architecture.
Tuning of dropout probability was also experimented
with and optimal value of 0.2 for the last fully connected
layer and 0.1 for the dropout in RNN layer was obtained.
The input & output dimensions in the audio network lay-
ers was doubled & quadrupled which resulted in accuracy Figure 5. Contrastive loss history curve on CNN+RNN+3DCNN.
improvement of 1-2%. Increasing the input output dimen-
sions in the layers also resulted in high memory usage dur-
ing model training. We attempted to extend this learning on 5.3. Loss & Classification Accuracy History on
the video network but due to limited memory of 12GB on CNN+RNN+3DCNN
the machine we were unable to carry out this experiment.
This leads us to strongly believe that there is room for im- Fig. 5 is the contrastive loss curve obtained during self
provement that needs more experimentation on a machine supervision model training. We ran self supervision model
with large memory. The accuracy improvement is also ev- with 5 epochs and 10 epochs separately and fed the learned
ident from different model architectures we used starting weights in these 2 experiments into CNN+RNN+3DCNN
from CNN, CNN+RNN and CNN+RNN+3DCNN which model for classification training. We observed the self su-
actually is having more parameters in the model to learn pervised model run with 5 epochs gave better classification
features better. accuracy by 0.5% compared to the self supervised model
80% of total data points were used for training and rest run with 10 epochs. This could be attributed to overfitting
for validation. Batch of 64 data points per iteration is used of weights learned in self supervised model when run with
to train the model. Higher batch count resulted in long iter- 10 epochs.
ation time and high memory usage thus, 64 was picked. Fig. 6 is the softmax/cross entropy loss curve obtained
Using normalization in image transformation with mean on the best model which is CNN+RNN+3DCNN. Since
of [0.485, 0.456, 0.406] and standard deviation of [0.229, the loss is reported per iteration hence it appears noisy but
0.224, 0.225] on all images improved accuracy by 0.37%. we observed that per epoch it is decreasing on logarithmic
scale.
5.2. Validation Set Accuracy Fig. 7 is the classification accuracy history on best model
which is CNN+RNN+3DCNN. We obtained best valida-
Table 3 summarizes the validation set accuracy obtained tion accuracy of 71.75% considering 3 emotions(sad, anger,
among different architectures. neutral).
6
Figure 6. Loss history curve on CNN+RNN+3DCNN. The curve Figure 8. Confusion matrix of true class vs. prediction in
is noisy because it is generated per iteration. CNN+RNN.
Figure 7. Classification accuracy history on CNN+RNN+3DCNN Figure 9. Confusion matrix of true class vs. prediction in
after every 20 iterations for a total of 1200 iteration. CNN+RNN+3DCNN.
5.4. Confusion Matrix on CNN+RNN & augmentation doesn’t improve the accuracy. CNN does not
CNN+RNN+3DCNN work well comparing with CNN+RNN because CNN has
the same architecture as the first few layers of CNN+RNN,
Fig. 8 is the confusion matrix obtained with CNN+RNN. and is comparably simple. Therefore, CNN+RNN will
From this confusion matrix we see only happy emotion learn features of higher-level and performs better compared
is predicted poorly compared to other emotions. This led with CNN. For CNN+LSTM, it does have a more com-
us to explore CNN+RNN & CNN+RNN+3DCNN archi- plex architecture, however, when we were tuning the hy-
tecture only on 3 emotions (instead of 4) to understand if perparams, we found out that accuracy improved slightly
we do see performance improvement when switching from by increasing dropout probability in CNN+LSTM , indi-
audio only inputs to audio+video inputs. Fig. 9 is the cating that CNN+LSTM could be overly complex for our
confusion matrix obtained with the best model which is dataset and training purpose. Also, adding model complex-
CNN+RNN+3DCNN. ity requires more careful hyperparameters tuning, and since
CNN+RNN is giving a relatively good performance com-
5.5. Results Analysis
pared with [13], we decided not to bother with adjusting
From Table 3, considering 4 emotions, we can see that CNN+LSTM.
CNN+RNN is the best performing architecture and data From Table 3, it also evident that CNN+RNN+3DCNN
7
architecture which uses video frames along with audio spec- the face/head from video frames. We strongly believe it
trogram is the best considering 3 emotions but, the accuracy will significantly improve the prediction accuracy. As far as
did not improve significantly to CNN+RNN. This is due to data augmentation is concerned, even though none of direct
the fact that the cropping window to focus on the face/head data augmentation methods proved to be useful but, adding
to recognize facial emotion was large as the actors are not signal with very low amplitude and varying frequency onto
facing the camera and they moved during their speech. Auto the speech signal and then generating audio spectrogram
detecting face/head with detection model and then cropping from the resulting signal will create unique data points and
based on the bounding box would be ideal and accuracy help in getting rid of model overfitting. If there were ma-
is expected to increase significantly. Considering 4 emo- chines/GPUs with more memory we wanted to experiment
tions, CNN+RNN+3DCNN performed worse compared to with increasing input and output dimensions in each layer
CNN+RNN is because the model prediction accuracy for in the network to obtain optimal point. There is definitely
happy emotion itself is bad due its low data count, therefore a room to get better accuracy using this method. We then
when the video frames are used which are only using facial want to experiment with prediction latency among different
expression from the side only confuses the model to learn models and there architecture size. We also wanted to ex-
poorly. periment more with CNN+LSTM network and fine tune it
Data augmentation does not increase the validation ac- to see what is the best accuracy we can achieve with this
curacy and even slightly makes the model perform worse model. We did try transfer learning using ResNet18 but
could be due to that the image generated by cropping and didn’t achieve good results. Need to do more experimenta-
rotation loses some emotion-related features, since it alters tion on how to transfer learn using existing models. Lastly,
the frequency and time scale. Which is similar to altering try the model on all 12 emotions in the dataset and under-
the pitch of the audio and reversing the audio of a sentence, stand bottlenecks and come up with neural network solu-
and could confuse the model. tions that can predict with high accuracy.
From confusion matrix, we observed that happiness pre-
diction is low compared to other emotions. One possible
7. Link to github code
reason for this is, happiness data set count is very low com-
pared to other emotions, and over-sampling by repetition https://github.com/julieeF/CS231N-Project
the happiness data set is not enough. More dataset of happi-
ness is expected to improve happiness prediction accuracy.
Comparing our results with [13], we lag their class ac- 8. Contributions & Acknowledgements
curacy by 5.4% but, comparing the overall accuracy con- Mandeep Singh:
sidering 3 emotions our work achieved accuracy of 71.75% Mandeep is student at Stanford under SCPD. He has worked
which is better by 2.95%. at Intel as Design Automation Engineer for 8 years. Prior to
joining Intel, he did masters in electrical engineering spe-
6. Conclusion/Future Work cializing in analog & mixed-signal design from SJSU.
Our work demonstrated emotion recognition us- Yuan Fang:
ing audio spectrograms through various deep neural Yuan is master’s student at Stanford in ICME department.
networks like CNN, CNN+RNN & CNN+LSTM on Her interests lies in machine learning & deep learning.
IEMOCAP[2] dataset. We then explored combining We would like to thank the CS231N Teaching Staff for
audio with video to achieve better performance accu- guiding us through the project. We also want to thank
racy through CNN+RNN+3DCNN. We demostrated that Google Cloud Platform and Google Colaboratory for pro-
CNN+RNN+3DCNN performs better as it learns emo- viding us free resources to carry out experimentation in-
tion features from the audio signal(CNN+RNN) and also volved in this work.
learns emotion features from facial expression in video
frames(3DCNN) thus complementing each other. References
To further improve the accuracy of our model we plan
to explore more and touch on various aspects. We want to [1] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper,
B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Di-
explore more noise removal algorithms and generate audio
amos, E. Elsen, J. H. Engel, L. Fan, C. Fougner, T. Han,
spectrograms without noise in them. This work will help
A. Y. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Y.
in analyzing if removing noise actually helps or it acts as Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. See-
a regularizer and we don’t need to remove noise from the tapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao,
spectrograms. We also want to explore, if there are multiple D. Yogatama, J. Zhan, and Z. Zhu. Deep speech 2: End-
people speaking how the model predicts the emotion in such to-end speech recognition in english and mandarin. CoRR,
scenarios. Next, we want to explore auto cropping around abs/1512.02595, 2015.
8
[2] C. L. A. K. E. M. S. K. J. C. S. L. C. Busso, M. Bulut and
S. Narayanan. Iemocap: Interactive emotional dyadic motion
capture database, December 2008.
[3] L. Deng. A tutorial survey of architectures, algorithms, and
applications for deep learning. APSIPA Transactions on Sig-
nal and Information Processing, 3:e2, 2014.
[4] K. Gouta and M. Miyamoto. Emotion recognition: facial
components associated with various emotions. Shinrigaku
kenkyu: The Japanese journal of psychology, 71(3):211–
218, 2000.
[5] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates,
and A. Y. Ng. Deep speech: Scaling up end-to-end speech
recognition, 2014.
[6] W. Hashim Abdulsalam, d. al hamdani, and M. Al Salam.
Facial emotion recognition from videos using deep convolu-
tional neural networks. 9:14–19, 01 2019.
[7] J. Kim and E. André. Emotion recognition based on physio-
logical changes in music listening. IEEE transactions on pat-
tern analysis and machine intelligence, 30(12):2067–2083,
2008.
[8] B. Korbar, D. Tran, and L. Torresani. Co-training of au-
dio and video representations from self-supervised temporal
synchronization. CoRR, abs/1807.00230, 2018.
[9] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee. Emotion
recognition by speech signals. In Eighth European Confer-
ence on Speech Communication and Technology, 2003.
[10] J. Lee and I. Tashev. High-level feature representation us-
ing recurrent neural network for speech emotion recognition,
September 2015.
[11] S. Minaee and A. Abdolrashidi. Deep-emotion: Facial
expression recognition using attentional convolutional net-
work. CoRR, abs/1902.01019, 2019.
[12] G. Sahu. Multimodal speech emotion recognition and ambi-
guity resolution. CoRR, abs/1904.06022, 2019.
[13] A. Satt, S. Rozenberg, and R. Hoory. Efficient emotion
recognition from speech using deep learning on spectro-
grams, 08 2017.
[14] A. Torfi, S. M. Iranmanesh, N. M. Nasrabadi, and J. M. Daw-
son. Coupled 3d convolutional neural networks for audio-
visual recognition. CoRR, abs/1706.05739, 2017.