Emotion Recognition in Audio and Video Using Deep Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Emotion Recognition in Audio and Video Using Deep Neural Networks

Mandeep Singh Yuan Fang


SCPD, Stanford University ICME, Stanford University
Stanford, CA Stanford, CA
[email protected] [email protected]
arXiv:2006.08129v1 [eess.AS] 15 Jun 2020

Abstract that sentence. The models only predict one of the four dif-
ferent emotions, e.g. happiness, anger, sadness, and neutral
Humans are able to comprehend information from multi- state, which were chosen for comparison with [13]. The
ple domains for e.g. speech, text and visual. With advance- deep learning architectures we explored were CNN, CNN +
ment of deep learning technology there has been significant RNN, CNN + LSTM.
improvement of speech recognition. Recognizing emotion After achieving a comparably good accuracy on audios
from speech is important aspect and with deep learning comparing with [13], we build models which predict emo-
technology emotion recognition has improved in accuracy tions using audio spectrogram and video frames in a video,
and latency. There are still many challenges to improve ac- since we believe video frames contain additional emotion-
curacy. In this work, we attempt to explore different neu- related information that can help us to achieve a better emo-
ral networks to improve accuracy of emotion recognition. tion prediction performance. The inputs of the models are
With different architectures explored, we find (CNN+RNN) the audio spectrogram and video frames, which are con-
+ 3DCNN multi-model architecture which processes audio verted and extracted from the sound and images of a video
spectrograms and corresponding video frames giving emo- recording an actor speaking one sentence. The output of the
tion prediction accuracy of 54.0% among 4 emotions and models is still one of the four selected emotions mentioned
71.75% among 3 emotions using IEMOCAP[2] dataset. above. Inspired by the work of [14], we explore the model
made of two sub-networks; the first sub-network is a 3D
CNN which takes in the video frames, and the second one
1. Introduction is a CNN+RNN which takes in the audio spectrogram, and
the last layer of the two sub-networks are concatenated and
Emotion recognition is an important ability for good in- followed by a fully connected layer that output the predic-
terpersonal relations and plays an important role in an effec- tion.
tive interpersonal communications. Recognizing emotions, The metric we use for evaluation is the overall accuracy
however, could be hard; even for human beings, the ability for both the audio and audio+video models.
of emotion recognition varies among persons.
The aim of this work is to recognize emotions in au- 2. Related Work
dios and audios+videos using deep neural networks. In this
work, we attempt to understand bottlenecks in existing ar- Emotion recognition is an important research area that
chitecture and input data, and explored novel ways on top many researchers work on in recent years using various
of existing architectures to increase emotion recognition ac- methods. Using speech signals[9], facial expression[4],
curacy. and physiological changes[7] are some of the common ap-
The dataset we used is IEMOCAP[2], which contains proaches researchers arise to approach the emotion recogni-
12 hours audiovisual data of 10 people(5 females, 5 males) tion problem. In this work, we will use audio spectrograms
speaking in anger, happiness, excitement, sadness, frustra- and video frames to do emotion recognition.
tion, fear, surprise, other and neutral state. It has been shown, emotion recognition accuracy can
Our work mainly consists of two stages. First, we build be improved with statistical learning of low-level features
neural networks to recognize emotions in audios by repli- (frequency & signal power intensity) by different layers of
cating and expanding upon the work of [13]. The input of deep learning network. Mel-scale spectrograms for speech
the models are the audio spectrograms converted from the recognition was demonstrated to be useful in [3]. There
audio of an actor speaking a sentence, and the models give has been state of the art speech recognition method that
one output which is the emotion the actor has when saying uses linearly spaced audio spectrograms as described in [1]

1
& [8] implements neural network architecture which pro-
cesses audio spectrogram & video frames to recognize emo-
tion. Both [14] and [8] implement a self-supervised model
for cooperative learning of audio & video models on differ-
ent dataset. [8] further does a supervised learning on the
pre-trained model to do classification. The model come
up by [14] and [8] are very similar; both are two-stream
models that contains one part for audio data, and one part
for video data. The only difference is the way the kernel
size, layer number, input data dimension are set. These hy-
perparamters are set differently because their input data is
different.[14] tends to use smaller input size, and kernel size
because its input images only capture mouth, which doesn’t
contain as much information as the image which captures
the movement of a person used in [8].
Figure 1. Example of audio spectrogram of anger emotion. Origi-
nal time scale without noise cleanup. 3. Dataset & Features
3.1. Dataset
[5]. Our work related to emotion recognition using audio The dataset we use is IEMOCAP [2] corpora as it is the
spectrogram follows the approach described in [13]. Audio best known comprehensibly labeled public corpus of emo-
spectrogram is an image of audio signal which consists of tion speech by actors.[10] uses this IEMOCAP dataset to
3 main components namely: 1. Time on x-axis. 2. fre- generate state of the art results at the time. IEMOCAP
quency on y-axis. 3. power intensity on the colorbar scale contains 12 hours audio and visual data of conversations of
which can be in decibels(dB) as shown in Fig. 1. [12] cov- two persons (1 female and 1 male for one conversation, and
ers machine learning methods to extract temporal features there are 5 females and 5 males in total), where each sen-
from the audio signals. The goodness in machine learn- tence in conversations is labelled with one emotion–anger,
ing models is, it’s training & prediction latency is good but, happiness, excitement, sadness, frustration, fear, surprise,
prediction accuracy is low. The CNN model that uses audio other or neutral state.
spectrograms to detect emotion has better prediction accu-
racy compared to machine learning model. 3.2. Data pre-processing
Comparing the CNN network used in [13] & [14] for
training using audio spectrograms, [13] uses wider kernel 3.2.1 Audio Data Pre-processing
window size with zero padding while [14] uses smaller win- IEMOCAP data corpus contain audio wav files with various
dow size and no zero padding. With wider kernel window time length and with marking of actual emotion label for
size we are able to see larger vision of the input which al- corresponding time segment. The audio wav files in IEMO-
lows for more expressive power. In order to avoid loos- CAP are generated at a sample rate of 22KHz. The audio
ing features use of zero padding becomes important. The spectrogram is extracted from the wav file using librosa1
zero padding decreases as the number of CNN layers in- python package with a sample rate of 44KHz. 44KHz sam-
crease in the architeture used in [13]. [14] avoids adding ple rate was used because as per Nyquist-Shannon sampling
zero padding in order to not consume extra virtual zero- theorem2 in order to fully recover a signal the sampling fre-
energy coefficients which are not useful in extracting local quency should at least be twice the signal frequency. The
features. One drawback that we see in [14] is that it does not audio signal frequency ranges from 20Hz to 20KHz. Hence,
compare performance between audio model & audio+video 44KHz sampling rate is commonly used rate for sampling.
model being used. One goodness observed in [14] is that it The spectrograms were generated in 2 segments which are:
does not do noise removal from the audio input data while 1. Original time length of utterance of sentence or emotion.
[13] uses noise removal techniques in the audio spectrogram 2. Clip each utterance of sentence into 3 second clips. An-
before training the model. other data segmentation that was done is with noise cleanup
To achieve better prediction accuracy, a natural progres- and without noise cleanup. We have named these segmen-
sion of emotion recognition using audio sprectrogram is tation as DSI, DSII, DSIII & DSIV. This data segmentation
to include facial features extracted from the video frames.
[11] & [6] implements facial emotion recognition using im- 1 https://librosa.github.io/librosa/index.html

ages and video frames respectively but, without audio. [14] 2 https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon sampling theorem

2
Dataset segmentation Type Noise Cleanup Name intensity relative to where there is actual signal of interest.
Original time length of utterance No DS I Compared to Fig. 1 where some signal intensity is observed
Clip each utterance into 3 second clips No DS II throughout the time scale, which is actually the noise. The
Original time length of utterance Yes DS III spectrogram images generated are of size 200x300 pixels.
Clip each utterance into 3 second clips Yes DS IV
Total count of 3 second audio spectrograms among 4 dif-
Table 1. Segmentation of input data generation.
ferent emotions is summarized in Table. 2. As observed
the happy emotion count is significantly low. So we dupli-
cated the happy data to reach total count of 1600. Similarly
anger emotion count was also duplicated. Sad & Neutral
data count was reduced to match to 1600 data points for
each emotion. A total images of 6400 is used for training
the model. Data balance is crucial for the model to train
well. 400 images from each emotion is used for model val-
idation purpose. The images used for validation are never
part of training set.

At first, we started off with audio spectrograms that con-


tains xy axis and colorbar scale but, we removed the scales
after learning that including axis & scale could be contribut-
ing negatively to prediction accuracy.

To observe class accuracy improvement, input audio


spectrograms were data augmented by cropping and ro-
Figure 2. Example of audio spectrogram of anger emotion. 3 sec
audio clip with noise cleanup. Compare with Fig.1 tation. Each image was cropped by 10 pixels from the
top and resized back to 200x300 pixels. This cropping is
data
done to simulate frequency change in the emotion by small augmentation
is summarized in Table 1. Model training is done on these amount. Similarly, each image was also rotated by +/-
data segments separately. 10 degrees. This rotation also simulates frequency change
method of In order to get rid of the background noise we applied but it also shifts the time scale. Augmenting data that
enhancing changes time scale is not preferred hence the rotation was
data bandpass filter3 4 between 1Hz to 30KHz. Denoising or
noise cleanup of the input audio signal for data augmenta- done to a very small degree of 10 degrees. With cropping
tion is also followed by [1]. Sentence utterances that are less and rotation, total count of data used for training becomes
than 3 second are padded with noise to maintain uniformity 19200. The model training was done separately with origi-
in noise frequency and noise amplitude w.r.t noise in other nal images and images with data augmentation for compar-
parts of the signal. Initially zero padding was also experi- ison.Horizontal flip of images were avoided as this means
mented with to have 3 second time scale and then noise is flipping the timescale and enacts a person speaking in re-
added with signal to noise ratio (SNR) of 1 through out the verse, which will lead to lower model prediction accuracy.
signal time length but, this resulted in distorting the original
audio signal. The resulting signal is then denoised. Denois- Model training on audio spectrogram that contains full
ing helps in making the frequency, time scale and ampli- time length, and not 3 second, was done separately. For
tude features of the input audio signal to be more visible in given 3 second audio spectrogram, it was replaced with the
the hopes of getting better prediction accuracy per emotion. full time length spectrogram, thus maintaining data count
All the audio spectrograms are generated with same col- for balancing.
orbar intensity scale (+/- 60dB) to maintain uniformity of why
the spectrogram across the board among different emotions. Visual analysis of around 100 audio spectrograms were spectrograms
This is similar to normalization of data. As seen in Fig. 2 done. It was observed that maximum frequency observed were cropped
after denoising only the signal that contains actual informa- among all these spectrograms is around 8KHz. This means
tion remain with high power intensity or signal amplitude. around 60% of the spectrogram image is blue which does
Other regions in the spectrogram remains with low power not carry any information from emotion perspective. All
the input audio spectrograms were cropped from the top
3 https://timsainburg.com/noise-reduction-python.html by 60% and resized back to 200X300 pixels. An ideal
4 https://github.com/julieeF/CS231N- method would be to generate spectrograms specifying fixed
Project/blob/master/load single wav file.py frequency scale if the frequency range is known prior.

3
Emotion Count of data points
Happy 786
Sad 1752
Anger 1458
Neutral 2118

Table 2. Data count of each emotion.

3.2.2 Video Data Pre-processing


Since our work also include implementing video model to
see room for improvement in prediction accuracy of emo-
tion recognition, we also did video data pre-processing. For
the video data, we first clipped each video file into sentences
according to how we processed audio files. This ensured
that we are querying that part of the video file that corre-
sponds to given audio spectrogram. Then we extracted 20
images per 3 second from each video avi file that corre-
sponds to 3 sec audio spectrogram. The video contains both
the actors in the frames hence the frames were cropped ac-
cordingly from left or right to only capture the actor whose
emotion is being recorded. We then cropped the video
frames further to cover the actor’s face/head. The final res-
olution of video frames is 60x100. One limitation with the
dataset is that, in the video the actors are not speaking facing
the camera, therefore full facial expression corresponding to Figure 3. Audio model architectures.
a given emotion are not visible.
While processing the extraction of audio spectrograms
and video frames, it was observed that the memory usage layer, and this model is named CNN+RNN in this work. A
on the machine was more than 12GB. This lead to machine graph for the architecture of CNN+RNN is shown in Fig.3.
crashes. Therefore, to extract data, each audio and video The loss we use for training the model is the cross entropy
file was processed individually in batch. Python script5 was loss.
launched individually through unix shell script. 1 X
N
exp(xnc )
Lcross entropy = −log( P ) (1)
N n=1 j exp(xj )
4. Methods & Model Architecture
In this section, we will talk about the models we have where N is the number of data in the dataset, xnc is the true
built for emotion recognition in audios in the ’Audio Mod- class’s score of the n-th data point, xj is the j-th class’s score
els’ subsection, and models for emotion recognition in au- of the n-th input data. Minimizing the cross entropy loss
dios+videos in ’Audio+Video Models’ subsection. will force our model to learn the emotion-related features
from the audio spectrogram because when the loss will be
4.1. Audio Models minimum only when for a datapoint, the score of the true
class is remarkably larger than the score of all other classes.
By replicating and expanding upon the network architec-
ture used in [13] we formulate three different models. The
first model is a CNN model, which consists of three 2D con- 4.2. Audio+Video Models
volutional layers and maxpooling layers followed by two
Inspired by the work of [14], our audio+video model is
fully connected layers, as shown in Fig.3. The second ar-
a two-stream network that consists of two sub-networks as
chitecture we build is that we add a LSTM layer after the
shown in Fig.4(a). The first sub-network is the audio model,
convolutional layers in the CNN model we have built, and
which we choose to use the best-performing audio model
we will call this model CNN+LSTM in this work. In the
we have built–CNN & RNN as shown in Fig.3. The ar-
third model, we replace the LSTM layer with a vanilla RNN
chitecture of the first sub-network is the same as the au-
5 https://github.com/julieeF/CS231N- dio model except that it dumps the original output layer
Project/blob/master/load single Video.py in order to get high-level features of audio spectrograms,

4
trastive loss.
N
1 X n
Lcontrastive loss = L + Ln2
N n=1 1

where
2
Ln1 = (y n ) kfv (v n ) − fa (an )k2
Ln2 = (1 − y n )max(η − kfv (v n ) − fa (an )k2 , 0)2
N is the number of datapoints in the dataset, v n , an are
the video frames and audio spectrogram of the n-the dat-
apoint, fv , fa are the video and audio sub-networks, yn is
one if the video frames and audio spectrogram are from the
same video, and zero otherwise.η is the margin hyperparam-
eter. kfv (v n ) − fa (an )k2 should be small when the video
frames and audio spectrogram are from the same video, and
large when they come from different videos. Therefore, by
minimizing the contrastive loss, the audio and video mod-
els are forced to output similar values when their inputs are
from the same video, and very disctint values when they are
not. This allows the model to learn the connection between
audio and visual elements from the same video.
After pre-training is done, we do supervised learning on
the pre-trained model where the input is the audio spectro-
gram and video frames of a video and output is the emotion
predicted, as shown in Fig.4(a). The loss of our model is the
cross entropy, and the formula is the same as in Equation.1.
The second training method is that we do supervised
training directly on the model without pre-training process.

Figure 4. Audio+Video model architectures. 5. Experiments & Results


For model evaluation, prediction accuracy is the key met-
ric used. For results comparison, the accuracy was com-
as shown in Fig.4(CNN+RNN). The second sub-network is pared with accuracy reported in [13]. Since we balanced
the video model, and it is made of four 3D convolutional the data count therefore the overall accuracy and class ac-
layers, and three 3D maxpooling layers, followed by two curacy as reported in [13] are mathematically equal terms in
fully connected layers, as shown in Fig.4(3D CNN). Finally, our work. Our work aimed to achieve prediction accuracy
the last layer of the two sub-networks are concatenated to- of around 60% considering 4 emotions.
gether, followed by one output layer, as shown in Fig.4(a). We trained the model on all 4 segmentation of dataset
We train this audio+video model using two different generated and observed that results on data with original
methods–semi-supervised training and supervised training. time scale and without noise cleanup gives the best accu-
For semi-supervised training method, we first pre-train our racy. The results reported are based on this dataset. Spec-
model using video frames and audio spectrogram from trograms with noise removed theoretically sounds promis-
the same video and from different videos, as shown in ing but it did not work due to 2 possible reasons. First, the
Fig.4(b). This forces the model to learn the correlation algorithm used to remove noise reduces signal amplitude
between the visual and auditive elements of a video. The which may lead to some feature suppression. An algorithm
input of the pre-training process has three distinct types– that amplifies the signal back need to be explored. Some
positive (the audio spectrogram and video frames are from techniques for e.g. subtracting noise from the signal and
the same video); hard negative (the audio spectrogram and multiplying final signal with a constant was explored but
video frames are from different videos with different emo- they all resulted in signal distortion. Secondly, having noise
tions); super hard negative (the audio spectrogram and in the spectrogram simulates real scenario and during model
video frames are from different videos with the same emo- training noise could indirectly acts as a regularizer. [14] also
tion). The loss function we use for pre-training is the con- does not remove noise from the input audio spectrograms.

5
5.1. Hyperparameters Architecture Accuracy(%) Data Aug. Emotion
CNN 52.23 No H,S,A,N
We started off with prediction on 4 emotions and most of CNN 51.90 Yes H,S,A,N
the work, results and analysis is based off of these 4 emo- CNN+LSTM 39.77 No H,S,A,N
tions. Our validation accuracy did not go beyond 54.00% CNN+LSTM 39.65 Yes H,S,A,N
and we saw overfitting during model training beyond this CNN+RNN 54.00 No H,S,A,N
point. This lead us to experiment with various hyperpa- CNN+RNN 70.25 No S,A,N
rameters in the optimizer and in the network model layers CNN+RNN+3DCNN 51.94 No H,S,A,N
for e.g. kernel size, size of input and output in each layer, CNN+RNN+3DCNN 71.75 No S,A,N
dropout, batchnorm, data augmentation, l1 & l2 regulariza-
Table 3. Validation set accuracy over CNN, CNN+LSTM,
tion.
CNN+RNN & CNN+RNN+3DCNN among 4 and 3 different
Adam optimizer was used with learning rate of 1e-4 to emotions. H=Happy, S=Sad, A=Angry, N=Neutral
train the model as this gave best accuracy. We experimented
with 1e-3 & 1e-5 and observed the model did not train well
with these settings. It was observed weight decay(parameter
that controls l2 regularization) of 0.01 in Adam optimizer
improved the accuracy by 1%. Weight decay values of
0.005 and 0.02 were also experimented with but, did not
help. All other parameters was kept default in the optimizer.
Enabling l1 regularization, data augmentation of rotation
and cropping, batchnorm resulted in no change or improve-
ment in accuracy. This is possibly due to that the model
learned all the features it can from the available data based
on the model architecture.
Tuning of dropout probability was also experimented
with and optimal value of 0.2 for the last fully connected
layer and 0.1 for the dropout in RNN layer was obtained.
The input & output dimensions in the audio network lay-
ers was doubled & quadrupled which resulted in accuracy Figure 5. Contrastive loss history curve on CNN+RNN+3DCNN.
improvement of 1-2%. Increasing the input output dimen-
sions in the layers also resulted in high memory usage dur-
ing model training. We attempted to extend this learning on 5.3. Loss & Classification Accuracy History on
the video network but due to limited memory of 12GB on CNN+RNN+3DCNN
the machine we were unable to carry out this experiment.
This leads us to strongly believe that there is room for im- Fig. 5 is the contrastive loss curve obtained during self
provement that needs more experimentation on a machine supervision model training. We ran self supervision model
with large memory. The accuracy improvement is also ev- with 5 epochs and 10 epochs separately and fed the learned
ident from different model architectures we used starting weights in these 2 experiments into CNN+RNN+3DCNN
from CNN, CNN+RNN and CNN+RNN+3DCNN which model for classification training. We observed the self su-
actually is having more parameters in the model to learn pervised model run with 5 epochs gave better classification
features better. accuracy by 0.5% compared to the self supervised model
80% of total data points were used for training and rest run with 10 epochs. This could be attributed to overfitting
for validation. Batch of 64 data points per iteration is used of weights learned in self supervised model when run with
to train the model. Higher batch count resulted in long iter- 10 epochs.
ation time and high memory usage thus, 64 was picked. Fig. 6 is the softmax/cross entropy loss curve obtained
Using normalization in image transformation with mean on the best model which is CNN+RNN+3DCNN. Since
of [0.485, 0.456, 0.406] and standard deviation of [0.229, the loss is reported per iteration hence it appears noisy but
0.224, 0.225] on all images improved accuracy by 0.37%. we observed that per epoch it is decreasing on logarithmic
scale.
5.2. Validation Set Accuracy Fig. 7 is the classification accuracy history on best model
which is CNN+RNN+3DCNN. We obtained best valida-
Table 3 summarizes the validation set accuracy obtained tion accuracy of 71.75% considering 3 emotions(sad, anger,
among different architectures. neutral).

6
Figure 6. Loss history curve on CNN+RNN+3DCNN. The curve Figure 8. Confusion matrix of true class vs. prediction in
is noisy because it is generated per iteration. CNN+RNN.

Figure 7. Classification accuracy history on CNN+RNN+3DCNN Figure 9. Confusion matrix of true class vs. prediction in
after every 20 iterations for a total of 1200 iteration. CNN+RNN+3DCNN.

5.4. Confusion Matrix on CNN+RNN & augmentation doesn’t improve the accuracy. CNN does not
CNN+RNN+3DCNN work well comparing with CNN+RNN because CNN has
the same architecture as the first few layers of CNN+RNN,
Fig. 8 is the confusion matrix obtained with CNN+RNN. and is comparably simple. Therefore, CNN+RNN will
From this confusion matrix we see only happy emotion learn features of higher-level and performs better compared
is predicted poorly compared to other emotions. This led with CNN. For CNN+LSTM, it does have a more com-
us to explore CNN+RNN & CNN+RNN+3DCNN archi- plex architecture, however, when we were tuning the hy-
tecture only on 3 emotions (instead of 4) to understand if perparams, we found out that accuracy improved slightly
we do see performance improvement when switching from by increasing dropout probability in CNN+LSTM , indi-
audio only inputs to audio+video inputs. Fig. 9 is the cating that CNN+LSTM could be overly complex for our
confusion matrix obtained with the best model which is dataset and training purpose. Also, adding model complex-
CNN+RNN+3DCNN. ity requires more careful hyperparameters tuning, and since
CNN+RNN is giving a relatively good performance com-
5.5. Results Analysis
pared with [13], we decided not to bother with adjusting
From Table 3, considering 4 emotions, we can see that CNN+LSTM.
CNN+RNN is the best performing architecture and data From Table 3, it also evident that CNN+RNN+3DCNN

7
architecture which uses video frames along with audio spec- the face/head from video frames. We strongly believe it
trogram is the best considering 3 emotions but, the accuracy will significantly improve the prediction accuracy. As far as
did not improve significantly to CNN+RNN. This is due to data augmentation is concerned, even though none of direct
the fact that the cropping window to focus on the face/head data augmentation methods proved to be useful but, adding
to recognize facial emotion was large as the actors are not signal with very low amplitude and varying frequency onto
facing the camera and they moved during their speech. Auto the speech signal and then generating audio spectrogram
detecting face/head with detection model and then cropping from the resulting signal will create unique data points and
based on the bounding box would be ideal and accuracy help in getting rid of model overfitting. If there were ma-
is expected to increase significantly. Considering 4 emo- chines/GPUs with more memory we wanted to experiment
tions, CNN+RNN+3DCNN performed worse compared to with increasing input and output dimensions in each layer
CNN+RNN is because the model prediction accuracy for in the network to obtain optimal point. There is definitely
happy emotion itself is bad due its low data count, therefore a room to get better accuracy using this method. We then
when the video frames are used which are only using facial want to experiment with prediction latency among different
expression from the side only confuses the model to learn models and there architecture size. We also wanted to ex-
poorly. periment more with CNN+LSTM network and fine tune it
Data augmentation does not increase the validation ac- to see what is the best accuracy we can achieve with this
curacy and even slightly makes the model perform worse model. We did try transfer learning using ResNet18 but
could be due to that the image generated by cropping and didn’t achieve good results. Need to do more experimenta-
rotation loses some emotion-related features, since it alters tion on how to transfer learn using existing models. Lastly,
the frequency and time scale. Which is similar to altering try the model on all 12 emotions in the dataset and under-
the pitch of the audio and reversing the audio of a sentence, stand bottlenecks and come up with neural network solu-
and could confuse the model. tions that can predict with high accuracy.
From confusion matrix, we observed that happiness pre-
diction is low compared to other emotions. One possible
7. Link to github code
reason for this is, happiness data set count is very low com-
pared to other emotions, and over-sampling by repetition https://github.com/julieeF/CS231N-Project
the happiness data set is not enough. More dataset of happi-
ness is expected to improve happiness prediction accuracy.
Comparing our results with [13], we lag their class ac- 8. Contributions & Acknowledgements
curacy by 5.4% but, comparing the overall accuracy con- Mandeep Singh:
sidering 3 emotions our work achieved accuracy of 71.75% Mandeep is student at Stanford under SCPD. He has worked
which is better by 2.95%. at Intel as Design Automation Engineer for 8 years. Prior to
joining Intel, he did masters in electrical engineering spe-
6. Conclusion/Future Work cializing in analog & mixed-signal design from SJSU.
Our work demonstrated emotion recognition us- Yuan Fang:
ing audio spectrograms through various deep neural Yuan is master’s student at Stanford in ICME department.
networks like CNN, CNN+RNN & CNN+LSTM on Her interests lies in machine learning & deep learning.
IEMOCAP[2] dataset. We then explored combining We would like to thank the CS231N Teaching Staff for
audio with video to achieve better performance accu- guiding us through the project. We also want to thank
racy through CNN+RNN+3DCNN. We demostrated that Google Cloud Platform and Google Colaboratory for pro-
CNN+RNN+3DCNN performs better as it learns emo- viding us free resources to carry out experimentation in-
tion features from the audio signal(CNN+RNN) and also volved in this work.
learns emotion features from facial expression in video
frames(3DCNN) thus complementing each other. References
To further improve the accuracy of our model we plan
to explore more and touch on various aspects. We want to [1] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper,
B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Di-
explore more noise removal algorithms and generate audio
amos, E. Elsen, J. H. Engel, L. Fan, C. Fougner, T. Han,
spectrograms without noise in them. This work will help
A. Y. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Y.
in analyzing if removing noise actually helps or it acts as Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. See-
a regularizer and we don’t need to remove noise from the tapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao,
spectrograms. We also want to explore, if there are multiple D. Yogatama, J. Zhan, and Z. Zhu. Deep speech 2: End-
people speaking how the model predicts the emotion in such to-end speech recognition in english and mandarin. CoRR,
scenarios. Next, we want to explore auto cropping around abs/1512.02595, 2015.

8
[2] C. L. A. K. E. M. S. K. J. C. S. L. C. Busso, M. Bulut and
S. Narayanan. Iemocap: Interactive emotional dyadic motion
capture database, December 2008.
[3] L. Deng. A tutorial survey of architectures, algorithms, and
applications for deep learning. APSIPA Transactions on Sig-
nal and Information Processing, 3:e2, 2014.
[4] K. Gouta and M. Miyamoto. Emotion recognition: facial
components associated with various emotions. Shinrigaku
kenkyu: The Japanese journal of psychology, 71(3):211–
218, 2000.
[5] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates,
and A. Y. Ng. Deep speech: Scaling up end-to-end speech
recognition, 2014.
[6] W. Hashim Abdulsalam, d. al hamdani, and M. Al Salam.
Facial emotion recognition from videos using deep convolu-
tional neural networks. 9:14–19, 01 2019.
[7] J. Kim and E. André. Emotion recognition based on physio-
logical changes in music listening. IEEE transactions on pat-
tern analysis and machine intelligence, 30(12):2067–2083,
2008.
[8] B. Korbar, D. Tran, and L. Torresani. Co-training of au-
dio and video representations from self-supervised temporal
synchronization. CoRR, abs/1807.00230, 2018.
[9] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee. Emotion
recognition by speech signals. In Eighth European Confer-
ence on Speech Communication and Technology, 2003.
[10] J. Lee and I. Tashev. High-level feature representation us-
ing recurrent neural network for speech emotion recognition,
September 2015.
[11] S. Minaee and A. Abdolrashidi. Deep-emotion: Facial
expression recognition using attentional convolutional net-
work. CoRR, abs/1902.01019, 2019.
[12] G. Sahu. Multimodal speech emotion recognition and ambi-
guity resolution. CoRR, abs/1904.06022, 2019.
[13] A. Satt, S. Rozenberg, and R. Hoory. Efficient emotion
recognition from speech using deep learning on spectro-
grams, 08 2017.
[14] A. Torfi, S. M. Iranmanesh, N. M. Nasrabadi, and J. M. Daw-
son. Coupled 3d convolutional neural networks for audio-
visual recognition. CoRR, abs/1706.05739, 2017.

You might also like