PDF Original Del Articulo 05
PDF Original Del Articulo 05
PDF Original Del Articulo 05
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 1
Abstract—As a serious mood disorder problem, depression face and head gestures [15], [16], [17], [18], [19], [20], which
causes severe symptoms that affect how people feel, think, and motivated us to extract deep features using two different face
handle daily activities, such as sleeping, eating, or working. In visual cues. Our first visual cue uses tight-cropped aligned
this paper, a novel framework is proposed to estimate the Beck
Depression Inventory II (BDI-II) values from video data, which face region and focuses mostly on the face expressions. The
uses a 3D convolutional neural network to automatically learn the second cue uses a relatively larger face region with the full
spatiotemporal features at two different scales of the face regions. head included, which helps capturing head gestures and more
Then, a Recurrent Neural Network (RNN) is used to learn further dynamics around the face region. Even though audio-visual
from the sequence of the spatiotemporal information. This fusion shows better results than using the visual data only to
formulation, called RNN-C3D, can model the local and global
spatiotemporal information from consecutive face expressions, analyze the depression [21], [22], [23], our work focuses on
in order to predict the depression levels. Experiments on the visual based nonverbal data without utilizing the audio.
AVEC2013 and AVEC2014 depression datasets show that our
proposed approach is promising, when compared to the state-of- The depression diagnosis is usually based on the patients’
the-art visual-based depression analysis methods. verbal and action behaviors reported by the patients or friends.
Index Terms—Automated Visual-based depression analysis, Beck Depression Inventory-II (BDI-II), an estimation method
nonverbal behavior, 3D convolutional Neural Network (C3D), of depression levels [24], has depression values ranging from
Recurrent Neural Network (RNN). 0 to 63, where 0-13 implies no depression, 14-19 mild
depression, 20-28 moderate, and 29-63 severe depression. As
I. I NTRODUCTION a result, Automatic Depression Level Prediction (ADLP) can
be formulated as a regression or a multi-class classification
Major depressive disorder (MDD) is a mental illness char-
problem.
acterized by low confidence, loss of interest in normally fun
activities, and low energy without a specific reason [1]. MDD The aim of this paper is predicting the depression levels
affects approximately 216 million people (3% of the world’s from a person’s visual expression. A convolutional 3D net-
population) in 2015. Females are affected about twice as often work (C3D) [25] is used to model both the dynamics and
as males. It can also negatively affect a person’s work or appearance from the video data, where people performing
school life, as well as sleeping, eating habits, and general Human-Computer Interaction tasks responding to a number
health [1], [2]. Up to 60% of people who die by suicide had of questions, such as: What is your favorite dish? What was
depression or other mood disorders [3]. As more people are your best gift, and why? Discuss a sad childhood memory. The
affected badly, the importance of an accurate MDD diagnosis ability of the C3D to learn both the salient temporal and spatial
becomes a priority. Thus automated machine systems need information from a sequence of frames makes it appropriate to
to be developed to give an objective assessment and quick analyze the human visual behavior and predict the depression
analysis for the mood disorders, which can lead to a better levels. After the deep spatiotemporal features learned by the
and in time MDD therapy. C3D net are extracted, a Recurrent Neural Network (RNN)
The machine based health assessment systems can give an [26], [27] is used to learn the depression levels further. To the
easy way to track the person’s depression status online through best of our knowledge, it is for the first time to explore the
human machine interaction [4]. Specifically, automated mental C3D and RNN methods for ADLP in videos.
health systems can detect and analyze the audio and the visual
behaviors which are related particularly to the depression. The Our extensive experiments on both the AVEC2013 and
speech characteristics can be useful for depression analysis AVEC2014 datasets show that our approach is promising, in
[5][6], some depression analysis methods are developed using comparison with the state-of-the-art visual based depression
the audio data [7], [8], [9], [10], [11], [12]. Studies show that analysis methods.
visual based expression and gestures are very important in
depression analysis as well [13], [14]. Specifically, many of the In the following, the related works and methods are re-
methods focus on the face region because in human activities viewed briefly. Then in Section III, the proposed framework
more than half of the visual-based nonverbal behaviors are with details about the architecture and submodules is pre-
sented. After that, the evaluation of our method is demon-
Mohamad AL Jazaery and Guodong Guo (Corresponding Author, E-mail: strated through many experiments using each submodule in
[email protected]) are with the Lane Department of Computer
Science and Electrical Engineering, West Virginia University, Morgantown, our framework. Finally, Sections V and VI present some
WV 26506, USA. discussions and draw conclusions.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 2
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 3
Fig. 2. The flow diagram of the proposed method for predicting the Beck depression inventory (BDI) score using Deep C3D and RNN in videos. Two
different prepossessing methods are used to extract face features in two different scales. Then the RNN learns the dynamics further.
Each fully connected layer has 512 output units. The input is
a series of 16 RGB video frames with 128 × 128 pixels in the
spatial dimensions. The C3D structure is illustrated in Fig. 4.
As shown in [35], [36], the head gestures are good visual
cues along with the facial expressions for depression analysis.
In addition to a tight aligned face C3D model, another C3D
model is trained on larger face regions, in order to explore the
dynamics of faces and heads in depression analysis. From now
on, the model which is trained on aligned, tight faces will be
called the C3D Tight-Face model, or C3D Loose-Face model,
when trained on larger face regions.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 4
Fig. 4. C3D deep structure. 8 convolutions, 5 max-poolings, 2 fully connected layers, and an Euclidean loss output layer.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 5
around 80K and 30K 16-frame training clips for AVEC2013 TABLE I
and AVEC2014, respectively. D EPRESSION RECOGNITION RESULTS ON AVEC2013 (T EST SET ).
- Face Detection and Alignment: For the C3D Tight-Face Our Methods RMSE MAE
model, we used the OpenFace [43] to detect face landmarks. C3D Tight-Face 9.64 7.50
Then in each frame, we cropped and aligned the facial region RNN-C3D Tight-Face 9.33 7.39
using the mouth, ears, and eyes facial landmarks. This setting C3D Loose-Face 10.04 8.15
RNN-C3D Loose-Face 9.84 7.95
is used for both training and testing data. On the other hand, C3D Tight-Face & Loose-Face 9.49 7.40
all available frames from AVEC2013 and AVEC2014 training weighted merge (2 models)
datasets are used to train the C3D Loose-Face model. RNN-C3D Tight-Face & Loose-Face 9.28 7.37
weighted merge (2 models)
2) C3D Training: To initialize the weights of the 3D
convolutional layers, the weights from a C3D pre-trained
TABLE II
model trained on Sports-1M dataset and then fine-tuned on D EPRESSION RECOGNITION RESULTS ON AVEC2014 (T EST SET ).
the UCF101 are used. This model is released by [25]. As
mentioned earlier, we changed the last fully-connected layer Our Methods RMSE MAE
from 4096 to 512 neural units to avoid over-fitting and changed C3D Tight-Face 9.66 7.48
the softmax loss function into Euclidean distance for the RNN-C3D Tight-Face 9.49 7.41
C3D Loose-Face 9.81 7.73
regression task. The implementation is in a modified version RNN-C3D Loose-Face 9.76 7.66
of Caffe toolbox [44] to support 3-Dimensional Convolutional C3D Tight-Face & Loose-Face 9.32 7.29
Networks [25]. The learning rate is fixed to 10−6 and the weighted merge (2 models)
RNN-C3D Tight-Face & Loose-Face 9.20 7.22
batch size to 35. The network converged after around 4000 weighted merge (2 models)
iterations (2 epochs) and around 3000 iterations (4 epochs),
for AVEC2013 and AVEC2014, respectively.
3) RNN Training: For each 2-consecutive 16-frame clips, At first, we explored the performance using the individ-
we extracted the C3D features from the Fc6 layer. The reason ual C3D deep model, i.e. Tight-Face or Loose-Face model.
of choosing the Fc6 layer, not Fc7, is that it immediately From Table I (AVEC2013), when only the face C3D Tight-
follows the last 3D convolutional layer, so its features hold Face model is used, the RMSE and MAE achieve 9.64 and
some raw spatiotemporal information along with the depres- 7.50, respectively. While the MAE and RMSE are 10.04 and
sion labels. We applied PCA to reduce the feature dimension 8.15 when using the C3D Loose-Face model. From Table
from 512 to 10. Then, the vanilla RNN [38] of 30 hidden II (AVEC2014), when using the C3D Tight-Face model, the
units is used to learn the normalized depression labels from MAE and RMSE are 7.48 and 9.66, respectively. While the
the input sequence of length 2. Specifically, the z-score is used MAE and RMSE are obtained 7.73 and 9.81, respectively,
to normalize the depression score labels after calculating the when using the C3D Loose-Face model. From both the
mean and std of training data labels, in order to make the net AVEC2013 and AVEC2014 results, one can see that the C3D
converge faster. Also, we set the learning rate (LR) to 5∗10−5 Tight-Face model is better than the C3D Loose-Face model.
and LR decay to 0.9. ReLU is used as the activation function. Then, we explored the performance by merging the C3D Tight-
The net converged after 10 epochs. Face and Loose-Face models. In AVEC2014, the final merging
4) Testing and Performance Measure: For the C3D models, result for any video is the median of all sub-clips from both
the final depression prediction value of each testing video is models. The RMSE and MAE are obtained 9.32 and 7.29,
the median value of all 16-frame sample clips in the video. respectively. On the other hand, to obtain the best merging
While for the RNN-C3D models, the median of the RNN result for AVEC2013, where the videos are much longer, a
outputs for all possible two consecutive 16-frame clips in double weight is given for the C3D Tight-Face model clips
the video. Since the median is less affected by outliers than prediction. Using this C3D Tight-Face & Loose-Face weighted
the mean, it is used to avoid the outliers from the large merge model, the obtained RMSE and MAE are 9.49 and 7.40,
number of samples in the video. The overall performance is respectively, which are improved over each single model.
measured using the Mean Absolute Error (MAE) and Root The results highlight that the tight faces are better than loose
Mean Square r Error (RMSE). The RMSE is computed by or larger faces to analyze the depression, probably because
1 PN 2 there are more background noise in the loose or larger face
RM SE = (yi − ŷi ) , and the MAE is computed
N i=1 regions. However, the best result can be achieved by merging
1 PN the Tight-Face and Loose-Face models together. The dynamics
by M AE = |yi − yˆi |, where N is the total number
N i=1 and appearance of the loose face data which includes the head
of video samples, yi denotes the ground truth of the i-th video
gestures are useful, even though they are less accurate than the
sample, and ŷi is the predicted value of the i-th video sample.
tight faces for depression analysis.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 6
Fig. 6. Visualization of C3D Tight-Face and Loose-Face models, using the method from [45] . C3D captures appearance for the first and last few frames but
then only attends to salient motion in the middle frames.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 7
Fig. 7. Comparison with the AVEC2013 competition results. Note that several
of the listed methods utilize the audio data while our method only uses visual Fig. 8. Comparison with the AVEC2014 competition results. Using video
data. (V) indicates using video data, while (A) for audio only. only is marked with (V), while audio marked with (A).
Their method used the Features Dynamic History Histogram In future, our RNN-C3D can be evaluated on more human
(FDHH) to capture the dynamics of the temporal movement behavior understanding problems.
from the sequence of the features in the spatial space. Since it
extracts FDHH from C2D features, the FDHH can be also be ACKNOWLEDGMENT
applied on the C3D features, which may be examined in the
future. The method in [22] was evaluated only on one dataset The authors thank the organizers of AVEC2013/AVEC2014
which is AVEC2014. Other than those two methods, the RNN- for providing the data, Yu Zhu for useful discussions on [16],
C3D Tight-Face & Loose-Face models, our best approach, has and anonymous reviewers for comments to improve the paper.
a better performance than other start-of-the-art methods on
both AVEC2013 and AVEC2014 datasets. R EFERENCES
[1] N. I. of Mental Health (NIH). (2016) Depression. [Online]. Available:
V. D ISCUSSIONS https://www.nimh.nih.gov/health/topics/depression/index.shtml
[2] A. P. Association, Diagnostic and Statistical Manual of Mental Disor-
The AVEC2013 and AVEC2014 depression datasets are ders (5th ed.). Arlington: American Psychiatric Publishing, 2013.
released as part of the Audio-Visual Emotion Challenge and [3] J. B. Lynch, Virginia A.; Duval, Forensic Nursing Science. Elsevier
Health Sciences. Arlington: American Psychiatric Publishing, 2010.
Workshop 2013 and 2014, respectively. Even though our
[4] “A Cross-modal Review of Indicators for Depression Detection
method uses only the visual data without making use of the Systems,” in Proc. of the Fourth Workshop on Computational
audio, we compare our results to the competition results where Linguistics and Clinical Psychology, Vancouver, Canada. [Online].
many of them utilize both the audio and visual data. Available: http://aclweb.org/anthology/W17-3101
[5] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and D. M.
Compared to the competition results in AVEC2013 and Wilkes, “Acoustical properties of speech as indicators of depression and
AVEC2014 challenges, as shown in Figures 7 and 8, our suicidal risk,” Biomedical Engineering, IEEE Trans. on, vol. 47, no. 7,
method has comparable performance to the top visual- pp. 829–837, 2000.
[6] J. C. Mundt, A. P. Vogel, D. E. Feltner, and W. R. Lenderking, “Vocal
based and Audio-Visual methods on both AVEC2013 and acoustic biomarkers of depression severity and treatment response,”
AVEC2014 datasets. As the bars represent errors in Figures Biological psychiatry, vol. 72, no. 7, pp. 580–587, 2012.
7 and 8, the lower values mean better results. [7] J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu,
and D. D. Mehta, “Vocal biomarkers of depression based on motor
incoordination,” in Proc. of the 3rd ACM Int’l workshop on Audio/visual
VI. C ONCLUSION emotion challenge. ACM, 2013, pp. 41–48.
[8] H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraifi, and Y. Wang,
We have proposed a new approach to perform depression “Depression recognition based on dynamic facial and vocal expression
analysis from the visual data. We have shown that the convolu- features using partial least square regression,” in Proc. of the 3rd ACM
Int’l workshop on Audio/visual emotion challenge, 2013, pp. 21–30.
tional 3D model can learn and detect the salient spatiotemporal [9] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps, “Di-
features to perform the depression level prediction. We have agnosis of depression by behavioural signals: a multimodal approach,” in
also shown that the RNN model, as a global descriptor, can Proc. of the 3rd ACM Int’l workshop on Audio/visual emotion challenge,
2013, pp. 11–20.
learn from the transitions of the local C3D spatiotemporal [10] Y. Yang, C. Fairbairn, and J. F. Cohn, “Detecting depression severity
features, and improve the results further. To the best of our from vocal prosody,” Affective Computing, IEEE Trans. on, vol. 4, no. 2,
knowledge, this is for the first time to explore C3D for de- pp. 142–150, 2013.
[11] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T.
pression level analysis. Our best result is obtained by merging Padilla, F. Zhou, and F. D. La Torre, “Detecting depression from facial
the models learned from tight-cropped and loose face regions. actions and vocal prosody,” in Affective Computing and Intelligent
Experiments on both AVEC2013 and AVEC2014 datasets have Interaction and Workshops. IEEE, 2009, pp. 1–7.
[12] L. Chao, J. Tao, M. Yang, Y. Li, and J. Tao, “Multi task sequence
shown that our visual-based approach is promising, compared learning for depression scale prediction from video,” in 2015 Int’l Conf.
to the state-of-the-art visual-based or audio-visual approaches. on Affective Computing and Intelligent Interaction, 2015, pp. 526–531.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2018.2870884, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. X, NO. X, XXXX 8
[13] I. H. JONES and M. PANSA, “Some nonverbal aspects of depression [36] J. Joshi, R. Goecke, G. Parker, and M. Breakspear, “Can body expres-
and schizophrenia occurring during the interview.” The Journal of sions contribute to automatic depression analysis?” in 2013 10th IEEE
nervous and mental disease, vol. 167, no. 7, pp. 402–409, 1979. Int’l Conf. and Workshops on Automatic Face and Gesture Recognition
[14] H. Ellgring, Non-verbal communication in depression. Cambridge (FG), April 2013, pp. 1–7.
University Press, 2007. [37] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” vol. 9,
[15] R. L. Birdwhistell, “Toward analyzing american movement,” Nonverbal pp. 1735–80, 12 1997.
communication, pp. 134–143, 1974. [38] G. Taylor, “Recurrent neural network implemented with theano,”
[16] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression diagnosis https://github.com/gwtaylor/theano-rnn, 2012.
based on deep networks to encode facial appearance and dynamics,” [39] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize
IEEE Trans. on Affective Computing, vol. 8, no. 99, pp. 1–9, 2017. recurrent networks of rectified linear units,” CoRR, vol. abs/1504.00941,
[17] J. M. Girard, J. F. Cohn, M. H. Mahoor, S. M. Mavadati, Z. Hammal, and 2015. [Online]. Available: http://arxiv.org/abs/1504.00941
D. P. Rosenwald, “Nonverbal social withdrawal in depression: Evidence [40] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia,
from manual and automatic analyses,” Image and vision computing, S. Schnieder, R. Cowie, and M. Pantic, “Avec 2013: the continuous
vol. 32, no. 10, pp. 641–647, 2014. audio/visual emotion and depression recognition challenge,” in Proc. of
[18] N. Firth, “Computers diagnose depression from our body language,” the 3rd ACM Int’l workshop on Audio/visual emotion challenge. ACM,
New Scientist, vol. 217, no. 2910, pp. 18–19, 2013. 2013, pp. 3–10.
[19] S. Alghowinem, R. Goecke, M. Wagner, G. Parker, and M. Breakspear, [41] M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski,
“Eye movement analysis for depression detection,” in the 20th IEEE R. Cowie, and M. Pantic, “Avec 2014: 3d dimensional affect and
Int’l Conf. on Image Processing, 2013, pp. 4220–4224. depression recognition challenge,” in Proc. of the 4th Int’l Workshop
[20] A. Pampouchidou, P. Simos, K. Marias, F. Meriaudeau, F. Yang, on Audio/Visual Emotion Challenge. ACM, 2014, pp. 3–10.
M. Pediaditis, and M. Tsiknakis, “Automatic assessment of depression [42] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,
based on visual cues: A systematic review,” IEEE Trans. on Affective and G. Toderici, “Beyond short snippets: Deep networks for video
Computing, vol. PP, no. 99, pp. 1–1, 2017. classification,” CoRR, vol. abs/1503.08909, 2015. [Online]. Available:
[21] H. Kaya and A. A. Salah, “Eyes whisper depression: A cca based http://arxiv.org/abs/1503.08909
multimodal approach,” in Proc. of the 22Nd ACM Int’l Conf. on [43] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Constrained local
Multimedia, ser. MM ’14. New York, NY, USA: ACM, 2014, pp. 961– neural fields for robust facial landmark detection in the wild,” in Proc.
964. [Online]. Available: http://doi.acm.org/10.1145/2647868.2654978 of the 2013 IEEE Int’l Conf. on Computer Vision Workshops.
[22] A. Jan, H. Meng, Y. F. A. Gaus, and F. Zhang, “Artificial intelligent [44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
system for automatic depression level analysis through visual and vocal D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
expressions,” IEEE Trans. on Cognitive and Developmental Systems, convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available:
vol. PP, no. 99, pp. 1–1, 2017. http://arxiv.org/abs/1409.4842
[23] X. Ma, D. Huang, Y. Wang, and Y. Wang, “Cost-sensitive two-stage [45] R. Kotikalapudi and contributors, “keras-vis,”
depression prediction using dynamic visual clues,” in Computer Vision https://github.com/raghakot/keras-vis, 2017.
– ACCV 2016, S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, Eds. [46] M. Kächele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker,
Cham: Springer Int’l Publishing, 2017, pp. 338–351. “Fusion of audio-visual features using hierarchical classifier systems for
[24] A. MCPHERSON and C. R. MARTIN, “A narrative review of the the recognition of affective states and the state of depression,” in Proc.
beck depression inventory (bdi) and implications for its use in an of the 3rd Int’l Conf. on Pattern Recognition Applications and Methods.
alcohol-dependent population,” Journal of Psychiatric and Mental [47] L. Wen, X. Li, G. Guo, and Y. Zhu, “Automated depression diagnosis
Health Nursing, vol. 17, no. 1, pp. 19–30, 2010. [Online]. Available: based on facial dynamic analysis and sparse coding,” IEEE Trans. on
http://dx.doi.org/10.1111/j.1365-2850.2009.01469.x Information Forensics and Security, vol. 10, no. 7, pp. 1432–1441, 2015.
[25] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:
generic features for video analysis,” CoRR, vol. abs/1412.0767, 2014.
[Online]. Available: http://arxiv.org/abs/1412.0767
[26] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal,
“Recurrent neural networks for emotion recognition in video,” in Proc.
of the 2015 ACM Int’l Conf. on Multimodal Interaction. Mohamad Al Jazaery received the B.E. degree in computer science from
[27] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition Damascus University, Damascus, Syria, in 2012. He is currently a CS graduate
using cnn-rnn and c3d hybrid networks,” in Proc. of the 18th ACM Int’l student at West Virginia University (WVU) and working as a Research
Conf. on Multimodal Interaction. Assistant in the Computer Sciences Department. His research interests include
[28] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning computer vision, deep learning, machine learning, and face recognition.
realistic human actions from movies,” in IEEE Conf. on CVPR, 2008,
pp. 1–8.
[29] A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a
spatial pyramid kernel,” in Proc. of the 6th ACM Int’l Conf. on Image
and video retrieval, 2007, pp. 401–408.
[30] A. Jan, H. Meng, Y. F. A. Gaus, F. Zhang, and S. Turabzadeh,
“Automatic depression scale prediction using facial expression dynamics Guodong Guo (M’07-SM’07) received the B.E. degree in automation from
and regression,” in Proc. of the ACM 4th Int’l Workshop on Audio/Visual Tsinghua University, Beijing, China, the Ph.D. degree in pattern recognition
Emotion Challenge, 2014, pp. 73–80. and intelligent control from Chinese Academy of Sciences, Beijing, China, and
[31] H. Kaya, F. Çilli, and A. A. Salah, “Ensemble cca for continuous the Ph.D. degree in computer science from University of Wisconsin-Madison,
emotion prediction,” in Proc. of the ACM 4th Int’l Workshop on Madison, WI, USA. He is an Associate Professor with the Department
Audio/Visual Emotion Challenge, 2014, pp. 19–26. of Computer Science and Electrical Engineering, West Virginia University
[32] H. Pérez Espinosa, H. J. Escalante, L. Villaseñor-Pineda, M. Montes-y (WVU), Morgantown, WV, USA. In the past, he visited and worked in several
Gómez, D. Pinto-Avedaño, and V. Reyez-Meza, “Fusing affective di- places, including INRIA, Sophia Antipolis, France; Ritsumeikan University,
mensions and audio-visual features from segmented video for depression Kyoto, Japan; Microsoft Research, Beijing, China; and North Carolina Central
recognition,” in Proc. of the ACM 4th Int’l Workshop on Audio/Visual University. He authored a book, Face, Expression, and Iris Recognition Using
Emotion Challenge, 2014, pp. 49–55. Learning-based Approaches (2008), co-edited two books, Mobile Biometrics
[33] R. Gupta, S. Sahu, C. Espy-Wilson, and S. Narayanan, “An affect (2017), and Support Vector Machines Applications (2014), and published
prediction approach through depression severity parameter incorporation about 100 technical papers. His research interests include computer vision,
in neural networks,” pp. 3122–3126, 08 2017. machine learning, and multimedia. He received the North Carolina State
[34] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet: Award for Excellence in Innovation in 2008, Outstanding Researcher (2017-
A large-scale hierarchical image database,” in IEEE Conf. on CVPR, 2018, 2013-2014), and New Researcher of the Year (2010-2011) at CEMR,
June 2009, pp. 248–255. WVU. He was selected the “People’s Hero of the Week” by BSJB under
[35] J. Joshi, A. Dhall, R. Goecke, and J. F. Cohn, “Relative body parts Minority Media and Telecommunications Council (MMTC) on July 29, 2013.
movement for automatic depression analysis,” in 2013 Humaine Asso- Two of his papers were selected as “The Best of FG’13” and “The Best of
ciation Conf. on Affective Computing and Intelligent Interaction, Sept FG’15”, respectively.
2013, pp. 492–497.
1949-3045 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.