Transactions on Affective Computing
Abstract—As a serious mood disorder problem, depression face and head gestures [15], [16], [17], [18], [19], [20], which
causes severe symptoms that affect how people feel, think, and motivated us to extract deep features using two different face
handle daily activities, such as sleeping, eating, or working. In visual cues. Our first visual cue uses tight-cropped aligned
this paper, a novel framework is proposed to estimate the Beck
Depression Inventory II (BDI-II) values from video data, which face region and focuses mostly on the face expressions. The
uses a 3D convolutional neural network to automatically learn the second cue uses a relatively larger face region with the full
spatiotemporal features at two different scales of the face regions. head included, which helps capturing head gestures and more
Then, a Recurrent Neural Network (RNN) is used to learn further dynamics around the face region. Even though audio-visual
from the sequence of the spatiotemporal information. This fusion shows better results than using the visual data only to
formulation, called RNN-C3D, can model the local and global
spatiotemporal information from consecutive face expressions, analyze the depression [21], [22], [23], our work focuses on
in order to predict the depression levels. Experiments on the visual based nonverbal data without utilizing the audio.
AVEC2013 and AVEC2014 depression datasets show that our
proposed approach is promising, when compared to the state-of- The depression diagnosis is usually based on the patients’
the-art visual-based depression analysis methods. verbal and action behaviors reported by the patients or friends.
Index Terms—Automated Visual-based depression analysis, Beck Depression Inventory-II (BDI-II), an estimation method
nonverbal behavior, 3D convolutional Neural Network (C3D), of depression levels [24], has depression values ranging from
Recurrent Neural Network (RNN). 0 to 63, where 0-13 implies no depression, 14-19 mild
depression, 20-28 moderate, and 29-63 severe depression. As
I. I NTRODUCTION a result, Automatic Depression Level Prediction (ADLP) can
be formulated as a regression or a multi-class classification
Major depressive disorder (MDD) is a mental illness char-
acterized by low confidence, loss of interest in normally fun
activities, and low energy without a specific reason [1]. MDD The aim of this paper is predicting the depression levels
affects approximately 216 million people (3% of the world’s from a person’s visual expression. A convolutional 3D net-
population) in 2015. Females are affected about twice as often work (C3D) [25] is used to model both the dynamics and
as males. It can also negatively affect a person’s work or appearance from the video data, where people performing
school life, as well as sleeping, eating habits, and general Human-Computer Interaction tasks responding to a number
health [1], [2]. Up to 60% of people who die by suicide had of questions, such as: What is your favorite dish? What was
depression or other mood disorders [3]. As more people are your best gift, and why? Discuss a sad childhood memory. The
affected badly, the importance of an accurate MDD diagnosis ability of the C3D to learn both the salient temporal and spatial
becomes a priority. Thus automated machine systems need information from a sequence of frames makes it appropriate to
to be developed to give an objective assessment and quick analyze the human visual behavior and predict the depression
analysis for the mood disorders, which can lead to a better levels. After the deep spatiotemporal features learned by the
and in time MDD therapy. C3D net are extracted, a Recurrent Neural Network (RNN)
The machine based health assessment systems can give an [26], [27] is used to learn the depression levels further. To the
easy way to track the person’s depression status online through best of our knowledge, it is for the first time to explore the
human machine interaction [4]. Specifically, automated mental C3D and RNN methods for ADLP in videos.
health systems can detect and analyze the audio and the visual
behaviors which are related particularly to the depression. The Our extensive experiments on both the AVEC2013 and
speech characteristics can be useful for depression analysis AVEC2014 datasets show that our approach is promising, in
[5][6], some depression analysis methods are developed using comparison with the state-of-the-art visual based depression
the audio data [7], [8], [9], [10], [11], [12]. Studies show that analysis methods.
visual based expression and gestures are very important in
depression analysis as well [13], [14]. Specifically, many of the In the following, the related works and methods are re-
methods focus on the face region because in human activities viewed briefly. Then in Section III, the proposed framework
more than half of the visual-based nonverbal behaviors are with details about the architecture and submodules is pre-
sented. After that, the evaluation of our method is demon-
Mohamad AL Jazaery and Guodong Guo (Corresponding Author, E-mail:
[email protected]) are with the Lane Department of Computer
Science and Electrical Engineering, West Virginia University, Morgantown, our framework. Finally, Sections V and VI present some
WV 26506, USA.
Transactions on Affective Computing
Transactions on Affective Computing
Fig. 2. The flow diagram of the proposed method for predicting the Beck depression inventory (BDI) score using Deep C3D and RNN in videos. Two
different prepossessing methods are used to extract face features in two different scales. Then the RNN learns the dynamics further.
Each fully connected layer has 512 output units. The input is
a series of 16 RGB video frames with 128 × 128 pixels in the
spatial dimensions. The C3D structure is illustrated in Fig. 4.
As shown in [35], [36], the head gestures are good visual
cues along with the facial expressions for depression analysis.
In addition to a tight aligned face C3D model, another C3D
model is trained on larger face regions, in order to explore the
dynamics of faces and heads in depression analysis. From now
on, the model which is trained on aligned, tight faces will be
called the C3D Tight-Face model, or C3D Loose-Face model,
when trained on larger face regions.
Transactions on Affective Computing
Fig. 4. C3D deep structure. 8 convolutions, 5 max-poolings, 2 fully connected layers, and an Euclidean loss output layer.
Transactions on Affective Computing
around 80K and 30K 16-frame training clips for AVEC2013 TABLE I
- Face Detection and Alignment: For the C3D Tight-Face Our Methods RMSE MAE
model, we used the OpenFace [43] to detect face landmarks. C3D Tight-Face 9.64 7.50
Then in each frame, we cropped and aligned the facial region RNN-C3D Tight-Face 9.33 7.39
using the mouth, ears, and eyes facial landmarks. This setting C3D Loose-Face 10.04 8.15
RNN-C3D Loose-Face 9.84 7.95
is used for both training and testing data. On the other hand, C3D Tight-Face & Loose-Face 9.49 7.40
all available frames from AVEC2013 and AVEC2014 training weighted merge (2 models)
datasets are used to train the C3D Loose-Face model. RNN-C3D Tight-Face & Loose-Face 9.28 7.37
weighted merge (2 models)
2) C3D Training: To initialize the weights of the 3D
convolutional layers, the weights from a C3D pre-trained
model trained on Sports-1M dataset and then fine-tuned on D EPRESSION RECOGNITION RESULTS ON AVEC2014 (T EST SET ).
the UCF101 are used. This model is released by [25]. As
mentioned earlier, we changed the last fully-connected layer Our Methods RMSE MAE
from 4096 to 512 neural units to avoid over-fitting and changed C3D Tight-Face 9.66 7.48
the softmax loss function into Euclidean distance for the RNN-C3D Tight-Face 9.49 7.41
C3D Loose-Face 9.81 7.73
regression task. The implementation is in a modified version RNN-C3D Loose-Face 9.76 7.66
of Caffe toolbox [44] to support 3-Dimensional Convolutional C3D Tight-Face & Loose-Face 9.32 7.29
Networks [25]. The learning rate is fixed to 10−6 and the weighted merge (2 models)
RNN-C3D Tight-Face & Loose-Face 9.20 7.22
batch size to 35. The network converged after around 4000 weighted merge (2 models)
iterations (2 epochs) and around 3000 iterations (4 epochs),
for AVEC2013 and AVEC2014, respectively.
3) RNN Training: For each 2-consecutive 16-frame clips, At first, we explored the performance using the individ-
we extracted the C3D features from the Fc6 layer. The reason ual C3D deep model, i.e. Tight-Face or Loose-Face model.
of choosing the Fc6 layer, not Fc7, is that it immediately From Table I (AVEC2013), when only the face C3D Tight-
follows the last 3D convolutional layer, so its features hold Face model is used, the RMSE and MAE achieve 9.64 and
some raw spatiotemporal information along with the depres- 7.50, respectively. While the MAE and RMSE are 10.04 and
sion labels. We applied PCA to reduce the feature dimension 8.15 when using the C3D Loose-Face model. From Table
from 512 to 10. Then, the vanilla RNN [38] of 30 hidden II (AVEC2014), when using the C3D Tight-Face model, the
units is used to learn the normalized depression labels from MAE and RMSE are 7.48 and 9.66, respectively. While the
the input sequence of length 2. Specifically, the z-score is used MAE and RMSE are obtained 7.73 and 9.81, respectively,
to normalize the depression score labels after calculating the when using the C3D Loose-Face model. From both the
mean and std of training data labels, in order to make the net AVEC2013 and AVEC2014 results, one can see that the C3D
converge faster. Also, we set the learning rate (LR) to 5∗10−5 Tight-Face model is better than the C3D Loose-Face model.
and LR decay to 0.9. ReLU is used as the activation function. Then, we explored the performance by merging the C3D Tight-
The net converged after 10 epochs. Face and Loose-Face models. In AVEC2014, the final merging
4) Testing and Performance Measure: For the C3D models, result for any video is the median of all sub-clips from both
the final depression prediction value of each testing video is models. The RMSE and MAE are obtained 9.32 and 7.29,
the median value of all 16-frame sample clips in the video. respectively. On the other hand, to obtain the best merging
While for the RNN-C3D models, the median of the RNN result for AVEC2013, where the videos are much longer, a
outputs for all possible two consecutive 16-frame clips in double weight is given for the C3D Tight-Face model clips
the video. Since the median is less affected by outliers than prediction. Using this C3D Tight-Face & Loose-Face weighted
the mean, it is used to avoid the outliers from the large merge model, the obtained RMSE and MAE are 9.49 and 7.40,
number of samples in the video. The overall performance is respectively, which are improved over each single model.
measured using the Mean Absolute Error (MAE) and Root The results highlight that the tight faces are better than loose
Mean Square r Error (RMSE). The RMSE is computed by or larger faces to analyze the depression, probably because
1 PN 2 there are more background noise in the loose or larger face
RM SE = (yi − ŷi ) , and the MAE is computed
N i=1 regions. However, the best result can be achieved by merging
1 PN the Tight-Face and Loose-Face models together. The dynamics
by M AE = |yi − yˆi |, where N is the total number
N i=1 and appearance of the loose face data which includes the head
of video samples, yi denotes the ground truth of the i-th video
gestures are useful, even though they are less accurate than the
sample, and ŷi is the predicted value of the i-th video sample.
tight faces for depression analysis.
Transactions on Affective Computing
Fig. 6. Visualization of C3D Tight-Face and Loose-Face models, using the method from [45] . C3D captures appearance for the first and last few frames but
then only attends to salient motion in the middle frames.
Transactions on Affective Computing
Fig. 7. Comparison with the AVEC2013 competition results. Note that several
of the listed methods utilize the audio data while our method only uses visual Fig. 8. Comparison with the AVEC2014 competition results. Using video
data. (V) indicates using video data, while (A) for audio only. only is marked with (V), while audio marked with (A).
Their method used the Features Dynamic History Histogram In future, our RNN-C3D can be evaluated on more human
(FDHH) to capture the dynamics of the temporal movement behavior understanding problems.
from the sequence of the features in the spatial space. Since it
extracts FDHH from C2D features, the FDHH can be also be ACKNOWLEDGMENT
applied on the C3D features, which may be examined in the
future. The method in [22] was evaluated only on one dataset The authors thank the organizers of AVEC2013/AVEC2014
which is AVEC2014. Other than those two methods, the RNN- for providing the data, Yu Zhu for useful discussions on [16],
C3D Tight-Face & Loose-Face models, our best approach, has and anonymous reviewers for comments to improve the paper.
a better performance than other start-of-the-art methods on
both AVEC2013 and AVEC2014 datasets. R EFERENCES
