Final Report
Final Report
Final Report
Bachelor of Technology
by
ABHIJITH ES(VML21CS006)
under the supervision of
Ms. SUVARNA V M
Assistant Professor
CERTIFICATE
This is to certify that the report entitled A Hybrid Learning-Architecture for Mental
Disorder Detection Using Emotion Recognition submitted by ABHIJITH ES
(VML21CS006) to the APJ Abdul Kalam Technological University in partial
fulfillment of the B.Tech degree in Computer Science and Engineering is a bona fide
record of the seminar work carried out by him under our guidance and supervision.
This report in any form has not been submitted to any other University or Institute for
any purpose.
(Office
Seal)
DECLARATION
ABHIJITH ES
CHEMPERI
VML21CS006
14-10-2024
i
ACKNOWLEDGEMENT
ABHIJITH ES
VML21CS006
ii
Abstract
Mental illness has grown to become a prevalent and global health concern that affects
individuals across various demographics. Timely detection and accurate diagnosis of
mental disorders are crucial for effective treatment and support as late diagnosis could
result in suicidal, harmful behaviors and ultimately death. To this end, the present
study introduces a novel pipeline for the analysis of facial expressions, leveraging both
the AffectNet and 2013 Facial Emotion Recognition (FER) datasets. Consequently,
this research goes beyond traditional diagnostic methods by contributing a system
capable of generating a comprehensive mental disorder dataset and concurrently
predicting mental disorders based on facial emotional cues. Particularly, I introduce
a hybrid architecture for mental disorder detection leveraging the state-of-the-art
object detection algorithm, YOLOv8 to detect and classify visual cues associated
with specific mental disorders. To achieve accurate predictions, an integrated learning
architecture based on the fusion of Convolution Neural Networks (CNNs) and Visual
Transformer (ViT) models is developed to form an ensemble classifier that predicts
the presence of mental illness. To ensure transparency and interpretability, I integrate
techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) and
saliency maps to highlight the regions in the input image that significantly contribute
to the model’s predictions thus providing healthcare professionals with a clear
understanding of the features influencing the system’s decisions thereby enhancing
trust and more informed diagnostic process.
iii
Contents
Abstract iii
Nomenclature viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 General Background . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Mental Disorder Detection . . . . . . . . . . . . . . . . . . . 2
1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scope of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Works 6
2.1 PredictEYE: Personalized Time Series Model for Mental State Predic-
tion Using Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Automated Detection Of Human Mental Disorder . . . . . . . . . . . 8
2.2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iv
2.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Machine Learning in ADHD and Depression Mental Health Diagnosis 10
2.3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 An Investigation on the Audio-Video Data Based Estimation of Emo-
tion Regulation Difficulties and Their Association With Mental Disorders 12
2.4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Methodology 14
3.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 YOLOv8 Detection Model . . . . . . . . . . . . . . . . . . . 16
3.1.3 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . 17
3.1.4 Vision Transformers (ViT) . . . . . . . . . . . . . . . . . . . 18
3.1.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.6 Explainable AI (XAI) Tools . . . . . . . . . . . . . . . . . . 19
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Working 22
4.1 Working principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
6 Conclusion 26
References 27
vi
List of Figures
vii
Nomenclature
viii
Chapter 1
Introduction
1.1 Overview
1
1.2 General Background
The background of the paper highlights the need for innovative tools in mental
healthcare, particularly for early detection of disorders like depression and anxiety.
Current diagnostic methods rely on subjective, time-consuming processes such as self-
reporting and interviews, leading to risks of late diagnosis or misdiagnosis. Advances
in AI and machine learning offer potential for automated diagnostic tools, especially
in mental health [2]. Deep learning, particularly through facial emotion recognition
(FER) systems [1], can analyze facial expressions to detect emotions like happiness,
sadness, and fear, which are critical indicators of mental state. The paper proposes
a hybrid model that integrates Convolutional Neural Networks (CNNs) [2] [1] and
Vision Transformers (ViTs). CNNs excel at feature extraction from images, while
ViTs capture complex visual relationships. By combining these, the study improves
detection of facial expressions linked to mental disorders. Using Grad-CAM and
saliency maps to highlight facial regions that influence the model’s predictions, making
it easier for healthcare professionals to interpret. The system is trained on datasets
like AffectNet and FER 2013 [1], categorizing emotions into mental health conditions
like anxiety and depression. The hybrid model achieves 81% accuracy, offering a
promising tool for early detection and better clinical decision-making.
2
accuracy. AI systems can uncover patterns invisible to human observers, enabling
earlier and more objective diagnosis. Techniques like Grad-CAM and saliency
maps improve transparency, making AI predictions more interpretable for healthcare
professionals. This approach offers faster, more accurate mental health diagnosis,
supporting clinical decisions.
Mental health disorders, such as depression and anxiety, are rapidly increasing
global health concerns, affecting millions of people across various demographics.
Current diagnostic methods for these disorders largely rely on subjective assessments,
interviews, and observation by mental health professionals, which can lead to delayed
or inaccurate diagnoses. These traditional approaches are time-consuming and prone
to human error, contributing to inconsistent treatment and care. Furthermore, mental
health conditions often manifest subtly, making early detection difficult. There is
a critical need for objective, efficient, and accurate tools that can assist healthcare
professionals in identifying mental disorders at an early stage. The challenge is
to develop a system that can automatically detect mental health disorders through
non-invasive, data-driven methods, such as analyzing facial expressions, speech, and
behavior [1] [4]. Such a system must not only be accurate but also transparent
and interpretable, allowing clinicians to trust and understand the AI’s decisions.
Without these advancements, mental health diagnoses will continue to be subjective,
inconsistent, and limited in their ability to provide timely interventions.
3
1.4 Scope of the system
The mental disorder detection system aims to enhance early diagnosis and treatment
of conditions like depression and anxiety using advanced AI techniques. It leverages
deep learning models like CNNs and Vision Transformers (ViTs) to analyze facial
expressions and detect emotional cues such as sadness, anger, or fear, which can
indicate mental health issues [1] [2]. The system primarily focuses on facial
emotion recognition but can also integrate additional data like speech patterns and
behavioral information to improve accuracy. Explainability is a key feature, with
techniques like Grad-CAM and saliency maps ensuring healthcare professionals can
interpret the AI’s decisions, increasing transparency and trust in clinical settings.
Designed for scalability, the system is adaptable for use in clinical environments,
telemedicine platforms, and mobile health applications. Its non-invasive nature
makes it accessible for diverse populations and suitable for remote mental health
assessments, particularly in underserved or rural areas with limited access to mental
health professionals. Remote monitoring enables real-time assessments without in-
person visits and allows for long-term tracking of a patient’s emotional state [4].
This can help clinicians adjust treatment plans based on patterns of mental health
improvement or deterioration. Beyond diagnostics, the system provides real-time
feedback during therapy sessions [4], helping therapists adjust their approach based
on a patient’s emotional cues, potentially leading to more effective treatment. By
incorporating continuous monitoring and real-time emotional analysis [4], the system
serves as both a diagnostic tool and a comprehensive support system for mental health
management, crisis intervention, and ongoing research in the field.
4
1.5 Objective
5
Chapter 2
Related Works
2.1.1 Abstract
Mental health is crucial for emotional, psychological, and social well-being, and early
intervention can help manage mental illnesses. The paper introduces a personalized
time series model called PredictEYE, designed to predict an individual’s mental
state and identify the specific scene responsible by analyzing eye-tracking data while
watching calm and stressful videos. The model uses a Long Short-Term Memory
(LSTM) deep learning regression model to predict feature sequences and a Random
Forest algorithm for mental state prediction, achieving 86.4% accuracy. PredictEYE
focuses on individual eye-tracking data, as it provides unique insights into mental
states, outperforming models that compare multiple participants. The model is
adaptable for continuous, non-invasive monitoring using webcam-based eye tracking,
making it suitable for various mental health applications.
6
2.1.2 Methodology
2.1.3 Conclusion
7
2.2 Automated Detection Of Human Mental Disorder
2.2.1 Abstract
The increasing pressures of daily life can lead to stress, anxiety, and mood swings,
which may escalate into depression and more severe mental health issues. Unfor-
tunately, these emotional changes are often overlooked until it’s too late, resulting
in severe consequences like suicidal intentions. This work presents a model that
detects and classifies observable facial behaviors to assess mental health. Using
a Haar feature-based cascade classifier, the model extracts facial features from the
FER+ dataset, while the VGG model classifies individuals as normal or abnormal.
For those identified as abnormal, the model predicts specific conditions such as
depression or anxiety based on their facial expressions. This approach aims to provide
timely assistance and support, achieving an overall prediction accuracy of 95%. By
facilitating early detection of emotional disturbances, the proposed model serves as a
valuable diagnostic tool for assessing mental health through facial expressions.
2.2.2 Methodology
8
the VGG16 architecture through transfer learning. This pre-trained CNN reduces
computational costs and allows for fine-tuning the final layers to specialize in facial
emotion recognition relevant to mental health assessment. The model identifies
emotions such as fear, sadness, anger, and disgust, interpreting these classifications
in a mental health context. For instance, combinations of sadness and anger may
suggest depression, while fear and disgust could signal anxiety. The model’s
performance is evaluated using precision, recall, F1-score, and overall accuracy
metrics, achieving a 95% prediction accuracy. This underscores its potential for real-
world applications, particularly for early detection of mental disorders, facilitating
timely intervention. Overall, the methodology integrates multiple machine learning
techniques and preprocessing steps to accurately detect human mental disorders
through facial expressions.
2.2.3 Conclusion
9
2.3 Machine Learning in ADHD and Depression Men-
2.3.1 Abstract
This paper examines machine learning methods for identifying Attention Deficit
Hyperactivity Disorder (ADHD) and depression, both of which have become in-
creasingly prevalent worldwide, partly due to the COVID-19 pandemic. Depression
affects about 19.7% of individuals over 16, while ADHD impacts approximately
7.2% of all age groups. The study explores various training and testing modalities,
including functional Magnetic Resonance Imaging (fMRI), Electroencephalography
(EEG), medical notes, video, and speech, to enhance the identification of these mental
health conditions. With rising mental health awareness and limited access to in-person
clinics, especially during the shift to remote consultations, there is a growing need for
reliable AI-based technologies to support healthcare systems, particularly in developed
countries.
2.3.2 Methodology
The methodology outlined in the paper on machine learning for diagnosing ADHD
and depression focuses on integrating diverse data sources and advanced algorithms
to improve detection and classification of these mental health disorders. It employs
both wearable devices, like EEG sensors for continuous brain activity monitoring,
and non-wearable technologies, such as functional Magnetic Resonance Imaging
(fMRI), which provide insights into brain structure and function. This multimodal
approach captures both dynamic and static aspects of mental health conditions.
Central to the methodology are machine learning models, particularly Support Vector
Machines (SVMs) and Convolutional Neural Networks (CNNs). SVMs excel in high-
10
dimensional spaces, making them suitable for complex neuroimaging and EEG data,
while CNNs analyze video data related to facial expressions and body language, which
are significant for mental health diagnostics. These models are trained on labeled
datasets confirming ADHD or depression diagnoses, allowing them to learn relevant
patterns and features. Feature extraction plays a critical role, with EEG data analyzed
for frequency band features linked to abnormal brain activity, and fMRI data assessed
for functional connectivity metrics between brain regions. Additionally, natural
language processing techniques are applied to medical notes and clinical histories
to extract vital symptom-related information. To address data scarcity, especially
concerning sensitive health information, data augmentation techniques modify existing
data to create additional training samples through methods like flipping, rotating, or
adding noise. This comprehensive multimodal approach enhances model accuracy by
broadening the feature set for analysis. Model performance is evaluated using metrics
like accuracy, precision, recall, and F1-score, and robustness is improved through
hyperparameter fine-tuning and regularization techniques to prevent overfitting.
2.3.3 Conclusion
The conclusion of the paper on machine learning in diagnosing ADHD and depression
highlights the potential of AI-driven approaches to improve mental health diagnostics
by utilizing diverse data sources, including EEG, fMRI, clinical notes, and audio-
visual inputs. Techniques like Support Vector Machines (SVMs) and Convolutional
Neural Networks (CNNs) enhance accuracy and provide deeper insights into ADHD
and depression. However, challenges remain, such as small and imbalanced datasets,
as well as privacy and ethical concerns regarding sensitive mental health data. Future
research should focus on refining data collection methods and expanding datasets
to improve the accessibility and scalability of AI-based diagnostic tools for early
detection and treatment of mental health conditions.
11
2.4 An Investigation on the Audio-Video Data Based
2.4.1 Abstract
This study explores the use of audio-video data to estimate emotion regulation diffi-
culties (ERD), which are linked to various mental disorders. Traditional assessments
rely on self-report measures, prompting the investigation of additional modalities for
improved diagnosis. The research utilizes a locally collected dataset, consisting of
audio-video recordings from participants interacting with an automated system while
responding to questionnaires. Two methodologies were developed to estimate ERD,
which serves as an intermediate representation for assessing the severity of mental
disorders, specifically major depressive disorder (MDD) and post-traumatic stress
disorder (PTSD). The findings show that the estimated ERD correlates with self-
reported data, indicating its effectiveness in reflecting the severity levels of MDD and
PTSD, thereby supporting its potential as a valuable diagnostic tool.
2.4.2 Methodology
12
video data: a regression model for predicting DERS subscale scores and a classification
model for estimating Likert scale responses. Both models were trained on the collected
dataset, emphasizing the fusion of audio and video features. Audio features were
extracted using the AVEC16-audio feature set, while video features, including action
units (AUs) and eye-gaze data, were obtained via the OpenFace toolbox. The extracted
features were summarized using a mean function over short time segments to create
a compact representation for analysis. The study also trained a regression model
to predict MDD and PTSD severity levels based on the estimated DERS scores,
comparing results with a baseline that directly estimated severity from audio-video
data. This demonstrated that using ERD as an intermediate representation provided
similar performance to direct estimation while reducing computational complexity.
Finally, the methodology was validated against existing datasets like DAIC-WOZ,
with performance evaluated using metrics such as mean absolute error (MAE) and F1-
score. Correlation analysis between DERS scores and mental disorder severity further
confirmed the validity of the methodology and the utility of ERD in assessing MDD
and PTSD.
2.4.3 Conclusion
This study investigated the estimation of emotion regulation difficulties (ERD) using
audio-video data and its correlation with mental disorders like major depressive
disorder (MDD) and post-traumatic stress disorder (PTSD). The findings indicate that
ERD can effectively predict the severity of these disorders, offering a computationally
efficient alternative to direct estimation methods. Validation against the DAIC-WOZ
dataset highlights its potential for broader applications across various psychological
conditions. The work emphasizes the importance of non-intrusive audio-video
modalities in improving mental health assessment and suggests future enhancements
in ERD estimation models and dataset expansion for better accuracy.
13
Chapter 3
Methodology
This study develops a system for detecting mental disorders through emotion recog-
nition from facial expressions, combining Convolutional Neural Networks (CNNs)
and Vision Transformers (ViTs) to enhance accuracy [1] [2]. It processes data from
AffectNet and FER 2013 using the YOLOv8 model to classify emotions linked to
disorders like anxiety and depression [1]. Explainability tools like Grad-CAM and
saliency maps are employed to ensure transparency in the model’s predictions.
3.1 Technology
• Deep learning
• Ensemble Learning
14
Figure 3.1: Structure of the analysis pipeline for human mental disorders
detection through facial emotion recognition.
This study develops a hybrid system for detecting mental disorders through emotion
recognition from facial expressions, using data from AffectNet and FER 2013 [1]. The
methodology employs the YOLOv8 object detection model to identify facial cues and
integrates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in
an ensemble for accurate emotion recognition [2] [1]. Emotions are categorized into
classes related to disorders like anxiety and depression. To enhance interpretability,
explainable AI tools such as Grad-CAM and saliency maps are utilized to highlight
influential facial areas. This approach ensures reliable emotion detection linked to
mental health while providing transparency in decision-making.
15
in images, enhancing the analysis of facial expressions. This hybrid architecture
improves accuracy in detecting mental disorders by understanding nuanced emotional
cues. Furthermore, the research employs explainable AI tools, such as Grad-CAM and
saliency maps, to clarify which facial image parts influence the model’s predictions [3].
This transparency builds trust among healthcare professionals, making the application
of deep learning both practical and interpretable. Ultimately, the study demonstrates
that deep learning can significantly enhance the detection of mental disorders by
leveraging neural networks for detailed facial expression analysis [1].
YOLOv8 (You Only Look Once, version 8) enhances the accuracy and speed of its
predecessors, making it a versatile model for object detection, image segmentation,
and classification. By predicting bounding boxes and class probabilities directly from
input images in a single evaluation, it achieves high performance, suitable for real-time
applications like autonomous driving, security, and healthcare diagnostics [4].
The architecture of YOLOv8 can be broken down into several key components:
16
• Key Point Detection: Improves accuracy by predicting key points (corners or
centers) instead of relying on anchors, particularly for small or overlapping
objects.
17
3.1.4 Vision Transformers (ViT)
18
3.1.6 Explainable AI (XAI) Tools
3.2 Datasets
• AffectNet Dataset
The AffectNet dataset is a comprehensive collection of over 1 million labeled
facial expression images, covering seven basic emotions: anger, disgust, fear,
happiness, sadness, surprise, and neutral. Its diversity in facial pose, lighting,
and ethnic backgrounds makes it ideal for real-world emotion recognition
applications [1] [2]. This dataset is essential for training deep learning
models like Convolutional Neural Networks (CNNs) and Vision Transformers
(ViTs), enabling them to generalize across various demographics and emotional
contexts. In this research, AffectNet serves as the primary source for validating
the model’s performance in predicting mental disorders, ensuring effective
handling of diverse facial expressions in mental health diagnostics.
19
• FER2013 Dataset
The FER2013 (Facial Expression Recognition 2013) dataset consists of 30,000
grayscale images, each 48x48 pixels, focused on the facial region and labeled
with seven emotions: anger, disgust, fear, happiness, sadness, surprise, and
neutral [1] [2]. Recognized as a benchmark for facial emotion recognition, it
provides well-curated data used in numerous studies. In this research, FER2013
served as the initial training set for deep learning models, establishing a baseline
before validation with the more diverse AffectNet dataset. Although smaller than
AffectNet, FER2013 is essential for learning core facial expressions, critical for
early-stage training in emotion-based mental health detection systems.
• The implementation begins with collecting the AffectNet and FER2013 datasets,
which contain a variety of labeled facial expressions like happiness, sadness,
and anger. These annotated images provide the foundation for training a deep
learning model to predict emotions and their links to mental disorders [1] [3].
• Images from the AffectNet and FER2013 datasets are resized to 64x64 pixels
and 48x48 pixels, respectively, and normalized by scaling pixel values to a range
of [0,1]. This preprocessing is crucial for preparing the data for deep learning
models and accelerating the training process [1].
20
a custom CNN (MDNet), a pre-trained ResNet50, and Vision Transformers
(ViT), leveraging the strengths of each model for improved predictions.
• The outputs from MDNet, ResNet50, and ViT are combined into an ensemble
classifier using a fusion strategy, leveraging MDNet’s detail capture, ResNet50’s
deep feature extraction, and ViT’s global context understanding. This ensemble
approach enhances the prediction accuracy of the mental disorder detection
system.
21
Chapter 4
Working
• The hybrid learning architecture for mental disorder detection combines deep
learning techniques to analyze facial expressions and predict mental disorders
[3]. It begins with data acquisition and preprocessing using the AffectNet and
FER2013 datasets, which contain annotated facial expressions categorized by
basic emotions. The images are resized and normalized to ensure compatibility
with deep learning models, facilitating efficient processing [1].
• After detecting facial emotions with YOLOv8, the system employs a hybrid
learning approach that integrates three models: MDNet (a custom CNN),
ResNet50 (a pre-trained model), and Vision Transformer (ViT). MDNet focuses
22
on localized features, ResNet50 captures complex patterns, and ViT recognizes
global relationships among facial features. This combination enhances the
classification of mental disorders by leveraging the strengths of each model [1].
• The ensemble model classifies detected emotions into specific mental disorders
based on predefined relationships, such as linking fear and disgust to anxiety
disorders or sadness and anger to depression. It categorizes emotions into
four main groups: anxiety disorder, depressive disorder, no disorder, and other
disorders, offering diagnostic insights for healthcare intervention.
23
significant influence on the model’s decision. For example, Grad-CAM might
show that the model focused on the mouth and eyes when detecting an expression
related to anxiety. This interpretability feature is crucial in building trust with
healthcare providers, as it allows them to understand the rationale behind the
system’s predictions, thus making the model’s decisions more transparent and
easier to validate [4].
Figure 4.1: A flow chart of the predictive logic adopted in this study.
24
Chapter 5
The hybrid learning architecture for mental disorder detection achieved an overall
accuracy of 81%, excelling in identifying anxiety and depressive disorders. The
ensemble model, combining MDNet, ResNet50, and ViT, outperformed individual
models, yielding a precision of 78%, recall of 80%, and an F1-score of 79%. Excluding
ViT slightly improved accuracy, emphasizing the critical roles of CNN and ResNet50
in feature extraction. ROC-AUC scores of 0.75 for anxiety and 0.76 for depressive
disorders validated the model’s classification ability. Explainability techniques like
Grad-CAM highlighted key facial regions, enhancing the system’s transparency and
reliability for early diagnosis.
The hybrid learning architecture for mental disorder detection achieved impres-
sive results, with the ensemble model effectively utilizing CNN, ResNet50, and Vision
Transformers to excel in identifying depressive and anxiety disorders. Explainability
techniques like Grad-CAM enhanced transparency and trust in the model’s decision-
making. Future improvements could include integrating additional modalities, such as
speech or contextual information, to further boost classification accuracy in cases of
overlapping emotional cues.
25
Chapter 6
Conclusion
26
References
[2] C. Nash, R. Nair, and S. M. Naqvi, “Machine learning in adhd and depression
mental health diagnosis: A survey,” IEEE Access, 2023.
[4] R. K. Gupta and R. Sinha, “An investigation on the audio-video data based
estimation of emotion regulation difficulties and their association with mental
disorders,” IEEE Access, vol. 11, pp. 74 324–74 336, 2023.
27