Final Report

A Hybrid Learning-Architecture for Mental Disorder
Detection Using Emotion Recognition

A Seminar Report
submitted to
the APJ Abdul Kalam Technological University
in partial fulfillment of the requirements for the degree of
Bachelor of Technology
by
ABHIJITH ES(VML21CS006)
under the supervision of
Ms. SUVARNA V M
Assistant Professor
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

VIMAL JYOTHI ENGINEERING COLLEGE CHEMPERI
CHEMPERI P.O. - 670632, KANNUR, KERALA, INDIA
October 2024
DEPT. OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the report entitled A Hybrid Learning-Architecture for Mental
Disorder Detection Using Emotion Recognition submitted by ABHIJITH ES
(VML21CS006) to the APJ Abdul Kalam Technological University in partial
fulfillment of the B.Tech degree in Computer Science and Engineering is a bona fide
record of the seminar work carried out by him under our guidance and supervision.
This report in any form has not been submitted to any other University or Institute for
any purpose.
SEMINAR COORDINATORS SEMINAR GUIDE
Ms. CLARA JOSEPH Ms. SUVARNA V M

Assistant Professor Assistant Professor
Computer Science & Engineering Computer Science & Engineering
Vimal Jyothi Engineering College Vimal Jyothi Engineering College
Chemperi Chemperi
Ms. TINTU DEVASIA

Assistant Professor
Computer Science & Engineering
Vimal Jyothi Engineering College
Chemperi
Place : VJEC Chemperi Head of the department

Date : 14-10-2024
(Office
Seal)
DECLARATION
I hereby declare that the seminar report ”A Hybrid Learning-Architecture for

Mental Disorder Detection Using Emotion Recognition”, submitted for partial
fulfillment of the requirements for the award of degree of Bachelor of Technology
of the APJ Abdul Kalam Technological University, Kerala is a bona fide work done
by me under supervision of Ms. SUVARNA V M ,Assistant Professor of Computer
Science and Engineering Department.
This submission represents my ideas in my own words and where ideas or words
of others have been included, I have adequately and accurately cited and referenced
the original sources. I also declare that I have adhered to ethics of academic honesty
and integrity and have not misrepresented or fabricated any data or idea or fact or
source in my submission. I understand that any violation of the above will be a cause
for disciplinary action by the institute and/or the University and can also evoke penal
action from the sources which have thus not been properly cited or from whom proper
permission has not been obtained. This report has not been previously formed the
basis for the award of any degree, diploma or similar title of any other University.
ABHIJITH ES
CHEMPERI
VML21CS006
14-10-2024
i
ACKNOWLEDGEMENT
The successful presentation of this seminar on ”A Hybrid Learning-Architecture

for Mental Disorder Detection Using Emotion Recognition” would have been
incomplete without acknowledging the invaluable contributions of those who made
it possible.
I convey thanks to my seminar guide Ms. Suvarna V M ,Assistant Professor of the
Computer Science and Engineering Department for providing encouragement,constant
support and guidance which was of a great help to complete this seminar successfully. I
express my thanks to my seminar coordinators Ms. Clara Joseph, Assistant Professor
of the Computer Science and Engineering Department and Ms. Tintu Devasia,
Assistant Professor of the Computer Science and Engineering Department and all staff
members and friends for all the help and co-ordination extended in bringing out this
seminar successfully in time. Last but not the least, I would like to express my gratitude
towards my parents for their kind cooperation and encouragement.
ABHIJITH ES
VML21CS006
ii
Abstract
Mental illness has grown to become a prevalent and global health concern that affects
individuals across various demographics. Timely detection and accurate diagnosis of
mental disorders are crucial for effective treatment and support as late diagnosis could
result in suicidal, harmful behaviors and ultimately death. To this end, the present
study introduces a novel pipeline for the analysis of facial expressions, leveraging both
the AffectNet and 2013 Facial Emotion Recognition (FER) datasets. Consequently,
this research goes beyond traditional diagnostic methods by contributing a system
capable of generating a comprehensive mental disorder dataset and concurrently
predicting mental disorders based on facial emotional cues. Particularly, I introduce
a hybrid architecture for mental disorder detection leveraging the state-of-the-art
object detection algorithm, YOLOv8 to detect and classify visual cues associated
with specific mental disorders. To achieve accurate predictions, an integrated learning
architecture based on the fusion of Convolution Neural Networks (CNNs) and Visual
Transformer (ViT) models is developed to form an ensemble classifier that predicts
the presence of mental illness. To ensure transparency and interpretability, I integrate
techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) and
saliency maps to highlight the regions in the input image that significantly contribute
to the model’s predictions thus providing healthcare professionals with a clear
understanding of the features influencing the system’s decisions thereby enhancing
trust and more informed diagnostic process.
iii
Contents
Abstract iii
List of Figures vii
Nomenclature viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 General Background . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Mental Disorder Detection . . . . . . . . . . . . . . . . . . . 2
1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scope of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Works 6
2.1 PredictEYE: Personalized Time Series Model for Mental State Predic-
tion Using Eye Tracking . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Automated Detection Of Human Mental Disorder . . . . . . . . . . . 8
2.2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iv
2.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Machine Learning in ADHD and Depression Mental Health Diagnosis 10
2.3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 An Investigation on the Audio-Video Data Based Estimation of Emo-
tion Regulation Difficulties and Their Association With Mental Disorders 12
2.4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Methodology 14
3.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 YOLOv8 Detection Model . . . . . . . . . . . . . . . . . . . 16
3.1.3 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . 17
3.1.4 Vision Transformers (ViT) . . . . . . . . . . . . . . . . . . . 18
3.1.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.6 Explainable AI (XAI) Tools . . . . . . . . . . . . . . . . . . 19
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Working 22
4.1 Working principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Result and Discussion 25
v
6 Conclusion 26
References 27
vi
List of Figures
3.1 Structure of the analysis pipeline for human mental disorders

detection through facial emotion recognition. . . . . . . . . . . . . 15
4.1 A flow chart of the predictive logic adopted in this study. . . . . . 24
vii
Nomenclature
ResNet50 Residual Network(50 layers)

YOLOv8 You Only Look Once (Version 8)
FER Facial Emotion Recognition
Grad-CAM Gradient-Weighted Class Activation Mapping
CNN Convolutional Neural Network
AI Artificial Intelligence
MDNet Mental Disorder Network
ViT Vision Transformer
XAI Explainable Artificial Intelligence
LSTM Long Short-Term Memory
GSR Galvanic Skin Response
VGG Visual Geometry Group
SVM Support Vector Machine
fMRI functional Magnetic Resonance Imaging
ERD Emotion Regulation Difficulties
PTSD Post-Traumatic Stress Disorder
DERS Difficulties in Emotion Regulation Scale
PHQ-8 Patient Health Questionnaire
PCL-C PTSD Checklist–Civilian version
MAE Mean Absolute Error
viii
Chapter 1
Introduction
1.1 Overview
The paper ”A Hybrid Learning-Architecture for Mental Disorder Detection Using

Emotion Recognition” presents a deep learning system for detecting mental disorders
by analyzing facial expressions [1] [2]. The proposed model integrates Convolutional
Neural Networks (CNNs) with Vision Transformers (ViT) to improve prediction
accuracy for conditions like depression and anxiety [3]. The YOLOv8 object detection
algorithm is employed to analyze facial cues from datasets such as AffectNet and FER
2013, leading to the creation of a specialized dataset for mental disorder detection [1].
An ensemble classifier combines CNN-based MDNet, ViT, and ResNet50 to achieve
81% accuracy [1] [2]. To enhance interpretability, explainability techniques like Grad-
CAM and saliency maps are used to highlight the most influential facial regions for the
model’s decisions. This research aims to provide more accurate, explainable tools to
assist clinicians in diagnosing mental health conditions using emotion recognition.
1
1.2 General Background
The background of the paper highlights the need for innovative tools in mental
healthcare, particularly for early detection of disorders like depression and anxiety.
Current diagnostic methods rely on subjective, time-consuming processes such as self-
reporting and interviews, leading to risks of late diagnosis or misdiagnosis. Advances
in AI and machine learning offer potential for automated diagnostic tools, especially
in mental health [2]. Deep learning, particularly through facial emotion recognition
(FER) systems [1], can analyze facial expressions to detect emotions like happiness,
sadness, and fear, which are critical indicators of mental state. The paper proposes
a hybrid model that integrates Convolutional Neural Networks (CNNs) [2] [1] and
Vision Transformers (ViTs). CNNs excel at feature extraction from images, while
ViTs capture complex visual relationships. By combining these, the study improves
detection of facial expressions linked to mental disorders. Using Grad-CAM and
saliency maps to highlight facial regions that influence the model’s predictions, making
it easier for healthcare professionals to interpret. The system is trained on datasets
like AffectNet and FER 2013 [1], categorizing emotions into mental health conditions
like anxiety and depression. The hybrid model achieves 81% accuracy, offering a
promising tool for early detection and better clinical decision-making.
1.2.1 Mental Disorder Detection
Mental disorder detection traditionally relies on clinical assessments and behavioral

observation, but advancements in AI now enable automated systems to identify
conditions like depression and anxiety using data such as facial expressions and speech
patterns [4]. Deep learning models like CNNs and Vision Transformers analyze
facial cues (e.g., sadness, anger) to detect mental health conditions. Multimodal
data integration, combining facial, speech, and physiological data, enhances detection
2
accuracy. AI systems can uncover patterns invisible to human observers, enabling
earlier and more objective diagnosis. Techniques like Grad-CAM and saliency
maps improve transparency, making AI predictions more interpretable for healthcare
professionals. This approach offers faster, more accurate mental health diagnosis,
supporting clinical decisions.
1.3 Problem statement
Mental health disorders, such as depression and anxiety, are rapidly increasing
global health concerns, affecting millions of people across various demographics.
Current diagnostic methods for these disorders largely rely on subjective assessments,
interviews, and observation by mental health professionals, which can lead to delayed
or inaccurate diagnoses. These traditional approaches are time-consuming and prone
to human error, contributing to inconsistent treatment and care. Furthermore, mental
health conditions often manifest subtly, making early detection difficult. There is
a critical need for objective, efficient, and accurate tools that can assist healthcare
professionals in identifying mental disorders at an early stage. The challenge is
to develop a system that can automatically detect mental health disorders through
non-invasive, data-driven methods, such as analyzing facial expressions, speech, and
behavior [1] [4]. Such a system must not only be accurate but also transparent
and interpretable, allowing clinicians to trust and understand the AI’s decisions.
Without these advancements, mental health diagnoses will continue to be subjective,
inconsistent, and limited in their ability to provide timely interventions.
3
1.4 Scope of the system
The mental disorder detection system aims to enhance early diagnosis and treatment
of conditions like depression and anxiety using advanced AI techniques. It leverages
deep learning models like CNNs and Vision Transformers (ViTs) to analyze facial
expressions and detect emotional cues such as sadness, anger, or fear, which can
indicate mental health issues [1] [2]. The system primarily focuses on facial
emotion recognition but can also integrate additional data like speech patterns and
behavioral information to improve accuracy. Explainability is a key feature, with
techniques like Grad-CAM and saliency maps ensuring healthcare professionals can
interpret the AI’s decisions, increasing transparency and trust in clinical settings.
Designed for scalability, the system is adaptable for use in clinical environments,
telemedicine platforms, and mobile health applications. Its non-invasive nature
makes it accessible for diverse populations and suitable for remote mental health
assessments, particularly in underserved or rural areas with limited access to mental
health professionals. Remote monitoring enables real-time assessments without in-
person visits and allows for long-term tracking of a patient’s emotional state [4].
This can help clinicians adjust treatment plans based on patterns of mental health
improvement or deterioration. Beyond diagnostics, the system provides real-time
feedback during therapy sessions [4], helping therapists adjust their approach based
on a patient’s emotional cues, potentially leading to more effective treatment. By
incorporating continuous monitoring and real-time emotional analysis [4], the system
serves as both a diagnostic tool and a comprehensive support system for mental health
management, crisis intervention, and ongoing research in the field.
4
1.5 Objective
The objective of the system described in the paper ”A Hybrid Learning-Architecture

for Mental Disorder Detection Using Emotion Recognition” is to detect and prevent
mental disorders like depression and anxiety by using advanced AI models [4] [1]. The
report highlights the growing need for better diagnostic tools in mental health, focusing
on deep learning techniques, particularly a hybrid architecture combining CNNs and
Vision Transformers (ViTs) for facial expression analysis [2] [1]. This system offers
a non-invasive, scalable solution for early detection, applicable in clinical settings,
telemedicine, and mobile health platforms. The report emphasizes the importance
of transparency in AI, using explainability techniques like Grad-CAM and saliency
maps to help healthcare professionals interpret AI predictions. It also touches on
ethical considerations, such as data privacy and security in mental health diagnostics.
Looking forward, the report explores potential improvements, including the integration
of multimodal data and continuous monitoring for personalized care. The report
underscores how AI-driven mental health diagnostics can transform healthcare by
providing faster, more objective insights and detecting early warning signs. Traditional
methods are limited by subjectivity and time-consuming processes, while automated
systems can offer timely interventions and bridge gaps in access to care [1], especially
in underserved or remote areas. By integrating AI into telemedicine and mobile apps,
individuals can access mental health assessments from home, reducing barriers to
care. The system’s continuous monitoring allows for long-term tracking of emotional
states, enabling clinicians to adjust treatment plans based on gradual changes in mental
health. It aligns with the trend toward personalized healthcare by analyzing individual
emotional and behavioral cues to offer tailored therapy recommendations and self-care
strategies. The report ultimately demonstrates how AI integration improves clinical
practice and transforms the delivery of mental healthcare.
5
Chapter 2
Related Works
2.1 PredictEYE: Personalized Time Series Model for
Mental State Prediction Using Eye Tracking
2.1.1 Abstract
Mental health is crucial for emotional, psychological, and social well-being, and early
intervention can help manage mental illnesses. The paper introduces a personalized
time series model called PredictEYE, designed to predict an individual’s mental
state and identify the specific scene responsible by analyzing eye-tracking data while
watching calm and stressful videos. The model uses a Long Short-Term Memory
(LSTM) deep learning regression model to predict feature sequences and a Random
Forest algorithm for mental state prediction, achieving 86.4% accuracy. PredictEYE
focuses on individual eye-tracking data, as it provides unique insights into mental
states, outperforming models that compare multiple participants. The model is
adaptable for continuous, non-invasive monitoring using webcam-based eye tracking,
making it suitable for various mental health applications.
6
2.1.2 Methodology
The methodology of the ”PredictEYE” study focuses on using eye-tracking data to

predict mental states like calm or stress by analyzing how individuals react to visual
stimuli, specifically videos, through their eye movements. The study employed a time-
series analysis using an LSTM deep learning model and a Random Forest algorithm.
Eye-tracking features such as pupil diameter, fixation duration, and blink duration
were collected from participants at 60 Hz while watching calm and stressful videos.
The LSTM model predicted future sequences of these features, while the Random
Forest classified participants’ mental states. The system’s accuracy was validated
using Galvanic Skin Response (GSR) data, a physiological measure that enhanced
the predictive power of the model. This dual approach, combining conscious visual
cues (eye movements) with subconscious physiological signals (GSR), provided a
more comprehensive understanding of mental states. The real-time nature of the data
collection allowed for dynamic mental state detection, making the system suitable for
continuous monitoring in real-world applications like stress management and mental
health monitoring during daily activities. The combination of LSTM for prediction and
Random Forest for classification helped develop a personalized model that accurately
predicts individual mental states.
2.1.3 Conclusion
The PredictEYE system represents a significant advancement in mental state prediction

by integrating personalized time-series modeling with eye-tracking data. Its combina-
tion of LSTM for prediction and Random Forest for classification allows for accurate
forecasting of mental states like calm or stress. By providing real-time feedback and
enabling continuous monitoring, PredictEYE enhances mental health diagnostics and
interventions, making it a promising tool for clinical and real-world applications.
7
2.2 Automated Detection Of Human Mental Disorder
2.2.1 Abstract
The increasing pressures of daily life can lead to stress, anxiety, and mood swings,
which may escalate into depression and more severe mental health issues. Unfor-
tunately, these emotional changes are often overlooked until it’s too late, resulting
in severe consequences like suicidal intentions. This work presents a model that
detects and classifies observable facial behaviors to assess mental health. Using
a Haar feature-based cascade classifier, the model extracts facial features from the
FER+ dataset, while the VGG model classifies individuals as normal or abnormal.
For those identified as abnormal, the model predicts specific conditions such as
depression or anxiety based on their facial expressions. This approach aims to provide
timely assistance and support, achieving an overall prediction accuracy of 95%. By
facilitating early detection of emotional disturbances, the proposed model serves as a
valuable diagnostic tool for assessing mental health through facial expressions.
2.2.2 Methodology
The research on automated detection of mental disorders utilizes advanced machine

learning techniques, specifically convolutional neural networks (CNNs), to classify
facial expressions as indicators of underlying mental health conditions such as
depression and anxiety. The study employs the FER+ dataset, which contains
grayscale images annotated with various emotional states, including anger, sadness,
happiness, and neutrality, essential for model training. Extensive preprocessing of the
data ensures consistency and optimal performance, resizing and normalizing images
to mitigate variations in scale, lighting, and orientation. Face detection is performed
using Haar-cascade classifiers, which isolate the facial region for further analysis. The
processed facial data is fed into a deep learning model for classification, utilizing
8
the VGG16 architecture through transfer learning. This pre-trained CNN reduces
computational costs and allows for fine-tuning the final layers to specialize in facial
emotion recognition relevant to mental health assessment. The model identifies
emotions such as fear, sadness, anger, and disgust, interpreting these classifications
in a mental health context. For instance, combinations of sadness and anger may
suggest depression, while fear and disgust could signal anxiety. The model’s
performance is evaluated using precision, recall, F1-score, and overall accuracy
metrics, achieving a 95% prediction accuracy. This underscores its potential for real-
world applications, particularly for early detection of mental disorders, facilitating
timely intervention. Overall, the methodology integrates multiple machine learning
techniques and preprocessing steps to accurately detect human mental disorders
through facial expressions.
2.2.3 Conclusion
The conclusion of the research on ”Automated Detection of Human Mental Disorder”

emphasizes the successful creation of a system that accurately detects and classifies
mental health conditions, such as depression and anxiety, through facial expression
analysis. The study highlights the need for early detection systems in light of the rising
prevalence of psychological disorders. Utilizing deep learning models like VGG16 and
Haar-cascade classifiers, the system achieves an impressive 95% prediction accuracy,
surpassing previous models. This automated approach provides accessible mental
health support, offering early warning mechanisms for individuals who may lack
access to professional help. Furthermore, the model’s efficiency with publicly available
datasets showcases its adaptability and potential for future enhancements.
9
2.3 Machine Learning in ADHD and Depression Men-
tal Health Diagnosis
2.3.1 Abstract
This paper examines machine learning methods for identifying Attention Deficit
Hyperactivity Disorder (ADHD) and depression, both of which have become in-
creasingly prevalent worldwide, partly due to the COVID-19 pandemic. Depression
affects about 19.7% of individuals over 16, while ADHD impacts approximately
7.2% of all age groups. The study explores various training and testing modalities,
including functional Magnetic Resonance Imaging (fMRI), Electroencephalography
(EEG), medical notes, video, and speech, to enhance the identification of these mental
health conditions. With rising mental health awareness and limited access to in-person
clinics, especially during the shift to remote consultations, there is a growing need for
reliable AI-based technologies to support healthcare systems, particularly in developed
countries.
2.3.2 Methodology
The methodology outlined in the paper on machine learning for diagnosing ADHD
and depression focuses on integrating diverse data sources and advanced algorithms
to improve detection and classification of these mental health disorders. It employs
both wearable devices, like EEG sensors for continuous brain activity monitoring,
and non-wearable technologies, such as functional Magnetic Resonance Imaging
(fMRI), which provide insights into brain structure and function. This multimodal
approach captures both dynamic and static aspects of mental health conditions.
Central to the methodology are machine learning models, particularly Support Vector
Machines (SVMs) and Convolutional Neural Networks (CNNs). SVMs excel in high-
10
dimensional spaces, making them suitable for complex neuroimaging and EEG data,
while CNNs analyze video data related to facial expressions and body language, which
are significant for mental health diagnostics. These models are trained on labeled
datasets confirming ADHD or depression diagnoses, allowing them to learn relevant
patterns and features. Feature extraction plays a critical role, with EEG data analyzed
for frequency band features linked to abnormal brain activity, and fMRI data assessed
for functional connectivity metrics between brain regions. Additionally, natural
language processing techniques are applied to medical notes and clinical histories
to extract vital symptom-related information. To address data scarcity, especially
concerning sensitive health information, data augmentation techniques modify existing
data to create additional training samples through methods like flipping, rotating, or
adding noise. This comprehensive multimodal approach enhances model accuracy by
broadening the feature set for analysis. Model performance is evaluated using metrics
like accuracy, precision, recall, and F1-score, and robustness is improved through
hyperparameter fine-tuning and regularization techniques to prevent overfitting.
2.3.3 Conclusion
The conclusion of the paper on machine learning in diagnosing ADHD and depression
highlights the potential of AI-driven approaches to improve mental health diagnostics
by utilizing diverse data sources, including EEG, fMRI, clinical notes, and audio-
visual inputs. Techniques like Support Vector Machines (SVMs) and Convolutional
Neural Networks (CNNs) enhance accuracy and provide deeper insights into ADHD
and depression. However, challenges remain, such as small and imbalanced datasets,
as well as privacy and ethical concerns regarding sensitive mental health data. Future
research should focus on refining data collection methods and expanding datasets
to improve the accessibility and scalability of AI-based diagnostic tools for early
detection and treatment of mental health conditions.
11
2.4 An Investigation on the Audio-Video Data Based
Estimation of Emotion Regulation Difficulties and
Their Association With Mental Disorders
2.4.1 Abstract
This study explores the use of audio-video data to estimate emotion regulation diffi-
culties (ERD), which are linked to various mental disorders. Traditional assessments
rely on self-report measures, prompting the investigation of additional modalities for
improved diagnosis. The research utilizes a locally collected dataset, consisting of
audio-video recordings from participants interacting with an automated system while
responding to questionnaires. Two methodologies were developed to estimate ERD,
which serves as an intermediate representation for assessing the severity of mental
disorders, specifically major depressive disorder (MDD) and post-traumatic stress
disorder (PTSD). The findings show that the estimated ERD correlates with self-
reported data, indicating its effectiveness in reflecting the severity levels of MDD and
PTSD, thereby supporting its potential as a valuable diagnostic tool.
2.4.2 Methodology
This study’s methodology focused on estimating emotion regulation difficulties (ERD)

and their correlation with mental disorders, particularly major depressive disorder
(MDD) and post-traumatic stress disorder (PTSD). Data collection occurred in a
controlled environment, where participants interacted with a computer agent that posed
questions from the Difficulties in Emotion Regulation Scale (DERS). Audio-video
data was captured alongside self-reported responses from online forms, including the
DERS, Patient Health Questionnaire (PHQ-8) for MDD, and the PTSD Checklist –
Civilian version (PCL-C). Two approaches were used to estimate ERD from the audio-
12
video data: a regression model for predicting DERS subscale scores and a classification
model for estimating Likert scale responses. Both models were trained on the collected
dataset, emphasizing the fusion of audio and video features. Audio features were
extracted using the AVEC16-audio feature set, while video features, including action
units (AUs) and eye-gaze data, were obtained via the OpenFace toolbox. The extracted
features were summarized using a mean function over short time segments to create
a compact representation for analysis. The study also trained a regression model
to predict MDD and PTSD severity levels based on the estimated DERS scores,
comparing results with a baseline that directly estimated severity from audio-video
data. This demonstrated that using ERD as an intermediate representation provided
similar performance to direct estimation while reducing computational complexity.
Finally, the methodology was validated against existing datasets like DAIC-WOZ,
with performance evaluated using metrics such as mean absolute error (MAE) and F1-
score. Correlation analysis between DERS scores and mental disorder severity further
confirmed the validity of the methodology and the utility of ERD in assessing MDD
and PTSD.
2.4.3 Conclusion
This study investigated the estimation of emotion regulation difficulties (ERD) using
audio-video data and its correlation with mental disorders like major depressive
disorder (MDD) and post-traumatic stress disorder (PTSD). The findings indicate that
ERD can effectively predict the severity of these disorders, offering a computationally
efficient alternative to direct estimation methods. Validation against the DAIC-WOZ
dataset highlights its potential for broader applications across various psychological
conditions. The work emphasizes the importance of non-intrusive audio-video
modalities in improving mental health assessment and suggests future enhancements
in ERD estimation models and dataset expansion for better accuracy.
13
Chapter 3
Methodology
This study develops a system for detecting mental disorders through emotion recog-
nition from facial expressions, combining Convolutional Neural Networks (CNNs)
and Vision Transformers (ViTs) to enhance accuracy [1] [2]. It processes data from
AffectNet and FER 2013 using the YOLOv8 model to classify emotions linked to
disorders like anxiety and depression [1]. Explainability tools like Grad-CAM and
saliency maps are employed to ensure transparency in the model’s predictions.
3.1 Technology
• Deep learning
• YOLOv8 detection model
• Convolutional Neural Networks (CNNs)
• Vision Transformers (ViT)
• Ensemble Learning
• Explainable AI (XAI) Tools
14
Figure 3.1: Structure of the analysis pipeline for human mental disorders
detection through facial emotion recognition.
This study develops a hybrid system for detecting mental disorders through emotion
recognition from facial expressions, using data from AffectNet and FER 2013 [1]. The
methodology employs the YOLOv8 object detection model to identify facial cues and
integrates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in
an ensemble for accurate emotion recognition [2] [1]. Emotions are categorized into
classes related to disorders like anxiety and depression. To enhance interpretability,
explainable AI tools such as Grad-CAM and saliency maps are utilized to highlight
influential facial areas. This approach ensures reliable emotion detection linked to
mental health while providing transparency in decision-making.
3.1.1 Deep Learning
Deep learning is central to this research, enabling automated detection of mental

disorders through emotion recognition from facial expressions. Its multi-layered neural
networks model complex data patterns, making it suitable for image recognition tasks
that require capturing subtle features. The study uses Convolutional Neural Networks
(CNNs) to extract key facial patterns linked to mental states like anxiety and depression
[3] [2], allowing for precise emotion detection crucial for diagnosis. Additionally,
Vision Transformers (ViTs) are integrated to capture both local and global relationships
15
in images, enhancing the analysis of facial expressions. This hybrid architecture
improves accuracy in detecting mental disorders by understanding nuanced emotional
cues. Furthermore, the research employs explainable AI tools, such as Grad-CAM and
saliency maps, to clarify which facial image parts influence the model’s predictions [3].
This transparency builds trust among healthcare professionals, making the application
of deep learning both practical and interpretable. Ultimately, the study demonstrates
that deep learning can significantly enhance the detection of mental disorders by
leveraging neural networks for detailed facial expression analysis [1].
3.1.2 YOLOv8 Detection Model
YOLOv8 (You Only Look Once, version 8) enhances the accuracy and speed of its
predecessors, making it a versatile model for object detection, image segmentation,
and classification. By predicting bounding boxes and class probabilities directly from
input images in a single evaluation, it achieves high performance, suitable for real-time
applications like autonomous driving, security, and healthcare diagnostics [4].
The architecture of YOLOv8 can be broken down into several key components:
• Backbone Network: Utilizes CSPDarknet53 for feature extraction, capturing

multi-scale features for detecting objects of various sizes.
• Path Aggregation Network (PANet): Enhances information flow between layers,

improving detection accuracy for both large and small objects.
• Feature Pyramid Network (FPN): Combines with PANet to better localize

objects across scales while preserving spatial hierarchy.
• Anchor-Free Detection: Simplifies detection by directly predicting object cen-

ters and sizes, eliminating the need for predefined anchor boxes.
16
• Key Point Detection: Improves accuracy by predicting key points (corners or
centers) instead of relying on anchors, particularly for small or overlapping
objects.
• Mosaic Augmentation: Randomly combines four images during training to help

the model generalize to various visual patterns.
• Label Smoothing: A regularization technique that prevents overconfidence in

predictions, reducing overfitting and enhancing performance on unseen data.
3.1.3 Convolutional Neural Networks (CNNs)
Convolutional Neural Networks (CNNs) have transformed computer vision by en-

abling automatic feature extraction and classification, particularly in image and video
analysis [1] [2] [4]. They excel at capturing spatial hierarchies, using convolutional
layers that apply learnable filters to identify patterns in images, such as edges and
textures. This automation eliminates the need for domain-specific feature design,
allowing CNNs to learn important features directly from data. Pooling layers further
enhance efficiency by reducing spatial dimensions and minimizing overfitting risks,
crucial for real-time applications like detecting mental health disorders from facial
expressions [1] [2] [4]. In advanced applications, CNNs are often combined with
other architectures, such as Vision Transformers (ViTs), to form hybrid models that
improve detection accuracy. Techniques like Grad-CAM and saliency maps enhance
transparency and interpretability in healthcare applications by highlighting critical
regions in input images that influence predictions. Overall, CNNs offer a powerful
and efficient means for image-based tasks, particularly in detailed analyses such as
emotion recognition and mental disorder detection, solidifying their role as essential
tools in modern artificial intelligence.
17
3.1.4 Vision Transformers (ViT)
Vision Transformers (ViTs) represent a significant innovation in computer vision by

employing the transformer architecture from natural language processing. Unlike
Convolutional Neural Networks (CNNs) [1], which use localized filters for feature
extraction, ViTs treat images as sequences of fixed-size patches. These patches are
flattened and embedded into vectors, allowing the transformer to utilize self-attention
mechanisms to understand relationships between patches, capturing both local and
global context. ViTs excel at modeling long-range dependencies, enabling them
to grasp high-level reasoning in images, particularly in tasks like object detection
and segmentation. However, they require large datasets to train effectively and may
struggle with generalization on smaller datasets. When trained on extensive data, ViTs
can surpass CNNs in image classification and detection tasks. Recent research also
explores hybrid models that combine CNNs’ local feature extraction with ViTs’ global
attention for enhanced performance in various computer vision applications [4] [1].
3.1.5 Ensemble Learning
Ensemble learning is a robust technique that improves predictive performance by

combining multiple models, leveraging their unique strengths to reduce bias and
variance. In mental disorder detection through emotion recognition, ensemble models
like Convolutional Neural Networks and Vision Transformers enhance accuracy by
capturing both local features and global dependencies in facial expressions [2] [1].
Techniques like stacking involve training various models independently and then using
a meta-model to combine their outputs. This approach led to an accuracy increase of
around 81% in detecting mental health conditions such as anxiety and depression. By
addressing individual model weaknesses, ensemble learning offers a practical solution
for complex tasks like diagnosing mental disorders from subtle visual cues.
18
3.1.6 Explainable AI (XAI) Tools
Explainable AI (XAI) encompasses techniques that enhance the interpretability and

transparency of machine learning models, especially crucial in high-stakes fields
like healthcare and finance. As AI systems increasingly influence critical decisions,
understanding their predictions becomes vital for accountability and trust. In ap-
plications like mental disorder detection via facial emotion recognition, XAI tools
such as Gradient-weighted Class Activation Mapping (Grad-CAM) and saliency maps
help visualize how specific facial regions contribute to predictions [3] [1], aiding
professionals in validating AI recommendations. Additional methods like LIME
(Local Interpretable Model-agnostic Explanations) and SHAP (Shapley Additive
Explanations) further clarify model behavior by detailing the contribution of individual
features. Overall, XAI fosters greater understanding of AI decisions, enhancing its
reliability in diagnosing mental health conditions.
3.2 Datasets
• AffectNet Dataset
The AffectNet dataset is a comprehensive collection of over 1 million labeled
facial expression images, covering seven basic emotions: anger, disgust, fear,
happiness, sadness, surprise, and neutral. Its diversity in facial pose, lighting,
and ethnic backgrounds makes it ideal for real-world emotion recognition
applications [1] [2]. This dataset is essential for training deep learning
models like Convolutional Neural Networks (CNNs) and Vision Transformers
(ViTs), enabling them to generalize across various demographics and emotional
contexts. In this research, AffectNet serves as the primary source for validating
the model’s performance in predicting mental disorders, ensuring effective
handling of diverse facial expressions in mental health diagnostics.
19
• FER2013 Dataset
The FER2013 (Facial Expression Recognition 2013) dataset consists of 30,000
grayscale images, each 48x48 pixels, focused on the facial region and labeled
with seven emotions: anger, disgust, fear, happiness, sadness, surprise, and
neutral [1] [2]. Recognized as a benchmark for facial emotion recognition, it
provides well-curated data used in numerous studies. In this research, FER2013
served as the initial training set for deep learning models, establishing a baseline
before validation with the more diverse AffectNet dataset. Although smaller than
AffectNet, FER2013 is essential for learning core facial expressions, critical for
early-stage training in emotion-based mental health detection systems.
3.3 Implementation details
• The implementation begins with collecting the AffectNet and FER2013 datasets,
which contain a variety of labeled facial expressions like happiness, sadness,
and anger. These annotated images provide the foundation for training a deep
learning model to predict emotions and their links to mental disorders [1] [3].
• Images from the AffectNet and FER2013 datasets are resized to 64x64 pixels
and 48x48 pixels, respectively, and normalized by scaling pixel values to a range
of [0,1]. This preprocessing is crucial for preparing the data for deep learning
models and accelerating the training process [1].
• The backbone of YOLOv8 uses CSPDarknet53, a 53-layer CNN that extracts

detailed features from facial expressions. It generates bounding boxes and class
probabilities through anchor-free detection, enabling high-speed predictions
with precision [2].
• A hybrid learning architecture enhances mental disorder detection by integrating
20
a custom CNN (MDNet), a pre-trained ResNet50, and Vision Transformers
(ViT), leveraging the strengths of each model for improved predictions.
• The outputs from MDNet, ResNet50, and ViT are combined into an ensemble
classifier using a fusion strategy, leveraging MDNet’s detail capture, ResNet50’s
deep feature extraction, and ViT’s global context understanding. This ensemble
approach enhances the prediction accuracy of the mental disorder detection
system.
• The ensemble model is trained on a mental disorder dataset, categorizing

emotions into conditions like anxiety, depression, or no disorder. During
inference, it analyzes facial features to predict the presence of specific mental
disorders.
• Gradient-weighted Class Activation Mapping (Grad-CAM) and saliency maps

are utilized to visualize influential facial regions in the model’s predictions.
This interpretability fosters trust among healthcare professionals by clarifying
the rationale behind the system’s decisions regarding mental disorders [4].
21
Chapter 4
Working
4.1 Working principle
• The hybrid learning architecture for mental disorder detection combines deep
learning techniques to analyze facial expressions and predict mental disorders
[3]. It begins with data acquisition and preprocessing using the AffectNet and
FER2013 datasets, which contain annotated facial expressions categorized by
basic emotions. The images are resized and normalized to ensure compatibility
with deep learning models, facilitating efficient processing [1].
• After preprocessing, the emotion detection phase uses YOLOv8, an advanced

object detection model known for its speed and accuracy. YOLOv8 identifies
key facial features by dividing images into a grid and predicting emotions
with associated class probabilities and bounding boxes. This enables real-time
predictions of emotional states, linking them to potential mental disorders [4].
• After detecting facial emotions with YOLOv8, the system employs a hybrid
learning approach that integrates three models: MDNet (a custom CNN),
ResNet50 (a pre-trained model), and Vision Transformer (ViT). MDNet focuses
22
on localized features, ResNet50 captures complex patterns, and ViT recognizes
global relationships among facial features. This combination enhances the
classification of mental disorders by leveraging the strengths of each model [1].
• The system employs feature fusion and ensemble classification by combining

outputs from MDNet, ResNet50, and ViT to enhance prediction accuracy and
robustness. This ensemble approach reduces the risk of overfitting or underfit-
ting, allowing for more informed decisions regarding a person’s emotional state
and associated mental disorders [2].
– MDNet (Custom CNN): Comprises convolutional layers for feature extrac-

tion, max-pooling layers for dimensionality reduction, and dropout layers
for regularization to avoid overfitting.
– ResNet50 (Pre-trained Model): Utilizes skip connections to enable infor-

mation flow across layers without degradation.
– ViT (Vision Transformer): Captures long-range dependencies by process-

ing the image in patches, focusing on global context rather than just local
features.
• The ensemble model classifies detected emotions into specific mental disorders
based on predefined relationships, such as linking fear and disgust to anxiety
disorders or sadness and anger to depression. It categorizes emotions into
four main groups: anxiety disorder, depressive disorder, no disorder, and other
disorders, offering diagnostic insights for healthcare intervention.
• Finally, the system incorporates explainability techniques to ensure transparency

in its predictions. Two key techniques used are Gradient-weighted Class
Activation Mapping (Grad-CAM) and saliency maps. These techniques generate
visual heatmaps that highlight the regions of the input image that had the most
23
significant influence on the model’s decision. For example, Grad-CAM might
show that the model focused on the mouth and eyes when detecting an expression
related to anxiety. This interpretability feature is crucial in building trust with
healthcare providers, as it allows them to understand the rationale behind the
system’s predictions, thus making the model’s decisions more transparent and
easier to validate [4].
4.2 Flow Chart
Figure 4.1: A flow chart of the predictive logic adopted in this study.
24
Chapter 5
Result and Discussion
The hybrid learning architecture for mental disorder detection achieved an overall
accuracy of 81%, excelling in identifying anxiety and depressive disorders. The
ensemble model, combining MDNet, ResNet50, and ViT, outperformed individual
models, yielding a precision of 78%, recall of 80%, and an F1-score of 79%. Excluding
ViT slightly improved accuracy, emphasizing the critical roles of CNN and ResNet50
in feature extraction. ROC-AUC scores of 0.75 for anxiety and 0.76 for depressive
disorders validated the model’s classification ability. Explainability techniques like
Grad-CAM highlighted key facial regions, enhancing the system’s transparency and
reliability for early diagnosis.
The hybrid learning architecture for mental disorder detection achieved impres-
sive results, with the ensemble model effectively utilizing CNN, ResNet50, and Vision
Transformers to excel in identifying depressive and anxiety disorders. Explainability
techniques like Grad-CAM enhanced transparency and trust in the model’s decision-
making. Future improvements could include integrating additional modalities, such as
speech or contextual information, to further boost classification accuracy in cases of
overlapping emotional cues.
25
Chapter 6
Conclusion
In conclusion, the paper ”A Hybrid Learning-Architecture for Mental Disorder

Detection Using Emotion Recognition” proved to be an effective approach, achieving
high accuracy and reliability. By integrating CNN, ResNet50, and Vision Transformers
in an ensemble model, the system was able to leverage diverse strengths in feature
extraction and classification, particularly excelling in detecting depressive and anxiety
disorders. The use of explainability techniques like Grad-CAM and saliency maps
enhanced the system’s transparency, making it more interpretable and trustworthy
for potential clinical use. While the system demonstrated strong performance,
incorporating additional data modalities, such as speech or behavioral context, could
further refine its ability to accurately differentiate between mental disorders, ensuring
more comprehensive and early diagnosis. This system holds great potential for aiding
healthcare professionals in making informed decisions for mental health interventions.
26
References
[1] S. A. Hussein, A. E. R. S. Bayoumi, and A. M. Soliman, “Automated detection

of human mental disorder,” Journal of Electrical Systems and Information
Technology, vol. 10, no. 1, p. 9, 2023.
[2] C. Nash, R. Nair, and S. M. Naqvi, “Machine learning in adhd and depression
mental health diagnosis: A survey,” IEEE Access, 2023.
[3] C. Jyotsna, J. Amudha, A. Ram, D. Fruet, and G. Nollo, “Predicteye: Personalized

time series model for mental state prediction using eye tracking,” IEEE Access,
vol. 11, pp. 128 383–128 409, 2023.
[4] R. K. Gupta and R. Sinha, “An investigation on the audio-video data based
estimation of emotion regulation difficulties and their association with mental
disorders,” IEEE Access, vol. 11, pp. 74 324–74 336, 2023.
[5] J. Aina, O. Akinniyi, M. M. Rahman, V. Odero-Marah, and F. Khalifa, “A hybrid

learning-architecture for mental disorder detection using emotion recognition,”
IEEE Access, 2024.
27

Final Report

Uploaded by

Copyright:

Available Formats

Final Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report

Uploaded by

Copyright:

Available Formats

A Hybrid Learning-Architecture for Mental Disorder

Detection Using Emotion Recognition

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SEMINAR COORDINATORS SEMINAR GUIDE

Ms. CLARA JOSEPH Ms. SUVARNA V M

Ms. TINTU DEVASIA

Place : VJEC Chemperi Head of the department

I hereby declare that the seminar report ”A Hybrid Learning-Architecture for

The successful presentation of this seminar on ”A Hybrid Learning-Architecture

List of Figures vii

5 Result and Discussion 25

3.1 Structure of the analysis pipeline for human mental disorders

4.1 A flow chart of the predictive logic adopted in this study. . . . . . 24

ResNet50 Residual Network(50 layers)

The paper ”A Hybrid Learning-Architecture for Mental Disorder Detection Using

1.2.1 Mental Disorder Detection

Mental disorder detection traditionally relies on clinical assessments and behavioral

1.3 Problem statement

The objective of the system described in the paper ”A Hybrid Learning-Architecture

2.1 PredictEYE: Personalized Time Series Model for

Mental State Prediction Using Eye Tracking

The methodology of the ”PredictEYE” study focuses on using eye-tracking data to

The PredictEYE system represents a significant advancement in mental state prediction

The research on automated detection of mental disorders utilizes advanced machine

The conclusion of the research on ”Automated Detection of Human Mental Disorder”

tal Health Diagnosis

Estimation of Emotion Regulation Difficulties and

Their Association With Mental Disorders

This study’s methodology focused on estimating emotion regulation difficulties (ERD)

• YOLOv8 detection model

• Convolutional Neural Networks (CNNs)

• Vision Transformers (ViT)

• Explainable AI (XAI) Tools

3.1.1 Deep Learning

Deep learning is central to this research, enabling automated detection of mental

3.1.2 YOLOv8 Detection Model

• Backbone Network: Utilizes CSPDarknet53 for feature extraction, capturing

• Path Aggregation Network (PANet): Enhances information flow between layers,

• Feature Pyramid Network (FPN): Combines with PANet to better localize

• Anchor-Free Detection: Simplifies detection by directly predicting object cen-

• Mosaic Augmentation: Randomly combines four images during training to help

• Label Smoothing: A regularization technique that prevents overconfidence in

3.1.3 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) have transformed computer vision by en-

Vision Transformers (ViTs) represent a significant innovation in computer vision by

3.1.5 Ensemble Learning

Ensemble learning is a robust technique that improves predictive performance by

Explainable AI (XAI) encompasses techniques that enhance the interpretability and

3.3 Implementation details

• The backbone of YOLOv8 uses CSPDarknet53, a 53-layer CNN that extracts

• A hybrid learning architecture enhances mental disorder detection by integrating

• The ensemble model is trained on a mental disorder dataset, categorizing

• Gradient-weighted Class Activation Mapping (Grad-CAM) and saliency maps

4.1 Working principle

• After preprocessing, the emotion detection phase uses YOLOv8, an advanced

• The system employs feature fusion and ensemble classification by combining

– MDNet (Custom CNN): Comprises convolutional layers for feature extrac-

– ResNet50 (Pre-trained Model): Utilizes skip connections to enable infor-

– ViT (Vision Transformer): Captures long-range dependencies by process-