Facial Emotion Detection

FACIAL EMOTION DETECTION
USING DEEP LEARNING

*Ayush tripathi, **Rahul shah, ***Sonu Gautham
Department of Information technology, **Prof. – Pawan kumar chaurasiya Department of Information

Technology
Abstract 1.DATA SET
Emotion Recognition is a task to process a human facial The FER 2013 dataset, short for Facial Expression
expression and classify Recognition 2013 dataset, is a
it into certain emotion categories. Such task typically widely used benchmark dataset in the field of computer
requires the feature vision and emotion
extractor to detect the feature, and the trained classifier recognition. It consists of facial images labeled with
produces the label based seven different emotion.
on the feature. The problem is that the extraction of categories.

feature may be distorted by
Here's an overview of the FER 2013 dataset:
variance of location of object and lighting condition in
the image.In this project, Emotion Categories: The dataset includes seven emotion
categories: Angry,
we address the solution of the problem by using a deep
learning algorithm called Disgust, Fear, Happy, Sad, Surprise, and Neutral. These
categories represent a
Conventional Neural Network (CNN) to address the
issues above. By using this range of facial expressions typically observed in human
emotions.
algorithm, the feature of image can be extracted without
user-defined featureengineering, and classifier model is Image Samples: The dataset contains 35,887 grayscale
integrated with feature extractor to produce images in total. Each
the result when input is given. In this way, such method image is 48x48 pixels in size, representing a cropped
produces a featurelocation-invariant image classifier that facial region. The images
achieves higher accuracy than traditional
exhibit variations in pose, lighting conditions, and facial
linear classifier when the variance such as lighting noise expressions.
and background
It's important to note that the FER 2013 dataset, while
environment appears in the input image [1] . The widely used, has some
evaluation of the model shows
limitations. It may not cover the full range of possible
that the accuracy of our lab condition testing data set is facial expressions, and the
94.63%, and for wild
resolution of the images is relatively low. Therefore, for
emotion detection it achieves only around 37% accuracy. more complex or finegrained emotion recognition tasks,
researchers may combine the FER 2013
dataset with other datasets or employ more advanced

techniques.
2.DEEP LEARNING Deep Learning algorithms can automatically learn
and improve from data without the
Deep learning is a branch of machine learning
which is based on artificial neural need for manual feature engineering.
networks. It is capable of learning complex patterns 2.3 Deep Learning has achieved significant success
and relationships within data. In in various fields, including image
deep learning, we don’t need to explicitly program recognition, natural language processing, speech
everything. It has become recognition, and recommendation
increasingly popular in recent years due to the systems. Some of the popular Deep Learning
advances in processing power and the architectures include Convolutional
availability of large datasets. Because it is based on Neural Networks (CNNs), Recurrent Neural
artificial neural networks (ANNs) Networks (RNNs), and Deep Belief
also known as deep neural networks (DNNs). Networks (DBNs).

These neural networks are inspired by
2.4 Training deep neural networks typically
the structure and function of the human brain’s requires a large amount of data and
biological neurons, and they are
computational resources. However, the availability
designed to learn from large amounts of data. of cloud computing and the
2.1 Deep Learning is a subfield of Machine development of specialized hardware, such as

Learning that involves the use of neural Graphics Processing Units (GPUs), has
networks to model and solve complex problems. made it easier to train deep neural networks.
Neural networks are modeled after
the structure and function of the human brain and

consist of layers of interconnected
nodes that process and transform data.
2.2 The key characteristic of Deep Learning is the

use of deep neural networks, which
have multiple layers of interconnected nodes.

These networks can learn complex
representations of data by discovering hierarchical

patterns and features in the data.
3.Introduction to Transfer Learning
Humans are extremely skilled at transferring

knowledge from one task to another. This means
that when we face a new problem or task, we
immediately recognize it and use the relevant
knowledge we have gained from previous learning
experiences. This makes it easy to complete our
tasks quickly and efficiently. If a user can ride a
bike and are asked to drive a motorbike, this is a
good example. Their experience with riding a bike 4. MobileNetV2
will be helpful in such situations. They can balance
the bike and steer the motorbike. This will make it MobileNetV2 is a convolutional neural network
easier than if they were a complete beginner. These architecture that seeks to perform well on mobile
lessons are extremely useful in real life because devices. It is based on an inverted residual structure
they make us better and allow us to gain more where the residual connections are between the
experience. bottleneck layers. The intermediate expansion layer
uses lightweight depthwise convolutions to filter
The same approach was used to introduce Transfer features as a source of non-linearity. As a whole,
learning into machine learning. This involves using the architecture of MobileNetV2 contains the initial
knowledge from a task to solve a problem in the fully convolution layer with 32 filters, followed by
target task. Although most machine learning 19 residual bottleneck layers.
algorithms are designed for a single task, there is
an ongoing interest in developing transfer learning
algorithms.
Let’s return to the example of a model that has

been intended to identify a backpack in an image
and will now be used to detect sunglasses. Because
the model has trained to recognise objects in the
earlier levels, we will simply retrain the subsequent
layers to understand what distinguishes sunglasses
from other objects.
Why Transfer Learning?
One curious feature that many deep neural

networks built on images share is the ability to
detect edges, colours, intensities variations, and
other features in the early layers. These features are Inverted Residuals with Linear Bottlenecks:
not specific to any particular task or dataset. It MobileNetV2 introduces the concept of inverted
doesn't matter what kind of image we are using to residuals, which is contrary to the traditional
detect lions or cars. These low-level features must residual blocks commonly used in deeper CNN
be detected in both cases. These features are architectures. Inverted residuals aim to reduce the
present regardless of whether the image data or computational cost while maintaining model
cost function is exact. These features can be accuracy. Additionally, linear bottlenecks are
learned in one task, such as detecting lions. They employed to reduce the number of channels in the
can also be used to detect humans. Transfer network, thus further reducing computation and
learning is exactly what this is. Nowadays, it isn't memory usage
easy to find people who train whole convolutional
neural networks from scratch. Instead, it is
common to use pre-trained models that have been
trained using a variety of images for similar tasks,
such as ImageNet , MobileNet (1.2million images
with 1000 categories), and then use the features to
solve a new task.
5. TENSORFLOW
TensorFlow is an open-source machine learning

framework developed by the Google Brain team. It
provides a comprehensive ecosystem of tools,
libraries, and resources for building and deploying
machine learning models. TensorFlow is designed
to handle both research and production
environments, making it a popular choice for a
wide range of applications in academia and
industry.
Key features and concepts of TensorFlow include:

6. Introduction to OpenCV
5.1Computational Graph: TensorFlow represents OpenCV (Open Source Computer Vision Library)
computations as a directed graph, known as a is an open-source computer vision and image
computational graph. Nodes in the graph represent processing library. It provides a comprehensive set
operations, and edges represent the flow of data of functions and algorithms for image and video
between operations. This graph-based approach analysis, including object detection, feature
allows for efficient execution and optimization of extraction, image enhancement, and more. OpenCV
complex computations. was originally developed by Intel and is now
maintained by the OpenCV community.
5.2 Flexibility: TensorFlow supports a wide range

of machine learning tasks, including deep learning, 6.1 Key features and capabilities of OpenCV
reinforcement learning, and traditional machine include:
learning algorithms. It provides various levels of 6.1.1 Image and Video Processing: OpenCV offers
abstraction, allowing developers to choose the level a wide range of functions to read, manipulate, and
of control and flexibility that best suits their needs. process images and videos. It provides tools for
basic operations such as resizing, cropping, and
filtering, as well as advanced techniques like edge
5.3 TensorFlow Keras: As mentioned earlier, detection, image segmentation, and optical flow
TensorFlow includes Keras as its official high-level analysis.
API for building neural networks. Keras provides a
user-friendly interface for designing and training
deep learning models, making it accessible to 6.1.2 Object Detection and Tracking: OpenCV
beginners while still maintaining extensibility and includes pre-trained models and algorithms for
flexibility. object detection and tracking. These algorithms can
identify and track objects in images and videos,
enabling applications such as face detection,
5.4 Distributed Computing: TensorFlow supports pedestrian detection, and object tracking.
distributed computing, enabling users to scale their
models across multiple devices or machines. This
makes it suitable for training large models on 6.1.3 Feature Extraction and Matching: OpenCV
powerful clusters or deploying models in provides functions for extracting features from
distributed environments. images and matching them between different
images. These features can be used for tasks like
image stitching, image recognition, and image
5.5 TensorFlow Serving: TensorFlow provides a retrieval.
serving system that allows trained models to be
deployed and served in production environments.
This system enables efficient model serving with
high throughput and low latency.
7. Haar Cascade Xml File
A Haar cascade XML file is a file format used by

OpenCV for object detection using Haar-like
features. It is a machine learning-based approach
for detecting objects in images or video streams.
The Haar cascade XML file contains a trained
model that consists of a set of weak classifiers
arranged in a cascade structure. 8. The Method:
Here's how the Haar cascade XML file is used for In this project, we use a deep learning technique
object detection: called Convolutional Neural Networks(CNNs) to
7.1 Training: The Haar cascade XML file is establish a classification model that combines
generated through a training process. It involves feature extraction with classification. Since the
providing a large dataset of positive samples training typically takes more a week to complete
(images containing the object of interest) and from scratch, we decide to fine-tune the existing
negative samples (images without the object). The model MobileNetV2 trained by Google. In this
training algorithm iteratively learns and selects a way, we do not have to take large cost of
set of discriminative features known as Haar-like computation in both time and opportunity, and
features. meanwhile the data can assist the model to fit the
local variance and condition.
7.2 Features and Classifiers: Haar-like features are

rectangular filters that are applied to image regions 8.1 The Procedure:
at different scales and positions. These features
capture local image variations and can help
distinguish between object and non-object regions. 8.1.1 Obtain the raw dataset and of cropped face
The weak classifiers are trained using these and correct label
features to classify whether a particular region
contains the object or not. 8.1.2 Generate train directory for the test before
training
8.1.3 Obtain the confused matrix ( shown as below)

7.3 Object Detection: To detect objects using a
Haar cascade XML file, the XML file is loaded into 8.1.4 Obtain model and train on Jupiter Notebook
an application or script that utilizes the OpenCV 8.1.5.Evaluate the training result.
library. The image or video frames are passed
through the cascade, and at each stage, the region is
evaluated by the weak classifiers. If a region is
classified as a potential object, it proceeds to the
next stage. If it fails at any stage, it is discarded.
Regions passing through all stages are considered
detections, and their locations are reported.
Haar cascade XML files have been widely used for

object detection tasks, including face detection,
pedestrian detection, and object recognition.
OpenCV provides pre- trained Haar cascade XML
files for various objects, which can be used directly
for detection or can be further fine-tuned using
additional training data.
9. Test before training :
The data is preprocessed to change the format of Sparse Activation: ReLU tends to produce sparse
filename to include labels, and FER 2013 is activations, where only a subset of the neurons are
cropped to separate from background. The test was activated. This can lead to more efficient and
conducted on pre-trained MobileNetV2 model fine- expressive representations in the network.
tuned by Google. The confused matrix is generated
as well as demonstrations.
Faster Convergence: ReLU has been observed to
promote faster convergence during training
9.1 Data Augmentation and Training: compared to other activation functions, such as
sigmoid or tanh.
The dataset is divided into training, validation, and
test. The team also enlarges the dataset by 30 times
through addition of random noise, rotation of
angles, and horizontal flip (like mirror). The dataset 11.SOFTMAX
is also divided based on labels into 7 classes. The Softmax is an activation function that is commonly
training is conducted on MobileNetV2 model.. The used in neural networks, particularly in multi-class
RAW is about 2-3GB The batch size is 32 for classification problems. It transforms a vector of
training and 16 for validation. The total epochs of real- valued numbers into a probability distribution
this training is 25, and the steps of epoch is 448. over multiple classes, assigning probabilities to
The optimization is Adam G and Nesterov each class.
Momentum was applied with base learning rate of
1e-4. The training takes about 8 and half hours to The softmax function takes a vector of input
complete. values, often referred to as logits or scores, and
applies a normalization and exponentiation process
to produce the probabilities. The softmax function
10. ReLU (Rectified Linear Unit) is defined as follows:
ReLU is an activation function commonly used in softmax(x_i) = exp(x_i) / sum(exp(x_j))

neural networks and deep learning models. It where x_i represents the input value for a particular
introduces non-linearity into the network, allowing class, exp() denotes the exponential function, and
it to learn and approximate complex relationships the sum is taken over all classes.
between inputs and outputs.
The softmax function ensures that the output
The ReLU activation function is defined as probabilities sum up to 1, making it suitable for
follows: multi-class classification tasks. It maps the input
ReLU(x) = max(0, x) logits to a probability distribution, where higher
values are associated with higher probabilities.
In other words, ReLU outputs the input value if it is
positive or zero, and it outputs zero for any
negative input. Mathematically, this can be The benefits of softmax activation include:
represented as a piecewise linear function.
The benefits of ReLU activation include:

11.1 robability Interpretation: The softmax function
Simplicity and Computational Efficiency: ReLU is assigns probabilities to each class, allowing for a
a simple function that is computationally efficient clear interpretation of the model's confidence in its
to compute, compared to more complex activation predictions. The highest probability indicates the
functions like sigmoid or tanh. most likely class.
Avoiding Vanishing Gradient: ReLU helps alleviate
the vanishing gradient problem, which can occur
during training when gradients become very small 11.2Differentiation: The softmax function is
in deep networks. ReLU allows the gradients to differentiable, which is essential for training neural
flow more easily, enabling more effective training networks using gradient-based optimization
of deep neural networks. algorithms, such as backpropagation.
can be tuned to control the optimization process
and achieve better convergence.
11.3 Convergence: Softmax helps stabilize and
guide the learning process during training by
providing well-defined probability distributions,
enabling more effective convergence. 12.1 The Working of Adam is as below –
12.1.1 Initialization: Adam initializes two

important variables, namely the first-moment
12. ADAM moving average (m) and the second-moment
moving average (v), both with zeros. These
Adam (Adaptive Moment Estimation) is an variables will keep track of the exponentially
optimization algorithm commonly used for training decaying average of past gradients and squared
deep learning models. It combines concepts from gradients, respectively.
both stochastic gradient descent (SGD) and
momentum-based methods to efficiently update
model parameters during the training process.
12.1.2 Compute Gradients: During each training
iteration (batch), the gradients of the model's
parameters with respect to the loss function are
Here's an overview of the Adam optimization computed using backpropagation. These gradients
algorithm: represent the direction and magnitude to update the
model's parameters to minimize the loss.
12.1 Adaptive Learning Rate: Adam adapts the

learning rate for each parameter individually based 12.1.3 Update Moving Averages: Adam updates the
on the historical gradients. It calculates adaptive first-moment moving average (m) and the second-
learning rates by considering both the first and moment moving average (v) using the gradients
second moments of the gradients. computed in the previous step. The update
equations are as follows:
12.2 Momentum: Adam utilizes the concept of

momentum, which helps accelerate convergence m = β1 * m + (1 - β1) * gradients
and overcome local optima. It maintains a moving
average of past gradients, similar to other v = β2 * v + (1 - β2) * (gradients ** 2)
momentum-based methods like Nesterov
accelerated gradient (NAG). where β1 and β2 are hyperparameters that control
the decay rates of the moving averages. Typical
12.3 Parameter Updates: Adam computes the values are β1 = 0.9 and β2 = 0.999.
update for each parameter using the current
gradient, the first and second moment estimates,
and a learning rate. It updates the parameters along 12.1.4 Bias Correction: Since the moving averages
the direction of the accumulated gradients, adjusted are initialized with zeros, they may be biased
by the learning rate. towards zero at the beginning of training. To
address this, Adam performs bias correction by
scaling the moving averages:
12.4 Bias Correction: To address the initial bias
caused by initializing the first and second moment
estimates to zero, Adam performs bias correction m_hat = m / (1 - β1^t) v_hat = v / (1 - β2^t)
during the initial iterations of training.
where t is the current training iteration. This bias
correction is crucial to stabilize the optimization
12.5 Hyperparameters: Adam introduces two process, especially in the early stages of training.
important hyperparameters the learning rate and the
exponential decay rates for the first and second
moments of the gradients. These hyperparameters 12.1.5 Update Model Parameters: Finally, Adam
calculates the updates to the model's parameters
based on the bias-corrected moving averages and need a framework to build the network. There are
applies them to the model: several popular frameworks [9]:
parameter = parameter - learning_rate * m_hat / • caffe

(sqrt(v_hat) + epsilon)
• Tensorflow
where learning_rate is the initial learning rate, and
epsilon is a small constant (e.g., 1e-8) added to the • Keras
denominator to prevent division by zero. For the purpose of this project, we will use Keras in
Tensorflow on Jupyter Notebook machine to train
our network, which satisfies the requirement of
12.1.6 Repeat: Steps 2 to 5 are repeated for each hardware. 6.3.2 The Parameters of Training The
training iteration (batch) until the model converges following information is the parameter of training
or reaches a predefined stopping criterion.
• Training Batch size: 128
13. Approach to Apply CNN
• Step/Epoch: 148
From previous section, we understand CNN does
solve the issues of conventional method such as • Validation Batch size:16/32
variance of feature location and noise surrounding • Step for validation;100/50
the face. It, however, has its own requirement in
formation of training process. In this session, our • γ: 0.1
own solution will be presented to improve the
• Momentum:0.9
model to fit the local dataset and constraints.
• Nesterov Momentum: True
In practice, we train the model for 2 times in the

order of 30 epochs, 25 epochs to obtain a final
model.
13.1.2 Evaluation and Prediction
The evaluation of the model will include testing of

model against the test portion of original dataset
and comparison between untrained model and
trained model. We generate the confusion matrix to
analyze the model.
13.1 The Training
13.1.1 The Environment of Training
One of the disadvantages of CNN is that it requires

a very powerful or groups of powerful GPU to
complete the training because GPU has dedicated
memory to compute the algorithm of Gradient
Descent [9].In addition, in order to construct the
model for machine to train the algorithm, we also
14. Result and Functionality of Solution Conclusion
14.1 The Result of Training Accuracy and Loss Facial expressions play an important role in finding
Function the roots of causes and issues in our day to day life.
In the earlier era, we had thick fat registers to store
details. This way was totally manual where the
technology hadnt rise high. Then we had this whole
new technology of CCTV cameras and then the
launch of biometric systems. In the near future, the
most widely used technology will be th eface
recognition nad feature classification.
•While the truck drivers drive night long for

continuous days, they need absolute rest during the
day time. In cases if they don’t, this may cause
mishaps to happen. At the poll gates we can have
this facial expression system which will detect the
After the training, the training accuracy reaches fatigued weary expression of the driver and prevent
around 95% ,and the validation accuracy reaches the mishaps.
around 81%.
•The facial expression recognition can be used for
15 The Results of Test on lab condition ASD-Autism Specific Disorder; where the person
suffering cannot express their emotions due to
images In order to test our model, we will use the
some abnormalities they face; hence this Machine
test portion mentioned in 6.2.1 to conduct the test
learning sytem comes into use.
of image on label condition. The number of images
is slightly more than 200, and since the bias of the
original dataset the test dataset consists of mostly
neutral images. In general, the accuracy is around
94% for the trained final model.However,the
system basically randomly guesses the output when
we do the test on the untrained model, thereby
generating random results. These are shown as
confusion matrices below:
References
[1]B. Y. LeCun, Yann and G. Hinton, “Deep

learning,” Nature, vol. 521, pp. 436– 444,June 28
2023.
[2]D. Duncan, G. Shine, and C. English, “Facial

emotion recognition in real time.”
Available:http:
//cs231n.stanford.edu/reports/2016/pdfs/022_Repor
t.pdf [Accessed: 2023-07-12].
[3]P. Lucey, J. F. Cohn, J. S. Takeo Kanade, Z.

Ambadar, and I. Matthews, “A complete expression
dataset for action unit and emotion-specified
expression,” 2010 IEEE Computer Society
Conference on Computer Vision and Pattern
Recognition- Workshopss, pp. 94–101, June 30,
2023.
[4]M. Lyons, S. Akamatsu, M. Kamachi, and J.
Gyoba, “Coding facial expressions with gabor
wavelets,” Proceedings of Third IEEE International
Conference on Automatic Face and Gesture
Recognition, 1998., pp. 200–205, july 14 2023.
[5]T. Kanade, J. F. Cohn, and Y. Tian,

“Comprehensive database for facial expression
analysis,” Proceedings of Fourth IEEE
International Conference on Automatic Face and
Gesture Recognition (Cat. No. PR00580), pp. 46–
53, 2023.
[6]F.-F. Li, J. Johnson, and S. Yeung, “Class notes

of cs 231n of stanford university.”
Available:http://cs231n.github.io/ [Accessed: 2023-
06-23].
[7]F. Chollet, Deep Learning with Python.

Manning Publications Company, June 22 2023.
[8]“Confusion matrix.”
https://scikit-learn.org/stable/auto_examples/model
_
selection/plot_confusion_matrix.html#sphx-glr-
auto-examples-model selection-plot- confusion-
matrix-py. Accessed: 2023-07-10.
[9]“Keras documentation.” https://keras.io.

Accessed: 2023-07

Facial Emotion Detection

Uploaded by

Copyright:

Available Formats

Facial Emotion Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Facial Emotion Detection

Uploaded by

Copyright:

Available Formats

FACIAL EMOTION DETECTION

USING DEEP LEARNING

Department of Information technology, **Prof. – Pawan kumar chaurasiya Department of Information

Abstract 1.DATA SET

on the feature. The problem is that the extraction of categories.

dataset with other datasets or employ more advanced

also known as deep neural networks (DNNs). Networks (DBNs).

2.1 Deep Learning is a subfield of Machine development of specialized hardware, such as

the structure and function of the human brain and

nodes that process and transform data.

2.2 The key characteristic of Deep Learning is the

have multiple layers of interconnected nodes.

representations of data by discovering hierarchical

Humans are extremely skilled at transferring

Let’s return to the example of a model that has

Why Transfer Learning?

One curious feature that many deep neural

TensorFlow is an open-source machine learning

Key features and concepts of TensorFlow include:

5.2 Flexibility: TensorFlow supports a wide range

A Haar cascade XML file is a file format used by

7.2 Features and Classifiers: Haar-like features are

8.1.3 Obtain the confused matrix ( shown as below)

Haar cascade XML files have been widely used for

ReLU is an activation function commonly used in softmax(x_i) = exp(x_i) / sum(exp(x_j))

The benefits of ReLU activation include:

12.1.1 Initialization: Adam initializes two

12.1 Adaptive Learning Rate: Adam adapts the

12.2 Momentum: Adam utilizes the concept of

parameter = parameter - learning_rate * m_hat / • caffe

In practice, we train the model for 2 times in the

13.1.2 Evaluation and Prediction

The evaluation of the model will include testing of

13.1 The Training

13.1.1 The Environment of Training

One of the disadvantages of CNN is that it requires

•While the truck drivers drive night long for

[1]B. Y. LeCun, Yann and G. Hinton, “Deep

[2]D. Duncan, G. Shine, and C. English, “Facial

[3]P. Lucey, J. F. Cohn, J. S. Takeo Kanade, Z.

[5]T. Kanade, J. F. Cohn, and Y. Tian,

[6]F.-F. Li, J. Johnson, and S. Yeung, “Class notes

[7]F. Chollet, Deep Learning with Python.

[9]“Keras documentation.” https://keras.io.

You might also like