Facial Final Mini

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Industry Oriented Mini Project Report on

FACIAL KEYPOINTS DETECTION USING MACHINE


LEARNING
A Mini Project report submitted in
Partial fulfillment of the requirements for the award of the Degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

By

VADDI ABHILASH REDDY 16671A0554

Department of Computer Science and Engineering


J.B. Institute of Engineering & Technology
(UGC Autonomous, Permanently Affiliated to Jawaharlal Nehru Technological University, Hyderabad)

Bhaskar Nagar (Post), Moinabad Mandal, R.R. Dist.-500075


2019

i
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
(UGC Autonomous, Accredited by NAAC , Permanently Affiliated to JNTUH)
Bhaskar Nagar (Post), Moinabad Mandal, R.R. Dist.-500075

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE

This is to certify that the mini project report entitled FACIAL KEYPOINTS
DETECTION USING MACHINE LEARNING being submitted by VADDI
ABHILASH REDDY in partial fulfillment for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering to the Jawaharlal Nehru Technological
University is a record of bonafied work carried out by him/her under my guidance and
supervision.
The results embodied in this mini project report have not been submitted to any
other University or Institute for the award of any Degree or Diploma.

Signature of Internal Guide Signature of Head of the Department


Mr.S.Sathish Kumar Dr.P.Srinivasa Rao
Assistant Professor Professor

ii
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DECLARATION

VADDI ABHILASH REDDY bearing Roll No:16671A0554, a bonafide student of


J.B. Institute of Engineering and Technology, would like to declare that the mini project
titled “FACIAL KEYPOINTS DETECTION USING MACHINE LEARNING”, A
partial fulfillment of B.Tech Degree course of Jawaharlal Nehru Technological University is
my original work in the year 2019 under the guidance of Mr.S.SATHISH KUMAR,
ASSISTANT PROFESSOR, Computer Science & Engineering Department and it has not
previously formed the basis for any degree or diploma or any other similar title submitted to
any university.

VADDI ABHILASH REDDY


Date:

iii
ACKNOWLEDGEMENT

I thank my Principal Dr. Niraj Upadhayaya, Professor in Computer Science and


Engineering, extending his utmost support and cooperation in providing all the provisions,
and Management for providing excellent facilities to carry out my project work.
I take it as a great privilege to express our heartfelt gratitude to Dr. P. Srinivasa Rao,
Professor, and Head of the Department for his valuable support and all senior faculty
members of CSE department for their help during my course.
I would like to express my sincere gratitude to my guide Mr.S.Sathish Kumar,
Assistant Professor, Computer Science and Engineering Department, whose knowledge and
guidance has motivated me to achieve goals I never thought possible. He has consistently
been a source of motivation, encouragement, and inspiration. The time I have spent working
under his supervision has truly been a pleasure.
Finally, special thanks to my parents, brother for their support and encouragement
throughout my life and this course. Thanks to all my friends and well-wishers for their
constant support.

VADDI ABHILASH REDDY


16671A0554

iv
ABSTRACT
Nowadays, facial keypoints detection has become a very popular topic and its
applications include Snapchat, How old are you, have attracted a large number of users. The
objective of facial keypoints detection is to find the facial keypoints in a given face, which is
very challenging due to very different facial features from person to person. The idea of deep
learning has been applied to this problem, such as neural network and cascaded neural
network. And the results of these structures are significantly better than state of-the-art
methods, like feature extraction and dimension reduction algorithms.In our project, we would
like to locate the key points in a given image using deep architectures to not only obtain
lower loss for the detection task but also accelerate the training and testing process for real-
world applications. We have constructed two basic neural network structures, one hidden
layer neural network and convolutional neural network as our baselines. And we have
proposed an approach to better locate the coordinates of facial keypoints. Specifically, we use
a block of pretrained Inception Model to extract intermediate features and using different
deep structures to compute the final output vector. The experiments results have shown the
effectiveness of deep structures for facial keypoints detection tasks, and using the pretrained
Inception Model has slightly improved the performance of detection compared to baseline
methods.

v
TABLE OF CONTENTS

1. INTRODUCTION 1-2
1.1 Facial features are very different 1
1.2 Detecting keypoints has to be fast 1
1.3 Existing And Proposed System 2
2. LITERATURE SURVEY 3
3. SOFTWARE REQUIREMENTS ANALYSIS 4
3.1 Hardware requirements: 4
3.2 Software requirements: 4
4. BACKGROUND 5-17
4.1 Machine learning 5
4.2 Artificial neural networks 5-6
4.3 Activation function: 7-10
4.4 Deep neural networks 11
4.5 Convolutional neural networks 12-13
4.6 Computational graph 14
4.7 TensorFlow 15
4.8 Graphics processing unit 15
4.9 Dataset 16-17
5. SYSTEM DESIGN 18-21
5.1 System Architecture: 18
5.2 UML Diagrams: 19-21
6. SYSTEM IMPLEMENTATION 22-24
6.1 Code 22-24
7. OUTPUT SCREENS 25-28
8. CONCLUSION 29
9. FUTURE ENHANCEMENTS 30
10. BIBLIOGRAPHY 31

vi
TABLE OF FIGURES

FIG.NO FIGURE NAME PG.NO


Figure 4.1 : Neuron with inputs, weights and bias applied 6
Figure 4.2 : Neural network with multiple layers. 7
Figure 4.3 : Sigmoid Function 8
Figure 4.4 : Tanh or Hyperbolic Tangent 8
Figure 4.5 : ReLU 9
Figure 4.6 : Leaky ReLU 10
Figure 4.7 : Example of convolutional layer with kernel size 3 × 3 and padding 1 13
Figure 4.8 : Example of 2 × 2 max-pooling with stride 2 14
Figure 4.9 : Number of images in the dataset for each facial feature 17
Figure 5.1 : System Architecture 18
Figure 5.2 : Use Case Diagram 19
Figure 5.3 : Sequence Diagram 20
Figure 5.4 : State Chart Diagram 21
Figure 7.1 : Checking for null fields in the file 25
Figure 7.2 : Model Summary 26
Figure 7.3 : Compiling the model 26
Figure 7.4 : Training the model 27
Figure 7.5 : Output Prediction 28
Figure 7.6 : Output Prediction 28

vii
1. INTRODUCTION

With the fast development in computer vision area, more and more research works
and industry applications are focused on facial keypoints detection. Detecting keypoints in a
given face image would act as a fundamental part for many applications, including facial
expression classification, facial alignment, tracking faces in videos and also applications for
medical diagnosis. Thus, how to detect facial keypoints both fast and accurately to use it as a
preprocessing procedure has become a big challenge. There are two main challenges for
facial keypoints detection, one is that facial features have great variance between different
people and different external factors, the other is that we have to reduce time complexity to
achieve real-time keypoints detection.

1.1 Facial features are very different


Facial features are very different from people to people, which result in the difficulty
of training the regression model. And similar as object detection task, under different
illumination conditions, positions, sizes, detecting facial keypoints would be very challenging.

1.2 Detecting keypoints has to be fast


Detecting keypoints in face images is one of the first few steps for many applications
as we mentioned before, for example analyzing facial expression or detecting faces in images
and videos. Moreover if we would like to fit the detection procedure to a real-time mobile app,
we have to complete the detection within seconds . Therefore, the computation complexity of
keypoints detection have to be lower than traditional image classification tasks. Unlike other
image classification task that we only evaluate the accuracy, we have to focus on the time that
we spend on the task . When using traditional deep structures the training and testing process
tend to be much slower, which is not we want for facial keypoints detection.

1
1.3 Existing And Proposed System
There are two main kinds of state-of-art methods for detecting facial keypoints, the
first type uses feature extraction algorithms, using Gabor features and other features, and the
other one is using probability graphic model to specify the relationship between pixels and its
neighbors. With the development of deep learning, many deep structures designed for this
task have been explored recently. Different deep structures have been proposed for facial
keypoints detection, like deep convolutional cascade network , which can better deal with
two main challenges we mentioned before. Our objective is to locate 15 sets of facial
keypoints when given a raw facial image. The input is a set of 96 × 96 raw 1 facial images
with only the grayscale pixel values, and the output is a 30-dimensional vector, which
indicates the (x, y) coordinates of 15 sets of facial keypoints. In our project, we are going to
use deep structures for facial keypoints detection, which can learn well from different faces
and overcome the variance between faces of different person or of different conditions to a
great extent. Two widely-used model, One Hidden Layer Neural Network and Convolutional
Neural Network, are designed as baselines in our project. Most importantly, we have used the
pretrained Inception Model to explore some techniques to reduce the computation complexity
for detecting facial keypoints. With sparsely connected layers and different number of
different size of filters, Inception Model have the ability to better capture local features and
reduce computation complexity at the same time. When compared to our baseline models, we
can see a great improvement when using pretrained inception models to predict the location
of facial keypoints. The contributions of this paper include:

1. Explore the performance of different deep structures on the task of detecting facial
keypoints, and evaluate the effectiveness using MSE loss.
2. Use the recently proposed Inception Model when detecting facial keypoints. Although
the Inception Model is trained on image classification task on ImageNet, the
experiments results have shown the intermediate features can adapted to facial
keypoints detection task well.
Conduct experiments on real-world datasets from Kaggle challenge, and our model
can be easily extended to other facial detection task.

2
2. LITERATURE SURVEY
Traditional methods have explored feature extraction strategies includes texture-based
and shape-based features and different types of graphic models to detect facial keypoints.
[10] proposed a methods that uses Gabor wavelet feature based boosted classifiers
which can detect 20 different facial keypoints with limited head rotation and different
illumination conditions.
[4] also use Gabor features for detecting facial keypoints. Using a sample log Gabor
response of a facial point, the authors have shown that the locations on the test facial images
that are similar to the sample facial points can be detected.
[6] focused on keypoints detection for Textured 3D Face Recognition. Features are
extracted by fitting a surface to the neighborhood of a keypoint and sampling it on a uniform
grid. And then PCA are used for dimensional reduction and then projected and matched from
a probe and gallery face. Other methods have designed probability graphic models to capture
the relationship between different pixels and features to detect facial keypoints. Using
Markov Random Fields, [9] exploits the constellations that facial points can form. Also the
authors use boosted regression to learn the mappings from the pixels appearance of the area
around the keypoints and the location of the keypoints. These models have reached high
performance on aligned faces but still need improvements for faces under different
environment conditions.
[5] have proposed a model to detect the target location by local evidence aggregation.
By aggregating the estimates obtained from stochastically selected local appearance
information into a single robust prediction other than focus only on the target locaion only,
the proposed model have solved the regression problem from a new prospective. Recent
works have focused on deep architectures for this detection task since these structures can
better capture the high-level features of a image, in our problem a given face.
[7] proposed a three-level carefully designed convolutional networks, which at first
capture the global high-level features, and then refine the initiliation to locate the positions of
keypoints. On the other hand, [3] used pretrained DBN on surrounding feed-forward neural
network with linear Gaussian output layer to detect facial keypoints. Different from our work,
the deep structures mentioned before have not focused on the time complexity but only on the
correctness of the detected keypoints. But using Inception Model we can train a much more
complexity model in less time, thus the detection process would be faster than the traditional
deep structures considering the number of parameters, which can be adapted to this task.

3
3.SOFTWARE REQUIREMENTS ANALYSIS

3.1 Hardware requirements:

● System : i5 INTEL 2.4GHZ


● Hard disk : 1 TB
Monitor
● : 15.6 VGA Color
● Ram : 8 GB

3.2 Software requirements:

● Operating system : Windows 10


● Packages : Python3.6.x
● Tool : Anaconda(JUPYTER NOTEBOOK)

4
4. BACKGROUND

4.1 Machine learning


Machine learning is a field of computer science that specializes in techniques that
give computers ability to learn. The goal is to teach computer system to solve tasks without
explicitly programming it. Machine learning tasks can be roughly. divided into multiple
categories:
1. Supervised learning
2. Classification
3. Regression
4. Unsupervised learning
5. Clustering
6. Reinforcement learning
Supervised learning is used to improve results of a function that maps an input to an
output based on example input and output pairs. While performing classification, the goal is to
categorize input into classes. To illustrate, imagine a training set of pictures containing
various kinds of fruits, one type per picture. For each image, we have a label that tells us
which type of fruit is in the picture. We can use this set of pairs to train a computer system
which we can later utilize to classify images that we do not have labels for. Another
application of supervised learning is for solving regression tasks. In regression tasks, the
output is a continuous value. For example a value of house can be predicted based on its size.
For solving classification tasks, approaches like decision trees or neural networks can be used.
Regression tasks can be solved by algorithms such as linear regression, regression trees,
neural networks and others.

4.2 Artificial neural networks


Artificial neural networks (or simply neural networks) are inspired by biological
brains. They can be used to solve a variety of tasks, such as computer vision,
medical diagnosis or speech recognition. Neural networks can improve their performance on
certain task by taking examples into consideration. Such process can be simply called
learning.
A neural network (NN) consists of artificial neurons. An artificial neuron (simply
neuron) is inspired by biological neuron. Artificial neurons are connected together. Each

5
connection is used to transmit a signal to another neuron. The receiving neuron can process it
and signal other neurons con-nected to it. Each neuron has set of weights. The weights are
edges, connected to each neuron in the previous layer. Neuron has one output value, which is
weighted sum of input values plus a bias passed to an activation function. The Output value
of neuron is defined by the following formula:

Figure 4.1: Neuron with inputs, weights and bias applied

The simplest type of neural network is called perceptron. It consists of one neuron and
it is able to decide whether an input defined by a vector of numbers belongs to some specific
class or not. Single-layer perceptrons are capable of solving linearly separable problems . The
neuron in single-layer perceptron uses an activation function which maps the weighted sum
of inputs plus bias to values 0 or 1. The perceptron algorithm was invented in 1958 by Frank
Rosenblatt .
Typically, neurons are organized into groups called layers. A neural network consists
of at least two layers, that is input and output layer. It may also contain hidden layers. Multi-
layer perceptron (MLP) consists of at least three layers. The output and hidden layers use
non-linear activation function. In contrary to single-layer perceptron, multi-layer perceptron
is capable of learning data that is not linearly separable.

6
Figure 4.2: Neural network with multiple layers.

4.3 Activation function:


Activation function defines the output of a neuron given an input. There is a large
amount of activation functions. This subsection describes only a few of the commonly used
activation functions and they are visualised in the below Fig.

4.3.1 Binary step:


A simple activation function which returns a 0 or 1. It represents whether the
neuron is firing or not.

4.3.2 Sigmoid function:


Defined by f(x) = 1− x. Its curve is similar to 1+e shape of “S”. It maps the resulting
values in between 0 and 1. Therefore, it is usually used in the output layer for tasks where the
output is probability, for example binary classification.

7
Figure 4.3: Sigmoid Function

4.3.3 Tanh or hyperbolic tangent:


It is similar to sigmoid function. It also has shape similar to “S” and its range is from -
1 to 1. The advantage of Tanh over Sigmoid function is that the zero inputs will be mapped
around zero and negative inputs will be strongly negative.

Figure 4.4: Tanh or Hyperbolic Tangent

8
4.3.4 ReLU (Rectified Linear Unit):
Currently the most used activation function, since it works well for convolutional
neural networks and for deep neural networks in general. It computes the function f(x) =
max(0, x). The issue with ReLU is that all negative values become zero, which may decrease
the ability of model to train properly.

Figure 4.5: ReLU

4.3.5 Leaky ReLU:


It attempts to fix that problem. Instead of returning 0 for x < 0, it will have a small
negative value, for example 0.1x.

9
Figure 4.6: Leaky ReLU

Activation functions share a common attribute. Their derivative can be easily


computed, which is used during learning to find the slope of the curve. The slope is needed to
know in which direction and how much to change the curve to find the optimal values for
weights and biases. By using activation functions with easily computed derivatives, we can
save some computations.
Sigmoid and Tanh functions share the same disadvantage. The slope of their curve
gets low for x values far from 0, which can slow down the learning. Neither ReLU nor Leaky
ReLU have this problem for x > 0. When choosing an activation function for our model,
ReLU is the default where p is the ground truth and q is the predicted value. It is used to
measure how close is the true probability distribution to the predicted probability distribution.

4.3.6 Mean squared error (MSE):


Another function that can be used to measure the error of a neural network and it is
defined by:

10
where n is the number of outputs, y is the actual value and yˆ is the predicted value.
MSE punishes large mistakes much more than small mistakes and its output is always
positive. With that said, it is commonly used as a cost function for regression tasks where the
goal is to predict in real values.

4.3.7 Gradient descent:


The goal of optimization is to find optimal weights and biases that lead to minimal
loss function. The most common optimization algorithm is gradient descent. It uses the first
derivative of a function at a given point to find the direction in which the weight and biases
need to be tuned in order to lower the loss. There are multiple variants of gradient descent.
Batch gradient descent uses the whole dataset to calculate the gradi-ent of the loss
function. The descent can be very slow, because only one update is performed for the whole
dataset. Also the whole dataset needs to fit in the memory which can be a problem especially
for very large amount of data used to calculate the gradient.
Stochastic. gradient descent (SGD) calculates the gradient and performs an update
for each example. Therefore, it is usually much faster. On the other hand, performing frequent
updates with high variance causes the loss function to oscillate. Mini-batch gradient descent is
a combination of the previous two approaches. It performs an update for every mini-batch of
size n. Calculating the gradient over n examples can lead to more stable convergence. The
size of batch is usually chosen in range between 50 and 256, but it may vary depending on the
dataset and application. SGD is generally the best choice, because it combines the advantages
of the other two approaches.

4.4 Deep neural networks


A deep neural networks (DNN) is an artificial neural network with multiple hidden
layers between the input and output layer. There is not a strict rule defining which neural
network is deep and which is not. Using multiple layers allows DNNs to learn more complex

11
relationships between the data. However, They require a lot of data for successful training.
This is to some extent addressed by transfer learning. Basically, we use a neural network
model pre-trained on a related task and reuse its feature extracting abilities. Selecting the
correct hyperparameters (i.e. learning rate, used activation function) / training method /
structure is not always obvious. However, we can suggest changes based on achieved results.
They require much more computational power, especially during training. They also require a
lot of memory, which is a problem especially on mobile devices. Training takes longer
compared to other algorithms. It is hard to understand what is going on under the hood (for
example why did the model choose certain decision over the other).

4.5 Convolutional neural networks


Convolutional neural networks (CNN) are a class of deep neural networks that have
proven successful for analysing visual imagery [16, 17, 18], for example image recognition,
object detection and classification.
The architecture of convolutional neural networks is designed to take advantage of the
dimensional structure of input data. When working with images, that can be achieved by
preserving the relationship between pixels in a small group of input image. Each neuron in
convolutional layer uses a small group of pixels as an input. Which means that all inputs that
are connected to a given neuron are close to each other in the input image. Connecting only a
local region of input image to a neuron leads to fewer parameters compared to fully
connected layers.
Layers of convolutional neural network have neurons organized into 3 dimensions:
width, height and depth. Neurons in the convolutional layer share weights with other neurons
in the same depth. This reduces the number of learnable parameters.
Convolutional neural networks typically consist of an input layer, multiple
convolutional layers, pooling layers, fully connected layers and output layer.

4.5.1 Convolutional layer


The convolutional layer is the main component of CNN. It is a stack of filters
(sometimes referred as kernels) used to extract features from an input image. Each filter is
used to extract a certain feature. The amount of weight is based on the kernel size. The
weights in each filter are shared among other neurons in the same filter. The position of a
certain feature is not that important. What matters is whether the feature is present in the

12
picture or not. For example, imagine kernel of size 3 × 3. Such window is moving across the
image’s x and y axes by a stride which we define and its output is based on whether the
certain feature is present or not. By using multiple filters we are able to detect multiple
features, which can be further analysed by following layers.

Figure 4.7: Example of convolutional layer with kernel size 3 × 3 and padding 1

The kernel size, stride and number of filters are chosen based on the dataset. Different
values may bring different results. The usual value for kernel size is between 3 × 3 and 5 × 5.
The number of filters in convolutional layer is usually increasing the deeper the layer is in the
network, which improves the ability of the model to detect more low level features. With that
said, it is better to try multiple values and compare their results.

4.5.2 Pooling layer


The pooling layer is used to reduce the number of parameters and amount of
computations in the neural network. The most common type of pooling is max-pooling. It
splits each filter from previous convolutional layer into non-overlapping rectangles and
outputs the maximum value for each rectangle. The most common shape for such rectangle is
2 × 2. That leads to down-sampling the previous activations to 25%. It extracts the most

13
important features detected in convolutional layer. The depth remains unchanged. Another
type of pooling is average pooling, which outputs average of all values in each rectangle.

Figure 4.8: Example of 2 × 2 max-pooling with stride 2

4.5.3 Fully connected layer


Neurons in fully connected layers have connections to all activations in the previous
layers. The densely connected layers are identical to the layers in multi-layer neural network.
This layer is used to further analyse the features detected in the previous convolutional or
pooling layer.

4.6 Computational graph


Computational graph organizes a computation. It is a directed graph which consists of
nodes and edges. Each node represents a variable or an operation. Edges represent passing
the result of an operation to another operation as an operand. Computation organized in a
graph can be solved parallelly by computing the subgraphs that are independent on each other.
If a complex problem is split into simpler subproblems, each subproblem can be
solved only once and its solution can be stored. This technique can save computational time
at the cost of memory.
Derivatives are used to calculate how much does the output changes with respect to
all parameters during backpropagation of error while training a neural network. Those partial
derivatives can be calculated very efficiently in a computational graph by using reverse mode
differentiation. Reverse mode differentiation tracks how each node in the graph influences

14
one output. This is used in neural networks to calculate how much does a change in input
node affect the loss function.

4.7 TensorFlow
TensorFlow is a machine learning library developed by Google. It can be used to
create and execute neural network models. TensorFlow provides application programming
interfaces (APIs) in Python, C++, Java and Go. TensorFlow can run on graphics processing
unit to speed up the execution.
In TensorFlow, a neural network model is defined as a computational graph where nodes are
tensors.A tensor is basically a multidimensional matrix. While building the graph, we define
how each tensor is computed based on other variable tensors. We can then run part of this
graph to achieve desired results.
TensorBoard is a part of TensorFlow that can be used to visualise learning. Probably
the most useful information during training is the error of training and testing dataset based
on epoch. TensorBoard also provides a way to visualize the computational graph of the
model, which may be useful for large and complicated neural network architectures.

4.8 Graphics processing unit


Graphics processing unit (GPU) is an application specific integrated circuit designed
to accelerate creation of images intended for output to a display device. It provides efficient
and powerful parallel computing and also high performance memory. Both of those
properties can be used to accelerate machine learning. To illustrate, we can speed up training
of neural network 5 to 10 times by using GPU. The exact ratio depends on the specific
hardware as well as on the structure of the neural network. As described earlier, training is
mostly done by computing simple formulas for lots of data. Most of the operations are
performed with matrices. For example matrix multiplication can be done in parallel efficient.

15
4.9 Dataset
It contains 7049 grayscale images with resolution 96 × 96 pixels. Each facial keypoint
is specified by x and y position in the image. The following 15 facial features are represented
in the dataset:
.left_eye_center, right_eye_center,
.left_eye_inner_corner, left_eye_outer_corner,
.right_eye_inner_corner, right_eye_outer_corner,
.left_eyebrow_inner_end, left_eyebrow_outer_end,
.right_eyebrow_inner_end, right_eyebrow_outer_end,
.nose_tip,
.mouth_left_corner, mouth_right_corner,
.mouth_center_top_lip, mouth_center_bottom_lip.

Figure 4.9: Example of an image from the dataset with marked facial keypoints

Even though the dataset contains 7049 images, only 2140 of them have all 15 key
points marked. We will use the pictures that have all facial keypoints present in the dataset,
because we want our neural network to predict all 15 keypoint multidimensional data in
different ranges.

16
Figure 4.9: Number of images in the dataset for each facial feature

We will split the dataset into two parts. One for training and another one for testing.
By doing that, we can measure the performance of the model on training images as well as on
images that the neural network never trained on. Both training error and validation error are
important when analysing the model’s results. The training dataset will contain 80% of the
original dataset and the rest 20% will be the testing dataset. That is 1712 training images and
428 testing images.
Normalizing the inputs to the neural network is a common thing to do while working
with pixels in input pictures are in range from 0 to 255. We will scale that to range [0, 1]. The
positions of facial keypoints are in range from 0 to 96. We want those values to have mean 0
and variance 1. That can be achieved by simple computation y′ = 48y − 1. Using normalized
data can help during training of neural network while finding the gradient of loss function.
Imagine if we had values from range [0,1] and [0, 1000]. In that case, the latter would have
much more impact on the output of the neural network and that could lead to slower
convergence.

17
5. SYSTEM DESIGN

5.1 System Architecture:

Figure 5.1: System Architecture

18
5.2 UML DIAGRAMS:

5.2.1 Use Case Diagrams:

Figure 5.2: Use Case Diagram

19
5.2.2 Sequence Diagrams:

Figure 5.3: Sequence Diagram

20
5.2.3 State Chart Diagram:

Figure 5.4: State Chart diagram

21
6. SYSTEM IMPLEMENTATION

6.1 Code
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Display images
import cv2
training_data=pd.read_csv('training.csv')
training_data.head()
training_data.isnull().sum()
training_data=training_data.dropna(axis=0)
training_data.isnull().sum()
Training_data.iloc[2000]
rows=training_data.axes[0].tolist()
image = []
for row in rows:
img = training_data['Image'][row].split(' ')
img = ['0' if x == '' else x for x in img]
image.append(img)
len(image)
image_list = np.array(image,dtype = 'float')
X_train = image_list.reshape(-1,96,96,1)
len(image_list)
len(X_train)
training = training_data.drop('Image',axis = 1)
y_train = []
for i in range(len(rows)):
y = training.iloc[i,:]
y_train.append(y)
y_train = np.array(y_train,dtype = 'float')
y_train = y_train/96
import matplotlib.pyplot as plt

plt.imshow(X_train[2].reshape(96,96),cmap='gray')

22
for i in range(15):
plt.plot(96*y_train[2][2*i],96*y_train[0][2*i+1],'ro')
plt.show()
from keras.layers import Conv2D,Dropout,Dense,Flatten
from keras.models import Sequential
from keras.layers.advanced_activations import LeakyReLU
from keras.models import Sequential, Model
from keras.layers import Activation, Convolution2D, MaxPooling2D, BatchNormalization,
Flatten, Dense, Dropout, Conv2D,MaxPool2D, ZeroPadding2D
model = Sequential()
model.add(Convolution2D(32,(3,3),padding='same',input_shape=(96,96,1),activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(32,
(3,3), padding='same',activation='relu')) model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(64,
(3,3), padding='same',activation='relu')) model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(96,
(3,3), padding='same',activation='relu')) model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(128,
(3,3),padding='same',activation='relu')) model.add(BatchNormalization())
model.add(Convolution2D(128, (3,3),padding='same',activation='relu'))
model.add(BatchNormalization()) model.add(MaxPool2D(pool_size=(2,
2))) model.add(Convolution2D(256,
(3,3),padding='same',activation='relu'))
model.add(BatchNormalization()) model.add(MaxPool2D(pool_size=(2,
2))) model.add(Convolution2D(512, (3,3),
padding='same',activation='relu')) model.add(BatchNormalization())
model.add(Convolution2D(512, (3,3), padding='same',activation='relu'))
model.add(BatchNormalization())

23
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(30,activation='relu'))
model.summary()
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['mae'])
model.fit(X_train,y_train,epochs = 500,batch_size = 256,validation_split = 0.2)
im = X_train[401]
im = im.reshape(1, 96, 96, 1)
pred = np.array(model.predict(im))
plt.imshow(im.reshape(96,96),cmap='gray')
for i in range(15):
plt.plot(96*pred[0][2*i],96*pred[0][2*i+1],'ro')
plt.show()
// For second image
!wget https://i.ibb.co/mNvTH4W/abhi.jpg
test_img = cv2.imread("abhi.jpg",0)
type(test_img)
test_img = test_img.reshape(1,96,96,1) test_pred =
np.array(model.predict(test_img))
plt.imshow(test_img.reshape(96,96),cmap='gray')
for i in range(15):
plt.plot(96*test_pred[0][2*i],96*test_pred[0][2*i+1],'ro')

plt.show()

24
7. OUTPUT SCREENS

Figure 7.1: Checking for null fields in the file

25
Figure 7.2: Model Summary

Figure 7.3: Compiling the model

26
Figure 7.4: Training the model

27
Figure 7.5: Output Prediction

Figure 7.6: Output Prediction

28
8. CONCLUSION

We have proposed multiple approaches to facial feature detection problem in this


document. We have used several types of artificial neural networks, including the state of the
art in image recognition. The results have shown various upsides and downsides of explored
solutions. The results have also shown that straightforward use of neural networks did not
perform well. The proposed models suffered from high bias as well as high variance. After
further analysing the results of proposed approaches we suggested several changes that aimed
to improve the final outcome.
Here we have focused on the task∗ of detecting facial keypoints when given raw
facial images. Specifically, for a given 96 96 image, we would predict 15 sets of (x, y)
coordinates for facial keypoints. Two traditional deep structures, One Hidden Layer Neural
Network and Convolutional Neural Network, are implemented as our baselines. We further
explored a sparsely connected Inception Model to reduce computational complexity to fit the
requirements for detecting facial keypoints. Experiments which conducted on real-world
kaggle dataset have shown the effectiveness of deep structures, especially Inception Model.

29
10. FUTURE ENHANCEMENTS

As for our future work, we can explore from these few aspects:

1. We have already shown the effectiveness of the Inception Model when used as the
pretrained model, but the performance has a chance to improve if we train from scratch.
2. As we can see from the results, using deep structures can increase time complexity when
compared to other state-of-art methods but the results have been shown to improve a lot.
What we can do in the future is to design a deep structure specifically for this task to
further improve the performance.
3. Different resolution can greatly affect the results of the facial keypoints detection, thus
what we can do is try to reduce the resolution of our given raw images to see the variance
of the performance to further evaluate our model.

30
10.BIBLIOGRAPHY

[1] M. Dantone, J. Gall, G. Fanelli, and L. Van Gool. Real-time facial feature detection using
conditional regression forests. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 2578–2585. IEEE, 2012.
[2] M. Gargesha and S. Panchanathan. A hybrid technique for facial feature point detection. In
Image Analysis and Interpretation, 2002. Proceedings. Fifth IEEE Southwest Symposium on,
pages 134–138. IEEE, 2002.
[3] M. Haavisto et al. Deep generative models for facial keypoints detection. 2013.
[4] E.-J. Holden and R. Owens. Automatic facial point detection. In Proc. Asian Conf. Computer
Vision, volume 2, page 2, 2002.
[5] B. Martinez, M. F. Valstar, X. Binefa, and M. Pantic. Local evidence aggregation for
regression-based facial point detection. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 35(5):1149–1163, 2013.
[6] A. S. Mian, M. Bennamoun, and R. Owens. Keypoint detection and local feature matching
for textured 3d face recognition. International Journal of Computer Vision, 79(1):1–12, 2008.
[7] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point
detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 3476–3483, 2013.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–9, 2015.
[9] M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted
regression and graph models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
Conference on, pages 2729–2736. IEEE, 2010.
[10] D. Vukadinovic and M. Pantic. Fully automatic facial feature point detection using gabor
feature based boosted classifiers. In Systems, Man and Cybernetics, 2005 IEEE International
Conference on, volume 2, pages 1692–1698. IEEE, 2005.

31

You might also like