Facial Final Mini
Facial Final Mini
Facial Final Mini
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
i
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
(UGC Autonomous, Accredited by NAAC , Permanently Affiliated to JNTUH)
Bhaskar Nagar (Post), Moinabad Mandal, R.R. Dist.-500075
CERTIFICATE
This is to certify that the mini project report entitled FACIAL KEYPOINTS
DETECTION USING MACHINE LEARNING being submitted by VADDI
ABHILASH REDDY in partial fulfillment for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering to the Jawaharlal Nehru Technological
University is a record of bonafied work carried out by him/her under my guidance and
supervision.
The results embodied in this mini project report have not been submitted to any
other University or Institute for the award of any Degree or Diploma.
ii
J.B.INSTITUTE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
DECLARATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
Nowadays, facial keypoints detection has become a very popular topic and its
applications include Snapchat, How old are you, have attracted a large number of users. The
objective of facial keypoints detection is to find the facial keypoints in a given face, which is
very challenging due to very different facial features from person to person. The idea of deep
learning has been applied to this problem, such as neural network and cascaded neural
network. And the results of these structures are significantly better than state of-the-art
methods, like feature extraction and dimension reduction algorithms.In our project, we would
like to locate the key points in a given image using deep architectures to not only obtain
lower loss for the detection task but also accelerate the training and testing process for real-
world applications. We have constructed two basic neural network structures, one hidden
layer neural network and convolutional neural network as our baselines. And we have
proposed an approach to better locate the coordinates of facial keypoints. Specifically, we use
a block of pretrained Inception Model to extract intermediate features and using different
deep structures to compute the final output vector. The experiments results have shown the
effectiveness of deep structures for facial keypoints detection tasks, and using the pretrained
Inception Model has slightly improved the performance of detection compared to baseline
methods.
v
TABLE OF CONTENTS
1. INTRODUCTION 1-2
1.1 Facial features are very different 1
1.2 Detecting keypoints has to be fast 1
1.3 Existing And Proposed System 2
2. LITERATURE SURVEY 3
3. SOFTWARE REQUIREMENTS ANALYSIS 4
3.1 Hardware requirements: 4
3.2 Software requirements: 4
4. BACKGROUND 5-17
4.1 Machine learning 5
4.2 Artificial neural networks 5-6
4.3 Activation function: 7-10
4.4 Deep neural networks 11
4.5 Convolutional neural networks 12-13
4.6 Computational graph 14
4.7 TensorFlow 15
4.8 Graphics processing unit 15
4.9 Dataset 16-17
5. SYSTEM DESIGN 18-21
5.1 System Architecture: 18
5.2 UML Diagrams: 19-21
6. SYSTEM IMPLEMENTATION 22-24
6.1 Code 22-24
7. OUTPUT SCREENS 25-28
8. CONCLUSION 29
9. FUTURE ENHANCEMENTS 30
10. BIBLIOGRAPHY 31
vi
TABLE OF FIGURES
vii
1. INTRODUCTION
With the fast development in computer vision area, more and more research works
and industry applications are focused on facial keypoints detection. Detecting keypoints in a
given face image would act as a fundamental part for many applications, including facial
expression classification, facial alignment, tracking faces in videos and also applications for
medical diagnosis. Thus, how to detect facial keypoints both fast and accurately to use it as a
preprocessing procedure has become a big challenge. There are two main challenges for
facial keypoints detection, one is that facial features have great variance between different
people and different external factors, the other is that we have to reduce time complexity to
achieve real-time keypoints detection.
1
1.3 Existing And Proposed System
There are two main kinds of state-of-art methods for detecting facial keypoints, the
first type uses feature extraction algorithms, using Gabor features and other features, and the
other one is using probability graphic model to specify the relationship between pixels and its
neighbors. With the development of deep learning, many deep structures designed for this
task have been explored recently. Different deep structures have been proposed for facial
keypoints detection, like deep convolutional cascade network , which can better deal with
two main challenges we mentioned before. Our objective is to locate 15 sets of facial
keypoints when given a raw facial image. The input is a set of 96 × 96 raw 1 facial images
with only the grayscale pixel values, and the output is a 30-dimensional vector, which
indicates the (x, y) coordinates of 15 sets of facial keypoints. In our project, we are going to
use deep structures for facial keypoints detection, which can learn well from different faces
and overcome the variance between faces of different person or of different conditions to a
great extent. Two widely-used model, One Hidden Layer Neural Network and Convolutional
Neural Network, are designed as baselines in our project. Most importantly, we have used the
pretrained Inception Model to explore some techniques to reduce the computation complexity
for detecting facial keypoints. With sparsely connected layers and different number of
different size of filters, Inception Model have the ability to better capture local features and
reduce computation complexity at the same time. When compared to our baseline models, we
can see a great improvement when using pretrained inception models to predict the location
of facial keypoints. The contributions of this paper include:
1. Explore the performance of different deep structures on the task of detecting facial
keypoints, and evaluate the effectiveness using MSE loss.
2. Use the recently proposed Inception Model when detecting facial keypoints. Although
the Inception Model is trained on image classification task on ImageNet, the
experiments results have shown the intermediate features can adapted to facial
keypoints detection task well.
Conduct experiments on real-world datasets from Kaggle challenge, and our model
can be easily extended to other facial detection task.
2
2. LITERATURE SURVEY
Traditional methods have explored feature extraction strategies includes texture-based
and shape-based features and different types of graphic models to detect facial keypoints.
[10] proposed a methods that uses Gabor wavelet feature based boosted classifiers
which can detect 20 different facial keypoints with limited head rotation and different
illumination conditions.
[4] also use Gabor features for detecting facial keypoints. Using a sample log Gabor
response of a facial point, the authors have shown that the locations on the test facial images
that are similar to the sample facial points can be detected.
[6] focused on keypoints detection for Textured 3D Face Recognition. Features are
extracted by fitting a surface to the neighborhood of a keypoint and sampling it on a uniform
grid. And then PCA are used for dimensional reduction and then projected and matched from
a probe and gallery face. Other methods have designed probability graphic models to capture
the relationship between different pixels and features to detect facial keypoints. Using
Markov Random Fields, [9] exploits the constellations that facial points can form. Also the
authors use boosted regression to learn the mappings from the pixels appearance of the area
around the keypoints and the location of the keypoints. These models have reached high
performance on aligned faces but still need improvements for faces under different
environment conditions.
[5] have proposed a model to detect the target location by local evidence aggregation.
By aggregating the estimates obtained from stochastically selected local appearance
information into a single robust prediction other than focus only on the target locaion only,
the proposed model have solved the regression problem from a new prospective. Recent
works have focused on deep architectures for this detection task since these structures can
better capture the high-level features of a image, in our problem a given face.
[7] proposed a three-level carefully designed convolutional networks, which at first
capture the global high-level features, and then refine the initiliation to locate the positions of
keypoints. On the other hand, [3] used pretrained DBN on surrounding feed-forward neural
network with linear Gaussian output layer to detect facial keypoints. Different from our work,
the deep structures mentioned before have not focused on the time complexity but only on the
correctness of the detected keypoints. But using Inception Model we can train a much more
complexity model in less time, thus the detection process would be faster than the traditional
deep structures considering the number of parameters, which can be adapted to this task.
3
3.SOFTWARE REQUIREMENTS ANALYSIS
4
4. BACKGROUND
5
connection is used to transmit a signal to another neuron. The receiving neuron can process it
and signal other neurons con-nected to it. Each neuron has set of weights. The weights are
edges, connected to each neuron in the previous layer. Neuron has one output value, which is
weighted sum of input values plus a bias passed to an activation function. The Output value
of neuron is defined by the following formula:
The simplest type of neural network is called perceptron. It consists of one neuron and
it is able to decide whether an input defined by a vector of numbers belongs to some specific
class or not. Single-layer perceptrons are capable of solving linearly separable problems . The
neuron in single-layer perceptron uses an activation function which maps the weighted sum
of inputs plus bias to values 0 or 1. The perceptron algorithm was invented in 1958 by Frank
Rosenblatt .
Typically, neurons are organized into groups called layers. A neural network consists
of at least two layers, that is input and output layer. It may also contain hidden layers. Multi-
layer perceptron (MLP) consists of at least three layers. The output and hidden layers use
non-linear activation function. In contrary to single-layer perceptron, multi-layer perceptron
is capable of learning data that is not linearly separable.
6
Figure 4.2: Neural network with multiple layers.
7
Figure 4.3: Sigmoid Function
8
4.3.4 ReLU (Rectified Linear Unit):
Currently the most used activation function, since it works well for convolutional
neural networks and for deep neural networks in general. It computes the function f(x) =
max(0, x). The issue with ReLU is that all negative values become zero, which may decrease
the ability of model to train properly.
9
Figure 4.6: Leaky ReLU
10
where n is the number of outputs, y is the actual value and yˆ is the predicted value.
MSE punishes large mistakes much more than small mistakes and its output is always
positive. With that said, it is commonly used as a cost function for regression tasks where the
goal is to predict in real values.
11
relationships between the data. However, They require a lot of data for successful training.
This is to some extent addressed by transfer learning. Basically, we use a neural network
model pre-trained on a related task and reuse its feature extracting abilities. Selecting the
correct hyperparameters (i.e. learning rate, used activation function) / training method /
structure is not always obvious. However, we can suggest changes based on achieved results.
They require much more computational power, especially during training. They also require a
lot of memory, which is a problem especially on mobile devices. Training takes longer
compared to other algorithms. It is hard to understand what is going on under the hood (for
example why did the model choose certain decision over the other).
12
picture or not. For example, imagine kernel of size 3 × 3. Such window is moving across the
image’s x and y axes by a stride which we define and its output is based on whether the
certain feature is present or not. By using multiple filters we are able to detect multiple
features, which can be further analysed by following layers.
Figure 4.7: Example of convolutional layer with kernel size 3 × 3 and padding 1
The kernel size, stride and number of filters are chosen based on the dataset. Different
values may bring different results. The usual value for kernel size is between 3 × 3 and 5 × 5.
The number of filters in convolutional layer is usually increasing the deeper the layer is in the
network, which improves the ability of the model to detect more low level features. With that
said, it is better to try multiple values and compare their results.
13
important features detected in convolutional layer. The depth remains unchanged. Another
type of pooling is average pooling, which outputs average of all values in each rectangle.
14
one output. This is used in neural networks to calculate how much does a change in input
node affect the loss function.
4.7 TensorFlow
TensorFlow is a machine learning library developed by Google. It can be used to
create and execute neural network models. TensorFlow provides application programming
interfaces (APIs) in Python, C++, Java and Go. TensorFlow can run on graphics processing
unit to speed up the execution.
In TensorFlow, a neural network model is defined as a computational graph where nodes are
tensors.A tensor is basically a multidimensional matrix. While building the graph, we define
how each tensor is computed based on other variable tensors. We can then run part of this
graph to achieve desired results.
TensorBoard is a part of TensorFlow that can be used to visualise learning. Probably
the most useful information during training is the error of training and testing dataset based
on epoch. TensorBoard also provides a way to visualize the computational graph of the
model, which may be useful for large and complicated neural network architectures.
15
4.9 Dataset
It contains 7049 grayscale images with resolution 96 × 96 pixels. Each facial keypoint
is specified by x and y position in the image. The following 15 facial features are represented
in the dataset:
.left_eye_center, right_eye_center,
.left_eye_inner_corner, left_eye_outer_corner,
.right_eye_inner_corner, right_eye_outer_corner,
.left_eyebrow_inner_end, left_eyebrow_outer_end,
.right_eyebrow_inner_end, right_eyebrow_outer_end,
.nose_tip,
.mouth_left_corner, mouth_right_corner,
.mouth_center_top_lip, mouth_center_bottom_lip.
Figure 4.9: Example of an image from the dataset with marked facial keypoints
Even though the dataset contains 7049 images, only 2140 of them have all 15 key
points marked. We will use the pictures that have all facial keypoints present in the dataset,
because we want our neural network to predict all 15 keypoint multidimensional data in
different ranges.
16
Figure 4.9: Number of images in the dataset for each facial feature
We will split the dataset into two parts. One for training and another one for testing.
By doing that, we can measure the performance of the model on training images as well as on
images that the neural network never trained on. Both training error and validation error are
important when analysing the model’s results. The training dataset will contain 80% of the
original dataset and the rest 20% will be the testing dataset. That is 1712 training images and
428 testing images.
Normalizing the inputs to the neural network is a common thing to do while working
with pixels in input pictures are in range from 0 to 255. We will scale that to range [0, 1]. The
positions of facial keypoints are in range from 0 to 96. We want those values to have mean 0
and variance 1. That can be achieved by simple computation y′ = 48y − 1. Using normalized
data can help during training of neural network while finding the gradient of loss function.
Imagine if we had values from range [0,1] and [0, 1000]. In that case, the latter would have
much more impact on the output of the neural network and that could lead to slower
convergence.
17
5. SYSTEM DESIGN
18
5.2 UML DIAGRAMS:
19
5.2.2 Sequence Diagrams:
20
5.2.3 State Chart Diagram:
21
6. SYSTEM IMPLEMENTATION
6.1 Code
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Display images
import cv2
training_data=pd.read_csv('training.csv')
training_data.head()
training_data.isnull().sum()
training_data=training_data.dropna(axis=0)
training_data.isnull().sum()
Training_data.iloc[2000]
rows=training_data.axes[0].tolist()
image = []
for row in rows:
img = training_data['Image'][row].split(' ')
img = ['0' if x == '' else x for x in img]
image.append(img)
len(image)
image_list = np.array(image,dtype = 'float')
X_train = image_list.reshape(-1,96,96,1)
len(image_list)
len(X_train)
training = training_data.drop('Image',axis = 1)
y_train = []
for i in range(len(rows)):
y = training.iloc[i,:]
y_train.append(y)
y_train = np.array(y_train,dtype = 'float')
y_train = y_train/96
import matplotlib.pyplot as plt
plt.imshow(X_train[2].reshape(96,96),cmap='gray')
22
for i in range(15):
plt.plot(96*y_train[2][2*i],96*y_train[0][2*i+1],'ro')
plt.show()
from keras.layers import Conv2D,Dropout,Dense,Flatten
from keras.models import Sequential
from keras.layers.advanced_activations import LeakyReLU
from keras.models import Sequential, Model
from keras.layers import Activation, Convolution2D, MaxPooling2D, BatchNormalization,
Flatten, Dense, Dropout, Conv2D,MaxPool2D, ZeroPadding2D
model = Sequential()
model.add(Convolution2D(32,(3,3),padding='same',input_shape=(96,96,1),activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(32,
(3,3), padding='same',activation='relu')) model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(64,
(3,3), padding='same',activation='relu')) model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(96,
(3,3), padding='same',activation='relu')) model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2, 2))) model.add(Convolution2D(128,
(3,3),padding='same',activation='relu')) model.add(BatchNormalization())
model.add(Convolution2D(128, (3,3),padding='same',activation='relu'))
model.add(BatchNormalization()) model.add(MaxPool2D(pool_size=(2,
2))) model.add(Convolution2D(256,
(3,3),padding='same',activation='relu'))
model.add(BatchNormalization()) model.add(MaxPool2D(pool_size=(2,
2))) model.add(Convolution2D(512, (3,3),
padding='same',activation='relu')) model.add(BatchNormalization())
model.add(Convolution2D(512, (3,3), padding='same',activation='relu'))
model.add(BatchNormalization())
23
model.add(Flatten())
model.add(Dense(512,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(30,activation='relu'))
model.summary()
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['mae'])
model.fit(X_train,y_train,epochs = 500,batch_size = 256,validation_split = 0.2)
im = X_train[401]
im = im.reshape(1, 96, 96, 1)
pred = np.array(model.predict(im))
plt.imshow(im.reshape(96,96),cmap='gray')
for i in range(15):
plt.plot(96*pred[0][2*i],96*pred[0][2*i+1],'ro')
plt.show()
// For second image
!wget https://i.ibb.co/mNvTH4W/abhi.jpg
test_img = cv2.imread("abhi.jpg",0)
type(test_img)
test_img = test_img.reshape(1,96,96,1) test_pred =
np.array(model.predict(test_img))
plt.imshow(test_img.reshape(96,96),cmap='gray')
for i in range(15):
plt.plot(96*test_pred[0][2*i],96*test_pred[0][2*i+1],'ro')
plt.show()
24
7. OUTPUT SCREENS
25
Figure 7.2: Model Summary
26
Figure 7.4: Training the model
27
Figure 7.5: Output Prediction
28
8. CONCLUSION
29
10. FUTURE ENHANCEMENTS
As for our future work, we can explore from these few aspects:
1. We have already shown the effectiveness of the Inception Model when used as the
pretrained model, but the performance has a chance to improve if we train from scratch.
2. As we can see from the results, using deep structures can increase time complexity when
compared to other state-of-art methods but the results have been shown to improve a lot.
What we can do in the future is to design a deep structure specifically for this task to
further improve the performance.
3. Different resolution can greatly affect the results of the facial keypoints detection, thus
what we can do is try to reduce the resolution of our given raw images to see the variance
of the performance to further evaluate our model.
30
10.BIBLIOGRAPHY
[1] M. Dantone, J. Gall, G. Fanelli, and L. Van Gool. Real-time facial feature detection using
conditional regression forests. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 2578–2585. IEEE, 2012.
[2] M. Gargesha and S. Panchanathan. A hybrid technique for facial feature point detection. In
Image Analysis and Interpretation, 2002. Proceedings. Fifth IEEE Southwest Symposium on,
pages 134–138. IEEE, 2002.
[3] M. Haavisto et al. Deep generative models for facial keypoints detection. 2013.
[4] E.-J. Holden and R. Owens. Automatic facial point detection. In Proc. Asian Conf. Computer
Vision, volume 2, page 2, 2002.
[5] B. Martinez, M. F. Valstar, X. Binefa, and M. Pantic. Local evidence aggregation for
regression-based facial point detection. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 35(5):1149–1163, 2013.
[6] A. S. Mian, M. Bennamoun, and R. Owens. Keypoint detection and local feature matching
for textured 3d face recognition. International Journal of Computer Vision, 79(1):1–12, 2008.
[7] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point
detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 3476–3483, 2013.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–9, 2015.
[9] M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted
regression and graph models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
Conference on, pages 2729–2736. IEEE, 2010.
[10] D. Vukadinovic and M. Pantic. Fully automatic facial feature point detection using gabor
feature based boosted classifiers. In Systems, Man and Cybernetics, 2005 IEEE International
Conference on, volume 2, pages 1692–1698. IEEE, 2005.
31