Convolutional Neural Network CNN For Ima

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC)

Convolutional Neural Network (CNN) for


Image Detection and Recognition
Rahul Chauhan Kamal Kumar Ghanshala R.C Joshi
Graphic Era Hill University Graphic Era University Graphic Era University
Dehradun, India Dehradun, India Dehradun, India
[email protected] [email protected] [email protected]

Abstract- Deep Learning algorithms are designed in such sets. It contains natural images and helps implement the
a way that they mimic the function of the human cerebral image detection algorithms.
cortex. These algorithms are representations of deep neural
networks i.e. neural networks with many hidden layers. In this paper, Convolutional neural networks models
Convolutional neural networks are deep learning algorithms are implemented for image recognition on MNIST dataset
that can train large datasets with millions of parameters, in and object detection on the CIFAR-10 dataset. The
form of 2D images as input and convolve it with filters to implementation of models is discussed and the
produce the desired outputs. In this article, CNN models are performance is evaluated in terms of accuracy. The model
built to evaluate its performance on image recognition and is trained on an only CPU unit and real-time data
detection datasets. The algorithm is implemented on MNIST augmentation is used on the CIFAR-10 dataset. Along
and CIFAR-10 dataset and its performance are evaluated. with that, Dropout is used to reduce Overfitting on the
The accuracy of models on MNIST is 99.6 %, CIFAR-10 is datasets.
using real-time data augmentation and dropout on CPU
unit. The remaining sections of the paper are described as
follows: Section 2 describes a brief literature survey;
Keywords- Deep Learning, Handwritten digit Recognition, Section 3 describes the classifier models with details of
Object Detection, Convolutional neural networks, MNIST, the techniques implemented. Section 4 evaluates the
CIFAR-10, Dropout, Overfitting, Data Augmentation, Relu performance of the model and describes the results.
Section 5 summaries the work with future works.
I INTRODUCTION
II. LITERATURE SURVEY
Image Recognition and detection is a classic machine
learning problem. It is a very challenging task to detect an In recent years there have been great strides in
object or to recognize an image from a digital image or a building classifiers for image detection and recognition on
video. Image Recognition has application in the various various datasets using various machine learning
field of computer vision, some of which include facial algorithms. Deep learning, in particular, has shown
recognition, biometric systems, self-driving cars, emotion improvement in accuracy on various datasets. Some of the
detection, image restoration, robotics and many more[1]. works have been described below:
Deep Learning algorithms have achieved great progress in Norhidayu binti Abdul Hamid et al. [3] evaluated the
the field of computer vision. Deep Learning is an performance on MNIST datasets using 3 different
implementation of the artificial neural networks with classifiers: SVM (support vector machines), KNN (K-
multiple hidden layers to mimic the functions of the nearest Neighbor) and CNN (convolutional neural
human cerebral cortex. The layers of deep neural network networks). The Multilayer perceptron didn't perform well
extract multiple features and hence provide multiple on that platform as it didn't reach the global minimum
levels of abstraction. As compared to shallow networks, rather remained stuck in the local optimal and couldn't
this cannot extract or work on multiple features. recognize digit 9 and 6 accurately. Other classifiers,
Convolutional neural networks is a powerful deep performed correctly and it was concluded that
learning algorithm capable of dealing with millions of performance on CNN can be improved by implementing
parameters and saving the computational cost by inputting the model on Keras platform. Mahmoud M. Abu Gosh et
a 2D image and convolving it with filters/kernel and al. [5] implement DNN (Deep neural networks), DBF
producing output volumes. (Deep Belief networks) and CNN (convolutional neural
The MNIST dataset is a dataset containing networks) on MNIST dataset and perform a comparative
handwritten digits and tests the performance of a study. According to the work, DNN performed the best
classification algorithm. Handwritten digit recognition has with an accuracy of 98.08% and other had some error
many applications such as OCR (optical character rates as well as the difference in their execution time.
recognition), signature verification, interpretation and Youssouf Chherawala et al. [6] built a vote weighted
manipulation of texts and many more[2,3]. Handwritten RNN (Recurrent Neural networks) model to determine the
digit recognition is an image classification and significance of feature sets. The significance is
recognition problem and there have been recent determined by weighted votes and their combination and
advancements in this field [4]. Another dataset is CIFAR- the model is an application of RNN. It extracts features
10 which is an object detection datasets that classifies the from the Alex word images and then uses it to recognize
objects into 10 classes and detects the objects in the test handwriting. Alex krizhevsky [7] uses a 2-layer

978-1-5386-6373-8/18/$31.00 ©2018 IEEE 278


2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC)

Convolutional Deep belief network on the CIFAR-10


dataset. The model built classified the CIFAR-10 dataset
with an accuracy of 78.90% on a GPU unit. Elaborative
differences in filters and their performance is described in
the paper which differs with every model.
Yehya Abouelnaga et al. [8] built an ensemble of
classifiers on KNN. They used KNN in combination with
CNN and reduce the Overfitting by PCA (Principal
Component Analysis). The combination of these two
classifiers improved the accuracy to about 0.7%.
Yann le Cunn et al. [1] give a detailed introduction to
deep learning and its algorithms. The algorithms like
Backpropagation with multilayer perceptron,
Convolutional neural networks, and Recurrent neural Fig.1 A sample MNIST Dataset
networks are discussed in detail with examples. They
have also mentioned the scope of unsupervised learning in
future in Artificial intelligence.
Li Deng [10] details a survey on deep learning, its
applications, architectures, and algorithms. The
generative, discriminative and hybrid architectures are
discussed in detail along with the algorithms that fall
under the respective categories. CNN, RNN,
Autoencodes, DBN’s, RBM’s (Restricted Boltzmann
machines) are discussed with their various applications.
III. CLASSIFIER MODELS
A. Datasets
MNIST is the dataset used for image recognition i.e.
for recognition of handwritten digits [11]. The dataset has
70,000 images to train and test the model. The training
and test set distribution is 60,000 train images and 10,000
test images. The size of each image is 28x28 pixels (784
pixels) which are given as input to the system and has 10
Fig.2 A CIFAR-10 dataset images
output class labels from (0-9). Fig.1 shows a sample
picture from MNIST dataset [13]. weights and biases [15,21]. Equation (1) shows the size of
CIFAR-10 is the dataset used for object detection the output matrix with no padding and Equation (2) shows
which is labeled a subset of 80 million tiny images [12]. the convolution operation. In order to preserve the size of
The dataset has 60,000 32x32 pixel color images with 10 input image padding is used . In a ‘SAME’ padding the
classes (airplane, automobile, bird, cat, deer, dog, frog, output image size is same as input image size and a
horse, ship, truck ). Each class has 6000 images. The train “VALID” padding is no padding. The size of th eoutput
batch has 50,000 images and test batch has 10,000 matrix with padding is depicted in equation (3).
images. The test batch for each class has 1000 images ܰܺܰ ‫ ݂݂ܺ כ‬ൌ ܰ െ ‫ ܨ‬൅ ͳ (1)
which are randomly selected. Fig.2 shows sample pictures
from CIFAR-10 dataset [14]. ܱ ൌ ߪሺܾ ൅ σଶ௜ୀ଴ σଶ௝ୀ଴ ‫ݓ‬௜ǡ௝ ݄௔ା௜ǡ௕ା௝ ሻ (2)
B. CNN Models
Convolutional neural networks are deep learning ܰܺܰ ‫ ݂ כ ݂ כ‬ൌ ሺܰ ൅ ʹܲ െ ݂ሻȀሺ‫ ݏ‬൅ ͳሻ (3)
algorithms that take input images and convolves it with
filters or kernels to extract features. A NxN image is Here, O is the output , P is the padding, s i sthe stride,
convolved with a fXf filter and this convolution operation b is the bias, σ is the sigmoidal activation function, w is a
learns the same feature on the entire image[18]. The 3x3 weight matrix of shared weights and ݄௫ǡ௬ is the input
window slides after each operation and the features are activation at position x, y. CNN has application in fields
learnt by the feature maps. The feature maps capture the of large scale image recognition [17], Emotion detection
local receptive field of th eimage and work with shared through speech [9] [19] facial expression recognition [20],
biometric systems, genomics and many others.
C. CNN model for MNIST dataset
The CNN model for MNIST dataset is shown in figure
(c). The input image is a vector with 784 pixel values. It is
input into the convolutional model where the convolution
layers along with filters generate th efeature maps using
the local receptive field. The pooling and fully connected
layers follow the convolution layers. Dropout is

279
2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC)

introduced after each convolutional layer. The pooling x Dropout : True


layers simplify the output after convolution. There are two
types of pooling: Max pooling and L2 pooling. In max The architecture is as follows:
pooling the maximum activation output is pooled into a 2
x 2 input region and L2 pooling takes the square root of Conv1→Relu→Conv2→Relu→Max_pooling→Dropout
the sum of squares of the activation 2x2 region. Finally, →Conv3→Relu→Conv4→Relu→Max_pooling→Dropo
the fully connected layers connect each layer of max ut→Conv5→Relu→Max_pooling→Dropout→Flatten→
pooling layer to the output neurons. The architecture of Dense→Relu→Dropout→Dense→Softmax→Result
the developed model is as follows:
x Batch Size (training): 128, Batch size (test): 256, Fig.4 shows the architecture of the model. The 32X32
Number of epochs:10 input image is given to the model where the first
x Dropout: Yes convolutional layer learns 32 features through a 5x5 filter
and ‘same’ padding. After the activation, another
x Optimizer: RMS prop, Learning rate=0.001, the
convolutional layer is stacked up that learns 64 features
parameter β : 0.9
through a 5x5 filter. Then Relu activation is added and
x Keep_prob:0.8 forwarded to max pooling layer. After the max pooling
layer a dropout is implemented with the dropout of 0.25.
The architecture of the model is : This entire layer is repeated again with 3x3 filters and the
Convolution_layer 1 → Relu → Max_pool → dropout → conv 3 layer this time learns 64 features similar to conv 4
features. All the remaining parameters are same. The conv
Convolution_layer 2 → Relu → Max_pool → dropout →
5 layer learns 64 features with a 3x3 filter followed by
Convolution_layer 3 → Relu → Max_pool → Relu, max-pooling and dropout. Then the output is
fully_connected → dropout → output_layer → Result flattened and we have a fully connected layer with 512
outputs. The dropout is again applied as 0.5 and denser to
The input is a 28X28 image which passed to filters to number of output classes i.e. 10 and passed to the softmax
generate fetaure maps. . The first filter is of size 5x5x1x32 layer for the final output. The accuracy of the model is
(32 features to learn in the first hidden layer), 3x3x32x64 80.01 % on a CPU unit on test dataset.
for the second convolution layer (64 features to learn from
second hidden layer), 3x3x64x128 for the third layer, Conv 1(32,5x5 filter, Conv3 (64,3X3,
(128*4*4,625) for the fourth layer and (625,10) for the stride=1) →Relu→Conv2 stride=1)
last layer. The stride is 1 for convolution layer and 2 for . (64,5x5 stride=1) →relu→conv4 A
max-pooling layers. Stride defines the number of block to →Relu→Max-Pool (64,3x3, stride=1)
move forward after one calculation. Generally, the value (stride=2) →dropout (0.25) →relu→dropout
(0.25)
of stride for convolution layer is 1 and for pooling layer is
2. The accuracy of this model is 99.6%.

Input Conv1 Max_p Conv2


28*28 28*28*32 ool Conv 5(64, 3x3, Flatten→dense
with 3x3,
Image with 5*5, (2*2, A stride=1) →relu (512) Result
s=1 filter Max_pool →relu→dropou
s=1 filter s=2) (stride=2, 2x2) t (0.5)
→dropout (0.25) →Dense→soft
max

Reshaped Fully_conn Conv3 Max_poo


ected l Fig.4 CNN Model for CIFAR-10
FC (625, with 3x3,
10) Reshape to s=1 filter (2x2,
E. Relu non-linearity activation function
(128*4*4,6 s=2)
25) There is a wide range of activation function available
when training neural network models. The mainly used
activations are sigmoid, tanh, relu and leaky relu. The relu
non linearity is a popular activation function used in deep
Output Layer learning algorithm sand has replaced the use of sigmoidal
activation function which is generally used for binary
classification techniques. Relu non linearity ˆሺšሻ ൌ
Fig. 3. CNN model for MNIST dataset
ƒšሺͲǡ šሻ works several times faster than the tanh non –
linearity ሺ‡୸ െ ‡ି୸ ሻȀሺ‡୸ ൅ ‡ି୸ ሻ , and results the output of
D. CNN model for CIFAR-10 dataset
the activation as either 0 or a positive number. With relu it
The CNN model for CIFAR-10 dataset is as follows: is easier to train larger neural networks. Fig.5 shows the
graph of relu non-linearity [16].
x Batch size: 32, Number of epochs:50
x Optimizer: RMS Prop, Decay rate= 1e-6 F. Overfitting and under fitting
x Data augmentation: True, Rotation ; in range of Another aspect of training a deep neural network is the
0-180, horizontal flip: TRUE, Vertical flip: issue of high bias (resulting in underfitting) or high
FALSE variance (resulting in overfitiing). When the data is

280
2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC)

underfitting is not generalizing or fitting the data points vanishing gradient problem i.e. the gradient takes smaller
well. In case of high bias, which is on training batch the steps and ultimately becomes so small that the changes in
network needs to be trained longer and have much bigger weights are insignificant. Thus the learning rate is a hyper
network of hidden layers. In case of overfitting, the data parameter that needs to be finely tuned and can be done
has high variance i.e. it is generalizing too well on the test with the help of learning rate decay. In learning rate
sets. To reduce high variance, regularization techniques decay, the learning rate decays exponentially after every
and data augmentation techniques can be implemented. epoch.
IV. RESULT ANALYSIS
The results of the experiments are as shown below:
x CNN Model for MNIST dataset: Accuracy
99.6% shown in figure 6

Fig.6 Accuracy of CNN model on MNIST in 10 epochs

Fig.5. Relu non-linearity The result shows that as the number of epochs get increased
and the best accuracy achieve in recognizing the digits on
G. Dropout to reduce overfitting MNIST data set is 99.6 % with 10 epochs.
Dropout is a regularization technique which is used to
reduce overfitting [7]. In dropout, the network deads some
of its nodes randomly based on a parameter. The
probability parameter determines whether the node should
remain in the network or not. Keep_prob is the probability
parameter to keep the hidden nodes in the network. The
activation is unaffected during this process as it only
determines whether to keep the node in the network or
not.
H. Data Augmentation
Another technique to reduce overfitiing is to train the
data on large datasets. If dataset is limited the dataset can
be artificially created by data augmentation techniques
[7]. The data augmentation techniques include distortion
and altering of data images for processing to get more
data. Some of the techniques are:
x Mirroring – The images are flipped and laterally
inverted.
x Random cropping- Cropping some parts of the
image and creating subsets from the main image.
x Rotation- This includes rotating the images in
any direction at various angles and generating
new images.
x Color shifting- Shifting the RGB pixel values of Fig. 7 Accuracy of the CNN model in 50 epochs
the image to get a new coloured image.
CNN model for CIFAR-10 dataset: Accuracy of 80.17%
I. RMS prop optimizer and learning rate on test set as shown in figure 7.
RMS prop or root mean square prop is an optimizer
V. CONCLUSION
which works on the root mean square value of the change
in gradients. The change in weights and bias determine The article discusses various aspects of deep learning,
the gradient parameters with help of rms value. The CNN in particular and performs image recognition and
learning rate determines the steps the algorithm will take detection on MNIST and CIFAR -10 datasets using CPU
to converge to the global minimum. If the learning rate is unit only. The accuracy of MNIST is good but the
too high the algorithm faces exploding gradient problem accuracy of CIFAR-10 can be improved by training with
i.e. it takes larger steps and fails to converge at the local larger epochs and on a GPU unit. The calculated accuracy
minimum. If the learning rate is too small it faces a on MNIST is 99.6% and on CIFAR-10 is 80.17%. The

281
2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC)

training accuracy on CIFAR-10 is 76.57 percent after 50 [21] Christian szegedy, Wei liu, Yangqing Jia et al., “Going Deeper with
epochs. The accuracy on training set may also be Convolutions”,Conference on Computer Vision and Pattern
Recognition (CPVR) , IEEE explorer, Boston, MA, USA, 2015.
improved further by adding more hidden layers. And this
system can be implemented as a assistance system for
machine vision for detecting nature language sysmbols.
REFERENCES
[1] Yann LeCun, Yoshua Bengio, Geoffery Hinton, “Deep Learning”,
Nature, Volume 521, pp. 436-444, Macmillan Publishers, May
2015.
[2] Norhidayu binti Abdul Hamid, Nilam Nur Binti Amir Sjarif,
“Handwritten Recognition Using SVM, KNN and Neural
Network”, www.arxiv.org/ftp/arxiv/papers/1702/1702.00723
[3] Cheng-Lin Liu‫כ‬, Kazuki Nakashima, Hiroshi Sako, Hiromichi
Fujisawa, “Handwritten digit recognition: benchmarking of state-
of-the-art techniques”, ELSEVIER, Pattern Recognition 36 (2003)
2271 – 2285).
[4] Ping kuang, Wei-na cao and Qiao wu, “Preview on Structures and
Algorithms of Deep Learning”, 11th International Computer
Conference on Wavelet Actiev Media Technology and
Information Processing (ICCWAMTIP), IEEE, 2014.
[5] Mahmoud M. Abu Ghosh ; Ashraf Y. Maghari, “A Comparative
Study on Handwriting Digit Recognition Using Neural Networks”,
IEEE, 2017.
[6] Youssouf Chherawala, Partha Pratim Roy and Mohamed Cheriet,
“Feature Set Evaluation for Offline Handwriting Recognition
Systems: Application to the Recurrent Neural Network,” IEEE
Transactions on Cybernetics, VOL. 46, NO. 12, DECEMBER
2016.
[7] Alex Krizhevsky, “Convolutional Deep belief Networks on
CIFAR-10”. Available: https://www.cs.toronto.edu/~kriz/conv-
cifar10-aug2010.pdf.
[8] Yehya Abouelnaga , Ola S. Ali , Hager Rady , Mohamed
Moustafa, “ CIFAR-10: KNN-based ensemble of classifiers”,
IEEE, March 2017.
[9] Caifeng Shan, Shaogang Gong, Peter W. McOwan, “Facial
expression recogniton based on Local binary patterns: A
comprehensive study”, ELSEVIER, Image and Vision Computing
27, pp. 803-816, 2009.
[10] Li Deng, “A tutorial survey of architectures, algorithms, and
applications of Deep Learning”, APSIPA Transactions on Signal
and Information Processing (SIP), Volume 3, 2014.
[11] Yann LeCun, Corinna Cortes and Christopher J.C. Burges, “The
MNIST Database of handwritten digits”. Available:
http://yann.lecun.com/exdb/mnist/ - MNIST database
[12] The CIFAR-10 dataset. Available:
https://www.cs.toronto.edu/~kriz/cifar.html
[13] MNIST dataset introduction, 2017. Available:
http://corochann.com/mnist-dataset-introduction-1138.html
[14] Robust Vision Benchmark. Available:
https://robust.vision/benchmark/about/
[15] Neural Network and Deep Learning. Available:
http://neuralnetworksanddeeplearning.com/chap6.html
[16] Convolutional Neural Networks for Visual Recoginition. Available:
http://cs231n.github.io/neural-networks-1/
[17] Krizhevsky, Sutskever and Hinton, “ImageNet classification with
deep convolutional neural networks”, Advances in Neural
Information Processing Systems 25 (NIPS 2012), pp. 1106–1114,
2012.
[18] Zeiler, M. D. and Fergus, “Visualizing and understanding
convolutional networks”. European Conference on Computer
Vision, vol 8689. Springer, Cham, pp. 818-833, 2014.
[19] Zhengwei Huang, Min Dong, qirong Mao and Yongzhao Zhan,
“Speech Recognition using CNN”, IEEE/ACM Transactions on
Audio, Speech and Language Processing, pp. 1533-1545, Volume
22, Issue 10, 2014, http://dx.doi.org/10.1145/2647868.2654984.
[20] Shima Alizadeh and Azarr Fazel, “Convolutional Neural networks
for Facial Expression recognition”, Computer Vision and Pattern
Recognition, Cornell University Library, ArXiv:1704.06756v1, 22
April, 2017, arXiv.org.1704.06756.pdf.

282

You might also like