Review of Deep Learning Algorithms and Architectur

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
Review of Deep Learning Algorithms and

Architectures
Ajay Shrestha, Ausif Mahmood (Senior Member, IEEE)
Department of Computer Science and Engineering, University of Bridgeport, 126 Park Ave, Bridgeport, CT 06604, USA
Corresponding author: Ajay Shrestha (e-mail: [email protected]).
ABSTRACT Deep learning (DL) is playing an increasingly important role in our lives. It has already made
a huge impact in areas such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting,
speech recognition, etc. The painstakingly handcrafted feature extractors used in the traditional learning,
classification and pattern recognition systems are not scalable for large-sized data sets. In many cases
depending on the problem complexity, deep learning can also overcome limitations of earlier shallow
networks that prevented efficient training and abstractions of hierarchical representations of multi-
dimensional training data. Deep Neural Network (DNN) uses multiple (deep) layers of units with highly
optimized algorithms and architectures. The paper reviews several optimization methods to improve accuracy
of the training and reduce training time. We delve into the math behind training algorithms used in recent
deep networks. We describe current shortcomings, enhancements and implementations. The review also
covers different types of deep architectures such as deep convolution networks, deep residual networks,
recurrent neural networks, reinforcement learning, variational autoencoders, and others.
INDEX TERMS Machine Learning Algorithm, Optimization, Artificial Intelligence, Deep Neural Network
Architectures, Convolution Neural Network, Backpropagation, Supervised and unsupervised learning
unit in the next layer. The result of the final output layer is
1. Introduction used as the solution for the problem.
Neural Network is a machine learning (ML) technique that Neural Networks can be used in a variety of problems
is inspired by and resembles the human nervous system and including pattern recognition, classification, clustering,
the structure of the brain. It consists of processing units dimensionality reduction, computer vision, natural language
organized in input, hidden and output layers. The nodes or processing (NLP), regression, predictive analysis, etc. Here
units in each layer are connected to nodes in adjacent layers. is an example of image recognition.
Each connection has a weight value. The inputs are Figure 1 shows how a deep neural network called
multiplied by the respective weights and summed at each Convolution Neural Network (CNN) can learn hierarchical
unit. The sum then undergoes a transformation based on the levels of representations from a low-level input vector and
activation function, which is in most cases is a sigmoid successfully identify the higher-level object. The red squares
function, tan hyperbolic or rectified linear unit (ReLU). in the figure are simply a gross generalization of the pixel
These functions are used because they have a mathematically values of the highlighted section of the figure. CNNs can
favorable derivative, making it easier to compute partial progressively extract higher representations of the image
derivatives of the error delta with respect to individual after each layer and finally recognize the image.
weights. Sigmoid and tanh functions also squash the input The implementation of neural networks consists of the
into a narrow output range or option, i.e., 0/1 and -1/+1 following steps:
respectively. They implement saturated nonlinearity as the 1. Acquire training and testing data set
outputs plateaus or saturates before/after respective 2. Train the network
thresholds. ReLu on the other hand exhibits both saturating 3. Make prediction with test data
and non-saturating behaviors with 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥). The
output of the function is then fed as input to the subsequent The paper is organized in the following sections:
1. Introduction to Machine Learning
a. Background and Motivation
Volume XX, 2017

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
Author Name: Preparation of Papers for IEEE Access (February 2017)
2. Classifications of Neural Networks understood and applied to neural networks. The self-directed
3. DNN Architectures learning was made possible with the deeper understanding
4. Training Algorithms
5. Shortcomings of Training Algorithms and application of backpropagation algorithm. The
6. Optimization of Training Algorithms automation of feature extractors is what differentiates a
7. Architectures & Algorithms – Implementations DNNs from earlier generation machine learning techniques.
8. Conclusion DNN is a type of neural network modeled as a multilayer
perceptron (MLP) that is trained with algorithms to learn
representations from data sets without any manual design of
feature extractors. As the name Deep Learning suggests, it
consists of higher or deeper number of processing layers,
which contrasts with shallow learning model with fewer
layers of units. The shift from shallow to deep learning has
allowed for more complex and non-linear functions to be
mapped, as they cannot be efficiently mapped with shallow
architectures. This improvement has been complemented by
the proliferation of cheaper processing units such as the
general-purpose graphic processing unit (GPGPU) and large
volume of data set (big data) to train from. While GPGPUs
are less powerful that CPUs, the number of parallel
processing cores in them outnumber CPU cores by orders of
magnitude. This makes GPGPUs better for implementing
DNNs. In addition to the backpropagation algorithm and
GPU, the adoption and advancement of ML and particularly
Deep Learning can be attributed to the explosion of data or
bigdata in the last 10 years. ML will continue to impact and
disrupt all areas of our lives from education, finance,
governance, healthcare, manufacturing, marketing and
Figure 1. Image recognition by a CNN
others [7].
1.1. Background
1.2. Motivation
In 1957, Frank Rosenblatt created the perceptron, the first Deep learning is perhaps the most significant development
prototype of what we now know as a neural network [1]. It in the field of computer science in recent times. Its impact
had two layers of processing units that could recognize has been felt in nearly all scientific fields. It is already
simple patterns. Instead of undergoing more research and disrupting and transforming businesses and industries. There
development, neural networks entered a dark phase of its is a race among the world’s leading economies and
history in 1969, when professors at MIT demonstrated that it technology companies to advance deep learning. There are
couldn’t even learn a simple XOR function[2]. already many areas where deep learning has exceeded human
In addition, there was another finding that particularly level capability and performance, e.g., predicting movie
dampened the motivation for DNN. The universal ratings, decision to approve loan applications, time taken by
approximation theorem showed that a single hidden layer car delivery, etc. [8]. On March 27, 2019 the three deep
was able to solve any continuous problem [3]. It was learning pioneers (Yoshua Bengio, Geoffrey Hinton, and
mathematically proven as well [4], which further questioned Yann LeCun) were awarded the Turing Award, which is also
the validity of DNN. While a single hidden layer could be referred to as the “Nobel Prize” of computing[9]. While a lot
used to learn, it was not efficient and was a far cry from the has been accomplished, there is more to advance in deep
convenience and capability afforded by the hierarchical learning. Deep learning has a potential to improve human
abstraction of multiple hidden layers of DNN that we know lives with more accurate diagnosis of diseases like cancer
now. But it was not just the universal approximation [10], discovery of new drugs, prediction of natural disasters
theorem that held back the progress of DNN. Back then, we [11]. E.g., [12] reported that an deep learning network was
didn’t have a way to train a DNN either. These factors able to learn from 129,450 images of 2,032 diseases and was
prolonged the so-called AI winter, i.e., a phase in the history able to diagnose at the same level as 21 board certified
of artificial intelligence where it didn’t get much funding and dermatologists. Google AI [10] was able to beat the average
interest, and as a result didn’t advance much either. accuracy of US board certified general pathologists in
A breakthrough in DNN occurred with the advent of grading prostate cancer by 70% to 61%.
backpropagation learning algorithm. It was proposed in the The goal of this review is to cover the vast subject of deep
1970s [5] but it wasn’t until mid-1980s [6] that it was fully learning and present a holistic survey of dispersed
VOLUME XX, 2017 2
information under one article. It presents novel work by

collating the works of leading authors from the wide scope
and breadth the deep learning. Other review papers [13-16]
focus on specific areas and implementations without
encompass the full scope of the field. This review covers the
different types of deep learning network architectures, deep
learning algorithms, their shortcomings, optimization
methods and the latest implementations and applications.
2. Classification of Neural Network

Neural Networks can be classified into the following
different types.
1. Feedforward Neural Network
2. Recurrent Neural Network (RNN)
3. Radial Basis Function Neural Network
4. Kohonen Self Organizing Neural Network
5. Modular Neural Network
In feedforward neural network, information flows in just
one direction from input to output layer (via hidden nodes if
any). They do not form any circles or loopbacks. Figure 2a Figure 2a. Feedforward neural network [6]
shows a particular type of implementation of a multilayer
feedforward neural network with values and functions
computed along the forward pass path. Z is the weighed sum
of the inputs and y represents the non-linear activation
function f of Z at each layer. W represents the weights
between the two units in the adjoining layers indicated by the
subscript letters and b represents the bias value of the unit.
Unlike feedforward neural networks, the processing units
in RNN form a cycle. The output of a layer becomes the input
to the next layer, which is typically the only layer in the
network, thus the output of the layer becomes an input to
itself forming a feedback loop. This allows the network to
Figure 2b. The unrolling of RNN in time [6]
have memory about the previous states and use that to
influence the current output. One significant outcome of this Radial basis function neural network is used in
difference is that unlike feedforward neural network, RNN classification, function approximation, time series prediction
can take a sequence of inputs and generate a sequence of problems, etc. It consists of input, hidden and output layers.
output values as well, rendering it very useful for
The hidden layer includes a radial basis function
applications that require processing sequence of time phased
(implemented as gaussian function) and each node represents
input data like speech recognition, frame-by-frame video
classification, etc. a cluster center. The network learns to designate the input to
Figure 2b demonstrates the unrolling of a RNN in time. a center and the output layer combines the outputs of the
E.g., if a sequence of 3-word sentence constitutes an input, radial basis function and weight parameters to perform
then each word would correspond to a layer and thus the classification or inference[17].
network would be unfolded or unrolled 3 times into a 3-layer
Kohonen self-organizing neural network self organizes the
RNN.
network model into the input data using unsupervised
Here is the mathematical explanation of the diagram: 𝑥𝑡
represents the input at time 𝑡. 𝑈, 𝑉, and 𝑊 are the learned learning. It consists of two fully connected layers, i.e., input
parameters that are shared by all steps. 𝑂𝑡 is the output at layer and output layer. The output layer is organized as a
time 𝑡. 𝑆𝑡 represents the state at time 𝑡 and can be computed two-dimensional grid. There is no activation function and the
as follows, where 𝑓 is the activation function, e.g., ReLU. weights represent the attributes (position) of the output layer
node. The Euclidian distance between the input data and each
𝑆𝑡 = 𝑓(𝑈𝑥𝑡 + 𝑊𝑠𝑡−1 ) (1)
output layer node with respect to the weights are calculated.
The weights of the closest node and its neighbors from the
input data are updated to bring them closer to the input data
with the formula below[18].
VOLUME XX, 2017 3
Supervised learning consists of labeled data which is used

𝑤𝑖 (𝑡 + 1) = 𝑤𝑖 (𝑡) + 𝛼(𝑡)𝜂𝑗∗𝑖 (𝑥(𝑡)
to train the network, whereas unsupervised learning there is
− 𝑤𝑖 (𝑡)) (2) no labeled data set, thus no learning based on feedback. In
Where 𝑥(𝑡) is the input data at time t, 𝑤𝑖 (𝑡) is the 𝑖𝑡ℎ unsupervised learning, neural networks are pre-trained using
weight at time t and 𝜂𝑗∗𝑖 is the neighborhood function generating models such as RBMs and later could be fine-
tuned using standard supervised learning algorithms. It is
between the 𝑖𝑡ℎ 𝑎𝑛𝑑 𝑗𝑡ℎ nodes.
then used on test data set to determine patterns or
Modular neural network breaks down large network into classifications. Big data has pushed the envelope even
smaller independent neural network modules. The smaller further for deep learning with its sheer volume and variety of
networks perform specific task which are later combined as data. Contrary to our intuitive inclination, there is no clear
part of a single output of the entire network [19]. consensus on whether supervised learning is better than the
unsupervised learning. Both have their merits and use cases.
DNNs are implemented in the following popular ways: [22] demonstrated enhance results with unsupervised
1. Sparse Autoencoders learning using unstructured video sequences for camera
2. Convolution Neural Networks (CNNs or ConvNets) motion estimation and monocular depth. Modified Neural
3. Restricted Boltzmann Machines (RBMs) Networks such as Deep Belief Network (DBM) as described
4. Long Short-Term Memory (LSTM) by Xue-Wen Chen et al. [23] uses both labeled and unlabeled
Autoencoders are neural networks that learn features or data with supervised and unsupervised learning respectively
encoding from a given dataset in order to perform to improve performance. Developing a way to automatically
dimensionality reduction. Sparse Autoencoder is a variation extract meaningful features from labeled and unlabeled high
of Autoencoders, where some of the units output a value dimensional data space is challenging. Yann LeCun et al.
close to zero or are inactive and do not fire. Deep CNN uses asserts that one way we could achieve this would be to utilize
multiple layers of unit collections that interact with the input and integrate both unsupervised and supervised learning
(pixel values in the case of image) and result in desired [24]. Complementing unsupervised learning (with un-
feature extraction. CNN finds it application in image labeled data) with supervised learning (with labeled data) is
recognition, recommender systems and NLP. RBM is used referred to as semi-supervised learning.
to learn probability distribution within the data set. DNN and training algorithms have to overcome two major
All these networks use backpropagation for training. challenges: premature convergence and overfitting.
Backpropagation uses gradient descent for error reduction, Premature convergence occurs when the weights and bias of
by adjusting the weights based on the partial derivative of the the DNN settle into a state that is only optimal at a local level
error with respect to each weight. and misses out on the global minima of the entire multi-
Neural Network models can also be divided into the dimensional space. Overfitting on the other hand describes a
following two distinct categories: state when DNNs become highly tailored to a given training
1. Discriminative data set at a fine grain level that it becomes unfit, rigid and
2. Generative less adaptable for any other test data set.
Discriminative model is a bottom-up approach in which Along with different types of training, algorithms and
data flows from input layer via the hidden layers to the output architecture, we also have different machine learning
layer. They are used in supervised training for problems like frameworks (Table 1) and libraries that have made training
classification and regression. Generative models on the other models easier. These frameworks make complex
hand are top-down and data flows in the opposite direction. mathematical functions, training algorithms and statistically
They are used in unsupervised pre-training and probabilistic modeling available without having to write them on your
distribution problems. If the input x and corresponding label own. Some provide distributed and parallel processing
y are given, a discriminative model learns the probability capabilities, and convenient development and deployment
distribution p(y|x), i.e., the probability of y given x directly, features. Figure 3 shows a graph with various deep learning
whereas a generative model learns the joint probability of libraries along with their Github stars from 2015-2018.
p(x,y), from which P(y|x) can be predicted [20]. In general Github is the largest hosting service provider of source code
whenever labelled data is available discriminative in the world [25]. Github stars are indicative of how popular
approaches are undertaken as they provide effective training, a project is on Github. TensorFlow is the most popular DL
and when labelled data is not available generative approach library.
can be taken [21].
Training can be broadly categorized into three types:
1. Supervised
2. Unsupervised
3. Semi-supervised
VOLUME XX, 2017 4
Figure 3. Github stars by Deep Learning Library [26]
winner for general problems like classification as the choice

of architecture could depend on multiple factors.
TABLE 1. POPULAR DEEP LEARNING FRAMEWORKS AND LIBRARIES Nonetheless [27] evaluated 179 classifiers and concluded
that parallel random forest or parRF_t, which is essentially
Framework Institution License 1st Release parallel implementation of variation of decision tree,
Caffe Berkeley AI BSD / Free 2015 performed the best. Below are three of the most common
Research architectures of deep neural networks.
Microsoft Microsoft MIT License / 2016 1. Convolution Neural Network
Cognitive Toolkit Free 2. Autoencoder
Gluon AWS and Open Source 2017 3. Restricted Boltzmann Machine (RBM)
Microsoft 4. Long Short-Term Memory (LSTM)
Keras Individual Author MIT License / 2015
Free 3.1. Convolution Neural Network
MXNet Apache Software Apache 2.0 / 2015 CNN is based on the human visual cortex and is the neural
Foundation Free network of choice for computer vision (image recognition)
TensorFlow Google Brain Apache 2.0 / 2015 and video recognition. It is also used in other areas such as
Free NLP, drug discovery, etc. As shown in Figure 4, a CNN
Theano University of BSD / Free 2008 consists of a series of convolution and sub-sampling layers
Montreal followed by a fully connected layer and a normalizing (e.g.,
Torch Ronan Collobert BSD / Free 2002 softmax function) layer. Figure 4 illustrates the well-known
et al. 7 layered LeNet-5 CNN architecture devised by LeCun et al.
PyTorch Facebook BSD / Free 2016 [28] for digit recognition. The series of multiple convolution
Chainer Preferred BSD / Free 2015 layers perform progressively more refined feature extraction
Networks at every layer moving from input to output layers. Fully
Deeplearning4j Adam Gibson et Apache 2.0 / 2014 connected layers that perform classification follow the
al. Free convolution layers. Sub-sampling or pooling layers are often
inserted between each convolution layers. CNN’s takes a 2D
𝒏 𝒙 𝒏 pixelated image as an input. Each layer consists of
3. DNN Architectures groups of 2D neurons called filters or kernels. Unlike other
Deep neural network consists of several layers of nodes. neural networks, neurons in each feature extraction layers of
Different architectures have been developed to solve CNN are not connected to all neurons in the adjacent layers.
problems in different domains or use-cases. E.g., CNN is Instead, they are only connected to the spatially mapped
used most of the time in computer vision and image fixed sized and partially overlapping neurons in the previous
recognition, and RNN is commonly used in time series layer’s input image or feature map. This region in the input
problems/forecasting. On the other hand, there is no clear is called local receptive field. The lowered number of
VOLUME XX, 2017 5
connections reduces training time and chances of overfitting. the network’s susceptibility of shifts, scale and distortions of
All neurons in a filter are connected to the same number of images [29]. Max/mean pooling or local averaging filters are
neurons in the previous input layer (or feature map) and are used often to achieve sub-sampling. The final layers of CNN
constrained to have the same sequence of weights and biases. are responsible for the actual classifications, where neurons
These factors speed up the learning and reduces the memory between the layers are fully connected. Deep CNN can be
requirements for the network. Thus, each neuron in a specific implemented with multiple series of weight-sharing
filter looks for the same pattern but in different parts of the convolution layers and sub-sampling layers. The deep nature
input image. Sub-sampling layers reduce the size of the of the CNN results in high quality representations while
network. In addition, along with local receptive fields and maintaining locality, reduced parameters and invariance to
shared weights (within the same filter), it effectively reduces minor variations in the input image [30].
Figure 4. 7-layer Architecture of CNN for character recognition [28]
direction, e.g., where mean pooling is used, upsample evenly

In most cases, backpropagation is used solely for training distributes the error to the previous input unit. And finally,
all parameters (weights and biases) in CNN. Here is a brief here is the gradient w.r.t. feature maps [31]:
description of the algorithm. The cost function with respect 𝑚
to individual training example (𝑥, 𝑦) in hidden layers can be 𝛻 𝐽(𝑊, 𝑏; 𝑥, 𝑦) = ∑(𝑎𝑖 (𝑙) ) ∗ 𝑟𝑜𝑡90 (𝛿𝑘
(𝑙+1)
, 2) (8)
(𝑙)
defined as [31]: 𝑤𝑘
𝑖−1
1
𝐽(𝑊, 𝑏; 𝑥, 𝑦) = ||ℎ𝑤,𝑏 (𝑥) − 𝑦||2 (3)
2 (𝑙+1)
The equation for error term 𝛿 for layer 𝑙 is given by [31]: 𝛻 (𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦) = ∑(𝛿𝑘 )𝑎,𝑏. (9)
𝑏𝑘
𝑎,𝑏
𝛿 (𝑙) = ((𝑊 (𝑙) )𝑇 𝛿 (𝑙+1) ) . 𝑓 ′ (𝑧 (𝑙) ) (4) Where (𝑎𝑖 (𝑙) ) ∗ 𝛿𝑘

(𝑙+1)
represents the convolution between
error and the 𝑖 − 𝑡ℎ input in the 𝑙 − 𝑡ℎ layer with respect to
Where 𝛿 (𝑙+1) is the error for (𝑙 + 1)th layer of a network the 𝑘 − 𝑡ℎ filter.
whose cost function is 𝐽(𝑊, 𝑏; 𝑥, 𝑦). 𝑓 ′ (𝑧 (𝑙) ) represents the Algorithm 1 below represents a high-level description and
derivate of the activation function. flow of the backpropagation algorithm as used in a CNN as
it goes through multiple epochs until either the maximum
𝛻𝑤 (𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦) = 𝛿 (𝑙+1) (𝑎 (𝑙+1) )𝑇 (5) iterations are reached or the cost function target is met.
In addition to discriminative models such as image
𝛻𝑏(𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦) = 𝛿 (𝑙+1) (6)
recognition, CNN can also be used for generative models
such as deconvolving images to make blurry image sharper.
Where 𝑎 is the input, such that 𝑎(1) is the input for 1st layer
[32] achieves this by leveraging Fourier transformation to
(i.e., the actual input image) and 𝑎 (𝑙) is the input for 𝑙 − 𝑡ℎ
regularize inversion of the blurred images and denoising.
layer.
Different implementations of CNN has shown continuous
Error for sub-sampling layer is calculated as [31]:
improvement of accuracy in computer vision. The
(𝑙) (𝑙) (𝑙+1) (𝑙)
𝛿𝑘 = 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑒 ((𝑊𝑘 )𝑇 𝛿𝑘 ) . 𝑓 ′ (𝑧𝑘 ) (7) improvements are tested against the same benchmark
(ImageNet) to ensure unbiased results.
Where 𝑘 represent the filter number in the layer. In the sub-
sampling layer, the error has to be cascaded in the opposite
VOLUME XX, 2017 6
ALGORITHM 1. CNN BACKPROPAGATION ALGORITHM PSEUDO CODE
1: 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑡𝑜 𝑟𝑎𝑛𝑑𝑜𝑚𝑙𝑦 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑠𝑚𝑎𝑙𝑙)

2: 𝑆𝑒𝑡 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 𝑡𝑜 𝑎 𝑠𝑚𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒 (𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒)
3: 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛 = 1; 𝑩𝒆𝒈𝒊𝒏
4: 𝒇𝒐𝒓 𝑛
< 𝑚𝑎𝑥 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑂𝑅 𝐶𝑜𝑠𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑎 𝑚𝑒𝑡, 𝒅𝒐
5: 𝒇𝒐𝒓 𝑖𝑚𝑎𝑔𝑒 𝑥1 𝑡𝑜 𝑥𝑖 , 𝒅𝒐
6: 𝑎. 𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛, 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑎𝑛𝑑 𝑡ℎ𝑒𝑛 𝑓𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑙𝑎𝑦𝑒𝑟𝑠
7: 𝑏. 𝐷𝑒𝑟𝑖𝑣𝑒 𝐶𝑜𝑠𝑡 𝐹𝑢𝑐𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑖𝑚𝑎𝑔𝑒
8: 𝑐. 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚 𝛿 (𝑙) 𝑤𝑖𝑡ℎ 𝑟𝑒𝑠𝑝𝑒𝑐𝑡 𝑡𝑜 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑡𝑦𝑝𝑒 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠.
9: 𝑁𝑜𝑡𝑒 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑔𝑒𝑡𝑠 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒𝑑 𝑓𝑟𝑜𝑚 𝑙𝑎𝑦𝑒𝑟 𝑡𝑜 𝑙𝑎𝑦𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒
10: 𝑖. 𝑓𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑙𝑎𝑦𝑒𝑟
11: 𝑖𝑖. 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑙𝑎𝑦𝑒𝑟
12: 𝑖𝑖𝑖. 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟
13: 𝑑. 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝛻𝑤(𝑙) 𝑎𝑛𝑑 𝛻𝑏(𝑙) 𝑓𝑜𝑟 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝛻𝑤(𝑙) 𝑎𝑛𝑑 𝑏𝑖𝑎𝑠 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑙𝑎𝑦𝑒𝑟
𝑘 𝑘 𝑘
14: 𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒
15: 𝑖. 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟
16: 𝑖𝑖. 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑙𝑎𝑦𝑒𝑟
17: 𝑖𝑖𝑖. 𝑓𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑙𝑎𝑦𝑒𝑟 Figure 5. Linear representation of a 2D data input using PCA
18: 𝑒. 𝑈𝑝𝑑𝑎𝑡𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑠
(𝑙) (𝑙) (𝑙)
19: 𝑤𝑗𝑖 ← 𝑤𝑗𝑖 + ∆𝑤𝑗𝑖
Autoencoders extend the idea of principal component
20: 𝑓. 𝑈𝑝𝑑𝑎𝑡𝑒 𝑏𝑖𝑎𝑠
(𝑙) (𝑙) (𝑙) analysis (PCA). As shown in Figure 5, a PCA transforms
21: 𝑏𝑗 ← 𝑏𝑗 + ∆𝑏𝑗
multi-dimensional data into a linear representation. Figure 5
demonstrates how a 2D input data can be reduced to a linear
vector using PCA. Autoencoders on the other hand can go
Here are the well-known variation and implementation of further and produce nonlinear representation. PCA
the CNN architecture. determines a set of linear variables in the directions with
1. AlexNet: largest variance. The 𝑝 dimensional input data points are
a. CNN developed to run on Nvidia parallel represented as 𝑚 orthogonal directions, such that 𝑚 ≤
computing platform to support GPUs 𝑝 and constitutes a lower (i.e., less than 𝑚) dimensional
2. Inception: space. The original data points are projected into the
a. Deep CNN developed by Google principal directions thus omitting information in the
3. ResNet: corresponding orthogonal directions. PCA focuses more on
a. Very deep Residual network developed by the variances rather than covariances and correlations and it
Microsoft. It won 1st place in the ILSVRC 2015 looks for the linear function with the most variance [34]. The
competition on ImageNet dataset. goal is to determine the direction with the least mean square
4. VGG: error, which would then have the least reconstruction error.
a. Very deep CNN developed for large scale Autoencoders use encoder and decoder blocks of non-
image recognition linear hidden layers to generalize PCA to perform
5. DCGAN: dimensionality reduction and eventual reconstruction of the
a. Deep convolutional generative adversarial original data. It uses greedy layer by layer unsupervised pre-
networks proposed by [33]. It is used in training and fin-tuning with backpropagation [35]. Despite
unsupervised learning of hierarchy of feature using backpropagation, which is mostly used in supervised
representations in input objects. training, autoencoders are considered unsupervised DNN
because they regenerate the input 𝑥 (𝑖) itself instead of a
3.2. Autoencoder different set of target values 𝑦 (𝑖) , i.e., 𝑦 (𝑖) = 𝑥 (𝑖) . Hinton et
Autoencoder is a neural network that uses unsupervised al. were able to achieve a near perfect reconstruction of 784-
algorithm and learns the representation in the input data set pixel images using autoencoder, proving that it is far better
for dimensionality reduction and to recreate the original data than PCA [36].
set. The learning algorithm is based on the implementation While performing dimensionality reduction, autoencoders
of the backpropagation. come up with interesting representations of the input vector
in the hidden layer. This is often attributed to the smaller
number of nodes in the hidden layer or every second layer of
the two-layer blocks. But even if there are higher number of
nodes in the hidden layer, a sparsity constraint can be
enforced on the hidden units to retain interesting lower
VOLUME XX, 2017 7
dimension representations of the inputs. To achieve sparsity,

some nodes are restricted from firing, i.e., the output is set to
a value close to zero.
Figure 7. Autoencoder nodes

A sparsity parameter 𝜌is introduced such that 𝜌 is very
close to zero, e.g., 0.03 and 𝜌̂ = 𝜌. To ensure that 𝜌̂ = 𝜌, a
penalty term 𝐾𝐿(𝜌|| 𝜌̂ 𝑗) is introduced such that the
Kullback–Leibler (KL) divergence term 𝐾𝐿(𝜌||𝜌̂ 𝑗) = 0, if
𝜌̂ 𝑗 = 𝜌, else becomes large monotonically as the difference
between the two values diverges [38]. Here is the updated
cost function [38]:
𝑠2
𝐽 𝑠𝑝𝑎𝑟𝑠𝑒 (𝑊, 𝑏) = 𝐽(𝑊, 𝑏) + 𝛽 ∑ 𝐾𝐿(𝜌|| 𝜌̂ 𝑗) ] (11)

𝑗=1
Figure 6. Training stages in Autoencoder [36] Where s2 equals the number of units in 2nd layer and 𝛽 is
the parameter than controls sparsity penalty term’s weight.
Figure 6 shows single layer feature detector blocks of 3.3. Restricted Boltzmann Machine (RBM)
RBMs used in pre-training, which is followed by unrolling Restricted Boltzmann Machine is an artificial neural
[36]. Unrolling combines the stacks of RBMs to create the network where we can apply unsupervised learning
encoder block and then reverses the encoder block to create algorithm to build non-linear generative models from
the decoder section, and finally the network is fine-tuned unlabeled data [39]. The goal is to train the network to
with backpropagation [36]. increase a function (e.g., product or log) of the probability of
Figure 7 illustrates a simplified representation of how vector in the visible units so it can probabilistically
autoencoders can reduce the dimension of the input data and reconstruct the input. It learns the probability distribution
learn to recreate it in the output layer. Wang et al. [37] over its inputs. As shown in Figure 8, RBM is made of two-
successfully implemented a deep autoencoder with stacks of layer network called the visible layer and the hidden layer.
RBM blocks similar to Figure 6 to achieve better modeling Each unit in the visible layer is connected to all units in the
accuracy and efficiency than the proper orthogonal hidden layer and there are no connections between the units
decomposition (POD) method for dimensionality reduction in the same layer.
of distributed parameter systems (DPSs). The equation The energy (E) function of the configuration of the visible
(2)
below describes the average of activation function 𝑎𝑗 of 𝑗𝑡ℎ and hidden units, (v, h) is expressed in the following way
unit of 2nd layer when the 𝑥𝑡ℎ input activates the neuron [38]. [40]:
1 (2) (𝑖)
𝜌̂ 𝑗 = ∑𝑚
𝑖=1[𝑎𝑗 𝑥 ] (10)
𝑚
E(v, h) = − ∑𝑖 𝜀 𝑣𝑖𝑠𝑖𝑏𝑙𝑒 𝑎𝑖 𝑣𝑖 − ∑𝑗 𝜀 ℎ𝑖𝑑𝑑𝑒𝑛 𝑏𝑗 ℎ𝑗 −
∑𝑖,𝑗 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
(12)
VOLUME XX, 2017 8
vi and hj are the vector states of the visible unit i and hidden RNN, i.e., the problem of vanishing gradients by letting
unit j. ai and bj represents the bias of visible and hidden units. gradients to pass unaltered. As shown in the illustration in
Wij denotes the weight between the respective visible and Figure 9, LSTM consists of blocks of memory cell state
hidden units. through which signal flows while being regulated by input,
The partition function, Z is represented by the sum of all forget and output gates. These gates control what is stored,
possible pairs of visible and hidden vectors [40]. read and written on the cell. LSTM is used by Google, Apple
and Amazon in their voice recognition platforms [42].
Figure 8. Restricted Boltzmann Machine
𝑍 = ∑𝑣,ℎ 𝑒 −𝐸 (𝑣,ℎ) (13)

The probability of every pair of visible and hidden vectors
is given by the following [40].
1
𝑝(𝑣, ℎ) = 𝑒 −𝐸 (𝑣,ℎ) (14) Figure 9. LSTM Block with memory cell and gates
𝑍
The probability of a particular visible layer vector is In figure 9, 𝐶, 𝑥, ℎ represent cell, input and output values.
provided by the following [40]. Subscript 𝑡 denotes time step value, i.e., 𝑡 − 1 is from
previous LSTM block (or from time 𝑡 − 1) and 𝑡 denotes
1
𝑝(𝑣) = ∑ℎ 𝑒 −𝐸 (𝑣,ℎ) (15) current block values. The symbol σ is the sigmoid function
𝑍
and 𝑡𝑎𝑛ℎ is the hyperbolic tangent function. Operator + is
As you can see from the equations above, the partition the element-wise summation and x is the element-wise
function becomes higher with lower energy function value. multiplication. The computations of the gates are described
Thus during the training process, the weights and biases of in the equations below[41, 43].
the network are adjusted to arrive at a lower energy and thus
maximize the probability assigned to the training vector. It is 𝑓𝑡 = 𝜎(𝑊𝑓 𝑥𝑡 + 𝑤𝑓 ℎ𝑡−1 + 𝑏𝑓 ) (17)
mathematically convenient to compute the derivative of the
log probability of a training vector. 𝑖𝑡 = 𝜎(𝑊𝑖 𝑥𝑡 + 𝑤𝑖 ℎ𝑡−1 + 𝑏𝑖 ) (18)
𝜕 log 𝑝(𝑣)
= 〈𝑣𝑖 ℎ𝑗 〉𝑑𝑎𝑡𝑎 − 〈𝑣𝑖 ℎ𝑗 〉𝑚𝑜𝑑𝑒𝑙 (16) 𝑜𝑡 = 𝜎(𝑊𝑜 𝑥𝑡 + 𝑤𝑜 ℎ𝑡−1 + 𝑏𝑜 ) (19)
𝜕𝑤𝑖𝑗
𝑐𝑡 = 𝑓𝑡 ⨂𝑐𝑡−1 + 𝑖𝑡 ⨂ 𝜎𝑐 (𝑊𝑐 𝑥𝑡 + 𝑤𝑐 ℎ𝑡−1 + 𝑏𝑐 ) (20)
In the equation [40] above 〈vihj〉data and 〈vihj〉model represents
the expectations under the respective distributions. ℎ𝑡 = 𝑜𝑡 ⨂ 𝜎ℎ (𝑐𝑡 ) (21)
Thus, the adjustments in the weights can be denoted as
follows [40], where ϵ is the learning rate. Where 𝑓, 𝑖, 𝑜 are the forget, input and output gate vectors
respectively. 𝑊, 𝑤, 𝑏 𝑎𝑛𝑑 ⨂ represent weights of input,
∆𝑤𝑖𝑗 = 𝜖(〈𝑣𝑖 ℎ𝑗 〉𝑑𝑎𝑡𝑎 − 〈𝑣𝑖 ℎ𝑗 〉𝑚𝑜𝑑𝑒𝑙 ) (28) weights of recurrent output, bias and element-wise
multiplication respectively.
3.4. Long Short-Term Memory (LSTM) There is a smaller variation of the LSTM known as gated
LSTM is an implementation of the Recurrent Neural recurrent units (GRU). GRUs are smaller in size than LSTM
Network and was first proposed by Hochreiter et al. in 1997 as they don’t include the output gate, and can perform better
[41]. Unlike the earlier described feed forward network than LSTM on only some simpler datasets[44, 45].
architectures, LSTM can retain knowledge of earlier states LSTMs recurrent neural networks can keep track of long-
and can be trained for work that requires memory or state term dependencies. Therefore, they are great for learning
awareness. LSTM partly addresses a major limitation of
VOLUME XX, 2017 9
from sequence input data and building models that rely on Table 2 provides a compact summary and comparison of the
context and earlier states. The cell block of LSTM retains different DNN architectures. The examples of
pertinent information of previous states. The input, forget implementations, applications, datasets and DL software
and output gates dictates new data going into the cell, what frameworks presented in the table are not implied to be
remains in the cell and the cell values used in the calculation exhaustive. In addition, some of the categorization of the
of the output of the LSTM block respectively [41, 43]. Naul network architectures could be implemented in hybrid
et al. demonstrated LSTM and GRU based autoencoders for fashion. E.g., even though RBMs are generative models and
automatic feature extractions [46]. their training is considered unsupervised, they can have
elements of discriminative model when training is finetuned
3.5. Comparison of DNN Networks with supervised learning. The table also provides examples
of common applications for using different architectures.
TABLE 2. DNN NETWORK COMPARISON TABLE

Popular
Network Network Training Training Implementation Common Dataset
Type Architecture Model Type Algorithm Sample Application Sample DL Framework (sample)
TensorFlow, Caffe,
Gradient Descent Siamese Image Theano, Torch,
CNN Discriminative Supervised based Network, Deep recognition/cla MNIST Deeplearning4j, Microsoft
Backpropagation CNN ssification Cognitive Toolkit, Keras,
MXNet, PyTorch
Gradient Descent Deep ResNet;
Residual Image TensorFlow, PyTorch,
Discriminative Supervised based HighwayNet; ImageNet
Network recognition Keras
Feedforward Neural Network
Backpropagation DenseNet
Sparse
Dimensionality
Autoencoders, TensorFlow,
Autoencoder Generative Unsupervised Backpropagation Reduction; MNIST
Variational Deeplearning4j, Keras
Encoding
Autoencoders
Generate
realistic fake
Generative data;
Adversarial Generative &
Unsupervised Backpropagation Adversarial Reconstruction CIFAR10 TensorFlow, Keras
Networks Discriminative
Network of 3D models;
Image
improvement
Dimensionality
Generative Deep Belief
Gradient Descent Reduction; TensorFlow,
with Network; Deep
RBM Unsupervised based Contrastive Feature MNIST Deeplearning4j, Keras,
Discriminative Boltzmann
divergence learning; Topic MXNet, Theano, Torch
finetuning Machine
modeling
Deep RNN, Gated
Natural TensorFlow, Caffe,
Gradient Descent Recurrent Unit
Recurrent Language MNIST Theano, Torch,
& (GRU), Neural
Neural LSTM Discriminative Supervised Processing; Stroke Deeplearning4j, Microsoft
Backpropagation Machine
Network Language Sequence Cognitive Toolkit, Keras,
through Time Translation
Translation MXNet, PyTorch
(NMT)
Radial Function
Supervised K-means
Basis Radial Basis approximation Fisher's Iris
RBF Network Discriminative and Clustering; Least TensorFlow
Function Function NN ; Time series data set
Unsupervised Square Function
NN prediction
Dimensionality
Nodes
Kohonen Reduction;
arranged in
Self Competitive Kohonen Self Optimization
hexagonal or Generative Unsupervised SPAMbase TensorFlow
Organizing Learning Organizing NN problems;
rectangular
NN Clustering
grid
analysis
VOLUME XX, 2017 10
Likewise, this risk is also present when using gradient

descent on a non-convex function. In fact, the impact is
4. Training Algorithms amplified in the multi-dimensional (each dimension
The learning algorithm constitutes the main part of Deep represents a weight variable) and multi-layer landscape of
Learning. The number of layers differentiates the deep neural DNN and it result in a sub-optimal set of weights. Cost
network from shallow ones. The higher the number of layers, function is one half the square of the difference between the
the deeper it becomes. Each layer can be specialized to detect desired output minus the current output as shown below.
a specific aspect or feature.
As indicated by Maryam M Jajafabadi et al. [47], in case of
image (face) recognitions, first layer can detect edges and the 1 2
C = (𝑦expected − 𝑦𝑎𝑐𝑡𝑢𝑎𝑙 ) (22)
second can detect higher features such as various part of the 2
face, e.g., ears, eyes, etc., and the third layer can go further
up the complexity order by even learning facial shapes of Backpropagation methodology uses gradient descent. In
various persons. Even though each layer might learn or backpropagation, chain rule and partial derivatives are
detect a defined feature, the sequence is not always designed employed to determine error delta for any change in the value
for it, especially in unsupervised learning. These feature of each weight. The individual weights are then adjusted to
extractors in each layer had to be manually programmed reduce the cost function after every learning iteration of
prior to the development of training algorithms such as training data set, resulting in a final multi-dimensional
gradient descent. These hand-crafted classifiers didn’t scale (multi-weight) landscape of weight values [6]. We process
for lager dataset or adapt to variation in the dataset. This through all the samples in the training dataset before
message was echoed in the 1998 paper [28] by Yann Lecun applying the updates to the weights. This process is repeated
et al., where they demonstrate that systems with more until objective (aka cost function) doesn’t reduce any further.
automatic learning and reduced manually designed heuristics Figure 10 shows the error derivatives in relation to outputs
yields far better pattern recognition. in each hidden layer, which is the weighted summation of the
Backpropagation provides representation learning error derivates in relation to the inputs in the unit in the above
methodology, where raw data can be fed without the need to layer. E.g., when 𝜕𝐸/𝜕𝑧𝑘 calculated, the partial error
manually massage it for classifiers, and it will automatically derivative with respect to 𝑤𝑗𝑘 to is equal to𝑦𝑗 𝜕𝐸/𝜕𝑧𝑘 .
find the representations needed for classification or
recognition [6]. The goal of the learning algorithm is to find
the optimal values for the weight vectors to solve a class of
problem in a domain.
Some of the well-known training algorithms are:

1. Gradient Descent
2. Stochastic Gradient Descent
3. Momentum
4. Levenberg–Marquardt algorithm
5. Backpropagation through time
4.1. Gradient Descent

Gradient descent (GD) is the underlying idea in most of
machine learning and deep learning algorithms. It is based
on the concept of Newton’s Algorithm for finding the roots
(or zero value) of a 2D function. To achieve this, we
randomly pick a point in the curve and slide to the right or
left along the x-axis based on negative or positive value of
the derivative or slope of the function at the chosen point
until the value of the y-axis, i.e., function or f(x) becomes
zero. The same idea is used in gradient descent, where we
traverse or descend along a certain path in a multi-
dimensional weight space if the cost function keeps
decreasing and stop once the error rate ceases to decrease. Figure 10. Error calculation in Multilayer Neural Network [6]
Newton’s method is prone to getting stuck in local minima if
the derivative of the function at the current point is zero.
VOLUME XX, 2017 11
where the measurement error for y (ti), i.e., σyi is the inverse
4.2. Stochastic Gradient Descent of the weighting matrix Wii.
Stochastic Gradient Descent (SGD) is the most common The gradient descent of the squared error function in
variation and implementation of gradient descent. In gradient relation to the n parameters can be denoted as [49]:
descent, we process through all the samples in the training
dataset before applying the updates to the weights. While in
SGD, updates are applied after running through a minibatch ∂ 2 ∂
𝑥 = 2(𝑦 − ŷ(𝑝))𝑇 𝑊 (𝑦 − ŷ(𝑝)) (28)
of n number of samples. Since we are updating the weights ∂𝐩 ∂𝐩
more frequently in SGD than in GD, we can converge
towards global minimum much faster.
∂ŷ(𝑝)
= 2(𝑦 − ŷ(𝑝))𝑇 𝑊 [ ] (29)
4.3. Momentum ∂𝐩
In the standard SGD, learning rate is used as a fixed

multiplier of the gradient to compute step size or update to = 2(𝑦 ŷ)𝑇 𝑊 𝐉 (30)
the weight. This can cause the update to overshoot a potential
𝑇
minima, if the gradient is too steep, or delay the convergence hgd = 𝛼 𝐉 𝑊 (𝑦 − ŷ) (31)
if the gradient is noisy. Using the concept of momentum in
physics, the momentum algorithm presents a velocity 𝑣
where J is the Jacobian matrix of size m x n used in place
variable that configured as an exponentially decreasing
of the [∂ŷ/ ∂p], and hgd is the update in the direction of the
average of the gradient [48]. This helps prevent costly
steepest gradient descent.
descent in the wrong direction. In the equation below, 𝛼 ∈
The equation for the Gauss-Newton method update (hgn )
[0,1) is the momentum parameter and 𝜖 is the learning rate.
is as follows [49]:
[J 𝑇 WJ]hgn = J 𝑇 W(𝑦 − ŷ) (32)
𝑉𝑒𝑙𝑜𝑐𝑖𝑡𝑦 𝑈𝑝𝑑𝑎𝑡𝑒: 𝑣 ← 𝛼𝑣 − 𝜖𝑔 (23)
𝐴𝑐𝑡𝑢𝑎𝑙 𝑈𝑝𝑑𝑎𝑡𝑒: 𝜃 ← 𝜃 + 𝑣 (24) The Levenberg- Marquardt update [hlm ] is generated by

combining gradient descent and Gauss-Newton methods
resulting in the equation below [49]:
4.4. Levenberg-Marquardt algorithm
Levenberg-Marquadt algorithm (LMA) is primarily used [J 𝑇 WJ + λ diag (J 𝑇 WJ)] hlm = J 𝑇 W(𝑦 − ŷ) (33)
in solving non-linear least squares problems such as curve
fitting. In least squares problems, we try to fit a given data
points with a function with the least amount of sum of the
squares of the errors between the actual data points and 4.5. Backpropagation through time
points in the function. LMA uses a combination of gradient Backpropagation through time (BPTT) is the standard
descent and Gauss-Newton method. Gradient descent is method to train the recurrent neural network. As shown in
employed to reduce the sum of the squared errors by Figure 2b, the unrolling of RNN in time makes it appears like
updating the parameters of the function in the direction of the a feedforward network. But unlike the feedforward network,
steepest-descent, while the Gauss-Newton method the unrolled RNN has the same exact set of weight values for
minimizes the error by assuming the function to be locally each layer and represents the training process in time
quadratic and finds the minimum of the quadratic [49]. domain. The backward pass through this time domain
network calculates the gradients with respect to specific
If the fitting function is denoted by ŷ(t;p) and m data weights at each layer. It then averages the updates for the
points denoted by (ti,yi), then the squared error can be written same weight at different time increments (or layers) and
as [49]: changes them to ensure the value of weights at each layer
𝑚 continues to stay uniform.
2
y(𝑡𝑖 )− ŷ(𝑡𝑖 ;𝐩)
𝑥 2 (𝒑) = ∑ [ ] (25)
𝜎𝑦𝑖 4.6. Comparison of Deep Learning Algorithms
𝑖=1
Table 3 provides a summary and comparison of common
deep learning algorithms. The advantages and disadvantages
= (𝑦 − ŷ(𝑝))𝑇 𝑊 (𝑦 − ŷ(𝑝)) (26) are presented along with techniques to address the
disadvantages. Gradient descent-based training is the most
common type of training. Backpropagation through time is
= 𝑦 𝑇 𝑊𝑦 − 2𝑦 𝑇 𝑊ŷ + ŷ𝑇 𝑊ŷ (27) the backpropagation tailored for recurrent neural network.
VOLUME XX, 2017 12
Contrastive divergence finds its use in probabilistic models network to take a long time to train, whereas large gradients
such as RBMs. Evolutionary algorithms can be applied to can cause the training to overshoot and diverge. This is made
hyperparameter optimizations or training models by worse by the non-linear activation functions like sigmoid and
optimizing weights. Reinforcement learning could be used in tanh functions that squash the outputs to a small range. Since
game theory, multi-agent systems and other problems where change in weight have nominal effect on the output training
both exploitation and exploration need to be optimized. could take much longer. This problem can be mitigated using
linear activation function like ReLu and proper weight
TABLE 3. DEEP LEARNING ALGORITHM COMPARISON TABLE initialization.
Techniques to
Algorithm Advantages Disadvantages address
5.2. Local Minima
disadvantages
Takes a long time to Local minima is always the global minima in a convex
(Batch) Scales well converge as weights Mini-Batch function, which makes gradient descent based optimization
Gradient after are updated after the Gradient Descent fool proof. Whereas in nonconvex functions,
Descent optimizations entire dataset pass backpropagation based gradient descent is particularly
Local minima Please see table 4
vulnerable to the issue of premature convergence into the
Noisy error rates
Mini-Batch local minima. A local minima as shown in Figure 11, can
since it is calculated
Stochastic Scales well Gradient Descent; easily be mistaken for global absolute minima.
at every sample;
Gradient after Shuffle data after
Accuracy requires
Descent optimizations random order every epoch
Local minima Please see table 4

Hard to be used in
Performs the application
Back better than where online Truncate part of
Propagation metaheuristics adaption is required time instead of
through Time (e.g., genetic as the entire time
algorithm) entire time series
must be used
Can create
samples that
appear to
come from
Get sampling from
Contrastive input data
Difficult to train Monte Carlo
divergence distribution;
Markov Chain
Generative
models;
Pattern
Figure 11. Gradient Descent
completion
Is able to
explore and Takes long time to 5.3. Flat Regions
Evolutionary exploit run as it needs to test Utilize cloud and Just like local minima, flat regions or saddle points (Figure
Algorithms solutions different GPUs
space combinations
12) also pose similar challenge for gradient descent based
effectively optimization in nonconvex high-dimensional functions. The
Is able to training algorithm could potentially mislead by this area as
Reinforcement balance In some cases, Work backwards the gradient comes to a halt at this point.
Learning (Q- exploration reward is extremely from the reward
learning) and rare state
exploitation
5. Shortcomings of Training Algorithms

There are several shortcomings with the standard use of
training algorithms on DNNs. The most common ones are
described here.
5.1. Vanishing and Exploding Gradients

Deep neural networks are prone to vanishing (or
exploding) gradients due to the inherent way in which
gradients (or derivates) are computed layer by layer in a
cascading manner with each layer contributing to
exponentially decreasing or increasing derivatives. Weights
are increased or decreased based on gradients to reduce the Figure 12. Flat (saddle point marked with black dot) region in a nonconvex
cost function or error. Very small gradients can cause the function
VOLUME XX, 2017 13
5.4. Steep Edges the problem area to be a convex function. The other problem
Steep edges are another section of the optimization surface is high number of nodes and the sheer possible combination
area where the steep gradient could cause the gradient of weight values they can have. While weights are learned
descent-based weight updates to overshoot and miss a by training on the dataset, there are additional crucial
potential global minima. parameters referred to as hyperparameters that aren’t directly
learnt from training dataset. These hyperparameters can take
5.5. Training Time a range of values and add complexity of finding the optimal
Training time is an important factor to gauge the efficiency architecture and model. There is significant room for
of an algorithm. It is not uncommon for graduate students to improvement to the standard training algorithms. Here are
train their model for days or weeks in the computer lab. Most some of the popular ways to enhance the accuracy of the
models require exorbitant amount of time and large datasets DNNs.
to train. Often times many of the samples from the datasets
do not add value to the training process and in some cases, 6.1. Parameter Initialization Techniques
they introduce noise and adversely affect the training. Since the solution space is so huge, the initial parameters
have an outsized influence on how fast or slow the training
5.6. Overfitting converges, if at all or if it prematurely converges to a
As we add more neurons to DNN, it can undoubtedly suboptimal point. Initialization strategies tend to be heuristic
model the network for more complex problems. DNN can in nature. [50] proposed normalized initialization where
lend itself to high conformability to training data. But there weights are initialized in the following manner.
is also a high risk of overfitting to the outliers and noise in
the training data as shown in Figure 13. This can result in √6 √6
𝑊 ~ 𝑈 [− , ] (34)
delayed training and testing times and result in the lower √𝑛𝑗 + 𝑛𝑗+1 √𝑛𝑗 + 𝑛𝑗+1
quality prediction on the actual test data. E.g., in
classification or cluster problems, overfitting can create a
high order polynomial output that separates the decision [51] proposed another technique called sparse
boundary for the training set, which will take longer and initialization, where the number of non-zero incoming
result in degraded results for most test data set. One way to weights were capped at a certain limit causing them to retain
overcome overfitting is to choose the number of neurons in high diversity and reduce chances of saturation.
the hidden layer wisely to match the problem size and type.
There are some algorithms that can be used to approximate 6.2. Hyperparameter Optimization
the appropriate number of neurons but there is no magic The learning rate and regularization parameters constitutes
bullet and the best bet is to experiment on each use case to the commonly used hyperparameters in DNN. Learning rate
get an optimal value. determines the rate at which the weights are updated. The
purpose of regularization is the prevent overfitting and
regularization parameter affects the degree of influence on
the loss function. CNN’s have additional hyperparameters
i.e., number of filters, filter shapes, number of dropouts and
max pooling shapes at each convolution layer and number of
nodes in the fully connected layer. These parameters are very
important for training and modeling a DNN. Coming up with
an optimal set of parameter values is a challenging feat.
Exhaustively iterating through each combination of
hyperparameter values is computationally very expensive.
For example, if training and evaluating a DNN with the full
dataset takes ten minutes, then with seven hyperparameters
each with eight potential values will take (87 x 10 min), i.e.,
20,971,520 minutes or almost 40 years to exhaustively train
and evaluate the network on all combinations of the
Figure 13. Overfitting in Classification
hyperparameter values. Hyperparameter can be optimized
with different metaheuristics. Metaheuristics are nature
6. Optimization of Training Algorithms
inspired guiding principles that can help in traversing the
The goal of the DNN is to improve the accuracy of the
search space more intelligently yet much faster than the
model on test data. Training algorithms aims to achieve the
exhaustive method.
end goal by reducing the cost function. The common root
Particle Swarm Optimization (PSO) is another type of
cause of three out of five shortcomings mentioned above is
metaheuristic that can be used for hyperparameter
primarily due to the fact that the training algorithms assume
optimization. PSO is modeled around the how birds fly
VOLUME XX, 2017 14
around in search of food or during migration. The velocity

and location of birds (or particles) are adjusted to steer the
swarm towards better solution in the vast search space.
Escalante et al. used PSO for hyperparameter optimization to
build a competitive model that ranked among the top relative
to other comparable methods[52].
Figure 14b. Crossover in Genetic Algorithm
Elite method ranks population members by fitness and only

uses high fitness members for the crossover process. The
mutation process then makes random changes to the number
sequence and the entire process continues until a desired
fitness or maximum number of iterations are reached. [53,
54] propose parallelization and hybridization of GA to
achieve better and faster results. Parallelization provide both
speedup and better results as we can periodically exchange
population members between the distributed and parallel
operations of genetic algorithms on different set of
population members. Hybridization is the process of mixing
the primary algorithm (GA in this case) with other
operations, like local search. Shrestha and Mahmood [53]
incorporated 2-Opt local search method into GA to improve
the search for optimal solution. [55] postulates that correctly
performed exchanges (e.g., in GA) breeds innovation and
results in creation solutions to hard problems just like in real
life where collaboration and exchanges between individuals,
organizations and societies. In additional to GA, other
Figure 14a. Genetic Algorithm [53]
variations of evolution-based metaheuristics have also been
used to evolve and optimize deep learning architectures and
hyperparameters. E.g., [56] proposed CoDeepNEAT
Genetic algorithm (GA) is a metaheuristic that is framework based on deep neuroevolution technique for
commonly used to solve combinatorial optimization finding an optimized architecture to match the task at hand.
problems. It mimics the selection and crossover processes of
species reproduction and how that contributes to evolution 6.3. Adaptive Learning Rates
and improvement of the species prospect of survival. Figure Learning rates have a huge impact on training DNN. It can
14a shows a high-level diagram of the GA. Figure 14b speed up the training time, help navigate flat surfaces better
illustrates the crossover process where parts of the respective and overcome pitfalls of non-convex functions. Adaptive
genetic sequence are merged from both the parents to form learning rates allow us to change the learning rates for
the new genetic sequence in the children. The goal is to find parameters in response to gradient and momentum. Several
a population member (a sequence of numbers resembling innovative methods have been proposed. [48] describes the
DNA nucleotides) that meets the fitness requirement. Each following:
population member represents a potential solution.
Population members are selected based on different 1. Delta-bar Algorithm
methods, e.g., elite, roulette, rank and tournament. 2. AdaGrad
3. RMSProp
4. Adam
In Delta-bar algorithm, the learning rate of the parameter

is increased if the partial derivative with respect to it stays in
the same sign and decreased if the sign changes. AdaGrad
VOLUME XX, 2017 15
is more sophisticated [57] and prescribes an inversely functions. Iofee and Szegedy [59] proposed the idea of batch
proportional scaling of the learning rates to the square root normalization in 2015. It has made a huge difference in
of the cumulative squared gradient. AdaGrad is not effective improving the training time and accuracy of DNN. It updates
for all DNN training. Since the change in the learning rate is the inputs to have a unit variance and zero mean at each mini-
a function of the historical gradient, AdaGrad becomes batch.
susceptible to convergence.
RMSProp algorithm is a modification of AdaGrad 6.5. Supervised Pretraining
algorithm to make it effective in a nonconvex problem space. Supervised pretraining constitutes breaking down complex
RMSProd replaces the summation of squared gradient in problems into smaller parts and then training the simpler
AdaGrad with exponentially decaying moving average of the models and later combining them to solve the larger model.
gradient, effectively dropping the impact of historical Greedy algorithms are commonly used in supervised
gradient [48]. Adam which denotes adaptive moment pretraining of DNN.
estimation is the latest evolution of the adaptive learning
algorithms that integrates the ideas from AdaGrad, 6.6. Dropout
RMSProp and momentum [58]. Just like AdaGrad and There are few commonly used methods to lower the risk of
RMSProd, Adam provides an individual learning rate for overfitting. In the dropout technique, we randomly choose
each parameter. Adam includes the benefits of both the units and nullify their weights and outputs so that they do not
earlier methods does a better job handling non-stationary influence the forward pass or the backpropagation. Figure 16
objectives and both noisy and sparse gradients problems shows a fully connected DNN on the left and a DNN with
[58]. Adam uses first moment (i.e., mean as used in dropout to the right. The other methods include the use of
RMSProp) as well as second moments of the gradients regularization and simply enlarging the training dataset using
(uncentered variance) utilizing the exponential moving label preserving techniques. Dropout works better than
average of squared gradient [58]. regularization to reduces the risk of overfitting and also
speeds up the training process. [60] proposed the dropout
technique and demonstrated significant improvement on
supervised learning based DNN for computer vision,
computational biology, speech recognition and document
classification problems.
Figure 15. Multilayer network training cost on MNIST dataset using

different adaptive learning algorithms [58]
Figure 15 shows the relative performance of the various
adaptive learning rate mechanisms where Adam outperform Figure 16. DNN with and without Dropout
the rest.
6.7. Training Speed up with Cloud and GPU processing
6.4. Batch Normalization Training time is one of the key performance indicators of
As the network is getting trained with variations to weights machine learning. Cloud computing and GPUs lend
and parameters, the distribution of actual data inputs at each themselves very well to speeding up the training process.
layer of DNN changes too, often making them all too large Cloud provides massive amounts of compute power and now
or too small and thus making them difficult to train on all major cloud vendors include GPU powered servers that
networks, especially with activation functions that can easily be provisioned and used for training DNNs on
implement saturating nonlinearities, e.g., sigmoid and tanh demand at competitive prices. Cloud vendor Amazon Web
VOLUME XX, 2017 16
Services’ (AWS) P2 instances provides up to 40 thousand 7.1. Deep Residual Learning

parallel GPU cores and its P3 GPU instances are further The ability to add more layers to DNN has allowed us to
optimized for machine learning [61]. solve harder problems. Microsoft Research Asia (MSRA)
applied a 100/1000 layer deep residual network (ResNet) on
6.8. Summary of DL Algorithms Shortcomings and CIFAR-10 dataset and won 1st place in the ILSVRC 2015
Resolutions Techniques competition with a 152-layer DNN on the ImageNet dataset
Table 4 provides a summary of deep learning algorithm [62]. Figure 17 demonstrates a simplified version of
shortcomings and resolutions techniques. The table also lists Microsoft’s winning deep residual learning model. Despite
the cause and effect[s] of the shortcomings. the depth of these networks, simply adding more layers to
DNN does not improve or guarantee results. To the contrary,
it degrades the quality of the solution. This makes training
DNN not so straight forward. The MSRA team was able to
overcome the degradation by making the hoping stacked
TABLE 4. DL ALGORITHM SHORTCOMINGS & RESOLUTION TECHNIQUES layers match a residual mapping instead of the desired
Optimizations to mapping with the following function [62]:
Shortcomings Cause Effect address
Shortcomings 𝐹(𝑥): = 𝐻(𝑥) − 𝑥𝑣 (30)
ReLu activation
function; Weight
Where F(x) is the residual mapping and H(x) is the desired
Long mapping, and then by recasting the desired mapping at the
initialization;
training
Connection between end [62]. According to MSRA team, it is much easier to
time;
Vanishing and the forget gate
Exploding
Propagation of Overshoot
activations and the
optimize the residual mapping.
derivatives global
Gradients gradients computation
minima;
in LSTM (RNN);
Halts
Skipping connections
training
(ResNet); Faster
hardware (GPUs)
Adaptive learning
rates; Parameter Re-
Gradient descent Convergence
Local Minima / initialization /
of non-convex into local
Flat regions initialization
problem space minima
techniques; Adaptive
learning rates
Gradient descent
Overshoot to
of non-convex Adaptive learning
Steep Edges miss global
problem & steep rates
minima
spaces
Regularization;
Oversized nodes Figure 17. Deep Residual Learning model by MSRA at Microsoft
Poor Dropout; Choosing
(network) relative
Overfitting accuracy of correct size of
to dataset; poor
the model network; Better 7.2. Oddball Stochastic gradient descent
dataset
training data
Large network All training data are not created equal. Some will have
Poor use of higher training error than the others. Yet, we assume that
(weights & Adaptive learning
High Training compute
hyperparameters), rates; Using Cloud they are the same and thus use each training examples the
Time resources;
high dimensional and GPUs
data, others
wasted time same number of times. Andrew Simpson [63] argues that this
Metaheuristics (e.g., assumption is invalid and makes a case in his paper for the
Extremely large Genetic Algorithm) number of times a training examples is used to be
solution space for hyperparameter proportional to its respective training error. So, if a training
Convergence
Hyperparameter (NP-hard optimization; Random
into local example has a higher error rate, it will be used to train the
selection problem) for high search; Sequential
minima
dimensional Model-based network higher number of times than the other training
problems Algorithm example. Andrew Simpson [63] proves his methodology,
Configuration
termed Oddball Stochastic Gradient Descent with a training
set of 1000 video frames. Simpson [63] created a training
selection probability distribution for training example based
7. Architectures & Algorithms – Implementations on the error value and pegged the frequency of using the
This section describes different implementations of neural training example based on the distribution.
networks using a variety of training methods, network
architectures and models. It also includes models and ideas 7.3. Deep belief Network
that have been incorporated into machine learning in general. Xue-wen Chen et al. [23] highlights the fact that
conventional neural network can easily get stuck in local
VOLUME XX, 2017 17
minima when the function is non-convex. They propose a 7.5. Generative top down connection (generative model)
DNN architecture called large scale deep belief network Much of the training is usually implemented with bottom-
(DBN) that uses both labeled and unlabeled to learn feature up approach, where discriminatory or recognition models are
representations. DBN are made up of layers of RBM stacked developed using backpropagation. A bottom-up model is one
together and learn probability distribution of the input that takes the vector representation of input objects and
vectors. They employ unsupervised pre-training and fine- computes higher level feature representations at subsequent
tuned supervised algorithms and techniques to mitigate the layer with a final discrimination or recognition pattern at the
risk of getting trapped in local minima. Below is the equation output layer. One of the shortcomings of backpropagation is
[23] for change in weights, where c is the momentum factor that it requires labeled data to train. Geoffrey Hinton
and α is the learning rate, and v and h are visible and hidden proposed a novel way of overcoming this limitation in 2007
units respectively. [66]. He proposed a multi-layer DNN that used generative
top-down connection as opposed to bottom-up connection to
∆𝑤𝑖𝑗 (𝑡 + 1) = 𝑐∆𝑤𝑖𝑗 (𝑡) + 𝛼(〈𝑣𝑖 ℎ𝑗 〉𝑑𝑎𝑡𝑎 mimic the way we generate visual imagery in our dream
− 〈𝑣𝑖 ℎ𝑗 〉𝑚𝑜𝑑𝑒𝑙 ) (35) without the actual sensory input. In top-down generative
connection, the high-level data representation or the outputs
Equation [23] for probability distribution for hidden of the networks are used to generate the low-level raw vector
and visible inputs. representations of the original inputs, one layer at a time. The
𝐼 layers of feature representations learned with this approach
𝑝(ℎ𝑗 = 1 | v; W) = σ (∑ 𝑤𝑖𝑗 𝑣𝑖 + 𝑎𝑗 ) (36) can then be further perfected either in generative models
𝑖=1 such as auto-encoders or even standard recognition models
𝐽
[66].
𝑝(𝑣𝑖 = 1 | h; W) = σ (∑ 𝑤𝑖𝑗 ℎ𝑗 + 𝑏𝑖 ) (37) In the generative model in Figure 18, since the correct
𝑗=1 upstream cause of the events in each layer is known, a
comparison between the actual cause and the prediction
7.4. Big data made by the approximate inference procedure can be made,
Big data provides tremendous opportunity and challenge and the recognition weights, 𝑟𝑖𝑗 can be adjusted to increase
for deep learning. Big data is known for the 4 Vs (volume, the probability of correct prediction.
velocity, veracity, variety). Unlike the shallow networks, the
huge volume and variety of data can be handled by DNNs
and significantly improve the training process and the ability
to fit more complex models. On the flip side, the sheer
velocity of data that is generated in real-time can be daunting
to process. Maryam M Jajafabadi et al. [47] raises similar
challenges learning from real-time streaming data such as
credit cards usage to monitor for fraud detection. They
propose using parallel and distributed processing with
thousands of CPU cores. In addition, we should also use
cloud providers that support auto-scaling based on usage and
workload. Not all data represent the same quality. In the case
of computer vision, images from constrained sources, e.g.,
studios are much easier to recognize that the ones from
unconstrained sources like surveillance cameras. [64]
proposes a method to utilize multiple images of the
unconstrained source to enhance the recognition process.
Deep learning can help mine and extract useful patterns Figure 18. Learning multiple layers of representation
from big data and build models for inference, prediction and
business decision making. There is massive volumes of
structured and unstructured data and media files getting
generated today making information retrieval very Here is the equation [66] for adjusting the recognition
challenging. Deep learning can help with semantic indexing weights 𝑟𝑖𝑗 .
to enable information to be more readily accessible in search
engines [14, 65]. This involves building models that provide ∆𝑟𝑖𝑗 α ℎ𝑖 (ℎ𝑗 − 𝜎(∑ ℎ𝑖 𝑟𝑖𝑗 )) (38)
relationships between documents and keywords the contain 𝑖
to make information retrieval more effective.
VOLUME XX, 2017 18
7.6. Pre-training with unsupervised Deep Boltzmann

Machines.
Vast majority of DNN training is based on supervised
learning. In real life, our learning is based on both supervised
and unsupervised learning. In fact, most of our learning is
unsupervised. Unsupervised learning is more relevant in
today’s age of big data analytics because most raw data is
unlabeled and un-categorized [47]. One way to overcome the
limitation of backpropagation, where it gets stuck in local
minima is to incorporate both supervised and unsupervised
training. It is quite evident that top-down generative
unsupervised learning is good for generalization because it
is essentially adjusting the weights by trying to match or
Figure 20. Pretraining of stacked & altered RBM to create a DBM [67]
recreate the input data on layer at a time [67]. After this
effective unsupervised pre-training, we can always fine-tune
it with some labeled data. Geoffrey Hinton and Ruslan
Salakhutdinov describe multiple layers of RBMs that are Here are the equations showing probability distributions
stacked together and trained layer by layer in a greedy, over visible and two hidden units in DBM (after
unsupervised way, essentially creating what is called the unsupervised pre-training) [67].
Deep Belief Network. They further modify stacks to make
them un-directed models with symmetric weights, thus
𝑝(𝑣𝑖 = 1|h1 ) = σ (∑ 𝑊𝑖𝑗1 ℎ𝑗 ) (39)
creating the Deep Boltzmann Machines (DBM). Four
𝑗
layered deep belief network and deep Boltzmann machines
are shown in Figure 19. In [67] the DBM layers were pre-
2 2 1
trained one at a time using unsupervised method and then 𝑝(ℎ𝑚 = 1|h1 ) = σ (∑ 𝑊𝑗𝑚 ℎ𝑗 ) (40)
tweaked using supervised backpropagation on the MNIST 𝑗
and NORB datasets as shown in Figure 20. They [67]
received favorable results validating benefits of combining 𝑝(ℎ𝑗1 = 1|v, h2 ) = σ(∑𝑖 𝑊𝑖𝑗1 𝑣𝑖 + ∑𝑚 𝑊𝑗𝑚
2 2
ℎ𝑚 ) (41)
supervised and unsupervised learning methods.
Post unsupervised pre-training, the DBM is converted into
a deterministic multi-layer neural network by fine-tuning the
network with supervised learning using labeled data as
demonstrated in Figure 21. The approximate posterior
distribution q(h|v) is generated for each input vector and the
marginals q(h2j=1|v) are added as an additional input for the
network as shown in the figure above and subsequently,
backpropagation is used to fine-tune the network [67].
Figure 19. Four-layer DBN & four-layer Deep Boltzmann Machine
Figure 21. DBM getting initialized as deterministic neural network with

supervised fine-tuning [67]
VOLUME XX, 2017 19
Figure 22 shows a pareto frontier function that can be used

7.7. Extreme Learning Machine (ELM)
to achieve a compromise between two competing objectives
There have been other variations of learning
functions.
methodologies. While more layers allow us to extract more
complex features and patterns, some problems might be
solved faster and better with less number of layers. [68]
proposed a four-layered CNN termed DeepBox that
outperformed larger networks in speed and accuracy. for
evaluating objectness. ELM is another type of neural
network with just one hidden layer. Linear models are learnt
from the dataset in a single iteration by adjusting the weights
between the hidden layer and the output, whereas the weights
between the input and the hidden layers are randomly
initialized and fixed [69].
ELM can obviously converge much faster than
backpropagation, but it can only be applied to simpler
problems of classifications and regression. Since proposing
ELM in 2006, Buang-Bin Huang et al. came up with a
multilayer version of ELM in 2016 [70] to take on more
complex problems. They combined unsupervised multilayer
Figure 22. Pareto Frontier
encoding with the radom initialization of the weights and
demonstrate faster convergence or lower training time than
7.9. Multiclass Semi-Supervised Learning Based on
the state of the art multilayer perceptron training algorithm. Kernel Spectral Clustering
Mehrkanoon et al. [72] proposed a multiclass learning
7.8. Multiobjective Sparse Feature Learning Model algorithm based on Kernel Spectral Clustering (KSC) using
Moaguo et al. [71] developed a multi-objective sparse both labeled and unlabeled data. The novelty of their
feature learning (MO-SFL) model based on auto encoder, proposal is the introduction of regularization terms added to
where they used an evolutionary algorithm to optimize two the cost function of KSC, which allow labels or membership
competing objectives of sparsity of hidden units and the to be applied to unlabeled data examples. It is achieved in the
reconstruction error (input vendor of AE). It fairs better than following way [72]:
models where the sparsity is determined by human • Unsupervised learning based on kernel spectral
intervention or less than optimal methods. clustering (KSC) is used as the core model
Since the time complexity of evolutionary algorithms are • A regularization term is introduced and labels (from
high, they [71] utilize self-adaptive multi-objective labeled data) are added to the model
differential evolution (DE) based on decomposition (Sa-
MODE/D) to cut down on time and demonstrate it has better
results than standard AE (auto encoder), SR-RBM (Sparse
response RMB) and SESM (sparse encoding symmetric
machine) by testing with MNIST dataset and compare the
results with other implementations. Their learning procedure
continuously iterates between evolutionary optimization step
and the stochastic gradient descent to optimize the
reconstruction error [71].
• Step 1: Multi-objective optimization to select the
most optimal point in the pareto frontier for both
objectives
• Step 2: Optimize parameters θ and θ’ with
stochastic gradient descent in the following
reconstruction error function (of Auto Encoder), where
D is the training data set and L (x,y) is the loss function
with x representing the input and y representing the
output, i.e., reconstructed input. Figure 23. Spectral Clustering Representation
∑𝑥∈𝐷 𝐿(𝑥, 𝑔θ′ (𝑓θ (x))) (42)
VOLUME XX, 2017 20
Figure 23 illustrates data points in a spectral clustering supervised learning module provided real-time feedback
representation. Spectral clustering (SC) is an algorithm that with back diffusion (BD) to retain diversity and social
divides the data points in a graph using Laplacian or double attractor renewal to overcome stagnation [76].
derivative operation, whereas KSC is simply an extension of Metaheuristics provide high level guidance inspired by
SC that uses Least Squares Support Vector Machines nature and applies them to solve mathematical problems. In
methodology [73]. a similar way [77] proposes incorporating the concepts of
Since unlabeled data is more abundantly available relative intelligent teacher and privileged information, which is
to labeled data, it would be beneficial to make the most of it essentially extra information available during training but
with unsupervised or in this case semi-supervised learning. not during evaluation or testing, into the DNN training
process.
7.10. Very Deep Convolutional Networks for Natural
Language Processing 7.12. Genetic Algorithm
Deep CNN have mostly been used in computer vision, Genetic Algorithm is a metaheuristic that can be effectively
where it is very effective. Conneau et al. [74] used it for the used in training DNN. GA mimics the evolutionary
first time to NLP with up to 29 convolution layers. The goal processes of selection, crossover and mutation. Each
is to analyze and extract layers of hierarchical population member represents a possible solution with a set
representations from words and sentences at the syntactic, of weights. Unlike PSO, which includes only one operator
semantic and contextual level. One the major setbacks for for adjusting the solution, evolutionary algorithms like GA
lack of earlier deep CNN for NLP is because of deeper includes various steps, i.e., selection, crossover and mutation
networks tend to cause saturation and degradation of methods [52]. Population members undergo several
accuracy. This is in addition to the processing overhead of iterations of selection and crossover based on known
more layers. Kaiming et al. [62] states that the degradation is strategies to achieve better solution in the next iteration or
not caused by overfitting but because deeper systems are generation. GA has undergone decades of improvement and
difficult to optimize. [62] addressed this issue with shortcut refinements since it was first proposed in 1976 [78]. There
connections between the convolution blocks to let the are several ways to perform selections, e.g., elite, roulette,
gradients to propagate more freely and they, along with [74] rank, tournament [79]. There are about dozen ways to
were able to validate the benefits of the shortcuts with perform crossovers by Larrañaga et al. alone [80]. Selection
10/101/152-layers and 49 layers respectively. Conneau et al. methodologies represent exploration of the solution space
[74] architecture consists of series of convolution blocks and crossovers represent the exploitation of the selected
separated by pooling that halved the resolution followed by solution candidates. The goal is to get better solution wider
k-max pooling and classification at the end. exploration and deeper exploitation. Additional tweaking
can be introduced with mutation. Parallel clusters of GA can
7.11. Metaheuristics be executed independently in islands and few members
Metaheuristics can be used to train neural networks to exchanged between the island every so often [81]. In
overcome the limitation of backpropagation-based learning. addition, we can also utilize local search such as greedy
When implementing metaheuristics as training algorithm, algorithm, Nearest Neighbor or K-opt algorithm to further
each weight of the neural network connection is represented improve the quality of the solution.
by a dimension in the multi-dimensional solution search Lin et al. [82] demonstrated a successful incorporation of
space of the problem we are trying to solve. The goal is to GA that resulted in better classification accuracy and
come as near as possible to the optimal values of weights, performance of a Polynomial Neural Network. Standard GA
i.e., a location in the search space that represents the global operations including selection, crossover and mutation were
best solution. Particle Swarm Optimization (PSO) is a type used on parameters that included partial descriptions (PDs)
of metaheuristic inspired by the movement of birds in the sky of inputs in the first layer, bias and all input features [82].
consists of particles or candidate solutions move about in a GA was further enhanced with the incorporation of the
search space to reach a near optimal solution. In their paper concept of mitochondrial DNA (mtDNA). In evolution, it is
[75], N. Krpan and D. Jakobovic ran parallel quite evident from casual observation and simple reason that
implementations using backpropagation and PSO. Their crossover of population members with too much similarity
results demonstrate that while parallelization improves the does not yield much variance in the offspring. Likewise, we
efficacy of both algorithms, parallel backpropagation is can infer that in GA, selection and crossover between
efficient only on large networks, whereas parallel PSO has solutions that are very similar would not result is high degree
wider influence on various sizes of problems. of exploration of the multi-dimensional solution space. In
Similarly, W. Dong and M. Zhou [76] complemented PSO fact, it might run the risk of getting pigeonholed into a
with supervised learning control module to guide the search restricted pattern.
for global minima of an optimization problem. The
VOLUME XX, 2017 21
Diversity is the key to overcoming the risk of getting

stuck in local minima. This risk can be mitigated by
exploiting the idea of mtDNA. mtDNA represents one
percent of the human chromosomes [83]. The concept
of incorporating mitochondrial DNA into GA was
introduced by Shrestha and Mahmood [53]. They
describe a way to restrict crossover between population
members or solution candidates based proximity on
their mtDNA value [53]. Unlike the rest of the 99%
DNA, mtDNA is only inherited from the female, thus it
is a more continuous marker of lineage or genetic
proximity. The premise behind this is that offspring of
population members with similar genetic makeup
doesn’t help with overcoming the local minima. Figure
24 describes the parallel and distributed nature of their
full implementation [53] along with the GA operators
(selection, mutation and mtDNA incorporated
crossover). The training process is enhanced [53] with
the implementation of continental model, where
distributed servers run multiple threads, each running
an instance of GA with mtDNA. Population members
are then exchanged between the servers after fixed
number of iterations as shown in Figure 24.
7.13. Neural Machine Translation (NMT)

Neural Machine Translation is a turnkey solution used in
translation of sentences. While it provides some
improvement over the traditional Statistical machine
translation (SMT), it is not scalable for large models or
datasets. It also requires lot of computational power for
training and translation, and has difficult with rare words.
For these reason, large tech companies like Google and
Microsoft have both improved on NMT and have their own
implementations of NMT, labeled as Google Neural
Machine Translation (GNMT) and Skype Translator
respectively. GMNT as shown in Figure 257 consists of
encoder and decoder LSTM blocks organized in layers was
presented in 2016 in [84]. It overcomes the shortcomings of
NMT with enhanced deep LSTM neural network that
includes 8 encoder and 8 decoder layers, and a method to
break down rare difficult words to infer their meaning. On
Conference on Machine Translation in 2014, GNMT
received results at par with state-of-the-art for English-to-
Figure 24. Continental Model with mtDNA [53]
French and English-to-German language benchmarks [84].
VOLUME XX, 2017 22
Figure 25. GNMT Architecture [84] with encoder neural network on the left and decoder neural network on the right.
dataset imported from an external data source, and the

7.14. Multi-Instance Multi-Label Learning trained model could be deployed on another server which
Images in real life include multiple instances (objects) and accepts APIs calls with real time inputs (e.g., images of
need multiple labels to describe them. E.g., a picture of an people entering a building) and responds with matches. The
office space could include a laptop computer, a desk, a interconnected architecture exposes the machine learning to
cubicle and a person typing on the computer. Zhang et al. a wide attack surface. The real-time input or training dataset
[85] proposed MIML (Multi-Instance Multi-Label learning) can be manipulated by an adversary to compromise the
framework and corresponding MIMLBOOST and output (image match by the network) or the entire model
MIMLSVM algorithms for efficient learning of individual respectively.
object labels in complex high level concepts, e.g., like the Adversarial machine learning is a relatively new field of
office space. The goal is to learn 𝑓 ∶ 2𝑥 → 2𝑦 from dataset research that takes into out these new threats to machine
{(𝑋1 , 𝑌1 ), (𝑋2 , 𝑌2 ), … , (𝑋𝑚 , 𝑌𝑚 ), where 𝑋𝑖 ⊆ 𝑋 represents a learning. According to [86] adversaries (e.g., email
set of instances {𝑥𝑖1 , 𝑥𝑖2 , … 𝑥𝑖,𝑛𝑖, }, 𝑥𝑖𝑗 ∈ 𝑋 (𝑗 = 1, 2, … , 𝑛𝑖 ), spammer) can exploit the lack of stationary data distribution
and 𝑌𝑖 ⊆ 𝑌 represents a set of instances {𝑦𝑖1 , 𝑦𝑖2 , … 𝑦𝑖,𝑙𝑖, }, and manipulate the input (e.g., an actual spam email) as a
𝑦𝑖𝑘 ∈ 𝑌 (𝑘 = 1, 2, … , 𝑙𝑖 ), where 𝑛𝑖 is the number of normal email. [86] demonstrates these and other
instances in 𝑋𝑖 and 𝑙𝑖 is the number of labels in 𝑌𝑖 [85]. vulnerabilities and discusses how application domain,
MIMLBOOST uses category-wise decomposition into features and data distribution can be used to reduce the risk
traditional single instance & single label supervised learning, and impact of such adversarial attacks.
whereas MIMLSVN utilizes cluster-based feature
transformation. So, instead of trying to learn the idea of 7.16. Gaussian Mixture Model
complex entities (e.g., office space), [85] took the alternate Gaussian mixture model (GMM) is a statistical
route and learned the lower level individual objects and probabilistic model used to represent multiple normal
inferred the higher level concepts. gaussian distributions within a larger distribution using an
EM (estimation maximization) algorithm in an unsupervised
7.15. Adversarial Training setting. E.g., a GMM could be used to represent the height
Machine learning training and deployment used to be done distribution for a large population group with two gaussian
in isolated computers, but now they are increasing being distributions, for male and female sub-groups. Figure 26
done in a highly interconnected commercial production below demonstrates a GMM with three gaussian
environment. Take a face recognition system where a distributions within itself.
network could be trained on a fleet of servers with a training
VOLUME XX, 2017 23
to siamese network as the number of outputs of the softmax

in the twin networks doesn’t have the requirement to match
the number of classes[89]. This ability to scale to many more
classes for classification extends the use of siamese networks
beyond what a traditional CNN is used for. Siamese network
can be used for handwritten check recognition, signature
verification, text similarity, etc.
Figure 26. GMM example with three components
GMM has been used primarily in speech recognition and

tracking objects in video sequences. GMM are very effective
in extracting speech features and modeling the probability
density function to a desired level of accuracy as long as we
have sufficient components, and the estimation
maximization makes it easy to fit the model [87]. The
probability density function for the GMM is given by the
following [87]:
𝑀
𝑝(𝑥) = ∑ 𝑐𝑚 Ɲ (𝑥; µ𝑚 , Σ𝑚 ), (𝑐𝑚 > 0) (43)

𝑚=1
Where 𝑀 is the number of number of gaussian
components, 𝑐𝑚 is the weight of the 𝑀-th gaussian, and
Figure 27. Siamese network
(𝑥; µ𝑚 , Σ𝑚 ) represents the random variable 𝑥, which
following the mean vector µ𝑚 . 7.18. Variational Autoencoders
As the name suggests, variational autoencoder (VAE), are
7.17. Siamese Networks a type of autoencoder and consists of encoder and decoder
The purpose of siamese network is to determine the degree parts as shown in figure 28. It falls under the generative
of similarity between two images. As shown in figure 27 model class of neural networks and are used in unsupervised
below, siamese network consists of two identical CNN learning. VAEs learn a low dimensional representation
networks with identical weights and parameters. The two (latent variable) that model the original high dimensional
images to be compared are passed separately through the two dataset into a gaussian distribution. Kullback–Leibler (KL)
twin CNNs and the respective vector representations outputs divergence method is a good way to compare distributions.
are evaluated using contrastive divergence loss function. The Therefore, the loss function in VAE is a combination of cross
function is defined as following [88]: entropy (or mean squared error) to minimize reconstruction
1 1 error and KL divergence to make the compressed latent
𝐿(𝑊, 𝑌, ⃗⃗⃗⃗
𝑋1 , ⃗⃗⃗⃗
𝑋1 ) = (1 − 𝑌) (𝐷𝑤 )2 + (𝑌) (𝑚𝑎𝑥(0, 𝑚 − variable follow a gaussian distribution. We then sample from
2 2
𝐷𝑤 ))2 (44) the probability distribution to generate new dataset samples
that are representative of the original dataset. It has found
𝐷𝑤 represents the Euclidean distance between the two various applications including generating images in video
output vectors as shown in figure 27. The output of the games to de-noising pictures.
contrastive divergence loss function, 𝑌 is either 1 (indicates
images are not the same) or 0 (indicates images are the
same). 𝑚 represents a margin value greater than 0. The idea
of siamese networks has been extended to come up with
triplet networks, which includes three identical networks and
Figure 28. Variational Autoencoder
is used to assess the similarity of a given image with two
other images.
Since the softmax layer outputs must match the number of In figure 28, 𝑥 is the input and 𝑧 is the encoded output
classes, a standard CNN becomes impractical for problems (latent variable). 𝑃(𝑥) represents the distribution associated
that have large number of classes. This issue doesn’t apply with 𝑥. 𝑃(𝑧) represents the distribution associated with 𝑧.
The goal is to infer 𝑃(𝑧) based on 𝑃(𝑧|𝑥) that follows a
VOLUME XX, 2017 24
certain distribution. The mathematical derivation for VAEs games like chess, go, etc. AlphaGo, Google’s program that
were originally proposed in [90]. Suppose we wanted to infer beat the human Go champion also uses reinforcement
𝑃(𝑧|𝑥) based on some 𝑄(𝑧|𝑥), then we can try to minimize learning[91]. When we combine deep network architecture
the KL divergence between the two: with reinforcement learning, we get deep reinforcement
Q(z|x) learning (DRL), which can extend the use of reinforcement
𝐷𝐾𝐿 [𝑄(𝑧|𝑥)||𝑃(𝑧|𝑥)] = ∑ Q(z|x) log[ ] (45) to even more complex games and areas such as robotics,
P(z|x)
smart grids, healthcare, finance etc. [92]. With DRL,
𝑧
Q(z|x) problems that were intractable with reinforcement learning
= E [log[ ]] (46)
P(z|x) can now be solved with higher number of hidden layers of
deep networks and reinforcement learning based Q-learning
= E log[ Q(z|x) − 𝑙𝑜𝑔P(z|x)] (47) algorithm that maximizes the reward for actions taken by the
agent [13].
Where 𝐷𝐾𝐿 is the Kullback–Leibler (KL) divergence
and E represents expectation.
Using Baye’s rule: 7.20. Generative Adversarial Network (GAN)
𝑃(𝑥|𝑧)𝑃(𝑧)
GANs consists of generative and discriminative neural
𝑃(𝑧|𝑥) = (48) networks. The generative network generates completely new
𝑃(𝑥)
𝐷𝐾𝐿 [𝑄(𝑧|𝑥)||𝑃(𝑧|𝑥)] (fake) data based on input data (unsupervised learning) and
= 𝐸 [log 𝑄 (𝑧|𝑥) the discriminative network attempts to distinguish whether
𝑃(𝑥|𝑧)𝑃(𝑧) the data is real (from training set) or generated. The
− 𝐿𝑜𝑔 ] (49) generative network is trained to increase the probability of
𝑃(𝑥)
= 𝐸 [log 𝑄 (𝑧|𝑥) − log 𝑃 (𝑥|𝑧) − log 𝑃 (𝑧)] deceiving the discriminative network, i.e., to make the
+ log 𝑃 (𝑥) (50) generated data indistinguishable from the original. GANs
were proposed by Goodfellow et al., [93] in 2014. It has been
To allow us to easily sample 𝑃(𝑧) and generate new data, very popular as it has many applications both good and bad.
we set 𝑃(𝑧) to normal distribution, i.e., 𝑁(0,1). If 𝑄(𝑧|𝑥) is E.g., [94] were able to successfully synthesize realistic
represented as gaussian with parameters 𝜇(x) and ∑(𝑥), then images from text.
the KL divergence between 𝑄(𝑧|𝑥) and 𝑃(𝑧) can be derived
in closed form as:
7.21. Multi-approach method for enhancing deep
learning
𝐷𝐾𝐿 [𝑁(𝜇(x), Σ(x)) | |𝑁(0,1)] = Deep learning can be optimized at different areas. We
(1/2) ∑ (exp(Σ(x)) + 𝜇 2 (𝑥) − 1 − Σ(x)) (51) discussed training algorithm enhancements, parallel
𝑘 processing, parameter optimizations and various
architectures. All these areas can be simultaneously
implemented in a framework to get the best results for
7.19. Deep Reinforcement Learning
specific problems. The training algorithms can be finetuned
The primary idea about reinforcement learning is about
at different levels by incorporating heuristics, e.g., for
making an agent learn from the environment with the help of
hyperparameter optimization. The time to train a deep
random experimentation (exploration) and defined reward
learning network model is a major factor to gauge the
(exploitation). It consists of finite number of states (𝑠𝑖 ,
performance of an algorithm or network. Instead of training
representing agent and environment), actions (𝑎𝑖 ) by the
the network with all the data set, we can pre-select a smaller
agent, probability (𝑃𝑎 ) of moving from one state to another
but representative data set from the full training distribution
based on action 𝑎𝑖 , and reward 𝑅𝑎 (𝑠𝑖 , 𝑠𝑖+1) associated with
set using instance selection methods [95] or Monte Carlo
moving to the next state with action 𝑎. The goal is to balance
sampling [48]. An effective sampling method can result in
and maximize the current reward (𝑅) and future reward
preventing overfitting, improving accuracy and speeding up
(𝛾. max[𝑄(𝑠 ′ , 𝑎′ )]) by predicting the best action as defined
of the learning process without compromising on the quality
by this function 𝑄(𝑠, 𝑎). 𝛾 in the equation represent a fixed
of the training dataset. Albelwi and Mahmood [96] designed
discount factor. 𝑄(𝑠, 𝑎) is represented as the summation of
a framework that combined dataset reduction, deconvolution
current reward (𝑅) and future reward (𝛾. max[𝑄(𝑠 ′ , 𝑎′ )]) as
network, correlation coefficient and an updated objective
shown below.
function. Nelder-Mead method was used in optimizing the
𝑄(𝑠, 𝑎) = 𝑅 + 𝛾. max[𝑄(𝑠 ′ , 𝑎′ )] (52)
parameters of the objective function and the results were
comparable to latest known results on the MNIST dataset
Reinforcement learning is specifically suited for problems [96]. Thus, combining optimizations at multiple levels and
that consists of both short-term and long-term rewards, e.g.,
VOLUME XX, 2017 25
using multiple methods is a promising field of research and 5. Werbos, P.J., Beyond regression : new tools for
can lead to further advancement in machine learning. prediction and analysis in the behavioral sciences.
1975.
8. Conclusion 6. LeCun, Y., Y. Bengio, and G. Hinton, Deep
In this tutorial, we provided a thorough overview of the learning. Nature, 2015. 521(7553): p. 436-44.
neural networks and deep neural networks. We took a deeper 7. Jordan, M.I. and T.M. Mitchell, Machine learning:
dive into the well-known training algorithms and Trends, perspectives, and prospects. Science, 2015.
architectures. We highlighted their shortcomings, e.g., 349(6245): p. 255-60.
getting stuck in the local minima, overfitting and training 8. Ng, A., Machine Learning Yearning: Technical
time for large problem sets. We examined several state-of- Strategy for AI Engineers In the Era of Deep
the-art ways to overcome these challenges with different Learning. 2019: deeplearning.ai.
optimization methods. We investigated adaptive learning 9. Metz, C., Turing Award Won by 3 Pioneers in
rates and hyperparameter optimization as effective methods Artificial Intelligence, in New York Times.
3/27/2019, The New York Times Company. p. B3.
to improve the accuracy of the network. We surveyed and
10. Nagpal, K., et al., Development and Validation of a
reviewed several recent papers, studied them and presented
Deep Learning Algorithm for Improving Gleason
their implementations and improvements to the training
Scoring of Prostate Cancer. CoRR, 2018.
process. We also included tables to summarize the content in abs/1811.06497.
a concise manner. The tables provide a full view on how 11. Nevo, S., et al., ML for Flood Forecasting at Scale.
different aspects of deep learning are correlated. CoRR, 2019. abs/1901.09583.
Deep Learning is still in its nascent stage. There is 12. Esteva, A., et al., Dermatologist-level classification
tremendous opportunity for exploitation of current of skin cancer with deep neural networks. Nature,
algorithms/architectures and further exploration of 2017. 542: p. 115.
optimization methods to solve more complex problems. 13. Arulkumaran, K., et al., Deep Reinforcement
Training is currently constrained by overfitting, training time Learning: A Brief Survey. IEEE Signal Processing
and is highly susceptible to getting stuck in local minima. If Magazine, 2017. 34(6): p. 26-38.
we can continue to overcome these challenges, deep learning 14. Gheisari, M., G. Wang, and M.Z.A. Bhuiyan. A
networks will accelerate breakthroughs across all Survey on Deep Learning in Big Data. in 2017
applications of machine learning and artificial intelligence. IEEE International Conference on Computational
Science and Engineering (CSE) and IEEE
International Conference on Embedded and
Ubiquitous Computing (EUC). 2017.
Conflicts of Interest: The authors declare no conflict of
15. Pouyanfar, S., et al., A Survey on Deep Learning:
interest. The founding sponsors had no role in the design of
Algorithms, Techniques, and Applications. ACM
the study; in the collection, analyses, or interpretation of Comput. Surv., 2018. 51(5): p. 1-36.
data; in the writing of the manuscript, and/or in the decision 16. Vargas, R., A. Mosavi, and R. Ruiz, Deep learning:
to publish the results. A review. 2017.
17. Buhmann, M.D. and M.D. Buhmann, Radial Basis
Functions. 2003: Cambridge University Press. 270.
ORCID: 18. Akinduko, A.A., E.M. Mirkes, and A.N. Gorban,
Ajay Shrestha: http://orcid.org/0000-0001-5595-5953 SOM: Stochastic initialization versus principal
components. Information Sciences, 2016. 364-365:
p. 213-221.
References 19. Chen, K., Deep and Modular Neural Networks, in
Springer Handbook of Computational Intelligence,
J. Kacprzyk and W. Pedrycz, Editors. 2015,
1. Rosenblatt, F., The perceptron: a probabilistic Springer Berlin Heidelberg: Berlin, Heidelberg. p.
model for information storage and organization in 473-494.
the brain. Psychol Rev, 1958. 65(6): p. 386-408. 20. Ng, A.Y. and M.I. Jordan, On discriminative vs.
2. Minsky, M. and S. Papert, Perceptrons; an generative classifiers: a comparison of logistic
introduction to computational geometry. 1969, regression and naive Bayes, in Proceedings of the
Cambridge, Mass.,: MIT Press. 258 p. 14th International Conference on Neural
3. Cybenko, G., Approximation by superpositions of a Information Processing Systems: Natural and
sigmoidal function. Mathematics of Control, Synthetic. 2001, MIT Press: Vancouver, British
Signals and Systems, 1989. 2(4): p. 303-314. Columbia, Canada. p. 841-848.
4. Hornik, K., Approximation capabilities of
multilayer feedforward networks. Neural Networks,
1991. 4(2): p. 251-257.
VOLUME XX, 2017 26
21. Bishop, C.M. and J. Lasserre. Generative or 36. Hinton, G.E. and R.R. Salakhutdinov, Reducing the
Discriminative ? Getting the Best of Both Worlds. Dimensionality of Data with Neural Networks.
2007. Science, 2006. 313(5786): p. 504.
22. Zhou, T., et al., Unsupervised Learning of Depth 37. Wang, M., et al., Deep Learning-Based Model
and Ego-Motion from Video. CoRR, 2017. Reduction for Distributed Parameter Systems.
abs/1704.07813. IEEE Transactions on Systems, Man, and
23. Chen, X.W. and X. Lin, Big Data Deep Learning: Cybernetics: Systems, 2016. 46(12): p. 1664-1674.
Challenges and Perspectives. IEEE Access, 2014. 38. Ng, A. Autoencoders. Unsupervised Feature
2: p. 514-525. Learning and Deep Learning (UFLDL) Tutorial
24. LeCun, Y., K. Kavukcuoglu, and C. Farabet. 2018 [cited 2018 7/21/2018]; Available from:
Convolutional networks and applications in vision. http://ufldl.stanford.edu/tutorial/unsupervised/Auto
in Proceedings of 2010 IEEE International encoders.
Symposium on Circuits and Systems. 2010. 39. Teh, Y.W. and G.E. Hinton, Rate-coded Restricted
25. Gousios, G., et al., Lean GHTorrent: GitHub data Boltzmann Machines for Face Recognition. 2001:
on demand, in Proceedings of the 11th Working p. 908--914.
Conference on Mining Software Repositories. 2014, 40. Hinton, G.E., A Practical Guide to Training
ACM: Hyderabad, India. p. 384-387. Restricted Boltzmann Machines, in Neural
26. AI-Index. Top deep learning Github repositories. Networks: Tricks of the Trade: Second Edition, G.
AI Index 2019; Available from: Montavon, G.B. Orr, and K.-R. Müller, Editors.
https://github.com/mbadry1/Top-Deep-Learning. 2012, Springer Berlin Heidelberg: Berlin,
27. Fern, M., et al., Do we need hundreds of classifiers Heidelberg. p. 599-619.
to solve real world classification problems? J. 41. Hochreiter, S. and J. Schmidhuber, Long Short-
Mach. Learn. Res., 2014. 15(1): p. 3133-3181. term Memory. Vol. 9. 1997. 1735-80.
28. Lecun, Y., et al., Gradient-based learning applied 42. Metz, C., Apple is bringing the AI Revolution to
to document recognition. Proceedings of the IEEE, your Phone, in Wired. 2016.
1998: p. 2278--2324. 43. Gers, F.A., J. Schmidhuber, and F.A. Cummins,
29. LeCun, Y. and Y. Bengio, Convolutional networks Learning to Forget: Continual Prediction with
for images, speech, and time series, in The LSTM. Neural Computation, 2000. 12(10): p. 2451-
handbook of brain theory and neural networks, 2471.
A.A. Michael, Editor. 1998, MIT Press. p. 255-258. 44. Chung, J., et al., Empirical Evaluation of Gated
30. Taylor, G.W., et al. Convolutional Learning of Recurrent Neural Networks on Sequence Modeling.
Spatio-temporal Features. in Computer Vision – eprint arXiv:1412.3555, 2014: p. arXiv:1412.3555.
ECCV 2010. 2010. Berlin, Heidelberg: Springer 45. Cho, K., et al., Learning Phrase Representations
Berlin Heidelberg. using RNN Encoder-Decoder for Statistical
31. Ng, A. Convolutional Neural Network. Machine Translation. eprint arXiv:1406.1078,
Unsupervised Feature Learning and Deep Learning 2014: p. arXiv:1406.1078.
(UFLDL) Tutorial 2018 [cited 2018 7/21/2018]; 46. Naul, B., et al., A recurrent neural network for
Available from: classification of unevenly sampled variable stars.
http://ufldl.stanford.edu/tutorial/supervised/Convol Nature Astronomy, 2018. 2(2): p. 151-155.
utionalNeuralNetwork/. 47. Najafabadi, M.M., et al., Deep learning
32. Schuler, C.J., et al. A Machine Learning Approach applications and challenges in big data analytics.
for Non-blind Image Deconvolution. in 2013 IEEE Journal of Big Data, 2015. 2(1): p. 1.
Conference on Computer Vision and Pattern 48. Goodfellow, I., Y. Bengio, and A. Courville, Deep
Recognition. 2013. learning. Adaptive computation and machine
33. Radford, A., L. Metz, and S. Chintala, learning. 2016, Cambridge, Massachusetts: The
Unsupervised Representation Learning with Deep MIT Press. xxii, 775 pages.
Convolutional Generative Adversarial Networks. 49. Gavin, H.P., The Levenberg-Marquardt method for
CoRR, 2015. abs/1511.06434. nonlinear least squares curve-fitting problems.
34. Jolliffe, I.T., Principal component analysis. 2nd ed. 2016.
Springer series in statistics. 2002, New York: 50. Xavier, G. and B. Yoshua, Understanding the
Springer. xxix, 487 p. difficulty of training deep feedforward neural
35. Noda, K., et al. Multimodal integration learning of networks, in In Proceedings of the Thirteenth
object manipulation behaviors using deep neural International Conference on Artificial Intelligence
networks. in 2013 IEEE/RSJ International and Statistics 2010, PMLR. p. 249-256.
Conference on Intelligent Robots and Systems. 51. Martens, J., Deep learning via Hessian-free
2013. optimization, in Proceedings of the 27th
International Conference on International
VOLUME XX, 2017 27
Conference on Machine Learning. 2010, International Conference on Artificial Intelligence

Omnipress: Haifa, Israel. p. 735-742. and Statistics, D. David van and W. Max, Editors.
52. Escalante, H.J., M. Montes, and L.E. Sucar, 2009, PMLR: Proceedings of Machine Learning
Particle Swarm Model Selection. J. Mach. Learn. Research. p. 448--455.
Res., 2009. 10: p. 405-440. 68. Kuo, W., B. Hariharan, and J. Malik, DeepBox:
53. Shrestha, A. and A. Mahmood, Improving Genetic Learning Objectness with Convolutional Networks.
Algorithm with Fine-Tuned Crossover and Scaled CoRR, 2015. abs/1505.02146.
Architecture. Journal of Mathematics, 2016. 2016: 69. Huang, G.-B., Q.-Y. Zhu, and C.-K. Siew, Extreme
p. 10. learning machine: Theory and applications.
54. Sastry, K., D. Goldberg, and G. Kendall, Genetic Neurocomputing, 2006. 70(1): p. 489-501.
Algorithms. 2005. 70. Tang, J., C. Deng, and G.B. Huang, Extreme
55. Goldberg, D.E., The design of innovation: Lessons Learning Machine for Multilayer Perceptron. IEEE
from and for competent genetic algorithms 2013: Transactions on Neural Networks and Learning
Springer, Boston, MA. Systems, 2016. 27(4): p. 809-821.
56. Miikkulainen, R., et al., Evolving Deep Neural 71. Gong, M., et al., A Multiobjective Sparse Feature
Networks. CoRR, 2017. abs/1703.00548. Learning Model for Deep Neural Networks. IEEE
57. Duchi, J., E. Hazan, and Y. Singer, Adaptive Transactions on Neural Networks and Learning
Subgradient Methods for Online Learning and Systems, 2015. 26(12): p. 3263-3277.
Stochastic Optimization. J. Mach. Learn. Res., 72. Mehrkanoon, S., et al., Multiclass Semisupervised
2011. 12: p. 2121-2159. Learning Based Upon Kernel Spectral Clustering.
58. Kingma, D.P. and J. Ba, Adam: A Method for IEEE Transactions on Neural Networks and
Stochastic Optimization. CoRR, 2014. Learning Systems, 2015. 26(4): p. 720-733.
abs/1412.6980. 73. Langone, R., et al., Kernel Spectral Clustering and
59. Ioffe, S. and C. Szegedy, Batch Normalization: applications. CoRR, 2015. abs/1505.00477.
Accelerating Deep Network Training by Reducing 74. Conneau, A., et al., Very Deep Convolutional
Internal Covariate Shift. CoRR, 2015. Networks for Natural Language Processing. CoRR,
abs/1502.03167. 2016. abs/1606.01781.
60. Srivastava, N., et al., Dropout: a simple way to 75. Krpan, N. and D. Jakobovic. Parallel neural
prevent neural networks from overfitting. J. Mach. network training with OpenCL. in 2012
Learn. Res., 2014. 15(1): p. 1929-1958. Proceedings of the 35th International Convention
61. Services, A.W. Amazon EC2 P2 & P3 Instances. MIPRO. 2012.
Amazon EC2 Instance Types 2018 [cited 2018 76. Dong, W. and M. Zhou, A Supervised Learning and
7/21/2018]; Available from: Control Method to Improve Particle Swarm
https://aws.amazon.com/ec2/instance-types/p2/ & Optimization Algorithms. IEEE Transactions on
https://aws.amazon.com/ec2/instance-types/p3/ Systems, Man, and Cybernetics: Systems, 2017.
62. He, K., et al. Deep Residual Learning for Image 47(7): p. 1135-1148.
Recognition. in 2016 IEEE Conference on 77. Vapnik, V. and R. Izmailov, Learning using
Computer Vision and Pattern Recognition (CVPR). privileged information: similarity control and
2016. knowledge transfer. J. Mach. Learn. Res., 2015.
63. Simpson, A.J.R., Uniform Learning in a Deep 16(1): p. 2023-2049.
Neural Network via "Oddball" Stochastic Gradient 78. Sampson, J.R., Adaptation in Natural and Artificial
Descent. CoRR, 2015. abs/1510.02442. Systems (John H. Holland). SIAM Review, 1976.
64. Best-Rowden, L., et al., Unconstrained Face 18(3): p. 529-530.
Recognition: Identifying a Person of Interest From 79. Mohd Razali, N. and J. Geraghty, Genetic
a Media Collection. IEEE Transactions on Algorithms Performance with Different Selection
Information Forensics and Security, 2014. 9(12): p. Strategy in Solving TSP. 2010.
2144-2157. 80. Larrañaga, P., et al., Genetic Algorithms for the
65. Letsche, T.A. and M.W. Berry, Large-scale Travelling Salesman Problem: A Review of
information retrieval with latent semantic indexing. Representations and Operators. Artificial
Information Sciences—Informatics and Computer Intelligence Review, 1999. 13(2): p. 129-170.
Science, Intelligent Systems, Applications: An 81. Whitley, D., A genetic algorithm tutorial. Statistics
International Journal 1997. 100(1-4): p. 105-137. and Computing, 1994. 4(2): p. 65-85.
66. Hinton, G.E., Learning multiple layers of 82. Lin, C.T., M. Prasad, and A. Saxena, An Improved
representation. Trends in Cognitive Sciences, Polynomial Neural Network Classifier Using Real-
2007. 11(10): p. 428-434. Coded Genetic Algorithm. IEEE Transactions on
67. Salakhutdinov, R. and G. Hinton, Deep Boltzmann Systems, Man, and Cybernetics: Systems, 2015.
Machines, in Proceedings of the Twelth 45(11): p. 1389-1401.
VOLUME XX, 2017 28
83. Guo, Y., et al., The use of next generation

sequencing technology to study the effect of
radiation therapy on mitochondrial DNA mutation.
Mutation Research/Genetic Toxicology and
Environmental Mutagenesis, 2012. 744(2): p. 154-
160.
84. Wu, Y., et al., Google's Neural Machine
Translation System: Bridging the Gap between
Human and Machine Translation. CoRR, 2016.
abs/1609.08144.
85. Zhou, Z.-H., et al., Multi-instance multi-label
learning. Artificial Intelligence, 2012. 176(1): p.
2291-2320.
86. Huang, L., et al., Adversarial machine learning, in
Proceedings of the 4th ACM workshop on Security
and artificial intelligence. 2011, ACM: Chicago,
Illinois, USA. p. 43-58.
87. Yu, D. and L. Deng, Automatic Speech
Recognition: A Deep Learning Approach. 2015:
Springer, London.
88. Hadsell, R., S. Chopra, and Y. LeCun.
Dimensionality Reduction by Learning an Invariant
Mapping. in 2006 IEEE Computer Society
Conference on Computer Vision and Pattern
Recognition (CVPR'06). 2006.
89. Shrestha, A. and A. Mahmood. Enhancing Siamese
Networks Training with Importance in Proceedings
of the 11th International Conference on Agents and
Artificial Intelligence. 2019. Prague, Czech
Republic,: SciTePress.
90. Kingma, D.P. and M. Welling, Auto-Encoding
Variational Bayes. ArXiv e-prints, 2013.
91. Silver, D., et al., Mastering the game of Go with
deep neural networks and tree search. Nature,
2016. 529: p. 484.
92. François-Lavet, V., et al., An Introduction to Deep
Reinforcement Learning. CoRR, 2018.
abs/1811.12560.
93. Goodfellow, I.J., et al. Generative Adversarial
Networks. arXiv e-prints, 2014.
94. Reed, S., et al. Generative Adversarial Text to
Image Synthesis. arXiv e-prints, 2016.
95. Brighton, H. and C. Mellish, Advances in Instance
Selection for Instance-Based Learning Algorithms.
Data Mining and Knowledge Discovery, 2002.
6(2): p. 153-172.
96. Albelwi, S. and A. Mahmood, A Framework for
Designing the Architectures of Deep Convolutional
Neural Networks. Entropy, 2017. 19(6).
VOLUME XX, 2017 29

Review of Deep Learning Algorithms and Architectur

Uploaded by

Copyright:

Available Formats

Review of Deep Learning Algorithms and Architectur

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Review of Deep Learning Algorithms and Architectur

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Review of Deep Learning Algorithms and

Corresponding author: Ajay Shrestha (e-mail: [email protected]).

Volume XX, 2017

VOLUME XX, 2017 2

information under one article. It presents novel work by

2. Classification of Neural Network

VOLUME XX, 2017 3

Supervised learning consists of labeled data which is used

VOLUME XX, 2017 4

Figure 3. Github stars by Deep Learning Library [26]

winner for general problems like classification as the choice

VOLUME XX, 2017 5

Figure 4. 7-layer Architecture of CNN for character recognition [28]

direction, e.g., where mean pooling is used, upsample evenly

𝛿 (𝑙) = ((𝑊 (𝑙) )𝑇 𝛿 (𝑙+1) ) . 𝑓 ′ (𝑧 (𝑙) ) (4) Where (𝑎𝑖 (𝑙) ) ∗ 𝛿𝑘

VOLUME XX, 2017 6

ALGORITHM 1. CNN BACKPROPAGATION ALGORITHM PSEUDO CODE

1: 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑡𝑜 𝑟𝑎𝑛𝑑𝑜𝑚𝑙𝑦 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑠𝑚𝑎𝑙𝑙)

VOLUME XX, 2017 7

dimension representations of the inputs. To achieve sparsity,

Figure 7. Autoencoder nodes

𝐽 𝑠𝑝𝑎𝑟𝑠𝑒 (𝑊, 𝑏) = 𝐽(𝑊, 𝑏) + 𝛽 ∑ 𝐾𝐿(𝜌|| 𝜌̂ 𝑗) ] (11)

VOLUME XX, 2017 8

Figure 8. Restricted Boltzmann Machine

𝑍 = ∑𝑣,ℎ 𝑒 −𝐸 (𝑣,ℎ) (13)

VOLUME XX, 2017 9

TABLE 2. DNN NETWORK COMPARISON TABLE

VOLUME XX, 2017 10

Likewise, this risk is also present when using gradient

Some of the well-known training algorithms are:

4.1. Gradient Descent

VOLUME XX, 2017 11

In the standard SGD, learning rate is used as a fixed

𝐴𝑐𝑡𝑢𝑎𝑙 𝑈𝑝𝑑𝑎𝑡𝑒: 𝜃 ← 𝜃 + 𝑣 (24) The Levenberg- Marquardt update [hlm ] is generated by

VOLUME XX, 2017 12

Local minima Please see table 4

5. Shortcomings of Training Algorithms

5.1. Vanishing and Exploding Gradients

VOLUME XX, 2017 13

VOLUME XX, 2017 14

around in search of food or during migration. The velocity

Figure 14b. Crossover in Genetic Algorithm

Elite method ranks population members by fitness and only

In Delta-bar algorithm, the learning rate of the parameter

VOLUME XX, 2017 15

Figure 15. Multilayer network training cost on MNIST dataset using

VOLUME XX, 2017 16

Services’ (AWS) P2 instances provides up to 40 thousand 7.1. Deep Residual Learning

VOLUME XX, 2017 17

VOLUME XX, 2017 18

7.6. Pre-training with unsupervised Deep Boltzmann

Figure 19. Four-layer DBN & four-layer Deep Boltzmann Machine

Figure 21. DBM getting initialized as deterministic neural network with

VOLUME XX, 2017 19

Figure 22 shows a pareto frontier function that can be used

∑𝑥∈𝐷 𝐿(𝑥, 𝑔θ′ (𝑓θ (x))) (42)

VOLUME XX, 2017 20

VOLUME XX, 2017 21