Review of Deep Learning Algorithms and Architectur
Review of Deep Learning Algorithms and Architectur
Review of Deep Learning Algorithms and Architectur
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT Deep learning (DL) is playing an increasingly important role in our lives. It has already made
a huge impact in areas such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting,
speech recognition, etc. The painstakingly handcrafted feature extractors used in the traditional learning,
classification and pattern recognition systems are not scalable for large-sized data sets. In many cases
depending on the problem complexity, deep learning can also overcome limitations of earlier shallow
networks that prevented efficient training and abstractions of hierarchical representations of multi-
dimensional training data. Deep Neural Network (DNN) uses multiple (deep) layers of units with highly
optimized algorithms and architectures. The paper reviews several optimization methods to improve accuracy
of the training and reduce training time. We delve into the math behind training algorithms used in recent
deep networks. We describe current shortcomings, enhancements and implementations. The review also
covers different types of deep architectures such as deep convolution networks, deep residual networks,
recurrent neural networks, reinforcement learning, variational autoencoders, and others.
INDEX TERMS Machine Learning Algorithm, Optimization, Artificial Intelligence, Deep Neural Network
Architectures, Convolution Neural Network, Backpropagation, Supervised and unsupervised learning
unit in the next layer. The result of the final output layer is
1. Introduction used as the solution for the problem.
Neural Network is a machine learning (ML) technique that Neural Networks can be used in a variety of problems
is inspired by and resembles the human nervous system and including pattern recognition, classification, clustering,
the structure of the brain. It consists of processing units dimensionality reduction, computer vision, natural language
organized in input, hidden and output layers. The nodes or processing (NLP), regression, predictive analysis, etc. Here
units in each layer are connected to nodes in adjacent layers. is an example of image recognition.
Each connection has a weight value. The inputs are Figure 1 shows how a deep neural network called
multiplied by the respective weights and summed at each Convolution Neural Network (CNN) can learn hierarchical
unit. The sum then undergoes a transformation based on the levels of representations from a low-level input vector and
activation function, which is in most cases is a sigmoid successfully identify the higher-level object. The red squares
function, tan hyperbolic or rectified linear unit (ReLU). in the figure are simply a gross generalization of the pixel
These functions are used because they have a mathematically values of the highlighted section of the figure. CNNs can
favorable derivative, making it easier to compute partial progressively extract higher representations of the image
derivatives of the error delta with respect to individual after each layer and finally recognize the image.
weights. Sigmoid and tanh functions also squash the input The implementation of neural networks consists of the
into a narrow output range or option, i.e., 0/1 and -1/+1 following steps:
respectively. They implement saturated nonlinearity as the 1. Acquire training and testing data set
outputs plateaus or saturates before/after respective 2. Train the network
thresholds. ReLu on the other hand exhibits both saturating 3. Make prediction with test data
and non-saturating behaviors with 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥). The
output of the function is then fed as input to the subsequent The paper is organized in the following sections:
1. Introduction to Machine Learning
a. Background and Motivation
2. Classifications of Neural Networks understood and applied to neural networks. The self-directed
3. DNN Architectures learning was made possible with the deeper understanding
4. Training Algorithms
5. Shortcomings of Training Algorithms and application of backpropagation algorithm. The
6. Optimization of Training Algorithms automation of feature extractors is what differentiates a
7. Architectures & Algorithms – Implementations DNNs from earlier generation machine learning techniques.
8. Conclusion DNN is a type of neural network modeled as a multilayer
perceptron (MLP) that is trained with algorithms to learn
representations from data sets without any manual design of
feature extractors. As the name Deep Learning suggests, it
consists of higher or deeper number of processing layers,
which contrasts with shallow learning model with fewer
layers of units. The shift from shallow to deep learning has
allowed for more complex and non-linear functions to be
mapped, as they cannot be efficiently mapped with shallow
architectures. This improvement has been complemented by
the proliferation of cheaper processing units such as the
general-purpose graphic processing unit (GPGPU) and large
volume of data set (big data) to train from. While GPGPUs
are less powerful that CPUs, the number of parallel
processing cores in them outnumber CPU cores by orders of
magnitude. This makes GPGPUs better for implementing
DNNs. In addition to the backpropagation algorithm and
GPU, the adoption and advancement of ML and particularly
Deep Learning can be attributed to the explosion of data or
bigdata in the last 10 years. ML will continue to impact and
disrupt all areas of our lives from education, finance,
governance, healthcare, manufacturing, marketing and
Figure 1. Image recognition by a CNN
others [7].
1.1. Background
1.2. Motivation
In 1957, Frank Rosenblatt created the perceptron, the first Deep learning is perhaps the most significant development
prototype of what we now know as a neural network [1]. It in the field of computer science in recent times. Its impact
had two layers of processing units that could recognize has been felt in nearly all scientific fields. It is already
simple patterns. Instead of undergoing more research and disrupting and transforming businesses and industries. There
development, neural networks entered a dark phase of its is a race among the world’s leading economies and
history in 1969, when professors at MIT demonstrated that it technology companies to advance deep learning. There are
couldn’t even learn a simple XOR function[2]. already many areas where deep learning has exceeded human
In addition, there was another finding that particularly level capability and performance, e.g., predicting movie
dampened the motivation for DNN. The universal ratings, decision to approve loan applications, time taken by
approximation theorem showed that a single hidden layer car delivery, etc. [8]. On March 27, 2019 the three deep
was able to solve any continuous problem [3]. It was learning pioneers (Yoshua Bengio, Geoffrey Hinton, and
mathematically proven as well [4], which further questioned Yann LeCun) were awarded the Turing Award, which is also
the validity of DNN. While a single hidden layer could be referred to as the “Nobel Prize” of computing[9]. While a lot
used to learn, it was not efficient and was a far cry from the has been accomplished, there is more to advance in deep
convenience and capability afforded by the hierarchical learning. Deep learning has a potential to improve human
abstraction of multiple hidden layers of DNN that we know lives with more accurate diagnosis of diseases like cancer
now. But it was not just the universal approximation [10], discovery of new drugs, prediction of natural disasters
theorem that held back the progress of DNN. Back then, we [11]. E.g., [12] reported that an deep learning network was
didn’t have a way to train a DNN either. These factors able to learn from 129,450 images of 2,032 diseases and was
prolonged the so-called AI winter, i.e., a phase in the history able to diagnose at the same level as 21 board certified
of artificial intelligence where it didn’t get much funding and dermatologists. Google AI [10] was able to beat the average
interest, and as a result didn’t advance much either. accuracy of US board certified general pathologists in
A breakthrough in DNN occurred with the advent of grading prostate cancer by 70% to 61%.
backpropagation learning algorithm. It was proposed in the The goal of this review is to cover the vast subject of deep
1970s [5] but it wasn’t until mid-1980s [6] that it was fully learning and present a holistic survey of dispersed
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
connections reduces training time and chances of overfitting. the network’s susceptibility of shifts, scale and distortions of
All neurons in a filter are connected to the same number of images [29]. Max/mean pooling or local averaging filters are
neurons in the previous input layer (or feature map) and are used often to achieve sub-sampling. The final layers of CNN
constrained to have the same sequence of weights and biases. are responsible for the actual classifications, where neurons
These factors speed up the learning and reduces the memory between the layers are fully connected. Deep CNN can be
requirements for the network. Thus, each neuron in a specific implemented with multiple series of weight-sharing
filter looks for the same pattern but in different parts of the convolution layers and sub-sampling layers. The deep nature
input image. Sub-sampling layers reduce the size of the of the CNN results in high quality representations while
network. In addition, along with local receptive fields and maintaining locality, reduced parameters and invariance to
shared weights (within the same filter), it effectively reduces minor variations in the input image [30].
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Figure 6 shows single layer feature detector blocks of 3.3. Restricted Boltzmann Machine (RBM)
RBMs used in pre-training, which is followed by unrolling Restricted Boltzmann Machine is an artificial neural
[36]. Unrolling combines the stacks of RBMs to create the network where we can apply unsupervised learning
encoder block and then reverses the encoder block to create algorithm to build non-linear generative models from
the decoder section, and finally the network is fine-tuned unlabeled data [39]. The goal is to train the network to
with backpropagation [36]. increase a function (e.g., product or log) of the probability of
Figure 7 illustrates a simplified representation of how vector in the visible units so it can probabilistically
autoencoders can reduce the dimension of the input data and reconstruct the input. It learns the probability distribution
learn to recreate it in the output layer. Wang et al. [37] over its inputs. As shown in Figure 8, RBM is made of two-
successfully implemented a deep autoencoder with stacks of layer network called the visible layer and the hidden layer.
RBM blocks similar to Figure 6 to achieve better modeling Each unit in the visible layer is connected to all units in the
accuracy and efficiency than the proper orthogonal hidden layer and there are no connections between the units
decomposition (POD) method for dimensionality reduction in the same layer.
of distributed parameter systems (DPSs). The equation The energy (E) function of the configuration of the visible
(2)
below describes the average of activation function 𝑎𝑗 of 𝑗𝑡ℎ and hidden units, (v, h) is expressed in the following way
unit of 2nd layer when the 𝑥𝑡ℎ input activates the neuron [38]. [40]:
1 (2) (𝑖)
𝜌̂ 𝑗 = ∑𝑚
𝑖=1[𝑎𝑗 𝑥 ] (10)
𝑚
E(v, h) = − ∑𝑖 𝜀 𝑣𝑖𝑠𝑖𝑏𝑙𝑒 𝑎𝑖 𝑣𝑖 − ∑𝑗 𝜀 ℎ𝑖𝑑𝑑𝑒𝑛 𝑏𝑗 ℎ𝑗 −
∑𝑖,𝑗 𝑣𝑖 ℎ𝑗 𝑤𝑖𝑗
(12)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
vi and hj are the vector states of the visible unit i and hidden RNN, i.e., the problem of vanishing gradients by letting
unit j. ai and bj represents the bias of visible and hidden units. gradients to pass unaltered. As shown in the illustration in
Wij denotes the weight between the respective visible and Figure 9, LSTM consists of blocks of memory cell state
hidden units. through which signal flows while being regulated by input,
The partition function, Z is represented by the sum of all forget and output gates. These gates control what is stored,
possible pairs of visible and hidden vectors [40]. read and written on the cell. LSTM is used by Google, Apple
and Amazon in their voice recognition platforms [42].
1
𝑝(𝑣, ℎ) = 𝑒 −𝐸 (𝑣,ℎ) (14) Figure 9. LSTM Block with memory cell and gates
𝑍
The probability of a particular visible layer vector is In figure 9, 𝐶, 𝑥, ℎ represent cell, input and output values.
provided by the following [40]. Subscript 𝑡 denotes time step value, i.e., 𝑡 − 1 is from
previous LSTM block (or from time 𝑡 − 1) and 𝑡 denotes
1
𝑝(𝑣) = ∑ℎ 𝑒 −𝐸 (𝑣,ℎ) (15) current block values. The symbol σ is the sigmoid function
𝑍
and 𝑡𝑎𝑛ℎ is the hyperbolic tangent function. Operator + is
As you can see from the equations above, the partition the element-wise summation and x is the element-wise
function becomes higher with lower energy function value. multiplication. The computations of the gates are described
Thus during the training process, the weights and biases of in the equations below[41, 43].
the network are adjusted to arrive at a lower energy and thus
maximize the probability assigned to the training vector. It is 𝑓𝑡 = 𝜎(𝑊𝑓 𝑥𝑡 + 𝑤𝑓 ℎ𝑡−1 + 𝑏𝑓 ) (17)
mathematically convenient to compute the derivative of the
log probability of a training vector. 𝑖𝑡 = 𝜎(𝑊𝑖 𝑥𝑡 + 𝑤𝑖 ℎ𝑡−1 + 𝑏𝑖 ) (18)
𝜕 log 𝑝(𝑣)
= 〈𝑣𝑖 ℎ𝑗 〉𝑑𝑎𝑡𝑎 − 〈𝑣𝑖 ℎ𝑗 〉𝑚𝑜𝑑𝑒𝑙 (16) 𝑜𝑡 = 𝜎(𝑊𝑜 𝑥𝑡 + 𝑤𝑜 ℎ𝑡−1 + 𝑏𝑜 ) (19)
𝜕𝑤𝑖𝑗
𝑐𝑡 = 𝑓𝑡 ⨂𝑐𝑡−1 + 𝑖𝑡 ⨂ 𝜎𝑐 (𝑊𝑐 𝑥𝑡 + 𝑤𝑐 ℎ𝑡−1 + 𝑏𝑐 ) (20)
In the equation [40] above 〈vihj〉data and 〈vihj〉model represents
the expectations under the respective distributions. ℎ𝑡 = 𝑜𝑡 ⨂ 𝜎ℎ (𝑐𝑡 ) (21)
Thus, the adjustments in the weights can be denoted as
follows [40], where ϵ is the learning rate. Where 𝑓, 𝑖, 𝑜 are the forget, input and output gate vectors
respectively. 𝑊, 𝑤, 𝑏 𝑎𝑛𝑑 ⨂ represent weights of input,
∆𝑤𝑖𝑗 = 𝜖(〈𝑣𝑖 ℎ𝑗 〉𝑑𝑎𝑡𝑎 − 〈𝑣𝑖 ℎ𝑗 〉𝑚𝑜𝑑𝑒𝑙 ) (28) weights of recurrent output, bias and element-wise
multiplication respectively.
3.4. Long Short-Term Memory (LSTM) There is a smaller variation of the LSTM known as gated
LSTM is an implementation of the Recurrent Neural recurrent units (GRU). GRUs are smaller in size than LSTM
Network and was first proposed by Hochreiter et al. in 1997 as they don’t include the output gate, and can perform better
[41]. Unlike the earlier described feed forward network than LSTM on only some simpler datasets[44, 45].
architectures, LSTM can retain knowledge of earlier states LSTMs recurrent neural networks can keep track of long-
and can be trained for work that requires memory or state term dependencies. Therefore, they are great for learning
awareness. LSTM partly addresses a major limitation of
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
from sequence input data and building models that rely on Table 2 provides a compact summary and comparison of the
context and earlier states. The cell block of LSTM retains different DNN architectures. The examples of
pertinent information of previous states. The input, forget implementations, applications, datasets and DL software
and output gates dictates new data going into the cell, what frameworks presented in the table are not implied to be
remains in the cell and the cell values used in the calculation exhaustive. In addition, some of the categorization of the
of the output of the LSTM block respectively [41, 43]. Naul network architectures could be implemented in hybrid
et al. demonstrated LSTM and GRU based autoencoders for fashion. E.g., even though RBMs are generative models and
automatic feature extractions [46]. their training is considered unsupervised, they can have
elements of discriminative model when training is finetuned
3.5. Comparison of DNN Networks with supervised learning. The table also provides examples
of common applications for using different architectures.
Backpropagation DenseNet
Sparse
Dimensionality
Autoencoders, TensorFlow,
Autoencoder Generative Unsupervised Backpropagation Reduction; MNIST
Variational Deeplearning4j, Keras
Encoding
Autoencoders
Generate
realistic fake
Generative data;
Adversarial Generative &
Unsupervised Backpropagation Adversarial Reconstruction CIFAR10 TensorFlow, Keras
Networks Discriminative
Network of 3D models;
Image
improvement
Dimensionality
Generative Deep Belief
Gradient Descent Reduction; TensorFlow,
with Network; Deep
RBM Unsupervised based Contrastive Feature MNIST Deeplearning4j, Keras,
Discriminative Boltzmann
divergence learning; Topic MXNet, Theano, Torch
finetuning Machine
modeling
Deep RNN, Gated
Natural TensorFlow, Caffe,
Gradient Descent Recurrent Unit
Recurrent Language MNIST Theano, Torch,
& (GRU), Neural
Neural LSTM Discriminative Supervised Processing; Stroke Deeplearning4j, Microsoft
Backpropagation Machine
Network Language Sequence Cognitive Toolkit, Keras,
through Time Translation
Translation MXNet, PyTorch
(NMT)
Radial Function
Supervised K-means
Basis Radial Basis approximation Fisher's Iris
RBF Network Discriminative and Clustering; Least TensorFlow
Function Function NN ; Time series data set
Unsupervised Square Function
NN prediction
Dimensionality
Nodes
Kohonen Reduction;
arranged in
Self Competitive Kohonen Self Optimization
hexagonal or Generative Unsupervised SPAMbase TensorFlow
Organizing Learning Organizing NN problems;
rectangular
NN Clustering
grid
analysis
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
where the measurement error for y (ti), i.e., σyi is the inverse
4.2. Stochastic Gradient Descent of the weighting matrix Wii.
Stochastic Gradient Descent (SGD) is the most common The gradient descent of the squared error function in
variation and implementation of gradient descent. In gradient relation to the n parameters can be denoted as [49]:
descent, we process through all the samples in the training
dataset before applying the updates to the weights. While in
SGD, updates are applied after running through a minibatch ∂ 2 ∂
𝑥 = 2(𝑦 − ŷ(𝑝))𝑇 𝑊 (𝑦 − ŷ(𝑝)) (28)
of n number of samples. Since we are updating the weights ∂𝐩 ∂𝐩
more frequently in SGD than in GD, we can converge
towards global minimum much faster.
∂ŷ(𝑝)
= 2(𝑦 − ŷ(𝑝))𝑇 𝑊 [ ] (29)
4.3. Momentum ∂𝐩
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Contrastive divergence finds its use in probabilistic models network to take a long time to train, whereas large gradients
such as RBMs. Evolutionary algorithms can be applied to can cause the training to overshoot and diverge. This is made
hyperparameter optimizations or training models by worse by the non-linear activation functions like sigmoid and
optimizing weights. Reinforcement learning could be used in tanh functions that squash the outputs to a small range. Since
game theory, multi-agent systems and other problems where change in weight have nominal effect on the output training
both exploitation and exploration need to be optimized. could take much longer. This problem can be mitigated using
linear activation function like ReLu and proper weight
TABLE 3. DEEP LEARNING ALGORITHM COMPARISON TABLE initialization.
Techniques to
Algorithm Advantages Disadvantages address
5.2. Local Minima
disadvantages
Takes a long time to Local minima is always the global minima in a convex
(Batch) Scales well converge as weights Mini-Batch function, which makes gradient descent based optimization
Gradient after are updated after the Gradient Descent fool proof. Whereas in nonconvex functions,
Descent optimizations entire dataset pass backpropagation based gradient descent is particularly
Local minima Please see table 4
vulnerable to the issue of premature convergence into the
Noisy error rates
Mini-Batch local minima. A local minima as shown in Figure 11, can
since it is calculated
Stochastic Scales well Gradient Descent; easily be mistaken for global absolute minima.
at every sample;
Gradient after Shuffle data after
Accuracy requires
Descent optimizations random order every epoch
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
5.4. Steep Edges the problem area to be a convex function. The other problem
Steep edges are another section of the optimization surface is high number of nodes and the sheer possible combination
area where the steep gradient could cause the gradient of weight values they can have. While weights are learned
descent-based weight updates to overshoot and miss a by training on the dataset, there are additional crucial
potential global minima. parameters referred to as hyperparameters that aren’t directly
learnt from training dataset. These hyperparameters can take
5.5. Training Time a range of values and add complexity of finding the optimal
Training time is an important factor to gauge the efficiency architecture and model. There is significant room for
of an algorithm. It is not uncommon for graduate students to improvement to the standard training algorithms. Here are
train their model for days or weeks in the computer lab. Most some of the popular ways to enhance the accuracy of the
models require exorbitant amount of time and large datasets DNNs.
to train. Often times many of the samples from the datasets
do not add value to the training process and in some cases, 6.1. Parameter Initialization Techniques
they introduce noise and adversely affect the training. Since the solution space is so huge, the initial parameters
have an outsized influence on how fast or slow the training
5.6. Overfitting converges, if at all or if it prematurely converges to a
As we add more neurons to DNN, it can undoubtedly suboptimal point. Initialization strategies tend to be heuristic
model the network for more complex problems. DNN can in nature. [50] proposed normalized initialization where
lend itself to high conformability to training data. But there weights are initialized in the following manner.
is also a high risk of overfitting to the outliers and noise in
the training data as shown in Figure 13. This can result in √6 √6
𝑊 ~ 𝑈 [− , ] (34)
delayed training and testing times and result in the lower √𝑛𝑗 + 𝑛𝑗+1 √𝑛𝑗 + 𝑛𝑗+1
quality prediction on the actual test data. E.g., in
classification or cluster problems, overfitting can create a
high order polynomial output that separates the decision [51] proposed another technique called sparse
boundary for the training set, which will take longer and initialization, where the number of non-zero incoming
result in degraded results for most test data set. One way to weights were capped at a certain limit causing them to retain
overcome overfitting is to choose the number of neurons in high diversity and reduce chances of saturation.
the hidden layer wisely to match the problem size and type.
There are some algorithms that can be used to approximate 6.2. Hyperparameter Optimization
the appropriate number of neurons but there is no magic The learning rate and regularization parameters constitutes
bullet and the best bet is to experiment on each use case to the commonly used hyperparameters in DNN. Learning rate
get an optimal value. determines the rate at which the weights are updated. The
purpose of regularization is the prevent overfitting and
regularization parameter affects the degree of influence on
the loss function. CNN’s have additional hyperparameters
i.e., number of filters, filter shapes, number of dropouts and
max pooling shapes at each convolution layer and number of
nodes in the fully connected layer. These parameters are very
important for training and modeling a DNN. Coming up with
an optimal set of parameter values is a challenging feat.
Exhaustively iterating through each combination of
hyperparameter values is computationally very expensive.
For example, if training and evaluating a DNN with the full
dataset takes ten minutes, then with seven hyperparameters
each with eight potential values will take (87 x 10 min), i.e.,
20,971,520 minutes or almost 40 years to exhaustively train
and evaluate the network on all combinations of the
Figure 13. Overfitting in Classification
hyperparameter values. Hyperparameter can be optimized
with different metaheuristics. Metaheuristics are nature
6. Optimization of Training Algorithms
inspired guiding principles that can help in traversing the
The goal of the DNN is to improve the accuracy of the
search space more intelligently yet much faster than the
model on test data. Training algorithms aims to achieve the
exhaustive method.
end goal by reducing the cost function. The common root
Particle Swarm Optimization (PSO) is another type of
cause of three out of five shortcomings mentioned above is
metaheuristic that can be used for hyperparameter
primarily due to the fact that the training algorithms assume
optimization. PSO is modeled around the how birds fly
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
is more sophisticated [57] and prescribes an inversely functions. Iofee and Szegedy [59] proposed the idea of batch
proportional scaling of the learning rates to the square root normalization in 2015. It has made a huge difference in
of the cumulative squared gradient. AdaGrad is not effective improving the training time and accuracy of DNN. It updates
for all DNN training. Since the change in the learning rate is the inputs to have a unit variance and zero mean at each mini-
a function of the historical gradient, AdaGrad becomes batch.
susceptible to convergence.
RMSProp algorithm is a modification of AdaGrad 6.5. Supervised Pretraining
algorithm to make it effective in a nonconvex problem space. Supervised pretraining constitutes breaking down complex
RMSProd replaces the summation of squared gradient in problems into smaller parts and then training the simpler
AdaGrad with exponentially decaying moving average of the models and later combining them to solve the larger model.
gradient, effectively dropping the impact of historical Greedy algorithms are commonly used in supervised
gradient [48]. Adam which denotes adaptive moment pretraining of DNN.
estimation is the latest evolution of the adaptive learning
algorithms that integrates the ideas from AdaGrad, 6.6. Dropout
RMSProp and momentum [58]. Just like AdaGrad and There are few commonly used methods to lower the risk of
RMSProd, Adam provides an individual learning rate for overfitting. In the dropout technique, we randomly choose
each parameter. Adam includes the benefits of both the units and nullify their weights and outputs so that they do not
earlier methods does a better job handling non-stationary influence the forward pass or the backpropagation. Figure 16
objectives and both noisy and sparse gradients problems shows a fully connected DNN on the left and a DNN with
[58]. Adam uses first moment (i.e., mean as used in dropout to the right. The other methods include the use of
RMSProp) as well as second moments of the gradients regularization and simply enlarging the training dataset using
(uncentered variance) utilizing the exponential moving label preserving techniques. Dropout works better than
average of squared gradient [58]. regularization to reduces the risk of overfitting and also
speeds up the training process. [60] proposed the dropout
technique and demonstrated significant improvement on
supervised learning based DNN for computer vision,
computational biology, speech recognition and document
classification problems.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
minima when the function is non-convex. They propose a 7.5. Generative top down connection (generative model)
DNN architecture called large scale deep belief network Much of the training is usually implemented with bottom-
(DBN) that uses both labeled and unlabeled to learn feature up approach, where discriminatory or recognition models are
representations. DBN are made up of layers of RBM stacked developed using backpropagation. A bottom-up model is one
together and learn probability distribution of the input that takes the vector representation of input objects and
vectors. They employ unsupervised pre-training and fine- computes higher level feature representations at subsequent
tuned supervised algorithms and techniques to mitigate the layer with a final discrimination or recognition pattern at the
risk of getting trapped in local minima. Below is the equation output layer. One of the shortcomings of backpropagation is
[23] for change in weights, where c is the momentum factor that it requires labeled data to train. Geoffrey Hinton
and α is the learning rate, and v and h are visible and hidden proposed a novel way of overcoming this limitation in 2007
units respectively. [66]. He proposed a multi-layer DNN that used generative
top-down connection as opposed to bottom-up connection to
∆𝑤𝑖𝑗 (𝑡 + 1) = 𝑐∆𝑤𝑖𝑗 (𝑡) + 𝛼(〈𝑣𝑖 ℎ𝑗 〉𝑑𝑎𝑡𝑎 mimic the way we generate visual imagery in our dream
− 〈𝑣𝑖 ℎ𝑗 〉𝑚𝑜𝑑𝑒𝑙 ) (35) without the actual sensory input. In top-down generative
connection, the high-level data representation or the outputs
Equation [23] for probability distribution for hidden of the networks are used to generate the low-level raw vector
and visible inputs. representations of the original inputs, one layer at a time. The
𝐼 layers of feature representations learned with this approach
𝑝(ℎ𝑗 = 1 | v; W) = σ (∑ 𝑤𝑖𝑗 𝑣𝑖 + 𝑎𝑗 ) (36) can then be further perfected either in generative models
𝑖=1 such as auto-encoders or even standard recognition models
𝐽
[66].
𝑝(𝑣𝑖 = 1 | h; W) = σ (∑ 𝑤𝑖𝑗 ℎ𝑗 + 𝑏𝑖 ) (37) In the generative model in Figure 18, since the correct
𝑗=1 upstream cause of the events in each layer is known, a
comparison between the actual cause and the prediction
7.4. Big data made by the approximate inference procedure can be made,
Big data provides tremendous opportunity and challenge and the recognition weights, 𝑟𝑖𝑗 can be adjusted to increase
for deep learning. Big data is known for the 4 Vs (volume, the probability of correct prediction.
velocity, veracity, variety). Unlike the shallow networks, the
huge volume and variety of data can be handled by DNNs
and significantly improve the training process and the ability
to fit more complex models. On the flip side, the sheer
velocity of data that is generated in real-time can be daunting
to process. Maryam M Jajafabadi et al. [47] raises similar
challenges learning from real-time streaming data such as
credit cards usage to monitor for fraud detection. They
propose using parallel and distributed processing with
thousands of CPU cores. In addition, we should also use
cloud providers that support auto-scaling based on usage and
workload. Not all data represent the same quality. In the case
of computer vision, images from constrained sources, e.g.,
studios are much easier to recognize that the ones from
unconstrained sources like surveillance cameras. [64]
proposes a method to utilize multiple images of the
unconstrained source to enhance the recognition process.
Deep learning can help mine and extract useful patterns Figure 18. Learning multiple layers of representation
from big data and build models for inference, prediction and
business decision making. There is massive volumes of
structured and unstructured data and media files getting
generated today making information retrieval very Here is the equation [66] for adjusting the recognition
challenging. Deep learning can help with semantic indexing weights 𝑟𝑖𝑗 .
to enable information to be more readily accessible in search
engines [14, 65]. This involves building models that provide ∆𝑟𝑖𝑗 α ℎ𝑖 (ℎ𝑗 − 𝜎(∑ ℎ𝑖 𝑟𝑖𝑗 )) (38)
relationships between documents and keywords the contain 𝑖
to make information retrieval more effective.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Figure 23 illustrates data points in a spectral clustering supervised learning module provided real-time feedback
representation. Spectral clustering (SC) is an algorithm that with back diffusion (BD) to retain diversity and social
divides the data points in a graph using Laplacian or double attractor renewal to overcome stagnation [76].
derivative operation, whereas KSC is simply an extension of Metaheuristics provide high level guidance inspired by
SC that uses Least Squares Support Vector Machines nature and applies them to solve mathematical problems. In
methodology [73]. a similar way [77] proposes incorporating the concepts of
Since unlabeled data is more abundantly available relative intelligent teacher and privileged information, which is
to labeled data, it would be beneficial to make the most of it essentially extra information available during training but
with unsupervised or in this case semi-supervised learning. not during evaluation or testing, into the DNN training
process.
7.10. Very Deep Convolutional Networks for Natural
Language Processing 7.12. Genetic Algorithm
Deep CNN have mostly been used in computer vision, Genetic Algorithm is a metaheuristic that can be effectively
where it is very effective. Conneau et al. [74] used it for the used in training DNN. GA mimics the evolutionary
first time to NLP with up to 29 convolution layers. The goal processes of selection, crossover and mutation. Each
is to analyze and extract layers of hierarchical population member represents a possible solution with a set
representations from words and sentences at the syntactic, of weights. Unlike PSO, which includes only one operator
semantic and contextual level. One the major setbacks for for adjusting the solution, evolutionary algorithms like GA
lack of earlier deep CNN for NLP is because of deeper includes various steps, i.e., selection, crossover and mutation
networks tend to cause saturation and degradation of methods [52]. Population members undergo several
accuracy. This is in addition to the processing overhead of iterations of selection and crossover based on known
more layers. Kaiming et al. [62] states that the degradation is strategies to achieve better solution in the next iteration or
not caused by overfitting but because deeper systems are generation. GA has undergone decades of improvement and
difficult to optimize. [62] addressed this issue with shortcut refinements since it was first proposed in 1976 [78]. There
connections between the convolution blocks to let the are several ways to perform selections, e.g., elite, roulette,
gradients to propagate more freely and they, along with [74] rank, tournament [79]. There are about dozen ways to
were able to validate the benefits of the shortcuts with perform crossovers by Larrañaga et al. alone [80]. Selection
10/101/152-layers and 49 layers respectively. Conneau et al. methodologies represent exploration of the solution space
[74] architecture consists of series of convolution blocks and crossovers represent the exploitation of the selected
separated by pooling that halved the resolution followed by solution candidates. The goal is to get better solution wider
k-max pooling and classification at the end. exploration and deeper exploitation. Additional tweaking
can be introduced with mutation. Parallel clusters of GA can
7.11. Metaheuristics be executed independently in islands and few members
Metaheuristics can be used to train neural networks to exchanged between the island every so often [81]. In
overcome the limitation of backpropagation-based learning. addition, we can also utilize local search such as greedy
When implementing metaheuristics as training algorithm, algorithm, Nearest Neighbor or K-opt algorithm to further
each weight of the neural network connection is represented improve the quality of the solution.
by a dimension in the multi-dimensional solution search Lin et al. [82] demonstrated a successful incorporation of
space of the problem we are trying to solve. The goal is to GA that resulted in better classification accuracy and
come as near as possible to the optimal values of weights, performance of a Polynomial Neural Network. Standard GA
i.e., a location in the search space that represents the global operations including selection, crossover and mutation were
best solution. Particle Swarm Optimization (PSO) is a type used on parameters that included partial descriptions (PDs)
of metaheuristic inspired by the movement of birds in the sky of inputs in the first layer, bias and all input features [82].
consists of particles or candidate solutions move about in a GA was further enhanced with the incorporation of the
search space to reach a near optimal solution. In their paper concept of mitochondrial DNA (mtDNA). In evolution, it is
[75], N. Krpan and D. Jakobovic ran parallel quite evident from casual observation and simple reason that
implementations using backpropagation and PSO. Their crossover of population members with too much similarity
results demonstrate that while parallelization improves the does not yield much variance in the offspring. Likewise, we
efficacy of both algorithms, parallel backpropagation is can infer that in GA, selection and crossover between
efficient only on large networks, whereas parallel PSO has solutions that are very similar would not result is high degree
wider influence on various sizes of problems. of exploration of the multi-dimensional solution space. In
Similarly, W. Dong and M. Zhou [76] complemented PSO fact, it might run the risk of getting pigeonholed into a
with supervised learning control module to guide the search restricted pattern.
for global minima of an optimization problem. The
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Figure 25. GNMT Architecture [84] with encoder neural network on the left and decoder neural network on the right.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
certain distribution. The mathematical derivation for VAEs games like chess, go, etc. AlphaGo, Google’s program that
were originally proposed in [90]. Suppose we wanted to infer beat the human Go champion also uses reinforcement
𝑃(𝑧|𝑥) based on some 𝑄(𝑧|𝑥), then we can try to minimize learning[91]. When we combine deep network architecture
the KL divergence between the two: with reinforcement learning, we get deep reinforcement
Q(z|x) learning (DRL), which can extend the use of reinforcement
𝐷𝐾𝐿 [𝑄(𝑧|𝑥)||𝑃(𝑧|𝑥)] = ∑ Q(z|x) log[ ] (45) to even more complex games and areas such as robotics,
P(z|x)
smart grids, healthcare, finance etc. [92]. With DRL,
𝑧
Q(z|x) problems that were intractable with reinforcement learning
= E [log[ ]] (46)
P(z|x) can now be solved with higher number of hidden layers of
deep networks and reinforcement learning based Q-learning
= E log[ Q(z|x) − 𝑙𝑜𝑔P(z|x)] (47) algorithm that maximizes the reward for actions taken by the
agent [13].
Where 𝐷𝐾𝐿 is the Kullback–Leibler (KL) divergence
and E represents expectation.
Using Baye’s rule: 7.20. Generative Adversarial Network (GAN)
𝑃(𝑥|𝑧)𝑃(𝑧)
GANs consists of generative and discriminative neural
𝑃(𝑧|𝑥) = (48) networks. The generative network generates completely new
𝑃(𝑥)
𝐷𝐾𝐿 [𝑄(𝑧|𝑥)||𝑃(𝑧|𝑥)] (fake) data based on input data (unsupervised learning) and
= 𝐸 [log 𝑄 (𝑧|𝑥) the discriminative network attempts to distinguish whether
𝑃(𝑥|𝑧)𝑃(𝑧) the data is real (from training set) or generated. The
− 𝐿𝑜𝑔 ] (49) generative network is trained to increase the probability of
𝑃(𝑥)
= 𝐸 [log 𝑄 (𝑧|𝑥) − log 𝑃 (𝑥|𝑧) − log 𝑃 (𝑧)] deceiving the discriminative network, i.e., to make the
+ log 𝑃 (𝑥) (50) generated data indistinguishable from the original. GANs
were proposed by Goodfellow et al., [93] in 2014. It has been
To allow us to easily sample 𝑃(𝑧) and generate new data, very popular as it has many applications both good and bad.
we set 𝑃(𝑧) to normal distribution, i.e., 𝑁(0,1). If 𝑄(𝑧|𝑥) is E.g., [94] were able to successfully synthesize realistic
represented as gaussian with parameters 𝜇(x) and ∑(𝑥), then images from text.
the KL divergence between 𝑄(𝑧|𝑥) and 𝑃(𝑧) can be derived
in closed form as:
7.21. Multi-approach method for enhancing deep
learning
𝐷𝐾𝐿 [𝑁(𝜇(x), Σ(x)) | |𝑁(0,1)] = Deep learning can be optimized at different areas. We
(1/2) ∑ (exp(Σ(x)) + 𝜇 2 (𝑥) − 1 − Σ(x)) (51) discussed training algorithm enhancements, parallel
𝑘 processing, parameter optimizations and various
architectures. All these areas can be simultaneously
implemented in a framework to get the best results for
7.19. Deep Reinforcement Learning
specific problems. The training algorithms can be finetuned
The primary idea about reinforcement learning is about
at different levels by incorporating heuristics, e.g., for
making an agent learn from the environment with the help of
hyperparameter optimization. The time to train a deep
random experimentation (exploration) and defined reward
learning network model is a major factor to gauge the
(exploitation). It consists of finite number of states (𝑠𝑖 ,
performance of an algorithm or network. Instead of training
representing agent and environment), actions (𝑎𝑖 ) by the
the network with all the data set, we can pre-select a smaller
agent, probability (𝑃𝑎 ) of moving from one state to another
but representative data set from the full training distribution
based on action 𝑎𝑖 , and reward 𝑅𝑎 (𝑠𝑖 , 𝑠𝑖+1) associated with
set using instance selection methods [95] or Monte Carlo
moving to the next state with action 𝑎. The goal is to balance
sampling [48]. An effective sampling method can result in
and maximize the current reward (𝑅) and future reward
preventing overfitting, improving accuracy and speeding up
(𝛾. max[𝑄(𝑠 ′ , 𝑎′ )]) by predicting the best action as defined
of the learning process without compromising on the quality
by this function 𝑄(𝑠, 𝑎). 𝛾 in the equation represent a fixed
of the training dataset. Albelwi and Mahmood [96] designed
discount factor. 𝑄(𝑠, 𝑎) is represented as the summation of
a framework that combined dataset reduction, deconvolution
current reward (𝑅) and future reward (𝛾. max[𝑄(𝑠 ′ , 𝑎′ )]) as
network, correlation coefficient and an updated objective
shown below.
function. Nelder-Mead method was used in optimizing the
𝑄(𝑠, 𝑎) = 𝑅 + 𝛾. max[𝑄(𝑠 ′ , 𝑎′ )] (52)
parameters of the objective function and the results were
comparable to latest known results on the MNIST dataset
Reinforcement learning is specifically suited for problems [96]. Thus, combining optimizations at multiple levels and
that consists of both short-term and long-term rewards, e.g.,
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
using multiple methods is a promising field of research and 5. Werbos, P.J., Beyond regression : new tools for
can lead to further advancement in machine learning. prediction and analysis in the behavioral sciences.
1975.
8. Conclusion 6. LeCun, Y., Y. Bengio, and G. Hinton, Deep
In this tutorial, we provided a thorough overview of the learning. Nature, 2015. 521(7553): p. 436-44.
neural networks and deep neural networks. We took a deeper 7. Jordan, M.I. and T.M. Mitchell, Machine learning:
dive into the well-known training algorithms and Trends, perspectives, and prospects. Science, 2015.
architectures. We highlighted their shortcomings, e.g., 349(6245): p. 255-60.
getting stuck in the local minima, overfitting and training 8. Ng, A., Machine Learning Yearning: Technical
time for large problem sets. We examined several state-of- Strategy for AI Engineers In the Era of Deep
the-art ways to overcome these challenges with different Learning. 2019: deeplearning.ai.
optimization methods. We investigated adaptive learning 9. Metz, C., Turing Award Won by 3 Pioneers in
rates and hyperparameter optimization as effective methods Artificial Intelligence, in New York Times.
3/27/2019, The New York Times Company. p. B3.
to improve the accuracy of the network. We surveyed and
10. Nagpal, K., et al., Development and Validation of a
reviewed several recent papers, studied them and presented
Deep Learning Algorithm for Improving Gleason
their implementations and improvements to the training
Scoring of Prostate Cancer. CoRR, 2018.
process. We also included tables to summarize the content in abs/1811.06497.
a concise manner. The tables provide a full view on how 11. Nevo, S., et al., ML for Flood Forecasting at Scale.
different aspects of deep learning are correlated. CoRR, 2019. abs/1901.09583.
Deep Learning is still in its nascent stage. There is 12. Esteva, A., et al., Dermatologist-level classification
tremendous opportunity for exploitation of current of skin cancer with deep neural networks. Nature,
algorithms/architectures and further exploration of 2017. 542: p. 115.
optimization methods to solve more complex problems. 13. Arulkumaran, K., et al., Deep Reinforcement
Training is currently constrained by overfitting, training time Learning: A Brief Survey. IEEE Signal Processing
and is highly susceptible to getting stuck in local minima. If Magazine, 2017. 34(6): p. 26-38.
we can continue to overcome these challenges, deep learning 14. Gheisari, M., G. Wang, and M.Z.A. Bhuiyan. A
networks will accelerate breakthroughs across all Survey on Deep Learning in Big Data. in 2017
applications of machine learning and artificial intelligence. IEEE International Conference on Computational
Science and Engineering (CSE) and IEEE
International Conference on Embedded and
Ubiquitous Computing (EUC). 2017.
Conflicts of Interest: The authors declare no conflict of
15. Pouyanfar, S., et al., A Survey on Deep Learning:
interest. The founding sponsors had no role in the design of
Algorithms, Techniques, and Applications. ACM
the study; in the collection, analyses, or interpretation of Comput. Surv., 2018. 51(5): p. 1-36.
data; in the writing of the manuscript, and/or in the decision 16. Vargas, R., A. Mosavi, and R. Ruiz, Deep learning:
to publish the results. A review. 2017.
17. Buhmann, M.D. and M.D. Buhmann, Radial Basis
Functions. 2003: Cambridge University Press. 270.
ORCID: 18. Akinduko, A.A., E.M. Mirkes, and A.N. Gorban,
Ajay Shrestha: http://orcid.org/0000-0001-5595-5953 SOM: Stochastic initialization versus principal
components. Information Sciences, 2016. 364-365:
p. 213-221.
References 19. Chen, K., Deep and Modular Neural Networks, in
Springer Handbook of Computational Intelligence,
J. Kacprzyk and W. Pedrycz, Editors. 2015,
1. Rosenblatt, F., The perceptron: a probabilistic Springer Berlin Heidelberg: Berlin, Heidelberg. p.
model for information storage and organization in 473-494.
the brain. Psychol Rev, 1958. 65(6): p. 386-408. 20. Ng, A.Y. and M.I. Jordan, On discriminative vs.
2. Minsky, M. and S. Papert, Perceptrons; an generative classifiers: a comparison of logistic
introduction to computational geometry. 1969, regression and naive Bayes, in Proceedings of the
Cambridge, Mass.,: MIT Press. 258 p. 14th International Conference on Neural
3. Cybenko, G., Approximation by superpositions of a Information Processing Systems: Natural and
sigmoidal function. Mathematics of Control, Synthetic. 2001, MIT Press: Vancouver, British
Signals and Systems, 1989. 2(4): p. 303-314. Columbia, Canada. p. 841-848.
4. Hornik, K., Approximation capabilities of
multilayer feedforward networks. Neural Networks,
1991. 4(2): p. 251-257.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
21. Bishop, C.M. and J. Lasserre. Generative or 36. Hinton, G.E. and R.R. Salakhutdinov, Reducing the
Discriminative ? Getting the Best of Both Worlds. Dimensionality of Data with Neural Networks.
2007. Science, 2006. 313(5786): p. 504.
22. Zhou, T., et al., Unsupervised Learning of Depth 37. Wang, M., et al., Deep Learning-Based Model
and Ego-Motion from Video. CoRR, 2017. Reduction for Distributed Parameter Systems.
abs/1704.07813. IEEE Transactions on Systems, Man, and
23. Chen, X.W. and X. Lin, Big Data Deep Learning: Cybernetics: Systems, 2016. 46(12): p. 1664-1674.
Challenges and Perspectives. IEEE Access, 2014. 38. Ng, A. Autoencoders. Unsupervised Feature
2: p. 514-525. Learning and Deep Learning (UFLDL) Tutorial
24. LeCun, Y., K. Kavukcuoglu, and C. Farabet. 2018 [cited 2018 7/21/2018]; Available from:
Convolutional networks and applications in vision. http://ufldl.stanford.edu/tutorial/unsupervised/Auto
in Proceedings of 2010 IEEE International encoders.
Symposium on Circuits and Systems. 2010. 39. Teh, Y.W. and G.E. Hinton, Rate-coded Restricted
25. Gousios, G., et al., Lean GHTorrent: GitHub data Boltzmann Machines for Face Recognition. 2001:
on demand, in Proceedings of the 11th Working p. 908--914.
Conference on Mining Software Repositories. 2014, 40. Hinton, G.E., A Practical Guide to Training
ACM: Hyderabad, India. p. 384-387. Restricted Boltzmann Machines, in Neural
26. AI-Index. Top deep learning Github repositories. Networks: Tricks of the Trade: Second Edition, G.
AI Index 2019; Available from: Montavon, G.B. Orr, and K.-R. Müller, Editors.
https://github.com/mbadry1/Top-Deep-Learning. 2012, Springer Berlin Heidelberg: Berlin,
27. Fern, M., et al., Do we need hundreds of classifiers Heidelberg. p. 599-619.
to solve real world classification problems? J. 41. Hochreiter, S. and J. Schmidhuber, Long Short-
Mach. Learn. Res., 2014. 15(1): p. 3133-3181. term Memory. Vol. 9. 1997. 1735-80.
28. Lecun, Y., et al., Gradient-based learning applied 42. Metz, C., Apple is bringing the AI Revolution to
to document recognition. Proceedings of the IEEE, your Phone, in Wired. 2016.
1998: p. 2278--2324. 43. Gers, F.A., J. Schmidhuber, and F.A. Cummins,
29. LeCun, Y. and Y. Bengio, Convolutional networks Learning to Forget: Continual Prediction with
for images, speech, and time series, in The LSTM. Neural Computation, 2000. 12(10): p. 2451-
handbook of brain theory and neural networks, 2471.
A.A. Michael, Editor. 1998, MIT Press. p. 255-258. 44. Chung, J., et al., Empirical Evaluation of Gated
30. Taylor, G.W., et al. Convolutional Learning of Recurrent Neural Networks on Sequence Modeling.
Spatio-temporal Features. in Computer Vision – eprint arXiv:1412.3555, 2014: p. arXiv:1412.3555.
ECCV 2010. 2010. Berlin, Heidelberg: Springer 45. Cho, K., et al., Learning Phrase Representations
Berlin Heidelberg. using RNN Encoder-Decoder for Statistical
31. Ng, A. Convolutional Neural Network. Machine Translation. eprint arXiv:1406.1078,
Unsupervised Feature Learning and Deep Learning 2014: p. arXiv:1406.1078.
(UFLDL) Tutorial 2018 [cited 2018 7/21/2018]; 46. Naul, B., et al., A recurrent neural network for
Available from: classification of unevenly sampled variable stars.
http://ufldl.stanford.edu/tutorial/supervised/Convol Nature Astronomy, 2018. 2(2): p. 151-155.
utionalNeuralNetwork/. 47. Najafabadi, M.M., et al., Deep learning
32. Schuler, C.J., et al. A Machine Learning Approach applications and challenges in big data analytics.
for Non-blind Image Deconvolution. in 2013 IEEE Journal of Big Data, 2015. 2(1): p. 1.
Conference on Computer Vision and Pattern 48. Goodfellow, I., Y. Bengio, and A. Courville, Deep
Recognition. 2013. learning. Adaptive computation and machine
33. Radford, A., L. Metz, and S. Chintala, learning. 2016, Cambridge, Massachusetts: The
Unsupervised Representation Learning with Deep MIT Press. xxii, 775 pages.
Convolutional Generative Adversarial Networks. 49. Gavin, H.P., The Levenberg-Marquardt method for
CoRR, 2015. abs/1511.06434. nonlinear least squares curve-fitting problems.
34. Jolliffe, I.T., Principal component analysis. 2nd ed. 2016.
Springer series in statistics. 2002, New York: 50. Xavier, G. and B. Yoshua, Understanding the
Springer. xxix, 487 p. difficulty of training deep feedforward neural
35. Noda, K., et al. Multimodal integration learning of networks, in In Proceedings of the Thirteenth
object manipulation behaviors using deep neural International Conference on Artificial Intelligence
networks. in 2013 IEEE/RSJ International and Statistics 2010, PMLR. p. 249-256.
Conference on Intelligent Robots and Systems. 51. Martens, J., Deep learning via Hessian-free
2013. optimization, in Proceedings of the 27th
International Conference on International
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2912200, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.