Thesis CNN FPGA California
Thesis CNN FPGA California
Thesis CNN FPGA California
A Thesis
Presented to the
Faculty of
In Partial Fulfillment
Master of Science
In
Electrical Engineering
By
Mark A. Espinosa
2019
SIGNATURE PAGE
ii
ACKNOWLEDGEMENTS
To my mom and dad who have always believed in me in all my endeavors all my life. To my wife who has
kept me going and cheered me on every step of the way. Thank you and love you all.
iii
ABSTRACT
Machine Learning and Deep Learning its sub discipline are gaining popularity quickly. Machine Learning
algorithms have been successfully deployed in a variety of applications such as Natural Language
Processing, Optical Character Recognition, and Speech Recognition. Deep Learning particularly is suited
to Computer Vision and Image Recognition tasks. The Convolutional Neural Networks employed in Deep
Learning Neural Networks train a set of weights and biases which with each layer of the network learn to
recognize key features in an image. This work set out to develop a scalable and modular FPGA
implementation for Convolutional Neural Networks. It was the objective of this work to attempt to develop
a system which could be configured to run as many layers as desired and test it using a currently defined
CNN configuration, AlexNet. This type of system would allow a developer to scale a design to fit any size
of FPGA from the most inexpensive to the costliest cutting-edge chip on the market. The objective of this
work was achieved, and all layers were accelerated including Convolution, Affine, ReLu, Max Pool, and
Softmax layers. The performance of the design was assessed, and it was determined its maximum
iv
TABLE OF CONTENTS
ABSTRACT ...................................................................................................................................................iv
1.4 AlexNet..........................................................................................................................................5
v
3.1.2.2 Parameter Sharing........................................................................................................ 21
vi
5.2 Data Format ............................................................................................................................... 44
6.1 Algorithm.................................................................................................................................... 59
vii
6.3.10 Bias Address Register ....................................................................................................... 68
viii
8.1 Algorithm.................................................................................................................................. 103
8.6 Max Pooling Layer - Submodule Definitions and Operations ............................................. 116
ix
9.3.5 Probability 1 Register ..................................................................................................... 128
x
12.3 Simulation Performance ..................................................................................................... 156
xi
LIST OF TABLES
xii
Table 28: Bias Parameters Register Description .......................................................................................... 69
Table 44: Register List for the Max Pooling Layer Design ........................................................................ 106
Table 49: Input Data Address Register Bit Map ......................................................................................... 109
Table 51: Output Data Address Register Bit Map ...................................................................................... 109
xiii
Table 56: Output Parameters Register Description .................................................................................... 111
Table 60: Register List for the Softmax Layer Design ................................................................................ 124
Table 65: Input Data Address Register Bit Map ......................................................................................... 127
Table 67: Output Data Address Register Bit Map ...................................................................................... 127
Table 81: 10 Classes Used to Train the AlexNet Model ............................................................................. 143
Table 83: Memory Read and Write Transactions Per AlexNet Layer ......................................................... 156
xiv
Table 84: Simulation Execution time of each AlexNet Layer ...................................................................... 157
Table 85: Hardware execution times of each AlexNet Layer ...................................................................... 162
xv
LIST OF FIGURES
Figure 9: Angel-Eye...................................................................................................................................... 12
Figure 16: The components of a typical convolutional neural network layer. ............................................ 25
Figure 25: Specifications for the Zynq XC7Z020 Artix-7 FPGA .................................................................. 40
xvi
Figure 27: Channel Architecture of Writes .................................................................................................. 43
Figure 33: Finite State Machine for the AXI Master Module. ...................................................................... 74
Figure 36: Number of Channel Units able to be used in the design. ............................................................ 81
Figure 38: Volume FIFOs with Mux and Router for the Input Volume FIFOs ............................................ 83
Figure 39: Weight FIFOs with Mux and Router for the Weight Data FIFOs............................................... 83
Figure 41: Shows how new data is read into the Channel Unit FIFOs ........................................................ 85
Figure 44: Shows the design of the Volume and Weight Mux blocks to select the data stream.................... 87
Figure 45: Shows the design of the Volume and Weight Mux blocks to select the data stream enable
signals. .......................................................................................................................................................... 88
Figure 47: Shows the design of the Volume and Weight Mux blocks. .......................................................... 91
Figure 48: Finite State Machine for the Convolution Layer Operation ....................................................... 95
Figure 49: Finite State Machine for the Affine Layer Operation ................................................................. 95
Figure 52: Shows the adder tree employed to sum the product data. ........................................................... 98
Figure 53: Shows the adder tree employed to sum the column data............................................................. 99
xvii
Figure 54: Shows the Finite State Machine for the DSP Accumulator logic. ............................................. 100
Figure 55: Shows the Finite State Machine for the Accumulator Relay. .................................................... 100
Figure 60: Top level Architecture of the Max Pooling Layer ..................................................................... 105
Figure 64: Finite State Machine for the Max Pool Layer Row Controller ................................................. 118
Figure 65: A max-hep viewed as (a) a binary tree and (b) an array. ......................................................... 119
Figure 67: Loaded row information is processes through the Heap Sorter ............................................... 121
Figure 68: Finite State Machine for the Heap Sorter ................................................................................. 122
Figure 70: Finite State Machine for the Softmax Layers AXI Master ........................................................ 132
Figure 72: Finite State Machine for the Exponential Function Logic ........................................................ 135
Figure 73: Finite State Machine for the Softmax Adder Wrapper .............................................................. 137
Figure 74: Finite State Machine for the Softmax Divider Wrapper ........................................................... 138
Figure 75: Finite State Machine for the Softmax Controller ...................................................................... 139
Figure 77: Loss and Training Accuracy over 1000 iterations .................................................................... 144
Figure 78: Convolution/Affine Layer Virtual Memory Test Bench ............................................................. 146
Figure 79: Convolution/Affine Layer Block RAM Test Bench .................................................................... 146
Figure 80: Max Pool Layer Virtual Memory Test Bench ........................................................................... 146
Figure 81: Max Pool Layer Block RAM Test Bench ................................................................................... 146
xviii
Figure 82: Softmax Layer Virtual Memory Test Bench .............................................................................. 147
Figure 83: Softmax Layer Block RAM Test Bench ..................................................................................... 147
Figure 99: Hardware Performance vs. Ops/Mem Trans. ratio .................................................................. 173
xix
1.0 CURRENT WORK AND PRACTICE
In order to understand the overall motivation for this work, we must look at the Deep Learning and its
pervasiveness in our everyday lives, how Convolutional Neural Networks are being implemented in FPGA,
and how FPGAs compare with other platforms such as CPUs and GPUs. This section will explore these
areas and will lend better insight as to how this work hopes to contribute to the field.
The best definition of what Machine Learning is comes from Professor Tom Mitchell of Carnegie Mellon
University. Machine learning can be described as when a computer program is said to learn from
experience E with respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E. (Mitchell, 1997) As this definition pertains to image
classification, the experience E would be the experience of having the program classify tens of thousands
of images. The task T would be the task of classifying images and the performance measure P would be the
g. Games
i. Medical diagnosis
1
This list is by no means comprehensive, and learning algorithms are applied to new applications every day.
(Mohri et.al., 2014) Moreover, such applications correspond to a wide variety of learning problems. Some
a. Classification: Assign a category to each item. For example, document classification may assign
items with categories such as politics, business, sports, or weather while image classification may
assign items with categories such as landscape, portrait, or animal. The number of categories in
such tasks is often relatively small but can be large in some difficult tasks and even unbounded as
b. Regression: Predict a real value for each item. Examples of regression include prediction of stock
values or variations of economic variables. In this problem, the penalty for an incorrect prediction
depends on the magnitude of the difference between the true and predicted values, in contrast with
the classification problem, where there is typically no notion of closeness between various
c. Ranking: Order items according to some criterion. Web search, e.g., returning web pages relevant
to a search query, is the canonical ranking example. Many other similar ranking problems arise in
the context of the design of information extraction or natural language processing systems. (Mohri
et.al., 2014)
d. Clustering: Partition items into homogeneous regions. Clustering is often performed to analyze
very large data sets. For example, in the context of social network analysis, clustering algorithms
lower-dimensional representation of these items while preserving some properties of the initial
The two most used types of Machine Learning Algorithms are Supervised Learning and Unsupervised
2
a. Supervised learning: The learner receives a set of labeled examples as training data and makes
predictions for all unseen points. This is the most common scenario associated with classification,
b. Unsupervised learning: The learner exclusively receives unlabeled training data and makes
predictions for all unseen points. Since in general no labeled example is available in that setting, it
dimensionality reduction are example of unsupervised learning problems. (Mohri et.al., 2014)
As we can see from the preceding explanation, Machine Learning is used in modern Image Classification.
Since Image Classification is essentially the subject of this work, let’s now look at a specific example.
Machine learning programs can be trained in several different ways. In one type of training, the program is
shown a lot of pictures of different animals and each picture is labeled with the name of the animal; the cats
are all labeled "cat". The program will eventually learn that the animals that look like cats are called "cats"
without ever being programmed to call a picture of a cat a "cat". (Murnane, 2016)
The program does this by learning combinations of features that tend to appear together. Cats have visual
features, such as their body shape, long whiskers, and the way their faces look. These features make them
visually different from other animals. The program learns to associate this distinctive combination of
features with the word "cat". This learning process is usually called constructing a model of a cat.
(Murnane, 2016)
Once it has constructed the cat model, a machine learning program tests the model by trying to identify the
cats in a set of pictures it hasn't seen before. The program measures how well it did at identifying the new
cats and uses this information to adjust the model so it will do a better job of picking out cats the next time
it tries. The new model is then tested, its performance is evaluated, and it receives another adjustment. This
iterative process continues until the program has built a model that can identify cats with a high level of
3
1.2 Deep Learning
Deep learning carries out the machine learning process using an artificial neural net that is composed of a
number of levels arranged in a hierarchy. The network learns something simple at the initial level in the
hierarchy and then sends this information to the next level. The next level takes this simple information,
combines it into something that is a bit more complex, and passes it on the third level. This process
continues as each level in the hierarchy builds something more complex from the input it received from the
Continuing the cat example, the initial level of a deep learning network might use differences in the light
and dark areas of an image to learn where edges or lines are in a picture of a cat. The initial level passes
this information about edges to the second level which combines the edges into simple shapes like a
diagonal line or a right angle. The third level combines the simple shapes into more complex objects like
ovals or rectangles. The next level might combine the ovals and rectangles into rudimentary whiskers, paws
and tails. The process continues until it reaches the top level in the hierarchy where the network has learned
to identify cats. While it was learning about cats, the network also learned to identify all the other animals it
4
1.3 Core Deep Learning Architectures
As we have seen in the above explanations of what Machine Learning and Deep Learning are, the most
critical aspect of performing this kind of Artificial Intelligence work is the data set being used to train the
neural network model. For the task of Image Classification, many researchers have been using the well-
ImageNet is an image dataset organized according to the WordNet hierarchy. WordNet is a large lexical
database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms
(synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and
lexical relations. The resulting network of meaningfully related words and concepts can be navigated with a
browser. (Fellbaum, 1998) Each meaningful concept in WordNet, possibly described by multiple words or
word phrases, is called a “synonym set” or “sysnet”. There are more than 100,000 sysnets in WordNet with
In 2010 ImageNet announced the first ever ImageNet Challenge which was a competition of which Deep
Learning Algorithm could best estimate the content of photographs. The validation and test data from
ImageNet and for this competition consisted of 200,000 photographs, which were collected from flickr and
other search engines. These photographs were hand labeled with the presence or absence of 1000 distinct
object categories. The same competition has been held every year since then and announces the winning
algorithms in tasks such as Object Localization and Object Detection. (Li, F., et. al, 2015)
1.4 AlexNet
In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton used the ImageNet database in order to
develop the Deep Learning architecture which came to be known as AlexNet. Using Alexnet they trained a
large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet
LSVRC-2010 contest into the 1000 different classes for the 2012 competition. (Li, F., et. al, 2017) The
The worst error rate they achieved was 37.5% and the best average error rate of 17.0%. The neural
network consisted of 60 million parameters, 650,000 neurons, five Convolutional Layers followed by ReLu
and Max Pool Layers, three Fully Connected Layers, and a 1000-way Softmax Classifier. (Krizhevsky,
5
2012) Below is an illustration on the overall architecture of the AlexNet model. Their work was done on
two GTX 580 GPUs. This implementation was the first to use a Rectified Linear Unit as an activation layer
rather than a Sigmoid Activation function. This implementation also won the ImageNet LSVRC-2012
competition.
6
1.5 VGGnet
In 2014 Karen Simonyan and Andrew Zisserman working for the Visual Geometry Group at the University
of Oxford created VGGnet. Their work expanded on the Alexnet architecture by adding more
convolutional layers and using small receptive fields such as 3x3 or 1x1. Different configurations of their
network were tested with each configuration following a generic design and distinct only in the depth from
11, to 13, then to 19 weight layers. Each of these configurations had sub configurations as well.
(Zisserman,2014)
During their experimentation, two VGGnet configurations VGG16 and VGG19 performed the best. Error
rates for VGG16 were a max of 27.3% and an average of 8.1%. Error rates for VGG19 were a max of
25.5% and an average of 8.0%. (Zisserman,2014) By adding more layers to the network the number of
parameters in both VGG16 and VGG19 increased to 138 and 144 million respectively. (Zisserman,2014)
Figure 4 shows the two VGGnet configurations as well as to how they compare to their predecessor
AlexNet. This VGGnet architecture won the 2014 ImageNet LSRVC Challenge. (Li, F., et. al, 2017)
7
1.6 ResNet
In 2015, Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft created the ResNet
architecture which sought to improve on what VGGnet achieved by recognizing that network depth was
crucial to a more accurate architecture. (He et. al., 2015) They also recognized that attempting to train a
very deep network was plagued by the “vanishing gradient” problem which occurred during the back-
propagation phase of the neural network training. They proposed “deep residual learning” which is a new
framework that introduces “Shortcut Connections”. (He et. al., 2015) They hypothesized that this structure
would allow training and optimization to be much easier and solve the “vanishing gradient” problem. (Li,
The baseline for the Resnet architecture is VGGnet however the convolutional layers use only 3x3 filter
weight kernels. For their experimentation they constructed a 34-layer plain network with no Shortcut
connections and a 34 Layer network with shortcut connections, a Resnet. They also configured several
networks with incrementally increasing layer count from 34 layers to 152 layers. Overall, 34-layer ResNet
outperformed the 34-layer plain network and the average error rate achieved on the 152-layer network for
the 2015 ImageNet LSVRC competition was 3.57%. This network architecture won the 2015 ImageNet
8
Figure 5: Residual Learning: a building block of the ResNet architecture.
(Li, F., et. al, 2017)
9
1.7 Convolutional Neural Networks in FPGA: Literature Review
Now that we have introduced the world of Deep Learning and highlighted the most celebrated
Convolutional Deep Neural Networks, lets now shift focus onto how Convolutional Neural Networks have
been implemented into FPGA. We must keep in mind that the architectures such as AlexNet, VGGnet, and
ResNet have been developed with only software development in mind. Therefore, these research groups
are composed primarily of computer scientists and mathematicians. A few works have implemented
Convolutional Neural Networks into FPGAs and this section presents a few.
1.7.1 Up to Now
One of the most limiting hardware realizations for Deep Learning techniques on FPGAs is design size. The
trade-off between design reconfigurability and density means that FPGA circuits are often considerably less
dense than hard-ware alternatives, and so implementing large neural networks has not always been
possible. However, as modern FPGAs continue to exploit smaller feature sizes to increase density and
incorporate hardened computational units along-side generic FPGA fabric, deep networks have started to
be implemented on single FPGA systems. A brief timeline of important events in FPGA deep learning
10
1.7.2 Other Works
A group out of Neurocoms in South Korea published and presented their work at the International Joint
Conference on Neural Networks in 2015 where they implemented a Real-Time Video Object Recognition
System using Convolutional Neural Networks. They used a custom 5 Layer neural network architecture
developed first in Matlab. The system used grey-scale input images of 28x28 and each layer of the network
was instantiated as logic all at once. The output of each layer would feed into the input of the next layer.
The design employed the use of logic blocks they named synapses and receptors which formed the layers
for the neural network. (Figure 7) The input images were classified from at 10 class list. This group used
the Xilinx KC705 evaluations board and their logic operated at a frequency of 250MHz. Their measured
power consumption was 3.1 watts. They utilized 42,616 Look Up Tables, 32 Block RAMs, and 326
DSP48 Blocks. The data format used was 16-bit Fixed Point and the group focused their work on the
Another group from the Institute of Semiconductors from the Chinese Academy of Sciences in Beijing
China attempted a small implementation in 2015. Their implementation ran on an Altera Arria V FPGA
board operating at 50Mhz. The input images were 32x32, used an 8-bit fixed point data format, and used a
3 Convolution Layers with Activation, 2 Pooling Layers, and 1 Softmax Classifier. This work
implemented custom processing elements which could be reconfigured when needed. (Figure 8) They
measured their performance based on how many images could be processed. (Li, H et.al., 2015)
11
Figure 8: Chinese Academy logic architecture
(Li, H et.al., 2015)
A joint effort between Tsinghua University and Stanford university in 2016 yielded the “Angel-Eye”
system. To accelerate the Convolution operation this system runs on the Xilinx Zynq XC7Z045 platform
and uses an array of processing elements as shown in Figure 9. The logic on the FPGA runs at a clock
frequency of 150 MHz, uses 16bit fixed data format, uses 9.63 watts of power, and achieved 187.80
GFLOPS performance while running the VGG16 ConvNet. The resource utilization was not specified in
their published work. The project that produced Angel-Eye also created a custom compiler which attempted
to minimize external memory access and thereby reduce the inherent latency with memory transactions.
Figure 9: Angel-Eye
(Left) Angel-Eye architecture. (Right) Processing Element
(Guo, K. et. al., 2016)
12
Another work from a group at Purdue University in 2016 developed a Convolutional Neural Network
accelerator which employed the use of custom software tools. These software tools were essentially
compilers which explored optimization methods to improve the performance of the neural network. This
work used the Xilinx Kintex-7 XC7K325T FPGA, achieved 58-115 GFLOPS performance, and ran a
ConvNet with the same number of layers as AlexNet but with smaller channels. (Dundar, A. et. al., 2016)
The resource utilization for their FPGA implementation was not specified.
The last work we will look at comes from the School of Computing, Informatics, and Decision Systems
Engineering at Arizona State University. In 2016, one of their research groups efforted to create a scalable
FPGA implementation of a Convolutional Neural Network. This group realized that as ever newer Deep
Learning CNN configurations increase in layer count, and FPGA implementation would need to keep pace.
This group also created a CNN compiler to analyze the input CNN model’s structure and sizes, as well as
the degree of computing parallelism set by users, to generate and integrate parametrized CNN modules.
This implementation ran on a Stratix-V GXA7 FPGA operating with a clock frequency of 100MHz, uses a
16bit fixed data format, consumes 19.5 watts of power, and achieves 114.5 GFLOPS performance. Their
FPGA resource utilization is 256 DSPS, 112K Look Up Tables, and 2,330 Block RAM. Their design
employs the use of a shared multiplier bank for use with all the multiplication operations as shown in
13
Figure 10: Arizona State Implementation
(Top) Convolution acceleration module block diagram. (Bottom) Integration of Scalable CNN Modules.
(Ma, Y et. al., 2016)
14
2.0 THESIS STATEMENT
As we have seen by reviewing the current state of the art, Machine Learning is a fast-growing field of
practice and research for Computer Scientists as well as Computer Engineers. This work efforts to design
and develop a scalable, reusable, and user-friendly FPGA implementation of Convolutional Neural
Networks for Image Classification tasks. This design would cater to the smaller more inexpensive
As a road map for the rest of this work let’s look at what will follow in the body of this work. First, we
survey the Fundamental Theory behind Convolutional Neural Networks in Section 3.0. Section 4.0
explores the overall goals of this works FPGA implementation of Convolutional Neural Networks. Section
5.0 discusses the top-level architectural FPGA implementation involving multiple sub designs
communicating on an AXI Bus. Section 6 through Section 9 discuss how each layer of a Convolutional
Neural Network was accelerated on FPGA. Section 10 and Section 11 discuss how this implementation was
verified in simulation and on hardware respectively. Section 12 reviews the overall results of the work and
offers insight as to how to improve the overall performance. Section 13 closes this work with a brief
summary. The source code and all files pertaining to this work are too numerous to include in the Appendix
of this document. Therefore, links to the Github repositories containing the source files are provided.
15
3.0 FUNDAMENAL CONCEPTS
In order to properly delve into the proposed design, we should first review some fundamental theory behind
Convolutional Neural Networks. This section will review the mathematical properties of CNNs as well as
how they are implemented into working systems and will be more of the research paper portion of this
work.
Most of the texts available on the subject of Neural Networks delve more deeply into the genesis of Neural
Network research. Most texts cover primarily the Perceptron, Recurrent Neural Network, and LSTM
Neural Network models which have been used in a variety of applications in the decades of Neural
Network research. Finding comprehensive content and coverage of how Convolutional Neural Networks
are designed and function was very challenging. However, the latest information on the current state of
Deep Learning was found in a Stanford University course CS231n which Stanford made open to the public
to view the lectures and do the assignments. This avenue of research enabled this work to be accomplished
and not encountering this content would have placed this work at an extreme disadvantage.
Therefore, the section describing the operation of Convolutional Neural Networks is taken from the course
notes and the lectures on the subject. The mathematical formulations behind Convolutional Neural
Networks are taken from the newest text in Deep Learning from Goodfellow et. al. of the Massachusetts
Institute of Technology (Goodfellow, et. al, 2016). Goodfellow’s treatment of the topic is very thorough
Convolutional networks also known as convolutional neural networks, or CNNs, are a specialized kind of
neural network for processing data that has a known grid-like topology. Examples include time-series data,
which can be thought of as a 1-D grid taking samples at regular time intervals, and image data, which can
be thought of as a 2-D grid of pixels. Convolutional networks have been tremendously successful in
practical applications. The name “convolutional neural network” indicates that the network employs a
16
Convolutional networks are simply neural networks that use convolution in place of general matrix
In its most general form, convolution is an operation on two functions of a real-valued argument. To
motivate the definition of convolution, we start with examples of two functions we might use.
Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single
output x(t), the position of the spaceship at time t. Both x and t are real valued, that is, we can get a
different reading from the laser sensor at any instant in time. (Goodfellow, et. al, 2016)
Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s
position, we would like to average several measurements. Of course, more recent measurements are more
relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We
can do this with a weighting function w(a), where a is the age of a measurement. (Goodfellow, et. al, 2016)
If we apply such a weighted average operation at every moment, we obtain a new function s providing a
This operation is called convolution. The convolution operation is typically denoted with an asterisk:
In our example, w needs to be a valid probability density function, or the output will not be a weighted
average. Also, w needs to be 0 for all negative arguments, or it will look into the future, which is
presumably beyond our capabilities. These limitations are particular to our example, though. In general,
convolution is defined for any functions for which the above integral is defined and may be used for other
17
In convolutional network terminology, the first argument (in this example, the function x) to the
convolution is often referred to as the input, and the second argument (in this example, the function w) as
the kernel. The output is sometimes referred to as the feature map. (Goodfellow, et. al, 2016)
In our example, the idea of a laser sensor that can provide measurements at every instant is not realistic.
Usually, when we work with data on a computer, time will be discretized, and our sensor will provide data
at regular intervals. In our example, it might be more realistic to assume that our laser provides a
measurement once per second. The time index t can then take on only integer values. If we now assume
that x and w are defined only on integer t, we can define the discrete convolution:
In machine learning applications, the input is usually a multidimensional array of data, and the kernel is
usually a multidimensional array of parameters that are adapted by the learning algorithm. We will refer to
these multidimensional arrays as tensors. Because each element of the input and kernel must be explicitly
stored separately, we usually assume that these functions are zero everywhere but in the finite set of points
for which we store the values. This means that in practice, we can implement the infinite summation as a
summation over a finite number of array elements. (Goodfellow, et. al, 2016)
Finally, we often use convolutions over more than one axis at a time. For example, if we use a two-
dimensional image I as our input, we probably also want to use a two-dimensional kernel K:
18
Usually the latter formula is more straightforward to implement in a machine learning library, because
there is less variation in the range of valid values of m and n. (Goodfellow, et. al, 2016)
The commutative property of convolution arises because we have flipped the kernel relative to the input, in
the sense that as m increases, the index into the input increases, but the index into the kernel decreases. The
only reason to flip the kernel is to obtain the commutative property. While the commutative property is
useful for writing proofs, it is not usually an important property of a neural network implementation.
Instead, many neural network libraries implement a related function called the cross-correlation, which is
Many machine learning libraries implement cross-correlation but call it convolution. In this text we follow
this convention of calling both operations convolution and specify whether we mean to flip the kernel or
not in contexts where kernel flipping is relevant. In the context of machine learning, the learning algorithm
will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on
convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an
algorithm without the flipping. It is also rare for convolution to be used alone in machine learning; instead
convolution is used simultaneously with other functions, and the combination of these functions does not
commute regardless of whether the convolution operation flips its kernel or not. (Goodfellow, et. al, 2016)
19
Figure 11: An example of 2-D convolution without kernel flipping.
We restrict the output to only positions where the kernel lies entirely within the image, called “valid”
convolution in some contexts. We draw boxes with arrows to indicate how the upper-left element of the
output tensor is formed by applying the kernel to the corresponding upper-left region of the input tensor.
(Goodfellow, et. al, 2016)
Convolution leverages three important ideas that can help improve a machine learning system: sparse
interactions, parameter sharing, and equivariant representations. Moreover, convolution provides a means
for working with inputs of variable size. We now describe each of these ideas in turn. (Goodfellow, et. al,
2016)
Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate
parameter describing the interaction between each input unit and each output unit. This means that every
output unit interacts with every input unit. Convolutional networks, however, typically have sparse
interactions (also referred to as sparse connectivity or sparse weights). This is accomplished by making the
kernel smaller than the input. For example, when processing an image, the input image might have
thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels
that occupy only tens or hundreds of pixels. This means that we need to store fewer parameters, which both
reduces the memory requirements of the model and improves its statistical efficiency. It also means that
20
computing the output requires fewer operations. These improvements in efficiency are usually quite large.
If there are m inputs and n outputs, then matrix multiplication requires m × n parameters, and the
algorithms used in practice have O(m × n) runtime (per example). If we limit the number of connections
each output may have to k, then the sparsely connected approach requires only k × n parameters and O(k ×
n) runtime. For many practical applications, it is possible to obtain good performance on the machine
learning task while keeping k several orders of magnitude smaller than m. For graphical demonstrations of
sparse connectivity, see Figure 12 and Figure 13. In a deep convolutional network, units in the deeper
layers may indirectly interact with a larger portion of the input, as shown in Figure 14. This allows the
network to efficiently describe complicated interactions between many variables by constructing such
interactions from simple building blocks that each describe only sparse interactions. (Goodfellow, et. al,
2016)
Parameter sharing refers to using the same parameter for more than one function in a model. In a traditional
neural net, each element of the weight matrix is used exactly once when computing the output of a layer. It
is multiplied by one element of the input and then never revisited. As a synonym for parameter sharing, one
can say that a network has tied weights, because the value of the weight applied to one input is tied to the
value of a weight applied elsewhere. In a convolutional neural net, each member of the kernel is used at
every position of the input (except perhaps some of the boundary pixels, depending on the design decisions
regarding the boundary). The parameter sharing used by the convolution operation means that rather than
learning a separate set of parameters for every location, we learn only one set. This does not affect the
runtime of forward propagation—it is still O(k × n)—but it does further reduce the storage requirements of
the model to k parameters. Recall that k is usually several orders of magnitude smaller than m. Since m and
n are usually roughly the same size, k is practically insignificant compared to m × n. Convolution is thus
dramatically more efficient than dense matrix multiplication in terms of the memory requirements and
statistical efficiency. For a graphical depiction of how parameter sharing works, see Figure 15.
21
Figure 12: Sparse connectivity, viewed from below.
We highlight one input unit, x3, and highlight the output units in s that are affected by this unit. (Top)When
s is formed by convolution with a kernel of width 3, only three outputs are affected by x. (Bottom) When s is
formed by matrix multiplication, connectivity is no longer sparse, so all the outputs are affected by x3.
(Goodfellow, et. al, 2016)
22
Figure 14: Receptive Field.
The receptive field of the units in the deeper layers of a convolutional network
is larger than the receptive field of the units in the shallow layers. This effect increases if
the network includes architectural features like strided convolution (Figure 17) or pooling. This means that
even though direct connections in a convolutional net are
very sparse, units in the deeper layers can be indirectly connected to all or most of the
input image. (Goodfellow, et. al, 2016)
23
3.1.2.3 Sparse Interactions
In the case of convolution, the particular form of parameter sharing causes the layer to have a property
called equivariance to translation. To say a function is equivariant means that if the input changes, the
Specifically, a function f(x) is equivariant to a function g if f(g(x)) = g(f(x)). In the case of convolution, if
we let g be any function that translates the input, that is, shifts it, then the convolution function is
equivariant to g. For example, let I be a function giving image brightness at integer coordinates. Let g be a
function mapping one image function to another image function, such that I’=g(I) is the image function
with I’(x, y) = I(x −1, y). This shifts every pixel of I one unit to the right. If we apply this transformation to
I, then apply convolution, the result will be the same as if we applied convolution to I’, then applied the
transformation g to the output. When processing time-series data, this means that convolution produces a
sort of timeline that shows when different features appear in the input. (Goodfellow, et. al, 2016)
If we move an event later in time in the input, the exact same representation of it will appear in the output,
just later. Similarly, with images, convolution creates a 2-D map of where certain features appear in the
input. If we move the object in the input, its representation will move the same amount in the output. This
is useful for when we know that some function of a small number of neighboring pixels is useful when
applied to multiple input locations. For example, when processing images, it is useful to detect edges in the
The same edges appear more or less everywhere in the image, so it is practical to share parameters across
the entire image. In some cases, we may not wish to share parameters across the entire image. For example,
if we are processing images that are cropped to be centered on an individual’s face, we probably want to
extract different features at different locations—the part of the network processing the top of the face needs
to look for eyebrows, while the part of the network processing the bottom of the face needs to look for a
chin. Convolution is not naturally equivariant to some other transformations, such as changes in the scale or
rotation of an image. Other mechanisms are necessary for handling these kinds of transformations.
24
Finally, some kinds of data cannot be processed by neural networks defined by matrix multiplication with a
fixed-shape matrix. Convolution enables processing of some of these kinds of data. (Goodfellow, et. al,
2016)
A typical layer of a convolutional network consists of three stages (see Figure 16). In the first stage, the
layer performs several convolutions in parallel to produce a set of linear activations. In the second stage,
each linear activation is run through a nonlinear activation function, such as the rectified linear activation
function. This stage is sometimes called the detector stage. In the third stage, we use a pooling function to
modify the output of the layer further. (Goodfellow, et. al, 2016)
A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the L2
norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.
25
3.2 Convolutional Neural Network in Practice
When discussing convolution neural networks, we usually do not refer exactly to the standard discrete
convolution operation as it is usually understood in the mathematical literature and as it was detailed in the
previous section. The functions used in practice differ slightly. Here we describe these differences in detail
and highlight some useful properties of the functions used in neural networks. (Goodfellow, et. al, 2016)
We also look more closely as to how Convolutional Neural Networks are structured and what layers
First, when we refer to convolution in the context of neural networks, we usually mean an operation that
consists of many applications of convolution in parallel. This is because convolution with a single kernel
can extract only one kind of feature, albeit at many spatial locations. Usually we want each layer of our
network to extract many kinds of features, at many locations. (Goodfellow, et. al, 2016)
Additionally, the input is usually not just a grid of real values. Rather, it is a grid of vector-valued
observations. For example, a color image has a red, green and blue intensity at each pixel. In a multilayer
convolutional network, the input to the second layer is the output of the first layer, which usually has the
output of many different convolutions at each position. When working with images, we usually think of the
input and output of the convolution as being 3-D tensors, with one index into the different channels and two
indices into the spatial coordinates of each channel. Software implementations usually work in batch mode,
so they will actually use 4-D tensors, with the fourth axis indexing different examples in the batch, but we
will omit the batch axis in our description here for simplicity. (Goodfellow, et. al, 2016)
Because convolutional networks usually use multichannel convolution, the linear operations they are based
on are not guaranteed to be commutative, even if kernel flipping is used. These multichannel operations are
only commutative if each operation has the same number of output channels as input channels.
26
Assume we have a 4-D kernel tensor K with element Ki,j,k,l giving the connection strength between a unit in
channel i of the output and a unit in channel j of the input, with an offset of k rows and l columns between
the output unit and the input unit. Assume our input consists of observed data V with element Vi,j,k giving
the value of the input unit within channel i at row j and column k. Assume our output consists of Z with the
where the summation over l, m and n is over all values for which the tensor indexing operations inside the
summation are valid. In linear algebra notation we index into arrays using a 1 for the first entry. This
necessitates the −1 in the above formula. Programming languages such as C and Python index starting from
0 which renders the above expression even simpler. (Goodfellow, et. al, 2016)
3.2.2 Stride
We may want to skip over some positions of the kernel to reduce the computational cost (at the expense of
not extracting our features as finely). We can think of this as down sampling the output of the full
convolution function. If we want to sample only every s pixels in each direction in the output, then we can
We refer to s as the stride of this down sampled convolution. It is also possible to define a separate stride
for each direction of motion. See Figure 17 for an illustration. (Goodfellow, et. al, 2016)
27
Figure 17: Convolution with a stride.
In this example, we use a stride of two. (Top)Convolution with a stride length of two implemented in a
single operation. (Bottom) Convolution with a stride greater than one pixel is mathematically equivalent to
convolution with unit stride followed by down sampling. Obviously, the two-step approach involving down
sampling is computationally wasteful, because it computes many values
that are then discarded. (Goodfellow, et. al, 2016)
3.2.3 Padding
One essential feature of any convolutional network implementation is the ability to implicitly zero pad the
input V to make it wider. Without this feature, the width of the representation shrinks by one pixel less than
the kernel width at each layer. Zero padding the input allows us to control the kernel width and the size of
the output independently. Without zero padding, we are forced to choose between shrinking the spatial
extent of the network rapidly and using small kernels—both scenarios that significantly limit the expressive
power of the network. See Figure 18 for an example. (Goodfellow, et. al, 2016)
28
Figure 18: The effect of zero padding on network size.
Consider a convolutional network with a kernel of width six at every layer. In this example, we do not use
any pooling, so only the convolution operation itself shrinks the network size. (Top)In this convolutional
network, we do not use any implicit zero padding. This causes the representation to shrink by five pixels at
each layer. Starting from an input of sixteen pixels, we are only able to have three convolutional layers,
and the last layer does not ever move the kernel, so arguably only two of the layers are truly convolutional.
The rate of shrinking can be mitigated by using smaller kernels, but smaller kernels are less expressive,
and some shrinking is inevitable in this kind of architecture. (Bottom)By adding five implicit zeros to each
layer, we prevent the representation from shrinking with depth. This allows us to make an arbitrarily deep
convolutional network. (Goodfellow, et. al, 2016)
29
3.2.4 Putting It All Together
So now that we have reviewed the fundamental concepts of Convolutional Neural Networks, we now look
more closely as to how a system employing CNNs uses specific layers. As described in the previous
section, a simple Convolutional Network is a sequence of layers, and every layer of the network transforms
one volume of activations to another through a differentiable function. (Li, F., et. al, 2017) We use four
a. Convolutional Layer
c. Pooling Layer
A full Convolutional Neural Network Architecture stacks these layers. Let’s briefly look at an example
architecture in order to understand how practical Convolutional Neural Networks are constructed. Let’s say
we have an input image from the CIFAR-10 data set of a car with dimensions 32x32x3. These dimensions
are 32 pixels high, 32 pixels wide, and 3 channels or colors in this case. Like the ImageNet database, the
CIFAR-10 dataset is a much smaller database of images with each image 32x32x3 and from only 10
classes. (Li, F., et. al, 2017) A simple ConvNet for CIFAR-10 classification could have the layer
30
Looking at these layers in more detail we that:
a. Input Image: The example image is of dimension 32x32x3 and will hold the raw pixel values of
the image, in this case an image of width 32, height 32, and with three color channels R, G, B. (Li,
b. CONV Layer: The CONV layer will compute the output of neurons that are connected to local
regions in the input, each computing a dot product between their weights and a small region they
are connected to in the input volume. This may result in volume such as [32x32x12] if we decided
c. ReLu Layer: The RELU layer will apply an elementwise activation function, such as the max(0, x)
thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]). (Li, F., et. al,
2017)
d. Maxpool Layer: The POOL layer will perform a down sampling operation along the spatial
dimensions (width, height), resulting in volume such as [16x16x12]. (Li, F., et. al, 2017)
e. Affine / Fully Connected Layer: The FC (i.e. fully connected) layer will compute the class scores,
resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score,
such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name
implies, each neuron in this layer will be connected to all the numbers in the previous volume. (Li,
In this way, ConvNets transform the original image layer by layer from the original pixel values to the final
class scores. Note that some layers contain parameters and other don’t. In particular, the CONV/FC layers
perform transformations that are a function of not only the activations in the input volume, but also of the
parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will
implement a fixed function. The parameters in the CONV/FC layers are trained with gradient descent so
that the class scores that the ConvNet computes are consistent with the labels in the training set for each
31
Figure 20: Example Image Classified using Example Architecture.
The activations of an example ConvNet architecture. The initial volume stores the raw image pixels (left)
and the last volume stores the class scores (right). Each volume of activations along the processing path is
shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The
last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and
print the labels of each one.
(Li, F., et. al, 2017)
32
4.0 DCNN DESIGN AND IMPLEMENTATION
Up to this point we have reviewed the necessary background behind Deep Learning, reviewed several
celebrated Deep Convolutional Neural Network architectures, and reviewed how Deep Convolutional
Neural networks have been implemented into FPGA. Now that the necessary background has been
established in this subject, let’s now look at the particulars of this design.
This section will describe how this work implements Deep Convolutional Neural Networks in FPGA. We
will start with an overview of the design’s similarities and distinctions with previous works, the goal of the
design, and the tools which will be used. We will then move onto describing the overall architecture to be
implemented onto the FPGA. Finally, we will review the three major sub-designs in detail:
Convolutional/Affine Layer, ReLu Layer, Max Pooling Layer, and Softmax Layer.
4.1 Similarities
In reviewing the previous works by other groups who have implemented DCNNs on FPGAs we can see
that there are many similarities in their implementations and this work. Of course, there are a few aspects of
DCNNs which will need to be common across any design which undertakes the acceleration of DCNNs.
Therefore, necessary aspects such as the needed layers (i.e. convolution, ReLu, max pool, etc.…) and adder
trees to sum the products of all channels will not be discussed in this section.
First, in previous works we see designs using sub-module intercommunication. Many of the previous
works which utilized separate sub-modules in their overall design used a communication bus protocol.
This approach has the benefit of allowing already designed intellectual property from a FPGA
manufacturer such as Intel or Xilinx to be used. This allows the design to be focused on the DCNN portion
of the overall task rather than the infrastructure of the design itself. Another benefit is hardware
microprocessors or implemented co-processors can communicate with the submodules themselves. This
provides a software developer as well as a hardware developer great insight into the state of the design and
is invaluable when debugging and verifying the design functionality. The only drawback is that a bus
protocol can add to the overall overhead of a design. This is due to the handshaking between sub-modules
33
in order to establish reliable communication. The presence of the bus protocol also requires more signal
routing which does use overall FPGA resources and can lead to more dwell time with no task being
executed. The drawbacks of a bus protocol can be successfully managed by carefully planning out the
Another common feature found in most of the previous works and in this work is the use of Digital Signal
Processing Slices. These slices are dedicated hardware which can perform multiply and add operations for
both floating point and fixed precision numbers. These slices are usually much faster than any custom
design implemented by a designer which would need to be described in an HDL language. Previous
designs have leveraged the parallel nature of FPGAs by using as many if not all the DSP slices available on
the FPGA being used. It is in the designs of DCNNs as it is in many other designs the case where the more
4.1.3 Scalability
An important feature found in other works and in this work is the notion of stepping through the DCNN.
As was seen in other works, software implementations of DCNNs are getting ever bigger year after year,
with the first ResNet design utilizing 152 Layers. Implementing every single layer simultaneously in an
FPGA would be so resource intensive it would very quickly not fit into many FPGAs on the market.
However, other works as well as this one have taken the approach of implementing reusable designs which
can be programmed to perform the function of all the layers necessary in the DCNN architecture.
In the software environment, Deep Learning research is conducted with weight data as 64-bit double
precision floating point signed numbers. Although there have been works which have implemented
DCNNs with 32-bit Single numbers, there is a growing body of evidence to prove that reducing the bit size
and format of these numbers can significantly impact overall performance. A common alteration of the
number format is to use 16-bit fixed precision numbers. However, a more interesting and arguably better
34
design would be to truncate the 32-bit Single number down to a 16-bit “Half” precision number.
4.2 Distinctions
With the similarities between other works and this one fully described; we can now look at how this design
First, many previous works made a significant amount of effort in creating a custom compiler in order to
fully describe a Deep Convolutional Neural Network. This compiled network description would then be
loaded into the design. This was done in an effort to tease out the most amount of performance possible
since the compiler would analyze the proposed CNN configuration and generate scripts for the FPGA.
However, in this design the desire is to make the DCNN as useful as possible to a software designer and
hardware designer alike. This would amount to the FPGA hardware being programmable and able to be
commanded by function calls made in the microprocessor of the design which essentially perform register
writes to the FPGA implementation in this work. Therefore, software could be written to command the
DCNN FPGA Layers and trigger their execution all while avoiding a custom compiler. This method is the
Another attempt made by many groups is to use High Level Synthesis Tools which are a bridge in FPGA
between Software C++ code and Hardware Description Language (HDL). These HLS tools attempt to
create synthesizable HDL code from user C code. The use of these HLS tools, while a valuable aide in
development, can lead to inefficiencies in the FPGA’s resource utilization. Therefore, for this work the
design is made using VHDL instead of Verilog or HLS tools. There is the point of view that further DCNN
in FPGA research should focus on Deep Learning algorithms which cater mainly to hardware design.
Although this viewpoint is understandable and would be ideal, it ignores the very likely fact that much of
the Deep Learning algorithm development is and will continue to be conducted by Computer Scientists and
Mathematicians. Therefore, there is a good case that can be made to attempt to accelerate the current Deep
35
Learning algorithms which are the state of the art. Doing this and providing a user-friendly interface
between the FPGA design and the software designer is an ideal scenario.
In previous works the design of a DCNN is largely tailored to the specific hardware board which the
research group is using. The specific number of DSPs needed for their design was a set value. This work
attempts to allow the number of DSPs in the design to be configurable depending on the FPGA being used.
This would allow a developer the ability to use a wide variety of FPGA chips instead of tailoring their
Another feature of this work is the fact that each layer in the DCNN is modular and can be used in
conjunction with a bus protocol. This allows a developer to insert multiple instances of the Convolutional
Layer, Affine Layer, and Max Pooling Layer. The functions of each of these layers is broken up into
4.3 Goals
The primary goal of this design is to develop a working Deep Convolutional Neural Network FPGA
accelerator for image classification. Several features were desired in the design in order allow for
1. All Layers Accelerated: For this design all the Layers in Deep Convolutional Neural Networks
were to be accelerated. These layers are the Convolutional Layer, ReLu Layer, Max Pooling
Layer, Affine Layer, and Softmax Layer. Therefore, an FPGA VHDL design is needed for each
2. Modularity: In order to allow future code reuse, all layers in the design would need to be modular
3. Scalability: In order to allow multiple FPGA platforms to be used, the design would have to
36
4. Programmability: This design would need to allow for software to perform register writes to
configure the layers of the neural network, thereby, allowing any size network to be run on the
FPGA.
4.4 Tools
Several tools were used throughout the development effort of implementing a Convolutional Neural
Network onto an FPGA. The Xilinx line of chips were selected due to their widespread use and this
author’s previous familiarity with Xilinx products. Therefore, the tools used for this development effort
were from the wide array of tools Xilinx offers. The primary design environment was the Xilinx Vivado
2016.4 package (Figure 21) which served as the main design hub throughout development. From here each
layer type in the neural network was created as an AXI capable submodule. Vivado also provided the
integration with already made Xilinx IP such as the Microblaze Co-Processor and Memory Interface
Generator (MIG). Lastly, Vivado also provided a springboard for the software development which
Another tool which allowed for hardware debugging and software design was the Xilinx Software Design
Kit (SDK) 2016.4 as shown in Figure 22. This tool provided the use of the XSCT Console tool which
37
enables a user to directly command and read any memory location or register in the FPGA design from a
PC workstation.
4.5 Hardware
For this work, two potential FPGA boards were selected due to their availability and multitude of features
allowing for easier integration and testing. The Zedboard shown in Figure 23(left) utilizes the Xilinx Zynq
XC7Z020 System-On-Chip which hosts an ARM Processor as well as an Artix-7 logic fabric. The Nexys
Video is a more powerful board which utilizes the Xilinx Artix-7 XC7A200T FPGA and is shown in
Figure 23(right).
38
Figure 23: (Left) Zedboard (Right) Nexys Video
The specifications for each FPGA are shown in Figure 24 and Figure 25. Looking at these figures we can
see that the XC7A200T is quite a bit more powerful when compared to the XC7Z020. The XC7A200T has
215K logic cells, 13.1MB of Block RAM, 270K Flip Flops, and 740 DSP slices. The XC7Z020’s has 85K
logic cells, 4.9MB of Block RAM, 106K Flip Flops, and 220 DSP slices. Therefore, depending on how
many resources this design takes up, there are two boards available to fit the design.
39
Figure 25: Specifications for the Zynq XC7Z020 Artix-7 FPGA
(Xilinx, 2017)
40
5.0 FPGA TOP-LEVEL ARCHITECURAL DESIGN AND CONCEPT
In order to implement the Forward Pass of a Convolutional Neural Network, the multiple layers involved
need to be included in a top-level FPGA design. As was seen in Section 3.2.4, the layers which comprise a
a. Convolution Layer
e. Classification Layer
Each of these layers needs to pull data from the memory available on the FGPA board in order to execute
Figure 30. However, before jumping straight to the details of the CNN Layers, we must first address
several significant structural design decisions. Important considerations such as Bus Protocol, Data
The selection of a bus protocol allows for the system to be modular and flexible. When a design deploys a
Bus for command and data transfers, any custom designed logic block with bus interface can attach to the
bus with no impact to other logic blocks in the system. Let’s briefly describe how the AXI Bus protocol
works.
The AXI Bus protocol, which stands for Advanced eXtensible Interface, is part of ARM AMBA, a family
of micro controller buses first introduced in 1996. The first version of AXI was first included in AMBA
3.0, released in 2003. AMBA 4.0, released in 2010, includes the second version of AXI, AXI4. A robust
collection of third-party AXI tool vendors is available that provide a variety of verification, system
41
There are three types of AXI4 interfaces:
interfaces and allows burst of up to 256 data transfer cycles with just a single address phase.
(ARM, 2018)
from control and status registers). This is a lightweight, single transaction memory mapped
interface. It has a small logic footprint and is a simple interface to work with both in design and
c. AXI4-Stream—for high-speed streaming data. This removes the requirement for an address phase
altogether and allows unlimited data burst size. AXI4-Stream interfaces and transfers do not have
address phases and are therefore not considered to be memory-mapped. (ARM, 2018)
The AXI specifications describe an interface between a single AXI master and a single AXI slave,
representing IP cores that exchange information with each other. Memory mapped AXI masters and slaves
can be connected using a structure called an Interconnect block. The Xilinx AXI Interconnect IP contains
AXI-compliant master and slave interfaces and can be used to route transactions between one or more AXI
masters and slaves. Both AXI4 and AXI4-Lite interfaces consist of five different channels:
Data can move in both directions between the master and slave simultaneously, and data transfer sizes can
vary. The limit in AXI4 is a burst transaction of up to 256 data transfers. AXI4-Lite allows only 1 data
transfer per transaction. Figure 26 shows how an AXI4 Read transaction uses the Read address and Read
42
data channels. Figure 27 shows how an AXI4 Write transaction uses the Write address and Write data
As shown in Figure 26 and Figure 27, AXI4 provides separate data and address connections for Reads and
Writes, which allows simultaneous, bidirectional data transfer. AXI4 requires a single address and then
bursts up to 256 words of data. At a hardware level, AXI4 allows a different clock for each AXI master-
slave pair. In addition, the AXI protocol allows the insertion of register slices (often called pipeline stages)
43
AXI4-Lite is similar to AXI4 with some exceptions, the most notable of which is that bursting, is not
supported. The AXI4-Stream protocol defines a single channel for transmission of streaming data. The
AXI4-Stream channel is modeled after the Write Data channel of the AXI4. Unlike AXI4, AXI4-Stream
As mentioned earlier, memory mapped AXI masters and slaves can be connected using a structure called an
Interconnect block. Therefore, each master and slave in this design has a designated address which is used
by the Microblaze to address each master and slave. The Address Map is shown in Table 1.
The decision of which data format to choose has significant impacts on an FPGA’s resource utilization.
The data format is a consideration since the data values throughout a Convolutional Neural Network are not
integer values but can have a wide range of fractional precision. As was seen in the research, while using a
32-bit floating point single precision format may lead to increased accuracy, it inevitably leads to increased
FPGA resource utilization. Therefore, it can be said that using a large format such as single precision
floating point may be a bit overkill. This can be especially true if Convolutional Neural Networks regularly
normalize the input data and convolved data. Looking briefly at the data formats to choose from we can see
44
5.2.1 Fixed Point
The fixed-point number consists of integer and fraction parts. (Brown et. al., 2002) It can be written in the
𝑛𝑛−1
𝑉𝑉(𝐵𝐵) = � 𝑏𝑏𝑖𝑖 × 2𝑖𝑖
𝑖𝑖= −𝑘𝑘
The position of the radix point is assumed to be fixed; hence the name fixed-point number. If the radix
point is not shown, then it is assumed to be to the right of the least significant digit, which means that the
If we look at an example, let’s say we have the value 11.3125 and we want to convert this number to a
16bit Fixed Point representation. The integer part of the number is:
(11)10 = (00001011)2
(0.3125)10 = (? )2
0.3125
0 0.625
1 0.25
45
0 0.5
1 0
(0.3125)10 = (. 010100000 )2
(11.3125)10 = (00001011.010100000 )2
The 8bit integer part of the 16-bit Fixed Point data format has a range from 0 to 255. The 8bit fractional
portion of the 16-bit Fixed Point data format has a precision of 3.9E-3. (Brown et. al., 2002)
The Floating-Point Single Precision format is depicted below in Figure 28(a). The left-most bit is the sign
bit, 0 for positive and 1 for negative numbers. There is an 8-bit exponent field, E, and a 23-bit mantissa
field, M. The exponent is with respect to the radix 2. Because it is necessary to be able to represent both
very large and very small number, the exponent can be either positive or negative. Instead of simply using
an 8-bit signed number as the exponent., which would allow exponent values in the range -128 to 127, the
IEEE standard specifies the exponent in the excess-127 format. (Brown et. al., 2002) In this format the
𝐸𝐸𝑥𝑥𝐸𝐸𝐸𝐸𝑛𝑛𝐸𝐸𝑛𝑛𝑡𝑡 = 𝐸𝐸 − 127
In this way E becomes a positive integer. This format is convenient for adding and subtracting floating-
point numbers because the first step in these operations involves comparing the exponents to determine
whether the mantissas must be appropriately shifted to add/subtract the significant bits. The range of E is 0
to 255. The extreme values of E = 0 and E = 255 are taken to denote the exact zero and infinity,
46
respectively. Therefore, the normal range of the exponent is -126 to 127, which is represented by the
The mantissa is represented using 23 bits. The IEEE standard calls for a normalized mantissa, which means
that the most-significant bit is always equal to 1. Thus, it is not necessary to include this bit explicitly in the
mantissa field. Therefore, if M is the bit vector in the mantissa field, the actual value of the mantissa is
1.M, which gives a 24-bit mantissa. (Brown et. al., 2002) Consequently, the floating-point format in Figure
The size of the mantissa field allows the representation of numbers that have the precision of about seven
decimal digits. The exponent field range of 2−126 to 2−127 corresponds to about 10±38 . (Brown et. al.,
2002)
The Floating-Point Double Precision format is depicted in Figure 28(b) and uses 64 bits. Both the exponent
and mantissa fields are larger. (Brown et. al., 2002) This format allows greater range and precision of
numbers. The exponent field has 11 bits, and it specifies the exponent in the excess-1023 format, where:
47
𝐸𝐸𝑥𝑥𝐸𝐸𝐸𝐸𝑛𝑛𝐸𝐸𝑛𝑛𝑡𝑡 = 𝐸𝐸 − 1023
The range of E is 0 to 2047, but again the values of E = 0 and E = 2047 are used to indicate the exact 0 and
infinity, respectively. Thus, the normal range of the exponent is -1022 to 1023, which is represented by the
The mantissa field has 52 bits. Since the mantissa is assumed to be normalized, its actual value is again
This format allows representation of numbers that have the precision of about 16 decimal digits and the
In 2008 the Institute of Electrical and Electronic Engineers released a revised Standard for Floating-Point
Arithmetic. (IEEE, 2008) This newer standard included a new floating-point format designated as
“binary16”, where “binary32” is known as Single Precision and “binary64” is known as Double Precision.
The binary16 format has recently come to be known colloquially as the “Half Precision” format. (IEEE,
2008)
The Floating-Point Half Precision format is depicted in Figure 28(b) and uses only 16 bits. Both the
exponent and mantissa fields are smaller than the Single and Double counterparts. This format allows for
greater range and precision of numbers than the fixed format but less than that of the Single or Double
Precision formats. The exponent field has 5 bits, and it specifies the exponent in the excess-15 format,
where,
𝐸𝐸𝑥𝑥𝐸𝐸𝐸𝐸𝑛𝑛𝐸𝐸𝑛𝑛𝑡𝑡 = 𝐸𝐸 − 15
48
The range of E is 0 to 31, but again the values of E = 0 and E = 31 are used to indicate the exact 0 and
infinity, respectively. Thus, the normal range of the exponent is -14 to 15, which is represented by the
The mantissa field has 10 bits. Since the mantissa is assumed to be normalized, its actual value is again
This format allows representation of numbers that have the precision of about 4 decimal digits and the
Having looked at the various formats to choose from, we can see that the two which would be good
candidates for use in this implementation of a Convolutional Neural Network would be the Single and Half
Precision formats. While the Single Precision would give increased accuracy, the Half Precision would be
best suited for an Embedded System Design. Although, a knee jerk reaction would be to immediately jump
to implement the Half Precision, a few considerations would lend merit to the argument that a Single
Precision system would first need to be implemented. The first of these considerations is another aspect of
Floating-Point Arithmetic in Computer systems. Round off error deals with the rules in which a Computer
system will round toward the limit of its computational precision. Different systems may incorporate a
subtly different round off scheme which can generate large scale differences in a final answer when
multiple arithmetic operations are being performed. Another consideration would be commonality between
the outputs of a software model of the neural network and the outputs of an FPGA design implementing a
neural network. Having a commonality between a software model and the FPGA design enables the
inevitable debugging effort which takes place during the simulation and hardware phases of an FPGA
design. Therefore, the Single Precision format is selected as the primary format for this work. Time
permitted; the Half Precision format would be explored. This step by step approach allows for a baseline
design to be developed first. Then once the baseline is fully functional, the newer format can be tested.
49
5.3 DSP48 Floating Point Logic
With the main objective of this work being the implementation of a Convolutional Neural Network, the
task of designing the Floating-Point Arithmetic Logic was abstracted away by utilizing already made
macros. The Xilinx DSP38E1 Slice located physically on the FPGA fabric allows for the efficient
All Xilinx 7 Series FPGAs have many dedicated, full-custom, low-power DSP slices, combining high
speed with small size while retaining system design flexibility. The DSP slices enhance the speed and
efficiency of many applications beyond digital signal processing, such as wide dynamic bus shifters,
memory address generators, wide bus multiplexers, and memory mapped I/O registers. The basic
functionality of the DSP48E1 slice is shown in Figure 29. (Xilinx UG479, 2018)
50
5.4 Top Level Architecture and Concept
With solutions for the Bus Protocol, Data Format, and Floating-Point logic decided, now we can look at the
Figure 30, we can see the top-level block diagram of the top-level FPGA Design. With the goal of
modularity in mind, each CNN layer type is instantiated in the design. A Xilinx generated memory
interface is instantiated in order to interface with the board’s DDR Memory. The DDR Memory is the
main receptacle for all input, weight, bias, and output data sets being generated by this design. A Xilinx
Microblaze Co-Processor is also included which coordinates the operations and flow of data between the
layers of the neural network. A Xilinx UART core is instantiated to allow debugging and commanding
from an external computer. Each of the blocks of logic communicates through a central AXI Bus. This
AXI Bus forms the essential back bone in the system for all data transfer and commanding.
51
Figure 30: FPGA Top Level Architecture
52
5.5 Top Level Module Descriptions
The table below contains a brief description of each module in the top-level FPGA design. The benefit of
using the AXI Bus becomes clear when considering the significant amount of infrastructure which has
already been designed. The AXI Interconnect, Clocking Wizard, MicroBlaze Co-Processor, Processor
System Reset, Memory Interface Generator, and UART are all modules from existing Xilinx IP. This
allowed this work to focus on the design and development of the Deep Convolutional Neural Network
53
CONV/AFFINE LAYER Convolution Layer which Custom Module
executes the convolution
operation on image data. This
module also performs the
Affine Layer function.
The table below shows the top-level ports of the FPGA design. These ports are routed to physical pins on
54
Table 3: Top Level Port Description
55
5.7 Top Level Memory Map
In order to execute the AlexNet Convolutional Neural Network, each layer’s input, output, weights, and
bias data sets need to be saved to DDR Memory on the FPGA Board. With the 32bit Single Precision data
format selected, we can calculate the systems memory requirements which are shown in Table 4. With the
memory requirements determined, we can derive the memory map for the 32bit system which is shown in
Table 5.
56
Filters: 256
Conv5 Height: 13 173056
Width: 13
Channels: 256
Max Pool 3 Height: 6 36864
Width: 6
Channels: 256
Weight 6 Height: 6 150994944
Width: 6
Channels: 256
Filters: 4096
Fully Conn. 6 Height: 1 16384
Width: 1
Channels: 4096
Weight 7 Height: 1 67108864
Width: 1
Channels: 4096
Filters: 4096
Fully Conn. 7 Height: 1 16384
Width: 1
Channels: 4096
Weight 8 Height: 1 163840
Width: 1
Channels: 4096
Filters: 10
Fully Conn. 8 Height: 1 40
Width: 1
Channels: 10
Bias 1 96 384
Bias 2 256 1024
Bias 3 384 1536
Bias 4 384 1536
Bias 5 256 1024
Bias 6 4096 16384
Bias 7 4096 16384
Bias 8 10 40
TOTAL 236721414
57
Table 5: Top Level Memory Map
58
6.0 CONVOLUTION / AFFINE LAYER
6.1 Algorithm
Using the background covered thus far, FPGA hardware can be described which would allow the FPGA
circuitry to perform the Convolution operation. The Convolution Operation involves applying a weight
filter kernel to an input image in order to generate an output volume. These output volumes in effect filter
the image for certain features which are determined during training of the weight set of the Neural
Network. (Li, F., et. al, 2017) In order to describe how a CNN can be implemented into an FPGA we look
Here we can see that each element in the filter weight kernel is multiplied by its corresponding element in
the image data. So, for the case with a 3x3 filter weight kernel, 9 filter weight values are multiplied by 9
image data values, to generate 9 products. These 9 products are then summed together in order to produce
the output volume. This is done for each color of the input image and in subsequent volumes/filter maps,
each channel. This operation is again repeated for each subsequent weight filter set. So, if a 227x227x3
59
image has an 11x11x3x96 weight filter kernel set applied to it, with a stride and pad of 2, the resulting
The Convolutional / Affine Layer Module architecture is shown below in Figure 32. The module utilizes
an AXI4-Lite Slave Interface which provides a command and status interface between the MicroBlaze
processor and the internal logic of the module. The module also utilizes an AXI4 Master Interface which
reads and writes data to and from the DDR memory on the FPGA board. The AXI Master Interface
retrieves the input data, weights, and biases from specified DDR memory locations for the convolution and
fully connected operations then outputs the output volume to a specified region of memory.
60
6.3 Conv/Affine Layer - Register Set
The AXI Slave provides the following register set for the Convolution / Affine Layer shown in Table 6.
This layer design is very large and required quite a few registers to allow for reliable RTL operation and to
meet timing when generating the bit file to put on the actual FPGA board.
This register contains essential settings which determine the mode in which the Conv/Affine layer operates.
61
Table 8: Control Register Description
The status register breaks out important logic signals which may be used for debugging the design once
62
Table 10: Status Register Description
63
6.3.3 Input Data Address Register
The value contained in this register points the location in the DDR memory were the layer input data
begins.
The value contained in this register points the location in the DDR memory were the layer will begin
outputting the resulting data from the Convolution of Fully connected operations.
64
6.3.5 Weights Address Register
The value contained in this register points the location in the DDR memory were the layer weight kernel
data begins.
This register contains information on the size of the input image such as Height, Width, and number of
channels.
65
Table 18: Input Volume Parameters Register Description
This register contains information on the size of the output image such as Height, Width, and number of
channels.
66
6.3.8 Weight Parameters Register
This register contains information on the size of the weight filter kernel such as Height, Width, number of
This register contains information on the size of the input image such as Height, Width, and number of
channels.
67
Table 24: Convolution Parameters Register Description
The value contained in this register points the location in the DDR memory were the layer bias data begins.
68
Table 28: Bias Parameters Register Description
This register contains a precalculated value which is essential for the design to read the correct weight filter
This register contains a precalculated value which is essential for the design to read the correct weight filter
69
Table 31: Weight Multiple 0 Register Bit Map
This register contains a precalculated value which is essential for the design to read the correct Input image
70
6.3.15 Input Multiple 1 Register
This register contains a precalculated value which is essential for the design to read the correct Input image
This register contains a precalculated value which is essential for the design to write the resulting output
71
Table 38: Output Multiple 0 Register Description
This register contains a precalculated value which is essential for the design to write the resulting output
72
6.3.18 Affine Parameters 0 Register
This register contains information on the number of channels and filters in the current set being calculated.
73
6.4 Conv/Affine Layer - AXI Master
While the AXI Slave interface receives commands via the AXI Slave Bus, the AXI Master is responsible
for reading into the design all relevant data from memory and then writing the resulting volume of data
back to memory for the next layer in the Neural Network to use. The AXI Master logic is a large state
machine shown in Figure 33 with states specifically designed to retrieve the input data, weight filter kernel
data, and bias data from memory regions specified by the AXI Slave Register settings. This module starts
the Convolution and Affine operations and collects the results of these operations.
Figure 33: Finite State Machine for the AXI Master Module.
Due to the complexity, the state machine has been simplified for readability.
Due to the complexity of this large state machine the figure was simplified in order to increase its
74
IDLE: The FSM checks various control signals in order to ensure that the logic downstream is ready to
receive data and begin its operation. This state also waits for a start command from the Control Register.
CAL_WEIGHTS_READ_LENGTH, FCS_ADDRESS_WEIGHTS,
FCS_READ_DATA_WEIGHTS: These states calculate the appropriate size of AXI read transfer to
perform based on the Weight data configuration registers in the AXI Slave interface. This is to ensure that
the correct weights are being read from memory for the Convolution and Affine operations to yield a
correct answer. Once all the appropriate weights have been read in the state machine moves on.
FCS LOADING WEIGHTS: This state holds the state machine until the logic downstream signals that all
the weights have been loaded into the Channel Unit Weight FIFOs. For the convolution operation, on the
first iteration of the channel set, the State Machine moves onto obtaining the biases for the Neural Network.
On subsequent iterations the state machine will move onto load new input data. For the Affine operation,
on the first iteration of the channel set, the state machine moves onto hold for the operation complete. On
subsequent iterations the state machine moves onto fetch the layer results from the previous iteration.
BIAS_READ_LENGTH, FCS_READ_ADDRESS_BIAS,
FCS_READ_DATA_BIAS: These states read in the Bias data for the current layer in the Neural Network.
The correct number of bias values is specified in the configuration register: Bias Parameters Register.
FCS_INPUT_VOLUME_SETUP, CALC_INPUT_VOLUME_LENGTH,
calculate the appropriate size of AXI write transfer to perform based on the input data configuration
registers in the AXI Slave interface. This is to ensure that the correct input data is being read from memory
for the Convolution and Affine operations to yield a correct answer. Once all the appropriate input data
75
FCS_PROCESSING_DATA: This state holds the state machine until the output FIFO buffer indicates
that there is data in the buffer. Once data is detected, the state machine moves on.
CALC_OUTPUT_WRITE_LENGTH, FCS_WRITE_ADDRESS_CONV_OUT,
FCS_WRITE_DATA_CONV_OUT, FCS_WRITE_RESPONSE_CONV_OUT:
These states calculate the appropriate size of AXI write transfers to perform based on the output volume
configuration registers in the AXI Slave interface. The state machine reads out all the data from the
previous Convolution or Affine operation and writes the data to the address specified in the Output Address
configuration register. This state is the ultimately the last state in the whole state machine design. The state
machine will move on if there are more channel iterations to complete and/or the full Convolution or Affine
FCS_STRIDE_PROCESSING, CALC_STRIDE_READ_LENGTH,
Convolutional operation state. The Affine operation does not stride across the input volume. If the full
input volume has not yet been completed, additional rows of input data will be read from memory. The
number of rows read is specified by the Convolution Parameters configuration register. Once all the
additional stride rows are complete read in for each of the channel set being calculated, the design moves
onto the FCS_PROCESSING_DATA state to wait for output data from the Convolution operation to be
generated.
RCS_PREV_DATA_SETUP, CALC_PREV_READ_LENGTH,
channels exceeds the number of DSPs available for use, these previous data state will be used. These states
are used in both the Convolutional and Affine situations. After the Convolution or Affine operation has
completed for the input data volume, the whole state machine starts again to work on the next set of
channel data. However, this time before each operation, the previous data that was output is read back into
the design so that it can be summed with the current operation output.
76
6.5 Conv/Affine Unit – Design and Ports
A look inside the Convolution / Affine Unit is shown in Figure 34 below. The unit is comprised of four
major functional blocks. The input buffer is a FIFO buffer which receives data from the AXI Master and
reads it out to the Filter logic. Once the Filter logic has finished performing Convolution or an Affine
operation the data is passed through the ReLu Unit and output to the Output FIFO buffer. The contents of
the output buffer are then read out by the AXI Master interface and written to DDR Memory.
77
The ports for this submodule are shown in Table 43 below.
78
piece of valid data
o_inbuff_full std_logic Out Input Buffer cannot
accept any more data
o_inbuff_almost_full std_logic Out Input Buffer can accept
one more piece of valid
data
o_inbuff_valid std_logic Out Input Buffer contains
valid data
o_outbuff_dout std_logic_vector Out Output Buffer data to AXI
(15 downto 0) Master Interface
o_outbuff_empty std_logic Out Output Buffer does not
contain valid data
o_outbuff_almost_empty std_logic Out Output Buffer contains
only one piece of valid
data
o_outbuff_full std_logic Out Output Buffer cannot
accept any more data
o_outbuff_almost_full std_logic Out Output Buffer can only
accept one more piece of
data
o_outbuff_valid std_logic Out Output Buffer contains
valid data
79
6.6 Convolution Layer - Submodule Definitions and Operations
The following section will describe the design of each of the submodules involved in the design of the
The inner workings of the Filter submodule are shown below in Figure 35. The Filter submodule takes in
data from the input buffer FIFO and distributes the data across all the Channel Units in the Volume Weight
Multiply Logic group. This group takes the input data and weight data and uses the Xilinx DSP48
hardware embedded into the FPGA itself to perform a floating-point single precision multiplication. The
product results of the Floating-Point Multiplication operations are passed to the Accumulator Logic group.
There the product data is summed together in order to get the final summed kernel data that both
Convolution and Affine operations require. The Accumulator Logic group also sums the kernel data
The heart of this design is the Channel Unit which involves the use of a FIFO buffer and glue logic to
perform vital operations of the Convolution and Affine layer in the Convolutional Neural Network.
Essentially this design utilizes heavily the available DSP resources on the FGPA chip. The number of
available DSPs dictates the maximum number of Channel Units instantiated into the design. Therefore, as
80
is shown in Figure 36 the maximum number of Channel Units n instantiated in the design is directly
Operationally each Channel Unit corresponds to one row of data of the input image or one row of data for
all the filters for the weight data set. Weight filter kernels in Convolutional Neural Networks typically
have a size of 3x3, 5x5, 7x7, or 11x11. Therefore, the height of the weight filter kernel will dictate the
number of input volume channels that can be operated on at any given time. So, the configuration of the
Channel Units and the overall Filter is directed by the available number of DSPs and the number channels
The inner workings of the Channel Unit itself are shown below in Figure 37. Here we can see what was
described earlier. The Channel Unit block takes input data and weight data and uses the Xilinx DSP48
hardware embedded into the FPGA itself to perform a floating-point single precision multiplication.
81
Figure 37: Architecture of the Channel Unit
In order to more thoroughly explain the concept of operation of this design let’s look an example case.
Let’s say we have an input data volume of dimension 227x227x3 which is 227 pixels high, 227 pixels
wide, and with 3 channels. Let’s also assume for this example we have a corresponding weight set of
dimensions 5x5x3x16. As shown in Figure 38 and Figure 39, both input volume data and weight data are
loaded into the channel units available. Since the weight filter kernel is 5 pixels high, only 5 rows of input
volume data of the 227 total rows are loaded into the FIFOs. The corresponding weight filter data for all
the weight filters is loaded into the weight FIFOs as well. In Figure 38 and Figure 39 it is important to
note that the data FIFOs are surrounded by logic blocks. These logic blocks are the Mux and the Router
logic. The Mux serves to multiplex the various data streams being written into the data FIFO. The Router
serves to direct where the flow of data is going between FIFOs in the Channel Unit.
82
Figure 38: Volume FIFOs with Mux and Router for the Input Volume FIFOs
Figure 39: Weight FIFOs with Mux and Router for the Weight Data FIFOs
83
For Convolutional Layers in a Neural Network the input data volume may be zero padded around the
boarders of the input data volume. Therefore, the input data image will have top, bottom, right, and left
boarders with zero value data. This practice of padding is also accomplished in this design as shown below
in Figure 40. Maintaining the earlier input and weight data example we add the additional configuration
that this example will have a padding of 2. We see in Figure 40(Top) that in this case the first two rows of
data going to the input volume FIFOs are zero filled. For the remaining 3 input volume FIFOs the first
three rows of input data are fed to the input volume data FIFOs. Note the first three rows of input data are
padded on the right and left sides with 2 zero data values. Similarly, the padding for the bottom border of
When the design is given a new input data volume to use in the execution of a Convolution or Affine
operation it proceeds as illustrated below in Figure 41. The Controller loads full rows of both input and
84
Figure 41: Shows how new data is read into the Channel Unit FIFOs
After the rows have been loaded into both Input Volume and Weight Data FIFOs, the Routers for the
Weight Data read out the first Weight Filter Kernel and hold it in registers. Once the kernel is ready, the
Routers on the Volume FIFOs begin to read out the input volume data while also accounting for stride
across the row itself. As both input data and weight kernel are read out simultaneously, they are sent to the
Channel Units DSP48 multiplier block. Also, simultaneously the input volume data is written back into the
input data FIFOs and is shown below in Figure 42. Writing back the input data allows the same data to be
reused for execution with the next weight filter kernel, thereby reducing the number of AXI transactions
85
After all the weight filter kernels have been used with the input data volume rows, the stride operation
begins. This is to satisfy the vertical stride required for the Convolution operation. To accomplish this
function, the Router for each Channel Unit reads out its FIFOs data and sends the data to the Mux of the
Channel Unit neighboring it. Looking at Figure 43, if we were to designate each Channel Unit as Rows 0
through 4 we can see that during the stride operation, new data enters the Row 4 FIFO. The data currently
in the Row 4 FIFO is redirected and written into the Row 3 FIFO. The data currently in the Row 3 FIFO is
redirected and written into the Row 2 FIFO. The data currently in the Row 2 FIFO is redirected and written
into the Row 1 FIFO. The data currently in the Row 1 FIFO is redirected and written into the Row 0 FIFO.
Lastly, the data currently in the Row 0 FIFO is read out by its Router and allowed to be lost since that data
is no longer needed.
After all the stride rows have been read in for each channel being processed, the Routers resume reading
out data to the DSP48 Floating Point multipliers. This stride and convolve pattern is repeated until all the
rows of the input image have been read into the FIFOs.
86
For the Affine Layer operation, the logic executes in much the same manner as with the Convolution
operation. The only key difference is that there is no striding across the image with a weight filter kernel.
The Affine Layer operation uses a unique weight for each input data pixel.
The Volume and Weight Mux blocks, as described earlier essentially mux between the various input data
streams to the FIFOs of the Channel Units. Depending on the operation the controller will specify an enable
signal to active the new data stream, previous data stream, or recycled data stream. The new data stream is
used for new data entering the Channel Unit FIFO. The previous data stream is used during the stride
operation to allow for the neighboring Router to write its FIFOs content to this FIFO. The recycled data
stream is used during the convolution operation while the Volume Data Router is reading out input volume
data. As shown in Figure 44 and Figure 45, the structure is a straight forward mux with three select lines.
The inputs are the i_new_data, i_prev_data, and i_recycled_data streams with all other input possibilities
tied to ground. The flip flop illustrates the clocked nature of this process. Adding the flip flop after the
Figure 44: Shows the design of the Volume and Weight Mux blocks to select the data stream.
87
Figure 45: Shows the design of the Volume and Weight Mux blocks to select the data stream enable
signals.
As described earlier, in each Channel Unit the Volume Router will read out the data from the Volume
FIFO. During operation it reads out input data as wide as the weight filter kernel width. For instance, if
the weight filter kernel has a size of 3x3, then the volume router will only read out 3 pieces of input volume
data from the Volume FIFO. According to the stride setting from the Convolution Parameters register, the
Volume Router will read out the next pieces of data to perform the Horizontal stride required by the
Convolution operation. Throughout the process of striding and convolving data, the Router always checks
to see if the downstream logic is ready to receive data. The state machine for the submodule is shown in
IDLE: The FSM checks various control signals coming from the Controller in order to determine which
mode it will operate in. If given an enable signal to begin the vertical stride, it will move to the Snake Fill
state. If given a signal from the Volume FIFO is not empty and has data, then the FSM will go to the Fill
state. If given the enable signal to empty the Volume FIFO it will go to the Top Row Empty state.
FILL: This state will read out the same number of input volume data pixels as dictated by the weight filter
kernel width. The data is stored in a register array for use in the Convolution and Affine operations.
88
SEND: Once the data array is filled according to the weight filter kernel width size, the input volume data
ACC_HOLD: After sending the data to the Multipliers the finite state machine waits until the Accumulator
Logic indicates it is ready to receive more data. For the Affine operation the cycle repeats by going back to
the Idle state. For the Convolution Operation a horizontal stride is required and the FSM therefore moves
SHIFT: This state moves all the input volume data down in the register array down by the number of
strides indicated in the Convolution Parameters Register. The input volume data shifted out of the array is
SNAKE_FILL: If the Controller indicates that the vertical stride is to be performed, this state reads out the
contents of the Volume FIFO and sends it back to the Controller for distribution to the correct Volume
Mux.
TOP ROW EMPTY: Once the vertical stride operation has completed, the old data in the top row needs to
be read out since it is no longer needed. This state reads out one row of data from the Volume FIFO.
EMPTY RESET: This state informs the Controller that the row empty operation has completed, and the
89
Figure 46: Finite State Machine for the Volume Router.
As described earlier, the Weight Router reads in weight filter kernel data from the Weight FIFO as wide as
the Weight Parameters Register dictates. This data is also written into a register array within the Weight
Router. Once the data from the weight filter kernel is read into the register array, the FSM signals to the
Controller and Volume Router that the filter is loaded and ready. At this point the Convolution or Affine
operation begins. After the Convolution or Affine operation has completed for all rows in the Channel
Units, the next filter kernel is read out from the Weight FIFOs and used to execute the next round of
operations. The state machine for the submodule is shown in Figure 47 and the states are described in text
for simplicity.
IDLE: The FSM checks various control signals coming from the Controller which will kick off the rest of
PRIMING: This state holds the FSM meanwhile data begins to be written into the Weight FIFOs. Once
there exists data in the Weight FIFOs the state machine will move on to load the weight filter kernel into
90
LOAD_KERNEL: This state reads in weight filter kernel data from the Weight FIFO as wide as the
Weight Parameters Register dictates. Once the data from the weight filter kernel is read into the register
array, the FSM signals to the Controller and Volume Router that the filter is loaded and ready.
PROCESSING: As the Convolution or Affine operation executes, the weight filter kernel data is sent to
the DSP48 Multipliers. This continues until the entire row of input volume data has been processes with the
weight filter kernel. This cycle continues for all weight filter kernels.
EMPTY_WEIGHTS: When the Convolution or Affine operations have completed for the current weight
filters and current rows, the Weights FIFOs are emptied in order to clear them before writing in the next set
Figure 47: Shows the design of the Volume and Weight Mux blocks.
For all the operations described thus far, the Controller unit has been the logic orchestrating the entire data
flow. There is such a large amount of data to keep track of, it seemed to be a good idea to have one central
controller to direct the operations of all the smaller submodules already discussed. Just like the AXI
Master module, the Controller finite state machine is massive with many states. Describing every single
signal and state in a traditional ASM chart manner would be far too tedious, would burden the reader’s
attention, and would be better described succinctly as a general flow of states with a text brief of what each
state does. As shown in Figure 48 and Figure 49, the finite state machine for the controller can take one of
two major paths. One path sets the Controller to perform Convolution while the other sets the Controller to
91
perform the Affine Fully Connected operation. The state machine which performs Convolution is shown in
Figure 48 and the states are described in text for simplicity below.
IDLE: The FSM checks for a start command from the Control Register as well as the affine enable signal
to determine if the layer will operate as a Convolution Layer or as an Affine Fully Connected Layer.
IS_NET_READY: This state will hold the FSM until it receives signals from all the Channel Units that
they are ready to process data. Once all the Channel Units are ready, the state machine moves on.
CALC_PARAMETERS: This state determines if there will be more channel iterations necessary to
completely process the data. As previously described, the availability of the DSP blocks in hardware put a
limit as to how many input channels can be processed at any given time. If the input data volume contains
more channels than there are available DSPs, then this state calculates how man iterations are required.
The size of the input image with padding is also determined here.
FETCH_AND_LOAD_WEIGHTS: This state reads in weight filter kernel data from the input buffer and
PAD_VOLUME_TOP: If the contents of the Convolution Parameters Register indicates that the input
data volume should be padded and none of the input volume rows have been processed yet, the FSM will
begin to zero fill Volume FIFOs of the first rows. After the Volume FIFOs for the number of rows
indicated by the pad field of the Convolution Parameters Register have been zero filled, the state machine
moves on.
PAD_VOLUME_LEFT: If the contents of the Convolution Parameters Register indicates that the input
data volume should be padded, padding is added to every Volume FIFO before the actual image data is
written.
92
FETCH_VOLUME: During this state, the data contained in the input buffer FIFO is distributed to the
Volume FIFOs across all the Channel Units. If a 0-padding value is given in the Convolution Parameters
PAD_VOLUME_RIGHT: If the contents of the Convolution Parameters Register indicates that the input
data volume should be padded, padding is added to every Volume FIFO after the actual image data is
written.
WAIT_ONE_PAD: This state holds the state machine for one clock cycle to allow the data to finish being
CONVOLUTION: This state holds the state machine until the Convolution operation is completed on the
current rows of data loaded in the Channel Units. Once the Controller receives a signal from the
downstream logic that the rows have been processed, the Controller FSM moves onto the Stride operations.
WAIT_ONE_SINGLE, PAD_SINGLE: These states perform the same function as the PAD VOLUME
TOP, PAD VOLUME LEFT, FETCH VOLUME, PAD VOLUME RIGHT, and WAIT ONE PAD states.
The only difference is that in these new states, only one row of new data is loaded into the Channel Units
versus several. These states are separate from the other states since after each chain of single row states is
executed, the vertical stride operation is performed by the SNAKE_FILL and EMPTY_TOP_ROW states.
SNAKE_FILL: After each single row is fetched in the states PVL_SINGLE, FV_SINGLE,
stride operation is allowed to occur one row at a time. This operation was covered earlier in Figure 43.
93
EMPTY_TOP_ROW: After the vertical stride operation occurs, the last operation to complete the vertical
stride is to read out the old data in the top channel unit Volume FIFO. This data is no longer needed and is
discarded by the Volume Router downstream. This operation was covered earlier in Figure 43.
The state machine as shown in Figure 49 for the Affine Fully Connected Layer operation runs much like
the state machine for the Convolution Layer operation. The operation still requires input volume data,
weight filter kernel data, and bias data. The operation does not need, however, to stride across the input
volume with the weight filter kernel. The nature of the Affine Fully Connected Layer is such that every
item of input volume data has a unique weight associated with it. Therefore, all the needed data is read into
the Channel Units just as in the Convolution data. However, since the number of filters is greater than that
of the Convolution layer, this requires the weights to be loaded a chunk at a time rather than at one time.
For instance, if the largest filter count for a Convolutional Layer is around 256 or 384, the Affine Layer can
have filter counts of 4096. This design will feed in only the number of filters dictated by the Affine
Parameters Register and execute the Affine operation on those filters. Once those are complete, another
chunk of filter data will be read into the Channel Units and processed. This cycle will continue until all
filters have been processed for the current input volume data loaded into the Volume FIFOs. With the rest
of the state machine being the same as with the Convolution FSM, the only different states are detailed.
EMPTY_WEIGHTS: After the Affine Fully Connected operation completes, this state will empty the
Weight FIFOs in the Channel Units. This is to clear the Weight FIFOs for the next chunk of filter data. This
state does not empty the Volume FIFOs since they are still be used for processing.
EMPTY_VOL_WEIGHTS: After the Affine Fully Connected operation completes for all filters, this state
will empty the Weight FIFOs and Volume FIFOs both in the Channel Units. This is to clear all the Channel
Unit FIFOs for the next set of filter and input volume data.
94
Figure 48: Finite State Machine for the Convolution Layer Operation
Figure 49: Finite State Machine for the Affine Layer Operation
95
6.6.7 Accumulator Logic
The inner workings of the Accumulator Logic group are shown below in Figure 50. As the DSP48
Multipliers begin to output valid product data and their accompanying valid signals, the DSP Accumulator
uses an adder tree to sum the results of the Convolution or Affine operation together. The results of the
adder tree are summed together to arrive at the Kernel Sum. This sum is written to the Accumulator FIFO.
The Accumulator Router reads the Accumulator FIFO data and uses a DSP48 Adder to perform a Floating-
Point Single Precision Addition between the Accumulator FIFO data and the Bias data or the Previous data.
As described earlier, the design is only able to process a limited number of input volume channels at a time.
This requires the design to process groups of channel s at a time creating an iterative processing scheme.
Therefore, the first iteration will add the Bias Data and subsequent iterations will add the Previous Data
together with the data from the Accumulator FIFO. The Previous Data refers to the previous results to the
96
6.6.8 DSP Accumulator Sub-module
As previously stated, the DSP Accumulator uses an adder tree to sum the results of the Convolution or
Affine operation together. The adder tree can be seen in Figure 52. Since the upstream logic in the Channel
Units moves across the Input Volume row by row, the product results received by the DSP Accumulator
will be the size of the Weight Filter Kernel by as many channels being processed. Therefore, the result of
the adder tree will be the summation of the columns as shown in Figure 51. As we see in the figure if we
have input volume data X with 3 channels and Weight Filter Kernel W with a size 5x5, the result of the
Adder Tree is 5 wide and is the summation of all the product data down the columns.
97
Figure 52: Shows the adder tree employed to sum the product data.
The column sum data is saved in a register array to be used in the next operation. It is important to note
that the implementation of the adder tree registers the results of each adder layer. This was done in order to
pipeline the adder tree and help with timing closure later. Not using the pipelined registers would cause
there to be very long combinational paths in the design and would make it very hard for the Place and
Route tool to establish FPGA logic routes which would meet the 100 MHz timing requirement.
Once the column sum data is obtained, the column sums are again summed by a smaller adder tree to
obtain the single Kernel Sum data value. This is shown in Figure 53 where we see that register array data
being used as inputs to the small adder tree to obtain the single Kernel Sum. This process repeats as the
Channel Units produce data for the rest of the filters in the Weight set.
The process of putting the products from the Channel Units through adder trees and registers to obtain the
Kernel Sum requires the design to keep track of the valid signal as it propagates through the design.
Therefore, a shift register is employed which is sized to accommodate the predetermined clock cycle delay
between input to the adder tree and output. The input to the shift register is the product valid signal.
98
Figure 53: Shows the adder tree employed to sum the column data.
The state machine shown in Figure 54 for the DSP Accumulator is very straight forward since the DSP48
IDLE: The FSM checks for the product valid signals in order to kick off the accumulator. Once product
RUN: This state allows the column adder tree to continue adding the product data being input from the
Channel Units. Once the valid signal in the data valid shift register is detected after the delay of the adders
has lapsed, the state machine registers the resulting column sum value into a register array. This register
array is then used to feed the second smaller adder tree to obtain the kernel sum.
QUE: This state is precautionary and is used when the Accumulator FIFO is full and cannot accept new
Kernel Sum values. This state places the data in a very shallow buffer and signals to the upstream logic
that the DSP Accumulator is no longer read to receive new data. This lack of readiness informs the Routers
in the Channel Units to continue to hold until the Accumulator is ready again.
EMPTY_QUE: If the Accumulator FIFO is no longer full and can accept data, this state first reads out the
contents of its small buffer before signaling to the Channel Unit Routers that the DSP Accumulator is ready
99
Figure 54: Shows the Finite State Machine for the DSP Accumulator logic.
Figure 55: Shows the Finite State Machine for the Accumulator Relay.
As described earlier, the design is only able to process a limited number of input volume channels at a time.
This requires the design to process groups of channels at a time creating an iterative processing scheme.
Therefore, the first iteration will add the Bias Data and subsequent iterations will add the Previous Data
together with the data from the Accumulator FIFO. The Previous Data refers to the previous results to the
The state machine shown in Figure 55 for the Accumulator is also very straight forward. Each state is
briefed below.
IDLE: The FSM checks for the existence of data in the Accumulator FIFO and the Bias FIFO. If there
exists data in the Accumulator FIFO and Bias FIFO, the state machine will move on.
100
ALL_CHANNELS: This state reads out one piece of data from each of the Bias and Accumulator FIFOs
and passes these values to the DSP48 Adder block. Once this occurs, the state machine moves on to hold
BIAS_ADDER_HOLD: This state holds the state machine until the Adder has had time to fully process
the Floating-Point Single Precision addition operation. This state also keeps track of which row, filter, and
pixel it is processing by maintaining counters. These counters are important since the Accumulator Relay
is the last logic block in the whole Convolution or Affine Operation and signals to the AXI Master and
MORE_CHANNELS, PREV_ADDER_HOLD: These states perform much of the same functions as the
first two, except for in this case they add the Previous Data from the last Convolution / Affine execution to
101
7.0 RELU LAYER
Implementing the ReLu Layer is actually very simple. This is because the Rectified Linear Unit only
checks to see if a value is less than zero or greater than or equal to zero. Therefore, any value coming into
the ReLu unit will be forced to zero if it is negative and passed if it is not. This operation can be placed
upstream of the Convolutional Layer’s output buffer with each value being written into the output buffer
checked and modified as necessary. Of course, this option of using a ReLu unit is not utilized in the Fully
Connected or Affine layer and therefore needs to be deactivated once the Convolution Layer is performing
this function. Deactivation of this layer is simply done by an AXI register write from the Microblaze
Processor.
102
8.0 MAX POOLING LAYER
8.1 Algorithm
In order to detail the Max Pooling operation let’s take a specific example. Let’s say we have an image of
size 13x13x256, a filter kernel with size 3x3, and a stride of 2. The Max Pooling operation involves sliding
a 3x3 kernel across 3 rows by 3 columns at a time as shown in Figure 57. The stride involves moving the
kernel across the image in increments of 2 pixels in both vertical and horizontal directions as shown in
Figure 57 and Figure 58. At each increment the kernel considers all the values in the 3x3 neighborhood
and chooses the greatest value as shown in Figure 59. This value is passed by the kernel to be the output
pixel value for the scaled down Max Pooled image. This process is repeated across the height, width, and
channels of the input image to generate a full Max Pooled image. Given an input image of 13x13x356, the
103
Figure 58: Vertical stride across input image.
104
8.2 Max Pooling Layer - Architecture
The Max Pooling Layer Module architecture is shown below in Figure 60. The module utilizes an AXI4-
Lite Slave Interface which provides a command and status interface between the MicroBlaze processor and
the internal logic of the module. The module also utilizes an AXI4 Master Interface which reads and writes
data to and from the DDR memory on the FPGA board. The AXI Master Interface retrieves input data
from specified DDR memory locations for the Max Pooling operation then outputs the Max Pooled result
105
8.3 Max Pooling Layer - Register Set
The AXI Slave provides the following register set for the Max Pooling Layer.
Table 44: Register List for the Max Pooling Layer Design
This register allows for controlling the Max Pool Layer using a few essential signals.
106
Table 46: Control Register Description
The status register breaks out important logic signals which may be used for debugging the design once
107
Table 48: Status Register Description
108
8.3.3 Input Data Address Register
The value contained in this register points the location in the DDR memory were the layer input data
begins.
The value contained in this register points the location in the DDR memory were the layer will begin
109
8.3.5 Input Parameters Register
This register contains information on the size of the input image such as Height, Width, and number of
channels.
This register contains information on the size of the output image such as Height, Width, and number of
channels.
110
Table 56: Output Parameters Register Description
Initial
Bit Field Name Description
Value
31:24 Output_height 0x0 Value which specifies the height
of the output volume
23:16 Output_width 0x0 Value which specifies the width
of the output volume
15:0 Output_channels 0x0 Value which specifies the number
of channels that will be produced
by the max pooling operation
This register contains information on the size of the Max Pooling kernel such as Height, Width, number of
Initial
Bit Field Name Description
Value
31:24 kernel_height 0x0 Value which specifies the height
of the filter weight kernel
23:16 kernel_width 0x0 Value which specifies the width
of the filter weight kernel
15:8 kernel_stride 0x0 Value which specifies the stride
to be used when performing the
max pooling operation
111
8.4 Max Pooling Layer - AXI Master
While the AXI Slave interface receives commands via the AXI Slave Bus, the AXI Master is responsible
for reading into the design all relevant data from memory and then writing the resulting volume of data
back to memory for the next layer in the Neural Network to use. The AXI Master logic is shown in Figure
61 with states specifically designed to retrieve the input data from memory regions specified by the AXI
Slave Register settings. This module starts the Max Pooling operations and collects the results.
The state machine is shown in Figure 61 and the states are described in text for simplicity below.
IDLE: The FSM checks for a start command from the Control Register as well as the signal indicating that
rows of input image data. Since the Max Pool Kernel is only 2x2 or 3x3 the kernel size in Kernel
Parameters Register dictates how many rows are read into the Input Buffer. Once the first few rows have
been read into the design, the state machine moves on.
112
FINISH_READ_FIRST: This state holds the state machine until the max pooling operation has completed
for the rows currently loaded into the design. Once the signal which indicates that the rows have been
calculate the appropriate size of AXI write transfers to perform based on the Output Parameters
configuration register in the AXI Slave interface. The state machine reads out all the data from the
previous Max Pooling operation and writes the data to the address specified in the Output Address
configuration register. This state is the ultimately the last state in the whole state machine design. The state
machine will continue to process data if all the rows for the input image data have not yet been processed.
A look inside the Max Pool Unit is shown in Figure 62. The unit is comprised of five major functional
blocks. The input buffer is a FIFO buffer which receives data from the AXI Master and reads it out to the
Row Controller. The Row Controller sends the input data to the FIFO Network so that correct pixels are
being used in the Max Pool operation. Once the FIFO Network is fully primed, the max pooling filter
kernel is applied across the input image. This is achieved by using the Heap Sorter submodule to obtain the
max value in the 3x3 or 2x2 neighborhood. The resulting max value is written out to the Output FIFO
buffer. The same processed is repeated until the max pooling kernel has covered all columns and rows of
113
Figure 62: Architecture of the Max Pool Unit
114
o_inbuff_empty std_logic Out Indicates Input buffer is
empty with no valid data
o_inbuff_almost_empty std_logic Out Indicates input buffer is
almost empty with one
more valid piece of data
o_inbuff_full std_logic Out Indicates input buffer is
full
o_inbuff_almost_full std_logic Out Indicates input buffer is
about to fill up
o_outbuff_dout std_logic_vector Out Output buffer Data out to
(15 downto 0) AXI Master
o_outbuff_empty std_logic Out Indicates output buffer is
empty with no valid data
o_outbuff_almost_empty std_logic Out Indicates output buffer is
almost empty with one
more valid piece of data
o_outbuff_full std_logic Out Indicates output buffer is
full
o_outbuff_almost_full std_logic Out Indicates output buffer is
about to fill up
o_outbuff_valid std_logic Out Indicates output buffer
contains valid data
o_channel_complete std_logic Out Indicates that the Max
Pooling Operation on the
current volume has
completed.
115
8.6 Max Pooling Layer - Submodule Definitions and Operations
The following section will describe the design of each of the submodules involved in the design of the Max
The Row Controller receives input data from the AXI Master via the Input Buffer as shown in Figure 63.
The AXI Master will read the input image information from the DDR Memory in rows. Looking at Figure
63, let’s take an example case of a 13x13x256 input image with a Max Pool Kernel of size 3x3 and stride
of 2. With this example case in mind, three rows of input image information are read by the AXI Master
and passed down to the Row Controller. Once the Row Controller is signaled that the AXI Master has
loaded information into the Input Buffer FIFO, the Controller will load the three rows of image data into
the Row FIFOs. After the three rows of data are loaded, the Row Controller will send one 3x3 kernels
worth of data to the Heap Sorter as shown in Figure 67. The Row Controller reads out 3 pixels from the
Row FIFOs each, for a 3x3 kernel to be sent to the Heap Sorter. It is important to note that the data read out
of each Row FIFO is also written back into the same Row FIFO, so the row data can be reused in the
striding process. The three Row FIFOs constitute the FIFO Network in this design. Once the Heap Sorter
has determined the Maximum Value from the 3x3 Kernel, that value is written to the Output Buffer.
116
This pattern is repeated each time the Kernel is moved down the input image by 2 pixels. The kernel is
shifted down the image both in the horizontal direction and the vertical direction.
The state machine for the submodule is shown in Figure 64 and the states are described in text for
simplicity.
IDLE: The FSM checks for a start signal from the Control Register before moving on. This state also
checks for the existence of data in the input buffer as well as the signal indicating that a channel was
completed. Once data is available in the input buffer or a channel was completed, the state machine will
move on.
PRIME ROW 0, P01 WAIT, PRIME ROW 1, P12 WAIT, PRIME ROW 2, P2S WAIT: These states
read out the input image row data in the input buffer and write the rows to the Row FIFOs in the FIFO
Network. If the kernel is of dimension 2x2, only two FIFOs will be loaded with row data instead of three.
Therefore, in a 2x2 kernel case, the state machine will bypass the PRIME ROW 2 state entirely and move
PRIME SORTER: This state reads out a piece of row data from the FIFO Network until the data read out
is as wide as the Max Pool Kernel. For instance, in a 2x2 kernel case, this state will read out twice from
each Row FIFO. As the data is being read out of each Row FIFO, the same data is also being written back
COLUME STRIDE SHIFT: After the Heap Sorter has sorted out the previous kernel of data it had been
sent, this state strides across the rows loaded in the Row FIFOs. This is done by reading and discarding n-1
pieces of information each where n is the stride value. After the striding has completed for the rows
currently loaded in the Row FIFOs, the state machine will move on to load the next three rows of input
data.
117
ROW STRIDE SHIFT: If all the input image rows of data have yet to be processed through the Max
Pooling Layer, this state will perform a vertical stride down the image. This is accomplished by the Row
Controller connecting the output of the Row 0 FIFO to the input of the Row 1 FIFO and connecting the
output of the Row 1 FIFO to the input of the Row 2 FIFO. The input of the Row 0 FIFO receives the data
from the input buffer and the old data in the Row 2 FIFO is read out and allowed to be lost. One row of
data is moved at a time until the correct number rows have been shifted according to the stride setting in the
CLEAR_COUNTERS: This state performs the function in the state name. The state clears all counters
CHANNEL CLEAR: Once all the rows of the input image have been processed for a channel, the Row
FIFOs are read out until empty. This allows the next round of data to be loaded into fresh Row FIFOs that
Figure 64: Finite State Machine for the Max Pool Layer Row Controller
118
8.6.2 Heap Sorter Sub-module
In order to explain the Heap Sorter module first let’s look at the Heapsort algorithm. The (binary) heap
data structure is an array object that we can view as a nearly complete binary tree as shown in Figure 65.
Each node of the tree corresponds to an element of the array. The tree is filled on all levels except possibly
the lowest, which is filled from the left up to a point. (Cormen et.al, 2009)
There are two kinds of binary heaps: max-heaps and min-heaps. In both kinds, the values in the nodes
satisfy a heap property, the specifics of which depend on the kind of heap. In a max-heap, the max-heap
property is that for every node other than the root a parent nodes value is greater than the child nodes value.
Thus, the largest element in a max-heap is stored at the root, and the subtree rooted at a node contains
values no longer than that contained at the node itself. A min-heap is organized in the opposite way; the
min-heap property is that for every node other than the root, a parent nodes value is less than the child
nodes value. The smallest element in a min-heap is at the root. (Cormen et.al, 2009)
As shown in Figure 65, a max-heap can be viewed as a binary tree and an array. The number within the
circle at each node in the tree is the value stored at that node. The number above a node is the
corresponding index in the array. Above and below the array are lines showing parent-child relationships;
parents are always to the left of their children. (Cormen et.al, 2009)
Figure 65: A max-hep viewed as (a) a binary tree and (b) an array.
(Cormen et.al, 2009)
119
In order to maintain the max-heap property, we call the procedure Max-Heapify. This procedure lets the
value in a parent node “float down” in the max-heap so that the subtrees obey the max-heap property.
In order to fully understand the action of Max-Heapify let’s look at an example shown in Figure 66. In
Max-Heapify at each step the largest of the elements between parent or right and left children is
determined. If the root node is the largest, then the subtree is already a max-heap and the action is
complete. Otherwise, one of the two children have the largest element, and is swapped with the parent
node. This may cause one of the subtrees to violate the max-heap property. Consequently, we perform the
Max-Heapify action recursively on that subtree. As shown in Figure 66, the initial configuration at node 2
violates the max-heap property since it is not larger than both children. The max-heap property is restored
for node 2 in Figure 66(b) by exchanging the value at node 2 with the value at node 4. This destroys the
max-heap property for node 4. The recursive execution of Max-Heapify swaps the value at node 4 with the
value at node 9. As shown in Figure 66(c) node 4 is fixed up, and the recursive execution of Max-Heapify
120
Now having explained the Heapsort Algorithm, we can now look at how it was implemented in this design.
After the Row Controller has loaded the image row data into the Row FIFOs, the Row Controller then
sends one kernels worth of image data to the Heap Sorter Submodule as illustrated in Figure 67. Once the
data is handed over to the Heap Sorter Submodule, the maximum value of that 3x3 or 2x2 neighborhood is
The state machine for the submodule is shown in Figure 68 and it performs the Max-Heapify action as
described above. The state machine executes recursively on the kernel data and swaps any node value that
Figure 67: Loaded row information is processes through the Heap Sorter
121
Figure 68: Finite State Machine for the Heap Sorter
122
9.0 SOFTMAX LAYER
9.1 Algorithm
As seen in Section 3.0, a Convolutional Neural network is configured to arrive at several class scores and
thereby classify the image it is given. These scores are then passed through a loss or cost function and used
to arrive at a loss value as well as the probabilities. This is to measure how well the neural network has
classified the image it was given. Therefore, the loss will be high if we are doing a poor job of classifying
the image and low if we are doing well. (Li, F., et. al, 2017)
A popular classifier, which will compute the loss, is the Softmax Classifier. The Softmax classifier outputs
the normalized class probabilities. (Li, F., et. al, 2017) The Softmax Function is:
𝐸𝐸 𝑓𝑓𝑦𝑦𝑖𝑖 𝐸𝐸 𝑓𝑓𝑦𝑦𝑖𝑖
𝐿𝐿𝑖𝑖 = − log � � 𝑃𝑃𝑖𝑖 =
∑𝑗𝑗 𝐸𝐸 𝑓𝑓𝑗𝑗 ∑𝑗𝑗 𝐸𝐸 𝑓𝑓𝑗𝑗
Where we are using the notation fj to mean the j-th element of the vector of class scores f. The Softmax
Function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero
and one that sum to one. (Li, F., et. al, 2017)
Due to mathematical complexity in the FPGA implementation, the Probabilities are output by the Softmax
Layer. In order to implement the Softmax Function and arrive at the Probability, each of the mathematical
The Softmax Layer Module architecture is shown below in Figure 69. The module utilizes an AXI4-Lite
Slave Interface which provides a command and status interface between the MicroBlaze processor and the
internal logic of the module. The module also utilizes an AXI4 Master Interface which reads and writes
data to and from the DDR memory on the FPGA board. The AXI Master Interface retrieves input data
from specified DDR memory locations for the Softmax operation then outputs the result to a specified
region of memory.
123
Figure 69: Top level Architecture of the Softmax Layer
The AXI Slave provides the following register set for the Softmax Layer.
124
9.3.1 Control Register
This register allows for controlling the Softmax Layer using a few essential signals.
The status register breaks out important logic signals which may be used for debugging the design once
125
Table 64: Status Register Description
126
9.3.3 Input Data Address Register
The value contained in this register points the location in the DDR memory were the layer input data
begins.
The value contained in this register points the location in the DDR memory were the layer will begin
127
9.3.5 Probability 1 Register
The register contains the highest probability as well its associated class after the softmax classifier has
The register contains the second highest probability as well its associated class after the softmax classifier
128
Table 72: Prediction 2 Register Description
The register contains the third highest probability as well its associated class after the softmax classifier has
129
9.3.8 Probability 4 Register
The register contains the fourth highest probability as well its associated class after the softmax classifier
The register contains the fifth highest probability as well its associated class after the softmax classifier has
130
Table 78: Prediction 5 Register Description
While the AXI Slave interface receives commands via the AXI Slave Bus, the AXI Master is responsible
for reading into the design the scores data from memory and then writing the resulting probability and class
number to registers on the AXI Slave interface as well as to DDR Memory. The AXI Master logic is
shown in Figure 70 with states specifically designed to retrieve the scores data from memory regions
specified by the AXI Slave Register settings. This module starts the Softmax Function operation and
The state machine is shown in Figure 70 and the states are described in text for simplicity below.
IDLE: The FSM checks for a start command from the Control Register before moving to start the Softmax
Function operation.
Read transfers to read in the scores data from the neural network. The scores data is written into the Input
Buffer. Once the scores for the classes have been read into the design, the state machine moves on.
SOFTMAX_EXECUTING: This state holds the state machine until the softmax function has completed
executing.
These states calculate the appropriate size of AXI write transfers to perform based on the number of classes
specified in the Control register in the AXI Slave interface. Once data is detected in the Output Buffer,
these states will read out the contents of the Output Buffer, sort the data from largest to smallest, and finally
131
Figure 70: Finite State Machine for the Softmax Layers AXI Master
A look inside the Softmax Unit is shown in Figure 71 below. The unit is comprised of seven major
functional blocks. The input buffer is a FIFO buffer which receives scores data from the AXI Master and
reads it out to the Softmax Controller. The Softmax Controller sends the scores data to the Exponential
Function logic so that the exponential of each of the scores is calculated. As the exponential of the scores is
calculated, the results of exponentiation are written into the Exponential Buffer FIFO. Once the
exponential of each score is calculated, the Softmax Controller signals the Softmax Adder Wrapper logic to
begin reading out the Exponential Buffer data and summing it all together. As the summation is executing
the data in the Exponential Buffer is written back into itself to be used during the Division operation. After
the summation is complete, the Softmax Adder Wrapper signals to the Softmax Divider Wrapper to begin
the division process. The results of the division process are sent back to the Softmax Controller where they
132
Figure 71: Architecture of the Softmax Unit
133
almost empty with one
more valid piece of data
o_outbuff_full std_logic Out Indicates output buffer is
full
o_outbuff_almost_full std_logic Out Indicates output buffer is
about to fill up
o_outbuff_valid std_logic Out Indicates output buffer
contains valid data
o_expbuff_empty std_logic Out Indicates exponential
buffer is empty with no
valid data
o_expbuff_almost_empty std_logic Out Indicates exponential
buffer is almost empty
with one more valid piece
of data
o_expbuff_full std_logic Out Indicates exponential
buffer is full
o_expbuff_almost_full std_logic Out Indicates exponential
buffer is about to fill up
o_expbuff_valid std_logic Out Indicates exponential
buffer contains valid data
o_softmax_complete std_logic Out Indicates that the Softmax
classifier has completed.
o_busy std_logic Out Indicates that the Softmax
Layer is still busy
executing
The following section will describe the design of each of the submodules involved in the design of the
In order to implement the Softmax Function into the FPGA design, custom Exponential Function logic was
developed. This logic performs the Maclaurin Series Taylor Expansion on the input data to calculate,
through successive iterations, the exponential.(Weisstein, 2019) Being that the input data are scores and are
positive, there is no need to consider the possibility that the input data may be a negative value. (Weisstein,
2019)
134
Where n is the order up to which the expansion will calculate the answer. Refining the above equation and
applying it to the exponential function we can rewrite the equation as. (Weisstein, 2019)
𝑥𝑥 3 𝑥𝑥 𝑛𝑛
𝑒𝑒 + + ⋯+
3! 𝑛𝑛!
Therefore, to implement these mathematical functions, we must pre-calculate the factorials out to an order
n. As a starting point for this design the order n chosen was n=24. Once the factorials are pre-calculated
and their values preloaded into registers in the design, any input to the exponential function can be
calculated with iterations of addition and multiplication. This design employs the use of the DSP48
Floating Point Single Precision Adder and Multiplier logic blocks. The pre-calculated factorials are shown
in Table 80.
The state machine for the submodule is shown in Figure 72 and the states are described in text for
simplicity.
Figure 72: Finite State Machine for the Exponential Function Logic
IDLE: The FSM checks for a signal from the Softmax Controller that it is ready for the Exponential
Function to begin processing data. This state also checks that it is receiving valid data from the Softmax
Controller.
135
Table 80: Factorials Pre-calculated for use in Design
1/n! 1/n!
n
Decimal Hexadecimal
0 1 3F800000
1 1 3F800000
2 0.5 3F000000
3 0.166667 3E2A0000
4 0.041667 3D2A0000
5 0.008333 3C080000
6 0.001389 3AB60000
7 0.000198 39500000
8 2.48E-05 37D00000
9 2.76E-06 36380000
10 2.76E-07 34930000
11 2.51E-08 32D70000
12 2.09E-09 310F0000
13 1.61E-10 2F300000
14 1.15E-11 2D490000
15 7.65E-13 2B570000
16 4.78E-14 29570000
17 2.81E-15 274A0000
18 1.56E-16 25340000
19 8.22E-18 23170000
20 4.11E-19 20F20000
21 1.96E-20 1EB80000
22 8.9E-22 1C860000
23 3.87E-23 1A3B0000
MULT_INPUT: This state loads the input data value and the current value in the Multiplication register to
the DSP Multiplier. This is done to calculate the xn where x is the input data value and n is the order.
MULT_HOLD: This state holds the state machine for the amount delay needed for the Floating-Point
MULT_FACT: This state sends the product from the multiplier along with the pre-calculated factorial to
𝑥𝑥 𝑛𝑛
the Floating-Point Multiplier. This is done to calculate the value where x is the input data value and n is
𝑛𝑛!
136
the order.
MULT_FACT_HOLD: This state holds the state machine for the amount delay needed for the Floating-
Point Multiplier to generate a valid output. The states MULT_INPUT, MULT_HOLD, MULT_FACT, and
𝑥𝑥 𝑛𝑛
MULT_FACT_HOLD are repeated iteratively 24 times in order to calculate the full list of values, where
𝑛𝑛!
n is in the range 0 to 23. Each product is placed into an array for the later step of summing them all
together.
𝑥𝑥 𝑛𝑛
SUM_ALL, SUM_HOLD: These states sum the entire contents of the array containing the values as
𝑛𝑛!
WRITE_RESULT: After the Exponential Function has completed its execution, the result is written to the
Exponential Buffer.
Once the exponential of the scores is calculated by the Exponential Function and stored in the Exponential
Buffer, the Softmax Controller signals the Softmax Adder Wrapper to begin the summation process. The
summation employs the use of a DSP48 Floating Point Addition Logic block to perform the individual
additions.
Figure 73: Finite State Machine for the Softmax Adder Wrapper
The state machine for the submodule is shown in Figure 73 and the states are described in text for
simplicity.
137
IDLE: The FSM checks for a signal from the Softmax Controller that the exponentiation process is
complete for all scores. It also checks for the existence of data in the Exponential Buffer.
SET_OPERANDS: This state sets the addend and augend for the DSP Adder.
ADDER_HOLD: This state holds the state machine to account for the adder delay between valid input to
valid output. The result is registered for use in the next summation iteration. Both this state and SET
OPERANDS are executed iteratively until all the data in the Exponential Buffer has been summed together.
DIVIDER_HOLD: This state holds the state machine until the division operation has fully completed.
Once the summation process has completed, the Softmax Adder Wrapper signals the Softmax Divider
Wrapper to begin the division process. The division employs the use of a DSP48 Floating Point Divider
Logic block to perform the individual divisions. This logic divides the data in the Exponential Buffer by
Figure 74: Finite State Machine for the Softmax Divider Wrapper
The state machine for the submodule is shown in Figure 74 and the states are described in text for
simplicity.
IDLE: The FSM checks for a signal from the Softmax Adder Wrapper indicating that the summation
process has completed. This state also checks for the existence of data in the Exponential Buffer FIFO.
138
SET_OPERANDS: This state reads out one piece of data from the Exponential Buffer FIFO and sets the
DIVIDER_HOLD: This state holds the state machine to account for the divider delay between valid input
CLEAR_HOLD: Once the division process completes, the state machine waits for a clear signal from the
For all the operations described thus far, the Softmax Controller has been the logic orchestrating the entire
data flow. With several logic blocks accessing one FIFO buffer in the design, it seemed to be a better
design to have one logic block control which submodule gets access to the buffer.
The state machine for the submodule is shown in Figure 75 and the states are described in text for
simplicity.
IDLE: The FSM checks for a signal from the Softmax Adder Wrapper indicating that the summation
process has completed. This state also checks for the existence of data in the Exponential Buffer FIFO.
139
CALC_EXP: This state checks for the existence of data in the Input Buffer as well as if the Exponential
Function is ready to receive data. Once those conditions are satisfied, this state reads out a piece of data
EXP_HOLD: This state holds the state machine until the Exponential Function signals that it is ready for
another piece of data. Both CALC_EXP and EXP_HOLD execute iteratively until all the input score data
SUM_EXP: This state signals the Softmax Adder Wrapper that the exponentiation process is complete and
to begin the addition process. This state holds the state machine until the summation has completed. Also,
the Softmax Controller gives the Softmax Adder Wrapper access to the Exponential Buffer by sending a
DIVIDE: This state gives the Softmax Divider Wrapper access to the Exponential Buffer and holds the
CLEAR_HOLD: Once the Softmax Function has been allowed to fully execute for the given class scores,
this state clears the downstream logic in preparation for the next image classification.
140
10.0 SIMULATION VERIFICATION TESTING
The most challenging and time-consuming aspect of the development process is the verification of the
design in question. In order to properly verify any FPGA design, the design must be tested against its
various operating conditions. This kind of rigorous testing must be performed both in simulation and in the
hardware itself. Testing must incorporate running data through the design and analyzing the resulting
output to verify that the design is in fact performing as intended. To be able to definitively determine that a
design is operating as intended, a model of the design must be generated which mimics the designs
operations and output. After the truth model is generated, the FPGA design can be simulated using the
known inputs and comparing the output of the design with that of the truth model. After the design is fully
For this project there existed absolutely no tools prior to commencing this work nor were there peers
available which had prior experience. Therefore, a considerable amount of time was invested into
developing an array of tools in various software environments that would be capable of testing the design to
completion. Of course, the processes of discovering and learning these tools meant that the verification
effort was extremely time consuming and involved more total time to complete than the design itself.
Before moving forward with any work on a hardware design, the inner workings of Convolutional Neural
Networks themselves had to be understood. This involved researching for the most comprehensive
treatment of the Convolutional Neural Network material. The most comprehensive coverage, at the time
this work was researched, rested with the creators of the ImageNet Challenge out of Stanford University.
Their open source coursework from their Computer Science course cs231n enabled this work and their
class tools laid the foundation in gaining understanding of the subject. The Stanford tools utilized the
Anaconda software package which uses Python Notebooks. These tools aided in the fundamental
understanding of the subject as well as aided in developing a model. Using these tools other custom Python
tools were created which ran and trained an AlexNet Convolutional Neural Network. The output of each
layer, the trained Weights, and trained Biases of this AlexNet model were then used in the verification
141
10.2 Trained Model
To develop a trained model of a Convolutional Neural Network, two major avenues were pursued in
parallel. One avenue was expanding on the Stanford tools and developing a model using the Anaconda
with Jupyter Notebooks in Python. The other avenue was to use the much vaunted Tensorflow Software
Package developed by Google which utilizes the GPU on a Desktop or Laptop computer to perform the
Of these two, the Tensorflow option seemed to be the most promising since it could train a Convolutional
Neural Network very quickly using the GPU and leverage the extremely large data set provided by the
ImageNet Challenge database. The goal of developing a model in Tensorflow was achieved. However,
although developing a CNN model in Tensorflow was successful, diving into its inner workings in order to
see the data pixel by pixel proved to be much more tedious and difficult to work with.
Therefore, expanding on the Stanford tools was used to train a CNN with a very small training set. This
was done since the objective of this work was to prove a CNN can be deployed efficiently on a small
FPGA and not to develop a brand-new CNN framework itself. For this work proving that the layers are
Using the Stanford tools, a small data set of 10 images was used to train an AlexNet Convolutional Neural
Network. These images were selected randomly from a list of 10 Classes shown in Table 81.
142
Table 81: 10 Classes Used to Train the AlexNet Model
4 n02690373 airliner
mobile phone
It is important to note that the ImageNet Challenge Database contains more than 100,000 individual images
spanning 1000 or more classes with each image differing in size. Therefore, additional tools Python tools
were created to search the database for specific files, determine their size, obtain their bounding box
boundaries, and finally resize the image to the 227x227 AlexNet input image size. The randomly selected
images which were used as the model’s training set are shown in Figure 76.
1 2 3 4 5
6 7 8 9 10
143
Using these 10 images as the training set for the AlexNet CNN, the network was trained over 1000
iterations as shown in Figure 77. The model was trained using the following parameters.
d. Iterations = 1000
e. Batch Size = 10
From these parameters and from looking at Figure 77, we can clearly see that the trained AlexNet model
has overfit its training set with a Training Accuracy of 100% and a Validation Accuracy of 0%. For the
purposes of this work, this situation is adequate since as stated before the objective is to prove that the
CNN is functional on the FPGA given a weight set and input image. It would be a subject of another work
144
10.3 Matlab Model
After the custom Python tools were created and run for the AlexNet Convolutional Neural Network, the
outputs, trained Weights, and trained Biases for each layer were used to create truth data. The truth data
would later be used with the FPGA simulation to verify the design. This was accomplished by generating
Matlab scripts which mimicked the operations of the hardware FPGA design. Therefore, given input data,
weights, and biases the scripts would generate a comparable truth data to use in simulation as well as make
available all the variables involved in the operation. Giving insight to the internal variables of the
Convolution, Affine, Max Pool, and Softmax layers was a critical debugging tool and was used in tandem
with the FPGA simulations in order to spot were the logic of the design needed work. These scripts can be
Using the files generated from the Matlab scripts, the FPGA designs for the layer types of the
Convolutional Neural Network were tested in simulation. The choice of simulation environment was
Modelsim PE Student Edition by Mentor Graphics which provided a great deal of flexibility and
dependability in its operation. This software simulation package shares much in common with the
professional level Mentor Graphics Questasim software package. Each layer type in the design was tested
individually in simulation thereby reducing the complexity of the simulation as a whole. Various attempts
at testing the layer types were made, however, the most useful method was to test the layer type in an actual
AXI configured test bench. This method was the most useful and ultimately led to design completion and
verification. Therefore, a Xilinx Vivado block diagram was created for each layer type in two different
configurations; a virtual memory configuration, and a block ram configuration. The Vivado Block
diagrams for each layer are shown in Figure 78 through Figure 83.
145
Figure 78: Convolution/Affine Layer Virtual Memory Test Bench
146
Figure 82: Softmax Layer Virtual Memory Test Bench
The virtual memory configuration required the design of a custom memory block which would read input,
weight, and bias data into arrays and allow the AXI bus to read and write from and to those arrays. This
virtual memory allowed for actual data to be used in the simulation and would check the resulting data
The block RAM configuration operated much simpler than the Virtual Memory configuration in that it only
required the use of a single BRAM instead of a custom memory block. Using a Xilinx block RAM for
simulation allowed this type of simulation to execute very quickly whereas the virtual memory
Therefore, both simulation types were used in tandem. The block RAM configuration was used to test
overall execution and flow of the design and the Virtual Memory configuration was used to test the
During the simulation testing, the Microblaze Co-Processor was not used since doing so would have made
the testing overly complicated. Therefore, a custom tester module was written to take the place of the
Microblaze. This custom tester configures each of the layer’s configuration registers and initiates the
execution of the layer. Three identical testers were developed which would test the Convolution/Affine
Layer, Maxpool Layer, and Softmax Layer. The source code for all the test-benches can be found at the
147
Simulation testing was deemed complete when each of the Convolutional Neural Network layer types were
tested with simulated input data, weights, and biases and their output matched that of the truth file
148
11.0 HARDWARE VERIFICATION AND TESTING
Once the simulation testing had completed, the Hardware Verification testing phase of the design
development proceeded. With many Embedded Systems projects, which comes first, the design or the
required hardware, is a bit of a chick or the egg problem. Some projects in industry force upon a project a
required hardware platform which engineers are required to design their system to run on. Other projects in
industry allow for the design itself to drive the requirements that hardware components would need to meet.
For this work the scenario would be the latter, where after the design was somewhat finalized, only then
were the full hardware requirements known. This is consistent with a research and development type
situation where the process of developing a system for the first time fully reveals its needed resources.
After the design was mostly finalized, the design was run through the Xillinx Synthesis and
Implementation tools in order to generate a bit file for hardware integration. Mostly finalized, meaning that
no major structural changes would be made to the logic of the design. At this point it became clear that the
32-bit data configuration of the design would take too many resources and would not fit onto the Zedboard
as was previously hoped. Therefore, the alternative board, the Nexys Video, was used which does have
11.2 Testing
Once the Convolutional Neural Network FPGA design met the timing requirements of 100MHz clock rate,
hardware testing moved along. As with the simulation testing, the hardware testing first tested each layer
type individually in a full Microblaze-AXI configuration. Now that these hardware configuration tests
included the Microblaze processor, the Xilinx SDK XCST tool was used. This tool allows for direct
commanding from the PC computer through the Microblaze and onto the AXI bus running on the FPGA.
To aide in this effort, custom scripts written in the TCL language were made which would configure the
registers of each layer, load the input data to DDR Memory from the PC hard drive, and execute the layer
149
Of course, as it is with many Embedded Systems designs, the first integration with hardware often shows
the design does not function and kicks off the hardware debugging effort. At this level in development
there isn’t as much access to the individual signals of the design as there is with simulation. This lack of
observability necessitates the use of yet another tool to allow for the hardware debugging to take place at
all. The tool in question would be some sort of Logic Analyzer to scope the internal signals of the design
and bring them out for human eyes to analyze. Therefore, the Xilinx Integrate Logic Analyzer tool was
used extensively in order to peer inside the design during its operation and see where the logic needed to be
adjusted.
Each layer by design writes its output data to a specified place in the DDR memory. Again, using TCL
scripts this output data was read out of the FPGA board memory and written to a bin file for comparison
with truth data. The Beyond Compare software was used in order to perform a bit by bit comparison and
ensure that the output data of each layer matched the expected truth data.
The use of simulation testing and these hardware testing methods were used iteratively in order to test logic
150
12.0 RESULTS AND DISCUSSION
As stated earlier, the objective of this work was to successfully implement a scalable, modular, and
programmable Convolutional Neural Network on a small FPGA. This would allow a cost effective and
versatile tool for Embedded Applications utilizing image recognition/classification. This work was
successful in this endeavor and this section will characterize the designs performance both with Simulation
We know that each Convolution and Affine layer will perform a certain number of Floating-Point
Operations. Therefore, we can say that the total number of Floating-Point Operations divided by the time
of execution yields the Floating-Point Operations per Second. The calculations arriving at the total number
of Floating-Point Operations is as follows. Let’s use the first Convolutional Layer as an example.
The input image is 227x227x3 with a weight filter kernel of size 11x11x3x96 and an output volume of size
55x55x96. The design currently allows for 33 Channel Units. So, the total number of Floating-Point
Operations for one output volume rows worth of the Channel Units is
𝐶𝐶𝐻𝐻𝑈𝑈𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 11 × 96 × 55 × 33
𝐶𝐶𝐻𝐻𝑈𝑈𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 1916640
The adder tree employs the use of 6 adder layers with the following number of adders per layer
Layer1 16
Layer2 8
Layer3 4
Layer4 2
151
Layer5 1
Layer6 1
𝐴𝐴𝐶𝐶𝐶𝐶𝑈𝑈_𝑇𝑇𝑇𝑇𝐸𝐸𝐸𝐸𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 11 × 96 × 55 × 32
𝐴𝐴𝐶𝐶𝐶𝐶𝑈𝑈_𝑇𝑇𝑇𝑇𝐸𝐸𝐸𝐸𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 1858560
The kernel adder tree adds an additional 3 adder layers with the following number of adders per layer.
Layer1 5
Layer2 3
Layer3 1
𝐾𝐾𝐸𝐸𝑇𝑇𝐾𝐾𝐴𝐴𝐿𝐿_𝑇𝑇𝑇𝑇𝐸𝐸𝐸𝐸𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 55 × 96 × 9
𝐾𝐾𝐸𝐸𝑇𝑇𝐾𝐾𝐴𝐴𝐿𝐿_𝑇𝑇𝑇𝑇𝐸𝐸𝐸𝐸𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 47520
The total number of Floating-Point Operations for one output volume rows worth of the Accumulator is.
𝐴𝐴𝐶𝐶𝐶𝐶𝑈𝑈𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 1906080
𝐵𝐵𝐼𝐼𝐴𝐴𝑆𝑆𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 96
152
Since there are 55 output volume rows, the number of Floating-Point operations for the Channel Unit and
𝑉𝑉𝑂𝑂𝐿𝐿𝑈𝑈𝑀𝑀𝐸𝐸𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 207636096
For the Convolution and Affine Layers, if an input image contains more channels than there are available
resources, the design will process channels in groups. Therefore, after each iteration of channel groups, the
output data calculated for the previous channel group must be summed with the output data calculated for
𝑃𝑃𝑇𝑇𝐸𝐸𝑉𝑉𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 55 ∗ 55 ∗ 96 ∗ (1 − 1)
𝑃𝑃𝑇𝑇𝐸𝐸𝑉𝑉𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 0
For the current example the input image contains a number of channels within the available resources and
would therefore not require previous output data to be summed with current output data. The same
calculations for each Convolution and Affine Layer are performed and shown in Table 82 to determine the
153
12.2 Memory Transactions
Another important performance characteristic is the number of memory transactions being initiated in order
to successfully read in all the relevant data and write out the correct result. This becomes important since
each transaction with off chip memory adds delay in the overall execution of the Convolution or Affine
Layer. Therefore, keeping the number of memory transactions to a minimum will aid the performance of
any design. Continuing with the example of the first Convolution layer we can estimate the number of
memory transactions for both the read and write cases. The Convolution Layer inputs four kinds of data;
input image data, weight filter kernel data, bias data, and previous output volume data.
The input image data is read into the design in a row by row basis. The first transaction will prime the
Volume FIFOs in the Channel Units by reading in the first few rows of each channel. So, in this case if we
have 3 channels of the input image, three transactions will load the first few rows.
𝑃𝑃𝑘𝑘𝑖𝑖𝑚𝑚𝐸𝐸_𝐹𝐹𝐼𝐼𝐹𝐹𝑂𝑂𝑀𝑀𝑀𝑀 = 3
For the stride operation, each row for each channel is read from memory individually and loaded into the
appropriate FIFO. This is done once as many times as the stride value dictates. Strider operations are
performed until the entire input image is loaded into the design. The first Convolution layer specifies a
stride of 4.
𝑆𝑆𝑡𝑡𝑘𝑘𝑖𝑖𝑑𝑑𝐸𝐸𝑀𝑀𝑀𝑀 = 3 × 4 × (55 − 1)
𝑆𝑆𝑡𝑡𝑘𝑘𝑖𝑖𝑑𝑑𝐸𝐸𝑀𝑀𝑀𝑀 = 3 × 4 × (55 − 1)
𝑆𝑆𝑡𝑡𝑘𝑘𝑖𝑖𝑑𝑑𝐸𝐸𝑀𝑀𝑀𝑀 = 648
Both the memory transactions to load the first rows into the design as well as the stride memory
transactions are performed as many times as needed to read all the input image channels into the design.
154
𝐼𝐼𝑛𝑛𝐸𝐸𝑉𝑉𝑡𝑡 𝑇𝑇𝐸𝐸𝑎𝑎𝑑𝑑𝑀𝑀𝑀𝑀 = 𝐶𝐶ℎ𝑎𝑎𝑛𝑛𝑛𝑛𝐸𝐸𝑉𝑉 𝐼𝐼𝑡𝑡𝐸𝐸𝑘𝑘𝑎𝑎𝑡𝑡𝑖𝑖𝐸𝐸𝑛𝑛𝑠𝑠 × (𝑆𝑆𝑡𝑡𝑘𝑘𝑖𝑖𝑑𝑑𝐸𝐸𝑀𝑀𝑀𝑀 + 𝑃𝑃𝑘𝑘𝑖𝑖𝑚𝑚𝐸𝐸_𝐹𝐹𝐼𝐼𝐹𝐹𝑂𝑂𝑀𝑀𝑀𝑀 )
The weight filter kernels are read into the design as one large transaction spanning the allowable channels.
So, in example with a weight filter kernel set of dimensions 11x11x3x96, the 11x11 kernel for all three
channels is read in as one transaction. This is repeated for each filter and in the case where there are more
𝑊𝑊𝐸𝐸𝑖𝑖𝑤𝑤ℎ𝑡𝑡𝑠𝑠𝑀𝑀𝑀𝑀 = 96 × 1
𝑊𝑊𝐸𝐸𝑖𝑖𝑤𝑤ℎ𝑡𝑡𝑠𝑠𝑀𝑀𝑀𝑀 = 96
𝐵𝐵𝑖𝑖𝑎𝑎𝑠𝑠𝑀𝑀𝑀𝑀 = 96
In the case where the number of input image channels exceeds the number of available resources, the
previous output volume results are read back into the design to sum these values with those of the current
output volume data. The previous output volume data is read into the design one row at a time.
𝑃𝑃𝑘𝑘𝐸𝐸𝑃𝑃𝑖𝑖𝐸𝐸𝑉𝑉𝑠𝑠 𝐷𝐷𝑎𝑎𝑡𝑡𝑎𝑎𝑀𝑀𝑀𝑀 = 96 × 55 × (1 − 1)
𝑃𝑃𝑘𝑘𝐸𝐸𝑃𝑃𝑖𝑖𝐸𝐸𝑉𝑉𝑠𝑠 𝐷𝐷𝑎𝑎𝑡𝑡𝑎𝑎𝑀𝑀𝑀𝑀 = 0
For the current example the input image contains a number of channels within the available resources and
would therefore not require previous output data to be summed with current output data. The same
155
calculations for each Convolution and Affine Layer are performed and shown in Table 83 to determine the
Table 83: Memory Read and Write Transactions Per AlexNet Layer
We can estimate the performance of each layer of the design by analyzing the simulations used during
simulation testing. During the operation of the design, each layer is configured and then allowed to start its
execution after a start signal has been given by command to the layer’s AXI Slave interface. After each
layer has completed its execution, a done or complete signal is set high by the layer’s logic. We can get an
idea of the total time of execution for each layer by comparing the start time to the complete time. These
156
Table 84: Simulation Execution time of each AlexNet Layer
The finalized design implementing a Convolutional Neural Network can be seen in Figure 85 and is
Figure 30. The design was synthesized and implemented in the Xilinx Vivado development environment.
The pertinent specifications for the synthesized design can be summed up in Figure 88 which shows the
As shown in Figure 88 as well as in Figure 84, the implementation of a Convolutional Neural Network on
FPGA met its timing requirements with the constraint of a 100MHz input clock frequency.
157
158
As shown in Figure 88 as well as in Figure 86, the design currently is using ~70% of the FPGA Look-up
Tables, ~40% of the FPGA Flip Flops, and ~40% of the FPGA Block RAM. It is worth making special
mention that the design currently only uses ~16% of the FPGA DSPs. Figure 89 illustrates the FPGA
resource utilization.
As shown in Figure 88 as well as in Figure 87, the design currently consumes 1.374 watts of dynamic
power and 0.179 watts of static power with a total power consumption of 1.553 watts.
159
Figure 88: Synthesized and Implemented Design
160
Figure 89: Graphic showing usage of FPGA resources.
161
12.5 Hardware Performance
Just as how the Floating-Point Operations per second were calculated for the simulations, we can also
calculate the FLOPs in a similar fashion. In this case however the Integrated Logic Analyzer was used
along with an epoch counter running on the FPGA. The combination of the Vivado ILA and the epoch
counter aided greatly in the hardware debugging of the design. The epoch counter increments an epoch
count every time a cycle counter counts 100 cycles. With a 100MHz clock frequency, the epoch counter
increments every 1us. Both Cycle Counter and Epoch Counter are broken out of the design and fed to the
logic analyzer for display and analysis. The counters start when the start signal is given by command and
stop when the done signal is flagged by the logic signaling that the Convolution or Affine operation has
162
12.5.1 Output Results
The ten randomly selected images shown in Figure 76 from the 10 class list shown in Table 81 were run
through this hardware FPGA implementation of a Convolutional Neural Network in the AlexNet
configuration. All 10 images were classified correctly by the FPGA hardware with the image’s class being
correctly identified as being the one with the largest score out of 10 classes. As shown below in Table 86,
both the Python Model scores as well as the Hardware scores are shown for comparison. Each ImageNet
ID for each input image is given along with the proper class number out of the 10-class set. As can be seen
from the scores, both the Python and Hardware scores are virtually identical with the hardware achieving a
great deal of precision. Also provided are the results of the Softmax Classifier hardware probabilities
shown in Table 87. Looking at Table 87, we can see that the images are classified with 80-100%
probability.
HW
Imag Model Model HW HW Score
Image ID Score %Diff
e# Class Score Class Hex
Dec
n02690373_
1 4 7.771 4 0x40F8778B 7.7646 0.0824
3435
n02708093_
2 8 5.008 8 0x40A05553 5.0104 0.0479
1783
n02992529_
3 10 4.844 10 0x409AF820 4.8428 0.0248
59933
n02690373_
4 4 9.227 4 0x41138E41 9.2222 0.0520
6755
n03930630_
5 5 6.361 5 0x40CB98F0 6.3624 0.0220
24834
n02708093_
6 8 5.808 8 0x40BA17C8 5.8154 0.1272
3688
n02690373_
7 4 9.753 4 0x411C0F50 9.7537 0.0072
17389
n03063599_
8 3 8.181 3 0x41028B29 8.159 0.2696
3650
n03063599_
9 3 5.665 3 0x40B4E50D 5.653 0.2123
4149
n03063599_
10 3 7.136 3 0x40E41692 7.1278 0.1150
5006
163
Table 87: 10 Images Softmax Probability Results
To see how each layer of the neural network is processing the input image, the results of each layer in the
neural network for an example image shown in Figure 90 are read out of the hardware memory and
converted to BIN file for display. This example image is run through the hardware and the results of the
first Convolutional Layer with the ReLu activation is shown in Figure 91. The results of each subsequent
AlexNet Layer are shown in Figure 92 through Figure 98. Each of these output images was read from
hardware memory and are results of the full AlexNet Convolutional Neural Network executing.
164
The image shown in Figure 91 is the result of Convolving the Input RGB image with the Weight filter and
bias data sets for the first convolution layer, the CONV1 layer. The input image shown in Figure 90 was
of dimension 227x227x3, the Weight Filter Kernel was of dimension 11x11x3x96, and the Bias data was
96 values. The Convolution operation used a stride of 4 columns and/or rows as well as a pad of 0. The
resulting image in Figure 91 is a mosaic of the output image or feature map with dimensions 55x55x96.
All 96 channels are displayed, and we can already see the weight filter kernel beginning to filter out and/or
After the image had been run though the CONV1 layer, the image is passed through the first Maxpooling
Layer, MAXPOOL1. As was described in a previous section, the Maxpooling layer applies a max value
kernel for a 3x3 neighborhood throughout the entire image data set. This kernel strides every 2 rows and/or
columns. Therefore, the image shrinks in size from 55x55x96 to 27x27x96. The resulting image shown in
Figure 92 shows how this process already begins to eliminate some of the lesser intensity detail.
165
Figure 92: Example result of Maxpool Layer 1
Output Feature Map or Image has dimensions of 27 pixels high, 27 pixels wide, and 96 channels deep.
The image shown in Figure 93 is the result of Convolving the output of the MAXPOOL1 layer with the
Weight filter and bias data sets for the second convolution layer, the CONV2 layer. The MAXPOOL1
result shown in Figure 92 was of dimension 27x27x96, the Weight Filter Kernel was of dimension
5x5x96x256, and the Bias data was 256 values. The Convolution operation used a stride of 1 column and/or
row as well as a pad of 2. The resulting image in Figure 93 is a mosaic of the output image or feature map
with dimensions 27x27x256. All 256 channels are displayed, and we can especially see the difference
when compared to Figure 91 in that even more weight filters are beginning to suppress some features of
166
Figure 93: Example result of Convolution/ReLu Layer 2.
Output Feature Map or Image has dimensions of 27 pixels high, 27 pixels wide, and 256 channels deep.
Same as with the MAXPOOL1 layer, the resulting image from the CONV1 layer is passed through the
second Maxpooling Layer, MAXPOOL2. This Maxpooling layer also strides every 2 rows and/or columns
as well as uses a 3x3 max filter neighborhood. Therefore, the image shrinks in size from 27x27x256 to
167
Figure 94: Example result of Maxpool Layer 2
Output Feature Map or Image has dimensions of 13 pixels high, 13 pixels wide, and 256 channels deep.
Just as with the CONV1 and CONV2 layers, the CONV3 (see Figure 95), CONV4 (see Figure 96), and
CONV5 (see Figure 97) layers execute in the same manner. The only real difference is that between the
CONV3 through 5 layers, there is no Max pool operation. These three Convolution layers use a stride and
pad of 1 with a 3x3 weight filter kernel. Therefore, the output images between these three layers does not
shrink in size and stays 13x13 even though their channels change in size. As the input image is process
through these layers, we can see the weight filter kernels further filtering out some image features while
intensifying others.
168
Figure 95: Example result of Convolution/ReLu Layer 3.
Output Feature Map or Image has dimensions of 13 pixels high, 13 pixels wide, and 384 channels deep.
169
Figure 97: Example result of Convolution/ReLu Layer 5.
Output Feature Map or Image has dimensions of 13 pixels high, 13 pixels wide, and 256 channels deep.
Same as with the MAXPOOL1 and MAXPOOL2 layers, the resulting image from the CONV1 layer is
passed through the third and last Maxpooling Layer, MAXPOOL3. This Maxpooling layer also strides
every 2 rows and/or columns as well as uses a 3x3 max filter neighborhood. Therefore, the image shrinks in
size from 13x13x256 to 6x6x256. The resulting image shown in Figure 98.
170
Figure 98: Example result of Maxpool Layer 3
Output Feature Map or Image has dimensions of 6 pixels high, 6 pixels wide, and 256 channels deep.
Now the that design’s performance has been characterized and the image classification ability has been
accessed to be accurate, a discussion should be had as to why the system performs as it does.
The floating-point operations per second for the layer simulations shown in Table 84 and the FLOPs
gathered from the hardware layers shown in Table 85 are shown below in Table 88. In the table although
the simulation FLOPS results come very close to approximating the actual hardware FLOPS achievable by
each layer, there is a growing percentage difference between the two performance assessments. This is
likely due to the simulation estimating an ideal model of the actual hardware and does not and cannot
consider real latencies present in the actual hardware. This difference has a compounding effect the more a
171
Table 88: Simulation and Hardware FLOPs
Channel
Layer SIM FLOPs HW FLOPs %Diff
Iterations
CONV1 2.931 G 2.9530 G 0.742% 1
CONV2 128.468 M 113.796 M 12.893% 16
CONV3 37.156 M 29.106 M 27.660% 32
CONV4 26.645 M 20.830 M 27.914% 48
CONV5 26.574 M 20.763 M 27.986% 48
AFFINE1 176.322 M 113.884 M 54.827% 64
AFFINE2 87.677 M 38.077 M 130.258% 128
AFFINE3 33.919 M 20.229 M 83.233% 128
Also evident in Table 88 is the fact that the majority of the AlexNet layers in this implementation perform
at the MFLOPS range with the only GFLOPS performance seen by the first Convolutional Layer. With
current research at large on GPUs yielding performance in the tens or hundreds of GFLOPS, an explanation
another aspect of the design which is of paramount importance is the number of Memory Transactions
performed.
As shown in Table 89 the relationship between the Floating Point Operations to be performed and the
Memory Transactions needed can be seen when comparing the ratio between these two aspects and the
measured Floating Point Operations Per Second achieved in the design. Despite having the greatest
number of operations to perform, the first Convolution Layer was not slowed down by having to perform
172
too many memory transactions. Therefore, this balance between the operations needed and the memory
transactions needed dictated the achievable performance. This relationship can be seen in both Figure 99
and Figure 100. Therefore, it can be stated that the best achievable performance with the design as it is
currently configured is approximately 3 GFLOPs. Let’s next discuss how this number can be increased.
173
Figure 100: Simulation Performance vs. Ops/Mem Trans. ratio
These derived relationships from observations about this design exactly coincide with the reasoning of
CNN system behavior from a joint research collaboration between UCLA and Peking University in China.
This research group also saw in their FPGA CNN implementations the relationship between the Floating-
Point Operations in the system needing to be done, the number of memory transactions, and the achieved
FLOPs performance. They brought over an already existing modeling scheme from Multicore Processing
and applied it to FPGA CNN systems. The Roofline Model, as its originators from UC Berkeley called it
relates system performance to off-chip memory traffic and the peak performance provided by the hardware
𝐶𝐶𝐸𝐸𝑚𝑚𝐸𝐸𝑉𝑉𝑡𝑡𝑎𝑎𝑡𝑡𝑖𝑖𝐸𝐸𝑛𝑛𝑎𝑎𝑉𝑉 𝑇𝑇𝐸𝐸𝐸𝐸𝑓𝑓
𝐴𝐴𝑡𝑡𝑡𝑡𝑎𝑎𝑖𝑖𝑛𝑛𝑎𝑎𝑏𝑏𝑉𝑉𝐸𝐸 𝑃𝑃𝐸𝐸𝑘𝑘𝑓𝑓. = 𝑚𝑚𝑖𝑖𝑛𝑛 �
𝐶𝐶𝑇𝑇𝐶𝐶 𝑇𝑇𝑎𝑎𝑡𝑡𝑖𝑖𝐸𝐸 × 𝐵𝐵𝑊𝑊
174
The equation formulates the attainable throughput of an application on a specific hardware platform.
Floating-point performance (GFLOPS) is used as the metric of throughput. The actual floating-point
performance of an application kernel can be no higher than the minimum value of two terms. The first term
describes the peak floating-point throughput provided by all available computation resources in the system,
or computational roof. Operations per DRAM traffic, or computation to communication (CTC) ratio,
features the DRAM traffic needed by a kernel in a specific system implementation. The second term
bounds the maximum floating-point performance that the memory system can support fora given
Figure 101 visualizes the roofline model with computational roof and I/O bandwidth roof. Algorithm 2 in
the figure has higher computation to communication ratio, or better data reuse, compared to Algorithm 1.
From the figure, we can see that by fully utilizing all hardware computation resources, Algorithm 2 can
outperform Algorithm 1, in which computation resources are under-utilized because of the inefficient off-
chip communication.
175
12.6.3 Methods of Improvement
This implementation of a Convolutional Neural Network in an AlexNet configuration is a first pass attempt
and leave a lot room for improvement and optimization. There are a few ways the performance of this
implementation can be increased which would be areas for future work. Looking at Table 90, we can see
the differences in resource utilization and performance between other recent works and this one. Although
this implementation achieved a lower amount of GLOPs performance, the number of chip resources is far
lower than any of the other implementations. Also, the estimated power consumption is far lower.
Data format 16-bit fixed Fixed(8-16b) 32-bit Float 32-bit Float 32-bit Float
The current design depends heavily on the Channel Unit structure to perform the Convolution and Fully
Connected operations. As it currently stands only 33 Channel Units which use 1 DPS each are instantiated
in the design. Increasing the number of DSPs would allow for more channels to be processed at a given
176
time. More channels per Convolution/Affine operation mean the required number of memory transactions
decreases and as can be seen in Figure 99 and Figure 100 this would increase the performance capability
of each layer in the neural network. Increasing the DSP usage does not come without increasing other logic
block instantiations as well. Therefore, even though the FPGA has around 700 DSPs, not all will likely be
able to be used since other FPGA resources would reach 100% utilization before then.
Another possible way of improving the achievable performance is to increase the clock frequency output by
the Phase Locked Loop generating the FPGA system clock. The current design uses a 100MHz system
clock frequency. Just doubling the clock frequency could possibly increase the max achievable system
performance to 6 GFLOPs. Increasing the clock frequency may require some logic optimization since the
logic may not be able to comply with new clock constraints timing requirements.
A method employed by some other works is to reduce the number of bits used to represent data values
throughout the FPGA implementation of the Convolutional Neural Network. As discussed earlier, two
other smaller data formats can be used instead of the one used in this work; 16-bit fixed and Half Precision.
Another possible way of improving performance would be to optimize the already existing logic. As was
stated earlier, this implementation is a first pass attempt and logic optimization for speed was sacrificed for
functionality.
177
13.0 CONCLUSIONS
Machine Learning and Deep Learning its sub discipline are gaining popularity quickly as we have seen in
reviewing the literature and overall state of the art. Machine Learning algorithms have been successfully
deployed in a variety of applications such as Natural Language Processing, Optical Character Recognition,
and Speech Recognition. Deep Learning particularly is best suited to Computer Vision and Image
Recognition tasks. The Convolutional Neural Networks employed in Deep Learning Neural Networks train
a set of weights and biases which with each layer of the network learn to recognize key features in an
image. In recent years much of the Deep Learning Convolutional Neural Network has been firmly in the
realm of Computer Science with much of the computation performed on large Graphical Processing Unit
cards in desktop computer towers. However, as it was seen in the literature, GPUs while effective at
processing large amounts of image data, are extremely power hungry. Current FPGA implementations
have mostly seemed to concern themselves with acceleration of the Convolutional Layer only and
Therefore, this work set out to develop a scalable and modular FPGA implementation for Convolutional
Neural Networks. It was the objective of this work to attempt to develop a system which could be
configured to run as many layers as desired and test it using a currently defined CNN configuration,
AlexNet. This type of system would allow a developer to scale a design to fit any size of FPGA from the
Overall, the thesis of this work was proven to be correct that a system such as this can be developed for a
small Embedded System. The objective of this work was achieved, and all layers were accelerated
including Convolution, Affine, ReLu, Max Pool, and Softmax layers. The performance of the design was
assessed, and it was determined its maximum achievable performance was approximately 3 GFLOPS.
While a far cry from the most cutting-edge GPU implementation, this design was a first attempt and this
design can be optimized using several approaches which could be the subject of a future work.
178
14.0 REFERENCES
2. Mohri, Mehryar, et al. Foundations of Machine Learning, MIT Press, 2014. ProQuest Ebook
3. Murnane, K. (Apr. 01, 2016). What Is Deep Learning And How Is It Useful?. Forbes.com.
and-how-is-it-useful/#302bbf5ed547
4. Siegel, E. (Apr. 07, 2018). Twelve Hot Deep Learning Applications Featured at Deep Learning
learning-applications-featured-at-deep-learning-world/9454/
5. Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (*
= equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
6. Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT
Press.
7. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classification with Deep
convolutional-neural-networks.pdf
8. Zisserman, A. & Simonyan, K. (2014). Very Deep Convolutional Networks For Large-Scale
9. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition.
10. Goodfellow, I., & Bengio, Y., & Courville, A., (2016). Convolutional Networks. In Dietterich, T.,
11. Ma, Y., & Suda, N., & Cao, Y., & Seo, J., & Vrudhula, S. (Sept. 29, 2016). Scalable and
179
Conference on Field Programmable Logic and Applications (FPL), 26th, Session S5b-
Compilation. doi:10.1109/FPL.2016.7577356
12. Zhang, C., & Li, P., & Sun, G., & Guan, Y., & Xiao, B., & Cong, J. (Feb. 22, 2015). Optimizing
FPGA-based Accelerator Design for Deep Convolutional Neural Networks. Proceedings of the
doi: 10.1145/2684746.2689060
13. Ahn, B. (Oct. 01, 2015). Real-time video object recognition using convolutional neural network.
10.1109/IJCNN.2015.7280718
14. Li, H., & Zhang, Z., & Yang, J., & Liu, L., & Wu, N. (Nov. 6, 2015). A novel vision chip
architecture for image recognition based on convolutional neural network. IEEE 2015
15. Motamedi, M., & Gysel, P., & Akella, V., & Ghiasi, S. (Jan. 28, 2016). Design space exploration
of FPGA-based Deep Convolutional Neural Networks. 2016 Asia and South Pacific Design
16. Lacey, G., & Taylor, G., & Areibi, S., (Feb. 13, 2016). Deep Learning on FPGAs: Past, Present,
17. Dundar, A., & Jin, J., & Martini, B., & Culurciello, E. (Apr. 08, 2016). Embedded Streaming
Deep Neural Networks Accelerator With Applications. IEEE Transactions on Neural Networks
18. Qiao, Y., & Shen, J., & Xiao, T., & Yang, Q., & Wen, M., & Zhang, C. (May 06, 2016). FPGA‐
accelerated deep convolutional neural networks for high throughput and energy efficiency.
Concurrency and Computation Practice and Experience. John Wiley & Sons Ltd.
19. Guo, K., & Sui, L., & Qiu, J., & Yao, S., & Han, S., & Wang, Y., & Yang, H. (July. 13, 2016).
Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware. IEEE
10.1109/ISVLSI.2016.129
180
20. Zhu, M., & Liu, L., & Wang, C., & Xie, Y. (Jun. 20, 2016). CNNLab: a Novel Parallel
Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis.
21. Li, F., et. al. Image Classification Pipeline [PDF document]. Retrieved from Lecture Notes Online
Website: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture2.pdf
22. Li, F., et. al. Linear Classification [HTML]. Retrieved from Lecture Notes Online Website:
http://cs231n.github.io/linear-classify/
23. Li, F., et. al. Loss Functions and Optimization [PDF document]. Retrieved from Lecture Notes
24. Li, F., et. al. Backpropagation and Neural Networks [PDF document]. Retrieved from Lecture
25. Li, F., et. al. Convolutional Neural Networks [PDF document]. Retrieved from Lecture Notes
26. Li, F., et. al. Training Neural Networks [PDF document]. Retrieved from Lecture Notes Online
Website: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture6.pdf
27. Li, F., et. al. Neural Networks [HTML]. Retrieved from Lecture Notes Online Website:
http://cs231n.github.io/neural-networks-1/
28. Li, F., et. al. Deep Learning Software [PDF document]. Retrieved from Lecture Notes Online
Website: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf
29. Li, F., et. al. CNN Architectures [PDF document]. Retrieved from Lecture Notes Online Website:
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
30. Xilinx (2018). 7 Series DSP48E1 Slice [User Guide UG479]. Retrieved from
https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf
31. Weisstein, Eric W. "Maclaurin Series." From MathWorld--A Wolfram Web Resource.
http://mathworld.wolfram.com/MaclaurinSeries.html
32. Cormen, et.al. (2009). Introduction to Algorithms, 3rd ed. Cambridge, Massachusetts: The MIT
Press.
181
33. Brown, S. & Vranesic, Z. (2002). Fundamentals of Digital Logic with Verilog Design, 1st ed. New
34. Xilinx (2017). Zynq-7000 All Programmable SoC Family Product Tables and Product Selection
7000-product-selection-guide.pdf
35. Xilinx (2018). All Programmable 7 Series Product Selection Guide. Retrieved from
https://www.xilinx.com/support/documentation/selection-guides/7-series-product-selection-
guide.pdf
36. ARM (2018). AMBA AXI and ACE Protocol Specification. Retrieved from
http://www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI4_specification.pdf
37. IEEE Computer Society (2008). IEEE Standard for Floating –Point Aritmetic (IEEE Std 754)
org.proxy.library.cpp.edu/stamp/stamp.jsp?tp=&arnumber=5976968
182
APPENDIX – Links to Repositories
The links to the Github repositories are provided here and serve as the appendix for this work due to
large amount of content they embody.
https://github.com/maespinosa/Thesis_VHDL
https://github.com/maespinosa/Thesis_Python
https://github.com/maespinosa/Thesis_Documents
https://github.com/maespinosa/Thesis_Matlab_Models
183