A Reconfigurable CNN-based Accelerator Design For

This article has been accepted for publication in IEEE Access.
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
A Reconfigurable CNN-based Accelerator Design

for Fast and Energy-Efficient Object Detection
System on Mobile FPGA
HEEKYUNG KIM1 , (Student, IEEE), and KYUWON KEN CHOI1 , (Senior Member, IEEE)
1
Department of Electrical and Computer Engineering, Illinois Institute of Technology, IL 60616 USA ([email protected])
Corresponding author: Heekyung Kim (e-mail: [email protected]).
This work was supported by the Technology Innovation Program of the Ministry of Trade, Industry & Energy (MOTIE, Korea) Korea
Electronics Technology Institute (KETI), South Korea (Software and Hardware Development of cooperative autonomous driving control
platform for commercial special and work-assist vehicles), under Grant 1415181272.
ABSTRACT In limited-resource edge computing circumstances such as on mobile devices, IoT devices,
and electric vehicles, the energy-efficient optimized convolutional neural network (CNN) accelerator
implemented on mobile Field Programmable Gate Array (FPGA) is becoming more attractive due to
its high accuracy and scalability. In recent days, mobile FPGAs such as the Xilinx PYNQ-Z1/Z2 and
Ultra96, definitely have the advantage of scalability and flexibility for the implementation of deep learning
algorithm-based object detection applications. It is also suitable for battery-powered systems, especially
for drones and electric vehicles, to achieve energy efficiency in terms of power consumption and size
aspect. However, it has a low and limited performance to achieve real-time processing. In this article,
optimizing the accelerator design flow in the register-transfer level (RTL) will be introduced to achieve
fast programming speed by applying the low-power techniques on FPGA accelerator implementation. In
general, most accelerator optimization techniques are conducted on the system level on the FPGA. In this
article, we propose the reconfigurable accelerator design for a CNN-based object detection system on mobile
FPGA. Furthermore, we present RTL optimization design techniques that will be applied such as various
types of clock gating techniques to eliminate residual signals and to deactivate the unnecessarily active
block. Based on the analysis of the CNN-based object detection architecture, we analyze and classify the
common computing operation components from the Convolutional Neuron Network, such as multipliers
and adders. We implement a multiplier/adder unit to a universal computing unit and modularize it to be
suitable for a hierarchical structure of RTL code. Experimental results show that the proposed design process
improves the power efficient consumption, hardware utilization, and throughput by 16%, up to 58%, and
15%, respectively.
INDEX TERMS FPGA Accelerator; CNN Accelerator; RT Level Design Techniques; Low Power Tech-
niques; Reconfigurable Accelerator; CNN-based Object Detection; Low Power Consumption; High Perfor-
mance; Mobile FPGA
I. INTRODUCTION or edge devices. The primary implementation issue of the

CNN application is that computing complexity is above aver-
C ONVOLUTIONAL Neural Network(CNN)-based ob-
ject detection application has been applied in various
systems including Field Programmable Gate Array (FPGA)
age and the power consumption amount is huge to achieve
fast processing speed and high accuracy at the same time.
devices from personal mobile devices to industrial machines High computing complexity also involves a large number of
such as healthcare devices, smart surveillance systems, Ad- operation units, and massive memory accesses as well. The
vanced Driver Assistance Systems (ADAS), drones, and lo- dynamic power consumption occurs over the data transfer
gistics robots [1]–[6]. To achieve high accuracy of recogni- process and in the time-delay process of computing opera-
tion, CNNs have become an essential feature of many diverse tion. It seems impossible to make the real-time inference of
object detection-adopted devices, whether it is cloud-based CNN-based object detection on mobile FPGA devices which
VOLUME 11, 2023 1
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
have limited hardware resources such as memory size and

lower processor performance. In these power and hardware
resource-limited circumstances, to improve performance and
reduce power consumption, many researchers have proposed
CNN accelerators at various design levels including system
level, application level, architecture level, and transistor level
[7]–[10]. Recent studies have proposed a flexible CNN accel-
erator design for FPGA implementation at the transistor level
and a flexible FPGA accelerator for various CNN architec-
tures from lightweight CNN to large-scale CNN [11]–[16].
Since CNN-based object detection applications become
more common technology for unmanned drones, autonomous
vehicles, ADAS systems on the vehicle, and industrial au-
tomation systems, researchers have been conducting CNN FIGURE 1. Vivado HLS Design Flow
object detection-related research in terms of the following
topics; implementation on the mobile FPGA-SoC board for
real-time processing, accelerator design for mobile FPGA-
System-On-Chip (SoC), and hardware optimization tech-
niques, are becoming popular. To overcome the lack of hard-
ware resources on the mobile FPGAs such as Xilinx Ultra
96 and Xilinx PYNQ-Z1 which are popular FPGA-SoC de-
vices implemented on drone and IoT devices, many papers
have been published to achieve high performance, low power
consumption, and real-time processing speed [17]–[24]. The
main focus of their proposed implementation techniques in
those papers is reducing the size of the CNN architecture,
pre-processing the input feature map, tightening pipe-lining
design, size adjustment of the input and output feature maps,
FIGURE 2. FPGA SoC Platform Design Architecture
and code optimization [9], [25]–[32]. Moreover, in previous
our research, we verified that RT-level optimization is able to
not only reduce the processing time but also, save dynamic
power [33]–[35]. Prime from Intel, and PYNQ). However, due to the closed
Therefore, in this work, we applied the low-power tech- platform feature of the Xilinx FPGA products, in the High
niques to the baseline RTL code of the CNN accelerator gen- Level of Synthesis (HLS) design flow as shown in Figure 1,
erated from the Tensil and applied the hardware-optimized the Vivado HLS system can verify the functionality of the
techniques to the proposed reconfigurable FPGA hardware C/C++/System C code and convert the code to the register-
accelerator design through the proposed automated optimiza- transfer level (RTL) code for the FPGA hardware operation
tion tool for RTL code. The rest of this paper is organized and optimization [7], [23], [36], however, once the RTL code
as follows: Section II introduces the low-power techniques is generated by Vivado HLS Tool, the code is no longer
in RT-level for energy-efficiently accelerating the CNN com- readable or modifiable. On the other hand, the platform-based
puting operation and overviews the basic RT-level-based opti- design flow can import the VHDL/Verilog code to set as cus-
mization hardware design flow based on generated a baseline tomized IP blocks so that we can easily modify the hardware
CNN accelerator RTL code by Tensil. Section III describes design at the RT level or gate level and intuitively configure
the architecture of the proposed accelerator and the design the data flow for Processing System (PS) and Programmable
details of data flow and processing modules in two parts, Logic (PL) through the Vivado IP Integrator.
optimization & modularization and low power techniques.
Section V discusses the implementation and simulation re- A. PLATFORM-BASED DESIGN FLOW WITH RTL CODE
sults with previous works. Finally, the conclusions are given The platform-based design flow was introduced by Xilinx
in Section VI. Vivado which is an integrated design environment program
tool as shown in Figure 1. The RTL code can be imported into
II. BACKGROUND the IP block, and it can be assembled with other peripheral
To design the CNN accelerator on an FPGA-SoC board, the IP blocks and PS IP blocks to generate the hardware design
use of CAD tools and platforms is required. Each manufac- for the bitstream. Jupyter Notebook is a web-based primary
turer provides the CAD tools and development platforms for computing environment of the PYNQ which is linked to Xil-
the implementation process and reconfigurable components inx platforms [37]. PYNQ is running based on Python on the
and parts on the FPGA (e.g., Vivado from Xilinx, Quartus Jupyter Notebook with Linux kernel on the FPGA. However,
2 VOLUME 11, 2023
FIGURE 3. Conventional Register
FIGURE 5. Local Explicit Clock Gating (LECG)
FIGURE 4. Local Explicit Clock Enable(LECE)
the Python library is not fully supported in the PYNQ.
B. BASELINE RTL CODE GENERATION FOR CNN

ACCELERATOR: TENSIL
Tensil is a set of tools for designing the accelerator including
an RTL generator, a model compiler, and a set of drivers [24].
The basic processing flow is that using the selected machine
learning accelerator architectures on the limited FPGA-SoC
devices, it generates the RTL code by using a model compiler.
The primary advantage of Tensil is that it is able to create an
accelerator without quantization or other degradation. Tensil
applies a few of the optimization techniques for the selected
FPGAs, thus the optimization performance is not effective
enough. Previously we applied our low-power techniques to
CNN accelerator RTL code and verified the performance FIGURE 6. Bus-Specific Clock Gating (BSCG)
[33]. In section III, we apply the techniques to the Tensil RTL

code and evaluate the effectiveness of the techniques on the
FPGA-SoC boards, PYNQ-Z1. LECG has the advantage of reducing power consumption in
the case of the multi-bit of output, by updating output at once
C. LOW POWER TECHNIQUES AT RT-LEVEL
when completing the output update.
1) Low Power Clock Techniques
Bus-Specific Clock Gating (BSCG) [38]–[40] utilizes the
Clock Gating(CG) is a basic low-power technique to enhance
clock gating technique and adjusts the EN signal based on
performance and efficiency by disabling unnecessary clock
the comparison of I/O signals as shown in Figure 6. In terms
cycles as shown in Figure 3. Standby states are included in
of power consumption, XOR gates are significantly lower
many parts of the CNN computing process. This leads to
power-consuming logic gates for the gate-level power anal-
a significant amount of power consumption. CG eliminates
ysis compared to AND/OR gates. Enhanced Clock Gating
unnecessary clock cycle occurrences. Local Explicit Clock
(ECG) [38], [40] consists of XOR gates to control the input
Enable (LECE) [38]–[40] is a method using ENABLE signal
clock signals and enable signals when considering multi-
for 2:1 multiplexer or multiplexed D flip-flop to update the
bit I/O data as shown in Figure 7. The efficiency of the
output on the rising edge of the clock only when the ENABLE
power reduction would be maximized when there are larger
signal is high as shown in Figure 4. The more bits are used
pipelines and IO bit sizes.
as an input, the more ENABLE signals occur. The Local
Explicit Clock Gating (LECG) [38]–[40] has the equivalent
fundamental of the LECE as shown in Figure 5, however, III. ARCHITECTURE DESIGN OF ACCELERATOR
VOLUME 11, 2023 3
FIGURE 9. IP Block Design for CNN Object Detection
FIGURE 7. Enhanced Clock Gating (ECG)
FIGURE 10. Flexible Accelerator Design Overview
B. THE PROPOSED RECONFIGURABLE ACCELERATOR

HARDWARE ARCHITECTURE
As shown in Figure 9, the IP block design for the CNN object
detection accelerator consists of referencing IP blocks and
customized IP blocks (top_pynqz1_0). In the top_pynqz1_0
FIGURE 8. Proposed Processing System and Programmable Logic Unit block, there are hierarchically defined multiply-accumulate
units (MACs), POOLs, memory bandwidth, memory access
scheduler, and CONV computing modules. The original Ten-
A. ARCHITECTURE OVERVIEW sil’s RTL codes do not have hierarchical architecture, How-
ever, in this case, analysis of the RTL code would take a long
The block diagram in Figure 8 shows a data flow of the pro- time.
posed processing unit design for an FPGA-based CNN object The primary feature of FPGA devices is in reconfigurabil-
detection accelerator. For the programmable logic, each type ity. Therefore, to maximize the flexibility of the FPGA-SoC
of block is defined specifically and is modularized to enhance design, the proposed RTL code of the CNN accelerator was
the implementation efficiency of various CNN models. The designed with hierarchical and modularized main modules
proposed architecture can be mainly divided into computing including MACs, Conv, Multiplier, Adder, MUX, and ALU
processing logic and memory system as detailed as follows: as shown in Figure 10. This figure shows that the proposed
In the memory system, there are three main functional com- flexible accelerator design has the scalability to support the
ponents for the on-chip and off-chip data transfer to prepare different CNN architectures such as the YOLO series and
data for computation. First, the buffers are responsible for ResNet20. After the modularization of the MAC unit, we
storing data. All the weights and intermediate feature maps applied our low-power techniques such as clock gating, XOR
are arranged in a layer-by-layer format which is stored in gate, and OR gate for MUXs. This design is able to accommo-
external DRAM. When loading a tile of data to the on-chip date add-on detectors, such as Single-Shot Detectors (SSD)
input/weight/output ping-pong buffers, they are arranged in and Multibox detectors. For the memory access modules
a unique format according to the requirement of computa- such as InnerDualPortMem1, DualPortMem1, MemSplitter,
tion mode. Second, a dispatching module employs Direct and MemBoundarySplitter, memory partitioning techniques
Memory Access (DMA) engine through DMA descriptors are applied. To accelerate the CPU computing operation, the
generated by the DMA control module to fetch required data memory reassignment technique has been applied so that the
from DRAM or save the results back to DRAM. Third, the memory size and flow would be changed once it detected
on-chip data scheduling modules, consisting of scatter and the pre-assigned computation. For example, our target CNN
gather modules, realize the serial-to-parallel or parallel-to- accelerator architectures should be using fixed 16-bit, then
serial conversions, which manipulate the data flow for the we can pre-assign the memory size prior to the input data or
following computation or transmission. the weight. This would be helpful to compute the sequential
computation operation such as convolution operation.
4 VOLUME 11, 2023
FIGURE 12. Conventional MAC (Left) Proposed MAC (Right)
FIGURE 11. Low Power Design Flow
IV. PROPOSED HARDWARE IMPLEMENTATION

A. LOW POWER HW DESIGN TECHNIQUES AT RTL
As shown in Figure 9, in this experiment, our optimization
for low power is targeted to the Tensil’s processing flow.
Step 1. Based on the neural architecture file, .tarch, Tensil
helps generate the TCU RTL code for basic hardware resource
design as shown in Figure 11. Step 2. After the RTL code
is generated, we applied our low-power techniques including
LECG, Split memory, BSC, and ECG. Step 3. Using Vivado,
we designed the hardware IP block. From the IP block design,
we were able to get the bit-stream file. Step 4. Based on
the customized bit stream, we were able to implement the
hardware accelerator for the CNN object detection algorithm.
Step 5. We simulated it on the FPGA board and evaluated the
power consumption of the target DNN-based object detection
processing by using Vivado.
The optimized multiplier design using a power-efficient
adder block based on power analysis was implemented at the
transistor level as shown in Figure 12 (right). In convolution
computation, the computation complexity of the multipliers
FIGURE 13. Psuedocode for Proposed MAC Operation
can cause dynamic power consumption and delays. To reduce
the complexity of the adder and multiplier, first, we tested the
full adder designs through the transistor-level design process.
For the MAC module, Bus Specific Clock (BSC) is applied. operated using AND gate and the OR gate can be applied
In a conventional register, the data input is active and lasts instead of the MUX operation. Not only as the same as MUX,
until the end of the period. In this case, the power could be OR gate can support parallel MAC operation, but also, it
wasted. When BSC is applied to register Z, the XOR can consequences a reduced dynamic power consumption result.
control enabling the clock so that the clock toggles are not Eventually, we utilized this parallel MAC structure using OR
wasted. The AND gate and Latch were added to safely disable gate as shown in Figure 12 (b). Figure 13 shows the proposed
the clock without allowing any glitches to reach the register MAC pseudocode which has been applied to BSC and OR-
clock. based MAC computing operations.
B. PROPOSED MAC HARDWARE DESIGN C. FLEXIBLE ACCELERATOR DESIGN FOR

The detailed technique approaches are as follows: 1. MAC MULTI-ARCHITECTURE AND OPTIMIZATION TECHNIQUES
unit is the major power consumption unit of the convolution Based on the analysis result of the target CNN architecture,
operation in which the data transmissions occur frequently. we customize the pipe-lining of the data flow and assign max-
This technique is applied to remove the wasted clock toggles imized buffer capacity in the BRAM and external memory.
during the data input is deactivated. 2. The proposed adder The fixed-point numbers are able to reduce the computation
group with BSC can reduce the wasted clock toggles so that resource consumption and it also is able to reduce the band-
it can reduce the power consumption of the adder unit. 3. width requirements, however, for getting high performance,
In stochastic multiplication, two unary bit-streams can be the optimized size of bandwidth should be defined by the
VOLUME 11, 2023 5
FIGURE 14. GPU Testbed (GPU TITAN X)

FIGURE 15. FPGA Testbed (Xilinx PYNQ-Z1 FPGA)
analysis of the network architecture. Once the data transmis-

sion size is fixed, memory splitting and merging should be
applied. Our CNN accelerator is based on the 16-bit fixed
point bandwidth which is given by the reference [24]. We
modularize the RTL code based on thorough analysis, which
helps easy modification for implementing the accelerator
design.
V. EXPERIMENT AND RESULTS WITH DISCUSSION

A. EXPERIMENT ENVIRONMENT
For the basic hardware platform, we chose the PYNQ-Z1
board instead of the regular ZYNQ-7020 board, where the FIGURE 16. HW Resource Report Comparison of Tensil Sample Simulation
and Our Work tested on PYNQ-Z1
PYNQ is an open-source project from AMD [37]. It embeds
Xilinx ZYNQ-7020, and also provides a Jupyter-based frame-
work with Python APIs. The PYNQ-Z1 board has the FPGA-
can modify the structure by RTL code modification. Then
SoC platform which is composed of PL and PS. The basic
we can improve the specific hardware resources and power
software development tool is Jupyter Notebook, a web-based
consumption of the design. Analyzing the result leads to
software programming platform. It is also supporting Python,
improved performance of CNN processing. Figure 16 shows
C/C++ programming languages, and other open-source li-
the power consumption reduction of the processing system
braries such as OpenCV. Our experiment environment is as
unit. We were able to archive the 43.9 (GOPs/W) as a power
follows in Figure 14 and Figure 15. The imported CNN
efficiency result, compared to other FPGA board implemen-
architecture is the Resnet-20. It has 23 layers and it was
tations, it increased 1.37 times. the hardware resource utiliza-
trained with CIFAR-10 which provides a test set of 10,000
tion in DSPs is increased 2.2 times from the result of [24].
images in several formats. We used the provided weight file
and converted the ONNX format of the ResNet Model [41].
ONNX, a machine learning (ML) model converter, provides C. POWER CONSUMPTION RESULTS
the converted ML model code in ONNX format. Tensil com- Our optimization will decrease 16% of the dynamic power
piler generates three import artifacts, a .tmodel, .tdata, and consumption. Also, the total On-Chip power will deduct 20%
.tprog files. Once the .tmodel manifest for the model into the of the total power consumption. Once the global buffer is ac-
driver is loaded, it tells the driver where to locate the binary tivated, the unused global clock buffer and the second global
files, program data, and weights data. They were not open clock resource will help to improve the performance of the
data and, we are using them without any modification, so that design. Moreover, this can be the solution to some high fan-
means the accuracy was not changed. out signals to make the device fully functional. In the pipeline
logic, inserting an intermediate flip-flop(FF) can improve the
B. FPGA IMPLEMENTATION RESULTS working speed of the device, however, too many flip-flops
Compared with Tensil’s optimization result, we verified more make computational complexity. Our low-power techniques
register buffers are activated for our proposed structure. Once show better performance than the performance of FFs.
we check the functionality and performance result, then you
6 VOLUME 11, 2023
TABLE 1. Long Table
[17] [18] [19] [20] [21] [22] [23] [24] Ours

Year 2018 2018 2018 2018 2017 2018 2019 2022 2022
CNN Model AlexNet MobileNetV2 VGG16 VGG16 VGG19 VGG16 AP2D-Net ResNet20 ResNet20
FPGA ZYNQ- Intel Arria10- ZYNQ- Virtex-7 Stratix V Intel Arria 10 Ultra96 PYNQ-Z1 PYNQ-Z1
XCZ7020 SoC XCZ7020 VX690t GSMDS
Clock 200 133 214 150 150 200 300 50 50
(MHz)
BRAMs 268 1844** 85.5 1220 919** 2232** 162 198 523
DSPs 218 1278 190 2160 1036 1518 287 73 167
LUTs 49.8K - 29.9K - - - 54.3K 14.6K 15.2K
FFs 61.2K - 35.5K - - - 94.3K 9.1K 41.2K
Precision (16, 16) (16, 16) (8, 8) (16, 16) (16, 16) (16, 16) (8-16, 16) (16, 16) (16, 16)
(W,A)*
Latency(s) 0.016 0.004 0.364 0.106 0.107 0.043 0.032 0.178 0.109
Throughput 80.35 170.6 84.3 290 364.4 715.9 130.2 55.0 63.3
(GOPs)
Power(W) 2.21 - - 35 25 - 5.59 1.714 1.440
Power 36.36 - - 8.28 14.57 - 23.3 28.2 43.9
Efficiency
(GOPs/W)
VI. CONCLUSION to a lightweight cnn,’’ in 2018 IEEE 4th International Conference on

In this article, the proposed highly reconfigurable FPGA Collaboration and Internet Computing (CIC), 2018, pp. 256–265.
[3] K. Haeublein, W. Brueckner, S. Vaas, S. Rachuj, M. Reichenbach, and
hardware accelerator showed improved performance in terms D. Fey, ‘‘Utilizing pynq for accelerating image processing functions in adas
of the processing speed and power consumption result during applications,’’ in ARCS Workshop 2019; 32nd International Conference on
inference of various CNNs. The hardware optimization is Architecture of Computing Systems, 2019, pp. 1–8.
[4] Z. Zhang, M. A. P. Mahmud, and A. Z. Kouzani, ‘‘Fitnn: A low-resource
conducted mainly for two purposes: to improve the through- fpga-based cnn accelerator for drones,’’ IEEE Internet of Things Journal,
put and to reduce power consumption. For improving perfor- vol. 9, no. 21, pp. 21 357–21 369, 2022.
mance, the minimized data transferring strategy was applied [5] C. Fu and Y. Yu, ‘‘Fpga-based power efficient face detection for mo-
bile robots,’’ in 2019 IEEE International Conference on Robotics and
by assigning the maximum amount of buffers during the Biomimetics (ROBIO), 2019, pp. 467–473.
computations and by applying a controlled pipeline design [6] X. Li, X. Gong, D. Wang, J. Zhang, T. Baker, J. Zhou, and T. Lu, ‘‘Abm-
for minimized data access. For achieving energy efficient spconv-simd: Accelerating convolutional neural network inference for in-
results of CNN object detection operation, not only the data dustrial iot applications on edge devices,’’ IEEE Transactions on Network
Science and Engineering, pp. 1–1, 2022.
access controlling for minimized memory access, but also we [7] S. Tamimi, Z. Ebrahimi, B. Khaleghi, and H. Asadi, ‘‘An efficient sram-
proposed the RT level low power techniques-applied recon- based reconfigurable architecture for embedded processors,’’ IEEE Trans-
figured MAC units such as advanced clock gating-applied actions on Computer-Aided Design of Integrated Circuits and Systems,
vol. 38, no. 3, pp. 466–479, 2019.
adder, register Z with bus specific clock, and OR-based MAC [8] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa, ‘‘Power effi-
architecture to RTL code of the proposed accelerator. The pro- cient design of high-performance convolutional neural networks hardware
posed hardware accelerator for ResNet-20 was implemented accelerator on fpga: A case study with googlenet,’’ IEEE Access, vol. 9, pp.
151 897–151 911, 2021.
on mobile FPGA-SoC, PYNQ-Z1, and the power consump- [9] S. Lee, D. Kim, D. Nguyen, and J. Lee, ‘‘Double mac on a dsp: Boosting
tion was measured during inference operation. As a result, the performance of convolutional neural networks on fpgas,’’ IEEE Trans-
the throughput result showed a 15% improvement compared actions on Computer-Aided Design of Integrated Circuits and Systems,
vol. 38, no. 5, pp. 888–897, 2019.
with the baseline RTL code of the accelerator, also power con- [10] S. Ullah, S. Rehman, M. Shafique, and A. Kumar, ‘‘High-performance ac-
sumption was reduced by 16%, and hardware utilization was curate and approximate multipliers for fpga-based hardware accelerators,’’
increased by 58%. The object detection processing speed was IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 41, no. 2, pp. 211–224, 2022.
9.17FPS, which shows that real-time processing is feasible in
[11] X. Wu, Y. Ma, M. Wang, and Z. Wang, ‘‘A flexible and efficient fpga ac-
mobile FPGA. celerator for various large-scale and lightweight cnns,’’ IEEE Transactions
on Circuits and Systems I: Regular Papers, vol. 69, no. 3, pp. 1185–1198,
ACKNOWLEDGEMENT 2022.
[12] W. Liu, J. Lin, and Z. Wang, ‘‘A precision-scalable energy-efficient con-
We thank our colleagues from KETI and KEIT who provided volutional neural network accelerator,’’ IEEE Transactions on Circuits and
insight and expertise that greatly assisted the research and Systems I: Regular Papers, vol. 67, no. 10, pp. 3484–3497, 2020.
greatly improved the manuscript. [13] H. Irmak, D. Ziener, and N. Alachiotis, ‘‘Increasing flexibility of fpga-
based cnn accelerators with dynamic partial reconfiguration,’’ in 2021 31st
International Conference on Field-Programmable Logic and Applications
REFERENCES (FPL), 2021, pp. 306–311.
[1] A. K. Jameil and H. Al-Raweshidy, ‘‘Efficient cnn architecture on fpga [14] W. Chen, D. Wang, H. Chen, S. Wei, A. He, and Z. Wang, ‘‘An asyn-
using high level module for healthcare devices,’’ IEEE Access, vol. 10, pp. chronous and reconfigurable cnn accelerator,’’ in 2018 IEEE International
60 486–60 495, 2022. Conference on Electron Devices and Solid State Circuits (EDSSC), 2018,
[2] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. Faughnan, pp. 1–2.
‘‘Smart surveillance as an edge network service: From harr-cascade, svm [15] C. Yang, Y. Wang, H. Zhang, X. Wang, and L. Geng, ‘‘A reconfigurable
VOLUME 11, 2023 7
cnn accelerator using tile-by-tile computing and dynamic adaptive data in autonomous vehicles,’’ Electronics, vol. 9, no. 3, 2020. [Online].
truncation,’’ in 2019 IEEE International Conference on Integrated Circuits, Available: https://www.mdpi.com/2079-9292/9/3/478
Technologies and Applications (ICTA), 2019, pp. 73–74. [34] H. Kim and K. Choi, ‘‘Low power fpga-soc design techniques for cnn-
[16] S. Zeng, K. Guo, S. Fang, J. Kang, D. Xie, Y. Shan, Y. Wang, and based object detection accelerator,’’ in 2019 IEEE 10th Annual Ubiquitous
H. Yang, ‘‘An efficient reconfigurable framework for general purpose cnn- Computing, Electronics & Mobile Communication Conference (UEM-
rnn models on fpgas,’’ in 2018 IEEE 23rd International Conference on CON), 2019, pp. 1130–1134.
Digital Signal Processing (DSP), 2018, pp. 1–5. [35] ——, ‘‘The implementation of a power efficient bcnn-based object detec-
[17] L. Gong, C. Wang, X. Li, H. Chen, and X. Zhou, ‘‘Maloc: A fully pipelined tion acceleration on a xilinx fpga-soc,’’ in 2019 International Conference
fpga accelerator for convolutional neural networks with all layers mapped on Internet of Things (iThings) and IEEE Green Computing and Commu-
on chip,’’ IEEE Transactions on Computer-Aided Design of Integrated nications (GreenCom) and IEEE Cyber, Physical and Social Computing
Circuits and Systems, vol. 37, no. 11, pp. 2601–2612, 2018. (CPSCom) and IEEE Smart Data (SmartData), 2019, pp. 240–243.
[18] L. Bai, Y. Zhao, and X. Huang, ‘‘A cnn accelerator on fpga using depthwise [36] Y. Kim, Q. Tong, K. Choi, E. Lee, S.-J. Jang, and B.-H. Choi, ‘‘System
separable convolution,’’ IEEE Transactions on Circuits and Systems II: level power reduction for yolo2 sub-modules for object detection of fu-
Express Briefs, vol. 65, no. 10, pp. 1415–1419, 2018. ture autonomous vehicles,’’ in 2018 International SoC Design Conference
[19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, (ISOCC), 2018, pp. 151–155.
N. Xu, S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded [37] ‘‘Pynq: Python productivity,’’ http://www.pynq.io/, accessed: 2023-2-15.
fpga platform for convolutional neural network,’’ in Proceedings of [38] L. Li, K. Choi, S. Park, and M. Chung, ‘‘Selective clock gating by using
the 2016 ACM/SIGDA International Symposium on Field-Programmable wasting toggle rate,’’ in 2009 IEEE International Conference on Elec-
Gate Arrays, ser. FPGA ’16. New York, NY, USA: Association tro/Information Technology, 2009, pp. 399–404.
for Computing Machinery, 2016, p. 26–35. [Online]. Available: [39] W. Wang, Y.-C. Tsao, K. Choi, S. Park, and M.-K. Chung, ‘‘Pipeline
https://doi.org/10.1145/2847263.2847265 power reduction through single comparator-based clock gating,’’ in 2009
[20] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt, International SoC Design Conference (ISOCC), 2009, pp. 480–483.
‘‘A framework for acceleration of cnn training on deeply-pipelined fpga [40] Y. Zhang, Q. Tong, L. Li, W. Wang, K. Choi, J. Jang, H. Jung, and S.-
clusters with work and weight load balancing,’’ in 2018 28th International Y. Ahn, ‘‘Automatic register transfer level cad tool design for advanced
Conference on Field Programmable Logic and Applications (FPL), 2018, clock gating and low power schemes,’’ in 2012 International SoC Design
pp. 394–3944. Conference (ISOCC), 2012, pp. 21–24.
[41] ‘‘Compile an ml model,’’ https://www.tensil.ai/docs/howto/compile/, ac-
[21] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
cessed: 2023-2-15.
and J. Cong, ‘‘Fp-dnn: An automated framework for mapping deep neural
networks onto fpgas with rtl-hls hybrid templates,’’ in 2017 IEEE 25th An-
nual International Symposium on Field-Programmable Custom Computing
Machines (FCCM), 2017, pp. 152–159.
[22] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, ‘‘Optimizing the convolution
operation to accelerate deep neural networks on fpga,’’ IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–
1367, 2018.
[23] S. Li, Y. Luo, K. Sun, N. Yadav, and K. K. Choi, ‘‘A novel fpga accel-
erator design for real-time and ultra-low power deep convolutional neural
networks compared with titan x gpu,’’ IEEE Access, vol. 8, pp. 105 455–
105 471, 2020.
[24] ‘‘Learn tensil with resnet and pynq z1,’’
https://www.tensil.ai/docs/tutorials/resnet20-pynqz1/, accessed: 2022-12-
15.
[25] X. Zhang, Y. Ma, J. Xiong, W.-M. W. Hwu, V. Kindratenko, and D. Chen,
‘‘Exploring hw/sw co-design for video analysis on cpu-fpga heterogeneous
systems,’’ IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 41, no. 6, pp. 1606–1619, 2022.
[26] E. Antolak and A. Pułka, ‘‘Energy-efficient task scheduling in design of
multithread time predictable real-time systems,’’ IEEE Access, vol. 9, pp.
121 111–121 127, 2021.
[27] W. Huang, H. Wu, Q. Chen, C. Luo, S. Zeng, T. Li, and Y. Huang, ‘‘Fpga-
based high-throughput cnn hardware accelerator with high computing
resource utilization ratio,’’ IEEE Transactions on Neural Networks and
Learning Systems, vol. 33, no. 8, pp. 4069–4083, 2022.
[28] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, ‘‘Optimizing the convolution
operation to accelerate deep neural networks on fpga,’’ IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–
1367, 2018.
[29] G. Lakshminarayanan and B. Venkataramani, ‘‘Optimization techniques
for fpga-based wave-pipelined dsp blocks,’’ IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 13, no. 7, pp. 783–793, 2005.
HEEKYUNG KIM received the B.S. in electronic
[30] D. Wang, K. Xu, J. Guo, and S. Ghiasi, ‘‘Dsp-efficient hardware acceler-
and electrical engineering from Hongik University,
ation of convolutional neural network inference on fpgas,’’ IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems,
Seoul, Korea, in 2012 and M.S. degrees in elec-
vol. 39, no. 12, pp. 4867–4880, 2020. trical and computer engineering from the Illinois
[31] A. Prihozhy, E. Bezati, A. A.-H. Ab Rahman, and M. Mattavelli, ‘‘Syn- Institute of Technology, Chicago, in 2015. She is
thesis and optimization of pipelines for hw implementations of dataflow currently pursuing a Ph.D. degree in computer en-
programs,’’ IEEE Transactions on Computer-Aided Design of Integrated gineering from the Illinois Institute of Technology,
Circuits and Systems, vol. 34, no. 10, pp. 1613–1626, 2015. Chicago.
[32] W. Lou, L. Gong, C. Wang, Z. Du, and X. Zhou, ‘‘Octcnn: A high through- Her current research interests include low-power
put fpga accelerator for cnns using octave convolution algorithm,’’ IEEE and high-performance HW/SW optimization for
Transactions on Computers, vol. 71, no. 8, pp. 1847–1859, 2022. CNN accelerator design for FPGA and ASIC, wireless sensor networks
[33] Y. Kim, H. Kim, N. Yadav, S. Li, and K. K. Choi, ‘‘Low-power rtl design and sensor data analysis and embedded HW/SW system design for
code generation for advanced cnn algorithms toward object detection robotics, drones, IoTs.
8 VOLUME 11, 2023
KYUWON KEN CHOI (Senior Member, IEEE) re-

ceived the Ph.D. degree in electrical and computer
engineering from the Georgia Institute of Technol-
ogy, Atlanta, in 2002. During the Ph.D. degree,
he proposed and conducted several projects sup-
ported by NASA, DARPA, NSF, and SRC regard-
ing power-aware computing and communication.
Since 2004, he had been with the Takayasu Saku-
rai Laboratory, The University of Tokyo, Japan,
as a Postdoctoral Research Associate, working on
leakage power reduction circuit techniques. He is currently an Associate
Professor with the Department of Electrical and Computer Engineering,
Illinois Institute of Technology. He was a Senior CAD Engineer and a Tech-
nical Consultant for low-power system-on-chip (SoC) design in Samsung
Semiconductor, Broadcom, and Sequence Design, prior to joining IIT. In
the past, he had eight-year industry experience in the area of VLSI chip
design from compiler level to circuit level. Last few years, by using his
novel low-power techniques, everal processor, and control VLSI chips were
successfully fabricated in deep-submicrometer CMOS technology and more
than 80 peer-reviewed journals and conference papers have been published.
He is currently the Director of the VLSI Design and Automation Laboratory
(DA-Lab), IIT.
His research interests include ultra-low power circuit and IC design for
multimedia, and mobile or energy harvesting applications. He is also a TPC
member for several IEEE circuit design conferences. For last seven years,
DA-Lab has been awarded several projects about hardware design for modern
applications, including IoT, AI, deep learning, unmanned vehicles, and en-
ergy harvesting, about 1.6 million U.S. dollars from U.S. federal agencies and
Korea Government. He has served as the President in the Korean-American
Scientists and Engineers Association (KSEA)-Chicagoland Chapter. He is
currently the Technical Group Director and an Advisory Election Committee
in the KSEA Head Quarter. He is also the Editor-in-Chief of the Journal of
Pervasive Technologies and a Guest Editor of Springer and Wiley Journals.
VOLUME 11, 2023 9

A Reconfigurable CNN-based Accelerator Design For

Uploaded by

Copyright:

Available Formats

A Reconfigurable CNN-based Accelerator Design For

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Reconfigurable CNN-based Accelerator Design For

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in IEEE Access.

A Reconfigurable CNN-based Accelerator Design

I. INTRODUCTION or edge devices. The primary implementation issue of the

VOLUME 11, 2023 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

have limited hardware resources such as memory size and

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 3. Conventional Register

FIGURE 5. Local Explicit Clock Gating (LECG)

FIGURE 4. Local Explicit Clock Enable(LECE)

the Python library is not fully supported in the PYNQ.

B. BASELINE RTL CODE GENERATION FOR CNN

[33]. In section III, we apply the techniques to the Tensil RTL

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 9. IP Block Design for CNN Object Detection

FIGURE 7. Enhanced Clock Gating (ECG)

FIGURE 10. Flexible Accelerator Design Overview

B. THE PROPOSED RECONFIGURABLE ACCELERATOR

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 12. Conventional MAC (Left) Proposed MAC (Right)

FIGURE 11. Low Power Design Flow

IV. PROPOSED HARDWARE IMPLEMENTATION

B. PROPOSED MAC HARDWARE DESIGN C. FLEXIBLE ACCELERATOR DESIGN FOR

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 14. GPU Testbed (GPU TITAN X)

analysis of the network architecture. Once the data transmis-

V. EXPERIMENT AND RESULTS WITH DISCUSSION

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1. Long Table

[17] [18] [19] [20] [21] [22] [23] [24] Ours

VI. CONCLUSION to a lightweight cnn,’’ in 2018 IEEE 4th International Conference on

VOLUME 11, 2023 7

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

8 VOLUME 11, 2023

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

KYUWON KEN CHOI (Senior Member, IEEE) re-

VOLUME 11, 2023 9

You might also like