A Reconfigurable CNN-based Accelerator Design For
A Reconfigurable CNN-based Accelerator Design For
A Reconfigurable CNN-based Accelerator Design For
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
ABSTRACT In limited-resource edge computing circumstances such as on mobile devices, IoT devices,
and electric vehicles, the energy-efficient optimized convolutional neural network (CNN) accelerator
implemented on mobile Field Programmable Gate Array (FPGA) is becoming more attractive due to
its high accuracy and scalability. In recent days, mobile FPGAs such as the Xilinx PYNQ-Z1/Z2 and
Ultra96, definitely have the advantage of scalability and flexibility for the implementation of deep learning
algorithm-based object detection applications. It is also suitable for battery-powered systems, especially
for drones and electric vehicles, to achieve energy efficiency in terms of power consumption and size
aspect. However, it has a low and limited performance to achieve real-time processing. In this article,
optimizing the accelerator design flow in the register-transfer level (RTL) will be introduced to achieve
fast programming speed by applying the low-power techniques on FPGA accelerator implementation. In
general, most accelerator optimization techniques are conducted on the system level on the FPGA. In this
article, we propose the reconfigurable accelerator design for a CNN-based object detection system on mobile
FPGA. Furthermore, we present RTL optimization design techniques that will be applied such as various
types of clock gating techniques to eliminate residual signals and to deactivate the unnecessarily active
block. Based on the analysis of the CNN-based object detection architecture, we analyze and classify the
common computing operation components from the Convolutional Neuron Network, such as multipliers
and adders. We implement a multiplier/adder unit to a universal computing unit and modularize it to be
suitable for a hierarchical structure of RTL code. Experimental results show that the proposed design process
improves the power efficient consumption, hardware utilization, and throughput by 16%, up to 58%, and
15%, respectively.
INDEX TERMS FPGA Accelerator; CNN Accelerator; RT Level Design Techniques; Low Power Tech-
niques; Reconfigurable Accelerator; CNN-based Object Detection; Low Power Consumption; High Perfor-
mance; Mobile FPGA
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
cnn accelerator using tile-by-tile computing and dynamic adaptive data in autonomous vehicles,’’ Electronics, vol. 9, no. 3, 2020. [Online].
truncation,’’ in 2019 IEEE International Conference on Integrated Circuits, Available: https://www.mdpi.com/2079-9292/9/3/478
Technologies and Applications (ICTA), 2019, pp. 73–74. [34] H. Kim and K. Choi, ‘‘Low power fpga-soc design techniques for cnn-
[16] S. Zeng, K. Guo, S. Fang, J. Kang, D. Xie, Y. Shan, Y. Wang, and based object detection accelerator,’’ in 2019 IEEE 10th Annual Ubiquitous
H. Yang, ‘‘An efficient reconfigurable framework for general purpose cnn- Computing, Electronics & Mobile Communication Conference (UEM-
rnn models on fpgas,’’ in 2018 IEEE 23rd International Conference on CON), 2019, pp. 1130–1134.
Digital Signal Processing (DSP), 2018, pp. 1–5. [35] ——, ‘‘The implementation of a power efficient bcnn-based object detec-
[17] L. Gong, C. Wang, X. Li, H. Chen, and X. Zhou, ‘‘Maloc: A fully pipelined tion acceleration on a xilinx fpga-soc,’’ in 2019 International Conference
fpga accelerator for convolutional neural networks with all layers mapped on Internet of Things (iThings) and IEEE Green Computing and Commu-
on chip,’’ IEEE Transactions on Computer-Aided Design of Integrated nications (GreenCom) and IEEE Cyber, Physical and Social Computing
Circuits and Systems, vol. 37, no. 11, pp. 2601–2612, 2018. (CPSCom) and IEEE Smart Data (SmartData), 2019, pp. 240–243.
[18] L. Bai, Y. Zhao, and X. Huang, ‘‘A cnn accelerator on fpga using depthwise [36] Y. Kim, Q. Tong, K. Choi, E. Lee, S.-J. Jang, and B.-H. Choi, ‘‘System
separable convolution,’’ IEEE Transactions on Circuits and Systems II: level power reduction for yolo2 sub-modules for object detection of fu-
Express Briefs, vol. 65, no. 10, pp. 1415–1419, 2018. ture autonomous vehicles,’’ in 2018 International SoC Design Conference
[19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, (ISOCC), 2018, pp. 151–155.
N. Xu, S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded [37] ‘‘Pynq: Python productivity,’’ http://www.pynq.io/, accessed: 2023-2-15.
fpga platform for convolutional neural network,’’ in Proceedings of [38] L. Li, K. Choi, S. Park, and M. Chung, ‘‘Selective clock gating by using
the 2016 ACM/SIGDA International Symposium on Field-Programmable wasting toggle rate,’’ in 2009 IEEE International Conference on Elec-
Gate Arrays, ser. FPGA ’16. New York, NY, USA: Association tro/Information Technology, 2009, pp. 399–404.
for Computing Machinery, 2016, p. 26–35. [Online]. Available: [39] W. Wang, Y.-C. Tsao, K. Choi, S. Park, and M.-K. Chung, ‘‘Pipeline
https://doi.org/10.1145/2847263.2847265 power reduction through single comparator-based clock gating,’’ in 2009
[20] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt, International SoC Design Conference (ISOCC), 2009, pp. 480–483.
‘‘A framework for acceleration of cnn training on deeply-pipelined fpga [40] Y. Zhang, Q. Tong, L. Li, W. Wang, K. Choi, J. Jang, H. Jung, and S.-
clusters with work and weight load balancing,’’ in 2018 28th International Y. Ahn, ‘‘Automatic register transfer level cad tool design for advanced
Conference on Field Programmable Logic and Applications (FPL), 2018, clock gating and low power schemes,’’ in 2012 International SoC Design
pp. 394–3944. Conference (ISOCC), 2012, pp. 21–24.
[41] ‘‘Compile an ml model,’’ https://www.tensil.ai/docs/howto/compile/, ac-
[21] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
cessed: 2023-2-15.
and J. Cong, ‘‘Fp-dnn: An automated framework for mapping deep neural
networks onto fpgas with rtl-hls hybrid templates,’’ in 2017 IEEE 25th An-
nual International Symposium on Field-Programmable Custom Computing
Machines (FCCM), 2017, pp. 152–159.
[22] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, ‘‘Optimizing the convolution
operation to accelerate deep neural networks on fpga,’’ IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–
1367, 2018.
[23] S. Li, Y. Luo, K. Sun, N. Yadav, and K. K. Choi, ‘‘A novel fpga accel-
erator design for real-time and ultra-low power deep convolutional neural
networks compared with titan x gpu,’’ IEEE Access, vol. 8, pp. 105 455–
105 471, 2020.
[24] ‘‘Learn tensil with resnet and pynq z1,’’
https://www.tensil.ai/docs/tutorials/resnet20-pynqz1/, accessed: 2022-12-
15.
[25] X. Zhang, Y. Ma, J. Xiong, W.-M. W. Hwu, V. Kindratenko, and D. Chen,
‘‘Exploring hw/sw co-design for video analysis on cpu-fpga heterogeneous
systems,’’ IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 41, no. 6, pp. 1606–1619, 2022.
[26] E. Antolak and A. Pułka, ‘‘Energy-efficient task scheduling in design of
multithread time predictable real-time systems,’’ IEEE Access, vol. 9, pp.
121 111–121 127, 2021.
[27] W. Huang, H. Wu, Q. Chen, C. Luo, S. Zeng, T. Li, and Y. Huang, ‘‘Fpga-
based high-throughput cnn hardware accelerator with high computing
resource utilization ratio,’’ IEEE Transactions on Neural Networks and
Learning Systems, vol. 33, no. 8, pp. 4069–4083, 2022.
[28] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, ‘‘Optimizing the convolution
operation to accelerate deep neural networks on fpga,’’ IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–
1367, 2018.
[29] G. Lakshminarayanan and B. Venkataramani, ‘‘Optimization techniques
for fpga-based wave-pipelined dsp blocks,’’ IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 13, no. 7, pp. 783–793, 2005.
HEEKYUNG KIM received the B.S. in electronic
[30] D. Wang, K. Xu, J. Guo, and S. Ghiasi, ‘‘Dsp-efficient hardware acceler-
and electrical engineering from Hongik University,
ation of convolutional neural network inference on fpgas,’’ IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems,
Seoul, Korea, in 2012 and M.S. degrees in elec-
vol. 39, no. 12, pp. 4867–4880, 2020. trical and computer engineering from the Illinois
[31] A. Prihozhy, E. Bezati, A. A.-H. Ab Rahman, and M. Mattavelli, ‘‘Syn- Institute of Technology, Chicago, in 2015. She is
thesis and optimization of pipelines for hw implementations of dataflow currently pursuing a Ph.D. degree in computer en-
programs,’’ IEEE Transactions on Computer-Aided Design of Integrated gineering from the Illinois Institute of Technology,
Circuits and Systems, vol. 34, no. 10, pp. 1613–1626, 2015. Chicago.
[32] W. Lou, L. Gong, C. Wang, Z. Du, and X. Zhou, ‘‘Octcnn: A high through- Her current research interests include low-power
put fpga accelerator for cnns using octave convolution algorithm,’’ IEEE and high-performance HW/SW optimization for
Transactions on Computers, vol. 71, no. 8, pp. 1847–1859, 2022. CNN accelerator design for FPGA and ASIC, wireless sensor networks
[33] Y. Kim, H. Kim, N. Yadav, S. Li, and K. K. Choi, ‘‘Low-power rtl design and sensor data analysis and embedded HW/SW system design for
code generation for advanced cnn algorithms toward object detection robotics, drones, IoTs.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3285279
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4