Sensors 23 02208 v2
Sensors 23 02208 v2
Sensors 23 02208 v2
Article
FPGA-Based Vehicle Detection and Tracking Accelerator
Jiaqi Zhai 1 , Bin Li 1,2, * , Shunsen Lv 1 and Qinglei Zhou 1
1 School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
2 Henan Key Laboratory of Network Cryptography Technology, Zhengzhou 450001, China
* Correspondence: [email protected]
Abstract: A convolutional neural network-based multiobject detection and tracking algorithm can be
applied to vehicle detection and traffic flow statistics, thus enabling smart transportation. Aiming at
the problems of the high computational complexity of multiobject detection and tracking algorithms,
a large number of model parameters, and difficulty in achieving high throughput with a low power
consumption in edge devices, we design and implement a low-power, low-latency, high-precision,
and configurable vehicle detector based on a field programmable gate array (FPGA) with YOLOv3
(You-Only-Look-Once-version3), YOLOv3-tiny CNNs (Convolutional Neural Networks), and the
Deepsort algorithm. First, we use a dynamic threshold structured pruning method based on a scaling
factor to significantly compress the detection model size on the premise that the accuracy does not
decrease. Second, a dynamic 16-bit fixed-point quantization algorithm is used to quantify the network
parameters to reduce the memory occupation of the network model. Furthermore, we generate a
reidentification (RE-ID) dataset from the UA-DETRAC dataset and train the appearance feature
extraction network on the Deepsort algorithm to improve the vehicles’ tracking performance. Finally,
we implement hardware optimization techniques such as memory interlayer multiplexing, param-
eter rearrangement, ping-pong buffering, multichannel transfer, pipelining, Im2col+GEMM, and
Winograd algorithms to improve resource utilization and computational efficiency. The experimental
results demonstrate that the compressed YOLOv3 and YOLOv3-tiny network models decrease in
size by 85.7% and 98.2%, respectively. The dual-module parallel acceleration meets the demand of
the 6-way parallel video stream vehicle detection with the peak throughput at 168.72 fps.
sensors, such as LiDAR, GPS, and camera devices for vehicle detection are better, but
deployment is costly and influenced by environmental factors [7,8]. With the advanced
development of convolutional neural networks, network architectures such as R-CNN
(region-based convolutional neural network) [9], Faster-RCNN (faster region-based con-
volutional neural network) [10], SSD (single shot multibox detector) [11], and YOLO [12]
and other new network architectures have emerged and are applied to image classification,
object detection, and other fields [13]. Compared with two-stage networks such as R-CNN
and Faster-RCNN, the single-stage YOLO series of networks treats object detection as a
regression problem to solve; thus, it has higher detection speed [14].
FPGAs are semicustom circuits that have a lower latency and higher parallelism
capability than CPUs and have a lower power consumption and lower cost than GPUs.
Compared to ASIC, they have a shorter design cycle, are more iterable, and are less costly.
With the rapid development of deep learning frameworks, FPGAs are the best platform for
the deep learning model forwarding inference acceleration [14].
There are still many challenges in FPGA hardware acceleration. The performance
of hardware acceleration is directly related to the on-chip resources of FPGAs. How to
use limited hardware resources to design an efficient hardware acceleration architecture
is a very important research problem. For the application of the YOLO series neural
network, which is computationally intensive and has huge parameters, a high memory
access frequency, and complex control logic, improving the acceleration performance
involves two difficulties: optimizing the computing process and optimizing the memory
exchange. In order to improve the accuracy and quantity of vehicle detection, networks
with higher accuracy are needed. However, due to the huge amount of network parameters
and computation, these networks will bring high resource and computation costs [15]. How
to reduce the parameter amount and algorithm complexity to improve the acceleration
performance becomes one of the difficulties. In addition, in the process of data exchange
between the on-chip and off-chip, due to the lack of the effective organization of data, the
utilization of bandwidth resources is insufficient and the efficiency of parallel reading and
writing is low, thus becoming a bottleneck that restricts efficient computing [16]. Optimizing
the data organization and memory exchange strategy to reduce the communication cost
between the on-chip storage and off-chip memory is another optimization route.
There have been many studies focusing on the FPGA-based convolutional neural
network acceleration. However, the research of transplantation optimization for large
models remains in theoretical research and fails to combine with specific scenarios for
application deployment. Therefore, taking vehicle tracking and counting as the specific
application scenario, we study the deployment of the YOLO series network acceleration
on the FPGA side from the two aspects: the neural network compression and hardware
accelerator design. In addition, we retrain the appearance feature extraction network of
the Deepsort algorithm based on the self-generated dataset for vehicle tracking. The main
contributions of this work are summarized as follows:
• We trained the YOLOv3 and YOLOv3-tiny networks using the UA-DETRAC dataset [17].
Then, we incorporate the dynamic threshold structured pruning strategy based on
binary search and the dynamic INT16 fixed-point quantization algorithm to compress
the model.
• A reidentification dataset was generated based on the UA-DETRAC dataset and used
to train the appearance feature extraction network of the Deepsort algorithm with a
modified input size to improve the vehicle tracking performance.
• We designed and implemented a vehicle detector based on an FPGA using high
level synthesis (HLS) technology. At the hardware level, optimization techniques
such as the Im2col+GEMM and Winograd algorithms, parameter rearrangement, and
multichannel transmission are adopted to improve the computational throughput
and balance the resource occupancy and power consumption. Compared with the
other related work, vehicle detection performance with higher precision and higher
throughput is realized with lower power consumption.
Sensors 2023, 23, 2208 3 of 26
• Our design adopts a loosely coupled architecture, which can flexibly switch between
the two detection models by changing the memory management module, optimizing
the balance between the software flexibility and high computing efficiency of the
dedicated chips.
The rest of this paper is organized as follows: Section 1 reviews the background
knowledge and related work on the simplification of deep neural networks (DNNs) and
the convolutional neural network acceleration based on FPGAs. Section 2 introduces our
strategies of neural network compression and accelerator optimization. Section 3 presents
our experiments and analysis. Finally, we conclude the paper in Section 4.
2.2. Deepsort
Deepsort [18] is an online multitarget tracking algorithm. It considers both the detec-
tion frame parameters of the detection result and the appearance information of the tracked
object, combining the relevant information of the previous frame and the current frame for
prediction without considering the whole video at the time of detection. In the first frame
of the video to be detected, a unique track ID is assigned to the detection frame of each
target. Then, the detection object in the new frame is associated with the previously tracked
object using the Hungarian algorithm [19] to obtain a global minimum of the assignment
cost function. The cost function contains the spatial Mahalanobis distance [20] d(1), which
measures the difference between the detected frame and the position predicted based on
the previously known position of the object, and a visual distance d(2), which measures
the difference between the appearance of the currently detected object and the previous
appearance of the object. The cost function for assigning the detected object j to track i
Sensors 2023, 23, 2208 4 of 26
is shown in (1), the spatial martingale distance is shown in (2), and the visual distance is
shown in (3). The meanings of the parameters in the formula are shown in Table 1.
Variables Meaning
The parameter for regulating the effect of the spatial Mahalanobis distance
λ
and visual distance on the cost function.
yi The state vector of the i-th prediction frame.
The covariance matrix of the average tracking results between the detection
Si
frame and track i.
dj Detection box j.
ri The appearance descriptor extracted from detection box j.
Ri The last 100 appearance descriptor sets associated with track i.
data exchange, but their accelerator did not validate the acceleration performance for large
networks. Zhang et al. [29] proposed optimizing the on-chip cache using ping-pong opera-
tions to hide the data transfer latency and designed it to search the accelerator optimization
space using the roofline model. However, they only designed the hardware architecture.
Lu et al. [30] first used the Winograd algorithm in CNN operations to reduce the convolu-
tional computational complexity and proposed row buffers to achieve efficient data reuse.
Later, in the literature[31], the Winograd algorithm was proposed to be combined with the
CNN sparsity to improve accelerator performance, but the model used in its evaluation was
simple. Bao et al. [32] used a fixed-point quantization approach to reduce FPGA resource
consumption and proposed a buffer pipeline approach to further improve the accelerator
efficiency while reducing the resource and power overhead. Wang et al. [33] introduced
a new unstructured sparse convolution algorithm using a lower quantization method
and an end-to-end design space search sparse convolution dedicated circuit architecture,
which achieved high computational efficiency, but its performance-to-power ratio was
relatively low.
The above studies have made great contributions to deploying AI directly on edge
devices, but as the models become more complex, research on the optimization of large
models remains in theoretical studies and fails to be deployed in conjunction with specific
scenario applications.
The neural network acceleration contains complex operators and memory manage-
ment modules, so using HDL (Hardware Description Language) to directly describe the
framework has a long development cycle, making it difficult to explore the design space.
HLS uses C/C++ to describe the framework from a high level. It greatly improves develop-
ment efficiency due to the rapid conversion of high-level code to FPGA implementation [34]
. Many studies on neural network acceleration have been implemented based on HLS, and
the HLS tools for neural network acceleration have been improved and expanded to make
development easier and faster. We designed and implemented a vehicle detector based on
an FPGA using HLS.
between pruning rate and accuracy. Finally, the knowledge distillation strategy is used to
fine-tune the network accuracy.
Trim small
Simplify
Fine-tuning factor
the network
channels
Sparse regularization training first introduces a scaling factor for each channel, which
is used to multiply with the output of that channel. The scaling factors are trained jointly
with the network weights and are sparsely regularized during the training to identify
insignificant channels. The objective function of the sparse regularization training is shown
in (4), where ( x, y) represents the input and target of training and W represents the trainable
weight. The first term represents CNN training losses, g(·) is the sparse penalty function for
the scaling factor, and g(s) = |s|. λ is used to balance the effect of two terms as the result.
We use the subgradient descent algorithm to optimize the nonsmooth L1 penalty term.
L= ∑ l ( f ( x, W ), y) + λ ∑ g ( γ ). (4)
( x,y) γ∈Γ
The structure of the network after sparse regularization training is shown in Figure 3a.
We prune the channels whose contribution value is less than the threshold value to obtain
the network structure, as shown in Figure 3b.
Channel scaling (i+1)=j-th Channel scaling (i+1)=j-th
i-th conv-layer i-th conv-layer
factors conv-layer factors conv-layer
Cin Cj2 Cin Cj2
0.768
0.768
Ci4 ... ...
Ci3 Ci3
0.001
Ci2 Cj1 Cj1
Ci1 0.578 Ci1 0.578
Pruning
0.002
0.285
0.285
(a) (b)
Figure 3. Strategies for sparse regularized channel pruning. (a) Structure before pruning. (b) Structure
after pruning.
xq = (int) x ∗ 2Q . (5)
Sensors 2023, 23, 2208 7 of 26
w −1
Vf ixed = f (w, Q) = ∑ Bi · 2−Q · 2i , Bi ∈ 0, 1. (7)
i =0
In the stage of quantifying the weight values and bias values, the optimal Q values are
analyzed for each layer dynamically using the approach shown in Equations (8) and (9), so
that the absolute error sum of the original value of the weight bias and the quantized value
is minimized. W lf loat and blf loat are 32-bit floating-point values of the l-th layer weights and
biases, respectively, and W lf ixed (w, Q) and blf ixed (w, Q) are 16-bit fixed-point values of the
l-th layer weights and biases, respectively.
In the stage of quantization of inputs and outputs between layers, we find the optimal
Q value for each layer of the input–output feature map, and the optimal Q value is calcu-
lated as shown in Equations (10) and (11). For example, the RGB value of the input image
is scaled to the [0,1] interval in the preprocessing stage, and Q = 14 can be used to quantize
the input of the first layer when the bit width w = 16.
In the stage of quantifying the intermediate results, we find the best Q value for each
layer of intermediate data by using the approach shown in (12).
By quantifying in the above way, the model size can be further reduced to 50% after pruning,
reducing the consumption of computing, memory, and bandwidth resources.
Host FPGA
Memory
management controller
Image
processing W&B DMA0 W&B GEMM
Controller
data
DRAM
The accelerator consists of a host computer and an FPGA. The main tasks of the
host are image preprocessing, data quantization, nonmaximal suppression, and Deepsort
task scheduling. The host uses the controller to schedule the flow of data and uses the
memory manager to manage the interaction between DRAM and DMA. The FPGA is
responsible for the accelerated calculation of various computation-intensive tasks. The
host loads the configuration information of the current model at the beginning, and stores
the pre-quantized weight and bias data of the model in a continuous memory. Then, the
host extracts the input video into frame images and sends them to the controller module
in sequence. First, the control module sends the image to the data quantification module.
Then, it transfers the quantized image, the weight, and bias data of the current layer to
the FPGA on-chip memory through the optimized transmission method of ping-pong
double buffering and multi-channel transmission. After the acceleration of a specific
computing module, the result is sent back to the off-chip DRAM through the above method.
After completing the prediction of an image, the host performs the NMS (Non Maximum
Suppression) operation and transmits the result to the Deepsort tracking module. Finally, it
draws the tracking result into a new video stream in real time.
Sensors 2023, 23, 2208 9 of 26
416x416x16 416x416x16
Address increase
416x416x16
416x416x16 Out_layer(19)
...
out_layer(i)/
... in_layer(i+1) ... Out_layer(8)
bottom bottom
Out_layer(20)
(a)Memory management for YOLOv3-tiny (b) Memory management for route layer
Symbol Meaning
I The input feature map.
W The weights of the convolution layer.
B The bias of the convolution layer.
O The output feature map.
IH The height of the input feature map.
IW The width of the input feature map.
IC The number of input channels.
K The kernel size.
OH The height of the output feature map.
OW The width of the output feature map.
OC The number of output channels.
pad The padding.
S The stride.
Tx Parallelism of multiply-add operations on input feature maps.
Ty Parallelism of multiply-add operations on output feature maps.
Taking the weight parameters of the 12th layer of YOLOv3-tiny as an example, as shown
in Figure 7, there are 1024 × 512 × 9( X × Y × K2 ) parameters, where X = 1024, Y = 512, and
K2 = 9. X, Y represent the number of input and output feature maps, respectively. When
the convolutional loop block is divided according to Tx = 32, Ty = 4, and the weight
parameters are stored in row priority order, 524,288 parameter blocks with a size of 9
need to be read from the memory in the order of the arrow. After rearrangement, the
parameters are stored continuously, and 4096 parameter blocks with a size of 32 × 4 × 9
should be read from memory in the order of the arrows. Parameters’ prearrangement
reduces memory reads.
Ty×K2
...
... ... ... ...
(X×Y)/(Tx×Ty)
...
Y×K2 Tx×Ty×K2
DRAM BRAM
W&B W&B
Tx/n DMA0 Ty/m
pixels pixels
DMA1
...
pixels ...
DMAm+n
pixels
The architecture determined by the above parameter reduces the number of feature
Ty
maps transmitted by each channel from Tx + Ty to Tx n or m without causing too much
competition, which brings us the best transmission delay.
Interlayer pipeline
Line Output
Im2col GEMM
buffering buffer
In-layer pipeline
For the convolution module designed with the Im2col+GEMM algorithm, we use the
in-layer pipeline design shown in Figure 9. The entire convolution module is optimized
into a four-stage pipeline, corresponding to four subtasks: the line cache, Im2col function,
GEMM calculation, and result output.
W [oc][ic][i ][ j] + B[oc],
0 ≤ oc < OC, 0 ≤ ic < IC, 0 ≤ i, j < K (16)
I H − K + 2pad
OH = + 1,
S
IW − K + 2pad
OW = +1
S
O = I ×W +B (17)
CNN computing requires a large amount of memory, but the FPGA’s on-chip storage
resources cannot meet the requirement of storing such a large amount of data at a time [38].
Therefore, based on the local principle of the convolution computing data, the input
feature map data and corresponding weight parameters can be divided into blocks. Each
time, 2 pixel blocks of size Tx × tir × tic and corresponding weight parameters of size
Tx × Ty × K2 are read from the off-chip DRAM. After all the on-chip data are calculated,
the result of size Ty × tor × toc is written back to the off-chip DRAM. The calculation of tor
is shown below:
tir − K + 2pad
tor = +1 (18)
S
We deeply analyze the characteristics of convolution operations with kernel sizes of 1 × 1
and 3 × 3, and design two convolution acceleration engines using the Im2col+GEMM [39]
and Winograd [40] algorithms, respectively, so as to reduce the computational complexity
and resource consumption.
Sensors 2023, 23, 2208 13 of 26
IC OC
OH
OC IH
K
OW
IW
K2 × OH × OW
space_ratio = (19)
I H × IW
The space complexity is proportional to the convolution kernel size K2 . The convo-
lution kernel with a size of 1 × 1, as shown in Figure 12, does not consume extra space
to store the feature graph matrix; thus, the im2col+GEMM algorithm is more suitable
for acceleration.
Sensors 2023, 23, 2208 14 of 26
Kernel matrix Input Feature map matrix Output Feature map matrix
2 1 2 3 4 = 2 4 6 8
Matrix
Multiplication
Figure 12. Two-dimensional convolution with kernel size 1 × 1.
...
tr×tc
... ⊕
Tx ...
PE ⊕
...
... ⊕ Ty
...
...
...
tr×tc
Im2col ⊕
PE ⊕
...
⊕ tr
Tx ...
...
...
...
...
tc
...
Ty Tx ... ⊕
tr PE ⊕ Output
1 ⊕
1 Parallel
tc Parallel
Feature map Kernel Multiplication Tree Addition Tree
Figure 13. Architecture of the convolution module based on the Im2col+GEMM algorithm.
Winograd convolution:
For the convolution operation with a convolution kernel size of 3 × 3, we designed a
Winograd convolution engine to accelerate the operation.
The Winograd algorithm accelerates the convolution operation by significantly re-
ducing the multiplication operation in the convolution [41]. F (m × m, r × r ) represents a
two-dimensional convolution function; its input is a convolution kernel of size r × r, and
the output is an output feature map of size m × m. We use Y to represent the output of this
function, which can be expressed in the form of Equation (20).
In Equation (22), W represents the convolution filter, I represents the input feature
map, G is the convolution kernel transformation matrix of size r (m + r − 1), A is the output
feature map transformation matrix of size m(m + r − 1), and B is the input feature map
Sensors 2023, 23, 2208 15 of 26
⊕
Feature PE ⊕
tir ⊕
map Ty
tic ⊕
matrix PE ⊕ Output
⊕ tor
transformation
...
...
...
toc
⊕
⊕
...
Kernel Ty Tx PE
⊕
K Parallel Parallel
Multiplication Tree Addition Tree
K
Figure 14. Architecture of the convolution module based on the Winograd algorithm.
PE1
seg1
seg2 PE2
Input Line Output
buffer seg3 buffer buffer
PEn
no
MUX
Symbol Meaning
Onorm The output of the feature map after batch normalization.
γ The parameter that controls the variance of Onorm .
σ2 The variance of O .
e A small constant used to prevent numerical error.
O The output of the feature map.
µ The estimate of the mean of O.
β The parameters that control the mean of Onorm .
γµ
Let P = √ γ and Q = ( √ + β); then, Equation (22) can be simplified into the
σ2 + e σ2 + e
form in (23).
Onorm = PO + Q. (23)
Substituting (17) into (23), we obtain (24).
Onorm = P( I × W + B) + Q. (24)
Sensors 2023, 23, 2208 17 of 26
4. Experiments
4.1. Experimental Setup
We designed and simulated the proposed accelerator to verify the effectiveness of the
proposed optimization method. The training and pruning quantization were completed
by the NVIDIA Tesla V100 platform. The detection inference was implemented by the
CPU+FPGA heterogeneous platform. The chip we used is ZYNQ XC7Z035-FFG676-2. We
designed the IP cores of YOLOv3 and YOLOv3-tiny accelerators using Xilinx Vivado HLS
2021.2, and used Vivado 2021.2 for the synthesis and layout.
Model Pruning Rate [email protected] Model Size (MB) Parameters (×103 ) BFLOPs
YOLOv3 0 0.671 235.06 61523 65.864
YOLOv3 85% 0.711 33.55 8719 19.494
YOLOv3-tiny 0 0.625 33.10 8670 5.444
YOLOv3-tiny 85% 0.625 1.02 267 1.402
YOLOv3-tiny 85% + 30% 0.599 0.59 69 0.735
According to the data in Table 4, the size of the pruned YOLOv3 is reduced by 85%.
The detection accuracy [email protected] is improved by 0.04, and the number of floating-point
calculations required for convolution is reduced by 70.4%. The size of YOLOv3-tiny after
two dynamic prunings reduced 98.2% at the cost of the mAP reduction by 0.026. The
computation of the convolution is reduced by 86.5%.
To show the detection effect more intuitively, we selected an image from the test set to
be detected using the models before and after compression, as shown in Figure 16.
The number of vehicles detected by each model is shown in Table 5. Among them,
the detection results of the YOLOv3-prune 85% model are consistent with those of the
original YOLOv3 model, both detecting 27 vehicles. The detection accuracy of the YOLOv3-
prune 85% model is slightly better than that of the original YOLOv3 model. The detection
performance of the YOLOv3-tiny-prune 85% + 30% model is slightly better than that of the
original YOLOv3-tiny model. The number of vehicles is increased from 24 to 26.
Sensors 2023, 23, 2208 18 of 26
(c)YOLOv3-prune85% (d)YOLOv3-tiny
(e)YOLOv3-tiny-prune85% (f)YOLOv3-tiny-prune85%+30%
Figure 16. Comparison of detection results.
Figure 17. Comparison of tracking results before and after reidentification training.
Model Video Stream IDF1↑ IDP↑ IDR↑ FP↓ FN↓ IDs↓ MOTA↑ MOTP↓
Deepsort 76.4% 82.5% 71.1% 1515 3706 53 66.8% 0.118
MVI_40701
RE-ID Deepsort 79.6% 86.4% 76.2% 1452 3686 27 67.5% 0.117
Deepsort 69.3% 74.5% 66.2% 2409 2409 49 65.2% 0.153
MVI_40771
RE-ID Deepsort 80.6% 87.2% 75.0% 1015 2348 13 69.6% 0.155
Deepsort 55.2% 80.6% 42.0% 2076 17746 51 39.2% 0.138
MVI_40863
RE-ID Deepsort 56.1% 82.0% 42.6% 2037 17382 35 40.5% 0.138
The MVI_40701 video stream was captured in the daytime peak traffic flow scene and
shot from a forward overlooking angle. All evaluation indices show that the RE-ID Deepsort
algorithm has better performance than the Deepsort algorithm in tracking vehicles.
The MVI_40771 video stream was captured in the peak traffic flow scene at night
and shot from a forward overlooking angle. Compared with the Deepsort algorithm, the
RE-ID Deepsort algorithm improved significantly in several evaluation indices, especially
in reducing the number of IDs. The experimental results indicate that the model proposed
in this paper has an obvious improvement effect on vehicle detection in night scenes.
The MVI_40771 video stream was shot on a rainy day with heavy traffic. The shooting
angle was a side view. A large number of small cars were covered by large cars in this video
stream; thus, all tracking indices are inferior to those of the previous two video streams.
Except for the same MOTP, RE-ID Deepsort is better than Deepsort in other indicators.
We use the YOLOv3 and YOLOv3-tiny models after pruning and combine them with
the RE-ID Deepsort algorithm to conduct vehicle tracking counting experiments. Figure 18
is the result of vehicle detection and ID assignment. When a vehicle crosses the solid red
Sensors 2023, 23, 2208 20 of 26
line in the figure, one is added to the traffic flow counter. Figure 19 compares the traffic
flow data collected by the model with the data collected by manual statistics.
Figure 18. Tracking using RE-ID Deepsort. (a) Detection result of MVI_40701. (b) Detection result of
MVI_40771. (c) Detection result of MVI_40863.
The scenarios we tested included peak traffic under the day, night, and rainy condi-
tions. In these scenarios, vehicles move slowly, and the vehicles are relatively close, which
easily generates occlusion. A car located between two large vehicles will cause missed
detection. Especially in the third scenario, the camera is located on the side of the road;
thus, cars in the middle of the road are almost completely covered by large cars on the
side road when moving slowly, resulting in poor detection results. The accuracy rates of
YOLOv3 and YOLOv3-tiny were 96.15% and 92.3% in daytime conditions, 94.0% and 92.0%
in nighttime conditions, and 81.8% and 75.8% in rainy conditions, respectively.
the price is more expensive; our design with the Zynq-7000 outperforms it in terms of
cost efficiency and energy efficiency. Compared with the yolov3_adas_pruned_0_9 model
implemented based on Vitis AI and ZCU102, our pruned YOLOv3-tiny model has faster
forward inference speed and higher fps. Tajar and his partner implement the YOLOv3-tiny
network for the vehicle detection on Nvidia Jetson Nano [45]. The throughput of their
solution was not sufficient for applications requiring at least 24 fps. However, we achieve
91.65 fps with the pruned YOLOv3-tiny model. We compare the cost efficiency of different
platforms, and the results demonstrate that our solution is the most cost-effective.
Full
Operation Throughput Efficiency Cost Efficiency
Item Platform CNN Model Power
(GOP) (fps) (GOPS/W) (GOPS/$×102 )
(W)
CPU AMD YOLOv3-
Baseline1 0.735 10.01 45 0.16 1.96
R75800H tiny
GeForce RTX YOLOv3-
Baseline2 0.735 112.87 160 0.52 16.58
2060 tiny
XCZU9EG- yolov3-adas-
Baseline3 5.5 84.1 - 3.71 4.16
FFVB1156 pruned-0.9
Nvidia YOLOv3-
Ref [45] 1.81 17 10 3.08 24.62
Jetson Nano tiny
YOLOv3-
This work Zynq-7000 0.735 91.65 12.51 5.43 46.51
tiny
Table 8 shows the comparison of our work with previous fpga-based work. Since
we designed pipeline processing based on both intralayer and interlayer granularity, the
computational efficiency is slightly higher than that of the literature [14]. The resource
consumption is slightly higher than those in the literature [14] because of the separate
upsampling computation module we designed to adapt the computation of the upsampling
layer in YOLOv3 and YOLOv3-tiny. However, due to our pruning strategy, we reduced
the computation of YOLOv3 and YOLOv3-tiny by factors of 3.4 and 7.4, respectively.
Our design has a significant increase in throughput, with a slightly better computational
performance. We doubled the performance of the convolutional computation compared
to the literature [37] due to the introduction of the Winograd convolutional acceleration
computation engine and multiple levels of pipeline processing. Since the literature [14,37]
only shows dynamic power consumption, for a fair comparison, we use dynamic energy
efficiency (GOPS/W) to compare with them and obtain a clear advantage. At the same
time, we also have better cost efficiency and DSP efficiency. Reference [33] uses lower
bit quantization precision and introduces a new sparse convolution algorithm, which
makes the DSP efficiency higher. They use an end-to-end design space search for a sparse
convolution-specific circuit architecture, making it computationally more efficient than our
design. Since our YOLOv3-tiny model is less computationally intensive after compression,
we have a higher detection speed. At the same time, our designs consume less power
and are more cost-effective. Ding et al. [46] proposed a resource-aware system-level
quantization framework, which takes into account both the accuracy of the object detection
algorithm and the hardware resource consumption during deployment. They implemented
the acceleration of the YOLOv2-tiny network on the Virtex-7 with more abundant resources
and superior performance, and achieved a high throughput. Our design deploys the more
advanced YOLOv3 and YOLOv3-tiny models at less than 10% of the overall resource
consumption of their design, and outperforms it in terms of the DSP efficiency. None of the
models in other works are trained on vehicle detection-specific datasets; thus, our model
has an advantage in vehicle detection scenarios.
Sensors 2023, 23, 2208 22 of 26
Item Ref [14] Ref [37] Ref [33] Ref [46] This Work
Basic information introduction
Virtex-7:
Platform ZYNQ XC7Z020 Zedboard Arria-10GX1150 Zynq-7000
XC7VX690T-2
Precision Fixed-16 Fixed-16 Int8 Float-32 Float-32 Fixed-16
CNN Model YOLOv2 YOLOv2 YOLOv2-tiny YOLOv2 YOLOv2-tiny YOLOv3 YOLOv3-tiny YOLOv3 YOLOv3-tiny
Dataset COCO COCO VOC VOC UA-DETRAC
Hardware resource consumption
BRAM 87.5 88 96% 1320 98.5 (19.7%) 132.5 (26.5%)
DSPs 150 153 6% 3456 301 (33.8%) 144 (16.2%)
LUTs 36 576 37 342 637 560 38 336 (22.3%) 38 228 (22.2%)
45%
FFs 43 940 35 785 717 660 62 988 (18.3%) 42 853 (12.5%)
Performance comparison
mAP 0.481 0.481 - 0.744 0.548 0.711 0.599 0.711 0.599
Operations
29.47 29.47 5.14 4.2 1.24 19.494 0.735 19.494 0.735
(GOP)
Freq (MHz) 150 150 204 200 210 230
Performance
64.91 30.15 21.97 182.36 389.90 41.39 43.47 63.51 67.91
(GOP/s)
Throughput(fps) 2.20 1.02 4.27 61.90 314.2 2.12 59.14 3.23 91.65
Efficiency comparison
Cost Efficiency
44.45 20.65 15.05 17.49 46.75 28.35 29.77 43.50 46.51
(GOPS/$×102 )
DSP Efficiency
0.433 0.197 0.144 2.004 0.113 0.138 0.144 0.441 0.472
(GOPS/DSPs)
Dynamic Power
1.4 1.2 0.83 - - 1.80 1.48 1.52 1.31
(W)
Full Power (W) - - - 26 21 13.29 12.92 12.73 12.51
Dynamic Energy
Efficiency 46.36 25.13 26.47 - - 22.99 29.37 41.78 51.84
(GOPS/W)
Full Energy
Efficiency - - - 7.01 18.57 3.11 3.36 4.99 5.43
(GOPS/W)
Sensors 2023, 23, 2208 23 of 26
Author Contributions: Conceptualization, J.Z. and B.L.; methodology, J.Z. and B.L.; software, J.Z.
and B.L.; validation, J.Z., B.L. and S.L.; formal analysis, J.Z., B.L. and S.L.; investigation, J.Z., B.L.
and Q.Z.; resources, J.Z., B.L., S.L. and Q.Z.; data curation, J.Z. and S.L.; writing—original draft
preparation, J.Z. and B.L.; writing—review and editing, J.Z., B.L. and S.L.; visualization, J.Z., B.L.
and S.L.; supervision, B.L. and Q.Z.; project administration, B.L. and Q.Z. All authors have read and
agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data sharing not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Jan, B.; Farman, H.; Khan, M.; Talha, M.; Din, I.U. Designing a Smart Transportation System: An Internet of Things and Big Data
Approach. IEEE Wirel. Commun. 2019, 26, 73–79. [CrossRef]
2. Lin, J.; Yu, W.; Yang, X.; Zhao, P.; Zhang, H.; Zhao, W. An Edge Computing Based Public Vehicle System for Smart Transportation.
IEEE Trans. Veh. Technol. 2020, 69, 12635–12651. [CrossRef]
3. Wang, S.; Djahel, S.; Zhang, Z.; McManis, J. Next Road Rerouting: A Multiagent System for Mitigating Unexpected Urban Traffic
Congestion. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2888–2899. [CrossRef]
4. Tseng, Y.-T.; Ferng, H.-W. An Improved Traffic Rerouting Strategy Using Real-Time Traffic Information and Decisive Weights.
IEEE Trans. Veh. Technol. 2021, 70, 9741–9751. [CrossRef]
5. Heitz, D.; Mémin, E.; Schnörr, C. Variational fluid flow measurements from image sequences: Synopsis and perspectives. Exp.
Fluids 2010, 48, 369–393. [CrossRef]
6. Zivkovic, Z.; van der Heijden, F. Efficient adaptive density estimation per image pixel for the task of background subtraction.
Pattern Recognit. Lett. 2006, 27, 773–780. [CrossRef]
7. Wang, G.; Wu, J.; Xu, T.; Tian, B. 3D Vehicle Detection With RSU LiDAR for Autonomous Mine. IEEE Trans. Veh. Technol. 2021, 70,
344–355. [CrossRef]
8. Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.-J. A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN
for Object Detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1861–1873. [CrossRef]
9. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and
Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [CrossRef]
10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv.
Neural Inf. Process. Syst. 2015, 28. Available online: https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028
a21ed38046-Abstract.html (accessed on 1 January 2023). [CrossRef]
11. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer
International Publishing: Cham, Switzerland, 2016; pp. 21–37. [CrossRef]
12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 Juny 2016; pp. 779–788.
13. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
14. Pei, S.; Wang, X. Research on FPGA-Accelerated Computing Model of YOLO Detection Network. Small Microcomput. Syst.
Available online: https://kns.cnki.net/kcms/detail/21.1106.TP.20210906.1741.062.html (accessed on 1 January 2023).
15. Zhao, M.; Peng, J.; Yu, S.; Liu, L.; Wu, N. Exploring Structural Sparsity in CNN via Selective Penalty. IEEE Trans. Circuits Syst.
Video Technol. 2022, 32, 1658–1666. [CrossRef]
16. Nguyen, D.T.; Kim, H.; Lee, H.-J. Layer-Specific Optimization for Mixed Data Flow With Mixed Precision in FPGA Design for
CNN-Based Object Detectors. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2450–2464. [CrossRef]
17. Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.-C.; Qi, H.; Lim, J.; Yang, M.-H.; Lyu, S. UA-DETRAC: A new benchmark and protocol
for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [CrossRef]
18. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017
IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [CrossRef]
19. Wright, M.B. Speeding up the hungarian algorithm. Comput. Oper. Res. 1990, 17, 95–96. [CrossRef]
20. De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D.L. The Mahalanobis distance. Chemom. Intell. Lab. Syst. 2000, 50, 1–18.
[CrossRef]
Sensors 2023, 23, 2208 25 of 26
21. Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both Weights and Connections for Efficient Neural Network. Adv. Neural Inf. Process.
Syst. 2015, 28. Available online: https://proceedings.neurips.cc/paper/2015/hash/ae0eb3eed39d2bcef4622b2499a05fe6-Abstract.
html (accessed on 1 January 2023).
22. Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv 2017, arXiv.1608.08710.
23. Yang, T.-J.; Howard, A.; Chen, B.; Zhang, X.; Go, A.; Sandler, M.; Sze, V.; Adam, H. NetAdapt: Platform-Aware Neural Network
Adaptation for Mobile Applications. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany,
8–14 September 2018; pp. 285–300.
24. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks Through Network Slimming. In
Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 27–29 October 2017; pp. 2736–2744.
25. Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the Value of Network Pruning. arXiv, 2019. arXiv.1810.05270.
26. Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going Deeper with Embedded
FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, New York, NY, USA, 21–23 February 2016; pp. 26–35. [CrossRef]
27. Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Matta, M.; Patetta, M.; Re, M.; Spanò, S. Approximated computing for
low power neural networks. TELKOMNIKA (Telecommun. Comput. Electron. Control.) 2019, 17, 1236–1241. [CrossRef]
28. Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.; Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional
Neural Networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, USA, 22–24 February 2017; pp. 45–54. [CrossRef]
29. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural
Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York,
NY, USA, 22–24 February 2015; pp. 161–170. [CrossRef]
30. Lu, L.; Liang, Y.; Xiao, Q.; Yan, S. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. In Proceedings of
the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA,
USA, 30 April–2 May 2017; pp. 101–108. [CrossRef]
31. Lu, L.; Liang, Y. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs. In Proceedings of the
55th Annual Design Automation Conference, San Francisco, CA, USA, 24–29 June 2018; pp. 1–6. [CrossRef]
32. Bao, C.; Xie, T.; Feng, W.; Chang, L.; Yu, C. A Power-Efficient Optimizing Framework FPGA Accelerator Based on Winograd for
YOLO. IEEE Access 2020, 8, 94307–94317. [CrossRef]
33. Wang, Z.; Xu, K.; Wu, S.; Liu, L.; Liu, L.; Wang, D. Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for
YOLOv2. IEEE Access 2020, 8, 116569–116585. [CrossRef]
34. McFarland, M.C.; Parker, A.C.; Camposano, R. The high-level synthesis of digital systems. Proc. IEEE 1990, 78, 301–318. [CrossRef]
35. Yeom, S.-K.; Seegerer, P.; Lapuschkin, S.; Binder, A.; Wiedemann, S.; Müller, K.R.; Samek, W. Pruning by explaining: A novel
criterion for deep neural network pruning. Pattern Recognit. 2021, 115, 107899. [CrossRef]
36. Shan, L.; Zhang, M.; Deng, L.; Gong, G. A Dynamic Multi-precision Fixed-Point Data Quantization Strategy for Convolutional
Neural Network. In Computer Engineering and Technology: 20th CCF Conference, NCCET 2016, Xi’an, China, August 10–12. 2016,
Revised Selected Papers; Springer: Singapore, 2016; pp. 102–111. [CrossRef]
37. Chen, C.; Xia, J.; Yang, W.; Li, K.; Chai, Z. A PYNQ-compliant Online Platform for Zynq-based DNN Developers. In Proceedings
of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 24–26 February
2019; p. 185. [CrossRef]
38. Qi, Y.; Zhou, X.; Li, B.; Zhou, Q. FPGA-based CNN image recognition acceleration and optimization. Comput. Sci. 2021, 48,
205–212. [CrossRef]
39. Liu, Z.-G.; Whatmough, P.N.; Mattina, M. Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile
CNN Inference. IEEE Comput. Archit. Lett. 2020, 19, 34–37. [CrossRef]
40. Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021.
41. Ji, Z.; Zhang, X.; Wei, Z.; Li, J.; Wei, J. A tile-fusion method for accelerating Winograd convolutions. Neurocomputing 2021, 460,
9–19. [CrossRef]
42. Adiono, T.; Putra, A.; Sutisna, N.; Syafalni, I.; Mulyawan, R. Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using
General Matrix Multiplication Principle. IEEE Access 2021, 9, 141890–141913. [CrossRef]
43. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In
Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456.
44. Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016.
arXiv.1603.00831.
Sensors 2023, 23, 2208 26 of 26
45. Tajar, T.; Alireza; Ramazani, A.; Mansoorizadeh, M. A lightweight Tiny-YOLOv3 vehicle detection approach. J. -Real-Time Image
Process. 2021, 18, 2389–2401. [CrossRef]
46. Ding, C.; Wang, S.; Wang, N.; Xu, K.; Wang, Y.; Liang, Y. REQ-YOLO: A resource-aware, efficient quantization framework for
object detection on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 33–42.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.