Fncom 17 1243779
Fncom 17 1243779
Fncom 17 1243779
KEYWORDS
1. Introduction
Smoking behavior detection has gradually attracted more and more attention in recent
years. With the increase in public health awareness and a deeper understanding of the harms
of smoking, more and more individuals and organizations are beginning to focus on how to
effectively identify and prevent smoking behaviors (Ashare et al., 2021). Smoking behavior
detection involves using computer vision technology to automatically recognize and locate
human smoking behaviors in images or videos, thereby monitoring and controlling smoking
scenarios. This technology can be applied in practical applications such as public places,
factories, schools, etc. (Liu X. et al., 2023), helping to enforce interpolation up-sampling module, thereby reducing the
smoking bans, strengthen the management of smoking areas, loss of feature information during the up-sampling process.
protect the environment, and reduce pollution (Shi et al., 2023). 4. Proposing the smoking behavior detection algorithm
Smoking behavior detection generally relies on deep learning YOLOv8-MNC, based on YOLOv8. On our custom dataset,
models for training and inference, effectively reducing the manual the detection accuracy during training reached 85.887%,
cost of smoking detection and improving detection accuracy with a mean Average Precision (mAP) that was 5.7% higher
and efficiency (Tian et al., 2023). Generally speaking, these compared to the YOLOv8 algorithm.
algorithms can be divided into two branches. One is sensor-
based detection methods, such as inhalation sensor-based detection The rest of this paper is structured as follows: section “2.
(Yu et al., 2022), lip sensor-based detection (Imtiaz et al., 2019), Related works” provides a review of relevant works in the field of
and hand sensor-based detection (Skinner et al., 2017). These smoking behavior detection. Section “3. Materials and methods”
methods face several challenges. They involve high computational delves into the enhanced YOLOv8-MNC algorithm framework
load and complex manual feature extraction. Additionally, they and explicates the specifics of its implementation. In section “4.
exhibit weak feature representation capability and poor model Experimental results,” we assess the performance of our proposed
generalization. As a result, solving smoking detection problems method through a series of experimental tests. Finally, the paper
across various scenarios becomes quite challenging. The other is concludes with a summary and outlines potential future directions.
using convolutional neural network algorithms to extract features
from images, thereby recognizing smoking targets. Common target
detection frameworks include YOLO (Jiang P. et al., 2022), Faster 2. Related works
R-CNN (Li et al., 2015), SSD (Leibe et al., 2016), and Heterogeneous
Networks of Graph Neural Networks (GNNs) (Wang Y. et al., Presently, methods for detecting smoking behavior primarily
2022). These algorithms learn and train from a large amount comprise traditional and computer vision-based approaches.
of smoking image data to achieve efficient and accurate target Traditional methods employ smoke sensors to detect cigarette
detection. smoke, thereby identifying smoking behavior. Wu and Chen (2011)
Despite the significant improvements in smoking detection due proposed a system for smoking behavior detection through facial
to deep convolutional networks, there are still some challenges. analysis, which accurately and rapidly discerns whether individuals
First, smoking detection needs to consider the influence of the in images are smoking. Iwamoto et al. (2010) introduced a
surrounding environment on smoking images, such as intense smoke detection method based on image sequences, utilizing
illumination, complex backgrounds, and occlusion. These factors convolutional neural networks (CNNs) to process continuous video
may cause biases or misjudgments in the model. Secondly, smoking frames and detect the presence of smoke. Ali et al. (2012) presented
behavior exhibits certain diversified characteristics. For instance, an automated system named mPuff for detecting inhalations of
when recognizing cigarettes, information regarding shape and cigarette smoke from respiratory measurements. With the rapid
size needs to be noted. These characteristics also increase the development of computer vision and deep learning, an increasing
difficulty in algorithm training and practical application. Finally, number of smoking detection algorithms based on object detection
smoking detection requires the use of high-precision sensors have been proposed. Adebowale and Lwin (2019) put forward a
and cameras, which can increase the system development and deep learning algorithm architecture based on convolutional neural
maintenance costs. Also, in large-scale applications, one may networks (CNNs) and long short-term memory (LSTM) networks,
need to consider hardware resource limitations, as well as other for detecting smoking behavior from respiratory signals. Rentao
constraints. Consequently, in practical applications, due to the et al. (2019) proposed an indoor smoking behavior detection
impact of the above factors, there may be problems such as false approach that adds a small-scale detection layer to the traditional
detection, missed detection, and a low detection rate, as shown in YOLOv3-tiny network. Poonam et al. (2019) used the Faster RCNN
Figure 1. These issues may affect the accuracy and reliability of the algorithm for cigarette target detection, demonstrating robustness
detection results. Therefore, it is necessary to take corresponding to lighting and deformations. Zhang et al. (2018) proposed a new
measures to address these problems and improve the accuracy and smoking detection algorithm based on CNNs, which differentiates
reliability of the detection. between non-smokers and smokers by recognizing the position and
To address these issues, this paper proposes the YOLOv8- posture of smokers in photos or videos through feature extraction
MNC algorithm, which is an improvement on the faster and more and classifiers. Liao and Zou (2020) proposed using the DarkNet53
accurate YOLOv8, and applies it to smoking behavior detection. as the backbone feature extraction network and decoding the
The main contributions are as follows: YOLOv3 model through Bounding Box after outputting the feature
map to detect smoking behavior within the monitored area.
1. Incorporating NWD Loss to mitigate the sensitivity of IoU Jiang X. et al. (2022) introduced a smoking behavior detection
to minor object position deviations, thereby enhancing the method based on the YOLOv5 network, which captures images
training accuracy. using a camera and recognizes and locates smokers in the images
2. Incorporating the Multi-head Self-attention Mechanism using the YOLOv5 algorithm. Wang Z. et al. (2022) proposed
(MHSA) to boost the global feature learning ability of the an improved YOLOv5-based architecture with the addition of
target object in the convolution network. new data enhancement techniques such as RandomErasing and
3. Utilizing the lightweight general up-sampling operator GaussianBlur to enhance the robustness of the model for real-
CARAFE to replace the original nearest-neighbor time smoke detection. Hu et al. (2022) introduced a fast detection
FIGURE 1
Example diagram of smoking error detection.
algorithm for forest fire smoke using MVMNet, which is designed and context, enhancing feature extraction, localization, and
to extract and classify image features for smoke detection. Liu et al. classification. Guan et al. (2018) proposed the Semantic Context
(2022) proposed an IoT security solution named Adaptive multi- Aware Network (SCAN), utilizing pyramid pooling to fuse
channel Bayesian graph attention network (AMGBA), aiming to multi-level context, thereby improving small target detection.
address security issues in the Internet of Things. Xu et al. (2023) In the realm of loss functions, Wang J. et al. (2021) used the
introduced a bimodal emotion recognition algorithm using mixed Wasserstein distance to measure bounding box similarity,
features of audio and speech context. Liu et al. (2020) presented a replacing standard IoU, and demonstrated that using NWD in
method for ESD soft fault detection based on Linux kernel function R-CNN increases network convergence time. Xu et al. (2022)
call analysis. Liu et al. (2018) proposed a method for heat exchange proposed a Gaussian Receptive Field based Label Assignment
analysis in deep-sea spectral detection systems based on ROV, (RFLA) strategy, enhancing tiny target detection and achieving
including detailed modeling. Liu Z. et al. (2023) discussed a graph a 24.8% average precision on the AI-TOD dataset. Akyon et al.
structure learning method of EGNN, focusing on its application in (2022) presented SAHI (Slicing Aided Hyper Inference), an open-
graph neural networks. source framework for small target detection in high-resolution
In the field of object detection, the challenge of accurately images. Zhang et al. (2020) introduced the MultiResolution
identifying small targets, such as cigarettes in smoking detection, Attention Extractor (MRAE) to learn attention weights across
has been a persistent issue. These small objects often occupy only different layers, fusing features with weighted sums, and improving
a minor portion of the entire image, leading to difficulties in small target detection precision without the need for GAN or data
extracting precise position and feature information. Existing preprocessing.
methods have approached this problem through various YOLO is currently the most popular real-time object detector,
techniques, but limitations remain. Deep learning algorithms encompassing versions such as YOLOv5 (Zhu et al., 2021),
for small target detection commonly adopt methods that focus on YOLOv7 (Wang C. Y. et al., 2022), and YOLOv8. For example,
multi-scale features, contextual information, and loss functions. YOLOv5 focuses on optimizing speed and efficiency, YOLOv7
In terms of multi-scale features, Lin et al. (2017a) utilized FPN introduces new features for better handling of small objects,
to fuse high-resolution and high semantic information for the and YOLOv8 further refines the architecture for improved
Faster RCNN, achieving a 17.8% average precision for small accuracy and robustness. Compared to the previous version
target detection. Liu et al. (2019) improved scale invariance by YOLOv4 (Bochkovskiy et al., 2020), both YOLOv5 and YOLOv7
suppressing inconsistencies in spatial-temporal feature fusion, have made improvements in speed and accuracy. However,
achieving a 43.9% AP on the YOLOv3 and MS COCO dataset. YOLOv5 exhibits some drawbacks, such as deficiencies in
Gong et al. (2021) introduced a "fusion factor" to control small target detection and the need for improvements in dense
information flow between deep and shallow network layers, target detection. YOLOv7 is also limited by training data,
enhancing small target learning efficiency. Regarding contextual model structure, and hyperparameters, leading to performance
information, Leng et al. (2021) proposed an internal-external degradation in certain situations. YOLOv8, an anchor-less
network-based detector (ENe) that leverages target appearance object detection algorithm, incorporates new network structures
like PAN-FPN and Decoupled-Head, but it still struggles with 3.2. Improvement measures
small object recognition in complex scenes. For instance,
during feature extraction by the neural network, small targets 3.2.1. MHSA module network structure
can be misled by large ones, and the features extracted from With the wide application of Transformer in the field of
deep layers may lack sufficient small target information. This Computer Vision (CV), models such as ViT (Wang Y. et al.,
deficiency causes the algorithm to ignore small targets during 2021) for image classification tasks, DETR (Carion et al., 2020)
the learning process, leading to poor detection performance. and Deformable DETR (Zhu et al., 2020) for object detection
Compared to normal-sized objects, small-sized ones are tasks are all designed based on the Transformer concept. In
more likely to be overlapped by other objects and partially the attention mechanism, Srinivas et al. (2021) proposed the
obscured by objects of other sizes, making them difficult to Bottleneck Transformer module, which designed the Multi-Head
distinguish and locate in an image. Existing methods have Self-Attention Layer (MHSA) based on the Non-local idea. This
approached this problem through techniques such as multi-scale structure reduces the number of parameters while optimizing the
training, specialized loss functions, feature fusion, and attention backbone feature extraction network. The structure of the multi-
mechanisms. head self-attention layer is shown in Figure 3. For the current input,
To address these issues, we propose a new detection algorithm, feature ZH × W × d three different weight matrices WK , WQ , WV
YOLOv8-MNC, based on the YOLOv8 algorithm. It significantly are first initialized. These initialized matrices representing query,
enhances the detection performance for small-sized objects while key, and value are used to compute the representation of the input
maintaining the detection effectiveness for normal-sized ones. features, respectively. These representations are used in the self-
attention mechanism to compute attention weights, and the input
features are weighted and averaged to generate attention-enhanced
3. Materials and methods feature representations. After calculations, q, w, and v, three
vectors of dim, are obtained. Unlike the multi-head self-attention
3.1. Overview of the YOLOv8-MNC mechanism, MHSA uses a similar spatial attention mechanism to
handle position encoding. Rh and Rw are two learnable vectors,
YOLOv8 is the latest iteration of the YOLO series of detection which can be used as attention vector representations in the
models, renowned for their joint detection and segmentation horizontal and vertical spatial directions. The sum of these two
capabilities. We have enhanced it and introduced it into the vectors gives a two-dimensional spatial encoding r. After the vector
field of smoking detection. The architecture of our YOLOv8- dot product calculation between r and q, spatial similarity is
MNC detector is shown in Figure 2. It consists of three parts: obtained. The content similarity is obtained after the vector dot
the backbone, the head, and the neck. YOLOv8-MNC is based product calculation between q and k. After adding the two, it is
on the CSP concept and improves YOLOv5 by replacing the C3 converted into attention weights through Softmax, and then the dot
module with the C2f module. Compared with the C3 module, product calculation with v yields the attention-enhanced feature
the C2f module can better capture feature information and representation.
improve detection accuracy. At the same time, the CSP concept Spatial similarity is derived from the dot product between the
can effectively reduce the amount of calculation and improve relative position encoding vector r and the query vector q, capturing
the running speed of the model. The C2f module borrows the the geometric structure within the data. Content similarity, on
ELAN idea from YOLOv7, combining C3 and ELAN to form the the other hand, is obtained from the dot product between the
C2f module, allowing it to maintain a lightweight design while query vector q and the key vector k, focusing on semantic
obtaining richer gradient flow information. In the penultimate layer relationships. Together, these similarities provide a comprehensive
of the backbone, we still use the most popular SPPF module, passing understanding of both the geometric and semantic aspects of
three 5 × 5 Maxpools of different sizes in succession, and then the input, enhancing the model’s ability to recognize complex
concatenating each layer. This not only ensures the accuracy of patterns in tasks such as object detection and image classification.
objects at different scales but also ensures the lightweight nature of The multi-head self-attention layer directly replaces the 3 × 3
the objects. We added this module to the SPPF module to help the convolution in the last residual block of ResNet, and the output
convolutional network learn the global characteristics of the target feature can be used in various downstream tasks. It is a good
object. The MHSA attention mechanism can adaptively adjust the way to enhance the model’s ability to model input features and
weights between different features, so as to better capture the global the ability to perceive the relationship between different positions.
information of the target object and improve the performance of The introduction of relative position encoding in the MHSA layer
the model. In the neck part, the feature fusion method we use is not only considers content information, but also considers the
PAN-FPN, which enhances the fusion and utilization of feature relative distance between features at different positions, which
layer information at different scales. We used three lightweight can effectively correlate the information and position perception
upsampling operators called CARAFE and multiple C2f modules, between objects.
along with a decoupled head structure, to form the neck module.
The idea of decoupling the head in YOLOX is used in the last part of 3.2.2. NWD
the neck. It combines confidence and regression boxes to improve In YOLOv8, the Anchor-Free method is used for object
training accuracy. The upsampling operator CARAFE replaces the detection. The core idea is to divide the input image into S × S grid
original nearest neighbor interpolation, reducing the loss of feature units, each referred to as a "Cell." Within each Cell, B bounding
information during the upsampling process. boxes (abbreviated as BOBox) are predicted. Each bounding box
FIGURE 2
YOLOV8-MNC structure.
FIGURE 3
MHSA module network structure.
contains a center point coordinate (CP) and a width and height. distribution, which can be fitted into a two-dimensional Gaussian
distribution N(µ, ), where:
P
These bounding boxes can cover the entire input image, thereby
detecting all possible targets. Compared to traditional detection
W2
" # !
methods, the Anchor-Free method does not require predefined cx X 0
R = (cx , cy , w, h) , µ = , = 4
2 (1)
anchor boxes but predicts the target’s position and category directly cy 0 h4
on the feature map.
In this way, the similarity between bounding boxes is
In the entire Anchor mechanism, Intersection over Union
transformed into the distance between Gaussian distributions,
(IoU) is an essential metric for determining positive and negative
where (cx, , cy ) are the center coordinates of the bounding box, and
labels based on thresholds or for excluding bounding boxes with
w and h are the width and height. The Wasserstein distance is used
high redundancy. In the training process, a large number of anchor
to calculate the distribution distance. The second order Wasserstein
boxes are generated. To obtain the anchor box’s target category and
distance between different bounding boxes µ1 and µ2 is as follows:
the real box’s offset to the anchor box, the calculation of IoU is
utilized to acquire the anchor box’s label. Similarly, in the prediction W22 (µ1 , µ2 ) = ||m1 − m2 ||22 (2)
phase, a single target will generate multiple similar prediction
1/2
1/2 X X
boxes, thereby significantly increasing the computational load X X X
+ Tr( + −2( )1/2 )
significantly. Hence, IoU is used as a threshold, adopting non-
1 2 2 1 2
maximum suppression to get the optimal prediction box.
Small targets in an image often contain only a few pixels, lacking where X X
substantial appearance information and details. The Intersection µ1 = N(m1 , ), µ2 = N(m2 , ) (3)
over Union (IoU) and its extensions are highly sensitive to the 1 2
positional deviation of small targets; even minor shifts can cause Using Gaussian distributions N1 and N2 , where N1 represents
a significant drop in IoU, leading to errors in label allocation. bounding box N2 represents bounding box B, the formula can
When applied to algorithms based on the Anchor mechanism, this finally be simplified as:
sensitivity can adversely affect detection performance. As illustrated
in Figure 4, minor positional deviations can lead to considerable W22 (N1 , N2 ) (4)
changes in IoU. Given the critical role of IoU in label allocation,
w1 h1 T
w2 h2 T 2
even a slight numerical difference might cause what should be = ||( c1x , c1y , , , c2x , c2y , , ) ||2
2 2 2 2
allocated to positive samples to be assigned to negative ones.
Moreover, if the scale of some targets is too small, the overlap Where
between the anchor box and the real box may never meet the
threshold, resulting in an average number of positive samples A = (c1x , c1y , w1 , h1 ) , B = (c2x , c2y , w2 , h2 ) (5)
allocated by the actual box of less than one. As W22 (N1 , N2 ) functions as a unit of distance rather than a
IoU only works when bounding boxes overlap. Hence, GIoU similarity measure, and IoU operates as a ratio bounded between
(Lin et al., 2017b) was proposed to solve this problem by adding a 0 and 1, the necessity to normalize this distance arises. This leads to
penalty term. But when two bounding boxes contain each other, the computation of the Normalized Wasserstein Distance (NWD),
GIoU degrades to IoU. Subsequently, DIoU (Zheng et al., 2019) which yields a standardized measure suitable for comparison.
and CIoU (Zheng et al., 2020) were proposed to overcome these The final normalized result is NWD (Normalized Wasserstein
issues. However, GIoU, DIoU, and CIoU are all extensions of IoU, Distance):
commonly used in loss functions. They still exhibit sensitivity to q
positional deviations of small target objects in label allocation. W22 (µ1 , µ2 )
To overcome these shortcomings, this paper adds NWD (Wang WL (N1 , N2 ) = exp(− ) (6)
C
J. et al., 2021) to the CIoU loss function, with both components
accounting for half of the total loss function. The primary step of where C is a constant set empirically, set as 12.8 in this paper.
NWD is to model the bounding box as a two-dimensional Gaussian
distribution, then use NWD to measure the similarity of the derived 3.2.3. Lightweight upsampling operator CARAFE
Gaussian distributions. NWD can measure distribution similarity (content-aware ReAssembly of features)
even in non-overlapping cases, and it is insensitive to the scale of The original YOLOv8 feature fusion network employs nearest
the target. It is particularly suitable for measuring the similarity of neighbor interpolation, using the grayscale value of the closest
small target objects. pixel among neighboring pixels around the sampling point.
For small target objects, since most real objects are unlikely This approach neglects the influence of other neighboring pixel
to be standard rectangles, the bounding boxes often carry some points, and the grayscale value becomes discontinuous after
background information. The information of the target object and resampling, leading to a loss of image quality. In contrast,
the background information are concentrated at the center point the improved method, within the PAFP structure introduces
and the boundary of the bounding box, respectively. Therefore, the lightweight upsampling operator CARAFE (Content-Aware
when creating a two-dimensional Gaussian distribution for the ReAssembly of Features) (Loy et al., 2019) to replace nearest
bounding box, the center pixel of the bounding box can be set as neighbor interpolation. The CARAFE structure is mainly divided
the highest weight, which then gradually decreases from the center into two parts: the upsampling kernel prediction module and
point to the boundary. For a horizontal bounding box R, µ and
P
the feature recombination module. First, the upsampling kernel
represent the mean vector and covariance matrix of the Gaussian prediction module utilizes the input feature map to predict the
FIGURE 4
(A) Objects on a tiny scale. (B) Normal proportion of objects.
FIGURE 5
Structure of the light weight upsampling operator CARAFE.
sampling kernel. Then, it uses the predicted upsampling kernel (1) Channel Compression: The input H × W × C dimensional
to recombine the features and complete upsampling process. features are compressed to H × W × Cm dimensions to
These recombined features can rectify the feature deviation reduce the amount of computation in subsequent operations.
that occurs during the fusion process. Characterized by low Where Cm is the number of compressed channels, in this
redundancy, lightweight design, rapid computation, strong feature paper Cm is set to 64.
fusion ability, and fast running speed, the CARAFE operator (2) Content Encoding and Upsampling Kernel Prediction: For
the compressed feature map, an upsampling kernel of size
is a significant enhancement. By replacing the feature fusion
σH × σW × Kup 2 is predicted using a convolutional layer
network with the CARAFE operator, it can aggregate contextual
with a convolution kernel of Kencoder × Kencoder . Where Kup
information within a larger receptive field. This method abandons
is the size of the predicted upsampling kernel, in this paper
the nearest neighbor interpolation approach for samples, opting Kup is 5, Kencoder is 3.
instead for a single kernel sampling method, and generates an (3) Upsampling Kernel Normalization: The predicted
adaptive content-aware sampling technique. The feature fusion upsampling kernel is normalized by Softmax to make
network with the introduced CARAFE operator is depicted in the sum of the convolution kernel weights 1.
Figure 5. (4) Content-Aware Feature Recombination: The predicted
The CARAFE computation process can be divided into the upsampling kernel is convolved with the input features to
following four parts: obtain the recombined features.
FIGURE 6
Image of the dataset.
4. Experimental results update the model parameters, making it suitable for handling the
large-scale dataset.
4.1. Dataset and experimental setup
4.2. Model evaluation
For the specific task of smoking detection, this study relies on
a self-constructed dataset, as public datasets are lacking in this This paper uses precision, recall, Average Precision (AP), and
domain. The dataset was assembled from smoking-related images Mean Average Precision (mAP) as model accuracy evaluation
sourced from the Internet through keyword searches and manual indicators. AP represents the area under the Precision-Recall (PR)
screening, as well as key frames extracted from recorded smoking curve, and mAP represents the average of the AP for each class.
video clips. The combined collection was then meticulously cleaned TP represents the number of correctly predicted positive samples,
and screened to remove noise and outliers, with the aid of advanced which reflects the performance of the model in accurately detecting
image and video processing technologies, including deep learning- positive samples. FN represents the number of positive samples
based image processing. The final dataset comprises a total of that were incorrectly predicted as negative samples, revealing
11,629 images, all annotated using Labelimg in the PASCAL VOC positive samples that the model may have missed. FP represents
format. Prior to training, the annotations were converted into the the number of negative samples that are incorrectly predicted as
txt format required by YOLOv8, and the dataset was partitioned positive samples, indicating that the model may incorrectly label
into training and validation sets at a 7:3 ratio. The detection task negative samples as positive samples. The specific formulas are as
focuses solely on categorizing smoking behavior, labeled as "smoke" follows:
within the dataset. The dataset, as depicted in Figure 6, represents
a comprehensive and carefully curated resource for the study’s TP
P = (7)
experimental needs. TP + FP
This study was conducted using the PyTorch deep learning
framework, with code execution and model training carried out TP
R = (8)
on the Inspur Artificial Intelligence platform server, equipped with TP + FN
an ASPEED Graphics Family (rev 41) graphics card. The system
P
operates on Red Hat 4.8.5–44, utilizing Python 3.8, CUDA 11.3, P
AP = (9)
and PyTorch 1.12.1 tools. Specifically, the model was trained over Num(objects)
500 epochs to ensure comprehensive learning, with a learning rate
P
of 0.01 to balance convergence speed and accuracy. The Stochastic AP
mAP = (10)
Gradient Descent (SGD) optimizer was employed to efficiently Num(class)
TABLE 1 Comparison of different loss functions. including TripletAttention (Misra et al., 2020), CoTAttention
(Li et al., 2021), ShuffleAttention (Yang, 2021), Polarized Self-
Loss function Map0.5/%
Attention (Liu H. et al., 2021), GAM_Attention (Liu Y. et al.,
EioU 80.965 2021), CAM_concat (Xiao et al., 2021), SKAttention (Li et al.,
Focal-EIoU 81.484 2019), GlobalContext (Cao et al., 2019), EffectiveSE (Lee and Park,
CioU 81.805 2019), ParNetAttention (Goyal et al., 2021), SimAM (Yang et al.,
2021), SEAttention (Hu et al., 2018), and MHSA (Srinivas et al.,
SioU 81.86
2021). As seen in Table 2, the Multi-head Self-attention Mechanism
CiOU + NWD 82.777 (MHSA) is introduced, which can consider multiple attention
subspaces simultaneously, modeling the association relationship
between different features more comprehensively and globally. This
4.3. Experimental results allows for better capture of the association and context information
between features. In addition to having a similar [email protected]/% to
4.3.1. Experimental comparison of different loss the SimAM attention mechanism and ParNet Attention attention
functions mechanism, MHSA, compared with other attention mechanisms,
To validate the effects of different loss functions, we used the can focus on target features more accurately and improve the
YOLOv8 model as a baseline and selected CIoU (Zheng et al., accuracy of target detection.
2020), SIoU (Gevorgyan, 2022), EIoU, Wise-IoU (Tong et al., To verify the effectiveness of the proposed method in this
2023), Focal-EIoU (Zhang et al., 2021), and NWD (Wang J. et al., paper, we conducted comparative experiments on the smoking
2021) for experimental comparison. As shown in Table 1, the dataset with several mainstream object detection methods, further
[email protected] values for EIoU, Focal_EIoU, CIoU, Wise-IoU, SIoU, and validating the feasibility and superiority of the improved model.
CioU+NWD are 80.965, 81.484, 81.805, 81.883, 81.86, and 82.777, The detection results are shown in Table 3. The mainstream
respectively. [email protected] is an important indicator for evaluating the object detection algorithms include YOLOv3-tiny (Gong et al.,
performance of the target detection model, and a higher [email protected] 2019), YOLOv4-tiny (Jiang et al., 2020), YOLOv5 (Jocher et al.,
value represents the accurate detection ability of the model for 2022), YOLOv6 (Li et al., 2022), YOLOv7 (Wang C. Y. et al.,
the target object. We can observe that the CioU+NWD loss 2022), YOLOX-tiny (Ge et al., 2021), SSD (Leibe et al., 2016),
function performs significantly better than other loss functions in RetinaNet (Lin et al., 2017b) and YOLOv8, compared with our
the experiment, obtaining the highest [email protected] value of 82.777. It model. It can be seen that our YOLOv8-MNC training result
is particularly worth noting that compared with the original CIoU, [email protected]/% is higher than that of YOLOv3-tiny, YOLOv4-tiny,
the [email protected] value of CioU+NWD is increased by 1.293%. This YOLOv5, YOLOv6, YOLOv7, YOLOX-tiny, SSD, and RetinaNet
demonstrates that the introduction of NWD effectively reduces by 8.674, 15.007, 3.935, 5.987, 15.867, 6.317, 22.067, and 19.317,
the sensitivity to small object position deviations, and solves the respectively. In this experiment, the improved YOLOv8 model,
localization problem of small objects while improving training YOLOv8-MNC, achieved 85.887, which is 5.797 higher than
accuracy. Therefore, this further validates the effectiveness of the original YOLOv8 model. This result proves that YOLOv8-
incorporating NWD into the CiOU loss function. MNC is superior to other models, validating the efficiency of
this model. At the same time, it also illustrates the effectiveness
4.3.2. Experimental comparison of different of our combination of NWD Loss, the multi-head self-attention
attention mechanisms mechanism (MHSA), and the use of a lightweight general-purpose
We have made improvements to the activation function in upsampling operator CARAFE to replace the original nearest
YOLOv8 by using CELU and added a small object detection neighbor interpolation upsampling module. In addition, the fine-
layer and attention mechanism based on NWD for comparison. tuning of model parameters can yield more accurate and stable
We selected 11 different attention mechanisms for comparison, forecast results.
TABLE 3 Comparison with mainstream algorithms. (1) NWD Loss Integration: The NWD loss function reduces
sensitivity to small object position deviations, enhancing
Detector Backbone [email protected]/%
training accuracy. This is achieved by normalizing IoU
YOLOv3-tiny (Gong et al., 2019) DarkNet-53 77.213 weights according to the target object’s size and introducing
YOLOv4-tiny (Bochkovskiy et al., CSPDarknet53 70.88 position-sensitive weights. These adjustments allow the model
2020) to predict the location and size of bounding boxes more
Yolov5 (Jocher et al., 2022) CSPDarknet53 81.952 accurately, paying more attention to details and reducing the
impact of edge object position deviation.
Yolov6 (Li et al., 2022) EfficientNet 79.90
(2) Inclusion of MHSA Attention Mechanism: The addition of
YOLOv7 (Wang Z. et al., 2022) CBS+E-ELAN+MP 70.02
the MHSA attention mechanism enables the model to better
YOLOX-tiny (Ge et al., 2021) CSPDarknet-S 79.57 capture relationships between different locations, scales, and
SSD (Leibe et al., 2016) VGG16 63.82 semantics. By computing similarities between query and key
vectors, the model can focus on important regions in the
RetinaNet (Lin et al., 2017b) resnet50 66.57
image, enhancing its perception of local details and global
YOLOv8 CSPDarknet53 80.09
contextual information.
YOLOv8-MNC CSPDarknet53 85.887 (3) Stride Improvement in the Backbone Part: By changing
the stride from 2 to 1 in the YOLOv8 yaml file, the
model captures more detailed features and provides more
4.3.3. Ablation experiments location information. This adjustment allows the convolution
We proposed four improvements on the base of the YOLOv8 layer to move only one pixel at a time, capturing more
model: (1) introducing NDW, (2) adding MHSA attention nuanced information.
mechanism, (3) improving the step size of the first convolution (4) Adoption of CARAFE for Upsampling: Replacing traditional
in the backbone part of the yaml file in YOLOv8, from 2 to 1, upsampling methods with CARAFE improves the spatial
perception of low-resolution input images. CARAFE’s self-
and (4) using the lightweight upsampling operator CARAFE. The
attention mechanism calculates from which surrounding local
improved model is evaluated using three indicators: parameters,
areas to gather information for reorganization, allowing for a
GFLOPs, and Map0.5/%.
more refined feature reorganization process. This ensures that
In Table 4, using the YOLOv8 model as a baseline, we
the output quality matches the input, overcoming problems
introduced four key improvements to enhance its performance.
such as blurring and distortion in low-resolution images.
The CELU activation function was adopted for its strong non-
linear expression ability. A small target detection layer was
These improvements collectively contribute to the superior
added, increasing the [email protected]/% by 2.696. The introduction
performance of YOLOv8-MNC, making it more sensitive and
of Normalized Wasserstein Distance (NWD) further improved
accurate in locating small targets, and enhancing its ability to
the [email protected]/% by 0.898, enhancing small target detection. The process low-resolution information.
Multi-Head Attention Mechanism (MHSA) and the lightweight
universal upsampling operator CARAFE contributed additional 4.3.4. Algorithm analysis
improvements. Adjusting the stride of the first convolution To further intuitively demonstrate and evaluate the test effects
parameter from 2 to 1 also increased the [email protected]/%. The model and compare the feature extraction capabilities of YOLOv8 and the
improvement graph is shown in Figure 7. Figure 8 is the confusion improved YOLOv8-MNC in small target detection, we need to pay
matrix diagram of YOLOv8 and YOLOv8-MNC. Collectively, these attention to what key information the main network has extracted
enhancements led to a significant increase in [email protected]/%, with a from the pictures. In this paper, we use the more generalized Grad-
notable rise in the True Positive box from 0.79 to 0.83, validating CAM method to study the areas of interest of the grid output
the effectiveness of the improvements and illustrating the model’s values. Grad-CAM (Gradient-weighted Class Activation Mapping),
increased precision and robustness. an improved version of CAM (Class Activation Mapping), uses
In summary, the YOLOv8-MNC algorithm outperforms other specified class gradients to help analyze the network’s areas of
algorithms due to the following key enhancements: interest for a particular class. By examining the network’s areas
YOLOv8 Tiny object NDW MHSA Backbone CARAFE Parameters GFLOPs Map0.5/%
layer variant
√
3011043 8.2 80.078
√ √
2983204 12.6G 82.774
√ √ √
2983204 12.6 83.672
√ √ √ √
3180580 12.8 84.303
√ √ √ √ √
3180580 51.2 85.346
√ √ √ √ √ √
3383036 55.5 85.887
FIGURE 7
Ablation experiment line graph.
FIGURE 8
Confusion matrix diagram.
FIGURE 9
Graph of YOLOv8 and YOLOv8-MNC model test results.
of interest, we can analyze whether the network has learned the architecture, the model introduces several enhancements. The
correct features or information, making this method significantly NWD Loss is implemented to reduce sensitivity to small object
meaningful for the visualization of image classification. position deviations, improving training accuracy. The Multi-
Figure 9 shows the Grad-CAM images after two different head Self-Attention Mechanism (MHSA) is added to bolster
networks processed the test set images. The brighter areas in the convolutional network’s global feature learning, and the
the figure represent the areas the network pays more attention lightweight CARAFE operator replaces the original nearest-
to. Observing the test results, it can be seen that the improved neighbor interpolation, minimizing feature information loss during
YOLOv8-MNC model covers more smoking target parts in the heat upsampling. These innovations collectively enhance both speed and
map area and is brighter and more concentrated than YOLOv8. accuracy. While the model demonstrates promising results on a
Therefore, with the help of NWD Loss, the MHSA attention self-made smoking dataset, its performance in real-world scenarios
mechanism, and the lightweight upsampling operator CARAFE, may be constrained by the limited diversity of the dataset. Future
the model can pay more accurate attention to the targets, reflecting work should focus on collecting more varied and complex smoking
the model’s efficiency and accuracy. datasets, reflecting a broader range of environmental factors, to
The performance of the smoking detection model can be further refine the model’s generalization ability in complex and
challenged in real-world applications due to factors like poor dynamic environments.
visual conditions, pose and scale variations, occlusions, and real-
time requirements. However, these challenges can be mitigated
through strategies such as data augmentation to simulate diverse Data availability statement
visual conditions, multi-scale training to handle scale variations,
the integration of contextual or location information to manage The original contributions presented in this study are included
occlusions, and model optimization to meet real-time demands. in the article/supplementary material, further inquiries can be
Implementing these strategies can enhance the model’s robustness directed to the corresponding author.
and adaptability, improving its performance in various real-
world scenarios.
Author contributions
5. Conclusion ZW: conceptualization, methodology, resources, data curation,
and writing—review and editing. LL: software, validation, and
This paper presents a novel smoking behavior detection model, writing—original draft preparation. PS: formal analysis and
focusing on real-time performance and accuracy, particularly in investigation. All authors have read and agreed to the published
detecting small targets like cigarettes. Built upon the YOLOv8 version of the manuscript.
References
Adebowale, M., and Lwin, K. (2019). “Deep learning with convolutional neural Imtiaz, M. H., Senyurek, V. Y., Belsare, P., Nagaraja, K., and Sazonov, E. (2019).
network and long short-term memory for phishing detection,” in Proceeding of the “Development of a Smart IoT Charger for Wearable Cigarette Smoking Monitor,” in
13th International Conference on Software, Knowledge, Information Management and Proceeding of the SoutheastCon 2019), (Huntsville, AL: IEEE).
Applications (SKIMA 2019)), (Ulkulhas: IEEE).
Iwamoto, K., Inoue, H., Matsubara, T., and Tanaka, T. (2010). Cigarette smoke
Akyon, F. C., Altinuc, S. O., and Temizel, A. (2022). Slicing aided hyper inference detection from captured image sequences. Proc. Spie 7538:753813.
and fine-tuning for small object detection. arXiv [Preprint]. doi: 10.48550/arXiv.2202.
Jiang, P., Ergu, D., Liu, F., Cai, Y., and Ma, B. (2022). A review of Yolo algorithm
06934
developments. Proc. Comput. Sci. 199, 1066–1073.
Ali, A. A., Hossain, S. M., Hovsepian, K., Plarre, K., and Kumar, S. (2012). “mPuff: Jiang, X., Hu, H., Liu, X., Ding, R., Xu, Y., Shi, J., et al. (2022). A smoking behavior
Automated detection of cigarette smoking puffs from respiration measurements,” detection method based on the YOLOv5 network. J. Phys. Conf. Ser. 2232:012001.
in Proceeding of the 2012 ACM/IEEE 11th International Conference on Information
Processing in Sensor Networks (IPSN)), (Beijing: IEEE). doi: 10.1007/s13534-020- Jiang, Z., Zhao, L., Li, S., and Jia, Y. (2020). Real-time object detection method based
00147-8 on improved YOLOv4-tiny. arXiv [Preprint]. doi: 10.48550/arXiv.2011.04244
Ashare, R. L., Bernstein, S. L., Schnoll, R., Gross, R., Catz, S. L., Cioe, P., et al. (2021). Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon, Y., Michael, K., et al. (2022).
The United States National Cancer Institute’s coordinated research effort on tobacco Ultralytics/yolov5: V7. 0-YOLOv5 SOTA realtime instance segmentation. Honolulu:
use as a major cause of morbidity and mortality among people with HIV. Nicotine Tob. Zenodo.
Res. 23, 407–410. doi: 10.1093/ntr/ntaa155 Lee, Y., and Park, J. (2019). CenterMask : Real-time anchor-free instance
Bochkovskiy, A., Wang, C. Y., and Liao, H. Y. M. (2020). YOLOv4: Optimal segmentation. arXiv [Preprint]. doi: 10.48550/arXiv.1911.06667
speed and accuracy of object detection. arXiv [Preprint]. doi: 10.48550/arXiv.2004. Leibe, B., Matas, J., Sebe, N., and Welling, M. (2016). “SSD: Single Shot MultiBox
10934 Detector,” in Computer Vision-ECCV 2016. Lecture Notes in Computer Science, eds B.
Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. (2019). GCNet: Non-local networks meet Leibe, J. Matas, N. Sebe, and M. Welling (Cham: Springer), 21–37. doi: 10.1007/978-
squeeze-excitation networks and beyond. arXiv [Preprint]. doi: 10.48550/arXiv.1904. 3-319-46448-0_2
11492 Leng, J., Ren, Y., Jiang, W., Sun, X., and Wang, Y. (2021). Realize your surroundings:
Exploiting context information for small object detection. Neurocomputing 433, 287–
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S.
299.
(2020). End-to-End object detection with transformers. arXiv [Preprint]. doi: 10.
48550/arXiv.2005.12872 Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., et al. (2022). YOLOv6: A
single-stage object detection framework for industrial applications. arXiv [Preprint].
Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). YOLOX: Exceeding
doi: 10.48550/arXiv.2209.02976
YOLO series in 2021. arXiv [Preprint]. doi: 10.48550/arXiv.2107.0
8430 Li, J., Liang, X., Shen, S. M., Xu, T., and Yan, S. (2015). Scale-aware fast R-CNN for
pedestrian detection. IEEE Trans. Multimedia 20, 985–996.
Gevorgyan, Z. (2022). SIoU loss: More powerful learning for
bounding box regression. arXivv [Preprint]. doi: 10.48550/arXiv.2205. Li, X., Wang, W., Hu, X., and Yang, J. (2019). “Selective kernel networks,” in
12740 Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
(Long Beach, CA: IEEE), 510–519. doi: 10.1093/pcmedi/pbac011
Gong, H., Li, H., Xu, K., and Zhang, Y. (2019). “Object Detection Based on
Improved YOLOv3-tiny,” in Proceeding of the 2019 Chinese Automation Congress Li, Y., Yao, T., Pan, Y., and Mei, T. (2021). Contextual transformer networks for
(CAC), (Hangzhou: IEEE). visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 1489–1500.
Gong, Y., Yu, X., Ding, Y., Peng, X., Zhao, J., and Han, Z. (2021). “Effective Fusion Liao, J., and Zou, J. (2020). “Smoking target detection based on Yolo V3,” in
Factor in FPN for Tiny Object Detection,” in Proceeding of the IEEE Winter Conference Proceeding of the 2020 5th International Conference on Mechanical, Control and
on Applications of Computer Vision), (Waikoloa, HI: IEEE). Computer Engineering (ICMCCE)), (Harbin: IEEE).
Goyal, A., Bochkovskiy, A., Deng, J., and Koltun, V. (2021). Non-deep networks. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017a).
arXiv [Preprint]. doi: 10.48550/arXiv.2110.07641 “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference
on computer vision and pattern recognition (Honolulu, HI: IEEE), 2117–2125.
Guan, L., Wu, Y., and Zhao, J. (2018). SCAN: Semantic context aware network for
accurate small object detection. Int. J. Comput. Intell. Syst. 11:936. Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017b). Focal loss for dense
object detection. IEEE Trans. Pattern Anal. Mach. Intell. 99, 2999–3007.
Hu, J., Shen, L., and Sun, G. (2018). “Squeeze-and-Excitation Networks,” in
Proceeding of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Liu, H., Liu, F., Fan, X., and Huang, D. (2021). Polarized self-attention: Towards
Recognition (CVPR)), (Salt Lake City, UT: IEEE). high-quality pixel-wise regression. Neurocomputing 506, 158–167.
Hu, Y., Zhan, J., Zhou, G., Chen, A., Cai, W., Guo, K., et al. (2022). Liu, Y., Shao, Z., and Hoffmann, N. (2021). Global attention mechanism: Retain
Fast forest fire smoke detection using MVMNet. Knowl. Based Syst. 241: information to enhance channel-spatial interactions. arXiv [Preprint]. doi: 10.48550/
108219. arXiv.2112.05561
Liu, S., Huang, D., and Wang, Y. (2019). Learning spatial fusion for single-shot Wang, Z., Wu, L., Li, T., and Shi, P. (2022). A smoke detection model based on
object detection. arXiv [Preprint]. doi: 10.48550/arXiv.1911.09516 improved YOLOv5. Mathematics 10:1190.
Liu, X., Li, X., Su, H., Zhao, Y., and Ge, S. S. (2023). The opening workspace control Wang, J., Xu, C., Yang, W., and Yu, L. (2021). A normalized gaussian wasserstein
strategy of a novel manipulator-driven emission source microscopy system. ISA Trans. distance for tiny object detection. arXiv [Preprint]. doi: 10.48550/arXiv.2110.13389
134, 573–587. doi: 10.1016/j.isatra.2022.09.002
Wang, Y., Huang, R., Song, S., Huang, Z., and Huang, G. (2021). Not all images
Liu, Z., Yang, D., Wang, Y., Lu, M., and Li, R. (2023). EGNN: Graph structure are worth 16x16 words: Dynamic transformers for efficient image recognition. arXiv
learning based on evolutionary computation helps more in graph neural networks. [Preprint]. doi: 10.48550/arXiv.2105.15075
Appl. Soft Comput. 135:110040.
Wu, W. C., and Chen, C. Y. (2011). “Detection System of Smoking Behavior Based
Liu, X., Maghlakelidze, G., Zhou, J., Izadi, O. H., Shen, L., Pommerenke, M., et al. on Face Analysis,” in Proceeding of the Fifth International Conference on Genetic &
(2020). Detection of ESD-induced soft failures by analyzing linux kernel function calls. Evolutionary Computing, (Kitakyushu: IEEE).
IEEE Trans. Device Mater. Reliabil. 20, 128–135.
Xiao, J., Zhao, T., Yao, Y., Yu, Q., and Chen, Y. (2021). Context augmentation and
Liu, X., Qi, F., Ye, W., Cheng, K., Guo, J., and Zheng, R. (2018). Analysis and feature refinement network for tiny object detection. Expert Syst Appl. 211:1635.
modeling methodologies for heat exchanges of deep-sea in situ spectroscopy detection
Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., and Xia, G. S. (2022). “RFLA: Gaussian
system based on rov. Sensors 18:2729. doi: 10.3390/s18082729
Receptive Field based Label Assignment for Tiny Object Detection,” in Computer
Liu, Z., Yang, D., Wang, S., and Su, H. (2022). Adaptive multi-channel bayesian Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
graph attention network for iot transaction security. Digital Commun. Netw. (in press). Proceedings, Part IX, (Berlin: Springer-Verlag), 526–543.
doi: 10.1016/j.dcan.2022.11.018
Xu, Y., Su, H., Ma, G., and Liu, X. (2023). A novel dual-modal emotion recognition
Loy, C. C., Lin, D., Wang, J., Chen, K., Xu, R., and Liu, Z. (2019). CARAFE: algorithm with fusing hybrid features of audio signal and speech context. Complex
Content-aware reassembly of features. arXiv [Preprint]. doi: 10.48550/arXiv.1905. Intell. Syst. 9, 951–963.
02188
Yang, L., Zhang, R. Y., Li, L., and Xie, X. (2021). “SimAM: A Simple, Parameter-
Misra, D., Nalamada, T., Arasanipalai, A. U., and Hou, Q. (2020). Rotate to attend: Free Attention Module for Convolutional Neural Networks,” in Proceedings of the 38th
Convolutional triplet attention module. arXiv [Preprint]. doi: 10.48550/arXiv.2010. International Conference on Machine Learning, ML Research Press, 11863–11874.
03045
Yang, Y. B. (2021). SA-Net: Shuffle attention for deep convolutional neural networks.
Poonam, G., Shashank, B. N., and Rao, A. G. (2019). Development of framework for New York, NY: Paperspace.
detecting smoking scene in video clips. Indon. J. Electr. Eng. Comput. Sci. 13, 22–26.
Yu, Q., Chen, J., Fu, W., Muhammad, K. G., Li, Y., Liu, W., et al. (2022).
Rentao, Z., Mengyi, W., Zilong, Z., Ping, L., and Qingyu, Z. (2019). “Indoor Smoking Smartphone-based platforms for clinical detections in lung-cancer-related exhaled
Behavior Detection Based on YOLOv3-tiny,” in Proceedings of the 2019 Chinese breath biomarkers: A review. Biosensors 12:223. doi: 10.3390/bios12040223
Automation Congress (CAC), (Hangzhou), 22–24.
Zhang, D., Jiao, C., and Wang, S. (2018). “Smoking Image Detection Based on
Shi, Y., Li, H., Fu, X., Luan, R., Wang, Y., Wang, N., et al. (2023). Self-powered Convolutional Neural Networks,” in Proceeding of the 2018 IEEE 4th International
difunctional sensors based on sliding contact-electrification and tribovoltaic effects for Conference on Computer and Communications (ICCC)), (Chengdu: IEEE).
pneumatic monitoring and controlling. Nano Energy 110:108339.
Zhang, F., Jiao, L., Li, L., Liu, F., and Liu, X. (2020). Multiresolution attention
Skinner, A., Stone, C. J., Doughty, H., and Munafo, M. R. (2017). StopWatch: extractor for small object detection. arXiv [Preprint]. doi: 10.48550/arXiv.2006.05941
A smartwatch based system for passive detection of cigarette smoking. PsyArXiv
[Preprint]. doi: 10.31234/osf.io/75j57 Zhang, Y. F., Ren, W., Zhang, Z., Jia, Z., Wang, L., and Tan, T. (2021). Focal
and efficient IOU loss for accurate bounding box regression. arXiv [Preprint]. doi:
Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., and Vaswani, A. (2021). Bottleneck 10.48550/arXiv.2101.08158
transformers for visual recognition. arXiv [Preprint]. doi: 10.48550/arXiv.2101.11605
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (2019). Distance-IoU
Tian, C., Xu, Z., Wang, L., and Liu, Y. (2023). Arc fault detection using artificial loss: Faster and better learning for bounding box regression. arXiv [Preprint]. doi:
intelligence: Challenges and benefits. Math. Biosci. Eng. 20, 12404–12432. doi: 10.3934/ 10.48550/arXiv.1911.08287
mbe.2023552
Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., et al. (2020). Enhancing
Tong, Z., Chen, Y., Xu, Z., and Yu, R. (2023). Wise-IoU: Bounding box regression geometric factors in model learning and inference for object detection and instance
loss with dynamic focusing mechanism. arXiv [Preprint]. doi: 10.48550/arXiv.2301. segmentation. IEEE Trans. Cybern. 52, 8574–8586. doi: 10.1109/TCYB.2021.3095305
10051
Zhu, X., Lyu, S., Wang, X., and Zhao, Q. (2021). “TPH-YOLOv5: Improved YOLOv5
Wang, C. Y., Bochkovskiy, A., and Liao, H. Y. M. (2022). YOLOv7: Trainable bag- Based on Transformer Prediction Head for Object Detection on Drone-captured
of-freebies sets new state-of-the-art for real-time object detectors. arXiv [Preprint]. Scenarios,” in Proceeding of the 2021 IEEE/CVF International Conference on Computer
doi: 10.48550/arXiv.2207.02696 Vision Workshops (ICCVW), (Montreal, BC: IEEE).
Wang, Y., Liu, Z., Xu, J., and Yan, W. (2022). Heterogeneous network representation Zhu, X., Su, W., Lu, L., Li, B., and Dai, J. (2020). Deformable DETR: Deformable
learning approach for ethereum identity identification. IEEE Trans. Comput. Soc. Syst. transformers for end-to-end object detection. arXiv [Preprint]. doi: 10.48550/arXiv.
10, 890–899. 2010.04159