Conference Template a4
Conference Template a4
Conference Template a4
Abstract— This work aims to develop an efficient and today's security landscape. Traditional person-tracking
interpretable AI system that can accurately detect and monitor methods often need help with occlusions, complex
entities within restricted spaces, ensuring enhanced security backgrounds, and varying lighting conditions, leading to
and safety. The proposed model leverages the strength of
inaccurate and unreliable results [2]. Person tracking in
explainable AI and applies the self-refined segment editing
technique, allowing the model to better discern and track
restricted areas has crucial real-world applications, including
persons in complex environments, which enhances the model security surveillance in government and military facilities,
performance through iterative learning. Combining these two public safety in airports and train stations, and critical
approaches allows the AI system to adapt to various scenarios infrastructure protection. Accurately tracking individuals
and refine its tracking capabilities over time. The model is while maintaining interpretability is precious in these
trained and evaluated in the experiment on a custom dataset scenarios.
comprising video frames from multiple restricted areas with
posture and gait datasets. The results demonstrate that the Explainable AI enables the model to learn iteratively
hybrid model outperforms traditional tracking methods, from its predictions and improve its tracking capabilities over
exhibiting higher accuracy and interpretability. Additionally, time [3]. By continuously refining its understanding of
the model's explainable nature provides insights into its tracking dynamics, the system becomes adaptive to varying
decision-making process, enabling better understanding and scenarios, offering increased accuracy in complex
trust in its output. The proposed hybrid Integrated Gradient environments. The self-refined segment editing technique is
with a self-refined segment editing model (IG-SEPM) holds
promising potential for applications in security, surveillance,
introduced with explainable AI to further enhance the
and safety measures in restricted areas. The metrics like mean model's performance. Self-refined segment editing allows the
Average Precision (mAP) (or) Intersection over Union (IoU) system to analyse and modify object representations,
are used to measure the segment editing model's accuracy in enabling better discrimination of individuals in crowded or
tracking upon a standard sample. The comparative results occluded scenes. This refinement process improves tracking
show that the proposed model provides better results of about results by handling challenging scenarios where traditional
89.7% Average precision value compared to other existing methods tend to falter.
work. The output from SPEM shows that the model provides a
target class predicted probability of 0.3 for α = 0.15. Then the Researchers have explored various techniques for person
valid masked image is interpreted in parallel using IG and tracking in complex environments such as crowded public
obtains better target class predicted probability with average spaces, indoor settings, and outdoor areas with challenging
pixel gradients. The result shows that this proposed work gives lighting conditions. Traditional methods, such as Kalman
better results than the existing system. filters and particle filters, have shown limitations in handling
occlusions and maintaining track continuity in crowded
Keywords—Explainable Artificial Intelligence, Object scenes [4]. A brief introduction to the importance and
Tracking, Segment Editing Model, Self-Refinement, Integrated
challenges of person tracking in restricted areas, emphasizing
Gradient
the need for accurate and interpretable AI models for
I. INTRODUCTION enhanced security and safety. Researchers have investigated
hybrid AI models that combine the strengths of different
In recent years, the increasing need for enhanced security techniques to improve tracking performance. Hybrid models
and surveillance in restricted areas has driven significant involving reinforcement learning, attention mechanisms, and
advancements in artificial intelligence (AI) technologies. feature fusion have shown promise in person tracking [5].
Person tracking within such spaces ensures safety and However, more research must be done on combining
prevents unauthorized access. Traditional person-tracking explainable AI with segment editing for this specific
methods often need help in complex and dynamic application.
environments, leading to inaccuracies and limited
interpretability. Person tracking in restricted areas is a critical The proposed hybrid model holds great promise for
task with significant implications for security and safety [1]. security, surveillance, and safety applications in restricted
The ability to accurately and efficiently monitor individuals areas. This study intends to improve upon existing techniques
within confined spaces has become increasingly important in of identifying individuals in confined regions by developing
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)
an artificial intelligence (AI) system that is both accurate and the tracking process as a reinforcement learning problem.
easy to understand. Its ability to combine precise tracking They design a reward function to guide decision-making and
with interpretability makes it suitable for real-world optimize the tracking strategy using deep reinforcement
deployment, where reliable and transparent results are learning. The model employs temporal convolutional
paramount. It presents the methodology for designing the networks to capture temporal dependencies in the video
hybrid Integrated gradient with a self-refined segment editing sequence, allowing it to make informed decisions based on
model. It also outlines the dataset compiled for evaluation, the tracking history. The entire tracking system is trained
consisting of video frames captured in various restricted end-to-end using the reinforcement learning framework,
areas. The advantages of the suggested technology over enabling the model to adapt and improve its tracking
conventional tracking techniques are demonstrated and performance over time.
displayed by extensive testing and evaluation.
A dual matching attention network for multi-object
In the following sections, in section 2, a literature review tracking, which can be relevant for tracking persons in
of the existing work is conducted deeply; section 3 describes crowded areas. The proposed approach utilizes dual
the dataset used for evaluation; in section 4, the matching attention networks to establish associations
Implementation of the proposed work were discussed; in between objects across consecutive frames. The double-
section 5, experimental results and their implications were matching mechanism captures both appearance and motion
presented and in section 6 conclusion and future were cues for accurate object tracking. The dual matching
discussed. attention networks also estimate the matching confidence
between object pairs, which helps handle occlusions and
II. LITERATURE REVIEW improves tracking robustness. The proposed MOT algorithm
To predict pixel-level masks for each object instance. [9] is designed to operate in real time, making it suitable for
This work [6] proposed segmenting instances using deep online tracking applications.
learning with the Mask R-CNN model by extending the
To better understand the current trends, latest
Faster R-CNN architecture. To extract features, Mask R-
developments, and unexplored possibilities in video
CNN takes advantage of a standard convolutional network.
segmentation using algorithms based on deep learning, the
At the same time, the more rapid R-CNN core is responsible
author [10] conducted an investigation and summary of the
for producing potential item areas through a network of
current research in techniques that use deep knowledge of
region proposals (RPN). To accurately extract features from
video segmentation. This study surveys the popular deep
the region proposals and avoid misalignments in the Region
learning architectures used in video segmentation, including
of Interest (RoI) pooling process, Mask R-CNN introduces
Fully Convolutional Networks (FCNs), U-Net, Mask R-
RoIAlign, which interpolates the elements at exact locations
CNN, and other variants. Explores the various applications of
in the feature map. In addition to the bounding box
video segmentation in profound learning-based real-world
regression and classification heads of Faster R-CNN, Mask
events, such as surveillance, autonomous vehicles, and video
R-CNN includes an extra mask head that predicts the
editing. Address deep learning-based video segmentation
segmentation mask for each region proposal. Its ability to
challenges and propose potential future research directions.
predict pixel-level masks has made it popular in various
computer vision tasks that require precise object Explainable AI techniques which gained attention to
segmentation. address the black-box nature of deep learning models. These
methods provide insights into decision-making, making AI
Using deep learning, the researchers suggested
systems more transparent and interpretable. The predictions
DeepSORT (Deep Simple Online and Realtime Tracking) [7]
of deep learning-based tracking models have been explained
as a cutting-edge various-entity monitoring technique. It is an
using methods such as the LIME algorithm SHAP (SHapley
extension of the SORT (Simple Online and Realtime
Additive exPlanations) and (Local Interpretable Model-
Tracking) algorithm, which combines the Kalman filter and
agnostic Explanations) [11]. This survey paper covers
Hungarian algorithm for object tracking. DeepSORT
various techniques and methodologies related to explainable
improves upon SORT by incorporating a deep association
AI, which can be integrated into a self-refined AI model for
metric to handle identity switches and occlusions, making it
providing interpretable results.
more robust in complex tracking scenarios. DeepSORT
employs a deep association metric to handle occlusions, A novel method called SELF-REFINE [12] for object
which allows it to recover tracks more effectively when detection is proposed, which leverages self-supervision and
objects are temporarily obscured. DeepSORT is designed to iterative refinement with self-feedback to improve the
operate in real-time, allowing it to track things in video accuracy of object detection models. The approach uses self-
streams with low computational overhead. supervised learning to leverage the unlabelled data during
training. It learns to generate pseudo-labels for the unlabelled
A learning-based approach for online multi-object
data and uses them to refine the model iteratively. The model
tracking, which can be relevant for tracking persons in
iteratively refines itself by learning from labelled and
restricted areas [8]. As evidence of the efficacy of their
pseudo-labelled data. The iterative process helps the model
decision-making methodology in addressing complicated
improve its performance over time. SELF-REFINE employs
monitoring circumstances, the recommended approach
a self-feedback mechanism to handle incorrect pseudo-labels
delivers outstanding results on numerous data sets used in
generated during the iterative process. The self-feedback
online various-entity monitoring tasks. The authors formulate
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)
mechanism reduces the impact of incorrect labels and TensorFlow or PyTorch: For deep learning model
improves the robustness of the model. implementation.
According to the framework underlying visual OpenCV: For video processing and object tracking.
segmentation of objects and monitoring, the author recently
Scikit-learn: For data association and post-
studied segment manipulation [13]. It involves modifying
processing.
object masks or segmentations to improve tracking accuracy
and handle occlusions. Some works have utilized this Data Preprocessing: Prepare the video sequences
technique with interactive video object segmentation, but its and ground truth annotations for training and
combination with explainable AI for person tracking in evaluation. Ensure access to bounding boxes or
restricted areas still needs to be explored. segmentation masks for the target items in each
frame for the labelled data.
The literature indicates that person tracking in restricted
areas is challenging, and existing methods often need more Initial Object Segmentation: Use an existing object
accuracy and interpretability. The proposed hybrid IG-SEPM segmentation model to obtain initial segmentation
model offers a novel approach to address these challenges, masks for the objects in the video frames. This
combining the benefits of self-refinement, explainable AI, could be achieved using popular architectures like
and segment editing for enhanced tracking performance and Mask R-CNN.
transparency in decision-making. Further research and
Iterative Self-Refinement: Design an iterative
experimentation are required to explore the full potential of
process to feed the initial segmentation masks into
this model and its effectiveness in real-world scenarios.
the model and obtain refined segmentation. The
III. DATA SETS model could incorporate feedback from itself to
refine the learning mechanism.
This work uses the SA-1B [14] dataset for the experiment
as given in Table I. The SA-1B dataset includes a diverse set Self-Feedback Mechanism: Implement a self-
of video sequences. These sequences cover many challenges feedback mechanism to assess the quality of the
commonly encountered in real-world object-tracking refined segmentation in each iteration. The model
scenarios. Challenges may include occlusions, scale changes, can use metrics like Intersection over Union (IoU)
fast camera motion, object deformations, background clutter, with ground truth or temporal consistency across
and illumination variations. SA-1B dataset has different frames to determine whether further refinement is
video frames for training the model with the object tracking needed.
process.
Get point, box and text from the user as input and understandable to users. This is especially important in
Combine promoted input into the decoded image critical applications where trust and interpretability are
Get Feedback from the mask decoder to self-refine essential for user acceptance and regulatory compliance.
process
For iteration t belongs to 0… T do
Redefine the image and send to the image encoder
as feedback
Stop the for loop
Evaluate Precision, Recall, F1 score and Intersection
Over Union (IoU) based on Confidence score.
CNN
0.85 0.717 0.13 0.55
based Bi-
7 8 2
LSTM
Fig3. Masked image with SEPM
REFERENCES
[1] Nuha H. Abdulghafoor, and Hadeel N. Abdullah, “A novel real-
time multiple objects detection and tracking framework for different
challenges”, Alexandria Engineering Journal, Vol. 61, pp. 9637 –
9647, Dec 2022.
[2] Jun-Wei Hsieh, and Yea-Shuan Huang, “Multiple-Person Tracking
System For Content Analysis”, International Journal of Pattern
Recognition and Artificial Intelligence, Vol. 16, pp. 447 – 462,
2002.
[3] Lindsay Wells, and Tomasz Bednarz, “Explainable AI and
Reinforcement Learning—A Systematic Review of Current
Approaches and Trends”, Front. Artif. Intell, Vol. 4, May 2021.
Fig5. Model target class predicted probability and Average pixel gradients [4] Alper Akca, and M. Önder Efe, “Multiple Model Kalman and
over alpha Particle Filters and Applications: A Survey”, IFAC-PapersOnLine,
Vol. 52, pp. 73 – 78, 2019.
[5] Sakorn Mekruksavanich, and Anuchit Jitpattanakul, “Hybrid
convolution neural network with channel attention mechanism for
sensor-based human activity recognition”, Sci Rep, Vol. 13, July
2023.
[6] He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. IEEE Trans
Pattern Anal Mach Intell, pp. 386–397, Feb 2020.
[7] Nicolai Wojke, Alex Bewley, Dietrich Paulus, “Simple Online and
Realtime Tracking with a Deep Association Metric”, IEEE
International Conference on Image Processing (ICIP), Feb 2017.
[8] Weiming Hu, et.al, “SiamMask: A Framework for Fast Online
Object Tracking and Segmentation”, IEEE Trans Pattern Anal
Mach Intell , vol. 45, pp. 3072–3089, 2023.
[9] Zhenzhen Wang, Shengyong Ding, and Huiwen Wang, “Online
Multi-Object Tracking with Dual Matching Attention Networks”,
Conference on Computer Vision and Pattern Recognition (CVPR),
2020.
[10] Tianfei. Z, et.al, “A Survey on Deep Learning Technique for Video
Segmentation”, IEEE Trans Pattern Anal Mach Intell, vol. 45, pp.
7099 – 7122, Nov 2022.
[11] Sameer Singh and Divyansh Kaushik, “Explainable AI: A Survey”,
arXiv preprint, 2020.
[12] Xiaoshuai Zhang, Qixiang Ye, Yurong Chen, Mingyu You, and
Fig6. Attribution of visualization on mask image, and original image Jianfei Cai., “SELF-REFINE: Iterative Refinement with Self-
overlayed Feedback”, Conference on Computer Vision and Pattern
Recognition (CVPR), 2021.
VI. CONCLUSIO AND FUTURE WORK [13] Rui Yao, Guosheng Lin, SHIXIONG XIA, JIAQI ZHAO, and
YONG ZHOU, “Video Object Segmentation and Tracking: A
This study developed a novel hybrid intelligent model, Survey”, ACM Transactions on Intelligent Systems and
namely the self-reined segment editing model and Integrated Technology,
gradient (IG-SEPM), for tracking objects and persons in a Vol. 11, pp. 1 – 47, May 2020.
restricted area. The effectiveness of the suggested approach [14] Alexander Kirillov et.al., Segment Anything, Meta AI Research,
is evaluated in comparison to other object tracking FAIR, arXiv:2304.02643v1, pp. 1 – 30, Apr 2020.
techniques using deep learning concepts such as Mask R- [15] Lucas Prado Osco, et.al., “The Segment Anything Model (SAM)
for Remote Sensing Applications: From Zero to One Shot”,
CNN, CNN-based Bi-LSTM, and DeepSORT, using Computer Vision and Pattern Recognition, arXiv:2306.16623, June
multimodal and multi-camera data such as posture and gait 2023.
data. A pilot study was carried out on an independent sample [16] Anupama Jha, et.al., “Enhanced Integrated Gradients: improving
to evaluate the tracking efficiency of the segment editing interpretability of deep learning models using splicing codes as a
case study”, Genome Biology, vol. 21, June 2020.
model using relevant assessment measures such as mean
[17] Olson, “Advanced Data Mining Techniques,” in Springer, 1st ed.,
Average Precision (mAP), Precision, Recall and F1. The ISBN 3-540-76916-1, pp. 138, 2008.
comparative results show that the proposed model provides [18] Beinan Wang, “A Parallel Implementation of Computing Mean
better results of about 89.7% Average precision value Average Precision”, Computer Vision and Pattern Recognition,
compared to other existing work. The output from SPEM is arXiv:2206.09504, June 2022.
taken parallelly into the IG model for interpreting the image, [19] Andrei Kapishnikov, et.al., “Guided Integrated Gradients: an
as the results show that the model provides a target class Adaptive Path Method for Removing Noise”, IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR),
predicted probability of 0.3 for α = 0.15 as shown in Fig 6. pp. 5050-5058. 2021.
In future, this work will be extended by adding wearable [20] Michael Munn, and David Pitman, “Explainable AI for
sensors to track and assist patients and old age people's Practitioners”, in O'Reilly Media, Inc., ISBN: 9781098119133, Oct
2022.
activity in their respective environments. This work will be
further developed to track human movement with
interpretation in long-range videos.