Conference Template a4

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

2023 Innovations in Power and Advanced Computing Technologies (i-PACT)

A Novel Hybrid Integrated Gradient based Self


Refined Segment Editing parallel Model for
Tracking Entities in Video Frames
Dr. Mohanraj G Dr. Nadesh R.K Dr. Marimuthu M
School of Information Technology School of Information Technology School of Computer Science
and Engineering and Engineering and Engineering
Vellore Institute of Technology, Vellore Institute of Technology, Vellore Institute of Technology,
Vellore, India Vellore, India Chennai, India
[email protected] [email protected] [email protected]

Abstract— This work aims to develop an efficient and today's security landscape. Traditional person-tracking
interpretable AI system that can accurately detect and monitor methods often need help with occlusions, complex
entities within restricted spaces, ensuring enhanced security backgrounds, and varying lighting conditions, leading to
and safety. The proposed model leverages the strength of
inaccurate and unreliable results [2]. Person tracking in
explainable AI and applies the self-refined segment editing
technique, allowing the model to better discern and track
restricted areas has crucial real-world applications, including
persons in complex environments, which enhances the model security surveillance in government and military facilities,
performance through iterative learning. Combining these two public safety in airports and train stations, and critical
approaches allows the AI system to adapt to various scenarios infrastructure protection. Accurately tracking individuals
and refine its tracking capabilities over time. The model is while maintaining interpretability is precious in these
trained and evaluated in the experiment on a custom dataset scenarios.
comprising video frames from multiple restricted areas with
posture and gait datasets. The results demonstrate that the Explainable AI enables the model to learn iteratively
hybrid model outperforms traditional tracking methods, from its predictions and improve its tracking capabilities over
exhibiting higher accuracy and interpretability. Additionally, time [3]. By continuously refining its understanding of
the model's explainable nature provides insights into its tracking dynamics, the system becomes adaptive to varying
decision-making process, enabling better understanding and scenarios, offering increased accuracy in complex
trust in its output. The proposed hybrid Integrated Gradient environments. The self-refined segment editing technique is
with a self-refined segment editing model (IG-SEPM) holds
promising potential for applications in security, surveillance,
introduced with explainable AI to further enhance the
and safety measures in restricted areas. The metrics like mean model's performance. Self-refined segment editing allows the
Average Precision (mAP) (or) Intersection over Union (IoU) system to analyse and modify object representations,
are used to measure the segment editing model's accuracy in enabling better discrimination of individuals in crowded or
tracking upon a standard sample. The comparative results occluded scenes. This refinement process improves tracking
show that the proposed model provides better results of about results by handling challenging scenarios where traditional
89.7% Average precision value compared to other existing methods tend to falter.
work. The output from SPEM shows that the model provides a
target class predicted probability of 0.3 for α = 0.15. Then the Researchers have explored various techniques for person
valid masked image is interpreted in parallel using IG and tracking in complex environments such as crowded public
obtains better target class predicted probability with average spaces, indoor settings, and outdoor areas with challenging
pixel gradients. The result shows that this proposed work gives lighting conditions. Traditional methods, such as Kalman
better results than the existing system. filters and particle filters, have shown limitations in handling
occlusions and maintaining track continuity in crowded
Keywords—Explainable Artificial Intelligence, Object scenes [4]. A brief introduction to the importance and
Tracking, Segment Editing Model, Self-Refinement, Integrated
challenges of person tracking in restricted areas, emphasizing
Gradient
the need for accurate and interpretable AI models for
I. INTRODUCTION enhanced security and safety. Researchers have investigated
hybrid AI models that combine the strengths of different
In recent years, the increasing need for enhanced security techniques to improve tracking performance. Hybrid models
and surveillance in restricted areas has driven significant involving reinforcement learning, attention mechanisms, and
advancements in artificial intelligence (AI) technologies. feature fusion have shown promise in person tracking [5].
Person tracking within such spaces ensures safety and However, more research must be done on combining
prevents unauthorized access. Traditional person-tracking explainable AI with segment editing for this specific
methods often need help in complex and dynamic application.
environments, leading to inaccuracies and limited
interpretability. Person tracking in restricted areas is a critical The proposed hybrid model holds great promise for
task with significant implications for security and safety [1]. security, surveillance, and safety applications in restricted
The ability to accurately and efficiently monitor individuals areas. This study intends to improve upon existing techniques
within confined spaces has become increasingly important in of identifying individuals in confined regions by developing
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)

an artificial intelligence (AI) system that is both accurate and the tracking process as a reinforcement learning problem.
easy to understand. Its ability to combine precise tracking They design a reward function to guide decision-making and
with interpretability makes it suitable for real-world optimize the tracking strategy using deep reinforcement
deployment, where reliable and transparent results are learning. The model employs temporal convolutional
paramount. It presents the methodology for designing the networks to capture temporal dependencies in the video
hybrid Integrated gradient with a self-refined segment editing sequence, allowing it to make informed decisions based on
model. It also outlines the dataset compiled for evaluation, the tracking history. The entire tracking system is trained
consisting of video frames captured in various restricted end-to-end using the reinforcement learning framework,
areas. The advantages of the suggested technology over enabling the model to adapt and improve its tracking
conventional tracking techniques are demonstrated and performance over time.
displayed by extensive testing and evaluation.
A dual matching attention network for multi-object
In the following sections, in section 2, a literature review tracking, which can be relevant for tracking persons in
of the existing work is conducted deeply; section 3 describes crowded areas. The proposed approach utilizes dual
the dataset used for evaluation; in section 4, the matching attention networks to establish associations
Implementation of the proposed work were discussed; in between objects across consecutive frames. The double-
section 5, experimental results and their implications were matching mechanism captures both appearance and motion
presented and in section 6 conclusion and future were cues for accurate object tracking. The dual matching
discussed. attention networks also estimate the matching confidence
between object pairs, which helps handle occlusions and
II. LITERATURE REVIEW improves tracking robustness. The proposed MOT algorithm
To predict pixel-level masks for each object instance. [9] is designed to operate in real time, making it suitable for
This work [6] proposed segmenting instances using deep online tracking applications.
learning with the Mask R-CNN model by extending the
To better understand the current trends, latest
Faster R-CNN architecture. To extract features, Mask R-
developments, and unexplored possibilities in video
CNN takes advantage of a standard convolutional network.
segmentation using algorithms based on deep learning, the
At the same time, the more rapid R-CNN core is responsible
author [10] conducted an investigation and summary of the
for producing potential item areas through a network of
current research in techniques that use deep knowledge of
region proposals (RPN). To accurately extract features from
video segmentation. This study surveys the popular deep
the region proposals and avoid misalignments in the Region
learning architectures used in video segmentation, including
of Interest (RoI) pooling process, Mask R-CNN introduces
Fully Convolutional Networks (FCNs), U-Net, Mask R-
RoIAlign, which interpolates the elements at exact locations
CNN, and other variants. Explores the various applications of
in the feature map. In addition to the bounding box
video segmentation in profound learning-based real-world
regression and classification heads of Faster R-CNN, Mask
events, such as surveillance, autonomous vehicles, and video
R-CNN includes an extra mask head that predicts the
editing. Address deep learning-based video segmentation
segmentation mask for each region proposal. Its ability to
challenges and propose potential future research directions.
predict pixel-level masks has made it popular in various
computer vision tasks that require precise object Explainable AI techniques which gained attention to
segmentation. address the black-box nature of deep learning models. These
methods provide insights into decision-making, making AI
Using deep learning, the researchers suggested
systems more transparent and interpretable. The predictions
DeepSORT (Deep Simple Online and Realtime Tracking) [7]
of deep learning-based tracking models have been explained
as a cutting-edge various-entity monitoring technique. It is an
using methods such as the LIME algorithm SHAP (SHapley
extension of the SORT (Simple Online and Realtime
Additive exPlanations) and (Local Interpretable Model-
Tracking) algorithm, which combines the Kalman filter and
agnostic Explanations) [11]. This survey paper covers
Hungarian algorithm for object tracking. DeepSORT
various techniques and methodologies related to explainable
improves upon SORT by incorporating a deep association
AI, which can be integrated into a self-refined AI model for
metric to handle identity switches and occlusions, making it
providing interpretable results.
more robust in complex tracking scenarios. DeepSORT
employs a deep association metric to handle occlusions, A novel method called SELF-REFINE [12] for object
which allows it to recover tracks more effectively when detection is proposed, which leverages self-supervision and
objects are temporarily obscured. DeepSORT is designed to iterative refinement with self-feedback to improve the
operate in real-time, allowing it to track things in video accuracy of object detection models. The approach uses self-
streams with low computational overhead. supervised learning to leverage the unlabelled data during
training. It learns to generate pseudo-labels for the unlabelled
A learning-based approach for online multi-object
data and uses them to refine the model iteratively. The model
tracking, which can be relevant for tracking persons in
iteratively refines itself by learning from labelled and
restricted areas [8]. As evidence of the efficacy of their
pseudo-labelled data. The iterative process helps the model
decision-making methodology in addressing complicated
improve its performance over time. SELF-REFINE employs
monitoring circumstances, the recommended approach
a self-feedback mechanism to handle incorrect pseudo-labels
delivers outstanding results on numerous data sets used in
generated during the iterative process. The self-feedback
online various-entity monitoring tasks. The authors formulate
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)

mechanism reduces the impact of incorrect labels and TensorFlow or PyTorch: For deep learning model
improves the robustness of the model. implementation.
According to the framework underlying visual OpenCV: For video processing and object tracking.
segmentation of objects and monitoring, the author recently
Scikit-learn: For data association and post-
studied segment manipulation [13]. It involves modifying
processing.
object masks or segmentations to improve tracking accuracy
and handle occlusions. Some works have utilized this  Data Preprocessing: Prepare the video sequences
technique with interactive video object segmentation, but its and ground truth annotations for training and
combination with explainable AI for person tracking in evaluation. Ensure access to bounding boxes or
restricted areas still needs to be explored. segmentation masks for the target items in each
frame for the labelled data.
The literature indicates that person tracking in restricted
areas is challenging, and existing methods often need more  Initial Object Segmentation: Use an existing object
accuracy and interpretability. The proposed hybrid IG-SEPM segmentation model to obtain initial segmentation
model offers a novel approach to address these challenges, masks for the objects in the video frames. This
combining the benefits of self-refinement, explainable AI, could be achieved using popular architectures like
and segment editing for enhanced tracking performance and Mask R-CNN.
transparency in decision-making. Further research and
 Iterative Self-Refinement: Design an iterative
experimentation are required to explore the full potential of
process to feed the initial segmentation masks into
this model and its effectiveness in real-world scenarios.
the model and obtain refined segmentation. The
III. DATA SETS model could incorporate feedback from itself to
refine the learning mechanism.
This work uses the SA-1B [14] dataset for the experiment
as given in Table I. The SA-1B dataset includes a diverse set  Self-Feedback Mechanism: Implement a self-
of video sequences. These sequences cover many challenges feedback mechanism to assess the quality of the
commonly encountered in real-world object-tracking refined segmentation in each iteration. The model
scenarios. Challenges may include occlusions, scale changes, can use metrics like Intersection over Union (IoU)
fast camera motion, object deformations, background clutter, with ground truth or temporal consistency across
and illumination variations. SA-1B dataset has different frames to determine whether further refinement is
video frames for training the model with the object tracking needed.
process.

TABLE I. DATASET USED FOR ANALYSIS

Data Features of the Data


Sum of all pictures: 11 million
There are 1.1 billion masks in
total.
Standard picture mask count:
100
SA-1B The typical image size is 1500
by 2250 pixels.
Theme of images: locations,
objects, scenes. Fig1. Architecture of proposed novel Self Refined Segment Editing
Images or Mask Annotations with Model
class labels
Algorithm 1: Self-refined segment editing model for
tracking
IV. IMPLEMENTATION OF PROPOSED MODEL Input: Data set with single long video frame and set of
different type of video frames
A. Self-Refined Segment Editing Model Output: Masked video frames with Evaluation Metrics
Implementing a self-refined segment editing model [15] Preprocessing the dataset
for tracking persons in restricted areas involves combining Pre-train the model with
techniques from explainable AI and computer vision. Here's Initialize the segmentation using predefined Mask R-
a high-level overview of the implementation steps as shown CNN model as encoder
in Fig 1. Let Mask generator = SAM_Mask_Generator(data)
Image_name = image_path. split ()
Following is the general outline of the steps involved and
Image = cv2.imread(image_path)
the main libraries typically used in such implementations:
Masks = mask generator. generate (Image)
 Libraries Required: Covert encoded images into embedded images as
Let Ed = {E1, E2…, En}
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)

Get point, box and text from the user as input and understandable to users. This is especially important in
Combine promoted input into the decoded image critical applications where trust and interpretability are
Get Feedback from the mask decoder to self-refine essential for user acceptance and regulatory compliance.
process
For iteration t belongs to 0… T do
Redefine the image and send to the image encoder
as feedback
Stop the for loop
Evaluate Precision, Recall, F1 score and Intersection
Over Union (IoU) based on Confidence score.

B. Integrating Integrated gradient with a self-refined


segment editing model
Integrating Integrated gradient with a self-refined
segment editing model [16] for object tracking can enhance Fig2. Architecture of proposed novel IG-SEPM model
the model's transparency and provide interpretable insights Algorithm 2: Integrating Integrated gradient with a
into its tracking decisions. Combining both approaches can self-refined segment editing model for tracking
achieve a more trustworthy and understandable object- Input: Masked image from SEPM model
tracking system, as shown in Fig 2. Output: Visualize Image interpretability with Evaluation
 Explainable Segmentation: Incorporate Integrated Metrics
gradient techniques into the segment editing model model = keras. Sequential (“Inception-V3”)
to provide human-interpretable explanations for the mode. Build ()
segmentation process. For instance, use feature interp_img = inter_polate_image (
visualization methods to highlight the critical regions baseline=black,
of the input frame that influence the segmentation img=img_name_tensors['Parrot'], alpha=0.15)
result. This allows users to understand why certain p_grad = calculate_gradients (
areas are segmented as part of an object or img=interp_img, t_class_xid = 55)
background. prd = model(interp_img)
prd_prob = softmax (prd, axis = -1) [: 55]
 Rule-Based Segmentation: Consider using rule-based attrib = IG (baseline=black, image=masked_img,
segmentation models that provide explicit rules for t_class_xid = t_class_xid, k_steps = 10)
segmentation decisions. These rules can be generated attrib_mask = sumof (absolute(attrib), axis = -1)
using symbolic rule extraction or rule lists. The rules Evaluate Image interpretability using image ovelyshape
will offer understandable criteria for segment editing, with value 0.15
making the segmentation more transparent.
V. RESULT AND DISCUSSION
 Local Explanations for Refinement: Utilize local
explanations like Integrated gradient (IG) to interpret To evaluate and compare the base model's
the self-refinement process. IG can generate simple performance, such as Mask R-CNN, CNN-based
and interpretable models that approximate the self- Bi-LSTM, and DeepSORT. These base models
refinement behaviour locally and globally for combined with a proposed model for tracking
specific frames, allowing users to comprehend the objects in video frames with an image
reasoning behind particular refinement decisions. segmentation process using SA-1B as a dataset
for training and testing the model. Results using
 Attention Mechanisms for Occlusion Handling: If the four-performance metrics (mAP, Precision,
the segment editing model uses attention Recall, and F1-score) are tabulated in Table 2.
mechanisms, visualize the attention maps to Evidence suggests that the novel hybrid model
understand which areas of the object are being with Mask R-CNN as the base model has the
focused on during occlusion handling. This may highest mAP among the models. The value of
illuminate the model's handling of obscured items Precision for the Mask R-CNN, CNN-based Bi-
and lead to more discernible results. LSTM, and DeepSORT is 0.811, 0.717, and
 Interactive User Interface: Create an interactive user 0.842, respectively as given in Table I0049. The
interface that displays the real-time segmentation F1-score for the Mask R-CNN-based hybrid
masks, refinement steps, and explanations. This model gives a higher F1-score indicating the
interface can allow users to interact with the model, superior performance of this model. Hence, the
explore the segmentation process, and gain insights Mask R-CNN-based hybrid model is the most
into the tracking decisions. accurate model for objects in video frames.
Integrating Integrated gradient with a segment editing
The precision value [17] may vary based on the
model for tracking improves the accuracy and robustness of
model's confidence score, as shown in
the tracking process and makes the model more transparent
equations 1, 2 & 3.
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)

the method cannot generate an accurate approximation. As it


True P moves along the straight line, more details in the image
Precision= become visible, allowing the algorithm to make accurate
(Tru e P + FalseP ) predictions.
(1)
By including them across a straight line, Integrated Gradients
True P avoid the problem of local gradients reaching saturation [20].
Recall= (2) The objective aims to map each pixel's local gradients from
¿¿ the source image to the image used in a straight line.
Researchers can adjust the model's overall outcome class
Recall∗Precision probability by adding or subtracting a pixel's relevance value
F 1=2∗( )
Recall+ Precision based on its local gradients as shown in Fig 3, 4, & 5
(3) respectively. The model f often defines the relevance of the
i−th pixel's characteristic data as shown in equation 6.
Here, True_P is true positive, False_p is false
positive, and False_N is false negative.
α=1
∂ f (x +α ( x−x ))
To determine the mean AP [18], select the AP IGi ( f , x , x )= ∫ ( x i−x ' i ) dα (6)
for each class individually and then take the α=0 ∂ xi
mean of this value across all categories as
given in equation 4. Here i , feature of individual pixel, x is the image input
tensor, x i is the baseline image input tensor, and α is the
1
N interpolation constant.
mAP= ∑ APi(4)
N i=1 Unfortunately, the slopes can "saturate," meaning that the
objective class's probability approach is to a maximum
Both false positives (False_P) and false shortly when the level equals 1. By averaging the pixel
negatives (False_N) are considered by the, gradient magnitudes, we can observe why the algorithm
which is based on a balance between precision trains more while alpha gets small, near the α point at 0.1.
and recall. Beyond that, at α > 0.2, the gradients vanish, and no further
information is acquired.
TABLE II. THE PERFORMANCE CRITERIA OF THE MASK R-CNN, CNN
BASED BI-LSTM, AND DEEPSORT MODELS

Model mAP Precisi Reca F1


on ll

Mask R- 0.89 0.811 0.25 1.01


CNN 7 4 6

CNN
0.85 0.717 0.13 0.55
based Bi-
7 8 2
LSTM
Fig3. Masked image with SEPM

DeepSO 0.84 0.705 0.07 0.68


RT 2 3 0

Consider the linear path connecting the baseline to the


picture being used by employing a simple baseline made up
of a completely black image, and then look at the prediction
made by the model value based on its projected class [19].
The formula for a linear approximation among the two points
x and y as shown in equation 5.
αy+ ( 1−α ) x [5]

Here range of α range is from 0 to 1.


Fig4. Interaction With Segmented Image
As more information is added to the foundational image, the
algorithm receives a more explicit message and has more
faith in what is present in the picture. At α =0, near the basis,
2023 Innovations in Power and Advanced Computing Technologies (i-PACT)

REFERENCES
[1] Nuha H. Abdulghafoor, and Hadeel N. Abdullah, “A novel real-
time multiple objects detection and tracking framework for different
challenges”, Alexandria Engineering Journal, Vol. 61, pp. 9637 –
9647, Dec 2022.
[2] Jun-Wei Hsieh, and Yea-Shuan Huang, “Multiple-Person Tracking
System For Content Analysis”, International Journal of Pattern
Recognition and Artificial Intelligence, Vol. 16, pp. 447 – 462,
2002.
[3] Lindsay Wells, and Tomasz Bednarz, “Explainable AI and
Reinforcement Learning—A Systematic Review of Current
Approaches and Trends”, Front. Artif. Intell, Vol. 4, May 2021.
Fig5. Model target class predicted probability and Average pixel gradients [4] Alper Akca, and M. Önder Efe, “Multiple Model Kalman and
over alpha Particle Filters and Applications: A Survey”, IFAC-PapersOnLine,
Vol. 52, pp. 73 – 78, 2019.
[5] Sakorn Mekruksavanich, and Anuchit Jitpattanakul, “Hybrid
convolution neural network with channel attention mechanism for
sensor-based human activity recognition”, Sci Rep, Vol. 13, July
2023.
[6] He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. IEEE Trans
Pattern Anal Mach Intell, pp. 386–397, Feb 2020.
[7] Nicolai Wojke, Alex Bewley, Dietrich Paulus, “Simple Online and
Realtime Tracking with a Deep Association Metric”, IEEE
International Conference on Image Processing (ICIP), Feb 2017.
[8] Weiming Hu, et.al, “SiamMask: A Framework for Fast Online
Object Tracking and Segmentation”, IEEE Trans Pattern Anal
Mach Intell , vol. 45, pp. 3072–3089, 2023.
[9] Zhenzhen Wang, Shengyong Ding, and Huiwen Wang, “Online
Multi-Object Tracking with Dual Matching Attention Networks”,
Conference on Computer Vision and Pattern Recognition (CVPR),
2020.
[10] Tianfei. Z, et.al, “A Survey on Deep Learning Technique for Video
Segmentation”, IEEE Trans Pattern Anal Mach Intell, vol. 45, pp.
7099 – 7122, Nov 2022.
[11] Sameer Singh and Divyansh Kaushik, “Explainable AI: A Survey”,
arXiv preprint, 2020.
[12] Xiaoshuai Zhang, Qixiang Ye, Yurong Chen, Mingyu You, and
Fig6. Attribution of visualization on mask image, and original image Jianfei Cai., “SELF-REFINE: Iterative Refinement with Self-
overlayed Feedback”, Conference on Computer Vision and Pattern
Recognition (CVPR), 2021.
VI. CONCLUSIO AND FUTURE WORK [13] Rui Yao, Guosheng Lin, SHIXIONG XIA, JIAQI ZHAO, and
YONG ZHOU, “Video Object Segmentation and Tracking: A
This study developed a novel hybrid intelligent model, Survey”, ACM Transactions on Intelligent Systems and
namely the self-reined segment editing model and Integrated Technology,
gradient (IG-SEPM), for tracking objects and persons in a Vol. 11, pp. 1 – 47, May 2020.
restricted area. The effectiveness of the suggested approach [14] Alexander Kirillov et.al., Segment Anything, Meta AI Research,
is evaluated in comparison to other object tracking FAIR, arXiv:2304.02643v1, pp. 1 – 30, Apr 2020.
techniques using deep learning concepts such as Mask R- [15] Lucas Prado Osco, et.al., “The Segment Anything Model (SAM)
for Remote Sensing Applications: From Zero to One Shot”,
CNN, CNN-based Bi-LSTM, and DeepSORT, using Computer Vision and Pattern Recognition, arXiv:2306.16623, June
multimodal and multi-camera data such as posture and gait 2023.
data. A pilot study was carried out on an independent sample [16] Anupama Jha, et.al., “Enhanced Integrated Gradients: improving
to evaluate the tracking efficiency of the segment editing interpretability of deep learning models using splicing codes as a
case study”, Genome Biology, vol. 21, June 2020.
model using relevant assessment measures such as mean
[17] Olson, “Advanced Data Mining Techniques,” in Springer, 1st ed.,
Average Precision (mAP), Precision, Recall and F1. The ISBN 3-540-76916-1, pp. 138, 2008.
comparative results show that the proposed model provides [18] Beinan Wang, “A Parallel Implementation of Computing Mean
better results of about 89.7% Average precision value Average Precision”, Computer Vision and Pattern Recognition,
compared to other existing work. The output from SPEM is arXiv:2206.09504, June 2022.
taken parallelly into the IG model for interpreting the image, [19] Andrei Kapishnikov, et.al., “Guided Integrated Gradients: an
as the results show that the model provides a target class Adaptive Path Method for Removing Noise”, IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR),
predicted probability of 0.3 for α = 0.15 as shown in Fig 6. pp. 5050-5058. 2021.
In future, this work will be extended by adding wearable [20] Michael Munn, and David Pitman, “Explainable AI for
sensors to track and assist patients and old age people's Practitioners”, in O'Reilly Media, Inc., ISBN: 9781098119133, Oct
2022.
activity in their respective environments. This work will be
further developed to track human movement with
interpretation in long-range videos.

You might also like