667302
667302
667302
This document presents a comprehensive study of pedestrian tracking using an integrated ap-
proach of DeepSORT and YOLOv8 (You-Only-Look-Once). Focused on enhancing Multiple Ob-
ject Tracking (MOT) accuracy and robustness, this research evaluates the proposed model against
state-of-the-art (SOTA) algorithms within the MOT16 benchmark dataset. Through a meticulous
investigation, the study reveals the nuanced interplay between model complexity, computational
efficiency, and performance accuracy.
It demonstrates that while the YOLOv8 medium model, optimized with a 0.5 detection thresh-
old and a 0.4 maximum cosine distance, shows promise, it falls short of SOTA methods in key
areas, particularly in detecting pedestrians under varied conditions and maintaining their identities
over time. The document identifies the critical limitations stemming from the reliance on pre-
trained models and the lack of ground truth annotations for the testing images, which significantly
impacts the model’s learning and predictive capabilities.
This work not only contributes to the academic discourse by outlining potential improvements
in data acquisition, model fine-tuning, and feature extraction techniques but also sets the stage for
future research. It underscores the importance of transfer learning, advanced post-processing, and
the development of comprehensive datasets to surmount the present challenges. Through these
insights, the document lays a foundational path toward achieving higher levels of detection and
association accuracy in pedestrian tracking technologies.
i
ii
Resumo
iii
iv
Acknowledgements
I would like to express my heartfelt gratitude to my supervisor Professor João Manuel R.S. Tavares
for his invaluable guidance, reviews and suggestions, and continuous encouragement throughout
the journey of this work.
I would also like to extend my heartfelt appreciation to my dear friends and peers, whose
unwavering friendship and contributions went beyond academics, providing motivation and re-
silience when needed most. I am deeply grateful for their role in both my personal and academic
growth.
Last yet importantly, I am deeply indebted to my loving family for their unconditional love,
encouragement, and belief in my abilities. Their constant encouragement and sacrifices have
been my pillars of strength, motivating me to persevere through the challenges of this academic
endeavor.
Thank you all for being an indispensable part of this journey, and for making this academic
pursuit a reality. Your support and encouragement have made all the difference.
v
vi
“Machine intelligence is the last invention that humanity will ever need to make.”
Nick Bostrom
vii
viii
Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Achieved Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Overview of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Workflow of a MOT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Main Challenges of MOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.4 DeepSORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.5 FairMOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Existing Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 State-of-the-art 15
3.1 Articles Selection Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Review Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Methodology 19
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Used Image Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 MOT16 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Dataset Composition and Annotation . . . . . . . . . . . . . . . . . . . 20
4.2.3 Relevance to Object Tracking . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.4 Dataset Source and Availability . . . . . . . . . . . . . . . . . . . . . . 21
4.2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 You Only Look Once . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
ix
x CONTENTS
A Appendix 43
References 47
List of Figures
3.1 PRISMA diagram displaying the performed literature search in the Scopus database. 15
5.1 Comparison of performance of the algorithm using models YOLOv8n and YOLOv8m,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Comparison of performance of the algorithm using a detection threshold of 0.5
and 0.3, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Comparison of performance of the algorithm using a maximum cosine distance of
0.2 and 0.8, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Comparison of performance of the top state-of-the-art algorithms. . . . . . . . . 39
A.1 Frame from the MOT16-02 image sequence, adapted from [9]. . . . . . . . . . . 43
A.2 Frame from the MOT16-04 image sequence, adapted from [9]. . . . . . . . . . . 44
A.3 Frame from the MOT16-05 image sequence, adapted from [9]. . . . . . . . . . . 44
A.4 Frame from the MOT16-09 image sequence, adapted from [9]. . . . . . . . . . . 45
A.5 Frame from the MOT16-10 image sequence, adapted from [9]. . . . . . . . . . . 45
A.6 Frame from the MOT16-11 image sequence, adapted from [9]. . . . . . . . . . . 46
A.7 Frame from the MOT16-13 image sequence, adapted from [9]. . . . . . . . . . . 46
xi
xii LIST OF FIGURES
xiii
xiv ABBREVIATIONS
Abbreviations
ACF Aggregated Channel Features
ACM Adaptive Calculation Method
AssA Association Accuracy
AssPr Association Precision
AssRe Association Recall
CCF CNN-Based Correlation Filter
CNN Convolutional Neural Network
COCO Microsoft Common Objects in Context
DCNN Deep Convolutional Neural Network
DetA Detection Accuracy
DetPr Detection Precision
DetRe Detection Recall
DL Deep Learning
FM Fragmentation
FPN Feature Pyramid Network
FPS Frames per Second
HMF Hybrid Motion Feature
HOTA Higher Order Tracking Accuracy
IDF1 Identity F1 Score
IMM Interactive Multiple Model
IoU Intersection over Union
KF Kalman Filter
KFR Kalman Filter Rectification
LocA Localization Accuracy
mAP Mean Average Precision
MF-SORT Simple Online Real-time with Motion Features
ML Mostly Lost
MOT Multiple Object Tracking
MOTA Multiple Object Tracking Accuracy
MOTP Multiple Object Tracking Precision
MOTS Multiple Object Tracking and Segmentation
MT Mostly Tracked
OSPA Optimal Sub-Pattern Assigment
PT Partly Tracked
R-CNN Region-Based Convolutional Neural Network
RL Recover from Long-term occlusion
ROI Region of Interest
RPN Region Proposal Network
RS Recover from Short-term occlusion
SCT Super Chained Tracker
SF Simple Filtering
SORT Simple Online Realtime Tracking
SOT Single Object Tracking
SOTA State-of-the-Art
SSD Single Shot Detector
SSDT Smart Social Distancing Tracker
SVM Support Vector Machines
TDE Tracking Distance Error
VS Visual Studio
YOLO You Only Look Once
Chapter 1
Introduction
This chapter presents a brief summary of the Dissertation. It starts by contextualizing the im-
portance of Multiple Object Tracking (MOT) in various fields like computer vision, robotics, and
surveillance. The research objectives are then presented, with the goal of improving the accuracy,
robustness, and real-time performance of MOT systems. The potential contributions of this work
are then proposed. Finally, the Dissertation structure is presented, including an overview of the
main chapters and their respective contributions.
This chapter serves as a foundational framework for the following sections, laying the ground-
work for the in-depth examination of multiple object tracking in the following chapters.
1.1 Context
In an age of rapidly advancing technology, reliably tracking multiple objects has become critical,
with applications ranging from autonomous vehicles to video surveillance to augmented reality.
Due to the collective effort of various researchers and recent breakthroughs, in the foreseeable
future, it may be possible to consider a world in which machines have the perceptual abilities
to track and monitor multiple objects at the same time, allowing for efficient decision-making,
enhanced situational awareness, and seamless human-machine interactions.
The process of consistently estimating the trajectories and identities of multiple objects in a
video sequence is referred to as MOT. Despite its apparent simplicity, MOT remains a difficult
problem due to a variety of factors that impede accurate and robust tracking. Occlusions, in which
objects are temporarily hidden from view, appearance changes caused by viewpoint variations or
object deformations, object interactions resulting in complex motion patterns, and the presence of
cluttered backgrounds are among these factors.
Since object tracking first became popular about 20 years ago, several techniques and concepts
have been developed to raise the precision and effectiveness of the tracking models. Some of those
methods involve machine learning approaches, which are effective at predicting the target object
but call for the extraction of discriminatory features by professionals. In contrast, deep learning
1
2 Introduction
techniques can extract these features independently and can cover more complex scenarios such
as dense traffic or crowded scenes.
1.2 Motivation
Due to the vast array of potential applications, object tracking has drawn a lot of interest in recent
years. Coupled with motion analysis and behavior understanding, it has shown great capability in
developing speech recognition and natural language understanding, gesture recognition, body and
face pose estimation, and facial expression analysis and recognition. These results translate into
an effort to allow machines to communicate more cleverly with their users and environment. In
light of a growing demand for security, object tracking has been applied to surveillance systems to
not only detect motion, but to classify it as human or nonhuman, perform face recognition, track
access control across multiple cameras, or detect suspicious behavior [10].
Following the success of employing deep convolutional neural networks (DCNN) for image
classification, object identification has made significant progress using deep learning approaches.
Traditional detection techniques were vastly outperformed by the new deep learning-based algo-
rithms [3].
1.3 Objectives
The goal of this project is to improve the accuracy and robustness of multiple object tracking in
complex environments, which can be further divided into three steps:
1. Thoroughly examine the current state-of-the-art algorithms in this field, understanding their
strengths, limitations, and potential for further improvement.
2. Implement and evaluate a promising novel method to enhance the accuracy of object detec-
tions in MOT. Applying deep learning algorithms to existing tracking methods is expected
to improve the detection and, consequently, the tracking of objects compared to more tradi-
tional detection methods.
3. Assess and compare the results obtained from the implemented method with existing bench-
marks. Through this comparative analysis, its aim is to gauge the effectiveness and perfor-
mance of the method against established tracking approaches, providing insights into its
potential contributions to the field of MOT.
The MOT16 dataset [9], which is to be used during this work, is part of the MOTChallenge,
a popular crowdsourced dataset that provides an easy way to compare the performance of state-
of-the-art tracking methods by presenting results from both public and private detections. These
results also include a comprehensive and complete set of evaluation metrics that can serve as
performance indicators.
1.4 Achieved Contributions 3
• Integration of a tracking algorithm based on DeepSORT and a YOLOv8 model and compar-
ison to current state-of-the-art algorithms.
Background
This chapter provides a comprehensive overview of multiple object tracking (MOT) by introduc-
ing fundamental concepts, methodologies, and challenges. It explains MOT as a computer vision
research area involving the tracking of multiple objects over time using a combination of object de-
tection and tracking-by-detection techniques. The workflow of MOT algorithms is outlined while
also discussing the significance of object detection in MOT, comparing two-stage and one-stage
detection methods and highlighting the advancements in deep learning-based pedestrian detection.
Additionally, popular object tracking algorithms such as Kalman Filter, Mean Shift, Optical Flow,
DeepSORT, and FairMOT are introduced, emphasizing their features, advantages, and limitations
in object tracking applications.
MOT is a research area in the field of computer vision that involves the development of algorithms
and systems for tracking multiple objects within a scene over time. It typically operates by us-
ing a combination of object detection and tracking techniques. The detections are composed of
bounding boxes with the coordinates of objects the algorithm detected.
Consequently, MOT algorithms usually employ a tracking-by-detection method. This method
involves an independent detector that is applied to all video frames to obtain likely detections,
followed by a tracker which is run on the set of detections. The purpose of this method is to guide
the tracking process, which will assign the ID to the same bounding boxes that contain the same
target.
Object tracking algorithms might be classified differently depending on the type of information
they utilize for matching; specifically, color, contour, feature, and template-based object tracking
approaches are available. These algorithms can be further divided into batch (or offline) and
online tracking methods. While batch tracking methods can make use of future frames to identify
the target objects, online tracking methods use only past and current frames. This results in batch
tracking providing superior results, associated with a higher computational load which is not ideal
for real-time environments [1].
5
6 Background
When designing a MOT algorithm three issues need to be taken into account [2], such as how
to extract a set of accurate detection data from video frames; determining how to measure the
similarity of objects; and, based on the results from the data association method, how to judge
whether the objects between frames are the same.
Currently, most MOT frameworks can be scaled down to an assignment problem, and their
efforts have focused on improving the association stage. This project intends to focus on the
detection stage as it is accepted that the performance of the detector can have a significant impact
on the tracking results.
1. Detection stage: In the detection stage, an object detection algorithm is applied to each
frame of the video to identify and localize objects of interest, often referred to as "detec-
tions." The object detector generates bounding boxes around the detected objects, indicating
their positions in the frames. These detections serve as the initial input for the tracking pro-
cess.
3. Affinity stage: The extracted appearance and motion features, along with any motion pre-
dictions, are used to compute similarity or distance scores between pairs of detections. The
affinity stage aims to measure the likelihood that two detections belong to the same real-
world object based on their feature similarities or distances.
4. Association stage: The similarity or distance scores obtained in the affinity stage are used
in the data association process. The goal of the association stage is to link detections across
frames to form coherent trajectories, also known as "tracks." This step involves determin-
ing which detections in the current frame should be associated with existing tracks or used
to initiate new tracks. Various data association methods, such as the Hungarian algorithm,
Kalman filters (KF), or graph-based optimization, are commonly used to solve this associa-
tion problem and establish the best possible associations. In this stage the MOT algorithm
also assigns a unique numerical ID to each track, maintaining consistency across frames.
This ID serves as a persistent identifier for each tracked object, allowing the algorithm to
track the same objects throughout the video sequence.
2.3 Main Challenges of MOT 7
Figure 2.1: Stages involved in a MOT algorithm: it starts by analyzing the input frames at which
point the detection algorithm obtains the bounding box of each target. Then, the feature extraction
algorithm will extract the object’s features based on their appearance and/or motion. Finally, it will
compute each object’s similarity between frames and assign them a unique ID. Adapted from [1].
The combination of these stages forms a comprehensive workflow for MOT, allowing the al-
gorithm to detect and track multiple objects over time in video data. The effectiveness of such
algorithms depend on the quality of object detection, the choice of features and motion predic-
tors, the accuracy of the affinity computation, and the robustness of the data association method.
Different MOT algorithms may use various techniques and approaches in each stage to achieve
accurate and robust tracking performance.
MOT algorithms present various difficulties [2] which have been heavily researched throughout
the years and even though they may share some similarities with single object tracking (SOT)
techniques, some problems are inherently unique to this type of task (illustrated in Fig. 2.2). There
have been successful attempts to apply SOT methods to MOT problems [11]. The most common
challenges are:
8 Background
Figure 2.2: Difficulties of MOT:(a) Example of variability in the number of objects, (b) Example
of complex background, (c) Example of occlusion of objects, (d) Example of similarity of appear-
ance, (e) Example of appearance deformation of the objects. Adapted from [2].
2.4 Object Detection 9
• Long-term full occlusion: When an object is completely concealed from the camera for
a significant duration, it becomes challenging for the MOT algorithm to maintain its track
identity. Dealing with long-term occlusions requires sophisticated methods for handling
track re-identification when the object reappears [12].
• Variability in the number of objects: Objects entering and leaving the camera’s field
of view can cause the number of objects to vary over time. The MOT algorithm must
handle dynamically changing object counts and handle object appearance/disappearance
smoothly [13].
• Similar appearance between objects: In scenarios where multiple objects have similar
visual features, such as objects from the same class, distinguishing between them becomes
challenging. Developing effective appearance modeling techniques and data association
methods is essential to handle this challenge.
• Complex video backgrounds: Adverse factors in the video environment, such as weather
conditions, lighting changes, shadows, and cluttered backgrounds, can hinder object detec-
tion and tracking. Robust object detection and feature extraction methods are required to
deal with complex backgrounds.
• Real-time scenarios: Many applications demand real-time MOT algorithms that can pro-
cess video data in real time. Balancing computational complexity and tracking accuracy is
critical to ensure the algorithm’s efficiency and effectiveness in real-time environments.
Addressing these challenges often involves combining various techniques and strategies, such
as motion prediction, appearance modeling, data association optimization, and robust feature ex-
traction. MOT research continually strives to overcome these obstacles and improve the accuracy
and performance of tracking algorithms in diverse and challenging real-world scenarios.
explainable models. Deep learning methods can learn directly from data, requiring less manual
effort, and are known to be more accurate at object detection tasks than traditional methods. How-
ever, they require large amounts of labeled data to learn from and take longer to train as they are
computationally intensive tasks.
Object detection can be approached in two ways [2]: two-stage object detectors and one-stage
object detectors. One-shot detectors predict object bounding boxes and class probabilities directly
from the whole image or image grid in a single pass through the network, as illustrated in Fig. 2.3.
They use pre-defined anchor boxes of various shapes and sizes to predict object locations and
classes. The anchor boxes serve as references for generating object bounding boxes. These de-
tectors are faster than two-stage object detectors in general, but they may not be as accurate. You
Only Look Once (YOLO) and Single Shot Multibox Detector(SSD) are two examples of one-stage
object detectors. These detectors benefit from a faster inference speed compared to two-stage de-
tectors, making them suitable for real-time applications and scenarios with limited computational
resources, and possessing a simpler architecture and straightforward implementation. Contrarily,
they may struggle with precise localization, especially for small or closely packed objects, and
may be less accurate in handling complex scenes with heavy occlusions.
Conversely, two-stage object detectors have two stages: a proposal stage and a detection stage,
as illustrated in Fig. 2.4. During the proposal stage, the detector generates a set of bounding boxes
that may contain objects. In the first stage, they often use region proposal techniques like Selective
Search or EdgeBoxes to generate potential object regions. In the detection stage, the detector uses
these bounding boxes to identify the objects in the image. Two-stage object detectors are generally
more accurate than one-stage detectors but are slower because they must pass through the network
twice. Region-Based Convolutional Neural Networks (R-CCN) and Fast R-CNN are examples of
two-stage object detectors. These are generally more accurate, especially in complex scenes with
a large number of objects or significant occlusions, and have better handling of object instances
with varying scales and aspect ratios due to the two-stage architecture.
Pedestrian detection is a crucial task in intelligent video surveillance systems, but it has unique
challenges compared to generic object detection. Pedestrian objects have fixed aspect ratios but
vary greatly in scale, and they are often affected by crowding, occlusion, and blurring. Early
pedestrian detection algorithms used hand-crafted features and region classifiers like Support Vec-
tor Machines (SVM). However, deep learning methods have shown significant advancements in
pedestrian detection, achieving state-of-the-art results on public benchmarks.
2.5 Object Tracking 11
Some notable deep learning approaches include a real-time framework using deep convolu-
tional networks and decision tree-based frameworks. Other methods use scale-aware architectures
to handle pedestrians at different scales. There are also techniques to address the impact of oc-
clusions, such as part-based models and occlusion-aware region of interest (ROI) pooling layers,
which integrate prior structure information and visibility prediction for better feature representa-
tions [3].
The Kalman Filter is a popular and widely used mathematical technique for state estimation and
tracking in various fields, including object detection and tracking. It is an optimal recursive es-
timator that processes noisy measurements over time to estimate the state of a dynamic system.
In the context of object detection, the KF is employed to predict the position and velocity of an
object, track its movement, and handle uncertainties and noise in the measurements. However,
it does have some limitations, especially in handling complex and non-linear motion patterns or
dealing with significant occlusions and appearance changes.
12 Background
In such cases, more advanced techniques, such as particle filters or deep learning-based track-
ers, may be employed to enhance tracking performance. Nevertheless, the KF remains a fun-
damental and useful tool for object detection and tracking, especially in scenarios where linear
motion models and moderate uncertainty are present.
2.5.4 DeepSORT
DeepSORT is an advanced object tracking algorithm that combines deep learning with the classical
SORT (Simple Online and Realtime Tracking) algorithm. It is specifically designed for real-time
multi-object tracking in challenging video streams with crowded scenes and occlusions. Deep-
SORT utilizes a deep learning-based object detection model, such as YOLO or Faster R-CNN,
to detect and localize objects in each frame. Feature embeddings are then extracted using a deep
neural network to represent the unique appearance characteristics of each object.
2.6 Existing Metrics 13
The algorithm employs the Hungarian algorithm or the KF to associate detected objects with
existing tracks, ensuring correct identity assignment even in complex scenarios. Additionally, a
KF predicts the next positions of tracked objects based on their historical motion patterns, main-
taining tracks during occlusions or when objects temporarily leave the camera’s view. DeepSORT
dynamically manages track initialization, termination, and deletion, creating new tracks for newly
detected objects and removing tracks for objects that have disappeared or exited the scene. This
integration of deep learning for detection and feature embeddings with the classical SORT algo-
rithm for data association and tracking management makes DeepSORT a powerful and reliable
solution for multi-object tracking in various applications, including surveillance and autonomous
vehicles.
2.5.5 FairMOT
FairMOT is an advanced object tracking algorithm designed to achieve high accuracy and real-
time performance in multi-object tracking scenarios. It emphasizes fairness and transparency in
computer vision research, aiming to address issues related to evaluation fairness and dataset bias.
The key features of FairMOT include fairness-aware training, which mitigates dataset bias
using the MixUp data augmentation technique. This improves the model’s generalization and
reduces the impact of biased training data. Additionally, FairMOT introduces online instance
alignment, enhancing tracking consistency and accuracy by aligning tracked objects to their corre-
sponding instances across frames. The algorithm employs a tracking head, a convolutional neural
network (CNN) head responsible for predicting object motion and estimating object associations
and trajectories.
FairMOT’s real-time performance, achieving high tracking accuracy in time-critical scenarios,
has garnered impressive results in various tracking benchmarks and competitions. Its focus on
fairness and transparency, coupled with its real-time capabilities, makes FairMOT a promising
solution for multi-object tracking applications that demand both accuracy and efficiency.
1. Accuracy: Multiple Object Tracking Accuracy (MOTA) is a metric used to evaluate the
performance of MOT algorithms. MOTA is calculated as the sum of the ID switches, frag-
mentations, missed detections, and false positives, normalized by the total number of ground
truth objects. A higher MOTA score indicates better performance of the tracker.
14 Background
2. Precision: There are three metrics. Multiple Object Tracking Precision (MOTP), Tracking
Distance Error (TDE), and Optimal Sub-Pattern Assignment (OSPA). They describe how
precisely the objects are tracked by comparing the overlap between bounding boxes and/or
the distance between them.
3. Completeness: These metrics indicate how completely the ground truth trajectories are
tracked. They include Mostly Tracked (MT), Partly Tracked (PT), Mostly Tracked (MT),
Mostly Lost (ML), and Fragmentation (FM).
4. Robusteness: Recover from Short-term occlusion (RS) and Recover from Long-term oc-
clusion (RL) can evaluate the algorithm’s ability to recover from occlusion.
2.7 Summary
MOT focuses on tracking multiple objects in a scene over time using object detection and track-
ing techniques. The workflow of MOT algorithms includes stages like object detection, feature
extraction/motion prediction, affinity computation, and data association, all crucial for accurate
multi-object tracking in video data. However, MOT algorithms encounter challenges such as
handling long-term occlusion, varying object counts, similar appearances, complex backgrounds,
appearance deformation, and the need for real-time processing.
Object detection is a fundamental aspect of MOT, and deep learning-based methods have sig-
nificantly advanced pedestrian detection, achieving state-of-the-art results on public benchmarks.
Various object tracking algorithms, including Kalman Filter, Mean Shift, and Optical Flow, are
discussed, and advanced approaches like DeepSORT and FairMOT combine deep learning with
traditional methods for real-time and accurate multi-object tracking. The evaluation of MOT al-
gorithms relies on metrics like MOTA, MOTP, MT, PT, ML, and FM, which assess the accuracy,
completeness, and robustness of the tracking results.
Chapter 3
State-of-the-art
This chapter presents a comprehensive and systematic review of the literature related to multiple
object tracking (MOT) in computer vision. By synthesizing the existing literature, this systematic
review aims to provide a clear and structured understanding of the current state of MOT research,
shedding light on its potential applications and future directions.
A systematic literature search was performed in the Scopus database, Figure 3.1, with the follow-
ing query: "TITLE-ABS-KEY (( mot OR "Multiple Object Tracking" ) AND ( "Kalman Filter*"
OR kf ) AND ( pedestrian* OR people OR human* )) AND ( LIMIT-TO ( LANGUAGE, "English"
))".
The search yielded a total of 17 results, however, after a brief analysis of their abstracts 5 of
them were disregarded for the following reasons: not being relevant to the topic, or focusing on
specific applications.
Figure 3.1: PRISMA diagram displaying the performed literature search in the Scopus database.
15
16 State-of-the-art
In this section, each article will be briefly presented, showcasing each of the articles’ proposed
methods and results.
According to [21], a novel method named SearchTrack for Multiple Object Tracking and Seg-
mentation (MOTS) is presented, addressing the challenge of associating detected objects across
frames. This method employs an object-customized search network alongside motion-aware fea-
tures, integrating both object appearance and motion cues. A Kalman filter is maintained for each
object to encode predicted motion into these features, enhancing tracking accuracy. Experiments
conducted on the KITTI MOTS and MOT17 datasets demonstrate that SearchTrack outperforms
competitive methods, particularly in association accuracy. For instance, it achieves 71.5 Higher
Order Tracking Accuracy (HOTA) for cars, 57.6 HOTA for pedestrians on KITTI MOTS and 53.4
HOTA on MOT17, showcasing its effectiveness in improving association accuracy among 2D on-
line methods.
In [22] introduced the Super Chained Tracker (SCT) model. This model integrates object
detection, feature manipulation, and representation learning into a single end-to-end solution for
online MOT. It uses adjacent frames as input, converting each frame into bounding box pairs and
chaining them using intersection over union (IoU), Kalman filtering, and bipartite matching. The
SCT model includes a joint attention module to enhance efficiency by focusing on informative
regions through expected confidence maps. This method allows for efficient and effective track-
ing with significant improvements in Multi-Object Tracking Accuracy (MOTA) and Identity F1
(IDF1) scores on the MOT16 dataset compared to existing methods. It achieves a Multiple Object
Tracking Accuracy (MOTA) of 68.4% and an Identity F1 Score (IDF1) of 64.3%, demonstrating
qualitative and quantitative superiority over other techniques.
In [23] proposed a Smart Social Distancing Tracker (SSDT) model for monitoring social dis-
tancing compliance using a deep learning-based approach. It integrates You-Only-Look-Once
(YOLOv4) for object detection with Simple Online and Real-time with motion features (MF-
SORT) Kalman Filter and brute force feature matching techniques to accurately identify and track
individuals in video footage, ensuring they maintain recommended social distances. The model
demonstrates superior performance in challenging conditions, including occlusions and lighting
variations, achieving a mean Average Precision (mAP) of 97% and a real-time processing speed
of 24 frames per second (FPS). The research focuses on the application of these technologies in
managing public health risks by enforcing social distancing norms effectively, offering a valuable
tool for public space management during the COVID-19 pandemic.
Kumar et al.[17] presented a system designed to address MOT within a specified zone, fo-
cusing on detecting, identifying, and tracking various objects. It employs YOLOv4 for object
detection and recognition, Kalman filters for motion prediction and feature generation, and the
DeepSORT algorithm for tracking. This combination enhances tracking accuracy, allowing for
real-time surveillance and the management of crowd gatherings. The methodology demonstrates
3.2 Review Results 17
improved tracking performance, especially in applications such as traffic management and moni-
toring crowd density during the COVID-19 pandemic.
Similarly, Jindal et al.[24] outlined a methodology for real-time object tracking within a spec-
ified zone, focusing on the detection, identification, and tracking of objects using YOLOv4 for de-
tection and DeepSORT algorithm for tracking. The approach leverages Kalman filters for motion
prediction and feature extraction, aiming to enhance tracking accuracy. The system is evaluated
using the MOT18 dataset, achieving notable performance in terms of MOTA and FPS, indicating
its potential for real-time surveillance and crowd management applications.
Wang et al. [25] detailed a MOT framework that introduces a Hybrid Motion Feature (HMF)
integrating Intersection over Union (IoU), Euclidean distance, and area ratio. This approach,
alongside an Adaptive Calculation Method (ACM), aims to improve tracking accuracy by dy-
namically adjusting the weighting of motion and appearance features in the cost matrix. The
method also includes Kalman Filter Rectification (KFR) to handle irregular object motions. On
the MOT17 benchmark, it achieved a MOTA of 80.7%, an IDF1 score of 78.5%, and a HOTA of
64.0%, outperforming ByteTrack, the state-of-the-art model at the time.
Ren et al.[26] proposed a new tracking method utilizing the social force model to address
challenges in multi-pedestrian tracking, such as occlusions and lighting variations. This approach
differentiates between candidates and real pedestrians, using historical data and social forces to
predict movements and assign identities using the Hungarian algorithm. Tested on the MOT chal-
lenges dataset, it outperforms traditional algorithms in terms of tracking accuracy and processing
speed, indicating its effectiveness in complex environments.
Gai et al.[27] developed a tracking algorithm that combines YOLOv5 for high-performance
object detection with the DeepSORT algorithm for tracking. This integration aims to improve
tracking accuracy by utilizing Kalman filtering for motion prediction and the Hungarian algorithm
for matching detections to track identities. The method addresses challenges such as occlusion
and variable lighting conditions, demonstrating significant improvements in tracking performance
on pedestrian datasets.
Tsai et al.[28] presented a real-time, lightweight MOT method leveraging MobileNet for im-
proved processing speed in MOT tasks. It introduces a novel MobileNet-based MOT model with
an appearance embedding layer, a redesigned anchor box, and a feature pyramid network (FPN)
to enhance tracking accuracy. Additionally, a Simple Filtering (SF) method is proposed to re-
place the Kalman Filter in data association processing, significantly boosting processing speed
while maintaining competitive tracking performance. The method achieves high FPS rates on
both desktop and embedded platforms and presents a viable solution for applications requiring
real-time tracking with limited computational resources.
Yang et al.[29] introduced a novel pedestrian MOT method specifically designed to tackle the
challenges of heavy occlusions. It employs a regression network for refining predicted positions
to obtain precise bounding boxes and visibility scores. Different handling strategies are applied
to targets based on their occlusion status, significantly reducing false negatives and positives. A
motion model combining Kalman filter and camera motion compensation is developed to enhance
18 State-of-the-art
3.3 Summary
This chapter reviewed the state-of-the-art in MOT within computer vision, offering a structured
overview of current research, applications, and future directions. A systematic literature search in
Scopus yielded 17 relevant articles, with 12 selected after screening.
These studies introduced advancements in MOT techniques, including novel tracking meth-
ods. Other notable contributions focus on applications like social distancing monitoring, real-time
surveillance, and addressing challenges such as occlusion and lighting variations. The review
highlighted the diversity and innovation in current MOT research, emphasizing improvements in
tracking accuracy, processing speed, and application-specific solutions.
Chapter 4
Methodology
Following a brief exposition of the objectives and techniques used for this work, this chapter
explores, in greater detail, the algorithms responsible for the detection and tracking stages and the
dataset used.
4.1 Overview
The primary goal of this proposed methodology is to implement a DeepSORT framework for
robust pedestrian tracking in image sequences. Using You-Only-Look-Once (YOLOv8) in the
detection stage is expected to offer advantages over using DeepSORT alone:
1. Improved Object Detection: YOLOv8 provides better object detection capabilities, with
higher accuracy and an improved ability to detect small objects [32] which is crucial for the
initial detection stage before tracking.
2. Real-Time Performance: YOLOv8 is faster than previous versions, making it suitable for
real-time applications. This speed is essential for tracking objects in video streams.
3. Robustness to Occlusions: When used with a tracker like DeepSORT, YOLOv8 can help
identify the same object and assign it a unique ID from frame to frame even when the object
detector fails to detect the object in some frames [33].
4. Multiple Object Tracking: YOLOv8 can detect multiple objects in a single image, which
can then be tracked by DeepSORT [33].
YOLOv8 is an advanced version of the YOLO model, known for its superior speed and ac-
curacy in real-time object detection. It efficiently classifies and localizes pedestrians in the image
frames, providing precise bounding box information.
Subsequently, the detected pedestrian objects are fed into the DeepSORT algorithm. Deep-
SORT is an extension of the Simple Online Realtime Tracking (SORT) algorithm, incorporating
deep association metrics for enhanced tracking performance. It takes advantage of Kalman fil-
ters to estimate and predict the motion of pedestrians in the video stream, overcoming occlusion
19
20 Methodology
challenges and maintaining accurate tracks even in complex scenarios. By combining the strength
of YOLOv8 for detection and DeepSORT for tracking, it is aimed to create a robust pedestrian
tracking system capable of real-time and accurate results in various environmental conditions, as
illustrated in Fig. 4.1.
The Multiple Object Tracking Benchmark, specifically the MOT16 dataset, was chosen as the
primary dataset for evaluating and benchmarking the performance of the proposed multiple object
tracking system using DeepSORT and YOLOv8. The MOT16 dataset serves as a comprehensive
benchmark designed to assess the efficacy of tracking algorithms under diverse and challenging
real-world conditions.
This dataset comprises a collection of video sequences captured in various environments, rang-
ing from urban settings to crowded pedestrian areas. Some sample images containing each of the
training sequences can be found in A. It encompasses challenges commonly encountered in mul-
tiple object tracking scenarios, including occlusions, scale variations, crowded scenes, and object
interactions. The dataset is particularly relevant to our research objectives due to its focus on
pedestrian tracking, aligning with the broader context of surveillance, autonomous navigation,
and human behavior analysis.
The dataset contains detailed annotations for each object of interest in the video sequences, which
are primarily people, making it an invaluable resource for pedestrian tracking research. These
annotations, manually labeled to ensure accuracy, are typically bounding boxes around each ob-
ject. Additionally, the dataset is divided into training and testing sets. The training set comes with
publicly available ground truth data which allows for training and fine-tuning tracking algorithms.
The testing set however does not publicly provide the ground truth, making researchers submit
their results for online evaluation with a limited number of tries and a waiting time of three days.
For this reason, the training set was primarily used for evaluation and benchmarking.
4.3 Object Detection 21
The MOT16 dataset is widely recognized as a standard benchmark for evaluating object-tracking
algorithms. Its real-world scenarios and diverse challenges make it an ideal choice for assessing
the performance of our proposed tracking system. By utilizing MOT16, the aim is to provide a
thorough evaluation under conditions representative of practical tracking applications, contributing
to the reliability and generalizability of our findings.
All necessary data, along with associated annotations and documentation, is publicly available
and can be obtained from the official MOTChallenge website. This availability facilitates the
reproducibility of the experiments and enables other researchers to validate and build upon our
findings.
4.2.5 Limitations
Consistently with other datasets, the MOT16 has limitations that researchers should be aware
of when using it to evaluate multiple object tracking. This dataset has been noted for having a
slow convergence [34], which refers to the phenomenon where an algorithm takes a considerable
amount of time or a large number of iterations to reach a satisfactory or optimal solution during
the training stage; having poor detection effects for small objects [34]; and a limited variety [9], as
the dataset might not encompass all potential real-world scenarios, despite having a wide range of
sequences shot in various lighting conditions, from various points of view, and in more crowded
situations.
The following section provides a description of the algorithm that was used throughout this project
for the object detection stage.
As previously mentioned in section 2.4, in the realm of computer vision and object detection,
the YOLO algorithm has established itself as a groundbreaking and influential paradigm. YOLO
offers an innovative approach to real-time object detection that has revolutionized the field and
found diverse applications across numerous domains. By spatially separating bounding boxes
and associating probabilities to each detected image using a single convolutional neural network
(CNN), the authors frame the object detection problem as a regression problem rather than a
classification task [35]. For this project, YOLOv8 was implemented in the object detection stage.
22 Methodology
The most recent iteration of the state-of-the-art YOLO model, known as YOLOv8, is being
actively developed and maintained by the Ultralytics team, which was built on the YOLOv5’s ar-
chitecture. YOLOv8 enables users to choose the ideal model for their particular use case from five
different scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l
(large), and YOLOv8x (extra-large). As of the time of writing, Ultralytics has yet to provide
an official paper on this model, apart from the present documentation on their website [36] and
GitHub [7].
4.3.2 Architecture
The architecture of YOLOv8 can be divided into two main parts: the backbone and the head, as
illustrated in Fig. 4.2. The backbone is designed using a modified version of the CSPDarknet53
architecture, which consists of 53 convolutional layers and employs cross-stage partial connec-
tions to improve information flow between the different layers. The head of YOLOv8 consists of
multiple convolutional layers followed by a series of fully connected layers. These layers are re-
sponsible for predicting bounding boxes, objectness scores, and class probabilities for the objects
detected in an image [7].
YOLOv8 builds upon the foundation set by its predecessors and introduces several new fea-
tures and optimizations to improve object detection accuracy and speed. These improvements
can be attributed to two major changes from YOLOv5: anchor-free detection and mosaic data
augmentation [37].
In object detection, anchor boxes are predefined bounding boxes of various shapes and sizes
that detect objects in images. Matching these anchor boxes with the actual objects in the image
helps the model predict the location and size of objects more accurately. The advantage of us-
ing anchor boxes is their ability to detect objects of varying sizes and aspect ratios. In contrast,
anchor-free detection methods do not rely on predefined anchor boxes. Instead, they directly pre-
dict the object boundaries. This method simplifies the detection process and reduces the model’s
complexity. Anchor-free methods are advantageous because they avoid the difficulties associated
with selecting appropriate anchor box sizes and shapes, and they can be more flexible and efficient,
particularly in scenarios where objects do not conform to standard shapes or sizes, Fig. 4.3.
While anchor boxes generally increase the mean Average Precision (mAP) scores in the train-
ing stage, they were removed in YOLOv8 due to making the model rigid and hard to fit on new
data, and due to the difficulty in clearly mapping irregularities with polygon anchor boxes [37].
Mosaic augmentation is a data augmentation technique used in deep learning, particularly in
object detection tasks. As illustrated in Fig. 4.4, it involves combining different training images
into a single image, thereby creating a new composite image. This technique allows a model to
see more objects in one image, enhancing its ability to learn from diverse contexts and scales.
Mosaic effectively increases the variability and complexity of the training data, which can lead to
improved detection performance, especially in scenarios with limited data or diverse object sizes
24 Methodology
Figure 4.4: Visual representation of a mosaic data augmentation method. Adapted from [6].
4.3.3 Implementation
Integrating YOLO into an object-tracking task involves a combination of real-time object detection
and subsequent tracking of these objects across frames. The first step is to install the dependencies,
such as Ultralytics which loads the YOLOv8 model and performs real-time object detection, and
imports the YOLO method. Subsequently, the model can be initialized with a specified version
with pre-trained weights and applied to each frame, whose results contain information about the
detected objects, including bounding box coordinates, confidence scores, and class labels.
The code iterates through the detected objects, filtering for pedestrians and discarding detec-
tions below a specified confidence threshold. The tracker, implemented in a custom Tracker class,
4.4 Object Tracking 25
Figure 4.5: Comparison between different YOLO versions. Adapted from [7].
is updated with the current frame and the processed detections to associate the detections across
frames and assign unique track IDs.
Finally, it iterates through the tracked objects and draws bounding boxes around them on the
current frame while displaying the unique track ID associated with each tracked object.
4.4.1 DeepSORT
DeepSORT, as mentioned in section 2.5, is a sophisticated algorithm for multi-object tracking that
improves on the SORT algorithm with deep learning techniques. It tracks and matches objects in
video sequences using a combination of Kalman filtering and a deep association metric [38].
The motion of objects across frames is predicted using Kalman filtering, and high-quality
feature representations of the detected objects are extracted using a deep neural network. These
characteristics are critical for associating detections across multiple frames, especially when ob-
jects become occluded or drastically change appearance.
DeepSORT’s primary strength is its deep association metric, significantly improving tracking
consistency and accuracy, particularly in challenging scenarios. By integrating motion and ap-
pearance information, DeepSORT effectively manages identity switches and occlusions in object-
tracking scenarios.
The Kalman filter is a recursive estimator that optimally combines the information from a predic-
tion model and noisy measurements to estimate the state of a system. It operates in two funda-
mental steps: prediction and update.
26 Methodology
Figure 4.6: Comparison between different YOLO models. mAP values are for single-model
single-scale on COCO val2017 dataset. Speed averaged over COCO val images using an Amazon
EC2 P4d instance. Adapted from [7].
The prediction step uses the previous state to predict the current state. The prediction of
the state is based on the system’s dynamics, which are usually defined by a set of mathematical
equations, such as position and velocity.
After the prediction step, the current measurement is used to correct the predicted state. This
step involves updating our prediction based on the difference between the predicted and measured
values, also known as the residual or innovation. The update step also adjusts the uncertainty
associated with the predicted state. It then repeats these two steps in a recursive manner to contin-
uously estimate the state of a dynamic system over time. The Kalman filter combines the predicted
state of the system and the latest measurement in a weighted average, whose weights are assigned
based on the relative uncertainty of the values, with more confidence placed in measurements with
lower estimated uncertainty [39].
The DeepSORT algorithm used in this work was based on Nwojke’s [40] whose Kalman filter
was implemented with an 8-dimensional state space containing the bounding box center position
(x,y), aspect ratio (a), height (h) and velocities. The filter follows a constant velocity model to
predict object movements and directly observes the bounding box parameters for updates. The
original implementation features:
1. A initialization function starts a new track with a given measurement by initializing the
mean and covariance of the state.
2. A predict function calculates the new mean and covariance based on the motion model and
process noise.
3. An update method where the Kalman Gain is computed to adjust the state estimate based on
the difference between the actual measurement and the predicted state.
4. A projection function projects the state estimate into the measurement space, preparing for
the update step by calculating the expected measurement and its uncertainty.
4.4 Object Tracking 27
5. A gating distance function that calculates the Mahalanobis distance between predicted states
and new measurements to determine which measurements are likely to correspond to which
tracked objects.
6. A predefined Chi-Square distribution table to set thresholds for the gating distance, which
facilitates the decision on whether a measurement should be considered for updating a track.
Figure 4.7: Visual representation of the prediction and update steps of a Kalman filter. Adapted
from [8].
By continuously refining the state estimate with each new frame, Fig. 4.7, the Kalman Filter
improves tracking accuracy, especially in handling object movements and temporary occlusions.
It helps maintain consistent object identity across frames, enhancing the overall performance of
the tracking system.
Data association is the defining step where DeepSORT deviates from the SORT algorithm by
employing a combination of motion information and appearance features instead of relying solely
on intersection over union (IoU) [38]. The IoU is a performance metric used to evaluate the
28 Methodology
accuracy of detection by quantifying the overlap between the predicted and the corresponding
ground truth bounding boxes. The IoU can be expressed as
Area of Intersection
IoU = (4.1)
Area of Union
where the "Area of Intersection" represents the area where the bounding boxes intersect and
the "Area of Union" is the total area of both bounding boxes.
Matching is done by the Hungarian algorithm using a cost matrix that takes into account the co-
sine distance for appearance similarity and the Mahalanobis distance for motion consistency [41].
The Mahalanobis distance accounts for the uncertainties in the Kalman filter predictions, while
the cosine distance compares the appearance feature vectors of the objects [38]. The squared Ma-
halanobis distance implemented in the Kalman Filter normalizes the distance of the point x from
the mean µ, taking into account the covariance of the distribution Σ, which is given by
where D2M (x) represents the squared Mahalanobis distance, (x − µ)T is the transpose of the
vector difference between the point and the mean, and Σ is the inverse of the covariance matrix. A
high squared Mahalanobis distance indicates that the point is far from the mean, considering the
distribution’s variance and covariance structure.
This approach allows DeepSORT to combine both association metrics as a weighted sum:
(1) (2)
ci j = λ di j + (1 − λ )di j (4.3)
where ci j is the combined association cost for the i-th track and the j-th detection, λ is a
(1)
weighting parameter that balances the influence of the two metric, di j is the Mahalanobis distance
(2)
for motion information, and di j is the cosine distance for appearance information.
The data association employs a matching cascade algorithm that matches detections in order
from the most recently updated tracks to ensure the Mahalanobis distance does not favor the track
with a larger uncertainty [38]. The cascade approach helps in handling uncertainties in object
detection and tracking by considering the "age" of tracks and adjusting the matching strictness
accordingly.
In addition to the dataset, the MOTChallenge provides its own official evaluation kit TrackE-
val [42] which is a framework designed for the rigorous evaluation of object tracking algorithms.
It incorporates a comprehensive suite of evaluation metrics, including the advanced Higher Order
Tracking Accuracy (HOTA) metric, which assesses tracking performance across multiple dimen-
sions such as detection accuracy, identity retention, and overall tracking precision [43].
4.6 Computational Infrastructure and Software 29
In this work, TrackEval was employed to benchmark the performance of a newly developed
tracker against the MOT16 dataset. Through the use of TrackEval, an extensive and uniform
evaluation of our tracker was made possible, facilitating a deep understanding of its capabilities
and areas for improvement. This evaluation framework enabled us to quantitatively compare our
tracker’s performance with existing state-of-the-art methods, providing valuable insights into its
efficacy in complex tracking scenarios typical of the MOTChallenge.
According to the instructions from the software, the resulting tracker was saved as a text file
containing one object instance per line with the following values: <frame>, <id>, <bb_left>,
<bb_top>, <bb_width>, <bb_height>, <conf>, <x>, <y>, <z>. For 2D challenges, the world
coordinates x,y,z are filled with -1 as they can be ignored.
The following section describes the hardware and software tools used during the project’s devel-
opment.
This work was developed on a desktop PC equipped with an AMD Ryzen 5 3600 6-Core proces-
sor @3.60GHz, 16GB DDR4 RAM @2133MHz, and a NVIDIA GeForce RTX 3070 GPU. It is
equipped with Microsoft Windows 10 Home Version, which is compatible with all the computa-
tional tools used during its development, as well as Windows Subsystem for Linux which allows
running Linux command-line tools and apps alongside the Windows command-line, desktop and
store apps, and to access Windows files from within Linux.
Python is a high-level, interpreted programming language known for its clear syntax, readability,
and versatility. Its ecosystem is replete with libraries and frameworks that cater to almost every
need of a data science, machine learning, or computer vision project. For a DeepSORT project,
libraries like TensorFlow, scikit-learn, scikit-image and specialized libraries such as those from
Ultralytics for object detection, provide robust tools to implement, train, and deploy models effec-
tively. Furthermore, Python’s strength in scientific computing is bolstered by libraries like NumPy,
SciPy, and Pandas. These libraries are essential for handling and processing the large volumes of
data typically involved in machine learning and computer vision projects.
Due to its popularity, Python has a vast and active community of developers and researchers.
This community support translates into a wealth of tutorials, forums, and discussions which can
be invaluable resources when troubleshooting issues, optimizing code, or seeking advice on best
practices for implementation.
30 Methodology
Visual Studio Code (VS Code) is a free, open-source code editor developed by Microsoft, avail-
able for Windows, macOS, and Linux. It’s highly regarded for its performance, flexibility, and
extensive range of features, including support for debugging, syntax highlighting, snippets, and
code refactoring. Additionally, it contains a vast marketplace of extensions, allowing developers
to customize their environment with tools and languages specific to their needs. This work was
developed primarily with this tool for creating, debugging and running scripts to implement the
tracking algorithm in Python.
Furthermore, Git integration was particularly useful in managing version control. GitHub is
a web-based platform for version control and collaboration that utilizes Git, a distributed version
control system designed to handle everything from small to very large projects with speed and
efficiency. It was mainly used as a repository to store and manage the project’s code by forking
from previous versions and building upon them.
4.7 Summary
The methodology chapter outlines the implementation of a DeepSORT framework integrated with
YOLOv8 for robust pedestrian tracking in image sequences. The main objectives include lever-
aging YOLOv8’s advanced object detection capabilities, such as improved accuracy, real-time
performance, and robustness to occlusions, to enhance the initial detection stage before tracking
with DeepSORT. DeepSORT, an extension of SORT, incorporates deep association metrics and
Kalman filters to predict and maintain the motion of pedestrians across video frames, even in
complex scenarios like occlusions.
The chapter details the use of the MOT16 dataset, a benchmark for evaluating multiple object
tracking systems under diverse and challenging conditions, emphasizing pedestrian tracking. The
dataset, with its comprehensive collection of video sequences and detailed annotations, serves as
a crucial tool for benchmarking the proposed tracking system’s performance.
In terms of technical implementation, YOLOv8’s architecture is discussed, highlighting its ef-
ficiency in classifying and localizing pedestrians through various model scales and the integration
of new features like anchor-free detection and mosaic data augmentation for improved detection
accuracy. The chapter also elaborates on the integration process of YOLOv8 for object detection
and the subsequent tracking with DeepSORT, detailing the algorithms’ workings, from feature
extraction and Kalman filtering to data association using motion and appearance information.
The evaluation of the tracking system’s performance utilizes the TrackEval framework from
MOTChallenge, providing a thorough and uniform assessment through advanced metrics like the
HOTA. This framework facilitates a quantitative comparison with existing methods, offering in-
sights into the proposed system’s effectiveness.
4.7 Summary 31
Finally, the computational infrastructure and development tools used in the project are de-
scribed, including the hardware specifications and software environments like Python, Visual Stu-
dio Code, and GitHub. This setup underscores the project’s reliance on robust computational
resources and efficient development practices to achieve its objectives in pedestrian tracking with
real-time accuracy and performance.
32 Methodology
Chapter 5
The following chapter presents the empirical findings from different models and parameters which
are further discussed to establish conclusions based on their performance. A range of established
metrics were employed to quantify the model’s tracking accuracy and robustness, such as Order
Tracking Accuracy (HOTA), Detection Accuracy (DetA), Association Accuracy (AssA), Detec-
tion Precision (DetPr), Detection Recall (DetRe), Association Precision (AssPr), Association Re-
call (AssRe), and Localization Accuracy (LocA). Finally, based on the previous discussions it is
made a comparison between the proposed model and state-of-the-art (SOTA) algorithms.
2. DetRe: Indicates the proportion of actual positives that were correctly identified.
3. DetPr: Reflects the proportion of positive identifications that were actually correct.
4. AssA: Evaluates how accurately the algorithm associates detected objects across frames.
5. AssRe: Measures the algorithm’s ability to maintain correct associations over time.
6. AssPr: Indicates the precision of maintaining correct associations without false links be-
tween objects.
7. HOTA: Balances the importance of detection and association accuracy, giving a holistic
view of the overall tracking performance
8. LocA: Assesses how precisely the algorithm localizes the objects within the frames.
33
34 Results and Discussion
The detection threshold is used in the object detection phase, being a predefined value that filters
out detections based on their confidence score. Confidence scores, typically ranging from 0 to 1,
indicate the model’s certainty that a detected object belongs to a specific class. When an object de-
tection model, such as YOLO, detects objects in an image, each detection comes with a confidence
score. The detection threshold is used to decide which detections to keep and which to discard,
as detections with confidence scores above the threshold are considered valid and passed on for
further processing, while those below the threshold are ignored. Setting the detection threshold
too high might result in missing valid objects, increasing false negatives, while setting it too low
could lead to many false positives.
As shown in Figure 5.2, when comparing different detection threshold values, 0.5 and 0.3
respectively, both plots depict a very similar performance profile, as the second plot offers a slight
improvement in AssA and AssPr, with a negligible trade-off in AssRe, LocA and DetPr.
In tracking algorithms that use appearance features to match detections across frames, the cosine
distance between the feature vector of a new detection and existing tracks is calculated. A lower
cosine distance indicates a higher similarity. The max cosine distance parameter sets the maximum
allowed distance for considering a detection and a track to be a match. In theory, setting the max
cosine distance too low may result in the tracker being too strict, potentially causing frequent track
losses or missed matches. Conversely, setting it too high can lead to incorrect matches, which
means an increase in identity switches. According to Figure 5.3, both plots show similar trends
across all metrics, with a slight difference in the AssRe and AssPr scores, which suggests that
5.4 Effects of different maximum cosine distances 35
Figure 5.1: Comparison of performance of the algorithm using models YOLOv8n and YOLOv8m,
respectively.
36 Results and Discussion
Figure 5.2: Comparison of performance of the algorithm using a detection threshold of 0.5 and
0.3, respectively.
5.5 Proposed Model 37
the variations in the maximum cosine distance threshold may not have a significant impact on the
overall performance of the tracking as measured by these metrics. While the specific impact of the
cosine distance threshold is not clear-cut, there is a slight indication that increasing the threshold
could potentially improve the association precision of the DeepSORT algorithm at higher alpha
values. However, since the scores and trends are largely similar, it could also be interpreted that
the performance of DeepSORT is robust to these changes in the threshold, at least for the metrics
and the range of alpha presented in these plots.
After adding up all of the experiments, it was concluded that due to a marginal difference in
performance for both thresholds, the proposed model will be using the YOLOv8m model with a
0.5 detection threshold and 0.4 maximum cosine distance.
As described in the second plot of Figure 5.1, a HOTA score of 0.06 indicates that the algo-
rithm’s overall performance in balancing detections and associations is low; a DetA score of 0.02
shows a very low accuracy in detecting pedestrians; a score of 0.2 in AssA implies moderate as-
sociation accuracy, showing some capability in correctly associating detections over time; DetRe
at 0.03 and DetPR at 0.07 show that the algorithm misses a large number of true positives as well
as false positives, respectively; AssRe and AssPR scores of 0.29 suggest the algorithm moderately
recalls the correct associations over multiple frames and a moderate level of precision in maintain-
ing those associations; and a LocA score of 0.62 suggests that when the algorithm does detect a
pedestrian, it localizes them relatively well within the frame.
The proposed model is compared to three of the current top state-of-the-art algorithms, according
to the MOT16 Challenge Official website. The models were chosen from the three best that used
public detections and only tracked pedestrians.
As shown by Figure 5.4, overall these methods outperform our proposed model in all metrics.
Both the proposed model and the SOTA algorithms were evaluated using the same data and the
results were similarly obtained from the TrackEval kit, so they can be compared directly.
The proposed model’s performance indicates that it struggles to correctly identify pedestrians
and maintain their identities across frames, unlike the SOTA methods which suggest robust per-
formance in both detecting pedestrians and associating them accurately over time. In terms of
localization accuracy DeepSORT has a moderate performance while the SOTA methods are better
at pinpointing the exact location of pedestrians within the frames.
Thus it can be concluded that there is still room for improvement in terms of both detection
and tracking of objects.
38 Results and Discussion
Figure 5.3: Comparison of performance of the algorithm using a maximum cosine distance of 0.2
and 0.8, respectively.
5.6 Comparison Results With State-of-the-Art Methods 39
5.7 Summary
This chapter presented findings from evaluating various models using metrics like HOTA, DetA,
AssA, and LocA. It explored the performance of YOLOv8 models under different conditions,
including changes in detection thresholds and cosine distances, ultimately selecting a medium
YOLOv8 model with specific settings for the proposed system. Despite some performance gains,
the proposed model demonstrated limitations in detection and tracking accuracy compared to
SOTA methods. The analysis revealed the proposed model’s challenges in accurately detecting
pedestrians and maintaining their identities, suggesting significant room for improvement in both
detection and tracking capabilities.
Chapter 6
The investigation’s findings are examined in this chapter along with the effectiveness of the sug-
gested model. After the findings are analyzed, the discussion moves into possible developments
to shed light on future directions that might resolve recognized weaknesses.
6.1 Conclusion
This Dissertation aims to implement and evaluate a Multiple Object Tracking (MOT) model and
compare it to the current State-of-the-art (SOTA) algorithms based on a group of established met-
rics to ensure homogeneity of results.
In this case, it involved integrating DeepSORT with a You-Only-Look-Once (YOLOv8) model.
The choice of YOLOv8, being a newer model, was predicated on the assumption that its cutting-
edge features could offer superior performance in terms of detection accuracy, tracking precision,
and computational efficiency. This investigation aimed to validate whether the incorporation of
YOLOv8 could indeed elevate the tracking system’s capabilities beyond the current benchmarks
established by state-of-the-art methods.
The key findings highlight the nuanced balance between model complexity, computational
efficiency, and performance accuracy. The selected YOLOv8 medium model, with a 0.5 detec-
tion threshold and a 0.4 maximum cosine distance, represents a strategic compromise aiming to
optimize tracking accuracy while maintaining computational viability. However, the comparison
with state-of-the-art algorithms reveals a significant performance gap, particularly in accurately
detecting pedestrians and maintaining their identities over time.
The underperformance of the YOLOv8 model, particularly in detecting pedestrians at a dis-
tance and in varied lighting conditions, can be attributed to its reliance on a pre-trained frame-
work. While pre-trained models offer the advantage of leveraging learned features from extensive
datasets, they can encounter difficulties when applied to specific scenarios that differ significantly
from their training environments. In the context of this study, the YOLOv8 model’s pre-training
did not adequately prepare it for the nuanced challenges of pedestrian detection, such as small
object sizes and complex visibility issues.
41
42 Conclusion and Future Work
These limitations encountered due to a lack of training can be largely attributed to the inad-
equacies in the dataset used. A significant obstacle was the absence of ground truth annotations
for the testing images within the dataset. Ground truth data, which provides the definitive loca-
tions and classifications of objects within images, is essential for training machine learning models
accurately and effectively. Without these annotations, the model’s ability to learn and predict ac-
curately is severely compromised, especially in scenarios that demand high precision, such as
pedestrian detection at varying distances and under different lighting conditions.
The limitations of the proposed model, as outlined in the results, underscore the challenges
faced in achieving high levels of detection and association accuracy. These challenges highlight
the importance of continued research and development in the field to better meet or exceed the
benchmarks set by current state-of-the-art methods.
Appendix
Figure A.1: Frame from the MOT16-02 image sequence, adapted from [9].
43
44 Appendix
Figure A.2: Frame from the MOT16-04 image sequence, adapted from [9].
Figure A.3: Frame from the MOT16-05 image sequence, adapted from [9].
Appendix 45
Figure A.4: Frame from the MOT16-09 image sequence, adapted from [9].
Figure A.5: Frame from the MOT16-10 image sequence, adapted from [9].
46 Appendix
Figure A.6: Frame from the MOT16-11 image sequence, adapted from [9].
Figure A.7: Frame from the MOT16-13 image sequence, adapted from [9].
References
[1] Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto Taglia-
ferri, and Francisco Herrera. Deep learning in video multi-object tracking: A survey. Neu-
rocomputing, 381:61–88, 3 2020. doi:10.1016/J.NEUCOM.2019.11.023.
[2] Yan Dai, Ziyu Hu, Shuqi Zhang, and Lianjun Liu. A survey of detection-based video
multi-object tracking. Displays, 75:102317, 12 2022. doi:10.1016/J.DISPLA.2022.
102317.
[3] Xiongwei Wu, Doyen Sahoo, and Steven C.H. Hoi. Recent advances in deep learning for
object detection. Neurocomputing, 396:39–64, 7 2020. doi:10.1016/J.NEUCOM.2020.
01.085.
[6] Gabriel Mongaras. Yolox explanation — mosaic and mixup for data augmenta-
tion. Visited on 2023-11-28. URL: https://medium.com/mlearning-ai/
yolox-explanation-mosaic-and-mixup-for-data-augmentation-3839465a3adf.
[8] Furqan Asghar, Muhammad Talha, Sung Kim, and In-ho Ra. Simulation study on battery
state of charge estimation using kalman filter. Journal of Advanced Computational Intelli-
gence and Intelligent Informatics, 20:861–866, 11 2016. doi:10.20965/jaciii.2016.
p0861.
[9] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for
multi-object tracking. arXiv:1603.00831 [cs], March 2016. arXiv: 1603.00831. URL:
http://arxiv.org/abs/1603.00831.
[10] G. Stamou, M. Krinidis, Evangelos Loutas, Nikos Nikolaidis, and Ioannis Pitas. 2d and 3d
motion tracking in digital video. Handbook of Image and Video Processing, pages 491–517,
2005. doi:10.1016/B978-012119792-6/50093-0.
[11] Linyu Zheng, Ming Tang, Yingying Chen, Guibo Zhu, Jinqiao Wang, and Hanqing Lu. Im-
proving multiple object tracking with single object tracking. Proceedings of the IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition, pages 2453–2462,
2021. doi:10.1109/CVPR46437.2021.00248.
47
48 REFERENCES
[12] Latha Anuj and M. T.Gopala Krishna. Multiple camera based multiple object tracking
under occlusion: A survey. IEEE International Conference on Innovative Mechanisms
for Industry Applications, ICIMIA 2017 - Proceedings, pages 432–437, 7 2017. doi:
10.1109/ICIMIA.2017.7975652.
[13] Raquel R. Pinho and João Manuel R.S. Tavares. Tracking features in image sequences
with kalman filtering, global optimization, mahalanobis distance and a management model.
CMES - Computer Modeling in Engineering and Sciences, 46:51–75, 2009.
[14] Niall O’ Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo
Velasco-Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh. Deep learning
vs. traditional computer vision. 943, 10 2019. URL: https://arxiv.org/abs/1910.
13796v1, doi:10.1007/978-3-030-17795-9.
[15] Jong Min Jeong, Tae Sung Yoon, and Jin Bae Park. Kalman filter based multiple objects
detection-tracking algorithm robust to occlusion. Proceedings of the SICE Annual Confer-
ence, pages 941–946, 10 2014. doi:10.1109/SICE.2014.6935235.
[16] Lorenzo Porzi, Markus Hofinger, Idoia Ruiz, Joan Serrat, Samuel Rota Bulo, and Peter
Kontschieder. Learning multi-object tracking and segmentation from automatic annotations.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pages 6845–6854, 2020. doi:10.1109/CVPR42600.2020.00688.
[17] Shailender Kumar, Vishal, Pranav Sharma, and Nitin Pal. Object tracking and counting in
a zone using yolov4, deepsort and tensorflow. Proceedings - International Conference on
Artificial Intelligence and Smart Systems, ICAIS 2021, pages 1017–1022, 3 2021. doi:
10.1109/ICAIS50930.2021.9395971.
[18] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the
fairness of detection and re-identification in multiple object tracking. International Journal
of Computer Vision, 129:3069–3087, 11 2021. doi:10.1007/S11263-021-01513-4.
[19] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance:
The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008. doi:
10.1155/2008/246309.
[20] Wenhan Luo, Junliang Xing, Anton Milan, Xiaoqin Zhang, Wei Liu, and Tae Kyun Kim.
Multiple object tracking: A literature review. Artificial Intelligence, 293, 4 2021. doi:
10.1016/J.ARTINT.2020.103448.
[21] Zhong Min Tsai, Yu Ju Tsai, Chien Yao Wang, Hong Yuan Liao, Youn Long Lin, and
Yung Yu Chuang. Searchtrack: Multiple object tracking with object-customized search and
motion-aware features. BMVC 2022 - 33rd British Machine Vision Conference Proceedings,
page London.
[22] Shahzad Ahmad Qureshi, Lal Hussain, Qurat Ul Ain Chaudhary, Syed Rahat Abbas, Raja Ju-
naid Khan, Amjad Ali, and Ala Al-Fuqaha. Kalman filtering and bipartite matching based
super-chained tracker model for online multi object tracking in video sequences. Applied
Sciences (Switzerland), 12, 10 2022. doi:10.3390/APP12199538.
[23] Chhaya Gupta, Nasib Singh Gill, and Preeti Gulia. Ssdt: Distance tracking model based
on deep learning original scientific paper. International Journal of Electrical and Computer
Engineering Systems, 13:339–348, 2022. doi:10.32985/IJECES.13.5.2.
REFERENCES 49
[24] Rajni Jindal, Aditya Panwar, Nishant Sharma, and Aman Rai. Object tracking in a zone
using deepsort, yolov4 and tensorflow. 2021 2nd International Conference for Emerging
Technology, INCET 2021, 5 2021. doi:10.1109/INCET51464.2021.9456443.
[25] Mingyan Wang, Bozheng Lit, Haoran Jiang, and Junjie Zhang. Multi-object tracking with
adaptive cost matrix. 2022 IEEE 24th International Workshop on Multimedia Signal Pro-
cessing, MMSP 2022, 2022. doi:10.1109/MMSP55362.2022.9948977.
[26] Hengle Ren, Fang Xu, Fengshan Zou, Kai Jia, Pei Di, and Jie Kang. Multi-pedestrian track-
ing based on social forces. 2018 International Conference on Intelligence and Safety for
Robotics, ISR 2018, pages 527–532, 11 2018. doi:10.1109/IISR.2018.8535956.
[27] Yuqiao Gai, Weiyang He, and Zilong Zhou. Pedestrian target tracking based on deepsort
with yolov5. Proceedings - 2021 2nd International Conference on Computer Engineering
and Intelligent Control, ICCEIC 2021, pages 1–5, 2021. doi:10.1109/ICCEIC54227.
2021.00008.
[28] Chi Yi Tsai and Yu Kai Su. Mobilenet-jde: a lightweight multi-object tracking model for
embedded systems. Multimedia Tools and Applications, 81:9915–9937, 3 2022. doi:10.
1007/S11042-022-12095-9.
[29] Jieming Yang, Hongwei Ge, Jinlong Yang, Yubing Tong, and Shuzhi Su. Online pedestrian
multiple-object tracking with prediction refinement and track classification. Neural Process-
ing Letters, 54:4893–4919, 12 2022. doi:10.1007/S11063-022-10840-7.
[30] M. V. Rahul, Revanur Ambareesh, and G. Shobha. Siamese network for underwater multiple
object tracking. ACM International Conference Proceeding Series, Part F128357:511–516,
2 2017. doi:10.1145/3055635.3056579.
[31] Lionel Rakai, Huansheng Song, Shi Jie Sun, Wentao Zhang, and Yanni Yang. Data as-
sociation in multiple object tracking: A survey of recent techniques. Expert Systems with
Applications, 192, 4 2022. doi:10.1016/J.ESWA.2021.116300.
[32] Augmented Startups. The benefits of using yolov8 for image segmentation tasks.
Visited on 2023-11-28. URL: https://www.augmentedstartups.com/blog/
the-benefits-of-using-yolov8-for-image-segmentation-tasks.
[34] Xiaoning Zhu, Yannan Jia, Sun Jian, Lize Gu, and Zhang Pu. Vitt: Vi-
sion transformer tracker. Sensors 2021, Vol. 21, Page 5608, 21:5608, 8
2021. URL: https://www.mdpi.com/1424-8220/21/16/5608/htmhttps://
www.mdpi.com/1424-8220/21/16/5608, doi:10.3390/S21165608.
[38] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with
a deep association metric. In 2017 IEEE International Conference on Image Processing
(ICIP), pages 3645–3649. IEEE, 2017. doi:10.1109/ICIP.2017.8296962.
[39] MathWorks. Using kalman filter for object tracking. Visited on 2023-
11-28. URL: https://www.mathworks.com/help/vision/ug/
using-kalman-filter-for-object-tracking.html.
[40] Nicolai Wojke and Alex Bewley. Deep cosine metric learning for person re-identification. In
2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 748–756.
IEEE, 2018. doi:10.1109/WACV.2018.00087.
[41] Allan Kouidri. Mastering deep sort: The future of object tracking explained.
Visited on 2023-11-28. URL: https://https://www.ikomia.ai/blog/
deep-sort-object-tracking-guide#how-does-deep-sort-work.
[43] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger,
Laura Leal-Taixe, and Bastian Leibe. Hota: A higher order metric for evaluat-
ing multi-object tracking. International Journal of Computer Vision, 129:548–578,
9 2020. URL: http://arxiv.org/abs/2009.07736http://dx.doi.org/10.
1007/s11263-020-01375-2, doi:10.1007/s11263-020-01375-2.