A G T B C F C V P R: Pplications of A Raph Heoretic Ased Lustering Ramework in Omputer Ision and Attern Ecognition
A G T B C F C V P R: Pplications of A Raph Heoretic Ased Lustering Ramework in Omputer Ision and Attern Ecognition
A G T B C F C V P R: Pplications of A Raph Heoretic Ased Lustering Ramework in Omputer Ision and Attern Ecognition
Supervisor:
Prof. Andrea Prati
The Chair of the Doctoral Program:
Prof. Fabio Peron
Abstract
R ECENTLY , several clustering algorithms have been used to solve variety of prob-
lems from different discipline. This dissertation aims to address different chal-
lenging tasks in computer vision and pattern recognition by casting the prob-
lems as a clustering problem. We proposed novel approaches to solve multi-target
tracking, visual geo-localization and outlier detection problems using a unified under-
lining clustering framework, i.e., dominant set clustering and its extensions, and pre-
sented a superior result over several state-of-the-art approaches.
Firstly, this dissertation will present a new framework for multi-target tracking in
a single camera. We proposed a novel data association technique using Dominant set
clustering (DCS) framework. We formulate the tracking task as finding dominant sets
on the constructed undirected edge weighted graph. Unlike most techniques which are
limited in temporal locality (i.e. few frames are considered), we utilized a pairwise re-
lationships (in appearance and position) between different detections across the whole
temporal span of the video for data association in a global manner. Meanwhile, tempo-
ral sliding window technique is utilized to find tracklets and perform further merging on
them. Our robust tracklet merging step renders our tracker to long term occlusions with
more robustness. DSC leads us to a more accurate approach to multi-object tracking
by considering all the pairwise relationships in a batch of frames; however, it has some
limitations. Firstly, it finds target trajectories(clusters) one-by-one (peel off strategy),
causing change in the scale of the problem. Secondly, the algorithm used to enumer-
ate dominant sets, i.e., replicator dynamics, have quadratic computational complexity,
which makes it impractical on larger graphs.
To address these problems and extend tracking problem to multiple non-overlapping
cameras, we proposed a novel and unified three-layer hierarchical approach. Given a
video and a set of detections (obtained by any person detector), we first solve within-
camera tracking employing the first two layers of our framework and, then, in the third
layer, we solve across-camera tracking by merging tracks of the same person in all
cameras in a simultaneous fashion. To best serve our purpose, a constrained dominant
set clustering (CDSC) technique, a parametrized version of standard quadratic opti-
mization, is employed to solve both tracking tasks. The tracking problem is caste as
I
finding constrained dominant sets from a graph. That is, given a constraint set and a
graph, CDSC generates cluster (or clique), which forms a compact and coherent set that
contains a subset of the constraint set. The approach is based on a parametrized family
of quadratic programs that generalizes the standard quadratic optimization problem. In
addition to having a unified framework that simultaneously solves within- and across-
camera tracking, the third layer helps link broken tracks of the same person occurring
during within-camera tracking. A standard algorithm to extract constrained dominant
set from a graph is given by the so-called replicator dynamics whose computational
complexity is quadratic per step which makes it handicapped for large-scale applica-
tions. In this work, we propose a fast algorithm, based on dynamics from evolutionary
game theory, which is efficient and salable to large-scale real-world applications. In
test against several tracking datasets, we show that the proposed method outperforms
competitive methods.
Another challenging task in computer vision is image geo-localization, here as well,
we proposed a novel approach which cast geo-localization as a clustering problem of
local image features. Akin to existing approaches to the problem, our framework builds
on low-level features which allow local matching between images. We cluster features
from reference images using Dominant Set clustering, which affords several advan-
tages over existing approaches. First, it permits variable number of nodes in the cluster,
which we use to dynamically select the number of nearest neighbors for each query fea-
ture based on its discrimination value. Second, this approach is several orders of mag-
nitude faster than existing approaches. Thus, we use multiple weak solutions through
constrained Dominant Set clustering on global image features, where we enforce the
constraint that the query image must be included in the cluster. This second level of
clustering also bypasses heuristic approaches to voting and selecting the reference im-
age that matches to the query. We evaluated the proposed framework on an existing and
new dataset and showed that it outperforms the state-of-the-art approaches by a large
margin.
Finally, we present a unified approach for simultaneous clustering and outlier detec-
tion in data. We utilize some properties of a family of quadratic optimization problems
related to dominant sets. Unlike most (all) of the previous techniques, in our frame-
work the number of clusters arises intuitively and outliers are obliterated automatically.
The resulting algorithm discovers both parameters (number of clusters and outliers)
from the data. Experiments on real and large scale synthetic dataset demonstrate the
effectiveness of our approach and the utility of carrying out both clustering and outlier
detection in a concurrent manner.
II
Acknowledgments
First and for most, I thank Jehovah God for everything he has done for me.
I would like to express my gratitude to my supervisor Prof. Andrea Prati, for his scien-
tific guidance and support during my Ph.D. studies.
My sincere thanks also goes to my co-authors Prof. Marcello Pelillo for his insight-
ful comments and Dr. Mubarak Shah who provided me an opportunity to join his team,
CRCV at University of Central Florida, as a research scholar for more than a year, with-
out his precious support, great advice, and fruitful discussion it would not be possible
to conduct most of my researches.
I thank my external reviewers: Prof. Lamberto Ballan, and Prof. Christian Mich-
eloni for the time they spent on carefully reading the thesis and for their insightful
comments and suggestions.
Also I thank my friends Josh, Leul, Sure, Tedo, Tinsu, Siyum ... for all the fun time
we had and the moments we spent at ’BAUM’, Campo, coffee break and so on. In
particular, I am grateful to Josh for stimulating discussions, for the sleepless nights we
were working together before deadlines, and for all the fun we have had on the trips we
had.
Last but not least, I would like to thank my families: my parents (Seni and Tare),
my brother (Beck), and my aunty (Tiz) for supporting me spiritually throughout my
Ph.D study and my life in general. This is the fruit of your love, support, guidance and
prayers.
III
Preface
V
Introduction
VII
cluster is formed, sensitivity of the clustering technique to changes that do not affect
the structure of data, and the way the data was structured.
As there does not exist a universal clustering approach which can be applied for all
kinds of modalities of problems, we need to find the right clustering approach based of
the problem at hand. The focus of this dissertation is to propose a unified, novel and
robust approaches for problems in multi-target tracking, geo-localization and outlier
detection. In all the proposed frameworks, we used the same graph theoretic clustering
approach, namely, dominant set clustering (DSC), and its extension, constrained domi-
nant set clustering framework. The intrinsic properties of these approaches make them
very suitable for the above-mentioned problems.
Dominant set clustering framework is first introduced by Pavan and Pelillo [110,
112], and showed its effectiveness in the areas of segmentation, image retrieval and
group detection. Dominant set clustering is a graph-theoretic approach for pairwise
data clustering which is motivated by the analogies between the intuitive concept of a
cluster and that of a dominant set of vertices. It generalizes a maximal clique problem
to an edge-weighted graphs. Its correspondence with a linearly constrained quadratic
optimization program under the standard simplex, allowed us to use of straightforward
and easily implementable continuous optimization techniques from evolutionary game
theory. We follow a pill-off strategy to enumerate all possible clusters, that is, at each
iteration we remove the cluster from the graph. Even though, it has several advan-
tages over other clustering approaches, the iterative approach we follow to extract clus-
ters cause change in the scale of the problem. Meaning, we do not have a theoretical
guaranty that clusters found at the consecutive iterations are the local solutions of the
original graph.
Constrained dominant set clustering framework is first proposed in [183]. It is a
parametrized version of standard quadratic optimization. By properly controlling a
regularization parameter which determines the structure and the scale of the underlying
problem, we are able to extract groups of dominant-set clusters which are constrained
to contain user-selected elements. A standard algorithm to extract constrained dom-
inant sets from a graph is given by the so-called replicator dynamics, whose compu-
tational complexity is quadratic per step which makes it handicapped for large-scale
applications. In this work, we propose noble fast algorithm, based on dynamics from
evolutionary game theory, which is efficient and salable to large-scale real-world appli-
cations.
In this thesis, we aim to address problems in multi-target tracking in single and mul-
tiple cameras, large scale image geo-localization, and outlier detections by proposing
several new algorithms, which utilize the above-mentioned clustering frameworks.
Firstly, we proposed a novel and efficient single camera multi-target tracking ap-
proach, which formulates the tracking task as finding dominant sets in an auxiliary
undirected edge-weighted graph. The nodes in the graph represent detection responses
from consecutive frames and edge weights depict the similarity between detection re-
sponses. In this formulation, the extracted cluster (dominant set) represents a trajectory
of a target across consecutive frames. Unlike most techniques, which are limited in
temporal locality (i.e. few frames are considered), we utilized a pairwise relationships
(in appearance and position) between different detections across the whole temporal
span of the video for data association in a global manner. Meanwhile, temporal sliding
VIII
window technique is utilized to find tracklets and perform further merging on them.
Our robust tracklet merging step renders our tracker to long term occlusions with more
robustness.
Even though, the approach showed competitive result against several state-of-the-
art approaches, it suffers the short comings of the underlined clustering framework, i.e.
dominant sets. To solve the limitations of the above approach and extend multi-target
tracking across multiple non-overlapping cameras, we proposed a robust and unified
three-layered hierarchical approach. We first determine tracks within each camera, by
solving data association, and later we associate tracks of the same person in different
cameras in a unified approach, hence solving the across-camera tracking. Since appear-
ance and motion cues of a target tend to be consistent in a short temporal window in a
single camera tracking, tracklets are generated within short temporal window first and
later they are merged to form full tracks (or trajectories). To best serve our purpose, a
constrained dominant set clustering (CDSC) technique is employed to solve both track-
ing tasks. The tracking problem is caste as finding constrained dominant sets from a
graph. That is, given a constraint set and a graph, CDSC generates cluster (or clique),
which forms a compact and coherent set that contains all or part of the constraint set.
Clusters represent tracklets and tracks in the first and second layers, respectively. The
proposed within-camera tracker can robustly handle long-term occlusions, does not
change the scale of original problem as it does not remove nodes from the graph during
the extraction of compact clusters and is several orders of magnitude faster (close to real
time) than existing methods. Also, the proposed across-camera tracking method using
CDSC and later followed by refinement step offers several advantages. More specifi-
cally, CDSC not only considers the affinity (relationship) between tracks, observed in
different cameras, but also considers the affinity among tracks from the same camera.
Consequently, the proposed approach not only accurately associates tracks from dif-
ferent cameras but also makes it possible to link multiple short broken tracks obtained
during within-camera tracking, which may belong to a single target track.
Next, we present a novel approach for a challenging problem of large scale image
geo-localization using image matching, in a structured database of city-wide reference
images with known GPS coordinates. This is done by finding correspondences between
local features of the query and reference images. We first introduce automatic Nearest
Neighbors (NN) selection into our framework, by exploiting the discriminative power
of each NN feature and employing different number of NN for each query feature. That
is, if the distance between query and reference NNs is similar, then we use several NNs
since they are ambiguous, and the optimization is afforded with more choices to select
the correct match. On the other hand, if a query feature has very few low-distance
reference NNs, then we use fewer NNs to save the computation cost. Thus, for some
cases we use fewer NNs, while for others we use more requiring on the average ap-
proximately the same amount of computation power, but improving the performance,
nonetheless. This also bypasses the manual tuning of the number of NNs to be con-
sidered, which can vary between datasets and is not straightforward. Next, we cluster
features from reference images using Dominant Set clustering, which possesses sev-
eral advantages over existing approaches. First, it permits variable number of nodes in
the cluster. Second, this approach is several orders of magnitude faster than existing
approaches. Finally, we use multiple weak solutions through constrained Dominant
IX
Set clustering on global image features, where we enforce the constraint that the query
image must be included in the cluster. This second level of clustering also bypasses
heuristic approaches to voting and selecting the reference image that matches to the
query.
Finally, we present a modified dominant set clustering approach for simultaneous
clustering and outlier detection from data (SCOD). Unlike most approaches our method
requires no prior knowledge on both the number of clusters and outliers, which makes
our approach more convenient for real applications. A naive approach to apply domi-
nant set clustering is to set a threshold, say cluster size, and label clusters with smaller
cluster size than the threshold as outliers. However, in the presence of many clut-
tered noises (outliers) with a uniformly distributed similarity (with very small internal
coherency), the dominant set framework extracts the set as one big cluster. That is,
cluster size threshold approaches are handicapped in dealing with such cases. Thus,
what is required is a more robust technique that can gracefully handle outlier clusters
of different size and cohesiveness. Dominant set framework naturally provides a prin-
cipled measure of a cluster’s cohesiveness as well as a measure of vertex participation
to each group (cluster). On the virtue of this nice feature of the framework, we propose
a technique which simultaneously discover clusters and outlier in a computationally
efficient manner.
In summary, this dissertation makes the following important contributions:
• In the proposed Dominant set clustering based single camera multi-target tracker,
we formalize the tracking problem as finding DSC from the graph and each clus-
ter represent a trajectory of a single target in different consecutive frames and
we extract clusters one-by-one iteratively, which cause change in the scale of the
problem.
• To improve the limitations of the above approach and extend the approach to a
multiple non-overlapping cameras we proposed a new technique which is based
on Constrained Dominant Set Clustering, this approach not only address the lim-
itations of our previous approach but also handles tracking in a multiple non-
overlapping cameras in a unified manner.
• The dissertation also included a novel approach to solve another very challenging
task in computer vision, large scale image geo-localization, in this framework we
proposed a new approach, which is also based on Dominant Set clustering and
its extensions, the approach is based on image matching, we first collect candi-
date matches based of their local feature matching and later we use Constrained
Dominant Set Clustering to decide the best match from the short listed candidates,
Unlike most previous approach which use a simple voting scheme.
• Finally, the dissertation proposed a novel approach using some properties of dom-
inant sets clustering approach, for simultaneous clustering and outlier detection.
Unlike most (all) previous approaches the proposed method does not require any
prior knowledge of neither the number of clusters nor outliers.
The rest of the dissertation is structured as follows: In chapter 1, we briefly intro-
duce dominant set and constrained dominant set clustering frameworks and present the
newly proposed fast approach to extract constrained dominant sets. In chapter 2, we
X
present new approach for multi-target tracking using Dominant Sets in a single camera
scenario. In Chapter 3, we present a robust multi-target multi-camera tracking approach
using constrained dominant sets. A new approach for large scale image geo-localization
is presented in chapter 4. Finally, in chapter 5, a novel method is proposed for detecting
outliers while extracting compact clusters.
XI
Contents
XIII
Contents
6 Conlusion 79
Bibliography 81
XIV
List of Figures
1.1 Dominant set example: (a) shows a compact set (dominant set), (b) node
4 is added which is highly similar to the set {1,2,3} forming a new com-
pact set. (c) Node 5 is added to the set which has very low similarity
with the rest of the nodes and this is reflected in the value W{1,2,3,4,5} (5). 3
2.2 Toy example of sliding window approach. Bounding boxes denote de-
tection responses, and the different colors represent different targets.
The sliding window moves one frame per step. This figure is best
viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Sample Tracking result on ETH-Sunny day sequence (first row repre-
sents our results, while second row represents the ground truth). This
figure is best viewed in color. . . . . . . . . . . . . . . . . . . . . . . 26
XV
List of Figures
3.1 A general idea of the proposed framework. (a) First, tracks are deter-
mined within each camera, then (b) tracks of the same person from
different non-overlapping cameras are associated, solving the across-
camera tracking. Nodes in (a) represent tracklets and nodes in (b) repre-
sent tracks. The ith track of camera j, Tji , is a set of tracklets that form a
clique. In (b) each clique in different colors represent tracks of the same
person in non-overlapping cameras. Similar color represents the same
person. (Best viewed in color) . . . . . . . . . . . . . . . . . . . . . . 28
3.2 The figure shows within-camera tracking where short-tracklets from dif-
ferent segments are used as input to our first layer of tracking. The re-
sulting tracklets from the first layer are inputs to the second layer, which
determine a tracks for each person. The three dark green short-tracklets
(s21 , s10 7
1 , s1 ), shown by dotted ellipse in the first layer, form a cluster
resulting in tracklet (t21 ) in the second layer, as shown with the black
arrow. In the second layer, each cluster, shown in purple, green and
dark red colors, form tracks of different targets, as can be seen on the
top row. tracklets and tracks with the same color indicate same target.
The two green cliques (with two tracklets and three tracklets) represent
tracks of the person going in and out of the building (tracks T1p and T12
respectively) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Exemplar graph of tracks from three cameras. Tji represents the ith track of camera j. Black
and colored edges, respectively, represent within- and across-camera relations of tracks. Colours
of the nodes depict track IDs, nodes with similar colour represent tracks of the same person, and
the thick lines show both within- and across-camera association. . . . . . . . . . . . . 36
3.4 Camera topology for DukeMTMC dataset. Detections from the overlap-
ping fields of view are not considered. More specifically, intersection
occurred between camera (8 & 2) and camera (5 & 3). . . . . . . . . . 39
3.5 Sample qualitative results of the proposed approach on DukeMTMC dataset. Bounding boxes
and lines with the same color indicate the same target (Best viewed in color). . . . . . . . 42
3.6 The results show the performance of our algorithm on MARS (both us-
ing CNN + XQDA) when the final ranking is done using membership
score (left) and using pairwise euclidean distance (right). . . . . . . . 43
3.7 CPU time taken for each track association using our proposed fast ap-
proach (FCDSC - fast CDSC) and CDSC. . . . . . . . . . . . . . . . . 44
3.8 The ratio of CPU time taken between CDSC and proposed fast approach
(FCDSC), computed as CPU time for CDSC/CPU time for FCDSC. . . 45
XVI
List of Figures
4.3 Exemplar output of the dominant set framework: Left: query, Right:
each row shows corresponding reference images of the first, second and
third local solutions (dominant sets), respectively, from top to bottom.
The number under each image shows the mode of the matched refer-
ence image, while those on the right of each image show the min-max
normalized scores of HSV, CNN6, CNN7 and GIST global features, re-
spectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Exemplar graph for post processing. Top left: reduced graph for Fig. 4.2
which contains unique matched reference images. Bottom left: Part of
the full graph which contains the gray circled nodes of the reduced graph
and the query. Top right: corresponding affinity of the reduced graph.
Bottom right: The outputs of nearest neighbor approach, consider only
the node’s pairwise similarity, (KNN(Q)=node 3 which is the dark red
node) and constrained dominant sets approach (CDS(Q) = node 2 which
is the yellow node). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 The top four rows are sample street view images from eight different
places of WorldCities dataset. The bottom two rows are sample user
uploaded images from the test set. . . . . . . . . . . . . . . . . . . . . 60
4.6 Comparison of our baseline (without post processing) and final method,
on overall geo-localization results, with state-of-the-art approaches on
the first dataset (102K Google street view images). . . . . . . . . . . . 62
4.7 The ratio of CPU time taken between GMCP based geo-localization
[178] and our approach, computed as CPU time for GMCP/CPU time
for DSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 Comparison of overall geo-localization results using DSC with and with-
out post processing and state-of-the-art approaches on the WorldCities
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 Sample qualitative results taken from Pittsburgh area. The green ones
are the ground truth while yellow locations indicate our localization results. 65
4.10 Geo-localization results using different number of NN . . . . . . . . . 65
4.11 The effectiveness of constrained dominant set based post processing step
over simple voting scheme. . . . . . . . . . . . . . . . . . . . . . . . 66
4.12 Comparison of geo-localization results using different global features
for our post processing step. . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Examplar plots: Left: Original data points with different colors which show possible clusters.
Right: Extracted clusters and their cohesiveness measures, C with affinity (A) and (CL) with
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the learned affinity (S) 73
5.2 Results of the algorithm on synthetic datasets with respect to increasing
number of outliers (l). While fixing parameters as k = 10, m = 100, d =
32, and σ = 0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Results of the algorithm on synthetic datasets with respect to increasing
dimension (d). While fixing parameters as k = 10, m = 100, l = 100, and
σ = 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Results of the algorithm on synthetic datasets with respect to the stan-
dard deviation used to generate the data (σ). While fixing parameters as
k = 10, m = 100, l = 100, and d = 32 . . . . . . . . . . . . . . . . . . . 75
XVII
List of Tables
3.1 The results show detailed (for each camera C1 to C8) and average perfor-
mance (Av) of our and state-of-the-art approach [128] on the Test-Easy
sequence of DukeMTMC dataset. . . . . . . . . . . . . . . . . . . . . 40
3.2 The results show detailed (for each camera) and average performance
of our and state-of-the-art approach [128] on the Test-Hard sequence of
DukeMTMC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Multi-camera performance of our and state-of-the-art approach [128] on
the Test-Easy sequence of DukeMTMC dataset. . . . . . . . . . . . . . 41
3.4 Multi-Camera performance of our and state-of-the-art approach [128]
on the Test-Hard sequence of DukeMTMC dataset. . . . . . . . . . . . 42
3.5 The table shows the comparison (based on rank-1 accuracy) of our ap-
proach with the state-of-the-art approaches: SDALF [50], HLBP [168],
BoW [190], BCov [97], LOMO [90], HOG3D [82] on MARS dataset. . 43
3.6 The results show performance of our(using pairwise distance and mem-
bership score) and state-of-the-art approach [189] in solving within- and
across-camera ReID using average precision on MARS dataset using
CNN feature and different distance metrics. . . . . . . . . . . . . . . . 43
4.1 Results of the experiment, done on the 102k Google street view images
(Dts1) and WorldCities (Dts2) datasets, to see the impact of the post-
processing step when the candidates of reference images are obtained
by other image retrieval algorithms . . . . . . . . . . . . . . . . . . . 64
XIX
List of Tables
XX
CHAPTER 1
Graph Theoretic Based Clustering Framework
The problem of clustering consists in organizing a set of objects into clusters (groups,
subsets, or categories), in a way that objects in the same cluster should be similar (in-
ternal homogeneity), and dissimilar ones organized into different cluster (external In-
homogeneity).
In particular clustering algorithms are distinguished in two main classes: feature and
similarity (pairwise) -based clustering. In feature-based clustering, objects are repre-
sented as points in a metric space with distances reflecting the dissimilarity relations.
While in similarity-based clustering, objects are described indirectly by their respective
similarity relations. In numerous real-world application domains, it is not possible to
find satisfactory features, but it is more natural to provide a measure of similarity. For
example, when features consist of both continuous and categorical variables or when
the objects to be classified are represented in terms of graphs or structural representa-
tions.
A classical approach to pairwise clustering uses concepts and algorithms from graph
theory [46, 71]. Certainly, it is natural to map the data to be clustered to the nodes of
a weighted graph with edge weights representing similarity relations. These methods
are of significant interest since they cast clustering as pure graph-theoretic problems for
which a solid theory and powerful algorithms have been developed.
Graph-theoretic algorithms basically consist of searching for certain combinatorial
structures in the similarity graph, such as a minimum spanning tree [175] or a minimum
cut [166] and, a complete subgraph (clique) [71]. Authors in [7, 124], argue that the
maximal clique is the strictest definition of a cluster. Authors in [110, 112], proposed
dominant set clustering framework which generalizes the maximal clique problem to
an edge weighted graphs and later authors in [183], proposed an approach which gen-
eralizes dominant set framework.
1
Chapter 1. Graph Theoretic Based Clustering Framework
In this dissertation, we used Dominant set clustering and its extension, Constrained
dominant set clustering, approach to solve different problems in computer vision and
patter recognition. In the rest of this chapter, we will briefly introduce these clustering
frameworks and present newly proposed faster approach to extract constrained domi-
nant sets from the graph, which is linear in complexity.
A = (aij ) (1.1)
where: (
ω(i, j), if i 6= j and (i, j) ∈ E
aij =
0, if i = j
Since there are no self-loop in graph G, all entries on the main diagonal of A are zero.
In an attempt to formally capture this notion, we need some notations and definitions.
Let S ⊆ V be a non-empty subset of vertices and i ∈ V . The (average) weighted
degree of i w.r.t. S is defined as:
1 X
AW DegS (i) = ai,j (1.2)
|S| j∈S
Moreover, when j ∈ / S, we can measure similarity between nodes j and i, with respect
to the average similarity between node i and its neighbors in S as:
1 if |S| = 1
wS (i) = P
φS\{i} (j, i)wS\{i} (j) otherwise (1.4)
j∈S\{i}
Here, we can compute the total weight of the set S, W (S), by summing up each weights
ws (i). From this recursive characterization of the weights we can obtain a measure of
the overall similarity between a vertex i and S \ {i} w.r.t the overall similarity among
S\{i}, and in [110,112] characterizes a set S as dominant set if it satisfies the following
two conditions:
2
1.1. Dominant set clusters and their properties
1 30 30
1 4 W{1,2,3,4} (4)>0 1 4 W{1,2,3,4} (4)>0
35
20 20 1 35
41 1
22 22 W{1,2,3,4,5} (5)<0 5 41
1
1
2 3 2 3
21 2 3
21 21
(a) (b) (c)
Figure 1.1: Dominant set example: (a) shows a compact set (dominant set), (b) node 4 is added which
is highly similar to the set {1,2,3} forming a new compact set. (c) Node 5 is added to the set which
has very low similarity with the rest of the nodes and this is reflected in the value W{1,2,3,4,5} (5).
The main result presented in [110, 112] establishes an intriguing one-to-one corre-
spondence between dominant sets and strict local maximizers of the problem:
3
Chapter 1. Graph Theoretic Based Clustering Framework
Conversely, under mild conditions, it turns out that if x is a strict local solution of
problem (1.5), then its "support" S = {i ∈ V : xi > 0} is a dominant set. By
virtue of this result, we can find a dominant set by first localizing a solution of problem
(1.5) with an appropriate continuous optimization technique, and then picking up the
support set of the found solution. In this sense, we indirectly perform combinatorial
optimization via continuous optimization.
The standard approach for finding the local maxima of problem (1.5), as used in
[112], is to use replicator dynamics a well-known family of algorithms from evolution-
ary game theory inspired by Darwinian selection processes. Let A be a non-negative
real-valued n × n matrix, the discrete version of the dynamics is defined as follows:
6: end if
7: x ← δ(y − x) + x
8: end while
9: return x
4
1.2. Constrained Dominant Set clustering.
By reiterating this process of infection and immunization the dynamics drives the
population to a state that cannot be infected by any other strategy. If this is the case
then x is an equilibrium or fixed point under the dynamics. The refinement loop of
Algorithm (1) controls the number of iterations allowing them to continue until x is
with in the range of the tolerance τ and we emperically set τ to 10−7 . The range (x)
2
min xi , (Ax)i − x> Ax .
P
is computed as (x) =
i∈J
where ∆ = {x ∈ Rn :
P
i xi = 1, and xi ≥ 0 for all i = 1 . . . n}, x contains a
membership score for each node and IQ is the n × n diagonal matrix whose diagonal
elements are set to 1 in correspondence to the vertices contained in V \ Q (a set V
without the element Q) and to zero otherwise.
Let Q ⊆ V , with Q = 6 ∅ and let α > λmax (AV \Q ), where λmax (AV \Q ) is the largest
eigenvalue of the principal submatrix of A indexed by the elements of V \ Q. If x is a
local maximizer of fQα in ∆, then σ(x) ∩ Q =6 ∅, where, σ(x) = {i ∈ V : xi > 0} .
The above result provides us with a simple technique to determine dominant set
clusters containing user-specified query vertices, Q. Indeed, if Q is a vertex selected
by the user, by setting
α > λmax (AV \Q ), (1.8)
we are guaranteed that all local solutions of (1.7) will have a support that necessarily
contains elements of Q.
5
Chapter 1. Graph Theoretic Based Clustering Framework
(InfImDyn). InfImDyn solves the problem in linear time. However, it needs the whole
affinity matrix to extract a dominant set which, more often than not, exists in local
range of the whole graph. Dominant Set Clustering (DSC) [112] is an iterative method
which, at each iteration, peels off a cluster by performing a replicator dynamics until
its convergence. Efficient out-of-sample [111], extension of dominant sets, is the other
approach which is used to reduce the computational cost by sampling the nodes of the
graph using some given sampling rate that affects the framework efficacy. Liu et al. [93]
proposed an iterative clustering algorithm, which operates in two steps: Shrink and Ex-
pansion. These steps help reduce the runtime of replicator dynamics on the whole data,
which might be slow. The approach has many limitations such as its preference of
sparse graph with many small clusters and the results are sensitive to some additional
parameters. Another approach which tries to reduce the computational complexity of
the standard quadratic program (StQP [14]) is proposed by [35].
All the above formulations, with their limitations, try to minimize the computational
complexity of StQP using the standard game dynamics, whose complexity is O(n2 ) for
each iteration.
In this work we propose a fast approach (listed in Algorithm 2), based on InfImDyn
approach which solves StQP in O(n), for the recently proposed formulation, x> (A −
αIQ )x, which of-course generalizes the StQP.
InfImDyn is a game dynamics inspired by Evolutionary game theory. The dynamics
extracts a dominant set using a two-steps approach (infection and immunization), that
iteratively increases the compactness measure of the objective function by driving the
(probability) distribution with lower payoff to extinction, by determining an ineffec-
tive distribution y ∈ ∆, that satisfies the inequality (y − x)> Ax > 0, the dynamics
combines linearly the two distributions (x and y), thereby engendering a new popu-
lation z which is immune to y and guarantees a maximum increase in the expected
payoff. In our setting, given a set of instances (tracks, tracklets) and their affinity, we
first assign all of them an equal probability (a distribution at the centre of the simplex,
a.k.a. barycenter). The dynamics then drives the initial distribution with lower affinity
to extinction; those which have higher affinity start getting higher, while the other get
lower values. A selective function, S(x), is then run to check if there is any infective
distribution; a distribution which contains instances with a better association score. By
iterating this process of infection and immunization the dynamics is said to reach the
equilibrium, when the population is driven to a state that cannot be infected by any other
distribution, that is there is no distribution, whose support contains a set of instances
with a better association score. The selective function, however, needs whole affinity
matrix, which makes the InfImDyn inefficient for large graphs. We propose an algo-
rithm, that reduces the search space using the Karush-Kuhn-Tucker (KKT) condition
of the constrained quadratic optimization, effectively enforcing the user constraints. In
the constrained optimization framework [183], the algorithm computes the eigenvalue
of the submatrix for every extraction of the compact sets, which contains the user con-
straint set. Computing eigenvalues for large graphs is computationally intensive, which
makes the whole algorithm inefficient.
In our approach, instead of running the dynamics over the whole graph, we localize
it on the sub-matrix, selected using the dominant distribution, that is much smaller than
the original one. To alleviate the issue with the eigenvalues, we utilize the properties of
6
1.3. Fast Approach for Solving Constrained Dominant Set Clustering
eigenvalues; a good approximation for the parameter α is to use the maximum degree
of the graph, which of-course is larger than the eigenvalue of corresponding matrix.
The computational complexity, apart from eigenvalue computation, is reduced to O(r)
where r, which is much smaller than the original affinity, is the size of the sub-matrix
where the dynamics is run.
Let us summarize the KKT conditions for quadratic program reported in eq. (1.7).
By adding Lagrangian multipliers, n non-negative constants µ1 , ...., µn and a real num-
ber λ, its Lagrangian function is defined as follows:
n
! n
X X
α
L(x, µ, λ) = fQ (x) + λ 1 − xi + µi xi .
i+1 i+1
Since both the xi and the µi values are nonnegative, the latter condition is equivalent to
saying that i ∈ σ(x) which implies that µi = 0, from which we obtain:
(
= λ/2, if i ∈ σ(x)
[(A − αIQ )x]i (1.9)
≤ λ/2, if i ∈ / σ(x)
Let the "support" be σ(x) = {i ∈ V : xi > 0} and ei the ith unit vector (a zero
vector whose ith element is one).
Proposition 1. Given an affinity A and a distribution x ∈ ∆, if (Ax)i > x0 Ax −
αx0Q xQ , for i ∈
/ σ(x),
1. x is not the maximizer of the parametrized quadratic program of (1.7)
2. ei is a dominant distribution for x
Proof. To show the first condition holds: Let’s assume x is a KKT point
n
X
>
x (A − αIQ )x = xi [(A − αIQ )x]i
i=1
7
Chapter 1. Graph Theoretic Based Clustering Framework
Since i ∈
/ σ(x)
(Ax)i ≤ x> (A − αIQ )x
Which concludes the proof showing that the inequality does not hold.
For the second condition, if ei is a dominant distribution for x, it should satisfy
the inequality
e>
>
i (A − αI Q )x > x (A − αI Q )x
Since i ∈
/ σ(x)
The selected dominant distribution always increases the value of the objective func-
tion. Moreover, the objective function is bounded which guaranties the convergence of
the algorithm.
8
1.4. Summary
1.4 Summary
In the first part of this chapter, we briefly introduced dominant set clustering framework,
a pairwise clustering approach, based on the notion of a dominant set, which can be
seen as an edge weighted generalization of a clique. In the second part of the chapter,
we have discussed an approach, constrained dominant-set, that finds a collection of
dominant set clusters constrained to contain user-defined elements. The approach is
based on some properties of a family of quadratic optimization problems related to
dominant sets which show that, by properly selecting a regularization parameter that
controls the structure of the underlying function, we are able to "force" all solutions to
contain user specified elements. Finally, We have presented a novel fast approach to
extract constrained dominant sets from the graph.
9
CHAPTER 2
Multi-Object Tracking Using Dominant Sets
2.1 Introduction
In order to ensure security of people and places by means of intelligent video surveil-
lance, the current trend is to increase the number of cameras installed, both as a deter-
rent and to better monitoring the surveyed area. As a consequence, we have witnessed
an exponential increase of the data to be watched and stored, requiring inevitably the
use of automatic processing of videos for scene analysis and understanding, both for
real time and a-posterior mining. In this way, a large variety of video data provided by
installed cameras is automatically analyzed for event detection, object and people track-
ing as well as behaviour analysis. These actions offer a valid support to investigations
and crime detection.
Tracking targets is a challenging task: variations in the type of camera, lighting con-
11
Chapter 2. Multi-Object Tracking Using Dominant Sets
ditions, scene settings (e.g. crowd or occlusions), noise in images, variable appearance
of moving targets and the point of view of the camera must be accounted. Follow-
ing multiple targets while robustly maintaining data association remains a largely open
problem. In [173] a large experimental survey of various tracking approaches is pre-
sented, evaluating suitability of each approach in different situations and with different
constraints (e.g assumptions on the background, motion model, occlusions, etc.).
In recent years, due to significant improvements in object detection, several re-
searchers have proposed tracking methods that associate detection responses into tracks,
also referred as Association Based Tracking (ABT) techniques [3, 114, 116, 167, 170,
176] . An offline trained people detector is used to generate detection responses, then
tracklets are produced by linking these responses and further associated into longer
tracks. Similarity between tracklets (i.e., the linking probabilities) is based on the mo-
tion smoothness and appearance similarity. A Hungarian algorithm is often used to find
the global optimum [114, 167]. As compared to generative models, ABT is powerful
in assessing the presence of objects on the scene since it uses discriminatively trained
detectors, and needs no manual initialization. Association-based approaches are prone
to handling long-term occlusions between targets and the complexity is polynomial
with the number of targets present. In order to differentiate between different targets,
speed and distance between tracklet pairs are often used as motion descriptors, whereas
appearance descriptors are often based on global or part-based color histograms. Nev-
ertheless, how to deal with motion in moving cameras and how to better distinguish
nearby targets remain key issues that limit the performance of ABT.
In most of the ABT works [114,165], the affinity score between detection responses
is computed once and kept fixed for all the later processes. Conversely, in this proposal
we develop a more flexible approach where associations are made at two levels and the
affinity measure is iteratively refined based on the knowledge retrieved from the previ-
ous level. Moreover, most of the previous methods [9, 15, 76, 142] use locally-limited
temporal information by focusing only on the pairwise relationship of the tracklets,
rather than applying data association among multiple tracklets over the whole video in
a global manner. As a consequence, existing approaches are prone to identity switches
(i.e., assigning different labels/IDs to the same target) in cases where targets with small
spatial and appearance differences move together (which are rather common cases in
real security videos).
In this chapter, a dominant set clustering based tracker is proposed, which formu-
lates the tracking task as the problem of finding dominant set clusters in an auxiliary
undirected edge-weighted graph. Unlike the previous approaches, the proposed method
for data association combines both appearance and position information in a global
manner. Object appearance is modelled with a 9 × 9 covariance matrix feature de-
scriptor [118] and the relative position between targets which is less influenced by the
camera motion (angle of the view) is computed. Since when the camera moves all the
targets shift together, this motion feature is quite invariant to camera movements, mak-
ing it a suitable representation applicable also to videos acquired by a moving camera.
Then, a temporal sliding window technique is utilized to find tracklets and perform
further merging on them. More specifically, given detection responses found along the
temporal window, we will represent them as a graph in which all the detections in each
frame are connected to all the other detections in the other frames, regardless of their
12
2.2. Related Work
closeness in time, and the edge weight depicts both appearance and position similarity
between nodes.
A two-level association technique follows: first, low-level association is performed
linking detection responses of the last two consecutive frames of the window, which
helps differentiating difficult pairs of targets in a crowded scene; then, a global asso-
ciation is performed to form tracklets along the temporal window. Finally, different
tracklets of the same target are merged to obtain the final trajectory.
The main contributions presented in this chapter are:
• this chapter presents the first example of the use of dominant set clustering for
data association in multi-target tracking;
• the two-level association technique proposed in this chapter allows us to consider
efficiently (also thanks to the superior performance of dominant set clustering
on identifying compact structures in graphs) the temporal window at once, per-
forming data association in a global manner; this helps in handling long-lasting
occlusions and target disapperance/reappearance more properly;
• the consensus clustering technique developed to merge tracklets of the same target
and obtain the final trajectories is firstly introduced in this work, although it has
been used in different domains;
• the proposed technique outperforms state-of-the-art techniques on various publicly-
available challenging data sets.
The rest of the chapter is organized as follows: related works are discussed in Sec-
tion 2.2; while Section 2.3 details our tracking framework (DSC tracker, hereinafter).
Experimental results are shown in Section 2.4, followed by summary in section 2.5.
13
Chapter 2. Multi-Object Tracking Using Dominant Sets
occlusions are present or targets with similar spatial and appearances exist. On the
other hand, other researchers in [9, 117, 176, 186] follow a method generally termed as
global: recently this approach is becoming more popular since it allows to remove the
limited-temporal-locality assumptions and this allows them to incorporate more global
properties of a target during optimization, which helps overcoming to problems caused
by noisy detection inputs.
In [186] data association problem is mapped into a cost-flow network with a non-
overlap constraint on trajectories. The optimal solution is found by a min-cost flow
algorithm in the network. In [117] the graph was defined similar to [186] and showed
that dynamic programming can be used to find high quality sub-optimal solutions. The
paper in [9] uses a grid of potential spatial locations of a target and solve association
problem using the K-Shortest Path algorithm. Unfortunately, their method completely
ignored appearance features into the data association process, which result unwarranted
identity switches in complex scenes. To solve this limitation, the proposal in [141]
incorporates the global appearance constraints.
The above-mentioned global methods use simplified version of the problem by only
considering the relationships of detections in consecutive frames. Despite the good
performance of these methods, they came short in identifying targets with similar ap-
pearance and spatial positions. Conversely, in recent approaches [3, 176], no simplifi-
cations in problem formulation are made. However, the proposed solutions are approx-
imate due to the complexity of the models. More specifically, authors in [3] proposed
a continuous energy minimization based approach in which the local minimization of
a non-convex energy function is performed exploiting the conjugate gradient method
and periodic trans-dimensional jumps. Compared to the aforementioned global meth-
ods, this approach is more suitable for real-world tracking scenarios. On the negative
side, however, the solutions found to the non-convex function can be attracted by local
minima.
Recently, in [176] tracking problem is represented in a more complete manner,
where pairwise relationships between different detections across the temporal span of
the video are considered, such that a complete K-partite graph is built. A target track
will be represented by a clique (sub graph where all the nodes are connected to each
other), and the tracking problem is formulated as a constraint maximum weight clique
problem. However, since a greedy local neighborhood search algorithm is followed
to solve the problem, also this approach is prone to the local minima case. Moreover,
due to heuristic line fitting approach used for outlier detection, the approach is prone to
identity switches, particularly when targets with close position move together.
14
2.3. Multi-object Tracking through Dominant Set Clusters
In order to obtain more accurate results with more efficiency (i.e., lower latency)
it is vital to use sliding temporal windows to track targets. Moreover, in the whole
video there might be many targets most of which may not have temporal overlap. As a
consequence, performing a global optimization over the whole video frames is not only
impractical, but also inefficient due to the huge search space and long latency, and also
needless as many targets only appear for some period but not on the whole course of
the video.
To generate tracklets over the sliding temporal window, we use overlapping neigh-
bouring windows of frames. A toy example is given in Figure 2.2 to ease the description
of the approach. Given an input video and the detection responses found within the first
sliding window, tracking is performed within the window. Then, the window is moved
with the step of one frame at a time, so that the new window will have more than 90%
overlap with the previous one, and the detection algorithm is applied on the new frame
(number 4 in Fig. 2.2) so to generate detection responses. At this point, a low-level
association algorithm (detailed in section 2.3.1) is employed to associate the detection
responses found in the last two frames (3 and 4) which have high appearance simi-
larities and small spatial distances. Then, a global association algorithm (see section
2.3.1) is applied over all the frames in the current window in order to create consistent
tracklets. As a result, we are able to associate efficiently and accurately the new frame
detection responses to the right tracklet.
It is worth noticing that selecting the size of the sliding window is crucial. In fact,
there is a trade-off between latency and accuracy: the bigger the size (more frames)
the more accurate the association task will be since more data are available, but at the
cost of higher latency (preventing online real-time functionality); on the other hand, the
smaller the size, the quicker the results will be available, but with lower accuracy. In
our experiments (reported in section 2.4), we select a window size between 5 and 15
frames depending on the dynamics as well as frame rate of the video: the faster the
frame rate, the bigger will be the size.
15
Chapter 2. Multi-Object Tracking Using Dominant Sets
Figure 2.2: Toy example of sliding window approach. Bounding boxes denote detection responses, and
the different colors represent different targets. The sliding window moves one frame per step. This
figure is best viewed in color.
16
2.3. Multi-object Tracking through Dominant Set Clusters
it may happen that some people are not visible in the last two frames of the current
window. As a toy example, low-level association in Figure 2.2(c) on frame 3 and 4 fails
in creating an association for the pink person, since it is visible only in frame 4 and not
frame 3. As a consequence, a global association algorithm ought to be used over the
whole window. As a result, the tracklet generated in the previous window for the pink
person (between frames 1 and 2 in Figure 2.2(b)) can be successfully associated to the
new detection in frame 4 (see Figure 2.2(d)).
Similarly to the low-level association case, we denote the detection responses to
be tracked in one temporal window of f frames as an undirected edge weighted graph
Gn = (Vn , En , ωn ) with no self loop, represented as a similarity matrix An . One critical
point in the global association algorithm is that the knowledge about both the previous
window and the low-level association is exploited to update the new association matrix
An . In particular, let Gko be the subgraph representing tracklet of person k with Vok
vertices from the previous window. Then, the similarity matrix An = (aij ) is computed
as
1
if (i, j) ∈ Vok or (i, j) ∈ Vsk
ai,j = 0 (i, j) ∈
/ En (2.1)
ω (i, j) otherwise
n
where Vsk represents tracklet of target k obtained from low-level association, (i, j) ∈
Vok or (i, j) ∈ Vsk means both node i and j belong to same tracklet of k th target obtained
from the previous window or low-level association, respectively.
In order to capture the individual similarities in people (patches) of the same target
and differentiate between different targets, it is compulsory that the graph is made using
a meaningful and robust similarity measure. A node vji is represented with a location
feature lji , which is the 2-dimensional spatial coordinates of the bottom center and
upper left corner of the corresponding detection.
Moreover, we decided to model people appearance using covariance matrix feature
descriptors [118]. This representation has been adopted in various approaches [39, 40,
94, 100] thanks to its robustness in capturing information like color, shape and location
and also to its scale and rotation invariance. Considering d different selected pixel
features extracted independently from the image patch, benefiting from being a low
dimensional representation, the resulting covariance matrix C is a square symmetric
matrix d × d where the diagonal entries represent the variance of each feature and the
non-diagonal entries represent the correlations.
Let us consider Im as a three-channels color image and Y be the W × H × d dimen-
sional image feature extracted from Im. Let Y (x, y) = ρ(Im, x, y), where the function
ρ could be any mapping such as gradients, color, filter responses, intensity, etc.. Let
{ti }i=1...M be the d-dimensional feature points inside Y, with M = W × H. The image
Im is represented with the d × d covariance matrix of the feature points:
M
1 X
CR = (ti − µ)(ti − µ)T (2.2)
M − 1 i=1
where vector µ represents the mean of the corresponding features of each point in a
region R.
17
Chapter 2. Multi-Object Tracking Using Dominant Sets
In our case, we decided to model each pixel within a people patch with its HSV
color values, its position (x, y), Gx and Gy which are the first order derivatives of the
intensities calculated through Sobel operator with respect to x and y, the magnitude
p 2 2
Gy
mag(x, y) = Gx + Gy and the angle of the first derivative Θ(x, y) = arctan .
Gx
Therefore, each pixel of the people (patch) is mapped to a 9-dimensional feature vector:
ti = [x; y; H ; S ; V ; Gx ; Gy ; mag(x, y); Θ(x, y)]T . Based on this 9-dimensional feature
vector representation the covariance of a patch is 9 × 9 matrix.
The distance between covariance matrices is computed using technique proposed
in [55,118] which is the sum of the squared logarithms of their generalized eigenvalues.
Formally, the distance between two matrices Cij and Cm n
is expressed as:
v
u d
j
uX
n
$(Ci , Cm ) = t ln2 λk (Cij , Cm n) (2.3)
k=1
where λk (Cij , Cm
n
)k=1....d are the generalized eigenvalues of Cij and Cm n
.
+
The weight of an edge between two nodes, ω : E → R , represents both appear-
ance and spatial similarities (correlation) between detection responses and the distance
between two nodes is computed as :
q
i n
D(vj , vm ) = κ · $(Cji , Cm n )2 + ι · d(l i , l n )2
j m (2.4)
where d(lji , lm
n
) = klji − lm
n 2 i n
k (lj and lm representing the location feature vectors of the
corresponding nodes) and κ and ι are values which are used to control contributions of
appearance and location features, respectively.
The affinity between the two nodes is defined as:
D(vji , vm
n
)
ω(vji , vm
n
) = exp(− ) (2.5)
2γ 2
where $ is the regularization term. The smaller the sigma, less likely are the detec-
tion responses to be associated, leading to fewer identity switches but more fragments
(interruption of the correct trajectory) and vice versa.
Our task of finding the tracklet of one target in one temporal window requires to
identify a target in each frame and then to represent the feasible solution as a dominant
set (subgraph) Gf in which at most one node (detection response) is selected from
each frame’s detection responses which are highly coherent and similar to each other.
Therefore, we represent the dominant set as a subgraph Gf = (Vf , Ef , ωf ). We ought
to note that the feasible solution (tracklet) Gf only contains the detection responses of
one target and not all visible detections found in the temporal window. In Figure 2.3(c)
we can see one feasible solution found from current temporal window of size 3.
By solving DSC for the graph Gn of the global association algorithm, we will end up
with the optimal solution that is a dominant set (≡ subgraph Gf ) containing detections
of one target, which corresponds to the feasible solution with the most consistent and
highly coherent in both spatial and appearance features over the whole span of temporal
window. Moreover, in order to find the tracklets of all the people found in a temporal
window, the optimization problem (1.5) needs to be solved several times. Therefore, at
18
2.3. Multi-object Tracking through Dominant Set Clusters
each iteration, the algorithm finds the tracklet which has the highest internal coherency.
Then, vertices selected in Gf will be removed from Gn and the above optimization
process is repeated to find the tracklet for the next target, and so on until zero or few
nodes remains in Gn . This approach is commonly referred as “peeling off strategy”.
Figure 2.3: Illustration of tracklet generation in one temporal window;(a) shows the possible detection
responses in the window; (b) shows the graph built between the detection responses; (c) shows the
tracking result from one iteration, containing dominant sets of one target as a subgraph; (d) shows
the obtained trajectory. This figure is best viewed in color.
19
Chapter 2. Multi-Object Tracking Using Dominant Sets
used concurrently:
1. In the first approach, each tracklet is divided in two equal parts: each tracklet
partition is represented as a node in a graph and the weight between the nodes
(tracklets) is computed as a mean distance between their appearance features. Let
I and J be two tracklets (≡ nodes), then the distance between the two tracklets is
formulated as: ( !)
1 X 1 X
DI,J = $(Ci , Cj ) (2.6)
|I| i∈I |J| j∈J
20
2.4. Experiments
that the person will make a big change on his/her new appearance. Hence, it will
be enough to represent a tracklet with two representative detection responses with
the highest characteristic vector values. But yet again this technique fails if the
selected representative detection results to be a bad representative (e.g., it can be
partially occluded), generating multiple identity switches.
DSC is performed three times, each using one of the different approaches listed above
(Aµ ,AM and AC ). The clustering results are then combined in to a single affinity matrix
according to the Evidence Accumulation Paradigm [95]. We built a matrix B = (bij )
known as co-association matrix: where bij = 0 if i = j, otherwise bij = ϕ(i, j), being
ϕ(i, j) the fraction of times object i and object j were clustered together among all C
clustering results in the ensemble and computed as:
|C|
1 X
ϕ(i, j) = δ(Cr (i), Cr (j)) (2.9)
|C| r=1
where Cr (i) is the label of ith object in the rth clustering result and δ(n, m) is Kro-
necker’s delta function, that is, δ(n, m) = 1 if n = m and 0 otherwise.
To get the final clustering result, we run the DSC on the co-association matrix.
2.4 Experiments
2.4.1 Experimental setup
We evaluate our approach on three publicly available data sets: sequence S2L1 from
PETS2009 benchmark [47], TUD-standemitte dataset [3], and "sunny day" sequence
from ETH mobile dataset [49], which are commonly used by previous multi-target
tracking works. To be fair and transparent on the comparison, we used publicly avail-
able detection responses [169, 171] and ground truth [88] in all our experiments. We
also used publicly available code [101] for the evaluation.
Brief description of the datasets: PETS2009-S2L1-View one [47] consists of 795
frames and comprised of challenges like non-linear target motion, targets in close prox-
imity and several occlusions in the scene. TUD-Stadtmitte [3] consists of 179 frames
recorded in a busy pedestrian street with low camera angle, which generates frequent
occlusions. Both these datasets are recorded using static cameras. On the contrary,
ETH dataset (“sunny day” sequence) [48] is recorded with a pair of moving cameras in
a busy street scene. The cameras move forward and have a panning motion at times,
which potentially makes the exploitation of known motion models less reliable. Most
targets move with a speed of 2-4 pixels per frame, whereas the camera exhibits a motion
of around 20 pixels per frame. This causes imprecise results while using linear motion
estimations. However, the relative positions of the targets do not change between two
consecutive frames since all targets move according to the same camera motion. The
sequence contains 354 frames and people size detected on image plane varies signifi-
cantly. Similar to [87], we used the sequence from the left camera. No ground plane
information is used.
Parameter analysis: There is no fixed window size which will work for all the
datasets, rather depending on the dynamics and frame rate of the video. Therefore,
21
Chapter 2. Multi-Object Tracking Using Dominant Sets
we set window size of 15,10 and 5 for PETS2009-S2L1, TUD-standemitte and ETH-
sunnyday sequences, respectively. Our algorithm performs well in a wide range of
γ (the scaling parameter in eq. 2.5) from 0.2 to 3.0. The good invariance (or low
sensitivity) to γ parameter is due to the exploitation of results from both the previous
window and the low-level association along the current window to update our affinity
matrix, which results in the replacement of most of the values of our affinity matrix
(with the value 1, if the two nodes belong to same cluster according to results from
previous window or low-level association and with the value 0 otherwise).
For PETS2009-S2L1 and TUD-standemitte, κ and ι in eq. 2.4 (which are factors for
controlling the contributions of appearance and position information, respectively) are
typically set to one which will give equal weight to appearance and position features.
However, based on our experiments, the appearance features are more informative than
position in our formulation. In fact, in many cases using appearance features only is
sufficient for tracklet generation. In the case of ETH-sunny day sequence, we set κ and
ι values to 1 and 1.25 respectively to get the best result. In fact, since this sequence
is recorded with a moving camera, large changes in poses and sizes of the people are
present and this results in significant changes in their appearance. Consequently, ap-
pearance information is less informative than positions in this case.
Evaluation methodology: The correct and fair evaluation of multi-target tracking
systems relies mainly on two issues: the definition of proper and agreed evaluation
metrics; and the use of publicly-available ground truth on which the evaluation can be
based.
Regarding the evaluation metrics, we adopted those defined in [101] which uses
the most widely accepted protocol CLEAR MOT [11]. CLEAR MOT defined several
values:
• Multi-Object Tracking Accuracy (MOTA) combines all error types (false positives
(FP), false negatives/missing detections (FN) and identity switches (IDS)) - the
higher the better;
• Multi-object Tracking Precision (MOTP) measures the normalized distance be-
tween tracker output and ground truth location, i.e. the precision in the bounding
box (or center of gravity) localization - the higher the better;
• Mostly Tracked (MT) measures how many ground truth (GT) trajectories are suc-
cessfully tracked for at least 80% of frames - the higher the better;
• Mostly Lost (ML) measures how many of the ground truth (GT) trajectories are
tracked for less than 20% of the whole trajectory - the lower the better;
• identity switches (IDS): number of times that a tracked trajectory switches its
matched ground truth identity - the lower the better.
In addition to those metrics, we also compute the following values:
• False alarm per frame (FAF) measures the average false alarms per frame - the
lower the better;
• Precision (Prcsn) is the average of correctly matched detections per total detec-
tions in the tracking result - the higher the better;
22
2.4. Experiments
• Recall (Rcll) computes the average of matched detections per total detections in
the ground truth - the higher the better.
All the reported results in Tables 2.1 and 2.2 are generated using tracking outputs
provided by the authors [101] with the same overlap threshold for CLEAR MOT met-
rics. Instead, quantitative results of the compared approach reported in Table 2.3 are
taken from [87].
Regarding the ground truth, in order to provide a common ground for further com-
parison, we used the publicly-available detection responses from [171] for PETS2009
and TUD datasets and those from [87] for ETH. As ground truth, for all the datasets we
used that provided in [88]. It is worth noting that we achieved exactly the same results
even when we used modified (stricter) ground truths, assuring that all the people have
one unique ID troughout the whole video. In fact, it often happens that when a person
disappears from the scene (due to occlusions or because he/she exits temporary from
the scene), when he/she reappears a different ID is assigned (also in the public ground
truth). The stricter ground truth, instead, assigns the same ID, which is not always the
case in [88].
23
Chapter 2. Multi-Object Tracking Using Dominant Sets
Figure 2.4: Tracking result on PETS2009-S2L1 and TUD-Stadtmitte datasets. First two rows refer
to PETS2009-S2L1 (first row represents our results, while second row represents the ground truth),
wherease the last two rows refers to TUD-Stadtmitte (with third row showing our results, while fourth
showing the ground truth). This figure is best viewed in color.
parable in ML and IDs. However, our approach obtains relatively lower performance
in recall: this is mainly because DCS tracker focuses on assigning the right person ID
to detection responses generated by the people detector, i.e. no motion model (linear
or any other) for prediction of next locations has been used. As a result, our method
generates a higher number of false negatives.
Fig. 2.4 shows some visual tracking results on TUD-Stadtmitte sequence (third and
fourth row). Our approach is able to maintain correct IDs, as in the case of targets 4
and 8 on frame 61 or targets 10 and 11 on frame 104, regardless of their similarity in
position and appearance and the severe occlusions.
24
2.5. Summary
Table 2.2: Tracking results on TUD-Stadtmitte sequence. For all approaches the number of ground truth
(GT) trajectories is the same (10).
2.5 Summary
In this chapter, a dominant set clustering based tracker is proposed, which formulates
the tracking task as finding dominant set (cluster) on the constructed undirected edge
weighted graph. The development of a complete multi-target tracking using domi-
nant set clustering is the main contribution of this work. We utilized both information
(appearance and position) for data association in a global manner, avoiding the locally-
limited approach typically present in previous works. Experimental results compared
with the state-of-the-art tracking approaches show the superiority of our tracker, es-
pecially in terms of MOTA and precision, as well as lower identity switches in some
cases. Generally speaking, the tradeoff between precision and recall in tracking is hard
to balance, thus resulting in lower performance on one of the two values. Our claim
is that precision is more relevant for tracking purposes and that assigning the right ID
to targets (people) when they re-appear after an occlusion or the exit from the scene
should be the key objective of tracking systems.
25
Chapter 2. Multi-Object Tracking Using Dominant Sets
Figure 2.5: Sample Tracking result on ETH-Sunny day sequence (first row represents our results, while
second row represents the ground truth). This figure is best viewed in color.
Regarding the efficiency of the proposed approach, for a window size of 10 frames
with approximately 10 pedestrians, processing a frame using the full proposed frame-
work, excluding human detection, takes an average time of 1.6 and 0.003 s for affinity
matrix generation and tracklet generation, respectively. Moreover, it takes only 0.001 s
for tracklet merging step. These values are computed by running a non-optimised Mat-
lab code on a core i5 2.5 GHz machine. Using an optimised parallel implementation in
C, the algorithm is likely to work in real time.
As future directions of research, we first would extend the methodology to multi-
camera setups. As a feeling, we believe that our approach will be straightforward to
extend multiple cameras, since no motion or temporal information are used in it. An-
other possible future work consists in evaluating different similarity measures as well
as to consider different types of targets instead of people.
26
CHAPTER 3
Multi-target tracking in multiple nonoverlapping
cameras using constrained dominant sets
In this chapter, we present a unified three-layer hierarchical approach for solving track-
ing problems in multiple non-overlapping cameras. Given a video and a set of detec-
tions (obtained by any person detector), we first solve within-camera tracking employ-
ing the first two layers of our framework and, then, in the third layer, we solve across-
camera tracking by merging tracks of the same person in all cameras in a simultaneous
fashion. To best serve our purpose, a constrained dominant sets clustering (CDSC)
technique, a parametrized version of standard quadratic optimization, is employed to
solve both tracking tasks. The tracking problem is caste as finding constrained dom-
inant sets from a graph. That is, given a constraint set and a graph, CDSC generates
cluster (or clique), which forms a compact and coherent set that contains a subset of
the constraint set. In addition to having a unified framework that simultaneously solves
within- and across-camera tracking, the third layer helps link broken tracks of the same
person occurring during within-camera tracking. We have tested this approach on a very
large and challenging dataset (namely, MOTchallenge DukeMTMC [128, 129, 145])
and show that the proposed framework outperforms the current state of the art. Even
though the main focus of this work is on multi-target tracking in non-overlapping cam-
eras, proposed approach can also be applied to solve re-identification problem. Towards
that end, we also have performed experiments on MARS [189], one of the largest and
challenging video-based person re-identification dataset, and have obtained excellent
results. These experiments demonstrate the general applicability of the proposed frame-
work for non-overlapping across-camera tracking and person re-identification tasks.
27
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
Camera 1 Camera 3 Camera 1
Camera 3
𝑇14
𝑇11 𝑇33 𝑇13 𝑇33
𝑇13 𝑇14
𝑇31
𝑇31 𝑇12
𝑇32
𝑇12 𝑇32 𝑇11
𝑇23
𝑇23 𝑇22 𝑇21
𝑇22 𝑇21
Camera 2
Camera 2
3.1 Introduction
As the need for visual surveillance grow, a large number of cameras have been de-
ployed to cover large and wide areas like airports, shopping malls, city blocks etc..
Since the fields of view of single cameras are limited, in most wide area surveillance
scenarios, multiple cameras are required to cover larger areas. Using multiple cameras
with overlapping fields of view is costly from both economical and computational as-
pects. Therefore, camera networks with non-overlapping fields of view are preferred
and widely adopted in real world applications.
In the work presented in this chapter, the goal is to track multiple targets and main-
tain their identities as they move from one camera to the another camera with non-
overlapping fields of views. In this context, two problems need to be solved, that is,
within-camera data association (or tracking) and across-cameras data association by
employing the tracks obtained from within-camera tracking. Although there have been
significant progresses in both problems separately, tracking multiple target jointly in
both within and across non-overlapping cameras remains a less explored topic. Most
approaches, which solve multi-target tracking in multiple non-overlapping cameras
[18, 31, 34, 73, 75, 80, 115, 162], assume tracking within each camera has already been
performed and try to solve tracking problem only in non-overlapping cameras; the re-
sults obtained from such approaches are far from been optimal [162].
In this work, we propose a hierarchical approach in which we first determine tracks
within each camera, (Figure 3.1(a)) by solving data association, and later we associate
tracks of the same person in different cameras in a unified approach (Figure 3.1(b)),
hence solving the across-camera tracking. Since appearance and motion cues of a tar-
get tend to be consistent in a short temporal window in a single camera tracking, solving
tracking problem in a hierarchical manner is common: tracklets are generated within
28
3.1. Introduction
short temporal window first and later they are merged to form full tracks (or trajec-
tories) [43, 151, 176]. Often, across-camera tracking is more challenging than solving
within-camera tracking due to the fact that appearance of people may exhibit significant
differences due to illumination variations and pose changes between cameras.
Therefore, this chapter proposes a unified three-layer framework to solve both within-
and across-camera tracking. In the first two layers, we generate tracks within each cam-
era and in the third layer we associate all tracks of the same person across all cameras
in a simultaneous fashion.
To best serve our purpose, a constrained dominant sets clustering (CDSC) technique,
a parametrized version of standard quadratic optimization, is employed to solve both
tracking tasks. The tracking problem is caste as finding constrained dominant sets from
a graph. That is, given a constraint set and a graph, CDSC generates cluster (or clique),
which forms a compact and coherent set that contains all or part of the constraint set.
Clusters represent tracklets and tracks in the first and second layers, respectively. The
proposed within-camera tracker can robustly handle long-term occlusions, does not
change the scale of original problem as it does not remove nodes from the graph during
the extraction of compact clusters and is several orders of magnitude faster (close to real
time) than existing methods. Also, the proposed across-camera tracking method using
CDSC and later followed by refinement step offers several advantages. More specifi-
cally, CDSC not only considers the affinity (relationship) between tracks, observed in
different cameras, but also takes into account the affinity among tracks from the same
camera. As a consequence, the proposed approach not only accurately associates tracks
from different cameras but also makes it possible to link multiple short broken tracks
obtained during within-camera tracking, which may belong to a single target track. For
instance, in Figure 3.1(a) track T13 (third track from camera 1) and T14 (fourth track from
camera 1) are tracks of same person which were mistakenly broken from a single track.
However, during the third layer, as they are highly similar to tracks in camera 2 (T23 )
and camera 3 (T33 ), they form a clique, as shown in Figure 3.1(b). Such across-camera
formulation is able to associate these broken tracks with the rest of tracks from different
cameras, represented with the green cluster in Figure 3.1(b).
The main contributions of this chapter are summarized as follows:
The rest of the chapter is organized as follows. In Section 3.2, we review relevant
previous works. Overall proposed approach for within- and across-cameras tracking
modules is summarized in section 3.3, while sections 3.3.1 and 3.3.2 provide more in
details of the two modules. In section 1.3, we present the proposed approach to further
speed up our method. Experimental results are presented in Section 3.4. Finally, section
3.5 summarizes the chapter.
29
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
Object tracking is a challenging computer vision problem and has been one of the
most active research areas for many years. In general, it can be divided in two broad
categories: tracking in single and multiple cameras. Single camera object tracking
associates object detections across frames in a video sequence, so as to generate the
object motion trajectory over time. Multi-camera tracking aims to solve handover
problem from one camera view to another and hence establishes target correspon-
dences among different cameras, so as to achieve consistent object labelling across
all the camera views. Early multi-camera target tracking research works fall in dif-
ferent categories as follows. Target tracking with partially overlapping camera views
has been researched extensively during the last decade [4, 20, 44, 81, 103, 159]. Multi
target tracking across multiple cameras with disjoint views has also been researched
in [18, 31, 34, 73, 75, 80, 115, 162]. Approaches for overlapping field of views compute
spatial proximity of tracks in the overlapping area, while approaches for tracking tar-
gets across cameras with disjoint fields of view, leverage appearance cues together with
spatio-temporal information.
Almost all early multi-camera research works try to address only across-camera
tracking problems, assuming that within-camera tracking results for all cameras are
given. Given tracks from each camera, similarity among tracks is computed and target
correspondence across cameras is solved, using the assumption that a track of a target in
one camera view can match with at most one target track in another camera view. Hun-
garian algorithm [85] and bipartite graph matching [73] formulations are usually used
to solve this problem. Very recently, however, researchers have argued that assump-
tions of cameras having overlapping fields of view and the availability of intra-camera
tracks are unrealistic [162]. Therefore, the work proposed in this chapter addresses the
more realistic problem by solving both within- and across-camera tracking in one joint
framework.
In the rest of this section, we first review the most recent works for single cam-
era tracking, and then describe the previous related works on multi-camera multi-view
tracking.
Single camera target tracking associates target detections across frames in a video
sequence in order to generate the target motion trajectory over time. Zamir et al. [176]
formulate tracking problem as generalized maximum clique problem (GMCP), where
the relationships between all detections in a temporal window are considered. In [176],
a cost to each clique is assigned and the selected clique maximizes a score function.
Nonetheless, the approach is prone to local optima as it uses greedy local neighbour-
hood search. Deghan et al. [43] cast tracking as a generalized maximum multi clique
problem (GMMCP) and follow a joint optimization for all the tracks simultaneously. To
handle outliers and weak-detections associations they introduce dummy nodes. How-
ever, this solution is computationally expensive. In addition, the hard constraint in their
optimization makes the approach impractical for large graphs. Tesfaye et al. [151] con-
sider all the pairwise relationships between detection responses in a temporal sliding
window, which is used as an input to their optimization based on fully-connected edge-
weighted graph. They formulate tracking as finding dominant set clusters. They follow
a pill-off strategy to enumerate all possible clusters, that is, at each iteration they re-
30
3.2. Related Work
move the cluster from the graph which results in a change in scale (number of nodes
in a graph) of the original problem. In this chapter, we propose a multiple target track-
ing approach, which in contrast to previous works, does not need additional nodes to
handle occlusion nor encounters change in the scale of the problem.
Across-camera tracking aims to establish target correspondences among trajecto-
ries from different cameras so as to achieve consistent target labelling across all camera
views. It is a challenging problem due to the illumination and pose changes across cam-
eras, or track discontinuities due to the blind areas or miss detections. Existing across-
camera tracking methods try to deal with the above problems using appearance cues.
The variation in illumination of the appearance cues has been leveraged using different
techniques such as Brightness Transfer Functions (BTFs). To handle the appearance
change of a target as it moves from one camera to another, the authors in [74] show that
all brightness transfer functions from a given camera to another camera lie in a low di-
mensional subspace, which is learned by employing probabilistic principal component
analysis and used for appearance matching. Authors of [59] used an incremental learn-
ing method to model the colour variations and [120] proposed a Cumulative Brightness
Transfer Function, which is a better use of the available colour information from a
very sparse training set. Performance comparison of different variations of Brightness
Transfer Functions can be found in [45]. Authors in [147] tried to achieve color con-
sistency using colorimetric principles, where the image analysis system is modelled as
an observer and camera-specific transformations are determined, so that images of the
same target appear similar to this observer. Obviously, learning Brightness Transfer
Functions or color correction models requires large amount of training data and they
may not be robust against drastic illumination changes across different cameras. There-
fore, recent approaches have combined them with spatio-temporal cue which improve
multi-target tracking performance [19, 32, 33, 57, 86, 188]. Chen et al. [33] utilized
human part configurations for every target track from different cameras to describe the
across-camera spatio-temporal constraints for across-camera track association, which is
formulated as a multi-class classification problem via Markov Random Fields (MRF).
Kuo et al. [86] used Multiple Instance Learning (MIL) to learn an appearance model,
which effectively combines multiple image descriptors and their corresponding simi-
larity measurements. The proposed appearance model combined with spatio-temporal
information improved across-camera track association solving the "target handover"
problem across cameras. Gao et al. [57] employ tracking results of different trackers
and use their spatio-temporal correlation, which help them enforce tracking consistency
and establish pairwise correlation among multiple tracking results. Zha et al. [188] for-
mulated tracking of multiple interacting targets as a network flow problem, for which
the solution can be obtained by the K-shortest paths algorithm. Spatio-temporal rela-
tionships among targets are utilized to identify group merge and split events. In [19]
spatio-temporal context is used for collecting samples for discriminative appearance
learning, where target-specific appearance models are learned to distinguish different
people from each other. And the relative appearance context models inter-object ap-
pearance similarities for people walking in proximity and helps disambiguate individ-
ual appearance matching across cameras.
The problem of target tracking across multiple non-overlapping cameras is also tack-
led in [128] by extending their previous single camera tracking method [129], where
31
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
they formulate the tracking task as a graph partitioning problem. Authors in [32], learn
across-camera transfer models including both spatio-temporal and appearance cues.
While a color transfer method is used to model the changes of color across cameras
for learning across-camera appearance transfer models, the spatio-temporal model is
learned using an unsupervised topology recovering approach. Recently Chen et al. [31]
argued that low-level information (appearance model and spatio-temporal information)
is unreliable for tracking across non-overlapping cameras, and integrated contextual in-
formation such as social grouping behaviour. They formulate tracking using an online-
learned Conditional Random Field (CRF), which favours track associations that main-
tain group consistency. In this chapter, for tracks to be associated, besides their high
pairwise similarity (computed using appearance and spatio-temporal cues), their corre-
sponding constrained dominant sets should also be similar.
Another recent popular research topic, video-based person re-identification(ReID)
[38, 50, 90, 97, 98, 161, 168, 174, 190], is closely related to across-camera multi-target
tracking. Both problems aim to match tracks of the same persons across non-overlapping
cameras. However, across-camera tracking aims at 1-1 correspondence association be-
tween tracks of different cameras. Compared to most video-based ReID approaches, in
which only pairwise similarity between the probes and gallery is exploited, our across-
camera tracking framework not only considers the relationship between probes and
gallery but it also takes in to account the relationship among tracks in the gallery.
Figure 3.2 shows proposed within-camera tracking framework. First, we divide a video
into multiple short segments, each segment contains 15 frames, and generate short-
tracklets, where human detection bounding boxes in two consecutive frames with 70%
overlap, are connected [43]. Then, short-tracklets from 10 different non-overlapping
segments are used as input to our first layer of tracking. Here the nodes are short-
tracklets (Figure 3.2, bottom left). Resulting tracklets from the first layer are used as an
input to the second layer, that is, a tracklet from the first layer is now represented by a
node in the second layer (Figure 3.2, bottom right). In the second layer, tracklets of the
same person from different segment are associated forming tracks of a person within a
camera.
32
3.3. Overall Approach
Tracks
𝑠11 𝑠113 𝑠112 𝑠111 𝑡19 𝑡 8
𝑠110 𝑡110
𝑠12 1
𝑡17
𝑠19 𝑡11 𝑡16
𝑠13 𝑠 4 5 𝑠18
1 𝑠 𝑠16 𝑠17 𝑡12 𝑡13 𝑡14 𝑡15
1
First layer Second layer
Figure 3.2: The figure shows within-camera tracking where short-tracklets from different segments are
used as input to our first layer of tracking. The resulting tracklets from the first layer are inputs to
the second layer, which determine a tracks for each person. The three dark green short-tracklets
(s21 , s10 7 2
1 , s1 ), shown by dotted ellipse in the first layer, form a cluster resulting in tracklet (t1 ) in the
second layer, as shown with the black arrow. In the second layer, each cluster, shown in purple,
green and dark red colors, form tracks of different targets, as can be seen on the top row. tracklets
and tracks with the same color indicate same target. The two green cliques (with two tracklets and
three tracklets) represent tracks of the person going in and out of the building (tracks T1p and T12
respectively)
We build an input graph, G(V, E, w), where nodes represent short-tracklet (sji , that is,
j th short-tracklet of camera i) in the case of first layer (Figure 3.2, bottom left) and
tracklet (tlk , that is, lth tracklet of camera k), in the second layer (Figure 3.2, bottom
right). The corresponding affinity matrix A = {ai,j }, where ai,j = w(i, j) is built. The
weight w(i, j) is assigned to each edge, by considering both motion and appearance
similarity between the two nodes. Fine-tuned CNN features are used to model the
appearance of a node. These features are extracted from the last fully-connected layer
of Imagenet pre-trained 50-layers Residual Network (ResNet 50) [67] fine-tuned using
the trainval sequence of DukeMTMC dataset. Similar to [176], we employ a global
constant velocity model to compute motion similarity between two nodes.
Determining cliques: In our formulation, a clique of graph G represents track-
33
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
where the cardinality |Γi | is the number of nodes that forms the ith cluster and δji is the
membership score of node j obtained when assigned to cluster Γi . The normalization
using the cardinality is important to avoid any unnatural bias to a smaller set.
Algorithm (3), putting the number of cameras under consideration (I) to 1 and Q
as short-tracklets(tracklets) in the first(second) layer, is used to determine constrained
dominant sets which correspond to tracklet(track) in the first (second) layer.
Given tracks (Tij , that is, the j th track of camera i) of different cameras from previous
step, we build graph G0 (V 0 , E 0 , w0 ), where nodes represent tracks and their correspond-
ing affinity matrix A depicts the similarity between tracks.
Assuming we have I number of cameras and Ai×j represents the similarity among
tracks of camera i and j, the final track based affinity A, is built as
34
3.3. Overall Approach
Figure 3.3 shows exemplar graph for across-camera tracking among three cameras.
Tji represents the ith track of camera j. Black and orange edges, respectively, represent
within- and across-camera relations of the tracks. From the affinity A, Ai×j represents
the black edges of camera i if i = j, which otherwise represents the across-camera
relations using the orange edges.
The colors of the nodes depict the track ID; nodes with similar color represent tracks
of the same person. Due to several reasons such as long occlusions, severe pose change
of a person, reappearance and others, a person may have more than one track (a broken
track) within a camera. The green nodes of camera 1 (the second and the pth tracks)
typify two broken tracks of the same person, due to reappearance as shown in Figure
3.2. The proposed unified approach, as discussed in the next section, is able to deal
with such cases.
35
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
1 T11 T2
1
Ca
a
2 2
er
m
T1 T2
m
er
Ca
a2
i
T1 T2
j
p q
T1 T2
1 2 c d
T3 T3 T3 T3
Camera 3
Figure 3.3: Exemplar graph of tracks from three cameras. Tji represents the ith track of camera j. Black and
colored edges, respectively, represent within- and across-camera relations of tracks. Colours of the nodes depict
track IDs, nodes with similar colour represent tracks of the same person, and the thick lines show both within-
and across-camera association.
only in camera 1. As a general case, Cji , represents the ith track set using tracks in
camera j as a constraint set and Cj is the set that contains track sets generated using
camera j as a constraint set, e.g. C1 = {C11 , C12 , C13 }. We iteratively process all the
cameras and then apply track refinement step.
Though Algorithm (3) is applicable to within-camera tracking also, here we show
the specific case for across-camera track association. Let T represents the set of tracks
from all the cameras we have and C is the set which contains sets of tracks, as Cpi ,
generated using our algorithm. Tpϑ typifies the ϑth track from camera p and Tp contains
all the tracks in camera p. The function F(Q, A) takes as an input a constraint set Q and
the affinity A, and provides as output all the m local solutions X n×m of program (1.7)
that contain element(s) from the constraint set. This can be accomplished by iteratively
finding a local maximizer of equation (program) (1.7) in ∆, e.g. using game dynamics,
and then changing the constraint set Q, until all members of the constraint set have
been clustered.
36
3.3. Overall Approach
1: while p ≤ I do
2: Q ← Tp , define constraint set
3: X ← F(Q, A)
4: Cpi = ← σ(X i ), compute for all i = 1 . . . m
5: p←p+1
6: end while
I
S
7: C= Cp
p=1
8: OUTPUT: {C}
tracks from different cameras as different constraint sets. Let us assume we have I
cameras and Ki represents the set corresponding to track i, while Kpi is the subset of
i
Ki that corresponds to the pth camera. Mlp is the membership score assigned to the lth
track in the set Cpi .
We use two constraints during track refinement stage, which helps us refining false
positive association.
Constraint-1: A track can not be found in two different sets generated using same
constraint set, i.e. it must hold that:
|Kpi | ≤ 1
Sets that do not satisfy the above inequality should be refined as there is one or more
tracks that exist in different sets of tracks collected using the same constraint, i.e. Tp .
The corresponding track is removed from all the sets which contain it and is assigned
to the right set based on its membership score in each of the sets. Let us say the lth
track exists in q different sets, when tracks from camera p are taken as a constraint set,
|Kpl | = q. The right set which contains the track, Cpr , is chosen as:
r i li
Cp = arg max |Cp | ∗ Mp .
Cpi ∈Kpl
where i = 1, . . . , |Kpl |. This must be normalized with the cardinality of the set to avoid
a bias towards smaller sets.
Constraint-2: The maximum number of sets that contain track i should be the number
of cameras under consideration. If we consider I cameras, the cardinality of the set
which contains sets with track i, is not larger than I, i.e.:
|Ki | ≤ I.
If there are sets that do not satisfy the above condition, the tracks are refined based
on the cardinality of the intersection of sets that contain the track, i.e. by enforcing the
reciprocal properties of the sets.
If there are sets that do not satisfy the above condition, the tracks are refined based
on the cardinality of the intersection of sets that contain the track by enforcing the
37
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
reciprocal properties of the sets which contain a track. Assume we collect sets of tracks
considering tracks from camera q as constraint set and assume a track ϑ in the set Cpj ,
p 6= q, exists in more than one sets of Cq . The right set, Cqr , for ϑ considering tracks
from camera q as constraint set is chosen as:
where i = 1, . . . , |Kqϑ |.
38
3.4. Experimental Results
Figure 3.4: Camera topology for DukeMTMC dataset. Detections from the overlapping fields of view
are not considered. More specifically, intersection occurred between camera (8 & 2) and camera (5
& 3).
IDP and IDR) which can be applied both within- and across-cameras. These measure
tracker’s performance not by how often ID switches occur, but by how long the tracker
correctly tracks targets.
Identification precision IDP (recall IDR): is the fraction of computed (ground
truth) detections that are correctly identified.
Identification F-Score IDF1: is the ratio of correctly identified detections over the
average number of ground-truth and computed detections. Since MOTA and its related
performance measures under-report across-camera errors [128], we use them for the
evaluation of our single camera tracking results.
The performance of the algorithm for re-identification is evaluated employing rank-
1 based accuracy and confusion matrix using average precision (AP).
Implementation: In the implementation of our framework, we do not have param-
eters to tune. The affinity matrix A adapting kernel trick distance function from [61], is
constructed as follows:
r
K(xi , xi ) + K(xj , xj ) − 2 ∗ K(xi , xj )
Ai,j = 1 − ,
2
where K(xi , xj ) is chosen as the Laplacian kernel
exp(−γ k xi − xj k1 ).
The kernel parameter γ is set as the inverse of the median of pairwise distances.
In our similarity matrix for the final layer of the framework, which is sparse, we
use spatio-temporal information based on the time duration and the zone of a person
moving from one zone of a camera to other zone of another camera which is learned
from the Trainval sequnece of DukeMTMC dataset. The affinity between track i and
39
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
Mthd MOTA↑ MT↑ ML↓ FP↓ FN↓ IDS↓ IDF1↑ IDP↑ IDR↑
[128] 43.0 24 46 2,713 107,178 39 57.3 91.2 41.8
C1
Ours 69.9 137 22 5,809 52,152 156 76.9 89.1 67.7
[128] 44.8 133 8 47,919 53,74 60 68.2 69.3 67.1
C2
Ours 71.5 134 21 8,487 43,912 75 81.2 90.9 73.4
[128] 57.8 52 22 1,438 28,692 16 60.3 78.9 48.8
C3
Ours 67.4 44 9 2,148 21,125 38 64.6 76.3 56.0
[128] 63.2 36 18 2,209 19,323 7 73.5 88.7 62.8
C4
Ours 76.8 45 4 2,860 10,689 18 84.7 91.2 79.0
[128] 72.8 107 17 4,464 35,861 54 73.2 83.0 65.4
C5
Ours 68.9 88 11 9,117 36,933 139 68.3 76.1 61.9
[128] 73.4 142 27 5,279 45,170 55 77.2 87.5 69.1
C6
Ours 77.0 136 11 4,868 38,611 142 82.7 91.6 75.3
[128] 71.4 69 13 1,395 18,904 23 80.5 93.6 70.6
C7
Ours 73.8 64 4 1,182 17,411 36 81.8 94.0 72.5
[128] 60.7 102 53 2,730 52,806 46 72.4 92.2 59.6
C8
Ours 63.4 92 28 4,184 47,565 91 73.0 89.1 61.0
[128] 59.4 665 234 68,147 361,672 300 70.1 83.6 60.4
Av
Ours 70.9 740 110 38,655 268,398 693 77.0 87.6 68.6
Table 3.1: The results show detailed (for each camera C1 to C8) and average performance (Av) of our
and state-of-the-art approach [128] on the Test-Easy sequence of DukeMTMC dataset.
track j is different from zero , if and only if they have a possibility, based on the
direction a person is moving and the spatio-temporal information, to be linked and form
a trajectory (across camera tracks of a person). However, this may have a drawback due
to broken tracks or track of a person who is standing and talking or doing other things in
one camera which results in a track that does not meet the spatio-temporal constraints.
To deal with this problem, we add, for the across camera track’s similarity, a path-based
information as used in [182], i.e if a track in camera i and a track in camera j have a
probability to form a trajectory, and track j in turn have linkage possibility with a track
in camera z, the tracks in camera i and camera z are considered to have a possibility to
be linked.
The similarity between two tracks is computed using the Euclidean distance of the
max-pooled features. The max-pooled features are computed as the row maximum
of the feature vector of individual patch, of the given track, extracted from the last
fully-connected layer of Imagenet pre-trained 50-layers Residual Network (ResNet_50)
[67], fine-tuned using the Trainval sequence of DukeMTMC dataset. The network is
fine-tuned with classification loss on the Trainval sequence, and activations of its last
fully-connected layer are extracted, L2-normalized and taken as visual features. Cross-
view Quadratic Discriminant Analysis (XQDA) [90] is then used for pairwise distance
computation between instances. For the experiments on MARS, patch representation is
obtained using CNN features used in [189]. The pairwise distances between instances
are then computed in XQDA, KISSME [84] and euclidean spaces.
40
3.4. Experimental Results
Mthd MOTA↑ MT↑ ML↓ FP↓ FN↓ IDS↓ IDF1↑ IDP↑ IDR↑
[128] 37.8 6 34 1,257 78,977 55 52.7 92.5 36.8
C1
Ours 63.2 65 17 2,886 44,253 408 67.1 83.0 56.4
[128] 47.3 68 12 26526 46898 194 60.6 65.7 56.1
C2
Ours 54.8 62 16 8,653 54,252 323 63.4 78.8 53.1
[128] 46.7 24 4 288 18182 6 62.7 96.1 46.5
C3
Ours 68.8 18 2 2,093 8,701 11 81.5 91.1 73.7
[128] 85.3 21 0 1,215 2,073 1 84.3 86.0 82.7
C4
Ours 75.6 17 0 1,571 3,888 61 82.3 87.1 78.1
[128] 78.3 57 2 1,480 11,568 13 81.9 90.1 75.1
C5
Ours 78.6 47 2 1,219 11,644 50 82.8 91.5 75.7
[128] 59.4 85 23 5,156 77,031 225 64.1 81.7 52.7
C6
Ours 53.3 68 36 5,989 88,164 547 53.1 71.2 42.3
[128] 50.8 43 23 2,971 38,912 148 59.6 81.2 47.1
C7
Ours 50.8 34 20 1,935 39,865 266 60.6 84.7 47.1
[128] 73.0 34 5 706 9735 10 82.4 94.9 72.8
C8
Ours 70.0 37 6 2,297 9,306 26 81.3 90.3 73.9
[128] 54.6 338 103 39,599 283,376 652 64.5 81.2 53.5
Av
Ours 59.6 348 99 26,643 260,073 1637 65.4 81.4 54.7
Table 3.2: The results show detailed (for each camera) and average performance of our and state-of-
the-art approach [128] on the Test-Hard sequence of DukeMTMC dataset.
Table 3.3: Multi-camera performance of our and state-of-the-art approach [128] on the Test-Easy se-
quence of DukeMTMC dataset.
table 3.1 represent the performance on the Test-Easy sequence, while those in table
3.2 show the performance on the Test-Hard sequence. For a fair comparison, we use
the same detection responses obtained from MOTchallenge DukeMTMC as the input
to our method. In both cases, the reported results of row ’Camera 1’ to ’Camera 8’
represent the within-camera tracking performances. The last row of the tables represent
the average performance over 8 cameras. Both tabular results demonstrate that the
proposed approach improves tracking performance for both sequences. In the Test-Easy
sequence, the performance is improved by 11.5% in MOTA and 7% in IDF1 metrics,
while in that of the Test-Hard sequence, our method produces 5% larger average MOTA
score than [128], and 1% improvement is achieved in IDF1. Table 3.3 and Table 3.4
respectively present Multi-Camera performance of our and state-of-the-art approach
[128] on the Test-Easy and Test-Hard sequence (respectively) of DukeMTMC dataset.
We have improved IDF1 for both Test-Easy and Test-Hard sequences by 4% and 3%,
respectively.
Figure 3.5 depicts sample qualitative results. Each person is represented by (similar
color of) two bounding boxes, which represent the person’s position at some specific
time, and a track which shows the path s(he) follows. In the first row, all the four
targets, even under significant illumination and pose changes, are successfully tracked
in four cameras, where they appear. In the second row, target 714 is successfully tracked
through three cameras. Observe its significant illumination and pose changes from
camera 5 to camera 7. In the third row, targets that move through camera 1, target six,
seven and eight are tracked. The last row shows tracks of targets that appear in cameras
41
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
Table 3.4: Multi-Camera performance of our and state-of-the-art approach [128] on the Test-Hard
sequence of DukeMTMC dataset.
1 to 4.
Figure 3.5: Sample qualitative results of the proposed approach on DukeMTMC dataset. Bounding boxes and
lines with the same color indicate the same target (Best viewed in color).
42
3.4. Experimental Results
Figure 3.6: The results show the performance of our algorithm on MARS (both using CNN + XQDA)
when the final ranking is done using membership score (left) and using pairwise euclidean distance
(right).
Methods rank 1
HLBP + XQDA 18.60
BCov + XQDA 9.20
LOMO + XQDA 30.70
BoW + KISSME 30.60
SDALF + DVR 4.10
HOG3D + KISSME 2.60
CNN + XQDA [189] 65.30
CNN + KISSME [189] 65.00
Ours 68.22
Table 3.5: The table shows the comparison (based on rank-1 accuracy) of our approach with the state-of-
the-art approaches: SDALF [50], HLBP [168], BoW [190], BCov [97], LOMO [90], HOG3D [82]
on MARS dataset.
Table 3.6: The results show performance of our(using pairwise distance and membership score) and
state-of-the-art approach [189] in solving within- and across-camera ReID using average precision
on MARS dataset using CNN feature and different distance metrics.
set is, we conduct an experiment on the MARS dataset computing the final ranking
using the membership score and pairwise distances. The confusion matrix in Fig. 3.6
shows the detail result of both the within cameras (diagonals) and across cameras (off-
diagonals), as we consider tracks from each camera as query. Given a query, a set
43
Chapter 3. Multi-target tracking in multiple nonoverlapping cameras using constrained
dominant sets
which contains the query is extracted using the constrained dominant set framework.
Note that constraint dominant set comes with the membership scores for all members
of the extracted set. We show in Figure 3.6 the results based on the final ranking
obtained using membership scores (left) and using pairwise Euclidean distance between
the query and the extracted nodes(right). As can be seen from the results in Table 3.6
(average performance) the use of membership score outperforms the pairwise distance
approach, since it captures the interrelation among targets.
8
FCDSC
7 CDSC
5
CPU time in sec
0
0 20 40 60 80 100
Tracks
Figure 3.7: CPU time taken for each track association using our proposed fast approach (FCDSC - fast
CDSC) and CDSC.
3.5 Summary
In this chapter, we presented a constrained dominant set clustering (CDSC) based
framework for solving multi-target tracking problem in multiple non-overlapping cam-
44
3.5. Summary
5000
4500
4000
CPU time ratio
3500
3000
2500
2000
1500
0 20 40 60 80 100
Tracks
Figure 3.8: The ratio of CPU time taken between CDSC and proposed fast approach (FCDSC), com-
puted as CPU time for CDSC/CPU time for FCDSC.
eras. The proposed method utilizes a three layers hierarchical approach, where within-
camera tracking is solved using first two layers of our framework resulting in tracks
for each person, and later in the third layer the proposed across-camera tracker merges
tracks of the same person across different cameras. Experiments on a challenging real-
world dataset (MOTchallenge DukeMTMCT) validate the effectivness of our model.
We further perform additional experiments to show effectiveness of the proposed
across-camera tracking on one of the largest video-based people re-identification datasets
(MARS). Here each query is treated as a constraint set and its corresponding members
in the resulting constrained dominant set cluster are considered as possible candidate
matches to their corresponding query.
There are few directions we would like to pursue in our future research. In this work,
we consider a static cameras with known topology but it is important for the approach
to be able to handle challenging scenario, were some views are from cameras with
ego motion (e.g., PTZ cameras or taken from mobile devices) with unknown camera
topology. Moreover, here we consider features from static images, however, we believe
video features which can be extracted using LSTM could boost the performance and
help us extend the method to handle challenging scenarios.
45
CHAPTER 4
Large-scale Image Geo-Localization
Using Dominant Sets
This chapter presents a new approach for the challenging problem of geo-localization
using image matching in a structured database of city-wide reference images with
known GPS coordinates. We cast geo-localization as a clustering problem of local
image features. Akin to existing approaches to the problem, our framework builds on
low-level features which allow local matching between images. For each local feature
in the query image, we find its approximate nearest neighbors in the reference set. Next,
we cluster the features from reference images using Dominant Set clustering. We obtain
multiple clusters (different local maximizers) and obtain a robust final solution to the
problem using multiple weak solutions through constrained Dominant Set clustering on
global image features, where we enforce the constraint that the query image must be in-
cluded in the cluster. This second level of clustering also bypasses heuristic approaches
to voting and selecting the reference image that matches to the query. We evaluate the
proposed framework on an existing dataset of 102k street view images as well as a new
larger dataset of 300k images, and show that it outperforms the state-of-the-art by 20%
and 7%, respectively, on the two datasets.
4.1 Introduction
Image geo-localization, the problem of determining the location of an image using just
the visual information, is remarkably difficult. Nonetheless, images often contain use-
ful visual and contextual informative cues which allow us to determine the location of
an image with variable confidence. The foremost of these cues are landmarks, architec-
tural details, building textures and colors, in addition to road markings and surrounding
vegetation.
47
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
48
4.1. Introduction
Since the dynamics and linear relaxation of binary variables allow our method to
be extremely fast, we run it multiple times to obtain several local maxima as solutions.
Next, we use a query-based variation of DSC to combine those solutions to obtain a
final robust solution. The query-based DSC uses the soft-constraint that the query, or
a group of queries, must always become part of the cluster, thus ensuring their mem-
bership in the solution. We use a fusion of several global features to compute the cost
between query and reference images selected from the previous step. The members
of the cluster from the reference set are used to find the geo-location of the query im-
age. Note that, the GPS location of matching reference image is also used as a cost in
addition to visual features to ensure both visual similarity and geographical proximity.
GPS tagged reference image databases collected from user uploaded images on
Flickr have been typically used for the geo-localization task. The query images in our
experiments have been collected from Flickr, however, the reference images were col-
lected from Google Street View. The data collected through Flickr and Google Street
View differ in several important aspects: the images downloaded from Flickr are often
redundant and repetitive, where images of a particular building, landmark or street are
captured multiple times by different users. Typically, popular or tourist spots have rel-
atively more images in testing and reference sets compared to less interesting parts of
the urban environment. An important constraint during evaluation is that the distribu-
tion of testing images should be similar to that of reference images. On the contrary,
Google Street View reference data used in this work contains only a single sample of
each location of the city. However, Street View does provide spherical 360◦ panoramic
views, , approximately 12 meters apart, of most streets and roads. Thus, the images
are uniformly distributed over different locations, independent of their popularity. The
comprehensiveness of the data ensures that a correct match exists; nonetheless, at the
same time, the sparsity or uniform distribution of the data makes geo-localization diffi-
cult, since every location is captured in only few of the reference images. The difficulty
is compounded by the distorted, low-quality nature of the images as well.
The main contributions of this chapter are summarized as follows:
• We present a robust and computationally efficient approach for the problem of
large-scale image geo-localization by locating images in a structured database of
city-wide reference images with known GPS coordinates.
• We formulate geo-localization problem in terms of a more generalized form of
dominant sets framework which incorporates weights from the nodes in addition
to edges.
• We take a two-step approach to solve the problem. The first step uses local fea-
tures to find putative set of reference images (and is therefore faster), whereas the
second step uses global features and a constrained variation of dominant sets to re-
fine results from the first step, thereby, significantly boosting the geo-localization
performance.
• We have collected new and more challenging high resolution reference dataset
(WorldCities dataset) of 300K Google street view images.
The rest of the chapter is structured as follows. We present literature relevant to our
problem in Sec. 4.2, followed by technical details of the proposed approach in Sec.
49
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
4.3, while constrained dominant set based post processing step is discussed in Sec. 4.4.
This is followed by dataset description in section 4.5.1. Finally, we provide results of
our extensive evaluation in Sec. 4.5 and summary in Sec. 4.6.
50
4.2. Related Work
DS1
DS2
𝓠
40.439 -80.004
Approximate the location Best matching DS3
based on the location of DS1
reference image
the best matching reference
DS2
Post-Processing using constrained Dominant sets
51
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
ξ(·) represents an operator which returns the local descriptor of the argument node,
then q i is removed, otherwise it is retained. That is, if the first NN is similar to the last
NN (less than β), then the corresponding query feature along with its NNs are pruned
since it is expected to be uninformative.
We empirically set both thresholds, θ and β, in Algorithm (4) and pruning step,
respectively, to 0.7 and keep them fixed for all tests.
52
4.3. Image Matching Based Geo-Localization
5: Vi = Vi ∪ vm+1
i i
. If so, add vm+1 to our solution
6: m=m+1 . Go to the next neighbor
7: else
8: Break . If not, stop adding and exit
9: end if
10: end while
11: end procedure
In our framework, the set of nodes, V , represents all NNs for each query feature which
i
survives the pruning step. The edge set is defined as E = {(vm , vnj ) | i 6= j}, which
signifies that all the nodes in G are connected as long as their corresponding query
features are not the same. The edge weight, ω : E −→ R+ is defined as ω(vm i
, vnj ) =
i
exp(−kψ(vm ) − ψ(vnj )k2 /2γ 2 ), where ψ(·) represents an operator which returns the
global descriptor of the parent image of the argument node and γ is empirically set to
27 . The edge weight, ω(vm i
, vnj ), represents a similarity between nodes vm i
and vnj in
terms of the global features of their parent images. The node score, ζ : V −→ R+ , is
i
defined as ζ(vm ) = exp(−kξ(q i ) − ξ(vm i
)k2 /2γ 2 ). The node score shows how similar
i
the node vm is with its corresponding query feature in terms of its local features.
Matching the query features to the reference features requires identifying the correct
NNs from the graph G which maximize the weight, that is, selecting a node (NN)
which forms a coherent (highly compact) set in terms of both global and local feature
similarities.
Affinity matrix A represents the global similarity among reference images, which
is built using GPS locations as a global feature and a node score b which shows how
similar the reference image is with its corresponding query feature in terms of their
local features. We formulate the following optimization problem, a more general form
of the dominant set formulation:
53
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
54
4.4. Post processing using constrained dominant sets
similarity between the query image and the candidate reference images at the global
level, but simply accounts for local matching of features. Therefore, we deal with
this issue by proposing a post processing step, which considers the comparison of the
global features of the images and employs constrained dominant set, a framework that
generalizes the dominant sets formulation and its variant [110, 112].
In our post processing step, the user-selected query and the matched reference im-
ages are related using their global features and a unique local solution is then searched
which contains the union of all the dominant sets containing the query. As customary,
the resulting solution will be globally consistent both with the reference images and the
query image and due to the notion of centrality, each element in the resulting solution
will have a membership score, which depicts how similar a given reference image is
with the rest of the images in the cluster. So, by virtue of this property of constrained
dominant sets, we will select the image with the highest membership score as the final
best matching reference image and approximate the location of the query with the GPS
location of the best matched reference image.
Given a user specified query, Q ⊆ V̂ , we define the graph Ĝ = (V̂ , Ê, ŵ), where
the edges are defined as Ê = {(i, j)|i 6= j, {i, j} ∈ DS n ∨ (i ∈ Q ∨ j ∈ Q)}, i.e.,
all the nodes are connected as long as they do not belong to different local maximizers,
DS n , which represents the nth extracted dominant set. The set of nodes V̂ represents
all matched reference images (local maximizers) and query image, Q. The edge weight
ŵ : Ê → R+ is defined as:
ρ(i, j), for i 6= j, i ∈ Q ∨ j ∈ Q,
ŵ(i, j) = Bn (i, j), for i 6= j, {i, j} ∈ DS n ,
0, otherwise
where ρ(i, j) is an operator which returns the global similarity of two given images i
and j, that is, ρ(i, j) = exp(−kψ(i) − ψ(j)k2 /2γ 2 ), Bn represents a sub-matrix of B,
which contains only members of DS n , normalized by its maximum value and finally
Bn (i, j) returns the normalized affinity between the ith and j th members of DS n . The
graph Ĝ can be represented by an n × n affinity matrix  = (ŵ(i, j)), where n is the
number of nodes in the graph. Given a parameter α > 0, let us define the following
parameterized variant of program (1.5):
where IˆQ is the n × n diagonal matrix whose diagonal elements are set to 1 in corre-
spondence to the vertices contained in V̂ \ Q (a set V̂ without the element Q) and to
zero otherwise.
Let Q ⊆ V̂ , with Q =
6 ∅ and let α > λmax (ÂV̂ \Q ), where λmax (ÂV̂ \Q ) is the largest
eigenvalue of the principal submatrix of  indexed by the elements of V̂ \ Q. If x is
a local maximizer of fQα in ∆, then σ(x) ∩ Q 6= ∅. A complete proof can be found
in [183].
The above result provides us with a simple technique to determine dominant-set
clusters containing user-specified query vertices. Indeed, if Q is a vertex selected by
55
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
56
4.4. Post processing using constrained dominant sets
Figure 4.2: Exemplar output of the dominant set framework: Left: query, Right: each row shows
corresponding reference images from the first, second and third local solutions (dominant sets), re-
spectively, from top to bottom. The number under each image shows the frequency of the matched
reference image, while those on the right side of each image show the min-max normalized scores
of HSV, CNN6, CNN7 and GIST global features, respectively. The filled colors circles on the upper
right corner of the images are used as reference IDs of the images.
number (cardinality) of local features, which belongs to ith reference image from the
extracted sets and the total number of matched reference images be N . We build an
N
P
affinity matrix B̂ of size S = Fi + 1 (e.g., for the example in Fig. 4.2, the size S is
i=1
19 ). Fig. 4.4 shows the reduced graph for the matched reference images shown in Fig.
4.2. Fig. 4.4 upper left, shows the part of the graph for the post processing. It shows the
relation that the query has with matched reference images. The bottom left part of the
figure shows how one can get the full graph from the reduced graph. For the example
in Fig. 4.4, V̂ = {Q, 1, 2, 2, 2, 3, 4, 4.......6}.
The advantages of using constrained dominant sets are numerous. First, it provides
a unique local (and hence global) solution whose support coincides with the union
of all dominant sets of Ĝ, which contains the query. Such solution contains all the
local solutions which have strong relation with the user-selected query. As it can be
observed in Fig. 4.4 (bottom right), the Constrained Dominant Set which contains the
query Q, CDS(Q), is the union of all overlapping dominant sets (the query, one green,
57
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
Q
8 2 5
2 5 5
5 2 2
Figure 4.3: Exemplar output of the dominant set framework: Left: query, Right: each row shows corre-
sponding reference images of the first, second and third local solutions (dominant sets), respectively,
from top to bottom. The number under each image shows the mode of the matched reference image,
while those on the right of each image show the min-max normalized scores of HSV, CNN6, CNN7
and GIST global features, respectively.
one dark red and three yellow nodes) containing the query as one of the members.
If we assume to have no cyan link between the green and yellow nodes, as long as
there is a strong relation between the green node and the query, CDS(Q) will not
be affected. In addition, due to the noise, the strong affinity between the query and
the green node may be reduced, while still keeping the strong relation with the cyan
link which, as a result, will preserve the final result. Second, in addition to fusing all
the local solutions leveraging the notion of centrality, one of the interesting properties
of dominant set framework is that it assigns to each image a score corresponding to
how similar it is to the rest of the images in the solution. Therefore, not only it helps
selecting the best local solution, but also choosing the best final match from the chosen
local solution. Third, an interesting property of constrained dominant sets approach is
that it not only considers the pairwise similarity of the query and the reference images,
but also the similarity among the reference images. This property helps the algorithm
avoid assignment of wrong high pairwise affinities. As an example, with reference to
Fig. 4.4, if we consider the nodes pairwise affinities, the best solution will be the dark
red node (score 1.00). However, using constrained dominant sets and considering the
relation among the neighbors, the solution bounded by the red dotted rectangle can
be found, and by choosing the node with the highest membership score, the final best
58
4.5. Experimental Results
Figure 4.4: Exemplar graph for post processing. Top left: reduced graph for Fig. 4.2 which contains
unique matched reference images. Bottom left: Part of the full graph which contains the gray
circled nodes of the reduced graph and the query. Top right: corresponding affinity of the reduced
graph. Bottom right: The outputs of nearest neighbor approach, consider only the node’s pairwise
similarity, (KNN(Q)=node 3 which is the dark red node) and constrained dominant sets approach
(CDS(Q) = node 2 which is the yellow node).
match is the yellow node which is more similar to the query image than the reference
image depicted by the dark red node (see Fig. 4.2).
zip
59
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
Figure 4.5: The top four rows are sample street view images from eight different places of WorldCities
dataset. The bottom two rows are sample user uploaded images from the test set.
Rome, Milan and Paris), Australia (Sydney and Melbourne), USA (Vegas, Los Angeles,
Phoenix, Houston, San Diego, Dallas, Chicago). Existence of similarity in buildings
around the world, which can be in terms of their wall designs, edges, shapes, color
etc, makes the dataset more challenging than the other. Fig. 4.5 (top four rows) shows
sample reference images taken from different place marks.
For the test set, we use 644 and 500 GPS-tagged user uploaded images downloaded
from Picasa, Flickr and Panoramio for the 102k Google street view images and WorldC-
ities datasets, respectively. Fig. 4.5 (last two rows) shows sample test images. Through-
out our experiment, we use the all the reference images from around the world to find
the best match with the query image, not just with the ground truth city only.
The proposed approach has been then compared with the results obtained by state-of-
the-art methods. In Fig. 4.6, the horizontal axes shows the error threshold in meters and
the vertical axes shows the percentage of the test set localized within a particular error
threshold. Since the scope of this work is an accurate image localization at a city-scale
level, test set images localized above 300 meter are considered a failure.
The black (-*-) curve shows localization result of the approach proposed in [139]
which uses vocabulary tree to localize images. The red (-o-) curve depicts the re-
sults of [177] where they only consider the first NN for each query feature as best
matches which makes the approach very sensitive to the query features they select.
60
4.5. Experimental Results
Moreover, their approach suffers from lacking global feature information. The green
(-o-) curve illustrates the localization results of [178] which uses generalized maximum
clique problem (GMCP) to solve feature matching problem and follows voting scheme
to select the best matching reference image. The black (-o- and −♦−) curves show
localization results of MAC and RMAC, (regional) maximum activation of convolu-
tions [123, 154]. These approaches build compact feature vectors that encode several
image regions without the need to feed multiple inputs to the network. The cyan (-
o-) curve represents localization result of NetVLAD [5] which aggregates mid-level
(conv5) convolutional features extracted from the entire image into a compact single
vector representation amenable to effici,ent indexing. The cyan (−♦−) curve depicts
localization result of NetVLAD but finetuned on our dataset. The blue (−♦−) curve
show localizaton result of approach proposed in [135] which exploits geometric re-
lations between different database images retrieved by a query to handle geometric
burstness. The blue (-o-) curve shows results from our baseline approach, that is, we
use voting scheme to select best match reference image and estimate the location of the
query image. We are able to make a 10% improvement w.r.t the other methods with
only our baseline approach (without post processing). The magenta (-o-) curve illus-
trates geo-localization results of our proposed approach using dominant set clustering
based feature matching and constrained dominant set clustering based post processing.
As it can be seen, our approach shows about 20% improvement over the state-of-the-art
techniques.
Computational Time. Fig. 4.7, on the vertical axis, shows the ratio between GMCP
(numerator) and our approach (denominator) in terms of CPU time taken for each query
images to be localized. As it is evident from the plot, this ratio can range from 200 (our
approach 200x faster than GMCP) to a maximum of even 750x faster.
We have also compared the performance of different algorithms on the new dataset of
300k Google street view images created by us. Similarly to the previous tests, Fig. 4.8
reports the percentage of the test set localized within a particular error threshold. Since
the new dataset is relatively more challenging, the overall performance achieved by all
the methods is lower compared to 102k image dataset.
From bottom to top of the graph in Fig. 4.8 we present the results of [123,154] black
(−♦− and -o-), [135] blue (−♦−), [177] red (-o-), [5] cyan (-o-), fine tuned [5] cyan
(−♦−), [178] green (-o-), our baseline approach without post processing blue (-o-) and
our final approach with post processing magenta (-o-) . The improvements obtained
with our method are lower than in the other dataset, but still noticeable (around 2% for
the baseline and 7% for the final approach).
Some qualitative results for Pittsburgh, PA are presented in Fig. 4.9.
4.5.3 Analysis
4.5.3.1 Outlier Handling
In order to show that our dominant set-based feature matching technique is robust in
handling outliers, we conduct an experiment by fixing the number of NNs (disabling
the dynamic selection of NNs) to different numbers. It is obvious that the higher the
61
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
Figure 4.6: Comparison of our baseline (without post processing) and final method, on overall geo-
localization results, with state-of-the-art approaches on the first dataset (102K Google street view
images).
number of NNs are considered for each query feature, the higher will be the number of
outlier NNs in the input graph, besides the increased computational cost and an elevated
chance of query features whose NNs do not contain any inliers surviving the pruning
stage.
Fig. 4.10 shows the results of geo-localization obtained by using GMCP and domi-
nant set based feature matching on 102K Google street view images [178]. The graph
shows the percentage of the test set localized within the distance of 30 meters as a
function of number of NNs. The blue curve shows the results using dominant sets: it
is evident that when the number of NNs increases, the performance improves despite
the fact that more outliers are introduced in the input graph. This is mainly because our
framework takes advantage of the few inliers that are added along with many outliers.
The red curve shows the results of GMCP based localization and as the number of NNs
increase the results begin to drop. This is mainly due to the fact that their approach
imposes hard constraint that at least one matching reference feature should be selected
for each query feature whether or not the matching feature is correct.
In order to show the effectiveness of the post processing step, we perform an exper-
iment comparing our constrained dominant set based post processing with a simple
62
4.5. Experimental Results
5000
4500
4000
CPU time ratio
3500
3000
2500
2000
1500
0 20 40 60 80 100
Tracks
Figure 4.7: The ratio of CPU time taken between GMCP based geo-localization [178] and our ap-
proach, computed as CPU time for GMCP/CPU time for DSC.
voting scheme to select the best matching reference image. The GPS information of
the best matching reference image is used to estimate the location of the query image.
In Fig. 4.11, the vertical axis shows the percentage of the test set localized within a par-
ticular error threshold shown in the horizontal axis (in meters). The blue and magenta
curves depict geo-localization results of our approach using a simple voting scheme
and constrained dominant sets, respectively. The green curve shows the results from
GMCP based geo-localization. As it is clear from the results, our approach with post
processing exhibits superior performance compared to both GMCP and our baseline
approach.
Since our post-processing algorithm can be easily plugged in to an existing retrieval
methods, we perform another experiment to determine how much improvement we can
achieve by our post processing. We use [123, 154, 155] methods to obtain candidate
reference images and employ as an edge weight the similarity score generated by the
corresponding approaches. Table 4.1 reports, for each dataset, the first row shows rank-
1 result obtained from the existing algorithms while the second row (w_post) shows
rank-1 result obtained after adding the proposed post-processing step on top of the
retrieved images. For each query, we use the first 20 retrieved reference images. As
the results demonstrate, Table 4.1, we are able to make up to 7% and 4% improvement
on 102k Google street view images and WorldCities datasets, respectively. We ought
to note that, the total additional time required to perform the above post processing, for
63
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
Figure 4.8: Comparison of overall geo-localization results using DSC with and without post processing
and state-of-the-art approaches on the WorldCities dataset.
Table 4.1: Results of the experiment, done on the 102k Google street view images (Dts1) and WorldCities
(Dts2) datasets, to see the impact of the post-processing step when the candidates of reference images
are obtained by other image retrieval algorithms
The NetVLAD results are obtained from the features generated using the best trained
model downloaded from the author’s project page [155]. It’s fine-tuned version (NetVLAD*)
is obtained from the model we fine-tuned using images within 24m range as a positive
set and images with GPS locations greater than 300m as a negative set.
The MAC and RMAC results are obtained using MAC and RMAC representations
extracted from fine-tuned VGG networks downloaded from the authors webpage [123,
154].
64
4.5. Experimental Results
Figure 4.9: Sample qualitative results taken from Pittsburgh area. The green ones are the ground truth
while yellow locations indicate our localization results.
The input graph for our post processing step utilizes the global similarity between the
query and the matched reference images. Wide variety of global features can be used
for the proposed technique. In our experiments, the similarity between query and the
corresponding matched reference images is computed between their global features,
65
Chapter 4. Large-scale Image Geo-Localization
Using Dominant Sets
Figure 4.11: The effectiveness of constrained dominant set based post processing step over simple voting
scheme.
using HSV, GIST, CNN6, CNN7 and fine-tuned NetVLAD. The performance of the
proposed post processing technique highly depends on the discriminative ability of the
global features used to built the input graph. Depending on how informative the feature
is, we dynamically assign a weight for each global feature based on the area under the
normalized score between the query and the matched reference images. To show the
effectiveness of this approach, we perform an experiment to find the location of our
test set images using both individual and combined global features. Fig. 4.12 shows
the results attained by using fine-tuned NetVLAD, CNN7, CNN6, GIST, HSV and by
combining them together. The combination of all the global features outperforms the
individual feature performance, demonstrating the benefits of fusing the global features
based on their discriminative abilities for each query.
4.6 Summary
In this chapter, we proposed a novel framework for city-scale image geo-localization.
Specifically, we introduced dominant set clustering-based multiple NN feature match-
ing approach. Both global and local features are used in our matching step in order
to improve the matching accuracy. In the experiments, carried out on two large city-
scale datasets, we demonstrated the effectiveness of post processing employing the
novel constrained dominant set over a simple voting scheme. Furthermore, we showed
66
4.6. Summary
Figure 4.12: Comparison of geo-localization results using different global features for our post process-
ing step.
that our proposed approach is 200 times, on average, faster than GMCP-based ap-
proach [178]. Finally, the newly-created dataset (WorldCities) containing more than
300k Google Street View images used in our experiments is available to the public for
research purposes.
As a natural future direction of research, we can extend the results of this work for
estimating the geo-spatial trajectory of a video in a city-scale urban environment from
a moving camera with unknown intrinsic camera parameters.
67
CHAPTER 5
Simultaneous Clustering and Outlier Detection
using Dominant sets
In this chapter, we present a unified approach for simultaneous clustering and outlier
detection in data. We utilize some properties of a family of quadratic optimization
problems related to dominant sets, a well-known graph-theoretic notion of a cluster
which generalizes the concept of a maximal clique to edge-weighted graphs. Unlike
most (all) of the previous techniques, in our framework the number of clusters arises
intuitively and outliers are obliterated automatically. The resulting algorithm discov-
ers both parameters from the data. Experiments on real and on large scale synthetic
dataset demonstrate the effectiveness of our approach and the utility of carrying out
both clustering and outlier detection in a concurrent manner.
5.1 Introduction
In the literature, clustering and outlier detection are often treated as separate problems.
However, it is natural to consider them simultaneously. The problem of outlier detec-
tion is deeply studied in both communities of data mining and statistics [22, 64], with
different perspectives.
A classical statistical approach for finding outliers in multivariate data is Minimum
Covariance Determinant (MCD). The main objective of this approach is to identify a
subset of points which minimizes the determinant of the variance-covariance matrix
over all subsets of size n − l, where n is the number of multivariate data points and l is
the number of outliers. The resulting variance-covariance matrix can be integrated into
the Mahalanobis distance and used as part of a chi-square test to identify multivariate
outliers [133]. However, the high computational complexity makes it impractical for
high-dimensional datasets.
69
Chapter 5. Simultaneous Clustering and Outlier Detection using Dominant sets
• it requires no prior knowledge of the number of clusters since the approach dis-
covers the number of clusters from the data.
• the effectiveness of the SCOD is tested on both synthetic and real datasets.
70
5.2. Enumerate Dominant Sets Obliterating Outliers
The rest of the chapter is organised as follows: Section 5.2 details our approach on
enumerating dominant sets while obliterating outliers. Experimental results are shown
in Section , followed by conclusions in Section 5.3.
A good cluster has high C value. The average global cohesiveness GC of a given sim-
ilarity matrix can be computed by fixing the vector x to the barycenter of the simplex,
specifically xi = 1/N where N is the size of the data and i = 1 . . . N .
If the payoff matrix A is symmetric, then f (x) = x0 Ax is a strictly increasing
function along any non-constant trajectory of any payoff-monotonic dynamics of which
71
Chapter 5. Simultaneous Clustering and Outlier Detection using Dominant sets
replicator dynamics are a special instance. This property together with cohesiveness
measure allowed us modify the dominant set framework for SCOD.
Initializing the dynamics to the barycenter, say xt at a time t = 1, a step at each
iteration implies an increase in f (xt+i ) for any i > 1, which again entails that at con-
vergence at time t = c, the cohesiveness of the cluster, extracted as the support of xc ,
is greater than the average global cohesiveness (GC), i.e
.
In the dominant set framework, there are situations where false positive clusters
arise.
First, large number of very loosely coupled objects with similar affinities may arise
as a big cluster. This can be handled using the cohesiveness measure as there will not
be any big step of the point that initializes the dynamics.
Secondly, a small compact set of outliers form a cluster whose cohesiveness is
greater than the average global cohesiveness of the original similarity. In our auto-
mated framework, to address these issues, a new affinity (S) is built from the original
pairwise similarity (A) based on M-estimation from robust statistics.
To every candidate i a weight which intuitively measures its average relationship
with the local neighbors is assigned:
1
P
where w(i) = |Ni |
A(i, j) and Ni is the set of top N similar items, based on the
j∈Ni
pairwise similarity (A), of object i. The framework is not sensitive to the setting of
the parameter (N ). In all the experiments we fixed it as 10% of the size of the data.
One may also choose it based on the average distances among the objects. A similar
study, though with a different intention, has been done in [23] and illustrated that this
approach makes the system less sensitive to the parameter sigma to built the similarity.
The newly built affinity (S) can be used in different ways: first, we can use it to
recheck if the extracted sets are strict local solution of (1.5) setting (A) = (S). Another
simpler and efficient way is using it for the cohesiveness measure i.e, an extracted set,
to be a cluster, should satisfy the cohesiveness criteria in both affinities A and S.
Figure 5.1 illustrates the second case. The red points in the middle are a compact
outlier sets which forms a dominant set whose cohesiveness (C = 0.952) is greater than
the average global cohesiveness (GC = 0.580). However, its cohesiveness in the newly
built affinity (CL = 0.447) is less than that of the average global cohesiveness. Observe
that the two true positive cluster sets (green and blue) have a cohesiveness measures
(in both affinities C and CL) which are greater than the average global cohesiveness.
Algorithm 5 summarizes the detail.
72
5.3. Experiments
1 1 GC=0.580
C=0.987
CL=0.982
0.5 0.5
C=0.952
CL=0.447
C=0.988
CL=0.984
0 0
0 0.5 1 0 0.5 1
Figure 5.1: Examplar plots: Left: Original data points with different colors which show possible clusters. Right:
Extracted clusters and their cohesiveness measures, C with affinity (A) and (CL) with the learned affinity (S)
1: Outliers ← ∅
2: Clusters ← ∅
3: S ← Build new affinity from A using (5.1)
4: Initialize x to the barycenter and i and j to 1
5: GC ← x> Ax
6: while size of A 6= 0 do xc ← Find local solution of (1.5)
7: if x> >
c Sxc < GC or xc Axc < GC then
8: Oj ← σ(xc ), find the j th outlier set
9: j ←j+1
10: else
11: DS i = σ(xc ), find the ith dominant set
12: i←i+1
13: end if
14: Remove σ(xc ) from the affinity matrices S and A
15: end while
Si
16: Clusters = DS k
k=1
j
S
17: Outliers = Ok
k=1
OUTPUT: {Clusters, Outliers}
5.3 Experiments
In this section we evaluate the proposed approach on both synthetic and real datasets.
First we evaluate our method on a synthetic datasets and present quantitative analysis.
Then, we present experimental results on real datasets KDD-cup and SHUTTLE.
A pairwise distance D among individuals is transformed to similarity (edge-weight)
73
Chapter 5. Simultaneous Clustering and Outlier Detection using Dominant sets
where σ is choosen as the median distance among the items, and 1P = 1 if P is true,
0 otherwise. We compare our Dominant set clustering based approach result with k-
means-- [24].
74
5.3. Experiments
number of outliers increases. In Figure 5.3, the case were we vary the dimension, our
approach scores relatively low in most of the measurements when the dimension is set
to 2. But we gate excellent result as the dimension increases. In the last case, in Figure
5.4, we can see that our method is invariant to the value of standard deviation and it
gates almost close to 1 in most of the measurements.
Figure 5.2: Results of the algorithm on synthetic datasets with respect to increasing number of outliers
(l). While fixing parameters as k = 10, m = 100, d = 32, and σ = 0.2
Figure 5.3: Results of the algorithm on synthetic datasets with respect to increasing dimension (d).
While fixing parameters as k = 10, m = 100, l = 100, and σ = 0.1
Figure 5.4: Results of the algorithm on synthetic datasets with respect to the standard deviation used to
generate the data (σ). While fixing parameters as k = 10, m = 100, l = 100, and d = 32
In this section we will discuss the performance of our approach on real datasets.
75
Chapter 5. Simultaneous Clustering and Outlier Detection using Dominant sets
5.3.2.1 SHUTTLE
We first consider the “SHUTTLE” dataset which is publicly available on UCI Machine
Learning repository [54]. This dataset contains 7 class labels while the main three
classes account 99.6% of the dataset, each with 78.4%, 15.5%, and 5.6% of frequency.
We took these three classes as non-outliers while the rest (0.4%) are considered as
outliers. The dataset contains one categorical attribute, which is taken as class label,
and 9 numerical attribute. We use the training part of the dataset, which consists of
43500 instances.
The results of our algorithm on this dataset is shown on Table 5.1 . The precision
is computed with respect to the outliers found by the algorithm to the ground truth
outliers. Since the number of outliers l and cluster k, required by k-mean--, is typically
not known exactly we explore how its misspecification affects their results.
To investigate the influence of number of cluster (k) on k-means--, we run the ex-
periments varying values of k while fixing number of outliers to l = 0.4% (the correct
value). As it can be seen from Table 5.1 miss-specification of number of clusters has
a negative effect on k-means--. The approach performs worst in all measurements as
the number of clusters decreases. Our approach has the best result in all measurements.
We can observed how providing k-means-- with different numbers of k results in worst
performance which highlights the advantage of our method which is capable of auto-
matically selecting the number of clusters and outliers from the data.
Table 5.1: Results on SHUTTLE dataset with fixed l and varying K
5.3.2.2 KDD-CUP
We further evaluate our approach on 1999 KDD-CUP dataset which contains instances
describing connections of tcp packet sequences. Every row is labeled as intrusion or
normal along with their intrusion types. Since the dataset has both categorical and nu-
merical attributes, for simplicity, we consider only 38 numerical attributes after having
normalized each one of them so that they have 0 mean and standard deviation equal
to 1. Similar to [24], we used 10% of the dataset for our experiment, that is, around
76
5.4. Summary
494,021 instances. There are 23 classes while 98.3% of the instances belong to only 3
classes, namely the class smurf 56.8%, the class neptune 21.6% and the class normal
19.6%. We took these three classes as non-outliers while the rest (1.7%) are considered
as outliers.
The result of our algorithm on KDD-CUP dataset is reported in Table 5.3. Here
also we compared our result with k-means-- while taking different values of k. We see
that both techniques perform quit well in purity, that is, they are able to extract clusters
which best matches the ground truth labels. While our algorithm better performs in
identifying outliers with relatively good precision.
Table 5.3: Results on KDD-CUP dataset with fixed number of outliers while varying cluster number
5.4 Summary
In this chapter, we propose a modified dominant set clustering problem for simulta-
neous clustering and outlier detection from the data. Unlike most of the previous
approaches our method requires no prior knowledge on both the number of clusters
and outliers, which makes our approach more convenient for real application. More-
over, our proposed algorithm is simple to implement and highly scalable. We tested
the performance of SCOD on both large scale synthetic and real datasets, and showed
prevailing result.
77
CHAPTER 6
Conlusion
79
Chapter 6. Conlusion
Chapter 4 presents a new tracking approach which is able to overcome the lim-
itations of our DSC tracker, and also extend the problem to multiple cameras with
non-overlapping field of view. In this chapter, we presented a constrained dominant
set clustering (CDSC) based framework for solving multi-target tracking problem in
multiple non-overlapping cameras. The proposed method utilizes a three-layer hierar-
chical approach, where within-camera tracking is solved using first two layers of our
framework resulting in tracks for each person, and later in the third layer the proposed
across-camera tracker merges tracks of the same person across different cameras. Ex-
periments on a challenging real-world dataset (MOTchallenge DukeMTMCT) validate
the effectiveness of our model. We further perform additional experiments to show ef-
fectiveness of the proposed across-camera tracking on one of the largest video-based
people re-identification datasets (MARS). Here each query is treated as a constraint set
and its corresponding members in the resulting constrained dominant set cluster are
considered as possible candidate matches to their corresponding query.
In chapter 5, we proposed a novel framework for city-scale image geo-localization.
Specifically, we introduced dominant set clustering-based multiple NN feature match-
ing approach. Both global and local features are used in our matching step in order
to improve the matching accuracy. In the experiments, carried out on two large city-
scale datasets, we demonstrated the effectiveness of post processing employing the con-
strained dominant set over a simple voting scheme. We evaluate the proposed frame-
work on an existing dataset as well as a new larger dataset, and show that it outperforms
the state-of-the-art by 20% and 7%, respectively, on the two datasets. Furthermore, we
showed that our proposed approach is 200 times, on average, faster than GMCP-based
approach [178]. Moreover, the newly-created dataset (WorldCities) containing more
than 300k Google Street View images used in our experiments is available to the public
for research purposes.
Finally, in chapter 6, we present a modified dominant set clustering problem for
simultaneous clustering and outlier detection from the data. Unlike most of the previous
approaches our method requires no prior knowledge on both the number of clusters and
outliers, which makes our approach more convenient for real application. Moreover,
our proposed algorithm is simple to implement and highly scalable. We first test the
performance of SCOD on large scale of synthetic datasets which confirms that in a
controlled set up, the algorithm is able to achieve excellent result in an efficient manner.
We conduct further evaluation on real datasets and attain a significant improvement.
80
Bibliography
[1] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M Seitz, and Richard Szeliski. Building rome in a day.
In International Conference on Computer vision (ICCV), pages 72–79. IEEE, 2009.
[2] Stefano Alletto, Giuseppe Serra, Simone Calderara, Francesco Solera, and Rita Cucchiara. From ego to nos-
vision: Detecting social relationships in first-person views. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 580–585, 2014.
[3] Anton Andriyenko and Konrad Schindler. Multi-target tracking by continuous energy minimization. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 1265–1272, 2011.
[4] Nadeem Anjum and Andrea Cavallaro. Trajectory association and fusion across partially overlapping cam-
eras. In IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS), pages
201–206, 2009.
[5] Relja Arandjelovic, Petr Gronát, Akihiko Torii, Tomás Pajdla, and Josef Sivic. Netvlad: CNN architecture for
weakly supervised place recognition. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), pages 5297–5307, 2016.
[6] Relja Arandjelović and Andrew Zisserman. Dislocation: Scalable descriptor distinctiveness for location
recognition. In Asian Conference on Computer Vision, pages 188–204. Springer, 2014.
[7] JG Augston and J Minker. An analysis of some graph theoretical clustering techniques. J. of the ACM,
17(4):571–588, 1970.
[8] Yannis Avrithis, Yannis Kalantidis, Giorgos Tolias, and Evaggelos Spyrou. Retrieving landmark and non-
landmark images from community photo collections. In Proceedings of the 18th ACM international confer-
ence on Multimedia, pages 153–162. ACM, 2010.
[9] Jérôme Berclaz, François Fleuret, Engin Türetken, and Pascal Fua. Multiple object tracking using k-shortest
paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)), 33(9):1806–
1819, 2011.
[10] Alessandro Bergamo, Sudipta N Sinha, and Lorenzo Torresani. Leveraging structure from motion to learn
discriminative codebooks for scalable landmark classification. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 763–770, 2013.
[11] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: The CLEAR
MOT metrics. EURASIP J. Image and Video Processing, 2008, 2008.
[12] Immanuel M. Bomze. Evolution towards the maximum clique. J. Global Optimization, 10(2):143–164, 1997.
[13] Immanuel M. Bomze. On standard quadratic optimization problems. J. Global Optimization, 13(4):369–387,
1998.
[14] Immanuel M. Bomze. Branch-and-bound approaches to standard quadratic optimization problems. J. Global
Optimization, 22(1-4):17–37, 2002.
[15] William Brendel, Mohamed R. Amer, and Sinisa Todorovic. Multiobject tracking as maximum weight inde-
pendent set. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1273–1280,
2011.
81
Bibliography
[16] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. Lof: Identifying density-based
local outliers. SIGMOD Rec., 29(2):93–104, May 2000.
[17] Weiling Cai, Songcan Chen, and Daoqiang Zhang. Fast and robust fuzzy c-means clustering algorithms
incorporating local information for image segmentation. Pattern recognition, 40(3):825–838, 2007.
[18] Yinghao Cai, Kaiqi Huang, and Tieniu Tan. Human appearance matching across multiple non-overlapping
cameras. In International Conference on Pattern Recognition (ICPR), pages 1–4, 2008.
[19] Yinghao Cai and Gérard G. Medioni. Exploring context information for inter-camera multiple target tracking.
In IEEE Workshop on Applications of Computer Vision (WACV), pages 761–768, 2014.
[20] Simone Calderara, Rita Cucchiara, and Andrea Prati. Bayesian-competitive consistent labeling for people
surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(2), 2008.
[21] Song Cao and Noah Snavely. Graph-based discriminative learning for location recognition. In IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pages 700–707, 2013.
[22] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv.,
41(3), 2009.
[23] Hong Chang and Dit-Yan Yeung. Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203,
2008.
[24] Sanjay Chawla and Aristides Gionis. k-means-: A unified approach to clustering and outlier detection. In
Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA.,
pages 189–197, 2013.
[25] Sanjay Chawla and Pei Sun. SLOM: a new measure for local spatial outliers. Knowl. Inf. Syst., 9(4):412–429,
2006.
[26] Chao-Yeh Chen and Kristen Grauman. Clues from the beaten path: Location estimation with bursty sequences
of tourist photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1569–
1576, 2011.
[27] David M Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen,
Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. City-scale landmark identification on mobile
devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 737–744, 2011.
[28] Hwei-Ju Chen, R. Gnanadesikan, and J. R. Kettenring. Statistical methods for grouping corporations. Indian
Statistical Institute, 36(1):1–28, 1974.
[29] Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. SODA, pages
826–835, 2008.
[30] Wei-Peng Chen, Jennifer C Hou, and Lui Sha. Dynamic clustering for acoustic target tracking in wireless
sensor networks. IEEE Transactions on Mobile Computing, 3(3):258–271, 2004.
[31] Xiaojing Chen and Bir Bhanu. Integrating social grouping for multi-target tracking across cameras in a crf
model. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), in press.
[32] Xiaotang Chen, Kaiqi Huang, and Tieniu Tan. Object tracking across non-overlapping views by learning
inter-camera transfer models. Pattern Recognition, 47(3):1126–1137, 2014.
[33] De Cheng, Yihong Gong, Jinjun Wang, Qiqi Hou, and Nanning Zheng. Part-aware trajectories association
across non-overlapping uncalibrated cameras. Neurocomputing, 230:30–39, 2017.
[34] Amit Chilgunde, Pankaj Kumar, Surendra Ranganath, and Weimin Huang. Multi-camera target tracking in
blind regions of cameras with non-overlapping fields of view. In British Machine Vision Conference (BMVC),
pages 1–10, 2004.
[35] Lingyang Chu, Shuhui Wang, Siyuan Liu, Qingming Huang, and Jian Pei. ALID: scalable dominant cluster
detection. Conference on Very Large DataBases (VLDB), 8(8):826–837, 2015.
[36] Keh-Shih Chuang, Hong-Long Tzeng, Sharon Chen, Jay Wu, and Tzong-Jer Chen. Fuzzy c-means clustering
with spatial information for image segmentation. Computerized Medical Imaging and Graphics, 30(1):9–15,
2006.
[37] A Cohen, R Gnanadesikan, JR Kettenring, and JM Landwehr. Methodological developments in some appli-
cations of clustering. Applications of statistics, pages 141–162, 1977.
[38] Dung Nghi Truong Cong, Catherine Achard, Louahdi Khoudour, and Lounis Douadi. Video sequences associ-
ation for people re-identification across multiple non-overlapping cameras. In IAPR International Conference
on Image Analysis and Processing (ICIAP), pages 179–189, 2009.
82
Bibliography
[39] Dalia Coppi, Simone Calderara, and Rita Cucchiara. Appearance tracking by transduction in surveillance
scenarios. In AVSS, pages 142–147, 2011.
[40] Dalia Coppi, Simone Calderara, and Rita Cucchiara. People appearance tracing in video by spectral graph
transduction. In IEEE International Conference on Computer Vision Workshops, pages 920–927, 2011.
[41] David J Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. Mapping the world’s photos. In
Proceedings of the 18th international conference on World wide web, pages 761–770. ACM, 2009.
[42] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 886–893, 2005.
[43] Afshin Dehghan, Shayan Modiri Assari, and Mubarak Shah. GMMCP tracker: Globally optimal generalized
maximum multi clique problem for multiple object tracking. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 4091–4099, 2015.
[44] Carlos R. del-Blanco, Raúl Mohedano, Narciso N. García, Luis Salgado, and Fernando Jaureguizar. Color-
based 3d particle filtering for robust tracking in heterogeneous environments. In ACM/IEEE International
Conference on Distributed Smart Cameras (ICDSC), pages 1–10, 2008.
[45] Tiziana D’Orazio, Pier Luigi Mazzeo, and Paolo Spagnolo. Color brightness transfer function evaluation
for non overlapping multi camera tracking. In ACM/IEEE International Conference on Distributed Smart
Cameras (ICDSC), pages 1–6, 2009.
[46] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification and scene analysis part 1: Pattern
classification. Wiley, Chichester, 2000.
[47] Anna Ellis and James M. Ferryman. PETS2010 and PETS2009 evaluation of results using individual ground
truthed single views. In IEEE International Conference on Advanced Video and Signal Based Surveillance,
pages 135–142, 2010.
[48] A. Ess, B. Leibe, K. Schindler, , and L. van Gool. A mobile vision system for robust multi-person tracking.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Press, June 2008.
[49] A. Ess, B. Leibe, K. Schindler, , and L. van Gool. A mobile vision system for robust multi-person tracking.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Press, June 2008.
[50] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. Person re-
identification by symmetry-driven accumulation of local features. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), pages 2360–2367, 2010.
[51] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with
discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), 32(9):1627–1645, 2010.
[52] T. E. Fortmann, Y. Bar-Shalom, and M. Scheffe. Multi-target tracking using joint probabilistic data associ-
ation. In 19th IEEE Conference on Decision and Control including the symposium on Adaptive Processes,
volume 19,, 1980.
[53] Katerina Fragkiadaki, Weiyu Zhang, Geng Zhang, and Jianbo Shi. Two-granularity tracking: Mediating
trajectory and detection graphs for tracking under occlusions. In European Conference on Computer (ECCV),
pages 552–565, 2012.
[54] A. Frank and A. Asuncion. UCI machine learning repository. 2010.
[55] Wolfgang Förstner and Boudewijn Moonen. A metric for covariance matrices. Technical report,
Dept.Geodesy Geoinform. Stuttgart Univ., Stuttgart, Germany, 1999.
[56] Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool. I know what you did last summer: object-
level auto-annotation of holiday snaps. In IEEE International Conference on Computer Vision (ICCV), pages
614–621. IEEE, 2009.
[57] Yue Gao, Rongrong Ji, Longfei Zhang, and Alexander G. Hauptmann. Symbiotic tracker ensemble toward
A unified tracking framework. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT),
24(7):1122–1131, 2014.
[58] Kathleen Garland. An experiment in automatic hierarchical document classification. Information Processing
& Management, 19(3):113–120, 1983.
[59] Andrew Gilbert and Richard Bowden. Tracking objects across cameras by incrementally learning inter-
camera colour calibration and patterns of activity. In European Conference on Computer Vision (ECCV),
pages 125–136, 2006.
83
Bibliography
[60] Petr Gronat, Guillaume Obozinski, Josef Sivic, and Tomas Pajdla. Learning and calibrating per-location
classifiers for visual place recognition. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 907–914, 2013.
[61] Robert Grossman, Roberto J. Bayardo, and Kristin P. Bennett, editors. ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining. ACM, 2005.
[62] Asaad Hakeem, Roberto Vezzani, Mubarak Shah, and Rita Cucchiara. Estimating geospatial trajectory of a
moving camera. In International Conference on Pattern Recognition (ICPR), volume 2, pages 82–87, 2006.
[63] Qiang Hao, Rui Cai, Zhiwei Li, Lei Zhang, Yanwei Pang, and Feng Wu. 3d visual phrases for landmark
recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3594–3601,
2012.
[64] D. Hawkins. Identification of outliers. Chapman and Hall,London, 1980.
[65] James Hays and Alexei A Efros. Im2gps: estimating geographic information from a single image. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.
[66] James Hays and Alexei A Efros. Large-scale image geolocalization. In Multimodal Location Estimation of
Videos and Images, pages 41–62. Springer, 2015.
[67] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778,
2016.
[68] R. Horst, P.M. Pardalos, and N. Van Thoai. Introduction to Global Optimization. Nonconvex Optimization
and Its Applications. Kluwer Academic Publishers, Dordrecht/Boston/London, 2000.
[69] Hayley Hung and Ben Kröse. Detecting f-formations as dominant sets. In International conference on
Multimodal Interfaces, pages 231–238. ACM, 2011.
[70] Nathan Jacobs, Scott Satkin, Nathaniel Roman, Richard Speyer, and Robert Pless. Geolocating static cameras.
In IEEE International Conference on Computer Vision (ICCV), pages 1–6. IEEE, 2007.
[71] Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
[72] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: A review. ACM Comput. Surv.,
31(3):264–323, 1999.
[73] Omar Javed, Zeeshan Rasheed, Khurram Shafique, and Mubarak Shah. Tracking across multiple cameras
with disjoint views. In IEEE International Conference on Computer Vision (ICCV), pages 952–957, 2003.
[74] Omar Javed, Khurram Shafique, Zeeshan Rasheed, and Mubarak Shah. Modeling inter-camera space–time
and appearance relationships for tracking across non-overlapping views. Computer Vision and Image Under-
standing, 109(2):146–162, 2008.
[75] Omar Javed, Khurram Shafique, and Mubarak Shah. Appearance modeling for tracking in multiple non-
overlapping cameras. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), pages 26–33, 2005.
[76] Hao Jiang, Sidney Fels, and James J. Little. A linear programming approach for multiple object tracking. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[77] Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. Predicting good features for image geo-localization
using per-bundle vlad. In IEEE International Conference on Computer Vision (ICCV), pages 1170–1178,
2015.
[78] Edward Johns and Guang-Zhong Yang. From images to scenes: Compressing an image cluster into a single
scene model for place recognition. In IEEE International Conference on Computer Vision (ICCV), pages
874–881, 2011.
[79] Evangelos Kalogerakis, Olga Vesselova, James Hays, Alexei A Efros, and Aaron Hertzmann. Image sequence
geolocation with human travel priors. In IEEE International Conference on Computer Vision (ICCV), pages
253–260. IEEE, 2009.
[80] Jinman Kang, Isaac Cohen, and Gérard G. Medioni. Persistent objects tracking across multiple non overlap-
ping cameras. In IEEE Workshop on Applications of Computer Vision / IEEE Workshop on Motion and Video
Computing (WACV/MOTION, pages 112–119, 2005.
[81] Sohaib Khan and Mubarak Shah. Consistent labeling of tracked objects in multiple cameras with overlapping
fields of view. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1355–1360, 2003.
84
Bibliography
[82] Alexander Kläser, Marcin Marszalek, and Cordelia Schmid. A spatio-temporal descriptor based on 3D-
gradients. In British Machine Vision Conference (BMV), pages 1–10, 2008.
[83] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB,
pages 392–403, 1998.
[84] Martin Köstinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Large scale metric
learning from equivalence constraints. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2288–2295, 2012.
[85] Harold W Kuhn. Variants of the hungarian method for assignment problems. Naval Research Logistics
Quarterly, 3(4):253–258, 1956.
[86] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia. Inter-camera association of multi-target tracks by on-line
learned appearance affinity models. In European Conference on Computer Vision (ECCV), pages 383–396,
2010.
[87] Cheng-Hao Kuo and Ram Nevatia. How does person identity recognition help multi-person tracking? In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1217–1224, 2011.
[88] Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards
a benchmark for multi-target tracking. arXiv:1504.01942 [cs], April 2015. arXiv: 1504.01942.
[89] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and Pascal Fua. Worldwide pose estimation using 3d point
clouds. In European Conference on Computer Vision (ECCV), pages 15–29. Springer, 2012.
[90] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. Person re-identification by local maximal occurrence
representation and metric learning. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2197–2206, 2015.
[91] Tsung-Yi Lin, Serge Belongie, and James Hays. Cross-view image geolocalization. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2013.
[92] Tsung-Yi Lin, Yin Cui, Serge Belongie, and James Hays. Learning deep representations for ground-to-aerial
geolocalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5007–
5015, 2015.
[93] Hairong Liu, Longin Jan Latecki, and Shuicheng Yan. Fast detection of dense subgraphs with iterative shrink-
ing and expansion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(9):2131–
2142, 2013.
[94] Yunpeng Liu, Guangwei Li, and Zelin Shi. Covariance tracking via geometric particle filtering. EURASIP
Journal on Advances in Signal Processing, 2010(1):1–22, 2010.
[95] André Lourenço, Samuel Rota Bulò, Nicola Rebagliati, Ana L. N. Fred, Mário A. T. Figueiredo, and Marcello
Pelillo. Probabilistic consensus clustering using evidence accumulation. Machine Learning, 98(1-2):331–
357, 2015.
[96] David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2. Springer, 1984.
[97] Bingpeng Ma, Yu Su, and Frédéric Jurie. Covariance descriptor based on bio-inspired features for person
re-identification and face verification. Image Vision Computing, 32(6-7):379–390, 2014.
[98] Niall McLaughlin, Jesús Martínez del Rincón, and Paul C. Miller. Recurrent convolutional network for
video-based person re-identification. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1325–1334, 2016.
[99] Henry Medeiros, Johnny Park, and Avinash Kak. Distributed object tracking using a cluster-based kalman
filter in wireless camera networks. IEEE Journal of Selected Topics in Signal Processing, 2(4):448–463,
2008.
[100] Michael J. Metternich, Marcel Worring, and Arnold W. M. Smeulders. Color based tracing in real-life surveil-
lance data. T. Data Hiding and Multimedia Security, 5:18–33, 2010.
[101] Anton Milan, Laura Leal-TaixÃ
, c Konrad Schindler, and Ian Reid. Joint tracking and segmentation of
multiple targets. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2015.
[102] Anton Milan, Konrad Schindler, and Stefan Roth. Detection- and trajectory-level exclusion in multiple object
tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3682–3689, 2013.
[103] Birgit Möller, Thomas Plötz, and Gernot A. Fink. Calibration-free camera hand-over for fast and reliable
person tracking in multi-camera setups. In International Conference on Pattern Recognition (ICPR), pages
1–4, 2008.
85
Bibliography
[104] Marius Muja and David G. Lowe. Fast approximate nearest neighbors with automatic algorithm configura-
tion. In VISAPP 2009 - Proceedings of the Fourth International Conference on Computer Vision Theory and
Applications, Lisboa, Portugal, February 5-8, 2009 - Volume 1, pages 331–340, 2009.
[105] Karsten Müller, Jane Neumann, Maren Grigutsch, D Yves von Cramon, and Gabriele Lohmann. Detecting
groups of coherent voxels in functional mri data using spectral analysis and replicator dynamics. Journal of
Magnetic Resonance Imaging, 26(6):1642–1650, 2007.
[106] L. Fisher J. W. Van Ness. Admissible clustering procedures. Biometrika, 58:91–104, 1971.
[107] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial
envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
[108] Lionel Ott, Linsey Xiaolin Pang, Fabio Tozeto Ramos, and Sanjay Chawla. On integrated clustering and
outlier detection. In Neural Information Processing Systems, 8-13, Montreal, Canada, pages 1359–1367,
2014.
[109] Thrasyvoulos N Pappas. An adaptive clustering algorithm for image segmentation. IEEE Transactions on
signal processing, 40(4):901–914, 1992.
[110] Massimiliano Pavan and Marcello Pelillo. Dominant sets and hierarchical clustering. In IEEE International
Conference on Computer Vision (ICCV), pages 362–369, 2003.
[111] Massimiliano Pavan and Marcello Pelillo. Efficient out-of-sample extension of dominant-set clusters. In
Annual Conference on Neural Information Processing Systems (NIPS), pages 1057–1064, 2004.
[112] Massimiliano Pavan and Marcello Pelillo. Dominant sets and pairwise clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), 29(1):167–172, 2007.
[113] Marcello Pelillo. Replicator dynamics in combinatorial optimization. In Encyclopedia of Optimization,
Second Edition, pages 3279–3291. 2009.
[114] A. G. Amitha Perera, Chukka Srinivas, Anthony Hoogs, Glen Brooksby, and Wensheng Hu. Multi-object
tracking through simultaneous long occlusions and split-merge conditions. In IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition (CVPR), pages 666–673, 2006.
[115] Roman P. Pflugfelder and Horst Bischof. Tracking across non-overlapping views via geometry. In Interna-
tional Conference on Pattern Recognition (ICPR), pages 1–4, 2008.
[116] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Globally-optimal greedy algorithms for tracking
a variable number of objects. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 1201–1208, 2011.
[117] Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Globally-optimal greedy algorithms for tracking
a variable number of objects. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 1201–1208, 2011.
[118] Fatih Porikli, Oncel Tuzel, and Peter Meer. Covariance tracking using model update based on lie algebra. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 728–735,
2006.
[119] Vittal Premachandran and Ramakrishna Kakarala. Consensus of k-nns for robust neighborhood selection on
graph-based manifolds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
1594–1601, 2013.
[120] Bryan James Prosser, Shaogang Gong, and Tao Xiang. Multi-camera matching using bi-directional cumula-
tive brightness transfer functions. In British Machine Vision Conference (BMVC), pages 1–10, 2008.
[121] Till Quack, Bastian Leibe, and Luc Van Gool. World-scale mining of objects and events from community
photo collections. In Proceedings of the 2008 international conference on Content-based image and video
retrieval, pages 47–56. ACM, 2008.
[122] Lawrence R Rabiner and Jay G Wilpon. Considerations in applying clustering techniques to speaker-
independent word recognition. The Journal of the Acoustical Society of America, 66(3):663–673, 1979.
[123] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNN image retrieval learns from bow: Unsupervised
fine-tuning with hard examples. In European Conference on Computer Vision (ECCV), pages 3–20, 2016.
[124] Vijay V Raghavan and CT Yu. A comparison of the stability characteristics of some graph theoretic clustering
methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), (4):393–402, 1981.
[125] Srikumar Ramalingam, Sofien Bouaziz, Peter Sturm, and Matthew Brand. Skyline2gps: Localization in urban
canyons using omni-skylines. In Intelligent Robots and Systems (IROS), pages 3816–3823, 2010.
86
Bibliography
[126] C Radhakrishna Rao. Cluster analysis applied to a study of race mixture in human populations. In Classifi-
cation and clustering: proceedings of an advanced seminar conducted by the Mathematics Research Center,
the University of Wisconsin-Madison, number 37, page 175. Academic Press, 1977.
[127] D. B. Reid. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control, 1979.
[128] Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures
and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop
(ECCV workshop), pages 17–35, 2016.
[129] Ergys Ristani and Carlo Tomasi. Tracking multiple people online and in real time. In Asian Conference on
Computer Vision, pages 444–459. Springer, 2014.
[130] Andrew Rosenberg and Julia Hirschberg. V-measure: A conditional entropy-based external cluster evaluation
measure. In EMNLP-CoNLL,Prague, Czech Republic, pages 410–420, 2007.
[131] Samuel Rota Bulò and Immanuel M. Bomze. Infection and immunization: A new class of evolutionary game
dynamics. Games and Economic Behavior, 71(1):193–211, 2011.
[132] Samuel Rota Bulò, Marcello Pelillo, and Immanuel M. Bomze. Graph-based quadratic optimization: A fast
evolutionary approach. Computer Vision and Image Understanding, 115(7):984–995, 2011.
[133] Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant
estimator. Technometrics, 41:212–223, 1998.
[134] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communi-
cations of the ACM, 18(11):613–620, 1975.
[135] Torsten Sattler, Michal Havlena, Konrad Schindler, and Marc Pollefeys. Large-scale location recognition and
the geometric burstiness problem. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1582–1590, 2016.
[136] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image-based localization using direct 2d-to-3d match-
ing. In IEEE International Conference on Computer Vision, pages 667–674. IEEE, 2011.
[137] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving image-based localization by active correspon-
dence search. In European Conference on Computer Vision (ECCV), pages 752–765. Springer, 2012.
[138] Ann Scher, Micheal Shneir, and Azriel Rosenfeld. Clustering of collinear line segments. Pattern Recognition,
14(2):85–91, 1982.
[139] Grant Schindler, Matthew A. Brown, and Richard Szeliski. City-scale location recognition. In IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[140] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 22(8):888–905, 2000.
[141] Horesh Ben Shitrit, Jérôme Berclaz, François Fleuret, and Pascal Fua. Tracking multiple people under global
appearance constraints. In IEEE International Conference on Computer Vision (ICCV), pages 137–144, 2011.
[142] Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. Part-based multiple-person
tracking with partial occlusion handling. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1815–1821, 2012.
[143] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.
[144] G Siromoney, S Govindaraju, and M Bagavandas. Temple carvings of southern india. Perspectives in Com-
puting, 4:34–43, 1985.
[145] F. Solera, S. Calderara, E. Ristani, C. Tomasi, and R. Cucchiara. Tracking social groups within and across
cameras. IEEE Transactions on Circuits and Systems for Video Technology, 2016.
[146] Francesco Solera, Simone Calderara, Ergys Ristani, Carlo Tomasi, and Rita Cucchiara. Tracking social
groups within and across cameras. IEEE Transactions on Circuits and Systems for Video Technology,
27(3):441–453, 2017.
[147] Satyam Srivastava, Ka Ki Ng, and Edward J. Delp. Color correction for object tracking across multiple
cameras. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
1821–1824, 2011.
[148] George Stockman, Steven Kopstein, and Sanford Benett. Matching images to models for registration and
object detection via clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
(3):229–241, 1982.
87
Bibliography
[149] EL Sutanto, JD Mason, and Kevin Warwick. Mean-tracking clustering algorithm for radial basis function
centre selection. International Journal of Control, 67(6):961–977, 1997.
[150] Yonatan Tariku Tesfaye and Marcello Pelillo. Multi-target tracking using dominant sets. M.Sc. thesis, Uni-
versità Ca’Foscari Venezia, 2014.
[151] Yonatan Tariku Tesfaye, Eyasu Zemene, Marcello Pelillo, and Andrea Prati. Multi-object tracking using
dominant sets. IET computer vision, 10:289–298, 2016.
[152] Yonatan Tariku Tesfaye, Eyasu Zemene, Andrea Prati, Marcello Pelillo, and Mubarak Shah. Multi-
target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint
arXiv:1706.06196, 2017.
[153] Sinisa Todorovic and Narendra Ahuja. Region-based hierarchical image matching. International Journal of
Computer Vision, 78(1):47–66, 2008.
[154] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Particular object retrieval with integral max-pooling of CNN
activations. In International Conference on Learning Representations (ICLR), 2016.
[155] Akihiko Torii, Josef Sivic, Masatoshi Okutomi, and Tomás Pajdla. Visual place recognition with repetitive
structures. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 37(11):2346–2359,
2015.
[156] Akihiko Torii, Josef Sivic, Tomas Pajdla, and Masatoshi Okutomi. Visual place recognition with repetitive
structures. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 883–890, 2013.
[157] Gonzalo Vaca-Castano, Amir Roshan Zamir, and Mubarak Shah. City scale geo-spatial trajectory estimation
of a moving camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1186–
1193, 2012.
[158] Sebastiano Vascon, Eyasu Zemene Mequanint, Marco Cristani, Hayley Hung, Marcello Pelillo, and Vittorio
Murino. A game-theoretic probabilistic approach for detecting conversational groups. In Asian Conference
on Computer Vision, pages 658–675. Springer, 2014.
[159] Senem Velipasalar, Jason Schlessman, Cheng-Yao Chen, Wayne Hendrix Wolf, and Jaswinder Singh. A
scalable clustered camera system for multiple object tracking. EURASIP J. Image and Video Processing,
2008, 2008.
[160] Hongxing Wang, Chaoqun Weng, and Junsong Yuan. Multi-feature spectral clustering with minimax opti-
mization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
[161] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person re-identification by video ranking.
In European Conference on Computer Vision (ECCV), pages 688–703, 2014.
[162] Youlu Wang, Senem Velipasalar, and Mustafa Cenk Gursoy. Distributed wide-area multi-object tracking with
non-overlapping camera views. Multimedia Tools and Applications, 73(1):7–39, 2014.
[163] Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-photo geolocation with convolutional neural net-
works. arXiv preprint arXiv:1602.05314, 2016.
[164] Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference
imagery. In IEEE International Conference on Computer Vision, pages 3961–3969, 2015.
[165] Bo Wu and Ramakant Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian
combination of edgelet based part detectors. International Journal of Computer Vision, 75(2):247–266, 2007.
[166] Zhenyu Wu and Richard Leahy. An optimal graph theoretic approach to data clustering: Theory and its
application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
15(11):1101–1113, 1993.
[167] Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object tracking through occlusions by local tracklets
filtering and global tracklets association with detection responses. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1200–1207, 2009.
[168] Fei Xiong, Mengran Gou, Octavia I. Camps, and Mario Sznaier. Person re-identification using kernel-based
metric learning methods. In European Conference on Computer Vision (ECCV), pages 1–16, 2014.
[169] Bo Yang and Ram Nevatia. An online learned CRF model for multi-target tracking. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 2034–2041, 2012.
[170] Bo Yang and Ram Nevatia. Online learned discriminative part-based appearance models for multi-human
tracking. In European Conference on Computer (ECCV), pages 484–498, 2012.
88
Bibliography
[171] Bo Yang and Ramakant Nevatia. Multi-target tracking by online learning a CRF model of appearance and
motion patterns. International Journal of Computer Vision, 107(2):203–217, 2014.
[172] Tao Yang, Stan Z. Li, Quan Pan, and Jing Li. Real-time multiple objects tracking with occlusion handling
in dynamic scenes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), pages 970–975, 2005.
[173] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACM Comput. Surv., 38(4), 2006.
[174] Jinjie You, Ancong Wu, Xiang Li, and Wei-Shi Zheng. Top-push video-based person re-identification. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1345–1353,
2016.
[175] Charles T Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions
on computers, 100(1):68–86, 1971.
[176] Amir Roshan Zamir, Afshin Dehghan, and Mubarak Shah. Gmcp-tracker: Global multi-object tracking using
generalized minimum clique graphs. In European Conference on Computer Vision (ECCV), pages 343–356,
2012.
[177] Amir Roshan Zamir and Mubarak Shah. Accurate image localization based on google maps street view. In
European Conference on Computer Vision (ECCV), pages 255–268. Springer, 2010.
[178] Amir Roshan Zamir and Mubarak Shah. Image geo-localization based on multiple nearest neighbor feature
matching using generalized graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
36(8):1546–1558, 2014.
[179] Bernhard Zeisl, Torsten Sattler, and Marc Pollefeys. Camera pose voting for large-scale image-based local-
ization. In IEEE International Conference on Computer Vision (ICCV), pages 2704–2712, 2015.
[180] Eyasu Zemene, Leulseged Tesfaye Alemu, and Marcello Pelillo. Constrained dominant sets for retrieval. In
International Conference on Pattern Recognition, pages 2568–2573. IEEE, 2016.
[181] Eyasu Zemene, Leulseged Tesfaye Alemu, and Marcello Pelillo. Dominant sets for" constrained" image
segmentation. arXiv preprint arXiv:1707.05309, 2017.
[182] Eyasu Zemene and Marcello Pelillo. Path-based dominant-set clustering. In IAPR International Conference
on Image Analysis and Processing (ICIAP), pages 150–160, 2015.
[183] Eyasu Zemene and Marcello Pelillo. Interactive image segmentation using constrained dominant sets. In
European Conference on Computer Vision (ECCV), pages 278–294, 2016.
[184] Eyasu Zemene, Yonatan Tariku Tesfaye, Haroon Idrees, Andrea Prati, Marcello Pelillo, and Mubarak Shah.
Large-scale image geo-localization using dominant sets. arXiv preprint arXiv:1702.01238, 2017.
[185] Eyasu Zemene, Yonatan Tariku Tesfaye, Andrea Prati, and Marcello Pelillo. Simultaneous clustering and
outlier detection using dominant sets. In Pattern Recognition (ICPR), 2016 23rd International Conference
on, pages 2325–2330. IEEE, 2016.
[186] Li Zhang, Yuan Li, and Ramakant Nevatia. Global data association for multi-object tracking using network
flows. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[187] Shaoting Zhang, Ming Yang, Timothée Cour, Kai Yu, and Dimitris N. Metaxas. Query specific rank fusion
for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 37(4):803–815,
2015.
[188] Shu Zhang, Yingying Zhu, and Amit K. Roy-Chowdhury. Tracking multiple interacting targets in a camera
network. Computer Vision and Image Understanding, 134:64–73, 2015.
[189] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. MARS: A video
benchmark for large-scale person re-identification. In European Conference on Computer Vision (ECCV),
pages 868–884, 2016.
[190] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-
identification: A benchmark. In IEEE International Conference on Computer Vision (ICCV), pages 1116–
1124, 2015.
[191] Liang Zheng, Shengjin Wang, Lu Tian, Fei He, Ziqiong Liu, and Qi Tian. Query-adaptive late fusion for
image search and person re-identification. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), pages 1741–1750, 2015.
89
Bibliography
[192] Yan-Tao Zheng, Ming Zhao, Yang Song, Hartwig Adam, Ulrich Buddemeier, Alessandro Bissacco, Fernando
Brucher, Tat-Seng Chua, and Hartmut Neven. Tour the world: building a web-scale landmark recognition
engine. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages
1085–1092, 2009.
[193] Caiming Zhong, Xiaodong Yue, Zehua Zhang, and Jingsheng Lei. A clustering ensemble: Two-level-refined
co-association matrix with path-based transformation. Pattern Recognition, 48(8):2699–2709, 2015.
[194] Ray Zimmerman, Rick Jacobs, and James Farr. A comparison of the accuracy of four methods for clustering
jobs. Applied Psychological Measurement, 6(3):353–366, 1982.
90