CV Mot
CV Mot
CV Mot
• MOSSE Tracking
• Deep SORT
(*) End of “AI Winter” comes about with the rediscovery of the “backpropagation” algorithm; LeCun
et al., (1998) error rate < 1% on MNIST (handwritten digit recognition task).
(*) Inception of “Deep Learning” era begins with landmark “Alexnet” architecture: Krizehevsky et
al., (one of the first use of GPUs, among other innovations).
Neurons & the Brain
Hebb’s Postulate
Neurons & the Brain
• Human brain contains ~1011 neurons
• Each individual neuron connects to ~104 neuron
• ~1014 total synapses!
McCulloch & Pitts Neuron Model (1943)
(3) Components:
(1) Set of weighted inputs {wi} that correspond to synapses
(2) An “adder” that sums the input signals (equivalent to membrane of the cell that collects
the electrical charge)
(3) An activation function (initially a threshold function) that decides whether the neuron
fires (“spikes”) for the current inputs.
Neural Networks
(*) A Neural Network (NN) consists of a network of McCulloch/Pitts computational neurons (a single
layer was known historically as a “perceptron.”)
(*) NNs are universal function approximators – meaning that they can learn any arbitrarily complex mapping
between inputs and outputs. While this fact speaks to the broad utility of these models, NNs are
nevertheless prone to overfitting. The core issue in most ML/AI models can be reduced to the question
of generalizability.
(*) Each neuron receives some inputs, performs a dot product and optionally follow it with a non-
linearity (e.g. sigmoid/tanh). NNs are typically trained using backpropagation. This method calculates the
gradient of a loss function (e.g. squared-loss) with respect to all the weights (W) in the network. More
specifically, we use the chain rule to compute the ‘delta’ for the weight updates (one can think of this
delta as assigning a degree of ‘blame’ for misclassifications).
(*) Training a NN amounts to “tuning” the network weights. This process encompasses two
distinct phases: (1) “Forward phase” in which case a datum is passed “forward” through the
network (operationally this consists of a sequence of dot products and activations); (2)
“Backward phase” during which the weights in the network are updated according to the
desired output (i.e. label) for the datum; this process is highly parallelizable.
(*) Conventionally, NNs are best-suited for problems for which there exists a large amount of (diverse and) labelled
data; training for deep learning (NNs with many layers/neurons) can be lengthy (recently training algorithms have
become even more efficient); inference is however generally quick.
A Neural Network “Zoo”
Gradient Descent
(*) Backpropagation is one particular instance of a larger paradigm of optimization algorithms know as Gradient
Descent (also called “hill climbing”).
(*) There exists a large array of nuanced methodologies for efficiently training NNs (particularly DNNs), including
the use of regularization, momentum, dropout, batch normalization, pre-training regimes, initialization
processes, etc.
(*) Traditionally, the backpropagation algorithm has been used to efficiently train a NN; more recently the Adam
stochastic optimization method (2014) has eclipsed backpropagation in practice:
https://arxiv.org/abs/1412.6980
A Typical Workflow for Machine Learning
Some Deep Learning Techniques
• Momentum
• L2 and L1 regularization
• Batch normalization
Some Deep Learning Techniques
• “Dropout” and sparse representations
• Ensemble Modeling
• Parameter Initialization Strategies
• Adversarial Training
The key difference with CNNs is that neurons/activations are represented as 3D volumes. CNNs
additionally employ weight-sharing for computational efficiency; they are most commonly applied to
image data, in which case image feature activations are trained to be translation-invariant (convolution +
max pooling achieves this)
Convolutional Neural Networks
Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some
orientation or a blotch of some color. Now, we will have an entire set of filters in each CONV layer, and each of them
will produce a separate 2-dimensional activations; these features are stacked along the depth dimension in the CNN and
thus produce the output volume.
A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of activations to another
through a differentiable function. The three main types of layers to build CNN architectures are: Convolutional
Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). These layers are
stacked to form a full CNN architecture.
The convolution layer determines the activations of various filters over the original image; pooling is used for
downsampling the images for computational savings; the fully-connected layers are used to compute class scores for
classification tasks.
Convolutional Neural Networks
A nice way to interpret CNNs via a brain analogy is to consider each entry in the 3D output volume as an output of a
neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right
spatially (since the same filter is used).
Each neuron is accordingly connected to only a local region of the input volume; the spatial extent of this connectivity
is a hyperparameter called the receptive field (i.e. the filter size, such as: 5x5).
fj = å intensity(pixel b) - å intensity(pixel w)
bÎblack pixels wÎwhite pixels
• For each subwindow, compute features and send them to an ensemble classifier (learned via
boosting). If classifier is positive (“face”), then detect a face at this location and scale.
(*) The image is divided into small connected regions called cells, and for the pixels within
each cell, a histogram of gradient directions is compiled. For improved accuracy, the local
histograms can be contrast-normalized by calculating a measure of the intensity across a
larger region of the image, called a block, and then using this value to normalize all cells
within the block. This normalization results in better invariance to changes in illumination
and shadowing.
(*) HOG features advantage: invariant to geometric and photometric transformations; HOG
features are particularly good at detection people.
DNNs: AlexNet (2012)
AlexNet was developed by Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever; it uses CNNs with GPU
support. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points ahead of the
runner up.
Among other innovations: AlexNet used GPUs, utilized RELU (rectified linear units) for activations, and
“dropout” for training.
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
DNNs: AlexNet (2012)
DNNs: VGG (2014)
https://arxiv.org/pdf/1409.1556.pdf
DNNs: Inception (2015, Google)
https://arxiv.org/pdf/1409.1556.pdf
• Team at Google (Szegedy et al.) produced an even deeper DNN (22 layers). No need
to pick filter sizes explicitly, as network learns combinations of filter sizes/pooling
steps; upside: newfound flexibility for architecture design (architecture parameters
themselves can be learned); downside: ostensibly requires a large amount of
computation – this can be reduced by using 1x1 convolutions for dimensionality
reduction (prior to expensive convolutional operations).
• Team achieved new state of the art for classification and detection in the ImageNet
Large-Scale Visual Recognition Challenge 2014 (ILSVRC14; 6% top-5 error rate for
classification.
Object Detection with Deep Learning: RCNN
• Girshick et al.*achieved state-of-the-art performance on several object detection
benchmarks using a “regions with convolutional neural networks” (R-CNN).
• R-CNN and its extensions use a region-proposal network (RPN) that simultaneously
predicts object bounds and objectness scores for proposals.
• To avoid an exhaustive search, R-CNN utilizes a selective search algorithm that reduces
the overall computational overhead requirement.
*R. Girshick, Fast R-CNN, in: Int. Conf. Comput. Vis., IEEE, 2015: pp. 1440–1448;
*R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Conf. Comput. Vis.
Pattern Recognit., IEEE, 2014: pp. 580–587
Object Detection with Deep Learning: YOLO
• YOLO only looks at the input image once (hence the name).
• First the algorithm divides the input image into a grid of 13x13 cells.
• YOLO also outputs a confidence score that tells us how certain it is that the
predicted bounding box actually encloses some object. This score doesn’t say
anything about what kind of object is in the box, just if the shape of the box
is any good.
Object Detection with Deep Learning: YOLO
• The predicted bounding boxes may look something like the following
(the higher the confidence score, the fatter the box is drawn):
• For each bounding box, the cell also predicts a class. This works just like a
classifier: it gives a probability distribution over all the possible classes.
https://arxiv.org/pdf/1506.02640.pdf
Object Detection with Deep Learning: YOLO
• The confidence score for the bounding box and the class prediction are combined into one final score that
tells us the probability that this bounding box contains a specific type of object. For example, the big fat
yellow box on the left is 85% sure it contains the object “dog”:
• Since there are 13×13 = 169 grid cells and each cell predicts 5 bounding boxes, we end up with 845
bounding boxes in total. It turns out that most of these boxes will have very low confidence scores, so we
only keep the boxes whose final score is 30% or more (you can change this threshold depending on how
accurate you want the detector to be).
Similarity Learning with DNNs
• Similarity learning is the process of training a metric to compute the similarity
between two entities (also called: metric learning).
• A Siamese Network is a NN where the network is trained to distinguish between two inputs.
• Typically a Siamese network uses two encoders (the “sister” networks – they can be identical); each
encoder is fed with one of the images in either a positive or negative pair.
• The network is trained using “contrastive loss” – the Euclidean distance of the image encodings.
Optimization is performed using a standard algorithm, e.g., backprop, Adam.
Similarity Learning with DNNs
• Importantly: Siamese networks can be used as one shot classification models, on the other
hand, which require just one training example of each class you want to predict on.
• A nice example would be facial recognition. You would train a One Shot classification model
on a dataset that contains various angles , lighting , etc. of a few people. Then if you want to
recognize if a person X is in an image, you take one single photo of that person, and then
ask the model if that person is in the that image(note, the model was not trained using any pictures
of person X).
Multiple-Object Tracking (MOT)
• The task of Multiple Object Tracking (MOT) is generally divided between locating
multiple objects, maintaining their identities and generating their individual
trajectories.
• As a mid-level task in CV, MOT can serve as an intermediate step for tasks such as
pose estimation, action recognition and behavior analysis.
Multiple-Object Tracking (MOT)
• Compared to single object tracking (SOT), MOT requires two additional tasks to be
solved: (1) determining the number of objects, and (2) maintaining their identities.
• Apart from the common challenges shared by SOT and MOT, further key issues
that complicate MOT include among others:
• Given an image sequence, denote the state of the ith object in the t-th frame as 𝑠𝑡𝑖 ;
accordingly, let 𝑆𝑡 = (𝑠𝑡𝑖 , … , 𝑠𝑡𝑀𝑡 ) connote the states of all Mt objects in the t-th frame: S1:t
={S1,…,St} represents all sequential states of all objects from the first frame to the t-th
frame.
• Similarly, define 𝑂𝑡 = (𝑜𝑡𝑖 , … , 𝑜𝑡𝑀𝑡 ) as the collected observations for all of the Mt objects in
the t-th frame; O1:t is defined analogously.
Multiple-Object Tracking (MOT)
• With these notational conventions, the objective of MOT is to find the
optimal sequential states of all objects, which can generally be modeled by
performing MAP estimation from the conditional distribution of the
sequential states give all the observations:
• DBT: Objects are first detected and then linked into trajectories. Two key issues for
DBT framework: object detector must be trained in advance (offline); performance of
tracking algorithm is highly dependent upon effectiveness of detector (e.g. RCNN,
YOLO).
• DFT: Requires manual initialization of fixed number of objects in first frame, then
localizes objects in subsequent frames.
• DBT is more popular because new objects discovered and disappearing objects are
terminated automatically. DFT can’t accommodate cases for which objects appear – but
it is nevertheless free of pre-trained detectors (no deep learning needed).
Multiple-Object Tracking (MOT)
• Processing Modes for MOT: online tracking or offline tracking.
• (2) major considerations: (1) how to measure similarity (if at all) between
objects in frames
• Appearance is an important cue for affinity computation in MOT. However, it is usually not considered the core
component of a MOT algorithm.
• Appearance models consist of (i) visual representation (e.g. local features or region features such as optical flow,
HOG features, region covariance) and (ii) statistical measuring (using single cues, e.g., distance between color
histograms or multiple cues involving a fusion of information, e.g. boosting, cascading, etc.)
Multiple-Object Tracking (MOT): Components of MOT
• (B) Motion Model
• The motion model captures the dynamic behavior of an object. In most cases objects
are assumed to move smoothly and thereby maintain object continuity conditions.
• (i) Linear Motion Model: the most popular motion model for MOT, assumes constant
velocity of objects; one can further impose velocity/position/acceleration smoothness
constrains.
• (ii) Non-linear motion model: more complex model (often) can yield better results
when linking tracklets of objects.
Multiple-Object Tracking (MOT): Components of MOT
• (C) Interaction Model
• Constraint set employed to avoid physical collisions for MOT problem; intuition: two
objects can’t occupy the same physical space. Two types: (1) detection-level exclusion
(can’t assign detection to same space in same frame for two distinct customers), (2)
trajectory-level exclusion (trajectories must maintain some minimal epsilon
difference).
• To model exclusion: can use exclusion graph to capture constrain; optimize inference
to encourage connected nodes to have different labels; further distinction between
soft/hard modeling.
Multiple-Object Tracking (MOT): Components of MOT
• (E)Occlusion Handling (critical challenge):
• One solution: Kalman fitter (could use Gaussian Process): makes linear system
assumption and Gaussian-distributed object states assumption; under these
conditions, Kalman filter is optimal estimator.
Multiple-Object Tracking (MOT): Components of MOT
• (F) Inference
• Can also model MOT problem as graph bipartite matching problem; two
sets of nodes: observations/detections; weights between nodes are
affinity measure; can solve greedily or use Hungarian algorithm (classic
paradigm for matching with preference problem: “stable marriage
problem”; can alternatively use min-cut/max flow network (use for
trajectories); graphical models/conditional random fields.
Multiple-Object Tracking (MOT)
MOT evaluation: standard metrics: accuracy/precision; tracking metrics: MOTA metric (mot accuracy) combines FP
rate and FN rate and mismatch rate; precision: uses IOU metric and/or distance; completeness: how completely the
ground truth trajectories were tracked; robustness: recovery from occlusion measure.
Open source: (surprisingly few for MOT): more for SOT; RCNN, Fast RCNN, Faster RCNN, YOLO, MOSSE
Tracker, SORT, DEEPSORT, INTEL SDK OPENCV.
Multiple-Object Tracking (MOT)
• Multiple cameras: major question how to fuse information from multiple cameras (can
use traditional multi-modal data ensemble approach); for cameras with non-
overlapping viewing regions, question about to re-identify. How to use known
geometry of scene for inference (very much an open problem): one idea: project large
inference problem to entire floorplan (in 2-d) and generate tracklets here; question,
again, how to fuse data streams. (How to automate for new store geometries? )
• 3-d object tracking? Requires camera calibration, estimate poses and scene layout.
• MOT with scene/context understanding: first analyze image with scene understanding (could
perform low-computation image segmentation), use results from scene understanding to
provide contextual information and scene structure; this approach could be highly valuable
(if not essential) for object tracking, gesture recognition, behavior classification, etc.
• MOT with deep learning: (still open) DL model can furnish a strong observation model that
performs, say, image classification, object detection or SOT. Could parallelize strong SOT
with DL for MOT (possibly).
• MOT with other CV tasks: symbiotic relationship between MOT and other conventional CV
tasks (inference for one helps with inference for the other and vice versa): segmentation,
human re-identification, face detection/recognition, action recognition, etc.
MOSSE tracker
• This algorithm is proposed for fast tracking using correlation filter methods.
Correlation filter-based tracking comprises the following steps:
• (3) The results from (2) is inverted to the spatial domain using IFFT. The
position of the template object in the image I is the max value of the IFFT
response.
MOSSE tracker
The aforementioned correlation filter-based technique has limitations in
choice of T. As a single template image match may not observe all the
variations of an object, such as rotation.
min I i T * Oi
T*
i
SORT (Simple Online and Realtime Tracking)
SORT is a recent algorithm for MOT (Bewley et al., 2016)
In the problem setting of MOT, each frame has more than one object to
track. A generic method to solve this problem has two steps:
The authors integrate appearance information to improve the performance of SORT. This
extension allows objects to be tracked through longer periods of occlusions (effectively
reducing the number of identity switches).
Specifically, they integrate motion and appearance information using two metrics.
(*) For motion information they use the Mahalanabois distance between predicted Kalman
states.
(*) To alleviate the problem of re-identification in the case of long-term occlusions they
introduce a cosine metric applied to appearance descriptors of each identified object.
Multiple-Object Tracking (MOT) for Multiple Cameras
• Each detection in each camera frame becomes a node in the hypergraph; hyperedges denote
potential couplings across different cameras; edge orientations in the graph connote temporal
data.
• The MOT problem is reduced to a constrained, min-cost flow graph; the tracking problem
can be efficiently solved using a binary linear programming algorithm.
Multiple-Object Tracking (MOT) for Multiple Cameras
• Other attempted solutions to the MOT problem have relied on homography-
based methods.
Eshel et al., Homography Based Multiple Camera Detection and Tracking of People in a Dense Crowd
Multiple-Object Tracking (MOT) for Multiple Cameras
• Other attempted solutions to the MOT problem have relied on homography-
based methods.
Eshel et al., Homography Based Multiple Camera Detection and Tracking of People in a Dense Crowd
Multiple-Object Tracking (MOT) for Multiple Cameras
• Intel Open CV SDK supports: rcnn, yolo, AlexNet, VGG, Inception,
MTT (baseline single camera) Etc.
• Acceleration with GPUs, traditional CV task support, MTT (single
camera?), subject facial recognition, features (gender, age inference),
optimization for Intel hardware.
Further Issues and Problem Considerations
• Integration of “stitching” for tracking across multiple cameras.
• Training robust person classifier (is a preexisting model sufficient?)
• Computation delegation (smart camera vs. cloud & on-line vs. offline)
• Identification of a group of consumers with single tag (e.g. family of shoppers).
• Method to utilize known geometry of store (for exclusion model).
• Fine-grained identification issues (e.g. is facial recognition, etc., a feature the model can use?)
Further Issues and Problem Considerations
• Meta-Data uses
• Gesture recognition
• Multi-modal resources for tracking (e.g. audio, sensors)
• Camera stream “parsimony” (e.g. when no individual detected in a frame, no need to process frame).
• Major challenge: Seamless integration of new inventory and store geometries for ambient computing; one-
shot learning.
• Automation or semi-automation of system calibration in stores
• Primary component of “ambient computing” is MOT; with MOT alone one could potentially package meta-
data pipeline using Movidius, etc., for commercial use (e.g. anonymous “heat map” tracking of consumer
behavior).