Research On Object Tracking Based On Siamese Network

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 5

RESEARCH ON OBJECT TRACKING BASED ON SIAMESE NETWORK

Wang Yuanhui

ABSTRACT add or remove new classes to the data. In this case, we


have to update the neural network and retrain it on the
Visual object tracking is a fundamental task in computer whole data set. Also, deep neural networks need a large
vision, which aims to predict the position and shape of a volume of data on which to train. SNNs, on the other
given target in each video frame. It has a wide range of hand, learn a similarity function. Thus, we can train the
applications in robot vision, video surveillance, SNN to see if two images are the same. This process
unmanned driving, and other fields. With the rise of deep enables us to classify new classes of data without
learning, neural networks have been employed in the retraining the network.
mainstream frameworks for visual object tracking. Loss function: In the Siamese network, the loss
Among them, the methods built on architecture of function is the contrastive loss, which can effectively
Siamese networks have shown excellent tracking deal with the relationship between paid data in the twin
performance. In this paper, I will investigate the effects neural network. The expression for comparative loss is as
of different interference and network structures on target follows:
tracking performance based on Siamese networks.

Index Terms— Visual Object Tracking, Deep Learning,


Siamese Networks
There DW represents the Euclidean distance of two
1. INTRODUCTION sample features X1 and X2, P represents the feature
dimension of the sample, Y is the label of whether the
In this paper, I first investigate how to train Siamese two samples match, Y=1 indicates that the two samples
network, discuss the input and loss function of the are similar or match, Y=0 indicates that they do not
Siamese network. Then, I investigate the effects of match, m is the set threshold, and N is the number of
different interference and network structures on target samples.
tracking performance based on Siamese networks.
3. THE PIONEER-SIAMFC
2. SIAMESE NETWORK
Siamese neural networks have been used in object
A Siamese neural network (SNN) [1] is a class of neural tracking because of its unique two tandem inputs and
network architectures that contain two or more identical similarity measurement. In object tracking, one input of
sub-networks. “Identical” here means they have the same the twin network is user pre-selected exemplar image,
configuration with the same parameters and weights. the other input is a larger search image, which twin
Parameter updating is mirrored across both sub-networks network's job is to locate exemplar inside of search
and it’s used to find similarities between inputs by image. By measuring the similarity between exemplar
comparing feature vectors. Often one of the output and each part of the search image, a map of similarity
vectors is precomputed, thus forming a baseline against score can be given by the twin network. Furthermore,
which the other output vector is compared. This is using a Fully Convolutional Network, the process of
similar to comparing fingerprints but can be described computing each sector's similarity score can be replaced
more technically as a distance function for locality- with only one cross correlation layer.
sensitive hashing. Siamese network is an important solution for object
tracking using deep learning. It mainly includes:
Siamfc [2], Siamrpn[3], DASiamrpn[5], Siamrpn++ [4] etc.

3.1. SiamFC

Siamfc pioneered the application of Siamese network


structure in the field of object tracking, and significantly
improved the tracking speed of the deep learning method
tracker. Later, most of the relevant deep learning tracker
Figure1 Architecture of Siamese network methods were improved and optimized based on this
method. Therefore, this method is close to KCF [13] and
Traditionally, a neural network learns to predict has certain milestone significance.
multiple classes. This poses a problem when we need to

1
Siamfc adopts the full convolution Siamese network to siamfc has 5 scales 1.025^ {-2, -1,0,1,2}, of which 255 ×
realize object tracking. Its network structure is shown in 255 corresponds to scale 1. In order to improve the FPS
the figure2 below, with two branches sharing weight. of the network, siamfc-3s with three scales is proposed.
Where Z is 127 × The template image of 127 is    when the template image and the search image are
equivalent to the target to be tracked, and X is 255 × not cut enough, the insufficient pixels will be filled
255. All we have to do is find the position of Z in X. according to the mean value of the RGB channel.
In order to construct an effective loss function, the
location points in the search area are distinguished
between positive and negative samples, that is, the points
within a certain range of the target are taken as positive
samples, and the points outside this range are taken as
negative samples. For example, in the score map
generated on the far right of the network structure
diagram, the red points are positive samples, and the blue
points are negative samples. The ground truth in the
Figure2 Architecture of SiamFC.
score map is marked as follows:
Siamfc has two branches corresponding to two inputs:
Z and X. input them at the same time for φ The function
here is to extract features and generate 6 × six × 128 and
22 × twenty-two × 128. φ is the corresponding feature
extraction network adopts Alexnet [14], and its structure is
Where C is the center of the object in the score map, u
as follows:
is any point in the score map, ∣ u − C ∣ is the Euclidean
distance between u and C, R is the threshold value of the
distance, and K is the multiple of the reduction of the
score map after it passes through the network. From the
network structure, we can see that there are three layers
of convolution, and the pooling takes 2 as the step
length, so the image will be reduced by 2^3=8 times after
Input the generated feature map into the cross- passing through the network.
correlation layer to generate a score map. In fact, the The loss function adopted by siamfc is logistic loss,
following calculations will be performed: and the form of loss function is as follows:

Where b1 is the value corresponding to each position,


which is equivalent to an offset, φ (x) And φ (z) Is the where V is the real value of each point in the score
convolution operation, which extracts the part most map, and Y ∈ {+1, − 1} is the label corresponding to this
similar to Z in X. point. The above formula is the loss value of each point
The Siamese network has two branches corresponding in the score map. For the overall loss of the score map,
to two inputs. The sizes of Z and X are not arbitrary the mean value of the loss of all points is used.
inputs, but expand the target area, as shown in the
following figure: 3.2 Deficiency of Siamfc

Siamfc is an effective target tracking framework. It has


the advantages of simplicity and high speed, but it also
has many improvements:
 Siamfc can only get the central position of the
target, but can not get the size of the target, so it
can only adopt simple multi-scale plus regression,
which increases the amount of calculation and is not
accurate enough.
Figure3 Training pairs extracted from the same video: exemplar image  After Siamfc, there are many tracking algorithms
and corresponding search image from same video. When a sub-window based on siamese networks, you may notice that,
extends beyond the extent of the image, the missing portions are filled with
the mean RGB value. These networks use the Alexnet as the benchmark
feature extractor. In fact, before this, some scholars
For the search image x, a 255 × 255 clipping will be tried to use the deep network, but found that the
cropped from the image. The center of clipping is the direct use of the pre trained deep network would
center of the bounding box predicted in the previous lead to the decline in the accuracy of the tracking
frame. In order to improve the tracking performance, a algorithm. Therefore, this has become a thorny
variety of scales are selected for prediction. The original

2
problem to be solved for a tracker based on the Network input template image is 127 × 127, the
Siamese network. search image is 255 × 255, two outputs are generated by
 Siamfc can only distinguish between the target and CNN, and the sizes are 6 × 6 × 256 and 22 × 22 × 256.
the background without semantic information. Then copy the two feature maps to the classification
When the semantic object is the background, that is, branch and the regression branch respectively. Note that
when there is a distractor, the performance is not the convolution weights here are not shared.
very good. In the classification branch, the template feature map
and the search feature map are output respectively
4. IMPROVED TRACKER BASED ON SIAMFC through the convolution layer, and their size is 4 × 4 ×
(2k × 256) and 20 × 20 × 256, where k generates K
4.1. SiamRPN anchors for each grid. The aspect ratio of anchors is
[0.33,0.5,1,2,3], ⋆is a convolution operation. Two feature
Similar to Siamfc, Siamrpn also has a Siamese network maps are convoluted with each other, 256 channels are
structure. The weights of the two branches are shared. convoluted with each other, and a channel is generated
The two branches input a picture respectively to extract by weighted summation, so get a 17× 17 × 2K feature
the features of each picture. As shown in the following map, which is equivalent to dividing the search image
figure, the template frame above is equivalent to the into 17 ×17 grids. Each grid generates K anchors. Every
template image, and the detection frame below is two channels are a group. A total of K groups correspond
equivalent to the search image. The two images are input to K anchors. In the first channel, the anchors of the
into CNN network for feature extraction, and then input target is 1 and the background is 0; In the second
into the cross-correlation layer. channel, the background is 1 and the target is 0.
In the regression branch, two feature maps are
generated respectively through convolution layer. The
operation is the same as that of classification branch,
generating 17 × 17 × 4K feature map. The four groups
correspond to the four values dx, dy, dw and dh of the
bounding box, which are the distance between anchor
and the true value.
On the basis of the baseline algorithm Siamfc,
SiamRPN has realized the improvement of more than
five points (otb100, vot15/16/17 data sets); At the same
Figure 4: Tracking as one-shot detection: the template branch predicts the time, it achieves faster speed (160fps) and better balance
weights (in gray) for kernels of region proposal subnetwork on detection between accuracy and speed.
branch using the first frame. Then the template branch is pruned and only the
detection branch is retained. So the framework is modified to a local
detection network. 4.2. SiamRPN++

The full name of RPN[15] is region proposal network, The network structure of siamrpn++ is shown in the
which translates into regional proposal network. RPN can following figure 6. Both sides of the dotted line are
be understood as selecting an area from an image or network structure diagrams. The left side of the dotted
feature map to generate anchors. RPN has two branches, line is the feature extraction network structure, and the
one is classification branch and the other is regression right side is the RPN structure diagram. In fact, the
branch. As shown in Figure 5, the proposed framework network structure of siamrpn++ is very similar to that of
consists of a Siamese subnetwork for feature extraction siamrpn, and siamrpn++ adds many innovations on the
and a region proposal subnetwork for proposal basis of siamrpn.
generation. Specifically, there are two branches in RPN
subnetwork, one is in charge of the foreground-
background classification, another is used for proposal
refinement. Image patches including the target objects
are fed into the proposed framework and the whole
system is trained end-to end.
Figure 6. Main framework of SiamRPN++ Given a target template and search
region, the network ouputs a dense prediction by fusion the outputs from
multiple Siamese Region Proposal (SiamRPN) blocks. Each SiamRPN block
is shown on right.

By analyzing the training process of siamese neural


network, it is found that Siamese network has the
problem of position bias when using depth neural
Figure 5: Main framework of Siamese-RPN network, and this problem is because the convolution
padding will destroy the strict translation invariance.
However, padding cannot be removed from the deep

3
network. In order to alleviate this problem and improve
the tracking performance of the deep network, siamrpn+
+ proposes to add a location balanced sampling strategy
in the training process. By modifying the sampling
strategy to alleviate the location bias in the training
process, the deep network can play its due role. Figure 7. Visualization of the response heatmaps of Siamese network
trackers. (a) shows the search images. (b-e) show the heatmaps that produced
By adding this sampling strategy, the deep network by SiamFC, SiamRPN, SiamRPN++ and the DaSiamRPN.
can finally play a role in the tracking task, so that the Then the author proposed three innovations:
tracking performance is no longer constrained by the Firstly, the author used multiple kinds of picture pairs
network capacity. At the same time, in order to give to increase the generation ability of the model. The
better play to the performance of deep network, author expanded the training data set. In addition to
Siamrpn++ uses multi-layer fusion. Because the shallow using vid and Youtube BB (there are only 20 and 30
features have more detailed information, while the deep classes of objects, respectively), he also used Imagenet
network has more semantic information, after multi-layer and Coco as training sets through data enhancement,
fusion, the tracker can take into account the details and which greatly increased the types of objects. Then
deep semantic information, so as to further improve the negative picture pairs containing semantic information
performance. are used to increase the discrimination ability of the
In addition, researchers also proposed a new model. In the process of training, the author intentionally
connection component, the depth separable correlation uses the same kind of negative picture pairs but not the
(DW). Compared with the previous upchannel target to train the network, so that the network can
correlation, DW can greatly simplify the parameters, effectively distinguish different objects of the same kind
balance the parameters of the two branches, make the and increase the robustness. Secondly, proposed a
training more stable and converge better. In order to distractor aware module . Finally, the author also
verify the above content, the researchers have done a proposes a local to global strategy to track for a long
detailed experiment. In order to verify the above content, time. By detecting the score, we can judge whether the
the researchers have done a detailed experiment. object moves out of the frame. According to the effect,
Siamrpn++ has achieved SOTA results on large data sets once the object moves out of the picture, the score will
such as Lasot and Trackingnet. decrease sharply. At this time, the algorithm will expand
the cropped local picture until the target is found.
4.3 DaSiamRPN Because DaSiamrpn can effectively distinguish the
background and distractor in the picture, the response
This paper first points out three problems of Siam value of the heat map will increase only when the object
algorithm: appears, and then the local search will be carried out.
1. Common Siam class tracking methods can only
distinguish between the target and the background 5. CONCLUSION AND PROSPECT
without semantic information. When the semantic object
is the background, that is, when there are distractors, the The structure of Siamese network for tracking is an
performance is not very good. excellent framework. Many researchers have done a lot
2. Most Siam trackers cannot update the model in the of work on it and improved it from various aspects. For
tracking phase, and the trained model is the same for example, Siamban [6] and Siamcar [7] have introduced
different specific targets. This brings about high speed anchor free bounding box regression strategy, which
and correspondingly sacrifices accuracy. reduces the amount of calculation and improves the
3. In the application of long-term tracking, Siam tracker tracking speed. Siamrcnn [8] used re-detection combine
cannot well deal with the challenges of full occlusion, Siamese network to tracking. In recent years,
target out of picture. transformer [9] has risen in the field of CV, and many
The author found that siamfc and siamrpn had high algorithms that combine transformer with Siamese
scores for other similar objects besides the target. The network have emerged, such as Swintrack [10] and
author mentioned that the number of targets without TCtrack [11]. People also pay attention to the utilization of
semantic information is far greater than that with historical frame information in the tracking process.
semantic information. In the training process, most of the Some algorithms begin to pay attention to the mining and
trained image pairs have no semantic background, and utilization of historical frame information, for example,
few have semantic information. Therefore, the network STMTrack [12].
only learns the ability to distinguish between background Visual object tracking is a fundamental task in
and foreground. In the training stage, the author computer vision, which aims to predict the position and
introduces the existing detection data set to enrich the shape of a given target in each video frame. It has a wide
positive sample data, so as to improve the generalization range of applications in robot vision, video surveillance,
ability of the tracker; Then, the author enriches the unmanned driving, and other fields. More and more
difficult negative sample data to improve the excellent tracking algorithms appear, object tracking is
discrimination ability of the tracker. moving towards the direction of long-term tracking and
multi-object tracking,

4
6. REFERENCES

[1] Chicco, Davide (2020), "Siamese neural networks: an


overview", Artificial Neural Networks, Methods in Molecular
Biology, vol. 2190 (3rd ed.), New York City, New York, USA:
Springer Protocols, Humana Press, pp. 73–
[2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and
P. H. Torr. Fully-convolutional siamese networks for object
tracking. In ECCV Workshops, 2016
[3] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, Xiaolin Hu, High
Performance Visual Tracking with Siamese Region Proposal
Network. In CVPR, 2018.
[4] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang
Xing, Junjie Yan, SiamRPN++: Evolution of Siamese Visual
Tracking with Very Deep Networks, In CVPR, 2019.
[5] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan3,
Weiming Hu, Distractor-aware Siamese Networks for Visual
Object Tracking, In ECCV 2018.
[6] Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang,
Rongrong Ji, Siamese Box Adaptive Network for Visual
Tracking,In CVPR 2020.
[7] Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang,
Shengyong Chen, SiamCAR: Siamese Fully Convolutional
Classification and Regression for Visual Tracking, In CVPR
2020.
[8] Paul Voigtlaender, Jonathon Luiten, Philip H.S. Torr,
Bastian Leibe, Siam R-CNN: Visual Tracking by Re-Detection,
In CVPR 2020.
[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia
Polosukhin, Attention Is All You Need, In NIPS,2017.
[10] Liting Lin, Heng Fan, Yong Xu, Haibin Ling, SwinTrack:
A Simple and Strong Baseline for Transformer Tracking, In
CVPR 2021.
[11] Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang,
Ziwei Liu, Changhong Fu, TCTrack: Temporal Contexts for
Aerial Tracking, In CVPR 2022.
[12] Zhihong Fu, Qingjie Liu, Zehua Fu, Yunhong Wang,
STMTrack: Template-free Visual Tracking with Space-time
Memory Networks, In CVPR 2021.
[13] JF Henriques , R Caseiro , P Martins , J Batista, High-
Speed Tracking with Kernelized Correlation Filters, IEEE
Transactions on Pattern Analysis & Machine Intelligence,
2015.
[14] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
ImageNet Classification with Deep Convolutional Neural
Networks, Advances in neural information processing systems,
2012.
[15] Ross Girshick, Fast R-CNN, In CVPR, 2015.

You might also like