Research On Object Tracking Based On Siamese Network
Research On Object Tracking Based On Siamese Network
Research On Object Tracking Based On Siamese Network
Wang Yuanhui
3.1. SiamFC
1
Siamfc adopts the full convolution Siamese network to siamfc has 5 scales 1.025^ {-2, -1,0,1,2}, of which 255 ×
realize object tracking. Its network structure is shown in 255 corresponds to scale 1. In order to improve the FPS
the figure2 below, with two branches sharing weight. of the network, siamfc-3s with three scales is proposed.
Where Z is 127 × The template image of 127 is when the template image and the search image are
equivalent to the target to be tracked, and X is 255 × not cut enough, the insufficient pixels will be filled
255. All we have to do is find the position of Z in X. according to the mean value of the RGB channel.
In order to construct an effective loss function, the
location points in the search area are distinguished
between positive and negative samples, that is, the points
within a certain range of the target are taken as positive
samples, and the points outside this range are taken as
negative samples. For example, in the score map
generated on the far right of the network structure
diagram, the red points are positive samples, and the blue
points are negative samples. The ground truth in the
Figure2 Architecture of SiamFC.
score map is marked as follows:
Siamfc has two branches corresponding to two inputs:
Z and X. input them at the same time for φ The function
here is to extract features and generate 6 × six × 128 and
22 × twenty-two × 128. φ is the corresponding feature
extraction network adopts Alexnet [14], and its structure is
Where C is the center of the object in the score map, u
as follows:
is any point in the score map, ∣ u − C ∣ is the Euclidean
distance between u and C, R is the threshold value of the
distance, and K is the multiple of the reduction of the
score map after it passes through the network. From the
network structure, we can see that there are three layers
of convolution, and the pooling takes 2 as the step
length, so the image will be reduced by 2^3=8 times after
Input the generated feature map into the cross- passing through the network.
correlation layer to generate a score map. In fact, the The loss function adopted by siamfc is logistic loss,
following calculations will be performed: and the form of loss function is as follows:
2
problem to be solved for a tracker based on the Network input template image is 127 × 127, the
Siamese network. search image is 255 × 255, two outputs are generated by
Siamfc can only distinguish between the target and CNN, and the sizes are 6 × 6 × 256 and 22 × 22 × 256.
the background without semantic information. Then copy the two feature maps to the classification
When the semantic object is the background, that is, branch and the regression branch respectively. Note that
when there is a distractor, the performance is not the convolution weights here are not shared.
very good. In the classification branch, the template feature map
and the search feature map are output respectively
4. IMPROVED TRACKER BASED ON SIAMFC through the convolution layer, and their size is 4 × 4 ×
(2k × 256) and 20 × 20 × 256, where k generates K
4.1. SiamRPN anchors for each grid. The aspect ratio of anchors is
[0.33,0.5,1,2,3], ⋆is a convolution operation. Two feature
Similar to Siamfc, Siamrpn also has a Siamese network maps are convoluted with each other, 256 channels are
structure. The weights of the two branches are shared. convoluted with each other, and a channel is generated
The two branches input a picture respectively to extract by weighted summation, so get a 17× 17 × 2K feature
the features of each picture. As shown in the following map, which is equivalent to dividing the search image
figure, the template frame above is equivalent to the into 17 ×17 grids. Each grid generates K anchors. Every
template image, and the detection frame below is two channels are a group. A total of K groups correspond
equivalent to the search image. The two images are input to K anchors. In the first channel, the anchors of the
into CNN network for feature extraction, and then input target is 1 and the background is 0; In the second
into the cross-correlation layer. channel, the background is 1 and the target is 0.
In the regression branch, two feature maps are
generated respectively through convolution layer. The
operation is the same as that of classification branch,
generating 17 × 17 × 4K feature map. The four groups
correspond to the four values dx, dy, dw and dh of the
bounding box, which are the distance between anchor
and the true value.
On the basis of the baseline algorithm Siamfc,
SiamRPN has realized the improvement of more than
five points (otb100, vot15/16/17 data sets); At the same
Figure 4: Tracking as one-shot detection: the template branch predicts the time, it achieves faster speed (160fps) and better balance
weights (in gray) for kernels of region proposal subnetwork on detection between accuracy and speed.
branch using the first frame. Then the template branch is pruned and only the
detection branch is retained. So the framework is modified to a local
detection network. 4.2. SiamRPN++
The full name of RPN[15] is region proposal network, The network structure of siamrpn++ is shown in the
which translates into regional proposal network. RPN can following figure 6. Both sides of the dotted line are
be understood as selecting an area from an image or network structure diagrams. The left side of the dotted
feature map to generate anchors. RPN has two branches, line is the feature extraction network structure, and the
one is classification branch and the other is regression right side is the RPN structure diagram. In fact, the
branch. As shown in Figure 5, the proposed framework network structure of siamrpn++ is very similar to that of
consists of a Siamese subnetwork for feature extraction siamrpn, and siamrpn++ adds many innovations on the
and a region proposal subnetwork for proposal basis of siamrpn.
generation. Specifically, there are two branches in RPN
subnetwork, one is in charge of the foreground-
background classification, another is used for proposal
refinement. Image patches including the target objects
are fed into the proposed framework and the whole
system is trained end-to end.
Figure 6. Main framework of SiamRPN++ Given a target template and search
region, the network ouputs a dense prediction by fusion the outputs from
multiple Siamese Region Proposal (SiamRPN) blocks. Each SiamRPN block
is shown on right.
3
network. In order to alleviate this problem and improve
the tracking performance of the deep network, siamrpn+
+ proposes to add a location balanced sampling strategy
in the training process. By modifying the sampling
strategy to alleviate the location bias in the training
process, the deep network can play its due role. Figure 7. Visualization of the response heatmaps of Siamese network
trackers. (a) shows the search images. (b-e) show the heatmaps that produced
By adding this sampling strategy, the deep network by SiamFC, SiamRPN, SiamRPN++ and the DaSiamRPN.
can finally play a role in the tracking task, so that the Then the author proposed three innovations:
tracking performance is no longer constrained by the Firstly, the author used multiple kinds of picture pairs
network capacity. At the same time, in order to give to increase the generation ability of the model. The
better play to the performance of deep network, author expanded the training data set. In addition to
Siamrpn++ uses multi-layer fusion. Because the shallow using vid and Youtube BB (there are only 20 and 30
features have more detailed information, while the deep classes of objects, respectively), he also used Imagenet
network has more semantic information, after multi-layer and Coco as training sets through data enhancement,
fusion, the tracker can take into account the details and which greatly increased the types of objects. Then
deep semantic information, so as to further improve the negative picture pairs containing semantic information
performance. are used to increase the discrimination ability of the
In addition, researchers also proposed a new model. In the process of training, the author intentionally
connection component, the depth separable correlation uses the same kind of negative picture pairs but not the
(DW). Compared with the previous upchannel target to train the network, so that the network can
correlation, DW can greatly simplify the parameters, effectively distinguish different objects of the same kind
balance the parameters of the two branches, make the and increase the robustness. Secondly, proposed a
training more stable and converge better. In order to distractor aware module . Finally, the author also
verify the above content, the researchers have done a proposes a local to global strategy to track for a long
detailed experiment. In order to verify the above content, time. By detecting the score, we can judge whether the
the researchers have done a detailed experiment. object moves out of the frame. According to the effect,
Siamrpn++ has achieved SOTA results on large data sets once the object moves out of the picture, the score will
such as Lasot and Trackingnet. decrease sharply. At this time, the algorithm will expand
the cropped local picture until the target is found.
4.3 DaSiamRPN Because DaSiamrpn can effectively distinguish the
background and distractor in the picture, the response
This paper first points out three problems of Siam value of the heat map will increase only when the object
algorithm: appears, and then the local search will be carried out.
1. Common Siam class tracking methods can only
distinguish between the target and the background 5. CONCLUSION AND PROSPECT
without semantic information. When the semantic object
is the background, that is, when there are distractors, the The structure of Siamese network for tracking is an
performance is not very good. excellent framework. Many researchers have done a lot
2. Most Siam trackers cannot update the model in the of work on it and improved it from various aspects. For
tracking phase, and the trained model is the same for example, Siamban [6] and Siamcar [7] have introduced
different specific targets. This brings about high speed anchor free bounding box regression strategy, which
and correspondingly sacrifices accuracy. reduces the amount of calculation and improves the
3. In the application of long-term tracking, Siam tracker tracking speed. Siamrcnn [8] used re-detection combine
cannot well deal with the challenges of full occlusion, Siamese network to tracking. In recent years,
target out of picture. transformer [9] has risen in the field of CV, and many
The author found that siamfc and siamrpn had high algorithms that combine transformer with Siamese
scores for other similar objects besides the target. The network have emerged, such as Swintrack [10] and
author mentioned that the number of targets without TCtrack [11]. People also pay attention to the utilization of
semantic information is far greater than that with historical frame information in the tracking process.
semantic information. In the training process, most of the Some algorithms begin to pay attention to the mining and
trained image pairs have no semantic background, and utilization of historical frame information, for example,
few have semantic information. Therefore, the network STMTrack [12].
only learns the ability to distinguish between background Visual object tracking is a fundamental task in
and foreground. In the training stage, the author computer vision, which aims to predict the position and
introduces the existing detection data set to enrich the shape of a given target in each video frame. It has a wide
positive sample data, so as to improve the generalization range of applications in robot vision, video surveillance,
ability of the tracker; Then, the author enriches the unmanned driving, and other fields. More and more
difficult negative sample data to improve the excellent tracking algorithms appear, object tracking is
discrimination ability of the tracker. moving towards the direction of long-term tracking and
multi-object tracking,
4
6. REFERENCES