0% found this document useful (0 votes)
44 views10 pages

(!) Chen One-Shot Adversarial Attacks On Visual Tracking With Dual Attention CVPR 2020 Paper

1. The document proposes a one-shot adversarial attack method to generate adversarial examples that can cause state-of-the-art visual object trackers to lose the tracked target in subsequent frames, by only perturbing the target patch in the initial frame of a video. 2. The attack targets Siamese network-based trackers and uses a novel optimization objective with dual attention mechanisms, including a targeted attack strategy and a general perturbation strategy. 3. Experimental results show that the proposed one-shot attack method can significantly lower the accuracy of state-of-the-art Siamese network-based trackers on three benchmarks.

Uploaded by

黃丰楷
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views10 pages

(!) Chen One-Shot Adversarial Attacks On Visual Tracking With Dual Attention CVPR 2020 Paper

1. The document proposes a one-shot adversarial attack method to generate adversarial examples that can cause state-of-the-art visual object trackers to lose the tracked target in subsequent frames, by only perturbing the target patch in the initial frame of a video. 2. The attack targets Siamese network-based trackers and uses a novel optimization objective with dual attention mechanisms, including a targeted attack strategy and a general perturbation strategy. 3. Experimental results show that the proposed one-shot attack method can significantly lower the accuracy of state-of-the-art Siamese network-based trackers on three benchmarks.

Uploaded by

黃丰楷
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

One-shot Adversarial Attacks on Visual Tracking with Dual Attention

Xuesong Chen* 1 Xiyu Yan* 2 Feng Zheng† 3 Yong Jiang2, 4


Shu-Tao Xia2, 4 Yong Zhao1 Rongrong Ji 4, 5
1 2
Peking University, School of ECE Tsinghua University
3
Southern University of Science and Technology
4 5
Peng Cheng Laboratory Xiamen University
[email protected], [email protected], [email protected]

Abstract #Frame 1 #Frame 50 #Frame 100 #Frame 150

Almost all adversarial attacks in computer vision are … … …


aimed at pre-known object categories, which could be off-
line trained for generating perturbations. But as for visual Original Original Original Original
object tracking, the tracked target categories are normally
unknown in advance. However, the tracking algorithms also
have potential risks of being attacked, which could be mali- … … …
ciously used to fool the surveillance systems. Meanwhile, it
is still a challenging task that adversarial attacks on track-
Perturbed Attacked Attacked Attacked
ing since it has the free-model tracked target. Therefore,
to help draw more attention to the potential risks, we study Figure 1. We only perturb slightly the target patch in the initial
adversarial attacks on tracking algorithms. In this paper, frame of a video, resulting in tracking failure in subsequent frames.
we propose a novel one-shot adversarial attack method to First line: the original video frames are successfully tracked. Sec-
generate adversarial examples for free-model single ob- ond line: attacking the target only in the initial frame could para-
ject tracking, where merely adding slight perturbations on lyze the tracker. The green boxes represent the ground truth, and
the target patch in the initial frame causes state-of-the-art the red boxes represent the tracking result of the tracker.
trackers to lose the target in subsequent frames. Specifi-
cally, the optimization objective of the proposed attack con-
sists of two components and leverages the dual attention 28, 16] brought by the progress of deep learning. For exam-
mechanisms. The first component adopts a targeted attack ple, the SiamRPN++ tracker [16] based on Siamese network
strategy by optimizing the batch confidence loss with confi- has reached 91% precision on the OTB100 benchmark [30].
dence attention while the second one applies a general per- However, whether the deep learning-based object tracking
turbation strategy by optimizing the feature loss with chan- algorithms are as powerful as they seem is a question to
nel attention. Experimental results show that our approach worth pondering.
can significantly lower the accuracy of the most advanced Adversarial attacks on deep learning models in computer
Siamese network-based trackers on three benchmarks. vision have attracted increasing interest in the past years [1].
There are many adversarial attacks against deep networks
successfully fooling image classifiers and object detectors.
1. Introduction For example, Szegedy et al. demonstrated that putting small
perturbations in images that remain (almost) imperceptible
Visual Object Tracking (VOT) plays a significant role to the human visual system could fool deep learning mod-
in practical security applications such as intelligent surveil- els into misclassification [24]. Most recently, [26] created a
lance systems. Recent years have witnessed many break- small adversarial patch that is used as a cloaking device to
throughs in visual object tracking algorithms [2, 25, 5, 17, hide persons from a person detector. Commonly, almost all
* Equal contributions. This work is done when Xuesong Chen and Xiyu
these attacks are not aimed at free-models (i.e. arbitrary tar-
Yan visited to Feng Zheng Lab in SUSTech. get) but the pre-known categories. Actually, adding adver-
† Corresponding author. sarial perturbations on the free-model target patch in a cer-

10176
tain frame may cause state-of-the-art trackers to lose the tar- the generalization ability of the one-shot attack, the feature
get in subsequent frames, which could be maliciously used channel-wise activation-oriented attention of feature maps
to fool surveillance systems. Thus, it is necessary to study is taken into account under limited perturbation conditions.
adversarial attacks on visual object tracking algorithms to Eventually, we evaluate our attacks on three track-
help improve their prevention against these potential risks. ing benchmarks, including OTB100 [30], LaSOT [4], and
However, attacking a tracker to lose the object in contin- GOT10K [11]. The experimental results show that our ap-
uous video frames is a challenging task. First, online visual proaches can significantly lower the accuracy of the most
tracking is unable to pre-know the category of tracked and advanced Siamese network-based trackers.
to learn beforehand because of the free-model tacked target In summary, the key contributions of this paper are as
and the continuous video frames. Secondly, it is difficult follows.
to set an optimization objective to generate adversarial ex-
amples since a successful attack on the tracking task is sig- • To the best of our knowledge, we are the first to
nificantly different from that on the multi-classification task study one-shot adversarial attacks against VOT. The
which could merely maximize the probability of the class proposed one-shot attack method against the trackers
with the second-highest confidence. Specifically, the track- based on Siamese networks can make them fail to track
ing task in each frame is the same as that of classifying all in a video by only disturbing the initial frame.
candidate boxes into one positive sample and the others into
• We present a new optimization objective function with
negative samples. Such a special binary classification prob-
dual attention mechanisms to generate adversarial per-
lem makes it difficult to perform a successful attack if only
turbations for ensuring the efficiency of the one-shot
one candidate box is selected to increase its confidence.
attack.
To address these challenges, in this paper, we study the
adversarial attacks against visual object tracking. Our at- • Experimental results on three popular benchmarks
tack target is a series of excellent trackers based on Siamese show that our method is able to significantly lower the
networks, in which the tracking accuracy and efficiency accuracy of the state-of-the-art Siamese network-based
are well-balanced due to the unique advantages of off-line trackers.
learning and the abandonment of similarity updating. For
these trackers, we propose a one-shot attack framework— 2. Background and Related Work
only slightly perturbing the pixel values of the object patch
in the initial frame of a video, which achieves the goal of In this section, we first briefly describe the background
attacking the tracker in subsequent frames, i.e. failure to the of adversarial attack problems. Next, the development of
tracking of SiamRPN (see Fig. 1). adversarial attack methods in computer vision (CV) tasks is
reviewed. Lastly, we discuss the trackers based on Siamese
Specifically, a novel attack method with dual losses and
networks that are adopted as our attack targets in this work.
dual attention mechanisms is explored to generate adversar-
ial perturbation on the target patch at the initial frame. Our 2.1. Background of Adversarial Attacks
optimization objective of the proposed attack method con-
sists of two components, and each loss is combined with It is necessary to introduce some common technical
its corresponding well-designed attention weight to further terms related to the adversarial attacks on deep learning
improve attack abilities. On the one hand, we formulate models in computer vision and the remaining paper also fol-
such Siamese network-based tracking problems as a specific lows the same definitions of the terms.
classification task—the candidates of tracking are treated Adversarial example. It is a concept related to a natural
as the labels of classification, for a successful matching of clean example and is obtained by a specific algorithm pro-
the target template and the candidate box with the maxi- cessing to make the incorrect decision of models. It can be
mum confidence. Thus, we can pertinently perturb with generated by global pixel perturbations of clean samples, or
a tracker to make it match “the best box”. Here, we op- by adding adversarial patches to clean samples. The global
timize the batch confidence loss by suppressing the confi- pixel perturbation is applied to our work.
dence of excellent candidates and stimulate that of moder- Adversarial attacks. According to the degree of the at-
ate candidates. To further distinguish the high-quality can- tacker’s understanding of the model, it can be classified into
didate boxes, the distance-oriented attention mechanism is white-box attacks and black-box attacks. Also, through the
adopted to widen the distance between excellent candidates. target attacked by the attacker, it can be divided into tar-
On the other hand, we apply a general perturbation strat- geted attacks and non-targeted attacks.
egy by optimizing the feature loss that maximizes the dis- White-box attacks. It means that when the attackers know
tance between the clean image and its adversarial example all the knowledge of the model, including the structure, the
in the feature space for a powerful attack. To further ensure parameters and the values of the trainable weights of the

10177
neural network model, they can generate adversarial exam- get has been specified in the first frame [18]. Recently, the
ples to mislead the model. Siamese network-based trackers [25, 2, 8, 32, 27, 10] have
Black-box attacks. It means that, when the attackers only drawn significant attention due to their simplicity and ef-
have limited or no information about the model, they con- fectiveness. Bertinetto et al. [2] first proposed a network
struct adversarial examples that can fool most machine structure based on Siamese fully convolutional networks
learning models. for object tracking (SiamFC). Since then, many state-of-
Targeted attacks. It is usually used to attack classifiers. In the-art algorithms of tracking have been proposed by re-
this case, the attacker wants to change the prediction result searchers [32, 10, 17, 16, 28]. For example, the representa-
to some specified target category. tive tracker—SiamRPN [17] introduces a regional recom-
Non-targeted attacks. On the contrary, in this case, the goal mendation network after the Siamese network and com-
of attackers is simply to make the classifier give false pre- bines classification and regression for tracking.
dictions, regardless of which category the error classifica- These Siamese trackers formulate the VOT problem as
tion becomes. Our attack is in the middle of these two cases. a cross-correlation problem and learn a tracking similarity
Our work focuses on white-box, test-time attacks on vi- map from deep models with a Siamese network structure,
sual object tracking algorithms, and other families of attacks one branch for learning the feature presentation of the tar-
not directly relevant to our setting are not discussed here. get, and the other one for the search area. To ensure tracking
efficiency, the offline learned siamese similarity function is
2.2. Adversarial Attacks in CV Tasks often fixed during the running time. Meanwhile, the target
Szegedy et al. [24] first propose to generate adversar- template is acquired in the initial frame and remains un-
ial examples for classification models that successfully mis- changed in the subsequent video frames.
leading the classifier. Following that, Goodfellow et al. [7] In the tracking phase of each frame, the target template
extend this line and create a Fast Gradient Sign Method and the search region including several candidate boxes are
(FGSM) to generate adversarial attacks on images. Be- fed into the Siamese network to generate a confidence map
sides, attack methods based on the gradient include BIM that represents the confidences of the candidate boxes. It
[15], JSMA [22], DFool [20], Carlini and Wagner Attacks is worth noting that the Gaussian windows are widely ap-
(C&W) [3], etc. Most of these attacks are directed at image plied to refine the confidences of the candidate boxes on
classification that is the most basic visual task. the inference phase in tracking tasks. Different from the
Recently, there are several explorations of the adversarial Non-Maximum Suppression (NMS) algorithm [21] used in
attacks on some high-level tasks, such as semantic segmen- detection tasks [23, 6] for suppressing the candidates with
tation and object detection. For example, [31] first trans- low confidences, the role of Gaussian windows in tracking
forms an attack task into a generation task and proposes is to weaken the confidence of the candidate boxes far from
a Dense Adversary Generation (DAG) method to optimize the center location of the predicted target in the last frame.
the loss function for the generation of adversarial examples, The explanation for which the Gaussian window can be ef-
and then uses the generated adversarial example to attack fectively used is based on the prior knowledge of the con-
the segmentation and detection models based on the deep tinuity of video frames in tracking tasks, that is, the target
network. This transformation makes the attacks no longer could not move too far in the two adjacent frames.
limited to the traditional gradient-based algorithms but also
introduces more generation models, such as GAN. Then, 3. Methodology
[26] presents an approach to generate adversarial patches to In this section, we first introduce the problem defini-
targets with lots of intra-class variety and successfully hide tion of the proposed adversarial attack method for track-
a person from a person detector. ing algorithms. Then a one-shot attack framework setting
Most recently, PAT [29] and SPARK [9] generate adver- against Siamese network-based trackers is detailed. Lastly,
sarial samples against VOT through iterative optimization we elaborate on the proposed dual attention attack method.
on video frames. However, the attack strategy with online
iterative restricts their application scenarios. First, to gen- 3.1. Problem Definition
erate adversarial sequences, they always need to access to Our attack targets the most popular VOT pipeline—
the weights of the models during the attack. Second, the Siamese network-based trackers described above, which
forward-backward propagation iteration is difficult to meet formulate the VOT as learning a general similarity map
the real-time requirements of the tracking task. by cross-correlation between the feature representations
learned for the target template and the search region (see
2.3. Siamese Network-based Tracking
Fig. 2). In these trackers, the offline learned siamese sim-
Visual Object Tracking (VOT) aims to predict the posi- ilarity function and the target template given in the first
tion and size of an object in a video sequence after the tar- frame are fixed during the running time. Such a tracking

10178
Back propagation iteration Confidence Loss
positive negtive Rank
Feature Loss

Prediction
Softmax
CNN Classification

Box
Branch

Confidences

...
Update
Update
Forward iteration dx dy dw dh

Regression
CNN Branch

Figure 2. The framework of the one-shot attack against Siamese trackers by dual attention mechanisms.

process without model updates and template updates makes ing the power of the attack, the box farthest from the object
it possible to encounter attacks. Note that other trackers is the best perturbation target. However, the confidences
with updating, such as CREST, MDNet and ATOM, are of distant boxes are more suppressed by Gaussian windows
more easily to be attacked because the adversarial informa- and selecting these boxes as the target may result in a failed
tion will lead the feature of the model to draft to the wrong attack.
space. Then they almost cannot work after attacks due to In response to these challenges, we propose several cri-
the wrong update. They are thus not discussed in this paper. teria to generate the adversarial perturbations.
Although there are many existing attack methods for Firstly, it is necessary to generate matching adversarial
other high-level CV tasks such as detection and classifica- perturbations for arbitrary targets of tracking because of the
tion, it is quite a challenge to attack tracking tasks because unknown category. Therefore, we propose to only add an
tracking tasks are very different from these tasks. Specif- adversarial perturbation in the initial frame of each video,
ically, we analyze the characteristics of Siamese trackers namely one-shot attack.
compared with detection and classification. Secondly, our adversarial attack must be able to perturb
First, online visual tracking is unable to pre-know the a certain number of boxes, which can increase the success
category of the tracked objects because the target position rate of the attack. Specifically, adding the perturbations
is only given in the first frame of a video for the model can reduce the confidence of several high-quality boxes and
training. Therefore it cannot off-line learn a mechanism raise the confidence of many low-quality boxes, resulting
to perturb the pixel values in advance while it is impossi- in that tracker outputs wrong prediction boxes with large
ble to generate general class-level adversarial perturbations, deviations. Thus, we propose to learn attack perturbations
which are just commonly used in attack algorithms against by optimizing a batch confidence loss. Besides, we need
classification and detection. to consider general attack by designing a feature loss func-
Second, the concept of failure tracking, which is differ- tion to ensure the attack power. Therefore, two optimiza-
ent from misclassification, that maximizing the probability tion strategies are introduced by us. One is the batch confi-
of the category with the second-highest confidence to ex- dence loss while the other is attacking generally all candi-
ceed the probability of the correct category for the targeted dates from the feature space of CNNs.
attack. As was explained above, the Siamese trackers output Lastly, to further improve the attack power, we add both
confidence maps that metric the similarity of the target and attention mechanisms to these two loss functions. On the
the candidates in the search region. The candidate with the batch confidence loss, we distinctively suppress different
highest confidence in the rankings is selected for the predic- candidates using confidence attention. On the feature loss,
tion of the object. Only simple maximizing the box with the we add attention to the channels of the feature map to dis-
second-highest confidence does not lead to failed tracking. tinguish the importance of different channels by feature at-
For example, all anchors (candidate boxes) in SiamRPN are tention, which is inspired by [12].
employed for regression to the location of the target which Considering these criteria, we propose the one-shot at-
enables a considerable number of anchors to accurately re- tack based on the dual attention, which is detailed in the
turn to the target location. next two subsections.
Last but not least, different from the NMS algorithm
3.2. One-shot Attack with Dual Loss
used in the detection, the Gaussian windows are widely
applied to refine the box confidences in the tracking task, Given the initial frame and the ground-truth bounding
which induces difficulties to balance the strength and the box of the tracking target, we could obtain the target patch
success of the attack. For example, when only consider- z. The goal of our one-shot attack is to generate an adver-

10179
sarial target image z ∗ (z ∗ = z+∆z) with slight pixel value 3.3. Dual Attention Attacks
perturbation ∆z only in the initial frame, which could make
Furthermore, we add attention mechanisms to both two
the tracking results away from the ground-truth (i.e. failure
loss functions to further improve the attack power.
tracking). We define the adversarial example attacking the
Confidence Attention. By applying the confidence atten-
tracker as follows:
tion mechanism to the loss function, we can distinguish the
z∗ = arg min L(z, z ∗ ) (1)
degree of suppression and stimulation for the candidates
|zk −zk
∗ |≤ε with different confidences. The Eq. (2) is rewritten as
X X
where zk denotes the per pixel of clean image z while zk∗ L∗1 = {wi · f (z ∗ , xi )} − {f (z ∗ , xi )}
refers to which of the adversary z ∗ , and ε stands for the R1:p Rq:r (4)
maximum perturbation range of the per-pixel value in the s.t. |zk − zk∗ | ≤ ε.
image. In our experiments, ε is set to 16, for which the with wi defined as
global perturbations with such intensity are considered to be
an imperceptible change in the human visual system. The 1
wi = , (5)
batch confidence loss function L1 and feature loss function a + b · tanh(c · (d(xi ) − d(x1 )))
L2 are detailed below.
where d(xi ) denotes the coordinates distance between the
Batch Confidence Loss. Our one-shot attack only occurs in any i-th candidate xi and first x1 in the sorted confidence
the initial frame of each video, so we simulate the tracking list. Eq. (5) is inspired by the Shrinkage loss [19], in which
process in the initial frame (given the tracking template) to a, b, and c are controlling hyper-parameters. Specially, c
generate the adversarial example. Note that the test has not stands for the shrinkage rate, and both a and b limit the the
yet started in this phase for the general tracking task. weight wi to the range of ( a+b1
, a1 ).
Follow the Siamese trackers, we assume that the search Feature Attention. Because of the limited perturbation
region X is around the target and twice the size of it, in- conditions, we further consider the channel-wise activation-
cludes n candidates {x1 , ..., xn }. Let f (z, xi ) denotes the guided attention of feature maps to distinguish the impor-
tracking model which takes z ∈ Rm and xi ∈ Rm as inputs tance of different channels, which will guarantee the gen-
and the confidence of each candidate as the output. The out- eralization ability of the one-shot attack. Similarly, the Eq.
put confidences f (z, xi ) of the n candidates have a ranking (3) is rewritten as:
R1:n . Thus the batch confidence loss function can be de- X
fined as follows: L∗2 = − ||wj′ · {(φj (z ∗ ) − φj (z)}||2 ,
X X j=1:C (6)
L1 = f (z ∗ , xi ) − f (z ∗ , xi ), s.t. |zk − zk∗ | ≤ ε.
R1:p Rq:r (2)
and wj is defined as
s.t. |zk − zk∗ | ≤ ε.
1
where R1:p denotes the sort in the first p, Rq:r denotes the wj = , (7)
a′ + b′ · tanh(c′ · (m(φj (z)) − m(φj (z))min ))
sort from q to r in the confidence ranking. The purpose
of this approach based on batch confidence is to suppress where m(·) and m(·)min stand for the mean of each chan-
the candidates with high confidence and stimulate the can- nel φj (z) and the minimum mean value, a′ , b′ and c′ are
didates with moderate confidence. controlling hyper-parameters.
Feature Loss. Considering the challenges come from the Dual Attention Loss. We combine the advantages of L∗1
Gaussian window and to balance the strength and success of with accurate attacks and L∗2 with general attacks, and even-
the attack power, we apply another strategy that is generally tually obtain the dual attention loss:
attacking all candidates from the feature space of CNNs.
L = αL∗1 + βL∗2 , (8)
Let φ(·) represents the feature map of CNNs, then the
Euclidean distance of the feature maps of z and z ∗ are max- where the factors α and β will be determined empirically.
imized. Thus the loss function is defined as follows: The goal of our optimizer is to minimize the total loss L.
In the implementation, we use Adam optimizer [13] to min-
X imize the loss by iteratively perturbing the pixels along the
L2 = − ||φj (z ∗ ) − φj (z)||2 , gradient directions within the patch area, and the generation
j=1:C (3) process stops when the number of iterations reaches 100 or
s.t. |zk − zk∗ | ≤ ε. the first candidate of the ranking Rτ [1] behinds p in the ini-
tial ranking R0 . The whole attack process is presented in
where C denotes the channel of the feature maps. Algorithm 1.

10180
Table 1. Comparison of results with original, Random noise, and
Algorithm 1: One-shot White-box Attack for VOT our attack of different Siamese trackers on the OTB100 dataset in
Input: The target crop z in the first frame image of a terms of precision and success rate.
video; The tracker with Siamese network f (·, ·)
Output: An adversarial example z ∗ . Precision (%) Success Rate(%)
Trackers
1 Initialize the adversary z ∗ = z; Org Noise Ours Org Noise Ours
2 Initialize the iterative variable τ = 0; SiamFC 76.5 73.4 27.1 57.8 56.0 32.3
3 Feed clean z and search area X containing n candidates SiamRPN 87.6 83.1 27.8 66.6 63.3 20.4
xi into the network to get confidence map f (z, xi ); SiamRPN++(R) 91.4 85.0 33.7 69.6 64.9 25.2
4 Sort f (z, xi ) and obtain the initial Rank R0 [1 : n]; SiamRPN++(M) 86.4 80.7 35.3 65.8 58.0 26.1
5 Save the candidate indexes in original Rank R0 [1 : n]; SiamMask 83.7 83.6 65.0 64.6 62.6 48.1
6 while Number of iterations τ ++ < 100 do
7 Sort f (z ∗ , xi ) and get the new Rank Rτ [1 : n];
8 if the sort of the candidate in Rτ [1] > p in R0 then Implementation Details. Our algorithm is implemented
9 break; by Pytorch and runs on the NVIDIA Tesla V100 GPU. For
10 else each attacked video, we use Adam optimizer [13] to opti-
11 dual attention attack ; mize the generated adversarial perturbation, with 100 iter-
12 z ∗ := zτ∗ ; ations and a learning rate of 0.01. Based on the different
13 end purposes of the attention modules, we use different hyper-
14 end parameter settings. Specifically, for the confidence attention
module, we set a = 0.5, b = 1.5 and c = 0.2. Meanwhile,
for the feature attention module, we set a′ = 2, b′ = −1,
4. Attack Evaluation and c′ = 20, respectively. To balance the weight parameters
α and β, we set β = 1 while α is a model-sensitive parame-
In this section, we describe our experimental settings ter in the range of 0.2 to 0.8 in our experiments. In Eq. (2),
and analyze the attack results of the proposed dual attention the hyper-parameters of p, q, r are set to 45 (9 · 5 anchors),
attack algorithm against different trackers on 3 challenge 90, and 135, respectively. In addition, all the results pre-
tracking datasets, including OTB100 [30], LaSOT [4], and sented below are the averaged value of repeated five times
GOT10K [11]. Then we evaluate the effectiveness of the experiments under these settings.
proposed method by ablation studies on various contrast ex-
periments. 4.2. Overall Attack Results

4.1. Experimental Setting Results on OTB100. Table 1 compares the overall results
of theses trackers in OTB100 dataset. We compare random
Attacked Targets. We show our adversarial attack re- noises with our adversarial examples on the target patch
sults on four representative Siamese network based track- in the initial frame and observe that they impact very lit-
ers, including SiamFC [2], SiamRPN [17], SiamRPN++ tle on tracking results, but our adversarial attack can cause
[16], and SiamMask [28]. Besides, our experiments almost devastating results to the tracking methods. Specif-
employ SiamRPN++ with two different backbones, in- ically, the precision of adding random noises to SiamFC,
cluding ResNet-50 and MobileNet-v2, which are called SiamRPN, SiamRPN++(R), SiamRPN++(M) are reduced
SiamPRN++(R) and SiamPRN++(M) respectively below. by 3.1%, 4.5%, 6.4%, and 5.7% respectively. While the pre-
Evaluation Metrics. For fair evaluation, the standard eval- cision of adding adversarial perturbations to corresponding
uation methods are applied to measure our attack effect. In trackers are greatly reduced by 49.4%, 59.8%, 57.7%, and
the OTB100 and LaSOT datasets, we applied the one-pass 51.1%, respectively.
evaluation (OPE) with the precision plot and success plot Fig. 3 shows the success and precision plots on OTB100
metrics. The precision plot reflects the center location error dataset with the comparison of the results by original track-
between tracking results and ground-truth. The threshold ers and the results after our corresponding attacks. We
distance is set to 20 pixels. Meanwhile, the success rate can see that the precision results and success rates of the
measures the overlap ratio between the detected box and five trackers are significantly reduced after being attacked.
ground-truth which could reflect the accuracy of tracking In precision plots, we observe that the proposed attack
in scales. In the GOT10K dataset, we applied Average of method has the best and second attack effects on SiamRPN
Overlap rates (AO) between tracking results and ground- and SiamRPN++(R), which reduces the precision by 59.8%
truths over all frames and Success Rate (SR) with a thresh- and 57.7%, respectively. Similarly, our attack method re-
old of 0.50. We view successful attacks and failed trackings duces the success rate on SiamRPN and SiamRPN++(R) by
as consistent. Specifically, the lower the accuracy of the 46.2% and 44.4% respectively.
tracking is, the higher the success rate of the attack is. Results on LaSOT. We compare our attack against these

10181
Precision plots of OPE on OTB100 Success plots of OPE on OTB100 Table 4. Ablation comparison studies of dual attention attacks.
0.9 0.9
SiamRPN++ (ResNet-50) Precision (%) Success Rate (%)
0.8 0.8

Success rate
0.7 0.7 Original 91.4 69.6
Random Noise 85.0 64.9
Precision

0.6 0.6

0.5 SiamRPN++(R) 0.5 SiamRPN++(R) Attack by L1 38.8 29.1


SiamRPN SiamRPN
0.4
SiamRPN++(M) 0.4 SiamRPN++(M) Attack by L∗1 37.1 27.6
SiamMask SiamMask
0.3
A_SiamMask 0.3 A_SiamMask Attack by L2 38.7 27.7
SiamFC SiamFC
0.2
A_SiamRPN++(M) 0.2 A_SiamFC Attack by L∗2 34.3 25.6
A_SiamRPN++(R) A_SiamRPN++(M)
0.1
A_SiamFC 0.1 A_SiamRPN++(R) Attack by L1 + L2 37.5 26.9
A_SiamRPN A_SiamRPN
0.0
0 5 10 15 20 25 30 35 40 45 50
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Attack by L∗1 + L∗2 33.7 25.2
Location error threshold Overlap threshold

Figure 3. Evaluation results of trackers with or without adversarial


attacks on OTB100 dataset. A_SiamRPN++ Generally, too few anchors make SiamFC unable to accu-
A_SiamRPN
(R) rately estimate the target. Meanwhile, it reduces the risk
Table 2. Comparison of results with original, Random noise, and of being attacked by adversarial samples. Moreover, we
our attack of different Siamese trackers on the LaSOT dataset in can see that our attack method has the best attack effect
terms of precision and success rate.
on SiamRPN, which reduces the precision on OTB100 by
Trackers
Precision (%) Success Rate(%) 59.8% and the success rate by 46.2%, respectively. Thus
Org Noise Ours Org Noise Ours can be attributed to the excessive head parameters that make
SiamFC 34.4 33.7 12.0 35.2 34.7 16.7 SiamRPN difficult to get fully trained. To a certain extent,
SiamRPN 42.4 42.2 10.8 43.3 43.1 14.7 this problem has been solved in siamRPN++ with the help
SiamRPN++(R) 50.2 49.3 12.2 49.6 48.5 14.9 of multi-stage learning and more efficient cross-correlation.
SiamRPN++(M) 45.5 45.5 11.4 45.2 44.9 14.7
As we can see, siamRPN++ has better robustness and more
SiamMask 46.3 46.0 34.3 46.5 46.3 37.1
difficult to be attacked. Besides, our attacking method has
the lowest degree of attack on SiamMask compared with
Table 3. Comparison of results with original, Random noise, and
other trackers. For example, the attack on SiamMask re-
our attack of different Siamese trackers on the GOT10k dataset in
terms of AO and SR0.50 .
duces the precision and success rate by 18.7% and 16.5% on
OTB100 respectively, which can be attributed to the multi-
AO (%) SR0.50 (%)
Trackers task learning of SiamMask. Compared with SiamRPN
Org Noise Ours Org Noise Ours
and SiamRPN++, SiamMask adds a semantic segmentation
SiamFC 53.8 50.2 34.6 57.8 54.3 28.4 branch and focuses on the tracked object with pixel-level,
SiamRPN 60.8 56.1 31.2 71.4 65.2 26.5
SiamRPN++(R) 65.1 65.0 31.2 76.7 75.7 26.5 which makes the learned features more robust.
SiamRPN++(M) 64.1 61.0 39.4 75.0 70.2 34.7
SiamMask 64.4 64.1 55.6 76.5 75.9 64.1 4.3. Ablation Study of Dual Attention Attack
We implement a series of experiments to analyze and
evaluate the contribution of each component of our dual
trackers on the LaSOT dataset [4]. Table 2 shows the over- attention attacks. We choose the current state-of-the-art
all results of these trackers after attacks perform poorly. tracker SiamRPN++(R) as the representative and the track-
Through the results, we can see that precision of these five ing results on OTB100 are shown in Table 4.
trackers has a significant decline, accounting for 34.9%, Intuitively, we observe that random noises impact very
25.5%, 24.3%, 25.1%, and 74.1% of the original results in little on tracking results, but our adversarial attacks cause a
SiamFC, SiamRPN, SiamRPN++(R), SiamRPN++(M), and significant drop in tracking accuracy. Moreover, separately
SiamMask respectively. using the loss L1 and loss L2 in our experiments greatly re-
Results on GOT10K. We also implement our attack against duce the accuracy of tracking and their damage to tracking
these five trackers on the large tracking dataset GOT10K is similar to each other in terms of the data. It thanks to
[14]. Table 3 shows that there are significant declines in our selection strategy for candidates of L1 and the global
overall results on these trackers after attacks. Through the feature perturbation mechanism of L2 . Second, we test the
results, we can see that AO these five trackers decreased by effectiveness of the distance-oriented confidence attention
64.3%, 51.3%, 61.5%, 47.9%, and 86.3% respectively. mechanism in L1 component, namely L∗1 . Specifically, the
Analysis. From the attack results on these trackers in vari- L∗1 method further reduces the tracking accuracy by 1.7%
ous datasets, we found an interesting phenomenon that the and 1.5% for both precision and success rate metrics based
simplest SiamFC shows good robustness on both OTB100 on L1 . At the same time, we validate the contribution of the
and LaSOT, which we believe is due to the under-fitting of activation-oriented feature attention mechanism in L2 com-
the algorithm. More specifically, to some extent, SiamFC ponent, namely L∗2 , and reduce the tracking performance
can be considered as a SiamRPN with only one anchor. by 4.4% and 2.1% for success and precision, respectively.

10182
Ground Truth
SiamFC
SiamRPN
SiamRPN++ #1 #65 #126 #249 #1 #260 #668 #1076
(Rsenet50)
SiamRPN++
(Mobile-v2)
SiamMask

Figure 4. Qualitative evaluation of one-shot adversarial attack in various trackers on video examples Human7 and Human2 from the
OTB100 dataset. For each of the two subfigures, the first column represents the adversarial examples generated in the initial frame, except
the clean example in the first row. The green, blue, and red rectangles represent the bounding boxes of ground-truth, tracking results before
and after being attacked.

Moreover, through the experimental analysis, we can see deeper models.


that the feature attention mechanism brings more gain than
the confidence attention mechanism. The potential reason is
that the attention mechanism of L∗1 has narrowed the candi- 5. Conclusion
dates to a more appropriate range, and all of these selected
boxes will contribute to the attack. In addition, the feature In this work, we highlight the adversarial perturbations
attention mechanism can force the algorithm to mine chan- against VOT to circumvent potential risks of the surveil-
nels that contribute more to the attack in the huge feature lance system. We focus on the adversarial attacks for free-
space, which effectively reduces the concerned scope be- model single object tracking and our attack target is a se-
longing to L2 attack. Besides, the attacking strategy using ries of excellent trackers based on Siamese networks. We
two basic components L∗1 and L∗2 simultaneously achieves present a one-shot attack method that only perturbs slightly
gain on each basis. Finally, the dual attention attack method the pixel values of the initial frame image of a video, result-
obtains the best attack result by simultaneously employing ing in tracking failure in subsequent frames. Experimental
two attention mechanisms. results prove that our approaches can successfully attack the
advanced Siamese network-based trackers. We hope that
4.4. Qualitative Evaluation more researchers can pay attention to the adversarial attack
and defense of the tracking algorithms in the future.
Fig. 4 shows examples of adversarial attacks against var-
ious trackers. We can see that the initial frame perturbation Acknowledgement This work is supported in part by the
National Natural Science Foundation of China under Grant
of the five trackers is so subtle that it is difficult to be ob-
61972188, 61771273, the National Key Research and Develop-
served by the human eye. Generally, adding adversarial at-
ment Program of China under Grant 2018YFB1800204, the R&D
tacks results in a large deviation of tracking results. Among Program of Shenzhen under Grant JCYJ201805-08152204044,
them, the attacks on SiamFC and SiamRPN are stronger the Science and Technology Planning Project of Shenzhen (No.
when the target scale changes greatly. In contrast, the im- JCYJ20180503182133411), and the research fund of PCL Future
pact on the results of SiamRPN++ is not obvious, which Regional Network Facilities for Large-scale Experiments and Ap-
is partly attributed to the robust feature extraction using plications (PCL2018KP001).

10183
References In Proceedings of the European Conference on Computer Vi-
sion (ECCV), pages 0–0, 2018.
[1] Naveed Akhtar and Ajmal Mian. Threat of adversarial at- [15] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Ad-
tacks on deep learning in computer vision: A survey. IEEE versarial examples in the physical world. arXiv preprint
Access, 6:14410–14430, 2018. arXiv:1607.02533, 2016.
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea [16] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing,
Vedaldi, and Philip HS Torr. Fully-convolutional siamese and Junjie Yan. Siamrpn++: Evolution of siamese visual
networks for object tracking. In Proceedings of the Eu- tracking with very deep networks. In Proceedings of the
ropean Conference on Computer Vision Workshops (ECCV IEEE Conference on Computer Vision and Pattern Recog-
Workshops), pages 850–865, 2016. nition (CVPR), pages 4282–4291, 2019.
[3] Nicholas Carlini and David Wagner. Towards evaluating the [17] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu.
robustness of neural networks. In Symposium on Security High performance visual tracking with siamese region pro-
and Privacy (SP), pages 39–57, 2017. posal network. In Proceedings of the IEEE Conference on
[4] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Si- Computer Vision and Pattern Recognition (CVPR), pages
jia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin 8971–8980, 2018.
Ling. Lasot: A high-quality benchmark for large-scale sin- [18] Peixia Li, Dong Wang, Lijun Wang, and Huchuan Lu. Deep
gle object tracking. In Proceedings of the IEEE Conference visual tracking: Review and experimental comparison. Pat-
on Computer Vision and Pattern Recognition (CVPR), pages tern Recognition (PR), 76:323–338, 2018.
5374–5383, 2019. [19] Xiankai Lu, Chao Ma, Bingbing Ni, Xiaokang Yang, Ian
[5] Heng Fan and Haibin Ling. Sanet: Structure-aware net- Reid, and Ming-Hsuan Yang. Deep regression tracking with
work for visual tracking. In Proceedings of the IEEE Con- shrinkage loss. In Proceedings of the European Conference
ference on Computer Vision and Pattern Recognition Work- on Computer Vision (ECCV), pages 353–369, 2018.
shops (CVPR Workshops), pages 42–49, 2017. [20] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and
[6] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE In- Pascal Frossard. Deepfool: a simple and accurate method
ternational Conference on Computer Vision (ICCV), pages to fool deep neural networks. In Proceedings of the IEEE
1440–1448, 2015. Conference on Computer Vision and Pattern Recognition
[7] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. (CVPR), pages 2574–2582, 2016.
Explaining and harnessing adversarial examples. arXiv [21] Alexander Neubeck and Luc Van Gool. Efficient non-
preprint arXiv:1412.6572, 2014. maximum suppression. In 18th International Conference on
[8] Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Pattern Recognition (ICPR’06), volume 3, pages 850–855.
Song Wang. Learning dynamic siamese network for visual IEEE, 2006.
object tracking. In Proceedings of the IEEE International [22] Nicolas Papernot, Patrick Mcdaniel, Somesh Jha, Matt
Conference on Computer Vision (CVPR), pages 1763–1771, Fredrikson, Z. Berkay Celik, and Ananthram Swami. The
2017. limitations of deep learning in adversarial settings. In IEEE
European Symposium on Security & Privacy, 2016.
[9] Qing Guo, Xiaofei Xie, Lei Ma, Zhongguo Li, Wanli
[23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Xue, and Wei Feng. Spark: Spatial-aware online incre-
Faster r-cnn: Towards real-time object detection with region
mental attack against visual tracking. In arxiv preprint
proposal networks. In Advances in Neural Information Pro-
arXiv:1910.0868, 2019.
cessing Systems (NeurIPS), pages 91–99, 2015.
[10] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. A
[24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan
twofold siamese network for real-time object tracking. In
Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.
Proceedings of the IEEE Conference on Computer Vision
Intriguing properties of neural networks. arXiv preprint
and Pattern Recognition (CVPR), pages 4834–4843, 2018.
arXiv:1312.6199, 2013.
[11] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A [25] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders.
large high-diversity benchmark for generic object tracking in Siamese instance search for tracking. In Proceedings of the
the wild. arXiv preprint arXiv:1810.11981, 2018. IEEE Conference on Computer Vision and Pattern Recogni-
[12] Nathan Inkawhich, Wei Wen, Hai Helen Li, and Yiran Chen. tion (CVPR), pages 1420–1429, 2016.
Feature space perturbations yield more transferable adver- [26] Simen Thys, Wiebe Van Ranst, and Toon Goedemé. Fool-
sarial examples. In Proceedings of the IEEE Conference ing automated surveillance cameras: adversarial patches to
on Computer Vision and Pattern Recognition (CVPR), pages attack person detection. In Proceedings of the IEEE Con-
7066–7074, 2019. ference on Computer Vision and Pattern Recognition Work-
[13] Diederik P Kingma and Jimmy Ba. Adam: A method for shops (CVPR Workshops), 2019.
stochastic optimization. arXiv preprint arXiv:1412.6980, [27] Qiang Wang, Zhu Teng, Junliang Xing, Jin Gao, Weiming
2014. Hu, and Stephen Maybank. Learning attentions: residual at-
[14] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Fels- tentional siamese network for high performance online vi-
berg, Roman Pflugfelder, Luka Cehovin Zajc, Tomas Vojir, sual tracking. In Proceedings of the IEEE conference on
Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al. Computer Vision and Pattern Recognition (CVPR), pages
The sixth visual object tracking vot2018 challenge results. 4854–4863, 2018.

10184
[28] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and
Philip HS Torr. Fast online object tracking and segmentation:
A unifying approach. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
1328–1338, 2019.
[29] Rey Reza Wiyatno and Anqi Xu. Physical adversarial tex-
tures that fool visual object tracking. In Proceedings of the
IEEE International Conference on Computer Vision (ICCV),
2019.
[30] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track-
ing benchmark. IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI), 37(9):1834–1848, 2015.
[31] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou,
Lingxi Xie, and Alan Yuille. Adversarial examples for se-
mantic segmentation and object detection. In Proceedings
of the IEEE International Conference on Computer Vision
(ICCV), pages 1369–1378, 2017.
[32] Yunhua Zhang, Lijun Wang, Jinqing Qi, Dong Wang,
Mengyang Feng, and Huchuan Lu. Structured siamese net-
work for real-time visual tracking. In Proceedings of the
European Conference on Computer Cision (ECCV), pages
351–366, 2018.

10185

You might also like