0% found this document useful (0 votes)
5 views

Scale-aware Automatic Augmentation for Object Detection

Uploaded by

Le Truc Quynh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Scale-aware Automatic Augmentation for Object Detection

Uploaded by

Le Truc Quynh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Scale-aware Automatic Augmentation for Object Detection

Yukang Chen1 * † , Yanwei Li1† , Tao Kong2 , Lu Qi1 , Ruihang Chu1∗ , Lei Li2 , Jiaya Jia1.3
1 2 3
The Chinese University of Hong Kong ByteDance AI Lab SmartMore

Abstract Baseline
arXiv:2103.17220v1 [cs.CV] 31 Mar 2021

45 Scale-aware AutoAug
44
InstaBoost
We propose Scale-aware AutoAug to learn data augmen- 43
42 Stitcher GridMask
tation policies for object detection. We define a new scale- FCOS Mixup RandAug
41

AP (%)
aware search space, where both image- and box-level aug-
40 AutoAug-det ResNet-101 PSIS
mentations are designed for maintaining scale invariance. Mask R-CNN
39 Faster R-CNN ResNet-101
Dropblock ResNet-101
Upon this search space, we propose a new search met- 38 RetinaNet
ResNet-101
ric, termed Pareto Scale Balance, to facilitate search with 37
high efficiency. In experiments, Scale-aware AutoAug yields RetinaNet
36 ResNet-50
significant and consistent improvement on various object 53 59 64 67 74
detectors (e.g., RetinaNet, Faster R-CNN, Mask R-CNN, Inference time (ms/image)
and FCOS), even compared with strong multi-scale training Figure 1: Comparison with object detection augmentation strate-
baselines. Our searched augmentation policies are trans- gies on MS COCO dataset. Methods in the same vertical line are
ferable to other datasets and box-level tasks beyond object based upon the same detector. Scale-aware AutoAug outperforms
detection (e.g., instance segmentation and keypoint estima- both hand-crafted and learned strategies on various detectors.
tion) to improve performance. The search cost is much less
than previous automated augmentation approaches for ob-
grouped into color operations (e.g., brightness, contrast, and
ject detection. It is notable that our searched policies have
whitening) and geometric operations (e.g., re-scaling, flip-
meaningful patterns, which intuitively provide valuable in-
ping). Among them, geometric operations, such as multi-
sight for human data augmentation design. Code and mod-
scale training, improve scale robustness [39, 19]. Sev-
els will be available at https://github.com/Jia-Research-
eral hand-crafted data augmentation strategies were devel-
Lab/SA-AutoAug.
oped to improve performance and robustness of the detec-
tor [41, 42]. Previous work [17, 15] also improves box-level
augmentation by enriching foreground data. Though inspir-
1. Introduction ing performance gain has achieved, these data augmentation
Object detection, aiming to locate as well as classify var- strategies usually rely on heavy expert experience.
ious objects, is one of the core tasks in the computer vision. Automatic data augmentation policies were widely ex-
Due to the large scale variance of objects in real-world sce- plored in image classification [44, 50, 37, 35, 9]. Its poten-
narios, it raises concerns on how to bring the scale adap- tial for object detection, however, was not thoroughly re-
tation to the network efficiently. Previous work handles leased. One attempt to automatically learn data augmen-
this challenge mainly from two aspects, namely network ar- tation policies for object detectors is AutoAug-det [51]1 ,
chitecture and data augmentation. To make the network which performs color or geometric augmentation upon the
scale invariant, in-network feature pyramids [28, 47, 23] context of boxes. It does not fully consider the scale issue
and adaptive receptive fields [25] are usually employed. from image- and box-level, which are found, however, es-
Another crucial technique to enable scale invariance is data sential in object detector design [41, 42, 17]. Moreover, the
augmentation, which is independent of specific architec- heavy computational search cost (i.e., 400 TPU for 2 days)
tures, and can be generalized among multiple tasks. impedes it from vastly practical. Thus, scale-aware property
This paper focuses on data augmentation for object and efficiency issue are essential to address for searching
detection. Current data augmentation strategies can be augmentation in box-level tasks.
In this paper, we propose a new way to automatically
* This work was done during an internship at ByteDance AI Lab. Tao
Kong is responsible for correspondence. † Equal contribution. 1 We refer it as AutoAug-det [51] to distinguish from AutoAugment [9].
learn scale-aware data augmentation strategies for object hyper-parameter tuning and are usually task-specific [31].
detection and relevant box-level tasks. We first introduce Some commonly used strategies on image classification
scale-awareness to the search space from two image- and include random cropping, image mirroring, color shift-
box-levels. For image-level augmentations, zoom-in and ing/whitening [24], Cutout [12], and Mixup [49].
-out operations are included with their probabilities and Scale-wise augmentations also play a vital role for
zooming ratios for search. For box-level augmentations, the optimization of object detectors [46, 5]. For exam-
the augmenting areas are generalized with a new searchable ple, SNIPER [42] generates image crops around ground
parameter, i.e., area ratio. This makes box-level augmenta- truth instances with multi-scale training. YOLO-v4 [2]
tions adaptive to object scales. and Stitcher [8] introduce mosaic inputs that contain re-
Based on our scale-aware search space, we further pro- scaled sub-images. For box-level augmentation, Dwibedi
pose a new estimation metric to facilitate the search process et al. [15] improve detection performance with the cut-and-
with better efficiency. Previously, each candidate policy is paste strategy. And the visual context surrounding objects
estimated by the validation accuracy on a proxy task [9, 27], are modeled in [14]. Furthermore, InstaBoost [17] aug-
which lacks efficiency and accuracy to an extend. Our met- ments training images using annotated instance masks with
ric takes advantage of more specific statistics, that is, vali- a location probability map. However, these hand-crafted
dation accuracy and accumulated loss over different scales, designs still highly rely on expert efforts.
to measure the scale balance. We empirically show that it
Inspired by recent advancements in neural architecture
yields a clearly higher correlation coefficient with the actual
search (NAS) [52, 53, 38, 7], researchers try to learn aug-
accuracy than the previous proxy accuracy metric.
mentation policies from data automatically. An example
The proposed approach is distinguished from previous
is AutoAugment [9], which searches data augmentations
work from two aspects. First, different from hand-crafted
for image classification and achieves promising results.
policies, the proposed method utilizes automatic algorithms
PBA [22] uses population-based search method for better
to search among a large variety of augmentation candidates.
efficiency. Fast AutoAugment [27] applies Bayesian opti-
It is hard to be fully explored or achieved by human effort.
mization to learn data augmentation policies. RandAug [10]
Moreover, compared with previous learning-based meth-
removes the search process at the price of manually tailor-
ods, our approach fully explores the important scale issue in
ing the search space to a very limited volume. AutoAug-
both image-level and box-level. With the proposed search
Det [51] extends AutoAugment [9] to object detection by
space and evaluation metric, our method attains decent per-
taking box-level augmentations into consideration.
formance with much (i.e., 40×) less search cost.
The overall approach, called Scale-aware AutoAug, can
be easily instantiated for box-level tasks, which will be elab-
3. Scale-aware AutoAug
orated on in Sec. 3. To validate its effectiveness, we con-
duct extensive experiments on MS COCO and Pascal VOC In this section, we first briefly review the auto augmenta-
dataset [30, 16] with several anchor-based and anchor-free tion pipeline. Then, the scale-aware search space and esti-
object detectors, which are reported in Sec. 4.2. mation metric will be respectively elaborated in Sec. 3.2 and
In particular, with ResNet-50 backbone, the searched Sec. 3.3. We finally show the search framework in Sec. 3.4.
augmentation policies contribute non-trivial gains over the
strong MS baseline of RetinaNet [29], Faster R-CNN [39],
and FCOS [43], and achieve 41.3% AP, 41.8% AP, and 3.1. Review of AutoAug
42.6% AP, respectively. We further experiment with more
box-level tasks, like instance segmentation and keypoint de- Auto augmentation methods [9, 51, 22, 27, 26] com-
tection. Without bells-and-whistles, our improved FCOS monly formulate the process of finding the best augmenta-
model attains 51.4% AP with the search augmentation poli- tion policy as a search problem. To this end, three main
cies. Besides, our searched policies present meaningful pat- components are needed, namely the search space, search
terns, which provide intuitive insight for human knowledge. algorithm, and estimation metric. Search space may vary
according to tasks. For example, the search space [9, 22, 27]
2. Related Work is developed to image classification, while it is not the spec-
ified case for box-level tasks. As for search algorithms, re-
Data augmentation has been widely utilized for net- inforcement learning [52] and evolutionary algorithm [38]
work optimization and proven to be beneficial in vision are usually utilized to explore the search space in iterations.
tasks [11, 40, 39, 32, 33, 36]. Traditional approaches could During this procedure, each child model, which is opti-
be roughly divided into color operations (e.g., brightness, mized with the searched policy p, is evaluated on a designed
contrast, and whitening) and geometric operations (e.g., metric to estimate its effectiveness. This metric serves as
scaling, flipping, translation, and shearing), which require feedback for the search algorithm.
Figure 2: Scale-aware search space. It contains image-level and box-level augmentation. Image-level augmentation includes zoom-in
and zoom-out functions with probabilities and magnitudes for search. In box-level, we introduce scale-aware area ratios, which make
operations adaptive to objects in different scales. Augmented images are further generalized with the Gaussian map.

3.2. Scale-aware Search Space large-scale images would introduce an additional computa-
tional burden. To avoid this issue, we reserve the original
The designed scale-aware search space contains both
shape in the zoom-in function with random cropping.
image-level and box-level augmentations. The image-level
augmentations include zoom-in and zoom-out functions on After the search procedure, input images are randomly
the whole image. As for box-level augmentations, color and sampled from zoom-in, zoom-out, and original scale im-
geometric operations are searched for objects in images. ages with the searched P and M in each training iteration.
In other words, we sample from 3 resolutions, a larger one,
Image-level augmentations. To handle scale variations,
a small one and the original with the searched probabilities,
object detectors are commonly trained with image pyra-
i.e., {Pin , Pout , Pori }. To our best knowledge, no previ-
mids. However, these scale settings highly rely on hand-
ous work considers automatic scale-aware transformation
crafted selection. In our search space, we alleviate this bur-
search for object detection. Experiments validate the supe-
den by searchable zoom-in and zoom-out functions. As il-
riority over traditional multi-scale training in Tab. 2.
lustrated in the left part of Fig. 2, zoom-in and zoom-out
functions are specified by probabilities P and magnitudes Box-level augmentations. The box-level augmentations
M . Specifically, the probabilities Pin and Pout are searched are designed to conduct transformation for each object box.
in the range from 0 to 0.5. With this range, the existence of Different from [51], the proposed approach further smooths
original scale could be guaranteed with the probability the augmentations and relaxes it to contain learnable fac-
Pori = 1 − Pout − Pin . (1) tors, i.e., area ratio. In particular, previous box-level aug-
mentation [51] works exactly in the whole bounding box
The magnitude M represents the zooming ratio for each annotations without attenuation, which generate an obvi-
function. For the zoom-in function, we search a zooming ous boundary gap between the augmented and original re-
ratio from 0.5 to 1.0. For the zoom-out function, we search gion. The sudden appearance change could reduce the dif-
a zooming ratio from 1.0 to 1.5. For example, if a zooming ficulty for networks to locate the augmented objects, which
ratio of 1.5 is selected, it means that the input images might brings the gap between training and inference. To solve this
be increased by 1.5×. In traditional multi-scale training, issue, we extend the original rectangle augmentation to a
Table 1: Analysis on the context for scales. On well-trained
ResNet-101 detectors, APs drops and APl increases consistently
if contexts are removed in validation images.

with context AP APs APm APl


3 41.4 25.2 44.8 53.0
Faster R-CNN 7 40.5 18.0 45.7 56.1
∆ -0.9 -7.2 +0.9 +3.1
3 40.3 23.3 44.0 53.3
RetinaNet 7 39.8 16.7 44.4 57.7
(a) Comparison between square and gaussian transform.
∆ -0.5 -6.6 +0.4 +4.4

and bounding box annotations, the box (xc , yc , h, w)


could be represented with the central point (xc , yc ) and the
height/width h/w. We formulate the Gaussian map by

(x − xc )2 (y − yc )2
  
α(x, y) = exp − + . (3)
2σx2 2σy2
(b) Gaussian-based transform process.
Then, we define the augmentation area V as the integra-
Figure 3: An example of Gaussian-based box-level augmenta- tion of the Gaussian map, where
tion. It removes the original hard boundary and the augmented Z H Z W
areas are adjustable to the Gaussian variance. V = α(x, y) dxdy. (4)
0 0

gaussian-based manner. A visualization example of a box- The area ratio for the box-level augmentation is denoted
level color brightness operation is given in Fig. 3(a). We as r. Here, r(sbox ) = V /sbox is searchable for different
blend the original and the transformed pixels with a spatial- scales, which determines the spatially augmentation area for
wise Gaussian map α(x, y) by each object. Thus, the standard deviation factors, σx and σy ,
could be calculated as in Eq. (5). We provide the detailed
A = α(x, y) · I + (1 − α(x, y)) · T, (2) calculation process in the supplementary materials.
r r
where I, T , and A denotes the input, transformation func- W/H H/W
σx = h r, σy = w r. (5)
tion, and augmented region, respectively. This process is 2π 2π
depicted in Fig. 3(b). Actually, the designed gaussian-based
process softens the boundary gap in a more natural manner. Search space summary. Our search space contains both
The second issue in previous operations is the lack of image-level and box-level augmentations. For the image-
considering receptive fields and object scales. A common level augmentation, we search for the parameters of zoom-
belief is that neural networks largely rely on the context in and zoom-out operations. To keep consistent with
information to recognize objects [4, 34]. Experimentally, the convention [51], our box-level operations have 5 sub-
we find that it may not be correct for objects in all scales, policies, where sub-policy consists of a color operation as
while the effect varies with objects scales. This is demon- well as a geometric operation. Each operation contains two
strated with widely applied two-stage and one-stage detec- parameters, namely the probability and magnitude. The
tors, namely the Faster R-CNN [39] and RetinaNet [29]. As probability is sampled from a set of 6 discrete values, from
presented in Tab. 1, if we test it on the COCO validation set 0 to 1.0 with 0.2 as the interval. The magnitude repre-
with all context (background) pixels removed, its accuracy sents the strength factor for each operation with custom
on small objects, APs , dramatically declines from 25.2% to range values. We map the magnitude range to a standard-
18.0%. In contrast, APl increases from 53.0% to 56.1%. ized set of 6 discrete values, from 0 to 10 with 2 as the
It is consistent with that in RetinaNet [29]. This motivates interval. For box-level operations, there are 3 area ratios
us that augmentations merely inside/outside object boxes to search for small, middle and large scales. Each area ra-
may not deal with objects in all scales appropriately. To tio is independently searched in a discrete set of 10 values.
this end, we introduce a searchable parameter, area ratio, We list the details of these operations in the supplemen-
which makes the aug area adaptive to object sizes. tary materials. In summary, the total search space provides
Here, we generalize the Gaussian map with the param- (62 )2 × (((6 × 62 ) × (8 × 62 ))5 × 103 ) = 1.230 candidate
eter of area ratio. Given an image with size H × W policies, which is twice large as [51].
3.3. Scale-aware Estimation Metric Algorithm 1: Search Framework
Autoaugment methods commonly employ validation ac- Input : Plain model m, Initialized Population P,
curacies on a proxy task (a small subset with training im- Training set T, Val set V, Iterations #I.
ages) as the search metric. However, such a manner is found 1 f ∗ , p∗ ← (∞, ∅)
to be inaccurate and computationally expensive [10], which 2 for k ∈ (1 to #I) do
3 for p ∈ P do
will be further demonstrated in Fig. 4. In contrast, all oper-
4 mp , {Lpi∈S } ← finetune(m,T,θp )
ations in our search space, both image-level and box-level, p
5 {APi∈S } ← evaluate(mp , V)
have explicit relationships with each scale. Thanks for the 6 fp ← f ({Lpi∈S }, {APi∈S p
}) -- Eq.(7)
convenience, a scale-aware metric can be further proposed 7

if fp < f then
to capture more specific statistics of different scales. Specif- 8 f ∗ , p∗ ← (fp , p)
ically, the evaluation metric is established based on an ob-
servation that balanced optimization over different scales 9 P ← select-topk(P)
would be beneficial to training. Thus, a scale-aware metric 10 P ← mutate-crossover(P)
can be formulated with the record of the accumulated loss Output: The best augmentation policy p∗
and accuracy on different scales during fine-tuning.
Given a plain model trained without data augmentation, Table 2: Improvement details on RetinaNet ResNet-50.
we record its validation accuracy AP and accuracies on each
AP AP50 AP75 APs APm APl
scale APi with i ∈ S. For each candidate policy p, a child MS Baseline 38.2 57.3 40.5 23.0 41.6 50.3
model is further fine-tuned upon it. We record the accumu- Ours image-level 40.1 59.8 43.3 24.0 44.1 53.1
lated loss Lpi , the validation accuracies on each scale APip , + box-level 40.6 60.4 43.6 24.1 44.4 53.5
and the overall AP to formulate the objective function as + scale-aware area 41.3 61.0 44.1 25.2 44.5 54.6

min f ({Lpi∈S }, {APi∈S


p
}). (6) Table 3: Comparison with AutoAug-det on RetinaNet ResNet-50.
p
search cost AP APs APm APl
Balanced optimization over different scales is essential AutoAug-det [51] 800 TPU-days 36.7→39.0 - - -
to the performance and robustness of object detectors. An AutoAug-det† [51]2 800 TPU-days 38.2→40.3 23.6 43.9 53.8
intuitive way to measure the balance is the standard devia- Ours 20 GPU-days 38.2→41.3 25.2 44.5 54.6

tion σ({Lpi |i ∈ S}) of losses on various scales. However, Table 4: Search on RetinaNet ResNet-50 with different metrics.
we find it sometimes delves into a sub-optimal where other
scales are sacrificed to achieve the optimization balance. AP AP50 AP75 APs APm APl
proxy accuracy in [51] 40.0 59.7 42.5 23.9 44.1 52.6
Here, we adopt the principle of Pareto Optimality [1] to scale loss std σ 40.7 60.5 43.5 24.1 44.5 53.5
overcome this obstacle. In particular, we introduce a con- our metric - Eq. (7) 41.3 61.0 44.1 25.2 44.5 54.6
cept, named Pareto Scale Balance, to describe our objective:
the optimization over scales can not be better without hurt-
ing the accuracy of any other scale. To this end, we intro-
3.4. Search Framework
duce a penalization factor Φ to punish the scales Sb where Given the above search space and search metric, we de-
accuracy drops after fine-tuning with the policy p. There- scribe the search framework in this section. In this work,
fore, the metric function can be upgraded to the evolutionary algorithm, e.g., tournament selection algo-
rithm [38], is adopted as the search controller. Specifically,
f ({Lpi∈S }, {APi∈S
p
}) = σ({Lpi∈S }) · Φ({APi∈ p
b }), (7)
S a population of |P | policies are sampled from the search
p Q APi APi space in each iteration. After evaluating the sampled poli-
where Φ({APi∈ b }) =
S b AP p and AP p is the scale-
i∈S i i cies, we select the top k policies as parent for the next gen-
wise ratio of the original and the fine-tuned accuracy.
eration. Then, child policies are produced by mutation and
Compared with previous proxy-accuracy metrics, ours
crossover among the parent policies. This process is re-
is superior in computational efficiency and estimation ac-
peated for iterations until convergence.
curacy. Towards efficient computation, child models are
To evaluate augmentation policies, we first train a plain
fine-tuned upon the given plain model, instead of training
model with no data augmentation. Then, we fine-tune it
from scratch. We record the changes that resulted from the
upon each augmentation policy for n iterations and record
augmented fine-tuning to compute our metric. For accurate
the accumulated loss during optimization. We also record
estimation, more specific statistics, that is, AP and loss over
its accuracy before and after fine-tuning. With these statis-
various scales, is reasonable to receive a higher coefficient
tics, the search metric for each policy could be obtained.
with the actual performance. Experimentally, we carefully
This search framework is illustrated in Alg. 1.
compare the proposed search metric with the original accu-
racy metric to verify the effectiveness in Sec. 4.2. 2† means our implementation with the same baseline settings to ours.
Table 5: Improvements across detection frameworks. for 270k iterations. Multi-scale training baselines are en-
Models policy AP AP50 AP75 APs APm APl hanced by randomly selecting a scale between 640 to 800
RetinaNet: during training. We train models on 8 GPUs with a total 16
Baseline 36.6 55.7 39.1 20.8 40.2 49.4 images per batch. The learning rate is initialized as 0.02.
ResNet-503 MS Baseline 38.2 57.3 40.5 23.0 41.6 50.3 We set weight decay as 0.0001 and momentum as 0.9.
Ours 41.3 61.0 44.1 25.2 44.5 54.6
Baseline 38.8 59.1 42.3 21.8 42.7 50.2
ResNet-101 MS Baseline 40.3 59.8 42.9 23.2 44.0 53.2 4.2. Verification
Ours 43.1 62.8 46.0 26.2 46.8 56.7
Faster R-CNN:
In this section, we systematically evaluate our proposed
Baseline 37.6 57.8 41.0 22.2 39.9 48.4 Scale-aware AutoAug. We first present the improvements
ResNet-50 MS Baseline 39.1 60.8 42.6 24.1 42.3 50.3 from the search policy on the target task and then show its
Ours 41.8 63.3 45.7 26.2 44.7 54.1 transferability to other tasks and datasets. After that, we
Baseline 39.8 61.3 43.5 23.1 43.2 52.3
ResNet-101 MS Baseline 41.4 60.4 44.8 25.0 45.5 53.1 analyze the proposed search metric in detail.
Ours 44.2 65.6 48.6 29.4 47.9 56.7
Improvements analysis. The top block in Tab. 5 shows the
FCOS:
MS Baseline 40.8 59.6 43.9 26.2 44.9 51.9
improvements from our searched augmentation policy on
ResNet-50 RetinaNet. On ResNet-50 backbone, our searched augmen-
Ours 42.6 61.2 46.0 28.2 46.4 54.3
ResNet-101
MS Baseline 41.8 60.3 45.3 25.6 47.7 56.1 tation policy enhances the competitive multi-scale training
Ours 44.0 62.7 47.3 28.2 47.8 56.1 baseline to 41.3% AP by 3.1%. On ResNet-101, it achieves
Table 6: Improvements across tasks on Mask R-CNN. a 2.8% gains to 43.1% AP. We also perform experiments
upon the large scale jittering [13] in the supplementary ma-
Models policy APm/k APm/k
50 APm/k
75 APb APb50 APb75 terials. These improvements come from training data aug-
Instance Segmentation: mentations and introduce no additional cost to inference.
MS Baseline 36.4 58.8 38.7 40.4 61.9 44.0
ResNet-50
Ours 38.1 60.9 40.8 42.8 64.4 46.9
For a better understanding of the improvements, we
MS Baseline 37.9 60.4 40.4 42.3 63.8 46.6 show the component-wise improvements in Tab. 2. The
ResNet-101
Ours 40.0 63.2 42.9 45.3 66.4 49.8 image-level augmentations boost the performance by 1.9%
Keypoint Estimation: AP from 38.2% to 40.1%. Upon this, the non-scale-
MS Baseline 64.1 85.9 69.7 53.5 82.7 58.4
ResNet-50 aware box-level augmentations improve the performance to
Ours 65.7 86.6 71.7 55.5 84.2 60.9
ResNet-101
MS Baseline 65.1 86.5 71.2 54.8 83.2 60.0 40.6%. If it is further upgraded to be scale-aware, the per-
Ours 66.4 87.5 72.7 56.5 84.6 62.1
formance gets an additional 0.7% enhancement to 41.3%.
In contrast, in AutoAug-det [51], the box-level augmenta-
4. Experiments tions yield only 0.4% improvements. The improvements
mostly come from small and large objects, which verifies
4.1. Implementation Details the effectiveness of scale-aware box-level augmentations.
In addition, we compare with the previous state-of-the-
Policy search. In the search phase, we adopt RetinaNet [29]
art auto augmentation method in Tab. 3. On RetinaNet [29]
on ResNet-50 [21] backbone. We split the detection dataset
with ResNet-50 [21] backbone, AutoAug-det [51] im-
into a training set for child model training, a validation set
proves the baseline from 36.7% to 39.0% by 2.3% AP. For
for evaluation during search, and the test set val2017 for
better comparison, we implement the searched policy in
final evaluation. The validation set contains 5k images ran-
AutoAug-det [51] on our baseline. It is trained on the exact
domly sampled from the train2017 in MS COCO [30]
same settings, except for the data augmentation policy. It
and the remains are for child model training. Each child
improves the baseline from 38.2% to 40.3% by 2.1% AP.
model is fine-tuned for 1k iterations on the plain model,
It is inferior to our +3.1% improvement. For small objects,
which is just an arbitrary partially trained baseline model.
our searched policy gets a more balanced performance (i.e.,
In the evolutionary search, the evolution process is repeated
+1.6 APs ) thanks to the scale-aware search space and met-
for 10 iterations. The evolution population size is 50 and
ric. In terms of search cost, the data augmentation policy
the top 10 models are selected as subsequent parents.
in AutoAug-det [51] costs 800 TPU-days (400 TPUs on 2
Final policy evaluation. Models are trained with the days) for search, while our search costs only 8 GPUs (Tesla-
searched augmentation policy in the typical pre-training and V100) on 2.5 days. It is a 40× computational saving, with-
fine-tuning schedule on MS COCO dataset. The training out considering the machine type difference.
images are resized such that the shorter size is 800 pixels.
Transferability. Although our data augmentation policy is
Faster R-CNN and RetinaNet models are trained for 540k it-
searched in object detection on RetinaNet, we make com-
erations to fully show its potential, while others are trained
prehensive experiments to show its effectiveness to work on
3 FPN [28] is used as a default setting, unless -C4 is denoted. other object detectors, datasets and relevant tasks.
Table 7: Improvements on PASCAL VOC with Faster R-CNN on ResNet-50 backbone.
mAP plane bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
baseline 78.6 80.9 80.8 79.3 72.3 67.2 87.4 88.5 88.6 62.6 86.0 71.2 88.0 88.9 80.6 79.9 52.6 78.7 74.0 86.2 78.3
+ ours 81.6 88.7 88.2 80.1 74.1 73.6 88.3 89.1 88.9 68.1 87.2 73.8 88.4 88.9 87.5 87.1 56.2 79.0 79.7 87.2 78.6

Table 8: Comparison with state-of-the-art data augmentation methods for object detection.

Method Detector Backbone AP AP50 AP75 APs APm APl


Hand-crafted:
Dropblock [18] RetinaNet ResNet-50 38.4 56.4 41.2 - - -
Mix-up [49] Faster R-CNN ResNet-101 41.1 - - - - -
PSIS∗ [45] Faster R-CNN ResNet-101 40.2 61.1 44.2 22.3 45.7 51.6
Stitcher [8] Faster R-CNN ResNet-101 42.1 - - 26.9 45.5 54.1
GridMask [6] Faster R-CNN ResNeXt-101 42.6 65.0 46.5 - - -
InstaBoost∗ [17] Mask R-CNN ResNet-101 43.0 64.3 47.2 24.8 45.9 54.6
SNIP (MS test)∗ [41] Faster R-CNN ResNet-101-DCN-C4 44.4 66.2 49.9 27.3 47.4 56.9
SNIPER (MS test)∗ [42] Faster R-CNN ResNet-101-DCN-C4 46.1 67.0 51.6 29.6 48.9 58.1
Automatic:
AutoAug-det [51] RetinaNet ResNet-50 39.0 - - - - -
AutoAug-det [51] RetinaNet ResNet-101 40.4 - - - - -
AutoAug-det† [51] RetinaNet ResNet-50 40.3 60.0 43.0 23.6 43.9 53.8
AutoAug-det† [51] RetinaNet ResNet-101 41.8 61.5 44.8 24.4 45.9 55.9
RandAug [10] RetinaNet ResNet-101 40.1 - - - - -
RandAug† [10] RetinaNet ResNet-101 41.4 61.4 44.5 25.0 45.4 54.2
Ours:
Scale-aware AutoAug RetinaNet ResNet-50 41.3 61.0 44.1 25.2 44.5 54.6
Scale-aware AutoAug RetinaNet ResNet-101 43.1 62.8 46.0 26.2 46.8 56.7
Scale-aware AutoAug Faster R-CNN ResNet-101 44.2 65.6 48.6 29.4 47.9 56.7
Scale-aware AutoAug (MS test) Faster R-CNN ResNet-101-DCN-C4 47.0 68.6 52.1 32.3 49.3 60.4
Scale-aware AutoAug FCOS ResNet-101 44.0 62.7 47.3 28.2 47.8 56.1
Scale-aware AutoAug FCOS‡ ResNeXt-32x8d-101-DCN 48.5 67.2 52.8 31.5 51.9 63.0
Scale-aware AutoAug (1200 size) FCOS‡ ResNeXt-32x8d-101-DCN 49.6 68.5 54.1 35.7 52.5 62.4
Scale-aware AutoAug (MS test) FCOS‡ ResNeXt-32x8d-101-DCN 51.4 69.6 57.0 37.4 54.2 65.1

Table 9: Searched augmentation policy.


In object detection, we verify our policy on mainframe
Image-level (Zoom-in, 0.2, 4) (Zoom-out, 0.4, 10)
anchor-based one-stage, two-stage, and anchor-free detec- Box-level Color operations Geometric operations
tors. In addition to the previous RetinaNet experiments, we Sub-policy 1. (Color, 0.4, 2) (TranslateX, 0.4, 4)
show our results on Faster R-CNN and FCOS in Tab. 5. The Sub-policy 2. (Brightness, 0.2, 4) (Rotate, 0.4, 2)
improvements on Faster R-CNN are remarkable, i.e., +2.7% Sub-policy 3. (Sharpness, 0.4, 2) (ShearX, 0.2, 6)
and +2.8% on ResNet-50 and ResNet-101, respectively. On Sub-policy 4. (SolarizeAdd, 0.2, 2) (Hflip, 0.3, 0)
the anchor-free detector FCOS, it achieves 44.0% AP on Sub-policy 5. Original (TranslateY, 0.2, 8)
ResNet-101 with similar improvements. Area ratio Small - 6 Middle - 2 Large - 0.4
Our augmentation policy is feasible in any box-level
Search metric analysis. We compare our search metric
tasks. We validate its performance using Mask R-CNN [20]
with the proxy accuracy metric. For the proxy accuracy
on instance segmentation and keypoint estimation. Simi-
metric in [51], each model is trained on a subset training set,
lar improvements are consistently present in Tab. 6. For
5k images. For each search metric, we train 50 models with
instance segmentation, our Mask R-CNN model achieves
policies randomly sampled in the search space. Each model
40.0% mask AP on ResNet-101 backbone. In addition,
is trained for 90k iterations and evaluated on val2017 to
we also transfer our augmentation policy to PASCAL VOC
obtain the actual accuracy. Meanwhile, the proxy accu-
dataset. We train a Faster R-CNN model on ResNet-50 for
racy metric and our std-based metric are computed for each
48k iterations and divide the learning rate at 36k iterations.
model. We illustrate the Pearson coefficients in Fig. 4. Our
It improves the baseline by 3% mAP as in Tab. 7.
std-based metric is horizontally flipped in [0, 1] for better il-
‡ For FCOS ResNeXt-32x8d-101-DCN models, it is an improved ver- lustration. It shows that our metric has a clearly higher coef-
sion with ATSS [48] for performance boosting. ∗ results on test-dev. ficient to actual accuracies than the proxy accuracy metric.
Our comparisons on test-dev are in the supplementary materials. In addition, we use different metrics for search in Tab. 4.
Table 10: Scale variation issue on a clean Faster R-CNN.
AP AP50 AP75 APs APm APl
ResNet-50-C4 34.7 55.7 37.1 18.2 38.8 48.3
with MS train 34.8 55.6 37.3 18.9 39.2 47.6
with FPN 36.7 58.4 39.6 21.1 39.8 48.1
with Ours 36.8 58.0 39.5 21.0 41.2 49.1

Figure 4: Coefficients between actual accuracy and metrics. Our • Zoom-out has higher probability and magnitude than
metric presents a higher coefficient than the proxy accuracy [51]. zoom-int. This matches the fact that object detectors
usually have unsatisfied performance in small objects,
while zoom-out benefits detecting small objects.
The search metric in Eq.(7) is slightly better than the purely
std metric, thanks to the penalty factor. • The area ratio decreases dramatically from small scale
to large scale. Note that the area ratio is searched inde-
4.3. Comparison pendently in various scales from a set of discrete num-
We compare our final models on the MS COCO dataset bers. This phenomenon shows that augmentation in-
with other data augmentation methods in object detection. volving the context (area ratio larger than 1.0) would
The training settings are consistent with the implementation be beneficial to small and middle object recognition.
details mentioned before. As shown in Tab. 8, our augmen- • In box-level augmentations, geometric operations gen-
tation method on Faster R-CNN with ResNet-101 backbone erally have higher probability and magnitude than
achieves 44.2% AP, without any testing techniques. It is color operations. It intuitively reveals that geometric
better than the augmentation methods with the same back- operations, e.g., rotation, translation, shearing, might
bone, including InstaBoost [17] on Mask R-CNN (43.0% have more effect than color ones in object detection.
AP). To compare with the state-of-the-art hand-crafted aug-
mentations, SNIP [41] and SNIPER [42] on Faster R-CNN, The above patterns accord with our intuition and could pro-
we use the exactly the same settings, which includes multi- vide valuable insights to human knowledge.
scale testing on [480, 800, 1400] sizes, valid ranges, and Image/Feature pyramids v.s. Scale-aware AutoAug. Fea-
Soft-NMS [3]. No flipping or other enhancements are used. ture pyramid network [28] is proposed for solving the scale
Our model on the same backbone achieves 47.0% AP. It is variance issue in Faster R-CNN and has been widely used
better than the 46.1% SNIPER. We also compare with au- in this area. Here we show that our Scale-aware AutoAug
tomatic augmentation methods, AutoAug-det [51] and Ran- could be a substitute for FPN on Faster R-CNN detector
dAug [10]. For a fair comparison, we train them with the as in Tab. 10. Multi-scale training is commonly known to
same training settings to our methods on various backbones, be scale-invariant. However, on a clean baseline of Faster
denoted as † . They are inferior to ours. R-CNN [39] without FPN in 90k training iterations, it pro-
In addition to these common comparisons, we conduct vides almost no benefits. In contrast, our augmentation pol-
experiments on large-scale models to push the envelope of icy improves the baseline to the performance that requires
Scale-aware AutoAug. The baseline is the improved ver- training with FPN. Note that our augmentation policy is
sion of FCOS [48] on ResNeXt-32x8d-101-DCN backbone cost-free and requires no network modification.
with multi-scale training. It has 47.5% AP in the standard
single scale testing. Without any bells and whistles, Scale- 5. Conclusion
aware AutoAug enhances this strong baseline to 48.5% AP
by + 1.0% increase. It is further improved to 49.6% AP on In this work, we present Scale-aware AutoAug for object
larger training images with 1200 size. When it is equipped detection. It aims at the common scale variation issue with
with multi-scale testing, it is promoted to 51.4% AP. our search space and search metric. Scale-aware AutoAug
spends 20 GPU-days searching augmentation policies, 40
4.4. Discussion × saving compared to previous work. Our method shows
significant improvements over several strong baselines. Al-
Understanding the searched policy. Tab. 9 illustrates our though the augmentation policy is searched in object detec-
learned augmentation policy in details. We present each in- tion on the COCO dataset, it is transferable to other tasks
dividual augmentation in the format of {type, probability, and dataset. Thus, it provides a practical solution for re-
magnitude}. Probabilities are in the range of [0, 1.0]. The search and applications of augmentations in object detec-
magnitude ranges for augmentations are listed in the sup- tion. Finally, the searched augmentation policy have mean-
plementary materials. We measure it with 0 to 10 during ingful patterns, which might, in return, provide valuable in-
search. This searched policy presents meaningful patterns. sights for the hand-crafted data augmentation design.
References Zisserman. The pascal visual object classes challenge: A
retrospective. International Journal of Computer Vision,
[1] John Black, Nigar Hashimzade, and Gareth Myles. A dictio- 111(1):98–136, 2015. 2
nary of economics. Oxford university press, 2012. 5 [17] Haoshu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou,
[2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yonglu Li, and Cewu Lu. Instaboost: Boosting instance
Yuan Mark Liao. Yolov4: Optimal speed and accuracy of segmentation via probability map guided copy-pasting. In
object detection. CoRR, abs/2004.10934, 2020. 2 ICCV, pages 682–691, 2019. 1, 2, 7, 8, 11
[3] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and [18] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. Dropblock:
Larry S. Davis. Soft-nms - improving object detection with A regularization method for convolutional networks. In
one line of code. In ICCV, pages 5562–5570, 2017. 8 NeurIPS, pages 10750–10760, 2018. 7
[4] Ali Borji and Seyed Mehdi Iranmanesh. Empirical upper [19] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr
bound in object detection and more. CoRR, 2019. 4 Dollár, and Kaiming He. Detectron, 2018. 1
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, shick. Mask r-cnn. In ICCV, pages 2961–2969, 2017. 7
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Deep residual learning for image recognition. In CVPR,
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, pages 770–778, 2016. 6
Chen Change Loy, and Dahua Lin. MMDetection: Open [22] Daniel Ho, Eric Liang, Xi Chen, Ion Stoica, and Pieter
mmlab detection toolbox and benchmark. CoRR, 2019. 2 Abbeel. Population based augmentation: Efficient learning
[6] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. of augmentation policy schedules. In ICML, pages 2731–
Gridmask data augmentation. CoRR, abs/2001.04086, 2020. 2741, 2019. 2
7 [23] Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun.
[7] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Hypernet: Towards accurate region proposal generation and
Xinyu Xiao, and Jian Sun. Detnas: Backbone search for ob- joint object detection. In CVPR, pages 845–853, 2016. 1
ject detection. In Hanna M. Wallach, Hugo Larochelle, Alina [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Ro- Imagenet classification with deep convolutional neural net-
man Garnett, editors, NeurIPS, pages 6638–6648, 2019. 2 works. In NIPS, pages 1106–1114, 2012. 2
[8] Yukang Chen, Peizhen Zhang, Zeming Li, Yanwei Li, Xi- [25] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang
angyu Zhang, Gaofeng Meng, Shiming Xiang, Jian Sun, and Zhang. Scale-aware trident networks for object detection.
Jiaya Jia. Stitcher: Feedback-driven data provider for object In ICCV, 2019. 1
detection. CoRR, abs/2004.12432, 2020. 2, 7 [26] Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy M.
[9] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Va- Hospedales, Neil Martin Robertson, and Yongxing Yang.
sudevan, and Quoc V. Le. Autoaugment: Learning augmen- DADA: differentiable automatic data augmentation. In
tation strategies from data. In CVPR, pages 113–123, 2019. ECCV, 2020. 2
1, 2 [27] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and
Sungwoong Kim. Fast autoaugment. In NeurIPS, pages
[10] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V.
6662–6672, 2019. 2
Le. Randaugment: Practical automated data augmentation
[28] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
with a reduced search space. In NeurIPS, 2020. 2, 5, 7, 8
Bharath Hariharan, and Serge J. Belongie. Feature pyramid
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
networks for object detection. In CVPR, 2017. 1, 6, 8
and Fei-Fei Li. Imagenet: A large-scale hierarchical image
[29] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,
database. In CVPR, pages 248–255, 2009. 2
and Piotr Dollár. Focal loss for dense object detection. In
[12] Terrance Devries and Graham W. Taylor. Improved regular- ICCV, pages 2999–3007, 2017. 2, 4, 6
ization of convolutional neural networks with cutout. CoRR, [30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
abs/1708.04552, 2017. 2 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[13] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Zitnick. Microsoft coco: Common objects in context. In
Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. ECCV, pages 740–755, 2014. 2, 6
SpineNet: Learning scale-permuted backbone for recogni- [31] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie
tion and localization. In CVPR, pages 11592–11601, 2020. Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning
6, 11 for generic object detection: A survey. IJCV, pages 261–
[14] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- 318, 2020. 2
eling visual context is key to augmenting object detection [32] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
datasets. In ECCV, pages 364–380, 2018. 2 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[15] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, Berg. Ssd: Single shot multibox detector. In ECCV, pages
paste and learn: Surprisingly easy synthesis for instance de- 21–37, 2016. 2
tection. In ICCV, pages 1301–1310, 2017. 1, 2 [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
[16] Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, convolutional networks for semantic segmentation. In
Christopher K. I. Williams, John M. Winn, and Andrew CVPR, pages 3431–3440, 2015. 2
[34] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S. tive adversarial networks. arXiv preprint arXiv:1711.00648,
Zemel. Understanding the effective receptive field in deep 2017. 1
convolutional neural networks. In NeurIPS, pages 4898– [51] Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin,
4906, 2016. 4 Jonathon Shlens, and Quoc V. Le. Learning data augmenta-
[35] Luis Perez and Jason Wang. The effectiveness of data aug- tion strategies for object detection. In ECCV, 2020. 1, 2, 3,
mentation in image classification using deep learning. arXiv 4, 5, 6, 7, 8
preprint arXiv:1712.04621, 2017. 1 [52] Barret Zoph and Quoc V. Le. Neural architecture search with
[36] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. reinforcement learning. CoRR, abs/1611.01578, 2016. 2
Amodal instance segmentation with KINS dataset. In CVPR, [53] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V.
pages 3014–3023, 2019. 2 Le. Learning transferable architectures for scalable image
[37] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, recognition. In CVPR, pages 8697–8710, 2018. 2
Jared Dunnmon, and Christopher Ré. Learning to compose
domain-specific transformations for data augmentation. In
NeurIPS, pages 3236–3246, 2017. 1
[38] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V.
Le. Regularized evolution for image classifier architecture
search. In AAAI, pages 4780–4789, 2019. 2, 5
[39] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun.
Faster R-CNN: towards real-time object detection with re-
gion proposal networks. IEEE Trans. Pattern Anal. Mach.
Intell., 39(6):1137–1149, 2017. 1, 2, 4, 8
[40] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. CoRR,
2014. 2
[41] Bharat Singh and Larry S. Davis. An analysis of scale invari-
ance in object detection SNIP. In CVPR, pages 3578–3587,
2018. 1, 7, 8, 11
[42] Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER:
efficient multi-scale training. In NeurIPS, pages 9333–9343,
2018. 1, 2, 7, 8, 11
[43] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In ICCV,
pages 9627–9636, 2019. 2
[44] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and
Ian Reid. A bayesian data augmentation approach for learn-
ing deep models. In NeurIPS, pages 2797–2806, 2017. 1
[45] Hao Wang, Qilong Wang, Fan Yang, Weiqi Zhang, and
Wangmeng Zuo. Data augmentation for object detection
via progressive and selective instance-switching. CoRR,
abs/1906.00358, 2019. 7, 11
[46] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019. 2
[47] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhen-
guo Li. Auto-fpn: Automatic network architecture adap-
tation for object detection beyond classification. In ICCV,
2019. 1
[48] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
Stan Z. Li. Bridging the gap between anchor-based and
anchor-free detection via adaptive training sample selection.
In CVPR, pages 9756–9765, 2020. 7, 8
[49] Zhi Zhang, Tong He, Hang Zhang, Zhongyue Zhang, Jun-
yuan Xie, and Mu Li. Bag of freebies for training object
detection neural networks. CoRR, abs/1902.04103, 2019. 2,
7
[50] Xinyue Zhu, Yifan Liu, Zengchang Qin, and Jiahong Li.
Data augmentation in emotion classification using genera-
Estimation metric
compute for policies

Select
Supplementary Materials

Sample policies in
Table A - 1. Comparison with methods on test-dev.
Search space
Method AP AP50 AP75 APs APm APl
Res101:
PSIS [45] 40.2 61.1 44.2 22.3 45.7 51.6
Population Evolution InstaBoost [17] 43.0 64.3 47.2 24.8 45.9 54.6
Ours 44.4 66.1 48.8 27.1 47.4 55.3
Res101-DCN-C4:
SNIP† [41] 44.4 66.2 49.9 27.3 47.4 56.9
Evaluate policies with SNIPER† [42] 46.1 67.0 51.6 29.6 48.9 58.1
Estimation metric Ours† 46.9 68.8 51.7 30.6 48.1 58.4

Figure A - 1. The overall evolutionary algorithm framework of


our search method for learning data augmentation policies. Table A - 2. Comparison on larg-scale jittering.
Original
Method AP AP50 AP75 APs APm APl
RetinaNet Res50:
A. Search framework review Baseline 40.1 59.7 43.0 23.7 44.1 54.4
Ours 41.6 61.6 44.4 25.4 45.4 55.6
We provide a review of our overall search method frame-
work in Fig. A - 1. We adopt the evolutionary algorithmSample policies in
for search, where a population of data augmentation poli- Search space
cies are randomly initialized and then evolved in iterations. Combining the above two equations, Eq. (10) and Eq. (11),
During search, policies are sampled from the search space. we can obtain the variance factors as
Then, they are trained and evaluated by ourPopulation
estimation met- Evolution
r r
ric. The computed metrics serve as feedback to update. Bet- W/H H/W
σx = h r, σy = w r. (12)
ter policies are generated in this framework over time. 2π 2π

B. Derivation of Gaussian deviation Evaluate policies with


C. Augmentation
Estimation metric
operations details
The standard deviation of the Gaussian map can be de- We list the details about all box-level operations in
rived as the following. Given the Gaussian map Tab. A - 3 with their description and magnitude ranges. Be-
 
(x − xc )2 (y − yc )2
 sides, we provide the visualization example of these aug-
f (x, y) = exp − + , (8) mentations in Fig. A - 2.
2σx2 2σy2

its integration among the image can be calculated as D. Removing context pixels
Z HZ W In Tab. 1 in the main paper, we evaluate well-trained
V = f (x, y) dxdy ≈ 2πδx δy . (9) models on validation images whose context (background)
0 0 pixels are removed. For better understanding, we provide
With the defination of the area ratio r = V /sbox and an example image as shown in Fig. A - 3.
sbox = hw, we can formulate their relationship as
E. Other Comparisons

r= δx δy . (10) Some methods in Tab. 8 are reported on test-dev. We
hw
show our counterparts on test-dev in Tab. 1. † denotes
Without loss of generality, the variance factors δx and δy that the multi-scale testing technique has been used.
should be correlated with the ratio of box height (width) We also perform experiments upon the large scale jitter-
and image height (width) to make the Gaussian map match ing [13], i.e., [0.5, 2.0] as in Tab. 2. The baseline is en-
the box aspect ratio. This can be represented as hanced to 40.1% AP, while ours stably achieves 41.6% AP.
  
h w
δx /δy = . (11)
H W
Table A - 3. Details about box-level operations with their description and magnitude ranges.

Operation Description Magnitude range


Brightness Control the object brightness. Magnitude = 0 represents the black, while magni- [0.1, 1.9]
tude = 1.0 means the original.
Color Control the color balance. Magnitude = 0 represents a black & white object, while [0.1, 1.9]
magnitude = 1.0 means the original.
Contrast Control the contrast of the object. Magnitude = 0 represents a gray object, while [0.1, 1.9]
magnitude = 1.0 means the original object.
Cutout Randomly set a square area of pixels to be gray. Magnitude represents the side length. [0, 60]
Equalize Equalize the histogram of the object area. -
Sharpness Control the sharpness of the object. Magnitude = 0 represents a blurred object, while [0.1, 1.9]
magnitude = 1.0 means the original object.
Solarize Invert all pixels above a threshold value. Magnitude represents the threshold. [0,256]
SolarizeAdd For pixels less than 128, add an amount to them. Magnitude represents the amount. [0,110]
Hflip Flip the object horizontally. -
Rotate Rotate the object to a degree. Magnitude represents the degree. [-30,30]
ShearX/Y Shear the object along the horizontal or vertical axis with a magnitude. [-0.3, 0.3]
TranslateX/Y Translate the object in the horizontal or vertical direction by magnitude pixels. [-150, 150]

Figure A - 2. Examples on different box-level operations with magnitudes random sampled.

Figure A - 3. An example image of removing context.

You might also like