0% found this document useful (0 votes)
14 views

Beery_Synthetic_Examples_Improve_Generalization_for_Rare_Classes_WACV_2020_paper

Uploaded by

godwinsarpei10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Beery_Synthetic_Examples_Improve_Generalization_for_Rare_Classes_WACV_2020_paper

Uploaded by

godwinsarpei10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Synthetic Examples Improve Generalization for Rare Classes

Sara Beery*⋄ , Yang Liu*⋄ , Dan Morris∧ , Jim Piavis† ,


Ashish Kapoor† , Markus Meister⋄ , Neel Joshi† , Pietro Perona⋄

California Institite of Technology⋄ Microsoft AI for Earth∧ , Microsoft Research†


1200 E California Blvd, Pasadena, CA 91125 14820 NE 36th Street, Redmond, WA, 98052

Abstract has been shown to be useful for various computer vision


tasks [70, 50, 32, 53, 24, 61, 55, 49, 28, 26, 36]. However,
The ability to detect and classify rare occurrences in im- an exploration of this approach in a long-tailed setting is
ages has important applications – for example, counting still missing (see Section 2.4).
rare and endangered species when studying biodiversity, or As a testbed, we focus on the effect of simulated data
detecting infrequent traffic scenarios that pose a danger to augmentation on the real-world application of recognizing
self-driving cars. Few-shot learning is an open problem: animal species in camera trap images. Camera traps are
current computer vision systems struggle to categorize ob- heat- or motion-activated cameras placed in the wild to
jects they have seen only rarely during training, and collect- monitor animal populations and behavior. The processing
ing a sufficient number of training examples of rare events of camera trap images is currently limited by human review
is often challenging and expensive, and sometimes outright capacity; consequently, automated detection and classifica-
impossible. We explore in depth an approach to this prob- tion of animals is a necessity for scalable biodiversity as-
lem: complementing the few available training images with sessment. A single sighting of a rare species is of immense
ad-hoc simulated data. importance. However, training data of rare species is, by
Our testbed is animal species classification, which has a definition, scarce. This makes this domain ideal for study-
real-world long-tailed distribution. We present two natural ing methods for training detection and classification algo-
world simulators, and analyze the effect of different axes of rithms with few training examples. We utilize a technique
variation in simulation, such as pose, lighting, model, and from [8] which tests performance at camera locations both
simulation method, and we prescribe best practices for effi- seen (cis) and unseen (trans) during training in order to ex-
ciently incorporating simulated data for real-world perfor- plicitly study generalization (see Section 3.1 for a more de-
mance gain. Our experiments reveal that synthetic data can tailed explanation).
considerably reduce error rates for classes that are rare, We introduce two novel natural world simulators based
that as the amount of simulated data is increased, accuracy on popular 3D game development engines for generalizable,
on the target class improves, and that high variation of sim- realistic and efficient synthetic data generation. We investi-
ulated data provides maximum performance gain. gate the use of simulated data as augmentation during train-
ing, and how to best combine real data for common classes
with simulated data for rare classes to achieve optimal per-
1. Introduction formance across the class set at test time. We consider four
1
In recent years computer vision researchers have made different data simulation methods (see Fig.1) and compare
substantial progress towards automated visual recognition the effects of each on classification performance. Finally,
across a wide variety of visual domains [57, 20, 51, 68, 47, we analyze the effect of both increasing the number of simu-
67, 8]. However, applications are hampered by the fact that lated images and controlling for axes of variation to provide
in the real world the distribution of visual classes is long- best practices for leveraging simulated data for real-world
tailed, and state-of-the-art recognition algorithms struggle performance gain on rare classes.
to learn classes with limited data [69]. In some cases
(such as recognition of rare endangered species) classify-
2. Related work
2.1. Visual Categorization Datasets
ing rare occurrences correctly is crucial. Simulated data,
which is plentiful, and comes with annotation “for free”, Large and well-annotated public datasets allow scien-
tists to train, analyze, and compare the performance of dif-
1* denotes equal contribution ferent methods, and have provided large performance im-

863
(a) Real Camera Traps (b) TrapCam-Unity (c) TrapCam-AirSim (d) Sim on Empty (e) Real on Empty
Figure 1: Day and night examples for each simulation method. We compare four different simulation methods and
compare the effects of each on classification performance.

provements over traditional vision approaches [64, 34, 31]. complementary, algorithmic advances can be used in con-
The most popular datasets used for this purpose are Ima- junction with augmented training data.
geNet, COCO, PascalVOC, and OpenImages, all of which
2.3. Low-shot Learning
are human-curated from images scraped from the web [17,
44, 21, 39]. These datasets cover a wide set of classes across Low-shot learning attempts to learn categories from few
both the manufactured and natural world, and are usually examples [42]. Wang and Herbert [71] do low-shot classifi-
designed to provide “enough” data per class to avoid the cation by regressing from small-dataset classifiers to large-
low-data regime. More recently researchers have proposed dataset classifiers. Hariharan and Girshick [27] look specif-
datasets that focus specifically on long-tailed distributions ically at ImageNet, using classes that are unbalanced, some
[68, 8, 41]. The Caltech Camera Traps dataset [8] intro- with large amounts of training data, and some with little
duced the challenge of learning from limited locations, and training data. Metric learning learns a representation space
generalizing to new locations. where distance corresponds to similarity, and uses this as
a basis for low-shot solutions [15]. We consider the low-
2.2. Handling Imbalanced Datasets shot regime with regard to real data for our rare target class,
Imbalanced datasets lead to bias in algorithm perfor- but investigate the use of added synthetic data based on a
mance toward well-represented classes [12]. Algorithmic human-generated articulated model of the unseen class dur-
solutions often use a non-uniform cost per misclassifica- ing training instead of additional class-specific attribute la-
tion via weighted loss [19, 30, 29]. One example, focal bels at training and test time. This takes us outside of the
loss, was recently proposed to deal with the large fore- traditional low-shot framework into the realm of domain
ground/background imbalance in detection [43]. transfer from simulated to real data.
Data solutions employ data augmentation, either by 1)
over-sampling the minority classes, 2) under-sampling the 2.4. Data Augmentation via Style Transfer, Gener-
majority classes, or 3) generating new examples for the mi- ation, and Simulation
nority classes. When using mini-batch gradient descent, Image generation via generative adversarial networks
oversampling the minority classes is similar to weighted (GANs) and recurrent neural networks (RNNs), as well as
loss. Under-sampling the majority classes is non-ideal, as style transfer and image-to-image translation have all been
this reduces information about common classes. Our pa- considered as sources for data augmentation [11, 25, 35, 52,
per falls into the third category: generating new training 66, 45, 74]. These techniques require large amounts of data
data for rare classes. Data augmentation via pre-processing, to generate realistic images making them non-ideal solu-
using affine and photometric transformations, is a well- tions for low-data regimes. Though conditional generation
established tool for improving generalization [40, 33]. Data allows for class-specific output, the results can be difficult
generation and simulation have begun to be explored as data to interpret or control.
augmentation methods, see Section 2.4. Graphics engines such an Unreal [7, 72] and Unity [6]
Algorithmic and data solutions for imbalanced data are leverage the expertise of human artists and complex physics

864
models to generate photorealistic simulated images, which 243, 187 images from 140 camera trap locations covering
can be used for data augmentation. Because ground truth 30 classes of animals, curated from data provided by the
is known at generation, simulated data has proved particu- United States Geological Survey and the National Park Ser-
larly useful for tasks requiring detailed and expensive anno- vice. We follow the CCT-20 data split laid out in [8], which
tation, such as keypoints, semantic segmentations, or depth was explicitly designed for in-depth generalization analy-
information [70, 50, 32, 53, 24, 61, 55, 49, 28]. Varol et sis. The split uses a subset of 57, 868 images from 20
al. [70] use synthetically-generated humans placed on top camera locations covering 15 classes in CCT to simulta-
of real image backgrounds as pretraining for human pose neously investigate performance on locations seen during
estimation, and suggest fine-tuning a synthetically-trained training and generalization performance to new locations.
model on real data. [61] uses a combination of unlabeled Bounding-box annotations are provided for all images in
real data and labeled simulated data of the same class to CCT-20, whereas the rest of CCT has only class labels. In
improve real-world performance on an eye-tracking task by the CCT-20 data split, cis-locations are defined as locations
using GANs [24]. This method requires a large number of seen during training and trans-locations as locations not
unlabled examples from the target class. [50, 32, 53] find seen during training (see Fig.3). Nine locations are used for
that simulated data improves detection performance, and trans-test data, one location for trans-validation data, and
the degree of realism and variability of simulation affects data from the remaining 10 locations is split between odd
the amount of improvement. They consider only small sets and even days, with odd days as cis-test data and even days
of non-deformable man-made objects. Richter et al. [55] as training and cis-validation data (a 95% of data from even
showed that a segmentation model for city scenes trained days for training, 5% for testing).
with a subset of their real dataset and a large synthetic set To study the effect of simulated data on rare species,
outperforms a model trained with the full real dataset. [49] we focus on deer, which are rare in CCT-20, with only
proposes a dataset and benchmark for evaluating models for 44 deer examples out of the 13, 553 images in the train-
unsupervised domain transfer from synthetic to real data ing set (see Fig.3). To focus on the performance of a single
with all-simulated training data, as opposed to simulated rare class, we remove the other two rare classes in CCT-20:
data only for rare classes. While this literature is encour- badgers and foxes. We note that there are no deer images
aging, a number of questions are left unexplored. The first in the established CCT-20 trans sets. In reality, deer are
is a careful analysis of when simulated data is useful and, far from uncommon: unlike a truly rare species, there ex-
in particular, if it is useful in generalizing to new scenar- ist sufficient images of deer in the CCT dataset outside of
ios. Second, whether simulated data can be useful in highly the CCT-20 locations to rigorously evaluate performance.
complex and relatively unpredictable scenes such as natu- To facilitate deeper investigation of generalization we have
ral scenes, as opposed to indoors and urban scenes. Third, collected bounding-box annotations for an additional 16K
whether it is just the synthetic objects or also the synthetic images from CCT across 65 new locations, which we add
environments that contribute to learning. to the trans-validation and trans-test sets to cover a wider
variety of locations and classes (including deer). We call
2.5. Simulated Datasets
this augmented trans set trans+ (see Fig.3) and will release
Previous efforts on synthetic dataset generation focus on the annotations at publication. To further analyze general-
non-deformable man-made objects and indoor scenes [62, ization, we also test on data containing deer from the iNat-
58, 73, 32, 53, 38], human pose/actions [70, 16], or urban uralist 2017 dataset [68], which represents a domain shift
scenes [56, 22, 55, 18, 26, 36]. to human-captured and human-selected photographs. We
Bondi et al. [10] previously released the AirSim-w data consider Odocoileus hemionus (mule deer) and Odocoileus
simulator within the domain of wildlife conservation, fo- virginianus (white-tailed deer) images from iNaturalist, the
cused on creating aerial infrared imagery. The resolution two species of deer seen in the CCT data. In Supplementary
and quality of the assets is designed to replicate data from Material we show results on an additional class, wolf.
100 meters in the air, but is not realistic close-up. We con-
tribute the first image data generators specifically for the 3.2. Synthetic Data
natural world with the ability to recreate natural environ- To assess generality we leverage multiple collections of
ments and generate near-photorealistic images of animals woodland and animal models to create two simulation en-
within the scene, including real-world nuisance factors such vironments, which we call TrapCam-Unity and TrapCam-
as challenging pose, lighting, and occlusion. AirSim. Both simulation environments and source code to
generate images will be provided publicly, along with the
3. Data and Simulation data generated for this paper. To synthesize daytime im-
3.1. Real Data ages we varied the orientation of the simulated sun in both
Our real-world training and test data comes from the azimuth and elevation. To create images taken at night we
Caltech Camera Traps (CCT) dataset [8]. CCT contains used a spotlight attached to the simulated camera to sim-

865
Other Classes

Number of training images


3 Deer
10

(a) Training images 102

opossum
rabbit
coyote
cat
squirrel
raccoon
dog
bobcat
bird
rodent
skunk
deer
Classes

104

Number of images
(b) Cis test images
103

102

101

Train

Cis Val

Cis Test

Trans Val

Trans Test

Trans+ Val

Trans+ Test
(c) Trans+ test images
Data Split
Figure 3: (Top) Number of training examples for each
class. Deer are rare in the training locations from the CCT-
20 data split. We focus on deer as a test species in order
to investigate whether we can improve performance on a
“rare” class. Since deer are not rare at other camera loca-
(d) iNaturalist images tions within the CCT dataset, we have enough test data to
thoroughly evaluate the effect. (Bottom) Number of ex-
Figure 2: Cis vs. Trans: The cis-test data can be very simi- amples for each data split, for deer and other classes. In
lar to the training data: animals tend to behave similarly at a the CCT-20 data split there were no trans examples of deer.
single location even across different days, so the images col- We added annotations to the trans val and test sets for an
lected of each species are easy to memorize intra-location. additional 16K images across 65 new locations from CCT,
The trans data has biases towards specific angles and light- including 6K examples of deer. We call these augmented
ing conditions that are different from those in the cis loca- sets trans+.
tions, and as such is very hard to learn from the training
data. iNaturalist data represents a domain shift to human- but the modularity allows many possible scenes to be built.
curated images. TrapCam-Unity. Unity 3D game development engine is
a popular game development tool that offers realistic graph-
ulate a white-light or IR flash and qualitatively match the ics, real time performance and abundant 3D assets. We
low color saturation of the nighttime images. To simulate take advantage of the “Book of The Dead” environment
animals’ eyeshine (a result of the reflection of camera flash [5], a near-photorealistic, open-source forest environment
from the back of the eye), we placed small reflective balls published by Unity to demonstrate its high definition ren-
on top of the eyes of model animals. dering pipeline. This off-the-shelf environment is large and
TrapCam-AirSim. We create a modular natural envi- rich in details, it has a diversity of subregions with signifi-
ronment within Microsoft AirSim [60] that can be randomly cantly different statistics. We change the lighting and move
populated with flora and fauna. The distribution and types throughout this large, static environment to collect data with
of trees, bushes, rocks, and logs can be varied and randomly various background scenes. We make use of 17 animated
seeded to create a diverse set of landscapes, from an open deer models from five off-the-shelf model sets, purchased
plain to a dense forest. We used various off-the-shelf com- from Unity Asset Store and originally developed for game
ponents such as an animal pack from Epic Studios [1] (An- development, including the GiM models used in TrapCam-
imals Vol 01: Forest Animals by GiM [2]), background AirSim. A single gaming PC (Core i7 5820K, 16GB RAM,
terrain also from Unreal Marketplace [7], vegetation from GTX 1080Ti) generates over 300,000 full-HD images with
SpeedTree [4], and rocks/obstructions from Megascans [3]. pixel-level instance annotation per day and the throughput
The actual area of the environment is small, at 50 meters, linearly scales to additional machines.

866
Simulated animals on empty images. Similar to the 1.0
data generated in [70], we generate synthetic images of deer Trans+ deer
Trans+ other classes
by rendering deer on top of real camera trap images contain- Cis deer
0.8
ing no animals, which we call Sim on Empty (see Fig.1). We Cis other classes
iNat deer
first generate animal foreground images by randomizing the
location, orientation in azimuth, pose and illumination of 0.6

Error
the deer, then paste the foreground images on top of the real
empty images. A limitation is that the deer are not in real- 0.4
istic relationships or occlusion scenarios with the environ-
ment around them. We also note that the empty images used
to construct this data come from both cis and trans loca- 0.2
tions, so Sim on Empty contains information about test-set
backgrounds unavailable in the purely simulated sets. This 0.0
0 ... 101 102 103 104 105 106
choice is based on current camera trap literature, which first
Number of simulated images
detects the presence of any animal, and then determines an-
Figure 4: Error as a function of number of simulated
imal species [47, 8]. After the initial animal detection step,
images seen during training. We divide this plot into
the empty images are known and can be utilized.
three regions. The leftmost region is the baseline perfor-
Segmented animals on empty images. We manually
mance with no simulated data, shown at x=0 (Note x-axis
segment the 44 examples of deer from the training set and
is in log scale). In the middle region, additional simulated
paste them at random on top of real empty camera trap im-
training data increases performance on the rare class and
ages, which we call Real on Empty (see Fig.1). This allows
does not harm the performance of the remaining classes
us to analyze whether the generalization challenge is related
(trend lines are visualized). The rightmost region, where
to memorizing the training deer+background or memoriz-
many simulated images are added to the training set, results
ing the training deer regardless of background. Similar to
in a biased classifier, hurting the performance of the other
the Sim on Empty set, these images do not have realistic
classes (see Fig.5 (b-c) for details). We compare the class
foreground/background relationships and the empty images
error for “deer” and “other classes” in both the “cis” and
come from both cis and trans locations.
“trans+” testing regimes. Lines marked “deer” use only the
deer test images for the error computation. Lines marked
4. Experiments “other classes” use all the images in the other classes (ex-
Beery, et al. [8] showed that detecting and localizing the cluding deer) for the error computation. Error is defined as
presence of an “animal” (where all animals are grouped into the number of incorrectly identified images divided by the
a single class) both generalizes well to new locations and number of images.
improves classification performance. We focus on classifi-
cation of cropped ground-truth bounding boxes as opposed ping based on trans+ validation set performance [9].
to training multi-class detectors in order to disambiguate
classification and detection errors. We specifically inves- 4.1. Effect of increase in simulated data
tigate how added synthetic training data for rare classes ef- We explore the trade-off in performance when increas-
fects model performance on both rare and common classes. ing the number of simulated images, from 5 to 1.4 million,
We find that the Inception-Resnet-V2 architecture [63] spanning 5 log units (see Fig.4). Very little simulated data
works best for the cropped-box classification task, based is needed to see a trans+ performance boost: with as few as
on performance comparison across architectures (see Sup- 5 simulated images we see a 10% decrease in per-class error
plementary Material). Most classification systems are pre- on trans+ deer, with < 0.5% increase in average per-class
trained on Imagenet, which contains animal classes. To en- error on the other trans+ classes. As we increase the number
sure that our “rare” class is truly something the model is un- of simulated images, trans+ performance improves: with
familiar with, as opposed to something seen in pretraining, 100K simulated images we see a 39% decrease in trans+
we pretrain our classifiers on no-animal ImageNet, a dataset deer error, with < 0.5% increase in error for the other trans
we define by removing the “animal” subtree (all classes un- classes. There exists some threshold (> 325K) where, if
der synset node n00015388) from ImageNet. We use an ini- passed, an increase in simulated data noticeably biases the
tial learning rate of 0.0045, RMSprop with a momentum of classifier towards the deer class (see Fig.5): with 1.4 mil-
0.9 [65], and a square input resolution of 299. We employ lion simulated images, our trans+ deer error decreases by
random cropping (containing at least 65% of the region), 88%, but it comes at the cost of a 13% increase in average
horizontal flipping, color distortion, and blur as data aug- per-class error across the other classes. At this point there is
mentation. Model selection is performed using early stop- an overwhelming class prior towards deer: the next-largest

867
1.0 1.0
0
CCT

5
fPLM
0.8
17
fPM
0.8
Precision

120
fLM
0.6
5K
fPL
22K
0.4 fPM with night
100K
0.6
Vary all
325K

Error
0.2
830K

1.4M
0.0 0.4

0.0 0.2 0.4 0.6 0.8 1.0

Recall

0.2
(a) Trans+ deer precision-recall curves

0.0
trans+ cis trans+ cis

deer deer other other

(avg) (avg)

Test set

Figure 6: Error as a function of variability of simulated


images seen during training: 100K simulated deer im-
ages. Error is calculated as in Fig.4, and all data is from
(b) Confusion matrix: 100K (c) Confusion matrix: 1.4M
TrapCam-Unity. Trans+ deer performance is highlighted.
In the legend “CCT” means the model was trained only
Figure 5: (a) Trans+ PR curves for the deer class: Note on the CCT-20 training set with no added simulated data.
the development of a biased classifier as we add simulated “P” means “pose,” “L” means “lighting,” and “M” means
training data. The baseline model (in blue) has high preci- “model,” while the prefix “f” for “fixed” denotes which of
sion but suffers low recall. The model trained with 1.4M these variables were controlled for a particular experiment.
simulated images (in grey) has higher recall, but suffers a For example “fPM” means the pose and the animal model
loss in precision. (b-c) Evidence of a biased classifier: were held fixed, while the lighting was allowed to vary. The
Compare the deer column in the confusion matrices, the variability of simulated data is extremely important, and
model trained with 1.4M simulated images predicts more that while all axes of variability matter, simulating night-
test images as deer. time images has the largest effect.

class at training time would be opossums with 2, 514 im- robustness and generality of the representation learned.
ages, 3 orders of magnitude less.
Unsurprisingly, cis deer performance decreases with 4.2. Effect of variation in simulation
added simulated data. Although the images were taken on In order to understand which aspects of the simulated
different days (train from even days, cis-test from odd days) data are most beneficial, we consider three dimensions of
the animals captured were to some extent creatures of habit. variation during simulation: pose, lighting, and animal
Thus, training and test images can be nearly identical from model. Using the TrapCam-Unity simulator, we generate
within the same locations (see Fig.2). Almost all cis test 100K daytime simulated images for each of these experi-
deer images have at least one visually similar training im- ments. As a control, we create a set of data where the pose,
age. As simulated data is added at training time, the model lighting, and animal model are all fixed. We then create sets
is forced to learn a more complex, varied representation of with varied pose, varied lighting, and varied animal model,
deer. As a result, we see cis deer performance decrease. each with the other variables held fixed. An additional set of
To quantify robustness, we ran the 100K experiment three data is generated varying all of the above. Unsurprisingly,
times. We found that trans+ deer error had a standard de- widest variation results in the best trans+ deer performance.
viation of 2% and cis deer error had a standard deviation The individual axes of variation do have an effect of per-
of 4%, whereas the average error across other classes had a formance, and some are more “valuable” than others (see
standard deviation of 0.2% for both cis and trans. Fig.6). There are many more dimensions of variation that
We also investigate performance on deer images from could be explored, such as simulated motion blur or varia-
iNaturalist [68], which are individually collected by humans tion in camera perspective. For CCT data, we find adding
and are usually relatively centered and well-focused (and simulated nighttime images has the largest effect on perfor-
therefore easier to classify) but represent a domain shift (see mance. We have determined that for deer 49% of training
Fig.2). Adding simulated data improves performance on images, 53% of cis test images, and 56% of trans+ test im-
the iNaturalist deer images (see Fig.4), demonstrating the ages were captured at night, using either IR or white flash.

868
1.0 ilarities between the training and cis-test data (see Fig.2).
CCT Real on Empty and Sim on Empty are able to approxi-
CCT oversample
0.8 Real on Empty mate both “day” and “night” imagery, a deer pasted onto a
Sim on Empty nighttime empty image is actually a reasonable approxima-
TrapCam-AirSim
TrapCam-Unity
tion of an animal illuminated by a flash at night (see Fig.1).
0.6
TrapCam-Unity+ They also have the additional benefit of using backgrounds
Error

Real on Empty from both cis and trans sets, giving them trans information
0.4 not provided by the simulated datasets. TrapCam-Unity
with all variability enabled is our best-performing model
0.2 without requiring additional segmentation annotations. If
segmentation information is available, Real on Empty com-
0.0 bined with TrapCam-Unity (50K of each) improves both cis
trans+ cis trans+ cis and trans deer performance: trans deer error decreases to
deer deer other other
(avg) (avg) 36% (a 54% decrease compared to CCT only), with < 2%
Test set
increase in error on trans other classes.
Figure 7: Error as a function of simulated data gener- 4.4. Visualizing the representation of data
ation method: 100K simulated deer images. Per-class In order to visualize how the network represents simu-
error is calculated as in Fig.4. Trans+ deer performance lated data vs. real data, we use PCA and tSNE [46] to clus-
is highlighted. Oversampling decreases performance, and ter the activations of the final pre-logit layer of the network.
there is a large boost in performance from incorporating These visualizations can be seen in Fig.8. Interestingly, the
real segmented animals on different backgrounds (Real model learns “deer” bimodally: simulated deer are clus-
on Empty). TrapCam-Unity with everything allowed to tered almost entirely separately from real deer, with a few
vary (model, lighting, pose, including nighttime simulation) datapoints of each ending up in the opposite cluster. Even
gives us slightly better trans+ performance, without requir- though those clusters overlap only slightly, the network is
ing additional segmentation annotations. Combining Real surprisingly able to classify more deer images correctly.
on Empty with TrapCam-Unity (50K of each) gives us the
best trans+ deer performance. 5. Conclusions and Future Work
We present two fast, realistic natural world data simula-
Simulating only daytime images injects a prior towards deer tors based on popular 3D game development engines. Our
being seen during the day. By training on half day and half simulators have 3 major advantages. First, they are gen-
night images we match the day/night prior for deer in the eralizable. Thanks to the abundant 3D assets available on-
data. Not all species occur equally during the day or night, line in the game development community, integrating a new
some are strictly nocturnal. Our results suggest that a good species in a new environment from off the shelf assets is
strategy is to determine the appropriate ratio of day to night simple and fast. Second, not only are the graphics near-
images using your training set and match that ratio when photorealistic, the pipeline also generates animals with re-
adding simulated data. alistic pose, animation, and interactions with the environ-
ment. Third, data generation is efficient. A single gam-
4.3. Comparing simulated data generation methods ing PC generates over 300,000 full-HD images with pixel-
We compare performance gain from 4 methods of data level instance annotation per day and the throughput lin-
synthesis, using 100K added deer images for each (see early scales to additional machines.
Fig.7. The animal model is controlled (each simulated set We explore using the simulated data to augment rare
uses the same GiM deer model for these experiments) for classes during training. Towards this goal, we compare
fair comparison of the efficacy of each generation method. multiple sources of natural-world data simulation, explic-
As an additional control, we consider oversampling the rare itly measure generalization via the cis-vs-trans paradigm,
class. This creates the same sampling prior towards deer examine trade-offs in performance as the number of simu-
without introducing any new information. Oversampling lated images seen during training is increased, and analyze
performs worse than just training on the unbalanced train- the effect of controlling for different axes of variation and
ing set by causing the model to overfit the deer class to the data generation methods.
training images. By manually segmenting out the deer in the From our experiments we draw three main lessons. First:
44 training images and randomly pasting them onto empty using synthetic data can considerably reduce error rates for
backgrounds we see a large improvement in performance. classes that are rare, and with segmentation annotations we
Cis error goes down to 6% with this method of data aug- can reduce error rates even further by additionally randomly
mentation, which makes sense in the view of the strong sim- pasting segmented images of rare classes on empty back-

869
sim deer day
sim deer night
deer_inat
deer (cis)
deer (trans)
bird (cis) bird (trans)
bobcat (cis) bobcat (trans)
cat (cis) cat (trans)
coyote (cis) coyote (trans)
dog (cis) dog (trans)
opossum (cis) opossum (trans)
rabbit (cis) rabbit (trans)
raccoon (cis) raccoon (trans)
rodent (cis) rodent (trans)
skunk (cis) skunk (trans)
squirrel (cis) squirrel (trans)

(a) No simulated deer (b) 1.4M simulated deer


Figure 8: Visualization of network activations: more deer are classified correctly as we add synthetic data, despite the
synthetic data being clustered separately. The pink points are real deer, the brown are simulated day images and the grey
are simulated night images. Large markers are points that are classified correctly, while small markers are points classified
incorrectly. The plots were generated by running 200-dimensional PCA over the activations at the last pre-logit layer of the
network when running inference on the test sets, and then running 2-dimensional tSNE over the resulting PCA embedding.

ground images. Second: as the amount of simulated data is ideal ratios and dimensions of variation, 3) take advantage
increased, accuracy on the target class improves. However, of ease and speed of generation to create an abundance
with 1000x more simulated data than the common classes, of data based on this ideal distribution, and determine an
we see negative effects on the performance of other classes operating point of number of added simulated images to
due to the high class imbalance. Third: the variation of optimize performance between rare target class and other
simulated data generated is very important, and maximum classes based on the project goal.
variation provides maximum performance gain.
Further, the performance gains we have demonstrated,
While an increase in simulated data corresponds to an
along with the data generation tools we contribute to the
increase in target class performance, the representation of
community, will allow biodiversity researchers focused en-
simulated data overlaps only rarely with real data (see
dangered species to improve classification performance on
Fig.8). It remains to be studied whether embedding tech-
their target species. Adding each new species to the sim-
niques [59], domain adaptation techniques [23, 75], or style
ulation tools currently requires the assistance of a graphics
transfer [24, 61] could be used to encourage a higher over-
artist. However, automated 3D modeling techniques, such
lap in representation between the synthetic and real data,
as those proposed in [37, 54, 13, 48], might eventually be-
and if that overlap would lead to an increase in categoriza-
come an inexpensive and practical source of data to improve
tion accuracy. Additionally, the bias induced by adding
few-shot learning.
large amounts of simulated data could be addressed with
algorithmic solutions such as those in [14, 19, 30, 29]. We The improvement we have found in rare-class catego-
have not discussed the drawbacks related to model training rization is encouraging, and the release of our data gen-
with large quantities of synthetic data (epoch time, data stor- eration tools and the data we have generated will provide
age, etc.). In future, we will explore merging the simulator a good starting point for other researchers studying imbal-
and classifier so that highly variable synthetic data could be anced data, simulated data augmentation, or natural-world
requested “online” without storing raw frames. domains.
Simulation is a fast, interpretable, and controllable
method of data generation that is easy to use and easy to
adapt to new classes. This allows for an integrated and
evolving training pipeline with new classes of interest: sim- 6. Acknowledgements
ulated data can be generated iteratively based on needs
or gaps in performance. Our analysis suggests a general We would like to thank the USGS and NPS for provid-
methodology when using simulated data to improve rare- ing data. This work was supported by NSFGRFP Grant No.
class performance: 1) generate small, variable sets of simu- 1745301, the views are those of the authors and do not nec-
lated data (even small sets can drive improvement), 2) add essarily reflect the views of the NSF. Compute provided by
these sets to training and analyze performance to determine Microsoft AI for Earth and AWS.

870
References [18] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and
V. Koltun. CARLA: An open urban driving simulator. In
[1] Epic studios. http://epicstudios.com/. Accessed: Proceedings of the 1st Annual Conference on Robot Learn-
2019-03-21. 4 ing, pages 1–16, 2017. 3
[2] Forest animals by GiM. https://www. [19] C. Elkan. The foundations of cost-sensitive learning. In
unrealengine.com/marketplace/en-US/ International joint conference on artificial intelligence, vol-
animals-vol-01-forest-animals. Accessed: ume 17, pages 973–978. Lawrence Erlbaum Associates Ltd,
2019-03-21. 4 2001. 2, 8
[3] Quixel megascans library. https://quixel.com/ [20] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter,
megascans. Accessed: 2019-03-21. 4 H. M. Blau, and S. Thrun. Dermatologist-level classifi-
[4] Speedtree. https://store.speedtree.com/. Ac- cation of skin cancer with deep neural networks. Nature,
cessed: 2019-03-21. 4 542(7639):115, 2017. 1
[5] Unity book of the dead. https://unity3d.com/ [21] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
book-of-the-dead. Accessed: 2019-03-21. 4 A. Zisserman. The pascal visual object classes (voc) chal-
[6] Unity game engine. https://unity3d.com/. Ac- lenge. International journal of computer vision, 88(2):303–
cessed: 2019-02-05. 2 338, 2010. 2
[7] Unreal game engine. https://www.unrealengine. [22] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as
com/en-US/what-is-unreal-engine-4. Ac- proxy for multi-object tracking analysis. In Proceedings of
cessed: 2019-02-05. 2, 4 the IEEE conference on computer vision and pattern recog-
[8] S. Beery, G. Van Horn, and P. Perona. Recognition in terra nition, pages 4340–4349, 2016. 3
incognita. In The European Conference on Computer Vision [23] Y. Ganin and V. Lempitsky. Unsupervised domain adap-
(ECCV), September 2018. 1, 2, 3, 5 tation by backpropagation. In International Conference on
[9] Y. Bengio. Practical recommendations for gradient-based Machine Learning, pages 1180–1189, 2015. 8
training of deep architectures. In Neural networks: Tricks [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
of the trade, pages 437–478. Springer, 2012. 5 D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[10] E. Bondi, D. Dey, A. Kapoor, J. Piavis, S. Shah, F. Fang, erative adversarial nets. In Advances in neural information
B. Dilkina, R. Hannaford, A. Iyer, L. Joppa, et al. Airsim- processing systems, pages 2672–2680, 2014. 1, 3, 8
w: A simulation environment for wildlife conservation with [25] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and
uavs. In Proceedings of the 1st ACM SIGCAS Conference on D. Wierstra. Draw: A recurrent neural network for image
Computing and Sustainable Societies, page 40. ACM, 2018. generation. arXiv preprint arXiv:1502.04623, 2015. 2
3 [26] S. Han, A. Fafard, J. Kerekes, M. Gartley, E. Ientilucci,
[11] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and A. Savakis, C. Law, J. Parhan, M. Turek, K. Fieldhouse, et al.
D. Krishnan. Unsupervised pixel-level domain adaptation Efficient generation of image chips for training deep learning
with generative adversarial networks. In Proceedings of the algorithms. In Automatic Target Recognition XXVII, volume
IEEE conference on computer vision and pattern recogni- 10202, page 1020203. International Society for Optics and
tion, pages 3722–3731, 2017. 2 Photonics, 2017. 1, 3
[12] M. Buda, A. Maki, and M. A. Mazurowski. A systematic [27] B. Hariharan and R. Girshick. Low-shot visual recognition
study of the class imbalance problem in convolutional neural by shrinking and hallucinating features. In Proc. of IEEE Int.
networks. Neural Networks, 106:249–259, 2018. 2 Conf. on Computer Vision (ICCV), Venice, Italy, 2017. 2
[13] T. J. Cashman and A. W. Fitzgibbon. What shape are dol- [28] H. Hattori, V. N. Boddeti, K. Kitani, and T. Kanade. Learn-
phins? building 3d morphable models from 2d images. IEEE ing scene-specific pedestrian detectors without real data. In
transactions on pattern analysis and machine intelligence, Computer Vision and Pattern Recognition (CVPR), 2015
35(1):232–244, 2013. 8 IEEE Conference on, pages 3819–3827. IEEE, 2015. 1, 3
[14] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie. Class- [29] H. He, Y. Bai, E. A. Garcia, and S. Li. Adasyn: Adap-
balanced loss based on effective number of samples. arXiv tive synthetic sampling approach for imbalanced learning.
preprint arXiv:1901.05555, 2019. 8 In 2008 IEEE International Joint Conference on Neural
[15] Y. Cui, F. Zhou, Y. Lin, and S. Belongie. Fine-grained cate- Networks (IEEE World Congress on Computational Intelli-
gorization and dataset bootstrapping using deep metric learn- gence), pages 1322–1328. IEEE, 2008. 2, 8
ing with humans in the loop. In Proceedings of the IEEE [30] H. He and E. A. Garcia. Learning from imbalanced data.
Conference on Computer Vision and Pattern Recognition, IEEE Transactions on Knowledge & Data Engineering,
pages 1153–1162, 2016. 2 (9):1263–1284, 2008. 2, 8
[16] C. R. de Souza12, A. Gaidon, Y. Cabon, and A. M. López. [31] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
Procedural generation of videos to train deep action recogni- ing for image recognition. In Proceedings of the IEEE con-
tion networks. 2017. 3 ference on computer vision and pattern recognition, pages
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 770–778, 2016. 2
ImageNet: A Large-Scale Hierarchical Image Database. In [32] S. Hinterstoisser, O. Pauly, H. Heibel, M. Marek, and
CVPR09, 2009. 2 M. Bokeloh. An annotation saved is an annotation earned:

871
Using fully synthetic training for object instance detection, [47] M. S. Norouzzadeh, A. Nguyen, M. Kosmala, A. Swanson,
2019. 1, 3 C. Packer, and J. Clune. Automatically identifying wild
[33] A. G. Howard. Some improvements on deep convolutional animals in camera trap images with deep learning. arXiv
neural network based image classification. arXiv preprint preprint arXiv:1703.05830, 2017. 1, 5
arXiv:1312.5402, 2013. 2 [48] F. Pahde, M. Puscas, J. Wolff, T. Klein, N. Sebe, and
[34] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, M. Nabi. Low-shot learning from imaginary 3d model. arXiv
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. preprint arXiv:1901.01868, 2019. 8
Speed/accuracy trade-offs for modern convolutional object [49] X. Peng, B. Usman, K. Saito, N. Kaushik, J. Hoffman, and
detectors. In IEEE CVPR, 2017. 2 K. Saenko. Syn2real: A new benchmark forsynthetic-to-real
[35] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating visual domain adaptation. arXiv preprint arXiv:1806.09755,
images with recurrent adversarial networks. arXiv preprint 2018. 1, 3
arXiv:1602.05110, 2016. 2 [50] B. Pepik, R. Benenson, T. Ritschel, and B. Schiele. What is
[36] S. Ji, Y. Shen, M. Lu, and Y. Zhang. Building instance holding back convnets for detection? In German Conference
change detection from large-scale aerial images using con- on Pattern Recognition, pages 517–528. Springer, 2015. 1, 3
volutional neural networks and simulated samples. Remote [51] R. Poplin, A. V. Varadarajan, K. Blumer, Y. Liu, M. V. Mc-
Sensing, 11(11):1343, 2019. 1, 3 Connell, G. S. Corrado, L. Peng, and D. R. Webster. Predic-
tion of cardiovascular risk factors from retinal fundus pho-
[37] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learn-
tographs via deep learning. Nature Biomedical Engineering,
ing category-specific mesh reconstruction from image col-
page 1, 2018. 1
lections. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 371–386, 2018. 8 [52] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adver-
[38] E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and
sarial networks. arXiv preprint arXiv:1511.06434, 2015. 2
A. Farhadi. AI2-THOR: an interactive 3d environment for
[53] P. S. Rajpura, H. Bojinov, and R. S. Hegde. Object detection
visual AI. CoRR, abs/1712.05474, 2017. 3
using deep cnns trained on synthetic images, 2017. 1, 3
[39] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija,
[54] B. Reinert, T. Ritschel, and H.-P. Seidel. Animated 3d
A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit,
creatures from single-view video by skeletal sketching. In
S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik,
Graphics Interface, pages 133–141, 2016. 8
D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open-
images: A public dataset for large-scale multi-label and [55] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for
multi-class image classification. Dataset available from data: Ground truth from computer games. Lecture Notes in
https://github.com/openimages, 2017. 2 Computer Science, page 102118, 2016. 1, 3
[56] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M.
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
Lopez. The synthia dataset: A large collection of synthetic
classification with deep convolutional neural networks. In
images for semantic segmentation of urban scenes. In Pro-
Advances in neural information processing systems, pages
ceedings of the IEEE conference on computer vision and pat-
1097–1105, 2012. 2
tern recognition, pages 3234–3243, 2016. 3
[41] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J.
[57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
Kress, I. Lopez, and J. V. B. Soares. Leafsnap: A computer
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
vision system for automatic plant species identification. In
et al. Imagenet large scale visual recognition challenge.
The 12th European Conference on Computer Vision (ECCV),
International Journal of Computer Vision, 115(3):211–252,
October 2012. 2
2015. 1
[42] F.-F. Li, R. Fergus, and P. Perona. One-shot learning of ob- [58] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and
ject categories. IEEE transactions on pattern analysis and V. Koltun. MINOS: Multimodal indoor simulator for navi-
machine intelligence, 28(4):594–611, 2006. 2 gation in complex environments. arXiv:1712.03931, 2017.
[43] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal 3
loss for dense object detection. IEEE transactions on pattern [59] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A
analysis and machine intelligence, 2018. 2 unified embedding for face recognition and clustering. In
[44] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- Proceedings of the IEEE conference on computer vision and
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- pattern recognition, pages 815–823, 2015. 8
mon objects in context. In European conference on computer [60] S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-
vision, pages 740–755. Springer, 2014. 2 fidelity visual and physical simulation for autonomous vehi-
[45] F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo cles. In Field and service robotics, pages 621–635. Springer,
style transfer. In Proceedings of the IEEE Conference 2018. 4
on Computer Vision and Pattern Recognition, pages 4990– [61] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang,
4998, 2017. 2 and R. Webb. Learning from simulated and unsupervised
[46] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. images through adversarial training. In The IEEE Conference
Journal of machine learning research, 9(Nov):2579–2605, on Computer Vision and Pattern Recognition (CVPR), July
2008. 7 2017. 1, 3, 8

872
[62] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and
T. Funkhouser. Semantic scene completion from a single
depth image. IEEE Conference on Computer Vision and Pat-
tern Recognition, 2017. 3
[63] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual
connections on learning. In Thirty-First AAAI Conference
on Artificial Intelligence, 2017. 5
[64] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2818–2826, 2016. 2
[65] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Di-
vide the gradient by a running average of its recent magni-
tude. COURSERA: Neural networks for machine learning,
4(2):26–31, 2012. 5
[66] T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid. A
bayesian data augmentation approach for learning deep mod-
els. In Advances in Neural Information Processing Systems,
pages 2797–2806, 2017. 2
[67] G. van Horn, J. Barry, S. Belongie, and P. Per-
ona. The Merlin Bird ID smartphone app
(http://merlin.allaboutbirds.org/download/).
1
[68] G. Van Horn, O. Mac Aodha, Y. Song, A. Shepard, H. Adam,
P. Perona, and S. Belongie. The inaturalist challenge 2017
dataset. arXiv preprint arXiv:1707.06642, 2017. 1, 2, 3, 6
[69] G. Van Horn and P. Perona. The devil is in the tails:
Fine-grained classification in the wild. arXiv preprint
arXiv:1709.01450, 2017. 1
[70] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black,
I. Laptev, and C. Schmid. Learning from synthetic humans.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017. 1, 3, 5
[71] Y.-X. Wang and M. Hebert. Learning to learn: Model regres-
sion networks for easy small sample learning. In European
Conference on Computer Vision, pages 616–634. Springer,
2016. 2
[72] Y. Z. S. Q. Z. X. T. S. K. Y. W. A. Y. Weichao Qiu, Fang-
wei Zhong. Unrealcv: Virtual worlds for computer vision.
ACM Multimedia Open Source Software Competition, 2017.
2
[73] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian. Building general-
izable agents with a realistic and rich 3d environment. arXiv
preprint arXiv:1801.02209, 2018. 3
[74] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
works. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2223–2232, 2017. 2
[75] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsu-
pervised domain adaptation for semantic segmentation via
class-balanced self-training. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 289–305,
2018. 8

873

You might also like