0% found this document useful (0 votes)
187 views8 pages

Synthetic Data For Object Classification in Industrial Applications

This document discusses using synthetic data generated from 3D models to train an object classification model for industrial applications. The authors generate synthetic images of industrial devices using a game engine to address the challenge of collecting large amounts of real training data. They train an ResNet50 model on synthetic data alone, then fine-tune it on small amounts of real images. Their results show the model maintains high accuracy when trained this way, requiring as few as 12-24 real images per class. Combining synthetic and real data in this manner can improve model confidence for industrial classification tasks where limited real data is available.

Uploaded by

Luz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views8 pages

Synthetic Data For Object Classification in Industrial Applications

This document discusses using synthetic data generated from 3D models to train an object classification model for industrial applications. The authors generate synthetic images of industrial devices using a game engine to address the challenge of collecting large amounts of real training data. They train an ResNet50 model on synthetic data alone, then fine-tune it on small amounts of real images. Their results show the model maintains high accuracy when trained this way, requiring as few as 12-24 real images per class. Combining synthetic and real data in this manner can improve model confidence for industrial classification tasks where limited real data is available.

Uploaded by

Luz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Synthetic Data for Object Classification in Industrial Applications

August Baaz 1, Yonan Yonan 1, Kevin Hernandez-Diaz 1Ga


Fernando Alonso-Fernandez 1G b, Felix Nilsson2
1School
of Infonnation Technology (!TE), Halmstad University, Sweden
2 HMS Industrial Networks AB, Halmstad, Sweden
[email protected], [email protected], {kevher, feralo }@hh.se, [email protected]

Keywords: Synthetic Data, Object Classification, Machine Leaming, Computer Vision, ResNet50

Abstract: One of the biggest challenges in machine learning is data collection. Training data is an important part since it
determines how the model will behave. In object classification, capturing a large number of images per object
and in different conditions is not always possible and can be very time-consuming and tedious. Accordingly,
this work explores the creation of artificial images using a game engine to cope with limited data in the
training dataset. We combine real and synthetic data to train the object classification engine, a strategy that
has shown to be beneficial to increase confidence in the decisions made by the classifier, which is often critical
in industrial setups. To combine real and synthetic data, we first train the classifier on a massive amount of
synthetic data, and then we fine-tune it on real images. Another important result is that the amount of real
images needed for fine-tuning is not very high, reaching top accuracy with just 12 or 24 images per class. This
substantially reduces the requirements of capturing a great amount of real data.

1 INTRODUCTION of images per class. Our most important contribution


is the use of synthetic data rendered with a game en-
Popularized since 2015, Industry 4.0 (Xu et al., 2021) gine. Synthetic data is then combined with real data,
refers to integrating Computer Vision (CV), Artificial demonstrating by experiments that the classification
Intelligence (Al), Machine Learning (ML), the Inter- network not only keeps a good accuracy but increases
net of Things (IoT), and cloud computing into indus- its confidence in classifying the different objects.
trial processes. Some significant changes of industry
Devicel Device2 Device3 Device4 Devices
4.0 are increased automation, self-optimization, and
predictive maintenance. For example, object detec- CADfile {
available

tion and image classification could significantly ben-


efit industrial scenarios. Models need training data to Device6 Device7 Device8

learn, and the quality and quantity of such data is the


most crucial part to obtain a reliable model. However,
collecting data can be challenging and costly.
This research explores methods to minimize the Figure I: Target objects to be classified. Device 1: Com-
data collection needed to train object recognition and pactCom M40 Module EtherNet/IP IIoT Secure. D2: Wire-
classification. We aim at developing a system to rec- less Bridge II Ethernet. D3: Communicator PROFINET
IO-Device Modbus TCP server. D4: Edge Gateway with
ognize industrial products using a camera. It could
Switch. D5: X-gateway Modbus Plus Slave PROFINET-
monitor production lines and reduce human repetitive IRT Device. D6: Communicator PROFINET-IRT. D7:
workload for tasks such as sorting, inventory keep- Edge Essential Sequence. D8: Anybus PROFINET to .NET
ing, and quality control. We use ResNet50 (He et al., Bridge. All devices can be found at www.anybus.com
2016) Convolutional Neural Network (CNN) as clas-
sification architecture in conjunction with methods to This project is a collaboration of Halmstad Uni-
reduce the amount of data needed, exploring possi- versity with HMS Networks AB in Halmstad. HMS
bilities other than manually collecting a large number makes products that enable industrial equipment to
communicate over various industrial protocols (HMS,
a4S https://orcid.org/0000-0002-9696-7843 2022). They explore emerging technologies, and one
b4S https ://orcid. org/0000-0002-1400-346X crucial technology is Al, where they want to exam-
ine different applications of AI and vision technolo- (Tremblay et al., 2018) applied this, achieving an av-
gies, e.g. (Nilsson et al., 2020), which may be part erage precision of 79%, which is similar to (Jung
of future products. As shown in Figure 1, HMS prod- et al., 2017) with real data. Thus, it is safe to say
ucts have simple shapes, although the system is po- that similar results can be achieved training with syn-
tentially applicable to other products in the industry thetic data, with the advantage that it is far easier to
where sorting and flow control are needed. collect. Also, in (Tremblay et al., 2018), the results of
synthetic data far exceed the results of real data after
fine-tuning with as few as 100 images.
2 RELATED WORKS This work is about sorting industrial products. We
can assume that they have a CAD file used in their
manufacturing process. 3D scanning is also an effec-
2.1 Object Classification tive way. If the object cannot be 3D scanned and does
not have a CAD file, Generative Adversarial Net-
Image classification is a well-known CV field applied works (GANs) can be used. GANs artificially create
to various tasks (Al-Faraj et al., 2021). A CNN-based similar data using a discriminator that checks if the
visual sorting system can be used in an inventory or feature distribution of the generated data looks close
a warehouse where items lack other tokens, such as to the real data. Some notable GANs are StyleGAN
a damaged barcode or unreadable tag (Wang et al., for face generation (Karras et al., 2021) and Cycle-
2020). Tailored to retail, (Femling et al., 2018) iden- GAN (Zhu et al., 2017), which allows translating an
tified fruit and vegetables with a video-camera attach- image from one domain to another (e.g. indoor to out-
able to a scale, which could aid or relieve customers door, summer to winter, etc.)
and cashiers of navigating through a menu.
A visual-based system is also beneficial for qual- TrainSet
ity control in manufacturing. An operator can get
tired after many quality checks and thus misclassify
products. To avoid that, (Hachem et al., 2021) imple- RedSet
mented ResNet50 for automatic quality control.
In recycling, waste has to be sorted to be recycled
properly. This has been studied in (Gyawali et al., CutSet

[I] ■
2020) using CNNs, achieving an accuracy of 87%.
Similarly, (Persson et al., 2021) developed a method
to short plastics from Waste from Electrical and Elec- TestSet
tronic Equipment (WEEE).
Surveillance is another field. (Jung et al.,
2017) detected (using YOLOv4) and classified (using Test5et2
ResNet) various vehicle types, including cars, bicy-
cles, buses and motorcycles. Similarly, (Svanstrom
et al., 2021) developed a drone detector via sensor fu-


sion, being able to distinguish drones from other typ-
ical objects, such as airplanes, helicopters, or birds.
SynthfixedSet

2.2 Synthetic Data


Ship classification from overhead imagery is a largely
rn
Figure 2: Example of images from the different datasets.
unsolved problem in the maritime domain. The main
issue is the lack of ground truth data. (Ward et al.,
2018) addressed this by building a large-scale syn-
thetic dataset using the Unity game engine and 3D 3 METHODOLOGY
models of ships, demonstrating that synthetic data in-
creases performance dramatically while reducing the 3.1 Data Acquisition and Synthesis
amount of real data required to train the models.
For car surveillance, game engines such as Grand An overview of the different datasets created for this
Theft Auto V are an excellent way to generate work is given in Table 1. Several HMS products
real-looking synthetic images (Richter et al., 2016). are chosen as target objects (Figure 1). They are
Table I: Datasets created for this work. The indicated devices are shown in Figure I.
Initial stage of our research
Images/
Name Data Devices Classes Scale Rotation Notes
class
TrainSet Real 345678 6 96 Random Random Light on/off
TestSet Real 345678 6 30 Random Random Cluttered background

Later stage of our research


Images/
Name Data Devices Classes Scale Rotation Notes
class
TrainSet Real I 2345 5 96 Random Random Light on/off
RedSet Real 12345 5 48 Random Random Light on. red background
CutSet Real I 2345 5 48 Fixed Random RedSet with fixed scale+ transparent background
TestSet Real 12345 5 30 Random Random Cluttered background
TestSet2 Real 12345 5 30 Fixed Random TestSet with fixed scale
SynthVarSet Synth. 12345 5 2000 Random Random Tries to recreate TrainSet conditions
SynthFixedSet Synth. 12345 5 2000 Fixed Random Like SynthVarSet but with fixed scale

mostly routers and switches for industrial machines • TestSet2: created by cutting the objects of interest
that HMS sells. Our research was conducted in two of TestSet manually to fill the entire frame.
stages. In the initial one, we started to build a dataset
Using Unity's universal render pipeline (URP),
of training and test data with devices 3 to 8. How-
synthetic data is also generated. It takes 16ms to gen-
ever, we did not have accessibility to CAD files of all
erate one image, or 3750 images/minute, which is a
these products initially used. For this reason, we later
fast and reliable way to generate datasets in minutes.
employed devices 1 to 5, for which CAD files were
A Unity scene has been built with a room that can be
available to enable the possibility of creating 3D syn-
randomized. The camera that renders the scene can be
thetic data. However, a few experiments were con-
programmed to focus anywhere in the room, and its
ducted on the initial dataset before switching to the
distance to the object can be varied. The intensity and
later one, but they were not re-run, so we keep the
rotation of the lighting can also be randomized, and
description of both datasets here. In the experimental
the background can be replaced too. Unity offers a li-
section, we will make clear which one is being used
brary called Perception (Borkman et al., 2021). Using
in each particular experiment.
this package, the images for the synthetic datasets are
Real images from each product were captured us-
artificially generated. The same seed is provided for
ing a smartphone with 4K resolution, creating the fol-
every product to generate the same random scene for
lowing datasets (see Figure 2):
every object, as can be seen in Figure 3.
• TrainSet: using the smartphone on a tripod to A 3D model of the objects has to be obtained for
simulate a stationary camera, with 96 images per Unity to render images. If the products have a CAD
class. Each object is rotated on all sides, with file, they can be converted into 3D models. HMS
lights on/off to vary the illumination in the room. provided us with all the CAD files for the products.
• RedSet: created exactly like TrainSet, except that With this, two synthetic datasets are created, having
it has a red backdrop that can be segmented digi- 2000 images per class (see Figure 2). Every synthetic
tally. This dataset has 48 pictures per class since dataset also uses the same seed to generate the same
it was only captured with lights on. randomness for every object:
• CutSet: created from RedSet by segmenting the • SynthVarSet: created with varying distances be-
background and normalizing the object scale. The tween the object and the camera to simulate dif-
segmentation mask is created by applying thresh- ferent scaling, and with randomized orientation,
olds to the HSV channels. Morphological opening light direction and background. This dataset tries
and closing are also applied to clean artifacts. The to simulate and recreate TrainSet.
background is transparent, allowing addition of • SynthFixedSet: is exactly like SynthVarSet with
different backgrounds as desired to simulate dif- randomized rotations and backgrounds. The dif-
ferent environments. In our experiments, it is re- ference is that the distance between the objects
placed with a random RGB color during runtime. and the camera is fixed, so that every object fills
• TestSet: for evaluation, with 30 images per class. the frame, thus normalizing the scale.
It has cluttered backgrounds, and not stationary
camera, so the target objects may not be the only
objects in the image.
to change this value. The main evaluation metric em-
ployed is Accuracy, given by the fraction between
correctly classified trials and the total amount of tri-
als. For a given object (class) to be classified, we also
Figure 3: Synthetic images of device2 and device4 gener- employ i) Precision (P), the fraction between True
ated with the same seed. Positives (number of correctly detected objects of the
class), and the total amount of trials labeled as be-
3.2 System Overview
longing to the class, and ii) Recall (R), the fraction be-
tween True Positives and the total amount of trials that
This research has developed an AI to classify indus-
belong to the class. Precision measures the proportion
trial products (Figure 1) with a camera. This is about
of trials labeled as a given class that are really objects
finding the type of object (class or category) that is
of that class, whereas Recall measures the proportion
appearing in the image. This significantly differs in
complexity depending on the specific scenario. For of objects of a given class that are correctly associ-
ated with that class. A single measure summarizing P
this work, the following limitations are considered: i)
the camera is stationary, located to one side, and an- and R is the Fl-score, which is their harmonic mean,
computed as F1=2x(PxR)/(P+R). Another way is the
gled towards the table where objects are located; ii)
the camera is in colour, iii) the objects are in focus, iv) confusion matrix, a table that provides the model pre-
dictions (x-axis) against the true prediction of an ob-
the table is well-lit, and the objects are visible, and v)
ject (y-axis).
only one product needs to be identified at a time. The
objective is to identify products reliably (measured by
accuracy on a test set not seen during training) regard-
less of their orientation or scale.
The model architecture is based upon ResNet50
pre-trained on ImageNet as a feature extractor. The
network is connected to a single fully connected layer
with dropout, followed by a five/six neurons layer (the
number of classes in our datasets) with softmax ac- Figure 4: Example of data augmentation methods (top: iso-
tivation. The original ResNet50 has two fully con- lated effect, bottom: combined effect on a single image).
nected layers of 4096 each, which makes up for a
large portion of the weights (Reddy and Juliet, 2019).
Using only a single layer at the end ofResNet50 is en- 4.1 Finding the Best Configuration
tirely arbitrary but common for transfer learning with
ResNet. During training, we will test the optimal size 4.1.1 Data Augmentation
of this fully connected layer, as well as the number of
early layers that are frozen. Generally speaking, early We first test different data augmentation methods
layers find simple patterns that are general for vision (Figure 4), to test if they allow to combat over-fitting
tasks, such as lines or shapes, and they can be kept and help the models to better generalize against light
frozen. On the other hand, later layers find more com- and camera changes. These data augmentation ex-
plex patterns that are specific of each task, (Yasinski periments are the only ones carried out on the ini-
et al., 2015). so it is expected to benefit accuracy by tial dataset with six classes that we gathered (Table 1,
re-training these last layers on the task-specific data. top). In all the remaining sections, the later dataset
with five classes is employed. Because of that, the
test results cannot be compared directly, but we be-
lieve that the conclusions of this sub-section, i.e. us-
4 EXPERIMENTS AND RESULTS ing data augmentation, are still valid.
Experiments of this sub-section are done on real
Many parameters affect the training and performance data, using TrainSet/TestSet as train/test sets, with
of a machine-learning model like ours. A series of 80% of TrainSet used for actual training and 20% for
experiments were done to find the best model param- validation to stop training. Rotation (360°), cropping
eters. The learning rate highly depends on other fac- (0-30% in all sides), brightness change (50-120% ),
tors, so we tune it up and down in all tests to do a and zooming (100-150%) are used. For these exper-
grid search. The results reported on each sub-section iments, the network is left with 32 unfrozen layers
correspond to the learning rate that produces the best at the end and a fully connected layer (before soft-
numbers. The batch size is kept constant at 64, ex- max) of 256 elements. The obtained accuracy with-
cept in Section 4.1.4, where the experiments demand
Table 3: Effect of changing dropout.
out/with data augmentation is 73.9/84.4%. Without
data augmentation, most misclassified images are ob- Dropout Accuracy
jects that appear far away (small scale). This sug- 0% 83,3
gests that zooming may have a significant effect on 20% 60,7
model performance. It is well known that CNNs of- 40% 67,3
ten struggle to identify objects on different scales (Liu 60% 65,3
et al., 2018). However, the model was trained with all 80% 76,0
100% 43,3
data augmentation methods applied together, so the
effect of individual changes was not explored. Given
these results, all subsequent models in the project 4.1.3 Dropout
were trained with data augmentation
Table 2: Left: effect of changing the end layer size (un- Dropout regularisation can help to make models more
frozen layers set to 32). Right: effect of changing the num- general and reduce over-fitting, but too high dropout
ber of unfrozen layers (end layer size set to 128). might make the model converge towards a lower accu-
racy. We start with a dropout rate of 0% applied after
(unfrozen layers=32) (end layer size=128)
the fully connected layer, and then increased in steps
End Unfrozen of 20% (Table 3). Experiments are done on real data,
layer layers Accuracy using TrainSet/TestSet as train/test sets and 80/20%
size Accuracy 32 75,3 for actual training/validation. The best result is ob-
64 63,3 48 62,7 tained when no dropout at all is applied, yielding an
128 75,3 64 72,7 accuracy of 83.3% on TestSet. One possible explana-
256 70,7 96 76,7
tion of this result is that we are using a feature vector
512 61,3 128 53,3
of 128 elements to classify only 5 classes, so dropout
is not providing any tangible benefit. This is quite
small compared to feature vectors of 2048 or 4096
4.1.2 ResNet Model Setup elements, which are common in CNN architectures,
followed by a classification layer of 1000 elements
Here, we test the optimal number of unfrozen lay- (e.g. in ImageNet).
ers left at the beginning of ResNet50, and the size
of the fully connected layer. Experiments are done Table 4: Effect of reducing the size of the training set.
on real data, using TrainSet/TestSet as train/test sets Images Training Batch Accuracy
and 80/20% for actual training/validation. The more per class images per class size %
layers are left unfrozen, the more a network is prone
12 10 8 62,0
to over-fitting if few training data is available, while,
24 20 16 72,7
while too few unfrozen layers may make the model
48 39 32 73,3
converge towards a lower accuracy. The size of the
end layer also has an impact on the network training 96 77 64 72,7
time and accuracy. Starting with 32 unfrozen layers
and an end layer size of 512, we first reduce the size
of the end layer by a factor of 2 (Table 2, left). A too 4.1.4 Reduced Training Data
high-end layer size is seen to negatively affect accu-
racy. The model shows a significant improvement by The TrainSet employed in previous subsections has
decreasing the layer size up to 128 and beyond that, 96 images per class. In an operational industrial sys-
accuracy decreases again. We then set the size of the tem, taking such amount of images per object that
end layer to 128 and increase the number of unfrozen needs to be sorted may be inconvenient. In this sub-
layers from 32 by a factor of 2 (Table 2, right). The section, we will reduce the size of the training set to
best result is obtained with 96 layers unfrozen (55% 48, 24 and 12 images per class to assess the practi-
of the network), and going beyond that hurts accuracy. cality of capturing fewer images vs its the effect on
The difference between 96 and 64 layers unfrozen is accuracy. Accuracy is computed on TestSet. As we
~4%, but the model with 64 unfrozen layers (37% of decrease the amount of images per class, the mini-
the network) has been observed to be more consistent batch size is also reduced, since models tend to learn
and less prone to over-fitting. Thus, the settings of 64 badly and overfit when the mini-batch size is too large
unfrozen layers and an end layer of 128 are identified compared with the dataset size. Results are given in
as the optimal settings of this subsection, which will Table 4. The change in accuracy between 96, 48 and
be used on all subsequent models. 24 is negligible, which is positive for our purposes.
Train:SynthVarSet Train:SynthFixedSet
Table 5: Effect of fine-tuning the model trained with syn-
~:TestSet -~:TestSet
thetic images of SynthFixedSet with real images of Train-
No zoom Zoom out 50% Zoom in 50%

~-'ls.
' ,, Accuracy=55.3% Accuracy=52.7% Accuracy=58.7'¼ Accuracy=76% Set.

i
Device2

Device3

I ~-, '. :. : : "


O..·tiul ~
a

1
3

4
4

3
1

2
4

0
't;J''
O

0
O

2
5

2
°' '\, ~ '> "
6

5
I

3
[s'''
5

0
°'
4

1
'\,
2

0
~
0

I
~ "
0

1
Dataset size
96
48
Accuracy%
79,3
80,7
Improvement
3,3
4,7
P~i''p~j'j'
<f <f <f <f <f
<f}' <f':l <f'f' <f'f' "}' lf i'',/..J' <:J'# <!# 24 80,0 4,0
-Prediction
12 79,3 3,3
Figure 5: Synthetic model trained on SynthVarSet (left) or
SynthFixedSet (right) and tested on TestSet with different
scales. 4.1.6 Real and Synthetic Data Mix
4.1.5 Synthetic Data A model trained on a large synthetic dataset may learn
patterns from the synthetic data that do not apply to
Products that need to be sorted for the manufactur- reality. Fine-tuning the model on a small number of
ing industry likely have CAD models. Synthetic data real images may increase the real-world performance.
generated from such models can be an alternative to Another approach would be to combine synthetic and
taking pictures of each object. A synthetic dataset can real data during a single training round. However, the
be made many times larger, which can help against size difference between our real and synthetic datasets
over-fitting and make the models to generalize better. is large, which could prevent the real data from affect-
We first train the model on SynthVarSet and com- ing the results much if the images are mixed together.
pute accuracy results on TestSet (Figure 5, left part). To test our assumption, the best synthetic model
The first sub-element ('no zoom') shows the results of the previous section (76% accuracy on TestSet) has
with the original Synth VarSet and TestSet datasets. been retrained again on TrainSet with a lower learning
As it can be seen, the model struggles with specific rate. We also carry out the same data reduction of
objects (Device 2 and specially Device 1), which turn Section 4.1.4 and evaluate a size of 96, 48, 24 and
out to be the smallest objects (see Figure 1). Recall 12 images per class. Results are shown in Table 5.
that in these two datasets, the objects appear with vari- As it can be seen, this fine-tuning provides an extra
able scale (Figure 2 and Table 1). To check this scale accuracy improvement, even with a small number of
issue, the models were tested again on TestSet with images. The models with 24 and 12 images do not
images zoomed out and in by 50%. As it can be seen, perform much worse than those with 96 and 48, so
this makes that different devices get better or worse the synthetic model can be noticeably improved with
under zooming out or in. For example, Device 3 gets a small handful of real images.
better when zooming in, and worse with zooming out,
while Device 5 is the opposite. Also, Device 1 (the Table 6: Accuracy of the best three models on different sets.
smallest one) is mis-classified most of the times, no Model TestSet TestSet2 RedSet
matter in which option. The overall accuracy also in- RealModel 83,3% 87,3% 89,6%
creases slightly with zooming in. Since the objects to SynthModel 76,0% 84,7% 82,1%
be detected fill a bigger portion of the image, this may
SynthTuned 80,7% 87,3% 87,1%
produce that the network is able to detect them better.
The model is then trained on SynthFixedSet (Fig-
ure 5, right). SynthFixedSet is generated in a way that
Table 7: Accuracy of the best three models on the Test-
the scale is fixed, with objects filling the entire frame. Set2 dataset when predictions below 70% confidence are
As it can be seen, this training provides the best bal- discarded. 'Confident proportion' is the amount of images
anced accuracy among all objects, and the best overall with at least 70% confidence. 'Accuracy of confident' is the
accuracy. Specially Device 1 (the smallest object) is accuracy after discarding images with less than 70% confi-
brough to a similar accuracy than the other devices, dence.
very likely because now the object occupies a bigger Confident Confident Accuracy
portion of the training images, so the network better images proportion of confident
can 'see it' when it appears smaller on images ofTest- RealModel 89 / 150 59,3% 97,8%
Set. The overall accuracy on TestSet is 76%, with the SynthModel 88 / 150 58,7% 95,5%
model trained on a synthetic dataset with 2000 images SynthTuned 107 / 150 71,3% 96,3%
per object. From the experiments of Section 4.1.4,
this size could likely be smaller without hurting per-
formance, although such option has not been tested.
4.2 Best Models' Analysis detection models. One way to artificially create im-
ages is by a game engine, with many of the most
The best three models from previous sections are famous game engines providing libraries specifically
brought here for further analysis. We name them as: for synthetic data (Borkman et al., 2021; Qiu and
• RealModel: from dropout experiments of Sec- Yuille, 2016). We focus on industrial production set-
tion 4.1.3, trained on real data (TrainSet). tings, where CAD models are often accessible for
manufactured parts, making possible to generate 2D
• SynthModel: from Section 4.1.5, trained on syn- and 3D synthetic images of them. Synthetic images
thetic data (SynthFixedSet). can be rendered very quickly and effortlessly com-
• SynthTuned: from Section 4.1.6, trained on syn- pared to capturing real data, simulating a wide vari-
thetic data (SynthFixedSet) and fine-tuned on real ability of viewpoints, illumination, scale, etc. In addi-
data (TrainSet). tion, the dataset can be auto-labeled, avoiding errors
Their accuracy on several sets is summarized in in manual annotation, and the object's position in the
Table 6. As it can be observed, performance on Test- image is known at pixel precision. It also offers many
Set2 is significantly better than on TestSet, with a no- more features that can be very hard to obtain with real
ticeable improvement with the models that use syn- data, like 3D labeling, segmentation, and human key-
thetic data during training. A better performance on point labels (Borkman et al., 2021).
TestSet2 can be expected since the objects occupy the Here, we train a ResNet50 model pre-trained on
entire image, and the cluttered background is removed ImageNet to classify five diffe rent objects (Figure I).
(see Figure 2). To test which of the two components These are objects commercialized by the collaborat-
(object size of background) are affecting the most, we ing partner of this research , HMS Networks AB in
also report results on RedSet, which has objects with Halmstad. A dataset with images of each object
variable scale but with a uniform red background. The type from different viewpoints has also been acquired,
performance on RedSet appear to be on par with Test- both of real and synthetic images (Figure 2) and with
Set2, or even better with RealModel, suggesting that different scales, illumination and background (Ta-
eliminating a cluttered background has a more signif- ble 1). We have conducted different experiments to
icant impact than normali zing the scale. find the optimal setting of the classifier, including
Comparatively, SynthTuned (trained on massive data augmentation, number of frozen layers of the
synthetic data and fine-tuned on real data) has a per- network, size of the end layer, or dropout. We also
formance on-par with RealModel (trained on just real evaluated the impact ofreduced training data and the
data), so one may question the utility of the employed incorporation of synthetic data in the training set. The
synthetic data augmentation. However, accuracy does latter is done by training the classifier first on a mas-
not tell the full story of how well a model performs. sive amount of synthetic data, and then fine-tuning it
Even if an object is identified correctly, the confidence on real data. Even if the overall accuracy of models
of the classifier in such decision matters. Setting a trained with synthetic+real data is on-par with models
threshold on confidence is likely how an object clas- trained with real data only, it has been observed that
sifier would be used in many practical scenarios. To the addition of synthetic data helps to increase confi-
test the effect of such practice, we set a confidence dence in classification on a significant number of test
threshold of 70%, so decisions below this threshold images. This is an important advantage in industrial
are considered 'unsure' . Disregarding objects below settings, where high confidence in the decision is crit-
this confidence gives the results shown in Table 7. It ical in many situations. Another important contribu-
can be seen that the amount of trials on which the clas- tion is that the amount of real data needed to fine-tune
sifier is confident is substantially higher with Synth- the model is not very high to reach top accuracy Gust
Tuned, revealing an important benefit given by adding 12-24 images per class), greatly alleviating the need
synthetic data to the training set. The overall accuracy to obtain a substantial number of real images.
of the three models is in a similar range (96-98% ), but Scale or cluttered background has been identified
on SynthTuned, such entails a higher number of im- as two relevant issues. When making the objects fill
ages that are actually classified correctly. the entire image frame (thus removing the impact of
the background) or the background is set to constant
on the test data, a performance improvement is ob-
served (Table 6). Training on images where the ob-
5 CONCLUSIONS ject fills the entire frame has also been shown to cope
with smaller objects in the test data that are other-
This paper has studied the utility of adding wise misclassified frequently (Figure 5). This work
synthetically-generated data to the training of object
has considered stationary objects in a relatively sim- and vegetable identification using machine learning
ple and well-lit environment. An obvious improve- for retail applications. In SITJS.
ment likely to appear in industrial settings is to al- Gyawali, D. et al. (2020). Comparative analysis of multiple
low motion between the camera and the objects, e.g. deep CNN models for waste classification. In ICAEIC.
due to conveyor belts. To do so, further research in Hachem, C. et al. (2021). Automation of quality control in
the detection and segmentation of moving objects is automotive with deep learning algorithms. In ICCCR.
necessary before presentation to the classifier. Pos- He, K. et al. (2016). Deep residual learning for image
sible solutions to this, depending on the scene com- recognition. In CVPR.
HMS (2022). https:/lwww.hms-networks.com.
plexity, range from a traditional Mean Frame Sub-
traction (MFS) method to detect moving objects in Jung, H. et al. (2017). Resnet-based vehicle classif and lo-
calization in traffic surveillance systems. In CVPRW.
simple setups where the background remains static
Karras, T. et al. (2021 ). A style-based generator architecture
for a long time (Tamersoy, 2009) to more elabo-
for generative adversarial networks. IEEE TPAMI.
rated trained approaches such as RetinaNet (Lin et al.,
Lin, T. et al. (2020). Focal loss for dense object detection.
2020) or YOLOv4 (Bochkovskiy et al., 2020) object
IEEETPAMI.
detectors. The latter is more tolerant to changes in
Liu, Y. et al. (2018). Scene classification based on multi-
scale, light, multiple objects, and motion, but often scale convolutional neural network. IEEE TPAMI.
they need more training data. This, however, could be Nemati, H. M., Fan, Y., Alonso-Fernandez, F. (2016). Hand
addressed with an approach based on synthetic data detection and gesture recognition using symmetric
like the one followed in this paper. patterns. In AC/IDS.
In a warehouse, new products are coming in all Nilsson, F., Jakobsen, .I., Alonso-Fernandez, F. (2020). De-
the time. In our case, the classifier must be retrained tection and classification of industrial signal lights for
to recognize each new class. Other alternatives for factory floors. In ISCV.
warehouses with many different products would be Persson, A., Dymne, N., Alonso-Fernandez, F. (2021).
expanding a classifier without retraining it (Schulz Classification of ps and abs black plastics for weee
et al., 2020). Using labels attached to products would recycling applications. In JSCMI.
be another approach to identify objects. For example, Qiu, W., Yuille, A. (2016). Unrealcv: Connecting computer
vision to unreal engine. In ECCVW.
(Nemati et al., 2016) employs spiral codes, similar
Reddy, A. S. B., Juliet, D. S. (2019). Transfer learning with
in concept to barcodes, but detectable with any 360-
resnet-50 for malaria cell-image classif. In JCCSP.
degree orientation (in contraposition to barcodes that
Richter, S. R. et al. (2016). Playing for data: Ground truth
need to be properly oriented). However, this would
from computer games. In ECCV.
demand manual attachment of labels to the objects.
Schulz, J. et al. (2020). Extending deep learning to new
classes without retraining. In SPIE DSMEOOT XXV.
Svanstrom, F., Englund, C., Alonso-Fernandez, F. (2021 ).
ACKNOWLEDGEMENTS Real-time drone detection and tracking with visible,
thermal and acoustic sensors. In !CPR.
This work has been carried out by August Baaz and Tamersoy, B. (2009). Background subtraction. The Univer-
sity of Texas at Austin.
Yonan Yonan in the context of their Bachelor Thesis
Tremblay, .I. et al. (2018). Training deep networks with syn-
at Halmstad University (Computer Science and En-
thetic data: Bridging the reality gap by domain ran-
gineering), with the support of HMS Networks AB domization. In CVPRW.
in Halmstad. Authors Hernandez-Diaz and Alonso-
Wang, Y. et al. (2020). A cnn-based visual sorting system
Fernandez thank the Swedish Research Council (VR) with cloud-edge computing for flexible manufacturing
and the Swedish Innovation Agency (VINNOVA) for systems. IEEE TI!.
funding their research. Ward, C. M. et al. (2018). Ship classification from overhead
imagery using synthetic data and domain adaptation.
In IEEE OCEANS.
Xu, X. et al. (2021). Industry 4.0 and 5.0-inception, con-
REFERENCES ception and perception. Journal Manufacturing Sys.
Yasinski, .I. et al. (2015). Understanding neural networks
Al-Faraj, S. et al. (2021). Cnn-based alphabet identification
through deep visualization. In JCMLW.
and sorting robotic arm. In ICCCES.
Zhu, .1.-Y. et al. (2017). Unpaired image-to-image transla-
Bochkovskiy, A. et al. (2020). Yolov4: Optimal speed and
tion using cycle-consistent adversarial net. In !CCV.
accuracy of object detection. CoRR, abs/2004.10934.
Borkman, S. et al. (2021 ). Unity perception: Generate syn-
thetic data for comp vis. CoRR, abs/2107.04259.
Femling, F., Olsson, A., Alonso-Fernandez, F. (2018). Fruit

You might also like