0% found this document useful (0 votes)
43 views

Chapter 1 CYTED Book

This paper proposes a deep learning method to detect artistic graffiti in urban images. Artistic graffiti presents challenges for detection due to variations in style, color, size and partial occlusion. The paper describes using YOLO models to perform object detection on graffiti images and analyzes the results.

Uploaded by

Juan Pampa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Chapter 1 CYTED Book

This paper proposes a deep learning method to detect artistic graffiti in urban images. Artistic graffiti presents challenges for detection due to variations in style, color, size and partial occlusion. The paper describes using YOLO models to perform object detection on graffiti images and analyzes the results.

Uploaded by

Juan Pampa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Art Graffiti Detection in Urban Images using

Deep Learning

Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

Abstract Art graffiti can be considered as a type of urban street art which is actually
present in most cities worldwide. Many artists who began as street artists have
successfully moved to mainstream art, including art galleries. In consequence, the
artistic graffiti produced by these authors became valuable works which are part of the
cultural heritage in cities. When understanding the economic value of these public art
initiatives within the smart cities context the preservation of artistic graffiti (mainly,
against vandalism) becomes essential. This fact will make it possible for municipal
governments and urban planners to implement such graffiti maintenance initiatives
in the future. In this context, this paper describes a deep learning-based methodology
to accurately detect urban graffiti in complex images. The different graffiti varieties
(i.e., 3D, stencil or wildstyle, among others) and the multiple variabilities present in
these artistic elements on street scenes (such as partial occlusions or their reduced
size) make this object detection problem very challenging. Our experimental results
using different datasets endorse the effectiveness of this proposal.

1 Introduction

The term graffiti refers to some types of writings or drawings made on a wall or
other surface, usually without permission and within public view [5]. Graffiti ranges
from simple written words to elaborate wall paintings. Nowadays, countless acts of
vandalism graffiti are committed daily against public and private properties around
the world. The costs caused by such damage are huge, and correspond to direct costs

Tacio Souza Bomfim


UNIFACS, 41720-200 Salvador (Bahia), Brazil, e-mail: [email protected]
Éldman de Oliveira Nunes
UNIFACS, 41720-200 Salvador (Bahia), Brazil, e-mail: [email protected]
Ángel Sánchez
ETSII - URJC, 28933 Móstoles (Madrid), Spain, e-mail: [email protected]

1
2 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

(a) (b)

Fig. 1: Differences between graffiti types: (a) vandalism and (b) artistic.

of cleaning surfaces (buildings, public transport vehicles, among others) and loss
of value of properties repeatedly damaged by graffiti. For example, annual costs of
graffiti removal in USA were estimated to be more than USD 12 billion in 2015.
In contrast to those graffiti that is considered as vandalism, other type of graffiti is
becoming more widely recognized as a type of artwork [3]. Art graffiti is a form of
visual communication created in public places that is legally produced and usually
involves making usage of public spaces. Nowadays, graffiti artists such as Banksy
have exhibited their graffiti-style paintings commercially in gallery and museum
spaces. Figure 1 shows respective sample images of vandalism and art graffiti.
Since art graffiti is considered a form of outdoor cultural heritage to be pre-
served, municipal investments need to be dedicated to its conservation since this
kind of art manifestation also produces an economic return [4]. The protection and
dissemination of this kind of heritage in cities through the use of Information and
Communication Technologies (ICTs) is currently an important aspect in the develop-
ment of Smart Cities [9]. On the one hand, cultural tourism demands the development
of applications that allow people with mobile phones to receive real-time information
by taking a photograph. On the other hand, it is necessary to detect and guarantee
the degree of conservation of this type of graffiti from captured images in order to
be able to carry out actions for the recovery of these urban elements that could have
suffered some kind of vandalism or deterioration [3][10].
Art graffiti usually features color and technique variations and is aesthetically
elegant, following an idea of the fine arts, different from vandalism graffiti. Artistic
graffiti has a wide variety of categories [4], which use different techniques such
as 3D, corresponding to those drawings made with a perspective and providing an
effect of depth. They present an illusion very close to reality, making this technique
much appreciated; the stencil that has become very popular, mainly by the artist
“Bansky”. This art mode uses shapes made of cardboard, paper, metal, plastic, or
other materials and sprays to create the design or text; the Piece, which is made by
hand and contains at least three color and wildstyle that is based on letters, but in
Art Graffiti Detection in Urban Images using Deep Learning 3

a distorted and interconnected way, which makes it very difficult to understand this
type of graffiti. Usually this technique also presents points, arrows and other types
of elements. Wildstyle also has a strong connection with hip hop.
This paper describes a deep learning-based methodology to accurately detect
urban art graffiti in complex images. As far as we know, the problem of artistic
graffiti detection in images has not been researched previously using deep networks.
The works found in the literature investigate only on vandalism graffiti problems, such
as detection [16], image retrieval [17], quantification [14], and graffiti classification
[11]. Most recent work on vandalism graffiti processing have applied some types of
convolutional neural networks (CNN), such as VGG-16 or ResNet (see, for example,
references [11] or [6]).
Our paper is organized as follows. Section 2 describes the object detection problem
in general, and art graffiti detection as a particular case. Section 3 summarizes the
different YOLO detection models considered in this study. Our dataset of images
for the experiments is explained in Section 4. In Section 5, we describe the metrics,
the experiments performed and analyze the corresponding results. Finally, Section 6
concludes this study.

2 Detection of art graffiti

Object detection is a challenging task in Computer Vision that has received large
attention in last years, especially with the development of Deep Learning [18] [15]. It
presents many applications related with video surveillance, automated vehicle system
robot vision or machine inspection, among many others. The problem consists in
recognizing and localizing some classes of objects present in a static image or in
a video. Recognizing (or classifying) means determining the categories (from a
given set of classes) of all object instances present in the scene together with their
respective network confidence values on these detections. Localizing consists in
returning the coordinates of each bounding box containing any considered object
instance in the scene. The detection problem is different from (semantic) instance
segmentation where the goal is identifying for each pixel of the image the object
instance (for every considered type of object) to which the pixel belongs. Some
difficulties in the object detection problem [18] include aspects such as geometrical
variations like scale changes (e.g., small size ratio between the object and the image
containing it) and rotations of the objects (e.g., due to scene perspective the objects
may not appear as frontal); partial occlusion of objects by other elements in the scene;
illumination conditions (i.e., changes due to weather conditions, natural or artificial
light); among others but not limited to these ones. Note that some images may contain
several combined variabilities (e.g., small, rotated and partially occluded objects). In
addition to detection accuracy, another important aspect to consider is how to speed
up the detection task.
Detecting art graffiti in images can be considered as an object detection problem
with the multiple variabilities above indicated. Art graffiti images are, in general,
4 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

much more diverse and elaborated as those including vandalism graffiti (as shown
by Figure 1), and also contain much more colors and textures. Another interesting
aspect that can also difficult the detection of graffiti (both vandalism and art ones) in
images is the varying nature of covered materials, i.e., the surfaces where graffiti are
painted [1]. For example, stone walls, wooden fences, metal boxes or glass windows,
among others.

3 Object detection deep architectures

This section outlines object detection deep architectures, in special the YOLO model
and its variants that were applied in this work.

3.1 YOLO models

Redmon and collaborators proposed in 2015 [13] the new object detector model
called YOLO (acronym of "You Only Look Once"), which handles the object de-
tection as a one-stage regression problem by taking an input image and learning
simultaneously the class probabilities and the bounding box object coordinates. This
first version of YOLO was also called YOLOv1, and since them the successive im-
proved versions of this architecture (YOLOv2, YOLOv3, YOLOv4, and YOLOv5,
respectively) have gained much popularity within the Computer Vision community.
Different from previous two-stage detection networks, like R-CNN and faster
R-CNN, the YOLO model used only one-stage detection. That is, it can make predic-
tions with only one "pass" in the network. This feature made the YOLO architecture
extremely fast, at least 1000 times faster than R-CNN and 100 times faster than Fast
R-CNN.
The architecture of all YOLO models have some similar components which are
summarized next:
• Backbone: A convolutional neural network that accumulates and produces visual
features with different shapes and sizes. Classification models like ResNet, VGG,
and EfficientNet are used as feature extractors.
• Neck: This component consists in a set of layers that receive the output features
extracted by the Backbone (at different resolutions), and integrate and blend these
characteristics before passing them on to the prediction layer. For example, models
like Feature Pyramid Networks(FPN) or Path Aggregation networks(PAN) have
been used for such purpose.
• Head: This component takes in features from the Neck along with the bound-
ing box predictions. It performs the classification along with regression on the
features and produces the bounding box coordinates to complete the detection
process. Generally, it produces four output values per detection: the 𝑥 and 𝑦 center
coordinates, and width and height of detected object, respectively.
Art Graffiti Detection in Urban Images using Deep Learning 5

In the next subsections, we summarize the main specific features of the three
YOLO architectures used in our experiments: YOLOv4, YOLOv5 and YOLOv4-
tiny, respectively.

3.2 YOLOv4

YOLOV4 was released by Alexey Bochkovskiy et al. in their 2020 paper “YOLOv4:
Optimal Speed and Accuracy of Object Detection” [2]. This model is ahead in per-
formance on other convolutional detection models like EfficientNet and ResNext50.
Like YOLOv3, it has the Darknet53 model as Backbone component. It has a speed
of 62 frames per second with an mAP of 43.5 percent on the COCO dataset.
As technical improvements with respect to YOLOv3, YOLOv4 introduces as new
elements the bag of freebies and bag of specials.
Bag of freebies (BOF) are a set of techniques enabling an improvement of the
model in performance without increasing the inference cost. In particular:
• Data augmentation techniques: CutMix, MixUp, CutOut, ...
• Bounding box regression loss types: MSE, IoU, CIoU, DIoU, ...
• Regularization techniques: Dropout, DropPath, DropBlock, ...
• Normalization techniques: Mini-batch, Iteration-batch, GPU normalization, ...
Bag of specials (BOS) consist in techniques that increase accuracy while increas-
ing the computation cost. In particular:
• Spatial attention modules (SAM): Spatial Attention (SA), Channel-wise Attention
(CA), ...
• Non-max suppression modules(NMS)
• Non-linear activation functions: ReLU, SELU, Leaky, Mish, ...
• Skip-Connections: Weighted Residual Connections(WRC), Cross-Stage Partial
connections (CSP), ...
Figure 2 illustrates the layer structure of YOLOv4 network used in our experi-
ments.

3.3 YOLOv5

One month after the release of YOLOv4, the version 5 of this model, created by
Glenn Jocher, was published. This novelty caused a series of discussions in the
scientific community, first for not having been developed by the original author of
the YOLO network, then for not having published a release paper [8].
YOLOv5 uses the PyTorch framework. This version now uses CSPDarknet53 as
the Backbone, only the PANet as the Neck and the YOLO layer as the Head, the same
as previous versions. An innovation of the YOLOv5 network is the self-learning of
6 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

Fig. 2: Schematic representation of YOLOv4 architecture.

bounding box anchors. YOLOv5 achieves the same if not better accuracy (mAP of
55.6) than YOLOv4 model while taking less computation power.
Some of the technical improvements of YOLOv5 over the previous version of this
architecture are the following ones: an easier framework to train and test (PyTorch);
a better data augmentation and loss calculations (using PyTorch framework); auto-
learning of anchor boxes (do not need to be added manually now); use of cross-stage
partial connections(CSP) in the backbone; use of path aggregation network (PAN)
in the neck of the model; support of YAML files, which greatly enhances the layout
and readability of model configuration files; among other advantages.

3.4 YOLOv4-tiny

YOLOv4-tiny is the compressed version of YOLOv4 designed to train on machines


that have less computing power [7]. Its model weights are around 16 megabytes
large, allowing it to train on 350 images in 1 hour when using a Tesla P100 GPU.
YOLOv4-tiny has an inference speed of 3 ms on the Tesla P100, making it one of
the fastest object detection models to exist.
YOLOv4-Tiny utilizes a couple of different changes from the original YOLOv4
network to help it achieve these fast speeds. First and foremost, the number of
convolutional layers in the CSP backbone are compressed with a total of 29 pretrained
convolutional layers. Additionally, the number of YOLO layers has been reduced to
two instead of three and there are fewer anchor boxes for prediction.
YOLOv4-Tiny has comparatively competitive results with YOLOv4 given the
size reduction. It achieves 40 mAP @.5 on the MS COCO dataset.
Art Graffiti Detection in Urban Images using Deep Learning 7

Table 1: Distribution of images in our dataset

Source No. Images No. Annotations


Own authorship 127 164
Other sources 395 439

Table 2: Distribution of graffiti sizes with respect to image sizes

Ratio size Quantity

Smaller than 10% 177


Between 10% and 20% 127
Between 20% and 30% 91
Between 30% and 40% 68
Between 40% and 50% 48
Larger than 50% 92

4 Dataset

There is a lack of available public image datasets on art graffiti images. Therefore,
to perform the experiments of this study, we created our own dataset of annotated
graffiti images.
The database used in this project was created with images of own authorship and
also from sites with free copyright, such as unsplash (https://unsplash.com/), pixebay
(https://pixabay.com/pt/), pexels (https://www.pexels.com/pt-br/) and Google maps.
Table 1 shows the distribution of the set of images in terms of their source. A total
of 522 images were collected, containing 603 annotations of artistic graffiti.
Another analysis that was carried out in the image dataset created, was in relation
to the size ratios of the annotated graffiti in relation to the corresponding image
sizes. This information is relevant to be considered in the separation of training,
validation and testing bases in a uniform way. In addition, graffiti with small sizes
are, in most cases, more difficult to detect and, therefore, when evaluating the results,
this information is quite useful. Table 2 describes the amount of graffiti according to
their size ratios in relation to the image sizes.
The number of art graffiti with a proportion between %0 and 20%, is approxi-
mately 50% of the total. These types of detection tend to be more challenging for
neural networks, because the localization of small images needs a much higher pre-
cision [15]. Figure 3 shows two images with respective graffiti sizes less than 10%
and more than 80% of the image size.
Additionally, our graffiti dataset present other variabilities as shown in Figure 4.
These include large brightness differences (Figure 4.a), since in a real situation, the
system must be able to detect graffiti at night or in low light, cloudy weather, rain,
shadow, among other climatic variations; partial occlusions (Figure 4.b); and image
perspectives or inclinations (Figure 4.c).
8 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

(a) (b)

Fig. 3: Two examples of graffiti sizes: (a) less than 10% and (b) more than 80% of
the image size.

(a) (b) (c)

Fig. 4: Examples of variabilities present in the graffiti dataset: (a) low-contrast, (b)
partial occlusion, and (c) perspective.

Table 3: Distribution of the image dataset in training, validation and test sets.

Dataset Images Skewed Occluded Highly-


contrasted
Train 395 88 66 43
Validation 42 9 8 2
Test 85 25 29 6

The image dataset was distributed into three groups: training (with 74% of im-
ages), testing (16%) and validation (10%), respectively. Table 3 shows the distribution
of images, according to considered variabilities, in these groups.
Art Graffiti Detection in Urban Images using Deep Learning 9

5 Experiments

This section describes the evaluation metrics employed, presents the different tests
performed using different YOLO architectures, and shows the respective quantitative
and qualitative results.

5.1 Evaluation metrics and computing resources

To measure and compare the results of each experiment, the following evaluation
metrics were calculated: Precision (Prec), Recall (Rec) and F1 score (F1), as follows:
𝑇𝑃 𝑇𝑃 𝑃𝑟𝑒𝑐 ∗ 𝑅𝑒𝑐
𝑃𝑟𝑒𝑐 = 𝑅𝑒𝑐 = 𝐹1 = 2 (1)
𝑇 𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐 + 𝑅𝑒𝑐
For all experiments performed, we also computed the Average Precision for the
detected class of "Artistic Graffiti" objects, using 11-point interpolation [12], and
represented it as Precision vs Recall curves. In each of the tests, the performances
were analyzed under different aspects, such as, for example, the size of the graffiti in
relation with the image size, its inclination, variation of luminosity and occlusion,
respectively.
It is important to point out that all the training was carried out using Google
Colab virtual environment, which provides for free a cloud GPU processing service.
All detections in images were performed on a specific GPU, with lower capacity,
to standardize detection times. The GPU to perform the detection has the following
configuration:
• Manufacturer: Nvidia
• Model: GeForce MX110
• Capacity: 2 GB
• Processor: Intel(R) Core(TM) i5-10210U

5.2 Experiments

Network training was carried out using three different input sizes (i.e. image res-
olutions): 416 × 416 (which is the standard one for this network), 512 × 512 and
608 × 608, respectively. Additional tests were also executed by first increasing the
sharpness (i.e., the contrast) of images a 50% previous to training the YOLOv4 de-
tector. We also have performed some tests with YOLOv5 and with the YOLOv4-tiny
network (suitable for limited computing resources), both with the smaller 416 × 416
image resolution. The number of test graffiti objects in images was 96 for all the
experiments.
10 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

Table 4: Confusion matrix corresponding to Test 1.

Predicted
Graffiti No Graffiti

Actual
Graffiti 79 17
No Graffiti 4 4

For the four first experiments (i.e., using the YOLOv4 network), the following
network hyperparameters values were used:
• Batch size: 64
• Subdivision: 64
• Momentum: 0.9
• Decay: 0.0005
• Learning rate: 0.001
• Max Batches: 3000

5.2.1 Test 1: Image resolution 416 × 416 (YOLOv4)

This test was performed on the basis of the test images being resized to a spatial
resolution of 416 × 416 pixels and keeping their aspect ratios. The main goal of this
test was to train the network with the default settings in order to verify the detection
accuracy for test images and compare the results with those from the remaining ex-
periments. Table 4 shows the confusion matrix of the results. The average detection
time per test image was 527 ms.

Respective values of Precision, Recall and F1-Score for this experiment were:
95.18%, 82.29% and 88.27%. Finally, the Precision-Recall curve corresponding to
this experiment is shown by Figure 5 where the mAP (mean Average Precision) is
calculated in all experiments using the 11-point interpolation technique.

5.2.2 Test 2: Image resolution 512 × 512 (YOLOv4)

This test was performed on the basis of the test images being resized to a spatial
resolution of 512 × 512 pixels and keeping their aspect ratios. The main goal of this
test and Test3 3 was to train the network with images of higher spatial resolutions
than in the initial experiment, and to perform comparisons using the considered
detection metrics. Table 5 shows the confusion matrix with the results of this test.
The average detection time per test image was 772 ms.

Respective values of Precision, Recall and F1-Score for this experiment were:
89.18%, 68.75% and 77.64%. Finally, the Precision-Recall curve corresponding to
this experiment is shown by Figure 6.
Art Graffiti Detection in Urban Images using Deep Learning 11

Fig. 5: Precision-Recall curve of Test1 1.

Table 5: Confusion matrix corresponding to Test 2.

Predicted
Graffiti No Graffiti
Actual

Graffiti 66 30
No Graffiti 8 7

Fig. 6: Precision-Recall curve of Test 2.

5.2.3 Test 3: Image resolution 608 × 608 (YOLOv4)

This test was performed on the basis of the test images being resized to a higher
spatial resolution than in previous test (608 × 608 pixels) and keeping their aspect
12 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

Table 6: Confusion matrix corresponding to Test 3.

Predicted
Graffiti No Graffiti

Actual
Graffiti 69 27
No Graffiti 12 8

Fig. 7: Precision-Recall curve of Test 3.

ratios. Table 6 shows the confusion matrix with the results of this test. The average
detection time per test image was 1,553 ms.

Respective values of Precision, Recall and F1-Score for this experiment were:
85.18%, 71.88% and 77.97%. Finally, the Precision-Recall curve corresponding to
this experiment is shown by Figure 7.

5.2.4 Test 4: Image resolution 416 × 416 with sharpening (YOLOv4)

In this experiment, an enhancement pre-processing was performed to images in order


to increase their image contrast by a factor of 1.5 (i.e, increasing of a 50%). This
sharpening transformation was applied to the training and validation images after
these were resized to 416×416 by keeping their aspect ratio. The goal of this test is to
verify whether or not increasing image sharpness produces more accurate detection
results, since in some images, the illumination conditions or the graffiti drawings
themselves can worsen the results.
Table 7 shows the confusion matrix with the results of this test. The average
detection time per test image was 386 ms.
Art Graffiti Detection in Urban Images using Deep Learning 13

Table 7: Confusion matrix corresponding to Test 4.

Predicted
Graffiti No Graffiti

Actual
Graffiti 83 13
No Graffiti 5 2

Fig. 8: Precision-Recall curve of Test 4.

Respective values of Precision, Recall and F1-Score for this experiment were:
94.32%, 86.46% and 90.22%. Finally, the Precision-Recall curve corresponding to
this experiment is shown by Figure 8.

5.2.5 Test 5: Image resolution 416 × 416 with sharpening (YOLOv5)

This test was performed using the images properly resized to the original resolution
of 416 × 416 pixels but using the YOLOv5 network with the following configuration
hyperparameters:
• Batch size: 64
• Momentum: 0.937
• Decay: 0.0005
• Learning rate: 0.001
• Max Batches: 3000
The purpose of this experiment is compare the results produced by YOLOv4 and
YOLOv5 models on the same test images.
Table 8 shows the confusion matrix with the results of this test. The average
detection time per test image was 386 ms.
14 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

Table 8: Confusion matrix corresponding to Test 5

Predicted
Graffiti No Graffiti

Actual
Graffiti 78 18
No Graffiti 15 39

Fig. 9: Precision-Recall curve of Test 5.

Respective values of Precision, Recall and F1-Score for this experiment were:
83.87%, 81.25% and 82.54%. Finally, the Precision-Recall curve corresponding to
this experiment is shown by Figure 9.

5.2.6 Test 6: Image resolution 416 × 416 (YOLOv4-tiny)

The purpose of this is to analyze the performance of this model with more limited
resources in comparison with the other considered architectures. The test was per-
formed on the basis of the test images being resized to a spatial resolution of 416×416
pixels and keeping their aspect ratios. The values of training hyperparameters for
this model were the following ones:

• Batch size: 64
• Subdivision: 64
• Momentum: 0.9
• Decay: 0.0005
• Learning rate: 0.00261
• Max Batches: 3000
Art Graffiti Detection in Urban Images using Deep Learning 15

Table 9: Confusion matrix corresponding to Test 6.

Predicted
Graffiti No Graffiti

Actual
Graffiti 50 46
No Graffiti 10 3

Fig. 10: Precision-Recall curve of Test 6.

Table 9 shows the confusion matrix of the results. The average detection time per
test image was 51 ms.

Respective values of Precision, Recall and F1-Score for this experiment were:
83.33%, 52.08% and 64.10%. Finally, the Precision-Recall curve corresponding to
this experiment is shown by Figure 10 where the mAP (mean Average Precision) is
calculated in all experiments using the 11-point interpolation technique.

5.2.7 Analysis of results

Among all the tests performed, the one that obtained the highest global score (i.e., The
F1-Score) was Test 4, that is, input size 416x416 in YOLOv4, applying sharpening
to the training and validation images, according Figure 11. If the evolution measures
are evaluated separately, Test 1 (input dimension 416) obtained the highest Precision,
while Test 4 (sharpening application) obtained the highest Recall. It was possible to
verify the application of sharpness improved the detection mainly of cases in which
the artistic graffiti was more faded or with few colors.
Regarding the test using YOLOv5, it is possible to verify that it obtained a worse
result than YOLOv4, when compared to Test 1, which had the same network input
pattern (416 × 416). It is important to point out that to have a conclusion about which
16 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

Fig. 11: Comparison of metrics values between tests.

network would have the best performance, it will be necessary to carry out more
tests.
Test 6, with YOLOv4-tiny for mobile applications, had a low recall when com-
pared to the other tests, however it obtained an accuracy very close to all tests. As
previously mentioned, this application is an option for real-time object detection in
videos on mobile and embedded devices.
Finally, we also present some qualitative results corresponding to a randomly
chosen test image. Figure 12 shows for this same image the visualisation results of
Intersection over Union (IoU) corresponding to the respective six experiments carried
out. In order to clarify these visual results, we also show in Table 10 the respective
confidence values returned by the network and also the corresponding IoU results.
It can be noticed that best confidence value corresponds to Test 1 working with a
lower image resolution (416 × 416), and best IoU result corresponds to Test 5 using
YOLOv5 also with the lower resolution.
The reported poor results for the YOLOv4-tiny model (Test 6) - in Table 10
correspond to its execution with the same computer resources as previous tests.
Note that YOLOv4-tiny can also be executed in a smartphone and also work for
videos at real-time. We achieved much better results for the Confidence and IoU
metrics (respectively, 0.93 and 0.84) using a smartphone Xiaomi 9T model (with
camera 48 Mpx, video 4K, 8 core and display 6.39). For a better interpretation of
the previous comparative results between experiments it is important to remark that
YOLOv4, YOLOv5 and YOLOv4-tiny architectures have respectively the following
approximate number of training parameters: 6 × 107 , 4.6 × 107 and 6 × 106 .
Art Graffiti Detection in Urban Images using Deep Learning 17

Table 10: Respective network confidence and IoU values for the six experiments
computed on the same test image shown in Figure 11.

Test 1 Test 2 Test 3 Test 4 Test 5 Test 6

Confidence 0.88 0.64 0.49 0.87 0.77 0.70


IoU 0.76 0.85 0.84 0.75 0.92 0.45

Fig. 12: Qualitative detections for each of the six experiments on the same test image.

6 Conclusion

Object detection is a task that has received a lot of attention in recent years, especially
with the development of extremely fast networks, arising from deep learning and
traditional convolutional neural networks. This task consists of locating and classi-
fying objects in static or dynamic images (videos). The location is done through the
coordinates of a bounding box that contains at least one instance of the object and
the classification consists of informing a confidence value for a certain category of
object.
This work deals with the detection in static images of art graffiti objects, which
are artistic manifestations and considered popular art. These paintings have different
techniques and styles, having a great potential to beautify cities, in addition to
becoming local tourist spots. Another interesting point is that this form of expression
18 Tacio Souza Bomfim, Éldman de Oliveira Nunes and Ángel Sánchez

is legal, unlike graffiti. To carry out the detections of artistic graffiti, three YOLO
network models were used (YOLOv4, YOLOv5 and YOLOv4-tiny, respectively).
Six tests were performed, the first three using YOLOv4 at image dimensions
of 416 × 416, 512 × 512 and 608 × 608, respectively. Test 4 also used YOLOv4
at resolution 416 × 416 but applying a sharpening preprocessing to the training
and validation images. Tests 5 and 6 used YOLOv5 and YOLOv4-tiny, respec-
tively. To carry out the training, validation and testing, we used a database created
with images of our own authorship and also from sites with free copyright, such
as unsplash (https://unsplash.com/),pixebay (https://pixabay.com /pt/) and pexels
(https://www.pexels.com/pt-br/) and Google maps, with a total of 522 images, con-
taining 603 annotations of artistic graffiti.
Among all the tests performed, the one with the highest global score (F1-score)
was test 4, that is, input size 416 x 416 in YOLOv4, applying sharpness to the training
and validation images. This test has an accuracy of 94.3% and a recall of 86.5%. The
test with YOLOv5 obtained good evaluation metrics, but it was inferior to YOLOv4.
YOLOv4-tiny also achieved a close accuracy to all tests, however it had a low recall.
As future work we consider the application of YOLOv4-tiny to mobile devices,
using also the geolocation of images, in order to generate relevant information (e.g.,
for art graffiti maintenance purposes). For this task, it will be necessary to improve
the accuracy of this reduced detector. Another future work consists in extending
our detection problem considering together artistic, vandalism and other variants of
graffiti images (i.e., multiclass detection problem).

Acknowledgements We acknowledge to the Spanish Ministry of Science and Innovation, under


RETOS Programme, with Grant No.: RTI2018-098019-B-I00; and also to the CYTED Network
"Ibero-American Thematic Network on ICT Applications for Smart Cities”, Grant No.: 518RT0559.

References

1. Abdullah Alfarrarjeh, Dweep Trivedi, Seon Ho Kim, Hyunjun Park, Chao Huang, and Cyrus
Shahabi. Recognizing material of a covered object: A case study with graffiti. In 2019 IEEE
International Conference on Image Processing (ICIP), pages 2491–2495, 2019.
2. Alexey Bochkovskiy, Chien-Yao Wang, and Hong-yuan Liao. Yolov4: Optimal speed and
accuracy of object detection. 04 2020.
3. Anna Collins. Graffiti: Vandalism or Art? Greenhaven Publishing LLC, 2017.
4. Fabiana Forte and Pierfrancesco De Paola. How can street art have economic value? Sustain-
ability, 11(3):580, 2019.
5. Marisa A Gómez. The writing on our walls: Finding solutions through distinguishing graffiti
art from graffiti vandalism. U. Mich. JL Reform, 26:633, 1992.
6. Mehmet Ergün Hatir, Mücahit Barstuğan, and Ismail Ince. Deep learning-based weathering
type recognition in historical stone monuments. Journal of Cultural Heritage, 45:193–203,
2020.
7. Zicong Jiang, Liquan Zhao, Shuaiyang Li, and Yanfei Jia. Real-time object detection method
based on improved yolov4-tiny. 11 2020.
8. M. Karthi, V Muthulakshmi, R Priscilla, P Praveen, and K Vanisri. Evolution of yolo-v5
algorithm for object detection: Automated detection of library books and performace validation
of dataset. pages 1–6, 2021.
Art Graffiti Detection in Urban Images using Deep Learning 19

9. Rida Khatoun and Sherali Zeadally. Smart cities: concepts, architectures, research opportuni-
ties. Communications of the ACM, 59(8):46–57, 2016.
10. Samuel Oliver Crichton Merrill. Graffiti at heritage places: vandalism as cultural significance
or conservation sacrilege? Time and Mind, 4(1):59–75, 2011.
11. Glauco R Munsberg, Pedro Ballester, Marco F Birck, Ulisses B Correa, Virginia O Andersson,
and Ricardo M Araujo. Towards graffiti classification in weakly labeled images using convo-
lutional neural networks. In Latin American Workshop on Computational Neuroscience, pages
39–48. Springer, 2017.
12. Rafael Padilla, Sergio L. Netto, and Eduardo A. B. da Silva. A survey on performance metrics
for object-detection algorithms. pages 237–242, 2020.
13. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. pages 779–788, 06 2016.
14. Eric K Tokuda, Roberto M Cesar, and Claudio T Silva. Quantifying the presence of graffiti
in urban environments. In 2019 IEEE International Conference on Big Data and Smart
Computing (BigComp), pages 1–4. IEEE, 2019.
15. Kang Tong, Yiquan Wu, and Fei Zhou. Recent advances in small object detection based on
deep learning: A review. Image and Vision Computing, 97:103910, 03 2020.
16. Jing Wang, Zhijie Xu, and Michael O’Grady. Head curve matching and graffiti detection. Int.
Journal of Computer Vision, 14:9–14, 2010.
17. Chunlei Yang, Pak Chung Wong, William Ribarsky, and Jianping Fan. Efficient graffiti image
retrieval. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval,
pages 1–8, 2012.
18. Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A
survey. arXiv preprint arXiv:1905.05055, 2019.

You might also like