0% found this document useful (0 votes)
79 views

Logodet-3K: A Large-Scale Image Dataset For Logo Detection

Uploaded by

Minh Phương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Logodet-3K: A Large-Scale Image Dataset For Logo Detection

Uploaded by

Minh Phương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO.

XX, MONTH YEAR 1

LogoDet-3K: A Large-Scale Image Dataset for


Logo Detection
Jing Wang, Weiqing Min, Member, IEEE, Sujuan Hou, Member, IEEE, Shengnan Ma,
Yuanjie Zheng, Member, IEEE, Shuqiang Jiang, Senior Member, IEEE

Abstract—Logo detection has been gaining considerable atten-


tion because of its wide range of applications in the multimedia
arXiv:2008.05359v1 [cs.CV] 12 Aug 2020

field, such as copyright infringement detection, brand visibility


monitoring, and product brand management on social media. In
this paper, we introduce LogoDet-3K, the largest logo detection
dataset with full annotation, which has 3,000 logo categories,
about 200,000 manually annotated logo objects and 158,652
images. LogoDet-3K creates a more challenging benchmark for
logo detection, for its higher comprehensive coverage and wider
variety in both logo categories and annotated objects compared
with existing datasets. We describe the collection and annotation
process of our dataset, analyze its scale and diversity in com-
parison to other datasets for logo detection. We further propose
a strong baseline method Logo-Yolo, which incorporates Focal
loss and CIoU loss into the state-of-the-art YOLOv3 framework
for large-scale logo detection. Logo-Yolo can solve the problems
of multi-scale objects, logo sample imbalance and inconsistent
bounding-box regression. It obtains about 4% improvement on
the average performance compared with YOLOv3, and greater
improvements compared with reported several deep detection
models on LogoDet-3K. The evaluations on other three existing Fig. 1: Statistics of LogoDet-3K categories and images. The abscissa
datasets further verify the effectiveness of our method, and represents the number of logo images, the ordinate represents the
demonstrate better generalization ability of LogoDet-3K on logo number of categories.
detection and retrieval tasks. The LogoDet-3K dataset is used to
promote large-scale logo-related research and it can be found at
https://github.com/Wangjing1551/LogoDet-3K-Dataset. detection, such as WebLogo-2M [17], PL2K [18] and Logo-
2K+ [19], these logo datasets are either only labeled on image-
level [17], [19] or not publicly available [18]. As we all
I. I NTRODUCTION known, the emergence of large-scale datasets with a diverse
Logo-related research has always been extensively studied and general set of objects, like ImageNet DET [20] and
in the field of multimedia [1], [2], [3], [4], [5]. As an important COCO [21] has contributed greatly to rapid advances of object
branch of logo research, logo detection [6], [7], [8] plays a detection. As a special case of object detection, compared with
critical role for its various applications and services, such as ImageNet DET [20] and COCO [21], existing logo detection
intelligent transportation [9], brand visibility monitoring [10] benchmarks lack a large number of categories and well-defined
and analysis [11], trademark infringement detection [1] and annotations.
video advertising research [12]. Therefore, we introduce LogoDet-3K, a new large-scale,
Currently, deep-learning approaches have been widely used high-quality logo detection dataset. Compared with existing
in logo detection, like Faster R-CNN [13], SSD [14] and logo datasets, LogoDet-3K has three distinctive character-
YOLOv3 [15]. By supporting the learning process of deep istics: (1) Large-scale. LogoDet-3K consists of 3,000 logo
networks with millions of parameters, large-scale logo datasets categories, 158,652 images and 194,261 bounding boxes. It
are crucial in logo detection. However, most existing logo has larger coverage on logo categories and larger quantity
researches focus on small-scale datasets, such as BelgaLo- on annotated objects compared with existing logo datasets.
gos [16] and FlickrLogos-32 [2]. Recently, although some (2) High-quality. Each image in the construction progress is
large-scale logo datasets are proposed for recognition and strictly conformed to the pipeline which is carefully designed,
including logo image collection, logo image filtering and logo
J. Wang, S. Hou, S. Ma and Y. Zheng are School of Information object annotation. (3) High-challenge. Logo objects typically
Science and Engineering, Shandong Normal University, Shandong, 250358,
China. Email: [email protected], [email protected], consist of mixed text and graphic symbols. Even the same
[email protected], [email protected]. W. Min and S. logo can appear in different scenarios such as various non-
Jiang are with the Key Laboratory of Intelligent Information Processing, rigid, coloring and lighting transformations. For example, a
Institute of Computing Technology, Chinese Academy of Sciences, Beijing,
100190, China, and also with University of Chinese Academy of Sciences, rigid logo object when appearing in a real clothing image
Beijing, 100049, China. Email: [email protected], [email protected]. often becomes non-rigid, making it difficult to be detected. As
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 2

Fig. 2: Image samples from various categories of LogoDet-3K.

shown in Fig. 1, our proposed LogoDet-3K dataset far exceeds several baseline models and our method, and further verify the
the existing logo dataset both in the number of categories and effectiveness of our method and better generalization ability
the number of images. Fig. 2 gives some image samples from of LogoDet-3K on logo detection and retrieval tasks.
various categories of LogoDet-3K. In addition, imbalanced The rest of this paper is organized as follows. Section
samples and very small logo objects make this dataset more II reviews related work. Section III given the process of
challenging. datasets construction and statistics. And Section IV elaborates
We further propose a strong baseline method Logo-Yolo the proposed large-scale logo detection method. Experimental
based on the network architecture YOLOv3 for logo detec- results and analysis are reported in Section V. Finally, we
tion. Logo-Yolo takes characteristics of LogoDet-3K, such conclude the paper and give future work in Section VI.
as various logo object sizes, sample imbalance and different
background scenarios into consideration, and incorporates II. R ELATED W ORK
Focal Loss [22] into the state-of-the-art detection framework Our work is closely related to two research fields: (1) logo
YOLOv3 for logo detection. CIoU loss [23] is further adopted detection datasets and (2) logo detection researches.
to obtain more accurate regression results. Finally, we con-
duct comprehensive experiments on LogoDet-3K using several
state-of-the-art object detection models and our proposed A. Logo Detection Datasets
method, as well as ablation study and qualitative analysis. The large-scale dataset is an important factor for supporting
This paper has three main contributions. (1) We introduce a advanced object detection algorithms, especially in the deep
new large-scale logo dataset LogoDet-3K1 with 3,000 classes, learning era, and it is no exception in logo detection. The first
194,261 objects and 158,652 images, which is the largest benchmark for logo detection is the BelgaLogos dataset [16],
logo classes with full annotation. (2) We propose a strong which contains only 37 logo categories totaling 1,000 images.
baseline method Logo-Yolo, which adopts the YOLOv3 de- Over the years, some larger logo datasets such as FlickrLogos-
tection framework, and combines Focal loss and CIoU loss 32 [2] and Logos in the wild [24] have been proposed.
to achieve better detection performance on LogoDet-3K. (3) However, these datasets lack the diversity and coverage in
We perform extensive experiments on LogoDet-3K by using logo categories and images. For example, FlickrLogos-32
only consists of 32 logo categories with 70 images each
1 We will release the dataset upon publication. category. This is far less than millions of images required
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 3

TABLE I: Comparison between LogoDet-3K and existing logo datasets.


]Datasets ]Logos ]Brands ]Images ]Objects ]Supervision ]Public
BelgaLogos [16] 37 37 10,000 2,695 Object-Level Yes
FlickrLogos-27 [2] 27 27 1,080 4,671 Object-Level Yes
FlickrLogos-32 [2] 32 32 8,240 5,644 Object-Level Yes
FlickrLogos-47 [2] 47 47 8,240 - Object-Level No
Logo-18 [25] 18 10 8,460 16,043 Object-Level No
Logo-160 [25] 160 100 73,414 130,608 Object-Level No
Logos-32plus [26] 32 32 7,830 12,302 Object-Level No
Top-Logo-10 [27] 10 10 700 - Object-Level No
SportsLogo [28] 20 20 2,000 - Object-Level No
CarLogo-51 [29] 51 51 11903 - Image-Level No
WebLogo-2M [17] 194 194 1,867,177 - Image-Level Yes
Logos-in-the-Wild [24] 871 871 11,054 32,850 Object-Level Yes
QMUL-OpenLogo [30] 352 352 27,083 - Object-Level Yes
PL2K [18] 2,000 2,000 295,814 - Object-Level No
Logo-2K+ [19] 2,341 2,341 167,140 - Image-Level Yes
LogoDet-3K 3,000 2864 158,652 194,261 Object-Level Yes

in deep learning. Some researchers constructed some larger on hand-crafted visual features (e.g. SIFT and HOG [25]) and
datasets, such as WebLogo-2M [17], LOGO-Net [25] and conventional classification models (e.g. SVM [3]). Recently,
PL2K [18]. However, WebLogo-2M is collected from online some deep learning techniques have been applied in logo
search engines and just automatically be labeled at image level detection [36], [37], [4], [38]. For example, Oliveira et al. [39]
with much noise, while PL2K and LOGO-Net are not publicly adopted pre-trained CNN models and used them as a part
available. of Fast Region-Based Convolutional Networks recognition
In order to solve the problem, we propose the LogoDet-3K, pipeline. Fehérvári et al. [18] combined metric learning and
which is a large-scale, high-coverage and high-quantity dataset basic object detection networks to achieve few-shot logo de-
with 3,000 logo categories, 158,652 images and 194,261 ob- tection. Compared with existing logo detectors, our proposed
jects. Table I summarizes the statistics of existing logo datasets Logo-Yolo is more effective for large-scale logo category and
and LogoDet-3K. We can see that LogoDet-3K has more logo sample imbalance.
categories and logo objects, which is more helpful to explore
data-driven deep learning techniques for logo detection. III. L OGO D ET-3K
A. Dataset Construction
B. Logo Detection
The construction of LogoDet-3K is comprised of three
In previous years, DPM [31] and HOG [25], are widely steps, namely logo image collection, logo image filtering and
used as traditional object detection methods. Later, with the logo object annotation. Each image is manually examined and
development of convolutional neural networks, more and more reviewed to guarantee the quality of LogoDet-3K after filtering
works start to utilize deep learning techniques, such as Faster and annotation. The dataset building process is detailed in
RCNN [13], YOLO [15] and [32] self-attention for logo the following subsections. Additionally, each logo name is
detection. In general, deep learning based object detector could assigned to one of nine super-classes based on the daily
be divided into two types: two-stage detector and single-stage need of life and the main positioning of common enterprises,
detector. The popular two-stage detectors are the series of R- namely Clothing, Food, Transportation, Electronics, Neces-
CNN like Faster RCNN [13], which introduced the region pro- sities, Leisure, Medicine, Sport and Others. In this paper,
posal network and individual blocks to improve the detection Table II gives the statistics of super classes of LogoDet-3K
performance. In contrast, the paradigm of single-stage detector dataset.
aims to be faster and more efficient solution by classifying Logo Image Collection. A large-scale logo detection
anchors directly and then refining them without proposal dataset should include comprehensive categories. Before
generation network, such as SSD [14], RetinaNet [22] and crawling logo images, we built a comprehensive logo list
YOLO series [15]. Recently, the proposed anchor-free method based on the ‘Forbes Global 2,000’2 and other famous logo
CornerNet [33] is highly acclaimed, while SNIPER [34] and lists. Finally, we collected 3,000 logo names for our logo
Cascade R-CNN [35] are introduced to further improve the vocabulary, which covers nine super-classes.
performance. Subsequently, we used the logo name from the logo vo-
In general, logo detection has little advanced as a kind cabulary as the query to crawl logo images from the Google
of generic object detection. An important reason is that the search engine. Top-500 retrieved results were kept for the logo
development of logo detection technology is limited by the size
of logo dataset. Early logo detection methods are established 2 https://www.forbes.com/global2000/list/tab:overall
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 4

Fig. 3: Multiple logo categories for some brands, where a distinction between these logo categories via adding the suffix ‘-1’, ‘-2’.

Fig. 4: Sorted distribution of images for each logo in LogoDet-3K.

TABLE II: Data statistics on LogoDet-3K.


both automatic processing and manual cleaning. Particularly,
Root-Category Sub-Category Images Objects we removed the following logo images, including: (1) images
Food 932 53,350 64,276 with length or height less than 300 pixels or extreme aspect
Clothes 604 31,266 37,601 ratio, (2) images with extreme aspect ratio, (3) duplicated
Necessities 432 24,822 30,643 images, (4) images without logos and (5) images with logos
Others 371 15,513 20,016 were not included in the logo vocabulary. In addition, a brand
Electronic 224 9,675 12,139 may have different types of logos, such as a symbolic logo
and a textual logo or even more. In this case, different types
Transportation 213 10,445 12,791
of logos should be treated as different logo categories for this
Leisure 111 5,685 6,573
brand similar to [24]. Fig. 3 shows some examples, the suffix
Sports 66 3,953 5,041
‘-1’, ‘-2’ is added to the logo name as the new logo category,
Medical 47 3,945 5,185 such as the ‘Lexus-1’ presents the ‘Lexus’ symbolic logo while
Total 3,000 158,652 194,261 ‘Lexus-2’ presents its textual logo for the brand ‘Lexus’.
Logo Object Annotation. As the most important step in
constructing logo detection datasets, the annotation process
relevance for each query. In order to increase diversity of takes a lot of time. The final annotation results follow some
the dataset, we also crawled logo images from other online criterions. For example, if the logo is occluded, the annotators
search engines including Bing and Baidu. In order to crawl are instructed to draw the box around its visible parts. If an
more relevant images, we changed the search terms by adding image contains multiple logo instances, each logo object needs
‘brand’ or ‘logo’ in search keywords. For example, there to be annotated. In order to ensure the annotation quality of
were so many images of shoes without any logo in the LogoDet-3K, each bounding box was annotated manually as
‘Clarks’ category, which is a famous British shoe company. close as possible to the logo object to avoid extra backgrounds.
We extended the search term such as ‘Clarks brand’ or ‘Clarks After finishing the above works, we inspected and examined
logo’ and obtained more relevant logo images as we expected. all the annotated images labeled by the annotators. If an
annotated image does not meet these requirements, the image
Logo Image Filtering. To guarantee the data quality, we will be rejected and need to be re-annotating.
cleaned the collected images manually before annotating them.
Considering that not all the logo images are acceptable, we B. Dataset Statistics
check each logo category to guarantee that it contained corre- Our resulting LogoDet-3K consists of 3,000 logo classes,
sponding logo images with a suitable size and aspect ratio via 158,652 images and 194,261 logo objects. To delve into the
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 5

Fig. 5: The detailed statistics of LogoDet-3K about Image and object distribution in per category, the number of objects in per image and
object size in per image.

Fig. 6: Distributions of categories, images and objects from LogoDet-3K on super-classes.

details of our dataset, we provide the statistics at the super- In addition, we adopted K-means clustering statistics to re-
class and category level. Fig. 4 shows the distribution of compute the pre-anchors size for LogoDet-3K to select the best
images for each logo in LogoDet-3K. The thicker the columnar anchor size, and introduced recent proposed CIoU loss [23] to
area in histogram, the larger the proportion. From Fig. 4, obtain more accurate regression results.
we can see that imbalanced distribution across different logo Improved Losses for Logo Detection. Fewer logo objects
categories are one characteristic of LogoDet-3K, posing a in the image produce more negative samples, leading to
challenge for effective logo detection with few samples. an imbalance between positive and negative samples. Focal
In addition, Fig. 5 summarizes the distribution of images Loss [22] is proposed to solve the problem of sample im-
and categories in LogoDet-3K. Fig. 5 (A) shows the distribu- balance. Therefore, we incorporates the Focal Loss into the
tion of the number of images for each category. Fig. 5 (B) whole loss of Logo-Yolo, the classification loss is formulated
shows the distribution of the number of objects of each as follows:
class. As we can see, there exists imbalanced distribution  β
−α(1 − y 0 ) log y 0 , y=1
across different logo objects and images for different logo Focal Loss = (1)
−(1 − α)y 0β log(1 − y 0 ) , y = 0
categories. Fig. 5 (C) gives the number of objects in each
image. We can see that most images contain one or two logo where y ∈ {±1} is a ground-truth class and y 0 ∈ [0, 1]
objects. As shown in Fig. 5 (D), LogoDet-3K is composed is the model’s estimated probability by activation function.
of 4.81% small instances (area < 322 ), 29.79% medium Focus loss introduces two factors α and β, where α is used to
instances (322 <= area <= 962 ) and 65.40% large instances balance positive and negative samples, while β focuses more
(area > 962 ). The large percentage of small and medium logo on difficult samples.
objects (∼ 35%) will create another challenge to logo detection In addition, Ln -norm loss is widely adopted for bounding
on this dataset, since small logos are harder to detect. box regression, while it is not tailored to the evaluation
We also provide the statistics of logo categories, images metric (Intersection over Union (IoU)) in existing methods.
and logo objects in 9 different super classes in Fig. 6, which We further incoporate the CIoU loss [23] into the whole loss
can direct to getting the difference on numbers. The Food, of YOLOv3 to solve the problem of inconsistency between
Clothes and Necessities class are larger in objects and images the metric and the border regression on logo detection, and
compared with other classes. the IoU-based loss can be defined as,
LCIoU = 1 − IoU + RCIoU (Bpd , Bgt ) (2)
IV. A PPROACH
where RCIoU is penalty term for predicted box Bpd and target
Taking characteristics of LogoDet-3K into consideration, we box Bgt .
propose a strong baseline Logo-Yolo for logo detection, which CIoU loss considered three geometric factors in the bound-
adopted the state-of-the-art deep detector YOLOv3 as the ing box regression, including overlap area, central point dis-
backbone to cope with small-scale and multi-scale logos. Since tance and aspect ratio to solve the problem of inconsistency
the logo image contains fewer objects, there will be conducted between the metric and the border regression during logo
more negative samples and hard samples, we utilized Focal detection. Therefore, the method to minimize the normalized
Loss [22] to solve the problem of logo sample imbalance. distance between central points of two bounding boxes, and
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 6

TABLE III: Statistics of three benchmarks. TABLE IV: Statistics of three super-classes.
]Datasets ]Classes ]Images ]Objects ]Trainval ]Test ]Datasets ]Classes ]Images ]Objects ]Trainval ]Test
LogoDet-3K-1000 1,000 85,344 101,345 75,785 11,236 Food 932 53,350 64,276 47,321 6,029
LogoDet-3K-2000 2,000 116,393 136,815 103,356 13,037 Clothes 604 31,266 37,601 27,732 3,534
LogoDet-3K 3,000 158,652 194,261 142,142 16,510 Necessities 432 24,822 30,643 22,017 2,805

TABLE V: Comparison of baselines on different benchmarks (%).


the penalty term can be defined as, Benchmarks Methods Backbones mAP
2 gt 2 Faster RCNN [13] ResNet-101 45.16
ϕ (b, bgt ) 4 w w SSD [14] VGGNet-16 43.32
RCIoU = 2
+ α 2 (arc tan gt − arc tan ) (3) RetinaNet [22] ResNet-101 52.10
c π h h
FPN [41] ResNet-101 49.63
where b and bgt denote the central points of Bpd and Bgt , ϕ(·) LogoDet-3K-1000
Cascade R-CNN[35] ResNet-101 48.14
is the Euclidean distance, and c is the diagonal length of the Distance-IoU [23] DarkNet-53 53.06
YOLOv3 [15] DarkNet-53 55.21
smallest enclosing box covering the two boxes. α is a positive Logo-Yolo DarkNet-53 58.86
trade-off parameter. w, h are aspect ratio of the prediected Faster RCNN [13] ResNet-101 41.86
box, respectively. SSD [14] VGGNet-16 38.97
Pre-anchors Design for Logo Detection. Anchor boxes are RetinaNet [22] ResNet-101 49.00
FPN [41] ResNet-101 47.91
a set of initial fixed width-and-height candidate boxes. Those LogoDet-3K-2000
Cascade R-CNN[35] ResNet-101 46.32
defined by the original network are no longer suitable for Distance-IoU [23] DarkNet-53 51.69
LogoDet-3K. Therefore, we use K-means clustering algorithm YOLOv3 [15] DarkNet-53 52.32
Logo-Yolo DarkNet-53 56.42
to perform clustering analysis on the bounding boxes for Faster RCNN [13] ResNet-101 38.30
objects of LogoDet-3K and then select the average overlap SSD [14] VGGNet-16 34.47
degree (Avg IoU) as the metric for clustering result analysis. RetinaNet [22] ResNet-101 44.32
FPN [41] ResNet-101 42.84
We can obtain the number of anchor boxes based on the LogoDet-3K
Cascade R-CNN[35] ResNet-101 41.23
relationship between the number of samples and Avg IoU. Distance-IoU [23] DarkNet-53 46.34
The aggregated Avg IoU objective function f can be ex- YOLOv3 [15] DarkNet-53 48.61
Logo-Yolo DarkNet-53 52.28
pressed as,
Nk
k X
X
IIoU (B, C) and 3,000 categories, respectively. Through those experiments,
i=1 j=1 we verify the robustness of our method as the number of
f = argmax (4)
N categories and images increases. The statistics of three sub-
where B represents the ground-truth sample and C represents datasets are shown in Table III. In addition, we conduct
the center of the cluster. N represents the total number of experiments based on super categories. The categories with
samples, k represents the number of clusters. In general, we the largest number of the three categories are also common
adopt the K-means clustering algorithm to select the number logo categories in real world, including Food, Clothes, and
of candidate anchor boxes and aspect ratio dimensions. Necessities. This experiment is to explore the detection effect
of our method on common categories and the characteristics of
V. E XPERIMENT the three categories of datasets. The statistics of three subsets
from these super categories are shown in Table IV.
A. Experimental Setup
Experiments are performed with state-of-the-art object
For parameter settings, we design pre-anchor boxes for detectors: Faster R-CNN [13], SSD [14], RetinaNet [22],
different object detectors via calculations on LogoDet-3K FPN [41], Cascade R-CNN [35], Distance-IoU [23] and
dataset. In our method, the number of anchor boxes is set YOLOv3 [15]. For their backbones, we adopt the general
as 9, according to the relationship between the number of setting: ResNet101 is selected as the backbone for Faster R-
samples and Avg IoU via K-means clustering. The final results CNN, RetinaNet, FPN and Cascade R-CNN. Darknet-53 is
of anchor centers are (53, 35), (257, 151), (75, 104), (271, used as the backbone of YOLOv3 and Distance-IoU, and
248), (159, 118), (134, 220), (270,73), (115, 46) and (193, VGGNet-16 [42] for SSD. The experiments are conducted
58), which are width and height of the corresponding cluster in PyTorch and DarkNet framework, GPU with the NVIDIA
centers on the LogoDet-3K dataset. For the Focal loss of Logo- Tesla K80 and Tesla V100.
Yolo, α = 0.25, β = 2.
For the evaluation metric, we use mean Average Precision
(mAP) [40] and the IoU threshold is 0.5, which means that B. Experimental Results
a detection will be considered as positive if the IoU between Table V summarizes the results on three subsets among
the predicted box and ground-truth box exceeds 50%. different detection models. Compared with existing baselines
For the experiment datasets, we define various data subsets Faster RCNN, SSD and RetinaNet etc., YOLOv3 detector
as different benchmarks by means of random division on obtains better results on three subsets, which are 55.21%,
the overall LogoDet-3K dataset. Particularly, we divide the 52.32% and 48.60% respectively. The results of YOLOv3 are
LogoDet-3K dataset into three subsets including 1,000, 2,000 higher than Faster RCNN detector, because there are more
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 7

Fig. 7: Some detection results of Logo-Yolo on LogoDet-3K.

Fig. 8: Qualitative result comparison on LogoDet-3K between YOLOv3 and Logo-Yolo. Green boxes: ground-truth boxes. Red boxes: correct
detection boxes. yellow boxes: mistakes detection boxes.

small logo objects and fewer objects for many images in real- mistakes, such as treating a person or hamburger as logos, and
world scenarios, and the one-stage method is more suitable for thus the bounding boxes of detected logos are inaccurate, or
this case. Therefore, we use the one-stage YOLOv3 detector missing. In contrast, our method obtains better performance
as the basis of our method. both in the bounding box regression and the confidence of
We then compare the performance of Logo-Yolo with all detected logos. In particular, our method has an advantage in
baselines, and observe that Logo-Yolo achieves the best per- small logo detection, such as the detected logos in the last two
formance among these models. It’s worth noting that mAP of images in Fig. 8.
Logo-Yolo is 58.86%, 56.42% and 52.28% on three bench- In addition, Table VI gives the comparison of three super-
marks, and Logo-Yolo achieves the performance gain with classes on different methods. Compared with existing base-
3.65%, 4.10% and 3.67% compared with YOLOv3 in Table V. lines, the Logo-Yolo detector also obtains better results with
Our method Logo-Yolo detection performance achieves the 56.73%, 61.32% and 61.43% on the super classes of Food,
best result on the 1000-2000-3000 datasets, which proves the Clothes, and Necessities, respectively, which are 3.24%, 4.31%
stability of the method. and 3.75% higher than YOLOv3. This experiment also illus-
Some detection results of Logo-Yolo are given in Fig. 7, trates the effectiveness of our method. As we can see from
including the regression bounding box and the classification Table VI, the number of Necessities categories is 172 less
accuracy. The red box represents the prediction box and than the clothes categories, but relatively similar detection
the green box is the ground-truth box. Clearly, Logo-Yolo results have been obtained (61.32% vs 61.43%), indicating
can detect objects with occlusion, ambiguities and smaller, that the Necessities category dataset is more difficult to detect.
it obtains more accurate bounding box regression. And as Analyzing food logos with a large number of categories and
shown in Fig. 8, the detector YOLOv3 makes some detection images, the detection performance of the 932 food category
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 8

Fig. 9: The Precision-Recall curve of Logo-Yolo and YOLOv3. The larger the enclosing area under the curve, the better the detection effect.

Fig. 10: Left: Performance evaluation for different IoU thresholds. Right: The comparison of Logo-Yolo and YOLOv3 with increasing
iterations.

TABLE VI: Comparison of super-classes on different methods (%).


C. Analysis
Benchmarks Methods Backbones mAP
Faster RCNN [13] ResNet-101 47.32 Since Logo-Yolo and YOLOv3 obtain better detection per-
SSD [14] VGGNet-16 46.18
RetinaNet [22] ResNet-101 51.46
formance, we next focus on the analysis via the comparison
FPN [41] ResNet-101 51.10 between two methods.
Food
Cascade R-CNN[35] ResNet-101 52.46 Dataset Scale. According to Table V, the drop of Logo-Yolo
Distance-IoU [23] DarkNet-53 53.11
YOLOv3 [15] DarkNet-53 53.49 in mAP is 2.44% and 4.14% when the number of categories
Logo-Yolo DarkNet-53 56.73 increases from 1,000 to 2,000 and 2,000 to 3,000. Compared
Faster RCNN [13] ResNet-101 51.63 with YOLOv3, our model achieves better performance than
SSD [14] VGGNet-16 49.74
RetinaNet [22] ResNet-101 55.98 other baselines on datasets with different scales, which proves
Clothes
FPN [41] ResNet-101 55.62 a higher robustness on LogoDet-3K. We further calculate the
Cascade R-CNN[35] ResNet-101 56.90 Precision and Recall to illustrate the accuracy and missed
Distance-IoU [23] DarkNet-53 56.54
YOLOv3 [15] DarkNet-53 57.01 detection rate. We use the Precision-Recall curve to show
Logo-Yolo DarkNet-53 61.32 the trade-off between Precision and Recall in Fig. 9 between
Faster RCNN [13] ResNet-101 52.22 YOLOv3 and Logo-Yolo. The larger the enclosing area under
SSD [14] VGGNet-16 50.03
RetinaNet [22] ResNet-101 54.01 the curve, the better the detection performance. As shown in
Necessities
FPN [41] ResNet-101 53.37 Fig. 9, Logo-Yolo has significantly improved the recall rate,
Cascade R-CNN[35] ResNet-101 55.49 which indicates that our method alleviated the problem of
Distance-IoU [23] DarkNet-53 57.20
YOLOv3 [15] DarkNet-53 57.68 missing small objects in logo detection.
Logo-Yolo DarkNet-53 61.43 Parameter Sensitivity. We evaluate the performance by
varying different IoU thresholds from 0.5 to 0.8 at an interval
of 0.05. As shown in Fig. 10 (Left), Logo-Yolo (red curve)
is slightly lower than the 1000 subset (56.73% vs 58.86%). has a more stable performance improvement than YOLOv3
The result shows that food-related logo detection is more (blue curve) when changing the IoU threshold. We also set
challenging. different iterations to compare the convergence and accuracy
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 9

TABLE VII: Evaluation on individual modules and two modules of TABLE IX: The performance of Logo-Yolo on FlickrLogos-32 (%).
Logo-Yolo (%).
Method mAP
Model mAP Bag of Words (BoW) [5] 54.50
YOLOv3 48.61 Deep Logo [37] 74.40
YOLOv3+Pre-anchors Design 50.12 BD-FRCN-M [39] 73.50
YOLOv3+Focal Loss 49.21 Faster RCNN [13] 70.20
YOLOv3+CIoU loss 49.86 YOLO [43] 68.70
Logo-Yolo(w/o Pre-anchors Design) 49.92 YOLOv3 [15] 71.70
Logo-Yolo(w/o Focal Loss) 51.50 Logo-Yolo 74.62
Logo-Yolo(w/o CIoU loss) 50.64 Logo-Yolo (Pre-trained) 76.11
Logo-Yolo 52.28

TABLE VIII: The performance of Logo-Yolo on Top-Logo-10 (%). 1.5 percent improvement after pre-training on LogoDet-3K,
Method mAP showing better generalization ability of LogoDet-3K. We can
Faster RCNN [13] 41.80 also see similar trends on FlickrLogo-32 in Table IX. Overall,
SSD [14] 38.70 the evaluation on these two datasets verify the effectiveness
YOLO [43] 44.58 of Logo-Yolo, and also shows better generalization ability of
YOLOv3 [15] 50.10 LogoDet-3K on other logo detection datasets.
In addition, we further select QMUL-OpenLogo dataset to
Logo-Yolo 52.17
evaluate the general object detection. This dataset is the largest
Logo-Yolo (Pre-trained) 53.62
publicly available logo detection dataset, and contains 352 cat-
egories and 27,038 images. To further exploit the fine-tuning
of models. Fig. 10 (Right) shows higher performance with capability of LogoDet-3K, we analyze the difference between
increasing iterations. It can be seen that our method converges LogoDet-3K pre-trained weights and QMUL-OpenLogo pre-
at about 400,000 iterations and keeps higher accuracy than trained weights.
YOLOv3 in the training process. According to Table X, our LogoDet-3K dataset shows
strong generalization ability. Compared with YOLOv3 and
Logo-Yolo method, our fine-tuned LogoDet-3K model for
D. Ablation Study
QMUL-OpenLogo detection can significantly boost the per-
We conduct a comprehensive analysis of effects of three formance, with 1.73 points (53.69% vs 51.96%) for YOLOv3,
sub-variables and two modules from Logo-Yolo. Table VII and 2.16 points (55.37% vs 53.21%) for Logo-Yolo, the
shows an ablation study on the effects of different com- Logo-Yolo gains a 1.68 improvement (55.37% vs 53.69%).
binations of K-means, Focal Loss and CIoU loss. Firstly, The results are shown that the effectiveness of pre-trained
three modules are added to YOLOv3, and the results improve models and Logo-Yolo method. By pre-training the LogoDet-
1.51%, 0.60% and 1.25%, which proves the effectiveness of 3K dataset which removes the 352 categories from QMUL-
the Pre-anchors Design, Focal Loss and CIoU loss, respec- OpenLogo (LogoDet-3K w/o QMUL-OpenLogo), we can still
tively. Then, we conduct the two modules experiments from achieve competitive results with 52.36% on the QMUL-
Logo-Yolo. The result of Logo-Yolo is higher than Logo-Yolo OpenLogo benchmark, 0.4 points higher than the result in
without Pre-anchors Design, which explains the effectiveness YOLOv3 method, and 1.25 points for Logo-Yolo. It shows that
of two losses. Similarly, compared to Logo-Yolo without the LogoDet-3K dataset has the generalization ability. Com-
Focal Loss or CIoU loss, our proposed method achieves pared with QMUL-OpenLogo, our LogoDet-3K benchmark
improvement, which demonstrates the effectiveness of another has much higher performance gain. By involving QMUL-
two modules for Logo-Yolo. OpenLogo Pre-training before LogoDet-3K, we can slightly
improve the YOLOv3 with 0.34. For the Logo-Yolo, the
E. Generalization Ability on Logo Detection QMUL-OpenLogo pre-training before LogoDet-3K can further
To evaluate the robustness and generalization ability of bring in 0.73 points gain. The results shows LogoDet-3K
Logo-Yolo architecture and its pre-trained models, we explore contains richer logo features than QMUL-OpenLogo dataset,
other two datasets Top-Logo-10 [27] and FlickrLogos-32 [2]. which can be widely used for logo detection.
The former contains 10 unique logo classes with 70 images for
each logo class, and the latter is a popular logo dataset with F. Generalization Ability on Logo Retrieval
full annotations, comprising 8,240 images from 32 categories. For the retrieval experiments, each of the ten FlickrLogos-
Logo-Yolo (per-trained) first loades the model trained on 32 train samples for each brand serves as query sample. This
LogoDet-3K, and is then trained on the target dataset while allows to assess the statistical significance of results similar
Logo-Yolo is directly trained on the target dataset with random to a 10-fold-cross-validation strategy. As shown in Table XI
parameter initialization. the ResNet101+Litw [24] is the better logo retrieval method.
Table VIII summarizes experimental results for Top-Logo- Detected logos are described by the feature extraction network
10. We observe that our method Logo-Yolo achieves better per- outputs where three different state-of-the-art classification ar-
formance compared with other models. There is further about chitectures, namely VGG16, ResNet101 and DenseNet161,
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 10

Fig. 11: Qualitative result of some failure cases on Logo-Yolo. Green boxes denotes the ground-truth. Red boxes represent correct logo
detections, while yellow are mistakes.

TABLE X: Generalization ability of general object detection results


on the QMUL-OpenLogo dataset (%).
with YOLOv3. However, it can not achieve high detection
performance for some cases. Fig. 11 shows some failure cases
Method Pre-trained Dataset mAP from Logo-Yolo. Logo-Yolo is difficult to detect the smaller
YOLO9000 [44] QMUL-OpenLogo 26.33 scale logos, leading to missed detection, such as the third
YOLOv2+CAL [30] QMUL-OpenLogo 49.17
FR-CNN+CAL [30] QMUL-OpenLogo 51.03 image in Fig. 11. In addition, the logos under the same brand
YOLOv3 QMUL-OpenLogo 51.96 are similar and often appear in the same image, so there
YOLOv3 LogoDet-3K w/o QMUL-OpenLogo 52.36 will be some problems in the object classification, such as
YOLOv3 LogoDet-3K 53.69 the four images. As shown in Fig. 11, we found that our
YOLOv3 QMUL-OpenLogo -> LogoDet-3K 54.03 method encountered lower performance when the following
Logo-Yolo QMUL-OpenLogo 53.21
Logo-Yolo LogoDet-3K w/o QMUL-OpenLogo 54.46 cases appear, such as blocked logo objects, logo objects close
Logo-Yolo LogoDet-3K 55.37 to the background and very small objects. Therefore, the logo
Logo-Yolo QMUL-OpenLogo -> LogoDet-3K 56.10 detection on LogoDet-3K still has great challenges, such as
the multi-label problem and large-scale problem, and it mean-
TABLE XI: Evaluation retrieval results on FlickrLogos-32 (%). while highlights the comparative difficulty of the LogoDet-3K
Method mAP dataset.
baseline [27] 36.00
ResNet101 32.70 VI. C ONCLUSIONS
DenseNet161 36.80
ResNet101+Litw [24] 46.40 In this paper, we present LogoDet-3K dataset, the largest
DenseNet161+Litw [24] 44.80 logo detection dataset with full annotation, which has 3,000
Deepvision(ResNet101) 52.62 logo categories, about 200,000 high-quality manually anno-
Deepvision(DenseNet161) 50.78 tated logo objects and 158,652 images. Detailed analysis
Deepvision(ResNet101+Pre-trained) 54.17 shows the LogoDet-3K was highly diverse and more chal-
Deepvision(DenseNet161+Pre-trained) 52.91 lenging than previous logo datasets. Therefore, it establishes
a more challenging benchmark and can benefit many existing
localization sensitive logo-relate tasks. In addition, we propose
serve as base networks in Table XI. In addition, we adopt a new strong baseline method Logo-Yolo, which can get better
the proposed method Logo-Yolo to FlickrLogos-32 dataset detection performance than other state-of-art baselines. And
retrieval experiments, including baseline network and pre- we also report results of various detection models and demon-
trained on LogoDet-3K network. We used the latest retrieval strate the effectiveness of our method and better generalization
based detection method Deepvision [18], which adopts two ability on other three logo datasets and logo retrieval tasks.
different state-of-the-art classification architectures, namely In the future, we hope LogoDet-3K will become a new
ResNet101 and DenseNet161, and the experimental results benchmark dataset for a broad range of logo related research.
are 52.62% and 50.38%, respectively. The pre-train model on Such as logo detection, logo retrieval and logo synthesis tasks.
LogoDet-3K is used to the baseline method Deepvision [18], With the rapid development of major brands, real-time logo
the results are 54.17% and 52.91% mAP, with the 1.55% detection will become the trend of future research. We will
and 2.31% improvement compared with Deepvision. The continue to explore the characteristics of the LogoDet-3K
experimental results show that the pre-trained model generated dataset, and use anchor-free and lightweight design methods
by our dataset is also effective in the logo retrieval task, further specifically for logo detection to achieve faster and more
illustrating the value of LogoDet-3K in logo-related research. accurate logo detection.

G. Discussion R EFERENCES
Compared with existing methods, our proposed method ob-
[1] Y. Gao, F. Wang, H. Luan, and T.-S. Chua, “Brand data gathering from
tains better detection performance, especially in solving small live social media streams,” in International Conference on Multimedia
objects and complex backgrounds of logo images compared Retrieval, 2014, pp. 169–176.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. XX, MONTH YEAR 11

[2] S. Romberg, L. G. Pueyo, R. Lienhart, and R. van Zwol, “Scalable logo [28] Y. Liao, X. Lu, C. Zhang, Y. Wang, and Z. Tang, “Mutual enhancement
recognition in real-world images,” in ACM Conference on International for detection of multiple logos in sports videos,” in IEEE International
Conference on Multimedia Retrieval, 2011, pp. 1–8. Conference on Computer Vision, 2017, pp. 4856–4865.
[3] J. Revaud, M. Douze, and C. Schmid, “Correlation-based burstiness for [29] W. Z. L. Xie, Q. Tian and B. Zhang, “Fast and accurate near-duplicate
logo retrieval,” in ACM International Conference on Multimedia, 2012, image search with affinity propagation on the imageweb,” in Computer
pp. 965–968. Vision Image Understand, 2014, pp. 31–41.
[4] Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis, [30] H. Su, X. Zhu, and S. Gong, “Open logo detection challenge,” in British
“Scalable triangulation-based logo recognition,” in ACM International Machine Vision Conference, 2018, pp. 111–119.
Conference on Multimedia Retrieval, 2011, pp. 1–7. [31] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan,
[5] S. Romberg and R. Lienhart, “Bundle min-hashing for logo recogni- “Object detection with discriminatively trained part-based models,”
tion,” in ACM Conference on International Conference on Multimedia IEEE Transactions on Pattern Analysis and Machine Intelligence., pp.
Retrieval, 2013, pp. 113–120. 1627–1645, 2010.
[6] J. W. Yan, Wei-Qi and M. Kankanhalli, “Automatic video logo detection [32] P. Gao, K. Lu, J. Xue, L. Shao, and J. Lyu, “A coarse-to-fine facial
and removal,” Multimedia Systems, pp. 379–391, 2005. landmark detection method based on self-attention mechanism,” IEEE
[7] X. F. R. L. Y. Bao, H. Li and Q. Jia, “Region-based cnn for logo Transactions on Multimedia, pp. 1–10, 2020.
detection.” in Internet Multimedia Computing and Service, 2016, pp. [33] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,”
319–322. in European Conference on Computer Vision, 2018, pp. 765–781.
[8] C. Eggert, D. Zecha, S. Brehm, and R. Lienhart, “Improving small [34] B. Singh, M. Najibi, and L. S. Davis, “SNIPER: efficient multi-scale
object proposals for company logo detection,” ACM on International training,” in Conference on Neural Information Processing Systems,
Conference on Multimedia Retrieval, pp. 167–174, 2017. 2018, pp. 9333–9343.
[9] L. Yang, P. Luo, C. C. Loy, and X. Tang, “A large-scale car dataset [35] Z. Cai and N. Vasconcelos, “Cascade R-CNN: delving into high quality
for fine-grained categorization and verification,” in IEEE Conference on object detection,” in IEEE Conference on Computer Vision and Pattern
Computer Vision and Pattern Recognition, 2015, pp. 3973–3981. Recognition, 2018, pp. 6154–6162.
[10] Y. Gao, Y. Zhen, H. Li, and T. Chua, “Filtering of brand-related mi- [36] S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini, “Logo recognition
croblogs using social-smooth multiview embedding,” IEEE Transactions using CNN features,” in International Conference on Image Analysis and
on Multimedia, pp. 2115–2126, 2016. Processing, 2015, pp. 438–448.
[11] L. Liu, D. Dzyabura, and N. Mizik, “Visual listening in: Extracting brand [37] F. N. Iandola, A. Shen, P. Gao, and K. Keutzer, “DeepLogo: hitting
image portrayed on social media,” in AAAI Conference on Artificial logo recognition with the deep neural network hammer,” arXiv preprint
Intelligence, 2018, pp. 71–77. arXiv:1510.02131, 2015.
[12] Z. Cheng, X. Wu, Y. Liu, and X. Hua, “Video ecommerce++: Toward [38] H. Su, S. Gong, and X. Zhu, “Scalable logo detection by self co-
large scale online video advertising,” IEEE Transactions on Multimedia, learning,” Pattern Recognition., p. 107003, 2020.
pp. 1170–1183, 2017. [39] G. Oliveira, X. Frazão, A. Pimentel, and B. Ribeiro, “Automatic graphic
[13] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real- logo detection via fast region-based convolutional networks,” in Inter-
time object detection with region proposal networks,” in Conference on national Joint Conference on Neural Networks, 2016, pp. 985–991.
Neural Information Processing Systems, 2015, pp. 91–99. [40] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and
[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. A. Zisserman, “The pascal visual object classes (VOC) challenge,”
Berg, “SSD: single shot multibox detector,” in European Conference on International Journal of Computer Vision., pp. 303–338, 2010.
Computer Vision, 2016, pp. 21–37. [41] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie,
[15] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” “Feature pyramid networks for object detection,” in IEEE Conference
arXiv preprint arXiv:1804.02767, 2018. on Computer Vision and Pattern Recognition, 2017, pp. 936–944.
[16] J. Neumann, H. Samet, and A. Soffer, “Integration of local and global [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
shape analysis for logo classification,” Pattern Recognition Letters., pp. large-scale image recognition,” in International Conference on Learning
1449–1457, 2002. Representations, 2015, pp. 1–14.
[17] H. Su, S. Gong, and X. Zhu, “WebLogo-2M: scalable logo detection [43] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only
by deep learning from the web,” in IEEE International Conference on look once: Unified, real-time object detection,” in IEEE Conference on
Computer Vision Workshops, 2017, pp. 270–279. Computer Vision and Pattern Recognition, 2016, pp. 779–788.
[18] I. Fehérvári and S. Appalaraju, “Scalable logo recognition using prox- [44] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in
ies,” in IEEE Winter Conference on Applications of Computer Vision, IEEE Conference on Computer Vision and Pattern Recognition, 2017,
2019, pp. 715–725. pp. 6517–6525.
[19] J. Wang, W. Min, S. Hou, S. Ma, Y. Zheng, H. Wang, and S. Jiang,
“Logo-2K+: a large-scale logo dataset for scalable logo classification,”
in AAAI Conference on Artificial Intelligence, 2020, pp. 6194–6201.
[20] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: a large-
scale hierarchical image database,” in IEEE Conference on Computer
Vision and Pattern Recognition, 2009, pp. 248–255.
[21] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in
context,” in European Conference on Computer Vision, 2014, pp. 740–
755.
[22] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,” in IEEE International Conference on Computer
Vision, 2017, pp. 2999–3007.
[23] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU
loss: Faster and better learning for bounding box regression,” in AAAI
Conference on Artificial Intelligence, 2020, pp. 12 993–13 000.
[24] A. Tüzkö, C. Herrmann, D. Manger, and J. Beyerer, “Open set logo
detection and retrieval,” in Conference on Computer Vision, Imaging
and Computer Graphics Theory and Applications, 2018, pp. 284–292.
[25] S. C. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue, and Q. Wu, “LOGO-
Net: large-scale deep logo detection and brand recognition with deep
region-based convolutional networks,” arXiv preprint arXiv:1511.02462,
2015.
[26] S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini, “Deep learning
for logo recognition,” Neurocomputing., pp. 23–30, 2017.
[27] H. Su, X. Zhu, and S. Gong, “Deep learning logo detection with data
expansion by synthesising context,” in IEEE Winter Conference on
Applications of Computer Vision, 2017, pp. 530–539.

You might also like