0% found this document useful (0 votes)
54 views

Foot Ulcer Detection

Deep Learning models

Uploaded by

basuthker ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Foot Ulcer Detection

Deep Learning models

Uploaded by

basuthker ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Preprint (2020)

Deep Learning in Diabetic Foot Ulcers Detection: A Comprehensive Evaluation

Moi Hoon Yapa,∗, Ryo Hachiumab,1 , Azadeh Alavic,1 , Raphael Brüngeld,1 , Manu Goyale,1 , Hongtao Zhuf,1 , Bill Cassidya , Johannes
Ruckertd , Moshe Olshanskyc , Xiao Huangf , Hideo Saitob , Saeed Hassanpoure , Christoph M. Friedrichd,g , David Ascherc , Anping
Songf , Hiroki Kajitah , David Gillespiea , Neil D. Reevesa , Joseph Pappachani , Claire O’Sheaj , Eibe Frankk
a Manchester Metropolitan University, John Dalton Building, Chester Street, Manchester M1 5GD, UK
b Keio University, Yokohama, Kanagawa, Japan
c Baker Heart and Diabetes Institute, 20 Commercial Road, Melbourne, VIC 3000, Australia
d Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge-Str. 42, 44227 Dortmund, Germany
e Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA
f Shanghai University, Shanghai 200444, China
arXiv:2010.03341v2 [cs.CV] 15 Oct 2020

g Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Hufelandstr. 55, 45122 Essen, Germany
h Keio University School of Medicine, Shinanomachi, Tokyo, Japan
i Lancashire Teaching Hospitals, Chorley, UK
j Waikato Diabetes Health Board, Hamilton, New Zealand
k Department of Computer Science, University of Waikato, Hamilton, New Zealand

ARTICLE INFO ABSTRACT

Article history: There has been a substantial amount of research on computer methods and technology
for the detection and recognition of diabetic foot ulcers (DFUs), but there is a lack of
systematic comparisons of state-of-the-art deep learning object detection frameworks
2000 MSC: 41A05, 41A10, 65D05, applied to this problem. With recent development and data sharing performed as part
65D17 of the DFU Challenge (DFUC2020) such a comparison becomes possible: DFUC2020
provided participants with a comprehensive dataset consisting of 2,000 images for train-
Keywords: diabetic foot ulcers, object
ing each method and 2,000 images for testing them. The following deep learning-based
detection, machine learning, deep learn-
ing, DFUC2020 algorithms are compared in this paper: Faster R-CNN, three variants of Faster R-CNN
and an ensemble method; YOLOv3; YOLOv5; EfficientDet; and a new Cascade At-
tention Network. For each deep learning method, we provide a detailed description
of model architecture, parameter settings for training and additional stages including
pre-processing, data augmentation and post-processing. We provide a comprehensive
evaluation for each method. All the methods required a data augmentation stage to
increase the number of images available for training and a post-processing stage to re-
move false positives. The best performance is obtained Deformable Convolution, a
variant of Faster R-CNN, with a mAP of 0.6940 and an F1-Score of 0.7434. Finally,
we demonstrate that the ensemble method based on different deep learning methods
can enhanced the F1-Score but not the mAP. Our results show that state-of-the-art deep
learning methods can detect DFU with some accuracy, but there are many challenges
ahead before they can be implemented in real world settings.

© 2020 Preprint.

1. Introduction developing a diabetic foot ulcer (DFU). In other words, 1 in ev-


ery 3 people with diabetes will develop a DFU in their lifetime
According to the International Diabetes Federation IDF Armstrong et al. (2017). Infection of a DFU frequently leads to
(2019), there are approximately 463 million adults with dia- limb amputation, causing significant morbidity, psychological
betes worldwide. This number is expected to grow to 700 mil- distress, and reduced quality of life and life expectancy. Pre-
lion in 2045. A person with diabetes has a 34% lifetime risk of vention of DFU is the optimal management pathway; however,
current prevention strategies rely on patient and clinician vig-
ilance and place a high burden on global health services. It is
∗ Corresponding author: Tel.: +44 161 247 1503; essential to develop a technological solution capable of trans-
e-mail: [email protected] (Moi Hoon Yap) forming current screening practices and vastly reduce the clini-
1 Authors with equal contribution.
2 Moi Hoon Yap et al. / Preprint (2020)

cal time burden.


With the emerging growth of deep learning, automated anal-
ysis of DFU has become possible. However, deep learning re-
quires large-scale datasets to achieve results comparable with
those of human experts. Currently, medical imaging researchers
are working in isolation and the majority of their research is not
reproducible. To bridge the gap and to motivate data sharing
amongst researchers and clinicians, Yap et al. (2020c,b) pro-
posed diabetic foot ulcer challenges. This paper presents an
overview of the state-of-the-art computer methods in DFU de-
tection, provides an overview of the publicly available datasets,
presents a comprehensive evaluation of the popular object de-
tection frameworks on DFU detection, proposes an ensemble
method and Cascade Attention DetNet for DFU detection, and Fig. 1. The users of DFUC2020 dataset across the world.
conducts a comprehensive evaluation of the deep learning algo-
rithms on the DFUC2020 dataset.
3. Datasets

The DFU datasets provided by The Manchester Metropoli-


2. Related Work tan University and The Lancashire Teaching Hospitals Goyal
et al. (2018, 2020); Cassidy et al. (2020) are digital DFU image
The growing number of reported cases of diabetes has re- datasets with expert annotations. The aim of the publication of
sulted in a corresponding growth in research interest in DFU. this data is to encourage more researchers to work in this do-
Early attempts in training deep learning models in this domain main and conduct reproducible experiments. There are three
have shown promising results. Goyal et al. (2018, 2017, 2019) types of datasets made publicly available for researchers. The
trained models capable of classification, localisation and seg- first dataset consists of foot skin patches for DFUNet classifica-
mentation. These models reported high levels of mAP, sen- tion Goyal et al. (2018); the second dataset contains regions of
sitivity and specificity in experimental settings. The localisa- interests for infection and ischaemia classification Goyal et al.
tion model was trained using Faster R-CNN with Inception v2 (2020); and the third is the most recently published dataset
and two-tier transfer learning from the Microsoft Common Ob- for DFU detection Cassidy et al. (2020). The third dataset
jects in Context (MS COCO) dataset. However, despite the high is the largest dataset to date, and increased usage of this data
scoring performance measures, these models were trained and is the driving force for the organisers of the DFUC contests.
evaluated on small datasets (<2000) so the results cannot be The researchers involved in organising the yearly DFU chal-
regarded as conclusive evidence of their efficacy in real-world lenges Yap et al. (2020c,b), in conjunction with MICCAI con-
settings. ferences, aim to attract wider participation to improve the di-
Brown et al. (2017) created the MyFootCare mobile app agnosis/monitoring of foot ulcers and raise the awareness of
which was designed to encourage patient self-monitoring us- diabetes and DFU. The Diabetic Foot Ulcers Grand Challenge
ing diaries, goals and notifications. The app stores a log of (DFUC2020) datasets consist of 2,000 training images, 200 val-
patient foot images and is capable of semi-automated segmen- idation images and 2,000 testing images Cassidy et al. (2020);
tation. This novel solution to maintaining foot records utilises Goyal et al. (2019). The data consists of 2,496 ulcers in the
a method of automatic photograph capture where the phone is training set and 2,097 ulcers in the testing set. To increase the
placed on the floor and the patient is guided using voice feed- level of challenge, some of the images in the testing set do not
back. However, this particular function of the system was not have DFU. The details of the dataset are described in Cassidy
tested during the actual experiment, so it is not known how well et al. (2020). To improve the performance of the deep learning
it performed in real-world settings. methods and reduce the computational costs, the images were
Wang et al. (2014, 2016) devised a method of consistent DFU resized to 640 × 480. Since the release of the DFUC2020 train-
image capture using a box with a glass surface containing mir- ing dataset on the 27th April 2020, we received requests from
rors which reflect the image back to a camera or mobile device. 39 international institutions, as shown in Table 1. There are a
Cascaded two-stage support vector classification was used to total of 31 submissions from 11 teams to the grand challenge.
ascertain the DFU region, followed by a two-stage super-pixel We report the top scores from each team and discuss their meth-
classification technique for segmentation and feature extraction. ods according to the object detection approaches involved.
Despite being highly novel, this method exhibited a number of
limitations, such as risk of infection due to physical contact be- 4. DFU Detection Methods
tween wound and capture box. The design of the capture box
also limited monitoring to DFU that are present on the plan- This section presents a comprehensive description of the
tar surface of the foot. The sample size was also statistically DFU detection methods used, grouped according to the pop-
insignificant, with only 35 images from real patients, and 30 ular deep learning object detection algorithms they apply, i.e.
images of wound moulds. Faster R-CNN, YOLOv3, YOLOv5 and EfficientDet. We also
Moi Hoon Yap et al. / Preprint (2020) 3

include descriptions of an ensemble method and a new Cascade dian blur filters with the filter size set to 3. The filters are
Attention DetNet (CA-DetNet). applied with the probability of 0.1.

4.1. Faster R-CNN • Affine transformation: As the images are captured from
different camera angles, we apply random affine transfor-
Faster R-CNN Ren et al. (2015) is one of the two-stage ob- mation to the images. Specifically, we apply random shift,
ject detection models, which generates a sparse set of candidate scaling (0.1), and rotation (90 degrees) to the images.
object locations by a Region Pooling Network (RPN) based
on shared feature maps, which then classifies each candidate • Brightness: As the images are captured in various environ-
proposal as the foreground or background class. After extract- ments, we employ brightness and contrast data augmenta-
ing shared feature maps with a CNN, the first stage RPN takes tion. More specifically, we randomly change the bright-
shared feature maps as an input and generates a set of bound- ness and contrast in a scale from 0.1 to 0.3, with probabil-
ing box candidate object locations, each with an ”objectness” ity set to 0.2.
score. The size of each anchor is configured using hyperpa-
rameters. Then, the proposals are used in the region of interest 4.1.2. Model training and implementation
pooling layer (RoI pooling) to generate subfeature maps. The In this paper, we fine-tune a model pretrained on MS-COCO
subfeature maps are converted to 4096 dimensional vectors and Lin et al. (2014). We employ Stochastic Gradient Descent Opti-
fed forward into fully connected layers. These layers are then mizer with a momentum of 0.9 and weight decay set to 0.0001.
used as a regression network to predict bounding box offsets, During training, we employ a warm up learning rate scheduling
with a classification network used to predict the class label of strategy, using lower learning rates in the early stages of training
each bounding box proposal. to overcome optimization difficulties. More specifically, we lin-
The RoI pooling layer quantizes a floating-number RoI to the early increase the learning rate to 0.01 in the first 500 iterations,
discrete granularity of the feature map. This quantization intro- then multiply by 0.1 at epoch 6, 12, and 30. We implemented
duces misalignments between the RoI and the extracted fea- the methods based on the mmdetection repository 2 .
tures. Therefore, the model evaluated in this paper employs a
RoIAlign layer, which is introduced in Mask R-CNN He et al. 4.1.3. Variants of Faster R-CNN
(2017), instead of the RoI pooling layer. This removes the harsh Several papers have proposed variants of Faster R-CNN.
quantization of the RoI pooling layer, properly aligning the ex- In this paper, we implement Faster R-CNN, three variants of
tracted features with the input. Faster R-CNN, and ensemble the results. The three variants of
Also, the Feature Pyramid Network (FPN) Lin et al. (2017) is Faster R-CNN are as follows:
employed as the backbone of the network. FPN uses a top-down
architecture with lateral connections to build an in-network fea- • Cascade R-CNN Cai and Vasconcelos (2019): Cascade
ture pyramid from a single-scale input. Faster R-CNN with an R-CNN is similar to Faster R-CNN, but the architecture
FPN backbone extracts RoI features from different levels of the of the ROI head (the module that predicts the bounding
feature pyramid according to their scale, but otherwise the rest boxes and the category label) is different. Cascade R-CNN
of the approach is similar to vanilla ResNet. Using a ResNet- builds up a cascade head based on Faster R-CNN Ren et al.
FPN backbone for feature extraction with Mask R-CNN gives (2015) to refine detection progressively. Since the pro-
excellent gains in both accuracy and speed. Specifically, We posal boxes are refined by multiple box regression heads,
employ ResNeXt101 Xie et al. (2017) with the FPN feature ex- Cascade R-CNN is optimal for more precise localization
traction backbone to extract the features. of objects.

4.1.1. Data Augmentation • Deformable Convolution Zhu et al. (2019): Here, the basic
architecture of the network is the same as the one in Faster
In this challenge, the images in the dataset were captured
R-CNN. However, we replace the convolution layer with
from different viewpoint angles, cameras with different focal
a deformable convolution layer Zhu et al. (2018) at the
lengths, and varying levels of blur. Also, the training dataset
second, third, and fourth ResNeXt blocks of the feature
contains only 2, 000 images, which could be considered small
extractor. The deformable convolution adds 2D offsets to
for training deep learning models. Therefore, we employ vari-
the regular grid sampling locations in the standard convo-
ous data augmentation techniques for robust prediction. Specif-
lution so that it enables free-form deformation of the sam-
ically, we employ the following augmentations:
pling grid. The offsets are learned from the feature maps,
• HSV and RGB: As the lighting conditions are different via additional convolutional layers. Thus, the deformation
among images in the dataset, we apply random RGB is conditioned on the input features in a local, dense, and
and HSV shift to the images. Especially, we randomly adaptive manner.
add/subtract from 0 to 10 RGB values and 0 to 20 HSV
• Prime Sample Attention Cao et al. (2020) (PISA): Here,
values to the images.
the basic network architecture is again the same as in
• Blurring: As the dataset contains images captured from
different focal lengths, some images are blurred and con-
tain camera noise. Therefore, we apply Gaussian and me- 2 https://github.com/open-mmlab/mmdetection
4 Moi Hoon Yap et al. / Preprint (2020)

Faster R-CNN. PISA is motivated by two considerations: Net3 -based versions, and Bochkovskiy et al. (2020) created
samples should not be treated as independent and equally ports for other machine learning libraries such as PyTorch4
important, and the classification and localization are cor- Paszke et al. (2019).
related. Thus, it employs a ranking strategy that places In this paper, two approaches are selected for DFU detection
the positive samples with highest IoUs around each ob- on the DFUC2020 dataset: YOLOv3 and YOLOv5. We discuss
ject, and the negative samples with highest scores in each the networks and present descriptions of our implementation in
cluster at the top of the ranked list. This directs the focus the following subsections.
of the training process via a simple re-weighting scheme.
It also employs a classification-aware regression loss to 4.2.1. YOLOv3
jointly optimize the classification and regression branches. YOLOv3 Redmon and Farhadi (2018) was developed as an
improved version of YOLOv2 Redmon and Farhadi (2017). It
employs multi-scale schema, predicting bounding boxes on dif-
4.1.4. Post-processing
ferent scales. This allows YOLOv3 to be more effective for
At test time, we employ a test-time augmentation scheme: detecting smaller targets when compared to YOLOv2.
we augment the test image by applying two resolutions, and we YOLOv3 uses dimension clusters as anchor boxes in order to
also flip the test image. As a result, we augment a single im- predict bounding boxes around the desired objects in given im-
age to four images and merge the predictions obtained for the ages. Logistic regression is used to predict the objectness score
four images. We employ soft NMS (non maximum suppres- for a given bounding box. Specifically, as illustrated in Fig. 2,
sion) Bodla et al. (2017) with a confidence threshold of 0.5 as the algorithm predicts the four coordinates of the bounding box
the post-processing of predicted bounding boxes. (t x , ty , th , tw ) as in Equation 1.

4.1.5. Ensemble method b x = σ(t x ) + c x


Generally, combining predictions from different models gen-
by = σ(ty ) + cy
eralizes better and usually yields more accurate results com- (1)
pared to a single model. At the post-processing step of Faster bh = pw etw
R-CNNs, we employ soft NMS Bodla et al. (2017) to select bw = ph eth
the predicted bounding boxes for each method. Such methods
work well on a single model, but they only select the boxes and where (c x , yy ) are offsets from the top left corner of the image,
cannot produce averaged localization of predictions combined and (pw , ph ) are bounding box prior height and weight. The k-
from various models effectively. Therefore, after predicting the means clustering algorithm is used to determine bounding box
bounding boxes for each method, we ensemble these predicted priors, while the sum of squared errors is used for training the
bounding boxes using Weighted Boxes Fusion Solovyev et al. network. Let tˆ∗ be the ground truth for some coordinate predic-
(2019). Unlike NMS-based methods that simply exclude part of tion, and t∗ be the network prediction during training. Then, the
the predicted bounding boxes, the Weighted Boxes Fusion algo- gradient is tˆ∗ − t∗ , which can be easily computed by inverting
rithm uses confidence scores of all proposed bounding boxes to equation 1.
constructs the average boxes. The reader is referred to Solovyev
et al. (2019) for further details of the algorithm. We ensemble Model Pipeline
four models (pure Faster R-CNN, Cascade R-CNN, Faster R- The backbone of YOLOv3 is a hybrid model called Darknet-
CNN with Deformable convolution, and Faster R-CNN with 53 (as shown in Table 1), which is employed for the feature ex-
Prime Sample Attention model). We set equal weights when traction part of the algorithm. As the name indicates, DarkNet-
fusing the predicted bounding boxes of each model. 53 is made of 53 convolutional layers that also take advantage
of shortcut connections.
As the detection algorithm is required to detect one type of
4.2. YOLO object only, the complexity of the problem is reduced from
multi-class detection to single object detection. Hence, for the
You-Only-Look-Once (YOLO) Redmon et al. (2016) is a purpose of detecting diabetic foot ulcers, we have employed a
unified, real-time object detection algorithm that reformulates simplified version of YOLOv3.
the object detection task to a single regression problem. YOLO
employs a single neural network architecture to predict bound- Training
ing boxes and class probabilities directly from full images.
We employ transfer learning by using the pre-trained Dark-
Hence, when compared to Faster R-CNN Ren et al. (2015),
Net weights which are provided by Redmon and Farhadi
YOLO provides faster detection.
(2018). Then, we train our detector in 2 steps, using the follow-
Over time, improvements of YOLO were implemented and ing settings: Adam optimizer with learning rate 1e-3, number
released as distinct and independent software packages by the
originators Redmon et al. (2016); Redmon and Farhadi (2017,
2018). As an effect of increasing publicity and popularity, a 3 DarkNet GitHub repository: https://github.com/pjreddie/
model zoo containing further YOLO adaptations emerged. Sub- darknet (accessed 2020-08-29)
sequently, further maintainers continued to improve the Dark- 4 PyTorch website: https://pytorch.org/ (accessed 2020-08-29)
Moi Hoon Yap et al. / Preprint (2020) 5

Post-processing
As observed from Figure 3, in rare cases, the resulting al-
gorithm may produce double detection or false positives. To
reduce such drawbacks, we include a post-processing stage.

Fig. 2. Illustration of bounding boxes, dimension priors, and location pre-


diction.

Table 1. The architecture of DarkNet-53 used in YOLOv3. Fig. 3. Illustration of two types of false positives: (top row) false positives
from double detection; and (bottom row) false positives of the network.
Type Filters Size

Convolutional 32 3×3
Our post-processing steps consist of two stages. First, we
Convolutional 64 3×3/2
identify double detection by flagging the detected bounding
Convolutional 32 1×1 boxes with more that 80% overlap. Among the overlapping
Convolutional 1× 64 3×3 detected boxes we only keep the one with highest confidence.
Residual Finally, we further post-process the results by removing any de-
Convolutional 128 3×3/2 tection with confidence under 0.3, aiming to reduce the rate of
Convolutional 64 1×1 false positive detections.
Convolutional 2× 128 3×3
Residual 4.2.2. YOLOv5
Convolutional 256 3×3/2 YOLOv5 Jocher et al. (2020b) was first published in May
Convolutional 128 1×1 2020 by Glenn Jocher of Ultralytics LLC5 on GitHub6 . Origi-
Convolutional 8× 256 3×3 nally, it was an improved version of their well known YOLOv3
Residual implementation for PyTorch7 Jocher et al. (2020a), based on
Convolutional 512 3×3/2 the original YOLOv3 Redmon and Farhadi (2018). However,
Convolutional 256 1×1 due to the release of YOLOv4 Bochkovskiy et al. (2020) for the
Convolutional 8x 512 3×3 DarkNet framework8 , which incorporated many improvements
Residual made in the PyTorch YOLOv3 implementation, the authors de-
Convolutional 1024 3×3/2 cided to name it YOLOv5 to avoid naming conflicts. Essen-
Convolutional 512 1×1 tially, YOLOv5 can be labeled as “YOLOv4 for PyTorch”. Un-
Convolutional 4× 1024 3×3
like the original YOLOv3 and YOLOv4, there has not been a
Residual
scientific paper published on the PyTorch port and its improve-
Avgpool Connected Softmax Global 1000
ments yet. YOLOv5 is under active development with new fea-
tures and releases appearing on a weekly basis. At the time of
writing, the latest release is v3.0, published on 20 August 2020.
The new features and improvements in YOLOv4/YOLOv5
of epochs=100, batch size=32, and using 20% of the data for are mainly focused around incorporating state-of-the-art tech-
validation.
First, we start by freezing the top DarkNet-53 layers and train 5 Ultralytics LLC website: https://www.ultralytics.com/ (accessed
the algorithm with the above settings. Then, we retrain the 2020-08-29)
entire network for better performance. Similar to the original 6 YOLOv5 GitHub repository: https://github.com/ultralytics/

YOLOv3, our trained network extracts features from 3 different yolov5/ (accessed 2020-08-29)
7 Ultralytics’ YOLOv3 GitHub repository: https://github.com/
pre-defined scales, which is a similar concept to feature pyra- ultralytics/yolov3 (accessed 2020-08-29)
mid networks Lin et al. (2017). We then use the trained network 8 YOLOv4 GitHub repository: https://github.com/AlexeyAB/
for detecting diabetic foot ulcers in blind test images. darknet (accessed 2020-08-29)
6 Moi Hoon Yap et al. / Preprint (2020)

niques for activation functions, data augmentation, and post- had derogatory effects on the detection performance, images
processing into the established YOLO architecture to achieve were enhanced using a fast implementation of the non-local
the best possible object detection performance. One of the most means algorithm Buades et al. (2005) for color images, utiliz-
notable new features is the novel mosaic loader data augmen- ing the Python language14 in version 3.6.9 with the OpenCV
tation. Four images are combined to form a new image, allow- on Wheels (opencv-python)15 package in version 4.2.0.34.
ing detection of objects outside of their normal context and at The algorithm parameters were set to h = 1 (luminance com-
smaller sizes, and reducing the need for large mini-batch sizes. ponent filter strength) and hColor = 1 (color component filter
Another new data augmentation technique is self-adversarial strength) with templateWindowSize = 7 (template patch size
training (SAT), where images are generated to deceive the net- in pixel) and searchWindowSize = 21 (search window size in
work. YOLOv5 claims accelerated inference and smaller model pixel).
files compared to YOLOv4, allowing easy translation to mobile Resulting images show less definitive compression artifact
use cases. borders and notably reduced color noise. Some textures are also
The approach on DFU detection via YOLOv5 described more pronounced. Examples of results at a macroscopic and a
in the following is based on the early version v1.09 commit detail level are shown in Figure 4.
a1c840610 from 14 July 2020 that still posed several issues.

Pre-processing
Initially, image data of the training dataset was analyzed via
AntiDupl11 in version 2.3.10 to identify duplicate images,
yielding a set of 39 pair findings. A spatial analysis of dupli- (a) Original (b) NLM (a) (c) Details (d) NLM (c)
cate pair annotation data was performed, utilizing the R lan-
guage12 R Core Team (2020) in version 4.0.1 and the Simple
Features for R (sf) package13 Pebesma (2018) in version 0.9-2.
Originally, none of the duplicate pair images showed BBox in-
tersections by themselves. After joining duplicate pair annota-
tions, several intersections were detected with a maximum of (e) Original (f) NLM (e) (g) Details (h) NLM (g)
two involved BBox. These represented different annotations Fig. 4. Effects of the non-local means (NLM) algorithm are shown for two
of the same wound in two duplicate images, now joint in one example images (a) and (e) of the training dataset in (b) and (f). At a
image. To resolve these, each intersections of a BBox1 and a macroscopic level the changes are not obvious. At a detail level borders
of compression artifacts on homogeneous areas and color noise of (c) are
BBox2 were merged into BBox\ by using their outer boundaries,
visibly reduced in (d). Vague textures of (g) are also more pronounced in
as shown in Equation 2. (h).



 [ = min xmin1 , xmin2 
xmin

[ = min ymin1 , ymin2  Data Augmentation


 ymin

BBox
\  (2) As mentioned in the introduction of YOLOv5, it is basi-
[ = max xmax1 , xmax2
 


 xmax
cally a port of YOLOv4 for PyTorch, adapting the novelties


[ = max ymax1 , ymax2

 ymax
 
of YOLOv4. Hence, in the following, these novelties are ex-
The applied duplicate cleansing and annotation merging plained but ascribed to YOLOv4. Nonetheless, the described
strategy resulted in n = 1961 images with k = 2453 annota- techniques also apply for YOLOv5.
tions in the cleansed training dataset. Boundaries of merged A key factor in the improved performance of YOLOv4 over
BBoxes were checked for consistency. Afterwards, annotation YOLOv3 is data augmentation, where additional training data
data was converted to the image resolution-independent format is artificially generated by manipulating or combining existing
used by YOLO implementations. training images to improve the robustness of the trained model.
Reviewing image data of all dataset parts (training, valida- In Bochkovskiy et al. (2020), these techniques are referred to as
tion, and test), showed pronounced compression artifacts and ”bag of freebies”, meaning that they can be applied at training
color noise due to a high compression rate and downscaling to time and do not affect inference speed.
a low resolution. As both compression artifacts and color noise A first set of data augmentation techniques in YOLOv4 are
pixel-wise adjustments including photometric distortion. This
involves adjustments of brightness, contrast, hue, saturation,
9 YOLOv5 v1.0: https://github.com/ultralytics/yolov5/ and noise of images as well as geometric distortion, consist-
releases/tag/v1.0 (accessed 2020-09-12) ing of random scaling, cropping, flipping, and rotating. A sec-
10 YOLOv5 GitHub commit a1c8406: https://github.com/ ond set of techniques tackles the problem of object occlusion.
ultralytics/yolov5/commit/a1c8406 (accessed 2020-08-29)
11 AntiDupl GitHub repository: https://github.com/ermig1979/
AntiDupl (accessed 2020-08-29)
12 R language website: https://www.r-project.org/ (accessed 2020- 14 Python language website: https://www.python.org/ (accessed 2020-

08-29) 08-29)
13 Simple Features for R (sf) GitHub repository: https://github.com/ 15 OpenCV on Wheels GitHub repository: https://github.com/
r-spatial/sf (accessed 2020-08-29) skvark/opencv-python (accessed 2020-08-29)
Moi Hoon Yap et al. / Preprint (2020) 7

Here, techniques like random erase or CutOut DeVries and Tay-


Table 2. Architecture of the YOLOv5x model.
lor (2017) select a rectangle in the image to be filled with ran-
Type Filters Size
dom or zero values. Hide-and-seek and grid mask work sim-
ilarly, but select several regions. Similar techniques can also Backbone
be applied to the feature maps, where the techniques DropOut Focus 12 3×3
Srivastava et al. (2014), DropConnect Wan et al. (2013), and Convolutional 160 3×3
DropBlock Ghiasi et al. (2018) randomly drop certain values
BottleneckCSP 4× 160 1×1+3×3
during propagation to improve model robustness. Third, there
are techniques that combine several images of the training set Convolutional 320 3×3
into one image. MixUp Zhang et al. (2017a) superimposes two BottleneckCSP 12× 320 1×1+3×3
images by multiplying their pixel values with a coefficient and Convolutional 640 3×3
doing the same for the labels. CutMix Yun et al. (2019) takes
BottleneckCSP 12× 640 1×1+3×3
one image and covers a random rectangle with a region of an-
other image, adjusting the labels according to the size of the Convolutional 1280 3×3
region. SPP
Two novel data augmentation techniques, Mosaic and Self- BottleneckCSP 4× 1280 1×1+3×3
Adversarial Training (SAT), were introduced in YOLOv4. Mo-
Head
saic augmentation is similar to CutMix but takes four images
instead of two. These are placed in the four corners of the new Convolutional 640 1×1
image with random ratios, thereby allowing the model to detect Upsample 2
objects in different contexts and at different sizes. This reduces BottleneckCSP 4× 640 1×1+3×3
the need for large mini-batch sizes. SAT generates deceiving
Convolutional 320 1×1
images based on the response of the model to given images.
Its goal is to get the model to not detect a previously detected Upsample 2
object, and then adjust the network weights based on this new BottleneckCSP 4× x 320 1×1+3×3
image. Mosaic data augmentation was disabled in this approach Convolutional 320 3×3
because it led to invalid bounding boxes (BBoxes).
BottleneckCSP 4× 640 1×1+3×3
Lastly, class label smoothing is applied to improve model
robustness. Additional smoothing is based on relationships be- Convolutional 640 3×3
tween categories, modelled through a label refinement network. BottleneckCSP 4× 1280 1×1+3×3
Detection
Model
YOLOv5 includes four different models ranging from the
smallest YOLOv5s with 7.5 million parameters (plain 7 MB,
with 16 GB memory as part of an NVIDIA® DGX-118 super-
COCO pre-trained 14 MB) and 140 layers to the largest
computer for deep learning. YOLOv5 was set up using a pro-
YOLOv5x with 89 million parameters and 284 layers (plain 85
vided Docker container19 , executed via Nvidia-Docker20 in ver-
MB, COCO pre-trained 170 MB). In the approach considered
sion 19.03.5.
in this paper, the pre-trained YOLOv5x model is used. Its archi-
tecture is displayed in Table 2, derived from YOLOv5’s model Training was organized in two stages: Initial training and
export16 . self-training. The initial training stage uses the originally avail-
The YOLOv5x model uses a two-stage detector that con- able training data to train a model. The self-training approach,
sists of a Cross Stage Partial Network (CSPNet) Wang et al. also called pseudo-labelling, extends available training data by
(2020) backbone trained on MS COCO Lin et al. (2014), and inferring detections on images for which originally no annota-
a model head using a Path Aggregation Network (PANet) Liu tion data is available Koitka and Friedrich (2017). This is re-
et al. (2018) for instance segmentation. Each BottleneckCSP alized using the model resulting from the initial training stage;
unit consists of two convolutional layers with 1 × 1 and 3 × 3 yielded detections are then used as pseudo-annotation data. Re-
filters. The backbone incorporates a Spatial Pyramid Pooling suming the initial training in the self-training stage with the
network (SSP) He et al. (2015), which allows for dynamic in- extended training data generalizes detection capabilities of the
put image size and is robust against object deformations. model.
A five-fold cross-validation was performed for each train-
Training ing stage to approximate training optima. Both training stages
The infrastructure used for training comprised a single
NVIDIA® V10017 tensor core graphical grocessing unit (GPU)
18 NVIDIA® DGX-1: https://www.nvidia.com/en-us/
data-center/dgx-1/ (accessed 2020-08-30)
16 YOLOv5 model export: https://github.com/ultralytics/ 19 YOLOv5 Docker Hub container: https://hub.docker.com/r/
yolov5/issues/251 (accessed 2020-09-21) ultralytics/yolov5 (accessed 2020-08-30)
17 NVIDIA® V100: https://www.nvidia.com/en-us/data-center/ 20 Nvidia-Docker GitHub repository: https://github.com/NVIDIA/
v100/ (accessed 2020-08-30) nvidia-docker (accessed 2020-08-30)
8 Moi Hoon Yap et al. / Preprint (2020)

used the default set of hyperparameters: optimizer = SGD, as its backbone. EfficientDet uses feature fusion techniques in
lr0 = 0.01, momentum = 0.937, weight decay = 0.0005, the form of a bidirectional feature pyramid network (BiFPN)
giou = 0.05, cls = 0.58, cls pw = 1.0, obj = 1.0, which combines representations of input images at different res-
obj pw = 1.0, iou t = 0.2, anchor t = 4.0, fl gamma = 0.0, olutions. BiFPN adds weights to input features which enables
hsv h = 0.014, hsv s = 0.68, hsv v = 0.36, degrees = 0.0, the network to learn the importance of each feature. The out-
translate = 0.0, scale = 0.5, and shear = 0.0. A default puts from the BiFPN are then used to predict class and gener-
seed value of 0 was used for model initialization. Both training ate bounding boxes using bounding box regression. Efficient-
stages were performed in the single-class training mode, with Det also utilises compound scaling, which allows all parts of
mosaic data augmentation deactivated due to issues regarding the network to scale in accordance to the target hardware being
BBox positioning. used for training and inference Tan et al. (2020). An overview
During the initial training stage, a base model was trained on of the EfficientDet architecture is shown in Fig. 5.
the pre-processed training dataset for 60 epochs with a batch
size of 30. This base model was initialized with weights from 4.3.1. Pre-processing
the MS COCO pre-trained YOLOv5x model. For the self- Since the dataset was captured with different types of camera
training approach, the base model was then used to create the devices and lighting conditions, a color constancy algorithm,
extended training dataset for self-training. Pseudo-annotation Shades of Gray (SoG), was used to handle variations in noise
data was inferred for the validation and test dataset, using the and lighting from the different capture devices hua Ng et al.
best-performing epoch 58 automatically saved by YOLOv5. (2019). Examples of pre-processed DFU images using SoG are
The resulting extended training dataset held 4161 images of shown in Fig. 6.
which 3963 held 4638 wound annotations.
During the self-training stage, the base model training was
4.3.2. Data Augmentation
resumed at its latest epoch, but trained further on the extended
training dataset with a batch size of 20. Three final training Data Augmentation techniques have proven to be an impor-
states were created: One after an additional 30 epochs, another tant tool in improving the performance of deep learning algo-
one after an additional 40 epochs, and a final one after an ad- rithms for various computer vision tasks Goyal and Yap (2018);
ditional 60 epochs of self-training (referred to as E60 SELF90, Yap et al. (2020a). For the application of EfficientDet, we aug-
E60 SELF100, and E60 SELF120). mented the training data by applying identical transformations
to the images and associated bounding boxes for DFU detec-
tion. Random rotation and shear transformations were used
Post-processing
to augment the DFUC2020 dataset. Shearing involves the dis-
The minimum confidence threshold for detection was set to
placement of the image at its corners, resulting in a skewed or
0.70, so only quite certain predictions were exported. This ap-
deformed output. Examples of these types of data augmentation
plies for pseudo-annotation data of the extended training dataset
are shown in Fig. 7.
created for self-training as well as for the final predictions.
Predictions for our experiments were inferred via the final
training states E60 SELF90, E60 SELF100, and E60 SELF120, 4.3.3. Model
using the best epochs 88, 96, and 118 each. Another exper- EfficientDet algorithms achieved state-of-the-art accuracy
iment was based on the training state E60 SELF100 involving on the popular MS-COCO Lin et al. (2014) object detection
the built-in test-time augmentation (TTA) and non-maxima sup- dataset. EfficientDet pre-trained weights are classed from D0
pression (NMS) features of YOLOv5 for inference. to D7, with D0 having the fewest number of parameters, and
TTA is a data augmentation method which involves sev- D7 having the highest number of parameters. Tests on the MS-
eral augmented instances of an image that are presented to the COCO dataset indicate that training using weights with more
model. For each instance, predictions are made; these predic- parameters results in better network accuracy. However, this
tions for the image provide an ensemble of instance predictions. comes at the cost of significantly increased training time. Given
This can enable a model to detect objects it may not be able to that the DFUC2020 dataset images were resized to 640x480, we
detect in a “clean” image. However, TTA may also cause mul- selected to use the EfficientDet-D1 pre-trained weights for DFU
tiple distinct detections for the same object that can harm eval- detection Goyal and Hassanpour (2020).
uation scores. To tackle these, NMS was applied to collapse
multiple intersecting detections into one BB. The intersection 4.3.4. Training
over union (IoU) threshold was set low to IoU ≥ 0.30, as in We trained the EfficientDet-D1 method on an NVIDIA
case of multiple wounds in an image usually a distinct spatial Quadro RTX 8000 GPU (48 GB) with a batch-size of 16, SGS
demarcation was given. Thus, the risk of interfering detections optimizer with a learning rate of 0.00005, momentum of 0.9,
of different wounds was low. and number of epochs set to 50. We used the validation accu-
racy and early stopping to select the final model for inference.
4.3. EfficientDet
The EfficientDet architecture Tan (2019) is an object detec- 4.3.5. Post-processing
tion network created by the Google Brain team, and utilises the We further refined the EfficientDet architectures with a score
EfficientNet ConvNet Tan and Le (2019) classification network threshold of 0.5 and removed overlapping bounding boxes to
Moi Hoon Yap et al. / Preprint (2020) 9

Fig. 5. (The architecture of EfficientDet, redrawn from Tan et al. (2020)

et al. (2017b). The original purpose of this method is to over-


come the problem of disturbance rejection. Since Zhang et al.
Zhang et al. (2019) introduced this method into object detec-
tion, many researchers have used it in data augmentation to en-
hance network robustness. The principle of this algorithm can
be described as follows: We randomly select two sample im-
ages, and then generate a new sample image according to the
mixup method of equation 3 and equation 4.

x̂ = λxi + (1 − λ)x j (3)

ŷ = λyi + (1 − λ)y j (4)

where (xi , yi ), (x j , y j ) are the points of two sample images,


λ ∈ [0,1], which is randomly generated by Beta(alpha, alpha)
Original Image After Pre-Processing distribution. The new sample ( x̂, ŷ) is used for training. As
shown in Fig. 8, two images of DFU are mixed in a certain
Fig. 6. Shades of gray algorithm for pre-processing of DFUC2020 dataset: ratio. We use Beta(1.5,1.5) for the images’ synthesis.
(To the left) original images; and (Right) pre-processed images. Moreover, it is unsatisfactory to detect the DFU in a com-
plex environment. To improve the ability for detection, we use
minimize the number of false positives. The scores were com- the mobile fuzzy method for data augmentation. As shown in
pared between the overlapping bounding boxes, and the bound- Fig. 9, we blur every image with the mobile fuzzy method to
ing box with the highest score was used as the final output. increase the number of images in the training set.

4.4. Cascade Attention DetNet 4.4.2. Model


4.4.1. Data Augmentation The Cascade R-CNN Cai and Vasconcelos (2017) is the first
Since the DFUC2020 dataset has only 2000 images for train- cascaded object detection model. Due to the superior perfor-
ing, we use two data augmentation methods to complement the mance of the cascade structure, it is widely used in the field of
dataset in order to avoid over-fitting when training models. A object detection Zhao et al. (2020). We use the cascade struc-
more generalized model can be obtained through data augmen- ture in conjunction with DetNet Li et al. (2018), which is de-
tation in order to make it adapt to the complex clinical envi- signed to address the problems incurred by down-sampling re-
ronment. We use common data augmentation methods includ- peatedly, as such a process reduces the accuracy of positioning.
ing horizontal and vertical image flipping, random noise and DetNet makes full use of the dilated convolutions to enhance
central scaling method (which scales with ground truth as the the receptive field instead of down-sampling repeatedly. The
center). Additionally we increase the number of training im- overall framework of our method, Cascade Attention DetNet
ages by using the visually coherent image mixup method Zhang (CA-DetNet) is shown in Fig. 10.
10 Moi Hoon Yap et al. / Preprint (2020)

Original Image Augmented Image Augmented Image

Fig. 7. Bounding box data augmentation on DFUC2020 dataset

The detection of DFU is different from common object de-


tection tasks. For common object detection tasks, objects can
appear anywhere in the image. For the detection of DFU, the
wounds can only appear on the foot, which is a good fit for
applying an attention mechanism, so we add an attention mech-
anism into the DetNet. To this end, we adopt the mask branch
of the Residual Attention Network Wang et al. (2017).
The Attention DetNet (A-DetNet) is composed of six stages.
The first stage consists of a 7*7 convolution (with stride 2) layer
and a max-pooling layer. The second, third and fourth stages
contain an A-Resbody, and the fifth and sixth stages contain
an A-Detbody. The A-Resbody and A-Detbody are similar to
those in the original DetNet. The difference between A-DetNet
and the original DetNet is that we add an attention branch into
the Resbody and Detbody. The attention branch is like the mask
Fig. 8. The effect of visually coherent image mixup method. branch of the Residual Attention Network, while we take other
parts from the original Resbody or Detbody as the trunk. The
attention branch of the Resbody is made up of two zoom struc-
tures, which consist of a max-pooling layer and an up-sampling
layer, followed by two 1×1 convolution layers activated by sig-
moid functions. Because down-sampling five times is not able
to make the feature map (20×15) recover the original size by
up-sampling, we only add one zoom structure into the attention
branch of the A-Detbody. The feature map from the trunk will
be multiplied by the mask from the attention branch. To avoid
consuming the value of the feature and breaking the identity
mapping, we refer to the Residual Attention Network and add
one to the mask.

4.4.3. Training
For the cascade structure, we set the total number of the cas-
cade stages to 3. Considering the intersect over union (IOU)
threshold, we set it to 0.5, 0.6 and 0.7 for each of the three
Fig. 9. The effect of mobile fuzzy method. (a) is the original image, and (b)
stages. During training, we use a pre-trained model to acceler-
is the image after blurring with the mobile fuzzy method.
ate model convergence. We use the pre-trained model of Det-
Net, which has been trained on the ImageNet dataset. We train
on one GPU (NVIDIA Tesla P100) for 60 epochs, with a batch
size of 4 and a learning rate of 0.001. The learning rate de-
Moi Hoon Yap et al. / Preprint (2020) 11

Fig. 10. The framework of CA-DetNet. “Image” is input image. “A-DetNet” is backbone network. “Pool” is region-wise feature extraction. “H” is
network head. “B” is bounding box and “C” is classification. “B0” is proposals in all architectures. The structure of the A-DetNet is based on the
DetNet.The attention mechanism is applied in Resbody and Detbody. Different bottleneck blocks in the Detbody or Resbody are similar to those in the
DetNet.

creases 10 times at the 10th epoch, and then decreases another


Table 3. Faster R-CNN. The first column shows the results of pure Faster
10 times at the 20th epoch. We optimize the model with the R-CNN, the second column shows the results of Cascade R-CNN, the third
Adam optimizer. column shows the results of Faster R-CNN with Deformable Convolu-
tion v2, the fourth column shows the results of Faster R-CNN with Prime
Sample Attention, and the last column shows the results of the ensemble
4.4.4. Post-processing
method.
Noise from the external environment will lead to many low Methods TP FP Recall Precision F1-Score mAP
confidence bounding boxes. These bounding boxes will reduce
Faster 1512 683 0.7210 0.6888 0.7046 0.6338
the performance of the detector, so we adopt a special thresh-
old suppression method so that we suppress bounding boxes Cascade 1483 649 0.7072 0.6956 0.7014 0.6309
with low thresholds except when the detector detects only one Deform 1612 628 0.7687 0.7196 0.7434 0.6940
bounding box. We set the threshold to 0.5. PISA 1495 444 0.7129 0.7710 0.7408 0.6518
Ensemble 1447 394 0.6900 0.7860 0.7349 0.6353
5. Results and Analysis

We report and analyse the results obtained using the meth- the precision of DFU detection, it does not improve the over-
ods described above. The evaluation metrics are the number of all score. Therefore, the best result is achieved by Deformable
true positives (TP), the number of false positives (FP), recall, Faster R-CNN, with mAP of 0.6940 and F1-Score of 0.7434.
precision, F1-Score and mean average precision (mAP), as de- The qualitative results of Faster R-CNN with Deformable
scribed in the diabetic foot ulcer challenge 2020 Cassidy et al. Convolution is summarized in Figure 11. It can be seen that
(2020). For the common object detection task, mAP is used as our model successfully detected the defects in the image, even
the main evaluation metric. However, in this DFU task, miss- though the defects are small (bottom-right image) or the im-
detection (a false negative) potentially has severe implications ages are blurred (top-middle image). However, we observed
as it may affect the quality of life of patients, and an incorrect the miss-detection as in the top-right image. In this image, the
detection (a false positive) could increase the financial burden background texture of the blood was incorrectly detected as a
on health services. Therefore, F1-Score is as important as mAP DFU. To improve prediction accuracy, the training data should
for performance evaluation. be captured in various environments so that the network is bet-
ter able to discern between DFU and background objects.
5.1. Faster R-CNN
Table 3 summarizes the quantitative results of pure Faster R- 5.2. YOLOv3
CNN, its variants, and the final ensemble model. From the ta-
ble, the performance of pure Faster R-CNN is on par with Cas- Table 4 shows the final results of the proposed YOLOv3
cade R-CNN. In contrast, employing the Deformable convolu- method on the testing dataset. The results are reported for two
tion or PISA module significantly improves the performance. different batch sizes, with and without post-processing.
After we ensemble the model, we reduce FP substantially, but As the results indicate, using a batch size of 50 leads to a bet-
TP is also reduced. Although the Ensemble method improves ter overall performance compared to using a batch size of 32. It
12 Moi Hoon Yap et al. / Preprint (2020)

healthy feet21 to the training set to observe the effect on detec-


tion performance. As the results show, the above action results
in improvement in F1-Score, but reduces the mAP.

5.3. YOLOv5
Table 5 summarizes the results of YOLOv5. Fewer addi-
tional self-training epochs in method E60 SELF90 achieved
(a) (b) better results than E60 SELF100 and E60 SELF120. Yet, ap-
plication of TTA with NMS on E60 SELF100 achieved the best
results in E60 SELF100 TTA NMS. Examples for detections
of E60 SELF100 TTA NMS on the test set are shown in Fig-
ure 13, Figure 14 shows additional examples of false negative
and false positive cases.

(c) (d)

Fig. 11. The qualitative results of Faster R-CNN with Deformable Con-
volution, which shows the best performance among Faster R-CNN based
methods. It is noted that the network is able to detect small ulcers as shown
(a) Small (b) Medium (c) Large (d) Tilted
in (a),(b) and (c). However, it generates FP as demonstrated in (d).

also demonstrates that removing the overlaps leads to improve-


ment in both F1-score and Precision, while resulting in slight (h) Focus
(e) Stains (f) Scar (g) Tissues
decreases to both mAP and Recall. As the gain overpowers the
loss, we conclude that removing overlaps results in better over- Fig. 13. Examples for adequate predictions of YOLOv5 for different DFU
all performance. sizes and compositions: (a) to (c) different wound sizes, (d) highly tilted
wound, (e) ignored blood stain on dressing, (f) ignored scar and hyperk-
While removing the detections with less than 0.3 confidence eratosis, (g) heterogeneous wound composition, (h) detected wound out of
results in slightly better precision, it reduces recall, F1-score focus.
and mAP. Therefore, unless the precision is the priority, remov-
ing the low confidence detection would not lead to improve-
ment. Examples of final detections for YOLOv3 are presented
in Figure 12.

(a) Missed (b) Missed (c) Nail (d) Nail

(e) Too large (f) Too small (g) One? (h) Two?
(a) (b)
Fig. 14. Examples for false negative, false positive, inadequate, and ques-
tionable predictions of YOLOv5: (a) and (b) missed wounds, (c) and (d)
painted finger nail and malformed toe nail, (e) and (f) too large and too
small, (g) and (h) unclear detections (one, two, many?).

5.4. EfficientDet
Table 6 shows the results of EfficientDet on the DFUC2020
(c) (d) testing set both with and without post-processing. As the results
indicated, the number of both TP and FP cases are reduced with
Fig. 12. Examples of final detection output of trained YOLOv3, after post- the post-processing method. But, with post-processing method,
processing.

Additionally, we have added 60 copyright free images of 21 Website: https://www.freepik.com/ (accessed 2020-08-29)
Moi Hoon Yap et al. / Preprint (2020) 13

Table 4. YOLOv3: Results of different settings, post-processing, and adding extra copyright free foot images. B50 and B32: compares the performance of
the method with batch size 50 with 32. OverlapRemoved: indicates the performance of the method, with overlap removal post processing. conf0.3: shows
the impact of ignoring any prediction with < 0.3 confidence. Extra: demonstrates the effect on performance of adding extra images of healthy feet.
Method Settings Metrics
Base Coefficient Overlap-Removed TP FP Recall Precision F1-Score mAP

B50 50 0 × 1572 676 0.7496 0.6993 0.7236 0.6560


B50 Overlap 50 0 X 1553 618 0.7406 0.7153 0.7277 0.6500
B32 32 0 × 1452 605 0.6929 0.7060 0.6994 0.6053
B32 Overlap 32 0 X 1433 551 0.6834 0.7223 0.7023 0.5998
B32 Overlap conf 32 0.3 X 1386 490 0.6609 0.7388 0.6977 0.5835
B50 Exact 50 0 × 1563 616 0.7454 0.7173 0.7311 0.6548
B50 Overlap Extra 50 0 X 1543 565 0.7358 0.7320 0.7339 0.6484

Table 5. YOLOv5: Results of different submitted runs. The settings state epochs for base and self-training as well as the use of test-time augmentation
(TTA) and non-maxima suppression (NMS). Best results are highlighted bold, the winning method is highlighted gray.
Method Settings Metrics
Base Self-training TTA+NMS TP FP Recall Precision F1-Score mAP

E60 SELF90 60 30 × 1504 474 0.7172 0.7604 0.7382 0.6270


E60 SELF100 60 40 × 1496 485 0.7134 0.7552 0.7337 0.6165
E60 SELF120 60 60 × 1502 478 0.7163 0.7586 0.7368 0.6201
E60 SELF100 TTA NMS 60 40 X 1507 498 0.7187 0.7516 0.7348 0.6294

Table 6. EfficientDet. ‘Before’ is the result of EfficientDet without post-


processing and‘After’ is the result with post-processing.
Methods TP FP Recall Precision F1-Score mAP

Before 1626 770 0.7754 0.6786 0.7238 0.5782


After 1593 594 0.7597 0.7284 0.7437 0.5694

the percentage of TP cases (from 1626 to 1593) is 2.02% com- (a) (b)
pared to FP cases (from 720 to 594) is 17.50%. Hence, post-
processing method lead to important improvement in both Pre-
cision (67.86% to 72.84%) and F1-score (72.38% to 74.37%),
while the slightly decrease in both mAP (57.82% to 56.94%)
and Recall (77.44% to 75.97%). The EfficientDet with post-
processing method achieved the highest F1-Score and Preci-
sion (least number of FP cases) in DFUC2020. Examples of
final outputs by the refined EfficientDet architecture are shown (c) (d)
in Fig. 15.
Fig. 15. The results of EfficientDet. (a) and (c) are the results of EfficientDet
5.5. Cascade Attention DetNet without post-processing; (b) and (d) are the results obtained with post-
processing.
Table 7 summarizes the results of the Cascade Attention
DetNet on the DFUC2020 testing dataset. The results are re-
ported for two different data augmentation methods, two differ- From the analysis, we see that the mobile fuzzy data aug-
ent backbones, and with or without a pre-trained model. mentation method brings about a striking effect and improves
From the results, we observe that CA-DetNet with two data 1.46% on mAP and 1.03% on F1-Score. At the same time,
augmentation methods and the pre-trained model achieves the using the single mixup method in data augmentation did not
best result. It achieves the highest score of 63.94% on mAP enhance the performance. The results suggest that the mobile
and 70.01% on F1-Score. The C-DetNet achieves the highest fuzzy method can make the model adapt to the noise from the
score of 74.11% on Recall, while the CA-DetNet with the mo- external environment, while the mixup method is detrimental.
bile fuzzy method achieves the highest score of 66.67% on Pre- The attention mechanism contributes to the improved perfor-
cision. mance of detection and increases mAP by 0.02% and F1-Score
14 Moi Hoon Yap et al. / Preprint (2020)

Table 7. Cascade Attention DetNet.


Backbone Settings Metrics
pre-trained mobile fuzzy mixup TP FP Recall Precision F1-Score mAP
C-DetNet X X X 1554 789 0.7411 0.6633 0.7000 0.6391
CA-DetNet × × × 1493 1089 0.7120 0.5782 0.6382 0.5963
CA-DetNet X × × 1523 820 0.7263 0.6500 0.6860 0.6204
CA-DetNet X × X 1431 961 0.6824 0.5982 0.6376 0.5749
CA-DetNet X X × 1528 764 0.7287 0.6667 0.6963 0.6350
CA-DetNet X X X 1554 788 0.7411 0.6635 0.7002 0.6394

by 0.03%. Moreover, training with a pre-trained model can ac-


Table 8. A summary based on the mAP ranking from each object detection
celerate the convergence of the model and improve the ability method when evaluated on the DFUC2020 testing set.
to detect DFU. Methods TP FP Recall Precision F1-Score mAP
Our approach was effective for the vast majority of the de-
Faster R-CNN 1612 628 0.7687 0.7196 0.7434 0.6940
tected cases, as shown in Fig. 16. However, due to the complex
YOLOv3 1572 676 0.7496 0.6993 0.7236 0.6560
clinical environment, there are also some failure cases in our ap-
CA-DetNet 1554 788 0.7411 0.6635 0.7002 0.6394
proach. From our observation, such failures are generally due to
the false identification of toenails, interference from the exter- YOLOv5 1507 498 0.7187 0.7516 0.7348 0.6294

nal environment and low image quality. For the false identifica- EfficientDet 1593 594 0.7597 0.7284 0.7437 0.5694

tion of toenails, we believe that the appearance of leuconychia


is similar to wounds and some cases of DFU are on the location
Table 9. A summary based on F1-Score ranking from each object detection
of toenails. It is not easy to overcome this problem. For the in-
method when evaluated on the DFUC2020 testing set.
terference from objects present in the external environment, we
Methods TP FP Recall Precision F1-Score mAP
believe that the background can sometimes interfere with de-
EfficientDet 1593 594 0.7597 0.7284 0.7437 0.5694
tection. We use the attention mechanism to deal with this prob-
lem to some extent. For image quality, we observe that there Faster R-CNN 1612 628 0.7687 0.7196 0.7434 0.6940

are several images which are blurry. We use data augmentation YOLOv5 1504 474 0.7172 0.7604 0.7382 0.6270

methods like the mobile fuzzy method to partially address this YOLOv3 1543 565 0.7358 0.7320 0.7339 0.6484
problem. We speculate that if a two-stage architecture whose CA-DetNet 1554 788 0.7411 0.6635 0.7002 0.6394
first stage is to detect and segment the relevant area of feet is
designed, the above problems could be solved. However, more
labeled data may be required be achieve this goal. is achieved by the variant of Faster R-CNN using Deformable
Convolution, with 0.6940. This method achieves the highest
TP and the best Recall. It is noted that YOLOv5 achieved the
lowest number of FP, but it has lower mAP and F1-Score.
In Table 9, the ranking according to F1-Score shows the high-
est F1-Score of 0.7437 obtained by EfficientDet, however, the
mAP is only 0.5694. On the other hand, the Faster R-CNN ap-
proach achieves a comparable F1-Score of 0.7434 with a much
higher mAP of 0.6940.
Fig. 17 visually compare the detection results on DFUs with
less visible appearances. In Fig. 17(a), the ulcer was detected
by all the methods. However, in Fig. 17(b), only Faster R-
CNN and EfficientDet detected the ulcer. Fig. 17(c) is another
challenging case and it was detected by CA-DetNet and Faster
Fig. 16. The results of CA-DetNet: Illustration of successful DFUs detec- R-CNN. In Fig. 17(d), we demonstrate a case where only Faster
tion. R-CNN successfully localise the ulcer.
In Section 5.1, we demonstrate that the ensemble method us-
ing Weighted Boxes Fusion did not improve the results of four
5.6. Comparison Faster R-CNN approaches. This observation suggests that addi-
The results from the popular deep learning object detection tional experiments based on different deep learning approaches
methods and the proposed CA-DetNet are comparable. Table 8 should be investigated. We run some experiments based on
shows the overall result when evaluated on DFUC2020 testing combinations of two approaches (Faster R-CNN + (CA-DetNet
set, where we present the best mAP from each object detection / EfficientDet / YOLOv3 / YOLOv5)), three approaches and
method. Considering the ranking based on mAP, the best result a combination of all approaches, as summarised in Table 10.
Moi Hoon Yap et al. / Preprint (2020) 15

Whilst most of the results show an F1-Score of greater than


70%, there is much work to do to enable the use of deep learning
algorithms in real-world settings.
Faster R-CNN based approaches detected DFU in the
DFUC2020 testing set with high mAP and F1-Score. In ad-
dition, the variants of Faster R-CNN largely improve the per-
formance of the original Faster R-CNN. After ensemble the re-
(a) (b) sults of four models, we managed to reduce the number of false
positives, but the overall performance when compared to the
individual variants of Faster R-CNN. The reason may be that
even though we are fusing the prediction of four models into
one prediction, similar results are predicted among these four
models because all models are based on Faster R-CNN. There-
fore, in future work, a one-stage object detection method such
as CenterNet Zhou et al. (2019) could potentially be included
in the ensemble method to produce more accurate results.
(c) (d) The YOLOv3 algorithm is able to reliably detect DFU and
ranked third place in both mAP and F1-Score ranking. We have
Fig. 17. Visual comparisons of object detection methods when compared to
the ground truth (in red): (a) An easy case where all the methods detected
observed that post-processing (by removing overlaps), along
the ulcer; (b) A more challenging case detected by Faster R-CNN (green) with removing low confidence detections, leads to improvement
and EfficientDet (yellow); (c) A challenging case detected by Faster R-CNN in precision but at the expense of the number of true positives
(green) and CA-DetNet (blue); and (d) A challenging case only detected by and recall. Additionally, our analysis indicates that adding ad-
Faster R-CNN (green).
ditional images of healthy feet, along with post-processing, can
result in a higher F1-score. We aim to further investigate the
Table 10. A Comparison of ensemble methods with different combinations results of pre-processing, as well as studying a more effective
of object detection framework, where FRCNN is Faster R-CNN, DetNet is post-processing scheme.
CA-DetNet, EffDet is EfficientDet, and ‘ALL methods’ represent an ensem-
ble method based on Faster R-CNN, CA-DetNet, EfficientDet, YOLOv3 The YOLOv5 approach also demonstrated a reliable detec-
and YOLOv5. tion performance with an overall high precision over the differ-
Methods TP FP Recall Precision F1-Score mAP ent model configurations. Application of the NLM algorithm
FRCNN+DetNet 1510 426 0.7201 0.7800 0.7488 0.6619 for image enhancement and generalization via self-training
FRCNN+EffDet 1502 345 0.7163 0.8132 0.7617 0.6425 helped to notably increase precision further. Improvements by
FRCNN+YOLOv3 1423 310 0.6786 0.8211 0.7431 0.6205 applied duplicate cleansing and BBox merging were marginal
FRCNN+YOLOv5 1453 350 0.6929 0.8059 0.7451 0.6421 due to the limited number of cases, but could prove beneficial
FRCNN+YOLOv5+EffDet 1396 252 0.6657 0.8471 0.7455 0.6109 on larger datasets. Application of TTA with NMS helped to fur-
FRCNN+YOLOv5+DetNet 1384 295 0.6600 0.8243 0.7331 0.6132 ther increase true-positives at the cost of increased false positive
FRCNN+DetNet+EffDet 1435 270 0.6843 0.8416 0.7549 0.6229 detection, yet increased mAP and F1-Score.
ALL methods 1277 198 0.6090 0.8658 0.7150 0.5642
However, the presented results may not be representative
in regards to YOLOv5’s actual capabilities. Surprisingly, the
least self-trained model performed best, indicating optimiza-
From our observation, the ensemble methods reduce the num-
tion potential in the configurations considered. Models with
bers of TPs and FPs, i.e., the more approaches, the lower the
fewer self-training epochs may perform better. In addition,
numbers of TPs and FPs. It did not improve the mAP, but in
an early version (v1.0) of the network was applied during the
the majority of the ensembles, there are notable improvement
DFUC2020, whereby the Mosaic data augmentation was not
in precision, hence led to improvement in F1-Score. The best
functioning correctly on custom data. At the time of writing, the
F1-Score for the ensemble method is 0.7617, achieved by en-
more developed version v3.022 is available, featuring numerous
sembling Faster R-CNN with Deformable Convolution and Ef-
improvements and bug fixes. E.g., the activation function was
ficientDet.
changed from Leaky ReLU Maas et al. (2013) in versions v1.0
Apart from fine-tuning each deep learning method to achieve
(used here) and v2.0 to hard-swish Howard et al. (2017), further
maximum performance, the methods are highly dependent on
increasing detection performance.
the pre-processing stage, selection of data augmentation, post-
YOLOv5 is improving rapidly and its full potential could
processing methods and ensemble method. We address the lim-
not be taken advantage of during the DFUC2020. E.g., Model
itations and future challenges of our work in the following sec-
Ensembling23 could allow further performance increases when
tion.

6. Discussion 22 YOLOv5 v3.0: https://github.com/ultralytics/yolov5/


releases/tag/v3.0 (accessed 2020-09-28)
In this section, we discuss the performance of each object 23 YOLOv5 GitHub repository tutorial on Model Ensembling: https://

detection method and future work to improve DFU detection. github.com/ultralytics/yolov5/issues/318 (accessed 2020-09-28)
16 Moi Hoon Yap et al. / Preprint (2020)

fusing differently specialized models as well as investigation learning algorithms. This approach could also impact network
of Hyperparameter Evolution24 . Hence, YOLOv5 could prove size and complexity, which could negatively impact inference
helpful for performing DFU detection tasks particularly when speed. Segmenting the foot from its surroundings might pro-
considering implementation directly on mobile devices. vide another possible solution to this problem, so that trained
The refined EfficientDet algorithm is able to detect DFU with models do not have to account for objects in complex environ-
a high recall rate. The pre-processing stage with the Shades of ments. Future research challenges include:
Gray algorithm improved the consistency of the images. We
• Gather a larger-scale dataset with clinical annotations.
extensively used the data augmentation techniques to learn the
This is the best solution for supervised machine learning
subtle features of DFUs of various sizes and severity. The post-
algorithms. However, in the real-world, there are still bar-
processing stage has refined the inference of the original Ef-
riers in data sharing. Additionally, clinical annotation is
ficientDet method by removing overlapping bounding boxes.
expensive and time consuming. It is important to encour-
Due to low mAP, further work will focus on investigating other
age co-creation by machine learning and clinical experts
options of EfficentDet, particularly EfficientDet-D7.
to foster better understanding of the annotated data.
The performance of Cascade Attention DetNet on the
DFUC2020 testing set is not entirely satisfactory. We evalu- • Create self-supervised and unsupervised deep learning al-
ated our model on 10% of the DFUC2020 training set and it gorithms for DFU detection. These methods were devel-
achieved 0.9 on mAP. We analyzed the possible reasons and oped and implemented for natural object detection tasks
consider that the model may be over-fitting, to which ensem- and remain under-explored in medical imaging.
ble learning may provide a possible solution. We further aim
to use appropriate data augmentation methods to improve the • For inspections of DFU, accurate delineation of an ulcer
robustness of the model. and its surrounding skin can help to measure the progress
The ensemble methods based on fusion of different back- of the ulcer. Goyal et al. Goyal et al. (2017) developed
bones have reduced the number of predicted bounding boxes an automated segmentation algorithm for DFU. However,
substantially. Faster R-CNN with Deformable Convolution pre- they experimented on a small dataset only and future work
dicted 2240 bounding boxes, but after ensembled with Efficient- will potentially enable a larger scale of experimentation.
Det, it only predicted 1847 bounding boxes. The predicted
• The use of DFU classification systems that can be used by
bounding boxes was dropping to 1475 when we ensemble the
clinicians to analyse ulcer condition. Automated analysis
results from all the five networks. Consequently, the ensemble
and recognition of DFU can help to improve the diagno-
methods have reduced the number of TPs and FPs. It is cru-
sis of DFUs. The next challenge (DFUC2021 Yap et al.
cial for future research to focus on true positives, i.e. correctly
(2020b)) will focus on multi-class DFU recognition.
locate the DFUs. One of the aspect to overcome this issue is
to understand the threshold setting of IOU. Our experiment is • With the growth in the number of people diagnosed with
using IOU ≥ 0.5, which is the guideline set by object detec- diabetes, remote detection and monitoring of DFU can re-
tion for natural objects. However, in medical imaging studies duce the burden on health services. Research in optimiza-
Drukker et al. (2002); Yap et al. (2008), they used the IOU (or tion of deep learning models for remote monitoring is an-
Jaccard Similarity Index) threshold of 0.4. When we evaluate other active research area that has the potential to change
the performance of the best ensemble method, the number of the healthcare landscape globally.
TPs increases to 1594, and with IOU ≥ 0.3, the number of TP
increases to 1668. With Faster R-CNN with Deformable Con-
Acknowledgments
volution, the number of TPs increases to 1743 and 1883 for IOU
threshold of 0.4 and 0.3, respectively. We gratefully acknowledge the support of NVIDIA Corpora-
tion for the use of GPUs for this challenge and sponsoring our
7. Conclusion event. A.A., D.B.A. and M.O. were supported by the National
Health and Medical Research Council [GNT1174405] and the
We conduct a comprehensive evaluation of the performance Victorian Government’s OIS Program.
of deep learning object detection networks for DFU detection.
While the overall results show the potential of automatically
References
localising the ulcers, there are many false positives, and the
networks struggle to discriminate ulcers from other skin con- Armstrong, D.G., Boulton, A.J., Bus, S.A., 2017. Diabetic foot ulcers and their
ditions. A possible solution to address this issue might be to recurrence. New England Journal of Medicine 376, 2367–2375.
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. YOLOv4: Optimal Speed
introduce a second classifier in the form of a negative dataset to and Accuracy of Object Detection. URL: https://arxiv.org/abs/
train future networks on. However, in reality, it may prove im- 2004.10934, arXiv:2004.10934.
possible to gather all possible negative examples for supervised Bodla, N., Singh, B., Chellappa, R., Davis, L.S., 2017. Soft-nms–improving
object detection with one line of code, in: Proceedings of the IEEE interna-
tional conference on computer vision, pp. 5561–5569.
Brown, R., Ploderer, B., Da Seng, L.S., Lazzarini, P., van Netten, J., 2017.
24 YOLOv5 GitHub repository tutorial on Hyperparameter Evolution: Myfootcare: a mobile self-tracking tool to promote self-care amongst people
https://github.com/ultralytics/yolov5/issues/607 (accessed with diabetic foot ulcers, in: Proceedings of the 29th Australian Conference
2020-09-28) on Computer-Human Interaction, pp. 462–466.
Moi Hoon Yap et al. / Preprint (2020) 17

Buades, A., Coll, B., Morel, J.M., 2005. A non-local algorithm for image sembles for medical subfigure classification, in: In Experimental IR Meets
denoising, in: 2005 IEEE Computer Society Conference on Computer Vi- Multilinguality, Multimodality, and Interaction 8th International Conference
sion and Pattern Recognition (CVPR’05), IEEE. pp. 60–65. URL: https: of the CLEF Association, CLEF 2017, Lecture Notes in Computer Sci-
//doi.org/10.1109/cvpr.2005.38, doi:10.1109/cvpr.2005.38. ence (LNCS). Springer International Publishing, pp. 57–68. doi:10.1007/
Cai, Z., Vasconcelos, N., 2017. Cascade r-cnn: Delving into high quality object 978-3-319-65813-1\_5.
detection . Li, Z., Peng, C., Yu, G., Zhang, X., Sun, J., 2018. Detnet: A backbone network
Cai, Z., Vasconcelos, N., 2019. Cascade r-cnn: High quality object detection for object detection .
and instance segmentation. arXiv:1906.09756. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017.
Cao, Y., Chen, K., Loy, C.C., Lin, D., 2020. Prime sample attention in object Feature pyramid networks for object detection, in: Proceedings of the IEEE
detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern conference on computer vision and pattern recognition, pp. 2117–2125.
Recognition (CVPR), pp. 11580–11588. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
Cassidy, B., Reeves, N.D., Joseph, P., Gillespie, D., O’Shea, C., Rajbhan- P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in:
dari, S., Maiya, A.G., Frank, E., Boulton, A., Armstrong, D., et al., 2020. European conference on computer vision, Springer. pp. 740–755.
Dfuc2020: Analysis towards diabetic foot ulcer detection. arXiv preprint Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018. Path aggregation network for
arXiv:2004.11853 . instance segmentation, in: Proceedings of the IEEE conference on computer
DeVries, T., Taylor, G.W., 2017. Improved regularization of convolutional neu- vision and pattern recognition, pp. 8759–8768.
ral networks with cutout. arXiv preprint arXiv:1708.04552 URL: https: Maas, A.L., Hannun, A.Y., Ng, A.Y., 2013. Rectifier nonlinearities im-
//arxiv.org/abs/1708.04552, arXiv:1708.04552. url: https:// prove neural network acoustic models, in: Proc. ICML, p. 3. URL:
arxiv.org/abs/1708.04552 (accessed on 11 September 2020). http://robotics.stanford.edu/~amaas/papers/relu_hybrid_
Drukker, K., Giger, M.L., Horsch, K., Kupinski, M.A., Vyborny, C.J., Mendel- icml2013_final.pdf. url: http://robotics.stanford.edu/
son, E.B., 2002. Computerized lesion detection on breast ultrasound. Med- ~amaas/papers/relu_hybrid_icml2013_final.pdf (accessed on 11
ical physics 29, 1438–1446. September 2020).
Ghiasi, G., Lin, T.Y., Le, Q.V., 2018. DropBlock: A regularization method for hua Ng, J., Goyal, M., Hewitt, B., Yap, M.H., 2019. The effect of color con-
convolutional networks, in: Bengio, S., Wallach, H., Larochelle, H., Grau- stancy algorithms on semantic segmentation of skin lesions, in: Medical
man, K., Cesa-Bianchi, N., Garnett, R. (Eds.), Advances in Neural Informa- Imaging 2019: Biomedical Applications in Molecular, Structural, and Func-
tion Processing Systems 31. Curran Associates, Inc., pp. 10727–10737. tional Imaging, International Society for Optics and Photonics. p. 109530R.
Goyal, M., Hassanpour, S., 2020. A refined deep learning architecture for dia- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
betic foot ulcers detection. arXiv preprint arXiv:2007.07922 . Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-
Goyal, M., Reeves, N.D., Davison, A.K., Rajbhandari, S., Spragg, J., Yap, Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
M.H., 2018. Dfunet: convolutional neural networks for diabetic foot ul- J., Chintala, S., 2019. Pytorch: An imperative style, high-performance deep
cer classification. IEEE Transactions on Emerging Topics in Computational learning library, in: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché
Intelligence , 1–12doi:10.1109/TETCI.2018.2866254. Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Pro-
Goyal, M., Reeves, N.D., Rajbhandari, S., Ahmad, N., Wang, C., Yap, M.H., cessing Systems 32. Curran Associates, Inc., pp. 8024–8035.
2020. Recognition of ischaemia and infection in diabetic foot ulcers: Dataset Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial
and techniques. Computers in Biology and Medicine , 103616. Vector Data. The R Journal 10, 439–446. doi:10.32614/RJ-2018-009.
Goyal, M., Reeves, N.D., Rajbhandari, S., Yap, M.H., 2019. Robust methods R Core Team, 2020. R: A Language and Environment for Statistical Com-
for real-time diabetic foot ulcer detection and localization on mobile devices. puting. R Foundation for Statistical Computing. Vienna, Austria. URL:
IEEE Journal of Biomedical and Health Informatics 23, 1730–1741. doi:10. https://www.R-project.org/.
1109/JBHI.2018.2868656. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You Only Look
Goyal, M., Yap, M.H., 2018. Region of interest detection in dermoscopic im- Once: Unified, Real-Time Object Detection, in: 2016 IEEE Conference on
ages for natural data-augmentation. arXiv preprint arXiv:1807.10711 . Computer Vision and Pattern Recognition (CVPR), IEEE. URL: https:
Goyal, M., Yap, M.H., Reeves, N.D., Rajbhandari, S., Spragg, J., 2017. Fully //doi.org/10.1109/cvpr.2016.91, doi:10.1109/cvpr.2016.91.
convolutional networks for diabetic foot ulcer segmentation, in: 2017 IEEE Redmon, J., Farhadi, A., 2017. YOLO9000: Better, Faster, Stronger,
International Conference on Systems, Man, and Cybernetics (SMC), pp. in: 2017 IEEE Conference on Computer Vision and Pattern Recognition
618–623. doi:10.1109/SMC.2017.8122675. (CVPR), IEEE. URL: https://doi.org/10.1109/cvpr.2017.690,
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn, in: 2017 IEEE doi:10.1109/cvpr.2017.690.
International Conference on Computer Vision (ICCV), pp. 2980–2988. Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv
He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial Pyramid Pooling in Deep preprint arXiv:1804.02767 .
Convolutional Networks for Visual Recognition. IEEE Transactions on Pat- Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time
tern Analysis and Machine Intelligence 37, 1904–1916. URL: https:// object detection with region proposal networks, in: Advances in neural in-
doi.org/10.1109/tpami.2015.2389824, doi:10.1109/tpami.2015. formation processing systems, pp. 91–99.
2389824. Solovyev, R., Wang, W., Gabruseva, T., 2019. Weighted boxes fusion: ensem-
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., bling boxes for object detection models. arXiv:1910.13302.
Andreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional neural Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.,
networks for mobile vision applications. arXiv preprint arXiv:1704.04861 . 2014. Dropout: a simple way to prevent neural networks from overfitting.
IDF, 2019. International diabetes federation: Facts & figures. The journal of machine learning research 15, 1929–1958.
https://www.idf.org/aboutdiabetes/what-is-diabetes/facts-figures.html. Tan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional
Jocher, G., Kwon, Y., guigarfr, Veitch-Michaelis, J., perry0418, Ttayu, Marc, neural networks .
Bianconi, G., Baltacı, F., Suess, D., Chen, T., Yang, P., idow09, WannaSeaU, Tan, M., Pang, R., Le, Q.V., 2020. Efficientdet: Scalable and efficient object de-
Xinyu, W., Shead, T.M., Havlik, T., Skalski, P., NirZarrabi, LukeAI, Lin- tection, in: Proceedings of the IEEE/CVF Conference on Computer Vision
Coce, Hu, J., IlyaOvodov, GoogleWiki, Reveriano, F., Falak, Kendall, D., and Pattern Recognition, pp. 10781–10790.
2020a. ultralytics/yolov3: [email protected]:0.95 on COCO2014. URL: Tan, Mingxing, P.R.V.L.Q., 2019. Efficientdet: Scalable and efficient object
https://doi.org/10.5281/zenodo.3785397, doi:10.5281/zenodo. detection. https://arxiv.org/pdf/1911.09070 64, 2098–2109.
3785397. Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R., 2013. Regulariza-
Jocher, G., Stoken, A., Borovec, J., NanoCode012, ChristopherSTAN, tion of neural networks using dropconnect, PMLR, Atlanta, Georgia, USA.
Changyu, L., Laughing, Hogan, A., lorenzomammana, tkianai, yxNONG, pp. 1058–1066. URL: http://proceedings.mlr.press/v28/wan13.
AlexWang1900, Diaconu, L., Marc, wanghaoyang0106, ml5ah, Doug, Ha- html.
tovix, Poznanski, J., Yu, L., changyu98, Rai, P., Ferriday, R., Sullivan, T., Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.,
Xinyu, W., YuriRibeiro, Claramunt, E.R., hopesala, pritul dave, yzchen, 2020. Cspnet: A new backbone that can enhance learning capability of
2020b. ultralytics/yolov5: v3.0. URL: https://doi.org/10.5281/ cnn, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
zenodo.3983579, doi:10.5281/zenodo.3983579. Pattern Recognition Workshops, pp. 390–391.
Koitka, S., Friedrich, C.M., 2017. Optimized convolutional neural network en- Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.,
18 Moi Hoon Yap et al. / Preprint (2020)

2017. Residual attention network for image classification .


Wang, L., Pedersen, P.C., Agu, E., Strong, D.M., Tulu, B., 2016. Area determi-
nation of diabetic foot ulcer images using a cascaded two-stage svm-based
classification. IEEE Transactions on Biomedical Engineering 64, 2098–
2109.
Wang, L., Pedersen, P.C., Strong, D.M., Tulu, B., Agu, E., Ignotz, R., 2014.
Smartphone-based wound assessment system for patients with diabetes.
IEEE Transactions on Biomedical Engineering 62, 477–488.
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., 2017. Aggregated residual
transformations for deep neural networks, in: 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995.
Yap, M.H., Edirisinghe, E.A., Bez, H.E., 2008. A novel algorithm for initial
lesion detection in ultrasound breast images. Journal of Applied Clinical
Medical Physics 9, 181–199.
Yap, M.H., Goyal, M., Osman, F., Marti, R., Denton, E., Juette, A., Zwigge-
laar, R., 2020a. Breast ultrasound region of interest detection and lesion
localisation. Artificial Intelligence in Medicine , 101880.
Yap, M.H., Reeves, N., Boulton, A., Rajbhandari, S., Armstrong, D., Maiya,
A.G., Najafi, B., Frank, E., Wu, J., 2020b. Diabetic foot ulcers grand
challenge 2021. URL: https://doi.org/10.5281/zenodo.3715020,
doi:10.5281/zenodo.3715020.
Yap, M.H., Reeves, N.D., Boulton, A., Rajbhandari, S., Armstrong, D., Maiya,
A.G., Najafi, B., Frank, E., Wu, J., 2020c. Diabetic foot ulcers grand chal-
lenge 2020. doi:http://doi.org/10.5281/zenodo.3715016.
Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J., 2019. Cut-
Mix: Regularization strategy to train strong classifiers with localizable fea-
tures, in: 2019 IEEE/CVF International Conference on Computer Vision
(ICCV), IEEE. URL: https://doi.org/10.1109/iccv.2019.00612,
doi:10.1109/iccv.2019.00612.
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D., 2017a. mixup: Beyond
empirical risk minimization. arXiv preprint arXiv:1710.09412 URL:
https://arxiv.org/abs/1710.09412, arXiv:1710.09412. url:
https://arxiv.org/abs/1710.09412 (accessed on 11 September
2020).
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D., 2017b. mixup: Beyond
empirical risk minimization .
Zhang, Z., He, T., Zhang, H., Zhang, Z., Xie, J., Li, M., 2019. Bag
of freebies for training object detection neural networks. arXiv preprint
arXiv:1902.04103 .
Zhao, W., Huang, H., Li, D., Chen, F., Cheng, W., 2020. Pointer defect de-
tection based on transfer learning and improved cascade-rcnn. Sensors 20,
4939.
Zhou, X., Wang, D., Krähenbühl, P., 2019. Objects as points, in: arXiv preprint
arXiv:1904.07850.
Zhu, J., Fang, L., Ghamisi, P., 2018. Deformable convolutional neural net-
works for hyperspectral image classification. IEEE Geoscience and Remote
Sensing Letters 15, 1254–1258.
Zhu, X., Hu, H., Lin, S., Dai, J., 2019. Deformable convnets v2: More de-
formable, better results, in: 2019 IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pp. 9300–9308.

You might also like