Foot Ulcer Detection
Foot Ulcer Detection
Moi Hoon Yapa,∗, Ryo Hachiumab,1 , Azadeh Alavic,1 , Raphael Brüngeld,1 , Manu Goyale,1 , Hongtao Zhuf,1 , Bill Cassidya , Johannes
Ruckertd , Moshe Olshanskyc , Xiao Huangf , Hideo Saitob , Saeed Hassanpoure , Christoph M. Friedrichd,g , David Ascherc , Anping
Songf , Hiroki Kajitah , David Gillespiea , Neil D. Reevesa , Joseph Pappachani , Claire O’Sheaj , Eibe Frankk
a Manchester Metropolitan University, John Dalton Building, Chester Street, Manchester M1 5GD, UK
b Keio University, Yokohama, Kanagawa, Japan
c Baker Heart and Diabetes Institute, 20 Commercial Road, Melbourne, VIC 3000, Australia
d Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge-Str. 42, 44227 Dortmund, Germany
e Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA
f Shanghai University, Shanghai 200444, China
arXiv:2010.03341v2 [cs.CV] 15 Oct 2020
g Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Hufelandstr. 55, 45122 Essen, Germany
h Keio University School of Medicine, Shinanomachi, Tokyo, Japan
i Lancashire Teaching Hospitals, Chorley, UK
j Waikato Diabetes Health Board, Hamilton, New Zealand
k Department of Computer Science, University of Waikato, Hamilton, New Zealand
Article history: There has been a substantial amount of research on computer methods and technology
for the detection and recognition of diabetic foot ulcers (DFUs), but there is a lack of
systematic comparisons of state-of-the-art deep learning object detection frameworks
2000 MSC: 41A05, 41A10, 65D05, applied to this problem. With recent development and data sharing performed as part
65D17 of the DFU Challenge (DFUC2020) such a comparison becomes possible: DFUC2020
provided participants with a comprehensive dataset consisting of 2,000 images for train-
Keywords: diabetic foot ulcers, object
ing each method and 2,000 images for testing them. The following deep learning-based
detection, machine learning, deep learn-
ing, DFUC2020 algorithms are compared in this paper: Faster R-CNN, three variants of Faster R-CNN
and an ensemble method; YOLOv3; YOLOv5; EfficientDet; and a new Cascade At-
tention Network. For each deep learning method, we provide a detailed description
of model architecture, parameter settings for training and additional stages including
pre-processing, data augmentation and post-processing. We provide a comprehensive
evaluation for each method. All the methods required a data augmentation stage to
increase the number of images available for training and a post-processing stage to re-
move false positives. The best performance is obtained Deformable Convolution, a
variant of Faster R-CNN, with a mAP of 0.6940 and an F1-Score of 0.7434. Finally,
we demonstrate that the ensemble method based on different deep learning methods
can enhanced the F1-Score but not the mAP. Our results show that state-of-the-art deep
learning methods can detect DFU with some accuracy, but there are many challenges
ahead before they can be implemented in real world settings.
© 2020 Preprint.
include descriptions of an ensemble method and a new Cascade dian blur filters with the filter size set to 3. The filters are
Attention DetNet (CA-DetNet). applied with the probability of 0.1.
4.1. Faster R-CNN • Affine transformation: As the images are captured from
different camera angles, we apply random affine transfor-
Faster R-CNN Ren et al. (2015) is one of the two-stage ob- mation to the images. Specifically, we apply random shift,
ject detection models, which generates a sparse set of candidate scaling (0.1), and rotation (90 degrees) to the images.
object locations by a Region Pooling Network (RPN) based
on shared feature maps, which then classifies each candidate • Brightness: As the images are captured in various environ-
proposal as the foreground or background class. After extract- ments, we employ brightness and contrast data augmenta-
ing shared feature maps with a CNN, the first stage RPN takes tion. More specifically, we randomly change the bright-
shared feature maps as an input and generates a set of bound- ness and contrast in a scale from 0.1 to 0.3, with probabil-
ing box candidate object locations, each with an ”objectness” ity set to 0.2.
score. The size of each anchor is configured using hyperpa-
rameters. Then, the proposals are used in the region of interest 4.1.2. Model training and implementation
pooling layer (RoI pooling) to generate subfeature maps. The In this paper, we fine-tune a model pretrained on MS-COCO
subfeature maps are converted to 4096 dimensional vectors and Lin et al. (2014). We employ Stochastic Gradient Descent Opti-
fed forward into fully connected layers. These layers are then mizer with a momentum of 0.9 and weight decay set to 0.0001.
used as a regression network to predict bounding box offsets, During training, we employ a warm up learning rate scheduling
with a classification network used to predict the class label of strategy, using lower learning rates in the early stages of training
each bounding box proposal. to overcome optimization difficulties. More specifically, we lin-
The RoI pooling layer quantizes a floating-number RoI to the early increase the learning rate to 0.01 in the first 500 iterations,
discrete granularity of the feature map. This quantization intro- then multiply by 0.1 at epoch 6, 12, and 30. We implemented
duces misalignments between the RoI and the extracted fea- the methods based on the mmdetection repository 2 .
tures. Therefore, the model evaluated in this paper employs a
RoIAlign layer, which is introduced in Mask R-CNN He et al. 4.1.3. Variants of Faster R-CNN
(2017), instead of the RoI pooling layer. This removes the harsh Several papers have proposed variants of Faster R-CNN.
quantization of the RoI pooling layer, properly aligning the ex- In this paper, we implement Faster R-CNN, three variants of
tracted features with the input. Faster R-CNN, and ensemble the results. The three variants of
Also, the Feature Pyramid Network (FPN) Lin et al. (2017) is Faster R-CNN are as follows:
employed as the backbone of the network. FPN uses a top-down
architecture with lateral connections to build an in-network fea- • Cascade R-CNN Cai and Vasconcelos (2019): Cascade
ture pyramid from a single-scale input. Faster R-CNN with an R-CNN is similar to Faster R-CNN, but the architecture
FPN backbone extracts RoI features from different levels of the of the ROI head (the module that predicts the bounding
feature pyramid according to their scale, but otherwise the rest boxes and the category label) is different. Cascade R-CNN
of the approach is similar to vanilla ResNet. Using a ResNet- builds up a cascade head based on Faster R-CNN Ren et al.
FPN backbone for feature extraction with Mask R-CNN gives (2015) to refine detection progressively. Since the pro-
excellent gains in both accuracy and speed. Specifically, We posal boxes are refined by multiple box regression heads,
employ ResNeXt101 Xie et al. (2017) with the FPN feature ex- Cascade R-CNN is optimal for more precise localization
traction backbone to extract the features. of objects.
4.1.1. Data Augmentation • Deformable Convolution Zhu et al. (2019): Here, the basic
architecture of the network is the same as the one in Faster
In this challenge, the images in the dataset were captured
R-CNN. However, we replace the convolution layer with
from different viewpoint angles, cameras with different focal
a deformable convolution layer Zhu et al. (2018) at the
lengths, and varying levels of blur. Also, the training dataset
second, third, and fourth ResNeXt blocks of the feature
contains only 2, 000 images, which could be considered small
extractor. The deformable convolution adds 2D offsets to
for training deep learning models. Therefore, we employ vari-
the regular grid sampling locations in the standard convo-
ous data augmentation techniques for robust prediction. Specif-
lution so that it enables free-form deformation of the sam-
ically, we employ the following augmentations:
pling grid. The offsets are learned from the feature maps,
• HSV and RGB: As the lighting conditions are different via additional convolutional layers. Thus, the deformation
among images in the dataset, we apply random RGB is conditioned on the input features in a local, dense, and
and HSV shift to the images. Especially, we randomly adaptive manner.
add/subtract from 0 to 10 RGB values and 0 to 20 HSV
• Prime Sample Attention Cao et al. (2020) (PISA): Here,
values to the images.
the basic network architecture is again the same as in
• Blurring: As the dataset contains images captured from
different focal lengths, some images are blurred and con-
tain camera noise. Therefore, we apply Gaussian and me- 2 https://github.com/open-mmlab/mmdetection
4 Moi Hoon Yap et al. / Preprint (2020)
Faster R-CNN. PISA is motivated by two considerations: Net3 -based versions, and Bochkovskiy et al. (2020) created
samples should not be treated as independent and equally ports for other machine learning libraries such as PyTorch4
important, and the classification and localization are cor- Paszke et al. (2019).
related. Thus, it employs a ranking strategy that places In this paper, two approaches are selected for DFU detection
the positive samples with highest IoUs around each ob- on the DFUC2020 dataset: YOLOv3 and YOLOv5. We discuss
ject, and the negative samples with highest scores in each the networks and present descriptions of our implementation in
cluster at the top of the ranked list. This directs the focus the following subsections.
of the training process via a simple re-weighting scheme.
It also employs a classification-aware regression loss to 4.2.1. YOLOv3
jointly optimize the classification and regression branches. YOLOv3 Redmon and Farhadi (2018) was developed as an
improved version of YOLOv2 Redmon and Farhadi (2017). It
employs multi-scale schema, predicting bounding boxes on dif-
4.1.4. Post-processing
ferent scales. This allows YOLOv3 to be more effective for
At test time, we employ a test-time augmentation scheme: detecting smaller targets when compared to YOLOv2.
we augment the test image by applying two resolutions, and we YOLOv3 uses dimension clusters as anchor boxes in order to
also flip the test image. As a result, we augment a single im- predict bounding boxes around the desired objects in given im-
age to four images and merge the predictions obtained for the ages. Logistic regression is used to predict the objectness score
four images. We employ soft NMS (non maximum suppres- for a given bounding box. Specifically, as illustrated in Fig. 2,
sion) Bodla et al. (2017) with a confidence threshold of 0.5 as the algorithm predicts the four coordinates of the bounding box
the post-processing of predicted bounding boxes. (t x , ty , th , tw ) as in Equation 1.
Post-processing
As observed from Figure 3, in rare cases, the resulting al-
gorithm may produce double detection or false positives. To
reduce such drawbacks, we include a post-processing stage.
Table 1. The architecture of DarkNet-53 used in YOLOv3. Fig. 3. Illustration of two types of false positives: (top row) false positives
from double detection; and (bottom row) false positives of the network.
Type Filters Size
Convolutional 32 3×3
Our post-processing steps consist of two stages. First, we
Convolutional 64 3×3/2
identify double detection by flagging the detected bounding
Convolutional 32 1×1 boxes with more that 80% overlap. Among the overlapping
Convolutional 1× 64 3×3 detected boxes we only keep the one with highest confidence.
Residual Finally, we further post-process the results by removing any de-
Convolutional 128 3×3/2 tection with confidence under 0.3, aiming to reduce the rate of
Convolutional 64 1×1 false positive detections.
Convolutional 2× 128 3×3
Residual 4.2.2. YOLOv5
Convolutional 256 3×3/2 YOLOv5 Jocher et al. (2020b) was first published in May
Convolutional 128 1×1 2020 by Glenn Jocher of Ultralytics LLC5 on GitHub6 . Origi-
Convolutional 8× 256 3×3 nally, it was an improved version of their well known YOLOv3
Residual implementation for PyTorch7 Jocher et al. (2020a), based on
Convolutional 512 3×3/2 the original YOLOv3 Redmon and Farhadi (2018). However,
Convolutional 256 1×1 due to the release of YOLOv4 Bochkovskiy et al. (2020) for the
Convolutional 8x 512 3×3 DarkNet framework8 , which incorporated many improvements
Residual made in the PyTorch YOLOv3 implementation, the authors de-
Convolutional 1024 3×3/2 cided to name it YOLOv5 to avoid naming conflicts. Essen-
Convolutional 512 1×1 tially, YOLOv5 can be labeled as “YOLOv4 for PyTorch”. Un-
Convolutional 4× 1024 3×3
like the original YOLOv3 and YOLOv4, there has not been a
Residual
scientific paper published on the PyTorch port and its improve-
Avgpool Connected Softmax Global 1000
ments yet. YOLOv5 is under active development with new fea-
tures and releases appearing on a weekly basis. At the time of
writing, the latest release is v3.0, published on 20 August 2020.
The new features and improvements in YOLOv4/YOLOv5
of epochs=100, batch size=32, and using 20% of the data for are mainly focused around incorporating state-of-the-art tech-
validation.
First, we start by freezing the top DarkNet-53 layers and train 5 Ultralytics LLC website: https://www.ultralytics.com/ (accessed
the algorithm with the above settings. Then, we retrain the 2020-08-29)
entire network for better performance. Similar to the original 6 YOLOv5 GitHub repository: https://github.com/ultralytics/
YOLOv3, our trained network extracts features from 3 different yolov5/ (accessed 2020-08-29)
7 Ultralytics’ YOLOv3 GitHub repository: https://github.com/
pre-defined scales, which is a similar concept to feature pyra- ultralytics/yolov3 (accessed 2020-08-29)
mid networks Lin et al. (2017). We then use the trained network 8 YOLOv4 GitHub repository: https://github.com/AlexeyAB/
for detecting diabetic foot ulcers in blind test images. darknet (accessed 2020-08-29)
6 Moi Hoon Yap et al. / Preprint (2020)
niques for activation functions, data augmentation, and post- had derogatory effects on the detection performance, images
processing into the established YOLO architecture to achieve were enhanced using a fast implementation of the non-local
the best possible object detection performance. One of the most means algorithm Buades et al. (2005) for color images, utiliz-
notable new features is the novel mosaic loader data augmen- ing the Python language14 in version 3.6.9 with the OpenCV
tation. Four images are combined to form a new image, allow- on Wheels (opencv-python)15 package in version 4.2.0.34.
ing detection of objects outside of their normal context and at The algorithm parameters were set to h = 1 (luminance com-
smaller sizes, and reducing the need for large mini-batch sizes. ponent filter strength) and hColor = 1 (color component filter
Another new data augmentation technique is self-adversarial strength) with templateWindowSize = 7 (template patch size
training (SAT), where images are generated to deceive the net- in pixel) and searchWindowSize = 21 (search window size in
work. YOLOv5 claims accelerated inference and smaller model pixel).
files compared to YOLOv4, allowing easy translation to mobile Resulting images show less definitive compression artifact
use cases. borders and notably reduced color noise. Some textures are also
The approach on DFU detection via YOLOv5 described more pronounced. Examples of results at a macroscopic and a
in the following is based on the early version v1.09 commit detail level are shown in Figure 4.
a1c840610 from 14 July 2020 that still posed several issues.
Pre-processing
Initially, image data of the training dataset was analyzed via
AntiDupl11 in version 2.3.10 to identify duplicate images,
yielding a set of 39 pair findings. A spatial analysis of dupli- (a) Original (b) NLM (a) (c) Details (d) NLM (c)
cate pair annotation data was performed, utilizing the R lan-
guage12 R Core Team (2020) in version 4.0.1 and the Simple
Features for R (sf) package13 Pebesma (2018) in version 0.9-2.
Originally, none of the duplicate pair images showed BBox in-
tersections by themselves. After joining duplicate pair annota-
tions, several intersections were detected with a maximum of (e) Original (f) NLM (e) (g) Details (h) NLM (g)
two involved BBox. These represented different annotations Fig. 4. Effects of the non-local means (NLM) algorithm are shown for two
of the same wound in two duplicate images, now joint in one example images (a) and (e) of the training dataset in (b) and (f). At a
image. To resolve these, each intersections of a BBox1 and a macroscopic level the changes are not obvious. At a detail level borders
of compression artifacts on homogeneous areas and color noise of (c) are
BBox2 were merged into BBox\ by using their outer boundaries,
visibly reduced in (d). Vague textures of (g) are also more pronounced in
as shown in Equation 2. (h).
[ = min xmin1 , xmin2
xmin
[ = min ymin1 , ymin2 Data Augmentation
ymin
BBox
\ (2) As mentioned in the introduction of YOLOv5, it is basi-
[ = max xmax1 , xmax2
xmax
cally a port of YOLOv4 for PyTorch, adapting the novelties
[ = max ymax1 , ymax2
ymax
of YOLOv4. Hence, in the following, these novelties are ex-
The applied duplicate cleansing and annotation merging plained but ascribed to YOLOv4. Nonetheless, the described
strategy resulted in n = 1961 images with k = 2453 annota- techniques also apply for YOLOv5.
tions in the cleansed training dataset. Boundaries of merged A key factor in the improved performance of YOLOv4 over
BBoxes were checked for consistency. Afterwards, annotation YOLOv3 is data augmentation, where additional training data
data was converted to the image resolution-independent format is artificially generated by manipulating or combining existing
used by YOLO implementations. training images to improve the robustness of the trained model.
Reviewing image data of all dataset parts (training, valida- In Bochkovskiy et al. (2020), these techniques are referred to as
tion, and test), showed pronounced compression artifacts and ”bag of freebies”, meaning that they can be applied at training
color noise due to a high compression rate and downscaling to time and do not affect inference speed.
a low resolution. As both compression artifacts and color noise A first set of data augmentation techniques in YOLOv4 are
pixel-wise adjustments including photometric distortion. This
involves adjustments of brightness, contrast, hue, saturation,
9 YOLOv5 v1.0: https://github.com/ultralytics/yolov5/ and noise of images as well as geometric distortion, consist-
releases/tag/v1.0 (accessed 2020-09-12) ing of random scaling, cropping, flipping, and rotating. A sec-
10 YOLOv5 GitHub commit a1c8406: https://github.com/ ond set of techniques tackles the problem of object occlusion.
ultralytics/yolov5/commit/a1c8406 (accessed 2020-08-29)
11 AntiDupl GitHub repository: https://github.com/ermig1979/
AntiDupl (accessed 2020-08-29)
12 R language website: https://www.r-project.org/ (accessed 2020- 14 Python language website: https://www.python.org/ (accessed 2020-
08-29) 08-29)
13 Simple Features for R (sf) GitHub repository: https://github.com/ 15 OpenCV on Wheels GitHub repository: https://github.com/
r-spatial/sf (accessed 2020-08-29) skvark/opencv-python (accessed 2020-08-29)
Moi Hoon Yap et al. / Preprint (2020) 7
used the default set of hyperparameters: optimizer = SGD, as its backbone. EfficientDet uses feature fusion techniques in
lr0 = 0.01, momentum = 0.937, weight decay = 0.0005, the form of a bidirectional feature pyramid network (BiFPN)
giou = 0.05, cls = 0.58, cls pw = 1.0, obj = 1.0, which combines representations of input images at different res-
obj pw = 1.0, iou t = 0.2, anchor t = 4.0, fl gamma = 0.0, olutions. BiFPN adds weights to input features which enables
hsv h = 0.014, hsv s = 0.68, hsv v = 0.36, degrees = 0.0, the network to learn the importance of each feature. The out-
translate = 0.0, scale = 0.5, and shear = 0.0. A default puts from the BiFPN are then used to predict class and gener-
seed value of 0 was used for model initialization. Both training ate bounding boxes using bounding box regression. Efficient-
stages were performed in the single-class training mode, with Det also utilises compound scaling, which allows all parts of
mosaic data augmentation deactivated due to issues regarding the network to scale in accordance to the target hardware being
BBox positioning. used for training and inference Tan et al. (2020). An overview
During the initial training stage, a base model was trained on of the EfficientDet architecture is shown in Fig. 5.
the pre-processed training dataset for 60 epochs with a batch
size of 30. This base model was initialized with weights from 4.3.1. Pre-processing
the MS COCO pre-trained YOLOv5x model. For the self- Since the dataset was captured with different types of camera
training approach, the base model was then used to create the devices and lighting conditions, a color constancy algorithm,
extended training dataset for self-training. Pseudo-annotation Shades of Gray (SoG), was used to handle variations in noise
data was inferred for the validation and test dataset, using the and lighting from the different capture devices hua Ng et al.
best-performing epoch 58 automatically saved by YOLOv5. (2019). Examples of pre-processed DFU images using SoG are
The resulting extended training dataset held 4161 images of shown in Fig. 6.
which 3963 held 4638 wound annotations.
During the self-training stage, the base model training was
4.3.2. Data Augmentation
resumed at its latest epoch, but trained further on the extended
training dataset with a batch size of 20. Three final training Data Augmentation techniques have proven to be an impor-
states were created: One after an additional 30 epochs, another tant tool in improving the performance of deep learning algo-
one after an additional 40 epochs, and a final one after an ad- rithms for various computer vision tasks Goyal and Yap (2018);
ditional 60 epochs of self-training (referred to as E60 SELF90, Yap et al. (2020a). For the application of EfficientDet, we aug-
E60 SELF100, and E60 SELF120). mented the training data by applying identical transformations
to the images and associated bounding boxes for DFU detec-
tion. Random rotation and shear transformations were used
Post-processing
to augment the DFUC2020 dataset. Shearing involves the dis-
The minimum confidence threshold for detection was set to
placement of the image at its corners, resulting in a skewed or
0.70, so only quite certain predictions were exported. This ap-
deformed output. Examples of these types of data augmentation
plies for pseudo-annotation data of the extended training dataset
are shown in Fig. 7.
created for self-training as well as for the final predictions.
Predictions for our experiments were inferred via the final
training states E60 SELF90, E60 SELF100, and E60 SELF120, 4.3.3. Model
using the best epochs 88, 96, and 118 each. Another exper- EfficientDet algorithms achieved state-of-the-art accuracy
iment was based on the training state E60 SELF100 involving on the popular MS-COCO Lin et al. (2014) object detection
the built-in test-time augmentation (TTA) and non-maxima sup- dataset. EfficientDet pre-trained weights are classed from D0
pression (NMS) features of YOLOv5 for inference. to D7, with D0 having the fewest number of parameters, and
TTA is a data augmentation method which involves sev- D7 having the highest number of parameters. Tests on the MS-
eral augmented instances of an image that are presented to the COCO dataset indicate that training using weights with more
model. For each instance, predictions are made; these predic- parameters results in better network accuracy. However, this
tions for the image provide an ensemble of instance predictions. comes at the cost of significantly increased training time. Given
This can enable a model to detect objects it may not be able to that the DFUC2020 dataset images were resized to 640x480, we
detect in a “clean” image. However, TTA may also cause mul- selected to use the EfficientDet-D1 pre-trained weights for DFU
tiple distinct detections for the same object that can harm eval- detection Goyal and Hassanpour (2020).
uation scores. To tackle these, NMS was applied to collapse
multiple intersecting detections into one BB. The intersection 4.3.4. Training
over union (IoU) threshold was set low to IoU ≥ 0.30, as in We trained the EfficientDet-D1 method on an NVIDIA
case of multiple wounds in an image usually a distinct spatial Quadro RTX 8000 GPU (48 GB) with a batch-size of 16, SGS
demarcation was given. Thus, the risk of interfering detections optimizer with a learning rate of 0.00005, momentum of 0.9,
of different wounds was low. and number of epochs set to 50. We used the validation accu-
racy and early stopping to select the final model for inference.
4.3. EfficientDet
The EfficientDet architecture Tan (2019) is an object detec- 4.3.5. Post-processing
tion network created by the Google Brain team, and utilises the We further refined the EfficientDet architectures with a score
EfficientNet ConvNet Tan and Le (2019) classification network threshold of 0.5 and removed overlapping bounding boxes to
Moi Hoon Yap et al. / Preprint (2020) 9
4.4.3. Training
For the cascade structure, we set the total number of the cas-
cade stages to 3. Considering the intersect over union (IOU)
threshold, we set it to 0.5, 0.6 and 0.7 for each of the three
Fig. 9. The effect of mobile fuzzy method. (a) is the original image, and (b)
stages. During training, we use a pre-trained model to acceler-
is the image after blurring with the mobile fuzzy method.
ate model convergence. We use the pre-trained model of Det-
Net, which has been trained on the ImageNet dataset. We train
on one GPU (NVIDIA Tesla P100) for 60 epochs, with a batch
size of 4 and a learning rate of 0.001. The learning rate de-
Moi Hoon Yap et al. / Preprint (2020) 11
Fig. 10. The framework of CA-DetNet. “Image” is input image. “A-DetNet” is backbone network. “Pool” is region-wise feature extraction. “H” is
network head. “B” is bounding box and “C” is classification. “B0” is proposals in all architectures. The structure of the A-DetNet is based on the
DetNet.The attention mechanism is applied in Resbody and Detbody. Different bottleneck blocks in the Detbody or Resbody are similar to those in the
DetNet.
We report and analyse the results obtained using the meth- the precision of DFU detection, it does not improve the over-
ods described above. The evaluation metrics are the number of all score. Therefore, the best result is achieved by Deformable
true positives (TP), the number of false positives (FP), recall, Faster R-CNN, with mAP of 0.6940 and F1-Score of 0.7434.
precision, F1-Score and mean average precision (mAP), as de- The qualitative results of Faster R-CNN with Deformable
scribed in the diabetic foot ulcer challenge 2020 Cassidy et al. Convolution is summarized in Figure 11. It can be seen that
(2020). For the common object detection task, mAP is used as our model successfully detected the defects in the image, even
the main evaluation metric. However, in this DFU task, miss- though the defects are small (bottom-right image) or the im-
detection (a false negative) potentially has severe implications ages are blurred (top-middle image). However, we observed
as it may affect the quality of life of patients, and an incorrect the miss-detection as in the top-right image. In this image, the
detection (a false positive) could increase the financial burden background texture of the blood was incorrectly detected as a
on health services. Therefore, F1-Score is as important as mAP DFU. To improve prediction accuracy, the training data should
for performance evaluation. be captured in various environments so that the network is bet-
ter able to discern between DFU and background objects.
5.1. Faster R-CNN
Table 3 summarizes the quantitative results of pure Faster R- 5.2. YOLOv3
CNN, its variants, and the final ensemble model. From the ta-
ble, the performance of pure Faster R-CNN is on par with Cas- Table 4 shows the final results of the proposed YOLOv3
cade R-CNN. In contrast, employing the Deformable convolu- method on the testing dataset. The results are reported for two
tion or PISA module significantly improves the performance. different batch sizes, with and without post-processing.
After we ensemble the model, we reduce FP substantially, but As the results indicate, using a batch size of 50 leads to a bet-
TP is also reduced. Although the Ensemble method improves ter overall performance compared to using a batch size of 32. It
12 Moi Hoon Yap et al. / Preprint (2020)
5.3. YOLOv5
Table 5 summarizes the results of YOLOv5. Fewer addi-
tional self-training epochs in method E60 SELF90 achieved
(a) (b) better results than E60 SELF100 and E60 SELF120. Yet, ap-
plication of TTA with NMS on E60 SELF100 achieved the best
results in E60 SELF100 TTA NMS. Examples for detections
of E60 SELF100 TTA NMS on the test set are shown in Fig-
ure 13, Figure 14 shows additional examples of false negative
and false positive cases.
(c) (d)
Fig. 11. The qualitative results of Faster R-CNN with Deformable Con-
volution, which shows the best performance among Faster R-CNN based
methods. It is noted that the network is able to detect small ulcers as shown
(a) Small (b) Medium (c) Large (d) Tilted
in (a),(b) and (c). However, it generates FP as demonstrated in (d).
(e) Too large (f) Too small (g) One? (h) Two?
(a) (b)
Fig. 14. Examples for false negative, false positive, inadequate, and ques-
tionable predictions of YOLOv5: (a) and (b) missed wounds, (c) and (d)
painted finger nail and malformed toe nail, (e) and (f) too large and too
small, (g) and (h) unclear detections (one, two, many?).
5.4. EfficientDet
Table 6 shows the results of EfficientDet on the DFUC2020
(c) (d) testing set both with and without post-processing. As the results
indicated, the number of both TP and FP cases are reduced with
Fig. 12. Examples of final detection output of trained YOLOv3, after post- the post-processing method. But, with post-processing method,
processing.
Additionally, we have added 60 copyright free images of 21 Website: https://www.freepik.com/ (accessed 2020-08-29)
Moi Hoon Yap et al. / Preprint (2020) 13
Table 4. YOLOv3: Results of different settings, post-processing, and adding extra copyright free foot images. B50 and B32: compares the performance of
the method with batch size 50 with 32. OverlapRemoved: indicates the performance of the method, with overlap removal post processing. conf0.3: shows
the impact of ignoring any prediction with < 0.3 confidence. Extra: demonstrates the effect on performance of adding extra images of healthy feet.
Method Settings Metrics
Base Coefficient Overlap-Removed TP FP Recall Precision F1-Score mAP
Table 5. YOLOv5: Results of different submitted runs. The settings state epochs for base and self-training as well as the use of test-time augmentation
(TTA) and non-maxima suppression (NMS). Best results are highlighted bold, the winning method is highlighted gray.
Method Settings Metrics
Base Self-training TTA+NMS TP FP Recall Precision F1-Score mAP
the percentage of TP cases (from 1626 to 1593) is 2.02% com- (a) (b)
pared to FP cases (from 720 to 594) is 17.50%. Hence, post-
processing method lead to important improvement in both Pre-
cision (67.86% to 72.84%) and F1-score (72.38% to 74.37%),
while the slightly decrease in both mAP (57.82% to 56.94%)
and Recall (77.44% to 75.97%). The EfficientDet with post-
processing method achieved the highest F1-Score and Preci-
sion (least number of FP cases) in DFUC2020. Examples of
final outputs by the refined EfficientDet architecture are shown (c) (d)
in Fig. 15.
Fig. 15. The results of EfficientDet. (a) and (c) are the results of EfficientDet
5.5. Cascade Attention DetNet without post-processing; (b) and (d) are the results obtained with post-
processing.
Table 7 summarizes the results of the Cascade Attention
DetNet on the DFUC2020 testing dataset. The results are re-
ported for two different data augmentation methods, two differ- From the analysis, we see that the mobile fuzzy data aug-
ent backbones, and with or without a pre-trained model. mentation method brings about a striking effect and improves
From the results, we observe that CA-DetNet with two data 1.46% on mAP and 1.03% on F1-Score. At the same time,
augmentation methods and the pre-trained model achieves the using the single mixup method in data augmentation did not
best result. It achieves the highest score of 63.94% on mAP enhance the performance. The results suggest that the mobile
and 70.01% on F1-Score. The C-DetNet achieves the highest fuzzy method can make the model adapt to the noise from the
score of 74.11% on Recall, while the CA-DetNet with the mo- external environment, while the mixup method is detrimental.
bile fuzzy method achieves the highest score of 66.67% on Pre- The attention mechanism contributes to the improved perfor-
cision. mance of detection and increases mAP by 0.02% and F1-Score
14 Moi Hoon Yap et al. / Preprint (2020)
nal environment and low image quality. For the false identifica- EfficientDet 1593 594 0.7597 0.7284 0.7437 0.5694
are several images which are blurry. We use data augmentation YOLOv5 1504 474 0.7172 0.7604 0.7382 0.6270
methods like the mobile fuzzy method to partially address this YOLOv3 1543 565 0.7358 0.7320 0.7339 0.6484
problem. We speculate that if a two-stage architecture whose CA-DetNet 1554 788 0.7411 0.6635 0.7002 0.6394
first stage is to detect and segment the relevant area of feet is
designed, the above problems could be solved. However, more
labeled data may be required be achieve this goal. is achieved by the variant of Faster R-CNN using Deformable
Convolution, with 0.6940. This method achieves the highest
TP and the best Recall. It is noted that YOLOv5 achieved the
lowest number of FP, but it has lower mAP and F1-Score.
In Table 9, the ranking according to F1-Score shows the high-
est F1-Score of 0.7437 obtained by EfficientDet, however, the
mAP is only 0.5694. On the other hand, the Faster R-CNN ap-
proach achieves a comparable F1-Score of 0.7434 with a much
higher mAP of 0.6940.
Fig. 17 visually compare the detection results on DFUs with
less visible appearances. In Fig. 17(a), the ulcer was detected
by all the methods. However, in Fig. 17(b), only Faster R-
CNN and EfficientDet detected the ulcer. Fig. 17(c) is another
challenging case and it was detected by CA-DetNet and Faster
Fig. 16. The results of CA-DetNet: Illustration of successful DFUs detec- R-CNN. In Fig. 17(d), we demonstrate a case where only Faster
tion. R-CNN successfully localise the ulcer.
In Section 5.1, we demonstrate that the ensemble method us-
ing Weighted Boxes Fusion did not improve the results of four
5.6. Comparison Faster R-CNN approaches. This observation suggests that addi-
The results from the popular deep learning object detection tional experiments based on different deep learning approaches
methods and the proposed CA-DetNet are comparable. Table 8 should be investigated. We run some experiments based on
shows the overall result when evaluated on DFUC2020 testing combinations of two approaches (Faster R-CNN + (CA-DetNet
set, where we present the best mAP from each object detection / EfficientDet / YOLOv3 / YOLOv5)), three approaches and
method. Considering the ranking based on mAP, the best result a combination of all approaches, as summarised in Table 10.
Moi Hoon Yap et al. / Preprint (2020) 15
detection method and future work to improve DFU detection. github.com/ultralytics/yolov5/issues/318 (accessed 2020-09-28)
16 Moi Hoon Yap et al. / Preprint (2020)
fusing differently specialized models as well as investigation learning algorithms. This approach could also impact network
of Hyperparameter Evolution24 . Hence, YOLOv5 could prove size and complexity, which could negatively impact inference
helpful for performing DFU detection tasks particularly when speed. Segmenting the foot from its surroundings might pro-
considering implementation directly on mobile devices. vide another possible solution to this problem, so that trained
The refined EfficientDet algorithm is able to detect DFU with models do not have to account for objects in complex environ-
a high recall rate. The pre-processing stage with the Shades of ments. Future research challenges include:
Gray algorithm improved the consistency of the images. We
• Gather a larger-scale dataset with clinical annotations.
extensively used the data augmentation techniques to learn the
This is the best solution for supervised machine learning
subtle features of DFUs of various sizes and severity. The post-
algorithms. However, in the real-world, there are still bar-
processing stage has refined the inference of the original Ef-
riers in data sharing. Additionally, clinical annotation is
ficientDet method by removing overlapping bounding boxes.
expensive and time consuming. It is important to encour-
Due to low mAP, further work will focus on investigating other
age co-creation by machine learning and clinical experts
options of EfficentDet, particularly EfficientDet-D7.
to foster better understanding of the annotated data.
The performance of Cascade Attention DetNet on the
DFUC2020 testing set is not entirely satisfactory. We evalu- • Create self-supervised and unsupervised deep learning al-
ated our model on 10% of the DFUC2020 training set and it gorithms for DFU detection. These methods were devel-
achieved 0.9 on mAP. We analyzed the possible reasons and oped and implemented for natural object detection tasks
consider that the model may be over-fitting, to which ensem- and remain under-explored in medical imaging.
ble learning may provide a possible solution. We further aim
to use appropriate data augmentation methods to improve the • For inspections of DFU, accurate delineation of an ulcer
robustness of the model. and its surrounding skin can help to measure the progress
The ensemble methods based on fusion of different back- of the ulcer. Goyal et al. Goyal et al. (2017) developed
bones have reduced the number of predicted bounding boxes an automated segmentation algorithm for DFU. However,
substantially. Faster R-CNN with Deformable Convolution pre- they experimented on a small dataset only and future work
dicted 2240 bounding boxes, but after ensembled with Efficient- will potentially enable a larger scale of experimentation.
Det, it only predicted 1847 bounding boxes. The predicted
• The use of DFU classification systems that can be used by
bounding boxes was dropping to 1475 when we ensemble the
clinicians to analyse ulcer condition. Automated analysis
results from all the five networks. Consequently, the ensemble
and recognition of DFU can help to improve the diagno-
methods have reduced the number of TPs and FPs. It is cru-
sis of DFUs. The next challenge (DFUC2021 Yap et al.
cial for future research to focus on true positives, i.e. correctly
(2020b)) will focus on multi-class DFU recognition.
locate the DFUs. One of the aspect to overcome this issue is
to understand the threshold setting of IOU. Our experiment is • With the growth in the number of people diagnosed with
using IOU ≥ 0.5, which is the guideline set by object detec- diabetes, remote detection and monitoring of DFU can re-
tion for natural objects. However, in medical imaging studies duce the burden on health services. Research in optimiza-
Drukker et al. (2002); Yap et al. (2008), they used the IOU (or tion of deep learning models for remote monitoring is an-
Jaccard Similarity Index) threshold of 0.4. When we evaluate other active research area that has the potential to change
the performance of the best ensemble method, the number of the healthcare landscape globally.
TPs increases to 1594, and with IOU ≥ 0.3, the number of TP
increases to 1668. With Faster R-CNN with Deformable Con-
Acknowledgments
volution, the number of TPs increases to 1743 and 1883 for IOU
threshold of 0.4 and 0.3, respectively. We gratefully acknowledge the support of NVIDIA Corpora-
tion for the use of GPUs for this challenge and sponsoring our
7. Conclusion event. A.A., D.B.A. and M.O. were supported by the National
Health and Medical Research Council [GNT1174405] and the
We conduct a comprehensive evaluation of the performance Victorian Government’s OIS Program.
of deep learning object detection networks for DFU detection.
While the overall results show the potential of automatically
References
localising the ulcers, there are many false positives, and the
networks struggle to discriminate ulcers from other skin con- Armstrong, D.G., Boulton, A.J., Bus, S.A., 2017. Diabetic foot ulcers and their
ditions. A possible solution to address this issue might be to recurrence. New England Journal of Medicine 376, 2367–2375.
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. YOLOv4: Optimal Speed
introduce a second classifier in the form of a negative dataset to and Accuracy of Object Detection. URL: https://arxiv.org/abs/
train future networks on. However, in reality, it may prove im- 2004.10934, arXiv:2004.10934.
possible to gather all possible negative examples for supervised Bodla, N., Singh, B., Chellappa, R., Davis, L.S., 2017. Soft-nms–improving
object detection with one line of code, in: Proceedings of the IEEE interna-
tional conference on computer vision, pp. 5561–5569.
Brown, R., Ploderer, B., Da Seng, L.S., Lazzarini, P., van Netten, J., 2017.
24 YOLOv5 GitHub repository tutorial on Hyperparameter Evolution: Myfootcare: a mobile self-tracking tool to promote self-care amongst people
https://github.com/ultralytics/yolov5/issues/607 (accessed with diabetic foot ulcers, in: Proceedings of the 29th Australian Conference
2020-09-28) on Computer-Human Interaction, pp. 462–466.
Moi Hoon Yap et al. / Preprint (2020) 17
Buades, A., Coll, B., Morel, J.M., 2005. A non-local algorithm for image sembles for medical subfigure classification, in: In Experimental IR Meets
denoising, in: 2005 IEEE Computer Society Conference on Computer Vi- Multilinguality, Multimodality, and Interaction 8th International Conference
sion and Pattern Recognition (CVPR’05), IEEE. pp. 60–65. URL: https: of the CLEF Association, CLEF 2017, Lecture Notes in Computer Sci-
//doi.org/10.1109/cvpr.2005.38, doi:10.1109/cvpr.2005.38. ence (LNCS). Springer International Publishing, pp. 57–68. doi:10.1007/
Cai, Z., Vasconcelos, N., 2017. Cascade r-cnn: Delving into high quality object 978-3-319-65813-1\_5.
detection . Li, Z., Peng, C., Yu, G., Zhang, X., Sun, J., 2018. Detnet: A backbone network
Cai, Z., Vasconcelos, N., 2019. Cascade r-cnn: High quality object detection for object detection .
and instance segmentation. arXiv:1906.09756. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017.
Cao, Y., Chen, K., Loy, C.C., Lin, D., 2020. Prime sample attention in object Feature pyramid networks for object detection, in: Proceedings of the IEEE
detection, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern conference on computer vision and pattern recognition, pp. 2117–2125.
Recognition (CVPR), pp. 11580–11588. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
Cassidy, B., Reeves, N.D., Joseph, P., Gillespie, D., O’Shea, C., Rajbhan- P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in:
dari, S., Maiya, A.G., Frank, E., Boulton, A., Armstrong, D., et al., 2020. European conference on computer vision, Springer. pp. 740–755.
Dfuc2020: Analysis towards diabetic foot ulcer detection. arXiv preprint Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018. Path aggregation network for
arXiv:2004.11853 . instance segmentation, in: Proceedings of the IEEE conference on computer
DeVries, T., Taylor, G.W., 2017. Improved regularization of convolutional neu- vision and pattern recognition, pp. 8759–8768.
ral networks with cutout. arXiv preprint arXiv:1708.04552 URL: https: Maas, A.L., Hannun, A.Y., Ng, A.Y., 2013. Rectifier nonlinearities im-
//arxiv.org/abs/1708.04552, arXiv:1708.04552. url: https:// prove neural network acoustic models, in: Proc. ICML, p. 3. URL:
arxiv.org/abs/1708.04552 (accessed on 11 September 2020). http://robotics.stanford.edu/~amaas/papers/relu_hybrid_
Drukker, K., Giger, M.L., Horsch, K., Kupinski, M.A., Vyborny, C.J., Mendel- icml2013_final.pdf. url: http://robotics.stanford.edu/
son, E.B., 2002. Computerized lesion detection on breast ultrasound. Med- ~amaas/papers/relu_hybrid_icml2013_final.pdf (accessed on 11
ical physics 29, 1438–1446. September 2020).
Ghiasi, G., Lin, T.Y., Le, Q.V., 2018. DropBlock: A regularization method for hua Ng, J., Goyal, M., Hewitt, B., Yap, M.H., 2019. The effect of color con-
convolutional networks, in: Bengio, S., Wallach, H., Larochelle, H., Grau- stancy algorithms on semantic segmentation of skin lesions, in: Medical
man, K., Cesa-Bianchi, N., Garnett, R. (Eds.), Advances in Neural Informa- Imaging 2019: Biomedical Applications in Molecular, Structural, and Func-
tion Processing Systems 31. Curran Associates, Inc., pp. 10727–10737. tional Imaging, International Society for Optics and Photonics. p. 109530R.
Goyal, M., Hassanpour, S., 2020. A refined deep learning architecture for dia- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
betic foot ulcers detection. arXiv preprint arXiv:2007.07922 . Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-
Goyal, M., Reeves, N.D., Davison, A.K., Rajbhandari, S., Spragg, J., Yap, Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
M.H., 2018. Dfunet: convolutional neural networks for diabetic foot ul- J., Chintala, S., 2019. Pytorch: An imperative style, high-performance deep
cer classification. IEEE Transactions on Emerging Topics in Computational learning library, in: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché
Intelligence , 1–12doi:10.1109/TETCI.2018.2866254. Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Pro-
Goyal, M., Reeves, N.D., Rajbhandari, S., Ahmad, N., Wang, C., Yap, M.H., cessing Systems 32. Curran Associates, Inc., pp. 8024–8035.
2020. Recognition of ischaemia and infection in diabetic foot ulcers: Dataset Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial
and techniques. Computers in Biology and Medicine , 103616. Vector Data. The R Journal 10, 439–446. doi:10.32614/RJ-2018-009.
Goyal, M., Reeves, N.D., Rajbhandari, S., Yap, M.H., 2019. Robust methods R Core Team, 2020. R: A Language and Environment for Statistical Com-
for real-time diabetic foot ulcer detection and localization on mobile devices. puting. R Foundation for Statistical Computing. Vienna, Austria. URL:
IEEE Journal of Biomedical and Health Informatics 23, 1730–1741. doi:10. https://www.R-project.org/.
1109/JBHI.2018.2868656. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You Only Look
Goyal, M., Yap, M.H., 2018. Region of interest detection in dermoscopic im- Once: Unified, Real-Time Object Detection, in: 2016 IEEE Conference on
ages for natural data-augmentation. arXiv preprint arXiv:1807.10711 . Computer Vision and Pattern Recognition (CVPR), IEEE. URL: https:
Goyal, M., Yap, M.H., Reeves, N.D., Rajbhandari, S., Spragg, J., 2017. Fully //doi.org/10.1109/cvpr.2016.91, doi:10.1109/cvpr.2016.91.
convolutional networks for diabetic foot ulcer segmentation, in: 2017 IEEE Redmon, J., Farhadi, A., 2017. YOLO9000: Better, Faster, Stronger,
International Conference on Systems, Man, and Cybernetics (SMC), pp. in: 2017 IEEE Conference on Computer Vision and Pattern Recognition
618–623. doi:10.1109/SMC.2017.8122675. (CVPR), IEEE. URL: https://doi.org/10.1109/cvpr.2017.690,
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn, in: 2017 IEEE doi:10.1109/cvpr.2017.690.
International Conference on Computer Vision (ICCV), pp. 2980–2988. Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv
He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial Pyramid Pooling in Deep preprint arXiv:1804.02767 .
Convolutional Networks for Visual Recognition. IEEE Transactions on Pat- Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time
tern Analysis and Machine Intelligence 37, 1904–1916. URL: https:// object detection with region proposal networks, in: Advances in neural in-
doi.org/10.1109/tpami.2015.2389824, doi:10.1109/tpami.2015. formation processing systems, pp. 91–99.
2389824. Solovyev, R., Wang, W., Gabruseva, T., 2019. Weighted boxes fusion: ensem-
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., bling boxes for object detection models. arXiv:1910.13302.
Andreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional neural Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.,
networks for mobile vision applications. arXiv preprint arXiv:1704.04861 . 2014. Dropout: a simple way to prevent neural networks from overfitting.
IDF, 2019. International diabetes federation: Facts & figures. The journal of machine learning research 15, 1929–1958.
https://www.idf.org/aboutdiabetes/what-is-diabetes/facts-figures.html. Tan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional
Jocher, G., Kwon, Y., guigarfr, Veitch-Michaelis, J., perry0418, Ttayu, Marc, neural networks .
Bianconi, G., Baltacı, F., Suess, D., Chen, T., Yang, P., idow09, WannaSeaU, Tan, M., Pang, R., Le, Q.V., 2020. Efficientdet: Scalable and efficient object de-
Xinyu, W., Shead, T.M., Havlik, T., Skalski, P., NirZarrabi, LukeAI, Lin- tection, in: Proceedings of the IEEE/CVF Conference on Computer Vision
Coce, Hu, J., IlyaOvodov, GoogleWiki, Reveriano, F., Falak, Kendall, D., and Pattern Recognition, pp. 10781–10790.
2020a. ultralytics/yolov3: [email protected]:0.95 on COCO2014. URL: Tan, Mingxing, P.R.V.L.Q., 2019. Efficientdet: Scalable and efficient object
https://doi.org/10.5281/zenodo.3785397, doi:10.5281/zenodo. detection. https://arxiv.org/pdf/1911.09070 64, 2098–2109.
3785397. Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R., 2013. Regulariza-
Jocher, G., Stoken, A., Borovec, J., NanoCode012, ChristopherSTAN, tion of neural networks using dropconnect, PMLR, Atlanta, Georgia, USA.
Changyu, L., Laughing, Hogan, A., lorenzomammana, tkianai, yxNONG, pp. 1058–1066. URL: http://proceedings.mlr.press/v28/wan13.
AlexWang1900, Diaconu, L., Marc, wanghaoyang0106, ml5ah, Doug, Ha- html.
tovix, Poznanski, J., Yu, L., changyu98, Rai, P., Ferriday, R., Sullivan, T., Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.,
Xinyu, W., YuriRibeiro, Claramunt, E.R., hopesala, pritul dave, yzchen, 2020. Cspnet: A new backbone that can enhance learning capability of
2020b. ultralytics/yolov5: v3.0. URL: https://doi.org/10.5281/ cnn, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
zenodo.3983579, doi:10.5281/zenodo.3983579. Pattern Recognition Workshops, pp. 390–391.
Koitka, S., Friedrich, C.M., 2017. Optimized convolutional neural network en- Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.,
18 Moi Hoon Yap et al. / Preprint (2020)