3.deep Learning Approach To Diabetic Retinopathy Detection
3.deep Learning Approach To Diabetic Retinopathy Detection
Borys Tymchenko1 a
, Philip Marchenko2 b
and Dmitry Spodarets3 c
1 Instituteof Computer Systems, Odessa National Polytechnic University, Shevchenko av. 1, Odessa, Ukraine
2 Department of Optimal Control and Economical Cybernetics, Faculty of Mathematics, Physics and Information Tecnology,
Odessa I.I. Mechnikov National University, Dvoryanskaya str. 2, Odessa, Ukraine
3 VITech Lab, Rishelevska St, 33, Odessa, Ukraine
Keywords: Deep learning, diabetic retinopathy, deep convolutional neural network, multi-target learning, ordinal
arXiv:2003.02261v1 [cs.LG] 3 Mar 2020
Abstract: Diabetic retinopathy is one of the most threatening complications of diabetes that leads to permanent blindness
if left untreated. One of the essential challenges is early detection, which is very important for treatment
success. Unfortunately, the exact identification of the diabetic retinopathy stage is notoriously tricky and
requires expert human interpretation of fundus images. Simplification of the detection step is crucial and
can help millions of people. Convolutional neural networks (CNN) have been successfully applied in many
adjacent subjects, and for diagnosis of diabetic retinopathy itself. However, the high cost of big labeled
datasets, as well as inconsistency between different doctors, impede the performance of these methods. In
this paper, we propose an automatic deep-learning-based method for stage detection of diabetic retinopathy
by single photography of the human fundus. Additionally, we propose the multistage approach to transfer
learning, which makes use of similar datasets with different labeling. The presented method can be used
as a screening method for early detection of diabetic retinopathy with sensitivity and specificity of 0.99 and
is ranked 54 of 2943 competing methods (quadratic weighted kappa score of 0.925466) on APTOS 2019
Blindness Detection Dataset (13000 images).
1 INTRODUCTION disease;
• Severe non-proliferative retinopathy results in de-
Diabetic retinopathy (DR) is one of the most threat- prived blood supply to the retina due to the in-
ening complications of diabetes in which damage oc- creased blockage of more blood vessels, hence
curs to the retina and causes blindness. It damages the signaling the retina for the growing of fresh blood
blood vessels within the retinal tissue, causing them vessels;
to leak fluid and distort vision. Along with diseases
• Proliferative diabetic retinopathy is the advanced
leading to blindness, such as cataracts and glaucoma,
stage, where the growth features secreted by the
DR is one of the most frequent ailments, according to
retina activate proliferation of the new blood ves-
the US, UK, and Singapore statistics (NCHS, 2019;
sels, growing along inside covering of retina in
NCBI, 2018; SNEC, 2019).
some vitreous gel, filling the eye.
DR progresses with four stages:
Each stage has its characteristics and particular
• Mild non-proliferative retinopathy, the earliest properties, so doctors possibly could not take some
stage, where only microaneurysms can occur; of them into account, and thus make an incorrect di-
• Moderate non-proliferative retinopathy, a stage agnosis. So this leads to the idea of creation of an
which can be described by losing the blood ves- automatic solution for DR detection.
sels’ ability of blood transportation due to their At least 56% of new cases of this disease could be
distortion and swelling with the progress of the reduced with proper and timely treatment and mon-
itoring of the eyes (Rohan T, 1989). However, the
a https://orcid.org/0000-0002-2678-7556 initial stage of this ailment has no warning signs, and
b https://orcid.org/0000-0001-9995-9454 it becomes a real challenge to detect it on the early
c https://orcid.org/0000-0001-6499-4575 start. Moreover, well-trained clinicians sometimes
could not manually examine and evaluate the stage 2 RELATED WORK
from diagnostic images of a patient’s fundus (accord-
ing to Google’s research (Krause et al., 2017), see
Figure 1). At the same time, doctors will most of-
ten agree when lesions are apparent. Furthermore, Many research efforts have been devoted to the prob-
existing ways of diagnosing are quite inefficient due lem of early diabetic retinopathy detection. First of
to their duration time, and the number of ophthalmol- all, researchers were trying to use classical methods
ogists included in patient’s problem solution. Such of computer vision and machine learning to provide
sources of disagreement cause wrong diagnoses and a suitable solution to this problem. For instance,
unstable ground-truth for automatic solutions, which Priya et al. (Priya and Aruna, 2012) proposed a
were provided to help in the research stage. computer-vision-based approach for the detection of
diabetic retinopathy stages using color fundus images.
They tried to extract features from the raw image, us-
ing the image processing techniques, and fed them
to the SVM for binary classification and achieved a
sensitivity of 98%, specificity 96%, and accuracy of
97.6% on a testing set of 250 images. Also, other re-
searchers tried to fit other models for multiclass clas-
sification, e.g., applying PCA to images and fitting
decision trees, naive Bayes, or k-NN (Conde et al.,
2012) with best results 73.4% of accuracy, and 68.4%
for F-measure while using a dataset of 151 images
with different resolutions.
Figure 1: Google showed that ophtalmologists’ diagnoses
differ for same fundus image. Best viewed in color. With the growing popularity of deep learning-
based approaches, several methods that apply CNNs
to this problem appeared. Pratt et al. (Harry Pratt,
Thus, algorithms for DR detection began to ap- 2016) developed a network with CNN architecture
pear. The first algorithms were based on different and data augmentation, which can identify the intri-
classical algorithms from computer vision and set- cate features involved in the classification task such
ting thresholds (Michael D. Abrmoff and Quellec, as micro-aneurysms, exudate, and hemorrhages in the
2010; Christopher E.Hann, 2009; Nathan Silberman retina and consequently provide a diagnosis automat-
and Subramanian, 2010). Nevertheless, in the past ically and without user input. They achieved a sen-
few years, deep learning approaches have proved their sitivity of 95% and an accuracy of 75% on 5,000
superiority over other algorithms in tasks of classifi- validation images. Also, there are other works on
cation and object detection (Harry Pratt, 2016). In CNNs from other researchers (Carson Lam and Lind-
particular, convolutional neural networks (CNN) have sey, 2018; Yung-Hui Li and Chung, 2019). It is use-
been successfully applied in many adjacent subjects ful to note that Asiri et al. reviewed a significant
and for diagnosis of diabetic retinopathy itself (Shao- amount of methods and datasets available, highlight-
hua Wan, 2018; Harry Pratt, 2016). ing their pros and cons (Asiri et al., 2018). Besides,
they pointed out the challenges to be addressed in de-
In 2019, APTOS (Asia Pacific Tele-
signing and learning about efficient and robust deep-
Ophthalmology Society) and competition ML
learning algorithms for various problems in DR diag-
platform Kaggle challenged ML and DL researchers
nosis and drew attention to directions for future re-
to develop a five-class DR automatic diagnosing solu-
search.
tion (APTOS 2019 Blindness Detection Dataset). In
this paper, we propose the transfer learning approach Other researchers also tried to make transfer learn-
and an automatic method for detection of the stage ing with CNN architectures. Hagos et al. (Hagos
of diabetic retinopathy by single photography of the and Kant, 2019) tried to train InceptionNet V3 for 5-
human fundus. This approach is able to learn useful class classification with pretrain on ImageNet dataset
features even from a noisy and small dataset and and achieved accuracy of 90.9%. Sarki et al. (Ru-
could be used as a DR stages screening method in bina Sarki, 2019) tried to train ResNet50, Xception
automatic solutions. Also, this method was ranked 54 Nets, DenseNets and VGG with ImageNet pretrain
of 2943 different methods on APTOS 2019 Blindness and achieved best accuracy of 81.3%. Both teams
Detection Competition and achieved the quadratic of researchers used datasets, which were provided by
weighted kappa score of 0.92546. APTOS and Kaggle.
3 PROBLEM STATEMENT
3.1 Datasets
The image data used in this research was taken from
several datasets. We used an open dataset from Kag-
gle Diabetic Retinopathy Detection Challenge 2015
(EyePACs, 2015) for pretraining our CNNs. This
dataset is the largest available publicly. It consists
of 35126 fundus photographs for left and right eyes
of American citizens labeled with stages of diabetic
retinopathy:
• No diabetic retinopathy (label 0)
• Mild diabetic retinopathy (label 1) Figure 2: Classes distribution in APTOS2019 dataset.
• Moderate diabetic retinopathy (label 2)
• Severe diabetic retinopathy (label 3)
• Proliferative diabetic retinopathy (label 4)
In addition, we used other smaller datasets: In-
dian Diabetic Retinopathy Image Dataset (IDRiD)
(Sahasrabuddhe and Meriaudeau, 2018), from which
we used 413 photographs of the fundus, and MES-
SIDOR (Methods to Evaluate Segmentation and In-
dexing Techniques in the field of Retinal Ophthal-
mology) (Decencire et al., 2014) dataset, from which
we used 1200 fundus photographs. As the origi-
nal MESSIDOR dataset has different grading from Figure 3: Sample of fundus photo from the dataset.
other datasets, we used the version that was relabeled
to standard grading by a panel of ophthalmologists
scores assigned by the human rater and the predicted
(Google Brain, 2018).
scores. This metric varies from -1 (complete disagree-
As the evaluation was performed on Kaggle AP-
ment between raters) to 1 (complete agreement be-
TOS 2019 Blindness Detection (APTOS2019) dataset
tween raters). The definition of κ is:
(APTOS, 2019), we had access only to the training
part of it. The full dataset consists of 18590 fundus
photographs, which are divided into 3662 training, ∑ki=1 ∑kj=1 wi j oi j
κ = 1− , (1)
1928 validation, and 13000 testing images by organiz- ∑ki=1 ∑kj=1 wi j ei j
ers of Kaggle competition. All datasets have similar
where k is the number of categories, oi j , and ei j
distributions of classes; distribution for APTOS2019
are elements in the observed, and expected matrices
is shown in Figure 2.
respectively. wi j is calculated as following:
As different datasets have a similar distribution,
we considered it as a fundamental property of this
(i − j)2
type of data. We did no modifications to the dataset wi j = , (2)
distribution (undersampling, oversampling, etc.). (k − 1)2
The smallest native size among all of the datasets Due to Cohens Kappa properties, researchers must
is 640x480. Sample image from APTOS2019 is carefully interpret this ratio. For instance, if we con-
shown in Figure 3. sider two pairs of raters with the same percentage of
an agreement, but different proportions of ratings, we
3.2 Evaluation metric should know, that it will drastically affect the Kappa
ratio.
In this research, we used quadratic weighted Co- Another problem is the number of codes: as the
hen’s kappa score as our main metric. Kappa score number of codes grows, Kappa becomes higher. Also,
measures the agreement between two ratings. The Kappa may be low even though there are high levels
quadratic weighted kappa is calculated between the of agreement, and even though individual ratings are
accurate. All things mentioned above make Kappa a
volatile ratio to analyze.
The main reason to use the Kappa ratio is that
we do not have access to labels of validation and test
datasets. Kappa value for these datasets is obtained by
submitting our model and runner’s code to the check-
ing system on the Kaggle site. Moreover, we do not
have explicit access to images from the test dataset.
Along with the Kappa score, we calculate macro
F1- score, accuracy, sensitivity, specificity on holdout
dataset of 736 images taken from APTOS2019 train-
ing data.
Ts = T + ∆
∆ ∼ U (a, b)
Figure 6: Feature embeddings with T-SNE. Ground truth
(top) and predicted (bottom) classes. Best viewed in color. Where Ts is smoothed target label, T is the orig-
inal label, and U is the uniform distribution. In this
T −T
dients of updating corresponding heads’ weights and case, −a = b = i 3 i+1 and Ti Ti+1 are neighbouring
further discourage network of converging. discrete target labels.
Initial weights for every head were set to 1/3 and Applying this smoothing scheme, we could reduce
then trained for five epochs to minimize mean squared the importance of wrong labeling.
error function.
Difference between prediction distributions of re- 4.4.5 Ensembling
gression head and linear regression outputs is show
on Figure 7. For final scoring, we ensembled models with 3
encoder architectures at different resolution that
4.4.4 Regularization scored best on the holdout dataset : EfficientNet-B4
(380x380), EfficientNet-B5 (456x456) (Tan and Le,
At training time, we regularize our models for bet- 2019), SE-ResNeXt50 (380x380 and 512x512) (Hu
ter robustness. We use conventional methods, e.g., et al., 2017).
weight decay (Krogh and Hertz, 1992) and dropout. Our best performing solution is an ensemble of 20
Also, we penalize the network for overconfident pre- models (4 architectures x 5 folds) with test-time aug-
dictions by using label smoothing (Szegedy et al., mentations (horizontal flip, vertical flip, transpose,
2016). rotate, zoom). Overall, this scheme generated 200
Additionally to label smoothing for classifica- predictions per one fundus image. These predictions
were averaged with a 0.25-trimmed mean to eliminate cases, visualization of salient features can assist the
outliers from possibly overfitted models. A trimmed physician to focus on regions of interest where fea-
mean is used to filter out outliers to reduce variance. tures are the most noticeable.
We used Catalyst framework (Kolesnikov, 2018) In Figure 8, we show an example visualization of
based on PyTorch (Paszke et al., 2017) with GPU SHAP values for one of the models from the ensem-
support. Evaluation of the whole ensemble was per- ble. Red color denotes features that increase the out-
formed on Nvidia P100 GPU in 9 hours, processing put value for a given class, and blue color denotes fea-
2.5 seconds per image. tures that decrease the output value for a given class.
Overall intensity of the features denotes the saliency
of the given region for the classification process.
5 RESULTS
As experimental results, we provide two tables
with metrics, which were mentioned in the Evalua-
tion paragraph. The first table is about results that we
have got from local validation without TTA (Table 1),
and the second is with TTA (Table 2).
Our test stage was split into two parts: local test-
ing and Kaggle testing. As we found locally, the en-
sembling method is the best one, and we evaluated it
on Kaggle validation and test datasets.
On a local dataset of 736 images, ensembling
with TTA performed slightly worse than without it.
Ensemble with TTA performed better on the testing
dataset of 13000 images as it has a better ability to
generalize on unseen images.
Ensembles scored 0.818462/0.924746 valida-
tion/test QWK score for a trimmed mean ensemble
without TTA and 0.826567/0.925466 QWK score for
trimmed mean ensemble with TTA.
Additionally, we evaluated binary classification
(DR/No DR) to check the best model’s quality as a
screening method (see Tables 1 and 2, last row)
The ensemble with TTA showed its stability in the
final scoring, keeping consistent rank (58 and 54 of
2943) on validation and testing datasets, respectively.
6 INTERPRETATION
In medical applications, it is important to be able to
interpret models’ predictions. As a good performance
of the validation dataset can be a measure to select the
best-trained model for production, it is insufficient for
real-life use of this model.
By using SHAP (Shapley Additive exPlanations)
(Lundberg and Lee, 2017), it is possible to visualize
features that contribute to the assessment of the dis-
ease stage. SHAP unites several previous methods
and represents the only possible consistent and locally
accurate additive feature attribution method based.
Using SHAP allows ensuring that the model learns Figure 8: Shap analysis of sample images. Best viewed in
useful features during training, as well as uses correct color.
features at inference time. Furthermore, in uncertain
Model QWK Macro F1 Accuracy Sensitivity Specificity
EfficientNet-B4 0.965 0.811 0.903 0.812 0.976
EfficientNet-B5 0.963 0.815 0.907 0.807 0.977
SE-ResNeXt50 (512x512) 0.969 0.854 0.924 0.871 0.982
SE-ResNeXt50 (380x380) 0.960 0.788 0.892 0.785 0.974
Ensemble (mean) 0.968 0.840 0.921 0.8448 0.981
Ensemble (trimmed mean) 0.971 0.862 0.929 0.860 0.983
Ensemble (trimmed mean, binary classification) 0.981 0.989 0.986 0.991 0.991
Table 1: Results of experiments and metrics tracked, without using TTA.