Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks
Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks
Hyeonsung Cho, Young Hoon Hwang, Jae Keun Chung, Kwan Bok Lee, Ji Sang
Park, Hong-Gee Kim & Jae Hoon Jeong
To cite this article: Hyeonsung Cho, Young Hoon Hwang, Jae Keun Chung, Kwan Bok Lee,
Ji Sang Park, Hong-Gee Kim & Jae Hoon Jeong (2021) Deep Learning Ensemble Method for
Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks,
Current Eye Research, 46:10, 1516-1524, DOI: 10.1080/02713683.2021.1900268
Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus
Photographs and Convolutional Neural Networks
Hyeonsung Choa, Young Hoon Hwangb, Jae Keun Chung b
, Kwan Bok Leeb, Ji Sang Parka, Hong-Gee Kimc,
and Jae Hoon Jeong d
a
Intelligence and Robot System Research Group, Electronics & Telecommunication Research Institute, Daejeon, Republic of Korea; bDepartment of
Ophthalmology, Chungnam National University Hospital, Daejeon, Republic of Korea; cBiomedical Knowledge Engineering Laboratory, Seoul National
University, Seoul, Republic of Korea; dDepartment of Ophthalmology, Konyang University Hospital, Konyag University College of Medicine, Daejeon,
Republic of Korea
CONTACT Jae Hoon Jeong [email protected] Department of Ophthalmology, Konyang University Hospital 158, Gwanjeodong-ro, Seo-gu, Daejeon (35365),
Republic of Korea.
Supplemental data for this article can be accessed on the publisher’s website.
© 2021 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way.
CURRENT EYE RESEARCH 1517
deep learning technologies that is prominent in image pattern were less than six months old. A glaucomatous VF defect was
recognition field could be useful in glaucoma screening. Liu defined as the threshold of three or more adjacent points in the
et al.18 have established a large-scale database of fundus images pattern deviation chart, that have a deviation with less than 5%
(241,032) for glaucoma diagnosis and developed from the probability of being due to chance with at least one point less than
fundus images glaucoma diagnosis with CNN, as an advanced 1%, or that the pattern standard deviation index is less than 5%.
deep learning system would be used in different settings with Cross-sectional data of each eye from 2,801 subjects including
images of varying quality, patient ethnicity, and population all fundus photographs (1 to 13) and single field analysis of VF
sources. An et al.19 have built a machine learning classification tests (1 to 7) between March 2016 and June 2018 were distributed
model that combines the information of color fundus images to four glaucoma specialists (The Korean Glaucoma Society
and OCT data to classify glaucomatous and healthy eyes and members). Only one fundus image per eye, a total of 4,445 fundus
this system should help to improve the diagnostic accuracy of photographs were selected that were compatible with the ophthal
detecting early glaucoma. mologic criteria of the CNN model. Reliable VF data within
However, most previous studies have focused their deep learning 6 months in which the selected fundus photo was acquired were
techniques on either fundus photographs12,15–19 or OCT scans13,19 split into folds at the eye level. In cases wherein both eyes of
and presented their results in terms of whether or not glaucoma was glaucoma or unaffected control subject were eligible for the study,
present, omitting the stage of the disease. The assessment of glaucoma data from both eyes were chosen for inclusion. Based on the
needs to include the structural and functional changes in the eye as the results of the structural and VF testing, the fundus photographs
disease progresses. The purpose of this study is to propose and were labeled preliminarily according to the following five stages:
evaluate the performance of a new cost-effective glaucoma screening unaffected control, preperimetric, mild, moderate, and severe
test for primary care using a deep learning ensemble method with glaucoma. Unaffected controls did not have any glaucomatous
fundus photographs and CNN, considering various stages and struc structural change or any VF defect. The preperimetric grade was
ture-function correlations of the disease. defined as a definite structural glaucomatous change without any
VF defects, and perimetric glaucoma was defined as a definite
structural glaucomatous change with a corresponding VF defect.
Materials and methods
Perimetric glaucoma was graded on the MD value from the VF
This study employed a retrospective case-control design. testing according to the Hodapp-Parrish-Anderson classification
Subjects from the Glaucoma Clinic of Konyang University system.20 The mild group had a MD value greater than or equal
Hospital and Kim’s Eye Hospital were enrolled between to – 6 dB, the moderate group had an MD value of – 6 to – 12 dB,
March 2016 and June 2018. The study followed the tenets of and the severe group had an MD value of less than – 12 dB.
the World Medical Association’s Declaration of Helsinki. The Cross-validation of the preliminary label classifications was
Institutional Review Board of the Konyang University Hospital performed to maximize the efficiency of the CNN models. The
and Kim’s Eye Hospital reviewed and approved the study pairing of fundus photograph and VF test results of each eye
protocol and exempted informed consent for this study. was reviewed by three other glaucoma specialists who did not
The fundus images were acquired by color imaging with participate in preliminary labeling. Each specialist labeled fun
a digital fundus camera (Nonmyd 7, Kowa Optimed, Tokyo, dus photographs according to the previous structural and
Japan) without pupil dilation. Glaucomatous structural changes functional criteria without the results of preliminary label
were defined as images with any of the following conditions: classification. The results of the cross-validation were added
enlargement of the cup-to-disc ratio of 0.7 or greater, cup-to-disc to the data collection. Finally, each photograph for which all
ratio asymmetry of >0.2 between fellow eyes, neuroretinal rim glaucoma specialists agreed upon was included in a final data
thinning, notching or excavation, disc hemorrhages, and RNFL set. The resolutions of the final dataset consisted of the average
defects in red-free images with edges being present at the optic pixel of 2,270 (SD; 391, 95% CI; 2,257–2,283) in the height and
nerve head margin. Subjects with the following conditions were 3,412 (SD; 596, 95% CI; 2,144–3,432) in the width.
excluded from this study: astigmatism with cylinder correction The flow of the image processing in the current study is
< – 3.0 D or > + 3.0 D, poor-quality conditions of fundus images shown in Figure 1. Image preprocessing was undertaken to
that could interfere with glaucoma evaluation such as media clean-up the photographs of marks that do not affect the read
opacities and motion artifacts, other optic neuropathies induced ing, such as words, patient numbers, and the black area around
via inflammatory, ischemic, compressive, and hereditary factors, the edges of the photograph. Red-free channel from the original
and other retinal pathologies such as retinal detachment, age- color fundus image was taken to obtain a high-resolution image
related macular degeneration, myopic chorioretinal atrophy, dia for RNFL. Data augmentation was performed to reduce over
betic retinopathy, macular hole, retinal vascular obstruction, and fitting and to maximize the training effect of the CNN models.
epiretinal membrane. The preprocessed images were rotated (90°, 180°, 270°) and
Standard automated perimetry using the Swedish interactive enlarged by 25% with centering on the mid-points of the
thresholding algorithm (SITA-Standard) of central 24–2 perimetry original image. Finally, the image resolution of each photo
(Humphrey field analyzer II, Carl Zeiss Meditec, Dublin, CA, graph was converted to 299 × 299 × 3 (R × G × B) to input into
USA) was performed from each subject with selected fundus the CNN architecture after filtering and resizing of the image.
photographs. A visual field (VF) was considered to be reliable To optimize the parameters of the CNN architectures, the
when the fixation loss was less than 20% and the false positive processed fundus photographs were passed through various
was less than 33%. Only reliable VF data were included in the image filters (bilateral, Gaussian, histogram equalization, med
analysis, and test data were conducted on fundus photographs that ian, and sharpening), and the results were used as input.
1518 H. CHO ET AL.
Figure 1. An overview of the image processing flow in the current study. The final image for convolutional neural network architecture of each photograph was
prepared by image preprocessing, red-free channel extraction, data augmentation, filtering, and resizing process.
Figure 2 shows a total of 56 CNN models comprised of the probability of Ck on its model. N is the number of models used
combined two color types of the fundus photograph (original and in the ensemble method.
red-free images), seven types of image filters including a bypass and The accuracy and the area under the receiver operating char
filter with all the rest, and four types of CNN architectures. For acteristic (AUROC) were used to compare the diagnostic perfor
increasing the diversity of the CNN models, we simulated archi mance between the best single CNN model, which showed the best
tectures with one or three fully connected layers of InceptionNet- performance out of 56 models and the ensemble method, and the
V321 conversion (ICFC1 or ICFC3) and InceptionResNet-V222 AUROC for each of the three classes (C0, C1, and C2) were
(IRFC1 or IRFC3). The final output stage of a total of four types evaluated as a performance index to classify the stage of glaucoma.
of CNN architectures was composed of a softmax layer with three The performance of the best single CNN model and the
output nodes. The details of CNN architectures used in this study system combination of 56 CNN models were assessed and com
can be found in Supplementary Fig. S1. A graphic processing unit pared using the Shapiro-Wilk test and the paired t-test. The
that supports TensorFlow 1.8, CUDA 9.0, and 5,120 CUDA cores algorithms were run a total of 10 times to evaluate the perfor
was used in the process of training the 56 CNN models. The mance using 10-fold cross-validation and the test number was
computer language used for system development was Python ver less than 30, the Shapiro-Wilk test was performed to verify the
sion 3.5. OpenCV version 3.1 was used for the image processing of normality in the data distribution. If the data satisfied normal
the fundus photographs. As this study applied 10-fold cross- distribution, then the paired t-test was used to compare the
validation, 90% of the whole dataset was used for training on model performance, otherwise, the Mann-Whitney U test was
CNN models and the rest was allocated for validation. performed. Data were recorded and analyzed using R version
The final decision on the grading of the fundus photographs 3.4.1 (R Foundation for Statistical Computing, Vienna, Austria),
was performed by averaging the probabilities of each class, which based upon a 5% probability of statistical significance.
became the output of 56 CNN models. The class with the highest
probability was selected as a grade. The ensembled output for
each class is calculated using the equation: Availability of materials and data
!,
� X
N � The datasets generated during and/or analyzed during the
PF Ck¼0;1;2 ¼ PsðiÞ Ck¼0;1;2 N
current study are not publicly available due to the research
i¼1
employs a retrospective case-control design with a waiver of
where PF(Ck) is the final probability of Ck, k is class identifier, informed consent, but are available from the corresponding
Ps(Ck) is the output from, a single CNN model, which is the author on reasonable request and with permission of The
CURRENT EYE RESEARCH 1519
Figure 2. Concept diagram of ensemble method to combine 56 convolutional neural network models. The combination of two color channels of the fundus photograph,
seven types of image filters, and four types of CNN architectures resulted in a total of 56 CNN models. The probability of each model was averaged for the final decision
on the grading of the fundus photographs. CNN: Convolutional neural networks, ICFC1: InceptionNet-V3 with one fully connected layer, ICFC3: InceptionNet-V3 with
three fully connected layers, IRFC1: InceptionResNet-V3 with one fully connected layer, IRFC3: InceptionResNet-V3 with three fully connected layers
Institutional Review Board of the Konyang University randomly sampled so that the training dataset and the valida
Hospital. Some representative fundus images in the current tion dataset were equally distributed for each class, and the
study can be found in Supplementary Fig. S2. ratio of training to validation was 9:1.
The performance evaluation results of the all 56 CNN models
and the ensemble method are documented in supplementary
Results Tables S1-5 and box plots are plotted in supplementary Fig. S3-7.
The best single CNN model was different depending on the
The final dataset consists of 3,460 fundus photographs from
performance index and classification. The model that used
2,204 subjects. The distribution and quantity per grade of the
a single InceptionResNet-V2 and sharpening filter
datasets before and after the data cross-validation are shown in
(S_C_IRFC1) was the best in accuracy and AUROC of C0, and
Table 1. The number of images among some of the subgroups,
the model that used a single InceptionResNet-V2 and all filters
which consisted of the five glaucoma grades, was insufficient to
(A_C_IRFC1) was the best in AUROC of average, C1, and C2.
optimize CNN models, and the final dataset was reclassified
As shown in Table 3 and Figure 3, the ensemble results of 56
into three classes. Three classes were the unaffected controls
models had higher mean scores of accuracy and AUROC, and
(C0), the early-stage glaucoma (the merged grades of preperi
lower variance of AUROC compared to the best single model.
metric and mild grade; C1), and late-stage glaucoma (C2). As
The average accuracy and AUROC value of the all classes were
described in Table 2, in each experiment, the data were
0.852 (95% CI, 0.835–0.869) and 0.950 (95% CI, 0.940–0.961) in
the best single model and 0.881 (95% CI, 0.856–0.907) and 0.975
Table 1. Demographic of the dataset. (95% CI, 0.967–0.984) in the ensemble method, respectively. The
Data cross validation AUROC and curves for the best single model and the ensemble
Group Before After Proportion (%) of remaining method according to the glaucoma stages are reported in Table 4
Unaffected control 1,848 1,259 68.1 and Figure 4. The AUROC of C1 has remarkably increased from
Preperimetric glaucoma 284 185 65.1 0.905 (95% CI, 0.888–0.922) in the best single model to 0.951
Mild glaucoma 1,045 784 75.0
Moderate glaucoma 570 563 98.8 (95% CI, 0.937–0.965) in the ensemble method. The analysis
Severe glaucoma 698 669 95.8 results of the performance of both methods were confirmed to
Total images 4,445 3,460 77.8 have statistically significant differences (P < .05) in accuracy and
Number of patients 2,801 2,204 78.7
AUROC. For a demonstration of the variance in performance,
1520 H. CHO ET AL.
Table 3. Comparison of diagnostic performance between the best single CNN model and the ensemble method.
Mean 95% Confidence Interval Shapiro-Wilk normality
Metrics Group (Standard deviation) (Minimum to maximum) test (P) Paired t-test (P)
Accuracy (%) Best single CNN (S_C_IRFC1) 85.2 83.5–86.9 0.888 0.021
(0.023) (80.4–88.2)
Ensemble method 88.1 85.6–90.7
(0.034) (84.3–94.1)
AUROC Best single CNN (A_C_IRFC1) 0.950 0.940–0.961 0.508 < 0.001
(0.014) (0.923–0.967)
Ensemble method 0.975 0.967–0.984
(0.011) (0.958–0.994)
AUROC = Area under the response operating characteristic
Figure 3. Comparison of diagnostic performance between the best single CNN model and the ensemble method. The red and green box indicate the best single CNN
model in accuracy and average AUROC, respectively, and the blue box indicates the ensemble method.
receiver operating characteristic curves of individual folds are the best single CNN model (S_C_IRFC1) and the ensem
presented in supplementary Fig. S8-11. ble method, respectively. In comparison of the two algo
The agreement between the predicted class by the algo rithms, the ensemble method had higher proportions in all
rithms and the final dataset is summarized in Table 5. In correct prediction cases and lower proportions in mispre
most misprediction cases, the algorithms predicted the diction cases, except only one case that C2 was incorrectly
adjacent class such as C1 was incorrectly predicted as predicted as C0 (0.2% in the best single CNN model and
C2, followed by C2 as C1, C0 as C1, and C1 as C0 by 0.4% in the ensemble method).
CURRENT EYE RESEARCH 1521
Table 4. Comparison of the AUROC between the best single CNN model and the ensemble method according to the glaucoma stages.
Class Mean 95% Confidence Interval Shapiro-Wilk
(Class code) Group (Standard deviation) (Minimum to maximum) normality test (P) P
Unaffected control (C0) Best single CNN (S_C_IRFC1) 0.980 0.972–0.987 0.043 0.014*
(0.010) (0.958–0.994)
Ensemble method 0.990 0.985–0.994
(0.006) (0.983–1.000)
Early-stage glaucoma (C1) Best single CNN (A_C_IRFC1) 0.905 0.888–0.922 0.966 <0.001**
(0.023) (0.869–0.939)
Ensemble method 0.951 0.937–0.965
(0.019) (0.920–0.989)
Late-stage glaucoma (C2) Best single CNN (A_C_IRFC1) 0.948 0.932–0.965 0.313 <0.001**
(0.022) (0.901–0.975)
Ensemble method 0.970 0.956–0.984
(0.818) (0.942–0.992)
* Mann-Whitney U test
** Paired t-test
AUROC = Area under the response operating characteristic
Figure 4. Receiver operating characteristic curves for the best single CNN model and the ensemble method according to the glaucoma stages. The red and blue line
indicate receiver operating characteristic curves for the best single CNN model and the ensemble method, in the unaffected controls (a), the early-stage glaucoma (b),
and the late-stage glaucoma (c). The ensemble method achieved significantly higher area under the receiver operating characteristic compared to the baseline model in
all glaucoma stages, especially in the early-stage glaucoma (B).
Table 5. Proportion of the predicted class according to the algorithms compared to the final dataset.
Final dataset
Unaffected control Early-stage glaucoma Late-stage glaucoma
Predicted unaffected control (SD) Best single CNN (S_C_IRFC1) 31.2% 2.0% 0.2%
(1.3) (1.4) (0.4)
Ensemble method 32.0% 1.0% 0.4%
(0.8) (0.9) (0.5)
Predicted early-stage glaucoma (SD) Best single CNN (S_C_IRFC1) 2.7% 26.8% 3.8%
(1.5) (1.6) (1.1)
Ensemble method 2.4% 28.0% 2.9%
(1.4) (2.6) (1.4)
Predicted late-stage glaucoma (SD) Best single CNN (S_C_IRFC1) 0.9% 5.2% 27.3%
(1.0) (1.4) (1.9)
Ensemble method 0.4% 4.8% 28.1%
(0.7) (1.8) (2.2)
Discussion clinical impact that has proven useful for documentation of the
nerve’s appearance at a given time, allowing more detailed scrutiny
In this study, we investigated the performance of a newly devel
then, and later comparison for change.6 Glaucoma screening by
oped deep learning ensemble method to classify glaucoma stages
using fundus photography has not been recommended to the
using fundus photographs and CNN. The ensemble method
general population23,24 in part due to the fact that the optic nerve
demonstrated significantly better performance and accuracy head has inter-individual variability and due to the detection of
than the baseline model. Using this approach, the AUROC structural change at its early stages is usually dependent on sub
value of the early-stage glaucoma (0.951) suggested the promising jective interpretation.25,26 However, computer-aided diagnosis of
potential for glaucoma screening in primary care. fundus images has shown promise in the diagnosis of glaucoma
Traditionally, fundus photograph is an essential tool for glau which can overcome the inter-intra variability. Raghavendra et al.27
coma evaluation, because of convenience, affordability, and the have achieved the highest accuracy of 98.13% using only 18 layers
1522 H. CHO ET AL.
of CNN. Rogers et al.28 have evaluated the performance of a deep by assembling the learning of multiple CNN models by diversify
learning-based artificial intelligence software for detection of glau ing the conditions and characteristics of model learning were
coma from stereoscopic optic disc photographs in the European found to be more advanced than when using only one CNN
Optic Disc Assessment Study, and the system has obtained model in all aspects of bias and variance of the performance
a diagnostic performance and repeatability comparable to that of evaluation results (Figure 3 and Fig. S3-7). In fact, the single
a large cohort of ophthalmologists and optometrists. Shibata et al.16 CNN model that used all filters showed higher AUROC of average,
have validated the diagnostic ability of the deep residual learning C1, and C2 than that of a single filter, but were not superior to the
algorithm in highly myopic eyes which makes the detection of ensemble method (supplementary Tables S1-5). By verifying each
glaucoma a challenging task because of the morphological differ model that used the processed fundus photographs, we found that
ence from those of non-highly myopic eyes. Kim et al.29 have the readings of some fundus photographs were different than
developed a publicly available prototype web application for com others. We hypothesized these readings could be improved by
puter-aided diagnosis and localization of glaucoma in fundus combining several CNN models with diversity, and that were
images, integrating their predictive model.12–19,27–29 supposed to advantages of the ensemble method despite its archi
Although most of recent studies have been suggested numer tectural complexity.
ous potential and vision in this field, various stages and struc Christopher et al.14 recently published their results of the per
ture-function correlations of the glaucoma have received little formance of deep learning architectures and transfer learning for
attention. The results of our study show agree with those found detecting glaucomatous optic neuropathy in fundus photographs.
in earlier investigations with an accuracy of 83.4–98.1%13,17,27–29 They stratified glaucomatous optic neuropathy by the degree of
and an AUROC of 0.887–0.996,12,14–19 and moreover this study functional loss into two groups: a mild group with a VF mean
enhanced the research by applying a third classification grade to deviation (MD) better than or equal to – 6 dB and moderate-to-
the glaucoma severity based on functional tests. That third level severe group with a VF MD worse than – 6 dB. Their deep learning
of diagnostics can provide primary care with greater detail at an model achieved an AUROC of 0.89 in identifying glaucomatous
earlier stage improving the disease management, reducing the optic neuropathy with mild functional loss. It is difficult to directly
chances of blindness, and ultimately reducing the overall med compare the diagnostic performance of the present CNN algo
ical costs to the patient. Binary classification, such as normal rithm with those in Christopher et al. because their dataset con
versus glaucoma suspect or normal versus glaucoma is not tains a greater number of fundus photographs (n = 14,822) from
suitable for a screening test of glaucoma, since the disease is a racially and ethnically diverse group of individuals than the
irreversible and shows different structural changes at the early current study. However, our ensemble method may help account
and advanced stages. Even though the current study adopted the for better diagnostic accuracy in identifying the mild-stage glau
ternary (C0, C1, C2) approach to classify the severity of glau coma (C1).
coma, the performance (the averaged AUROC, 0.975) was equal The third major feature of this study is the superior
to, or superior to the results of previous studies that adopted quality of the fundus photographs used in the CNN model.
binary classification.12,13,15,16,18,19 As a screening tool, the fatal Li et al.15 used the dataset contained approximately 40,000
false negatives are the least adjacent mispredictions, and even fundus photographs in identifying glaucomatous optic neu
that are less in the ensemble method than in the best single CNN ropathy, and their AUROC was 0.986. Interestingly, the
model in the current study (Table 5). The AUROC of C1 (0.951) proposed ensemble method with less than 10% of the dataset
in our study may have particular implications for the combina achieved an AUROC of 0.990 in distinguishing unaffected
tion of deep learning technique and fundus photographs in control (C0) from glaucoma cases. As stated previously,
glaucoma screening test. classifying the stage of glaucoma was conducted by review
Weak coordination between structure and function is ing the fundus photographs with reliable VF test data, and
another limitation of previous studies. Clinical data is often the final dataset was decided by the glaucoma specialists
labeled by focusing only on structural tests including fundus unanimously. Actually the model using the dataset after
photographs12,15–19,27–29and OCT scans,13,19 although, glau the cross-validation revealed an excellent performance com
coma is a chronic progressive optic neuropathy with corre pared to before the cross-validation, even though the data
sponding glaucomatous VF defects. In addition, the results of size of total fundus images decreased from 4,445 to 3,460
the current study were not inferior to an attempt to use deep (77.8%) after the cross-validation. Although the detailed
learning for analysis of functional test (AUROC 0.926) that data were not shown, this was probably because ambiguous
preperimetric glaucomatous VF could be distinguished from cases in the assessment of glaucoma such as retinal changes
normal controls.30 Datasets reviewed by the combination of the due to high myopia, fundus photographs unrelated to VF
fundus photographs, which is the most accessible test, and the tests, and vice versa were excluded after the cross-validation.
Humphrey VF test, which is a functional test mainly used for Nonetheless, this study has some limitations that need to be
the diagnosis of glaucoma and grading of stages will enhance considered. First, the findings were obtained from a highly popu
the performance of a deep learning model. lation-specific (Korean) subjects. Furthermore, good classifying
The second main feature of this study is using the ensemble performance using the entire area in fundus images, not limited
method. Major of the previous studies used one CNN to the optic disc area may be related to the that the RNFL defect is
model.12,15–19,27 The experiments of this study confirmed that much easier to be identified in Asians, who have more pigment in
the performance in individual CNN models such as InceptionNet- the retinal pigment epithelium layer, as compared to Caucasians.31
V3 and InceptionResNet-V2 was not significantly different (sup However, additional data acquisition and verification in different
plementary Tables S1-5). On the other hand, the results obtained racial groups will be needed for a more generalizable model.
CURRENT EYE RESEARCH 1523
Second, it is necessary to investigate the CNN models that can world regions 1990-2010: a meta-analysis. PLoS One. 2016;11(10):
classify the grade of glaucoma in more detail. In particular, patients e0162229. doi:10.1371/journal.pone.0162229.
with glaucoma may be co-morbid for many conditions of exclu 3. Varma R, Lee PP, Goldberg I, Kotak S. An assessment of the health
and economic burdens of glaucoma. Am J Ophthalmol. 2011;152
sion criteria in our study such as high myopia and discrepancy (4):515–22. doi:10.1016/j.ajo.2011.06.004.
between structure and function test. Although it may be difficult to 4. Lee PP, Walt JG, Doyle JJ, Kotak SV, Evans SJ, Budenz DL,
approach fundamentally, further research on the challenging cases Chen PP, Coleman AL, Feldman RM, Jampel HD, et al.
is expected to provide more information and generalizability to A multicenter, retrospective pilot study of resource use and costs
help in real clinical practice. Third, extra studies for comparing the associated with severity of disease in glaucoma. Arch Ophthalmol.
2006;124(1):12–19. doi:10.1001/archopht.124.1.12.
performance of the deep learning ensemble system against a panel 5. Gupta P, Zhao D, Guallar E, Ko F, Boland MV, Friedman DS.
of practicing fundus photographs including glaucoma specialists, Prevalence of glaucoma in the United States: the 2005-2008 national
general ophthalmologist, residents in ophthalmology, and non- health and nutrition examination survey. Invest Ophthalmol Vis Sci.
ophthalmological physicians may clarify its necessity and clinical 2016;57(6):2905–13. doi:10.1167/iovs.15-18469.
effectiveness in primary care. Fourth, in order to collect enough 6. Myers JS, Fudemberg SJ, Lee D. Evolution of optic nerve photo
graphy for glaucoma screening: a review. Clin Exp Ophthalmol.
fundus images, the cut-off value for the false-positive rate (33%) in 2018;46(2):169–76. doi:10.1111/ceo.13138.
this study is higher than the standard cut-off value (20%) in other 7. Chauhan BC, Garway-Heath DF, Goni FJ, Rossetti L, Bengtsson B,
studies. The proportion of the false-positive rate exceed 20% are Viswanathan AC, Heijl A. Practical recommendations for measur
higher in unaffected control and preperimetric glaucoma groups, ing rates of visual field change in glaucoma. Br J Ophthalmol.
and that may be related to a retrospective study design. Finally, 2008;92(4):569–73. doi:10.1136/bjo.2007.135012.
additional studies on the image processing and the optimization of 8. Rountree L, Mulholland PJ, Anderson RS, Garway-Heath DF,
Morgan JE, Redmond T. Optimising the glaucoma signal/noise
the ensemble method considering whether to average or to add ratio by mapping changes in spatial summation with
weighted values per individual model are needed for enhancing area-modulated perimetric stimuli. Sci Rep. 2018;8(1):2172.
the performance. doi:10.1038/s41598-018-20480-4.
In conclusion, this study demonstrated a newly developed deep 9. Bizios D, Heijl A, Bengtsson B. Integration and fusion of standard
learning ensemble method and confirmed the possibility of classi automated perimetry and optical coherence tomography data for
improved automated glaucoma diagnostics. BMC Ophthalmol.
fying the severity of glaucoma using fundus photographs. It is 2011;11:20. doi:10.1186/1471-2415-11-20.
suggested that the key to high performance may be improving the 10. Russell RA, Malik R, Chauhan BC, Crabb DP, Garway-Heath DF.
quality of the dataset and combining multiple CNN models. The Improved estimates of visual field progression using bayesian lin
CNN ensemble method proposed in this study can be used as ear regression to integrate structural information in patients with
a tool for a clinical decision support system to screen the early ocular hypertension. Invest Ophthalmol Vis Sci. 2012;53
stages and to monitor the progression of glaucoma. (6):2760–69. doi:10.1167/iovs.11-7976.
11. Malik R, Swanson WH, Garway-Heath DF. ‘Structure-function
relationship’ in glaucoma: past thinking and current concepts.
Clin Exp Ophthalmol. 2012;40(4):369–80. doi:10.1111/j.1442-
Competing interests 9071.2012.02770.x.
12. Xiangyu C, Yanwu X, Damon Wing Kee W, Tien Yin W, Jiang L.
The authors declare that we have no competing interests with the contents
Glaucoma Detection Based on Deep Convolutional Neural
of this article.
Network. Conf Proc IEEE Eng Med Biol Soc. 2015;2015:715–18.
13. Muhammad H, Fuchs TJ, De Cuir N, De Moraes CG,
Blumberg DM, Liebmann JM, Ritch R, Hood DC. Hybrid deep
Funding learning on single wide-field optical coherence tomography scans
accurately classifies glaucoma suspects. J Glaucoma. 2017;26
This work was supported by Institute for Information & Communications
(12):1086–94. doi:10.1097/IJG.0000000000000765.
Technology Planning & Promotion (IITP) grant funded by the Korea govern
14. Christopher M, Belghith A, Bowd C, Proudfoot JA,
ment (MSIT) (No.2017-0-00046, “Basic Technology for Extracting High-Level
Goldbaum MH, Weinreb RN, Girkin CA, Liebmann JM,
Information from Multiple Sources Data based on Intelligent Analysis”). The
Zangwill LM. Performance of deep learning architectures and
funding organization had no role in the design or conduct of this research;
transfer learning for detecting glaucomatous optic neuropathy in
Institute for Information & Communications Technology Planning &
fundus photographs. Sci Rep. 2018;8(1):16685. doi:10.1038/
Promotion (IITP) grant funded by the Korea government (MSIT) [2017-
s41598-018-35044-9.
0-00046].
15. Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep
learning system for detecting glaucomatous optic neuropathy
based on color fundus photographs. Ophthalmology. 2018;125
ORCID (8):1199–206. doi:10.1016/j.ophtha.2018.01.023.
16. Shibata N, Tanito M, Mitsuhashi K, Fujino Y, Matsuura M,
Jae Keun Chung http://orcid.org/0000-0003-2968-834X Murata H, Asaoka R. Development of a deep residual learning
Jae Hoon Jeong http://orcid.org/0000-0001-7311-7080 algorithm to screen for glaucoma from fundus photography. Sci
Rep. 2018;8(1):14665. doi:10.1038/s41598-018-33013-w.
References 17. Liu S, Graham SL, Schulz A, Kalloniatis M, Zangerl B, Cai W, Gao Y,
Chua B, Arvind H, Grigg J. A deep learning-based algorithm identifies
1. Tham YC, Li X, Wong TY, Quigley HA, Aung T, Cheng CY. Global glaucomatous discs using monoscopic fundus photographs. Ophthalmol
prevalence of glaucoma and projections of glaucoma burden through Glaucoma. 2018;1(1):15–22. doi:10.1016/j.ogla.2018.04.002.
2040: a systematic review and meta-analysis. Ophthalmology. 18. Liu H, Li L, Wormstone IM, Qiao C, Zhang C, Liu P, Li S, Wang H,
2014;121(11):2081–90. doi:10.1016/j.ophtha.2014.05.013. Mou D, Pang R. Development and validation of a deep learning system
2. Bourne RR, Taylor HR, Flaxman SR, Keeffe J, Leasher J, Naidoo K, to detect glaucomatous optic neuropathy using fundus photographs.
Pesudovs K, White RA, Wong TY, Resnikoff S, et al. Number of JAMA Ophthalmol. 2019;137(12):1353. doi:10.1001/jamaophthalmol.
people blind or visually impaired by glaucoma worldwide and in 2019.3501.
1524 H. CHO ET AL.
19. An G, Omodaka K, Hashimoto K, Tsuda S, Shiga Y, Takada N, among academic glaucoma subspecialists in assessing optic disc
Kikawa T, Yokota H, Akiba M, Nakazawa T. Glaucoma diagnosis notching. Trans Am Ophthalmol Soc. 2001;99:177–84. discussion
with machine learning based on optical coherence tomography and 184–5.
color fundus images. J Healthc Eng. 2019;2019:4061313. 26. Jampel HD, Friedman D, Quigley H, Vitale S, Miller R,
doi:10.1155/2019/4061313. Knezevich F, Ding Y. Agreement among glaucoma specialists in
20. Hodapp E, Parrish RK, Anderson DR. Clinical decisions in glau assessing progressive disc changes from photographs in open-angle
coma. St. Louis, (MO): Mosby; 1993. glaucoma patients. Am J Ophthalmol. 2009;147(1):39–44 e1.
21. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking doi:10.1016/j.ajo.2008.07.023.
the inception architecture for computer vision. Paper presented at: 27. Raghavendra U, Fujita H, Bhandary SV, Gudigar A, Tan JH,
CVPR 2016. Proceedings of the IEEE conference on computer Acharya UR. Deep convolution neural network for accurate diag
vision and pattern recognition. Las Vegas (NV); 2016; 27–30. nosis of glaucoma using digital fundus images. Inf Sci (Ny).
22. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, 2018;441:41–49. doi:10.1016/j.ins.2018.01.051.
inception-resnet and the impact of residual connections on 28. Rogers TW, Jaccard N, Carbonaro F, Lemij HG, Vermeer KA,
learning. Paper presented at: AAAI-17. Proceedings of the AAAI Reus NJ, Trikha S. Evaluation of an AI system for the automated
Conference on Artificial Intelligence. San Francisco (CA); 2017 Feb detection of glaucoma from stereoscopic optic disc photographs:
4–9. the European Optic Disc Assessment Study. Eye (Lond). 2019;33
23. Moyer VA, Force USPST. Screening for glaucoma: U.S. preventive (11):1791–97. doi:10.1038/s41433-019-0510-3.
services task force recommendation statement. Ann Intern Med. 29. Kim M, Han JC, Hyun SH, Janssens O, Van Hoecke S, Kee C, De
2013;159(7):484–89. doi:10.7326/0003-4819-159-6-201309170-00686. Neve W. Medinoid: computer-aided diagnosis and localization of
24. Pizzi LT, Waisbourd M, Hark L, Sembhi H, Lee P, glaucoma using deep learning. Appl Sci. 2019;9(15):3064.
Crews JE, Saaddine JB, Steele D, Katz LJ. Costs of a doi:10.3390/app9153064.
community-based glaucoma detection programme: analysis 30. Asaoka R, Murata H, Iwase A, Araie M. Detecting preperimetric
of the Philadelphia Glaucoma detection and treatment glaucoma with standard automated perimetry using a deep learn
project. Br J Ophthalmol. 2018;102(2):225–32. doi:10.1136/ ing classifier. Ophthalmology. 2016;123(9):1974–80. doi:10.1016/j.
bjophthalmol-2016-310078. ophtha.2016.05.029.
25. Gaasterland DE, Blackwell B, Dally LG, Caprioli J, Katz LJ, 31. Jonas JB, Dichtl A. Evaluation of the retinal nerve fiber layer. Surv
Ederer F. Advanced glaumoca intervention Study I. The Ophthalmol. 1996;40(5):369–78. doi:10.1016/S0039-6257(96)800
Advanced Glaucoma Intervention Study (AGIS): 10. Variability 65-8.