0% found this document useful (0 votes)
2 views

Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks

This study presents a deep learning ensemble method for classifying glaucoma stages using fundus photographs, achieving an accuracy of 88.1% and outperforming the best single CNN model. The method utilizes 56 convolutional neural networks to enhance classification performance, particularly in distinguishing between unaffected controls, early-stage, and late-stage glaucoma. The findings suggest that this ensemble approach could serve as an effective clinical decision support system for glaucoma screening in primary care settings.

Uploaded by

LANA ESAM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks

This study presents a deep learning ensemble method for classifying glaucoma stages using fundus photographs, achieving an accuracy of 88.1% and outperforming the best single CNN model. The method utilizes 56 convolutional neural networks to enhance classification performance, particularly in distinguishing between unaffected controls, early-stage, and late-stage glaucoma. The findings suggest that this ensemble approach could serve as an effective clinical decision support system for glaucoma screening in primary care settings.

Uploaded by

LANA ESAM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Current Eye Research

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/icey20

Deep Learning Ensemble Method for Classifying


Glaucoma Stages Using Fundus Photographs and
Convolutional Neural Networks

Hyeonsung Cho, Young Hoon Hwang, Jae Keun Chung, Kwan Bok Lee, Ji Sang
Park, Hong-Gee Kim & Jae Hoon Jeong

To cite this article: Hyeonsung Cho, Young Hoon Hwang, Jae Keun Chung, Kwan Bok Lee,
Ji Sang Park, Hong-Gee Kim & Jae Hoon Jeong (2021) Deep Learning Ensemble Method for
Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks,
Current Eye Research, 46:10, 1516-1524, DOI: 10.1080/02713683.2021.1900268

To link to this article: https://doi.org/10.1080/02713683.2021.1900268

© 2021 The Author(s). Published with View supplementary material


license by Taylor & Francis Group, LLC.

Published online: 06 Apr 2021. Submit your article to this journal

Article views: 3302 View related articles

View Crossmark data Citing articles: 24 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=icey20
CURRENT EYE RESEARCH
2021, VOL. 46, NO. 10, 1516–1524
https://doi.org/10.1080/02713683.2021.1900268

Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus
Photographs and Convolutional Neural Networks
Hyeonsung Choa, Young Hoon Hwangb, Jae Keun Chung b
, Kwan Bok Leeb, Ji Sang Parka, Hong-Gee Kimc,
and Jae Hoon Jeong d
a
Intelligence and Robot System Research Group, Electronics & Telecommunication Research Institute, Daejeon, Republic of Korea; bDepartment of
Ophthalmology, Chungnam National University Hospital, Daejeon, Republic of Korea; cBiomedical Knowledge Engineering Laboratory, Seoul National
University, Seoul, Republic of Korea; dDepartment of Ophthalmology, Konyang University Hospital, Konyag University College of Medicine, Daejeon,
Republic of Korea

ABSTRACT ARTICLE HISTORY


Purpose: This study developed and evaluated a deep learning ensemble method to automatically grade Received 19 June 2020
the stages of glaucoma depending on its severity. Revised 9 February 2021
Materials and Methods: After cross-validation of three glaucoma specialists, the final dataset comprised Accepted 21 February 2021
of 3,460 fundus photographs taken from 2,204 patients were divided into three classes: unaffected KEYWORDS
controls, early-stage glaucoma, and late-stage glaucoma. The mean deviation value of standard auto­ Artificial intelligence; deep
mated perimetry was used to classify the glaucoma cases. We modeled 56 convolutional neural networks learning; diagnostic imaging;
(CNN) with different characteristics and developed an ensemble system to derive the best performance by glaucoma; neural networks
combining several modeling results. models
Results: The proposed method with an accuracy of 88.1% and an average area under the receiver
operating characteristic of 0.975 demonstrates significantly better performance to classify glaucoma
stages compared to the best single CNN model that has an accuracy of 85.2% and an average area
under the receiver operating characteristic of 0.950. The false negative is the least adjacent misprediction,
and it is less in the proposed method than in the best single CNN model.
Conclusions: The method of averaging multiple CNN models can better classify glaucoma stages by using
fundus photographs than a single CNN model. The ensemble method would be useful as a clinical
decision support system in glaucoma screening for primary care because it provides high and stable
performance with a relatively small amount of data.

Introduction There are fundamental shortcomings of fundus photogra­


phy and standard automated perimetry in glaucoma screening
Glaucoma is one of the leading causes of blindness, is found in
beyond the economic aspects. The interpretation of disc photo­
approximately 3.54% of the global adult population, or
graphs is inherently subjective because of the broad range of
approximately 64.3 million people. This is expected to have
normal optic nerve appearance and its overlap with pathologi­
increased to 76 million by 2020.1 According to a global report
cal findings.6 Furthermore, the major difficulty in detecting
in 2010, glaucoma may be related to blindness in 2.1 million
glaucoma, classifying stage and identifying progression of dis­
people and the severe loss of visual acuity in 4.2 million
ease comes from the high variability and low disease signal in
people.2 From an economic viewpoint, the disease results in
standard automated perimetry7,8 and there have been several
substantial financial costs on both individuals and society, and
attempts to integrate of retinal structure and visual
these burdens increases as disease severity increases.3
function.9,10 Although combining structural and functional
Due to its chronic and irreversible nature, early detection of assessments have been shown to provide improved sensitivity
glaucoma is important, so early management can slow the and specificity than either modality alone,11 it is impossible to
progression. Treatment is relatively good in the early stages, undertake as many tests as clinicians would like within
whereas advanced glaucoma often has a poor prognosis.4 a reasonable period of time in glaucoma screening. To over­
Glaucoma screening has its limitations, including the cost come these limitations, several studies suggest that deep learn­
and the motivating factor of subjective symptoms that most ing algorithms based on clinical image data show potential for
patients lack until later stages of the disease.5 The cost factor being used in early screening.12–17
comes from the need for advanced expertise and experience in Since Xiangyu et al.12 have demonstrated a method to clas­
reading the relatively inexpensive and accessible fundus photo­ sify glaucoma and normal groups automatically by combining
graphs, or it comes from the expensive optical coherence the fundus photographs and convolutional neural networks
tomography (OCT) and standard automated perimetry. (CNN), review of fundus photographs using CNN, one of the

CONTACT Jae Hoon Jeong [email protected] Department of Ophthalmology, Konyang University Hospital 158, Gwanjeodong-ro, Seo-gu, Daejeon (35365),
Republic of Korea.
Supplemental data for this article can be accessed on the publisher’s website.
© 2021 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way.
CURRENT EYE RESEARCH 1517

deep learning technologies that is prominent in image pattern were less than six months old. A glaucomatous VF defect was
recognition field could be useful in glaucoma screening. Liu defined as the threshold of three or more adjacent points in the
et al.18 have established a large-scale database of fundus images pattern deviation chart, that have a deviation with less than 5%
(241,032) for glaucoma diagnosis and developed from the probability of being due to chance with at least one point less than
fundus images glaucoma diagnosis with CNN, as an advanced 1%, or that the pattern standard deviation index is less than 5%.
deep learning system would be used in different settings with Cross-sectional data of each eye from 2,801 subjects including
images of varying quality, patient ethnicity, and population all fundus photographs (1 to 13) and single field analysis of VF
sources. An et al.19 have built a machine learning classification tests (1 to 7) between March 2016 and June 2018 were distributed
model that combines the information of color fundus images to four glaucoma specialists (The Korean Glaucoma Society
and OCT data to classify glaucomatous and healthy eyes and members). Only one fundus image per eye, a total of 4,445 fundus
this system should help to improve the diagnostic accuracy of photographs were selected that were compatible with the ophthal­
detecting early glaucoma. mologic criteria of the CNN model. Reliable VF data within
However, most previous studies have focused their deep learning 6 months in which the selected fundus photo was acquired were
techniques on either fundus photographs12,15–19 or OCT scans13,19 split into folds at the eye level. In cases wherein both eyes of
and presented their results in terms of whether or not glaucoma was glaucoma or unaffected control subject were eligible for the study,
present, omitting the stage of the disease. The assessment of glaucoma data from both eyes were chosen for inclusion. Based on the
needs to include the structural and functional changes in the eye as the results of the structural and VF testing, the fundus photographs
disease progresses. The purpose of this study is to propose and were labeled preliminarily according to the following five stages:
evaluate the performance of a new cost-effective glaucoma screening unaffected control, preperimetric, mild, moderate, and severe
test for primary care using a deep learning ensemble method with glaucoma. Unaffected controls did not have any glaucomatous
fundus photographs and CNN, considering various stages and struc­ structural change or any VF defect. The preperimetric grade was
ture-function correlations of the disease. defined as a definite structural glaucomatous change without any
VF defects, and perimetric glaucoma was defined as a definite
structural glaucomatous change with a corresponding VF defect.
Materials and methods
Perimetric glaucoma was graded on the MD value from the VF
This study employed a retrospective case-control design. testing according to the Hodapp-Parrish-Anderson classification
Subjects from the Glaucoma Clinic of Konyang University system.20 The mild group had a MD value greater than or equal
Hospital and Kim’s Eye Hospital were enrolled between to – 6 dB, the moderate group had an MD value of – 6 to – 12 dB,
March 2016 and June 2018. The study followed the tenets of and the severe group had an MD value of less than – 12 dB.
the World Medical Association’s Declaration of Helsinki. The Cross-validation of the preliminary label classifications was
Institutional Review Board of the Konyang University Hospital performed to maximize the efficiency of the CNN models. The
and Kim’s Eye Hospital reviewed and approved the study pairing of fundus photograph and VF test results of each eye
protocol and exempted informed consent for this study. was reviewed by three other glaucoma specialists who did not
The fundus images were acquired by color imaging with participate in preliminary labeling. Each specialist labeled fun­
a digital fundus camera (Nonmyd 7, Kowa Optimed, Tokyo, dus photographs according to the previous structural and
Japan) without pupil dilation. Glaucomatous structural changes functional criteria without the results of preliminary label
were defined as images with any of the following conditions: classification. The results of the cross-validation were added
enlargement of the cup-to-disc ratio of 0.7 or greater, cup-to-disc to the data collection. Finally, each photograph for which all
ratio asymmetry of >0.2 between fellow eyes, neuroretinal rim glaucoma specialists agreed upon was included in a final data­
thinning, notching or excavation, disc hemorrhages, and RNFL set. The resolutions of the final dataset consisted of the average
defects in red-free images with edges being present at the optic pixel of 2,270 (SD; 391, 95% CI; 2,257–2,283) in the height and
nerve head margin. Subjects with the following conditions were 3,412 (SD; 596, 95% CI; 2,144–3,432) in the width.
excluded from this study: astigmatism with cylinder correction The flow of the image processing in the current study is
< – 3.0 D or > + 3.0 D, poor-quality conditions of fundus images shown in Figure 1. Image preprocessing was undertaken to
that could interfere with glaucoma evaluation such as media clean-up the photographs of marks that do not affect the read­
opacities and motion artifacts, other optic neuropathies induced ing, such as words, patient numbers, and the black area around
via inflammatory, ischemic, compressive, and hereditary factors, the edges of the photograph. Red-free channel from the original
and other retinal pathologies such as retinal detachment, age- color fundus image was taken to obtain a high-resolution image
related macular degeneration, myopic chorioretinal atrophy, dia­ for RNFL. Data augmentation was performed to reduce over­
betic retinopathy, macular hole, retinal vascular obstruction, and fitting and to maximize the training effect of the CNN models.
epiretinal membrane. The preprocessed images were rotated (90°, 180°, 270°) and
Standard automated perimetry using the Swedish interactive enlarged by 25% with centering on the mid-points of the
thresholding algorithm (SITA-Standard) of central 24–2 perimetry original image. Finally, the image resolution of each photo­
(Humphrey field analyzer II, Carl Zeiss Meditec, Dublin, CA, graph was converted to 299 × 299 × 3 (R × G × B) to input into
USA) was performed from each subject with selected fundus the CNN architecture after filtering and resizing of the image.
photographs. A visual field (VF) was considered to be reliable To optimize the parameters of the CNN architectures, the
when the fixation loss was less than 20% and the false positive processed fundus photographs were passed through various
was less than 33%. Only reliable VF data were included in the image filters (bilateral, Gaussian, histogram equalization, med­
analysis, and test data were conducted on fundus photographs that ian, and sharpening), and the results were used as input.
1518 H. CHO ET AL.

Figure 1. An overview of the image processing flow in the current study. The final image for convolutional neural network architecture of each photograph was
prepared by image preprocessing, red-free channel extraction, data augmentation, filtering, and resizing process.

Figure 2 shows a total of 56 CNN models comprised of the probability of Ck on its model. N is the number of models used
combined two color types of the fundus photograph (original and in the ensemble method.
red-free images), seven types of image filters including a bypass and The accuracy and the area under the receiver operating char­
filter with all the rest, and four types of CNN architectures. For acteristic (AUROC) were used to compare the diagnostic perfor­
increasing the diversity of the CNN models, we simulated archi­ mance between the best single CNN model, which showed the best
tectures with one or three fully connected layers of InceptionNet- performance out of 56 models and the ensemble method, and the
V321 conversion (ICFC1 or ICFC3) and InceptionResNet-V222 AUROC for each of the three classes (C0, C1, and C2) were
(IRFC1 or IRFC3). The final output stage of a total of four types evaluated as a performance index to classify the stage of glaucoma.
of CNN architectures was composed of a softmax layer with three The performance of the best single CNN model and the
output nodes. The details of CNN architectures used in this study system combination of 56 CNN models were assessed and com­
can be found in Supplementary Fig. S1. A graphic processing unit pared using the Shapiro-Wilk test and the paired t-test. The
that supports TensorFlow 1.8, CUDA 9.0, and 5,120 CUDA cores algorithms were run a total of 10 times to evaluate the perfor­
was used in the process of training the 56 CNN models. The mance using 10-fold cross-validation and the test number was
computer language used for system development was Python ver­ less than 30, the Shapiro-Wilk test was performed to verify the
sion 3.5. OpenCV version 3.1 was used for the image processing of normality in the data distribution. If the data satisfied normal
the fundus photographs. As this study applied 10-fold cross- distribution, then the paired t-test was used to compare the
validation, 90% of the whole dataset was used for training on model performance, otherwise, the Mann-Whitney U test was
CNN models and the rest was allocated for validation. performed. Data were recorded and analyzed using R version
The final decision on the grading of the fundus photographs 3.4.1 (R Foundation for Statistical Computing, Vienna, Austria),
was performed by averaging the probabilities of each class, which based upon a 5% probability of statistical significance.
became the output of 56 CNN models. The class with the highest
probability was selected as a grade. The ensembled output for
each class is calculated using the equation: Availability of materials and data
!,
� X
N � The datasets generated during and/or analyzed during the
PF Ck¼0;1;2 ¼ PsðiÞ Ck¼0;1;2 N
current study are not publicly available due to the research
i¼1
employs a retrospective case-control design with a waiver of
where PF(Ck) is the final probability of Ck, k is class identifier, informed consent, but are available from the corresponding
Ps(Ck) is the output from, a single CNN model, which is the author on reasonable request and with permission of The
CURRENT EYE RESEARCH 1519

Figure 2. Concept diagram of ensemble method to combine 56 convolutional neural network models. The combination of two color channels of the fundus photograph,
seven types of image filters, and four types of CNN architectures resulted in a total of 56 CNN models. The probability of each model was averaged for the final decision
on the grading of the fundus photographs. CNN: Convolutional neural networks, ICFC1: InceptionNet-V3 with one fully connected layer, ICFC3: InceptionNet-V3 with
three fully connected layers, IRFC1: InceptionResNet-V3 with one fully connected layer, IRFC3: InceptionResNet-V3 with three fully connected layers

Institutional Review Board of the Konyang University randomly sampled so that the training dataset and the valida­
Hospital. Some representative fundus images in the current tion dataset were equally distributed for each class, and the
study can be found in Supplementary Fig. S2. ratio of training to validation was 9:1.
The performance evaluation results of the all 56 CNN models
and the ensemble method are documented in supplementary
Results Tables S1-5 and box plots are plotted in supplementary Fig. S3-7.
The best single CNN model was different depending on the
The final dataset consists of 3,460 fundus photographs from
performance index and classification. The model that used
2,204 subjects. The distribution and quantity per grade of the
a single InceptionResNet-V2 and sharpening filter
datasets before and after the data cross-validation are shown in
(S_C_IRFC1) was the best in accuracy and AUROC of C0, and
Table 1. The number of images among some of the subgroups,
the model that used a single InceptionResNet-V2 and all filters
which consisted of the five glaucoma grades, was insufficient to
(A_C_IRFC1) was the best in AUROC of average, C1, and C2.
optimize CNN models, and the final dataset was reclassified
As shown in Table 3 and Figure 3, the ensemble results of 56
into three classes. Three classes were the unaffected controls
models had higher mean scores of accuracy and AUROC, and
(C0), the early-stage glaucoma (the merged grades of preperi­
lower variance of AUROC compared to the best single model.
metric and mild grade; C1), and late-stage glaucoma (C2). As
The average accuracy and AUROC value of the all classes were
described in Table 2, in each experiment, the data were
0.852 (95% CI, 0.835–0.869) and 0.950 (95% CI, 0.940–0.961) in
the best single model and 0.881 (95% CI, 0.856–0.907) and 0.975
Table 1. Demographic of the dataset. (95% CI, 0.967–0.984) in the ensemble method, respectively. The
Data cross validation AUROC and curves for the best single model and the ensemble
Group Before After Proportion (%) of remaining method according to the glaucoma stages are reported in Table 4
Unaffected control 1,848 1,259 68.1 and Figure 4. The AUROC of C1 has remarkably increased from
Preperimetric glaucoma 284 185 65.1 0.905 (95% CI, 0.888–0.922) in the best single model to 0.951
Mild glaucoma 1,045 784 75.0
Moderate glaucoma 570 563 98.8 (95% CI, 0.937–0.965) in the ensemble method. The analysis
Severe glaucoma 698 669 95.8 results of the performance of both methods were confirmed to
Total images 4,445 3,460 77.8 have statistically significant differences (P < .05) in accuracy and
Number of patients 2,801 2,204 78.7
AUROC. For a demonstration of the variance in performance,
1520 H. CHO ET AL.

Table 2. Distribution of the final dataset.


Class Number of Proportion in training Number of vali­ Proportion in valida­
Class Name Code Subgroup training set dataset (%) dation set tion dataset (%)
Unaffected control C0 Unaffected Control 2,448 33.3 272 33.3
Early-stage glaucoma C1 Preperimetric 1,224 16.7 136 16.7
glaucoma
Mild glaucoma 1,224 16.7 136 16.7
Subtotal 2,448 33.3 272 33.3
Late-stage glaucoma C2 Moderate glaucoma 1,224 16.7 136 16.7
Severe glaucoma 1,224 16.7 136 16.7
Subtotal 2,448 33.3 272 33.3
Total 7,344 100.0 816 100.0

Table 3. Comparison of diagnostic performance between the best single CNN model and the ensemble method.
Mean 95% Confidence Interval Shapiro-Wilk normality
Metrics Group (Standard deviation) (Minimum to maximum) test (P) Paired t-test (P)
Accuracy (%) Best single CNN (S_C_IRFC1) 85.2 83.5–86.9 0.888 0.021
(0.023) (80.4–88.2)
Ensemble method 88.1 85.6–90.7
(0.034) (84.3–94.1)
AUROC Best single CNN (A_C_IRFC1) 0.950 0.940–0.961 0.508 < 0.001
(0.014) (0.923–0.967)
Ensemble method 0.975 0.967–0.984
(0.011) (0.958–0.994)
AUROC = Area under the response operating characteristic

Figure 3. Comparison of diagnostic performance between the best single CNN model and the ensemble method. The red and green box indicate the best single CNN
model in accuracy and average AUROC, respectively, and the blue box indicates the ensemble method.

receiver operating characteristic curves of individual folds are the best single CNN model (S_C_IRFC1) and the ensem­
presented in supplementary Fig. S8-11. ble method, respectively. In comparison of the two algo­
The agreement between the predicted class by the algo­ rithms, the ensemble method had higher proportions in all
rithms and the final dataset is summarized in Table 5. In correct prediction cases and lower proportions in mispre­
most misprediction cases, the algorithms predicted the diction cases, except only one case that C2 was incorrectly
adjacent class such as C1 was incorrectly predicted as predicted as C0 (0.2% in the best single CNN model and
C2, followed by C2 as C1, C0 as C1, and C1 as C0 by 0.4% in the ensemble method).
CURRENT EYE RESEARCH 1521

Table 4. Comparison of the AUROC between the best single CNN model and the ensemble method according to the glaucoma stages.
Class Mean 95% Confidence Interval Shapiro-Wilk
(Class code) Group (Standard deviation) (Minimum to maximum) normality test (P) P
Unaffected control (C0) Best single CNN (S_C_IRFC1) 0.980 0.972–0.987 0.043 0.014*
(0.010) (0.958–0.994)
Ensemble method 0.990 0.985–0.994
(0.006) (0.983–1.000)
Early-stage glaucoma (C1) Best single CNN (A_C_IRFC1) 0.905 0.888–0.922 0.966 <0.001**
(0.023) (0.869–0.939)
Ensemble method 0.951 0.937–0.965
(0.019) (0.920–0.989)
Late-stage glaucoma (C2) Best single CNN (A_C_IRFC1) 0.948 0.932–0.965 0.313 <0.001**
(0.022) (0.901–0.975)
Ensemble method 0.970 0.956–0.984
(0.818) (0.942–0.992)
* Mann-Whitney U test
** Paired t-test
AUROC = Area under the response operating characteristic

Figure 4. Receiver operating characteristic curves for the best single CNN model and the ensemble method according to the glaucoma stages. The red and blue line
indicate receiver operating characteristic curves for the best single CNN model and the ensemble method, in the unaffected controls (a), the early-stage glaucoma (b),
and the late-stage glaucoma (c). The ensemble method achieved significantly higher area under the receiver operating characteristic compared to the baseline model in
all glaucoma stages, especially in the early-stage glaucoma (B).

Table 5. Proportion of the predicted class according to the algorithms compared to the final dataset.
Final dataset
Unaffected control Early-stage glaucoma Late-stage glaucoma
Predicted unaffected control (SD) Best single CNN (S_C_IRFC1) 31.2% 2.0% 0.2%
(1.3) (1.4) (0.4)
Ensemble method 32.0% 1.0% 0.4%
(0.8) (0.9) (0.5)
Predicted early-stage glaucoma (SD) Best single CNN (S_C_IRFC1) 2.7% 26.8% 3.8%
(1.5) (1.6) (1.1)
Ensemble method 2.4% 28.0% 2.9%
(1.4) (2.6) (1.4)
Predicted late-stage glaucoma (SD) Best single CNN (S_C_IRFC1) 0.9% 5.2% 27.3%
(1.0) (1.4) (1.9)
Ensemble method 0.4% 4.8% 28.1%
(0.7) (1.8) (2.2)

Discussion clinical impact that has proven useful for documentation of the
nerve’s appearance at a given time, allowing more detailed scrutiny
In this study, we investigated the performance of a newly devel­
then, and later comparison for change.6 Glaucoma screening by
oped deep learning ensemble method to classify glaucoma stages
using fundus photography has not been recommended to the
using fundus photographs and CNN. The ensemble method
general population23,24 in part due to the fact that the optic nerve
demonstrated significantly better performance and accuracy head has inter-individual variability and due to the detection of
than the baseline model. Using this approach, the AUROC structural change at its early stages is usually dependent on sub­
value of the early-stage glaucoma (0.951) suggested the promising jective interpretation.25,26 However, computer-aided diagnosis of
potential for glaucoma screening in primary care. fundus images has shown promise in the diagnosis of glaucoma
Traditionally, fundus photograph is an essential tool for glau­ which can overcome the inter-intra variability. Raghavendra et al.27
coma evaluation, because of convenience, affordability, and the have achieved the highest accuracy of 98.13% using only 18 layers
1522 H. CHO ET AL.

of CNN. Rogers et al.28 have evaluated the performance of a deep by assembling the learning of multiple CNN models by diversify­
learning-based artificial intelligence software for detection of glau­ ing the conditions and characteristics of model learning were
coma from stereoscopic optic disc photographs in the European found to be more advanced than when using only one CNN
Optic Disc Assessment Study, and the system has obtained model in all aspects of bias and variance of the performance
a diagnostic performance and repeatability comparable to that of evaluation results (Figure 3 and Fig. S3-7). In fact, the single
a large cohort of ophthalmologists and optometrists. Shibata et al.16 CNN model that used all filters showed higher AUROC of average,
have validated the diagnostic ability of the deep residual learning C1, and C2 than that of a single filter, but were not superior to the
algorithm in highly myopic eyes which makes the detection of ensemble method (supplementary Tables S1-5). By verifying each
glaucoma a challenging task because of the morphological differ­ model that used the processed fundus photographs, we found that
ence from those of non-highly myopic eyes. Kim et al.29 have the readings of some fundus photographs were different than
developed a publicly available prototype web application for com­ others. We hypothesized these readings could be improved by
puter-aided diagnosis and localization of glaucoma in fundus combining several CNN models with diversity, and that were
images, integrating their predictive model.12–19,27–29 supposed to advantages of the ensemble method despite its archi­
Although most of recent studies have been suggested numer­ tectural complexity.
ous potential and vision in this field, various stages and struc­ Christopher et al.14 recently published their results of the per­
ture-function correlations of the glaucoma have received little formance of deep learning architectures and transfer learning for
attention. The results of our study show agree with those found detecting glaucomatous optic neuropathy in fundus photographs.
in earlier investigations with an accuracy of 83.4–98.1%13,17,27–29 They stratified glaucomatous optic neuropathy by the degree of
and an AUROC of 0.887–0.996,12,14–19 and moreover this study functional loss into two groups: a mild group with a VF mean
enhanced the research by applying a third classification grade to deviation (MD) better than or equal to – 6 dB and moderate-to-
the glaucoma severity based on functional tests. That third level severe group with a VF MD worse than – 6 dB. Their deep learning
of diagnostics can provide primary care with greater detail at an model achieved an AUROC of 0.89 in identifying glaucomatous
earlier stage improving the disease management, reducing the optic neuropathy with mild functional loss. It is difficult to directly
chances of blindness, and ultimately reducing the overall med­ compare the diagnostic performance of the present CNN algo­
ical costs to the patient. Binary classification, such as normal rithm with those in Christopher et al. because their dataset con­
versus glaucoma suspect or normal versus glaucoma is not tains a greater number of fundus photographs (n = 14,822) from
suitable for a screening test of glaucoma, since the disease is a racially and ethnically diverse group of individuals than the
irreversible and shows different structural changes at the early current study. However, our ensemble method may help account
and advanced stages. Even though the current study adopted the for better diagnostic accuracy in identifying the mild-stage glau­
ternary (C0, C1, C2) approach to classify the severity of glau­ coma (C1).
coma, the performance (the averaged AUROC, 0.975) was equal The third major feature of this study is the superior
to, or superior to the results of previous studies that adopted quality of the fundus photographs used in the CNN model.
binary classification.12,13,15,16,18,19 As a screening tool, the fatal Li et al.15 used the dataset contained approximately 40,000
false negatives are the least adjacent mispredictions, and even fundus photographs in identifying glaucomatous optic neu­
that are less in the ensemble method than in the best single CNN ropathy, and their AUROC was 0.986. Interestingly, the
model in the current study (Table 5). The AUROC of C1 (0.951) proposed ensemble method with less than 10% of the dataset
in our study may have particular implications for the combina­ achieved an AUROC of 0.990 in distinguishing unaffected
tion of deep learning technique and fundus photographs in control (C0) from glaucoma cases. As stated previously,
glaucoma screening test. classifying the stage of glaucoma was conducted by review­
Weak coordination between structure and function is ing the fundus photographs with reliable VF test data, and
another limitation of previous studies. Clinical data is often the final dataset was decided by the glaucoma specialists
labeled by focusing only on structural tests including fundus unanimously. Actually the model using the dataset after
photographs12,15–19,27–29and OCT scans,13,19 although, glau­ the cross-validation revealed an excellent performance com­
coma is a chronic progressive optic neuropathy with corre­ pared to before the cross-validation, even though the data
sponding glaucomatous VF defects. In addition, the results of size of total fundus images decreased from 4,445 to 3,460
the current study were not inferior to an attempt to use deep (77.8%) after the cross-validation. Although the detailed
learning for analysis of functional test (AUROC 0.926) that data were not shown, this was probably because ambiguous
preperimetric glaucomatous VF could be distinguished from cases in the assessment of glaucoma such as retinal changes
normal controls.30 Datasets reviewed by the combination of the due to high myopia, fundus photographs unrelated to VF
fundus photographs, which is the most accessible test, and the tests, and vice versa were excluded after the cross-validation.
Humphrey VF test, which is a functional test mainly used for Nonetheless, this study has some limitations that need to be
the diagnosis of glaucoma and grading of stages will enhance considered. First, the findings were obtained from a highly popu­
the performance of a deep learning model. lation-specific (Korean) subjects. Furthermore, good classifying
The second main feature of this study is using the ensemble performance using the entire area in fundus images, not limited
method. Major of the previous studies used one CNN to the optic disc area may be related to the that the RNFL defect is
model.12,15–19,27 The experiments of this study confirmed that much easier to be identified in Asians, who have more pigment in
the performance in individual CNN models such as InceptionNet- the retinal pigment epithelium layer, as compared to Caucasians.31
V3 and InceptionResNet-V2 was not significantly different (sup­ However, additional data acquisition and verification in different
plementary Tables S1-5). On the other hand, the results obtained racial groups will be needed for a more generalizable model.
CURRENT EYE RESEARCH 1523

Second, it is necessary to investigate the CNN models that can world regions 1990-2010: a meta-analysis. PLoS One. 2016;11(10):
classify the grade of glaucoma in more detail. In particular, patients e0162229. doi:10.1371/journal.pone.0162229.
with glaucoma may be co-morbid for many conditions of exclu­ 3. Varma R, Lee PP, Goldberg I, Kotak S. An assessment of the health
and economic burdens of glaucoma. Am J Ophthalmol. 2011;152
sion criteria in our study such as high myopia and discrepancy (4):515–22. doi:10.1016/j.ajo.2011.06.004.
between structure and function test. Although it may be difficult to 4. Lee PP, Walt JG, Doyle JJ, Kotak SV, Evans SJ, Budenz DL,
approach fundamentally, further research on the challenging cases Chen PP, Coleman AL, Feldman RM, Jampel HD, et al.
is expected to provide more information and generalizability to A multicenter, retrospective pilot study of resource use and costs
help in real clinical practice. Third, extra studies for comparing the associated with severity of disease in glaucoma. Arch Ophthalmol.
2006;124(1):12–19. doi:10.1001/archopht.124.1.12.
performance of the deep learning ensemble system against a panel 5. Gupta P, Zhao D, Guallar E, Ko F, Boland MV, Friedman DS.
of practicing fundus photographs including glaucoma specialists, Prevalence of glaucoma in the United States: the 2005-2008 national
general ophthalmologist, residents in ophthalmology, and non- health and nutrition examination survey. Invest Ophthalmol Vis Sci.
ophthalmological physicians may clarify its necessity and clinical 2016;57(6):2905–13. doi:10.1167/iovs.15-18469.
effectiveness in primary care. Fourth, in order to collect enough 6. Myers JS, Fudemberg SJ, Lee D. Evolution of optic nerve photo­
graphy for glaucoma screening: a review. Clin Exp Ophthalmol.
fundus images, the cut-off value for the false-positive rate (33%) in 2018;46(2):169–76. doi:10.1111/ceo.13138.
this study is higher than the standard cut-off value (20%) in other 7. Chauhan BC, Garway-Heath DF, Goni FJ, Rossetti L, Bengtsson B,
studies. The proportion of the false-positive rate exceed 20% are Viswanathan AC, Heijl A. Practical recommendations for measur­
higher in unaffected control and preperimetric glaucoma groups, ing rates of visual field change in glaucoma. Br J Ophthalmol.
and that may be related to a retrospective study design. Finally, 2008;92(4):569–73. doi:10.1136/bjo.2007.135012.
additional studies on the image processing and the optimization of 8. Rountree L, Mulholland PJ, Anderson RS, Garway-Heath DF,
Morgan JE, Redmond T. Optimising the glaucoma signal/noise
the ensemble method considering whether to average or to add ratio by mapping changes in spatial summation with
weighted values per individual model are needed for enhancing area-modulated perimetric stimuli. Sci Rep. 2018;8(1):2172.
the performance. doi:10.1038/s41598-018-20480-4.
In conclusion, this study demonstrated a newly developed deep 9. Bizios D, Heijl A, Bengtsson B. Integration and fusion of standard
learning ensemble method and confirmed the possibility of classi­ automated perimetry and optical coherence tomography data for
improved automated glaucoma diagnostics. BMC Ophthalmol.
fying the severity of glaucoma using fundus photographs. It is 2011;11:20. doi:10.1186/1471-2415-11-20.
suggested that the key to high performance may be improving the 10. Russell RA, Malik R, Chauhan BC, Crabb DP, Garway-Heath DF.
quality of the dataset and combining multiple CNN models. The Improved estimates of visual field progression using bayesian lin­
CNN ensemble method proposed in this study can be used as ear regression to integrate structural information in patients with
a tool for a clinical decision support system to screen the early ocular hypertension. Invest Ophthalmol Vis Sci. 2012;53
stages and to monitor the progression of glaucoma. (6):2760–69. doi:10.1167/iovs.11-7976.
11. Malik R, Swanson WH, Garway-Heath DF. ‘Structure-function
relationship’ in glaucoma: past thinking and current concepts.
Clin Exp Ophthalmol. 2012;40(4):369–80. doi:10.1111/j.1442-
Competing interests 9071.2012.02770.x.
12. Xiangyu C, Yanwu X, Damon Wing Kee W, Tien Yin W, Jiang L.
The authors declare that we have no competing interests with the contents
Glaucoma Detection Based on Deep Convolutional Neural
of this article.
Network. Conf Proc IEEE Eng Med Biol Soc. 2015;2015:715–18.
13. Muhammad H, Fuchs TJ, De Cuir N, De Moraes CG,
Blumberg DM, Liebmann JM, Ritch R, Hood DC. Hybrid deep
Funding learning on single wide-field optical coherence tomography scans
accurately classifies glaucoma suspects. J Glaucoma. 2017;26
This work was supported by Institute for Information & Communications
(12):1086–94. doi:10.1097/IJG.0000000000000765.
Technology Planning & Promotion (IITP) grant funded by the Korea govern­
14. Christopher M, Belghith A, Bowd C, Proudfoot JA,
ment (MSIT) (No.2017-0-00046, “Basic Technology for Extracting High-Level
Goldbaum MH, Weinreb RN, Girkin CA, Liebmann JM,
Information from Multiple Sources Data based on Intelligent Analysis”). The
Zangwill LM. Performance of deep learning architectures and
funding organization had no role in the design or conduct of this research;
transfer learning for detecting glaucomatous optic neuropathy in
Institute for Information & Communications Technology Planning &
fundus photographs. Sci Rep. 2018;8(1):16685. doi:10.1038/
Promotion (IITP) grant funded by the Korea government (MSIT) [2017-
s41598-018-35044-9.
0-00046].
15. Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep
learning system for detecting glaucomatous optic neuropathy
based on color fundus photographs. Ophthalmology. 2018;125
ORCID (8):1199–206. doi:10.1016/j.ophtha.2018.01.023.
16. Shibata N, Tanito M, Mitsuhashi K, Fujino Y, Matsuura M,
Jae Keun Chung http://orcid.org/0000-0003-2968-834X Murata H, Asaoka R. Development of a deep residual learning
Jae Hoon Jeong http://orcid.org/0000-0001-7311-7080 algorithm to screen for glaucoma from fundus photography. Sci
Rep. 2018;8(1):14665. doi:10.1038/s41598-018-33013-w.
References 17. Liu S, Graham SL, Schulz A, Kalloniatis M, Zangerl B, Cai W, Gao Y,
Chua B, Arvind H, Grigg J. A deep learning-based algorithm identifies
1. Tham YC, Li X, Wong TY, Quigley HA, Aung T, Cheng CY. Global glaucomatous discs using monoscopic fundus photographs. Ophthalmol
prevalence of glaucoma and projections of glaucoma burden through Glaucoma. 2018;1(1):15–22. doi:10.1016/j.ogla.2018.04.002.
2040: a systematic review and meta-analysis. Ophthalmology. 18. Liu H, Li L, Wormstone IM, Qiao C, Zhang C, Liu P, Li S, Wang H,
2014;121(11):2081–90. doi:10.1016/j.ophtha.2014.05.013. Mou D, Pang R. Development and validation of a deep learning system
2. Bourne RR, Taylor HR, Flaxman SR, Keeffe J, Leasher J, Naidoo K, to detect glaucomatous optic neuropathy using fundus photographs.
Pesudovs K, White RA, Wong TY, Resnikoff S, et al. Number of JAMA Ophthalmol. 2019;137(12):1353. doi:10.1001/jamaophthalmol.
people blind or visually impaired by glaucoma worldwide and in 2019.3501.
1524 H. CHO ET AL.

19. An G, Omodaka K, Hashimoto K, Tsuda S, Shiga Y, Takada N, among academic glaucoma subspecialists in assessing optic disc
Kikawa T, Yokota H, Akiba M, Nakazawa T. Glaucoma diagnosis notching. Trans Am Ophthalmol Soc. 2001;99:177–84. discussion
with machine learning based on optical coherence tomography and 184–5.
color fundus images. J Healthc Eng. 2019;2019:4061313. 26. Jampel HD, Friedman D, Quigley H, Vitale S, Miller R,
doi:10.1155/2019/4061313. Knezevich F, Ding Y. Agreement among glaucoma specialists in
20. Hodapp E, Parrish RK, Anderson DR. Clinical decisions in glau­ assessing progressive disc changes from photographs in open-angle
coma. St. Louis, (MO): Mosby; 1993. glaucoma patients. Am J Ophthalmol. 2009;147(1):39–44 e1.
21. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking doi:10.1016/j.ajo.2008.07.023.
the inception architecture for computer vision. Paper presented at: 27. Raghavendra U, Fujita H, Bhandary SV, Gudigar A, Tan JH,
CVPR 2016. Proceedings of the IEEE conference on computer Acharya UR. Deep convolution neural network for accurate diag­
vision and pattern recognition. Las Vegas (NV); 2016; 27–30. nosis of glaucoma using digital fundus images. Inf Sci (Ny).
22. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, 2018;441:41–49. doi:10.1016/j.ins.2018.01.051.
inception-resnet and the impact of residual connections on 28. Rogers TW, Jaccard N, Carbonaro F, Lemij HG, Vermeer KA,
learning. Paper presented at: AAAI-17. Proceedings of the AAAI Reus NJ, Trikha S. Evaluation of an AI system for the automated
Conference on Artificial Intelligence. San Francisco (CA); 2017 Feb detection of glaucoma from stereoscopic optic disc photographs:
4–9. the European Optic Disc Assessment Study. Eye (Lond). 2019;33
23. Moyer VA, Force USPST. Screening for glaucoma: U.S. preventive (11):1791–97. doi:10.1038/s41433-019-0510-3.
services task force recommendation statement. Ann Intern Med. 29. Kim M, Han JC, Hyun SH, Janssens O, Van Hoecke S, Kee C, De
2013;159(7):484–89. doi:10.7326/0003-4819-159-6-201309170-00686. Neve W. Medinoid: computer-aided diagnosis and localization of
24. Pizzi LT, Waisbourd M, Hark L, Sembhi H, Lee P, glaucoma using deep learning. Appl Sci. 2019;9(15):3064.
Crews JE, Saaddine JB, Steele D, Katz LJ. Costs of a doi:10.3390/app9153064.
community-based glaucoma detection programme: analysis 30. Asaoka R, Murata H, Iwase A, Araie M. Detecting preperimetric
of the Philadelphia Glaucoma detection and treatment glaucoma with standard automated perimetry using a deep learn­
project. Br J Ophthalmol. 2018;102(2):225–32. doi:10.1136/ ing classifier. Ophthalmology. 2016;123(9):1974–80. doi:10.1016/j.
bjophthalmol-2016-310078. ophtha.2016.05.029.
25. Gaasterland DE, Blackwell B, Dally LG, Caprioli J, Katz LJ, 31. Jonas JB, Dichtl A. Evaluation of the retinal nerve fiber layer. Surv
Ederer F. Advanced glaumoca intervention Study I. The Ophthalmol. 1996;40(5):369–78. doi:10.1016/S0039-6257(96)800
Advanced Glaucoma Intervention Study (AGIS): 10. Variability 65-8.

You might also like