10.1055@a 0981 6133
10.1055@a 0981 6133
Authors
Bum-Joo Cho 1, 2, 3, Chang Seok Bang3, 4, 5, Se Woo Park 4, 5, Young Joo Yang3, 4, 5, Seung In Seo 4, 5, Hyun Lim 4, 5,
Woon Geon Shin4, 5, Ji Taek Hong4, 5, Yong Tak Yoo 6, Seok Hwan Hong 6, Jae Ho Choi3, Jae Jun Lee 3, 7, Gwang Ho Baik 4, 5
Institutions ABSTR AC T
1 Department of Ophthalmology, Hallym University Background Visual inspection, lesion detection, and dif-
College of Medicine, Chuncheon, Korea ferentiation between malignant and benign features are
2 Interdisciplinary Program in Medical Informatics, Seoul key aspects of an endoscopist’s role. The use of machine
National University College of Medicine, Seoul, Korea learning for the recognition and differentiation of images
3 Institute of New Frontier Research, Hallym University has been increasingly adopted in clinical practice. This
mentation of endoscopic screening programs, the proportion exclusion criteria: 1) images with poor quality or low resolution
of patients with EGC at the time of diagnosis has increased [3, that precluded proper classification (out of focus, artifacts,
4]. Although endoscopic screening programs have reduced shadowing, etc.); 2) images from image-enhanced endoscopy;
gastric cancer mortality rates by 47 % [3], the detection of gas- and 3) images without pathology results. After applying the ex-
tric neoplasms remains a challenge because it is dependent on clusion criteria, the remaining images were included in the
the endoscopists’ experience, expertise, and skill [5]. Moreover, study. All images were de-identified by removing individual
repeated endoscopic examinations have been associated with identifiers. Finally, a total of 5017 white-light images from
decreased mortality rates from gastric cancer [3] and longer in- 1269 individuals were included in the study. Of these, 812 ima-
spection times have been associated with higher proportions of ges from 212 subjects were used as the test dataset. Table 1s
neoplasm detection [6] in Korean studies, indicating that one- (see the online-only supplementary material) shows the image
time screenings are not a perfect method. category composition of the datasets used in the study.
Endoscopy is used for both screening and diagnosing a vari- The study was conducted in accordance with the Declaration
ety of gastrointestinal diseases, including gastric neoplasms of Helsinki and approved by the Institutional Review Board of
[5]. A high quality endoscopic examination is necessary to de- Chuncheon Sacred Heart Hospital (2018 – 8).
tect malignant and premalignant lesions, especially in areas
where gastric cancer is prevalent. Detection of abnormal le- Endoscopic procedure
sions is usually based on abnormal morphology or color chang- Upper endoscopic examinations were performed either as part
No. of patients, n (%) Age, mean (SD), years Sex, M/F, n (% men)
Overall 200 88 (44.0) 112 (56.0) 62.5 (13.8) 61.3 (13.8) 61.3 (13.8) 146/54 68/20 (77.3) 78/34 (69.6)
(73.0)
AGC 28 15 (17.0) 13 (11.6) 73.8 (9.5) 69 (13.8) 79.2 (13.7) 23/5 (82.1) 14/1 (93.3) 9/4 (69.2)
EGC 46 16 (18.2) 30 (26.8) 70.5 (6.5) 72.4 (6.2) 69.4 (6.5) 34/12 (73.9) 9/7 (56.3) 25/5 (83.3)
HGD 26 14 (15.9) 12 (10.7) 63.8 (7.7) 64.6 (8.0) 62.9 (7.6) 17/9 (65.4) 10/4 (71.4) 7/5 (58.3)
LGD 30 8 (9.1) 22 (19.6) 64.2 (10.1) 65.9 (8.0) 63.6 (10.9) 25/5 (83.3) 6/2 (75.0) 19/3 (86.4)
Non- 70 35 (39.8) 35 (31.3) 51.4 (14.1) 50.5 (13.4) 52.3 (14.9) 47/23 (67.1) 29/6 (82.9) 18/17 (51.4)
AGC, advanced gastric cancer; EGC, early gastric cancer; HGD, high grade dysplasia; LGD, low grade dysplasia; % is proportion of each category in the overall dataset.
86 10 2 1 10 17 2 0 0 9
AGC
AGC
(79 %) (9 %) (2 %) (1 %) (9 %) 80 % (61 %) (7 %) (0 %) (0 %) (32 %) 80 %
31 97 4 12 41 1 13 0 8 24
EGC
EGC
(17 %) (52 %) (2 %) (6 %) (22 %) 60 % (2 %) (28 %) (0 %) (17 %) (52 %) 60 %
11 20 0 13 14 1 3 0 7 15
HGD
HGD
(19 %) (34 %) (0 %) (22 %) (24 %) (4 %) (12 %) (0 %) (27 %) (58 %)
40 % 40 %
4 25 4 22 44 0 11 1 2 16
LGD
LGD
(4 %) (25 %) (4 %) (22 %) (44 %) (0 %) (37 %) (3 %) (7 %) (53 %)
20 % 20 %
NON
9 13 4 2 333 1 2 0 0 67
(2 %) (4 %) (1 %) (1 %) (92 %) (1 %) (3 %) (0 %) (0 %) (96 %)
a AGC EGC HGD LGD NON b AGC EGC HGD LGD NON
▶ Fig. 1 Confusion matrix for per-category sensitivity of the Inception-Resnet-v2 model. a The test dataset. b The prospective validation
dataset. AGC, advanced gastric cancer; EGC, early gastric cancer; HGD, high grade dysplasia; LGD, low grade dysplasia; NON, non-neoplasm.
Prospective validation
0.4
The prospective validation dataset comprised 200 images in-
cluding 28 AGCs, 46 EGCs, 26 HGDs, 30 LGDs, and 70 non-neo-
0.2 inception_v4
plasms (74 cancers vs. 126 non-cancers, 130 neoplasms vs. 70
resnet_v2_152
non-neoplasms). Detailed characteristics of the enrolled pa- inception_resnet_v2
tients are shown in ▶ Fig. 3 and ▶ Table 1. 0.0
In classifying the prospective validation dataset into five ca- 0.0 0.2 0.4 0.6 0.8 1.0
tegories, an endoscopist with the best performance showed an a False positive rate (1-specificity)
accuracy of 87.6 % (95 %CI 84.3 % – 90.9 %), whereas the Incep- Neoplasm ROC curve
▶ Fig. 3 Flow diagram of prospective validation. AGC, advanced gastric cancer; EGC, early gastric cancer; HGD, high grade dysplasia; LGD, low
grade dysplasia.
guish between neoplastic and non-neoplastic disease, suffi- more can be done to improve the accuracy of diagnosis. Auto-
cient tissue needs to be obtained for accurate diagnosis [5]. matic classification and diagnosis of ambiguous lesions can re-
However, endoscopic biopsy is an invasive procedure that can duce unnecessary biopsies, procedures, or procedure-related
result in mucosal damage and hemorrhage [5, 16]. Moreover, adverse events [13].
repeated biopsies or multiple biopsies over a wide area can Studies on the classification of gastric lesions by CNNs or
lead to submucosal fibrosis, which impedes therapeutic proce- other machine-learning models using endoscopic images have
dures such as endoscopic submucosal dissection [5]. Therefore, been limited to date. A previous study by Hirasawa et al.
precise prediction of the histological diagnosis during the showed that 71 of 77 gastric cancers were correctly classified
endoscopic examination would decrease unnecessary biopsies by the established CNN, with an overall sensitivity of 92.2 %
[5]. and a PPV of 30.6 % [10]. However, the performance of this
The deep-learning models have the potential for clinical ap- model might be overestimated because even when the CNN de-
plication during endoscopic procedures, although they cannot tected only one gastric cancer in multiple images of the same
replace the procedure itself at the current time. Improving di- lesion, the answer was still considered to be correct [10]. The
agnostic ability during visual inspection is a constant goal for authors also claimed that all of the lesions that were missed by
endoscopists. Although image-enhanced endoscopy has been the CNN were superficially depressed and differentiated intra-
widely adopted in clinical practice, it is not a perfect tool and mucosal cancers (corresponding to HGDs in our study) that
Endoscopist 1
▪ AGC 99.5 (97.3 – 99.9) 96.4 (81.7 – 99.9) 100 (97.9 – 100) 100 99.4 (96.2 – 99.9) 0.982 (0.953 – 0.996)
▪ EGC 82 (76.0 – 87.1) 56.5 (41.1 – 71.1) 89.6 (83.7 – 93.9) 61.9 (48.9 – 73.4) 87.3 (83.2 – 90.6) 0.731 (0.664 – 0.791)
▪ HGD 84 (78.2 – 88.8) 30.8 (14.3 – 51.8) 92.0 (86.9 – 95.5) 36.4 (21.0 – 55.1) 89.9 (87.3 – 92.0) 0.614 (0.542 – 0.681)
▪ LGD 84 (78.2 – 88.8) 50.0 (31.3 – 68.7) 90.0 (84.5 – 94.1) 46.9 (33.2 – 61.1) 91.1 (87.7 – 93.6) 0.700 (0.631 – 0.763)
▪ Non-neo- 89.5 (84.4 – 83.4) 90.0 (80.5 – 95.9) 89.2 (82.6 – 94.0) 81.8 (73.2 – 88.1) 94.3 (89.1 – 97.1) 0.896 (0.845 – 0.935)
plasm
Endoscopist 2
▪ AGC 99 (96.4 – 99.9) 100 (87.7 – 100) 98.8 (95.9 – 99.9) 93.3 (77.9 – 98.2) 100 0.994 (0.971 – 1.000)
▪ HGD 81.5 (75.4 – 86.6) 30.8 (14.3 – 51.8) 84.2 (78.2 – 89.2) 21.6 (12.4 – 34.9) 89.6 (86.9 – 91.8) 0.575 (0.505 – 0.643)
▪ LGD 82.5 (76.5 – 87.5) 40 (22.7 – 59.4) 90 (84.5 – 94.1) 41.4 (27.4 – 57.0) 89.5 (86.3 – 92.0) 0.650 (0.580 – 0.716)
▪ Non-neo- 89 (83.8 – 93.0) 90 (80.5 – 95.9) 88.5 (81.7 – 93.4) 80.8 (72.2 – 87.2) 94.3 (89.0 – 97.1) 0.892 (0.841 – 0.932)
plasm
Endoscopist 3
▪ AGC 98.5 (95.7 – 99.7) 92.9 (76.5 – 99.1) 99.4 (96.8 – 99.9) 96.3 (78.6 – 99.5) 98.8 (95.7 – 99.7) 0.961 (0.924 – 0.983)
▪ EGC 81.5 (75.4 – 86.6) 50 (34.9 – 65.1) 90.9 (85.2 – 94.9) 62.2 (48.0 – 74.5) 85.9 (81.9 – 89.1) 0.705 (0.636 – 0.767)
▪ HGD 85.5 (79.8 – 90.1) 7.7 (0.9 – 25.1) 97.1 (93.4 – 99.1) 28.6 (7.6 – 66.2) 87.6 (86.3 – 88.8) 0.524 (0.452 – 0.595)
▪ LGD 80 (73.8 – 85.3) 56.7 (37.4 – 74.5) 84.1 (77.7 – 89.3) 38.6 (28.3 – 50.1) 91.7 (87.9 – 94.3) 0.704 (0.635 – 0.766)
▪ Non-neo- 85.5 (79.8 – 90.1) 90 (80.5 – 95.9) 83.1 (75.5 – 89.1) 74.1 (66 – 80.9) 93.9 (88.4 – 96.9) 0.865 (0.810 – 0.909)
plasm
Inception – Resnet – v2
▪ AGC 93.0 (88.5 – 96.1) 60.7 (40.6 – 78.5) 98.3 (95.0 – 99.6) 85.0 (64.0 – 94.8) 93.9 (90.6 – 96.1) 0.795 (0.732 – 0.849)
▪ EGC 74.5 (67.9 – 80.4) 28.3 (16.0 – 43.5) 88.3 (82.2 – 92.9) 41.9 (27.7 – 57.6) 80.5 (77.3 – 83.3) 0.583 (0.511 – 0.652)
▪ HGD 86.4 (80.9 – 90.9) 0 (0 – 13.2) 99.4 (96.8 – 99.9) 0 86.9 (86.7 – 87.0) 0.497 (0.426 – 0.569)
▪ LGD 78.5 (72.2 – 84.0) 6.7 (0.8 – 22.1) 91.2 (85.9 – 95.0) 11.8 (3.1 – 35.6) 84.7 (83.3 – 86.0) 0.489 (0.418 – 0.561)
▪ Non-neo- 66.5 (59.5 – 73) 95.7 (88.0 – 99.1) 50.8 (41.9 – 59.6) 51.1 (46.6 – 55.7) 95.7 (87.8 – 98.5) 0.732 (0.665 – 0.792)
plasm
CI, confidence interval; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; AGC, advanced gastric cancer; EGC, early gastric
cancer; HGD, high grade dysplasia; LGD, low grade dysplasia.
were difficult to distinguish from gastritis, even for experienced and the diagnostic performance of non-neoplasm also reached
endoscopists [10]. However, 69.4 % of the lesions that the CNN a substantial level. However, diagnostic performance for LGD or
diagnosed as gastric cancer were benign, and the most com- HGD was lower than that of other lesions, which was common
mon reasons for misdiagnosis were gastritis, atrophy, and intes- not only among endoscopists but also with the Inception-Re-
tinal metaplasia (corresponding to non-neoplasm in our study), snet-v2 model. AGC is a lesion that should not be missed by
all of which are very common in clinical practice [10]. CNN endoscopists because it is associated with significant mortality.
models in our study showed highest per-category sensitivity The recognition of the characteristic morphology of AGC is em-
and NPV for the non-neoplasm lesions in a prospective valida- phasized during endoscopy training and the results of prospec-
tion, which is presumed to be unaffected by the limitation of tive validation are supposed to reflect endoscopists’ alertness
the previous study. for suspected AGC lesions. However, dysplasia is defined as a le-
In terms of per-category performance, endoscopists dem- sion that refers to a mucosal structure that exhibits cytological
onstrated near-perfect performance in the diagnosis of AGC atypia. It is categorized into low or high grade depending on
▶ Table 3 Diagnostic performance of endoscopists and the established convolutional neural network model in classifying gastric cancer or neoplasm on
endoscopic images in the prospective validation dataset.
Cancer or non-cancer
▪ Endoscopist 1 97.5 (94.3 – 99.2) 93.2 (84.9 – 97.8) 100 (97.1 – 100) 100 96.2 (91.5 – 98.3) 0.966 (0.931 – 0.987)
▪ Endoscopist 2 82.5 (76.5 – 87.5) 74.3 (62.8 – 84.8) 87.3 (80.2 – 92.6) 77.5 (68.1 – 84.7) 85.3 (79.6 – 89.6) 0.808 (0.747 – 0.860)
▪ Endoscopist 3 82 (76.0 – 87.1) 68.9 (57.1 – 79.2) 89.7 (83.0 – 94.4) 79.7 (69.6 – 87.0) 83.1 (77.7 – 87.4) 0.793 (0.730 – 0.847)
▪ Inception- 76 (69.5 – 81.7) 50 (38.1 – 61.9) 91.3 (84.9 – 95.6) 77.1 (64.7 – 86.1) 75.7 (71.1 – 79.7) 0.706 (0.638 – 0.768)
Resnet-v2
Neoplasm or non-neoplasm
▪ Endoscopist 1 96.5 (92.9 – 98.6) 94.6 (89.2 – 97.8) 100 (97.9 – 100) 100 96 (92.2 – 98.0) 0.973 (0.948 – 0.988)
▪ Endoscopist 2 87.5 (82.1 – 91.7) 88.5 (81.7 – 93.4) 85.7 (75.3 – 92.9) 92 (86.6 – 95.3) 80 (71.1 – 86.7) 0.871 (0.816 – 0.914)
▪ Inception- 73.5 (66.8 – 79.5) 63.8 (55.0 – 72.1) 91.4 (82.3 – 96.8) 93.3 (86.4 – 96.8) 57.7 (51.7 – 63.4) 0.776 (0.712 – 0.832)
Resnet-v2
CI, confidence interval; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve.
the degree of atypia at cellular level. Endoscopic findings ap- Despite the strengths, there are several limitations of the
pear varied but are most commonly in the form of flat or elevat- study. First, the pitfalls inherent in retrospective studies make
ed lesions, which are hard to differentiate from EGC or even it difficult to exclude selection bias. Some of the included ima-
from non-neoplastic lesions. ges taken from an older endoscopy system had low brightness/
Our established models displayed weakness, especially in resolution compared with the recently adopted system. Sec-
the classification of HGD. The reason for this relatively lower ond, the performance of the CNNs presented in this study
performance (five-category classification) during prospective might be influenced by the composition of the database (so-
validation is presumed to be difficulty in the diagnosis of HGD. called spectrum bias), although the database enrolled consecu-
(The AUCs of established models for the classification of HGD tive patients. Third, pathological classification of the lesions
are commonly lowest in the test dataset and prospective valida- into five categories could be different in areas outside of Korea.
tion dataset.) In real clinical practice, it is nearly impossible to No generally accepted definition has been created for differen-
accurately differentiate between HGD and EGC. Therefore, tiating gastric epithelial dysplasia or cancer, especially between
endoscopic ultrasound, image-enhanced endoscopy, or even Japanese and Western pathologists [17]. Although a revised
confocal endomicroscopy are employed to resolve this issue. Vienna classification has been proposed to address the incon-
The number of HGD images in the validation dataset was lowest sistent diagnosis of gastric epithelial dysplasias, category 4 le-
compared with the other four categories and this could have sions (HGD and intramucosal cancer) could still be categorized
also affected the accuracy data. The establishment of models in the EGC category in some countries [17]. Therefore, the di-
predicting depth of invasion of the lesions and enrollment of agnostic performance could be changed if coding for images is
more HGD cases could resolve this issue and enhance machine different. Fourth, binary classification performance of the es-
learning. tablished model by prospective validation was also lower than
The strength of this study is the enrollment of endoscopic the endoscopist with the best performance. Considering the re-
images obtained from endoscopies performed in multiple hos- latively lower number of neoplasms compared with non-neo-
pitals over a long-term period and the prospective validation, plastic lesions in the training dataset, this performance could
which were conducted to reflect real practice patterns of be enhanced through the enrollment of a higher number of
endoscopists in Korea. Moreover, these models attempted to images with more balanced data, as we did not perform a
reduce the false-positive rate by presenting the probability of class-balancing process in this study. Fifth, we used JPEG format
lesions in all types of gastric neoplasms (five categories or two for the model establishment. This compression standard is typi-
categories) rather than giving only one definitive diagnosis. cally “lossy” and contains several user-defined settings that af-
Moreover, binary classification with cancer vs. non-cancer, or fect the image quality. Although JPEG was the only format that
neoplasm vs. non-neoplasm would give on-site information for could be collected in the multicenter setting owing to technical
the accurate prediction of gastric lesions to endoscopists and problems, this format could induce a bias in terms of image
would help in determining the necessity for a biopsy specimen. quality. We established an unused dataset of PNG files and vali-
Development Program of the National Research Foundation [12] Chen PJ, Lin MC, Lai MJ et al. Accurate classification of diminutive
colorectal polyps using computer-aided analysis. Gastroenterology
(NRF) and funded by the Korean government, Ministry of Sci-
2018; 154: 568 – 575
ence and ICT (MSIT) (grant number NRF2017M3A9E8033207).
[13] Komeda Y, Handa H, Watanabe T et al. Computer-aided diagnosis
based on convolutional neural network system for colorectal polyp
classification: preliminary experience. Oncology 2017; 93: (Suppl.
Competing interests
01): 30 – 34
[14] Jisu H, Bo-Yong P, Hyunjin P. Convolutional neural network classifier
None. for distinguishing Barrett’s esophagus and neoplasia endomicroscopy
images. Conf Proc IEEE Eng Med Biol Soc 2017; 2017: 2892 – 2895
[15] DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas un-
References der two or more correlated receiver operating characteristic curves: a
nonparametric approach. Biometrics 1988; 44: 837 – 845
[1] World Health Organization Fact Sheets on Cancer. Available from: [16] Anderson MA, Ben-Menachem T, Gan SI et al. Management of antith-
http://www.who.int/mediacentre/factsheets/fs297/en Accessed: 13 rombotic agents for endoscopic procedures. Gastrointest Endosc
August 2018 2009; 70: 1060 – 1070
[2] de Vries AC, van Grieken NC, Looman CW et al. Gastric cancer risk in [17] Stolte M. The new Vienna classification of epithelial neoplasia of the
patients with premalignant gastric lesions: a nationwide cohort study gastrointestinal tract: advantages and disadvantages. Virchows Arch
in the Netherlands. Gastroenterology 2008; 134: 945 – 952 2003; 442: 99 – 106
[3] Jun JK, Choi KS, Lee HY et al. Effectiveness of the Korean National
Cancer Screening Program in reducing gastric cancer mortality. Gas-
troenterology 2017; 152: 1319 – 1328