0% found this document useful (0 votes)
12 views9 pages

10.1055@a 0981 6133

This study developed convolutional neural network (CNN) models to automatically classify gastric neoplasms from endoscopic images, achieving a weighted average accuracy of 84.6% with the Inception-Resnet-v2 model. A total of 5017 images were analyzed, and while the CNN model showed promising results, it performed worse than the best endoscopist in prospective validation. The findings suggest that deep-learning models could be clinically applicable for classifying gastric cancer or neoplasm based on endoscopic images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

10.1055@a 0981 6133

This study developed convolutional neural network (CNN) models to automatically classify gastric neoplasms from endoscopic images, achieving a weighted average accuracy of 84.6% with the Inception-Resnet-v2 model. A total of 5017 images were analyzed, and while the CNN model showed promising results, it performed worse than the best endoscopist in prospective validation. The findings suggest that deep-learning models could be clinically applicable for classifying gastric cancer or neoplasm based on endoscopic images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Original article

Automated classification of gastric neoplasms in endoscopic


images using a convolutional neural network

Authors
Bum-Joo Cho 1, 2, 3, Chang Seok Bang3, 4, 5, Se Woo Park 4, 5, Young Joo Yang3, 4, 5, Seung In Seo 4, 5, Hyun Lim 4, 5,
Woon Geon Shin4, 5, Ji Taek Hong4, 5, Yong Tak Yoo 6, Seok Hwan Hong 6, Jae Ho Choi3, Jae Jun Lee 3, 7, Gwang Ho Baik 4, 5

Institutions ABSTR AC T
1 Department of Ophthalmology, Hallym University Background Visual inspection, lesion detection, and dif-
College of Medicine, Chuncheon, Korea ferentiation between malignant and benign features are
2 Interdisciplinary Program in Medical Informatics, Seoul key aspects of an endoscopist’s role. The use of machine
National University College of Medicine, Seoul, Korea learning for the recognition and differentiation of images
3 Institute of New Frontier Research, Hallym University has been increasingly adopted in clinical practice. This

Downloaded by: Georgetown University Medical Center. Copyrighted material.


College of Medicine, Chuncheon, Korea study aimed to establish convolutional neural network
4 Department of Internal Medicine, Hallym University (CNN) models to automatically classify gastric neoplasms
College of Medicine, Chuncheon, Korea based on endoscopic images.
5 Institute for Liver and Digestive Diseases, Hallym Methods Endoscopic white-light images of pathologically
University, Chuncheon, Korea confirmed gastric lesions were collected and classified into
6 Dudaji Inc., Seoul, Korea five categories: advanced gastric cancer, early gastric can-
7 Department of Anesthesiology and Pain medicine, cer, high grade dysplasia, low grade dysplasia, and non-
Hallym University College of Medicine, Chuncheon, neoplasm. Three pretrained CNN models were fine-tuned
Korea using a training dataset. The classifying performance of
the models was evaluated using a test dataset and a pro-
submitted 27.11.2018 spective validation dataset.
accepted after revision 19.6.2019 Results A total of 5017 images were collected from 1269
patients, among which 812 images from 212 patients were
Bibliography used as the test dataset. An additional 200 images from 200
DOI https://doi.org/10.1055/a-0981-6133 patients were collected and used for prospective validation.
Published online: 2019 | Endoscopy For the five-category classification, the weighted average
© Georg Thieme Verlag KG Stuttgart · New York accuracy of the Inception-Resnet-v2 model reached 84.6 %.
ISSN 0013-726X The mean area under the curve (AUC) of the model for differ-
entiating gastric cancer and neoplasm was 0.877 and 0.927,
Corresponding author respectively. In prospective validation, the Inception-Re-
Chang Seok Bang, MD, PhD, Department of Internal snet-v2 model showed lower performance compared with
Medicine, Hallym University College of Medicine, the endoscopist with the best performance (five-category
Sakju-ro 77, Chuncheon, Gangwon-do 24253, South Korea accuracy 76.4 % vs. 87.6 %; cancer 76.0 % vs. 97.5 %; neo-
Fax: +82-33-2418064 plasm 73.5 % vs. 96.5 %; P < 0.001). However, there was no
[email protected] statistical difference between the Inception-Resnet-v2
model and the endoscopist with the worst performance in
Supplementary material the differentiation of gastric cancer (accuracy 76.0 % vs.
Online content viewable at: 82.0 %) and neoplasm (AUC 0.776 vs. 0.865).
https://doi.org/10.1055/a-0981-6133 Conclusion The evaluated deep-learning models have the
potential for clinical application in classifying gastric cancer
or neoplasm on endoscopic white-light images.

Introduction without screening strategies. Patients with premalignant le-


Gastric cancer remains a global health burden and is the fourth sions, such as a gastric dysplasia, also have a considerable risk
most common cause of cancer-related death worldwide [1]. of developing gastric cancer [2]. Korea has the highest inci-
Most early gastric cancers (EGCs) lack clinical signs or symp- dence of gastric cancer and adopted the National Cancer
toms and are difficult to detect and treat in a timely manner Screening Program in 1999 [3]. With the widespread imple-

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


Original article

mentation of endoscopic screening programs, the proportion exclusion criteria: 1) images with poor quality or low resolution
of patients with EGC at the time of diagnosis has increased [3, that precluded proper classification (out of focus, artifacts,
4]. Although endoscopic screening programs have reduced shadowing, etc.); 2) images from image-enhanced endoscopy;
gastric cancer mortality rates by 47 % [3], the detection of gas- and 3) images without pathology results. After applying the ex-
tric neoplasms remains a challenge because it is dependent on clusion criteria, the remaining images were included in the
the endoscopists’ experience, expertise, and skill [5]. Moreover, study. All images were de-identified by removing individual
repeated endoscopic examinations have been associated with identifiers. Finally, a total of 5017 white-light images from
decreased mortality rates from gastric cancer [3] and longer in- 1269 individuals were included in the study. Of these, 812 ima-
spection times have been associated with higher proportions of ges from 212 subjects were used as the test dataset. Table 1s
neoplasm detection [6] in Korean studies, indicating that one- (see the online-only supplementary material) shows the image
time screenings are not a perfect method. category composition of the datasets used in the study.
Endoscopy is used for both screening and diagnosing a vari- The study was conducted in accordance with the Declaration
ety of gastrointestinal diseases, including gastric neoplasms of Helsinki and approved by the Institutional Review Board of
[5]. A high quality endoscopic examination is necessary to de- Chuncheon Sacred Heart Hospital (2018 – 8).
tect malignant and premalignant lesions, especially in areas
where gastric cancer is prevalent. Detection of abnormal le- Endoscopic procedure
sions is usually based on abnormal morphology or color chang- Upper endoscopic examinations were performed either as part

Downloaded by: Georgetown University Medical Center. Copyrighted material.


es in the mucosa, and diagnostic accuracy is known to improve of routine check-ups, for diagnosis of symptomatic patients or
through training and the use of optical techniques or chromo- as a therapeutic procedure for neoplastic lesions. All of the pa-
endoscopy [5, 7, 8]. The application of endoscopic imaging tients fasted for more than 8 hours before the examination. All
technologies such as narrow-band imaging, confocal imaging of the procedures were performed by six experienced endos-
or magnifying techniques (so-called image-enhanced endos- copists ( > 6000 cases). Endoscopic examination was performed
copy) is also known to enhance diagnostic accuracy [7, 9]. How- using the GIF-Q260, H260 or H290 endoscopes (Olympus Opti-
ever, examination solely with white-light endoscopy remains cal Co., Ltd., Tokyo, Japan) with an endoscopic video imaging
the most routine form of screening, and standardization of the system (Evis Lucera CV-260 SL or Elite CV-290; Olympus Optical
procedure and improvements in the interpretation process to Co.). All neoplasm-suspected lesions detected during examina-
resolve the interobserver and intraobserver variability are tion were endoscopically biopsied or resected using either
needed in image-enhanced endoscopy. Therefore, meticulous endoscopic mucosal resection or endoscopic submucosal dis-
inspection of the stomach, discrimination of the lesions, and a section. Lesions that could not be resected by endoscopic pro-
targeted biopsy are the key factors in diagnosing pathologic le- cedures were surgically resected, and the final pathology re-
sions [5]. sults were identified. Pathological diagnosis of endoscopic
Recently, the use of convolutional neural networks (CNNs) to biopsy was made by two specialist pathologists, each with
recognize and differentiate medical images has been increas- more than 10 years of experience. Positive findings in the biop-
ingly adopted in clinical practice. This technique has already sy specimen were cross-checked with another pathologist in
shown promising diagnostic performance using endoscopic Chuncheon Sacred Heart hospital. The final classification of
images, such as detecting gastric cancer [10], recognizing Heli- EGC, advanced gastric cancer (AGC), high grade dysplasia
cobacter pylori infection [11], classifying colorectal polyps into (HGD), or low grade dysplasia (LGD) was made by combination
neoplastic or non-neoplastic features [12, 13], and distinguish- of histological diagnosis and clinical findings by staff members
ing Barrett’s esophagus and neoplasia [14]. However, there of the gastroenterology department.
have been no studies on the application of CNNs in the classifi-
cation of gastric neoplasms based on white-light images. This Building training and test datasets
study aimed to develop a deep-learning model to automatically All images were reviewed by two expert endoscopists (C.S.B
classify gastric neoplasms based on white-light images and to and S.W.P) and grouped into five categories: AGC, EGC, HGD,
evaluate the model performance. LGD, and non-neoplasm. The non-neoplasm category included
any form of gastritis, benign ulcers, erosions, polyps, or intes-
Methods tinal metaplasia, etc. In addition, the images were also classi-
fied into two categories from two perspectives: cancer vs.
Study sample
non-cancer, and neoplasm vs. non-neoplasm. The cancer cate-
All still-cut white-light endoscopy photographs of pathological- gory included AGC and EGC, and the non-cancer category in-
ly confirmed gastric lesions were retrospectively collected from cluded HGD, LGD, and non-neoplasms. The neoplasm category
consecutive patients who underwent an upper endoscopy be- included AGC, EGC, HGD, and LGD. Some images were taken of
tween 2010 and 2017 at two hospitals (Chuncheon and Dong- the same lesion from a different angle, direction, and distance.
tan Sacred Heart Hospitals). The images were retrieved in JPEG The entire dataset was divided into training and test data-
format from the picture archiving and communication data- sets, which were mutually exclusive, using random sampling.
base system of the participant hospitals. Images were from a Randomization was performed based on patients and not based
35-degree field of view with a resolution of 1280 × 640 pixels. on images. The ratio of the patient number for training and test
Inappropriate images were excluded according to the following datasets was set to be 5:1 for each category of gastric lesions.

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


▶ Table 1 Clinical characteristics of enrolled patients in the prospective validation dataset.

No. of patients, n (%) Age, mean (SD), years Sex, M/F, n (% men)

Overall Kangdong Hallym Overall Kangdong Hallym Overall Kangdong Hallym


Sacred University Sacred University Sacred University
Heart Sacred Heart Sacred Heart Sacred
hospital Heart hospital Heart hospital Heart
hospital hospital hospital

Overall 200 88 (44.0) 112 (56.0) 62.5 (13.8) 61.3 (13.8) 61.3 (13.8) 146/54 68/20 (77.3) 78/34 (69.6)
(73.0)

AGC 28 15 (17.0) 13 (11.6) 73.8 (9.5) 69 (13.8) 79.2 (13.7) 23/5 (82.1) 14/1 (93.3) 9/4 (69.2)

EGC 46 16 (18.2) 30 (26.8) 70.5 (6.5) 72.4 (6.2) 69.4 (6.5) 34/12 (73.9) 9/7 (56.3) 25/5 (83.3)

HGD 26 14 (15.9) 12 (10.7) 63.8 (7.7) 64.6 (8.0) 62.9 (7.6) 17/9 (65.4) 10/4 (71.4) 7/5 (58.3)

LGD 30 8 (9.1) 22 (19.6) 64.2 (10.1) 65.9 (8.0) 63.6 (10.9) 25/5 (83.3) 6/2 (75.0) 19/3 (86.4)

Non- 70 35 (39.8) 35 (31.3) 51.4 (14.1) 50.5 (13.4) 52.3 (14.9) 47/23 (67.1) 29/6 (82.9) 18/17 (51.4)

Downloaded by: Georgetown University Medical Center. Copyrighted material.


neo-
plasm

AGC, advanced gastric cancer; EGC, early gastric cancer; HGD, high grade dysplasia; LGD, low grade dysplasia; % is proportion of each category in the overall dataset.

Thus, lesions of the same category in a single patient were as-


Constructing CNN models
signed together in one group into either training or test data-
set, respectively. Of note, if a patient had lesions of different ca- Three different CNN models were used in classifying endo-
tegories concurrently, the lesions could belong to different da- scopic images, namely, Inception-v4, Resnet-152, and Incep-
tasets because lesions of different categories were randomized tion-Resnet-v2. For all CNN models, pretrained models with
independently. the ImageNet Dataset were adopted using transfer learning. In-
The training dataset was used to fine-tune pretrained CNN ception-v4 (https://arxiv.org/abs/1512.00567) is a revised ver-
models to classify gastric lesions. The test dataset, which was sion of CNN that achieved 21.2 % top-1 and 5.6 % top-5 error
not balanced to ensure a similar number of lesions in each cate- rates for single-frame evaluation on the ImageNet 2012 Chal-
gory, was subsequently used to evaluate the performance of lenge data set, which was developed and released by Google,
the CNN models. The best-performing model was validated Inc. Resnet-152 (https://arxiv.org/abs/1603.05027) is an im-
during the next stage of the study and compared with endos- proved version of the deep residual network, which won the Im-
copists’ performance. ageNet Large Scale Visual Recognition Challenge (ILSRVC) 2015
competition and surpassed human performance on the Ima-
Prospective validation dataset geNet dataset. Inception Resnet-V2 (https://arxiv.org/abs/
Another unused dataset was collected from two different hos- 1602.07261) is a variation of the Inception-v3 model that bor-
pitals (Kangdong- and Hallym University Sacred Heart Hospi- rows some ideas from Microsoft’s ResNet. The TensorFlow fra-
tals) to validate the established models and to compare its per- mework was adopted to implement all CNN models.
formance with that of three endoscopists. All still-cut white- For all CNN models, a stochastic approximation of gradient
light images of pathologically confirmed gastric lesions were descent optimization was done with the Adam optimizer. The
prospectively collected from consecutive patients who under- initial learning rate, end learning rate, weight decay, and batch
went an upper endoscopy between December 2018 and Febru- size were 0.01, 0.0001, 5e-05, and 30, respectively. The training
ary 2019 with the same exclusion criteria as those stated above. dataset was preprocessed to enhance recognition performance
Finally, 200 images from 200 patients were selected for the by random cropping, resizing, flipping, and color adjustments
prospective validation to compare the classifying performance implemented internally in each CNN model. A 5-fold cross-vali-
of the established model with that of endoscopists. dation was carried out for all models, which means that the
After constructing the prospective validation dataset, three training set was further subdivided with a validation set for the
endoscopists (C.S.B., Y.J.Y., and J.T.H.) classified this dataset selection of hyperparameters for each network. The hardware
without knowing the final diagnosis. The mean experience of used for this study was NVIDIA’s GeForce GTX 1080ti.
the endoscopists was 6.7 years (standard deviation [SD] 0.6).
▶ Table 1 shows the characteristics of the enrolled population
Main outcome measures
in the prospective validation dataset. After constructing the CNN models using the training dataset,
the performance of the models was evaluated using the test da-
taset and the prospective validation dataset. The main outcome

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


Original article

86 10 2 1 10 17 2 0 0 9
AGC

AGC
(79 %) (9 %) (2 %) (1 %) (9 %) 80 % (61 %) (7 %) (0 %) (0 %) (32 %) 80 %

31 97 4 12 41 1 13 0 8 24
EGC

EGC
(17 %) (52 %) (2 %) (6 %) (22 %) 60 % (2 %) (28 %) (0 %) (17 %) (52 %) 60 %

11 20 0 13 14 1 3 0 7 15
HGD

HGD
(19 %) (34 %) (0 %) (22 %) (24 %) (4 %) (12 %) (0 %) (27 %) (58 %)
40 % 40 %

4 25 4 22 44 0 11 1 2 16
LGD

LGD
(4 %) (25 %) (4 %) (22 %) (44 %) (0 %) (37 %) (3 %) (7 %) (53 %)
20 % 20 %

Downloaded by: Georgetown University Medical Center. Copyrighted material.


NON

NON
9 13 4 2 333 1 2 0 0 67
(2 %) (4 %) (1 %) (1 %) (92 %) (1 %) (3 %) (0 %) (0 %) (96 %)

a AGC EGC HGD LGD NON b AGC EGC HGD LGD NON

▶ Fig. 1 Confusion matrix for per-category sensitivity of the Inception-Resnet-v2 model. a The test dataset. b The prospective validation
dataset. AGC, advanced gastric cancer; EGC, early gastric cancer; HGD, high grade dysplasia; LGD, low grade dysplasia; NON, non-neoplasm.

measurements were classifying performance of established Results


models for the five categories, gastric cancer vs. non-cancer, Five-category classification performance
and gastric neoplasm vs. non-neoplasm.
In the five-category classification, Inception-Resnet-v2 showed
Statistical methods the best performance (accuracy 84.6 %, 95 %CI 83.69 % – 85.5 %)
To investigate the performance of established CNN models, the (weighted average of each class). The mean elapsed time clas-
area under the curve (AUC), was calculated. Further sensitivity, sifying one image in the test dataset was 0.0264 seconds (SD
specificity, positive predictive value (PPV), negative predictive 0.0009). The change of validation accuracy by the number of
value (NPV), and accuracy were estimated including true posi- epochs is presented in Fig. 1s. The performance reached a pla-
tive (TP), false positive (FP), true negative (TN), and false nega- teau after the number of training epochs reached 500.
tive (FN) values. The following formulae were used to calculate The detailed per-category performance of established mod-
performance parameters: sensitivity TP/(TP + FN); specificity els is described in Table 2s. The per-category AUC of estab-
TN/(FP + TN); PPV TP/(TP + FP); NPV TN/(FN + TN); and accuracy lished models was highest for lesions with AGC (range 0.802 –
(TP + TN)/(TP + FP + FN + TN). 0.855), and lowest for lesions with HGD (range 0.491 – 0.522).
Continuous variables are expressed as the mean (SD). Cate- Of note, the per-category sensitivity was highest for lesions
gorical variables are expressed as percentage with 95 % confi- with non-neoplasm (range 74.2 % – 92.2 %); however, it was
dence interval (CI). Fisher’s exact test was used for the compar- lowest for lesions with HGD (range 0 – 10.3 %). The confusion
ison of categorical variables, and DeLong test was used for the matrix for the per-category sensitivity of the Inception-Resnet-
comparison of AUC values [15]. A P-value of < 0.05 were consid- v2 model in the test dataset is presented in ▶ Fig. 1a.
ered statistically significant and all tests were two-sided.
Analyses were performed using SPSS version 24.0. (IBM Binary classification performance
Corp., Armonk, New York, USA), R version 3.2.3. (R Foundation The performance of the established models in classifying gas-
for Statistical Computing, Vienna, Austria), and Medcalc ver- tric lesions into cancer or neoplasm is presented in Table 3s.
sion 18.11.6 (Medcalc Software, Ostend, Belgium). The Fleiss’ In determining whether the gastric lesion was cancer or not,
kappa statistic was calculated using a Microsoft Excel spread- Inception-Resnet-v2 showed the best performance (AUC
sheet (http://www.ccitonline.org/jking/homepage/interrater. 0.877, 95 %CI 0.851 – 0.901). The accuracy, sensitivity, and
html, provided by Jason King, Ph.D.). specificity in classifying gastric cancer were 81.9 % (95 %CI
79.3 % – 84.6 %), 75.9 % (95 %CI 79.3 % – 84.6 %), and 85.3 % (95
%CI 79.3 % – 84.6 %), respectively.

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


For the classification of gastric neoplasms, Inception-Re-
snet-v2 showed the best performance (AUC 0.927, 95 %CI Cancer ROC curve
1.0
0.908 – 0.944), and its accuracy, sensitivity, and specificity
were 85.5 % (95 %CI 83.3 % – 87.8 %), 84.0 % (95 %CI 80.5 % –

True positive rate (sensitivity)


87.3 %), and 87.3 % (95 %CI 83.8 % – 90.5 %), respectively. 0.8
The AUCs for the binary classification of gastric lesions into
cancer or neoplasm are presented in ▶ Fig. 2. 0.6

Prospective validation
0.4
The prospective validation dataset comprised 200 images in-
cluding 28 AGCs, 46 EGCs, 26 HGDs, 30 LGDs, and 70 non-neo-
0.2 inception_v4
plasms (74 cancers vs. 126 non-cancers, 130 neoplasms vs. 70
resnet_v2_152
non-neoplasms). Detailed characteristics of the enrolled pa- inception_resnet_v2
tients are shown in ▶ Fig. 3 and ▶ Table 1. 0.0
In classifying the prospective validation dataset into five ca- 0.0 0.2 0.4 0.6 0.8 1.0
tegories, an endoscopist with the best performance showed an a False positive rate (1-specificity)

accuracy of 87.6 % (95 %CI 84.3 % – 90.9 %), whereas the Incep- Neoplasm ROC curve

Downloaded by: Georgetown University Medical Center. Copyrighted material.


tion-Resnet-v2 model had an accuracy of only 76.4 % (95 %CI 1.0
72.1 % – 80.7 %) (weighted average of each category). The per-
formance was significantly higher for the endoscopist with the

True positive rate (sensitivity)


0.8
best performance than for the established model (P < 0.001).
The detailed per-category performance of endoscopists and
the Inception-Resnet-v2 model is described in ▶ Table 2. The 0.6
confusion matrix for the per-category sensitivity of Inception-
Resnet-v2 model in prospective validation is described in 0.4
▶ Fig. 1b.
For the per-category performance of the five categories,
0.2
endoscopists commonly showed the highest performance in inception_v4
resnet_v2_152
the diagnosis of AGC (accuracy range 98.5 % – 99.5 %) and they
inception_resnet_v2
showed second highest diagnostic performance in the diagno- 0.0
0.0 0.2 0.4 0.6 0.8 1.0
sis of non-neoplasm (accuracy range 85.5 % – 89.5 %). However,
b False positive rate (1-specificity)
the diagnostic performance of LGD and HGD was lower than
that of other lesions (accuracy range 80 % – 85.5 %), which was
common not only among endoscopists but also with the Incep- ▶ Fig. 2 Area under the curve for the prediction of: a gastric cancer;
tion-Resnet-v2 model ( ▶ Table 2). b gastric neoplasm. AUC, area under the curve; ROC, receiver oper-
ating characteristic.
In determining whether the gastric lesion was cancer or not,
an endoscopist with the best performance showed an accuracy
of 97.5 % (95 %CI 94.3 % – 99.2 %), whereas the Inception-Re-
snet-v2 model had an accuracy of only 76.0 % (9 %%CI 69.5 % – 0.810 – 0.909]). The detailed per-category performance is de-
81.7 %). The performance was significantly higher for the scribed in ▶ Table 3.
endoscopist with the best performance than for the established Interrater reliability between three endoscopists (Fleiss’s
model (P < 0.001). However, there was no statistical difference Kappa) was 0.61 for the classification of five categories (P <
in performance in differentiating gastric cancer between the 0.001), 0.64 for the classification of cancer (P < 0.001), and
model and the two remaining endoscopists (accuracy 76.0 % 0.70 for the classification of neoplasm (P < 0.001), which all re-
[95 %CI 69.5 % – 81.7 %] vs. 82.0 % [95 %CI 76.0 % – 87.1 %] and present substantial agreement.
82.5 % [95 %CI 76.5 % – 87.5 %]). The detailed per-category per-
formance is described in ▶ Table 3.
For the classification of gastric neoplasms, the endoscopist Discussion
with the best performance showed an accuracy of 96.5 % (95 % This study established the value of high performance models
CI 92.9 % – 98.6 %), whereas the Inception-Resnet-v2 model had for the classification of gastric lesions, presenting an in situ
an accuracy of only 73.5 % (95 %CI 66.8 % – 79.5 %). The per- probability of lesions, categorized into certain types, necessi-
formance was significantly higher for the endoscopist with the tating a targeted biopsy. During endoscopic screening proce-
best performance than for the established model (P < 0.001). dures, these models may assist endoscopists in predicting the
However, there was no statistical difference in differentiating histology of ambiguous lesions and determining diagnostic or
neoplasm compared with the endoscopist with the worst per- therapeutic strategies. Targeted biopsy on the presumed neo-
formance (AUC 0.776 [95 %CI 0.712 – 0.832] vs. 0.865 [95 %CI plastic lesion is the key technique for future therapeutic plans.
In the case of ambiguous lesions, when it is difficult to distin-

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


Original article

Upper gastrointestinal Upper gastrointestinal


endoscopy performed in endoscopy performed in
Kangdong Sacred Heart Hallym University Sacred
hospital (n = 2564) Heart hospital (n = 3013)
▪ December 2018 (n = 1200) ▪ December 2018 (n = 1268)
▪ January 2019 (n = 571) ▪ January 2019 (n = 954)
▪ February 2019 (n = 793) ▪ February 2019 (n = 791)

Detected gastric neoplasms Detected gastric neoplasms


(n = 59) (n = 85)
▪ AGC (n = 15) (25.4 %) ▪ AGC (n = 13) (15.3 %)
▪ EGC (n = 18) (30.5 %) ▪ EGC (n = 32) (37.6 %)
▪ HGD (n = 18) (30.5 %) ▪ HGD (n = 18) (21.2 %)
▪ LGD (n = 8) (13.6 %) ▪ LGD (n = 22) (25.9 %)
Selection of image with non- Selection of image with non-
Exclusion of images neoplasms category (n =35) neoplasm category (n =35) Exclusion of images

Downloaded by: Georgetown University Medical Center. Copyrighted material.


with poor quality or with poor quality or
low resolution (n = 6) low resolution (n = 8)
(10.2%); 2 in EGC, (9.4%); 2 in EGC,
4 in HGD category) Detected gastric neoplasms Detected gastric neoplasms 6 in HGD category)
(n = 53) (n = 77)
▪ AGC (n = 15) (28.3 %) ▪ AGC (n = 13) (16.9 %)
▪ EGC (n = 16) (30.2 %) ▪ EGC (n = 30) (39.0 %)
▪ HGD (n = 14) (26.4 %) ▪ HGD (n = 12) (15.6 %)
▪ LGD (n = 8) (15.1 %) ▪ LGD (n = 22) (28.6 %)
Selection of image with non- Selection of image with non-
neoplasm category (n =35) neoplasm category (n =35)

Final enrollment (n = 200)


▪ AGC (n = 28) (14 %)
▪ EGC (n = 46) (23 %)
▪ HGD (n = 26) (13 %)
▪ LGD (n = 30) (15 %)
▪ Non-neoplasm (n = 70) (35 %)

▶ Fig. 3 Flow diagram of prospective validation. AGC, advanced gastric cancer; EGC, early gastric cancer; HGD, high grade dysplasia; LGD, low
grade dysplasia.

guish between neoplastic and non-neoplastic disease, suffi- more can be done to improve the accuracy of diagnosis. Auto-
cient tissue needs to be obtained for accurate diagnosis [5]. matic classification and diagnosis of ambiguous lesions can re-
However, endoscopic biopsy is an invasive procedure that can duce unnecessary biopsies, procedures, or procedure-related
result in mucosal damage and hemorrhage [5, 16]. Moreover, adverse events [13].
repeated biopsies or multiple biopsies over a wide area can Studies on the classification of gastric lesions by CNNs or
lead to submucosal fibrosis, which impedes therapeutic proce- other machine-learning models using endoscopic images have
dures such as endoscopic submucosal dissection [5]. Therefore, been limited to date. A previous study by Hirasawa et al.
precise prediction of the histological diagnosis during the showed that 71 of 77 gastric cancers were correctly classified
endoscopic examination would decrease unnecessary biopsies by the established CNN, with an overall sensitivity of 92.2 %
[5]. and a PPV of 30.6 % [10]. However, the performance of this
The deep-learning models have the potential for clinical ap- model might be overestimated because even when the CNN de-
plication during endoscopic procedures, although they cannot tected only one gastric cancer in multiple images of the same
replace the procedure itself at the current time. Improving di- lesion, the answer was still considered to be correct [10]. The
agnostic ability during visual inspection is a constant goal for authors also claimed that all of the lesions that were missed by
endoscopists. Although image-enhanced endoscopy has been the CNN were superficially depressed and differentiated intra-
widely adopted in clinical practice, it is not a perfect tool and mucosal cancers (corresponding to HGDs in our study) that

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


▶ Table 2 Per-category diagnostic performance of three endoscopists and the established convolutional neural network model on endoscopic images
in the prospective validation dataset.

Model Diagnostic performance, % (95 %CI) AUC (95 %CI)

Accuracy Sensitivity Specificity PPV NPV

Endoscopist 1

▪ AGC 99.5 (97.3 – 99.9) 96.4 (81.7 – 99.9) 100 (97.9 – 100) 100 99.4 (96.2 – 99.9) 0.982 (0.953 – 0.996)

▪ EGC 82 (76.0 – 87.1) 56.5 (41.1 – 71.1) 89.6 (83.7 – 93.9) 61.9 (48.9 – 73.4) 87.3 (83.2 – 90.6) 0.731 (0.664 – 0.791)

▪ HGD 84 (78.2 – 88.8) 30.8 (14.3 – 51.8) 92.0 (86.9 – 95.5) 36.4 (21.0 – 55.1) 89.9 (87.3 – 92.0) 0.614 (0.542 – 0.681)

▪ LGD 84 (78.2 – 88.8) 50.0 (31.3 – 68.7) 90.0 (84.5 – 94.1) 46.9 (33.2 – 61.1) 91.1 (87.7 – 93.6) 0.700 (0.631 – 0.763)

▪ Non-neo- 89.5 (84.4 – 83.4) 90.0 (80.5 – 95.9) 89.2 (82.6 – 94.0) 81.8 (73.2 – 88.1) 94.3 (89.1 – 97.1) 0.896 (0.845 – 0.935)
plasm

Endoscopist 2

▪ AGC 99 (96.4 – 99.9) 100 (87.7 – 100) 98.8 (95.9 – 99.9) 93.3 (77.9 – 98.2) 100 0.994 (0.971 – 1.000)

Downloaded by: Georgetown University Medical Center. Copyrighted material.


▪ EGC 81 (74,9 – 86.2) 47.8 (32.9 – 63.1) 90.9 (85.2 – 94.9) 61.1 (46.7 – 73.8) 85.4 (81.5 – 88.5) 0.694 (0.625 – 0.757)

▪ HGD 81.5 (75.4 – 86.6) 30.8 (14.3 – 51.8) 84.2 (78.2 – 89.2) 21.6 (12.4 – 34.9) 89.6 (86.9 – 91.8) 0.575 (0.505 – 0.643)

▪ LGD 82.5 (76.5 – 87.5) 40 (22.7 – 59.4) 90 (84.5 – 94.1) 41.4 (27.4 – 57.0) 89.5 (86.3 – 92.0) 0.650 (0.580 – 0.716)

▪ Non-neo- 89 (83.8 – 93.0) 90 (80.5 – 95.9) 88.5 (81.7 – 93.4) 80.8 (72.2 – 87.2) 94.3 (89.0 – 97.1) 0.892 (0.841 – 0.932)
plasm

Endoscopist 3

▪ AGC 98.5 (95.7 – 99.7) 92.9 (76.5 – 99.1) 99.4 (96.8 – 99.9) 96.3 (78.6 – 99.5) 98.8 (95.7 – 99.7) 0.961 (0.924 – 0.983)

▪ EGC 81.5 (75.4 – 86.6) 50 (34.9 – 65.1) 90.9 (85.2 – 94.9) 62.2 (48.0 – 74.5) 85.9 (81.9 – 89.1) 0.705 (0.636 – 0.767)

▪ HGD 85.5 (79.8 – 90.1) 7.7 (0.9 – 25.1) 97.1 (93.4 – 99.1) 28.6 (7.6 – 66.2) 87.6 (86.3 – 88.8) 0.524 (0.452 – 0.595)

▪ LGD 80 (73.8 – 85.3) 56.7 (37.4 – 74.5) 84.1 (77.7 – 89.3) 38.6 (28.3 – 50.1) 91.7 (87.9 – 94.3) 0.704 (0.635 – 0.766)

▪ Non-neo- 85.5 (79.8 – 90.1) 90 (80.5 – 95.9) 83.1 (75.5 – 89.1) 74.1 (66 – 80.9) 93.9 (88.4 – 96.9) 0.865 (0.810 – 0.909)
plasm

Inception – Resnet – v2

▪ AGC 93.0 (88.5 – 96.1) 60.7 (40.6 – 78.5) 98.3 (95.0 – 99.6) 85.0 (64.0 – 94.8) 93.9 (90.6 – 96.1) 0.795 (0.732 – 0.849)

▪ EGC 74.5 (67.9 – 80.4) 28.3 (16.0 – 43.5) 88.3 (82.2 – 92.9) 41.9 (27.7 – 57.6) 80.5 (77.3 – 83.3) 0.583 (0.511 – 0.652)

▪ HGD 86.4 (80.9 – 90.9) 0 (0 – 13.2) 99.4 (96.8 – 99.9) 0 86.9 (86.7 – 87.0) 0.497 (0.426 – 0.569)

▪ LGD 78.5 (72.2 – 84.0) 6.7 (0.8 – 22.1) 91.2 (85.9 – 95.0) 11.8 (3.1 – 35.6) 84.7 (83.3 – 86.0) 0.489 (0.418 – 0.561)

▪ Non-neo- 66.5 (59.5 – 73) 95.7 (88.0 – 99.1) 50.8 (41.9 – 59.6) 51.1 (46.6 – 55.7) 95.7 (87.8 – 98.5) 0.732 (0.665 – 0.792)
plasm

CI, confidence interval; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; AGC, advanced gastric cancer; EGC, early gastric
cancer; HGD, high grade dysplasia; LGD, low grade dysplasia.

were difficult to distinguish from gastritis, even for experienced and the diagnostic performance of non-neoplasm also reached
endoscopists [10]. However, 69.4 % of the lesions that the CNN a substantial level. However, diagnostic performance for LGD or
diagnosed as gastric cancer were benign, and the most com- HGD was lower than that of other lesions, which was common
mon reasons for misdiagnosis were gastritis, atrophy, and intes- not only among endoscopists but also with the Inception-Re-
tinal metaplasia (corresponding to non-neoplasm in our study), snet-v2 model. AGC is a lesion that should not be missed by
all of which are very common in clinical practice [10]. CNN endoscopists because it is associated with significant mortality.
models in our study showed highest per-category sensitivity The recognition of the characteristic morphology of AGC is em-
and NPV for the non-neoplasm lesions in a prospective valida- phasized during endoscopy training and the results of prospec-
tion, which is presumed to be unaffected by the limitation of tive validation are supposed to reflect endoscopists’ alertness
the previous study. for suspected AGC lesions. However, dysplasia is defined as a le-
In terms of per-category performance, endoscopists dem- sion that refers to a mucosal structure that exhibits cytological
onstrated near-perfect performance in the diagnosis of AGC atypia. It is categorized into low or high grade depending on

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


Original article

▶ Table 3 Diagnostic performance of endoscopists and the established convolutional neural network model in classifying gastric cancer or neoplasm on
endoscopic images in the prospective validation dataset.

Model Diagnostic performance, % (95 %CI) AUC (95 %CI)

Accuracy Sensitivity Specificity PPV NPV

Cancer or non-cancer

▪ Endoscopist 1 97.5 (94.3 – 99.2) 93.2 (84.9 – 97.8) 100 (97.1 – 100) 100 96.2 (91.5 – 98.3) 0.966 (0.931 – 0.987)

▪ Endoscopist 2 82.5 (76.5 – 87.5) 74.3 (62.8 – 84.8) 87.3 (80.2 – 92.6) 77.5 (68.1 – 84.7) 85.3 (79.6 – 89.6) 0.808 (0.747 – 0.860)

▪ Endoscopist 3 82 (76.0 – 87.1) 68.9 (57.1 – 79.2) 89.7 (83.0 – 94.4) 79.7 (69.6 – 87.0) 83.1 (77.7 – 87.4) 0.793 (0.730 – 0.847)

▪ Inception- 76 (69.5 – 81.7) 50 (38.1 – 61.9) 91.3 (84.9 – 95.6) 77.1 (64.7 – 86.1) 75.7 (71.1 – 79.7) 0.706 (0.638 – 0.768)
Resnet-v2

Neoplasm or non-neoplasm

▪ Endoscopist 1 96.5 (92.9 – 98.6) 94.6 (89.2 – 97.8) 100 (97.9 – 100) 100 96 (92.2 – 98.0) 0.973 (0.948 – 0.988)

▪ Endoscopist 2 87.5 (82.1 – 91.7) 88.5 (81.7 – 93.4) 85.7 (75.3 – 92.9) 92 (86.6 – 95.3) 80 (71.1 – 86.7) 0.871 (0.816 – 0.914)

Downloaded by: Georgetown University Medical Center. Copyrighted material.


▪ Endoscopist 3 85.5 (79.8 – 90.1) 83.1 (75.5 – 89.1) 90 (80.5 – 95.9) 93.9 (88.4 – 96.9) 74.1 (66.0 – 80.9) 0.865 (0.810 – 0.909)

▪ Inception- 73.5 (66.8 – 79.5) 63.8 (55.0 – 72.1) 91.4 (82.3 – 96.8) 93.3 (86.4 – 96.8) 57.7 (51.7 – 63.4) 0.776 (0.712 – 0.832)
Resnet-v2

CI, confidence interval; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve.

the degree of atypia at cellular level. Endoscopic findings ap- Despite the strengths, there are several limitations of the
pear varied but are most commonly in the form of flat or elevat- study. First, the pitfalls inherent in retrospective studies make
ed lesions, which are hard to differentiate from EGC or even it difficult to exclude selection bias. Some of the included ima-
from non-neoplastic lesions. ges taken from an older endoscopy system had low brightness/
Our established models displayed weakness, especially in resolution compared with the recently adopted system. Sec-
the classification of HGD. The reason for this relatively lower ond, the performance of the CNNs presented in this study
performance (five-category classification) during prospective might be influenced by the composition of the database (so-
validation is presumed to be difficulty in the diagnosis of HGD. called spectrum bias), although the database enrolled consecu-
(The AUCs of established models for the classification of HGD tive patients. Third, pathological classification of the lesions
are commonly lowest in the test dataset and prospective valida- into five categories could be different in areas outside of Korea.
tion dataset.) In real clinical practice, it is nearly impossible to No generally accepted definition has been created for differen-
accurately differentiate between HGD and EGC. Therefore, tiating gastric epithelial dysplasia or cancer, especially between
endoscopic ultrasound, image-enhanced endoscopy, or even Japanese and Western pathologists [17]. Although a revised
confocal endomicroscopy are employed to resolve this issue. Vienna classification has been proposed to address the incon-
The number of HGD images in the validation dataset was lowest sistent diagnosis of gastric epithelial dysplasias, category 4 le-
compared with the other four categories and this could have sions (HGD and intramucosal cancer) could still be categorized
also affected the accuracy data. The establishment of models in the EGC category in some countries [17]. Therefore, the di-
predicting depth of invasion of the lesions and enrollment of agnostic performance could be changed if coding for images is
more HGD cases could resolve this issue and enhance machine different. Fourth, binary classification performance of the es-
learning. tablished model by prospective validation was also lower than
The strength of this study is the enrollment of endoscopic the endoscopist with the best performance. Considering the re-
images obtained from endoscopies performed in multiple hos- latively lower number of neoplasms compared with non-neo-
pitals over a long-term period and the prospective validation, plastic lesions in the training dataset, this performance could
which were conducted to reflect real practice patterns of be enhanced through the enrollment of a higher number of
endoscopists in Korea. Moreover, these models attempted to images with more balanced data, as we did not perform a
reduce the false-positive rate by presenting the probability of class-balancing process in this study. Fifth, we used JPEG format
lesions in all types of gastric neoplasms (five categories or two for the model establishment. This compression standard is typi-
categories) rather than giving only one definitive diagnosis. cally “lossy” and contains several user-defined settings that af-
Moreover, binary classification with cancer vs. non-cancer, or fect the image quality. Although JPEG was the only format that
neoplasm vs. non-neoplasm would give on-site information for could be collected in the multicenter setting owing to technical
the accurate prediction of gastric lesions to endoscopists and problems, this format could induce a bias in terms of image
would help in determining the necessity for a biopsy specimen. quality. We established an unused dataset of PNG files and vali-

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy


dated the model in a prospective manner. Further studies using [4] Bang CS, Baik GH, Shin IS et al. Endoscopic submucosal dissection for
early gastric cancer with undifferentiated-type histology: a meta-a-
only TIFF or PNG file format would avoid this type of bias. Sixth,
nalysis. World J Gastroenterol 2015; 21: 6032 – 6043
the mechanism for the classification of the lesions by CNN is
[5] Bang CS, Baik GH, Kim JH et al. Effect of training in upper endoscopic
not understood, although these models showed a high diag-
biopsy. Korean J Helicobacter Up Gastrointest Res 2015; 15: 33 – 38
nostic performance. This could be understood through re-
[6] Park JM, Huo SM, Lee HH et al. Longer observation time increases
search that explores the mechanism by dividing the images proportion of neoplasms detected by esophagogastroduodenoscopy.
into various morphological factors that can be objectified. In Gastroenterology 2017; 153: 460 – 469
summary, technical novelty and measures to minimize bias [7] Muguruma N, Miyamoto H, Okahisa T et al. Endoscopic molecular
could improve the performance of established models using imaging: status and future perspective. Clin Endosc 2013; 46: 603
datasets under more realistic conditions. [8] Cotton PB, Barkun A, Ginsberg G et al. Diagnostic endoscopy: 2020
In conclusion, the proposed CNNs, which classified gastric vision. Gastrointest Endosc 2006; 64: 395 – 398
cancers/neoplasms on white-light images, displayed high per- [9] Cohen J, Safdi MA, Deal SE et al. Quality indicators for esophagogas-
formance comparable to experienced endoscopists. These eval- troduodenoscopy. Am J Gastroenterol 2006; 101: 886 – 891
uated models have the potential for in situ add-on testing for [10] Hirasawa T, Aoyama K, Tanimoto T et al. Application of artificial intel-
the accurate prediction of gastric lesions. ligence using a convolutional neural network for detecting gastric
cancer in endoscopic images. Gastric Cancer 2018; 21: 653 – 660
[11] Itoh T, Kawahira H, Nakashima H et al. Deep learning analyzes Helico-
Acknowledgments

Downloaded by: Georgetown University Medical Center. Copyrighted material.


bacter pylori infection by upper gastrointestinal endoscopy images.
This research was supported by the Bio & Medical Technology Endosc Int Open 2018; 6: E139 – E144

Development Program of the National Research Foundation [12] Chen PJ, Lin MC, Lai MJ et al. Accurate classification of diminutive
colorectal polyps using computer-aided analysis. Gastroenterology
(NRF) and funded by the Korean government, Ministry of Sci-
2018; 154: 568 – 575
ence and ICT (MSIT) (grant number NRF2017M3A9E8033207).
[13] Komeda Y, Handa H, Watanabe T et al. Computer-aided diagnosis
based on convolutional neural network system for colorectal polyp
classification: preliminary experience. Oncology 2017; 93: (Suppl.
Competing interests
01): 30 – 34
[14] Jisu H, Bo-Yong P, Hyunjin P. Convolutional neural network classifier
None. for distinguishing Barrett’s esophagus and neoplasia endomicroscopy
images. Conf Proc IEEE Eng Med Biol Soc 2017; 2017: 2892 – 2895
[15] DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas un-
References der two or more correlated receiver operating characteristic curves: a
nonparametric approach. Biometrics 1988; 44: 837 – 845
[1] World Health Organization Fact Sheets on Cancer. Available from: [16] Anderson MA, Ben-Menachem T, Gan SI et al. Management of antith-
http://www.who.int/mediacentre/factsheets/fs297/en Accessed: 13 rombotic agents for endoscopic procedures. Gastrointest Endosc
August 2018 2009; 70: 1060 – 1070
[2] de Vries AC, van Grieken NC, Looman CW et al. Gastric cancer risk in [17] Stolte M. The new Vienna classification of epithelial neoplasia of the
patients with premalignant gastric lesions: a nationwide cohort study gastrointestinal tract: advantages and disadvantages. Virchows Arch
in the Netherlands. Gastroenterology 2008; 134: 945 – 952 2003; 442: 99 – 106
[3] Jun JK, Choi KS, Lee HY et al. Effectiveness of the Korean National
Cancer Screening Program in reducing gastric cancer mortality. Gas-
troenterology 2017; 152: 1319 – 1328

Cho Bum-Joo et al. Machine learning on gastric neoplasm images … Endoscopy

You might also like