Resarch Paper01
Resarch Paper01
Background: Using deep learning techniques in image analysis is a dynamically emerging field. This
study aims to use a convolutional neural network (CNN), a deep learning approach, to automatically classify
esophageal cancer (EC) and distinguish it from premalignant lesions.
Methods: A total of 1,272 white-light images were adopted from 748 subjects, including normal cases,
premalignant lesions, and cancerous lesions; 1,017 images were used to train the CNN, and another 255
images were examined to evaluate the CNN architecture. Our proposed CNN structure consists of two
subnetworks (O-stream and P-stream). The original images were used as the inputs of the O-stream to
extract the color and global features, and the pre-processed esophageal images were used as the inputs of the
P-stream to extract the texture and detail features.
Results: The CNN system we developed achieved an accuracy of 85.83%, a sensitivity of 94.23%, and
a specificity of 94.67% after the fusion of the 2 streams was accomplished. The classification accuracy of
normal esophagus, premalignant lesion, and EC were 94.23%, 82.5%, and 77.14%, respectively, which
shows a better performance than the Local Binary Patterns (LBP) + Support Vector Machine (SVM) and
Histogram of Gradient (HOG) + SVM methods. A total of 8 of the 35 (22.85%) EC lesions were categorized
as premalignant lesions because of their slightly reddish and flat lesions.
Conclusions: The CNN system, with 2 streams, demonstrated high sensitivity and specificity with the
endoscopic images. It obtained better detection performance than the currently used methods based on the
same datasets and has great application prospects in assisting endoscopists to distinguish esophageal lesion
subclasses.
Keywords: Esophageal cancer (EC); endoscopic diagnosis; convolutional neural network (CNN); deep learning
Submitted Nov 11, 2019. Accepted for publication Feb 21, 2020.
doi: 10.21037/atm.2020.03.24
View this article at: http://dx.doi.org/10.21037/atm.2020.03.24
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Page 2 of 10 Liu et al. Classification of esophageal lesions based on CNN
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Annals of Translational Medicine, Vol 8, No 7 April 2020 Page 3 of 10
Figure 1 Sample images of three types using the CNN system. CNN, convolutional neural network. The red boxes indicate location of lesion.
and precancerous lesions included low-grade dysplasia and A large difference and a clear “boundary effect” were
high-grade dysplasia. observed between the foreground and background of the
images. Images were cropped to 90% to eliminate the
boundary effect. The original and preprocessed images are
Data preprocessing
shown in Figure 2.
The esophageal images were rescaled to 512×512 through a
bilinear interpolation method to reduce the computational Data augmentation
complexity (26).
Brightness variation of the endoscopic esophageal images To overcome overfitting for our small-scale esophageal
might lead to intraclass differences, which can affect the images, we adopted the following data augmentation
results of the proposed network. Therefore, instead of using measurements before training the network. In the training
the original endoscopic images, the following contrast- dataset, spatial translation of 0–10-pixel value in horizontal
enhanced image was used as the inputs for the CNN. and vertical direction flipping and slight shifting between
−10 and 10 pixels were employed (Figure 3).
I ′ ( x, y;σ ) = αI ( x, y ) + βG ( x, y;ε ) * I ( x, y ) + γ [1]
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Page 4 of 10 Liu et al. Classification of esophageal lesions based on CNN
A B C
Figure 3 Data augmentation with flipping (B) and mirror (C) in the original image (A).
Figure 4 The exemplary architecture of the basic CNN. CNN, convolutional neural network.
The convolutional layer’s main function was to extract calculation formula of the pooling layer was as follows:
the features of the image information on the upper layer.
Convolution operations use local perception and weight (
x L = f β Lj down ( x L-
j ) +b j
1 L
) [3]
sharing to reduce parameters. The calculation formula of where down(∙) represents a down-sampling function, and
the convolution layer was as follows: β and b represent weight and bias, respectively. In this
study, we selected average pooling, which is defined as the
x L = f ( x L-1 * wLj +b Lj ) [2]
following:
L
where x represents the feature map of the convolution
m m
kernel in the L-th layer for input and j-th convolution kernel down ( xm? m ) = mean ∑∑ xab [4]
in the (L-1)-th layer for output, “*” represents convolution a=1 b=1
operation, wLj represents the bias of j-th convolutional Fully connected layer FC(c): each unit of feature maps
kernel in the L-th layer, and f(*) represents activation in the upper layer is connected with the c units of the fully
function. In this study, the RELU activation function was connected layer. An output layer follows the fully connected
often used to solve the gradient dispersion problem. layer.
The pooling layer performed dimensionality reductions The Softmax layer was used to normalize the input
on an input feature map, reduced parameters, and retained feature values into the range of (0, 1) so that the output
the main feature information. The layer also improved values y m represented the probability of each category.
the robustness of a network structure to transformations, The operation for the Softmax layer can be written as the
such as rotation, translation, and stretching of images. The following:
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Annals of Translational Medicine, Vol 8, No 7 April 2020 Page 5 of 10
512
+ 512
Input
1×1 Conv
(2048 Linear)
3×1 Conv
(256) Inception- Inception-
X
ResNet-V2 ResNet-V2
O-stream
P-stream
1×1 Conv Fusion
(192)
Relu activation
Output
Figure 5 The basic structure of the Inception-ResNet module.
used as the basic CNN structure. The input of the O-stream is the
m=1
original image, and the input of the P-stream is the preprocessed
where ym is the output probability of the m-th class, θm is the image. CNN, convolutional neural network.
weight parameter of the m-th class, n is the number of total
class, and x represents the input neurons of the upper layer.
stacked nonlinear layers are forced to learn the following
Construction of Two-stream CNN algorithm transformation:
A deep neural network structure called Inception-ResNet
was employed to construct a reliable AI-based diagnostic F
=L ( x) H L ( x) − x [6]
system. The Inception-ResNet achieved the best results of th
Therefore, the transformation for the L building block
the moment in the ILSVRC image classification benchmark
is the following:
in 2017 (27). The proposed structure consists of 2 streams:
the O-stream and P-stream. L ( x)
H= FL ( x) + x [7]
Inception networks can effectively solve the problem
The classic Inception-ResNet module consists of 1×1,
of computation complexity. The ResNet network can
1×3, and 3×1 convolutional layers. The 1×1 convolutional
reduce the overfitting when the network becomes deeper.
Inception-ResNet network combining the Inception layer is used to reduce channel number, and the 1×3, 3×1
network with the ResNet network achieves an improved convolutional layer is employed to extract spatial features.
performance on the test set of the ImageNet classification Figure 6 demonstrates the O-stream and P-streams
challenge (28). Figure 5 shows the basic structure of the employing the same network structure to allow effective
Inception-ResNet module. feature fusion. The O-stream inputs the original image and
For clarity, HL(x) denotes the transformation of the Lth focuses on extracting the global features of the esophageal
building block. x is the input of the L th building block, images. The P-stream inputs the preprocessed images and
and the desired output is FL(x). Residual block explicitly focuses on extracting the texture features of the esophageal
forces the output to fit the residual mapping; that is, the image (Figure 6). The results of the proposed network and
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Page 6 of 10 Liu et al. Classification of esophageal lesions based on CNN
Normal 114 45.6 15.4 171 47.5 12.9 285 46.8 13.9
Total 432 57.8 12.8 316 53.3 13.0 748 56.0 13.1
0.8 0.75
Loss
0.2 0.60
Experiments and validation parameters
0.0 0.55
0 20000 40000 60000 80000 100000 The proposed approaches were implemented in the
Figure 7 Training curves of the proposed classification approach TensorFlow deep learning framework, which was run on
on the EC database. EC, esophageal cancer. a PC with NVIDIA GeForce GTX 1080Ti GPU (8 G)
(NVIDIA CUDA framework 8.0, and cuDNN library).
For the elimination of contingencies in the classification
the sub-streams for EC classification are presented in Table 1. results and to evaluate the performance of the proposed
The fusion of the 2 streams show the final results. For the EC model, the results were quantitatively evaluated by 3
proposed structure, the concatenation fusion is employed. metrics; these were accuracy (ACC), sensitivity (SEN), and
For clarity, we defined a concatenation fusion function: specificity (SPEC), and were defined as the following:
f, 2 feature maps xta and xtb , and a fusion feature map y,
TP
where x a ∈ R H ×W ×D , x b ∈ R H ×W ×D , and y ∈ R H ′×W ′×D′ , and Sen =
TP + FP [9]
where W, H, and D are the width, height, and the number
TP
of channels of the feature maps. The concatenation fusion Spec = [10]
FP +TN
method was described as follows:
TP +TN
Concatenation fusion y=fcat(xa,xb) stacks the 2 features at Acc = [11]
the same location i, j across the feature channels d. TP +TN + FP + FN
where
yi, j,d = xi,aj,d , yi, j,D+d = xi,b j,d [8] True positive (TP) is the number of positive images
correctly detected.
where y ∈ R H ×W ×2D . True negative (TN) is the number of negative images
correctly detected.
Learning parameters False positive (FP) is the number of correctly detected
The key to achieving promising results is training a model wrongly as the esophagus images. False negative (FN) is the
with the correct weight parameters, which influence the number of positive samples misclassified as negative.
performance of the entire structure. In training, the weight In the evaluation phase, all the metrics were calculated
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Annals of Translational Medicine, Vol 8, No 7 April 2020 Page 7 of 10
Validation 126 53 38 35
30
Test 129 54 39 36
True label
Precancer 4 33 3
EC, esophageal cancer.
20
Table 3 Results of the proposed network and the sub-streams in the Cancer 0 8 27
10
EC database
SEN (%) SPEC (%) ACC (%) 0
Normal Precancer Cancer
O-Stream 98.08 85.33 66.93
Predicted label
P-Stream 96.15 88.00 79.53
Figure 8 Confusion matrix of the proposed structure in EC
Proposed structure 94.23 94.67 85.83 database. EC, esophageal cancer.
EC, esophageal cancer; SEN, sensitivity; SPEC, specificity;
ACC, accuracy.
by itself was 66.93%. Using the preprocessed image as the
input, the P-stream focuses on exploiting the textures and
Table 4 Results of the proposed network in the EC database detailed features of the esophageal images, and the ACC of
Normal Precancerous lesion Cancer p-stream alone was 79.53%. The fusion of the two streams
ACC 94.23% 82.50% 77.14% led to the best results of 85.83%.
EC, esophageal cancer; ACC, accuracy.
Table 4 shows the ACC of each category in the EC
database based on the proposed network. The normal type
was easier to identify probably because the amount of data
based on the five-fold cross-validation results. The dataset in the normal type was greater than the other two types.
was divided into the training (80%) and testing (20%) Figure 8 presents the confusion matrix for the EC
datasets, respectively. database. In the confusion matrix, the diagonal values are
The detailed data statistics distribution from the EC the A of each category classification, and the others are
database is shown in Table 2. the confusion degrees between the two categories. This
method diagnosed 74 total lesions as esophageal lesions (the
precancerous lesion or cancer); 3 were normal cases with
Results
a PPV of 95.94% and a negative predictive value (NPV)
A total of 748 patients were included in this analysis. Table 1 92.45%. The PPV and the NPV of EC were 87.09% and
presents the sizes and demographics of the database. Overall, 91.67%, respectively. The accuracy of the cancer category
no significant age difference was observed between males and was 77.14%, which implies that it is easy to confuse EC
females in each group. However, the normal control group with the precancerous lesions.
was 15 years younger on average than the other two groups. Table 5 demonstrates a comparison made between the
Cancer and precancerous lesion groups had more males than method we proposed and the methods of LBP+SVM and
females, both of which were around 60 years old. HOG+SVM using the same dataset. The total sensitivity,
The comparative results of the proposed network and sub- specificity, and accuracy of our method were 94.23%,
streams (the O-Stream and the P-Stream) in the database are 94.67%, and 85.83%, respectively, which are higher than
listed in Table 3. This database contains all images, including those of the other methods.
those of the normal esophagus, precancerous lesions, and
EC. And the results are the overall ACC, SEN, and SPEC of
Discussion
each methods. The O-stream focuses on exploiting the color
and global features of the esophageal images, and its ACC Endoscopy plays a crucial role in the diagnosis of EC,
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Page 8 of 10 Liu et al. Classification of esophageal lesions based on CNN
Table 5 Comparison of the proposed network with other methods presented achieved better results. Therefore, the CNN
SEN (%) SPEC (%) ACC (%) system we proposed can easily distinguish whether
LBP + SVM 63.27 64.36 64.75
samples suffer from esophageal lesions. In some cases,
however, there were some discrepancies between EC and
HOG + SVM 57.93 59.82 60.40
precancerous esophageal lesions. For instance, 85% of the
Proposed method 94.23 94.67 85.83 lesions diagnosed by the CNN as premalignant lesions
SEN, sensitivity; SPEC, specificity; ACC, accuracy; LBP, Local were EC. The most probable reason for misdiagnosis was
Binary Patter ns; SVM, Support Vector Machine; HOG, that cancerous lesions were extremely localized in the
Histogram of Gradient. precancerous lesions, and their surface characteristics were
not obvious. Some other reasons may include the fact that
the cancer was hard to detect on the surface or the poor
which is the sixth leading cause of cancer-related death (1). angle at which the image was taken.
However, diagnosing EC at an early stage by endoscopy The main contributions of this paper are twofold.
is difficult and requires experienced endoscopists. An First, the esophageal endoscopic database was built. The
alternative method for EC classification is done by using database included 1,272 endoscopic images, which consisted
a deep leaning method. It is more helpful and has been of 3 types of endoscopic images (normal, premalignant,
applied in various fields, such as computer vision (29) and cancerous). Each image in this database had a classification
pattern recognition (30). The application of deep learning label. Secondly, we presented a two-stream CNN that
methods achieves complex function approximation through can automatically extract global and local features from
a nonlinear network structure and shows powerful learning endoscopic images.
abilities (31). Compared with traditional recognition The significant strength of the study was that our
algorithms, deep learning combines feature selection proposed two-stream CNN consisted of 2 subnetworks
methods or extraction and classifier determination methods (O-stream and P-stream). The original images were input
into a single step and can study features to reduce the with the O-stream to extract the colors and global features,
manual design workload (32). and the pre-processed esophageal images were input with
The CNN model is one of the most important deep the P-stream to extract the texture and detail features.
learning models for computer vision and image detection. Advanced Inception-ResNet V2 was adopted as our CNN
In the most recent study, Hirasawa et al. achieved the framework. Finally, two-stream CNN effectively extracted
automatic detection of gastric cancer in endoscopic images the two-stream feature and achieved promising results.
by using a CNN-based diagnostic system and obtained This study had some limitations. First, the detection of
an overall sensitivity of 92.2% and a PPV of 30.6% (20). EC was based on images in white light view only. Designing
Sakai et al. proposed a CNN-based detection scheme and a universal detection system with images under more
achieved high accuracy in classifying early gastric cancer and views, such as NBI and chromoendoscopy using indigo
normal stomach (33). Our study has developed a CNN- carmine, is possible. Second, our sample size was small, and
based framework to classify esophageal lesions with an we obtained all endoscopic images from a single center.
overall accuracy of 85.83%. The images were preprocessed The type of endoscopy and its image resolution are highly
first, then the features of the image information were variable across different facilities. Therefore, we will obtain
extracted and annotated manually; finally, these images endoscopic images from other centers and use other types
were used for training the CNN model. This model was of endoscopy in future research. Third, the anatomical
applied to distinguish normal esophagus, premalignant structure of the squamocolumnar junction was also
lesions from EC. misdiagnosed as EC, which is unlikely to be misdiagnosed
According to our study, the trained network achieved an by endoscopists. If CNNs can have more systematic
accuracy of 85.83%, a sensitivity of 94.23%, and a specificity learning about the normal anatomical structures and various
of 94.67% with the fusion of the 2 streams. The accuracy lesions, the accuracy of EC detection will improve in the
rates of classifying normal esophagus, premalignant lesions, future.
and EC were 94.23%, 82.5%, and 77.14%, respectively. In future studies, we will add the precise location of
LBP+SVM and HOG+SVM methods are classical machine lesion areas and video analysis to allow for real-time
learning methods. Compared with them, the system we computer-aided diagnosis of esophageal tumors.
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Annals of Translational Medicine, Vol 8, No 7 April 2020 Page 9 of 10
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24
Page 10 of 10 Liu et al. Classification of esophageal lesions based on CNN
© Annals of Translational Medicine. All rights reserved. Ann Transl Med 2020;8(7):486 | http://dx.doi.org/10.21037/atm.2020.03.24