TMRGM A Template-Based Multi-Attention Model For X
TMRGM A Template-Based Multi-Attention Model For X
model can automatically choose whether to generate report text etc. The mTLR is more consistent with the radiologists’ way of read-
based on image features, sentence topics, or text features. (4) We ing images, and can better support comprehensive diagnosis.
verified the performance of chest lesion recognition and report gen-
There are commonly two steps in mTLR: (1) multi-label classifica-
eration based on the public available IU X-ray dataset (Open I) [18].
tion (MLC) of thoracic lesions revealed in chest radiography; (2)
thoracic lesion localization, which identifies specific regions and
2. RELATED WORKS profile of abnormal lesions in chest radiography. In recent years,
deep learning models start to outperform conventional statistical
2.1. Medical Imaging Datasets learning approaches [43,44] in the TLR task. A representative work
is the CheXNet developed by Ng et al. [22], a 121-layer dense con-
In recent years, deep neural networks have shown great potential volutional neural network (dense CNN), which detect 14 chest
in challenging tasks of medical image processing [12,13]. The rapid diseases simultaneously based on the ChestX-Ray14 data set. Bar
improvement partly depends on the publicly accessible medical et al. [23] used the pretrained CNN model to extract the high-
imaging datasets that covering multimodal and various body parts dimensional features of medical images, and combined them with
with quality annotation. In particular, images concerning chest dis- general GIST feature and bag-of-visual words (BoVW) features
eases, e.g., chest X-rays and chest CT scan are commonly used for as the input of support vector machine (SVM) to detect thoracic
clinical screening and diagnosis, and account for a large proportion lesions. Wang et al. [14] developed a Ddep convolutional neural net-
in public datasets. work (DCNN) for mTLR. Yao et al. [24] constructed a DenseNet-
For instance, the NIH released the ChestX-ray14 dataset for tho- long short-term memory (DENsenet-LSTM) model to identify the
racic lesion detection [14]. The National Cancer Institute (NCI) 14 thoracic lesions by utilizing latent correlation between different
released the LIDC–IDRI dataset for early cancer detection in high- lesions in chest X-ray images.
risk populations [15] and Data Science Bowl 2017 [16], the high-
resolution CT scan data for lung cancer prediction. The Stanford
University present CheXpert [17], a large-scale dataset that con- 2.3. Visual Captioning and Medical Image
tains 224,316 chest radiographs of 65,240 patients. OpenI [18] con- Report Generation
tains chest X-ray reports of 3,955 patients and 7470 chest X-ray
images, which has become the benchmark of the current research Visual captioning aims at generating a descriptive sentence for
on imaging report generation. Recently, MIT released MIMIC- a given image or video. Most state-of-the-art methods generated
CXR-JPG v2.0.0 [19], a large dataset of 377,110 chest X-rays asso- sequences based on the CNN-RNN architectures and attention
ciated with 227,827 imaging studies sourced from the Beth Israel mechanisms [45–47]. In addition to the one-sequence generation
Deaconess Medical Center. In addition, during the outbreak time in early studies, some efforts have been made for generating longer
of COVID-19, many small-scale datasets are released for develop- paragraphs [11], which inspires the research of medical image
ing AI-based diagnosis models of COVID-19. For instance, Yang et report generation. However, medical image reports are more pro-
al. build an open-sourced dataset COVID-CT [41], which contains fessional and informative than natural image captions, which poses
349 COVID-19 CT images from 216 patients and 463 non-COVID- greater challenge on generating clinically readable reports. Shin et
19 CT. Li et al. introduced COV-CTR [42], a COVID-19 CT report al. first proposed a variant of CNN-RNN framework to predict
dataset which contains 728 images collected from published papers lesion tags of chest X-ray images [25]. Wang et al. [26] developed
and their corresponding paired Chinese reports. Latent Dirichlet Allocation-based topic models for imaging report
generation. Kisilev et al. [27] proposed a CNN-based method for
generating reports of classified mammography images. Wang et al.
2.2. Thoracic Lesion Recognition proposed the TieNet model [28], integrating the multi-attention
model into the end-to-end CNN-RNN framework for performing
In the early stage of image recognition, some feature extraction disease classification and generating simple imaging reports. Jing et
methods, such as histogram of oriented gradients (HOG) and scale al. [29] constructed a hierarchical language model equipped with
invariant feature transform (SIFT) were mainly used to classify co-attention to better model the paragraphs, but it tend to pro-
and recognize the extracted features through classifiers [43]. Early duce normal findings. They went further to explore the complex
image recognition tasks are targeted at specific recognition objects, structures of reports, and proposed a two-stage strategy that mod-
without generalization ability, and the sample size is small, so it is els the relationship between Findings and impression [48]. Li et al.
difficult to meet high recognition requirements in practical appli- [30] proposed KERP, a knowledge-driven imaging report genera-
cation. tion model, which constructed a graph transformer (GTR) for the
Thoracic Lesion Recognition (TLR) has long been a research focus dynamic transformation of text features and image features.
in CAD. According to the types of identified lesions, TLR meth- The difference between our proposed model and existing meth-
ods can be divided into two categories. One is single thoracic ods lies in that we classified chest X-rays into healthy or abnormal
lesion recognition (sTLR), which focuses on the imaging charac- individuals based on MLC module, then we combined report tem-
teristics of a particular type of lesion. It can assist the early screen- plates with multi-attention-based hierarchical LSTM model and
ing and diagnosis of a specific disease, e.g., the pulmonary nodule generate reports respectively according to the nature of the given
detection [20,21]. The other one is multiple thoracic lesion recog- image (healthy/abnormal). In addressing the problem that the non-
nition (mTLR), which target multiple types of disease or lesion, visual feature words are difficult to align with the image features,
such as pulmonary nodules, pneumonia, pneumothorax, pleural TMRGM- generated visual words and nonvisual words separately
effusion, atelectasis, pulmonary abscess, pulmonary tuberculosis, based on features from different modality.
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 3
Figure 1 Over view of the framework of the proposed template-based multi-attention report generation model
(TMRGM).
( )
employed the co-attention mechanism to fuse the image features topic(t) = Wt tanh Wth ht + Wtc C(t)
VL
(10)
extracted by ResNet152 and the text features of the thoracic lesion
labels predicted by the MLC module. The feature fusion model
assigned corresponding weights to different image regions while ( ( ))
stop(t) = softmax Ws tanh Wsh1 ht−1 + Wsh2 ht (11)
generating sentences, so that it can focus on the related image region
and the thoracic lesion labels.
In particular, we first define a sentence LSTM model. At time t, let 3.4.3. Adaptive attention-based word LSTM for
the image feature as V, the embedding vector of 10 predicted tho- sentence generation
racic lesion labels is L, the attention weight vector of the image fea- There are many nonvisual words in the report context, such as
ture 𝜶 V , the attention weight vector of the label text feature is 𝜶 L , “evidence,” “of,” “acute,” and “remain,” which cannot be aligned
and then we fuse the image features and the label text feature by directly to a specific image region. Otherwise, in the training pro-
computing the context feature vector CVL as follows: cess, the gradient of nonvisual words will influence the alignment
[ ] accuracy between visual words and image features. Therefore, we
C(t)
VL
= WFC V(t) (t)
att ; Latt (2) used the adaptive attention-based word LSTM model to generate
sentences. During the process of word generation, the adaptive
attention mechanism decides whether to use the image feature, the
V(t)
att = aV ⋅ V (3) sentence topic, or rather the context feature to generate the current
word. Figure 2 shows the structure of the word LSTM model based
on the adaptive attention mechanism.
L(t)
att = 𝜶 L ⋅ L (4) The adaptive attention mechanism [34] is an extension of the soft
attention model proposed by Xu et al. [35]. As shown in the for-
( ( )) mula (13) and (14), at timestamp t, the adaptive attention mecha-
𝜶 V = softmax fatt V, ht−1 (5) nism assigns weights 𝜶 t to image local features based on the hidden
state ht , thus reduce the uncertainty of generating new words.
( )
( ( )) zt = 𝜔Th tanh Wv V + Wg ht (12)
𝜶 L = softmax fatt L, ht−1 (6)
In formula (2), WFC is a fully connected network layer, V(t) att and ( )
(t)
Latt are the image feature and the text feature weighted by the co- 𝜶 t = softmax zt (13)
attention mechanism at the time t in formula (3) and (4). The ht−1
represents the hidden state of the sentence LSTM at the time t − 1,
fatt is the function of the attention mechanism, as shown in formula Cvt = 𝜶 t ⋅ V (14)
(5) and (6), in which Wvat , Wv , Wvh , Wlat , Wl and Wlh are parame-
ter metrics. Based on the context feature vector CVL , we can predict
The adaptive attention also improves the LSTM by introducing a
topics of each generated sentence.
new sentinel gate gt and a visual sentinel vector St as follows:
( ) ( )
fatt V, ht−1 = Wvat tanh Wv V + Wvh ht−1 (7) ( )
gt = σ Wx xt + Wto topic(t) + Wh ht−1 (15)
( ) ( )
fatt L, ht−1 = Wlat tanh Wl L + Wlh ht−1 (8) ( )
St = gt ⋅ tanh mt (16)
Figure 2 The structure of the word LSTM model based on the adaptive attention mechanism.
image classification module, we calculate accuracy, specificity, only 0.112. For one thing, the 587 semantic labels increased the dif-
and sensitivity. As to the imaging report generation module, we ficulty of building high-precision classifiers, while the training set
obtained BLEU [49], METEOR [50], ROUGE [51], and CIDEr only contains 5909 chest X-ray images. For another thing, the distri-
[52] by the standard image captioning evaluation tool [53], which bution of the semantic labels showed that each image in the training
are commonly used in the field of machine translation and image set contains two labels on average, which implies the confliction of
captioning. our strategy on label selection (top 10). We tried to select the top2
labels predicted by ResNet152 as a comparison, and achieved a pre-
cision of 0.311, a recall of 0.488, and a F1 score of 0.355.
4.2.3. Comparison methods
Table 2 shows two examples of ResNet152-based MLC module for
For thoracic lesion recognition and chest X-ray classification, we
thoracic lesion recognition. For the first image, the model correctly
compare the influence of different image encoders on the classi-
identified three semantic labels, namely atelectases, atelectasis, and
fication models. As a comparison experiment, we simultaneously
opacity. According to these three lesions, patients can go to the res-
built multiple CNN models such as VGG19 [37], Densenet121 [38],
piratory department for medical treatment. As to the other one, we
SENet154 [39], and ResNet152 [31] to extract visual features, as
recognized a lesion “cardiomegaly,” which reminds the patient to
shown in Tables 1 and 3.
see the cardiologist.
For chest X-ray report generation, we compare our proposed
method with state-of-the-art method: TieNet [28], CoAtt [29], and 4.3.2. Results of chest X-ray image classification
Adapt-Att [34]. We also report TMRGM without introducing tem-
plate. Further, we perform a qualitative assessment of the generated According to whether the “Normal” label achieves the high-
radiology reports manually. est probability in predicted labels, we classified chest X-ray
images into healthy individuals and abnormal ones. We com-
pare the ResNet152-based classification module with other CNN-
4.3. Results based binary classification models, such as VGG19, Densenet121,
SENet154, and Inception-V3 [40]. Table 3 shows the experimen-
4.3.1. Results of thoracic lesion recognition
tal results of chest X-ray image classification. The ResNet152-
Table 1 shows the experimental results of thoracic lesion recog- based classification model achieved the best accuracy of 0.73, the
nition based on different MLC models. It can be seen that the DenseNet121 achieved the best specificity of 0.803, and the SENet
ResNet152-based MLC module achieved the precision of 0.112, the achieved the best sensitivity of 0.758. The ResNet152 achieved
recall@5 of 0.605, the recall@10 of 0.698, and the F1 Score of 0.181, the best 95% confidence interval of the accuracy ([0.691, 0.769]),
which outperform other methods. However, the best precision was followed by the Densenet121 ([0.674, 0.754]) and the SENet154
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 7
cardiomegaly; cardiomegaly
degenerative change;
opacity;
atelectases;
atelectasis;
scarring;
normal;
calcified granuloma;
granuloma;
pleural effusion
([0.672, 0.752]). In the ResNet152-based classification model, error The difference between these metrics lies in the various strategies of
case study reveals that a part of normal cases were misclassified as n-gram similarity calculation and weight assignment. We compared
abnormal ones. A statistical analysis reveals that the ratio of images our proposed TMRGM model with three state-of-the-art methods
from healthy and abnormal individuals in the training set was about based on the test set, as shown in Table 7, which demonstrate the
2:3, which indicates that the performance of chest X-ray image clas- comparable performance of TMRGEM to the SOTA. The Adapt-
sification is to some extent affected by the data imbalance. att represents the hierarchical LSTM model based solely on multi-
attention mechanism, which achieved the best ROUGE of 0.316
and the CIDEr of 0.387, suggesting that the hierarchical model is
4.3.3. Templates for chest X-ray report generation better for modeling paragraphs. Our TMRGM model obtained the
For generating the reports of healthy individuals, we manually con- preferable BLEU scores and the METEOR of 0.183, which indi-
structed templates based on the Findings and Impression text respec- cates the high semantic similarity between generated report sen-
tively. Specifically, the Impression section contains 63 subclasses of tences and the ground-truth sentences. By comparing the results of
the report sentence (partly shown in Table 4), and the “Findings” TMRGM model and TMRGM without templates, we can see that
field contains 150 subclasses (partly shown in Table 5). According to the introduction of chest X-ray report template can improve the
the sum of the sentence frequency in each subclass, we selected the BLEU scores and METEOR, suggesting that that the template-based
top2 high frequency subclass for the Impression and the top4 sub- report generation is linguistically in line with the reports of healthy
class for the Findings. Then the combination of the representative individuals.
sentences from the selected six subclasses forms a complete Chest
X-ray report template (see Table 6). 4.3.5. Qualitative analysis
In this section, we perform the qualitative analysis on the generated
4.3.4. Results of chest X-ray report generation reports. Table 8 presents two abnormal cases of chest X-ray reports
generated by the TMRGM model and Table 9 shows an example of
Table 7 shows results of Chest X-ray report generation on the
template-based reports generated for healthy ones.
automatic metrics. The evaluation metrics, such as BLEU score,
METEOR, ROUGE, and CIDEr, are based on n-gram similarity As shown in Table 8, for the upper case, two sentences of nor-
between the generated sentences and the ground-truth sentences. mal descriptions are semantically similar with the ground-truth
8 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press
Table 6 The complete template of chest X-ray reports of the corresponding sentence, and the darker the color, the greater
healthy individuals. the weight. However, it is hard to explain the correlation between
Section Template generated sentences and image features.
Impression No acute cardiopulmonary abnormality
No active disease
Findings No focal consolidation pleural effusion
5. DISCUSSION
or pneumothorax
Automatic chest X-ray report generation will facilitate radiolo-
The lungs are clear
gists to improve the efficiency of diagnosis and report writing. The
The cardiomediastinal silhouette is
within normal limits proposed TMRGM model achieved comparable performance with
The heart is normal in size. SOTA models on chest X-ray report generation. However, it is still
far from clinical usage in realistic scenarios.
First, in the training phase, we collected semantic labels and built
sentence, such as “pulmonary vascularity appear within normal report templates entirely based on the training reports from the
limits.” versus “pulmonary vasculature within normal limits”; and IU X-ray. Then we test our proposed model based on another 500
“no pleural effusion or pneumothorax is seen.” versus “no pleu- samples. We found that in generated reports of abnormal individ-
ral effusion. no pneumothorax.” As to the second case in Table 8, uals, most sentences are normal descriptions, while the propor-
the TMRGM model performs acceptable on generating abnormal tion of abnormal descriptions is relatively small. This problem may
descriptions of chest X-rays, e.g., the predicted sentence “stable car- due to the imbalance of normal and abnormal descriptions in the
diomegaly with prominent perihilar opacities which may represent training set (in the IU X-ray dataset, each report contains 3.7 nor-
scarring or edema,” is semantically similar with the real sentence mal sentences and 2.6 abnormal sentences on average). Empiri-
“findings concerning for interstitial edema or infection. heart size cally, the data scale, completeness, normalization, and quality of
is mildly enlarged. there are diffusely increased interstitial opacities imaging reports are important factors for training. One further
bilaterally.” improvement is introducing high-quality parallel datasets, such as
the recently released MIMIC-CXR dataset, so as to train the model
Table 9 described the chest X-ray of a healthy individual from better. It is also necessary for us to validate the generalization per-
several aspects, such as the cardiopulmonary function (“no acute formance on external data source.
cardiopulmonary abnormality”), the pleural lesions (“no pneu-
mothorax or pleural effusion”), the costal mediastinum outline Second, unlike common natural images, the difference of visual
(“the cardiomediastinal silhouette is within normal limits”), the features in medical images is not obvious, and the ambiguous sit-
cardiac shape and size (“the heart is normal in size”). It can be uations are quite often, such as the same disease with diverse
observed that the descriptions of multiple anatomic structures are visual features, or the similar image features attributed to differ-
grammatically and logically in accord with the ground-truth sen- ent diseases. The TMRGM model extracted image features based
tences, which demonstrate the chest X-ray report template is highly on the ResNet152, and involved the co-attention as well as the
similar with the real normal reports in the OpenI IU X-ray dataset. adaptive attention mechanism. The introduction of the adaptive
As shown in Table 10, the visualization heat map reveals the atten- attention mechanism chooses reasonable features for generating
tive image region while generating a specific sentence. The high- different kinds of words, which to some extent, alleviates the prob-
lights in the heat map represent the image features used to generate lem of unaligned non-visual words and image features. However,
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 9
by reviewing the visualization heat maps of TMRGM, we found left, upper, and lower. However, due to the uneven distribution
it is hard to explain the correlation between generated sentences of words in the training set and the low frequency of anatomical
and image features. One optimizing strategy is to segment chest locations, most of the generated reports do not contain accurate
X-ray images by referring to the description sequence and body anatomical locations. This is also a limitation of this study. In fur-
parts specified in reports, and then extract local image features ther research, we will focus more on the location of the disease in
respectively. Since each body parts has specific semantic labels, the the medical imaging pictures and how to accurately generate the
problem of image feature extraction and classification would be description of the anatomical locations.
more simplified. Another direction for improvement is to explore
emerging explainable deep learning networks, combining with
state-of-the-art data augmentation for better understanding and 6. CONCLUSION
interpreting radiology images.
In this paper, based on a systematic review of thoracic lesion
Third, we selected the top 10 semantic labels from the MLC module
recognition and medical imaging report generation, we proposed
as the thoracic lesions. Based on this rule, we achieved high recall
a template-based multi-attention model (TMRGM) for automat-
but poor precision on thoracic lesion recognition. It is necessary to
ically generating reports of chest X-rays. By exploring the lin-
explore more reasonable label selection strategies. In addition, in
guistic characteristics of report texts, we implemented different
view of the increasing open access Covid-19 dataset, our method
report generation methods for healthy individuals and abnormal
can be further optimized for assisting the current Covid-19 diagno-
ones respectively, and validate the effectiveness of TMRGM based
sis, such as identifying thoracic lesions and automatically writing
on the IU X-ray dataset. It is helpful for radiologists to quickly iden-
radiology reports, and reduce the workload of doctors.
tify the thoracic lesions and write high-quality chest X-ray reports.
Fourth, the dictionary used by the TMRGM model to generate the That facilitates the daily work of medical imaging examination and
medical imaging report contains anatomical locations like right, reduce their burden of image reading and report writing.
10 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press
Generated report No acute cardiopul- No acute cardiopul- The lungs and pleural
monary findings monary findings spaces show no acute-
abnormality
[9] C. Wang, A. Elazab, J. Wu, et al., Lung nodule classification using [25] H.C. Shin, L. Lu, L. Kim, et al., Interleaved text/image deep mining
deep feature fusion in chest radiography, Comput. Med. Imaging. on a large-scale radiology database, J. Mach. Learn. Res. 17 (2017),
Graph. 57 (2017), 10–18. 3729–3759.
[10] P. Kisilev, E. Walach, E. Barkan, et al., From medical image to [26] X. Wang, L. Lu, H. Shin, et al., Unsupervised category discovery
automatic medical report generation, IBM J. Res. Dev. 59 (2015), via looped deep pseudo-task optimization using a large scale radi-
2:1–2:7. ology image database, arXiv:1603.07965v1, 2016.
[11] H.C. Shin, K. Roberts, L. Lu, et al., Learning to read chest X-rays: [27] P. Kisilev, E. Sason, E. Barkan, et al., Medical image description
recurrent neural cascade model for automated image annotation, using multi-task-loss CNN, in: G. Carneiro et al. (Eds.), Deep
in IEEE Conference on Computer Vision and Pattern Recogni- Learning and Data Labeling for Medical Applications, Springer
tion, IEEE Computer Society, Las Vegas, NV, USA, (2016), pp. International Publishing, Cham, Switzerland, 2016.
2497–2506. [28] X. Wang, Y. Peng, L. Lu, et al., TieNet: text-image embedding
[12] G. Litjens, T. Kooi, B.E. Bejnordi, et al., A survey on deep learning network for common thorax disease classification and reporting
in medical image analysis, Med. Image Anal. 42 (2017), 60. in chest X-rays, in IEEE Conference on Computer Vision and
[13] J.G. Lee, S. Jun, Y.W. Cho, et al., Deep learning in medical imaging: Pattern Recognition, Salt Lake City, UT, USA, 2018.
general overview, Korean J. Radiol. 18 (2017), 570–584. [29] B. Jing, P. Xie, E. Xing, On the automatic generation of medical
[14] X. Wang, Y. Peng, L. Lu, et al., ChestX-Ray8: hospital-scale imaging reports, arXiv:1711.08195v2 [cs.CL], 2017.
chest X-ray database and benchmarks on weakly-supervised [30] C.Y. Li, X. Liang, Z. Hu, et al., Knowledge-driven encode, retrieve,
classification and localization of common thorax diseases, paraphrase for medical image report generation, in Thirty-Third
in 2017 IEEE Conference on Computer Vision and Pat- AAAI Conference on Artificial Intelligence, arXiv:1903.10122v1,
tern Recognition (CVPR), Honolulu, HI, USA, (2017), Honolulu, Hawaii, USA, 2019.
pp. 3462–3471. [31] Z. Wu, C. Shen, A.V. Den Hengel, et al., Wider or deeper: revisit-
[15] S.G.A. Armato III, G. Mclennan, L. Bidaut, et al., The Lung Image ing the ResNet model for visual recognition, Pattern Recognit. 90
Database Consortium (LIDC) and Image Database Resource Ini- (2019), 119–133.
tiative (IDRI): a completed reference database of lung nodules on [32] A. Graves, Long short-term memory, Neural Comput. 9
CT scans, Med. Phys. 38 (2011), 915. (1997), 1735.
[16] B.A. Hamilton, Data science bowl 2017. https://www.kaggle.com/ [33] P. Cao, Z. Yang, L. Sun, et al., Image captioning with bidirec-
c/data-science-bowl-2017/overview/description tional semantic attention-based guiding of long short-term mem-
[17] J. Irvin, P. Rajpurkar, M. Ko, et al., CheXpert: a large chest ory, Neural Process. Lett. 50 (2019), 103–119.
radiograph dataset with uncertainty labels and expert com- [34] J. Lu, C. Xiong, D. Parikh, et al., Knowing when to look: adap-
parison, in 33rd AAAI Conference on Artificial Intelli- tive attention via a visual sentinel for image captioning, in IEEE
gence, AAAI 2019, 31st Innovative Applications of Artifi- Conference on Computer Vision and Pattern Recognition, IEEE,
cial Intelligence Conference, IAAI 2019 and the 9th AAAI Honolulu, HI, USA, (2017), pp. 3242–3250.
Symposium on Educational Advances in Artificial Intel- [35] K. Xu, J. Ba, R. Kiros, et al., Show, attend and tell: neural image
ligence (EAAI 2019), Honolulu, Hawaii, USA, (2019), caption generation with visual attention, Comput. Sci. 37 (2015),
pp. 590–597. 2048–2057.
[18] M.D. Demnerfushman, M.D. Kohli, M.B. Rosenman, et al., [36] J.G. Mork, A.J.J. Yepes, A.R. Aronson The NLM medical
Preparing a collection of radiology examinations for distribu- text indexer system for indexing biomedical literature, 2013.
tion and retrieval, J. Am. Med. Inform. Assoc. Jamia. 23 (2016), Ii.nlm.nih.gov
304–310. [37] K. Simonyan, A. Zisserman Very deep convolutional net-
[19] A. Johnson, M. Lungren, Y. Peng, et al., MIMIC-CXR-JPG-chest works for large-scale image recognition, in Proceedings of
radiographs with structured labels (Version 2.0.0). PhysioNet. the 3rd International Conference on Learning Representations,
2019. arXiv:1409.1556v6, San Diego, CA, USA, 2015.
[20] N. Tajbakhsh, K. Suzuki, Comparing two classes of end-to-end [38] G. Huang, Z. Liu, L.V.D. Maaten, et al., Densely connected convo-
machine-learning models in lung nodule detection and clas- lutional networks, in 2017 IEEE Conference on Computer Vision
sification: MTANNs vs. CNNs, Pattern Recognit. 63 (2017), and Pattern Recognition, Honolulu, HI, USA, 2017.
476–486. [39] J. Hu, L. Shen, S. Albanie, et al., Squeeze-and-excitation networks,
[21] A.A.A. Setio, F. Ciompi, G. Litjens, et al., Pulmonary nodule in 2017 IEEE Conference on Computer Vision and Pattern Recog-
detection in CT images: false positive reduction using multi-view nition, Salt Lake City, UT, USA, 2018.
convolutional networks, IEEE Trans. Med. Imaging. 35 (2016), [40] C. Szegedy, V. Vanhoucke, S. Ioffe, et al., Rethinking the inception
1160–1169. architecture for computer vision, in Proceeding of the IEEE Con-
[22] P. Rajpurkar, J. Irvin, K. Zhu, et al., CheXNet: radiologist- ference on Computer Vision and Pattern Recognition, Las Vegas,
level pneumonia detection on chest X-rays with deep learning, NV, USA, (2016), pp. 2818–2826.
arXiv:1711.05225v3, 2017. [41] X. Yang, X. He, J. Zhao, et al., Covid-ct-dataset: a CT scan dataset
[23] Y. Bar, I. Diamant, L. Wolf, et al., Chest pathology identification about covid-19, arXiv:2003.13865v3, 2020.
using deep feature selection with non-medical training, Com- [42] M. Li, F. Wang, X. Chang, X. Liang, Auxiliary signal-guided
put. Methods Biomech. Biomed. Eng. Imaging Visual. 6 (2016), knowledge encoder-decoder for medical report generation, arXiv:
259–263. 2006.03744v1, 2020.
[24] L. Yao, E. Poblenz, D. Dagunts, et al., Learning to diag- [43] M. Kakar, D.R. Olsen, Automatic segmentation and recognition of
nose from scratch by exploiting dependencies among labels, lungs and lesion from CT scans of thorax, Comput. Med. Imaging
arXiv:1710.10501v2, 2017. Graph. 33 (2009), 72–82.
12 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press
[44] C. Qin, D. Yao, Y. Shi, et al., Computer-aided detection in chest [49] K. Papineni, S. Roukos, T. Ward, W.J. Zhu, Bleu: a method
radiography based on artificial intelligence: a survey, BioMed. for automatic evaluation of machine translation, in Proceedings
Eng. OnLine. 17 (2018), 113. of the 40th Annual Meeting on Association for Computational
[45] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, Sequence Linguistics, Association for Computational Linguistics, Philadel-
level training with recurrent neural networks, in 4th Inter- phia, PA, USA, (2002), pp. 311–318.
national Conference on Learning Representations, ICLR, [50] A. Lavie, A. Agarwal, Meteor: an automatic metric for MT
arXiv:1511.06732v7, San Juan, Puerto Rico, 2016. evaluation with high levels of correlation with human judgments,
[46] S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical in Proceedings of the Second Workshop on Statistical Machine
sequence training for image captioning, in 2017 IEEE Conference Translation, Prague, Czech Republic, (2007), pp. 228–231.
on Computer Vision and Pattern Recognition (CVPR), Honolulu, [51] C.-Y. Lin, Rouge: a Package for Automatic Evaluation of
HI, USA, 2017. Summaries, Text Summarization Branches Out, Associa-
[47] Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning tion for Computational Linguistics, Barcelona, Spain, 2004.
with semantic attention, in 2016 IEEE Conference on Com- https://www.aclweb.org/anthology/W04-1013.pdf
puter Vision and Pattern Recognition (CVPR), Las Vegas, NV, [52] R. Vedantam, C.L. Zitnick, D. Parikh, Cider: consensus-based
USA, 2016. image description evaluation, in Proceedings of the IEEE Confer-
[48] B. Jing, Z. Wang, E. Xing, Show, describe and conclude: on ence on Computer Vision and Pattern Recognition, Boston, MA,
exploiting the structure information of chest X-ray reports, in USA, (2015), pp. 4566–4575.
Proceedings of the 57th Annual Meeting of the Association for [53] X. Chen, H. Fang, T. Lin, et.al., Microsoft COCO captions: data
Computational Linguistics, Florence, Italy, 2019. collection and evaluation server, arXiv:1504.00325v2, 2015.