0% found this document useful (0 votes)
16 views

TMRGM A Template-Based Multi-Attention Model For X

Uploaded by

kakarladentalehs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

TMRGM A Template-Based Multi-Attention Model For X

Uploaded by

kakarladentalehs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Journal of Artificial Intelligence for Medical Sciences

In Press, Corrected Proof


DOI: https://doi.org/10.2991/jaims.d.210428.002; eISSN: 2666-1470
https://www.atlantis-press.com/journals/jaims/

TMRGM: A Template-Based Multi-Attention Model for X-Ray


Imaging Report Generation

Xuwen Wang1, , Yu Zhang1 , Zhen Guo1, , Jiao Li1,*,


1
Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

ARTICLE INFO ABSTRACT


Article History The rapid growth of medical imaging data brings heavy pressure to radiologists for imaging diagnosis and report writing. This
Received 15 Nov 2020 paper aims to extract valuable information automatically from medical images to assist doctors in chest X-ray image interpreta-
Accepted 14 Apr 2021 tion. Considering the different linguistic and visual characteristics in reports of different crowds, we proposed a template-based
multi-attention report generation model (TMRGM) for the healthy individuals and abnormal ones respectively. In this study, we
Keywords developed an experimental dataset based on the IU X-ray collection to validate the effectiveness of TMRGM model. Specifically,
Chest X-ray our method achieves the BLEU-1 of 0.419, the METEOR of 0.183, the ROUGE score of 0.280, and the CIDEr of 0.359, which
Deep learning are comparable with the SOTA models. The experimental results indicate that the proposed TMRGM model is able to simulate
Thoracic abnormality recognition the reporting process, and there is still much room for improvement in clinical application.
Medical imaging report generation
Attention mechanism
Medical imaging report template

© 2021 The Authors. Published by Atlantis Press B.V.


This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION However, despite the state-of-the-art progress, it is still challeng-


ing to generate clinically readable and interpretable reports. For
Medical imaging data is the key basis for early screening, diagno- example, existing methods perform better on generating short
sis, and treatment of diseases. In a real clinical scenario, profes- descriptions of images, but incapable of diversifying language and
sional radiologists review and analyze medical images empirically, depicting long complex structures [10,11]. Linguistically, most
then describe imaging findings and write the diagnosis conclusions studies treat visual words and nonvisual words equally (such as
in semi-structured reports. However, the rapid growth of medi- “there,” “evidence,” “seen,” “to,” etc.), while the latter have no corre-
cal imaging data brings heavy workload to radiologists for image lation with any image features and may be misleading for text gen-
reading and report writing. How to assist doctors in medical image eration. Additionally, in the real clinical setting, radiologists often
interpretation has become an important and challenging task for write normal reports based on unified templates, and reports of
computers. healthy individuals only describe normal organ or structures. How-
In the last decade, the interdisciplinary research and application of ever, most studies treat the reports of healthy individuals and abnor-
medical imaging and advanced intelligence technology are growing mal ones with similar methods. There is little difference between the
rapidly [1]. Driven by large-scale open access image dataset, deep generated reports of healthy individuals and sick ones, especially
learning, represented by convolutional neural network (CNN) [2] underperform on depicting rare abnormal findings.
and recurrent neural network (RNN) [3], push forward the devel- In addressing this problem, we proposed a novel framework for
opment of computer-aided diagnosis (CAD) systems [4], which can chest X-ray image interpretation and report generation by exploit-
effectively process large-scale multimodal medical images, detect ing the different structure of healthy/abnormal reports. The major
abnormal lesions, and distinguish the nature of the lesion [5–9]. In contributions of this paper are summarized as follows. (1) We
the computer vision area, deep natural language processing (NLP) proposed template-based multi-attention report generation model
technology can be used to describe images by combining the image (TMRGM), a new template-based multi-attention mechanism for
features with the text features. Inspired by this, more complex cog- chest X-ray report generation, which utilize different strategies to
nitive tasks such as visual captioning and medical image report gen- generate imaging reports for healthy individuals and abnormal ones
eration have attracted growing attention in recent years [28–30]. respectively. (2) To generate chest X-ray imaging reports for healthy
individuals, we manually constructed a library of chest X-ray report
templates. (3) To generate chest X-ray imaging reports for abnor-
* Corresponding author. Email: [email protected] mal individuals, we integrate image features and text features via
Xuwen Wang and Yu Zhang are co-first authors. co-attention mechanism and adaptive attention mechanism. The
2 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press

model can automatically choose whether to generate report text etc. The mTLR is more consistent with the radiologists’ way of read-
based on image features, sentence topics, or text features. (4) We ing images, and can better support comprehensive diagnosis.
verified the performance of chest lesion recognition and report gen-
There are commonly two steps in mTLR: (1) multi-label classifica-
eration based on the public available IU X-ray dataset (Open I) [18].
tion (MLC) of thoracic lesions revealed in chest radiography; (2)
thoracic lesion localization, which identifies specific regions and
2. RELATED WORKS profile of abnormal lesions in chest radiography. In recent years,
deep learning models start to outperform conventional statistical
2.1. Medical Imaging Datasets learning approaches [43,44] in the TLR task. A representative work
is the CheXNet developed by Ng et al. [22], a 121-layer dense con-
In recent years, deep neural networks have shown great potential volutional neural network (dense CNN), which detect 14 chest
in challenging tasks of medical image processing [12,13]. The rapid diseases simultaneously based on the ChestX-Ray14 data set. Bar
improvement partly depends on the publicly accessible medical et al. [23] used the pretrained CNN model to extract the high-
imaging datasets that covering multimodal and various body parts dimensional features of medical images, and combined them with
with quality annotation. In particular, images concerning chest dis- general GIST feature and bag-of-visual words (BoVW) features
eases, e.g., chest X-rays and chest CT scan are commonly used for as the input of support vector machine (SVM) to detect thoracic
clinical screening and diagnosis, and account for a large proportion lesions. Wang et al. [14] developed a Ddep convolutional neural net-
in public datasets. work (DCNN) for mTLR. Yao et al. [24] constructed a DenseNet-
For instance, the NIH released the ChestX-ray14 dataset for tho- long short-term memory (DENsenet-LSTM) model to identify the
racic lesion detection [14]. The National Cancer Institute (NCI) 14 thoracic lesions by utilizing latent correlation between different
released the LIDC–IDRI dataset for early cancer detection in high- lesions in chest X-ray images.
risk populations [15] and Data Science Bowl 2017 [16], the high-
resolution CT scan data for lung cancer prediction. The Stanford
University present CheXpert [17], a large-scale dataset that con- 2.3. Visual Captioning and Medical Image
tains 224,316 chest radiographs of 65,240 patients. OpenI [18] con- Report Generation
tains chest X-ray reports of 3,955 patients and 7470 chest X-ray
images, which has become the benchmark of the current research Visual captioning aims at generating a descriptive sentence for
on imaging report generation. Recently, MIT released MIMIC- a given image or video. Most state-of-the-art methods generated
CXR-JPG v2.0.0 [19], a large dataset of 377,110 chest X-rays asso- sequences based on the CNN-RNN architectures and attention
ciated with 227,827 imaging studies sourced from the Beth Israel mechanisms [45–47]. In addition to the one-sequence generation
Deaconess Medical Center. In addition, during the outbreak time in early studies, some efforts have been made for generating longer
of COVID-19, many small-scale datasets are released for develop- paragraphs [11], which inspires the research of medical image
ing AI-based diagnosis models of COVID-19. For instance, Yang et report generation. However, medical image reports are more pro-
al. build an open-sourced dataset COVID-CT [41], which contains fessional and informative than natural image captions, which poses
349 COVID-19 CT images from 216 patients and 463 non-COVID- greater challenge on generating clinically readable reports. Shin et
19 CT. Li et al. introduced COV-CTR [42], a COVID-19 CT report al. first proposed a variant of CNN-RNN framework to predict
dataset which contains 728 images collected from published papers lesion tags of chest X-ray images [25]. Wang et al. [26] developed
and their corresponding paired Chinese reports. Latent Dirichlet Allocation-based topic models for imaging report
generation. Kisilev et al. [27] proposed a CNN-based method for
generating reports of classified mammography images. Wang et al.
2.2. Thoracic Lesion Recognition proposed the TieNet model [28], integrating the multi-attention
model into the end-to-end CNN-RNN framework for performing
In the early stage of image recognition, some feature extraction disease classification and generating simple imaging reports. Jing et
methods, such as histogram of oriented gradients (HOG) and scale al. [29] constructed a hierarchical language model equipped with
invariant feature transform (SIFT) were mainly used to classify co-attention to better model the paragraphs, but it tend to pro-
and recognize the extracted features through classifiers [43]. Early duce normal findings. They went further to explore the complex
image recognition tasks are targeted at specific recognition objects, structures of reports, and proposed a two-stage strategy that mod-
without generalization ability, and the sample size is small, so it is els the relationship between Findings and impression [48]. Li et al.
difficult to meet high recognition requirements in practical appli- [30] proposed KERP, a knowledge-driven imaging report genera-
cation. tion model, which constructed a graph transformer (GTR) for the
Thoracic Lesion Recognition (TLR) has long been a research focus dynamic transformation of text features and image features.
in CAD. According to the types of identified lesions, TLR meth- The difference between our proposed model and existing meth-
ods can be divided into two categories. One is single thoracic ods lies in that we classified chest X-rays into healthy or abnormal
lesion recognition (sTLR), which focuses on the imaging charac- individuals based on MLC module, then we combined report tem-
teristics of a particular type of lesion. It can assist the early screen- plates with multi-attention-based hierarchical LSTM model and
ing and diagnosis of a specific disease, e.g., the pulmonary nodule generate reports respectively according to the nature of the given
detection [20,21]. The other one is multiple thoracic lesion recog- image (healthy/abnormal). In addressing the problem that the non-
nition (mTLR), which target multiple types of disease or lesion, visual feature words are difficult to align with the image features,
such as pulmonary nodules, pneumonia, pneumothorax, pleural TMRGM- generated visual words and nonvisual words separately
effusion, atelectasis, pulmonary abscess, pulmonary tuberculosis, based on features from different modality.
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 3

Figure 1 Over view of the framework of the proposed template-based multi-attention report generation model
(TMRGM).

3. METHOD In the formula (1), number 1 represents images from healthy


individuals and number 0 represents images from abnormal
As shown in Figure 1, the proposed framework is comprised of individuals.
three modules: (1) the chest X-ray classification (healthy/abnor-
mal) module based on multi-label thoracic lesion recognition; (2)
3.3. Template-Based Report Generation for
template-based report generation module for healthy individu-
als; and (3) multi-attention-based Hierarchical LSTM module for Healthy Individuals
abnormal individuals [32–34]. For healthy individuals, radiologists confirm no abnormalities and
depict the normal organ or tissue with similar descriptions. In
3.1. CNN-Based Thoracic Lesion view of this, we constructed a library of chest X-ray report tem-
plate for generating normal reports of healthy individuals. We first
Recognition selected all the imaging reports of healthy ones from the IU X-
We define the identification of thoracic lesions as a MLC prob- ray dataset, and then we respectively collected sentences from two
lem. Given a chest X-ray image, we first extracted the image fea- text field, “Findings” and “Impression.” Since many sentences in
ture V automatically using the ResNet152 model. Then we predict imaging reports express similar medical meaning (e.g., “pulmonary
the probability distribution of 587 semantic labels collected from vascularity is within normal limits” and “pulmonary vascularity is
the IU X-ray dataset [18] via a MLC module P ∝ exp (MLC (V)), normal”), we sorted these sentences according to their frequency
which consists of a full connection layer and a softmax layer. Finally, in corresponding field and manually classified and labeled them.
we selected the top 10 semantic labels (abnormal lesion or normal) Specifically, first, we combined identical sentences (maybe some
with the highest probability as the output of thoracic lesion recog- words have different singular or plural forms or tenses) into a single
nition model. sentence. Second, we categorized sentences that have similar med-
ical meaning. Third, we annotated the key words in each sentence
for further analysis. Forth, we ranked the categories according to
3.2. Chest X-Ray Image Classification the sum of sentence frequency in each category, and selected one
representative sentence from each category to construct a normal
Considering the difference between descriptions from normal/ab-
template library. Fifth, on average, the “Findings” field contains
normal reports, the TMRGM model first determines whether the
3.4 sentences and the “Impression” field contains 1.5 sentences,
given medical image belongs to healthy individuals or abnormal
we chose the top4 categories from “Findings” and the top2 cate-
ones, and then utilizes different methods to generate reports for
gories from “Impression.” Finally, we use the representative sen-
these two types of images. According to the distribution of semantic
tences from the chosen 6 categories as the template sentences for
labels predicted by the thoracic lesion recognition model, we clas-
generating normal imaging reports of healthy individuals.
sify chest X-ray images based on the MLC module. We defined the
image category as C, and the semantic label with the highest prob-
ability was Lmax , the image classification criteria was as follows: 3.4. Multi-Attention-Based Report
Generation for Abnormal Individuals
3.4.1. Co-attention-based multimodal feature fusion
{
1, Lmax = normal To better interpret abnormal findings, it is necessary to combine
C= (1)
0, Lmax = other label the local image features with high-level thoracic lesion labels. We
4 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press

( )
employed the co-attention mechanism to fuse the image features topic(t) = Wt tanh Wth ht + Wtc C(t)
VL
(10)
extracted by ResNet152 and the text features of the thoracic lesion
labels predicted by the MLC module. The feature fusion model
assigned corresponding weights to different image regions while ( ( ))
stop(t) = softmax Ws tanh Wsh1 ht−1 + Wsh2 ht (11)
generating sentences, so that it can focus on the related image region
and the thoracic lesion labels.
In particular, we first define a sentence LSTM model. At time t, let 3.4.3. Adaptive attention-based word LSTM for
the image feature as V, the embedding vector of 10 predicted tho- sentence generation
racic lesion labels is L, the attention weight vector of the image fea- There are many nonvisual words in the report context, such as
ture 𝜶 V , the attention weight vector of the label text feature is 𝜶 L , “evidence,” “of,” “acute,” and “remain,” which cannot be aligned
and then we fuse the image features and the label text feature by directly to a specific image region. Otherwise, in the training pro-
computing the context feature vector CVL as follows: cess, the gradient of nonvisual words will influence the alignment
[ ] accuracy between visual words and image features. Therefore, we
C(t)
VL
= WFC V(t) (t)
att ; Latt (2) used the adaptive attention-based word LSTM model to generate
sentences. During the process of word generation, the adaptive
attention mechanism decides whether to use the image feature, the
V(t)
att = aV ⋅ V (3) sentence topic, or rather the context feature to generate the current
word. Figure 2 shows the structure of the word LSTM model based
on the adaptive attention mechanism.
L(t)
att = 𝜶 L ⋅ L (4) The adaptive attention mechanism [34] is an extension of the soft
attention model proposed by Xu et al. [35]. As shown in the for-
( ( )) mula (13) and (14), at timestamp t, the adaptive attention mecha-
𝜶 V = softmax fatt V, ht−1 (5) nism assigns weights 𝜶 t to image local features based on the hidden
state ht , thus reduce the uncertainty of generating new words.
( )
( ( )) zt = 𝜔Th tanh Wv V + Wg ht (12)
𝜶 L = softmax fatt L, ht−1 (6)

In formula (2), WFC is a fully connected network layer, V(t) att and ( )
(t)
Latt are the image feature and the text feature weighted by the co- 𝜶 t = softmax zt (13)
attention mechanism at the time t in formula (3) and (4). The ht−1
represents the hidden state of the sentence LSTM at the time t − 1,
fatt is the function of the attention mechanism, as shown in formula Cvt = 𝜶 t ⋅ V (14)
(5) and (6), in which Wvat , Wv , Wvh , Wlat , Wl and Wlh are parame-
ter metrics. Based on the context feature vector CVL , we can predict
The adaptive attention also improves the LSTM by introducing a
topics of each generated sentence.
new sentinel gate gt and a visual sentinel vector St as follows:
( ) ( )
fatt V, ht−1 = Wvat tanh Wv V + Wvh ht−1 (7) ( )
gt = σ Wx xt + Wto topic(t) + Wh ht−1 (15)

( ) ( )
fatt L, ht−1 = Wlat tanh Wl L + Wlh ht−1 (8) ( )
St = gt ⋅ tanh mt (16)

3.4.2. Sentence topic generation based on sentence


where mt is the memory cell of LSTM, Wx , Wto and Wh are the
LSTM
parameter matrix, σ is a sigmoid function, topic(t) is the topic vector
The sentence LSTM contains three parts: (1) a single-layer LSTM generated by the sentence LSTM. The sentinel gate gt determines
network, which generates the LSTM hidden state ht on time t based whether the model focuses on the image feature V or the visual sen-
on CVL ; (2) a topic generation network, which is a single-layer fully tinel vector St . Furthermore, based on the St , the adaptive attention
connected network for predicting the sentence topic vector topic(t) improves the context feature vector Ct as follows:
on time t based on C(t) and ht ; (3) a stop-control network that deter- ( )
VL
mines when to stop generating report text. It consists of a fully con- Ct = 𝛽t St + 1 − 𝛽t Cvt (17)
nected layer and a softmax function, and take the LSTM hidden
state ht and ht−1 as input to generate the stop vector stop(t) on time t. To compute 𝛽t ∈ [0, 1], we modified the attention weight 𝜶 t into
The formula for calculating ht , topic(t) , and stop(t) are as follows, in 𝜶 ′t . Then the probability distribution pt of current word can be cal-
which Wt , Wth , Wtc , Ws , Wsh1 and Wsh2 are parameter metrics. culated as the formula (20).
( ) ([ ( )])
ht = LSTM C(t) (9) 𝜶 ′t = softmax zt ; 𝜔Th tanh Ws St + Wg ht (18)
VL
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 5

Figure 2 The structure of the word LSTM model based on the adaptive attention mechanism.

𝛽t = 𝜶 ′t [n + 1] (19) a validation set containing 500 randomly selected X-ray images, a


test set containing another 500 images, and a training set contain-
ing the rest of 5909 images.
( ( ))
pt = softmax Wp Ct + ht (20)

4.2. Experimental Settings


4. EXPERIMENTS 4.2.1. Implementation details
4.1. Preprocessing of Chest X-Ray Dataset We carried out experiments on Windows Sever 2012 R2, Intel(R)
Xeon(R) Gold 6130 64 CPU, 512GB memory, NVIDIA Tesla P100
Indiana University chest X-ray collection [18] is a public dataset 16GB * 4 GPUs. The codes of TMRGM are implemented under the
containing 7470 chest X-ray images and 3955 de-identified radi- PyTorch framework and are available at https://github.com/54649
ology reports, and is commonly used for assessing imaging report 2928/TMRGM.
generation models. Each report is comprised of several sections:
Impression, Findings, and Indication, etc. We select Findings and During the training process, the dimensions of hidden states in
Impression from the reports as our experimental data. The seman- sentence LSTM and word LSTM are set to 512. The dimension
tic labels annotated by MTI tools [36] are also collected for thoracic of thoracic lesion word embedding, sentence topic embedding
lesion recognition. and report word embedding are also set as 512. We adopt a pre-
trained ResNet152 as image encoder, which is fine-tuned on the
During the preprocessing stage, we resized all chest X-ray images training set for obtaining chest X-ray image features. For the tho-
into 224*224 pixels as the unified input of CNN model. The image racic lesion MLC module, the visual features are 2048 dimensions
quality is quite acceptable and we did not use additional data extracted from the last average polling layer of ResNet152. For the
augmentation technologies. For collected MTI labels, we removed multi-attention-based report generation module, visual features are
duplicates, lowercased all words, and obtained a set of 587 seman- extracted from the last convolutional layer, which yields a 7*7*2048
tic labels. For texts extracted from Findings and Impression, we feature map. We use Adam optimizer with the initial learning rate
performed sentence segmentation, lowercased, delimitated punc- of 0.0003 (dynamically reduced by 10% while the training error stop
tuations, special characters, and extra spaces, and then converted descending in 10 epochs), and the batch size is set as 16.
numbers into a unified identifier “num.” Further, we constructed
a dictionary base on the word frequency higher than 5 in imaging
reports, in which 1173 words were included. 4.2.2. Evaluation metrics
Figure 3 shows a processed chest X-ray report sample, including We evaluated each submodule of our proposed method on differ-
a chest X-ray image, corresponding semantic labels and textual ent evaluation metrics. For evaluating the performance of MLC
descriptions. of thoracic lesions, we calculate precision (P), recall (R), F1 score,
We filtered out 298 reports without MTI labels, and collected the Recall@5, Recall@10, and Recall@20. Specifically, recall@N com-
rest of 3657 reports together with 6909 X-ray images as our exper- pares the number of correct labels in the top N predictions with
imental dataset. We divided the whole dataset into three parts, i.e., the total number of labels in ground truth. For the chest X-ray
6 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press

Figure 3 A sample of processed chest X-ray report sample.

Table 1 Results of multi-label classification model based on different CNN models.


Methods P R F1 Score Recall@5 Recall@10 Recall@20
ResNet152 0.112 0.698 0.181 0.605 0.698 0.767
VGG19 0.091 0.618 0.150 0.560 0.618 0.635
Densenet121 0.112 0.682 0.180 0.595 0.682 0.756
SENet154 0.112 0.701 0.180 0.602 0.701 0.775
ResNet152 (top2) 0.311 0.488 0.355 __ __ __

image classification module, we calculate accuracy, specificity, only 0.112. For one thing, the 587 semantic labels increased the dif-
and sensitivity. As to the imaging report generation module, we ficulty of building high-precision classifiers, while the training set
obtained BLEU [49], METEOR [50], ROUGE [51], and CIDEr only contains 5909 chest X-ray images. For another thing, the distri-
[52] by the standard image captioning evaluation tool [53], which bution of the semantic labels showed that each image in the training
are commonly used in the field of machine translation and image set contains two labels on average, which implies the confliction of
captioning. our strategy on label selection (top 10). We tried to select the top2
labels predicted by ResNet152 as a comparison, and achieved a pre-
cision of 0.311, a recall of 0.488, and a F1 score of 0.355.
4.2.3. Comparison methods
Table 2 shows two examples of ResNet152-based MLC module for
For thoracic lesion recognition and chest X-ray classification, we
thoracic lesion recognition. For the first image, the model correctly
compare the influence of different image encoders on the classi-
identified three semantic labels, namely atelectases, atelectasis, and
fication models. As a comparison experiment, we simultaneously
opacity. According to these three lesions, patients can go to the res-
built multiple CNN models such as VGG19 [37], Densenet121 [38],
piratory department for medical treatment. As to the other one, we
SENet154 [39], and ResNet152 [31] to extract visual features, as
recognized a lesion “cardiomegaly,” which reminds the patient to
shown in Tables 1 and 3.
see the cardiologist.
For chest X-ray report generation, we compare our proposed
method with state-of-the-art method: TieNet [28], CoAtt [29], and 4.3.2. Results of chest X-ray image classification
Adapt-Att [34]. We also report TMRGM without introducing tem-
plate. Further, we perform a qualitative assessment of the generated According to whether the “Normal” label achieves the high-
radiology reports manually. est probability in predicted labels, we classified chest X-ray
images into healthy individuals and abnormal ones. We com-
pare the ResNet152-based classification module with other CNN-
4.3. Results based binary classification models, such as VGG19, Densenet121,
SENet154, and Inception-V3 [40]. Table 3 shows the experimen-
4.3.1. Results of thoracic lesion recognition
tal results of chest X-ray image classification. The ResNet152-
Table 1 shows the experimental results of thoracic lesion recog- based classification model achieved the best accuracy of 0.73, the
nition based on different MLC models. It can be seen that the DenseNet121 achieved the best specificity of 0.803, and the SENet
ResNet152-based MLC module achieved the precision of 0.112, the achieved the best sensitivity of 0.758. The ResNet152 achieved
recall@5 of 0.605, the recall@10 of 0.698, and the F1 Score of 0.181, the best 95% confidence interval of the accuracy ([0.691, 0.769]),
which outperform other methods. However, the best precision was followed by the Densenet121 ([0.674, 0.754]) and the SENet154
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 7

Table 2 Examples of ResNet152-based MLC module for thoracic lesion recognition.


Chest X-ray Image Predicted Labels MTI Labels
atelectases; atelectases;
atelectasis; opacity;
opacity; atelectasis;
cardiomegaly; hiatal hernia;
scarring; infection
degenerative change;
calcified granuloma;
normal;
pleural effusion;
granuloma

cardiomegaly; cardiomegaly
degenerative change;
opacity;
atelectases;
atelectasis;
scarring;
normal;
calcified granuloma;
granuloma;
pleural effusion

Table 3 Results of chest X-ray classification.


Method Accuracy Confidence TP TN FP FN Specificity Sensitivity
Interval (95%)
ResNet152 0.73 [0.691, 0.769] 137 228 61 74 0.789 0.649
VGG19 0.578 [0.535, 0.621] 289 0 211 0 0 1
Densenet121 0.714 [0.674, 0.754] 125 232 57 86 0.803 0.592
SENet154 0.712 [0.672, 0.752] 160 196 93 51 0.678 0.758
Inception-V3 0.708 [0.668, 0.748] 128 226 63 83 0.782 0.607

([0.672, 0.752]). In the ResNet152-based classification model, error The difference between these metrics lies in the various strategies of
case study reveals that a part of normal cases were misclassified as n-gram similarity calculation and weight assignment. We compared
abnormal ones. A statistical analysis reveals that the ratio of images our proposed TMRGM model with three state-of-the-art methods
from healthy and abnormal individuals in the training set was about based on the test set, as shown in Table 7, which demonstrate the
2:3, which indicates that the performance of chest X-ray image clas- comparable performance of TMRGEM to the SOTA. The Adapt-
sification is to some extent affected by the data imbalance. att represents the hierarchical LSTM model based solely on multi-
attention mechanism, which achieved the best ROUGE of 0.316
and the CIDEr of 0.387, suggesting that the hierarchical model is
4.3.3. Templates for chest X-ray report generation better for modeling paragraphs. Our TMRGM model obtained the
For generating the reports of healthy individuals, we manually con- preferable BLEU scores and the METEOR of 0.183, which indi-
structed templates based on the Findings and Impression text respec- cates the high semantic similarity between generated report sen-
tively. Specifically, the Impression section contains 63 subclasses of tences and the ground-truth sentences. By comparing the results of
the report sentence (partly shown in Table 4), and the “Findings” TMRGM model and TMRGM without templates, we can see that
field contains 150 subclasses (partly shown in Table 5). According to the introduction of chest X-ray report template can improve the
the sum of the sentence frequency in each subclass, we selected the BLEU scores and METEOR, suggesting that that the template-based
top2 high frequency subclass for the Impression and the top4 sub- report generation is linguistically in line with the reports of healthy
class for the Findings. Then the combination of the representative individuals.
sentences from the selected six subclasses forms a complete Chest
X-ray report template (see Table 6). 4.3.5. Qualitative analysis
In this section, we perform the qualitative analysis on the generated
4.3.4. Results of chest X-ray report generation reports. Table 8 presents two abnormal cases of chest X-ray reports
generated by the TMRGM model and Table 9 shows an example of
Table 7 shows results of Chest X-ray report generation on the
template-based reports generated for healthy ones.
automatic metrics. The evaluation metrics, such as BLEU score,
METEOR, ROUGE, and CIDEr, are based on n-gram similarity As shown in Table 8, for the upper case, two sentences of nor-
between the generated sentences and the ground-truth sentences. mal descriptions are semantically similar with the ground-truth
8 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press

Table 4 Some part of manually annotated sentences in the Impression section.


Class Representative Sentence Frequency Key Words
1 No acute cardiopulmonary abnormality 817 Cardiopulmonary abnormality
2 No active disease 384 Abnormality
3 Heart size is normal and lungs are clear 76 Heart size; lung
4 The heart size and cardio mediastinal silhou- 67 Heart size; cardio mediastinal silhou-
ette are within normal limits ette
5 No acute pulmonary disease 55 Pulmonary disease

Table 5 Some part of manually anotated sentences in the Findings section.


Class Representative Sentence Frequency Key Words
1 No focal consolidation pleural effusion or 579 Pleural effusion; pneumothorax
pneumothorax
2 The lungs are clear 550 Lung
3 The cardiomediastinal silhouette is within 320 Cardiomediastinal silhouette
normal limits
4 The heart is normal in size 315 Heart size
5 Visualized osseous structures of the thorax 163 Thorax; osseous structure
are without acute abnormality

Table 6 The complete template of chest X-ray reports of the corresponding sentence, and the darker the color, the greater
healthy individuals. the weight. However, it is hard to explain the correlation between
Section Template generated sentences and image features.
Impression No acute cardiopulmonary abnormality
No active disease
Findings No focal consolidation pleural effusion
5. DISCUSSION
or pneumothorax
Automatic chest X-ray report generation will facilitate radiolo-
The lungs are clear
gists to improve the efficiency of diagnosis and report writing. The
The cardiomediastinal silhouette is
within normal limits proposed TMRGM model achieved comparable performance with
The heart is normal in size. SOTA models on chest X-ray report generation. However, it is still
far from clinical usage in realistic scenarios.
First, in the training phase, we collected semantic labels and built
sentence, such as “pulmonary vascularity appear within normal report templates entirely based on the training reports from the
limits.” versus “pulmonary vasculature within normal limits”; and IU X-ray. Then we test our proposed model based on another 500
“no pleural effusion or pneumothorax is seen.” versus “no pleu- samples. We found that in generated reports of abnormal individ-
ral effusion. no pneumothorax.” As to the second case in Table 8, uals, most sentences are normal descriptions, while the propor-
the TMRGM model performs acceptable on generating abnormal tion of abnormal descriptions is relatively small. This problem may
descriptions of chest X-rays, e.g., the predicted sentence “stable car- due to the imbalance of normal and abnormal descriptions in the
diomegaly with prominent perihilar opacities which may represent training set (in the IU X-ray dataset, each report contains 3.7 nor-
scarring or edema,” is semantically similar with the real sentence mal sentences and 2.6 abnormal sentences on average). Empiri-
“findings concerning for interstitial edema or infection. heart size cally, the data scale, completeness, normalization, and quality of
is mildly enlarged. there are diffusely increased interstitial opacities imaging reports are important factors for training. One further
bilaterally.” improvement is introducing high-quality parallel datasets, such as
the recently released MIMIC-CXR dataset, so as to train the model
Table 9 described the chest X-ray of a healthy individual from better. It is also necessary for us to validate the generalization per-
several aspects, such as the cardiopulmonary function (“no acute formance on external data source.
cardiopulmonary abnormality”), the pleural lesions (“no pneu-
mothorax or pleural effusion”), the costal mediastinum outline Second, unlike common natural images, the difference of visual
(“the cardiomediastinal silhouette is within normal limits”), the features in medical images is not obvious, and the ambiguous sit-
cardiac shape and size (“the heart is normal in size”). It can be uations are quite often, such as the same disease with diverse
observed that the descriptions of multiple anatomic structures are visual features, or the similar image features attributed to differ-
grammatically and logically in accord with the ground-truth sen- ent diseases. The TMRGM model extracted image features based
tences, which demonstrate the chest X-ray report template is highly on the ResNet152, and involved the co-attention as well as the
similar with the real normal reports in the OpenI IU X-ray dataset. adaptive attention mechanism. The introduction of the adaptive
As shown in Table 10, the visualization heat map reveals the atten- attention mechanism chooses reasonable features for generating
tive image region while generating a specific sentence. The high- different kinds of words, which to some extent, alleviates the prob-
lights in the heat map represent the image features used to generate lem of unaligned non-visual words and image features. However,
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 9

Table 7 Result of chest X-ray report generation.


Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE CIDEr
TieNet [28] 0.286 0.160 0.104 0.074 0.108 0.226 –
CoAtt (Jing et al., 2018) 0.303 0.181 0.121 0.084 0.132 0.249 0.175
Adapt-att 0.378 0.255 0.185 0.138 0.162 0.316 0.387
TMRGM(without template) 0.380 0.259 0.188 0.141 0.163 0.317 0.391
TMRGM 0.419 0.281 0.201 0.145 0.183 0.280 0.359

Table 8 Examples of generated chest X-ray reports for abnormal individuals.


Chest X-ray Generated Report Ground Truth
No acute cardiopulmonary abnormality. Right middle lobe airspace disease may reflect
The heart size and pulmonary vascularity appear atelectasis or pneumonia.
within normal limits. The cardiomediastinal silhouette is normal size
The lungs are free of focal airspace disease. and configuration.
No pleural effusion or pneumothorax is seen. Pulmonary vasculature within normal limits.
There is right middle lobe airspace disease
may reflect atelectasis or pneumonia.
No pleural effusion. no pneumothorax
Stable cardiomegaly with prominent perihilar Findings concerning for interstitial edema or
opacities which may represent scarring or edema. infection.
There is stable cardiomegaly. Heart size is mildly enlarged.
there is no pneumothorax. There are diffusely increased interstitial opaci-
ties bilaterally.
No focal consolidation pneumothorax or pleu-
ral effusion.
No acute bony abnormality.

Table 9 An example of generated chest X-ray reports for healthy individuals.


Chest X-ray Generated Report Ground Truth
No acute cardiopulmonary No acute cardiopulmonary findings.
abnormality. No focal consolidation.
No active disease. No visualized pneumothorax.
No pneumothorax or pleural effusion. No pleural effusions.
The lungs are clear. Heart size normal.
The cardiomediastinal silhouette is The cardiomediastinal silhouette is
within normal limits. unremarkable
The heart is normal in size.

by reviewing the visualization heat maps of TMRGM, we found left, upper, and lower. However, due to the uneven distribution
it is hard to explain the correlation between generated sentences of words in the training set and the low frequency of anatomical
and image features. One optimizing strategy is to segment chest locations, most of the generated reports do not contain accurate
X-ray images by referring to the description sequence and body anatomical locations. This is also a limitation of this study. In fur-
parts specified in reports, and then extract local image features ther research, we will focus more on the location of the disease in
respectively. Since each body parts has specific semantic labels, the the medical imaging pictures and how to accurately generate the
problem of image feature extraction and classification would be description of the anatomical locations.
more simplified. Another direction for improvement is to explore
emerging explainable deep learning networks, combining with
state-of-the-art data augmentation for better understanding and 6. CONCLUSION
interpreting radiology images.
In this paper, based on a systematic review of thoracic lesion
Third, we selected the top 10 semantic labels from the MLC module
recognition and medical imaging report generation, we proposed
as the thoracic lesions. Based on this rule, we achieved high recall
a template-based multi-attention model (TMRGM) for automat-
but poor precision on thoracic lesion recognition. It is necessary to
ically generating reports of chest X-rays. By exploring the lin-
explore more reasonable label selection strategies. In addition, in
guistic characteristics of report texts, we implemented different
view of the increasing open access Covid-19 dataset, our method
report generation methods for healthy individuals and abnormal
can be further optimized for assisting the current Covid-19 diagno-
ones respectively, and validate the effectiveness of TMRGM based
sis, such as identifying thoracic lesions and automatically writing
on the IU X-ray dataset. It is helpful for radiologists to quickly iden-
radiology reports, and reduce the workload of doctors.
tify the thoracic lesions and write high-quality chest X-ray reports.
Fourth, the dictionary used by the TMRGM model to generate the That facilitates the daily work of medical imaging examination and
medical imaging report contains anatomical locations like right, reduce their burden of image reading and report writing.
10 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press

Table 10 A visualization example of generated sentences and corresponding heat map.

Generated report No acute cardiopul- No acute cardiopul- The lungs and pleural
monary findings monary findings spaces show no acute-
abnormality

The cardiomediasti- no typical findings of no typical findings of


nal silhouette and pulmonary edema pulmonary edema
pulmonary vascu-
lature are within
normal limits
in size
Ground truth Negative for acute abnormality.
The cardiomediastinal silhouette is nor-
mal in size and contour.
no focal consolidation pneumothorax or
large pleural effusion.
normal xxxx.

CONFLICT OF INTEREST USA, 2017, pp. 1–19. https://trumpwhitehouse.archives.gov/


wp-content/uploads/2017/12/Roadmap-for-Medical-Imaging-
The authors declare they have no conflicts of interest. Research-and Development-2017.pdf
[2] Y. LeCun, Y. Bengio, Convolutional networks for images, speech,
and time series, in: M.A. Arbib (Ed.), The Handbook of Brain
AUTHORS’ CONTRIBUTIONS
Theory and Neural Networks, vol. 3361, MIT Press, Cambridge,
All authors have made significant contributions to the manuscript MA, USA, 1995. http://citeseer.ist.psu.edu/viewdoc/summary?
including its conception and design, the analysis of the data, and doi=10.1.1.32.9297
the writing of the manuscript. All authors have reviewed all parts of [3] G.S. Lodwick, Computer-aided diagnosis in radiology. A research
the manuscript, take responsibility for its content, and approve its plan, Invest. Radiol. 1 (1966), 72–80.
publication. [4] Z.C. Lipton, J. Berkowitz, C. Elkan, A critical review of recur-
rent neural networks for sequence learning, Computer Science,
arXiv:1506.00019v4, 2015.
ACKNOWLEDGMENTS [5] L. Ebner, M. Tall, K.R. Choudhury, et al., Variations in the func-
tional visual field for detection of lung nodules on chest computed
This work has been supported by the National Natural Science tomography: impact of nodule size, distance, and local lung com-
Foundation of China (Grant No. 61906214), the Beijing Natural Sci- plexity, Med. Phys. 44 (2017), 3483–3490.
ence Foundation (Grant No. Z200016), CAMS Innovation Fund for [6] W. Sun, B. Zheng, W. Qian, Automatic feature learning using
Medical Sciences (CIFMS) (Grant No. 2018-I2M-AI-016), and the multichannel ROI based on deep structured algorithms for
Non-profit Central Research Institute Fund of Chinese Academy of computerized lung cancer diagnosis, Comput. Biol. Med. 89
Medical Sciences (Grant No. 2018PT33024). (2017), 530.
[7] W. Sun, T.B. Tseng, J. Zhang, et al., Enhancing deep convolutional
REFERENCES neural network scheme for breast cancer diagnosis with unlabeled
data, Comput. Med. Imaging. Graph. 57 (2017), 4–9.
[1] Interagency Working Group on Medical Imaging Committee on [8] A. Masood, B. Sheng, P. Li, et al., Computer-assisted decision sup-
Science, National Science and Technology Council, Roadmap for port system in pulmonary cancer detection and stage classifica-
Medical Imaging Research and Development, Washington, D.C., tion on CT images, J. Biomed. Inform. 79 (2018), 117–128.
X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press 11

[9] C. Wang, A. Elazab, J. Wu, et al., Lung nodule classification using [25] H.C. Shin, L. Lu, L. Kim, et al., Interleaved text/image deep mining
deep feature fusion in chest radiography, Comput. Med. Imaging. on a large-scale radiology database, J. Mach. Learn. Res. 17 (2017),
Graph. 57 (2017), 10–18. 3729–3759.
[10] P. Kisilev, E. Walach, E. Barkan, et al., From medical image to [26] X. Wang, L. Lu, H. Shin, et al., Unsupervised category discovery
automatic medical report generation, IBM J. Res. Dev. 59 (2015), via looped deep pseudo-task optimization using a large scale radi-
2:1–2:7. ology image database, arXiv:1603.07965v1, 2016.
[11] H.C. Shin, K. Roberts, L. Lu, et al., Learning to read chest X-rays: [27] P. Kisilev, E. Sason, E. Barkan, et al., Medical image description
recurrent neural cascade model for automated image annotation, using multi-task-loss CNN, in: G. Carneiro et al. (Eds.), Deep
in IEEE Conference on Computer Vision and Pattern Recogni- Learning and Data Labeling for Medical Applications, Springer
tion, IEEE Computer Society, Las Vegas, NV, USA, (2016), pp. International Publishing, Cham, Switzerland, 2016.
2497–2506. [28] X. Wang, Y. Peng, L. Lu, et al., TieNet: text-image embedding
[12] G. Litjens, T. Kooi, B.E. Bejnordi, et al., A survey on deep learning network for common thorax disease classification and reporting
in medical image analysis, Med. Image Anal. 42 (2017), 60. in chest X-rays, in IEEE Conference on Computer Vision and
[13] J.G. Lee, S. Jun, Y.W. Cho, et al., Deep learning in medical imaging: Pattern Recognition, Salt Lake City, UT, USA, 2018.
general overview, Korean J. Radiol. 18 (2017), 570–584. [29] B. Jing, P. Xie, E. Xing, On the automatic generation of medical
[14] X. Wang, Y. Peng, L. Lu, et al., ChestX-Ray8: hospital-scale imaging reports, arXiv:1711.08195v2 [cs.CL], 2017.
chest X-ray database and benchmarks on weakly-supervised [30] C.Y. Li, X. Liang, Z. Hu, et al., Knowledge-driven encode, retrieve,
classification and localization of common thorax diseases, paraphrase for medical image report generation, in Thirty-Third
in 2017 IEEE Conference on Computer Vision and Pat- AAAI Conference on Artificial Intelligence, arXiv:1903.10122v1,
tern Recognition (CVPR), Honolulu, HI, USA, (2017), Honolulu, Hawaii, USA, 2019.
pp. 3462–3471. [31] Z. Wu, C. Shen, A.V. Den Hengel, et al., Wider or deeper: revisit-
[15] S.G.A. Armato III, G. Mclennan, L. Bidaut, et al., The Lung Image ing the ResNet model for visual recognition, Pattern Recognit. 90
Database Consortium (LIDC) and Image Database Resource Ini- (2019), 119–133.
tiative (IDRI): a completed reference database of lung nodules on [32] A. Graves, Long short-term memory, Neural Comput. 9
CT scans, Med. Phys. 38 (2011), 915. (1997), 1735.
[16] B.A. Hamilton, Data science bowl 2017. https://www.kaggle.com/ [33] P. Cao, Z. Yang, L. Sun, et al., Image captioning with bidirec-
c/data-science-bowl-2017/overview/description tional semantic attention-based guiding of long short-term mem-
[17] J. Irvin, P. Rajpurkar, M. Ko, et al., CheXpert: a large chest ory, Neural Process. Lett. 50 (2019), 103–119.
radiograph dataset with uncertainty labels and expert com- [34] J. Lu, C. Xiong, D. Parikh, et al., Knowing when to look: adap-
parison, in 33rd AAAI Conference on Artificial Intelli- tive attention via a visual sentinel for image captioning, in IEEE
gence, AAAI 2019, 31st Innovative Applications of Artifi- Conference on Computer Vision and Pattern Recognition, IEEE,
cial Intelligence Conference, IAAI 2019 and the 9th AAAI Honolulu, HI, USA, (2017), pp. 3242–3250.
Symposium on Educational Advances in Artificial Intel- [35] K. Xu, J. Ba, R. Kiros, et al., Show, attend and tell: neural image
ligence (EAAI 2019), Honolulu, Hawaii, USA, (2019), caption generation with visual attention, Comput. Sci. 37 (2015),
pp. 590–597. 2048–2057.
[18] M.D. Demnerfushman, M.D. Kohli, M.B. Rosenman, et al., [36] J.G. Mork, A.J.J. Yepes, A.R. Aronson The NLM medical
Preparing a collection of radiology examinations for distribu- text indexer system for indexing biomedical literature, 2013.
tion and retrieval, J. Am. Med. Inform. Assoc. Jamia. 23 (2016), Ii.nlm.nih.gov
304–310. [37] K. Simonyan, A. Zisserman Very deep convolutional net-
[19] A. Johnson, M. Lungren, Y. Peng, et al., MIMIC-CXR-JPG-chest works for large-scale image recognition, in Proceedings of
radiographs with structured labels (Version 2.0.0). PhysioNet. the 3rd International Conference on Learning Representations,
2019. arXiv:1409.1556v6, San Diego, CA, USA, 2015.
[20] N. Tajbakhsh, K. Suzuki, Comparing two classes of end-to-end [38] G. Huang, Z. Liu, L.V.D. Maaten, et al., Densely connected convo-
machine-learning models in lung nodule detection and clas- lutional networks, in 2017 IEEE Conference on Computer Vision
sification: MTANNs vs. CNNs, Pattern Recognit. 63 (2017), and Pattern Recognition, Honolulu, HI, USA, 2017.
476–486. [39] J. Hu, L. Shen, S. Albanie, et al., Squeeze-and-excitation networks,
[21] A.A.A. Setio, F. Ciompi, G. Litjens, et al., Pulmonary nodule in 2017 IEEE Conference on Computer Vision and Pattern Recog-
detection in CT images: false positive reduction using multi-view nition, Salt Lake City, UT, USA, 2018.
convolutional networks, IEEE Trans. Med. Imaging. 35 (2016), [40] C. Szegedy, V. Vanhoucke, S. Ioffe, et al., Rethinking the inception
1160–1169. architecture for computer vision, in Proceeding of the IEEE Con-
[22] P. Rajpurkar, J. Irvin, K. Zhu, et al., CheXNet: radiologist- ference on Computer Vision and Pattern Recognition, Las Vegas,
level pneumonia detection on chest X-rays with deep learning, NV, USA, (2016), pp. 2818–2826.
arXiv:1711.05225v3, 2017. [41] X. Yang, X. He, J. Zhao, et al., Covid-ct-dataset: a CT scan dataset
[23] Y. Bar, I. Diamant, L. Wolf, et al., Chest pathology identification about covid-19, arXiv:2003.13865v3, 2020.
using deep feature selection with non-medical training, Com- [42] M. Li, F. Wang, X. Chang, X. Liang, Auxiliary signal-guided
put. Methods Biomech. Biomed. Eng. Imaging Visual. 6 (2016), knowledge encoder-decoder for medical report generation, arXiv:
259–263. 2006.03744v1, 2020.
[24] L. Yao, E. Poblenz, D. Dagunts, et al., Learning to diag- [43] M. Kakar, D.R. Olsen, Automatic segmentation and recognition of
nose from scratch by exploiting dependencies among labels, lungs and lesion from CT scans of thorax, Comput. Med. Imaging
arXiv:1710.10501v2, 2017. Graph. 33 (2009), 72–82.
12 X. Wang et al. / Journal of Artificial Intelligence for Medical Sciences, in press

[44] C. Qin, D. Yao, Y. Shi, et al., Computer-aided detection in chest [49] K. Papineni, S. Roukos, T. Ward, W.J. Zhu, Bleu: a method
radiography based on artificial intelligence: a survey, BioMed. for automatic evaluation of machine translation, in Proceedings
Eng. OnLine. 17 (2018), 113. of the 40th Annual Meeting on Association for Computational
[45] M. Ranzato, S. Chopra, M. Auli, W. Zaremba, Sequence Linguistics, Association for Computational Linguistics, Philadel-
level training with recurrent neural networks, in 4th Inter- phia, PA, USA, (2002), pp. 311–318.
national Conference on Learning Representations, ICLR, [50] A. Lavie, A. Agarwal, Meteor: an automatic metric for MT
arXiv:1511.06732v7, San Juan, Puerto Rico, 2016. evaluation with high levels of correlation with human judgments,
[46] S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical in Proceedings of the Second Workshop on Statistical Machine
sequence training for image captioning, in 2017 IEEE Conference Translation, Prague, Czech Republic, (2007), pp. 228–231.
on Computer Vision and Pattern Recognition (CVPR), Honolulu, [51] C.-Y. Lin, Rouge: a Package for Automatic Evaluation of
HI, USA, 2017. Summaries, Text Summarization Branches Out, Associa-
[47] Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning tion for Computational Linguistics, Barcelona, Spain, 2004.
with semantic attention, in 2016 IEEE Conference on Com- https://www.aclweb.org/anthology/W04-1013.pdf
puter Vision and Pattern Recognition (CVPR), Las Vegas, NV, [52] R. Vedantam, C.L. Zitnick, D. Parikh, Cider: consensus-based
USA, 2016. image description evaluation, in Proceedings of the IEEE Confer-
[48] B. Jing, Z. Wang, E. Xing, Show, describe and conclude: on ence on Computer Vision and Pattern Recognition, Boston, MA,
exploiting the structure information of chest X-ray reports, in USA, (2015), pp. 4566–4575.
Proceedings of the 57th Annual Meeting of the Association for [53] X. Chen, H. Fang, T. Lin, et.al., Microsoft COCO captions: data
Computational Linguistics, Florence, Italy, 2019. collection and evaluation server, arXiv:1504.00325v2, 2015.

You might also like