Medical Image Captioning via Generative Pretrained
Medical Image Captioning via Generative Pretrained
Transformers
Alexander Selivanov1,† , Oleg Y. Rogov2,† , Daniil Chesakov3 , Artem Shelmanov3 , Irina
Fedulova1 , and Dmitry V. Dylov2,*
1 Alexander Selivanov and Irina Fedulova are with the Philips Innovation Labs Rus, Skolkovo Technopark 42 building
1 Bolshoi boulevard, Moscow, Russia, 121205.
2 Oleg Y. Rogov and Dmitry V. Dylov* are with the Skolkovo Institute of Science and Technology, Bolshoy blvd., 30/1,
† Contributed equally
* Corresponding author e-mail: [email protected]
ABSTRACT
The automatic clinical caption generation problem is referred to as proposed model combining the analysis of frontal chest X-Ray
scans with structured patient information from the radiology records. We combine two language models, the Show-Attend-Tell
and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed combination of these models
generates a textual summary with the essential information about pathologies found, their location, and the 2D heatmaps
localizing each pathology on the original X-Ray scans. The proposed model is tested on two medical datasets, the Open-I,
MIMIC-CXR, and the general-purpose MS-COCO. The results measured with the natural language assessment metrics prove
their efficient applicability to the chest X-Ray image captioning.
1 Introduction
Medical imaging is indispensable in the current diagnostic workflows. Out of the plethora of existing imaging modalities,
X-Ray remains one of the most widely-used visualization methods in many hospitals around the world, because it is inexpensive
and easily accessible1 . Analyzing and interpreting X-ray images is especially crucial for diagnosing and monitoring a wide
range of lung diseases, including pneumonia2 , pneumothorax3 , and COVID-19 complications4 .
Today, the generation of a free-text description based on clinical radiography results has become a convenient tool in
clinical practice5 . Having to study approximately 100 X-Rays daily5 , radiologists are overloaded by the necessity to report
their observations in writing, a tedious and time-consuming task that requires a deep domain-specific knowledge. The typical
manual annotation overload can lead to several problems, such as missed findings, inconsistent quantification, and delay of a
patient’s stay in the hospital, which brings increased costs for the treatment. Among all, the qualification of radiologists as far
as the correct diagnosis establishing should be stated as major problems.
In the COVID-19 era, there is a higher need for robust image captioning5 framework. Thus, many healthcare systems
outsource the medical image analysis task. Automatic generation of chest X-Ray medical reports using deep learning can assist
and accelerate the diagnosis establishing process followed by clinicians. Providing automated support for this task has the
potential to ease clinical workflows and improve both care quality and standardization. We propose to apply a model that works
perfectly on non-medical data, to the medical data.
1.3 Contributions
In the current paper, we address all the problems mentioned above. The contributions of this paper are the following:
• We introduce the new architecture for image captioning, based on combination of two language models with image-
attention (SAT) and text-attention (GPT-3), which outperforming current state-of-the-art models
• We introduce the new preprocessing pipeline for radiology reports, that allows to get higher NLG metrics
• We perform extensive experiments to show the effectiveness of the proposed methods
2/13
• Finally, we contribute into deep learning community with two language models trained on large MIMIC-CXR dataset
The rest of the paper is organized as follows: section 2 describes two language models architecture separately, section 3
provides the description of the proposed approach, section 4 describes datasets used and computing power utilized, subsection 5.1
and subsection 5.2 present, compare the results, while section 6 introduces the results and conclusions of the paper.
2 Methods
2.1 Show Attend and Tell
Show Attend and Tell (SAT)11 is an attention-based image caption generation neural net. Attention-based technique allows to
get well interpretable results, which can be utilized by radiologist to ensure their findings on X-Ray. Including attention, the
module gives the advantage to visualize where exactly the model ’sees’ the specific pathology. SAT consists of three blocks:
Encoder, Attention module and Decoder. It takes an image, encodes it, attends each part of the image, and generates a L-length
caption z, an encoded sequence of words from W -length vocabulary:
Here C represents the number of channels in the output of the encoder. It depends on the used type of the encoder: 1024 for
DenseNet-12136 , 512 for VGG-1637 , 2048 for InceptionV338 and ResNet-10139 . D is a configurable parameter representing the
encoded vectors size. Features are extracted from the lower convolutional layer prior to the fully connected layers, and are
being passed through the Adaptive Average Pooling layer. This allows the decoder to selectively focus on certain parts of an
image by selecting a subset of all the feature vectors.
2.1.2 Decoder with attention module
The decoder is implemented as a LSTM neural network40 . It produces a caption by generating one word at every time step
conditioned by the attention (context) vector, the previous hidden state and the previously generated words. The LSTM can be
represented as the following set of equations:
it σ
ft σ Ezt−1
ot σ TD+m+n,n ht−1
= (3)
ât
gt tanh
ct = ft ct−1 + it gt (4)
ht = ot tanh(ct ). (5)
Vectors it , ft , ct , ot , ht represent the input/update gate activation vector, forgetting gate activation vector, memory or cell state
vector, while outputting gate activation vector and hidden state of the LSTM respectively. Ts,t is an affine transformation, such
that Rs → Rt with non-zero bias. m denotes the embedding dimension, while n represents LSTM dimension. σ and stand for
the sigmoid activation function and element-wise multiplication, respectively. E ∈ Rm×L is an embedding matrix. The vector
â ∈ RD holds the visual information from a particular input location of the image at time t. Thus, â called context vector.
Attention is a function φ , that computes context vector ât from the encoded vectors ai (2), produced by the encoder. The
attention module generates a positive number αi for each location i on the image. This number can be interpreted as the relative
importance to give to the location i, among others. Attention module realized as a multi-layer perceptron (MLP) with a softmax
activation function, conditioned at the previous hidden state ht−1 (5) of the LSTM. The attention module is depicted in Figure 1.
Set of linear layers in MLP is denoted as a function fatt . The weights αti are computed with the help of the following equations:
The sum of weights αti (7) should be equal to 1 ∑Ci=1 αti = 1. The context vector ât is computed by the attention function φ
with the set of encoded vectors a (2) and their corresponding weights αti (7) as inputs: ât = φ ({ai } , {αti }). According to the
3/13
Learned Findings in Attention Layer
X-Ray
Linear layer
Flattened to transforming
1024 x 64 to Attention
Dimension Summation
Linear layer
1024 x 8 x 8
Linear layer
transforming
+ ReLU
activation
transforming to
the dimension of 1
SoftMax
original paper function, φ can be either ’soft’ or ’hard’ attention. Due to specific task of medical image caption, function φ
was chosen to be the ’soft’ attention, as it allows model to focus more on some specific parts of X-Rays from others and to
detect pathologies and major organs such as heart, lung etc. It is named as a ’deterministic soft attention’ and recognized as a
weighted sum : φ ({ai } , {αti }) = ∑Ci αi ai . Hence, context vector can be computed as:
C
ât = ∑ αi ati (8)
i
The initial memory state and hidden state of the LSTM are initialized with two separate multi-layer perceptrons (init-c and
init-h) with the encoded vectors ai (2) for a faster convergence:
1 C
c0 = finit-c ( ai ) (9)
C∑i
1 C
h0 = finit-h ( ai ) (10)
C∑i
To compute the output of LSTM representing a probabilities vector the next word, a ’deep output layer’40 was used. It looks
both on the LSTM state ht (5), on context vector ât (8) and the one previous word zt−1 (2):
4/13
3 Proposed Architecture
We introduce two architectures for X-Ray image captioning. The overall goal of our approach is to improve the quality of
Encoder-Decoder generated clinical records by using the GPT-3 language model. The suggested model consists of two parts:
the Encoder, Decoder (LSTM) with an attention module and the GPT-3. While the Encoder with LSTM detects pathologies and
indicates zones of higher attention demand, the GPT-3 takes it as input and writes a comprehensive medical report.
There are two possible approaches for this task. The first one consists in forcing models to learn joint word distribution.
Within this method (Fig. 2), both models A and B output scores for the next word in a sentence. Afterwards, due to concatenating
these scores and pushing them through the feed-forward neural net C, we get the final scores for upcoming word. Whilst the
disadvantage of this approach is the following: the GPT-3 model has its own vocabulary built by the Byte Pair Tokenizer. This
vocabulary is different from the one used by the Show Attend and Tell. We need to take from continuous GPT-3 distribution
separate scores corresponding to the words present in the Show Attend and Tell vocabulary. This turns continuous distribution
from the GPT-3 into discrete and hence, while we don’t use all the potential generation power from the GPT-3.
A Pre-Trained GloVe
Embeddings
Tokenization
Transform with Linear Layer h0, c0
C FFNN
h0 Attention 1st word LSTM
MIMIC-CXR Vocab Cell
Learned Logits from
Preprocessed Findings h1 h1
SAT
Medical Text
Report Logits SAT
Scores
PA and AP view Logits
Attention nth word LSTM
Cell
Encoder 1024 x 8 x 8
SAT
lSAT Vocab
Concatenate
1 x 224 x 224 Feed
Forward
NN
B GPT-3 Vocab Tokenization
1st token nth token
lGPT-3 Vocab
Figure 2. The first approach. Learn the joint distribution of two models. The drawback is in sampling from the GPT-3
distribution.
The second method shown in Fig. 3 consists in fine-tuning both models on the MIMIC-CXR dataset and using them one
after another. Show Attend and Tell A gets an image as an input and generates a report based on the data found on X-Ray with
an Attention module. It learns where to focus and gives a seed for the GPT-3 B to continue generating text. The GPT-3 was
fine-tuned on MIMIC-CXR in self-supervised manner using the Huggingface framework43 . It learns to predict the next word
in the text. The GPT-3 continues the report outputed by SAT and generates a detailed and complete clinical report based on
pathologies found by SAT. Such an approach is better for the GPT-3 as it gets more context as input (from SAT) than in the
first approach. Thus, the second approach performs better, and was hence chosen by the authors of this paper as the main
architecture.
5/13
A Pre-Trained GloVe
Embeddings
Tokenization
Transform with Linear Layer h0, c0 Logits from
SAT
h0 Attention 1st word LSTM Logits
Cell
MIMIC-CXR Vocab
Preprocessed Learned h1 h1 Logits
Medical Text Findings
Report
K-Beam
PA and AP view
LSTM
Search
Attention n word
th
Cell
Encoder
Text Generated
SAT
1024 x 8 x 8 by SAT Text
1 x 224 x 224
GPT-3
Masked Self-Attention
1st token nth token
Logits
Figure 3. Second approach. Pretrained GPT-3 (B) continues text generated by SAT (A).
convolutional layer were taken. These features were passed through the Adaptive Average Pooling layer. As a result, the image
encoded parts were obtained. They can be represented by the tensor with the following dimensions: (batchsize ×C, D, D) (Eq.
2). C stands for the number of channels or how many different image regions to consider. D implies the dimension of the image
encoded region. Furthermore, the fine-tune method for encoder was added. It enables or disables the calculation of gradients
for the encoder’s parameters through the last layers. Then, at every time step, the decoder with the attention module observes
the encoded small images with findings and generates a caption word by word. The Encoder output is received and flattened
to dimensions (batchsize,C, D × D). Since captions are padded with special <pad> token, captions are sorted by decreasing
lengths and at every time-step of generating a word, an effective batch size is computed in order not to process the <pad>
token.
The Show Attend and Tell model was trained using the Teacher-Forcing method while at each step the input to the model
was the ground truth word on this step and not the previous generated word. As a result, we can consider the SAT as a language
model A. It gets a tokenized text of length m, an image as input and outputs a vector of probabilities for the next word at each
time step t:
where W is the SAT vocabulary size and L is the length of generated report (Eq. 1). Where P1 is computed as it is shown in the
Eq. 11.
Over the training process the LSTM outputs a word with a maximum probability after the softmax layer. It is a greedy
approach, yet there is also an option to use the K-Beam search. Authors of46 used the K-Beam during training, however this is
not a common approach. In our experiments, the greedy approach was used within the training process, and we applied the
K-Beam search over the inference stage.
6/13
3.2 Second Language Model
The second part of the architecture proposed is the GPT-3 being a language model. The GPT-3 is built from decoder blocks
using the transformer architecture. At the same time, the decoder block consists of masked self-attention and feed-forward
neural network (FFNN). The output yields the token probabilities, i.e., logits. The GPT-3 was pretrained separately on the
MIMIC-CXR dataset and was then fine-tuned together with the SAT to enhance clinical reports.
We put a special token <start> at the end of the text generated by the SAT allowing the GPT-3 to understand where to
start the generation process. We also used the K-Beam search after the GPT-3 generation and took the second best sentence from
the output as a continuation. The pretrained GPT-3 performs as a separate language model B and generates good records based
on the input text or tags. The GPT-3 generates report till the moment when it generates the special token <|endoftext|>
token. We denote the length of the GPT-3 generated text as l
4 Experiments
4.1 Datasets
For training and evaluation of medical image captioning, we use three publicly available datasets. Two of them are medical
images datasets and the third one is a general-purpose one.
MIMIC-CXR The MIMIC Chest X-Ray (MIMIC-CXR)53 dataset is a large publicly available dataset of chest radiographs in
DICOM format with free-text radiology reports. This dataset consists of 377,110 images corresponding to 227835 radiographic
studies performed at the Beth Israel Deaconess Medical Center in Boston, MA.
Open-I The Indiana University Chest X-Ray Collection (IU X-Ray)20 contains radiology reports associated with X-Ray
images. This dataset contains 7470 image-report pairs. All the reports enclose the following sections: impression, findings,
tags, comparison, and indication. We use the concatenation of impression and findings as the target captions
MSCOCO Microsoft Common Objects in Context dataset (MS COCO dataset)54 is large-scale non-medical dataset for scene
understanding. The dataset is commonly used for training and benchmark object detection, segmentation, and captioning
algorithms.
7/13
more accurate reports, we added the extracted labels to the beginning of the report. This allows language models to know the
summary of the report for a more precise description generation.
We additionally formed the abbreviations dictionary of 150+ words from the Unified Medical Language System (UMLS)57 .
We also extended our dictionary size with several commonly used medical terms from the Medical Concept Annotation Tool58 .
5.2 Discussion
The first language model (SAT) learned to generate short summary at the beginning of the report, based on findings from the
X-Ray to provide the finding details. This offers text generation direction seed for the second model. Performed preprocessing
of medical reports allowed to get these high metrics. We also address the biased data problem by applying domain-specific
text preprocessing while using the NegBio labeller. In a radiology database, the data is unbalanced because abnormal cases
are rarer than the normal ones. The NegBio labeller allowed us to get a not negative-biased diagnosis clinical records as it
added short sentences at the beginning of ground truth report, making this task closer (in some ways) to classification task,
when the state-of-the-art models had already managed to achieve strong performance. The SAT also provides 2D heatmaps of
pathologies localization, assisting and accelerating the diagnosis process followed by clinicians.
The second language model, the Generative Pretrained Transformer (GPT-3), showed promising results in the medical
domain. It successfully continued texts from the first language model, taking into consideration all the findings provided. As
GPT-3 is a large and smart transformer, it summarizes and provides more details on findings. Natural language generation
metrics suggest using two language models subsequently. Such an approach can be considered as strong for the text generation.
The SAT followed by the GPT-3 outperformed the reported state-of-the-art (SOTA) models in all the 3 datasets considered.
Notably, the proposed approach beats SOTA models on MIMIC-CXR demonstrating the highest performance in all the metrics
measured. The performance for the main evaluation dataset, the Open-I, is also measured by the F1-score using micro-averaging
and demonstrates 0.861 vs. 0.840 for the proposed (SAT + GPT-3) model and the SAT, respectively.
Examples of the reports generated jointly via the Show-Attend-Tell + GPT-3 architecture, are shown in Table 2. One may
notice that some generated sentences are identical with the ground truth. For example, in both generated and true reports for the
first X-Ray is “no acute cardiopulmonary abnormality". Some sentences close in their meaning, even, even if they are different
in terms of chosen words and n-grams ("no pneumonia. no pleural effusion. no edema. ..." compared to “ without pulmonary
edema or pneumothorax").
6 Conclusions
The authors of the current paper introduced a new technique of combining two language models for the medical image
captioning task. Principally, the new preprocessing and squeezing approaches for clinical records were implemented along
with a combined language model, where the first component is based on attention mechanism and the second represents a
generative pretrained transformer. The proposed combination of models generates a descriptive textual summary with essential
8/13
Model CIDEr ROUGE_L BLEU-1 BLEU-2 BLEU-3 BLEU-4
S&T8 0.886 0.300 0.307 0.201 0.137 0.093
Original SAT11 0.967 0.288 0.318 0.205 0.137 0.093
TieNet15 1.004 0.296 0.332 0.212 0.142 0.095
MIMIC-CXR
NLG24 1.153 0.307 0.352 0.223 0.153 0.104
SAT 1.986 0.478 0.634 0.549 0.451 0.383
SAT + GPT-3 1.989 0.480 0.725 0.626 0.505 0.418
Co-Attention56 0.327 0.447 0.517 0.386 0.306 0.247
TieNet15 - 0.311 0.330 0.194 0.124 0.081
CNN-RNN8 0.111 0.267 0.316 0.211 0.140 0.095
LRCN60 0.190 0.278 0.369 0.229 0.149 0.138
Open-I ATT-RK14 0.155 0.323 0.369 0.226 0.151 0.108
CDGPT233 0.257 0.289 0.387 0.245 0.166 0.111
Original SAT11 0.320 0.361 0.433 0.281 0.194 0.138
SAT 0.699 0.413 0.407 0.258 0.210 0.125
SAT + GPT-3 0.701 0.450 0.520 0.390 0.296 0.235
BRNN61 - - 0.642 0.451 0.304 0.203
Original SAT11 - - 0.718 0.504 0.357 0.250
MS-COCO
SAT 1.300 0.592 0.815 0.663 0.516 0.395
SAT + GPT-3 1.360 0.606 0.821 0.672 0.529 0.409
Table 1. Reported mean performance using word-overlap metrics for two medical radiology datasets and one non-medical for general
purpose. Here SAT stands for the model implemented by us and trained with the preprocessed MIMIC-CXR data. BLUE-n denotes the
BLEU score that uses up to n-grams.
9/13
Chest X-Ray Ground Truth Our predictions
Lungs remain well inflated without evidence of focal no findings. no pneumonia. no pleural effusion. no edema.
airspace consolidation, pleural effusions, there is little change and no evidence of acute
pulmonary edema or pneumothorax.Irregularity in cardiopulmonary disease. no pneumonia, vascular
the right humeral neck is related to a known healing congestion, pleural effusion.of incidental note is an
fracture secondary to recent fall. PA and lateral azygos fissure, of no clinical significance. this raises
views of the chest at 09:55 are submitted. possibility of a normal variant.
Compared to prior chest radiographs through support devices present. no pneumothorax. pleural
. Previous mild pulmonary edema has effusion present. lung opacity present. uncertain enlarged
improved, moderate cardiomegaly and mediastinal cardiomediastinum. no edema. atelectasis present. right
vascular engorgement have not. ET tube, right internal jugular central line has its tip in distal superior
transjugular temporary pacer lead are in standard vena cava. overall cardiac and mediastinal contours are
placements and an esophageal drainage tube likely stable given patient rotation on current study. lung
passes into the stomach and out of view. Pleural volumes remain low with patchy opacities at both bases
effusions are presumed but not substantial. No likely reflecting atelectasis. blunting of both costophrenic
pneumothorax. angles may reflect small effusions.
information on found pathologies along with their location and severity. Besides, the 2D heatmaps localize each pathology on
the original X-Ray scans. The results measured with the natural language generation metrics on both the MIMIC-CXR and the
Open-I datasets speak for an efficient applicability to the chest X-Ray image captioning task. This approach also provides
well-interpretable results and allows to support medical decision making.
We investigated various approaches to the text from the angle of generation automatic X-Ray captioning. We proved that
the Show-Attend-Tell is a strong baseline outperforming models with Transformer-based decoders. With the help of the GPT-3
pre-trained language model, we managed to improve this baseline. The simple method, whither the GPT-3 model finishes report
started by the Show-Attend-Tell model, yields significant improvements of the standard text generation scores.
7 Acknowledgements
The authors of this paper thank Alexander Panchenko and Alexander Shvets for the helpful discussion.
References
1. Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019).
2. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Medical
Informatics Assoc. 23, 304–310 (2016). URL https://doi.org/10.1093/jamia/ocv080.
10/13
3. Chan, Y.-H., Zeng, Y.-Z., Wu, H.-C., Wu, M.-C. & Sun, H.-M. Effective pneumothorax detection for chest x-ray images
using local binary pattern and support vector machine. Journal of Healthcare Engineering 2018, 1–11 (2018).
4. Maghdid, H. S. et al. Diagnosing covid-19 pneumonia from x-ray and ct images using deep learning and transfer learning
algorithms. In Multimodal image exploitation and learning 2021, vol. 11734, 117340E (International Society for Optics
and Photonics, 2021).
5. Monshi, M. M. A., Poon, J. & Chung, V. Deep learning in generating radiology reports: A survey. Artificial Intelligence in
Medicine 106, 101878 (2020).
6. Gurgitano, M. et al. Interventional radiology ex-machina: impact of artificial intelligence on practice. La radiologia
medica 126, 998–1006 (2021).
7. Pavlopoulos, J., Kougia, V. & Androutsopoulos, I. A survey on biomedical image captioning. In Proceedings of the Second
Workshop on Shortcomings in Vision and Language, 26–36 (Association for Computational Linguistics, Minneapolis,
Minnesota, 2019). URL https://www.aclweb.org/anthology/W19-1803.
8. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the
IEEE conference on computer vision and pattern recognition, 3156–3164 (2015).
9. Shin, H.-C. et al. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation (2016).
1603.08486.
10. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate (2016).
1409.0473.
11. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd
International Conference on International Conference on Machine Learning - Volume 37, ICML’15, 2048–2057 (JMLR.org,
2015).
12. Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2625–2634 (2015).
13. Zhang, Z., Xie, Y., Xing, F., McGough, M. & Yang, L. Mdnet: A semantically and visually interpretable medical image
diagnosis network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3549–3557 (2017).
14. You, Q., Jin, H., Wang, Z., Fang, C. & Luo, J. Image captioning with semantic attention. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 4651–4659 (2016).
15. Wang, X., Peng, Y., Lu, L., Lu, Z. & Summers, R. M. Tienet: Text-image embedding network for common thorax disease
classification and reporting in chest x-rays. CoRR abs/1801.04334 (2018). URL http://arxiv.org/abs/1801.
04334. 1801.04334.
16. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2577–2586 (Association
for Computational Linguistics, Melbourne, Australia, 2018). URL https://www.aclweb.org/anthology/
P18-1240.
17. Wang, X. et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and
localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
3462–3471 (2017).
18. Gale, W., Oakden-Rayner, L., Carneiro, G., Bradley, A. P. & Palmer, L. J. Producing radiologist-quality reports for
interpretable artificial intelligence. CoRR abs/1806.00340 (2018). URL http://arxiv.org/abs/1806.00340.
1806.00340.
19. Yuan, J., Liao, H., Luo, R. & Luo, J. Automatic radiology report generation based on multi-view image fusion and medical
concept enrichment (2019). 1907.09085.
20. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Medical
Informatics Assoc. 23, 304–310 (2016). URL https://doi.org/10.1093/jamia/ocv080.
21. Zhang, Y. et al. When radiology report generation meets knowledge graph. Proceedings of the AAAI Conference on
Artificial Intelligence 34, 12910–12917 (2020). URL https://doi.org/10.1609/aaai.v34i07.6989.
22. Rajpurkar, P. et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning (2017). 1711.
05225.
11/13
23. Tuluptceva, N., Bakker, B., Fedulova, I., Schulz, H. & Dylov, D. V. Anomaly detection with deep perceptual autoencoders
(2020). 2006.13265.
24. Liu, G. et al. Clinically accurate chest x-ray report generation (2019). 1904.02633.
25. Peng, Y. et al. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits
on Translational Science Proceedings 2017 (2017).
26. Ni, J., Hsu, C.-N., Gentili, A. & McAuley, J. Learning visual-semantic embeddings for reporting abnormal findings on chest
X-rays. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1954–1960 (Association for Compu-
tational Linguistics, Online, 2020). URL https://www.aclweb.org/anthology/2020.findings-emnlp.
176.
27. Syeda-Mahmood, T. et al. Chest x-ray report generation through fine-grained label learning (2020). 2007.13831.
28. Liu, J. et al. Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019).
29. Cohen, J. P., Hashir, M., Brooks, R. & Bertrand, H. On the limits of cross-domain generalization in automated x-ray
prediction. In Arbel, T. et al. (eds.) Proceedings of the Third Conference on Medical Imaging with Deep Learning, vol. 121
of Proceedings of Machine Learning Research, 136–155 (PMLR, 2020). URL http://proceedings.mlr.press/
v121/cohen20a.html.
30. Rodin, I., Fedulova, I., Shelmanov, A. & Dylov, D. V. Multitask and multimodal neural network model for interpretable
analysis of x-ray images. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (IEEE,
2019). URL https://doi.org/10.1109/bibm47256.2019.8983272.
31. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (Association for Computational
Linguistics, Minneapolis, Minnesota, 2019).
32. Ziegler, Z. M., Melas-Kyriazi, L., Gehrmann, S. & Rush, A. M. Encoder-agnostic adaptation for conditional language
generation (2020). URL https://openreview.net/forum?id=B1xq264YvH.
33. Alfarghaly, O., Khaled, R., Elkorany, A., Helal, M. & Fahmy, A. Automated radiology report generation using conditioned
transformers. Informatics in Medicine Unlocked 24, 100557 (2021). URL https://www.sciencedirect.com/
science/article/pii/S2352914821000472.
34. Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer (2020).
2010.16056.
35. Xiong, Y., Du, B. & Yan, P. Reinforced transformer for medical image captioning. In Machine Learning in Medical
Imaging, 673–680 (Springer International Publishing, 2019).
36. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269 (2017).
37. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Bengio, Y. &
LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings (2015). URL http://arxiv.org/abs/1409.1556.
38. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision.
CoRR abs/1512.00567 (2015). URL http://arxiv.org/abs/1512.00567. 1512.00567.
39. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition (2015). 1512.03385.
40. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation 9, 1735–1780 (1997).
41. Brown, T. B. et al. Language models are few-shot learners (2020). 2005.14165.
42. Vaswani, A. et al. Attention is all you need. In Guyon, I. et al. (eds.) Advances in Neural Information Processing
Systems, vol. 30 (Curran Associates, Inc., 2017). URL https://proceedings.neurips.cc/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
43. Wolf, T. et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, 38–45 (Association for Computational
Linguistics, Online, 2020). URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
12/13
44. Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and
pattern recognition, 248–255 (Ieee, 2009).
45. Cohen, J. P. et al. TorchXRayVision: A library of chest X-ray datasets and models.
https://github.com/mlmed/torchxrayvision (2020). URL https://github.com/mlmed/torchxrayvision.
46. Wiseman, S. & Rush, A. M. Sequence-to-sequence learning as beam-search optimization (2016). 1606.02960.
47. Papineni, K., Roukos, S., Ward, T. & jing Zhu, W. Bleu: a method for automatic evaluation of machine translation.
311–318 (2002).
48. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. 10 (2004).
49. Banerjee, S. & Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human
judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, 65–72 (Association for Computational Linguistics, Ann Arbor, Michigan, 2005). URL https:
//www.aclweb.org/anthology/W05-0909.
50. Vedantam, R., Zitnick, C. L. & Parikh, D. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 4566–4575 (2015).
51. Anderson, P., Fernando, B., Johnson, M. & Gould, S. Spice: Semantic propositional image caption evaluation. In ECCV
(2016).
52. Chen, X. et al. Microsoft coco captions: Data collection and evaluation server (2015). 1504.00325.
53. Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text
reports. Scientific Data 6 (2019).
54. Lin, T.-Y. et al. Microsoft coco: Common objects in context 1405.0312.
55. Koziol, Q. et al. HDF5. In Encyclopedia of Parallel Computing, 827–833 (Springer US, 2011).
56. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational
Linguistics, Melbourne, Australia, 2018). URL https://www.aclweb.org/anthology/P18-1240.
57. Bodenreider, O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids
Research 32, 267D–270 (2004). URL https://doi.org/10.1093/nar/gkh061.
58. Kraljevic, Z. et al. Multi-domain clinical natural language processing with medcat: the medical concept annotation toolkit
(2021). 2010.01165.
59. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 311–318 (Association for
Computational Linguistics, USA, 2002). URL https://doi.org/10.3115/1073083.1073135.
60. Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description (2016). 1411.4389.
61. Karpathy, A. & Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions (2015). 1412.2306.
13/13