0% found this document useful (0 votes)
0 views

Medical Image Captioning via Generative Pretrained

The document discusses the application of Generative Pretrained Transformers for medical image captioning. It highlights the contributions of various researchers from different institutions in Moscow. The focus is on enhancing the understanding and interpretation of medical images through advanced AI techniques.

Uploaded by

Alexis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Medical Image Captioning via Generative Pretrained

The document discusses the application of Generative Pretrained Transformers for medical image captioning. It highlights the contributions of various researchers from different institutions in Moscow. The focus is on enhancing the understanding and interpretation of medical images through advanced AI techniques.

Uploaded by

Alexis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Medical Image Captioning via Generative Pretrained

Transformers
Alexander Selivanov1,† , Oleg Y. Rogov2,† , Daniil Chesakov3 , Artem Shelmanov3 , Irina
Fedulova1 , and Dmitry V. Dylov2,*
1 Alexander Selivanov and Irina Fedulova are with the Philips Innovation Labs Rus, Skolkovo Technopark 42 building
1 Bolshoi boulevard, Moscow, Russia, 121205.
2 Oleg Y. Rogov and Dmitry V. Dylov* are with the Skolkovo Institute of Science and Technology, Bolshoy blvd., 30/1,

Moscow 121205, Russia


3 Daniil Chesakov and Artem Shelmanov are with the Sber AI Lab, Kutuzovsky Ave, 32 bld. 1, Moscow, 121170.
arXiv:2209.13983v1 [cs.CV] 28 Sep 2022

† Contributed equally
* Corresponding author e-mail: [email protected]

ABSTRACT

The automatic clinical caption generation problem is referred to as proposed model combining the analysis of frontal chest X-Ray
scans with structured patient information from the radiology records. We combine two language models, the Show-Attend-Tell
and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed combination of these models
generates a textual summary with the essential information about pathologies found, their location, and the 2D heatmaps
localizing each pathology on the original X-Ray scans. The proposed model is tested on two medical datasets, the Open-I,
MIMIC-CXR, and the general-purpose MS-COCO. The results measured with the natural language assessment metrics prove
their efficient applicability to the chest X-Ray image captioning.

1 Introduction
Medical imaging is indispensable in the current diagnostic workflows. Out of the plethora of existing imaging modalities,
X-Ray remains one of the most widely-used visualization methods in many hospitals around the world, because it is inexpensive
and easily accessible1 . Analyzing and interpreting X-ray images is especially crucial for diagnosing and monitoring a wide
range of lung diseases, including pneumonia2 , pneumothorax3 , and COVID-19 complications4 .
Today, the generation of a free-text description based on clinical radiography results has become a convenient tool in
clinical practice5 . Having to study approximately 100 X-Rays daily5 , radiologists are overloaded by the necessity to report
their observations in writing, a tedious and time-consuming task that requires a deep domain-specific knowledge. The typical
manual annotation overload can lead to several problems, such as missed findings, inconsistent quantification, and delay of a
patient’s stay in the hospital, which brings increased costs for the treatment. Among all, the qualification of radiologists as far
as the correct diagnosis establishing should be stated as major problems.
In the COVID-19 era, there is a higher need for robust image captioning5 framework. Thus, many healthcare systems
outsource the medical image analysis task. Automatic generation of chest X-Ray medical reports using deep learning can assist
and accelerate the diagnosis establishing process followed by clinicians. Providing automated support for this task has the
potential to ease clinical workflows and improve both care quality and standardization. We propose to apply a model that works
perfectly on non-medical data, to the medical data.

1.1 Medical background


Radiology is the medical discipline that uses medical imaging to diagnose and treat diseases. Today, radiology actively
implements new artificial intelligence approaches6 . There are three types of radiologists - diagnostic radiologists, interventional
radiologists and radiation oncologists. They all use medical imaging procedures such as X-Rays, computed tomography
(CT), magnetic resonance imaging (MRI), nuclear medicine, positron emission tomography (PET) and ultrasound. Diagnostic
radiologists interpret and report on images resulted from imaging procedures, diagnose the cause of patient’s symptoms,
recommend treatment and offer additional clinical tests. They specialize on different parts of human body - breast imaging
(mammograms), cardiovascular radiology (heart and circulatory system), chest radiology (heart and lungs), gastrointestinal
radiology (stomach, intestines and abdomen), etc. Interventional radiologists use radiology images to perform clinical
procedures with minimally invasive techniques. They are often involved in treating cancers or tumors, heart diseases, strokes,
blockages in the arteries and veins, fibroids in the uterus, back pains, liver and kidney problems. Radiation oncologists use
radiation therapy to treat cancer.

1.2 Technical background


Since image captioning is a multimodal problem, it draws a significant attention of both computer vision and natural language
processing communities. The latest surveys in the medical image captioning task5, 7 offer a detailed description of domain
knowledge from radiology and deep learning. The first architectures to address this problem were CNN-RNN models from8
and9 . However, the latter show satisfactory results only on the single-pathology tasks.
With the new concept of attention approach10 , more papers began to use visual attention11 ,12 and13 , being the first to use
attention on medical images. Authors of11 presented the model that can fix its attention on salient objects while generating
the corresponding words in the output sequence. Shortly after the visual-attention concept was exposed, text-attention was
introduced by authors of14–16 . They used both semantic and visual attention, that allowed them to get high natural language
generation (NLG) metrics on medical datasets. Authors of15 introduced a framework generating natural reports for the
Chest-Xray14 dataset17 - TieNet. It was trained for solving several tasks such as classification, localization, and text generation.
It used a non-hierarchical CNN-LSTM approach together with the attention to semantic and visual features, as it allowed to
beat the current state-of-the-art results. In the18 , bone fracture X-Ray reports were generated by identifying image features
and filling text templates. Authors of16 suggested a multi-task framework, that can both predict tags and generate texts using
co-attention mechanism. This model is still not sufficient for producing accurate diagnosis from X-Rays as the produced texts
still contained repeated sentences due to a lack of contextual coherence in the hierarchical models. The authors of19 took
advantage of a sentence-level attention mechanism in a late fusion fashion. They took advantage of the multi-view images
using both frontal and lateral views from the Open-I dataset20 .
Authors of21 proposed to utilize a pre-constructed knowledge graph embedding module (extracted from the Open-I images
using Chexnet models22 ) on multiple disease findings to assist the report generation process. Authors of23 exposed an anomaly
detection method for detecting abnormalities on chest X-Rays with deep perceptual autoencoders. The authors of24 first
generated topics for sentences using reinforcement learning (RL) followed by the word decoder sequence generation from
the topic with attention to the original images. RL was used for tuning to optimize readability. We solve this problem in a
simpler method without losing in quality. To extract topics, we use the NegBio labeller17, 25 , which provides topics from clinical
reports. We add these topics to the beginning of the medical report, for our model to understand where exactly the text should
be generated.
The paper in26 dives into reporting abnormal findings on radiology images. The proposed method learns conditional
visual-semantic embeddings in radiology images and reports further used to measure the similarity between image regions and
medical reports. This by optimizing a triplet ranking loss. The authors of27 developed an algorithm that learns fine-grained
description of findings from images and uses their pattern of occurrences to retrieve and customize similar reports from a
large report database. The work in28 proposed a Contrast Induced Attention Network (CIA-Net), using contrastive learning
on the aligned positive and negative samples for the disease localization on the chest X-Ray images. The work in29 studies
the cross-domain performance, agreement between models, and model representations for X-Rays diagnostic prediction tasks.
The authors test for concept similarity by regularizing a network to group tasks across multiple datasets together and observe
variation across the tasks. The model in30 generates a short textual summary with essential information on the found pathologies
along with their location and severity. The model is trained on only 2% of the MIMIC-CXR dataset, and generates short reports.
Although, in this work, we train on whole MIMIC-CXR and generate full-text report.
Authors of31–35 attempted to use transformer-based models as decoders in the image captioning domain. The34 affirmed
affirmed to generate radiology reports through the custom transformer with additional memory-driven unit. Another model was
introduced in35 where encoder detects regions of interest via a bottom-up attention module and extracts top-down visual features.
In this study, the decoder is presented as a custom transformer. For example, the paper in32 proposes an approach called
”pseudo self-attention”. Its main idea is to incorporate the conditioning input as a pseudo history to a pretrained transformer.
They add a new key and value weights in the self-attention module to be projected onto the decoder’s self-attention space,
while33 focuses on visual and weighted semantic features.

1.3 Contributions
In the current paper, we address all the problems mentioned above. The contributions of this paper are the following:
• We introduce the new architecture for image captioning, based on combination of two language models with image-
attention (SAT) and text-attention (GPT-3), which outperforming current state-of-the-art models
• We introduce the new preprocessing pipeline for radiology reports, that allows to get higher NLG metrics
• We perform extensive experiments to show the effectiveness of the proposed methods

2/13
• Finally, we contribute into deep learning community with two language models trained on large MIMIC-CXR dataset
The rest of the paper is organized as follows: section 2 describes two language models architecture separately, section 3
provides the description of the proposed approach, section 4 describes datasets used and computing power utilized, subsection 5.1
and subsection 5.2 present, compare the results, while section 6 introduces the results and conclusions of the paper.

2 Methods
2.1 Show Attend and Tell
Show Attend and Tell (SAT)11 is an attention-based image caption generation neural net. Attention-based technique allows to
get well interpretable results, which can be utilized by radiologist to ensure their findings on X-Ray. Including attention, the
module gives the advantage to visualize where exactly the model ’sees’ the specific pathology. SAT consists of three blocks:
Encoder, Attention module and Decoder. It takes an image, encodes it, attends each part of the image, and generates a L-length
caption z, an encoded sequence of words from W -length vocabulary:

z = {z1 , . . . , zL } , zi ∈ RWSAT (1)


2.1.1 Encoder
Encoder is a convolutional neural network (CNN). It encodes an image and outputs a set of C vectors, each of which is a
D-dimensional representation of the image corresponding part:

a = {a1 , . . . , aC } , ai ∈ RD×D (2)

Here C represents the number of channels in the output of the encoder. It depends on the used type of the encoder: 1024 for
DenseNet-12136 , 512 for VGG-1637 , 2048 for InceptionV338 and ResNet-10139 . D is a configurable parameter representing the
encoded vectors size. Features are extracted from the lower convolutional layer prior to the fully connected layers, and are
being passed through the Adaptive Average Pooling layer. This allows the decoder to selectively focus on certain parts of an
image by selecting a subset of all the feature vectors.
2.1.2 Decoder with attention module
The decoder is implemented as a LSTM neural network40 . It produces a caption by generating one word at every time step
conditioned by the attention (context) vector, the previous hidden state and the previously generated words. The LSTM can be
represented as the following set of equations:
   
it σ  
 ft   σ  Ezt−1
ot   σ  TD+m+n,n ht−1
  =    (3)
ât
gt tanh
ct = ft ct−1 + it gt (4)
ht = ot tanh(ct ). (5)

Vectors it , ft , ct , ot , ht represent the input/update gate activation vector, forgetting gate activation vector, memory or cell state
vector, while outputting gate activation vector and hidden state of the LSTM respectively. Ts,t is an affine transformation, such
that Rs → Rt with non-zero bias. m denotes the embedding dimension, while n represents LSTM dimension. σ and stand for
the sigmoid activation function and element-wise multiplication, respectively. E ∈ Rm×L is an embedding matrix. The vector
â ∈ RD holds the visual information from a particular input location of the image at time t. Thus, â called context vector.
Attention is a function φ , that computes context vector ât from the encoded vectors ai (2), produced by the encoder. The
attention module generates a positive number αi for each location i on the image. This number can be interpreted as the relative
importance to give to the location i, among others. Attention module realized as a multi-layer perceptron (MLP) with a softmax
activation function, conditioned at the previous hidden state ht−1 (5) of the LSTM. The attention module is depicted in Figure 1.
Set of linear layers in MLP is denoted as a function fatt . The weights αti are computed with the help of the following equations:

eti = fatt (ai , ht−1 ) (6)


exp(eti )
αti = C (7)
∑ p=1 exp(et p )

The sum of weights αti (7) should be equal to 1 ∑Ci=1 αti = 1. The context vector ât is computed by the attention function φ
with the set of encoded vectors a (2) and their corresponding weights αti (7) as inputs: ât = φ ({ai } , {αti }). According to the

3/13
Learned Findings in Attention Layer
X-Ray

Linear layer
Flattened to transforming
1024 x 64 to Attention
Dimension Summation
Linear layer
1024 x 8 x 8
Linear layer
transforming
+ ReLU
activation
transforming to
the dimension of 1
SoftMax

Hidden state from Alphas


ht to Attention
Decoder on step t
Dimension

Figure 1. Attention module used in SAT

original paper function, φ can be either ’soft’ or ’hard’ attention. Due to specific task of medical image caption, function φ
was chosen to be the ’soft’ attention, as it allows model to focus more on some specific parts of X-Rays from others and to
detect pathologies and major organs such as heart, lung etc. It is named as a ’deterministic soft attention’ and recognized as a
weighted sum : φ ({ai } , {αti }) = ∑Ci αi ai . Hence, context vector can be computed as:
C
ât = ∑ αi ati (8)
i

The initial memory state and hidden state of the LSTM are initialized with two separate multi-layer perceptrons (init-c and
init-h) with the encoded vectors ai (2) for a faster convergence:

1 C
c0 = finit-c ( ai ) (9)
C∑i

1 C
h0 = finit-h ( ai ) (10)
C∑i

To compute the output of LSTM representing a probabilities vector the next word, a ’deep output layer’40 was used. It looks
both on the LSTM state ht (5), on context vector ât (8) and the one previous word zt−1 (2):

P(zt |ât , zt−1 ) = so f tmax(Lo (Lh ht + La ât + Ezt−1 )) (11)

where Lo ∈ RW ×m , Lh ∈ Rm×n , La ∈ Rm×D , and E ∈ Rm×L represent the embedding matrix.


The authors in11 suggest to use the ’doubly stochastic attention’, where ∑t αti ≈ 1. This can be interpreted as encouraging the
model to pay equal attention to every part of the image. Yet, this method is not relevant for X-Rays, as each part of the chest is
almost at the same position from image to image. If the model learned, e.g., that heart is in its specific position, a model doesn’t
have to search for the heart somewhere else. The model is trained in an end-to-end manner by minimizing the cross-entropy
loss LCE between vector with a softmaxed distribution probability of next word and true caption as LCE = − log(P(z|a)).

2.2 Generative Pretrained Transformer


Generative Pretrained Transformer (GPT-3)41 is a large transformer-based language model with 1.75 × 1011 parameters, trained
on 570 GB of text. GPT-3 can be used to generate realistic continuations texts from the arbitrary domain. Basically, GPT-3
is a transformer that can look at a part of the sentence and predict the next word, thus being a language model. The original
transformer42 is made up of encoder stack and decoder stack, in which encoders and decoders stacked upon each other. Whereas
GPT-3 is built using just decoder blocks. One decoder block consists of Masked Self-Attention layer and Feed-Forward neural
network. It is called Masked as it pays attention only to previous inputs. The input should be encoded prior to going into the
decoder block. In transformers and in the GPT-3 particularly, there are two subsequent encodings: Byte Pair Token Encoding
and Positional Encoding. Byte Pair Encoding (BPE) is a simple data compression technique that iteratively replaces the most
frequent pair of bytes in a sequence with a single, unused byte. The algorithm compresses data by finding the most frequently
occurring pairs of adjacent subtokens in the data and replacing all instances of the pair with a single subword. The algorithm
repeats this process until no further compression is possible. Such tokenization avoids adding a special <unk> token to the
vocabulary, as now all words can be encoded and obtained by combination of subwords from the vocabulary.

4/13
3 Proposed Architecture
We introduce two architectures for X-Ray image captioning. The overall goal of our approach is to improve the quality of
Encoder-Decoder generated clinical records by using the GPT-3 language model. The suggested model consists of two parts:
the Encoder, Decoder (LSTM) with an attention module and the GPT-3. While the Encoder with LSTM detects pathologies and
indicates zones of higher attention demand, the GPT-3 takes it as input and writes a comprehensive medical report.
There are two possible approaches for this task. The first one consists in forcing models to learn joint word distribution.
Within this method (Fig. 2), both models A and B output scores for the next word in a sentence. Afterwards, due to concatenating
these scores and pushing them through the feed-forward neural net C, we get the final scores for upcoming word. Whilst the
disadvantage of this approach is the following: the GPT-3 model has its own vocabulary built by the Byte Pair Tokenizer. This
vocabulary is different from the one used by the Show Attend and Tell. We need to take from continuous GPT-3 distribution
separate scores corresponding to the words present in the Show Attend and Tell vocabulary. This turns continuous distribution
from the GPT-3 into discrete and hence, while we don’t use all the potential generation power from the GPT-3.

A Pre-Trained GloVe
Embeddings
Tokenization
Transform with Linear Layer h0, c0
C FFNN
h0 Attention 1st word LSTM
MIMIC-CXR Vocab Cell
Learned Logits from
Preprocessed Findings h1 h1
SAT
Medical Text
Report Logits SAT
Scores
PA and AP view Logits
Attention nth word LSTM
Cell
Encoder 1024 x 8 x 8

SAT
lSAT Vocab

Concatenate
1 x 224 x 224 Feed
Forward
NN
B GPT-3 Vocab Tokenization
1st token nth token
lGPT-3 Vocab

BytePair Encoded Medical Report Logits from


GPT-3
Tokenizer Positional
1st token Logits Final
Text Generated
by SAT Text
and Tokens
Embeddings Decoder Scores
nth token Logits
Decoder lMIMIC-CXR Vocab
Feed Forward NN
Sample Indexes GPT-3
GPT-3
Masked Self-Attention Corresponding
1st token nth token to Vocab Scores

Figure 2. The first approach. Learn the joint distribution of two models. The drawback is in sampling from the GPT-3
distribution.

The second method shown in Fig. 3 consists in fine-tuning both models on the MIMIC-CXR dataset and using them one
after another. Show Attend and Tell A gets an image as an input and generates a report based on the data found on X-Ray with
an Attention module. It learns where to focus and gives a seed for the GPT-3 B to continue generating text. The GPT-3 was
fine-tuned on MIMIC-CXR in self-supervised manner using the Huggingface framework43 . It learns to predict the next word
in the text. The GPT-3 continues the report outputed by SAT and generates a detailed and complete clinical report based on
pathologies found by SAT. Such an approach is better for the GPT-3 as it gets more context as input (from SAT) than in the
first approach. Thus, the second approach performs better, and was hence chosen by the authors of this paper as the main
architecture.

3.1 First Language Model


The first part of the suggested model is realized as the Show Attend and Tell model (SAT), the encoder, to encode the image and
the LSTM for decoding into sequence. The encoder encodes the input image with 3 or 1 color channels into a smaller image
with ’learned’ channels. The resulted encoded images can be interpreted as a summary representation of findings in the X-Ray
(Eq. 2). Those encoders pretrained on the ImageNet44 are not suitable for the medical image caption task, as chest X-Rays
doesn’t have objects, figures from everyday life. Thus, the DenseNet-121 from45 pretrained on the MIMIC-CXR dataset was
taken. It was trained for the classification task on 18 labels : Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema,
Emphysema, Fibrosis, Effusion, Pneumonia, Pleural Thickening, Cardiomegaly, Nodule, Mass, Hernia, Lung Lesion, Fracture,
Lung Opacity, and Enlarged Cardiomediastinum. Hence, the last classification layer was removed and features from the last

5/13
A Pre-Trained GloVe
Embeddings
Tokenization
Transform with Linear Layer h0, c0 Logits from
SAT
h0 Attention 1st word LSTM Logits
Cell
MIMIC-CXR Vocab
Preprocessed Learned h1 h1 Logits
Medical Text Findings
Report
K-Beam
PA and AP view
LSTM
Search
Attention n word
th
Cell
Encoder
Text Generated

SAT
1024 x 8 x 8 by SAT Text

1 x 224 x 224

B GPT-3 Vocab Tokenization


Text Continued
by GPT-3 Text
1 token
st
n token th

Encoded Medical Report


BytePair
Tokenizer
Text Generated Positional Decoder K-Beam
and Tokens Search
by SAT Text
Embeddings
Decoder 1st token Logits from
Feed Forward NN GPT-3
nth token Logits

GPT-3
Masked Self-Attention
1st token nth token
Logits

Figure 3. Second approach. Pretrained GPT-3 (B) continues text generated by SAT (A).

convolutional layer were taken. These features were passed through the Adaptive Average Pooling layer. As a result, the image
encoded parts were obtained. They can be represented by the tensor with the following dimensions: (batchsize ×C, D, D) (Eq.
2). C stands for the number of channels or how many different image regions to consider. D implies the dimension of the image
encoded region. Furthermore, the fine-tune method for encoder was added. It enables or disables the calculation of gradients
for the encoder’s parameters through the last layers. Then, at every time step, the decoder with the attention module observes
the encoded small images with findings and generates a caption word by word. The Encoder output is received and flattened
to dimensions (batchsize,C, D × D). Since captions are padded with special <pad> token, captions are sorted by decreasing
lengths and at every time-step of generating a word, an effective batch size is computed in order not to process the <pad>
token.
The Show Attend and Tell model was trained using the Teacher-Forcing method while at each step the input to the model
was the ground truth word on this step and not the previous generated word. As a result, we can consider the SAT as a language
model A. It gets a tokenized text of length m, an image as input and outputs a vector of probabilities for the next word at each
time step t:

A : text,image → P1 (zt |true words = z<1> z<2> . . . z<t−1> , image),


t ∈ {2, . . . m, . . . L}, (12)
m×WSAT
P1 ∈ R

where W is the SAT vocabulary size and L is the length of generated report (Eq. 1). Where P1 is computed as it is shown in the
Eq. 11.
Over the training process the LSTM outputs a word with a maximum probability after the softmax layer. It is a greedy
approach, yet there is also an option to use the K-Beam search. Authors of46 used the K-Beam during training, however this is
not a common approach. In our experiments, the greedy approach was used within the training process, and we applied the
K-Beam search over the inference stage.

6/13
3.2 Second Language Model
The second part of the architecture proposed is the GPT-3 being a language model. The GPT-3 is built from decoder blocks
using the transformer architecture. At the same time, the decoder block consists of masked self-attention and feed-forward
neural network (FFNN). The output yields the token probabilities, i.e., logits. The GPT-3 was pretrained separately on the
MIMIC-CXR dataset and was then fine-tuned together with the SAT to enhance clinical reports.
We put a special token <start> at the end of the text generated by the SAT allowing the GPT-3 to understand where to
start the generation process. We also used the K-Beam search after the GPT-3 generation and took the second best sentence from
the output as a continuation. The pretrained GPT-3 performs as a separate language model B and generates good records based
on the input text or tags. The GPT-3 generates report till the moment when it generates the special token <|endoftext|>
token. We denote the length of the GPT-3 generated text as l

B : text → P2 (zt |true words = z<1> . . . z<L> < s >),


(13)
t ∈ {L + 1, . . . L + l},

3.3 Combination of two language models


We use a combination of two models placing them sequentially: the SAT model extracts visual features from the image and
allows us to focus on its specific parts. The GPT-3 provides good and comprehensive text, based on what is found by the first
model. Thus, the predictions from the first model improve those of the second language model.

3.4 Evaluation metrics


The common evaluation metrics used for image captioning are : bilingual evaluation understudy (BLEU)47 , recall-oriented
understudy for gisting evaluation (ROUGE)48 , metric for evaluation of translation with explicit ordering (METEOR)49 , consensus-
based image description evaluation (CIDEr)50 , and semantic propositional image caption evaluation (SPICE)51 . The Microsoft
Common Objects in Context52 provides the kit with implementation of these metrics for the image caption task.

4 Experiments
4.1 Datasets
For training and evaluation of medical image captioning, we use three publicly available datasets. Two of them are medical
images datasets and the third one is a general-purpose one.
MIMIC-CXR The MIMIC Chest X-Ray (MIMIC-CXR)53 dataset is a large publicly available dataset of chest radiographs in
DICOM format with free-text radiology reports. This dataset consists of 377,110 images corresponding to 227835 radiographic
studies performed at the Beth Israel Deaconess Medical Center in Boston, MA.
Open-I The Indiana University Chest X-Ray Collection (IU X-Ray)20 contains radiology reports associated with X-Ray
images. This dataset contains 7470 image-report pairs. All the reports enclose the following sections: impression, findings,
tags, comparison, and indication. We use the concatenation of impression and findings as the target captions
MSCOCO Microsoft Common Objects in Context dataset (MS COCO dataset)54 is large-scale non-medical dataset for scene
understanding. The dataset is commonly used for training and benchmark object detection, segmentation, and captioning
algorithms.

4.2 Image preprocessing


Hierarchical Data Format (HDF5)55 dataset was used to store all images. X-Rays are in gray-scale and have one channel.
To process them with the pre-trained CNN DenseNet-121, we used 1 channel image. Each image was resized to the size of
224×224 pixels, normalized to the range from 0 to 1, and converted to the float32 type and stored in the HDF5 dataset.

4.3 Image captions pre-processing


Following the logic in56 , a medical report is considered as a concatenation of Impression and Findings sections, if both of
these sections is empty then this report was excluded. This resulted in 360666 DICOMs with reports for MIMIC-CXR dataset.
The text records are pre-processed by converting all tokens to lowercase, removing all non-alphanumerical tokens. For our
experiments we used 75% of data for training, 24,75 % for validation and 0.25% for testing.
The MIMIC-CXR database was used to access metadata and labels derived from free-text radiology reports. These labels
were extracted using NegBio tool17, 25 that outputs one of 14 pathologies along with their severity and (or) absence. To generate

7/13
more accurate reports, we added the extracted labels to the beginning of the report. This allows language models to know the
summary of the report for a more precise description generation.
We additionally formed the abbreviations dictionary of 150+ words from the Unified Medical Language System (UMLS)57 .
We also extended our dictionary size with several commonly used medical terms from the Medical Concept Annotation Tool58 .

4.4 Training of the Neural Network


The pipeline is implemented using PyTorch. Experiments were conducted on a server running the Ubuntu 16.04 (32 GB RAM).
All models were trained with NVIDIA Tesla V100 GPU (32 GB RAM). In all experiments, we use a 5-fold cross-validation
and reported the mean performance. The SAT was trained for 70 epochs with batch size of 16, embedding dimension of 100,
attention and decoder dimension of 512, dropout value 0.1. The encoder and decoder learning rates were 4 × 10−7 and 3 × 10−7 ,
respectively. The Cross Entropy loss was used for training. The best model is chosen according to the highest geometric
mean of BLEU-n, as it is done in other works59 . SAT was trained in Teacher-Forcing technique, while the Greedy approach is
used for counting metrics. The GPT-3 small was fine-tuned with the MIMIC-CXR dataset for 30 epochs with batch size of 4,
learning rate of 5 × 10−5 , the Adam epsilon of 1 × 10−8 , where the block size equals 1024, with clipping gradients, which are
bigger than 1.0. It was fine-tuned in a self-supervised manner as a language model. No data augmentation was applied.

5 Results & Discussion


5.1 Quantitative results
The quantitative results for the baseline models, preceding works and our models are presented in Table 1. Models were
evaluated on the most common Open-I dataset as well as on the big and rarely reported MIMIC-CXR dataset with free-text
radiology reports. We implemented the most commonly used metrics for evaluation - BLEU-n, CIDEr and ROUGE. The
proposed approach outperforms the existing models in terms of the NLG metrics - BLEU-n, CIDEr and ROUGE.
We additionally provided our model performance illustrations in Table 2 containing the original X-Ray images from the
MIMIC-CXR dataset, the ground truth expert label and the model prediction (SAT + GPT-3). We manually underlined the
similarities and identical diagnoses in texts to guide the eye.

5.2 Discussion
The first language model (SAT) learned to generate short summary at the beginning of the report, based on findings from the
X-Ray to provide the finding details. This offers text generation direction seed for the second model. Performed preprocessing
of medical reports allowed to get these high metrics. We also address the biased data problem by applying domain-specific
text preprocessing while using the NegBio labeller. In a radiology database, the data is unbalanced because abnormal cases
are rarer than the normal ones. The NegBio labeller allowed us to get a not negative-biased diagnosis clinical records as it
added short sentences at the beginning of ground truth report, making this task closer (in some ways) to classification task,
when the state-of-the-art models had already managed to achieve strong performance. The SAT also provides 2D heatmaps of
pathologies localization, assisting and accelerating the diagnosis process followed by clinicians.
The second language model, the Generative Pretrained Transformer (GPT-3), showed promising results in the medical
domain. It successfully continued texts from the first language model, taking into consideration all the findings provided. As
GPT-3 is a large and smart transformer, it summarizes and provides more details on findings. Natural language generation
metrics suggest using two language models subsequently. Such an approach can be considered as strong for the text generation.
The SAT followed by the GPT-3 outperformed the reported state-of-the-art (SOTA) models in all the 3 datasets considered.
Notably, the proposed approach beats SOTA models on MIMIC-CXR demonstrating the highest performance in all the metrics
measured. The performance for the main evaluation dataset, the Open-I, is also measured by the F1-score using micro-averaging
and demonstrates 0.861 vs. 0.840 for the proposed (SAT + GPT-3) model and the SAT, respectively.
Examples of the reports generated jointly via the Show-Attend-Tell + GPT-3 architecture, are shown in Table 2. One may
notice that some generated sentences are identical with the ground truth. For example, in both generated and true reports for the
first X-Ray is “no acute cardiopulmonary abnormality". Some sentences close in their meaning, even, even if they are different
in terms of chosen words and n-grams ("no pneumonia. no pleural effusion. no edema. ..." compared to “ without pulmonary
edema or pneumothorax").

6 Conclusions
The authors of the current paper introduced a new technique of combining two language models for the medical image
captioning task. Principally, the new preprocessing and squeezing approaches for clinical records were implemented along
with a combined language model, where the first component is based on attention mechanism and the second represents a
generative pretrained transformer. The proposed combination of models generates a descriptive textual summary with essential

8/13
Model CIDEr ROUGE_L BLEU-1 BLEU-2 BLEU-3 BLEU-4
S&T8 0.886 0.300 0.307 0.201 0.137 0.093
Original SAT11 0.967 0.288 0.318 0.205 0.137 0.093
TieNet15 1.004 0.296 0.332 0.212 0.142 0.095
MIMIC-CXR
NLG24 1.153 0.307 0.352 0.223 0.153 0.104
SAT 1.986 0.478 0.634 0.549 0.451 0.383
SAT + GPT-3 1.989 0.480 0.725 0.626 0.505 0.418
Co-Attention56 0.327 0.447 0.517 0.386 0.306 0.247
TieNet15 - 0.311 0.330 0.194 0.124 0.081
CNN-RNN8 0.111 0.267 0.316 0.211 0.140 0.095
LRCN60 0.190 0.278 0.369 0.229 0.149 0.138
Open-I ATT-RK14 0.155 0.323 0.369 0.226 0.151 0.108
CDGPT233 0.257 0.289 0.387 0.245 0.166 0.111
Original SAT11 0.320 0.361 0.433 0.281 0.194 0.138
SAT 0.699 0.413 0.407 0.258 0.210 0.125
SAT + GPT-3 0.701 0.450 0.520 0.390 0.296 0.235
BRNN61 - - 0.642 0.451 0.304 0.203
Original SAT11 - - 0.718 0.504 0.357 0.250
MS-COCO
SAT 1.300 0.592 0.815 0.663 0.516 0.395
SAT + GPT-3 1.360 0.606 0.821 0.672 0.529 0.409

Table 1. Reported mean performance using word-overlap metrics for two medical radiology datasets and one non-medical for general
purpose. Here SAT stands for the model implemented by us and trained with the preprocessed MIMIC-CXR data. BLUE-n denotes the
BLEU score that uses up to n-grams.

9/13
Chest X-Ray Ground Truth Our predictions

Lungs remain well inflated without evidence of focal no findings. no pneumonia. no pleural effusion. no edema.
airspace consolidation, pleural effusions, there is little change and no evidence of acute
pulmonary edema or pneumothorax.Irregularity in cardiopulmonary disease. no pneumonia, vascular
the right humeral neck is related to a known healing congestion, pleural effusion.of incidental note is an
fracture secondary to recent fall. PA and lateral azygos fissure, of no clinical significance. this raises
views of the chest at 09:55 are submitted. possibility of a normal variant.

1. Stable bilateral small pleural effusions and


pleural effusion present. lung opacity present. no edema.
atelectasis. 2. Enlarged pulmonary artery,
cardiomegaly present. atelectasis present. as compared
suggesting pulmonary hypertension. No significant
to previous radiograph, there is an increase in extent of a
interval change. Bilateral small pleural effusions
pre existing small left pleural effusion with subsequent
and adjacent atelectasis are overall unchanged.
atelectasis at left lung bases. otherwise, radiograph is
The heart is top-normal in size, unchanged. The
unchanged. moderate cardiomegaly. mild fluid overload
pulmonary artery is enlarged, suggesting
no overt pulmonary edema. no new focal parenchymal
pulmonary hypertension. No demand, focal
opacities suggesting pneumonia. unchanged position of
consolidation to suggest pneumonia, or
right pectoral port a cath.
pneumothorax.

uncertain pneumonia. pleural effusion present. lung


opacity present. atelectasis present. bilateral pleural
effusions, left greater than right. bibasilar opacities
There is decrease in now small right pleural
potentially atelectasis in setting of low lung volumes.
effusion. There is no pneumothorax. There is a
infection be excluded. frontal and lateral views of chest
new right pacer pigtail catheter. Cardiomediastinal
demonstrate low lung volumes, which accentuate
contours are unchanged. Lines and tubes are in
bronchovascular markings. there are small bilateral
standard position. Left lower lobe opacities, a
pleural effusions, right greater than left, with adjacent
combination of pleural effusion and atelectasis, are
atelectasis. there is no focal consolidation pneumothorax.
unchanged.
cardiomediastinal silhouette is within normal limits.
surgical clips are seen in right upper quadrant of
abdomen. aortic arch calcifications are noted.

Compared to prior chest radiographs through support devices present. no pneumothorax. pleural
. Previous mild pulmonary edema has effusion present. lung opacity present. uncertain enlarged
improved, moderate cardiomegaly and mediastinal cardiomediastinum. no edema. atelectasis present. right
vascular engorgement have not. ET tube, right internal jugular central line has its tip in distal superior
transjugular temporary pacer lead are in standard vena cava. overall cardiac and mediastinal contours are
placements and an esophageal drainage tube likely stable given patient rotation on current study. lung
passes into the stomach and out of view. Pleural volumes remain low with patchy opacities at both bases
effusions are presumed but not substantial. No likely reflecting atelectasis. blunting of both costophrenic
pneumothorax. angles may reflect small effusions.

Table 2. Results of generated reports by the proposed SAT + GPT-3 model.

information on found pathologies along with their location and severity. Besides, the 2D heatmaps localize each pathology on
the original X-Ray scans. The results measured with the natural language generation metrics on both the MIMIC-CXR and the
Open-I datasets speak for an efficient applicability to the chest X-Ray image captioning task. This approach also provides
well-interpretable results and allows to support medical decision making.
We investigated various approaches to the text from the angle of generation automatic X-Ray captioning. We proved that
the Show-Attend-Tell is a strong baseline outperforming models with Transformer-based decoders. With the help of the GPT-3
pre-trained language model, we managed to improve this baseline. The simple method, whither the GPT-3 model finishes report
started by the Show-Attend-Tell model, yields significant improvements of the standard text generation scores.

7 Acknowledgements
The authors of this paper thank Alexander Panchenko and Alexander Shvets for the helpful discussion.

References
1. Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019).
2. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Medical
Informatics Assoc. 23, 304–310 (2016). URL https://doi.org/10.1093/jamia/ocv080.

10/13
3. Chan, Y.-H., Zeng, Y.-Z., Wu, H.-C., Wu, M.-C. & Sun, H.-M. Effective pneumothorax detection for chest x-ray images
using local binary pattern and support vector machine. Journal of Healthcare Engineering 2018, 1–11 (2018).
4. Maghdid, H. S. et al. Diagnosing covid-19 pneumonia from x-ray and ct images using deep learning and transfer learning
algorithms. In Multimodal image exploitation and learning 2021, vol. 11734, 117340E (International Society for Optics
and Photonics, 2021).
5. Monshi, M. M. A., Poon, J. & Chung, V. Deep learning in generating radiology reports: A survey. Artificial Intelligence in
Medicine 106, 101878 (2020).
6. Gurgitano, M. et al. Interventional radiology ex-machina: impact of artificial intelligence on practice. La radiologia
medica 126, 998–1006 (2021).
7. Pavlopoulos, J., Kougia, V. & Androutsopoulos, I. A survey on biomedical image captioning. In Proceedings of the Second
Workshop on Shortcomings in Vision and Language, 26–36 (Association for Computational Linguistics, Minneapolis,
Minnesota, 2019). URL https://www.aclweb.org/anthology/W19-1803.
8. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the
IEEE conference on computer vision and pattern recognition, 3156–3164 (2015).
9. Shin, H.-C. et al. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation (2016).
1603.08486.
10. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate (2016).
1409.0473.
11. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd
International Conference on International Conference on Machine Learning - Volume 37, ICML’15, 2048–2057 (JMLR.org,
2015).
12. Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2625–2634 (2015).
13. Zhang, Z., Xie, Y., Xing, F., McGough, M. & Yang, L. Mdnet: A semantically and visually interpretable medical image
diagnosis network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3549–3557 (2017).
14. You, Q., Jin, H., Wang, Z., Fang, C. & Luo, J. Image captioning with semantic attention. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 4651–4659 (2016).
15. Wang, X., Peng, Y., Lu, L., Lu, Z. & Summers, R. M. Tienet: Text-image embedding network for common thorax disease
classification and reporting in chest x-rays. CoRR abs/1801.04334 (2018). URL http://arxiv.org/abs/1801.
04334. 1801.04334.
16. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2577–2586 (Association
for Computational Linguistics, Melbourne, Australia, 2018). URL https://www.aclweb.org/anthology/
P18-1240.
17. Wang, X. et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and
localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
3462–3471 (2017).
18. Gale, W., Oakden-Rayner, L., Carneiro, G., Bradley, A. P. & Palmer, L. J. Producing radiologist-quality reports for
interpretable artificial intelligence. CoRR abs/1806.00340 (2018). URL http://arxiv.org/abs/1806.00340.
1806.00340.
19. Yuan, J., Liao, H., Luo, R. & Luo, J. Automatic radiology report generation based on multi-view image fusion and medical
concept enrichment (2019). 1907.09085.
20. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Medical
Informatics Assoc. 23, 304–310 (2016). URL https://doi.org/10.1093/jamia/ocv080.
21. Zhang, Y. et al. When radiology report generation meets knowledge graph. Proceedings of the AAAI Conference on
Artificial Intelligence 34, 12910–12917 (2020). URL https://doi.org/10.1609/aaai.v34i07.6989.
22. Rajpurkar, P. et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning (2017). 1711.
05225.

11/13
23. Tuluptceva, N., Bakker, B., Fedulova, I., Schulz, H. & Dylov, D. V. Anomaly detection with deep perceptual autoencoders
(2020). 2006.13265.
24. Liu, G. et al. Clinically accurate chest x-ray report generation (2019). 1904.02633.
25. Peng, Y. et al. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits
on Translational Science Proceedings 2017 (2017).
26. Ni, J., Hsu, C.-N., Gentili, A. & McAuley, J. Learning visual-semantic embeddings for reporting abnormal findings on chest
X-rays. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1954–1960 (Association for Compu-
tational Linguistics, Online, 2020). URL https://www.aclweb.org/anthology/2020.findings-emnlp.
176.
27. Syeda-Mahmood, T. et al. Chest x-ray report generation through fine-grained label learning (2020). 2007.13831.
28. Liu, J. et al. Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019).
29. Cohen, J. P., Hashir, M., Brooks, R. & Bertrand, H. On the limits of cross-domain generalization in automated x-ray
prediction. In Arbel, T. et al. (eds.) Proceedings of the Third Conference on Medical Imaging with Deep Learning, vol. 121
of Proceedings of Machine Learning Research, 136–155 (PMLR, 2020). URL http://proceedings.mlr.press/
v121/cohen20a.html.
30. Rodin, I., Fedulova, I., Shelmanov, A. & Dylov, D. V. Multitask and multimodal neural network model for interpretable
analysis of x-ray images. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (IEEE,
2019). URL https://doi.org/10.1109/bibm47256.2019.8983272.
31. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (Association for Computational
Linguistics, Minneapolis, Minnesota, 2019).
32. Ziegler, Z. M., Melas-Kyriazi, L., Gehrmann, S. & Rush, A. M. Encoder-agnostic adaptation for conditional language
generation (2020). URL https://openreview.net/forum?id=B1xq264YvH.
33. Alfarghaly, O., Khaled, R., Elkorany, A., Helal, M. & Fahmy, A. Automated radiology report generation using conditioned
transformers. Informatics in Medicine Unlocked 24, 100557 (2021). URL https://www.sciencedirect.com/
science/article/pii/S2352914821000472.
34. Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer (2020).
2010.16056.
35. Xiong, Y., Du, B. & Yan, P. Reinforced transformer for medical image captioning. In Machine Learning in Medical
Imaging, 673–680 (Springer International Publishing, 2019).
36. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269 (2017).
37. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Bengio, Y. &
LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings (2015). URL http://arxiv.org/abs/1409.1556.
38. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision.
CoRR abs/1512.00567 (2015). URL http://arxiv.org/abs/1512.00567. 1512.00567.
39. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition (2015). 1512.03385.
40. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation 9, 1735–1780 (1997).
41. Brown, T. B. et al. Language models are few-shot learners (2020). 2005.14165.
42. Vaswani, A. et al. Attention is all you need. In Guyon, I. et al. (eds.) Advances in Neural Information Processing
Systems, vol. 30 (Curran Associates, Inc., 2017). URL https://proceedings.neurips.cc/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
43. Wolf, T. et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, 38–45 (Association for Computational
Linguistics, Online, 2020). URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.

12/13
44. Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and
pattern recognition, 248–255 (Ieee, 2009).
45. Cohen, J. P. et al. TorchXRayVision: A library of chest X-ray datasets and models.
https://github.com/mlmed/torchxrayvision (2020). URL https://github.com/mlmed/torchxrayvision.
46. Wiseman, S. & Rush, A. M. Sequence-to-sequence learning as beam-search optimization (2016). 1606.02960.
47. Papineni, K., Roukos, S., Ward, T. & jing Zhu, W. Bleu: a method for automatic evaluation of machine translation.
311–318 (2002).
48. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. 10 (2004).
49. Banerjee, S. & Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human
judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, 65–72 (Association for Computational Linguistics, Ann Arbor, Michigan, 2005). URL https:
//www.aclweb.org/anthology/W05-0909.
50. Vedantam, R., Zitnick, C. L. & Parikh, D. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 4566–4575 (2015).
51. Anderson, P., Fernando, B., Johnson, M. & Gould, S. Spice: Semantic propositional image caption evaluation. In ECCV
(2016).
52. Chen, X. et al. Microsoft coco captions: Data collection and evaluation server (2015). 1504.00325.
53. Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text
reports. Scientific Data 6 (2019).
54. Lin, T.-Y. et al. Microsoft coco: Common objects in context 1405.0312.
55. Koziol, Q. et al. HDF5. In Encyclopedia of Parallel Computing, 827–833 (Springer US, 2011).
56. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational
Linguistics, Melbourne, Australia, 2018). URL https://www.aclweb.org/anthology/P18-1240.
57. Bodenreider, O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids
Research 32, 267D–270 (2004). URL https://doi.org/10.1093/nar/gkh061.
58. Kraljevic, Z. et al. Multi-domain clinical natural language processing with medcat: the medical concept annotation toolkit
(2021). 2010.01165.
59. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 311–318 (Association for
Computational Linguistics, USA, 2002). URL https://doi.org/10.3115/1073083.1073135.
60. Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description (2016). 1411.4389.
61. Karpathy, A. & Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions (2015). 1412.2306.

13/13

You might also like