0% found this document useful (0 votes)
18 views

3D Visual Grounding

Uploaded by

jwp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

3D Visual Grounding

Uploaded by

jwp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Visual Grounding of Whole Radiology Reports

for 3D CT Images

Akimichi Ichinose1 , Taro Hatsutani1 , Keigo Nakamura1 , Yoshiro Kitamura1 ,


Satoshi Iizuka2 , Edgar Simo-Serra3 , Shoji Kido4 , and Noriyuki Tomiyama4
1
Medical Systems Research & Development Center, FUJIFILM Corporation, Japan
Center for Artificial Intelligence Research, University of Tsukuba, Japan
arXiv:2312.04794v1 [cs.CV] 8 Dec 2023

2
3
Department of Computer Science and Engineering, Waseda University, Japan
4
Graduate School of Medicine, Osaka University, Japan
[email protected]

Abstract. Building a large-scale training dataset is an essential prob-


lem in the development of medical image recognition systems. Visual
grounding techniques, which automatically associate objects in images
with corresponding descriptions, can facilitate labeling of large number
of images. However, visual grounding of radiology reports for CT im-
ages remains challenging, because so many kinds of anomalies are de-
tectable via CT imaging, and resulting report descriptions are long and
complex. In this paper, we present the first visual grounding framework
designed for CT image and report pairs covering various body parts
and diverse anomaly types. Our framework combines two components
of 1) anatomical segmentation of images, and 2) report structuring. The
anatomical segmentation provides multiple organ masks of given CT im-
ages, and helps the grounding model recognize detailed anatomies. The
report structuring helps to accurately extract information regarding the
presence, location, and type of each anomaly described in corresponding
reports. Given the two additional image/report features, the ground-
ing model can achieve better localization. In the verification process, we
constructed a large-scale dataset with region-description correspondence
annotations for 10,410 studies of 7,321 unique patients. We evaluated our
framework using grounding accuracy, the percentage of correctly local-
ized anomalies, as a metric and demonstrated that the combination of the
anatomical segmentation and the report structuring improves the per-
formance with a large margin over the baseline model (66.0% vs 77.8%).
Comparison with the prior techniques also showed higher performance
of our method.

Keywords: Deep Learning · Vision Language · Visual Grounding ·


Computed Tomography.

1 Introduction

In recent years, a number of medical image recognition systems have been de-
veloped [6] to alleviate the increasing burden on radiologists [2,22,21]. In the
2 A. Ichinose et al.

Visual Grounding on X-ray Image Visual Grounding on CT Image (Our Task)


The heart is within normal There was a mass in liver S4 showing early deep
limits in size. Surgical sut- staining and late washout. Same size as last time.
ure material projects over The S8/5 mass is slightly stained in the early ph-
the right lung apex. There ase, indicating washout. No change from last time.
is moderate left pleural e- Cyst and calcification in left kidney. Mild paraeso-
ffusion. phageal varices present. The size of the aorta-ca-
val lymph nodes was the same as before.

Fig. 1. Comparison of the visual grounding task on X-ray image and on CT image.

development of such systems, the task of manually labeling images is a signifi-


cant bottleneck. Auto-labeling, the process of automatically assigning labels to
images using machine learning algorithms, has emerged as a promising solution
to this problem. In cases where there are plenty of image and caption pairs,
one potential approach to auto-labeling is visual grounding [12], which utilizes
natural language descriptions to identify and localize objects in images.
With the recent advances in cross-modal technology based on deep learning,
many frameworks for visual grounding has been proposed [11,7]. Within the med-
ical domain, several large scale datasets with radiology reports are available (e.g.
OpenI [3], MIMIC-CXR [9]), and these produced researches on medical image
visual grounding [25,1]. However, to the best of our knowledge, prior studies have
focused on 2D X-ray images [28] or videos [15], and there has been no research
applying visual grounding to 3D computed tomography (CT) images so far. Vi-
sual grounding on CT images has the following difficulties: 1) Large number
of anomaly types to detect: Existing researches on visual grounding using
X-ray images handled only chest X-ray images. The number of anomaly types
to detect is at most dozen or so (e.g. 13 findings [8]). In contrast, our research
handles CT images including various parts of the human body. Consequently,
the number of anomaly types to be detected is larger than one hundred. 2) Long
and complex sentences: Radiology reports on X-ray images are often simple,
noting only the presence or absence of anomalies. On the other hand, in CT
examinations, the qualitative diagnosis of each anomaly is often performed. In
cases, multiple anomalies are simultaneously described in a sentence. Therefore,
the description tend to be long and complicated with multiple sentences (Fig. 1).
Visual grounding for CT images requires the extraction of information about the
location and type of each anomaly from these complex sentences.
In this work, we propose a novel visual grounding framework for 3D CT im-
ages and radiology reports. The main idea is to separate the task into three parts:
1) anatomical segmentation on images, 2) report structuring, and 3) localization
of described anomalies. In the anatomical segmentation, multiple organs and
tissues are extracted using the deep learning based segmentation model and pro-
vided as landmarks. The report structuring model, which is based on BERT [5],
is also introduced to extract information of each anomaly from a complex re-
port. Both of these features are fed into the grounding model (3) to extrapolate
medical domain knowledge, thereby enabling accurate visual grounding.
Our contributions are as follows:
Visual Grounding of Radiology Reports for CT Images 3

– We show the first visual grounding results for 3D CT images that covers
various body parts and anomalies.
– We introduce a novel grounding architecture that can leverage report struc-
turing results of presence/type/location of described anomalies.
– We validate the efficacy of the proposed framework using a large-scale dataset
with region-description correspondence annotations.

2 Related Work

Visual Grounding Visual grounding task involves learning the correspon-


dences between descriptions in the text and image regions from a given train-
ing set of region-description pairs [12]. There are mainly two approaches: one-
stage approach and two-stage approach. Most studies follow a two-stage ap-
proach [14,17]. However, this approach usually employs a pre-trained object
detector, and it leads to restrict the capability of categories and attributes in
grounding. Accordingly, recent studies is shifting to employ the one-stage ap-
proach, in which visual grounding is performed by end-to-end training [27,4,10].

Vision-Language Tasks on Medical Image The existence of public datasets


with paired images and reports [3,9,26] has accelerated research on cross-modal
tasks in the medical field [25,16]. Inspired by the success of visual grounding,
several studies of visual grounding for medical images and radiology reports have
also been reported [28,1,23]. These studies utilized a large scale dataset and an
attention-based language interpretation model such as BERT [5] to ground the
descriptions in the report. However, these studies have focused on X-ray images,
and to the best of our knowledge, there have been no studies on CT images,
which cover the entire body and have a complex report.

3 Methods

We first formulate the problem. Next, we explain three key components of


anatomical segmentation, report structuring, and anomaly localization in our
framework. In our framework, multiple organ labels obtained as the output
of anatomical segmentation encourage the grounding model to learn detailed
anatomy, and report structuring allows the grounding model to accurately ex-
tract the features of the target anomaly from complex sentences.

3.1 Problem Formulation

Our research assumes that a dataset of image-report pairs with region-description


correspondence annotations is provided for training. We show the overall frame-
work in Fig. 2. We denote an image and a paired report as I and T respectively.
Let Ia be a label image in which multiple organs are extracted from I. Each
report T contains descriptions of multiple (image) anomalies. We denote each
4 A. Ichinose et al.

anomalies as ti ∈ {t1 , t2 , ..., tN }. Given an image I and corresponding organ la-


bel images Ia encoded as V ∈ Rdz ×dy ×dx ×d and a description about an anomaly
ti encoded as Lti ∈ Rd , the goal of our framework is to generate a segmentation
map Mti that represents the location of the anomaly ti .

Segmentation
Model

Image Encoder
𝑉
𝐼𝑎
Source Target
Attention
Pair data

𝑀𝑡𝑖
Anatomical Segmentation 𝐿𝑡𝑖
𝑇
Liver: a good accumu-
Text Encoder

lation of lipiodol is seen in


Anomaly-Wise
S1. A 6mm large nodule is
seen in S8 with slight Feature Aggregator
hypo-absorption in portal
and equilibrium phases. …
Anomaly Localization
STRUCTURING RESULTS
Structuring Anatomical
Model Organ
Segment
Lesion …

LIVER S1 lipiodol
LIVER S8 nodule 6mm
SPLEEN enlarged
KIDNEY left cyst
Report Structuring

Fig. 2. The proposed framework for 3D-CT visual grounding.

3.2 Anatomical Segmentation


The task of the anatomical segmentation is to extract relevant anatomies that
can be clues for visual grounding. We use the commercial version of the 3D image
analysis software (Synapse 3D V6.8, FUJIFILM corporation, Japan) to extract
32 organs and tissues (See Appendix Table. A1). In this software, anatomies
are extracted using U-Net based architectures [13,18]. The extracted anatomical
label images are Ia .

3.3 Report Structuring


The tasks of the report structuring are as follows: 1) anatomical prediction,
2) phrase recognition, and 3) relationship estimation between phrases (See Ap-
pendix Fig. A1). The anatomical prediction is a sentence-wise prediction to de-
termine which organ or body part is mentioned in each sentence. The organs and
body parts to be recognized are shown in Appendix Table. A2. The sentences
Visual Grounding of Radiology Reports for CT Images 5

belonging to the same class are concatenated, then the phrase recognition and
the relationship estimation are performed for each class.
The phrase recognition module extracts phrases and classifies each of them
into 9 classes (See Appendix Table. A2). Subsequently, the relationship estima-
tion module determines whether there is a relationship between anomaly phrases
(e.g. ’nodule’, ’fracture’) and other phrases (e.g. ’6mm’, ’Liver S6’), resulting in
the grouping of phrases related to the same anomaly. If multiple anatomical
phrases are grouped in the same group, they are split into separate groups on a
rule basis (e.g. [‘right S1’, ‘left S6’, ‘nodule’] -> [‘right S1’, ‘nodule’], [‘left S6’,
‘nodule’]). More details of implementation and training methods are reported in
Nakano et al. [20] and Tagawa et al [24].

3.4 Anomaly Localization

The task of the anomaly localization is to output a localization map of the


anomaly mentioned in the input report T . The CT image I and the organ la-
bel image Ia are concatenated along the channel dimension and encoded by a
convolutional backbone to generate a visual embedding V . The sentences in the
report T are encoded by BERT [5] to generate embeddings for each character.
Let r = {r1 , r2 , ..., rNC } be the set of character embeddings where NC is the
number of characters. Our framework next adopt the Anomaly-Wise Feature
Aggregator (AFA). For each anomaly ti , AFA generates a representative em-
bedding Lti by aggregating the embeddings of related phrases based on report
structuring results. The final grounding result Mti is obtained by the following
Source-Target Attention.

Mti = sigmoid(Lti WQ (V WK )T ) (1)


d×dn
where WQ , WK ∈ R are trainable variables.
The overall architecture of this module is illustrated in Appendix Fig. A2.

Anomaly-Wise Feature Aggregator The results of the report structuring


mti ∈ RNC are defined as follows:
(
cj if a j-th character is related to an anomaly ti ,
mtij = (2)
0 else.

mti = {mti1 , mti2 , ...mtiNC } (3)


where cj is the class index labeled by the phrase recognition module (Let C
be the number of classes). In this module, aggregate character-wise embeddings
based on the following formula.

ek = {rj |mtij = k} (4)

Lti = LSTM([vorgan ; p1 ; e1 ; p2 ; e2 ; ..., pC ; eC ]) (5)


6 A. Ichinose et al.

where vorgan and pk are trainable embeddings for each organ and each class label
respectively. [·; ·] stands for concatenation operation. In this way, embeddings of
characters related to the anomaly ti are aggregated and concatenated. Subse-
quently, representative embeddings of the anomaly are generated by an LSTM
layer. In the task of visual grounding focused on 3D CT images, the size of the
dataset that can be created is relatively small. Considering this limitation, we
use an LSTM layer with strong inductive bias to achieve high generalization
performance.

4 Dataset and Implementation Details


4.1 Clinical Data
We retrospectively collected 10,410 CT studies (11,163 volumes/7,321 unique
patients) and 671,691 radiology reports from one university hospital in Japan.
We assigned a bounding box to each anomaly described in the reports as shown
in Appendix Fig. A3. The total category number is about 130 in combination
of anatomical regions and anomaly types (The details are in Fig. 4) For each
anomaly, a correspondence annotation was made with anomaly phrases in the
report. The total number of annotated regions is 17,536 (head: 713 regions, neck:
285 regions, chest: 8,598 regions, and abdomen: 7,940 regions). We divide the
data into 9,163/1,000/1,000 volumes as a training/validation/test split.

4.2 Implementation Details


We use a VGG-like network as Image Encoder, with 15 3D-convolutional layers
and 3 max pooling layers. For training, the voxel spacings in all three dimensions
are normalized to 1.0 mm. CT values are linearly normalized to obtain a value
of [0–1]. The anatomy label image, in which only one label is assigned to each
voxel, is also normalized to the value [0–1], and the CT image and the label
image are concatenated along the channel dimension. As our Text Encoder, we
use a BERT with 12 transformer encoder layers, each with hidden dimension of
768 and 12 heads in the multi-head attention. At first, we pre-train the BERT
using 6.7M sentences extracted from the reports in a Masked Language Model
task. Then we train the whole architecture jointly using dice loss [19] with the
first 8 transformer encoder layers of the BERT frozen. Further information about
implementation are shown in Appendix Table. A3.

5 Experiments
We did two kinds of experiments for comparison and ablation studies. The com-
parison study was made against TransVG [4] and MDETR [10] that are one-
stage visual grounding approaches and established state-of-the-art performances
on photos and captions. To adapt TransVG and MDETR for the 3D modality,
the backbone was changed to a VGG-like network with 3D convolution layers,
the same as the proposed method. We refer one of the proposed method without
anatomical segmentation and report structuring as the baseline model.
Visual Grounding of Radiology Reports for CT Images 7

5.1 Evaluation Metrics


We report segmentation performance using Dice score, mean intersection over
union (mIoU), and the grounding accuracy. The output masks are thresholded
to compute mIoU and grounding accuracy score. The mIoU is defined as an
average IoU over the thresholds [0.1, 0.2, 0.3, 0.4, 0.5]. The grounding accuracy
is defined as the percentage of anomalies for which the IoU exceeds 0.1 under
the threshold 0.1.

5.2 Results
The experimental results of the two studies are shown in Table. 1. Both of
MDETR and TransVG failed to achieve stable grounding in this task. A main
difference between these models and our baseline model is using a source-target
attention layer instead of the transformer. It is known that a transformer-based
algorithm with many parameters and no strong inductive bias is difficult to gen-
eralize with such a relatively limited number of training data. For this reason, the
baseline model achieved a much higher accuracy than the comparison methods.

Table 1. Results of the comparison/ablation studies. ’-’ represents ’not converged’.

Anatomical Report Dice mIoU Accuracy


Method
Seg. Struct. [%] [%] [%]
MDETR [10] - - N/A - -
TransVG [4] - - N/A 8.5 21.8
Baseline ✗ ✗ 27.4 15.6 66.0
✓ ✗ 28.1 16.6 67.9
Proposed ✗ ✓ 33.0 20.3 75.9
✓ ✓ 34.5 21.5 77.8

The ablation study showed that the anatomical segmentation and the report
structuring can improve the performance. In Fig. 3 (upper row), we demonstrate
several cases that facilitate an intuitive understanding of each effect. Longer
reports often mention more than one anomaly, making it difficult to recognize
the grounding target and cause localization errors. The proposed method can
explicitly indicate phrases such as the location and size of the target anomaly,
reducing the risk of failure. Fig. 3 (lower row) shows examples of grounding
results when a query that is not related to the image is inputted. In this case, the
grounding results were less consistent with the anatomical phrases. The results
suggest that the model performs grounding with an emphasis on anatomical
information against the backdrop of abundant anatomical knowledge.
The grounding performance for each combination of organ and anomaly type
is shown in Fig. 4. The performance is relatively high for organ shape abnormal-
ities (e.g. swelling, duct dilation) and high-frequency anomalies in small organs
8 A. Ichinose et al.

(e.g. thyroid/prostate mass). For these anomaly types, our model is considered
to be available for automatic training data generation. On the other hand, the
performance tends to be low for rare anomalies (e.g. mass in small intestine) and
anomalies in large body part (e.g. limb). Improving grounding performance for
these targets will be an important future work.

Input Query Input Query Input Query


A 10 mm nodule in the left lingual segment A 1.7-cm diameter water-
Input Image is the same as before. There is a cingulate Input Image dense mass is found in Input Image
shadow in the peripheral right middle lobe. the right upper internal There is a
Case #1 There are post-inflammatory changes.
Case #2 Case #3 compression
deep neck. A lymph node
Nodular thickening of 10 mm in size in up to 9 mm in short fracture on L1.
contact with the pleura of the terminal diameter is found in the
right upper lobe. right submental region.

Model without Model without Model without


Report Anatomical Anatomical
Structuring Segmentation Segmentation

Input Query Input Query Input Query


The dilation of the ascending
Input Image Input Image Bone islands are seen in Input Image Lymph nodes in left
aorta to a size of about 40 mm. the right ilium and left
Calcification and plaque are gastric artery trunk
Case #4 observed at the origin of the left
Case #5 pubis. No significant change Case #6 also slightly shrank.
from the previous CT.
subclavian artery.

Fig. 3. The grounding results for several input queries. Underlines in the input query
indicate the target anomaly phrase to be grounded. The phrases highlighted in bold blue
indicate the anatomical locations of the target anomaly. The red rectangles indicate
the ground truth regions. Case #4-#6 are the grounding results when an unrelated
input query is inputted. The region surrounded by the red dashed line indicates the
anatomical location corresponding to the input query.

6 Conclusion
In this paper, we proposed the first visual grounding framework for 3D CT im-
ages and reports. To deal with various type of anomalies throughout the body
and complex reports, we introduced a new approach using anatomical recognition
results and report structuring results. The experiments showed the effectiveness
of our approach and achieved higher performance compared to prior techniques.
However, in clinical practice, radiologists write reports from comparing multiple
images such as time-series images, or multi-phase scans. Realizing such sophis-
ticated diagnose process by a visual grounding model will be a future research.
Visual Grounding of Radiology Reports for CT Images 9

Head Neck Chest Abdomen Other


Head Neck Chest Abdomen Other

Gallbladder

Abdomen
Pancreas
Stomach

Intestine
Medias-
Thyroid

Prostate
Adrenal

Bladder
Kidney
Spleen
Throat

Uterus
tinum
Breast

Ovary
Colon
Small

Aorta
Chest
Brain

Liver

Limb
Lung
Head

Neck
Mass 0.11 0.53 0.26 0.16 0.43 0.19 0.17 0.19 0.38 0.26 0.38 0.39 0.3 0.11 0.51 0.34 0.39 0.88 0.24 0.09

Soft Tissue Tumor 0.07 0.16 0.22 0.24 0.33 0.24 0.14 0.54 0.22 0.09

High Density Area 0.13 0.34 0.12 0.21 0.36 0.13 0.23 0.14 0.22 0.22 0.2

Low Density Area 0.44 0.32 0.22 0.26 0.36 0.35 0.34 0.38 0.07

Water Density Area 0.12 0.51 0.14 0.18 0.64 0.35

Cyst 0.08 0.21 0.35 0.33 0.53 0.79 0.48

Lymph Node 0.07 0.14 0.31 0.27 0.29 0.18 0.54 0.2

Calcification 0.57 0.34 0.25 0.37 0.25 0.22 0.34

Vascular stenosis 0.37 0.08

Aneurysm 0.37 0.48

Embolism 0.15 0.82 0.1

Swelling 0.46 0.29 0.85 0.59 0.34 0.87 0.43

Atrophy 0.03 0.74

Wall Thickening 0.1 0.31 0.25 0.72 0.35 0.23 0.13 0.81 0.09

Duct Dilation 0.57 0.3 0.43 0.48 0.51 0.31 0.6

Lung Liver Chest Abdomen


Ground Glass Opacity 0.34 Trabecular Shadow 0.43 Reticular 0.13 Mucous Plug 0.13 Fatty Liver 0.37 Osteosclerosis 0.18 Osteosclerosis 0.28
Consolidation 0.43 Granular Shadow 0.33 Emphysema 0.10 Osteolysis 0.18 Osteolysis 0.44
Fracture 0.11 Fracture 0.26

Fig. 4. Grounding performance for representative anomalies. The value in each cell is
the average dice score of the proposed method.

References

1. Bhalodia, R., Hatamizadeh, A., Tam, L., Xu, Z., Wang, X., Turkbey, E., Xu, D.:
Improving Pneumonia Localization via Cross-Attention on Medical Images and
Reports. In: Proceedings of Medical Image Computing and Computer Assisted
Intervention. pp. 571–581. Springer (2021)
2. Dall, T.: The Complexities of Physician Supply and Demand: Projections from
2016 to 2030. IHS Markit Limited (2018)
3. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez,
L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol-
ogy examinations for distribution and retrieval. Journal of the American Medical
Informatics Association 23(2), 304–310 (2016)
4. Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: End-to-End Visual
Grounding with Transformers. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 1769–1779 (2021)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidi-
rectional Transformers for Language Understanding. In: Proceedings of NAACL-
HLT. pp. 4171–4186 (2019)
6. Ebrahimian, S., Kalra, M.K., Agarwal, S., Bizzo, B.C., Elkholy, M., Wald, C.,
Allen, B., Dreyer, K.J.: FDA-regulated AI algorithms: Trends, Strengths, and Gaps
of Validation Studies. Academic Radiology 29(4), 559–566 (2022)
7. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from Natural Language Expres-
sions. In: Proceedings of the European Conference on Computer Vision. pp. 108–
124. Springer (2016)
10 A. Ichinose et al.

8. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
Haghgoo, B., Ball, R., Shpanskaya, K., et al.: CheXpert: A Large Chest Radiograph
Dataset with Uncertainty Labels and Expert Comparison. In: Proceedings of the
AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
9. Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P.,
Deng, C.y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-identified publicly available
database of chest radiographs with free-text reports. Scientific data 6(1), 317
(2019)
10. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-
Modulated Detection for End-to-End Multi-Modal Understanding. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790
(2021)
11. Karpathy, A., Fei-Fei, L.: Deep Visual-Semantic Alignments for Generating Image
Descriptions. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 3128–3137 (2015)
12. Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep Fragment Embeddings for Bidirec-
tional Image Sentence Mapping. In: Proceedings of Advances in Neural Information
Processing System. pp. 1889–1897 (2014)
13. Keshwani, D., Kitamura, Y., Li, Y.: Computation of Total Kidney Volume from
CT Images in Autosomal Dominant Polycystic Kidney Disease Using Multi-task
3D Convolutional Neural Networks. In: Proceedings of Medical Image Computing
and Computer Assisted Intervention. pp. 380–388. Springer (2018)
14. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked Cross Attention for Image-
Text Matching. In: Proceedings of the European Conference on Computer Vision.
pp. 201–216 (2018)
15. Li, B., Weng, Y., Sun, B., Li, S.: Towards Visual-Prompt Temporal Answering
Grounding in Medical Instructional Video. arXiv preprint arXiv:2203.06667 (2022)
16. Li, Y., Wang, H., Luo, Y.: A Comparison of Pre-Trained Vision-and-Language
Models for Multimodal Representation Learning across Medical Images and Re-
ports. In: Proceedings of the IEEE international conference on bioinformatics and
biomedicine. pp. 1999–2004. IEEE (2020)
17. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: Pretraining Task-Agnostic Vi-
siolinguistic Representations for Vision-and-Language Tasks. Advances in neural
information processing systems 32, 13–23 (2019)
18. Masuzawa, N., Kitamura, Y., Nakamura, K., Iizuka, S., Simo-Serra, E.: Automatic
Segmentation, Localization, and Identification of Vertebrae in 3D CT Images Us-
ing Cascaded Convolutional Neural Networks. In: Proceedings of Medical Image
Computing and Computer Assisted Intervention. pp. 681–690. Springer (2020)
19. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully Convolutional Neural Net-
works for Volumetric Medical Image Segmentation. In: Proceedings of the interna-
tional conference on 3D vision. pp. 565–571. IEEE (2016)
20. Nakano, N., Tagawa, Y., Ozaki, R., Taniguchi, T., Ohkuma, T., Suzuki, Y., Kido,
S., Tomiyama, N.: Pre-training methods for creating a language model with em-
bedded knowledge of radiology reports. In: Proceedings of the annual meeting of
the Association for Natural Language Processing (2022)
21. Nishie, A., Kakihara, D., Nojo, T., Nakamura, K., Kuribayashi, S., Kadoya, M.,
Ohtomo, K., Sugimura, K., Honda, H.: Current radiologist workload and the short-
ages in japan: how many full-time radiologists are required? Japanese journal of
radiology 33, 266–272 (2015)
22. Rimmer, A.: Radiologist shortage leaves patient care at risk, warns royal college.
BMJ: British Medical Journal (Online) 359 (2017)
Visual Grounding of Radiology Reports for CT Images 11

23. Seibold, C., Reiß, S., Sarfraz, S., Fink, M.A., Mayer, V., Sellner, J., Kim,
M.S., Maier-Hein, K.H., Kleesiek, J., Stiefelhagen, R.: Detailed Annotations
of Chest X-Rays via CT Projection for Report Understanding. arXiv preprint
arXiv:2210.03416 (2022)
24. Tagawa, Y., Nakano, N., Ozaki, R., Taniguchi, T., Ohkuma, T., Suzuki, Y., Kido,
S., Tomiyama, N.: Performance improvement of named entity recognition on noisy
data using teacher-student training. In: Proceedings of the annual meeting of the
Association for Natural Language Processing (2022)
25. Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TieNet: Text-Image Embed-
ding Network for Common Thorax Disease Classification and Reporting in Chest
X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 9049–9058 (2018)
26. Yan, K., Wang, X., Lu, L., Summers, R.M.: DeepLesion: automated mining of
large-scale lesion annotations and universal lesion detection with deep learning.
Journal of medical imaging 5(3), 036501–036501 (2018)
27. Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A Fast and Accu-
rate One-Stage Approach to Visual Grounding. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 4683–4693 (2019)
28. You, D., Liu, F., Ge, S., Xie, X., Zhang, J., Wu, X.: AlignTransformer: Hierarchical
Alignment of Visual Regions and Disease Tags for Medical Report Generation. In:
Proceedings of Medical Image Computing and Computer Assisted Intervention.
pp. 72–82. Springer (2021)
12 A. Ichinose et al.

MEDICAL REPORT * Liver


A good accumulation of lipiodol is seen in A good accumulation of lipiodol is seen in
Liver S1. A 6mm large nodule is seen in S8
Liver S1. A 6mm large nodule is seen in S8
with slight hypo-absorption in portal and
with slight hypo-absorption in portal and

Anatomical
Prediction
equilibrium phases. Although enhancement
equilibrium phases. Although enhancement effect by contrast medium was noted on the
effect by contrast medium was noted on the previous MRI, no enhancement can be
previous MRI, no enhancement can be noted noted this time. The S6 nodule noted on the
this time. The S6 nodule noted on the previous MRI cannot be identified.
No portal vein tumor emboli.
previous MRI cannot be identified.
No portal vein tumor emboli.
Spleen
Spleen is enlarged.
Spleen is enlarged.
Gallbladder: n.p. …
* Inputs of our system is Japanese
STRUCTURING RESULTS
A good accumulation of lipiodol is seen in
Liver S1. A 6mm large nodule is seen in S8 Anatomical …
Organ Lesion
with slight hypo-absorption in portal and Segment

Relationship
Recognition

equilibrium phases. Although enhancement

Estimation
effect by contrast medium was noted on the LIVER S1 lipiodol
Phrase

previous MRI, no enhancement can be noted


LIVER S8 nodule 6mm
this time. The S6 nodule noted on the
previous MRI cannot be identified. SPLEEN enlarged
No portal vein tumor emboli.
KIDNEY left cyst
lesion anatomical segment
measurement result Phrases for the same
characteristics contrast anomaly are grouped

Fig. A1. Flowchart of report structuring.

Table A1. Output classes of anatomical segmentation

Brain Lung Liver Aorta


- Cerebraspinal Fluid - Left Upper Lobe Gallbladder Bone
- Left Lateral Ventricle - Left Lower Lobe Stomach - Cervical Vertebra
- Right Lateral Ventricle - Right Upper Lobe Duodenum - Thoracic Vertebra
- Third Ventricle - Right Middle Lobe Spleen - Lumber Vertebra
- Fourth Ventricle - Right Lower Lobe Pancreas - Left Rib
- Brainstem Heart Kidney - Right Rib
- Left Cerebellum - Left Kidney
- Right Cerebellum - Right Kidney
- Left Cerebrum Prostate
- Right Cerebrum Bladder
Visual Grounding of Radiology Reports for CT Images 13

Table A2. Output classes of report structuring

Anatomical Prediction Phrase Classification


Head Breast Small Intestine Anatomical Segment
Brain Pleural Cavity Colon Lesion
Ear Chest Prostate Shape Abnormality
Nose Liver Uterus Diagnosis
Neck Stomach Bladder Characteristics
Oropharynx Gallbladder Ovary Contrast Information
Thyroid Adrenal Abdomen Quantity
Lung Spleen Abdominal Cavity Measurement Result
Heart Pancreas Limb Temporal Change
Mediastinum Kidney Aorta

Image+Organ label Report


𝐼 𝐼𝑎 𝑇
Conv 3D
kernel size 1,ch128 Transformer
Conv block, ch8
Encoder
Max Pool [2x2x2]
MatMul ch768, num heads 12
Conv block, ch16
w) sigmoid
・・・
Max Pool [2x2x2] 𝑀𝑡𝑖 X 12
Conv block, ch32
Fully Connect, ch128 Transformer
Max Pool [2x2x2]
LSTM, ch768 Encoder
Conv block, ch64 ch768, num heads 12
Anomaly-Wise
Conv block, ch128 Feature Aggregation Report Structuring Result
𝑉 𝑚𝑡𝑖

Fig. A2. Anomaly localization architecture: The Conv block denotes three sets of
Conv[3 × 3 × 3] → BatchRenorm → ReLU operations.

CT image Report

There is an enlargement of the left S10 nodule (about 0.4 cm in diameter).


There is a consolidation in the left S6 peripheral region.
Mild consolidation is seen mainly in the upper lobe. No obvious changes
from the last exam.
No pleural effusion.

Liver: no SOL exists.


Spleen/Gallbladder/Pancreas: n.p.
Renal: Numerous masses containing fatty concentrations suspicious for
angiomyolipoma are seen in both kidneys. No obvious changes have been
observed since the last exam. An aneurysm is seen inside the mass on the
ventral side of the left kidney, but there is no obvious enlargement. An
aneurysm of 11 mm in diameter is observed in the right renal porta. There is
no significant change from the last exam.
Adrenal gland: n.p.

Fig. A3. Examples of the ground truth data.


14 A. Ichinose et al.

Table A3. Implementation Details

Value
Optimizer Adam
Initial Learning Rate 1e−5
Learning Rate Schedule Linearly increased to 1e−4 within the first 5,000
steps, and then multiplied by 0.1 every 30,000
steps
Batch Size 10
Normalization Method Batch Renormalization
Data Augmentation for Image Random Crop, Random Rotation, Random Scal-
ing, Sharpness change, Smoothing, and Gaussian
noise addition
Data Augmentation for Text Random deletion, Random insertion, and Ran-
dom crop
Machine Learning Library Tensorflow 2.3
GPU NVIDIA Tesla V100 × 2

You might also like