0% found this document useful (0 votes)

18 views

3D Visual Grounding

Uploaded by

jwp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

3D Visual Grounding

Uploaded by

jwp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Visual Grounding of Whole Radiology Reports

for 3D CT Images

Akimichi Ichinose1 , Taro Hatsutani1 , Keigo Nakamura1 , Yoshiro Kitamura1 ,

Satoshi Iizuka2 , Edgar Simo-Serra3 , Shoji Kido4 , and Noriyuki Tomiyama4
1
Medical Systems Research & Development Center, FUJIFILM Corporation, Japan
Center for Artificial Intelligence Research, University of Tsukuba, Japan
arXiv:2312.04794v1 [cs.CV] 8 Dec 2023

2
3
Department of Computer Science and Engineering, Waseda University, Japan
4
Graduate School of Medicine, Osaka University, Japan
[email protected]

Abstract. Building a large-scale training dataset is an essential prob-

lem in the development of medical image recognition systems. Visual
grounding techniques, which automatically associate objects in images
with corresponding descriptions, can facilitate labeling of large number
of images. However, visual grounding of radiology reports for CT im-
ages remains challenging, because so many kinds of anomalies are de-
tectable via CT imaging, and resulting report descriptions are long and
complex. In this paper, we present the first visual grounding framework
designed for CT image and report pairs covering various body parts
and diverse anomaly types. Our framework combines two components
of 1) anatomical segmentation of images, and 2) report structuring. The
anatomical segmentation provides multiple organ masks of given CT im-
ages, and helps the grounding model recognize detailed anatomies. The
report structuring helps to accurately extract information regarding the
presence, location, and type of each anomaly described in corresponding
reports. Given the two additional image/report features, the ground-
ing model can achieve better localization. In the verification process, we
constructed a large-scale dataset with region-description correspondence
annotations for 10,410 studies of 7,321 unique patients. We evaluated our
framework using grounding accuracy, the percentage of correctly local-
ized anomalies, as a metric and demonstrated that the combination of the
anatomical segmentation and the report structuring improves the per-
formance with a large margin over the baseline model (66.0% vs 77.8%).
Comparison with the prior techniques also showed higher performance
of our method.

Keywords: Deep Learning · Vision Language · Visual Grounding ·

Computed Tomography.

1 Introduction

In recent years, a number of medical image recognition systems have been de-
veloped [6] to alleviate the increasing burden on radiologists [2,22,21]. In the
2 A. Ichinose et al.

Visual Grounding on X-ray Image Visual Grounding on CT Image (Our Task)

The heart is within normal There was a mass in liver S4 showing early deep
limits in size. Surgical sut- staining and late washout. Same size as last time.
ure material projects over The S8/5 mass is slightly stained in the early ph-
the right lung apex. There ase, indicating washout. No change from last time.
is moderate left pleural e- Cyst and calcification in left kidney. Mild paraeso-
ffusion. phageal varices present. The size of the aorta-ca-
val lymph nodes was the same as before.

Fig. 1. Comparison of the visual grounding task on X-ray image and on CT image.

development of such systems, the task of manually labeling images is a signifi-

cant bottleneck. Auto-labeling, the process of automatically assigning labels to
images using machine learning algorithms, has emerged as a promising solution
to this problem. In cases where there are plenty of image and caption pairs,
one potential approach to auto-labeling is visual grounding [12], which utilizes
natural language descriptions to identify and localize objects in images.
With the recent advances in cross-modal technology based on deep learning,
many frameworks for visual grounding has been proposed [11,7]. Within the med-
ical domain, several large scale datasets with radiology reports are available (e.g.
OpenI [3], MIMIC-CXR [9]), and these produced researches on medical image
visual grounding [25,1]. However, to the best of our knowledge, prior studies have
focused on 2D X-ray images [28] or videos [15], and there has been no research
applying visual grounding to 3D computed tomography (CT) images so far. Vi-
sual grounding on CT images has the following difficulties: 1) Large number
of anomaly types to detect: Existing researches on visual grounding using
X-ray images handled only chest X-ray images. The number of anomaly types
to detect is at most dozen or so (e.g. 13 findings [8]). In contrast, our research
handles CT images including various parts of the human body. Consequently,
the number of anomaly types to be detected is larger than one hundred. 2) Long
and complex sentences: Radiology reports on X-ray images are often simple,
noting only the presence or absence of anomalies. On the other hand, in CT
examinations, the qualitative diagnosis of each anomaly is often performed. In
cases, multiple anomalies are simultaneously described in a sentence. Therefore,
the description tend to be long and complicated with multiple sentences (Fig. 1).
Visual grounding for CT images requires the extraction of information about the
location and type of each anomaly from these complex sentences.
In this work, we propose a novel visual grounding framework for 3D CT im-
ages and radiology reports. The main idea is to separate the task into three parts:
1) anatomical segmentation on images, 2) report structuring, and 3) localization
of described anomalies. In the anatomical segmentation, multiple organs and
tissues are extracted using the deep learning based segmentation model and pro-
vided as landmarks. The report structuring model, which is based on BERT [5],
is also introduced to extract information of each anomaly from a complex re-
port. Both of these features are fed into the grounding model (3) to extrapolate
medical domain knowledge, thereby enabling accurate visual grounding.
Our contributions are as follows:
Visual Grounding of Radiology Reports for CT Images 3

– We show the first visual grounding results for 3D CT images that covers
various body parts and anomalies.
– We introduce a novel grounding architecture that can leverage report struc-
turing results of presence/type/location of described anomalies.
– We validate the efficacy of the proposed framework using a large-scale dataset
with region-description correspondence annotations.

2 Related Work

Visual Grounding Visual grounding task involves learning the correspon-

dences between descriptions in the text and image regions from a given train-
ing set of region-description pairs [12]. There are mainly two approaches: one-
stage approach and two-stage approach. Most studies follow a two-stage ap-
proach [14,17]. However, this approach usually employs a pre-trained object
detector, and it leads to restrict the capability of categories and attributes in
grounding. Accordingly, recent studies is shifting to employ the one-stage ap-
proach, in which visual grounding is performed by end-to-end training [27,4,10].

Vision-Language Tasks on Medical Image The existence of public datasets

with paired images and reports [3,9,26] has accelerated research on cross-modal
tasks in the medical field [25,16]. Inspired by the success of visual grounding,
several studies of visual grounding for medical images and radiology reports have
also been reported [28,1,23]. These studies utilized a large scale dataset and an
attention-based language interpretation model such as BERT [5] to ground the
descriptions in the report. However, these studies have focused on X-ray images,
and to the best of our knowledge, there have been no studies on CT images,
which cover the entire body and have a complex report.

3 Methods

We first formulate the problem. Next, we explain three key components of

anatomical segmentation, report structuring, and anomaly localization in our
framework. In our framework, multiple organ labels obtained as the output
of anatomical segmentation encourage the grounding model to learn detailed
anatomy, and report structuring allows the grounding model to accurately ex-
tract the features of the target anomaly from complex sentences.

3.1 Problem Formulation

Our research assumes that a dataset of image-report pairs with region-description

correspondence annotations is provided for training. We show the overall frame-
work in Fig. 2. We denote an image and a paired report as I and T respectively.
Let Ia be a label image in which multiple organs are extracted from I. Each
report T contains descriptions of multiple (image) anomalies. We denote each
4 A. Ichinose et al.

anomalies as ti ∈ {t1 , t2 , ..., tN }. Given an image I and corresponding organ la-

bel images Ia encoded as V ∈ Rdz ×dy ×dx ×d and a description about an anomaly
ti encoded as Lti ∈ Rd , the goal of our framework is to generate a segmentation
map Mti that represents the location of the anomaly ti .

Segmentation
Model

Image Encoder
𝑉
𝐼𝑎
Source Target
Attention
Pair data

𝑀𝑡𝑖
Anatomical Segmentation 𝐿𝑡𝑖
𝑇
Liver: a good accumu-
Text Encoder

lation of lipiodol is seen in

Anomaly-Wise
S1. A 6mm large nodule is
seen in S8 with slight Feature Aggregator
hypo-absorption in portal
and equilibrium phases. …
Anomaly Localization
STRUCTURING RESULTS
Structuring Anatomical
Model Organ
Segment
Lesion …

LIVER S1 lipiodol
LIVER S8 nodule 6mm
SPLEEN enlarged
KIDNEY left cyst
Report Structuring

Fig. 2. The proposed framework for 3D-CT visual grounding.

3.2 Anatomical Segmentation

The task of the anatomical segmentation is to extract relevant anatomies that
can be clues for visual grounding. We use the commercial version of the 3D image
analysis software (Synapse 3D V6.8, FUJIFILM corporation, Japan) to extract
32 organs and tissues (See Appendix Table. A1). In this software, anatomies
are extracted using U-Net based architectures [13,18]. The extracted anatomical
label images are Ia .

3.3 Report Structuring

The tasks of the report structuring are as follows: 1) anatomical prediction,
2) phrase recognition, and 3) relationship estimation between phrases (See Ap-
pendix Fig. A1). The anatomical prediction is a sentence-wise prediction to de-
termine which organ or body part is mentioned in each sentence. The organs and
body parts to be recognized are shown in Appendix Table. A2. The sentences
Visual Grounding of Radiology Reports for CT Images 5

belonging to the same class are concatenated, then the phrase recognition and
the relationship estimation are performed for each class.
The phrase recognition module extracts phrases and classifies each of them
into 9 classes (See Appendix Table. A2). Subsequently, the relationship estima-
tion module determines whether there is a relationship between anomaly phrases
(e.g. ’nodule’, ’fracture’) and other phrases (e.g. ’6mm’, ’Liver S6’), resulting in
the grouping of phrases related to the same anomaly. If multiple anatomical
phrases are grouped in the same group, they are split into separate groups on a
rule basis (e.g. [‘right S1’, ‘left S6’, ‘nodule’] -> [‘right S1’, ‘nodule’], [‘left S6’,
‘nodule’]). More details of implementation and training methods are reported in
Nakano et al. [20] and Tagawa et al [24].

3.4 Anomaly Localization

The task of the anomaly localization is to output a localization map of the

anomaly mentioned in the input report T . The CT image I and the organ la-
bel image Ia are concatenated along the channel dimension and encoded by a
convolutional backbone to generate a visual embedding V . The sentences in the
report T are encoded by BERT [5] to generate embeddings for each character.
Let r = {r1 , r2 , ..., rNC } be the set of character embeddings where NC is the
number of characters. Our framework next adopt the Anomaly-Wise Feature
Aggregator (AFA). For each anomaly ti , AFA generates a representative em-
bedding Lti by aggregating the embeddings of related phrases based on report
structuring results. The final grounding result Mti is obtained by the following
Source-Target Attention.

Mti = sigmoid(Lti WQ (V WK )T ) (1)

d×dn
where WQ , WK ∈ R are trainable variables.
The overall architecture of this module is illustrated in Appendix Fig. A2.

Anomaly-Wise Feature Aggregator The results of the report structuring

mti ∈ RNC are defined as follows:
(
cj if a j-th character is related to an anomaly ti ,
mtij = (2)
0 else.

mti = {mti1 , mti2 , ...mtiNC } (3)

where cj is the class index labeled by the phrase recognition module (Let C
be the number of classes). In this module, aggregate character-wise embeddings
based on the following formula.

ek = {rj |mtij = k} (4)

Lti = LSTM([vorgan ; p1 ; e1 ; p2 ; e2 ; ..., pC ; eC ]) (5)

6 A. Ichinose et al.

where vorgan and pk are trainable embeddings for each organ and each class label
respectively. [·; ·] stands for concatenation operation. In this way, embeddings of
characters related to the anomaly ti are aggregated and concatenated. Subse-
quently, representative embeddings of the anomaly are generated by an LSTM
layer. In the task of visual grounding focused on 3D CT images, the size of the
dataset that can be created is relatively small. Considering this limitation, we
use an LSTM layer with strong inductive bias to achieve high generalization
performance.

4 Dataset and Implementation Details

4.1 Clinical Data
We retrospectively collected 10,410 CT studies (11,163 volumes/7,321 unique
patients) and 671,691 radiology reports from one university hospital in Japan.
We assigned a bounding box to each anomaly described in the reports as shown
in Appendix Fig. A3. The total category number is about 130 in combination
of anatomical regions and anomaly types (The details are in Fig. 4) For each
anomaly, a correspondence annotation was made with anomaly phrases in the
report. The total number of annotated regions is 17,536 (head: 713 regions, neck:
285 regions, chest: 8,598 regions, and abdomen: 7,940 regions). We divide the
data into 9,163/1,000/1,000 volumes as a training/validation/test split.

4.2 Implementation Details

We use a VGG-like network as Image Encoder, with 15 3D-convolutional layers
and 3 max pooling layers. For training, the voxel spacings in all three dimensions
are normalized to 1.0 mm. CT values are linearly normalized to obtain a value
of [0–1]. The anatomy label image, in which only one label is assigned to each
voxel, is also normalized to the value [0–1], and the CT image and the label
image are concatenated along the channel dimension. As our Text Encoder, we
use a BERT with 12 transformer encoder layers, each with hidden dimension of
768 and 12 heads in the multi-head attention. At first, we pre-train the BERT
using 6.7M sentences extracted from the reports in a Masked Language Model
task. Then we train the whole architecture jointly using dice loss [19] with the
first 8 transformer encoder layers of the BERT frozen. Further information about
implementation are shown in Appendix Table. A3.

5 Experiments
We did two kinds of experiments for comparison and ablation studies. The com-
parison study was made against TransVG [4] and MDETR [10] that are one-
stage visual grounding approaches and established state-of-the-art performances
on photos and captions. To adapt TransVG and MDETR for the 3D modality,
the backbone was changed to a VGG-like network with 3D convolution layers,
the same as the proposed method. We refer one of the proposed method without
anatomical segmentation and report structuring as the baseline model.
Visual Grounding of Radiology Reports for CT Images 7

5.1 Evaluation Metrics

We report segmentation performance using Dice score, mean intersection over
union (mIoU), and the grounding accuracy. The output masks are thresholded
to compute mIoU and grounding accuracy score. The mIoU is defined as an
average IoU over the thresholds [0.1, 0.2, 0.3, 0.4, 0.5]. The grounding accuracy
is defined as the percentage of anomalies for which the IoU exceeds 0.1 under
the threshold 0.1.

5.2 Results
The experimental results of the two studies are shown in Table. 1. Both of
MDETR and TransVG failed to achieve stable grounding in this task. A main
difference between these models and our baseline model is using a source-target
attention layer instead of the transformer. It is known that a transformer-based
algorithm with many parameters and no strong inductive bias is difficult to gen-
eralize with such a relatively limited number of training data. For this reason, the
baseline model achieved a much higher accuracy than the comparison methods.

Table 1. Results of the comparison/ablation studies. ’-’ represents ’not converged’.

Anatomical Report Dice mIoU Accuracy

Method
Seg. Struct. [%] [%] [%]
MDETR [10] - - N/A - -
TransVG [4] - - N/A 8.5 21.8
Baseline ✗ ✗ 27.4 15.6 66.0
✓ ✗ 28.1 16.6 67.9
Proposed ✗ ✓ 33.0 20.3 75.9
✓ ✓ 34.5 21.5 77.8

The ablation study showed that the anatomical segmentation and the report
structuring can improve the performance. In Fig. 3 (upper row), we demonstrate
several cases that facilitate an intuitive understanding of each effect. Longer
reports often mention more than one anomaly, making it difficult to recognize
the grounding target and cause localization errors. The proposed method can
explicitly indicate phrases such as the location and size of the target anomaly,
reducing the risk of failure. Fig. 3 (lower row) shows examples of grounding
results when a query that is not related to the image is inputted. In this case, the
grounding results were less consistent with the anatomical phrases. The results
suggest that the model performs grounding with an emphasis on anatomical
information against the backdrop of abundant anatomical knowledge.
The grounding performance for each combination of organ and anomaly type
is shown in Fig. 4. The performance is relatively high for organ shape abnormal-
ities (e.g. swelling, duct dilation) and high-frequency anomalies in small organs
8 A. Ichinose et al.

(e.g. thyroid/prostate mass). For these anomaly types, our model is considered
to be available for automatic training data generation. On the other hand, the
performance tends to be low for rare anomalies (e.g. mass in small intestine) and
anomalies in large body part (e.g. limb). Improving grounding performance for
these targets will be an important future work.

Input Query Input Query Input Query

A 10 mm nodule in the left lingual segment A 1.7-cm diameter water-
Input Image is the same as before. There is a cingulate Input Image dense mass is found in Input Image
shadow in the peripheral right middle lobe. the right upper internal There is a
Case #1 There are post-inflammatory changes.
Case #2 Case #3 compression
deep neck. A lymph node
Nodular thickening of 10 mm in size in up to 9 mm in short fracture on L1.
contact with the pleura of the terminal diameter is found in the
right upper lobe. right submental region.

Model without Model without Model without

Report Anatomical Anatomical
Structuring Segmentation Segmentation

Input Query Input Query Input Query

The dilation of the ascending
Input Image Input Image Bone islands are seen in Input Image Lymph nodes in left
aorta to a size of about 40 mm. the right ilium and left
Calcification and plaque are gastric artery trunk
Case #4 observed at the origin of the left
Case #5 pubis. No significant change Case #6 also slightly shrank.
from the previous CT.
subclavian artery.

Fig. 3. The grounding results for several input queries. Underlines in the input query
indicate the target anomaly phrase to be grounded. The phrases highlighted in bold blue
indicate the anatomical locations of the target anomaly. The red rectangles indicate
the ground truth regions. Case #4-#6 are the grounding results when an unrelated
input query is inputted. The region surrounded by the red dashed line indicates the
anatomical location corresponding to the input query.

6 Conclusion
In this paper, we proposed the first visual grounding framework for 3D CT im-
ages and reports. To deal with various type of anomalies throughout the body
and complex reports, we introduced a new approach using anatomical recognition
results and report structuring results. The experiments showed the effectiveness
of our approach and achieved higher performance compared to prior techniques.
However, in clinical practice, radiologists write reports from comparing multiple
images such as time-series images, or multi-phase scans. Realizing such sophis-
ticated diagnose process by a visual grounding model will be a future research.
Visual Grounding of Radiology Reports for CT Images 9

Head Neck Chest Abdomen Other

Gallbladder

Abdomen
Pancreas
Stomach

Intestine
Medias-
Thyroid

Prostate
Adrenal

Bladder
Kidney
Spleen
Throat

Uterus
tinum
Breast

Ovary
Colon
Small

Aorta
Chest
Brain

Liver

Limb
Lung
Head

Neck
Mass 0.11 0.53 0.26 0.16 0.43 0.19 0.17 0.19 0.38 0.26 0.38 0.39 0.3 0.11 0.51 0.34 0.39 0.88 0.24 0.09

Soft Tissue Tumor 0.07 0.16 0.22 0.24 0.33 0.24 0.14 0.54 0.22 0.09

High Density Area 0.13 0.34 0.12 0.21 0.36 0.13 0.23 0.14 0.22 0.22 0.2

Low Density Area 0.44 0.32 0.22 0.26 0.36 0.35 0.34 0.38 0.07

Water Density Area 0.12 0.51 0.14 0.18 0.64 0.35

Cyst 0.08 0.21 0.35 0.33 0.53 0.79 0.48

Lymph Node 0.07 0.14 0.31 0.27 0.29 0.18 0.54 0.2

Calcification 0.57 0.34 0.25 0.37 0.25 0.22 0.34

Vascular stenosis 0.37 0.08

Aneurysm 0.37 0.48

Embolism 0.15 0.82 0.1

Swelling 0.46 0.29 0.85 0.59 0.34 0.87 0.43

Atrophy 0.03 0.74

Wall Thickening 0.1 0.31 0.25 0.72 0.35 0.23 0.13 0.81 0.09

Duct Dilation 0.57 0.3 0.43 0.48 0.51 0.31 0.6

Lung Liver Chest Abdomen

Ground Glass Opacity 0.34 Trabecular Shadow 0.43 Reticular 0.13 Mucous Plug 0.13 Fatty Liver 0.37 Osteosclerosis 0.18 Osteosclerosis 0.28
Consolidation 0.43 Granular Shadow 0.33 Emphysema 0.10 Osteolysis 0.18 Osteolysis 0.44
Fracture 0.11 Fracture 0.26

Fig. 4. Grounding performance for representative anomalies. The value in each cell is
the average dice score of the proposed method.

References

1. Bhalodia, R., Hatamizadeh, A., Tam, L., Xu, Z., Wang, X., Turkbey, E., Xu, D.:
Improving Pneumonia Localization via Cross-Attention on Medical Images and
Reports. In: Proceedings of Medical Image Computing and Computer Assisted
Intervention. pp. 571–581. Springer (2021)
2. Dall, T.: The Complexities of Physician Supply and Demand: Projections from
2016 to 2030. IHS Markit Limited (2018)
3. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez,
L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol-
ogy examinations for distribution and retrieval. Journal of the American Medical
Informatics Association 23(2), 304–310 (2016)
4. Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: End-to-End Visual
Grounding with Transformers. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 1769–1779 (2021)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidi-
rectional Transformers for Language Understanding. In: Proceedings of NAACL-
HLT. pp. 4171–4186 (2019)
6. Ebrahimian, S., Kalra, M.K., Agarwal, S., Bizzo, B.C., Elkholy, M., Wald, C.,
Allen, B., Dreyer, K.J.: FDA-regulated AI algorithms: Trends, Strengths, and Gaps
of Validation Studies. Academic Radiology 29(4), 559–566 (2022)
7. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from Natural Language Expres-
sions. In: Proceedings of the European Conference on Computer Vision. pp. 108–
124. Springer (2016)
10 A. Ichinose et al.

8. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
Haghgoo, B., Ball, R., Shpanskaya, K., et al.: CheXpert: A Large Chest Radiograph
Dataset with Uncertainty Labels and Expert Comparison. In: Proceedings of the
AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
9. Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P.,
Deng, C.y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-identified publicly available
database of chest radiographs with free-text reports. Scientific data 6(1), 317
(2019)
10. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-
Modulated Detection for End-to-End Multi-Modal Understanding. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790
(2021)
11. Karpathy, A., Fei-Fei, L.: Deep Visual-Semantic Alignments for Generating Image
Descriptions. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 3128–3137 (2015)
12. Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep Fragment Embeddings for Bidirec-
tional Image Sentence Mapping. In: Proceedings of Advances in Neural Information
Processing System. pp. 1889–1897 (2014)
13. Keshwani, D., Kitamura, Y., Li, Y.: Computation of Total Kidney Volume from
CT Images in Autosomal Dominant Polycystic Kidney Disease Using Multi-task
3D Convolutional Neural Networks. In: Proceedings of Medical Image Computing
and Computer Assisted Intervention. pp. 380–388. Springer (2018)
14. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked Cross Attention for Image-
Text Matching. In: Proceedings of the European Conference on Computer Vision.
pp. 201–216 (2018)
15. Li, B., Weng, Y., Sun, B., Li, S.: Towards Visual-Prompt Temporal Answering
Grounding in Medical Instructional Video. arXiv preprint arXiv:2203.06667 (2022)
16. Li, Y., Wang, H., Luo, Y.: A Comparison of Pre-Trained Vision-and-Language
Models for Multimodal Representation Learning across Medical Images and Re-
ports. In: Proceedings of the IEEE international conference on bioinformatics and
biomedicine. pp. 1999–2004. IEEE (2020)
17. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: Pretraining Task-Agnostic Vi-
siolinguistic Representations for Vision-and-Language Tasks. Advances in neural
information processing systems 32, 13–23 (2019)
18. Masuzawa, N., Kitamura, Y., Nakamura, K., Iizuka, S., Simo-Serra, E.: Automatic
Segmentation, Localization, and Identification of Vertebrae in 3D CT Images Us-
ing Cascaded Convolutional Neural Networks. In: Proceedings of Medical Image
Computing and Computer Assisted Intervention. pp. 681–690. Springer (2020)
19. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully Convolutional Neural Net-
works for Volumetric Medical Image Segmentation. In: Proceedings of the interna-
tional conference on 3D vision. pp. 565–571. IEEE (2016)
20. Nakano, N., Tagawa, Y., Ozaki, R., Taniguchi, T., Ohkuma, T., Suzuki, Y., Kido,
S., Tomiyama, N.: Pre-training methods for creating a language model with em-
bedded knowledge of radiology reports. In: Proceedings of the annual meeting of
the Association for Natural Language Processing (2022)
21. Nishie, A., Kakihara, D., Nojo, T., Nakamura, K., Kuribayashi, S., Kadoya, M.,
Ohtomo, K., Sugimura, K., Honda, H.: Current radiologist workload and the short-
ages in japan: how many full-time radiologists are required? Japanese journal of
radiology 33, 266–272 (2015)
22. Rimmer, A.: Radiologist shortage leaves patient care at risk, warns royal college.
BMJ: British Medical Journal (Online) 359 (2017)
Visual Grounding of Radiology Reports for CT Images 11

23. Seibold, C., Reiß, S., Sarfraz, S., Fink, M.A., Mayer, V., Sellner, J., Kim,
M.S., Maier-Hein, K.H., Kleesiek, J., Stiefelhagen, R.: Detailed Annotations
of Chest X-Rays via CT Projection for Report Understanding. arXiv preprint
arXiv:2210.03416 (2022)
24. Tagawa, Y., Nakano, N., Ozaki, R., Taniguchi, T., Ohkuma, T., Suzuki, Y., Kido,
S., Tomiyama, N.: Performance improvement of named entity recognition on noisy
data using teacher-student training. In: Proceedings of the annual meeting of the
Association for Natural Language Processing (2022)
25. Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TieNet: Text-Image Embed-
ding Network for Common Thorax Disease Classification and Reporting in Chest
X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 9049–9058 (2018)
26. Yan, K., Wang, X., Lu, L., Summers, R.M.: DeepLesion: automated mining of
large-scale lesion annotations and universal lesion detection with deep learning.
Journal of medical imaging 5(3), 036501–036501 (2018)
27. Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A Fast and Accu-
rate One-Stage Approach to Visual Grounding. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 4683–4693 (2019)
28. You, D., Liu, F., Ge, S., Xie, X., Zhang, J., Wu, X.: AlignTransformer: Hierarchical
Alignment of Visual Regions and Disease Tags for Medical Report Generation. In:
Proceedings of Medical Image Computing and Computer Assisted Intervention.
pp. 72–82. Springer (2021)
12 A. Ichinose et al.

MEDICAL REPORT * Liver

A good accumulation of lipiodol is seen in A good accumulation of lipiodol is seen in
Liver S1. A 6mm large nodule is seen in S8
Liver S1. A 6mm large nodule is seen in S8
with slight hypo-absorption in portal and
with slight hypo-absorption in portal and

Anatomical
Prediction
equilibrium phases. Although enhancement
equilibrium phases. Although enhancement effect by contrast medium was noted on the
effect by contrast medium was noted on the previous MRI, no enhancement can be
previous MRI, no enhancement can be noted noted this time. The S6 nodule noted on the
this time. The S6 nodule noted on the previous MRI cannot be identified.
No portal vein tumor emboli.
previous MRI cannot be identified.
No portal vein tumor emboli.
Spleen
Spleen is enlarged.
Spleen is enlarged.
Gallbladder: n.p. …
* Inputs of our system is Japanese
STRUCTURING RESULTS
A good accumulation of lipiodol is seen in
Liver S1. A 6mm large nodule is seen in S8 Anatomical …
Organ Lesion
with slight hypo-absorption in portal and Segment

Relationship
Recognition

equilibrium phases. Although enhancement

Estimation
effect by contrast medium was noted on the LIVER S1 lipiodol
Phrase

previous MRI, no enhancement can be noted

LIVER S8 nodule 6mm
this time. The S6 nodule noted on the
previous MRI cannot be identified. SPLEEN enlarged
No portal vein tumor emboli.
KIDNEY left cyst
lesion anatomical segment
measurement result Phrases for the same
characteristics contrast anomaly are grouped

Fig. A1. Flowchart of report structuring.

Table A1. Output classes of anatomical segmentation

Brain Lung Liver Aorta

- Cerebraspinal Fluid - Left Upper Lobe Gallbladder Bone
- Left Lateral Ventricle - Left Lower Lobe Stomach - Cervical Vertebra
- Right Lateral Ventricle - Right Upper Lobe Duodenum - Thoracic Vertebra
- Third Ventricle - Right Middle Lobe Spleen - Lumber Vertebra
- Fourth Ventricle - Right Lower Lobe Pancreas - Left Rib
- Brainstem Heart Kidney - Right Rib
- Left Cerebellum - Left Kidney
- Right Cerebellum - Right Kidney
- Left Cerebrum Prostate
- Right Cerebrum Bladder
Visual Grounding of Radiology Reports for CT Images 13

Table A2. Output classes of report structuring

Anatomical Prediction Phrase Classification

Head Breast Small Intestine Anatomical Segment
Brain Pleural Cavity Colon Lesion
Ear Chest Prostate Shape Abnormality
Nose Liver Uterus Diagnosis
Neck Stomach Bladder Characteristics
Oropharynx Gallbladder Ovary Contrast Information
Thyroid Adrenal Abdomen Quantity
Lung Spleen Abdominal Cavity Measurement Result
Heart Pancreas Limb Temporal Change
Mediastinum Kidney Aorta

Image+Organ label Report

𝐼 𝐼𝑎 𝑇
Conv 3D
kernel size 1,ch128 Transformer
Conv block, ch8
Encoder
Max Pool [2x2x2]
MatMul ch768, num heads 12
Conv block, ch16
w) sigmoid
・・・
Max Pool [2x2x2] 𝑀𝑡𝑖 X 12
Conv block, ch32
Fully Connect, ch128 Transformer
Max Pool [2x2x2]
LSTM, ch768 Encoder
Conv block, ch64 ch768, num heads 12
Anomaly-Wise
Conv block, ch128 Feature Aggregation Report Structuring Result
𝑉 𝑚𝑡𝑖

Fig. A2. Anomaly localization architecture: The Conv block denotes three sets of
Conv[3 × 3 × 3] → BatchRenorm → ReLU operations.

CT image Report

There is an enlargement of the left S10 nodule (about 0.4 cm in diameter).

There is a consolidation in the left S6 peripheral region.
Mild consolidation is seen mainly in the upper lobe. No obvious changes
from the last exam.
No pleural effusion.

Liver: no SOL exists.

Spleen/Gallbladder/Pancreas: n.p.
Renal: Numerous masses containing fatty concentrations suspicious for
angiomyolipoma are seen in both kidneys. No obvious changes have been
observed since the last exam. An aneurysm is seen inside the mass on the
ventral side of the left kidney, but there is no obvious enlargement. An
aneurysm of 11 mm in diameter is observed in the right renal porta. There is
no significant change from the last exam.
Adrenal gland: n.p.

Fig. A3. Examples of the ground truth data.

14 A. Ichinose et al.

Table A3. Implementation Details

Value
Optimizer Adam
Initial Learning Rate 1e−5
Learning Rate Schedule Linearly increased to 1e−4 within the first 5,000
steps, and then multiplied by 0.1 every 30,000
steps
Batch Size 10
Normalization Method Batch Renormalization
Data Augmentation for Image Random Crop, Random Rotation, Random Scal-
ing, Sharpness change, Smoothing, and Gaussian
noise addition
Data Augmentation for Text Random deletion, Random insertion, and Ran-
dom crop
Machine Learning Library Tensorflow 2.3
GPU NVIDIA Tesla V100 × 2

The Physics and Technology of Diagnostic Ultrasound: A Practitioner's Guide (Second Edition)
From Everand
The Physics and Technology of Diagnostic Ultrasound: A Practitioner's Guide (Second Edition)
Robert Gill
No ratings yet
CT Anatomy for Radiotherapy
From Everand
CT Anatomy for Radiotherapy
Peter Bridge
5/5 (1)
Genanat D&R Agam
100% (2)
Genanat D&R Agam
51 pages
Atlas of the Electrical Generators of Sleep
From Everand
Atlas of the Electrical Generators of Sleep
Dr. Mark Doidge
No ratings yet
The Physics and Technology of Diagnostic Ultrasound: Study Guide (Second Edition)
From Everand
The Physics and Technology of Diagnostic Ultrasound: Study Guide (Second Edition)
Robert Gill
No ratings yet
Textbook of Urgent Care Management: Chapter 35, Urgent Care Imaging and Interpretation
From Everand
Textbook of Urgent Care Management: Chapter 35, Urgent Care Imaging and Interpretation
Tim Hogan
No ratings yet
Age-related macular degeneration: Diagnosis, symptoms and treatment, an overview
From Everand
Age-related macular degeneration: Diagnosis, symptoms and treatment, an overview
Christopher Schütze
No ratings yet
Nanoparticle Imaging Systems
From Everand
Nanoparticle Imaging Systems
Felicia Dunbar
No ratings yet
2406.06512v1
No ratings yet
2406.06512v1
28 pages
AutoRG-Brain_ Grounded Report Generation for Brain MRI
No ratings yet
AutoRG-Brain_ Grounded Report Generation for Brain MRI
39 pages
CT in Nuclear Medicine
From Everand
CT in Nuclear Medicine
Khalid Jassim
No ratings yet
Modern Radiology And Ai
From Everand
Modern Radiology And Ai
Enrico Guardelli
No ratings yet
Augmented Reality Assisted Surgery: Enhancing Surgical Precision through Computer Vision
From Everand
Augmented Reality Assisted Surgery: Enhancing Surgical Precision through Computer Vision
Fouad Sabry
No ratings yet
Clinical Applications of SPECT–CT
From Everand
Clinical Applications of SPECT–CT
IAEA
No ratings yet
The Digital Universe of Medical Imaging
From Everand
The Digital Universe of Medical Imaging
Pasquale De Marco
No ratings yet
Chestx-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks On Weakly-Supervised Classification and Localization of Common Thorax Diseases
No ratings yet
Chestx-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks On Weakly-Supervised Classification and Localization of Common Thorax Diseases
19 pages
Optical Fiber Medicine
From Everand
Optical Fiber Medicine
Felicia Dunbar
No ratings yet
Rad-Former: Structuring Radiology Reports Using Transformers
No ratings yet
Rad-Former: Structuring Radiology Reports Using Transformers
6 pages
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
From Everand
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
Fouad Sabry
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
On The Automatic Generation of Medical Imaging Reports
No ratings yet
On The Automatic Generation of Medical Imaging Reports
10 pages
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
3D-CT-GPT_ Generating 3D Radiology Reports Through Integration of Large Vision-Language Models
No ratings yet
3D-CT-GPT_ Generating 3D Radiology Reports Through Integration of Large Vision-Language Models
9 pages
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Medical Imaging Mastery: Techniques and Technologies
From Everand
Medical Imaging Mastery: Techniques and Technologies
Bea D. Kinsley
No ratings yet
2002.08277 - When Radiology Report Generation Meets Knowledge Graph
No ratings yet
2002.08277 - When Radiology Report Generation Meets Knowledge Graph
8 pages
Medical Imaging: Translating 2 Dimensional Mri Scans of the Human Forearm into 3 Dimensional Dielectric Phantoms
From Everand
Medical Imaging: Translating 2 Dimensional Mri Scans of the Human Forearm into 3 Dimensional Dielectric Phantoms
Esuabom David Dijemeni
No ratings yet
Bionic Vision Systems: A Simple Guide to Big Ideas
From Everand
Bionic Vision Systems: A Simple Guide to Big Ideas
NOVA MARTIAN
No ratings yet
The Radiology Guide
From Everand
The Radiology Guide
Vincenzo Giuliano
No ratings yet
Supervised Machine Learning for Science: How to stop worrying and love your black box
From Everand
Supervised Machine Learning for Science: How to stop worrying and love your black box
Christoph Molnar
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Advanced Analytics of Image Datasets in Human Health
From Everand
Advanced Analytics of Image Datasets in Human Health
Dr. Zemelak Goraga
No ratings yet
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
From Everand
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
Fouad Sabry
No ratings yet
VISTA3D
No ratings yet
VISTA3D
24 pages
fradi-03-1088068
No ratings yet
fradi-03-1088068
10 pages
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
No ratings yet
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
9 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
Handbook of Ultra-Wideband Short-Range Sensing: Theory, Sensors, Applications
From Everand
Handbook of Ultra-Wideband Short-Range Sensing: Theory, Sensors, Applications
Jürgen Sachs
No ratings yet
2108.00316v1 (3)
No ratings yet
2108.00316v1 (3)
24 pages
2405.14905v1
No ratings yet
2405.14905v1
11 pages
X2V 3D Organ Volume Reconstruction From A Planar X-Ray Image With Neural Implicit Methods
No ratings yet
X2V 3D Organ Volume Reconstruction From A Planar X-Ray Image With Neural Implicit Methods
13 pages
Percept: Fundamentals and Applications
From Everand
Percept: Fundamentals and Applications
Fouad Sabry
No ratings yet
A Survey On Automatic Generation of Medical Imaging Reports Based On Deep Learning
No ratings yet
A Survey On Automatic Generation of Medical Imaging Reports Based On Deep Learning
16 pages
unpaired_medical_report_generation_cycle_consistency_hirsch_tal
No ratings yet
unpaired_medical_report_generation_cycle_consistency_hirsch_tal
16 pages
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
From Everand
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
Fouad Sabry
No ratings yet
Computer Stereo Vision: Exploring Depth Perception in Computer Vision
From Everand
Computer Stereo Vision: Exploring Depth Perception in Computer Vision
Fouad Sabry
No ratings yet
applsci-15-00343
No ratings yet
applsci-15-00343
14 pages
UNesT - Local Spatial Representation Learning With Hierarchical Transformer For Efficient Medical Segmentation
No ratings yet
UNesT - Local Spatial Representation Learning With Hierarchical Transformer For Efficient Medical Segmentation
21 pages
Knowledge_Based_Scene_Graph_Generation_i
No ratings yet
Knowledge_Based_Scene_Graph_Generation_i
6 pages
Radiology in Your Pocket: A Simple and Easy Guide
From Everand
Radiology in Your Pocket: A Simple and Easy Guide
Pasquale De Marco
No ratings yet
Gazegnn: A Gaze-Guided Graph Neural Network For Chest X-Ray Classification
No ratings yet
Gazegnn: A Gaze-Guided Graph Neural Network For Chest X-Ray Classification
10 pages
1 s2.0 S0925231222003174 Main
No ratings yet
1 s2.0 S0925231222003174 Main
25 pages
Rahman 24 A
No ratings yet
Rahman 24 A
19 pages
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
No ratings yet
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
28 pages
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
A Survey on Tools and Techniques for Localizing Abnormalities in X-ray Images Using Deep Learning
No ratings yet
A Survey on Tools and Techniques for Localizing Abnormalities in X-ray Images Using Deep Learning
29 pages
Contrastive Learning of Medical Visual Representations From Paired Images and Text
No ratings yet
Contrastive Learning of Medical Visual Representations From Paired Images and Text
15 pages
Vetenarian – Research and Laboratory Medicine Dictionary: Grow Your Vocabulary
From Everand
Vetenarian – Research and Laboratory Medicine Dictionary: Grow Your Vocabulary
Blake Pieck
No ratings yet
2403.17834v2
No ratings yet
2403.17834v2
47 pages
Learning Visual Context by Comparison
No ratings yet
Learning Visual Context by Comparison
17 pages
Cross Attention Transformers F
No ratings yet
Cross Attention Transformers F
11 pages
Gastrointestinal System Anatomy and Physiology
No ratings yet
Gastrointestinal System Anatomy and Physiology
27 pages
ANAPHY Lec Session #14 - SAS (Agdana, Nicole Ken)
No ratings yet
ANAPHY Lec Session #14 - SAS (Agdana, Nicole Ken)
8 pages
Exam 1 Psychology
No ratings yet
Exam 1 Psychology
6 pages
reproductive-system
No ratings yet
reproductive-system
4 pages
Physiology With Bioethics (2 Course)
No ratings yet
Physiology With Bioethics (2 Course)
29 pages
Hole's Essentials of Human Anatomy & Physiology Twelfth Edition - Chapter 8 Lecture Outline
No ratings yet
Hole's Essentials of Human Anatomy & Physiology Twelfth Edition - Chapter 8 Lecture Outline
111 pages
Profil Hasil Pemeriksaan CT-Scan Pada Pasien Tumor Paru Di Bagian Radiologi RSUD Dr. Zainoel Abidin Periode Juli 2018-Oktober 2018
No ratings yet
Profil Hasil Pemeriksaan CT-Scan Pada Pasien Tumor Paru Di Bagian Radiologi RSUD Dr. Zainoel Abidin Periode Juli 2018-Oktober 2018
6 pages
Extra-Embryonic Membranes in Chick
No ratings yet
Extra-Embryonic Membranes in Chick
6 pages
Zodiac Academy Endocrine MCQ
No ratings yet
Zodiac Academy Endocrine MCQ
5 pages
Thalassemia-Engleza Nou
No ratings yet
Thalassemia-Engleza Nou
73 pages
Neet-neural Control and Coordination
No ratings yet
Neet-neural Control and Coordination
3 pages
Berberis Vulgaris
No ratings yet
Berberis Vulgaris
11 pages
Heart Anatomy For BSC Nursing
No ratings yet
Heart Anatomy For BSC Nursing
57 pages
Basic Human Anatomy: Lesson 4: Skeletal System
No ratings yet
Basic Human Anatomy: Lesson 4: Skeletal System
24 pages
Eidl Adha Holiday: Grade 4 Daily Lesson Log
No ratings yet
Eidl Adha Holiday: Grade 4 Daily Lesson Log
29 pages
Study and Exam Information (2023-01-30)
No ratings yet
Study and Exam Information (2023-01-30)
8 pages
MDC B.Sc Microbiology syllabus
No ratings yet
MDC B.Sc Microbiology syllabus
2 pages
Biography of Karl Landsteiner
No ratings yet
Biography of Karl Landsteiner
3 pages
Topic1 Revision Sheet Cell Special.
No ratings yet
Topic1 Revision Sheet Cell Special.
2 pages
Book List
No ratings yet
Book List
4 pages
Plasmodesmata
No ratings yet
Plasmodesmata
5 pages
Cambridge IGCSE (9-1) : BIOLOGY 0970/01
No ratings yet
Cambridge IGCSE (9-1) : BIOLOGY 0970/01
18 pages
PPT 2 -Sexual Reproduction in Flowering Plants -MEGASPOROGENESIS
No ratings yet
PPT 2 -Sexual Reproduction in Flowering Plants -MEGASPOROGENESIS
13 pages
GUARIN - 2-Comprehensive-Health-Assessment (1) (AutoRecovered)
No ratings yet
GUARIN - 2-Comprehensive-Health-Assessment (1) (AutoRecovered)
9 pages
The Skeletal System: Dr. Mona Liza N. Valencia, PTRP
No ratings yet
The Skeletal System: Dr. Mona Liza N. Valencia, PTRP
33 pages
AQA GCSE Cell Biology Definitions Dice Challenge
No ratings yet
AQA GCSE Cell Biology Definitions Dice Challenge
2 pages
Placenta Previa
100% (2)
Placenta Previa
19 pages
DR - Luma MCQs
No ratings yet
DR - Luma MCQs
8 pages
Pathoma - Leukopenia and Leucocytosis
100% (1)
Pathoma - Leukopenia and Leucocytosis
9 pages

3D Visual Grounding

Uploaded by

3D Visual Grounding

Uploaded by

Visual Grounding of Whole Radiology Reports

Akimichi Ichinose1 , Taro Hatsutani1 , Keigo Nakamura1 , Yoshiro Kitamura1 ,

Abstract. Building a large-scale training dataset is an essential prob-

Keywords: Deep Learning · Vision Language · Visual Grounding ·

Visual Grounding on X-ray Image Visual Grounding on CT Image (Our Task)

development of such systems, the task of manually labeling images is a signifi-

Visual Grounding Visual grounding task involves learning the correspon-

Vision-Language Tasks on Medical Image The existence of public datasets

We first formulate the problem. Next, we explain three key components of

3.1 Problem Formulation

Our research assumes that a dataset of image-report pairs with region-description

anomalies as ti ∈ {t1 , t2 , ..., tN }. Given an image I and corresponding organ la-

lation of lipiodol is seen in

Fig. 2. The proposed framework for 3D-CT visual grounding.

3.2 Anatomical Segmentation

3.3 Report Structuring

3.4 Anomaly Localization

The task of the anomaly localization is to output a localization map of the

Mti = sigmoid(Lti WQ (V WK )T ) (1)

Anomaly-Wise Feature Aggregator The results of the report structuring

mti = {mti1 , mti2 , ...mtiNC } (3)

ek = {rj |mtij = k} (4)

Lti = LSTM([vorgan ; p1 ; e1 ; p2 ; e2 ; ..., pC ; eC ]) (5)

4 Dataset and Implementation Details

4.2 Implementation Details

5.1 Evaluation Metrics

Table 1. Results of the comparison/ablation studies. ’-’ represents ’not converged’.

Anatomical Report Dice mIoU Accuracy

Input Query Input Query Input Query

Model without Model without Model without

Input Query Input Query Input Query

Head Neck Chest Abdomen Other

Water Density Area 0.12 0.51 0.14 0.18 0.64 0.35

Cyst 0.08 0.21 0.35 0.33 0.53 0.79 0.48

Calcification 0.57 0.34 0.25 0.37 0.25 0.22 0.34

Vascular stenosis 0.37 0.08

Aneurysm 0.37 0.48

Embolism 0.15 0.82 0.1

Swelling 0.46 0.29 0.85 0.59 0.34 0.87 0.43

Atrophy 0.03 0.74

Duct Dilation 0.57 0.3 0.43 0.48 0.51 0.31 0.6

Lung Liver Chest Abdomen

MEDICAL REPORT * Liver

equilibrium phases. Although enhancement

previous MRI, no enhancement can be noted

Fig. A1. Flowchart of report structuring.

Table A1. Output classes of anatomical segmentation

Brain Lung Liver Aorta

Table A2. Output classes of report structuring

Anatomical Prediction Phrase Classification

Image+Organ label Report

There is an enlargement of the left S10 nodule (about 0.4 cm in diameter).

Liver: no SOL exists.

Fig. A3. Examples of the ground truth data.

Table A3. Implementation Details

You might also like