3D Visual Grounding
3D Visual Grounding
for 3D CT Images
2
3
Department of Computer Science and Engineering, Waseda University, Japan
4
Graduate School of Medicine, Osaka University, Japan
[email protected]
1 Introduction
In recent years, a number of medical image recognition systems have been de-
veloped [6] to alleviate the increasing burden on radiologists [2,22,21]. In the
2 A. Ichinose et al.
Fig. 1. Comparison of the visual grounding task on X-ray image and on CT image.
– We show the first visual grounding results for 3D CT images that covers
various body parts and anomalies.
– We introduce a novel grounding architecture that can leverage report struc-
turing results of presence/type/location of described anomalies.
– We validate the efficacy of the proposed framework using a large-scale dataset
with region-description correspondence annotations.
2 Related Work
3 Methods
Segmentation
Model
Image Encoder
𝑉
𝐼𝑎
Source Target
Attention
Pair data
𝑀𝑡𝑖
Anatomical Segmentation 𝐿𝑡𝑖
𝑇
Liver: a good accumu-
Text Encoder
LIVER S1 lipiodol
LIVER S8 nodule 6mm
SPLEEN enlarged
KIDNEY left cyst
Report Structuring
belonging to the same class are concatenated, then the phrase recognition and
the relationship estimation are performed for each class.
The phrase recognition module extracts phrases and classifies each of them
into 9 classes (See Appendix Table. A2). Subsequently, the relationship estima-
tion module determines whether there is a relationship between anomaly phrases
(e.g. ’nodule’, ’fracture’) and other phrases (e.g. ’6mm’, ’Liver S6’), resulting in
the grouping of phrases related to the same anomaly. If multiple anatomical
phrases are grouped in the same group, they are split into separate groups on a
rule basis (e.g. [‘right S1’, ‘left S6’, ‘nodule’] -> [‘right S1’, ‘nodule’], [‘left S6’,
‘nodule’]). More details of implementation and training methods are reported in
Nakano et al. [20] and Tagawa et al [24].
where vorgan and pk are trainable embeddings for each organ and each class label
respectively. [·; ·] stands for concatenation operation. In this way, embeddings of
characters related to the anomaly ti are aggregated and concatenated. Subse-
quently, representative embeddings of the anomaly are generated by an LSTM
layer. In the task of visual grounding focused on 3D CT images, the size of the
dataset that can be created is relatively small. Considering this limitation, we
use an LSTM layer with strong inductive bias to achieve high generalization
performance.
5 Experiments
We did two kinds of experiments for comparison and ablation studies. The com-
parison study was made against TransVG [4] and MDETR [10] that are one-
stage visual grounding approaches and established state-of-the-art performances
on photos and captions. To adapt TransVG and MDETR for the 3D modality,
the backbone was changed to a VGG-like network with 3D convolution layers,
the same as the proposed method. We refer one of the proposed method without
anatomical segmentation and report structuring as the baseline model.
Visual Grounding of Radiology Reports for CT Images 7
5.2 Results
The experimental results of the two studies are shown in Table. 1. Both of
MDETR and TransVG failed to achieve stable grounding in this task. A main
difference between these models and our baseline model is using a source-target
attention layer instead of the transformer. It is known that a transformer-based
algorithm with many parameters and no strong inductive bias is difficult to gen-
eralize with such a relatively limited number of training data. For this reason, the
baseline model achieved a much higher accuracy than the comparison methods.
The ablation study showed that the anatomical segmentation and the report
structuring can improve the performance. In Fig. 3 (upper row), we demonstrate
several cases that facilitate an intuitive understanding of each effect. Longer
reports often mention more than one anomaly, making it difficult to recognize
the grounding target and cause localization errors. The proposed method can
explicitly indicate phrases such as the location and size of the target anomaly,
reducing the risk of failure. Fig. 3 (lower row) shows examples of grounding
results when a query that is not related to the image is inputted. In this case, the
grounding results were less consistent with the anatomical phrases. The results
suggest that the model performs grounding with an emphasis on anatomical
information against the backdrop of abundant anatomical knowledge.
The grounding performance for each combination of organ and anomaly type
is shown in Fig. 4. The performance is relatively high for organ shape abnormal-
ities (e.g. swelling, duct dilation) and high-frequency anomalies in small organs
8 A. Ichinose et al.
(e.g. thyroid/prostate mass). For these anomaly types, our model is considered
to be available for automatic training data generation. On the other hand, the
performance tends to be low for rare anomalies (e.g. mass in small intestine) and
anomalies in large body part (e.g. limb). Improving grounding performance for
these targets will be an important future work.
Fig. 3. The grounding results for several input queries. Underlines in the input query
indicate the target anomaly phrase to be grounded. The phrases highlighted in bold blue
indicate the anatomical locations of the target anomaly. The red rectangles indicate
the ground truth regions. Case #4-#6 are the grounding results when an unrelated
input query is inputted. The region surrounded by the red dashed line indicates the
anatomical location corresponding to the input query.
6 Conclusion
In this paper, we proposed the first visual grounding framework for 3D CT im-
ages and reports. To deal with various type of anomalies throughout the body
and complex reports, we introduced a new approach using anatomical recognition
results and report structuring results. The experiments showed the effectiveness
of our approach and achieved higher performance compared to prior techniques.
However, in clinical practice, radiologists write reports from comparing multiple
images such as time-series images, or multi-phase scans. Realizing such sophis-
ticated diagnose process by a visual grounding model will be a future research.
Visual Grounding of Radiology Reports for CT Images 9
Gallbladder
Abdomen
Pancreas
Stomach
Intestine
Medias-
Thyroid
Prostate
Adrenal
Bladder
Kidney
Spleen
Throat
Uterus
tinum
Breast
Ovary
Colon
Small
Aorta
Chest
Brain
Liver
Limb
Lung
Head
Neck
Mass 0.11 0.53 0.26 0.16 0.43 0.19 0.17 0.19 0.38 0.26 0.38 0.39 0.3 0.11 0.51 0.34 0.39 0.88 0.24 0.09
Soft Tissue Tumor 0.07 0.16 0.22 0.24 0.33 0.24 0.14 0.54 0.22 0.09
High Density Area 0.13 0.34 0.12 0.21 0.36 0.13 0.23 0.14 0.22 0.22 0.2
Low Density Area 0.44 0.32 0.22 0.26 0.36 0.35 0.34 0.38 0.07
Lymph Node 0.07 0.14 0.31 0.27 0.29 0.18 0.54 0.2
Wall Thickening 0.1 0.31 0.25 0.72 0.35 0.23 0.13 0.81 0.09
Fig. 4. Grounding performance for representative anomalies. The value in each cell is
the average dice score of the proposed method.
References
1. Bhalodia, R., Hatamizadeh, A., Tam, L., Xu, Z., Wang, X., Turkbey, E., Xu, D.:
Improving Pneumonia Localization via Cross-Attention on Medical Images and
Reports. In: Proceedings of Medical Image Computing and Computer Assisted
Intervention. pp. 571–581. Springer (2021)
2. Dall, T.: The Complexities of Physician Supply and Demand: Projections from
2016 to 2030. IHS Markit Limited (2018)
3. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez,
L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiol-
ogy examinations for distribution and retrieval. Journal of the American Medical
Informatics Association 23(2), 304–310 (2016)
4. Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: End-to-End Visual
Grounding with Transformers. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 1769–1779 (2021)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidi-
rectional Transformers for Language Understanding. In: Proceedings of NAACL-
HLT. pp. 4171–4186 (2019)
6. Ebrahimian, S., Kalra, M.K., Agarwal, S., Bizzo, B.C., Elkholy, M., Wald, C.,
Allen, B., Dreyer, K.J.: FDA-regulated AI algorithms: Trends, Strengths, and Gaps
of Validation Studies. Academic Radiology 29(4), 559–566 (2022)
7. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from Natural Language Expres-
sions. In: Proceedings of the European Conference on Computer Vision. pp. 108–
124. Springer (2016)
10 A. Ichinose et al.
8. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
Haghgoo, B., Ball, R., Shpanskaya, K., et al.: CheXpert: A Large Chest Radiograph
Dataset with Uncertainty Labels and Expert Comparison. In: Proceedings of the
AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)
9. Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P.,
Deng, C.y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-identified publicly available
database of chest radiographs with free-text reports. Scientific data 6(1), 317
(2019)
10. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-
Modulated Detection for End-to-End Multi-Modal Understanding. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 1780–1790
(2021)
11. Karpathy, A., Fei-Fei, L.: Deep Visual-Semantic Alignments for Generating Image
Descriptions. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 3128–3137 (2015)
12. Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep Fragment Embeddings for Bidirec-
tional Image Sentence Mapping. In: Proceedings of Advances in Neural Information
Processing System. pp. 1889–1897 (2014)
13. Keshwani, D., Kitamura, Y., Li, Y.: Computation of Total Kidney Volume from
CT Images in Autosomal Dominant Polycystic Kidney Disease Using Multi-task
3D Convolutional Neural Networks. In: Proceedings of Medical Image Computing
and Computer Assisted Intervention. pp. 380–388. Springer (2018)
14. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked Cross Attention for Image-
Text Matching. In: Proceedings of the European Conference on Computer Vision.
pp. 201–216 (2018)
15. Li, B., Weng, Y., Sun, B., Li, S.: Towards Visual-Prompt Temporal Answering
Grounding in Medical Instructional Video. arXiv preprint arXiv:2203.06667 (2022)
16. Li, Y., Wang, H., Luo, Y.: A Comparison of Pre-Trained Vision-and-Language
Models for Multimodal Representation Learning across Medical Images and Re-
ports. In: Proceedings of the IEEE international conference on bioinformatics and
biomedicine. pp. 1999–2004. IEEE (2020)
17. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: Pretraining Task-Agnostic Vi-
siolinguistic Representations for Vision-and-Language Tasks. Advances in neural
information processing systems 32, 13–23 (2019)
18. Masuzawa, N., Kitamura, Y., Nakamura, K., Iizuka, S., Simo-Serra, E.: Automatic
Segmentation, Localization, and Identification of Vertebrae in 3D CT Images Us-
ing Cascaded Convolutional Neural Networks. In: Proceedings of Medical Image
Computing and Computer Assisted Intervention. pp. 681–690. Springer (2020)
19. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully Convolutional Neural Net-
works for Volumetric Medical Image Segmentation. In: Proceedings of the interna-
tional conference on 3D vision. pp. 565–571. IEEE (2016)
20. Nakano, N., Tagawa, Y., Ozaki, R., Taniguchi, T., Ohkuma, T., Suzuki, Y., Kido,
S., Tomiyama, N.: Pre-training methods for creating a language model with em-
bedded knowledge of radiology reports. In: Proceedings of the annual meeting of
the Association for Natural Language Processing (2022)
21. Nishie, A., Kakihara, D., Nojo, T., Nakamura, K., Kuribayashi, S., Kadoya, M.,
Ohtomo, K., Sugimura, K., Honda, H.: Current radiologist workload and the short-
ages in japan: how many full-time radiologists are required? Japanese journal of
radiology 33, 266–272 (2015)
22. Rimmer, A.: Radiologist shortage leaves patient care at risk, warns royal college.
BMJ: British Medical Journal (Online) 359 (2017)
Visual Grounding of Radiology Reports for CT Images 11
23. Seibold, C., Reiß, S., Sarfraz, S., Fink, M.A., Mayer, V., Sellner, J., Kim,
M.S., Maier-Hein, K.H., Kleesiek, J., Stiefelhagen, R.: Detailed Annotations
of Chest X-Rays via CT Projection for Report Understanding. arXiv preprint
arXiv:2210.03416 (2022)
24. Tagawa, Y., Nakano, N., Ozaki, R., Taniguchi, T., Ohkuma, T., Suzuki, Y., Kido,
S., Tomiyama, N.: Performance improvement of named entity recognition on noisy
data using teacher-student training. In: Proceedings of the annual meeting of the
Association for Natural Language Processing (2022)
25. Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TieNet: Text-Image Embed-
ding Network for Common Thorax Disease Classification and Reporting in Chest
X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 9049–9058 (2018)
26. Yan, K., Wang, X., Lu, L., Summers, R.M.: DeepLesion: automated mining of
large-scale lesion annotations and universal lesion detection with deep learning.
Journal of medical imaging 5(3), 036501–036501 (2018)
27. Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A Fast and Accu-
rate One-Stage Approach to Visual Grounding. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 4683–4693 (2019)
28. You, D., Liu, F., Ge, S., Xie, X., Zhang, J., Wu, X.: AlignTransformer: Hierarchical
Alignment of Visual Regions and Disease Tags for Medical Report Generation. In:
Proceedings of Medical Image Computing and Computer Assisted Intervention.
pp. 72–82. Springer (2021)
12 A. Ichinose et al.
Anatomical
Prediction
equilibrium phases. Although enhancement
equilibrium phases. Although enhancement effect by contrast medium was noted on the
effect by contrast medium was noted on the previous MRI, no enhancement can be
previous MRI, no enhancement can be noted noted this time. The S6 nodule noted on the
this time. The S6 nodule noted on the previous MRI cannot be identified.
No portal vein tumor emboli.
previous MRI cannot be identified.
No portal vein tumor emboli.
Spleen
Spleen is enlarged.
Spleen is enlarged.
Gallbladder: n.p. …
* Inputs of our system is Japanese
STRUCTURING RESULTS
A good accumulation of lipiodol is seen in
Liver S1. A 6mm large nodule is seen in S8 Anatomical …
Organ Lesion
with slight hypo-absorption in portal and Segment
Relationship
Recognition
Estimation
effect by contrast medium was noted on the LIVER S1 lipiodol
Phrase
Fig. A2. Anomaly localization architecture: The Conv block denotes three sets of
Conv[3 × 3 × 3] → BatchRenorm → ReLU operations.
CT image Report
Value
Optimizer Adam
Initial Learning Rate 1e−5
Learning Rate Schedule Linearly increased to 1e−4 within the first 5,000
steps, and then multiplied by 0.1 every 30,000
steps
Batch Size 10
Normalization Method Batch Renormalization
Data Augmentation for Image Random Crop, Random Rotation, Random Scal-
ing, Sharpness change, Smoothing, and Gaussian
noise addition
Data Augmentation for Text Random deletion, Random insertion, and Ran-
dom crop
Machine Learning Library Tensorflow 2.3
GPU NVIDIA Tesla V100 × 2