0% found this document useful (0 votes)
22 views

MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

MedCoDi-M: A Multi-Prompt Foundation Model for

Multimodal Medical Data Generation


Daniele Molinoa , Francesco Di Feolab , Eliodoro Faiellad,e , Deborah Fazzinic ,
Domiziana Santuccid , Linlin Shenf , Valerio Guarrasia,1 , Paolo Sodaa,b,1,∗
a
Research Unit of Computer Systems and Bioinformatics, Department of Engineering,
arXiv:2501.04614v1 [cs.AI] 8 Jan 2025

Università Campus Bio-Medico di Roma, Roma, Europe


b
Department of Diagnostics and Intervention, Radiation Physics, Biomedical
Engineering, Umeå University, Umeå, Sweden
c
Department of Diagnostic Imaging and Stereotactic Radiosurgey, Centro Diagnostico
Italiano S.p.A., Milano, Italy
d
Department of Radiology and Interventional Radiology, Fondazione Policlinico
Universitario Campus Bio-Medico, Rome, Italy
e
Research Unit of Radiology and Interventional Radiology, Department of Medicine and
Surgery, Università Campus Bio-Medico di Roma, Rome, Italy
f
College of Computer Science and Software Engineering, Shenzhen
University, Shenzhen, China

Abstract
Artificial Intelligence is revolutionizing medical practice, enhancing diagnos-
tic accuracy and healthcare delivery. However, its adaptation in medical
settings still faces significant challenges, related to data availability and pri-
vacy constraints. Synthetic data has emerged as a promising solution to
mitigate these issues, addressing data scarcity while preserving privacy. Re-
cently, Latent Diffusion Models have emerged as a powerful tool for gen-
erating high-quality synthetic data. Meanwhile, the integration of different
modalities has gained interest, emphasizing the need of models capable of
handle multimodal medical data. Existing approaches struggle to integrate


Corresponding author: [email protected], [email protected]
Email addresses: [email protected] (Daniele Molino),
[email protected] (Francesco Di Feola), [email protected]
(Eliodoro Faiella), [email protected] (Deborah Fazzini),
[email protected] (Domiziana Santucci), [email protected] (Linlin
Shen), [email protected] (Valerio Guarrasi), [email protected],
[email protected] (Paolo Soda)
1
These authors equally contributed to the work and share senior authorship.

Preprint submitted to Information Fusion January 9, 2025


complementary information and lack the ability to generate modalities si-
multaneously. To address this challenge, we present MedCoDi-M, a 6.77-
billion-parameter model, designed for multimodal medical data generation,
that, following Foundation Model paradigm, exploits contrastive learning
and large quantity of data to build a shared latent space which capture the
relationships between different data modalities. Further, we introduce the
Multi-Prompt training technique, which significantly boosts MedCoDi-M’s
generation under different settings. We extensively validate MedCoDi-M:
first we benchmark it against five competitors on the MIMIC-CXR dataset,
a state-of-the-art dataset for Chest X-ray and radiological report generation.
Secondly, we perform a Visual Turing Test with expert radiologists to assess
the realism and clinical relevance of the generated data, ensuring alignment
with real-world scenarios. Finally, we assess the utility of MedCoDi-M in
addressing key challenges in the medical field, such as anonymization, data
scarcity and imbalance learning. The results are promising, demonstrating
the applicability of MedCoDi-M in medical contexts. Project page is at
https://cosbidev.github.io/MedCoDi-M/.
Keywords: Diffusion Models, Contrastive Learning, Self-Supervised
Learning, Generative AI, Chest X-rays, Radiological Report

1. Introduction
Artificial Intelligence (AI) is increasingly revolutionizing several fields, in-
cluding healthcare. Today, AI systems are capable of processing vast amounts
of medical data, revealing patterns often undetectable to the human eye and
enabling more accurate diagnostics, personalized treatments, and efficient
healthcare delivery [1]. Moreover, the capability to leverage multimodal data
represents a disruptive advancement in the medical field, enabling compre-
hensive diagnostic insights by integrating different data sources. However,
despite these advancements, the implementation of AI in healthcare faces sev-
eral challenges, primarly caused by data scarcity and privacy concerns [2].
The available datasets for training AI models are often limited in size, di-
versity, and scope, making training deep learning (DL) models a significant
challenge, as these models typically require extensive, high-quality data to
achieve great performances. Without enough diverse data, models may be-
come biased, prone to overfitting, or unable to generalize well to new, unseen
cases, creating a bottleneck in the deployment of AI solutions in real-world

2
healthcare scenarios. Privacy regulations, such as the General Data Protec-
tion Regulation (GDPR) [3] in Europe and the Health Insurance Portability
and Accountability Act (HIPAA) [4] in the United States, although crucial
for protecting patient privacy, can hinder the collaborative efforts required
to gather large-scale datasets. To address these limitations, a novel stream
of research is focusing on multimodal synthetic data generation techniques.
This emerging approach involves creating artificial data that replicate the
complexity and diversity of real medical data, thus providing a solution to
bypass the constraints of real-world data scarcity and privacy concerns.

1.1. Generative AI
Generative AI has seen remarkable growth since 2014, when the introduc-
tion of Generative Adversarial Networks (GANs) [5] had a groundbreaking
influence on the research field, enabling the creation of realistic synthetic
data through adversarial training. Despite their early success, GANs face
inherent challenges, such as training instability, mode collapse, and diffi-
culty in generating fine-grained details, issues that limit their effectiveness
in the medical domain. Building on the foundation set by GANs, Diffu-
sion Models (DM) [6] have recently emerged as a more robust approach for
data generation. Through a multi-step denoising process, DMs demonstrate
an improved capacity for generating diverse, high-fidelity data, capturing
subtle variations and intricate details that are essential in medical imaging
application, where both fidelity and diversity of synthetic data are crucial.
Latent diffusion models (LDMs) [7] have gained significant attention: as the
denoising process operates within a low-dimensional latent space [7], LDMs
require reduced computational resources, making them more practical and
accessible for deployment to a wider range of users and systems. Moreover,
due to their advanced conditioning mechanism, LDM allow for fine-grained
control over the generation process [8, 9]. The conditioning mechanism lever-
ages encoders that extract meaningful representations from the input source,
enabling the synthesis of targeted features, such as specific anatomical struc-
tures or disease characteristics. This controlled generation process not only
enhances the model’s flexibility but also its relevance in medical setting, as it
allows practitioners to generate synthetic data tailored to unique diagnostic
requirements or research needs. LDMs can be effectively adapted to a wide
range of downstream tasks and, with the appropriate pre-training, have the
potential to serve as robust foundation models [10]. However, most of these
models can only generate one modality from another, which can be a signif-

3
icant limitation in the healthcare setting, where multiple modalities coexist
and interact. Outside the medical domain, significant advancements have
been made in multimodal data generation. Among these studies, CoDi [11]
stands as pivotal work. By enabling the simultaneous generation of multi-
ple modalities from a shared latent space, CoDi significantly improves the
consistency and coherence of the generated outputs, allowing for any-to-any
generation, avoiding the pitfalls of a multi-step approach. The adaptation of
a similar approach for medical data generation could prove highly beneficial,
filling a critical gap in the availability of diverse and high-quality datasets
for research and diagnostic purposes. However, CoDi presents some limita-
tions when applied to such a setting: while it demonstrates the feasibility
of any-to-any generation in non-medical enviroment, its performance tends
to degrade when provided with multiple input modalities or their combina-
tions, a limitation that cannot be overlooked in the medical domain, where
reliability and consistency across modalities are critical.

1.2. Related Works


In recent years, there has been a growing interest in developing generative
models for X-rays generation, as they can give insight about a wide range
of medical conditions. Numerous studies have assessed the task of synthetic
Chest X-ray (CXR) generation through GANs, where most focused on a
specific pathology, like tubercolosis [12], pneumonia [13] or Covid-19 [14, 15].
Recently, LDMs have emerged as a promising approach for CXR generation:
RoentGen [16] was the first work to explore the adaptation of a pre-trained
LDM, named Stable Diffusion [7], for text-conditioned generation of CXRs
beyond few- or zero-shot setting [17, 18]. In such a work, they showed that
fine-tuning the UNet component is necessary in order to effectively adapt the
model to the medical domain, as it allows to capture the unique features and
nuances of medical images, thereby improving the quality and realism of the
generated CXRs.
In parallel with the rise of LDMs, there has been a growing interest in devel-
oping techniques to fuse textual and visual data, driving the development of
the first vision-language models, capable of processing and combining both
modalities [19, 20, 21, 22]. In the field of CXR generation, UniXGen [23]
leverages the transformer [24, 25] architecture for both X-ray and report gen-
eration. They adopted a vector quantization technique, named VQ-GAN [26],
to convert an X-ray in a set of discrete tokens, addressing both tasks as a
sequence generation problem. Additionally, their work emphasizes the gen-

4
eration of different X-ray views, as each view contains distinct informative
content, enhancing the utility of generated data. However, their approach is
limited by its inability to generate multiple outputs simultaneously, requir-
ing separate processing for each modality, without an explicit mechanism to
guarantee coherence between the generated data. Building on this idea, Lee
et Al. [27] proposed a similar approach for bidirectional X-ray and report
generation via a fine-tuned large language model, named LLM-CXR. Unlike
UniXGen, they only leveraged frontal chest X-rays, focusing on a single view
for their generation tasks, potentially limiting its applicability in more com-
prehensive clinical scenarios.
The main limitation of these works is that they overlook the complementary
nature of different medical data modalities and lack the ability to gener-
ate multimodal outputs simultaneously. This independent processing often
results in inconsistencies when modalities are synthesized separately, poten-
tially leading to outputs that lack clinical coherence. Such limitations hinder
their applicability in real-world healthcare settings, where seamless integra-
tion of multimodal data is essential to replicate the complexity of patient-
specific information accurately.

1.3. Contribution
Building on the success of CoDi, this work proposes MedCoDi-M, a novel
multi-prompt foundation model for multimodal medical data generation. By
taking advantage of contrastive learning techniques, used to build Foundation
Models [28], MedCoDi-M enable flexible, any-to-any generation across differ-
ent medical data modalities. Specifically, our generative process ensures that
MedCoDi-M can capture the complex interactions between different medi-
cal modalities. To this end, we propose a novel training approach, named
Multi-Prompt Training, to improve the model’s ability to fuse information
from multiple modalities, enhancing its capability to generate coherent and
accurate medical data.
The main contribution can be summarized as:

• We propose MedCoDi-M, a novel generative model that synthesizes


multiple data modalities from a shared multimodal latent space.

• We introduce a Multi-Prompt training approach that boosts MedCoDi-


M’s generation capabilities when prompted by multiple data modalities.

5
• We thoroughly evaluate MedCoDi-M against existing state-of-the-art
models, showing its superior capabilities in terms of quality, realism
and clinical accuracy.

• We perform a Visual Turing Test, consisting of five evaluation tasks


administrated to three expert radiologists, to assess the clinical realism
and diagnostic consistency of the generated data modalities.

• We evaluate the utility of synthetic data, by showing its effectiveness in


tackling three key challenges in the medical domain, i.e., Anonymiza-
tion, Imbalance Learning and Data Scarcity.

This paper is organized as follows: Section 2 presents the methods employed


in this work, detailing the architecture of MedCoDi-M, its training procedure,
and the innovative Multi-Prompt Training strategy. Section 3 describes the
dataset and preprocessing steps, emphasizing the characteristics of the medi-
cal data used. It then outlines the experimental setup, including the competi-
tors, evaluation metrics, and configurations adopted to assess MedCoDi-M.
Section 5 discusses the results, providing both quantitative and qualitative
analyses, also introducing the findings from the Visual Turing Test. Finally,
Section 6 summarizes the main contributions and discusses potential direc-
tions for future research.

2. Methods
Assuming M is the set of our modalities, let I = {I1 , I2 , ..., In } be any subset
of modalities used to prompt the generation and let O = {O1 , O2 , ..., Om }
be any subset of modalities to be generated, such that O ∩ I = ∅, with
I, O ⊆ M. It is important to note that this distinction is made solely for
expositional clarity; in practice, any modality from the set M can be used
both as an input or as an output, and the model is not restricted to specific
modality pairings.
The overall architecture of MedCoDi-M is depicted in Fig.1, which consists
of three blocks, each corresponding to a distinct training phase. In panel
(a), we align the feature representations extracted from the input modali-
ties by modality-specific prompt encoders into a shared latent space using
contrastive learning. In panel (b), we independently train an LDM for each
output modality, using the multi-prompt training approach for condition-
ing. Finally, in panel (c), we perform cross-modal alignment, enabling the

6
(a) (b) (c)

...
...
...

...

...

...

Figure 1: Framework of MedCoDi-M - a) Shared Latent Space construction: Input


modalities are processed by modality-specific prompt encoders to extract feature
representations, which are aligned using contrastive learning. - b) Single modality
generation training: Individual LDMs are trained for each output modality using the
proposed Multi-prompt training approach. This technique dynamically combines subsets
of input modalities to form a conditioning vector, allowing the model to learn from
various input configurations. - c) Latent Cross-Modal Alignment: This phase enable
simultaneous multimodal generation, enabling mutual conditioning between LDMs.

model to simultaneously generate any combination of output modalities. In


the following, we provide a rigorous description of each training step in sec-
tions 2.1, 2.2 and 2.3.

2.1. Building a Shared Latent Space


We propose to align any input modalities within a shared latent space by
leveraging contrastive learning. This approach allows the model to be freely
conditioned on any input combination, even those absent in the training data.
Inspired by [11], we take advantage of an efficient technique called Bridg-
ing Alignment to align the representations extracted by modality-specific
prompt encoders. Following Fig 1.a, we first extract a feature representa-
tion hIj = PIj (Ij ) for every input modality Ij ∈ I, where PIj is the prompt
encoder for modality Ij . The latent space is constructed through a series of
pairwise training rounds, ensuring coherent alignment across all modalities
while reducing the computational complexity. Once the encoders are trained,

7
multimodal conditioning can be achieved by interpolating the representations
of each modality hI1 , hI2 , ..., hIn .

2.2. A Multi-prompt approach for single-modality generation


Training a multi-input, multi-output generative model requires extensive
training across diverse data sources, while mantaining high generation qual-
ity across all synthesis flows. To address these challenges, MedCoDi-M is de-
signed to be both composable and integrative, as it enables the independent
development of modality-specific LDMs, which can then be seamlessly inte-
grated into a unified framework. In the healthcare domain, information often
flows concurrently across multiple modalities: to emulate this phenomenon,
we developed a novel training approach, named Multi-Prompt Training. This
technique enhances MedCoDi-M’s conditioning capabilities, enabling it to be
effectively conditioned on multiple data modalities simultaneously. Let us re-
member that Oi is an output modality we aim to generate and I is the set
of input modalities used to prompt the LDM. Following Figure 1.b, we first
extract the latent representation hIj = P (Ij ) for all the Ij in I using the
prompt encoders now frozen and previously trained as described in Section
2.1. Then, at each training iteration, the prompt sampling strategy Ω dy-
namically select and combine a random subset of input modalities Ip from
I into a conditioning vector ω. Given a total number of n modalities, there
exists 2n−1 − 1 possible combinations, making the probability of drawing any
1
possible subset equal to p = 2(n−1) −1
. Once a combination is selected, their
latent representations are linearly combined to form a conditioning vector
defined as:
n
X n
X
ω = Ω(hI1 ...hIn ) = αj hIj with αj = 1 and j ∈ {Ip }. (1)
j=1 j=1

The resulting vector ω, is then used as the conditioning for training the model
GOi . Following the reparametrization method proposed in [6], the training
objective can be expressed as [7]:

LD = Ez,ϵ,t ∥ϵ − ϵθ (zt , t, ω)∥22


 
(2)

Where zt is the latent variable of diffusion process, progressively diffused


across time step t ∼ [1, T ], sampled from a uniform distribution, and ϵθ is a
denoising model with a UNet architecture parameterized by θ.

8
2.3. Multi-output generation via Cross-modal Latent Alignement
The third training stage enables the simultaneous generation of any combi-
nation of output modalities, ensuring that each generative flow is aware of
the others. To this end, we incorporate two trainable components into each
LDM GOi : the first is an encoder VOi , that projects the latent variable of
the diffusion process zOi into a shared latent space; the second is a cross-
attention layer, that allows each LDM to attend to the generative process of
another model. Formally, let us consider two modalities, Oi and Oi+1 , being
jointly synthesized by GOi and GOi+1 and let zOi and zOi+1 denote their la-
tent variables at a generic diffusion step, respectively. Following Fig 1.c, the
encoder VOi+1 first projects zOi+1 into a shared latent space. Then, in each
layer of GOi , the cross-attention layer attends to VOi+1 (zOi+1 ).
For the diffusion model of modality Oi , the training objective in Eq.2 become:

LO 2
D = Ez,ϵ,t ∥ϵ − ϵθc (zOi , VOi+1 (zOi+1 ), t, ω)∥2 ,
i
(3)

where θc represents the parameters of the cross-attention layer in the UNet.


The training objective for the joint generation of Oi and Oi+1 becomes
Oi+1
LCross = LOD + LD
i
.
By training only these two additional components while keeping the rest
of the model frozen, MedCoDi-M effectively learns to generate modalities
simultaneously while mantaining high-quality outputs.

3. Experimental Configuration
This section provides a comprehensive overview of the experimental setup
adopted to evaluate the performance of MedCoDi-M. It begins by describing
the dataset used, including its key characteristics and preprocessing steps,
which ensure that the data is prepared appropriately for training and evalu-
ation. Next, the section details the implementation specifics of the proposed
framework, such as architectural choices, training configurations, and opti-
mization strategies. Additionally, it presents the state-of-the-art competitors
used for comparative analysis, highlighting their relevance and limitations in
the context of multimodal medical data generation. Finally, the evaluation
metrics are introduced, encompassing both quantitative and qualitative mea-
sures to thoroughly assess the realism, coherence, and clinical utility of the
generated outputs.

9
3.1. Materials
To achieve our purpose, it is crucial to leverage a multimodal medical dataset
that captures the complementary nature of different modalities, such as imag-
ing and textual data. Such datasets are essential for training models capable
of synthesizing clinically accurate and coherent outputs across diverse medi-
cal data types. We used the MIMIC-CXR [29] dataset that contains 377.110
CXR images along with their corresponding radiology reports, for a total of
227.827 studies conducted at the Beth Israel Deaconess Medical Center in
Boston, MA, USA. In the dataset, images are acquired in frontal and lat-
eral projection. Due to significant anatomical differences, the two views offer
distinct yet complementary diagnostic information [30]. For example, car-
diovascular structures and the diaphragm can obscure up to 15% of the lung,
making certain pathologies undetectable in the frontal view alone [31]. The
lateral view, by providing a different perspective, enables the visualization
of lesions or abnormalities hidden behind these anatomical structures, thus
ensuring more accurate diagnosis [32]. For those reasons, we treated frontal
and lateral CXRs as distinct modalities. Each radiology report in the dataset
is divided in two sections: a finding section that provides a detailed descrip-
tion of both normal and abnormal features observed in the corresponding
CXR, and an impression section, which provides a concise summary of the
findings intended to support medical decision-making. In this work, we fo-
cused exclusively on the latter, as it offers a concise yet powerful summary
of the patient’s condition and it also complies with our text encoder, which,
following [11] implementation, poses a limitation on the length of the report
to 77 tokens.
From the repository, we extracted a total of 154.721 X-rays from 78.584 stud-
ies, including all the patients for which the radiology report and both frontal
and lateral view were present. An explicative example of a triplet is depicted
in Fig. 2.
Furthermore, we used the original uncompressed X-rays stored in DICOM
format [33] as in medical imaging, subtle details are critical for accurate di-
agnosis, and compression can lead to unintended loss of information.
The X-ray’s preprocessing involved several steps to standardize and prepare
the data for model training. First, we examinated the pixel spacing of each
image and resampled those with non-standard spacing to [0.139, 0.139], i.e.,
the value observed in 95.76% of the dataset. For the images with a Pho-
tometric Intepretation of Monochrome-1, the pixel values were inverted to
ensure proper representation. Subsequently, we normalized the images by

10
Figure 2: A sample of our dataset, composed of a Frontal X-ray, a Lateral X-ray and the
corresponding radiology report.

dividing every pixel by the maximum pixel value possible given by their bit
depth, bringing the range to [0, 1]. Since the original scans are not square, we
chose not to modify their proportions through a direct resizing, as this could
distort important anatomical features. Additionally, extracting a square crop
was not a viable option, due to the impossibility to select a Region of Interest
(ROI) that would be universally applicable. Instead, we added zero-padding
around images and then resize them to 256 × 256, to standardize the input
size while preserving the integrity of the visual content.
To ensure an unbiased evaluation, we extracted an holdout test set before any
training procedure. This set consists of 33.588 samples, that were carefully
selected to guarantee no patient’s overlapping between training and test set.

3.2. Model Architecture and Training


We delve here into the architectural choices made for each component of our
framework, with a specific focus on how every part of the model is trained.

3.2.1. Prompt Encoders


Given that our three modalities consist of texts and images, we adopt the
Contrastive Language-Image Pretraining (CLIP) [34] approach to leverage
a pretrained text-image paired encoder. This approach is composed of an
image and a text encoder, denoted as PX and PR , jointly trained on large-
scale datasets of text-image pairs. By doing so, the encoders learn a shared
representation space that effectively captures the semantics of both modal-
ities. In order to reduce the computational overload, we decided to let a
single image encoder, i.e. a ViT Transformer, be responsible for both frontal

11
and lateral X-rays feature representations, while we leverage a masked self-
attention Transformer for the text encoding.
Given a batch of X-ray images X and their corresponding reports R, we ob-
tain the embeddings hX = PX (X), hR = PR (R) for both modalities. These
representations are then aligned through contrastive learning, using the In-
foNCE contrastive loss [35]:

exp(hiX hiR /τ )
LX,R = − log ⊤ ⊤
(4)
exp(hiX hiR /τ ) + j̸=i exp(hiX hiR /τ )
P

where τ is the scalar temperature regulating the softness of the softmax


distribution, and i,j refers, respectively, to positive and negative couples.
We adopt the symmetric loss LX ,R + LR,X to make the embeddings closer
together. An example of the generic training procedure for CLIP is illustrated
in Fig. 3.

Figure 3: CLIP training procedure.

3.2.2. Latent Diffusion Model


X-ray Diffusion Model: The LDM for image generation adopts the same
architecture of Stable Diffusion 1.5 [7], where AutoKL [26] is used as the

12
variational autoencoder (VAE) to map the input modalities in a lower di-
mensional space. As stated in [17], the most effective approach for the adap-
tation of an LDM to the medical imaging domain is to fine-tune the UNet
component. Therefore, we kept the VAE frozen and only trained the UNet.
In total we trained two LDMs, one for the frontal X-rays and one for the
lateral X-rays, using a batch size of 512, a learning rate of 5 × 10−5 , and a
weight decay of 1 × 10−4 . Both models were trained for 100 epochs using the
AdamW optimizer.

Report Diffusion Model: For the text generation, the UNet architec-
ture is based on [36], which introduced the fully-connected residual blocks
(FCResBlock). These expand the 768-dimensional text latent vectors into a
320-by-4 hidden feature and follow the residual block paradigm with Group-
Norms [37], SiLU [38], and skip connections. We adopt Optimus [39] as the
text VAE, which consist of a BERT [40] text encoder and a GPT-2 [41] text
decoder. Unlike the LDMs used for X-ray images, we decided to fine-tune
both the VAE and the UNet in two separate training rounds. This approach
is necessary as the model has to effectively adapt to a completely different
vocabulary. Following [39], the training process begins with a reconstruction
task, where the VAE is tasked with accurately reconstructing input radiol-
ogy reports from their latent representations. Once the first step is fulfilled,
the UNet is trained for report generation using a batch size of 1024 was em-
ployed, while the learning rate was set to 1 × 10−5 . The weight decay, the
optimizer configuration and the number of epochs remained consistent with
the X-ray LDMs.

3.3. Computational Analysis


To quantify the computational cost of our framework, we provide a detailed
breakdown of the number of parameters for each model component. Specif-
ically, CLIP model contains 737 million parameters, while AutoKL has 101
million parameters, with two instances used in our framework. Optimus
model consists of 241 million parameters, and the X-ray UNet model has
1.77 billion parameters, with two instances used. Finally, the Report UNet
model has 2.04 billion parameters. In total, the number of parameters for
all components combined amounts to 6.77 billion. All experiments were con-
ducted on a high-performance computing cluster equipped with four NVIDIA
A100 GPUs. The total computational time required across all experiments
was approximately 38.354 hours.

13
3.4. Competitors
To rigorously compare MedCoDi-M against established approaches for the
generation of X-rays and radiology report, we include a total of five open
source competitors which, to the best of our knowledge, are the only works
with accessible and reproducible code as well as model weights. It is worth
noting that we selected UniXGen and LLM-CXR [23, 27] because they ad-
dress the same problem, i.e., bidirectional generation of CXRs and reports,
with different architectures from us, namely Transformer and LLM. Mean-
while, despite it lacks of bidirectional capabilities, we selected RoentGen [16]
as it was the first work to leverage a fine-tuned LDM for X-ray generation.
All competitors were trained on the MIMIC-CXR dataset, allowing us to use
them directly without requiring additional training.
• The original CoDi [11] model, without any adaptation to the medical
domain.
• MedCoDi, our same implementation that omits the Multi-Prompt
training strategy, serving as an ablation of MedCoDi-M. This variant
highlights the importance of the Multi-Prompt training approach in
enhancing the model’s conditioning capabilities and generating outputs
that integrate information from multiple input modalities effectively.
• UniXGen [23], a transformer-based architecture for bidirectional CXR
and report generation. UniXGen employs a vector quantization tech-
nique to convert X-rays into discrete visual tokens, enabling the model
to treat both tasks as a sequence generation. UniXGen incorporates
tokens to generate view-specific X-rays. Additionally, multi-view CXRs
can be used to condition the report generation.
• LLM-CXR [27], a pretrained LLM fine-tuned for CXR understanding
and generation. It can perform both CXR-to-report and report-to-
CXR generation tasks. It is restricted to frontal chest X-rays and does
not consider multi-view or multimodal relationships, limiting its appli-
cability in comprehensive diagnostic workflows.
• RoentGen [16], a text-conditioned LDM fine-tuned for generating syn-
thetic frontal CXRs based on textual prompts. RoentGen adapts the
Stable Diffusion architecture to the medical domain by fine-tuning the
UNet component, enabling it to capture the unique characteristics of
chest X-rays.

14
3.5. Evaluation Metrics
We conducted both quantitative and qualitative assessments to evaluate the
performance of our approach. The first focuses on the statistical properties
of the generated data, while the second ensures that the outputs accurately
align with the expected clinical informations.

3.5.1. Quantitative Metrics


To objectively evaluate the quality of the generated outputs, we employed two
well-established quantitative metrics, the FID Score and the BLEU Score,
that measure the statistical similarity between synthetic and real data, as
well as the linguistic coherence of generated clinical reports.

The Fréchet Inception Distance (FID) [42] measures the dissimilarity be-
tween real and synthetic samples in the feature space of an Inception-V3
model pre-trained on ImageNet [43], ranging in the interval [0, +∞) with
lower values indicating greater similarity. However, because the Inception-
Net is not trained on a medical dataset, it may lead to misleading results [44].
To address this limitation, we computed the FID also with another two back-
bones, i.e., XRV-DenseNet [45], an in-domain classification model trained to
detect pathologies in CXR scans, and XRV-Mimic densenet [45], specifically
trained for MIMIC-CXR scans classification. However, to remain coherent
with other works, We decided to report the result obtained using the latter
backbone in Table A1.

The Bilingual Evaluation Understudy (BLEU) compares machine-generated


text to a set of references by calculating the n-gram overlap between the
two [46], ranging in the interval [0, +∞). Following the literature, here we
computed the BLEU score for a number of n-grams equal to 1,2,3,4. BLEU-
1 and BLEU-2 place greater emphasis on the consistency of the vocabulary
used, focusing on single words or word pairs, while BLEU-3 and BLEU-4
provide information about the semantic structure of the reports.

3.5.2. Factual Correctness


Ensuring the factual correctness of the generated data is a crucial aspect of
evaluating the performance of MedCoDi-M. By using well-established classi-
fication models and rule-based tools, we assess how well the synthetic outputs
align with real-world diagnostic information.

15
X-rays Classification: To evaluate whether the models are capable of gen-
erating images that accurately reflect the information of the corresponding
clinical reports, we classified the generated samples using XRV-DenseNet [45],
a well-established classifier in the literature for CXR classification. Since
such a classifier is trained only on a subset of pathologies, we computed
the AUC and F1 scores for the following diseases: Atelectasis (Atl.), Car-
diomegaly (Cmgl.), Consolidation (Cnsl.), Edema (Edm.), Enlarged Car-
diomediastinum (Enl.), Lung Lesion (Les.), Lung Opacity (Opc.), Pleural
Effusion (Eff.), Pneumonia (Pnm.) and Pneumothorax (Ptx.), along with
micro, macro, and weighted averages. The micro average aggregates con-
tributions from all classes to provide an overall measure, the macro average
computes the metric independently for each class and averages them, and the
weighted average adjusts the macro average by accounting for the number of
samples per class. However, because not all scans have a defined label for
every pathology, we computed the performance for each class only when a
ground truth was available, Table 1 report the number of samples for every
pathology in our test set.

Condition Samples Positives


Atl. 10561 1438
Cgml. 10305 1146
Cnsl. 9911 254
Edm. 10230 774
Enl. 9215 83
Les. 9476 342
Opc. 11136 1980
Eff. 10558 1258
Pnm. 10454 853
Ptx. 9337 84

Table 1: Number of samples of every pathology for the X-rays classification task.

Report Classification: For report classification, we leveraged CheXpert-


Labeler [47], a rule-based natural language processing tool that reads a text
report and extracts whether it mentions the presence or absence of significant
radiologic findings. Since a rule-based classifier is used, it is not possible to
compute the AUC; instead, we reported the F1 score for the same subset of
disease previously introduced along with No Finding (No F.) class. To re-
main consistent with the previous setup, we also reported the micro, macro,

16
and weighted averages for the F1 score. This task quantifies the ability of
the model to generate reports that align with the medical conditions seen in
the X-ray images, ensuring that the synthetic reports accurately reflect the
diagnostic information provided by the images.

3.5.3. Visual Turing Test


We performed a qualitative assessment of the data generated by MedCoDi-M,
through a Visual Turing Test performed by three expert radiologist. This
evaluation consisted of five independent tasks aimed at comparing synthetic
and real medical data, with both X-rays and clinical report being assessed.
Each task was performed through a web-based platform, where experts eval-
uated the data using a 1-to-5 numeric scale. A score of 1 indicated the
poorest quality, while a score of 5 represented the highest level of quality
and coherence. The tasks are:

• General X-ray Realism: Experts rated the overall realism of a series


of 20 images, which included a mix of real and synthetic X-rays. A high
score indicates that the synthetic X-rays appear indistinguishable from
real clinical X-rays, with accurate anatomical structures and no visible
artifacts that could mislead a clinician in a diagnostic setting.

• General Report Realism: Experts reviewed 20 clinical reports, both


real and synthetic, rating their plausibility and realism. A high score
in this task implies that the synthetic reports accurately reflect the
clinical context, making use of appropriate medical terminology and
providing clinically relevant findings that would be consistent with a
real report.

• Report Coherence with Pair of X-rays: Experts evaluated the


consistency between 20 pairs X-ray images and their associated reports.
The images were guaranteed to be real, while the reports were either
real or synthetic, presented in random order. A high score indicates
that the synthetic reports align accurately with the X-ray images, with
no contradictions or inconsistencies between the reported findings and
the visual evidence.

• Coherence Between Report and X-ray: Experts compared 20


pairs of X-ray images, both real and synthetic, in random order, with
the real clinical report to assess the plausibility of the X-ray given

17
Model T→F L→F L+T→F T→L F→L F+T→L
v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓
RoentGen 102.77 5.60 - - - - - - - - - -
UniXGen 81.75 7.28 - - 86.21 7.63 128.96 9.76 - - 133.38 10.20
LLM-CXR 71.91 6.83 - - - - - - - - - -
CoDi 541.44 107.23 520.02 83.12 539.35 107.17 522.00 83.01 540.66 105.63 525.80 82.52
MedCoDi 10.56 0.86 34.89 3.31 22.63 1.90 13.90 0.84 43.24 4.95 23.12 1.99
MedCoDi-M 10.67 0.93 12.04 0.48 11.51 0.43 14.00 0.96 13.75 0.48 11.97 0.34

Table 2: FID score for X-ray generation, with lower values indicating greater similarity.
XRV and v3 refers to the two backbones used to compute the score, respectively
XRV-Densenet and Inception-v3. The “-” symbol indicates that the respective models
are not capable of performing the specified generation task

the report. A high score reflects that the synthetic X-rays match the
findings described in the report, indicating that the generative model
correctly understood and translated the clinical context from the report
into the visual representation.

• Coherence Between X-ray Pairs: Experts assessed the consistency


between 20 pairs of frontal and lateral X-rays. One of the view was
guaranteed to be real, while the other were either real or synthetic,
presented in random order. A high score here indicates that the syn-
thetic view accurately represents the same clinical findings as the real
view, demonstrating proper anatomical alignment and no conflicting
features across different perspectives.

4. Results and Discussion


This section presents an in-depth analysis to assess MedCoDi-M’s perfor-
mance, providing both quantitative and qualitative evaluations and visual
examples.

4.1. X-ray Generation


Table 2 presents the FID scores on the test set for generating frontal (F) and
lateral (L) X-rays. The first column lists the models used for generation,
while the remaining columns show the performance achieved for different
generation settings using different prompts combination: from clinical report
(T) to frontal or lateral CXR (T→F, T→L), from lateral or frontal CXR to
the other view (L→F, F→L) and a combination of a clinical report and a
CXR image to the other view (L+T→F, F+T→L). The results in Table 2

18
show that MedCoDi-M consistently outperforms all the competitors for X-
ray generation. It is worth noting that the excessively high FID values for
CoDi highlight how crucial is the finetuning step of a LDM. This observation
is further supported by Figure 4, which show one example of generation by
CoDi and MedCoDi-M with the same textual prompt, where the former fails
to generate any resemblance of an X-ray.

Figure 4: Generation comparison between CoDi and MedCoDi-M using the same textual
prompt, i.e., “No acute cardiopulmonary process”.

On the other side, the Multi-Prompt training technique shows its effective-
ness, as MedCoDi-M consistently outperform MedCoDi in four configurations
(L→F, L+T→F, F→L, F+T→F), showing that the possibility to merge in-
formation in the shared latent space improves the generation results. How-
ever, since this approach leverages two image modalities alongside a single
textual modality, it tends to prioritize learning from visual prompts over
textual ones. As a result, we observe a slight drop in performance when
generation is based solely on clinical reports (T→F, T→L), although this
degradation is minimal compared to the significant boost the model gains in
other generation settings. In Fig. 5 and Fig. 6 several examples of synthetic
images from the different models are displayed. While at first glance the
synthetic X-rays may appear visually similar, the FID values reveal a clearer
resemblance of MedCoDi-M’s outputs to real samples, highlighting the effec-
tiveness of our approach.
To further assess the generation capabilities of MedCoDi-M we analyzed
the factual correctness by performing a classification task on the patient’s
pathology (Section 3.5.2). Specifically, for each sample in the test set, we

19
Figure 5: Frontal Synthetic Samples generated by different baselines with the same input
prompt, i.e., “No acute cardiopulmonary process”.

generated a corresponding synthetic X-ray and evaluated it with the XRV


classifier [45]. The results are presented in Table 3. Again, MedCoDi-M
achieves the highest performance in both AUC and F1 scores, demonstrating
its ability to effectively capture and represent the relevant clinical features
specified in the inputs. It is important to highlight that F1 scores for some
classes are notably low, likely due to the scarcity of positive samples for these
specific pathologies in the dataset, as showed in Table 1, however, MedCoDi-
M outperforms its competitors also for less represented classes. Moreover, it
is worth to notice that our synthetic data achieves better results than real
data. We attribute this result to the model’s strong understanding of the spe-
cific characteristics of the diseases, which makes the generated samples more
easily classifiable by the pre-trained densenet. This hypothesis is further
supported by the results of the Visual Turing Test, which will be presented
in Section 4.4, where the synthetic data demonstrated a high level of clinical
realism. Both this results indicates that MedCoDi-M not only produces vi-
sually realistic X-rays but also accurately reflects the clinical features of the
real images.

20
Figure 6: Lateral Synthetic Samples generated by different baselines with the same input
prompt, i.e., “No acute cardiopulmonary process”.

4.2. Report Generation


Table 4 presents the BLEU score on the test set for report generation across
three different generation settings: frontal CXR to report (F→T), lateral
CXR to report (L→T) and both frontal and lateral CXR to report (F+L→T).
These scores highlight the ability of each method to generate reports in com-
parison to reference ones, with higher BLEU scores indicating better per-
formances. The results show that MedCoDi-M outperforms the competitors
across all BLEU score metrics. Results suggest that not only MedCoDi-M
utilize the same terminology as real clinical reports, as indicated by the best
BLEU-1 and BLEU-2, but it also generates sentences whose structure is con-
sistent with that of actual reports, as demonstrated by the better BLEU-3
and BLEU-4. This highlights a high degree of linguistic coherence and fi-
delity in reproducing both the content and phrasing of real-world medical
texts. Moreover, we computed BLEU scores for reports generated across

21
Task Model Atl. Cmgl. Cnsl. Edm. Enl. Les. Opc. Eff. Pnm. Ptx. Micro Macro Weighted
- Real Data .84 .91 .91 .93 .81 .78 .85 .95 .78 .86 .87 .86 .87
RoentGen .87 .93 .82 .80 .50 .59 .68 .94 .48 .58 .78 .72 .78
UniXGen .75 .78 .69 .81 .67 .67 .69 .76 .66 .66 .74 .71 .71
T→F LLM-CXR .89 .92 .87 .96 .80 .78 .85 .95 .80 .85 .89 .87 .87
MedCoDi .91 .95 .93 .96 .84 .82 .91 .97 .84 .88 .90 .90 .91
MedCoDi-M .86 .95 .91 .94 .82 .75 .86 .97 .86 .85 .88 .89 .90
MedCoDi .84 .88 .90 .91 .81 .76 .85 .95 .76 .82 .83 .85 .90
L→F
MedCoDi-M .86 .92 .93 .94 .83 .80 .89 .97 .80 .90 .87 .88 .90
UniXGen .75 .79 .65 .81 .66 .63 .68 .77 .63 .70 .73 .71 .71
T+L→F MedCoDi .89 .91 .92 .91 .82 .80 .86 .97 .77 .85 .88 .89 .90
MedCoDi-M .92 .96 .95 .97 .87 .85 .92 .98 .85 .93 .91 .92 .92

Task Model Atl. Cmgl. Cnsl. Edm. Enl. Les. Opc. Eff. Pnm. Ptx. Micro Macro Weighted
- Real Data .49 .49 .35 .60 .09 .12 .60 .73 .35 .21 .49 .40 .53
RoentGen .31 .40 .00 .25 .00 .02 .40 .74 .04 .04 .36 .25 .43
UniXGen .25 .31 .05 .29 .06 .05 .37 .41 .00 .04 .24 .17 .34
T→F LLM-CXR .51 .53 .19 .60 .08 .07 .60 .74 .38 .16 .47 .39 .51
MedCoDi .60 .60 .33 .66 .10 .05 .64 .78 .38 .17 .53 .43 .59
MedCoDi-M .51 .60 .25 .59 .08 .11 .60 .78 .33 .12 .50 .41 .54
MedCoDi .51 .45 .32 .42 .06 .07 .59 .73 .27 .10 .43 .35 .49
L→F
MedCoDi-M .51 .46 .39 .57 .09 .13 .64 .79 .36 .17 .50 .41 .54
UniXGen .20 .31 .06 .27 .06 .07 .34 .43 .04 .07 .24 .19 .33
T+L→F MedCoDi .54 .55 .32 .64 .09 .06 .63 .75 .38 .17 .53 .44 .58
MedCoDi-M .58 .59 .39 .68 .13 .18 .68 .81 .40 .26 .57 .47 .60

Table 3: CXR classification with XRV - AUC and F1.

multiple scans of the same study. This evaluation assesses the model’s con-
sistency in generating coherent and reliable reports when presented with
scans from the same clinical case. The goal was to ensure that the model not
only excels in generating high-quality individual reports but also maintains
consistency across multiple related images. We report numerical results for
this evaluation in Table A2.

As we did for the synthetic X-rays, we performed a classification task on the


generated reports, to assess whether they capture the relevant clinical infor-
mation used to prompt MedCoDi-M. In terms of classification performance,
as shown in Table 5, MedCoDi-M also demonstrates superior capabilities.
For instance, it achieves the highest F1-score in most categories, including
Micro, Macro and Weighted average for all the three input combinations.
The results shows that our approach not only excels in generating realistic
reports but also in accurately representing the patient’s condition provided
in the prompt. It is worth noting that the best performances are achieved

22
F→T L→T F+L→T
Methods
BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑ BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑ BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑
UniXGen .25 .16 .12 .09 .26 .16 .11 .07 .26 .17 .12 .09
LLM-CXR .25 .15 .10 .07 - - - - - - - -
MedCoDi .38 .27 .22 .18 .38 .28 .23 .18 .42 .32 .27 .22
MedCoDi-M .41 .30 .25 0.20 .42 .32 .26 .21 .44 .34 .29 .24

Table 4: BLEU Score for Report Generation, with higher values indicating greater
similarity. The “-” symbol indicates that the respective models are not capable of
performing the specified generation task.

when both views are given as conditioning (F+L→T), demonstrating that


combining information from multiple perspectives enables the model to ex-
tract more comprehensive insights of the clinical picture. This integration
allows for a more accurate and detailed diagnosis, highlighting the model’s
capability to leverage diverse data to enhance diagnostic performance.

Task Model Atl. Cmgl. Cnsl. Edm. Enl. Les. Opc. Eff. Pnm. Ptx. No F. Micro Macro Weighted
UniXGen .18 .17 .03 .28 .01 .01 .01 .13 .10 .03 .74 .49 .15 .42
LLM-CXR .25 .23 .07 .40 .02 .03 .21 .36 .24 .02 .74 .46 .20 .44
F→T
MedCoDi .56 .56 .15 .67 .10 .26 .47 .71 .52 .37 .88 .67 .44 .67
MedCoDi-M .62 .61 .14 .70 .15 .27 .52 .79 .59 .34 .90 .71 .46 .73
UniXGen .23 .23 .05 .36 .04 .02 .20 .36 .16 .05 .73 .45 .19 .43
L→T MedCoDi .54 .58 .14 .65 .07 .23 .45 .75 .51 .30 .89 .68 .41 .68
MedCoDi-M .62 .61 .16 .70 .09 .26 .51 .78 .57 .32 .91 .71 .45 .73
UniXGen .22 .16 .06 .30 .01 .00 .17 .33 .17 .02 .75 .49 .16 .45
F+L→T MedCoDi .61 .60 .17 .70 .09 .31 .51 .78 .58 .27 .90 .72 .45 .72
MedCoDi-M .62 .61 .17 .70 .12 .27 .53 .79 .58 .37 .91 .73 .46 .73

Table 5: Report classification with CheXPert-Labeler.

4.3. Multi-Output Generation


Thanks to the training procedure described in Section 2.3, MedCoDi-M learns
to jointly generate two or more modalities while ensuring their coherence.
Since there is no established quantitative metric to assess whether the modali-
ties simultaneously generated are coherent, we computed the cosine similarity
between the two generated modalities Mi and Mj [11]:

cos (Pi (Mi ), Pj (Mj )) (5)

where Pi , Pj are the prompt encoders which projects modalities in the shared
latent space. Intuitively, the closer is the metric to 1, the higher the align-
ment between the two modalities. Because none of the competitors is capable
of jointly generate more than one modality, we compared the multi-output

23
generation of MedCoDi-M with a multi-step approach, where every modal-
ity is synthesized independently by MedCoDi-M. As shown in Table 6, the
joint generation shows higher similarity between the generated samples. To

Input Independent Gen. Multi-Output Gen.


T→F+L .61 .65
L→F+T .12 .22
F→L+T .19 .21

Table 6: Similarity Scores for Multi-Output Generation.

further assess whether this improvement in similarity is also reflected in the


clinical correctness of the two jointly generated modalities, we conducted an
evaluation between the frontal X-rays and the report generated through in-
dependent generation, i.e., L→F + L→T, and Multi-Output generation, i.e.,
L→F+T. Specifically, we aimed to assess whether the two modalities gener-
ated were aligned in terms of clinical content. To do so, we classified both the
generated X-ray images and reports, obtaining two classification vectors for
each instance. To quantify the alignment, we computed the Hamming Dis-
tance [48], measuring the discrepancy between the classification labels derived
from the two outputs. In this context, smaller Hamming Distance values in-
dicate greater similarity between the labels and, therefore, better alignment
of the clinical information conveyed by the generated modalities. The re-
sult of this evaluation is illustrated in Figure 7, where the x-axis represents
the Hamming Distance, i.e., the number of differing labels, while the y-axis
shows the frequency of occurrences for each distance value. As shown in the
figure, the red curve (joint generation) peaks at lower Hamming Distance
values compared to the blue curve (independent generation). Specifically,
a large proportion of the samples generated via the joint approach exhibit
a Hamming Distance equal to zero, meaning the generated modalities are
highly aligned. Additionally, the fact that the red curve is lower than the
blue one at distances 1 and 2 further highlights how the jointly generated
modalities are clinically closer. Subsequently, both curves reach a plateau,
demonstrating that neither approach generates modalities that are entirely
incoherent. This suggests that generating the X-ray and the report simulta-
neously ensures better consistency in terms of clinical content, highlighting
the advantages of the Multi-Output approach.

24
ŝŖŖŖ —Ž™Ž—Ž—ȱ Ž—Ž›Š’˜—
˜’—ȱ Ž—Ž›Š’˜—
ŜŖŖŖ

śŖŖŖ
Ǜȱ˜ȱŠ–™•Žœ

ŚŖŖŖ

řŖŖŖ

ŘŖŖŖ

ŗŖŖŖ

Ŗ
Ŗ Ř Ś Ŝ Ş ŗŖ ŗŘ

Š––’—ȱ’œŠ—ŒŽ

Figure 7: Hamming Distance for joint generation.

4.4. Visual Turing Test


Table 7 presents the results of the Visual Turing Test conducted to as-
sess the quality and coherence of the synthetic medical data generated by
MedCoDi-M. The evaluation involved three expert radiologists, each with
10+ years of experience.

General X-ray Realism.


The average rating for real X-rays was 4.1 ± 0.9, demonstrating the radiolo-
gists’ high level of confidence in recognizing and evaluating authentic clinical
images. This result highlights the robustness of the evaluation process and
the expertise of the participants. Synthetic images generated by MedCoDi-M
scored 3.7 ± 0.8, which, while slightly lower, still represents a strong perfor-
mance, indicating the overall quality of MedCoDi-M’s outputs.

General Report Realism.


Real clinical reports received an average rating of 3.5 ± 0.9, highlighting
that even authentic reports are not always perceived as perfect. This may
be due to occasional discrepancies in interpretation between radiologists or
variability in the detail and clarity of the reports, which are often influenced
by stylistic differences among clinicians. In contrast, synthetic reports gen-
erated by MedCoDi-M achieved a higher average rating of 4.0 ± 0.9. This

25
Task Data Type Score
General X-ray Realism Real X-rays 4.1 ± 0.9
Synthetic X-rays 3.7 ± 0.8
General Report Realism Real Reports 3.5 ± 0.9
Synthetic Reports 4.0 ± 0.9
Report Coherence with X-ray Pair Real Reports 3.4 ± 1.3
Synthetic Reports 3.3 ± 1.1
Coherence Between Report and X-ray Real X-rays 3.7 ± 0.9
Synthetic X-rays 3.9 ± 0.8
Coherence Between X-ray Pairs Real X-rays 3.8 ± 0.7
Synthetic X-rays 3.6 ± 0.8

Table 7: Evaluation results for different tasks in the Visual Turing Test, comparing real
and synthetic data generated by MedCoDi-M. The scores range from 0 to 5, with higher
values indicating better performance.

result underscores the model’s ability to consistently use accurate medical


terminology and construct coherent and contextually appropriate narratives.

Report Coherence with Pair of X-rays.


The average score for real reports was 3.4 ± 1.3, while synthetic reports
achieved a comparable score of 3.3 ± 1.1. These relatively modest ratings
reflect the inherent difficulty of this task, which requires precise alignment
between diagnostic findings described in the report and the visual evidence
provided by the X-ray pair. Despite the challenge, MedCoDi-M demonstrates
the ability to generate diagnostic reports that are closely aligned with those
written by radiologists, highlighting the model’s capability to produce coher-
ent and contextually relevant reports, even in a highly demanding evaluation
scenario.

Coherence Between Report and X-ray. The average score for real X-rays
paired with their corresponding reports was 3.7 ± 0.9, while synthetic pairs
generated by MedCoDi-M achieved a slightly higher score of 3.9 ± 0.8. These
results highlight the strong capabilities of MedCoDi-M in generating X-rays
that align closely with the diagnostic context described in the corresponding
reports. Additionally, while the model’s robust understanding of pathologies

26
contributes to producing samples where disease features are prominently and
clearly represented, this characteristic might also pose a limitation. By gen-
erating images with overly evident pathological signs, the synthetic data may
fail to capture more subtle or ambiguous cases that often challenge radiolo-
gists in real-world clinical practice.

Coherence Between X-ray Pairs. The average score for real X-ray pairs was
3.8 ± 0.7, while synthetic pairs scored 3.6 ± 0.8. This suggests that MedCoDi-
M is able to guarantee anatomical consistency between frontal and lateral
views of the same subject, approaching the realism of real X-ray data.

4.5. About the utility of Synthetic Data in the Medical Field


To gain more insight into the generation performance of MedCoDi-M, we
investigate the use of synthetic data to address three key challenges in the
medical field, i.e., Anonymization, Imbalance Learning and Data Scarcity.
These issues are critical when working in real-world scenarios, where privacy
concerns, uneven class distributions and limited access to large datasets often
hinder the development of robust models. For this evaluation, we conducted
experiments using both Frontal and Lateral synthetic X-rays. However, to
maintain clarity and conciseness, we present only the results for Frontal X-
rays here, nevertheless the results for Lateral X-rays are reported in appendix,
in Tables A3, A4, A5.

Anonymization. Medical data is subject to stringent privacy regulations.


However, traditional anonymization methods can lead to information loss.
By generating synthetic data that mimics real patients, it is possible to
mitigate privacy risks while preserving data quality. To assess whether
MedCoDi-M can tackle this challenge, we trained two DenseNet-121 classi-
fier from scratch on two data splits: one composed of synthetic images only,
while the other includes solely real data. Subsequently, both models were
evaluated on the real test set to assess whether training solely on synthetic
data would lead to a degradation in model performance. As shown in Ta-
ble 8, both classifiers achieve comparable results. The classifier trained with
synthetic data achieves slightly lower AUC which is however compensated
by an improvement in the F1-Score, suggesting that synthetic data can effec-
tively replace real data for training purposes without significant degradation
in model performance.

27
Training Data AUROC F1-Score
Micro Macro Weighted Micro Macro Weighted
Real .75 .74 .74 .30 .20 .21
Real + Synthetic .74 .73 .73 .34 .23 .23

Table 8: Classification metrics for Anonymization task.

Imbalance Learning. Imbalance learning is a well-known challenge in machine


learning, many classifier tend to be biased towards majority classes, result-
ing in poor sensitivity to rare but critical conditions. One possible solution
is to artificially augment minority classes using synthetic data, thus helping
models to generalize better on imbalanced datasets. To assess MedCoDi-M’s
capabilities to tackle this challenge, we focused on a 5-class classification
task, selecting the five most represented classes from our dataset, which are
listed in Table 9 along with the respective percentage of positive samples.
This choice was made to avoid excessive imbalance that would have required
generating an impractical number of synthetic samples. For this experiment,
we generated synthetic samples to build a training split that ensured a bal-
anced distribution of positive samples across all five classes.

Condition Positive (%)


Atelectasis 15.6%
Cardiomegaly 13.5%
Consolidation 2.99%
Edema 10.4%
Effusion 14.1%

Table 9: Percentage of positive samples in the training set for the condition used for Data
Imbalance task.

To evaluate the effectiveness of this strategy, we trained two separate DenseNet-


121 from scratch: one on the original dataset, which maintained the natural
imbalance, and the other on the augmented split, where synthetic samples
were added to create a balanced partition. Both models were trained us-
ing the same hyperparameters to ensure a fair comparison. In Table 10, we
compare the two classifiers’ F1-score on the real test set: the results show
an improvement for all classes after synthetic data augmentation, indicating
that synthetic data can enhance the classifier’s ability to generalize across
all conditions, ensuring better overall performance and sensitivity to rare,

28
Training Data F1-Score
Atl. Cmgl. Cnsl. Edm. Eff. Micro Macro Weighted
Real .34 .42 .17 .67 .64 .50 .45 .45
Synthetic .48 .48 .33 .69 .72 .58 .54 .54

Table 10: Classification metrics for Imbalance Learning task.

clinically important cases.

Data Scarcity. The collection of large and well-annotated medical datasets


poses a significant challenge, especially when dealing with rare diseases or
newly emerging conditions. The scarcity of data hinders the development
and training of robust machine learning, as these models often require sub-
stantial amounts of high-quality data to generalize effectively. One promising
approach to address this issue is the augmentation of real datasets with syn-
thetic data, while preserving the real distributions. By generating synthetic
samples that capture the underlying statistical properties of the original data,
we can allow models to train on a more comprehensive range of samples,
mitigating the problem of data scarcity while preserving the real-world vari-
ability and clinical relevance. To assess whether MedCoDi-M could take
this challenge, we conducted a series of experiments training from scratch a
DenseNet-121 on several data splits, gradually increasing the number of syn-
thetic data generated by our model. Specifically, we expanded the training
set by incorporating synthetic data at the following proportions: 25%, 50%,
75%, 90%, and 95% of the total training data. Subsequently, all models were
evaluated on the real test set, allowing us to assess the impact of progres-
sively increasing amounts of synthetic data on the model’s performance. As
shown in Table 11, our findings confirm that synthetic data can expand the
size of existing medical datasets, positively affecting the model performance
and learning outcome.

5. Conclusions
This work presents MedCoDi-M, a novel foundation model specifically de-
signed for multimodal medical data generation, leveraging diffusion models
and contrastive learning. The results demonstrate that MedCoDi-M excels
in generating both realistic chest X-rays and high-quality radiology reports,
consistently outperforming state-of-the-art models across both quantitative

29
Syn % AUROC F1-Score
Micro Macro Weighted Micro Macro Weighted
0% .75 .74 .76 .27 .18 .18
25% .68 .75 .68 .34 .24 .25
50% .79 .76 .79 .32 .22 .22
75% .76 .77 .77 .36 .24 .25
90% .70 .75 .70 .38 .26 .27
95% .76 .75 .77 .35 .24 .25

Table 11: Classification metrics for Data Scarcity task.

and factual correctness metrics. The main novelty introduced by MedCoDi-M


is the Multi-Prompt Training strategy, which plays a pivotal role in enhanc-
ing cross-modal generation performance, helping the model to capture the
complex relationships between medical data modalities. This approach al-
lows MedCoDi-M to effectively integrate and align information from multiple
sources in a shared latent space, thereby improving both the visual fidelity of
the generated images and the clinical accuracy of the corresponding textual
reports. The ability of MedCoDi-M to seamlessly fuse diverse inputs ensures
a more cohesive representation, facilitating more accurate and context-aware
generation of synthetic medical data, as shown by the results in comparison
with MedCoDi. The Visual Turing Test results demonstrate that MedCoDi-
M performs well in generating synthetic medical data that is both realistic
and contextually coherent. Across all tasks, synthetic data scored consis-
tently between 3.0 and 4.0, indicating a high level of quality and alignment
with real clinical data. The model’s high performance in tasks requiring
consistency between different image views and between reports and images
further confirms its potential for generating clinically relevant synthetic data.
Moreover, the results show that the synthetic data generated by MedCoDi-
M hold significant promise for addressing challenges in medical research and
healthcare, such as anonymization, data imbalance, and data scarcity, sug-
gesting that synthetic chest X-rays and radiology reports can offer new av-
enues for enhancing AI integration in the medical field. Future research
could explore several promising directions to further extend the capabili-
ties of MedCoDi-M. One area of exploration involves scaling the modular
framework of MedCoDi-M to support additional modalities beyond 2D chest
X-rays, such as 3D medical images (e.g., CT, MRI) and time series (e.g.,
ECG, EEG). Additionally, there is significant potential to investigate more

30
advanced techniques for merging input embeddings, which could enhance the
integration of heterogeneous medical data sources, for a more holistic under-
standing of patient conditions. The inclusion of longitudinal information
and temporal dynamics could also play a vital role in temporal data gen-
eration, allowing MedCoDi-M to model disease progression and treatment
responses over time. This would be particularly beneficial for chronic condi-
tions where patient data evolve across multiple time points. Finally, future
work could explore the integration of active learning and reinforcement learn-
ing paradigms into the training process, enabling MedCoDi-M to iteratively
refine its generative capabilities by receiving feedback from expert clinicians
or interacting with clinical environments. This would ensure that the model
continues to improve over time, staying up-to-date with the latest medical
knowledge and standards of care. Ultimately, MedCoDi-M represents a sig-
nificant step forward in the generation of synthetic medical data, and with
further research and development, it has the potential to become a key tool
in advancing medical AI across a wide range of applications.

Author Contributions
Daniele Molino: Conceptualization, Data curation, Formal analysis, Inves-
tigation, Methodology, Software, Validation, Visualization, Writing – original
draft, Writing – review & editing; Francesco Di Feola: Conceptualization,
Formal analysis, Investigation, Methodology, Supervision, Validation, Writ-
ing – original draft, Writing – review & editing; Eliodoro Faiella: Vali-
dation; Deborah Fazzini: Validation; Domiziana Santucci: Validation;
Linlin Shen: Validation, Writing – review & editing; Valerio Guarrasi:
Conceptualization, Formal analysis, Investigation, Methodology, Project ad-
ministration, Resources, Supervision, Validation, Writing – review & edit-
ing. Paolo Soda Conceptualization, Formal analysis, Funding acquisition,
Investigation, Methodology, Project administration, Resources, Supervision,
Writing – review & editing;

Acknowledgment
Daniele Molino is a Ph.D. student enrolled in the National Ph.D. in Artifi-
cial Intelligence, XL cycle, course on Health and life sciences, organized by
Università Campus Bio-Medico di Roma.
This work was partially founded by: i) Università Campus Bio-Medico di

31
Roma under the program “University Strategic Projects” within the project
“AI-powered Digital Twin for next-generation lung cancEr cAre (IDEA)”;
ii) PNRR MUR project PE0000013-FAIR. iii) Cancerforskningsfonden Nor-
rland project MP23-1122; iv) Kempe Foundation project JCSMK24-0094; v)
the Italian Ministry of Foreign Affairs and International Cooperation, grant
number PGR01156
Resources are provided by the National Academic Infrastructure for Super-
computing in Sweden (NAISS) and the Swedish National Infrastructure for
Computing (SNIC) at Alvis @ C3SE, partially funded by the Swedish Re-
search Council through grant agreements no. 2022-06725 and no. 2018-
05973.

References
[1] Shuroug A Alowais, Sahar S Alghamdi, Nada Alsuhebany, Tariq Alqah-
tani, Abdulrahman I Alshaya, Sumaya N Almohareb, Atheer Aldairem,
Mohammed Alrashed, Khalid Bin Saleh, Hisham A Badreldin, et al.
Revolutionizing healthcare: the role of artificial intelligence in clinical
practice. BMC medical education, 23(1):689, 2023.
[2] Laith Alzubaidi, Jinshuai Bai, Aiman Al-Sabaawi, Jose Santamarı́a,
Ahmed Shihab Albahri, Bashar Sami Nayyef Al-dabbagh, Mohammed A
Fadhel, Mohamed Manoufali, Jinglan Zhang, Ali H Al-Timemy, et al.
A survey on deep learning tools dealing with data scarcity: defini-
tions, challenges, solutions, tips, and applications. Journal of Big Data,
10(1):46, 2023.
[3] General Data Protection Regulation GDPR. General data protection
regulation. Regulation (EU) 2016/679 of the European Parliament and
of the Council of 27 April 2016 on the protection of natural persons with
regard to the processing of personal data and on the free movement of
such data, and repealing Directive 95/46/EC, 2016.
[4] Accountability Act. Health insurance portability and accountability act
of 1996. Public law, 104:191, 1996.
[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial networks. Communications of the ACM, 63(11):139–
144, 2020.

32
[6] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion proba-
bilistic models. In Proceedings of the 34th International Conference on
Neural Information Processing Systems, NIPS ’20. Curran Associates
Inc., 2020.

[7] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser,


and Björn Ommer. High-Resolution Image Synthesis With Latent Diffu-
sion Models. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.

[8] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark
Chen. Hierarchical text-conditional image generation with clip latents.
arXiv preprint arXiv:2204.06125, 1(2):3, 2022.

[9] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang,
Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu
Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image dif-
fusion models with deep language understanding. Advances in neural
information processing systems, 35:36479–36494, 2022.

[10] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran
Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine
Bosselut, Emma Brunskill, et al. On the opportunities and risks of
foundation models. arXiv preprint arXiv:2108.07258, 2021.

[11] Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit
Bansal. Any-to-Any Generation via Composable Diffusion. In Thirty-
seventh Conference on Neural Information Processing Systems, 2023.

[12] Daniel I Morı́s, Joaquim de Moura, Jorge Novo, and Marcos Ortega. Un-
supervised contrastive unpaired image generation approach for improv-
ing tuberculosis screening using chest X-ray images. Pattern Recognition
Letters, 164:60–66, 2022.

[13] Devansh Srivastav, Akansha Bajpai, and Prakash Srivastava. Improved


classification for pneumonia detection using transfer learning with GAN
based synthetic image augmentation. In 2021 11th international con-
ference on cloud computing, data science & engineering (confluence),
pages 433–437. IEEE, 2021.

33
[14] Yash Karbhari, Arpan Basu, Zong Woo Geem, Gi-Tae Han, and Ram
Sarkar. Generation of synthetic chest X-ray images and detection of
COVID-19: A deep learning based approach. Diagnostics, 11(5):895,
2021.

[15] MY Shams, OM Elzeki, Mohamed Abd Elfattah, T Medhat, and


Aboul Ella Hassanien. Why are generative adversarial networks vital
for deep neural networks? A case study on COVID-19 chest X-ray im-
ages. In Big data analytics and artificial intelligence against COVID-19:
innovation vision and approach, pages 147–162. Springer, 2020.

[16] Christian Bluethgen, Pierre Chambon, Jean-Benoit Delbrouck, Rogier


van der Sluijs, Malgorzata Polacin, Juan Manuel Zambrano Chaves,
Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P Langlotz, and
Akshay S Chaudhari. A vision–language foundation model for the gen-
eration of realistic chest x-ray images. Nature Biomedical Engineering,
pages 1–13, 2024.

[17] Pierre Chambon, Christian Bluethgen, Curtis P Langlotz, and Akshay


Chaudhari. Adapting pretrained vision-language foundational models
to medical imaging domains. arXiv preprint arXiv:2210.04133, 2022.

[18] Kai Packhäuser, Lukas Folle, Florian Thamm, and Andreas Maier. Gen-
eration of anonymous chest radiographs using latent diffusion models
for training thoracic abnormality classification systems. In 2023 IEEE
20th International Symposium on Biomedical Imaging (ISBI), pages 1–5.
IEEE, 2023.

[19] R OpenAI. Gpt-4 technical report. View in Article, 2(5), 2023.

[20] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual
instruction tuning. Advances in neural information processing systems,
36, 2024.

[21] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain
Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican,
Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot
learning. Advances in neural information processing systems, 35:23716–
23736, 2022.

34
[22] Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating im-
ages with multimodal language models. Advances in Neural Information
Processing Systems, 36, 2024.

[23] Hyungyung Lee, Wonjae Kim, Jin-Hwa Kim, Tackeun Kim, Jihang Kim,
Leonard Sunwoo, and Edward Choi. Unified chest x-ray and radiology
report generation model with multi-view chest x-rays. arXiv preprint
arXiv:2302.12172, 3(7):8, 2023.

[24] A Vaswani. Attention is all you need. Advances in Neural Information


Processing Systems, 2017.

[25] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou


Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz
Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.
arXiv preprint arXiv:2009.14794, 2020.

[26] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers
for high-resolution image synthesis. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 12873–
12883, 2021.

[27] Suhyeon Lee, Won Jun Kim, Jinho Chang, and Jong Chul Ye. LLM-
CXR: Instruction-Finetuned LLM for CXR Image Understanding and
Generation. arXiv preprint arXiv:2305.11490, 2023.

[28] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed
Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno,
Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun
Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash,
Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi,
Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Joelle Bar-
ral, Dale Webster, Greg S. Corrado, Yossi Matias, Karan Singhal, Pete
Florence, Alan Karthikesalingam, and Vivek Natarajan. Towards Gen-
eralist Biomedical AI. NEJM AI, 1(3):AIoa2300138, 2024.

[29] Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R


Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and
Steven Horng. MIMIC-CXR, a de-identified publicly available database

35
of chest radiographs with free-text reports. Scientific data, 6(1):317,
2019.

[30] KC Santosh and Laurent Wendling. Angular relational signature-based


chest radiograph image view classification. Medical & biological engi-
neering & computing, 56:1447–1458, 2018.

[31] Suhail Raoof, David Feigin, Arthur Sung, Sabiha Raoof, Lavanya
Irugulpati, and Edward C Rosenow III. Interpretation of plain chest
roentgenogram. Chest, 141(2):545–558, 2012.

[32] Mohammad Hashir, Hadrien Bertrand, and Joseph Paul Cohen. Quan-
tifying the value of lateral views in deep learning for chest x-rays. In
Medical Imaging with Deep Learning, pages 288–303. PMLR, 2020.

[33] W Dean Bidgood Jr, Steven C Horii, Fred W Prior, and Donald E
Van Syckle. Understanding and using DICOM, the data interchange
standard for biomedical imaging. Journal of the American Medical In-
formatics Association, 4(3):199–212, 1997.

[34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin,
Jack Clark, et al. Learning transferable visual models from natural
language supervision. In International conference on machine learning,
pages 8748–8763. PMLR, 2021.

[35] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learn-
ing with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
2018.

[36] Xingqian Xu, Zhangyang Wang, Gong Zhang, Kai Wang, and Humphrey
Shi. Versatile diffusion: Text, images and variations all in one diffusion
model. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7754–7765, 2023.

[37] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the


European conference on computer vision (ECCV), pages 3–19, 2018.

[38] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear
units for neural network function approximation in reinforcement learn-
ing. Neural networks, 107:3–11, 2018.

36
[39] Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe
Zhang, and Jianfeng Gao. Optimus: Organizing sentences via pre-
trained modeling of a latent space. arXiv preprint arXiv:2004.04092,
2020.
[40] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language under-
standing. In Proceedings of naacL-HLT, volume 1, page 2. Minneapolis,
Minnesota, 2019.
[41] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
Ilya Sutskever, et al. Language models are unsupervised multitask learn-
ers. OpenAI blog, 1(8):9, 2019.
[42] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard
Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. Advances in neural informa-
tion processing systems, 30, 2017.
[43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, et al. Imagenet large scale visual recognition chal-
lenge. International journal of computer vision, 115:211–252, 2015.
[44] Lorenzo Tronchin, Rosa Sicilia, Ermanno Cordelli, Sara Ramella, and
Paolo Soda. Evaluating GANs in medical imaging. In Deep Generative
Models, and Data Augmentation, Labelling, and Imperfections: First
Workshop, DGM4MICCAI 2021, and First Workshop, DALI 2021, Held
in Conjunction with MICCAI 2021, Strasbourg, France, October 1, 2021,
Proceedings 1, pages 112–121. Springer, 2021.
[45] Joseph Paul Cohen, Joseph D Viviano, Paul Bertin, Paul Morrison,
Parsa Torabian, Matteo Guarrera, Matthew P Lungren, Akshay Chaud-
hari, Rupert Brooks, Mohammad Hashir, et al. TorchXRayVision: A
library of chest X-ray datasets and models. In International Conference
on Medical Imaging with Deep Learning, pages 231–249. PMLR, 2022.
[46] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu:
a method for automatic evaluation of machine translation. In Proceed-
ings of the 40th annual meeting of the Association for Computational
Linguistics, pages 311–318, 2002.

37
[47] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-
Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball,
Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset
with uncertainty labels and expert comparison. In Proceedings of the
AAAI conference on artificial intelligence, volume 33, pages 590–597,
2019.

[48] Richard W Hamming. Error detecting and error correcting codes. The
Bell system technical journal, 29(2):147–160, 1950.

38
Appendix
A1. FID Score computed with the XRV-mimic backbone

Model T→F L→F L+T→F T→L F→L F+T→L


XRV-mimic↓ XRV-mimic↓ XRV-mimic↓ XRV-mimic↓ XRV-mimic↓ XRV-mimic↓
RoentGen 9.38 - - - - -
UniXGen 17.08 - 16.36 22.12 - 23.07
LLM-CXR 9.66 - - - - -
CoDi 141.79 140.03 141.53 139.92 142.34 139.13
MedCoDi 1.63 2.04 2.26 2.56 2.10 1.80
MedCoDi-M 1.68 0.45 0.44 2.78 0.48 0.44

Table A1: FID score for X-ray generation with XRV-mimic backbone, with lower values
indicating greater similarity. The ’-’ symbol indicates that the respective models are not
capable of performing the specified generation task. Results marked in bold denote the
best performance.

A2. Intra-study BLEU score evaluation

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4


UniXGen .45 .35 .27 .23
MedCoDi .59 .52 .48 .46
MedCoDi-M .60 .53 .49 .47

Table A2: BLEU scores for the intra-study evaluation of the clinical report.

39
A3. Anonymization, Imbalance Learning and Data Scarcity assess-
ment with Lateral X-rays

Training Data AUROC F1-Score


Micro Macro Weighted Micro Macro Weighted
Real .77 .73 .73 .34 .23 .24
Synthetic .75 .70 .69 .37 .26 .27

Table A3: Classification metrics for Anonymization task with Lateral X-rays.

Training Data F1-Score


Atl. Cmgl. Cnsl. Edm. Eff. Micro Macro Weighted
Real .32 .41 .13 .70 .63 .46 .40 .41
Synthetic .34 .43 .15 .65 .63 .48 .42 .43

Table A4: Classification metrics for Imbalance Learning task with Lateral X-rays.

Syn % AUROC F1-Score


Micro Macro Weighted Micro Macro Weighted
0% .76 .76 .76 .37 .26 .26
25% .71 .77 .72 .38 .27 .28
50% .74 .74 .74 .29 .19 .19
75% .74 .78 .75 .36 .23 .24
90% .73 .76 .73 .32 .20 .20
95% .75 .74 .75 .36 .21 .22

Table A5: Classification metrics for Data Scarcity task with Lateral X-rays.

40

You might also like