MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
Abstract
Artificial Intelligence is revolutionizing medical practice, enhancing diagnos-
tic accuracy and healthcare delivery. However, its adaptation in medical
settings still faces significant challenges, related to data availability and pri-
vacy constraints. Synthetic data has emerged as a promising solution to
mitigate these issues, addressing data scarcity while preserving privacy. Re-
cently, Latent Diffusion Models have emerged as a powerful tool for gen-
erating high-quality synthetic data. Meanwhile, the integration of different
modalities has gained interest, emphasizing the need of models capable of
handle multimodal medical data. Existing approaches struggle to integrate
∗
Corresponding author: [email protected], [email protected]
Email addresses: [email protected] (Daniele Molino),
[email protected] (Francesco Di Feola), [email protected]
(Eliodoro Faiella), [email protected] (Deborah Fazzini),
[email protected] (Domiziana Santucci), [email protected] (Linlin
Shen), [email protected] (Valerio Guarrasi), [email protected],
[email protected] (Paolo Soda)
1
These authors equally contributed to the work and share senior authorship.
1. Introduction
Artificial Intelligence (AI) is increasingly revolutionizing several fields, in-
cluding healthcare. Today, AI systems are capable of processing vast amounts
of medical data, revealing patterns often undetectable to the human eye and
enabling more accurate diagnostics, personalized treatments, and efficient
healthcare delivery [1]. Moreover, the capability to leverage multimodal data
represents a disruptive advancement in the medical field, enabling compre-
hensive diagnostic insights by integrating different data sources. However,
despite these advancements, the implementation of AI in healthcare faces sev-
eral challenges, primarly caused by data scarcity and privacy concerns [2].
The available datasets for training AI models are often limited in size, di-
versity, and scope, making training deep learning (DL) models a significant
challenge, as these models typically require extensive, high-quality data to
achieve great performances. Without enough diverse data, models may be-
come biased, prone to overfitting, or unable to generalize well to new, unseen
cases, creating a bottleneck in the deployment of AI solutions in real-world
2
healthcare scenarios. Privacy regulations, such as the General Data Protec-
tion Regulation (GDPR) [3] in Europe and the Health Insurance Portability
and Accountability Act (HIPAA) [4] in the United States, although crucial
for protecting patient privacy, can hinder the collaborative efforts required
to gather large-scale datasets. To address these limitations, a novel stream
of research is focusing on multimodal synthetic data generation techniques.
This emerging approach involves creating artificial data that replicate the
complexity and diversity of real medical data, thus providing a solution to
bypass the constraints of real-world data scarcity and privacy concerns.
1.1. Generative AI
Generative AI has seen remarkable growth since 2014, when the introduc-
tion of Generative Adversarial Networks (GANs) [5] had a groundbreaking
influence on the research field, enabling the creation of realistic synthetic
data through adversarial training. Despite their early success, GANs face
inherent challenges, such as training instability, mode collapse, and diffi-
culty in generating fine-grained details, issues that limit their effectiveness
in the medical domain. Building on the foundation set by GANs, Diffu-
sion Models (DM) [6] have recently emerged as a more robust approach for
data generation. Through a multi-step denoising process, DMs demonstrate
an improved capacity for generating diverse, high-fidelity data, capturing
subtle variations and intricate details that are essential in medical imaging
application, where both fidelity and diversity of synthetic data are crucial.
Latent diffusion models (LDMs) [7] have gained significant attention: as the
denoising process operates within a low-dimensional latent space [7], LDMs
require reduced computational resources, making them more practical and
accessible for deployment to a wider range of users and systems. Moreover,
due to their advanced conditioning mechanism, LDM allow for fine-grained
control over the generation process [8, 9]. The conditioning mechanism lever-
ages encoders that extract meaningful representations from the input source,
enabling the synthesis of targeted features, such as specific anatomical struc-
tures or disease characteristics. This controlled generation process not only
enhances the model’s flexibility but also its relevance in medical setting, as it
allows practitioners to generate synthetic data tailored to unique diagnostic
requirements or research needs. LDMs can be effectively adapted to a wide
range of downstream tasks and, with the appropriate pre-training, have the
potential to serve as robust foundation models [10]. However, most of these
models can only generate one modality from another, which can be a signif-
3
icant limitation in the healthcare setting, where multiple modalities coexist
and interact. Outside the medical domain, significant advancements have
been made in multimodal data generation. Among these studies, CoDi [11]
stands as pivotal work. By enabling the simultaneous generation of multi-
ple modalities from a shared latent space, CoDi significantly improves the
consistency and coherence of the generated outputs, allowing for any-to-any
generation, avoiding the pitfalls of a multi-step approach. The adaptation of
a similar approach for medical data generation could prove highly beneficial,
filling a critical gap in the availability of diverse and high-quality datasets
for research and diagnostic purposes. However, CoDi presents some limita-
tions when applied to such a setting: while it demonstrates the feasibility
of any-to-any generation in non-medical enviroment, its performance tends
to degrade when provided with multiple input modalities or their combina-
tions, a limitation that cannot be overlooked in the medical domain, where
reliability and consistency across modalities are critical.
4
eration of different X-ray views, as each view contains distinct informative
content, enhancing the utility of generated data. However, their approach is
limited by its inability to generate multiple outputs simultaneously, requir-
ing separate processing for each modality, without an explicit mechanism to
guarantee coherence between the generated data. Building on this idea, Lee
et Al. [27] proposed a similar approach for bidirectional X-ray and report
generation via a fine-tuned large language model, named LLM-CXR. Unlike
UniXGen, they only leveraged frontal chest X-rays, focusing on a single view
for their generation tasks, potentially limiting its applicability in more com-
prehensive clinical scenarios.
The main limitation of these works is that they overlook the complementary
nature of different medical data modalities and lack the ability to gener-
ate multimodal outputs simultaneously. This independent processing often
results in inconsistencies when modalities are synthesized separately, poten-
tially leading to outputs that lack clinical coherence. Such limitations hinder
their applicability in real-world healthcare settings, where seamless integra-
tion of multimodal data is essential to replicate the complexity of patient-
specific information accurately.
1.3. Contribution
Building on the success of CoDi, this work proposes MedCoDi-M, a novel
multi-prompt foundation model for multimodal medical data generation. By
taking advantage of contrastive learning techniques, used to build Foundation
Models [28], MedCoDi-M enable flexible, any-to-any generation across differ-
ent medical data modalities. Specifically, our generative process ensures that
MedCoDi-M can capture the complex interactions between different medi-
cal modalities. To this end, we propose a novel training approach, named
Multi-Prompt Training, to improve the model’s ability to fuse information
from multiple modalities, enhancing its capability to generate coherent and
accurate medical data.
The main contribution can be summarized as:
5
• We thoroughly evaluate MedCoDi-M against existing state-of-the-art
models, showing its superior capabilities in terms of quality, realism
and clinical accuracy.
2. Methods
Assuming M is the set of our modalities, let I = {I1 , I2 , ..., In } be any subset
of modalities used to prompt the generation and let O = {O1 , O2 , ..., Om }
be any subset of modalities to be generated, such that O ∩ I = ∅, with
I, O ⊆ M. It is important to note that this distinction is made solely for
expositional clarity; in practice, any modality from the set M can be used
both as an input or as an output, and the model is not restricted to specific
modality pairings.
The overall architecture of MedCoDi-M is depicted in Fig.1, which consists
of three blocks, each corresponding to a distinct training phase. In panel
(a), we align the feature representations extracted from the input modali-
ties by modality-specific prompt encoders into a shared latent space using
contrastive learning. In panel (b), we independently train an LDM for each
output modality, using the multi-prompt training approach for condition-
ing. Finally, in panel (c), we perform cross-modal alignment, enabling the
6
(a) (b) (c)
...
...
...
...
...
...
7
multimodal conditioning can be achieved by interpolating the representations
of each modality hI1 , hI2 , ..., hIn .
The resulting vector ω, is then used as the conditioning for training the model
GOi . Following the reparametrization method proposed in [6], the training
objective can be expressed as [7]:
8
2.3. Multi-output generation via Cross-modal Latent Alignement
The third training stage enables the simultaneous generation of any combi-
nation of output modalities, ensuring that each generative flow is aware of
the others. To this end, we incorporate two trainable components into each
LDM GOi : the first is an encoder VOi , that projects the latent variable of
the diffusion process zOi into a shared latent space; the second is a cross-
attention layer, that allows each LDM to attend to the generative process of
another model. Formally, let us consider two modalities, Oi and Oi+1 , being
jointly synthesized by GOi and GOi+1 and let zOi and zOi+1 denote their la-
tent variables at a generic diffusion step, respectively. Following Fig 1.c, the
encoder VOi+1 first projects zOi+1 into a shared latent space. Then, in each
layer of GOi , the cross-attention layer attends to VOi+1 (zOi+1 ).
For the diffusion model of modality Oi , the training objective in Eq.2 become:
LO 2
D = Ez,ϵ,t ∥ϵ − ϵθc (zOi , VOi+1 (zOi+1 ), t, ω)∥2 ,
i
(3)
3. Experimental Configuration
This section provides a comprehensive overview of the experimental setup
adopted to evaluate the performance of MedCoDi-M. It begins by describing
the dataset used, including its key characteristics and preprocessing steps,
which ensure that the data is prepared appropriately for training and evalu-
ation. Next, the section details the implementation specifics of the proposed
framework, such as architectural choices, training configurations, and opti-
mization strategies. Additionally, it presents the state-of-the-art competitors
used for comparative analysis, highlighting their relevance and limitations in
the context of multimodal medical data generation. Finally, the evaluation
metrics are introduced, encompassing both quantitative and qualitative mea-
sures to thoroughly assess the realism, coherence, and clinical utility of the
generated outputs.
9
3.1. Materials
To achieve our purpose, it is crucial to leverage a multimodal medical dataset
that captures the complementary nature of different modalities, such as imag-
ing and textual data. Such datasets are essential for training models capable
of synthesizing clinically accurate and coherent outputs across diverse medi-
cal data types. We used the MIMIC-CXR [29] dataset that contains 377.110
CXR images along with their corresponding radiology reports, for a total of
227.827 studies conducted at the Beth Israel Deaconess Medical Center in
Boston, MA, USA. In the dataset, images are acquired in frontal and lat-
eral projection. Due to significant anatomical differences, the two views offer
distinct yet complementary diagnostic information [30]. For example, car-
diovascular structures and the diaphragm can obscure up to 15% of the lung,
making certain pathologies undetectable in the frontal view alone [31]. The
lateral view, by providing a different perspective, enables the visualization
of lesions or abnormalities hidden behind these anatomical structures, thus
ensuring more accurate diagnosis [32]. For those reasons, we treated frontal
and lateral CXRs as distinct modalities. Each radiology report in the dataset
is divided in two sections: a finding section that provides a detailed descrip-
tion of both normal and abnormal features observed in the corresponding
CXR, and an impression section, which provides a concise summary of the
findings intended to support medical decision-making. In this work, we fo-
cused exclusively on the latter, as it offers a concise yet powerful summary
of the patient’s condition and it also complies with our text encoder, which,
following [11] implementation, poses a limitation on the length of the report
to 77 tokens.
From the repository, we extracted a total of 154.721 X-rays from 78.584 stud-
ies, including all the patients for which the radiology report and both frontal
and lateral view were present. An explicative example of a triplet is depicted
in Fig. 2.
Furthermore, we used the original uncompressed X-rays stored in DICOM
format [33] as in medical imaging, subtle details are critical for accurate di-
agnosis, and compression can lead to unintended loss of information.
The X-ray’s preprocessing involved several steps to standardize and prepare
the data for model training. First, we examinated the pixel spacing of each
image and resampled those with non-standard spacing to [0.139, 0.139], i.e.,
the value observed in 95.76% of the dataset. For the images with a Pho-
tometric Intepretation of Monochrome-1, the pixel values were inverted to
ensure proper representation. Subsequently, we normalized the images by
10
Figure 2: A sample of our dataset, composed of a Frontal X-ray, a Lateral X-ray and the
corresponding radiology report.
dividing every pixel by the maximum pixel value possible given by their bit
depth, bringing the range to [0, 1]. Since the original scans are not square, we
chose not to modify their proportions through a direct resizing, as this could
distort important anatomical features. Additionally, extracting a square crop
was not a viable option, due to the impossibility to select a Region of Interest
(ROI) that would be universally applicable. Instead, we added zero-padding
around images and then resize them to 256 × 256, to standardize the input
size while preserving the integrity of the visual content.
To ensure an unbiased evaluation, we extracted an holdout test set before any
training procedure. This set consists of 33.588 samples, that were carefully
selected to guarantee no patient’s overlapping between training and test set.
11
and lateral X-rays feature representations, while we leverage a masked self-
attention Transformer for the text encoding.
Given a batch of X-ray images X and their corresponding reports R, we ob-
tain the embeddings hX = PX (X), hR = PR (R) for both modalities. These
representations are then aligned through contrastive learning, using the In-
foNCE contrastive loss [35]:
⊤
exp(hiX hiR /τ )
LX,R = − log ⊤ ⊤
(4)
exp(hiX hiR /τ ) + j̸=i exp(hiX hiR /τ )
P
12
variational autoencoder (VAE) to map the input modalities in a lower di-
mensional space. As stated in [17], the most effective approach for the adap-
tation of an LDM to the medical imaging domain is to fine-tune the UNet
component. Therefore, we kept the VAE frozen and only trained the UNet.
In total we trained two LDMs, one for the frontal X-rays and one for the
lateral X-rays, using a batch size of 512, a learning rate of 5 × 10−5 , and a
weight decay of 1 × 10−4 . Both models were trained for 100 epochs using the
AdamW optimizer.
Report Diffusion Model: For the text generation, the UNet architec-
ture is based on [36], which introduced the fully-connected residual blocks
(FCResBlock). These expand the 768-dimensional text latent vectors into a
320-by-4 hidden feature and follow the residual block paradigm with Group-
Norms [37], SiLU [38], and skip connections. We adopt Optimus [39] as the
text VAE, which consist of a BERT [40] text encoder and a GPT-2 [41] text
decoder. Unlike the LDMs used for X-ray images, we decided to fine-tune
both the VAE and the UNet in two separate training rounds. This approach
is necessary as the model has to effectively adapt to a completely different
vocabulary. Following [39], the training process begins with a reconstruction
task, where the VAE is tasked with accurately reconstructing input radiol-
ogy reports from their latent representations. Once the first step is fulfilled,
the UNet is trained for report generation using a batch size of 1024 was em-
ployed, while the learning rate was set to 1 × 10−5 . The weight decay, the
optimizer configuration and the number of epochs remained consistent with
the X-ray LDMs.
13
3.4. Competitors
To rigorously compare MedCoDi-M against established approaches for the
generation of X-rays and radiology report, we include a total of five open
source competitors which, to the best of our knowledge, are the only works
with accessible and reproducible code as well as model weights. It is worth
noting that we selected UniXGen and LLM-CXR [23, 27] because they ad-
dress the same problem, i.e., bidirectional generation of CXRs and reports,
with different architectures from us, namely Transformer and LLM. Mean-
while, despite it lacks of bidirectional capabilities, we selected RoentGen [16]
as it was the first work to leverage a fine-tuned LDM for X-ray generation.
All competitors were trained on the MIMIC-CXR dataset, allowing us to use
them directly without requiring additional training.
• The original CoDi [11] model, without any adaptation to the medical
domain.
• MedCoDi, our same implementation that omits the Multi-Prompt
training strategy, serving as an ablation of MedCoDi-M. This variant
highlights the importance of the Multi-Prompt training approach in
enhancing the model’s conditioning capabilities and generating outputs
that integrate information from multiple input modalities effectively.
• UniXGen [23], a transformer-based architecture for bidirectional CXR
and report generation. UniXGen employs a vector quantization tech-
nique to convert X-rays into discrete visual tokens, enabling the model
to treat both tasks as a sequence generation. UniXGen incorporates
tokens to generate view-specific X-rays. Additionally, multi-view CXRs
can be used to condition the report generation.
• LLM-CXR [27], a pretrained LLM fine-tuned for CXR understanding
and generation. It can perform both CXR-to-report and report-to-
CXR generation tasks. It is restricted to frontal chest X-rays and does
not consider multi-view or multimodal relationships, limiting its appli-
cability in comprehensive diagnostic workflows.
• RoentGen [16], a text-conditioned LDM fine-tuned for generating syn-
thetic frontal CXRs based on textual prompts. RoentGen adapts the
Stable Diffusion architecture to the medical domain by fine-tuning the
UNet component, enabling it to capture the unique characteristics of
chest X-rays.
14
3.5. Evaluation Metrics
We conducted both quantitative and qualitative assessments to evaluate the
performance of our approach. The first focuses on the statistical properties
of the generated data, while the second ensures that the outputs accurately
align with the expected clinical informations.
The Fréchet Inception Distance (FID) [42] measures the dissimilarity be-
tween real and synthetic samples in the feature space of an Inception-V3
model pre-trained on ImageNet [43], ranging in the interval [0, +∞) with
lower values indicating greater similarity. However, because the Inception-
Net is not trained on a medical dataset, it may lead to misleading results [44].
To address this limitation, we computed the FID also with another two back-
bones, i.e., XRV-DenseNet [45], an in-domain classification model trained to
detect pathologies in CXR scans, and XRV-Mimic densenet [45], specifically
trained for MIMIC-CXR scans classification. However, to remain coherent
with other works, We decided to report the result obtained using the latter
backbone in Table A1.
15
X-rays Classification: To evaluate whether the models are capable of gen-
erating images that accurately reflect the information of the corresponding
clinical reports, we classified the generated samples using XRV-DenseNet [45],
a well-established classifier in the literature for CXR classification. Since
such a classifier is trained only on a subset of pathologies, we computed
the AUC and F1 scores for the following diseases: Atelectasis (Atl.), Car-
diomegaly (Cmgl.), Consolidation (Cnsl.), Edema (Edm.), Enlarged Car-
diomediastinum (Enl.), Lung Lesion (Les.), Lung Opacity (Opc.), Pleural
Effusion (Eff.), Pneumonia (Pnm.) and Pneumothorax (Ptx.), along with
micro, macro, and weighted averages. The micro average aggregates con-
tributions from all classes to provide an overall measure, the macro average
computes the metric independently for each class and averages them, and the
weighted average adjusts the macro average by accounting for the number of
samples per class. However, because not all scans have a defined label for
every pathology, we computed the performance for each class only when a
ground truth was available, Table 1 report the number of samples for every
pathology in our test set.
Table 1: Number of samples of every pathology for the X-rays classification task.
16
and weighted averages for the F1 score. This task quantifies the ability of
the model to generate reports that align with the medical conditions seen in
the X-ray images, ensuring that the synthetic reports accurately reflect the
diagnostic information provided by the images.
17
Model T→F L→F L+T→F T→L F→L F+T→L
v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓ v3 ↓ XRV ↓
RoentGen 102.77 5.60 - - - - - - - - - -
UniXGen 81.75 7.28 - - 86.21 7.63 128.96 9.76 - - 133.38 10.20
LLM-CXR 71.91 6.83 - - - - - - - - - -
CoDi 541.44 107.23 520.02 83.12 539.35 107.17 522.00 83.01 540.66 105.63 525.80 82.52
MedCoDi 10.56 0.86 34.89 3.31 22.63 1.90 13.90 0.84 43.24 4.95 23.12 1.99
MedCoDi-M 10.67 0.93 12.04 0.48 11.51 0.43 14.00 0.96 13.75 0.48 11.97 0.34
Table 2: FID score for X-ray generation, with lower values indicating greater similarity.
XRV and v3 refers to the two backbones used to compute the score, respectively
XRV-Densenet and Inception-v3. The “-” symbol indicates that the respective models
are not capable of performing the specified generation task
the report. A high score reflects that the synthetic X-rays match the
findings described in the report, indicating that the generative model
correctly understood and translated the clinical context from the report
into the visual representation.
18
show that MedCoDi-M consistently outperforms all the competitors for X-
ray generation. It is worth noting that the excessively high FID values for
CoDi highlight how crucial is the finetuning step of a LDM. This observation
is further supported by Figure 4, which show one example of generation by
CoDi and MedCoDi-M with the same textual prompt, where the former fails
to generate any resemblance of an X-ray.
Figure 4: Generation comparison between CoDi and MedCoDi-M using the same textual
prompt, i.e., “No acute cardiopulmonary process”.
On the other side, the Multi-Prompt training technique shows its effective-
ness, as MedCoDi-M consistently outperform MedCoDi in four configurations
(L→F, L+T→F, F→L, F+T→F), showing that the possibility to merge in-
formation in the shared latent space improves the generation results. How-
ever, since this approach leverages two image modalities alongside a single
textual modality, it tends to prioritize learning from visual prompts over
textual ones. As a result, we observe a slight drop in performance when
generation is based solely on clinical reports (T→F, T→L), although this
degradation is minimal compared to the significant boost the model gains in
other generation settings. In Fig. 5 and Fig. 6 several examples of synthetic
images from the different models are displayed. While at first glance the
synthetic X-rays may appear visually similar, the FID values reveal a clearer
resemblance of MedCoDi-M’s outputs to real samples, highlighting the effec-
tiveness of our approach.
To further assess the generation capabilities of MedCoDi-M we analyzed
the factual correctness by performing a classification task on the patient’s
pathology (Section 3.5.2). Specifically, for each sample in the test set, we
19
Figure 5: Frontal Synthetic Samples generated by different baselines with the same input
prompt, i.e., “No acute cardiopulmonary process”.
20
Figure 6: Lateral Synthetic Samples generated by different baselines with the same input
prompt, i.e., “No acute cardiopulmonary process”.
21
Task Model Atl. Cmgl. Cnsl. Edm. Enl. Les. Opc. Eff. Pnm. Ptx. Micro Macro Weighted
- Real Data .84 .91 .91 .93 .81 .78 .85 .95 .78 .86 .87 .86 .87
RoentGen .87 .93 .82 .80 .50 .59 .68 .94 .48 .58 .78 .72 .78
UniXGen .75 .78 .69 .81 .67 .67 .69 .76 .66 .66 .74 .71 .71
T→F LLM-CXR .89 .92 .87 .96 .80 .78 .85 .95 .80 .85 .89 .87 .87
MedCoDi .91 .95 .93 .96 .84 .82 .91 .97 .84 .88 .90 .90 .91
MedCoDi-M .86 .95 .91 .94 .82 .75 .86 .97 .86 .85 .88 .89 .90
MedCoDi .84 .88 .90 .91 .81 .76 .85 .95 .76 .82 .83 .85 .90
L→F
MedCoDi-M .86 .92 .93 .94 .83 .80 .89 .97 .80 .90 .87 .88 .90
UniXGen .75 .79 .65 .81 .66 .63 .68 .77 .63 .70 .73 .71 .71
T+L→F MedCoDi .89 .91 .92 .91 .82 .80 .86 .97 .77 .85 .88 .89 .90
MedCoDi-M .92 .96 .95 .97 .87 .85 .92 .98 .85 .93 .91 .92 .92
Task Model Atl. Cmgl. Cnsl. Edm. Enl. Les. Opc. Eff. Pnm. Ptx. Micro Macro Weighted
- Real Data .49 .49 .35 .60 .09 .12 .60 .73 .35 .21 .49 .40 .53
RoentGen .31 .40 .00 .25 .00 .02 .40 .74 .04 .04 .36 .25 .43
UniXGen .25 .31 .05 .29 .06 .05 .37 .41 .00 .04 .24 .17 .34
T→F LLM-CXR .51 .53 .19 .60 .08 .07 .60 .74 .38 .16 .47 .39 .51
MedCoDi .60 .60 .33 .66 .10 .05 .64 .78 .38 .17 .53 .43 .59
MedCoDi-M .51 .60 .25 .59 .08 .11 .60 .78 .33 .12 .50 .41 .54
MedCoDi .51 .45 .32 .42 .06 .07 .59 .73 .27 .10 .43 .35 .49
L→F
MedCoDi-M .51 .46 .39 .57 .09 .13 .64 .79 .36 .17 .50 .41 .54
UniXGen .20 .31 .06 .27 .06 .07 .34 .43 .04 .07 .24 .19 .33
T+L→F MedCoDi .54 .55 .32 .64 .09 .06 .63 .75 .38 .17 .53 .44 .58
MedCoDi-M .58 .59 .39 .68 .13 .18 .68 .81 .40 .26 .57 .47 .60
multiple scans of the same study. This evaluation assesses the model’s con-
sistency in generating coherent and reliable reports when presented with
scans from the same clinical case. The goal was to ensure that the model not
only excels in generating high-quality individual reports but also maintains
consistency across multiple related images. We report numerical results for
this evaluation in Table A2.
22
F→T L→T F+L→T
Methods
BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑ BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑ BLEU-1 ↑ BLEU-2 ↑ BLEU-3 ↑ BLEU-4 ↑
UniXGen .25 .16 .12 .09 .26 .16 .11 .07 .26 .17 .12 .09
LLM-CXR .25 .15 .10 .07 - - - - - - - -
MedCoDi .38 .27 .22 .18 .38 .28 .23 .18 .42 .32 .27 .22
MedCoDi-M .41 .30 .25 0.20 .42 .32 .26 .21 .44 .34 .29 .24
Table 4: BLEU Score for Report Generation, with higher values indicating greater
similarity. The “-” symbol indicates that the respective models are not capable of
performing the specified generation task.
Task Model Atl. Cmgl. Cnsl. Edm. Enl. Les. Opc. Eff. Pnm. Ptx. No F. Micro Macro Weighted
UniXGen .18 .17 .03 .28 .01 .01 .01 .13 .10 .03 .74 .49 .15 .42
LLM-CXR .25 .23 .07 .40 .02 .03 .21 .36 .24 .02 .74 .46 .20 .44
F→T
MedCoDi .56 .56 .15 .67 .10 .26 .47 .71 .52 .37 .88 .67 .44 .67
MedCoDi-M .62 .61 .14 .70 .15 .27 .52 .79 .59 .34 .90 .71 .46 .73
UniXGen .23 .23 .05 .36 .04 .02 .20 .36 .16 .05 .73 .45 .19 .43
L→T MedCoDi .54 .58 .14 .65 .07 .23 .45 .75 .51 .30 .89 .68 .41 .68
MedCoDi-M .62 .61 .16 .70 .09 .26 .51 .78 .57 .32 .91 .71 .45 .73
UniXGen .22 .16 .06 .30 .01 .00 .17 .33 .17 .02 .75 .49 .16 .45
F+L→T MedCoDi .61 .60 .17 .70 .09 .31 .51 .78 .58 .27 .90 .72 .45 .72
MedCoDi-M .62 .61 .17 .70 .12 .27 .53 .79 .58 .37 .91 .73 .46 .73
where Pi , Pj are the prompt encoders which projects modalities in the shared
latent space. Intuitively, the closer is the metric to 1, the higher the align-
ment between the two modalities. Because none of the competitors is capable
of jointly generate more than one modality, we compared the multi-output
23
generation of MedCoDi-M with a multi-step approach, where every modal-
ity is synthesized independently by MedCoDi-M. As shown in Table 6, the
joint generation shows higher similarity between the generated samples. To
24
ŝ Ŗ Ŗ Ŗ ȱ
ȱ
Ŝ Ŗ Ŗ Ŗ
ś Ŗ Ŗ Ŗ
Ǜ ȱ ȱ
Ś Ŗ Ŗ Ŗ
ř Ŗ Ŗ Ŗ
Ř Ŗ Ŗ Ŗ
ŗ Ŗ Ŗ Ŗ
Ŗ
Ŗ Ř Ś Ŝ Ş ŗ Ŗ ŗ Ř