Mam e
Mam e
Article
MAM-E: Mammographic Synthetic Image Generation with
Diffusion Models
Ricardo Montoya-del-Angel 1, * , Karla Sam-Millan 1 , Joan C. Vilanova 2 and Robert Martí 1
1 Computer Vision and Robotics Institute (ViCOROB), University of Girona, 17004 Girona, Spain;
[email protected] (K.S.-M.); [email protected] (R.M.)
2 Department of Radiology, Clínica Girona, Institute of Diagnostic Imaging (IDI) Girona, University of Girona,
17004 Girona, Spain; [email protected]
* Correspondence: [email protected]
Abstract: Generative models are used as an alternative data augmentation technique to alleviate the
data scarcity problem faced in the medical imaging field. Diffusion models have gathered special
attention due to their innovative generation approach, the high quality of the generated images, and
their relatively less complex training process compared with Generative Adversarial Networks. Still,
the implementation of such models in the medical domain remains at an early stage. In this work,
we propose exploring the use of diffusion models for the generation of high-quality, full-field digital
mammograms using state-of-the-art conditional diffusion pipelines. Additionally, we propose using
stable diffusion models for the inpainting of synthetic mass-like lesions on healthy mammograms.
We introduce MAM-E, a pipeline of generative models for high-quality mammography synthesis
controlled by a text prompt and capable of generating synthetic mass-like lesions on specific regions
of the breast. Finally, we provide quantitative and qualitative assessment of the generated images
and easy-to-use graphical user interfaces for mammography synthesis.
1. Introduction
Citation: Montoya-del-Angel, R.;
Sam-Millan, K.; Vilanova, J.C.; Martí, Data scarcity is an important problem faced in the medical imaging domain, caused by
R. MAM-E: Mammographic Synthetic several factors, such as expensive image acquisition, processing and labeling procedures,
Image Generation with Diffusion data privacy concerns, and the rare incidence of some pathologies [1]. This leads to a
Models. Sensors 2024, 24, 2076. reduction in the volume of medical data available for the training of deep-learning models,
https://doi.org/10.3390/s24072076 which limits the models’ performance and holds back the development of computer-aided
systems, compared with non-medical imaging applications.
Academic Editors: Ruben Pauwels
and Alexander Wong
Generative models have been used to complement traditional data augmentation
techniques and expand medical datasets, with generative adversarial networks (GANs)
Received: 13 December 2023 being, for many years, the state-of-the-art (SOTA) due to their high image quality and
Revised: 12 March 2024 photorealism. Nevertheless, unstable training, low diversity generation, and low sample
Accepted: 22 March 2024 quality make the use of GAN-like architectures challenging for medical imaging synthe-
Published: 24 March 2024
sis [1]. Because medical diagnosis can depend on subtle changes in organ appearance
reflected in the images, realistic high-quality synthetic generation is crucial for the reliable
performance of computer-assisted diagnosis and intervention systems [2].
Copyright: © 2024 by the authors.
Diffusion models (DM) captured special attention from the generative models com-
Licensee MDPI, Basel, Switzerland. munity when they were proposed by Dhariwal and Nichol [3] for image generation and
This article is an open access article seemingly outperformed GANs in 2021 . Since then, applications and research papers for
distributed under the terms and medical images have been published to explore this new image generation principle. For in-
conditions of the Creative Commons stance, Dorjsembe et al. [4] proposed using the original pipeline of diffusion models on
Attribution (CC BY) license (https:// computer vision, called denoising diffusion probabilistic models (DDPM) [5], for the gen-
creativecommons.org/licenses/by/ eration of high-quality MRI of brain tumors. This first implementation of diffusion models
4.0/). for 3D medical images reached SOTA results and outperformed the baseline models based
on 3D GANs. Further advances in the field led to latent diffusion [6], which introduces
the use of a latent space for higher image resolution. Latent diffusion was used by Pinaya
et al. [7] to generate high-resolution 3D brain images, increasing the image resolution from
64 × 64 × 64 to 160 × 224 × 160 without requiring more GPU memory usage or overall
training time. The Fréchet inception distance (FID) [8] for image fidelity and the multi-scale
structural similarity index measure (MS-SSIM) for generation diversity were computed,
and in both cases DM surpassed the GANs’ baseline metrics.
A more controlled generation process can be achieved by feeding additional input
during training and inference. An example of this is stable diffusion (SD) [6], a conditional
diffusion model with text prompts as generation conditioning. An SD implementation for
medical images was introduced by Chambon et al. [9], who proposed a model for chest
X-ray generation. Their model, named RoentGen, was able to create visually convincing,
diverse chest X-rays, controlling the generation output using text prompts with radiology-
specific language. A key characteristic of this work is the use of SD weights pretrained
with natural images as the baseline. Instead of training from scratch, specific parts of the
network are fine-tuned to adapt the weights from its original domain into a specific chest
X-rays medical domain. This DM fine-tuning approach is called Dreambooth and was first
introduced by Ruiz et al. [10] for natural images.
Besides image generation, DM can be used for other tasks, such as super-resolution,
image denoising, and inpainting. Hung et al. [11] used a conditional diffusion model to
inpaint MRI prostate images to recover their original information, outperforming other
generation methods in both qualitative visual inspection and quantitative generation
metric comparison. Some works have explored lesion inpainting using DM for brain
MRI. Rouzrokh et al. [12] developed a DDPM to execute several inpainting tasks, such as
generating synthetic lesions or healthy tissue on slices of the 3D MRI volumes in various
sequences. Their model was capable of generating realistic tumoral lesions and tumor-free
brain tissue, although the performance of the model was only assessed visually.
Figure 1. Graphical user interface of MAM-E for generation of synthetic healthy mammograms.
Sensors 2024, 24, 2076 4 of 23
2.1.1. OMI-H
We used a subset of the OPTIMAM Mammography Image Database, consisting of
approximately 40 k Hologic full-field digital mammograms (FFDM) from several UK
breast-screening centers and with different image views [25]. The dataset was composed of
mammograms with and without lesions (masses of type benign, malignant, and interval-
cancers), and expert annotations are included in the respective cases, including the coordi-
nates of a bounding box surrounding the lesion. The images were in CC and MLO views
and no breast implant was present in any image.
2.1.2. VinDr-Mammo
A second dataset composed of approximately 20 k FFDM with breast-level assessment
and extensive lesion annotation was also used. It consists of 5000 mammography exams,
each with 4 standard views (CC and MLO for both lateralities), coming from two primary
hospitals from Vietnam, giving a total of 20,000 images in DICOM files [26]. The metadata
of each image, consisting of both technical and clinical information, was also available in a
CSV file. We filtered the images so that only mammograms coming from a Siemens vendor
unit were used, obtaining a final set of 15,475 mammograms. The lesions present in the
images correspond to masses and in some cases masses with suspicious microcalcifications.
Figure 2. Resizing and cropping of an OMI-H mammogram. The same process was conducted for
VinDr mammograms.
Sensors 2024, 24, 2076 5 of 23
(a) “a Hologic mammogram in MLO view with (b) “a Hologic mammogram in CC view with
small area” big area”
(c) “a Siemens mammogram in CC view with (d) “a Siemens mammogram in MLO view
very low density” with very high density”
Figure 3. Examples of training mammograms (real) and their respective text prompts for OMI-H (a,b)
and VinDr (c,d).
Sensors 2024, 24, 2076 6 of 23
where q is a probability distribution from which the noisy version of the image at time t
can be sampled, given xt−1 . The proposal of the DDPM framework [5] is to define q as a
Gaussian (normal) distribution given by
p
q ( x t | x t −1 ) = N ( x t ; 1 − β t x t −1 , β t I ), (2)
p
where xt is the output of the distribution sampling, 1 − β t xt−1 is the mean, and β is the
variance of the distribution. Therefore, the sampling of the next noisy version of the image
is essentially controlled by β, as its value affects both the mean and the variance of the
sampling distribution. Selecting the manner in which β changes through time is called
beta scheduling and is control by the noise scheduler. In Figure 5a,c two examples of beta
scheduling are shown.
with α̃t = Πts=1 αs and αt = 1 − β t , where α can be interpreted, as a measure of much infor-
mation from the previous image is being kept during the diffusion process. The importance
of α̃t , and therefore of β t , can be understood by looking at Figure 5. For t values close to
0, the distribution from which we sample has µ ≈ 1 and σ ≈ 0, meaning that the sample
Sensors 2024, 24, 2076 8 of 23
images are every similar to the original image. On the other hand, for large t values where
µ ≈ 0 and σ ≈ 1 the distribution is close to a standard normal distribution (SND) and the
sampled image will be essentially pure Gaussian noise.
Finally, to be able to define the training goal in the reverse diffusion process, we express
the sampling from the probability distribution in Equation (3) using the reparameterization
trick [29]. The reparameterization trick allows us to write the generation of a sample X
from a normal distribution N (µ, σ) as X = µ + σZ, where Z ∼ N (0, 1), i.e., Z was sampled
from an SND. With this, the forward diffusion sampling process can be expressed by
√ p
xt = α̃t x0 + 1 − α̃t ϵ, (4)
where ϵ ∼ N (0, 1). The stochastic variable epsilon (ϵ) in Equation (4) is crucial to under-
stand the reverse diffusion process as it is essentially the prediction target of the UNet.
where pθ is the learned probability distribution from which the denoised images are
sampled at each timestep, t. θ indicates that the distribution is parameterized as it was
learned by the UNet. This also explains why the term p( x T ) has no subscript θ as it is the
starting point of the reverse process, i.e., pure Gaussian noise.
Assuming that p can also be modeled as a normal distribution, it can be expressed as
where µθ and Σθ are the learnable mean and variance of the reverse sampling distribution.
To reduce the training complexity, and because it has been shown to provide similar results,
Σθ = βI; therefore, only µθ has to be learned. Due to limitations of space, the complete
formulation of the optimization of the usual variational bound on negative log likelihood is
not fully described, but key considerations of this formulation are given instead. The first
consideration is that µθ can be computed as
1 β
µθ ( xt , t) = √ ( xt − √ ϵθ ( xt , t)), (7)
αT 1 − α̃t
Figure 6. Reverse diffusion process using a denoising UNet. The upblock layers are a mirror of the
downblock layers.
4. x t −1 = √1 ( xt − √ β ϵθ ( xt , t )) + σt z
αT 1−α̃t
5. end for
6. Decode image using VAE
First, random Gaussian noise is sampled as a starting point. Then, the denoising
process is repeated for T steps. The loop consists of using the predicted noise, ϵθ , to
compute the distribution mean using Equation (7). By adding σt z to this mean term, we
are essentially sampling from the learned data distribution of the reverse diffusion process.
After the denoising process is finished, the image is send back to the image space using the
VAE decoder.
The inference process has two main hyperparameters to consider: number of timesteps,
T, and the guidance scale. First, the number of timesteps, T, will depend on the type of
sampling method that we use for denoising. The traditional DDPM sampling requires
approximately 100 steps to generate good quality images, which is time consuming and
represents a bottleneck in the image generation. The best alternative we found was to use
the DPM-solver proposed by Lu et al. [32], which allows fast diffusion sampling with only
20 steps for good-quality image generation. In the result section, we show how the change
of T affects the image quality.
The second hyperparameter is called the guidance scale. Even though the SD archi-
tecture uses cross attention in several parts of the network, so that the generation process
focuses on the text prompt, in practice the model tends to ignore the text prompt at inference
time. To solve this issue, Ho and Salimans [33] proposed a technique called classifier-free
guidance. In essence, classifier-free guidance consists of generating two noise predictions,
ϵ, at each step, one using the prompt (ϵtext ) and one without it (ϵ f ree ). Then, the differ-
ence between the prompt-generated noise and the free-generated noise is computed. This
difference can be considered as a vector in the image distribution space, which points
in the direction of the image with text. As such, we can scale this vector and sum it to
the free-generated noise to force it to go more in the direction of the prompt text. This
geometrical trick is illustrated in Figure 7.
Figure 7. Classifier-free guidance geometrical interpretation. As the guidance scale increases, the im-
age is pushed further in the prompt direction.
Formally, the scaling factor is called guidance scale, and the formulation can be
summarized as follows:
Figure 8 provides an overview of the full MAM-E pipeline proposed, which includes
both image generation and lesion inpainting tasks. The following sections explain the
specific details of each generation task.
Figure 8. Overall MAM-E pipeline combining both full-field generation and lesion inpainting tasks.
In dark green, the inputs needed for a full synthetic mammogram generation with lesion. In light
green, the optional input for lesion inpainting on real images, instead of full-field synthetic images.
In red, the outputs of each task.
Figure 10. Denoising UNet architecture used for the reverse diffusion process. The upblock structure
is a mirror of the downblock.
At training time, for each mammogram with the presence of a lesion two new elements
are added per example: the mask and a masked version of the original image. The masked
version of the original image is a copy of it where the pixel values inside the bounding box
are set to zero. At training time, both the image and the masked image are first encoded
into the latent space using the VAE encoder. Then, the mask is reshaped to match the latent
representation size of the images. The text prompt and the timesteps were given similarly
to the full-field image generation training pipeline. The training of the reverse diffusion
process remains the same, except for one crucial difference: instead of feeding only the
latent representation to the UNet, the latent representation, the mask, and the masked
latent representation are stacked into one tensor and fed into the UNet, as seen in Figure 11.
This small change in the training process allows the network to pay attention only to the
pixels inside the mask, as the pixels outside it are always provided. This process is repeated
for each dataset, meaning that two inpainting models were trained, one for each dataset.
Figure 11. Inpainting training pipeline. The mask is reshaped to match the image size of the latent
representations (64 × 64). The same UNet as in the Stable Diffusion pipeline is used.
At inference time, different to a normal SD inference pipeline, two extra inputs must
be given: an image, on top of which the lesion will be inpainted, and a mask, with the
designated region to inpaint. The diffusion process will be carried out as explained in
Section 2.3.5, with the only difference being the conditional added by the two new inputs.
Although a text prompt was given at training time, because it is the same for all samples,
the same input prompt must be given during inference time, as described in Section 2.2.2.
To achieve a 256 batch size in one single GPU, we used gradient accumulation, a tech-
nique that consists of computing the gradient for a mini-batch without updating the model
variables, for a set number of times, summing the gradients. In our case, using a mini-batch
size of 16 and 16 gradient accumulation steps, the resulting accumulated batch size is 256.
This technique, however, increased the overall training time.
Gradient checkpointing is another technique to reduce GPU memory usage. In this
case, the CPU processors are used to release some GPU memory at the expenses of a slower
training time. Gradient checkpointing saves strategically selected activations throughout
the computational graph of the model tensors, so only a fraction of the activations must be
re-computed for the gradients. A final memory reduction can be achieved by setting the
optimizer gradients to None instead of zero after the weight updates have been completed.
This will in general have a lower memory footprint and can modestly improve performance.
Most of these techniques can be implemented directly using the Hugging Face Accelerate
library and framework for distributed training and resources management.
3. Results
3.1. Healthy Mammogram Generation
As an initial experiment, we trained an unconditional diffusion model with the Hologic
dataset using the same text prompt for all images: “a mammogram”. The evolution of the
diffusion process as the training steps progress is shown in Figure 12. It can be seen that
from the first epoch the generated image has essentially no signs of residual Gaussian noise,
although the synthetic image does not resemble a mammogram. This implies that diffusion
models pretrained on natural images have learned to denoise images and that the new task
is to learn a new concept by finding its representation in the data distribution of the model.
We can also notice that in three epochs the model has learned the significant characteristics
of a mammogram and can generate realistic images. In the following epochs, the model
focuses on improving smaller details on the image, such as the edges of the breast and the
details of the breast parenchyma.
Figure 12. Training evolution of the diffusion process on an unconditional pretrained model at epochs
1, 3, 6, and 10.
Figure 13. Training evolution of SD with Hologic images at epochs 1, 3, 6, and 10. The prompt is: “a
mammogram in MLO view with small area”.
Figure 14. Training evolution of the diffusion process on a conditional pretrained model trained
with Siemens images at epochs 1, 3, 6, and 10. The prompt is: “a mammogram in CC view with
high density”.
Figure 15. Training evolution of the diffusion process on a conditional pretrained model trained with
both Siemens and Hologic images at epochs 1, 3, 7, and 40. The prompt is: “a siemens mammogram
in MLO view with high density and small area”.
remove the noise. The longest generation samples that we ran used T = 50, needing a
maximum of 4 s to denoise.
The guidance scale, on the other hand, played a more crucial role in the quality
and diversity of the generated images. Figure 16 shows the effect of the guidance scale
on the image generation. We observe that a guidance scale of 1 does not suffice for a
meaningful generation. This is a common behavior for stable diffusion pipelines, as the
image must be pushed further in the prompt direction (see Figure 7). It can be seen that
the increase in the guidance value not only generates a more meaningful image but also
adjusts the characteristics of the mammogram to better match the text prompt. For example,
at guidance 2, the mammogram still presents low breast density, contrary to the text prompt
description. In the following 3 and 4 guidance values, the breast density increases, as does
the overall quality of the image.
Figure 16. Guidance effect on the generation output. From upper-left to lower-right, the guidance
varies in a range from 1 to 4. Prompt: “A siemens mammogram in MLO view with small area and
very high density”.
Nevertheless, there exists a trade-off between prompt fidelity and generation diversity.
If the guidance scale is high, the generated images diversity decreases, creating a similar
behavior to the mode collapse suffered by GANs. To quantitatively assess this phenomenon,
we computed the MS-SSIM metric for different guidance scale values. The MS-SSIM (Multi-
Scale Structural Similarity Index) is usually used to assess the generation diversity of
geenrative models. MS-SSIM is an extension of the traditional SSIM metric and measures
the similarity between two images based on luminance, contrast, and structure.
Sensors 2024, 24, 2076 17 of 23
The mean and standard deviation of the MS-SSIM values among 20 images at different
guidance values using the same text prompt were computed and are shown in Table 3.
The experiment was repeated for both vendors and the combined model. It can be seen that,
overall, the higher the guidance value the lower the generation diversity, as the MS-SSIM
value decreases. This suggests that the value of the guidance scale must be carefully
selected, as a very low value will generate low quality images but with high diversity.
Conversely, a high guidance value (above 6) will generate a mammogram more faithful
to the prompt description but with low diversity. We attest that the optimum guidance
scale will depend on the model, so empirical experiments using the MS-SSIM metric are
encouraged. Nevertheless, the experiments we performed on all three models suggest that
a guidance scale of 4 will suffice for a meaningful and diverse generation.
Table 3. Guidance scale effect on the MS-SSIM metric for the three SD models. The lower the
MS-SSIM, the higher the image diversity.
The idea of the negative prompt is to specify some features that the user would like
to be avoided. For instance, in the cases when a synthetic image presents a gray or white
background, a negative prompt of “white background” or “no black background” has been
shown to make the background black.
In the case of the inpainting task, the GUI has the option to upload the image that will
be inpainted, although a default image is available. An interactive drawing brush is then
activated, with which a lesion can be inpainted in any part of the mammogram, as shown
in Figure 19.
Given that the pretrained weights are available in the Hugging Face personal repository
of the first author, and that the code to run the GUI interface is publicly available in the
GitHub repository of the same authorship, all five GUIs can be run with graphic cards of
approximately 4 GB of GPU memory capacity.
4. Discussion
Our stable diffusion models show satisfactory results for the mammography gen-
eration task, capable of synthesizing visually realistic mammograms that are difficult to
differentiate from real cases by a radiologist. Moreover, thanks to the text conditioning,
we are capable of controlling the generation process and obtaining mammograms with
specific features. Comparison with the work of Pinaya et al. [22] shows a similar visual
quality of our synthetic images, with the main difference being that our conditional models
control more than one image feature, namely vendor, view, breast density, and breast area.
Additionally, regarding image diversity, even though the experiments of Pinaya et al. [22]
do not completely correspond to ours, as we generated different scores for various guidance
scales and datasets, we can attest from Table 3 a lower MS-SSIM score for our generated
images compared with their 0.5356 MS-SSIM score, even for the largest guidance scales,
meaning a higher image diversity.
Nevertheless, selecting the proper diffusion hyperparameters is challenging as, in some
cases, the model may generate images with errors. Figure 20 shows some of the common
generation issues faced by our SD models, in this case coming from the same text prompt
at a guidance scale of 4.
Sensors 2024, 24, 2076 20 of 23
These generation errors have different possible solutions, each of them with their
drawbacks and limitations. For instance, the noise residuals in Figure 20a could be removed
if the inference diffusion steps are increased, leading to longer generation time. The gray
background issue in Figure 20b could be solved by using a negative prompt, which essentially
specifies some features in the image that must be avoided, such as “white background”
or “no black background”. The unsatisfied prompt description of Figure 20c and the
nonsensical generation of Figure 20d could be solved by increasing the guidance scale at the
expenses of the generation diversity, as previously discussed. Therefore, the selection of the
optimal diffusion hyperparameter must be defined for each individual model empirically.
For instance, as Table 3 shows, the optimal guidance scale value may not be the same across
models. Moreover, in some cases, such as in the Siemens model, the effect of the guidance
scale on the MS-SSIM value may not be significant and other metrics for image diversity
must be computed for a better informed decision.
Another important outcome was the concept extrapolation property of the SD models.
In this context, it means that semantic information among datasets can be shared between
them during inference time. For instance, as seen in Figure 15, our model can generate
a Siemens vendor mammogram controlling the breast area, even though this character-
istic was not labeled in the original VinDr dataset. This property presents as a powerful
additional characteristic of diffusion-based models as it increases the overall generation
Sensors 2024, 24, 2076 21 of 23
diversity of the model. The further exploration of this property is out of the scope of this
work and will be addressed in future work.
The limitations of this work include the reduced resolution of the synthetic mammo-
grams, which affects the use of our synthetic images on CAD system that require higher
resolution, such as micro-calcification detection. The pixel depth was also reduced from
its original 16 bits to 8 bits to match the pretrained model requirements, which leads to
some information loss from the images and reduces the overall image contrast. Regarding
model assessment, image synthesis is difficult to evaluate if not linked to a final application,
such as a complete CAD pipeline implementation using synthetic images, but preliminary
results are promising. Specific implementations include the use of the lesion inpainting
pipeline for data augmentation on CAD systems where the number of lesion cases is much
lower compared to normal cases, as happens in a screening population. However, this
implementation may require the generation of other mammographic findings, such as
microcalifications, architectural distortions, asymmetries, etc.
5. Conclusions
Conditional diffusion models are a promising generative model implementation to
synthesize mammograms with specific characteristics and properties. In this work, we
showed that fine-tuning a stable diffusion (SD) model pretrained on natural images with
mammography data proves to be an effective strategy for the controlled generation of
synthetic mammographic images. Additionally, we found that SD can be modified for
the inpainting of synthetic lesions over healthy mammograms. The developed inpainting
pipeline requires the modification of the input latent representation to include a mask
to focus the generation process only in the masked region. Inference pipelines for these
diffusion models were made accessible and ready-to-use through graphical user interfaces,
and their weights and code can be found in the authors personal repositories.
We found initial evidence that synthetic images coming from our SD implementation
could potentially be used for CAD systems in need of specific image characteristics, such
as breast density, breast area, mammographic unit vendor, and the presence of lesions.
A radiological assessment showed that the initial image quality can be compared with real
mammograms, which indicates the prospective use of synthetic images for the training
of radiologist or other educational usages. Moreover, explainability AI models allowed
us to explore the behavior of a lesion classification model when processing our synthetic
images, showing sensibility of the model to the synthesized lesions. Finally, we discovered
a property of stable diffusion models we called concept extrapolation. When trained using
datasets of populations with different labeled characteristics (breast density or area), stable
diffusion allows controlling the generation process using labels not originally included in
the training set of a specific dataset, augmenting the model’s generation customization.
Future work includes using the pretrained weights of the most recent SD models,
such as the 768 × 768 resolution model or the incoming stable diffusion v3, which could
allow higher resolution and diverse mammography generation. MAM-E could be used to
generate population-based synthetic images for algorithm development and evaluation.
Specifically, we plan to train complete CAD pipelines with and without synthetic images
to analyze enhancement on the models performance when augmenting with synthetic
images. In terms of assessment, we aim to perform a new radiological assessment including
more radiologists to avoid bias and provide a more representative clinical opinion of the
images. For the lesion inpainting task, we plan to add text conditionals to the training
process to generate specific types of mammographic findings, such as microcalcifications
and architectural distortions, with different biopsy diagnostics. Finally, we plan to explore
the effects and benefits of concept extrapolation in more detail.
Sensors 2024, 24, 2076 22 of 23
Author Contributions: Conceptualization, R.M. and R.M.-d.-A.; methodology, R.M. and R.M.-d.-A.;
software, R.M.-d.-A.; validation, J.C.V. and K.S.-M.; formal analysis, J.C.V.; investigation, R.M.-d.-A.;
resources, R.M.; data curation, R.M.-d.-A.; writing—original draft preparation, R.M.-d.-A.; writing—
review and editing, R.M.; supervision, R.M. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was possible thanks to funding for the Erasmus+: Erasmus Mundus Joint
Master’s Degree (EMJMD) scholarship (2021–2023), with project reference 610600-EPP-1-2019-1-ES-
EPPKA1-JMD-MOB, the project VICTORIA, “PID2021-123390OB-C21” from the Ministerio de Ciencia
e Innovación of Spain, and the Joan Oró grant for the hiring of pre-doctoral research staff in training
(2023) “ref. BDNS 657443” from the Government of Catalonia.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: This study use a de-identified image database and no informed
consents from the patients were required.
Data Availability Statement: Both OMI-DB [25] and VinDr [26] datasets, used in this work, are
publicly available. The subset used in our experiments can be obtained from the dataset analysis
found in the project repository and can also be obtained form the corresponding author upon
reasonable request.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Kazerouni, A.; Aghdam, E.K.; Heidari, M.; Azad, R.; Fayyaz, M.; Hacihaliloglu, I.; Merhof, D. Diffusion models in medical
imaging: A comprehensive survey. Med. Image Anal. 2023, 88, 102846. [CrossRef]
2. Müller-Franzes, G.; Niehues, J.M.; Khader, F.; Arasteh, S.T.; Haarburger, C.; Kuhl, C.; Wang, T.; Han, T.; Nebelung, S.; Kather, J.N.;
et al. Diffusion Probabilistic Models beat GANs on Medical Images. arXiv 2022, arXiv:2212.07501.
3. Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems;
Curran Associates, Inc.: Melbourne, VIC, Australian, 2021; Volume 34, pp. 8780–8794.
4. Dorjsembe, Z.; Odonchimed, S.; Xiao, F. Three-Dimensional Medical Image Synthesis with Denoising Diffusion Probabilistic
Models. Med. Imaging Deep Learn. 2022. Available online: https://openreview.net/forum?id=Oz7lKWVh45H (accessed on 21
March 2024).
5. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems; Curran
Associates, Inc.: Melbourne, VIC, Australian, 2020; Volume 33, pp. 6840–6851.
6. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In
Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA,
18–24 June 2022; pp. 10674–10685. [CrossRef]
7. Pinaya, W.H.L.; Tudosiu, P.D.; Dafflon, J.; da Costa, P.F.; Fernandez, V.; Nachev, P.; Ourselin, S.; Cardoso, M.J. Brain Imaging Generation
with Latent Diffusion Models. In Proceedings of the Deep Generative Models: Second MICCAI Workshop, DGM4MICCAI 2022,
Held in Conjunction with MICCAI 2022, Singapore, 22 September 2022; Proceedings; Springer: Berlin/Heidelberg, Germany, 2022.
8. Bynagari, N.B. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Asian J. Appl. Sci. Eng.
2019, 8, 25–34. [CrossRef]
9. Chambon, P.; Bluethgen, C.; Delbrouck, J.B.; Van der Sluijs, R.; Połacin, M.; Chaves, J.M.Z.; Abraham, T.M.; Purohit, S.;
Langlotz, C.P.; Chaudhari, A. RoentGen: Vision-Language Foundation Model for Chest X-ray Generation. arXiv 2022,
arXiv:2211.12737.
10. Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models
for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vancouver, BC, Canada, 17–24 June 2023.
11. Hung, A.L.Y.; Zhao, K.; Zheng, H.; Yan, R.; Raman, S.S.; Terzopoulos, D.; Sung, K. Med-cDiff: Conditional Medical Image
Generation with Diffusion Models. Bioengineering 2023, 10, 1258. [CrossRef] [PubMed]
12. Rouzrokh, P.; Khosravi, B.; Faghani, S.; Moassefi, M.; Vahdati, S.; Erickson, B.J. Multitask Brain Tumor Inpainting with Diffusion
Models: A Methodological Report. arXiv 2022, arXiv:2210.12113.
13. Katzen, J.; Dodelzon, K. A review of computer aided detection in mammography. Clin. Imaging 2018, 52, 305–309. [CrossRef]
[PubMed]
14. Lopez-Almazan, H.; Javier Pérez-Benito, F.; Larroza, A.; Perez-Cortes, J.C.; Pollan, M.; Perez-Gomez, B.; Salas Trejo, D.; Casals, M.;
Llobet, R. A deep learning framework to classify breast density with noisy labels regularization. Comput. Methods Programs
Biomed. 2022, 221, 106885. [CrossRef] [PubMed]
Sensors 2024, 24, 2076 23 of 23
15. Nguyen, H.T.X.; Tran, S.B.; Nguyen, D.B.; Pham, H.H.; Nguyen, H.Q. A novel multi-view deep learning approach for BI-RADS
and density assessment of mammograms. In Proceedings of the 2022 44th Annual International Conference of the IEEE
Engineering in Medicine & Biology Society (EMBC), Glasgow, Scotland, UK, 11–15 July 2022; pp. 2144–2148. [CrossRef]
16. Ma, W.; Zhao, Y.; Ji, Y.; Guo, X.; Jian, X.; Liu, P.; Wu, S. Breast Cancer Molecular Subtype Prediction by Mammographic Radiomic
Features. Acad. Radiol. 2019, 26, 196–201. [CrossRef] [PubMed]
17. Bonou, M.A.; Azouz, Z.B.; Nawres, K.; Sètchéou, R.; Dossou, J. Differentiation of Breast Cancer Immunohistochemical Status
Using Digital Mammography Radiomics Features. Health Sci. (IJMHRS) 2023, 12, 12–19.
18. Korkinof, D.; Rijken, T.; O’Neill, M.; Yearsley, J.; Harvey, H.; Glocker, B. High-Resolution Mammogram Synthesis using
Progressive Generative Adversarial Networks. arXiv 2019, arXiv:1807.03401.
19. Desai, S.D.; Giraddi, S.; Verma, N.; Gupta, P.; Ramya, S. Breast Cancer Detection Using GAN for Limited Labeled Dataset. In
Proceedings of the 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN),
Bhimtal, India, 25–26 September 2020; pp. 34–39. [CrossRef]
20. Alyafi, B.; Diaz, O.; Marti, R. DCGANs for Realistic Breast Mass Augmentation in X-ray Mammography. In Proceedings of the
Medical Imaging 2020: Computer-Aided Diagnosis, Houston, TX, USA, 16–19 February 2020; SPIE: Bellingham, WA, USA, 2019;
Volume 11314.
21. Montoya-del Angel, R.; Martí, R. MAM-E: Mammographic synthetic image generation with diffusion models. arXiv 2023,
arXiv:2311.09822.
22. Pinaya, W.H.L.; Graham, M.S.; Kerfoot, E.; Tudosiu, P.D.; Dafflon, J.; Fernandez, V.; Sanchez, P.; Wolleb, J.; da Costa, P.F.; Patel, A.;
et al. Generative AI for Medical Imaging: Extending the MONAI Framework. arXiv 2023, arXiv:2307.15208.
23. Kidder, B.L. Advanced image generation for cancer using diffusion models. Cancer Biol. 2023, preprint. [CrossRef]
24. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In
Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8821–8831.
25. Halling-Brown, M.D.; Warren, L.M.; Ward, D.; Lewis, E.; Mackenzie, A.; Wallis, M.G.; Wilkinson, L.S.; Given-Wilson, R.M.;
McAvinchey, R.; Young, K.C. OPTIMAM Mammography Image Database: A Large-Scale Resource of Mammography Images
and Clinical Data. Radiol. Artif. Intell. 2021, 3, e200103. [CrossRef] [PubMed]
26. Nguyen, H.T.; Nguyen, H.Q.; Pham, H.H.; Lam, K.; Le, L.T.; Dao, M.; Vu, V. VinDr-Mammo: A large-scale benchmark dataset for
computer-aided diagnosis in full-field digital mammography. Sci. Data 2023, 10, 277. [CrossRef]
27. Melnikow, J.; Fenton, J.J.; Whitlock, E.P.; Miglioretti, D.L.; Weyrich, M.S.; Thompson, J.H.; Shah, K. Supplemental Screening for
Breast Cancer in Women With Dense Breasts: A Systematic Review for the U.S.; Center for Healthcare Policy and Research, University
of California: Sacramento, CA, USA, 2016.
28. Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermo-
dynamics. arXiv 2015, arXiv:1503.03585.
29. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114.
30. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning
Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020.
31. Frans, K.; Soros, L.B.; Witkowski, O. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders. arXiv
2021, arXiv:2106.14843.
32. Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in
Around 10 Steps. arXiv 2022, arXiv:2206.00927.
33. Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. arXiv 2022, arXiv:2207.12598.
34. Chambon, P.; Bluethgen, C.; Langlotz, C.P.; Chaudhari, A. Adapting Pretrained Vision-Language Foundational Models to Medical
Imaging Domains. arXiv 2022, arXiv:2210.04133.
35. Dettmers, T.; Lewis, M.; Shleifer, S.; Zettlemoyer, L. 8-bit Optimizers via Block-wise Quantization. arXiv 2022, arXiv:2110.02861.
36. Abid, A.; Abdalla, A.; Abid, A.; Khan, D.; Alfozan, A.; Zou, J. Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild.
arXiv 2019, arXiv:1906.02569.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.