Lossy Image Compression With Foundation Diffusion Models Paper
Lossy Image Compression With Foundation Diffusion Models Paper
Diffusion Models
{lucas.relic, grossm}@inf.ethz.ch
2
Disney Research | Studios, Zürich, Switzerland
{roberto.azevedo, christopher.schroers}@disneyresearch.com
1 Introduction
In today’s digital era, multimedia content dominates global internet traffic, mak-
ing the development of efficient compression algorithms increasingly important.
Traditional codecs, which use handcrafted transformations [1, 42], are now out-
performed by data-driven neural image compression (NIC) [7,8,33] methods that
optimize for both rate and distortion. Nevertheless, most current methods still
produce blurry and unrealistic images in extremely low bitrate settings [5,13,24].
This is the result of such methods being optimized for rate-distortion, where dis-
tortion is measured with pixel-wise metrics like mean-squared error (MSE) [10].
The rate-distortion-realism3 triple tradeoff [4,10,32] formalizes this phenomenon
and states that optimizing for low distortion (i.e., pixel-wise error) necessarily
3
We use “realism” and “perception” interchangeably, representing the similarity of the
reconstructed image to other natural images.
2 L. Relic et al.
Fig. 1: Visual examples of our proposed method and various classes of image compres-
sion codecs. Traditional (BPG [1]) and autoencoder-based (ELIC [20]) codecs suffer
from blocking or blurring, and reconstructions from GAN-based (ILLM [34]) and previ-
ous diffusion-based (HFD [24]) methods contain high-frequency artifacts. Our proposal
is as realistic as the original image while recovering a high level of detail. Bitrates are
shown relative to our method. Best viewed digitally.
results in unrealistic images (i.e., images that do not fall on the manifold of nat-
ural images). In low-bitrate scenarios, however, it can be preferable to decode
realistic (thus, more perceptually pleasant) images, even if that means lower
performance in pixel-wise metrics [6].
Generative compression methods [5, 32, 34] try to reconstruct realistic im-
ages by introducing GAN architectures and adversarial or perceptual losses. In
image generation, however, diffusion models [15] have now emerged as a power-
ful alternative, outperforming GANs [15] and achieving state-of-the-art realism
scores [37]. Diffusion models are thus a natural fit for generative image compres-
sion architectures targeting low-bitrate scenarios. Yet, their applicability and
adoption is hindered by large model size and prohibitively expensive training
times, requiring multiple GPU years and hundreds of thousands of dollars [14].
The introduction of open-source foundation models [11] has the potential to de-
mocratize these powerful models and provide strong priors that can be explored
for feature extraction or transfer learning on a variety of domains [19], for exam-
ple image generation [45], depth estimation [27], and even music generation [16].
The use of foundation diffusion models as prior for image compression, how-
ever, is still an underexplored research area. Some works address this task [12,30]
but operate at extremely low bitrates (less than 0.03 bpp) where reconstructed
image content significantly differs from the original, limiting applicability. Only
Careil et al . [13] apply foundation diffusion models in a practical compression
setting. However, they modify the base model architecture and thus require fine-
tuning on a large dataset containing millions of images. Other works, which train
Lossy Image Compression with Foundation Diffusion Models 3
the diffusion component from scratch, operate at relatively high bitrates [44]
where low pixel-wise distortion can be achieved, or perform image enhance-
ment [18, 24] rather than native end-to-end compression. Notably, all current
work on diffusion-based image compression tasks sample the output image from
pure noise, requiring the full diffusion sampling process, which can take up to
one minute per image [44] due to its iterative nature.
To advance the state of the art, we propose a novel image compression codec
that uses foundation latent diffusion models as a means to synthesize lost details,
particularly at low bitrate. Leveraging the similarities between quantization er-
ror and noise [7], we transmit a quantized image latent and perform a subset of
denoising steps at the receiver corresponding to the noise level (i.e., quantiza-
tion error) of the latent (similar to diffusion image editing techniques [31]). The
key components of our proposal are: i) the autoencoder from a foundation latent
diffusion model to transform an input image to a lower-dimensional latent space;
ii) a learned adaptive quantization and entropy encoder, enabling inference-time
control over bitrate within a single model; iii) a learned method to predict the
ideal denoising timestep, which allows for balancing between transmission cost
and reconstruction quality; and iv) a diffusion decoding process to synthesize
information lost during quantization. Unlike previous work, our formulation re-
quires only a fraction of iterative diffusion steps and can be trained on a dataset
of fewer than 100k images. We also directly optimize a distortion objective be-
tween input and reconstructed images, enforcing coherency to the input image
while maintaining highly realistic reconstructions (Fig. 1) due to the diffusion
backbone.
In sum, our contributions are:
– We propose a novel latent diffusion-based lossy image compression pipeline
that is able to produce highly realistic and detailed image reconstructions at
low bitrates.
– To achieve this, we introduce a novel parameter estimation module that si-
multaneously learns adaptive quantization parameters as well as the ideal
number of denoising diffusion steps, allowing a faithful and realistic recon-
struction for a range of target bitrates with a single model.
– We extensively evaluate state-of-the-art generative compression methods on
several datasets via both objective metrics and a user study. To the best of
our knowledge, this is the first user study that compares generative diffu-
sion models for image compression. Our experiments verify that our method
achieves state-of-the-art visual quality as measured in FID and end users
subjectively prefer our reconstructions.
2 Related Work
Although diffusion models have seen significant successes in the machine learning
community, their use in the image compression domain is still limited.
Yang and Mandt [44] proposed the first transform-coding-based lossy com-
pression codec using diffusion models. They condition a diffusion model on con-
4 L. Relic et al.
3 Background
Neural Image Compression. Lossy neural image codecs (NIC) are commonly
modeled as autoencoders, in which an encoder E transforms an image x to a
quantized latent ŷ = ⌊E(x)⌉, while a decoder D reconstructs an approximation
of the original image x̂ = D(ŷ). Based on Shannon’s rate-distortion theory [38],
during training, E and D are optimized to minimize the rate-distortion trade-off:
\mathcal {L}_{total} = \mathcal {L}_{bits}(\mathbf {\hat {y}}) + \lambda \mathcal {L}_{rec}(\mathbf {x}, \mathbf {\hat {x}}) \label {eq:rd-loss} (1)
Lossy Image Compression with Foundation Diffusion Models 5
(a) (b)
Fig. 2: Rate-distortion (Fig. 2a) and visual (Fig. 2b) comparisons of our method to
naively quantizing and entropy coding the latents of a latent diffusion model (Stable
Diffusion [37]). The LDM baseline requires nearly triple the bits to achieve comparable
performance to our method and severly degrades the image at lower bitrates. Perform-
ing additional diffusion steps still does not produce a realistic image (Fig. 2b, right).
The color gradient of the dots in Fig. 2a represents the number of denoising steps.
Diffusion. Diffusion models (DMs) [23,39] are a class of generative models that
define an iterative process q(xt |xt−1 ) that gradually destroys an input signal as
t increases, and try to model the reverse process q(xt−1 |xt ). Empirically, the
forward process is performed by adding Gaussian noise to the signal; thus, the
reverse process becomes a denoising task. The diffusion model Mθ approximates
the reverse process by estimating the noise level ϵθ of the image and using it to
predict the previous step of the process:
\mathbf {x}_{t-1} = \sqrt {\alpha _{t-1}}\mathbf {\tilde {x}}_0 + \sqrt {1-\alpha _{t-1}}\epsilon _\theta \text {,}\quad \text {with}\quad \mathbf {\tilde {x}}_0 = \frac {\mathbf {x}_t - \sqrt {1 - \alpha _t}\epsilon _\theta }{\sqrt {\alpha _t}} \label {eq:denoise} (2)
predicted fully denoised sample from any given t. Eqs. 2 can be simplified to
\mathbf {x}_{t-1} = \mathcal {M}_\theta (\mathbf {x}_t, t) \label {eq:diffusion} (3)
where Mθ (·) is one forward pass of the diffusion model. It is therefore possible
to sample from a DM by initializing xT = N (0, 1) and performing T forward
passes to produce a fully denoised image.
Latent diffusion models (LDMs) [37] improve memory and computational
efficiency of DMs by moving the diffusion process to a spatially lower dimensional
latent space, encoded by a pre-trained variational autoencoder (VAE) [28]. Such
a latent space provides similar performance of the corresponding pixel-space
DMs while requiring less parameters (and memory) [9]. These types of DMs are
trained in a VAE latent space where y = Evae (x) and a sampled latent y0 can
be decoded back to an image x̂ = Dvae (y0 ).
Since LDMs are based on VAEs, they can also be considered a type of
compression method. However, their applicability in lossy image compression
is hindered by inherent challenges. LDMs lack explicit training to produce dis-
crete representations, resulting in highly distorted reconstructions when used
for lossy compression [24], and cannot navigate the rate-distortion tradeoff. To
highlight such issues, Fig. 2 shows the performance of the same LDM used by
our method (without modifications) as a compression codec compared to our ap-
proach optimized for lossy image compression. In this experiment, we manually
sweep over a range of quantization and diffusion timestep parameters, encoding
the images under the different configurations. Specifically, we encode to the la-
tent space, quantize according to the chosen parameters, compress with zlib [2],
run the chosen number of denoising diffusion steps, and decode back to image
space. As shown, the unmodified LDM requires nearly 3x the bits to achieve com-
parable performance to our method and cannot produce realistic images at low
bitrates, regardless of the number of diffusion steps performed (Fig. 2b). Thus,
deploying LDMs for compression requires thoughtful consideration to maximize
their effectiveness.
4 Method
Fig. 3 shows the high-level architecture of our method. It is composed of a vari-
ational autoencoder (containing an encoder, Evae , and a decoder, Dvae ) a quan-
tization and diffusion timestep parameter estimation network (Pϕ ), an entropy
model, and a latent diffusion model (Mθ ).
Our encoding process is performed as follows: First, the image x is encoded
into its latent representation y = Evae (x). Then, y is quantized by an adaptive
quantization method parameterized by γ (i.e., ẑ = Q(y, γ)). Finally, ẑ is entropy
encoded and stored or transmitted.
During decoding, the inverse quantization transformation computes ŷt =
Q−1 (ẑ, γ), which is then used as input to the generative LDM process over t
denoising steps to recover an approximation ŷ0 of the original latent represen-
tation y. Finally, ŷ0 is decoded by the VAE decoder into a reconstructed image
x̂ = Dvae (ŷ0 ). Algorithm 1 shows the complete encoding/decoding process.
Lossy Image Compression with Foundation Diffusion Models 7
Parameter Estimation
Diffusion U-Net
Entropy
Model
Fig. 3: Overview of our approach. The input image x is encoded into latent space
and transformed according to predicted parameters γ before quantization and entropy
coding. The quantized representation ẑ is transmitted with γ and predicted diffusion
timestep t as side information. At the reciever the latent is inverse transformed, diffused
over t steps, and decoded back to image space.
Aiming at avoiding extensive training time, we use Stable Diffusion v2.1 [37] for
certain modules of our architecture, particularly, Evae , Dvae , and Mθ . Note that
our method works independently of the base model. We select Stable Diffusion
as it is one of the only foundation latent diffusion models with publicly available
code and model weights.
\mathbf {\hat {z}} = Q(\mathbf {y}, \gamma ) = \lfloor \mathcal {T}(\mathbf {y}, \gamma ) \rceil (4)
8 L. Relic et al.
Fig. 4: Intermediate states of the sequential denoising process in our decoder. Our
method predicts the optimal number of denoising steps, highlighted in red, to produce
the most perceptually pleasing output. Best viewed digitally.
\mathbf {\hat {y}}_t = Q^{-1}(\mathbf {\hat {z}}, \gamma ) = \mathcal {T}^{-1}(\mathbf {\hat {z}}, \gamma ) (5)
4.4 Optimization
Following Eq. 1, we jointly optimize the tradeoff between the estimated coding
length of the bitstream and the quality of the reconstruction:
\mathcal {L}~=-\log _2 P(\mathbf {\hat {z}}) + \lambda \parallel \mathbf {x} - \mathbf {\hat {x}}\parallel ^2 _2. (6)
We train our model on the Vimeo-90k [43] dataset and randomly crop the
images to 256×256px in each epoch. Our model is optimized for 300,000 steps
with learning rate 1e-4. We randomly sample λ ∈ [1, 5, 10, 20] at each gradient
update to train for multiple target bitrates within a single model.
While our main motivation was to utilize foundation models without signif-
icant modification, we do make minor adjustments in our pipeline to allow for
optimization of trainable modules upstream. During training, it is prohibitively
expensive to backpropagate the gradient through multiple passes of the dif-
fusion model as it runs during DDIM [40] sampling. Therefore, we perform
only one DDIM sampling iteration and directly use x̃0 as the fully denoised
data (see Eq. (2) and Appendix B). For the low timestep range our model op-
erates in, we observe that the difference between x̃0 and the true fully denoised
data x0 is minimal and sufficient for the optimization of the parameter esti-
mation module. At inference time, we perform the standard iterative DDIM
process. Additionally, during the diffusion sampling process, the timestep t is
10 L. Relic et al.
MS-SSIM
0.80
LPIPS
PSNR
0.100
MS-COCO 30k 28
0.075 0.75
26
0.050 0.70
5
24
0.025 0.65
0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30
4
bpp bpp bpp
FID
0.12 28 0.80
2
MS-SSIM
0.10
LPIPS
PSNR
26 0.75
0.08
1
0.06 24 0.70
0.05 0.10 0.15 0.20 0.25 0.30
0.04
bpp 22 0.65
0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30
bpp bpp bpp
5 Experiments
We compare our method to state-of-the-art generative and diffusion-based image
compression codecs via objective metrics and a subjective user study. Fig. 7 also
shows qualitative comparisons of ours to different methods.
Metrics. We evaluate our proposal and baselines using: PSNR and MS-SSIM,
as measures of pixel-wise distortion; LPIPS, as a more perceptually-oriented
distortion metric; and FID [22], to evaluate the realism of the reconstructed
images. FID measures the similarity between the distributions of source and
distorted images, and thus has been widely used as a measure of realism [4, 32],
Lossy Image Compression with Foundation Diffusion Models 11
particularly for image generation tasks. Since FID requires a higher number of
samples of both source and distorted images, we focus the FID comparison only
on MS-COCO 30k.
User study. As qualitative metrics often fail to capture the perceptual qual-
ity of image reconstructions [10, 41], we further perform a user study to assess
the visual quality of our results. The study is set up as a two-alternative forced
choice (2AFC), where each participant is shown the source image and recon-
structions from two methods and is asked to choose the reconstruction they
prefer. We select 10 samples from the Kodak dataset with the smallest differ-
ence in bitrate between our method and the other generative model baselines,
namely CDC, ILLM, and HFD. 5 Thus, in a session, a participant is requested
to do 60 pairwise comparisons. Each sample is center-cropped to 512×512px so
that all images being compared are shown side-by-side at native resolution (i.e.,
without resampling). 6 Participants can freely zoom and pan the images in a
synchronized way. However, because the methods locally provide different types
of reconstructions, we ask the participants to inspect the images in their entirety
before rating.
Following [32], we use the Elo [17] rating system to rank the methods. Elo
matches can be organized into tournaments, where ranking updates are applied
only at the end of the tournament. We perform two separate experiments, where
a tournament is considered to be 1. a single comparison or 2. all image compar-
isons from the same user. As Elo scores depend on game order, a Monte Carlo
simulation is performed over 10,000 iterations, and we report the median score
for each method.
5
In a pilot study, we have also considered to include BPG and HiFiC. However, since
it was clear that they always performed worse than other methods in our target
bitrate range, and to avoid fatigue with long rating sessions (we target a session
time around 15–20min) we removed them from the study.
6
The images used in the study and respective bitrates can be found in the Supple-
mentary Materials.
12 L. Relic et al.
1800
1700
1500
1400
1400
1300 1200
1200
Ours CDC ILLM HFD Ours CDC ILLM HFD
0.103 bpp 0.248 bpp 0.153 bpp 0.108 bpp 0.103 bpp 0.248 bpp 0.153 bpp 0.108 bpp
Fig. 6: Computed Elo ratings from the user study with Elo tournaments set for each
comparison (left) or for each participant (right). Higher is better. The box extends to
the first and third quartiles and the whiskers 1.5 × IQR further.
5.1 Results
Quantitative results. Fig. 5 shows the rate-distortion (as measured by PSNR,
MS-SSIM, and LPIPS) and rate-realism (as measured by FID) curves of our
methods and baselines. Our method sets a new state-of-the-art in realism of re-
constructed images, outperforming all baselines in FID-bitrate curves. In some
distortion metrics (namely, LPIPS and MS-SSIM), we outperform all diffusion-
based codecs while remaining competitive with the highest-performing genera-
tive codecs. As expected, our method and other generative methods suffer when
measured in PSNR as we favor perceptually pleasing reconstructions instead of
exact replication of detail (see Sec. 3).
User study. Fig. 6 shows the outcome of our user study. The methods are
ordered by human preference according to Elo scores. The average bitrate of the
images for each model are shown below the name of the methods. As can be
seen in the Elo scores, our method significantly outperforms all the others, even
compared to CDC, which uses on average double the bits of our method. This
remains true regardless of Elo tournament strategy used.
Fig. 7: Qualitative comparison of our method to the baselines. Images are labeled as
Method@bpp (bpp is also shown as a percentage of our method). Best viewed digitally.
average encoding/decoding time (excluding entropy coding) for all images from
the Kodak dataset. All benchmarks were performed on an NVIDIA RTX 3090
GPU. Our method processes an image in 3.49 seconds, nearly twice as fast as
CDC, which requires 6.87 seconds. ILLM processes an image in 0.27 seconds.
However, it is important to note that diffusion-based methods are in general
slower than other codecs due to their iterative denoising nature. Due to the Sta-
14 L. Relic et al.
ble Diffusion backbone, our method is more complex than CDC (1.3B vs. 53.6M
parameters, respectively), while ILLM contains 181.5M parameters. However,
the large majority of our parameters come from the diffusion backbone. Our
trained modules (e.g., Pϕ and the entropy model) contain only 36M parameters.
Reducing the computational burden of diffusion models is an active research
area [25, 26], parallel to ours. Our method is fundamentally independent of the
chosen foundation model, thus advances on reducing the complexity of Stable
Diffusion can ultimately also improve our proposal.
6 Conclusion
Via our proposed novel lossy image compression codec based on foundation la-
tent diffusion, we produce realistic image reconstructions at low to very low
bitrates, outperforming previous generative codecs in both perceptual metrics
and subjective user preference. By combining the denoising capability of diffu-
sion models with the inherent characteristics of quantization noise, our method
predicts the ideal number of denoising steps to produce perceptually pleasing
reconstructions over a range of bitrates with a single model. Our formulation
has faster decoding time than previous diffusion codecs and, due to reusing
a foundation model backbone, a much lower training budget. Potential future
work includes the integration of more efficient backbone models [25, 26] and the
support for user control to navigate the rate-distortion-realism trade-off.
Lossy Image Compression with Foundation Diffusion Models 15
References
19. Graikos, A., Malkin, N., Jojic, N., Samaras, D.: Diffusion Models as Plug-and-Play
Priors. Advances in Neural Information Processing Systems 35, 14715–14728 (Dec
2022)
20. He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: ELIC: Efficient Learned
Image Compression With Unevenly Grouped Space-Channel Contextual Adaptive
Coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 5718–5727 (2022)
21. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus) (2023)
22. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained
by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems 30 (2017)
23. Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models. In: Ad-
vances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran
Associates, Inc. (2020)
24. Hoogeboom, E., Agustsson, E., Mentzer, F., Versari, L., Toderici, G., Theis, L.:
High-Fidelity Image Compression with Score-based Generative Models (May 2023).
https://doi.org/10.48550/arXiv.2305.18231
25. Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: End-to-end diffusion for
high resolution images. In: Proceedings of the 40th International Conference on Ma-
chine Learning. ICML’23, vol. 202, pp. 13213–13232. JMLR.org, Honolulu, Hawaii,
USA (Jul 2023)
26. Jabri, A., Fleet, D.J., Chen, T.: Scalable adaptive computation for iterative gener-
ation. In: Proceedings of the 40th International Conference on Machine Learning.
ICML’23, vol. 202, pp. 14569–14589. JMLR.org, Honolulu, Hawaii, USA (Jul 2023)
27. Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repur-
posing Diffusion-Based Image Generators for Monocular Depth Estimation (Dec
2023). https://doi.org/10.48550/arXiv.2312.02145
28. Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. In: 2nd International
Conference on Learning Representations, ICLR (2014)
29. Kodak: PhotoCD PCD0992 (1993)
30. Lei, E., Uslu, Y.B., Hassani, H., Bidokhti, S.S.: Text + Sketch: Image Compression
at Ultra Low Rates (Jul 2023). https://doi.org/10.48550/arXiv.2307.01944
31. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided
Image Synthesis and Editing with Stochastic Differential Equations. In: Interna-
tional Conference on Learning Representations (2022)
32. Mentzer, F., Toderici, G., Tschannen, M., Agustsson, E.: High-Fidelity Generative
Image Compression (Oct 2020). https://doi.org/10.48550/arXiv.2006.09965
33. Minnen, D., Ballé, J., Toderici, G.: Joint Autoregressive and Hierarchical Priors
for Learned Image Compression (Sep 2018). https://doi.org/10.48550/arXiv.
1809.02736
34. Muckley, M.J., El-Nouby, A., Ullrich, K., Jegou, H., Verbeek, J.: Improving Statis-
tical Fidelity for Neural Image Compression with Implicit Local Likelihood Mod-
els. In: Proceedings of the 40th International Conference on Machine Learning. pp.
25426–25443. PMLR (Jul 2023)
35. Pasco, R.C.: Source coding algorithms for fast data compression (1976)
36. Qian, Y., Lin, M., Sun, X., Tan, Z., Jin, R.: Entroformer: A Transformer-based
Entropy Model for Learned Image Compression (Mar 2022). https://doi.org/
10.48550/arXiv.2202.05492
37. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution
Image Synthesis With Latent Diffusion Models. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)
Lossy Image Compression with Foundation Diffusion Models 17
38. Shannon, C.E.: A Mathematical Theory of Communication. The Bell System Tech-
nical Journal 27, 379–423 (1948)
39. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep Unsuper-
vised Learning using Nonequilibrium Thermodynamics. In: Proceedings of the 32nd
International Conference on Machine Learning. pp. 2256–2265. PMLR (Jun 2015)
40. Song, J., Meng, C., Ermon, S.: Denoising Diffusion Implicit Models. In: Interna-
tional Conference on Learning Representations (Jan 2021)
41. Stein, G., Cresswell, J.C., Hosseinzadeh, R., Sui, Y., Ross, B.L., Villecroze, V., Liu,
Z., Caterini, A.L., Taylor, J.E.T., Loaiza-Ganem, G.: Exposing flaws of generative
model evaluation metrics and their unfair treatment of diffusion models (Oct 2023).
https://doi.org/10.48550/arXiv.2306.04675
42. Wallace, G.: The JPEG still picture compression standard. IEEE Transactions on
Consumer Electronics 38(1), xviii–xxxiv (1992). https://doi.org/10.1109/30.
125072
43. Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video Enhancement with
Task-Oriented Flow. International Journal of Computer Vision 127(8), 1106–1125
(Aug 2019). https://doi.org/10.1007/s11263-018-01144-2
44. Yang, R., Mandt, S.: Lossy Image Compression with Conditional Diffusion Models.
Advances in Neural Information Processing Systems 36, 64971–64995 (Dec 2023)
45. Zhang, L., Rao, A., Agrawala, M.: Adding Conditional Control to Text-to-Image
Diffusion Models. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 3836–3847 (2023)