0% found this document useful (0 votes)
21 views

IKIN - Diffusion Based Compression v1.8

Ikin codec

Uploaded by

jim.stiefelmaier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

IKIN - Diffusion Based Compression v1.8

Ikin codec

Uploaded by

jim.stiefelmaier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Diffusion-Based Compression
Bryan Westcott, Chris Vela

IKIN Inc., Austin, Texas

Abstract—This document presents a novel diffusion-based with any possible algorithm (including modern ML and AI)
video compression technique. We leverage the inherent expres- unless additional information is transmitted to the receiver.
siveness, photorealism and 3D awareness of denoising diffusion Unfortunately, that lossless lower bound is rather large and is
generative AI models as a powerful general-purpose prior that
only requires small complement of low-quality guidance data (or significantly (often several orders of magnitude) higher than
“hints”) to produce video with high spatio-temporal perceptual images (and videos) typically used for streaming.
quality and optional novel view synthesis. Our use of small, fine- For that reason, many lossy algorithms have been devel-
tuned, low-rank adaptations (e.g., LoRA) efficiently compresses oped, which permit additional size reduction at the expense
a batch of frames and, when combined with the general purpose of distortion in the reconstructed signal. However classical
base model, allows compression that can significantly exceed the
performance of state-of-the-art methods such as H.264 and H.265 algorithms must then decide how best to discard information,
in terms of perceptual and technical quality to compression size and some of the most successful methods exploit properties
ratios. of human perception by discarding (or reducing accuracy)
in information (and hence introducing distortion) that is less
apparent to humans.
I. I NTRODUCTION
This paper will present a novel diffusion-based video codec B. Perceptual compression models
and is arranged as follows:
Perhaps the most popular lossy image compression methods
• Sec. II provides theoretical background, context and
are JPEG for imagery and MP3 for audio, both of which
comparison to state of the art. exploit the frequency-dependent nature of human perception.
• Sec. III provides implementation details including neural
In the audio domain, psychoacoustic models [3] loosely rec-
architectures. ognize that humans are less sensitive to removal of higher
• Sec. IV provides a detailed description of metrics and
frequencies (including higher- frequency musical note harmon-
assessment methodology. ics). The MP3 audio standard converts sound from samples
• Sec. V provides a discussion of performance results.
in time to the frequency domain in order to focus loss of
• Sec. VI provides a discussion of our current and future
information to frequencies that humans are less sensitive to
work along with our approaches to address perceived and typically do not miss when attenuated or distorted.
limitations. A similar concept of “frequency” via the discrete co-
sine transform (DCT) applies to images as well and thus
II. BACKGROUND AND COMPARISON methods such as JPEG convert 8-by-8 blocks of pixels into
The term codec derives from the words coder/decoder the frequency domain. A similar psychovisual model loosely
and refers to the method used to prepare multimedia data recognizes that humans in general are less sensitive to high-
for transmission. Of interest to this paper are codecs which frequency spatial changes in intensity and color than to lower
provide video compression, however it will be instructive frequency, and further that errors in color (hue) are less notice-
to also consider static image compression. This section will able than errors in intensity [4]. By devoting less storage from
provide a brief history of both lossless and lossy video com- high-frequency information than lower-frequency information,
pression, including classic methods, more modern data-driven and by devoting less storage to color information rather than
neural deep learning approaches, and our novel diffusion-based brightness, JPEG is able to discard information that is less
methods. apparent to the user.
A key point that will be important later in this paper when
distinguishing our methods from other AI-based methods is
A. Lossless compression
that a more careful discarding of high-frequency and color
Shannon’s source coding theorem [1], [2] uses the concept information results in higher ratios of perceptual-quality to
of entropy from information theory to provide a lower bound file-size compared to a naive method such as simply down-
(minimum) on how small an image may be compressed before sampling pixels and interpolating upon reconstruction.
loss is virtually certain to occur; this limit cannot be exceeded
Copyright © 2023 IKIN, Inc. C. Temporal compression
Patents Pending An early video codec is MJPEG, which is essentially JPEG
Origination Date: 2023-December-01
Initial Version: 1.0, 2023-December-20 applied to each frame. This method does not account for
Current Revision: 1.8, 2024-September-23 any temporal redundancy and so, for example, a static scene
2

Experiment H.264 Guidance (Left) | Our Output (Right) H.265 Guidance (Left) | Our Output (Right)

Experiment 1
Balance
Size and Quality

Bandwidth Savings: 95% Bandwidth Savings: 94%


Guidance: 1024x1024
Quality Recovery: 84% Quality Recovery: 92%
Frame Rate: 60 FPS
Non-LoRA Guidance: 102 KiBps Non-LoRA Guidance: 69 Kibps

Experiment 2
Extreme Compression

Bandwidth Savings: 96% Bandwidth Savings: 94%


Guidance: 256x256
Quality Recovery: 61% Quality Recovery: 92%
Frame Rate: 60 FPS
Non-LoRA Guidance: 85 KiBps Non-LoRA Guidance: 77 KiBps
Figure 1. Visual comparison of conventionally-encoded (H.264/H.265) guidance (left of each image pair) and the decoded output of our method (right of each
pair). This comparison shows a sample of our method’s ability to maintain significantly higher perceptual quality while simultaneously providing significantly
higher bandwidth savings when compared to the conventionally-encoded H.264 and H.265 guidance. The results of two experiments are provided in the two
rows, where the second experiment (bottom row) shows the ability of our decoder to maintain high performance even when using conventionally-encoded
guidance at extremely low resolution (1/16 of total output pixels) and the lowest possible conventional encoder quality (CRF 51). A detailed discussion of
the methodology is provided in Sec. IV and a detailed discussion of the results in Sec. V. Detailed experiment parameters and metrics corresponding to this
output are in Table I and a high-level comparison of performance is provided in Fig. 4. The architecture associated with this output is shown in Fig. 2.

with no pixel changes would be completely encoded and information contained in the compressed file and not any ex-
transmitted for each frame and thus such a method which is ternal information. Any stream-specific ancillary information
useful for camera capture lacks practical applicability video should be considered as part of the transmitted information
transmission. size.
Modern standards include H.264 and H.265 [5] attempt to By contrast, most adult humans can leverage past experience
track temporal changes in the captured scene across frames to fill in missing information or “make sense of” corrupt
and thus reduce redundancies that would occur with a frame- imagery. For example, a human may recognize another hu-
by-frame compression method like MJPEG. For example, man figure from a heavily quantized image and understand
they may be efficient at tracking an object that is moving that a single pixel representing an eyeball typically contains
laterally to the camera frame. This often requires considerable eyelashes and an iris that is typically a subset of colors (e.g.,
computational expense on the part of the encoder, however. blue, brown, green); furthermore that human may even be able
However, unlike the human perceptual system, these meth- to mentally in-fill precise eye color if the individual in the
ods fundamentally lack strong 3D awareness and are thus blurry image is recognized as a familiar face.
limited in the information they are able to share across frames. The previously-discussed perceptual compression methods
To continue the example, these methods would not be efficient only use past experience in the form of human perception mod-
with an abrupt change from the front to the back of a person els that are explicitly programmed into the algorithm. While
which a human would typically understand and not interpret another class of more modern methods typically referred to as
as an entirely new person. compressive sensing (CS), showed that with prior knowledge
As we will show, our method exploits the implicit (or of the structure of a signal (e.g., that a signal is sparse in
optionally explicit) 3D awareness [6] of diffusion models to some domain) can be used to perfectly reconstruct below a
better provide spatio-temporal compression without expensive related fundamental limit (the Nyquist sampling limit) [7],
frame-to-frame tracking calculations. which allowed for extreme undersampling, including down to
a single pixel for image capture [8].
For many reasons CS has not become more prevalent
D. External information and prior experience outside perhaps specialized medical and defense applications
Prior to the widespread use and computational practicality [9], two reasons being very specific hardware requirements
of deep learning, decoding algorithms were required to use the that are not necessarily compatible with contemporary capture
3

sensors (e.g., interchangeable lens camera sensors) and also Although more complex networks may be used, this would
the restrictive requirement of operating in a sparse domain. create additional overhead in weights that must be transmitted
However, similar to CS, we will show that denoising diffusion and thus reduce (or even overpower) the compression savings.
can be viewed from a similar Bayesian perspective as a much
more general-purpose prior and can surpass these methods
F. Generative models
even in specialized medical domains [10]. It is important to
note that the objective of compressive sensing is to reduce raw One special class of ML methods recently developed at
measurement cost (often followed by compression for storage) the same time as the early ML-based compression methods
and to go straight to the compressed domain; the reason for are generative image models. Initially, generative adversarial
this is that raw measurement is expensive in terms hardware networks (GANs) were well known for their ability to gen-
requirements and may include, for example, excessive harmful erate “deep fakes” of humans; however these methods were
doses of radiation with medical radiographs. In the age of often focused on generating humans and lacked the ability
low-cost ultra-high definition capture devices to distributed to generate diverse settings (see generative learning trilemma
internet-based devices, a bigger concern is often transmission [12]). These GAN-based methods were also difficult to train
costs. We note however that compressive sensing applications due to issues such as mode collapse and catastrophic forgetting
in which collection hardware size, weight and power (SWAP) [13], thus they lacked practical applicability for compression
is the primary concern may also benefit from our diffusion- purposes at the time this paper was originally written.
based compression technology and is a separate area of our These GAN methods were soon surpassed in many ways
research. by denoising diffusion models which provided the ability
to generate photorealistic (or consistently stylized) images
including the ability to prompt with natural language as they
E. Deep learning methods
also incorporate large language model (LLM) artifacts. [14].
Early deep learning-based codecs [11] attempt to provide a As our method is agnostic to the diffusion implementation,
data-driven approach to lossy coding and reconstruction. As we will refer to the general class of score-based methods
with other deep ML, artificial neural networks (ANNs) can including (but not limited to) denoising diffusion probabilistic
learn weights from a large volume of data how to provide an models (DDPM) [15], denoising diffusion implicit models
algorithm that generalizes to a large number of images. An (DDIM) [16], latent diffusion models (LDM) [17] and latent
example is a self-supervised autoencoder architecture which consistency models (LCM) [18] as simply “diffusion” methods
learns a low-rank representation. Note that our diffusion-based for simplicity.
codec may use an autoencoder, but only as a part of a much These models were subsequently adapted from generative
more capable solution. purposes (e.g., generating novel scenes) to reconstruction
If a model of human perception is used to assess the quality (inpainting, deblurring, upsampling, etc.) [19]. While a naive
of the compression and subsequent decompression, the per- method of compression could consist of simply downsampling
ceptual lossy compression methods may be implicitly learned and image and subsequently upsampling with a diffusion
rather than explicitly programmed. The nonlinear nature of method, as we discussed in the JPEG background (Sec. II-B)
ANNs lends well to perceptual-based objectives (see section this method is suboptimal in terms of perceptual quality and
3.2.1 of [11]), when compared to linear methods such as JPEG, bias avoidance; instead we treat these models from a Bayesian
for which all operations other than quantization operations perspective as a practical and powerful prior allowing us to
are linear (where the common color space conversions, block more effectively restore information that would exceed normal
division, and DCT operations are linear. limits of information theory.
A key point that will be important later in this paper when One major limitation of these methods is the size; The
distinguishing our methods from other AI-based methods is expressiveness and extensibility of these methods (including
that as more information is lost, both ML algorithms and 3D awareness, lighting accuracy, etc.) requires large weights
humans must rely more on prior information, thus increasing (several gigabytes); fully fine-tuning these methods (just as
potential biases (e.g., misidentification of a particular person we discussed with early ML-based codecs) would require
from a blurry video); while these ML methods may reduce impractical transmission of these weights at a size that may far
this bias by fine-tuning to to specific videos (or sets of similar exceed any conventionally-compressed video. The solution for
videos), the entire weights must be periodically transmitted generative imagery, and on which we exploit in our method,
which produces much more overhead (and additional stor- is low-rank adaptation.
age/transmission) than our approach which may leverage small
adaptation information.
Additionally, the limited ability of these methods to encode G. Low-rank adaptation
spatio-temporal information (e.g., light, 3D, motion) is ap- Low-Rank Adaptation (LoRA) has been an extremely pop-
parent in the limited practical ability of these architectures ular performance-efficient fine tuning (PEFT) method with
to be used in a generative capacity (i.e., to generate novel LLMs [20] for its ability to fine tune a very large language
photorealistic or consistently stylized imagery), even when model to a domain-specific application (e.g., customer service
fine tuned, and thus models designed specifically for gener- for products with specialized and esoteric industry jargon)
ative applications were introduced (see the following section). with very small adaptation matrix, thus avoiding the need to
4

compute, store and transmit large adaptation matrices, includ- I. Comparison with naive, SOTA diffusion-based compression,
ing on a per-customer basis. It was subsequently applied to and SOTA NeRF-based compression
diffusion models [21] including with the popular dreambooth A naive approach to diffusion-based compression is to
adaptation method [22]. While we focus on LoRA for our simply downsample each image frame and use a diffusion-
experiments, other current and future PEFT methods for image based upsampler. This type of upsampling is typically used
and video diffusion models would be compatible with our only for final small levels (e.g., 2x-4x upsampling) The cost
approach. For example, although currently restricted and most and performance [28], [29] of upsampling limit the practicality
applicable to LLMs, sparse intervention approaches similar of this approach for compression. Our method is able to
to representational fine-tuning (ReFT) [23] and its low-rank achieve much lower quality guidance while including other
variations could serve as alternative PEFT methods if adapted forms (e.g., canny edges) as constraints which more closely
and proven effective for diffusion fine-tuning in a similar align with human perception. We may, however, still use such
manner that LoRA was adapted. upsampling to produce final output formats (including chang-
While a diffusion model can reconstruct a human or scene, a ing from portrait to landscape mode which uses outpainting
LoRA-adapted version reconstructs a specific person or scene. methods). For both use cases (naive compression and final
As with LLMs, it does not require that the original diffusion preparation) our method can take advantage of fine-tuning
model need ever be updated. It also provides an opportunity (e.g., LoRA) adaptation to further improve the performance
for stylization, where style could be a fantastical style, a of these methods, thus providing a potential improvement on
subtle “beauty-filter” style or simply a representation of the state of the art in both these areas.
original video style). While the LoRA weights introduce small One of the few diffusion-based compression methods [30],
additional overhead, we note that many applications in which which published after our method was conceived and pub-
the subject and environment do not change appreciably over lished, proposes a diffusion-based restoration process for a
time (e.g., videoconferencing in the same room with the same known and linear transmission model. As shown in [31],
subject) may never require retransmission beyond a single the modern DDPM models operate in the latent space, and
initial transfer. extensions of hard data consistency problems to the latent
It is important to note that a large portion of the information domain are computationally significant. Our method supports
in a sequence of frames is captured by these LoRA weights such guidance, but we are more flexible to variations in the
and can be viewed as a volumetrically-aware compression distortion (e.g., allow an adaptive codec such as H.265), and
of potentially deformable subjects (a task very difficult for we add (optional and potentially noisy) additional guidance
conventional methods and even more novel methods such as (e.g., canny edges, depth maps, etc.) which may for example
NeRF [24]). But as the LoRA weights encode the specialized be achieved with ControlNets [32]. Additionally, this other
ability (when combined with the original diffusion model) to method considers only an unmodified DDPM model at the
construct a range of scenes, we must give it some additional receiver, which is more computationally expensive than meth-
information as to what specifically to reconstruct, and thus ods such as DCIM and LCM which our method supports.
it requires only weak guidance information (e.g., extremely This other method does not recognize the power of fine-tuning
compressed video, canny edges, depth maps) to inform the per- adaptation (e.g., LoRA), which our method utilizes not only in
frame reconstruction. In many applications, this information the denoising UNet but also in the other attention mechanisms
may be compressed far beyond an acceptable level deemed such as temporal and ControlNet to improve reconstruction
aesthetically appropriate and may even be significantly smaller quality. Finally, they do not consider temporal attention or
than any LoRA weights for the same time window. other adaptation methods.
Neural radiance fields (NeRFs) [24] and its variants are
H. Video Diffusion a class of popular methods for 3D novel-view synthesis
(NVS) using implicit neural radiance fields. Although they
Following the popularity of image diffusion methods, sev- can produce excellent static 3D-models the inherent explicit
eral video-centric diffusion methods have been researched, 3D volumetric rendering requirements typically leads to signif-
some examples include stable video diffusion (SVD) [25], icantly more storage and computational needs when compared
Sora [26] and Lumiere [27]. One of the goals of this work is to to 2D diffusion methods with implicit 3D awareness [6]. This
further improve the temporal consistency of the sequence of can make these methods challenging for practical volumetric
images. Differences from image-based diffusion can include video rendering [33]. Also the size of even a per-frame LoRA-
adding attention in the temporal dimension in the denoising type adaptation may be significantly smaller than a single
UNet or the variational autoencoder. Our approach is compati- frame NeRF representation, even with more advanced methods
ble with these diffusion approaches and any future advances as for representation (e.g., Plenoxels [34]) and rendering (e.g.,
they mature. Such methods offer the potential to reduce latency Gaussian splatting [35]).
associated with single frame decoding, however the expansion
of attention to the temporal dimension adds considerable com-
putation and thus care must be taken such that the latency gains J. Diffusion connection to compressive sensing
are not overshadowed by increased computational complexity. In this section, we show how diffusing methods relate to and
Such multi-frame joint extensions also provide opportunities can outperform compressive sensing. If we consider a frame
for improving the quality of volumetric video. (or set of frames) x the standard forward diffusion process
5

[15], [36] is to progressively add noise in steps indicated by As z is trained on data related to y and θ is pretrained, we
t as have information related to p(y|θ, z).
√ If we instead treat the fine-tuning as additional observational
q(xt |xt−1 ) = N (xt ; αt xt−1 , (1 − αt )I) , data and assuming y is not independent of z, we can use the
where the constant αt controls the amount of noise added at joint probability
each step. In the reverse process, the noise is reversed using p(x, θ, y, z) = p(y, z|x, θ)p(x|θ)p(θ)
a neural network parameterized by θ in steps as
leaving the posterior conditioned on both y and z as
pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)) .
p(y, z|x, θ)p(x|θ)p(θ)
p(x, θ|y, z) = .
We can then define the joint probability over all xt as p(y, z)
TY
−1 And with pretrained θ we can simplify to
pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ).
p(y, z|x, θ)p(x|θ)
t=0 p(x|θ, y, z) = . (2)
p(y, z|θ)
If we define the guidance data (to include the ControlNet-like
guidance and any optional prompting) as y, we can define Where again, the information related to p(y, z|θ) is available
the likelihood p(y|x). As the denoising network is pretrained, from fine tuning and we are left with the likelihood of both
we can equivalently use p(x|θ), omitting for simplicity the y and z, along with the p(x|θ) which serves as a Bayesian
time notation on x. Using Bayes’ theorem the posterior for x prior.
conditioned on a pretrained denoising network θ is While this section is only intended to show a basic relation-
ship between diffusion and compressive sensing, it provides
p(y|x, θ)p(x|θ) a theoretical justification for the remarkable performance of
p(x|y, θ) = .
p(y|θ) our diffusion-based compression shown in later sections. It
With score-based matching [37], [38]we take ∇x log(·) of also shows the potential for sensing and security applications,
both sides, and in doing so the score of the normalizing term particularly where both low SWAP and low transmission
in the denominator p(y|θ) goes to zero thus we are left with bandwidth are crucial to operations. Based on this analysis,
we can interpret diffusion methods as a more general version
∇x log p(x|y, θ) = ∇x log p(y|x, θ) + ∇x log p(x|θ). of compressive sensing which avoids many of the limitations
of other popular CS methods. Furthermore, rather than rely on
Thus the posterior estimate of x is related to the observations
a sparsity structure we can potentially use the entire corpus of
y and the diffusion model, represented by p(x|θ) serves as a
recorded imagery as a statistical prior.
prior similar to compressive sensing.
We may now compare score-matching approaches to a
popular method for compressive sensing commonly referred to III. I MPLEMENTATION D ETAILS
as variational inference (VI) [39]. With variational inference A. Diffusion-Based Codec Advantages
we would compute an approximate distribution p̂VI (x) by min- The major features and associated benefits of diffusion-
imizing the evidence lower bound (ELBO) of the Kullback- based compression include:
Leibler divergence (KL) as • The use of a latent-diffusion methods allows us to operate

ELBOVI = EqVI (x) [log p(y|x, θ)] − KL (p̂VI (x)∥p(x|θ)) in a latent space, achieving similar performance to early
ML-based codecs.
= EqVI (x) [log p(y|x, θ) + log p(x|θ) − log p̂VI (x)] .
• We leverage the inherent expressiveness, photorealism

When compared to score matching, the limitations for VI and 3D awareness of denoising diffusion generative mod-
include that p̂VI (x) are typically chosen to be computationally els, while also allowing for 3D novel view synthesis with-
tractable, that p̂VI (x) is itself an approximation, and that a out expensive explicit 3D representation and rendering
lower bound on the difference is minimized. Additionally, the while leveraging monocular depth estimation.
normalizing constants that may be dropped in score matching • Our use of LoRA-type adaptation allows for inexpen-

must be accounted for with VI. When we also consider as sive and tiny adaptation files which avoid compromising
previously mentioned that the data x must be sparse in its compression overhead for quality. This also allows us
domain and the restrictions on the noise distribution, we show to leverage high-quality third party models with minimal
how the diffusion-based methods can be more accurate than modification (even sent once or delivered with hardware).
other approaches for restoration-type applications common • By including spatio-temporal attention, which itself may

with CS. be fine tuned with a low-rank adaptation, we achieve


Continuing with the score-matching diffusion approach, if temporal aesthetic consistency and in addition to static-
we treat any fine-tuning z as prior information we have, using frame spatial consistency.
the chain rule from probability theory, • With the use of modern perceptual video quality assess-
ment metrics combined with guidance data that matches
p(y|x, θ, z)p(x|θ, z) the innate edge-based sensitivity of human perception we
p(x|y, θ, z) = . (1)
p(y|θ, z) ensure high per-frame aesthetic consistency and allow
6

for a perceptually-guided method for self-supervision and The per-frame guidance may be applied in multiple ways.
hyperoptimization. ControlNets [32] are popular methods in which specialized
• Our method only requires use of an ensemble of weak networks for each class of guidance data are pre-computed to
guidance data as “hints” thus allowing that data to be use guidance data for reconstruction, and these ControlNets
noisy and highly compressed without requiring explicit may themselves be fine-tuned with techniques such as control-
degradation models for the guidance data as we use the LoRA to improve reconstruction quality. Other methods such
primary diffusion model and LoRA-type customization as T2I-Adapters [42], energy-guided conditional diffusion
as a prior to provide photorealism. [43], ReSample and Latent-DPS [31] are examples of so-
• Although a number of other guidance methods are avail- called hard-data consistency methods. In the end, the final
able, our method currently uses three key classes of reconstructed frames should match closely with the original
guidance, which happen to align with human perception: data source using only the reduced data provided in the
(1) canny edges, which helps balance out distortion in the “Transmitted Information” block.
color and depth data, (2) monocular depth information, Hyper-optimization may be used to compute optimal train-
which may be obtained from a single camera using ing and inference parameters (e.g., guidance quality/size,
monocular ML/AI estimation methods, and is useful for LoRA update times, diffusion training parameters, diffusion
3D awareness and also should help inform conventional inference parameters) and may be used either in batch or
video’s 3D effects such as lighting, and (3) color infor- adaptive mode to inform both the transmitter and receiver.
mation, which allows for extrinsic illumination changes
and may itself be a very low-quality and resolution
IV. P ERFORMANCE A SSESSMENT M ETHODOLOGY
compressed video.
• That weak guidance data may also itself be compressed A. Equivalent SOTA codec file size fair comparison
with conventional codecs (e.g., H.264/265) to obtain all
The constant rate factor (CRF) [44] from an H.264/5
the temporal efficiency and interoperability advantages
file/stream determines the distortion level and hence smaller
of those solutions. We may further fine-tune additional
values produce less distortion and higher reconstruction qual-
adapters and ControlNet-like models on the idiosyncratic
ity, but at the expense of typically larger file sizes. The file size
distortions these conventional codecs impart on guidance
often abruptly stops increasing as CRF is decreased (around
data.
CRF 20-25 in our initial test data). This indicates that a
• By being fairly agnostic to the diffusion base method and
maximum quality is reached per video and the storage size
adaptation method we are able to leverage any speed or
of CRF equal to 1 will give the upper bound on storage size
quality advances in SOTA methods (for example [40] and
specific to the contents of the video being compressed.
[41]).
When combined with idiosyncratic differences in imple-
mentations of these codecs, we find that the mapping of
B. AI Architecture CRF to file size is not injective (one-to-one) and hence it
is not trivial to estimate optimal CRF from a compressed
A high-level conceptual architecture for our approach is
file. However, we may still use the minimum CRF as a
described in Fig. 2. Processing at the transmitter is divided into
method to compute the maximum equivalent file size required
per-frame and multi-frame processing. Multi-frame processing
of H.264/5 to compute the same quality of information as
is focused on generating one or more fine-tuned models (e.g.,
our diffusion-based reconstruction. To put in other words,
Low-Rank Adaptation networks or LoRA [20]) which can
given a diffusion-based pixel-domain (human-viewable post
better reconstruct a specific subject and/or scene. The per-
VAE decoded) output, we may use a minimum-CRF (CRF=1)
frame processing is focused on generating highly-compressible
encoding to provide a fair-comparison file size and quality
metadata (e.g., canny edges, depth maps, lossy compressed
metric (e.g., DOVER) of our method vs H.264/5.
images) which may be used to guide the reconstruction of a
single frame. Note that in some special cases (e.g., volumetric If we use the size of the minimum-CRF conventional codec
compression), the metadata may exceed the size of any orig- equivalent (as indicated by H.26x) as a reference and compare
inal multi-camera imagery but still require less data than an to the size of the total guidance (e.g., total size of canny
equivalent volumetric representation (e.g., Hologram, NeRF, edges, low-resolution color guidance and depth maps) and fine
etc.). tuning (e.g., LoRA weights), we may compute the following
On the receiver, the per-frame and multi-frame data is used equivalent compression ratio (ECR) as:
in indirect ways to avoid recomputing and re-transmitting the size (guidance) + size (fine-tuning)
expensive pre-trained primary diffusion model (e.g., SDXL ECRtotal =  . (3)
size encodeH.26x,CRF=1 (output)
[14]). The receiver is divided into two parts, spatio-temporal
attention and ControlNet-like guidance. The low-rank adap- Note that the choice of conventional codec (e.g., H.264 or
tation networks typically only modify the spatial attention H.265) will be apparent by the context. An equivalent com-
layers, and a temporal attention layer is also added to increase pression ratio that accounts only for the guidance is given by:
temporal perceptual continuity along with sharing information
size (guidance)
between frames, thus further improving compressibility and ECRguidance-only =  . (4)
thus reducing data rates. size encodeH.26x,CRF=1 (output)
7

Figure 2. High-level overview of one variation of the compression methodology: Shown here is processing that takes place on the transmitter, including
both multi-frame LoRA customization and per-frame guidance data generation, and processing that takes place at the receiver, including spatio-temporal
attention and per-frame hard-data consistency guidance. The specific guidance information types and receiver algorithms may vary depending on the use case
or any software performance advancements (e.g., probability flow for LCM or adversarial distillation for SDXL-Turbo). All receiver methods may be LoRA
customized, thus allowing a single transmission of the larger base diffusion model to be potentially reused for all video sources. Various hyperoptimization
methods may be used to dynamically fine-tune any diffusion hyperparameters needed for encoding or decoding. Any latent space conversions (e.g., the VAE
autoencoder used by stable diffusion) are compatible and are not explicitly shown here. Although LoRA adaptation is specifically used in this diagram for
clarity, other fine-tuning adaptation methods are applicable.

As the interval in which the fine-tuning is applicable and thus the LoRA storage cost per frame approaches zero
(Tfine-tuning ) is increased, we note that: as the cumulative length of video conferences increase. For
this reason, we are currently focusing on only the size of
lim ECRtotal = ECRguidance . the combined guidance when compared to the reconstructed
Tfine-tuning →∞
video and we are not accounting for the size of the fine-tuning
An alternative view is the bandwidth savings (BWS) which weights.
is then given as:
In Sec. V we will use the minimum-CRF size to compare
BWStotal = 1 − ECRtotal (5) our small dataset. We plan to assess a wider range of videos
including more diverse subjects and durations (e.g., VQEG
and [45]). We may also adjust distortion ratios (e.g., CRF for
H.264/265) to match quality metrics before providing a size
BWSguidance = 1 − ECRguidance . (6)
comparison, although those metrics are often in contention
One challenge with this method is that the reconstructed with one another so results will differ per metric.
artifacts in the video are also encoded. If for example, im-
perfect temporal attention leads to jitter or distortion, the
minimum-CRF file size will increase. As we are actively B. Perceptual image quality assessment metrics comparison
researching the improvement of video quality, we consider this
approach a more fair comparison of compression efficiency 1) Standard metrics with known references: Two standard
when combined with measures of quality such as DOVER methods for assessing image quality in image compression
to measure objective and subjective quality differences as and generation research are peak signal-to-noise ratio (PSNR)
artifacts that would increase minimum-CRF file size would and structured similarity (SSIM) [46]. Both of these methods
decrease perceptual quality. require a high-quality (near lossless) reference image for
A second complicating factor is that the fine-tuning adap- comparison against, and these methods do not account for
tation weights (e.g., LoRA) may be large for short samples. subjective or aesthetic image quality.
Our research has not yet determined the duration in time for 2) Perceptual image quality assessment: As recognized in
which LoRA weights are applicable. Nor have we explored classical perceptual methods such as JPEG/MJPEG humans do
possible incremental updates to LoRA weights over time. As not interpret all distortion equally and simple mean-squared
the storage size of the LoRA weights amortizes over a number error at the pixel level is not necessarily a good predictor of
of frames, we have not yet determined a fair assessment of human quality judgement. It was discovered that the neural
the typical LoRA overhead cost per frame. We note also that networks used in LPIPS serve as an unexpectedly effective
some applications such as video conferencing may have LoRA predictor of perceptual quality (with their proposed PIM
weights sent exactly once over a number of videos (or session) metric) [47].
8

3) Perceptual video quality: The Video Multi-Method As- the relative quality of the low-res guidance as a first order
sessment Fusion (VMAF) [48] predicts image quality as- analysis of relative quality.
sessment by accounting for visual processing capabilities of For this preliminary analysis, we will use the DOVER
humans, particularly temporal aspects. For example, details in metric to compare the low-quality (color) guidance data. In
high-motion sequences are less perceptible and thus temporal future studies we plan to account for the size of the final
distortions of this type typically are less detrimental to human guidance data and adjust the distortion factor (e.g., CRF) of
quality judgement. As the motivation for this metric shows, conventional codecs before applying the quality assessment.
we should consider temporal compression artifacts in addition We also plan to provide other quality assessments where
to spatial static image artifacts of a single frame. This metric practical, including known-reference methods.
provides such a method, although it has not found widespread
use due in part to the requirement of an input image. We have D. Historical performance improvement assessment of diffu-
empirically discovered that it is inferior to more modern meth- sion models
ods such as DOVER at predicting human quality assessment, When predicting the future rate of improvement in diffusion
particularly with diffusion-based methods. performance toward real-time denoising, it is useful to assess
4) Reference-free SOTA perceptual methods: The Disen- performance improvements in recent history. In Fig. 3 we
tangled Objective Video Quality Evaluator (DOVER) [49] is consider the improvement of both hardware and software
a SOTA method that provides a reference-free assessment technology relevant to diffusion methods. Rather than just
of video quality which accounts for both technical (objec- considering technical capabilities such as transistor count, we
tive distortion) and aesthetic (subjective human judgement instead consider the practical performance of a representative
prediction) metrics. We note that this method works quickly diffusion-based benchmark algorithm. As Stable Diffusion has
(compared to LPIPS) which makes it suitable for real-time become a base platform for one of the most popular and
hyperoptimization for adaptive encoding with our methods. As flexible diffusion implementations–the Diffusers library by
diffusion methods are often prone to the perception- distortion Hugging Face–we choose Stable Diffusion as implemented
trade-off [50] (e.g., unintentional denoising which improves by Hugging Face as this benchmark. It is also important
perceptual quality but technically increases distortion), we also to measure algorithm-specific performance as diffusion mod-
find that the DOVER method achieves a good balance to els are currently iterative (sequential) in nature and with
predict subjective human quality assessment although other inherent limits of Moore’s Law potentially (and arguably)
methods may be used in the future if perfect reconstruction is being reached have recently led to more parallel hardware
a goal. capacity development [51]. Most important is to use a large
and diverse dataset with peer-reviewed (and preferably open-
source) assessment methodology from reputable sources, so
C. Equivalent SOTA codec quality fair comparison
our historical performance analysis will rely on and cite
As noted in Sec. III, we use imagery of potentially ex- published sources. As the disparate datasets do not all use the
tremely low resolution and/or quality to provide hints. When same parameters (e.g., resolution and attention methods) we
comparing the quality between videos, we may use our so- measure performance improvement over successive milestones
called low-resolution guidance video and compare its video in which these parameters remain constant.
quality to that of the reconstructed video. In this sense, Performance improvements in hardware compute time are
the quality measures the restorative ability of the decoder relatively straightforward to measure. We simply measure
compared to the decoder output. If we compare the quality the compute time decrease for stable diffusion with the
of the reconstructed video to that of the original near-lossless same parameters across different generations of hardware.
(or lossless) reference video, we may determine the ratio The data from [52] provides a comparison of three gener-
of quality recovered from the input. We define the quality ations of Nvidia high-performance computing GPUs (V100,
recovery (QR) as: A100, H100) with release dates spanning nearly six years.
DOVER (output) As significant software improvements (specifically memory-
QR = . (7) efficient attention or xformers) has become prevalent, the
DOVER (reference)
datasets only compare the V100 and A100 with no memory-
One challenge with this approach is that it overlooks the efficient attention and only compare the A100 and H100
overhead required for the additional sources of guidance (depth with memory-efficient attention. However, as we are only
and canny edges). At this time, we have not yet assessed considering performance improvement ratios, we consider it
the minimum acceptable quality of the guidance data nor a reasonable measure of hardware performance improvement.
performed ablation studies to assess the importance of any While some advances such as half-precision floating-point
individual class of guidance. Furthermore, the binary canny computation (FP16) are clearly enabled by hardware develop-
edge data may be replaced by more compressible and tolerant ment, we consider these in the software category as software
line-art data, or it may be highly compressed and subsequently methods must be developed to use these methods without
restored via canny edge detection at the receiver with minimal sacrificing quality, modularity, extensibility and pre-trained
added computation. We are also exploring custom adapters and model reusability.
ControlNet-like models that may obviate the need for other Software performance is more challenging to measure,
guidance altogether. For these reasons we will consider only particularly with generative methods such as diffusion. As dis-
9

Sec. IV-C. We summarize the performance of our results


in Fig. 4, we show select frames for visual comparison in
Fig. 1 and provide all experiment parameters and metrics in
Table I. This performance analysis is preliminary, based on a
small sample of data, and includes the limitations discussed in
the methodology section. We expect significant performance
improvements with the methods outlined in Sec. VI, so our
goal here is to simply demonstrate the potential of diffusion-
based compression to significantly outperform conventional
SOTA compression.

A. Experiments Overview
Figure 3. Historical performance improvement for Stable Diffusion: A To test the performance of our diffusion based codec, we
detailed description of the methodology behind this chart is provided in performed two experiments in October-November 2023:
Sec. IV-D. This chart shows the cumulative performance improvements in
stable diffusion for various hardware and software milestone advancements • Experiment 1: In this experiment we aimed to balance
in order to show general trends in the industry. Using data from [52] we see both bandwidth savings and final quality. This experiment
that hardware advancements in Nvidia GPUs (V100, A100, H100) lead to
a cumulative practical improvement ratio of 3.11 over approximately a six
would be more relevant to scenarios in which networks
year period. Using data from [53] and conservatively assuming cumulative have sufficient capacity, so the goal is to save bandwidth
performance gains, we show that improvements to the Hugging Face Diffusers costs while maintaining high perceptual quality. Some
library improved by a cumulative practical improvement ratio of 3.61 over
a span of approximately 2 years since DDIM models were introduced by
examples could be streaming recorded content or video
adding support for half-precision (FP16) and memory efficient attention conferencing on a reliable network. For this experi-
(xformers). Recently, improvements to diffusion algorithms such as LCM ment, our guidance resolution is of the same resolution
and SDXL-Turbo have led to an additional performance improvement ratio
of approximately 4 (see Table 2 of [18]) and 12.5 (see Figure 10 of [41])
(1024x1024) as the final output for the color guidance
with similar image quality (FID/CLIP and user preference, respectively) over and half the resolution (512x512) for the other guidance.
approximately a one year period. When accumulating the Diffusers gains • Experiment 2: In this experiment we push the bandwidth
with the SDXL-Turbo advances and noting the logarithmic scale of the
dependant axis which itself measures changes, we see a significant (nearly 50-
savings to extremes by minimizing the size of the guid-
fold) and exponentially accelerating performance improvement in diffusion ance data. This experiment would be more relevant to
algorithms and software accompanied by exponential growth in practical scenarios in which network bandwidth is limited or unre-
hardware performance that is close to but slightly lagging the gains predicted
by Moore’s Law.
liable, for example cellular streaming with poor wireless
signal quality. For this experiment, all our guidance is a
much smaller resolution than the output (256x256).
cussed in Sec. IV-B, no single standard metric exists as being The experiments conducted used low-quality lossy H.264
definitive. As many diffusion applications which lack ground and H.265 to encode the guidance data. The canny-edges were
truth (our compression and diffusion-based upsampling being augmented with facial-feature outlines from a monocular face
two exceptions), most peer-reviewed performance assessments landmark detector and the monocular depth maps are median
with large and diverse samples of imagery use methods such filtered in time due to jitter in those estimators.
as the Fréchet inception distance (FID) which measure the
statistical similarity between two sets of imagery, CLIP metrics
which assess the agreement between the Language tokens B. Summary of Results
and the image, or human preference studies. We rely on In Fig. 4 we show a summary of the high level results.
peer-reviewed published results when considering algorithmic It shows the raw transmission sizes and quality metrics for
improvements, such as adversarial distillation for SDXL-Turbo both conventional H.26x codecs and our novel diffusion-based
[41] or probability-flow ordinary differential equation solvers codec, with the guidance type indicated by the plot color. The
for LCM [18]. For more simple computational improvements, source of this is Table I, and a more detailed discussion is
we rely on the published metrics from the well-established provided in the following sections.
diffusion package maintainers (e.g,. Hugging Face) [53]. As
the HF data source is not clear on whether the improvements
are incremental or cumulative differences, we assume the most C. Detailed Results
conservative interpretation that the performance gains stated The parameters used and associated raw metrics output
are cumulative. are provided in Table I. Note that sizes KiB and MiB refer
to base-2 kibibytes and mebibytes, respectively, rather than
V. R ESULTS AND PERFORMANCE base-10 kilobytes and megabytes. Various combinations of
In this section we detail our experiments, provide a high- CRFs and resolutions were used, with the most aggressive
level summary of results, provide a detailed discussion of compression experiment using color guidance of only 256x256
results for each experiment, and also provide visual samples resolution at the highest-loss CRF available (51); in that case,
from the final decoded output. We provide a detailed de- the guidance data (data transmitted) would be only 1/16 of the
scription of our assessment methodology in Sec. IV-A and final resolution pixels under extreme compression. The bottom
10

Figure 4. A high-level summary of performance results. The left shows total transmission size, including all guidance and omitting any adaptation weights
size due to the amortization reasons discussed in Sec. IV-A. The right shows a perceptual quality comparison using DOVER scores, where the conventional
codec value is the DOVER score of the color guidance data and the diffusion-based codec score is the DOVER score of the final output. As discussed in Sec.
IV-C, we can also view this as the quality recovered from the diffusion decoder. A description of the experiments if provided in Sec. V-A, a full discussion
of results is provided in Sec. V-C, a visual comparison of perceptual quality is provided in Fig. 1, and all data is from Table I.

two rows of Table I provide the final bandwidth savings (Eq. Table I
6) and quality recovery (Eq. 7). E XPERIMENT PARAMETERS AND RAW METRICS
Although visual quality results are best viewed as a video Experiment 1 Experiment 2
we present a single frame in time in Fig. 1 which shows Parameter / Metric H.264 H.265 H.264 H.265
for each experiment (rows) and each conventional encoder CRF Color 51 51 51 51
CRF Canny 42 46 40 42
(columns): pairs of low-resolution color guidance (left sub- CRF Depth 32 42 36 38
frames) and diffusion decoded output (right sub-frames). The Resolution Color 1024 1024 256 256
lower-resolution color guidance (256x256 resolution) are re- Resolution Canny 512 512 256 256
Resolution Depth 512 512 256 256
sized to 1024x1024 for comparison purposes. For convenience,
Depth Median Radius 5 sec 5 sec 5 sec 5 sec
a subset of metrics from Table I are repeated there. This Output Median Radius 2 frames 2 frames 2 frames 2 frames
Test Duration 30 sec 30 sec 10 sec 10 sec
figure shows the remarkable ability of our novel diffusion- Frame Rate 60 FPS 60 FPS 60 FPS 60 FPS
based compression to use extremely small guidance (AKA Size Color 689 KiB 619 KiB 91 KiB 105 KiB
Hints) to reconstruct extremely high quality imagery. Size Canny 1623 KiB 1140 KiB 512 KiB 452 KiB
Size Depth 745 KiB 299 KiB 248 KiB 214 KiB

Equivalent Output Size 56.6 MiB 34.2 MiB 19.1 MiB 13.3 MiB
Fine-Tuning Size 22.3 MiB 22.3 MiB 22.3 MiB 22.3 MiB
D. Discussion of Experiment 1: Balanced quality and band- All Guidance Size 2.98 MiB 2.01 MiB 0.83 MiB 0.75 MiB
width savings Equivalent Output Data Rate 1.89 MiBps 1.14 MiBps 1.91 MiBps 1.33 MiBps
All Guidance Data Rate 102 KiBps 69 KiBps 85 KiBps 77 KiBps
The results for Experiment 1 are shown in the first column
DOVER Color .17 .47 .06 .11
of Table I and the first row of Fig. 1. As discussed in Sec. IV-A, DOVER Input .85 .85 .85 .85
longer video clips will likely provide more amortization of the DOVER Output .71 .78 .52 .78

fine-tuning (LoRA) transmission cost, and thus we focus pri- Bandwidth Savings– 55% 29% N/A N/A
Include Fine-Tuning (Eq. 5)
marily on the guidance-only bandwidth savings, which shows Bandwidth Savings– 95% 94% 96% 94%
Guidance Only (Eq. 6)
92% and 95% savings over H.265 and H.264, respectively. Quality Recovery (Eq. 7) 84% 92% 61% 92%
However, even if we account for the size of the fine-tuning
LoRA (22.3 MiB) we see that we still produced a savings
of roughly 30% and 56% over H.265 and H.264, respec-
tively. With more careful tuning of the parameters, improved are evident. This demonstrates that in contrast to conventional
diffusion models models, more efficient fine-tuning weights codecs, the errors produced are more consistent with historical
(e.g., reducing precision to FP8), and guidance methods (e.g., information (the diffusion UNet prior and LoRA fine-tuning).
ControlNet) that are more tailored to the specific artifacts A more conventional codec would likely produce more local-
added to our color guidance, we expect the performance to ized distortion. This shows the ability of our novel diffusion-
improve. based method to provide more aesthetically-pleasing results at
very low data quality. It also reinforces the reason our analysis
prefers the DOVER metric over more purely technical metrics
E. Discussion of Experiment 2: Extreme compression such as PSNR and SSIM which are poorer reflections of our
The results for Experiment 2 are shown in the last column of performance. We note however, that such differences are not
Table I and the bottom row of Fig. 1. Upon close inspection of visible in Experiment 1 results (the top row of Fig. 1), as the
the bottom row of Fig. 1, some slight differences in expression guidance compression is much less extreme.
11

When considering the differences in expression, we note particularly susceptible to this behavior on so-called out-of-
that: distribution (OOD) tasks, in which the data used to train
• The perceptual quality of our method is far superior to the diffusion model differs significantly from the data being
the equivalent highly-compressed guidance data shown reconstructed, and when the available data is highly distorted
on the left half of each comparison in Fig. 1. This is also and thus has many reconstructions consistent with that data.
seen in the DOVER metrics which have color guidance The consequences for this behavior are particularly grave
DOVER metrics which (for H.265-encoded guidance) are for medical applications, but we note here that compression
0.11/0.85 whereas the our final output is 0.78/0.85. applications differ significantly.
• While we used pre-trained models available at the time, Medical applications often avoid sampling a full resolution
we believe more tailored guidance networks (e.g., Con- image as that may require significant time in a machine or
trolNets tuned to extreme compression artifacts) will additional doses of radiation, and thus these applications often
significantly improve performance at very low guidance do not have access to the ground-truth image, and are thus
data rates. concerned with compressive sensing applications (see Sec.
• We also believe that diffusion models which provide joint II-D). For compression, this information is readily available
latent-space encoding of multiple frames would provide a but is expensive to store and/or transmit, so the ground truth
larger effective batch size and thus improve performance may be used to fine-tune the diffusion model (e.g., via LoRA)
for small changes in scenery (e.g., facial expression). at the transmitter allowing for better reconstruction at the
In this experiment the equivalent video sizes are smaller receiver. While a medical application may use a LoRA model
than the LoRA fine tuning, due to the reduced resolution (1/16) to fine-tune a model for an individual, the purpose of medical
of the guidance and reduced duration (1/3) of the target video. imaging is often to detect OOD conditions (e.g., suspicious
We also know from Experiment 1 that the LoRA can amortize growths) which are by definition non-existent in an individuals
effectively over a longer video. For these reasons, we do not historical images. For this reason compression of high-quality
report a Bandwidth savings inclusive of the fine tuning size. source imagery (even in the medical domain) does not share
However, we note that: the same issues as medical compressive sensing.
While our method is robust to OOD problems, we may
• Omitting the size of the LoRA fine-tuning size is still a
still suffer from reconstruction errors with extreme compres-
fair characterization of bandwidth savings for applications
sion levels. While we note that that all lossy compression
such as video conferencing in which a subject and/or
algorithms share this issue and also that human beings are
scene are well characterized a priori and the LoRA
similarly prone to biases based on past experience if presented
weights are sent once.
with the distorted guidance information if only low-quality
• Longer videos will still amortize this fine-tuning weight
imagery storage is practical, there are nonetheless significant
out over longer duration videos.
consequences of the apparently higher quality and realism of a
• We used performant technologies available at the time
diffusion-based reconstruction, particularly in law enforcement
of this experiment so other approaches such as improved
and defense applications. Our method allows several variables
LoRA training, including the option to reduce floating
for to be adapted based on the application, in particular the
point accuracy further (E.g., FP8 or even FP4) would
quality of the guidance information may vary along with the
further reduce the LoRA overhead.
update rate and rank of the LoRA-type adaptation.
We conclude by noting that additional approaches to im- However we plan to more systematically develop solutions
prove quality and bandwidth savings are discussed in Sec. VI. to this issue, one such goal being to compute and convey
the confidence information (and associated uncertainty) of
VI. L IMITATIONS AND F UTURE WORK the reconstruction at the per-pixel level. To do this, we
A. Unintentional denoising plan to apply inversion methods (e.g., null-text inversion)
One interesting behavior of a denoising diffusion approach [55] and other statistical/geometric analysis of diffusion space
is that it may (based on tunable hyperparameters) denoise a [56] combined with established estimation theory to predict
noisy source image. For example a poorly focused, exposed, uncertainty mapped back to the image domain. Although such
or low-resolution source image. The use of fine-tuning adapta- errors are also less likely to persist across frames, the temporal
tions such as LoRA mitigates this as we may consider a low- correlation introduced by temporal attention will also be ac-
quality video as a particular style to be preserved. In all cases, counted for and communicated in this additional information.
style may be separately separated (with a distinct fine-tuning Sensing and decision theory using the our diffusion-based
adaptation) and both adaptations (style and subject) may be compression methodology is a separate area of research with
added in tunable strengths. a wide range of additional opportunities.

B. Hallucination C. Speed and latency


Denoising diffusion models have a unique ability to gen- Diffusion models are not currently real-time, but they are
erate realistic yet inaccurate imagery when used for inverse improving quickly as video applications (mostly generative
(reconstruction) applications [54]. This form of bias is often “text-to-video” applications) become popular. In Fig. 3 we
pejoratively referred to as hallucination. Diffusion models are see that the practical cumulative performance improvements
12

for diffusion algorithms are growing exponentially. Addition- in many instances (e.g., video conferencing), allowing a size-
ally, mobile-based diffusion methods such as MediaPipe are quality trade off. The LoRA-type adaptation may be updated
beginning to appear [57], suggesting that these algorithms per frame, which still may provide a significant savings over
will eventually not require a high-powered GPU. Specialized comparable methods such as NeRF for volumetric (3D) appli-
silicon applications (FPGAs, ASICs, co-processors, etc.) are cations, or the LoRA-type adaptation may also be amortized
also expected to appear as the use of diffusion models has over multiple frames. We are actively exploring low-rank
widespread applications at the consumer level, particularly for adaptations training methods which increase the extensibility
generative art and image manipulation, which may make a of these adaptations thus requiring less transmission overhead.
low-cost streaming media device practical.
We have several (patent-pending) methods we are currently
E. Latent-space quality limitations
exploring to improve speed at the algorithm level. Diffusion
parallelization is a potential method to parallelize —either at As noted in [58], the variational autoencoder used for
the bit-level or at the spatial frequency bin level —diffusion latent-space diffusion methods such as SD and SDXL, may
methods which are inherently serial in nature and thus less able introduce inherent distortion into the final image. This has
to parallelize across multi-core hardware such as GPU/TPUs. the effect of putting an upper bound on reconstruction quality
Of course, simpler parallelization techniques such as loop and may be challenging for extensions such as 3D NVS which
unrolling are also applicable, especially as the number of require view consistency. We are currently researching a novel
diffusion iterations continues to fall with methods such as method, which we refer to as dynamic warping, to address
SDXL-Turbo [41] and LCM [18]. Also, as mentioned in Sec. this limitation, in which areas of fine detail or areas humans
II-H, the use of multi-frame joint autoencoders for the latent are sensitive to (e.g., faces) are given more latent pixels than
space conversions also can use increased parallel capacity to the other areas (e.g., flat blue sky). Furthermore, we expect
decrease diffusion latency. LoRA-type adaptation of refiner networks such as those used
We recognize that it is not efficient to denoise each frame for SDXL (for the SDXL results in this paper we show only
from noise independently, especially when considering that the the base network) to further improve that quality limit.
previous frame may often be considered a “noisy” version of
the subsequent frame. By using LoRA (and other) adaptations F. Limited HD resolution
and various novel mathematical methods, we expect to fine-
tune diffusion to adapt from one frame to the next; which we The current release of the SDXL base model is limited to
coined as structure-to-structure diffusion in contrast to noise- 10242 total pixels. This limits the native reconstruction to 720p
to-structure denoising diffusion. We also expect that this may HD resolution. We have several methods identified to address
approach an asymptotic bound on number of total iterations this limitation We first note that in addition to latency, the
required, even as frame rate increases. Thus providing an unin- practical total image resolution has been steadily improving
tuitive property of arbitrary reconstruction frame rate (arbitrary over time, so algorithmic improvements (e.g., distillation [41])
frame rate) to allow for “bullet time”. Even fast-motion scenes should lead to larger resolutions. Methods like SDXL already
which introduce blur in the uncompressed information (at the depart from simple diffusion and provide refiner networks,
sensor level) may be compensated by deblurring methods [19]. which (when LoRA-type adaptation is applied) should al-
It is important to note that our method is flexible in that low for high-quality resampling. With current technology,
depending on the application and implementation, our method diffusion-based upsampling are also well established and often
may require less computation to encode rather than to decode work best when upsampling ratios are small, making them
(a situation opposite to conventional codecs). This presents excellent complements to our base decoder approach, along
some interesting opportunities for edge-based devices, par- with other modern approaches such as ESRGAN [59] and
ticularly under constrained uplink/upload bandwidths. Addi- Swin2SR [60]. Finally, as with latent-space limitations from
tionally, our method may include hybrid approaches where an Sec. VI-E, we expect our novel dynamic warping to also assist
intermediary (e.g., cloud, content delivery networks, or near- in output resolutions that exceed base-model limitations as
edge server) provides conversion to and/or from our diffusion- this warping is expected to provide higher-quality upsampling
based codec and our a conventional codec. with conventional cubic spline methods as nonuniform sample
density is already increased in areas of high detail.

D. Model sizes
VII. ACKNOWLEDGEMENTS
We note here that many modern diffusion base models (e.g.,
Several IKIN employees contributed heavily to the research
SDXL [14]) are quite large. However, our approach leverages
and content presented in this document. These contributors
the ability of adaptation (e.g., LoRA) to fine tune that model
include: Jim Stiefelmaier, Kristy Tipton, Dusty Coleman,
with an adaptation matrix which is orders of magnitude smaller
Richie Romero, Blake Fox, and Taylor Scott.
without requiring requiring retransmission of the original
weights. The original diffusion model may be delivered once,
or even delivered with hardware depending on the application. R EFERENCES
The size and update rate of the LoRA (and other adaptation [1] C. E. Shannon, “A mathematical theory of communication,” The Bell
methods) information is customizable and may itself be low system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
13

[2] A. M. Andrew, “Information theory, inference, and learning algorithms, diffusion: Scaling latent video diffusion models to large datasets,” arXiv
by david jc mackay, cambridge university press, cambridge, 2003, preprint arXiv:2311.15127, 2023.
hardback, xii+ 628 pp., isbn 0-521-64298-1 (£ 30.00),” Robotica, vol. 22, [26] Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang,
no. 3, pp. 348–349, 2004. H. Sun, J. Gao et al., “Sora: A review on background, technology,
[3] K. Brandenburg, “Mp3 and aac explained,” in Audio Engineering Society limitations, and opportunities of large vision models,” arXiv preprint
Conference: 17th International Conference: High-Quality Audio Coding. arXiv:2402.17177, 2024.
Audio Engineering Society, 1999. [27] O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat,
[4] C.-Y. Wang, S.-M. Lee, and L.-W. Chang, “Designing jpeg quantiza- J. Hur, Y. Li, T. Michaeli et al., “Lumiere: A space-time diffusion model
tion tables based on human visual system,” Signal Processing: Image for video generation,” arXiv preprint arXiv:2401.12945, 2024.
Communication, vol. 16, no. 5, pp. 501–506, 2001. [28] Z. Yue, J. Wang, and C. C. Loy, “Resshift: Efficient diffusion
[5] Z.-N. Li, M. S. Drew, J. Liu, Z.-N. Li, M. S. Drew, and J. Liu, “Modern model for image super-resolution by residual shifting,” arXiv preprint
video coding standards: H. 264, h. 265, and h. 266,” Fundamentals of arXiv:2307.12348, 2023.
Multimedia, pp. 423–478, 2021. [29] Y. Zhang, K. Zhang, Z. Chen, Y. Li, R. Timofte, J. Zhang, K. Zhang,
[6] J. Burgess, K.-C. Wang, and S. Yeung, “Viewpoint textual inversion: R. Peng, Y. Ma, L. Jia et al., “Ntire 2023 challenge on image super-
Unleashing novel view synthesis with pretrained 2d diffusion models,” resolution (x4): Methods and results,” in Proceedings of the IEEE/CVF
arXiv preprint arXiv:2309.07986, 2023. Conference on Computer Vision and Pattern Recognition, 2023, pp.
[7] S. Qaisar, R. M. Bilal, W. Iqbal, M. Naureen, and S. Lee, “Compressive 1864–1883.
sensing: From theory to applications, a survey,” Journal of Communi- [30] S. F. Yilmaz, X. Niu, B. Bai, W. Han, L. Deng, and D. Gunduz, “High
cations and networks, vol. 15, no. 5, pp. 443–456, 2013. perceptual quality wireless image delivery with denoising diffusion
[8] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, models,” arXiv preprint arXiv:2309.15889, 2023.
K. F. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressive [31] B. Song, S. M. Kwon, Z. Zhang, X. Hu, Q. Qu, and L. Shen, “Solving
sampling,” IEEE signal processing magazine, vol. 25, no. 2, pp. 83–91, inverse problems with latent diffusion models via hard data consistency,”
2008. arXiv preprint arXiv:2307.08123, 2023.
[9] B. L. Westcott and S. P. Stanners, “Systems and methods for di- [32] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control
rect emitter geolocation,” patentimages.storage.googleapis.com/bd/aa/ to text-to-image diffusion models,” in Proceedings of the IEEE/CVF
d6/f12fada9c8384b/US9377520.pdf, Jun. 28 2016, uS Patent 9,377,520. International Conference on Computer Vision, 2023, pp. 3836–3847.
[10] Y. Song, L. Shen, L. Xing, and S. Ermon, “Solving inverse problems [33] T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim,
in medical imaging with score-based generative models,” arXiv preprint T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe et al., “Neural 3d
arXiv:2111.08005, 2021. video synthesis from multi-view video,” in Proceedings of the IEEE/CVF
[11] Y. Yang, S. Mandt, L. Theis et al., “An introduction to neural data Conference on Computer Vision and Pattern Recognition, 2022, pp.
compression,” Foundations and Trends® in Computer Graphics and 5521–5531.
Vision, vol. 15, no. 2, pp. 113–200, 2023. [34] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and
[12] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative A. Kanazawa, “Plenoxels: Radiance fields without neural networks,”
learning trilemma with denoising diffusion gans,” arXiv preprint in Proceedings of the IEEE/CVF Conference on Computer Vision and
arXiv:2112.07804, 2021. Pattern Recognition, 2022, pp. 5501–5510.
[13] H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse [35] Z. Zhu, Z. Fan, Y. Jiang, and Z. Wang, “Fsgs: Real-time few-shot view
in gans,” in 2020 international joint conference on neural networks synthesis using gaussian splatting,” 2023.
(ijcnn). IEEE, 2020, pp. 1–10.
[36] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli,
[14] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, “Deep unsupervised learning using nonequilibrium thermodynamics,”
J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models in International conference on machine learning. PMLR, 2015, pp.
for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2256–2265.
2023.
[37] A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical
[15] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
models by score matching.” Journal of Machine Learning Research,
Advances in neural information processing systems, vol. 33, pp. 6840–
vol. 6, no. 4, 2005.
6851, 2020.
[38] Y. Song and S. Ermon, “Generative modeling by estimating gradients
[16] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”
of the data distribution,” Advances in neural information processing
arXiv preprint arXiv:2010.02502, 2020.
systems, vol. 32, 2019.
[17] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
resolution image synthesis with latent diffusion models,” in Proceedings [39] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE
of the IEEE/CVF conference on computer vision and pattern recognition, Transactions on signal processing, vol. 56, no. 6, pp. 2346–2356, 2008.
2022, pp. 10 684–10 695. [40] C. Si, Z. Huang, Y. Jiang, and Z. Liu, “Freeu: Free lunch in diffusion
[18] S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency u-net,” arXiv preprint arXiv:2309.11497, 2023.
models: Synthesizing high-resolution images with few-step inference,” [41] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial
arXiv preprint arXiv:2310.04378, 2023. diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023.
[19] B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion [42] C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2i-
restoration models,” Advances in Neural Information Processing Sys- adapter: Learning adapters to dig out more controllable ability for text-
tems, vol. 35, pp. 23 593–23 606, 2022. to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
[20] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, [43] J. Yu, Y. Wang, C. Zhao, B. Ghanem, and J. Zhang, “Freedom:
and W. Chen, “Lora: Low-rank adaptation of large language models,” Training-free energy-guided conditional diffusion model,” arXiv preprint
arXiv preprint arXiv:2106.09685, 2021. arXiv:2303.09833, 2023.
[21] S. Ryu, “Low-rank Adaptation for Fast Text-to-Image Diffusion Fine- [44] W. Robitza, “CRF Guide (Constant Rate Factor in x264, x265 and lib-
tuning,” https://github.com/cloneofsimo/lora, 2023, [Online; accessed vpx),” https://slhck.info/video/2017/02/24/crf-guide.html, 2017, [Online;
01-Dec-2023]. accessed 01-Dec-2023].
[22] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, [45] W. Huang, K. Jia, P. Liu, and Y. Yu, “Spatio-temporal information
“Dreambooth: Fine tuning text-to-image diffusion models for subject- fusion network for compressed video quality enhancement,” in 2023
driven generation,” in Proceedings of the IEEE/CVF Conference on Data Compression Conference (DCC). IEEE, 2023, pp. 343–343.
Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510. [46] A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010
[23] Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and 20th international conference on pattern recognition. IEEE, 2010, pp.
C. Potts, “Reft: Representation finetuning for language models,” arXiv 2366–2369.
preprint arXiv:2404.03592, 2024. [47] S. Bhardwaj, I. Fischer, J. Ballé, and T. Chinen, “An unsupervised
[24] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, information-theoretic perceptual quality metric,” Advances in Neural
and R. Ng, “Nerf: Representing scenes as neural radiance fields for view Information Processing Systems, vol. 33, pp. 13–24, 2020.
synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, [48] R. Rassool, “Vmaf reproducibility: Validating a perceptual practical
2021. video quality metric,” in 2017 IEEE international symposium on broad-
[25] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, band multimedia systems and broadcasting (BMSB). IEEE, 2017, pp.
D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video 1–2.
14

[49] H. Wu, E. Zhang, L. Liao, C. Chen, J. Hou, A. Wang, W. Sun, Q. Yan, Christopher Vela has 8+ years of experience cre-
and W. Lin, “Exploring video quality assessment on user generated ating rapid prototype data solutions for the DoD
contents from aesthetic and technical perspectives,” in Proceedings of and DARPA and startups. His solutions have been
the IEEE/CVF International Conference on Computer Vision, 2023, pp. used for the Australian Army, the British Army
20 144–20 154. and presented to the US Joint Artificial Intelligence
[50] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in Pro- Center and at the I/ITSEC 2019 Conference. He
ceedings of the IEEE conference on computer vision and pattern has experience creating data science solutions for
recognition, 2018, pp. 6228–6237. transportation departments and social media analyt-
[51] J. Shalf, “The future of computing beyond moore’s law,” Philosophical ics for a variety of Fortune 500 companies such as
Transactions of the Royal Society A, vol. 378, no. 2166, p. 20190061, the NFL and Verizon. He majored in Statistics from
2020. Columbia University. His main focus is on signal
[52] Benchmarking diffuser models. https://github.com/LambdaLabsML/ analysis, computer vision, and cloud data solutions. Chris is a Principal Data
lambda-diffusers/blob/main/docs/benchmark.md. [Online; accessed 01- Scientist at IKIN and leads the AI and data engineering for IKIN’s volumetric
Dec-2023]. video initiatives.
[53] Huggingface diffusers website — speed up inference. https://
huggingface.co/docs/diffusers/optimization/fp16. [Online; accessed 01-
Dec-2023].
[54] R. Barbano, A. Denker, H. Chung, T. H. Roh, S. Arrdige, P. Maass,
B. Jin, and J. C. Ye, “Steerable conditional diffusion for out-of-
distribution adaptation in imaging inverse problems,” arXiv preprint
arXiv:2308.14409, 2023.
[55] R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-
text inversion for editing real images using guided diffusion models,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 6038–6047.
[56] Y.-H. Park, M. Kwon, J. Choi, J. Jo, and Y. Uh, “Understanding the latent
space of diffusion models through the lens of riemannian geometry,”
arXiv preprint arXiv:2307.12868, 2023.
[57] Benchmarking diffuser models. https://github.com/LambdaLabsML/
lambda-diffusers/blob/main/docs/benchmark.md. [Online; accessed 01-
Dec-2023].
[58] A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa,
“Instruct-nerf2nerf: Editing 3d scenes with instructions,” arXiv preprint
arXiv:2303.12789, 2023.
[59] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and
C. Change Loy, “Esrgan: Enhanced super-resolution generative adversar-
ial networks,” in Proceedings of the European conference on computer
vision (ECCV) workshops, 2018, pp. 0–0.
[60] M. V. Conde, U.-J. Choi, M. Burchi, and R. Timofte, “Swin2sr: Swinv2
transformer for compressed image super-resolution and restoration,” in
European Conference on Computer Vision. Springer, 2022, pp. 669–
687.

Bryan Westcott is Director of Applied Artificial


Intelligence at IKIN Inc. His focus has been in AI-
driven volumetric capture, manipulation and genera-
tion which has naturally led to this work in diffusion-
based compression. He holds a BS in Electrical and
Computer Engineering and a Masters in Engineering
from The University of Texas at Austin; his research
focus was in statistical signal processing, wireless
communications and electromagnetics. Previously,
Bryan has worked at Lockheed Martin, L3 Com-
munications (now L3Harris), and Cubic Defense
Applications where his roles included principal engineer and Director of
Applied AI. Bryan spent more than a decade as a lead researcher devel-
oping and fielding novel airborne intelligence, surveillance and reconnais-
sance (ISR) solutions. His focus was on geolocation (time-frequency, array-
based, differential, direct non-metadata, near-vertical incidence, and indoor)
and also on signal processing (clustering, sensor fusion, beam-forming,
sparse reconstruction, wireless interference cancellation, GPS denied-access
navigation and compressive sensing) and novel wireless communications
(Ultrawideband Radio-Frequency ID systems). He served as principal in-
vestigator and data science lead for multiple Defense Advanced Research
Projects Agency (DARPA) Programs. His Patent information is available at:
http://independent.academia.edu/BryanWestcott.

You might also like