Diffusion
Diffusion
# Sample an image
image_pipe().images[0]
We can re-create the sampling process step by step to get a better look at
what is happening as the model generates images. We initialize our sample
x with random noise and then run it through the model for 30 steps. On the
right, you can see the model’s prediction for what the final image will look
like at specific steps - note that the initial predictions are not particularly
good! Instead of jumping right to that final predicted image, we only
modify x by a small amount in the direction of the prediction (shown on the
left). We then feed this new, slightly better x through the model again for
the next step, hopefully resulting in a slightly improved prediction, which
can be used to update x a little more, and so on. With enough steps, the
model can produce some impressively realistic images.
# Calculate what the updated sample should look like with the
scheduler
scheduler_output = image_pipe.scheduler.step(noise_pred, t,
x)
# Update x
x = scheduler_output.prev_sample
pred_x0 = scheduler_output.pred_original_sample
grid = torchvision.utils.make_grid(pred_x0,
nrow=4).permute(1, 2, 0)
axs[1].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
axs[1].set_title(f"Predicted denoised images (step {i})")
plt.show()
NOTE
Don’t worry if that code looks a bit intimidating - we’ll explain how this all works over
the course of this chapter. For now, just focus on the results.
This core idea of learning how to refine a ‘corrupted’ input gradually can be
applied to a wide range of tasks. In this chapter, we’ll focus on
unconditional image generation - that is, generating images that resemble
the training data, with no additional controls over what these generated
samples look like. Diffusion models have also been applied to audio, video,
text and more. And while most implementations use some variant of the
‘denoising’ approach that we’ll cover here, new approaches utilizing
different types of ‘corruption’ together with iterative refinement are
emerging that may move the field beyond the current focus on denoising
diffusion specifically. Exciting times!
dataset = load_dataset("huggan/smithsonian_butterflies_subset",
split="train")
image_size = 64
Next, we need to create a dataloader to load the data in batches with these
transforms applied:
batch_size = 32
def transform(examples):
images = [preprocess(image.convert("RGB")) for image in
examples["image"]]
return {"images": images}
dataset.set_transform(transform)
train_dataloader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True
)
We can check that this worked by loading a single batch and inspecting the
images.
batch = next(iter(train_dataloader))
print('Shape:', batch['images'].shape,
'\nBounds:', batch['images'].min().item(), 'to',
batch['images'].max().item())
show_images(batch['images'][:8]*0.5 + 0.5) # NB: we map back to
(0, 1) for display
Shape: torch.Size([32, 3, 64, 64])
Bounds: -0.9921568632125854 to 1.0
Adding Noise
How do we gradually corrupt our data? The most common approach is to
add noise to the images. The amount of noise we add is controlled by a
noise schedule. Different papers and approaches tackle this in different
ways, which we’ll explore later in the chapter. For now, let’s see one
common approach in action based on the paper “Denoising diffusion
probabilistic models” by Ho et al. In the diffusers library, adding noise is
handled by something called a scheduler, which takes in a batch of images
and a list of ‘timesteps’ and determines how to create the noisy versions of
those images:
scheduler = DDPMScheduler(num_train_timesteps=1000,
beta_start=0.001, beta_end=0.02)
timesteps = torch.linspace(0, 999, 8).long()
x = batch['images'][:8]
noise = torch.rand_like(x)
noised_x = scheduler.add_noise(x, noise, timesteps)
show_images((noised_x*0.5 + 0.5).clip(0, 1))
During training, we’ll pick the timesteps at random. The scheduler takes
some parameters (beta_start and beta_end) which it uses to determine how
much noise should be present for a given timestep. We will cover
schedulers in more detail in section X.
The UNet
UNet is a convolutional neural network invented for tasks such as image
segmentation, where the desired output has the same spatial extent as the
input. It consists of a series of ‘downsampling’ layers that reduce the spatial
size of the input, followed by a series of ‘upsampling’ layers that increase
the spatial extent of the input again. The downsampling layers are also
typically followed by a ‘skip connection’ that connects the downsampling
layer’s output to the upsampling layer’s input. This allows the upsampling
layers to ‘see’ the higher-resolution representations from earlier in the
network, which is useful for tasks with image-like outputs where this high-
resolution information is especially useful.
The UNet architecture used in the diffusers library is more advanced than
the original UNet proposed in 2015 by Ronneberger et al, with additions
like attention and residual blocks. We’ll take a closer look later, but the key
feature here is that it can take in an input (the noisy image) and produce a
prediction that is the same shape (the predicted noise). For diffusion
models, the UNet typically also takes in the timestep as additional
conditioning, which again we will explore in the UNet deep dive section.
Here’s how we might create a UNet and feed our batch of noisy images
through it:
# Create a UNet2DModel
model = UNet2DModel(
in_channels=3, # 3 channels for RGB images
sample_size=64, # Specify our input size
block_out_channels=(64, 128, 256, 512), # N channels per
layer
down_block_types=("DownBlock2D", "DownBlock2D",
"AttnDownBlock2D", "AttnDownBlock2D"),
up_block_types=("AttnUpBlock2D", "AttnUpBlock2D",
"UpBlock2D", "UpBlock2D"),
)
Training
Now that we have our model and our data ready, we can train it. We’ll use
the AdamW optimizer with a learning rate of 3e-4. For each training step,
we:
Load a batch of images.
Add noise to the images, choosing random timesteps to determine how
much noise is added.
Feed the noisy images into the model.
Calculate the loss, which is the mean squared error between the
model’s predictions and the target - which in this case is the noise that
we added to the images. This is called the noise or ‘epsilon’ objective.
You can find more information on the different training objectives in
section X.
Backpropagate the loss and update the model weights with the
optimizer.
It takes an hour or so to run the above code on a GPU, so get some tea
while you wait or lower the number of epochs. Here’s what the loss curve
looks like after training:
Sampling
The diffusers library uses the idea of ‘pipelines’ which bundle together all
of the components needed to generate samples with a diffusion model:
for i, t in enumerate(scheduler.timesteps):
This is the same code we used at the beginning of the chapter to illustrate
the idea of iterative refinement, but hopefully, now you have a better
understanding of what is going on here. We start with a completely random
input, which is then refined by the model in a series of steps. Each step is a
small update to the input, based on the model’s prediction for the noise at
that timestep. We’re still abstracting away some complexity behind the call
to pipeline.scheduler.step() - in a later chapter we will dive
deeper into different sampling methods and how they work.
Evaluation
Generative model performance can be evaluated using FID scores (Fréchet
Inception Distance). FID scores measure how closely generated samples
match real-world samples by comparing statistics between feature maps
extracted from both sets of data using a pre-trained neural network. The
lower the score, the better the quality and realism of generated images
produced by a given model. FID scores are popular due to their ability to
provide an ‘objective’ comparison metric for different types of generative
networks without relying on human judgment.
As convenient as FID scores are, there are some important caveats to be
aware of:
The FID score for a given model depends on the number of samples
used to calculate it, so when comparing between model,s we need to
make sure both reported scores are calculated using the same number
of samples. Common practice is to use 50,000 samples for this
purpose, although to save time, you may evaluate on a smaller number
of samples during development and only do the full evaluation once
you’re ready to publish the results.
When calculating FID, images are resized to 299px square images.
This makes it less useful as a metric for extremely low-res or high-res
images. There are also minor differences between how resizing is
handled by different deep learning frameworks, which can result in
small differences in the FID score! We recommend using a library such
as clean-fid to standardize the FID calculation.
The network used as a feature extractor for FID is typically a model
trained on the Imagenet classification task. When generating images in
a different domain, the features learned by this model may be less
useful. A more accurate approach would be to somehow train a
classification network on domain-specific data first, but this would
make it harder to compare scores between different papers and
approaches, so for now the imagenet model is the standard choice.
If you save generated samples for later evaluation, the format and
compression can again affect the FID score. Avoid low-quality JPEG
images where possible.
Even if you account for all these caveats, FID scores are just a rough
measure of quality and do not perfectly capture the nuances of what makes
images look more ‘real’. So, use them to get an idea of how one model
performs relative to another but also look at the actual images generated by
each model to get a better sense of how they compare. Human preference is
still the gold standard for quality in what is ultimately a fairly subjective
field!
Figure 1-1. Illustration of the different degradations used in the Cold Diffusion Paper
Nonetheless, adding noise remains the most popular approach for several
reasons:
We can easily control the amount of noise added, giving a smooth
transition from ‘perfect’ to ‘completely corrupted’. This is not the case
for something like reducing the resolution of an image, which may
result in ‘discrete’ transitions.
We can have many valid random starting points for inference, unlike
some methods which may only have a limited number of possible
initial (fully corrupted) states, such as a completely black image or a
single-pixel image.
So, for the moment at least, we’ll stick with adding noise as our corruption
method. Next, let’s take a closer look at how we add noise to our images.
Starting Simple
We have some images (x) and we’d like to combine them somehow with
some random noise.
x = next(iter(train_dataloader))['images'][:8]
noise = torch.rand_like(x)
Let’s see this in action on a batch of data, with the amount of noise varying
from 0 to 1:
amount = torch.linspace(0, 1, 8)
noised_x = corrupt(x, noise, amount)
show_images(noised_x*0.5 + 0.5)
This seems to be doing exactly what we want, smoothly transitioning from
the original image to pure noise. Now, we’ve created a noise schedule here
that takes in a value for ‘amount’ from 0 to 1. This is called the ‘continuous
time’ approach, where we represent the full path on a time scale from 0 to
1. Other approaches use a discrete time approach, with some large integer
number of ‘timesteps’ used to define the noise scheduler. We can wrap our
function into a class that converts from continuous time to discrete
timesteps and adds noise appropriately:
class SimpleScheduler():
def __init__(self):
self.num_train_timesteps = 1000
def add_noise(self, x, noise, timesteps):
amount = timesteps / self.num_train_timesteps
return corrupt(x, noise, amount)
scheduler = SimpleScheduler()
timesteps = torch.linspace(0, 999, 8).long()
noised_x = scheduler.add_noise(x, noise, timesteps)
show_images(noised_x*0.5 + 0.5)
Now we have something that we can directly compare to the schedulers
used in the diffusers library, such as the DDPMScheduler we used during
training. Let’s see how it compares:
scheduler = DDPMScheduler(beta_end=0.01)
timesteps = torch.linspace(0, 999, 8).long()
noised_x = scheduler.add_noise(x, noise, timesteps)
show_images((noised_x*0.5 + 0.5).clip(0, 1))
The Maths
There are many competing notations and approaches in the literature. For
example, some papers parametrize the noise schedule in continuous-time
where t runs from 0 (no noise) to 1 (fully corrupted) - just like our
corrupt function in the previous section. Others use a discrete-time
approach with integer timesteps running from 0 to some large number T,
typically 1000. It is possible to convert between these two approaches the
way we did with our SimpleScheduler class - just make sure you’re
consistent when comparing different models. We’ll stick with the discrete-
time approach here.
A good place to start for pushing deeper into the maths is the DDPM paper
mentioned earlier. You can find an annotated implementation here which is
a great additional resource for understanding this approach.
The paper begins by specifying a single noise step to go from timestep t-1
to timestep t. Here’s how they write it:
Here β is defined for all timesteps t and is used to specify how much noise
t
is added at each step. This notation can be a little intimidating, but what this
equation tells us is that the noisier x is a distribution with a mean of
t
xt = √ 1 − βtxt−1 + √ βtϵ
To get the noisy input at timestep t, we could begin at t=0 and repeatedly
apply this single step, but this would be very inefficient. Instead, we can
find a formula to move to any timestep t in one go. We define α = 1 − β
t t
Our SimpleScheduler above just linearly mixes between the original image
and noise, as we can see if we plot the scaling factors (equivalent to √αt
plot_scheduler(SimpleScheduler())
A good noise schedule will ensure that the model sees a mix of images at
different noise levels. The best choice will differ based on the training data.
Visualizing a few more options, note that:
Setting beta_end too low means we never completely erase the image,
so the model will never see anything like the random noise used as a
starting point for inference.
Setting beta_end extremely high means that most of the timesteps are
spent on almost complete noise, which will result in poor training
performance.
Different beta schedules give different curves.
A Simple UNet
To better understand the structure of a UNet, let’s build a simple UNet from
scratch.
Figure 1-3. Our simple UNet architecture
class BasicUNet(nn.Module):
"""A minimal UNet implementation."""
def __init__(self, in_channels=1, out_channels=1):
super().__init__()
self.down_layers = torch.nn.ModuleList([
nn.Conv2d(in_channels, 32, kernel_size=5, padding=2),
nn.Conv2d(32, 64, kernel_size=5, padding=2),
nn.Conv2d(64, 64, kernel_size=5, padding=2),
])
self.up_layers = torch.nn.ModuleList([
nn.Conv2d(64, 64, kernel_size=5, padding=2),
nn.Conv2d(64, 32, kernel_size=5, padding=2),
nn.Conv2d(32, out_channels, kernel_size=5,
padding=2),
])
self.act = nn.SiLU() # The activation function
self.downscale = nn.MaxPool2d(2)
self.upscale = nn.Upsample(scale_factor=2)
for i, l in enumerate(self.up_layers):
if i > 0: # For all except the first up layer
x = self.upscale(x) # Upscale
x += h.pop() # Fetching stored output (skip
connection)
x = self.act(l(x)) # Through the layer and the
activation function
return x
For comparison, here are the results on MNIST when using the UNet
implementation in the diffusers library, which features all of the above
improvements:
WARNING
This section will likely be expanded with results and more details in the future. We just
haven’t gotten around to training variants with the different improvements yet!
Alternative Architectures
More recently, a number of alternative architectures have been proposed for
diffusion models. These include:
Transformers. The DiT paper (“Scalable Diffusion Models with
Transformers”) by Peebles and Xie showed that a transformer-based
architecture can be used to train a diffusion model, with great results.
However, the compute and memory requirements of the transformer
architecture remain a challenge for very high resolutions.
The UViT architecture from the Simple Diffusion paper link aims to
get the best of both worlds by replacing the middle layers of the UNet
with a large stack of transformer blocks. A key insight of this paper
was that focusing the majority of the compute at the lower resolution
blocks of the UNet allows for more efficient training of high-resolution
diffusion models. For very high resolutions, they do some additional
pre-processing using something called a wavelet transform to reduce
the spatial resolution of the input image while keeping as much
information as possible through the use of additional channels, again
reducing the amount of compute spent on the higher spatial
resolutions.
Recurrent Interface Networks. The RIN paper (Jabri et al) takes a
similar approach, first mapping the high-resolution inputs to a more
manageable and lower-dimensional ‘latent’ representation which is
then processed by a stack of transformer blocks before being decoded
back out to an image. Additionally, the RIN paper introduces an idea
of ‘recurrence’ where information is passed to the model from the
previous processing step, which can be beneficial for the kind of
iterative improvement that diffusion models are designed to perform.
It remains to be seen whether transformer-based approaches completely
supplant UNets as the go-to architecture for diffusion models, or whether
hybrid approaches like the UViT and RIN architectures will prove to be the
most effective.
similar idea via a parameter called c_skip, and unify the different
diffusion model formulations into a consistent framework. If you’re
interested in learning more about the different objectives, scalings and other
nuances of the different diffusion model formulations, we recommend
reading their paper for a more in-depth discussion.
Summary
In this chapter, we’ve seen how the idea of iterative refinement can be
applied to train a diffusion model capable of turning noise into beautiful
images. You’ve seen some of the design choices that go into creating a
successful diffusion model, and hopefully put them into practice by training
your own model. In the next chapter, we’ll take a look at some of the more
advanced techniques that have been developed to improve the performance
of diffusion models and to give them extraordinary new capabilities!
References
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion
probabilistic models.” Advances in Neural Information Processing Systems
33 (2020): 6840-6851.
Ronneberger, O., Fischer, P. and Brox, T., 2015. U-net: Convolutional
networks for biomedical image segmentation. In Medical Image Computing
and Computer-Assisted Intervention–MICCAI 2015: 18th International
Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18
(pp. 234-241). Springer International Publishing.
Bansal, Arpit, Eitan Borgnia, Hong-Min Chu, Jie S. Li, Hamid Kazemi,
Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. “Cold
diffusion: Inverting arbitrary image transforms without noise.” arXiv
preprint arXiv:2208.09392 (2022).
Hoogeboom, Emiel, Jonathan Heek, and Tim Salimans. “simple diffusion:
End-to-end diffusion for high resolution images.” arXiv preprint
arXiv:2301.11093 (2023).
Chang, Huiwen, Han Zhang, Jarred Barber, A. J. Maschinot, Jose Lezama,
Lu Jiang, Ming-Hsuan Yang et al. “Muse: Text-To-Image Generation via
Masked Generative Transformers.” arXiv preprint arXiv:2301.00704
(2023).
Chang, Huiwen, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman.
“Maskgit: Masked generative image transformer.” In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
11315-11325. 2022.
Rampas, Dominic, Pablo Pernias, Elea Zhong, and Marc Aubreville. “Fast
Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces.”
arXiv preprint arXiv:2211.07292 (2022).
Chen, Ting “On the Importance of Noise Scheduling for Diffusion Models.”
arXiv preprint arXiv:2301.10972 (2023).
Peebles, William, and Saining Xie. “Scalable Diffusion Models with
Transformers.” arXiv preprint arXiv:2212.09748 (2022).
Jabri, Allan, David Fleet, and Ting Chen. “Scalable Adaptive Computation
for Iterative Generation.” arXiv preprint arXiv:2212.11972 (2022).
Salimans, Tim, and Jonathan Ho. “Progressive distillation for fast sampling
of diffusion models.” arXiv preprint arXiv:2202.00512 (2022).)
Karras, Tero, Miika Aittala, Timo Aila, and Samuli Laine. “Elucidating the
design space of diffusion-based generative models.” arXiv preprint
arXiv:2206.00364 (2022).
Chapter 2. Building up to Stable
Diffusion
fashion_mnist = load_dataset("fashion_mnist")
clothes = fashion_mnist["train"]["image"][:8]
classes = fashion_mnist["train"]["label"][:8]
show_images(clothes, titles=classes, figsize=(4,2.5))
So class 0 means t-shirt, 2 is a sweater and 9 means boot. Here’s a list of the
10 categories in Fashion MNIST:
https://huggingface.co/datasets/fashion_mnist#data-fields. We prepare our
dataset and dataloader similarly to how we did it in Chapter 3, with the
main difference that we’ll also include the class information as an input.
Instead of resizing, in this case we’ll pad our image inputs (which have a
size of 28 × 28 pixels) to 32 × 32, as we did in Chapter 3.
preprocess = transforms.Compose([
transforms.RandomHorizontalFlip(), # Randomly flip (data
augmentation)
transforms.ToTensor(), # Convert to tensor (0,
1)
transforms.Pad(2), # Add 2 pixels on all
sides
transforms.Normalize([0.5], [0.5]), # Map to (-1, 1)
])
batch_size = 256
def transform(examples):
images = [preprocess(image.convert("L")) for image in
examples["image"]]
return {"images": images, "labels": examples["label"]}
train_dataset = fashion_mnist["train"].with_transform(transform)
train_dataloader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, shuffle=True
)
model = UNet2DModel(
in_channels=1, # 1 channel for grayscale images
out_channels=1, # output channels must also be 1
sample_size=32,
block_out_channels=(32, 64, 128, 256),
norm_num_groups=8,
num_class_embeds=10, # Enable class conditioning
)
To make predictions with this model, we must pass in the class labels as
additional inputs to the forward method:
Internally, both the timestep and the class label are turned into embeddings
that the model uses during its forward pass. At multiple stages throughout
the UNet, these embeddings are projected onto a dimension that matches
the number of channels in a given layer and are then added to the outputs of
that layer. This means the conditioning information is fed to every block of
the UNet, giving the model ample opportunity to learn how to use it
effectively.
scheduler = DDPMScheduler(num_train_timesteps=1000,
beta_start=0.0001, beta_end=0.02)
timesteps = torch.linspace(0, 999, 8).long()
batch = next(iter(train_dataloader))
x = batch['images'][:8]
noise = torch.rand_like(x)
noised_x = scheduler.add_noise(x, noise, timesteps)
show_images((noised_x*0.5 + 0.5).clip(0, 1))
Our training loop is also almost exactly the same as in Chapter 3 too, except
that we now pass the class labels for conditioning. Note that this is just
additional information for the model, but it doesn’t affect our loss function
in any way.
# Get the model prediction for the noise - note the use
of class_labels
noise_pred = model(noisy_images, timesteps,
class_labels=class_labels, return_dict=False)[0]
In this case we train for 25 epochs - full code can be found in the
supplementary material.
Sampling
Now we have a model that expects two inputs when making predictions: the
image and the class label. We can create samples by beginning with random
noise and then iteratively denoising, passing in whatever class label we’d
like to generate:
for i, t in tqdm(enumerate(scheduler.timesteps)):
# Get model pred
with torch.no_grad():
noise_pred = model(sample, t,
class_labels=class_labels).sample
Figure 2-1. The architecture introduced in the Latent Diffusion Models paper. Note the VAE encoder
and decoder on the left for translating between pixel space and latent space
Latent diffusion tries to mitigate this issue by using a separate model called
a Variational Auto-Encoder (VAE). As we saw in Chapter 2, VAEs can
compress images to a smaller spatial dimension. The rationale behind this is
that images tend to contain a large amount of redundant information - given
enough training data, a VAE can hopefully learn to produce a much smaller
representation of an input image and then reconstruct the image based on
this small latent representation with a high degree of fidelity. The VAE used
in SD takes in 3-channel images and produces a 4-channel latent
representation with a reduction factor of 8 for each spatial dimension. That
is, a 512px square input image will be compressed down to a 4x64x64
latent.
By applying the diffusion process on these smaller latent representations
rather than on full-resolution images, we can get many of the benefits that
would come from using smaller images (lower memory usage, fewer layers
needed in the UNet, faster generation times…) and still decode the result
back to a high-resolution image once we’re ready to view it. This
innovation dramatically lowers the cost to train and run these models. The
paper that introduced this idea (High-Resolution Image Synthesis with
Latent Diffusion Models by Rombach et al) demonstrated the power of this
technique by training models conditioned on segmentation maps, class
labels and text. The impressive results led to further collaboration between
the authors and partners such as RunwayML, LAION, and EleutherAI to
train a more powerful version of the model, which later became Stable
Diffusion.
In this section we’ll explore all of the components that make this possible.
Figure 2-2. The text encoder turns an input string into text embeddings which are fed into the UNet
along with the timestep and the noisy latents.
For this to work, we need to create a numeric representation of the text that
captures relevant information about what it describes. To do this, SD
leverages a pre-trained transformer model based on CLIP, which was also
introduced in Chapter 2. The text encoder is a transformer model that takes
in a sequence of tokens and produces a 1024-dimensional vector for each
token (0r 768-dimensional in the case of SD version 1 which we’re using
for the demonstrations in this section). Instead of combining these vectors
into a single representation, we keep them separate and use them as
conditioning for the UNet. This allows the UNet to make use of the
information in each token separately, rather than just the overall meaning of
the entire prompt. Because we’re extracting these text embeddings from the
internal representation of the CLIP model, they are often called the
“encoder hidden states”. Figure 3 shows the text encoder architecture.
Figure 2-3. Diagram showing the text encoding process which transforms the input prompt into a set
of text embeddings (the encoder_hidden_states) which can then be fed in as conditioning to the UNet.
The first step to encode text is to follow a process called tokenization. This
converts a sequence of characters into a sequence of numbers, where each
number represents a group of various characters. Characters that are usually
found together (like most common words) can be assigned a single token
that represents the whole word or group. Long or complicated words, or
words with many inflections, may be translated to multiple tokens, where
each one usually represents a meaningful section of the word.
There is no single “best” tokenizer; instead, each language model comes
with its own one. Differences reside in the number of tokens supported, and
on the tokenization strategy – do we use single characters, as we just
described, or should we consider different primitive units. In the following
example we see how the tokenization of a phrase works with Stable
Diffusion’s tokenizer. Each word in our sentence is assigned a unique token
number (for example, photograph happens to be 8853 in the tokenizer’s
vocabulary). There are also additional tokens that are used to provide
additional context, such as the point where the sentence ends.
max_length=pipe.tokenizer.model_max_length,
truncation=True, return_tensors="pt")
Once the text is tokenized, we can pass it through the text encoder to get the
final text embeddings that will be fed into the UNet:
We’ll go into more detail about how a transformer model processes a string
of tokens in the chapters focusing on transformer models.
Classifier-free guidance
It turns out that even with all of the effort put into making the text
conditioning as useful as possible, the model still tends to default to relying
mostly on the noisy input image rather than the prompt when making its
predictions. In a way, this makes sense - many captions are only loosely
related to their associated images and so the model learns not to rely too
heavily on the descriptions! However, this is undesirable when it comes
time to generate new images - if the model doesn’t follow the prompt then
we may get images out that don’t relate to our description at all.
Figure 2-4. Images generated from the prompt “An oil painting of a collie in a top hat” with CFG
scale 0, 1, 2 and 10 (left to right)
The VAE
The VAE is tasked with compressing images into a smaller latent
representation and back again. The VAE used with Stable Diffusion is a
truly impressive model. We won’t go into the training details here, but in
addition to the usual reconstruction loss and KL divergence described in
Chapter 2 they use an additional patch-based discriminator loss to help the
model learn to output plausible details and textures. This adds a GAN-like
component to training and helps to avoid the slightly blurry outputs that
were typical in previous VAEs. Like the text encoder, the VAE is usually
trained separately and used as a frozen component during the diffusion
model training and sampling process.
Figure 2-5. Encoding and decoding and image with the VAE
Let’s load an image and see what it looks like after being compressed and
decompressed by the VAE:
latents.shape
torch.Size([1, 4, 64, 64])
And decoding back to image space, we get an output image that is almost
identical to the original. Can you spot the difference?
The UNet
The UNet used in stable diffusion is somewhat similar to the one we used in
chapter 3 for generating images. Instead of taking in a 3-channel image as
the input we take in a 4-channel latent. The timestep embedding is fed in in
the same way as the class conditioning was in the example at the start of
this chapter. But this UNet also needs to accept the text embeddings as
additional conditioning. Scattered throughout the UNet are cross-attention
layers. Each spatial location in the UNet can attend to different tokens in
the text conditioning, bringing in relevant information from the prompt. The
diagram in Figure 7 shows how this text conditioning (as well as timestep-
based conditioning) is fed in at different points.
Figure 2-6. The Stable Diffusion UNet
The UNet for Stable Diffusion version 1 and 2 has around 860 million
parameters. The more recent SD XL has even more, at around (details
TBC), with most of the additional parameters being added at the lower-
resolution stages via additional channels in the residual blocks (N vs 1280
in the original) and additional transformer blocks.
NB: Stable Diffusion XL has not yet been publically released, so this
section will be updated when more information is public.
# Some settings
prompt = ["Acrylic palette knife painting of a flower"] # What we
want to generate
height = 512 # default height of Stable
Diffusion
width = 512 # default width of Stable
Diffusion
num_inference_steps = 30 # Number of denoising steps
guidance_scale = 7.5 # Scale for classifier-free
guidance
seed = 42 # Seed for random number
generator
The first step is to encode the text prompt. Because we plan to do classifier-
free guidance, we’ll actually create two sets of text embeddings: one with
the prompt, and one representing an empty string. You can also encode a
negative prompt in place of the empty string, or combine multiple prompts
with different weightings, but this is the most common usage:
Next we create our random initial latents and set up the scheduler to use the
desired number of inference steps:
Now we loop through the sampling steps, getting the model prediction at
each stage and using this to update the latents:
# Sampling loop
for i, t in enumerate(pipe.scheduler.timesteps):
# Display
show_image(image[0].float());
If you explore the source code for the StableDiffusionPipeline you’ll see
that the code above closely matches the call method used by the pipeline.
Hopefully this annotated version shows that there is nothing too magical
going on behind the scenes! Use this as a reference for when we encounter
additional pipelines that add additional tricks to this foundation.
Figure 2-7. “An explosion of artistic creativity” - Image generated by the authors using Stable
Diffusion
Stable Diffusion was one such model, trained on a subset of LAION as part
of a collaboration between the researchers who had invented latent diffusion
models and an organization called Stability AI. Training a model like SD
requires a significant amount of GPU time. Even with the freely-available
LAION dataset, there aren’t many who could afford the investment. This is
why the public release of the model weights and code was such a big deal -
it marked the first time a powerful text-to-image model with similar
capabilities to the best closed-source alternatives was available to all. Stable
Diffusion’s public availability has made it the go-to choice for researchers
and developers looking to explore this technology over the past year.
Hundreds of papers build upon the base model, adding new capabilities or
finding innovative ways to improve its speed and quality. And innumerable
startups have found ways to integrate these rapidly-improving tools into
their products, spawning an entire ecosystem of new applications.
The months after the introduction of Stable Diffusion demonstrated the
impact of sharing these technologies in the open. SD is not the best text-to-
image model, but it IS the best model most of us had access to, so
thousands of people have spent their time making it better and building
upon that open foundation. We hope this example encourages others to
follow suit and share their work with the open-source community in the
future!
Summary
In this chapter we’ve seen how conditioning gives us new ways to control
the images generated by diffusion models. We’ve seen how latent diffusion
lets us train diffusion models more efficiently. We’ve seen how a text
encoder can be used to condition a diffusion model on a text prompt,
enabling powerful text-to-image capabilities. And we’ve explored how all
of this comes together in the Stable Diffusion model by digging into the
sampling loop and seeing how the different components work together. In
the next chapter, we’ll show some of the many additional capabilities that
can be added to diffusion models such as SD to take them beyond simple
image generation. And later, in part 2 of the book, you’ll learn how to fine-
tune SD to add new knowledge or capabilities to the model.
About the Authors
Pedro Cuenca is a machine learning engineer who works on diffusion
software, models, and applications at Hugging Face.
Apolinário Passos is a machine learning art engineer at Hugging Face,
working across teams on multiple machine learning for art and creativity
use cases.
Omar Sanseviero is a lead machine learning engineer at Hugging Face,
where he works at the intersection of open source, community, and product.
Previously, Omar worked at Google on Google Assistant and TensorFlow.
Jonathan Whitaker is a data scientist and deep learning researcher focused
on generative modeling. Besides his research and consulting work, his main
focus is on sharing knowledge, which he does via the DataScienceCastnet
YouTube channel and various free online resources he has created.