Stable Diffusion 3 Paper
Stable Diffusion 3 Paper
Patrick Esser * Sumith Kulal Andreas Blattmann Rahim Entezari Jonas Müller Harry Saini Yam Levi
Dominik Lorenz Axel Sauer Frederic Boesel Dustin Podell Tim Dockhorn Zion English
Kyle Lacey Alex Goodwin Yannik Marek Robin Rombach *
Stability AI
Figure 1. High-resolution samples from our 8B rectified flow model, showcasing its capabilities in typography, precise prompt following
and spatial reasoning, attention to fine details, and high image quality across a wide variety of styles.
1
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
2
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
For a0 = 1, b0 = 0, a1 = 0 and b1 = 1, the marginals, one can derive various weighted loss functions that provide
a signal towards the desired solution but might affect the
pt (zt ) = E∼N (0,I) pt (zt |) , (3) optimization trajectory. For a unified analysis of different
approaches, including classic diffusion formulations, we
are consistent with the data and noise distribution.
can write the objective in the following form (following
To express the relationship between zt , x0 and , we intro- Kingma & Gao (2023)):
duce ψt and ut as
1
Lw (x0 ) = − Et∼U (t),∼N (0,I) wt λ0t kΘ (zt , t) − k2 ,
ψt (·|) : x0 7→ at x0 + bt (4) 2
ut (z|) := ψt0 (ψt−1 (z|)|) (5)
where wt = − 21 λ0t b2t corresponds to LCF M .
Next, we use Equation (10) to reparameterize Equation (8) wtEDM = N (λt | − 2Pm , (2Ps )2 )(e−λt + 0.52 ) (16)
as a noise-prediction objective:
3
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
d −1
(LDM-)Linear LDM (Rombach et al., 2022) uses a mod- dt fmode (t). As seen in Figure 11, the scale parameter
ification of the DDPM schedule (Ho et al., p2020). Both are controls the degree to which either the midpoint (positive
variance preserving schedules, i.e. bt = 1 − a2t , and de- s) or the endpoints (negative s) are favored during sam-
, T − 1 in terms
fine at for discrete timesteps t = 0, . . .Q pling. This formulation also includes a uniform weighting
t 1
of diffusion coefficients βt as at = ( s=0 (1 − βs )) 2 . πmode (t; s = 0) = U(t) for s = 0, which has been used
For given boundary values β0 and βT −1 , DDPM uses widely in previous works on Rectified Flows (Liu et al.,
t
βt = β0 + T −1 (βT −1 − β0 ) and LDM uses βt = 2022; Ma et al., 2024).
p
t
p p 2
β0 + T −1 ( βT −1 − β0 ) .
CosMap Finally, we also consider the cosine schedule
(Nichol & Dhariwal, 2021) from Section 3 in the RF setting.
3.1. Tailored SNR Samplers for RF models In particular, we are looking for a mapping f : u 7→ f (u) =
The RF loss trains the velocity vΘ uniformly on all timesteps t, u ∈ [0, 1], such that the log-snr matches that of the cosine
cos( π u)
in [0, 1]. Intuitively, however, the resulting velocity predic- schedule: 2 log sin( π2 u) = 2 log 1−f (u)
f (u) . Solving for f , we
2
tion target − x0 is more difficult for t in the middle of obtain for u ∼ U(u)
[0, 1], since for t = 0, the optimal prediction is the mean
1
of p1 , and for t = 1 the optimal prediction is the mean of t = f (u) = 1 − , (21)
p0 . In general, changing the distribution over t from the tan( π2 u) +1
commonly used uniform distribution U(t) to a distribution from which we obtain the density
with density π(t) is equivalent to a weighted loss Lwtπ with
d −1 2
t πCosMap (t) = f (t) = . (22)
wtπ = π(t) (18) dt π − 2πt + 2πt2
1−t
Thus, we aim to give more weight to intermediate timesteps 4. Text-to-Image Architecture
by sampling them more frequently. Next, we describe the For text-conditional sampling of images, our model has to
timestep densities π(t) that we use to train our models. take both modalities, text and images, into account. We
use pretrained models to derive suitable representations and
Logit-Normal Sampling One option for a distribution then describe the architecture of our diffusion backbone. An
that puts more weight on intermediate steps is the logit- overview of this is presented in Figure 2.
normal distribution (Atchison & Shen, 1980). Its density,
Our general setup follows LDM (Rombach et al., 2022)
(logit(t) − m)2 for training text-to-image models in the latent space of a
1 1 pretrained autoencoder. Similar to the encoding of images to
πln (t; m, s) = √ exp − ,
s 2π t(1 − t) 2s2 latent representations, we also follow previous approaches
(19) (Saharia et al., 2022b; Balaji et al., 2022) and encode the text
t
where logit(t) = log 1−t , has a location parameter, m, and conditioning c using pretrained, frozen text models. Details
a scale parameter, s. The location parameter enables us to can be found in Appendix B.2.
bias the training timesteps towards either data p0 (negative
m) or noise p1 (positive m). As shown in Figure 11, the Multimodal Diffusion Backbone Our architecture builds
scale parameters controls how wide the distribution is. upon the DiT (Peebles & Xie, 2023) architecture. DiT only
considers class conditional image generation and uses a
In practice, we sample the random variable u from a nor- modulation mechanism to condition the network on both
mal distribution u ∼ N (u; m, s) and map it through the the timestep of the diffusion process and the class label.
standard logistic function. Similarly, we use embeddings of the timestep t and cvec
as inputs to the modulation mechanism. However, as the
Mode Sampling with Heavy Tails The logit-normal den- pooled text representation retains only coarse-grained infor-
sity always vanishes at the endpoints 0 and 1. To study mation about the text input (Podell et al., 2023), the network
whether this has adverse effects on the performance, we also requires information from the sequence representation
also use a timestep sampling distribution with strictly pos- cctxt .
itive density on [0, 1]. For a scale parameter s, we define
We construct a sequence consisting of embeddings of the
π text and image inputs. Specifically, we add positional en-
fmode (u; s) = 1 − u − s · cos2 u −1+u . (20) codings and flatten 2 × 2 patches of the latent pixel rep-
2
resentation x ∈ Rh×w×c to a patch encoding sequence of
2
For −1 ≤ s ≤ π−2 , this function is monotonic, and we length 12 · h · 12 · w. After embedding this patch encoding
can use it to sample from the implied density πmode (t; s) = and the text encoding cctxt to a common dimensionality, we
4
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Caption
y
SiLU SiLU
CLIP-G/14 CLIP-L/14 T5 XXL Linear Linear
c x
77 + 77 tokens Noised Latent
Layernorm Layernorm
αc αx
Pooled
+ y c x
Q K V
Attention
MLP
MM-DiT-Block 1
Sinusoidal Encoding
Linear Linear
MM-DiT-Block 2
γc ∗ ∗ γx
Timestep ...
+ +
MM-DiT-Block d
Layernorm Layernorm
δc δx
Mod: δc · • + c Mod: δx · • + x
c x
Modulation
MLP MLP
Linear ζc ∗ ∗ ζx
Unpatching
+ +
Output
Figure 2. Our model architecture. Concatenation is indicated by and element-wise multiplication by ∗. The RMS-Norm for Q and K
can be added to stabilize training runs. Best viewed zoomed in.
concatenate the two sequences. We then follow DiT and addition, the losses of different approaches are incomparable
apply a sequence of modulated attention and MLPs. and also do not necessarily correlate with the quality of out-
put samples; hence we need evaluation metrics that allow for
Since text and image embeddings are conceptually quite
a comparison between approaches. We train models on Ima-
different, we use two separate sets of weights for the two
geNet (Russakovsky et al., 2014) and CC12M (Changpinyo
modalities. As shown in Figure 2b, this is equivalent to
et al., 2021), and evaluate both the training and the EMA
having two independent transformers for each modality, but
weights of the models during training using validation losses,
joining the sequences of the two modalities for the attention
CLIP scores (Radford et al., 2021; Hessel et al., 2021), and
operation, such that both representations can work in their
FID (Heusel et al., 2017) under different sampler settings
own space yet take the other one into account.
(different guidance scales and sampling steps). We calcu-
For our scaling experiments, we parameterize the size of late the FID on CLIP features as proposed by (Sauer et al.,
the model in terms of the model’s depth d, i.e. the number 2021). All metrics are evaluated on the COCO-2014 valida-
of attention blocks, by setting the hidden size to 64 · d tion split (Lin et al., 2014). Full details on the training and
(expanded to 4 · 64 · d channels in the MLP blocks), and the sampling hyperparameters are provided in Appendix B.3.
number of attention heads equal to d.
5.1.1. R ESULTS
5. Experiments We train each of 61 different formulations on the two
datasets. We include the following variants from Section 3:
5.1. Improving Rectified Flows
• Both - and v-prediction loss with linear
We aim to understand which of the approaches for
(eps/linear, v/linear) and cosine (eps/cos,
simulation-free training of normalizing flows as in Equa-
v/cos) schedule.
tion 1 is the most efficient. To enable comparisons across
• RF loss with πmode (t; s) (rf/mode(s)) with 7 val-
different approaches, we control for the optimization algo-
ues for s chosen uniformly between −1 and 1.75, and
rithm, the model architecture, the dataset and samplers. In
5
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
6
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
140
edm(-1.20, 1.20) Original Captions 50/50 Mix
eps/linear
120 rf/lognorm(0.00, 1.00)
success rate [%] success rate [%]
rf Color Attribution 11.75 24.75
100 v/cos
Colors 71.54 68.09
FID
v/linear
Position 6.50 18.00
80 Counting 33.44 41.56
Single Object 95.00 93.75
60 Two Objects 41.41 52.53
Overall score 43.27 49.78
10 20 30 40 50
number of sampling steps Table 4. Improved Captions. Using a 50/50 mixing ratio of
synthetic (via CogVLM (Wang et al., 2023)) and original cap-
Figure 3. Rectified flows are sample efficient. Rectified Flows
tions improves text-to-image performance. Assessed via the
perform better then other formulations when sampling fewer steps.
GenEval (Ghosh et al., 2023) benchmark.
For 25 and more steps, only rf/lognorm(0.00, 1.00) re-
mains competitive to eps/linear.
Metric 4 chn 8 chn 16 chn nature of the human-generated captions that come with
FID (↓) 2.41 1.56 1.06 large-scale image datasets, which overly focus on the image
Perceptual Similarity (↓) 0.85 0.68 0.45 subject and usually omit details describing the background
SSIM (↑) 0.75 0.79 0.86 or composition of the scene, or, if applicable, displayed
PSNR (↑) 25.12 26.40 28.62
text (Betker et al., 2023). We follow their approach and
Table 3. Improved Autoencoders. Reconstruction performance use an off-the-shelf, state-of-the-art vision-language model,
metrics for different channel configurations. The downsampling CogVLM (Wang et al., 2023), to create synthetic annotations
factor for all models is f = 8. for our large-scale image dataset. As synthetic captions may
cause a text-to-image model to forget about certain concepts
cordingly, the final performance of our algorithm depends not present in the VLM’s knowledge corpus, we use a ratio
not only on the training formulation, but also on the parame- of 50 % original and 50 % synthetic captions.
terization via a neural network and the quality of the image
To assess the effect of training on this caption mix, we train
and text representations we use. In the following sections,
two d = 15 MM-DiT models for 250k steps, one on only
we describe how we improve all these components before
original captions and the other on the 50/50 mix. We evalu-
scaling our final method in Section 5.3.
ate the trained models using the GenEval benchmark (Ghosh
et al., 2023) in Table 4. The results demonstrate that the
5.2.1. I MPROVED AUTOENCODERS
model trained with the addition of synthetic captions clearly
Latent diffusion models achieve high efficiency by operating outperforms the model that only utilizes original captions.
in the latent space of a pretrained autoencoder (Rombach We thus use the 50/50 synthetic/original caption mix for the
et al., 2022), which maps an input RGB X ∈ RH×W ×3 into remainder of this work.
a lower-dimensional space x = E(X) ∈ Rh×w×d . The
reconstruction quality of this autoencoder provides an upper 5.2.3. I MPROVED T EXT- TO -I MAGE BACKBONES
bound on the achievable image quality after latent diffusion
In this section, we compare the performance of existing
training. Similar to Dai et al. (2023), we find that increasing
transformer-based diffusion backbones with our novel mul-
the number of latent channels d significantly boosts recon-
timodal transformer-based diffusion backbone, MM-DiT, as
struction performance, see Table 3. Intuitively, predicting
introduced in Section 4. MM-DiT is specifically designed to
latents with higher d is a more difficult task, and thus mod-
handle different domains, here text and image tokens, using
els with increased capacity should be able to perform better
(two) different sets of trainable model weights. More specif-
for larger d, ultimately achieving higher image quality. We
ically, we follow the experimental setup from Section 5.1
confirm this hypothesis in Figure 10, where we see that the
and compare text-to-image performance on CC12M of DiT,
d = 16 autoencoder exhibits better scaling performance in
CrossDiT (DiT but with cross-attending to the text tokens
terms of sample FID. For the remainder of this paper, we
instead of sequence-wise concatenation (Chen et al., 2023))
thus choose d = 16.
and our MM-DiT. For MM-DiT, we compare models with
two sets of weights and three sets of weights, where the lat-
5.2.2. I MPROVED C APTIONS
ter handles the CLIP (Radford et al., 2021) and T5 (Raffel
Betker et al. (2023) demonstrated that synthetically gen- et al., 2019) tokens (c.f . Section 4) separately. Note that DiT
erated captions can greatly improve text-to-image models (w/ concatenation of text and image tokens as in Section 4)
trained at scale. This is due to the oftentimes simplistic can be interpreted as a special case of MM-DiT with one
7
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
a space elevator, A cheeseburger with juicy a hole in the floor of my a small office made out of This dreamlike digital art human life depicted entirely an origami pig on
cinematic scifi art beef patties and melted bathroom with small car parts captures a vibrant, out of fractals fire in the middle of
cheese sits on top of a toilet gremlins living in it kaleidoscopic bird in a lush a dark room with a
that looks like a throne and rainforest. pentagram on the
stands in the middle of the floor
royal chamber.
an old rusted robot wearing pants and a jacket riding skis in a supermarket. smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. “This is fine,”
the dog assures himself.
A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. This imaginative creature features the distinctive, bulky body of a hippo, but with a texture and
appearance resembling a golden-brown, crispy waffle. The creature might have elements like waffle squares across its skin and a syrup-like sheen. It’s set in a surreal environment that playfully combines
a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. The image should evoke a sense of playful absurdity and
culinary fantasy.
8
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 4. Training dynamics of model architectures. Compara- Figure 5. Effects of QK-normalization. Normalizing the Q- and
tive analysis of DiT, CrossDiT, UViT, and MM-DiT on CC12M, K-embeddings before calculating the attention matrix prevents the
focusing on validation loss, CLIP score, and FID. Our proposed attention-logit growth instability (left), which causes the attention
MM-DiT performs favorably across all metrics. entropy to collapse (right) and has been previously reported in the
discriminative ViT literature (Dehghani et al., 2023; Wortsman
et al., 2023). In contrast with these previous works, we observe
shared set of weights for all modalities. Finally, we consider this instability in the last transformer blocks of our networks. Max-
the UViT (Hoogeboom et al., 2023) architecture as a hybrid imum attention logits and attention entropies are shown averaged
between the widely used UNets and transformer variants. over the last 5 blocks of a 2B (d=24) model.
We analyze the convergence behavior of these architectures
in Figure 4: Vanilla DiT underperforms UViT. The cross- 5.3.2. F INETUNING ON H IGH R ESOLUTIONS
attention DiT variant CrossDiT achieves better performance
than UViT, although UViT seems to learn much faster ini- QK-Normalization In general, we pretrain all of our
tially. Our MM-DiT variant significantly outperforms the models on low-resolution images of size 2562 pixels. Next,
cross-attention and vanilla variants. We observe only a small we finetune our models on higher resolutions with mixed
gain when using three parameter sets instead of two (at the aspect ratios (see next paragraph for details). We find that,
cost of increased parameter count and VRAM usage), and when moving to high resolutions, mixed precision train-
thus opt for the former option for the remainder of this work. ing can become unstable and the loss diverges. This can
be remedied by switching to full precision training — but
5.3. Training at Scale comes with a ∼ 2× performance drop compared to mixed-
precision training. A more efficient alternative is reported
Before scaling up, we filter and preencode our data to ensure in the (discriminative) ViT literature: Dehghani et al. (2023)
safe and efficient pretraining. Then, all previous consider- observe that the training of large vision transformer models
ations of diffusion formulations, architectures, and data diverges because the attention entropy grows uncontrollably.
culminate in the last section, where we scale our models up To avoid this, Dehghani et al. (2023) propose to normalize
to 8B parameters. Q and K before the attention operation. We follow this
approach and use RMSNorm (Zhang & Sennrich, 2019)
5.3.1. DATA P REPROCESSING with learnable scale in both streams of our MMDiT archi-
tecture for our models, see Figure 2. As demonstrated in
Pre-Training Mitigations Training data significantly im-
Figure 5, the additional normalization prevents the attention
pacts a generative model’s abilities. Consequently, data
logit growth instability, confirming findings by Dehghani
filtering is effective at constraining undesirable capabili-
et al. (2023) and Wortsman et al. (2023) and enables efficient
ties (Nichol, 2022). Before training at sale, we filter our
training at bf16-mixed (Chen et al., 2019) precision when
data for the following categories: (i) Sexual content: We
combined with = 10−15 in the AdamW (Loshchilov &
use NSFW-detection models to filter for explicit content.
Hutter, 2017) optimizer. This technique can also be applied
(ii) Aesthetics: We remove images for which our rating
on pretrained models that have not used qk-normalization
systems predict a low score. (iii) Regurgitation: We use a
during pretraining: The model quickly adapts to the addi-
cluster-based deduplication method to remove perceptual
tional normalization layers and trains more stably. Finally,
and semantic duplicates from the training data; see Ap-
we would like to point out that although this method can
pendix E.2.
generally help to stabilize the training of large models, it is
not a universal recipe and may need to be adapted depending
Precomputing Image and Text Embeddings Our model on the exact training setup.
uses the output of multiple pretrained, frozen networks as in-
puts (autoencoder latents and text encoder representations). Positional Encodings for Varying Aspect Ratios After
Since these outputs are constant during training, we precom- training on a fixed 256 × 256 resolution we aim to (i) in-
pute them once for the entire dataset. We provide a detailed crease the resolution and resolution and (ii) enable inference
discussion of our approach in Appendix E.1. with flexible aspect ratios. Since we use 2d positional fre-
9
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
10
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
11
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 8. Quantitative effects of scaling. We analyze the impact of model size on performance, maintaining consistent training
hyperparameters throughout. An exception is depth=38, where learning rate adjustments at 3 × 105 steps were necessary to prevent
divergence. (Top) Validation loss smoothly decreases as a function of both model size and training steps for both image (columns 1 and 2)
and video models (columns 3 and 4). (Bottom) Validation loss is a strong predictor of overall model performance. There is a marked
correlation between validation loss and holistic image evaluation metrics, including GenEval (Ghosh et al., 2023), column 1, human
preference, column 2, and T2I-CompBench (Huang et al., 2023), column 3. For video models we observe a similar correlation between
validation loss and human preference, column 4. .
All text-encoders w/o T5 (Raffel et al., 2019) ing either highly detailed descriptions of a scene or larger
amounts of written text do we find significant performance
gains when using all three text-encoders. These observa-
tions are also verified in the human preference evaluation
results in Figure 7 (Ours w/o T5). Removing T5 has no
“A burger patty, with the bottom bun and lettuce and tomatoes. ”COFFEE” written on it in mustard”
effect on aesthetic quality ratings (50% win rate), and only a
small impact on prompt adherence (46% win rate), whereas
its contribution to the capabilities of generating written text
are more significant (38% win rate).
“A monkey holding a sign reading ”Scaling transformer models is awesome!”
6. Conclusion
In this work, we presented a scaling analysis of rectified
flow models for text-to-image synthesis. We proposed a
“A mischievous ferret with a playful grin squeezes itself into a large glass jar, surrounded by
colorful candy. The jar sits on a wooden table in a cozy kitchen, and warm sunlight filters novel timestep sampling for rectified flow training that im-
through a nearby window”
proves over previous diffusion training formulations for
Figure 9. Impact of T5. We observe T5 to be important for com- latent diffusion models and retains the favourable proper-
plex prompts e.g. such involving a high degree of detail or longer ties of rectified flows in the few-step sampling regime. We
spelled text (rows 2 and 3). For most prompts, however, we find also demonstrated the advantages of our transformer-based
that removing T5 at inference time still achieves competitive per- MM-DiT architecture that takes the multi-modal nature of
formance. the text-to-image task into account. Finally, we performed
a scaling study of this combination up to a model size of
time, we can use an arbitrary subset of all three text encoders. 8B parameters and 5 × 1022 training FLOPs. We showed
This offers means for trading off model performance for im- that validation loss improvements correlate with both exist-
proved memory efficiency, which is particularly relevant ing text-to-image benchmarks as well as human preference
for the 4.7B parameters of T5-XXL (Raffel et al., 2019) evaluations. This, in combination with our improvements in
that require significant amounts of VRAM. Interestingly, we generative modeling and scalable, multimodal architectures
observe limited performance drops when using only the two achieves performance that is competitive with state-of-the-
CLIP-based text-encoders for the text prompts and replac- art proprietary models. The scaling trend shows no signs of
ing the T5 embeddings by zeros. We provide a qualitative saturation, which makes us optimistic that we can continue
visualization in Figure 9. Only for complex prompts involv- to improve the performance of our models in the future.
12
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Dao, Q., Phung, H., Nguyen, B., and Tran, A. Flow match-
Learning to follow image editing instructions. In Proceed- ing in latent space, 2023.
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 18392–18402, 2023. Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P.,
Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R.,
Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen,
V., Tramer, F., Balle, B., Ippolito, D., and Wallace, E. M., Arnab, A., Wang, X., Riquelme, C., Minderer, M.,
Extracting training data from diffusion models. In 32nd Puigcerver, J., Evci, U., Kumar, M., van Steenkiste, S.,
13
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Elsayed, G. F., Mahendran, A., Yu, F., Oliver, A., Huot, Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
F., Bastings, J., Collier, M. P., Gritsenko, A., Birodkar, bilistic models, 2020.
V., Vasconcelos, C., Tay, Y., Mensink, T., Kolesnikov,
A., Pavetić, F., Tran, D., Kipf, T., Lučić, M., Zhai, X., Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko,
Keysers, D., Harmsen, J., and Houlsby, N. Scaling vision A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J.,
transformers to 22 billion parameters, 2023. and Salimans, T. Imagen video: High definition video
generation with diffusion models, 2022.
Dhariwal, P. and Nichol, A. Diffusion models beat gans on
image synthesis, 2021. Hoogeboom, E., Heek, J., and Salimans, T. Simple diffusion:
End-to-end diffusion for high resolution images, 2023.
Dockhorn, T., Vahdat, A., and Kreis, K. Score-based gener-
ative modeling with critically-damped langevin diffusion. Huang, K., Sun, K., Xie, E., Li, Z., and Liu, X. T2i-
arXiv preprint arXiv:2112.07068, 2021. compbench: A comprehensive benchmark for open-world
compositional text-to-image generation. arXiv preprint
Dockhorn, T., Vahdat, A., and Kreis, K. Genie: Higher- arXiv:2307.06350, 2023.
order denoising diffusion solvers, 2022.
Hyvärinen, A. Estimation of non-normalized statistical
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, models by score matching. J. Mach. Learn. Res., 6:695–
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., 709, 2005. URL https://api.semanticschola
Heigold, G., Gelly, S., et al. An image is worth 16x16 r.org/CorpusID:1152227.
words: Transformers for image recognition at scale. ICLR,
2020. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., and Ger- Amodei, D. Scaling laws for neural language models,
manidis, A. Structure and content-guided video synthesis 2020.
with diffusion models, 2023.
Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating
Euler, L. Institutionum calculi integralis. Number Bd. 1 in
the design space of diffusion-based generative models.
Institutionum calculi integralis. imp. Acad. imp. Saènt.,
ArXiv, abs/2206.00364, 2022. URL https://api.se
1768. URL https://books.google.de/book
manticscholar.org/CorpusID:249240415.
s?id=Vg8OAAAAQAAJ.
Fischer, J. S., Gui, M., Ma, P., Stracke, N., Baumann, S. A., Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila,
and Ommer, B. Boosting latent diffusion with flow match- T., and Laine, S. Analyzing and improving the train-
ing. arXiv preprint arXiv:2312.07360, 2023. ing dynamics of diffusion models. arXiv preprint
arXiv:2312.02696, 2023.
Ghosh, D., Hajishirzi, H., and Schmidt, L. Geneval: An
object-focused framework for evaluating text-to-image Kingma, D. P. and Gao, R. Understanding diffusion ob-
alignment. arXiv preprint arXiv:2310.11513, 2023. jectives as the elbo with simple data augmentation. In
Thirty-seventh Conference on Neural Information Pro-
Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei, L., cessing Systems, 2023.
Essa, I., Jiang, L., and Lezama, J. Photorealistic video
generation with diffusion models, 2023. Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D.,
Callison-Burch, C., and Carlini, N. Deduplicating train-
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and ing data makes language models better. arXiv preprint
Choi, Y. Clipscore: A reference-free evaluation metric for arXiv:2107.06499, 2021.
image captioning. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing. Lee, S., Kim, B., and Ye, J. C. Minimizing trajectory curva-
Association for Computational Linguistics, 2021. doi: ture of ode-based generative models, 2023.
10.18653/v1/2021.emnlp-main.595. URL http://dx
Lin, S., Liu, B., Li, J., and Yang, X. Common diffusion noise
.doi.org/10.18653/v1/2021.emnlp-main
schedules and sample steps are flawed. In Proceedings
.595.
of the IEEE/CVF Winter Conference on Applications of
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Computer Vision, pp. 5404–5411, 2024.
Hochreiter, S. Gans trained by a two time-scale update
rule converge to a local nash equilibrium, 2017. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO:
Ho, J. and Salimans, T. Classifier-free diffusion guidance, Common Objects in Context, pp. 740–755. Springer In-
2022. ternational Publishing, 2014. ISBN 9783319106021. doi:
14
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
10.1007/978-3-319-10602-1 48. URL http://dx.d Po, R., Yifan, W., Golyanik, V., Aberman, K., Barron, J. T.,
oi.org/10.1007/978-3-319-10602-1 48. Bermano, A. H., Chan, E. R., Dekel, T., Holynski, A.,
Kanazawa, A., et al. State of the art on diffusion models
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and for visual computing. arXiv preprint arXiv:2310.07204,
Le, M. Flow matching for generative modeling. In The 2023.
Eleventh International Conference on Learning Repre-
sentations, 2023. URL https://openreview.net Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn,
/forum?id=PqvMRDCJT9t. T., Müller, J., Penna, J., and Rombach, R. Sdxl: Im-
proving latent diffusion models for high-resolution image
Liu, X., Gong, C., and Liu, Q. Flow straight and fast: synthesis, 2023.
Learning to generate and transfer data with rectified flow,
2022. Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C.,
Amos, B., Lipman, Y., and Chen, R. T. Q. Multisam-
Liu, X., Zhang, X., Ma, J., Peng, J., and Liu, Q. Instaflow: ple flow matching: Straightening flows with minibatch
One step is enough for high-quality diffusion-based text- couplings, 2023.
to-image generation, 2023.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Loshchilov, I. and Hutter, F. Fixing weight decay regular- Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
ization in adam. ArXiv, abs/1711.05101, 2017. URL J., Krueger, G., and Sutskever, I. Learning transferable
https://api.semanticscholar.org/Corp visual models from natural language supervision, 2021.
usID:3312944.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Man-
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm- ning, C. D., and Finn, C. Direct Preference Optimiza-
solver++: Fast solver for guided sampling of diffusion tion: Your Language Model is Secretly a Reward Model.
probabilistic models, 2023. arXiv:2305.18290, 2023.
Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Eijnden, E., and Xie, S. Sit: Exploring flow and diffusion- Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
based generative models with scalable interpolant trans- the limits of transfer learning with a unified text-to-text
formers, 2024. transformer, 2019.
Nichol, A. Dall-e 2 pre-training mitigations. https: Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,
//openai.com/research/dall-e-2-pre-t M. Hierarchical text-conditional image generation with
raining-mitigations, 2022. clip latents, 2022.
Nichol, A. and Dhariwal, P. Improved denoising diffusion Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
probabilistic models, 2021. Ommer, B. High-resolution image synthesis with latent
NovelAI. Novelai improvements on stable diffusion, 2022. diffusion models. In 2022 IEEE/CVF Conference on
URL https://blog.novelai.net/novelai Computer Vision and Pattern Recognition (CVPR). IEEE,
-improvements-on-stable-diffusion-e1 2022. doi: 10.1109/cvpr52688.2022.01042. URL
0d38db82ac. http://dx.doi.org/10.1109/CVPR52688.2
022.01042.
Peebles, W. and Xie, S. Scalable diffusion models with
transformers. In 2023 IEEE/CVF International Con- Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolu-
ference on Computer Vision (ICCV). IEEE, 2023. doi: tional Networks for Biomedical Image Segmentation, pp.
10.1109/iccv51070.2023.00387. URL http://dx.d 234–241. Springer International Publishing, 2015. ISBN
oi.org/10.1109/ICCV51070.2023.00387. 9783319245744. doi: 10.1007/978-3-319-24574-4 28.
URL http://dx.doi.org/10.1007/978-3-3
Pernias, P., Rampas, D., Richter, M. L., Pal, C. J., and 19-24574-4 28.
Aubreville, M. Wuerstchen: An efficient architecture for
large-scale text-to-image diffusion models, 2023. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Pizzi, E., Roy, S. D., Ravindra, S. N., Goyal, P., and Douze, M. S., Berg, A. C., and Fei-Fei, L. Imagenet large scale
M. A self-supervised descriptor for image copy detection. visual recognition challenge. International Journal of
In Proceedings of the IEEE/CVF Conference on Com- Computer Vision, 115:211 – 252, 2014. URL https:
puter Vision and Pattern Recognition, pp. 14532–14542, //api.semanticscholar.org/CorpusID:29
2022. 30547.
15
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, Song, Y., Sohl-Dickstein, J. N., Kingma, D. P., Kumar,
T., Fleet, D., and Norouzi, M. Palette: Image-to-image A., Ermon, S., and Poole, B. Score-based generative
diffusion models. In ACM SIGGRAPH 2022 Conference modeling through stochastic differential equations. ArXiv,
Proceedings, pp. 1–10, 2022a. abs/2011.13456, 2020. URL https://api.semant
icscholar.org/CorpusID:227209335.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks,
S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., J., Fatras, K., Wolf, G., and Bengio, Y. Improving and
and Norouzi, M. Photorealistic text-to-image diffusion generalizing flow-based generative models with mini-
models with deep language understanding, 2022b. batch optimal transport, 2023.
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
and Norouzi, M. Image super-resolution via iterative L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
refinement. IEEE Transactions on Pattern Analysis and is all you need, 2017.
Machine Intelligence, 45(4):4713–4726, 2022c. Villani, C. Optimal transport: Old and new. 2008. URL
https://api.semanticscholar.org/Corp
Sauer, A., Chitta, K., Müller, J., and Geiger, A. Projected
usID:118347220.
gans converge faster. Advances in Neural Information
Processing Systems, 2021. Vincent, P. A connection between score matching and de-
noising autoencoders. Neural Computation, 23:1661–
Sauer, A., Lorenz, D., Blattmann, A., and Rombach, 1674, 2011. URL https://api.semanticscho
R. Adversarial diffusion distillation. arXiv preprint lar.org/CorpusID:5560643.
arXiv:2311.17042, 2023.
Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Pu-
Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., rushwalkam, S., Ermon, S., Xiong, C., Joty, S., and Naik,
Ashual, O., Parikh, D., and Taigman, Y. Emu edit: Precise N. Diffusion Model Alignment Using Direct Preference
image editing via recognition and generation tasks. arXiv Optimization. arXiv:2311.12908, 2023.
preprint arXiv:2311.10089, 2023.
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji,
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, J., Yang, Z., Zhao, L., Song, X., et al. Cogvlm: Visual
S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., expert for pretrained language models. arXiv preprint
Gupta, S., and Taigman, Y. Make-a-video: Text-to-video arXiv:2311.03079, 2023.
generation without text-video data, 2022.
Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A.,
Sohl-Dickstein, J. N., Weiss, E. A., Maheswaranathan, Adlam, B., Co-Reyes, J. D., Gur, I., Kumar, A., Novak,
N., and Ganguli, S. Deep unsupervised learning using R., Pennington, J., Sohl-dickstein, J., Xu, K., Lee, J.,
nonequilibrium thermodynamics. ArXiv, abs/1503.03585, Gilmer, J., and Kornblith, S. Small-scale proxies for
2015. URL https://api.semanticscholar. large-scale transformer training instabilities, 2023.
org/CorpusID:14888175. Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Va-
sudevan, V., Ku, A., Yang, Y., Ayan, B. K., et al. Scaling
Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and
Autoregressive Models for Content-Rich Text-to-Image
Goldstein, T. Diffusion art or digital forgery? investigat-
Generation. arXiv:2206.10789, 2022.
ing data replication in diffusion models. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling
Pattern Recognition, pp. 6048–6058, 2023a. vision transformers. In CVPR, pp. 12104–12113, 2022.
Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Zhang, B. and Sennrich, R. Root mean square layer normal-
Goldstein, T. Understanding and mitigating copying ization, 2019.
in diffusion models. arXiv preprint arXiv:2305.20086,
2023b.
16
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Supplementary
A. Background
Diffusion Models (Sohl-Dickstein et al., 2015; Song et al., 2020; Ho et al., 2020) generate data by approximating the
reverse ODE to a stochastic forward process which transforms data to noise. They have become the standard approach for
generative modeling of images (Dhariwal & Nichol, 2021; Ramesh et al., 2022; Saharia et al., 2022b; Rombach et al., 2022;
Balaji et al., 2022) and videos (Singer et al., 2022; Ho et al., 2022; Esser et al., 2023; Blattmann et al., 2023b; Gupta et al.,
2023). Since these models can be derived both via a variational lower bound on the negative likelihood (Sohl-Dickstein et al.,
2015) and score matching (Hyvärinen, 2005; Vincent, 2011; Song & Ermon, 2020), various formulations of forward- and
reverse processes (Song et al., 2020; Dockhorn et al., 2021), model parameterizations (Ho et al., 2020; Ho & Salimans, 2022;
Karras et al., 2022), loss weightings (Ho et al., 2020; Karras et al., 2022) and ODE solvers (Song et al., 2022; Lu et al., 2023;
Dockhorn et al., 2022) have led to a large number of different training objectives and sampling procedures. More recently,
the seminal works of Kingma & Gao (2023) and Karras et al. (2022) have proposed unified formulations and introduced
new theoretical and practical insights for training (Karras et al., 2022; Kingma & Gao, 2023) and inference (Karras et al.,
2022). However, despite these improvements, the trajectories of common ODEs involve partly significant amounts of
curvature (Karras et al., 2022; Liu et al., 2022), which requires increased amounts of solver steps and, thus, renders fast
inference difficult. To overcome this, we adopt rectified flow models whose formulation allows for learning straight ODE
trajectories.
Rectified Flow Models (Liu et al., 2022; Albergo & Vanden-Eijnden, 2022; Lipman et al., 2023) approach generative
modeling by constructing a transport map between two distributions through an ordinary differential equation (ODE). This
approach has close connections to continuous normalizing flows (CNF) (Chen et al., 2018) as well as diffusion models.
Compared to CNFs, Rectified Flows and Stochastic Interpolants have the advantage that they do not require simulation
of the ODE during training. Compared to diffusion models, they can result in ODEs that are faster to simulate than the
probability flow ODE (Song et al., 2020) associated with diffusion models. Nevertheless, they do not result in optimal
transport solutions, and multiple works aim to minimize the trajectory curvature further (Lee et al., 2023; Tong et al., 2023;
Pooladian et al., 2023). (Dao et al., 2023; Ma et al., 2024) demonstrate the feasibility of rectified flow formulations for
class-conditional image synthesis, (Fischer et al., 2023) for latent-space upsampling, and (Liu et al., 2023) apply the reflow
procedure of (Liu et al., 2022) to distill a pretrained text-to-image model (Rombach et al., 2022). Here, we are interested in
rectified flows as the foundation for text-to-image synthesis with fewer sampling steps. We perform an extensive comparison
between different formulations and loss weightings and propose a new timestep schedule for training of rectified flows with
improved performance.
Scaling Diffusion Models The transformer architecture (Vaswani et al., 2017) is well known for its scaling properties in
NLP (Kaplan et al., 2020) and computer vision tasks (Dosovitskiy et al., 2020; Zhai et al., 2022). For diffusion models,
U-Net architectures (Ronneberger et al., 2015) have been the dominant choice (Ho et al., 2020; Rombach et al., 2022; Balaji
et al., 2022). While some recent works explore diffusion transformer backbones (Peebles & Xie, 2023; Chen et al., 2023;
Ma et al., 2024), scaling laws for text-to-image diffusion models remain unexplored.
17
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Detailed pen and ink drawing of a happy pig butcher selling meat in its shop. a massive alien space ship that is shaped like a pretzel.
A kangaroo holding a beer, An entire universe inside a A cheesburger surfing the A swamp ogre with a pearl A car made out of heat death of the universe,
wearing ski goggles and bottle sitting on the shelf at vibe wave at night earring by Johannes Vermeer vegetables. line art
passionately singing silly walmart on sale.
songs.
A crab made of cheese on a plate Dystopia of thousand of workers picking cherries and feeding them into a machine that runs on
steam and is as large as a skyscraper. Written on the side of the machine: ”SD3 Paper”
translucent pig, inside is a smaller pig. Film still of a long-legged cute big-eye anthropomorphic cheeseburger wearing sneakers relaxing on
the couch in a sparsely decorated living room.
18
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
detailed pen and ink drawing of a massive complex alien space ship above a farm in the middle of photo of a bear wearing a suit and tophat in a river in the middle of a forest holding a sign that says
nowhere. ”I cant bear it”.
tilt shift aerial photo of a cute city made of sushi on a wooden table in the evening. dark high contrast render of a psychedelic tree of life illuminating dust in a mystical cave.
an anthropomorphic fractal person behind the counter at a fractal themed restaurant. beautiful oil painting of a steamboat in a river in the afternoon. On the side of the river is a large
brick building with a sign on top that says S̈D3.̈
an anthopomorphic pink donut with a mustache and cowboy hat standing by a log cabin in a forest fox sitting in front of a computer in a messy room at night. On the screen is a 3d modeling program
with an old 1970s orange truck in the driveway with a line render of a zebra.
19
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
B. On Flow Matching
B.1. Details on Simulation-Free Training of Flows
Following (Lipman et al., 2023), to see that ut (z) generates pt , we note that the continuity equation provides a necessary
and sufficient condition (Villani, 2008):
d
pt (x) + ∇ · [pt (x)vt (x)] = 0 ↔ vt generates probability density path pt . (26)
dt
where c, c0 do not depend on Θ and line Equation (31) to line Equation (32) follows from:
Z Z
Ept (z|),p() hvΘ (z, t) | ut (z|)i = dz dpt (z|)p()hvΘ (z, t) | ut (z|)i (34)
Z Z
pt (z|)
= dzpt (z)hvΘ (z, t) | d p()ut (z|)i (35)
pt (z)
Z
= dzpt (z)hvΘ (z, t) | ut (z)i = Ept (z) hvΘ (z, t) | ut (z)i (36)
pt (z)
where we extended with pt (z) in line Equation (35) and used the definition of Equation (6) in line Equation (35) to
Equation (36).
20
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 10. FID scores after training flow models with different sizes (parameterized via their depth) on the latent space of different
autoencoders (4 latent channels, 8 channels and 16 channels) as discussed in Section 5.2.1. As expected, the flow model trained on the
16-channel autoencoder space needs more model capacity to achieve similar performance. At depth d = 22, the gap between 8-chn and
16-chn becomes negligible. We opt for the 16-chn model as we ultimately aim to scale to much larger model sizes.
77×2048 77×4096
context conditioning cCLIP
ctxt ∈ R . Next, we encode c also to the final hidden representation, cT5 ctxt ∈ R , of the
CLIP
encoder of a T5-v1.1-XXL model (Raffel et al., 2019). Finally, we zero-pad cctxt along the channel axis to 4096 dimensions
to match the T5 representation and concatenate it along the sequence axis with cT5 ctxt to obtain the final context representation
cctxt ∈ R154×4096 . These two caption representations, cvec and cctxt , are used in two different ways as described in Section 4.
21
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 11. The mode (left) and logit-normal (right) distributions that we explore for biasing the sampling of training timesteps.
“A raccoon wearing formal clothes, wearing a “A smiling sloth is wearing a leather jacket, a
tophat and holding a cane. The raccoon is “A bowl of soup that looks like a monster “Two cups of coffee, one with latte art of a cowboy hat, a kilt and a bowtie. The sloth is
holding a garbage bag. Oil painting in the made out of plasticine” heart. The other has latte art of stars.” holding a quarterstaff and a big book. The
style of abstract cubism.” sloth is standing on grass a few feet in front of
a shiny VW van with flowers painted on it.
wide-angle lens from below.”
Figure 12. Qualitative effects of scaling. Displayed are examples demonstrating the impact of scaling training steps (left to right: 50k,
200k, 350k, 500k) and model sizes (top to bottom: depth=15, 30, 38) on PartiPrompts, highlighting the influence of training duration and
model complexity.
22
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 13. Comparison between base models and DPO-finetuned models. DPO-finetuning generally results in more aesthetically pleasing
samples with better spelling.
Direct Preference Optimization (DPO) (Rafailov et al., 2023) is a technique to finetune LLMs with preference data. Recently,
this method has been adapted to preference finetuning of text-to-image diffusion models (Wallace et al., 2023). In this
section, we verify that our model is also amenable to preference optimization. In particular, we apply the method introduced
in Wallace et al. (2023) to our 2B and 8B parameter base model. Rather than finetuning the entire model, we introduce
learnable Low-Rank Adaptation (LoRA) matrices (of rank 128) for all linear layers as is common practice. We finetune
these new parameters for 4k and 2k iteration for the 2B and 8B base model, respectively. We then evaluate the resulting
model in a human preference study using a subset of 128 captions from the Partiprompts set (Yu et al., 2022) (roughly three
voter per prompt and comparison). Figure 14 shows that our base models can be effectively tuned for human preference.
Figure 13 shows samples of the respective base models and DPO-finetuned models.
23
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 14. Human preference evaluation between base models and DPO-finetuned models. Human evaluators prefer DPO-finetuned
models for both prompt following and general quality.
Table 7. Key figures for preencoding frozen input networks. Mem is the memory required to load the model on the GPU. FP [ms] is
the time per sample for the forward pass with per-device batch size of 32. Storage is the size to save a single sample. Delta [%] is how
much longer a training step takes, when adding this into the loop for the 2B MMDiT-Model (568ms/it).
though no text manipulation tasks were included in the training data. We were not able to reproduce similar results when
training a SDXL-based (Podell et al., 2023) editing model on the same data.
24
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
GO BIG OR GO UNET
is written on
the blackboard
change the
word to
UNOT
make the
sign say
MMDIT rules
Figure 15. Zero Shot Text manipulation and insertion with the 2B Edit model
Details on Deduplication In accordance with the methods outlined by Carlini et al. (2023) and Somepalli et al. (2023a),
we opt for SSCD (Pizzi et al., 2022) as the backbone for the deduplication process. The SSCD algorithm is a state-of-the-art
technique for detecting near-duplicate images at scale, and it generates high-quality image embeddings that can be used for
clustering and other downstream tasks. We also decided to follow Nichol (2022) to decide on a number of clusters N . For
our experiments, we use N = 16, 000.
We utilize autofaiss (2023) for clustering. autofaiss (2023) is a library that simplifies the process of using Faiss (Facebook AI
Similarity Search) for large-scale clustering tasks. Specifically, leverage FAISS index factory1 functionality to train a custom
index with predefined number of centroids. This approach allows for efficient and accurate clustering of high-dimensional
data, such as image embeddings.
Algorithm 1 details our deduplication approach. We ran an experiment to see how much data is removed by different SSCD
threshold as shown in Figure 16b. Based on these results we selected four thresholds for the final run Figure 16a.
1
https://github.com/facebookresearch/faiss/wiki/The-index-factory
25
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
26
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
(a) Final result of SSCD deduplication over the entire dataset (b) Result of SSCD deduplication with various thresholds over 1000
random clusters
Figure 16. Results of deduplicating our training datasets for various filtering thresholds.
27
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Figure 17. SSCD-based deduplication prevents memorization. To assess how well our SSCD-based deduplication works, we extract
memorized samples from small, specifically for this purpose trained models and compare them before and after deduplication. We plot a
comparison between number of memorized samples, before and after using SSCD with the threshold of 0.5 to remove near-duplicated
samples. Carlini et al. (2023) mark images within clique size of 10 as memorized samples. Here we also explore different sizes for cliques.
For all clique thresholds, SSCD is able to significantly reduce the number of memorized samples. Specifically, when the clique size is 10,
models on the deduplicated training samples cut off at SSCD= 0.5 show a 5× reduction in potentially memorized examples.
28