0% found this document useful (0 votes)
34 views

Team15 Dreamfusion

Uploaded by

kimkinam111868
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Team15 Dreamfusion

Uploaded by

kimkinam111868
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

DREAMFUSION: TEXT-TO-3D

USING 2D DIFFUSION
[Poole et al., ICLR 2023]

Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Background
• Text-to-image task succeeded by training on Large text-image pair dataset directly

Stable diffusion Upainting

https://forums.fast.ai/t/new-paper-upainting-unified-text-to-image-diffusion-generation-with-cross-modal-guidance/101669

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Background
• Text-to-image task succeeded by training on Large text-image pair dataset directly

Stable diffusion Upainting

Q. Can we train Text-to-3D model directly using text - 3D object pair


dataset?
https://forums.fast.ai/t/new-paper-upainting-unified-text-to-image-diffusion-generation-with-cross-modal-guidance/101669

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Q. Can we train Text-to-3D model directly using text - 3D object pair dataset?
A. No, Text-to-3D datasets are not big enough to train 3D Generative model directly

Text-3D pair dataset (800K) Text-Image pair dataset (5B)

https://paperswithcode.com/dataset/laion-5b
https://blog.allenai.org/objaverse-a-universe-of-annotated-3d-objects-718ef3d61fd6

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Q. Can we train Text-to-3D model directly using text - 3D object pair dataset?
A. No, Text-to-3D datasets are not big enough to train 3D Generative model directly

Text-3D pair dataset (800K) Text-Image pair dataset (5B)

Solution: Train 3D model with 2D images! → One approach is using NeRF

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Background: NeRF

Optimize 3D model via gradient descent such that its 2D


renderings from random angles achieve a low loss

Only need 2D image data (no need for 3D data)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Background: NeRF
Issue using NeRF in text-to-3D
• Need multiple images from various perspectives to train NeRF.
• However in text-to-3D, we don’t have ground truth images, only have a single
text. → We can’t train NeRF in general way

“yellow lego bulldozer”

Q. Then how can we train NeRF without ground truth images, but using only single text?

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Solution: Dreamfusion(NeRF + Text-to-Image Diffusion model)


Optimize NeRF with gradient descent using
loss from text-to-image diffusion model

NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Render
image
NeRF

Initially, NeRF render


random image

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Text prompt
“A yellow lego bulldozer”

Render
image Input SDS Loss
Text-to-Image (contains information on
NeRF diffusion model how to adjust the
rendered image to align
with the provided text)

Initially, NeRF render


random image

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Text prompt
“A yellow lego bulldozer”

Render
image Input SDS Loss
Text-to-Image (contains information on
NeRF diffusion model how to adjust the
rendered image to align
with the provided text)

Initially, NeRF render


random image

Optimize NeRF
Backpropagation

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Text prompt
“A yellow lego bulldozer”

Render
image Input SDS Loss
(contains information on
NeRF how to adjust the
rendered image to align
with the provided text)

Initially, NeRF render


random image An AI artist capable of generating
any image based on the provided text

Tells how to create better image


https://en.wikipedia.org/wiki/Bob_Ross

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion


But NeRF requires multi-view images to be trained...
Q. How can diffusion model provide multi-view image information to NeRF?

“A yellow lego bulldozer”

(𝑅1 , 𝑇1 )
Text-to-Image
NeRF
diffusion model

“A yellow lego bulldozer”

(𝑅2 , 𝑇2 )
Text-to-Image
NeRF
diffusion model

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion


Q. How can diffusion model provide multi-view image information to NeRF?
A. Append view-dependent text to the provided input text based on the location of camera

“A yellow lego bulldozer,


°
Elevation angle > 60 overhead view”
Append “overhead view” at the end of the text

(𝑅1 , 𝑇1 )
Text-to-Image
NeRF
diffusion model

Elevation angle ≤ 60° “A yellow lego bulldozer,


Append “Front view”, “side view”, or “back view” front view”

(𝑅2 , 𝑇2 )
Text-to-Image
NeRF
diffusion model

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Denoise Denoise Denoise Denoise

𝑥𝑇 𝑥𝑇−1 𝑥3 𝑥2 𝑥1 𝑥0
Text prompt Text prompt Text prompt Text prompt

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Denoise Denoise Denoise Denoise

𝑥𝑇 𝑥𝑇−1 𝑥3 𝑥2 𝑥1 𝑥0
Text prompt Text prompt Text prompt Text prompt

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Transformer One step


Input noisy image Predicted noise denoised image

Denoise (𝑥𝑡 , 𝜖,Ƹ 𝑡)

𝑥𝑡 Noise Prediction Model 𝜖Ƹ 𝑥𝑡−1


(U-Net)

Key point !
Diffusion model doesn’t predict denoised image directly,
but it predicts noise first and subtract to to denoise image

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF

NeRF

𝑥0

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF

NeRF

𝑥0

2. Generate random noise

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random
denoise timestep t
t = random(1, T)

NeRF

𝑥0

2. Generate random noise

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random
denoise timestep t
t = random(1, T)

NeRF 4. Add noise to make


noisy image at timestep t

𝑥0

2. Generate random noise

𝑥𝑡

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random
denoise timestep t
t = random(0, T)

NeRF 4. Add noise to make


“A yellow lego bulldozer”
noisy image at timestep t 5. Predict noise

𝑥0
Text-to-Image
diffusion model
2. Generate random noise

𝑥𝑡 𝜖𝑡Ƹ

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random
denoise timestep t
t = random(0, T)

NeRF 4. Add noise to make


“A yellow lego bulldozer”
noisy image at timestep t 5. Predict noise

𝑥0
Text-to-Image
diffusion model
2. Generate random noise

6. Subtract injected noise


from predicted noise 𝜖𝑡Ƹ
𝑥𝑡

𝜖𝑡Ƹ − 𝜖 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1
∇𝐿: (𝜖𝑡Ƹ −𝜖)
1. Render image from NeRF 3. Select random
denoise timestep t
t = random(0, T)

NeRF 4. Add noise to make


“A yellow lego bulldozer”
noisy image at timestep t 5. Predict noise

𝑥0
Text-to-Image
diffusion model
2. Generate random noise

6. Subtract injected noise


from predicted noise 𝜖𝑡Ƹ
𝑥𝑡

1 𝜖𝑡Ƹ − 𝜖 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1 2
∇𝐿: (𝜖𝑡Ƹ −𝜖)(𝑈𝑛𝑒𝑡 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)
1. Render image from NeRF 3. Select random
denoise timestep t
t = random(0, T)

NeRF 4. Add noise to make


“A yellow lego bulldozer”
noisy image at timestep t 5. Predict noise
2
𝑥0
Text-to-Image
diffusion model
2. Generate random noise

6. Subtract injected noise


from predicted noise 𝜖𝑡Ƹ
𝑥𝑡

1 𝜖𝑡Ƹ − 𝜖 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1 2 3
∇𝐿: (𝜖𝑡Ƹ −𝜖)(𝑈𝑛𝑒𝑡 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)(𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)
1. Render image from NeRF 3. Select random
denoise timestep t

3 t = random(0, T)

4. Add noise to make


“A yellow lego bulldozer”
NeRF noisy image at timestep t 5. Predict noise
2
Text-to-Image
𝑥0 diffusion model
2. Generate random noise

6. Subtract injected noise


from predicted noise 𝜖𝑡Ƹ
𝑥𝑡

1 𝜖𝑡Ƹ − 𝜖 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1 2 3
∇𝐿: (𝜖𝑡Ƹ −𝜖)(𝑈𝑛𝑒𝑡 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)(𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)
1. Render image from NeRF 3. Select random
denoise timestep t
In practice, U-Net Jacobian term is expensive to
3 t = random(0, T) compute. And omitting it doesn’t change the
update direction
4. Add noise to make
“A yellow lego bulldozer”
NeRF noisy image at timestep t 5. Predict noise
2
Text-to-Image
𝑥0 diffusion model
2. Generate random noise

6. Subtract injected noise


from predicted noise 𝜖𝑡Ƹ
𝑥𝑡

1 𝜖𝑡Ƹ − 𝜖 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline Gradient of SDS


Step-by-step how NeRF is optimized using SDS loss loss 1 2 3
∇𝐿: (𝜖𝑡Ƹ −𝜖)(𝑈𝑛𝑒𝑡 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)(𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 𝐽𝑎𝑐𝑜𝑏𝑖𝑎𝑛)
1. Render image from NeRF 3. Select random
denoise timestep t
So just omit it!
3 t = random(0, T)

4. Add noise to make


“A yellow lego bulldozer”
NeRF noisy image at timestep t 5. Predict noise
2
Text-to-Image
𝑥0 diffusion model
2. Generate random noise

6. Subtract injected noise


from predicted noise 𝜖𝑡Ƹ
𝑥𝑡

1 𝜖𝑡Ƹ − 𝜖 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Visualization of optimizing NeRF

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline in paper

NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Qualitative Experiment

CLIP Encoder

- Extract semantic information from the input image.


- Encode image into feature vector that reside in the same space as text feature vector.

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Qualitative Experiment

Optimize NeRF with CLIP

Reimplement with enhanced


NeRF setting in Dreamfusion

Optimize mesh with CLIP

Dreamfusion DreamFusion generated the


highest quality 3D objects!

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Quantitative Experiment

(R-Precision measures how accurately CLIP can find


This means that DreamFusion has generated
the correct text caption when shown a picture of a scene)
the 3D object that is best aligned with the text!

Even though Dream Fields and CLIP-Mesh are trained on CLIP,


DreamFusion outperforms on CLIP Score (R-Precision)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Examples

Experience more examples here: Dreamfusion 3D Gallery!

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Limitation

• SDS loss is not a perfect loss function. It often produces oversmoothed results.

• Dreamfusion uses 64x64 Imagen model so the image resolution is limited to 64x64.

Oversmoothed example
(64x64)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Limitation
Janus Problem
DreamFusion approximates the view direction by categorizing angles into four rough categories,
which are “overhead”, “front”, “side”, and “back”.
However, this method can lead to issues, such as the occurrence of multiple features
(e.g., faces, eyes) at different angles.

Example of Janus problem

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Contribution
Papers inspired by Dreamfusion

Magic3D (CVPR 2023 highlight) Prolific Dreamer (NeurIPS 2023 highlight)

- As the originator of using a 2D diffusion model to create 3D objects, this methodology led to
many subsequent studies that improved SDS loss, resulting in better text-to-3D models.

- This approach offers a revolutionary methodology for solving 3D-related tasks not by relying
on scarce 3D data, but by utilizing abundant 2D data alone.

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Thank You
Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Quiz
https://forms.gle/HG67Nz3DrawxLVkq7

Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

CS380: Introduction to Computer Graphics

You might also like