0% found this document useful (0 votes)
29 views11 pages

MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation

This paper presents MVD-Fusion, a method for single-view 3D inference by generating depth-consistent multi-view RGB-D images using a denoising diffusion model. By training on synthetic and real-world datasets, it demonstrates more accurate novel view synthesis compared to state-of-the-art methods while also producing plausible 3D geometry from the inferred multi-view depth representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views11 pages

MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation

This paper presents MVD-Fusion, a method for single-view 3D inference by generating depth-consistent multi-view RGB-D images using a denoising diffusion model. By training on synthetic and real-world datasets, it demonstrates more accurate novel view synthesis compared to state-of-the-art methods while also producing plausible 3D geometry from the inferred multi-view depth representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

Hanzhe Hu1 * Zhizhuo Zhou2 * Varun Jampani3 Shubham Tulsiani 1


1 2 3
Carnegie Mellon University Stanford University Stability AI

https://mvd-fusion.github.io/
arXiv:2404.03656v1 [cs.CV] 4 Apr 2024

Input View Synthesized Novel Views Input View Synthesized Novel Views

Figure 1. Single-view 3D Inference. Given an input RGB image, MVD-Fusion allows synthesizing multi-view RGB-D images using a
depth-guided attention mechanism for enforcing multi-view consistency. We visualize the input RGB image (left) and three synthesized
novel views (with generated depth in inset).

Abstract gle RGB input image and leverage the (intermediate noisy)
depth estimates to obtain reprojection-based conditioning
We present MVD-Fusion: a method for single-view 3D to maintain multi-view consistency. We train our model
inference via generative modeling of multi-view-consistent using large-scale synthetic dataset Obajverse as well as
RGB-D images. While recent methods pursuing 3D infer- the real-world CO3D dataset comprising of generic cam-
ence advocate learning novel-view generative models, these era viewpoints. We demonstrate that our approach can
generations are not 3D-consistent and require a distillation yield more accurate synthesis compared to recent state-of-
process to generate a 3D output. We instead cast the task the-art, including distillation-based 3D inference and prior
of 3D inference as directly generating mutually-consistent multi-view generation methods. We also evaluate the ge-
multiple views and build on the insight that additionally in- ometry induced by our multi-view depth prediction and find
ferring depth can provide a mechanism for enforcing this that it yields a more accurate representation than other di-
consistency. Specifically, we train a denoising diffusion rect 3D inference approaches.
model to generate multi-view RGB-D images given a sin-
* Equal contribution.

1
1. Introduction held out objects as well as scanned real world data. We
show that our approach allows more accurate generation
The task of recovering 3D from a single 2D image has compared to prior state of the art while also (directly) gener-
witnessed a recent wave of generative-modeling based ap- ating plausible geometry via the synthesized depth images.
proaches. In particular, while initial 3D prediction methods Finally, we also highlight the ability of our method to sam-
pursued inference of volumetric [1, 6], point clouds [5, 47], ple diverse outputs and its ability to generalize zero-shot to
or meshes [7, 13] representations, a class of recent ap- in-the-wild generic objects.
proaches [20, 23, 54] instead formulate the task as learn-
ing (conditional) generation of novel views. By adapt-
2. Related Work
ing large-scale pre-trained generative models, these meth-
ods can learn generalizable view synthesis that performs re- Single-view 3D Prediction. The task of inferring 3D
markably well even for generic objects in-the-wild. How- from 2D images is a long-standing one in computer vision,
ever, the synthesized novel views are not mutually consis- and the pre-dominant learning-based approach has been to
tent and these 2D generative methods rely on a (costly) frame it as a prediction task where a data-driven predictor is
‘score distillation’ [28] based optimization to then recover trained to output a 3D representation given image input. In
a consistent 3D. While this process can yield impressive re- particular, several deep learning based methods pursued this
sults, these come at the cost of a reduction in both the effi- task by inferring a plethora of 3D representations such as
ciency of inference and the diversity of the generations. volumetric 3D [1, 6, 40, 50], meshes [7, 13, 15, 16, 43, 49],
In this work, we seek to overcome these limitations, and point clouds [5, 26, 47], or neural implicit fields [17, 24, 41].
pursue an approach that allows directly generating diverse While these approaches have shown promising results, they
outputs. We do so by re-formulating the task of 3D in- often struggle to generate detailed outputs for complex ob-
ference as that of generating a set of (mutually consistent) jects owing to the ambiguity in unobserved regions. Indeed,
multiple views, and learn a (conditional) generative prior to any such regression-based approach fundamentally cannot
model this joint distribution. While some recent (and con- model the inherent uncertainty in single-view reconstruc-
current) methods do similarly ‘co-generate’ multiple views tion. In contrast, our generative modeling-based approach
given a single input image [35, 36], these are typically not can synthesize and generate high-fidelity outputs and also
geometrically consistent. Instead, our approach is inspired yield multiple modes.
by the recent work from Liu et al. [21] which incorporates
a 3D bottleneck with unprojection and reprojection as an 3D Inference via 2D Diffusion. Instead of directly pre-
inductive bias for ensuring geometric consistency across dicting 3D shapes in a feed-forward way, this line of work
views. In this work, we explore an alternate mechanism for utilizes 2D diffusion prior to facilitating 3D inference. In
enforcing such consistency. In particular, we formulate a particular, DreameFusion [28] and SJC [42] formulated a
depth-generation-guided approach that: a) allows improved ‘score distillation’ objective that enabled incorporating pre-
generation via depth-based reprojection, and b) enables di- trained diffusion as a prior for optimization, and leveraged
rectly producing an estimate of the 3D geometry via the in- it to distill a 2D stable diffusion model for the text-to-3D in-
ferred multi-view 2.5D representation. ference. Inspired by this, several works [3, 8, 23, 29, 33, 37,
Our approach for enforcing multi-view consistency 48] adopt this pipeline to optimize a neural radiance field
stems from a simple question: what does it mean for im- (NeRF [25]) for the single-view reconstruction task. For in-
ages to be 3D-consistent? Drawing inspiration from classi- stance, RealFusion [23] extracts from a single image of an
cal 3D reconstruction methods, one answer to this question object a 360◦ 3D reconstruction by leveraging a 2D diffu-
is that if a pixel in one image corresponds to a point that sion model with a single-image variant of textual inversion
is also visible in another, then the local appearance should whereas NeuralLift-360 [48] utilizes a depth-aware NeRF
match. However, how can we know where the 3D point and learns to craft the scene by denoising diffusion mod-
corresponding to a pixel in one image may project in the els. However, as these methods only use pre-trained im-
other? This inspires our solution for generating multi-view age diffusion models for 3D inference, they can suffer from
consistent images, where we not only generate the RGB im- implausible 3D outputs e.g. janus effect and do not always
ages but also reason about the corresponding depth for these preserve the details in the observed image.
generations (and thus allow such reprojection-based multi- To circumvent this, SparseFusion [54] proposed to learn
view consistency). More specifically, we adopt existing 2D a novel-view diffusion model using epipolar feature trans-
diffusion models to generate RGB-D images while adding former to build a view-conditioned features model for
multi-view projection based on (noisy) depth estimates to sparse-view reconstruction and distilled it to obtain more
enforce 3D consistency. accurate 3D reconstructions. Moreover, Diffusion with For-
We train our system using a large-scale synthetic dataset, ward Models [39] applied 2D diffusion networks to denoise
and empirically demonstrate the efficacy of our approach on a consistent 3D scene. Our approach builds on Zero-1-to-

2
Depth-aware 3D attention

p = 𝜋 −1 (𝑢, 𝑑𝑡n )
=𝜏
u
Feature Aggregation

Input View & Noisy Target Views 𝒙1:𝑁


𝑡 Multi-view Aware Features 𝒛1:𝑁
𝑡

Cross Attention

Noisy Target View 𝒙𝑛𝑡 Input View Diffusion U-Net Denoised Target View 𝒙𝑛𝑡−1

Figure 2. Approach Overview. MVD-Fusion learns a denoising diffusion model for generating multi-view RGB-D images given an
input RGB image. At each diffusion timestep t, MVD-Fusion uses the current (noisy) depth estimates to compute depth-projection-based
multi-view aware features (top). A novel-view diffusion based U-Net is modified to leverage these multi-view aware features as additional
conditioning while producing denoised estimates of both, RGB and depth (bottom).

3 [20], which demonstrated that a large pre-trained image dant information. Finally, there have been several promis-
diffusion model can be finetuned for novel-view genera- ing concurrent works which also pursue multi-view infer-
tion using a large-scale 3D dataset to achieve better gen- ence [11, 14, 19, 22, 34, 46] but we believe that our method
eralization ability. While these methods are able to produce of depth-guided multi-view diffusion represents a comple-
high-quality predictions, the reliance on score distillation mentary advance to the techniques proposed in these.
sampling restricts them from obtaining diverse results with
single-view 3D prediction as an under-constrained task. 3. Method
Multi-view Image Generation. Unlike novel-view gen- Given a single RGB image, our method generates a set of
eration models which model the distribution over a sin- multi-view consistent RGB-D predictions. In addition to
gle view given a reference image, many recent works allowing the synthesis of the object from any desired set
have investigated generating multi-view images at the same of views, the generated multi-view depth maps also conve-
time by using diffusion models, including text-based meth- niently yield a (coarse) point cloud representation of the ge-
ods [35, 38] and image-based methods [21, 36]. Given a text ometry. To ensure multi-view consistency among the gen-
conditioning, MVDiffusion [38] simultaneously generates erated images, we model the joint distribution over a set
all images with a global transformer to facilitate cross-view of posed images by adding depth-guided 3D cross-attention
interactions. Similarly, MVDream [35] produces multi- layers on top of pre-trained latent diffusion backbone from
view images via multi-view diffusion and leverages a self- Stable Diffusion [32] and Zero-1-to-3 [20]. We first for-
attention layer to learn cross-view dependency and encour- malize this task of multi-view generation via denoising dif-
age multi-view consistency. While these methods rely on fusion (Section 3.1), and then detail our specific approach
text input, Viewset Diffusion [36] adopts a similar approach to enforcing multi-view consistency (Section 3.2) and 2.5D
for generating a multi-view image set given an input image image generation (Section 3.3). A diagram of our method
and subsequently infers a radiance field to ensure consis- is in Figure 2.
tent geometry. While these methods, similar to our goal,
3.1. Multi-view Denoising Diffusion
can model the distribution over novel views, they do lever-
age any geometric mechanism to enforce multi-view con- A conditional denoising diffusion model can capture the
sistency. Perhaps most closely related to our work, Sync- distribution over a variable of interest x given some con-
Dreamer [21] proposes to use a 3D-aware feature atten- ditioning c. In particular, by learning a function ϵ(xt , c, t)
tion mechanism that correlates the corresponding features that learns to denoise an input with time-dependent corrup-
across views to enforce multi-view consistency. Different tion, diffusion models can allow sampling from the distribu-
from [21], we utilize depth information to learn consistency tion p(x|c). Towards our goal of multi-view generation, we
across views instead of a 3D bottleneck that contains redun- are interested in an instantiation of this framework where

3
knowledge of object surfaces. In contrast, we propose to ex-
plicitly reason about the surface by additionally generating
depth and biasing sampling near the possible surface.
Given a target view xit , we obtain feature frustum zit by
shooting rays and sampling features at 3D locations along
Input View T=999 T=0 the rays. For each ray, we sample D depth values near the
expected surface and aggregate projected features from tar-
Figure 3. We visualize the unprojected point cloud obtained from get views {xnt } and input view y. Let (zit )md be the feature
a set of noisy RGB-D images at different timesteps during infer- for the m-th ray at d-th depth in zit . For a 3D point pimd cor-
ence. We observe the gradual denoising of geometry from a ran- responding to the feature (zit )md , we sample N + 1 features
dom point cloud to a point cloud that matches the input object.
cmdn from {xnt } and y. We also include the plucker embed-
ding of query ray qm and reference rays rmdn from pimd
the conditioning corresponds to an observed RGB image y to N + 1 camera centers along with the sampled features
and a set of desired novel viewpoints {π n }. Given these as input into the transformer fθ that predicts view-aligned
as input, we aim to generate a (mutually consistent) set of multi-view aware features:
novel views {xn } corresponding to the conditioning view- N
X +1

points and thus seek to learn a denoising diffusion model (zit )md = wθ (vmdn , t)fθ (vmdn , t)
(3)
that captures p({xn }|y, {π n }). n=0

To learn such a diffusion model, we need to formulate where vmdn = {cmdn , rmdn , qm }
an approach that can predict the noise added to a set of cor-
Here, wθ (vmdn , t) represents (normalized) weights pre-
rupted multi-view images:
dicted by the transformer, which are then used to aggre-
√ √ gate the multi-view features to obtain the pointwise feature
{xnt } = { ᾱt xn + 1 − ᾱt ϵn }
(zit )md .
ϵpred = f ({xnt }, y, t) Naively, we can sample a large number of depth points
along each ray linearly spaced throughout the scene bound;
Instead of learning such a prediction model from scratch,
however, such exhaustive sampling quickly becomes a
we propose to adapt a pre-trained novel-view generative
memory constraint while possibly making the learning task
model from Zero-1-to-3 [20]. Specifically, this model cap-
more difficult as the network may also observe features
tures the distribution over a single novel view p(x|y, π)
away from the surface. Thus, we sample D = 3 depths
given an RGB input by learning a denoising function
from a Gaussian distribution centered around an unbiased
ϵϕ (y, xt , π, t). While we aim to leverage this pre-trained
estimate of depth given the noisy depth dt and a scaled ver-
large-scale module for efficient learning and generalization,
sion of the denoising diffusion model variance schedule in
it only models the distribution over a single novel view
equation Eq. 4.
whereas we aim to model the joint distribution over mul- √
tiple views. To enable efficiently adapting this, we propose ᾱt
d ∼ N (E [d0 ] , k √ )
a first learn a separate module that computes view-aligned 1 − ᾱt
multi-view aware features {znt }. We then modify the pre- √ √
where dt = ᾱt d0 + 1 − ᾱt ϵ (4)
trained single-view diffusion model to additionally leverage
dt
this multi-view aware conditioning: and E [d0 ] = √
ᾱt
znt = fθ (y, {xnt }, {π n }, t) (1) We then use these multi-view aware features {ztn } as
ϵpred = {ϵϕ′ (y, xnt , π n , znt , t)} (2) conditioning input into our latent diffusion model in Sec-
tion 3.3.
3.2. Depth-guided Multi-view Consistency 3.3. Learning Multi-view 2.5D Diffusion
Generating a set of consistent images requires the network Inspired by the success of finetuning pretrained Stable Dif-
to attend across different images within the set. Sync- fusion models, we adapt Zero-1-to-3 [20] as our multi-view
Dreamer [21] proposes a way of achieving multi-view at- novel view synthesis backbone ϵϕ . While Zero-1-to-3 is
tention by unprojecting features from the set of images designed to only model single-view distributions and gen-
{x0t , x1t , ...xN
t } onto a 3D volume V and interpolating con- erate RGB output, we adapt it to predict an additional depth
ditioning feature frustums {z0t , z1t , ...zN
t }. However, inter- channel and cross-attend to the multi-view aware features.
polating feature frustums linearly spaced across the whole First, we increase the input and output channels of the la-
3D volume is an expensive operation that assumes no prior tent diffusion UNet backbone to predict normalized depth.

4
Input View Zero-1-to-3 SyncDreamer MVD-Fusion Ground Truth

Figure 4. Qualitative results for novel view synthesis on instances from Objaverse (top) and Google Scanned objects (bottom).
We compare our method with Zero-1-to-3 [20] and SyncDreamer [21]. We show the input image and two novel views generated by
each method. Zero-1-to-3 independently generates novel views which are not consistent (e.g. the person in Objaverse). While both,
SyncDreamer and MVD-Fusion yield consistent generations, we find that MVD-Fusion can generate more plausible output (e.g. the
Android image) and is more faithful to details in the input (e.g. the three cars).

While the image latents can be decoded into high-resolution During training, we finetune all the parameters of our
images, our predicted depth map remains at the lower reso- network and follow [10] to use a simplified variational
lution. This multi-resolution approach to predicting RGB-D lower bound objective in Eq. 5. During inference, we follow
lets us use the frozen Stable Diffusion VAE to decode high- [21] and use a classifier free guidance of 2.0.
resolution RGB images. Moreover, we add additional resid-
ual cross-attention layers at multiple levels of the UNet to
attend to our multi-view aware features. Finally, we modify LDM = Exn0 ,ϵn ,t [||ϵn − {ϵϕ′ (y, xnt , π n , znt , t)}||]
√ √ (5)
the camera parameterization used in Zero-1-to-3 from a 3- where xt = ᾱt x0 + 1 − ᾱt ϵ; ϵ ∼ N (0, 1)
DoF azimuth, elevation, and radius parameterization to use
the full perspective camera matrix. This makes our method 4. Experiments
capable of handling arbitrary camera poses in real datasets
We train MVD-Fusion using a large-scale synthetic dataset
such as CO3D.
(Section 4.1), and evaluate it on both, synthetic and real-

5
Table 1. Results for novel view synthesis on the Objaverse
dataset. We compare our method with two other baselines on 100
instances from the test set of Objaverse dataset, same as in [20].
Our method outperforms existing baselines over three commonly
used metrics: PSNR, SSIM and LPIPS.

Method PSNR ↑ SSIM ↑ LPIPS ↓


Zero123 [20] 17.37 0.783 0.211
Input View Sample 1 Sample 2 Sample 3 SyncDreamer [21] 19.22 0.817 0.176
MVD-Fusion 21.19 0.835 0.146
Figure 5. Sample Diversity. MVD-Fusion is capable of gener-
ating diverse samples given the same input. We show the input
image (left) followed by views synthesized in three randomly gen- images from novel viewpoints. Built on Zero-1-to-3, Sync-
erated samples. We observe that there is meaningful variation in Dreamer can simultaneously generate multiple images from
uncertain regions e.g. the eyes of the character and the colors on different viewpoints with 3D consistency. For CO3D, we
the screen vary across samples. compare against PixelNeRF [51] as both Zero-1-to-3 and
SyncDreamer are restricted to 3-DoF camera variation.
For 3D reconstruction, we compare our method with
world objects for view synthesis (Section 4.2) and 3D the aforementioned two methods together with RealFu-
reconstruction (Section 4.3). We show that our results sion [23], Magic 123 [29], One-2-3-45 [18], Point-E [27]
achieves more accurate view synthesis compared to state- and Shape-E [12]. Note that the diffusion-based methods
of-the-art as well as yields better 3D predictions compared require neural field optimization using either rendering ob-
to prior direct 3D inference methods. Finally, we present jectives or distillation objectives (e.g. Zero-1-to-3 requires
qualitative results on in-the-wild-objects (Section 4.4). a SDS distillation for extracting geometry whereas Sync-
Dreamer relies on training Neus [44]), whereas our method
4.1. Experimental Setup allows ‘directly’ computing the geometry via un-projecting
the predicted depth maps. To highlight this distinction, we
Datasets. We use the large-scale 3D dataset Objaverse [2]
categorize the reconstruction approaches as direct (One-2-
for training. Since Objaverse is large (800K instances) and
3-45 [18], Point-E [27], Shape-E [12], and MVD-Fusion)
contains several instances with poor texture, we filter the
or optimization based (RealFusion [23], Magic 123 [29],
dataset with CLIP score to remove instances that match a
Zero-1-to-3 [20], and SyncDreamer [21]).
set of hand-crafted negative text prompts. Our filtered set
contains around 400K instances. For each instance, we ren- Metrics. For the novel view synthesis, we adopt com-
der 16 views from an elevation of 30 degrees and azimuth monly used metrics: PSNR, SSIM [45], and LPIPS [53].
linearly spaced across 360 degrees. Additionally, we hold For the 3D reconstruction task, we report Chamfer Dis-
out a subset of Objaverse instances following [20] for eval- tances between ground-truth and predicted point clouds.
uation, which consists of about 4k instances.
Beyond evaluating on these held-out Objaverse in- Implementation Details. We train our model on a filtered
stances, we also evaluate our method with the Google version of the Objaverse dataset, which consists of about
Scanned Object dataset (GSO) [4], which consists of high- 400k instances. During training, for each instance, we ran-
quality scanned household items. For each object, we ren- domly sample 5 views and choose the first one as the in-
der 16 views evenly spaced in azimuth from an elevation put view. We train the model for 400k iterations with 4
of 30 degrees and choose one of them as the input image. 40G A100 GPUs using a total batch size of 16. We use
For quantitative results, we randomly chose 30 objects to Adam optimizer with a learning rate of 1e-5. Even though
compute the metrics. Finally, to show the flexibility of our we only train with 5 views, we can sample an arbitrary set
approach in modeling real-world datasets with general per- during inference as our depth-based projection followed by
spective cameras, as opposed to the common 3DoF cameras transformer-based aggregation trivially generalizes to incor-
used in Objaverse and GSO, we finetune and evaluate our porate more views. In our experiments, we render 16 views
model on CO3D [31]. We follow [52] to train on 41 cate- for each instance for evaluation.
gories and evaluate on the held-out set of all 51 categories.
4.2. Novel View Synthesis
Objaverse and Google Scanned Objects. We report
Baselines. For the novel view synthesis task, we adopt quantitative results on the Objaverse dataset and GSO
Zero-1-to-3 [20] and SyncDreamer [21] as our baseline dataset in Table 1 and Table 2, respectively. For the Ob-
methods. Given an input image, Zero-1-to-3 can synthesize javerse dataset, we use the held-out test set for evaluation

6
Table 2. Results for novel view synthesis on the Google Scanned
Objects (GSO) dataset. We compare our method with two other
baselines on 100 instances randomly chosen from the GSO dataset.
Our method achieves consistent improvement over baseline meth-
ods on PSNR and LPIPS, while slightly worse than SyncDreamer
on SSIM.

Method PSNR ↑ SSIM ↑ LPIPS ↓


Zero123 [20] 17.42 0.756 0.207 Input PixelNeRF MVD-Fusion
SyncDreamer [21] 18.95 0.796 0.176
Figure 7. Qualitative results for novel view synthesis on instances
MVD-Fusion 19.53 0.790 0.175
from CO3D. MVD-Fusionis able to predict accurate and realistic
novel views on real-world datasets with perspective camera poses.
Table 3. Results for 51 category novel view synthesis on CO3D.
We significantly outperform PixelNeRF on perceptual quality. baselines, we also highlight the ability of MVD-Fusion to
generate multiple plausible outputs. In particular, since
Method PSNR ↑ SSIM ↑ LPIPS ↓ novel view synthesis from a single image is an under-
constrained task, using the diffusion model can effectively
PixelNeRF [51] 17.64 0.484 0.378
MVD-Fusion 17.16 0.701 0.220 generate more diverse samples given a single input image.
As shown in Figure 5, MVD-Fusion is able to generate
diverse plausible samples with different random seeds e.g.
MVD-Fusion 56.1% 43.9% Zero-123 varying textures in the front of the bus.
MVD-Fusion 56.3% 43.7% SyncD. User Study. We run a user study by randomly selecting
SyncD. 17.9% 82.1% G.T. 40 instances from Objaverse and GSO test set and asking 43
users to make 860 pairwise comparisons (users are shown
Zero-123 24.8% 75.2% G.T. an input image and two generated novel views per method).
We show results in Figure 6. Our method tends to be chosen
MVD-Fusion 31.4% 68.6% G.T. over Zero-123 and SyncDreamer, and is also more compet-
0 25 50 75 100 itive with GT compared to these baselines.
Win Rate Percentage
Figure 6. User preference percentage of MVD-Fusion against
Zero-123, SyncDreamer (SyncD.), and ground truth (G.T.).
Common Objects in 3D. Real-world data often have
cameras that are not origin facing, making methods that
model 3DoF origin facing cameras [20, 21] not suitable for
whereas we use a subset of 30 random objects from GSO. real-world inference. We finetune our model on CO3D and
We find that our method achieves consistent improvements show novel view synthesis results in Table 3. We also train a
over the baselines across metrics on both, the in-distribution cross-category PixelNeRF [51] model as a baseline. While
Objaverse dataset and the out-of-distribution GSO dataset. it is slightly better in PSNR (perhaps due to blurry mean
We also provide qualitative comparisons on the Objaverse predictions being optimal under uncertainty), our method
dataset and GSO dataset in Figure 4. Although Zero-1-to- vastly outperforms PixelNeRF in perceptual metrics SSIM
3 [20] produces visually reasonable images, it suffers from and LPIPS (see Figure 7).
multi-view inconsistency across generated viewpoints. In
4.3. Single-view Reconstruction
contrast SyncDreamer and MVD-Fusion, are able to obtain
multi-view consistency among generated images. However, Unlike previous methods such as Zero-1-to-3 and Sync-
SyncDreamer struggles to obtain high alignment with the Dreamer, which have to fit a radiance field from gener-
input image, sometimes leading to consistent multi-view ated multi-view images to obtain the 3D shape, MVD-
images with unreasonable appearance. Our method, on the Fusion can directly obtain the point cloud. With multi-
other hand, is able to generate multi-view consistent im- view RGB-D generations, we can simply unproject the
ages with better alignment with the image input and more foreground pixels and obtain the object point cloud. We
plausible completions in unobserved regions. Moreover, we show quantitative results in Table 4, where we compare our
also note that SyncDreamer is trained on the whole Obja- method against previous methods on the GSO dataset using
verse dataset and may have actually seen these instances chamfer distance. We see that our method outperforms all
whereas these represent held out instances for both, MVD- of the methods that directly infer 3D shapes and most of the
Fusion and Zero-1-to-3. methods that require further optimization steps to get the
In addition to visualizing the comparative results with 3D shapes.

7
Input View Synthesized Novel Views Geometry Input View Synthesized Novel Views Geometry

Figure 8. In-the-wild Generalization. We visualize the prediction from MVD-Fusion on in-the-wild internet images. We find that MVD-
Fusion is able to preserve the rich texture in the input images and model the rough geometry without post-processing.

Table 4. Results for 3D reconstruction on the Google Scanned 5. Discussion


Objects (GSO) dataset. ‘Optimization’ denotes methods that re-
quire additional training such as fitting an occupancy field to ob- In this work, we presented MVD-Fusion, which allowed
tain 3D shapes. ‘Direct’ denotes methods that can directly output co-generating multi-view images given a single input im-
3D predictions. Following [21], we report Chamfer Distance on age. Our approach allowed adapting a pre-trained large-
the same 30 instances from the GSO dataset. Our method demon- scale novel-view diffusion model for generating multi-view
strates consistent improvement over ‘direct’ methods and outper- RGB-D images, and enforced consistency among these via
forms most of the ‘optimization’ methods. depth-guided projection. While our results showed im-
provements over prior state-of-the-art across datasets, there
3D Extraction Method Chamfer Dist ↓ are several challenges that still remain. First, the multi-
RealFusion [23] 0.082 view consistency is encouraged via inductive biases in the
Magic123 [29] 0.052 network design but is not guaranteed, and the network
Optimization
Zero123 [20] 0.034 may generate (slightly) inconsistent multi-view predictions.
SyncDreamer [21] 0.026 Moreover, while our inferred multi-view depth maps can
One-2-3-45 [18] 0.063 yield a point cloud representation that captures the coarse
Point-E [27] 0.043 geometry, these coarse depth maps do not capture the fine
Direct
Shap-E [12] 0.044 details visible in the generated views and an optimization-
MVD-Fusion 0.031 based procedure may help extract these better. Finally, our
approach has been trained on clean unoccluded instances
and would not be directly applicable under cluttered scenes
4.4. In-the-wild Generalization with partially visible objects, and it remains an open re-
search question to build systems that can deal with such
challenging scenarios.
We also demonstrate the generalization ability of MVD-
Fusion for reconstructing in-the-wild images from the inter-
Acknowledgements
net. We show qualitative results depicting generated novel
views and recovered point clouds in Figure 8. With chal- We thank Bharath Raj, Jason Y. Zhang, Yufei (Judy) Ye,
lenging out-of-domain images as input, MVD-Fusion is still Yanbo Xu, and Zifan Shi for helpful discussions and feed-
capable of generating consistent novel-view images and rea- back. This work is supported in part by NSF GRFP Grant
sonable 3D shapes from single-view observation. No. (DGE1745016, DGE2140739).

8
References [17] Chen-Hsuan Lin, Chaoyang Wang, and Simon Lucey. Sdf-
srn: Learning signed distance 3d object reconstruction from
[1] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin static images. In NeurIPS, 2020. 2
Chen, and Silvio Savarese. 3d-r2n2: A unified approach for [18] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund
single and multi-view 3d object reconstruction. In ECCV, Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single
2016. 2 image to 3d mesh in 45 seconds without per-shape optimiza-
[2] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, tion. In NeurIPS, 2023. 6, 8
Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana [19] Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang,
Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji-
A universe of annotated 3d objects. In CVPR, 2023. 6 ayuan Gu, and Hao Su. One-2-3-45++: Fast single image
[3] Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, to 3d objects with consistent multi-view generation and 3d
Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. diffusion. In CVPR, 2024. 3
Nerdi: Single-view nerf synthesis with language-guided dif- [20] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-
fusion as general image priors. In CVPR, 2023. 2 makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3:
[4] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- Zero-shot one image to 3d object. In ICCV, 2023. 2, 3, 4, 5,
man, Ryan Hickman, Krista Reymann, Thomas B McHugh, 6, 7, 8, 1
and Vincent Vanhoucke. Google scanned objects: A high- [21] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie
quality dataset of 3d scanned household items. In ICRA, Liu, Taku Komura, and Wenping Wang. Syncdreamer:
2022. 6 Learning to generate multiview-consistent images from a
[5] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point single-view image. In ICLR, 2024. 2, 3, 4, 5, 6, 7, 8
set generation network for 3d object reconstruction from a [22] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu,
single image. In CVPR, 2017. 2 Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang,
[6] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab- Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin-
hinav Gupta. Learning a predictable and generative vector gle image to 3d using cross-domain diffusion. In CVPR,
representation for objects. In ECCV, 2016. 2 2024. 3
[23] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and
[7] Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh
Andrea Vedaldi. Realfusion: 360deg reconstruction of any
r-cnn. In ICCV, 2019. 2
object from a single image. In CVPR, 2023. 2, 6, 8
[8] Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, [24] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. bastian Nowozin, and Andreas Geiger. Occupancy networks:
Nerfdiff: Single-image view synthesis with nerf-guided dis- Learning 3d reconstruction in function space. In CVPR,
tillation from 3d-aware diffusion. In ICML, 2023. 2 2019. 2
[9] Jonathan Ho and Tim Salimans. Classifier-free diffusion [25] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
guidance. arXiv preprint arXiv:2207.12598, 2022. 1 Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
[10] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Representing scenes as neural radiance fields for view syn-
sion probabilistic models. In NeurIPS, 2020. 5 thesis. In ECCV, 2020. 2
[11] Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, [26] K L Navaneet, Ansu Mathew, Shashank Kashyap, Wei-Chih
Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Hung, Varun Jampani, and R Venkatesh Babu. From image
Qiao, Bo Dai, and Lu Sheng. Epidiff: Enhancing multi- collections to point clouds with self-supervised shape and
view synthesis via localized epipolar-constrained diffusion. pose networks. In CVPR, 2020. 2
In CVPR, 2024. 3 [27] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela
[12] Heewoo Jun and Alex Nichol. Shap-e: Generat- Mishkin, and Mark Chen. Point-e: A system for generat-
ing conditional 3d implicit functions. arXiv preprint ing 3d point clouds from complex prompts. arXiv preprint
arXiv:2305.02463, 2023. 6, 8 arXiv:2212.08751, 2022. 6, 8
[28] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden-
[13] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and
hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,
Jitendra Malik. Learning category-specific mesh reconstruc-
2023. 2
tion from image collections. In ECCV, 2018. 2
[29] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren,
[14] Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko-
Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123:
Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad One image to high-quality 3d object generation using both
: Spatially aware multiview diffusers. In CVPR, 2024. 3 2d and 3d diffusion priors. In ICLR, 2024. 2, 6, 8
[15] Nilesh Kulkarni, Justin Johnson, and David F Fouhey. Di- [30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
rected ray distance functions for 3d scene reconstruction. In Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
ECCV, 2022. 2 Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[16] Nilesh Kulkarni, Linyi Jin, Justin Johnson, and David F Krueger, and Ilya Sutskever. Learning transferable visual
Fouhey. Learning to predict scene-level implicit 3d from models from natural language supervision. In ICML, 2021.
posed rgbd data. In CVPR, 2023. 2 1

9
[31] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, and Jun Zhu. Crm: Single image to 3d textured mesh
Luca Sbordone, Patrick Labatut, and David Novotny. Com- with convolutional reconstruction model. arXiv preprint
mon objects in 3d: Large-scale learning and evaluation of arXiv:2403.05034, 2024. 3
real-life 3d category reconstruction. In ICCV, 2021. 6 [47] Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph
[32] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Feichtenhofer, and Georgia Gkioxari. Multiview compres-
Patrick Esser, and Björn Ommer. High-resolution image syn- sive coding for 3d reconstruction. In CVPR, 2023. 2
thesis with latent diffusion models. In CVPR, 2022. 3, 1 [48] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang,
[33] Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything- and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild
3d: Towards single-view anything reconstruction in the wild. 2d photo to a 3d object with 360deg views. In CVPR, 2023.
arXiv preprint arXiv:2304.10261, 2023. 2 2
[34] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, [49] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir
Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Mech, and Ulrich Neumann. Disn: Deep implicit surface
Su. Zero123++: a single image to consistent multi-view dif- network for high-quality single-view 3d reconstruction. In
fusion base model. arXiv preprint arXiv:2310.15110, 2023. NeurIPS, 2019. 2
3 [50] Yufei Ye, Shubham Tulsiani, and Abhinav Gupta. Shelf-
[35] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, supervised mesh prediction in the wild. In CVPR, 2021. 2
and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- [51] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
eration. In ICLR, 2024. 2, 3 pixelnerf: Neural radiance fields from one or few images. In
[36] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea CVPR, 2021. 6, 7
Vedaldi. Viewset diffusion:(0-) image-conditioned 3d gener- [52] Jason Y. Zhang, Deva Ramanan, and Shubham Tulsiani. Rel-
ative models from 2d data. In ICCV, 2023. 2, 3 Pose: Predicting probabilistic relative rotation for single ob-
[37] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, jects in the wild. In ECCV, 2022. 6
Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d [53] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
creation from a single image with diffusion prior. In ICCV, and Oliver Wang. The unreasonable effectiveness of deep
2023. 2 features as a perceptual metric. In CVPR, 2018. 6
[38] Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and [54] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Dis-
Yasutaka Furukawa. Mvdiffusion: Enabling holistic multi- tilling view-conditioned diffusion for 3d reconstruction. In
view image generation with correspondence-aware diffusion. CVPR, 2023. 2
In NeurIPS, 2023. 3
[39] Ayush Tewari, Tianwei Yin, George Cazenavette, Semon
Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T.
Freeman, and Vincent Sitzmann. Diffusion with forward
models: Solving stochastic inverse problems without direct
supervision. In NeurIPS, 2023. 2
[40] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Ji-
tendra Malik. Multi-view supervision for single-view recon-
struction via differentiable ray consistency. In CVPR, 2017.
2
[41] Kalyan Alwala Vasudev, Abhinav Gupta, and Shubham Tul-
siani. Pre-train, self-train, distill: A simple recipe for super-
sizing 3d reconstruction. In CVPR, 2022. 2
[42] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
and Greg Shakhnarovich. Score jacobian chaining: Lifting
pretrained 2d diffusion models for 3d generation. In CVPR,
2023. 2
[43] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh
models from single rgb images. In ECCV, 2018. 2
[44] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. Neus: Learning neural implicit
surfaces by volume rendering for multi-view reconstruction.
In NeurIPS, 2021. 6
[45] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P
Simoncelli. Image quality assessment: from error visibility
to structural similarity. In TIP, 2004. 6
[46] Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xi-
ang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su,

10
A. Architecture Details Table 5. Ablation study on Google Scanned Objects (GSO)
dataset. We ablate the effect of the classifier-free guidance scale
Our network consists of modifications on top of Zero123 during inference. We randomly chose 30 instances from the
[20]. We describe each component of our network in detail. dataset for evaluation. ‘Scale’ denotes the classifier-free guidance
scale.
VAE. We use the pretrained VAE from Stable Diffusion 1.4
[32]. We freeze the VAE.
Scale PSNR ↑ SSIM ↑ LPIPS ↓
UNet. We initialize our UNet with weights from Zero123
[20]. Zero123 has a novel view synthesis UNet that accepts 1.0 19.34 0.775 0.199
one input image (4 channel latents) and one target noisy im- 2.0 19.91 0.787 0.184
age (4 channel latents) along with camera pose and predicts 3.0 19.38 0.778 0.189
5.0 18.66 0.771 0.194
a novel view image latent (4 channels). We modify the input
and output blocks to accommodate the prediction of an ad-
ditional depth channel. Our UNet has 10 input channels and
method, we use the classifier-free guidance scale ω
5 output channels. For all experiments, we use only RGB
to control the contribution of the classifier-free model:
images as input (4 channel latents) and pad the additional
ϵ̂ϕ′ (y, xnt , π n , znt , t) = ωϵϕ′ (y, xnt , π n , znt , t) + (1 −
channel with zeros. The noisy target target image is always
ω)ϵϕ′ (xnt , t), where ϵϕ′ (y, xnt , π n , znt , t) is our proposed
5 channels.
multi-view diffusion model. In practice, we notice that a
CLIP and Camera Pose Embedding. We follow Zero123 higher classifier-free guidance scale leads to better multi-
[20] to use the frozen CLIP [30] image encoder along view consistency. As shown in Table 5, we find that adopt-
with camera information as one of the inputs to the cross- ing a scale of 2 yields the best performance. Therefore, we
attention layers in Stable Diffusion. However, instead of us- use this classifier-free guidance scale for inference.
ing azimuth and elevation angle representation, we directly
use a flattened camera matrix as input. We use 3 fully con-
nected layers to map CLIP image embedding and flattened
camera matrix into cross-attention input of dimension 768.
Depth-guided Multi-view Attention. After each of the
existing cross-attention layers in the UNet, we add addi-
tional cross attention layers that attend to view-aligned fea-
ture frustums sampled from our depth-guided multi-view
attention module. Our depth-guided attention module is a
3 layers transformer that aggregates information across the
noisy target latents from the current timestep and also input
image latents. For each target view, we generated a fea-
ture frustum of shape (1, 256, 3, 32, 32), where the feature
map is 32 by 32, with 3 depth samples and feature dimen-
sion 256. The depth dimension represents the number of
depth points sampled along each ray and can be reduced
down to just 1. Our transformer uses a hidden dimension
of 256 with 8 heads. We use an additional fully connected
layer to project our features into 768 dimensions, making
them compatible with existing cross attention layers. A key
difference between our multi-view cross attention and text
cross attention is that, in our multi-view attention, each la-
tent patch independently attends to the corresponding patch
in the feature frustum.

B. Additional Results and Visualizations


Ablating Classifier-free Guidance Scale. We
further conduct experiments to ablate the effect of
the classifier-free guidance scale. Proposed in [9],
classifier-free guidance controls the faithfulness of the
generated output to the conditional input. In our

You might also like