Pixel Nerf
Pixel Nerf
pixelNeRF NeRF
Figure 1: NeRF from one or few images. We present pixelNeRF, a learning framework that predicts a Neural Radiance Field (NeRF)
representation from a single (top) or few posed images (bottom). PixelNeRF can be trained on a set of multi-view images, allowing it to
generate plausible novel view synthesis from very few input images without test-time optimization (bottom left). In contrast, NeRF has no
generalization capabilities and performs poorly when only three input views are available (bottom right).
Abstract 1. Introduction
We propose pixelNeRF, a learning framework that pre- We study the problem of synthesizing novel views of a
dicts a continuous neural scene representation conditioned scene from a sparse set of input views. This long-standing
on one or few input images. The existing approach for problem has recently seen progress due to advances in dif-
constructing neural radiance fields [27] involves optimiz- ferentiable neural rendering [27, 20, 24, 39]. Across these
ing the representation to every scene independently, requir- approaches, a 3D scene is represented with a neural net-
ing many calibrated views and significant compute time. work, which can then be rendered into 2D views. Notably,
We take a step towards resolving these shortcomings by in- the recent method neural radiance fields (NeRF) [27] has
troducing an architecture that conditions a NeRF on im- shown impressive performance on novel view synthesis of
age inputs in a fully convolutional manner. This allows a specific scene by implicitly encoding volumetric density
the network to be trained across multiple scenes to learn and color through a neural network. While NeRF can ren-
a scene prior, enabling it to perform novel view synthesis in der photorealistic novel views, it is often impractical as it
a feed-forward manner from a sparse set of views (as few as requires a large number of posed images and a lengthy per-
one). Leveraging the volume rendering approach of NeRF, scene optimization.
our model can be trained directly from images with no ex- In this paper, we address these shortcomings by propos-
plicit 3D supervision. We conduct extensive experiments ing pixelNeRF, a learning framework that enables predict-
on ShapeNet benchmarks for single image novel view syn- ing NeRFs from one or several images in a feed-forward
thesis tasks with held-out objects as well as entire unseen manner. Unlike the original NeRF network, which does not
categories. We further demonstrate the flexibility of pixel- make use of any image features, pixelNeRF takes spatial
NeRF by demonstrating it on multi-object ShapeNet scenes image features aligned to each pixel as an input. This im-
and real scenes from the DTU dataset. In all cases, pix- age conditioning allows the framework to be trained on a
elNeRF outperforms current state-of-the-art baselines for set of multi-view images, where it can learn scene priors
novel view synthesis and single image 3D reconstruction. to perform view synthesis from one or few input views. In
For the video and code, please visit the project website: contrast, NeRF is unable to generalize and performs poorly
https://alexyu.net/pixelnerf. when few input images are available, as shown in Fig. 1.
4578
Specifically, we condition NeRF on input images by first NeRF DISN ONet DVR SRN Ours
computing a fully convolutional image feature grid from the
Learns scene prior? ✗ ✓ ✓ ✓ ✓ ✓
input image. Then for each query spatial point x and view-
Supervision 2D 3D 3D 2D 2D 2D
ing direction d of interest in the view coordinate frame, we
Image features ✗ Local Global Global ✗ Local
sample the corresponding image feature via projection and Allows multi-view? ✓ ✓ ✗ ✗ ✓ ✓
bilinear interpolation. The query specification is sent along View space? - ✗ ✗ ✗ ✗ ✓
with the image features to the NeRF network that outputs
density and color, where the spatial image features are fed Table 1: A comparison with prior works reconstructing neu-
to each layer as a residual. When more than one image is ral scene representations. The proposed approach learns a scene
available, the inputs are first encoded into a latent represen- prior for one or few-view reconstruction using only multi-view 2D
tation in each camera’s coordinate frame, which are then image supervision. Unlike previous methods in this regime, we
pooled in an intermediate layer prior to predicting the color do not require a consistent canonical space across the training cor-
and density. The model is supervised with a reconstruction pus. Moreover, we incorporate local image features to preserve
loss between a ground truth image and a view rendered us- local information which is in contrast to methods that compress
ing conventional volume rendering techniques. This frame- the structure and appearance into a single latent vector such as Oc-
work is illustrated in Fig. 2. cupancy Networks (ONet) [25] and DVR [28].
PixelNeRF has many desirable properties for few-view
novel-view synthesis. First, pixelNeRF can be trained on a ward photorealism for both wider ranges of novel views
dataset of multi-view images without additional supervision and sparser sets of input views, by using 3D representations
such as ground truth 3D shape or object masks. Second, based on neural networks [27, 23, 26, 38, 42, 7]. However,
pixelNeRF predicts a NeRF representation in the camera because these approaches fit a single model to each scene,
coordinate system of the input image instead of a canoni- they require many input views and substantial optimization
cal coordinate frame. This is not only integral for general- time per scene.
ization to unseen scenes and object categories [41, 37], but There are methods that can predict novel view from few
also for flexibility, since no clear canonical coordinate sys- input views or even single images by learning shared priors
tem exists on scenes with multiple objects or real scenes. across scenes. Methods in the tradition of [35, 3] use depth-
Third, it is fully convolutional, allowing it to preserve the guided image interpolation [54, 10, 32]. More recently, the
spatial alignment between the image and the output 3D rep- problem of predicting novel views from a single image has
resentation. Lastly, pixelNeRF can incorporate a variable been explored [44, 47, 36, 5]. However, these methods em-
number of posed input views at test time without requiring ploy 2.5D representations, and are therefore limited in the
any test-time optimization. range of camera motions they can synthesize. In this work
We conduct an extensive series of experiments on syn- we infer a 3D volumetric NeRF representation, which al-
thetic and real image datasets to evaluate the efficacy of our lows novel view synthesis from larger baselines.
framework, going beyond the usual set of ShapeNet experi- Sitzmann et al. [39] introduces a representation based on
ments to demonstrate its flexibility. Our experiments show a continuous 3D feature space to learn a prior across scene
that pixelNeRF can generate novel views from a single im- instances. However, using the learned prior at test time
age input for both category-specific and category-agnostic requires further optimization with known absolute camera
settings, even in the case of unseen object categories. Fur- poses. In contrast, our approach is completely feed-forward
ther, we test the flexibility of our framework, both with a and only requires relative camera poses. We offer exten-
new multi-object benchmark for ShapeNet, where pixel- sive comparisons with this approach to demonstrate the ad-
NeRF outperforms prior approaches, and with simulation- vantages our design affords. Lastly, note that concurrent
to-real transfer demonstration on real car images. Lastly, work [43] adds image features to NeRF. A key difference is
we test capabilities of pixelNeRF on real images using the that we operate in view rather than canonical space, which
DTU dataset [14], where despite being trained on under 100 makes our approach applicable in more general settings.
scenes, it can generate plausible novel views of a real scene Moreover, we extensively demonstrate our method’s perfor-
from three posed input views. mance in few-shot view synthesis, while GRF shows very
limited quantitative results for this task.
2. Related Work
Learning-based 3D reconstruction. Advances in deep
Novel View Synthesis. The long-standing problem of novel learning have led to rapid progress in single-view or multi-
view synthesis entails constructing new views of a scene view 3D reconstruction. Many approaches [15, 12, 46, 53,
from a set of input views. Early work achieved photore- 38, 33, 49, 25, 31] propose learning frameworks with vari-
alistic results but required densely captured views of the ous 3D representations that require ground-truth 3D models
scene [19, 11]. Recent work has made rapid progress to- for supervision. Multi-view supervision [50, 45, 21, 22, 39,
4579
28, 8, 2] is less restrictive and more ecologically plausible. ized as r(t) = o + td, with the ray origin (camera center)
However, many of these methods [50, 45, 21, 22, 28] re- o ∈ R3 and ray unit direction vector d ∈ R3 . The inte-
quire object masks; in contrast, pixelNeRF can be trained gral is computed along r between pre-defined depth bounds
from images alone, allowing it to be applied to scenes of [tn , tf ]. In practice, this integral is approximated with nu-
two objects without modification. merical quadrature by sampling points along each pixel ray.
Most single-view 3D reconstruction methods condition The rendered pixel value for camera ray r can then
neural 3D representations on input images. The majority be compared against the corresponding ground truth pixel
employs global image features [29, 6, 28, 25, 8], which, value, C(r), for all the camera rays of the target view with
while memory efficient, cannot preserve details that are pose P. The NeRF rendering loss is thus given by
present in the image and often lead to retrieval-like results.
2
Spatially-aligned local image features have been shown to
X
L= Ĉ(r) − C(r) (2)
achieve detailed reconstructions from a single view [49, 33]. 2
r∈R(P)
However, both of these methods require 3D supervision.
Our method is inspired by these approaches, but only re- where R(P) is the set of all camera rays of target pose P.
quires multi-view supervision.
Within existing methods, the types of scenes that can Limitations While NeRF achieves state of the art novel
be reconstructed are limited, particularly so for object- view synthesis results, it is an optimization-based approach
centric approaches (e.g. [46, 21, 12, 45, 38, 53, 25, 49, 28]). using geometric consistency as the sole signal, similar to
CoReNet [31] reconstructs scenes with multiple objects via classical multiview stereo methods [1, 34]. As such each
a voxel grid with offsets, but it requires 3D supervision in- scene must be optimized individually, with no knowledge
cluding the identity and placement of objects. In compari- shared between scenes. Not only is this time-consuming,
son, we formulate a scene-level learning framework that can but in the limit of single or extremely sparse views, it is un-
in principle be trained to scenes of arbitrary structure. able to make use of any prior knowledge of the world to
accelerate reconstruction or for shape completion.
Viewer-centric 3D reconstruction For the 3D learning
task, prediction can be done either in a viewer-centered co- 4. Image-conditioned NeRF
ordinate system, i.e. view space, or in an object-centered co-
ordinate system, i.e. canonical space. Most existing meth- To overcome the NeRF representation’s inability to share
ods [49, 25, 28, 39] predict in canonical space, where all ob- knowledge between scenes, we propose an architecture to
jects of a semantic category are aligned to a consistent ori- condition a NeRF on spatial image features. Our model
entation. While this makes learning spatial regularities eas- is comprised of two components: a fully-convolutional im-
ier, using a canonical space inhibits prediction performance age encoder E, which encodes the input image into a pixel-
on unseen object categories and scenes with more than one aligned feature grid, and a NeRF network f which outputs
object, where there is no pre-defined or well-defined canon- color and density, given a spatial location and its corre-
ical pose. PixelNeRF operates in view-space, which has sponding encoded feature. We choose to model the spa-
been shown to allow better reconstruction of unseen object tial query in the input view’s camera space, rather than a
categories in [37, 2], and discourages the memorization of canonical space, for the reasons discussed in § 2. We vali-
the training set [41]. We summarize key aspects of our ap- date this design choice in our experiments on unseen object
proach relative to prior work in Table 1. categories (§ 5.2) and complex unseen scenes (§ 5.3). The
model is trained with the volume rendering method and loss
3. Background: NeRF described in § 3.
In the following, we first present our model for the single
We first briefly review the NeRF representation [27]. A
view case. We then show how this formulation can be easily
NeRF encodes a scene as a continuous volumetric radiance
extended to incorporate multiple input images.
field f of color and density. Specifically, for a 3D point
x ∈ R3 and viewing direction unit vector d ∈ R3 , f returns 4.1. Single-Image pixelNeRF
a differential density σ and RGB color c: f (x, d) = (σ, c).
The volumetric radiance field can then be rendered into We now describe our approach to render novel views
a 2D image via from one input image. We fix our coordinate system as
the view space of the input image and specify positions and
Z tf
camera rays in this coordinate system.
Ĉ(r) = T (t)σ(t)c(t)dt (1)
tn
Given a input image I of a scene, we first extract a feature
Rt volume W = E(I). Then, for a point on a camera ray x,
where T (t) = exp − tn σ(s) ds handles occlusion. For we retrieve the corresponding image feature by projecting
a target view with pose P, a camera ray can be parameter- x onto the image plane to the image coordinates π(x) using
4580
f Volume Rendering
Input View W σ
(x,d) (RGBσ)
Ray Distance
W(πx) 2
y
d g.t.
z x
2
CNN Encoder Target View Rendering Loss
Figure 2: Proposed architecture in the single-view case. For a query point x along a target camera ray with view direction d, a
corresponding image feature is extracted from the feature volume W via projection and interpolation. This feature is then passed into the
NeRF network f along with the spatial coordinates. The output RGB and density value is volume-rendered and compared with the target
pixel value. The coordinates x and d are in the camera coordinate system of the input view.
4581
recognition, as well as domain transfer to real car photos Input SRN Ours GT Input SRN Ours GT
4582
Input SoftRas DVR SRN Ours GT Input SoftRas DVR SRN Ours GT Input SoftRas DVR SRN Ours GT
Figure 5: Category-agnostic single-view reconstruction. Going beyond the SRN benchmark, we train a single model to the 13 largest
ShapeNet categories; we find that our approach produces superior visual results compared to a series of strong baselines. In particular,
the model recovers fine detail and thin structure more effectively, even for outlier shapes. Quite visibly, images on monitors and tabletop
textures are accurately reproduced; baselines representing the scene as a single latent vector cannot preserve such details of the input image.
SRN’s test-time latent inversion becomes less reliable as well in this setting. The corresponding quantitative evaluations are available in
Table 4. Due to space constraints, we show objects with interesting properties here. Please see the supplemental for sampled results.
plane bench cbnt. car chair disp. lamp spkr. rifle sofa table phone boat mean
DVR 25.29 22.64 24.47 23.95 19.91 20.86 23.27 20.78 23.44 23.35 21.53 24.18 25.09 22.70
↑ PSNR SRN 26.62 22.20 23.42 24.40 21.85 19.07 22.17 21.04 24.95 23.65 22.45 20.87 25.86 23.28
Ours 29.76 26.35 27.72 27.58 23.84 24.22 28.58 24.44 30.60 26.94 25.59 27.13 29.18 26.80
DVR 0.905 0.866 0.877 0.909 0.787 0.814 0.849 0.798 0.916 0.868 0.840 0.892 0.902 0.860
↑ SSIM SRN 0.901 0.837 0.831 0.897 0.814 0.744 0.801 0.779 0.913 0.851 0.828 0.811 0.898 0.849
Ours 0.947 0.911 0.910 0.942 0.858 0.867 0.913 0.855 0.968 0.908 0.898 0.922 0.939 0.910
DVR 0.095 0.129 0.125 0.098 0.173 0.150 0.172 0.170 0.094 0.119 0.139 0.110 0.116 0.130
↓ LPIPS SRN 0.111 0.150 0.147 0.115 0.152 0.197 0.210 0.178 0.111 0.129 0.135 0.165 0.134 0.139
Ours 0.084 0.116 0.105 0.095 0.146 0.129 0.114 0.141 0.066 0.116 0.098 0.097 0.111 0.108
Table 4: Category-agnostic single-view reconstruction. Quantitative results for category-agnostic view-synthesis are presented, with a
detailed breakdown by category. Our method outperforms the state-of-the-art by significant margins in all categories.
chairs and 3514 cars with a predefined split across object we follow an abbreviated evaluation protocol on ShapeNet
instances. All images have resolution 128 × 128. chairs, using 25 novel views on the Archimedean spiral.
A single model is trained for each object class with 50
random views per object instance, randomly sampling ei- 5.1.2 Category-agnostic Object Prior
ther one or two of the training views to encode. For testing, While we found appreciable improvements over baselines
We use 251 novel views on an Archimedean spiral for each in the simplest category-specific benchmark, our method is
object in the test set of object instances, fixing 1-2 infor- by no means constrained to it. We show in Table 4 and
mative views as input. We report our performance in com- Fig. 5 that our approach offers a much greater advantage in
parison with state-of-the-art baselines in Table 2, and show the category-agnostic setting of [21, 28], where we train
selected qualitative results in Fig. 4. We also include the a single model to the 13 largest categories of ShapeNet.
quantitative results of baselines TCO [40] and dGQN [9] Please see the supplemental for randomly sampled results.
reported in [39] where applicable, and the values available We follow community standards for 2D-supervised
in the recent works ENR [8] and GRF [43] in this setting. methods on multiple ShapeNet categories [28, 16, 21] and
PixelNeRF achieves noticeably superior results despite use the renderings and splits from Kato et al. [16], which
solving a problem significantly harder than SRN because provide 24 fixed elevation views of 64 × 64 resolution for
we: 1) use feed-forward prediction, without test-time opti- each object instance. During both training and evaluation,
mization, 2) do not use ground-truth absolute camera poses a random view is selected as the input view for each object
at test-time, 3) use view instead of canonical space. and shared across all baselines. The remaining 23 views are
Ablations. In Table 3, we show the benefit of using local used as target views for computing metrics (see § 5).
features and view directions in our model for this category-
5.2. Pushing the Boundaries of ShapeNet
specific setting. Conditioning the NeRF network on pixel-
aligned local features instead of a global code (−Local vs Taking a step towards reconstruction in less controlled
Full) improves performance significantly, for both single capture scenarios, we perform experiments on ShapeNet
and two-view settings. Providing view directions (−Dirs vs data in three more challenging setups: 1) unseen object cat-
Full) also provides a significant boost. For these ablations, egories, 2) multiple-object scenes, and 3) simulation-to-real
4583
Input DVR SRN Ours GT Input DVR SRN Ours GT Unlike the more standard category-agnostic task described
in the previous section, such generalization is impossible
with semantic information alone. The results in Table 5 and
Fig. 6 suggest our method learns intrinsic geometric and
appearance priors which are fairly effective even for objects
quite distinct from those seen during training.
We loosely follow the protocol used for zero-shot cross-
Figure 6: Generalization to unseen categories. We evaluate a category reconstruction from [53, ?]. Note that our base-
model trained on planes, cars, and chairs on 10 unseen ShapeNet lines [39, 28] do not evaluate in this setting, and we adapt
categories. We find that the model is able to synthesize reasonable them for the sake of comparison. We train on the airplane,
views even in this difficult case. car, and chair categories and test on 10 categories unseen
during training, continuing to use the Kato et al. renderings
Inputs SRN Ours GT
described in § 5.1.2.
Multiple-object scenes. We further perform few-shot 360◦
reconstruction for scenes with multiple randomly placed
and oriented ShapeNet chairs. In this setting, the network
cannot rely solely on semantic cues for correct object place-
Figure 7: 360◦ view prediction with multiple objects. We show ment and completion. The priors learned by the network
qualitative results of our method compared with SRN on scenes must be applicable in an arbitrary coordinate system. We
composed of multiple ShapeNet chairs. We are easily able to show in Fig. 7 and Table 5 that our formulation allows us
handle this setting, because our prediction is done in view space; to perform well on these simple scenes without additional
in contrast, SRN predicts in canonical space, and struggles with design modifications. In contrast, SRN models scenes in a
scenes that cannot be aligned in such a way. canonical space and struggles on held-out scenes.
We generate training images composed with 20 views
Unseen category Multiple chairs randomly sampled on the hemisphere and render test im-
↑ PSNR ↑ SSIM ↓ LPIPS ↑ PSNR ↑ SSIM ↓ LPIPS ages composed of a held out test set of chair instances, with
50 views sampled on an Archimedean spiral. During train-
DVR 17.72 0.716 0.240 - - - ing, we randomly encode two input views; at test-time, we
SRN 18.71 0.684 0.280 14.67 0.664 0.431 fix two informative views across the compared methods.
Ours 22.71 0.825 0.182 23.40 0.832 0.207 In the supplemental, we provide example images from our
Table 5: Image quality metrics for challenging ShapeNet tasks. dataset as well as additional quantitative results and quali-
(Left) Average metrics on 10 unseen categories for models trained tative comparisons with varying numbers of input views.
on only planes, cars, and chairs. See the supplemental for a break- Sim2Real on Cars. We also explore the performance
down by category. (Right) Average metrics for two-view recon- of pixelNeRF on real images from the Stanford cars
struction for scenes with multiple ShapeNet chairs.
dataset [18]. We directly apply car model from § 5.1.1 with-
Input Novel views out any fine-tuning. As seen in Fig. 8, the network trained
on synthetic data effectively infers shape and texture of the
real cars, suggesting our model can transfer beyond the syn-
thetic domain.
Synthesizing the 360◦ background from a single view
is nontrivial and out of the scope for this work. For this
demonstration, the off-the-shelf PointRend [17] segmenta-
Figure 8: Results on real car photos. We apply the car model tion model is used to remove the background.
from § 5.1.1 directly to images from the Stanford cars dataset [18].
The background has been masked out using PointRend [17]. The 5.3. Scene Prior on Real Images
views are rotations about the view-space vertical axis.
Finally, we demonstrate that our method is applicable for
few-shot wide baseline novel-view synthesis on real scenes
transfer on car images. In these settings, successful recon-
in the DTU MVS dataset [14]. Learning a prior for view
struction requires geometric priors; recognition or retrieval
synthesis on this dataset poses significant challenges: not
alone is not sufficient.
only does it consist of more complex scenes, without clear
Generalization to novel categories. We first aim to recon- semantic similarities across scenes, it also contains incon-
struct ShapeNet categories which were not seen in training. sistent backgrounds and lighting between scenes. More-
4584
Input: 3 views of held-out scene Novel views NeRF
Figure 9: Wide baseline novel-view synthesis on a real image dataset. We train our model to distinct scenes in the DTU MVS
dataset [14]. Perhaps surprisingly, even in this case, our model is able to infer novel views with reasonable quality for held-out scenes
without further test-time optimization, all from only three views. Note the train/test sets share no overlapping scenes.
6. Discussion
We have presented pixelNeRF, a framework to learn a
scene prior for reconstructing NeRFs from one or a few im-
ages. Through extensive experiments, we have established
that our approach can be successfully applied in a variety
of settings. We addressed some shortcomings of NeRF, but
there are challenges yet to be explored: 1) Like NeRF, our
rendering time is slow, and in fact, our runtime increases lin-
early when given more input views. Further, some methods
(e.g. [28, 21]) can recover a mesh from the image enabling
Figure 10: PSNR of few-shot feed-forward DTU reconstruc- fast rendering and manipulation afterwards, while NeRF-
tion. We show the quantiles of PSNR on DTU for our method and based representations cannot be converted to meshes very
NeRF, given 1, 3, 6, or 9 input views. Separate NeRFs are trained
reliably. Improving NeRF’s efficiency is an important re-
per scene and number of input views, while our method requires
only a single model trained with 3 encoded views.
search question that can enable real-time applications. 2) As
in the vanilla NeRF, we manually tune ray sampling bounds
over, under 100 scenes are available for training. We found tn , tf and a scale for the positional encoding. Making
that the standard data split introduced in MVSNet [51] con- NeRF-related methods scale-invariant is a crucial challenge.
tains overlap between scenes of the training and test sets. 3) While we have demonstrated our method on real data
Therefore, for our purposes, we use a different split of 88 from the DTU dataset, we acknowledge that this dataset was
training scenes and 15 test scenes, in which there are no captured under controlled settings and has matching camera
shared or highly similar scenes between the two sets. Im- poses across all scenes with limited viewpoints. Ultimately,
ages are down-sampled to a resolution of 400 × 300. our approach is bottlenecked by the availability of large-
scale wide baseline multi-view datasets, limiting the appli-
We train one model across all training scenes by en-
cability to datasets such as ShapeNet and DTU. Learning
coding 3 random views of a scene. During test time, we
a general prior for 360◦ scenes in-the-wild is an exciting
choose a set of fixed informative input views shared across
direction for future work.
all instances. We show in Fig. 9 that our method can per-
form view synthesis on the held-out test scenes. We further
quantitatively compare the performance of our feed-forward Acknowledgements
model with NeRF optimized to the same set of input views
in Fig. 10. Note that training each of 60 NeRFs took 14 We thank Shubham Goel and Hang Gao for comments
hours; in contrast, pixelNeRF is applied to new scenes im- on the text. We also thank Emilien Dupont and Vincent
mediately without any test-time optimization. Sitzmann for helpful discussions.
4585
References [17] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
shick. PointRend: Image segmentation as rendering. In
[1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. CVPR, 2020.
Szeliski. Building rome in a day. In ICCV, pages 72–79, [18] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
2009. 3d object representations for fine-grained categorization. In
[2] Miguel Angel Bautista, Walter Talbott, Shuangfei Zhai, Ni- 4th International IEEE Workshop on 3D Representation and
tish Srivastava, and Joshua M. Susskind. On the generaliza- Recognition (3dRR-13), Sydney, Australia, 2013.
tion of learning-based 3d reconstruction. In WACV, pages [19] Marc Levoy and Pat Hanrahan. Light field rendering. In
2180–2189, January 2021. SIGGRAPH, pages 31–42, 1996.
[3] Chris Buehler, Michael Bosse, Leonard McMillan, Steven [20] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
Gortler, and Michael Cohen. Unstructured lumigraph ren- Christian Theobalt. Neural Sparse Voxel Fields. In NeurIPS,
dering. In SIGGRAPH, pages 425–432, 2001. 2020.
[4] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat [21] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft ras-
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano- terizer: A differentiable renderer for image-based 3d reason-
lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, ing. In ICCV, 2019.
and Fisher Yu. ShapeNet: An Information-Rich 3D Model [22] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li.
Repository. Technical Report arXiv:1512.03012 [cs.GR], Learning to infer implicit surfaces without 3d supervision.
2015. In NeurIPS, 2019.
[5] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neu- [23] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
ral image based rendering with continuous view control. In Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol-
ICCV, pages 4090–4100, 2019. umes: Learning dynamic renderable volumes from images.
[6] Zhiqin Chen and Hao Zhang. Learning implicit fields for ACM Trans. Graph., 38(4):65:1–65:14, July 2019.
generative shape modeling. In CVPR, pages 5939–5948, [24] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi,
2019. Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck-
[7] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and worth. NeRF in the wild: Neural radiance fields for uncon-
Bing Zeng. Neural point cloud rendering via multi-plane strained photo collections. In CVPR, 2021.
projection. In CVPR, pages 7830–7839, 2020. [25] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
[8] Emilien Dupont, Miguel Angel Bautista, Alex Colburn, bastian Nowozin, and Andreas Geiger. Occupancy networks:
Aditya Sankar, Carlos Guestrin, Joshua Susskind, and Qi Learning 3d reconstruction in function space. In CVPR,
Shan. Equivariant neural rendering. In ICML, 2020. 2019.
[9] S. Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio [26] Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues
Viola, Ari Morcos, Marta Garnelo, Avraham Ruderman, An- Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-
drei Rusu, Ivo Danihelka, Karol Gregor, David Reichert, Brualla. Neural rerendering in the wild. In CVPR, 2019.
Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosen- [27] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
baum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Botvinick, and Demis Hassabis. Neural scene representation Representing scenes as neural radiance fields for view syn-
and rendering. Science, 360:1204–1210, 06 2018. thesis. In ECCV, 2020.
[10] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deep [28] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and
stereo: Learning to predict new views from the world’s im- Andreas Geiger. Differentiable volumetric rendering: Learn-
agery. In CVPR, pages 5515–5524, 2016. ing implicit 3d representations without 3d supervision. In
CVPR, 2020.
[11] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and
[29] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Michael F Cohen. The lumigraph. In SIGGRAPH, pages
Newcombe, and Steven Lovegrove. DeepSDF: Learning
43–54, 1996.
continuous signed distance functions for shape representa-
[12] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan tion. In CVPR, June 2019.
Russell, and Mathieu Aubry. AtlasNet: A Papier-Mâché Ap-
[30] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
proach to Learning 3D Surface Generation. In CVPR, 2018.
Zhu. Semantic image synthesis with spatially-adaptive nor-
[13] Xun Huang and Serge Belongie. Arbitrary style transfer in malization. In CVPR, 2019.
real-time with adaptive instance normalization. In ICCV, [31] Stefan Popov, Pablo Bauszat, and Vittorio Ferrari. CoReNet:
2017. Coherent 3d scene reconstructionfrom a single rgb image. In
[14] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, ECCV, 2020.
and Henrik Aanæs. Large scale multi-view stereopsis evalu- [32] Gernot Riegler and Vladlen Koltun. Free view synthesis. In
ation. In CVPR, pages 406–413, 2014. ECCV, pages 623–640, 2020.
[15] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning [33] S. Saito, Z. Huang, R. Natsume, S. Morishima, H. Li, and A.
a multi-view stereo machine. In NeurIPS, 2017. Kanazawa. PIFu: Pixel-aligned implicit function for high-
[16] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu- resolution clothed human digitization. In ICCV, pages 2304–
ral 3d mesh renderer. In CVPR, 2018. 2314, 2019.
4586
[34] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, [51] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long
and Jan-Michael Frahm. Pixelwise view selection for un- Quan. MVSNet: Depth inference for unstructured multi-
structured multi-view stereo. In ECCV, 2016. view stereo. In ECCV, 2018.
[35] Jonathan Shade, Steven Gortler, Li-wei He, and Richard [52] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Szeliski. Layered depth images. In SIGGRAPH, pages 231– and Oliver Wang. The unreasonable effectiveness of deep
242, 1998. features as a perceptual metric. In CVPR, 2018.
[36] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin [53] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang,
Huang. 3d photography using context-aware layered depth Joshua B Tenenbaum, William T Freeman, and Jiajun Wu.
inpainting. In CVPR, 2020. Learning to Reconstruct Shapes from Unseen Classes. In
[37] Daeyun Shin, Charless Fowlkes, and Derek Hoiem. Pixels, NeurIPS, 2018.
voxels, and views: A study of shape representations for sin- [54] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
gle view 3d object shape prediction. In CVPR, 2018. lik, and Alexei A Efros. View synthesis by appearance flow.
[38] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias In ECCV, pages 286–301, 2016.
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- [55] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Voxels: Learning persistent 3d feature embeddings. In Proc. Image quality assessment: from error visibility to structural
Computer Vision and Pattern Recognition (CVPR), IEEE, similarity. IEEE TIP, 13(4):600–612, 2004.
2019.
[39] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet-
zstein. Scene representation networks: Continuous 3d-
structure-aware neural scene representations. In NeurIPS,
2019.
[40] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.
Single-view to multi-view: Reconstructing unseen views
with a convolutional network. CoRR abs/1511.06702, 1(2):2,
2015.
[41] Maxim Tatarchenko, Stephan R Richter, René Ranftl,
Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
single-view 3d reconstruction networks learn? In CVPR,
pages 3405–3414, 2019.
[42] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
ferred neural rendering: Image synthesis using neural tex-
tures, 2019.
[43] Alex Trevithick and Bo Yang. GRF: Learning a general ra-
diance field for 3d scene representation and rendering. arXiv
preprint arXiv:2010.04595, 2020.
[44] Richard Tucker and Noah Snavely. Single-view view syn-
thesis with multiplane images. In CVPR, 2020.
[45] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Ji-
tendra Malik. Multi-view supervision for single-view recon-
struction via differentiable ray consistency. In CVPR, 2017.
[46] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh
models from single rgb images. In ECCV, 2018.
[47] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin
Johnson. SynSin: End-to-end view synthesis from a single
image. In CVPR, 2020.
[48] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham-
betov, and Gabriel J. Brostow. Interpretable transformations
with encoder-decoder networks. In ICCV, pages 5737–5746,
2017.
[49] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomı́r
Mech, and Ulrich Neumann. DISN: deep implicit surface
network for high-quality single-view 3d reconstruction. In
NeurIPS, pages 490–500, 2019.
[50] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and
Honglak Lee. Perspective transformer nets: Learning single-
view 3d object reconstruction without 3d supervision. In
NeurIPS. 2016.
4587