0% found this document useful (0 votes)

59 views

Pixel Nerf

This document proposes pixelNeRF, a learning framework that predicts a neural scene representation called a Neural Radiance Field (NeRF) conditioned on one or few input images. PixelNeRF can be trained on multiple scenes to learn priors, enabling novel view synthesis from sparse inputs like a single image. It incorporates image features to preserve local information and can synthesize views from variable numbers of test images without optimization.

Uploaded by

liudiyang1998.a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Pixel Nerf

Uploaded by

liudiyang1998.a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

pixelNeRF: Neural Radiance Fields from One or Few Images

Alex Yu Vickie Ye Matthew Tancik Angjoo Kanazawa

UC Berkeley

Input Novel views Input Novel views Input Novel views

pixelNeRF NeRF

Input: 3 views of held-out scene Output: Rendered new views

Figure 1: NeRF from one or few images. We present pixelNeRF, a learning framework that predicts a Neural Radiance Field (NeRF)
representation from a single (top) or few posed images (bottom). PixelNeRF can be trained on a set of multi-view images, allowing it to
generate plausible novel view synthesis from very few input images without test-time optimization (bottom left). In contrast, NeRF has no
generalization capabilities and performs poorly when only three input views are available (bottom right).

Abstract 1. Introduction

We propose pixelNeRF, a learning framework that pre- We study the problem of synthesizing novel views of a
dicts a continuous neural scene representation conditioned scene from a sparse set of input views. This long-standing
on one or few input images. The existing approach for problem has recently seen progress due to advances in dif-
constructing neural radiance fields [27] involves optimiz- ferentiable neural rendering [27, 20, 24, 39]. Across these
ing the representation to every scene independently, requir- approaches, a 3D scene is represented with a neural net-
ing many calibrated views and significant compute time. work, which can then be rendered into 2D views. Notably,
We take a step towards resolving these shortcomings by in- the recent method neural radiance fields (NeRF) [27] has
troducing an architecture that conditions a NeRF on im- shown impressive performance on novel view synthesis of
age inputs in a fully convolutional manner. This allows a specific scene by implicitly encoding volumetric density
the network to be trained across multiple scenes to learn and color through a neural network. While NeRF can ren-
a scene prior, enabling it to perform novel view synthesis in der photorealistic novel views, it is often impractical as it
a feed-forward manner from a sparse set of views (as few as requires a large number of posed images and a lengthy per-
one). Leveraging the volume rendering approach of NeRF, scene optimization.
our model can be trained directly from images with no ex- In this paper, we address these shortcomings by propos-
plicit 3D supervision. We conduct extensive experiments ing pixelNeRF, a learning framework that enables predict-
on ShapeNet benchmarks for single image novel view syn- ing NeRFs from one or several images in a feed-forward
thesis tasks with held-out objects as well as entire unseen manner. Unlike the original NeRF network, which does not
categories. We further demonstrate the flexibility of pixel- make use of any image features, pixelNeRF takes spatial
NeRF by demonstrating it on multi-object ShapeNet scenes image features aligned to each pixel as an input. This im-
and real scenes from the DTU dataset. In all cases, pix- age conditioning allows the framework to be trained on a
elNeRF outperforms current state-of-the-art baselines for set of multi-view images, where it can learn scene priors
novel view synthesis and single image 3D reconstruction. to perform view synthesis from one or few input views. In
For the video and code, please visit the project website: contrast, NeRF is unable to generalize and performs poorly
https://alexyu.net/pixelnerf. when few input images are available, as shown in Fig. 1.

4578
Specifically, we condition NeRF on input images by first NeRF DISN ONet DVR SRN Ours
computing a fully convolutional image feature grid from the
Learns scene prior? ✗ ✓ ✓ ✓ ✓ ✓
input image. Then for each query spatial point x and view-
Supervision 2D 3D 3D 2D 2D 2D
ing direction d of interest in the view coordinate frame, we
Image features ✗ Local Global Global ✗ Local
sample the corresponding image feature via projection and Allows multi-view? ✓ ✓ ✗ ✗ ✓ ✓
bilinear interpolation. The query specification is sent along View space? - ✗ ✗ ✗ ✗ ✓
with the image features to the NeRF network that outputs
density and color, where the spatial image features are fed Table 1: A comparison with prior works reconstructing neu-
to each layer as a residual. When more than one image is ral scene representations. The proposed approach learns a scene
available, the inputs are first encoded into a latent represen- prior for one or few-view reconstruction using only multi-view 2D
tation in each camera’s coordinate frame, which are then image supervision. Unlike previous methods in this regime, we
pooled in an intermediate layer prior to predicting the color do not require a consistent canonical space across the training cor-
and density. The model is supervised with a reconstruction pus. Moreover, we incorporate local image features to preserve
loss between a ground truth image and a view rendered us- local information which is in contrast to methods that compress
ing conventional volume rendering techniques. This frame- the structure and appearance into a single latent vector such as Oc-
work is illustrated in Fig. 2. cupancy Networks (ONet) [25] and DVR [28].
PixelNeRF has many desirable properties for few-view
novel-view synthesis. First, pixelNeRF can be trained on a ward photorealism for both wider ranges of novel views
dataset of multi-view images without additional supervision and sparser sets of input views, by using 3D representations
such as ground truth 3D shape or object masks. Second, based on neural networks [27, 23, 26, 38, 42, 7]. However,
pixelNeRF predicts a NeRF representation in the camera because these approaches fit a single model to each scene,
coordinate system of the input image instead of a canoni- they require many input views and substantial optimization
cal coordinate frame. This is not only integral for general- time per scene.
ization to unseen scenes and object categories [41, 37], but There are methods that can predict novel view from few
also for flexibility, since no clear canonical coordinate sys- input views or even single images by learning shared priors
tem exists on scenes with multiple objects or real scenes. across scenes. Methods in the tradition of [35, 3] use depth-
Third, it is fully convolutional, allowing it to preserve the guided image interpolation [54, 10, 32]. More recently, the
spatial alignment between the image and the output 3D rep- problem of predicting novel views from a single image has
resentation. Lastly, pixelNeRF can incorporate a variable been explored [44, 47, 36, 5]. However, these methods em-
number of posed input views at test time without requiring ploy 2.5D representations, and are therefore limited in the
any test-time optimization. range of camera motions they can synthesize. In this work
We conduct an extensive series of experiments on syn- we infer a 3D volumetric NeRF representation, which al-
thetic and real image datasets to evaluate the efficacy of our lows novel view synthesis from larger baselines.
framework, going beyond the usual set of ShapeNet experi- Sitzmann et al. [39] introduces a representation based on
ments to demonstrate its flexibility. Our experiments show a continuous 3D feature space to learn a prior across scene
that pixelNeRF can generate novel views from a single im- instances. However, using the learned prior at test time
age input for both category-specific and category-agnostic requires further optimization with known absolute camera
settings, even in the case of unseen object categories. Fur- poses. In contrast, our approach is completely feed-forward
ther, we test the flexibility of our framework, both with a and only requires relative camera poses. We offer exten-
new multi-object benchmark for ShapeNet, where pixel- sive comparisons with this approach to demonstrate the ad-
NeRF outperforms prior approaches, and with simulation- vantages our design affords. Lastly, note that concurrent
to-real transfer demonstration on real car images. Lastly, work [43] adds image features to NeRF. A key difference is
we test capabilities of pixelNeRF on real images using the that we operate in view rather than canonical space, which
DTU dataset [14], where despite being trained on under 100 makes our approach applicable in more general settings.
scenes, it can generate plausible novel views of a real scene Moreover, we extensively demonstrate our method’s perfor-
from three posed input views. mance in few-shot view synthesis, while GRF shows very
limited quantitative results for this task.
2. Related Work
Learning-based 3D reconstruction. Advances in deep
Novel View Synthesis. The long-standing problem of novel learning have led to rapid progress in single-view or multi-
view synthesis entails constructing new views of a scene view 3D reconstruction. Many approaches [15, 12, 46, 53,
from a set of input views. Early work achieved photore- 38, 33, 49, 25, 31] propose learning frameworks with vari-
alistic results but required densely captured views of the ous 3D representations that require ground-truth 3D models
scene [19, 11]. Recent work has made rapid progress to- for supervision. Multi-view supervision [50, 45, 21, 22, 39,

4579
28, 8, 2] is less restrictive and more ecologically plausible. ized as r(t) = o + td, with the ray origin (camera center)
However, many of these methods [50, 45, 21, 22, 28] re- o ∈ R3 and ray unit direction vector d ∈ R3 . The inte-
quire object masks; in contrast, pixelNeRF can be trained gral is computed along r between pre-defined depth bounds
from images alone, allowing it to be applied to scenes of [tn , tf ]. In practice, this integral is approximated with nu-
two objects without modification. merical quadrature by sampling points along each pixel ray.
Most single-view 3D reconstruction methods condition The rendered pixel value for camera ray r can then
neural 3D representations on input images. The majority be compared against the corresponding ground truth pixel
employs global image features [29, 6, 28, 25, 8], which, value, C(r), for all the camera rays of the target view with
while memory efficient, cannot preserve details that are pose P. The NeRF rendering loss is thus given by
present in the image and often lead to retrieval-like results.
2
Spatially-aligned local image features have been shown to
X
L= Ĉ(r) − C(r) (2)
achieve detailed reconstructions from a single view [49, 33]. 2
r∈R(P)
However, both of these methods require 3D supervision.
Our method is inspired by these approaches, but only re- where R(P) is the set of all camera rays of target pose P.
quires multi-view supervision.
Within existing methods, the types of scenes that can Limitations While NeRF achieves state of the art novel
be reconstructed are limited, particularly so for object- view synthesis results, it is an optimization-based approach
centric approaches (e.g. [46, 21, 12, 45, 38, 53, 25, 49, 28]). using geometric consistency as the sole signal, similar to
CoReNet [31] reconstructs scenes with multiple objects via classical multiview stereo methods [1, 34]. As such each
a voxel grid with offsets, but it requires 3D supervision in- scene must be optimized individually, with no knowledge
cluding the identity and placement of objects. In compari- shared between scenes. Not only is this time-consuming,
son, we formulate a scene-level learning framework that can but in the limit of single or extremely sparse views, it is un-
in principle be trained to scenes of arbitrary structure. able to make use of any prior knowledge of the world to
accelerate reconstruction or for shape completion.
Viewer-centric 3D reconstruction For the 3D learning
task, prediction can be done either in a viewer-centered co- 4. Image-conditioned NeRF
ordinate system, i.e. view space, or in an object-centered co-
ordinate system, i.e. canonical space. Most existing meth- To overcome the NeRF representation’s inability to share
ods [49, 25, 28, 39] predict in canonical space, where all ob- knowledge between scenes, we propose an architecture to
jects of a semantic category are aligned to a consistent ori- condition a NeRF on spatial image features. Our model
entation. While this makes learning spatial regularities eas- is comprised of two components: a fully-convolutional im-
ier, using a canonical space inhibits prediction performance age encoder E, which encodes the input image into a pixel-
on unseen object categories and scenes with more than one aligned feature grid, and a NeRF network f which outputs
object, where there is no pre-defined or well-defined canon- color and density, given a spatial location and its corre-
ical pose. PixelNeRF operates in view-space, which has sponding encoded feature. We choose to model the spa-
been shown to allow better reconstruction of unseen object tial query in the input view’s camera space, rather than a
categories in [37, 2], and discourages the memorization of canonical space, for the reasons discussed in § 2. We vali-
the training set [41]. We summarize key aspects of our ap- date this design choice in our experiments on unseen object
proach relative to prior work in Table 1. categories (§ 5.2) and complex unseen scenes (§ 5.3). The
model is trained with the volume rendering method and loss
3. Background: NeRF described in § 3.
In the following, we first present our model for the single
We first briefly review the NeRF representation [27]. A
view case. We then show how this formulation can be easily
NeRF encodes a scene as a continuous volumetric radiance
extended to incorporate multiple input images.
field f of color and density. Specifically, for a 3D point
x ∈ R3 and viewing direction unit vector d ∈ R3 , f returns 4.1. Single-Image pixelNeRF
a differential density σ and RGB color c: f (x, d) = (σ, c).
The volumetric radiance field can then be rendered into We now describe our approach to render novel views
a 2D image via from one input image. We fix our coordinate system as
the view space of the input image and specify positions and
Z tf
camera rays in this coordinate system.
Ĉ(r) = T (t)σ(t)c(t)dt (1)
tn
Given a input image I of a scene, we first extract a feature
Rt volume W = E(I). Then, for a point on a camera ray x,
where T (t) = exp − tn σ(s) ds handles occlusion. For we retrieve the corresponding image feature by projecting
a target view with pose P, a camera ray can be parameter- x onto the image plane to the image coordinates π(x) using

4580
f Volume Rendering
Input View W σ
(x,d) (RGBσ)
Ray Distance

W(πx) 2
y
d g.t.
z x
2
CNN Encoder Target View Rendering Loss
Figure 2: Proposed architecture in the single-view case. For a query point x along a target camera ray with view direction d, a
corresponding image feature is extracted from the feature volume W via projection and interpolation. This feature is then passed into the
NeRF network f along with the spatial coordinates. The output RGB and density value is volume-rendered and compared with the target
pixel value. The coordinates x and d are in the camera coordinate system of the input view.

P(i) = R(i) t(i) .

known intrinsics, then bilinearly interpolating between the
pixelwise features to extract the feature vector W(π(x)). For a new target camera ray, we transform a query point
The image features are then passed into the NeRF network, x, with view direction d, into the coordinate system of each
along with the position and view direction (both in the input input view i with the world to camera transform as
view coordinate system), as
x(i) = P(i) x, d(i) = R(i) d (4)
f (γ(x), d; W(π(x))) = (σ, c) (3)
To obtain the output density and color, we process the coor-
where γ(·) is a positional encoding on x with 6 expo- dinates and corresponding features in each view coordinate
nentially increasing frequencies introduced in the original frame independently and aggregate across the views within
NeRF [27]. The image feature is incorporated as a residual the NeRF network. For ease of explanation, we denote the
at each layer; see § 5 for more information. We show our initial layers of the NeRF network as f1 , which process in-
pipeline schematically in Fig. 2. puts in each input view space separately, and the final layers
In the few-shot view synthesis task, the query view direc- as f2 , which process the aggregated views.
tion is a useful signal for determining the importance of a We encode each input image into feature volume
particular image feature in the NeRF network. If the query W(i) = E(I(i) ). For the view-space point x(i) , we extract
view direction is similar to the input view orientation, the the corresponding image feature from the feature volume
model can rely more directly on the input; if it is dissimilar, W(i) at the projected image coordinate π(x(i) ). We then
the model must leverage the learned prior. Moreover, in the pass these inputs into f1 to obtain intermediate vectors:
multi-view case, view directions could serve as a signal for
the relevance and positioning of different views. For this V(i) = f1 γ(x(i) ), d(i) ; W(i) π(x(i) ) . (5)
reason, we input the view directions at the beginning of the
NeRF network. The intermediate V(i) are then aggregated with the aver-
4.2. Incorporating Multiple Views age pooling operator ψ and passed into a the final layers,
denoted as f2 , to obtain the predicted density and color:
Multiple views provide additional information about the
scene and resolve 3D geometric ambiguities inherent to the (σ, c) = f2 ψ V(1) , . . . , V(n) . (6)
single-view case. We extend our model to allow for an arbi-
trary number of views at test time, which distinguishes our In the single-view special case, this simplifies to Equation 3
method from existing approaches that are designed to only with f = f2 ◦f1 , by considering the view space as the world
use single input view at test time. [8, 53] Moreover, our for- space. An illustration is provided in the supplemental.
mulation is independent of the choice of world space and
the order of input views. 5. Experiments
In the case that we have multiple input views of the
scene, we assume only that the relative camera poses are We extensively demonstrate our approach in three exper-
known. For purposes of explanation, an arbitrary world imental categories: 1) existing ShapeNet [4] benchmarks
coordinate system can be fixed for the scene. We de- for category-specific and category-agnostic view synthesis,
note the ith input image as I(i) and its associated cam- 2) ShapeNet scenes with unseen categories and multiple
era transform from the world space to its view space as objects, both of which require geometric priors instead of

4581
recognition, as well as domain transfer to real car photos Input SRN Ours GT Input SRN Ours GT

and 3) real scenes from the DTU MVS dataset [14].

Baselines For ShapeNet benchmarks, we compare quan-
titatively and qualitatively to SRN [39] and DVR [28],
the current state-of-the-art in few-shot novel-view synthe-
sis and 2D-supervised single-view reconstruction respec-
tively. We use the 2D multiview-supervised variant of DVR. Figure 3: Category-specific single-view reconstruction bench-
mark. We train a separate model for cars and chairs and compare
In the category-agnostic setting (§ 5.1.2), we also include
to SRN. The corresponding numbers may be found in Table 2.
grayscale rendering of SoftRas [21] results. 1 In the exper-
iments with multiple ShapeNet objects, we compare with
2 Input Views SRN Ours GT
SRN, which can also model entire scenes.
For the experiment on the DTU dataset, we compare to
NeRF [27] trained on sparse views. Because NeRF is a
test-time optimization method, we train a separate model
for each scene in the test set.
Metrics We report the standard image quality metrics
PSNR and SSIM [55] for all evaluations. We also in-
clude LPIPS [52], which more accurately reflects human
perception, in all evaluations except in the category-specific Figure 4: Category-specific 2-view reconstruction benchmark.
setup (§ 5.1.1). In this setting, we exactly follow the pro- We provide two views (left) to each model, and show two novel
tocol of SRN [39] to remain comparable to prior works view renderings in each case (right). Please also refer to Table 2.
[40, 48, 9, 8, 43], for which source code is unavailable.
Implementation Details For the image encoder E, to cap- 1-view 2-view
ture both local and global information effectively, we ex- PSNR SSIM PSNR SSIM
tract a feature pyramid from the image. We use a ResNet34 GRF [43] 21.25 0.86 22.65 0.88
backbone pretrained on ImageNet for our experiments. Fea- TCO [40] * 21.27 0.88 21.33 0.88
tures are extracted prior to the first 4 pooling layers, upsam- Chairs dGQN [9] 21.59 0.87 22.36 0.89
pled using bilinear interpolation, and concatenated to form ENR [8] * 22.83 - - -
latent vectors of size 512 aligned to each pixel. SRN [39] 22.89 0.89 24.48 0.92
To incorporate a point’s corresponding image feature Ours * 23.72 0.91 26.20 0.94
into the NeRF network f , we choose a ResNet architec-
SRN [39] 22.25 0.89 24.84 0.92
ture with a residual modulation rather than simply concate-
Cars ENR [8] * 22.26 - - -
nating the feature vector with the point’s position and view
Ours * 23.17 0.90 25.66 0.94
direction. Specifically, we feed the encoded position and
view direction through the network and add the image fea-
Table 2: Category-specific 1- and 2-view reconstruction. Meth-
ture as a residual at the beginning of each ResNet block. We
ods marked * do not require canonical poses at test time. In all
train an independent linear layer for each block residual, in cases, a single model is trained for each category and used for
a similar manner as AdaIn and SPADE [13, 30], a method both 1- and 2-view evaluation. Note ENR is a 1-view only model.
previously used with success in [25, 28]. Please refer to the
supplemental for additional details.
1-view 2-view
5.1. ShapeNet Benchmarks ↑ PSNR ↑ SSIM ↓ LPIPS ↑ PSNR ↑ SSIM ↓ LPIPS
We first evaluate our approach on category-specific and − Local 20.39 0.848 0.196 21.17 0.865 0.175
category-agnostic view synthesis tasks on ShapeNet. − Dirs 21.93 0.885 0.139 23.50 0.909 0.121
Full 23.43 0.911 0.104 25.95 0.939 0.071
5.1.1 Category-specific View Synthesis Benchmark
Table 3: Ablation studies for ShapeNet chair reconstruction.
We perform one-shot and two-shot view synthesis on the We show the benefit of using local features over a global code to
“chair” and “car” classes of ShapeNet, using the protocol condition the NeRF network (−Local vs Full), and of providing
and dataset introduced in [39]. The dataset contains 6591 view directions to the network (−Dirs vs Full).

1 Color inference is not supported by the public SoftRas code.

4582
Input SoftRas DVR SRN Ours GT Input SoftRas DVR SRN Ours GT Input SoftRas DVR SRN Ours GT

Figure 5: Category-agnostic single-view reconstruction. Going beyond the SRN benchmark, we train a single model to the 13 largest
ShapeNet categories; we find that our approach produces superior visual results compared to a series of strong baselines. In particular,
the model recovers fine detail and thin structure more effectively, even for outlier shapes. Quite visibly, images on monitors and tabletop
textures are accurately reproduced; baselines representing the scene as a single latent vector cannot preserve such details of the input image.
SRN’s test-time latent inversion becomes less reliable as well in this setting. The corresponding quantitative evaluations are available in
Table 4. Due to space constraints, we show objects with interesting properties here. Please see the supplemental for sampled results.

plane bench cbnt. car chair disp. lamp spkr. rifle sofa table phone boat mean
DVR 25.29 22.64 24.47 23.95 19.91 20.86 23.27 20.78 23.44 23.35 21.53 24.18 25.09 22.70
↑ PSNR SRN 26.62 22.20 23.42 24.40 21.85 19.07 22.17 21.04 24.95 23.65 22.45 20.87 25.86 23.28
Ours 29.76 26.35 27.72 27.58 23.84 24.22 28.58 24.44 30.60 26.94 25.59 27.13 29.18 26.80
DVR 0.905 0.866 0.877 0.909 0.787 0.814 0.849 0.798 0.916 0.868 0.840 0.892 0.902 0.860
↑ SSIM SRN 0.901 0.837 0.831 0.897 0.814 0.744 0.801 0.779 0.913 0.851 0.828 0.811 0.898 0.849
Ours 0.947 0.911 0.910 0.942 0.858 0.867 0.913 0.855 0.968 0.908 0.898 0.922 0.939 0.910
DVR 0.095 0.129 0.125 0.098 0.173 0.150 0.172 0.170 0.094 0.119 0.139 0.110 0.116 0.130
↓ LPIPS SRN 0.111 0.150 0.147 0.115 0.152 0.197 0.210 0.178 0.111 0.129 0.135 0.165 0.134 0.139
Ours 0.084 0.116 0.105 0.095 0.146 0.129 0.114 0.141 0.066 0.116 0.098 0.097 0.111 0.108

Table 4: Category-agnostic single-view reconstruction. Quantitative results for category-agnostic view-synthesis are presented, with a
detailed breakdown by category. Our method outperforms the state-of-the-art by significant margins in all categories.

chairs and 3514 cars with a predefined split across object we follow an abbreviated evaluation protocol on ShapeNet
instances. All images have resolution 128 × 128. chairs, using 25 novel views on the Archimedean spiral.
A single model is trained for each object class with 50
random views per object instance, randomly sampling ei- 5.1.2 Category-agnostic Object Prior
ther one or two of the training views to encode. For testing, While we found appreciable improvements over baselines
We use 251 novel views on an Archimedean spiral for each in the simplest category-specific benchmark, our method is
object in the test set of object instances, fixing 1-2 infor- by no means constrained to it. We show in Table 4 and
mative views as input. We report our performance in com- Fig. 5 that our approach offers a much greater advantage in
parison with state-of-the-art baselines in Table 2, and show the category-agnostic setting of [21, 28], where we train
selected qualitative results in Fig. 4. We also include the a single model to the 13 largest categories of ShapeNet.
quantitative results of baselines TCO [40] and dGQN [9] Please see the supplemental for randomly sampled results.
reported in [39] where applicable, and the values available We follow community standards for 2D-supervised
in the recent works ENR [8] and GRF [43] in this setting. methods on multiple ShapeNet categories [28, 16, 21] and
PixelNeRF achieves noticeably superior results despite use the renderings and splits from Kato et al. [16], which
solving a problem significantly harder than SRN because provide 24 fixed elevation views of 64 × 64 resolution for
we: 1) use feed-forward prediction, without test-time opti- each object instance. During both training and evaluation,
mization, 2) do not use ground-truth absolute camera poses a random view is selected as the input view for each object
at test-time, 3) use view instead of canonical space. and shared across all baselines. The remaining 23 views are
Ablations. In Table 3, we show the benefit of using local used as target views for computing metrics (see § 5).
features and view directions in our model for this category-
5.2. Pushing the Boundaries of ShapeNet
specific setting. Conditioning the NeRF network on pixel-
aligned local features instead of a global code (−Local vs Taking a step towards reconstruction in less controlled
Full) improves performance significantly, for both single capture scenarios, we perform experiments on ShapeNet
and two-view settings. Providing view directions (−Dirs vs data in three more challenging setups: 1) unseen object cat-
Full) also provides a significant boost. For these ablations, egories, 2) multiple-object scenes, and 3) simulation-to-real

4583
Input DVR SRN Ours GT Input DVR SRN Ours GT Unlike the more standard category-agnostic task described
in the previous section, such generalization is impossible
with semantic information alone. The results in Table 5 and
Fig. 6 suggest our method learns intrinsic geometric and
appearance priors which are fairly effective even for objects
quite distinct from those seen during training.
We loosely follow the protocol used for zero-shot cross-
Figure 6: Generalization to unseen categories. We evaluate a category reconstruction from [53, ?]. Note that our base-
model trained on planes, cars, and chairs on 10 unseen ShapeNet lines [39, 28] do not evaluate in this setting, and we adapt
categories. We find that the model is able to synthesize reasonable them for the sake of comparison. We train on the airplane,
views even in this difficult case. car, and chair categories and test on 10 categories unseen
during training, continuing to use the Kato et al. renderings
Inputs SRN Ours GT
described in § 5.1.2.
Multiple-object scenes. We further perform few-shot 360◦
reconstruction for scenes with multiple randomly placed
and oriented ShapeNet chairs. In this setting, the network
cannot rely solely on semantic cues for correct object place-
Figure 7: 360◦ view prediction with multiple objects. We show ment and completion. The priors learned by the network
qualitative results of our method compared with SRN on scenes must be applicable in an arbitrary coordinate system. We
composed of multiple ShapeNet chairs. We are easily able to show in Fig. 7 and Table 5 that our formulation allows us
handle this setting, because our prediction is done in view space; to perform well on these simple scenes without additional
in contrast, SRN predicts in canonical space, and struggles with design modifications. In contrast, SRN models scenes in a
scenes that cannot be aligned in such a way. canonical space and struggles on held-out scenes.
We generate training images composed with 20 views
Unseen category Multiple chairs randomly sampled on the hemisphere and render test im-
↑ PSNR ↑ SSIM ↓ LPIPS ↑ PSNR ↑ SSIM ↓ LPIPS ages composed of a held out test set of chair instances, with
50 views sampled on an Archimedean spiral. During train-
DVR 17.72 0.716 0.240 - - - ing, we randomly encode two input views; at test-time, we
SRN 18.71 0.684 0.280 14.67 0.664 0.431 fix two informative views across the compared methods.
Ours 22.71 0.825 0.182 23.40 0.832 0.207 In the supplemental, we provide example images from our
Table 5: Image quality metrics for challenging ShapeNet tasks. dataset as well as additional quantitative results and quali-
(Left) Average metrics on 10 unseen categories for models trained tative comparisons with varying numbers of input views.
on only planes, cars, and chairs. See the supplemental for a break- Sim2Real on Cars. We also explore the performance
down by category. (Right) Average metrics for two-view recon- of pixelNeRF on real images from the Stanford cars
struction for scenes with multiple ShapeNet chairs.
dataset [18]. We directly apply car model from § 5.1.1 with-
Input Novel views out any fine-tuning. As seen in Fig. 8, the network trained
on synthetic data effectively infers shape and texture of the
real cars, suggesting our model can transfer beyond the syn-
thetic domain.
Synthesizing the 360◦ background from a single view
is nontrivial and out of the scope for this work. For this
demonstration, the off-the-shelf PointRend [17] segmenta-
Figure 8: Results on real car photos. We apply the car model tion model is used to remove the background.
from § 5.1.1 directly to images from the Stanford cars dataset [18].
The background has been masked out using PointRend [17]. The 5.3. Scene Prior on Real Images
views are rotations about the view-space vertical axis.
Finally, we demonstrate that our method is applicable for
few-shot wide baseline novel-view synthesis on real scenes
transfer on car images. In these settings, successful recon-
in the DTU MVS dataset [14]. Learning a prior for view
struction requires geometric priors; recognition or retrieval
synthesis on this dataset poses significant challenges: not
alone is not sufficient.
only does it consist of more complex scenes, without clear
Generalization to novel categories. We first aim to recon- semantic similarities across scenes, it also contains incon-
struct ShapeNet categories which were not seen in training. sistent backgrounds and lighting between scenes. More-

4584
Input: 3 views of held-out scene Novel views NeRF

Figure 9: Wide baseline novel-view synthesis on a real image dataset. We train our model to distinct scenes in the DTU MVS
dataset [14]. Perhaps surprisingly, even in this case, our model is able to infer novel views with reasonable quality for held-out scenes
without further test-time optimization, all from only three views. Note the train/test sets share no overlapping scenes.

6. Discussion
We have presented pixelNeRF, a framework to learn a
scene prior for reconstructing NeRFs from one or a few im-
ages. Through extensive experiments, we have established
that our approach can be successfully applied in a variety
of settings. We addressed some shortcomings of NeRF, but
there are challenges yet to be explored: 1) Like NeRF, our
rendering time is slow, and in fact, our runtime increases lin-
early when given more input views. Further, some methods
(e.g. [28, 21]) can recover a mesh from the image enabling
Figure 10: PSNR of few-shot feed-forward DTU reconstruc- fast rendering and manipulation afterwards, while NeRF-
tion. We show the quantiles of PSNR on DTU for our method and based representations cannot be converted to meshes very
NeRF, given 1, 3, 6, or 9 input views. Separate NeRFs are trained
reliably. Improving NeRF’s efficiency is an important re-
per scene and number of input views, while our method requires
only a single model trained with 3 encoded views.
search question that can enable real-time applications. 2) As
in the vanilla NeRF, we manually tune ray sampling bounds
over, under 100 scenes are available for training. We found tn , tf and a scale for the positional encoding. Making
that the standard data split introduced in MVSNet [51] con- NeRF-related methods scale-invariant is a crucial challenge.
tains overlap between scenes of the training and test sets. 3) While we have demonstrated our method on real data
Therefore, for our purposes, we use a different split of 88 from the DTU dataset, we acknowledge that this dataset was
training scenes and 15 test scenes, in which there are no captured under controlled settings and has matching camera
shared or highly similar scenes between the two sets. Im- poses across all scenes with limited viewpoints. Ultimately,
ages are down-sampled to a resolution of 400 × 300. our approach is bottlenecked by the availability of large-
scale wide baseline multi-view datasets, limiting the appli-
We train one model across all training scenes by en-
cability to datasets such as ShapeNet and DTU. Learning
coding 3 random views of a scene. During test time, we
a general prior for 360◦ scenes in-the-wild is an exciting
choose a set of fixed informative input views shared across
direction for future work.
all instances. We show in Fig. 9 that our method can per-
form view synthesis on the held-out test scenes. We further
quantitatively compare the performance of our feed-forward Acknowledgements
model with NeRF optimized to the same set of input views
in Fig. 10. Note that training each of 60 NeRFs took 14 We thank Shubham Goel and Hang Gao for comments
hours; in contrast, pixelNeRF is applied to new scenes im- on the text. We also thank Emilien Dupont and Vincent
mediately without any test-time optimization. Sitzmann for helpful discussions.

4585
References [17] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
shick. PointRend: Image segmentation as rendering. In
[1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. CVPR, 2020.
Szeliski. Building rome in a day. In ICCV, pages 72–79, [18] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
2009. 3d object representations for fine-grained categorization. In
[2] Miguel Angel Bautista, Walter Talbott, Shuangfei Zhai, Ni- 4th International IEEE Workshop on 3D Representation and
tish Srivastava, and Joshua M. Susskind. On the generaliza- Recognition (3dRR-13), Sydney, Australia, 2013.
tion of learning-based 3d reconstruction. In WACV, pages [19] Marc Levoy and Pat Hanrahan. Light field rendering. In
2180–2189, January 2021. SIGGRAPH, pages 31–42, 1996.
[3] Chris Buehler, Michael Bosse, Leonard McMillan, Steven [20] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
Gortler, and Michael Cohen. Unstructured lumigraph ren- Christian Theobalt. Neural Sparse Voxel Fields. In NeurIPS,
dering. In SIGGRAPH, pages 425–432, 2001. 2020.
[4] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat [21] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft ras-
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano- terizer: A differentiable renderer for image-based 3d reason-
lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, ing. In ICCV, 2019.
and Fisher Yu. ShapeNet: An Information-Rich 3D Model [22] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li.
Repository. Technical Report arXiv:1512.03012 [cs.GR], Learning to infer implicit surfaces without 3d supervision.
2015. In NeurIPS, 2019.
[5] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neu- [23] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
ral image based rendering with continuous view control. In Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol-
ICCV, pages 4090–4100, 2019. umes: Learning dynamic renderable volumes from images.
[6] Zhiqin Chen and Hao Zhang. Learning implicit fields for ACM Trans. Graph., 38(4):65:1–65:14, July 2019.
generative shape modeling. In CVPR, pages 5939–5948, [24] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi,
2019. Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck-
[7] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and worth. NeRF in the wild: Neural radiance fields for uncon-
Bing Zeng. Neural point cloud rendering via multi-plane strained photo collections. In CVPR, 2021.
projection. In CVPR, pages 7830–7839, 2020. [25] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
[8] Emilien Dupont, Miguel Angel Bautista, Alex Colburn, bastian Nowozin, and Andreas Geiger. Occupancy networks:
Aditya Sankar, Carlos Guestrin, Joshua Susskind, and Qi Learning 3d reconstruction in function space. In CVPR,
Shan. Equivariant neural rendering. In ICML, 2020. 2019.
[9] S. Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio [26] Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues
Viola, Ari Morcos, Marta Garnelo, Avraham Ruderman, An- Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-
drei Rusu, Ivo Danihelka, Karol Gregor, David Reichert, Brualla. Neural rerendering in the wild. In CVPR, 2019.
Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosen- [27] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
baum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Botvinick, and Demis Hassabis. Neural scene representation Representing scenes as neural radiance fields for view syn-
and rendering. Science, 360:1204–1210, 06 2018. thesis. In ECCV, 2020.
[10] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deep [28] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and
stereo: Learning to predict new views from the world’s im- Andreas Geiger. Differentiable volumetric rendering: Learn-
agery. In CVPR, pages 5515–5524, 2016. ing implicit 3d representations without 3d supervision. In
CVPR, 2020.
[11] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and
[29] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Michael F Cohen. The lumigraph. In SIGGRAPH, pages
Newcombe, and Steven Lovegrove. DeepSDF: Learning
43–54, 1996.
continuous signed distance functions for shape representa-
[12] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan tion. In CVPR, June 2019.
Russell, and Mathieu Aubry. AtlasNet: A Papier-Mâché Ap-
[30] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
proach to Learning 3D Surface Generation. In CVPR, 2018.
Zhu. Semantic image synthesis with spatially-adaptive nor-
[13] Xun Huang and Serge Belongie. Arbitrary style transfer in malization. In CVPR, 2019.
real-time with adaptive instance normalization. In ICCV, [31] Stefan Popov, Pablo Bauszat, and Vittorio Ferrari. CoReNet:
2017. Coherent 3d scene reconstructionfrom a single rgb image. In
[14] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, ECCV, 2020.
and Henrik Aanæs. Large scale multi-view stereopsis evalu- [32] Gernot Riegler and Vladlen Koltun. Free view synthesis. In
ation. In CVPR, pages 406–413, 2014. ECCV, pages 623–640, 2020.
[15] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning [33] S. Saito, Z. Huang, R. Natsume, S. Morishima, H. Li, and A.
a multi-view stereo machine. In NeurIPS, 2017. Kanazawa. PIFu: Pixel-aligned implicit function for high-
[16] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu- resolution clothed human digitization. In ICCV, pages 2304–
ral 3d mesh renderer. In CVPR, 2018. 2314, 2019.

4586
[34] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, [51] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long
and Jan-Michael Frahm. Pixelwise view selection for un- Quan. MVSNet: Depth inference for unstructured multi-
structured multi-view stereo. In ECCV, 2016. view stereo. In ECCV, 2018.
[35] Jonathan Shade, Steven Gortler, Li-wei He, and Richard [52] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Szeliski. Layered depth images. In SIGGRAPH, pages 231– and Oliver Wang. The unreasonable effectiveness of deep
242, 1998. features as a perceptual metric. In CVPR, 2018.
[36] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin [53] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang,
Huang. 3d photography using context-aware layered depth Joshua B Tenenbaum, William T Freeman, and Jiajun Wu.
inpainting. In CVPR, 2020. Learning to Reconstruct Shapes from Unseen Classes. In
[37] Daeyun Shin, Charless Fowlkes, and Derek Hoiem. Pixels, NeurIPS, 2018.
voxels, and views: A study of shape representations for sin- [54] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
gle view 3d object shape prediction. In CVPR, 2018. lik, and Alexei A Efros. View synthesis by appearance flow.
[38] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias In ECCV, pages 286–301, 2016.
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- [55] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Voxels: Learning persistent 3d feature embeddings. In Proc. Image quality assessment: from error visibility to structural
Computer Vision and Pattern Recognition (CVPR), IEEE, similarity. IEEE TIP, 13(4):600–612, 2004.
2019.
[39] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet-
zstein. Scene representation networks: Continuous 3d-
structure-aware neural scene representations. In NeurIPS,
2019.
[40] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.
Single-view to multi-view: Reconstructing unseen views
with a convolutional network. CoRR abs/1511.06702, 1(2):2,
2015.
[41] Maxim Tatarchenko, Stephan R Richter, René Ranftl,
Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
single-view 3d reconstruction networks learn? In CVPR,
pages 3405–3414, 2019.
[42] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
ferred neural rendering: Image synthesis using neural tex-
tures, 2019.
[43] Alex Trevithick and Bo Yang. GRF: Learning a general ra-
diance field for 3d scene representation and rendering. arXiv
preprint arXiv:2010.04595, 2020.
[44] Richard Tucker and Noah Snavely. Single-view view syn-
thesis with multiplane images. In CVPR, 2020.
[45] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Ji-
tendra Malik. Multi-view supervision for single-view recon-
struction via differentiable ray consistency. In CVPR, 2017.
[46] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh
models from single rgb images. In ECCV, 2018.
[47] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin
Johnson. SynSin: End-to-end view synthesis from a single
image. In CVPR, 2020.
[48] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham-
betov, and Gabriel J. Brostow. Interpretable transformations
with encoder-decoder networks. In ICCV, pages 5737–5746,
2017.
[49] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomı́r
Mech, and Ulrich Neumann. DISN: deep implicit surface
network for high-quality single-view 3d reconstruction. In
NeurIPS, pages 490–500, 2019.
[50] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and
Honglak Lee. Perspective transformer nets: Learning single-
view 3d object reconstruction without 3d supervision. In
NeurIPS. 2016.

4587

Mothership A Pound of Flesh v1.2
82% (11)
Mothership A Pound of Flesh v1.2
52 pages
Pdftax Invoice.665985633 PDF
No ratings yet
Pdftax Invoice.665985633 PDF
14 pages
Wingspan Bird Card List 20190309 Print
No ratings yet
Wingspan Bird Card List 20190309 Print
6 pages
CIMA E1 Enterprise Operations Study Text
No ratings yet
CIMA E1 Enterprise Operations Study Text
457 pages
Regnerf: Regularizing Neural Radiance Fields For View Synthesis From Sparse Inputs
No ratings yet
Regnerf: Regularizing Neural Radiance Fields For View Synthesis From Sparse Inputs
11 pages
2312.02981v1
No ratings yet
2312.02981v1
13 pages
Ne RF
No ratings yet
Ne RF
20 pages
116.00000162
No ratings yet
116.00000162
32 pages
Beyondpixels: A Comprehensive Review of The Evolution of Neural Radiance Fields
No ratings yet
Beyondpixels: A Comprehensive Review of The Evolution of Neural Radiance Fields
33 pages
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
No ratings yet
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
14 pages
Nerf: Neural Radiance Field in 3D Vision, Introduction and Review
No ratings yet
Nerf: Neural Radiance Field in 3D Vision, Introduction and Review
26 pages
NeRF - Neural Radiance Field in 3D Vision, A Comprehensive Review
No ratings yet
NeRF - Neural Radiance Field in 3D Vision, A Comprehensive Review
28 pages
A Generic and Flexible Regularization Framework For NeRFs
No ratings yet
A Generic and Flexible Regularization Framework For NeRFs
10 pages
From 2D to 3D: Leveraging Sparse Inputs for High-Fidelity Model Generation with Neural Radiance Fields
No ratings yet
From 2D to 3D: Leveraging Sparse Inputs for High-Fidelity Model Generation with Neural Radiance Fields
5 pages
Plenoxel
No ratings yet
Plenoxel
21 pages
Nerf Lets
No ratings yet
Nerf Lets
13 pages
2303.09431v1
No ratings yet
2303.09431v1
11 pages
Accelerating NeRF With The Visual Hull
No ratings yet
Accelerating NeRF With The Visual Hull
15 pages
Block Nerf
No ratings yet
Block Nerf
15 pages
Nerf RPN
No ratings yet
Nerf RPN
13 pages
NeRF-DA_Neural_Radiance_Fields_Deblurring_With_Active_Learning
No ratings yet
NeRF-DA_Neural_Radiance_Fields_Deblurring_With_Active_Learning
5 pages
DiffusioNeRF_Regularizing_Neural_Radiance_Fields_With_Denoising_Diffusion_Models_CVPR_2023_paper
No ratings yet
DiffusioNeRF_Regularizing_Neural_Radiance_Fields_With_Denoising_Diffusion_Models_CVPR_2023_paper
10 pages
Nerf Studio
No ratings yet
Nerf Studio
13 pages
Mip-Nerf 360: Unbounded Anti-Aliased Neural Radiance Fields
No ratings yet
Mip-Nerf 360: Unbounded Anti-Aliased Neural Radiance Fields
18 pages
Jihoon Reading Group
No ratings yet
Jihoon Reading Group
40 pages
Ambient-NeRF Light Train Enhancing Neural Radiance Fields in Low-Light Conditions With Ambient-Illumination
No ratings yet
Ambient-NeRF Light Train Enhancing Neural Radiance Fields in Low-Light Conditions With Ambient-Illumination
17 pages
GIRAFFE; Representing Scenes as Compositional Generative Neural Feature Fields _2011.12100v2
No ratings yet
GIRAFFE; Representing Scenes as Compositional Generative Neural Feature Fields _2011.12100v2
12 pages
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
No ratings yet
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
17 pages
3DSAM Segment Anything in NeRF
No ratings yet
3DSAM Segment Anything in NeRF
5 pages
Plenoctree Met SH Uitleg in
No ratings yet
Plenoctree Met SH Uitleg in
18 pages
Embed Any NERF
No ratings yet
Embed Any NERF
13 pages
ACM: NeRF: Representing Scenes As Neural Radiance Fields For View Synthesis
No ratings yet
ACM: NeRF: Representing Scenes As Neural Radiance Fields For View Synthesis
8 pages
Zero-1-To-3: Zero-Shot One Image To 3D Object
No ratings yet
Zero-1-To-3: Zero-Shot One Image To 3D Object
13 pages
Depth-Supervised Nerf: Fewer Views and Faster Training For Free
No ratings yet
Depth-Supervised Nerf: Fewer Views and Faster Training For Free
13 pages
Singh 2020
No ratings yet
Singh 2020
5 pages
LRM: L R M S I 3D: Arge Econstruction Odel For Ingle Mage To
No ratings yet
LRM: L R M S I 3D: Arge Econstruction Odel For Ingle Mage To
23 pages
Plen Octree
No ratings yet
Plen Octree
10 pages
RobustNeRF - Ignoring Distractors With Robust Losses
No ratings yet
RobustNeRF - Ignoring Distractors With Robust Losses
20 pages
ICCV2021 - In-Place Scene Labelling and Understanding With Implicit Scene Representation
No ratings yet
ICCV2021 - In-Place Scene Labelling and Understanding With Implicit Scene Representation
10 pages
Stylesdf: High-Resolution 3D-Consistent Image and Geometry Generation
No ratings yet
Stylesdf: High-Resolution 3D-Consistent Image and Geometry Generation
17 pages
Volume GAN
No ratings yet
Volume GAN
12 pages
Turki Mega-NERF Scalable Construction of Large-Scale NeRFs For Virtual Fly-Throughs CVPR 2022 Paper
No ratings yet
Turki Mega-NERF Scalable Construction of Large-Scale NeRFs For Virtual Fly-Throughs CVPR 2022 Paper
10 pages
NeRF in The Dark High Dynamic Range View Synthesis From Noisy Raw Images
No ratings yet
NeRF in The Dark High Dynamic Range View Synthesis From Noisy Raw Images
18 pages
Point-NeRF Point-Based Neural Radiance Fields
No ratings yet
Point-NeRF Point-Based Neural Radiance Fields
16 pages
The_More_You_See_in_2D_the_More_You_Perceive_in_3D
No ratings yet
The_More_You_See_in_2D_the_More_You_Perceive_in_3D
11 pages
Rendernet: A Deep Convolutional Network For Differentiable Rendering From 3D Shapes
No ratings yet
Rendernet: A Deep Convolutional Network For Differentiable Rendering From 3D Shapes
17 pages
Nerf Paper IA 3D
No ratings yet
Nerf Paper IA 3D
8 pages
Tang NeRFDeformer NeRF Transformation From a Single View via 3D Scene CVPR 2024 Paper
No ratings yet
Tang NeRFDeformer NeRF Transformation From a Single View via 3D Scene CVPR 2024 Paper
11 pages
Xie 等 - 2023 - HollowNeRF Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
No ratings yet
Xie 等 - 2023 - HollowNeRF Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
11 pages
3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University
No ratings yet
3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University
15 pages
3D Aware Synthesis Via Learning Textural and Structural Representations
No ratings yet
3D Aware Synthesis Via Learning Textural and Structural Representations
13 pages
Mirzaei_SPIn-NeRF_Multiview_Segmentation_and_Perceptual_Inpainting_With_Neural_Radiance_Fields_CVPR_2023_paper
No ratings yet
Mirzaei_SPIn-NeRF_Multiview_Segmentation_and_Perceptual_Inpainting_With_Neural_Radiance_Fields_CVPR_2023_paper
11 pages
Rafe: Generative Radiance Fields Restoration
No ratings yet
Rafe: Generative Radiance Fields Restoration
23 pages
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
No ratings yet
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
11 pages
Sync Dreamer
No ratings yet
Sync Dreamer
13 pages
Nerf Weekly Report March
No ratings yet
Nerf Weekly Report March
13 pages
Xu DisCoScene Spatially Disentangled Generative Radiance Fields For Controllable 3D-Aware Scene CVPR 2023 Paper
No ratings yet
Xu DisCoScene Spatially Disentangled Generative Radiance Fields For Controllable 3D-Aware Scene CVPR 2023 Paper
11 pages
柱
No ratings yet
柱
2 pages
3D Nvidia
No ratings yet
3D Nvidia
20 pages
Nerfstudio
No ratings yet
Nerfstudio
12 pages
Sitzmann 2020 CVPR
No ratings yet
Sitzmann 2020 CVPR
23 pages
2211.09869v4
No ratings yet
2211.09869v4
15 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
From Everand
Multi View Three Dimensional Reconstruction: Advanced Techniques for Spatial Perception in Computer Vision
Fouad Sabry
No ratings yet
Chapter 01 (Student Copy With Answer)
No ratings yet
Chapter 01 (Student Copy With Answer)
8 pages
Plans Now - Bed Loft Plans
100% (1)
Plans Now - Bed Loft Plans
10 pages
Product Loading - Marine Loading Arm
No ratings yet
Product Loading - Marine Loading Arm
9 pages
Ramadan Mubarak and Happy Eid
No ratings yet
Ramadan Mubarak and Happy Eid
5 pages
Narrative Writing - IGCSE
100% (1)
Narrative Writing - IGCSE
17 pages
Sancryl Brochure - Paint Coating
No ratings yet
Sancryl Brochure - Paint Coating
3 pages
Fundamentals of Computer Programming With CSharp Nakov Ebook v2013
No ratings yet
Fundamentals of Computer Programming With CSharp Nakov Ebook v2013
12 pages
Comperative Anatomy
No ratings yet
Comperative Anatomy
9 pages
Portfolio
No ratings yet
Portfolio
1 page
Exit Interview Form
No ratings yet
Exit Interview Form
2 pages
Inital Test 7th Grade
No ratings yet
Inital Test 7th Grade
3 pages
CH11 General VLSI System Components
No ratings yet
CH11 General VLSI System Components
30 pages
CV ENVIRONMENT EXPERT Dr. D.Banerjee
No ratings yet
CV ENVIRONMENT EXPERT Dr. D.Banerjee
10 pages
Cradles of Early Science
No ratings yet
Cradles of Early Science
10 pages
Reading 3
No ratings yet
Reading 3
286 pages
300_Kettlebell_Swings_a_Day_Challenge_full
No ratings yet
300_Kettlebell_Swings_a_Day_Challenge_full
5 pages
Metal Casting & Robots PDF
No ratings yet
Metal Casting & Robots PDF
3 pages
AAPC_CPC_Coding_Exam_2011_The_Certification_Step_Final_Examination
No ratings yet
AAPC_CPC_Coding_Exam_2011_The_Certification_Step_Final_Examination
35 pages
Me 2305 - Applied Hydraulics and Pneumatics Two Mark Question and Answer Unit - Ii
No ratings yet
Me 2305 - Applied Hydraulics and Pneumatics Two Mark Question and Answer Unit - Ii
4 pages
Ray Script
No ratings yet
Ray Script
86 pages
Commercial 2
No ratings yet
Commercial 2
31 pages
Malunggay Leaf Research
No ratings yet
Malunggay Leaf Research
3 pages
Rev - Cover Lta
No ratings yet
Rev - Cover Lta
12 pages
Pharmaceutical Data Mining
No ratings yet
Pharmaceutical Data Mining
584 pages
2012 EPA Guidelines For Water Reuse
100% (1)
2012 EPA Guidelines For Water Reuse
643 pages
Specification For Flue Gas Desulphurisation Plant by CEA
No ratings yet
Specification For Flue Gas Desulphurisation Plant by CEA
448 pages

Pixel Nerf

Uploaded by

Pixel Nerf

Uploaded by

pixelNeRF: Neural Radiance Fields from One or Few Images

Alex Yu Vickie Ye Matthew Tancik Angjoo Kanazawa

Input Novel views Input Novel views Input Novel views

Input: 3 views of held-out scene Output: Rendered new views

P(i) = R(i) t(i) .

and 3) real scenes from the DTU MVS dataset [14].

1 Color inference is not supported by the public SoftRas code.

You might also like