3721 InNeRF Learning Interpret
3721 InNeRF Learning Interpret
2
InNeRF: Learning Interpretable Radiance Fields for Generalizable 59
60
3 3D Scene Representation and Rendering 61
4 62
5
Anonymous Authors 63
6 64
7 ABSTRACT pooling-based multi-view feature as the conditional input. These 65
8
We propose Interpretable Neural Radiance Fields (InNeRF) for gen- prior NeRFs generally contain three basic components: a CNN- 66
9
eralizable 3D scene representation and rendering. In contrast to pre- based single-view feature extraction module, a pooling-based multi- 67
10
vious image-based rendering, which used two independent working view fusion module, and an MLP-based NeRF module. 68
11
processes of pooling-based fusion and MLP-based rendering, our Despite the intrinsic connection between these modules, each 69
12
framework unifies source-view fusion and target-view rendering module is designed and studied independently, making the overall 70
13
processes via an end-to-end interpretable Transformer-based net- framework disjointed. This incoherent framework design damages 71
14
work. InNeRF enables the investigation of deep relationships be- the model interpretability from three aspects: 1) Separating feature 72
15
tween the target-rendering view and source views that were previ- extraction of each source view overlooks their relevancy in repre- 73
16
ously neglected by pooling-based fusion and fragmented rendering senting 3D scenes. 2) Pooling-based fusion cannot fully explore the 74
17
procedures. As a result, InNeRF improves model interpretability complicated relationship among source views. 3) The MLP network 75
18
by enhancing the shape and appearance consistency of a 3D scene rendering the color and density from a single aggregated feature 76
19
in both the surrounding view space and the ray-cast space. For a struggles to decode intricate relationships between observed views 77
20
query rendering 3D point, InNeRF integrates both its projected 2D and the rendering view. The reason for this framework design is that 78
21
pixels from the surrounding source views and its adjacent 3D points previous NeRFs are built on MLPs that are incapable of processing 79
22
along the query ray and simultaneously decodes this information an arbitrary number of observed views. Consequently, they need 80
23
into the query 3D point representation. Experiments show that an auxiliary fusion model to aggregate multi-view information, and 81
24
InNeRF outperforms state-of-the-art image-based neural rendering pooling-based fusion provides such a straightforward technique. 82
25
methods in both scene-agnostic and per-scene finetuning scenarios, This limitation also impairs the capability of NeRFs to learn a 83
26
especially when there is a considerable disparity between source view-consistent 3D scene representation from observed views, espe- 84
27
views and rendering views. The interpretation experiment shows cially for the scenario where source views have a more complicated 85
28
that InNeRF can explain a query rendering process. relationship with the target view, e.g. the observed source views are 86
29 captured at camera poses that are very different from the camera 87
30
CCS CONCEPTS pose of the target view. When camera poses of source views are 88
31 similar to the rendering view, source views and the target view 89
• Computing methodologies → Computer vision; Rendering.
32 are distributed in a local region in 3D scene representation space, 90
33 making it possible to approximate their relationship by a linear 91
KEYWORDS
34 function as in previous work [4, 22, 25, 29]. However, as the differ- 92
35 Neural Rendering, Network Interpretability ence between observed views and the rendering view increases, the 93
36 correlation becomes more complicated, making it challenging for 94
37 1 INTRODUCTION these approaches to synthesize a realistic novel view. In this sce- 95
38 Novel view synthesis is a long-standing open problem concerned nario, existing MLP-based NeRFs, using a pooling-based function 96
39 with the rendering of unseen views of a 3D scene given a set of to fuse the multi-view, are insufficient to tackle this challenge. 97
40 observed views [16, 21]. Recent remarkable NeRF research [11, 12, Therefore, the fundamental issue is how to free the intrinsic in- 98
41 14, 18, 30] introduces neural radiance field scene representations, terpretability of NeRFs from the previously fragmented frameworks 99
42 which use multi-layer perceptrons (MLPs) to map a continuous 3D for learning generalizable radiance fields. To tackle this unmet need, 100
43 location and view direction to its density and color. we present Interpretable Neural Radiance Fields (InNeRF), an end- 101
44 However, these models need to optimize a specific 3D repre- to-end Transformer-based architecture that unifies source-view 102
45 sentation for each scene, which is time-consuming and does not fusion and target-view rendering processes for generalizable 3D 103
46 learn the shared information among scenes. Subsequently, to learn scene representation and rendering. In the rendering process of a 104
47 prior knowledge in diverse scenes, researchers [4, 22, 25, 29] gen- query 3D point, InNeRF is divided into two stages: the first works 105
48 eralize the radiance field scene representation by incorporating a in the surrounding-view space, integrating information of the pro- 106
49 jected 2D pixels at the surrounding source views for the query 3D 107
Permission to make digital or hard copies of all or part of this work for personal or
50
Unpublished working
classroom use is granted without feedraft.
providedNot for distribution.
that copies are not made or distributed point; and the second works in the ray-cast space, fusing the neigh- 108
51 for profit or commercial advantage and that copies bear this notice and the full citation boring 3D points along the query ray into the representation of the 109
52 on the first page. Copyrights for components of this work owned by others than the query 3D point, as shown in Fig 1. This design provides our model 110
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
53
republish, to post on servers or to redistribute to lists, requires prior specific permission with a comprehensive understanding of the shape and appearance 111
54 and/or a fee. Request permissions from [email protected]. consistency of a 3D scene in both the surrounding-view space and 112
55 ACM MM, 2024, Melbourne, Australia ray-cast spaces. Furthermore, the Transformer-based framework 113
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
56
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM taking advantage of the attention mechanism enables our rendering 114
57 https://doi.org/10.1145/nnnnnnn.nnnnnnn 115
58 116
ACM MM, 2024, Melbourne, Australia Anonymous Authors
117 process to learn in-depth and complicated relationships between appearance in a scene by considering the corresponding informa- 175
118 source views and the rendering view, which is essential for novel tion in the surrounding view space and the ray-cast space. 176
119 view synthesis. Therefore, InNeRF has improved interpretability Transformer. Transformer recently emerged as a promising 177
120 and learns a more comprehensive general neural radiance field. network framework and has achieved impressive performance in 178
121 Our contributions can be summarized as follows: natural language processing [2, 20, 27] and computer vision [3, 5, 6, 179
122 9, 13, 31]. The main idea behind this approach is to utilize the multi- 180
• We propose Interpretable Neural Radiance Fields (InNeRF), a
123 head self-attention operation to explore the dependence within 181
unified Transformer-based framework, to study deep correla-
124 input tokens and learn a global feature representation. In the object 182
tions between observed and rendering views and simultane-
125 detection task, DETR [3] presents a new framework combining 183
ously integrate this intricate information into a generalizable
126 a 2D CNN with a Transformer and predicts object detection in 184
neural radiance field.
127 parallel as a sequence of output tokens. In image classification, ViT 185
• InNeRF exploits geometry and appearance consistency of
128 [6] demonstrates the impressive ability to learn global contexts in 186
a neural radiance field in both the surrounding view space
129 Transformer even without using CNN features [23]. In 3D scene 187
and the ray-cast space, strengthening its interpretability.
130 understanding, FlatFormer [13] introduces a new window attention 188
• Experiments show that InNeRF achieves more realistic ren-
131 mechanism to optimize the computational efficiency and achieve 189
dering results than state-of-the-art methods in both scene-
132 improved performance in reconstruction. 190
agnostic and per-scene fine-tuning settings, especially when
133 For novel view synthesis, we introduce an end-to-end Trans- 191
source views are captured at camera poses that differ signifi-
134 former framework to implicitly model the continuous 3D scene 192
cantly from the rendering view.
135 as a neural radiance field representation. Our model leverages the 193
• InNeRF explains a query rendering process by utilizing its
136 advantage of Transformer in exploring deep relationships among 194
attention layers. Experiments show that the interpretation
137 observed images to learn a consistent generalizable 3D scene repre- 195
of InNeRF is consistent with human perception.
138 sentation. 196
139 197
140 2 RELATED WORK 3 METHODOLOGY 198
141 Novel View Synthesis. The goal of novel view synthesis is to ren- 3.1 Framework 199
142 der unseen views of a scene from its multiple observed images. The 200
143 essence of novel view synthesis is exploring and learning a view- We propose InNeRF to learn an interpretable generic radiance field 201
144 consistent 3D scene representation from a sparse set of input views. representation for novel scenes. Given captured multi-view images 202
145 The early work focused on modeling 3D shapes by discrete geomet- {I𝑚 }𝑚=1
𝑀 (𝑀 source views) of diverse scenes and their camera pa-
203
146 ric 3D representations, such as mesh surface [7, 8, 17], point cloud rameters {Θ𝑚 }𝑚=1
𝑀 (camera poses, intrinsic parameters and scene
204
147 [10, 19] and voxel grid [1, 24, 28]. Although explicit 3D geometry- bounds), InNeRF reconstructs a generic radiance field 𝐹 InNeRF to 205
148 based representations are intuitive, they are discrete and sparse, learn the prior knowledge: 206
149 making them incapable of learning high-resolution renderings with (𝜎, c) ← 𝐹 InNeRF ((𝑥, 𝑦, 𝑧), d; {I𝑚 , Θ𝑚 }𝑚 ) , (1) 207
150 sufficient quality for complex scenes. 208
151 More recently, the impressive neural radiance field (NeRF) [16] where (𝑥, 𝑦, 𝑧) is a 3D point location, d denotes a unit-length di- 209
152 has shown a solid ability to synthesize novel views by represent- rection of a viewing ray and outputs are a differential volumetric 210
153 ing continuous scenes as 5D radiance fields in MLPs. Nevertheless, density 𝜎 and a directional emitted color c. 211
154 NeRF optimizes each scene representation independently, not ex- As shown in Fig. 1, for rendering a query 3D point on a target- 212
155 ploring the shared information amongst scenes and being time- viewing ray, the proposed InNeRF includes two stages: 1) In the 213
156 consuming. Subsequently, researchers proposed models, such as surrounding-view space, our Decoder𝜎𝑣𝑖𝑒𝑤𝑠 (Sec. 3.2) and Decoder𝑐𝑣𝑖𝑒𝑤𝑠 214
157 PixelNeRF [29], MVSNeRF [4], IBRNet [25], which receives as con- (Sec. 3.4) fuse source views and query spatial information ((𝑥, 𝑦, 𝑧), 215
158 ditional inputs multiple observed views to learn a general neural d) into the latent density and color representations for the query 216
𝑟𝑎𝑦
159 radiance field. These methods are proposed using the divide-and- point; 2) In the ray-cast space, we use Decoder𝜎 (Sec. 3.3) and 217
𝑟𝑎𝑦
160 conquer strategy and have two separate components: a CNN feature Decoder𝑐 (Sec. 3.5) to enhance the query density and color repre- 218
161 extractor for each observed image and an MLP as the NeRF network. sentations by considering neighboring points along the target ray. 219
162 However, pooling-based fusion models in these methods barely ex- Finally, we obtain the density and color for the query point on a 220
163 plore the complex relationship across multiple views for 3D scene target-viewing ray. 221
164 understanding. Furthermore, processing each 3D point indepen- 222
165 dently ignores the geometry consistency of a 5D radiance field of a 3.2 Density Decoder in Surrounding-view Space 223
166 scene. We first present our density decoder in surrounding-view space 224
167 Here, we propose an encoder-decoder Transformer framework, (Decoder𝜎𝑣𝑖𝑒𝑤𝑠 ) decoding the projected pixels at source views into 225
168 InNeRF, to represent the neural radiance field scene for novel view the query latent density code. 226
169 synthesis. Compared with the pooling-based fusion in previous For each source view, we first extract its feature volume by a 227
170 work, InNeRF can explore deep relationships among multiple views pre-trained view-shared U-Net. A query 3D point (𝑥, 𝑦, 𝑧) is then 228
171 and aggregate multi-view information into the coordinate-based projected into each source view I𝑚 by its camera projection matrix 229
172 scene representation by the attention mechanism in a unified net- Θ𝑚 to extract the corresponding RGB color {c𝑚 𝑀
𝑠𝑟𝑐 }𝑚=1 and fea- 230
173 work. Meanwhile, InNeRF can learn the consistency of shape and 𝑚 }𝑀 at the projected 2D pixel {p𝑚 }𝑀 location
ture vector {f𝑠𝑟𝑐 231
𝑚=1 𝑚=1
174 232
InNeRF: Learning Interpretable Radiance Fields for Generalizable 3D Scene Representation and Rendering ACM MM, 2024, Melbourne, Australia
233 291
234 292
235 293
236 294
237 295
238 296
239 297
240 298
241 299
242 300
243 301
244 302
245 303
246 304
247 305
248 306
249 307
250 308
251 309
252 310
253 311
254
Figure 1: Workflow of the proposed InNeRF. Module A is the density decoder in surrounding-view space (Sec. 3.2). Module B is 312
255
the density decoder in ray-cast space (Sec. 3.3). Module C is the color decoder in surrounding-view space (Sec. 3.4). Module D is 313
256
the color decoder in ray-cast space (Sec. 3.5). 314
257 315
258 through bilinear interpolation. In each source view, we also record in each head). The Attention function is computed by 316
259 its viewing direction {d𝑚 𝑀 317
𝑠𝑟𝑐 }𝑚=1 for the projected pixel from the
260 source camera pose. Based on it, we obtain the initial source-view 318
261 QK𝑇 319
embeddings {x𝑚 𝑀
0 }𝑚=1 for source views. Attention(Q, K, V) = softmax( √︁ )V , (7)
262 𝑑𝑘 320
For the query point, Decoder𝜎𝑣𝑖𝑒𝑤𝑠 receives the initial source-
263 321
view embeddings {x𝑚 𝑀
0 }𝑚=1 and the learnable query density embed-
264 322
ding x0 as inputs X0 . Decoder𝜎𝑣𝑖𝑒𝑤𝑠 can be formulated as follows:
𝜎
Here, 𝑁𝑞 queries are stacked in Q = [q1 ; q2 ; · · · ; q𝑁𝑞 ] ∈ R𝑁𝑞 ×𝑑𝑘 ,
265 323
266 a set of 𝑁𝑘 key-value pairs are stacked in K = [k1 ; k2 ; · · · ; ; k𝑁𝑘 ] ∈ 324
X0 = [x𝜎0 ; x10 ; x20 ; · · · ; x𝑀
0 ] , (2)
267 R𝑁𝑘 ×𝑑𝑘 and V = [v1 ; v2 ; · · · ; v𝑁𝑘 ] ∈ R𝑁𝑘 ×𝑑 𝑣 , 𝑑𝑘 is used as a scalar 325
268 X̃𝑙+1 = Norm(Pixels×Query𝜎 (X𝑙 ) + X𝑙 ) , (3) for normalization. Our Decoder𝜎𝑣𝑖𝑒𝑤𝑠 is invariant to permutations 326
269
X𝑙+1 = Norm(FFN( X̃𝑙+1 ) + X̃𝑙+1 ) , (4) of source views and can receive an arbitrary number of source 327
270 views. 328
271 where 𝑙 denotes the index of a basic block (𝑙 = 1, · · · , 𝐿), “Norm” 329
272 is a layer normalization function, and “FFN” is a position-wise 330
273 feed-forward network. At the 𝐿-th block, we can obtain X𝐿 =
3.3 Density Decoder in Ray-cast Space 331
𝑟𝑎𝑦
274 [x𝜎𝐿 ; x𝐿1 ; x𝐿2 ; · · · ; x𝐿𝑀 ]. In Decoder𝜎𝑣𝑖𝑒𝑤𝑠 , we concatenate the embed- The density decoder in ray-cast space (Decoder𝜎 ) decodes the 332
275 ding x𝜎𝐿 and its 3D coordinate location (𝑥, 𝑦, 𝑧) as the latent density density information of the query point by aggregating the density 333
276 code for the query point. features of the neighboring 3D points along the target-view ray. 334
277 Pixels×Query Density Attention layers explore deep relation- For the query point and neighboring 2𝑛 points along the target- 335
278 ships among source views, defined as follows: viewing ray, we denote [𝜎0𝑖 −𝑛 ; · · · 𝜎0𝑖 · · · ; 𝜎0𝑖+𝑛 ] as their initial den- 336
𝑟𝑎𝑦
279 sity representations at the input end of the Decoder𝜎 , where 337
280 Pixels×Query𝜎 (X) = MH-Attn(X, X, X) , (5) 𝑖
the query point is denoted as 𝑃 and neighboring 2𝑛 points are 338
281 {𝑃 𝑖 −𝑛 , · · · , 𝑃 𝑖 −1, 𝑃 𝑖+1, · · · , 𝑃 𝑖+𝑛 }. Here, the initial density represen- 339
where the multi-head attention function is defined as:
282 tation for each 3D point is computed via an FC layer based on the 340
283
MH-Attn(Q, K, V) = Cat(A1, · · · , A𝐻 )W , (6) Decoder𝜎𝑣𝑖𝑒𝑤𝑠 output for the corresponding point (𝜎0 = FC(x𝜎𝐿 ⊙ 341
284 (𝑥, 𝑦, 𝑧)), where ⊙ is the concatenation operation). Then positional 342
where Aℎ = Attention(Qℎ , Kℎ , Vℎ ) ,
285 encodings E𝑝𝑜𝑠 are added to density representations of neighboring 343
𝑞
286 Qℎ = QWℎ ; Kℎ = KWℎ𝑘 ; Vℎ = VWℎ𝑣 . points to keep their position information in the ray-cast space. Each 344
287 positional encoding informs each point of its 3D spatial location, 345
Here, Wℎ , Wℎ𝑘 ∈ R𝑑𝑘 ×𝑑ℎ ; Wℎ𝑣 ∈ R𝑑 𝑣 ×𝑑ℎ and W ∈ R𝐻𝑑ℎ ×𝑑𝑘 are
𝑞
288 which is computed by utilizing sine and cosine functions of different 346
289 parameter matrices (𝐻 × 𝑑ℎ = 𝑑𝑘 and 𝑑ℎ is the feature dimension frequencies as [3]. 347
290 348
ACM MM, 2024, Melbourne, Australia Anonymous Authors
𝑟𝑎𝑦
349 Decoder𝜎 is formulated as follows: • Scene-agnostic setting: we train a single scene-agnostic model 407
350 on a large training dataset that includes various camera se- 408
D0 = [𝜎0𝑖 −𝑛 ; · · · 𝜎0𝑖 · · · ; 𝜎0𝑖+𝑛 ] +E 𝑝𝑜𝑠
, (8)
351 tups and scene types. We test its generalization ability to 409
352 D̃𝑙+1 = Norm(Points×Query𝜎 (D𝑙 ) + D𝑙 ) , (9) unseen scenes on all test scenes. 410
353
D𝑙+1 = Norm(FFN( D̃𝑙+1 ) + D̃𝑙+1 ) , (10) • Per-scene fine-tuning setting: our pretrained scene-agnostic 411
354 model is finetuned on each test scene. We evaluate each 412
355 where the Points×Query Density Attention layer is computed as finetuned scene-specific model separately. 413
356 Points×Query𝜎 = MH-Attn(D, D, D) fusing information of sur- 414
rounding 3D points on the target-viewing ray. At the end block, We train and evaluate our method on a collection of multi-view
357 415
𝑟𝑎𝑦
the Decoder𝜎 outputs the density representation 𝜎𝐿𝑖 of the query datasets containing both synthetic data and real data, as in IBRNet
358 416
3D point, and then we use an FC layer to project it to the density [25]. For novel view synthesis, we quantitatively evaluate the ren-
359 417
value. dered image quality based on PSNR, SSIM [26] (higher is better),
360 418
and LPIPS [32] (lower is better).
361 419
362 3.4 Color Decoder in Surrounding-view Space 420
4.1 Conditional Source-view Set
363 The color decoder in surrounding-view space (Decoder𝑐𝑣𝑖𝑒𝑤𝑠 ) de- 421
364 codes the projected pixels’ information from source views into the Experiments are designed to examine whether the proposed InNeRF 422
365 query color representation. can effectively learn a neural radiance field scene representation 423
366 Decoder𝑐𝑣𝑖𝑒𝑤𝑠 can be formulated as follows: in scenarios where the variation degree between the conditional 424
367 source view set and the target rendering view changes. Here, we 425
Ỹ𝑙+1 = Norm(Pixels×Query𝑐 (Y𝑙 , X̂, Ĉ) + Y𝑙 ) , (11) sample 10 views from the surrounding view set as the conditional
368 426
369 Y𝑙+1 = Norm(FFN( Ỹ𝑙+1 ) + Ỹ𝑙+1 ) . (12) source-view set to render a target view. Given the camera pose, 427
370 we can compute and sort the difference between each surrounding 428
In Pixels×Query Color Attention layers, the initial query color
371 view and the target rendering view. 429
embedding is represented as Y0 = FC(𝜎𝐿𝑖 ) ⊙ d𝑡𝑔𝑡 , where 𝜎𝐿𝑖 is
372 𝑟𝑎𝑦 Based on the sorted order, we construct 𝑁𝑠 conditional source- 430
the latent density representation from Decoder𝜎 and d𝑡𝑔𝑡 is the 𝑁𝑠
view sets ({S𝑖 }𝑖=1 ) from the surrounding-view set to render each test
373 431
target-viewing direction for the query point. Pixels×Query Color view. For the real evaluation dataset, there are 𝑁𝑠 = 3 sets, i.e. top
374 432
Attention layer is calculated as: 10 (S1 ), middle 10 (S2 ), and bottom 10 (S3 ) views. For the synthetic
375 433
376 Pixels×Query𝑐 (Y, X̂, Ĉ) = MH-Attn(Y, X̂, Ĉ) , (13) evaluation dataset, there are 𝑁𝑠 = 4 sets which are the top 10 (S1 ), 434
377 middle 10 (S2 ), 3/4th 10 (S3 ), and bottom 10 (S4 ) views, respectively. 435
where the value is Ĉ = 1 ); · · ·
[𝛾 (c𝑠𝑟𝑐 𝑀 )] (𝛾 (·) is the embedding
; 𝛾 (c𝑠𝑟𝑐
378 Fig. 4 shows visual examples of S1 and S4 for illustration. 436
function) and the key is X̂ = 1 1 ; · · · ; FC(x𝑀 ) ⊙ d𝑀 ]
[FC(x𝐿 ) ⊙ d𝑠𝑟𝑐
379 𝐿 𝑠𝑟𝑐 437
380
representing the projected pixels’ representations in source views. 4.2 Results 438
381
The output Y𝐿 is the latent color code for the query 3D point. 439
In both the scene-agnostic (Sec. 4.2.1) and per-scene fine-tuning
382 experiments (Sec. 4.2.2), we evaluate competing methods in sce- 440
383
3.5 Color Decoder in Ray-cast Space 441
𝑟𝑎𝑦 narios where the source views belong to different source view sets
384 The color decoder in ray-cast space (Decoder𝑐 ) learns a query 𝑁𝑠 442
{S𝑖 }𝑖=1 defined in Sec. 4.1. To render a testing view, each competing
385 color by fusing latent color codes of adjacent 3D points along the tar- 443
approach receives as input the same source-view set. In Sec. 4.2.3,
386 get ray in Points×Query Color Attention layers (Points×Query𝑐 (Z) = 444
𝑟𝑎𝑦 we provide the interpretation results of InNeRF.
387 MH-Attn(Z, Z, Z)). Decoder𝑐 is represented as: 445
388
[z𝑖0−𝑛 ; · · · z𝑖0 · · · ; z𝑖+𝑛 𝑝𝑜𝑠 4.2.1 Scene-agnostic Experiments. In scene-agnostic experiments, 446
Z0 = 0 ] +E , (14)
389 InNeRF is compared with PixelNeRF [29], MVSNeRF [4] and IBR- 447
390 Z̃𝑙+1 = Norm(Points×Query𝑐 (Z𝑙 ) + Z𝑙 ) , (15) Net [25] on the real forward-facing dataset [15] and the realistic 448
391 Z𝑙+1 = Norm(FFN( Z̃𝑙+1 ) + Z̃𝑙+1 ) . (16) synthetic dataset [25]. 449
392 Tab. 1 shows that the proposed InNeRF outperforms other meth- 450
393 where the query latent color code from Decoder𝑐𝑣𝑖𝑒𝑤𝑠 is assigned to ods on both datasets under the scene-agnostic setting. To facilitate 451
394 the corresponding z𝑖0 and likewise for adjacent 2𝑛 points in ray-cast the quantitative comparison in each metric, the best scores are 452
395 space. marked in bold. It shows that InNeRF has a better generalization 453
𝑟𝑎𝑦
396 Subsequently, after the Decoder𝑐 , we use an FC layer to project ability to novel scenes though it is trained on datasets with notice- 454
𝑖
the output color embedding z𝐿 to its output predicted color value.
397 ably different scenes and view distribution. The detailed results in 455
398 Then the predicted density and color of each query point along a the supplementary material also reveal that InNeRF has a better 456
399 ray of the desired virtual camera are put forward to the classical performance for each scene. 457
400 volume rendering. The implementation details of the network and The superior generalization ability of InNeRF is also reflected in 458
401 training are described in the supplementary material. qualitative results. As shown in Fig. 2, we compare the performance 459
402 of methods on rendering the same randomly-selected testing view 460
403 4 EXPERIMENTS based on different source-view sets. The results of other approaches 461
404 The proposed approach is evaluated in the following experimental contain more obvious artifacts than InNeRF and even become worse 462
405 settings: in the S3 scenario where the difference between source views and 463
406 464
InNeRF: Learning Interpretable Radiance Fields for Generalizable 3D Scene Representation and Rendering ACM MM, 2024, Melbourne, Australia
465 Table 1: Quantitative comparison of methods on the scene-agnostic setting for the realistic synthetic dataset [16] and the real 523
466 forward-facing dataset [15]. 524
467 525
468 PSNR ↑ SSIM ↑ LPIPS ↓ 526
469 Dataset S𝑖 PixelNeRF MVSNeRF IBRNet InNeRF PixelNeRF MVSNeRF IBRNet InNeRF PixelNeRF MVSNeRF IBRNet InNeRF 527
S1 21.20 22.47 25.31 26.45 0.857 0.874 0.913 0.922 0.161 0.143 0.104 0.092
470 528
realistic S2 17.00 18.44 21.80 23.16 0.732 0.755 0.805 0.842 0.295 0.286 0.236 0.183
471 529
synthetic S3 15.88 17.43 20.99 22.70 0.660 0.687 0.749 0.810 0.355 0.328 0.270 0.211
472 S4 14.67 16.25 19.97 21.72 0.567 0.597 0.672 0.758 0.440 0.400 0.322 0.248 530
473 S1 19.02 20.09 24.96 24.97 0.651 0.680 0.813 0.816 0.380 0.347 0.208 0.205 531
real
474 S2 16.30 17.68 22.69 22.94 0.576 0.614 0.749 0.760 0.459 0.422 0.273 0.260 532
forward-facing
475 S3 13.56 15.21 20.33 20.81 0.489 0.543 0.683 0.701 0.551 0.504 0.340 0.318 533
476 534
Table 2: Quantitative comparisons of methods on the per-scene fine-tuning setting for the realistic synthetic dataset [16] and
477 535
the real forward-facing dataset [15].
478 536
479 537
PSNR ↑ SSIM ↑ LPIPS ↓
480 538
Dataset S𝑖 PixelNeRF MVSNeRF IBRNet InNeRF PixelNeRF MVSNeRF IBRNet InNeRF PixelNeRF MVSNeRF IBRNet InNeRF
481 S1 24.06 27.04 29.27 30.79 0.877 0.913 0.940 0.952 0.140 0.103 0.076 0.064 539
482 realistic S2 20.15 23.30 25.91 27.76 0.770 0.813 0.847 0.881 0.263 0.221 0.187 0.142 540
483 synthetic S3 19.27 22.56 25.23 27.35 0.714 0.759 0.802 0.849 0.301 0.256 0.216 0.165 541
484 S4 18.23 21.57 24.33 26.65 0.639 0.689 0.739 0.803 0.358 0.306 0.254 0.195 542
485
S1 20.72 23.32 26.61 26.65 0.693 0.758 0.847 0.853 0.325 0.260 0.177 0.173 543
real
S2 18.28 21.11 24.69 24.99 0.625 0.696 0.788 0.811 0.384 0.313 0.225 0.212
486 forward-facing 544
S3 15.66 18.62 22.62 23.25 0.544 0.623 0.727 0.767 0.458 0.377 0.276 0.256
487 545
488 546
489 547
490 548
491 549
492 550
493 551
494 552
495 553
496 554
497 555
498 556
499 557
500 558
501 559
502 560
503 561
504 562
505 563
506 564
507 565
508 566
509 567
510 568
511 569
512 570
513 571
514 572
515 573
516 574
517 575
518 576
519 577
520
Figure 2: Qualitative results for the Trex and the Fern scenes [15] under the scene-agnostic setting. 578
521 579
522 580
ACM MM, 2024, Melbourne, Australia Anonymous Authors
581 639
582 640
583 641
584 642
585 643
586 644
587 645
588 646
589 647
590 648
591 649
592 650
593 651
594 652
595 653
596 654
597 Figure 3: Qualitative results for the Fern scene [15] under the per-scene finetuning setting. 655
598 656
599 657
600 658
601 659
602 660
603 661
604 662
605 663
606 664
607 665
608 666
609 667
610 668
611 669
612 670
613 671
614 672
615 673
616 674
617 675
618 676
619 Figure 4: Qualitative results for the Hotdog scene under the per-scene finetuning setting. The source-view sets S1 and S4 are 677
620 listed in the yellow frame. 678
621 679
622 680
the target view is larger than that in S1 and S2 . As highlighted in and learn a better scene representation in challenging scenarios.
623 681
colored frames, other methods cannot synthesize clean boundaries More results are provided in the supplementary material.
624 682
of guardrails and fronds and recover thin structures.
625 683
From the above qualitative results, we observe that there exists
626 4.2.2 Per-scene Finetuning Experiments. In the per-scene finetun- 684
a gradual degradation in the synthesized view when the difference
627 ing experiment, pretrained models of competing methods are fine- 685
between source views and the target rendering view increases from
628 tuned for each scene. 686
S1 to S3 . Similarly, in quantitative results from S1 to S3 , PSNR and
629 As shown in Tab. 2, InNeRF outperforms other methods after 687
SSIM values both decrease while LPIPS increases for all competing
630 per-scene finetuning. Similar to scene-agnostic results, per-scene 688
methods. It reveals that the more different the source views are
631 finetuning results further validate that InNeRF can provide more 689
with respect to the target rendering view, the more difficult novel
632 satisfactory novel view rendering than other methods in differ- 690
view synthesis becomes. Tab. 1 also indicates that the advantage
633 ent source-view settings. Meanwhile, performance gaps between 691
of InNeRF becomes more significant than other methods with the
634 InNeRF and other methods become larger in contrast with that 692
increase of the difference between source views and the target
635 in the scene-agnostic setting, which indicates that per-scene fine- 693
view. It demonstrates that InNeRF has a strong ability to explore
636 tuning can further fulfill the potential of InNeRF. Similar to quan- 694
complicated relationships between source views and the target view
637 titative results, Fig. 3 shows that InNeRF provides more realistic 695
638 696
InNeRF: Learning Interpretable Radiance Fields for Generalizable 3D Scene Representation and Rendering ACM MM, 2024, Melbourne, Australia
697 755
698 756
699 757
700 758
701 759
702 760
703 761
704 762
705 763
706 764
707 765
708 766
709 767
710 768
711 769
712 770
713 771
714 772
715 773
716 774
717 775
718 Figure 5: Interpretation results of finetuned InNeRF for a target view of Chair scene based on source-view set S4. 776
719 777
720 778
721 in both the surrounding-view space and the ray-cast space, thus 779
722 improving the model interpretability. Here, we evaluate the in- 780
723 terpretability of InNeRF to examine whether it is consistent with 781
724 human perception. 782
725 In the surrounding-view space, we visualize the attention of 783
726 different source views to a target 3D point to interpret its render- 784
727 ing in Decoder𝜎𝑣𝑖𝑒𝑤𝑠 and Decoder𝑐𝑣𝑖𝑒𝑤𝑠 . Similarly, in the ray-cast 785
𝑟𝑎𝑦 𝑟𝑎𝑦
728 space, the rendering process of Decoder𝜎 and Decoder𝑐 can 786
729 be explored by visualizing the attention of surrounding 3D points 787
730 on the target-viewing ray to the target 3D point. Specifically, for 788
731 a 2D region (a 5 × 5 pixel region) in the rendering view, we first 789
Figure 6: Interpretation results of fine-tuned InNeRF for a compute the average depth value of the corresponding view direc-
732 790
target view of the Lego scene. tions for the target pixels based on our learned neural radiance field.
733 791
734 Then we retrieve the 3D point that is located closest to the average 792
735 depth in the average viewing direction as the target-interpreted 3D 793
view synthesis results with fewer artifacts in comparison to other point. For the target 3D point, we can explain its rendering process
736 794
approaches. in both surrounding-view and ray-cast spaces by visualizing the
737 795
In Fig. 4, InNeRF is compared with IBRNet in four source-view corresponding attention layers in InNeRF.
738 796
sets (S1 , S2 , S3 and S4 ). Here, we randomly select one view from the To analyze the interpretability of InNeRF, we provide interpreta-
739 797
hotdog scene as the target rendering view. To show the difference tion to a randomly selected testing view of the chair scene based on
740 798
between source view sets, we display overall source views in S1 source-view sets S4 in Fig. 5. The target rendering view is shown
741 799
and S4 at the bottom of Fig. 4. It is obvious that the view angles of in Fig. 5 (a) and the target location for interpretation is marked
742 800
source views in S1 are closer to the rendering view compared with as a red dot. For human visual perception, the source views are
743 801
those in S4 . In Fig. 4, the top two rows display the rendering views divided into two groups depending on whether they capture the
744 802
of competing methods based on four source-view sets. The artifacts target location (red dot) in the rendering view. Fig. 5 (b-1) shows
745 803
in the rendering views of IBRNet are perceptible in S2 and become source views that capture the target location, and Fig. 5 (b-2) shows
746 804
worse in S3 and S4 . In contrast, the artifacts in rendering views of source views that fail to capture.
747 805
InNeRF remain at a low degree in four source-view sets. It illustrates For the target location (red dot) in Fig. 5 (a), Fig. 5 (c) and (d)
748 806
InNeRF can obtain better rendering results than IBRNet in different display attention of source views to the target location for rendering
749 807
source-view sets, especially when there is a large difference between the query density and color in Decoder𝜎𝑣𝑖𝑒𝑤𝑠 and Decoder𝑐𝑣𝑖𝑒𝑤𝑠 ,
750 808
the source views and the rendering view. respectively. In Fig. 5 (c) and (d), attention of the visible source
751 809
752 4.2.3 Analysis of Interpretability in InNeRF. Based on the atten- views in Fig. 5 (b) are colored blue for clarity. Source views (85, 810
753 tion mechanism, InNeRF utilizes shape and appearance consistency 41, and 61) with high attention values are consistent with those 811
754 812
ACM MM, 2024, Melbourne, Australia Anonymous Authors
813 871
814 872
815 873
816 874
817 875
818 876
819 877
820 878
821 879
822 880
823 881
824 882
825 883
826 884
827 885
828 886
829 887
830 888
831 889
832 890
833 891
834 892
835 893
836 894
837 895
838 896
839 897
840 898
841 899
842 900
843 901
844 902
845
Figure 7: Interpretation results of InNeRF for (top) the wings of the nose spot, (middle) the leaves on the left side, and (bottom) 903
846
the black tile in the novel views under the per-scene finetuning setting. 904
847 905
848 with the visible target location. It indicates that the attention layers show the top two source views that are of high-density attention 906
849 in Decoder𝜎𝑣𝑖𝑒𝑤𝑠 and Decoder𝑐𝑣𝑖𝑒𝑤𝑠 can learn the important source for the target location, and the last two columns show the last two 907
850 views that meet human perception. Fig. 5 (e) depicts the density source views with low-density attention. For different source view 908
851 attention (green) and color attention (orange) among 3D points sets, the top two source views of the framed leaf region both include 909
852 along the target-viewing ray for rendering the query 3D point in the corresponding leaf region while the last two source views do 910
853 𝑟𝑎𝑦 𝑟𝑎𝑦 911
Decoder𝜎 and Decoder𝑐 . Here, the red index (83) denotes the not. It indicates that the interpretation results are reasonable for
854 human perception. 912
retrieved 3D point for the target location in the rendering view.
855 913
As shown in Fig. 5 (e), both density attention and color attention
856 𝑟𝑎𝑦 𝑟𝑎𝑦 914
in Decoder𝜎 and Decoder𝑐 exist a crest near the query 3D
857
point, which illustrates that InNeRF in the ray-cast space takes into
5 CONCLUSION 915
858
account the consistency of neighbor points when rendering the We propose a unified Transformer-based NeRF framework to learn a 916
859
query point. general neural radiance field for novel view synthesis. The proposed 917
860
Fig. 6 shows the top two source views that are of high-density framework can explore complex relationships between source views 918
861
attention for the target location of Lego in the realistic synthetic and the target rendering view. Meanwhile, the framework improves 919
862
dataset, and the last two columns show the last two source views intrinsic interpretability by utilizing the shape and appearance 920
863
with low-density attention. Given that the top-attention source consistency of 3D scenes. Experiments demonstrate that InNeRF 921
864
views capture the target location (red frame), it is reasonable that achieves state-of-the-art performance on real and synthetic datasets 922
865
they receive more attention for the query rendering. in both scene-agnostic and per-scene finetuning settings. In the 923
866
Fig. 7 provides the interpretation results in the forward-facing future, we intend to extend InNeRF to conditional generative ra- 924
867
dataset. The leftmost column shows the rendering view and an diance fields, employing learned prior knowledge to generate a 925
868
enlarged region framed by a blue box. The second and third columns more expressive and interpretable 3D scene representation for the 926
869 conditional information. 927
870 928
InNeRF: Learning Interpretable Radiance Fields for Generalizable 3D Scene Representation and Rendering ACM MM, 2024, Melbourne, Australia