AutoRecon 自动检测物体并重建
AutoRecon 自动检测物体并重建
Yuang Wang Xingyi He Sida Peng Haotong Lin Hujun Bao Xiaowei Zhou†
State Key Lab of CAD&CG, Zhejiang University
arXiv:2305.08810v1 [cs.CV] 15 May 2023
Abstract
Input Images
A fully automated object reconstruction pipeline is cru-
cial for digital content creation. While the area of 3D recon-
struction has witnessed profound developments, the removal
Reg.
of background to obtain a clean object model still relies
SDF Radiance
on different forms of manual labor, such as bounding box Field
1
provided by a self-supervised 2D Vision Transformer (ViT) geometry constraints. The recent volume-rendering-based
[3]. Specifically, we aggregate multi-view ViT features from reconstruction methods [27, 40, 46] allow mask-free train-
input images to the SfM point cloud and then segment salient ing but still require object masks supervision to produce
foreground points with a point cloud segmentation Trans- background-free object models. Aside from object masks,
former. To train the Transformer on large-scale unlabeled existing methods also require manual annotation of the 3D
data, we devise a pseudo-ground-truth generation pipeline spatial extent of the foreground object. Instead, we propose
based on Normalized Cut [33] and show its ability to pro- a fully-automated object reconstruction pipeline without any
duce accurate segmentations and 3D bounding boxes upon human labeling, which further improves the usability and
training. For object reconstruction, we learn a neural scene scalability of 3D object reconstruction.
representation within the estimated foreground bounding Decomposition of neural scene representations. Many re-
box from multi-view images. Our main idea is to reconstruct cent works try to decompose neural scene representations
a decomposed scene representation with the help of explicit (NSR). We categorize related works based on the annota-
regularization provided by the previously decomposed point tions required. Explicit 3D geometric primitives provide
cloud. Finally, we can extract a clean object model and obtain simple but effective decompositions of different entities.
high-quality object masks with foreground-only rendering. NeRF++ [50] separates foreground and background with a
We conduct experiments on the CO3D [29], Blended- sphere. 3D bounding boxes manually annotated or predicted
MVS [45], and DTU [12] datasets to validate the effective- by category-specific models are used for decomposed mod-
ness of the proposed pipeline. The experimental results show eling of static and dynamic scenes [16, 24, 28]. Multi-view
that our approach can automatically and robustly recover segmentation masks provide dense annotations for scene
accurate 3D object models and high-quality segmentation decomposition. It has been shown that semantic fields can
masks from RGB videos, even with cluttered backgrounds. be learned with multi-view semantic masks [15, 39, 51] for
In summary, we make the following contributions: semantic scene decomposition. Moreover, decomposed ob-
ject representations can also be built from multi-view object
• We propose a fully-automated framework for recon-
masks [42,43]. To alleviate the annotation cost of multi-view
structing background-free object models from multi-
segmentation, methods relying on human interactions [52]
view images without any annotation.
and different forms of sparse human annotations are pro-
• We propose a coarse-to-fine pipeline for scene decom- posed, such as scribbles [30] and seed points [23]. The
position by first decomposing the scene in the form of decomposition is less stable as they rely on handcrafted
an SfM point cloud, which then guides the decomposi- non-dedicated features from various sources to distinguish
tion of a neural scene representation. the manually specified entities. Apart from learning discrete
semantic labels, DFF [14] and N3F [38] distill 2D features
• We propose an SfM point cloud segmentation Trans- into neural scene representations for query-based scene de-
former and devise an unsupervised pseudo-ground-truth composition. However, they still require manually-provided
generation pipeline for its training. queries and their query-based nature is more suitable for
local editing and impedes applications requiring global rea-
• We demonstrate the possibility of automatically creat- soning upon a scene, such as the decomposition of a salient
ing object datasets with 3D models, 3D bounding boxes, object. Different from existing approaches, our pipeline re-
and 2D segmentation masks. quires no annotation and facilitates global reasoning.
2
as object localization [34], salient detection, and seman- SDF-based radiance field is rendered via volume rendering.
tic segmentation [49]. TokenCut [41] and DSM [21] show Given a ray {r(t) = o + tv| t > 0}, where o denotes the
promising results by localizing and segmenting salient ob- camera center and v is the view direction, we can render its
jects with spectral clustering. However, their 2D nature leads color Ĉ by
to multi-view inconsistency and unstable results when ap- Z ∞
plied to object-centric videos. To overcome these limitations, Ĉ = ω(t) c(r(t), v)dt, (2)
we propose to perform unsupervised object discovery from 0
videos in 3D, which facilitates coherent salient object discov-
ery upon a global representation, instead of many isolated where ω(t) is an unbiased and occlusion-aware weight func-
inferences upon local 2D observations. tion as detailed in [40].
Notably, the spatial extent of the foreground object of
3. Preliminaries interest needs to be manually annotated, which is scaled
into a unit-sphere and represented with the SDF-based ra-
In this section, we briefly review the following preliminar- diance field. The background region outside the sphere is
ies: the self-supervised ViT features used to segment point represented with NeRF++ [50]. Since the object of interest
clouds, the Normalized Cut algorithm employed to generate can hardly be exclusively enclosed with a single sphere, the
pseudo segmentation labels, and the neural surface recon- reconstructed object model includes background geometries,
struction method NeuS utilized for object reconstruction. requiring manual post-processing for its removal.
3
Pseudo-GTs
Extract 2D
L
<latexit sha1_base64="KIBUPXdFc9jbwGmlyaMi3aNFjHE=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZdFNy5cVLAPmA4lk2ba0EwyJBmhDP0MNy4UcevXuPNvzLSz0NYDgcM595JzT5hwpo3rfjultfWNza3ydmVnd2//oHp41NEyVYS2ieRS9UKsKWeCtg0znPYSRXEcctoNJ7e5332iSjMpHs00oUGMR4JFjGBjJb8fYzMmmGf3s0G15tbdOdAq8QpSgwKtQfWrP5QkjakwhGOtfc9NTJBhZRjhdFbpp5ommEzwiPqWChxTHWTzyDN0ZpUhiqSyTxg0V39vZDjWehqHdjKPqJe9XPzP81MTXQcZE0lqqCCLj6KUIyNRfj8aMkWJ4VNLMFHMZkVkjBUmxrZUsSV4yyevks5F3busNx4ateZNUUcZTuAUzsGDK2jCHbSgDQQkPMMrvDnGeXHenY/FaMkpdo7hD5zPH4PzkWs=</latexit>
1
Linear
1
Trans. 0
1
0
Labels
Estimated
Input SfM Aggregate 2D Features to Point Cloud Bounding Box Ground
Images Point Cloud with DINO Features
Figure 2. Coarse Decomposition. Given an object-centric image sequence, we first reconstruct the semi-dense Structure-from-Motion (SfM)
point cloud and extract pointwise features by aggregating multi-view 2D DINO features, which are semantically rich as illustrated by the
PCA-projected colors. Then, we segment the foreground object from the SfM point cloud with a lightweight 3D Transformer, which takes
pointwise features ( ) and a global [CLS] feature ( ) as input and predicts pointwise labels ( 1 0 ). Finally, the 3D bounding box of the
object and an optional ground plane are estimated from the decomposed point cloud.
4
Down
Sample v,x Color Field
<latexit sha1_base64="0uylEzSwbarl7ioom/9bhXHmxtA=">AAAB83icbVBNS8NAEN34WetX1aOXYBE8SEmkqMeiF48V7Ac0oWy2k3bpZhN2J6Ul9G948aCIV/+MN/+N2zYHbX0w8Hhvhpl5QSK4Rsf5ttbWNza3tgs7xd29/YPD0tFxU8epYtBgsYhVO6AaBJfQQI4C2okCGgUCWsHwfua3RqA0j+UTThLwI9qXPOSMopE8D2GMQZiNLsfTbqnsVJw57FXi5qRMctS7pS+vF7M0AolMUK07rpOgn1GFnAmYFr1UQ0LZkPahY6ikEWg/m988tc+N0rPDWJmSaM/V3xMZjbSeRIHpjCgO9LI3E//zOimGt37GZZIiSLZYFKbCxtieBWD3uAKGYmIIZYqbW202oIoyNDEVTQju8surpHlVca8r1cdquXaXx1Egp+SMXBCX3JAaeSB10iCMJOSZvJI3K7VerHfrY9G6ZuUzJ+QPrM8fcdGR9g==</latexit>
Foreground Renderings
s, c
<latexit sha1_base64="lyt+Ry1Dp0FTGRcjFjZO7ULpHpI=">AAAB83icbVDLSgNBEOz1GeMr6tHLYBA8SNiVoB6DXjxGMA/ILmF2MpsMmZ1dZnrFsOQ3vHhQxKs/482/cfI4aGJBQ1HVTXdXmEph0HW/nZXVtfWNzcJWcXtnd2+/dHDYNEmmGW+wRCa6HVLDpVC8gQIlb6ea0ziUvBUObyd+65FrIxL1gKOUBzHtKxEJRtFKvjn3kT9hGOVs3C2V3Yo7BVkm3pyUYY56t/Tl9xKWxVwhk9SYjuemGORUo2CSj4t+ZnhK2ZD2ecdSRWNugnx685icWqVHokTbUkim6u+JnMbGjOLQdsYUB2bRm4j/eZ0Mo+sgFyrNkCs2WxRlkmBCJgGQntCcoRxZQpkW9lbCBlRThjamog3BW3x5mTQvKt5lpXpfLddu5nEU4BhO4Aw8uIIa3EEdGsAghWd4hTcnc16cd+dj1rrizGeO4A+czx9Lg5He</latexit>
5
field, similar to Geo-NeuS [7]. This regularization can speed
up convergence, alleviate the shape-radiance ambiguity and
improve the reconstruction quality of thin structures. Instead
CO3D
of directly constraining the SDF value of each SfM point to
zero like in [7], we take the noise of SfM point clouds into
account. Specifically, we model the positional uncertainty
τ (x) of each point x ∈ Pf g from the foreground SfM point
cloud Pf g by its distances to the neighboring points similar
to θ(x) in Eq. (4). Then we constrain the unsigned distance
|f (x)| of x to be smaller than τ (x):
BlendedMVS
1 X
Lf g = max(|f (x)| − τ (x), 0). (5)
Nf g
x∈Pf g
6
CO3D BlendedMVS DTU
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
CO3D
TokenCut + Seg. Agg. 0.816 0.204 0.875 0.625 0.500 0.167
Ours NCut (ablation) 0.867 0.306 1.00 0.75 1.00 0.667
Ours Transformer 0.908 0.581 1.00 1.00 0.833 0.833
BlendedMVS
Baselines. To the best of our knowledge, there is no ex-
isting baseline that holds the same setting as our coarse de-
composition pipeline, which detects 3D salient objects from
SfM point clouds without manual annotation for model train- NeuS Ours Ours w/o
ing. Therefore, we devise two straightforward pipelines for regularization
comparison to demonstrate the effectiveness of our design.
The first baseline is TokenCut + segmentation aggregation Figure 6. Qualitative results of salient object reconstruction.
(TokenCut + Seg. Agg.). We first use TokenCut [41] for 2D Our method is compared with NeuS on the CO3D and Blend-
salient object segmentation on each image and then aggre- edMVS datasets. We present results with and without our explicit
gate multi-view segmentation confidences to the SfM point regularization to illustrate its effectiveness.
cloud by averaging. Finally, we segment the SfM point cloud
with a threshold of 0.5 to determine the salient object’s 3D Scan ID 1 2 3 4 5 6 Mean
NeuS (w/ annotated fg. region) 0.390 0.216 0.245 0.223 0.345 0.271 0.282
region. Another baseline is our neural point cloud + NCut- Ours (fully-automated) 0.411 0.200 0.240 0.218 0.379 0.264 0.285
based segmentation (Ours NCut), which is used to generate
pseudo-GTs for training Ours Transformer. Table 2. Quantitative results on the BlendedMVS dataset. We
normalize the GT mesh so that its longest side equals one. Results
Evaluation metrics. We use the Intersection-over- on the Chamfer l2 distance are presented as percentages.
Union (IoU) metric with thresholds of 0.5 and 0.7
to evaluate the bounding box accuracy. The average Cut [41], which performs 2D salient object segmentation
percision (AP) is used for comparison. on each image and does not consider multi-view informa-
Results. As shown in Table 1, our approach substantially tion. 2) multi-view image segmentation baseline SemanticN-
achieves better salient object detection performances on all eRF [51], which fuses noisy masks with a neural field and
datasets, especially on the challenging CO3D [29] dataset. produces high-quality masks with neural rendering. Specifi-
Instead of individually segmenting 2D images as in TokenCut cally, we use the segmentations from TokenCut as inputs for
+ Seg. Agg., our strategy of aggregating multi-view 2D fea- SemanticNeRF and evaluate its mask renderings.
tures and performing segmentation on 3D facilitates global
reasoning of the salient object and eliminates multi-view Evaluation metrics. We evaluate 3D reconstruction on the
inconsistency. The proposed Ours Transformer also outper- Chamfer l2 distance. Mask IoU and Boundary IoU [4] met-
forms Ours NCut baseline on most datasets and metrics rics are used to evaluate 2D segmentation, with the former
although trained on pseudo-GTs generated by Ours NCut. reflecting overall segmentation quality and the latter focus-
We attribute this improvement to Our Transformer’s ability ing on boundary quality. The definitions of these metrics can
to accept point clouds with a higher density as inputs, its be found in the appendix.
ability to capture global dependencies, and the extra learning
on the dataset generated by Ours NCut. Results. For foreground object reconstruction, qualitative
5.3. Object reconstruction and 2D segmentation results are shown in Figs. 5 and 6, and quantitative re-
sults on the BlendedMVS dataset are presented in Table 2.
We evaluate the reconstructed object geometry and 2D The proposed fully-automated pipeline achieves compara-
foreground mask renderings to demonstrate the capability of ble or better reconstruction quality compared with NeuS,
our approach to reconstruct and segment complex objects. which is provided with manually-annotated object regions
and requires manual post-processing for background re-
Baselines. For 3D reconstruction, we compare our method moval. Our pipeline eliminates these tedious labors and thus
with the neural surface reconstruction baseline NeuS [40]. demonstrates the potential to automatically create large-scale
As for the evaluation of 2D segmentation, the proposed datasets.
method is compared with following baselines in two cat- Our method also achieves better 2D segmentation accu-
egories: 1) single-view image segmentation baseline Token- racy on most of the evaluated scans, as shown in Table 3
7
CO3D BlendedMVS DTU
Scan ID 1 2 3 4 5 Mean 1 2 3 4 5 Mean 1 2 3 4 5 Mean
Mask IoU
Ours 0.933 0.951 0.958 0.962 0.934 0.947 0.959 0.987 0.916 0.936 0.977 0.955 0.931 0.969 0.961 0.959 0.903 0.945
TokenCut (single-view) 0.784 0.888 0.976 0.975 0.966 0.918 0.785 0.904 0.919 0.855 0.943 0.881 0.829 0.921 0.905 0.955 0.971 0.916
TokenCut + SemanticNeRF 0.825 0.861 0.952 0.980 0.914 0.906 0.972 0.906 0.924 0.877 0.941 0.924 0.828 0.921 0.907 0.957 0.975 0.918
Boundary IoU
Ours 0.912 0.937 0.839 0.771 0.843 0.860 0.816 0.914 0.767 0.896 0.817 0.842 0.628 0.842 0.752 0.707 0.613 0.877
TokenCut (single-view) 0.635 0.832 0.877 0.839 0.887 0.814 0.493 0.562 0.664 0.688 0.695 0.620 0.572 0.693 0.525 0.636 0.803 0.646
TokenCut + SemanticNeRF 0.701 0.819 0.847 0.822 0.769 0.792 0.512 0.578 0.699 0.730 0.642 0.632 0.539 0.633 0.522 0.661 0.836 0.638
Table 3. Quantitative results of 2D segmentation. We compare our foreground mask renderings with baselines on Mask IoU and Boundary
IoU metrics on multiple datasets, including CO3D, BlendedMVS, and DTU. AutoRecon outperforms baselines on most of the scans.
example.
6. Conclusion
Ref. Image TokenCut TokenCut + Ours
SemanticNeRF We present a novel pipeline for fully-automated object dis-
covery and reconstruction from multi-view images, without
Figure 7. Qualitative results of 2D segmentation. We show fore- any human annotation. Experiments conducted on multiple
ground segmentation on the challenging chair category in CO3D real-world datasets show the effectiveness of our method in
and an object in BlendedMVS with complex geometry. building high-quality background-free object models. We
also demonstrate the capability of our pipeline in producing
and visualized in Fig. 7. The results of the 2D salient ob- high-quality segmentation masks, which are directly applica-
ject segmentation baseline TokenCut lack multi-view con- ble to 2D supervised learning.
sistency and are noisy on scans with complex backgrounds.
SemanticNeRF can bring improvement to the initial Token- Limitations and future work. Problems faced by neu-
Cut segmentations on some scans. The proposed method can ral reconstruction methods remain in our pipeline, like sen-
handle complex objects and backgrounds and outperforms sitivity to shadows and transient occluders and degraded
these baselines significantly on the Boundary IoU metric, results on thin-structured and non-Lambertian objects. Stor-
which demonstrates the capability of producing high-quality ing multi-view ViT features is memory-intensive, which
segmentations. we expect to be alleviated by distance-preserving compres-
sion techniques. The reconstruction quality of SfM point
5.4. Ablation studies
clouds can be further improved with refinement methods
We conduct experiments to validate the effectiveness of like [11, 17], which can further improve the quality of sur-
our point cloud segmentation Transformer for the coarse face reconstruction and potentially eliminate reconstruction
decomposition and regularization terms for training our de- ambiguities. Our automated object reconstruction pipeline
composed neural scene representation. More ablation studies can be used to create large-scale 3D object datasets for graph-
are provided in the supplementary material. ics and perception tasks, such as training 2D segmentation
networks and 3D generative models.
Segmentation Transformer for coarse decomposition.
We show the effectiveness of our 3D segmentation Trans- Acknowledgement. This work was supported by NSFC
former over our NCut-based pipeline from the higher 3D (No. 62172364), the ZJU-SenseTime Joint Lab of 3D Vision,
detection AP on multiple datasets, as shown in Table 1. Re- and the Information Technology Center and State Key Lab
sults show that although trained with pseudo-labels, the 3D of CAD&CG, Zhejiang University.
8
References Dellaert, and Thomas Funkhouser. Panoptic neural fields: A
semantic object-aware neural scene representation. In CVPR,
[1] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P 2022. 2
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded
[17] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson,
anti-aliased neural radiance fields. In CVPR, 2022. 5, 6
and Marc Pollefeys. Pixel-perfect structure-from-motion with
[2] Christopher P Burgess, Loic Matthey, Nicholas Watters,
featuremetric refinement. In ICCV, 2021. 8
Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander
Lerchner. Monet: Unsupervised scene decomposition and [18] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner,
representation. arXiv:1901.11390, 2019. 2 Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit,
Alexey Dosovitskiy, and Thomas Kipf. Object-centric learn-
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
ing with slot attention. In NeurIPS, 2020. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
ing properties in self-supervised vision transformers. In [19] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
CVPR, 2021. 1, 2, 3 Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural
[4] Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C Berg, volumes: learning dynamic renderable volumes from images.
and Alexander Kirillov. Boundary iou: Improving object- ACM TOG, 2019. 6
centric image segmentation evaluation. In CVPR, 2021. 7 [20] William E. Lorensen and Harvey E. Cline. Marching cubes:
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, A high resolution 3D surface construction algorithm. SIG-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, GRAPH Comput. Graph., 1987. 6
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [21] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and An-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is drea Vedaldi. Deep spectral methods: A surprisingly strong
worth 16x16 words: Transformers for image recognition at baseline for unsupervised semantic segmentation and local-
scale. In ICLR, 2021. 3 ization. In CVPR, 2022. 1, 3
[6] Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, and [22] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Ingmar Posner. GENESIS: generative scene inference and Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
sampling with object-centric latent representations. In ICLR, Representing scenes as neural radiance fields for view synthe-
2020. 2 sis. Communications of the ACM, 2021. 5
[7] Qiancheng Fu, Qingshan Xu, Yew-Soon Ong, and Wenbing [23] Ashkan Mirzaei, Yash Kant, Jonathan Kelly, and Igor
Tao. Geo-neus: Geometry-consistent neural implicit surfaces Gilitschenski. Laterf: Label and text driven object radiance
learning for multi-view reconstruction. In NeurIPS, 2022. 6 fields. In ECCV, 2022. 1, 2
[8] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and [24] Norman Müller, Andrea Simonelli, Lorenzo Porzi,
Richard Szeliski. Towards internet-scale multi-view stereo. Samuel Rota Bulò, Matthias Nießner, and Peter Kontschieder.
In CVPR, 2010. 1 Autorf: Learning 3d object radiance fields from single view
[9] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick observations. In CVPR, 2022. 2
Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, [25] Thomas Müller, Alex Evans, Christoph Schied, and Alexander
Matthew Botvinick, and Alexander Lerchner. Multi-object Keller. Instant neural graphics primitives with a multiresolu-
representation learning with iterative variational inference. In tion hash encoding. ACM ToG, 2022. 6
ICML, 2019. 2
[26] Michael Niemeyer, Lars M. Mescheder, Michael Oechsle,
[10] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
and Andreas Geiger. Differentiable volumetric rendering:
Yaron Lipman. Implicit geometric regularization for learning
Learning implicit 3d representations without 3d supervision.
shapes. In ICML, 2020. 6
In CVPR, 2020. 2
[11] Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun
Bao, and Xiaowei Zhou. Onepose++: Keypoint-free one-shot [27] Michael Oechsle, Songyou Peng, and Andreas Geiger.
object pose estimation without CAD models. In NeurIPS, Unisurf: Unifying neural implicit surfaces and radiance fields
2022. 4, 8 for multi-view reconstruction. In ICCV, 2021. 2
[12] Rasmus Ramsbøl Jensen, Anders Lindbjerg Dahl, George [28] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and
Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi- Felix Heide. Neural scene graphs for dynamic scenes. In
view stereopsis evaluation. In CVPR, 2014. 2, 6 CVPR, 2021. 2
[13] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and [29] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
François Fleuret. Transformers are rnns: Fast autoregressive Luca Sbordone, Patrick Labatut, and David Novotny. Com-
transformers with linear attention. In ICML, 2020. 4 mon objects in 3d: Large-scale learning and evaluation of
[14] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. real-life 3d category reconstruction. In ICCV, 2021. 2, 6, 7
Decomposing nerf for editing via feature field distillation. In [30] Zhongzheng Ren, Aseem Agarwala, Bryan Russell, Alexan-
NeurIPS, 2022. 2 der G Schwing, and Oliver Wang. Neural volumetric object
[15] Amit Pal Singh Kohli, Vincent Sitzmann, and Gordon Wet- selection. In CVPR, 2022. 1, 2
zstein. Semantic implicit neural scene representations with [31] Mehdi SM Sajjadi, Daniel Duckworth, Aravindh Mahendran,
semi-supervised training. In 3DV, 2020. 2 Sjoerd van Steenkiste, Filip Pavetic, Mario Lucic, Leonidas
[16] Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caro- Guibas, Klaus Greff, and Thomas Kipf. Object scene repre-
line Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank sentation transformer. In NeurIPS, 2022. 2
9
[32] Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, [48] Hong-Xing Yu, Leonidas J. Guibas, and Jiajun Wu. Unsu-
and Marc Pollefeys. Pixelwise View Selection for Unstruc- pervised discovery of object radiance fields. In ICLR, 2022.
tured Multi-View Stereo. In ECCV. 2016. 1, 2 2
[33] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image [49] Andrii Zadaianchuk, Matthaeus Kleindessner, Yi Zhu,
Segmentation. IEEE TPAMI, 2000. 2, 3, 5 Francesco Locatello, and Thomas Brox. Unsupervised se-
[34] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spy- mantic segmentation with self-supervised object-centric rep-
ros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, resentations. In ICLR, 2023. 3
and Jean Ponce. Localizing objects with self-supervised trans- [50] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
formers and no labels. In BMVC, 2021. 1, 3 Koltun. NeRF++: Analyzing and Improving Neural Radi-
[35] Cameron Smith, Hong-Xing Yu, Sergey Zakharov, Fredo Du- ance Fields. arXiv:2010.07492, 2020. 1, 2, 3
rand, Joshua B. Tenenbaum, Jiajun Wu, and Vincent Sitzmann. [51] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An-
Unsupervised Discovery and Composition of Object Light drew J Davison. In-place scene labelling and understanding
Fields. arXiv:2205.03923, 2022. 2 with implicit scene representation. In ICCV, 2021. 2, 7
[36] Karl Stelzner, Kristian Kersting, and Adam R Kosiorek. De- [52] Shuaifeng Zhi, Edgar Sucar, Andre Mouton, Iain Haughton,
composing 3d scenes into objects via unsupervised volume Tristan Laidlow, and Andrew J Davison. ilabel: Revealing
segmentation. arXiv:2104.01148, 2021. 2 objects in neural fields. IEEE Robot. Autom. Lett., 2022. 2
[37] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and
Xiaowei Zhou. Loftr: Detector-free local feature matching
with transformers. In CVPR, 2021. 3
[38] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea
Vedaldi. Neural Feature Fusion Fields: 3D Distillation of
Self-Supervised 2D Image Representations. In 3DV, 2022. 2
[39] Suhani Vora, Noha Radwan, Klaus Greff, Henning Meyer,
Kyle Genova, Mehdi S. M. Sajjadi, Etienne Pot, Andrea
Tagliasacchi, and Daniel Duckworth. NeSF: Neural Seman-
tic Fields for Generalizable Semantic Segmentation of 3D
Scenes. arXiv:2111.13260, 2021. 2
[40] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. Neus: Learning neural implicit
surfaces by volume rendering for multi-view reconstruction.
In NeurIPS, 2021. 1, 2, 3, 5, 7
[41] Yangtao Wang, Xi Shen, Shell Hu, Yuan Yuan, James Crow-
ley, and Dominique Vaufreydaz. Self-Supervised Transform-
ers for Unsupervised Object Discovery using Normalized Cut.
In CVPR, 2022. 1, 3, 7
[42] Qianyi Wu, Xian Liu, Yuedong Chen, Kejie Li, Chuanxia
Zheng, Jianfei Cai, and Jianmin Zheng. Object-Compositional
Neural Implicit Surfaces. In ECCV, 2022. 2
[43] Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han
Zhou, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Learn-
ing Object-Compositional Neural Radiance Field for Editable
Scene Rendering. In ICCV, 2021. 2
[44] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Mvsnet: Depth inference for unstructured multi-view stereo.
In ECCV, 2018. 2
[45] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
scale dataset for generalized multi-view stereo networks. In
CVPR, 2020. 2, 6
[46] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces. In NeurIPS, 2021.
1, 2
[47] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Ronen Basri, and Yaron Lipman. Multiview neural
surface reconstruction by disentangling geometry and appear-
ance. In NeuIPS, 2020. 2, 5
10