0% found this document useful (0 votes)
23 views10 pages

AutoRecon 自动检测物体并重建

Uploaded by

natenash.gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

AutoRecon 自动检测物体并重建

Uploaded by

natenash.gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

AutoRecon: Automated 3D Object Discovery and Reconstruction

Yuang Wang Xingyi He Sida Peng Haotong Lin Hujun Bao Xiaowei Zhou†
State Key Lab of CAD&CG, Zhejiang University
arXiv:2305.08810v1 [cs.CV] 15 May 2023

Abstract
Input Images
A fully automated object reconstruction pipeline is cru-
cial for digital content creation. While the area of 3D recon-
struction has witnessed profound developments, the removal
Reg.
of background to obtain a clean object model still relies
SDF Radiance
on different forms of manual labor, such as bounding box Field

labeling, mask annotations, and mesh manipulations. In


Structure-from-Motion Foreground Reconstruction
this paper, we propose a novel framework named AutoRe-
con for the automated discovery and reconstruction of an
object from multi-view images. We demonstrate that fore-
Trans.
ground objects can be robustly located and segmented from
SfM point clouds by leveraging self-supervised 2D vision
transformer features. Then, we reconstruct decomposed neu- DINO Point Cloud Coarse Segmentation Foreground Segmentation
ral scene representations with dense supervision provided
by the decomposed point clouds, resulting in accurate ob- Figure 1. Overview of our fully-automated pipeline and results.
ject reconstruction and segmentation. Experiments on the Given an object-centric video, we achieve coarse decomposition by
DTU, BlendedMVS and CO3D-V2 datasets demonstrate the segmenting the salient foreground object from a semi-dense SfM
effectiveness and robustness of AutoRecon. The code and point cloud, with pointwise-aggregated 2D DINO features [3]. Then
supplementary material are available on the project page: we train a decomposed neural scene representation from multi-view
images with the help of coarse decomposition results to reconstruct
https://zju3dv.github.io/autorecon/.
foreground objects and render multi-view consistent high-quality
foreground masks.

1. Introduction obtain clean foreground object models. However, accurate


2D object masks are expensive to annotate, and salient ob-
3D object reconstruction has long been investigated in
ject segmentation techniques [21, 34, 41] generally produce
computer vision. In this work, we focus on the specific set-
masks with limited granularity, thus degrading the recon-
ting of reconstructing a salient foreground object from multi-
struction quality, especially for objects with thin structures.
view images and automatically segmenting the object from
Recently, some methods [23,30,50] attempt to automatically
the background without any annotation, which enables scal-
decompose objects from 3D scenes given minimal human
able 3D content creation for VR/AR and may open up the
annotations, such as 3D object bounding boxes, scribbles
possibility to generate free 2D and 3D object annotations at
or pixel labels. But the requirement of manual annotations
a large scale for supervised-learning tasks.
limits the feasibility of more scalable 3D content creation.
Traditional multi-view stereo [8, 32] and recent neural
In this paper, we propose a novel two-stage framework for
scene reconstruction methods [40, 46] have attained impres-
the fully-automated 3D reconstruction of salient objects, as
sive reconstruction quality. However, these methods cannot
illustrated in Fig. 1. We first perform coarse decomposition to
identify objects and the reconstructed object models are typ-
automatically segment the foreground SfM point cloud, and
ically coupled with the surrounding background. A straight-
then reconstruct the foreground object geometry by learning
forward solution is utilizing the foreground object masks to
an implicit neural scene representation under explicit super-
The authors are affiliated with the ZJU-SenseTime Joint Lab of 3D Vision. vision from the coarse decomposition. The key idea of our
† Corresponding author: Xiaowei Zhou. coarse decomposition is to leverage the semantic features

1
provided by a self-supervised 2D Vision Transformer (ViT) geometry constraints. The recent volume-rendering-based
[3]. Specifically, we aggregate multi-view ViT features from reconstruction methods [27, 40, 46] allow mask-free train-
input images to the SfM point cloud and then segment salient ing but still require object masks supervision to produce
foreground points with a point cloud segmentation Trans- background-free object models. Aside from object masks,
former. To train the Transformer on large-scale unlabeled existing methods also require manual annotation of the 3D
data, we devise a pseudo-ground-truth generation pipeline spatial extent of the foreground object. Instead, we propose
based on Normalized Cut [33] and show its ability to pro- a fully-automated object reconstruction pipeline without any
duce accurate segmentations and 3D bounding boxes upon human labeling, which further improves the usability and
training. For object reconstruction, we learn a neural scene scalability of 3D object reconstruction.
representation within the estimated foreground bounding Decomposition of neural scene representations. Many re-
box from multi-view images. Our main idea is to reconstruct cent works try to decompose neural scene representations
a decomposed scene representation with the help of explicit (NSR). We categorize related works based on the annota-
regularization provided by the previously decomposed point tions required. Explicit 3D geometric primitives provide
cloud. Finally, we can extract a clean object model and obtain simple but effective decompositions of different entities.
high-quality object masks with foreground-only rendering. NeRF++ [50] separates foreground and background with a
We conduct experiments on the CO3D [29], Blended- sphere. 3D bounding boxes manually annotated or predicted
MVS [45], and DTU [12] datasets to validate the effective- by category-specific models are used for decomposed mod-
ness of the proposed pipeline. The experimental results show eling of static and dynamic scenes [16, 24, 28]. Multi-view
that our approach can automatically and robustly recover segmentation masks provide dense annotations for scene
accurate 3D object models and high-quality segmentation decomposition. It has been shown that semantic fields can
masks from RGB videos, even with cluttered backgrounds. be learned with multi-view semantic masks [15, 39, 51] for
In summary, we make the following contributions: semantic scene decomposition. Moreover, decomposed ob-
ject representations can also be built from multi-view object
• We propose a fully-automated framework for recon-
masks [42,43]. To alleviate the annotation cost of multi-view
structing background-free object models from multi-
segmentation, methods relying on human interactions [52]
view images without any annotation.
and different forms of sparse human annotations are pro-
• We propose a coarse-to-fine pipeline for scene decom- posed, such as scribbles [30] and seed points [23]. The
position by first decomposing the scene in the form of decomposition is less stable as they rely on handcrafted
an SfM point cloud, which then guides the decomposi- non-dedicated features from various sources to distinguish
tion of a neural scene representation. the manually specified entities. Apart from learning discrete
semantic labels, DFF [14] and N3F [38] distill 2D features
• We propose an SfM point cloud segmentation Trans- into neural scene representations for query-based scene de-
former and devise an unsupervised pseudo-ground-truth composition. However, they still require manually-provided
generation pipeline for its training. queries and their query-based nature is more suitable for
local editing and impedes applications requiring global rea-
• We demonstrate the possibility of automatically creat- soning upon a scene, such as the decomposition of a salient
ing object datasets with 3D models, 3D bounding boxes, object. Different from existing approaches, our pipeline re-
and 2D segmentation masks. quires no annotation and facilitates global reasoning.

2. Related Work Unsupervised object discovery. Unsupervised object dis-


covery (UOD) aims at the unsupervised learning of ob-
Multi-view 3D object reconstruction. The reconstruction ject concepts from a large-scale dataset. Recently, many
of 3D objects from multi-view images has long been stud- works strive for UOD with compositional generative mod-
ied with broad applications. Aside from multi-view images, eling [2, 6, 9]. Slot Attention [18] facilitates the inference
accurate object masks are needed to separate the object of of object-centric representations directly from images. This
interest from its surroundings and optionally provide ad- idea is further extended to 3D-aware modeling and inference
ditional geometric constraints. Multi-view Stereo (MVS) with neural radiance fields or light fields [31, 35, 36, 48].
methods [32, 44] recover a background-free reconstruction However, these works have only been shown to work on
by recovering frame-wise depth maps, followed by fusing synthetic datasets, not applicable to complex real-world situ-
depth only within object masks. Recently, neural reconstruc- ations. Our method focuses on the recovery of decomposed
tion methods, built upon differentiable neural renderers and single-object representations and is shown to work on real-
scene representations, have witnessed profound development. world data such as casual video captures. Another recent
Surface-rendering-based methods [26, 47] get rid of 3D su- trend makes use of self-supervised visual representations
pervision, but they still rely on object masks as substitutive for unsupervised object discovery in various forms, such

2
as object localization [34], salient detection, and seman- SDF-based radiance field is rendered via volume rendering.
tic segmentation [49]. TokenCut [41] and DSM [21] show Given a ray {r(t) = o + tv| t > 0}, where o denotes the
promising results by localizing and segmenting salient ob- camera center and v is the view direction, we can render its
jects with spectral clustering. However, their 2D nature leads color Ĉ by
to multi-view inconsistency and unstable results when ap- Z ∞
plied to object-centric videos. To overcome these limitations, Ĉ = ω(t) c(r(t), v)dt, (2)
we propose to perform unsupervised object discovery from 0
videos in 3D, which facilitates coherent salient object discov-
ery upon a global representation, instead of many isolated where ω(t) is an unbiased and occlusion-aware weight func-
inferences upon local 2D observations. tion as detailed in [40].
Notably, the spatial extent of the foreground object of
3. Preliminaries interest needs to be manually annotated, which is scaled
into a unit-sphere and represented with the SDF-based ra-
In this section, we briefly review the following preliminar- diance field. The background region outside the sphere is
ies: the self-supervised ViT features used to segment point represented with NeRF++ [50]. Since the object of interest
clouds, the Normalized Cut algorithm employed to generate can hardly be exclusively enclosed with a single sphere, the
pseudo segmentation labels, and the neural surface recon- reconstructed object model includes background geometries,
struction method NeuS utilized for object reconstruction. requiring manual post-processing for its removal.

Self-supervised ViTs. A Vision Transformer [5] flattens 4. Methods


an H × W sized image I into a sequence of P × P sized 2D
patches Ip . Each image patch is embedded with a trainable An overview of our pipeline is illustrated in Fig. 1. Given
linear projection and added with a positional embedding. an object-centric video, we aim to automatically decompose
A special learnable [CLS] token is usually prepended to and reconstruct the salient foreground object whose high-
the sequence of patches for modeling global and contextual quality 2D masks can be rendered from its reconstructed
information. The 1D sequence of token embeddings is fed to geometry. To achieve this goal, we propose a novel coarse-
several Transformer encoder blocks composed of multi-head to-fine pipeline that decomposes a neural scene representa-
self-attention (MSA) and MLP layers: tion with the help of point cloud decomposition. Our coarse
0
decomposition stage segments the foreground object from a
z` = MSA LN z`−1 + z`−1 ,

scene-level SfM point cloud and estimates its compact 3D
  0  0 (1) bounding box (Sec. 4.1). Then, a decomposed neural scene
z` = MLP LN z` + z` ,
representation of the foreground object is recovered under
explicit supervision of the coarse decomposition (Sec. 4.2).
where z` is the output of the `-th Transformer encoder layer.
It has been shown in [3] that self-supervised ViT features 4.1. Coarse decomposition of the salient object
contain explicit semantic information such as scene layout
and object boundaries, which is not found in the supervised To coarsely decompose the foreground object, we first re-
counterparts. construct its SfM point cloud and fuse multi-view DINO [3]
features on it. Then, the point cloud is segmented by our
Normalized cut algorithm (NCut) [33]. Spectral cluster- lightweight 3D segmentation Transformer, upon which a
ing is a wildly used clustering technique that originated from 3D bounding box of the salient foreground object is gener-
graph partitioning. Given a set of data points xi , spectral clus- ated. Our coarse decomposition pipeline is shown in Fig. 2.
tering builds an undirected graph G = (V, E) and partitions Since we assume that no manual annotation is available, we
it into two disjoint sets A, B. Each data point xi corresponds devise an unsupervised point cloud segmentation pipeline
to a vertex vi , and the weight w(i, j) of each graph edge rep- to generate pseudo-ground-truth segmentations, as shown
resents the similarity or the connectivity between two data in Fig. 3. Upon training, the 3D segmentation Transformer
points. Normalized Cut (NCut) is a wildly used criterion for outperforms our unsupervised pipeline and can be applied to
spectral clustering, which can be efficiently minimized by point clouds at larger scales.
solving a generalized eigenvalue problem as shown in [33].
Neural point cloud reconstruction. We leverage the SfM
Neural surface reconstruction with NeuS. NeuS [40] point clouds for efficient coarse decomposition since SfM
 function (SDF) f :
uses the zero-level set of a signed distance is usually performed prior to the dense reconstruction for
R3 → R to represent a surface S = x ∈ R3 | f (x) = 0 camera pose recovery. Specifically, we use the recent semi-
and models appearance with a radiance field c(x, v). The dense image matcher LoFTR [37] for SfM to reconstruct

3
Pseudo-GTs
Extract 2D
L
<latexit sha1_base64="KIBUPXdFc9jbwGmlyaMi3aNFjHE=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZdFNy5cVLAPmA4lk2ba0EwyJBmhDP0MNy4UcevXuPNvzLSz0NYDgcM595JzT5hwpo3rfjultfWNza3ydmVnd2//oHp41NEyVYS2ieRS9UKsKWeCtg0znPYSRXEcctoNJ7e5332iSjMpHs00oUGMR4JFjGBjJb8fYzMmmGf3s0G15tbdOdAq8QpSgwKtQfWrP5QkjakwhGOtfc9NTJBhZRjhdFbpp5ommEzwiPqWChxTHWTzyDN0ZpUhiqSyTxg0V39vZDjWehqHdjKPqJe9XPzP81MTXQcZE0lqqCCLj6KUIyNRfj8aMkWJ4VNLMFHMZkVkjBUmxrZUsSV4yyevks5F3busNx4ateZNUUcZTuAUzsGDK2jCHbSgDQQkPMMrvDnGeXHenY/FaMkpdo7hD5zPH4PzkWs=</latexit>

DINO Features Average


[CLS] Segmentation

1
Linear
1
Trans. 0
1
0

Labels
Estimated
Input SfM Aggregate 2D Features to Point Cloud Bounding Box Ground
Images Point Cloud with DINO Features

1. Neural Point Cloud Reconstruction 2. 3D Segmentation Transformer

Figure 2. Coarse Decomposition. Given an object-centric image sequence, we first reconstruct the semi-dense Structure-from-Motion (SfM)
point cloud and extract pointwise features by aggregating multi-view 2D DINO features, which are semantically rich as illustrated by the
PCA-projected colors. Then, we segment the foreground object from the SfM point cloud with a lightweight 3D Transformer, which takes
pointwise features ( ) and a global [CLS] feature ( ) as input and predicts pointwise labels ( 1 0 ). Finally, the 3D bounding box of the
object and an optional ground plane are estimated from the decomposed point cloud.

semi-dense point clouds. It can recover the complete ge- Transformers.


ometry of foreground objects, even the low-textured ones,
which is discussed in [11]. This capability is appealing for Dataset generation with unsupervised segmentation.
robustly locating an object’s complete spatial extent, which In order to generate training data for our 3D Transformer, we
is less reliable with sparse keypoints-based SfM. To facilitate propose an unsupervised SfM segmentation pipeline, which
3D segmentation, we lift self-supervised 2D ViT features to can produce robust segmentations but is computationally
3D. More specifically, frame-wise DINO-ViT features are more intensive. We propose to apply NCut on the previously
extracted and aggregated onto the semi-dense point cloud, built neural point clouds as it facilitates scene-level global
thanks to the explicit 3D-2D correspondences retained by reasoning. A large-scale dataset with pseudo-ground-truth
SfM. We find that fusing multi-view features with a simple segmentations can be automatically generated with the pro-
averaging operation is sufficient for our task. Additionally, posed pipeline. An overview is presented in Fig. 3.
frame-wise features of the [CLS] token globally describing To apply the NCut algorithm on our neural point cloud
each frame are also fused as a global description of our point for 3D segmentation, we build a fully connected graph
cloud and further used as a segmentation prototype in our G = (V, E) using the neural point cloud, where each graph
proposed 3D Transformer. vertex Vi corresponds to a 3D point. We combine feature
similarities and spatial affinities when modeling the edge
Point cloud segmentation with Transformer. As shown weights w(i, j) between Vi and Vj . Though the DINO fea-
in Fig. 2, the neural point cloud already contains discrimina- ture is semantically rich, it is hierarchical, and the salient
tive semantic features separating foreground and background. object inferred is sometimes ambiguous in terms of its po-
Therefore, we suppose that a concise network with proper sition in the part-whole hierarchy. We propose a grouped
inductive bias and trained with limited supervision is enough cosine similarity to avoid the saliency dominated by a certain
to probe the salient object in our task. We build an efficient object part, especially for objects with complex structures.
point cloud Transformer with only two Transformer encoder Formally, denote a group of multi-head attention fea-
layers and linear attentions [13]. The global [CLS] token tures Zi = {z0i , ..., zh−1
i } from h heads of an MSA module,
and pointwise tokens obtained by the previously built neural we compute the grouped cosine similarity S* between Zi
point cloud are added with positional encodings and trans- and Zj :
formed by the encoder. Then, the transformed [CLS] token is
treated as an input-dependent segmentation prototype, which S* (Zi , Zj ) = max S(zki , zkj ), (3)
k∈{0,...,h−1}
is correlated with pointwise features to produce a segmen-
tation mask. Our design takes full advantage of the global where S is the cosine similarity. The intuition is, taking the
information of [CLS] token to reason about the salient ob- maximum similarity between a group of multi-head features
ject and its global-local relationship with other pointwise assigns two points of high similarity if they are similar in
features for segmentation. The use of pre-trained 2D ViT fea- any aspect, thus reducing the chances that the saliency is
tures alleviates the reliance on large-scale data for training only dominated by a local part of an object. The foreground

4
Down
Sample v,x Color Field
<latexit sha1_base64="0uylEzSwbarl7ioom/9bhXHmxtA=">AAAB83icbVBNS8NAEN34WetX1aOXYBE8SEmkqMeiF48V7Ac0oWy2k3bpZhN2J6Ul9G948aCIV/+MN/+N2zYHbX0w8Hhvhpl5QSK4Rsf5ttbWNza3tgs7xd29/YPD0tFxU8epYtBgsYhVO6AaBJfQQI4C2okCGgUCWsHwfua3RqA0j+UTThLwI9qXPOSMopE8D2GMQZiNLsfTbqnsVJw57FXi5qRMctS7pS+vF7M0AolMUK07rpOgn1GFnAmYFr1UQ0LZkPahY6ikEWg/m988tc+N0rPDWJmSaM/V3xMZjbSeRIHpjCgO9LI3E//zOimGt37GZZIiSLZYFKbCxtieBWD3uAKGYmIIZYqbW202oIoyNDEVTQju8surpHlVca8r1cdquXaXx1Egp+SMXBCX3JAaeSB10iCMJOSZvJI3K7VerHfrY9G6ZuUzJ+QPrM8fcdGR9g==</latexit>

Foreground Renderings
s, c
<latexit sha1_base64="lyt+Ry1Dp0FTGRcjFjZO7ULpHpI=">AAAB83icbVDLSgNBEOz1GeMr6tHLYBA8SNiVoB6DXjxGMA/ILmF2MpsMmZ1dZnrFsOQ3vHhQxKs/482/cfI4aGJBQ1HVTXdXmEph0HW/nZXVtfWNzcJWcXtnd2+/dHDYNEmmGW+wRCa6HVLDpVC8gQIlb6ea0ziUvBUObyd+65FrIxL1gKOUBzHtKxEJRtFKvjn3kT9hGOVs3C2V3Yo7BVkm3pyUYY56t/Tl9xKWxVwhk9SYjuemGORUo2CSj4t+ZnhK2ZD2ecdSRWNugnx685icWqVHokTbUkim6u+JnMbGjOLQdsYUB2bRm4j/eZ0Mo+sgFyrNkCs2WxRlkmBCJgGQntCcoRxZQpkW9lbCBlRThjamog3BW3x5mTQvKt5lpXpfLddu5nEU4BhO4Aw8uIIa3EEdGsAghWd4hTcnc16cd+dj1rrizGeO4A+czx9Lg5He</latexit>

Normalized Cut SDF Field

Point Cloud Coarse Segmentation SDF Radiance Field

with DINO Features and Bounding Box Point Cloud Reg.


Ground Plane Reg.

Positive Foreground Modeling Mask Renderings

Ignored Figure 4. Salient object reconstruction and 2D mask rendering.


Negative
We model the salient foreground object enclosed in the coarse
bounding box with an SDF-based radiance field [40]. We use a
decomposed scene representation consisting of separate fields for
Pseudo Label Generation Pseudo-GTs
regions inside the bounding box, outside the bounding box, and near
the ground plane. We regularize the optimization of the SDF-based
Figure 3. Pseudo-ground-truth generation and label definition. radiance field with coarse decomposition results, i.e., the segmented
To train our point cloud segmentation Transformer with unlabeled foreground SfM point cloud and the estimated ground plane, for
data, we propose an unsupervised pipeline to generate pseudo labels. more robust foreground decomposition. After reconstruction, we
We segment the downsampled neural point cloud with Normalized can render high-quality multiview-consistent 2D object masks.
Cut [33] (NCut) and estimate a bounding box for the foreground
points. Taking the segmentation noise into account, we treat NCut’s
foreground segmentations as positive samples ( ), and background tional tiny NeRF to model regions around the ground plane
points outside the bounding box as negative ones ( ). Background supporting the object, which can be located with the bottom
segmentations located within the bounding box are regarded as plane of the object bounding box. Though there are overlaps
segmentation noise ( ) and thus ignored in training. between the inner region and the ground plane region, NeRF
normally converges faster than the SDF-based radiance field
and thus has an inductive bias to occupy the overlapped
point cloud is then segmented with NCut on the graph de-
region. We use a foreground-object-aware version of the
fined above. An oriented 3D bounding box is subsequently
contraction function with L∞ norm in MipNeRF-360 [1] to
inferred based on plane-aligned principle component analy-
model unbounded scenes. More details are provided in the
sis of the foreground point cloud. The pipeline is illustrated
appendix.
by Fig. 3. More details about our unsupervised segmentation
pipeline are provided in the appendix.
Explicit regularization with coarse decomposition. We
4.2. Background-free salient object reconstruction empirically find that decomposed modeling alone cannot
robustly separate a foreground object from its surroundings,
To reconstruct a background-free salient object model,
especially for thin structures and regions in closed contact.
we explicitly partition a scene with the coarse decomposi-
Therefore, we leverage the geometric cues in the coarse
tion result and model each partition separately with neu-
decomposition results, including the segmented foreground
ral scene representations [22, 40], which are trained upon
SfM point cloud and the estimated ground plane, to provide
multi-view posed images. Moreover, we incorporate multiple
extra regularization for training the SDF-based radiance field.
constraints into the optimization, which facilitates conver-
Firstly, the unsigned distances |f (x)| of SfM points x ∈ Pg
gence and the decomposition of foreground objects from
located on the estimated ground plane Pg are constrained to
surroundings. Fig. 4 illustrates our foreground modeling.
be larger than a lower bound θ(x):
1 X
Decomposed scene modeling. Previous methods [40, 47] Lg = max(θ(x) − |f (x)|, 0),
scale the foreground object into a unit-sphere with manual an- Ng (4)
x∈Pg
notation of its spatial extent and further rely on mask culling θ(x) = µ(x) + λ · σ(x),
or manual manipulation of the reconstructed mesh to remove
background geometries. Instead, we explicitly partition the where µ(x) and σ(x) are means and standard deviations
scene into three parts with finer granularity without man- of unsigned distances between point x and its K neareast
ual annotation, thanks to the estimated object bounding box neighbors. This constraint prevents the foreground network
in Sec. 4.1. More specifically, we use an SDF-based radiance from modeling the ground plane.
field [40] to represent regions within the object bounding Moreover, the foreground SfM point cloud is regarded
box and use a NeRF for regions outside. We use an addi- as a rough surface prior to regularize the signed distance

5
field, similar to Geo-NeuS [7]. This regularization can speed
up convergence, alleviate the shape-radiance ambiguity and
improve the reconstruction quality of thin structures. Instead

CO3D
of directly constraining the SDF value of each SfM point to
zero like in [7], we take the noise of SfM point clouds into
account. Specifically, we model the positional uncertainty
τ (x) of each point x ∈ Pf g from the foreground SfM point
cloud Pf g by its distances to the neighboring points similar
to θ(x) in Eq. (4). Then we constrain the unsigned distance
|f (x)| of x to be smaller than τ (x):

BlendedMVS
1 X
Lf g = max(|f (x)| − τ (x), 0). (5)
Nf g
x∈Pf g

To further enhance high-quality foreground renderings


with sharp boundaries, we add a beta distribution prior [19]
Lbin on the accumulated weights O(r) of each ray r ∈ Rf g
Ref. Image Fg. Rendering Extracted Mesh
intersecting with the object bounding box. Finally, we use
the eikonal term Leik [10] on sampled foreground points. Figure 5. Background-free salient object reconstruction results.
Our total loss is:

L = Lcolor + αLeik + βLg + γLf g + ζLbin . (6)


5. Experiments
5.1. Datasets
Foreground rendering and salient object extraction. We evaluate the proposed method on CO3D-V2 [29],
With the reconstructed SDF-based radiance field of the fore- BlendedMVS [45] and DTU [12] datasets.
ground object, we can easily render its multi-view consistent CO3D contains 19,000 video sequences of objects from
2D masks and extract its mesh. As our reconstruction models 50 MS-COCO categories. Many objects in CO3D contain
the foreground object and background with different fields, thin structures, which are challenging for detection and seg-
we simply compute the accumulated weights of our fore- mentation from SfM point clouds. We use the CO3D dataset
ground field along each ray intersecting the object bounding to evaluate 3D salient object detection and 2D segmentation
box as object masks, which are binarized with a threshold of to demonstrate the capabilities of our method on challenging
0.5. We use Marching Cubes [20] to extract the object mesh. objects and casual captures. To evaluate 3D detection, we
We can obtain a background-free object 3D model without manually annotate ground-truth 3D bounding boxes of 80
post-processing thanks to our decomposed scene modeling. objects from the chair category based on the given MVS
point clouds. Moreover, we annotate detailed 2D foreground
4.3. Implementation details masks of 5 objects to evaluate 2D segmentation.
BlendedMVS and DTU datasets are widely used for 3D
The input images used for SfM reconstruction are resized reconstruction. We use these datasets to evaluate 3D salient
to a max area of 720,000. We extract frame-wise 2D features object detection, reconstruction, and 2D segmentation. Since
from the ViT-S/8 version of DINO-ViT. We use our data meshes provided by BlendedMVS come with backgrounds,
generation pipeline to process 880 object-centric videos from we manually segment foreground meshes and render multi-
the CO3D-V2 dataset, which includes various categories. All view masks of 5 objects for evaluation. Foreground ob-
chair objects are kept as a holdout set for validation. This ject meshes are also used for producing ground truth 3D
leads to 800 objects for training and 80 objects for validation. bounding boxes. When evaluating reconstruction results, all
We train our 3D segmentation Transformer for 20 epochs. meshes are preprocessed with object masks.
We use multiresolution hash encoding [25] and separate
proposal MLPs [1] in all fields of our scene representation. 5.2. 3D salient object detection
We train our scene representation for 60k iterations, which
takes 2 hours on a single NVIDIA V100 GPU. All loss In this part, we evaluate our coarse decomposition re-
weights in Eq. (6) are set to 0.1. The explicit regularization sults based on the 3D bounding boxes inferred from seg-
terms are only applied during the initial 15k iterations with mented foreground point clouds. More details about generat-
their loss weights gradually annealed to zero. ing bounding boxes can be found in appendix.

6
CO3D BlendedMVS DTU
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

CO3D
TokenCut + Seg. Agg. 0.816 0.204 0.875 0.625 0.500 0.167
Ours NCut (ablation) 0.867 0.306 1.00 0.75 1.00 0.667
Ours Transformer 0.908 0.581 1.00 1.00 0.833 0.833

Table 1. Quntitative results of 3D salient object detection. Our


method is compared with baselines using the average precision
(AP) of 3D bounding box IoU with different thresholds.

BlendedMVS
Baselines. To the best of our knowledge, there is no ex-
isting baseline that holds the same setting as our coarse de-
composition pipeline, which detects 3D salient objects from
SfM point clouds without manual annotation for model train- NeuS Ours Ours w/o
ing. Therefore, we devise two straightforward pipelines for regularization
comparison to demonstrate the effectiveness of our design.
The first baseline is TokenCut + segmentation aggregation Figure 6. Qualitative results of salient object reconstruction.
(TokenCut + Seg. Agg.). We first use TokenCut [41] for 2D Our method is compared with NeuS on the CO3D and Blend-
salient object segmentation on each image and then aggre- edMVS datasets. We present results with and without our explicit
gate multi-view segmentation confidences to the SfM point regularization to illustrate its effectiveness.
cloud by averaging. Finally, we segment the SfM point cloud
with a threshold of 0.5 to determine the salient object’s 3D Scan ID 1 2 3 4 5 6 Mean
NeuS (w/ annotated fg. region) 0.390 0.216 0.245 0.223 0.345 0.271 0.282
region. Another baseline is our neural point cloud + NCut- Ours (fully-automated) 0.411 0.200 0.240 0.218 0.379 0.264 0.285
based segmentation (Ours NCut), which is used to generate
pseudo-GTs for training Ours Transformer. Table 2. Quantitative results on the BlendedMVS dataset. We
normalize the GT mesh so that its longest side equals one. Results
Evaluation metrics. We use the Intersection-over- on the Chamfer l2 distance are presented as percentages.
Union (IoU) metric with thresholds of 0.5 and 0.7
to evaluate the bounding box accuracy. The average Cut [41], which performs 2D salient object segmentation
percision (AP) is used for comparison. on each image and does not consider multi-view informa-
Results. As shown in Table 1, our approach substantially tion. 2) multi-view image segmentation baseline SemanticN-
achieves better salient object detection performances on all eRF [51], which fuses noisy masks with a neural field and
datasets, especially on the challenging CO3D [29] dataset. produces high-quality masks with neural rendering. Specifi-
Instead of individually segmenting 2D images as in TokenCut cally, we use the segmentations from TokenCut as inputs for
+ Seg. Agg., our strategy of aggregating multi-view 2D fea- SemanticNeRF and evaluate its mask renderings.
tures and performing segmentation on 3D facilitates global
reasoning of the salient object and eliminates multi-view Evaluation metrics. We evaluate 3D reconstruction on the
inconsistency. The proposed Ours Transformer also outper- Chamfer l2 distance. Mask IoU and Boundary IoU [4] met-
forms Ours NCut baseline on most datasets and metrics rics are used to evaluate 2D segmentation, with the former
although trained on pseudo-GTs generated by Ours NCut. reflecting overall segmentation quality and the latter focus-
We attribute this improvement to Our Transformer’s ability ing on boundary quality. The definitions of these metrics can
to accept point clouds with a higher density as inputs, its be found in the appendix.
ability to capture global dependencies, and the extra learning
on the dataset generated by Ours NCut. Results. For foreground object reconstruction, qualitative
5.3. Object reconstruction and 2D segmentation results are shown in Figs. 5 and 6, and quantitative re-
sults on the BlendedMVS dataset are presented in Table 2.
We evaluate the reconstructed object geometry and 2D The proposed fully-automated pipeline achieves compara-
foreground mask renderings to demonstrate the capability of ble or better reconstruction quality compared with NeuS,
our approach to reconstruct and segment complex objects. which is provided with manually-annotated object regions
and requires manual post-processing for background re-
Baselines. For 3D reconstruction, we compare our method moval. Our pipeline eliminates these tedious labors and thus
with the neural surface reconstruction baseline NeuS [40]. demonstrates the potential to automatically create large-scale
As for the evaluation of 2D segmentation, the proposed datasets.
method is compared with following baselines in two cat- Our method also achieves better 2D segmentation accu-
egories: 1) single-view image segmentation baseline Token- racy on most of the evaluated scans, as shown in Table 3

7
CO3D BlendedMVS DTU
Scan ID 1 2 3 4 5 Mean 1 2 3 4 5 Mean 1 2 3 4 5 Mean
Mask IoU
Ours 0.933 0.951 0.958 0.962 0.934 0.947 0.959 0.987 0.916 0.936 0.977 0.955 0.931 0.969 0.961 0.959 0.903 0.945
TokenCut (single-view) 0.784 0.888 0.976 0.975 0.966 0.918 0.785 0.904 0.919 0.855 0.943 0.881 0.829 0.921 0.905 0.955 0.971 0.916
TokenCut + SemanticNeRF 0.825 0.861 0.952 0.980 0.914 0.906 0.972 0.906 0.924 0.877 0.941 0.924 0.828 0.921 0.907 0.957 0.975 0.918
Boundary IoU
Ours 0.912 0.937 0.839 0.771 0.843 0.860 0.816 0.914 0.767 0.896 0.817 0.842 0.628 0.842 0.752 0.707 0.613 0.877
TokenCut (single-view) 0.635 0.832 0.877 0.839 0.887 0.814 0.493 0.562 0.664 0.688 0.695 0.620 0.572 0.693 0.525 0.636 0.803 0.646
TokenCut + SemanticNeRF 0.701 0.819 0.847 0.822 0.769 0.792 0.512 0.578 0.699 0.730 0.642 0.632 0.539 0.633 0.522 0.661 0.836 0.638

Table 3. Quantitative results of 2D segmentation. We compare our foreground mask renderings with baselines on Mask IoU and Boundary
IoU metrics on multiple datasets, including CO3D, BlendedMVS, and DTU. AutoRecon outperforms baselines on most of the scans.

detection accuracy improves significantly, especially on the


CO3D dataset. Moreover, Ours Transformer runs ∼ 100×
times faster than Ours NCut to segment a point cloud with
10k points and is applicable to large-scale point clouds.
CO3D

Explicit regularization for training decomposed neural


scene representation. The qualitative results in Fig. 6
demonstrate the effectiveness of explicit regularization in
disentangling foreground objects from their surroundings.
Regularization provided by the coarse decomposition also
alleviates the shape-radiance ambiguity as shown in the chair
BMVS

example.

6. Conclusion
Ref. Image TokenCut TokenCut + Ours
SemanticNeRF We present a novel pipeline for fully-automated object dis-
covery and reconstruction from multi-view images, without
Figure 7. Qualitative results of 2D segmentation. We show fore- any human annotation. Experiments conducted on multiple
ground segmentation on the challenging chair category in CO3D real-world datasets show the effectiveness of our method in
and an object in BlendedMVS with complex geometry. building high-quality background-free object models. We
also demonstrate the capability of our pipeline in producing
and visualized in Fig. 7. The results of the 2D salient ob- high-quality segmentation masks, which are directly applica-
ject segmentation baseline TokenCut lack multi-view con- ble to 2D supervised learning.
sistency and are noisy on scans with complex backgrounds.
SemanticNeRF can bring improvement to the initial Token- Limitations and future work. Problems faced by neu-
Cut segmentations on some scans. The proposed method can ral reconstruction methods remain in our pipeline, like sen-
handle complex objects and backgrounds and outperforms sitivity to shadows and transient occluders and degraded
these baselines significantly on the Boundary IoU metric, results on thin-structured and non-Lambertian objects. Stor-
which demonstrates the capability of producing high-quality ing multi-view ViT features is memory-intensive, which
segmentations. we expect to be alleviated by distance-preserving compres-
sion techniques. The reconstruction quality of SfM point
5.4. Ablation studies
clouds can be further improved with refinement methods
We conduct experiments to validate the effectiveness of like [11, 17], which can further improve the quality of sur-
our point cloud segmentation Transformer for the coarse face reconstruction and potentially eliminate reconstruction
decomposition and regularization terms for training our de- ambiguities. Our automated object reconstruction pipeline
composed neural scene representation. More ablation studies can be used to create large-scale 3D object datasets for graph-
are provided in the supplementary material. ics and perception tasks, such as training 2D segmentation
networks and 3D generative models.
Segmentation Transformer for coarse decomposition.
We show the effectiveness of our 3D segmentation Trans- Acknowledgement. This work was supported by NSFC
former over our NCut-based pipeline from the higher 3D (No. 62172364), the ZJU-SenseTime Joint Lab of 3D Vision,
detection AP on multiple datasets, as shown in Table 1. Re- and the Information Technology Center and State Key Lab
sults show that although trained with pseudo-labels, the 3D of CAD&CG, Zhejiang University.

8
References Dellaert, and Thomas Funkhouser. Panoptic neural fields: A
semantic object-aware neural scene representation. In CVPR,
[1] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P 2022. 2
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded
[17] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson,
anti-aliased neural radiance fields. In CVPR, 2022. 5, 6
and Marc Pollefeys. Pixel-perfect structure-from-motion with
[2] Christopher P Burgess, Loic Matthey, Nicholas Watters,
featuremetric refinement. In ICCV, 2021. 8
Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander
Lerchner. Monet: Unsupervised scene decomposition and [18] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner,
representation. arXiv:1901.11390, 2019. 2 Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit,
Alexey Dosovitskiy, and Thomas Kipf. Object-centric learn-
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
ing with slot attention. In NeurIPS, 2020. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
ing properties in self-supervised vision transformers. In [19] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
CVPR, 2021. 1, 2, 3 Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural
[4] Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C Berg, volumes: learning dynamic renderable volumes from images.
and Alexander Kirillov. Boundary iou: Improving object- ACM TOG, 2019. 6
centric image segmentation evaluation. In CVPR, 2021. 7 [20] William E. Lorensen and Harvey E. Cline. Marching cubes:
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, A high resolution 3D surface construction algorithm. SIG-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, GRAPH Comput. Graph., 1987. 6
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [21] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and An-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is drea Vedaldi. Deep spectral methods: A surprisingly strong
worth 16x16 words: Transformers for image recognition at baseline for unsupervised semantic segmentation and local-
scale. In ICLR, 2021. 3 ization. In CVPR, 2022. 1, 3
[6] Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, and [22] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Ingmar Posner. GENESIS: generative scene inference and Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
sampling with object-centric latent representations. In ICLR, Representing scenes as neural radiance fields for view synthe-
2020. 2 sis. Communications of the ACM, 2021. 5
[7] Qiancheng Fu, Qingshan Xu, Yew-Soon Ong, and Wenbing [23] Ashkan Mirzaei, Yash Kant, Jonathan Kelly, and Igor
Tao. Geo-neus: Geometry-consistent neural implicit surfaces Gilitschenski. Laterf: Label and text driven object radiance
learning for multi-view reconstruction. In NeurIPS, 2022. 6 fields. In ECCV, 2022. 1, 2
[8] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and [24] Norman Müller, Andrea Simonelli, Lorenzo Porzi,
Richard Szeliski. Towards internet-scale multi-view stereo. Samuel Rota Bulò, Matthias Nießner, and Peter Kontschieder.
In CVPR, 2010. 1 Autorf: Learning 3d object radiance fields from single view
[9] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick observations. In CVPR, 2022. 2
Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, [25] Thomas Müller, Alex Evans, Christoph Schied, and Alexander
Matthew Botvinick, and Alexander Lerchner. Multi-object Keller. Instant neural graphics primitives with a multiresolu-
representation learning with iterative variational inference. In tion hash encoding. ACM ToG, 2022. 6
ICML, 2019. 2
[26] Michael Niemeyer, Lars M. Mescheder, Michael Oechsle,
[10] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
and Andreas Geiger. Differentiable volumetric rendering:
Yaron Lipman. Implicit geometric regularization for learning
Learning implicit 3d representations without 3d supervision.
shapes. In ICML, 2020. 6
In CVPR, 2020. 2
[11] Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun
Bao, and Xiaowei Zhou. Onepose++: Keypoint-free one-shot [27] Michael Oechsle, Songyou Peng, and Andreas Geiger.
object pose estimation without CAD models. In NeurIPS, Unisurf: Unifying neural implicit surfaces and radiance fields
2022. 4, 8 for multi-view reconstruction. In ICCV, 2021. 2
[12] Rasmus Ramsbøl Jensen, Anders Lindbjerg Dahl, George [28] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and
Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi- Felix Heide. Neural scene graphs for dynamic scenes. In
view stereopsis evaluation. In CVPR, 2014. 2, 6 CVPR, 2021. 2
[13] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and [29] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler,
François Fleuret. Transformers are rnns: Fast autoregressive Luca Sbordone, Patrick Labatut, and David Novotny. Com-
transformers with linear attention. In ICML, 2020. 4 mon objects in 3d: Large-scale learning and evaluation of
[14] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. real-life 3d category reconstruction. In ICCV, 2021. 2, 6, 7
Decomposing nerf for editing via feature field distillation. In [30] Zhongzheng Ren, Aseem Agarwala, Bryan Russell, Alexan-
NeurIPS, 2022. 2 der G Schwing, and Oliver Wang. Neural volumetric object
[15] Amit Pal Singh Kohli, Vincent Sitzmann, and Gordon Wet- selection. In CVPR, 2022. 1, 2
zstein. Semantic implicit neural scene representations with [31] Mehdi SM Sajjadi, Daniel Duckworth, Aravindh Mahendran,
semi-supervised training. In 3DV, 2020. 2 Sjoerd van Steenkiste, Filip Pavetic, Mario Lucic, Leonidas
[16] Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caro- Guibas, Klaus Greff, and Thomas Kipf. Object scene repre-
line Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank sentation transformer. In NeurIPS, 2022. 2

9
[32] Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, [48] Hong-Xing Yu, Leonidas J. Guibas, and Jiajun Wu. Unsu-
and Marc Pollefeys. Pixelwise View Selection for Unstruc- pervised discovery of object radiance fields. In ICLR, 2022.
tured Multi-View Stereo. In ECCV. 2016. 1, 2 2
[33] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image [49] Andrii Zadaianchuk, Matthaeus Kleindessner, Yi Zhu,
Segmentation. IEEE TPAMI, 2000. 2, 3, 5 Francesco Locatello, and Thomas Brox. Unsupervised se-
[34] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spy- mantic segmentation with self-supervised object-centric rep-
ros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, resentations. In ICLR, 2023. 3
and Jean Ponce. Localizing objects with self-supervised trans- [50] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
formers and no labels. In BMVC, 2021. 1, 3 Koltun. NeRF++: Analyzing and Improving Neural Radi-
[35] Cameron Smith, Hong-Xing Yu, Sergey Zakharov, Fredo Du- ance Fields. arXiv:2010.07492, 2020. 1, 2, 3
rand, Joshua B. Tenenbaum, Jiajun Wu, and Vincent Sitzmann. [51] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An-
Unsupervised Discovery and Composition of Object Light drew J Davison. In-place scene labelling and understanding
Fields. arXiv:2205.03923, 2022. 2 with implicit scene representation. In ICCV, 2021. 2, 7
[36] Karl Stelzner, Kristian Kersting, and Adam R Kosiorek. De- [52] Shuaifeng Zhi, Edgar Sucar, Andre Mouton, Iain Haughton,
composing 3d scenes into objects via unsupervised volume Tristan Laidlow, and Andrew J Davison. ilabel: Revealing
segmentation. arXiv:2104.01148, 2021. 2 objects in neural fields. IEEE Robot. Autom. Lett., 2022. 2
[37] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and
Xiaowei Zhou. Loftr: Detector-free local feature matching
with transformers. In CVPR, 2021. 3
[38] Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea
Vedaldi. Neural Feature Fusion Fields: 3D Distillation of
Self-Supervised 2D Image Representations. In 3DV, 2022. 2
[39] Suhani Vora, Noha Radwan, Klaus Greff, Henning Meyer,
Kyle Genova, Mehdi S. M. Sajjadi, Etienne Pot, Andrea
Tagliasacchi, and Daniel Duckworth. NeSF: Neural Seman-
tic Fields for Generalizable Semantic Segmentation of 3D
Scenes. arXiv:2111.13260, 2021. 2
[40] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. Neus: Learning neural implicit
surfaces by volume rendering for multi-view reconstruction.
In NeurIPS, 2021. 1, 2, 3, 5, 7
[41] Yangtao Wang, Xi Shen, Shell Hu, Yuan Yuan, James Crow-
ley, and Dominique Vaufreydaz. Self-Supervised Transform-
ers for Unsupervised Object Discovery using Normalized Cut.
In CVPR, 2022. 1, 3, 7
[42] Qianyi Wu, Xian Liu, Yuedong Chen, Kejie Li, Chuanxia
Zheng, Jianfei Cai, and Jianmin Zheng. Object-Compositional
Neural Implicit Surfaces. In ECCV, 2022. 2
[43] Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han
Zhou, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Learn-
ing Object-Compositional Neural Radiance Field for Editable
Scene Rendering. In ICCV, 2021. 2
[44] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Mvsnet: Depth inference for unstructured multi-view stereo.
In ECCV, 2018. 2
[45] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-
scale dataset for generalized multi-view stereo networks. In
CVPR, 2020. 2, 6
[46] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces. In NeurIPS, 2021.
1, 2
[47] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Ronen Basri, and Yaron Lipman. Multiview neural
surface reconstruction by disentangling geometry and appear-
ance. In NeuIPS, 2020. 2, 5

10

You might also like