0% found this document useful (0 votes)
5 views14 pages

2502.10392v1

This paper introduces TSP3D, an efficient single-stage framework for 3D visual grounding that utilizes text-guided sparse voxel pruning (TGP) and completion-based addition (CBA) to enhance the interaction between 3D scene representations and text features. TSP3D achieves superior inference speed and accuracy compared to existing methods, outperforming the fastest single-stage method by 100% FPS and achieving state-of-the-art results on several datasets. The proposed architecture addresses the challenges of real-time performance in 3D visual grounding, making it suitable for applications in robotics and AR/VR.

Uploaded by

pritamroy1136
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views14 pages

2502.10392v1

This paper introduces TSP3D, an efficient single-stage framework for 3D visual grounding that utilizes text-guided sparse voxel pruning (TGP) and completion-based addition (CBA) to enhance the interaction between 3D scene representations and text features. TSP3D achieves superior inference speed and accuracy compared to existing methods, outperforming the fastest single-stage method by 100% FPS and achieving state-of-the-art results on several datasets. The proposed architecture addresses the challenges of real-time performance in 3D visual grounding, making it suitable for applications in robotics and AR/VR.

Uploaded by

pritamroy1136
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

Wenxuan Guo1∗ Xiuwei Xu1∗ Ziwei Wang2 Jianjiang Feng1† Jie Zhou1 Jiwen Lu1
1 2
Tsinghua University Nanyang Technological University
{gwx22,xxw21}@mails.tsinghua.edu.cn [email protected]
{jfeng,jzhou,lujiwen}@tsinghua.edu.cn
arXiv:2502.10392v1 [cs.CV] 14 Feb 2025

Abstract MCLN (two-stage)


TSP3D
(ours)
45

Accuracy ([email protected])
In this paper, we propose an efficient multi-level convolu- EDA MCLN (single-stage)
(two-stage)
tion architecture for 3D visual grounding. Conventional
methods are difficult to meet the requirements of real-time 3D-VisTA EDA (single-stage)
40
BUTD-DETR
inference due to the two-stage or point-based architecture. (two-stage)
BUTD-DETR (single-stage)
Inspired by the success of multi-level fully sparse convolu- 3D-SPS
tional architecture in 3D object detection, we aim to build (two-stage) 3D-SPS (single-stage)
35
a new 3D visual grounding framework following this tech- TSP3D-B
nical route. However, as in 3D visual grounding task the (ours)
3D scene representation should be deeply interacted with SAT
30
text features, sparse convolution-based architecture is in- 3 6 9 12
efficient for this interaction due to the large amount of Speed (frames per second)
voxel features. To this end, we propose text-guided prun-
Figure 1. Comparison of 3DVG methods on ScanRefer dataset [3].
ing (TGP) and completion-based addition (CBA) to deeply Our TSP3D surpasses existing methods in both accuracy and in-
fuse 3D scene representation and text features in an effi- ference speed, achieving the first efficient 3DVG framework.
cient way by gradual region pruning and target completion.
Specifically, TGP iteratively sparsifies the 3D scene rep-
to locate an object in the scene with a free-form query de-
resentation and thus efficiently interacts the voxel features
scription. 3DVG is challenging since it requires understand-
with text features by cross-attention. To mitigate the affect
ing of both 3D scene and language description. Recently,
of pruning on delicate geometric information, CBA adap-
with the development of 3D scene perception and vision-
tively fixes the over-pruned region by voxel completion with
language models, 3DVG methods have shown remarkable
negligible computational overhead. Compared with previ-
progress [16, 22]. However, with 3DVG being widely ap-
ous single-stage methods, our method achieves top infer-
plied in fields like robotics and AR / VR where inference
ence speed and surpasses previous fastest method by 100%
speed is the main bottleneck, how to construct efficient real-
FPS. Our method also achieves state-of-the-art accuracy
time 3DVG model remains a challenging problem.
even compared with two-stage methods, with +1.13 lead
Since the output format of 3DVG is similar with 3D ob-
of [email protected] on ScanRefer, and +2.6 and +3.2 leads on
ject detection, early 3DVG methods [3, 14, 38, 39] usually
NR3D and SR3D respectively. The code is available at
adopt a two-stage framework, which first conducts detection
https://github.com/GWxuan/TSP3D.
to locate all objects in the scene, and then selects the tar-
get object by incorporating text information. As there are
many similarities between 3D object detection and 3DVG
1. Introduction (e.g. both of them need to extract the representation of the
Incorporating multi-modal information to guide 3D visual 3D scene), there will be much redundant feature compu-
perception is a promising direction. In these years, 3D vi- tation during the independent adoption of the two mod-
sual grounding (3DVG), also known as 3D instance refer- els. As a result, two-stage methods are usually hard to
encing, has been paid increasing attention as a fundamen- handle real-time tasks. To solve this problem, single-stage
tal multi-modal 3D perception task. The aim of 3DVG is methods [22, 35] are presented, which generates the bound-
ing box of the target directly from point clouds. This in-
* Equal contribution. † Corresponding author. tegrated design is more compact and efficient. However,
current single-stage 3DVG methods mainly build on point- 2. Related Work
based architecture [25], where the feature extraction con-
tains time-consuming operations like furthest point sam- 2.1. 3D Visual Grounding
pling and kNN. They also need to aggressively downsam- 3D visual grounding aims to locate a target object within
ple the point features to reduce computational cost, which a 3D scene based on natural language descriptions [19].
might hurt the geometric information of small and thin ob- Existing methods are typically categorized into two-stage
jects [37]. Due to these reasons, current single-stage meth- and single-stage approaches. Two-stage methods follow a
ods are still far from real-time (< 6 FPS) and their perfor- detect-then-match paradigm. In the first stage, they inde-
mance is inferior to two-stage methods, as shown in Fig. 1. pendently extract features from the language query using
In this paper, we propose a new single-stage frame- pre-trained language models [7, 9, 24] and predict candi-
work for 3DVG based on text-guided sparse voxel pruning, date 3D objects using pre-trained 3D detectors [21, 26] or
namely TSP3D. Inspired by state-of-the-art 3D object de- segmenters [4, 17, 32]. In the second stage, they focus
tection methods [29, 37] which achieves both leading ac- on aligning the vision and text features to identify the tar-
curacy and speed with multi-level sparse convolutional ar- get object. Techniques for feature fusion include attention
chitecture, we build the first sparse single-stage 3DVG net- mechanisms with Transformers [13, 40], contrastive learn-
work. However, different from 3D object detection, in ing [1], and graph-based matching [10, 14, 39]. In con-
3DVG the 3D scene representation should be deeply in- trast, single-stage methods integrate object detection and
teracted with text features. Since the count of voxels is feature extraction, allowing for direct identification of the
very large in sparse convolution-based architecture, deep target object. Methods in this category include guiding key-
multi-modal interaction like cross-attention becomes infea- point selection using textual features [22], and measuring
sible due to unaffordable computational cost. To this end, similarity between words and objects inspired by 2D image-
we propose text-guided pruning (TGP), which first utilize language pre-trained models like GLIP [18], as in BUTD-
text information to jointly sparsify the 3D scene represen- DETR [16]. And methods like EDA [35] and G3 -LQ [34]
tation and enhance the voxel and text features. To mitigate advance single-stage 3D visual grounding by enhancing
the affect of pruning on delicate geometric information, we multimodal feature discriminability through explicit text-
further present completion-based addition (CBA) to adap- decoupling, dense alignment, and semantic-geometric mod-
tively fix the over-pruned region with negligible computa- eling. MCLN [27] uses the 3D referring expression seg-
tional overhead. Specifically, TGP prunes the voxel fea- mentation task to assist 3DVG in improving performance.
tures according to the object distribution. It gradually re- However, existing two-stage and single-stage methods
moves background features and features of irrelevant ob- generally have high computational costs, hindering real-
jects, which generates text-aware voxel features around the time applications. Our work aims to address these efficiency
target object for accurate bounding box prediction. Since challenges by proposing an efficient single-stage method
pruning may mistakenly remove the representation of tar- with multi-level sparse convolutional architecture.
get object, CBA utilizes text features to query a small set
of voxel features from the complete backbone features, 2.2. Multi-Level Convolutional Architectures
followed by pruned-aware addition to fix the over-pruned Recently, sparse convolutional architecture has achieved
region. We conduct extensive experiments on the popu- great success in the field of 3D object detection. Built on
lar ScanRefer [3] and ReferIt3D [2] datasets. Compared the voxel-based representation [5, 8, 33] and sparse con-
with previous single-stage methods, TSP3D achieves top volution operation [6, 11, 36], this kind of methods show
inference speed and surpasses previous fastest single-stage great efficiency and accuracy when processing scene-level
method by 100% FPS. TSP3D also achieves state-of-the- data. GSDN [12] first adopts multi-level sparse convolution
art accuracy even compared with two-stage methods, with with generative feature upsampling in 3D object detection.
+1.13 lead of [email protected] on ScanRefer, and +2.6 and +3.2 FCAF3D [29] simplifies the multi-level architecture with
leads on NR3D and SR3D respectively. anchor-free design, achieving leading accuracy and speed.
To summarize, our main contributions are as follows: TR3D [30] further accelerates FCAF3D by removing un-
• To the best of our knowledge, this is the first work explor- necessary layers and introducing category-aware proposal
ing sparse convolutional architecture for efficient 3DVG. assignment method. Moreover, DSPDet3D [37] introduces
• To enable efficient feature extraction, we propose text- the multi-level architecture to 3D small object detection.
guided pruning and completion-based addition to sparsify Our proposed method draws inspiration from these ap-
sparse voxels and adaptively fuse multi-level features. proaches, utilizing a sparse multi-level architecture with
• We conduct extensive experiments, and TSP3D outper- sparse convolutions and an anchor-free design. This allows
forms existing methods in both accuracy and speed, for efficient processing of 3D data, enabling real-time per-
demonstrating the superiority of the proposed framework. formance in 3D visual grounding tasks.
3. Method els of voxel features Vl (l = 1, 2, 3). With the increase of
l, the spatial resolution of Vl decreases and the context in-
In this section, we describe our TSP3D for efficient single- formation increases. Concurrently, the free-form text with l
stage 3DVG. We first analyze existing pipelines to identify words is encoded by the pre-trained RoBERTa [20] and pro-
current challenges and motivate our approach (Sec. 3.1).
duce the vanilla text tokens T ∈ Rl×d . With the extracted
We then introduce the text-guided pruning, which leverages
3D and text representations, we iteratively upsample V3 and
text features to guide feature pruning (Sec. 3.2). To address
fuse it with T to generate high-resolution and text-aware
the potential risk of pruning key information, we propose scene representation:
the completion-based addition for multi-level feature fusion

(Sec. 3.3). Finally, we detail the training loss (Sec. 3.4). Ul = UlG + Vl , UlG = GeSpConv(Ul+1 ) (1)
3.1. Architecture Analysis for 3DVG ′
Ul+1 = Concat(Ul+1 , T ) (2)
Top-performance 3DVG methods [31, 34, 35], are mainly where U3 = V3 , GeSpConv means generative sparse con-
two-stage, which is a serial combination of 3D object de- volution [12] with stride 2, which upsamples the voxel fea-
tection and 3D object grounding. This separate calls of two tures and expands their spatial locations for better bounding
approaches result in redundant feature extraction and com- box prediction. Concat is voxel-wise feature concatenation
plex pipeline, thus making the two-stage methods less ef- by duplicating T . The final upsampled feature map U1 is
ficient. To demonstrate the efficiency of existing methods, concatenated with T and fed into a convolutional head to
we conduct a comparison of accuracy and speed among sev- predict the objectness scores and regress the 3D bounding
eral representative methods on ScanRefer [3], as shown in box. We select the box with highest objectness score as the
Fig. 1. It can be seen that two-stage methods struggle in grounding result.
speed (< 3 FPS) due to the additional detection stage. Since As shown in Fig. 1, TSP3D-B achieves an inference
3D visual grounding is usually adopted in practical scenar- speed of 14.58 FPS, which is significantly faster than previ-
ios that require real-time inference under limited resources, ous single-stage methods and demonstrates great potential
such as embodied robots and VR/AR, the low speed of two- for real-time 3DVG.
stage methods make them less practical. On the other side,
single-stage methods [22], which directly predicts refered 3.2. Text-guided Pruning
bounding box from the observed 3D scene, are more suit- Though efficient, TSP3D-B exhibits poor performance due
able choices due to their streamlined processes. In Fig. 1, it to the inadequate interaction between 3D scene representa-
can be observed that single-stage methods are significantly tion and text features. Motivated by previous 3DVG meth-
more efficient than their two-stage counterparts. ods [16], a simple solution is to replace Concat with cross-
However, existing single-stage methods are mainly built modal attention to process voxel and text features, as shown
on point-based backbone [25], where the scene representa- in Fig. 2 (b). However, different from point-based architec-
tion is extracted with time-consuming operations like fur- tures where the scene representation is usually aggressively
thest point sampling and set abstraction. They also em- downsampled, the number of voxels in multi-level convo-
ploy large transformer decoder to fuse text and 3D fea- lutional framework is very large1 . In practical implemen-
tures for several iterations. Therefore, the inference speed tation, we find that the voxels expand almost exponentially
of current single-stage methods is still far from real-time with each upsampling layer, leading to a substantial com-
(< 6 FPS). The inference speed of specific components putational burden for the self-attention and cross-attention
in different frameworks is analyzed and discussed in de- of scene features. To address this issue, we introduce text-
tail in the supplementary material. Inspired by the success guided pruning (TGP) to construct TSP3D, as illustrated in
of multi-level sparse convolutional architecture in 3D ob- Fig. 2 (c). The core idea of TGP is to reduce feature amount
ject detection [30], which achieves both leading accuracy by pruning redundant voxels and guide the network to grad-
and speed, we propose to build the first multi-level convo- ually focus on the final target based on textual features.
lutional single-stage 3DVG pipeline. Overall Architecture. TGP can be regarded as a mod-
TSP3D-B. Here we propose a baseline framework ified version of cross-modal attention, which reduces the
based on sparse convolution, namely TSP3D-B. Follow- number of voxels before attention operation, thereby lower-
ing the simple and effective multi-level architecture of ing computational cost. To minimize the affect of pruning
FCAF3D [29], TSP3D-B utilizes 3 levels of sparse convolu- on the final prediction, we propose to prune the scene rep-
tional blocks for scene representation extraction and bound- resentation gradually. At higher level where the number of
ing box prediction, as shown in Fig. 2 (a). Specifically, the 1 Compared to point-based architectures, sparse convolutional frame-
input pointclouds P ∈ RN ×6 with 6-dim features (3D po- work provides higher resolution and more detailed scene representations,
sition and RGB) are first voxelized and then fed into three while also offering advantages in inference speed. For detailed statistics,
sequential MinkResBlocks [6], which generates three lev- please refer to the supplementary material.
Level 3 Description
CBA
Completion-based The stool that is closest
to the television in the
(a) (c) (d)
addition corner. It is orange and
sits beside a round table. FPS
Cat
SA Self attention
CA SA SA SA
SpConv Feature upsampling
GeSpConv
CA Cross attention MLP
CA CA
Level 2
Point cloud Interpolate
CBA
(b) MLP
Prune

Feature upsampling SA SA
SpConv SA
Prune

CA CA
CA CA
Level 1
GeSpConv
Voxelization GeSpConv GeSpConv
& SpConv
CBA Head

Output

Figure 2. Illustration of TSP3D. TSP3D bulids on multi-level sparse convolutional architecture. It iteratively upsamples the voxel features
with text-guided pruning (TGP), and fuses multi-level features via completion-based addition (CBA). (a) to (d) on the right side illustrate
various options for feature upsampling. (a) refers to simple concatenation with text features, which is fast but less accurate. (b) refers
to feature interaction through cross-modal attention mechanisms, which is constrained by the large number of voxels. (c) represents our
proposed TGP, which first prunes voxel features under textual guidance and thus enables efficient interaction between voxel and text
features. (d) shows a simplified version of TGP that removes farthest point sampling and interpolation, combines multi-modal feature
interactions into a whole and moves it before pruning.

voxels is not too large yet, TGP prunes less voxels. While specified by Ul . After pruning, the scale of the scene fea-
at lower level where the number of voxels is significantly tures is significantly reduced, enabling internal feature in-
increased by upsampling operation, TGP prunes the voxel teractions based on self-attention. Subsequently, we utilize
features more aggressively. The multi-level architecture of self-attention and cross-attention to perceive the relative re-
TSP3D consists of three levels and includes two feature up- lationships among objects within the scene and to fuse mul-
sampling operations. Therefore, we correspondingly con- timodal features, resulting in updated features Ul′ . Finally,
G
figure two TGPs with different functions, which are referred through generative sparse convolutions, we obtain Ul−1 .
as scene-level TGP (level 3 to 2) and target-level TGP (level Supervision for Pruning. The binary supervision mask
2 to 1) respectively. Scene-level TGP aims to distinguish M sce for scene-level TGP is generated based on the centers
between objects and the background, specifically pruning of all objects in the scene, and the mask M tar for target-
the voxels on background. Target-level TGP focuses on re- level TGP is based on the target and relevant objects men-
gions mentioned in the text, intending to preserve the target tioned in the descriptions:
object and referential objects while removing other regions.
Details of TGP. Since the pruning is relevant to the de- N
[ K
[
sce tar tar
scription, we need to make the voxel features text-aware to M = M(Oi ), M = M(O )∪ M(Ojrel )
predict a proper pruning mask. To reduce the computational i=1 j=1
cost, we perform farthest point sampling (FPS) on the voxel (5)
features to reduce their size while preserving the basic dis- where {Oi |1 ≤ i ≤ N } indicates all objects in the scene.
tribution of the scene. Next, we utilize cross-attention to Otar and Orel refer to target and relevant objects respec-
interact with the text features and employ a simple MLP tively. M(O) represents the mask generated from the cen-
to predict the probability distribution M̂ for retaining each ter of object O. It generates a L × L × L cube centered at
voxel. To prune the features Ul , we binarize and interpo- the center of O to construct the supervision mask M , where
late the M̂ to obtain the pruned mask. This process can be locations inside the cube is set to 1 while others set to 0.
expressed as: Simplification. Although the above mentioned method
can effectively prune voxel features to reduce the computa-
UlP = Ul ⊙ Θ(I(M̂ , Ul ) − σ) (3) tional cost of cross-modal attention, there are some ineffi-
cient operations in the pipeline: (1) FPS is time-consuming,
M̂ = MLP(CrossAtt(FPS(Ul ), SelfAtt(T ))) (4) especially for large scenes; (2) there are two times of in-
teractions between voxel features and text features, the first
where UlP is the pruned features, Θ is Heaviside step func- is to guide pruning and the second is to enhance the rep-
tion, ⊙ is matrix dot product, σ is the pruning threshold, resentation, which is a bit redundant. We also empirically
and I represents linear interpolation based on the positions observe that the number of voxels is not large in level 3.
tures are directly added. For voxel features outside the in-
tersection of UlG and Vl which lack corresponding features
in the other map, the missing voxel features are interpolated
before addition. Due to pruning process, UlG is sparser than
Vl . In this way, full addition can fix almost all the pruned re-
gion. But this operation is computationally heavy and make
(a) Ground Truth (b) Pruned Features the scene representation fail to focus on relevant objects,
which deviates the core idea of TGP. (2) Pruning-aware
Addition. The addition is constrained to the locations of
UlG . For voxel in UlG but not in Vl , interpolation from UlG
is applied to complete the missing locations in Vl . It restricts
the addition operation to the shape of the pruned features,
potentially leading to an over-reliance on the results of the
(c) Completion Features (d) Prediction pruning process. If important regions are over-pruned, the
network may struggle to detect targets with severely dam-
aged geometric information.
Considering the unavoidable risk of pruning the query
SA CA
Compare &
Interpolate
Cat target, we introduce the completion-based addition (CBA).
CBA is designed to address the limitations of full and
pruning-aware additions. It offers a more targeted and ef-
... bench sofa ...
SA MLP ficient way to integrating multi-level features, ensuring the
right of table ...

preservation of essential details while keeping the addi-


Figure 3. Illustration of completion-based addition. The upper
tional computational overhead negligible.
figure (b) illustrates an example of over-pruning on the target. The Details of CBA. We first enhance the backbone features
lower figure (c) shows the completed features predicted by CBA. Vl with the text features T through cross-attention, obtain-
ing Vl′ . Then a MLP is adopted to predict the probability
distribution of target for region selection:
To this end, we propose a simplified version of TGP, as
shown in Fig. 2 (d). We remove the FPS and merge the Mltar = Θ(MLP(Vl′ ) − τ ) (6)
two multi-modal interactions into one. We also move the where Θ is the step function, and τ is the threshold deter-
merged interaction operation before pruning. In this way, mining voxel relevance. Mltar is a binary mask indicating
voxel features and text features are first deeply interacted potential regions of the mentioned target. Then, compari-
for both feature enhancement and pruning. Because in level son of Mltar with Ul identifies missing voxels. The missing
3 the number of voxels is small and in level 2 / 1 the voxels mask Mlmis is derived as follows:
are already pruned, the computational cost of self-attention
and cross-attention is always kept at a relatively low level. Mlmis = Mltar ∧ (¬ C(UlG , Vl )) (7)
Effectiveness of TGP. After pruning, the voxel count of where C(A, B) denotes the generation of a binary mask for
U1 is reduced to nearly 7% of its original size without TGP, A based on the shape of B. Specifically, for positions in
while the 3DVG performance is significantly boosted. TGP B, if there are corresponding voxel features in A, the mask
serves multiple functions, including: (1) facilitating the in- for that position is set to 1. Otherwise it is set to 0. Missed
teraction of multi-modal features through cross-attention, voxel features in UlG that correspond to Mlmis are inter-
(2) reducing the feature amount (number of voxels) through polated from UlG , filling in gaps identified by the missing
pruning, and (3) gradually guiding the network to focus on mask. The completed feature map Ulcpl is computed by:
the mentioned target based on text features.
Ulcpl = Vl′ ⊙ Mlmis + I(UlG , Mlmis ) (8)
3.3. Completion-based Addition
where I represents linear interpolation on the feature map
During the pruning process, some targets may be mistak- based on the positions specified in the mask. Finally, the
enly removed, especially for small or narrow objects, as original upsampled features are combined with the back-
shown in Fig. 3 (b). Therefore, the addition operation be- bone features according to the pruning-aware addition, and
tween the upsampled pruned features UlG and backbone fea- merged with the completion features to yield updated Ul :
tures Vl described in Equation (1) play an important role to
Ul = Concat(UlG ← Vl , Ulcpl ) (9)
mitigate the affect of over-pruning.
There are two alternative addition operation: (1) Full where ← denotes the pruning-aware addition, and Concat
Addition. For the intersecting regions of Vl and UlG , fea- means concatenation of voxel features.
Table 1. Comparison of methods on the ScanRefer dataset eval- Table 2. Quantitative comparisons on Nr3D and Sr3D datasets.
uated at IoU thresholds of 0.25 and 0.5. TSP3D achieves state- We evaluate under three pipelines, noting that the Two-stage us-
of-the-art accuracy even compared with two-stage methods, with ing Ground-Truth Boxes is impractical for real-world applications.
+1.13 lead on [email protected]. Notably, we are the first to comprehen- TSP3D exhibits significant superiority, with leads of +2.6% and
sively evaluate inference speed for 3DVG methods. The inference +3.2% on NR3D and SR3D respectively.
speeds of other methods are obtained through our reproduction.
Accuracy
Method Venue Pipeline
Accuracy Inference Nr3D Sr3D
Method Venue Input
0.25 0.5 Speed (FPS) InstanceRefer [39] ICCV’21 Two-stage (gt) 38.8 48.0
Two-Stage Model LanguageRefer [28] CoRL’22 Two-stage (gt) 43.9 56.0
3D-SPS [22] CVPR’22 Two-stage (gt) 51.5 62.6
ScanRefer [3] ECCV’20 3D+2D 41.19 27.40 6.72 MVT [15] CVPR’22 Two-stage (gt) 55.1 64.5
TGNN [14] AAAI’21 3D 37.37 29.70 3.19 BUTD-DETR [16] ECCV’22 Two-stage (gt) 54.6 67.0
InstanceRefer [39] ICCV’21 3D 40.23 30.15 2.33
EDA [35] CVPR’23 Two-stage (gt) 52.1 68.1
SAT [38] ICCV’21 3D+2D 44.54 30.14 4.34
VPP-Net [31] CVPR’24 Two-stage (gt) 56.9 68.7
FFL-3DOG [10] ICCV’21 3D 41.33 34.01 Not released
G3 -LQ [34] CVPR’24 Two-stage (gt) 58.4 73.1
3D-SPS [22] CVPR’22 3D+2D 48.82 36.98 3.17
MCLN [27] ECCV’24 Two-stage (gt) 59.8 68.4
BUTD-DETR [16] ECCV’22 3D 50.42 38.60 3.33
EDA [35] CVPR’23 3D 54.59 42.26 3.34 InstanceRefer [39] ICCV’21 Two-stage (det) 29.9 31.5
3D-VisTA [41] ICCV’23 3D 45.90 41.50 2.03 LanguageRefer [28] CoRL’22 Two-stage (det) 28.6 39.5
VPP-Net [31] CVPR’24 3D 55.65 43.29 Not released BUTD-DETR [16] ECCV’22 Two-stage (det) 43.3 52.1
G3 -LQ [34] CVPR’24 3D 56.90 45.58 Not released EDA [35] CVPR’23 Two-stage (det) 40.7 49.9
MCLN [27] ECCV’24 3D 57.17 45.53 3.17 MCLN [27] ECCV’24 Two-stage (det) 46.1 53.9
Single-stage Model 3D-SPS [22] CVPR’22 Single-stage 39.2 47.1
3D-SPS [22] CVPR’22 3D 47.65 36.43 5.38 BUTD-DETR [16] ECCV’22 Single-stage 38.7 50.1
BUTD-DETR [16] ECCV’22 3D 49.76 37.05 5.91 EDA [35] CVPR’23 Single-stage 40.0 49.7
EDA [35] CVPR’23 3D 53.83 41.70 5.98 MCLN [27] ECCV’24 Single-stage 45.7 53.4
G3 -LQ [34] CVPR’24 3D 55.95 44.72 Not released TSP3D (Ours) —– Single-stage 48.7 57.1
MCLN [27] ECCV’24 3D 54.30 42.64 5.45
TSP3D (Ours) —– 3D 56.45 46.71 12.43
4.2. Implementation Details
TSP3D is implemented based on PyTorch [23]. The prun-
3.4. Train Loss
ing thresholds are set at σsce = 0.7 and σtar = 0.3, and the
The loss is composed of several components: pruning loss completion threshold in CBA is τ = 0.15. The initial vox-
for TGP, completion loss for CBA, and objectness loss as elization of the point cloud has a voxel size of 1cm, while
well as bounding box regression loss for the head. Prun- the voxel size for level i features scales to 2i+2 cm. The
ing loss, completion loss and objectness loss employ the supervision for pruning uses L = 7. The weights for all
focal loss to handle class imbalance. Supervision for com- components of the loss function, λ1 , λ2 , λ3 , λ4 , are equal to
pletion and classification losses are the same, which sets 1. Training is conducted using four GPUs, while inference
voxels near the target object center as positives while leav- speeds are evaluated using a single consumer-grade GPU,
ing others as negatives. For bounding box regression, we RTX 3090, with a batch size of 1.
use the Distance-IoU (DIoU) loss. The total loss function is
computetd as the sum of these individual losses: 4.3. Quantitative Comparisons
Performance on ScanRefer. We carry out comparisons
Ltotal = λ1 Lpruning + λ2 Lcom + λ3 Lclass + λ4 Lbbox
with existing methods on ScanRefer, as detailed in Tab. 1.
where λ1 , λ2 , λ3 and λ4 are the weights of different parts. The inference speeds of other methods are obtained through
our reproduction with a single RTX 3090 and a batch size
4. Experiments of 1. For two-stage methods, the inference speed includes
the time taken for object detection in the first stage. For
4.1. Datasets methods using 2D image features and 3D point clouds as
We maintain the same experimental settings with previous inputs, we do not account for the time spent extracting 2D
works, employing ScanRefer [3] and SR3D/NR3D [2] as features, assuming they can be obtained in advance. How-
datasets. ScanRefer: Built on the ScanNet framework, ever, in practical applications, the acquisition of 2D fea-
ScanRefer includes 51,583 descriptions across scenes. tures also impacts overall efficiency. TSP3D achieves state-
Evaluation metrics focus on Acc@mIoU. ReferIt3D: of-the-art accuracy even compared with two-stage meth-
ReferIt3D splits into Nr3D, with 41,503 human-generated ods, with +1.13 lead on [email protected]. Notably, in the single-
descriptions, and Sr3D, containing 83,572 synthetic expres- stage setting, TSP3D achieves fast inference speed, which
sions. ReferIt3D simplifies the task by providing segmented is unprecedented among the existing methods. This signifi-
point clouds for each object. The primary evaluation metric cant improvement is attributed to our method’s efficient use
is accuracy in target object selection. of a multi-level architecture based on 3D sparse convolu-
Table 3. Impact of the proposed TGP and Table 4. Influence of the two CBAs at differ- Table 5. Influence of different feature upsam-
CBA. Evaluated on ScanRefer. ent levels. Evaluated on ScanRefer. pling methods. Evaluated on ScanRefer.
Accuracy CBA CBA Accuracy Accuracy
ID TGP CBA Speed (FPS) ID Speed (FPS) ID Method Speed (FPS)
0.25 0.5 (level 2) (level 1) 0.25 0.5 0.25 0.5
(a) 40.13 32.87 14.58 (a) 55.20 46.15 13.22 (a) Simple concatenation 40.13 32.87 14.58
(b) ✓ 55.20 46.15 13.22 (b) ✓ 55.17 46.06 12.79 (b) Attention mechanism — — —
(c) ✓ 41.34 33.09 13.51 (c) ✓ 56.45 46.71 12.43 (c) Text-guided pruning 56.27 46.58 10.11
(d) ✓ ✓ 56.45 46.71 12.43 (d) ✓ ✓ 56.22 46.68 12.19 (d) Simplified TGP 56.45 46.71 12.43

tions, coupled with the text-guided pruning. By focusing pruning by TGP, thus increasing the network’s robustness.
computation only on salient regions of the point clouds, Influence of the Two CBAs. To explore the impact of
determined by textual cues, our model effectively reduces CBAs at two different levels, we conduct ablation experi-
computational overhead while maintaining high accuracy. ments as depicted in Tab. 4. In the absence of CBA, we
TSP3D also sets a benchmark for inference speed compar- use pruning-based addition as a substitute. The results indi-
isons for future methodologies. cate that the CBA at level 2 has negligible effects on the
Performance on Nr3D/Sr3D. We evaluate our method 3DVG task. This is primarily because the CBA at level
on the SR3D and NR3D datasets, following the evalua- 2 serves to supplement the scene-level TGP, which is ex-
tion protocols of prior works like EDA [35] and BUTD- pected to prune the background (a relatively simple task).
DETR [16] by using [email protected] as the accuracy metric. The Moreover, although some target features are pruned, they
results are shown in Tab. 2. Given that SR3D and NR3D are compensated by two subsequent generative sparse con-
provide ground-truth boxes and categories for all objects volutions. However, the CBA at level 1 enhances perfor-
in the scene, we consider three pipelines: (1) Two-stage mance by adapt completion for the target-level TGP. It is
using Ground-Truth Boxes, (2) Two-stage using Detected challenging to fully preserve target objects from deep up-
Boxes, and (3) Single-stage. In practical applications, the sampling features, especially for smaller or narrower tar-
Two-stage using Ground-Truth Boxes pipeline is unrealistic gets. The CBA at level 1, based on high-resolution back-
because obtaining all ground-truth boxes in a scene is infea- bone features, effectively complements the TGP.
sible. This approach can also oversimplify certain evalua- Feature Upsampling Techniques. We conduct experi-
tion scenarios. For example, if there are no other objects of ments to assess the effects of different feature upsampling
the same category as the target in the scene, the task reduces techniques, as detailed in Tab. 5. Using simple feature con-
to relying on the provided ground-truth category. Under the catenation (Fig. 2 (a)), while fast in inference speed, results
Single-stage setting, TSP3D exhibits significant superiority in poor performance. When we utilize an attention mech-
with peak performance of 48.7% and 57.1% on Nr3D and anism with stronger feature interaction, as shown in Fig. 2
Sr3D. TSP3D even outperforms previous works under the (b), the computation exceeds the limits of GPU due to the
pipeline of Two-stage using Detected Boxes, with leads of large number of voxels, making it impractical for real-world
+2.6% and +3.2% on NR3D and SR3D. applications. Consequently, we employ TGP to reduce the
feature amount, as illustrated in Fig. 2 (c), which signifi-
4.4. Ablation Study cantly improves performance and enables practical deploy-
ment. Building on TGP, we propose simplified TGP, as
Effectiveness of Proposed Components. To investigate
shown in Fig. 2 (d), that merges feature interactions before
the effects of our proposed TGP and CBA, we conduct ab-
and after pruning, achieving performance consistent with
lation experiments with module removal as shown in Tab. 3.
the original TGP while enhancing inference speed.
When TGP is not used, multi-modal feature concatenation
is employed as a replacement, as shown in Fig. 2 (a). When
4.5. Qualitative Results
CBA is not used, it is substituted with a pruning-based ad-
dition. The results demonstrate that TGP significantly en- Text-guided Pruning. To visually demonstrate the process
hances performance without notably impacting inference of TGP, we visualize the results of two pruning phases, as
time. This is because TGP, while utilizing a more complex shown in Fig. 4. In each example, the voxel features after
multi-modal attention mechanism for stronger feature fu- scene-level pruning, the features after target-level pruning,
sion, significantly reduces feature scale through text-guided and the features after target-level generative sparse convo-
pruning. Additionally, the performance improvement is also lution are displayed from top to bottom. It is evident that
due to the gradual guidance towards the target object by both pruning stages effectively achieve our intended effect:
both scene-level and target-level TGP. Using CBA alone the scene-level pruning filters out the background and re-
has a limited effect, as no voxels are pruned. Implement- tained object voxels, and the target-level pruning preserves
ing CBA on top of TGP further enhances performance, as relevant and target objects. Moreover, during the feature
CBA dynamically compensates for some of the excessive upsampling process, the feature amount nearly exponen-
(a) It is a gray trash can. (b) The chair is on the northwest side (c) A chrome water (d) There is a rectangular
The trash can sits in the of the table that is furthest northeast fountain. Is the lowest shelf. It is the one closest
in the room. The chair has a curved
corner by where the tv is. of two water fountains. to the curtain.
backside and four legs.

Figure 4. Visualization of the text-guided pruning process. In each example, the voxel features after scene-level TGP, target-level TGP
and the last upsampling layer are presented from top to bottom. The blue boxes represent the ground truth of the target, and the red
boxes denote the bounding boxes of relevant objects. TSP3D reduces the amount of voxel features through two stages of pruning and
progressively guides the network focusing towards the target.

(a) The brown square trash can is (b) The whiteboard is next to (c) This is a monitor in black color. (d) The book rack is left of the
The monitor is next to the wall bookshelf that looks like a right
to the right of the lamp. It is also two doors. The whiteboard is
angle with another. The book rack
behind the chair and to the right. a long, white rectangle. and at the back of maroon chair.
is black and has as circular bottom.

Figure 5. Visualization of the completion-based addition process. The blue points represent the voxel features output by the target-level
TGP, while the red points are the completion features predicted by the CBA. The blue boxes indicate the ground truth boxes. CBA
adaptively supplements situations where excessive pruning has occurred.

tially increases due to generative upsampling. Without TGP, 5. Conclusion


the voxel coverage would far exceed the range of the scene
point cloud, which is inefficient for inference. This also in- In this paper, we present TSP3D, an efficient sparse single-
tuitively explains the significant impact of our TGP on both stage method for real-time 3D visual grounding. Different
performance and inference speed. from previous 3D visual grounding frameworks, TSP3D
builds on multi-level sparse convolutional architecture for
Completion-based Addition. To clearly illustrate the efficient and fine-grained scene representation extraction.
function of CBA, we visualize the adaptive completion pro- To enable the interaction between voxel features and textual
cess in Fig. 5. The images below showcase several instances features, we propose text-guided pruning (TGP), which re-
of excessive pruning. TGP performs pruning based on deep duces the amount of voxel features and guides the network
and low-resolution features, which can lead to excessive to progressively focus on the target object. Additionally,
pruning, potentially removing entire or partial targets. This we introduce completion-based addition (CBA) for adap-
over-pruning is more likely to occur with small, as shown tive multi-level feature fusion, effectively compensating for
in Fig. 5 (a) and (c), narrow, as in Fig. 5 (b), or elongated instances of over-pruning. Extensive experiments demon-
targets, as in Fig. 5 (d). Our CBA effectively supplements strate the effectiveness of our proposed modules, resulting
the process using higher-resolution backbone features, thus in an efficient 3DVG method that achieves state-of-the-art
dynamically integrating multi-level features. accuracy and fast inference speed.
References [15] Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-
view transformer for 3d visual grounding. In CVPR, pages
[1] Ahmed Abdelreheem, Ujjwal Upadhyay, Ivan Skorokhodov, 15524–15533, 2022. 6
Rawan Al Yahya, Jun Chen, and Mohamed Elhoseiny. 3dref-
[16] Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Kate-
transformer: Fine-grained object identification in real-world
rina Fragkiadaki. Bottom up top down detection transform-
scenes using natural language. In WACV, pages 3941–3950,
ers for language grounding in images and point clouds. In
2022. 2
ECCV, pages 417–433. Springer, 2022. 1, 2, 3, 6, 7
[2] Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed
[17] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-
Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners
Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping
for fine-grained 3d object identification in real-world scenes.
for 3d instance segmentation. In CVPR, pages 4867–4876,
In ECCV, pages 422–440. Springer, 2020. 2, 6
2020. 2
[3] Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner.
[18] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian-
Scanrefer: 3d object localization in rgb-d scans using natural
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu
language. In ECCV, pages 202–221. Springer, 2020. 1, 2, 3,
Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded
6, 4
language-image pre-training. In CVPR, pages 10965–10975,
[4] Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, and
2022. 2
Xinggang Wang. Hierarchical aggregation for 3d instance
segmentation. In ICCV, pages 15467–15476, 2021. 2 [19] Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu.
A survey on text-guided 3d visual grounding: Elements,
[5] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and
recent advances, and future directions. arXiv preprint
Jiaya Jia. Voxelnext: Fully sparse voxelnet for 3d object
arXiv:2406.05785, 2024. 2
detection and tracking. In CVPR, pages 21674–21683, 2023.
2 [20] Yinhan Liu. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019. 3, 1
[6] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d
spatio-temporal convnets: Minkowski convolutional neural [21] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong.
networks. In CVPR, pages 3075–3084, 2019. 2, 3 Group-free 3d object detection via transformers. In ICCV,
pages 2949–2958, 2021. 2
[7] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and
Yoshua Bengio. Empirical evaluation of gated recurrent [22] Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing
neural networks on sequence modeling. arXiv preprint Ren, Hao Shen, Huaxia Xia, and Si Liu. 3d-sps: Single-stage
arXiv:1412.3555, 2014. 2 3d visual grounding via referred point progressive selection.
In CVPR, pages 16454–16463, 2022. 1, 2, 3, 6
[8] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou,
Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards [23] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
high performance voxel-based 3d object detection. In AAAI, James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
pages 1201–1209, 2021. 2 ing Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:
An imperative style, high-performance deep learning library.
[9] Jacob Devlin. Bert: Pre-training of deep bidirectional
NeurIPS, 32, 2019. 6
transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018. 2 [24] Jeffrey Pennington, Richard Socher, and Christopher D Man-
ning. Glove: Global vectors for word representation. In
[10] Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong
EMNLP, pages 1532–1543, 2014. 2
Zhang, Guangming Zhu, Hui Zhang, Yaonan Wang, and Aj-
mal Mian. Free-form description guided 3d visual graph net- [25] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
work for object grounding in point cloud. In ICCV, pages Guibas. Pointnet++: Deep hierarchical feature learning on
3722–3731, 2021. 2, 6, 3 point sets in a metric space. NeurIPS, 30, 2017. 2, 3, 1
[11] Benjamin Graham, Martin Engelcke, and Laurens Van [26] Charles R Qi, Or Litany, Kaiming He, and Leonidas J
Der Maaten. 3d semantic segmentation with submanifold Guibas. Deep hough voting for 3d object detection in point
sparse convolutional networks. In CVPR, pages 9224–9232, clouds. In ICCV, pages 9277–9286, 2019. 2
2018. 2 [27] Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng,
[12] JunYoung Gwak, Christopher Choy, and Silvio Savarese. Xiaoshuai Sun, and Rongrong Ji. Multi-branch collaborative
Generative sparse detection networks for 3d single-shot ob- learning network for 3d visual grounding. In ECCV, pages
ject detection. In ECCV, pages 297–313. Springer, 2020. 2, 381–398. Springer, 2025. 2, 6, 3
3 [28] Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter
[13] Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Fox. Languagerefer: Spatial-language model for 3d visual
Huang, Aixi Zhang, and Si Liu. Transrefer3d: Entity-and- grounding. In CoRL, pages 1046–1056. PMLR, 2022. 6
relation aware transformer for fine-grained 3d visual ground- [29] Danila Rukhovich, Anna Vorontsova, and Anton Konushin.
ing. In ACM MM, pages 2344–2352, 2021. 2 Fcaf3d: Fully convolutional anchor-free 3d object detection.
[14] Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and In ECCV, pages 477–493. Springer, 2022. 2, 3
Tyng-Luh Liu. Text-guided graph neural networks for refer- [30] Danila Rukhovich, Anna Vorontsova, and Anton Konushin.
ring 3d instance segmentation. In AAAI, pages 1610–1618, Tr3d: Towards real-time indoor 3d object detection. In ICIP,
2021. 1, 2, 6, 3 pages 281–285. IEEE, 2023. 2, 3
[31] Xiangxi Shi, Zhonghua Wu, and Stefan Lee. Viewpoint-
aware visual grounding in 3d scenes. In CVPR, pages 14056–
14065, 2024. 3, 6
[32] Thang Vu, Kookhoi Kim, Tung M Luu, Thanh Nguyen, and
Chang D Yoo. Softgroup for 3d instance segmentation on
point clouds. In CVPR, pages 2708–2717, 2022. 2
[33] Haiyang Wang, Lihe Ding, Shaocong Dong, Shaoshuai Shi,
Aoxue Li, Jianan Li, Zhenguo Li, and Liwei Wang. Ca-
group3d: Class-aware grouping for 3d object detection on
point clouds. NeurIPS, 35:29975–29988, 2022. 2
[34] Yuan Wang, Yali Li, and Shengjin Wang. Gˆ 3-lq: Marrying
hyperbolic alignment with explicit semantic-geometric mod-
eling for 3d visual grounding. In CVPR, pages 13917–13926,
2024. 2, 3, 6
[35] Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng,
and Jian Zhang. Eda: Explicit text-decoupling and dense
alignment for 3d visual grounding. In CVPR, pages 19231–
19242, 2023. 1, 2, 3, 6, 7, 4
[36] Xiuwei Xu, Ziwei Wang, Jie Zhou, and Jiwen Lu. Binariz-
ing sparse convolutional networks for efficient point cloud
analysis. In CVPR, pages 5313–5322, 2023. 2
[37] Xiuwei Xu, Zhihao Sun, Ziwei Wang, Hongmin Liu, Jie
Zhou, and Jiwen Lu. 3d small object detection with dynamic
spatial pruning. In ECCV. Springer, 2024. 2
[38] Zhengyuan Yang, Songyang Zhang, Liwei Wang, and Jiebo
Luo. Sat: 2d semantics assisted training for 3d visual
grounding. In ICCV, pages 1856–1866, 2021. 1, 6, 3
[39] Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang,
Sheng Wang, Zhen Li, and Shuguang Cui. Instancerefer:
Cooperative holistic understanding for visual grounding on
point clouds through instance multi-level contextual refer-
ring. In ICCV, pages 1791–1800, 2021. 1, 2, 6, 3
[40] Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 3dvg-
transformer: Relation modeling for visual grounding on
point clouds. In ICCV, pages 2928–2937, 2021. 2
[41] Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan
Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d
vision and text alignment. In ICCV, pages 2911–2921, 2023.
6, 3
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
Supplementary Material
We provide statistics and analysis for visual feature res- 256 206 voxels
olution (Sec. A), detailed comparisons of computational points SpConv FU
SA FP
cost (Sec. B), detailed results on the ScanRefer dataset [3] 789 voxels 2469 voxels
512 512
(Sec. C), qualitative comparisons (Sec. D) and potential SpConv FU
points points
limitations (Sec. E) in the supplementary material. SA 2952 voxels 23266 voxels
FP
1024 1024 SpConv
A. Visual Feature Resolution of Different Ar- points points 10356 voxels
SA SpConv
chitectures
2048
26891 voxels
To analyze the scene representation resolution of point- points
Inference SpConv Inference
SA
based and sparse convolutional architectures, we compare speed: 43282 voxels speed:
the resolution changes during the visual feature extraction 5.98 FPS
SpConv
14.58 FPS
50000
process for EDA [35] and TSP3D-B, as illustrated in Fig. 6. points
48908
For a thorough examination of the feature resolution of voxels
the sparse convolution architecture, we consider TSP3D-B
(a) Visual Feature (b) Visual Feature
without incorporating TGP and CBA. The voxel numbers Extraction of EDA Extraction of TSP3D-B
for TSP3D-B are based on the average statistics from the
ScanRefer validation set. In point-based architectures, the Figure 6. Feature resolution progression of point-based EDA and
number of point features is fixed and does not vary with sparse convolutional TSP3D-B. SA, FP, SpConv, and FU repre-
the scene size. In contrast, the number of voxel features in sent set abstraction, feature propagation, sparse convolution, and
sparse convolutional architectures tends to increase as the feature upsampling, respectively. For the point-based architec-
ture, the downsampling process is aggressive, with the first down-
scene size grows. This adaptive adjustment ensures that
sampling reducing 50,000 points directly to 2,048 points. Fur-
features do not become excessively sparse when process- thermore, the final scene representation consists of only 1,024
ing larger scenes. As shown in Fig. 6, point-based archi- points. In contrast, the sparse convolutional architecture performs
tectures perform aggressive downsampling, with the first progressive downsampling and refines the scene representation
downsampling step reducing 50,000 points to just 2,048 through a multi-level structure. This approach not only provides a
points. Moreover, the final scene representation consists high-resolution scene representation but also achieves faster infer-
of only 1,024 points, leading to a relatively coarse repre- ence speed compared to the point-based architecture.
sentation. By contrast, convolution-based architectures pro-
gressively downsample and refine the scene representation
through a multi-level structure. Overall, the sparse convo- Backbone. Except for TSP3D, the visual backbone in
lution architecture not only provides high-resolution scene other methods is PointNet++ [25], which has a high com-
representation but also achieves faster inference speed com- putational cost. This is precisely why we introduce a sparse
pared to point-based architectures. convolution backbone, which achieves approximately three
times the inference speed of PointNet++. As for the text
B. Detailed Computational Cost of Different backbone, both TSP3D and other methods use the pre-
trained RoBERTa [20], so the inference speed for this com-
Architectures
ponent is largely consistent across the methods.
We provide a detailed comparison of the inference speed Multi-modal Fusion. The multi-modal feature fusion
of specific components across different architectures, as primarily involves the interaction between textual and vi-
shown in Tab. 6. Two-stage methods tend to have slower sual features, with different methods employing different
inference speed and are significantly impacted by the effi- modules. For instance, the multi-modal fusion in SDSPS
ciency of the detection stage, which is not the primary fo- mainly includes the description-aware keypoint sampling
cus of the 3DVG task. Therefore, we focus our analysis (DKS) and target-oriented progressive mining (TPM) mod-
solely on the computational cost of single-stage methods. ules. And methods like BUTD-DETR, EDA, and MCLN
We divide the networks of existing methods and TSP3D into rely on cross-modal encoders and decoders for their fu-
several components: text decoupling, visual backbone, text sion process. In our TSP3D, the multi-modal fusion in-
backbone, multi-modal fusion, and the head. The inference volves feature upsampling, text-guided pruning (TGP), and
speed of each of these components is measured separately. completion-based addition (CBA). Notably, even though
Table 6. Detailed comparison of computational cost for different single-stage architectures on the ScanRefer dataset [3]. The numbers
in the table represent frames per second (FPS). TSP3D demonstrates superior processing speed across all components compared to other
methods, with the inference speed of the sparse convolution backbone being three times faster than that of the point-based backbone.

Text Visual Text Multi-modal


Method Head Overall
Decouple Backbone Backbone Fusion
3D-SPS [22] — 10.88 80.39 13.25 166.67 5.38
BUTD-DETR [16] 126.58 10.60 78.55 28.49 52.63 5.91
EDA [35] 126.58 10.89 81.10 28.57 49.75 5.98
MCLN [27] 126.58 10.52 76.92 23.26 41.32 5.45
TSP3D (Ours) — 31.88 81.21 28.67 547.32 12.43

TSP3D progressively increases the resolution of scene fea- D. Qualitative Comparisons


tures and integrates them with fine-grained backbone fea-
tures, it still achieves superior inference speed. This is pri- To qualitatively demonstrate the effectiveness of our pro-
marily due to the text-guided pruning, which significantly posed TSP3D, we visualize the 3DVG results of TSP3D
alongside EDA [35] on the ScanRefer dataset [3]. As shown
reduces the number of voxels and computational cost.
in Fig. 7, the ground truth boxes are marked in blue, with the
Head and Text Decouple. In the designs of methods predicted boxes for EDA and TSP3D displayed in red and
such as BUTD-DETR, EDA, and MCLN, the input text green, respectively. EDA encounters challenges in locating
needs to be decoupled into several semantic components. relevant objects, identifying categories, and distinguishing
Additionally, their heads do not output prediction scores appearance and attributes, as illustrated in Fig. 7 (a), (c),
directly. Instead, they output embeddings for each candi- and (d). In contrast, our TSP3D gradually focuses attention
date object, which must be compared with the embeddings on the target and relevant objects under textual guidance
of each word in the text to compute similarities and deter- and enhances resolution through multi-level feature fusion,
mine the final output. This can be considered additional pre- showcasing commendable grounding capabilities. Further-
processing and post-processing steps, with the latter signif- more, Fig. 7 (b) illustrates that TSP3D performs better with
icantly impacting computational efficiency. In contrast, our small or narrow targets, as our proposed completion-based
TSP3D directly predicts the matching scores between the addition can adaptively complete the target shape based on
objects and the input text, making the head inference speed high-resolution backbone feature maps.
over ten times faster than these methods.
E. Limitations and Future Work
C. Detailed Results on ScanRefer Despite its leading accuracy and inference speed, TSP3D
Due to page limitations, we report only the overall perfor- still has some limitations. First, the speed of TSP3D is
mances and inference speeds in the main text. To provide slightly slower than that of TSP3D-B. While TSP3D lever-
detailed results and analysis, we include the accuracies of ages TGP to enable deep interaction between visual and text
TSP3D and other methods across various subsets on the features in an efficient manner, it inevitably introduces addi-
ScanRefer dataset [3], as shown in Tab. 7. TSP3D achieves tional computational overhead compared to naive concate-
state-of-the-art accuracy, even when compared with two- nation. In future work, we aim to focus on designing new
stage methods, leading by +1.13 in [email protected]. TSP3D operations for multi-modal feature interaction to replace the
also demonstrates a level of efficiency that previous meth- heavy cross-attention mechanism. Second, the current input
ods lack. In various subsets, TSP3D maintains comparable for 3DVG methods consists of reconstructed point clouds.
accuracy to both single-stage and two-stage state-of-the-art We plan to extend this to an online setting using stream-
methods. Notably, the “multi-object” subset involves dis- ing RGB-D videos as input, which would support a broader
tinguishing the target object among numerous distractors of range of practical applications.
the same category within a more complex 3D scene. In this
setting, TSP3D achieves a commendable performance of
42.37 in [email protected], further demonstrating that TSP3D en-
hances attention to the target object in complex environ-
ments through text-guided pruning and completion-based
addition, enabling accurate predictions of both the location
and the shape of the target.
Table 7. Detailed comparison of methods on the ScanRefer dataset [3] evaluated at IoU thresholds of 0.25 and 0.5. TSP3D achieves state-
of-the-art accuracy even compared with two-stage methods, with +1.13 lead on [email protected]. In various subsets, TSP3D achieves comparable
accuracy to both single-stage and two-stage state-of-the-art methods. Additionally, TSP3D demonstrates a level of efficiency that previous
methods lack.

Unique (∼19%) Multiple (∼81%) Accuracy Inference


Method Venue
0.25 0.5 0.25 0.5 0.25 0.5 Speed (FPS)
Two-Stage Model
ScanRefer [3] ECCV’20 76.33 53.51 32.73 21.11 41.19 27.40 6.72
TGNN [14] AAAI’21 68.61 56.80 29.84 23.18 37.37 29.70 3.19
InstanceRefer [39] ICCV’21 77.45 66.83 31.27 24.77 40.23 30.15 2.33
SAT [38] ICCV’21 73.21 50.83 37.64 25.16 44.54 30.14 4.34
FFL-3DOG [10] ICCV’21 78.80 67.94 35.19 25.7 41.33 34.01 Not released
3D-SPS [22] CVPR’22 84.12 66.72 40.32 29.82 48.82 36.98 3.17
BUTD-DETR [16] ECCV’22 82.88 64.98 44.73 33.97 50.42 38.60 3.33
EDA [35] CVPR’23 85.76 68.57 49.13 37.64 54.59 42.26 3.34
3D-VisTA [41] ICCV’23 77.40 70.90 38.70 34.80 45.90 41.50 2.03
VPP-Net [31] CVPR’24 86.05 67.09 50.32 39.03 55.65 43.29 Not released
G3 -LQ [34] CVPR’24 88.09 72.73 51.48 40.80 56.90 45.58 Not released
MCLN [27] ECCV’24 86.89 72.73 51.96 40.76 57.17 45.53 3.17
Single-stage Model
3D-SPS [22] CVPR’22 81.63 64.77 39.48 29.61 47.65 36.43 5.38
BUTD-DETR [16] ECCV’22 81.47 61.24 44.20 32.81 50.22 37.87 5.91
EDA [35] CVPR’23 86.40 69.42 48.11 36.82 53.83 41.70 5.98
G3 -LQ [34] CVPR’24 88.59 73.28 50.23 39.72 55.95 44.72 Not released
MCLN [27] ECCV’24 84.43 68.36 49.72 38.41 54.30 42.64 5.45
TSP3D (Ours) —– 87.25 71.41 51.04 42.37 56.45 46.71 12.43
GT

EDA

TSP3D

This is the brown arm chair that is at the


end of the chalk board. to the left of it A black office chair.
It is a silver stainless steel It is a gray trash
refrigerator with a freezer can. The trash can
there is a small bookshelf, it is brown. It is in between of a on bottom. It is at the end
To the right of it and at the table is a cabinet and the wall. sits in the corner by
of the cabinets under the
blue chair. The brown chair is not under
window to the left.
where the tv is.
the table it is just next to the table.
(a) Missing Relevant Objects (b) Narrow or Small Targets

GT

EDA

TSP3D

There is a black table with A small document The chair is the middle This is a cushion chair
a rectangular glass top. It one on the north side of with no arms. The chair
organizer with
is to the right of a black the table. The chair is is next to building.
shelves. There is a
dresser, of similar height red and has two arms.
door entry near it.
that also has glass top.
(c) Category Error (d) Appearance and Attributes
Figure 7. Qualitative results of EDA [35] and our TSP3D on the ScanRefer dataset [3]. In each description, the red annotations indicate the
target object. The orange annotations in (a) refer to relevant objects, while the yellow annotations in (d) denote the appearance or attributes
of the target. TSP3D demonstrates exceptional performance in locating relevant objects, narrow or small targets, identifying categories,
and distinguishing appearance and attributes.

You might also like