2502.10392v1
2502.10392v1
Wenxuan Guo1∗ Xiuwei Xu1∗ Ziwei Wang2 Jianjiang Feng1† Jie Zhou1 Jiwen Lu1
1 2
Tsinghua University Nanyang Technological University
{gwx22,xxw21}@mails.tsinghua.edu.cn [email protected]
{jfeng,jzhou,lujiwen}@tsinghua.edu.cn
arXiv:2502.10392v1 [cs.CV] 14 Feb 2025
Accuracy ([email protected])
In this paper, we propose an efficient multi-level convolu- EDA MCLN (single-stage)
(two-stage)
tion architecture for 3D visual grounding. Conventional
methods are difficult to meet the requirements of real-time 3D-VisTA EDA (single-stage)
40
BUTD-DETR
inference due to the two-stage or point-based architecture. (two-stage)
BUTD-DETR (single-stage)
Inspired by the success of multi-level fully sparse convolu- 3D-SPS
tional architecture in 3D object detection, we aim to build (two-stage) 3D-SPS (single-stage)
35
a new 3D visual grounding framework following this tech- TSP3D-B
nical route. However, as in 3D visual grounding task the (ours)
3D scene representation should be deeply interacted with SAT
30
text features, sparse convolution-based architecture is in- 3 6 9 12
efficient for this interaction due to the large amount of Speed (frames per second)
voxel features. To this end, we propose text-guided prun-
Figure 1. Comparison of 3DVG methods on ScanRefer dataset [3].
ing (TGP) and completion-based addition (CBA) to deeply Our TSP3D surpasses existing methods in both accuracy and in-
fuse 3D scene representation and text features in an effi- ference speed, achieving the first efficient 3DVG framework.
cient way by gradual region pruning and target completion.
Specifically, TGP iteratively sparsifies the 3D scene rep-
to locate an object in the scene with a free-form query de-
resentation and thus efficiently interacts the voxel features
scription. 3DVG is challenging since it requires understand-
with text features by cross-attention. To mitigate the affect
ing of both 3D scene and language description. Recently,
of pruning on delicate geometric information, CBA adap-
with the development of 3D scene perception and vision-
tively fixes the over-pruned region by voxel completion with
language models, 3DVG methods have shown remarkable
negligible computational overhead. Compared with previ-
progress [16, 22]. However, with 3DVG being widely ap-
ous single-stage methods, our method achieves top infer-
plied in fields like robotics and AR / VR where inference
ence speed and surpasses previous fastest method by 100%
speed is the main bottleneck, how to construct efficient real-
FPS. Our method also achieves state-of-the-art accuracy
time 3DVG model remains a challenging problem.
even compared with two-stage methods, with +1.13 lead
Since the output format of 3DVG is similar with 3D ob-
of [email protected] on ScanRefer, and +2.6 and +3.2 leads on
ject detection, early 3DVG methods [3, 14, 38, 39] usually
NR3D and SR3D respectively. The code is available at
adopt a two-stage framework, which first conducts detection
https://github.com/GWxuan/TSP3D.
to locate all objects in the scene, and then selects the tar-
get object by incorporating text information. As there are
many similarities between 3D object detection and 3DVG
1. Introduction (e.g. both of them need to extract the representation of the
Incorporating multi-modal information to guide 3D visual 3D scene), there will be much redundant feature compu-
perception is a promising direction. In these years, 3D vi- tation during the independent adoption of the two mod-
sual grounding (3DVG), also known as 3D instance refer- els. As a result, two-stage methods are usually hard to
encing, has been paid increasing attention as a fundamen- handle real-time tasks. To solve this problem, single-stage
tal multi-modal 3D perception task. The aim of 3DVG is methods [22, 35] are presented, which generates the bound-
ing box of the target directly from point clouds. This in-
* Equal contribution. † Corresponding author. tegrated design is more compact and efficient. However,
current single-stage 3DVG methods mainly build on point- 2. Related Work
based architecture [25], where the feature extraction con-
tains time-consuming operations like furthest point sam- 2.1. 3D Visual Grounding
pling and kNN. They also need to aggressively downsam- 3D visual grounding aims to locate a target object within
ple the point features to reduce computational cost, which a 3D scene based on natural language descriptions [19].
might hurt the geometric information of small and thin ob- Existing methods are typically categorized into two-stage
jects [37]. Due to these reasons, current single-stage meth- and single-stage approaches. Two-stage methods follow a
ods are still far from real-time (< 6 FPS) and their perfor- detect-then-match paradigm. In the first stage, they inde-
mance is inferior to two-stage methods, as shown in Fig. 1. pendently extract features from the language query using
In this paper, we propose a new single-stage frame- pre-trained language models [7, 9, 24] and predict candi-
work for 3DVG based on text-guided sparse voxel pruning, date 3D objects using pre-trained 3D detectors [21, 26] or
namely TSP3D. Inspired by state-of-the-art 3D object de- segmenters [4, 17, 32]. In the second stage, they focus
tection methods [29, 37] which achieves both leading ac- on aligning the vision and text features to identify the tar-
curacy and speed with multi-level sparse convolutional ar- get object. Techniques for feature fusion include attention
chitecture, we build the first sparse single-stage 3DVG net- mechanisms with Transformers [13, 40], contrastive learn-
work. However, different from 3D object detection, in ing [1], and graph-based matching [10, 14, 39]. In con-
3DVG the 3D scene representation should be deeply in- trast, single-stage methods integrate object detection and
teracted with text features. Since the count of voxels is feature extraction, allowing for direct identification of the
very large in sparse convolution-based architecture, deep target object. Methods in this category include guiding key-
multi-modal interaction like cross-attention becomes infea- point selection using textual features [22], and measuring
sible due to unaffordable computational cost. To this end, similarity between words and objects inspired by 2D image-
we propose text-guided pruning (TGP), which first utilize language pre-trained models like GLIP [18], as in BUTD-
text information to jointly sparsify the 3D scene represen- DETR [16]. And methods like EDA [35] and G3 -LQ [34]
tation and enhance the voxel and text features. To mitigate advance single-stage 3D visual grounding by enhancing
the affect of pruning on delicate geometric information, we multimodal feature discriminability through explicit text-
further present completion-based addition (CBA) to adap- decoupling, dense alignment, and semantic-geometric mod-
tively fix the over-pruned region with negligible computa- eling. MCLN [27] uses the 3D referring expression seg-
tional overhead. Specifically, TGP prunes the voxel fea- mentation task to assist 3DVG in improving performance.
tures according to the object distribution. It gradually re- However, existing two-stage and single-stage methods
moves background features and features of irrelevant ob- generally have high computational costs, hindering real-
jects, which generates text-aware voxel features around the time applications. Our work aims to address these efficiency
target object for accurate bounding box prediction. Since challenges by proposing an efficient single-stage method
pruning may mistakenly remove the representation of tar- with multi-level sparse convolutional architecture.
get object, CBA utilizes text features to query a small set
of voxel features from the complete backbone features, 2.2. Multi-Level Convolutional Architectures
followed by pruned-aware addition to fix the over-pruned Recently, sparse convolutional architecture has achieved
region. We conduct extensive experiments on the popu- great success in the field of 3D object detection. Built on
lar ScanRefer [3] and ReferIt3D [2] datasets. Compared the voxel-based representation [5, 8, 33] and sparse con-
with previous single-stage methods, TSP3D achieves top volution operation [6, 11, 36], this kind of methods show
inference speed and surpasses previous fastest single-stage great efficiency and accuracy when processing scene-level
method by 100% FPS. TSP3D also achieves state-of-the- data. GSDN [12] first adopts multi-level sparse convolution
art accuracy even compared with two-stage methods, with with generative feature upsampling in 3D object detection.
+1.13 lead of [email protected] on ScanRefer, and +2.6 and +3.2 FCAF3D [29] simplifies the multi-level architecture with
leads on NR3D and SR3D respectively. anchor-free design, achieving leading accuracy and speed.
To summarize, our main contributions are as follows: TR3D [30] further accelerates FCAF3D by removing un-
• To the best of our knowledge, this is the first work explor- necessary layers and introducing category-aware proposal
ing sparse convolutional architecture for efficient 3DVG. assignment method. Moreover, DSPDet3D [37] introduces
• To enable efficient feature extraction, we propose text- the multi-level architecture to 3D small object detection.
guided pruning and completion-based addition to sparsify Our proposed method draws inspiration from these ap-
sparse voxels and adaptively fuse multi-level features. proaches, utilizing a sparse multi-level architecture with
• We conduct extensive experiments, and TSP3D outper- sparse convolutions and an anchor-free design. This allows
forms existing methods in both accuracy and speed, for efficient processing of 3D data, enabling real-time per-
demonstrating the superiority of the proposed framework. formance in 3D visual grounding tasks.
3. Method els of voxel features Vl (l = 1, 2, 3). With the increase of
l, the spatial resolution of Vl decreases and the context in-
In this section, we describe our TSP3D for efficient single- formation increases. Concurrently, the free-form text with l
stage 3DVG. We first analyze existing pipelines to identify words is encoded by the pre-trained RoBERTa [20] and pro-
current challenges and motivate our approach (Sec. 3.1).
duce the vanilla text tokens T ∈ Rl×d . With the extracted
We then introduce the text-guided pruning, which leverages
3D and text representations, we iteratively upsample V3 and
text features to guide feature pruning (Sec. 3.2). To address
fuse it with T to generate high-resolution and text-aware
the potential risk of pruning key information, we propose scene representation:
the completion-based addition for multi-level feature fusion
′
(Sec. 3.3). Finally, we detail the training loss (Sec. 3.4). Ul = UlG + Vl , UlG = GeSpConv(Ul+1 ) (1)
3.1. Architecture Analysis for 3DVG ′
Ul+1 = Concat(Ul+1 , T ) (2)
Top-performance 3DVG methods [31, 34, 35], are mainly where U3 = V3 , GeSpConv means generative sparse con-
two-stage, which is a serial combination of 3D object de- volution [12] with stride 2, which upsamples the voxel fea-
tection and 3D object grounding. This separate calls of two tures and expands their spatial locations for better bounding
approaches result in redundant feature extraction and com- box prediction. Concat is voxel-wise feature concatenation
plex pipeline, thus making the two-stage methods less ef- by duplicating T . The final upsampled feature map U1 is
ficient. To demonstrate the efficiency of existing methods, concatenated with T and fed into a convolutional head to
we conduct a comparison of accuracy and speed among sev- predict the objectness scores and regress the 3D bounding
eral representative methods on ScanRefer [3], as shown in box. We select the box with highest objectness score as the
Fig. 1. It can be seen that two-stage methods struggle in grounding result.
speed (< 3 FPS) due to the additional detection stage. Since As shown in Fig. 1, TSP3D-B achieves an inference
3D visual grounding is usually adopted in practical scenar- speed of 14.58 FPS, which is significantly faster than previ-
ios that require real-time inference under limited resources, ous single-stage methods and demonstrates great potential
such as embodied robots and VR/AR, the low speed of two- for real-time 3DVG.
stage methods make them less practical. On the other side,
single-stage methods [22], which directly predicts refered 3.2. Text-guided Pruning
bounding box from the observed 3D scene, are more suit- Though efficient, TSP3D-B exhibits poor performance due
able choices due to their streamlined processes. In Fig. 1, it to the inadequate interaction between 3D scene representa-
can be observed that single-stage methods are significantly tion and text features. Motivated by previous 3DVG meth-
more efficient than their two-stage counterparts. ods [16], a simple solution is to replace Concat with cross-
However, existing single-stage methods are mainly built modal attention to process voxel and text features, as shown
on point-based backbone [25], where the scene representa- in Fig. 2 (b). However, different from point-based architec-
tion is extracted with time-consuming operations like fur- tures where the scene representation is usually aggressively
thest point sampling and set abstraction. They also em- downsampled, the number of voxels in multi-level convo-
ploy large transformer decoder to fuse text and 3D fea- lutional framework is very large1 . In practical implemen-
tures for several iterations. Therefore, the inference speed tation, we find that the voxels expand almost exponentially
of current single-stage methods is still far from real-time with each upsampling layer, leading to a substantial com-
(< 6 FPS). The inference speed of specific components putational burden for the self-attention and cross-attention
in different frameworks is analyzed and discussed in de- of scene features. To address this issue, we introduce text-
tail in the supplementary material. Inspired by the success guided pruning (TGP) to construct TSP3D, as illustrated in
of multi-level sparse convolutional architecture in 3D ob- Fig. 2 (c). The core idea of TGP is to reduce feature amount
ject detection [30], which achieves both leading accuracy by pruning redundant voxels and guide the network to grad-
and speed, we propose to build the first multi-level convo- ually focus on the final target based on textual features.
lutional single-stage 3DVG pipeline. Overall Architecture. TGP can be regarded as a mod-
TSP3D-B. Here we propose a baseline framework ified version of cross-modal attention, which reduces the
based on sparse convolution, namely TSP3D-B. Follow- number of voxels before attention operation, thereby lower-
ing the simple and effective multi-level architecture of ing computational cost. To minimize the affect of pruning
FCAF3D [29], TSP3D-B utilizes 3 levels of sparse convolu- on the final prediction, we propose to prune the scene rep-
tional blocks for scene representation extraction and bound- resentation gradually. At higher level where the number of
ing box prediction, as shown in Fig. 2 (a). Specifically, the 1 Compared to point-based architectures, sparse convolutional frame-
input pointclouds P ∈ RN ×6 with 6-dim features (3D po- work provides higher resolution and more detailed scene representations,
sition and RGB) are first voxelized and then fed into three while also offering advantages in inference speed. For detailed statistics,
sequential MinkResBlocks [6], which generates three lev- please refer to the supplementary material.
Level 3 Description
CBA
Completion-based The stool that is closest
to the television in the
(a) (c) (d)
addition corner. It is orange and
sits beside a round table. FPS
Cat
SA Self attention
CA SA SA SA
SpConv Feature upsampling
GeSpConv
CA Cross attention MLP
CA CA
Level 2
Point cloud Interpolate
CBA
(b) MLP
Prune
Feature upsampling SA SA
SpConv SA
Prune
CA CA
CA CA
Level 1
GeSpConv
Voxelization GeSpConv GeSpConv
& SpConv
CBA Head
Output
Figure 2. Illustration of TSP3D. TSP3D bulids on multi-level sparse convolutional architecture. It iteratively upsamples the voxel features
with text-guided pruning (TGP), and fuses multi-level features via completion-based addition (CBA). (a) to (d) on the right side illustrate
various options for feature upsampling. (a) refers to simple concatenation with text features, which is fast but less accurate. (b) refers
to feature interaction through cross-modal attention mechanisms, which is constrained by the large number of voxels. (c) represents our
proposed TGP, which first prunes voxel features under textual guidance and thus enables efficient interaction between voxel and text
features. (d) shows a simplified version of TGP that removes farthest point sampling and interpolation, combines multi-modal feature
interactions into a whole and moves it before pruning.
voxels is not too large yet, TGP prunes less voxels. While specified by Ul . After pruning, the scale of the scene fea-
at lower level where the number of voxels is significantly tures is significantly reduced, enabling internal feature in-
increased by upsampling operation, TGP prunes the voxel teractions based on self-attention. Subsequently, we utilize
features more aggressively. The multi-level architecture of self-attention and cross-attention to perceive the relative re-
TSP3D consists of three levels and includes two feature up- lationships among objects within the scene and to fuse mul-
sampling operations. Therefore, we correspondingly con- timodal features, resulting in updated features Ul′ . Finally,
G
figure two TGPs with different functions, which are referred through generative sparse convolutions, we obtain Ul−1 .
as scene-level TGP (level 3 to 2) and target-level TGP (level Supervision for Pruning. The binary supervision mask
2 to 1) respectively. Scene-level TGP aims to distinguish M sce for scene-level TGP is generated based on the centers
between objects and the background, specifically pruning of all objects in the scene, and the mask M tar for target-
the voxels on background. Target-level TGP focuses on re- level TGP is based on the target and relevant objects men-
gions mentioned in the text, intending to preserve the target tioned in the descriptions:
object and referential objects while removing other regions.
Details of TGP. Since the pruning is relevant to the de- N
[ K
[
sce tar tar
scription, we need to make the voxel features text-aware to M = M(Oi ), M = M(O )∪ M(Ojrel )
predict a proper pruning mask. To reduce the computational i=1 j=1
cost, we perform farthest point sampling (FPS) on the voxel (5)
features to reduce their size while preserving the basic dis- where {Oi |1 ≤ i ≤ N } indicates all objects in the scene.
tribution of the scene. Next, we utilize cross-attention to Otar and Orel refer to target and relevant objects respec-
interact with the text features and employ a simple MLP tively. M(O) represents the mask generated from the cen-
to predict the probability distribution M̂ for retaining each ter of object O. It generates a L × L × L cube centered at
voxel. To prune the features Ul , we binarize and interpo- the center of O to construct the supervision mask M , where
late the M̂ to obtain the pruned mask. This process can be locations inside the cube is set to 1 while others set to 0.
expressed as: Simplification. Although the above mentioned method
can effectively prune voxel features to reduce the computa-
UlP = Ul ⊙ Θ(I(M̂ , Ul ) − σ) (3) tional cost of cross-modal attention, there are some ineffi-
cient operations in the pipeline: (1) FPS is time-consuming,
M̂ = MLP(CrossAtt(FPS(Ul ), SelfAtt(T ))) (4) especially for large scenes; (2) there are two times of in-
teractions between voxel features and text features, the first
where UlP is the pruned features, Θ is Heaviside step func- is to guide pruning and the second is to enhance the rep-
tion, ⊙ is matrix dot product, σ is the pruning threshold, resentation, which is a bit redundant. We also empirically
and I represents linear interpolation based on the positions observe that the number of voxels is not large in level 3.
tures are directly added. For voxel features outside the in-
tersection of UlG and Vl which lack corresponding features
in the other map, the missing voxel features are interpolated
before addition. Due to pruning process, UlG is sparser than
Vl . In this way, full addition can fix almost all the pruned re-
gion. But this operation is computationally heavy and make
(a) Ground Truth (b) Pruned Features the scene representation fail to focus on relevant objects,
which deviates the core idea of TGP. (2) Pruning-aware
Addition. The addition is constrained to the locations of
UlG . For voxel in UlG but not in Vl , interpolation from UlG
is applied to complete the missing locations in Vl . It restricts
the addition operation to the shape of the pruned features,
potentially leading to an over-reliance on the results of the
(c) Completion Features (d) Prediction pruning process. If important regions are over-pruned, the
network may struggle to detect targets with severely dam-
aged geometric information.
Considering the unavoidable risk of pruning the query
SA CA
Compare &
Interpolate
Cat target, we introduce the completion-based addition (CBA).
CBA is designed to address the limitations of full and
pruning-aware additions. It offers a more targeted and ef-
... bench sofa ...
SA MLP ficient way to integrating multi-level features, ensuring the
right of table ...
tions, coupled with the text-guided pruning. By focusing pruning by TGP, thus increasing the network’s robustness.
computation only on salient regions of the point clouds, Influence of the Two CBAs. To explore the impact of
determined by textual cues, our model effectively reduces CBAs at two different levels, we conduct ablation experi-
computational overhead while maintaining high accuracy. ments as depicted in Tab. 4. In the absence of CBA, we
TSP3D also sets a benchmark for inference speed compar- use pruning-based addition as a substitute. The results indi-
isons for future methodologies. cate that the CBA at level 2 has negligible effects on the
Performance on Nr3D/Sr3D. We evaluate our method 3DVG task. This is primarily because the CBA at level
on the SR3D and NR3D datasets, following the evalua- 2 serves to supplement the scene-level TGP, which is ex-
tion protocols of prior works like EDA [35] and BUTD- pected to prune the background (a relatively simple task).
DETR [16] by using [email protected] as the accuracy metric. The Moreover, although some target features are pruned, they
results are shown in Tab. 2. Given that SR3D and NR3D are compensated by two subsequent generative sparse con-
provide ground-truth boxes and categories for all objects volutions. However, the CBA at level 1 enhances perfor-
in the scene, we consider three pipelines: (1) Two-stage mance by adapt completion for the target-level TGP. It is
using Ground-Truth Boxes, (2) Two-stage using Detected challenging to fully preserve target objects from deep up-
Boxes, and (3) Single-stage. In practical applications, the sampling features, especially for smaller or narrower tar-
Two-stage using Ground-Truth Boxes pipeline is unrealistic gets. The CBA at level 1, based on high-resolution back-
because obtaining all ground-truth boxes in a scene is infea- bone features, effectively complements the TGP.
sible. This approach can also oversimplify certain evalua- Feature Upsampling Techniques. We conduct experi-
tion scenarios. For example, if there are no other objects of ments to assess the effects of different feature upsampling
the same category as the target in the scene, the task reduces techniques, as detailed in Tab. 5. Using simple feature con-
to relying on the provided ground-truth category. Under the catenation (Fig. 2 (a)), while fast in inference speed, results
Single-stage setting, TSP3D exhibits significant superiority in poor performance. When we utilize an attention mech-
with peak performance of 48.7% and 57.1% on Nr3D and anism with stronger feature interaction, as shown in Fig. 2
Sr3D. TSP3D even outperforms previous works under the (b), the computation exceeds the limits of GPU due to the
pipeline of Two-stage using Detected Boxes, with leads of large number of voxels, making it impractical for real-world
+2.6% and +3.2% on NR3D and SR3D. applications. Consequently, we employ TGP to reduce the
feature amount, as illustrated in Fig. 2 (c), which signifi-
4.4. Ablation Study cantly improves performance and enables practical deploy-
ment. Building on TGP, we propose simplified TGP, as
Effectiveness of Proposed Components. To investigate
shown in Fig. 2 (d), that merges feature interactions before
the effects of our proposed TGP and CBA, we conduct ab-
and after pruning, achieving performance consistent with
lation experiments with module removal as shown in Tab. 3.
the original TGP while enhancing inference speed.
When TGP is not used, multi-modal feature concatenation
is employed as a replacement, as shown in Fig. 2 (a). When
4.5. Qualitative Results
CBA is not used, it is substituted with a pruning-based ad-
dition. The results demonstrate that TGP significantly en- Text-guided Pruning. To visually demonstrate the process
hances performance without notably impacting inference of TGP, we visualize the results of two pruning phases, as
time. This is because TGP, while utilizing a more complex shown in Fig. 4. In each example, the voxel features after
multi-modal attention mechanism for stronger feature fu- scene-level pruning, the features after target-level pruning,
sion, significantly reduces feature scale through text-guided and the features after target-level generative sparse convo-
pruning. Additionally, the performance improvement is also lution are displayed from top to bottom. It is evident that
due to the gradual guidance towards the target object by both pruning stages effectively achieve our intended effect:
both scene-level and target-level TGP. Using CBA alone the scene-level pruning filters out the background and re-
has a limited effect, as no voxels are pruned. Implement- tained object voxels, and the target-level pruning preserves
ing CBA on top of TGP further enhances performance, as relevant and target objects. Moreover, during the feature
CBA dynamically compensates for some of the excessive upsampling process, the feature amount nearly exponen-
(a) It is a gray trash can. (b) The chair is on the northwest side (c) A chrome water (d) There is a rectangular
The trash can sits in the of the table that is furthest northeast fountain. Is the lowest shelf. It is the one closest
in the room. The chair has a curved
corner by where the tv is. of two water fountains. to the curtain.
backside and four legs.
Figure 4. Visualization of the text-guided pruning process. In each example, the voxel features after scene-level TGP, target-level TGP
and the last upsampling layer are presented from top to bottom. The blue boxes represent the ground truth of the target, and the red
boxes denote the bounding boxes of relevant objects. TSP3D reduces the amount of voxel features through two stages of pruning and
progressively guides the network focusing towards the target.
(a) The brown square trash can is (b) The whiteboard is next to (c) This is a monitor in black color. (d) The book rack is left of the
The monitor is next to the wall bookshelf that looks like a right
to the right of the lamp. It is also two doors. The whiteboard is
angle with another. The book rack
behind the chair and to the right. a long, white rectangle. and at the back of maroon chair.
is black and has as circular bottom.
Figure 5. Visualization of the completion-based addition process. The blue points represent the voxel features output by the target-level
TGP, while the red points are the completion features predicted by the CBA. The blue boxes indicate the ground truth boxes. CBA
adaptively supplements situations where excessive pruning has occurred.
EDA
TSP3D
GT
EDA
TSP3D
There is a black table with A small document The chair is the middle This is a cushion chair
a rectangular glass top. It one on the north side of with no arms. The chair
organizer with
is to the right of a black the table. The chair is is next to building.
shelves. There is a
dresser, of similar height red and has two arms.
door entry near it.
that also has glass top.
(c) Category Error (d) Appearance and Attributes
Figure 7. Qualitative results of EDA [35] and our TSP3D on the ScanRefer dataset [3]. In each description, the red annotations indicate the
target object. The orange annotations in (a) refer to relevant objects, while the yellow annotations in (d) denote the appearance or attributes
of the target. TSP3D demonstrates exceptional performance in locating relevant objects, narrow or small targets, identifying categories,
and distinguishing appearance and attributes.