Crowd-SAM: SAM As A Smart Annotator For Object Detection in Crowded Scenes
Crowd-SAM: SAM As A Smart Annotator For Object Detection in Crowded Scenes
1 Introduction
Object detection in crowded scenes is a fundamental task in areas such as au-
tonomous driving and video surveillance. The primary focus lies in identifying
and locating densely packed common objects like pedestrians and vehicles, where
occlusions present significant challenges. Great progress has been made in recent
years, including two-stage methods [45, 60] and query-based methods [8, 22, 62].
However, these methods mainly follow a supervised manner and necessitate ex-
tensive labeled training samples, incurring a considerable annotation cost of ap-
proximately 42.4 seconds per object [38] The density and complexity of crowded
scenes further aggravate the annotation burden. 1
⋆
Corresponding author
1
On average, an image contains approximately 22 objects in CrowdHuman [37] and
7 in MS COCO [24]
2 Z. Cai et al.
The high cost of collecting object annotations drives the exploration of alter-
natives such as few-shot learning [34, 39, 44], weakly supervised learning [41, 53],
semi-supervised learning [29,30,42,49], and unsupervised learning [1,6,21,26,48].
The best-performing ones, i.e. semi-supervised object detection (SSOD), lever-
age both labeled and unlabeled data for training and achieve a big success on
common benchmarks e.g. PASCAL VOC [15] and COCO [24]. Unfortunately,
SSOD introduces extra complexity such as complicated augmentations and on-
line pseudo-labeling.
Recently, prompt-based segmentation models have received increasing at-
tention due to their flexibility and scalability. Particularly, Segment Anything
Model (SAM) [20] show its high capability to effectively and accurately predict
the masks of regions specified by prompts, in any form of points, boxes, masks, or
text descriptions. Recognizing its exceptional potential, researchers have made
many efforts to adapt it for various vision tasks such as medical image recogni-
tion [31], remote sensing analysis [4, 12], industrial defect detection [52], etc.
Despite the huge progress [18,46,54] following SAM, applying SAM for object
detection in crowded scenes is seldom studied. In this paper, we investigate the
potential of SAM in such cases with two motivations. First, SAM is pre-trained
on a very large dataset i.e. SA-1B that contains most of the common objects
and it is thus reasonable to utilize the knowledge to facilitate labeling massive
data and training a brand-new detector. Second, SAM demonstrates a robust
segmentation ability in handling complicated scenes characterized by clustered
objects that are difficult for an object detector trained from scratch.
To this end, we propose Crowd-SAM, a smart annotator powered by SAM
for object detection in crowded scenes. As depicted in Fig. 1, we introduce a self-
promoting approach based on DINOv2 to alleviate the cost of human prompting.
Our method employs dense grids equipped with an Efficient Prompt Sampler
(EPS) to cover as many objects as possible at a moderate cost. To distinguish
the masks from multiple outputs precisely in occluded scenes, we design a mask
selection module, termed Part-Whole Discrimination Network (PWD-Net) that
learns to differentiate the output with the highest quality in Intersection over
Union (IoU) score. With a lightweight model design and fast training schedule,
it delivers considerable performance on public benchmarks including CrowdHu-
man [37] and CityPersons [58].
Our contributions can be summarized as follows:
Self-Prompting &
Self-Selection
(a) SAM (b) Ours
Mask
SAM
Selection 标注图 Crowd-SAM 标注图
Adaptation
Annotated Results Annotated Results
Prompting Labeling
: Human Annotator
Fig. 1: Pipeline comparison between SAM and Crowd-SAM. Crowd-SAM only requires
a few labeled images and can automatically recognize target objects.
2 Related Work
Object Detection. General object detection aims to identify and locate objects
and is mainly divided into two categories: i.e. one-stage detectors and two-stage
detectors. One-stage detectors predict bounding boxes and class scores by using
image features [23, 27, 35], while two-stage detectors first generate region pro-
posals and then classify and refine them [9, 10, 36]. Recently, end-to-end object
detectors e.g. DETR [2, 55, 63] have replaced the hand-crafted modules such
as Non-Maximum Suppression (NMS) by adopting one-to-one matching in the
training phase, showing great potential in a wide variety of areas.
However, applying these detectors directly to pedestrian detection tends to
incur performance degradation due to the fact that pedestrians are often crowded
with occlusions appearing. Early work [32] proposes to integrate extra features
into a pedestrian detector to explore low-level visual cues, while follow-up meth-
ods [5, 56] attempt to utilize the head areas for better representation learning.
In [56], an anchor is associated with two targets, the whole body, and the head
part, to achieve a more robust detector from joint training. AdaptiveNMS [25]
adjusts the NMS threshold by predicting the density of pedestrians. Alternative
methods focus on the design of loss functions to improve the training process. For
example, RepLoss [45] encourages the prediction consistency of the same target
while repels the ones from different targets. Recently, Zheng et al . [62] models
the relation of queries to improve DETR-based detectors in crowded scenes and
achieves remarkable success. Although these works have pushed the boundaries
of object detection in crowded scenes to a new stage, they all rely on a large
number of labeled samples for training, which is labor-intensive. This limitation
inspires us to develop label-efficient detectors and automatic annotation tools,
with the help of SAM.
Few-Shot Object Detection (FSOD). This task aims to detect objects
of novel classes with limited training samples. FSOD methods can be roughly
classified into meta-learning based [16,51] and fine-tuning based ones [34,39,44].
Meta-RCNN [51] processes the query and support images in parallel via a siamese
network. The Region of Interest (RoI) features of the query are fused with the
class prototypes to effectively transfer knowledge learned from the support set.
TFA [44] proposes a simple two-stage fine-tuning method that only fine-tunes
the last layers of the network. FSCE [39] introduces a supervised contrastive loss
in the fine-tuning stage to mitigate misclassification issues. De-FRCN [34] stops
4 Z. Cai et al.
Crowd-SAM
: Tunable Module
DINO
Binary Classifier* : Frozen Module
Encoder
Foreground Mask
EPS
Input Image
SAM PWD-Net
SAM Mask
Mask Token C MLP Selection
Output
Repeat
IoU Token IoU Head + IoU Scores
Fig. 2: The pipeline of Crowd-SAM shows the interaction between different modules.
DINO encoder and SAM are frozen in the training process. * represents the parameters
that are shared. For simplicity, the projection adapter of DINO is dismissed.
the gradient from the RPN and scales the gradient from R-CNN [36], followed
by a prototypical calibration block to refine the classification scores.
Segment Anything Models. SAM [20], a visual foundation model for seg-
mentation tasks, is trained on the SA-1B dataset using a semi-supervised learn-
ing paradigm. Its exposure to this vast repository of training samples renders it a
highly capable class-agnostic model, effectively handling a wide range of objects
in the world. Despite its impressive performance in solving segmentation tasks,
it suffers from several issues like domain shift, inefficiency, class-agnostic design,
etc. HQ-SAM [18] is proposed to improve its segmentation quality by learning a
lightweight adapter. Fast-SAM [50] and Mobile-SAM [54] focus on fastening the
inference speed of SAM by knowledge distillation. RSprompt [4] enables SAM to
generate semantically distinct segmentation results for remote sensing images by
generating appropriate prompts. Med-SA [47] presents a space-depth transpose
method to adapt 2D SAM to 3D medical images and a hyper-prompting adapter
to achieve prompt-conditioned adaptation. Unfortunately, these approaches ne-
cessitate a considerable amount of labeled data for effective adaptation, making
them impractical for crowded scenes where annotation costs are prohibitive. Dif-
ferent from them, Per-SAM [57] and Matcher [28] teach SAM to recognize spec-
ified objects with only one or few instances by extracting training-free similarity
priors. SAPNet [46] builds a weakly-supervised pipeline for instance segmenta-
tion. Although these approaches reduce data requirements, they still lag behind
the demands of crowded scenes, such as pedestrian detection, particularly with
occlusions.
Crowd-SAM: Smart Annotation for Crowded Scenes 5
3 Method
3.1 Preliminaries
SAM [20] is a powerful and promising segmentation model that comprises three
main components: (a) an image encoder responsible for feature extraction; (b)
a prompt encoder designed to encode the geometric prompts provided by users;
and (c) a lightweight mask decoder that predicts the masks conditioned on the
given prompts. Leveraging extensive training data, SAM demonstrates impres-
sive zero-shot segmentation performance across various benchmarks. In particu-
lar, SAM makes use of points and boxes as prompts to specify interested regions.
DINO [3] represents a family of Vision Transformers (ViT) [7] learned in
a self-supervised manner designed for general-purpose applications. During its
training, DINO employs a self-distillation strategy akin to BYOL [11], fostering
the learning of robust representations. DINOv2 [33] strengthens the foundation
of DINO by integrating several additional pre-training tasks, improving its scala-
bility and stability, especially for large models e.g. ViT-H (1 billion parameters).
Thanks to its enhancement, DINOv2 shows a strong representation ability, in
particular for the task of semantic segmentation.
Table 1: Comparison in terms of recall, average FPs, and decoding time (T ) of different
grid sizes (NG ) adopted by the SAM generator on CrowdHuman [37]. The oracle model
derives the prompts by computing the center of ground truth boxes. The decoding time
is collected on a 3090 Ti GPU card.
NG 16 32 64 128 Oracle
Recall 33.6 58.0 63.4 76.0 91.4
avg. FPs 51 112 227 485 -
T (s) 0.059 0.22 0.83 3.2 0.045
the output masks that is a mixture of correct masks, backgrounds, and part-
level masks, we design Part-Whole Discrimination Network (PWD-Net) that
takes as input both the learned tokens from SAM and the semantic-rich tokens
from DINOv2 to re-evaluate all the outputs. Finally, to handle the redundancy
brought by the use of dense grids, we propose an Efficient Prompt Sampler (EPS)
to decode the masks at a moderate cost.
We introduce the details of our methods in the following sections.
{\cal L}_{fg} = dice(f(\hat {\mathbb {H}}), \mathbb {H}), \label {eq:bin_loss} (1)
where f is an up-sampling function that resizes H to 256×256. During inference,
we add a threshold t for mask binarization, which is simply set at 0.5 in our
Crowd-SAM: Smart Annotation for Crowded Scenes 7
experiments. The binarized masks are mapped to point prompts PG which only
contain those in positive regions.
Given the proposals generated in Sec. 3.3, our further aim is to efficiently decode
the dense prompts and accurately discriminate the generated masks. As depicted
in Fig. 3, each instance contains a set of prompts due to the density of grids.
Supposing that only one in-position prompt is required for mask prediction in
SAM, decoding all the prompts would lead to not only a waste of computation
but also more FPs for some poorly located prompts.
Fig. 3: Illustration of EPS. PWD-Net produces valid masks with a threshold. In each
iteration, we prune prompts (with a cross above) that fall inside valid masks .
the quality of related masks if they are positive samples and (ii) suppressing the
scores of samples that fall in background regions.
Illustrated in Fig. 2, for the masks generated corresponding to N prompts,
we leverage the Mask Tokens and IoU Tokens within the mask decoder of SAM
along with the sophisticated features extracted by the self-supervised pre-trained
model DINOv2 [33]. M and U are responsible for mask decoding and IoU pre-
diction in the SAM Mask Decoder, respectively. Thus, we suppose that they
contain shape-aware information, which is helpful in discriminating the mask.
These components enable us to compute a discriminant confidence score S for
each specific prompt in a few-shot scenario. Initially, the refined IoU score Siou
is computed as follows:
S_{iou}=\text {Head}_{par}(\text {Concat}(\text {Repeat}({\cal U}), {\cal M}))+\text {Head}_{IoU}({\cal U}), \label {eq:s_iou} (2)
s^i_{target} = \begin {cases} IoU(m^i,m^i_{GT}), & m^i_{GT} \in M^{bg}_{GT}, \\ 0, & m^i_{GT} \in M^{fg}_{GT} , \end {cases} (4)
Here, sitarget denotes the target score of the mask mi generated by the ith prompt,
bg fg
and Starget = {sitarget }N i N i N
i=1 ; M = {m }i=1 ; MGT = {mGT }i=1 = MGT ∪ MGT .
4 Experiments
Datasets. Following [62], we adopt CrowdHuman [37] as the benchmark to con-
duct main experiments and ablation studies. CrowdHuman [37] collects and an-
notates images containing crowded persons from the Internet. It contains 15,000,
10 Z. Cai et al.
Table 2: Comparative results (%) on CrowdHuman [37] val. All the SAM-based meth-
ods adopt ViT-L [7] as the pre-trained backbone (SRCNN denotes Sparse R-CNN [40],
a baseline in [62] and * represents using the multi-crop trick).
4,370, and 5,000 images for training, validation, and testing, respectively. We also
evaluate our method on CityPersons [58] for a realistic urban-scene scenario. Ad-
ditionally, we utilize OCC-Human [61], which is specially reputed for occluded
persons. For these pedestrian datasets, we use visible annotations (only including
visible areas of an object) for training and evaluation. To validate the extensi-
bility of Crowd-SAM, we further devise a multi-class version of Crowd-SAM by
adding a multi-class classifier. We employ 0.1 % percent of the COCO [24] train-
val set for training and the COCO val set for validation. Besides, we validate
our method on a subset with occluded objects on COCO, i.e. COCO-OCC [17],
extracted by selecting the images whose objects have a high overlapping ratio.
Implementation Details. We utilize SAM (ViT-L) [20] and DINOv2 (ViT-
L) [33] as the base models for all experiments. In the fine-tuning stage, all their
parameters are frozen to avoid over-fitting. Instead of real GT, we use the pseudo
masks generated by SAM as Starget to supervise the learning of PWD-Net in
Eq. (5). These generated pseudo labels are of high quality as shown in Fig. 4(c).
We randomly pick the points from the pseudo masks as positive training samples
and the ones from the background as negative training samples. In the training
process, we use Adam [19] with a learning rate of 10−5 , a weight decay of 10−4 ,
β1 = 0.9, and β2 = 0.99 for optimization. We train our module for 2,000 itera-
tions with a batch size of 1, which can be done on a single GTX 3090 Ti GPU
in several minutes. For more details, please refer to the Appendix.
Evaluation Metrics. Following [62], we use AP with IoU threshold at 0.5,
MR−2 , and Recall as metrics. Generally, a higher AP, Recall, and lower MR−2
value indicates better performance.
Crowd-SAM: Smart Annotation for Crowded Scenes 11
Table 3: Comparative results (%) on OccHuman val, where APM and APH represent
AP in moderate and hard cases according to occlusion ratios, respectively.
benchmark, and COCO-OCC [17] which is a split of COCO that mainly consists
of images with a high occlusion ratio.
We compare our method with two supervised detectors, i.e. Faster R-CNN [36]
and BCNet [17], and report the results in Tab. 5. It can be seen that our Crowd-
SAM is comparable to the supervised detectors on both datasets and drops only
1.4 AP% when comparing those of COCO and COCO-OCC. This minor drop
indicates that our method is robust to occlusions.
all three samplers. However, the default sampler suffers an out-of-memory error
when the grid size reaches 128, preventing it from being adopted in this setting.
As for the random sampler, its performance is constrained by K and a grid size
larger than 64 only leads to limited improvement, e.g. 0.1% AP. On the contrary,
our EPS benefits from a much larger grid size and achieves a better result.
Ablation on PWD-Net. We compare PWD-Net to the variants that re-
place some tokens with a full-zero placeholder. We also consider two designs, di-
rectly tuning the IoU head or learning a parallel IoU head. As shown in Tab. 8,
all the three tokens, i.e. Mask Token M, IoU Token U, and Semantic Token O,
contribute to the final result. Particularly, once M is removed, the AP drops by
40.0%, which is a catastrophic decline. This degradation suggests that the mask
token contains the shape-aware feature that is essential for the part-whole dis-
crimination task. Notably, the AP drops by 2.8% when we tune the pre-trained
IoU Head, suggesting that it is prone to overfit the few labeled images. By freez-
Table 7: Comparison (%) of different samplers on CrowdHuman [37] val. Full means
using all prompts. OOM represents out-of-memory errors which occur when the GPU
memory is all consumed.
Table 8: Ablation results (%) on the design of PWD-Net. M, U, and O represent the
mask token, IoU token, and semantic token, respectively. For the IoU head, F means
freezing the original IoU head and training a parallel one, and T indicates tuning the
original IoU head.
ing the IoU head of SAM, PWD-Net can benefit more from the shape-aware
knowledge learned from massive segmentation data.
Fig. 4: Qualitative comparison between Crowd-SAM (a) and De-FRCN (b). Crowd-
SAM predictions are more accurate especially in the boundaries of persons. We also
plot the GT boxes (blue rectangles) and the generated masks (yellow regions), which
are of high quality (c). In (d ), we plot our prompt filtering results, where preserved
prompts (red points) are much fewer than the removed ones (gray points). Zoom in for
a better view.
5 Conclusion
This paper proposes Crowd-SAM, a SAM-based framework, for object detection
and segmentation in crowded scenes, designed to streamline the annotation pro-
cess. For each image, Crowd-SAM generates dense prompts for high recall and
uses EPS to prune redundant prompts. To achieve accurate detection in occlu-
sion cases, Crowd-SAM employs PWD-Net which leverages several informative
tokens to select the masks that best fit. Combined with the proposed modules,
Crowd-SAM achieves 78.4% AP on CrowdHuman, comparable to full-supervised
detectors, validating that object detection in crowded scenes can benefit from
foundation models like SAM with data efficiency.
Crowd-SAM: Smart Annotation for Crowded Scenes 15
Acknowledgements
References
1. Bar, A., Wang, X., Kantorov, V., Reed, C.J., Herzig, R., Chechik, G., Rohrbach,
A., Darrell, T., Globerson, A.: Detreg: Unsupervised pretraining with region priors
for object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14605–14615
(2022)
2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: Eur. Conf. Comput. Vis. pp. 213–
229. Springer (2020)
3. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.:
Emerging properties in self-supervised vision transformers. In: Int. Conf. Comput.
Vis. pp. 9650–9660 (2021)
4. Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learn-
ing to prompt for remote sensing instance segmentation based on visual foundation
model. IEEE Transactions on Geoscience and Remote Sensing (2024)
5. Chi, C., Zhang, S., Xing, J., Lei, Z., Li, S.Z., Zou, X.: Pedhunter: Occlusion robust
pedestrian detector in crowded scenes. In: AAAI. vol. 34, pp. 10639–10646 (2020)
6. Dai, Z., Cai, B., Lin, Y., Chen, J.: Up-detr: Unsupervised pre-training for object
detection with transformers. In: IEEE Conf. Comput. Vis. Pattern Recog. pp.
1601–1610 (2021)
7. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth
16x16 words: Transformers for image recognition at scale. In: Int. Conf. Learn.
Represent. (2021)
8. Gao, F., Leng, J., Gan, J., Gao, X.: Selecting learnable training samples is all
detrs need in crowded pedestrian detection. In: ACM Int. Conf. Multimedia. pp.
2714–2722 (2023)
9. Girshick, R.: Fast r-cnn. In: Int. Conf. Comput. Vis. pp. 1440–1448 (2015)
10. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-
curate object detection and semantic segmentation. In: IEEE Conf. Comput. Vis.
Pattern Recog. pp. 580–587 (2014)
11. Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doer-
sch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own
latent-a new approach to self-supervised learning. Adv. Neural Inform. Process.
Syst. 33, 21271–21284 (2020)
12. Gui, S., Song, S., Qin, R., Tang, Y.: Remote sensing object detection in the deep
learning era—a review. Remote Sensing 16(2), 327 (2024)
13. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Int. Conf. Comput.
Vis. pp. 2961–2969 (2017)
14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 770–778 (2016)
16 Z. Cai et al.
15. Hoiem, D., Divvala, S.K., Hays, J.H.: Pascal voc 2008 challenge. World Literature
Today 24(1), 1–4 (2009)
16. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection
via feature reweighting. In: Int. Conf. Comput. Vis. pp. 8420–8429 (2019)
17. Ke, L., Tai, Y.W., Tang, C.K.: Deep occlusion-aware instance segmentation with
overlapping bilayers. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 4019–4028
(2021)
18. Ke, L., Ye, M., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F., et al.: Segment
anything in high quality. In: Adv. Neural Inform. Process. Syst. vol. 36 (2024)
19. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Int. Conf.
Learn. Represent. (2015)
20. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Int. Conf.
Comput. Vis. pp. 4015–4026 (2023)
21. Li, M., Wu, J., Wang, X., Chen, C., Qin, J., Xiao, X., Wang, R., Zheng, M., Pan,
X.: Aligndet: Aligning pre-training and fine-tuning in object detection. In: Int.
Conf. Comput. Vis. pp. 6866–6876 (2023)
22. Lin, M., Li, C., Bu, X., Sun, M., Lin, C., Yan, J., Ouyang, W., Deng, Z.: Detr for
crowd pedestrian detection. arXiv preprint arXiv:2012.06785 (2020)
23. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: Int. Conf. Comput. Vis. pp. 2980–2988 (2017)
24. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: Eur. Conf. Comput.
Vis. pp. 2906–2917 (2014)
25. Liu, S., Huang, D., Wang, Y.: Adaptive nms: Refining pedestrian detection in a
crowd. In: IEEE Conf. Comput. Vis. Pattern Recog. (June 2019)
26. Liu, S., Li, Z., Sun, J.: Self-emd: Self-supervised object detection without imagenet.
arXiv preprint arXiv:2011.13677 (2020)
27. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd:
Single shot multibox detector. In: Eur. Conf. Comput. Vis. pp. 21–37. Springer
(2016)
28. Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything
with one shot using all-purpose feature matching. In: Int. Conf. Learn. Represent.
(2024)
29. Liu, Y.C., Ma, C.Y., He, Z., Kuo, C.W., Chen, K., Zhang, P., Wu, B., Kira, Z.,
Vajda, P.: Unbiased teacher for semi-supervised object detection. In: Int. Conf.
Learn. Represent. (2021)
30. Liu, Y.C., Ma, C.Y., Kira, Z.: Unbiased teacher v2: Semi-supervised object de-
tection for anchor-free and anchor-based detectors. In: IEEE Conf. Comput. Vis.
Pattern Recog. pp. 9819–9828 (2022)
31. Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical
images. Nature Communications 15(1), 654 (2024)
32. Mao, J., Xiao, T., Jiang, Y., Cao, Z.: What can help pedestrian detection? In:
IEEE Conf. Comput. Vis. Pattern Recog. pp. 3127–3136 (2017)
33. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,
Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust
visual features without supervision. Trans. Mach. Learn Res. (2024)
34. Qiao, L., Zhao, Y., Li, Z., Qiu, X., Wu, J., Zhang, C.: Defrcn: Decoupled faster
r-cnn for few-shot object detection. In: Int. Conf. Comput. Vis. pp. 8681–8690
(2021)
Crowd-SAM: Smart Annotation for Crowded Scenes 17
35. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 779–
788 (2016)
36. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Adv. Neural Inform. Process. Syst. vol. 28
(2015)
37. Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman:
A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123
(2018)
38. Su, H., Deng, J., Fei-Fei, L.: Crowdsourcing annotations for visual object detection.
In: Workshops at the twenty-sixth AAAI conference on artificial intelligence (2012)
39. Sun, B., Li, B., Cai, S., Yuan, Y., Zhang, C.: Fsce: Few-shot object detection via
contrastive proposal encoding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp.
7352–7362 (2021)
40. Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L.,
Yuan, Z., Wang, C., et al.: Sparse r-cnn: End-to-end object detection with learnable
proposals. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14454–14463 (2021)
41. Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with
online instance classifier refinement. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 2843–2851 (2017)
42. Tang, Y., Chen, W., Luo, Y., Zhang, Y.: Humble teachers teach better students
for semi-supervised object detection. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 3132–3141 (2021)
43. Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object
detection. In: Int. Conf. Comput. Vis. pp. 9627–9636 (2019)
44. Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E., Yu, F.: Frustratingly simple
few-shot object detection. Int. Conf. Mach. Learn. (2020)
45. Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C.: Repulsion loss: Detecting
pedestrians in a crowd. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7774–7783
(2018)
46. Wei, Z., Chen, P., Yu, X., Li, G., Jiao, J., Han, Z.: Semantic-aware sam for point-
prompted instance segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp.
3585–3594 (2024)
47. Wu, J., Fu, R., Fang, H., Liu, Y., Wang, Z., Xu, Y., Jin, Y., Arbel, T.: Medical
sam adapter: Adapting segment anything model for medical image segmentation.
arXiv preprint arXiv:2304.12620 (2023)
48. Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Sun, P., Li, Z., Luo, P.: Detco:
Unsupervised contrastive learning for object detection. In: Int. Conf. Comput. Vis.
pp. 8392–8401 (2021)
49. Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., Liu, Z.: End-
to-end semi-supervised object detection with soft teacher. In: Int. Conf. Comput.
Vis. pp. 3060–3069 (2021)
50. Xu, Z., Wenchao, D., Yongqi, A., Yinglong, D., Tao, Y., Min, L., Ming, T., Jinqiao,
W.: Fast segment anything. arXiv preprint arXiv:2306.12156 (2023)
51. Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta r-cnn: Towards
general solver for instance-level low-shot learning. In: Int. Conf. Comput. Vis. pp.
9577–9586 (2019)
52. Ye, Z., Lovell, L., Faramarzi, A., Ninic, J.: Sam-based instance segmentation mod-
els for the automation of masonry crack detection. arXiv preprint arXiv:2401.15266
(2024)
18 Z. Cai et al.
53. Zeng, Z., Liu, B., Fu, J., Chao, H., Zhang, L.: Wsod2: Learning bottom-up and
top-down objectness distillation for weakly-supervised object detection. In: Int.
Conf. Comput. Vis. (2019)
54. Zhang, C., Han, D., Qiao, Y., Kim, J.U., Bae, S.H., Lee, S., Hong, C.S.: Faster
segment anything: Towards lightweight sam for mobile applications. arXiv preprint
arXiv:2306.14289 (2023)
55. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.Y.: DINO:
DETR with improved denoising anchor boxes for end-to-end object detection. In:
Int. Conf. Learn. Represent. (2023)
56. Zhang, K., Xiong, F., Sun, P., Hu, L., Li, B., Yu, G.: Double anchor r-cnn for
human detection in a crowd. arXiv preprint arXiv:1909.09998 (2019)
57. Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Qiao, Y., Gao, P., Li, H.:
Personalize segment anything model with one shot. In: Int. Conf. Learn. Represent.
(2024)
58. Zhang, S., Benenson, R., Schiele, B.: Citypersons: A diverse dataset for pedestrian
detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3213–3221 (2017)
59. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-
based and anchor-free detection via adaptive training sample selection. In: IEEE
Conf. Comput. Vis. Pattern Recog. pp. 9759–9768 (2020)
60. Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Occlusion-aware r-cnn: Detecting
pedestrians in a crowd. In: Eur. Conf. Comput. Vis. pp. 637–653 (2018)
61. Zhang, S.H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., Yang, D., Huang, H.,
Hu, S.M.: Pose2seg: Detection free human instance segmentation. In: IEEE Conf.
Comput. Vis. Pattern Recog. pp. 889–898 (2019)
62. Zheng, A., Zhang, Y., Zhang, X., Qi, X., Sun, J.: Progressive end-to-end object
detection in crowded scenes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp.
857–866 (2022)
63. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable
transformers for end-to-end object detection. In: Int. Conf. Learn. Represent.
(2021)